Three Attempts at Masking a Voice Agent's Thinking Silence
After a caller stops talking, a production voice agent we run at Taritas has about 2.2 to 2.5 seconds of dead air before its first audio frame. We made three serious attempts to mask it: a classifier-driven filler, a state-change-driven filler, and preemptive generation. All three are disabled in production today, each for a different structural reason, and the usable lesson is about speech queue ordering: the only filler insertion point that works is the one where the reply does not exist yet.
Published · Updated · Supreet Tare
All names, numbers, and identifiers in this post are anonymized. The patterns are real.
The silence
One of our production voice agents answers a public-sector phone line. After the caller stops talking, there are about 2.2 to 2.5 seconds before the agent’s first audio frame plays. On a phone, that is unmistakable thinking silence, and it makes a competent agent feel robotic.
We made three serious attempts to mask that gap. All three sit disabled in production code today. Each failed for a different structural reason, and each taught us something the next design has to respect. This post is the autopsy.
Where the time actually goes
The budget after the caller stops, measured in production:
EOU detection (is the caller done?) ~1.0 s
Turn classifier (routing, guardrails) ~0.6 s serial
LLM time to first token ~0.6 s
TTS time to first audio byte ~0.3 s
total ~2.2-2.5 s
One uncomfortable detail: our dashboard latency metric was defined as EOU + LLM first token + TTS first byte. It excluded the serial classifier, so the dashboard understated what callers actually heard by about 0.6 seconds. If your latency metric does not include every serial step, it is a comforting lie.
Attempt 1: let the classifier suggest the filler
The turn classifier already runs every turn, so the first idea was to have it return a context-aware filler phrase along with its classification, and have the agent speak that filler before the main LLM call.
It failed for a reason that took a day to see clearly: the classifier IS the silence. A filler that rides on the classifier’s output can only play after the classifier returns, and by then the main response is about to start anyway. There was nothing left to mask. The change was rolled back.
Attempt 2: fire a filler when the agent enters “thinking”
The second design hooked the SDK’s agent state event: when the agent transitions to thinking, wait 600 ms, check the state again, and if it is still thinking, speak a short generic filler (“One sec.”, “Okay.”). Fast turns finish inside the delay and suppress the filler; only slow turns get one.
This failed more subtly. In LiveKit Agents, the state change to thinking fires at the same moment the reply’s speech handle is enqueued. The event is descriptive, not predictive. The speech queue is FIFO, so a filler fired in response to that event enters the queue behind the reply and plays after the answer. A filler after the answer is strictly worse than silence.
The handler is commented out in production with the explanation in the comment; the helper functions stay, because the next design reuses them at a different call site.
Attempt 3: preemptive generation
The SDK offers a setting that starts LLM generation before the caller has finished speaking, betting that the end of utterance is imminent. We tried it. The agent began answering while callers were still mid-sentence, because the turn detector fires on natural intra-sentence pauses. Cutting a caller off is the worst UX failure a voice agent has. The setting is now hardcoded off in our entrypoint, with a pointer to the public issue (LiveKit Agents issue 3701).
What did ship: parallelizing the serial steps
While the filler attempts failed, one latency change shipped and stayed. About 80 percent of classifiable turns in this deployment need a knowledge base search after classification. Running the classifier and the search serially cost about 1.3 seconds before the LLM could start; running them in parallel and cancelling the search on the turns that do not need it costs max(0.6, 0.7), about 0.7 seconds. Median win: roughly 0.6 seconds on the most common turn type, for the price of an occasionally wasted embedding lookup that costs a few thousandths of a cent.
The design that should work, and why
The correct insertion point fell out of the two filler failures: fire the filler inside the turn-completed callback, before the reply is created at all. At that moment the speech queue is provably empty for the turn, so the filler enqueues first and the reply lines up behind it. FIFO ordering then works for us instead of against us.
async def on_user_turn_completed(self, turn_ctx, new_message):
# The speech queue is empty for this turn at this point.
if self._should_fire_filler(turn_ctx):
self.session.say(self._pick_filler(), allow_interruptions=True)
# ...then kick off classifier + retrieval tasks, then the reply...
The open question for that redesign is whether the filler should fire unconditionally or be gated on a predicted latency budget, and we do not yet have input-side latency prediction. Until that is answered, we ship the silence. Honest dead air beats a filler that plays in the wrong place.
Key takeaways
- The component you are using to decide whether to mask latency may itself be the latency. Map the serial chain before designing the mask.
- Platform events are often descriptive, not predictive. An event that says “thinking started” can fire at the exact moment it is already too late to act.
- On a FIFO speech queue, the only question that matters for inserted audio is: is the queue empty when I enqueue? Find the one call site where that is guaranteed.
- Latency dashboards must include every serial step, including the ones that feel like bookkeeping.
- Sometimes the correct production state is the unmasked flaw. Worse-but-clever audio loses to honest silence.
What this means if you are an IT services firm
If a client’s voice agent feels slow, the fix is not “make the model faster” and it is not “play hold music.” Ask your team two questions: what is the full serial chain between caller-stop and first audio byte, with a measured number on each link, and what controls the ordering of any audio you want to insert into that chain? If nobody can answer the second question for your platform’s speech queue, every filler design is a guess. This is the kind of work we do behind IT services firms, under their brand.
Reading this because a client asked for voice AI? That is the conversation we are built for. What taritas does for partners.