Semantic Turn Detection: How a Voice Agent Knows You're Done
Turn detection is how a voice agent decides you have finished speaking. Once it decides, it replies. The simplest method is a silence timer. The agent waits a fixed pause after you stop, then treats the turn as over. That one number forces a bad trade. At taritas we lowered the silence window from 1.5 to 1.0 seconds. We wanted lower latency, and replies did come faster. But shorter silence had a cost. Our speech-to-text began splitting one spoken sentence into fragments whenever a caller paused. More fragments meant more turns marked interrupted. That stress exposed a bug that put the call's opening greeting on a later turn. The honest fix is to stop deciding turns by silence alone. Semantic turn detection is the upgrade. It is a small model that reads the words so far and judges, by meaning, whether you are done. On LiveKit it runs next to voice-activity detection, and the trade mostly goes away.
Published · Updated · Supreet Tare
All names, numbers, and identifiers in this post are anonymized. The patterns are real.
A voice agent makes one decision thousands of times per call. Has the caller finished talking, or are they just pausing to think? Get it wrong one way and the agent talks over people. Get it wrong the other way and it goes quiet. A second of dead air on a phone reads as a dropped line. This decision is called turn detection. It is the most underrated tuning problem in a voice agent. This post covers three things: how turn detection works, the trade we hit tuning it on a production phone line, and why semantic turn detection beats a better timeout.
The three layers of knowing the caller is done
Turn detection is not one thing. It is three layers. Most teams start with only the first.
Layer one is voice-activity detection. It works at the audio level. It labels each frame of incoming audio as speech or silence, in real time, before any transcription. It answers a narrow question: is someone talking right now? It is also what lets the agent notice an interruption.
Layer two is endpointing. It works at the transcript level. It watches the words arrive and looks for the end of a sentence.
Layer three is semantic turn detection. A small model reads the partial transcript. It predicts whether the caller is actually done, based on meaning. Silence alone cannot tell the cases apart. “Transfer me to” is unfinished even after a full second of quiet. “Transfer me to billing” is done. Only a model that reads the words knows the difference.
The trap is building turn detection from layer one alone. That means voice-activity detection plus a fixed silence timer. You wait a set pause after speech stops, then call the turn over. It is the simplest thing that works in a demo. It also forces a trade that gets worse the more you tune it.
Why a silence timeout forces a trade
A silence timer makes you choose between two things you both want. We ran a production voice agent for a public-sector phone line. We wanted it to feel faster. The obvious lever was the silence window. That is how long the agent waits after the caller stops before it decides the turn is over. We lowered it from 1.5 seconds to 1.0 seconds.
Replies came sooner, so in one sense it worked. But that one number does two jobs. Shortening it helped one and hurt the other. The job it helped was latency. The job it hurt was holding a sentence together. Real callers pause mid-sentence. They think, they read a reference number, they breathe. At 1.5 seconds those pauses stayed inside one turn. At 1.0 seconds, the speech-to-text engine read more of them as the end of speech. It split one spoken sentence into two or three fragments.
More fragments meant more turns marked interrupted. The next fragment of audio arrived before the agent’s reply to the previous one had been saved. So we had traded a latency problem for a fragmentation problem. Lowering the window further would make it worse. Raising it back would bring the latency complaint back. That is the dilemma in one line. The same number controls both speed and correctness, and they pull in opposite directions.
What the fragmentation actually broke
The fragmentation did not stay contained. It surfaced a bug two layers downstream, in the code that builds the post-call transcript.
Here is why that code existed. When a turn is interrupted, the agent’s spoken response can fail to be recorded. The next caller turn fires before the reply has been saved to the conversation history. So we had recovery code. When a turn looked interrupted, it scanned backward through the history for the most recent assistant message not yet filed to a turn. It used that as the answer. For the common case, that is correct. The agent did finish speaking, and the record just needs to catch up.
The recovery had one flaw. It had no ownership boundary. It scanned backward until it found any unfiled assistant line, and took it. Two facts made that dangerous. First, the call’s opening greeting is never filed to a turn, because turns start at the first caller question. So the greeting is permanently unfiled. Second, the aggressive timeout now produced more interrupted turns whose own answers had not been saved yet.
So on one interrupted turn, the scan walked too far. It went past that turn’s question, past several earlier turns, and landed on the opening greeting. The transcript then showed the agent answering a mid-call question with “Thanks for calling.” The live caller had heard the right answer in real time. Only the record was wrong. That is the kind of bug that survives a demo and fails an audit.
Our first guess was wrong, and worth admitting. From the transcript alone, we blamed a phantom second session. The structured event log disproved it in minutes. There was one session. The misattributed line had been generated on time. It was just filed to the wrong turn. The lesson is plain. The transcript is the thing that lies. Confirm a visible bug against the structured log before you commit to a cause.
The fix that was not about timeouts
The transcript fix is small, and it generalizes. Recovery has to respect ownership. An answer follows its question. The greeting comes before all questions. So the scan must stop at the current turn’s own caller message and refuse to look past it.
for item in reversed(history.items):
text = (item.text or "").strip()
# Ownership boundary: stop at this turn's own question.
if item.role == "user" and text == current_question:
break
if item.role != "assistant" or not text or text in already_logged:
continue
if text == greeting_text:
continue # the opening greeting is never any turn's answer
recovered = text
break
A short interrupted fragment gets handled separately. If it arrives within a merge window of about 6 seconds, it is merged forward into the next utterance. It is not saved as its own broken turn. That is the honest way to put a fragmented sentence back together. The wider rule is simple. Any code that recovers from a partial state by scanning backward for something unclaimed needs an ownership rule. Position is a safe signal. Content matching is not.
Why semantic turn detection is the real fix
The transcript fix stops the bug. It does not remove the reason the bug fired so often. That reason is the fixed silence timeout creating fragments. Semantic turn detection removes it.
A small model can read the partial transcript and judge intent. It can tell “mid-sentence” from “finished.” Then you no longer pack that judgment into one silence number. A caller who pauses after “my reference number is” is not done. The model knows it from the words. So the turn is not cut, and no fragment is created. A caller who finishes a full request can be committed quickly, without waiting out a long, cautious timeout. So latency drops too. The old trade, speed against fragmentation, mostly dissolves. The two jobs are no longer controlled by one dial.
LiveKit supports this directly. It ships an open-weights turn-detector model, distilled to about half a billion parameters. The model runs next to the Silero voice-activity detector. Voice-activity detection still handles speech presence and interruptions. The model supplies the semantic signal for when to commit the turn. Other 2026 options take the same shape. They fold context-aware turn detection into the speech-to-text layer. The shared idea is the one that matters. Stop deciding turns by silence alone.
Key takeaways
Turn detection is how a voice agent decides you are done. A fixed silence timeout is the version that fails quietly. That one number sets both latency and sentence integrity. Tune it for speed and you fragment real speech. On our deployment, that fragmentation reached all the way into the post-call transcript. There are two fixes, at two levels. First, give any partial-state recovery code an ownership boundary, so it cannot grab the wrong line. Second, move turn detection from a silence timer to a semantic model, so the latency-versus-fragmentation trade stops being a dilemma. The first is defensive. The second is the real upgrade.
What this means if you are an IT services firm
Ask your team one question about any voice agent you run for a client. How does it decide the caller has finished talking, and what happens to a turn that gets cut mid-sentence? If the answer is a single silence-timeout number, you have a latency-versus-fragmentation trade hiding in production. Its effects can reach places you would not expect, like the transcript a client reads after the call. Semantic turn detection is the durable fix. It is the kind of hardening that separates a voice agent that demos well from one a client can put on its main line. We covered a related edge of the same problem, the silence right before the agent speaks, in our note on masking a voice agent’s thinking latency. It is the substance of how we work with partners.
Reading this because a client asked for voice AI? That is the conversation we are built for. What taritas does for partners.