Skip to content
tarıtas
Field notes POST-14 8 min read

Debugging a Voice Agent's Behavior, Not Its Code: The First Live Call

You debug a voice agent's behavior by listening to a real call, not by reading its code. At taritas, the first live call of a chronic-pain education agent passed every text test yet still failed out loud. It talked too long, answered its own questions, and dropped its changed footing after a crisis moment. None of that was a code bug. The fix was four edits to the system prompt: a numeric length cap, a stop rule that ends a turn at a question, a rule that keeps crisis awareness alive for the rest of the session, and one reminder repeated at the end of the prompt. We added no new branches. The one true code bug, a name the avatar invented, came from a hardcoded string the prompt could not see. The lesson is to sort failures by root cause first: behavior problems belong in the prompt, and identity problems belong in the code.

Published · Updated · Supreet Tare

All names, numbers, and identifiers in this post are anonymized. The patterns are real.

Timeline of one voice-agent test call. A single horizontal lane marks the turns. Above the lane, green flags show what worked on the first try: a classifier caught a suicidal-ideation cue on the first turn that phrased it, and knowledge-base retrieval returned a full-confidence match. Below the lane, red flags show what failed: the avatar introduced itself with a name from old code, one answer ran long enough to draw the complaint that it was very long, the agent asked a question and answered it in the same message, and after the crisis moment it returned to full teaching mode. A side panel lists the four prompt edits from version v0.3 to v0.4, each pointing back at one red flag, under the label four edits, zero new branches.

A voice agent can pass every text test you write and still fail its first phone call. We had a chronic-pain education agent that handled written prompts cleanly. Then it took its first analyzed live call, with a trained tester playing a realistic patient, and four problems showed up in the first few minutes. None of them were visible on paper. None of them were code bugs in the usual sense. This is the story of how we sorted them, and why the fix was four lines in a prompt rather than a single new branch in the code.

The architecture in 30 seconds

The agent runs a cascade pipeline on LiveKit. Speech to text turns the caller’s audio into words. A small, fast classifier model reads each caller turn and labels it, for example crisis, education, or out of scope. That label routes the turn. Only then does the main, larger language model generate the spoken reply. An avatar layer speaks the result.

The key design choice is that safety routing sits before generation. A small classifier decides whether a turn is a crisis, and it does that independently of the model that writes the answer. So a generator mistake cannot silently swallow a safety signal. Keep that split in mind, because it is what worked on the first call.

The clues, in the order we heard them

Two things worked on the first try, and they are worth stating because they tell you where not to look.

First, the classifier caught a suicidal-ideation cue on the first turn that phrased it, and the crisis protocol fired with the correct resources. Second, knowledge-base retrieval returned a full-confidence match on the right therapy topic when the conversation went there. The safety net and the retrieval both held.

Then came the failures.

The avatar introduced itself with a doctor’s name that belongs to no one on the project. It was a legacy demo persona string, left over in older agent code. The classifier and the prompt were both clean. The name came from somewhere neither of them could see.

The answers were too long on every substantial turn. The test patient said it plainly: “very, very long answer, I need time to process it.” For a product built on trauma-informed care, where short turns are part of the clinical method, a monologue is a clinical failure, not a style note.

The agent asked a question and answered it in the same breath. The method here is Elicit-Provide-Elicit: ask what the person knows, give one piece of information, then ask what they make of it. The agent collapsed the ask and the answer into one message, so the patient never got the gap to respond.

The crisis footing did not last. Once the patient moved past the hard moment, the agent went straight back into full teaching mode, as if the disclosure had not happened.

Root cause: three problems, easily conflated

It would be easy to call all four of these the same bug. They are not. They have three different root causes, and each one lives in a different layer.

The wrong name is a configuration split-brain. The agent’s identity was defined in two places: the versioned system prompt, and a hardcoded string in the application code. The unversioned one won at runtime. This is a code and wiring problem, not a behavior problem.

The length and the collapsed turns share a cause. The model’s default behavior is to be thorough and complete. That directly opposes the clinical method, which is short turns and ask-then-wait. Voice makes it worse, because a paragraph that reads fine on a screen takes 45 seconds or more to speak. The prompt did say “be concise.” Without a number and a stop rule, the model’s helpfulness habit won anyway.

The crisis continuity is a third kind of mistake. The prompt treated a crisis as an event to respond to, rather than a state that changes the footing for the rest of the session. The trigger fired correctly. The aftermath did not hold.

The point that matters: none of these surfaced in text testing. All three behavior issues showed up in the first minutes of live voice.

The fix: four prompt edits, zero new branches

We moved the system prompt from version v0.3 to v0.4. The design rule we set for ourselves was deliberate: no new sections, no if-then branching. We wanted to preserve the model’s own judgment rather than encode a flowchart. Four edits did it.

First, an output rule with a real number. A soft cap of about 80 words per turn, break long content into several short turns, one concern per turn.

Second, a stop rule for the turn structure:

When you reach an Elicit question, end your turn there.
Do not answer your own question in the same message.

Third, a rule that turns crisis from an event into a state:

Once crisis mode has been triggered, stay alert to it
for the rest of the session.

Fourth, one brevity reminder added to the final reminders at the very end of the prompt. This is the part teams skip. Long-prompt models drift from an instruction stated only once. We state the brevity rule three times: at the top as a rule, in the middle as conversational discipline, and at the end as a reminder. The three-layer repeat is what makes it stick.

The classifier prompt needed no change at all. The failures were all on the generation side. Touching the classifier would only have risked the one thing that worked. The wrong name went to a code ticket instead, plus a product decision on what the avatar should actually be called.

A bonus lesson on clinical shorthand

One more thing the same project taught us, because it shows how a prompt can be confidently wrong. An early version of the prompt assumed a piece of clinical shorthand, “the four Ps,” meant one thing. The physician founder clarified that it meant something else entirely: a treatment taxonomy, not the assessment framework we had baked in. We had to rewrite a whole section of the prompt. The lesson is simple. Clinical and domain shorthand is ambiguous. Verify every acronym with the expert before it calcifies into a prompt that sounds authoritative and is wrong.

Hardening

The call’s failure modes did not just get fixed. They became permanent tests. We built a manual QA tracker with 71 scripted cases across 12 categories, plus multi-turn scenarios, with a release gate that counts critical fails. Crisis, conversational discipline, and the close are all standing categories now, grounded in the corrected prompt. A failure you found once and did not write a test for is a failure you will ship again.

The wrong-name bug argues for one more standing rule: a single source of truth for everything the avatar can say about itself. Identity, name, and credentials all live in the versioned prompt. Application code carries no persona strings. If runtime code can override the prompt about who the agent is, you have a leak waiting to speak.

Key takeaways

Your voice agent’s first live call will fail in ways no text test catches. The failures will look alike and have different roots. Length that reads fine but speaks for a minute, a question answered before the caller can respond, and a crisis disclosure treated as a one-turn event are all behavior problems, and they belong in the prompt. A name the agent invented is a wiring problem, and it belongs in the code. Sort by root cause before you touch anything. Enforce behavior with numbers and stop rules, not the word “concise.” Repeat the rules that matter at the top, the middle, and the end. And do not touch the parts that worked.

What this means if you are an IT services firm

If you run a voice agent for a client, ask your team two questions. Have we listened to a real call end to end, not just read the transcripts? And does any string in the runtime code override the versioned prompt about who the agent is or what it says? Most teams answer no to the first and are not sure about the second. The first live call is where the gap shows, and it shows in front of a real user. Building the prompt discipline, the regression tracker, and the single source of truth before that call is the difference between a demo that impresses and an agent a client can put on its main line. That is the substance of how we work with partners.

Related questions
How do you debug a voice agent's conversational behavior instead of its code?
You listen to a real call end to end. Text tests miss behavior that only shows up in speech, like answers that read fine but take a minute to say, or a question the agent answers before the caller can. Sort each failure by root cause first. Behavior failures go in the prompt. Identity or wiring failures go in the code. Then change only the layer that owns the problem.
Why does a voice agent talk too much when the prompt already says be concise?
Concise is not enforceable. Large models revert to their habit of giving thorough, complete answers. What works is a numeric soft cap, such as about 80 words per turn, plus a structural rule of one concern per turn, plus repeating the brevity instruction at the end of the prompt as well as the top. Voice makes length worse, because a paragraph that reads fine can take 45 seconds or more to speak.
What is Elicit-Provide-Elicit and why do language models break it?
It is a motivational interviewing move: ask what the person already knows, provide one piece of information, then ask what they make of it. Models break it by answering their own elicit question in the same turn, because completion pressure pushes them to finish the thought. The fix is an explicit stop rule that ends the turn at the question.
Where did the wrong name the avatar spoke come from?
From a hardcoded legacy persona string in the agent code that outlived a prompt rewrite. The versioned system prompt never named that persona. The runtime code did, and the runtime code won. The fix is to keep identity in one place, the versioned prompt, and to carry zero persona strings in application code.
Should a voice agent fix go in the prompt or the code?
Sort by root cause. Behavior failures like length, turn structure, and crisis continuity belong in the prompt. Identity leakage belongs in the code. If classification was already correct, leave the classifier alone. The discipline of not touching what worked keeps the change small and reviewable.

Reading this because a client asked for voice AI? That is the conversation we are built for. What taritas does for partners.

More from Field notes
PROJECT taritas.com/blog
DWG POST-14
REV 1.0
DATE 2026-06-30