Skip to content
tarıtas
Build decisions POST-6 7 min read

Chatterbox vs Azure Dragon HD: Choosing a Voice Agent's TTS in Production

Text-to-speech is the decision that most shapes how a voice agent sounds and what it costs. At Taritas we made the same choice three times for one production agent: a managed cloud voice, then self-hosted Chatterbox for quality, then back to a managed Azure Dragon HD voice once it cleared the bar. The honest comparison is that self-hosted Chatterbox wins on control and in-region flexibility but turns a line item into a GPU operations program, while managed Azure Dragon HD wins on cost and simplicity once its quality is good enough in your region. Neither is universally right; the answer changes with your region, your scale, and how much operations load you can carry.

Published · Updated · Supreet Tare

All names, numbers, and identifiers in this post are anonymized. The patterns are real.

Diagram of a round-trip text-to-speech decision: stage one a managed standard voice rejected for sounding robotic in-region, stage two self-hosted Chatterbox on a GPU that cleared quality but added operations cost, stage three back to managed Azure Dragon HD once it reached the region and cut cost about 54 percent

We made the same text-to-speech decision three times for one production voice agent, and ended up close to where we started, for completely different reasons each time. In voice AI the voice is the product, so this is not a minor config choice. It decides whether callers find the agent natural, what each minute costs, and how much infrastructure you operate. Here is the honest comparison of the two engines we ran in production, self-hosted Chatterbox and managed Azure Dragon HD, and how we chose between them.

What actually decides a voice agent’s TTS

Six things, roughly in the order they bite:

Naturalness in your languages and your region. The voice has to sound human on a phone line, and the catalog you can actually use is constrained by where your data has to live (more on that below). Latency, specifically time-to-first-byte of audio, because dead air after a caller stops reads as broken. Streaming, because a token-streaming engine can start talking sooner. Cost, which is per-character for a managed API and mostly fixed GPU cost for a self-hosted model. Operations burden, which is near zero for managed and substantial for self-hosted. And control and residency, where self-hosting lets you run any model in any region you can rent a GPU.

Round one: managed, rejected on naturalness

We started on a managed cloud voice (standard neural tier). It was the obvious default: no infrastructure, an API call. The client judged the in-region voices robotic for a citizen-facing line, and that was a fair call, not fussiness. The newer, more natural HD generation existed but was not yet available in the region our client’s data-residency rules required. So the managed path, as it stood in that region, failed the one gate that matters most for voice.

The lesson worth keeping: managed quality is not uniform. It varies by voice tier and by region, and the flagship voice in the demo is not always the voice your region offers.

Round two: self-hosted Chatterbox, quality won, operations followed

So we self-hosted. Chatterbox is Resemble AI’s open text-to-speech family (the Turbo variant is roughly 350M parameters, MIT-licensed, sub-200ms inference when tuned, with emotion control and voice cloning, per Resemble’s model docs and independent write-ups). It cleared the quality bar immediately. It also converted a line item into an operations program.

Running it meant a GPU node and everything that orbits one. We measured about 3 clean concurrent streams per mid-tier GPU, so capacity planning became real. We built a Redis slot counter to admit calls against available GPU capacity, with a managed-TTS fallback for overflow. And we learned the long-uptime failure mode the hard way: the GPU server crept to its memory ceiling after about six weeks of continuous uptime and needed recycling. The audio quality was excellent and the per-minute compute looked cheap on paper (around three cents per audio-minute of variable cost), but the GPU node was the binding cost of the whole cluster, and the reliability work was ongoing.

Self-hosting an open model is the right answer when quality forces it and you can carry the operations. It is not free just because the model is.

Round three: back to managed Azure Dragon HD

Then the HD generation (Dragon HD) reached our client’s required region. We re-auditioned it, it cleared the quality bar, and the economics flipped: dropping the GPU node cut cost about 54 percent, because we removed the binding constraint and the entire operations subsystem with it. Azure’s HD voices are billed at about $22 per 1M characters as of early 2026 (Azure Speech pricing, reduced from $30 that March), and the HD catalog expanded to more regions in 2026, which is what made the move possible at all.

The move was not free of catches. The LiveKit Azure TTS plugin is non-streaming (issue #4714), so it synthesizes before audio starts, which raises time-to-first-byte. It is manageable rather than disqualifying: the framework chunks by sentence so audio begins after the first sentence, our responses are one or two sentences by design, and narrated lookups (“let me check that”) cover the gap callers actually perceive. The rule we set was to ship Dragon HD, measure TTFB on real calls, and only move a streaming engine onto the critical path if the numbers demanded it. The other upgrade: SSML. We replaced about 150 lines of hand-tuned letter-by-letter spelling with SSML say-as for emails, URLs, acronyms, and phone numbers, plus break for pacing and a calm express-as style, which is cleaner and more dependable than the phonetic hack the self-hosted setup had required.

Honest current state: production still runs Chatterbox while we validate Dragon HD in development, and we go to production with it shortly. The decision is made; the rollout is measured, not flipped.

The comparison, side by side

Self-hosted ChatterboxManaged Azure Dragon HD
TypeOpen model (Resemble AI), MIT-licensedManaged API
NaturalnessCleared our barCleared our bar (HD); standard tier did not
Latency / TTFBSub-200ms inference, you control the pathLow synth, but non-streaming plugin raises TTFB (#4714)
StreamingYes, self-managedNon-streaming via the LiveKit plugin
Cost modelGPU node + operations (mostly fixed)Per character (about $22 / 1M chars, HD)
Region / residencyAny region you can run a GPULimited to the per-region voice catalog
Operations burdenHigh: capacity, leaks, autoscale, fallbackLow
Best whenQuality gap, scale to amortize the GPU, or residency forces in-regionIt clears quality in your region and you would rather not run GPUs

If TTFB on a non-streaming managed voice is the problem, the third option is a streaming managed engine like Cartesia (Sonic, around 90ms time-to-first-audio) or ElevenLabs (Flash, around 75ms) on the latency-critical path, reserving the HD voice for where it matters.

How to choose, and the meta-lesson

Self-host when a managed tier cannot meet your quality in your region, you have the volume to amortize a GPU, or residency rules force the model in-region and the managed catalog there is thin. Go managed when a voice clears quality in your region and you would rather not operate GPUs. And measure naturalness and TTFB on real calls before committing, because both are ear-and-data questions, not spec-sheet ones.

The meta-lesson is the one our three rounds taught: managed versus self-hosted is not a one-time decision. It is a loop you re-enter every time the managed catalog ships a new generation or reaches a new region, and every time your own GPU bill compounds. The expensive mistake is letting an old verdict (“that managed voice sounded robotic”) quietly veto a re-audition a year later.

What this means if you are an IT services firm

If you are reselling voice AI to a client, the voice is the part they judge first and the part procurement constrains hardest, through data-residency rules that decide which engines you can even use. Being able to say “here is the voice, here is where it runs, here is what it costs, and here is the plan if the residency-compliant option is not good enough yet” is what separates a credible build from a demo. That readiness is a large part of how we work with partners.

Related questions
Is self-hosted TTS cheaper than a managed API?
Only at the right scale, and only if you count the full cost. A self-hosted open model like Chatterbox has no per-character fee, but it needs a GPU node plus the operations around it: capacity planning, memory-leak handling, autoscaling, and a fallback path. On our deployment the GPU was the binding cost. When a managed HD voice later cleared our quality bar, dropping the GPU cut cost about 54 percent. Self-hosting pays off when you have enough volume to amortize the node and a reason the managed tier cannot meet.
Is Chatterbox good enough for a production voice agent?
Yes. Chatterbox is Resemble AI's open text-to-speech family (the Turbo variant is about 350M parameters, MIT-licensed, sub-200ms inference when tuned, with emotion control and voice cloning). It cleared our quality bar for a citizen-facing line when a managed standard voice did not. The cost is operational, not quality: you run it on a GPU and own the reliability work that comes with a long-running model server.
Why did the managed Azure voices sound robotic at first?
Voice quality on managed platforms varies by voice tier and by region. The standard neural voices available in our client's required region did not sound natural enough for a citizen-facing line. The newer HD generation (Dragon HD) is a different product and cleared the bar, but it was not yet available in that region when we first needed it, which is what pushed us to self-host.
Does Azure text-to-speech support streaming?
The LiveKit Azure TTS plugin is non-streaming (tracked as LiveKit Agents issue #4714), so it synthesizes a unit of text before audio starts, which raises time-to-first-byte. It is manageable: the framework chunks by sentence, so audio starts after the first sentence, and short one-or-two-sentence responses plus narrated lookups cover most of the gap. Measure TTFB after switching; if it is too high, put a streaming TTS like Cartesia or ElevenLabs on the latency-critical path.
Should you use SSML with a voice agent?
It helps once you are on a managed engine that supports it. We replaced about 150 lines of hand-tuned letter-by-letter spelling with SSML say-as for emails, URLs, acronyms, and phone numbers, and used break and a calm express-as style for pacing and tone. It is cleaner and more reliable than phonetic re-spelling, but the exact pause timings and styles need tuning by ear on real calls.

Reading this because a client asked for voice AI? That is the conversation we are built for. What taritas does for partners.

More from Build decisions
PROJECT taritas.com/blog
DWG POST-6
REV 1.0
DATE 2026-06-18