Why We Rejected the Realtime Voice API: Unit Economics From a Live Deployment
Because a realtime speech-to-speech API costs more per minute than the price the customer pays. At taritas we measured a realtime API at about 25 to 35 cents per minute against a standard price of 18 cents per minute, while our cascade pipeline runs near 3 cents per minute in variable cost. A voice agent is priced by the minute, so any architecture that costs more per minute than the line charges cannot scale into profit. The cascade also carries a fixed infrastructure base of about 1,900 Canadian dollars a month, which amortizes from 2.28 dollars per minute at one tenant down to about 26 cents at twenty-five, crossing below the 18 cent price as customers share it. That is why the decision is reversible rather than ideological: if realtime pricing ever drops below the line, the same arithmetic flips. Capability was never the blocker. Arithmetic was.
Published · Updated · Supreet Tare
All names, numbers, and identifiers in this post are anonymized. The patterns are real.
We rejected the most-hyped voice AI architecture of 2026 with one line of arithmetic. The loudest answer in the industry was “use a speech-to-speech realtime model,” collapse the whole stack into a single API that hears audio and speaks audio. We ran the numbers from a live deployment and kept the older, less fashionable design instead. Here is the math, because it is the part nobody publishes.
The one line that decided it
A realtime speech-to-speech API cost about 25 to 35 cents per minute in mid-2026, depending on the provider. Our standard price to the end customer was 18 cents per minute. The architecture would have cost more than the revenue each minute carried.
That is the entire decision. No amount of scale fixes a variable cost that sits above the price, because every additional minute loses money. Capability was not the question. A realtime model can absolutely hold a conversation. The question was whether we could sell its minutes for more than they cost, and we could not.
The 30-second architecture
There are two ways to build a phone agent.
A cascade pipeline runs three separate stages: speech-to-text turns the caller’s audio into words, a language model decides what to say, and text-to-speech turns the reply back into audio. Each stage is a swappable component you can price, measure, and replace on its own.
A realtime speech-to-speech model collapses all three into one API that takes audio in and gives audio out. It is simpler to wire up and it removes the seams between stages. It is also priced per audio-minute as a single meter you do not control.
We kept the cascade. The reason is not nostalgia. It is the cost structure underneath each option.
What does a cascade pipeline voice agent actually cost per minute?
From a 30-day billing window on the live cluster, the cascade pipeline’s variable cost was about 3 cents per audio-minute. That is speech-to-text, the language model, text-to-speech, and the variable share of infrastructure, added up per minute of conversation.
On top of that sits fixed infrastructure: about 1,900 Canadian dollars a month for the single-tenant cluster, most of it compute, with no reserved-capacity discounts at the time. Fixed cost behaves very differently from variable cost, and that difference is the whole game.
Cost per minute, cascade pipeline:
variable cost ~$0.03 / min (scales with usage)
fixed infra base ~$1,872 CAD / month (does NOT scale with usage)
Blended cost = (fixed + variable) / minutes served:
1 tenant -> ~$2.28 / min fixed cost spread over few minutes
25 tenants -> ~$0.26 / min fixed cost spread over many minutes
^ price line at $0.18 sits between: this is the crossover
At one tenant the blended cost is 2.28 dollars per minute, because a large fixed bill is divided over a small number of minutes. At twenty-five tenants on the same cluster it falls to about 26 cents per minute. The 18 cent price line sits inside that range, so the business becomes profitable somewhere in between, as fixed cost amortizes across customers.
Margin at the standard price exists for two reasons working together: variable cost is only about 3 cents, and fixed cost spreads thinner with every customer added. That leaves roughly 15 cents per minute of contribution before fixed costs.
Can scale make a realtime voice API profitable?
Now apply the same lens to the realtime option. Its 25 to 35 cents is almost entirely variable: it is a per-minute meter that charges the same whether you serve one customer or a thousand. There is no large fixed component to amortize, which sounds like an advantage until you notice it is the opposite.
A mostly-fixed cost falls per minute as volume grows. A mostly-variable cost above the price line stays above the price line forever. The realtime API converts our amortizable fixed cost into an unamortizable variable one that already exceeds revenue. It can never spread its way to profit. That is why the rejection is not ideological: it is a direct consequence of where the cost is fixed and where it is variable.
Model choice runs the same logic one level down
The same arithmetic chooses the language model. The agent runs a mini-class model, roughly a third of the cost of the full model, protected by deterministic fast-paths and canned responses for known failure modes so that the cheap model never has to be clever in the moments that matter.
To test whether a newer model was worth it, we ran a three-way experiment: the mini baseline, a next-generation mini at minimal reasoning effort (newly available in the Canadian region, which matters for data residency), and the full model. The gate had a hard rule before quality even entered: the next-generation model had to run with reasoning effectively off.
Observed response latency, reasoning enabled: 45 to 120 s (unusable for voice)
Required mode for voice: reasoning_effort = minimal
Main model time-to-first-token, production: ~1.9 s (sample: 1,936 ms)
With reasoning on, responses took 45 to 120 seconds. A phone call has a budget of a second or two before silence reads as a dropped line. Reasoning models are only viable in voice at a minimal-effort setting, which removes most of the reason to use one. We gate on latency mode first, then cost, then quality, in that order, because that is the order in which a phone call actually fails.
How should you decide between a realtime API and a cascade pipeline?
Out of all of this came one rule we now apply to every component: pick the cheapest model or architecture that clears three criteria. It has to sound natural and respond fast, it has to cost the same or less than what it replaces, and a majority of quality metrics have to improve. Architecture follows price-per-minute. Price-per-minute is set by the market, deliberately positioned against hosted-platform pricing, not by our internal costs.
Hardening: keeping the decision honest over time
A price-dependent decision has to be revisited when prices move:
- Re-run the realtime-versus-cascade arithmetic whenever provider pricing shifts by more than about 2x. The conclusion is arithmetic, not doctrine, and it will flip if realtime pricing drops below the price line.
- Keep every component swappable, which is the real advantage of the cascade. The text-to-speech stage has already been replaced twice; the language model is a one-line config change behind the experiment harness.
- Track variable cost per minute as a first-class production metric, sitting right next to latency. A cost regression ships as silently as a latency regression if nobody watches it.
- Do not benchmark models on quality alone. Gate on latency mode, then cost, then quality, the order that matches how a phone call breaks.
Key takeaways
A voice agent is a per-minute business, so every architectural choice is a per-minute number compared against the price, not a question of which option is more capable. The realtime API failed on unit economics, not capability. The cascade pipeline’s 3 cent variable cost is what turns an 18 cent price into a business, and its swappable stages are what let the decision change cleanly when the market does.
What this means if you are an IT services firm
When a client asks “should we use the realtime APIs everyone is demoing,” the correct first artifact is one line, not a proof of concept: the provider’s cost per minute against the per-minute value of the call. Then ask your team three questions. What is our variable cost per minute? What does the client pay, or save, per minute? At what volume or tenant count does fixed cost amortize below that spread?
If those three numbers are not known, the architecture conversation is decoration. Knowing them is most of what separates a voice AI engagement that makes money from one that quietly loses it, and it is the substance of how we work with partners.
Reading this because a client asked for voice AI? That is the conversation we are built for. What taritas does for partners.