Skip to content
tarıtas
Build decisions POST-9 7 min read

Why We Rejected the Realtime Voice API: Unit Economics From a Live Deployment

Because a realtime speech-to-speech API costs more per minute than the price the customer pays. At taritas we measured a realtime API at about 25 to 35 cents per minute against a standard price of 18 cents per minute, while our cascade pipeline runs near 3 cents per minute in variable cost. A voice agent is priced by the minute, so any architecture that costs more per minute than the line charges cannot scale into profit. The cascade also carries a fixed infrastructure base of about 1,900 Canadian dollars a month, which amortizes from 2.28 dollars per minute at one tenant down to about 26 cents at twenty-five, crossing below the 18 cent price as customers share it. That is why the decision is reversible rather than ideological: if realtime pricing ever drops below the line, the same arithmetic flips. Capability was never the blocker. Arithmetic was.

Published · Updated · Supreet Tare

All names, numbers, and identifiers in this post are anonymized. The patterns are real.

A two-panel voice AI unit economics diagram. The left panel is a bar chart of cost per minute: a realtime API band at 25 to 35 cents sits above an 18 cent price line and is marked rejected, while a cascade pipeline variable cost of about 3 cents sits far below the price line and is marked kept. The right panel is a blended-cost curve falling from 2.28 dollars per minute at one tenant to 26 cents at twenty-five tenants, crossing the 18 cent price line where fixed costs amortize.

We rejected the most-hyped voice AI architecture of 2026 with one line of arithmetic. The loudest answer in the industry was “use a speech-to-speech realtime model,” collapse the whole stack into a single API that hears audio and speaks audio. We ran the numbers from a live deployment and kept the older, less fashionable design instead. Here is the math, because it is the part nobody publishes.

The one line that decided it

A realtime speech-to-speech API cost about 25 to 35 cents per minute in mid-2026, depending on the provider. Our standard price to the end customer was 18 cents per minute. The architecture would have cost more than the revenue each minute carried.

That is the entire decision. No amount of scale fixes a variable cost that sits above the price, because every additional minute loses money. Capability was not the question. A realtime model can absolutely hold a conversation. The question was whether we could sell its minutes for more than they cost, and we could not.

The 30-second architecture

There are two ways to build a phone agent.

A cascade pipeline runs three separate stages: speech-to-text turns the caller’s audio into words, a language model decides what to say, and text-to-speech turns the reply back into audio. Each stage is a swappable component you can price, measure, and replace on its own.

A realtime speech-to-speech model collapses all three into one API that takes audio in and gives audio out. It is simpler to wire up and it removes the seams between stages. It is also priced per audio-minute as a single meter you do not control.

We kept the cascade. The reason is not nostalgia. It is the cost structure underneath each option.

What does a cascade pipeline voice agent actually cost per minute?

From a 30-day billing window on the live cluster, the cascade pipeline’s variable cost was about 3 cents per audio-minute. That is speech-to-text, the language model, text-to-speech, and the variable share of infrastructure, added up per minute of conversation.

On top of that sits fixed infrastructure: about 1,900 Canadian dollars a month for the single-tenant cluster, most of it compute, with no reserved-capacity discounts at the time. Fixed cost behaves very differently from variable cost, and that difference is the whole game.

Cost per minute, cascade pipeline:

  variable cost                      ~$0.03 / min   (scales with usage)
  fixed infra base            ~$1,872 CAD / month   (does NOT scale with usage)

Blended cost = (fixed + variable) / minutes served:

  1 tenant     ->  ~$2.28 / min     fixed cost spread over few minutes
  25 tenants   ->  ~$0.26 / min     fixed cost spread over many minutes
                       ^ price line at $0.18 sits between: this is the crossover

At one tenant the blended cost is 2.28 dollars per minute, because a large fixed bill is divided over a small number of minutes. At twenty-five tenants on the same cluster it falls to about 26 cents per minute. The 18 cent price line sits inside that range, so the business becomes profitable somewhere in between, as fixed cost amortizes across customers.

Margin at the standard price exists for two reasons working together: variable cost is only about 3 cents, and fixed cost spreads thinner with every customer added. That leaves roughly 15 cents per minute of contribution before fixed costs.

Can scale make a realtime voice API profitable?

Now apply the same lens to the realtime option. Its 25 to 35 cents is almost entirely variable: it is a per-minute meter that charges the same whether you serve one customer or a thousand. There is no large fixed component to amortize, which sounds like an advantage until you notice it is the opposite.

A mostly-fixed cost falls per minute as volume grows. A mostly-variable cost above the price line stays above the price line forever. The realtime API converts our amortizable fixed cost into an unamortizable variable one that already exceeds revenue. It can never spread its way to profit. That is why the rejection is not ideological: it is a direct consequence of where the cost is fixed and where it is variable.

Model choice runs the same logic one level down

The same arithmetic chooses the language model. The agent runs a mini-class model, roughly a third of the cost of the full model, protected by deterministic fast-paths and canned responses for known failure modes so that the cheap model never has to be clever in the moments that matter.

To test whether a newer model was worth it, we ran a three-way experiment: the mini baseline, a next-generation mini at minimal reasoning effort (newly available in the Canadian region, which matters for data residency), and the full model. The gate had a hard rule before quality even entered: the next-generation model had to run with reasoning effectively off.

Observed response latency, reasoning enabled:   45 to 120 s   (unusable for voice)
Required mode for voice:                         reasoning_effort = minimal
Main model time-to-first-token, production:      ~1.9 s        (sample: 1,936 ms)

With reasoning on, responses took 45 to 120 seconds. A phone call has a budget of a second or two before silence reads as a dropped line. Reasoning models are only viable in voice at a minimal-effort setting, which removes most of the reason to use one. We gate on latency mode first, then cost, then quality, in that order, because that is the order in which a phone call actually fails.

How should you decide between a realtime API and a cascade pipeline?

Out of all of this came one rule we now apply to every component: pick the cheapest model or architecture that clears three criteria. It has to sound natural and respond fast, it has to cost the same or less than what it replaces, and a majority of quality metrics have to improve. Architecture follows price-per-minute. Price-per-minute is set by the market, deliberately positioned against hosted-platform pricing, not by our internal costs.

Hardening: keeping the decision honest over time

A price-dependent decision has to be revisited when prices move:

  • Re-run the realtime-versus-cascade arithmetic whenever provider pricing shifts by more than about 2x. The conclusion is arithmetic, not doctrine, and it will flip if realtime pricing drops below the price line.
  • Keep every component swappable, which is the real advantage of the cascade. The text-to-speech stage has already been replaced twice; the language model is a one-line config change behind the experiment harness.
  • Track variable cost per minute as a first-class production metric, sitting right next to latency. A cost regression ships as silently as a latency regression if nobody watches it.
  • Do not benchmark models on quality alone. Gate on latency mode, then cost, then quality, the order that matches how a phone call breaks.

Key takeaways

A voice agent is a per-minute business, so every architectural choice is a per-minute number compared against the price, not a question of which option is more capable. The realtime API failed on unit economics, not capability. The cascade pipeline’s 3 cent variable cost is what turns an 18 cent price into a business, and its swappable stages are what let the decision change cleanly when the market does.

What this means if you are an IT services firm

When a client asks “should we use the realtime APIs everyone is demoing,” the correct first artifact is one line, not a proof of concept: the provider’s cost per minute against the per-minute value of the call. Then ask your team three questions. What is our variable cost per minute? What does the client pay, or save, per minute? At what volume or tenant count does fixed cost amortize below that spread?

If those three numbers are not known, the architecture conversation is decoration. Knowing them is most of what separates a voice AI engagement that makes money from one that quietly loses it, and it is the substance of how we work with partners.

Related questions
What does a production voice agent actually cost to run?
On a cascade pipeline of separate speech-to-text, language model, and text-to-speech stages, variable cost can run near 3 cents per audio-minute. On top of that sits a fixed infrastructure base, around 1,900 Canadian dollars a month for a single-tenant Kubernetes cluster, which dominates the total until it spreads across customers. The blended figure fell from about 2.28 dollars per minute at one tenant to roughly 26 cents at twenty-five tenants on the same shared cluster.
Why not use a speech-to-speech realtime model for a production voice agent?
In 2026 realtime pricing ran about 25 to 35 cents per minute, which exceeds a standard 18 cent per minute price to the customer. A negative gross margin per minute cannot be scaled away, because a realtime API turns mostly-fixed cost into mostly-variable cost that sits above the price. The model's capability was not the problem. The per-minute arithmetic was.
Is a small or cheap language model good enough for phone calls?
Yes, if you protect it. Deterministic fast-paths and canned responses for known failure modes let a mini-class model, around a third of the cost of the full model, carry a government service line. The right way to test a bigger model is a controlled experiment gated on cost and latency first, not on a general sense that bigger is better.
Can reasoning models be used in voice agents?
Only with reasoning effectively turned off. With reasoning enabled we measured responses of 45 to 120 seconds, which is two orders of magnitude beyond a conversational turn budget. A minimal reasoning-effort setting is the only viable mode for voice, which mostly defeats the point of using a reasoning model at all.
Platform versus custom build, how do the economics differ?
A hosted platform charges a per-minute fee that sits on top of model costs and compresses your margin permanently. A custom cascade pipeline trades higher fixed engineering and infrastructure cost for very low variable cost per minute. The crossover depends on call volume and how many customers share the fixed base, which is why a custom pipeline is priced against platform pricing while its cost structure is deliberately not platform-shaped.

Reading this because a client asked for voice AI? That is the conversation we are built for. What taritas does for partners.

More from Build decisions
PROJECT taritas.com/blog
DWG POST-9
REV 1.0
DATE 2026-06-24