Skip to content
tarıtas
Build decisions POST-4 7 min read

White-Label Voice AI for IT Services Firms: 7 Production Lessons

At Taritas we build white-label voice AI that regional IT services firms resell under their own brand, and the same lessons hold on every engagement. The ones that decide whether it works are rarely about the model: who owns the customer and the IP, whether the per-minute price can carry the architecture, and the configuration and operational discipline that keeps a live agent up. The demo is never the hard part.

Published · Updated · Supreet Tare

All names, numbers, and identifiers in this post are anonymized. The patterns are real.

Unit-economics diagram: a horizontal price line at 18 cents per minute, a realtime speech-to-speech API band at 25 to 35 cents per minute sitting above the price line and marked rejected, and a cascade pipeline variable cost of about 3 cents per minute sitting well below the price line

Thirty minutes after one of our agents went live, every call stopped connecting. Nothing in our code had changed. We build voice AI as the behind-the-scenes engineering partner to IT services firms: they own the customer and the brand, we build and run the engine, and our name never reaches the end client. After shipping production agents this way across a public-sector phone line and a healthcare platform, the same handful of lessons keep proving themselves. Almost none of them are about the model. They are about boundaries, unit economics, and the unglamorous discipline that keeps a live system up. The demo is the easy part.

The first decision is who owns the customer, not which model to use

On our largest deployment, the end customer has never heard the name Taritas. A regional IT firm sells the agent under its own brand, owns the relationship, and renews it. We are invisible by design, and the intellectual property transfers to the partner under standard work-for-hire terms. That is exactly what lets them sell the result as their own product.

This reads like paperwork and it is the most important decision in the engagement. An IT services firm partners with a specialist precisely so it does not lose an account it spent years earning. If ownership of the customer or the code is ambiguous, every later conversation carries friction: a feature request, a renewal, a referral. Settle it in writing before the first line of code, and everything after runs on trust instead of negotiation.

A voice agent is priced by the minute, so the minute decides the architecture

We rejected the most-hyped architecture of 2026 with one line of arithmetic. Speech-to-speech realtime APIs, which collapse the whole stack into a single model, ran about 25 to 35 cents per minute. Our standard price to the end customer was about 18 cents per minute. The architecture would have cost more than the revenue it carried, and no amount of scale fixes a variable cost that sits above price.

The cascade pipeline we kept, with separate speech-to-text, language model, and text-to-speech stages, has a variable cost near 3 cents per minute, which leaves real margin once fixed infrastructure spreads across tenants. The general rule: in a per-minute business, every component is judged in cost-per-minute terms against the price, not against the question “is it better.” Price is set by the market. Architecture lives underneath it.

Managed versus self-hosted is a loop you re-enter, not a box you tick

We made the same text-to-speech decision three times for one client. First we used a managed cloud voice; the client judged it robotic for a citizen-facing line, and because the voice is the product in voice AI, that rejection was correct. So we self-hosted an open model on a GPU to get an acceptable voice. That cleared the quality bar and turned a line item into an operations program: capacity planning, autoscaling, a fallback path, and a memory leak that surfaced only after six weeks of uptime. Then a newer generation of the managed voice shipped, cleared the quality bar, and modeled out about 54 percent cheaper than the self-hosted setup, so we moved back.

None of the three decisions was wrong on its date. The trap is letting an old verdict (“that vendor sounds robotic”) quietly outlive its truth and veto a re-test. Put a calendar reminder against managed tiers, re-audition each major generation, and when you self-host for quality, book the full cost honestly: the node plus the whole operations subsystem it drags in behind it.

Your first production outages will be configs nobody owns

That go-live outage, the one where every call stopped after thirty minutes, was not a defect. A soft cap of 100 calls per day was sitting in an admin panel, set during testing where 100 was generous, never re-checked against launch volume. It lived in panel state, so it appeared in no code diff and no review. Clearing it fixed everything in minutes. Finding it was the whole cost.

Weeks later, on the same project, call transfers started failing at exactly 15.0 seconds because a proxy’s default route timeout collided with a slow upstream phone line. Another default nobody had set on purpose. Two outages, same shape. The fix became a launch-day checklist: enumerate every limit, cap, and timeout that lives in panel state or proxy defaults, confirm each against expected launch conditions, and make any soft limit raise an alert before it silently starts declining calls.

A cheap model stays reliable only if you give it one job at a time

To make 3-cents-a-minute economics work, the agent runs a small, inexpensive model. The reliability trick is that we never ask it to do two things at once. Every turn fires two separate calls: a classifier that returns one routing label, and a responder that produces the spoken reply. Reviewers always ask why we pay the extra latency instead of combining them into one structured-output call. We tried exactly that and reverted. With the combined prompt the small model drifted: sometimes a good answer with the wrong routing label, sometimes the right label with a broken formatting rule. Either failure is bad, because the classifier is the gatekeeper for several deterministic guardrails downstream.

Two prompts, each with one job and one output shape, were far more reliable, and we hid most of the extra latency by running the classifier in parallel with the knowledge-base lookup. The principle travels well beyond voice: if economics push you to a cheaper model, give it one narrow job at a time, not one big one.

In regulated work, the security review is a deliverable you ship

On one healthcare launch we ran a structured security review against the live code on launch day. It returned a high-severity authentication bypass and two medium issues. We fixed all three and verified them with tests the same day, then handed the client the report with the real findings and the exact fixes still in it.

Showing a client a genuine vulnerability and its remediation built more confidence than a spotless report would have. Public-sector and healthcare buyers buy on trust, and trust is something you produce evidence for: where the data lives, how long it is kept, who else touches it, and what the audit log records. For an IT services firm whose client runs procurement, being able to hand over a residency map and a real security report is often what actually closes the deal.

The last two seconds of a call are the hardest engineering in the build

After a caller stops talking, there are about 2.2 to 2.5 seconds before the agent’s first audio plays, and on a phone that reads as thinking silence, which reads as robotic. We have made three serious attempts to mask it. All three are currently switched off, each defeated by a subtle timing or speech-queue-ordering reason that taught us what the next attempt has to respect. Separately, when we asked the model to emit markup for more natural speech, the agent sometimes read the tags aloud, so naturalness moved to a deterministic code boundary instead of living in the prompt.

None of this shows up in a demo script. But for a partner reselling the engine to their own client, the gap between an agent that feels natural and one that feels stilted is the gap between a reference and a refund. Plan that polish as real engineering, not a finishing touch.

What this means if you are an IT services firm

Two of these lessons are commercial, two sit where money meets architecture, and three are field-tested engineering. They share one root: white-label voice AI works when the boundaries are explicit, who owns what, what the minute costs, what the regulator requires, and when the unglamorous failure modes are treated as a known class rather than a surprise. The demo will always look fine. These seven are where the work actually is. If you are weighing whether to offer voice AI to your own clients, that is where we would tell you to look first, and it is the substance of how we work with partners.

Related questions
What does white-label voice AI mean for an IT services firm?
The IT firm sells and owns the voice AI engagement under its own brand, while a specialist technical partner builds and runs the engine invisibly behind it. The end customer sees only the IT firm, and the intellectual property transfers to the firm or its client under standard work-for-hire terms. The firm keeps the account; the partner supplies capability it does not have to hire for.
How much does a production voice agent cost per minute?
On a cascade pipeline of separate speech-to-text, language model, and text-to-speech components, variable cost can run around three cents per audio-minute, with the rest dominated by fixed infrastructure that falls sharply as more tenants share a cluster. Speech-to-speech realtime APIs ran roughly 25 to 35 cents per minute in 2026, which can exceed the per-minute price charged to the customer.
Should you use a speech-to-speech realtime model for a production voice agent?
Often not. In 2026 these models frequently cost more per minute than a per-minute voice product can charge, and they turn mostly-fixed infrastructure cost into mostly-variable cost that cannot amortize across tenants. A cascade of separate, swappable components stays cheaper and lets you replace any one stage when a better or cheaper option ships.
Can a voice AI agent pass a government or healthcare security review?
Yes, if it is built for one. On a healthcare launch, a structured security review of the live code found a high-severity authentication bypass and two medium issues, all fixed and test-verified the same day. Handing the client the report with the real findings and fixes in it built more trust than a clean-looking report would have.
Who owns the intellectual property in a white-label voice AI build?
In the model Taritas uses, IP transfers to the partner firm or its end client under work-for-hire terms, which is what lets the partner sell it as their own product. It has to be agreed in writing before the build starts. The technical partner keeps only its general methods and reusable internal tooling.

Reading this because a client asked for voice AI? That is the conversation we are built for. What taritas does for partners.

More from Build decisions
PROJECT taritas.com/blog
DWG POST-4
REV 1.0
DATE 2026-06-16