Build decisions POST-4 7 min read

White-Label Voice AI for IT Services Firms: 7 Production Lessons

At taritas we build white-label voice AI that regional IT services firms resell under their own brand, and the same lessons hold on every engagement. The ones that decide whether it works are rarely about the model: who owns the customer and the intellectual property, whether the per-minute price can carry the architecture, and the configuration and operational discipline that keeps a live agent up. Across production builds those lessons came with numbers: realtime APIs at 25 to 35 cents per minute against an 18 cent line price, a go-live outage from a 100-calls-per-day cap, a 15.0-second transfer timeout, a launch-day security review that fixed a high-severity auth bypass, and a text-to-speech rebuild that cut cost about 54 percent. The transferable point for an IT firm is that the demo is never the hard part: price the minute, own the boundaries, and treat the unglamorous failure modes as a known class.

Published June 16, 2026 · Updated June 24, 2026 · Supreet Tare

All names, numbers, and identifiers in this post are anonymized. The patterns are real.

Unit-economics diagram: a horizontal price line at 18 cents per minute, a realtime speech-to-speech API band at 25 to 35 cents per minute sitting above the price line and marked rejected, and a cascade pipeline variable cost of about 3 cents per minute sitting well below the price line

Thirty minutes after one of our agents went live, every call stopped connecting. Nothing in our code had changed. We build voice AI as the behind-the-scenes engineering partner to IT services firms: they own the customer and the brand, we build and run the engine, and our name never reaches the end client. After shipping production agents this way across a public-sector phone line and a healthcare platform, the same handful of lessons keep proving themselves. Almost none of them are about the model. They are about boundaries, unit economics, and the unglamorous discipline that keeps a live system up. The demo is the easy part.

The first decision is who owns the customer, not which model to use

On our largest deployment, the end customer has never heard the name taritas. A regional IT firm sells the agent under its own brand, owns the relationship, and renews it. We are invisible by design, and the intellectual property transfers to the partner under standard work-for-hire terms. That is exactly what lets them sell the result as their own product.

This reads like paperwork and it is the most important decision in the engagement. An IT services firm partners with a specialist precisely so it does not lose an account it spent years earning. If ownership of the customer or the code is ambiguous, every later conversation carries friction: a feature request, a renewal, a referral. Settle it in writing before the first line of code, and everything after runs on trust instead of negotiation.

Why does the per-minute price decide the voice AI architecture?

We rejected the most-hyped architecture of 2026 with one line of arithmetic. Speech-to-speech realtime APIs, which collapse the whole stack into a single model, ran about 25 to 35 cents per minute. Our standard price to the end customer was about 18 cents per minute. The architecture would have cost more than the revenue it carried, and no amount of scale fixes a variable cost that sits above price.

The cascade pipeline we kept, with separate speech-to-text, language model, and text-to-speech stages, has a variable cost near 3 cents per minute, which leaves real margin once fixed infrastructure spreads across tenants. The general rule: in a per-minute business, every component is judged in cost-per-minute terms against the price, not against the question “is it better.” Price is set by the market. Architecture lives underneath it.

Should you use managed or self-hosted TTS for a voice agent?

We made the same text-to-speech decision three times for one client. First we used a managed cloud voice; the client judged it robotic for a citizen-facing line, and because the voice is the product in voice AI, that rejection was correct. So we self-hosted an open model on a GPU to get an acceptable voice. That cleared the quality bar and turned a line item into an operations program: capacity planning, autoscaling, a fallback path, and a memory leak that surfaced only after six weeks of uptime. Then a newer generation of the managed voice shipped, cleared the quality bar, and modeled out about 54 percent cheaper than the self-hosted setup, so we moved back.

None of the three decisions was wrong on its date. The trap is letting an old verdict (“that vendor sounds robotic”) quietly outlive its truth and veto a re-test. Put a calendar reminder against managed tiers, re-audition each major generation, and when you self-host for quality, book the full cost honestly: the node plus the whole operations subsystem it drags in behind it.

Your first production outages will be configs nobody owns

That go-live outage, the one where every call stopped after thirty minutes, was not a defect. A soft cap of 100 calls per day was sitting in an admin panel, set during testing where 100 was generous, never re-checked against launch volume. It lived in panel state, so it appeared in no code diff and no review. Clearing it fixed everything in minutes. Finding it was the whole cost.

Weeks later, on the same project, call transfers started failing at exactly 15.0 seconds because a proxy’s default route timeout collided with a slow upstream phone line. Another default nobody had set on purpose. Two outages, same shape. The fix became a launch-day checklist: enumerate every limit, cap, and timeout that lives in panel state or proxy defaults, confirm each against expected launch conditions, and make any soft limit raise an alert before it silently starts declining calls.

A cheap model stays reliable only if you give it one job at a time

To make 3-cents-a-minute economics work, the agent runs a small, inexpensive model. The reliability trick is that we never ask it to do two things at once. Every turn fires two separate calls: a classifier that returns one routing label, and a responder that produces the spoken reply. Reviewers always ask why we pay the extra latency instead of combining them into one structured-output call. We tried exactly that and reverted. With the combined prompt the small model drifted: sometimes a good answer with the wrong routing label, sometimes the right label with a broken formatting rule. Either failure is bad, because the classifier is the gatekeeper for several deterministic guardrails downstream.

Two prompts, each with one job and one output shape, were far more reliable, and we hid most of the extra latency by running the classifier in parallel with the knowledge-base lookup. The principle travels well beyond voice: if economics push you to a cheaper model, give it one narrow job at a time, not one big one.

Can a white-label voice AI pass a client’s security review?

On one healthcare launch we ran a structured security review against the live code on launch day. It returned a high-severity authentication bypass and two medium issues. We fixed all three and verified them with tests the same day, then handed the client the report with the real findings and the exact fixes still in it.

Showing a client a genuine vulnerability and its remediation built more confidence than a spotless report would have. Public-sector and healthcare buyers buy on trust, and trust is something you produce evidence for: where the data lives, how long it is kept, who else touches it, and what the audit log records. For an IT services firm whose client runs procurement, being able to hand over a residency map and a real security report is often what actually closes the deal.

The last two seconds of a call are the hardest engineering in the build

After a caller stops talking, there are about 2.2 to 2.5 seconds before the agent’s first audio plays, and on a phone that reads as thinking silence, which reads as robotic. We have made three serious attempts to mask it. All three are currently switched off, each defeated by a subtle timing or speech-queue-ordering reason that taught us what the next attempt has to respect. Separately, when we asked the model to emit markup for more natural speech, the agent sometimes read the tags aloud, so naturalness moved to a deterministic code boundary instead of living in the prompt.

None of this shows up in a demo script. But for a partner reselling the engine to their own client, the gap between an agent that feels natural and one that feels stilted is the gap between a reference and a refund. Plan that polish as real engineering, not a finishing touch.

What this means if you are an IT services firm

Two of these lessons are commercial, two sit where money meets architecture, and three are field-tested engineering. They share one root: white-label voice AI works when the boundaries are explicit, who owns what, what the minute costs, what the regulator requires, and when the unglamorous failure modes are treated as a known class rather than a surprise. The demo will always look fine. These seven are where the work actually is. If you are weighing whether to offer voice AI to your own clients, that is where we would tell you to look first, and it is the substance of how we work with partners.

We Split One Prompt Into Two to Fix a Small LLM. Then We Deleted Both.

Build decisions July 4, 2026 9 min read

When to Build vs Buy Voice AI: What a 3/10 Client Review Taught Us

 PROJECT taritas.com/blog 
 DWG POST-4 
 REV 1.0 
 DATE 2026-06-24