Skip to content
tarıtas
Procurement and compliance POST-5 6 min read

What Happens When Your Voice AI Breaks in Production

The demo never shows you what happens at 2 a.m. when the agent stops answering and a real customer is on the line. Before buying voice AI, the operations questions decide more than the feature list: who is on call, how fast you hear about an outage, whether a failure degrades safely, and what the postmortem looks like. At Taritas we run production voice agents with on-call rotations, a communication discipline kept separate from the debugging, guardrails that fail open, and blameless postmortems, because in an always-on, per-minute product, how you handle the break is what the customer remembers.

Published · Updated · Supreet Tare

All names, numbers, and identifiers in this post are anonymized. The patterns are real.

Diagram of an incident status lifecycle: Investigating at 08:24, Identified at 08:54, Monitoring at 09:37, then Resolved, with a note that the status updates are written by a communication role that is separate from the engineers doing the fix

Recently I watched a major AI provider’s status page narrate an outage in real time. Investigating. Then root cause identified. Then a fix being applied, with a new note every half hour. What struck me was not the outage. It was the calm. Someone whose only job in that moment was to keep customers informed was writing those updates while other people fixed the system. The status update was a role, not a courtesy.

That is the part of voice AI no demo ever shows you. Every vendor can make an agent sound good in a controlled call. The harder question, the one that actually decides whether you should buy, is what happens when it breaks at 2 a.m. with a real customer on the line. Here are the operations questions worth asking any voice AI vendor, and how we answer them at Taritas.

Who is on call, and how fast do you find out?

The honest version of reliability is not “it never breaks.” It is “we find out before you do, and someone is already on it.” That means on-call rotations and automated alerts on error rate, latency, and capacity, rather than waiting for a customer to phone in angry.

We learned where that bar sits the hard way. On one go-live, calls stopped connecting about thirty minutes into normal operation. The cause was a configuration cap sitting in an admin panel, not a code defect, and clearing it took minutes. The lesson we kept was not about that one cap. It was that any limit which can silently decline a call must raise an alert well before it does. So the question to ask a vendor is concrete: what do you monitor, who gets paged, and how quickly does someone acknowledge it?

When it breaks, who tells you, and how often?

The status page I was watching worked because communication was a defined job, separate from the debugging. One person coordinates the incident, others investigate, and someone owns the customer-facing updates on a cadence, even when there is nothing new to say beyond “still on it.”

For a buyer, this is the difference between an outage you can live with and one that costs you the relationship. Silence during a failure erodes trust faster than the failure itself. Ask the vendor plainly: when something breaks, will I hear from a human on a schedule, and will the updates move through a clear lifecycle from investigating to identified to resolved, or will I be left guessing?

Does it fail safely?

Things will fail. What matters is how. We run a daily call cap as a cost guardrail, and we designed it to fail open: if the database check behind it errors, the call is allowed through rather than rejected. Wrongly blocking a real caller is worse than letting one extra call past a cost limit. A security boundary, by contrast, should fail closed. Neither choice is universally right. The discipline is making each one deliberately and documenting it, so a failure produces a known behavior instead of a surprise.

The procurement question is simple and revealing: when a dependency fails mid-call, what happens to the caller? A vendor who has thought about this will answer immediately. A vendor who has not will improvise.

Your first outages will be configurations nobody owns

Two of our most instructive incidents were not bugs. One was the go-live cap above. The other was a set of call transfers that failed at exactly 15.0 seconds, because a proxy’s default route timeout collided with a slow upstream phone line. We wrote that one up in full (the 15-second timeout postmortem). Both were configuration defaults nobody had set on purpose, invisible in testing, and only dangerous under real traffic.

The takeaway became a launch-day checklist: enumerate every limit, cap, timeout, and quota that lives in panel state or a default, and confirm each against expected launch conditions before going live. Ask the vendor whether they have one. Early production failures are far more often misconfigurations than code, and a team that knows this has a checklist for it.

For an always-on AI product, capacity is the failure mode

A voice agent is not a normal web app. Many incidents are not defects at all, they are demand meeting a capacity ceiling: concurrent-call limits, inference capacity, a serving change that quietly raises latency. The mitigations look different too, more about rate-limiting, shifting load, and staged rollouts than about patching a bug.

This is why a careful vendor rolls out a model or serving change gradually, to a small slice of traffic first, with automatic rollback if quality or latency regresses, rather than flipping it on for everyone at once. And it is why monitoring up-or-down status is not enough: a model can be fully available and still be answering worse than yesterday. Ask how a vendor deploys a change, whether they can roll it back automatically, and whether they watch output quality, not just uptime.

What does the postmortem look like?

After an incident is resolved, the artifact that matters is a blameless postmortem: a written timeline, the root cause, what worked, and the action items that stop it recurring. Blameless is the operative word. The goal is fixing the system, not assigning blame, because fear is what stops people reporting problems honestly.

The strongest signal a buyer can look for is whether a vendor will actually hand you that document. We write them, and we publish anonymized versions of the most useful ones on this blog. A vendor willing to show you exactly how something broke and what they changed is telling you something a clean marketing page cannot.

What this means if you are an IT services firm

When you resell voice AI to an enterprise client, their procurement team will ask these exact questions, and “the vendor handles it” is not an answer you can give with confidence unless you actually know the vendor’s operations posture. On call, status communication, fail-open design, a launch-day configuration checklist, staged rollouts, and real postmortems are the substance behind an SLA, and they are what we hand you so that when your client asks what happens when it breaks, you have the answer ready. That is a large part of how we work with partners.

Related questions
What should I ask a voice AI vendor about reliability before buying?
Ask the operations questions the demo hides: what is monitored and who gets paged when it breaks, how quickly they acknowledge an incident, whether a failed dependency degrades safely or drops the call, how they roll out and roll back a model change, and whether you get a written postmortem after an incident. A vendor who can answer these clearly has run production before.
What does on-call look like for a small voice AI provider?
An on-call engineer carries a pager, and automated alerts fire on elevated error rates, latency, or capacity limits rather than waiting for a customer to call. At a small shop the rotation is lean, so the real test is not headcount but whether the alerts are wired to the failures that actually happen and whether a human acknowledges them quickly.
What does it mean for a voice agent to fail open?
Fail open means that if a supporting check errors, for example a database lookup behind a rate limit, the call is allowed through rather than rejected. A cost-control guardrail should usually fail open, because wrongly blocking a real caller is worse than one extra call. A security boundary should fail closed. The point is choosing each one deliberately and writing it down.
Who updates the status page during an outage?
In a well-run incident, communication is a separate role from debugging. One person coordinates, others investigate, and someone owns the customer-facing updates on a cadence, even when there is nothing new to report. That is why a status page can stay calm and current while engineers are heads-down on the fix.
What uptime or SLA should I expect for a production voice agent?
It depends on the deployment, and a serious vendor sets specific terms per contract rather than quoting a number that sounds good. More useful than a headline percentage is what the SLA actually covers: time to acknowledge, time to communicate, how a degraded state is handled, and whether quality, not just up-or-down status, is monitored.

Reading this because a client asked for voice AI? That is the conversation we are built for. What taritas does for partners.

More from Procurement and compliance
PROJECT taritas.com/blog
DWG POST-5
REV 1.0
DATE 2026-06-17