Procurement and compliance POST-5 6 min read

What Happens When Your Voice AI Breaks in Production

The demo never shows you what happens at 2 a.m. when the agent stops answering and a real customer is on the line. Before buying voice AI, the operations questions decide more than the feature list: who is on call, how fast you hear about an outage, whether a failure degrades safely, and what the postmortem looks like. At taritas we run production voice agents with on-call rotations, a communication discipline kept separate from the debugging, guardrails that fail open, and blameless postmortems. The reason this matters is concrete: our worst outages were not exotic bugs but quiet defaults, a 15.0-second proxy timeout and a 100-calls-per-day cap left over from testing, each of which stopped real calls until someone found it. In an always-on, per-minute product, how you handle the break is what the customer remembers, so treat incident response as a feature you ship, not a thing you improvise.

Published June 17, 2026 · Updated June 24, 2026 · Supreet Tare

All names, numbers, and identifiers in this post are anonymized. The patterns are real.

Diagram of an incident status lifecycle: Investigating at 08:24, Identified at 08:54, Monitoring at 09:37, then Resolved, with a note that the status updates are written by a communication role that is separate from the engineers doing the fix

Recently I watched a major AI provider’s status page narrate an outage in real time. Investigating. Then root cause identified. Then a fix being applied, with a new note every half hour. What struck me was not the outage. It was the calm. Someone whose only job in that moment was to keep customers informed was writing those updates while other people fixed the system. The status update was a role, not a courtesy.

That is the part of voice AI no demo ever shows you. Every vendor can make an agent sound good in a controlled call. The harder question, the one that actually decides whether you should buy, is what happens when it breaks at 2 a.m. with a real customer on the line. Here are the operations questions worth asking any voice AI vendor, and how we answer them at taritas.

Who is on call, and how fast do you find out?

The honest version of reliability is not “it never breaks.” It is “we find out before you do, and someone is already on it.” That means on-call rotations and automated alerts on error rate, latency, and capacity, rather than waiting for a customer to phone in angry.

We learned where that bar sits the hard way. On one go-live, calls stopped connecting about thirty minutes into normal operation. The cause was a configuration cap sitting in an admin panel, not a code defect, and clearing it took minutes. The lesson we kept was not about that one cap. It was that any limit which can silently decline a call must raise an alert well before it does. So the question to ask a vendor is concrete: what do you monitor, who gets paged, and how quickly does someone acknowledge it?

When it breaks, who tells you, and how often?

The status page I was watching worked because communication was a defined job, separate from the debugging. One person coordinates the incident, others investigate, and someone owns the customer-facing updates on a cadence, even when there is nothing new to say beyond “still on it.”

For a buyer, this is the difference between an outage you can live with and one that costs you the relationship. Silence during a failure erodes trust faster than the failure itself. Ask the vendor plainly: when something breaks, will I hear from a human on a schedule, and will the updates move through a clear lifecycle from investigating to identified to resolved, or will I be left guessing?

Does it fail safely?

Things will fail. What matters is how. We run a daily call cap as a cost guardrail, and we designed it to fail open: if the database check behind it errors, the call is allowed through rather than rejected. Wrongly blocking a real caller is worse than letting one extra call past a cost limit. A security boundary, by contrast, should fail closed. Neither choice is universally right. The discipline is making each one deliberately and documenting it, so a failure produces a known behavior instead of a surprise.

The procurement question is simple and revealing: when a dependency fails mid-call, what happens to the caller? A vendor who has thought about this will answer immediately. A vendor who has not will improvise.

What causes your first production outages?

Two of our most instructive incidents were not bugs. One was the go-live cap above. The other was a set of call transfers that failed at exactly 15.0 seconds, because a proxy’s default route timeout collided with a slow upstream phone line. We wrote that one up in full (the 15-second timeout postmortem). Both were configuration defaults nobody had set on purpose, invisible in testing, and only dangerous under real traffic.

The takeaway became a launch-day checklist: enumerate every limit, cap, timeout, and quota that lives in panel state or a default, and confirm each against expected launch conditions before going live. Ask the vendor whether they have one. Early production failures are far more often misconfigurations than code, and a team that knows this has a checklist for it.

Why is capacity the failure mode for an always-on AI product?

A voice agent is not a normal web app. Many incidents are not defects at all, they are demand meeting a capacity ceiling: concurrent-call limits, inference capacity, a serving change that quietly raises latency. The mitigations look different too, more about rate-limiting, shifting load, and staged rollouts than about patching a bug.

This is why a careful vendor rolls out a model or serving change gradually, to a small slice of traffic first, with automatic rollback if quality or latency regresses, rather than flipping it on for everyone at once. And it is why monitoring up-or-down status is not enough: a model can be fully available and still be answering worse than yesterday. Ask how a vendor deploys a change, whether they can roll it back automatically, and whether they watch output quality, not just uptime.

What does the postmortem look like?

After an incident is resolved, the artifact that matters is a blameless postmortem: a written timeline, the root cause, what worked, and the action items that stop it recurring. Blameless is the operative word. The goal is fixing the system, not assigning blame, because fear is what stops people reporting problems honestly.

The strongest signal a buyer can look for is whether a vendor will actually hand you that document. We write them, and we publish anonymized versions of the most useful ones on this blog. A vendor willing to show you exactly how something broke and what they changed is telling you something a clean marketing page cannot.

What this means if you are an IT services firm

When you resell voice AI to an enterprise client, their procurement team will ask these exact questions, and “the vendor handles it” is not an answer you can give with confidence unless you actually know the vendor’s operations posture. On call, status communication, fail-open design, a launch-day configuration checklist, staged rollouts, and real postmortems are the substance behind an SLA, and they are what we hand you so that when your client asks what happens when it breaks, you have the answer ready. That is a large part of how we work with partners.

What Security Features Should an Enterprise Voice AI Have?

Procurement and compliance July 3, 2026 10 min read

Voice AI Subprocessors: Which Vendors Sign a BAA

 PROJECT taritas.com/blog 
 DWG POST-5 
 REV 1.0 
 DATE 2026-06-24