Production engineering POST-1 7 min read

A 15-Second Default Timeout Broke Our Voice AI's Call Transfers

A production voice AI agent we run at taritas stopped transferring callers to staff. Every transfer failed with a 504 at exactly 15 seconds, Envoy Gateway's default route timeout, while the destination line had developed a 13-second post-dial delay. Because the agent's transfer ran as a background API call, each failed dial kept ringing the destination, so staff answered ghost calls with no one on the line. The fix was one scoped timeouts block on the HTTPRoute that raised the limit above the real post-dial delay. The deeper lesson is that the outage was a default nobody chose: 15.0 seconds is sensible for a web request and wrong for a phone transfer that legitimately takes longer to connect. When you put a proxy in front of telephony, enumerate every default timeout and cap on the path and set each one against real call behavior, because an unset default will pick the worst moment to enforce itself.

Published June 12, 2026 · Updated June 24, 2026 · Supreet Tare

All names, numbers, and identifiers in this post are anonymized. The patterns are real.

Timeline diagram of a voice AI call transfer failing at a 15 second gateway timeout while the SIP dial keeps ringing in the background and staff answer a ghost call 55 seconds in

The symptoms

We run a voice AI agent that answers the inbound phone line of a public-sector service desk. It answers questions, looks things up, and transfers the caller to staff when they ask for a human. One afternoon, transfers stopped working. The reports did not fit together:

Some callers heard the AI say it could not transfer them.
Others got about 20 seconds of silence, after which the AI simply resumed the conversation.
Staff said their phones rang, they picked up, and the AI was already mid-conversation with a caller. Then the line dropped on its own.

The trunk provider’s status page was green. The transfer destination number worked when we dialed it by hand. Nothing had been deployed that day.

The architecture in 30 seconds

The stack: a Python agent built on LiveKit Agents, self-hosted LiveKit on Kubernetes, and SIP trunking through a major carrier. A transfer works like this: the agent says goodbye, calls LiveKit’s CreateSIPParticipant API to dial the staff desk into the same room with wait_until_answered=True, and once the human answers, the agent leaves. Caller and staff stay connected.

In front of the LiveKit API sits Envoy Gateway, configured through the Kubernetes Gateway API. That detail matters later.

Clue 1: the error message pointed the wrong way

The agent logs showed every transfer failing the same way:

TRANSFER[step=6/dial_out]: FAILED after 15009ms
ContentTypeError: 504, message='Attempt to decode JSON with
unexpected mimetype: text/plain',
url='https://<livekit-host>/twirp/livekit.SIP/CreateSIPParticipant'

This looks like a JSON parsing bug. It is not. The SDK expects JSON error bodies. It received a plain-text 504 instead, and the client library failed while parsing the error. So the stack trace points at JSON decoding instead of the real problem. The real signal is in three details: status 504, content type text/plain, and elapsed time 15009ms.

A debugging habit that would have saved us an hour: when an HTTP client throws a parsing error, log the raw response body and headers first. The transport facts are the evidence. The parser exception is noise.

Clue 2: exactly 15.0 seconds, every time

Failures at a round number are a timeout signature. The content type tells you whose timeout. LiveKit returns structured JSON errors, so a plain-text 504 means something in front of LiveKit answered instead. Envoy’s default route timeout is exactly 15 seconds, and when it fires, Envoy returns a 504 with a plain-text body.

Why would a dial-out API call take more than 15 seconds? Because wait_until_answered=True holds the HTTP request open for as long as the destination phone rings. The request duration is not bounded by your infrastructure. It is bounded by how fast a human picks up a phone.

Clue 3: the carrier records told the rest of the story

We pulled the carrier’s call records and matched every outbound transfer attempt against its inbound caller session. The records made the picture clear:

Every transfer dial was being placed successfully. The SIP service was healthy.
The destination line showed a 13.1-second post-dial delay: thirteen seconds before ringing even started. The carrier had flagged it as high PDD.
One call was created at 10:27:14 and answered at 10:28:09. That is 55 seconds of call setup and ringing. It lasted 20 seconds and ended at the exact second the original caller hung up.
Other attempts ended in SIP 487 Request Terminated: our side canceling dials that had outlived the conversation.

That explained the ghost calls. The 504 killed the API request, not the dial. The agent’s error handler concluded the transfer had failed, skipped its leave-the-room step, and kept talking to the caller. Meanwhile the dial kept ringing in the background. When staff answered up to a minute later, they joined a live room where the AI was still mid-conversation. When the caller hung up, the room closed and the staff line dropped with it.

Zero transfers succeeded during the incident window. Every “completed” outbound call was a ghost.

What was the root cause of the transfer failures?

The destination line degraded. The carrier route to the transfer number developed 13 or more seconds of post-dial delay and answer times near a minute. This change was entirely outside our system.
Envoy’s default 15-second route timeout turned that slowness into a hard failure on every attempt.

Neither factor alone causes an outage. A slow line behind a generous timeout is just a slow transfer. A fast line behind a 15-second timeout works fine. Together they produced a 100 percent failure rate, plus a design lesson: our error path never considered the case where the API call failed but the dial succeeded.

How do you fix an Envoy Gateway 15-second transfer timeout?

The Gateway API makes this a small change. We added a dedicated rule for the LiveKit API path with explicit timeouts and left everything else untouched:

rules:
  # LiveKit APIs (CreateSIPParticipant etc.): allow long ringing
  - matches:
      - path:
          type: PathPrefix
          value: /twirp
    timeouts:
      request: 120s
      backendRequest: 120s
    backendRefs:
      - name: livekit-server
        port: 80
  # Catch-all (WebSocket signaling etc.): unchanged
  - matches:
      - path:
          type: PathPrefix
          value: /
    backendRefs:
      - name: livekit-server
        port: 80

Scoping the override to /twirp keeps the blast radius at zero. The WebSocket path that carries live calls is untouched, and the change applies without restarts. We verified it without a single live caller by using the LiveKit CLI (lk sip participant create with --wait-until-answered). Before the change, it died at 15 seconds. After, it held through the ringing and returned when the desk answered.

Hardening beyond the hotfix

The timeout change restored transfers. The incident also exposed assumptions worth fixing properly:

Decouple transfers from any proxy timeout. Dial with wait_until_answered=False, then poll for the new participant joining the room under your own deadline. Keep the caller informed (“still connecting you”) or play ringback with play_dialtone=True instead of silence.
Treat answer latency as an input, not an assumption. A transfer target is an external dependency. Measure how long it takes a human to pick up, and alert when it drifts.
Log raw bodies on non-JSON error responses, so a gateway timeout never looks like a parsing bug again.
Make telephony settings configuration, not code. Transfer numbers, extensions, and DTMF behavior change at the client’s convenience. They belong in an admin panel, not a deployment.
Alert on transfer outcomes. A 100 percent transfer failure rate should page someone before users report it.

Key takeaways

A proxy default you never set is still production configuration. Audit the timeouts in every hop in front of long-running API calls.
Failures at an exact round number (15.0s, 30.0s, 60.0s) are a timeout signature. Start hunting for whose timeout it is.
A plain-text 504 from a JSON API means an intermediary answered, not the service.
Correlate application logs with carrier call records, call by call. Telephony bugs rarely show up in one log source.
An API call that times out is not an operation that stopped. Design error paths for “failed response, succeeded side effect.”

What this means if you are an IT services firm

If your clients run voice AI that transfers calls to humans, this failure mode is waiting in any stack with a proxy in front of the call-control API. The questions to ask: what is the timeout on every hop in front of the dial-out API, what happens when the API call fails but the dial succeeds, and who gets alerted when transfers fail. If nobody can answer those, that is the audit to run this week. This is the kind of work we do behind IT services firms, under their brand.

How Long Does It Take to Deploy a Voice AI Receptionist?

Production engineering June 24, 2026 8 min read

Semantic Turn Detection: How a Voice Agent Knows You're Done

 PROJECT taritas.com/blog 
 DWG POST-1 
 REV 1.0 
 DATE 2026-06-24