Skip to content
tarıtas
Production engineering POST-1 7 min read

A 15-Second Default Timeout Broke Our Voice AI's Call Transfers

A production voice AI agent we run at Taritas stopped transferring callers to staff. Every transfer failed with a 504 at exactly 15 seconds, Envoy Gateway's default route timeout, while the destination line had developed a 13-second post-dial delay. The failed API calls left dials ringing in the background, so staff answered ghost calls. One scoped timeouts block on the HTTPRoute fixed it.

Published · Updated · Supreet Tare

All names, numbers, and identifiers in this post are anonymized. The patterns are real.

Timeline diagram of a voice AI call transfer failing at a 15 second gateway timeout while the SIP dial keeps ringing in the background and staff answer a ghost call 55 seconds in

The symptoms

We run a voice AI agent that answers the inbound phone line of a public-sector service desk. It answers questions, looks things up, and transfers the caller to staff when they ask for a human. One afternoon, transfers stopped working. The reports did not fit together:

  • Some callers heard the AI say it could not transfer them.
  • Others got about 20 seconds of silence, after which the AI simply resumed the conversation.
  • Staff said their phones rang, they picked up, and the AI was already mid-conversation with a caller. Then the line dropped on its own.

The trunk provider’s status page was green. The transfer destination number worked when we dialed it by hand. Nothing had been deployed that day.

The architecture in 30 seconds

The stack: a Python agent built on LiveKit Agents, self-hosted LiveKit on Kubernetes, and SIP trunking through a major carrier. A transfer works like this: the agent says goodbye, calls LiveKit’s CreateSIPParticipant API to dial the staff desk into the same room with wait_until_answered=True, and once the human answers, the agent leaves. Caller and staff stay connected.

In front of the LiveKit API sits Envoy Gateway, configured through the Kubernetes Gateway API. That detail matters later.

Clue 1: the error message pointed the wrong way

The agent logs showed every transfer failing the same way:

TRANSFER[step=6/dial_out]: FAILED after 15009ms
ContentTypeError: 504, message='Attempt to decode JSON with
unexpected mimetype: text/plain',
url='https://<livekit-host>/twirp/livekit.SIP/CreateSIPParticipant'

This looks like a JSON parsing bug. It is not. The SDK expects JSON error bodies. It received a plain-text 504 instead, and the client library failed while parsing the error. So the stack trace points at JSON decoding instead of the real problem. The real signal is in three details: status 504, content type text/plain, and elapsed time 15009ms.

A debugging habit that would have saved us an hour: when an HTTP client throws a parsing error, log the raw response body and headers first. The transport facts are the evidence. The parser exception is noise.

Clue 2: exactly 15.0 seconds, every time

Failures at a round number are a timeout signature. The content type tells you whose timeout. LiveKit returns structured JSON errors, so a plain-text 504 means something in front of LiveKit answered instead. Envoy’s default route timeout is exactly 15 seconds, and when it fires, Envoy returns a 504 with a plain-text body.

Why would a dial-out API call take more than 15 seconds? Because wait_until_answered=True holds the HTTP request open for as long as the destination phone rings. The request duration is not bounded by your infrastructure. It is bounded by how fast a human picks up a phone.

Clue 3: the carrier records told the rest of the story

We pulled the carrier’s call records and matched every outbound transfer attempt against its inbound caller session. The records made the picture clear:

  • Every transfer dial was being placed successfully. The SIP service was healthy.
  • The destination line showed a 13.1-second post-dial delay: thirteen seconds before ringing even started. The carrier had flagged it as high PDD.
  • One call was created at 10:27:14 and answered at 10:28:09. That is 55 seconds of call setup and ringing. It lasted 20 seconds and ended at the exact second the original caller hung up.
  • Other attempts ended in SIP 487 Request Terminated: our side canceling dials that had outlived the conversation.

That explained the ghost calls. The 504 killed the API request, not the dial. The agent’s error handler concluded the transfer had failed, skipped its leave-the-room step, and kept talking to the caller. Meanwhile the dial kept ringing in the background. When staff answered up to a minute later, they joined a live room where the AI was still mid-conversation. When the caller hung up, the room closed and the staff line dropped with it.

Zero transfers succeeded during the incident window. Every “completed” outbound call was a ghost.

Root cause: two failures, one outage

  1. The destination line degraded. The carrier route to the transfer number developed 13 or more seconds of post-dial delay and answer times near a minute. This change was entirely outside our system.
  2. Envoy’s default 15-second route timeout turned that slowness into a hard failure on every attempt.

Neither factor alone causes an outage. A slow line behind a generous timeout is just a slow transfer. A fast line behind a 15-second timeout works fine. Together they produced a 100 percent failure rate, plus a design lesson: our error path never considered the case where the API call failed but the dial succeeded.

The fix: one scoped timeouts block

The Gateway API makes this a small change. We added a dedicated rule for the LiveKit API path with explicit timeouts and left everything else untouched:

rules:
  # LiveKit APIs (CreateSIPParticipant etc.): allow long ringing
  - matches:
      - path:
          type: PathPrefix
          value: /twirp
    timeouts:
      request: 120s
      backendRequest: 120s
    backendRefs:
      - name: livekit-server
        port: 80
  # Catch-all (WebSocket signaling etc.): unchanged
  - matches:
      - path:
          type: PathPrefix
          value: /
    backendRefs:
      - name: livekit-server
        port: 80

Scoping the override to /twirp keeps the blast radius at zero. The WebSocket path that carries live calls is untouched, and the change applies without restarts. We verified it without a single live caller by using the LiveKit CLI (lk sip participant create with --wait-until-answered). Before the change, it died at 15 seconds. After, it held through the ringing and returned when the desk answered.

Hardening beyond the hotfix

The timeout change restored transfers. The incident also exposed assumptions worth fixing properly:

  1. Decouple transfers from any proxy timeout. Dial with wait_until_answered=False, then poll for the new participant joining the room under your own deadline. Keep the caller informed (“still connecting you”) or play ringback with play_dialtone=True instead of silence.
  2. Treat answer latency as an input, not an assumption. A transfer target is an external dependency. Measure how long it takes a human to pick up, and alert when it drifts.
  3. Log raw bodies on non-JSON error responses, so a gateway timeout never looks like a parsing bug again.
  4. Make telephony settings configuration, not code. Transfer numbers, extensions, and DTMF behavior change at the client’s convenience. They belong in an admin panel, not a deployment.
  5. Alert on transfer outcomes. A 100 percent transfer failure rate should page someone before users report it.

Key takeaways

  • A proxy default you never set is still production configuration. Audit the timeouts in every hop in front of long-running API calls.
  • Failures at an exact round number (15.0s, 30.0s, 60.0s) are a timeout signature. Start hunting for whose timeout it is.
  • A plain-text 504 from a JSON API means an intermediary answered, not the service.
  • Correlate application logs with carrier call records, call by call. Telephony bugs rarely show up in one log source.
  • An API call that times out is not an operation that stopped. Design error paths for “failed response, succeeded side effect.”

What this means if you are an IT services firm

If your clients run voice AI that transfers calls to humans, this failure mode is waiting in any stack with a proxy in front of the call-control API. The questions to ask: what is the timeout on every hop in front of the dial-out API, what happens when the API call fails but the dial succeeds, and who gets alerted when transfers fail. If nobody can answer those, that is the audit to run this week. This is the kind of work we do behind IT services firms, under their brand.

Related questions
Why does Envoy Gateway return a 504 after exactly 15 seconds?
Envoy's route-level request timeout defaults to 15 seconds. If the upstream has not completed the response by then, Envoy abandons the request and returns 504 Gateway Timeout with a plain-text body. You can override it per route with the Gateway API timeouts field on an HTTPRoute rule, or with an Envoy Gateway BackendTrafficPolicy.
What is SIP 487 Request Terminated?
487 means the calling side sent a CANCEL before the call was answered. The INVITE was abandoned while still ringing. In our incident it appeared when the AI's room closed because the original caller hung up while the background transfer dial was still ringing, so the platform canceled the pending dial.
What is post-dial delay (PDD)?
PDD is the time between sending the call (the SIP INVITE) and receiving the first ringing indication from the far end. High PDD means the destination carrier or PBX is slow to even start ringing. Callers hear dead air, and any upstream timeout budget starts draining before the phone rings.
How do I increase the timeout for LiveKit's CreateSIPParticipant behind an ingress or gateway?
Find out which proxy fronts your LiveKit server URL, then raise the request timeout on the route serving /twirp. For the Gateway API, set timeouts request 120s on the HTTPRoute rule. For nginx ingress, use the proxy-read-timeout annotation. Scope it to the API path so WebSocket routes are unaffected.
Should voice AI agents use wait_until_answered for transfers?
It is the simplest correct pattern, if every hop allows the request to live as long as a phone can ring. For production resilience, prefer dialing asynchronously and watching for the participant to join, with your own timeout, caller-facing hold feedback, and cleanup logic for dials that outlive the conversation.

Reading this because a client asked for voice AI? That is the conversation we are built for. What taritas does for partners.

More from Production engineering
PROJECT taritas.com/blog
DWG POST-1
REV 1.0
DATE 2026-06-12