blog.category.engineeringApril 22, 2026·9 blog.minRead

Why we picked Anthropic Claude for our AI phone operator

We compared Claude, GPT-4o, and Gemini on natural Spanish, latency, and real tool use. When voice is the channel, the gap shows.

OLISE Team

Engineering

When you build an AI phone operator, the language model is not "one more piece." It is the piece. If it misunderstands, mistransfers, or mis-schedules, everything else —telephony, voice, integrations— loses value.

We spent three months comparing Claude (Anthropic), GPT-4o (OpenAI), and Gemini 1.5 (Google) under real conditions: car rental, hotel, and restaurant calls in Castilian Spanish, Mexican Spanish, and Rioplatense Spanish. Here is what we found.

The criterion: precision over vocabulary

In voice, it does not matter how many pretty sentences the model produces. What matters is whether it correctly extracts data from the transcribed audio —dates, phone numbers, car names— and whether it decides well when to transfer to a human.

We measured four things, in order of importance:

Slot extraction precision (pickup date, return date, car category, customer name).
First-token latency (how long until it starts speaking after the customer's last audio).
Correct tool calling on the first try (did it call the right tool with the right arguments?).
Unnecessary transfer rate (did it transfer to a human when it could have resolved alone?).

What we saw in spoken Spanish

GPT-4o has a more "conversational style" in Spanish, no question. Its sentences sound more natural in the abstract. But when the audio has noise —and phone audio always has noise— Claude extracts data with more reliability. On our 480-call hand-labeled test set, Claude Opus hit 97% on key slots in car rental. GPT-4o, 91%. Gemini, 88%.

Six points of precision sound small until you multiply by 1,000 calls/month. At 91% that is 90 calls with one wrong field landing in the CRM. At 97%, 30. The difference between operational confidence and operational doubt.

Tool use: where Claude pulls ahead

Here Claude is on another level. When the customer says "I want a midsize SUV for Friday and Saturday, charge it to last year's card," Claude:

1Calls search_availability with category: 'SUV-medium', start_date: 2026-05-01, end_date: 2026-05-02.
2In parallel, calls get_customer_payment_methods.
3Waits for both, then composes the next sentence.

GPT-4o, on the same transcript, tends to issue a single tool call and ask "which card?" without checking first. That is an extra round-trip that, in voice, becomes 600-900ms more silence.

Latency: technical tie

In streaming, the three models are reasonably close on first token. The actual difference is your pipeline: ASR → LLM → TTS. To keep the sub-second response we promise, we use:

Streaming ASR (Twilio Stream + Deepgram Nova-3).
Claude with stream: true and max_tokens: 500 (short voice sentences).
ElevenLabs Flash 2.5 with audio cache for common phrases.

The bottleneck is almost never the LLM. It is the network between the telephony provider and our edge.

Safety: what is invisible

Claude has the cleanest integration with prompt-injection defenses via the system message. Because public input (the customer's audio) always arrives as role: 'user', we can lock down behavior from the system prompt without the customer being able to override it. We wrote that up at length in another post.

Why we picked

Better precision on extracting data from noisy spoken Spanish.
Parallel tool use, not sequential.
Fewer unnecessary human transfers.
Acceptable use policy more aligned with regulated B2B product.

This is not final. We re-evaluate every six months. For now, Claude.

LLMClaudeVoice

blog.share X

blog.related

blog.category.engineering