Why we picked Anthropic Claude for our AI phone operator
We compared Claude, GPT-4o, and Gemini on natural Spanish, latency, and real tool use. When voice is the channel, the gap shows.
OLISE Team
Engineering
When you build an AI phone operator, the language model is not "one more piece." It is the piece. If it misunderstands, mistransfers, or mis-schedules, everything else —telephony, voice, integrations— loses value.
We spent three months comparing Claude (Anthropic), GPT-4o (OpenAI), and Gemini 1.5 (Google) under real conditions: car rental, hotel, and restaurant calls in Castilian Spanish, Mexican Spanish, and Rioplatense Spanish. Here is what we found.
The criterion: precision over vocabulary
In voice, it does not matter how many pretty sentences the model produces. What matters is whether it correctly extracts data from the transcribed audio —dates, phone numbers, car names— and whether it decides well when to transfer to a human.
We measured four things, in order of importance:
- Slot extraction precision (pickup date, return date, car category, customer name).
- First-token latency (how long until it starts speaking after the customer's last audio).
- Correct tool calling on the first try (did it call the right tool with the right arguments?).
- Unnecessary transfer rate (did it transfer to a human when it could have resolved alone?).
What we saw in spoken Spanish
GPT-4o has a more "conversational style" in Spanish, no question. Its sentences sound more natural in the abstract. But when the audio has noise —and phone audio always has noise— Claude extracts data with more reliability. On our 480-call hand-labeled test set, Claude Opus hit 97% on key slots in car rental. GPT-4o, 91%. Gemini, 88%.
Six points of precision sound small until you multiply by 1,000 calls/month. At 91% that is 90 calls with one wrong field landing in the CRM. At 97%, 30. The difference between operational confidence and operational doubt.
Tool use: where Claude pulls ahead
Here Claude is on another level. When the customer says "I want a midsize SUV for Friday and Saturday, charge it to last year's card," Claude:
- 1Calls
search_availabilitywithcategory: 'SUV-medium',start_date: 2026-05-01,end_date: 2026-05-02. - 2In parallel, calls
get_customer_payment_methods. - 3Waits for both, then composes the next sentence.
GPT-4o, on the same transcript, tends to issue a single tool call and ask "which card?" without checking first. That is an extra round-trip that, in voice, becomes 600-900ms more silence.
Latency: technical tie
In streaming, the three models are reasonably close on first token. The actual difference is your pipeline: ASR → LLM → TTS. To keep the sub-second response we promise, we use:
- Streaming ASR (Twilio Stream + Deepgram Nova-3).
- Claude with
stream: trueandmax_tokens: 500(short voice sentences). - ElevenLabs Flash 2.5 with audio cache for common phrases.
The bottleneck is almost never the LLM. It is the network between the telephony provider and our edge.
Safety: what is invisible
Claude has the cleanest integration with prompt-injection defenses via the system message. Because public input (the customer's audio) always arrives as role: 'user', we can lock down behavior from the system prompt without the customer being able to override it. We wrote that up at length in another post.
Why we picked
- Better precision on extracting data from noisy spoken Spanish.
- Parallel tool use, not sequential.
- Fewer unnecessary human transfers.
- Acceptable use policy more aligned with regulated B2B product.
This is not final. We re-evaluate every six months. For now, Claude.
blog.related
blog.category.engineering
How we built prompt-injection defense in production
Three layers of defense, an output filter with regex, and why we never trust a single line of protection. Real case with code.
blog.category.industry
AI phone operator vs chatbot: why it matters for luxe hospitality
Not all channels are equal. Voice has a different cognitive cost, a different service expectation, and a different tolerance for error.
blog.category.industry
Case study: how a Founder Pricing partner scaled their phone coverage
A car rental operation in Florida. Missed calls before and after OLISE. Lessons from the first 90 days.