My AI agent booked a patient with a doctor that doesn't exist.
Not a crash. Not an error message. The model confidently said "You're all booked for Tuesday at 3 with Dr. Smith!" and the caller hung up happy. Except the provider ID it passed to the booking API was off by one digit. The appointment went into a void. The patient showed up to a clinic that had no record of them.
This happened because I was doing what every AI agent framework tells you to do. Give the model a list of tools with JSON schemas, let it decide when to call them, and hope it formats the parameters correctly.
That "hope" was failing about 30% of the time. I tried prompt engineering, few-shot examples, structured output enforcement. Got it down to maybe 15%. For a text chatbot that might be fine. For a phone agent handling real medical bookings where there's no UI to catch the mistake? Not even close.
So I stopped letting the AI call tools. And started letting it say keywords instead.

When you ask an AI to "call a tool" it's actually doing two things at once that have nothing to do with each other.
The first is understanding what the person wants. Someone says "I'd like to book something for next week" and the model recognises that's a booking request. This is what language models are literally built for. Even small ones nail this almost every time.
The second is assembling the exact data the booking system needs. The right doctor ID, the right patient ID, the right date in the right format with the right timezone. That's not a language task. That's a data entry task. And it's the part that keeps going wrong.
Think of it like asking someone to have a phone conversation while simultaneously filling out a tax form. Under time pressure. In a language they're okay at but not fluent in. They'll keep talking fine. The tax form will have mistakes.
On a voice call you can't use a big powerful model that takes 2-5 seconds to think. Add pipeline and network latency and that's 8-10 seconds of silence. You need a small fast model that responds in under 300ms. These models are great at conversation. They're not great at producing perfectly structured JSON with five correctly-typed parameters while also talking naturally.
The fix was obvious once I saw it. Stop asking one model to do both jobs.
Instead of giving my model tool schemas and JSON formats, I gave it a list of keywords.
Available actions (the caller won't hear these):
- {{check_availability}} — Check open appointment slots
- {{book_appointment:TIME DAY}} — Book at the chosen time
- {{search_info:question}} — Look up business information
- {{end_call}} — End the call
- {{transfer}} — Transfer to a humanWhen the model decides it's time to check availability, it generates something like
"Sure, let me see what we've got {{check_availability}}"That's it. The model's entire contribution is a keyword tag tucked inside a natural sentence. The text-to-speech filter catches the tag and strips it before the caller hears anything. They just hear "Sure, let me see what we've got." Clean. Natural.
Behind the scenes, the keyword dispatcher intercepts the tag and hands the action off to a second, more powerful model running in the background. This bigger model isn't under any time pressure. The conversation is already flowing, the caller heard a natural response, nobody is waiting in silence.
This bigger model gets the full conversation history, the pre-fetched integration data with all the provider IDs and service types and patient records, the variables collected during the call, and all the time it needs to reason through the tool call correctly.
Because it's not racing against a 150ms deadline, it can actually think. It reads the reference data carefully, matches "Dr Smith" to the correct provider ID, resolves "3pm Tuesday" to a properly formatted timestamp with the right timezone, checks whether this is a new or existing patient, and gets it right.
The result flows back through an async queue. The speaking model gets nudged with the confirmation and naturally says "You're all set, Tuesday at 3pm with Dr. Smith."
Let me make this concrete. A caller wants to book a dental cleaning with Dr. Smith next Tuesday at 3pm. The system has already identified them as an existing patient (PAT-4821) and the pre-fetched data includes Dr. Smith (DOC-1092) and Dental Cleaning as a service (SVC-055).
The model receives a tool schema and has to dig through its context window to find the right IDs, format the date, remember the timezone, and assemble it all into JSON.
{
"name": "book_appointment",
"parameters": {
"provider_id": "DOC-1092",
"patient_id": "PAT-4821",
"service_id": "SVC-055",
"start_time": "2026-03-03T15:00:00+11:00",
"duration_minutes": 45
}
}Looks clean on paper. Here's what actually happens with a small fast model under time pressure. These are real failure patterns I saw in production.
{
"provider_id": "DOC-1093", // off by one digit. doesn't exist.
"patient_id": "PAT-4821",
"service_id": "SVC-055",
"start_time": "2026-03-03T15:00:00+11:00",
"duration_minutes": 45
}{
"provider_id": "PAT-4821", // patient ID in provider slot
"patient_id": "DOC-1092", // provider ID in patient slot
"service_id": "SVC-055",
"start_time": "2026-03-03T15:00:00+11:00",
"duration_minutes": 45
}{
"provider_id": "DOC-1092",
"patient_id": "PAT-4821",
"service_id": "SVC-055",
"start_time": "next Tuesday at 3pm",
"duration_minutes": 45
}The swapped parameters one is the scariest. Both values are valid strings. No type error. The API accepts it and creates a completely corrupted record. Nobody finds out until the patient shows up.
Same scenario. Same caller. Same small model on the front line. But now the model just says
"Perfect, let me lock that in for you {{book_appointment:3pm Tuesday with Dr Smith}}"The caller hears "Perfect, let me lock that in for you."
The keyword gets intercepted and handed to the bigger model running in the background. That model has the full context, the pre-fetched data, and no clock ticking. It builds everything correctly.
Keyword: "book_appointment"
Hint: "3pm Tuesday with Dr Smith"
Background model resolving with full context:
→ provider_id: "DOC-1092" matched "Dr Smith" against prefetched provider list
→ patient_id: "PAT-4821" from caller phone number lookup at call start
→ service_id: "SVC-055" from conversation state
→ start_time: "2026-03-03T15:00:00+11:00" resolved "3pm Tuesday" + business timezone
→ duration_minutes: 45 from service type config
All parameters verified. Executing.The small model contributed a keyword and a conversational time reference. The big model contributed the actual structured tool call, built correctly because it had the intelligence and the time to get it right.
With the traditional approach, one model does everything. It talks to the caller, decides when to act, builds the parameters, and handles the response. All under a 150ms latency budget. With a model small enough to hit that speed.
With keyword dispatch, the jobs are split. The small fast model handles conversation and intent. It says when to act, in natural language, embedded in natural speech. The big powerful model handles the structured work in the background, with no time pressure, with full access to all the data it needs.
The small model was never bad at understanding what the caller wanted. It was bad at turning that understanding into perfectly formatted JSON while also holding a conversation under time pressure. I didn't fix the small model. I stopped making it do the part it's bad at and gave that part to a model that has the capacity and the time to do it properly.
Traditional tool calling fails silently and dangerously. Hallucinated IDs that look plausible. Swapped parameters that pass type checking because they're all strings. The model tells the caller they're booked and everything looks fine until it isn't.
Keyword dispatch has two failure modes, both recoverable. Either the small model doesn't output the keyword when it should, in which case the tag never fires and the conversation just continues. Or the big model fails to build the parameters, in which case it can retry without any caller-facing silence because the conversation is still flowing.
A recoverable non-action versus a confident wrong action. For a system handling real appointments for real people that's not a tradeoff. It's an obvious choice.
This pattern exists because of constraints unique to phone conversations.
Latency kills. You have maybe 200-300ms before a pause feels unnatural. That forces small fast models. Small models can't reliably produce structured tool calls. Give me a big model with 5 seconds of think time and traditional tool calling works fine. On a voice call I don't have that luxury. But with keyword dispatch I can use the big model in the background where latency doesn't matter.
There's no visual safety net. In a chat UI a user can see the tool call, see the error, type a correction. On a phone call everything is invisible. The caller trusts what they hear. "You're booked for Tuesday at 3" is either right or wrong, and they won't know which until they show up.
And there's no retry loop. In a text agent if a tool call fails you can retry with corrected parameters. On a voice call every retry is dead air. With keyword dispatch, retries happen in the background. The caller never notices.
This isn't free.
You need a bounded action space. I know every action my agent might take because the conversation flow defines them. A general-purpose coding agent that might call any of 500 tools has a fundamentally different problem.
You're running two models instead of one. That's more infrastructure, more cost, more complexity. The tradeoff is worth it when silent failures on a live call cost you real patients and real trust. It might not be worth it for a low-stakes chatbot.
And about 3% of the time the small model understands it should act but doesn't include the keyword. That's recoverable through conversation monitoring. Compare it to the 15% rate of silently corrupted data with the old approach and the math is pretty clear.
Before keyword dispatch, after extensive prompt optimization, I was seeing about 15% tool call failures. Hallucinated parameters, format errors, wrong tool selection, missing fields. Every failure was either silently wrong or required awkward recovery on a live call.
After. 0% parameter hallucination because parameters come from a model that has the time and context to get them right. About 3% keyword miss rate which is recoverable. Net reliability went from 85% to 97%+, with the remaining 3% being gracefully recoverable instead of silently corrupt.
Humans separate deciding from executing all the time. A doctor decides a patient needs blood work. They don't personally operate the lab equipment. They write an order and the lab handles execution with the right parameters and protocols.
That's what keyword dispatch is. The small model is the doctor. The big model is the lab. The keyword is the order form.
Neither model changed. The architecture changed. And that made all the difference.