You call your dentist to book a cleaning. The receptionist picks up, and before you even finish saying hello she reads out every available time slot for the next three weeks, asks for your insurance number, tries to transfer you to billing, and then asks if you'd like to reschedule your 2019 root canal. All at the same time. In one breath.
You'd hang up. Obviously.
But this is exactly what happens when you build a conversational AI agent the "normal" way. I know because I built one that did exactly this. And then I spent the next year figuring out why it kept going off the rails and how to actually fix it.
The fix turned out to be embarrassingly simple. And it had nothing to do with using a smarter model.

Something most people don't realize about phone conversations. If someone takes more than about a quarter of a second to respond, it feels weird. You start wondering if they're still there. Half a second and you're already saying "hello?"
Now the really smart AI models, the ones that write poetry and solve math problems, they take 2-5 seconds just to think. Add pipeline and network latency on top of that and you're looking at 8-10 seconds for a response. Fine for a chatbot. For a phone call that's an eternity of awkward silence.
So you're stuck using the smaller faster models. Think of it like this. The smart model is a professor who takes a minute to give you a perfect answer. The fast model is your mate who responds instantly but sometimes says something completely unhinged if you confuse him.
The fast mate is who I'm working with. The question is how do you stop him from saying unhinged things.
Picture a restaurant waiter on their first day. The manager hands them a 30-page manual that covers everything. How to greet guests, the full menu with allergens, the wine list, how to process payments, the reservation system, the fire evacuation procedure, and the protocol for when someone proposes at table 12.
Now a customer walks up and says "Hi, table for two?"
The waiter, overwhelmed by 30 pages of instructions competing for attention in their brain, panics and starts reciting the wine list. The customer just wanted a table. But the waiter's head is so full of everything they might need to know that they grabbed the wrong bit of information.
This is exactly what happens with conversational AI when you give a small fast model too many instructions at once. I call it context pollution. Instructions for one situation bleed into another.
A good restaurant doesn't hand the waiter one giant manual. They break the job into stages.
Stage 1, greeting. Smile. Ask how many guests. Show them to a table. That's it. You don't need to know the wine list yet.
Stage 2, taking orders. Now you need the menu knowledge. But you don't need the payment system yet.
Stage 3, payment. Now you need the card machine. But you can completely forget the fire evacuation procedure (unless something is actively on fire, in which case you have bigger problems).
Each stage has its own small cheat sheet. The waiter only focuses on what's relevant right now. Their brain isn't cluttered with stuff that doesn't matter yet.
That's exactly what I built. Instead of one massive set of instructions the AI moves through conversation states. Greeting, conversation, collecting information, booking appointments, transferring to a human. At each state it gets only the instructions and tools it needs for that specific moment.
The "everything at once" approach gives the AI roughly 18,000 characters of instructions to process. That's about 7 pages of text the AI has to hold in its head while trying to respond in a quarter second.
The state-based approach? About 3,000 characters per state. Just over one page. The AI goes from reading a novel to reading a sticky note.
Unsurprisingly it gets a lot less confused.
Once you have these states you can do something really clever. Give the AI two brains.
Think of it like a TV news broadcast. You have the anchor on camera, talking smoothly, keeping the audience engaged. And you have the producer in the control room pulling up research, queueing clips, making the complex decisions behind the scenes.
The anchor doesn't need to be a genius. They need to be quick, natural, and good at talking. The producer doesn't need to be fast. They need to be thorough and accurate.
My conversational AI agent works the same way. The fast "anchor" model handles the conversation in real time. When something complex needs to happen, like actually booking an appointment or looking up patient records, it basically says "hey handle this" and the slower smarter "producer" model takes over that task in the background.
The conversation never pauses. The person on the phone hears "Let me check that for you" while the smart model is quietly doing the hard work behind the scenes.
This is my favourite part. When you book an appointment with a human receptionist they don't guess the doctor's schedule. They look at the calendar. They don't improvise your patient ID. They look it up.
Most AI systems? They guess. They try to construct all the booking details themselves and sometimes they get creative. "Creative" is great for writing fiction. It's terrible for booking your root canal at the right time with the right dentist.
In my system the AI's only job is to understand what you want. "3pm Tuesday please." The system then takes that and fills in everything else from actual data. The correct doctor, the right appointment type, your patient record from your phone number, the exact time converted to the right timezone.
The AI contributed the intent. The system built the facts. No guessing. No hallucinating a doctor that doesn't exist. No booking you at 3am instead of 3pm because it got confused about timezones (yes that has happened).
Building conversational AI that actually works on the phone isn't about making the AI smarter. It's about making its job simpler.
A confused genius is worse than a focused average person. Give someone clear instructions for exactly what they need to do right now and they'll nail it. Dump an entire operations manual on them and ask them to respond in a quarter second? They'll panic and start reciting the wine list.
The secret isn't a bigger brain. It's a better plan.
The state machine is the external intelligence that compensates for what small models lack.