My AI agent said "You're all booked for Tuesday at 3!" and then did absolutely nothing. No booking was made. No appointment existed.
The model wasn't broken. It wasn't hallucinating. It genuinely believed it had triggered the booking. But somewhere between saying "let me check that for you" and the tool call completing, the async operation silently failed. The agent never noticed. It just kept talking like everything was fine.
That's the thing about voice AI failures. They're invisible. In a text chatbot the user sees the error, types "try again," and moves on. On a phone call the caller trusts what they hear. If the agent says you're booked, you're booked. Except when you're not.
I spent months trying to make the speaking model more reliable. Better prompts. More guardrails. Retry logic. None of it worked because I was solving the wrong problem. The speaking model's job is to talk. Asking it to also monitor itself, track its own promises, detect its own silence, and predict what comes next is like asking a surgeon to perform an operation while simultaneously monitoring their own vital signs.
So I stopped trying to fix the model. I built three systems around it instead.

Think of a live television broadcast. The anchor sits in front of the camera, talking smoothly, reacting in real time, keeping millions of viewers engaged. But behind them is an entire control room they never see. A producer queuing segments. A director switching cameras. A technical operator monitoring audio levels and ready to cut to backup if anything goes wrong.
The anchor doesn't know any of this is happening. They just talk. And because the crew is handling everything else, the broadcast runs smoothly even when things go wrong behind the scenes.
That's what I built for my conversational AI. Three invisible systems running in parallel, each handling a different class of failure that the speaking model can't handle itself. A predictive engine that sees the future. A watchdog that understands context. A guardian agent that supervises the AI. The speaking model has no idea they exist. It just works better because of them.
Here's a timing problem most people don't think about. When you call a business and ask a question, the receptionist starts answering almost instantly. Maybe 200 milliseconds of pause feels natural. But when an AI model needs to generate a response, that takes 200-400ms minimum. Add network latency, text-to-speech processing, and audio streaming on top, and you're looking at a noticeable delay on every single response.
It's like talking to someone on a satellite phone. Technically functional. Practically awful. People start talking over each other. The rhythm of conversation breaks down.
But there's a window most systems waste completely. After the agent finishes speaking, the caller takes 2-5 seconds to think and respond. During those seconds the system is just sitting there, idle, waiting. My predictive engine uses that dead time to guess what the caller will say next and pre-generate responses before they even open their mouth.
My conversation system runs on a flow graph. Think of it like a choose-your-own-adventure book. Each page has a limited number of places you can go next. The predictive engine looks at the current page and asks "what are the most likely next moves?"
Not every prediction is equally confident. A caller who just heard available appointment times will almost certainly pick one or ask for more options. That's high confidence. A caller in an open-ended conversation could say anything. That's low confidence. The engine scores each prediction and only pre-generates for the top 2-3 candidates.
Keyword triggers (caller says "book" after hearing times) → 0.9 confidence
Flow-based transitions (next expected step in the process) → 0.5-0.7 confidence
Data collection (predicting next missing piece of info) → 0.6 confidence
Generic fallback (predict "yes" or simple acknowledgment) → 0.4 confidenceThe engine also learns from patterns. It keeps a small in-memory tracker of what callers actually said at each point in the conversation. If 80% of callers at the "pick a time" step say "Tuesday works," the confidence for that prediction goes up. If nobody ever takes a particular path, it drops off the prediction list entirely.
When the caller does speak, the system tries to match their actual words against the pre-generated responses. Exact match first. Then keyword overlap. Then fuzzy matching for cases where someone says "yeah Tuesday is good" instead of "Tuesday works." If any match hits, the response fires instantly. Zero generation latency. The caller hears a response faster than any human receptionist could manage.
The tricky part is knowing when to throw predictions away. If the conversation state changes, if a tool call completes with new data, if instructions update mid-call, every cached prediction becomes stale. The engine aggressively invalidates on any state change. A stale prediction served confidently is worse than no prediction at all.
Silence on a phone call can mean two completely different things. Either someone is checking their calendar, thinking about their answer, looking up an insurance number. Or something has gone horribly wrong and the system is stuck.
A naive approach would be a simple timer. If nobody speaks for 10 seconds, do something. But that's like a smoke detector that goes off every time you make toast. It triggers recovery when the caller is just thinking, which interrupts them. And it waits too long when the system is actually stuck, which loses them.
My watchdog understands context. It knows why the silence is happening.
The watchdog tracks five distinct conversation states, each with different rules for how long silence is acceptable.
AGENT_WAITING_FOR_USER
The agent asked a question. Silence means the caller is thinking.
Threshold: 15 seconds. Gentle check-in, max 2 times.
USER_WAITING_FOR_AGENT
The caller asked something. Silence means the system might be stuck.
Threshold: 15 seconds. Trigger recovery immediately.
NATURAL_PAUSE
Nobody asked anything. Both sides are quiet.
Threshold: 20 seconds. Soft "Was there anything else?"
STARTING
Call just connected. Special grace period.
CALL_ENDING
Farewells detected. Recovery disabled entirely.The difference matters enormously. When the agent just asked "What time works for you?" and the caller goes quiet for 8 seconds, the right response is patience. They're checking their schedule. Interrupting them with "Are you still there?" is annoying.
But when the agent said "Let me check that for you" and then went silent for 8 seconds, something is wrong. The tool call probably failed. The system needs to recover before the caller hangs up.
This is where it gets interesting. The watchdog listens for transitional phrases. "Let me check." "One moment." "I'll look that up." When the agent says something like this, the watchdog sets a short 4-second fuse. If the agent doesn't follow up within 4 seconds, something broke. Recovery kicks in.
But there's an edge case that took me a while to catch. Sometimes the agent says "Let me check that. What was your date of birth?" That message contains both a transitional phrase and a question. The transitional phrase is incidental. The real intent is the question. If the watchdog treated this as a pending action, it would trigger false recovery while the caller is answering the question. So the system checks whether the message also contains a question or delivers real content. If it does, the transitional phrase is ignored.
This is the failure that started it all. The agent confidently says "You're all booked!" but no booking tool ever completed.
The watchdog tracks every tool execution in real time. Completed tools. In-flight tools. Failed tools. When the agent claims a booking was made, the watchdog checks whether that's actually true. If no booking tool completed, two things happen.
First, a pre-speech guard catches the claim before it reaches text-to-speech. The caller never hears "You're all booked." Second, the system triggers the actual booking in the background. If that succeeds, the agent says the confirmation for real. If it fails after two attempts, the agent tells the caller honestly that something went wrong and offers to transfer to a human.
The booking success flag is only set when a tool confirms it. Never when the model claims it. The model's opinion on whether a booking happened is completely irrelevant. Only the system of record matters.
The watchdog handles timing and tool tracking. It knows when things go wrong. But it uses pattern matching to understand what the agent said. Regex and keyword detection. That works for 80% of cases. The other 20% is where things get messy.
"Sure, I can check on that." Is the agent committing to an action, or just acknowledging what the caller said? "Let me see what I can do." Is that a promise to act, or a polite way of saying maybe? Sarcasm, hedging, multi-intent messages. Pattern matching breaks on all of these.
So I added a third system. A separate lightweight language model whose only job is reading what the agent says and classifying the intent. Not the caller's intent. The agent's intent. It's AI watching AI.
The guardian model is a small 8-billion parameter model. Tiny by modern standards. It classifies every agent message into one of five categories in about 20-50 milliseconds.
PENDING_ACTION "Let me check availability for you"
Agent committed to doing something. Clock starts ticking.
COMPLETE_RESPONSE "We have openings at 2pm and 4pm on Tuesday"
Agent delivered actual information. Commitments fulfilled.
QUESTION "What time works best for you?"
Agent is waiting for caller input.
ACKNOWLEDGMENT "Got it, Tuesday at 3"
Agent heard the caller but may still need to act.
FAREWELL "Thanks for calling, have a great day!"
Conversation ending. Stand down.The classification comes back with a confidence score. For PENDING_ACTION it also includes a description of exactly what the agent committed to. "Check appointment availability." "Look up patient records." "Process the booking." This creates a concrete commitment with a 10-second countdown. If the agent doesn't deliver a COMPLETE_RESPONSE within 10 seconds, the commitment expires and recovery triggers.
When the agent does deliver information, every active commitment gets fulfilled at once. The clock stops. The system moves on.
Here's something that will go wrong if you're not careful. Recovery fires. The agent says something like "Sorry about that, let me try again." The guardian sees "let me try again" and classifies it as PENDING_ACTION. That creates a new commitment. If that commitment isn't fulfilled fast enough, recovery fires again. The agent apologises again. New commitment. Recovery again. Infinite loop.
The fix is a 30-second suppression window after any recovery. During that window, the guardian still classifies messages but doesn't create new commitments from PENDING_ACTION results. The agent gets breathing room to actually complete the recovery without triggering more recovery. Active commitments are also capped at 5. If the agent has somehow made 5 unfulfilled promises, something is deeply wrong and piling on more monitoring won't help.
The guardian model adds about 20-50ms of latency per classification. That's fast. But models fail. Networks go down. So the system has a regex-based fallback that kicks in automatically if the guardian model is unavailable.
"let me," "one moment," "checking" → PENDING_ACTION (0.7 confidence)
ends with "?" or "what is your" → QUESTION (0.7 confidence)
"goodbye," "bye," "have a great day" → FAREWELL (0.8 confidence)
everything else → COMPLETE_RESPONSE (0.5 confidence)The confidence scores are deliberately lower than the guardian model produces. The system knows the fallback is less accurate and adjusts its behaviour accordingly. Thresholds become more generous. Recovery triggers are slower to fire. It's a degraded mode, not a broken mode.
Let me walk through a real scenario. A caller asks about available appointment times. The agent says "Let me see what appointments are open."
Three things happen simultaneously. The guardian classifies this as PENDING_ACTION and starts a 10-second commitment timer. The watchdog detects the transitional phrase and sets its own 4-second fuse. The predictive engine, knowing the agent is about to deliver appointment times, starts pre-generating responses for likely follow-ups. "Tuesday works." "Do you have anything earlier?" "What about next week?"
Now imagine the tool call succeeds. The agent reads out the available times. The guardian sees a COMPLETE_RESPONSE, fulfils the commitment, and resets. The watchdog sees the agent spoke and transitions to AGENT_WAITING_FOR_USER. The caller says "Tuesday works." The predictive engine matches this against its cache and fires back a response instantly. Zero latency. The caller experiences a perfectly smooth, fast conversation.
Now imagine the tool call fails silently. Four seconds pass. The watchdog's transitional phrase timer expires. It triggers recovery with full context about what the agent was trying to do. The speaking model gets nudged with "The availability check hasn't returned yet. Acknowledge the delay naturally and retry." The agent says something like "Still pulling that up, one moment." The tool retries. Meanwhile the guardian enters its 30-second suppression window so the recovery message doesn't create a new commitment loop. The tool succeeds on retry. The agent delivers the times. The caller never knew anything went wrong.
Three parallel systems means three things that can fail, interact in unexpected ways, or add latency. This is real infrastructure with real operational cost.
The predictive engine wastes compute on wrong predictions. At a 50-60% hit rate, roughly half the pre-generated responses get thrown away. That's real money spent on responses nobody will hear. The tradeoff only makes sense if response speed matters more than compute cost. On a phone call it absolutely does. For a text chatbot it probably doesn't.
The watchdog adds complexity to every conversation state transition. Each new state, each new edge case, each new transitional phrase pattern needs to be accounted for. The five conversation states I described took months of production calls to get right. And there are still edge cases that surprise me.
The guardian model is an extra dependency. It's fast, but it's another service that can go down. The regex fallback works, but it's noticeably less accurate. About 20% of the nuanced cases that the guardian model handles correctly get misclassified by the fallback.
And debugging is harder. When something goes wrong, the cause could be in any of the three systems, or in the interaction between them. A bad prediction can conflict with a watchdog recovery. A guardian misclassification can prevent the watchdog from triggering when it should. Logging and observability become critical, not optional.
So why not just make the speaking model smarter? Give it instructions to track its own commitments, monitor its own silence, predict what comes next?
Because it's generating speech in real time under a 300ms deadline. That model is using every bit of its capacity to respond naturally, stay in context, follow the conversation flow, and sound human. Asking it to simultaneously run a timer, check tool completion status, pre-generate alternative responses, and classify its own intent is like asking a juggler to also solve sudoku. Technically possible. Practically guaranteed to make both tasks worse.
Every monitoring task I moved outside the speaking model made it better at speaking. Less in its context window. Fewer competing objectives. Simpler instructions. The model got smarter not because I made it smarter, but because I stopped asking it to do things it was bad at.