Conversational AIUXAudio

My AI Could Talk. It Couldn't Listen. That Was the Whole Problem.

January 8, 202616 min read

When I first built Retone ai voice agent and called it, I was genuinely impressed. The responses were smart. The voice sounded natural. It understood what I was asking and gave good answers. For the first few calls I thought I'd nailed it.

But after a while something started bothering me. The conversations felt off. Not wrong exactly. Just hollow. Like talking to someone who was technically saying all the right things but wasn't really there. I couldn't figure out what was missing.

One afternoon I was working on it in a cafe, stuck on this feeling that the agent was 90% of the way to human but that last 10% was a wall. So I closed my laptop, closed my eyes, and just listened. To the cafe. To the people around me having conversations.

And it hit me. Every conversation around me had this invisible layer I'd never really noticed. People don't just take turns talking. While one person speaks the other is constantly producing tiny signals. "Mm-hmm." "Oh no." "Yeah." "Right." Not interrupting. Not adding anything substantial. Just a steady rhythm of sounds that say "I'm here, I'm listening, keep going."

My agent didn't do any of that. While the caller talked, it sat in complete silence. No reactions. No acknowledgment. Nothing. It waited for them to finish making sounds and then produced a response. Technically correct. Conversationally dead.

The hardest thing my agent had to learn wasn't talking. It was listening.

Saying "mm-hmm" without interrupting

This is the hardest layer because the caller is actively talking. Any acknowledgment needs to land at a natural pause, not cut them off mid-sentence, match the tone of what they're saying, and play on a separate audio channel so the system doesn't think the agent is taking its turn.

The speech-to-text engine sends partial transcription results every 100-200 milliseconds as the caller speaks. The system monitors the gaps between these updates. A short gap, under 300ms, means the caller is mid-sentence. Words are still flowing. A longer gap, 500-700ms, means they paused. Collecting their thoughts. Taking a breath. That's the window.

The pause threshold isn't fixed. It's configurable on a 1-10 sensitivity scale. At low sensitivity, the system only triggers on long pauses of 700ms or more. At high sensitivity, it catches shorter breaths at 300ms. There's also a minimum gap between consecutive acknowledgments. Low sensitivity waits 8 seconds between them. High sensitivity, 3 seconds. Without this limit the agent says "mm-hmm" after every sentence, which sounds less like a listener and more like a parrot.

And nothing triggers until the caller has been speaking for at least 2 seconds. If someone just said "yes," they don't need acknowledgment. They need a response.

Not all "mm-hmm"s are the same

When someone describes a problem, the right response is "I see" or "oh." When someone is listing things, it's "okay" or "mm-hmm." When someone is excited, it's "oh nice." When someone is upset, it's "I understand" or "I'm sorry about that."

The system detects the caller's intent using two layers. The first is keyword matching. Instant, sub-millisecond. It scans the partial transcript for indicators across 12 intent categories.

intent categories and responses

Narrating ("what happened was," "so basically")     → "mm-hmm" or "yeah"
Describing a problem ("broken," "not working")       → "I see" or "oh"
Providing info ("my name is," "the number is")       → "got it" or "okay"
Positive emotion ("great," "wonderful," "love it")   → "oh nice" or "yeah"
Negative emotion ("frustrated," "worried," "upset")  → "I understand" or "I'm sorry"
Hesitating ("um," "uh," "well," "hmm")               → "mm-hmm" (gentle encouragement)
Listing items ("first," "second," "also")            → "mm-hmm" or "okay"
Clarifying ("what I mean is," "to clarify")          → "right" or "I see"

Silent categories (no acknowledgment, they want a real response):
Confirming ("yes," "correct," "exactly")
Questioning ("what," "how," "can you")

The second layer is a model fallback for ambiguous cases where keywords aren't enough. A fast model classifies the intent in the background, debounced so it only fires when the transcript has grown by at least 3 new words since the last classification.

Once the intent is determined, the system randomly picks from the mapped sounds. This randomness matters. Hearing the exact same "mm-hmm" on every pause sounds robotic. Variety creates the illusion of a real person reacting naturally.

The audio plays on a dedicated background track, separate from the primary speech channel. This is critical. The agent's turn detection only monitors the primary channel. If acknowledgments played there, the system would think the agent is "speaking" and would cut off the caller's turn. On a separate track, the caller keeps talking uninterrupted.

All 11 sounds are pre-synthesised at call start using the agent's configured voice. Generated concurrently during initialisation and cached. When a back-channel triggers, there's no text-to-speech delay. The audio plays immediately.

The instant filler the moment they stop talking

The caller finishes a sentence. There's a gap before the model produces its response. Typically 200-500ms minimum, often much longer. In that silence, the caller wonders if the agent heard them.

The fast pre-response system fills this gap with a contextually appropriate filler. "Got it." "Sure." "Oh no, sorry about that." Spoken in under 100ms while the main model generates its full response in parallel.

two tracks, one conversation

Caller finishes speaking
  │
  ├── Track 1 (fast):  Classify → Select filler → Speak (~100ms)
  │                     NOT added to conversation history
  │
  └── Track 2 (slow):  Full model inference → Generate response → Speak (~2-5s)
                        Added to conversation history

The filler is spoken almost instantly but the model doesn't know it happened.
The full response follows naturally as the official conversation turn.

The filler selection runs on two dimensions. Message type and sentiment.

Message type is a simple rule cascade. Contains "can you repeat" or "say that again?" That's a clarification request. Starts with "hi" or "hello?" That's a greeting. Starts with "yes" or "correct?" That's agreement. Contains numbers or is 5+ words? Information. And so on.

Sentiment scans for positive indicators ("thanks," "great," "awesome," "appreciate") and negative ones ("problem," "frustrated," "terrible," "broken"). Whichever count is higher wins.

filler selection rules

Negative sentiment, any message   → "Oh no, sorry about that"
Positive sentiment, any message   → "Oh nice!"
Clarification request             → "Oh yeah, of course!"
Question                          → "Hmm, yeah"
Statement                         → "Right"
Information                       → "Got it"
Fallback                          → "Okay"

Priority: exact match on both dimensions first, then sentiment match,
then type match, then fallback.

The matching priority means a negative-sentiment information message hits "Oh no, sorry about that" rather than "Got it." Sentiment wins over message type because empathy is more important than accuracy. Saying "Got it" when someone just told you something went wrong sounds dismissive.

These rules are fully configurable. The agent creator can customise which phrases map to which conditions through the frontend. The defaults sound like a real receptionist, but a law firm might want more formal language while a surf school might want something more casual.

The 30% skip that makes it human

Real people don't acknowledge everything. Sometimes they just listen. Sometimes they make a small noise. Sometimes they say nothing and let the silence do the work.

The system replicates this by randomly skipping 30% of filler responses. The system rolls a random number, and if it's under 0.30, no filler is spoken. The caller goes straight to the full response after the normal processing delay.

This makes the agent feel unpredictable in a human way. Consistent acknowledgment of every single statement is a dead giveaway of automation. Real conversations have rhythm and variation. Sometimes you say "got it." Sometimes you just let the person keep going.

But there are exceptions where the skip is overridden.

Greetings are never skipped. If the caller says "Hi," the agent always acknowledges immediately. Silence after a greeting is deeply awkward.

Negative sentiment is never skipped. If the caller expresses frustration, worry, or dissatisfaction, the agent always responds with empathy. Silence after "I'm really worried about this" would feel dismissive.

And then there are empathy triggers. These detect genuine distress. Medical concerns, bereavement, anxiety, financial worry. When triggered, the system responds immediately with the appropriate tone and never skips.

Note

The empathy patterns are carefully built with exclusions to prevent false positives. "I'll choose" doesn't trigger the illness pattern that matches "ill." "Lost connection" doesn't trigger the bereavement pattern that matches "lost." "Sorry can you repeat that" doesn't trigger the apology pattern. Getting these wrong would be worse than not having them at all.

Some questions don't need a language model

"Can you hear me?" will always get "Yep, I can hear you!" regardless of conversation context. "What's your name?" will always get "I'm [agent name]! How can I help?" There's no reasoning to do. Calling a language model for these is pure waste.

The instant reply system maintains a library of pattern-response pairs across 8 categories. Connection checks, identity questions, small talk, politeness, jokes, capability questions, confirmations, and farewells. Each message runs through a three-stage matching cascade.

three-stage matching

Stage 1: Exact match
  Normalise (lowercase, strip punctuation, trim) → hash map lookup
  "can you hear me" → CONNECTION_CHECK
  O(1) lookup. Sub-millisecond.

Stage 2: Regex match
  "can you hear (me|this)" catches variations exact match misses
  "Can you hear this?" matches the same category

Stage 3: Fuzzy match
  Levenshtein distance against all known patterns
  "Can u hear me" → 0.88 similarity to "can you hear me"
  Above 0.85 threshold → match

Confidence: Exact = 1.0, Regex = 0.95, Fuzzy = similarity ratio
No match above threshold → falls through to the normal model pipeline

Each category has multiple response options, randomly selected to prevent repetition. Some use template variables. "I'm {agent_name}! How can I help?" becomes "I'm Sarah! How can I help?"

When an instant reply fires, the model is never called. The response is spoken in under 50ms. On a typical 5-minute call, instant replies handle 3-5 exchanges. Connection checks, thank yous, identity questions. That's 3-5 model calls saved, and each one answered faster than any model could even begin generating.

All three layers in a real call

Let me walk through a real scenario showing all three layers working together.

a real call, start to finish

0:00 — Call connects
Caller: "Hello?"
Layer 3 (Instant Reply): Pattern match → GREETING → "Hi! How can I help?"
Spoken in 40ms. Model never called.

0:15 — Caller describes their problem
Caller: "Yeah so basically what happened was I booked an appointment online..."
  [500ms pause]
Layer 1 (Back-Channel): Pause detected, intent = narrating → plays "mm-hmm"

"...and I got a confirmation email..."
  [600ms pause, 5.5 seconds since last back-channel]
Layer 1 (Back-Channel): Pause detected, intent = narrating → plays "yeah"

"...but when I showed up they said there was no booking."
Caller finishes speaking.
Layer 2 (Fast Pre-Response): negative sentiment → "Oh no, sorry about that"
  Spoken in 90ms. Model starts generating in parallel.
Model: Full response 2.5 seconds later.

0:45 — Caller provides information
Caller: "Yeah it's B-K-4-7-2-9."
Layer 2 (Fast Pre-Response): information + neutral → "Got it"
  Random skip check: 0.42 > 0.30, not skipped. Spoken in 80ms.
Model: Processes the confirmation number.

1:10 — Connection check
Caller: "Can you hear me okay? The line sounds weird."
Layer 3 (Instant Reply): Exact match → CONNECTION_CHECK
  "Yep, I can hear you fine!" Spoken in 35ms. Model never called.

2:30 — Caller expresses worry
Caller: "I'm really worried because the appointment is tomorrow."
Layer 2 (Fast Pre-Response): Empathy trigger, "really worried" matches anxiety
  Never-skip override. "I completely understand your concern" in 95ms.
Model: Full reassurance response in parallel.

3:15 — Caller says thank you
Caller: "Thanks so much for your help."
Layer 3 (Instant Reply): Pattern match → POLITENESS
  "No worries! Anything else?" Spoken in 40ms. Model never called.

Look at the timing of a single exchange without these layers versus with them.

perceived latency comparison

Without active listening:
  0ms      Caller finishes speaking
  100ms    Model inference begins
  2500ms   Response ready, spoken to caller
  → 2.5 seconds of dead silence

With active listening:
  0ms      Caller finishes speaking
  80ms     Fast pre-response: "Got it" (spoken, not in history)
  100ms    Model inference begins
  2500ms   Full response ready, spoken to caller
  → 80ms to first acknowledgment, then natural conversation

The perceived latency drops from "uncomfortably long" to "instant." The actual model inference time hasn't changed at all. The caller just never sits in silence while it happens.

The honest tradeoffs

Pre-synthesising 11 sounds at call start adds about 200-400ms to initialisation. On a call that lasts 5 minutes this is invisible. On a 15-second automated callback it's a noticeable delay before the agent first speaks.

The 30% skip rate is tuned for general receptionist conversations. A therapist bot would want a lower skip rate, maybe 10%. An automated survey would want a higher one, maybe 50%. The right number depends on the use case, and getting it wrong makes the agent feel either robotic (too much acknowledgment) or cold (too little).

The keyword-based intent classifier handles most cases but misses sarcasm and subtle emotional shifts. Someone saying "Oh that's just great" sarcastically will get matched as positive emotion. The model fallback catches some of these, but not all. For most business calls this doesn't matter. For sensitive contexts like healthcare or counselling it could.

And the empathy pattern exclusions are never truly complete. Every new edge case discovered in production means another exclusion rule. "I'm dying to try that" shouldn't trigger a medical concern pattern, but without an explicit exclusion for "dying to," it will. This is an ongoing maintenance cost.

Why this matters more than the response itself

That afternoon in the cafe changed how I think about conversational AI entirely. The agent's responses were already good. The voice was already natural. The intelligence was already there. But none of it mattered because the caller didn't feel heard.

Conversation isn't just about what you say. It's about the constant stream of signals that tell the other person you're present, you're engaged, and you care about what they're saying.

A good response after dead silence feels like talking to a machine that happens to be smart. An "mm-hmm" at the right moment followed by a good response feels like talking to a person. The model didn't change. The intelligence didn't change. The only thing that changed was whether the caller felt heard.