Conversational AIArchitectureGraph Theory

Inspiration for a system architecture came from a long drive

February 5, 202615 min read

I was three weeks into trying to make my AI follow a conversation properly and nothing was working. The agent would ask for a caller's name, get it, then ask again two turns later. Someone would say "I want to book an appointment" and the system would keep collecting information it already had. For example, if a caller said "Hi, I'm Sarah Johnson, I'd like to book an appointment" and my agent responded with "Hello! What's your name?"

Three pieces of information in one sentence. Name, intent, action. And the system ignored all of it because it could only process one step at a time. Sarah had to repeat herself three times before the agent caught up to where she started.

I was quiteburnt out. I'd rewritten the conversation logic twice, added more prompt engineering than I want to admit, and the fundamental problem hadn't moved. The AI could only walk a straight line through the conversation, one step after another, no matter what the caller actually said. So I did what I always do when I'm stuck. I went for a drive.

Long drive. No destination. I drove so far out into the countryside that I had no idea where I was. Completely lost. So I did the obvious thing, pulled up GPS, typed in home, and started following it back.

And then something clicked. I recognised a few turns along the way. Took them early, before the GPS told me to. The GPS didn't care. It recalculated instantly. I took a completely different route through a part of town I knew, skipping three of its planned turns. It recalculated again. The route kept changing but the destination never did. It didn't matter how I got there. The system always knew where I was, where I was going, and how to connect the two.

That was the moment. My conversation system was turn-by-turn directions. It needed to be a map.

Turn-by-turn directions versus a map

Most chatbot frameworks treat conversations like a recipe. Step 1, then step 2, then step 3. If you need branching, you add an if-else. "Did the user say yes? Go to step 4a. Did they say no? Go to step 4b." It works for ordering a pizza. It falls apart the moment a real person does something unexpected. And on a phone call, people do unexpected things constantly. They interrupt. They change their mind. They give you everything at once or nothing at all. They answer questions you haven't asked yet.

Turn-by-turn directions work perfectly as long as every road is open and you never miss a turn. The moment something changes, you're lost. A map is different. A map shows every street, every intersection, every possible route. Miss a turn and you reroute. Take a shortcut and the map keeps up. You always know where you are.

That's what I built after that drive. A navigation engine where every conversation is a directed graph of nodes. Each node is a distinct stage. Greeting, collecting information, checking availability, booking an appointment, transferring to a human, ending the call. The connections between nodes are the possible paths the conversation can take. The engine's job is to figure out where the conversation is right now and where it should go next. Just like a GPS, except the roads are conversation topics and the destination is whatever the caller actually needs.

What the map looks like in memory

When a call starts, the entire conversation flow loads into memory. Every node gets stored in a dictionary keyed by a unique ID. Each node has a type (start, conversation, data extraction, booking, transfer, end call), its own configuration, and a list of outgoing transitions.

The connections between nodes are stored as an adjacency list. A dictionary where each node ID maps to a list of node IDs it connects to. This sounds like a small detail but it's the reason pathfinding is fast. When the engine needs to find a route from node A to node B, looking up the outgoing connections is instant instead of scanning every edge in the graph.

The engine also tracks state. Which node is active, how long the conversation has been on this node, how many times each node has been visited, and every piece of data collected so far. Names, phone numbers, appointment preferences, tool results, conversation history. All of it available to any node at any time.

Note

One quirk that took me a while to debug. The START node's execution count has to be manually set to 1 when the call begins. Without this, the first automatic transition from START never fires, because it checks whether the node has been executed at least once. The START node is entered directly, not through the normal transition system, so it never gets that count incremented naturally. The conversation just sits on the greeting forever.

Two layers of "where should this conversation go"

When a caller speaks, the engine has a decision to make. Stay on the current node, or move somewhere else. This happens in two layers with a strict priority order.

Think of a building. Every room has regular doors to adjacent rooms. But the building also has fire exits that you can reach from anywhere. You don't need to walk through three rooms to get to the emergency exit. You just go.

Layer one is the fire exits. Before checking any of the current node's transitions, the engine scans every "global" node in the flow. Booking nodes, transfer nodes, end-call nodes, anything marked as globally accessible. Each of these has intent patterns that can trigger from anywhere in the conversation. If the caller says "I want to book an appointment" while the agent is still collecting their name, the global booking intent fires and the engine starts navigating there immediately.

All global intents evaluate in parallel. Every global node's condition runs at the same time, and the first match wins. For built-in types like end-call, there's a fast path with keyword matching. "Goodbye," "bye," "hang up," "see you later." For custom global nodes, the condition can be a natural language expression evaluated by a lightweight model.

Two exceptions. Global intents are skipped on the START node because "Hi there" shouldn't accidentally trigger a booking flow. And they're skipped during cascade hops to prevent infinite re-triggering. More on cascading later.

Layer two is the regular doors. If no global intent matches, the engine evaluates the current node's outgoing transitions, sorted by priority. Lower number means higher importance. Again, all transitions run in parallel, and the first match by priority wins.

The parallel evaluation matters a lot for speed. A node might have four outgoing transitions, each requiring a model call to evaluate a condition. Running them one after another would be 4 times 200ms, which is 800ms. Running them simultaneously brings it down to a single 200ms round. On a phone call that's the difference between a natural pause and an awkward silence.

Seven ways a door can open

Not every transition works the same way. Some are instant. Some require understanding what the caller said. Some wait for conditions outside the conversation entirely.

transition types

Keyword          Trigger words in the message. "Book" fires booking transition.
                  Sub-millisecond. Zero ambiguity.

Any-message      Fires on any non-empty input. Used after START.
                  Caller says anything → move to the first real stage.

Immediate        Fires when a node has been executed at least once.
                  For nodes that don't need user input, like branching logic.

All-extracted    Fires when every required variable has been collected.
                  All fields filled → move on automatically.

Timeout          Fires after X seconds on the current node.
                  Caller silent for 30 seconds → "Are you still there?"

Condition prompt Natural language condition evaluated by a small fast model.
                  "Caller wants to reschedule" or "user confirmed the appointment."
                  Fast path tries keyword matching first. Model call only as fallback.
                  Results cached for 15 seconds to avoid redundant calls.

Variable compare Checks extracted variables against conditions.
                  Equals, not equals, contains, greater than, regex match.
                  Used for branching based on collected data.

The condition prompt type is the most interesting. The engine doesn't immediately call a model to evaluate every condition. It tries a fast path first, parsing simple patterns like "customer says X" and checking if X appears in the message. For literal conditions like "always" or "default," it returns immediately. Only when the fast path fails does it call the model. This keeps most evaluations under a millisecond while still handling complex semantic conditions when needed.

What happens when the destination isn't next door

When a global intent triggers, the target node might not be directly connected to the current one. The conversation might be on a data extraction node, and the booking node might be several hops away through intermediate stages. A confirmation node. A validation step. A transition announcement.

Jumping directly to the target would skip those intermediate nodes. And those nodes might be necessary. Skipping a confirmation step means the caller never confirmed. Skipping a validation step means bad data gets through.

Back to the building analogy. If someone on the third floor needs to get to the basement, they don't teleport. They take the stairs. Each floor they pass through is a real floor with real things happening. The building still makes sense.

The engine calculates the shortest path using breadth-first search over the adjacency list. Starting from the current node, it explores outgoing connections level by level until it reaches the target. For a typical conversation flow with 8-15 nodes, this completes in microseconds.

If the path is just two nodes (current to target), the transition is direct. If it's longer, the engine enters "navigation mode." It stores the final destination and the full route, then moves to the first intermediate node. On each subsequent transition check, it gives preference to transitions that continue along the calculated path. After each hop, it checks whether it's arrived.

Navigation clears itself when the destination is reached, or when a new global intent overrides the current route. If the caller changes their mind halfway through navigating to the booking node and says "actually, just transfer me to a person," the old route is abandoned and a new path to the transfer node is calculated.

Three steps in one sentence

This is the part that fixed Sarah's problem. Cascading.

After a successful transition, the engine asks a simple question. Does the caller's message also satisfy transitions on the new node? If yes, it transitions again. And again. Up to 5 hops deep.

Remember Sarah. "Hi, I'm Sarah Johnson, I'd like to book an appointment."

without cascading

Turn 1: "Hi, I'm Sarah Johnson, I'd like to book an appointment"
         → START to GREETING (any-message trigger)
         Agent: "Hello! Welcome. What's your name?"

Turn 2: "...I just said it. Sarah Johnson."
         → GREETING to DATA_COLLECTION (name detected)
         Agent: "Thanks Sarah! How can I help you today?"

Turn 3: "I said I want to book an appointment."
         → DATA_COLLECTION to BOOKING (intent match)
         Agent: "Sure! When works for you?"

Three turns. Caller repeated herself twice. Frustrating.

with cascading

Turn 1: "Hi, I'm Sarah Johnson, I'd like to book an appointment"
         → START to GREETING (any-message trigger)
         → cascade: GREETING to DATA_COLLECTION (name "Sarah Johnson" detected)
         → cascade: DATA_COLLECTION to BOOKING (booking intent matches)
         Agent: "Sure Sarah, when works for you?"

One turn. Zero repetition. Three nodes traversed in ~100ms.

The engine processed all three transitions in a single conversational turn. Sarah said one sentence and the agent jumped straight to the booking stage, already knowing her name. No "what's your name?" No "how can I help you?" The system extracted everything it needed and skipped every step that was already satisfied.

Cascade depth is capped at 5 to prevent infinite loops. And global intent checks are disabled during cascade hops. Only the explicit transitions on each node are evaluated. Without this safeguard, a global intent could keep re-triggering on every hop and the engine would spin forever.

If-else nodes, or decisions that happen in zero time

Some nodes don't need the caller to say anything. If-else nodes evaluate their conditions against the collected data and immediately return a next node. No conversational turn. No delay. The caller doesn't even know it happened.

These can chain. Node A evaluates to B. B evaluates to C. C evaluates to D. All in the same execution, recursively calling the evaluator on each new node until landing on one that actually needs input from the caller. This enables complex branching logic, routing callers to different flows based on their type, their history, or the data they've provided, without any visible pause in the conversation.

What happens in the 100ms between a sentence and a response

The full lifecycle, from the caller finishing a sentence to the agent responding from a potentially completely different context, looks like this.

the full lifecycle

1. Speech-to-text finalises the transcription.

2. Fast response check — instant replies for predictable phrases
   ("Can you hear me?", "What's your name?") that bypass traversal entirely.

3. If on a data extraction node, passive extraction runs first.
   Regex and keyword matching against the message to capture variables.
   Under 5ms. Ensures the all-extracted evaluator has current data.

4. Transition evaluation fires as a background task, running alongside
   the speaking model's response generation.

5. Inside transition evaluation:
   Global intents first (parallel) → explicit transitions (parallel, priority-ordered)
   First match wins. If the match leads to a distant node, BFS calculates the path.

6. On successful transition:
   Active node updates. Execution counts increment. Node timer resets.
   Watchdog gets notified. Integration data prefetches if the new node needs it.
   (Booking nodes pre-load availability, for example.)

7. New node's executor returns updated instructions.
   These replace the speaking model's system prompt via a live update.
   The model is now operating with completely different context.
   New rules. New tools. New conversation stage.

8. Cascade check. If the message satisfies transitions on the new node,
   the engine recurses. Up to 5 hops in a single turn.

9. The speaking model generates its response using the new instructions.

The caller experiences this as a natural, fluid conversation. They don't know the system just traversed three nodes, prefetched booking data, and rebuilt the AI's entire instruction set between their sentence ending and the response beginning.

Everything the caller said stays remembered

All extracted variables persist for the entire call. When the conversation transitions from a data extraction node to a booking node, everything collected earlier is still available. This is stored in a flat dictionary that every node and every transition evaluator can read.

There's a memory protection layer for long calls. Conversation history caps at 50 messages, oldest trimmed first. Extracted variables cap at 100 entries. Tool results cap at 20. For a typical 5-minute call these limits never come close. For a 30-minute call with heavy tool usage they prevent memory issues without losing anything critical.

There's also a cleaning step I had to add after a frustrating week of debugging. Speech-to-text sometimes adds trailing punctuation. A caller answering "What's your name?" gets transcribed as "Sarah?" with a question mark. Downstream systems were trying to look up patients named "Sarah?" The engine now strips trailing question marks, periods, commas, and other punctuation from extracted values so everything downstream gets clean data.

When nothing matches

Most conversational turns don't trigger a transition. And that's fine. A data extraction node stays active across several turns while collecting multiple fields. A conversation node stays active while discussing a topic. The engine evaluates transitions, finds no match, and the agent responds using its existing instructions. The next message triggers evaluation again.

When multiple transitions match on the same message, priority ordering resolves it. Transitions are sorted before evaluation. Even though they run in parallel, results are processed in priority order. A high-priority keyword transition like "cancel" always beats a lower-priority condition prompt like "user seems unhappy," even if both evaluate to true on the same message.

The honest tradeoffs

Parallel evaluation trades strict determinism for speed. All transition conditions run simultaneously, including model calls with non-deterministic latency. I accept this because priority ordering provides the determinism that actually matters. It doesn't matter which evaluation finishes first. It matters which matching transition has the highest priority.

Breadth-first search finds the shortest path by hop count, not by any weighted metric. In a conversation graph, the "best" path isn't always the shortest. Sometimes a longer route through a confirmation node is better than a shortcut. For most flows, shortest path is correct because the graph is designed with linear progression in mind. For edge cases where it isn't, the flow designer adjusts the edge structure.

The cascade depth cap of 5 could theoretically miss a valid 6-hop cascade. In practice I've never seen real input that needs more than 3 cascades. The cap exists to prevent pathological loops, not to limit normal operation.

Global intents can theoretically hijack a conversation from any node. A poorly configured global intent could accidentally trigger on benign input. The START node exception and cascade bypass handle the most common cases, but it's still something to watch for when designing flows.

Why a graph and not just a state machine

A traditional state machine would work for simple flows. But a graph gives properties that state machines don't.

Pathfinding. When the caller's intent doesn't match any transition on the current node but matches a global node elsewhere in the graph, the engine can calculate and follow a multi-hop path. A state machine can only transition to adjacent states.

Parallel evaluation. Transitions on a node are independent and run concurrently. In a state machine, transition conditions are typically evaluated one at a time.

Cascading. The same input can satisfy conditions on multiple consecutive nodes. A state machine processes one transition per input.

Dynamic instruction swapping. Each node carries its own prompt, tools, and behavioural rules. Transitioning to a new node completely rebuilds the AI's context. A state machine typically maintains the same behavioural context across states.

The graph isn't harder to design. A visual builder makes it as simple as drawing lines between boxes. But it's dramatically more capable at handling the chaotic, non-linear reality of phone conversations where people don't follow scripts and never have.

None of this came from a whiteboard or a design document. It came from getting lost in the countryside because I was too frustrated to keep staring at code. I took turns the GPS didn't plan for, skipped steps it thought were necessary, and arrived home anyway. Navigation has been a solved problem for decades. I just hadn't thought to apply it to a phone call.