AI Engineering 10 min read

Giving an LLM Agent Hands

Notes on turning Zoe from a copilot that talks into one that edits the document you are looking at — the bridge, reviewable edits, and the bugs that came from a model acting on stale context.

Zoe is the AI copilot inside Zomentum. The first version could read your pipeline and answer questions about it, which is a fine demo and a thin product. The version that mattered was the one that could change the quote you had open — add a section, rewrite a paragraph, edit a pricing line — while you watched.

That gap, between an assistant that talks and an agent that acts, is most of the work. This is the part of it I found least obvious going in: how you let a language model reach into a live document a person is editing without it becoming a liability. None of the hard problems turned out to be about the model. They were about grounding it in what was actually on the screen, and never letting it commit a change the user hadn’t seen.

This is the first of three posts about building Zoe. This one is about giving it hands.

Talking is easy, acting is not

A chat assistant has a clean contract. Text comes in, text goes out, and if it’s wrong you ignore it. Nothing in the world changed.

An editing agent breaks that contract in every direction. The document has live state that the model can’t see directly — what section the cursor is in, which pricing block is selected, what was edited two seconds ago and not yet saved. The model produces an edit, and that edit has to land in a real editor that has its own undo stack, its own schema, and a human typing into it at the same time. And the cost of being wrong is no longer zero. A wrong answer in chat is noise. A wrong edit is damage to a document somebody is about to send a client.

So before any of the interesting agent behaviour, we needed two things: a reliable way for the model to know the true state of the document, and a reliable way for a proposed change to become a real change only after a human said yes.

The bridge

The editor lives in the React frontend. The agent lives in a Python service talking to the model. These two need to have a conversation about a document, and that conversation needs a single, well-defined surface. We called it the bridge.

The bridge is one object, registered when the editor mounts and torn down when it unmounts, and there is exactly one of them per open document. That single-owner rule did more work than it looks like. The agent never has to wonder which editor instance it’s talking to. There’s no question of two tabs racing to mutate the same content through two different handles. When the document closes, the bridge unregisters and any attempt to act on it fails cleanly instead of writing into a dead editor.

What the bridge exposes splits into two kinds of operation. There are reads — give me the current document outline, the selected text, the live pricing blocks, the section the user is in — and there are writes — insert this prose here, replace this section, apply these pricing changes. The reads are how the model grounds itself. The writes are the hands.

The important design choice was that reads are pulled fresh, every turn, from the editor’s live state. Not from the last save. Not from a snapshot taken when the conversation started. Every time the agent takes a turn, it asks the bridge for the current truth and gets it. That sounds obvious written down. It was not what the first version did, and the gap between those two is a whole class of bugs I’ll get to.

Edits are proposals, not actions

The single most important decision in the whole feature: Zoe never edits your document. It proposes an edit, and you apply it.

Every change the model wants to make comes back to the UI as a card. A card to insert a generated section. A card to rewrite a paragraph. A card to change a pricing line. The card shows you what the change is, and it has an Apply button. Nothing touches the document until you press it.

This is partly a trust decision and partly an engineering one. The trust part is obvious — people will not let an AI silently rewrite a document they’re responsible for, and they shouldn’t. The engineering part is subtler: the review step is also the seam where you get to validate. By the time a change is a card with an Apply button, it has already been checked, sanitized, and resolved against the real document. The button is the last gate, but it’s not the only one.

The apply itself is a small state machine, because applying can fail. A card moves through pending, to applying, to either applied or failed, and a failed card offers a retry. That matters more than it sounds. The first version treated apply as fire-and-forget — click the button, assume it worked. When it didn’t work, which I’ll explain shortly, the user got nothing: no change, no error, no signal. Modelling apply as something that can fail, and showing that it failed, was the difference between a feature people trusted and one they didn’t.

When a single turn produces more than one change — say, a new prose section and an edit to the pricing table underneath it — the cards stack, and there’s an “insert all in order” affordance. Order matters there. If you apply them in the wrong sequence the prose lands below the table it was meant to introduce. So the stack applies them in document order, and each apply moves the insertion point forward so the next one lands after it, not on top of it.

The phantom quote

Here is the bug that taught me the most.

Early on, Zoe would cheerfully offer to add a product to your quote when there was no quote in the document at all. You’d open a blank proposal, ask for help, and it would suggest editing a pricing block that didn’t exist. You’d click apply, and the apply would fail, because there was nothing to apply it to.

The cause was that the model’s picture of the document came from the wrong place. Instead of reading the actual blocks present in the editor, the context was being built from a cached field on the last saved revision — a grand total left over from a previous state. On a fresh document that number was stale or zero, but the context still asserted a quote block was there, summarised confidently, with a total and an item count. The model believed it, because why wouldn’t it. It was told the quote existed.

This is the failure mode that I think is genuinely specific to agents, and it’s worth naming. A model acting on confidently-wrong context doesn’t hesitate. It doesn’t say “hm, are you sure there’s a quote here.” It acts, because the context said the quote was real, and an agent’s whole job is to act on its context. Garbage grounding doesn’t produce garbage caveats. It produces garbage actions, delivered with total composure.

The fix had three parts, and all three were about truthfulness. First, build the document context from the live blocks actually in the editor, not from a saved revision. Second, recompute that context every single turn, so it can never drift from what’s on the screen — if the live store says there are no pricing blocks, the context says there are no pricing blocks, even if a stale snapshot would have claimed otherwise. Third, validate edit targets on the backend: if the model asks to edit a block by an ID that isn’t actually present in the document, reject it with an error that names the valid targets, so the model can correct itself instead of producing a card that’s doomed to fail on apply.

The lesson generalises past this one bug. An agent is only as good as the honesty of its context. Most of the reliability work on Zoe’s editing was not “make the model smarter.” It was “stop lying to the model about the state of the world.”

The cursor that wasn’t there

A smaller bug, but a nice illustration of how editor reality intrudes.

When you generate two sections in a row, you expect the second to land after the first. Ours landed before it — the document built itself in reverse. The reason was the cursor. A freshly opened editor puts a collapsed cursor at the very start of the document, position zero, and the editing surface isn’t focused. Our insert logic saw a cursor at the start and faithfully inserted there. Every time. So each new section went in above the last one.

The problem was that a collapsed cursor at the start of an unfocused editor is indistinguishable, by position alone, from someone who has deliberately clicked at the top of the document and wants to insert there. Same position, opposite intent.

We resolved it with a small heuristic: if the editor isn’t focused and the cursor is sitting at the very start, treat that as “no meaningful cursor” and insert at the end of the document instead. If the editor is focused, respect the cursor — the user put it there on purpose. And after every insert, move the cursor to the end of what was just inserted, so a sequence of inserts naturally flows down the page in order.

None of this is deep. But it’s the kind of thing you only find by watching the agent act in a real editor, and it’s a reminder that “let the model edit the document” smuggles in a hundred small assumptions about how the editor behaves.

Never trust the model’s HTML

One more piece, quickly, because it’s load-bearing. The model emits HTML for the prose it writes. You cannot insert that HTML into the document as-is. Ever.

We sanitize in two layers. The backend runs the model’s HTML through an allowlist — a small set of tags we permit, and everything else is dropped before the change ever becomes a card. Then the editor itself does a second pass: HTML coming in is converted through the editor’s own schema, which silently discards anything the document model doesn’t support. The first layer is policy; the second is the editor’s own immune system. By the time a generated paragraph is on the page, it has been through both, and at no point was the model’s raw output trusted.

What I’d tell someone building this

If you’re about to give an agent the ability to change something real, the model is the least of your problems. Three things mattered far more.

Ground it in live truth, every turn. The expensive bugs all came from the agent acting on a stale or fabricated picture of the document. Recompute context from the real thing, don’t cache it, and validate that what the model wants to act on actually exists before you let it try.

Make every action a reviewed proposal. The Apply button isn’t just a trust feature for users; it’s the seam where you validate, sanitize, and resolve against reality. And model the apply as something that can fail, with a visible failure and a retry, because it will fail and silent failure is worse than no feature.

Respect the surface you’re editing into. The editor has a cursor, a schema, an undo stack, and a human inside it. Half the work of “let the AI edit the document” is the unglamorous business of behaving correctly inside someone else’s editor.

In the next post I’ll get into the other half of giving an agent capability: designing the tools it calls, and why the tool layer — not the prompt — is where an agent’s behaviour is really decided.

Back to Blog

Related Posts

View All Posts »

Designing Tools an LLM Can Actually Use

The tool layer, not the prompt, is where an agent behaves or misbehaves. Lessons from building Zoe's tools: uniform pagination, broaden-on-empty hints, read/write parity, and treating the system prompt as policy.

Making an LLM Agent Production-Grade

The unglamorous half of shipping Zoe: poisoned conversations that 400 forever, duplicate edits from retries, tool timeouts, rate limits, tracing, and the prompt caching and telemetry that made it affordable and debuggable.