Making an LLM Agent Production-Grade

The first two posts in this series were about capability — letting Zoe edit documents, and designing the tools it calls. This one is about the part nobody demos: making it survive production.

A copilot that works in a demo and a copilot that works for thousands of users on a Tuesday afternoon are different products, and the distance between them is almost entirely failure handling. The model is the same. What changes is everything around it — the retries, the half-finished requests, the slow dependencies, the cost. Zomentum’s reliability pass on Zoe was a P1 because the feature was, by then, genuinely useful and genuinely fragile, and the fragility had nothing to do with the quality of its answers.

Here’s what production actually demanded.

The conversation that breaks forever

The worst failure mode we had is specific to how tool-using agents talk to a model, and I want to explain it properly because it’s a trap anyone building this will fall into.

An agent turn works in a loop. The model says “call this tool,” your code runs the tool, you hand the result back, the model continues. The protocol has a hard rule: every tool call the model makes must be paired with a result you send back. A tool call with no matching result is an invalid conversation.

Now imagine a turn gets cut off in the middle. The model emits a tool call — a big one, a multi-section document draft, say — and the response hits its output limit or the stream drops before your code finishes handling it. You’ve now persisted a conversation where the last thing the model did was call a tool that never got a result. The next time the user sends a message, you replay that history to the model, and the API rejects it: there’s an unpaired tool call in here. It rejects it again on the next message. And the next. The conversation is poisoned — permanently, 400 every time, with no path forward except someone manually resetting it.

We fixed this from both ends. At read time, before replaying a conversation to the model, we heal it: scan the history, and for any tool call left dangling without a result, synthesise an error result — “this call was interrupted” — so the history is valid again. A poisoned conversation repairs itself the next time it’s loaded. And inside the agent loop, when we detect a turn ended with a tool call still pending, we inject that synthetic error result immediately and let the loop continue, so the model can recover within the same turn instead of leaving a landmine for the next one.

The general lesson: the agent loop has invariants the protocol enforces, and any interruption — a limit, a timeout, a dropped stream — can leave you in a state that violates them. You have to assume turns will be cut off, and make the conversation self-healing rather than hoping they won’t.

Retries that duplicate your work

The second failure came from the network being normal.

A user’s browser sends a message. The request is slow — the model is thinking, tools are running — and the browser, or a proxy, or an impatient retry somewhere, sends it again. Now the same message is being processed twice. For a chat reply that’s wasteful but harmless. For an agent that acts, it’s a real bug: two drafts of the same task, two copies of the same edit, the same change proposed twice.

The fix is idempotency, done at the message boundary. Each user message carries a unique ID. Before we start processing one, we try to claim that ID with an atomic conditional write — store it, but only if it hasn’t been stored already. If the claim succeeds, this is the first time we’ve seen the message and we process it. If the claim fails, we’ve seen it before and we reject the duplicate quietly. The atomicity matters: two concurrent copies of the same request race for the claim, exactly one wins, and the other is turned away.

The one nuance worth stating: when the store backing the claim is itself unavailable, we fail open — we allow the request rather than block it. A brief lapse in duplicate protection is a far smaller harm than refusing every legitimate message because the dedupe store hiccuped. You choose which way to fail, and for this we chose availability.

Bounding the blast radius

An agent fans out. One user message can trigger several tool calls, each of which hits a backend that hits a database. Any one of those can be slow. And a slow dependency, left unbounded, doesn’t just slow one tool — it holds a worker, holds the stream, and the user sits there watching a spinner with no idea whether anything is happening.

So every tool call has a timeout, enforced on both sides of the call. If a tool doesn’t return in time, it doesn’t hang the conversation — it comes back as a tool error, and the model adapts. “That search timed out; here’s what I can tell you from what I already have.” A bounded failure the agent can talk its way around is infinitely better than an unbounded wait the user has to abandon.

Above the per-call timeout sits rate limiting, scoped at a few levels. Per user, so one person can’t burn through capacity in a burst. Per tenant, so one organisation’s usage doesn’t starve another’s. And per session, to cap runaway tool-calling within a single conversation. When a limit is hit, the user gets a clear, honest message — you’ve sent a lot of requests, try again shortly — rather than a degraded experience or a silent stall. Rate limits aren’t only about cost; they’re about keeping one heavy user from becoming everyone else’s outage.

Being able to see what happened

When a user says “Zoe didn’t apply my pricing edit,” that report crosses three systems — the frontend that showed the card, the agent service that ran the turn, and the backend that the tools called. Without a way to stitch those together, debugging is archaeology.

So every user action carries a trace ID, generated at the start and threaded through every leg: the frontend stream, the agent’s turn, every backend tool call it makes. One ID, one story. When something goes wrong, you search for that ID and see the whole arc — which tools ran, which were slow, where it failed — instead of guessing across three sets of logs that don’t know about each other. The first time I debugged a real user complaint by grepping a single trace ID and watching the entire request reconstruct itself, the feature paid for itself.

Alongside tracing, we log the shape of every turn: how many iterations of the agent loop it took, how many tokens went in and out, how much of the input was served from cache, how many retries happened, how long it took. That per-turn record is what turns “Zoe feels slow today” into “this class of request is doing three loop iterations and a retry, here’s why.”

Making it affordable

Token cost on a multi-turn agent adds up fast, and most of what you send the model every turn is identical — the system prompt, the tool definitions, the policy. Re-sending and re-billing that on every turn is pure waste.

Prompt caching fixes it: the static part of the request — the system instructions and tool schemas that don’t change between turns — is marked cacheable, so after the first turn the model serves it from cache at a fraction of the cost and latency. The one discipline this demands is keeping the cacheable part byte-stable. Anything dynamic — the current page context, the live document state, the user’s specific scope — has to go in a separate, uncacheable part of the request. If you let dynamic content bleed into the cached block, the cache key changes every turn and you’ve cached nothing. So the structure is deliberate: stable instructions in the cached block, live context in the fresh block, and the boundary between them maintained on purpose.

The telemetry from the previous section is what made this visible. Because we logged cache reads versus cache writes per turn, we could actually see the cache working — a healthy conversation showing mostly reads after its first turn — instead of trusting that we’d wired it up correctly. Caching you can’t measure is caching you’re guessing about.

Failures are data, not exceptions

The thread running through all of this is a single shift in how we treated failure.

The first version of Zoe treated failures as exceptions — things that go wrong, get logged somewhere, and surface to the user as nothing at all. You’d click Apply on a card and it would quietly do nothing, because the document it targeted was no longer the one you had open, and there was no concept of telling you that.

The production version treats failures as typed, first-class outcomes. An edit that can’t apply because you’ve switched documents isn’t a silent no-op; it’s a specific state with a specific, actionable message — this block belongs to another document, switch to it or start a new one. A tool timeout is a result the model can reason about. A rate limit is a message with a retry. A poisoned conversation is a condition we detect and heal. Each failure mode has a name, a representation, and a defined behaviour, instead of all of them collapsing into the same silent nothing.

That’s really the whole of what “production-grade” meant here. Not a better model — the model never changed. It meant naming every way the thing could fail, deciding what should happen in each case, and making sure the user and the operator could always tell what happened. No silent failures. No mystery hangs. No conversation that breaks forever with no way back.

What I’d carry to the next one

If I were starting another agent from scratch, the reliability work is what I’d front-load, because it’s the part that’s invisible right up until it’s the only thing anyone notices.

Assume turns get interrupted, and make the conversation self-healing rather than betting they won’t. Make actions idempotent at the message boundary, and decide deliberately which way you fail when the dedupe store is down. Bound every external call with a timeout and turn the timeout into something the model can talk around. Thread one trace ID through every layer before you need it, because you’ll need it during an incident, not before. Cache the stable part of your prompt and measure that the cache is actually hitting. And treat every failure as a named outcome with a defined behaviour, not an exception that evaporates.

The model is the easy part. It mostly works. Everything that decides whether people trust it lives in the boring machinery around it — and that machinery is the actual product.

That’s the series. From giving the agent hands, to the tools it reaches with, to keeping the whole thing standing under real load. If there’s a single thread across all three, it’s that building a good agent is mostly not about the model. It’s about grounding it in truth, shaping what it can touch, and engineering honestly for every way it can go wrong.

Making an LLM Agent Production-Grade

The conversation that breaks forever

Retries that duplicate your work

Bounding the blast radius

Being able to see what happened

Making it affordable

Failures are data, not exceptions

What I’d carry to the next one

Related Posts

Designing Tools an LLM Can Actually Use

Giving an LLM Agent Hands

From vibe coding to engineering management: making AI coding assistants actually work

Stop learning frameworks. Start building things.