Designing Tools an LLM Can Actually Use

When Zoe — Zomentum’s AI copilot — does something useful, almost none of it is the model reasoning in the abstract. It’s the model calling tools: list these opportunities, get that contact, draft this task. The tools are how the agent touches the product. Which means the way you design the tools is, more than anything else, how the agent behaves.

I came into this expecting to spend my time on prompts. I spent most of it on tools. Every time Zoe did something dumb — declared you had no open deals when you had forty, edited the wrong field, gave up on a search that should have worked — the fix was almost never “rewrite the prompt to be sterner.” It was a tool returning a shape the model couldn’t reason about, or a gap between what the model could write and what it could read.

This is the second of three posts on building Zoe. The first was about letting the agent edit documents. This one is about the tools underneath, and the small, repeatable ways a tool can lead a model astray.

Tools are an API whose only consumer is a model

The mental shift that helped most was to treat the tool layer as a real API with exactly one client — the model — and to take that client’s quirks as seriously as you’d take a frontend team’s.

That client is fast, tireless, and reads your schemas literally. It’s also weirdly suggestible: it pattern-matches on names, it gives up at the first ambiguous signal, and it will confidently pass the wrong thing if your tool gives it room to. Designing for it is less like writing a backend and more like writing a backend for a brilliant intern who takes everything at face value and never asks a clarifying question.

A few conventions came out of that, and they’re boring on purpose. Tools that fetch a list are named list_ something. Tools that fetch one thing by ID are get_. Tools that propose a change are draft_. The model learns those prefixes fast and generalises across them — once it understands that list_ tools paginate and draft_ tools produce a reviewable card, it applies that to a list_ or draft_ tool it’s never seen before. Consistency in naming is free, and it compounds.

One pagination shape, everywhere

The first real lesson was about consistency of response shape, and it came from getting it wrong.

The list tools had grown up at different times, and they paginated differently. One returned a page number and a total. Another returned an offset and a count. A third just returned everything. To a human integrator this is mildly annoying. To the model it was paralysing in a specific way: it couldn’t form one strategy for “is there more data?” so it improvised per tool, and improvised badly. Sometimes it would stop after one page of a long list and tell the user that was everything. Sometimes it would try to walk pages that didn’t exist.

We made every list tool return the same shape — the items, the current page, and a plain boolean for whether more exist. Once that boolean was uniform, the model’s behaviour around it became uniform too. “More exist” is a clear signal to keep going if the user asked for everything; “no more” is a clear signal to stop. No guessing.

We also pushed page-size intent into the prompt: match the limit to what the user asked for. “Top five deals” wants five, not a walk through the whole pipeline. “All my open tasks” wants a large page. And don’t pre-emptively walk every page just because you can — fetch what answers the question. A list tool that returns a clean, uniform shape lets you actually teach that, because the model isn’t spending its attention decoding three different pagination dialects.

Don’t let an empty result mean “zero”

This one is my favourite, because it’s so specifically an agent problem.

A user asks Zoe for their open opportunities. The model calls the list tool with a stage filter of “open.” It gets back an empty list. And it tells the user: you have no open opportunities. Except they have plenty — “open” just isn’t a real stage name. It’s a concept. The actual stages are things like “Discovery” and “Proposal Sent,” and “open” means “any stage that isn’t a closed-won or closed-lost terminal.” The model passed a word that felt like a filter value but wasn’t one, got nothing back, and a human would have thought “huh, that’s odd, let me check” — but the model just reported the empty result as fact.

An empty list is genuinely ambiguous. It can mean “you have none,” or it can mean “your filter was wrong,” and the model has no way to tell which from an empty array alone. So we stopped sending it a bare empty array. When a list tool returns nothing, it now also returns a small diagnostic: how many records exist with no filters applied, and whether the filter the model used was even valid. That turns an ambiguous silence into something the model can reason about. If the unfiltered count is zero, it’s safe to tell the user they have none. If the unfiltered count is high but the filtered result is empty, the filter is the problem — loosen it and retry. If the filter value wasn’t valid at all, the response says so and lists the valid ones.

The same class of problem shows up with anything the user expresses as a concept rather than an ID. “Mine,” “my deals,” “Sarah’s accounts” — those are not IDs, and the tools want IDs. The instinct is to hope the model figures it out. The thing that actually worked was to write the constraint and the recipe into the prompt: tool inputs are IDs, never names; to resolve “mine,” call the current-user tool first and use that ID; to resolve a person’s name, look them up and use the resulting ID; never pass a bare name where an ID belongs. Pair the rule with the procedure and the model follows it reliably. State only the rule and it improvises.

If the model can write it, it must be able to read it

A subtle bug taught me a rule I now apply everywhere: read/write parity.

Zoe could update whether a contact was the primary contact. It could set that flag through a draft. But the tool that fetched a contact’s details didn’t return that flag. So the model could write the field but never read it. And an agent that can act but can’t observe the result of acting does something pathological — it over-corrects. Unable to see the current state, it would assume the worst and propose changes that were already true, or flip things back and forth, because it had no feedback loop on its own writes.

The fix was just to close the gap: every field the model can change, it can also see. Once it could read the primary-contact flag, the flailing stopped, because now it could check before it acted. It sounds trivial written down, but it’s an easy asymmetry to introduce — read tools and write tools are often built separately, by different people, at different times — and the model exposes it instantly.

Validate the draft before the human sees it

The agent’s changes come back as reviewable cards — that’s the pattern from the first post. But a card is only useful if what’s on it is coherent. So every draft tool validates before it ever becomes a card.

Validation here is unglamorous and strict. A draft that updates a record has a whitelist of fields that may be touched; anything outside it is rejected. Enum-valued fields are checked against their allowed values. Required fields are required. A draft that would change nothing — an empty set of updates — is refused outright, because an Apply button that does nothing is worse than no button.

The important detail is what happens on a validation failure: the error goes back to the model, in plain language, as the tool’s result. Not a 500, not a silent drop — a sentence the model can read and act on. “That field can’t be updated; the updatable fields are X, Y, Z.” The model reads it and tries again correctly. The validator isn’t just protecting the frontend from bad data; it’s a feedback channel that lets the agent self-correct within the same conversation. That reframing — errors as instructions to the model, not just failures — changed how I write tool validation.

The status the user sees

A small thing that punched above its weight: humanised tool labels. When Zoe is working, the user sees a line of status — “Looking through your pipeline,” “Checking engagement history,” “Drafting an email.” Early on these sometimes leaked the raw tool name, so a user would see “Running list_opportunities,” which is exactly the kind of detail that makes a product feel like a thing engineers forgot to finish.

The fix was to make the backend the single source of truth for that copy. Each tool owns its human-readable label, the label travels with the tool’s status event, and the frontend simply displays what it’s given rather than keeping its own mapping that could drift out of sync. One place to author the copy, no fallbacks to raw names. The user always sees a sentence written for them, not an internal identifier.

The system prompt is policy, not vibes

Pulling these together: I stopped thinking of the system prompt as a place to set a tone and started treating it as executable policy. Not “be helpful and friendly,” but a precise set of rules tied to real tool behaviour. Concept words map to these procedures. Inputs are IDs, never names. On an empty result, read the hint before concluding zero. Mutations are proposals, never silent actions. Match page size to intent.

What made the prompt work wasn’t sternness or length. It was specificity, and the pairing of every constraint with a recipe for satisfying it. The model is excellent at following a clear procedure and unreliable at inferring one. So you write the procedure down. The prompt becomes the contract between the model and your tools, and the tools are built to make that contract followable — uniform shapes, honest empties, readable errors, no write-only fields.

The tools decide what the agent can do. The prompt decides what it should do. Get both right and the model, honestly, mostly just works.

In the final post I’ll get into what it took to make all of this survive contact with production — the retries, the poisoned conversations, the timeouts, and the observability that made any of it debuggable.

Designing Tools an LLM Can Actually Use

Tools are an API whose only consumer is a model

Don’t let an empty result mean “zero”

If the model can write it, it must be able to read it

Validate the draft before the human sees it

The status the user sees

The system prompt is policy, not vibes

Related Posts

Giving an LLM Agent Hands

Making an LLM Agent Production-Grade

From vibe coding to engineering management: making AI coding assistants actually work

Stop learning frameworks. Start building things.