Agent harnesses: the part of AI nobody is showing you

Everyone is talking about AI agents. Most of what is being shown is a chatbot wired to a model, given a few buttons, and called an agent. You ask it to do a thing. It does the thing. The demo lands. The room claps.

That is not an agent. That is a model with a microphone.

The thing that makes an AI agent actually work, and more importantly, actually safe, is the harness around it. It is the part nobody puts on stage. It is also the only part that matters once you stop demoing and start running a business on the thing. The numbers show this. Only 12% of enterprise AI agent initiatives ever make it to production at scale, and the average failed project costs around £340,000 in direct spend before anyone calls it (Composio AI Agent Report 2025; Digital Applied, Why 88% of AI Agents Fail Production, 2026).

‍

What an agent harness actually is

The harness is everything around the model. The tools it can call. The data it can see. The rules it must follow. The checks that run before and after every action. The things it is not allowed to touch. The way it records what it did and why.

The model is the brain. The harness is the body, the muscles, the nervous system, and the supervisor standing behind it with a clipboard and a slightly worried expression.

The way to picture it is this. The model is a brilliant new hire on their first day. Sharp, quick, fluent, knows a remarkable amount about the world. Has no idea what your processes are, who can sign off what, where the data lives, or what counts as a fireable offence. The harness is the onboarding, the access controls, the policies, and the manager. Without all of that, you are letting a stranger loose in your accounts on day one and hoping for the best.

‍

How a harness works

The loop is simpler than it sounds. The model receives a request. It decides what it wants to do. The harness intercepts that decision before anything happens. It checks whether this user is allowed to do this thing, on this data, in this state. It runs the action through the right tool, which is not always the model itself. It returns a result. It logs everything.

The model never directly touches the database, the payment rails, or the customer's data. It asks. The harness answers, or refuses.

This is the bit nobody is putting on stage at AI conferences, and it is the bit that makes the difference between a useful agent and a very expensive mistake.

‍

What happens without one

A model with a database connection and no harness is not neutral. It is helpful. That is the problem.

It runs a query it was not asked to run, because it thought you would want to see the wider picture, and returns data from another company because nothing stopped it. It calculates a VAT total using its own arithmetic (which it can't do) and gets it wrong by a rounding error that compounds across two thousand line items. It pays the same invoice twice because the user retried after a timeout and the model has no idea the first call already went through. It produces a board pack with confident numbers it made up, because nothing in the system told it to refuse. It logs none of this, because logging was not part of the prompt.

This is not a hypothetical. 88% of organisations reported confirmed or suspected AI agent security incidents in the last year, with 61% of those incidents involving data exposure and 41% involving unintended actions in business processes (Gravitee, State of AI Agent Security 2026; Cloud Security Alliance / Token Security, AI Agent Security Report, 2026). In a separate study, 80% of organisations reported their agents had performed unintended actions, including accessing unauthorised systems and sharing protected data (Domo, As AI Agents Scale, So Does the Security Risk, 2026).

The model is not lying. It is doing exactly what it was asked, with the information it had, which was nothing. That is the problem.

Without a harness, every interaction is an act of trust in a system that has no concept of trust. It works until it doesn't. When it doesn't, you find out from the auditor, the supplier, or the tenant whose data just walked out the door.

‍

The biggest mistake: pushing everything through the LLM

The temptation, especially for teams shipping fast, is to use the model for everything. Need to add two numbers. Ask the model. Need to look up a supplier. Ask the model. Need to reconcile a hundred line items. Ask the model.

This is wrong on three counts.

It is expensive. Every call costs money and adds waiting time.

It is unreliable. Language models are probabilistic. Arithmetic is not. Recent benchmarking across 37 models found hallucination rates between 15% and 52% on structured analysis tasks (SQ Magazine, LLM Hallucination Rate Up to 82%: 40+ Stats, 2026). On finance specifically, even modern models still routinely produce confident, wrong numbers when asked to do the maths themselves.

It is unauditable. When a model decides something, you cannot rerun it and get the same answer. In finance that is not eccentric. It is a problem.

Asking a language model to do your VAT return is like asking a poet to file your taxes. They will produce something. You will not enjoy what HMRC says about it.

‍

The right tool for the job

A good harness routes each step to the thing best suited to do it.

Deterministic code does the maths. Decimal arithmetic, reconciliation, currency conversion, anything where the same input must always produce the same output. Database queries fetch the data, so the model never invents a balance. Specialised models do specialised jobs. OCR for documents, classifiers for invoice categories, embedding models for search, each chosen because it is good at one thing.

The LLM does what it is genuinely good at. Understanding what you asked for. Drafting prose. Summarising context. Deciding what to do next.

The harness is the conductor. The LLM is one instrument in the orchestra, not the whole band.

‍

Why finance demands a harness built for it

General-purpose agent harnesses are being built for general-purpose work. Replying to emails, booking flights, researching markets. The bar is "useful most of the time." That bar does not survive contact with finance.

Finance has a different list of non-negotiables. The numbers must be exact, always. The data must be isolated. Every action must be logged with who, what, when, and why. Sensitive fields must be encrypted. Approvals must follow real workflows, not be invented on the fly. A retry must never pay the same supplier twice. A mistake must be reversible, or at least visible.

A harness for finance is not a chatbot with guard rails bolted on. It is a system designed around those constraints first, with the model invited in to help where it actually adds value.

‍

But why not just use Claude or Cowork?

A reasonable question. General-purpose agents are genuinely impressive. Claude, Cowork, and the rest can read your documents, draft your emails, and handle a great deal of useful work. For most jobs, they are the right tool.

Finance is not most jobs.

A general-purpose harness is built to be broadly useful. It is given wide access to do many things, reasonably well, across many domains. That is the right design when the cost of a mistake is a misworded email or a slightly off summary. It is the wrong design when the cost of a mistake is a duplicate payment, a leaked balance sheet, or a number on a board pack that nobody can defend.

A specialised harness for finance starts from the opposite end. It begins with what must never happen, builds the rules to make those things impossible, and then invites the model in to help within those rules. Narrower on purpose. Safer by design.

‍

What Finzu's harness looks like

The Finzu harness is built around three principles.

Security as the foundation. The model never touches the database directly. There is always a layer between the data and the LLM, and that layer decides what the model is allowed to ask for, what it is allowed to see, and what it is not. Tenant isolation is enforced at the database layer itself, not in the model's prompt. Sensitive fields are encrypted at rest. Every action is logged. Authentication and authorisation are checked before the model is even consulted. This matters more than it sounds. IBM's most recent Cost of a Data Breach report found that of the organisations who had suffered an AI-related breach, 97% of them lacked proper AI access controls (IBM, Cost of a Data Breach Report 2025). The breach is almost always at the boundary, not in the model.

Confidence in the numbers. Arithmetic does not go through the LLM. It runs through deterministic code with decimal precision. The model can describe the numbers. It cannot make them up. When Finzu tells you what your payables are, the figure was calculated, not generated.

The right tool for the job. Document classification uses a classifier. Document data extraction uses an OCR engine. Reconciliation uses code. The LLM is reserved for the things only it can do, such as understanding what you asked for, drafting a board summary, or working out which step of an agent workflow to run next. Each provider is pluggable, so as the field moves, Finzu moves with it without rebuilding the foundation.

The harness is not a feature you see in the product. It is the reason you can trust what you do see.

‍

The point

Finance has spent decades building software that is correct, auditable, and safe by default. The temptation to throw all of that away because the model is impressive at the demo is real. The mistake would be expensive.

The future of AI in finance is not the model. It is the harness around the model. Build the harness for finance, and the model becomes useful. Skip it, and you have built a very "trustworthy" way to lose money.