Safety10 min read

Hallucination at $1M: How We Prevent AI Errors in Financial Workflows

One hallucinated invoice total in an AP workflow can cost a company significantly. We detail the seven-layer defense system we have built — from structured output enforcement to deterministic rules engines and drift detection.

Why financial workflows are different

AI errors in a customer service chatbot are annoying. AI errors in financial workflows are material. A hallucinated invoice amount that passes through AP automation and hits a payment run can result in an overpayment that takes 60 days to recover — if it is caught at all. A misclassified GL code compounds into incorrect financial statements. A missed duplicate invoice creates a liability that surfaces in an audit.

The standard AI industry response to this is "humans in the loop." That is necessary but not sufficient. A human reviewer who is processing 200 invoices per day will not catch every error. The review interface, the alert thresholds, the escalation paths, and the fallback behavior when confidence is low all determine whether human oversight actually works or whether it is theater.

We have built financial automation systems for logistics companies, financial services providers, and government entities. What follows is every layer of defense we deploy in production. None of these are theoretical — all are running in live systems today.

5%sustained correction rate for 7 days triggers automatic escalation to human review — the drift detection threshold that catches problems before they reach financial statements

The seven layers

Structured output enforcement

Every LLM in a financial workflow is constrained to output a defined JSON schema with explicit field types, required fields, and value constraints. A model that cannot produce a valid structured output for a given input does not produce output at all — it escalates. We use Pydantic validation on every agent response. If the model outputs "approximately $1,200" instead of 1200.00, validation fails and the document goes to human review. Not to the next pipeline step.

Deterministic rules engine

After structured output passes schema validation, it runs through a deterministic rules engine that encodes financial business logic. Invoice amount must be positive. Tax rate must be within legal range for the jurisdiction. Vendor ID must exist in the approved supplier master. Payment terms must match the contract on file. None of this uses LLM reasoning. These are hard-coded rules that run in milliseconds and catch the hallucinations that pass schema validation.

Confidence scoring with explicit thresholds

Every extracted field carries a confidence score. Fields below a configurable threshold — typically 0.85 for amounts, 0.92 for GL codes — are flagged for human confirmation before the record is created. High-confidence fields are auto-populated. Low-confidence fields are highlighted in the review interface with the source evidence so the reviewer can verify quickly. This is not binary pass/fail — it is a risk-proportional review load.

Dual-control for high-value actions

Any automated action above a configurable financial threshold — typically the same threshold used in manual approval workflows — requires a second human confirmation before execution. This is not a fallback for when AI fails. It is a permanent control, the same control that would exist in a manual workflow. Automation does not remove controls. It enforces them more consistently.

Immutable audit trail

Every action taken by every agent in the pipeline is written to an append-only audit log before the action executes. The log records the agent, the input, the output, the confidence score, the timestamp, and the human actor if applicable. This log cannot be modified or deleted. If an error surfaces two weeks later, we can reconstruct the exact state of the system at the moment of the error, including which model version was running and what its inputs were.

Rollback at every execution step

Every ERP write in a financial pipeline is wrapped in a transaction that can be rolled back cleanly. If a pipeline completes steps 1–4 and fails at step 5, the previous four steps are reversed automatically. We have never shipped a financial automation system that does not have this property. Partial financial records are almost always worse than no record — they corrupt reconciliation and audit trails in ways that take days to untangle.

Production monitoring and drift detection

After go-live, every model in production is monitored for accuracy drift. We track the rate of human corrections to auto-populated fields week over week. If correction rate rises above a threshold — typically 5% sustained for 7 days — the system automatically increases its confidence threshold, routing more documents to human review until the root cause is identified. Model drift is not a hypothetical — it is a production reality, and it needs to be caught before it reaches financial statements.

The layer most teams skip

Layer 7 — production monitoring and drift detection — is the one we see skipped most often, including by teams that have done everything else correctly. The assumption is: if it works at launch, it will keep working. That assumption is wrong.

Models drift for reasons that have nothing to do with the model itself. Supplier invoice formats change. A new ERP upgrade changes field names. A new accounting period introduces categories that did not exist in the training data. A new supplier uses a layout the model has never seen. Any of these can degrade accuracy silently while the system continues to process documents and create records.

Watch out

The minimum viable monitoring setup is a 7-day rolling correction rate dashboard with a human-reviewed alert threshold. The companies that skip it discover problems during audits — not during weekly ops reviews.

The minimum viable monitoring setup: track the human correction rate for auto-populated fields on a 7-day rolling window. Set an alert threshold. Review the alert within 24 hours. This does not require sophisticated ML infrastructure — it requires a dashboard and a process. The companies that skip it are the ones who discover a problem during an audit rather than during a weekly ops review.

What this means for AI vendor selection

When evaluating an AI vendor for financial workflow automation, the accuracy number in the demo is the least important metric. Ask instead:

—What happens when the model is not confident? Is there a defined low-confidence path or does it guess?
—Is there an audit trail? Can I see exactly what the model extracted, what confidence score it had, and which human reviewed it?
—Can the system be rolled back if an error is discovered after the fact?
—How is production accuracy monitored? Who gets alerted and how fast when accuracy degrades?
—Has this system been used in a live financial audit? What did auditors ask for and were you able to provide it?

A vendor who cannot answer these questions has not shipped a financial automation system in production. Their demo is the most controlled condition their system will ever run in.

The accuracy number in a vendor demo is nearly useless. The questions that matter are about what happens when the model is wrong — and whether the system was designed to catch that before money moves.