A Post-Mortem We Can Almost Share (Anonymized)

What the system was supposed to do

The client was a regional distribution company — multiple entities, two currencies, procurement spend across roughly 180 active suppliers. Their AP team was processing about 400 invoices per week manually in Business Central, with a standard three-tier approval matrix: invoices under $5K auto-approved by department heads, $5K–$50K required finance manager sign-off, above $50K went to the CFO.

The automation we built extracted invoice data from PDFs using a fine-tuned extraction model, validated fields against the supplier master and PO register, and then routed invoices to the appropriate approval tier based on the extracted amount. The approval_tier classification was handled by a separate routing agent that read the extracted amount and the supplier's contracted approval matrix.

We ran two weeks of parallel testing. Accuracy on the extraction layer was 97.3% across 800 test invoices. The routing logic was deterministic. We went live with the extraction layer in phase one, keeping human approval routing in place. Phase two — the routing automation — went live on day one of month two.

~$180Kqueued incorrectly before the anomaly was caught — payment run was stopped before any funds transferred

What went wrong

On day 11 post-phase-two go-live, a finance manager noticed that several invoices from a specific supplier — one that had historically required CFO-level approval due to a custom contractual requirement — were routing to the department head tier. She flagged it. We pulled logs.

The supplier had changed their invoice layout. A redesign of their invoice template moved the invoice total to a different position in the document. The extraction model was reading a subtotal line — the pre-tax amount — rather than the gross amount. For invoices where VAT was significant, this pushed the extracted amount below the $50K CFO threshold when the actual gross amount was above it.

The routing agent had no visibility into the extraction confidence score for the amount field. It received a number, it was a valid number, and it routed accordingly. The extraction model had flagged the amount field at 0.71 confidence — below our 0.85 threshold — but the routing agent was a separate service and did not receive the confidence payload. We had wired the two agents together without passing confidence metadata between them.

Watch out

The extraction layer and the routing layer were tested in isolation. The integration between them — specifically what happened to confidence scores when they crossed service boundaries — was never tested end-to-end. That gap is where the failure lived.

Twenty-three invoices from this supplier had routed incorrectly. Nine had been approved by department heads who had no reason to question them — the amounts looked normal for that category. The fourteen remaining were in the queue for the next payment run, which ran weekly on Fridays. This happened on a Tuesday. We had three days.

What we had missed in review

Looking back at our review process for this project, four things were missing.

Confidence score propagation was never reviewed. We reviewed the extraction service in isolation. We reviewed the routing service in isolation. Nobody reviewed the API contract between them and asked: does the routing service receive confidence scores for fields it depends on? It did not. This is a review gap that a standard API contract review would have caught if the reviewer knew to ask for it.

No supplier layout change detection. We assumed supplier invoice layouts were stable. They are not. We should have built a checksum comparison for known supplier layouts and triggered a human review flag when a document from a known supplier deviated from its expected structure. We had discussed this in design. We cut it from scope.

Amount reasonableness check was absent. We had a rules engine that validated amounts were positive and within schema. We did not have a rule that compared the extracted amount against the PO amount on file. This supplier had a PO for the full gross amount. A simple check — extracted amount differs from PO amount by more than 15% — would have caught every one of the misrouted invoices.

No monitoring alert on approval tier distribution. The proportion of invoices routing to each approval tier should be approximately stable week over week. A spike in department-head approvals for a supplier that historically hit CFO approval should have triggered an alert. We did not have that dashboard.

Every one of these gaps was catchable in review. We simply did not have a review checklist that included AI-specific failure patterns. We do now.

What we changed

We reversed the nine approved invoices that had not yet been paid. The client's finance manager sent hold notifications to three suppliers. No payments went out incorrectly. Total financial impact: zero. Operational impact: two days of remediation work and a difficult conversation with the client that we deserved to have.

The changes we made to our process and to the system:

—Confidence scores are now part of every inter-service contract. Any downstream agent that acts on an extracted field must receive and gate on that field's confidence score.
—Supplier layout change detection was added. Known supplier documents are fingerprinted on structural layout. A deviation flags the document for human review before extraction even runs.
—PO amount comparison is now a standard validation rule for all invoiced items that have a corresponding PO. The threshold is ±12% — enough to catch extraction errors, tight enough to catch genuine problems.
—Approval tier distribution is now a monitored metric with a weekly baseline and a ±20% deviation alert.
—Integration testing now explicitly covers confidence score propagation across every service boundary in the pipeline.

Our take

The core lesson is not that the AI made a mistake — extraction models fail on layout changes and that is expected. The lesson is that the system was not designed to surface that failure before it had consequences. Build for graceful degradation, not just happy-path accuracy.