5 Enterprise AI Failures We Have Seen Up Close (And What Caused Each One)

Failure mode 1: the confidence collapse

A document extraction system ran at 91% field-level accuracy for six weeks of staging and early production. On day 44, accuracy dropped to 58% and stayed there. Nothing in the model changed. No deployment happened. The ERP schema had not changed.

The cause: the system went live in October. The training and testing data was from August and September — a period when the client's largest supplier was using a standard invoice template. In November, that supplier switched to a redesigned template for their new fiscal year. 34% of all invoice volume came from this one supplier. The model had never seen the new layout.

This is the confidence collapse: a system that works well in staging because staging data looks like training data, and fails in production because production data includes edge cases the training set did not represent. The failure is particularly insidious because confidence scores often stay high — the model is confident about the wrong answer.

Prevention: Build your evaluation set with data from as many distinct time periods as you can access. Specifically include data from supplier/vendor template changes if your system processes external documents. And build supplier-level accuracy tracking — so a single supplier's template change does not bring the whole system down before you notice.

Failure mode 2: the scope creep spiral

A project that started as invoice processing automation ended, 14 months later, as an attempted complete procurement management system. We watched this one from the inside as the implementation partner.

Month 2: "Can it also check that the vendor is on the approved list?" Yes, straightforward. Month 4: "Can it route to the right approver based on vendor category?" Adds some complexity, but doable. Month 7: "Can it handle PO creation, not just invoice processing?" Different system. Month 10: "Can it manage vendor onboarding too?" Different project. Month 14: "Can it basically run procurement?"

Each step felt incremental. The cumulative effect was a system that tried to do everything and did nothing reliably. The invoice processing accuracy — the original goal — had degraded from 91% to 74% because every scope addition required prompt changes that affected performance on the core task.

Prevention: Scope boundaries need to be written, not verbal. A written agreement that specifies exactly what the system does and does not do, signed by the person who will later ask "can it also do X." Every scope addition gets a new estimate and a go/no-go decision — not a quiet yes in a weekly check-in.

74%invoice processing accuracy at month 14, down from 91% at launch — fourteen months of scope additions degraded the core task performance

Failure mode 3: the reversibility trap

A procurement automation system was designed to auto-approve purchase orders up to ~$8,000 when confidence was above 0.92. The design document described human override capabilities. Nobody designed what that override actually looked like.

When the first wrong auto-approval happened — a $7,400 PO for the wrong item category, sent to the wrong supplier — the operations team could not reverse it. The PO had been sent to the supplier's system via API. The ERP showed it as confirmed. The finance system had already reserved the budget. The undo path required manual intervention across three systems and a phone call to the supplier. It took 6 hours to resolve what should have been a one-click cancellation.

The gap: the system was designed for the happy path and the error path was described at a high level without any engineering work behind it. "Humans can override" is not a design. A specific interface, a specific reversibility window, and a specific compensation action for each write is a design.

Prevention: For every action the system can take, design the undo before you design the action. Not as an afterthought, not as a post-go-live improvement. The compensation path gets built first. This also forces useful conversations about which actions actually can be undone — some cannot, and those need different handling.

Watch out

If the undo path for any AI-driven action requires a phone call, you have not designed undo. You have described damage control.

Failure mode 4: the adoption cliff

This is the most common failure mode we see and the most preventable. The system works. The AI is accurate. The technical implementation is solid. And six months after go-live, fewer than 20% of users are using it.

On one project we were brought in to diagnose post-launch, user adoption had dropped from 85% at launch to 22% at month five. The system was a document triage tool — AI pre-classified incoming documents, humans confirmed or corrected the classification. The accuracy was 87%. The problem: the confirmation interface required 4 clicks to agree with the AI and 7 clicks to disagree. People found it faster to classify everything manually and stopped using the AI suggestions.

The system was technically correct. It was operationally slower than the old way. Users are rational — they will abandon a tool that makes their job harder, regardless of its accuracy.

Prevention: Measure time-per-task before and after deployment, not just accuracy. If the system is slower than the manual process for the median user, it will be abandoned — even if it is more accurate. Build the user workflow around what people actually do under time pressure, not what they should ideally do.

Failure mode 5: the vendor dependency trap

A client built an AI system in 2022 on GPT-3.5 Turbo. The system was well-built: good prompts, robust evaluation, monitoring in place. In January 2024, OpenAI deprecated GPT-3.5 Turbo legacy. The system had hardcoded model identifiers, prompt engineering tuned specifically for GPT-3.5 Turbo behaviors, and evaluation thresholds calibrated to that model's confidence scoring patterns.

Migrating to GPT-4o — the recommended replacement — was not a drop-in substitution. GPT-4o processes the same prompts differently. Response formatting changed. Confidence score distributions shifted. Evaluation thresholds that worked for GPT-3.5 Turbo flagged too many items for human review on GPT-4o. The migration required six weeks of re-tuning and re-validation. The client had budgeted zero weeks.

Prevention: Treat the underlying model as a dependency that will change. Abstract it behind a model configuration layer. Maintain your evaluation set so that testing a new model is a matter of running the eval, not re-discovering what your thresholds should be. Build model migration into your maintenance budget — not as a known amount, but as an acknowledged cost category.

Every AI system built on a specific model version is carrying a deprecation liability. The question is not whether the underlying model will change — it will. The question is whether you will find out 6 weeks before deprecation or 6 months.

The one we consider most preventable

Failure mode 4 — the adoption cliff — is the one that frustrates us most because it is almost entirely avoidable with five days of user research before building. Sit with the people who will use the system. Watch them do the task the system is supposed to help with. Count their clicks. Ask what makes their job slow. Build the AI workflow around actual behavior, not idealized behavior.

The other failure modes require architectural decisions and ongoing discipline to prevent. The adoption cliff just requires watching someone do their job before you design the interface they will use.

We have seen multi-million dirham AI systems sit unused because the confirmation UI was too slow. The waste is not the AI cost — it is the organizational cost of a system people have stopped trusting. That trust, once lost, is very hard to rebuild.