Deployment Patterns for Enterprise AI That Don't Require a 3am Rollback
An AI workflow update deployed Tuesday night broke the invoice approval queue for 8 hours. Prompt fine. Model fine. A 200-token addition to the system prompt had pushed one document type over the 32K context limit for Claude 3 Haiku — and there was no detection. Documents routed silently to a dead state with no alert.
Nobody knew until the next morning when the approval queue was empty and three finance team members were asking why their documents had not moved overnight. Eight hours of silent failure. The system had no circuit breaker on token count. No monitoring on document routing. The deployment process had been: test locally, deploy to production, check in the morning. That process was wrong.
AI systems fail differently than deterministic software. They fail silently. They degrade instead of crash. They fail on specific inputs that never appeared in testing. The deployment patterns that work for standard software deployments are necessary but not sufficient for AI. These are the patterns we use now.
Why AI failure modes need different patterns
Deterministic software fails loudly. Exceptions are thrown, services return 500s, monitoring catches the spike. AI systems fail quietly. The API returns 200. The output looks like output. The document gets processed. The result is wrong. No alarm fires.
Three AI-specific failure modes that standard deployment monitoring will not catch. First: silent degradation — the model is processing inputs but accuracy has dropped because a prompt change interacted badly with a document type you did not test. Second: context-sensitive failures — the pipeline works on 95% of documents and fails silently on a specific document type that happens to be longer or formatted differently. Third: prompt regression — a change that improves performance on the cases you tested breaks performance on cases you did not.
All three require detection at the application level, not the infrastructure level. No standard monitoring tool catches them automatically.
Blue/green deployment with behavioral monitoring
Blue/green deployment for AI means maintaining two identical pipeline versions and routing a percentage of traffic to the new version while the previous version stays live. This is standard practice for web services. For AI pipelines, the difference is what you monitor during the comparison window.
Infrastructure metrics — latency, error rate, throughput — are not enough. You need behavioral metrics: correction rate per document type on both versions. If the new prompt version is producing a 6% correction rate and the old version is at 3%, the new version has a regression even if no errors are being thrown.
The monitoring window we use: 48 hours at 10% traffic before cutover. Not 2 hours. Not 24. The 48-hour window catches document types that only appear on certain days — month-end invoices, weekly summary reports — that would be missed in a shorter window.
The shadow run pattern
For high-stakes deployments — changes to the core extraction prompt, model version upgrades, significant system prompt additions — we use a shadow run before any traffic is routed to the new version.
The shadow run processes real incoming documents in parallel with the production version. The new version's outputs are logged but not acted on — nothing writes to the ERP, no approvals are triggered, no notifications go to users. We compare outputs between the production version and the shadow version across 200–500 real documents before the new version handles any live traffic.
Divergence rate above 8% between production and shadow triggers a review before proceeding. Not a hard block — sometimes the new version is genuinely better in ways that look like divergence. But "someone look at these 40 diverging outputs before we flip the switch" is a 30-minute review that has caught three regressions in the last year that would otherwise have hit production.
Our take
Circuit breakers specific to AI
Standard circuit breakers trip on error rates and latency. AI circuit breakers need to trip on additional conditions. Three we now build into every pipeline.
Token count monitoring before each LLM call. The system prompt plus document content plus conversation history is measured before the API call executes. If it exceeds 90% of the model's context limit, the document is routed to a fallback — either a model with a larger context window or a human review queue. It never silently fails. This would have prevented the 8-hour invoice queue outage.
Confidence score distribution monitoring. In normal operation, the distribution of confidence scores across a document type has a shape — most fields extract at high confidence, some fields consistently extract at lower confidence. When that distribution shifts materially — mean confidence drops by more than 0.05 over a 4-hour window — it signals that the model is encountering input it has not seen before. The circuit breaker increases the confidence threshold, routing more documents to human review until the shift is understood.
Error rate auto-fallback. If API error rate exceeds 5% over a 15-minute window, the system automatically falls back to the previous prompt version and pages the on-call engineer. Not the on-call engineer at 3am reading a dashboard. An automated fallback that restores service, then a page.
Canary deployment thresholds for AI
Canary deployments for AI follow the same infrastructure pattern as standard software — route a small percentage of traffic to the new version, expand if metrics hold — but the promotion criteria are different.
The thresholds we use: start at 5% traffic for 2 hours. If correction rate on the canary is not more than 1.5x the production correction rate, expand to 25% for 4 hours. Apply the same check. If it holds, move to 100%. If at any step the correction rate on the canary exceeds the threshold, roll back automatically and review.
The 1.5x threshold sounds lenient. It is not. A production system at 3% correction rate has a canary threshold of 4.5%. If the new prompt version is at 5%, it fails. That 2-point difference represents real errors that real users would need to correct, at scale, after full rollout.
Watch out
The most dangerous AI deployments are the ones where nobody defined what a failure looks like. If you cannot answer "how will we know within 30 minutes that this deployment is broken?", you are not ready to deploy. Define the signal, build the alert, test that it fires — then ship.
Related reading