Production Monitoring for LLM Applications: What to Actually Track
Three weeks after go-live on a document extraction pipeline, a client mentioned that the finance team had been silently correcting output fields every morning. Not reporting errors. Just fixing them and moving on. We had no alerting on correction rate. We had response time dashboards, error rate dashboards, throughput dashboards. We had no signal that the model's field-level accuracy had drifted from 94% to 71% over three weeks.
The pipeline was technically healthy. Uptime at 99.8%. P99 latency under 600ms. Zero unhandled exceptions. Every infrastructure metric looked exactly as it should. The model was returning output. The schema validation was passing. Documents were routing through. The workflow was continuing. The outputs were wrong on nearly a third of documents and four finance team members had quietly built a morning correction routine around it.
That incident changed how we think about LLM monitoring. Infrastructure monitoring tells you whether the pipes are working. It says nothing about whether the water is clean. Accuracy monitoring is a completely different category, and most teams treat it as optional.
Why infrastructure metrics are not enough
P99 latency tells you the API is responding. It does not tell you whether the response is right. A pipeline with 99.9% uptime and 400ms response time can simultaneously be producing wrong outputs on 30% of documents. The failure mode is silent — documents route through, schema validation passes, the workflow continues, and the error only surfaces when someone downstream acts on wrong data.
Standard observability stacks — Datadog, Grafana, CloudWatch — are built for deterministic systems. They track exceptions, response times, error codes, resource utilization. An LLM returning a plausible-looking wrong answer registers as a success across all of these. There is no "wrong answer exception." The failure is invisible to every layer of infrastructure monitoring.
The implication: you need a monitoring layer that operates at the output level, not the infrastructure level. And the most important signal at that layer is one that requires you to log human behavior, not system behavior.
The correction rate metric
The most important signal in any AI pipeline is the human correction rate: the percentage of AI-populated fields that a human changed within 24 hours of the AI populating them. This metric is a direct measurement of output quality. Not a proxy. Not an inference. A direct measure of whether the model's output was accepted by the humans who actually know whether it was right.
Building it requires that you log both the AI's initial population and any subsequent human edits. Most systems log the final state. The field shows the corrected value. The AI's original value is gone. To measure correction rate, you need to log the delta: what the model wrote, and what a human subsequently changed it to.
Track it by field, by document type, and by week. A 3% weekly correction rate on invoice amount extraction is a healthy system. The same field climbing from 3% to 8% over four weeks without an alert is how AI pipelines become expensive manual correction processes.
Our take
Confidence score distribution monitoring
A well-calibrated model produces confidence scores that track actual accuracy. When a model says 0.93 on an invoice amount, it should be right roughly 93% of the time on that field. The calibration is not perfect, but the trend is real — and a shift in the distribution of confidence scores often precedes a rise in correction rate by 3 to 5 days.
This is your early warning signal. If the average confidence on invoice amount extraction drops from 0.91 to 0.83 over two weeks, something changed before your correction rate dashboards show it: the document population changed, the model was updated by the provider, or upstream data quality degraded. The confidence distribution shift tells you to look.
Track the weekly mean and variance of confidence scores per field per document type. Alert if the weekly mean drops more than 0.05 points on any high-stakes field. That threshold sounds tight. It is not — a 5-point confidence drop on invoice amount extraction typically corresponds to a 12–18% rise in correction rate, which is already a serious accuracy problem.
Token distribution and prompt saturation
Track input token counts per document type per week. A rising average often means document formatting is changing — longer documents, more boilerplate, embedded tables that expand token counts when flattened. When input tokens approach the model's context limit, extraction quality degrades before it hits a hard error.
Prompt saturation is the point where the combination of system prompt, document content, and conversation history pushes total tokens close enough to the context limit that relevant content starts getting cut. The model does not error. It just operates with less of the document than you intended. P95 and P99 token counts matter more than averages here — the average document may be fine while a minority of large documents are hitting the limit silently.
Watch out
Model version drift
If you are using hosted models — Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro — the model behind the same API endpoint changes over time. OpenAI's GPT-4o-2024-05-13 and GPT-4o-2024-08-06 produce measurably different outputs on some extraction tasks. Anthropic updates Claude versions on a similar cadence. You are not always notified when it happens.
Keep your eval set running on a weekly cadence against production traffic. If eval accuracy drops more than 3 points, check for a model version change before looking for problems in your code. Pin to specific model versions in production where the provider allows it. When you cannot pin, the weekly eval run is your only protection against silent accuracy regressions caused by provider-side model updates.
The three alerts you need on day one
Most production AI monitoring systems are over-instrumented on infrastructure and under-instrumented on accuracy. These three alerts cover 90% of production AI incidents we have seen.
Correction rate above baseline plus 2 percentage points sustained for 48 hours straight: page on-call. This is the signal that output quality has degraded and users have noticed. Every hour of delay compounds the number of incorrect records already in your ERP.
Confidence score average drop greater than 5 percentage points over 7 days: create a ticket. This is the early warning signal. Not an emergency, but something changed and it needs investigation before it becomes one.
Average input tokens for any document type above 80% of context limit: create a ticket. You are approaching silent truncation territory. Find out why token counts are rising and address it before they tip over.
Correction rate is the only metric that directly measures whether your AI system is doing its job. Everything else is proxy. If you are not logging and alerting on correction rate, you do not have production monitoring — you have infrastructure monitoring on a system that happens to include an LLM.
Related reading