AI Engineering8 min read

Testing AI Systems: Why Your Existing QA Process Is Not Enough

Our test suite had 100% pass rate. The model was producing wrong answers in production 23% of the time. Both were true simultaneously — because we were testing function calls, not outputs. This is the gap standard QA never closes.

Standard QA tests structural correctness: did the API respond with a 200, did schema validation pass, did the function execute without throwing. For deterministic software this is mostly sufficient. For AI systems, structural correctness and behavioral correctness are completely separate problems — and standard QA only addresses one of them.

We discovered this the hard way on a document extraction pipeline. The unit tests verified that the LLM was called, that the response was parsed, that the output was written to the database. All green. What they did not verify was whether the extracted values were actually correct. The answer was: not always. Not even close.

The eval set: what QA looks like for AI

The right replacement for unit tests as a correctness signal is a hand-curated evaluation set — real documents with known correct outputs, assembled before any build work starts. Not 10 documents. Not the 10 cleanest documents in the archive.

The structure we use: 40–50 standard cases covering the common document types the system will handle in normal operation. 20–30 edge cases — unusual formatting, missing fields, ambiguous values, documents that look like one type but are actually another. 10–15 adversarial cases drawn directly from real production failures after go-live, added retroactively as the system encounters them.

That last category matters. The adversarial set is how a system gets smarter over time rather than just accumulating failures. Every time a user corrects an output, we ask: should this become an eval case? Most of the time, yes.

23%of real-world inputs producing wrong outputs in a system with 100% unit test pass rate — because we were testing plumbing, not answers

Confidence calibration: the most important test nobody runs

Most AI pipelines output a confidence score alongside each extracted value. Most teams treat this as a threshold gate: above 0.85, auto-accept; below 0.85, route to human review. This is necessary. It is not sufficient.

The problem is miscalibration. A model that reports 90% confidence and is actually right 60% of the time passes every threshold check. It goes straight to production. It sends wrong answers without triggering a human review flag. We got this wrong on the first two projects we shipped. The failure was invisible until we correlated confidence scores against correction rates and saw the gap.

The calibration test is straightforward. Run your eval set. Bucket all outputs by confidence score range: 0.95–1.0, 0.90–0.95, 0.85–0.90, and so on. For each bucket, measure actual accuracy — what percentage of those outputs were correct. A well-calibrated model should show accuracy tracking confidence closely: the 0.95–1.0 bucket should be right roughly 95% of the time.

Watch out

A miscalibrated confidence score is more dangerous than no confidence score. It provides false assurance that routes wrong answers straight to production while appearing to have human oversight in place. Test calibration explicitly on your eval set before go-live.

If the 0.95–1.0 bucket is only right 72% of the time, you have a miscalibrated model and your threshold is meaningless. The right response is to raise the threshold until calibration holds — or to build a post-hoc calibration layer that adjusts scores before thresholding.

Prompt regression testing: run it before every merge

Prompt changes are code changes. They need to go through the same gate. We treat the eval set as a regression test suite that runs automatically on every prompt change before it merges.

This sounds obvious. In practice, almost nobody does it. The typical process is: someone edits the system prompt, tests it manually on 3–5 documents that look good, and deploys. Three weeks later, accuracy on a document type they did not think to test has dropped 18 points.

The CI gate we use: any prompt change that drops overall eval accuracy by more than 2 percentage points, or drops any single document category by more than 5 percentage points, blocks the merge and requires a review. Not a hard fail — someone can override it with justification. But it has to be a deliberate choice, not something that slips through because the author only tested the happy path.

Our take

Add the eval set as a CI job that runs on every prompt change. The job takes about 3 minutes to run 80 cases through Claude 3 Haiku at ~$0.04 total. It catches regressions that manual testing misses every single time.

Production monitoring as continuous testing

Once the system is live, production monitoring is the eval set running at scale. The signals we track: correction rate per field per week, broken down by document type. Confidence score distribution shifts — if the mean confidence on invoice date extraction drops from 0.93 to 0.87 over two weeks without a prompt change, something in the document population has changed. P95 latency, not average, because tail latency is where user experience breaks.

Correction rate is the most important signal. A 3% weekly correction rate is a healthy system. A correction rate climbing from 3% to 8% over four weeks, quietly, without an alert, is how AI pipelines become expensive manual processes that people route around.

We build correction rate dashboards into every production AI system now. Not as a nice-to-have. As a go-live requirement. If the correction rate is climbing and we cannot explain why, the system is broken — even if unit tests are green and no exceptions are being thrown.

Unit tests for AI pipelines are necessary but they are not a QA strategy. A 100% unit test pass rate tells you the plumbing works. It tells you nothing about whether the outputs are correct. Treat eval sets, calibration tests, and correction rate monitoring as the actual QA layer — because they are.