AI Engineering8 min read

How to Actually Evaluate an LLM for Enterprise Use

A client once chose the wrong model for a $200K document processing project because their evaluation process was, in practice, a demo. Here is what real enterprise LLM evaluation looks like — and what you measure when leaderboard scores mean nothing for your actual use case.

The client had done everything that looks like due diligence. They tested three models. They ran each one against a set of sample documents. They compared outputs side by side. They chose the model with the highest score on the MMLU leaderboard at the time, which was also the most expensive, which was also the one that failed in production within three weeks.

The failure mode: their documents included mixed Arabic/English headers, inconsistent date formats, and tables embedded inside scanned PDFs. None of these appeared in the evaluation set. When production documents hit the pipeline, field extraction accuracy dropped from the demo's 94% to a real-world 61%. They had evaluated the wrong things entirely. We inherited the project and spent six weeks rebuilding.

61%real-world field extraction accuracy — down from 94% in the vendor demo, on the same document type

The benchmark trap

MMLU, HellaSwag, HumanEval — these are the benchmarks that show up in press releases and leaderboard comparisons. They test general knowledge, reasoning over common language patterns, and code generation on textbook-style problems. For AI model benchmarking purposes, they are legitimate research artifacts. For enterprise use case selection, they are nearly useless.

We tested this directly. A model ranked third on a popular leaderboard completely failed a structured document extraction task we run routinely — pulling line-item data from supplier invoices with inconsistent formatting. A model ranked ninth handled it reliably at 91% field-level accuracy. The benchmark only tells you who trained well for that benchmark.

The models are being optimized for benchmark performance. That optimization does not transfer automatically to your use case. Your job is to test for your problem, not theirs.

Build your eval set first. Before you touch a model.

This is the highest-leverage step in any AI project and the one most teams skip. Before you open a model API, before you write a prompt, before you look at a leaderboard — build a hand-curated evaluation set of 50 to 100 examples from your actual use case. Real documents. Real queries. Real expected outputs.

The composition matters. Roughly 60% should be the normal case — typical documents, standard queries, clean data. The remaining 40% should be edge cases: the documents that are partially scanned, the queries that are ambiguous, the outputs where getting it wrong has real consequences. Include examples that should produce a "I don't have enough information" output — and test whether the model says so or fabricates.

Our take

The eval set is the most important artifact in your AI project. Not the prompt. Not the model. The eval set. It takes two to three days to build properly. Most teams skip it because writing prompts feels more interesting. That trade-off costs weeks in production.

Every model swap, every prompt edit, every parameter change becomes a measurable engineering decision the moment you have a ground truth to compare against.

What production LLM testing actually measures

Once you have your eval set, here is what to measure — not what vendor demos show, not what benchmark tables report:

Task accuracy on your specific data. Field-level extraction accuracy, answer correctness on your documents, classification precision and recall for your categories. Not general knowledge. Not reasoning benchmarks.

Hallucination rate. Does the model fabricate information when the answer is not present in the provided context? Give it documents that do not contain the answer. Measure how often it generates something plausible but wrong versus admitting uncertainty. This is the number that determines whether the system is safe to deploy.

Structured output reliability. If your pipeline depends on JSON output — and most production enterprise pipelines do — measure how often you get valid, schema-conforming JSON. Models that produce valid JSON 98% of the time will fail your pipeline 2% of the time, which at enterprise volume is not acceptable.

P95 latency, not average. Average latency hides the tail. A system averaging 800ms might have a P95 of 4.2 seconds on complex documents. Measure the tail before you commit to an architecture.

Cost per correct output. Not cost per token. Cost per correct output. A cheaper model that needs a human correction 30% of the time costs more in total than an expensive model that routes 5% to human review. Run the numbers before making the cost argument.

The confidence calibration problem

We got this wrong on the first two projects. We built confidence thresholds — route anything below 0.85 to human review — without ever testing whether the model's confidence scores actually correlated with accuracy. They did not.

A model that says "95% confident" and is right 60% of the time is worse than a model that says "60% confident" and is right 60% of the time. Miscalibrated confidence is the mechanism by which AI errors get past human oversight — the system is confident, so reviewers trust it, so errors compound.

Watch out

How to test for calibration: take your eval set, get confidence scores on every output, bucket them into bands (0.7–0.8, 0.8–0.9, 0.9–1.0), and measure actual accuracy within each band. If your 0.9–1.0 bucket has 68% actual accuracy, your confidence threshold is meaningless.

For financial workflows specifically — AP automation, contract extraction, anything with monetary fields — calibration testing is non-negotiable.

Red-teaming for your specific domain

After basic evaluation comes adversarial testing — deliberately constructing inputs designed to trigger failures. This is not optional for finance, legal, or compliance deployments.

In practice: take documents that should cause problems and test them explicitly. A purchase order where the supplier name appears three times with slightly different formatting. An invoice with a handwritten amendment that contradicts the printed total. A contract clause where the effective date is embedded in a footnote. These documents exist in every enterprise archive. Test against them before they hit production.

In the Gulf region specifically, we test all models against Arabic/English mixed documents. An invoice header in Arabic, line items in English, totals mixed. This is a consistent failure point for models that benchmark well on pure English or pure Arabic. The failure is not obvious in a demo. It surfaces when you run the actual document corpus.

The vendor demo tells you nothing

Every model looks good on the vendor's demo dataset. The demo is designed to show the model performing well. That is not evaluation — that is marketing.

When a vendor wants to demonstrate their model, ask to run your eval set against it. Real documents. Real edge cases. Your ground truth. If they resist, that tells you something meaningful about how confident they are in their model's performance on real data.

Most LLM selection decisions in enterprise are made on vibes and demos. The companies hurt by AI failures in production are almost uniformly the ones who skipped the eval set. There is no shortcut that replaces building your eval set and running real tests. Anyone telling you otherwise is selling something.