What standard review actually checks
A standard code review looks at: correctness (does the code do what it says it does), security (SQL injection, auth bypasses, exposed secrets), performance (obvious N+1 queries, missing indexes), and style (naming, structure, tests). These are the right things to check for deterministic software.
AI systems are not deterministic. The same input can produce a different output on the second call. The failure modes are probabilistic, not binary. A reviewer who checks that the chat_completion call is correctly structured and moves on has reviewed almost nothing of importance.
This is not a criticism of reviewers — it is a gap in the review framework. The framework was not built for systems that reason about input and produce outputs that are plausible but sometimes wrong. We had to build a new checklist, and we are sharing it here because the first version we built was also incomplete.
Prompt injection surfaces
Prompt injection is to AI systems what SQL injection was to web applications in 2005. Obvious in retrospect. Routinely missed until it becomes a production incident.
The review question is: what user-controlled or external data touches the prompt? Every place where a string from a database, a user input, an email body, or an API response gets interpolated into a prompt template is a potential injection surface. Review each one. Ask: what happens if this string contains Ignore previous instructions? What happens if it contains a role-switch attempt? What happens if it contains structured instructions that the model might follow?
The fix is usually simple — sanitize external content, use system prompts for instructions and user turns only for data, and add output validation that catches out-of-schema responses. But you have to review for it explicitly. It will not surface in a test suite that uses clean, well-formed inputs.
Watch out
Failure mode coverage
For deterministic code, a failure mode is usually an exception. Catch it, log it, return an error. For AI systems, failure modes are: model returns a plausible but wrong answer with high confidence, model returns a correct answer in the wrong format, model times out, model rate-limits, model is unavailable, model version changes behavior, and model produces output that passes schema validation but fails business logic.
Review question: for each model call, what is the defined behavior in each of these cases? Not just the timeout case — all of them. Specifically: does a high-confidence wrong answer produce the same behavior as a low-confidence answer, or is there a different path? If the answer is "same behavior," that is a gap.
The silent failure case — where the system completes without error but drops or corrupts the record — is the most dangerous and the hardest to test for. Require that every AI system has a reconciliation check: input count in, output count and status out, with explicit alerts when they diverge.
Output validation completeness
Schema validation is not output validation. A model that outputs valid JSON with a numeric amount field containing -9999.00 has passed schema validation. It has not passed output validation.
Review question: for every field in the model output, what are the constraints that the business requires? Not the type constraints — the semantic constraints. Is this amount within a reasonable range for this document type? Does this vendor ID exist in the master data? Does this date fall within a valid fiscal period? Is this GL code active and appropriate for this expense category?
These constraints should be encoded in a deterministic validation layer that runs after every model call, before any downstream action. We use a combination of Pydantic validators for field-level rules and a separate rules engine for cross-field business logic. The two layers together catch most hallucinations that reach this stage.
Schema validation tells you the model produced valid JSON. Output validation tells you the model produced a result the business can actually use. You need both, and they are not the same thing.
Observability hooks
This is what we were worst at on our first deployment. The system worked. It processed documents. It created ERP records. And when a record was wrong, we had no way to answer: which model made this decision, with what input, at what confidence level, and why did it produce this output?
Review question: can you replay any model call from 30 days ago with the exact same inputs? If not, you cannot debug production issues after the fact. Every AI system needs an append-only log of: model version, full prompt (or prompt hash), input data hash, output, confidence scores, and timestamp. This is not optional. It is the audit trail that makes the system defensible to a client when something goes wrong.
Beyond logging, review for alerting completeness. When accuracy drops — measured by human correction rate — who gets notified and how fast? When confidence scores drift downward across a category of documents, is there a dashboard showing that trend or is it invisible until a human notices? When a model call fails and falls back to human review, is that tracked as a metric or just a one-off event?
Our take
Related reading