Structured Output vs Function Calling: When Each Breaks at Production Scale

The incident that made us rethink function calling

On a supplier onboarding pipeline we shipped in late 2024, the model started calling update_supplier_record when it should have been calling create_draft_supplier. It did not crash. The function signature accepted the same parameters. The side effect was that 14 supplier records got partially overwritten before anyone noticed.

The root cause: the system prompt had been updated three days earlier to add context about existing suppliers. The model had correctly inferred that if a supplier already existed, updating made more sense than creating a draft. Logical reasoning. Wrong business rule.

This is the failure mode nobody warns you about with function calling. The model does not just execute instructions — it reasons about which function to call. That reasoning can be wrong in ways that are completely invisible until a record is in a bad state.

What function calling actually is — and what it is not

Function calling (OpenAI's term) or tool use (Anthropic's term) lets you define a set of available tools — with names, descriptions, and parameter schemas — and let the model decide which tool to call and with what arguments. The model produces a structured tool invocation rather than free text.

The appeal is obvious: you get structured, actionable output with built-in reasoning about which action is appropriate. The problem is that "built-in reasoning" is not bounded. The model can call functions in unexpected sequences, invent argument values for optional parameters, or decide that none of your defined tools are appropriate and produce free text instead.

Watch out

Function calling gives you model-side reasoning about what action to take. That is a feature when the reasoning is correct and a failure mode when it is not. Never expose a function whose side effects you cannot recover from.

The structured output failure we did not anticipate

Structured output — constraining the model's response to a specific JSON schema — looks simpler. You define the schema, the model fills it. No function selection, no action ambiguity. Just extraction.

The failure mode we hit on a document extraction pipeline: nested arrays. We were extracting line items from invoices — each with a unit price, quantity, and product code. When the invoice had more than about 12 line items, Claude 3 Haiku (the model we were using for cost reasons) would occasionally produce valid-looking JSON with a trailing comma in the line items array. Pydantic would reject it. The document would route to exception handling.

The fix was straightforward: switch to JSON mode with explicit schema enforcement using Claude 3.5 Sonnet for anything with more than 8 line items. But we only found this failure mode after three weeks of production data showed a 6.2% exception rate on multi-line invoices that we could not explain from the logs alone.

6.2%exception rate on multi-line invoices before we identified the trailing-comma failure mode — invisible in logs, visible only in exception queue analysis

When to use each approach

This is our working rule after shipping both approaches across 11 production pipelines:

Use structured output when

—The task is extraction — you want structured data out of unstructured input.
—The output schema is known in advance and does not depend on what the model finds.
—You need to process high volume at low cost (Haiku with JSON mode is significantly cheaper than Sonnet with tool use).
—The side effects of an error are a bad field value, not a bad action.

Use function calling when

—The task requires choosing between multiple possible actions, not just extracting data.
—The correct next step depends on what the model finds — and that decision logic is too complex for a routing layer.
—You can afford the latency and cost overhead of Sonnet or Opus-class models.
—Every exposed function is side-effect-safe or fully reversible.

The mistake we made on those first few projects: using function calling because it felt more "agentic" — more sophisticated. Sophistication is not a design goal. Reliability is.

The specific failure modes to design against

Function calling: phantom arguments

When a function has optional parameters, the model will sometimes populate them with plausible-looking invented values. We saw this on a supplier lookup function that had an optional country_code parameter — the model would fill it with a two-letter code that looked correct but was not in our approved list. The fix: remove optional parameters wherever possible, or add validation that treats unexpected values as hard errors rather than warnings.

Function calling: sequential calls in wrong order

In agentic workflows with multiple available functions, the model may call them in a different sequence than you intended. If function B depends on the output of function A, and the model calls B first, you get either a runtime error (if B requires A's output) or silent wrong behavior (if B can run without A's output but produces a worse result). Explicit sequencing in the orchestration layer is non-negotiable — do not rely on the model to know the dependency order.

Structured output: schema drift

When you update a schema — adding a required field, changing a type — existing prompts that were working may start failing. We have seen this on three projects. The model was generating outputs that matched the old schema perfectly, and the validation layer started rejecting them after a schema migration. Track schema versions and pin prompts to schema versions explicitly.

Both approaches need a validation layer between the model output and your business logic. The model output — whether a function call or a JSON blob — should never be trusted as correct without deterministic validation. If you are skipping this layer to save development time, you are building technical debt that compounds at production volume.

Our strong opinion: most enterprise use cases do not need function calling

This will frustrate some readers. Function calling is genuinely powerful and genuinely useful for the right problems. But most enterprise AI tasks — document extraction, classification, data transformation, summarization — are extraction tasks, not action tasks. They belong in structured output.

The tendency to reach for function calling comes from demo culture. Demos show agents taking actions, calling tools, producing results. That is compelling. In production, an agent calling the wrong function at 2am is not compelling — it is an incident report.

Use the simplest approach that solves the problem reliably. Add complexity only when simpler approaches demonstrably fail. We have learned this the expensive way, so you do not have to.