The Real Cost of AI Inference at Scale (And How to Control It)
A document classification pipeline we built looked like a bargain at $0.003 per call. Then it hit 90 million calls per month and the invoice came in at $8K. Not a calculation error. Not a runaway bug. Just the gap between what the pricing page says and what you actually pay when you factor in retries, system prompt overhead, and the fact that nobody batches by default.
The gap between the pricing page and your bill
Every major model provider quotes cost per million tokens. GPT-4o is currently $2.50 per million input tokens. Claude 3.5 Sonnet is $3.00. Gemini 1.5 Flash is $0.075. These numbers are accurate. They are also nearly useless for production budget planning.
Here is what the pricing page does not show: your system prompt is re-sent on every call unless you are using prompt caching. A 2,000-token system prompt across 90 million monthly calls is 180 billion tokens of input before a single user message is processed. At $3.00 per million, that alone is $540K annually. From a system prompt.
We got this wrong on the first two projects we scaled past 10M calls per month. We priced the happy path — one user message, one response, modest context — and ignored everything around it. The system prompt overhead. The retry tokens when the model returns malformed JSON. The logging calls where we echo the full input back to our own tracing system. These are not edge cases. They are the majority of your token spend.
In a typical production pipeline we measure, the actual token ratio is 3.2x the naive estimate. You calculate cost per call on the happy path, then multiply by 3.2 to get close to reality.
Batching vs streaming: which one is killing your budget
Streaming is the right default for user-facing features. When a person is waiting, streaming makes the product feel faster and increases completion rates. But streaming has real infrastructure overhead: a persistent connection per request, server-sent event handling, and — critically — it prevents the request-level batching that cuts costs dramatically for background workloads.
Background pipelines almost never need streaming. Document processing, classification, entity extraction, batch summarization — none of these have a human watching the tokens arrive. Using streaming here is a habit, not a requirement. The Batch API on both Anthropic and OpenAI returns results asynchronously at a 50% cost discount. Same models, same quality, half the price. For a pipeline spending $8K/month, that is $4K back with a one-day migration.
The rule we apply now: any workload without a human waiting on the response gets the Batch API. Streaming is reserved for chat, copilot features, and real-time output. This alone typically cuts 30–40% of inference spend on production systems with mixed workloads.
Caching: what it actually saves
Prompt caching on Anthropic's API caches the prefix of a conversation — typically the system prompt and any fixed context — and charges 10% of the normal input rate on cache hits. The requirement is that the cached prefix is at least 1,024 tokens and the same content appears at the start of subsequent requests within the cache TTL.
In a well-structured pipeline with a 3,000-token system prompt, we consistently see cache hit rates of 78–85% after the first few requests warm the cache. At those rates, you are paying full price on roughly 20% of your input tokens and 10% of the price on the rest. For a $3.00/million input model, your effective rate drops to around $0.82/million on the cached portion.
Semantic caching is a different mechanism entirely. Rather than caching at the API level, you store embedding vectors of past queries and responses in a vector database. When a new query arrives, you check for semantic similarity above a threshold — typically 0.93 cosine similarity — and return the cached response if the match is close enough. This skips the model call entirely.
Semantic caching works well for FAQ-style workloads with predictable query patterns. We have seen 40% cache hit rates on customer support pipelines where the same questions recur. It is a poor fit for document processing or creative generation where every input is genuinely novel. Know which workload you have before investing in the infrastructure.
Our take
Model routing: the decision that moves the needle most
Routing is the highest-leverage cost control we have found. The core idea: not every query needs your most capable model. A request to classify a document as invoice, receipt, or contract does not need Claude 3.5 Sonnet at $3.00/million input tokens. It needs Claude 3 Haiku at $0.25/million. The accuracy difference on a binary classification task is near zero. The cost difference is 12x.
We implement routing as a lightweight classifier that runs before every request. The classifier — itself a small, fast model or a rules-based system — assigns each request to one of three tiers: fast/cheap, balanced, or capable. Fast/cheap handles classification, entity extraction, simple transformations. Capable handles long-form generation, multi-step reasoning, synthesis across many documents. Balanced is the middle ground.
On a production pipeline we rebuilt last year, routing dropped the average cost per call from $0.0031 to $0.0009 — a 71% reduction — with no measurable quality degradation on the tasks routed to cheaper models. The key was defining routing criteria based on task type and complexity, not on ad-hoc guesswork.
Most teams route based on token count. This is wrong. A 50-token query that asks for legal reasoning is harder than a 500-token query asking for a date extraction. Route on task type and expected reasoning depth, not input length.
The hidden costs nobody tracks
Retry logic is invisible on most cost dashboards because it looks like legitimate traffic. When a model returns malformed JSON — which happens roughly 2–4% of the time on structured output tasks without strict output enforcement — the typical implementation retries up to three times. On a high-volume pipeline, 2% error rate with three retries means some fraction of your calls are paying for four model calls to get one result.
The fix is not fewer retries. It is constrained decoding. Both Anthropic and OpenAI support structured output modes that force the model to conform to a JSON schema at the generation level. This drops malformed output rates to near zero and eliminates the retry tax. We enable this on every pipeline that has a defined output schema. No exceptions.
Error handling overhead is a separate category. When a pipeline step fails, the default behavior in most frameworks is to log the full input and output for debugging. If your inputs are 4,000 tokens and you are logging them to a downstream system that charges per character, error logging can become a meaningful cost at scale. We set a token budget on logged payloads — truncating inputs to the first 500 tokens for error logs — and reduced logging costs by 60% with no loss of debugging utility.
Observability tooling is worth auditing separately. Tools like LangSmith, Langfuse, and Helicone are genuinely useful but charge per trace or per token logged. A system processing 5 million calls per day at $0.0008 per trace is paying $4K/month just for observability. Sample your traces in production — 10% sampling catches nearly every systemic issue at 10% of the cost.
Watch out
The $8K pipeline: what actually happened
The pipeline was a document intake system. Every incoming document — contracts, invoices, correspondence — got classified, had key fields extracted, and was routed to the appropriate downstream process. At launch it handled 3,000 documents per day. Eight months later it was processing 90,000 per day after the client expanded to three new business units.
The original cost estimate was based on a single classification call at 800 input tokens and 150 output tokens. Reality had three steps per document: pre-classification, extraction, and a validation call that we added during QA to catch edge cases. The system prompt was 2,800 tokens and was not cached. Retries were running at 6% due to a JSON parsing bug we had not caught in testing. And we were using streaming everywhere because the original prototype was a chat interface.
After the audit: we enabled prompt caching and got an immediate 70% reduction in input token costs on the static prefix. We switched all three pipeline steps to the Batch API and got the 50% discount. We fixed the JSON schema enforcement and dropped retries to 0.3%. We routed the pre-classification step to Haiku. And we consolidated the extraction and validation steps into a single call with a richer schema.
New monthly cost: $1,100. Same throughput, same quality on production evaluation. The work took four days.
Where to start
If you are running a pipeline spending more than $2K/month on inference, run this audit in order:
- 01Measure your actual token ratio. Pull real API logs, calculate average input and output tokens per call, and compare to your original estimate. If reality is more than 1.5x your estimate, find out why before optimizing anything else.
- 02Identify every background workload. List every pipeline step. Mark each one as "human waiting" or "background." Migrate every background step to the Batch API.
- 03Enable prompt caching on every pipeline with a system prompt over 1,024 tokens. Structure prompts so all static content comes first. Check your cache hit rate after 48 hours.
- 04Audit retries. If your retry rate is above 3%, fix structured output enforcement before anything else.
- 05Profile each step by task type. Route simple classification and extraction to a smaller model. Measure quality. Keep routing only for steps where quality holds.
These are not advanced optimisations. They are corrections to defaults that were set for convenience in development and never revisited. Most pipelines have at least two of these problems running simultaneously. Finding them is a spreadsheet exercise with API logs. The engineering to fix them is typically one to three days of work.
Related reading