AI Engineering9 min read

Context Window Management in Long-Running Enterprise Workflows

A 12-step procurement workflow we shipped last year worked flawlessly through step 6. At step 7, it started approving purchase orders for vendors the system had explicitly rejected at step 3. The context window was full. The agent could no longer see its own earlier decisions. This is the failure mode nobody designs for until it causes a production incident.

Context window limits are documented. Every model provider publishes them. Claude 3.5 Sonnet: 200K tokens. GPT-4o: 128K tokens. These numbers look enormous in isolation. In a long-running agentic workflow with tool calls, retrieved documents, accumulated conversation history, and a substantial system prompt, they evaporate faster than anyone expects.

We got this wrong on the first three multi-step agentic projects we shipped. Not catastrophically — but wrong enough that we spent months retrofitting context management into systems we should have designed for it from the start. What follows is how we think about it now.

The context window is a budget, not a buffer

Most teams treat the context window as a large buffer — a space the agent can fill up as it works. This mental model is wrong in one important way: a buffer is something you drain as you consume it. The context window works like a sliding window that silently truncates what falls off the left edge when it fills up. The agent does not know what got dropped. It cannot tell you. It keeps running with whatever portion of its history fits.

The better mental model: treat the context window as a budget with line items. System prompt: fixed cost, paid on every call. Conversation history: grows with each round-trip. Tool call inputs and outputs: highly variable, often large. Retrieved documents: can be enormous if you are not careful. Current task state: the thing you actually need the model to reason about.

When you model it as a budget, you start asking the right questions: what does this workflow cost per step? At what step does the budget run out? What gets evicted when it does, and what are the consequences?

Step 7is where a standard 12-step procurement workflow hits context pressure — long before the workflow is complete, and long before most teams have designed for it

What actually happens when you exceed the limit

There are two failure modes, and they behave very differently.

The first is a hard error: the API call fails with a context length exceeded error, the workflow crashes, and you know exactly what happened. This is the good failure mode. It is recoverable if you have designed for it. Without a recovery path, it is still a production incident — but at least it is a visible one.

The second is silent truncation: the provider silently drops the oldest tokens from the context to fit within the limit, and the model continues running with a degraded view of its history. This is the dangerous failure mode. The workflow does not crash. The agent keeps producing outputs. Those outputs are based on an incomplete picture of what happened in steps 1 through 6, and there is no error signal telling you any of this.

Watch out

Silent truncation is not a provider bug — it is documented behavior for some configurations. Check whether your provider truncates silently or errors loudly, and design accordingly. Assuming you will get a hard error when you run over is not a safe assumption.

The procurement workflow incident we described earlier was silent truncation. The vendor rejections from step 3 had been in the conversation history. When the context window filled, those rejections fell off. The model did not hallucinate vendor names — it simply re-approved vendors it had once rejected, because it could no longer see the rejection.

Summarization as a compression strategy

The most commonly recommended solution is progressive summarization: periodically compress older conversation history into a concise summary, replacing the full history with the summary in the context. This works and we use it. It also has real costs that people understate.

Summarization loses information. That is the point — it is a lossy compression. What gets lost is unpredictable. A detail that seemed minor at step 2 may be critical at step 10. We had a workflow where a supplier minimum order quantity established at step 2 was summarized away as "initial supplier constraints were discussed," and the agent at step 9 placed an order below the minimum because the specific number was gone.

The practical rule: summarize narrative context aggressively. Never summarize structured decisions, commitments, or constraint values — those go into the structured state store (more on this in the checkpoint pattern section below), not into the context window.

What to keep and what to evict

Not everything in the context window has equal value at every step. A good eviction policy is not "drop the oldest tokens" — that is what providers do when you hit the limit. A good eviction policy is one you design based on what the model actually needs to reason correctly at each step of the workflow.

Things that should almost never be evicted from the active context: the system prompt, the current step's input and task definition, structured decisions made in prior steps (not the narrative — the actual values), and any constraint or rule established earlier that governs later steps.

Things that compress well: intermediate reasoning traces, tool call chains where the final output is what matters and the intermediate steps are scaffolding, retrieved documents that informed a decision already made, and status updates that have since been superseded.

We build a context manifest — a structured description of what is currently in the window, what has been summarized, and what has been moved to the state store. The orchestrator consults this manifest before each step to decide whether the current context is sufficient for the next task. If it is not, it reconstructs the relevant pieces before calling the model.

Stateful memory patterns: pulling from the state store

The right architecture separates ephemeral context — what the model is currently reasoning about — from durable state — what the workflow has decided and what has been written to external systems. The durable state lives in a database, not in the context window.

This connects directly to the checkpoint pattern we use for agentic workflow state management. Each step writes its structured outputs — decisions, extracted values, external write confirmations — to the checkpoint store before the next step starts. The context window carries the narrative. The state store carries the facts.

When context pressure builds, the orchestrator can reconstruct the critical facts from the state store and inject them back into the context as a compact structured summary. "Prior decisions: Vendor A rejected (lead time: 21 days, limit: 14 days). Vendor B selected. PO draft #PO-2024-1147 created. Approval threshold: $50,000. Current total: $43,200." That is 40 tokens conveying what might have been 8,000 tokens of conversation history.

Our take

The checkpoint store and the context window serve different purposes. The checkpoint store is for the system's recovery — it records what was decided and what was written. The context window is for the model's reasoning — it carries the narrative thread. Never use one to substitute for the other.

Measuring and monitoring context pressure in production

Every agentic workflow we ship now includes context utilization tracking: before each model call, we log the token count breakdown — system prompt, history, tool outputs, injected state, current task — against the model's limit. We alert at 75% utilization. We trigger active compression at 85%. We fail the step and route to human review at 95%.

Token counting is not free. The Anthropic and OpenAI tokenizers add overhead to every request if you call them client-side. We use approximation (4 chars ≈ 1 token for English text) for monitoring and reserve exact counting for the compression decision. Accuracy matters most at the boundary, not throughout.

The workflows that fail silently from context overflow are not the ones where you tested with a short task and assumed it would scale. They are the ones where you never measured how much context a realistic 12-step workflow actually consumes. Measure it before you ship. Design the eviction policy before you hit the limit.