AI Engineering10 min read

State Management in Agentic Workflows: The Problem Nobody Designs For

A 12-step procurement workflow got to step 9 — vendor selected, PO drafted, approval routed — and then the orchestrator crashed. The LLM was fine. The prompts were fine. Three hours of work vanished because nobody had designed what "resume from step 9" meant. This is the failure mode that will hit every agentic workflow eventually.

We have built agentic workflows for procurement, AP processing, document review, and supplier onboarding. The LLM components — extraction, reasoning, decision-making — are rarely the thing that fails. What fails is the plumbing around them: how state is stored between steps, how the system recovers from a partial failure, and how it avoids re-running actions that already happened.

On the first three projects we shipped, we got this wrong. Not catastrophically wrong — wrong in ways that showed up as annoying production incidents at 11pm when a workflow stalled and the recovery path was "restart from the beginning." By the fourth project we had a pattern. This is that pattern.

Session memory is not state management

Most agentic framework tutorials reach for session memory as the answer to "how does the agent know what it did before." A conversation history stored in Redis. A context window that gets extended with prior turns. This is useful for conversational continuity. It is not state management for a workflow that writes to external systems.

The difference is durability and recoverability. Session memory is optimised for the agent's reasoning — it gives the LLM context for its next decision. Workflow state is optimised for the system's recovery — it records what was done, what was written where, and what the next pending step is, independently of whether the agent process is still running.

When a procurement workflow crashes at step 9 of 12, session memory is gone with the process. Workflow state persists. You can restart the orchestrator, load the checkpoint, and resume from step 9 — or roll back steps 7 and 8 if they left the ERP in an inconsistent state. Session memory cannot do either of those things.

Watch out

Storing agent conversation history in Redis is not the same as managing workflow state. If your agentic system writes to external systems and the only recovery mechanism is "restart the conversation," you do not have state management.

9 of 12steps completed in a procurement workflow before a crash — and without checkpoints, every one of those steps had to run again

Checkpoint patterns that actually work

A checkpoint in an agentic workflow is a durable record of the state at a point in the execution: which step just completed, what the outputs were, what has been written to external systems, and what the next pending step is. Written to a database before the next step starts. Not after — before.

The checkpoint record for a procurement step looks like this: step identifier, input payload hash, output payload, timestamp, correlation ID, status (completed / in-flight / failed), and a list of external writes made during this step with their identifiers. That last field is the one that saves you at 11pm.

Where to checkpoint is a design decision, not a default. Not every step needs a checkpoint — a pure reasoning step with no external writes can be re-run without consequences. Steps that write to the ERP, send emails, create purchase orders, or move money need a checkpoint before they execute. The rule: if the step has a side effect on a system outside the workflow engine, checkpoint before it runs.

We use Postgres for checkpoints with an append-only table — never update rows, only insert. Each checkpoint row records the workflow ID, step ID, and payload. The current state is derived by reading the most recent row for each step. Immutable checkpoint records mean the audit trail is free.

The ERP write problem: what did we already do?

This is the hardest part of agentic workflow state management, and the part that causes the most production pain. Your agent wrote a draft purchase order to the ERP at step 7. The workflow crashed at step 9. You restart. Does step 7 run again? Does the ERP now have two draft purchase orders?

The answer to this problem is idempotency keys. Every write to an external system is tagged with a deterministic key derived from the workflow ID and the step ID. Before executing a write, the agent checks whether a write with that key has already succeeded. If it has, skip the write and return the prior result. If it has not, execute the write, record the key and result, and proceed.

In practice: the idempotency key table lives in your workflow database. Before any ERP write, you insert a row with the key and status "in-flight." After the write succeeds, you update the status to "complete" and store the ERP's response (including the new record ID). On retry, you find the "complete" row and return the stored response without re-executing the write.

Some ERPs support idempotency keys natively — Microsoft Dynamics 365 Finance has request deduplication headers. Odoo does not. For systems that lack native support, you own the deduplication layer. This is not optional if your workflow writes to the ERP more than once.

Tool call deduplication: the LLM-specific wrinkle

Agentic workflows call external tools — ERP APIs, email services, document stores. The LLM decides when to call them. This introduces a deduplication problem that is specific to LLM-based systems: the LLM may decide to call the same tool twice within a single workflow execution if its context suggests the first call did not succeed, even if it did.

We saw this in a supplier notification step. The email tool was called, the email was sent, and the tool returned a response. The LLM, interpreting an ambiguous response format, decided the send had failed and called the tool again. The supplier got two emails. Small in this case. Consequential if the tool is "create purchase order" or "initiate payment."

The fix operates at the tool wrapper level, not the LLM level. Every tool call is intercepted by a middleware layer that checks the idempotency key table before executing. The tool wrapper generates the key from the tool name, input parameters hash, and workflow step ID. Same inputs, same step, same workflow — return the cached result. The LLM never knows the deduplication happened.

The LLM is not responsible for knowing what it already did. Your infrastructure is. Build the deduplication layer into the tool wrapper, not into the prompt.

Rollback: the conversation nobody wants to have

Every agentic workflow that writes to external systems needs a rollback strategy. Not a theoretical one — a specific, tested one. If step 9 fails and steps 7 and 8 already wrote to the ERP, what happens?

The honest answer is that true rollback is often not possible. You cannot un-send an email. You cannot always cancel an ERP record that was created in a prior step. What you can do is compensating actions: if step 7 created a draft PO, the rollback for that step is "cancel draft PO [id]" — a separate, explicit action that you execute when rolling back.

We model this in the checkpoint table as a "compensation action" field on each step that writes to an external system. When a workflow enters failed state, the orchestrator walks backwards through completed steps, executing the compensation action for each one that has one. Steps without compensation actions — typically read-only steps — are skipped. This is the saga pattern, and it is the right model for long-running agentic workflows. Not glamorous infrastructure, but the alternative is orphaned records in your ERP.

Our take

Before deploying any agentic workflow that writes to an ERP or external system, document the compensation action for every write step. If you cannot describe how to undo it, you cannot safely retry it.