System Manual
The Harness
The reliability layer that wraps every agent run: credit pre-checks, trust gates, a circuit breaker, retries, timeouts, dynamic model escalation, and a second-opinion evaluator on high-stakes agents.
Why no agent runs raw
Every run is wrapped by the harness — the layer that makes agents safe to run unattended. In order, a single call through the harness enforces a credit pre-check, a model downshift, a trust gate, a circuit-breaker check, a retry loop with timeouts, dynamic model escalation, and — for high-stakes agents — a second-opinion evaluator. The orchestrator invokes the harness for every node; nothing reaches your systems without passing through it.
The defaults at a glance
| Max retries | 2 (3 attempts total) |
| Backoff | Exponential — 800 ms, then 1,600 ms |
| Per-attempt timeout | 120 seconds (120,000 ms) |
| Circuit breaker threshold | 5 consecutive failures → circuit opens |
| Circuit breaker reset | Auto-resets 5 minutes (300,000 ms) after opening |
| Escalation threshold | Fast-tier (Gemini Flash) confidence < 0.72 → retry once on claude-opus-4-6 (plan-gated) |
| Evaluator | Runs after the 6 high-stakes agents; can force triage or request an Opus retry |
Retries, timeout & circuit breaker
A run is retried up to 2 times with exponential backoff — the first retry waits 800 ms, the second 1,600 ms. Each attempt races against a 120-second timeout: an agent that hangs is treated as a failed attempt rather than blocking the run forever.
The circuit breaker
Failures are counted per agent, per organization. After 5 consecutive failures the circuit opens and further calls fast-fail for 5 minutes before auto-resetting — preventing a broken integration or a model outage from burning credits in a retry loop. The breaker state lives in the agent_circuit_breakers table (with an in-memory fallback), so it is shared across processes. A successful run resets the count to zero and closes the circuit.
Dynamic model escalation
When a fast-tier (Gemini Flash) agent returns confidence below 0.72, the harness retries the run once on Opus (claude-opus-4-6) and keeps the escalated result only if confidence actually improved. If Opus does no better, the original output stands and a failed escalation is non-fatal.
allowOpusEscalation is true (Growth and Enterprise). It also never fires on a run that was already downshifted or already using a model override — escalation is for a genuine low-confidence first pass, not a second guess.Plan-gated model downshift
Before the run even starts, the credit meter resolves the model the agent may use. If the agent's default model exceeds the plan's allowed tiers, it is downshifted to the best model the plan permits — the agent still runs, just on a cheaper brain. Downshift and escalation are two ends of the same dial: the plan sets the ceiling, confidence can reach toward it.
The evaluator on high-stakes agents
A defined set of 6 high-stakes agents trigger an independent EvaluatorAgent after a successful run. The evaluator grades the output (scored out of 100) and returns a suggested action.
// HIGH_STAKES_AGENTS — trigger the evaluator after a successful run
const HIGH_STAKES_AGENTS = new Set([
"ap_specialist",
"payroll_processor",
"tax_filer",
"capex_approver",
"contract_reviewer",
"fraud_detector",
]);The evaluator can only make a run more cautious, never less. Its two non-pass verdicts are:
- triage — forces the run into the triage inbox, attaching the evaluator's score and the issues it found.
- retry_opus — re-runs the agent on Opus (only on plans that allow Opus escalation), then uses that output.
Worked example: a low-confidence payroll run
Example · Payroll Processor on a Growth plan
A payroll run starts. The credit meter confirms credits remain and that Growth allows the premium tier, so no downshift. The agent runs and succeeds, but because payroll_processor is high-stakes, the evaluator grades the output and scores it 62/100, flagging an unusual deduction.
The evaluator returns triage. The harness overrides the run to requires_triage: true with the reason "Evaluator forced triage (score 62/100): unusual deduction", resets the circuit, and records a trust outcome. The run lands in the inbox for a human — even though the agent itself was confident.
◳ Screenshot
An agent run trace timeline: attempt 1 (model, confidence), optional escalation step (Opus retry, new confidence), evaluator verdict for high-stakes agents (score / suggested action), credits consumed, and the final outcome badge — completed, triaged, or failed.