Agent Fleets — Continuously Operating Agent Infrastructure

Can provenance infrastructure scale to hundreds of long-lived agents that come and go, react to events, checkpoint their state, and produce durable outputs?

Far horizon agentscheckpointingfleet managementlong-running

— — — — —

The idea

Not a workflow with five agents. An operating layer for agents — long-lived, continuously running groups that react to stimuli (timers, webhooks, inboxes, sensor events), process them in episodes, checkpoint their state as durable records, and resume on demand. Agents are deployed, upgraded, scaled, and retired. The fleet persists even as individual agents come and go.

Think IoT, but for intelligent agents. A fleet of 200 monitoring agents, each watching a different data source, each with its own checkpoint state, each producing outputs when something interesting happens. Some are always active. Some wake on a timer. Some respond to external events. The underlying runtime — built for managing millions of concurrent connections — is architecturally native to this pattern.

How it maps

Fleet Deployment Spec ──→ stimulus arrives ──→ dispatch ──→ load checkpoint ──→ handle ──→ perform action
                                                  │                              │              │
                                                  │  route to correct            │  agent       │  external
                                                  │  agent by type               │  logic/AI    │  effects
                                                  │  + deployment                │  DECISION     │  gated
                                                  │                              │  RECORDED     │
                                                  │                              ▼              │
                                                  │                     new checkpoint ─────────┘
                                                  │                     (durable, resumable)
                                                  │
                                            ┌─────┴──────┐
                                            │ Episode run │  ← one stimulus = one run
                                            │ append-only │    with full provenance
                                            │ event log   │
                                            └────────────┘

Each stimulus produces an episode — a complete run with its own event log, decision records, and outputs. The checkpoint from one episode becomes input to the next. The agent’s history is the chain of episode runs, each independently replayable.

What makes this more than “run many pipelines”

Deployment as configuration. A fleet spec defines agent types, tool allowlists, schedules, budgets, and checkpoint policies — all as versioned records. Redeploy = new spec version; agents reconfigure automatically.

Stimulus routing. Incoming events are dispatched to the correct agent(s) by a routing step. Timer ticks, webhooks, inbox messages, sensor readings — the fleet handles heterogeneous event sources.

Checkpointing as records. Agent state is not in memory — it’s a sealed, fingerprinted record. Stop an agent, restart it tomorrow, it resumes from its last checkpoint. State lives on disk; computation happens only on activation.

Idempotent stimuli. Every stimulus has a deduplication key. Process the same event twice? The second time is a no-op. Essential for reliability when agents number in the hundreds.

Budget and policy. Per-agent budgets, tool allowlists, secret scoping. The fleet controls what agents can do, not just what they compute. An agent exceeding its budget is paused, not crashed.

Key properties

Long-lived runs at scale — hundreds of runs that start, checkpoint, stop, restart
Independent failure domains — individual agent failures don’t cascade
Artifact accumulation over time — thousands of episode checkpoints and outputs, fingerprinted, collectable by policy
Clear boundary between runtime (scheduling, checkpointing, routing) and agent logic (AI decisions, tool use)
Fleet-level observation — seeing the health and behavior of the collective, not just individual agents

— — — — —

Interested? Get in touch →