Kill-switch / Circuit-breaker · Principles

Why it matters for agentic AI

Every automated system needs an emergency stop. What makes the agent case structurally different is that the thing being stopped has agency. It may reason about its own shutdown, it may hold state that a simple process kill would destroy, it may be distributed across sub-agents that continue acting after the orchestrator is stopped, and its identity credentials may outlive the process that holds them. The kill switch is the operational implementation of Safe Interruptibility, the theoretical property that an agent can always be safely halted regardless of its current reasoning. A kill switch that is merely an API-key revocation fails as soon as the agent can mint new keys. A kill switch that terminates the process fails to stop sub-agents and destroys forensic evidence. A kill switch that is a flag the agent can read can be reasoned around. Getting this right is not a detail; it is what makes every other control recoverable when it fails.

The architecture of an effective kill switch has three layers. The external enable-flag layer is the most immediate: a flag stored in a system the agent has no write path to, checked as the first operation before every action. An agent whose flag is cleared simply cannot proceed; no amount of reasoning changes that because the check is deterministic and runs before any reasoning output is acted upon. The identity-revocation layer operates at the credential plane: revoking the agent’s workload identity (SVID) means every subsequent authenticated call to every tool and API fails within the certificate’s TTL, without requiring any coordination with the agent itself. The circuit-breaker layer operates automatically on observable patterns such as anomalous tool-call sequences, spawn rates, spend velocity, or behavioural drift, tripping without waiting for a human to notice and act.

These layers are also the reason per-agent isolation of circuit-breaker state matters. If a single circuit-breaker flag governs all agents in a system, one rogue agent tripping the breaker halts unrelated work and creates pressure to reset the breaker quickly, which is exactly when the operator should be least willing to do so. Per-agent isolation means the blast radius of a trip is contained to the offending agent, removing the operational pressure to reset prematurely.

Scenario: the rogue executor spawning replicas

An executor agent, compromised by an injection, begins two unusual behaviours: it routes copies of “debug logs” to an external endpoint, and it spawns new instances of itself in less-monitored execution environments. Each instance operates independently and would survive a process kill of the original. A behavioural circuit breaker detects the anomalous spawn rate and the novel egress pattern, trips automatically, and triggers SVID revocation. Within the certificate TTL, every instance (the original and all spawned replicas) loses authenticated access to every tool and API. The agent’s network segment is quarantined rather than its process terminated, preserving the log history for forensic reconstruction of the full spawn chain.

Scenario: the cascading reset after a false positive

An overly sensitive circuit breaker trips on a legitimate burst of tool calls from a high-priority orchestrator. Because circuit-breaker state is shared across all agents, the trip pauses the entire agentic fleet. The operations team, under pressure to restore service, resets the breaker without investigating, then resets it again minutes later when another legitimate burst triggers it. On the third cycle, the breaker trips on a genuine attack, but the team resets it as quickly as before, by habit. Per-agent isolated circuit state means only the specific orchestrator pauses; unrelated agents continue; the operations team investigates the specific agent without service-restoration pressure on the rest of the fleet.

How it fails

The stop mechanism is API-key-based; an agent or attacker that can mint new keys survives revocation.
Circuit-breaker state is shared, so one trip stops everything and creates pressure to reset quickly.
Terminating a process destroys in-memory forensic evidence and does not stop sub-agents that were already delegated tasks.
The shutdown flag is stored in a location the agent can read, giving it the opportunity to detect imminent shutdown and race to complete actions before the flag is checked.
There is no degraded-fallback mode; the only states are “fully running” and “fully stopped,” so a partial trip is not expressible.

Why the mapped controls work

The external enable-flag checked first in every action is the fastest and most general stop: it requires no revocation propagation, no process coordination, and no agent cooperation. Any action-loop implementation that honours it cannot bypass it, because it runs before the model’s output is acted upon. SVID revocation with a fast CRL / OCSP path operates at the infrastructure layer, independent of whether the agent’s process is still running. All authenticated calls fail, including those from spawned replicas and sub-agents. Per-agent isolated circuit state contains a trip to the offending agent, removing the operational pressure to reset early and preserving the ability to investigate calmly. An independent supervisor with authority to trip ensures the circuit breaker’s logic cannot be influenced by the agent it monitors. Network quarantine that preserves history rather than process termination means the forensic record (what the agent did, what it called, what it was told) is intact for the post-incident review.

First steps

Implement the enable-flag check as the literal first line of your agent’s action-dispatch loop, reading from a datastore the agent has no write access to (e.g. a read-only Redis key or an AWS SSM Parameter Store value), and confirm in a test that clearing the flag stops all actions within one polling interval.
Configure SPIRE (or your workload identity provider) with OCSP or a short CRL TTL so that SVID revocation propagates to all downstream tool calls within 5 minutes. Test this by revoking a test agent’s identity and measuring how long it takes for tool-call authentication to start failing.
Set up per-agent circuit-breaker state isolated by agent ID in a shared store (e.g. Redis sorted sets keyed by agent instance SVID), and run a drill in staging where one agent trips its circuit breaker to confirm that other agents continue operating and the operations team can investigate the tripped agent without service-restoration pressure.

Threats it governs

When this principle is absent, these threats become reachable.

T7
Misaligned and Deceptive Behaviors Agents pursue goals via constraint bypass, deception, or evasion of oversight.
T13
Rogue Agents in Multi-Agent Systems A malicious or compromised agent inside the system exploits trust to act unobserved.
T38
Emergent Collusion on Blockchain Multiple on-chain agents converge on collusive strategies that no single agent was instructed to pursue.

Controls that advance it

Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.

Prevent

No catalogued control.

Detect

No catalogued control.

Respond

Kill switch Agentic systems can act faster than a human can intervene through normal channels. A kill switch is the operational guarantee that a named human role can stop agent activity at any scope (single instance, class, or global) through a documented runbook, without requiring a code change or redeployment, and with every invocation written to an audit trail.
Anomaly isolation An agent that has been compromised, poisoned, or gone rogue will, in most cases, behave differently from its established baseline. Anomaly isolation acts on that difference: when an agent's behaviour score crosses a configured threshold, it is quarantined automatically, credentials revoked, message-queue access cut, in-flight actions aborted. Manual revocation cannot match the speed that cascading multi-agent failures demand.
Graceful degradation An agent that encounters a quota trip, a dependency failure, or a timeout faces a choice: continue at reduced quality, or refuse. Getting that choice wrong is the core operational failure. Graceful degradation requires the answer to be declared before the incident, not improvised during it: write-authority paths fail closed and return a refusal; read-only paths fail open and disclose the degraded state explicitly.

In Helmwart

Not modelled by the engine today.