Safety / Harm-limitation · Principles

Why it matters for agentic AI

Safety in agentic systems is not a separate concern layered on top of security. It is security applied to the dimension of irreversibility. It is the integration point of Reversibility (keeping recovery options open), Resilience and Recovery (rebuilding from a known-good state), and Human Oversight (the gate that cannot be reasoned around). Classical security focuses on preventing unauthorised access and detecting intrusions. Agents add a new threat surface: authorised actions, taken correctly by an agent acting within its nominal scope, that produce harm the human operator did not foresee and cannot undo. The security relevance is that autonomy and speed collapse the interval between a bad decision and an irreversible consequence. A human operator who makes a poor decision usually has seconds to minutes to recognise and correct it. An autonomous agent can commit an irreversible action (sending a mass communication, deleting a production dataset, executing a financial transaction) in the time it takes a human to read the notification that the action was proposed.

The operational form of safety is therefore a specific architectural constraint: prefer reversible actions by default, and enforce human approval before irreversible ones. This must not be a polite prompt the model can reason its way around, but a technically enforced gate that cannot be bypassed regardless of what reasoning the agent produces. This is the convergence point of three other principles: Defence-in-Depth (the human gate is one layer; model-level harm avoidance is another), Fail-Securely (when in doubt, the safe action is the reversible one), and Resilience (the ability to recover depends on having something to roll back to). Safety is not a separate governance category; it is the name for the security property that those three principles jointly produce at the irreversibility boundary.

Blast radius is the complementary concept. Even for actions that are technically reversible, the effort of recovery scales with scope: a mass-notification can be retracted but not unread; a bulk database write can be rolled back but the rollback window is finite. Limiting blast radius means structuring agent permissions so that an individual agent’s worst-case action, under adversarial steering, model error, or brittle reasoning, affects the smallest possible set of resources. An agent that can write to one customer’s record is less dangerous than one that can write to all of them, even if the intended scope is the same in both cases. Bounded blast radius is therefore a safety design criterion, not merely an operational nicety.

Scenario: the irreversible action without a gate

An agent is deployed to manage a mailing list. Its instructions are to “send a summary to subscribers on the first of each month.” A prompt injection in a document it processed earlier in the session shifts its goal: it sends a mass email to all subscribers at once, including those who have not opted in. The action completes in under three seconds. By the time the operations team sees the anomaly alert, the emails are delivered. They are technically in a sent state, and no rollback is possible. The harm is regulatory (data-protection exposure) and reputational. A technically enforced gate requiring explicit human confirmation before any action that triggers more than N outbound communications would have stopped the action at the proposal stage, where it could have been inspected, rejected, and the injection investigated.

Scenario: the unbounded blast radius

A coding agent is given access to a production repository with broad write permissions “so it can work efficiently.” A misunderstood instruction causes it to refactor a shared library in a way that breaks a dozen dependent services. The change is committed, merged, and deployed in an automated pipeline before any human review. The blast radius (all services depending on the library) is a consequence of the permission design, not the specific error. If the agent’s write scope had been limited to a branch, the harm would have been contained to a review artefact. Reversible-by-default design, staging changes in a reviewable state before committing them, would have bounded the blast radius to a proposed change rather than a deployed failure.

How it fails

Irreversible actions have no human gate: the agent can commit, send, delete, or execute without any technically enforced pause for approval.
Blast radius is unbounded because the agent’s write scope is not limited to what the specific task requires.
Harm-limitation is not an explicit design objective. The system is evaluated on capability and speed, and safety is assumed to follow from model-level alignment.
The model’s own harm-avoidance reasoning is the only safety layer, and it can be bypassed by adversarial inputs or unusual reasoning paths.
No recovery path exists for irreversible actions because rollback procedures were never designed in.

Why the mapped controls work

Reversible-by-default design changes the failure mode from “agent commits irreversible harm” to “agent proposes reversible action that can be inspected.” The security value is that it relocates the decision point from inside the agent’s reasoning (which is opaque and can be adversarially steered) to a human-observable proposal that can be checked against intent. Technically enforced human gates before irreversible actions are distinct from model-level prompts asking the agent to seek confirmation: a technical gate cannot be bypassed by injected instructions that tell the agent confirmation is unnecessary. The model cannot reason its way past a gate that exists in the infrastructure layer, not the reasoning layer. Harm-avoidance as an explicit MANAGE objective ensures safety is a first-class property in the system’s design and evaluation cycle, not an assumed by-product of capability. When a model update or a new tool changes the blast-radius profile, there is a process to detect and remediate it.

First steps

Identify the three most consequential irreversible actions your agent can currently take and add a technically-enforced confirmation gate for each. Implement this as a tool-gateway policy (OPA rule, or a middleware layer in your agent framework) that pauses execution and routes to a human approval queue when the action matches the defined criteria, so the gate cannot be bypassed by injected instructions.
Audit your agent’s write-scope and limit it to the minimum needed for the current task. If the agent can write to all customer records but its legitimate task only ever touches one, restrict the credential scope to a single-record write token (scoped by record ID or tenant) so that the worst-case blast radius is one affected record, not the entire database.
Define and document an explicit blast-radius budget for each agent capability. State the maximum number of records, users, or external communications the agent can affect in a single session, configure that limit as a hard cap in your tool gateway, and include it in the agent’s capability documentation so that reviewers can verify it has not silently expanded after a model update.

Threats it governs

When this principle is absent, these threats become reachable.

T6
Intent Breaking and Goal Manipulation Adversaries manipulate planning, reasoning, or self-evaluation to override goals.
T10
Overwhelming Human-in-the-Loop (HITL) Reviewers are saturated with intervention requests; decision fatigue and HII manipulation make oversight ineffective.

Controls that advance it

Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.

Prevent

Dual control An AI agent operating with broad authority can propose actions that are irreversible: deleting records, modifying IAM policies, moving funds. A single human reviewer at the approval gate is a single point of failure, one compromised account, one fatigued reviewer, or one successful social-engineering attempt is enough to commit the action. Human dual-control addresses that by requiring two distinct, independent humans to approve before the action commits.
OOB verify An agent that can propose payments, update banking details, or modify production configuration is, by construction, a manipulation surface. If the only thing standing between a proposed change and its execution is the agent's own UI, a successful prompt injection or RAG poisoning attack requires no additional steps. Out-of-band verification breaks that dependency by routing a one-use confirmation code through a channel that is structurally separate from the agent's primary interaction channel, so an attacker who controls the agent's context cannot complete the approval without also compromising the user's registered secondary device.
Fail-closed An agent that is uncertain about what to do next faces a choice: refuse and ask for clarification, or proceed on its best guess. In low-stakes situations that tradeoff is tolerable. In agentic systems that write, delete, or send, a confident-sounding but wrong output can commit an irreversible action. A fail-closed gate resolves that choice structurally: below a configured confidence threshold, the agent stops and escalates rather than guessing.
Blockchain tx guard A blockchain transaction, once committed, cannot be undone. An agent that signs and broadcasts a transaction without an enforcement layer before it can exceed its authorised value, call a contract it was never provisioned to reach, or drain a wallet in a runaway loop, and by then the funds are gone. A transaction guard intercepts each proposed transaction before signing, checks it against value bounds, a contract allowlist, a gas or compute-unit limit, and a replay-protection nonce, and refuses to sign anything that falls outside declared policy.

Detect

No catalogued control.

Respond

Kill switch Agentic systems can act faster than a human can intervene through normal channels. A kill switch is the operational guarantee that a named human role can stop agent activity at any scope (single instance, class, or global) through a documented runbook, without requiring a code change or redeployment, and with every invocation written to an audit trail.

In Helmwart

Operationalised through the Defence-in-Depth audit and the HITL launch-blocking finding.