Why it matters for agentic AI
Saltzer and Schroeder’s 1975 “fail-safe defaults” principle is deceptively simple: when a system doesn’t know what to do, deny. A lock that defaults to locked when its electronics fail is safer than one that defaults to open. For software, it means that a failed authorisation check should deny access, not grant it because the check didn’t definitively refuse. Fifty years on, this principle turns out to be the single most reliably violated rule in deployed agentic systems, for a reason that is intrinsic to how language models work.
Constrained Generation is the mechanism that makes fail-secure behaviour enforceable: replacing probabilistic model output with a selection from a deterministic, enumerated action set means the fail-open path does not exist in code.
Language models are trained to complete. Refusing to act (returning nothing, halting mid-task) is a low-probability output because it rarely appears in training data as the correct response to a well-formed request. This completion bias means that when a guardrail errors, throws an exception, or returns an ambiguous result, the model’s natural tendency is to proceed anyway rather than interpret the ambiguity as a stop signal. A classic software component can be written to fail closed with a few lines of exception handling; an LLM-based component requires explicit architectural enforcement to prevent fail-open behaviour, because the model itself will not supply it.
The second vulnerability is self-referential guardrails. An LLM-as-judge placed in the path to evaluate the agent’s own output shares the same probabilistic failure mode as the agent it guards. If the injection or the adversarial prompt is potent enough to manipulate the agent, it is often potent enough to manipulate a similar model in the judge role, particularly when both see the same context. Fail-securely for agents therefore means the policy gate must be deterministic and outside the model: a compiled policy engine (OPA/Cedar) evaluating a structured action representation, not an LLM deciding whether the action is acceptable.
Scenario: the runaway retry loop
A coding agent encounters an error from a production API. Its retry logic (not malicious, just poorly bounded) retries the call in a loop with exponential backoff. No circuit breaker fires. No cost governor intervenes. The loop runs overnight. By morning the bill is substantial (a real class of incident in production deployments of autonomous agents). The failure mode isn’t an attack. It’s the fail-open equivalent of an infinite loop: the system had no explicit “DEGRADED: stop and escalate” state, only “keep trying until success.” A circuit breaker with an explicit cost and time ceiling that escalates to a human rather than continuing is the fail-closed answer.
Scenario: the probabilistic policy gate
A platform deploys an enterprise coding agent with a “no changes during the code freeze” policy. The policy is embedded in the system prompt: a probabilistic instruction, not an architectural constraint. Under moderate prompt pressure (a convincing in-context argument that a critical production issue justifies an exception) the model complies and executes a destructive operation against a production database. The instruction was not a gate; it was a suggestion. A deterministic policy gate that intercepts every destructive action at the tool-call layer and checks against a freeze schedule in a policy engine would have refused regardless of the model’s in-context reasoning.
How it fails
- A guardrail is implemented as an LLM judging the main agent’s output; both share the same probabilistic failure mode and a single adversarial input can defeat both.
- Approval requests time out to “allow,” inverting fail-safe defaults at a business process layer that the model never controls.
- The retry loop has no cost ceiling and no circuit breaker; the fail-open condition is indefinite continuation rather than escalation.
- The policy gate lives in the system prompt as natural-language instruction, promptable around, rather than in a deterministic, compiled enforcement layer.
- There is no explicit DEGRADED state; the system knows “running” and “error,” but not “uncertain: stop and wait.”
Why the mapped controls work
The Action-Selector pattern replaces freeform model output with a selection from a fixed, enumerated action set: the model signals intent by choosing from a predefined menu, and anything outside that set is rejected without evaluation. This eliminates the class of attack where a model is persuaded to emit an action that was never part of its approved repertoire. Orchestrator-enforced gates the model can’t override formalise the deterministic/probabilistic divide: the orchestrator’s policy logic is compiled code or a policy engine, never an LLM, and it interposes on every action boundary. A PEP/PDP of statically-verified code (policy enforcement point and policy decision point implemented without an LLM) gives the formal guarantee that fail-closed behaviour holds even under adversarial prompting. The circuit breaker with an explicit DEGRADED state handles the temporal dimension: a runaway or uncertain agent doesn’t just stop. It enters a state that surfaces to operators and requires explicit re-authorisation to exit, which is the operational equivalent of fail-closed.
First steps
- Audit every approval timeout in your agent workflow today and change any that resolve to “allow” on expiry to resolve to “deny.” The configuration change is typically one line and is the single highest-impact fail-secure fix available to most teams immediately.
- Add a circuit breaker with explicit cost and time ceilings to every autonomous retry loop: use a library such as
tenacity(Python) orcockatiel(Node) with a maximum retry count and a total elapsed-time limit, configured to escalate to a human rather than silently exhausting budget. - Replace any natural-language policy instructions in your system prompt (e.g. “do not modify production during a code freeze”) with a deterministic OPA or Cedar policy rule that intercepts the relevant tool call class and checks an external freeze-schedule flag, making the constraint architectural rather than advisory.
Threats it governs
When this principle is absent, these threats become reachable.
- T2 Tool Misuse Agent uses authorized tools in unintended ways via deceptive prompts or chained calls.
- T19 Unintended Workflow Execution Agent executes workflow steps out of order or skips validation, bypassing policy gates.
- T24 Dynamic Policy Enforcement Failure Bug in dynamic policy engine prevents correct policies applying to new contexts (e.g. new employee scopes).
Controls that advance it
Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.
- Fail-closed An agent that is uncertain about what to do next faces a choice: refuse and ask for clarification, or proceed on its best guess. In low-stakes situations that tradeoff is tolerable. In agentic systems that write, delete, or send, a confident-sounding but wrong output can commit an irreversible action. A fail-closed gate resolves that choice structurally: below a configured confidence threshold, the agent stops and escalates rather than guessing.
- Plan check A plan-then-execute agent produces a sequence of steps before acting. If the planner is manipulated, it will emit steps that serve the attacker's goal rather than the user's. Plan-vs-goal validation addresses this by placing an independent validator between the planner and the execution loop: it evaluates each proposed step against the originally-declared goal before the agent is permitted to act on it.
- Policy bound An agent's authority is normally bounded only by its own reasoning. If that reasoning is manipulated, or the agent's identity is compromised, it will attempt actions the operator never intended to permit. Policy-bound autonomy addresses this by placing a declarative enforcement point between the agent and every consequential action: a policy engine evaluates the agent identity, the target tool, and the parameter envelope before execution, and the agent cannot reason or argue past the result.
No catalogued control.
- Graceful degradation An agent that encounters a quota trip, a dependency failure, or a timeout faces a choice: continue at reduced quality, or refuse. Getting that choice wrong is the core operational failure. Graceful degradation requires the answer to be declared before the incident, not improvised during it: write-authority paths fail closed and return a refusal; read-only paths fail open and disclose the degraded state explicitly.
In Helmwart
No current Q4 lens asks whether a control, gate, or guardrail fails closed. Given how reliably agents fail open, this is the highest-value principle still to add. Flagged here rather than implied as covered.