MITIGATION · m-fail-closed
Fail-closed gate — refuse rather than act on uncertain output
An agent that is uncertain about what to do next faces a choice: refuse and ask for clarification, or proceed on its best guess. In low-stakes situations that tradeoff is tolerable. In agentic systems that write, delete, or send, a confident-sounding but wrong output can commit an irreversible action. A fail-closed gate resolves that choice structurally: below a configured confidence threshold, the agent stops and escalates rather than guessing.
At a glance
TL;DR
- When the agent's confidence falls below a configured threshold, it refuses the action and escalates rather than guessing.
- The upstream principle is fail-secure, or default-deny: when a safety control cannot confirm an action is acceptable, the safe state is to deny, not to permit.
- The confidence signal can be a model logprob, agreement across multiple independently generated outputs, or a classifier score. The signal type and threshold are deployment choices; the refusal behaviour is fixed.
- The gate applies beyond output confidence: if policy evaluation is ambiguous or returns an error, or if a tool response cannot be parsed cleanly, the agent must not proceed on an uncertain interpretation.
How it behaves
What it is
Fail-closed, also called fail-secure or default-deny, is a design principle with a simple premise: when a safety or authorisation control cannot confirm that an action is acceptable, the action does not proceed. The control defaults to the safe state rather than to permission.
Applied to agentic systems, the principle targets a specific failure: an agent that is not confident enough about what to do next acts anyway, producing a wrong output that the rest of the pipeline treats as correct. The confidence signal can come from several sources: model logprobs, agreement across multiple independently generated outputs, or a dedicated classifier score. The threshold and the signal type are deployment choices. What the gate enforces is fixed: below the threshold, the agent refuses the action and either escalates to a human reviewer, asks the user to clarify, or returns a structured refusal.
The same logic applies beyond raw output confidence. If policy evaluation returns an error or an ambiguous result, the agent should not fill the gap by assuming permission. If a tool or server response cannot be parsed cleanly, the agent should not continue with an uncertain interpretation. In both cases the correct behaviour is the same: stop, surface the uncertainty, and do not proceed.
Pair this control with an independent output moderation gate on the output side, and with a risk-prioritised review queue so refused decisions reach human reviewers at the right priority rather than being dropped silently.
Detection signals
- Refusal rate per agent per task class. A sustained upward shift signals model drift, a distributional change in inputs, or a threshold that is miscalibrated for the current workload.
- Escalations reaching the human review queue. A rising count confirms the refusal path is exercising; a flat count when refusals are rising indicates escalations are being dropped.
Threats it covers
-
WHY IT HELPS Cascading Hallucination Attacks succeed because a fabricated output is treated as ground truth and passed forward. A fail-closed gate intercepts that path at the point of action rather than at the point of generation, so a confident-sounding but incorrect output is refused before it becomes a committed step in a longer chain.
-
WHY IT HELPS Dynamic Policy Enforcement Failure occurs when a policy evaluation returns an error or an ambiguous result and the agent fills the gap by assuming permission. A fail-closed gate treats any error or ambiguous policy result as a denial, so the agent cannot default to a permissive interpretation when the authorisation path is broken.
-
WHY IT HELPS Schema mismatch between an MCP server and an agent produces action proposals the agent cannot validate. A fail-closed gate requires the agent to refuse any response it cannot parse unambiguously, rather than proceeding with an uncertain interpretation of a malformed payload.
-
WHY IT HELPS Model inconsistency across agent instances produces conflicting proposals for the same action. A fail-closed gate that requires consensus above a threshold refuses to commit any action when the confidence gap between instance proposals exceeds the configured tolerance.
Principle coverage
Defence-in-Depth stage: Prevent — and it advances:
- Default / Implicit Deny Default-deny requires that the absence of an explicit allow result in a denial. A fail-closed gate applies that principle at the action boundary: the agent proceeds only when confidence or policy evaluation returns a clear permit, and any other result, including an error or an ambiguous score, is treated as a denial.
- Fail Securely (fail-closed) Fail-securely means that when a control cannot operate normally, the outcome is the safe state rather than an uncontrolled permissive default. A fail-closed gate is the direct implementation of that principle for agentic action: when confidence is insufficient or policy evaluation breaks down, the agent stops rather than guessing.
- Constrained Generation & Deterministic Guardrails Constrained generation limits what an agent may produce and act on. A fail-closed gate extends that constraint to the confidence dimension: output that the agent cannot generate with sufficient certainty is withheld rather than committed, so uncertainty in generation cannot propagate into irreversible action.
- Safety / Harm-limitation Safety requires that consequential agent actions not proceed on a single uncertain judgment. A fail-closed gate enforces that requirement at the confidence threshold, refusing to commit an action when the agent cannot meet the configured certainty standard, and escalating instead to a reviewer who can.
Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.
Implementation options
These are practical implementation paths for the fail-closed pattern. They cover the three main confidence signals: platform safety thresholds, managed policy enforcement, and self-consistency sampling.
Vertex AI safety gate Vertex AI safety settings expose per-category harm block thresholds that stop generation when the model's harm score crosses the configured level. The refusal is returned as a structured SafetyRating block response.
Why choose it: Use SafetySetting.threshold to set BLOCK_LOW_AND_ABOVE, BLOCK_MEDIUM_AND_ABOVE, or BLOCK_ONLY_HIGH per harm category. The blockReason field in the response identifies which threshold fired. Best when the refusal condition maps to a safety-harm category rather than a task-specific confidence score.
More details:
Amazon Bedrock guardrail gate Bedrock Guardrails applies managed policy enforcement across models, blocking or refusing outputs that violate configured content, topic, or grounding policies. The gate is evaluated at the managed API layer, not in application code.
Why choose it: Configure the guardrail with CreateGuardrail, then attach it to model invocations via guardrailConfig. The GUARDRAIL_INTERVENED stop reason in the response indicates a policy block. Best when you want cross-model policy enforcement without per-model integration work.
More details:
Azure Content Safety threshold Azure AI Content Safety evaluates text against harm categories and returns severity scores. The calling application enforces the threshold and routes the output based on the result.
Why choose it: Call analyzeText with the configured blocklistNames and categories to get per-category severity scores. Treat any score at or above the threshold as a refusal condition. Best when the fail-closed condition is tied to content risk categories rather than a model logprob or agreement score.
More details:
Self-consistency plus review queue Sample multiple outputs from the model independently, measure agreement, and treat low agreement as the refusal signal. Disagreeing outputs are routed to a human annotation queue rather than committed.
Why choose it: Generate N independent completions, compare them with a deterministic check or an evaluator model, and compute an agreement score. Below the configured agreement threshold, route to the human review queue rather than returning any output. LangSmith annotation queues provide the routing and reviewer interface. Best when no single confidence score is trusted alone and task stakes justify the extra inference cost.
More details:
Trade-offs
- Threshold calibration is the main ongoing operational cost. A threshold that is too strict produces false refusals on legitimate edge cases; one that is too permissive lets marginal outputs through.
- Self-consistency checks require multiple model generations, which increases both latency and inference cost relative to single-pass logprob thresholding.
- Refusals that escalate to a human review queue add reviewer load. Without a risk-tiered queue, the volume can overwhelm the review path the gate was designed to feed.
When NOT to use
- Read-only or easily reversible tasks where the cost of a wrong action is low and retrying is cheap.
- Creative and exploratory workflows where uncertainty is inherent and a low-confidence output is still useful rather than harmful.
- When a single global threshold is applied across task classes with different stakes and confidence distributions. Calibrate per task class.
Limitations
- Confidence signals are imperfect. A model can produce a high logprob or strong self-consistency score for a factually incorrect output.
- Thresholds drift as models, prompts, and input distributions change. A calibration that is correct at deployment may be wrong six months later.
- A threshold set conservatively enough to prevent genuine harm will produce more false refusals than one tuned for throughput, and the right balance depends on the deployment context.
Maturity tier reasoning
- Tier 2 fits because confidence signals (logprobs, self-consistency) and structured refusal paths are available in all major LLM APIs and agent frameworks today, with no novel engineering required.
- Not Tier 1 because threshold calibration is deployment-specific and operationally ongoing. There is no canonical threshold value that applies across task classes, and miscalibration is the most common real-world failure mode.
- Not Tier 3 because the core pattern is production-ready and well-documented in OWASP Agentic AI, NIST AI 600-1, and the Saltzer-Schroeder principle it inherits from.
Last verified against upstream docs: 2026-05-30.