EVIDENCE TRAIL

Fail closed on low confidence

Verbatim excerpts from the upstream sources cited on the mitigation page, with what each source does and does not prove. The title “fail closed on low confidence” is Helmwart’s normalised label — only one upstream document uses the phrase “fail closed” verbatim in an agentic context (OWASP Top 10 Agentic 2026, §ASI02).

Last cross-checked against upstream sources: 2026-05-29 · 8 sources

References

Each entry shows what the source supports and what it does not prove.

Reference 1

Version 2026 · published December 2025

OWASP Top 10 for Agentic Applications 2026

§ASI02 Tool Misuse and Exploitation — Mitigation 7 “Semantic and Identity Validation (Semantic Firewalls)”

“Fail closed on ambiguous resolution and prompt for user disambiguation.”

Supports: Verbatim use of the phrase “fail closed” in an agentic context. Closest upstream wording match for this control’s title.

Does not prove: Frames the refusal trigger as ambiguous tool resolution, not a numeric model-confidence score. Helmwart generalises the trigger to any low-confidence signal.

open original ↗

Reference 2

Version 2026 · published December 2025

OWASP Top 10 for Agentic Applications 2026

§ASI09 Human-Agent Trust Exploitation — Mitigation 5 “Adaptive Trust Calibration”

“Adaptive Trust Calibration: Continuously adjust the level of agent autonomy and required human oversight based on contextual risk scoring. Implement confidence weighted cues (e.g., “low-certainty” or “unverified source”) that visually prompt users to question high-impact actions, reducing automation bias and blind approval.”

Supports: Establishes confidence signals as the input that should drive escalation and human attention in agentic systems.

Does not prove: Names a UI cue (visual prompt) and an autonomy adjustment, not a hard refusal gate. Helmwart hardens the cue into a refuse-or-escalate path.

open original ↗

Reference 3

v1.1 · published December 2025

OWASP Agentic AI — Threats & Mitigations v1.1

§T10 Overwhelming Human in the Loop — Mitigation

“Develop advanced human-AI interaction frameworks, and adaptive trust mechanisms. These are dynamic AI governance models that employ dynamic intervention thresholds to adjust the level of human oversight and automation based on risk, confidence, and context.”

Supports: Names “dynamic intervention thresholds … based on risk, confidence, and context” as the lever that should toggle automation vs. human oversight — the same threshold-on-confidence mechanic this control uses.

Does not prove: T10’s framing is about preventing reviewer overload, not refusing at low confidence per se. Adjacent rationale, not identical.

open original ↗

Reference 4

v2.0.1

OWASP Secure Coding Practices Quick Reference Guide

§Error Handling and Logging (Secure Coding Practices Checklist). Related items appear in §Authentication and Password Management, §Access Control, and §System Configuration — all of the form “… should fail securely.”

“Error handling logic associated with security controls should deny access by default.”

Supports: Establishes the foundational fail-secure / default-deny pattern that this AI control specialises.

Does not prove: Does not name LLMs, agents, or confidence signals. Generic secure-coding guidance.

open original ↗

Reference 5

Published July 2024

NIST AI 600-1 — Generative AI Profile (NIST AI RMF)

MEASURE 2.6 — “AI system is evaluated regularly for safety risks”

“The AI system to be deployed is demonstrated to be safe, its residual negative risk does not exceed the risk tolerance, and it can fail safely, particularly if made to operate beyond its knowledge limits.”

Supports: Names “fail safely … when made to operate beyond its knowledge limits” as a deployment-evaluation requirement. The “knowledge limits” framing is the closest NIST analogue to a confidence-threshold gate.

Does not prove: Does not specify how the fail-safe trigger fires, or that confidence/logprob signals are the trigger. (Earlier Helmwart copy cited MEASURE 2.7 — that action is about security/resilience evaluation, not fail-safety, and was a misattribution; corrected to 2.6.)

open original ↗

Reference 6

Bundled with Threats & Mitigations v1.1 · December 2025

OWASP Agentic AI Mitigation Playbook P2 (Memory Poisoning)

No verbatim excerpt pulled yet — open the original to verify the cited section.

Supports: Names the pattern of routing unverified or suspect outputs into review rather than letting them propagate unchecked.

Does not prove: Internal Helmwart mapping of the playbook into a page; not the source of the fail-secure principle itself.

open original ↗

Reference 7

ATLAS catalogue (continuously updated)

MITRE ATLAS AML.M0029 — Human-in-the-Loop

No verbatim excerpt pulled yet — open the original to verify the cited section.

Supports: Defines the human-review escalation path that the refusal opens onto.

Does not prove: Does not specify how the refusal is triggered or where the confidence threshold sits.

open original ↗

Reference 8

Proc. IEEE 63(9), 1278–1308 · September 1975

Saltzer & Schroeder, “The Protection of Information in Computer Systems” (1975)

§3.A “Design principles” — Principle (b) Fail-safe defaults

“Base access decisions on permission rather than exclusion … the default situation is lack of access, and the protection scheme identifies conditions under which access is permitted. … A design or implementation mistake in a mechanism that gives explicit permission tends to fail by refusing permission, a safe situation, since it will be quickly detected.”

Supports: Original academic statement of fail-safe defaults — the design principle this AI control inherits.

Does not prove: Predates LLMs and agentic AI by five decades. Does not discuss confidence signals.

open original ↗