Psychological Acceptability · Principles

Why it matters for agentic AI

A security control that is too burdensome will not be used. Saltzer and Schroeder stated this as a design requirement in 1975: the security mechanism must not make the resource harder to access than it would be without any security at all. This directly shapes how Observability tooling must be delivered: instrumentation that requires manual setup per agent will be skipped. It also explains why Safe Interruptibility controls need to be costless for operators to trigger. The principle sounds like a usability concern, and in one direction it is. In the other direction it is a security concern. When controls are bypassed because they are inconvenient, the bypass is invisible to the threat model. “We have a guardrail for that” is not a claim about security; it is a claim about what was shipped. Whether it is actually active in production depends on whether the people who could disable it were motivated to do so.

Agentic AI surfaces this in two distinct failure directions. The first is developer bypass: guardrail libraries, threat-modelling steps, and security tooling are skipped in the pressure to deploy. Guardrails enabled in testing are disabled in production because they add latency or token cost. Complex permission configurations are replaced with broad grants because the scoped alternative requires understanding a policy language. Every one of these is a rational local decision that produces a collectively insecure system. The cure is not admonishment but design: controls that are on by default and opt-out rather than opt-in, where the default requires no configuration to be secure and the insecure alternative requires deliberate, documented action.

The second failure direction is operator fatigue. This one is particular to agents and has no close parallel in classical systems. A human user in 1975 might find a password prompt annoying; they would not face hundreds of authorisation requests per hour. An agent acting at automation speed generates exactly that volume. If every action requires explicit human confirmation, the human’s cognitive load quickly exceeds their ability to make genuine decisions, so each confirmation becomes reflexive. And a reflexive approval is worse than no approval, because it provides the audit trail of human oversight while delivering none of its substance. An adversary who can induce this state by flooding the approval queue with plausible requests can conceal the one malicious request in the stream.

The design response to approval fatigue is not to remove human oversight but to restructure it: batched plan-mode review instead of per-action micro-prompts; contextual approval surfaces that make the dangerous action visually legible rather than burying it in a confirmation dialog; rate limits on approval volume with escalation to a second reviewer when the rate is anomalous. The human is still in the loop, but the loop is shaped so genuine attention is possible.

Scenario: the reflexive approval

A finance operations team deploys an agent that handles expense processing. The agent generates confirmation requests at a rate of roughly forty per hour during peak periods. After the first week, the operations team has trained itself to approve them in batches: a few seconds of scanning and clicking. A poisoned invoice generates a confirmation request that is formally identical to the legitimate ones: a supplier name, an amount, a plausible description. It is approved in the batch. The funds transfer. The team did not bypass security; they used it exactly as designed, and the design’s psychological demand made genuine oversight impossible. A batched plan-mode surface that groups the forty requests into five meaningful decision points and highlights anomalies (new payee, unusual amount, first transfer to this jurisdiction) would have made the malicious request legible without requiring the team to examine each transaction individually.

Scenario: the disabled guardrail

A development team builds an agent with an output-scanning guardrail that detects potential PII exfiltration in tool call parameters. During load testing, the guardrail adds 200ms to every call and causes three false positives per hundred requests. The lead developer, under deadline pressure, disables it for the production deployment with a note to re-enable it after optimisation. The optimisation task is never prioritised. The guardrail ships disabled. A one-click integration path for a well-tuned, low-latency guardrail library, maintained by the platform rather than the application team, means the team never faced the false choice between performance and safety. The default is secure; disabling it requires deliberate, documented action rather than a single configuration flag.

How it fails

Guardrails are enabled in development and disabled in production because they introduce friction, latency, or noise; the threat model assumes they are active.
Confirmation dialogs fire at a rate that makes genuine review impossible; operators approve reflexively and the oversight exists in form only.
Security tooling is complex enough that teams hand-roll simpler alternatives; the hand-rolled version misses cases the original handled.
Noise from over-sensitive monitoring causes teams to mute alerts; the signal-to-noise ratio collapses and genuine anomalies are ignored.

Why the mapped controls work

Secure defaults that require deliberate opt-out flip the incentive: the path of least resistance is the secure path, so deadline pressure and cognitive shortcuts produce a secure deployment by default, not an insecure one. Batch-approval and plan-mode patterns address operator fatigue directly: instead of forty micro-decisions, the human makes five meaningful ones, and the interface is designed to surface the decision that matters rather than treating all forty as equivalent. One-click guardrail integration removes the skill and effort barrier for developers: a well-configured, platform-maintained library is easier to adopt than to build around, so it gets used. Automated observability means developers do not face the choice between building monitoring and skipping it; it is instrumented at the platform layer and delivered as a readable signal rather than raw telemetry they must process themselves.

First steps

Review your current confirmation surfaces and group low-risk actions into batch approval plans. If your agent framework supports plan-mode (LangGraph’s human-in-the-loop step, or Anthropic Claude’s tool-use pause), configure it to present a structured summary of the full planned action sequence before execution rather than firing a confirmation per tool call.
Measure your guardrail’s production latency and false-positive rate this week. If either is high enough that operators have disabled or are routinely overriding the guardrail, treat that as a security incident and work with the platform team to tune it (adjust thresholds, switch to a lighter model, or cache common decisions) rather than leaving it disabled.
Adopt a guardrail library (NeMo Guardrails, Guardrails AI, or Llama Guard) as a platform-level default so that application teams do not need to hand-roll input and output scanning. Configure it once at the gateway layer and let application agents inherit it without any per-team effort.

Threats it governs

When this principle is absent, these threats become reachable.

T10
Overwhelming Human-in-the-Loop (HITL) Reviewers are saturated with intervention requests; decision fatigue and HII manipulation make oversight ineffective.

In Helmwart

Connects to the HITL approval-fatigue concern; not a scored lens.