Human Oversight (HITL / HOTL) · Principles

A rogue agent can weaponise the human gate: flood it until approvals are rubber-stamped. A watchdog must pre-filter.

Why it matters for agentic AI

Human oversight has always been a governance requirement for consequential automated decisions, but in agentic systems it acquires a technical design dimension that earlier automation never forced. Contestability is the downstream twin: where oversight fails to prevent a harmful action, contestability provides the mechanisms to reverse and attribute it. A batch process either ran or it didn’t; a human reviewing its output had complete, durable information. An agent acts in milliseconds across many tool calls, its reasoning is not auditable in the way a deterministic function is, and its behaviour can be hijacked mid-session by injected content, so the human may be reviewing a plan that has already been manipulated to look benign. “Keep a human in the loop” is therefore not a staffing decision; it is an architectural one.

The governing principle is to match oversight mode to action reversibility. Read operations and fully reversible writes can be covered by human-on-the-loop monitoring with authority to interrupt; consequential writes need a blocking human-in-the-loop checkpoint before execution; irreversible, financial, or privileged actions need HITL plus a second party. But the design of that checkpoint matters as much as its presence. A confirmation dialogue that auto-approves on timeout is not oversight. It is the illusion of oversight while creating the paper-trail liability of approval. Approvals must be action-bound: tied to a specific action, a specific target, specific parameters, and a short expiry. “Approve this session” is not a meaningful gate.

Oversight is itself an attack surface. A rogue or injected agent can flood a human with well-formatted, confident-sounding approval requests, creating decision fatigue: a condition where the human begins rubber-stamping regardless of content. In high-volume pipelines this is not a theoretical attack; it is the operational default whenever the approval queue grows faster than a human can meaningfully evaluate it. The value of automation evaporates and the human becomes the denial-of-service target. The architectural answer is a watchdog that pre-filters proposals against policy before they reach the human, and rate-limits the volume per unit time, preserving genuine human authority without degrading throughput by saturating the reviewer.

Scenario: the poisoned invoice approval

A poisoned invoice reaches a finance agent. The agent produces a well-justified, clearly formatted “Approve payment to [legitimate-looking name]” prompt. A busy manager clicks it. The funds go to the attacker’s account. No credential was stolen; the oversight mechanism fired correctly in form but was defeated in content. A contextual decision surface that surfaces the original invoice, the proposed payee, and a diff against the expected payee (rather than just the agent’s synthesised summary) gives the human the raw evidence needed to catch the discrepancy. Action-bound tokens with a short expiry also mean a manager cannot accidentally approve a modified version of the action later.

Scenario: approval fatigue at scale

A high-volume customer support agent processes thousands of tickets per hour and surfaces a confirmation request for any action that touches a customer account. The queue rapidly outpaces the review team; approvals arrive at two per second. After an hour, reviewers are clicking through without reading. An injected ticket instructs the agent to modify account permissions across a range of customers; the request looks like a routine account-edit approval and is approved in the flood. A pre-filtering watchdog that routes only novel or high-risk action patterns to human review, while autonomously approving already-approved patterns, would have kept the queue manageable and concentrated human attention on the anomaly.

How it fails

Approvals auto-allow on timeout, converting a safety gate into a delay with a paper trail.
The only available stop is “kill all agents,” a blunt instrument that interrupts unrelated work, raising the cost of using it and meaning it is often not used.
Confirmation requests are so frequent that reviewers approve reflexively, making the gate indistinguishable from no gate.
Approvals are session-scoped rather than action-bound, so a subsequent manipulated action inherits the earlier approval.
The agent’s synthesised summary, rather than raw evidence, is shown to the reviewer, who has no independent view of what the agent actually proposes to do.

Why the mapped controls work

Signed action-bound approval tokens with short expiry make it structurally impossible to accidentally approve a subsequent, different action under the cover of an earlier legitimate one: each approval covers exactly one parameterised action for a bounded window. Plan-preview surfaces that show the projected state delta rather than the agent’s interpretation give the reviewer independent evidence. Step-up authentication for critical actions raises the cost of an inadvertent or fatigue-driven approval at the moments it matters most. The watchdog pre-filter is the architectural answer to the fatigue attack: it interposes between the agent and the human, runs deterministic policy checks the agent cannot influence, and manages volume so human attention lands on genuine anomalies rather than routine confirmations.

First steps

Audit every approval dialogue in your agent workflows and confirm that each is action-bound (tied to a specific set of parameters) with an expiry no longer than 10 minutes. Remove any session-scoped or open-ended approval that would allow a later, modified action to inherit an earlier approval.
Implement a rate limit on your approval queue: if more than N approval requests arrive within M minutes (start with 10 requests in 5 minutes as a threshold), automatically escalate to a second reviewer and surface a “high volume: possible automated attack” warning rather than routing them to the original reviewer.
Replace the agent’s synthesised approval summary with a plan-preview diff that shows the raw proposed parameters alongside the original task declaration, so the reviewer has independent evidence they can check rather than relying solely on the agent’s framing of its own proposed action.

Threats it governs

When this principle is absent, these threats become reachable.

T6
Intent Breaking and Goal Manipulation Adversaries manipulate planning, reasoning, or self-evaluation to override goals.
T10
Overwhelming Human-in-the-Loop (HITL) Reviewers are saturated with intervention requests; decision fatigue and HII manipulation make oversight ineffective.
T15
Human Manipulation Attacker turns the agent into a fluent, personalised social-engineering vector trusted by the user.

Controls that advance it

Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.

Prevent

Dual control An AI agent operating with broad authority can propose actions that are irreversible: deleting records, modifying IAM policies, moving funds. A single human reviewer at the approval gate is a single point of failure, one compromised account, one fatigued reviewer, or one successful social-engineering attempt is enough to commit the action. Human dual-control addresses that by requiring two distinct, independent humans to approve before the action commits.
Risk queue A human-in-the-loop review system saturates not from absolute decision volume but from undifferentiated volume: every item lands at the same priority, so reviewers cannot distinguish an irreversible high-consequence action from a routine low-stakes one. A risk-prioritised queue fixes this by scoring each decision before it enters the queue and routing it to the tier that matches its risk level, concentrating human attention where the cost of an error is highest.
Decision summaries When an agent decision reaches a human reviewer, the reviewer must reconstruct the agent's reasoning from raw traces before they can form a judgment. OWASP T10 names this reconstruction burden as the mechanism behind reviewer fatigue and oversight failures. A decision summary addresses the problem by inserting an independent model call between the agent's output and the reviewer: that call compresses the decision, evidence chain, and risk factors into a fixed-format card, reducing the per-review cognitive load without removing the human from the decision.
OOB verify An agent that can propose payments, update banking details, or modify production configuration is, by construction, a manipulation surface. If the only thing standing between a proposed change and its execution is the agent's own UI, a successful prompt injection or RAG poisoning attack requires no additional steps. Out-of-band verification breaks that dependency by routing a one-use confirmation code through a channel that is structurally separate from the agent's primary interaction channel, so an attacker who controls the agent's context cannot complete the approval without also compromising the user's registered secondary device.
AI label When an AI agent generates content or proposes an action, users need to know that the source is an AI before they decide to act. Without that signal, users routinely over-trust agent output. AI-source disclosure addresses this by attaching a visible label to every AI-generated item and by requiring explicit confirmation for consequential actions, restoring the critical gap between receipt and acceptance.
Peer consensus A single agent's judgment on a high-impact action can be wrong, manipulated, or compromised. Requiring N of M independent peer agents to agree before the action executes means an attacker or a systematic error must affect the quorum majority, not just one agent, before harm results.
Code review gate An AI coding agent produces code that can be executed or merged to a production branch without a human ever reading it. If the agent has been manipulated, its generated code can contain hidden payloads, backdoors, or privilege-escalating logic. A code-generation review gate prevents that: every change attributable to an AI agent must pass automated static analysis and receive explicit human approval before it can merge or execute, and the agent identity that authored the change is structurally barred from also approving it.

Detect

HITL calibration loop An agent at a human-in-the-loop gate will be overridden when its decisions do not match the reviewer's judgment. Without a return path, those corrections are discarded: the same miscalibration surfaces again in the next review cycle and the one after that. A feedback loop closes that gap by capturing each override event as a structured record, accumulating those records into a calibration dataset, and using patterns in that dataset to drive targeted changes to the agent's system prompt, tool-scope policy, or divergence-monitor thresholds. A well-calibrated agent produces fewer out-of-distribution decisions, so the review queue contracts over time.

Respond

Adaptive load Human reviewers make more errors as cognitive load accumulates over a shift. An adversary who floods a HITL gate, or a system that simply generates high output volume, exploits that degradation without bypassing the gate at all. Adaptive workload balancing addresses this by treating reviewer fatigue as a live routing input: each incoming review is assigned to the reviewer with the lowest current fatigue score, mandatory breaks are enforced before a reviewer's error rate climbs further, and items are held rather than assigned to any reviewer above the break threshold.

In Helmwart

The Q1 human-in-the-loop signal feeds the trifecta check and Q4 findings. A launch-blocking finding fires when the trifecta is active with no HITL gate.