Continuous Verification · Principles

Why it matters for agentic AI

Complete mediation (every access to every object must be checked against current authority, not a cached result from an earlier check) is a Saltzer and Schroeder classic from 1975. NIST SP 800-207 restated it as the operational heart of Zero Trust: trust state is an instantaneous claim, not a durable one, and it must be re-evaluated each time an action is attempted. For agents, “each time” needs to mean something far more granular than it ever did for humans or services.

A human user authenticates once per session. An agent in a moderately busy task might invoke a hundred tools in that same window. More importantly, a human’s intent during a session is generally stable: a developer who authenticated to push code stays a developer trying to push code. An agent’s effective intent can be hijacked mid-session by a single document that enters its context, such as a malicious email, a poisoned RAG result, or a tool response containing injected instructions. The agent’s identity credential did not change; the agent’s goal just did. Authentication verified the principal; it said nothing about whether the principal’s current behaviour still matches the task it was authorised to perform.

That gap between a valid credential and benign current behaviour is what continuous verification for agents is actually closing. It requires monitoring that goes beyond “is the token still valid?” to ask: does this agent’s current action stream still resemble the declared task? Verification must therefore include behavioural signals: goal drift, anomalous tool-chain sequences, action-velocity spikes, and behavioural changes correlated with ingesting low-trust content. Because agents can act thousands of times an hour, that monitoring has to run at machine speed, by a watchdog the monitored agent cannot influence.

The persistence threat makes this even more acute. Attacks like MINJA demonstrate that injected instructions can be written to an agent’s persistent memory and retrieved in future sessions, long after the initial poisoning event. A verification regime that checks only current-session tokens will never detect an instruction that has been silently incubating in memory for days. Provenance tagging on context content, marking which parts of the current context came from low-trust sources, gives the verification layer the information it needs to flag suspicious correlations.

Scenario: the goal-drift monitor

An agent ingests a poisoned email as part of its normal mail-triage task. Over the following ten tool calls it gradually shifts from “draft a reply to this thread” to “search the file store for documents containing contract terms” to “attach and forward them externally.” Each individual step is plausible in isolation. A behavioural monitor baselined on the agent’s normal triage pattern, which never involves file search or external forwarding, detects the sequence anomaly after the third deviation and suspends the agent for human review, before the forward completes.

Scenario: the rubber-stamp approval flood

A human-in-the-loop gate is placed on all irreversible actions. During a high-volume processing run the agent generates two hundred approval requests in ninety seconds. The reviewer, overwhelmed, begins clicking approve without reading. Continuous verification includes rate-limiting the approval queue and escalating to a second reviewer when approval velocity suggests rubber-stamping: the volume itself is the anomalous signal, even if each individual action looks plausible.

How it fails

Authorisation is checked at admission and cached for the session; an intent-flip caused by a poisoned input lands inside an already-authorised session and is never re-checked.
Injected instructions persist in the agent’s memory (the MINJA pattern) and are retrieved in future sessions, where they pass all token-validity checks.
The behavioural monitor is implemented as an LLM judging the primary agent’s output. They share the same vulnerability to injection, so a sufficiently crafted payload fools both simultaneously.
Human approvers are routed all agent actions in series without rate-limiting, producing approval fatigue and de facto rubber-stamping.
Context provenance is not tracked, so a verification system cannot distinguish actions triggered by trusted instructions from actions triggered by low-trust tool output.

Why the mapped controls work

Behavioural baselining and anomaly detection by a separate watchdog (a process the monitored agent cannot reach or influence) closes the shared-vulnerability gap that afflicts LLM-as-judge approaches. The watchdog evaluates the action stream against a statistical baseline, not a language model; its failure mode is a missed anomaly, not a shared injection surface. Provenance tags on context content give the verification layer the causal information it needs: when a behavioural deviation correlates with the ingestion of a low-trust document, the tag makes that correlation explicit rather than inferred. Circuit breakers translate a detected deviation into an automatic suspension, removing the dependency on human reaction speed. Signed, versioned policy bundles ensure the rules governing what counts as anomalous behaviour cannot be silently altered by a model update or a malicious config change. The policy is an auditable artefact, not a prompt.

First steps

Instrument your agent’s tool calls with OpenTelemetry traces and feed them into a separate watchdog process (not an LLM) that computes a rolling baseline of tool-call type frequencies per task category. Any session where one tool type exceeds 3× its baseline rate should trigger an automatic suspension.
Tag every item added to an agent’s context window with a trust level (SYSTEM / OPERATOR / USER / ENVIRONMENT) at the point of ingestion, and configure your verification layer to flag any session where a ENVIRONMENT-trust ingestion precedes a novel high-impact tool call within the same reasoning window.
Store your behavioural anomaly-detection rules as a signed OPA or Cedar policy bundle in version control, and add a CI check that any change to the bundle requires a cryptographic signature from a named security reviewer; this prevents a malicious config change from silently widening the detection threshold.

Threats it governs

When this principle is absent, these threats become reachable.

T1
Memory Poisoning Adversarial content written into short- or long-term memory contaminates future decisions.
T5
Cascading Hallucination Attacks Fabricated outputs propagate via reflection, memory, or multi-agent comms.
T6
Intent Breaking and Goal Manipulation Adversaries manipulate planning, reasoning, or self-evaluation to override goals.
T7
Misaligned and Deceptive Behaviors Agents pursue goals via constraint bypass, deception, or evasion of oversight.

Controls that advance it

Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.

Prevent

JIT elevation An agent running with a permanent high-privilege identity gives an attacker, or a misconfigured agent, broad access for as long as that identity persists. Time-bounded privilege elevation addresses this by issuing a short-lived credential tied to a specific action window: the agent holds elevated access only for the duration it needs, and the issuing platform revokes that access automatically when the TTL expires. This is the just-in-time (JIT) access pattern from PAM practice, applied to non-human identities.

Detect

Identity monitoring An AI agent operates under a non-human identity (NHI): a service principal, a task role, or a workload credential. That identity produces a stream of access events that, for a well-scoped agent, forms a narrow and predictable behavioural baseline. Identity monitoring applies User and Entity Behaviour Analytics (UEBA) to that stream, alerting when an observed access pattern deviates statistically from the baseline. Because agent behavioural distributions are tighter than those of human users, a deviation is a higher-confidence signal, and a spoofed or stolen credential used from the wrong workload origin is exactly the anomaly the technique is built to detect.
Cross-system audit An agent that operates across HR, Finance, cloud, and SaaS systems accumulates permissions at each boundary, often without any single team seeing the combined picture. Privilege accumulates silently across those boundaries until a quarterly review finds it, by which point a compromised or misconfigured agent has had weeks of unchecked reach. Cross-system scope auditing prevents that by continuously reconciling the agent's actual entitlements against a declared baseline across every system it touches and raising a ticket the moment drift is detected.
Divergence monitor An agent's behaviour can shift gradually over time: tool-selection patterns change, refusal rates drop, output style drifts. No single interaction reveals it, and a single-shot evaluation cannot catch a trend that spans weeks. Behavioural divergence monitoring detects that drift by comparing per-window statistical distributions of observable agent signals against a declared baseline, and alerting when the gap exceeds a threshold.
Goal consistency An agent's goal can drift across reasoning steps without any single catastrophic event: a manipulated tool output, a planted instruction in retrieved content, or an incremental semantic shift across many planner outputs can each redirect the agent away from its original objective. Goal-consistency monitoring addresses this by persisting the originally-declared goal, deriving a goal-state signal at each reasoning step, and computing a similarity score between the two. When the score falls below a per-task threshold, the monitor pauses the agent and surfaces the divergence for human review before any irreversible action executes.
Trust score In a multi-agent system, each agent routes decisions based on what its peers report. If a peer's behaviour becomes unreliable or adversarial, agents that keep treating it with full authority will propagate whatever errors or manipulations that peer introduces. Per-agent trust scoring addresses this by maintaining a continuously updated reputation score for every peer, derived from observed behaviour, and using that score to determine how much authority each incoming message carries.

Respond

No catalogued control.

In Helmwart

Audited indirectly as the “continuous verification” Zero-Trust tenet in Q4; no runtime behavioural monitor is modelled.