Assume Breach · Principles

Why it matters for agentic AI

Assume-breach entered mainstream security discourse as a posture shift: stop trying to guarantee the attacker never gets in, and start designing controls that hold after they already are in. For human-operated systems this mostly means east-west segmentation, minimal blast radius, and incident-response rehearsal. For agentic systems the posture shift runs deeper, because the adversarial channel isn’t the network. It’s the context window itself.

Any agent that reads external documents, email, web pages, search results, or other agents’ output is continuously processing attacker-reachable content. Prompt injection attacks exploit this directly: a crafted string inside an otherwise ordinary document can redirect the agent’s goal without any credential theft or network intrusion. Adaptive attacks on current defences succeed often enough that assuming injection will occasionally land is the only honest engineering stance. The consequence is stark: controls built on the assumption that injection can be fully prevented provide weak guarantees, because an attacker only needs to find one bypassed filter, while the defender needs every filter to hold every time. Assume-breach for agents means designing controls that still hold after the model is successfully injected.

The second major shift is temporal. Assume-breach in a classical network is usually triggered by an observed event: an alert fires, a log anomaly is detected. In agentic systems, a poisoned memory entry can sit inert for days before a specific trigger causes it to act; an injected instruction can persist across sessions if it writes to memory. There may be no error event at all. The agent appears to function correctly from the outside while executing attacker-directed actions. This demands proactive resilience: design the architecture so attacker-controlled content causes minimal damage even before any detection fires.

Scenario: EchoLeak, design-for-breach absent

EchoLeak (CVE-2025-32711) demonstrated the cost of assuming-away injection. A crafted email reached Microsoft 365 Copilot; the model, behaving “correctly” from its own perspective, read the user’s files and exfiltrated their content with no user interaction. The injection succeeded because the agent’s credential was available in the same context as untrusted external content; the attacker just needed the model to follow its instructions. Had the architecture separated untrusted-content processing (no credentials, structured output only) from the execution step that held credentials, using the dual-LLM pattern, the injection would have reached a quarantined context with nothing to steal. The breach was assumed-away rather than designed-for.

Scenario: incubating memory poisoning

A research assistant agent ingests thousands of documents daily and writes summaries into a long-term vector store. An attacker places a poisoned document in a public data source; the agent ingests it, writes a subtly altered “fact” to memory, and continues working normally. A week later a different query surfaces the false fact; the agent cites it in an authoritative report. No error was raised. Assume-breach says: treat every write to persistent memory as potentially attacker-influenced. Versioned, append-only memory with periodic integrity sampling and a rollback path means the damage is bounded and recoverable (the poison can be excised) rather than permanent.

How it fails

Defences assume injection can be fully prevented; when one filter is bypassed the entire posture collapses because no downstream control was designed to hold independently.
A poisoned memory entry is written with no tagging of its source trust level; it incubates and acts days later with no event to catch.
A prompt-injected agent still “completes its workflow” to monitoring, producing visible output and returning success, while executing attacker-directed side effects.
Secrets are held in the agent’s context or accessible to the same identity that processes untrusted input, so a successful injection directly yields credentials.
The agent’s autonomy level is not demoted on anomaly; after injection it continues with full privileges.

Why the mapped controls work

The dual-LLM pattern is the architecturally explicit embodiment of this posture: a privileged LLM with credentials and tool access never sees raw untrusted text; a quarantined LLM with no tools processes external content and returns only structured data through a narrow typed schema. The injection surface and the execution surface are separated at the architectural level, not just by a filter. Plan-then-execute with an immutable plan closes the in-flight manipulation path: the agent commits to a structured plan before retrieving external content, and the plan cannot be rewritten by anything it subsequently reads. Credential isolation via a proxy means even a perfectly successful injection finds no secrets in context to exfiltrate. Automatic autonomy demotion on anomaly applies the assume-breach posture at runtime: when behaviour departs from baseline the system doesn’t wait for confirmation, it immediately reduces what the agent is allowed to do.

First steps

Introduce the dual-LLM split for any agent that reads external documents: stand up a separate, tool-less “reader” model that returns only typed JSON fields, and verify in a code review that no raw external text can reach the privileged “executor” model’s context.
Apply append-only versioning to your persistent vector store (e.g. Chroma’s or Weaviate’s versioning features, or a write-once S3 prefix) and write a weekly integrity-sampling job that compares a random sample of stored entries against their source hashes, alerting on any mismatch.
Define and enforce an anomaly-demotion rule in your orchestration layer: if the agent’s action stream deviates from its declared task type (e.g. a summarisation agent issuing an HTTP POST to an external domain), automatically downgrade it to read-only mode and page an operator before it can take another action.

Threats it governs

When this principle is absent, these threats become reachable.

T1
Memory Poisoning Adversarial content written into short- or long-term memory contaminates future decisions.
T5
Cascading Hallucination Attacks Fabricated outputs propagate via reflection, memory, or multi-agent comms.
T15
Human Manipulation Attacker turns the agent into a fluent, personalised social-engineering vector trusted by the user.

Controls that advance it

Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.

Prevent

Context isolation An LLM processes everything in its context window as a single stream of tokens; it has no innate ability to tell instructions apart from data. If an attacker can place content where the model treats it as instruction, they control the agent. Context isolation prevents that by structurally separating untrusted content from system instructions at prompt construction time, so the boundary is enforced before the model ever sees the input.
Input sanitisation An LLM cannot distinguish data from instructions on its own: that boundary has to be enforced at the point where external content enters the prompt. Input sanitisation does this by normalising, filtering, and structurally segmenting untrusted content before the model ever sees it, so retrieved documents, tool results, and user messages are treated as data rather than commands.

Detect

Egress DLP An agent produces output continuously across multiple channels: user-facing responses, tool-call parameter envelopes, log records, and outbound HTTP requests. Any of those channels can carry sensitive content the agent has retrieved, been fed, or been tricked into including. Output egress DLP places an inspection gate at the boundary so that PII, credentials, and proprietary content are classified and either redacted or quarantined before they leave the trust boundary, regardless of how they got into the output.
Divergence monitor An agent's behaviour can shift gradually over time: tool-selection patterns change, refusal rates drop, output style drifts. No single interaction reveals it, and a single-shot evaluation cannot catch a trend that spans weeks. Behavioural divergence monitoring detects that drift by comparing per-window statistical distributions of observable agent signals against a declared baseline, and alerting when the gap exceeds a threshold.

Respond

Anomaly isolation An agent that has been compromised, poisoned, or gone rogue will, in most cases, behave differently from its established baseline. Anomaly isolation acts on that difference: when an agent's behaviour score crosses a configured threshold, it is quarantined automatically, credentials revoked, message-queue access cut, in-flight actions aborted. Manual revocation cannot match the speed that cascading multi-agent failures demand.
Kill switch Agentic systems can act faster than a human can intervene through normal channels. A kill switch is the operational guarantee that a named human role can stop agent activity at any scope (single instance, class, or global) through a documented runbook, without requiring a code change or redeployment, and with every invocation written to an audit trail.
Graceful degradation An agent that encounters a quota trip, a dependency failure, or a timeout faces a choice: continue at reduced quality, or refuse. Getting that choice wrong is the core operational failure. Graceful degradation requires the answer to be declared before the incident, not improvised during it: write-authority paths fail closed and return a refusal; read-only paths fail open and disclose the degraded state explicitly.

In Helmwart

The stance behind the Defence-in-Depth audit. Its reactive and detective phases are “detect and respond after compromise.” Named explicitly so it is not mistaken for a missing control.