Why it matters for agentic AI
Defence-in-depth predates computing: military planners drew concentric defensive rings so that any single wall being breached still left attackers short of the objective. The security world translated this into overlapping controls: firewalls, intrusion detection, host hardening, and data-layer encryption, each independently enforced so one control failing doesn’t hand over the system. Agentic AI fractures that picture in a specific, structurally important way: some of your layers are now probabilistic.
Fail Securely and Default Deny are the two principles that keep the deterministic layers hard when any probabilistic layer fails.
A model-level refusal is a layer of defence, but it is a soft layer. It can be prompted around, and the same adversarial input that manipulated the agent can manipulate a guardrail model you placed in its path to judge it. A single guardrail LLM asked to vet the agent’s output shares the agent’s vulnerability surface if the two models see the same input. This is the core agentic shift: the probabilistic layers add depth but may never be the enforcement layers. Only the orchestrator logic, the tool gateway, and the infrastructure (egress, networking, sandboxing) are deterministic and immune to prompting. In a well-layered system these must be the hard gates; the model’s built-in reluctance is a first filter, not the last.
Multi-agent pipelines add a second dimension. In a classical stack, depth is vertical, from network perimeter down to disk encryption. In a multi-agent graph, depth is also lateral: each hop between orchestrator and sub-agent is a place where a control can exist or be absent. An orchestrator that validates an incoming tool request but forwards it to a sub-agent without re-checking trust has only one layer where there could have been two. Re-verifying trust at every hop, even when the previous hop was already inside the perimeter, is the multi-agent form of defence-in-depth.
Scenario: the shared guardrail surface
An orchestrator uses a second LLM as its safety filter: every draft action is checked for policy compliance before execution. The input document is a poisoned PDF designed to exfiltrate the user’s files. The injection tricks both the agent and the safety filter in a single move, because they share the same probabilistic failure mode. If the safety filter is a second, independent LLM, an attacker who can craft an adversarial input for one transformer architecture is often close to cracking a similar one. Genuine depth requires at least three layers, and at least one must be deterministic: input scanning, then an orchestrator policy gate (OPA/Cedar), then infrastructure egress control. The injection still enters, but the subsequent tool call and the outbound HTTP request each hit a layer the injection cannot bypass by persuasion.
Scenario: the single-hop trust boundary
A fintech orchestrator calls nine sub-agents. It authenticates and authorises each one at session start, then assumes they are trusted for the remainder. One sub-agent retrieves external news articles and its context is poisoned mid-run. The poisoned agent emits an instruction to another sub-agent (“transfer £5,000 to sort code …”), and because all nine inherited trust from the same session admission, the receiving agent executes it. Per-hop re-authorisation (a deterministic policy check at every inter-agent call, not just at the outer edge) is what turns one compromised node into a contained incident rather than a cross-system cascade.
How it fails
- All safety rests on the model; a model version upgrade silently regresses safety with no runtime signal, and no secondary layer catches the regression.
- A second guardrail LLM is placed in the path but is evaluated on the same input as the agent, creating a shared failure surface rather than an independent one.
- Sub-agents inherit the orchestrator’s session trust; there is no per-hop policy gate, so depth is only one layer wide despite many nodes.
- One “everything agent” concentrates the full blast radius in a single failure domain, defeating the entire purpose of layering.
- Telemetry on tool calls is absent or aggregated too coarsely to detect a mid-session layer failure before it propagates.
Why the mapped controls work
The model → safety-system → orchestrator → infrastructure stack distributes defence across the probabilistic/deterministic divide: the model and a safety LLM are the early filters, but the orchestrator policy gate (OPA or Cedar, statically compiled) and the infrastructure egress controls are the gates that can’t be argued away. Distinct input and output guardrails close the shared-surface problem: an input filter operates on raw untrusted content before it shapes the agent’s reasoning, while an output filter sees the proposed action before execution, and the two are independently configured so bypassing one doesn’t trivially bypass the other. Per-hop re-verification keeps the lateral dimension of depth intact across multi-agent graphs. Telemetry on every tool call provides the detection layer so a control failure produces a signal: a cascade that starts with one failed layer is containable only if detection notices before the next layer is reached.
First steps
- Map your current agent stack against the four-layer model (model, safety-system, orchestrator policy gate, infrastructure egress) and identify which layers you are currently missing; add the highest missing deterministic layer (OPA/Cedar gate or eBPF egress policy) before adding any further probabilistic guardrails.
- Configure your input and output guardrail models to use different base model weights from the primary agent (e.g. a smaller or differently fine-tuned checkpoint) and feed them independently so a single adversarial input cannot trivially bypass both in the same forward pass.
- Implement per-hop re-authorisation at every inter-agent call using short-lived scoped tokens (see Agent Identity), and add a structured log entry at each hop so that a cascade failure across multiple agents produces a traceable signal rather than a silent propagation.
Controls that advance it
Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.
- PI defences+ Prompt injection succeeds when untrusted content entering an agent's prompt is indistinguishable from trusted instruction. Three layered techniques address that: spotlighting tags untrusted content with a machine-readable origin mark before it reaches the model; delimiter defence rejects input carrying reserved framework tokens before the model is called; and dual-LLM extraction routes attacker-influenceable content through a quarantined model that holds no tool access, so injected instructions cannot reach the model that can act on them.
- Output moderation An AI agent can produce output that is harmful, deceptive, or factually wrong while still sounding fluent and confident. Output moderation places an independent classifier or moderation model between the agent and its destination, checking every output before it reaches a user or a downstream system. The generating model does not evaluate its own answer; a separate gate does.
- Divergence monitor An agent's behaviour can shift gradually over time: tool-selection patterns change, refusal rates drop, output style drifts. No single interaction reveals it, and a single-shot evaluation cannot catch a trend that spans weeks. Behavioural divergence monitoring detects that drift by comparing per-window statistical distributions of observable agent signals against a declared baseline, and alerting when the gap exceeds a threshold.
- Kill switch Agentic systems can act faster than a human can intervene through normal channels. A kill switch is the operational guarantee that a named human role can stop agent activity at any scope (single instance, class, or global) through a documented runbook, without requiring a code change or redeployment, and with every invocation written to an audit trail.
- Graceful degradation An agent that encounters a quota trip, a dependency failure, or a timeout faces a choice: continue at reduced quality, or refuse. Getting that choice wrong is the core operational failure. Graceful degradation requires the answer to be declared before the incident, not improvised during it: write-authority paths fail closed and return a refusal; read-only paths fail open and disclose the degraded state explicitly.
In Helmwart
A Q4-audited principle. It checks each surfaced threat has mitigations across two or more phases (proactive/detective/reactive), and the canvas counts distinct control families per threat.