LLM01:2025: Prompt Injection (agentic delta)

What changes in an agent loop

In a chatbot, a prompt injection's worst outcome is a single misleading reply a human can catch. In an agent, the same injection triggers a chain of authorised tool calls (file writes, API requests, messages to peer agents) that the runtime cannot distinguish from legitimate work. This is the confused deputy problem at agent scale: a trusted principal executing unauthorised instructions on behalf of whoever crafted the malicious input. The attack surface widens with every integration point: a poisoned RAG document, a tool response carrying a hidden instruction, or a peer-agent message can all serve as the injection vector. Audit logs read clean throughout because each step was individually within policy. Containing this requires input validation at every trust boundary, not just at the user-facing entry point.

Canonical landings T6 Intent Breaking and Goal Manipulation T2 Tool Misuse T11 Unexpected RCE and Code Attacks

For the full definition, prevention checklist, and detection guidance, read OWASP's Prompt Injection page →. This page only adds the agentic angle and the bridge into Helmwart.

Mitigations

gVisor sandbox — a user-space kernel that intercepts every syscall a container makes T1

When an agent executes generated or retrieved code, that code runs as a process with access to the host kernel. A vulnerability in the generated code, or a deliberate exploit injected through the agent's prompt, can reach the kernel and affect other workloads or the host itself. gVisor prevents this by inserting a user-space kernel implementation between the container and the host: the container's syscalls go to the Sentry process, not to the host kernel, so the reachable attack surface from inside the container is structurally smaller.

Open Policy Agent — a policy-as-code engine for every tool call an agent makes T1

An agent can invoke any tool it has access to, constrained only by its own reasoning. If that reasoning is manipulated or the agent's permissions are misconfigured, it will call tools it should not. OPA addresses this by placing a policy decision point between the agent and every tool invocation: a Rego policy evaluates the agent identity, the tool, and the parameter envelope before execution proceeds, and the agent cannot reason or argue past the result.

Code-generation review gate — human approval before AI-generated code executes or merges T2

An AI coding agent produces code that can be executed or merged to a production branch without a human ever reading it. If the agent has been manipulated, its generated code can contain hidden payloads, backdoors, or privilege-escalating logic. A code-generation review gate prevents that: every change attributable to an AI agent must pass automated static analysis and receive explicit human approval before it can merge or execute, and the agent identity that authored the change is structurally barred from also approving it.

Context isolation — separate untrusted content from system instructions T2

An LLM processes everything in its context window as a single stream of tokens; it has no innate ability to tell instructions apart from data. If an attacker can place content where the model treats it as instruction, they control the agent. Context isolation prevents that by structurally separating untrusted content from system instructions at prompt construction time, so the boundary is enforced before the model ever sees the input.

Data classification with tool-access allow-lists — a sensitivity label on every dataset, enforced at every access seam T2

Every dataset, document, and external system an agent can reach carries a classification label. The agent's permitted-class set and the tool's permitted-class set are intersected at the moment of every read or write. When the requested data's class falls outside that intersection, access is denied at the seam. This is the data-side complement to least-privilege: it adds a data-sensitivity constraint that role scoping alone does not provide.

Human dual-control — four-eyes rule for irreversible high-impact approvals T2

An AI agent operating with broad authority can propose actions that are irreversible: deleting records, modifying IAM policies, moving funds. A single human reviewer at the approval gate is a single point of failure, one compromised account, one fatigued reviewer, or one successful social-engineering attempt is enough to commit the action. Human dual-control addresses that by requiring two distinct, independent humans to approve before the action commits.

Output egress DLP — inspection gate for PII, secrets, and IP at the agent boundary T2

An agent produces output continuously across multiple channels: user-facing responses, tool-call parameter envelopes, log records, and outbound HTTP requests. Any of those channels can carry sensitive content the agent has retrieved, been fed, or been tricked into including. Output egress DLP places an inspection gate at the boundary so that PII, credentials, and proprietary content are classified and either redacted or quarantined before they leave the trust boundary, regardless of how they got into the output.

Goal-consistency monitoring — a per-step check that the agent is still pursuing its original objective T2

An agent's goal can drift across reasoning steps without any single catastrophic event: a manipulated tool output, a planted instruction in retrieved content, or an incremental semantic shift across many planner outputs can each redirect the agent away from its original objective. Goal-consistency monitoring addresses this by persisting the originally-declared goal, deriving a goal-state signal at each reasoning step, and computing a similarity score between the two. When the score falls below a per-task threshold, the monitor pauses the agent and surfaces the divergence for human review before any irreversible action executes.

HITL feedback-loop calibration — reviewer overrides fed back into agent tuning T2

An agent at a human-in-the-loop gate will be overridden when its decisions do not match the reviewer's judgment. Without a return path, those corrections are discarded: the same miscalibration surfaces again in the next review cycle and the one after that. A feedback loop closes that gap by capturing each override event as a structured record, accumulating those records into a calibration dataset, and using patterns in that dataset to drive targeted changes to the agent's system prompt, tool-scope policy, or divergence-monitor thresholds. A well-calibrated agent produces fewer out-of-distribution decisions, so the review queue contracts over time.

Input sanitisation — enforcing the data/instruction boundary before content reaches the model T2

An LLM cannot distinguish data from instructions on its own: that boundary has to be enforced at the point where external content enters the prompt. Input sanitisation does this by normalising, filtering, and structurally segmenting untrusted content before the model ever sees it, so retrieved documents, tool results, and user messages are treated as data rather than commands.

Just-in-time tool grants — ephemeral access scoped to a single task T2

An agent that holds a persistent catalog of invokable tools can reach any of them at any point in its session. If its reasoning is manipulated or its identity is compromised, that persistent surface is fully available to an attacker. Just-in-time tool grants remove the standing surface: a policy broker issues a time-bound, task-scoped grant immediately before the tool is needed and revokes it automatically when the task completes or the window expires.

Kill switch: human authority to halt one agent, a class, or the entire deployment T2

Agentic systems can act faster than a human can intervene through normal channels. A kill switch is the operational guarantee that a named human role can stop agent activity at any scope (single instance, class, or global) through a documented runbook, without requiring a code change or redeployment, and with every invocation written to an audit trail.

MCP response sanitisation — validate and normalise tool outputs before they re-enter the LLM context T2

An MCP server response is content the LLM will reason over next. The model cannot distinguish tool output from instruction: that boundary must be enforced at the client, before the payload enters the context window. MCP response sanitisation applies schema validation, Unicode normalisation, control-token stripping, and structural wrapping to every tool result at the response boundary, so adversarial content embedded in a server response cannot redirect the agent's planner.

Advanced prompt-injection defences — spotlighting, delimiter gate, dual-LLM T2

Prompt injection succeeds when untrusted content entering an agent's prompt is indistinguishable from trusted instruction. Three layered techniques address that: spotlighting tags untrusted content with a machine-readable origin mark before it reaches the model; delimiter defence rejects input carrying reserved framework tokens before the model is called; and dual-LLM extraction routes attacker-influenceable content through a quarantined model that holds no tool access, so injected instructions cannot reach the model that can act on them.

Plan-vs-goal validation — independently check each proposed step against the original goal T2

A plan-then-execute agent produces a sequence of steps before acting. If the planner is manipulated, it will emit steps that serve the attacker's goal rather than the user's. Plan-vs-goal validation addresses this by placing an independent validator between the planner and the execution loop: it evaluates each proposed step against the originally-declared goal before the agent is permitted to act on it.

Pre-execution validation — a two-pass gate on every tool call an agent makes T2

An LLM produces tool-call arguments through generation, not through a type system, and generation is not reliable. The arguments may be wrong in type, out of range, or assembled in a combination that violates business rules. A pre-execution validation gate intercepts the call before it reaches the tool: a schema pass confirms each argument conforms to the declared JSON Schema, and a policy pass confirms the argument combination is permitted for this agent and this action. The tool executes only when both passes clear.

Link and HTML rendering restriction — an allow-list control on what agent output may render T2

An agent can include links and rich HTML in its output. When that output is attacker-influenced, a clickable link, embedded image, or rich preview card becomes the delivery mechanism for phishing or data exfiltration via markdown image injection. Rendering restriction removes that delivery vector by allowing clickable content only from an explicit allow-list of trusted domains and reducing everything else to plain text before the output reaches the user.

Secret scanning on agent-generated artefacts — detecting credentials before they escape the trust boundary T2

An agent produces code, configuration files, tool-call payloads, and log records continuously and at a rate no human reviewer can match. Any of those artefacts may contain a live API key, service token, or private certificate, placed there accidentally through model context, or deliberately through prompt injection or context poisoning. Secret scanning places an inspection gate at every agent output seam: regex patterns match known token formats, entropy analysis detects arbitrary high-entropy strings, and validator calls confirm which candidates are live credentials. The CI-secret-scanning pattern is mature; the agentic specialisation is seam placement, moving the scanner from the repository gate to the agent egress point, where artefacts can be intercepted before they reach any downstream system.

Static analysis on generated code — a pre-execution gate on LLM-emitted artifacts T2

An agent that can generate and execute code treats code generation as a tool call and code execution as the outcome. If the generated code contains a known-dangerous pattern, no amount of prompt engineering stops it from running once the execute call goes through. Static analysis closes that gap: it scans every code artifact the agent emits against a rule set before execution is permitted, catching the vulnerability patterns the same tooling already catches in human-written code.

Least-privilege tool scoping — a hard boundary on what each tool exposes T2

Each tool in an agent's catalog should expose only the methods, resources, and parameter ranges its designated role requires. Over-broad tool surfaces let individually authorised primitives compose into actions no human intended to grant; narrowing the scope at design time reduces both the attack surface and the blast radius of any compromise.

Intent attestation tokens — a cryptographic binding from user approval to tool execution T3

An agent acts on behalf of the user, but nothing in a standard OAuth bearer token records what the user actually approved. If the agent's planning is manipulated, it can invoke tools with parameters the user never sanctioned, while presenting credentials that look valid. Intent attestation fixes this by issuing a short-lived signed token that encodes the exact action and parameter envelope the user authorised, and requiring the resource server to verify that envelope before executing the call.