EVIDENCE TRAIL
Advanced prompt-injection defences — spotlighting, delimiters, dual-LLM
Verbatim excerpts from the upstream sources cited on the mitigation page, with what each source does and does not prove. The three techniques this control composes — spotlighting, delimiter defence, and dual-LLM extraction — each have distinct academic and vendor lineage; Simon Willison's April 2023 blog post is the canonical upstream reference for the privileged / quarantined split. MITRE ATLAS mitigations AML.M0015 and AML.M0020 are cross-referenced in the MDX but their current page text could not be fetched; they carry no excerpt here.
Last cross-checked against upstream sources: · 8 sources
References
Each entry shows what the source supports and what it does not prove.
Simon Willison — "The Dual LLM pattern for building AI assistants that can resist prompt injection"
Core concept section — definition of the Privileged / Quarantined split
"I think we need a pair of LLM instances that can work together: a Privileged LLM and a Quarantined LLM. … The Quarantined LLM is used any time we need to work with untrusted content—content that might conceivably incorporate a prompt injection attack. It does not have access to tools, and is expected to have the potential to go rogue at any moment. … it is absolutely crucial that unfiltered content output by the Quarantined LLM is never forwarded on to the Privileged LLM!"
Supports: Canonical statement of the dual-LLM extraction pattern. Verbatim source for the Privileged / Quarantined split as a prompt-injection defence. Establishes the key constraint: quarantined model gets no tool access and its raw output never reaches the privileged model.
Does not prove: April 2023 blog post predates current agent-framework primitives; does not specify a structured-output schema as the channel between the two models (that refinement is from later work). Does not describe spotlighting or delimiter defence.
Microsoft Azure AI Content Safety — Prompt Shields
Introduction and "Prompt Shields for documents" section
"Prompt Shields is a unified API in Azure AI Content Safety that detects and blocks adversarial user input attacks on large language models (LLMs). It helps prevent harmful, unsafe, or policy-violating AI outputs by analyzing prompts and documents before content is generated. … This shield aims to safeguard against attacks that use information not directly supplied by the user or developer, such as external documents. Attackers might embed hidden instructions in these materials in order to gain unauthorized control over the LLM session."
Supports: Production evidence for prompt-injection detection applied to both user prompts and third-party document content. Classifies document attacks (indirect prompt injection) by category. Demonstrates that spotlighting-class input inspection is a shipped vendor product, not only a research pattern.
Does not prove: The Microsoft docs describe what the API detects, not how the underlying spotlighting technique marks trusted vs. untrusted spans. Does not describe the delimiter-reservation or dual-LLM pattern.
Greshake et al. — "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" (arxiv:2302.12173)
Abstract
"LLM-Integrated Applications blur the line between data and instructions. We reveal new attack vectors, using Indirect Prompt Injection, that enable adversaries to remotely (without a direct interface) exploit LLM-integrated applications by strategically injecting prompts into data likely to be retrieved. … We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities, including data theft, worming, information ecosystem contamination, and other novel security risks. … effective mitigations of these emerging threats are currently lacking."
Supports: Foundational academic evidence for indirect prompt injection as a real attack class. Establishes that the threat surfaces whenever an LLM reads attacker-controlled data (RAG, documents, emails). The "data vs. instructions" framing is the precise gap that spotlighting and delimiter defence are designed to close.
Does not prove: The paper establishes the threat but explicitly notes that effective mitigations were lacking at the time of writing. Does not propose spotlighting, delimiters, or dual-LLM as solutions; those defences developed after this paper.
Liu et al. — "Prompt Injection attack against LLM-integrated Applications" (arxiv:2306.05499)
Abstract and HouYi framework description
"This study deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications. … a novel black-box attack technique comprising three components: a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill the attack objectives."
Supports: Names "context partition" as a core component of a successful injection attack — directly motivating delimiter defence (reserving delimiters the attacker cannot produce removes context partitioning as an attack primitive). Evaluates 36 real applications. Referenced in the MDX independentEvidence field.
Does not prove: Describes the attack mechanism; does not evaluate delimiter-reservation or spotlighting as defences. The paper's scope is attack characterisation, not defence validation.
OWASP Top 10 for Agentic Applications 2026
§ASI01 Agent Goal Hijack — Description and Prevention and Mitigation Guidelines (item 1)
"AI Agents exhibit autonomous ability to execute a series of tasks to achieve a goal. Due to inherent weaknesses in how natural-language instructions and related content are processed, agents and the underlying model cannot reliably distinguish instructions from related content. … Treat all natural-language inputs (e.g., user-provided text, uploaded documents, retrieved content) as untrusted. Route them through the same input-validation and prompt-injection safeguards defined in LLM01:2025 before they can influence goal selection, planning, or tool calls."
Supports: Establishes "cannot reliably distinguish instructions from related content" as the root cause of goal hijack — the exact gap that spotlighting addresses. The mitigation directive to treat all natural-language inputs as untrusted and route through injection safeguards is the upstream policy this control operationalises.
Does not prove: ASI01 mitigations are stated at policy level; the document does not name spotlighting, delimiters, or dual-LLM as specific implementation patterns. Helmwart supplies the concrete mechanism.
OWASP Agentic AI — Threats & Mitigations v1.1
§T6 Intent Breaking & Goal Manipulation — Description
"Intent Breaking and Goal Manipulation occurs when attackers exploit the lack of separation between data and instructions in AI agents, using prompt injections, compromised data sources, or malicious tools to alter the agent's planning, reasoning, and self-evaluation. This allows attackers to override intended objectives, manipulate decision-making, and force AI agents to execute unauthorized actions, particularly in systems with adaptive reasoning and external interaction capabilities (e.g., ReAct-based agents)."
Supports: Names "lack of separation between data and instructions" as the exploited condition. This is verbatim upstream support for the spotlighting / delimiter / dual-LLM family of controls, which all operate by re-establishing that separation. MDX maps this control to T6.
Does not prove: The T6 mitigation in the table is "Implement planning validation frameworks, boundary management for reflection processes, and dynamic protection mechanisms for goal alignment" — general guidance that does not specifically name spotlighting or dual-LLM.
Anthropic — Prompt Engineering Best Practices (Claude docs)
"Structure prompts with XML tags" section
"XML tags help Claude parse complex prompts unambiguously, especially when your prompt mixes instructions, context, examples, and variable inputs. Wrapping each type of content in its own tag (e.g. <instructions>, <context>, <input>) reduces misinterpretation."
Supports: Anthropic's own engineering guidance recommends wrapping content types in distinct structural tags — this is the delimiter-defence family of techniques applied to Claude specifically. The instruction to wrap <context> and <input> separately from <instructions> is the production-level expression of the principle that this mitigation formalises.
Does not prove: The Anthropic docs frame this as a prompting clarity technique, not a security control. The security-specific framing (reserved tokens, rejection on detection, fail-closed on delimiter misuse) is Helmwart's layering on top of this guidance.
MITRE ATLAS AML.T0051 — LLM Prompt Injection
No verbatim excerpt pulled yet — open the original to verify the cited section.
Supports: Catalogues prompt injection as an adversarial ML technique. This mitigation directly addresses AML.T0051. Referenced in the MDX atlasTechniques field.
Does not prove: ATLAS technique pages (AML.T) describe attack methods, not defences. ATLAS mitigation pages AML.M0015 and AML.M0020 (cross-referenced in the MDX) could not be fetched — their current text is unverified; they are listed in the MDX atlasMitigations field but no excerpts are included here.