← Atlas · Mitigations Tier 2 · Real-composable

MITIGATION · m-input-sanitization

Input sanitisation — enforcing the data/instruction boundary before content reaches the model

An LLM cannot distinguish data from instructions on its own: that boundary has to be enforced at the point where external content enters the prompt. Input sanitisation does this by normalising, filtering, and structurally segmenting untrusted content before the model ever sees it, so retrieved documents, tool results, and user messages are treated as data rather than commands.

Last reviewed 2026-05-12 · Status: published · Evidence →

At a glance

MATURITY
Tier 2
Available off-the-shelf or as a documented pattern, but newer or less broadly proven. Expect integration work and some operational nuance.
PLACES ON
node · edge
Restricted to node kinds: agent
COVERAGE
2 threats
T1 · T6
TRADE-OFFS
LAT
low
COST
low
UX
low
DEV
medium
Latency · cost · UX friction · dev effort.
TL;DR
  • An LLM cannot distinguish data from instructions on its own. Every external source (user messages, retrieved documents, tool results, peer-agent replies) must be treated as untrusted data, and that boundary has to be enforced at prompt construction time.
  • Structural segmentation is the primary mechanism: wrap untrusted content in typed delimiters (XML tags, labelled JSON fields, role channels) and declare in the system prompt that content inside those markers is data only. The boundary is enforced by construction, not by the model's judgment.
  • Layer a pattern filter on top: normalise Unicode, strip invisible and right-to-left characters, then scan for known injection signatures (override phrases, control tokens, encoded payloads) before the content reaches the model.
  • In agentic systems, every untrusted-content entry path must be covered. Missing one, whether a tool output, a RAG retrieval, or an inter-agent message, is the residual risk that prompt injection campaigns exploit.

How it behaves

Untrusted content arrives (user message, RAG result, tool output, peer-agent reply)
Normalise encoding, filter known injection patterns, then wrap in a structurally isolated segment
Content placed in a labelled untrusted segment; the model reasons over it as data
Content blocked or rewritten; the hit pattern and source channel are logged
Apply this gate at every external content boundary, not only at the user-facing edge. A gap on any one path is an unprotected entry point.

What it is

An LLM processes everything in its context window as a stream of tokens. It has no built-in mechanism to separate content it should follow from content it should merely read. Prompt injection exploits this: an attacker embeds text in an external source (a retrieved document, a tool result, a user message) that the model interprets as a new instruction, overriding the system prompt or hijacking the agent's next action. The fix is not to ask the model to be more careful, because the model has no reliable way to make that distinction without external structure. The fix is to enforce the boundary before the model sees the content.

Input sanitisation does this through three complementary steps, applied in sequence at every external content boundary.

Normalisation and re-encoding. Strip or escape control characters, normalise Unicode to a canonical form, and remove invisible or right-to-left characters. These characters are used to hide injection payloads from human reviewers while leaving them legible to the model. This step runs first because later pattern-matching depends on the content being in a predictable encoding.

Pattern filtering. Scan the normalised content for known injection patterns: instruction-override phrases, control tokens such as <|im_start|>, base64-encoded payloads. Block or rewrite matching content before it reaches the model. This step is a list of known-bad signatures, not a semantic judgment, so it is fast and deterministic.

Structural segmentation. Wrap the surviving content in unambiguous delimiters, XML-style tags, labelled JSON fields, or role-channel separation, and declare in the system prompt that content inside those markers is data only. The data/instruction boundary is enforced by prompt construction, not by the model's judgment, so it holds even when the model has not been fine-tuned for it.

In agentic systems, the surface that requires sanitisation is larger than a single user input field. It includes every path by which external content enters the model's context: retrieved documents, tool-call results, peer-agent messages, and webhook payloads. A sanitisation step applied only at the user-facing edge leaves every other entry path unprotected.

Detection signals

  • Rate of content blocked or modified at each entry channel. A sustained rise on a specific channel (RAG retrieval, a particular tool, inter-agent messages) points to a targeted injection campaign or a misconfigured upstream source.
  • Distribution of injection-pattern hits by category (override phrases, control tokens, encoded payloads). A sudden shift in category mix indicates attackers adapting their technique, which signals that the pattern list needs updating.

Threats it covers

  • T1 Memory Poisoning −1 severity step

    WHY IT HELPS Memory Poisoning works by writing attacker-controlled content into the agent's memory store, where it is later retrieved and treated as trusted context. Sanitising inbound content before it is written strips or neutralises injection payloads at the entry point, before they can persist into memory.

  • WHY IT HELPS Prompt Injection is the insertion of attacker-controlled text that the model interprets as a new instruction, overriding the system prompt or earlier context. Structural segmentation places untrusted content in a labelled data region that the system prompt declares off-limits for instruction, narrowing the channel through which an injection can succeed.

Principle coverage

Defence-in-Depth stage: Prevent — and it advances:

  • Assume Breach Sanitisation treats every external source as already hostile: it classifies and isolates all incoming content as untrusted before it reaches the model, rather than trusting it because it arrived through a normal channel.
  • Provenance & Trust-tagging Structural segmentation tags untrusted content as data at the boundary, so the model can tell instruction from input by construction rather than by judgment.
  • Input/Output Validation This control is the input half of I/O validation: it validates and normalises every external entry path before content is allowed into the prompt.

Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.

Implementation options

Use whichever fits your cloud platform and data-residency requirements. For most cloud workloads, start with a managed classifier (Prompt Shields or Bedrock Guardrails) wired to a Unicode-normalisation and structural-segmentation pre-processing step. Add NeMo Guardrails or a self-build hybrid when you need local execution, finer-grained control, or coverage of entry paths no managed service reaches.

Microsoft Prompt Shields A managed REST API under Azure AI Content Safety that classifies direct user-prompt attacks and indirect document attacks (RAG content, emails, tool results) before they reach the model. Returns a per-segment attack flag and sub-category breakdown.

Why choose it: Call AnalyzeTextAsync with UserPromptAnalysisResult (for user input) and DocumentAnalysisResult (for retrieved documents). Returns per-segment attackDetected and subCategory fields. Native to Azure AI Foundry; no self-hosted infrastructure required.

More details:

Amazon Bedrock Guardrails A policy-managed prompt-attack filter wired directly into Bedrock InvokeModel calls. Detects jailbreaks, prompt injection, and (Standard tier) prompt leakage. Content tags scope which segments are evaluated, so system-prompt content is never incorrectly flagged. Filter strength is configurable per call.

Why choose it: Create a guardrail with promptAttacksFilter set to the desired strength (NONE / LOW / MEDIUM / HIGH) via CreateGuardrail. Apply it inline with model inference using the guardrailIdentifier and guardrailVersion parameters on InvokeModel. No separate API hop.

More details:

Lakera Guard A provider-neutral SaaS classifier that evaluates message arrays for prompt injections, jailbreaks, and manipulation attempts. Returns a flagged boolean with category-level detail. Works in front of any model, regardless of cloud provider.

Why choose it: POST a message array to the Guard endpoint (https://api.lakera.ai/v2/guard) before forwarding to any LLM. Returns flagged and categories with per-category scores. Use as a drop-in pre-filter with no platform coupling. The project also publishes the open PINT benchmark for independent multi-vendor evaluation.

More details:

NVIDIA NeMo Guardrails An open-source Python framework (Apache 2.0) that defines input, output, and execution rails for tool calls using the Colang DSL. Includes built-in jailbreak and injection detection rails, sensitive-data masking on input, and composable LangChain integration. Runs locally with no external API dependency.

Why choose it: Define input rails in Colang that intercept user messages before they reach the LLM. Start from the jailbreak detection and sensitive data built-in rail templates, then extend with custom patterns. Install via pip install nemoguardrails; integrate with LangChain using RunnableRails.

More details:

Self-build normalisation and classifier hybrid A custom pre-processing pipeline: Unicode normalisation and invisible/RTL token stripping, regex matching against a maintained list of known injection signatures, a lightweight classifier call (such as Llama Guard 3) for ambiguous content, and structural wrapping of surviving content in typed delimiters before it reaches the model.

Why choose it: Best when you need sanitisation across entry paths no managed service natively covers (inter-agent messages, custom tool results, binary-parsed documents), or when you require deterministic control over every layer. For normalisation, use unicodedata.normalize('NFKC', text) and strip categories Cf and Cs. Maintain the regex list against OWASP LLM01 known-injection signatures. Higher dev effort: the regex list and classifier prompt require ongoing maintenance.

More details:

Trade-offs

  • Managed classifiers (Prompt Shields, Bedrock Guardrails, Lakera Guard) add 50 to 200 ms per call. A Unicode normalisation and regex pass is sub-millisecond and should always run, whether or not a managed classifier is in the stack.
  • Managed services are the fastest path to deployment but introduce vendor dependencies and may route content data through third-party infrastructure. Evaluate against your data-residency requirements before adopting one.
  • False positives (legitimate content blocked or rewritten) are the main UX cost. Tune blocking thresholds on representative traffic before enabling hard blocks; start in log-only mode.
  • The classification step is straightforward. The real effort is enumerating every untrusted-content entry path in the agent graph, and most teams underinvest here and discover gaps in production.

When NOT to use

  • Do not apply text-pattern sanitisation to binary or structured payloads the agent cannot interpret as instruction. Images, audio, and binary file content have different injection surfaces that text filters do not address.
  • Do not use sanitisation as the primary gate for agents that intentionally process adversarial text (red-team or security-research tools). The classifier will fire constantly on legitimate input, masking genuine signals.
  • Do not place sanitisation logic only on agent output rather than input. A classifier on outgoing content does not prevent an injection from executing during model reasoning; that is the job of output moderation, not this control.

Limitations

  • No classifier catches all injection classes. Adversarial researchers demonstrate new bypasses against every managed service each year. Treat classification as one layer of defence-in-depth, not as a complete defence.
  • Structural segmentation depends on the model honouring the data/instruction boundary declared in the system prompt. Model updates can change how reliably that boundary holds; re-validate after any model version change.
  • Pattern-filter evasion via encoding variants, Unicode homoglyphs, and multi-lingual attacks is well documented. A regex list that is not continuously maintained loses coverage as attack patterns evolve.
  • This control applies to text modalities. Vision, audio, and multimodal pipelines require separate injection-surface analysis; text sanitisation alone is not sufficient for them.

Maturity tier reasoning

  • Tier 2 fits because multiple production-grade managed implementations exist (Prompt Shields, Bedrock Guardrails, Lakera Guard) alongside a production-stable open-source option (NeMo Guardrails), each with documented integration paths and published evaluation data.
  • Not Tier 1, because no single implementation is industry-canonical. Deployments typically combine a self-hosted pre-processing step with a managed classifier, and the composition is bespoke per stack.
  • Not Tier 3, because all components are production-deployable today and the pattern is documented in OWASP, NIST AI 600-1, and multiple vendor integration guides.

Last verified against upstream docs: 2026-05-30.