MITIGATION · m-mcp-response-sanitization
MCP response sanitisation — validate and normalise tool outputs before they re-enter the LLM context
An MCP server response is content the LLM will reason over next. The model cannot distinguish tool output from instruction: that boundary must be enforced at the client, before the payload enters the context window. MCP response sanitisation applies schema validation, Unicode normalisation, control-token stripping, and structural wrapping to every tool result at the response boundary, so adversarial content embedded in a server response cannot redirect the agent's planner.
At a glance
TL;DR
- Every MCP tool result is content the model will reason over next, treat it as untrusted data, not authoritative tool output.
- Validate the response shape (
content[],isError,structuredContent) against the MCP spec and the tool's declaredoutputSchemabefore the payload touches the context window. - Strip framework-control tokens, normalise Unicode to NFKC, and remove invisible characters; then wrap the cleaned text in a labelled untrusted-content segment with server and tool attribution.
- Layer a managed injection classifier (Prompt Shields document mode, Bedrock ApplyGuardrail) on top for adversarial payloads that pass structural checks, the two layers address different failure classes.
How it behaves
What it is
An MCP server response is, from the model's perspective, indistinguishable from any other text that arrives in its context window. The model does not have a native boundary between "tool result" and "instruction": that separation has to be enforced structurally at the client, before the payload reaches the prompt. MCP response sanitisation is that enforcement layer. It applies schema validation, Unicode normalisation, control-token stripping, and contextual wrapping to every tool result the client receives, so a malicious payload embedded in a server response cannot be processed as an instruction.
The failure mode without this control is indirect prompt injection via the tool-response channel. A compromised server, an attacker-controlled upstream source that the server forwards, or a server that returns ambiguously-typed payloads can each supply text that the model will treat as authoritative. The MCP specification acknowledges this directly: servers MUST validate all tool inputs and clients SHOULD validate tool results before passing them to the LLM. Sanitisation at the client implements the client-side obligation.
Three layers apply in order on every response:
Schema validation. The MCP specification defines a typed response shape for every tool result: a content array of typed segments (text, image, audio, resource_link, resource) plus an optional isError boolean and optional structuredContent object. Tools that declare an outputSchema require the client to validate structuredContent against it. Reject any response that does not match the declared shape; a schema-invalid payload is a hard rejection and must not be passed downstream.
Normalise and strip. Normalise all text segments to Unicode NFKC. Strip framework-control tokens (<|im_start|>, <|im_end|>, [INST], [/INST], and role markers for any model family in deployment). Remove invisible characters: zero-width space (U+200B), zero-width non-joiner (U+200C), BOM (U+FEFF), and right-to-left override (U+202E). These characters hide injection payloads from human reviewers while remaining legible to the model.
Wrap. Place the cleaned content in a labelled untrusted-content segment with server and tool attribution before it re-enters the LLM context. This is the structural pattern that context isolation provides at the prompt-construction layer; applying it at the response boundary ensures the model processes the content as data, not as instruction.
Detection signals
- Schema rejection rate per MCP server. A rising rate on a specific server indicates adversarial or malformed responses from that upstream.
- Pattern-match hits for framework-control tokens (
<|im_start|>,[INST], and equivalents) inside tool outputs. Any hit is an active injection attempt.
Threats it covers
-
WHY IT HELPS Indirect prompt injection plants adversarial instructions in external content the agent will retrieve or receive, relying on the model treating that content as authoritative. Sanitising at the MCP response boundary strips injection payloads and wraps the cleaned content as labelled data before it can reach the planner or be persisted to memory.
-
WHY IT HELPS Intent Breaking and Goal Manipulation exploits the agent's reliance on tool outputs to shape its next plan step. Normalising and wrapping MCP responses before context re-entry removes the structural pathway a crafted tool result would need to redirect the planner.
-
WHY IT HELPS Insecure Inter-Agent Protocol Abuse includes Context Hijacking via MCP Response Injection as a named scenario: an attacker-controlled server returns a payload designed to seize the agent's session. Schema validation rejects malformed responses at the boundary; pattern stripping and contextual wrapping reduce the attack surface for responses that pass schema.
-
WHY IT HELPS Schema Mismatch Leading to Errors arises when a server returns a response whose types diverge from the declared tool contract. Enforcing the MCP response shape and any declared outputSchema at the sanitisation boundary catches ambiguously-typed payloads before they reach the planner, preventing downstream actions driven by parsing divergence.
Principle coverage
Defence-in-Depth stage: Prevent — and it advances:
- Confused-Deputy Prevention A confused-deputy attack requires an agent to exercise tool authority the operator did not intend. Response sanitisation reduces that surface by validating and wrapping every tool result before it re-enters the planner, so a crafted server response cannot supply parameters or instructions that redirect the agent into unintended tool invocations.
- Input/Output Validation This control is the tool-response half of I/O validation: it inspects and normalises every MCP server result at the response boundary before content is allowed into the prompt, mirroring what input sanitisation does on the inbound user-prompt path.
Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.
Implementation options
Five verified implementation options covering different layers: MCP spec-native schema validation, self-build normalisation pipeline, managed injection classifiers, and the Bedrock standalone guardrail API. Use at least the first two for every deployment; layer a classifier for tools whose response surface is wide enough to carry adversarial payloads.
MCP outputSchema validation Validate every tool result against the MCP typed response shape (content[], isError, structuredContent) and, when the tool declares an outputSchema, validate structuredContent against it. Reject any response that does not match; do not pass a schema-invalid payload downstream.
Why choose it: Best as the structural floor for any deployment. The schema is declared by the server itself; validation is zero-cost relative to the network call. Use Zod (TypeScript, v4 current) or Pydantic (Python, v2 current) to parse the content array at the MCP client boundary. If the tool declares outputSchema, also validate structuredContent against it. Log schema errors with the server ID and tool name.
More details:
Self-build normalisation pipeline A function applied to every MCP text segment before it enters the context window: normalise to NFKC, strip framework-control tokens, remove invisible characters, truncate at a hard character ceiling, then wrap the result in an XML-tagged untrusted-content block with server and tool attribution.
Why choose it: Best for every deployment, no managed service performs this step, and it is a prerequisite for the classifier options below. The pattern strip addresses the most common injection payloads; the contextual wrap ensures the model receives the cleaned content as data, not instruction. Unicode normalisation is the easiest step to skip and the most reliable way to miss homoglyph-encoded injection payloads. Dev effort is low: a single utility function applied at the MCP client callback site. Tokens to strip: <|im_start|>, <|im_end|>, [INST], [/INST], and role markers for any model family in deployment.
More details:
Microsoft Prompt Shields Azure AI Content Safety Prompt Shields exposes a documents parameter in the shieldPrompt REST endpoint. Pass each cleaned MCP text segment as a separate documents entry; the API returns documentsAnalysis[].attackDetected per segment, targeting the indirect-injection class of attack.
Why choose it: Best as the classifier layer on top of the self-build normalisation pipeline, for tools that return free-text content (search results, document excerpts, web fetch outputs). Document mode specifically targets indirect injection, content embedded in a document that attempts to gain control of the LLM session. API version: 2024-09-01. Adds 50–200 ms per call; apply to wide-surface tools, not to narrowly-typed tools that return integers or fixed enums.
More details:
AWS Bedrock ApplyGuardrail The Bedrock ApplyGuardrail API (POST /guardrail/{id}/version/{v}/apply) evaluates an arbitrary content array against a configured guardrail without invoking a foundation model. Set source to OUTPUT to evaluate MCP tool results before they enter the context window; treat action: GUARDRAIL_INTERVENED as a hard rejection.
Why choose it: Best for AWS-native deployments where a Bedrock guardrail is already configured for the agent's inference calls. ApplyGuardrail decouples the content policy check from model invocation: the same policy filters that apply to model output can also apply to tool results entering context by calling the API on the raw MCP response text with source=OUTPUT.
More details:
NeMo Guardrails NeMo Guardrails execution rails intercept tool results before they are returned to the model. An execution rail configured in a Colang flow runs a sanitisation or classification step within the guardrails framework before the output re-enters context.
Why choose it: Best when the agent already uses NeMo Guardrails for dialog or input rails and response sanitisation should be added without a separate service call. Execution rails compose with the same Colang flow language as input and output rails, so the sanitisation logic is co-located with the agent's other guardrail definitions. The framework is open-source and self-hosted, removing managed-classifier latency at the cost of operational responsibility for the guardrails server.
More details:
Trade-offs
- Schema validation (option 1) adds no measurable latency relative to the tool call itself; normalisation and wrapping (option 2) add sub-millisecond processing; managed classifiers (options 3 and 4) add 50–200 ms per call on wide-surface tools.
- The MCP spec is still evolving, outputSchema was added in the 2025-06-18 revision and is optional; not all servers declare it. Schema enforcement requires tracking server-by-server capability as the protocol matures.
- Managed classifier options (Prompt Shields, ApplyGuardrail) add per-call cost in addition to latency. Apply them to wide-surface tools (search, web fetch, document retrieval), not to narrowly-typed tools whose response surface cannot carry a meaningful injection payload.
- Dev effort is medium and concentrated in two places: writing the schema assertions per tool (or per server capability), and keeping the control-token regex current as new model families introduce new token formats.
When NOT to use
- Do not apply a managed injection classifier to high-frequency tools that return deterministic, narrowly-typed outputs, a tool that returns a single integer, a fixed enum value, or a boolean has no injection surface. The classifier overhead is disproportionate.
- Do not use response sanitisation as a substitute for not trusting a server. If the server's provenance is unknown or its operator is outside your trust boundary, the first control is MCP server attestation or exclusion, sanitising the output of an untrusted server is the wrong layer to lead with.
- Do not skip schema validation on the grounds that the server is internal or first-party. Schema enforcement catches malformed responses from buggy servers, not only from adversarial ones; it belongs at every MCP client boundary.
Limitations
- Schema validation and pattern stripping address structurally malformed and pattern-matchable adversarial content. They do not address semantically valid but adversarial tool responses, a correctly-typed search result that contains a plausible-but-attacker-crafted policy claim passes all structural checks.
- A compromised MCP server can return valid-looking responses that redirect the agent without tripping pattern detectors. Defence in depth via trust scoring, inter-agent message signing, and multi-agent consensus is required for high-impact decisions.
- The MCP spec does not mandate response-level signing. Any unsigned MCP response is data-integrity-equivalent to unsigned user input, assume it can be tampered with in transit unless the transport is mutually authenticated and the response is signed.
- Classifier false-positive rates on legitimate tool outputs can cause agents to silently drop valid tool results. Log every rejection with enough context to diagnose false positives during the rollout period.
Maturity tier reasoning
- Tier 2 fits because every component is production-available: Zod v4 and Pydantic v2 for schema validation are mature libraries; Unicode normalisation and control-token stripping are trivially implemented; Prompt Shields document mode and Bedrock ApplyGuardrail are GA managed services; NeMo Guardrails is a production-stable open-source framework.
- What keeps the composed pattern at Tier 2 is the absence of a standard MCP client library that ships the full sanitisation pipeline, schema validation, normalisation, classifier integration, and contextual wrapping, as a first-class feature. Every deployment assembles the stack from the individual primitives.
- The MCP outputSchema field (added in the 2025-06-18 revision) is the first step toward spec-native response validation; it is optional and not yet universally implemented by server authors, which limits how much the spec alone can enforce.
Last verified against upstream docs: 2026-05-30.