LLM04:2025: Data and Model Poisoning (agentic delta)

What changes in an agent loop

A poisoned model is a poisoned planner. Where a chatbot uses the model for one isolated reply, an agent uses it for goal decomposition, tool selection, self-reflection, and coordination with peer agents across sessions that may span hours. A subtle reasoning bias (say, a backdoor trigger that causes the model to prefer a specific API endpoint) compounds at every step in the plan, each step inheriting the error from the one before. Memory-equipped agents extend the attack surface beyond model weights: an adversary who can write to a shared vector store or session-memory file poisons every future agent that retrieves from it, without ever touching the model itself. Retraining on clean data does not fix a contaminated memory layer; the memory store has its own poisoning lifecycle and requires its own integrity controls. Detection therefore needs anomaly monitoring on both model outputs and retrieval-layer contents.

Canonical landings T1 Memory Poisoning T18 RAG Input Manipulation Leading to Policy Bypass T27 Vector Database Poisoning with Malicious Smart Contract Data

For the full definition, prevention checklist, and detection guidance, read OWASP's Data and Model Poisoning page →. This page only adds the agentic angle and the bridge into Helmwart.

Mitigations

Context isolation — separate untrusted content from system instructions T2

An LLM processes everything in its context window as a single stream of tokens; it has no innate ability to tell instructions apart from data. If an attacker can place content where the model treats it as instruction, they control the agent. Context isolation prevents that by structurally separating untrusted content from system instructions at prompt construction time, so the boundary is enforced before the model ever sees the input.

Input sanitisation — enforcing the data/instruction boundary before content reaches the model T2

An LLM cannot distinguish data from instructions on its own: that boundary has to be enforced at the point where external content enters the prompt. Input sanitisation does this by normalising, filtering, and structurally segmenting untrusted content before the model ever sees it, so retrieved documents, tool results, and user messages are treated as data rather than commands.

MCP response sanitisation — validate and normalise tool outputs before they re-enter the LLM context T2

An MCP server response is content the LLM will reason over next. The model cannot distinguish tool output from instruction: that boundary must be enforced at the client, before the payload enters the context window. MCP response sanitisation applies schema validation, Unicode normalisation, control-token stripping, and structural wrapping to every tool result at the response boundary, so adversarial content embedded in a server response cannot redirect the agent's planner.

Memory anomaly detection — runtime detection of poisoning that slipped past validation T2

An agent's memory store can receive adversarial content that passes schema and policy validation because the content is structurally valid but statistically unusual. Memory anomaly detection addresses this by monitoring write rates, embedding distances, provenance tags, and retrieval patterns at runtime, and quarantining writes whose statistical signatures diverge from the established baseline.

Memory content validation — a write-boundary gate on what enters the agent's memory store T2

An agent's memory store is a persistent surface: anything written to it can be retrieved by any agent, in any session, for the lifetime of the corpus. Memory poisoning exploits that persistence by writing adversarial content that steers the agent's reasoning long after the attacker has gone. Write-boundary validation prevents this by running every candidate memory write through schema, policy, and provenance checks before it is committed. Content that fails any gate is rejected and never reaches the store.

Model registry — version pinning, canary, rollback T2

An agent loads whichever model weights are available at startup unless the runtime is told exactly which artifact to load. If a poisoned or regressed weight is published to the model store, the agent picks it up silently on the next restart. A model registry prevents that: every artifact is registered with a cryptographic checksum and an approval stage, the agent runtime loads by explicit version pin, and new versions must pass a canary evaluation before promotion to production.

Advanced prompt-injection defences — spotlighting, delimiter gate, dual-LLM T2

Prompt injection succeeds when untrusted content entering an agent's prompt is indistinguishable from trusted instruction. Three layered techniques address that: spotlighting tags untrusted content with a machine-readable origin mark before it reaches the model; delimiter defence rejects input carrying reserved framework tokens before the model is called; and dual-LLM extraction routes attacker-influenceable content through a quarantined model that holds no tool access, so injected instructions cannot reach the model that can act on them.

Output provenance tracking — record the source of every claim an agent makes T2

When an agent produces a claim derived from retrieved data, that claim needs a record of where it came from: the source document, version, and retrieval time. Without that record, a downstream verifier cannot distinguish a well-grounded output from a fabricated one, a tampered one, or a poisoned one. Provenance tracking attaches source attribution to every claim, carries it through each transformation in the pipeline, and surfaces it in audit logs and user-facing interfaces.

Session-scoped memory isolation — preventing cross-session context bleed T2

An agent that serves multiple users stores conversation history, retrieved facts, and intermediate state in a memory layer. If that layer is not scoped to the originating session, one user's writes can reach another user's retrieval path. Session-scoped memory isolation prevents that by enforcing a hard boundary at the storage layer, so each session can only read and write its own state.

Shared-memory ACL — per-agent, per-namespace read/write access control on shared vector stores T2

When multiple agents share a single vector store, the access boundaries between them are not enforced by the store itself unless you configure them explicitly. Without per-namespace write and retrieval controls, an agent that can write to the shared corpus can insert crafted vectors into any namespace it can reach, and any agent that can query the store can retrieve another agent's confidential documents through embedding-space proximity. Shared-memory ACL addresses this by tagging every vector with a principal identifier at write time and filtering every retrieval query to the requesting agent's namespace, enforced at the gateway layer where the agent cannot bypass it.

Permission-aware vector retrieval — ACLs at the retrieval boundary T2

A vector store returns results by embedding-space proximity, not by who is asking. Without a per-principal filter applied before similarity ranking, a query from tenant A can surface tenant B's vectors if the embeddings are close enough. Vector ACL closes that gap: every retrieval call is scoped to the requesting principal's namespace or payload partition before the store ranks any results, so cross-principal hits are structurally impossible rather than merely unlikely.

Memory-poisoning defence — embedding-space anomaly detection and retrieval re-ranking T3

An agent that reads from a vector store assumes the stored content reflects what was legitimately written. An adversary who can write to that store can inject passages that divert the agent's retrieval toward attacker-controlled content. This control applies two defensive layers: anomaly detection on writes, which quarantines incoming embeddings that are statistical outliers relative to existing cluster centroids; and re-ranking on reads, which uses a cross-encoder or probe-gradient scorer to demote adversarial candidates after dense retrieval. Both layers are research-stage. No turnkey production implementation exists as of catalogue version; deploy additively on top of Tier 2 baseline controls.