MITIGATION · m-memory-poisoning-defense
Memory-poisoning defence — embedding-space anomaly detection and retrieval re-ranking
An agent that reads from a vector store assumes the stored content reflects what was legitimately written. An adversary who can write to that store can inject passages that divert the agent's retrieval toward attacker-controlled content. This control applies two defensive layers: anomaly detection on writes, which quarantines incoming embeddings that are statistical outliers relative to existing cluster centroids; and re-ranking on reads, which uses a cross-encoder or probe-gradient scorer to demote adversarial candidates after dense retrieval. Both layers are research-stage. No turnkey production implementation exists as of catalogue version; deploy additively on top of Tier 2 baseline controls.
At a glance
TL;DR
- At write time, compute the L2 distance between the incoming embedding and per-topic cluster centroids; quarantine vectors that exceed a calibrated threshold before they are committed to the store.
- At read time, after dense retrieval produces its top-k candidates, a cross-encoder or probe-gradient scorer re-evaluates each candidate and demotes adversarial passages before the result set reaches the agent's context.
- Both layers are Tier 3 research-stage: the attack surface is peer-reviewed (PoisonedRAG, USENIX Security 2025), but no production-grade implementation of either defence layer exists as of catalogue version. Deploy additively on top of Tier 2 controls (m-mem-validation, m-mem-anomaly), not as a replacement.
How it behaves
What it is
Memory poisoning is an adversarial write to a vector store: an attacker inserts passages whose embeddings are crafted to appear relevant to legitimate queries, so the dense retriever returns attacker-controlled content and the agent acts on it. The PoisonedRAG research (Zou et al., USENIX Security 2025, arXiv:2402.07867) demonstrated that injecting as few as five malicious passages achieves approximately 90% attack success against undefended retrieval pipelines.
This control addresses the threat at two points in the retrieval pipeline.
Write-time: embedding-space anomaly detection. When a new vector is written to the store, its distance from existing per-topic cluster centroids is computed. Vectors that fall beyond a calibrated threshold are quarantined before they are committed. The technique extends the outlier-removal work of Steinhardt, Koh, and Liang (NeurIPS 2017, arXiv:1706.03691) from training-time data poisoning to the runtime write stream; the RAG-specific application is bespoke per deployment and requires sustained threshold calibration against the live corpus.
Read-time: retrieval re-ranking. After the dense retriever produces its top-k candidates, a cross-encoder or probe-gradient scorer re-evaluates each candidate for adversarial characteristics and demotes suspicious passages before the result set reaches the agent's context. RAGPart, RAGMask, and ProGRank (2025–2026) are the closest peer-reviewed instantiations of this layer; none has a public code release as of catalogue version. MEMSAD (Gowda 2026) provides a gradient-coupling theoretical grounding for the write-time layer and reports TPR 1.00, FPR 0.00 in controlled experiments, with a known synonym-substitution evasion path remaining open.
Both layers are Tier 3 research-stage. Include them in roadmap planning and deploy them additively on top of the Tier 2 baseline controls (m-mem-validation, m-mem-anomaly). Do not substitute them for those controls.
Detection signals
- Quarantine rate at the write-time outlier filter. A sudden rise indicates a new ingestion source is injecting vectors that deviate from the established corpus distribution.
- Re-ranking demotion rate per retrieval session. A sustained increase means dense retrieval is being deceived at a higher frequency than baseline, which warrants examining the ingestion provenance of recently added content.
Threats it covers
-
WHY IT HELPS Memory Poisoning is the injection of adversarial content into an agent's persistent or shared memory store so that the agent retrieves and acts on attacker-controlled passages. Embedding-space anomaly detection intercepts statistically anomalous vectors before they are committed to the store; retrieval re-ranking demotes adversarial passages after dense retrieval is already deceived. Neither layer stops slow-drift poisoning that stays within statistical thresholds; both reduce severity for the pattern-injection scenario that peer-reviewed attack research documents.
Principle coverage
Defence-in-Depth stage: Detect — and it advances:
- Memory & RAG Integrity Memory integrity requires that content stored in an agent's memory store be trustworthy and unmodified by adversarial writes. This control advances that principle by intercepting statistically anomalous vectors before they are committed to the store and by demoting adversarial passages after retrieval, reducing the probability that a successful poisoning write reaches the agent's reasoning context.
Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.
Implementation options
These are the closest verified research artefacts and infrastructure primitives available to compose against. No turnkey production product ships this defence as of catalogue version. Each option notes what it actually provides and what remains research-only.
PoisonedRAG The foundational adversarial-RAG paper. Injects five malicious passages per target question and achieves ~90% attack success against undefended retrievers. Evaluates existing defences and finds them insufficient. Accepted at USENIX Security Symposium 2025.
Why choose it: Use as the canonical threat-model reference for scoping what your embedding-anomaly and re-ranking layers must withstand. The paper's defence-evaluation section identifies what does not work; subsequent papers build on that baseline.
More details:
RAGPart and RAGMask Two complementary retrieval-stage defences. RAGPart exploits how dense retrievers learn document partitions to surface poisoned documents; RAGMask detects suspicious tokens via targeted-token masking and similarity-shift scoring. Both operate on the retriever, not the LLM, and require no modification to the generation model.
Why choose it: Closest peer-reviewed precedent for the retrieval-stage re-ranking layer in this control. Research-stage: no public code as of catalogue version. Provides concrete design vocabulary (partition-based scoring, masking-induced similarity shift) for a self-build implementation.
More details:
ProGRank Training-free retriever-side defence. Applies randomised perturbations to each query-passage pair and derives two instability metrics, representational consistency and dispersion risk, from the retriever's gradient. Re-ranks and filters candidates without modifying original passages or retraining the retriever.
Why choose it: Provides a gradient-based mechanism for the re-ranking layer that requires no additional model training. Research-stage: no public code. The training-free property makes it more deployable than approaches requiring fine-tuning on adversarial examples.
More details:
MEMSAD Calibration-based defence grounded in a gradient-coupling theorem: the anomaly-score gradient and the retrieval-objective gradient are provably identical under certain conditions, making adversarial evasion simultaneously degrade retrieval quality. Provides theoretical certified detection radius and minimax optimality proofs. Achieves TPR 1.00, FPR 0.00 against continuous attacks in experiments; a synonym-substitution loophole remains open.
Why choose it: The only reviewed paper that explicitly targets agentic memory-store poisoning rather than general RAG corpora. Research-stage: paper-only, no public code. Theoretical guarantees are the strongest foundation available for the write-time anomaly-detection layer.
More details:
Self-build outlier filter Compute the L2 distance between the incoming embedding and the per-topic centroid of the existing vector store. Quarantine writes that exceed a calibrated threshold. No external dependency; runs in the ingestion pipeline before the vector is committed.
Why choose it: The most deployable option today given no production library ships this defence. Inherits the outlier-removal framing from Steinhardt, Koh, Liang (NeurIPS 2017, arXiv:1706.03691). Requires sustained threshold tuning per corpus and does not defend against slow-drift poisoning that stays within the statistical envelope.
More details:
Trade-offs
- Latency: medium. Cross-encoder or probe-gradient re-ranking adds 50 to 200 ms per retrieval query; embedding-distance checks at write time are sub-millisecond.
- Cost: medium. Cross-encoder inference per query at retrieval volume; gradient computation in ProGRank adds a forward-backward pass per candidate.
- UX friction: low. Both layers are invisible to end users.
- Dev effort: high. Research-stage components have no off-the-shelf implementation. Threshold tuning requires sustained red-team exercises because no public benchmark exists for agentic memory-poisoning resilience.
When NOT to use
- Do not deploy as a substitute for the Tier 2 baseline (m-mem-validation, m-mem-anomaly). Both layers are additive, not replacements.
- For corpora that are fully trusted and immutable, a read-only curated knowledge base with no runtime writes, the adversarial-vector surface does not exist and the added latency is wasted overhead.
- At high query volumes (thousands per hour), cross-encoder re-ranking becomes a dominant cost. Profile before enabling in production.
Limitations
- Slow-drift poisoning that stays within statistical thresholds defeats embedding-space detection. MEMSAD identifies this as the synonym-invariance loophole.
- Adversarial-similarity attacks adapt to known defences. The cross-encoder re-ranking that counters one strategy may not generalise to subsequent generations of attack.
- No public benchmark for agentic memory-poisoning resilience exists as of catalogue version. Recall and precision for this control can only be measured through bespoke red-team exercises.
- All current defence papers test on standard QA benchmarks, not deployed agentic memory stores. Transfer to production context is unvalidated.
Maturity tier reasoning
- Tier 3 because no production-grade implementation exists and no public benchmark for evaluation has been established.
- The foundational research (Steinhardt 2017, Zou 2024/USENIX 2025) is peer-reviewed; the agentic-memory application is exploratory.
- RAGPart/RAGMask, ProGRank, and MEMSAD (2025 to 2026) are the closest verified research instantiations. Expect upgrade to Tier 2 as vendors ship hardened RAG defaults.
Last verified against upstream docs: 2026-05-30.