← Mitigation · m-memory-poisoning-defense

EVIDENCE TRAIL

Research-stage memory-poisoning defence — embedding-space anomaly + RAG re-ranking

Verbatim excerpts from the upstream sources cited on the mitigation page, with what each source does and does not prove. This is a Tier 3 research-stage control: the academic foundations (Steinhardt 2017, PoisonedRAG 2024) are peer-reviewed, but no production-grade implementation exists as of catalogue version. Contemporary defence papers (RAGDefender, RAGuard) provide empirical grounding for the post-retrieval detection layer.

Last cross-checked against upstream sources: 2026-05-29 · 9 sources

References

Each entry shows what the source supports and what it does not prove.

Reference 1

arXiv preprint · February 2024

Zou et al. — "PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models" (arXiv:2402.07867)

Abstract

"We find that the knowledge database in a RAG system introduces a new and practical attack surface. Based on this attack surface, we propose PoisonedRAG, the first knowledge corruption attack to RAG, where an attacker could inject a few malicious texts into the knowledge database of a RAG system to induce an LLM to generate an attacker-chosen target answer for a attacker-chosen target question. … Our results show PoisonedRAG could achieve a 90% attack success rate when injecting five malicious texts for each target question into a knowledge database with millions of texts. We also evaluate several defenses and our results show they are insufficient to defend against PoisonedRAG, highlighting the need for new defenses."

Supports: Establishes the adversarial-vector injection threat surface that this mitigation targets. Directly motivates the need for embedding-space anomaly detection and retrieval re-ranking as countermeasures. The 90% ASR figure quantifies why passive filtering is insufficient and a post-retrieval re-ranking pass is required.

Does not prove: Evaluates existing defences and finds them insufficient — it does not propose or validate the embedding-anomaly + cross-encoder re-ranking approach used in this mitigation. Helmwart's defence composes against the PoisonedRAG threat class but is not derived from this paper.

open original ↗

Reference 2

NeurIPS 2017 · arXiv:1706.03691

Steinhardt, Koh, Liang — "Certified Defenses for Data Poisoning Attacks" (NeurIPS 2017, arXiv:1706.03691)

Abstract

"Machine learning systems trained on user-provided data are susceptible to data poisoning attacks, whereby malicious users inject false training data with the aim of corrupting the learned model. … We address this by constructing approximate upper bounds on the loss across a broad family of attacks, for defenders that first perform outlier removal followed by empirical risk minimization."

Supports: Establishes the foundational "outlier removal" paradigm that the embedding-space anomaly detection layer in this mitigation inherits. The concept of quarantining vectors whose distance from cluster centroids exceeds a threshold is a direct application of Steinhardt et al.'s outlier-removal framing to the vector-store context.

Does not prove: Applies to training-time data poisoning, not runtime vector-store injection in a deployed RAG system. The agentic application is bespoke and research-stage; this paper is related work, not a direct blueprint.

open original ↗

Reference 3

arXiv preprint · November 2025

Kim, Lee, Koo — "Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems" (arXiv:2511.01268)

Abstract

"RAGDefender operates during the post-retrieval phase, leveraging lightweight machine learning techniques to detect and filter out adversarial content without requiring additional model training or inference. … RAGDefender reduces the attack success rate (ASR) against the Gemini model from 0.89 to as low as 0.02, compared to 0.69 for RobustRAG and 0.24 for Discern-and-Answer when adversarial passages outnumber legitimate ones by a factor of four (4x)."

Supports: Post-retrieval filtering is the same phase targeted by the cross-encoder re-ranking layer in this mitigation. Demonstrates that lightweight post-retrieval anomaly detection is feasible and effective, providing empirical grounding for the re-ranking component's design rationale.

Does not prove: RAGDefender is a specific detection system, not the same as a cross-encoder re-ranking approach. Helmwart's mitigation generalises across re-ranking strategies; this paper is one contemporary instantiation, not the canonical specification.

open original ↗

Reference 4

arXiv preprint · October 2025

Cheng et al. — "Secure Retrieval-Augmented Generation against Poisoning Attacks" (arXiv:2510.25025)

Abstract

"RAGuard first expands the retrieval scope to increase the proportion of clean texts, reducing the likelihood of retrieving poisoned content. It then applies chunk-wise perplexity filtering to detect abnormal variations and text similarity filtering to flag highly similar texts. This non-parametric approach enhances RAG security, and experiments on large-scale datasets demonstrate its effectiveness in detecting and mitigating poisoning attacks, including strong adaptive attacks."

Supports: Names perplexity filtering and text-similarity detection as concrete implementation strategies for the detection-at-retrieval layer, giving empirically validated weight to the cross-encoder score gap signal described in this mitigation's detection signals list.

Does not prove: RAGuard is non-parametric and does not use embedding-space clustering; it is an alternative implementation approach, not the same architecture as Helmwart's embedding-anomaly layer. Results are on standard QA benchmarks, not agentic memory-store deployments.

open original ↗

Reference 5

Published July 2024

NIST AI 600-1 — Generative AI Profile (NIST AI RMF)

MEASURE 2.7 — Action MS-2.7-007

"Perform AI red-teaming to assess resilience against: Abuse to facilitate attacks on other systems (e.g., malicious code generation, enhanced phishing content), GAI attacks (e.g., prompt injection), ML attacks (e.g., adversarial examples/prompts, data poisoning, membership inference, model extraction, sponge examples)."

Supports: Names data poisoning as an ML attack class that requires red-team resilience assessment — providing the risk-management mandate under which embedding-space anomaly + re-ranking defences would be evaluated. Establishes data-poisoning resilience as a NIST AI RMF measurement obligation.

Does not prove: Does not prescribe specific detection methods, thresholds, or architectural patterns for vector-store defences. Generic GAI risk-management guidance, not an implementation specification.

open original ↗

Reference 6

OWASP LLM Top 10 v2025

OWASP LLM Top 10 v2025 — LLM04: Data and Model Poisoning

Core description — LLM04:2025

"Data poisoning occurs when pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This compromises model security, performance, or ethical behavior, potentially leading to harmful outputs or degraded capabilities."

Supports: Explicitly names embedding data as a poisoning surface alongside training and fine-tuning data, directly establishing the vector-store write-time poisoning threat that the embedding-space anomaly detection layer addresses.

Does not prove: LLM04 covers the full data-poisoning threat class across the model lifecycle; embedding-vector-store poisoning in an agentic RAG context is a specific sub-class. The LLM Top 10 entry does not address re-ranking or cross-encoder defences.

open original ↗

Reference 7

v1.1 · published December 2025

OWASP Agentic AI — Threats & Mitigations v1.1

No verbatim excerpt pulled yet — open the original to verify the cited section.

Supports: T1 Memory Poisoning names this as an active threat against agentic systems and identifies it as an emerging research area requiring defence-in-depth. The v1.1 document is the upstream industry classification that this mitigation is filed against.

Does not prove: The landing page does not expose verbatim section text; the full PDF was not accessible via automated fetch at cross-check date. Section structure and T1 coverage inferred from document title and MDX inline citations. Direct PDF review recommended.

open original ↗

Reference 8

ATLAS catalogue (continuously updated)

MITRE ATLAS AML.M0007 — Sanitize Training Data

No verbatim excerpt pulled yet — open the original to verify the cited section.

Supports: Names sanitisation of training/knowledge-base data as the canonical ATLAS mitigation for data-poisoning attacks. The embedding-space anomaly detection write-time check in this mitigation is an agentic operationalisation of this control.

Does not prove: ATLAS live page returned HTTP 404 at cross-check date — no verbatim excerpt could be pulled. The ID and description above are based on the MDX entry and prior catalogue versions. Re-verify at next cross-check.

open original ↗

Reference 9

ATLAS catalogue (continuously updated)

MITRE ATLAS AML.M0031 — Memory Hardening

No verbatim excerpt pulled yet — open the original to verify the cited section.

Supports: Directly names memory hardening as a mitigation category, providing the ATLAS anchor for this mitigation's scope. Establishes that protecting agent memory stores against adversarial manipulation is a recognised ATLAS control objective.

Does not prove: ATLAS live page returned HTTP 404 at cross-check date — no verbatim excerpt could be pulled. The ID is referenced in the MDX entry and standard ATLAS memory-attack coverage; re-verify at next cross-check.

open original ↗