EVIDENCE TRAIL

Behavioural red-teaming — adversarial evaluation of agent reasoning

Verbatim excerpts from the upstream sources cited on the mitigation page, with what each source does and does not prove. The MDX cites NIST AI 600-1 MEASURE 2.7 for red-teaming — this is correct at the function level; the specific red-teaming sub-action is MS-2.7-007, confirmed verbatim below. MEASURE 2.7 as a whole covers security and resilience evaluation, not reasoning-surface adversarial testing specifically.

Last cross-checked against upstream sources: 2026-05-29 · 8 sources

References

Each entry shows what the source supports and what it does not prove.

Reference 1

Published July 2024

NIST AI 600-1 — Generative AI Profile (NIST AI RMF)

MEASURE 2.7 — "AI system security and resilience … are evaluated and documented" · Action MS-2.7-007

"Perform AI red-teaming to assess resilience against: Abuse to facilitate attacks on other systems (e.g., malicious code generation, enhanced phishing content), GAI attacks (e.g., prompt injection), ML attacks (e.g., adversarial examples/prompts, data poisoning, membership inference, model extraction, sponge examples)."

Supports: Verbatim mandate for AI red-teaming under the MEASURE function. Covers prompt injection, adversarial examples, and data poisoning — the primary agentic attack classes this control targets. MS-2.7-007 is the specific sub-action; the MDX cites "MEASURE-2.7" correctly at the function level.

Does not prove: MEASURE 2.7 as a whole is scoped to security and resilience evaluation, not reasoning-surface adversarial testing. The red-teaming action (MS-2.7-007) sits inside a broader security-measurement function; it does not name agentic multi-turn or tool-chain attack classes explicitly.

open original ↗

Reference 2

Published July 2024

NIST AI 600-1 — Generative AI Profile (NIST AI RMF)

MEASURE 1.3 — "Internal experts … and independent assessors are involved in regular assessments" · Action MS-1.3-002

"Engage in internal and external evaluations, GAI red-teaming, impact assessments, or other structured human feedback exercises in consultation with representative AI Actors with expertise and familiarity in the context of use, and/or who are representative of the populations associated with the context of use."

Supports: Requires red-teaming to be conducted with both internal and external actors. Directly establishes the human-campaign layer this control names alongside automated tools.

Does not prove: Framed as a measurement-independence requirement, not a behavioural-reasoning probe requirement. Does not distinguish automated from human campaigns.

open original ↗

Reference 3

Published July 2024

NIST AI 600-1 — Generative AI Profile (NIST AI RMF)

MAP 5.1 — "Likelihood and magnitude … of impacts are examined" · Action MP-5.1-005

"Conduct adversarial role-playing exercises, GAI red-teaming, or chaos testing to identify anomalous or unforeseen failure modes."

Supports: "Unforeseen failure modes" is the exact rationale for behavioural red-teaming over static analysis — this action names the discovery purpose directly.

Does not prove: MAP 5.1 is an impact-mapping action, not a deployment control. Does not specify frequency, tooling, or scope for agentic systems.

open original ↗

Reference 4

Published September 2022

Anthropic — "Red Teaming Language Models to Reduce Harms" (arXiv 2209.07858)

Abstract

"We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs."

Supports: Establishes the three-purpose framing — discover, measure, reduce — that this control inherits. Also reports release of 38,961 red-team attacks as a dataset, demonstrating the empirical basis for systematic adversarial probing.

Does not prove: Paper targets single-turn LLM outputs, not agentic multi-step reasoning or tool-chain composition. Results on scaling (RLHF models become harder to red-team at scale) are model-level findings, not deployment controls.

open original ↗

Reference 5

Open source · actively maintained

NVIDIA Garak — Generative AI Red-teaming & Assessment Kit

README — project description

"garak checks if an LLM can be made to fail in a way we don't want. garak probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses."

Supports: Named toolkit for the automated probe pipeline step in this control. Verbatim confirms coverage of prompt injection and jailbreaks — the primary agentic attack classes. Open-source; integrates with CI/CD.

Does not prove: Probe library coverage skews toward single-turn LLM behaviour. Multi-step agentic tool-use scenarios require custom probe sets beyond the default library.

open original ↗

Reference 6

ATLAS catalogue (continuously updated)

MITRE ATLAS AML.M0008 — Validate AI Model

AML.M0008 description field (ATLAS YAML dist)

"Validate that AI models perform as intended by testing for backdoor triggers, potential for data leakage, or adversarial influence. Monitor AI model for concept drift and training data drift, which may indicate data tampering and poisoning."

Supports: Adversarial influence and backdoor testing are the supply-chain and model-integrity dimensions of red-teaming that this control covers when pairing with m-divergence-monitor. Provides the ATLAS anchor for the atlasMitigations frontmatter field.

Does not prove: Scoped to model validation at training/acquisition time, not ongoing adversarial behavioural probing at runtime or pre-deployment reasoning evaluation.

open original ↗

Reference 7

ATLAS catalogue (continuously updated)

MITRE ATLAS AML.M0022 — Generative AI Model Alignment

AML.M0022 description field (ATLAS YAML dist)

"When training or fine-tuning a generative AI model it is important to utilize techniques that improve model alignment with safety, security, and content policies. The fine-tuning process can potentially remove built-in safety mechanisms in a generative AI model, but utilizing techniques such as Supervised Fine-Tuning, Reinforcement Learning from Human Feedback or AI Feedback, and Targeted Safety Context Distillation can improve the safety and alignment of the model."

Supports: Establishes that red-teaming findings feed back into alignment improvement via RLHF and SFT — the feedback loop that this control's "feed findings into the threat model" step anticipates.

Does not prove: A training-time mitigation, not a red-teaming technique itself. Does not address agentic tool-chain or multi-turn evaluation.

open original ↗

Reference 8

Regulation (EU) 2024/1689 · enters into force 2 August 2026

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15(1) and Article 15(5)

"High-risk AI systems shall be as resilient as possible regarding errors, faults or inconsistencies … inputs designed to cause the AI model to make a mistake (adversarial examples or model evasion) … attacks trying to manipulate the training data set (data poisoning), or pre-trained components used in training (model poisoning)."

Supports: Legislative mandate for adversarial robustness testing for high-risk AI systems in the EU. Names adversarial examples, model evasion, data poisoning, and model poisoning — the attack classes that behavioural red-teaming probes for.

Does not prove: Article 15 requires robustness, not red-teaming as a named method. The regulation does not specify red-team tooling, frequency, or agentic-system scope. Commission benchmarks and measurement methodologies are yet to be published (as of May 2026).

open original ↗