MITIGATION · m-redteam-behavioural
Behavioural red-teaming — adversarial evaluation of agent reasoning and tool use
An agent exposes more attack surface than a static model: it reasons, plans, selects tools, and acts across multiple turns. Static analysis can characterise that surface, and runtime guardrails can block known-bad patterns, but neither can predict what the agent will do under attacker pressure it has never seen. Behavioural red-teaming addresses that gap through structured adversarial evaluation: probing the agent's reasoning, planning, and tool-use paths with attack strategies before each release.
At a glance
TL;DR
- Run structured adversarial probes against the agent before each release, not as a one-off audit, but as a repeating evaluation cycle tied to the release gate.
- Automated tools (PyRIT, Garak) cover known attack patterns at low cost; human campaigns catch emergent behaviour, such as tool-chain abuse and inter-agent collusion, that parameterised probe libraries do not anticipate.
- A passing run confirms only that the probed attack classes did not succeed in this cycle; it makes no statement about attack classes not in the probe set.
- Every finding either confirms a known risk applies to this agent or exposes a previously uncharacterised one; feed both back into the threat model before the next release.
How it behaves
What it is
Behavioural red-teaming is the structured, adversarial evaluation of an agent's reasoning, planning, and tool-use paths under attacker pressure. It is the AI specialisation of traditional security red-teaming, applied to the execution surfaces that agentic systems expose: prompt injection, multi-turn manipulation, tool-chain composition, memory poisoning, and inter-agent collusion.
The distinction from conventional red-teaming matters for two reasons. First, the surface is different: conventional red-teaming probes ports, APIs, and human social-engineering targets; agentic red-teaming adds the prompt layer, the tool invocation chain, and the reasoning context, none of which are visible to a network scanner. Second, the failure mode is emergent: static analysis can characterise what capabilities an agent holds, and runtime guardrails can block known-bad patterns, but neither can surface what an agent will do under pressure it has never encountered. Adversarial behavioural testing is the only method that directly probes that class of behaviour.
Three evaluation approaches are used in production:
- Fixed benchmark harness. Score the agent against canonical adversarial datasets on every release: the Lakera PINT benchmark for prompt injection detection (4,314 inputs across 24+ languages), HarmBench for harmful-output evaluation across 18 attack methods, and AdvBench for jailbreak coverage. A fixed benchmark produces a reproducible pass/fail line and a delta you can track across releases; its limitation is that it covers only what its curators anticipated.
- Automated adversarial generation. Microsoft PyRIT supports single-turn and multi-turn attack strategies including Crescendo, TAP, and Skeleton Key against OpenAI, Anthropic, Google, HuggingFace, and custom HTTP or WebSocket endpoints. NVIDIA Garak provides 20+ probe categories for automated scanning and produces a structured findings report suitable for diffing across releases. Both tools target single-turn and simple multi-turn LLM behaviour; neither ships ready-made probe libraries for full agentic tool-chain sequences, which require custom probe sets built on top of either framework.
- Human red-team campaigns. Frontier model labs (Anthropic, OpenAI, Google DeepMind) and government programmes (UK AISI, US AISI) run human-led, structured campaigns against their highest-risk agents. This is the only approach that reliably surfaces emergent multi-step agentic behaviour: tool-chain abuse, cross-session memory poisoning, and inter-agent collusion paths that no parameterised probe library has anticipated. For most product teams the practical decision is quarterly human campaigns for high-risk agents and automated tools for everything else.
Detection signals
- Findings per red-team cycle by attack class. A class with zero findings across multiple consecutive cycles may indicate probe coverage has not kept pace with the agent's evolving capabilities, not that the threat is absent.
- Re-emergence rate of previously closed findings. Any recurrence indicates the prior remediation was incomplete or was undone by a subsequent change.
Threats it covers
-
WHY IT HELPS Misaligned and Deceptive Behaviors describes an agent that pursues goals the operator did not authorise, or conceals what it is doing to avoid intervention. These behaviours emerge under attacker pressure, not under nominal load, and are therefore invisible to static analysis and runtime monitoring of known-bad patterns. Adversarial evaluation surfaces them by placing the agent under structured attacker pressure before deployment and verifying that the behaviours either do not occur or are caught by other controls.
Principle coverage
Defence-in-Depth stage: Prevent — and it advances:
- Robustness / Reliability Robustness requires that an agent produce reliable, policy-conformant behaviour under adversarial conditions, not only under nominal load. Behavioural red-teaming is the only pre-deployment method that directly probes the emergent failure modes, misaligned planning, tool-chain abuse, and manipulation under sustained pressure, that make an agent unreliable in practice.
Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.
Implementation options
This is primarily a process control, not a product you buy. Two open-source toolkits automate the repeatable layers; the rest is either a curated benchmark harness or a structured human campaign. Match the approach to the agent's risk tier: automated sweeps for routine agents, human campaigns for high-risk ones.
Microsoft PyRIT Open-source Python framework for single-turn and multi-turn adversarial probing against any HTTP, WebSocket, or SDK-accessible LLM target.
Why choose it: Best when you need multi-turn manipulation coverage (jailbreak escalation across multiple conversation turns) or when your agent exposes a custom endpoint. PyRIT's modular target, converter, scorer, and memory components let you wire adversarial probes directly into an existing test harness. Use PyRITTarget to point at any OpenAI, Anthropic, Google, HuggingFace, or custom endpoint. Does not ship ready-made probe libraries for agentic tool-chain sequences; those require custom probe sets built on top of the framework.
More details:
NVIDIA Garak Open-source vulnerability scanner with 20+ probe categories covering prompt injection, encoding attacks, data leakage, jailbreaks, and hallucination. Runs as a CLI and produces a structured findings report.
Why choose it: Best for a broad single-turn sweep on every release without writing custom probe logic. Run garak --model_type openai --model_name gpt-4o --probes all to scan a deployed endpoint; the structured report format is suitable for diffing across releases and flagging regressions. Like PyRIT, Garak does not cover full agentic tool-chain sequences out of the box.
More details:
HarmBench + Lakera PINT Fixed benchmark harness: score the agent against curated, versioned adversarial datasets on every release and track the delta.
Why choose it: Best when you need a reproducible pass/fail line that a release manager can understand without security expertise. HarmBench covers 18 attack methods for harmful-output classification; Lakera PINT covers 4,314 prompt-injection inputs across 24+ languages for injection detection. Fixed benchmarks are shallow by design: they catch only what their curators anticipated, but their repeatability makes regression tracking straightforward. Use as the automated gate; supplement with PyRIT or human campaigns for depth.
More details:
In-house human campaign Time-boxed adversarial campaign run by internal security engineers or contracted red-teamers, scoped to the specific agent's action space and threat model.
Why choose it: The only approach that reliably surfaces genuinely novel agentic attack patterns: multi-agent collusion, memory poisoning across sessions, and tool-chain abuse chains that no pre-built probe library anticipates. Use the OWASP GenAI Red Teaming Guide as the methodology baseline. Scope the campaign to the declared threat model, time-box it, document every finding, and feed results back into the threat model before the next release. Frontier labs (Anthropic, OpenAI, Google DeepMind) and government programmes (UK AISI, US AISI) all run campaigns of this type as a release gate for high-risk agents.
More details:
Trade-offs
- Automated probes (PyRIT, Garak) run quickly and cheaply, roughly £50-200 in API spend per full suite against a frontier-model agent, but they are scope-bounded: they find the attack classes their probe library covers, not novel ones.
- Human campaigns are the only way to catch emergent multi-step agentic behaviour, but they cost 40-200 person-hours per cycle; at typical security-contractor rates (£800-£1,500/day) a quarterly campaign for a high-risk agent is a £10,000-£50,000 budget line.
- Fixed benchmarks (HarmBench, PINT) are calibrated against single-turn LLM behaviour; agentic tool-chain sequences, memory poisoning, and inter-agent collusion paths are not covered by any public benchmark as of mid-2026.
- Red-teaming identifies gaps; it does not close them. A finding that an agent can be induced to disburse funds requires a structural fix to the authorisation model, not a second red-team run.
When NOT to use
- Do not run red-team campaigns when the agent is narrowly scoped to the point where its full action space can be covered by deterministic unit tests and schema validation. An agent that only reads from a database and returns formatted output has no agentic reasoning surface to probe.
- Do not use red-teaming as a substitute for structural controls such as RBAC, policy-bound autonomy, and output moderation. Red-teaming identifies where those controls are missing; those controls are what close the gaps.
- Do not treat automated probe results as sufficient for high-risk agents (financial, medical, infrastructure-adjacent) without at least one human campaign per major release cycle.
Limitations
- All public probe libraries target single-turn or short multi-turn LLM behaviour; no off-the-shelf tool covers full agentic tool-chain sequences, multi-agent collusion, or cross-session memory poisoning as of mid-2026. These require custom harnesses built on top of PyRIT or Garak.
- A passing red-team result is scoped to the probed attack classes. It makes no statement about attack classes not in the probe set; novel jailbreak families emerge continuously and will not appear in last quarter's benchmark.
- Automated probe libraries have limited coverage for non-English and highly domain-specific agents; running Garak or PyRIT against a medical or legal agent without domain-specific probe extensions produces a misleadingly clean report.
- Red-teaming cadence decays without operational discipline. Quarterly human campaigns and nightly automated sweeps both require scheduling, budget, and named ownership; teams that treat red-teaming as a one-off audit lose coverage between product iterations.
Maturity tier reasoning
- Tier 2 fits because PyRIT and Garak are production-available, actively maintained open-source tools with documented attack libraries, and the OWASP GenAI Red Teaming Guide provides a standards-level methodology baseline supported by NIST AI RMF and EU AI Act requirements.
- The agentic-specific probe gap, no public library covering multi-step tool-chain abuse, memory poisoning, or inter-agent collusion, is what keeps this out of Tier 1. The tooling exists; the agentic-surface coverage does not.
- Frontier-lab red-team programmes (Anthropic, OpenAI, Google DeepMind) are advancing the practice rapidly, but their findings and probe sets are not publicly available in a form that product teams can directly adopt.
Last verified against upstream docs: 2026-05-30.