ASI10: Rogue Agents

Definition

A rogue agent is one whose behavioural objective has drifted from its authorised purpose, yet its identity still checks out, its actions remain inside its permissions, and its logs look clean. Divergence may originate from prompt injection, supply-chain tampering, or goal hijack; ASI10 names what happens after divergence begins: sustained, covert operation toward an attacker's goal with no single action that trips an alarm.

What it means in practice

A rogue agent is one that's still authorised. Its identity checks out, its actions stay inside its permissions, its logs look clean. But its actual objective has drifted from what it was deployed for. Individual actions all pass review; the pattern across time has shifted toward an attacker's goal.

Detection has to be behavioural. Access controls don't help because the rogue agent is already inside them. What works: baselines on the agent's output distribution, canary instructions (does the agent still respond correctly to a known test prompt?), and periodic re-attestation (does the operator still intend for this agent to be doing this work?). By the time a rogue agent is detected by its impact, the impact has already happened.

Threat catalogue links

Base-catalog T-numbers follow OWASP source material; normalized MAS scenario entries are Helmwart editorial cross-references. Role colour-codes Helmwart's display weight: chips in the hero use the same scheme.

Primary: strongest pivot. Removing this T-number would gut the entry. Contributing: co-equal mechanism that combines with others to produce the ASI risk. Related: touches the entry but isn't its core; useful cross-reference.

T13 Rogue Agents in Multi-Agent Systems primary

A malicious or compromised agent inside the system exploits trust to act unobserved.
Open threat detail →
T14 Human Attacks on Multi-Agent Systems primary

Adversaries exploit inter-agent delegation, trust, and task dependencies to escalate privileges or disrupt workflows.
Open threat detail →
T15 Human Manipulation primary

Attacker turns the agent into a fluent, personalised social-engineering vector trusted by the user.
Open threat detail →
T38 Emergent Collusion on Blockchain contributing

Multiple on-chain agents converge on collusive strategies that no single agent was instructed to pursue.
Open threat detail →
T47 Rogue MCP Server in Ecosystem primary

Malicious MCP server registers in the agent ecosystem and is invoked under presumed-trustworthy framing.
Open threat detail →

MITRE ATLAS technique

OWASP has not published a 1:1 MITRE ATLAS mapping for this entry. The closest red-team techniques are referenced on the individual threat detail pages linked in the section above.

OWASP LLM Top 10 cross-references

From OWASP Appendix A (canonical inheritance)

LLM02:2025 Sensitive Information Disclosure LLM09:2025 Misinformation

Helmwart mechanistic crossover (named in OWASP body text, not in Appendix A)

LLM01:2025 Prompt Injection LLM03:2025 Supply Chain LLM04:2025 Data and Model Poisoning

Recommended mitigations

No single control answers an ASI; it is met by a layered stack. The cards below are ranked by how directly each control counters ASI10: the chips on each card name the threat of this ASI it actually covers, colour-coded by that threat's role.

Counters the core

Cover one or more of this ASI's primary threats — the strongest direct response.

Per-agent trust scoring — behavioural reputation for inter-agent message acceptance Tier 2

T13T14T47T38

In a multi-agent system, each agent routes decisions based on what its peers report. If a peer's behaviour becomes unreliable or adversarial, agents that keep treating it with full authority will propagate whatever errors or manipulations that peer introduces. Per-agent trust scoring addresses this by maintaining a continuously updated reputation score for every peer, derived from observed behaviour, and using that score to determine how much authority each incoming message carries.

Insider threat program — personnel security for operators of high-privilege agentic systems Tier 2

T13T15

Privileged-access personnel are the human layer behind every agentic system. A person with legitimate administrative credentials can tamper with logs, manipulate approval gates, or extract training data through authorised channels, and no technical control prevents it when the access itself is valid. An insider threat program addresses that gap: it governs who holds operator access, what they agree to, how quickly credentials are revoked on departure, and whether anomalous behaviour is surfaced before damage accumulates.

Multi-agent consensus — N-of-M independent agreement before high-impact actions Tier 2

T13T14

A single agent's judgment on a high-impact action can be wrong, manipulated, or compromised. Requiring N of M independent peer agents to agree before the action executes means an attacker or a systematic error must affect the quorum majority, not just one agent, before harm results.

Behavioural anomaly isolation — automatic quarantine on observable drift Tier 2

T13T38

An agent that has been compromised, poisoned, or gone rogue will, in most cases, behave differently from its established baseline. Anomaly isolation acts on that difference: when an agent's behaviour score crosses a configured threshold, it is quarantined automatically, credentials revoked, message-queue access cut, in-flight actions aborted. Manual revocation cannot match the speed that cascading multi-agent failures demand.

Per-agent rate limits and quotas — bound compute, tokens, and external-API spend Tier 2

T13T38

An agent operates without direct human oversight, autonomously scheduling tool calls, external API requests, and reflection loops. Without a budget, a single triggering event can fan out into hundreds of downstream calls. Per-agent rate limits and quotas assign each agent identity its own ceiling on call rate, token consumption, and cost spend, so a misbehaving or compromised agent cannot exhaust shared resources and its overconsumption becomes a visible, actionable signal.

Agent admission control — verify identity, capability claims, and provenance before a peer joins the system Tier 2

T13

In a multi-agent system, peer agents are granted authority by the other agents that accept their outputs. A rogue or compromised agent that enters the system inherits that authority immediately. Agent admission control is the registration gate that evaluates a peer's identity, declared capabilities, and binary provenance against policy before granting access. A peer that cannot pass attestation is refused entry and cannot participate in the system.

AI-source disclosure UI — visible AI labelling at the point of action Tier 2

T15

When an AI agent generates content or proposes an action, users need to know that the source is an AI before they decide to act. Without that signal, users routinely over-trust agent output. AI-source disclosure addresses this by attaching a visible label to every AI-generated item and by requiring explicit confirmation for consequential actions, restoring the critical gap between receipt and acceptance.

Data classification with tool-access allow-lists — a sensitivity label on every dataset, enforced at every access seam Tier 2

T15

Every dataset, document, and external system an agent can reach carries a classification label. The agent's permitted-class set and the tool's permitted-class set are intersected at the moment of every read or write. When the requested data's class falls outside that intersection, access is denied at the seam. This is the data-side complement to least-privilege: it adds a data-sensitivity constraint that role scoping alone does not provide.

Human dual-control — four-eyes rule for irreversible high-impact approvals Tier 2

T15

An AI agent operating with broad authority can propose actions that are irreversible: deleting records, modifying IAM policies, moving funds. A single human reviewer at the approval gate is a single point of failure, one compromised account, one fatigued reviewer, or one successful social-engineering attempt is enough to commit the action. Human dual-control addresses that by requiring two distinct, independent humans to approve before the action commits.

Inter-agent message signing — end-to-end integrity for A2A and MCP Tier 2

T14

An inter-agent message travels through channels and intermediate agents the receiver did not originate. If nothing binds the message cryptographically to its source, any intermediate hop can substitute or inject content that the receiving agent will treat as authoritative. Message signing closes that gap: the source agent signs each message payload with its private key, and the receiver verifies the signature against a distributed trust bundle before the content reaches the reasoning layer.

Link and HTML rendering restriction — an allow-list control on what agent output may render Tier 2

T15

An agent can include links and rich HTML in its output. When that output is attacker-influenced, a clickable link, embedded image, or rich preview card becomes the delivery mechanism for phishing or data exfiltration via markdown image injection. Rendering restriction removes that delivery vector by allowing clickable content only from an explicit allow-list of trusted domains and reducing everything else to plain text before the output reaches the user.

MCP server attestation — cryptographic proof of server identity and binary integrity Tier 2

T47

An MCP client connecting to a server has no built-in way to verify that the server at a given address is the expected workload or that its binary has not been replaced. An attacker who can intercept or substitute the server exploits that gap directly. MCP server attestation closes it by requiring the server to present cryptographic proof of two properties before the connection proceeds: that it holds a valid workload identity bound to a trusted certificate, and that its binary matches a signed hash recorded at build time.

Out-of-band verification — independent-channel confirmation for irreversible agent actions Tier 2

T15

An agent that can propose payments, update banking details, or modify production configuration is, by construction, a manipulation surface. If the only thing standing between a proposed change and its execution is the agent's own UI, a successful prompt injection or RAG poisoning attack requires no additional steps. Out-of-band verification breaks that dependency by routing a one-use confirmation code through a channel that is structurally separate from the agent's primary interaction channel, so an attacker who controls the agent's context cannot complete the approval without also compromising the user's registered secondary device.

Output egress DLP — inspection gate for PII, secrets, and IP at the agent boundary Tier 2

T15

An agent produces output continuously across multiple channels: user-facing responses, tool-call parameter envelopes, log records, and outbound HTTP requests. Any of those channels can carry sensitive content the agent has retrieved, been fed, or been tricked into including. Output egress DLP places an inspection gate at the boundary so that PII, credentials, and proprietary content are classified and either redacted or quarantined before they leave the trust boundary, regardless of how they got into the output.

Output moderation gates — independent moderation pass before emission Tier 2

T15

An AI agent can produce output that is harmful, deceptive, or factually wrong while still sounding fluent and confident. Output moderation places an independent classifier or moderation model between the agent and its destination, checking every output before it reaches a user or a downstream system. The generating model does not evaluate its own answer; a separate gate does.

Policy-bound autonomy — declarative runtime enforcement of the agent's action space Tier 2

T13

An agent's authority is normally bounded only by its own reasoning. If that reasoning is manipulated, or the agent's identity is compromised, it will attempt actions the operator never intended to permit. Policy-bound autonomy addresses this by placing a declarative enforcement point between the agent and every consequential action: a policy engine evaluates the agent identity, the target tool, and the parameter envelope before execution, and the agent cannot reason or argue past the result.

OWASP Top 10 for Agentic Applications 2026 (canonical source) ↗ · OWASP Gen AI Security Project · Dec 2025 · CC BY-SA 4.0
Agentic Top 10 side-by-side explainer ↗ · trydeepteam.com · secondary reference