Definition
A rogue agent is one whose behavioural objective has drifted from its authorised purpose, yet its identity still checks out, its actions remain inside its permissions, and its logs look clean. Divergence may originate from prompt injection, supply-chain tampering, or goal hijack; ASI10 names what happens after divergence begins: sustained, covert operation toward an attacker's goal with no single action that trips an alarm.
What it means in practice
A rogue agent is one that's still authorised. Its identity checks out, its actions stay inside its permissions, its logs look clean. But its actual objective has drifted from what it was deployed for. Individual actions all pass review; the pattern across time has shifted toward an attacker's goal.
Detection has to be behavioural. Access controls don't help because the rogue agent is already inside them. What works: baselines on the agent's output distribution, canary instructions (does the agent still respond correctly to a known test prompt?), and periodic re-attestation (does the operator still intend for this agent to be doing this work?). By the time a rogue agent is detected by its impact, the impact has already happened.
Threat catalogue links
Base-catalog T-numbers follow OWASP source material; normalized MAS scenario entries are Helmwart editorial cross-references. Role colour-codes Helmwart's display weight: chips in the hero use the same scheme.
-
T13 Rogue Agents in Multi-Agent Systems primary A malicious or compromised agent inside the system exploits trust to act unobserved.
Open threat detail → -
T14 Human Attacks on Multi-Agent Systems primary Adversaries exploit inter-agent delegation, trust, and task dependencies to escalate privileges or disrupt workflows.
Open threat detail → -
T15 Human Manipulation primary Attacker turns the agent into a fluent, personalised social-engineering vector trusted by the user.
Open threat detail → -
T38 Emergent Collusion on Blockchain contributing Multiple on-chain agents converge on collusive strategies that no single agent was instructed to pursue.
Open threat detail → -
T47 Rogue MCP Server in Ecosystem primary Malicious MCP server registers in the agent ecosystem and is invoked under presumed-trustworthy framing.
Open threat detail →
MITRE ATLAS technique
OWASP has not published a 1:1 MITRE ATLAS mapping for this entry. The closest red-team techniques are referenced on the individual threat detail pages linked in the section above.
OWASP LLM Top 10 cross-references
From OWASP Appendix A (canonical inheritance)
Helmwart mechanistic crossover (named in OWASP body text, not in Appendix A)
Recommended mitigations
No single control answers an ASI; it is met by a layered stack. The cards below are ranked by how directly each control counters ASI10: the chips on each card name the threat of this ASI it actually covers, colour-coded by that threat's role.
Counters the core
Cover one or more of this ASI's primary threats — the strongest direct response.
In a multi-agent system, each agent routes decisions based on what its peers report. If a peer's behaviour becomes unreliable or adversarial, agents that keep treating it with full authority will propagate whatever errors or manipulations that peer introduces. Per-agent trust scoring addresses this by maintaining a continuously updated reputation score for every peer, derived from observed behaviour, and using that score to determine how much authority each incoming message carries.
Privileged-access personnel are the human layer behind every agentic system. A person with legitimate administrative credentials can tamper with logs, manipulate approval gates, or extract training data through authorised channels, and no technical control prevents it when the access itself is valid. An insider threat program addresses that gap: it governs who holds operator access, what they agree to, how quickly credentials are revoked on departure, and whether anomalous behaviour is surfaced before damage accumulates.
A single agent's judgment on a high-impact action can be wrong, manipulated, or compromised. Requiring N of M independent peer agents to agree before the action executes means an attacker or a systematic error must affect the quorum majority, not just one agent, before harm results.
An agent that has been compromised, poisoned, or gone rogue will, in most cases, behave differently from its established baseline. Anomaly isolation acts on that difference: when an agent's behaviour score crosses a configured threshold, it is quarantined automatically, credentials revoked, message-queue access cut, in-flight actions aborted. Manual revocation cannot match the speed that cascading multi-agent failures demand.
An agent operates without direct human oversight, autonomously scheduling tool calls, external API requests, and reflection loops. Without a budget, a single triggering event can fan out into hundreds of downstream calls. Per-agent rate limits and quotas assign each agent identity its own ceiling on call rate, token consumption, and cost spend, so a misbehaving or compromised agent cannot exhaust shared resources and its overconsumption becomes a visible, actionable signal.
In a multi-agent system, peer agents are granted authority by the other agents that accept their outputs. A rogue or compromised agent that enters the system inherits that authority immediately. Agent admission control is the registration gate that evaluates a peer's identity, declared capabilities, and binary provenance against policy before granting access. A peer that cannot pass attestation is refused entry and cannot participate in the system.
When an AI agent generates content or proposes an action, users need to know that the source is an AI before they decide to act. Without that signal, users routinely over-trust agent output. AI-source disclosure addresses this by attaching a visible label to every AI-generated item and by requiring explicit confirmation for consequential actions, restoring the critical gap between receipt and acceptance.
Every dataset, document, and external system an agent can reach carries a classification label. The agent's permitted-class set and the tool's permitted-class set are intersected at the moment of every read or write. When the requested data's class falls outside that intersection, access is denied at the seam. This is the data-side complement to least-privilege: it adds a data-sensitivity constraint that role scoping alone does not provide.
An AI agent operating with broad authority can propose actions that are irreversible: deleting records, modifying IAM policies, moving funds. A single human reviewer at the approval gate is a single point of failure, one compromised account, one fatigued reviewer, or one successful social-engineering attempt is enough to commit the action. Human dual-control addresses that by requiring two distinct, independent humans to approve before the action commits.
An inter-agent message travels through channels and intermediate agents the receiver did not originate. If nothing binds the message cryptographically to its source, any intermediate hop can substitute or inject content that the receiving agent will treat as authoritative. Message signing closes that gap: the source agent signs each message payload with its private key, and the receiver verifies the signature against a distributed trust bundle before the content reaches the reasoning layer.
An agent can include links and rich HTML in its output. When that output is attacker-influenced, a clickable link, embedded image, or rich preview card becomes the delivery mechanism for phishing or data exfiltration via markdown image injection. Rendering restriction removes that delivery vector by allowing clickable content only from an explicit allow-list of trusted domains and reducing everything else to plain text before the output reaches the user.
An MCP client connecting to a server has no built-in way to verify that the server at a given address is the expected workload or that its binary has not been replaced. An attacker who can intercept or substitute the server exploits that gap directly. MCP server attestation closes it by requiring the server to present cryptographic proof of two properties before the connection proceeds: that it holds a valid workload identity bound to a trusted certificate, and that its binary matches a signed hash recorded at build time.
An agent that can propose payments, update banking details, or modify production configuration is, by construction, a manipulation surface. If the only thing standing between a proposed change and its execution is the agent's own UI, a successful prompt injection or RAG poisoning attack requires no additional steps. Out-of-band verification breaks that dependency by routing a one-use confirmation code through a channel that is structurally separate from the agent's primary interaction channel, so an attacker who controls the agent's context cannot complete the approval without also compromising the user's registered secondary device.
An agent produces output continuously across multiple channels: user-facing responses, tool-call parameter envelopes, log records, and outbound HTTP requests. Any of those channels can carry sensitive content the agent has retrieved, been fed, or been tricked into including. Output egress DLP places an inspection gate at the boundary so that PII, credentials, and proprietary content are classified and either redacted or quarantined before they leave the trust boundary, regardless of how they got into the output.
An AI agent can produce output that is harmful, deceptive, or factually wrong while still sounding fluent and confident. Output moderation places an independent classifier or moderation model between the agent and its destination, checking every output before it reaches a user or a downstream system. The generating model does not evaluate its own answer; a separate gate does.
An agent's authority is normally bounded only by its own reasoning. If that reasoning is manipulated, or the agent's identity is compromised, it will attempt actions the operator never intended to permit. Policy-bound autonomy addresses this by placing a declarative enforcement point between the agent and every consequential action: a policy engine evaluates the agent identity, the target tool, and the parameter envelope before execution, and the agent cannot reason or argue past the result.
Sources
- OWASP Top 10 for Agentic Applications 2026 (canonical source) ↗ · OWASP Gen AI Security Project · Dec 2025 · CC BY-SA 4.0
- Agentic Top 10 side-by-side explainer ↗ · trydeepteam.com · secondary reference