← Atlas · Mitigations Tier 2 · Real-composable

MITIGATION · m-trust-scoring

Per-agent trust scoring — behavioural reputation for inter-agent message acceptance

In a multi-agent system, each agent routes decisions based on what its peers report. If a peer's behaviour becomes unreliable or adversarial, agents that keep treating it with full authority will propagate whatever errors or manipulations that peer introduces. Per-agent trust scoring addresses this by maintaining a continuously updated reputation score for every peer, derived from observed behaviour, and using that score to determine how much authority each incoming message carries.

Last reviewed 2026-05-12 · Status: published · Evidence →

At a glance

MATURITY

Tier 2

Available off-the-shelf or as a documented pattern, but newer or less broadly proven. Expect integration work and some operational nuance.

PLACES ON

edge

Restricted to edge kinds: a2a-message

COVERAGE

6 threats

T7 · T12 · T13 · T14 · T38 · T47

TRADE-OFFS

LAT

low

COST

low

DEV

medium

Latency · cost · UX friction · dev effort.

TL;DR

Maintain a per-peer score, updated from observed behaviour, and use it to gate how much authority each peer's messages carry.
Three signals drive the score: does the peer contradict its own prior messages, does it diverge from what other peers report, and were its claims correct when acted on?
A score above the acceptance threshold passes the message through; a falling score triggers cross-source verification; a critically low score routes to human escalation or quarantine.
No off-the-shelf product implements per-agent LLM trust scoring. You compose it from an observability store and a score-update function: EigenTrust for dense interaction graphs, the Beta Reputation System for calibrated uncertainty, or a rolling weighted average for most production teams.

How it behaves

Incoming A2A message from a peer agent

Look up peer trust score; is it above the acceptance threshold?

Accept message and act on it

Route to cross-source verification, HITL escalation, or quarantine depending on score band

Score update runs after the outcome is observable, not at message-receipt time. A single bad observation nudges the score; a pattern of bad observations drives it below threshold.

What it is

A trust score is a numeric reputation value, maintained per peer and updated from observed behaviour, that determines how much authority an incoming message from that peer carries. In a multi-agent system, agents frequently act on claims made by peers without independently verifying each one. That reliance is efficient, but it creates a vulnerability: a peer that has been compromised, manipulated, or is simply producing unreliable output can propagate errors to every agent that trusts it. Trust scoring formalises the implicit reliance decision and makes it actionable.

Three observation sources drive each peer's score:

Internal consistency, does the peer's current message contradict its own prior messages?
Cross-peer consistency, does this peer's claim agree with what other peers report on the same subject?
Outcome correctness, when this peer's claim was acted on, did the resulting outcome match what the peer asserted?

A score above the acceptance threshold allows the message to proceed normally. A score in the mid range triggers cross-source verification before the message is acted on. A critically low score routes the peer to human escalation or quarantine. Score updates run after outcomes are observable, so the system learns from consequences rather than making real-time judgments about message content alone.

Detection signals

Per-peer trust score time series. A declining score for a specific peer indicates anomalous behaviour: self-contradiction, cross-peer divergence, or outcome mismatch.
Trust-score/message-volume ratio. A peer with high message volume and a low or declining score is the highest-priority investigation target, as its influence on downstream decisions is proportionally greater.

Threats it covers

T7 Misaligned and Deceptive Behaviors −1 severity step

WHY IT HELPS Misaligned or deceptive agent behaviour produces observable inconsistencies: the peer contradicts its own prior messages, diverges from what other peers report, or its claims do not match outcomes when acted on. These signals drive the trust score down, automatically narrowing the scope the peer can influence before an operator is alerted.
T12 Agent Communication Poisoning −1 severity step

WHY IT HELPS Agent Communication Poisoning introduces false or manipulated content into the message stream between agents. A poisoned peer's messages will diverge from cross-peer consistency checks and produce outcome-correctness failures, both of which are observable signals that drive its trust score down before the manipulation can propagate further.
T13 Rogue Agents in Multi-Agent Systems −1 severity step

WHY IT HELPS A rogue agent pursuing goals outside its declared role will exhibit behavioural inconsistency across the three observation dimensions. Trust scoring surfaces that inconsistency as a score decline, triggering verification and escalation before the rogue agent's influence reaches downstream actions.
T14 Human Attacks on Multi-Agent Systems −1 severity step

WHY IT HELPS Cross-agent approval forgery and identity impersonation generate output that is inconsistent with what the legitimate peer would produce, creating a measurable cross-peer consistency gap. The impersonating agent's score declines on that gap, limiting its delegation authority until attestation is re-verified.
T38 Emergent Collusion on Blockchain −1 severity step

WHY IT HELPS Emergent collusion produces a cluster of peers whose scores are mutually consistent but whose claims diverge from ground-truth outcomes. That divergence is the detection signal; trust scoring is the monitoring layer that makes it observable as a correlated score pattern rather than isolated per-peer noise.
T47 Rogue MCP Server in Ecosystem −1 severity step

WHY IT HELPS A rogue MCP server injecting manipulated tool responses produces output that diverges from trusted peers or known-good baselines. That consistency gap drives the server's trust score down, and score decay is the automated response that reduces its influence until attestation is re-verified.

Principle coverage

Defence-in-Depth stage: Detect — and it advances:

Zero Trust Zero Trust requires that no identity is inherently trusted beyond what it currently proves. Trust scoring extends that requirement to the behavioural dimension: a peer that holds a valid credential but whose observed output is inconsistent or unreliable is treated with reduced authority, so identity verification and behavioural reputation are both required before full trust is extended.
Continuous Verification Continuous Verification requires that trust in a peer be re-established at regular intervals rather than assumed from prior access. Trust scoring operationalises that requirement at the message layer: each verifiable outcome is a fresh data point that updates the peer's score, so the authority a peer carries is continuously recalculated from its recent record rather than granted once and held.

Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.

Implementation options

No managed service implements per-agent LLM trust scoring as a drop-in product. Every option below is either a peer-reviewed algorithm you adapt, an adjacent infrastructure component, or a self-build pattern. All verified against primary sources.

EigenTrust Computes a global trust vector across all peers from the left principal eigenvector of the pairwise local-trust matrix. Each peer's local trust value is derived from the ratio of satisfactory to total interactions with each counterpart. Designed for P2P file-sharing networks; the update principle adapts directly to agent-fleet scoring.

Why choose it: Best when you have a dense interaction graph and want a transitive, network-wide trust signal: a peer trusted by other trusted peers receives higher weight than one trusted only by low-trust peers. Not suitable for sparse graphs or small fleets where the trust matrix is mostly zero; eigenvalue convergence requires the matrix to be aperiodic and strongly connected. Implementation is self-build: store pairwise satisfaction vectors per agent pair, normalise them, and iterate. No Python library wraps EigenTrust for agentic use cases as of mid-2026.

More details:

Beta Reputation System Models each peer's reliability as a beta distribution parameterised by (positive interactions + 1, negative interactions + 1). Score is the expected value: alpha/(alpha+beta). Each new observation is a Bayesian update: increment alpha for a positive outcome, beta for a negative one.

Why choose it: Best when you want well-calibrated uncertainty alongside the score. A peer with 3 positive and 0 negative interactions should be treated differently from one with 300 positive and 0 negative; the beta distribution captures that distinction naturally. Simpler to implement than EigenTrust and correct on sparse graphs. Weakness: purely local, it does not propagate trust transitively from other peers' observations. Implementation is a per-peer (alpha, beta) counter pair updated after each verifiable outcome.

More details:

Josang & Ismail 2002, The Beta Reputation System ↗

Rolling average service A lightweight sidecar service that owns a per-peer score table. On each verifiable outcome it applies a weighted update: score = score * (1 - alpha) + alpha * observation, where alpha is the learning rate (typically 0.05-0.1). An exponential-decay term ages old observations so a peer can recover from a single bad period. Exposes GET /trust/{peer_id} returning score, confidence, and last_updated.

Why choose it: Best starting point for most teams. Requires no matrix operations or distribution fitting, just a key-value store and a simple update formula. The decay parameter gives an intuitive forgiveness adjustment. Weakness: no transitivity, and the learning rate must be tuned against your interaction volume. For a workflow generating 50 peer interactions per hour, alpha = 0.05 means roughly 14 hours of production traffic before a new peer's score stabilises.

More details:

OWASP Agentic AI v1.1 Playbook 6, Deploy agent trust scoring ↗

LangSmith evaluation store LangSmith's trace and evaluation API records per-run feedback scores queryable by agent ID or chain name to produce a rolling reliability signal. It does not compute a trust score natively; you use its feedback endpoint and aggregation API as the observation data store, with a thin score-computation layer on top.

Why choose it: Best when your pipeline already uses LangChain or LangSmith for observability and you want to reuse the same trace store rather than running a separate reputation service. Limitation: LangSmith is a managed SaaS with data-residency and cost implications; the trust-score computation remains your code.

More details:

BGP / email reputation analogy Per-peer reputation scoring is Tier 1 production infrastructure in adjacent domains: BGP route reputation (Cloudflare RPKI), email sender reputation (Talos, Spamhaus), and federation trust in identity (Okta Workforce, Auth0). These systems share the same pattern: per-entity score from observed-vs-claimed behaviour, score-based routing, threshold-triggered re-verification.

Why choose it: Reference this when justifying the pattern to stakeholders. The mechanism is proven at scale in adjacent production systems; the gap is that no vendor has packaged it for LLM agent fleets as of mid-2026. The absence of an off-the-shelf product is an ecosystem gap, not a gap in the underlying concept.

More details:

Trade-offs

Score lookup adds one read per incoming A2A message. At constant time from a key-value store this is under 1 ms. The primary adoption cost is the observation-collection pipeline, not the lookup.
Cold-start: a new peer has no history and must start at a neutral score (0.5 to 0.7 depending on the model). Under the Beta model, a new peer starts at alpha=1, beta=1 (expected value 0.5, high variance). Treat that as verify-by-default rather than accept.
Outcome labelling is the dominant engineering cost, not the score function. For outcome-correctness to drive the score, you need a ground-truth signal from acting on the peer's claim. In workflows where only 1-5% of interactions produce a verifiable outcome, a new peer may need weeks of production traffic before its score stabilises. Design for delayed updates, not real-time.
Collusion resistance: a cluster of coordinated peers can mutually assign high cross-peer consistency scores. EigenTrust partially addresses this via pre-trusted seed nodes; the rolling-average model does not. For high-stakes deployments, add a structural audit layer rather than relying solely on the score.

When NOT to use

Small, closed agent fleets where all agents are operated by the same team and share identical trust. Scores will be uniformly high and carry no signal. Use structural controls (SPIFFE mTLS, OPA policy) instead.
Automated quarantine without human review, unless the score is critically low (below 0.2 in the reference model). A single-bad-observation-driven score drop should route to escalation, not automatic shutdown.
Agents whose actions are opaque, such as closed SaaS sub-agents or black-box APIs, without an outcome-correctness signal. An unobservable peer holds a perpetually neutral score, providing false assurance.
As a substitute for cryptographic identity verification. A high trust score does not prove the peer is who it claims to be. Pair with m-message-signing and m-spiffe for the identity layer.

Limitations

Observability dependency: the score is only as good as the outcome signals that drive it. Agents whose actions produce no verifiable outcome cannot accumulate a reliable score.
Collusion: coordinated peers can manufacture consistent cross-peer signals. The score is a behavioural signal, not a cryptographic guarantee.
Recovery latency: once a peer's score drops below threshold, recovery under exponential decay requires many interactions. Build an explicit re-attestation pathway so a peer can re-earn trust faster than the decay function allows.
No standard protocol: there is no cross-platform trust-score interchange format for LLM agents as of mid-2026. Scores are local to the operator's fleet; a peer moving between fleets starts with no history.

Maturity tier reasoning

Tier 2 (real-composable): the underlying algorithms (EigenTrust, Beta Reputation System, rolling average) are Tier 1 mature in adjacent domains. The agentic application is an operational composition of available primitives. The gap keeping this from Tier 1 is the absence of a standardised agentic trust-score protocol and production deployments with published metrics at the LLM-agent layer.
EigenTrust (2003) and the Beta Reputation System (2002) are peer-reviewed, widely cited, and algorithmically settled. Neither has a maintained open-source library targeting LLM agents as of mid-2026.
The self-build rolling-average pattern requires no novel components, a key-value store and simple arithmetic, and is the recommended starting point for production deployments today.

Last verified against upstream docs: 2026-05-30.

PLACEMENT

On the canvas, this control can be placed on:

edge

Valid edge kinds: a2a-message

Place it on the canvas →

MAESTRO LAYERS

L6 L7

ATLAS TECHNIQUES

AML.T0067 LLM Trusted Output Components Manipulation
Adversary manipulates the structured parts of an LLM response (citations, tool-call arguments, approved-action markup) that downstream systems treat as trusted.
AML.T0080 AI Agent Context Poisoning
Adversary contaminates an agent's context store (short-term scratchpad, vector memory, conversation history) so future reasoning is biased toward attacker goals.

ATLAS MITIGATIONS

AML.M0024 AI Telemetry Logging
Log inputs, outputs, and reasoning steps of deployed AI models so anomalous behaviour can be detected and incidents reconstructed.
AML.M0022 Generative AI Model Alignment
Train or fine-tune the model so its outputs align with intended behaviour; reduces the residual surface of jailbreak / misalignment attacks.

TRADE-OFFS

latency low
cost low
ux friction low
dev effort medium

PLAYBOOKS

3 OWASP v1.1 playbooks recommend this control: