T48: Model Inconsistency Leading to Variable Approvals

The decision that won’t sit still

A language model making an approval decision is not a calculator. Ask a calculator “is 49.50 under the 50 limit?” and it answers the same way every time. Ask a language model to judge whether an expense claim is “within policy and adequately justified,” and, unless you have explicitly pinned it down, you can get approved one minute and rejected the next, with nothing about the claim having changed. For a chatbot, that variability is the feature that makes it sound human. For an agent that approves expenses, grants access, or releases money, the same variability is a security flaw: the system enforces policy differently each time it looks, and nobody notices it contradicting itself.

That is T48. The cause is not an attacker tampering with the agent’s memory (that is Memory Poisoning) and not a cleverly worded prompt. It is the model’s own non-determinism, the randomness built into how it generates text. To exploit it, an attacker changes nothing. They just ask again.

It helps to understand why identical inputs produce different outputs. At each step, a language model produces a probability distribution over possible next tokens and then samples from it. A setting called temperature controls how much randomness is in that sampling: above zero, the model will sometimes choose a less-likely continuation. That is what gives it fluency, and what makes its judgements unrepeatable. A claim sitting right on a policy boundary is exactly where the model is most uncertain, so its internal distribution is split: perhaps 60% toward “approve,” 40% toward “reject.” Re-running that decision is not re-reading the facts; it is re-rolling a weighted die.

How the failure plays out

The OWASP MAS Guide’s worked example is an RPA agent that processes expense reimbursements. Two identical claims (same receipts, same line items, same submitter) pass through it at different moments. Because inference runs at a non-zero temperature, the first is approved and the second is flagged for manual review. Crucially, no human ever sees the contradiction: each claim is handled independently, each decision is treated as final, and there is no step whose job is to notice that the pipeline just judged the same facts two different ways. A single wrong decision is forgivable. Humans make those too. The real defect is that the system has no mechanism to catch itself being inconsistent.

Why a fairness bug becomes an exploit

Once an approval is effectively a weighted coin-flip, anyone who can resubmit gets free re-rolls. That is what turns an internal quality problem into an attackable control bypass. Suppose a borderline claim is approved roughly 40% of the time. An attacker who simply submits it again and again is approved within a couple of attempts on average, and every rejected attempt costs them nothing. They never forge a receipt or alter a figure; they resample the model until it says yes. The control isn’t being broken so much as worn down by repetition, which is why ordinary fraud checks, all of which look at the content of a single claim, never fire.

Why autonomy makes it worse

In a human process, inconsistency is partly self-correcting: a reviewer who approved a claim on Monday and meets the same claim on Tuesday raises an eyebrow. Automation removes that safety net. The agent has no memory that it judged these facts before, and no instinct that “I already saw this.” Worse, in a multi-agent pipeline the randomness compounds. If a subordinate extraction agent variably reads the claim and a policy agent then variably judges it, the two sources of noise multiply rather than cancel, widening the band of claims whose outcome is effectively random. T26 is the same root cause in a higher-stakes setting, where the unstable decision drives an irreversible on-chain transaction. There is no “flag for review” once the funds have moved.

How you would notice

Nothing about any individual claim here is fraudulent, so the signal lives in repetition and variance, not in content. The practical approach is to fingerprint each claim at ingestion (a SHA-256 hash of its line items and merchant identifiers) and then watch the decisions attached to that fingerprint:

The same claim hash recurring above a threshold within a rolling 24-hour window. This is the footprint of someone re-rolling the dice.
A single claim hash that has been both approved and rejected inside one 7-day window. This decision-variance alert directly catches the model contradicting itself.
Low model-confidence approvals: if the pipeline exposes a confidence score, an approval issued below a confidence floor should be routed to a human rather than actioned.
A per-user approval rate climbing above its 30-day baseline without a matching rise in claims submitted. This is the statistical shape of a resubmission strategy.

The unifying idea: log a canonical input hash and the decision for every call, so “the same input produced two different answers” becomes a query you can alert on rather than an invisible event.

How to actually fix it

The fixes fall straight out of the cause, and they layer. This is Defence-in-Depth applied to a probabilistic component:

Make the decision deterministic. Set temperature to zero (or near-zero) on any model call that drives an approval, and reserve stochastic settings for tasks where variety is actually wanted. This removes the re-roll at the source: the same input now yields the same output, so resubmission gains the attacker nothing. It is the highest-value change and it costs nothing.
Never let the model be its own final judge. Even pinned to temperature zero a model can be wrong, and a probabilistic one should certainly not be the last gate. Put a rule-based pre-filter in front of it (reject clearly out-of-policy claims before the model ever sees them) and a validation step behind it (re-check the model’s “approve” against hard policy limits before any money moves). The deterministic layers, not the model, hold the line.
Remove the attacker’s free re-rolls. Rate-limit resubmissions and act on the detection signals above: a user who submits the same claim hash N times in a window triggers a fraud review rather than another inference. Even if some non-determinism survives, the attacker can no longer cheaply sample it.
Make it observable. Log every decision with its input hash and confidence score. Without this, none of the detection above is even possible.

Read together: temperature-zero removes the vulnerability, the deterministic guardrails catch whatever residue remains, rate-limiting strips the attacker’s leverage, and logging makes the whole loop visible.

Where it sits in the catalogue

T48 is the single-agent, decision-surface case of model instability. It extends T5 (Cascading Hallucination Attacks): T5 is about unstable outputs propagating between agents, whereas T48 is about one agent’s unstable output enforcing policy inconsistently. T26 is the same instability where the decision is an irreversible blockchain action.

OWASP Top 10 for Agentic Applications 2026

The Agentic Top 10 (ASI01 through ASI10) is a separate practitioner-facing publication that maps onto the master Threats & Mitigations threat numbering. T48 is covered by the following Top 10 entries:

ASI09 Human-Agent Trust Exploitation primary

Adversaries exploit the tendency of humans to trust fluent, authoritative-sounding agents: an agent presents plausible justification for a harmful action, the human approves it, and the resulting audit trail reads as deliberate human authorisation. The attack surface is the review step itself: human-in-the-loop oversight becomes the vector when reviewers lack the context, time, or authority to challenge what the agent recommends.

OWASP LLM Top 10: LLM01:2025 LLM05:2025 LLM06:2025 LLM09:2025
ASI01 Agent Goal Hijack related

An attacker manipulates an agent's objective, task selection, or decision pathway (via injected prompts, deceptive tool outputs, forged peer messages, or poisoned retrieval data) so that the agent pursues the attacker's goal rather than the operator's. Unlike a single-turn injection, the harm compounds across many authorised steps before any drift is visible.

OWASP LLM Top 10: LLM01:2025 LLM06:2025

Source: OWASP Top 10 for Agentic Applications 2026 (Dec 2025) · the Top 10 is a compass into the master Threats & Mitigations taxonomy, not a replacement for it.

Design principles at stake

When T48 is present, these security design principles are the ones being violated or tested. Each links to the full principle; the mitigations below are how you restore them.

Defence-in-Depth The failing layer here is the model itself: its output is non-deterministic, so it can never be its own safeguard. Depth means routing every approval through deterministic gates the model's instability can't slip past: a fixed confidence threshold that fails closed, out-of-band verification for high-value claims, and a resubmission limit on identical claims. Each is an independent layer, so a single inconsistent inference can't on its own produce an unauthorised approval.

Recommended mitigations

Auto-generated from the mitigation catalog: every mitigation whose coverage map includes T48, sorted by maturity tier (Tier 1 production-canonical first, then Tier 2, then Tier 3 research-stage).

Tier 2 Cross-system audit (Cross-system scope auditing — continuous permission reconciliation)

An agent that operates across HR, Finance, cloud, and SaaS systems accumulates permissions at each boundary, often without any single team seeing the combined picture. Privilege accumulates silently across those boundaries until a quarterly review finds it, by which point a compromised or misconfigured agent has had weeks of unchecked reach. Cross-system scope auditing prevents that by continuously reconciling the agent's actual entitlements against a declared baseline across every system it touches and raising a ticket the moment drift is detected.

why it helps Model inconsistency across agent instances can silently grant inconsistent access decisions; cross-system audit reconciles what each agent instance actually accessed against the declared permission baseline, surfacing inconsistencies caused by divergent model outputs before they accumulate into a policy violation.
Tier 2 Fail-closed (Fail-closed gate — refuse rather than act on uncertain output)

An agent that is uncertain about what to do next faces a choice: refuse and ask for clarification, or proceed on its best guess. In low-stakes situations that tradeoff is tolerable. In agentic systems that write, delete, or send, a confident-sounding but wrong output can commit an irreversible action. A fail-closed gate resolves that choice structurally: below a configured confidence threshold, the agent stops and escalates rather than guessing.

why it helps Model inconsistency across agent instances produces conflicting proposals for the same action. A fail-closed gate that requires consensus above a threshold refuses to commit any action when the confidence gap between instance proposals exceeds the configured tolerance.
Tier 2 OOB verify (Out-of-band verification — independent-channel confirmation for irreversible agent actions)

An agent that can propose payments, update banking details, or modify production configuration is, by construction, a manipulation surface. If the only thing standing between a proposed change and its execution is the agent's own UI, a successful prompt injection or RAG poisoning attack requires no additional steps. Out-of-band verification breaks that dependency by routing a one-use confirmation code through a channel that is structurally separate from the agent's primary interaction channel, so an attacker who controls the agent's context cannot complete the approval without also compromising the user's registered secondary device.

why it helps Model inconsistency across agent instances can produce contradictory proposals; OOB verification surfaces the inconsistency to a human who can reconcile before committing to an irreversible action.

Red-team pivot: MITRE ATLAS techniques

MITRE ATLAS catalogues adversary techniques against AI systems. Where this OWASP threat has an attacker-perspective counterpart, the ATLAS technique is shown below. That is what a red team would actually be doing on the wire. Use this for detection-signal anchoring, threat-hunting hypotheses, and IR runbooks. Source: mitre-atlas/atlas-data v5.6.0.

AML.T0031 Erode AI Model Integrity view on ATLAS ↗

Adversary degrades model output quality over time so users lose confidence or downstream consumers act on incorrect predictions.

AML.T0065 LLM Prompt Crafting view on ATLAS ↗

Adversary engineers prompt content to maximise the model's likelihood of taking a specific attacker-favourable action. This is the precursor to most prompt-based attacks.

References

OWASP MAS Threat Modelling Guide v1.0 (April 2025) §3 RPA Expense Reimbursement Agent — Layer 1 Foundation Models. Originally published as T16 in that guide; renumbered T48 in the Helmwart catalogue to preserve alignment with OWASP Agentic AI v1.1 IDs.

Sources

OWASP-MAS-Guide ↗ · 1.0 (Apr 2025) · §3 RPA Expense Reimbursement Agent — Layer 1 Foundation Models