← Mitigation · m-plan-validation

EVIDENCE TRAIL

Plan-vs-goal validation

Verbatim excerpts from the upstream sources cited on the mitigation page, with what each source does and does not prove. The phrase "planning validation frameworks" and "another model check … flag significant goal deviations" appear verbatim in OWASP Agentic AI Threats & Mitigations v1.1 §T6. Note: the MDX file cites NIST MEASURE-2.7 for "independent verification of agent decisions" — that sub-control covers security and resilience evaluation, not goal-alignment verification; the closer anchor is MS-4.2-003.

Last cross-checked against upstream sources: · 7 sources

References

Each entry shows what the source supports and what it does not prove.

Reference 1
v1.1 · published December 2025

OWASP Agentic AI — Threats & Mitigations v1.1

§T6 Intent Breaking & Goal Manipulation — Mitigation column (threat catalogue table)

"Implement planning validation frameworks, boundary management for reflection processes, and dynamic protection mechanisms for goal alignment. Deploy AI behavioral auditing by having another model check the agent and flag significant goal deviations that could indicate manipulation."

Supports: Verbatim upstream statement of the control. Names "planning validation frameworks" and "another model check … flag significant goal deviations" — the exact independent-validator pattern this mitigation operationalises.

Does not prove: Describes the control class, not an implementation schema. Does not specify how the validator is prompted, what model family it should use, or the ALLOW/REJECT/ESCALATE output contract.

Reference 2
v1.1 · published December 2025

OWASP Agentic AI — Threats & Mitigations v1.1

§Playbook 1: Preventing AI Agent Reasoning Manipulation — Step 2: Prevent AI agent Goal Manipulation (Reactive)

"Use goal consistency validation to detect and block unintended AI behavioral shifts."

Supports: Names "goal consistency validation" as the reactive lever against intent breaking, confirming the cross-check-each-step-against-the-goal pattern.

Does not prove: One bullet inside a larger playbook. Does not cover the independent validator architecture (separate model family, different temperature, separate context window).

Reference 3
Version 2026 · published December 2025

OWASP Top 10 for Agentic Applications 2026

§ASI01 Agent Goal Hijack — Prevention and Mitigation Guideline 4

"At run time, validate both user intent and agent intent before executing goal-changing or high-impact actions. Require confirmation — via human approval, policy engine, or platform guardrails — whenever the agent proposes actions that deviate from the original task or scope. Pause or block execution on any unexpected goal shift, surface the deviation for review, and record it for audit."

Supports: Establishes run-time intent validation against the original task as the primary defence for goal-hijack scenarios. The "pause or block on unexpected goal shift" wording matches this control's fail-closed stance on goal-divergent steps.

Does not prove: Frames the check as a confirmation gate (human approval or policy engine), not specifically as an independent LLM-judge. The independence requirement is a Helmwart implementation choice on top of this upstream guidance.

Reference 4
ATLAS catalogue · created 2025-10-29, modified 2025-12-23

MITRE ATLAS AML.M0029 — Human In-the-Loop for AI Agent Actions

AML.M0029 description (verbatim from ATLAS.yaml, mitre-atlas/atlas-data)

"Systems should require the user or another human stakeholder to approve AI agent actions before the agent takes them. The human approver may be technical staff or business unit SMEs depending on the use case. Separate tools, such as dedicated audit agents, may assist human approval, but final adjudication should be conducted by a human decision-maker."

Supports: Defines the escalation path that plan-validation's ESCALATE verdict opens onto. "Dedicated audit agents" names exactly the validator-as-audit-agent pattern this control uses.

Does not prove: Does not describe when to escalate (i.e., does not define the goal-divergence signal). The trigger logic is this control's contribution; ATLAS names only the destination.

Reference 5
ATLAS catalogue (continuously updated)

MITRE ATLAS AML.M0022 — Generative AI Model Alignment

AML.M0022 description (verbatim from ATLAS.yaml, mitre-atlas/atlas-data)

"When training or fine-tuning a generative AI model it is important to utilize techniques that improve model alignment with safety, security, and content policies. The fine-tuning process can potentially remove built-in safety mechanisms in a generative AI model, but utilizing techniques such as Supervised Fine-Tuning, Reinforcement Learning from Human Feedback or AI Feedback, and Targeted Safety Context Distillation can improve the safety and alignment of the model."

Supports: Names model alignment as a mitigation class that complements run-time plan validation. Where plan validation is a run-time guard, model alignment is the training-time complement.

Does not prove: Training-time control only. Does not address run-time per-step validation or independent validator architecture. Covers a different layer of the stack.

Reference 6
arXiv:2303.17651 · published March 2023

Madaan et al. 2023 — Self-Refine: Iterative Refinement with Self-Feedback

Abstract (verbatim)

"Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it to refine itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider."

Supports: Establishes the academic foundation for LLM-judge feedback loops. Demonstrates that a feedback-and-refine pass improves output quality (~20% absolute improvement across 7 tasks). The plan validator is an application of this pattern applied specifically to goal-alignment checking.

Does not prove: Self-Refine uses the same model as both generator and critic. This control intentionally requires a different model family or temperature for independence — the single-model limitation in Self-Refine is precisely what the validator-independence requirement addresses.

Reference 7
Published July 2024

NIST AI 600-1 — Generative AI Profile (NIST AI RMF)

MEASURE 4.2 — "Measurement results regarding AI system trustworthiness … are informed by input from domain experts and relevant AI Actors to validate whether the system is performing consistently as intended." Action item MS-4.2-003

"Implement interpretability and explainability methods to evaluate GAI system decisions and verify alignment with intended purpose."

Supports: Names verification of alignment with intended purpose as a NIST measurement action. The independent plan validator is an operationalisation of this alignment-verification requirement in the agentic execution loop.

Does not prove: MEASURE 4.2 is about evaluation methodology, not a run-time control specification. Does not prescribe per-step validation or independent model families. Note: the MDX file cites MEASURE-2.7 ("AI system security and resilience … evaluated and documented") for this control — that sub-control covers security benchmarking and red-teaming, not independent goal-alignment verification. MS-4.2-003 is the closer anchor.