EVIDENCE TRAIL
HITL feedback-loop calibration
Verbatim excerpts from the upstream sources cited on the mitigation page, with what each source does and does not prove. The override-capture and pattern-analysis steps are directly named in OWASP v1.1 §T10 and NIST AI 600-1 MANAGE 4.1; the closed-loop calibration deployment step (prompt updates, threshold tuning) is Helmwart's operationalisation of those requirements. Note: the MDX independentEvidence field references "NIST AI RMF GOVERN-6 names continuous improvement loops" — this is a misattribution. GOVERN 6 covers third-party IP and supply-chain risk. The continual-improvement sub-control is MANAGE 4.2; the override and feedback mechanism is MANAGE 4.1. Both are represented correctly in this evidence trail.
Last cross-checked against upstream sources: · 8 sources
References
Each entry shows what the source supports and what it does not prove.
OWASP Agentic AI — Threats & Mitigations v1.1
§T10 Overwhelming Human-in-the-Loop — Step 3: Strengthen AI Decision Traceability & Logging (Detective)
"Monitor and log human overrides of AI recommendations, analyzing reviewer patterns for potential bias or AI misalignment."
Supports: Verbatim instruction to log reviewer overrides and analyse their patterns — the capture-and-analyse step this control formalises.
Does not prove: Describes logging and pattern analysis as a detective step; does not specify that those patterns should be fed back into prompt tuning or threshold calibration. The closed-loop deployment step is Helmwart's extension.
OWASP Agentic AI — Threats & Mitigations v1.1
§T10 Overwhelming Human in the Loop — Mitigation (summary table)
"Develop advanced human-AI interaction frameworks, and adaptive trust mechanisms. These are dynamic AI governance models that employ dynamic intervention thresholds to adjust the level of human oversight and automation based on risk, confidence, and context."
Supports: Names dynamic intervention thresholds driven by risk, confidence, and context as the lever that should toggle the human-oversight level — the same threshold-calibration mechanic this control targets.
Does not prove: Frames threshold adjustment as a prospective governance design, not a feedback-loop mechanism driven by accumulated override data. Adjacent rationale, not identical.
OWASP Agentic AI — Threats & Mitigations v1.1
§T6 Intent Breaking & Goal Manipulation — Mitigation (summary table)
"Implement planning validation frameworks, boundary management for reflection processes, and dynamic protection mechanisms for goal alignment. Deploy AI behavioral auditing by having another model check the agent and flag significant goal deviations that could indicate manipulation."
Supports: Names AI behavioural auditing and detection of significant goal deviations as the primary mitigation — the pattern-analysis step of this control surfaces exactly those deviations from override data.
Does not prove: Does not name reviewer overrides as the signal source, nor feedback-driven prompt or threshold updates as the corrective mechanism.
NIST AI 600-1 — Generative AI Profile (NIST AI RMF)
MANAGE 4.1 — sub-control title
"Post-deployment AI system monitoring plans are implemented, including mechanisms for capturing and evaluating input from users and other relevant AI Actors, appeal and override, decommissioning, incident response, recovery, and change management."
Supports: MANAGE 4.1 explicitly requires mechanisms for "appeal and override" capture and evaluation as part of post-deployment monitoring — the override-capture step this control implements.
Does not prove: Does not specify that captured overrides must be fed back into model fine-tuning or prompt calibration. Change management is mentioned but the calibration pipeline is Helmwart's operationalisation.
NIST AI 600-1 — Generative AI Profile (NIST AI RMF)
MANAGE 2.2 — Action MG-2.2-003
"Evaluate feedback loops between GAI system content provenance and human reviewers, and update where needed. Implement real-time monitoring systems to affirm that content provenance protocols remain effective."
Supports: Uses the phrase "feedback loops between GAI system … and human reviewers" explicitly — the closest NIST wording for the human-to-system calibration loop this control constructs.
Does not prove: Action is scoped to content provenance (information integrity risk), not agentic decision-making or override-driven threshold tuning. Conceptual match; not a direct normative requirement for this control.
Christiano et al., "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017)
Abstract
"We explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function … while providing feedback on less than 1% of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems."
Supports: Foundational paper demonstrating that sparse human feedback (reviewer preferences over trajectory segments) is sufficient to calibrate complex agent behaviour — the academic grounding for the override-driven calibration loop.
Does not prove: Addresses reward-function learning for RL agents in games and robotics; does not address production agentic AI HITL gates, prompt engineering, or threshold tuning as calibration targets.
Rafailov et al., "Direct Preference Optimization" (NeurIPS 2023)
Abstract
"Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning."
Supports: Establishes DPO as a lower-overhead alternative to RLHF for incorporating human preference data — directly relevant to the fine-tuning pathway in this control's calibration deployment step.
Does not prove: Addresses LLM alignment from static preference datasets, not live agentic override events. The integration of production reviewer overrides into a DPO pipeline is an engineering step not covered by the paper.
MITRE ATLAS AML.M0029 — Human In-the-Loop for AI Agent Actions
AML.M0029 — Description (first paragraph)
"Systems should require the user or another human stakeholder to approve AI agent actions before the agent takes them. The human approver may be technical staff or business unit SMEs depending on the use case. Separate tools, such as dedicated audit agents, may assist human approval, but final adjudication should be conducted by a human decision-maker."
Supports: Defines the HITL approval gate that generates the override events this control captures and feeds back. Establishes that "dedicated audit agents may assist" — consistent with the pattern-analysis sub-component.
Does not prove: Describes the human-approval gate as an input-control, not a calibration feedback loop. Does not address what happens to the approval/override record after the decision is made.