08 · PRINCIPLES
Security design principles for agentic AI
The principles that hold when threats change every year, re-read for autonomous agents. Pick any principle for the full write-up: why agents change it, worked scenarios, the threats it governs, and the controls that advance it.
A CENTRAL ENFORCEMENT PRINCIPLE
Enforcement must live outside the model. An LLM is probabilistic: it can be prompted around, injected, or silently changed by a model upgrade, and it can't tell instructions from data. Every hard control belongs in the orchestrator, the policy engine, the tool gateway, and the infrastructure: deterministic code the agent cannot reason its way past. Model-layer safeguards can supplement those controls, but should not be the only gate.
Honest accounting: most principles are documented here but not yet wired into the threat model. This doubles as the backlog.
Scored or detected directly by the wizard or canvas engine.
Acted on indirectly: one tenet, one signal, or a related lens.
- Continuous Verification
- Microsegmentation
- Assume Breach
- Containment (blast radius)
- Separation of Duties
- Least Agency / Minimal Autonomy
- Sandboxing & Isolation
- Reversibility / Dry-run / Hold periods
- Provenance & Trust-tagging
- Input/Output Validation
- Observability / Non-repudiation
- Data Minimization & Privacy
- Agent-as-principal Identity
- Complete Mediation
- Safety / Harm-limitation
Documented here; not yet wired into the threat model.
- Default / Implicit Deny
- Attack Surface Minimization
- Resilience & Recovery
- Safe Interruptibility / Corrigibility
- Constrained Generation & Deterministic Guardrails
- Rate-limiting / Budgets / Loop prevention
- Kill-switch / Circuit-breaker
- Confused-Deputy Prevention
- Memory & RAG Integrity
- Supply-chain Security
- Economy of Mechanism
- Open Design
- Least Common Mechanism
- Psychological Acceptability
- Accountability
- Transparency / Explainability
- Robustness / Reliability
- Contestability / Redress
Who (or what) is allowed to do what, verified every time.
- Zero Trust Enforced
Never trust, always verify. Grant no implicit trust from network location or prior authentication; authenticate and authorise every request, every time, against current context.
- Least Privilege Enforced
Every program and user operates with the minimum privileges needed for the job, and nothing more.
- Default / Implicit Deny Reference
Base access on explicit permission, not exclusion. The default is denial; you allow-list the exceptions (“fail-safe defaults”).
- Continuous Verification Partial
Trust state is re-evaluated continuously, not cached from login.
- Attack Surface Minimization Reference
Reduce the number of pathways an attacker can use. Remove every unnecessary tool, interface, and capability.
- Microsegmentation Partial
Divide the system into fine-grained, independently-authorised zones so a compromise in one cannot reach the others (east-west control, not just a perimeter).
Assume something breaks: contain it and recover.
- Defence-in-Depth Enforced
Independent controls at multiple layers, so defeating one still leaves others standing.
- Assume Breach Partial
Design as if the attacker is already inside. One component’s compromise must not yield the whole system.
- Fail Securely (fail-closed) Known gap
When a control, check, or component fails, default to the secure (denied) state, not the open one.
- Resilience & Recovery Reference
Withstand disruption, degrade gracefully, and recover to a known-good state (anticipate, withstand, recover, adapt).
- Containment (blast radius) Partial
Limit how far a compromised or misbehaving agent can reach. Blast radius = the damage one compromised component can do.
- Separation of Duties Partial
No single actor can complete a sensitive operation alone. Split it so two independent parties (or checks) are required.
No classical analogue. The actor is autonomous, tool-using, probabilistic.
- Least Agency / Minimal Autonomy Partial
Give an agent no more authority to decide and act than the task needs; prefer suggesting an action over taking it. Treat every increase in autonomy as a liability you have to justify.
- Human Oversight (HITL / HOTL) Enforced
Keep meaningful human control over consequential actions: a blocking checkpoint before execution (HITL) or live monitoring with authority to interrupt (HOTL). “Meaningful” rules out rubber-stamp dialogs and approvals that time out to “allow.”
- Safe Interruptibility / Corrigibility Reference
An agent must always be stoppable and correctable by its operators, and must not learn to resist, evade, or plan around shutdown.
- Sandboxing & Isolation Partial
Every action runs in a constrained, revocable environment that limits what it can read, write, reach, or spend. This is the physical enforcement of least agency.
- Constrained Generation & Deterministic Guardrails Reference
Place hard controls outside the probabilistic model (schema validation, allow-lists, policy engines) and treat model output as raw material to verify before anything acts on it. “The LLM said it’s safe” is never sufficient.
- Reversibility / Dry-run / Hold periods Partial
Make actions undoable where possible; preview irreversible ones before they execute; expand capability in stages; insert a delay before high-value irreversible actions so a human can cancel.
- Rate-limiting / Budgets / Loop prevention Reference
Operate within hard, externally-enforced ceilings on time, cost, tokens, tool-call frequency, and recursion depth. A looping agent cannot be trusted to stop itself.
- Kill-switch / Circuit-breaker Reference
A layered emergency stop: an external kill switch that halts an agent immediately, circuit breakers that auto-trip on bad patterns, and graceful degradation that keeps unaffected capability running. All components are architecturally external to the agent.
The context window is a flat string with no hardware trust boundary.
- Provenance & Trust-tagging Partial
Every piece of text in the context (system prompt, user message, retrieved document, tool result, sub-agent reply) has a trust level. Track where it came from and tag it so instructions from low-trust sources are not obeyed.
- Confused-Deputy Prevention Reference
Stop the agent, a legitimately-privileged “deputy,” from being tricked into wielding its authority for an attacker (capability-based security).
- Input/Output Validation Partial
Treat all input and all output as potentially attacker-controlled. Validate inbound text before it influences the agent, and validate outbound text before it touches any downstream system.
- The Lethal Trifecta Enforced
A design heuristic: an agent that simultaneously has (1) access to private data, (2) exposure to untrusted content, and (3) the ability to communicate externally creates a direct exfiltration path if attacker-controlled content successfully drives action.
- Memory & RAG Integrity Reference
Protect the agent’s persistent memory and retrieval corpora from poisoning. Only authenticated, validated content is written; stored content is tamper-evident; retrieved content is trusted according to its provenance.
- Observability / Non-repudiation Partial
Every agent decision is recorded, attributable, tamper-evident, and undeniable: enough to reconstruct what happened, under whose authority, and prove it.
- Supply-chain Security Reference
Verify the provenance and integrity of every external component (models, MCP servers, tools, plugins, agent cards, frameworks, data sources) before integrating, and continuously after.
- Data Minimization & Privacy Partial
Access, process, retain, and transmit only the minimum data the current task needs. No “just in case.”
- Agent-as-principal Identity Partial
Every agent has a unique, cryptographically verifiable non-human identity (not a shared key, not the user’s credentials), and delegation from human to agent is explicit, scoped, and carries the chain of intent.
Saltzer & Schroeder (1975), re-read for agents.
- Secure by Design Enforced
Build security in from the start across the whole lifecycle, with secure defaults. Never bolt it on after an incident.
- Complete Mediation Partial
Every access to every object is checked, every time, with no cached bypass.
- Economy of Mechanism Reference
Keep the design as simple and small as possible. Complexity breeds failure modes and resists audit.
- Open Design Reference
Security must not depend on the secrecy of the mechanism, only on keys/configuration.
- Least Common Mechanism Reference
Minimise mechanism shared across users/agents. Shared resources let a compromise of one contaminate all.
- Psychological Acceptability Reference
Make the secure path the easy path. Controls that are burdensome get bypassed.
Security-relevant slices of the AI-governance frameworks.
- Accountability Reference
Every actor in the agent’s lifecycle is responsible for outcomes, with roles and a chain of responsibility explicitly assigned.
- Transparency / Explainability Reference
Meaningful information about the agent’s operation, capabilities, limits, and decisions is available to stakeholders.
- Robustness / Reliability Reference
The agent functions appropriately under normal use, foreseeable misuse, and adversarial conditions throughout its lifecycle.
- Safety / Harm-limitation Partial
The agent must not cause unnecessary harm, and must include mechanisms to limit the blast radius of failures.
- Contestability / Redress Reference
Those affected by an agent’s decisions can challenge them and seek correction; operators can override and roll back.