Data Minimization & Privacy · Principles

Why it matters for agentic AI

Data minimisation (access, process, retain, and transmit only what the current task requires) has been a core data-protection principle since before GDPR formalised it. For static applications the principal challenge is at design time: scope the database query, limit the API response fields, set appropriate retention periods. For agents the challenge is ongoing and harder to bound, because agents make retrieval decisions dynamically at runtime. The context window is effectively an unstructured working memory; the path of least resistance is to pull in everything that might be relevant and let the model figure out what matters. That path violates minimisation not once at design time but on every invocation. Least Agency is the enforcement partner: an agent that can only retrieve what the task declares cannot accumulate cross-context profiles regardless of what it is prompted to do.

Agents introduce privacy risks that static models do not. Cross-session memory enables aggregation: an agent that legitimately accesses a name in one session, a location in another, and an email address in a third can, if those sessions are linked in memory, construct a profile that none of the individual data subjects would have consented to. Cross-context access creates leakage between domains that organisational policy separates: an agent with access to both HR records and customer data, asked a question about one, may inadvertently retrieve relevant snippets from the other. Purpose creep is the agentic version of scope creep: what began as “process this support ticket” quietly expands to “access the full account history because it might be relevant.” And unbounded memory logs retain personal data long past any justifiable purpose simply because no one set a retention policy.

The LINDDUN framework maps these systematically. For agents the highest-risk threat types are: Linkability (the agent connects data across contexts that should be separate), Identifiability (aggregated retrieval re-identifies a pseudonymous record), Disclosure (cross-domain access leaks data to an agent, or a prompt, that has no legitimate need), and Non-compliance (unbounded memory and unscoped retrieval violate storage-limitation and purpose-limitation rules). Running a LINDDUN pass on any agent architecture that touches personal data is not optional where GDPR, HIPAA, or equivalent frameworks apply; it is the mechanism that surfaces the specific minimisation failures in a given design rather than relying on general principles.

Scenario: the over-retrieving support agent

A customer-support agent is asked to check a single order’s delivery status. Its retrieval mechanism, optimised for completeness, queries the customer record and pulls the full account history (order history, payment methods, contact preferences, prior complaint notes) into context because those fields co-locate with the delivery record. The model uses one field and ignores the rest. The minimisation violation is not in the model’s output but in the retrieval: personal data that was never necessary to answer the query was loaded into an active processing context, logged in the session record, and exposed to whatever tools the agent subsequently called. Task-scoped retrieval credentials, which can only query the specific record type and fields declared for this task, would have prevented the over-collection structurally.

Scenario: cross-domain leakage through a shared RAG corpus

An enterprise agent is connected to a single knowledge base that, for convenience, indexes both HR policy documents and customer-service guidelines. An agent asked a customer-service question occasionally retrieves an HR document (performance-review criteria, compensation bands, personal data on named employees) because the retrieval embedding finds a semantic match. The agent includes the irrelevant HR content in its response. Separating the RAG corpora by data classification, with the HR index inaccessible from the customer-service agent’s credential, removes the leakage path entirely: not by filtering, which can be bypassed, but by capability.

How it fails

Memory retains personal data indefinitely because no TTL or retention policy was set; data processed for one purpose remains available for all future queries.
HR records, customer data, and financial data share one RAG corpus because it was simpler to index everything together.
Tool calls transfer personal data to third-party services without assessing the lawfulness of the cross-border or cross-purpose transfer.
The agent’s access credential is scoped to the data store, not to a task-specific subset; “read from the customer database” becomes a key to all customer data regardless of what the current task needs.

Why the mapped controls work

Task-scoped credentials enforce minimisation at the access-control layer: the agent cannot retrieve data outside the declared task scope because the credential does not permit it, not because a post-retrieval filter is applied. Per-tool data-class declarations make the permitted data categories explicit for each tool call, creating an auditable record of what was intended to be accessed. TTL on memory implements the storage-limitation principle automatically: personal data does not outlast its justifiable retention period because it expires. RAG corpora separated by data classification removes the shared-index leakage path by capability rather than by policy. DPIA and LINDDUN analysis for high-risk deployments provide the systematic method for finding minimisation failures before they become violations, treating the privacy threat model as a first-class engineering artefact alongside the security threat model.

First steps

Audit every tool your agents call today and annotate each with the data classes it can return (e.g. PII:name, PII:email, financial:account_balance); for any tool that returns a class not declared as necessary for its associated task, restrict the credential or the query projection before the next deployment.
Set a TTL on every entry written to your agent’s persistent memory store (30 days is a reasonable starting value for most support contexts) and verify in staging that an expired entry is not retrievable, rather than merely flagged.
Run a LINDDUN Linkability check on your highest-data-volume agent: map the three most-frequently-accessed data fields across sessions and confirm that no combination of them re-identifies a pseudonymous record; if they can, add a cross-session correlation block to the retrieval credential.

Threats it governs

When this principle is absent, these threats become reachable.

T28
RAG Data Exfiltration Adversary gains access to the vector database used by the RAG pipeline and exfiltrates its contents.
T46
Data Residency / Compliance Violation via MCP Server MCP server processes data in a jurisdiction or context the data is not authorised to traverse.

Controls that advance it

Catalogue mitigations that strengthen this principle, grouped by the defence-in-depth stage they sit in.

Prevent

Data classification Every dataset, document, and external system an agent can reach carries a classification label. The agent's permitted-class set and the tool's permitted-class set are intersected at the moment of every read or write. When the requested data's class falls outside that intersection, access is denied at the seam. This is the data-side complement to least-privilege: it adds a data-sensitivity constraint that role scoping alone does not provide.
Session isolation An agent that serves multiple users stores conversation history, retrieved facts, and intermediate state in a memory layer. If that layer is not scoped to the originating session, one user's writes can reach another user's retrieval path. Session-scoped memory isolation prevents that by enforcing a hard boundary at the storage layer, so each session can only read and write its own state.
Vector ACL A vector store returns results by embedding-space proximity, not by who is asking. Without a per-principal filter applied before similarity ranking, a query from tenant A can surface tenant B's vectors if the embeddings are close enough. Vector ACL closes that gap: every retrieval call is scoped to the requesting principal's namespace or payload partition before the store ranks any results, so cross-principal hits are structurally impossible rather than merely unlikely.
Shared-memory ACL When multiple agents share a single vector store, the access boundaries between them are not enforced by the store itself unless you configure them explicitly. Without per-namespace write and retrieval controls, an agent that can write to the shared corpus can insert crafted vectors into any namespace it can reach, and any agent that can query the store can retrieve another agent's confidential documents through embedding-space proximity. Shared-memory ACL addresses this by tagging every vector with a principal identifier at write time and filtering every retrieval query to the requesting agent's namespace, enforced at the gateway layer where the agent cannot bypass it.
Secret scan An agent produces code, configuration files, tool-call payloads, and log records continuously and at a rate no human reviewer can match. Any of those artefacts may contain a live API key, service token, or private certificate, placed there accidentally through model context, or deliberately through prompt injection or context poisoning. Secret scanning places an inspection gate at every agent output seam: regex patterns match known token formats, entropy analysis detects arbitrary high-entropy strings, and validator calls confirm which candidates are live credentials. The CI-secret-scanning pattern is mature; the agentic specialisation is seam placement, moving the scanner from the repository gate to the agent egress point, where artefacts can be intercepted before they reach any downstream system.

Detect

Egress DLP An agent produces output continuously across multiple channels: user-facing responses, tool-call parameter envelopes, log records, and outbound HTTP requests. Any of those channels can carry sensitive content the agent has retrieved, been fed, or been tricked into including. Output egress DLP places an inspection gate at the boundary so that PII, credentials, and proprietary content are classified and either redacted or quarantined before they leave the trust boundary, regardless of how they got into the output.

Respond

No catalogued control.

In Helmwart

The Q2 LINDDUN privacy lens fires when Q1 flags PII/PHI/financial/regulated data, prompting a parallel privacy pass.