← Atlas · Mitigations Tier 2 · Real-composable

MITIGATION · m-model-registry

Model registry — version pinning, canary, rollback

An agent loads whichever model weights are available at startup unless the runtime is told exactly which artifact to load. If a poisoned or regressed weight is published to the model store, the agent picks it up silently on the next restart. A model registry prevents that: every artifact is registered with a cryptographic checksum and an approval stage, the agent runtime loads by explicit version pin, and new versions must pass a canary evaluation before promotion to production.

Last reviewed 2026-05-12 · Status: published · Evidence →

At a glance

MATURITY
Tier 2
Available off-the-shelf or as a documented pattern, but newer or less broadly proven. Expect integration work and some operational nuance.
PLACES ON
node
Restricted to node kinds: agent, shared-memory
COVERAGE
2 threats
T1 · T17
TRADE-OFFS
LAT
low
COST
low
UX
low
DEV
medium
Latency · cost · UX friction · dev effort.
TL;DR
  • Every model artifact the agent loads (weights, adapters, embedding models) is registered with a cryptographic checksum and an approval stage before it can reach production; the agent runtime loads by explicit version pin, not by "latest".
  • A new artifact cannot enter production without an explicit approval-stage transition in the registry, removing the silent substitution path that exists when agents load from a mutable "latest" reference.
  • Canary rollout routes a fraction of production traffic to the candidate version; the prior production version handles the remainder until a held-out evaluation confirms behaviour is within threshold.
  • Rollback is a registry-level operation: set the prior version back to the Production approval stage and the agent reloads from the known-good artifact within one deploy cycle.

How it behaves

New model artifact submitted to registry (CI/CD pipeline after training or vendor release)
Compute SHA-256 of artifact and match against registry record; confirm version is in Staging or Production approval stage; run canary evaluation before promotion
Agent loads the version-pinned artifact and serves production traffic
Agent halts at startup or rollback is triggered; rejection logged with version, checksum, and failing metric
A registry with a vacuous canary evaluation set is bookkeeping, not a control; the evaluation harness is the load-bearing part.

What it is

A model registry is a structured store of every model artifact an agent may load: fine-tuned weights, base-model snapshots, adapter weights, and embedding models. Each entry carries a cryptographic checksum, a version tag, training-data provenance, an approval stage, and a rollback target. The agent runtime loads by version pin (for example my-agent-llm@2026-04-30 or my-embedding-model@v2) rather than by latest. That binding means a new artifact cannot reach production without an explicit approval-stage transition in the registry.

The failure mode this addresses is silent substitution. Without a registry, the agent loads whatever weight is present at a well-known path or at latest. A poisoned fine-tune, a regressed base-model update, or a supply-chain-compromised artifact can enter the production fleet on the next agent restart with no audit trail and no signal. Version pinning removes the silent path: the artifact must be registered, checksummed, and approved before the runtime will load it.

New versions enter through a canary stage. The registry routes a small fraction of production traffic to the candidate version; the prior production version handles the remainder. A held-out evaluation set compares the two versions on a set of behaviour criteria, and the registry promotes the candidate only when agreement exceeds a defined threshold. If the canary detects a regression or unexpected behavioural shift, promotion is blocked and the prior version continues serving. Rollback is the reverse of promotion: flip the approval stage back to the previous production version and the agent reloads from the known-good artifact within one deploy cycle.

Detection signals

  • Production model version does not match the registry-approved version. Any divergence is either a deployment process failure or an out-of-band artifact substitution.
  • Canary error rate rises against the prior production baseline during pre-promotion evaluation. A sustained increase indicates behavioural regression or adversarial modification in the candidate artifact.

Threats it covers

  • T1 Memory Poisoning −1 severity step

    WHY IT HELPS Prompt Injection includes model-poisoning as a subclass: a fine-tuned weight with embedded adversarial behaviour can reproduce injection effects without a prompt-level trigger. A poisoned weight cannot enter production without an explicit approval-stage transition, and canary comparison against the prior production version surfaces behavioural deviation before full rollout.

  • T17 Supply Chain Compromise −2 severity steps

    WHY IT HELPS AI Supply Chain Vulnerabilities include the scenario where a poisoned model weight reaches the agent runtime through a legitimate update path. Version pinning requires an explicit registry transition to a new artifact, and the signed-checksum gate rejects any weight whose hash does not match the registered record. Canary evaluation against a held-out set catches behavioural drift before the full fleet is promoted.

Principle coverage

Defence-in-Depth stage: Prevent — and it advances:

  • Resilience & Recovery A version-pinned registry with a drilled rollback path means recovery from a bad model update is a single approval-stage reversion rather than an ad-hoc artifact hunt; the prior production version is always a named, checksummed target the runtime can load immediately.
  • Reversibility / Dry-run / Hold periods Version pinning preserves the prior production artifact as a named, checksummed registry entry, making a model rollback a deterministic operation rather than a best-effort reconstruction from backups.
  • Supply-chain Security Every artifact the agent loads must pass a checksum gate and an explicit approval-stage transition before it reaches production, so a poisoned or substituted weight introduced anywhere upstream of registry ingestion is blocked at the enforcement point rather than silently loaded.

Design & governance principles (open design, economy of mechanism, accountability, …) are architectural, not advanced by a single placed control.

Implementation options

Five verified implementation paths. MLflow Model Registry is the open-source default for self-managed deployments. AWS SageMaker Model Registry, Vertex AI Model Registry, and W&B Registry are the cloud-native options for teams already on those platforms. The self-build content-hash pinning pattern covers agentic runtimes that cannot integrate with a full registry; it carries more operational overhead but has no external service dependency.

MLflow Model Registry Open-source, Linux Foundation–governed registry with version stages (None to Staging to Production to Archived), approval-status transitions, and a Python SDK for version-pinned artifact loading.

Why choose it: Best for teams that want a self-hosted substrate without cloud-vendor lock-in. Provides stage-based approval gates, model lineage, and a REST API for CI/CD integration. Canary traffic is implemented by routing a fraction of agent requests to the Staging version while the prior Production version handles the remainder. Supports any model format via the MLflow model flavour abstraction. The Apache 2.0 licence and active Linux Foundation governance make it a durable long-term choice.

More details:

AWS SageMaker Model Registry Managed AWS service that catalogs model versions in Model Groups, tracks approval status (PendingManualApproval / Approved / Rejected), and triggers CI/CD deployment pipelines on approval-status transitions.

Why choose it: Best for teams running inference on SageMaker endpoints already inside the AWS ecosystem. Approval transitions re-trigger CI/CD, making rollback a matter of re-approving the prior version with no custom runbook. Integrates with SageMaker Pipelines for automated canary evaluation using Condition steps that evaluate metrics before promotion.

More details:

Google Cloud Vertex AI Model Registry Managed GCP service that stores model versions with aliases, deploys to Vertex AI Endpoints with configurable traffic splits for canary rollout, and supports rollback by shifting traffic back to a prior version.

Why choose it: Best for teams running agents on Google Cloud, especially those using Gemini or other Vertex AI foundation models. Traffic splitting at the Endpoint level is the native canary mechanism: deploy two model versions to the same endpoint, assign traffic percentages, then shift 100% to the new version once evaluation passes. Rollback is an alias reassignment or a traffic-split reset with no re-deployment required.

More details:

Weights & Biases Registry Managed registry with automatic sequential versioning (v0, v1, ...), organisation-wide artifact linking, lineage tracking, tag-based promotion, and webhook-triggered downstream pipelines.

Why choose it: Best for teams already using W&B for experiment tracking who want to extend into registry governance without a separate service. Tags (for example "production", "staging", "deprecated") are the promotion mechanism; webhook triggers on tag changes drive CI/CD. The absence of an explicit approval-status field means approval workflow is enforced via tag access-control policies rather than a built-in gate.

More details:

Self-build: content-hash pinning Version pinning without a registry product: store each artifact with a SHA-256 hash in a versioned object store, pin the agent runtime to a specific hash in a config file under source control, and maintain a rollback runbook that flips the config to the previous hash.

Why choose it: The only option for agentic runtimes that cannot integrate with a full registry: embedded deployments, air-gapped environments, or teams with no MLOps platform. The agent loads by hash, not by path or "latest" tag; the runtime computes the SHA-256 of the downloaded artifact and refuses to start if it does not match the pinned value. A rollback is a Git revert of the config file plus a re-deploy. Dev effort is medium: hash pinning and checksum verification are configuration, but the canary harness and rollback drill are bespoke.

More details:

Trade-offs

  • Managed registries (SageMaker, Vertex AI) add no runtime latency; artifact pull and checksum verification happen at startup, not per-request.
  • MLflow is open-source with no per-version cost; managed cloud registries are priced per stored model version and per API call at commodity rates.
  • The dominant adoption cost is the canary evaluation harness: defining the held-out set, the promotion threshold, and the rollback trigger. Pinning and checksumming are configuration, not engineering.
  • Self-build hash pinning has no registry service cost but requires manual enforcement of the approval gate and more bespoke tooling for the canary traffic split.

When NOT to use

  • Do not deploy a model registry for one-off or experimental agents where no production traffic is at stake; the overhead of approval workflows and canary stages slows exploration without reducing real risk.
  • Skip this control when the agent calls a third-party hosted model API; you do not control those artifacts and cannot pin them through a registry. Pin the API version in request headers and monitor for behavioural drift instead.
  • A registry with an empty or untested canary evaluation set is bookkeeping, not a control. Invest in the evaluation harness before standing up the registry workflow.

Limitations

  • A registry assumes the upstream artifact source is trustworthy at registration time. A model poisoned upstream of registry ingestion is recorded in its poisoned state; a canary evaluation set that does not probe the poisoned behaviour will not detect it.
  • The canary stage adds at minimum several days of validation traffic before a model can be promoted. For teams with weekly or faster release cadences, that window is the dominant operational cost.
  • There is no industry-standard canary evaluation profile for agentic AI; every deployment defines its own behaviour-regression suite from scratch.

Maturity tier reasoning

  • Tier 2 because model registries are Tier 1 mature in MLOps and shipped by every major cloud provider; the agentic application is operational composition: version-pinned agent loaders, canary harnesses with named evaluation criteria, and drilled rollback paths.
  • What keeps it from Tier 1 is the absence of an agentic-AI-specific canary profile; every deployment defines its own behaviour-regression suite from scratch.

Last verified against upstream docs: 2026-05-30.