Sixth post in the series. In the previous one, we automated GPU cluster provisioning. Now let’s talk about what happens after the hardware is ready: how a model goes from “works on my notebook” to “running in production with an SLA.”

The model with no birth certificate

A data scientist drops a message in the team channel with a link to a shared drive: “Here’s the model. It’s a 15 GB PyTorch checkpoint. We need it in production by Friday.”

You open the folder and find a single file: model_final_v2_FIXED.pt.

You start asking questions. Which version? Trained on what data? Rollback plan if predictions go wrong? Latency and throughput SLAs? Framework and CUDA version? The answers are vague. “It’s the latest one. Works on my machine. Just put it behind an API.”

You’ve seen this movie before — just with different actors. Developers used to hand you a compiled binary and say “deploy this.” That chaos drove the industry to build container registries, CI/CD pipelines, semantic versioning, and automated rollback. Models are no different. They’re artifacts: large, versioned, environment-dependent. They deserve the same lifecycle management.

Models are artifacts: treat them like it

If you’ve ever pulled an image from a container registry, tagged a release in Git, or promoted a build from staging to production, you already understand the core concepts of model lifecycle.

Infra ConceptML Equivalent
Container imageModel checkpoint (weights file)
Container registry (ACR)Model registry (Azure ML, MLflow)
CI buildTraining run
CD release pipelineModel deployment pipeline
Dockerfile (build manifest)Training config (hyperparameters, data version, framework version)
Artifact signatureModel provenance and lineage
Blue/green deploymentA/B testing with traffic splitting

A model file without metadata is like a container image without a tag. You can deploy it, but you can’t reproduce, audit, or safely roll it back.

Model registries

The registry is the single source of truth for the organization’s models. It stores artifacts with metadata: version, training metrics, lineage, and deployment status.

Azure Machine Learning Model Registry

# Register model from local file
az ml model create \
  --name sentiment-classifier \
  --version 3 \
  --path ./outputs/model.pt \
  --type custom_model \
  --tags task=sentiment framework=pytorch \
  --resource-group ml-prod-rg \
  --workspace-name ml-prod-ws

# List model versions
az ml model list \
  --name sentiment-classifier \
  --resource-group ml-prod-rg \
  --workspace-name ml-prod-ws \
  --output table

# View lineage: which run produced this model
az ml model show \
  --name sentiment-classifier \
  --version 3 \
  --resource-group ml-prod-rg \
  --workspace-name ml-prod-ws \
  --query "jobs"

MLflow (open-source, multi-framework)

MLflow is the open-source standard for experiment tracking and model management. Framework-agnostic, it wraps PyTorch, TensorFlow, and scikit-learn. Azure ML integrates natively with MLflow.

# Local MLflow server (dev/test)
mlflow server \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./mlruns \
  --host 0.0.0.0 --port 5000

# Register model via CLI
mlflow models register \
  --model-uri runs:/<run-id>/model \
  --name sentiment-classifier

# Promote to production
mlflow models transition-stage \
  --name sentiment-classifier \
  --version 3 \
  --stage Production

Container Registry for model serving

When models are served via containers (Triton, TorchServe, FastAPI wrapper), the image becomes the deployable artifact:

# Build and push the serving container
az acr build \
  --registry mlmodelsacr \
  --image sentiment-classifier:v3 \
  --file Dockerfile.serve .

# Verify image
az acr repository show-tags \
  --name mlmodelsacr \
  --repository sentiment-classifier \
  --output table

Which registry to use?

CriteriaAzure ML RegistryMLflow RegistryACR (Container)
Best forAzure-native teamsMulti-cloud / OSSContainerized serving
VersioningBuilt-in, immutableBuilt-in with stagesImage tags
Lineage trackingDeep (jobs, data, env)Run-levelDockerfile only
Infra overheadManagedSelf-hosted or Azure MLManaged (ACR)
When to avoidNeed multi-cloudNeed deep Azure integrationModels without containers

Watch out: Never use shared file systems or blob storage as a “registry.” Without immutable versions and metadata APIs, you end up with model_final_v2_FIXED_actually_final.pt.

CI/CD for models: the promotion pipeline

┌─────────┐     ┌─────────────┐     ┌──────────────┐
│   DEV   │────▶│   STAGING   │────▶│  PRODUCTION  │
│         │     │             │     │              │
│ Train   │     │ Validate    │     │ Serve        │
│ Track   │     │ Benchmark   │     │ Monitor      │
│ Version │     │ Security    │     │ Auto-rollback│
└─────────┘     └─────────────┘     └──────────────┘
     │               │                    │
  GPU Compute    Inference Infra      Load Balanced
  Blob Storage   Test Data Access     Multi-replica
  Experiment     Isolated Network     Prod Network
  Tracking                            SLA-bound

Validation gates between stages

GateWhat it checksInfra required
Accuracy thresholdMetrics ≥ baseline (e.g., F1 > 0.92)Storage for test dataset, compute for evaluation
Latency benchmarkP95 ≤ SLA (e.g., < 200ms)Load testing infra
Throughput testRequests/sec ≥ target under loadLoad generator (k6, Locust)
Security scanNo vulnerable deps, signed artifactContainer scanning (Defender)
Cost estimateProjected cost within budgetCost modeling based on SKU

GitHub Actions workflow for model deployment

name: Model Deployment Pipeline

on:
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Model name in registry'
        required: true
      model_version:
        description: 'Model version to deploy'
        required: true

env:
  AZURE_RG: ml-prod-rg
  AZURE_ML_WS: ml-prod-ws

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: azure/login@v2
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}
      - name: Download model from registry
        run: |
          az ml model download \
            --name ${{ inputs.model_name }} \
            --version ${{ inputs.model_version }} \
            --download-path ./model \
            --resource-group ${{ env.AZURE_RG }} \
            --workspace-name ${{ env.AZURE_ML_WS }}
      - name: Run accuracy validation
        run: |
          python scripts/validate_model.py \
            --model-path ./model \
            --test-data ./data/holdout.csv \
            --min-accuracy 0.92

  deploy-staging:
    needs: validate
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging endpoint
        run: |
          az ml online-deployment create \
            --name staging-${{ inputs.model_version }} \
            --endpoint-name sentiment-staging \
            --model azureml:${{ inputs.model_name }}:${{ inputs.model_version }} \
            --instance-type Standard_NC4as_T4_v3 \
            --instance-count 1 \
            --resource-group ${{ env.AZURE_RG }} \
            --workspace-name ${{ env.AZURE_ML_WS }}

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy canary (10% traffic)
        run: |
          az ml online-deployment create \
            --name prod-${{ inputs.model_version }} \
            --endpoint-name sentiment-prod \
            --model azureml:${{ inputs.model_name }}:${{ inputs.model_version }} \
            --instance-type Standard_NC4as_T4_v3 \
            --instance-count 2 \
            --resource-group ${{ env.AZURE_RG }} \
            --workspace-name ${{ env.AZURE_ML_WS }}

          az ml online-endpoint update \
            --name sentiment-prod \
            --traffic "prod-stable=90 prod-${{ inputs.model_version }}=10" \
            --resource-group ${{ env.AZURE_RG }} \
            --workspace-name ${{ env.AZURE_ML_WS }}

Infra ↔ AI translation: This is your blue/green pipeline, but for model weights instead of container images. The --traffic flag works exactly like weighted routing in Azure Front Door: you shift a percentage of requests to the new model while the old one keeps serving.

Your responsibilities at each stage

As an infrastructure engineer, your ownership spans the entire pipeline:

  • Compute provisioning: GPU node pools for training (Dev), inference VMs for validation (Staging), GPU clusters with autoscaling for serving (Prod)
  • Networking: Isolated VNets for staging, private endpoints for the model registry, load balancer for traffic splitting
  • Storage: High-throughput blob for training data, low-latency for model artifacts, retention policies for old versions
  • Secrets management: Key Vault for API keys, managed identity for pipeline auth, RBAC for the model registry
  • Monitoring: Deployment health dashboards, latency alerting, automated rollback triggers

Traffic splitting: canary and blue/green for models

Deploying a model isn’t a binary event. You shift traffic gradually:

PatternHow it worksWhen to use
Canary5-10% of traffic goes to the new model, increase graduallyDefault for most deployments
Blue/GreenFull parallel environment, instant switchWhen you need instant rollback
ShadowNew model receives real traffic but responses are discardedWhen you want to test without impacting users
# Promote canary to 100% after validation
az ml online-endpoint update \
  --name sentiment-prod \
  --traffic "prod-v3=100" \
  --resource-group ml-prod-rg \
  --workspace-name ml-prod-ws

In the next post

Now that models are deployed and serving traffic, how do you know they’re healthy? Next up: monitoring and observability for AI, including model drift, GPU metrics, and how to detect degradation before users notice.