Fourteenth post in the series. In the previous one, we used AI for our own infrastructure work. Now: how to take an entire organization from “let’s use AI” to a governed, scalable platform.

Best intentions, worst outcomes

Your CTO walks into the all-hands and says: “We’re going all-in on AI.” The room buzzes. Teams brainstorm use cases before the meeting ends. Within two weeks, Slack is full of threads about GPU availability.

Fast-forward three months. Five teams provisioned GPU VMs independently across four subscriptions. Nobody can tell which models are in production versus a weekend experiment. Two teams are paying reserved instances on clusters that sit idle 80% of the time. Security hasn’t reviewed a single deployment. The CFO wants to know why the Azure bill went up 40%.

The enthusiasm was there. The framework wasn’t.

The 6-phase model

Inspired by Microsoft’s Cloud Adoption Framework, but rebuilt specifically for infra teams. Each phase has concrete deliverables and clear exit criteria.

Assessment → Enablement → Infra Preparation → Experimentation → Scale & Governance → Continuous Adoption

Think of it as the infrastructure lifecycle applied to AI: assess, build, validate, scale, operate, iterate.

Phase 1: assessment (where are we today?)

Before building anything, an honest assessment. The question: if a team needed to deploy a model to production tomorrow, would your infrastructure support it securely?

Readiness scorecard

AreaKey questionsRating (1-5)
Team skillsCan the team provision and manage GPU compute?___
GPU readinessQuotas approved? Regions selected?___
NetworkingPrivate endpoints, bandwidth, DNS?___
SecurityManaged identity, Key Vault, network isolation?___
AutomationIaC coverage, CI/CD maturity, GitOps?___
Shadow AIUnauthorized deployments identified?___

A score below 3 in any area = focused work in Phase 2 before proceeding. Don’t hide low scores; they’re the most valuable output of this phase.

Shadow AI detection

The audit everyone skips and everyone needs. Look for: teams running models in personal subscriptions, API keys in code repos, GPU VMs provisioned outside IaC pipelines, SaaS AI tools processing company data without security review.

Shadow AI isn’t just a governance problem. It’s a security exposure. Every unreviewed model endpoint is a potential data leak. Treat it with the same urgency as unpatched servers.

Phase 2: enablement (building the foundation)

Close the gaps from Assessment. Investment in people, processes, and foundational tooling.

Team upskilling

Infrastructure engineers don’t need to understand backpropagation. They need: GPU memory management, inference scaling patterns, token-based pricing. Three tiers:

  1. Foundational: AI concepts in infra language (the visual glossary coming in the next post)
  2. Operational: Deploying and monitoring AI workloads (posts 3-8 in this series)
  3. Advanced: Performance tuning, cost optimization (posts 9-12)

Security baseline (non-negotiable)

  • All services authenticate via managed identity (no exceptions)
  • All secrets live in Key Vault with automated rotation
  • All model endpoints behind private endpoints
  • All data access follows least-privilege RBAC

Document these as policies, not suggestions.

Phase 3: infrastructure preparation (building the platform)

This is where your IaC skills become a superpower. Turn the baseline into a repeatable self-service platform. Everything codified; if it can’t be deployed from a git commit, it shouldn’t exist.

Templates for common patterns

  • GPU VM clusters for training (Bicep/Terraform)
  • AKS clusters with GPU node pools for inference
  • Azure ML workspaces with networking
  • Azure OpenAI deployments with diagnostic settings

Each template includes security controls baked in: private endpoints, managed identity, diagnostic settings, resource tagging.

Monitoring stack (deploy before workloads)

  • GPU utilization and memory (DCGM exporter)
  • Inference endpoint latency (P50/P95/P99)
  • Token consumption tracking
  • Cost attribution by team and project
  • Model health indicators

Cost governance (implement before it gets expensive)

Budgets per team, alerts at 50%/75%/90%, mandatory tagging, GPU quota governance. If it doesn’t exist before the workloads, it won’t exist after.

Phase 4: experimentation (controlled exploration)

Platform ready, teams can experiment with guardrails.

Sandbox environments

  • Dedicated resource groups with cost caps via Azure Policy
  • GPU quotas sized for experimentation
  • Automatic cleanup: sandboxes inactive 14 days = flagged, 30 days = decommissioned
  • Unique tag per experiment from day 1

Mandatory success criteria

Before starting an experiment, the team defines:

  • What success looks like (accuracy threshold, latency target, cost ceiling)
  • What infra signals indicate viability at scale
  • Next step if it works

Experiments without success criteria aren’t experiments; they’re hobbies.

Phase 5: scale & governance (going to production)

The transition from “works in the sandbox” to “runs with SLAs reliably.”

Multi-tenancy and isolation

  • Namespace or resource group isolation per team
  • GPU quota enforcement per tenant
  • Network segmentation between workloads
  • Per-team monitoring dashboards

SLA/SLO design for AI

Define SLOs for: availability, latency (P99), throughput, error budget. AI endpoints have unique failure modes (model loading delays, GPU memory exhaustion, token rate limiting) that your SLO design needs to account for.

Infra ↔ AI translation: “Inference endpoint SLA” is exactly like a web API SLA. The difference: cold start can be 30 seconds (loading GBs of model weights into GPU memory), and resource exhaustion is usually GPU memory, not CPU. Same discipline, different resources.

Fleet management and runbooks

Document procedures for: scaling during traffic spikes, zero-downtime model version rotation, GPU hardware failure response, token rate limiting (429s), cost overrun management.

Phase 6: continuous adoption (never ends)

AI infra isn’t a project with an end date. It’s a capability that evolves continuously.

Quarterly cadence

  • Utilization trend review
  • Cost optimization actions
  • Security updates
  • Technology radar changes
  • Self-service adoption metrics
  • Next quarter roadmap

Technology radar

Categorize tools and services as:

  • Adopt: Proven, standardize
  • Trial: Promising, time-boxed evaluation
  • Assess: Interesting, monitor
  • Hold: Not ready

The 5 anti-patterns that kill adoption

Anti-patternWhat happensHow to avoid
Big Bang6 months building a perfect platform, nobody uses itStart with MVP, iterate
Shadow AITeams deploy without infra involvementMake the governed path the easiest one
GPU HoardingTeams reserve quota “just in case”Use-it-or-lose-it: <20% utilization for 30 days = reclaimed
Security Afterthought“We’ll add security later”Templates with managed identity and private endpoints by default
Build EverythingCustom framework when a managed service existsDefault to managed services

These anti-patterns compound. Big Bang causes Shadow AI (teams won’t wait). Shadow AI creates Security Afterthought (deployments skip review). Recognizing the pattern is the first step.

In the next post

Adoption framework complete. In the final post of the series, the visual glossary: your infra ↔ AI Rosetta Stone. Every AI term mapped to an infrastructure concept you already know.