Fourteenth post in the series. In the previous one, we used AI for our own infrastructure work. Now: how to take an entire organization from “let’s use AI” to a governed, scalable platform.
Best intentions, worst outcomes
Your CTO walks into the all-hands and says: “We’re going all-in on AI.” The room buzzes. Teams brainstorm use cases before the meeting ends. Within two weeks, Slack is full of threads about GPU availability.
Fast-forward three months. Five teams provisioned GPU VMs independently across four subscriptions. Nobody can tell which models are in production versus a weekend experiment. Two teams are paying reserved instances on clusters that sit idle 80% of the time. Security hasn’t reviewed a single deployment. The CFO wants to know why the Azure bill went up 40%.
The enthusiasm was there. The framework wasn’t.
The 6-phase model
Inspired by Microsoft’s Cloud Adoption Framework, but rebuilt specifically for infra teams. Each phase has concrete deliverables and clear exit criteria.
Assessment → Enablement → Infra Preparation → Experimentation → Scale & Governance → Continuous Adoption
Think of it as the infrastructure lifecycle applied to AI: assess, build, validate, scale, operate, iterate.
Phase 1: assessment (where are we today?)
Before building anything, an honest assessment. The question: if a team needed to deploy a model to production tomorrow, would your infrastructure support it securely?
Readiness scorecard
| Area | Key questions | Rating (1-5) |
|---|---|---|
| Team skills | Can the team provision and manage GPU compute? | ___ |
| GPU readiness | Quotas approved? Regions selected? | ___ |
| Networking | Private endpoints, bandwidth, DNS? | ___ |
| Security | Managed identity, Key Vault, network isolation? | ___ |
| Automation | IaC coverage, CI/CD maturity, GitOps? | ___ |
| Shadow AI | Unauthorized deployments identified? | ___ |
A score below 3 in any area = focused work in Phase 2 before proceeding. Don’t hide low scores; they’re the most valuable output of this phase.
Shadow AI detection
The audit everyone skips and everyone needs. Look for: teams running models in personal subscriptions, API keys in code repos, GPU VMs provisioned outside IaC pipelines, SaaS AI tools processing company data without security review.
Shadow AI isn’t just a governance problem. It’s a security exposure. Every unreviewed model endpoint is a potential data leak. Treat it with the same urgency as unpatched servers.
Phase 2: enablement (building the foundation)
Close the gaps from Assessment. Investment in people, processes, and foundational tooling.
Team upskilling
Infrastructure engineers don’t need to understand backpropagation. They need: GPU memory management, inference scaling patterns, token-based pricing. Three tiers:
- Foundational: AI concepts in infra language (the visual glossary coming in the next post)
- Operational: Deploying and monitoring AI workloads (posts 3-8 in this series)
- Advanced: Performance tuning, cost optimization (posts 9-12)
Security baseline (non-negotiable)
- All services authenticate via managed identity (no exceptions)
- All secrets live in Key Vault with automated rotation
- All model endpoints behind private endpoints
- All data access follows least-privilege RBAC
Document these as policies, not suggestions.
Phase 3: infrastructure preparation (building the platform)
This is where your IaC skills become a superpower. Turn the baseline into a repeatable self-service platform. Everything codified; if it can’t be deployed from a git commit, it shouldn’t exist.
Templates for common patterns
- GPU VM clusters for training (Bicep/Terraform)
- AKS clusters with GPU node pools for inference
- Azure ML workspaces with networking
- Azure OpenAI deployments with diagnostic settings
Each template includes security controls baked in: private endpoints, managed identity, diagnostic settings, resource tagging.
Monitoring stack (deploy before workloads)
- GPU utilization and memory (DCGM exporter)
- Inference endpoint latency (P50/P95/P99)
- Token consumption tracking
- Cost attribution by team and project
- Model health indicators
Cost governance (implement before it gets expensive)
Budgets per team, alerts at 50%/75%/90%, mandatory tagging, GPU quota governance. If it doesn’t exist before the workloads, it won’t exist after.
Phase 4: experimentation (controlled exploration)
Platform ready, teams can experiment with guardrails.
Sandbox environments
- Dedicated resource groups with cost caps via Azure Policy
- GPU quotas sized for experimentation
- Automatic cleanup: sandboxes inactive 14 days = flagged, 30 days = decommissioned
- Unique tag per experiment from day 1
Mandatory success criteria
Before starting an experiment, the team defines:
- What success looks like (accuracy threshold, latency target, cost ceiling)
- What infra signals indicate viability at scale
- Next step if it works
Experiments without success criteria aren’t experiments; they’re hobbies.
Phase 5: scale & governance (going to production)
The transition from “works in the sandbox” to “runs with SLAs reliably.”
Multi-tenancy and isolation
- Namespace or resource group isolation per team
- GPU quota enforcement per tenant
- Network segmentation between workloads
- Per-team monitoring dashboards
SLA/SLO design for AI
Define SLOs for: availability, latency (P99), throughput, error budget. AI endpoints have unique failure modes (model loading delays, GPU memory exhaustion, token rate limiting) that your SLO design needs to account for.
Infra ↔ AI translation: “Inference endpoint SLA” is exactly like a web API SLA. The difference: cold start can be 30 seconds (loading GBs of model weights into GPU memory), and resource exhaustion is usually GPU memory, not CPU. Same discipline, different resources.
Fleet management and runbooks
Document procedures for: scaling during traffic spikes, zero-downtime model version rotation, GPU hardware failure response, token rate limiting (429s), cost overrun management.
Phase 6: continuous adoption (never ends)
AI infra isn’t a project with an end date. It’s a capability that evolves continuously.
Quarterly cadence
- Utilization trend review
- Cost optimization actions
- Security updates
- Technology radar changes
- Self-service adoption metrics
- Next quarter roadmap
Technology radar
Categorize tools and services as:
- Adopt: Proven, standardize
- Trial: Promising, time-boxed evaluation
- Assess: Interesting, monitor
- Hold: Not ready
The 5 anti-patterns that kill adoption
| Anti-pattern | What happens | How to avoid |
|---|---|---|
| Big Bang | 6 months building a perfect platform, nobody uses it | Start with MVP, iterate |
| Shadow AI | Teams deploy without infra involvement | Make the governed path the easiest one |
| GPU Hoarding | Teams reserve quota “just in case” | Use-it-or-lose-it: <20% utilization for 30 days = reclaimed |
| Security Afterthought | “We’ll add security later” | Templates with managed identity and private endpoints by default |
| Build Everything | Custom framework when a managed service exists | Default to managed services |
These anti-patterns compound. Big Bang causes Shadow AI (teams won’t wait). Shadow AI creates Security Afterthought (deployments skip review). Recognizing the pattern is the first step.
In the next post
Adoption framework complete. In the final post of the series, the visual glossary: your infra ↔ AI Rosetta Stone. Every AI term mapped to an infrastructure concept you already know.