This is the first post in a series where I’ll translate the world of AI into the language that infrastructure engineers already speak. If you’re the kind of professional who configures VMs, builds CI/CD pipelines, and gets woken up at 2 AM when Nagios fires, this content is for you.
The series is based on my open-source book AI for Infrastructure Professionals, adapted and expanded here on the blog.
The Monday morning message
It’s 8:47 AM on a Monday. You’re halfway through your coffee, reviewing a Terraform plan for a network redesign, when a Slack message lights up your screen. It’s from the data science team lead:
“Hey — we need 8 GPU VMs provisioned by Wednesday for a fine-tuning job. We also need a private endpoint for the model’s inference API, and can you set up TPM monitoring? Thanks!”
You read it twice. GPU VMs? Fine-tuning? You know what a private endpoint is — you’ve configured hundreds. Monitoring? That’s your bread and butter. But what the hell is “TPM” in this context? It’s not Trusted Platform Module. It’s Tokens Per Minute, a throughput metric for language models. You don’t know that yet, but that’s fine.
Notice something: everything else in that request is pure infrastructure.
Provisioning compute. Configuring network security. Setting up observability. You’ve been doing this for years. The only difference is the type of workload.
At its core, AI is just another workload
Let me be direct. Strip away the buzzwords and AI is a workload. It consumes compute, storage, and networking, just like every other workload you’ve ever managed. The difference is in the shape of that consumption: more parallel compute, larger datasets, different performance metrics.
The AI stack runs on three layers you already know:
| AI Layer | What it does | Your infra equivalent |
|---|---|---|
| Data | Feeds the model with examples | Storage: Blob, Data Lake, NFS, databases |
| Model | Learns patterns and makes predictions | The application — your compiled binary running on compute |
| Infrastructure | Holds everything up underneath | Your domain: compute, networking, security, observability |
The model is the application. The data is what it consumes and produces. The infrastructure is everything that makes it run reliably, securely, and at scale. That last part? That’s you.
Translating AI into infrastructure language
Back in 2014, when I started writing about Docker on this blog, the first thing I did was translate the concepts into something sysadmins already understood. I’m doing the same thing now with AI.
When someone from the AI team throws jargon you don’t recognize, map it back to what you already know:
| AI Concept | Infrastructure Equivalent | Why it works |
|---|---|---|
| Trained model | Compiled binary | A static artifact produced by a build process, deployed to serve requests |
| Training a model | Batch job | Long-running, compute-intensive process that reads data and produces an output artifact |
| Inference | An API call | Request comes in, the model processes it, response goes out. Just like any microservice |
| Fine-tuning | Patching a binary | You take an existing artifact and customize it for your environment |
| Dataset | Database / Data Lake | Structured input that the workload depends on |
| Training pipeline | CI/CD pipeline | Automated workflow: ingest → process → build → validate → deploy |
| Model registry | Artifact repository | Versioned storage for deployable artifacts (like ACR, but for models) |
| GPU cluster | High-performance compute | Specialized hardware allocated for heavy workloads |
💡 Meeting tip: When the data science team starts talking about “epochs”, “hyperparameters”, and “loss functions”, don’t panic. Those are their tuning knobs — the equivalent of your connection pool sizes, cache TTLs, and autoscale thresholds. You don’t need to master their knobs. You need to understand what those knobs demand from your infrastructure.
What changes and what stays the same
The good news: AI infrastructure isn’t a different planet. It’s more like a new neighborhood in a city you already know. The streets follow the same grid, the utilities work the same way, but the buildings look different and the residents have unusual needs.
What changes
| Dimension | Traditional Infra | AI Infra |
|---|---|---|
| Compute | CPUs, general-purpose VMs | GPUs (NVIDIA T4, A100, H100), multi-GPU nodes |
| Storage | SSD/HDD, managed disks | Data Lakes, high-throughput Blob, local NVMe for scratch |
| Networking | 1–25 GbE Ethernet | InfiniBand (up to 400 Gb/s), RDMA, GPU-to-GPU communication |
| Deployment | VMs, App Services, containers | Inference endpoints, model-as-a-service, GPU-enabled containers |
| Observability | CPU %, memory, disk I/O | GPU utilization, VRAM, tokens/second, time-to-first-token |
| Cost | $/hour per VM | $/hour per GPU (10-30× CPU cost), PTUs for managed services |
What doesn’t change
And this is equally important. Maybe more. These fundamentals don’t change just because the workload runs on GPUs:
- Security: Network segmentation, private endpoints, identity management, encryption. A GPU VM still needs an NSG. An inference API still needs authentication.
- Networking: VNets, subnets, DNS, load balancing. Packets still flow the same way.
- Infrastructure as Code: Bicep, Terraform, ARM templates. GPU VMs are still Azure resources with properties and parameters.
- Monitoring: You’ll still set thresholds, build dashboards, and respond to incidents. The metrics just have different names.
- Cost management: Budgets, tagging, right-sizing. If anything, cost governance is more critical with AI workloads.
⚠️ Production alert: The most common failures in production AI systems aren’t model accuracy problems. They’re the same old villains: disk full, network timeout, expired certificate, missing RBAC permission. Your instincts are right.
Why AI needs you (not the other way around)
The AI industry has a people problem, and it’s not what you’d expect. Data scientists who can build models in Jupyter notebooks are plentiful. What’s actually scarce are engineers who can take those models and run them reliably in production.
In my experience working with startups and enterprises at Microsoft, I see this pattern constantly:
Uncontrolled GPU sprawl. A data scientist requests 4 Standard_NC24ads_A100_v4 VMs for a training experiment. No resource locks, no budget alerts, no tagging. Three weeks later, the VMs are still running. Nobody remembers who provisioned them or whether the experiment finished. Monthly cost: $35,000+.
Exposed inference endpoints. The ML team deploys a model to a managed endpoint with a public IP. No private endpoint, no WAF, no API management. The model serves responses that include proprietary business logic.
Blind observability. The team monitors model accuracy but not infrastructure health. When inference latency jumps from 200ms to 8 seconds, nobody can tell whether it’s the model, the compute, the network, or a noisy neighbor.
⚠️ The $50K GPU weekend: A team provisioned 8
Standard_ND96asr_v4VMs (A100 GPUs) on a Friday afternoon for a training run that was supposed to finish Saturday morning. The job crashed at 3 AM due to a checkpoint storage misconfiguration, but the VMs kept running. Nobody had set up auto-shutdown or budget alerts. Monday surprise: $53,000 in compute for 60 hours of idle GPU. An infrastructure engineer would have configuredauto-shutdown, set a budget alert at $5,000, and stored checkpoints in Blob with lifecycle policies. Fifteen minutes of infra work would have saved $48,000.
Hands-on: your first AI reconnaissance
You don’t need to train a model or write Python. You need to understand that GPU compute is available to you and what your subscription’s limits are. This is reconnaissance — the same first step you’d take before architecting any new workload.
Discover GPU VMs in your region
az vm list-skus --location eastus2 --size Standard_N --output table
This filters the Standard_N family, which includes all GPU-accelerated VMs in Azure. Pay attention to three prefixes:
- NC: Compute-optimized GPUs for training and inference (NVIDIA T4, A100)
- ND: High-end GPUs for distributed deep learning with InfiniBand (A100, H100)
- NV: GPUs for visualization and lightweight inference (AMD Radeon, NVIDIA A10)
Check your GPU quota
az vm list-usage --location eastus2 --output table | grep -E "NC|ND|NV"
On Windows/PowerShell, replace
grep -E "NC|ND|NV"withSelect-String -Pattern "NC|ND|NV".
If your quota is zero across the board, you’ll need to request an increase before any provisioning. That’s exactly the kind of infra work that the data science team doesn’t know (and doesn’t want to know) how to do.
Next up
I’ll talk about data and storage for AI workloads — the piece everyone ignores but that ends up being the hidden performance bottleneck in virtually every AI project I’ve seen.
The full book is available for free at ai4infra.com.
This post is part of the AI for Infrastructure Engineers series, based on the book AI for Infrastructure Professionals. New posts every week.