Third post in the series where I translate AI into the language of those who live and breathe infrastructure. In the previous post, we talked about the hidden storage bottleneck. Today we’re going to what everyone thinks is the main topic of AI: compute.

Spoiler: it’s not just about having the most expensive GPU. It’s about having the right GPU, connected the right way.

The story you don’t want to live

The ML team asks for “a GPU cluster for training.” You do what any infra engineer would: provision eight Standard_D16s_v5 VMs. Sixty-four vCPUs each, 128 GiB of RAM, premium SSD. On paper, plenty of power.

The team launches the training script. Progress bar: estimated completion in 47 hours. CPUs at 100%, network barely registers traffic, and nobody looks happy.

Then a colleague suggests two Standard_ND96asr_v4 nodes, each with eight A100 GPUs connected via 200 Gb/s InfiniBand. Same training job, same dataset, same code. The job finishes in 90 minutes.

The difference isn’t just the GPUs. It’s how they talk to each other inside the node (NVLink), how they synchronize gradients across nodes (InfiniBand), and how data flows without the CPU becoming a bottleneck. Compute for AI isn’t about raw horsepower. It’s about the right kind of power, connected the right way.

Training vs. inference: two different worlds

Before choosing any SKU, you need to know which workload will run. Training and inference look similar on the surface, but their infrastructure profiles are completely opposite.

DimensionTrainingInference
Workload patternBatch, runs for hours/days/weeksReal-time, millisecond responses
GPU demandSaturates all available coresOften runs on a single GPU (or CPU)
Memory pressureGPU memory-bound (weights + gradients + optimizer states)Compute-bound (forward pass only)
Scaling axisScale up (bigger GPUs, more nodes)Scale out (more replicas behind a load balancer)
Cost modelTotal job cost (hours × GPUs × price/hr)Cost per request (latency × throughput × price)
Failure impactRestart from last checkpoint, hours lostLost request, retry in milliseconds
Network sensitivityExtreme: gradient sync every few secondsModerate: small payloads

Infra ↔ AI translation: Think of training as a massive batch job — like re-indexing a petabyte data warehouse. Think of inference as a high-traffic API endpoint — like your authentication service handling thousands of logins per second. The infra patterns you already know apply directly.

When CPU is enough

Not every AI workload needs a GPU. Lightweight inference scenarios (small classification models, embedding generation for search, edge deployment) run fine on Standard_D or Standard_F VMs. If the model fits comfortably in RAM and the latency requirement is above 50 ms, benchmark on CPU first. GPUs are expensive; don’t use them when you don’t need to.

Practical tip: ask the ML team two things before provisioning anything: (1) “Are we training or serving?” and (2) “How big is the model in parameters?” A 350-million-parameter model usually runs inference on CPU. A 70-billion-parameter one does not.

Why GPUs dominate AI

A modern server CPU has 32 to 128 cores optimized for complex logic with branching. A GPU like the NVIDIA H100 has 16,896 CUDA cores and 528 Tensor Cores, all designed to do one thing extremely well: multiply matrices in parallel.

AI workloads are fundamentally matrix multiplication. Every layer of a neural network multiplies an input matrix by a weight matrix, adds a bias, and applies an activation function. The CPU processes this sequentially across a few dozen cores. The GPU processes thousands of these operations simultaneously.

Infra ↔ AI translation: Think of the GPU like a SmartNIC that offloads packet processing from the CPU. Just as a SmartNIC handles millions of packets per second without overloading the host, the GPU offloads millions of matrix operations. The CPU orchestrates; the GPU does the heavy math.

CUDA Cores vs. Tensor Cores

Not all GPU cores are equal:

  • CUDA cores are general-purpose parallel processors that handle any floating-point math
  • Tensor Cores are specialized units that perform matrix multiply-and-accumulate in mixed-precision in a single clock cycle

For AI workloads using FP16 or BF16 (which is most training today), Tensor Cores deliver up to 8× the throughput of CUDA cores alone. When looking at GPU specs, pay attention to the Tensor Core count. That number defines your real AI performance more than the CUDA core count.

GPU VM families on Azure: the decision matrix

Choosing the right GPU VM family is the highest-impact decision you’ll make for an AI workload. Get it right and training finishes on time, within budget. Get it wrong and you burn money on idle hardware or wait days for results that should take hours.

FamilyExample SKUGPUsGPU MemInterconnectBest For~Cost/hr
NC T4 v3Standard_NC4as_T4_v31× T416 GiBEthernetCost-efficient inference, light training, dev/test$0.53
NC T4 v3Standard_NC64as_T4_v34× T464 GiBEthernetMulti-model inference, batch scoring$4.25
ND A100 v4Standard_ND96asr_v48× A100 40GB320 GiBInfiniBand 200 Gb/sDistributed training, large model fine-tuning$27.20
ND H100 v5Standard_ND96isr_H100_v58× H100 80GB640 GiBInfiniBand 400 Gb/sFlagship training, LLMs, NCCL-optimized$98.32
NV A10 v5Standard_NV36ads_A10_v51× A10 (full)24 GiBEthernetVisualization, light AI, dev/test$1.80
NV A10 v5Standard_NV6ads_A10_v5⅙× A104 GiBEthernetFractional GPU for small workloads$0.45
D/E/F seriesStandard_D16s_v5NoneAccel. NetworkingPreprocessing, data pipelines, CPU inference$0.77

Approximate pay-as-you-go prices, East US. Always verify on the Azure Pricing Calculator.

Warning: The original ND-series (ND6s, ND12s, ND24s, ND24rs) was retired in September 2023. If you find Terraform templates or blog posts referencing those SKUs, they’ll fail on deploy. The current ND-series is Standard_ND96asr_v4 (A100) and Standard_ND96isr_H100_v5 (H100).

How to choose

For inference: start with Standard_NC4as_T4_v3. The T4 is NVIDIA’s inference workhorse: supports INT8 and FP16, has dedicated Tensor Cores, and costs a fraction of the A100. If the model fits in 16 GiB of GPU memory, start here.

For training: depends on model size. Fine-tuning a model with fewer than 10B parameters? A single Standard_ND96asr_v4 node with eight A100s may suffice. Training a 70B+ model from scratch? Multiple Standard_ND96isr_H100_v5 nodes connected via InfiniBand, running DeepSpeed or PyTorch FSDP.

For dev/test: use Standard_NV6ads_A10_v5 (fractional GPU) or CPU-only VMs. Don’t burn ND-series quota on Jupyter notebooks.

Check availability before anything else:

az vm list-skus \
  --location eastus2 \
  --resource-type virtualMachines \
  --query "[?contains(name,'Standard_N')].{Name:name, Zones:locationInfo[0].zones, Restrictions:restrictions[0].reasonCode}" \
  -o table

If the Restrictions column shows NotAvailableForSubscription, you need to request a quota increase in the Azure portal under Subscriptions → Usage + quotas.

Clustering: when one VM isn’t enough

Three reasons to distribute an AI workload: the model is too large for one GPU’s memory, training is too slow on a single node, or you need to serve more inference requests than one VM can handle. Each reason points to a different clustering strategy.

PlatformBest ForGPU SupportScalingComplexity
AKSInference at scale, microservicesGPU node pools, device plugin, taintsHPA + Cluster AutoscalerMedium
Azure Machine LearningExperiment tracking, managed trainingManaged compute clusters, auto-provisioningBuilt-in, job-basedLow
VMSSHomogeneous GPU workloads, batchCustom images with pre-installed driversInstance-based autoscalingLow-Medium
Ray / DeepSpeed / HorovodDistributed training frameworksRun on top of AKS or VMsManaged by frameworkHigh

AKS for GPU workloads

AKS is the most common platform for serving AI models at scale. When you add GPU VMs to an AKS cluster, three things need to be configured correctly: the taint on the node pool, the NVIDIA device plugin, and the tolerations on your pods.

AKS automatically applies a taint to GPU node pools so non-GPU workloads don’t land on expensive nodes:

sku=gpu:NoSchedule

Your GPU pods need a matching toleration and must explicitly request GPU resources:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-inference
spec:
  tolerations:
  - key: "sku"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  containers:
  - name: model-server
    image: myregistry.azurecr.io/model-server:latest
    resources:
      limits:
        nvidia.com/gpu: 1

The NVIDIA device plugin (DaemonSet, current version v0.18.0) runs on GPU nodes and exposes nvidia.com/gpu as a schedulable resource to Kubernetes. Without it, Kubernetes doesn’t even know GPUs exist on the node.

Warning: The GPU taint in AKS is sku=gpu:NoSchedule, not nvidia.com/gpu. Many tutorials use the wrong key, which leaves your pods stuck in Pending forever.

Networking: the hidden multiplier

A fact that surprises most infra engineers when they first encounter AI workloads: the network is often the bottleneck, not the GPU. In distributed training, GPUs need to synchronize gradients after each forward-backward pass. With eight GPUs per node and multiple nodes, this synchronization generates tens of gigabytes of network traffic every few seconds. If the network can’t keep up, GPUs sit idle waiting for data, and you’re paying for expensive silicon that’s doing nothing.

InfiniBand and RDMA

InfiniBand enables RDMA (Remote Direct Memory Access): one machine reads from or writes to another machine’s GPU memory without involving any CPU. Gradient synchronization happens directly between GPUs across nodes, completely bypassing the operating system’s network stack.

On Azure, InfiniBand is available on:

  • Standard_ND96asr_v4 — 200 Gb/s InfiniBand (HDR)
  • Standard_ND96isr_H100_v5 — 400 Gb/s InfiniBand (NDR)

For distributed training with NCCL (NVIDIA Collective Communications Library), InfiniBand delivers 10× or more throughput compared to TCP/IP over Ethernet. NCCL detects and uses InfiniBand automatically when available.

Accelerated Networking

For VMs that don’t support InfiniBand (NC-series, NV-series, D/E/F-series), Accelerated Networking uses SR-IOV to bypass the host’s virtual switch. Network latency drops from ~500 μs to ~25 μs, and throughput reaches the VM’s maximum. No extra cost; just make sure it’s enabled on the NIC.

Network comparison table

FeatureThroughputLatencyAvailable OnUse Case
InfiniBand NDR400 Gb/s< 2 μsND H100 v5Multi-node LLM training
InfiniBand HDR200 Gb/s< 2 μsND A100 v4Distributed training
Accelerated NetworkingUp to 100 Gbps~25 μsMost D/E/F/N seriesInference, data pipelines
Standard EthernetUp to 100 Gbps~500 μsAll VMsGeneral workloads

Proximity placement groups

Deploying distributed training nodes across different availability zones adds cross-zone latency that can reduce training throughput by 30-50%. For multi-node jobs, always use a proximity placement group:

# Create proximity placement group
az ppg create \
  --resource-group rg-ai-training \
  --name ppg-training-cluster \
  --location eastus2 \
  --intent-vm-sizes Standard_ND96asr_v4

# Create VMSS inside the proximity placement group
az vmss create \
  --resource-group rg-ai-training \
  --name vmss-training \
  --image Ubuntu2204 \
  --vm-sku Standard_ND96asr_v4 \
  --instance-count 4 \
  --ppg ppg-training-cluster \
  --accelerated-networking true

Troubleshooting tip: when investigating slow distributed training, check network throughput before blaming the GPUs. Run ib_write_bw (InfiniBand bandwidth test) between nodes. If it’s significantly below the expected 200 or 400 Gb/s, the problem is network configuration, not model code.

Hands-on: create your first GPU VM

Time to get your hands dirty. We’ll provision a GPU VM, install NVIDIA drivers, and validate the GPU is operational. We’ll use Standard_NC4as_T4_v3 — the cheapest option and perfect for learning.

Step 0: define variables

RESOURCE_GROUP="rg-ai-lab"
LOCATION="eastus2"
VM_NAME="vm-gpu-lab"
VM_SIZE="Standard_NC4as_T4_v3"
ADMIN_USER="azureuser"

Step 1: check quota

az vm list-skus \
  --location $LOCATION \
  --size $VM_SIZE \
  --resource-type virtualMachines \
  --query "[].{Name:name, Restrictions:restrictions[0].reasonCode}" \
  -o table

If it shows NotAvailableForSubscription, request a quota increase in the portal.

Step 2: create the resource group

az group create \
  --name $RESOURCE_GROUP \
  --location $LOCATION

Step 3: create the GPU VM

az vm create \
  --resource-group $RESOURCE_GROUP \
  --name $VM_NAME \
  --image Ubuntu2204 \
  --size $VM_SIZE \
  --admin-username $ADMIN_USER \
  --generate-ssh-keys \
  --accelerated-networking true \
  --public-ip-sku Standard

This provisions an Ubuntu 22.04 VM with one NVIDIA T4, 4 vCPUs, and 28 GiB of RAM.

Step 4: install NVIDIA drivers (via VM Extension)

The VM Extension is the recommended approach. It installs the correct driver version, signs the kernel module for Secure Boot, and integrates with Azure update management:

az vm extension set \
  --resource-group $RESOURCE_GROUP \
  --vm-name $VM_NAME \
  --name NvidiaGpuDriverLinux \
  --publisher Microsoft.HpcCompute \
  --version 1.6

Monitor progress (takes 5-10 minutes):

az vm extension show \
  --resource-group $RESOURCE_GROUP \
  --vm-name $VM_NAME \
  --name NvidiaGpuDriverLinux \
  --query "{Status:provisioningState, Message:instanceView.statuses[0].message}" \
  -o table

Step 5: validate the GPU

SSH into the VM and confirm the GPU is recognized:

ssh $ADMIN_USER@$(az vm show \
  --resource-group $RESOURCE_GROUP \
  --name $VM_NAME \
  --show-details \
  --query publicIps -o tsv)

Once connected:

nvidia-smi

You should see a Tesla T4 with ~15 GiB of available memory, driver version, and CUDA version. If nvidia-smi returns “command not found”, the extension hasn’t finished installing yet.

Step 6: cleanup

GPU VMs are expensive even when idle. Delete the resource group when done:

az group delete --name $RESOURCE_GROUP --yes --no-wait

Real cost: A Standard_NC4as_T4_v3 costs ~$0.53/hr. Manageable for a lab. But a Standard_ND96isr_H100_v5 costs ~$98/hr. Leaving one running over a weekend = $4,700+. Always configure cost alerts and auto-shutdown policies for GPU VMs.

Monitoring GPU workloads

GPU infrastructure needs specific observability. Traditional CPU metrics (load average, memory usage) tell you nothing about whether the GPU is being utilized or starving.

MetricToolWhat it tells you
GPU utilization (%)nvidia-smi, DCGM ExporterIs the GPU computing or idle?
GPU memory used (GiB)nvidia-smi, DCGM ExporterClose to OOM (out-of-memory)?
GPU temperature (°C)nvidia-smi, DCGM ExporterThermal throttling? GPUs reduce clock above 83°C
Inference latency (P50/P95/P99)App Insights, OpenTelemetryUser experience, SLA compliance
Token throughput (tokens/sec)Application logs, Azure OpenAI metricsModel serving efficiency

Recommended setup: Deploy NVIDIA DCGM Exporter as a DaemonSet on AKS GPU node pools. It exposes GPU metrics in Prometheus format, which Azure Managed Prometheus scrapes automatically. Combine with pre-built Grafana dashboards for GPU utilization, memory, temperature, and error rates.

Next up

Now that you know which VMs to provision and how to connect them, it’s time to look inside the GPU. In the next post, we’ll do a deep dive into GPU architecture: CUDA memory hierarchy, multi-GPU strategies, the driver ecosystem, and how to read nvidia-smi output like a pro. You don’t need to write CUDA kernels, but understanding what happens inside the silicon will make you a better troubleshooter and a more efficient capacity planner.