Fourth post in the series. In the previous one, you learned which GPU VMs to provision and how to connect them. Now we’re going to look inside the GPU to understand what happens at the silicon level. Not to write CUDA kernels, but to be a better troubleshooter and have informed conversations with the ML team.

The 2 AM ticket

Slack fires at 2 AM. The ML team’s training job crashed again. The error is a single line:

CUDA out of memory. Tried to allocate 2.00 GiB

The data science lead is frustrated: “The model has 7 billion parameters in FP16. That’s only 14 GB. The A100 has 40 GB of memory. There should be 26 GB to spare. What’s going on?”

You SSH in, run nvidia-smi, and see memory usage at 100%. But the math doesn’t add up: 14 GB of weights don’t fill 40 GB. Unless something else is consuming the rest. And it is. Model parameters are just one piece of the memory puzzle. Gradients, optimizer states, and activations each claim their own slice. A “14 GB model” needs 90+ GB to train with full-precision Adam.

This post gives you the knowledge to answer that question and dozens like it.

GPU architecture for infrastructure engineers

You don’t need to design circuits. But you need a mental model of what’s inside the box, because it explains why workloads behave the way they do, why certain errors appear, and why certain SKUs are dramatically better than others for the same job.

Streaming Multiprocessors (SMs)

A GPU is built from repeated units called Streaming Multiprocessors (SMs). Each SM is an independent processor with its own cores, cache, and scheduling hardware. The A100 has 108 SMs. The H100 has 132.

Think of each SM as a small, self-contained factory floor. It has its own workers (cores), local storage (shared memory and registers), and its own scheduler.

CUDA Cores vs. Tensor Cores

Inside each SM:

  • CUDA Cores: general-purpose parallel processors. A100 = 6,912, H100 = 16,896. Handle floating-point and integer math.
  • Tensor Cores: specialized units that perform matrix-multiply-and-accumulate in a single clock cycle. A100 = 432 (3rd gen), H100 = 528 (4th gen). They’re the reason modern GPUs dominate AI.

Infra ↔ AI translation: A GPU is like a 100-lane highway where each lane carries math operations simultaneously. A CPU is a 4-lane highway where each lane can handle curves, exits, and complex decisions. For matrix multiplication (the foundation of AI), you want the highway.

When you have multiple GPUs in the same node, they need to communicate. PCIe provides a baseline connection (64 GB/s), but for serious multi-GPU training, you need NVLink:

GPUNVLink Bandwidth (bidirectional)
A100600 GB/s
H100900 GB/s
B2001.8 TB/s

Check NVLink with:

nvidia-smi topo -m

If it shows PIX or PHB between GPUs instead of NV#, you’re on PCIe, not NVLink. Confirm you’re on the right SKU (ND-series) before debugging performance issues.

GPU memory: the resource you’ll manage most

If you remember one section from this post, make it this one. GPU memory — specifically running out of it — is the #1 problem you’ll troubleshoot in AI infrastructure.

Memory hierarchy

LayerA100 SpecH100 SpecAnalogy
HBM (High Bandwidth Memory)40 or 80 GB, 2 TB/s80 GB, 3.35 TB/sSystem RAM
L2 Cache40 MB50 MBCPU L3 cache
Shared Memory / L1Up to 164 KB per SMUp to 256 KB per SMCPU L1/L2 cache
Registers256 KB per SM256 KB per SMCPU registers

HBM is what nvidia-smi reports. It’s the GPU’s “main memory” — where model weights, training data, and intermediate results live.

What fills memory during training

Four main consumers:

1. Model Parameters (the weights) Straightforward sizing: parameters × bytes per parameter. 7B params in FP16 (2 bytes each) = ~14 GB.

2. Gradients One gradient per parameter during backpropagation. 7B × 2 bytes = another ~14 GB.

3. Optimizer States (the hidden killer) The Adam optimizer maintains two additional values per parameter (momentum and variance), stored in FP32 regardless of model precision. 7B × 4 bytes × 2 states = ~56 GB just for the optimizer.

4. Activations Intermediate results from each layer, saved during the forward pass and consumed during backpropagation. Depend on architecture and batch size.

The math that saves your weekend

Total GPU Memory ≈ Parameters + Gradients + Optimizer States + Activations

For the 7B model from the ticket:

ComponentCalculationMemory
Parameters (FP16)7B × 2 bytes~14 GB
Gradients (FP16)7B × 2 bytes~14 GB
Optimizer States (FP32, Adam)7B × 4 bytes × 2~56 GB
Activations (varies)Depends on batch size~8-20 GB
Total~92-104 GB

A “14 GB model” needs 90+ GB to train. An A100-40GB never stood a chance. Even an A100-80GB is tight.

Rule of thumb: when an ML engineer says “the model is X gigabytes,” they almost always mean the parameter size (the checkpoint file). Training memory is 4-8× larger. Multiply by at least 4× for a quick estimate with Adam.

Gradient checkpointing (activation recomputation) trades compute for memory: instead of saving all activations during the forward pass, it saves only some and recomputes the rest during backpropagation. Reduces training speed by ~20-30% but cuts activation memory by 60-80%.

Precision: trading accuracy for speed and memory

FormatBitsBytes/paramRangeUse case
FP32324±3.4 × 10³⁸Full-precision training, master weights
TF3219*4 (stored)= FP32Default on A100+ for matmul, transparent
BF16162±3.4 × 10³⁸Preferred for training (same range as FP32)
FP16162±65,504Training with loss scaling, inference
INT881-128 to 127Quantized inference
INT440.5-8 to 7Aggressively quantized inference

BF16 is the current sweet spot for training. It maintains the same exponent range as FP32 but uses half the memory. Most modern pipelines use BF16.

INT8/INT4 are for inference. A model trained in BF16 can be quantized to INT8 or INT4 after training, drastically reducing memory with slight quality loss.

Infra ↔ AI translation: Think of precision like JPEG quality. FP32 is RAW (highest quality, largest file). BF16 is high-quality JPEG (imperceptibly different, half the size). INT4 is a thumbnail (visibly lossy, but loads instantly).

Multi-GPU strategies

When the model doesn’t fit on one GPU or training takes days:

Data Parallelism (DP)

Full copy of the model on each GPU. Each GPU processes a different batch. After each step, GPUs synchronize gradients via all-reduce. Scales nearly linearly: 8 GPUs ≈ 8× throughput.

Catch: each GPU needs to hold the entire model + gradients + optimizer states.

DeepSpeed ZeRO: the memory limit destroyer

StageWhat’s partitionedSavings per GPUCommunication overhead
ZeRO-1Optimizer states~4× reduction in optimizer memoryMinimal
ZeRO-2Optimizer states + GradientsAdditional gradient savingsModerate
ZeRO-3Optimizer + Gradients + ParametersEverything sharded, maximum savingsHigher

Back to the 7B example: training with Adam on one GPU = ~92 GB. With 8 GPUs and ZeRO-3, each GPU holds only ⅛ of everything: ~11-13 GB per GPU, plus activations. The model that couldn’t fit on an A100-80GB now trains comfortably on eight A100-40GBs.

FSDP (Fully Sharded Data Parallel)

PyTorch’s native answer to ZeRO-3. Same capability (full sharding of parameters, gradients, optimizer states), integrated directly into PyTorch’s distributed training API. From an infra perspective, FSDP and ZeRO-3 have similar requirements.

Pipeline Parallelism (PP)

Divides the model’s layers sequentially across GPUs: GPU 0 = layers 1-10, GPU 1 = layers 11-20, etc. Each GPU holds only a fraction of the parameters. Downside: pipeline bubbles (GPUs waiting for data).

Tensor Parallelism (TP)

The most granular: splits individual layers across GPUs. Requires NVLink, mandatory. Running TP over PCIe is technically possible but practically useless.

3D Parallelism (100B+ models)

Combines all three:

  • TP within the node (over NVLink)
  • PP across a few nodes
  • DP (with ZeRO) across many nodes

This is how GPT-4 and LLaMA 3 are trained.

Model SizeStrategyGPUsNetwork required
< 1B paramsSingle GPU or DP1-8PCIe OK
1-10B paramsDP + ZeRO-24-16NVLink preferred
10-70B paramsZeRO-3 / FSDP8-64NVLink + InfiniBand
70-200B+ params3D Parallelism64-512+NVLink + InfiniBand mandatory

The NVIDIA software stack

Every GPU debugging session ends up in software compatibility. The stack is layered, where each level depends on the one below:

Model code (training script)
Framework (PyTorch 2.x, TensorFlow, JAX)
cuDNN (optimized DL primitives) + NCCL (multi-GPU)
CUDA Toolkit (libraries, runtime, compiler)
NVIDIA Driver (kernel module → GPU hardware)
GPU Hardware (A100, H100, etc.)

A mismatch at any level = problems ranging from cryptic error messages to silent crashes.

The container escape hatch: Use NVIDIA NGC images (nvcr.io/nvidia/pytorch) that bundle a tested combination of driver API, CUDA, cuDNN, NCCL, and framework:

# Pull official NVIDIA PyTorch container (monthly releases)
docker pull nvcr.io/nvidia/pytorch:24.05-py3

# Run with GPU access
docker run --gpus all -it nvcr.io/nvidia/pytorch:24.05-py3

Troubleshooting tip: Always collect three versions first:

# Driver version + max supported CUDA
nvidia-smi

# Installed CUDA Toolkit
nvcc --version

# CUDA that PyTorch was compiled against
python -c "import torch; print(torch.version.cuda)"

A mismatch between any of these = most likely cause of the problem.

Reading nvidia-smi like a pro

nvidia-smi is the top of the GPU world. Typical output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08    Driver Version: 535.161.08    CUDA Version: 12.2               |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |          Memory-Usage  | GPU-Util  Compute M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB         On   | 00000001:00:00.0  Off  |                    0 |
| N/A   42C    P0              72W / 400W |  71458MiB / 81920MiB   |     94%      Default |
+-----------------------------------------+------------------------+----------------------+

Fields that matter

FieldWhat it meansHealthy (training)Problematic
GPU-Util% of active compute85-100%Below 50%
Memory-UsageHBM in use70-95%100% (OOM) or < 30% (underutilized)
TempTemperature °C35-75°CAbove 83°C (throttling)
Pwr:Usage/CapConsumption vs. limit60-90% of capBelow 30% (idle)
PerfPerformance stateP0P2+ during active job
ECC ErrorsMemory errors0Any value > 0
Persistence-MPersistent driverOnOff (adds latency)

Essential commands

# Basic snapshot (90% of usage)
nvidia-smi

# Continuous monitoring (refresh every 5s)
nvidia-smi -l 5

# CSV output for scripting and dashboards
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.used --format=csv

# Compact real-time monitoring
nvidia-smi dmon -s u

# GPU topology: check NVLink
nvidia-smi topo -m

# Check ECC errors (hardware health)
nvidia-smi --query-gpu=ecc.errors.uncorrected.volatile.total --format=csv

# List GPU processes
nvidia-smi pmon -s u -c 1

The 7 GPU problems you’ll encounter

1. CUDA Out of Memory (OOM)

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB

Fixes (in order): reduce batch size → enable gradient checkpointing → ZeRO-2/3 → mixed precision BF16 → bigger GPU.

2. CUDA Version Mismatch

CUDA error: no kernel image is available for execution on the device

Fix: check the 3 versions (driver, toolkit, framework). Use an NGC container with versions tested together.

3. GPU Not Found

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.

Fix: verify VM SKU (NC/ND/NV?), check VM Extension status, reboot, reinstall driver.

4. ECC Errors (hardware failure)

nvidia-smi --query-gpu=ecc.errors.uncorrected.volatile.total --format=csv
# If it returns > 0: open a ticket with Azure for replacement

GPUs don’t support live migration on Azure. Expect downtime.

5. Thermal Throttling

Temp above 83°C, perf state drops from P0 to P2/P3, throughput drops 20-40%. In cloud, this is Azure’s problem. Document it and open a ticket.

6. Low GPU Utilization

GPU-Util < 50% during active training = data starvation. Fixes: increase DataLoader num_workers, pin_memory=True, cache on local NVMe, optimized formats (WebDataset, TFRecord).

nvidia-smi topo -m shows PHB/PIX instead of NV#. Verify you’re on ND-series (NC/NV don’t have NVLink).

GPU generations on Azure

GenerationGPUAzure VMHBMNVLinkInfiniBand
Volta (2017)V100NC v316/32 GBNoNo
Ampere (2020)A100ND A100 v440/80 GB600 GB/s200 Gb/s
Hopper (2022)H100ND H100 v580 GB900 GB/s400 Gb/s
Blackwell (2024)B200ND GB200 v6192 GB1.8 TB/s400 Gb/s

Each generation doubles HBM bandwidth and introduces new precision formats: Ampere brought TF32 and structured sparsity, Hopper brought FP8 and Transformer Engine, Blackwell brings FP4 and 192 GB of HBM per GPU.

Next up

Now that you understand what happens inside the GPU (architecture, memory, software stack, debugging), it’s time to automate everything around it. Next post: Infrastructure as Code for AI — how to template GPU clusters, inference endpoints, and training pipelines in a reproducible, versioned, and auditable way.