Twelfth post in the series. In the previous one, we operated Azure OpenAI with HA and correct retry patterns. Now: when things break (and they will break).
This post is organized as real-world failure scenarios. Each follows: Symptoms → Diagnosis → Root Cause → Resolution → Prevention. Read it once for pattern recognition. Then bookmark it; you’ll be back.
Scenario 1: NVIDIA driver crash after kernel update
Symptoms
Monday morning. The ML team reports that all GPU workloads failed over the weekend. Nobody deployed anything. You SSH in:
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
GPU containers won’t start. Training jobs dead. The VM itself is fine, CPU workloads run normally.
Diagnosis
# Check kernel messages
dmesg | grep -i nvidia
# [ 4.212] NIST: module nvidia not found in modules.dep
# Current kernel
uname -r
# 6.5.0-44-generic
# Installed driver
dpkg -l | grep nvidia-driver
# nvidia-driver-535 535.183.01
# What happened
cat /var/log/apt/history.log | grep -A 5 "linux-image"
# unattended-upgrade installed new kernel
Root cause
Ubuntu’s unattended-upgrades installed a new kernel automatically. The NVIDIA kernel module is compiled against a specific version. When the VM rebooted into the new kernel, there was no matching NVIDIA module.
Resolution
# Option A: reinstall driver extension (Azure VMs)
az vm extension set \
--resource-group myRG \
--vm-name myGPUVM \
--name NvidiaGpuDriverLinux \
--publisher Microsoft.HpcCompute \
--version 1.9
# Option B: pin kernel version and reinstall driver
sudo apt-mark hold linux-image-$(uname -r) linux-headers-$(uname -r)
sudo apt install --reinstall nvidia-driver-535
sudo reboot
Prevention
Disable automatic kernel upgrades on all GPU VMs. Add to /etc/apt/apt.conf.d/50unattended-upgrades:
Unattended-Upgrade::Package-Blacklist {
"linux-image";
"linux-headers";
"linux-modules";
};
Use the Azure NVIDIA GPU Driver Extension for driver lifecycle. Treat kernel upgrades as planned maintenance.
This failure is silent. The VM boots normally, passes health checks, responds to SSH. Only GPU workloads fail. If you don’t monitor
nvidia-smioutput, you only find out when users complain.
Scenario 2: CUDA Out of Memory during fine-tuning
Symptoms
A fine-tuning job starts well, runs for 10-30 minutes, then crashes:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 79.15 GiB total capacity; 77.42 GiB already allocated;
1.08 GiB free; 78.50 GiB reserved in total by PyTorch)
“But it worked for the first 500 steps.”
Diagnosis
# Continuous GPU memory monitoring
watch -n 1 nvidia-smi
# Memory log for analysis
nvidia-smi --query-gpu=timestamp,memory.used,memory.free,utilization.gpu \
--format=csv -l 5 > gpu_memory.csv
Calculate expected memory (7B model with Adam in BF16):
| Component | Memory |
|---|---|
| Parameters (BF16) | ~14 GB |
| Gradients (BF16) | ~14 GB |
| Optimizer States (FP32, Adam) | ~56 GB |
| Activations (varies with batch) | Variable |
| Minimum total | ~84 GB + activations |
Root cause
Batch size = 8. At the start of training, short sequences in the dataset produced small activation tensors. As the data loader reached longer sequences, activation memory grew until it exceeded what was left on the GPU. OOM didn’t happen at step 1 because the first batches fit.
Resolution
# Immediate fix: reduce batch size, maintain effective batch with accumulation
training_args = TrainingArguments(
per_device_train_batch_size=2, # Reduced from 8
gradient_accumulation_steps=4, # Maintains effective batch = 8
)
# Better fix: gradient checkpointing (trades 20-30% speed for 60-80% less memory)
model.gradient_checkpointing_enable()
# For larger models: LoRA (trains <1% of parameters)
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6.5M || all params: 6.74B || trainable%: 0.096%
Prevention
- Always calculate required memory before starting
- Set
max_seq_lengthexplicitly to cap activation memory - Use
gradient_accumulation_stepsto maintain effective batch with a small per-GPU batch
If OOM happens at random steps (not consistently at step N), suspect variable-length sequences. Set
max_seq_lengthand pad/truncate.
Scenario 3: AKS GPU pods stuck in Pending
Symptoms
$ kubectl get pods -n ml-team
NAME READY STATUS RESTARTS AGE
training-job-7b-xyz 0/1 Pending 0 20m
Diagnosis
$ kubectl describe pod training-job-7b-xyz -n ml-team
Events:
Warning FailedScheduling 18m 0/12 nodes are available:
3 node(s) had untolerated taint {sku=gpu:NoSchedule},
9 node(s) didn't match Pod's node affinity/selector.
The taint message is the key. AKS GPU node pools apply sku=gpu:NoSchedule by default. The pod needs a matching toleration.
# Check if it's a quota issue
az vm list-usage --location eastus -o table | grep -i "Standard NC\|Standard ND"
# Check node pool scaling limits
az aks nodepool show --cluster-name myAKS --resource-group myRG \
--name gpunp --query '{min:minCount, max:maxCount, current:count}'
Root cause
Pod spec missing the required toleration. The scheduler sees GPU nodes as ineligible.
Other common causes:
- GPU quota exhausted (cluster autoscaler can’t provision new nodes)
- Node pool at
maxCount(autoscaler wants to scale but can’t)
Resolution
# Add toleration to the pod spec
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: training
resources:
limits:
nvidia.com/gpu: 1
If quota is the issue:
az quota create \
--resource-name "StandardNDSv2Family" \
--scope "/subscriptions/{sub-id}/providers/Microsoft.Compute/locations/eastus" \
--limit-object value=48 limit-object-type=LimitValue
Prevention
- Template all GPU pod specs with pre-configured tolerations
- Alert at 80% GPU quota usage
- Configure cluster autoscaler with headroom in
maxCount
A pod stuck in Pending produces no logs, because no container exists. Always check
kubectl describe podfor events, notkubectl logs.
Scenario 4: Azure OpenAI 429 storm
Symptoms
30%+ of requests returning HTTP 429. Users report slowness or timeouts.
{
"error": {
"code": "429",
"message": "Requests to the ChatCompletions_Create Operation under Azure OpenAI API have exceeded the token rate limit..."
}
}
Diagnosis
Check the Retry-After header:
Retry-After: 1= slightly over the limitRetry-After: 30= dramatically above the limit
az monitor metrics list \
--resource "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.CognitiveServices/accounts/{account}" \
--metric "TokenTransaction" \
--interval PT1M \
--aggregation Total \
--filter "ModelDeploymentName eq 'gpt-4o-prod'"
Root cause
Standard deployment with 80K TPM. Product launch generated a burst of 200K+ TPM. Standard enforces hard rate limits; every request above gets 429.
Resolution
- Immediate: Implement exponential backoff with jitter (code in the previous post)
- Short-term: Second deployment in another region for overflow
- Long-term: Evaluate PTU for predictable, high-volume workloads
Prevention
- Multi-deployment architecture with APIM load balancing
- Alerts at 80% of provisioned TPM
- Token-aware queue on the client (estimate tokens before sending)
- Log token count per request to forecast before launches
Scenario 5: Inference latency spike
Symptoms
P99 latency jumps from 200ms to 3 seconds. No deployment, no config change. “The AI is slow.”
Diagnosis
# GPU busy?
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,temperature.gpu \
--format=csv -l 2
# Container restarted?
kubectl get pods -n inference -w
kubectl describe pod model-serve-abc -n inference | grep -A 5 "Last State"
# Cold start? (model being reloaded)
kubectl logs model-serve-abc -n inference | grep -i "model loaded\|loading model"
# [2024-07-15 08:14:47] Model loaded in 164.2 seconds
164 seconds of model loading = almost 3 minutes of latency hole on every restart.
Root cause (usually a combination)
- Container cold start: Pod evicted (OOM, node drain, spot reclaim), model reloading from Blob Storage (14+ GB over network)
- GPU thermal throttling: Sustained 100% utilization → temperature > 83°C → automatic clock reduction
- Noisy neighbor: Another pod on the same node consuming CPU/memory/network needed for pre/post-processing
Resolution
For cold starts: Use an init container that downloads model weights to local NVMe before the serving container starts. Set a readiness probe that only marks ready after the model is loaded.
For thermal throttling: Monitor DCGM_FI_DEV_GPU_TEMP and alert above 78°C. Reduce batch size to lower sustained utilization.
For noisy neighbor: Use nodeSelector or dedicated taints to isolate inference pods on exclusive nodes.
Prevention
- Readiness probe that checks model loaded (not just container up)
- Model cache on local NVMe (not downloading from Blob on every start)
- GPU temperature monitoring with proactive alerts
- Inference pods on dedicated nodes without sharing
In the next post
Troubleshooting covered. Next, we step out of the operational and into something broader: AI use cases for infra teams. How to use AI to improve your own infrastructure work, from AIOps to log analysis and predictive capacity planning.