Gpu | Ricardo Martins — Cloud Architecture, Azure, Kubernetes & AI

Troubleshooting playbook: incidents that will wake you at 2AM

Twelfth post in the series. In the previous one, we ran Azure OpenAI with HA and sane retry patterns. This one is for when the nice diagram meets real life. This post is organized as real-world failure scenarios. Each follows: Symptoms → Diagnosis → Root Cause → Resolution → Prevention. Read it once for pattern recognition. Then bookmark it. You will need it again. tl;dr Most late-night AI infra incidents come down to driver drift, memory pressure, scheduler mismatch, throttling, or cold starts. Start with the first check that rules out the biggest class of failure. Scenario 1: NVIDIA driver crash after kernel update Symptoms Monday morning. The ML team reports that all GPU workloads failed over the weekend. Nobody deployed anything. You SSH in: ...

GPU deep dive: what happens inside the silicon

Fourth post in the series. In the previous one, you learned which GPU VMs to provision and how to connect them. This time we look inside the GPU so you can troubleshoot better and talk to the ML team without guessing. tl;dr GPU memory is consumed by more than model weights. Gradients, optimizer states, and activations usually dominate training memory. Understanding memory hierarchy and topology makes troubleshooting much faster. The 2 AM ticket Slack fires at 2 AM. The ML team’s training job crashed again. The error is a single line: ...

Compute for AI: choosing the right hardware (and connecting it properly)

Third post in the series where I translate AI into the language of people who live and breathe infrastructure. In the previous post, we talked about the storage bottleneck nobody notices until it hurts. This one is about compute. Spoiler: it is not enough to buy the most expensive GPU. You need the right GPU, connected the right way. tl;dr Pick hardware based on training vs inference, not on sticker price. GPU family, memory size, and interconnect matter more than vCPU count. For distributed jobs, quota, availability, and network fabric decide whether the cluster performs or stalls. The story you don’t want to live The ML team asks for “a GPU cluster for training.” You do what any infra engineer would do under time pressure: provision eight Standard_D64s_v5 VMs. Sixty-four vCPUs each, 256 GiB of RAM, Premium SSD. On paper, it looks respectable. ...

Data and storage for AI workloads: the bottleneck nobody sees

This is the second post in the series where I translate AI into the language of infrastructure engineers. In the first post, I showed that AI is just another workload and that your infra skills already prepare you more than you think. Now for the bottleneck everyone ignores: storage. It is the hidden villain behind performance problems in almost every AI project I’ve seen. tl;dr Storage is usually the first bottleneck in AI training. Keep durable data in Blob or Data Lake, but cache the active working set on local NVMe. Check GPU, disk, and network metrics together before blaming the model or the GPUs. The midnight call You did everything right. The ML team asked for a GPU cluster and you delivered: eight NVIDIA A100s across two nodes, high-bandwidth networking, CUDA drivers up to date. Clean deployment. The team kicked off their first training job Friday at 6 PM and you went home feeling good. ...

AI for infrastructure engineers: why AI needs you

This is the first post in a series where I’ll translate the world of AI into the language that infrastructure engineers already speak. If you’re the kind of professional who configures VMs, builds CI/CD pipelines, and gets woken up at 2 AM when Nagios fires, this content is for you. The series is based on my open-source book AI for Infrastructure Professionals, adapted and expanded here on the blog. tl;dr AI is another workload with different performance, cost, and data patterns. Infrastructure engineers already own the hard parts: compute, networking, security, observability, and cost control. If you can run production infrastructure well, you already have the foundation to support AI systems. The Monday morning message It’s 8:47 AM on a Monday. You’re halfway through your coffee, reviewing a Terraform plan for a network redesign, when a Slack message lights up your screen. It’s from the data science team lead: ...

ARO with Nvidia GPU Workloads

This article was originally published at ARO with Nvidia GPU Workloads | Red Hat Cloud Experts ARO guide to running Nvidia GPU workloads. Prerequisites oc cli Helm jq, moreutils, and gettext package An ARO 4.14 cluster Note: If you need to install an ARO cluster, please read our ARO Terraform Install Guide. Please be sure if you’re installing or using an existing ARO cluster that it is 4.14.x or higher. Note: Please ensure your ARO cluster was created with a valid pull secret (to verify make sure you can see the Operator Hub in the cluster’s console). If not, you can follow these instructions. ...