Nvidia

Troubleshooting playbook: incidents that will wake you at 2AM

Twelfth post in the series. In the previous one, we operated Azure OpenAI with HA and correct retry patterns. Now: when things break (and they will break). This post is organized as real-world failure scenarios. Each follows: Symptoms → Diagnosis → Root Cause → Resolution → Prevention. Read it once for pattern recognition. Then bookmark it; you’ll be back. Scenario 1: NVIDIA driver crash after kernel update Symptoms Monday morning. The ML team reports that all GPU workloads failed over the weekend. Nobody deployed anything. You SSH in: ...

GPU deep dive: what happens inside the silicon

Fourth post in the series. In the previous one, you learned which GPU VMs to provision and how to connect them. Now we’re going to look inside the GPU to understand what happens at the silicon level. Not to write CUDA kernels, but to be a better troubleshooter and have informed conversations with the ML team. The 2 AM ticket Slack fires at 2 AM. The ML team’s training job crashed again. The error is a single line: ...

Compute for AI: choosing the right hardware (and connecting it properly)

Third post in the series where I translate AI into the language of those who live and breathe infrastructure. In the previous post, we talked about the hidden storage bottleneck. Today we’re going to what everyone thinks is the main topic of AI: compute. Spoiler: it’s not just about having the most expensive GPU. It’s about having the right GPU, connected the right way. The story you don’t want to live The ML team asks for “a GPU cluster for training.” You do what any infra engineer would: provision eight Standard_D16s_v5 VMs. Sixty-four vCPUs each, 128 GiB of RAM, premium SSD. On paper, plenty of power. ...

ARO with Nvidia GPU Workloads

This article was originally published at ARO with Nvidia GPU Workloads | Red Hat Cloud Experts ARO guide to running Nvidia GPU workloads. Prerequisites oc cli Helm jq, moreutils, and gettext package An ARO 4.14 cluster Note: If you need to install an ARO cluster, please read our ARO Terraform Install Guide. Please be sure if you’re installing or using an existing ARO cluster that it is 4.14.x or higher. Note: Please ensure your ARO cluster was created with a valid pull secret (to verify make sure you can see the Operator Hub in the cluster’s console). If not, you can follow these instructions. ...