Nvidia-Smi

Fourth post in the series. In the previous one, you learned which GPU VMs to provision and how to connect them. Now we’re going to look inside the GPU to understand what happens at the silicon level. Not to write CUDA kernels, but to be a better troubleshooter and have informed conversations with the ML team. The 2 AM ticket Slack fires at 2 AM. The ML team’s training job crashed again. The error is a single line: ...