Troubleshooting playbook: incidents that will wake you at 2AM
Twelfth post in the series. In the previous one, we operated Azure OpenAI with HA and correct retry patterns. Now: when things break (and they will break). This post is organized as real-world failure scenarios. Each follows: Symptoms → Diagnosis → Root Cause → Resolution → Prevention. Read it once for pattern recognition. Then bookmark it; you’ll be back. Scenario 1: NVIDIA driver crash after kernel update Symptoms Monday morning. The ML team reports that all GPU workloads failed over the weekend. Nobody deployed anything. You SSH in: ...