Platform ops: building a self-service AI platform
Tenth post in the series. In the previous one, we controlled costs with Spot VMs, right-sizing, and FinOps. Now: how to stop being a human help desk for GPU. The Slack channel that ate your calendar Six months ago, you provisioned a single GPU VM for the ML team. Configured drivers, mounted storage, closed the ticket. Felt like any other infrastructure request. Today, you have four teams, three AKS clusters, dozens of GPU node pools, and a growing collection of Azure OpenAI endpoints. Each team wants their own resources, their own quotas, and their own SLAs. Your DMs have turned into a help desk: “Can we get more GPUs?” “Why is my training job Pending?” “Who’s using all the A100s?” ...