Kueue

Tenth post in the series. In the previous one, we controlled costs with Spot VMs, right-sizing, and FinOps. Now for the next problem: how to stop being a human help desk for GPU access. tl;dr Self-service AI platforms need isolation, quotas, and scheduling together. The goal is fewer tickets, not faster manual provisioning. The Slack channel that ate your calendar Six months ago, you provisioned a single GPU VM for the ML team. Configured drivers, mounted storage, closed the ticket. Felt like any other infrastructure request. ...