MLOps: model lifecycle for infra engineers

Sixth post in the series. In the previous one, we automated GPU cluster provisioning. Now let’s talk about what happens after the hardware is ready: how a model goes from “works on my notebook” to “running in production with an SLA.” The model with no birth certificate A data scientist drops a message in the team channel with a link to a shared drive: “Here’s the model. It’s a 15 GB PyTorch checkpoint. We need it in production by Friday.” ...

May 30, 2026 · 6 min · Ricardo Martins

Infrastructure as Code for AI: automating GPU clusters

Fifth post in the series. In the previous one, we dove inside the GPU. Now let’s automate everything around it. Because understanding GPUs is half the battle; provisioning them consistently and at scale is where infrastructure engineering actually meets AI. The $4,000 typo I started the week with a win. Manually provisioned a GPU cluster in East US 2 for an ML experiment: AKS with a Standard_NC6s_v3 node pool, accelerated networking, NVIDIA drivers, correct taints. Took almost a full day, but it worked. ...

May 26, 2026 · 7 min · Ricardo Martins