Infrastructure as Code for AI: automating GPU clusters

Fifth post in the series. In the previous one, we dove inside the GPU. Now let’s automate everything around it. Because understanding GPUs is half the battle; provisioning them consistently and at scale is where infrastructure engineering actually meets AI. The $4,000 typo I started the week with a win. Manually provisioned a GPU cluster in East US 2 for an ML experiment: AKS with a Standard_NC6s_v3 node pool, accelerated networking, NVIDIA drivers, correct taints. Took almost a full day, but it worked. ...

May 26, 2026 · 7 min · Ricardo Martins

Compute for AI: choosing the right hardware (and connecting it properly)

Third post in the series where I translate AI into the language of those who live and breathe infrastructure. In the previous post, we talked about the hidden storage bottleneck. Today we’re going to what everyone thinks is the main topic of AI: compute. Spoiler: it’s not just about having the most expensive GPU. It’s about having the right GPU, connected the right way. The story you don’t want to live The ML team asks for “a GPU cluster for training.” You do what any infra engineer would: provision eight Standard_D16s_v5 VMs. Sixty-four vCPUs each, 128 GiB of RAM, premium SSD. On paper, plenty of power. ...

May 18, 2026 · 11 min · Ricardo Martins