Aks | Ricardo Martins — Cloud Architecture, Azure, Kubernetes & AI

Platform Engineering on Azure: building an Internal Developer Platform with AKS and Bicep (Part 1)

First post in a two-part series on Platform Engineering on Azure. If your developers still need tickets, handoffs, or tribal knowledge to get a usable environment, your delivery system is slower than your codebase. Platform Engineering is how you fix that. The goal is not to hide infrastructure from developers. The goal is to package infrastructure, security, and observability into a self-service product developers can trust. On Azure, that means combining Microsoft Dev Center, Azure Deployment Environments, Bicep, and a shared runtime such as AKS. ...

MCP and AI Agents 101 for Infrastructure Engineers

Chapter 1: MCP and AI Agents 101 At some point in the last few months, someone on your team probably showed up talking about an “AI agent” or an “MCP server” and asked for cluster access, a deployment, or an explanation for the CISO. I wish I’d had a clean mental model before I touched any of this. That’s what this post is: no hype, and a real Azure example so this does not stay in slideware. ...

Infrastructure as Code for AI: automating GPU clusters

Fifth post in the series. In the previous one, we went inside the GPU. This time we automate everything around it. Understanding GPUs is useful. Provisioning them consistently and at scale is where infrastructure engineering actually meets AI. tl;dr IaC is the only sane way to provision expensive GPU infrastructure repeatedly. Validate SKU choices, remote state, and deployment guardrails before apply. If the pipeline uses OIDC, Terraform and GitHub Actions both need explicit OIDC settings. The $4,000 typo I started the week with a win. I manually provisioned a GPU cluster in East US 2 for an ML experiment: AKS with a Standard_NC6s_v3 node pool, accelerated networking, GPU drivers, correct taints. It took most of a day, but it worked. ...

Compute for AI: choosing the right hardware (and connecting it properly)

Third post in the series where I translate AI into the language of people who live and breathe infrastructure. In the previous post, we talked about the storage bottleneck nobody notices until it hurts. This one is about compute. Spoiler: it is not enough to buy the most expensive GPU. You need the right GPU, connected the right way. tl;dr Pick hardware based on training vs inference, not on sticker price. GPU family, memory size, and interconnect matter more than vCPU count. For distributed jobs, quota, availability, and network fabric decide whether the cluster performs or stalls. The story you don’t want to live The ML team asks for “a GPU cluster for training.” You do what any infra engineer would do under time pressure: provision eight Standard_D64s_v5 VMs. Sixty-four vCPUs each, 256 GiB of RAM, Premium SSD. On paper, it looks respectable. ...