Infrastructure as Code for AI: automating GPU clusters

Fifth post in the series. In the previous one, we dove inside the GPU. Now let’s automate everything around it. Because understanding GPUs is half the battle; provisioning them consistently and at scale is where infrastructure engineering actually meets AI. The $4,000 typo I started the week with a win. Manually provisioned a GPU cluster in East US 2 for an ML experiment: AKS with a Standard_NC6s_v3 node pool, accelerated networking, NVIDIA drivers, correct taints. Took almost a full day, but it worked. ...

May 26, 2026 · 7 min · Ricardo Martins

GPU deep dive: what happens inside the silicon

Fourth post in the series. In the previous one, you learned which GPU VMs to provision and how to connect them. Now we’re going to look inside the GPU to understand what happens at the silicon level. Not to write CUDA kernels, but to be a better troubleshooter and have informed conversations with the ML team. The 2 AM ticket Slack fires at 2 AM. The ML team’s training job crashed again. The error is a single line: ...

May 22, 2026 · 10 min · Ricardo Martins

Compute for AI: choosing the right hardware (and connecting it properly)

Third post in the series where I translate AI into the language of those who live and breathe infrastructure. In the previous post, we talked about the hidden storage bottleneck. Today we’re going to what everyone thinks is the main topic of AI: compute. Spoiler: it’s not just about having the most expensive GPU. It’s about having the right GPU, connected the right way. The story you don’t want to live The ML team asks for “a GPU cluster for training.” You do what any infra engineer would: provision eight Standard_D16s_v5 VMs. Sixty-four vCPUs each, 128 GiB of RAM, premium SSD. On paper, plenty of power. ...

May 18, 2026 · 11 min · Ricardo Martins

Data and storage for AI workloads: the bottleneck nobody sees

This is the second post in the series where I translate AI into the language of infrastructure engineers. In the first post, I showed that AI is just another workload and that your infra skills already prepare you more than you think. Now let’s talk about the bottleneck that everyone ignores — the hidden villain behind performance issues in virtually every AI project I’ve seen: storage. The midnight call You did everything right. The ML team asked for a GPU cluster and you delivered: eight NVIDIA A100s across two nodes, high-bandwidth networking, CUDA drivers up to date. Flawless deployment. The team kicked off their first training job Friday at 6 PM and you went home feeling good. ...

May 14, 2026 · 9 min · Ricardo Martins

AI for infrastructure engineers: why AI needs you

This is the first post in a series where I’ll translate the world of AI into the language that infrastructure engineers already speak. If you’re the kind of professional who configures VMs, builds CI/CD pipelines, and gets woken up at 2 AM when Nagios fires, this content is for you. The series is based on my open-source book AI for Infrastructure Professionals, adapted and expanded here on the blog. The Monday morning message It’s 8:47 AM on a Monday. You’re halfway through your coffee, reviewing a Terraform plan for a network redesign, when a Slack message lights up your screen. It’s from the data science team lead: ...

May 10, 2026 · 7 min · Ricardo Martins

Why Azure Feels Harder Than AWS

…and why that’s not an accident. If you have worked with both Azure and AWS long enough, you have probably felt it. AWS feels straightforward. Azure feels… heavier. Not worse. Not broken. Just harder to reason about. The console feels denser. The mental model feels less obvious. The number of “extra” concepts feels higher. This is not a beginner problem. Senior engineers feel it too. And the most interesting part is this: that friction is not accidental. ...

February 3, 2026 · 5 min · Ricardo Martins

Cloud Maturity Is Not About Being 100% Cloud

For years, “cloud-first” has been treated as a badge of honor. Companies proudly announce that everything is in the cloud, architects optimize for migrations instead of outcomes, and teams equate progress with how little infrastructure they still own. But after working with dozens of real systems, across different industries and at different scales, one thing becomes clear. Cloud maturity is not about being 100% cloud. It is about knowing why each workload is where it is. ...

January 23, 2026 · 5 min · Ricardo Martins

Private ARO Cluster with Access via JumpHost

This article was originally published at https://cloud.redhat.com/experts/aro/private-cluster/ A Quickstart guide to deploying a Private Azure Red Hat OpenShift cluster. Prerequisites Azure CLI Obviously you’ll need to have an Azure account to configure the CLI against. MacOS See Azure Docs for alternative install options. Install Azure CLI using homebrew brew update && brew install azure-cli Install sshuttle using homebrew brew install sshuttle Linux See Azure Docs for alternative install options. Import the Microsoft Keys sudo rpm --import https://packages.microsoft.com/keys/microsoft.asc Add the Microsoft Yum Repository cat << EOF | sudo tee /etc/yum.repos.d/azure-cli.repo [azure-cli] name=Azure CLI baseurl=https://packages.microsoft.com/yumrepos/azure-cli enabled=1 gpgcheck=1 gpgkey=https://packages.microsoft.com/keys/microsoft.asc EOF Install Azure CLI sudo dnf install -y azure-cli sshuttle Prepare Azure Account for Azure OpenShift Log into the Azure CLI by running the following and then authorizing through your Web Browser az login Make sure you have enough Quota (change the location if you’re not using East US) az vm list-usage --location "East US" -o table See Addendum – Adding Quota to ARO account if you have less than 36 Quota left for Total Regional CPUs ...

January 21, 2025 · 6 min · Ricardo Martins

Creating a Lightweight Jump Host in Azure with sshuttle (No VPN Required)

When working with development or test environments in Azure, a common need is secure access to internal resources without exposing them directly to the internet. While VPN solutions are a robust way to achieve this, they can often be overkill for simple use cases, especially when you just want to access a few VMs or services for testing. A jump host combined with sshuttle offers a simple, VPN-like solution that can be quickly deployed and used to tunnel traffic to your Azure resources—without the overhead of setting up a full VPN. ...

October 4, 2024 · 5 min · Ricardo Martins

Deploying Advanced Cluster Management and OpenShift Data Foundation for ARO Disaster Recovery

This article was originally published at https://cloud.redhat.com/experts/aro/acm-odf-aro/ A guide to deploying Advanced Cluster Management (ACM) and OpenShift Data Foundation (ODF) for Azure Red Hat OpenShift (ARO) Disaster Recovery. Overview VolSync is not supported for ARO in ACM: https://access.redhat.com/articles/7006295 so if you run into issues and file a support ticket, you will receive the information that ARO is not supported. In today’s fast-paced and data-driven world, ensuring the resilience and availability of your applications and data has never been more critical. The unexpected can happen at any moment, and the ability to recover quickly and efficiently is paramount. That’s where OpenShift Advanced Cluster Management (ACM) and OpenShift Data Foundation (ODF) come into play. In this guide, we will explore the deployment of ACM and ODF for disaster recovery (DR) purposes, empowering you to safeguard your applications and data across multiple clusters. ...

October 4, 2024 · 12 min · Ricardo Martins