Infrastructure as Code for AI: automating GPU clusters

Fifth post in the series. In the previous one, we dove inside the GPU. Now let’s automate everything around it. Because understanding GPUs is half the battle; provisioning them consistently and at scale is where infrastructure engineering actually meets AI. The $4,000 typo I started the week with a win. Manually provisioned a GPU cluster in East US 2 for an ML experiment: AKS with a Standard_NC6s_v3 node pool, accelerated networking, NVIDIA drivers, correct taints. Took almost a full day, but it worked. ...

May 26, 2026 · 7 min · Ricardo Martins

GPU deep dive: what happens inside the silicon

Fourth post in the series. In the previous one, you learned which GPU VMs to provision and how to connect them. Now we’re going to look inside the GPU to understand what happens at the silicon level. Not to write CUDA kernels, but to be a better troubleshooter and have informed conversations with the ML team. The 2 AM ticket Slack fires at 2 AM. The ML team’s training job crashed again. The error is a single line: ...

May 22, 2026 · 10 min · Ricardo Martins

Compute for AI: choosing the right hardware (and connecting it properly)

Third post in the series where I translate AI into the language of those who live and breathe infrastructure. In the previous post, we talked about the hidden storage bottleneck. Today we’re going to what everyone thinks is the main topic of AI: compute. Spoiler: it’s not just about having the most expensive GPU. It’s about having the right GPU, connected the right way. The story you don’t want to live The ML team asks for “a GPU cluster for training.” You do what any infra engineer would: provision eight Standard_D16s_v5 VMs. Sixty-four vCPUs each, 128 GiB of RAM, premium SSD. On paper, plenty of power. ...

May 18, 2026 · 11 min · Ricardo Martins

Data and storage for AI workloads: the bottleneck nobody sees

This is the second post in the series where I translate AI into the language of infrastructure engineers. In the first post, I showed that AI is just another workload and that your infra skills already prepare you more than you think. Now let’s talk about the bottleneck that everyone ignores — the hidden villain behind performance issues in virtually every AI project I’ve seen: storage. The midnight call You did everything right. The ML team asked for a GPU cluster and you delivered: eight NVIDIA A100s across two nodes, high-bandwidth networking, CUDA drivers up to date. Flawless deployment. The team kicked off their first training job Friday at 6 PM and you went home feeling good. ...

May 14, 2026 · 9 min · Ricardo Martins

AI for infrastructure engineers: why AI needs you

This is the first post in a series where I’ll translate the world of AI into the language that infrastructure engineers already speak. If you’re the kind of professional who configures VMs, builds CI/CD pipelines, and gets woken up at 2 AM when Nagios fires, this content is for you. The series is based on my open-source book AI for Infrastructure Professionals, adapted and expanded here on the blog. The Monday morning message It’s 8:47 AM on a Monday. You’re halfway through your coffee, reviewing a Terraform plan for a network redesign, when a Slack message lights up your screen. It’s from the data science team lead: ...

May 10, 2026 · 7 min · Ricardo Martins

Introduction to AI and Comparing OpenAI with Azure OpenAI

As I embark on my journey of learning about artificial intelligence (AI), I am discovering the fascinating world of large language models (LLMs) and their applications in various technologies. In this article, I aim to share my newfound knowledge and insights with others who are also beginning their journey in AI. We will explore OpenAI, one of the leading organizations in AI research and development, and compare its offerings with Microsoft’s Azure OpenAI service. ...

May 10, 2024 · 3 min · Ricardo Martins

Real-World Applications and Ethical Implications of AI

As we continue our journey into artificial intelligence (AI), it’s important to understand how AI is transforming different industries and the ethical and legal challenges associated with its widespread adoption. In this new post, we will explore AI’s real-world applications and the complexities of ethical and legal concerns in detail. AI in Healthcare AI is making significant advances in healthcare, improving patient care and medical research: Diagnostics: AI-powered algorithms can analyze medical images, such as X-rays, MRIs, and CT scans, to identify diseases like cancer or fractures with high accuracy. These systems can serve as a second opinion for radiologists, improving diagnostic accuracy and efficiency. Personalized Medicine: AI enables the development of personalized treatment plans based on a patient’s unique genetic makeup. This approach can lead to more effective and targeted therapies, improving patient outcomes. Drug Discovery: AI accelerates the process of discovering new drugs by analyzing vast amounts of data to identify potential compounds and predict their efficacy. This reduces the time and cost associated with bringing new drugs to market. Remote Monitoring: AI-powered wearable devices and remote monitoring tools enable healthcare providers to track patients’ health in real-time, offering proactive care and reducing hospital readmissions. Administrative Efficiency: AI streamlines administrative tasks such as scheduling, billing, and insurance claims processing, freeing up healthcare professionals to focus on patient care. AI in Finance AI is reshaping the finance industry by providing innovative solutions to complex problems: ...

May 10, 2024 · 5 min · Ricardo Martins