AI | Ricardo Martins — Cloud Architecture, Azure, Kubernetes & AI

Multi-Agent Orchestration: Correlating AKS and Azure OpenAI

Chapter 4: Multi-Agent Orchestration So far the series has built two separate things: in post 1, an agent that talks to AKS via aks-mcp to diagnose the cluster; in posts 2 and 3, a watchdog that watches TPM consumption on Azure OpenAI and decides how urgent an alert should be. Both are useful on their own. Together, they still leave the first SRE question hanging in the air: when token consumption jumps out of nowhere, did somebody deploy something? In most teams that answer still lives in two browser tabs and one annoyed human. ...

From Script to Agent: Giving the Watchdog Decision Autonomy

Chapter 3: From Script to Agent In the previous post, the Azure OpenAI quota watchdog was a script with if pct_of_tpm > 0.8: alert. That works, but it has the same flaw every blunt monitoring rule has: context does not exist. A batch job that predictably eats 90% of TPM for 10 minutes at month-end looks identical to an agent gone feral and burning tokens all afternoon. Both cross the threshold. Only one should wake somebody up. ...

Building a Deterministic 429 Watchdog for Azure OpenAI

Chapter 2: The Deterministic 429 Watchdog In the previous post I explained what MCP is and how an agent decides its next move from the tools available. Now for something you could actually ship over a weekend: an MCP server that watches token consumption on your Azure OpenAI or Foundry deployment and warns you on Slack or email before the 429 lands in production. tl;dr Watch Azure Monitor metrics before the client hits the first 429. Start with a deterministic threshold plus a rising-trend check. Add agent reasoning later, after the telemetry and alert path prove they work. Why this is subtler than it looks The first reaction from anyone who’s never been bitten by a 429 is “easy, just measure usage and compare it to the quota.” The problem is that TPM (tokens per minute) and RPM (requests per minute) on Azure OpenAI are evaluated over short rolling windows, not a smooth average across the minute. That means you can blow the limit even while staying “under quota” in aggregate, simply because requests arrived in a burst instead of spread out. That’s why teams report 429s “even within the documented limit”: the problem isn’t total volume, it’s distribution over time. ...

Context engineering: the art of feeding LLMs

You build a RAG pipeline, connect it to Azure OpenAI, and the answers come back… meh. Generic. Sometimes it ignores the context you sent. Sometimes it makes things up. The model is powerful, but input quality usually determines most of the result. Context engineering is the discipline of assembling that input so the model gives you what you actually need. It is not just “prompt engineering” with a fresher label. It is engineering: structure, constraints, and trade-offs. ...

Visual glossary infra ↔ AI: your Rosetta Stone

Final post in the series. In the previous one, we built the 6-phase adoption framework. This one is the cheat sheet. You already speak infrastructure fluently. AI is not a foreign language. It is infrastructure with worse naming and more hype. This glossary maps each AI term to something you already understand. tl;dr This glossary maps AI jargon to infra concepts so you can reason about AI systems without switching mental models. Use it as a translation sheet for conversations about models, data, compute, serving, and ops. When a term drives architecture or cost, check the underlying docs before repeating exact numbers. How to use this Every entry has: the AI term, the infra analogy in parentheses, a concise definition, and when you’ll encounter it in your work. It is split into 6 categories so you can find things fast instead of pretending you remember all of it. ...

From prompt engineering to frontier company: why the model is no longer the differentiator

Three years ago, the question I heard most was: “what’s the best prompt?” Two years ago, it shifted to: “how do I do RAG?” Last year: “how do I build an agent?” This year, the conversation is different. People are asking how to transform an entire organization to operate with agents. Not a chatbot on the website. Dozens of agents embedded in business processes, with governance, observability, granular permissions. That progression tells a story, and we often discuss each phase as if it appeared out of nowhere. ...

How RAG works: from theory to pipeline

The VP of Product walks into the daily standup: “I want the chatbot to answer questions about our internal documentation. We have 2,000 pages of runbooks, policies, and procedures. ChatGPT doesn’t know any of that.” The ML team says: “We’ll implement RAG.” Everyone nods. You get the job of provisioning the infrastructure. Before you start creating resources, you should know what RAG is actually doing under the hood. tl;dr RAG is search plus an LLM. The retrieval layer determines whether the answer is grounded or generic. The main moving parts are chunking, embeddings, vector storage, and hybrid search. In production, watch retrieval quality and pricing before you obsess over the model. The map for infra engineers RAG concept What it does Infra equivalent Retrieval Finds relevant documents Search engine query Augmentation Adds docs to the LLM prompt Build the request payload Generation LLM produces an answer using the context The model response Chunking Splits documents into smaller pieces Data partitioning, sharding Indexing pipeline Processes docs and generates embeddings ETL/data pipeline Hybrid search Combines semantic search + keyword search Using CDN + origin server together The problem RAG solves LLMs have two fundamental limitations: ...

AI adoption framework: from enthusiasm to governance

Fourteenth post in the series. In the previous one, we used AI for our own infrastructure work. This time the scope is bigger: how to take an entire organization from “let’s use AI” to a governed platform that can survive contact with finance, security, and production support. tl;dr AI adoption fails when teams skip readiness, guardrails, and cost controls. A workable path is assessment, enablement, platform prep, controlled experimentation, production governance, and continuous review. Treat AI as an operating capability with budgets, runbooks, and policies from day 1. Best intentions, worst outcomes Your CTO walks into the all-hands and says: “We’re going all-in on AI.” The room buzzes. Teams start brainstorming use cases before the meeting ends. Within two weeks, Slack is full of threads about GPU availability. ...

MCP and AI Agents 101 for Infrastructure Engineers

Chapter 1: MCP and AI Agents 101 At some point in the last few months, someone on your team probably showed up talking about an “AI agent” or an “MCP server” and asked for cluster access, a deployment, or an explanation for the CISO. I wish I’d had a clean mental model before I touched any of this. That’s what this post is: no hype, and a real Azure example so this does not stay in slideware. ...

AI use cases for infra teams: AIOps and beyond

Thirteenth post in the series. In the previous one, we dealt with the incidents that wake you up at 2 AM. This time the angle flips: using AI to make the infrastructure work itself less miserable. tl;dr AI helps with summarizing, drafting, and finding patterns across noisy data. Do not hand it deterministic enforcement, compliance evidence, or unattended production actions. Flipping the perspective Over the past 12 posts, you’ve been building infra for AI: GPUs, clusters, pipelines, security, monitoring, cost management. You know how to keep the runway paved for data scientists. ...