Azure

Multi-Agent Orchestration: Correlating AKS and Azure OpenAI

Chapter 4: Multi-Agent Orchestration So far the series has built two separate things: in post 1, an agent that talks to AKS via aks-mcp to diagnose the cluster; in posts 2 and 3, a watchdog that watches TPM consumption on Azure OpenAI and decides how urgent an alert should be. Both are useful on their own. Together, they still leave the first SRE question hanging in the air: when token consumption jumps out of nowhere, did somebody deploy something? In most teams that answer still lives in two browser tabs and one annoyed human. ...

From Script to Agent: Giving the Watchdog Decision Autonomy

Chapter 3: From Script to Agent In the previous post, the Azure OpenAI quota watchdog was a script with if pct_of_tpm > 0.8: alert. That works, but it has the same flaw every blunt monitoring rule has: context does not exist. A batch job that predictably eats 90% of TPM for 10 minutes at month-end looks identical to an agent gone feral and burning tokens all afternoon. Both cross the threshold. Only one should wake somebody up. ...

Platform Engineering on Azure: governance, observability and security for your IDP (Part 2)

Second post in the Azure Platform Engineering series. In Part 1, we built the provisioning layer of the Internal Developer Platform: Dev Center, Azure Deployment Environments, Bicep templates, and shared AKS runtime patterns. That is necessary, but it is not sufficient. An Internal Developer Platform becomes trustworthy when it enforces standards without turning into a bureaucratic cage. That is where governance, observability, and security enter the picture. The platform must make the right path easy, the risky path difficult, and the unsupported path visible. ...

Platform Engineering on Azure: building an Internal Developer Platform with AKS and Bicep (Part 1)

First post in a two-part series on Platform Engineering on Azure. If your developers still need tickets, handoffs, or tribal knowledge to get a usable environment, your delivery system is slower than your codebase. Platform Engineering is how you fix that. The goal is not to hide infrastructure from developers. The goal is to package infrastructure, security, and observability into a self-service product developers can trust. On Azure, that means combining Microsoft Dev Center, Azure Deployment Environments, Bicep, and a shared runtime such as AKS. ...

Postmortems on Azure: automation with Azure DevOps and learning metrics (Part 2)

Second post in the Azure postmortem series. In Part 1, we built the foundation: blameless culture, a reusable template, KQL-based evidence collection, and Logic Apps automation. Now we move from documentation to operations. A mature postmortem process should leave traces in the engineering system: linked work items, measurable trends, dashboards, and visible feedback into reliability practices such as SLOs, alert tuning, and chaos experiments. If a postmortem ends as a document nobody operationalizes, the process failed. ...

Postmortems on Azure: implementing blameless incident analysis with Azure Monitor (Part 1)

First post in a two-part series on Azure postmortems. Incidents are inevitable. Repeat incidents are optional. A lot of teams say they do postmortems, but what they really have is a short meeting, a vague document, and a backlog item nobody revisits. A good Azure postmortem is different: it is blameless, evidence-based, and tightly connected to telemetry. If you already use Azure Monitor, Application Insights, Log Analytics, and Azure Activity Logs, you already have most of the raw material you need. ...

Building a Deterministic 429 Watchdog for Azure OpenAI

Chapter 2: The Deterministic 429 Watchdog In the previous post I explained what MCP is and how an agent decides its next move from the tools available. Now for something you could actually ship over a weekend: an MCP server that watches token consumption on your Azure OpenAI or Foundry deployment and warns you on Slack or email before the 429 lands in production. tl;dr Watch Azure Monitor metrics before the client hits the first 429. Start with a deterministic threshold plus a rising-trend check. Add agent reasoning later, after the telemetry and alert path prove they work. Why this is subtler than it looks The first reaction from anyone who’s never been bitten by a 429 is “easy, just measure usage and compare it to the quota.” The problem is that TPM (tokens per minute) and RPM (requests per minute) on Azure OpenAI are evaluated over short rolling windows, not a smooth average across the minute. That means you can blow the limit even while staying “under quota” in aggregate, simply because requests arrived in a burst instead of spread out. That’s why teams report 429s “even within the documented limit”: the problem isn’t total volume, it’s distribution over time. ...

Context engineering: the art of feeding LLMs

You build a RAG pipeline, connect it to Azure OpenAI, and the answers come back… meh. Generic. Sometimes it ignores the context you sent. Sometimes it makes things up. The model is powerful, but input quality usually determines most of the result. Context engineering is the discipline of assembling that input so the model gives you what you actually need. It is not just “prompt engineering” with a fresher label. It is engineering: structure, constraints, and trade-offs. ...

Visual glossary infra ↔ AI: your Rosetta Stone

Final post in the series. In the previous one, we built the 6-phase adoption framework. This one is the cheat sheet. You already speak infrastructure fluently. AI is not a foreign language. It is infrastructure with worse naming and more hype. This glossary maps each AI term to something you already understand. tl;dr This glossary maps AI jargon to infra concepts so you can reason about AI systems without switching mental models. Use it as a translation sheet for conversations about models, data, compute, serving, and ops. When a term drives architecture or cost, check the underlying docs before repeating exact numbers. How to use this Every entry has: the AI term, the infra analogy in parentheses, a concise definition, and when you’ll encounter it in your work. It is split into 6 categories so you can find things fast instead of pretending you remember all of it. ...

How RAG works: from theory to pipeline

The VP of Product walks into the daily standup: “I want the chatbot to answer questions about our internal documentation. We have 2,000 pages of runbooks, policies, and procedures. ChatGPT doesn’t know any of that.” The ML team says: “We’ll implement RAG.” Everyone nods. You get the job of provisioning the infrastructure. Before you start creating resources, you should know what RAG is actually doing under the hood. tl;dr RAG is search plus an LLM. The retrieval layer determines whether the answer is grounded or generic. The main moving parts are chunking, embeddings, vector storage, and hybrid search. In production, watch retrieval quality and pricing before you obsess over the model. The map for infra engineers RAG concept What it does Infra equivalent Retrieval Finds relevant documents Search engine query Augmentation Adds docs to the LLM prompt Build the request payload Generation LLM produces an answer using the context The model response Chunking Splits documents into smaller pieces Data partitioning, sharding Indexing pipeline Processes docs and generates embeddings ETL/data pipeline Hybrid search Combines semantic search + keyword search Using CDN + origin server together The problem RAG solves LLMs have two fundamental limitations: ...