Azure OpenAI

Multi-Agent Orchestration: Correlating AKS and Azure OpenAI

Chapter 4: Multi-Agent Orchestration So far the series has built two separate things: in post 1, an agent that talks to AKS via aks-mcp to diagnose the cluster; in posts 2 and 3, a watchdog that watches TPM consumption on Azure OpenAI and decides how urgent an alert should be. Both are useful on their own. Together, they still leave the first SRE question hanging in the air: when token consumption jumps out of nowhere, did somebody deploy something? In most teams that answer still lives in two browser tabs and one annoyed human. ...

From Script to Agent: Giving the Watchdog Decision Autonomy

Chapter 3: From Script to Agent In the previous post, the Azure OpenAI quota watchdog was a script with if pct_of_tpm > 0.8: alert. That works, but it has the same flaw every blunt monitoring rule has: context does not exist. A batch job that predictably eats 90% of TPM for 10 minutes at month-end looks identical to an agent gone feral and burning tokens all afternoon. Both cross the threshold. Only one should wake somebody up. ...

Building a Deterministic 429 Watchdog for Azure OpenAI

Chapter 2: The Deterministic 429 Watchdog In the previous post I explained what MCP is and how an agent decides its next move from the tools available. Now for something you could actually ship over a weekend: an MCP server that watches token consumption on your Azure OpenAI or Foundry deployment and warns you on Slack or email before the 429 lands in production. tl;dr Watch Azure Monitor metrics before the client hits the first 429. Start with a deterministic threshold plus a rising-trend check. Add agent reasoning later, after the telemetry and alert path prove they work. Why this is subtler than it looks The first reaction from anyone who’s never been bitten by a 429 is “easy, just measure usage and compare it to the quota.” The problem is that TPM (tokens per minute) and RPM (requests per minute) on Azure OpenAI are evaluated over short rolling windows, not a smooth average across the minute. That means you can blow the limit even while staying “under quota” in aggregate, simply because requests arrived in a burst instead of spread out. That’s why teams report 429s “even within the documented limit”: the problem isn’t total volume, it’s distribution over time. ...

Context engineering: the art of feeding LLMs

You build a RAG pipeline, connect it to Azure OpenAI, and the answers come back… meh. Generic. Sometimes it ignores the context you sent. Sometimes it makes things up. The model is powerful, but input quality usually determines most of the result. Context engineering is the discipline of assembling that input so the model gives you what you actually need. It is not just “prompt engineering” with a fresher label. It is engineering: structure, constraints, and trade-offs. ...

How RAG works: from theory to pipeline

The VP of Product walks into the daily standup: “I want the chatbot to answer questions about our internal documentation. We have 2,000 pages of runbooks, policies, and procedures. ChatGPT doesn’t know any of that.” The ML team says: “We’ll implement RAG.” Everyone nods. You get the job of provisioning the infrastructure. Before you start creating resources, you should know what RAG is actually doing under the hood. tl;dr RAG is search plus an LLM. The retrieval layer determines whether the answer is grounded or generic. The main moving parts are chunking, embeddings, vector storage, and hybrid search. In production, watch retrieval quality and pricing before you obsess over the model. The map for infra engineers RAG concept What it does Infra equivalent Retrieval Finds relevant documents Search engine query Augmentation Adds docs to the LLM prompt Build the request payload Generation LLM produces an answer using the context The model response Chunking Splits documents into smaller pieces Data partitioning, sharding Indexing pipeline Processes docs and generates embeddings ETL/data pipeline Hybrid search Combines semantic search + keyword search Using CDN + origin server together The problem RAG solves LLMs have two fundamental limitations: ...

Troubleshooting playbook: incidents that will wake you at 2AM

Twelfth post in the series. In the previous one, we ran Azure OpenAI with HA and sane retry patterns. This one is for when the nice diagram meets real life. This post is organized as real-world failure scenarios. Each follows: Symptoms → Diagnosis → Root Cause → Resolution → Prevention. Read it once for pattern recognition. Then bookmark it. You will need it again. tl;dr Most late-night AI infra incidents come down to driver drift, memory pressure, scheduler mismatch, throttling, or cold starts. Start with the first check that rules out the biggest class of failure. Scenario 1: NVIDIA driver crash after kernel update Symptoms Monday morning. The ML team reports that all GPU workloads failed over the weekend. Nobody deployed anything. You SSH in: ...

Azure OpenAI in production: tokens, throughput, and high availability

Eleventh post in the series. In the previous one, we built the self-service AI platform with multi-tenancy and scheduling. This time it’s the service everybody wants to consume: Azure OpenAI, and how to run it without getting slapped by 429s. tl;dr Azure OpenAI capacity is a token problem before it is a scaling problem. Design around TPM and RPM, back off on 429s, and route across deployments instead of betting everything on one endpoint. The 429 that changed everything Your team launched an internal GPT-4o chatbot on Monday. Day 1 was demos for leadership and Slack praise. Day 3 brought “the bot is slow.” Day 5 brought HTTP 429 on 30% of requests. You open Azure Monitor and find the 80K TPM ceiling waiting for you. ...

Cost engineering for AI: when idle GPUs cost more than your car

Ninth post in the series. In the previous one, we hardened the platform against prompt injection and data leakage. Now for the part Finance notices first: how not to go bankrupt in the process. tl;dr AI cost control starts with lifecycle policy, model choice, and quota discipline. Shut down idle GPUs, use cheaper models where quality allows, and treat every exact cost number as time-sensitive. The $127,000 Monday Monday morning. Coffee in hand, email from Finance with the subject line: “URGENT: Azure invoice $127,000, please explain.” Forecast was $42,000. Two ND96isr_H100_v5 VMs, provisioned three weeks ago for a “quick experiment,” never shut down. At about $98/hour each, running 24/7 for three weeks: roughly $99,000 in idle GPU compute. Nobody was using them. Nobody remembered they existed. ...

Monitoring and observability for AI: when the green dashboard lies

Seventh post in the series. In the previous one, we put models into production with CI/CD pipelines. Now: how do you know they’re actually healthy? tl;dr Infra health is not model health. Track GPU, token, application, and answer-quality signals together or you will miss regressions while every dashboard stays green. The silent failure Your Azure OpenAI endpoint returns 200 OK on every request. Latency is normal, P95 under 800ms. CPU and memory within thresholds. Kubernetes shows healthy pods, no restarts. By every infra metric you trust, the system is perfect. ...

Introduction to AI and Comparing OpenAI with Azure OpenAI

As I embark on my journey of learning about artificial intelligence (AI), I am discovering the fascinating world of large language models (LLMs) and their applications in various technologies. In this article, I aim to share my newfound knowledge and insights with others who are also beginning their journey in AI. We will explore OpenAI, one of the leading organizations in AI research and development, and compare its offerings with Microsoft’s Azure OpenAI service. ...