Infrastructure

Context engineering: the art of feeding LLMs

You build a RAG pipeline, connect it to Azure OpenAI, and the answers come back… meh. Generic. Sometimes it ignores the context you sent. Sometimes it makes things up. The model is powerful, but input quality usually determines most of the result. Context engineering is the discipline of assembling that input so the model gives you what you actually need. It is not just “prompt engineering” with a fresher label. It is engineering: structure, constraints, and trade-offs. ...

Visual glossary infra ↔ AI: your Rosetta Stone

Final post in the series. In the previous one, we built the 6-phase adoption framework. This one is the cheat sheet. You already speak infrastructure fluently. AI is not a foreign language. It is infrastructure with worse naming and more hype. This glossary maps each AI term to something you already understand. tl;dr This glossary maps AI jargon to infra concepts so you can reason about AI systems without switching mental models. Use it as a translation sheet for conversations about models, data, compute, serving, and ops. When a term drives architecture or cost, check the underlying docs before repeating exact numbers. How to use this Every entry has: the AI term, the infra analogy in parentheses, a concise definition, and when you’ll encounter it in your work. It is split into 6 categories so you can find things fast instead of pretending you remember all of it. ...

How RAG works: from theory to pipeline

The VP of Product walks into the daily standup: “I want the chatbot to answer questions about our internal documentation. We have 2,000 pages of runbooks, policies, and procedures. ChatGPT doesn’t know any of that.” The ML team says: “We’ll implement RAG.” Everyone nods. You get the job of provisioning the infrastructure. Before you start creating resources, you should know what RAG is actually doing under the hood. tl;dr RAG is search plus an LLM. The retrieval layer determines whether the answer is grounded or generic. The main moving parts are chunking, embeddings, vector storage, and hybrid search. In production, watch retrieval quality and pricing before you obsess over the model. The map for infra engineers RAG concept What it does Infra equivalent Retrieval Finds relevant documents Search engine query Augmentation Adds docs to the LLM prompt Build the request payload Generation LLM produces an answer using the context The model response Chunking Splits documents into smaller pieces Data partitioning, sharding Indexing pipeline Processes docs and generates embeddings ETL/data pipeline Hybrid search Combines semantic search + keyword search Using CDN + origin server together The problem RAG solves LLMs have two fundamental limitations: ...

AI adoption framework: from enthusiasm to governance

Fourteenth post in the series. In the previous one, we used AI for our own infrastructure work. This time the scope is bigger: how to take an entire organization from “let’s use AI” to a governed platform that can survive contact with finance, security, and production support. tl;dr AI adoption fails when teams skip readiness, guardrails, and cost controls. A workable path is assessment, enablement, platform prep, controlled experimentation, production governance, and continuous review. Treat AI as an operating capability with budgets, runbooks, and policies from day 1. Best intentions, worst outcomes Your CTO walks into the all-hands and says: “We’re going all-in on AI.” The room buzzes. Teams start brainstorming use cases before the meeting ends. Within two weeks, Slack is full of threads about GPU availability. ...

MCP and AI Agents 101 for Infrastructure Engineers

Chapter 1: MCP and AI Agents 101 At some point in the last few months, someone on your team probably showed up talking about an “AI agent” or an “MCP server” and asked for cluster access, a deployment, or an explanation for the CISO. I wish I’d had a clean mental model before I touched any of this. That’s what this post is: no hype, and a real Azure example so this does not stay in slideware. ...

AI use cases for infra teams: AIOps and beyond

Thirteenth post in the series. In the previous one, we dealt with the incidents that wake you up at 2 AM. This time the angle flips: using AI to make the infrastructure work itself less miserable. tl;dr AI helps with summarizing, drafting, and finding patterns across noisy data. Do not hand it deterministic enforcement, compliance evidence, or unattended production actions. Flipping the perspective Over the past 12 posts, you’ve been building infra for AI: GPUs, clusters, pipelines, security, monitoring, cost management. You know how to keep the runway paved for data scientists. ...

Troubleshooting playbook: incidents that will wake you at 2AM

Twelfth post in the series. In the previous one, we ran Azure OpenAI with HA and sane retry patterns. This one is for when the nice diagram meets real life. This post is organized as real-world failure scenarios. Each follows: Symptoms → Diagnosis → Root Cause → Resolution → Prevention. Read it once for pattern recognition. Then bookmark it. You will need it again. tl;dr Most late-night AI infra incidents come down to driver drift, memory pressure, scheduler mismatch, throttling, or cold starts. Start with the first check that rules out the biggest class of failure. Scenario 1: NVIDIA driver crash after kernel update Symptoms Monday morning. The ML team reports that all GPU workloads failed over the weekend. Nobody deployed anything. You SSH in: ...

Azure OpenAI in production: tokens, throughput, and high availability

Eleventh post in the series. In the previous one, we built the self-service AI platform with multi-tenancy and scheduling. This time it’s the service everybody wants to consume: Azure OpenAI, and how to run it without getting slapped by 429s. tl;dr Azure OpenAI capacity is a token problem before it is a scaling problem. Design around TPM and RPM, back off on 429s, and route across deployments instead of betting everything on one endpoint. The 429 that changed everything Your team launched an internal GPT-4o chatbot on Monday. Day 1 was demos for leadership and Slack praise. Day 3 brought “the bot is slow.” Day 5 brought HTTP 429 on 30% of requests. You open Azure Monitor and find the 80K TPM ceiling waiting for you. ...

Platform ops: building a self-service AI platform

Tenth post in the series. In the previous one, we controlled costs with Spot VMs, right-sizing, and FinOps. Now for the next problem: how to stop being a human help desk for GPU access. tl;dr Self-service AI platforms need isolation, quotas, and scheduling together. The goal is fewer tickets, not faster manual provisioning. The Slack channel that ate your calendar Six months ago, you provisioned a single GPU VM for the ML team. Configured drivers, mounted storage, closed the ticket. Felt like any other infrastructure request. ...

Cost engineering for AI: when idle GPUs cost more than your car

Ninth post in the series. In the previous one, we hardened the platform against prompt injection and data leakage. Now for the part Finance notices first: how not to go bankrupt in the process. tl;dr AI cost control starts with lifecycle policy, model choice, and quota discipline. Shut down idle GPUs, use cheaper models where quality allows, and treat every exact cost number as time-sensitive. The $127,000 Monday Monday morning. Coffee in hand, email from Finance with the subject line: “URGENT: Azure invoice $127,000, please explain.” Forecast was $42,000. Two ND96isr_H100_v5 VMs, provisioned three weeks ago for a “quick experiment,” never shut down. At about $98/hour each, running 24/7 for three weeks: roughly $99,000 in idle GPU compute. Nobody was using them. Nobody remembered they existed. ...