Eleventh post in the series. In the previous one, we built the self-service AI platform with multi-tenancy and scheduling. Now: the service everyone wants to consume, Azure OpenAI, and how to operate it without getting 429’d in the face.

The 429 that changed everything

Your team launched an internal GPT-4o chatbot on Monday. Day 1: smooth sailing, demos for leadership, Slack full of praise. Day 3: “the bot is slow.” Day 5: 30% of requests return HTTP 429. You open Azure Monitor and discover you’re hitting the 80K TPM ceiling.

The data science team’s response? “Increase the limit.” But it’s not that simple. Quota increases aren’t instantaneous, and throwing more TPM at the problem doesn’t fix the underlying design. Some requests consume 4,000 tokens for a question that could fit in 200. The system prompt is 1,800 tokens, copied from a blog post and never trimmed. Retry logic hammers the endpoint without backoff, turning throttling into a cascading failure.

What you need isn’t a bigger pipe. You need to understand how Azure OpenAI measures, limits, and charges for capacity.

Tokens: the fundamental unit

A token is a chunk of a word. LLMs don’t process text character by character; they break it into subwords. In English, 1 token ≈ 4 characters ≈ 0.75 words.

Everything in Azure OpenAI is measured in tokens: billing, throughput limits, context windows, rate limiting.

Total Tokens = System Prompt + User Input + Output (completion)

Typical chatbot: 500 tokens (system) + 300 (user) + 800 (response) = 1,600 tokens/request. Multiply by concurrent users and requests per minute: that’s your throughput requirement.

Infra ↔ AI translation: Tokens are the payload packets of the AI world. TPM is your bandwidth ceiling (throughput per minute). RPM is your packets-per-second limit. Same diagnostic reasoning, different units.

Context windows

ModelContext Window
GPT-4o128K tokens
GPT-4o-mini128K tokens
GPT-4 Turbo128K tokens
GPT-3.5 Turbo16K tokens

A large context window doesn’t mean you should fill it. A 100K-token request consumes the same TPM as 62 requests of 1,600 tokens.

Deployment types: the architectural decision

CharacteristicStandardGlobal StandardProvisioned (PTU)
BillingPay per tokenPay per tokenFixed monthly cost per PTU
ThroughputQuota-limited (TPM/RPM)Quota-limited, higher defaultsReserved, guaranteed capacity
LatencyVariable (shared infra)Variable (Microsoft-routed)Predictable, low variance
Data residencySingle regionMicrosoft selects regionSingle region
Throttling429 when quota exceeded429 when quota exceededNo throttling within capacity
Best forDev/test, variable workloadsGlobal apps, no residency restrictionsProduction, apps with SLAs

When to use each

  1. Variable, low volume, experimental? → Standard or Global Standard
  2. Need higher quotas, no data residency restriction? → Global Standard
  3. Data residency within a geography (US, EU)? → Data Zone
  4. Production with SLA, consistently high volume? → Provisioned (PTU)
  5. Mission-critical production with overflow? → PTU primary + Standard overflow

Creating deployments via CLI

# Create Azure OpenAI resource
az cognitiveservices account create \
  --name aoai-prod \
  --resource-group rg-ai-prod \
  --kind OpenAI \
  --sku S0 \
  --location eastus

# Standard deployment (pay-per-token)
az cognitiveservices account deployment create \
  --name aoai-prod \
  --resource-group rg-ai-prod \
  --deployment-name gpt-4o-prod \
  --model-name gpt-4o \
  --model-version "2024-08-06" \
  --model-format OpenAI \
  --sku-name "Standard" \
  --sku-capacity 80

The sku-capacity in Standard is the TPM (in thousands). 80 = 80K TPM.

PTU throughput varies. There’s no fixed TPM-per-PTU number. It depends on the model, prompt length, and response length. Always use the Azure OpenAI capacity calculator with your actual traffic patterns and validate with load testing before committing.

Rate limiting: understanding the two axes

Azure OpenAI enforces two independent limits:

  • TPM (Tokens Per Minute): total tokens (input + output) processed
  • RPM (Requests Per Minute): number of API calls, regardless of tokens

You can hit TPM with a few large requests (RAG with long documents) or RPM with many small requests (single-line classification). They’re different constraints that need different solutions.

Checking deployment rate limits

az cognitiveservices account deployment show \
  --name aoai-prod \
  --resource-group rg-ai-prod \
  --deployment-name gpt-4o-prod \
  --query "properties.rateLimits"

The correct retry pattern (and the wrong one)

The most common mistake: immediate retry in a tight loop. This turns occasional throttling into a storm that takes down the system.

import time
import random
import openai

def call_with_backoff(client, messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=messages
            )
        except openai.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            retry_after = int(e.response.headers.get("Retry-After", 1))
            wait = retry_after + random.uniform(0, 1)
            time.sleep(wait)

Always respect the Retry-After header and add random jitter to avoid thundering herd (all clients retrying at the same instant).

High availability: multi-deployment

For production, never depend on a single deployment in a single region.

Architecture with APIM as gateway

Azure API Management in front of multiple Azure OpenAI deployments:

  1. Primary: PTU deployment in East US (guaranteed capacity, no 429s)
  2. Secondary: Standard deployment in West US (overflow, pay-per-token)
  3. Tertiary: Global Standard (catch-all when primaries are under pressure)

APIM handles routing based on availability and rate limit headers. If primary returns 429, it redirects to secondary automatically.

Capacity monitoring

# Token transaction metrics
az monitor metrics list \
  --resource "/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.CognitiveServices/accounts/aoai-prod" \
  --metric "TokenTransaction" \
  --interval PT1M \
  --aggregation Total \
  --filter "ModelDeploymentName eq 'gpt-4o-prod'"

Alerts that matter

MetricThresholdAction
TPM usage > 80%Sustained 5 minEvaluate scale or routing
HTTP 429 rate > 1%Sustained 2 minActivate overflow deployment
TTFT P95 > 3sSustained 5 minInvestigate capacity
Error rate > 5%ImmediateIncident response

Cost and performance optimization

Prompt caching

Azure OpenAI supports automatic caching for repeated prefixes. If your system prompt is identical across all requests (and it should be), cached tokens are charged at a reduced price. Structure prompts with the static part first.

Multi-model routing

Not every request needs the most capable (and most expensive) model. Route accordingly:

Request typeModelRationale
Simple FAQ, classificationGPT-4o-mini94% cheaper, sufficient quality
Short summarizationGPT-4o-miniGood quality for simple texts
Complex reasoningGPT-4oNeeds the full model
Code generationGPT-4oAccuracy matters more than cost

A simple router (based on input length, keyword presence, or quick classification) can cut inference costs by 50-80%.

In the next post

Azure OpenAI running with HA, correct retry, and multi-model routing. Next: the troubleshooting playbook. The real scenarios that generate pages at 2 AM: NVIDIA driver crash, CUDA OOM, pods stuck in Pending, and inference latency spikes.