Rate-Limiting

Eleventh post in the series. In the previous one, we built the self-service AI platform with multi-tenancy and scheduling. This time it’s the service everybody wants to consume: Azure OpenAI, and how to run it without getting slapped by 429s. tl;dr Azure OpenAI capacity is a token problem before it is a scaling problem. Design around TPM and RPM, back off on 429s, and route across deployments instead of betting everything on one endpoint. The 429 that changed everything Your team launched an internal GPT-4o chatbot on Monday. Day 1 was demos for leadership and Slack praise. Day 3 brought “the bot is slow.” Day 5 brought HTTP 429 on 30% of requests. You open Azure Monitor and find the 80K TPM ceiling waiting for you. ...