Github-Actions

Platform Engineering on Azure: governance, observability and security for your IDP (Part 2)

Second post in the Azure Platform Engineering series. In Part 1, we built the provisioning layer of the Internal Developer Platform: Dev Center, Azure Deployment Environments, Bicep templates, and shared AKS runtime patterns. That is necessary, but it is not sufficient. An Internal Developer Platform becomes trustworthy when it enforces standards without turning into a bureaucratic cage. That is where governance, observability, and security enter the picture. The platform must make the right path easy, the risky path difficult, and the unsupported path visible. ...

MLOps: model lifecycle for infra engineers

Sixth post in the series. In the previous one, we automated GPU cluster provisioning. Next comes what happens after the hardware is ready: how a model goes from “works on my notebook” to “running in production with an SLA.” tl;dr Models need the same artifact, promotion, and rollback discipline as application builds. Use a real registry with metadata and controlled deployments. Prefer MLflow aliases over deprecated stages when describing promotions. The model with no birth certificate A data scientist drops a message in the team channel with a link to a shared drive: “Here’s the model. It’s a 15 GB PyTorch checkpoint. We need it in production by Friday.” ...

Infrastructure as Code for AI: automating GPU clusters

Fifth post in the series. In the previous one, we went inside the GPU. This time we automate everything around it. Understanding GPUs is useful. Provisioning them consistently and at scale is where infrastructure engineering actually meets AI. tl;dr IaC is the only sane way to provision expensive GPU infrastructure repeatedly. Validate SKU choices, remote state, and deployment guardrails before apply. If the pipeline uses OIDC, Terraform and GitHub Actions both need explicit OIDC settings. The $4,000 typo I started the week with a win. I manually provisioned a GPU cluster in East US 2 for an ML experiment: AKS with a Standard_NC6s_v3 node pool, accelerated networking, GPU drivers, correct taints. It took most of a day, but it worked. ...