Seventh post in the series. In the previous one, we put models into production with CI/CD pipelines. Now: how do you know they’re actually healthy?

The silent failure

Your Azure OpenAI endpoint returns 200 OK on every request. Latency is normal, P95 under 800ms. CPU and memory within thresholds. Kubernetes shows healthy pods, no restarts. By every infra metric you trust, the system is perfect.

But the support tickets keep coming. Users report the chatbot “gives worse answers.” Fluent but factually incorrect responses. Hallucinations are up, summarizations miss key points, code suggestions introduce subtle bugs.

You pull up the monitoring stack. Azure Monitor: green. Application Insights: green. Grafana: all green. A wall of healthy metrics while the system is actively failing its users.

The problem? Model drift. A recent fine-tuning introduced a quality regression. Outputs degraded gradually over two weeks, but no alert fired because you’re monitoring infrastructure metrics, not AI metrics. Your observability stack was built for traditional workloads where “the server is up and responding” = “the system is working.” In AI, a model can be running perfectly and still be wrong.

The 6 dimensions of AI observability

Traditional monitoring covers compute, network, and storage. Necessary but insufficient for AI.

#DimensionWhat to monitorPriority
1Compute (GPU)Utilization, memory, temperature, ECC errorsP0
2CostGPU spend, tokens consumed, cost per inferenceP0
3ModelAccuracy, drift, latency, error ratesP1
4SecurityPrompt injection, data exfiltration, anomalous consumptionP1
5NetworkInfiniBand health, cross-node latency, throughputP2
6DataPipeline freshness, quality, ingestion failuresP2

Infra ↔ AI translation: Monitoring a web server means tracking CPU, memory, disk, and network. Monitoring an AI workload is like monitoring a web server, a database, a billing system, and a QA department simultaneously. The model doesn’t just consume resources; it produces outputs that have a dimension of correctness that traditional infra doesn’t have.

GPU monitoring: the foundation

DCGM Exporter on AKS

NVIDIA DCGM Exporter runs as a DaemonSet (one pod per GPU node) and exposes metrics in Prometheus format:

# Add NVIDIA Helm repo
helm repo add nvidia https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update

# Install DCGM Exporter as DaemonSet on GPU nodes
helm install dcgm-exporter nvidia/dcgm-exporter \
  --namespace gpu-monitoring \
  --create-namespace \
  --set nodeSelector."agentpool"="gpu"

Azure Managed Prometheus

Eliminates the need to run your own Prometheus server:

# Enable Azure Monitor managed Prometheus
az aks update \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --enable-azure-monitor-metrics

# Verify it's enabled
az aks show \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --query "azureMonitorProfile.metrics.enabled"

Managed Prometheus automatically discovers and scrapes DCGM Exporter pods via Kubernetes service discovery. No manual scrape target configuration needed.

GPU metrics and alert thresholds

MetricDCGM NameWarningCriticalMeaning
GPU UtilizationDCGM_FI_DEV_GPU_UTIL< 30% sustained< 10% sustainedWasted spend
GPU Memory UsedDCGM_FI_DEV_FB_USED> 85%> 95%OOM risk
GPU TemperatureDCGM_FI_DEV_GPU_TEMP> 78°C> 83°CThermal throttling
ECC ErrorsDCGM_FI_DEV_ECC_DBE_VOL_TOTAL> 0> 0Degrading hardware

Nuance: Low GPU utilization isn’t always a problem. Latency-sensitive inference workloads intentionally keep utilization low to maintain fast responses. Check whether the workload optimizes for throughput (training — high utilization expected) or latency (inference — moderate is OK).

Azure OpenAI monitoring

Key metrics

MetricWhat it isWhen to investigate
TPM (Tokens Per Minute)Throughput against allocated capacity> 80% sustained of limit
RPM (Requests Per Minute)Individual calls regardless of tokensMany small requests saturating before TPM
TTFT (Time to First Token)Perceived latency for streaming> 2 seconds (feels slow to the user)
HTTP 429 RateThrottling signal> 1% sustained

Infra ↔ AI translation: TPM limits are like bandwidth throttling. RPM limits are like connection-rate limiting. Token budgets are the AI equivalent of data transfer quotas.

Enable diagnostic logging

az monitor diagnostic-settings create \
  --resource "/subscriptions/<sub-id>/resourceGroups/myRG/providers/Microsoft.CognitiveServices/accounts/myAOAI" \
  --name "aoai-diagnostics" \
  --workspace "/subscriptions/<sub-id>/resourceGroups/myRG/providers/Microsoft.OperationalInsights/workspaces/myWorkspace" \
  --logs '[{"category":"RequestResponse","enabled":true},{"category":"Audit","enabled":true}]' \
  --metrics '[{"category":"AllMetrics","enabled":true}]'

Watch the volume: A deployment processing 1,000 RPM generates ~1.4 million log entries per day. Configure retention policies in Log Analytics: 30 days for operational debugging, longer for compliance.

Application-level observability

OpenTelemetry for distributed tracing

Modern AI applications involve multiple services: API gateway, preprocessing, embedding, vector search, LLM inference, post-processing. OpenTelemetry follows a request through the entire pipeline:

from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace

configure_azure_monitor()
tracer = trace.get_tracer("inference-pipeline")

def process_request(user_query):
    with tracer.start_as_current_span("inference-pipeline") as span:
        with tracer.start_as_current_span("generate-embedding"):
            embedding = embed(user_query)

        with tracer.start_as_current_span("vector-search"):
            context = search(embedding, top_k=5)

        with tracer.start_as_current_span("llm-inference") as llm_span:
            response = generate(user_query, context)
            llm_span.set_attribute("tokens.prompt", response.usage.prompt_tokens)
            llm_span.set_attribute("tokens.completion", response.usage.completion_tokens)

        return response

Custom metrics for AI

  • Inference latency percentiles (P50, P95, P99): P50 = typical experience, P95/P99 = tail latency
  • Tokens per second: LLM inference throughput. Dropping = memory pressure or degradation
  • Queue depth: requests waiting for GPU. Growing with stable throughput = need to scale out
  • Cache hit rate: for semantic caching. High hit rate = less latency and cost

Structured logging (mandatory)

import logging
import json

class StructuredFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "service": getattr(record, "service", "inference-api"),
            "model_version": getattr(record, "model_version", "unknown"),
            "request_id": getattr(record, "request_id", None),
            "tokens_used": getattr(record, "tokens_used", None),
            "latency_ms": getattr(record, "latency_ms", None),
        }
        return json.dumps(log_entry)

Always log model version and deployment name with every trace and metric. When you deploy a new version and latency spikes 40%, you need to be able to correlate.

In the next post

With monitoring covering all 6 dimensions, we’ll cover security for AI: prompt injection, data leakage, managed identities, private endpoints, and the threats your WAF won’t catch.