This is the second post in the series where I translate AI into the language of infrastructure engineers. In the first post, I showed that AI is just another workload and that your infra skills already prepare you more than you think.

Now let’s talk about the bottleneck that everyone ignores — the hidden villain behind performance issues in virtually every AI project I’ve seen: storage.

The midnight call

You did everything right. The ML team asked for a GPU cluster and you delivered: eight NVIDIA A100s across two nodes, high-bandwidth networking, CUDA drivers up to date. Flawless deployment. The team kicked off their first training job Friday at 6 PM and you went home feeling good.

Your phone rings at midnight. The data science lead is frustrated: “The GPUs aren’t working. The training that was supposed to take four hours hasn’t even finished the first epoch.”

You remote in and pull the metrics:

  • GPU utilization: 12%
  • GPU memory: one-third of total
  • Disk I/O: 100%, read throughput crawling at 60 MB/s

The team stored 2 TB of training images on Blob Storage with Standard HDD, mounted via a basic SMB share. Your storage architecture is starving the most expensive hardware in the rack.

This story plays out at organizations every week. Teams invest fortunes in GPUs only to discover that the data pipeline — the part that we in infra own — is the real bottleneck.

Why everything starts with data

Every AI system, from a simple classifier to a trillion-parameter LLM, depends on a formula:

Data + Model + Compute = AI

Remove any of the three and you have nothing. But the insight most of us miss early on is: of the three components, data is the one that touches infrastructure at every stage. The model is code. Compute is provisioned and sits there running. But data needs to be ingested, stored, prepared, served for training, and delivered at inference — and each of those stages is an infrastructure problem.

Infra ConceptAI EquivalentWhy it matters
Storage account / volumeDataset repositoryWhere raw data lives before the model sees it
Read throughput (MB/s)Data loader speedDetermines how fast GPUs receive training batches
IOPSSamples per secondSmall-file workloads (images) need high IOPS
Storage tiers (Hot/Cool/Archive)Lifecycle stagesHot for active training, Cool for completed datasets, Archive for compliance
NFS/SMB mountPOSIX access for frameworksPyTorch and TensorFlow expect filesystem semantics
Encryption at restData protection complianceMandatory for PII, medical data, and financial records

If you already manage storage, networking, and access control, you understand 70% of the AI data stack. What changes is the intensity: AI workloads push read throughput, IOPS, and sequential I/O harder than almost anything you’ve provisioned before.

Data starvation: the invisible bottleneck

Here’s the counterintuitive truth about AI infrastructure: the most common cause of low GPU utilization isn’t a GPU problem — it’s a storage problem.

When the data loader can’t feed batches to the GPU fast enough, the GPU sits idle waiting for data. This is called data starvation, and it turns your $50,000/month GPU cluster into an expensive space heater.

How to diagnose

If a data scientist reports suspiciously low GPU utilization, check storage before investigating anything else. Nine times out of ten, the problem is one of these:

  1. Training data on Standard HDD
  2. Remote mount without caching
  3. BlobFuse2 cache pointing at the OS disk instead of local NVMe

Classic data starvation signals:

# GPU utilization: if below 80% during training, storage is almost certainly the cause
nvidia-smi dmon -s u -d 5

# Disk I/O: if at 100% with low GPU, classic bottleneck
iostat -x 1 5

# Network (if dataset is remote): actual throughput vs capacity
sar -n DEV 1 5

The diagnostic pattern is simple:

GPU UtilCPU UtilDisk I/ODiagnosis
LowLowHighData starvation — storage can’t feed data fast enough
LowHighLowCPU preprocessing is the bottleneck (heavy data augmentation)
HighHighHighEverything working well, balanced system
LowLowLowProblem in model code or wrong batch size

⚠️ Rule of thumb: A five-minute storage fix can turn a three-day training run into an overnight one. Always start with storage.

Choosing the right storage: the decision matrix

This is the highest-impact decision you’ll make for AI workload performance. Here’s the map:

StorageBest forThroughputLatencyCostDon’t use when
Blob StorageDatasets, artifacts, checkpointsUp to 60 Gbps/accountModerate (ms)Low (~$0.018/GB/month)You need native POSIX without a mount
Data Lake Gen2Analytical pipelines, versioned datasetsUp to 60 Gbps/accountModerate (ms)LowSimple workload that doesn’t need granular ACLs
Local NVMeTraining scratch, data loader cache3-7 GB/s per diskUltra-low (μs)Included with the VMYou need persistence — data lost on deallocation
Azure Files (NFS)Shared datasets across nodesUp to 10 Gbps (premium)Low-moderateModerateSingle-node workload where local NVMe is enough
Azure Files (SMB)Legacy compatibility, WindowsUp to 4 Gbps (premium)ModerateModerateHigh-performance Linux training
Cosmos DBFeature stores, real-time inferenceN/A (request-based)Single-digit msHigherStoring raw training datasets

The most common production pattern is a two-tier approach: store raw datasets in Blob Storage or Data Lake Gen2 for durability and cost, then stage active data to local NVMe for performance.

Blob is your warehouse. NVMe is your workbench.

⚠️ Never use Standard HDD for training. The IOPS and throughput limits are orders of magnitude below what GPUs need. A single A100 can consume data faster than a Standard HDD storage account can serve it.

Most ML frameworks (PyTorch, TensorFlow) expect training data accessible through a filesystem path. BlobFuse2 is the virtual filesystem driver that mounts Azure Blob Storage containers as a local directory on Linux.

BlobFuse2 has two caching modes, and choosing the right one matters:

  • File cache: Downloads entire files to a local cache before serving reads. Use for training — datasets are read repeatedly across multiple epochs.
  • Block cache (streaming): Streams in chunks without downloading the complete file. Use for preprocessing or inference on large media files.

Mount with file cache for training

# Create cache directory on fast local storage (NVMe temp disk)
sudo mkdir -p /mnt/resource/blobfuse2cache
sudo chown $(whoami) /mnt/resource/blobfuse2cache

# Create mount point
sudo mkdir -p /mnt/training-data

# Mount with file cache
sudo blobfuse2 mount /mnt/training-data \
  --config-file=./config.yaml \
  --tmp-path=/mnt/resource/blobfuse2cache

Preload: data ready before training starts

# Mount with preload — downloads data to cache at mount time
sudo blobfuse2 mount /mnt/training-data \
  --config-file=./config.yaml \
  --tmp-path=/mnt/resource/blobfuse2cache \
  --preload

💡 Always point --tmp-path to the VM’s local NVMe disk (/mnt/resource on Azure VMs) — not to the OS disk. This gives the BlobFuse2 cache the lowest possible latency. On ND-series GPU VMs, the local temp disk delivers 3-7 GB/s read throughput.

AzCopy for bulk data ingestion

When you need to move large datasets into Azure (or between storage accounts), AzCopy is the fastest option. It supports parallel transfers, automatic retries, and resumable uploads.

# Login with Microsoft Entra ID
azcopy login

# Copy an entire dataset directory to Blob Storage
azcopy copy './local-dataset/' \
  "https://${STORAGE_ACCOUNT}.blob.core.windows.net/training-data/v1/" \
  --recursive

# Copy between storage accounts (server-side, no local download)
azcopy copy \
  "https://<source-account>.blob.core.windows.net/<container>" \
  "https://<dest-account>.blob.core.windows.net/<container>" \
  --recursive

💡 Use --cap-mbps to limit throughput during business hours and unleash full speed at night.

Security: built in, not bolted on

AI workloads handle some of the most sensitive data in the organization: customer records, medical images, financial transactions, proprietary text corpora.

Three non-negotiable rules:

1. Managed identities + RBAC, always.

Forget storage account keys. They’re static, shareable, and hard to rotate. Managed identities are bound to specific resources, automatically rotated, and auditable.

# Assign Storage Blob Data Reader to a VM's managed identity
az role assignment create \
  --role "Storage Blob Data Reader" \
  --assignee <managed-identity-principal-id> \
  --scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"

2. Classify before ingesting.

Before any data enters a training pipeline: is it public, internal, confidential, or restricted? Your storage architecture needs to enforce these classifications with network isolation, encryption, and access controls.

3. Combat shadow data sprawl.

Data scientists frequently copy data to local machines, shared drives, or unmanaged storage accounts for “quick experiments.” Use Azure Policy to restrict storage account creation and Microsoft Purview to scan for copies outside approved locations.

Hands-on: end-to-end optimized storage for AI

Let’s build a complete flow: provision, transfer, mount, and validate. All commands use --auth-mode login — no storage keys.

1. Create the storage account with Data Lake Gen2

RESOURCE_GROUP="rg-ai-training"
LOCATION="eastus2"
STORAGE_ACCOUNT="staitraining$(openssl rand -hex 4)"

az group create \
  --name $RESOURCE_GROUP \
  --location $LOCATION

az storage account create \
  --name $STORAGE_ACCOUNT \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --sku Standard_LRS \
  --kind StorageV2 \
  --enable-hierarchical-namespace true \
  --min-tls-version TLS1_2 \
  --allow-blob-public-access false

2. Configure RBAC (no keys!)

az ad signed-in-user show --query id -o tsv | az role assignment create \
  --role "Storage Blob Data Contributor" \
  --assignee @- \
  --scope "/subscriptions/$(az account show --query id -o tsv)/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"

Role assignments take 1-2 minutes to propagate. Wait before proceeding.

3. Create container and transfer data

az storage container create \
  --account-name $STORAGE_ACCOUNT \
  --name training-data \
  --auth-mode login

# For large datasets, use AzCopy
azcopy login
azcopy copy './local-dataset/' \
  "https://${STORAGE_ACCOUNT}.blob.core.windows.net/training-data/v1/" \
  --recursive

4. Mount with BlobFuse2 and NVMe cache

sudo mkdir -p /mnt/resource/blobfuse2cache
sudo chown $(whoami) /mnt/resource/blobfuse2cache
sudo mkdir -p /mnt/training-data

sudo blobfuse2 mount /mnt/training-data \
  --config-file=./config.yaml \
  --tmp-path=/mnt/resource/blobfuse2cache \
  --preload

5. Validate the pipeline is feeding the GPUs

# Verify data is accessible
ls /mnt/training-data/v1/ | head -20

# During training, monitor GPU vs I/O
nvidia-smi dmon -s u -d 5 &
iostat -x 1 5

If nvidia-smi shows GPU util above 80% and iostat isn’t pegged at 100%, your data pipeline is healthy.

Exit checklist

Before handing off storage for an AI workload:

  • Storage is Premium SSD or NVMe (never Standard HDD for training)
  • BlobFuse2 cache points to local NVMe (/mnt/resource), not the OS disk
  • Access via managed identity + RBAC, no storage keys
  • Data classified before entering the pipeline
  • Storage sized for 10× current data (datasets multiply with augmentation and versioning)
  • Throughput and IOPS alerts configured
  • Checkpoints writing back to Blob Storage for durability

Next up

Now that you understand how data flows through AI systems and why your storage decisions directly determine training performance, it’s time to look at the compute that consumes all that data. I’ll talk about GPUs, VM families, and cluster architecture — and why a well-tuned storage layer is only half the equation.

The full book is available for free at ai4infra.com.


This post is part of the AI for Infrastructure Engineers series, based on the book AI for Infrastructure Professionals. New posts every week.