This is the second post in the series where I translate AI into the language of infrastructure engineers. In the first post, I showed that AI is just another workload and that your infra skills already prepare you more than you think.
Now let’s talk about the bottleneck that everyone ignores — the hidden villain behind performance issues in virtually every AI project I’ve seen: storage.
The midnight call
You did everything right. The ML team asked for a GPU cluster and you delivered: eight NVIDIA A100s across two nodes, high-bandwidth networking, CUDA drivers up to date. Flawless deployment. The team kicked off their first training job Friday at 6 PM and you went home feeling good.
Your phone rings at midnight. The data science lead is frustrated: “The GPUs aren’t working. The training that was supposed to take four hours hasn’t even finished the first epoch.”
You remote in and pull the metrics:
- GPU utilization: 12%
- GPU memory: one-third of total
- Disk I/O: 100%, read throughput crawling at 60 MB/s
The team stored 2 TB of training images on Blob Storage with Standard HDD, mounted via a basic SMB share. Your storage architecture is starving the most expensive hardware in the rack.
This story plays out at organizations every week. Teams invest fortunes in GPUs only to discover that the data pipeline — the part that we in infra own — is the real bottleneck.
Why everything starts with data
Every AI system, from a simple classifier to a trillion-parameter LLM, depends on a formula:
Data + Model + Compute = AI
Remove any of the three and you have nothing. But the insight most of us miss early on is: of the three components, data is the one that touches infrastructure at every stage. The model is code. Compute is provisioned and sits there running. But data needs to be ingested, stored, prepared, served for training, and delivered at inference — and each of those stages is an infrastructure problem.
| Infra Concept | AI Equivalent | Why it matters |
|---|---|---|
| Storage account / volume | Dataset repository | Where raw data lives before the model sees it |
| Read throughput (MB/s) | Data loader speed | Determines how fast GPUs receive training batches |
| IOPS | Samples per second | Small-file workloads (images) need high IOPS |
| Storage tiers (Hot/Cool/Archive) | Lifecycle stages | Hot for active training, Cool for completed datasets, Archive for compliance |
| NFS/SMB mount | POSIX access for frameworks | PyTorch and TensorFlow expect filesystem semantics |
| Encryption at rest | Data protection compliance | Mandatory for PII, medical data, and financial records |
If you already manage storage, networking, and access control, you understand 70% of the AI data stack. What changes is the intensity: AI workloads push read throughput, IOPS, and sequential I/O harder than almost anything you’ve provisioned before.
Data starvation: the invisible bottleneck
Here’s the counterintuitive truth about AI infrastructure: the most common cause of low GPU utilization isn’t a GPU problem — it’s a storage problem.
When the data loader can’t feed batches to the GPU fast enough, the GPU sits idle waiting for data. This is called data starvation, and it turns your $50,000/month GPU cluster into an expensive space heater.
How to diagnose
If a data scientist reports suspiciously low GPU utilization, check storage before investigating anything else. Nine times out of ten, the problem is one of these:
- Training data on Standard HDD
- Remote mount without caching
- BlobFuse2 cache pointing at the OS disk instead of local NVMe
Classic data starvation signals:
# GPU utilization: if below 80% during training, storage is almost certainly the cause
nvidia-smi dmon -s u -d 5
# Disk I/O: if at 100% with low GPU, classic bottleneck
iostat -x 1 5
# Network (if dataset is remote): actual throughput vs capacity
sar -n DEV 1 5
The diagnostic pattern is simple:
| GPU Util | CPU Util | Disk I/O | Diagnosis |
|---|---|---|---|
| Low | Low | High | Data starvation — storage can’t feed data fast enough |
| Low | High | Low | CPU preprocessing is the bottleneck (heavy data augmentation) |
| High | High | High | Everything working well, balanced system |
| Low | Low | Low | Problem in model code or wrong batch size |
⚠️ Rule of thumb: A five-minute storage fix can turn a three-day training run into an overnight one. Always start with storage.
Choosing the right storage: the decision matrix
This is the highest-impact decision you’ll make for AI workload performance. Here’s the map:
| Storage | Best for | Throughput | Latency | Cost | Don’t use when |
|---|---|---|---|---|---|
| Blob Storage | Datasets, artifacts, checkpoints | Up to 60 Gbps/account | Moderate (ms) | Low (~$0.018/GB/month) | You need native POSIX without a mount |
| Data Lake Gen2 | Analytical pipelines, versioned datasets | Up to 60 Gbps/account | Moderate (ms) | Low | Simple workload that doesn’t need granular ACLs |
| Local NVMe | Training scratch, data loader cache | 3-7 GB/s per disk | Ultra-low (μs) | Included with the VM | You need persistence — data lost on deallocation |
| Azure Files (NFS) | Shared datasets across nodes | Up to 10 Gbps (premium) | Low-moderate | Moderate | Single-node workload where local NVMe is enough |
| Azure Files (SMB) | Legacy compatibility, Windows | Up to 4 Gbps (premium) | Moderate | Moderate | High-performance Linux training |
| Cosmos DB | Feature stores, real-time inference | N/A (request-based) | Single-digit ms | Higher | Storing raw training datasets |
The most common production pattern is a two-tier approach: store raw datasets in Blob Storage or Data Lake Gen2 for durability and cost, then stage active data to local NVMe for performance.
Blob is your warehouse. NVMe is your workbench.
⚠️ Never use Standard HDD for training. The IOPS and throughput limits are orders of magnitude below what GPUs need. A single A100 can consume data faster than a Standard HDD storage account can serve it.
The recommended pattern: Blob + NVMe + BlobFuse2
Most ML frameworks (PyTorch, TensorFlow) expect training data accessible through a filesystem path. BlobFuse2 is the virtual filesystem driver that mounts Azure Blob Storage containers as a local directory on Linux.
BlobFuse2 has two caching modes, and choosing the right one matters:
- File cache: Downloads entire files to a local cache before serving reads. Use for training — datasets are read repeatedly across multiple epochs.
- Block cache (streaming): Streams in chunks without downloading the complete file. Use for preprocessing or inference on large media files.
Mount with file cache for training
# Create cache directory on fast local storage (NVMe temp disk)
sudo mkdir -p /mnt/resource/blobfuse2cache
sudo chown $(whoami) /mnt/resource/blobfuse2cache
# Create mount point
sudo mkdir -p /mnt/training-data
# Mount with file cache
sudo blobfuse2 mount /mnt/training-data \
--config-file=./config.yaml \
--tmp-path=/mnt/resource/blobfuse2cache
Preload: data ready before training starts
# Mount with preload — downloads data to cache at mount time
sudo blobfuse2 mount /mnt/training-data \
--config-file=./config.yaml \
--tmp-path=/mnt/resource/blobfuse2cache \
--preload
💡 Always point
--tmp-pathto the VM’s local NVMe disk (/mnt/resourceon Azure VMs) — not to the OS disk. This gives the BlobFuse2 cache the lowest possible latency. On ND-series GPU VMs, the local temp disk delivers 3-7 GB/s read throughput.
AzCopy for bulk data ingestion
When you need to move large datasets into Azure (or between storage accounts), AzCopy is the fastest option. It supports parallel transfers, automatic retries, and resumable uploads.
# Login with Microsoft Entra ID
azcopy login
# Copy an entire dataset directory to Blob Storage
azcopy copy './local-dataset/' \
"https://${STORAGE_ACCOUNT}.blob.core.windows.net/training-data/v1/" \
--recursive
# Copy between storage accounts (server-side, no local download)
azcopy copy \
"https://<source-account>.blob.core.windows.net/<container>" \
"https://<dest-account>.blob.core.windows.net/<container>" \
--recursive
💡 Use
--cap-mbpsto limit throughput during business hours and unleash full speed at night.
Security: built in, not bolted on
AI workloads handle some of the most sensitive data in the organization: customer records, medical images, financial transactions, proprietary text corpora.
Three non-negotiable rules:
1. Managed identities + RBAC, always.
Forget storage account keys. They’re static, shareable, and hard to rotate. Managed identities are bound to specific resources, automatically rotated, and auditable.
# Assign Storage Blob Data Reader to a VM's managed identity
az role assignment create \
--role "Storage Blob Data Reader" \
--assignee <managed-identity-principal-id> \
--scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"
2. Classify before ingesting.
Before any data enters a training pipeline: is it public, internal, confidential, or restricted? Your storage architecture needs to enforce these classifications with network isolation, encryption, and access controls.
3. Combat shadow data sprawl.
Data scientists frequently copy data to local machines, shared drives, or unmanaged storage accounts for “quick experiments.” Use Azure Policy to restrict storage account creation and Microsoft Purview to scan for copies outside approved locations.
Hands-on: end-to-end optimized storage for AI
Let’s build a complete flow: provision, transfer, mount, and validate. All commands use --auth-mode login — no storage keys.
1. Create the storage account with Data Lake Gen2
RESOURCE_GROUP="rg-ai-training"
LOCATION="eastus2"
STORAGE_ACCOUNT="staitraining$(openssl rand -hex 4)"
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--enable-hierarchical-namespace true \
--min-tls-version TLS1_2 \
--allow-blob-public-access false
2. Configure RBAC (no keys!)
az ad signed-in-user show --query id -o tsv | az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee @- \
--scope "/subscriptions/$(az account show --query id -o tsv)/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
Role assignments take 1-2 minutes to propagate. Wait before proceeding.
3. Create container and transfer data
az storage container create \
--account-name $STORAGE_ACCOUNT \
--name training-data \
--auth-mode login
# For large datasets, use AzCopy
azcopy login
azcopy copy './local-dataset/' \
"https://${STORAGE_ACCOUNT}.blob.core.windows.net/training-data/v1/" \
--recursive
4. Mount with BlobFuse2 and NVMe cache
sudo mkdir -p /mnt/resource/blobfuse2cache
sudo chown $(whoami) /mnt/resource/blobfuse2cache
sudo mkdir -p /mnt/training-data
sudo blobfuse2 mount /mnt/training-data \
--config-file=./config.yaml \
--tmp-path=/mnt/resource/blobfuse2cache \
--preload
5. Validate the pipeline is feeding the GPUs
# Verify data is accessible
ls /mnt/training-data/v1/ | head -20
# During training, monitor GPU vs I/O
nvidia-smi dmon -s u -d 5 &
iostat -x 1 5
If nvidia-smi shows GPU util above 80% and iostat isn’t pegged at 100%, your data pipeline is healthy.
Exit checklist
Before handing off storage for an AI workload:
- Storage is Premium SSD or NVMe (never Standard HDD for training)
- BlobFuse2 cache points to local NVMe (
/mnt/resource), not the OS disk - Access via managed identity + RBAC, no storage keys
- Data classified before entering the pipeline
- Storage sized for 10× current data (datasets multiply with augmentation and versioning)
- Throughput and IOPS alerts configured
- Checkpoints writing back to Blob Storage for durability
Next up
Now that you understand how data flows through AI systems and why your storage decisions directly determine training performance, it’s time to look at the compute that consumes all that data. I’ll talk about GPUs, VM families, and cluster architecture — and why a well-tuned storage layer is only half the equation.
The full book is available for free at ai4infra.com.
This post is part of the AI for Infrastructure Engineers series, based on the book AI for Infrastructure Professionals. New posts every week.