DataLoader bottleneck guide
Why your GPU is waiting for data.
If GPU utilization drops while step time rises, your model may not be the problem. Use this guide to check decoding, tokenization, storage reads, collation, and DataLoader worker delays.
Same model. Slower batches.
TraceML shows the slowdown came from input loading, not compute.
Input time jumped from 1.9ms to 531.8ms. Compute stayed roughly similar.
Reproduce the bottleneck.
Run the same training setup twice: once as a compute-heavy
baseline, then again with a synthetic DataLoader delay. The
--run-name values become the log folders used in the
compare step.
Compute baseline
traceml run examples/dataloader_bottleneck_demo.py \
--mode=summary \
--run-name dataloader_fast_compute \
--args --scenario fast --batch-size 256 \
--num-samples 15360 --hidden-dim 4096 \
--depth 4 --print-every 20
Delayed input path
traceml run examples/dataloader_bottleneck_demo.py \
--mode=summary \
--run-name dataloader_slow_input \
--args --scenario slow --sleep-ms 2 \
--batch-size 256 --num-samples 15360 \
--hidden-dim 4096 --depth 4 --print-every 20
Inspect what changed.
TraceML writes final_summary.json under each run name.
Use the two summaries to see how step time, input loading, compute,
and GPU utilization changed.
traceml compare logs/dataloader_fast_compute/final_summary.json logs/dataloader_slow_input/final_summary.json
Result
Compute stayed roughly the same. The slowdown came from input loading, which took over the step and left the GPU waiting.
Inspect the input path before tuning the model.
TraceML points to the bottleneck class. The fix depends on what your real input pipeline is doing between storage and the batch arriving at the GPU.
Worker and transfer settings
Check worker count, persistent workers, prefetching, and pinned memory. More workers help only when the input path can actually run in parallel.
Preprocessing hot spots
Look for expensive decoding, tokenization, transforms, or collate functions. Cache derived features or move repeated work out of the step path.
Storage and shard balance
Check object storage, network filesystems, small-file overhead, uneven shards, and rank-specific input delays in distributed jobs.
Run it on your training job.
Install TraceML, wrap your training step, and run your script normally. Use the summary to see whether training time is going into input loading, compute, wait, memory, or rank skew.