DataLoader bottleneck guide

Why your GPU is waiting for data.

If GPU utilization drops while step time rises, your model may not be the problem. Use this guide to check decoding, tokenization, storage reads, collation, and DataLoader worker delays.

Real TraceML demo

Same model. Slower batches.

TraceML shows the slowdown came from input loading, not compute.

Fast baseline
37.3ms
step time
Input: 1.9ms · GPU util: 67%
Input-bound
563.3ms
step time
Input: 531.8ms · GPU util: 7%

Input time jumped from 1.9ms to 531.8ms. Compute stayed roughly similar.

Run the demo

Reproduce the bottleneck.

Run the same training setup twice: once as a compute-heavy baseline, then again with a synthetic DataLoader delay. The --run-name values become the log folders used in the compare step.

Fast run

Compute baseline

traceml run examples/dataloader_bottleneck_demo.py \
  --mode=summary \
  --run-name dataloader_fast_compute \
  --args --scenario fast --batch-size 256 \
  --num-samples 15360 --hidden-dim 4096 \
  --depth 4 --print-every 20
Slow run

Delayed input path

traceml run examples/dataloader_bottleneck_demo.py \
  --mode=summary \
  --run-name dataloader_slow_input \
  --args --scenario slow --sleep-ms 2 \
  --batch-size 256 --num-samples 15360 \
  --hidden-dim 4096 --depth 4 --print-every 20
Run summaries

Inspect what changed.

TraceML writes final_summary.json under each run name. Use the two summaries to see how step time, input loading, compute, and GPU utilization changed.

Command
traceml compare logs/dataloader_fast_compute/final_summary.json logs/dataloader_slow_input/final_summary.json
Metric
Fast baseline
Slow DataLoader
Step Time diagnosis
COMPUTE-BOUND
INPUT-BOUND
Total step time
37.3ms
563.3ms
Input loading
1.9ms
531.8ms
Compute
35.0ms
31.0ms
GPU utilization
67%
7%

Result

Compute stayed roughly the same. The slowdown came from input loading, which took over the step and left the GPU waiting.

What to check next

Inspect the input path before tuning the model.

TraceML points to the bottleneck class. The fix depends on what your real input pipeline is doing between storage and the batch arriving at the GPU.

01

Worker and transfer settings

Check worker count, persistent workers, prefetching, and pinned memory. More workers help only when the input path can actually run in parallel.

02

Preprocessing hot spots

Look for expensive decoding, tokenization, transforms, or collate functions. Cache derived features or move repeated work out of the step path.

03

Storage and shard balance

Check object storage, network filesystems, small-file overhead, uneven shards, and rank-specific input delays in distributed jobs.

Use it on your job

Run it on your training job.

Install TraceML, wrap your training step, and run your script normally. Use the summary to see whether training time is going into input loading, compute, wait, memory, or rank skew.

Start with the repo for examples, then use the quickstart for the minimal wrapper pattern.