DataLoader bottleneck guide

Why your GPU is waiting for data.

If GPU utilization drops while step time rises, your model may not be the problem. Use this guide to check decoding, tokenization, storage reads, collation, and DataLoader worker delays.

View on GitHub Read docs

Real TraceML demo

Same model. Slower batches.

TraceML shows the slowdown came from input loading, not compute.

Fast baseline

37.3ms

step time

Input: 1.9ms · GPU util: 67%

Input-bound

563.3ms

step time

Input: 531.8ms · GPU util: 7%

Input time jumped from 1.9ms to 531.8ms. Compute stayed roughly similar.

Run the demo

Reproduce the bottleneck.

Run the same training setup twice: once as a compute-heavy baseline, then again with a synthetic DataLoader delay. The --run-name values become the log folders used in the compare step.

Fast run

Compute baseline

traceml run examples/diagnosis/dataloader_bottleneck_demo.py \
  --mode=summary \
  --run-name dataloader_fast_compute \
  --args --scenario fast --batch-size 256 \
  --num-samples 15360 --hidden-dim 4096 \
  --depth 4 --print-every 20

Slow run

Delayed input path

traceml run examples/diagnosis/dataloader_bottleneck_demo.py \
  --mode=summary \
  --run-name dataloader_slow_input \
  --args --scenario slow --sleep-ms 2 \
  --batch-size 256 --num-samples 15360 \
  --hidden-dim 4096 --depth 4 --print-every 20

Run summaries

Inspect what changed.

TraceML writes final_summary.json under each run name. Use the two summaries to see how step time, input loading, compute, and GPU utilization changed.

Command

traceml compare logs/dataloader_fast_compute/final_summary.json logs/dataloader_slow_input/final_summary.json

Metric

Fast baseline

Slow DataLoader

Step Time diagnosis

COMPUTE-BOUND

INPUT-BOUND

Total step time

37.3ms

563.3ms

Input loading

1.9ms

531.8ms

Compute

35.0ms

31.0ms

GPU utilization

67%

Result

Compute stayed roughly the same. The slowdown came from input loading, which took over the step and left the GPU waiting.

What to check next

Inspect the input path before tuning the model.

TraceML points to the bottleneck class. The fix depends on what your real input pipeline is doing between storage and the batch arriving at the GPU.

Worker and transfer settings

Check worker count, persistent workers, prefetching, and pinned memory. More workers help only when the input path can actually run in parallel.

Preprocessing hot spots

Look for expensive decoding, tokenization, transforms, or collate functions. Cache derived features or move repeated work out of the step path.

Storage and shard balance

Check object storage, network filesystems, small-file overhead, uneven shards, and rank-specific input delays in distributed jobs.

Use it on your job

Run it on your training job.

Install TraceML, wrap your training step, and run your script normally. Use the summary to see whether training time is going into input loading, compute, residual time, memory, or rank skew.

Start with the repo for examples, then use the quickstart for the minimal wrapper pattern.

View on GitHub Read quickstart