Low GPU utilization guide

Find what is making the GPU wait.

Low GPU utilization is a symptom, not a diagnosis. Use TraceML to separate input loading, H2D transfer, compute, wait time, memory behavior, and rank skew before changing model code.

First-pass triage
1. Confirm the symptom
System reports LOW GPU UTIL or MODERATE GPU UTIL.
2. Read Step Time
Input, H2D, compute, and wait explain why the GPU is idle.
3. Route to the right fix
Data path, transfer, wait, rank skew, memory, or compute behavior.
Classification map

Low GPU utilization can come from different phases.

Start with the System GPU utilization signal, then use Step Time to decide where to look. The label alone is not enough.

TraceML signal
Likely bottleneck
Where to inspect first
INPUT-BOUND
Batches arrive late
DataLoader workers, decoding, tokenization, storage, collation
High H2D time
CPU-to-GPU transfer is visible in the step
Pinned memory, non-blocking transfers, tensor placement, batch size
WAIT-HEAVY
Time sits outside traced phases
Logging, checkpointing, validation, framework orchestration, synchronization
STRAGGLER
One rank or phase holds the aligned step
Worst rank vs median rank, input skew, compute skew, wait skew
COMPUTE-BOUND + low util
Compute dominates, but utilization is still low
Small kernels, fragmented work, CPU launch overhead, model size, profiler trace
One concrete failure mode

DataLoader stalls look like low GPU utilization.

In the DataLoader bottleneck demo, model compute stayed roughly the same. Input loading grew from 1.9ms to 531.8ms, and GPU utilization dropped from 67% to 7%.

Example
Fast baseline
37.3ms

Step time with 1.9ms input loading and 67% GPU utilization.

Input-bound
563.3ms

Step time with 531.8ms input loading and 7% GPU utilization.

Use it on your job

Find where GPU time is lost.

Install TraceML, wrap your training step, and run your script normally. Use the summary to decide whether low GPU utilization is an input, transfer, wait, memory, rank, or compute problem.

Start with the repo for examples, then use the quickstart for the minimal wrapper pattern.