DDP rank straggler guide
One slow rank can hold back the whole DDP job.
Use TraceML to compare worst-rank and median-rank timing across input loading, H2D transfer, forward, backward, optimizer, wait time, and memory before changing distributed settings.
Same DDP loop. Different slow rank.
The demo was run on two nodes with one GPU per node. TraceML separated the balanced baseline from input and compute rank stragglers.
COMPUTE-BOUND · input 1.4/1.4ms · compute 122.4/122.4ms
Reproduce balanced, input-skewed, and compute-skewed DDP.
Start with the single-node commands below. For two nodes, use the
same run name, master address, master port, and script args on
both nodes, changing only --node-rank.
traceml run examples/ddp_rank_straggler_demo.py --mode=summary --nproc-per-node=2 --run-name ddp_balanced --args --scenario balanced
traceml run examples/ddp_rank_straggler_demo.py --mode=summary --nproc-per-node=2 --run-name ddp_input_straggler --args --scenario input-straggler --straggler-rank 0 --input-sleep-ms 200
traceml run examples/ddp_rank_straggler_demo.py --mode=summary --nproc-per-node=2 --run-name ddp_compute_straggler --args --scenario compute-straggler --straggler-rank 0 --compute-extra-matmuls 8
Compare the slow rank against the typical rank.
The Step Time summary uses median/worst rank comparisons so one slow rank does not disappear inside an average.
Why this matters
In DDP, one rank can make the whole job wait. TraceML checks nearby forward and optimizer evidence before blaming backward directly, then reports the phase and rank to inspect first.
Fix the rank that actually moved.
TraceML points to the bottleneck class and rank. The real fix depends on what differs between that rank and the peer ranks.
Input rank skew
Check uneven shards, slow storage on one host, rank-local preprocessing, tokenization, collation, and noisy neighbors.
Compute rank skew
Inspect uneven input shapes, rank-local branches, extra optimizer work, callbacks, hooks, or model code that runs on only one rank.
Compare after a fix
Save the old and new final_summary.json files and
compare total step time, input, compute, wait, memory, and rank
skew.
Find the rank slowing down DDP.
Install TraceML, wrap your training step, and run your DDP job normally. Use the final summary to decide whether the slowdown is input, compute, wait, memory, or rank skew.