DDP rank straggler guide

One slow rank can hold back the whole DDP job.

Use TraceML to compare worst-rank and median-rank timing across input loading, H2D transfer, forward, backward, optimizer, wait time, and memory before changing distributed settings.

Real two-node proof

Same DDP loop. Different slow rank.

The demo was run on two nodes with one GPU per node. TraceML separated the balanced baseline from input and compute rank stragglers.

Balanced
124.6/124.6ms
median/worst step time

COMPUTE-BOUND · input 1.4/1.4ms · compute 122.4/122.4ms

Input straggler
r0 dataloader 201.6ms
vs r1 at 1.4ms
Compute straggler
r0 optimizer 33.1ms
vs r1 at 14.5ms
Run the demo

Reproduce balanced, input-skewed, and compute-skewed DDP.

Start with the single-node commands below. For two nodes, use the same run name, master address, master port, and script args on both nodes, changing only --node-rank.

Balanced baseline
traceml run examples/ddp_rank_straggler_demo.py --mode=summary --nproc-per-node=2 --run-name ddp_balanced --args --scenario balanced
Input straggler
traceml run examples/ddp_rank_straggler_demo.py --mode=summary --nproc-per-node=2 --run-name ddp_input_straggler --args --scenario input-straggler --straggler-rank 0 --input-sleep-ms 200
Compute straggler
traceml run examples/ddp_rank_straggler_demo.py --mode=summary --nproc-per-node=2 --run-name ddp_compute_straggler --args --scenario compute-straggler --straggler-rank 0 --compute-extra-matmuls 8
What TraceML reports

Compare the slow rank against the typical rank.

The Step Time summary uses median/worst rank comparisons so one slow rank does not disappear inside an average.

Run
Diagnosis
Signal
Balanced
COMPUTE-BOUND
total 124.6/124.6ms · input 1.4/1.4ms · compute 122.4/122.4ms
Input straggler
INPUT STRAGGLER
Rank r0 dataloader was 201.6ms vs median rank r1 at 1.4ms.
Compute straggler
COMPUTE STRAGGLER
Rank r0 optimizer was 33.1ms vs median rank r1 at 14.5ms.

Why this matters

In DDP, one rank can make the whole job wait. TraceML checks nearby forward and optimizer evidence before blaming backward directly, then reports the phase and rank to inspect first.

What to inspect

Fix the rank that actually moved.

TraceML points to the bottleneck class and rank. The real fix depends on what differs between that rank and the peer ranks.

01

Input rank skew

Check uneven shards, slow storage on one host, rank-local preprocessing, tokenization, collation, and noisy neighbors.

02

Compute rank skew

Inspect uneven input shapes, rank-local branches, extra optimizer work, callbacks, hooks, or model code that runs on only one rank.

03

Compare after a fix

Save the old and new final_summary.json files and compare total step time, input, compute, wait, memory, and rank skew.

Use it on your job

Find the rank slowing down DDP.

Install TraceML, wrap your training step, and run your DDP job normally. Use the final summary to decide whether the slowdown is input, compute, wait, memory, or rank skew.

Start with the GitHub repo for examples. Use the docs for launcher flags, distributed setup, and output details.