DDP rank straggler guide

One slow rank can hold back the whole DDP job.

Use TraceML to compare worst-rank and median-rank timing across input loading, H2D transfer, forward, backward, optimizer, wait time, and memory before changing distributed settings.

View on GitHub Read docs

Real two-node proof

Same DDP loop. Different slow rank.

The demo was run on two nodes with one GPU per node. TraceML separated the balanced baseline from input and compute rank stragglers.

Balanced

124.6/124.6ms

median/worst step time

COMPUTE-BOUND · input 1.4/1.4ms · compute 122.4/122.4ms

Input straggler

r0 dataloader 201.6ms

vs r1 at 1.4ms

Compute straggler

r0 optimizer 33.1ms

vs r1 at 14.5ms

Run the demo

Reproduce balanced, input-skewed, and compute-skewed DDP.

Start with the single-node commands below. For two nodes, use the same run name, master address, master port, and script args on both nodes, changing only --node-rank.

Balanced baseline

traceml run examples/ddp_rank_straggler_demo.py --mode=summary --nproc-per-node=2 --run-name ddp_balanced --args --scenario balanced

Input straggler

traceml run examples/ddp_rank_straggler_demo.py --mode=summary --nproc-per-node=2 --run-name ddp_input_straggler --args --scenario input-straggler --straggler-rank 0 --input-sleep-ms 200

Compute straggler

traceml run examples/ddp_rank_straggler_demo.py --mode=summary --nproc-per-node=2 --run-name ddp_compute_straggler --args --scenario compute-straggler --straggler-rank 0 --compute-extra-matmuls 8

What TraceML reports

Compare the slow rank against the typical rank.

The Step Time summary uses median/worst rank comparisons so one slow rank does not disappear inside an average.

Run

Diagnosis

Signal

Balanced

COMPUTE-BOUND

total 124.6/124.6ms · input 1.4/1.4ms · compute 122.4/122.4ms

Input straggler

INPUT STRAGGLER

Rank r0 dataloader was 201.6ms vs median rank r1 at 1.4ms.

Compute straggler

COMPUTE STRAGGLER

Rank r0 optimizer was 33.1ms vs median rank r1 at 14.5ms.

Why this matters

In DDP, one rank can make the whole job wait. TraceML checks nearby forward and optimizer evidence before blaming backward directly, then reports the phase and rank to inspect first.

What to inspect

Fix the rank that actually moved.

TraceML points to the bottleneck class and rank. The real fix depends on what differs between that rank and the peer ranks.

Input rank skew

Check uneven shards, slow storage on one host, rank-local preprocessing, tokenization, collation, and noisy neighbors.

Compute rank skew

Inspect uneven input shapes, rank-local branches, extra optimizer work, callbacks, hooks, or model code that runs on only one rank.

Compare after a fix

Save the old and new final_summary.json files and compare total step time, input, compute, wait, memory, and rank skew.

Use it on your job

Find the rank slowing down DDP.

Install TraceML, wrap your training step, and run your DDP job normally. Use the final summary to decide whether the slowdown is input, compute, wait, memory, or rank skew.

Start with the GitHub repo for examples. Use the docs for launcher flags, distributed setup, and output details.

View on GitHub Read docs