TraceOpt guides
Diagnose training slowdowns.
Practical guides for finding DataLoader stalls, low GPU utilization, DDP rank stragglers, memory creep, and run-to-run regressions.
Powered by TraceML, our open-source step-level diagnostics layer.
Find the guide that matches the slowdown you see.
Find DataLoader bottlenecks
Your GPU is idle because batches arrive late. Use this guide to check slow collation, decoding, tokenization, storage I/O, uneven shards, and input-rank stalls.
Read guide
Debug low GPU utilization
Your GPU is busy in short bursts, then waits. Use this guide to connect low utilization with data loading, H2D copies, compute time, memory pressure, rank skew, or wait states.
Read guide
Spot DDP rank stragglers
One slow rank can hold back the whole job. Use this guide to compare input, compute, wait time, memory, and skew across ranks.
Read guide
Catch memory creep
Your run looks fine until memory keeps rising. Use this guide to separate steady growth from temporary pressure, per-rank imbalance, and OOM risk.
Read guide
Compare two runs
Training got slower after a code, data, or infrastructure change? Compare run summaries to see which timing, memory, or rank-skew signals shifted.
Read guide
Use your existing stack
Run TraceML with custom PyTorch loops, Hugging Face, Lightning, Ray Train, W&B, MLflow, Slurm, and distributed jobs.
Read docs
Open-source diagnostics for training performance.
TraceML helps you inspect slow training runs, compare summaries, and understand where time and memory go.