TraceOpt guides

Diagnose training slowdowns.

Practical guides for finding DataLoader stalls, low GPU utilization, DDP rank stragglers, memory creep, and run-to-run regressions.

View on GitHub Read docs

                  The triage loop
                
                      01
Run TraceML
traceml run train.py

                      02
Read fingerprint
Check step time, input loading, H2D, compute, residual time, GPU utilization, memory, and rank skew.

                      03
Choose guide
Use the matching symptom guide before changing DataLoader, model, or distributed settings.

            Common training issues
          

Find the guide that matches the slowdown you see.

Find DataLoader bottlenecks

Your GPU is idle because batches arrive late. Use this guide to check slow collation, decoding, tokenization, storage I/O, uneven shards, and input-rank stalls.

Read guide

Debug low GPU utilization

Your GPU is busy in short bursts, then waits. Use this guide to connect low utilization with data loading, H2D copies, compute time, memory pressure, rank skew, or residual time.

Read guide

Spot DDP rank stragglers

One slow rank can hold back the whole job. Use this guide to compare input, compute, residual time, memory, and skew across ranks.

Read guide

Catch memory creep

Your run looks fine until memory keeps rising. Use this guide to separate steady growth from temporary pressure, per-rank imbalance, and OOM risk.

Read guide

Compare two runs

Training got slower after a code, data, or infrastructure change? Compare run summaries to see which timing, memory, or rank-skew signals shifted.

Read guide

Use your existing stack

Run TraceML with custom PyTorch loops, Hugging Face, Lightning, Ray Train, W&B, MLflow, Slurm, and distributed jobs.

Read docs

Open-source diagnostics for training performance.

TraceML helps you inspect slow training runs, compare summaries, and understand where time and memory go.

View on GitHub Read docs