Compare runs guide
Compare slow and faster training runs.
Use TraceML compare to diff two final_summary.json
files and see whether a slowdown or fix came from input
loading, compute, residual time, GPU utilization, memory, or
rank skew.
Same script. Faster input path.
Compare a slow input-bound run against the faster run after the input path is fixed.
Capture a slow run and an improved run.
This demo uses the same model settings and batch size twice. The first run adds a synthetic input delay, then the second run uses the fast input path so compare has one clear change to explain.
Slow input path
traceml run examples/diagnosis/dataloader_bottleneck_demo.py --mode=summary --run-name compare_slow --args --scenario slow --sleep-ms 2 --batch-size 256 --num-samples 15360 --hidden-dim 4096 --depth 4 --print-every 20
Faster input path
traceml run examples/diagnosis/dataloader_bottleneck_demo.py --mode=summary --run-name compare_fast --args --scenario fast --batch-size 256 --num-samples 15360 --hidden-dim 4096 --depth 4 --print-every 20
Compare the generated summary files.
TraceML writes a summary under each run name. Compare the slow run against the faster run to see which timing, memory, process, and system signals moved.
traceml compare logs/compare_slow/final_summary.json logs/compare_fast/final_summary.json
The improvement came from input loading.
The model compute stayed in the same range. Step time dropped because input loading stopped consuming nearly the whole step, and GPU utilization recovered.
Input-bound to compute-bound means the bottleneck class changed, not just the absolute runtime.
The input drop explains the improvement. The faster run did not come from hiding work in another phase.
Memory stayed balanced and process pressure stayed normal, so the first place to inspect is the input path.
Keep a before/after fingerprint for expensive changes.
Model or training-loop edits
Check whether step time moved into compute, residual time, memory pressure, or another phase after a code change.
Dataset, preprocessing, or storage
Catch input-path regressions from new transforms, tokenization, storage layout, object stores, or shard balance.
Machine, cluster, or distributed setup
Compare GPU utilization, memory headroom, residual time, and rank skew after changing hosts, launchers, or distributed settings.
Compare your own before/after run.
Run TraceML on baseline and candidate jobs, keep
final_summary.json, then compare them before
guessing why training got slower.