Compare runs guide

Compare slow and faster training runs.

Use TraceML compare to diff two final_summary.json files and see whether a slowdown or fix came from input loading, compute, residual time, GPU utilization, memory, or rank skew.

Real TraceML compare

Same script. Faster input path.

Compare a slow input-bound run against the faster run after the input path is fixed.

Verdict
IMPROVEMENT
from INPUT-BOUND to COMPUTE-BOUND
Step time
563.1ms
slow run
42.3ms after fix
Input
531.6ms
slow run
2.3ms after fix
GPU util
6.8%
slow run
49.8% after fix
Memory
BALANCED
no memory diagnosis change
Run the before/after jobs

Capture a slow run and an improved run.

This demo uses the same model settings and batch size twice. The first run adds a synthetic input delay, then the second run uses the fast input path so compare has one clear change to explain.

Before

Slow input path

traceml run examples/diagnosis/dataloader_bottleneck_demo.py --mode=summary --run-name compare_slow --args --scenario slow --sleep-ms 2 --batch-size 256 --num-samples 15360 --hidden-dim 4096 --depth 4 --print-every 20
After

Faster input path

traceml run examples/diagnosis/dataloader_bottleneck_demo.py --mode=summary --run-name compare_fast --args --scenario fast --batch-size 256 --num-samples 15360 --hidden-dim 4096 --depth 4 --print-every 20
Compare summaries

Compare the generated summary files.

TraceML writes a summary under each run name. Compare the slow run against the faster run to see which timing, memory, process, and system signals moved.

Compare command
traceml compare logs/compare_slow/final_summary.json logs/compare_fast/final_summary.json
JSON output
compare/compare_slow_vs_compare_fast.json
Text output
compare/compare_slow_vs_compare_fast.txt
What changed

The improvement came from input loading.

The model compute stayed in the same range. Step time dropped because input loading stopped consuming nearly the whole step, and GPU utilization recovered.

Metric
Slow run
Faster run
Change
Primary diagnosis
INPUT-BOUND
COMPUTE-BOUND
improved
Total step
563.1ms
42.3ms
-520.8ms (-92.5%)
Input
531.6ms
2.3ms
-529.3ms
Compute
31.0ms
38.9ms
+7.9ms
GPU utilization
6.8%
49.8%
+42.9 pp
Step memory
BALANCED
BALANCED
same
Diagnosis shift

Input-bound to compute-bound means the bottleneck class changed, not just the absolute runtime.

Input moved

The input drop explains the improvement. The faster run did not come from hiding work in another phase.

Context stayed stable

Memory stayed balanced and process pressure stayed normal, so the first place to inspect is the input path.

Use compare on your own runs

Keep a before/after fingerprint for expensive changes.

Code changes

Model or training-loop edits

Check whether step time moved into compute, residual time, memory pressure, or another phase after a code change.

Data changes

Dataset, preprocessing, or storage

Catch input-path regressions from new transforms, tokenization, storage layout, object stores, or shard balance.

Infrastructure changes

Machine, cluster, or distributed setup

Compare GPU utilization, memory headroom, residual time, and rank skew after changing hosts, launchers, or distributed settings.

Use it on your job

Compare your own before/after run.

Run TraceML on baseline and candidate jobs, keep final_summary.json, then compare them before guessing why training got slower.

Start with the GitHub repo for examples. Use the docs for compare output fields and artifact paths.