Compare runs guide

Compare slow and faster training runs.

Use TraceML compare to diff two final_summary.json files and see whether a slowdown or fix came from input loading, compute, residual time, GPU utilization, memory, or rank skew.

View on GitHub Read docs

Real TraceML compare

Same script. Faster input path.

Compare a slow input-bound run against the faster run after the input path is fixed.

Verdict

IMPROVEMENT

from INPUT-BOUND to COMPUTE-BOUND

Step time

563.1ms

slow run

42.3ms after fix

Input

531.6ms

slow run

2.3ms after fix

GPU util

6.8%

slow run

49.8% after fix

Memory

BALANCED

no memory diagnosis change

Run the before/after jobs

Capture a slow run and an improved run.

This demo uses the same model settings and batch size twice. The first run adds a synthetic input delay, then the second run uses the fast input path so compare has one clear change to explain.

Before

Slow input path

traceml run examples/diagnosis/dataloader_bottleneck_demo.py --mode=summary --run-name compare_slow --args --scenario slow --sleep-ms 2 --batch-size 256 --num-samples 15360 --hidden-dim 4096 --depth 4 --print-every 20

After

Faster input path

traceml run examples/diagnosis/dataloader_bottleneck_demo.py --mode=summary --run-name compare_fast --args --scenario fast --batch-size 256 --num-samples 15360 --hidden-dim 4096 --depth 4 --print-every 20

Compare summaries

Compare the generated summary files.

TraceML writes a summary under each run name. Compare the slow run against the faster run to see which timing, memory, process, and system signals moved.

Compare command

traceml compare logs/compare_slow/final_summary.json logs/compare_fast/final_summary.json

JSON output

compare/compare_slow_vs_compare_fast.json

Text output

compare/compare_slow_vs_compare_fast.txt

What changed

The improvement came from input loading.

The model compute stayed in the same range. Step time dropped because input loading stopped consuming nearly the whole step, and GPU utilization recovered.

Metric

Slow run

Faster run

Change

Primary diagnosis

INPUT-BOUND

COMPUTE-BOUND

improved

Total step

563.1ms

42.3ms

-520.8ms (-92.5%)

Input

531.6ms

2.3ms

-529.3ms

Compute

31.0ms

38.9ms

+7.9ms

GPU utilization

6.8%

49.8%

+42.9 pp

Step memory

BALANCED

same

Diagnosis shift

Input-bound to compute-bound means the bottleneck class changed, not just the absolute runtime.

Input moved

The input drop explains the improvement. The faster run did not come from hiding work in another phase.

Context stayed stable

Memory stayed balanced and process pressure stayed normal, so the first place to inspect is the input path.

Use compare on your own runs

Keep a before/after fingerprint for expensive changes.

Code changes

Model or training-loop edits

Check whether step time moved into compute, residual time, memory pressure, or another phase after a code change.

Data changes

Dataset, preprocessing, or storage

Catch input-path regressions from new transforms, tokenization, storage layout, object stores, or shard balance.

Infrastructure changes

Machine, cluster, or distributed setup

Compare GPU utilization, memory headroom, residual time, and rank skew after changing hosts, launchers, or distributed settings.

Use it on your job

Compare your own before/after run.

Run TraceML on baseline and candidate jobs, keep final_summary.json, then compare them before guessing why training got slower.

Start with the GitHub repo for examples. Use the docs for compare output fields and artifact paths.

View on GitHub Read docs