Open source • TraceML for PyTorch

See GPU efficiency inside your training loop

Training aware observability that lines up with steps, memory, and time

TraceOpt builds TraceML, a lightweight tool that captures system signals from inside your training loop. It helps you spot memory growth and timing regressions while a run is still happening.

Quick start

Minimal snippet, exact API may evolve

# Add a small wrapper around each training step
from traceml.decorators import trace_model_instance, trace_step

trace_model_instance(model)

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch)
        loss = loss_fn(outputs, batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

⭐ View on GitHub Share feedback

TraceML is early stage. If something is missing for your workflow, feedback helps a lot.

Training breakdowns are expensive

When runs slow down or crash, teams often debug by guessing and rerunning. General dashboards rarely tell you which step or which part of the loop caused the issue.

💥

Out of memory without context

Memory issues can appear late in training. Without step aligned signals, it is hard to know what changed.

🐌

Slow steps without a clear cause

A step can jump from two seconds to eight seconds. Is it the input pipeline, the model, or the optimizer.

🔀

Signals are scattered

GPU metrics, logs, and training metrics live in different places. Correlating them manually costs time.

TraceML: Training-Native Observability

Map system efficiency to model behavior in real-time. See exactly where time and memory go during training.

🧠

Per-layer memory tracking

Parameters, activations, and gradients broken down by module. See which layers consume memory and catch growth before OOM.

⏱️

Step timing breakdown

Time spent in forward pass, backward pass, optimizer step, and dataloader separately. Identify bottlenecks instantly.

📥

Dataloader monitoring

Track data fetch time per step. Know immediately if your dataloader is starving your GPU.

🖥️

Terminal dashboard

Live CLI dashboard for SSH sessions. See training metrics update in real-time without leaving your terminal.

🌐

Local web UI

Web dashboard at localhost:8765 with live charts and detailed breakdowns when you want visual analysis.

⚡

Lightweight overhead

Minimal performance impact on training. Designed to run during real jobs, not special profiling runs.

What TraceML is not

It is not a replacement for Nsight or deep kernel tracing. It is a practical view for training loops and step aligned signals.

Status

Available now

Single GPU usage and basic live signals for memory and timing, depending on your setup and hardware support.

In progress

Single node multi GPU support and richer step phase breakdowns. These are active areas of work.

Planned

Offline replay tool for replayable log analysis, shaped by what early users ask for most.

How it compares

TraceML is training loop focused. Other tools still matter for their strengths.

Capability	TraceML	PyTorch Profiler	NVIDIA Nsight	TensorBoard	W and B, Neptune
Live step aligned view	✓	✗	✗	Depends	Metrics view
Training loop first design	✓	Profiler workflow	GPU expert tool	Logging first	Experiment tracking
Memory signals during run	✓	Possible	✓	If logged	If logged
Kernel level traces	✗	✓	✓	✗	✗
Local first	✓	✓	✓	✓	Often hosted
Best fit	Daily training debugging	Performance deep dives	GPU profiling experts	Tracking and charts	Run management

Get started

Try TraceML, report issues, or reach out

Open the repo Email Abhinav Report an issue

If you can share a small repro script, it helps quickly.