Open source • TraceML for PyTorch

See GPU efficiency inside your training loop

Training aware observability that lines up with steps, memory, and time

TraceOpt builds TraceML, a lightweight tool that captures system signals from inside your training loop. It helps you spot memory growth and timing regressions while a run is still happening.

Quick start
Minimal snippet, exact API may evolve
# Add a small wrapper around each training step
from traceml.decorators import trace_model_instance, trace_step

trace_model_instance(model)

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch)
        loss = loss_fn(outputs, batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
TraceML is early stage. If something is missing for your workflow, feedback helps a lot.

Training breakdowns are expensive

When runs slow down or crash, teams often debug by guessing and rerunning. General dashboards rarely tell you which step or which part of the loop caused the issue.

💥

Out of memory without context

Memory issues can appear late in training. Without step aligned signals, it is hard to know what changed.

🐌

Slow steps without a clear cause

A step can jump from two seconds to eight seconds. Is it the input pipeline, the model, or the optimizer.

🔀

Signals are scattered

GPU metrics, logs, and training metrics live in different places. Correlating them manually costs time.

TraceML: Training-Native Observability

Map system efficiency to model behavior in real-time. See exactly where time and memory go during training.

🧠

Per-layer memory tracking

Parameters, activations, and gradients broken down by module. See which layers consume memory and catch growth before OOM.

⏱️

Step timing breakdown

Time spent in forward pass, backward pass, optimizer step, and dataloader separately. Identify bottlenecks instantly.

📥

Dataloader monitoring

Track data fetch time per step. Know immediately if your dataloader is starving your GPU.

🖥️

Terminal dashboard

Live CLI dashboard for SSH sessions. See training metrics update in real-time without leaving your terminal.

🌐

Local web UI

Web dashboard at localhost:8765 with live charts and detailed breakdowns when you want visual analysis.

Lightweight overhead

Minimal performance impact on training. Designed to run during real jobs, not special profiling runs.

What TraceML is not

It is not a replacement for Nsight or deep kernel tracing. It is a practical view for training loops and step aligned signals.

Status

Available now

Single GPU usage and basic live signals for memory and timing, depending on your setup and hardware support.

In progress

Single node multi GPU support and richer step phase breakdowns. These are active areas of work.

Planned

Offline replay tool for replayable log analysis, shaped by what early users ask for most.

How it compares

TraceML is training loop focused. Other tools still matter for their strengths.

Capability TraceML PyTorch Profiler NVIDIA Nsight TensorBoard W and B, Neptune
Live step aligned view Depends Metrics view
Training loop first design Profiler workflow GPU expert tool Logging first Experiment tracking
Memory signals during run Possible If logged If logged
Kernel level traces
Local first Often hosted
Best fit Daily training debugging Performance deep dives GPU profiling experts Tracking and charts Run management

Get started

Try TraceML, report issues, or reach out

If you can share a small repro script, it helps quickly.