See GPU efficiency inside your training loop
Training aware observability that lines up with steps, memory, and time
TraceOpt builds TraceML, a lightweight tool that captures system signals from inside your training loop. It helps you spot memory growth and timing regressions while a run is still happening.
# Add a small wrapper around each training step
from traceml.decorators import trace_model_instance, trace_step
trace_model_instance(model)
for batch in dataloader:
with trace_step(model):
outputs = model(batch)
loss = loss_fn(outputs, batch)
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Training breakdowns are expensive
When runs slow down or crash, teams often debug by guessing and rerunning. General dashboards rarely tell you which step or which part of the loop caused the issue.
Out of memory without context
Memory issues can appear late in training. Without step aligned signals, it is hard to know what changed.
Slow steps without a clear cause
A step can jump from two seconds to eight seconds. Is it the input pipeline, the model, or the optimizer.
Signals are scattered
GPU metrics, logs, and training metrics live in different places. Correlating them manually costs time.
TraceML: Training-Native Observability
Map system efficiency to model behavior in real-time. See exactly where time and memory go during training.
Per-layer memory tracking
Parameters, activations, and gradients broken down by module. See which layers consume memory and catch growth before OOM.
Step timing breakdown
Time spent in forward pass, backward pass, optimizer step, and dataloader separately. Identify bottlenecks instantly.
Dataloader monitoring
Track data fetch time per step. Know immediately if your dataloader is starving your GPU.
Terminal dashboard
Live CLI dashboard for SSH sessions. See training metrics update in real-time without leaving your terminal.
Local web UI
Web dashboard at localhost:8765 with live charts and detailed breakdowns when you want visual analysis.
Lightweight overhead
Minimal performance impact on training. Designed to run during real jobs, not special profiling runs.
What TraceML is not
It is not a replacement for Nsight or deep kernel tracing. It is a practical view for training loops and step aligned signals.
Status
Available now
Single GPU usage and basic live signals for memory and timing, depending on your setup and hardware support.
In progress
Single node multi GPU support and richer step phase breakdowns. These are active areas of work.
Planned
Offline replay tool for replayable log analysis, shaped by what early users ask for most.
How it compares
TraceML is training loop focused. Other tools still matter for their strengths.
| Capability | TraceML | PyTorch Profiler | NVIDIA Nsight | TensorBoard | W and B, Neptune |
|---|---|---|---|---|---|
| Live step aligned view | ✓ | ✗ | ✗ | Depends | Metrics view |
| Training loop first design | ✓ | Profiler workflow | GPU expert tool | Logging first | Experiment tracking |
| Memory signals during run | ✓ | Possible | ✓ | If logged | If logged |
| Kernel level traces | ✗ | ✓ | ✓ | ✗ | ✗ |
| Local first | ✓ | ✓ | ✓ | ✓ | Often hosted |
| Best fit | Daily training debugging | Performance deep dives | GPU profiling experts | Tracking and charts | Run management |
Get started
Try TraceML, report issues, or reach out