Profiler System¶
TeleFuser provides a three-layer progressive profiling system for performance analysis and debugging, built on PyTorch Profiler with additional features for distributed environments.
Features¶
- Three-layer progressive profiling - Stage timing → Kernel analysis → NCU deep dive
- Context manager and decorator - Flexible usage patterns
- Sync and async support - Works with both synchronous and asynchronous functions
- Distributed aware - Automatically handles multi-rank profiling
- Memory tracking - Peak memory allocation monitoring
- Stage I/O Signature capture - Record tensor shapes for isolated profiling
- Stage Bench Harness - Profile individual stages with mock inputs (avoid 40+ DiT iterations)
- Auto-generated test scripts - Reproducible profiling without harness infrastructure
- Organized output directory -
work_dirs/profiler_output/{pipeline_name}/{YYYYMMDD_HHMM}
Stage I/O Signature (Layer 1)¶
When profiling with TELEFUSER_PROFILE_DEBUG=true, the profiler automatically captures input and output tensor signatures for each stage. This enables Layer 2 isolated profiling without running the full pipeline.
Captured Information¶
For each stage, the signature includes: - Input tensor shapes, dtypes, and devices - Output tensor shapes (if applicable) - Non-tensor parameters (int, float, str) as metadata
Output Location¶
Default output directory structure:
work_dirs/profiler_output/{TELEFUSER_PIPELINE_NAME}/{YYYYMMDD_HHMM}/
├── timing.json # Layer 1 stage timing report
├── timing_io_signature.json # I/O signatures for harness
├── denoise_trace.json.gz # Layer 2 Chrome trace (single iteration)
├── denoise_breakdown.json # Layer 2 top 50 kernels
└── profile_denoise.py # Auto-generated test script
Set TELEFUSER_PIPELINE_NAME environment variable to organize outputs by pipeline.
Signature Format¶
{
"request_id": "req_20260402_...",
"timestamp": "2026-04-02T...",
"stages": {
"denoise": {
"stage_name": "denoise",
"input_signatures": {
"latents": {"shape": [1, 16, 21, 60, 104], "dtype": "bfloat16", "device": "cuda:0"},
"prompt_emb_posi": {"shape": [1, 512, 4096], "dtype": "bfloat16", "device": "cuda:0"},
"cfg_scale": 5.0
},
"output_signature": {...},
"metadata": {"num_inference_steps": 40, "sigma_shift": 8.0}
}
}
}
Stage Bench Harness (Layer 2)¶
The StageBenchHarness enables isolated profiling of individual stages using captured I/O signatures. This is especially useful for DiT models that iterate 40+ times, where full pipeline profiling creates massive trace files.
Benefits¶
| Aspect | Full Pipeline | Isolated Harness |
|---|---|---|
| Trace size | 100MB+ | <10MB |
| Iterations | 40+ (redundant) | 1 (clean) |
| Memory | Full pipeline | Stage only |
| Analysis | Hard to isolate | Clear view |
| Reproducibility | Manual setup | Auto-generated script |
Usage¶
from telefuser.utils.stage_bench_harness import StageBenchHarness, HarnessConfig
# Create harness from signature file
config = HarnessConfig(
warmup=1,
profile_steps=1,
# output_dir defaults to work_dirs/profiler_output/{pipeline_name}/{date}
)
harness = StageBenchHarness.from_signature_file(
signature_path="work_dirs/profiler_output/wan21_t2v/20260402/timing_io_signature.json",
stage_name="denoise",
stage_instance=pipeline.denoise_stage, # Pass loaded stage
config=config,
)
# Setup and profile
harness.setup()
results = harness.profile()
# Output files:
# - denoise_trace.json.gz (Chrome trace, single iteration)
# - denoise_breakdown.json (top 50 kernels)
# - profile_denoise.py (auto-generated test script)
Dynamic Single-Step Execution¶
For DiT stages with internal loops, the harness automatically creates a single-step function by detecting the dit and scheduler attributes. No modification to stage code is required.
The single-step logic extracts one iteration from the denoising loop: 1. Setup scheduler with minimal steps (2) 2. Take first timestep 3. Run single forward pass + scheduler step
Kernel Breakdown Output¶
The breakdown JSON contains top 50 kernels sorted by time (no categorization):
{
"name": "denoise",
"total_kernel_time_ms": 150.0,
"num_kernels": 200,
"top_kernels": [
{"name": "flash_attn_fwd", "ms": 75.0, "cuda_ms": 75.0, "cpu_ms": 0.5},
{"name": "ampere_fp16_s1688gemm", "ms": 50.0, "cuda_ms": 50.0, "cpu_ms": 0.3},
{"name": "fused_add_rms_norm", "ms": 10.0, "cuda_ms": 10.0, "cpu_ms": 0.1}
]
}
Generated Test Script¶
The harness generates a standalone Python script for reproducible profiling:
# profile_denoise.py - Generated by harness
# Contains:
# - Input tensor creation based on I/O signature
# - Single-step execution logic for DiT stages
# - Profiling function with warmup and trace export
Run the generated script with a loaded stage instance for standalone profiling.
Quick Start¶
Basic Usage¶
from telefuser.utils.profiler import ProfilingContext
# As context manager
with ProfilingContext("my_operation"):
# Your code here
result = model(input_data)
# As decorator
@ProfilingContext("my_function")
def process_data(data):
return model(data)
# As async decorator
@ProfilingContext("async_operation")
async def process_async(data):
return await model(data)
Enable PyTorch Profiler¶
Set environment variables to enable detailed profiling:
# Enable profiler for specific names
export ENABLE_PROFILER_NAMES="vae_decode,text_encoding,dit_denoising"
# Set output directory for trace files
export PROFILER_OUTPUT_DIR="./profiler_output"
# Run your application
python your_script.py
Environment Variables¶
| Variable | Description | Default |
|---|---|---|
TELEFUSER_PROFILE_DEBUG | Enable all debug profiling contexts (Layer 1+) | "false" |
TELEFUSER_PIPELINE_NAME | Pipeline name for output directory | "default" |
TELEFUSER_PROFILER_OUTPUT_DIR | Override output directory (default: work_dirs/profiler_output/{name}/{date}) | None |
ENABLE_PROFILER_NAMES | Comma-separated stage names for torch.profiler (deprecated, use harness) | "" |
Default Output Directory:
Quick Reference:
# Layer 1: Stage timing + I/O signature capture
export TELEFUSER_PROFILE_DEBUG=true
export TELEFUSER_PIPELINE_NAME="wan21_t2v" # Optional, for organized output
python examples/wan_video/wan21_1_3b_text_to_video_h100.py
# Output: work_dirs/profiler_output/wan21_t2v/20260402/timing.json
# Layer 2: Isolated stage profiling (recommended)
# Use StageBenchHarness programmatically with loaded stage
Controlling Profiler Activation¶
from telefuser.utils.profiler import (
enable_profiler_for_names,
set_profiler_output_dir,
set_pipeline_name,
get_profiler_output_dir,
)
# Programmatically set pipeline name
set_pipeline_name("wan21_t2v")
# Override output directory
set_profiler_output_dir("/path/to/traces")
# Get current output directory
output_dir = get_profiler_output_dir()
ProfilingContext vs ProfilingContext4Debug¶
ProfilingContext¶
Always active profiling context:
from telefuser.utils.profiler import ProfilingContext
@ProfilingContext("operation_name")
def process():
# Always logs execution time and peak memory
pass
ProfilingContext4Debug¶
Conditionally active based on TELEFUSER_PROFILE_DEBUG:
from telefuser.utils.profiler import ProfilingContext4Debug
@ProfilingContext4Debug("debug_operation")
def process():
# Only profiles when TELEFUSER_PROFILE_DEBUG=true
# Otherwise, no overhead
pass
Recommended usage in Stage:
from telefuser.core.base_stage import BaseStage, with_model_offload
from telefuser.utils.profiler import ProfilingContext4Debug
class MyStage(BaseStage):
@with_model_offload(["model"])
@ProfilingContext4Debug("my_stage_process")
@torch.inference_mode()
def process(self, input_data):
# Profiling only active in debug mode
return self.model(input_data)
Output¶
Console Logs¶
When using ProfilingContext, the following information is logged:
When Layer 1 profiling is enabled:
[Profiler] Timing report saved to: work_dirs/profiler_output/wan21_t2v/20260402/timing.json
[Profiler] I/O signature saved to: work_dirs/profiler_output/wan21_t2v/20260402/timing_io_signature.json
When Layer 2 harness profiling is active:
[Harness] Setup complete for stage 'denoise'
[Harness] Running 1 warmup iteration(s)...
[Harness] Running 1 profile iteration(s)...
[Harness] Chrome trace saved to: work_dirs/.../denoise_trace.json.gz
[Harness] Kernel breakdown saved to: work_dirs/.../denoise_breakdown.json
[Harness] Test script saved to: work_dirs/.../profile_denoise.py
[Harness] Average iteration time: 150.00 ms
Chrome Trace Files¶
Chrome trace files can be visualized:
- Chrome browser:
chrome://tracing→ Load the.json.gzfile - TensorBoard:
tensorboard --logdir work_dirs/profiler_output - Perfetto: https://ui.perfetto.dev/
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Profiler name for identification |
reset_peak_memory | bool | True | Reset peak memory stats before profiling |
# Custom memory tracking behavior
with ProfilingContext("operation", reset_peak_memory=False):
# Peak memory not reset - captures accumulated peak
pass
Distributed Support¶
Profiler automatically handles distributed environments:
# In distributed setting (e.g., 2 GPUs)
with ProfilingContext("distributed_op"):
# Rank 0 logs: "Rank 0 - Function 'distributed_op' Peak Memory: 4.50 GB"
# Rank 1 logs: "Rank 1 - Function 'distributed_op' Peak Memory: 4.50 GB"
pass
Trace files include rank information:
Platform Support¶
Profiler supports multiple hardware platforms:
| Platform | Profiler Activity |
|---|---|
| CUDA (NVIDIA) | torch.profiler.ProfilerActivity.CUDA |
| XPU (Intel) | torch.profiler.ProfilerActivity.XPU |
| NPU (Huawei) | torch.profiler.ProfilerActivity.PrivateUse1 |
| CPU | torch.profiler.ProfilerActivity.CPU (always) |
Integration in Stages¶
Typical Usage Pattern¶
from telefuser.core.base_stage import BaseStage, with_model_offload
from telefuser.utils.profiler import ProfilingContext4Debug
import torch
class VAEDecodeStage(BaseStage):
def __init__(self, name, module_manager, model_runtime_config):
super().__init__(name, model_runtime_config)
self.vae = module_manager.fetch_module("vae")
self.model_names = ["vae"]
@with_model_offload(["vae"])
@ProfilingContext4Debug("vae_decode")
@torch.inference_mode()
def process(self, latents):
with torch.autocast(device_type=self.device_type, dtype=self.torch_dtype):
return self.vae.decode(latents)
Profiling Multiple Operations¶
class TextEncodingStage(BaseStage):
@with_model_offload(["text_encoder"])
@ProfilingContext4Debug("text_encoding")
@torch.inference_mode()
def encode_text(self, prompts):
# Overall encoding profiled
with ProfilingContext4Debug("tokenization"):
tokens = self.tokenizer(prompts)
with ProfilingContext4Debug("embedding"):
embeddings = self.text_encoder(tokens)
return embeddings
Best Practices¶
1. Use Meaningful Names¶
# Good - descriptive and unique
@ProfilingContext4Debug("vae_decode_video")
@ProfilingContext4Debug("dit_denoising_step_0")
# Avoid - generic or duplicate
@ProfilingContext4Debug("process")
@ProfilingContext4Debug("model")
2. Use ProfilingContext4Debug in Stages¶
# Recommended - no overhead in production
@ProfilingContext4Debug("stage_name")
def process(self, data):
pass
# Avoid in production code - always active
@ProfilingContext("stage_name")
def process(self, data):
pass
3. Combine with Other Decorators¶
Order matters - profiler should wrap the actual computation:
@with_model_offload(["model"]) # Outer: handles model loading
@ProfilingContext4Debug("process") # Middle: profiles computation
@torch.inference_mode() # Inner: disables gradients
def process(self, data):
return self.model(data)
4. Enable Specifically¶
# Enable only what you need
export ENABLE_PROFILER_NAMES="vae_decode"
# Avoid enabling everything (large trace files)
export ENABLE_PROFILER_NAMES="*" # Not recommended
5. Use reset_peak_memory Appropriately¶
# Reset for each independent operation
with ProfilingContext("independent_op", reset_peak_memory=True):
pass
# Don't reset when tracking accumulated memory
with ProfilingContext("sequence_op", reset_peak_memory=False):
pass
Troubleshooting¶
Large Trace Files¶
Use isolated Stage Bench Harness instead of full pipeline profiling:
# Instead of full pipeline (creates 100MB+ traces)
# Use harness for single iteration profiling
from telefuser.utils.stage_bench_harness import StageBenchHarness
harness = StageBenchHarness.from_signature_file(
signature_path="timing_io_signature.json",
stage_name="denoise",
stage_instance=pipeline.denoise_stage,
)
harness.setup()
harness.profile() # Creates <10MB trace
Missing GPU Activity¶
If GPU activity is not recorded:
- Verify platform is supported (CUDA, XPU, NPU)
- Check CUDA synchronization is working
from telefuser.platforms import current_platform
print(current_platform.device_type) # Should be "cuda", "xpu", or "npu"
Memory Stats Not Accurate¶
Ensure synchronization before profiling:
# Profiler automatically syncs, but manual sync for custom timing
from telefuser.platforms import current_platform
current_platform.synchronize()
with ProfilingContext("operation"):
pass
Stage Instance Not Available for CLI¶
CLI mode cannot execute stage profiling without a loaded model. Use programmatic approach with a loaded stage instance:
# Load pipeline first
from my_pipeline import get_pipeline
pipeline = get_pipeline()
# Then use harness
harness = StageBenchHarness.from_signature_file(
signature_path="timing_io_signature.json",
stage_name="denoise",
stage_instance=pipeline.denoise_stage,
)
Related Documentation¶
- Adding New Stage - Stage development with profiler integration
- Metrics - Production monitoring and observability
- Logging - Logging configuration and usage
- Configuration - Runtime configuration options