Profiler System¶

TeleFuser provides a three-layer progressive profiling system for performance analysis and debugging, built on PyTorch Profiler with additional features for distributed environments.

Features¶

Three-layer progressive profiling - Stage timing → Kernel analysis → NCU deep dive
Context manager and decorator - Flexible usage patterns
Sync and async support - Works with both synchronous and asynchronous functions
Distributed aware - Automatically handles multi-rank profiling
Memory tracking - Peak memory allocation monitoring
Stage I/O Signature capture - Record tensor shapes for isolated profiling
Stage Bench Harness - Profile individual stages with mock inputs (avoid 40+ DiT iterations)
Auto-generated test scripts - Reproducible profiling without harness infrastructure
Organized output directory - work_dirs/profiler_output/{pipeline_name}/{YYYYMMDD_HHMM}

Stage I/O Signature (Layer 1)¶

When profiling with TELEFUSER_PROFILE_DEBUG=true, the profiler automatically captures input and output tensor signatures for each stage. This enables Layer 2 isolated profiling without running the full pipeline.

Captured Information¶

For each stage, the signature includes: - Input tensor shapes, dtypes, and devices - Output tensor shapes (if applicable) - Non-tensor parameters (int, float, str) as metadata

Output Location¶

Default output directory structure:

work_dirs/profiler_output/{TELEFUSER_PIPELINE_NAME}/{YYYYMMDD_HHMM}/
├── timing.json                    # Layer 1 stage timing report
├── timing_io_signature.json       # I/O signatures for harness
├── denoise_trace.json.gz          # Layer 2 Chrome trace (single iteration)
├── denoise_breakdown.json         # Layer 2 top 50 kernels
└── profile_denoise.py             # Auto-generated test script

Set TELEFUSER_PIPELINE_NAME environment variable to organize outputs by pipeline.

Signature Format¶

{
  "request_id": "req_20260402_...",
  "timestamp": "2026-04-02T...",
  "stages": {
    "denoise": {
      "stage_name": "denoise",
      "input_signatures": {
        "latents": {"shape": [1, 16, 21, 60, 104], "dtype": "bfloat16", "device": "cuda:0"},
        "prompt_emb_posi": {"shape": [1, 512, 4096], "dtype": "bfloat16", "device": "cuda:0"},
        "cfg_scale": 5.0
      },
      "output_signature": {...},
      "metadata": {"num_inference_steps": 40, "sigma_shift": 8.0}
    }
  }
}

Stage Bench Harness (Layer 2)¶

The StageBenchHarness enables isolated profiling of individual stages using captured I/O signatures. This is especially useful for DiT models that iterate 40+ times, where full pipeline profiling creates massive trace files.

Benefits¶

Aspect	Full Pipeline	Isolated Harness
Trace size	100MB+	<10MB
Iterations	40+ (redundant)	1 (clean)
Memory	Full pipeline	Stage only
Analysis	Hard to isolate	Clear view
Reproducibility	Manual setup	Auto-generated script

Usage¶

from telefuser.utils.stage_bench_harness import StageBenchHarness, HarnessConfig

# Create harness from signature file
config = HarnessConfig(
    warmup=1,
    profile_steps=1,
    # output_dir defaults to work_dirs/profiler_output/{pipeline_name}/{date}
)

harness = StageBenchHarness.from_signature_file(
    signature_path="work_dirs/profiler_output/wan21_t2v/20260402/timing_io_signature.json",
    stage_name="denoise",
    stage_instance=pipeline.denoise_stage,  # Pass loaded stage
    config=config,
)

# Setup and profile
harness.setup()
results = harness.profile()

# Output files:
# - denoise_trace.json.gz (Chrome trace, single iteration)
# - denoise_breakdown.json (top 50 kernels)
# - profile_denoise.py (auto-generated test script)

Dynamic Single-Step Execution¶

For DiT stages with internal loops, the harness automatically creates a single-step function by detecting the dit and scheduler attributes. No modification to stage code is required.

The single-step logic extracts one iteration from the denoising loop: 1. Setup scheduler with minimal steps (2) 2. Take first timestep 3. Run single forward pass + scheduler step

Kernel Breakdown Output¶

The breakdown JSON contains top 50 kernels sorted by time (no categorization):

{
  "name": "denoise",
  "total_kernel_time_ms": 150.0,
  "num_kernels": 200,
  "top_kernels": [
    {"name": "flash_attn_fwd", "ms": 75.0, "cuda_ms": 75.0, "cpu_ms": 0.5},
    {"name": "ampere_fp16_s1688gemm", "ms": 50.0, "cuda_ms": 50.0, "cpu_ms": 0.3},
    {"name": "fused_add_rms_norm", "ms": 10.0, "cuda_ms": 10.0, "cpu_ms": 0.1}
  ]
}

Generated Test Script¶

The harness generates a standalone Python script for reproducible profiling:

# profile_denoise.py - Generated by harness
# Contains:
# - Input tensor creation based on I/O signature
# - Single-step execution logic for DiT stages
# - Profiling function with warmup and trace export

Run the generated script with a loaded stage instance for standalone profiling.

Quick Start¶

Basic Usage¶

from telefuser.utils.profiler import ProfilingContext

# As context manager
with ProfilingContext("my_operation"):
    # Your code here
    result = model(input_data)

# As decorator
@ProfilingContext("my_function")
def process_data(data):
    return model(data)

# As async decorator
@ProfilingContext("async_operation")
async def process_async(data):
    return await model(data)

Enable PyTorch Profiler¶

Set environment variables to enable detailed profiling:

# Enable profiler for specific names
export ENABLE_PROFILER_NAMES="vae_decode,text_encoding,dit_denoising"

# Set output directory for trace files
export PROFILER_OUTPUT_DIR="./profiler_output"

# Run your application
python your_script.py

Environment Variables¶

Variable	Description	Default
`TELEFUSER_PROFILE_DEBUG`	Enable all debug profiling contexts (Layer 1+)	"false"
`TELEFUSER_PIPELINE_NAME`	Pipeline name for output directory	"default"
`TELEFUSER_PROFILER_OUTPUT_DIR`	Override output directory (default: work_dirs/profiler_output/{name}/{date})	None
`ENABLE_PROFILER_NAMES`	Comma-separated stage names for torch.profiler (deprecated, use harness)	""

Default Output Directory:

work_dirs/profiler_output/{TELEFUSER_PIPELINE_NAME}/{YYYYMMDD_HHMM}/

Quick Reference:

# Layer 1: Stage timing + I/O signature capture
export TELEFUSER_PROFILE_DEBUG=true
export TELEFUSER_PIPELINE_NAME="wan21_t2v"  # Optional, for organized output
python examples/wan_video/wan21_1_3b_text_to_video_h100.py
# Output: work_dirs/profiler_output/wan21_t2v/20260402/timing.json

# Layer 2: Isolated stage profiling (recommended)
# Use StageBenchHarness programmatically with loaded stage

Controlling Profiler Activation¶

from telefuser.utils.profiler import (
    enable_profiler_for_names,
    set_profiler_output_dir,
    set_pipeline_name,
    get_profiler_output_dir,
)

# Programmatically set pipeline name
set_pipeline_name("wan21_t2v")

# Override output directory
set_profiler_output_dir("/path/to/traces")

# Get current output directory
output_dir = get_profiler_output_dir()

ProfilingContext vs ProfilingContext4Debug¶

ProfilingContext¶

Always active profiling context:

from telefuser.utils.profiler import ProfilingContext

@ProfilingContext("operation_name")
def process():
    # Always logs execution time and peak memory
    pass

ProfilingContext4Debug¶

Conditionally active based on TELEFUSER_PROFILE_DEBUG:

from telefuser.utils.profiler import ProfilingContext4Debug

@ProfilingContext4Debug("debug_operation")
def process():
    # Only profiles when TELEFUSER_PROFILE_DEBUG=true
    # Otherwise, no overhead
    pass

Recommended usage in Stage:

from telefuser.core.base_stage import BaseStage, with_model_offload
from telefuser.utils.profiler import ProfilingContext4Debug

class MyStage(BaseStage):
    @with_model_offload(["model"])
    @ProfilingContext4Debug("my_stage_process")
    @torch.inference_mode()
    def process(self, input_data):
        # Profiling only active in debug mode
        return self.model(input_data)

Output¶

Console Logs¶

When using ProfilingContext, the following information is logged:

[Profile] my_operation cost 0.123456 seconds
Rank 0 - Function 'my_operation' Peak Memory: 4.50 GB

When Layer 1 profiling is enabled:

[Profiler] Timing report saved to: work_dirs/profiler_output/wan21_t2v/20260402/timing.json
[Profiler] I/O signature saved to: work_dirs/profiler_output/wan21_t2v/20260402/timing_io_signature.json

When Layer 2 harness profiling is active:

[Harness] Setup complete for stage 'denoise'
[Harness] Running 1 warmup iteration(s)...
[Harness] Running 1 profile iteration(s)...
[Harness] Chrome trace saved to: work_dirs/.../denoise_trace.json.gz
[Harness] Kernel breakdown saved to: work_dirs/.../denoise_breakdown.json
[Harness] Test script saved to: work_dirs/.../profile_denoise.py
[Harness] Average iteration time: 150.00 ms

Chrome Trace Files¶

Chrome trace files can be visualized:

Chrome browser: chrome://tracing → Load the .json.gz file
TensorBoard: tensorboard --logdir work_dirs/profiler_output
Perfetto: https://ui.perfetto.dev/

Parameters¶

Parameter	Type	Default	Description
`name`	str	Required	Profiler name for identification
`reset_peak_memory`	bool	True	Reset peak memory stats before profiling

# Custom memory tracking behavior
with ProfilingContext("operation", reset_peak_memory=False):
    # Peak memory not reset - captures accumulated peak
    pass

Distributed Support¶

Profiler automatically handles distributed environments:

# In distributed setting (e.g., 2 GPUs)
with ProfilingContext("distributed_op"):
    # Rank 0 logs: "Rank 0 - Function 'distributed_op' Peak Memory: 4.50 GB"
    # Rank 1 logs: "Rank 1 - Function 'distributed_op' Peak Memory: 4.50 GB"
    pass

Trace files include rank information:

profiler_output/
├── operation_rank0_run1.json
├── operation_rank1_run1.json

Platform Support¶

Profiler supports multiple hardware platforms:

Platform	Profiler Activity
CUDA (NVIDIA)	`torch.profiler.ProfilerActivity.CUDA`
XPU (Intel)	`torch.profiler.ProfilerActivity.XPU`
NPU (Huawei)	`torch.profiler.ProfilerActivity.PrivateUse1`
CPU	`torch.profiler.ProfilerActivity.CPU` (always)

Integration in Stages¶

Typical Usage Pattern¶

from telefuser.core.base_stage import BaseStage, with_model_offload
from telefuser.utils.profiler import ProfilingContext4Debug
import torch

class VAEDecodeStage(BaseStage):
    def __init__(self, name, module_manager, model_runtime_config):
        super().__init__(name, model_runtime_config)
        self.vae = module_manager.fetch_module("vae")
        self.model_names = ["vae"]

    @with_model_offload(["vae"])
    @ProfilingContext4Debug("vae_decode")
    @torch.inference_mode()
    def process(self, latents):
        with torch.autocast(device_type=self.device_type, dtype=self.torch_dtype):
            return self.vae.decode(latents)

Profiling Multiple Operations¶

class TextEncodingStage(BaseStage):
    @with_model_offload(["text_encoder"])
    @ProfilingContext4Debug("text_encoding")
    @torch.inference_mode()
    def encode_text(self, prompts):
        # Overall encoding profiled
        with ProfilingContext4Debug("tokenization"):
            tokens = self.tokenizer(prompts)
        with ProfilingContext4Debug("embedding"):
            embeddings = self.text_encoder(tokens)
        return embeddings

Best Practices¶

1. Use Meaningful Names¶

# Good - descriptive and unique
@ProfilingContext4Debug("vae_decode_video")
@ProfilingContext4Debug("dit_denoising_step_0")

# Avoid - generic or duplicate
@ProfilingContext4Debug("process")
@ProfilingContext4Debug("model")

2. Use ProfilingContext4Debug in Stages¶

# Recommended - no overhead in production
@ProfilingContext4Debug("stage_name")
def process(self, data):
    pass

# Avoid in production code - always active
@ProfilingContext("stage_name")
def process(self, data):
    pass

3. Combine with Other Decorators¶

Order matters - profiler should wrap the actual computation:

@with_model_offload(["model"])      # Outer: handles model loading
@ProfilingContext4Debug("process")  # Middle: profiles computation
@torch.inference_mode()             # Inner: disables gradients
def process(self, data):
    return self.model(data)

4. Enable Specifically¶

# Enable only what you need
export ENABLE_PROFILER_NAMES="vae_decode"

# Avoid enabling everything (large trace files)
export ENABLE_PROFILER_NAMES="*"  # Not recommended

5. Use reset_peak_memory Appropriately¶

# Reset for each independent operation
with ProfilingContext("independent_op", reset_peak_memory=True):
    pass

# Don't reset when tracking accumulated memory
with ProfilingContext("sequence_op", reset_peak_memory=False):
    pass

Troubleshooting¶

Large Trace Files¶

Use isolated Stage Bench Harness instead of full pipeline profiling:

# Instead of full pipeline (creates 100MB+ traces)
# Use harness for single iteration profiling
from telefuser.utils.stage_bench_harness import StageBenchHarness

harness = StageBenchHarness.from_signature_file(
    signature_path="timing_io_signature.json",
    stage_name="denoise",
    stage_instance=pipeline.denoise_stage,
)
harness.setup()
harness.profile()  # Creates <10MB trace

Missing GPU Activity¶

If GPU activity is not recorded:

Verify platform is supported (CUDA, XPU, NPU)
Check CUDA synchronization is working

from telefuser.platforms import current_platform
print(current_platform.device_type)  # Should be "cuda", "xpu", or "npu"

Memory Stats Not Accurate¶

Ensure synchronization before profiling:

# Profiler automatically syncs, but manual sync for custom timing
from telefuser.platforms import current_platform
current_platform.synchronize()
with ProfilingContext("operation"):
    pass

Stage Instance Not Available for CLI¶

CLI mode cannot execute stage profiling without a loaded model. Use programmatic approach with a loaded stage instance:

# Load pipeline first
from my_pipeline import get_pipeline
pipeline = get_pipeline()

# Then use harness
harness = StageBenchHarness.from_signature_file(
    signature_path="timing_io_signature.json",
    stage_name="denoise",
    stage_instance=pipeline.denoise_stage,
)

Adding New Stage - Stage development with profiler integration
Metrics - Production monitoring and observability
Logging - Logging configuration and usage
Configuration - Runtime configuration options