* Equal contribution. † Corresponding author. ‡ Project leader.
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation.
However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations.
We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
Framework overview of CtrlVDiff. Given a video with eight paired modalities, we first encode all modalities into latent representations using a pretrained shared 3D-VAE encoder. For each sample within a batch, its latent features are concatenated along the channel dimension. Subsequently, we apply the Hybrid Modality Control Strategy (HMCS) to each batch (as illustrated in the box on the right), which enables robust handling of all possible modality combinations. The outputs of the Diffusion Transformer are then processed through modality specific projection layers, where each modality is assigned an independent projection head to encourage effective modality disentanglement.
CtrlVDiff supports a wide range of conditional combinations. Below, we present results across various tasks, including single-condition generation, multi-condition generation, text-to-multimodal video generation, and video understanding.
CtrlVDiff is capable of flexibly generating videos from a single condition, such as normal, albedo, roughness, or metallic. We observe that each of these modalities, when used as a guiding signal, can lead to high-quality video generation.
CtrlVDiff is capable of flexibly generating videos from multiple combinations of conditions, and below we present results for several representative examples. The results demonstrate that our multi-condition generation approach can produce high-quality video content.
After decomposing the input video and using all decomposed modalities as conditions, CtrlVDiff can effectively reconstruct the video content.
Moreover, even in the absence of any conditioning inputs, CtrlVDiff can still generate multimodal video sequences guided solely by text prompts.
When conditioned on RGB, CtrlVDiff can efficiently interpret the video content and simultaneously generate multimodal outputs, all of which maintain high consistency and strong detail fidelity.
We conduct a comprehensive evaluation of CtrlVDiff on both multi-condition and single-condition generation tasks.
We compare CtrlVDiff with current state-of-the-art multi-condition video generation methods. In this evaluation, each method uses all available conditioning modalities, making the task equivalent to video reconstruction. Under this setting, our method demonstrates superior consistency with the input video as well as higher visual realism.
CtrlVDiff is compared with state-of-the-art methods for depth-guided and canny-guided video generation. In both tasks, our method continues to deliver competitive results relative to the latest approaches.
We conduct a comprehensive evaluation of CtrlVDiff on video understanding tasks, assessing the prediction quality of depth, segmentation, normal, and material properties (albedo, roughness, and metallicity).
CtrlVDiff produces more detailed geometric predictions in the depth modality, capturing fine structures such as thin wires. For segmentation, CtrlVDiff demonstrates stronger spatial consistency, for example by maintaining coherent table segmentation and correctly grouping the person’s hand into a single category.
DiffusionRenderer (DR) includes two versions: one based on SVD and one based on Cosmos, where Cosmos serves as our expert model. CtrlVDiff achieves performance close to this expert model. For the single-modality expert model NormalCrafter, our method produces even finer details.
Compared with other methods, CtrlVDiff demonstrates more physically plausible qualitative results in material property prediction on the InteriorVerse Test Dataset. In particular, CtrlVDiff shows notably stronger performance in predicting roughness and metallic.
We further demonstrate the application potential of CtrlVDiff in video editing through layer-wise control. In tasks such as scene relighting, material editing, and object insertion, CtrlVDiff consistently delivers high-quality video editing results.
For an input video, CtrlVDiff first decomposes it into all underlying modalities. These modalities are then used as conditioning inputs, and together with a new lighting description prompt, the model generates a video with the desired relighting effect while preserving the original content and structure.
For a given input video, CtrlVDiff first decomposes it to obtain all relevant modalities. By modifying the albedo properties and using the edited modality together with the remaining original modalities as conditioning inputs, the model generates a video that reflects the desired albedo edits while preserving the original content and structure.
When using two modalities (albedo and normal) as conditioning inputs, CtrlVDiff can generate videos as shown in the top row. By modifying the albedo and normal modalities to insert new objects (e.g., a bowl or a bottle), the model is able to produce videos that include the inserted objects, as illustrated in the bottom row, while preserving the original content and structure of the scene.
@misc{xdb2025ctrlvdiff,
title={CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion},
author={Dianbing Xi and Jiepeng Wang and Yuanzhi Liang and Xi Qiu and Jialun Liu and Hao Pan and Yuchi Huo and Rui Wang and Haibin Huang and Chi Zhang and Xuelong Li},
year={2025},
eprint={2511.21129},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.21129},
}