In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model.
To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps).
By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.
Given a video with four paired modalities, (a) we first encode it into latents using a shared 3D-VAE encoder; (b) then, concatenate them along the channel dimension and apply noise for video diffusion, where the denoised latents are then decoded into their respective modalities via modality-specific decoding heads; (c) Finally, each modality can be reconstructed into color space by the 3D-VAE decoder . During inference, the model enables various tasks by dynamically adjusting the role of each modality: (d) text-to-video generation, where all modalities are denoised from pure noise and (e) X-conditioned generation, where the condition X is given and other modalities are denoised from pure noise. If X is RGB modality, the model will perform generative understanding.
Using OmniVDiff, you can generate more coherent and temporally consistent video sequences using only text prompts.
Using OmniVDiff, you can generate more coherent and temporally consistent video sequences conditioned on various inputs such as depth, canny edges, and segmentation.
Given a reference video, OmniVDiff can jointly estimate multiple aligned visual understanding outputs within a single diffusion process.
Given a reference video, OmniVDiff first estimates the corresponding depth, which serves as a structural prior to guide scene composition. This enables the generation of videos with diverse scene styles (e.g., winter, autumn, summer, and sunset) through text-based control.
Row 1 – Video Super-Resolution: Given a low-resolution video as input, OmniVDiff generates a corresponding high-resolution output with enhanced visual details. Row 2 – Video Deblurring: Similarly, OmniVDiff can be fine-tuned for the video deblurring task, producing a sharp and clear video from a blurred input.
Given a reference video, OmniVDiff estimates the corresponding depth and segmentation sequences. These outputs can be reprojected into a 3D scene and rendered from novel viewpoints.
@article{xdb2025OmniVDiff,
author = {Xi, Dianbing and Wang, Jiepeng and Liang, Yuanzhi and Qi, Xi and Huo, Yuchi and Wang, Rui and Zhang, Chi and Li, Xuelong},
title = {OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding},
journal = {arXiv preprint arXiv:2504.10825},
year = {2025},
}