OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

Dianbing Xi1,2,*, Jiepeng Wang2,*,‡, Yuanzhi Liang2,

Xi Qiu2,

Yuchi Huo1, Rui Wang1,†, Chi Zhang2,†, Xuelong Li2,†

* Equal contribution. Corresponding author. Project leader.

1Zhejiang University, 2Institute of Artificial Intelligence, China Telecom (TeleAI)
Teaser Image

OmniVDiff enables controllable video generation and understanding in a unified video diffusion framework.

Abstract

In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model.

To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps).

By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.

Method overview

Teaser Image

Given a video with four paired modalities, (a) we first encode it into latents using a shared 3D-VAE encoder; (b) then, concatenate them along the channel dimension and apply noise for video diffusion, where the denoised latents are then decoded into their respective modalities via modality-specific decoding heads; (c) Finally, each modality can be reconstructed into color space by the 3D-VAE decoder . During inference, the model enables various tasks by dynamically adjusting the role of each modality: (d) text-to-video generation, where all modalities are denoised from pure noise and (e) X-conditioned generation, where the condition X is given and other modalities are denoised from pure noise. If X is RGB modality, the model will perform generative understanding.

Results: Generation and Understanding

Text-conditioned multi-modality video generation

Using OmniVDiff, you can generate more coherent and temporally consistent video sequences using only text prompts.

Controllable video generation

Using OmniVDiff, you can generate more coherent and temporally consistent video sequences conditioned on various inputs such as depth, canny edges, and segmentation.

Depth-conditioned video generation

Canny-conditioned video generation

Segmentation-conditioned video generation

Video understanding

Given a reference video, OmniVDiff can jointly estimate multiple aligned visual understanding outputs within a single diffusion process.

Applications

Video-to-video translation

Given a reference video, OmniVDiff first estimates the corresponding depth, which serves as a structural prior to guide scene composition. This enables the generation of videos with diverse scene styles (e.g., winter, autumn, summer, and sunset) through text-based control.

Adaptation to new applications: Deblur and Super-resolution

Row 1 – Video Super-Resolution: Given a low-resolution video as input, OmniVDiff generates a corresponding high-resolution output with enhanced visual details. Row 2 – Video Deblurring: Similarly, OmniVDiff can be fine-tuned for the video deblurring task, producing a sharp and clear video from a blurred input.

Scene Reconstruction

Given a reference video, OmniVDiff estimates the corresponding depth and segmentation sequences. These outputs can be reprojected into a 3D scene and rendered from novel viewpoints.

BibTeX

@article{xdb2025OmniVDiff,
  author    = {Xi, Dianbing and Wang, Jiepeng and Liang, Yuanzhi and Qi, Xi and Huo, Yuchi and Wang, Rui and Zhang, Chi and Li, Xuelong},
  title     = {OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding},
  journal   = {arXiv preprint arXiv:2504.10825},
  year      = {2025},
}