VideoDirector: Precise Video Editing via Text-to-Video Models

1Sun Yat-sen University 2Tsinghua University 3National University of Defense Technology

Abstract

Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatialtemporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatialtemporal Layout. The vanilla cross-attention control is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content. .

Motivation

Model Overview 1
(a) The prompt-to-prompt and null-text optimization are integrated directly into the T2V generation model to reconstruct the input videos. The results present challenges for the typical editing paradigm in accurately reconstructing the original videos.
Model Overview 2
(b) Our method achieves accurate reconstruction of input videos by incorporating multi-frame Null-text optimization and spatial-temporal decoupled guidance.

Figure 1. Principle visualization of our approach. Comparison of diffusion pivotal inversion using a T2V generation model integrated with vanilla null-text optimization (a) and our proposed guidance (b). Our approach constrains the reverse diffusion trajectory during video generation to align with DDIM inversion, enabling precise reconstruction of the input video.

Method

Model Overview 1
(a). Video pivotal inversion pipeline. Our pipeline comprises two key components: multi-frame null-text optimization and spatialtemporal decoupled guidance, which are integrated into the standard pivotal inversion pipeline.
Model Overview 2
(b). Our video editing pipeline. The SA-I and SA-II maintain the complicated spatial-temporal layout and enhance fidelity, while the cross-attention control introduces editing guidance based on the editing prompts.

Figure 2. The two stages in our VideoDirector. Video Pivotal Inversion in (a); Attention Control for Video Editing in (b).

More Editing Results

Input Video

"Iron man"

"Wonder Woman"

"Captain America"

"Spider Man"

Input Video

"Aquaman"

"Wonder Woman"

Input Video

"A blue swan"

"A white swan"

Input Video

"Fox"

"Husky"

Input Video

"Cat"

"Cheetah"

"Lion"

"Tiger"

"Wolf"

Input Video

"Cat"

"Cheetah"

"Lion"

"Fox"

"Husky"

"Tiger"

Input Video

"Porsche"

"LEGO car"

"Armored Humvee"

"Forest during autumn"

"Forest during a storm"

Input Video

"Cayenne"

"Ferrari"

"Porsche"

"Tesla"

Input Video

"Armored Humvee"

"LEGO car"

"Porsche Cayenne"

"Silver Tesla"

Comparisons

More Results

Input Video

"Red Porsche"

"Red Tesla"

"Silver Porsche"

"Lamborghini"

Input Video

"Red Porsche"

"Silver Porsche"

"Red Tesla"

"Silver Tesla"

Input Video

"Tiger"

"White polar bear"

"Black bear"

"Panda"

Input Video

"A blue flamingo"

"A green flamingo"

"An orange flamingo"

"A pink flamingo"

"A purple flamingo"

"A red flamingo"

"A yellow flamingo"

Input Video

"Hippopotamus"

"Lion"

"Tiger"

Input Video

"A male mallard duck"

Input Video

"A zebra"

Input Video

"Messi"

"Cristiano Ronaldo"

"A Worldcup soccer stadium with spectators"

Input Video

"Aircraft Carrier"

"Royal Cruise"

Input Video

"Snowy mountains"

"Volcano"

BibTeX

@misc{wang2024videodirectorprecisevideoediting,
      title={VideoDirector: Precise Video Editing via Text-to-Video Models}, 
      author={Yukun Wang and Longguang Wang and Zhiyuan Ma and Qibin Hu and Kai Xu and Yulan Guo},
      year={2024},
      eprint={2411.17592},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17592}, 
}

[1] Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow-guided attention for consistent text-to-video editing. International Conference on Learning Representations (ICLR), 2024.

[2] Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, and Pinar Yanardag. Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6507-6516, 2024.

[3] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. International Conference on Learning Representations (ICLR), 2024

[4] Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599-8608, 2024.