Text-to-video

ControlVideo: Training-free Controllable Text-to-Video Generation

Jongmin Lim 2025. 12. 2. 16:34

Abstract

Text-driven diffusion models이 이미지 생성에 좋은 성능을 보였지만, Video 생성에서는 temporal modeling의 과도한 학습 비용 때문에 성능이 좋지 않음
training burden과 더불어서, 생성된 비디오는 appearance inconsistency and structural flickers (외관 불일치 및 구조적 깜박임)를 겪는다
이 문제를 다루기 위해서, 본 논문에서는 ControlNet을 채택하여 input motion sequnce로부터 structure consistency를 유지하는 ControlVideo를 제안
- Self-attention module에 fully cross-frame attention interation을 추가

[참고]

1. DDIM vs DDPM

2. DDPM의 Reverse 과정에서 노이즈가 추가되는 이유?

Forward에서 노이즈가 추가되고, Rverse에서 이를 제거하는건 줄 알았는데?
먼저 Forward process를 살펴본다.

그때 reverse process는 아래와 같다

그렇다면 왜 UNet으로 노이즈를 제거하고, 랜덤 노이즈를 다시 추가하는 것인가?

3. 왜 DDIM의 Reverse 과정에서는 랜덤 노이즈가 제거 되는 것인가?

그렇게 되면 해당 논문의 식(4)는 다음과 같이 분리

DDPM과 달리, 여기에는 random noise sampling(σₜ z) 가 없다.
DDIM (deterministic) = clean component + noise prediction
이 clean component는 이미지의 구조적 정보를 보관한다.
- 객체의 shape
- boundary
- structure (저주파)
- coarse layout
- 전체적인 geometry
이 noise prediction component는 고주파 정보를 담는다.
- 텍스처(detail)
- 재질(texture)
- 작은 패턴
- shading
- high-frequency structure

1 Introduction

본 연구에서는 scratch에서 video 분포를 학습하는 것이 아니라,
- pre-trained text-to-image generative models의 생성 능력을 이용하고
- 동작 시퀀스의 시간적 일관성을 유지하여 vivid video를 생성한다
기존 연구 Text-To-Video-zero와 Tune-A-Video가 original self-attention을 sparser cross-frame attention을 대체하여 모든 frame을 독립적으로 보지 않고 appearance coherence을 달성했다
- 문제점 1) 특정 Frame 사이에 여전히 inconsistent appearnce 존재(그림 4(a) 참조)
- 문제점 2) 큰 움직임의 비디오에서 눈에 띄는 artifact(그림 4(b) 참조),
- 문제점 3) 프레임 간 전환 중의 구조적 깜박임이 존재

3 ControlVideo

기존 연구의 문제점을 해결하기 위해서, controllable text-to-video generation을 위한 training-free ControlVideo을 제안
- fully cross-frame interaction,
- interleaved-frame smoother
- hierarchical sampler

Fully cross-frame interaction.

text-to-image model을 video counterpart에 적용하는 주요 문제점은 temporal consistency를 보장하는 것이다
Cotrolvideo는 ControlNet의 controllability를 이용하여 motion sequence가 구조적으로 coarse-level consistency를 유지하게 한다
그러나, ControlNet으로 모든 frame을 개별적으로 생성하는 것은 appearance가 inconsistent하다 (그림 6 individual 참조)

따라서, 기존 연구 Text-to-video-zero와 Tune-A-Videio에 따라, 2D Convolution layer를 3D Convoultion layer로 바꾼다
- 3x3 kernel을 1x3x3 형태로

그러나 기존 연구는 오로지 first frame만 모든 프레임과 비교했다
대조적으로, 본 논문에서는 모든 frame을 combines하여 ‘large image’로 구성하여 cross-frame attention을 구한다

Interleaved-frame smoother.

영상이 깜빡깜빡 거리는 문제를 해결하기 위해,
- 각 3프레임 클립을 기준으로 프레임을 보간하여 , 이를 인터리브 방식으로 반복하여 전체 비디오를 부드럽게 만든다

Limitation

모든 프레임은 Pose ControlNet or Flow ControlNet에 의해 입력된 motion constraint를 따라야 함
- 예: 사용자가 Michael Jackson의 moonwalk pose 시퀀스를 입력한다.
- 그러면 ControlVideo는: moonwalk과 동일한 motion trajectory를 유지하면서 appearance만 Iron Man처럼 바꿀 수 있다 (“Iron Man이 moonwalk하는 영상”)
- 하지만 아래는 절대 불가능하다:
- “Iron Man runs on the street.” → 왜냐하면 입력 motion 시퀀스는 moonwalk이기 때문.
- 즉 동작을 텍스트로 바꿀 수 없음.

'Text-to-video' 카테고리의 다른 글

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors (0)	2026.01.12
TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models (0)	2026.01.10
ModelScope Text-to-Video Technical Report (0)	2025.12.17
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators (0)	2025.11.26

현재글ControlVideo: Training-free Controllable Text-to-Video Generation

JM's Research

Today :
Yesterday :

티스토리툴바