Text-to-video

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Jongmin Lim 2025. 11. 26. 19:14

Abstract

본 논문에서는 stable diffusion과 과 같은 text-to-image 방법론을 이용하여 새로운 task인 zero-shot text-to-video generation을 제안
text-to-image 방법론의 핵심 수정은
- generated frames의 global scene과 background time consistent를 유지한다
- 그리고 frame-level self-attention을 제안하는데, 이 방법론은 첫번째 프레임에 대한 각 프레임의 cross-frame attention을 구하여 foreground object의 context, appearence를 보존한다

1. Introduction

textual prompts로부터 video를 생성하는 zero-shot, training-free의 특성을 가진 방법론을 제안
접근방법의 핵심은 pre-trained text-to-image model (e.g., Stable Diffusion)을 temporally consisten generation하게 만드는 것이다.
이를 위해 lightweight 수정을 한다
- 첫번째로, 생성된 프레임의 latent code를 동작 정보로 풍부하게 하여 global scene과 backgrount time의 일관성을 유지
- 그리고, 첫번째 프레임과 각 프레임의 cross-frame attention을 사용하여 전체 시퀀스 전반에 걸쳐 foreground object의 context, appearancem idenetity를 보존

3. Method

[참고] Diffusion 기반 메소드의 발전 방향

1. Diffusion Model (DDPM: Denoising Diffusion Probabilistic Model)

개념
- 가우시안 노이즈를 점진적으로 추가하는 forward process와, 이를 반대로 제거하면서 데이터를 생성하는 reverse process를 학습하는 모델.
특징
- Markov chain 기반 (수백~수천 step)
- 각 step에서 노이즈를 조금씩 제거
- 확률적(stochastic) 샘플링
- 매우 안정적이지만 느리다 (sampling 단계 많음)
- 원래 diffusion 모델은 이미지 latent 사용 X → 픽셀 공간에서 직접 diffusion

2. DDIM (Denoising Diffusion Implicit Models)

개념
- DDPM은 reverse step이 확률적이라 sampling이 느림.
  → DDIM은 그 구조를 deterministic ODE 형태로 바꿔서 빠르게 sampling 가능하게 함.
- DDPM의 샘플링 속도를 크게 올리는 방식
- 주의할 점은
  Forward는 DDPM과 동일하게 noise를 점진적으로 넣고,
- Reverse에서 DDIM은 한 번에 여러 step의 noise 제거를 계산할 수 있는 것임*

3. Stable Diffusion = Latent Diffusion Model (LDM)

diffusion을 latent space에서 수행
- 이미지(픽셀 공간) 대신 VAE encoder가 만든 latent 공간에서 noise 제거
  - → 512×512 이미지도 빠르게 학습 가능
  - → 계산량 획기적으로 감소
Text-conditioning (CLIP text encoder)
- cross-attention으로 text embedding을 U-Net에 주입 → T2I 가능
DDIM 기반 빠른 sampling 지원
- Stable Diffusion에서 사용하는 sampler 중 DDIM이 있음
- (Euler, Heun, DPM++ 등 여러 sampler가 있지만 DDIM도 기본 제공)

Stable Diffusion은 “모델” 인 반면 DDPM/DDIM은 “sampling 방식 / diffusion 구조”.

3.1. Stable Diffusion

SD는 autoencoder $\mathcal{D} (\xi(\cdot) )$의 latent sepace에서 동작
- $\xi$는 인코더 $\mathcal{D}$는 디코더
input image $\text{Im}$의 latent tensor $x_0 \in \mathbb{R}^{{h} \times w \times c}$

Diffusion forward process는 $x_0$에 반복적으로 가우시안 노이즈를 추가한다

SD는 그때 backward process를 학습한다

Noise Prediction Loss 함수는 아래와 같다
- 모든 노이즈 step $t$에서 같은 U-Net 을 공유한다.
- Stable diffusion은 textual prompt를 추가할 수 있다

3.2. Zero-Shot Text-to-Video Problem Formulation

수식적으로,
- text description : $\tau$
- Function $\mathcal{F}$
- output video frame $\mathcal{V} \in \mathbb{R}^{m \times H \times W \times 3 }$
$\mathcal{F}$를 결정하기 위해 Video dataset을 기반으로 어떠한 훈련이나 fine-tuning이 있으면 안된다.

3.3. Method

Zero-shot Text-to-video 생성의 가장 Naiive 접근법은 standard Gaussian distribution으로부터 latent code를 독립적으로 sample하는 것
- $x_T^1,...,x_T^m \backsim \mathcal{N}(0,I)$
그러나 이것은 $\tau$에 의해 image를 프레임별로 완전히 랜덤하게 생성하기 때문에 object appearance나 motion coherence의 일관성이 없다
이 문제를 다루기 위해
- $x_T^1,...,x_T^m$ 사이에 motion dynamic을 도입하여 global scene을 유지하고
- cross-frame attention mechanism으로 foreground object의 appearance와 identity를 보존
수식의 단순함을 위해
- $x_T^{1:m}: [x_T^1,...,x_T^m]$

3.3.1 Motion Dynamics in Latent Codes

latent 공간에서 직접 motion을 주입하여, 각 프레임 latent들이 서로 물리적으로 연결된 움직임을 갖도록 만드는 방법
이를 위해, 단순히 프레임마다 독립적인 noise를 샘플링하지 않고,
1. 첫 latent만 랜덤 생성
2. 조금 denoise
3. 이동 방향 설정
4. 이동량 점진적 증가
5. latent warping으로 motion 생성
6. 다시 noise 추가해 diffusion 모델의 시간축과 맞추기
이 과정을 통해 일관된 카메라 이동·장면 이동을 가진 latent 시퀀스를 구성

3.3.2 Reprogramming Cross-Frame Attention

기존 self-attention
- Original Stable Diffusion UNet architecture $\epsilon^t(x_t,\tau)$에서 각 self attention layer는 feature map $x\in\mathbb{R}^{h \times w \times c}$를 추하여 선형적으로 query, key value feature $Q, K, V \in \mathbb{R}^{h \times w \times c}$로 projection한다.

제안하는 Cross-Frame Attention
- 본 논문의 각 attention layer는 $m$개의 inputs $x^{1:m}: [x^1,...,x^m]$을 받는다
- 그러므로 queries, keys, values는 $Q^{1:m}, K^{1:m}, V^{1:m} \in \mathbb{R}^{h \times w \times c}$로 projection된다

3.4. Conditional and Specialized Text-to-Video

3.5. Video Instruct-Pix2Pix

Instruct-Pix2Pix를 기반으로 한다
Instruct-Pix2Pix의 self-attention layer를 본 논문의 Cross-attention Frame으로 바꾼다.

'Text-to-video' 카테고리의 다른 글

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors (0)	2026.01.12
TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models (0)	2026.01.10
ModelScope Text-to-Video Technical Report (0)	2025.12.17
ControlVideo: Training-free Controllable Text-to-Video Generation (0)	2025.12.02

현재글Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

JM's Research

Today :
Yesterday :

티스토리툴바