Notes on Test Time Training for Video

Background reading: Understanding RNN’s, Test time training, video diffusion transformers (DiT’s)

Links: website, code

Paper: One-Minute Video Generation with Test-Time Training

Overview.

This was entirely generated by their approach. 1min, reasonably photorealistic video, major coherence.

base: CogVideo-X 5B diffusion transformer (DiT) which generates 3s videos
modifications:
- add TTT layers
- fine-tune
- extend context from 3s to 1min using long-context scaling
dataset:
- 7 hours of Tom and Jerry clips
- hand-curated textual descriptions of scenes, with LLM to assist in generation
- data engineering - slicing videos
training:
- 50 hours on 256 H100s

scenes and segments
- videos contain scenes
- scenes contain one to many segments
- 3-second segment as the atomic unit of text-to-video pairing
video split into 3 second segments
scenes describe in text:
- Each 3-second segment is described by a paragraph of 3-5 sentences, containing details such as background colors and camera movements. Groups of one or more paragraphs are strictly enforced as belonging to certain scenes with the keywords <scene start> and <scene end>

fine-tune entire pretrained model on Tom and Jerry segments.
over the next four stages, we fine-tune on videos of 9, 18, 30, and eventually 63 seconds
- only fine-tune the TTT layers, gates, and self-attention layers, using a lower learning rate during these four stages