Notes on Test Time Training for Video
Apr 23, 2025
Background reading: Understanding RNN’s, Test time training, video diffusion transformers (DiT’s)
Paper: One-Minute Video Generation with Test-Time Training
Overview.
This was entirely generated by their approach. 1min, reasonably photorealistic video, major coherence.
Model.
One-Minute Video Generation with Test-Time Training
- base: CogVideo-X 5B diffusion transformer (DiT) which generates 3s videos
- modifications:
- add TTT layers
- fine-tune
- extend context from 3s to 1min using long-context scaling
- dataset:
- 7 hours of Tom and Jerry clips
- hand-curated textual descriptions of scenes, with LLM to assist in generation
- data engineering - slicing videos
- training:
- 50 hours on 256 H100s
Data pipeline.
- scenes and segments
- videos contain scenes
- scenes contain one to many segments
- 3-second segment as the atomic unit of text-to-video pairing
- video split into 3 second segments
- scenes describe in text:
- Each 3-second segment is described by a paragraph of 3-5 sentences, containing details such as background colors and camera movements. Groups of one or more paragraphs are strictly enforced as belonging to certain scenes with the keywords <scene start> and <scene end>
Notes.
- fine-tune entire pretrained model on Tom and Jerry segments.
- over the next four stages, we fine-tune on videos of 9, 18, 30, and eventually 63 seconds
- only fine-tune the TTT layers, gates, and self-attention layers, using a lower learning rate during these four stages