‹ Notes

Notes on Test Time Training for Video

Apr 23, 2025

Background reading: Understanding RNN’s, Test time training, video diffusion transformers (DiT’s)

Links: website, code

Paper: One-Minute Video Generation with Test-Time Training

Overview.

This was entirely generated by their approach. 1min, reasonably photorealistic video, major coherence.

Model.

One-Minute Video Generation with Test-Time Training

  • base: CogVideo-X 5B diffusion transformer (DiT) which generates 3s videos
  • modifications:
  • dataset:
    • 7 hours of Tom and Jerry clips
    • hand-curated textual descriptions of scenes, with LLM to assist in generation
    • data engineering - slicing videos
  • training:
    • 50 hours on 256 H100s

Data pipeline.

  • scenes and segments
    • videos contain scenes
    • scenes contain one to many segments
    • 3-second segment as the atomic unit of text-to-video pairing
  • video split into 3 second segments
  • scenes describe in text:
    • Each 3-second segment is described by a paragraph of 3-5 sentences, containing details such as background colors and camera movements. Groups of one or more paragraphs are strictly enforced as belonging to certain scenes with the keywords <scene start> and <scene end>

Notes.

  • fine-tune entire pretrained model on Tom and Jerry segments.
  • over the next four stages, we fine-tune on videos of 9, 18, 30, and eventually 63 seconds
    • only fine-tune the TTT layers, gates, and self-attention layers, using a lower learning rate during these four stages