Machine learning

Talks.

TikTok’s recommendation system (2025). Presented to distributed systems study group. (slides)
- Summarising these papers:
  - Monolith: Real Time Recommendation System With Collisionless Embedding Table (2022)
  - Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations (2020)
  - Deep Neural Networks for YouTube Recommendations (2016)

Papers.

Techniques
- Attention. Attention is All You Need
- Residual connections. Deep Residual Learning for Image Recognition
- LoRA’s. LoRA: Low Rank Adaptation of LLM’s
- DeMo: Decoupled Momentum Optimization
- Effective Long-Context Scaling of Foundation Models
LLM’s
- Generative Pre-Trained Transformers
- GPT3
Recsys
RNN’s and test-time training.
Video
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Compression.
- Language modelling is compression
Theory.
- The Platonic Representation Hypothesis
- Harnessing the Universal Geometry of Embeddings

Distillations.

AI is statistics (science) applied to big data (engineering).

Foundations: scientific method
Prediction = compression (hutter).
ML = (x,y) -> optimizer -> f(x,P)
Embeddings: word2vec for anything
Attention: sequence compression O(N^2), probabilistic weight sharing.
Test-time-training: sequence compression O(N)
Feature engineering
- YouTube recsys.
- OpenAI tiktoken: n-grams and BPE.
- Rank factorization, LoRA’s.

Notes.

My story of the field.

I’ve been interested in ML since high school, ever since DeepDream came out. But I chose to go into crypto so I could travel the world. Now I’m back into ML.

Modern AI has existed since 2010, when we combined big data (ImageNet) with big compute (GPU’s). There has been a very steady linear progress in the capabilities of AI since 2010:

Google, web crawl dataset, eigenvectors (2000)
…
GPU parallelism + large datasets (imagenet)
RNN’s, CNN’s
batchnorm / dropout / simplifying CNN’s
relu/swiglu
deepdream
resnets, highway nets, information bottleneck thesis
GAN’s
Adam
transformers
scaling (chinchilla) / gpt2 / commoncrawl
bitter lesson (2019)
gpt3.5/RLHF
diffusion models (sd)
quantization
LoRA’s
recsys
TikTok Monolith - online learning.
cut-cross entropy / logit materialisation
P2P training: Nous DiSTrO
inference-time compute / reasoning models / GRPO / DeepSeek / o1
AI game engines (gamengen)
video models - Sora, Veo
realtime multimodal AI - text, image, voice
test-time training
- 1min video coherency

I believe that the progress will continue, linearly. The major things are:

Energy (power grids).
- Add more compute, get more intelligence.
Statistics.
- At its core, the ChatGPT unlock was about four things: attention, scaling compute, good dataset, and RLHF.
- Core unlocks like attention and TTT.
Software.
- Cut-Cross Entropy is one example.
- Quantization is another.
Hardware.
- GPU’s, tensor cores, TPU’s.
- Optimizing for hardware layout.
Data.
- TikTok gets this, online learning makes system better, thus more usage, more training data.
Product.
- This is probably the most counterintuitive one here. But hear me out.
- Deepseek is interesting because the reward signal comes from an external tool - python evaluating math equations.
- OpenAI is the best-in-class consumer product, and their next iteration as of March 2025 is buildng tooling integrations.
- Tooling is the cheapest way to more signal and thus more data.

Questions.

When will we get full AI generated TV shows like Seinfeld?
- If AI is prediction is compression, will video codecs be replaced by AI embeddings / TV tokens? e.g. ts_zip for TV
When will intelligence become like Docker containers?
- Base image (alpine) for English language, base image for reasoning (logic), and then other layers for domain-specific knowledge (Matrix-style hot patched)?

Interesting ideas.

pentesting the law with LLM’s and RL
natural things as ML systems
scifi
- Humans are a biological bootloader for digital intelligence.
- jailbreaking the simulation.
media
- Media is programming. Genres, themes, motifs, plots, character descriptions, arcs, recurrent bits, one-off features - these are all as much primitives as HTML, React views, react-query, useState, useEffect, CSS modules, API routes are. Atomization.
- The internet made the cost of distributing content marginal. Now due to AI, the cost of producing content falls to zero. Attention is still scarce. Taste is still scarce.
intelligence
- how much intelligence do we need to do <X>? we can estimate how much wood we need to get light for 1hr. but we don’t even have a unit for intelligence (tokens?). interesting post on this.
“neural <X>”
- discrete neural networks that emulate a digital circuit (see: Jane St problem), interacting with continuous neural networks (transformers). What could you build here?
- neural BitTorrent: RL to train agents that maximize utility (download speed, swarm health)
- neural Bitcoin: learned hash functions (embeddings) instead of sha256, learned difficulty approximation instead of moving average.
- neural DHT’s: use embeddings instead of cryptographic hash functions, nodes store content related to topics (ie. embedding clusters) rather than uniformly distributed.

On Sama

I like this position - https://ia.samaltman.com/