Machine learning
In 2025, I’m going deep into ML and want a job. There is a large cache of notes here.
Sub-pages.
- Heuristics
- Natural ML
- Notes on RNN’s
- Notes on Test Time Training
- Notes on Test Time Training in Video
- Notes on the move to foundation models
Talks.
- TikTok’s recommendation system (2025). Presented to distributed systems study group. (slides)
- Summarising these papers:
- Monolith: Real Time Recommendation System With Collisionless Embedding Table (2022)
- Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations (2020)
- Deep Neural Networks for YouTube Recommendations (2016)
- Summarising these papers:
Papers.
- Techniques
- LLM’s
- Recsys
- RNN’s and test-time training.
- Video
- Compression.
- Language modelling is compression
Distillations.
AI is statistics (science) applied to big data (engineering).
- Foundations: scientific method
- Prediction = compression (hutter).
- ML = (x,y) -> optimizer -> f(x,P)
- Embeddings: word2vec for anything
- Attention: sequence compression O(N^2), probabilistic weight sharing.
- Test-time-training: sequence compression O(N)
- Feature engineering
- YouTube recsys.
- OpenAI tiktoken: n-grams and BPE.
- Rank factorization, LoRA’s.
My story of the field.
I’ve been interested in ML since high school, ever since DeepDream came out. But I chose to go into crypto so I could travel the world. Now I’m back into ML.
Modern AI has existed since 2010, when we combined big data (ImageNet) with big compute (GPU’s). There has been a very steady linear progress in the capabilities of AI since 2010:
- Google, web crawl dataset, eigenvectors (2000)
- …
- GPU parallelism + large datasets (imagenet)
- RNN’s, CNN’s
- batchnorm / dropout / simplifying CNN’s
- relu/swiglu
- deepdream
- resnets, highway nets, information bottleneck thesis
- GAN’s
- Adam
- transformers
- scaling (chinchilla) / gpt2 / commoncrawl
- bitter lesson (2019)
- gpt3.5/RLHF
- diffusion models (sd)
- quantization
- LoRA’s
- recsys
- TikTok Monolith - online learning.
- cut-cross entropy / logit materialisation
- P2P training: Nous DiSTrO
- inference-time compute / reasoning models / GRPO / DeepSeek / o1
- AI game engines (gamengen)
- video models - Sora, Veo
- realtime multimodal AI - text, image, voice
- test-time training
- 1min video coherency
I believe that the progress will continue, linearly. The major things are:
- Energy (power grids).
- Add more compute, get more intelligence.
- Statistics.
- At its core, the ChatGPT unlock was about four things: attention, scaling compute, good dataset, and RLHF.
- Core unlocks like attention and TTT.
- Software.
- Cut-Cross Entropy is one example.
- Quantization is another.
- Hardware.
- GPU’s, tensor cores, TPU’s.
- Optimizing for hardware layout.
- Data.
- TikTok gets this, online learning makes system better, thus more usage, more training data.
- Product.
- This is probably the most counterintuitive one here. But hear me out.
- Deepseek is interesting because the reward signal comes from an external tool -
python
evaluating math equations. - OpenAI is the best-in-class consumer product, and their next iteration as of March 2025 is buildng tooling integrations.
- Tooling is the cheapest way to more signal and thus more data.
Questions.
- When will we get full AI generated TV shows like Seinfeld?
- If AI is prediction is compression, will video codecs be replaced by AI embeddings / TV tokens? e.g. ts_zip for TV
- When will intelligence become like Docker containers?
- Base image (alpine) for English language, base image for reasoning (logic), and then other layers for domain-specific knowledge (Matrix-style hot patched)?
Interesting ideas.
- pentesting the law with LLM’s and RL
- natural things as ML systems
- scifi
- Humans are a biological bootloader for digital intelligence.
- jailbreaking the simulation.
- media
- Media is programming. Genres, themes, motifs, plots, character descriptions, arcs, recurrent bits, one-off features - these are all as much primitives as HTML, React views, react-query, useState, useEffect, CSS modules, API routes are. Atomization.
- The internet made the cost of distributing content marginal. Now due to AI, the cost of producing content falls to zero. Attention is still scarce. Taste is still scarce.
- intelligence
- how much intelligence do we need to do <X>? we can estimate how much wood we need to get light for 1hr. but we don’t even have a unit for intelligence (tokens?). interesting post on this.
- “neural <X>”
- discrete neural networks that emulate a digital circuit (see: Jane St problem), interacting with continuous neural networks (transformers). What could you build here?
- neural BitTorrent: RL to train agents that maximize utility (download speed, swarm health)
- neural Bitcoin: learned hash functions (embeddings) instead of sha256, learned difficulty approximation instead of moving average.
- neural DHT’s: use embeddings instead of cryptographic hash functions, nodes store content related to topics (ie. embedding clusters) rather than uniformly distributed.
On Sama
I like this position - https://ia.samaltman.com/