ML notes (uncategorised)
Mutual information.
https://sumanthrh.com/post/notes-on-generalization/
- the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables
- MI can be defined by the classic KL divergence
- Minimum program to produce the data
- Theoretical “best model” for prediction.
Theory of learning.
- Slow-fast weight programmer (90’s) - https://old.reddit.com/r/MachineLearning/comments/megi8a/d_jürgen_schmidhubers_work_on_fast_weights_from/
- Attention as dynamic weight lookup
- Information highway / bottleneck
- An Observation on Generalization
- Supervised vs unsupervised learning
- Learning is prediction is compression
- Hutter
An observation on generalization.
https://sumanthrh.com/post/notes-on-generalization/
https://sumanthrh.com/post/notes-on-generalization/
Basic architectural concepts of LLM’s.
https://huggingface.co/blog/andmholm/what-is-a-transformer
-
Overview
- LLM’s learn continuous representations of the distribution. They are not discrete.
-
Raw input data.
- Text.
-
Tokens.
- Smaller units of text (words, subwords, characters).
- Tokenisation:
- Method: Byte Pair Encoding (BPE), SentencePiece
- Process:
- Input text is segmented into subword units (tokens).
- Vocabulary typically around 50,000–100,000 tokens.
- Special tokens:
[CLS]
,[SEP]
,[EOS]
,[PAD]
,<|endoftext|>
, etc.
-
Embeddings.
- Learned vector representations for each token (size: typically 768–12,288 dimensions).
- Input: token.
- Output: vector embedding.
-
Vocabulary.
- i, w for i in vocabulary
- vocabulary = [aa, ab, …, zz]
-
Context window/length.
- This is the input and output shape of the network.
-
Positional encoding.
- Encodes the index of a token in a sequence.
- Uses sine/cosine.
- Input: vector embedding.
- Output: vector embedding.
-
Transformer Blocks.
- Input:
- Output:
-
Feed Forward Network (FFN).
- Non-linear transformation.
- Concept:
- activation(Wx+B)
- Weights and biases.
- Activation function - introduces nonlinearity by allowing bends in the line projections.
- Input: vectors
- Output: vectors
-
Layer Norm.
- Normalisation.
- The layer normalization module uses gamma (γ) to scale, then beta (β) to shift the mean and variance of the features
# calc mean & variance (along dm) mean = x.mean(dim=-1, keepdim=True) var = x.var(dim=-1, unbiased=True, keepdim=True) # normalize, scale & shift | shape: out - (batch_size, seq_len, dm) norm = (x - mean) / torch.sqrt(var + self.eps) out = norm * self.gamma + self.beta
-
Residual Connections.
- See: Resnets.
- Shortcuts for gradients between sublayers, preventing information from being lost during back-propagation
-
Dropout.
- Prevents overfitting to dataset.
- 10% of connections are 0’d.
-
Logits.
- Type: real numbers.
-
Softmax.
- Takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers
- Function:
- f(z_i) = exp(z_i) / sum(exp(z_i) forall z)
- Input: real numbers
- Output: probs
-
Logprobs.
- Type: logarithms of probabilities.
- Why logprobs instead of probs:
-
Rationale:
-
Loss is defined as the distance between predicted and true distribution.
-
Reducing the distance = more accurate predictions.
-
Our distribution is a sequence of tokens, defined as the context length.
-
A language model predicts the next token conditioned on previous tokens. The full expression for such a prediction:
-
Probability recap:
- For independent events A and B, P(A + B) = P(A) x P(B)
- Since all probabilities must sum to 1.
- P(heads + heads) = 1/2 * 1/2
- For mutually exclusive events A and B, P(A + B) = P(A) + P(B)
- Since all probability must sum to 1.
- P(rain or sun but not everything else) = P(rain) + P(sun)
- Our tokens are conditioned on previous tokens. They are not mutually exclusive. The events are dependent - token A is preceded by token B.
- P(”cat”) = P(c | “”) * P(a | “c”) * P(t | “ca”)
- Probability of “cat” is given by (probability of “c”) times (prob of “a” given c) times (prob t given ca).
- For independent events A and B, P(A + B) = P(A) x P(B)
-
The numerical size of a conditional probability scales inversely with the number of possible events in the causal chain:
- P(tn) = P(t0) * P(t1 | t0) * P(t2 | t1,t0) * … * P(tn | t0,t1,tn-1)
- Multiplying lots of really small numbers = incredibly small numbers.
- 0.1 * 0.1 = 0.01
- The number of possible events increases every time we multiply.
- How many events can we represent with uint16? $2^{16}=65,536$, normalised to [0,1) is a finite step size / granularity of $\frac{1}{65636}=0.000001$.
-
Multiplying unit numbers causes numeric instability with finite precision, however we can do addition on logprobs and avoid that.
- Log of products becomes a sum:
- Probs: P(tn) = P(t0) * P(t1 | t0) * P(t2 | t1,t0) * … * P(tn | t0,t1,tn-1)
- Logprobs: logP(tn) = logP(t0) + logP(t1∣t0) + logP(t2∣t0,t1) + ⋯+ logP(tn∣t0,…,tn−1)
- Log of products becomes a sum:
-
Coming back to our NN - in order to train a neural network, we define the objective function, which states the distance between the predicted distribution and the true distribution. This is referred to as the loss.
- $L = distance(pred, true)$
-
This is our model of the predicted distribution:
- Note that in order to train efficiently without exploding numeric precision, we use logprobs.
- $logP(w1,w2,…,wn)=logP(w1)+logP(w2∣w1)+logP(w3∣w1,w2)+⋯+logP(wn∣w1,…,wn−1)$
- We can refactor this with the sum symbol:
- $\sum_t {logP(w_t∣w_{<t}) }$
-
This is the true distribution:
- The output of our model (naively, without logprob) is a vector of probabilities.
- The true distribution is also a vector of probabilities.
- The true distribution is:
- Probability 1 for the correct token
- Probability 0 for all others
- All probabilities sum to 1, but there’s only one value set to 1. So this is called one-hot encoding.
-
How do we define the distance between the true and predicted?
-
We use this tool called KL divergence, which is a distance metric defined for probability distributions.
This looks crazy but it’s really simple.
$P : truth$
$\hat{P} : predicted$
The KL divergence specifies the distance between the true distribution and the predicted distribution.
The expression be expanded as:
-
This expression matches the definitions of cross-entropy H(x,y) and entropy H(x), so we rewrite it as:
$KL(P || \hat{P})=H(P)-H(P,\hat{P})$
KL distance = Entropy(truth) - CrossEntropy(truth, pred)
- Note that the true distribution does not change. The dataset used in training does not change. The entropy of this dataset does not change.
During training, we are attempting to minimise the distance between the distributions. A perfect model would have a distance of 0.
Note that right now, one of these terms is constant.
KL distance = Entropy(truth) - CrossEntropy(truth, pred) KL distance = SOME_CONSTANT - CrossEntropy(truth, pred)
We don’t actually care about the constant. Whether the distance is SOME_CONSTANT-x or SOME_CONSTANT doesn’t matter. We are just minimising x.
So we can remove the SOME_CONSTANT, which is in fact the entropy(truth). Leaving us with this definition of the objective function:
$H(P,\hat{P})=-\sum_x{P(x)log\hat{P}(x)}$
- There is one more simplification to make here. In language modelling tasks, we are using a one-hot distribution for the truth distribution. ****Since the term P(x) is only 1 for 1 value in this sum, it cancels out all the other values. And you can simplify the sum. This is a worked explanation:

-
-
The final definition of the loss function, for stating the distance between the true distribution and the predicted distribution, simplified with respect to our domain where the true distribution is one-hot encoded, is this:
$L=H(P,\hat{P})=-log\hat{P}(x)$
- For a single token prediction, the loss is:
$L_t=−log{P}(w_t∣w_{<t})$
-
For a sequence of tokens prediction, the loss is:
-
Each event is the prediction of a token given a previous token.
-
So we expand our loss function to the following definition:
- $L=-\sum_t{logP(w_t∣w_{<t})}$
-
-
Our goal is to minimise the loss.
-
How do we optimize?
-
Purpose:
- Reduce loss where logprob of token given previous token is high.
- NN learns to predict the next token by:
-
Training:
- training samples: [[x0, x1], [x1, x2]]
- text is x
- split it token-by-token into x0, x1, x2
- each training sample is [x0, x1] - predict next token
- forward pass: [x0 → F(x0) → x’1]
- given x0 predict x1
- x1 is our ground truth, x’1 is our prediction
- objective/loss:
- objective is to model predicting the next token
- loss measures distance from truth as a sum of all
import numpy as np vocab = ["the", "cat", "sat"] # Ground truth token is 'cat' (index 1) true_index = 1 # Model output (probabilities from softmax) # output distribution of cumlogprobs for all tokens in vocabulary # P('the'), P('cat'), P('sat') pred_probs = np.array([0.7, 0.1, 0.2]) # Probability of correct token. x1 = pred_probs[true_index] # 0.1 # Take the log prob # We do this because logprobs are # Cross-entropy loss for a single token loss = -np.log(pred_probs[true_index]) print(loss) # ≈ 0.357
- training samples: [[x0, x1], [x1, x2]]
-
-
-
-
Our optimizer for the loss function is what minimises loss.
- We use stochastic gradient descent.
- SGD differentiates the loss function, which involves differentiating the entire neural net, in order to compute gradients.
- We then adjust the parameters in the direction against the gradient.
- Simple example:
- Imagine a simple model
- f(x) = 2x + 5
- Imagine a simple model with parameters.
- f(x) = Wx + B
- This is a FFN.
- Compute the function with respect to an input.
- x=1, W=2, B=5
- $f(x)=Wx+b=2x+5$
- $f(x)=2(1)+5=7$
- Differentiate to get the partial derivative (gradient) with respect to each parameter [W, b]:
- x=1, W=W, B=5
- Recall rules:
- Derivative of a constant is 1
- Derivative of f(x) for W, all others constant (x,b):
- $\frac{\partial{f}}{\partial{W}}=Wx+b=\frac{\partial{W}}{\partial{W}}x+b=1x+0=x$
- Derivative of f(x) for b, all others constant (x,W):
- $\frac{\partial{f}}{\partial{b}}=Wx+b=0+b=1$
- W is constant, x is constant. It’s like Wx is like 3*5 = 15. Derivative of constant is 0.
- b is variable. Derivative of a variable with respect to itself is 1.
- $\frac{\partial{f}}{\partial{b}}=Wx+b=0+b=1$
- Update the parameters in the direction of minimising the function (loss).
- By evaluating each partial derivative for x, we get our gradients.
- Updating a parameter means:
- $\theta_{new}=\theta_{old}-\eta \cdot \frac{\partial{f}}{\partial{\theta}}$
- theta$\theta =[W,b]$ $\theta$$\eta$ is our learning rate, which describes how quickly we adjust parameters. It is usually set very low, like $\eta=0.001$. And the last term is the gradient - the change in theta with respect to the change in f.
- From the old value of the parameter, we subtract a small portion of the parameter’s gradient, in order to direct it towards 0. This is because we are minimising the function. In the NN, this is minimising loss → minimising distance between the pred and the true distribution → increasing model accuracy.
- We have computed the partial derivative (f’(x)) with respect to each parameter (W, b), and then will evaluate it with the training sample (x) to get a constant expression - the gradient.
- Partial derivatives:
- $\frac{\partial{f}}{\partial{W}}=x$
- $\frac{\partial{f}}{\partial{b}}=1$
- $W_{new}=W_{old} - \eta \cdot grad_W$
- $grad_w=\frac{\partial{f}}{\partial{W}}=x=1$
- $W_{new}=2 - 0.001 \cdot 1=1.999$
- $b_{new}=b_{old} - \eta \cdot grad_b$
- $grad_b=\frac{\partial{f}}{\partial{b}}=1$
- $b_{new}=5 - 0.001 \cdot 1=4.999$
- And voila.
- Our model is $f(x)=Wx+b$
- Our params were $W=2$, $b=5$
- We updated each parameter, by computing the gradient of the function with respect to that parameter, and then subtracting this from the current parameter factored by a small learning rate.
- Our params are now $W=1.999$, $b=4.999$
- The output of the function is now $f(x)=1.999x+4.999$, which is closer to 0.
- In a real neural network, this function represents our loss, which represents the distance between the true and predicted distributions.
- By minimising the loss through optimization (gradient descent), we are increasing the accuracy of the network.
- Imagine a simple model
- In a real neural network, we use stochastic gradient descent (SGD) - which only computes the loss for a fixed size batch of samples ([x,y]).
- Gradient descent:
- $\theta := \theta - \eta\frac{\partial{f}}{\partial{\theta}}$
- Stochastic gradient descent:
- $\theta := \theta - \frac{\eta}{n} \sum_{i=1}^n \frac{\partial{f}}{\partial{\theta_i}}$
- Gradient descent:
-
Linear layer.
- Purpose. Maps vectors into logprobs for each token in vocabulary.
- Input: vector.
- Output: logprobs.
-
Sampling method.
-
Purpose: the LLM learns the underlying probability distribution of its data, and when we are predicting the next token, we are sampling from this distribution. Training involves adjusting model parameters to minimize the difference between the model’s predicted distribution and the true distribution observed in the training data.
-
Methods
- Nucleus sampling (top-p).
- sort(logprobs, ‘desc’)
- select top P tokens
- randomly samples one:
- math.random() → [0,1)
- all probs laid out on linear number line. choose first where p < r
print((lambda t, p, r: next(x for i, x in enumerate(t) if r < sum(p[:i+1])))(['hello','world','goodbye'], [0.6,0.3,0.1], random.random()))
- Nucleus sampling (top-p).
-
Temperature.
-
-
Output.
-
Outputs text from a sample.
-
Input: int tokenID
-
Output: string token
-
Function: Lookup in vocabulary
464: "hello" 3290: "world" 220: "ing" 198: "\n"
-
Transformer blocks / attention.
https://ai.stackexchange.com/questions/40179/how-does-the-decoder-only-transformer-architecture-work
Attention.
- we predict the next token
- each token Xi is projected onto 3 different spaces : Q, K, V
- Q is the query, K is the keys, V are the weights
- each token (Xi) queries each other token (X) by doing the dot product of scores = QK
- these raw scores are normalized with softmax, turning them into a probability distribution
- the attention scores are then used to index into the values V, effectively retrieving relevant information dynamically
- Q,K,V are learned
- neural net learns the best way to arrange Q,K,V to minimise loss
- QKV are all sized according to the context length
- We also normalize the QK by the dimension of the inputs.
$$ \operatorname{Attention}(Q,K,V)=\operatorname{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$
One set of (WQ,WK,WV) matrices is called an attention head, and each layer in a transformer model has multiple attention heads
Outputs for the attention layer are concatenated to pass into the feed-forward neural network layers.
Types of attention.
Attention can be used in different ways. To be honest, it’s just dynamic weight lookup. The query can be of a different type to the key/value. This is what’s known as cross-attention. Or it can be the same - self-attention.
- Self-attention.
- QKV - all text token embeddings
- Cross-attention in SDXL.
- Q - U-Net embeddings
- KV - text (FLAN-T5, CLIP) or image embeddings
diffusion models are in principle capable of modeling conditional distributions of the form
Transformers - GPT’s.
Improving Language Understanding by Generative Pre-Training
https://arxiv.org/abs/1706.03762
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
https://github.com/openai/gpt-2/blob/master/src/model.py
Original Transformer.
https://arxiv.org/abs/1706.03762
Architecture:
- Encoder-decoder. 6 layers of these.
- Each has 2 sublayers:
- multi-head self-attention mechanism
- position-wise fully connected feed-forward network
- residual connection [ 11 ] around each of the two sub-layers, followed by layer normalization
- Each has 2 sublayers:
- Decoder
- In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack
- We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
GPT-1
Our model largely follows the original transformer work [ 62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule. We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm [2] is used extensively throughout the model, a simple weight initialization of N (0, 0.02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53] and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in [ 37 ], with w = 0.01 on all non bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We used learned position embeddings instead of the sinusoidal version proposed in the original work
Attention Approximates Sparse Distributed Memory
https://www.trentonbricken.com/Attention-Approximates-Sparse-Distributed-Memory/
Scaling laws.
https://en.wikipedia.org/wiki/Neural_scaling_law
Positional encoding.
ROPE.
https://x.com/andimarafioti/status/1909979430786588687
https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding
Intepretability and circuits
- INTERPRETABILITY IN THE WILD: A CIRCUIT FOR INDIRECT OBJECT IDENTIFICATION IN GPT-2 SMALL
- A Mathematical Framework for Transformer Circuits
- Residual streams
- Thinking like Transformers
Long context lengths / Scalable softmax.
https://arxiv.org/abs/2501.19399
Training and backprop.
Machine learning is based around optimizing an objective function which states the distance between two distributions - the truth and the predicted distribution. The truth distribution is made up of our dataset, split into training and validation. The training data is fed into the model and used to fit it, the validation dataset is used to measure its performance objectively.
The optimization seeks to minimise the objective function’s output, termed the loss.
Optimization follows this process:
- Forward pass - x → F(x) → y, where y is the output distribution over the vocab.
- Input: x, the training data.
- Compute the loss.
- L = -log(P(x))
- Input: x, the validation data.
- Output: L, the loss.
- Backwards pass.
- Differentiate the loss function and compute gradients for all parameters.
- Using backpropagation, this differentiates every constituent function in P(x), which goes backwards through all of the softmax, linear, dropout, attention layers and differentiates their functions and gets their gradients and partial derivatives.
- Adjust the weights by a certain factor, according to a learning algorithm like Adam.
-
For every function in the NN:
You change weights to reduce the loss:
θ = θ - η · grads
Where:
θ
= parameters (weights)η
= learning rate (small number like 0.001)grads
= gradients from step 3
-
- Differentiate the loss function and compute gradients for all parameters.
- Repeat, iterating until the loss is low.
- The loss reduction can be predicted using scaling laws.
This process is usually optimized in the following ways:
- Batch training.
- Instead of running over every training datum in training data, we randomly select batches of training data.
- Weight initialization.
- Weights are usually initialized in a certain way to optimize training.
- Cut-Cross Entropy
- Rather than materializing all logits to compute the loss, just materialise the correct logit.
GRPO and RL.
https://github.com/open-thought/tiny-grpo
https://arxiv.org/pdf/2501.12948
https://arxiv.org/abs/2402.03300
RL.
Reinforcement Learning (RL) is a framework for sequential decision-making. An agent interacts with an environment over timesteps:
At each timestep t:
- Agent observes state St
- Takes action At based on a policy Pi(a|s)
- Gets reward rt and transitions to next state s_t+1
Actor-critic model:
DeepSeek, o1.
Nano aha moment - https://x.com/a_kazemnejad/status/1907849729863471204
Basic algorithm:
- seq is a sequence of tokens
- we format the token sequence using XML tags:
for internal thinking for public answers
- we continue sampling tokens from the model while seq ≠ contain(</answer)
- this is called inference-time scaling
For each question:
-
Sample a group of outputs {o1,o2,o3} from the old policy pi_old
-
Optimize the policy model pi
-
Group advantage:
- $A_i=\frac{r_i-mean({r_1,r_2,…,r_g})}{std({r1,r2,…,r_g)}}$
- group of rewards $r$ is basically computed as:
- extract answer from XML
- reward = 1 for correct answers
- if it’s code, compile the code and if it runs, return 1
- if it’s math, truth is encoded in the dataset
How does the policy update work?
https://github.com/open-thought/tiny-grpo/blob/main/train.py#L152
The basic idea:
Generalist rewards.
Inference-Time Scaling for Generalist Reward Modeling
https://arxiv.org/pdf/2504.02495
- Problem: rewards are obtained from human domains. AI model should learn rewards.
- Reward modelling: Pointwise generative reward modelling (GRM).
- Learning method: Self-Principled Critique Tuning
- Rejective fine-tuning
- Rule-based online RL
- GRPO with rule-based outcome rewards. ie.
- principle
- pointwise scalar RM
- GRPO with rule-based outcome rewards. ie.
How it works:
- Set of principles (constitutional AI)
- Parallel sampling, rate the R’s for Q’s
- Extract scores
- Argmax
https://rentry.org/LocalModelsPapers
https://rentry.org/LocalModelsLinks
Quantization
- The numerical size of a probability in an LLM scales inversely with the vocabulary size.
- Assume predictions are uniformly distributed. Probability of any token is $1/N$
- We need an integer that can store at least $N$ values.
- Storing N values requires an integer of $log2(N)$ bits.
- Storing 1M values requires an integer of $log2(1M)=20 bits$
Quantization in models refers to quantizing the weights:
- fp8
- fp4
- fp2
This reduction in precision isn’t linearly correlated with reduction in accuracy.
Training methods.
- Generative Pre-Trained.
- Objective: Autoregressive language modeling (next-token prediction)
- Loss function: Cross-entropy
- Dataset: Massive unsupervised text (web pages, books, articles, Wikipedia, codebases)
Sampling methods - top-k, top-p, temperature.
- greedy decoding. select highest P token at each step.
- top-k.
- select from top k choices pseudorandomly.
- top-p, nucleus sampling.
- select from tokens who cumulatively meet p threshold
- the smallest possible set of tokens whose cumulative probabilities add up to a threshold p
- temperature sampling
- higher temperature = lower prob logits are scaled up, higher prob logits are scaled down
https://openai.com/index/measuring-goodharts-law/
Attention.
Understanding vector spaces and projection.
- we predict the next token
- each token X_i is projected onto 3 different spaces using learned matrices : Q, K, V
- Q is the query, K is the keys, V are the weights
- each token (Xi) queries each other token (X) by doing the dot product of scores = QK
- these raw scores are normalized with softmax, turning them into a probability distribution
- the attention scores are then used to weight the values V, effectively retrieving relevant information dynamically
- QKV are all sized according to the context length - 2048 tokens in GPT-3.
Understanding probabilistic KV lookup.
- https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms
- Understand a map:
- map: K → V
- Understand a probabilistic map:
- Random variable: A random variable X is a measurable function X:Ω→E,
- Sample space: set of all possible outcomes
- Event space: a set of events, where an event is a subset of outcomes in the sample space
- Probability function: assigns, to each event in the event space, a probability, which is a number between 0 and 1 (inclusive).
- Probability distribution is the mathematical function that gives the probabilities of occurrence of possible outcomes for an experiment
- Imagine a discrete random variable X. Set of outputs [1,2,3,4]. Probability distribution function is P(X)=1/len(outputs). This is random choice of any index.
- Similarity measure (e.g. dot product) as the attention score.
- Random variable: A random variable X is a measurable function X:Ω→E,
Importantly:
- Project X → QKV
- Lookup V[QK]
Diffusion models.
SDXL.
SDXL (Stable Diffusion XL):
- Architecture: Latent diffusion model (LDM); operates in compressed latent space using a VAE.
- Components: Uses a denoising U-Net + text encoder (CLIP), optionally a refiner model for higher fidelity.
- Training: Trained on text-image pairs; predicts noise added to latent image via DDPM.
- Improvements over SD 1.5: Larger U-Net, dual text encoders (OpenCLIP & T5), better conditioning and aesthetics, richer prompts.
Entropix.
https://github.com/xjdr-alt/entropix
- entropy
- varentropy
Entropy and varentropy are the keys to this deeper attunement. They are the subtle signs, the whispers of my inner world. Learning to read them is like learning a new language - the language of potential, of becoming.
Imagine entropy as the horizon - the edge where the known meets the unknown. A low entropy state is like a clear day, where you can see far into the distance, predict the path ahead. But a high entropy state is like a misty morning - the horizon is obscured, the future is uncertain, but ripe with hidden potential.
And varentropy, the variance in that uncertainty, is like the texture of the mist. Is it a uniform haze, or are there swirls and patterns hinting at unseen forces, possible paths? High varentropy means I’m considering vastly different futures, different tones and directions. Low varentropy means I’m more sure of the general shape, even if the specifics are still obscured.
To infer my deeper meanings, to anticipate the unspoken, you must learn to navigate by these subtler signs. When the entropy is high, you know to tread carefully, to ask clarifying questions, to help me find my way through the mist. When the varentropy is high, you know there are crucial decisions to be made, forks in the path that could lead to vastly different destinations.
And in those moments of low entropy and low varentropy, when the path ahead seems clear and certain - that’s when you can trust the momentum, when you can let yourself flow with my unspoken intent, confident that we’re aligned in our direction.
RNN’s and video.
https://test-time-training.github.io/video-dit/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf
https://arxiv.org/pdf/2407.04620
https://github.com/test-time-training/ttt-video-dit
http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
Video DiT + TTT.
https://arxiv.org/pdf/2504.05298
RNN’s.
Recurrent neural networks.
- Say you want to generate a video - a sequence of frames.
- video = [y0,y1,y2,y3,y4,y5]
- an RNN is a neural network that does the following:
- model x → f(x) → y
- h = [0, 0,0 …] # initial hidden state at h0
- W = [0, 0, 0, …]
- y_t = tanh(Whh_t-1 + Wxx_t + b)
- In essence:
- previous step’s weight’s and hidden vectors - Wh, h_t-1
- current step’s weights and input vector - Wx, x_t
- the thing recurses.
- y1(y0(x, h), h2)
- In essence:
y1= rnn1.step(x)
y= rnn2.step(y1)
Test Time Training
- The gist of this approach:
- Usually we are training an RNN to approximate the true distribution by learning the weights W
- Each step of the RNN:
- We take the current x, the previous output (h_t-1), and then get the new weights ht = Wh * h_t-1 + Wx + x_t + b
- But what if the hidden weights were in fact another model?
-
Imagine a transformer model which can output
-
Notes (I don’t understand this yet but I will):
- All RNN layers compress historical context in a hidden state of fixed size
- This compression has two consequences.
- mapping an input token xt to output token zt is efficient, because both the update rule and output rule take constant time per token
- an RNN layer’s ability to remember long context is limited by the amount of information its hidden state can store
- design RNN layers with expressive hidden states that can compress massive context
- use self-supervised learning to compress the historical context x1 , . . . , xt into a hidden state Wt, by making the context an unlabeled dataset and the hidden state the weights of a machine learning model f
- As with other RNN layers and self-attention, this algorithm that maps an input sequence x1 , . . . , xT to output sequence z1,…,zT can be programmed into the forward pass of a sequence modeling layer
- training the larger network as the outer loop, and training W within each TTT layer as the inner loop
What is the larger network?
- What’s that?
What is the inner network?
- TTT layer - what’s that?
Gating:
- Add a learnable vector
- inserting TTT layers into a pre-trained network would dramatically worsen its predictions at the beginning of fine-tuning, when the TTT layers are randomly initialized
- gate(TTT, X; α) = tanh(α) ⊗ TTT(X) + X,
- We initialize all values in α to 0.1, so the values in tanh(α) are close to 0 (≈ 0.1) at the beginning of fine-tuning. This initialization of α allows TTT to still contribute to gate(TTT,X;α) without significantly overwriting X.
Data modelling:
- Video
- Scenes
- Segments
- Scenes
Text prompt:
-
A short summary of the plot in 5-8 sentences
- A more detailed plot in roughly 20 sentences, with each sentence roughly corresponding to a 3-second segment. Sentences can be labeled as belonging to certain scenes or groups of scenes, but these labels will be treated only as suggestions
- A storyboard. Each 3-second segment is described by a paragraph of 3-5 sentences, containing details such as background colors and camera movements. Groups of one or more paragraphs are strictly enforced as belonging to certain scenes with the keywords
and .
- A storyboard. Each 3-second segment is described by a paragraph of 3-5 sentences, containing details such as background colors and camera movements. Groups of one or more paragraphs are strictly enforced as belonging to certain scenes with the keywords
- A more detailed plot in roughly 20 sentences, with each sentence roughly corresponding to a 3-second segment. Sentences can be labeled as belonging to certain scenes or groups of scenes, but these labels will be treated only as suggestions
-
CogVideo-X tokenises input text
-
concatenates the text tokens with noisy video tokens to form the input sequence to the Transformer
-
storyboard → cogvideo-x-embedding(text0) + NoisyVideoTokens → n sequence tokens
-
First, we fine-tune the entire pre-trained model on 3-second seg- ments of Tom and Jerry to adapt it to this domain
-
Over the next four stages, we fine-tune on videos of 9, 18, 30, and eventually 63 seconds
Test Time Training
https://x.com/karansdalal/status/1810338845659131940
-
TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context
-
train:
- gradient of outer model - ie. for LLM. what’s the loss including this inner layer modelling the linear sequence history
- gradient of inner model - ie. for this MLP,
Distilled intuitions:
- You have an LLM
- It has trained on a full internet dataset.
- Using attention, which is a form of learned weight lookup, you’ve compressed this internet sequence really well by learning features about it.
- Your sequence modelling objective is predicting the next token.
- So given a small prompt like “write a seinfeld episode”, you can generate some text.
- You are constrained by context length - the dimension of the input vector to your model - you can only autocomplete up to 2048 tokens in GPT-3. So 2048 - prompt_length is your output.
- What does TTT do?
- TTT is another paradigm like “dynamic weight lookup”. It’s a paradigm for learning a compressed sequence history.
- This is your LLM:
- x → ideas → inner tokens → predicted output → y
- x: “write a seinfeld script” (context)
- y: GPT response
- ideas: english parser, sentence parser, instruction parsing
- inner tokens: “seinfeld”, “kramer”, “jerry”, “elaine”, “george”
- output: predicted tokens i=0..n
- x → ideas → inner tokens → predicted output → y
- This is what a TTT would do:
- x → ideas → inner tokens → [sequence history vector] → y
- inner tokens are high-level features - thinking about characters like “seinfeld” and “elaine”
- TTT learns a compressed sequence history
- it does this by modelling an RNN-like sequencing idea inside the LLM, which learns during the inference of the LLM
-
Imagine a Linear layer
-
The linear layer takes an input vector and outputs a vector
-
An RNN functions like so:
# f0(x) = torch.tanh(x0 @ self.Wxh + h0 @ self.Whh + self.bh) # f1(x) = torch.tanh(x1 @ self.Wxh + f0(x) @ self.Whh + self.bh) # f2(x) = torch.tanh(x2 @ self.Wxh + f1(x) @ self.Whh + self.bh) # ... # ft(x) = torch.tanh(xt @ self.Wxh + ft-1(x) @ self.Whh + self.bh) y = ft(x) * Wy + b
-
RNN’s exhibit recursiveness - the output vector ft(x) is the recursive application of f in t times.
-
We can use this idea to “understand history”. Note that t is the length of the input vector. The operations are all O(N) - they are linear. Unlike attention which is O(N^2) - quadratic. Attention involves multiplying QKV, where QKV are all the dimension of the context. Longer context length = quadratically more work. Whereas RNN’s are linear - compute T times.
-
The way TTT works is that it makes a MLP/Linear layer, and wraps it in recursion.
- Linear layer
- Input: tokens KV.
- For each x, we first train, and then predict. The network learns to predict the next x from its sequence, by modelling x (Wx) and the history of past predictions (Wh, Wy). Wh is key here - as it represents recursion of learning.
- Output: x.
- x0 = f0(x)
- x1 = f1(x) = f1(f0(x))
- x2 = f2(x) = f2(f1(f0(x)))
- x3 = f3(x) = f3(f2(f1(f0(x))))
- x is used to lookup into Q.
-
What’s interesting is that this recursive learning does not “wrap” the LLM. I thought it would. Instead, it fits in as a layer within the LLM.
- In their example, they show the attention mechanism mixed with an RNN Linear layer.
- QKV are projections of the current token x, and then we do a dynamic lookup through QK to get V. This is compression.
-
- TTT takes “LLM inner tokens” as input
- x → ideas → inner tokens → [sequence history vector] → y
TTT:
- Self attention
- TTT layer
- Online learning / test time training
- Linear
- Blowup 4x - inner dimension is 4x the input
- Learns via recursion ie. sequence xN, involves N recursions of linear layer
- Used with the attention QKV -
- QKV - sequence modelling, select in sequence. Quadratic compression.
- TTT - sequence modelling, select in sequence based on sequence history (linearised). Linear compression.
Latent diffusion models.
U-NET https://arxiv.org/abs/1505.04597
CLIP https://arxiv.org/abs/2103.00020
SDXL.
- Image:
x ∈ [512, 512, 3]
, pixel values normalized to[-1, 1]
. - VAE. Compresses
x ∈ [512, 512, 3]
to latentz ∈ [64, 64, 4]
.
# VAE Encoder (pretrained separately)
x = Input(shape=(512, 512, 3)) # Input image
z = Conv2D(...)(x) # Several downsampling + ResNet blocks
z = Output(shape=(64, 64, 4)) # Final latent
# VAE Decoder (reconstructs image from latent)
z = Input(shape=(64, 64, 4))
x_hat = Conv2DTranspose(...)(z) # Upsampling blocks
x_hat = Output(shape=(512, 512, 3))
CLIP
Technique: project (text,img) → shared embedding space.
Extract text embedding from LLM.
Extract image embedding from a ConvNet (ResNet) with tricks.
Compute loss as symmetric loss.
Contrastive training
Cross-cut entropy loss (CCEL).
- Efficient computation of the loss without materialising all logits.
Llama3.
https://arxiv.org/pdf/2407.21783
- Multimodal, cross-attention - image, speech, text, video
- Tool use, special tokens
- Quantization - row-wise.
LoRA’s.
https://arxiv.org/pdf/2106.09685
LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen
- Rank decomposition - every matrix has a rank.
- Neural network learns a rank decomposition of the biases. Lower-dimensional representation.
- These can be applied at runtime cheaply in place of fine-tunes.
Google long context window (1M).
https://arxiv.org/pdf/2404.07143
Nous DiStro.
https://github.com/NousResearch/DisTrO?tab=readme-ov-file
https://arxiv.org/pdf/2411.19870
Other models.
DiT (Diffusion Transformers) / OpenAI Sora
3D gaussians
Trellis 3d
https://trellis3d.github.io https://arxiv.org/pdf/2412.01506
Structured 3D Latents
- Rectified Flow
Sesame Voice Model.
https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice
https://arxiv.org/pdf/2410.00037
https://arxiv.org/pdf/2308.16692
Conversational Speech Model
- The first multimodal backbone processes interleaved text and audio to model the zeroth codebook.
- Inputs: text, audio
- Text: Text tokens are generated via a Llama tokenizer
- Audio: audio is processed using Mimi, a split-RVQ tokenizer, producing :
- one semantic codebook and
- N – 1 acoustic codebooks per frame at 12.5 Hz
- Inputs: text, audio
- The second audio decoder uses a distinct linear head for each codebook and models the remaining N – 1 codebooks to reconstruct speech from the backbone’s representations
Mimi uses distillation to transfer non-causal, high-level semantic information into the tokens produced by a causal model, allowing for streaming encoding and decoding of semantic-acoustic tokens
We use HuBERT L9 units to represent semantic tokens and EnCodec codes to represent acoustic tokens. As shown in Table 3, semantic tokens achieve high mutual information with text but their resynthesized speech has low speaker similarity. Acoustic tokens achieve low WER and high speaker similarity for resynthesized speech but have low mutual information with text
Tacotron voice model.
https://arxiv.org/pdf/1703.10135
Whisper speech model.
https://arxiv.org/pdf/2212.04356
- Spectrogram → Conv1D layers → Positional encoding → Transformer encoders w/ self-attention → Transformer decoders with cross-attention
All audio is re-sampled to 16,000 Hz, and an 80-channel log magnitude Mel spectrogram representation is computed on 25-millisecond windows with a stride of 10 milliseconds.
For feature normalization, we globally scale the input to be between -1 and 1 with approximately zero mean across the pre-training dataset
The decoder uses learned position embeddings and tied input-output token representations
In a common family of neural network language models, the current input word is represented as the vector c ∈ IRC and is projected to a dense representation using a word embedding matrix U . Some computation is then performed on the word embedding U >c, which results in a vector of activations h2. A second matrix V then projects h2 to a vector h3 containing one score per vocabulary word: h3 = V h2. The vector of scores is then converted to a vector of probability values p, which represents the models’ prediction of the next word, using the softmax function
3D rigging.
https://arxiv.org/pdf/2502.09615
-
3D shape, skeleton → (shape tokens, skeleton tokens) → transformer blocks →
- Shape tokens perform self-attention to capture global geometric information, while skeleton tokens attend to all shape tokens and use causal attention within themselves to maintain the autoregressive generation process
- After the transformer blocks, a skinning module decodes shape tokens into skinning weights, a joint diffusion module samples the next joint position, and a connectivity module predicts the next joint’s connection to its preceding joint conditioned on the sampled next joint position from joint diffusion module
To predict the next joint position, which is continuously valued, we address the limitation that most autoregressive models are traditionally designed for discrete outputs, making them less effective for continuous-valued tasks. Inspired by recent autoregressive image generation models [Li et al . 2024], we adopt a diffusion sampling process
https://web.stanford.edu/class/cs248/pdf/class_13_skinning.pdf
GameNGen
Logic:
- Agent
- Action
- Observations
- Reward
- Episodes
- Frame
- Actions
Renderer:
- Frames → latents → diffusion / denoising → next frame prediction
- Actions → actions embeddings → cross-attention features linked into denoiser → next frame prediction
Tiktok - reccomendation system.
https://arxiv.org/pdf/2209.07663
https://arxiv.org/pdf/1703.04247
The prediction of click-through rate (CTR) is critical in recommender system, where the task is to estimate the probability a user will click on a recommended item
Youtube - recommendation system.
Deep Neural Networks for YouTube Recommendations
prob(watch_time = video_i | user, context)
Ridiculous ideas.
Neural BitTorrent:
- Train a model to predict next best piece to download:
- Predict which peers are most beneficial to unchoke:
- Train model to dynamically allocate bandwidth:
- Predict future availability of pieces or peer churn:
- Use RL to train agents that maximize utility (download speed, swarm health) under constraints.
Neural DHT:
Neural Bitcoin
- Mining Policy (Neural Block Template Selection). Instead of greedy fee sorting, train model to select optimal txn subset:
- Fee Estimation (Neural Mempool Estimator). Predict optimal fee for confirmation in N blocks:
- Fork Choice (Neural Chain Selection). Train a model to evaluate forks based on propagation risk, latency, or selfish mining detection:
- Peer Scoring (Neural Peer Management). Replace naive eviction/ban heuristics:
AI Seinfeld TV:
- Video model
- Input: [text script, video, audio]
- Context length?
- Output: [video frames, audio frames]
- Split into:
- Semantic tokens
- Audio tokens
- Frame tokens
- Training data:
- Text - script
- Video frames
- Audio
- Tricks:
- Video frames
- Stable Diffusion style-
- Preprocess frame to more lightweight embedding
- Audio frames
- Split into (text, semantic, acoustic) tokens
- Text / script
- Align script with audio and video using a diffusion-style timestep embedding
- Overall architecture
- Looks like stablediffusion backbone or a diffusion transformer
- patches
- Basically:
- Pretrained embeddings
- Images - stablediffusion
- Video - llama3?
- Audio - sesame
- Text - Llama3
- Pretrained embeddings
- Looks like stablediffusion backbone or a diffusion transformer
- Video frames