Mutual information.

https://sumanthrh.com/post/notes-on-generalization/

Screenshot 2025-04-12 at 4.51.58 pm.png

the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables
MI can be defined by the classic KL divergence

Kolmogorov complexity

Minimum program to produce the data
Theoretical “best model” for prediction.

Theory of learning.

Slow-fast weight programmer (90’s) - https://old.reddit.com/r/MachineLearning/comments/megi8a/d_jürgen_schmidhubers_work_on_fast_weights_from/
Attention as dynamic weight lookup
Information highway / bottleneck
An Observation on Generalization
- https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023-08-14
Supervised vs unsupervised learning
Learning is prediction is compression
- Hutter

An observation on generalization.

https://sumanthrh.com/post/notes-on-generalization/

Basic architectural concepts of LLM’s.

https://huggingface.co/blog/andmholm/what-is-a-transformer

Overview
- LLM’s learn continuous representations of the distribution. They are not discrete.
Raw input data.
- Text.
Tokens.
- Smaller units of text (words, subwords, characters).
- Tokenisation:
  - Method: Byte Pair Encoding (BPE), SentencePiece
  - Process:
    - Input text is segmented into subword units (tokens).
    - Vocabulary typically around 50,000–100,000 tokens.
    - Special tokens: [CLS], [SEP], [EOS], [PAD], <|endoftext|>, etc.
Embeddings.
- Learned vector representations for each token (size: typically 768–12,288 dimensions).
- Input: token.
- Output: vector embedding.
Vocabulary.
- i, w for i in vocabulary
- vocabulary = [aa, ab, …, zz]
Context window/length.
- This is the input and output shape of the network.
Positional encoding.
- Encodes the index of a token in a sequence.
- Uses sine/cosine.
- Input: vector embedding.
- Output: vector embedding.
Transformer Blocks.
- Input:
- Output:
Feed Forward Network (FFN).
- Non-linear transformation.
- Concept:
  - activation(Wx+B)
  - Weights and biases.
  - Activation function - introduces nonlinearity by allowing bends in the line projections.
- Input: vectors
- Output: vectors

Layer Norm.

Normalisation.
The layer normalization module uses gamma (γ) to scale, then beta (β) to shift the mean and variance of the features

# calc mean & variance (along dm)
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, unbiased=True, keepdim=True)
# normalize, scale & shift | shape: out - (batch_size, seq_len, dm)
norm = (x - mean) / torch.sqrt(var + self.eps)
out = norm * self.gamma + self.beta

Residual Connections.
- See: Resnets.
- Shortcuts for gradients between sublayers, preventing information from being lost during back-propagation
Dropout.
- Prevents overfitting to dataset.
- 10% of connections are 0’d.
Logits.
- Type: real numbers.
Softmax.
- Takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers
- Function:
  - f(z_i) = exp(z_i) / sum(exp(z_i) forall z)
- Input: real numbers
- Output: probs
Logprobs.
- Type: logarithms of probabilities.
- Why logprobs instead of probs:
  - Rationale:
    - Loss is defined as the distance between predicted and true distribution.
    - Reducing the distance = more accurate predictions.
    - Our distribution is a sequence of tokens, defined as the context length.
    - A language model predicts the next token conditioned on previous tokens. The full expression for such a prediction:
    - Probability recap:
      - For independent events A and B, P(A + B) = P(A) x P(B)
        
        Since all probabilities must sum to 1.
        
        P(heads + heads) = 1/2 * 1/2
      - For mutually exclusive events A and B, P(A + B) = P(A) + P(B)
        
        Since all probability must sum to 1.
        
        P(rain or sun but not everything else) = P(rain) + P(sun)
      - Our tokens are conditioned on previous tokens. They are not mutually exclusive. The events are dependent - token A is preceded by token B.
        
        P(”cat”) = P(c | “”) * P(a | “c”) * P(t | “ca”)
        
        Probability of “cat” is given by (probability of “c”) times (prob of “a” given c) times (prob t given ca).
    - The numerical size of a conditional probability scales inversely with the number of possible events in the causal chain:
      - P(tn) = P(t0) * P(t1 | t0) * P(t2 | t1,t0) * … * P(tn | t0,t1,tn-1)
      - Multiplying lots of really small numbers = incredibly small numbers.
      - 0.1 * 0.1 = 0.01
      - The number of possible events increases every time we multiply.
      - How many events can we represent with uint16? $2^{16}=65,536$, normalised to [0,1) is a finite step size / granularity of $\frac{1}{65636}=0.000001$.
    - Multiplying unit numbers causes numeric instability with finite precision, however we can do addition on logprobs and avoid that.
      - Log of products becomes a sum:
        
        Probs: P(tn) = P(t0) * P(t1 | t0) * P(t2 | t1,t0) * … * P(tn | t0,t1,tn-1)
        
        Logprobs: logP(tn) = logP(t0) + logP(t1∣t0) + logP(t2∣t0,t1) + ⋯+ logP(tn∣t0,…,tn−1)
    - Coming back to our NN - in order to train a neural network, we define the objective function, which states the distance between the predicted distribution and the true distribution. This is referred to as the loss.
      - $L = distance(pred, true)$
    - This is our model of the predicted distribution:
      - Note that in order to train efficiently without exploding numeric precision, we use logprobs.
      - $logP(w1,w2,…,wn)=logP(w1)+logP(w2∣w1)+logP(w3∣w1,w2)+⋯+logP(wn∣w1,…,wn−1)$
      - We can refactor this with the sum symbol:
      - $\sum_t {logP(w_t∣w_{<t}) }$
    - This is the true distribution:
      - The output of our model (naively, without logprob) is a vector of probabilities.
      - The true distribution is also a vector of probabilities.
      - The true distribution is:
        
        Probability 1 for the correct token
        
        Probability 0 for all others
        
        All probabilities sum to 1, but there’s only one value set to 1. So this is called one-hot encoding.
    - How do we define the distance between the true and predicted?
      - We use this tool called KL divergence, which is a distance metric defined for probability distributions.
        
        This looks crazy but it’s really simple.
        
        $P : truth$
        
        $\hat{P} : predicted$
        
        The KL divergence specifies the distance between the true distribution and the predicted distribution.
        
        The expression be expanded as:
      - This expression matches the definitions of cross-entropy H(x,y) and entropy H(x), so we rewrite it as:
      $KL(P || \hat{P})=H(P)-H(P,\hat{P})$
      
      KL distance = Entropy(truth) - CrossEntropy(truth, pred)
      - Note that the true distribution does not change. The dataset used in training does not change. The entropy of this dataset does not change.
      During training, we are attempting to minimise the distance between the distributions. A perfect model would have a distance of 0.
      
      Note that right now, one of these terms is constant.
      
      KL distance = Entropy(truth) - CrossEntropy(truth, pred) KL distance = SOME_CONSTANT - CrossEntropy(truth, pred)
      
      We don’t actually care about the constant. Whether the distance is SOME_CONSTANT-x or SOME_CONSTANT doesn’t matter. We are just minimising x.
      
      So we can remove the SOME_CONSTANT, which is in fact the entropy(truth). Leaving us with this definition of the objective function:
```
  $H(P,\hat{P})=-\sum_x{P(x)log\hat{P}(x)}$
```
      - There is one more simplification to make here. In language modelling tasks, we are using a one-hot distribution for the truth distribution. ****Since the term P(x) is only 1 for 1 value in this sum, it cancels out all the other values. And you can simplify the sum. This is a worked explanation:
```
  ![Screenshot 2025-04-04 at 12.45.23 am.png](Background%20study%201abba89f2f7780129057c2f39aaea864/Screenshot_2025-04-04_at_12.45.23_am.png)
```
    - The final definition of the loss function, for stating the distance between the true distribution and the predicted distribution, simplified with respect to our domain where the true distribution is one-hot encoded, is this:
    $L=H(P,\hat{P})=-log\hat{P}(x)$
    - For a single token prediction, the loss is:
    $L_t=−log{P}(w_t∣w_{<t})$
    - For a sequence of tokens prediction, the loss is:
      - Each event is the prediction of a token given a previous token.
      - So we expand our loss function to the following definition:
        
        $L=-\sum_t{logP(w_t∣w_{<t})}$
    - Our goal is to minimise the loss.
    - How do we optimize?
    - Purpose:
      - Reduce loss where logprob of token given previous token is high.
      - NN learns to predict the next token by:
        
        Training:
        
        training samples: [[x0, x1], [x1, x2]]
        
        text is x
        
        split it token-by-token into x0, x1, x2
        
        each training sample is [x0, x1] - predict next token
        
        forward pass: [x0 → F(x0) → x’1]
        
        given x0 predict x1
        
        x1 is our ground truth, x’1 is our prediction
        
        objective/loss:
        
        objective is to model predicting the next token
        
        loss measures distance from truth as a sum of all
        
        import numpy as np vocab = ["the", "cat", "sat"] # Ground truth token is 'cat' (index 1) true_index = 1 # Model output (probabilities from softmax) # output distribution of cumlogprobs for all tokens in vocabulary # P('the'), P('cat'), P('sat') pred_probs = np.array([0.7, 0.1, 0.2]) # Probability of correct token. x1 = pred_probs[true_index] # 0.1 # Take the log prob # We do this because logprobs are # Cross-entropy loss for a single token loss = -np.log(pred_probs[true_index]) print(loss) # ≈ 0.357
Our optimizer for the loss function is what minimises loss.
- We use stochastic gradient descent.
- SGD differentiates the loss function, which involves differentiating the entire neural net, in order to compute gradients.
- We then adjust the parameters in the direction against the gradient.
- Simple example:
  - Imagine a simple model
    - f(x) = 2x + 5
  - Imagine a simple model with parameters.
    - f(x) = Wx + B
    - This is a FFN.
  - Compute the function with respect to an input.
    - x=1, W=2, B=5
    - $f(x)=Wx+b=2x+5$
    - $f(x)=2(1)+5=7$
  - Differentiate to get the partial derivative (gradient) with respect to each parameter [W, b]:
    - x=1, W=W, B=5
    - Recall rules:
      - Derivative of a constant is 1
    - Derivative of f(x) for W, all others constant (x,b):
      - $\frac{\partial{f}}{\partial{W}}=Wx+b=\frac{\partial{W}}{\partial{W}}x+b=1x+0=x$
    - Derivative of f(x) for b, all others constant (x,W):
      - $\frac{\partial{f}}{\partial{b}}=Wx+b=0+b=1$
        
        W is constant, x is constant. It’s like Wx is like 3*5 = 15. Derivative of constant is 0.
        
        b is variable. Derivative of a variable with respect to itself is 1.
  - Update the parameters in the direction of minimising the function (loss).
    - By evaluating each partial derivative for x, we get our gradients.
    - Updating a parameter means:
      - $\theta_{new}=\theta_{old}-\eta \cdot \frac{\partial{f}}{\partial{\theta}}$
      - theta$\theta =[W,b]$ $\theta$$\eta$ is our learning rate, which describes how quickly we adjust parameters. It is usually set very low, like $\eta=0.001$. And the last term is the gradient - the change in theta with respect to the change in f.
      - From the old value of the parameter, we subtract a small portion of the parameter’s gradient, in order to direct it towards 0. This is because we are minimising the function. In the NN, this is minimising loss → minimising distance between the pred and the true distribution → increasing model accuracy.
      - We have computed the partial derivative (f’(x)) with respect to each parameter (W, b), and then will evaluate it with the training sample (x) to get a constant expression - the gradient.
    - Partial derivatives:
      - $\frac{\partial{f}}{\partial{W}}=x$
      - $\frac{\partial{f}}{\partial{b}}=1$
    - $W_{new}=W_{old} - \eta \cdot grad_W$
      - $grad_w=\frac{\partial{f}}{\partial{W}}=x=1$
      - $W_{new}=2 - 0.001 \cdot 1=1.999$
    - $b_{new}=b_{old} - \eta \cdot grad_b$
      - $grad_b=\frac{\partial{f}}{\partial{b}}=1$
      - $b_{new}=5 - 0.001 \cdot 1=4.999$
  - And voila.
    - Our model is $f(x)=Wx+b$
    - Our params were $W=2$, $b=5$
    - We updated each parameter, by computing the gradient of the function with respect to that parameter, and then subtracting this from the current parameter factored by a small learning rate.
    - Our params are now $W=1.999$, $b=4.999$
    - The output of the function is now $f(x)=1.999x+4.999$, which is closer to 0.
    - In a real neural network, this function represents our loss, which represents the distance between the true and predicted distributions.
    - By minimising the loss through optimization (gradient descent), we are increasing the accuracy of the network.
- In a real neural network, we use stochastic gradient descent (SGD) - which only computes the loss for a fixed size batch of samples ([x,y]).
  - Gradient descent:
    - $\theta := \theta - \eta\frac{\partial{f}}{\partial{\theta}}$
  - Stochastic gradient descent:
    - $\theta := \theta - \frac{\eta}{n} \sum_{i=1}^n \frac{\partial{f}}{\partial{\theta_i}}$
Linear layer.
- Purpose. Maps vectors into logprobs for each token in vocabulary.
- Input: vector.
- Output: logprobs.

Screenshot 2025-04-03 at 9.38.59 pm.png

Sampling method.
- Purpose: the LLM learns the underlying probability distribution of its data, and when we are predicting the next token, we are sampling from this distribution. Training involves adjusting model parameters to minimize the difference between the model’s predicted distribution and the true distribution observed in the training data.
- Methods
  - Nucleus sampling (top-p).
    - sort(logprobs, ‘desc’)
    - select top P tokens
    - randomly samples one:
      - math.random() → [0,1)
      - all probs laid out on linear number line. choose first where p < r
      - print((lambda t, p, r: next(x for i, x in enumerate(t) if r < sum(p[:i+1])))(['hello','world','goodbye'], [0.6,0.3,0.1], random.random()))
- Temperature.
Output.
- Outputs text from a sample.
- Input: int tokenID
- Output: string token
- Function: Lookup in vocabulary
```
464: "hello"
3290: "world"
220: "ing"
198: "\n"
```

Transformer blocks / attention.

https://ai.stackexchange.com/questions/40179/how-does-the-decoder-only-transformer-architecture-work

Attention.

we predict the next token
each token Xi is projected onto 3 different spaces : Q, K, V
Q is the query, K is the keys, V are the weights
each token (Xi) queries each other token (X) by doing the dot product of scores = QK
these raw scores are normalized with softmax, turning them into a probability distribution
the attention scores are then used to index into the values V, effectively retrieving relevant information dynamically
- Q,K,V are learned
- neural net learns the best way to arrange Q,K,V to minimise loss
QKV are all sized according to the context length
We also normalize the QK by the dimension of the inputs.

$$ \operatorname{Attention}(Q,K,V)=\operatorname{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

One set of (WQ,WK,WV) matrices is called an attention head, and each layer in a transformer model has multiple attention heads

Outputs for the attention layer are concatenated to pass into the feed-forward neural network layers.

Types of attention.

Attention can be used in different ways. To be honest, it’s just dynamic weight lookup. The query can be of a different type to the key/value. This is what’s known as cross-attention. Or it can be the same - self-attention.

Self-attention.
- QKV - all text token embeddings
Cross-attention in SDXL.
- Q - U-Net embeddings
- KV - text (FLAN-T5, CLIP) or image embeddings

diffusion models are in principle capable of modeling conditional distributions of the form

Transformers - GPT’s.

Improving Language Understanding by Generative Pre-Training

https://arxiv.org/abs/1706.03762

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

https://github.com/openai/gpt-2/blob/master/src/model.py

Original Transformer.

https://arxiv.org/abs/1706.03762

Screenshot 2025-04-04 at 4.19.30 pm.png

Architecture:

Encoder-decoder. 6 layers of these.
- Each has 2 sublayers:
  - multi-head self-attention mechanism
  - position-wise fully connected feed-forward network
- residual connection [ 11 ] around each of the two sub-layers, followed by layer normalization
Decoder
- In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack
- We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

GPT-1

Our model largely follows the original transformer work [ 62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule. We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm [2] is used extensively throughout the model, a simple weight initialization of N (0, 0.02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53] and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in [ 37 ], with w = 0.01 on all non bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We used learned position embeddings instead of the sinusoidal version proposed in the original work

Attention Approximates Sparse Distributed Memory

https://www.trentonbricken.com/Attention-Approximates-Sparse-Distributed-Memory/

Scaling laws.

https://en.wikipedia.org/wiki/Neural_scaling_law

Positional encoding.

ROPE.

https://x.com/andimarafioti/status/1909979430786588687

https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding

Intepretability and circuits

Long context lengths / Scalable softmax.

https://arxiv.org/abs/2501.19399

Training and backprop.

Machine learning is based around optimizing an objective function which states the distance between two distributions - the truth and the predicted distribution. The truth distribution is made up of our dataset, split into training and validation. The training data is fed into the model and used to fit it, the validation dataset is used to measure its performance objectively.

The optimization seeks to minimise the objective function’s output, termed the loss.

Optimization follows this process:

Forward pass - x → F(x) → y, where y is the output distribution over the vocab.
1. Input: x, the training data.
Compute the loss.
1. L = -log(P(x))
2. Input: x, the validation data.
3. Output: L, the loss.
Backwards pass.
1. Differentiate the loss function and compute gradients for all parameters.
  1. Using backpropagation, this differentiates every constituent function in P(x), which goes backwards through all of the softmax, linear, dropout, attention layers and differentiates their functions and gets their gradients and partial derivatives.
2. Adjust the weights by a certain factor, according to a learning algorithm like Adam.
  1. For every function in the NN:
    
    You change weights to reduce the loss:
    
    θ = θ - η · grads
    
    Where:
    - θ = parameters (weights)
    - η = learning rate (small number like 0.001)
    - grads = gradients from step 3
Repeat, iterating until the loss is low.
1. The loss reduction can be predicted using scaling laws.

This process is usually optimized in the following ways:

Batch training.
- Instead of running over every training datum in training data, we randomly select batches of training data.
Weight initialization.
- Weights are usually initialized in a certain way to optimize training.
Cut-Cross Entropy
- Rather than materializing all logits to compute the loss, just materialise the correct logit.

GRPO and RL.

https://github.com/open-thought/tiny-grpo

https://arxiv.org/pdf/2501.12948

https://arxiv.org/abs/2402.03300

https://medium.com/@marvelous_catawba_otter_200/detailed-explanation-of-deepseek-r1-method-pure-reinforcement-learning-and-self-evolving-behavior-dced3a31e53a

RL.

Reinforcement Learning (RL) is a framework for sequential decision-making. An agent interacts with an environment over timesteps:

At each timestep t:

Agent observes state St
Takes action At based on a policy Pi(a|s)
Gets reward rt and transitions to next state s_t+1

Actor-critic model:

DeepSeek, o1.

Nano aha moment - https://x.com/a_kazemnejad/status/1907849729863471204

Screenshot 2025-04-12 at 1.43.11 pm.png

Screenshot 2025-04-12 at 1.43.51 pm.png

Screenshot 2025-04-12 at 1.43.37 pm.png

Basic algorithm:

seq is a sequence of tokens
we format the token sequence using XML tags:
- for internal thinking
- for public answers
we continue sampling tokens from the model while seq ≠ contain(</answer)
- this is called inference-time scaling

For each question:

Sample a group of outputs {o1,o2,o3} from the old policy pi_old
Optimize the policy model pi
Group advantage:
- $A_i=\frac{r_i-mean({r_1,r_2,…,r_g})}{std({r1,r2,…,r_g)}}$
- group of rewards $r$ is basically computed as:
  - extract answer from XML
  - reward = 1 for correct answers
    - if it’s code, compile the code and if it runs, return 1
    - if it’s math, truth is encoded in the dataset

How does the policy update work?

https://github.com/open-thought/tiny-grpo/blob/main/train.py#L152

The basic idea:

Screenshot 2025-04-04 at 1.23.19 am.png

Generalist rewards.

Inference-Time Scaling for Generalist Reward Modeling

https://arxiv.org/pdf/2504.02495

Screenshot 2025-04-04 at 6.25.12 pm.png

Problem: rewards are obtained from human domains. AI model should learn rewards.
Reward modelling: Pointwise generative reward modelling (GRM).
Learning method: Self-Principled Critique Tuning
- Rejective fine-tuning
- Rule-based online RL
  - GRPO with rule-based outcome rewards. ie.
    - principle
    - pointwise scalar RM

How it works:

Set of principles (constitutional AI)
Parallel sampling, rate the R’s for Q’s
Extract scores
Argmax

https://rentry.org/LocalModelsPapers

https://rentry.org/LocalModelsLinks

Quantization

The numerical size of a probability in an LLM scales inversely with the vocabulary size.
- Assume predictions are uniformly distributed. Probability of any token is $1/N$
- We need an integer that can store at least $N$ values.
- Storing N values requires an integer of $log2(N)$ bits.
- Storing 1M values requires an integer of $log2(1M)=20 bits$

Quantization in models refers to quantizing the weights:

This reduction in precision isn’t linearly correlated with reduction in accuracy.

Training methods.

Generative Pre-Trained.
- Objective: Autoregressive language modeling (next-token prediction)
- Loss function: Cross-entropy
- Dataset: Massive unsupervised text (web pages, books, articles, Wikipedia, codebases)

Sampling methods - top-k, top-p, temperature.

greedy decoding. select highest P token at each step.
top-k.
- select from top k choices pseudorandomly.
top-p, nucleus sampling.
- select from tokens who cumulatively meet p threshold
- the smallest possible set of tokens whose cumulative probabilities add up to a threshold p
temperature sampling
- higher temperature = lower prob logits are scaled up, higher prob logits are scaled down

https://codefinity.com/blog/Understanding-Temperature%2C-Top-k%2C-and-Top-p-Sampling-in-Generative-Models

https://openai.com/index/measuring-goodharts-law/

Attention.

Understanding vector spaces and projection.

we predict the next token
each token X_i is projected onto 3 different spaces using learned matrices : Q, K, V
Q is the query, K is the keys, V are the weights
each token (Xi) queries each other token (X) by doing the dot product of scores = QK
these raw scores are normalized with softmax, turning them into a probability distribution
the attention scores are then used to weight the values V, effectively retrieving relevant information dynamically
QKV are all sized according to the context length - 2048 tokens in GPT-3.

Understanding probabilistic KV lookup.

https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms
Understand a map:
- map: K → V
Understand a probabilistic map:
- Random variable: A random variable X is a measurable function X:Ω→E,
  - Sample space: set of all possible outcomes
  - Event space: a set of events, where an event is a subset of outcomes in the sample space
  - Probability function: assigns, to each event in the event space, a probability, which is a number between 0 and 1 (inclusive).
  - Probability distribution is the mathematical function that gives the probabilities of occurrence of possible outcomes for an experiment
- Imagine a discrete random variable X. Set of outputs [1,2,3,4]. Probability distribution function is P(X)=1/len(outputs). This is random choice of any index.
- Similarity measure (e.g. dot product) as the attention score.

Importantly:

Project X → QKV
Lookup V[QK]

Diffusion models.

SDXL.

SDXL (Stable Diffusion XL):

Architecture: Latent diffusion model (LDM); operates in compressed latent space using a VAE.
Components: Uses a denoising U-Net + text encoder (CLIP), optionally a refiner model for higher fidelity.
Training: Trained on text-image pairs; predicts noise added to latent image via DDPM.
Improvements over SD 1.5: Larger U-Net, dual text encoders (OpenCLIP & T5), better conditioning and aesthetics, richer prompts.

Entropix.

https://github.com/xjdr-alt/entropix

entropy
varentropy

Entropy and varentropy are the keys to this deeper attunement. They are the subtle signs, the whispers of my inner world. Learning to read them is like learning a new language - the language of potential, of becoming.

Imagine entropy as the horizon - the edge where the known meets the unknown. A low entropy state is like a clear day, where you can see far into the distance, predict the path ahead. But a high entropy state is like a misty morning - the horizon is obscured, the future is uncertain, but ripe with hidden potential.

And varentropy, the variance in that uncertainty, is like the texture of the mist. Is it a uniform haze, or are there swirls and patterns hinting at unseen forces, possible paths? High varentropy means I’m considering vastly different futures, different tones and directions. Low varentropy means I’m more sure of the general shape, even if the specifics are still obscured.

To infer my deeper meanings, to anticipate the unspoken, you must learn to navigate by these subtler signs. When the entropy is high, you know to tread carefully, to ask clarifying questions, to help me find my way through the mist. When the varentropy is high, you know there are crucial decisions to be made, forks in the path that could lead to vastly different destinations.

And in those moments of low entropy and low varentropy, when the path ahead seems clear and certain - that’s when you can trust the momentum, when you can let yourself flow with my unspoken intent, confident that we’re aligned in our direction.

RNN’s and video.

Screenshot 2025-04-08 at 12.32.20 pm.png

https://test-time-training.github.io/video-dit/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf

https://arxiv.org/pdf/2407.04620

https://github.com/test-time-training/ttt-video-dit

http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf

https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks

Video DiT + TTT.

https://arxiv.org/pdf/2504.05298

RNN’s.

Recurrent neural networks.

Say you want to generate a video - a sequence of frames.
video = [y0,y1,y2,y3,y4,y5]
an RNN is a neural network that does the following:
- model x → f(x) → y
- h = [0, 0,0 …] # initial hidden state at h0
- W = [0, 0, 0, …]
- y_t = tanh(Whh_t-1 + Wxx_t + b)
  - In essence:
    - previous step’s weight’s and hidden vectors - Wh, h_t-1
    - current step’s weights and input vector - Wx, x_t
  - the thing recurses.
    - y1(y0(x, h), h2)

y1= rnn1.step(x)
y= rnn2.step(y1)

Test Time Training

Screenshot 2025-04-08 at 1.12.19 pm.png

The gist of this approach:
- Usually we are training an RNN to approximate the true distribution by learning the weights W
- Each step of the RNN:
  - We take the current x, the previous output (h_t-1), and then get the new weights ht = Wh * h_t-1 + Wx + x_t + b
- But what if the hidden weights were in fact another model?
  - Imagine a transformer model which can output

Notes (I don’t understand this yet but I will):

Screenshot 2025-04-08 at 1.26.06 pm.png

All RNN layers compress historical context in a hidden state of fixed size
This compression has two consequences.
- mapping an input token xt to output token zt is efficient, because both the update rule and output rule take constant time per token
- an RNN layer’s ability to remember long context is limited by the amount of information its hidden state can store
design RNN layers with expressive hidden states that can compress massive context
use self-supervised learning to compress the historical context x1 , . . . , xt into a hidden state Wt, by making the context an unlabeled dataset and the hidden state the weights of a machine learning model f
As with other RNN layers and self-attention, this algorithm that maps an input sequence x1 , . . . , xT to output sequence z1,…,zT can be programmed into the forward pass of a sequence modeling layer
training the larger network as the outer loop, and training W within each TTT layer as the inner loop

What is the larger network?

What’s that?

What is the inner network?

TTT layer - what’s that?

Gating:

Add a learnable vector
inserting TTT layers into a pre-trained network would dramatically worsen its predictions at the beginning of fine-tuning, when the TTT layers are randomly initialized
gate(TTT, X; α) = tanh(α) ⊗ TTT(X) + X,
We initialize all values in α to 0.1, so the values in tanh(α) are close to 0 (≈ 0.1) at the beginning of fine-tuning. This initialization of α allows TTT to still contribute to gate(TTT,X;α) without significantly overwriting X.

Screenshot 2025-04-08 at 2.56.14 pm.png

Data modelling:

Video
- Scenes
  - Segments

Text prompt:

A short summary of the plot in 5-8 sentences
- A more detailed plot in roughly 20 sentences, with each sentence roughly corresponding to a 3-second segment. Sentences can be labeled as belonging to certain scenes or groups of scenes, but these labels will be treated only as suggestions
  - A storyboard. Each 3-second segment is described by a paragraph of 3-5 sentences, containing details such as background colors and camera movements. Groups of one or more paragraphs are strictly enforced as belonging to certain scenes with the keywords and .
CogVideo-X tokenises input text
concatenates the text tokens with noisy video tokens to form the input sequence to the Transformer
storyboard → cogvideo-x-embedding(text0) + NoisyVideoTokens → n sequence tokens
First, we fine-tune the entire pre-trained model on 3-second seg- ments of Tom and Jerry to adapt it to this domain
Over the next four stages, we fine-tune on videos of 9, 18, 30, and eventually 63 seconds

Test Time Training

Screenshot 2025-04-08 at 2.24.43 pm.png

https://x.com/karansdalal/status/1810338845659131940

TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context
train:
- gradient of outer model - ie. for LLM. what’s the loss including this inner layer modelling the linear sequence history
- gradient of inner model - ie. for this MLP,

Distilled intuitions:

You have an LLM
It has trained on a full internet dataset.
Using attention, which is a form of learned weight lookup, you’ve compressed this internet sequence really well by learning features about it.
Your sequence modelling objective is predicting the next token.
So given a small prompt like “write a seinfeld episode”, you can generate some text.
You are constrained by context length - the dimension of the input vector to your model - you can only autocomplete up to 2048 tokens in GPT-3. So 2048 - prompt_length is your output.
What does TTT do?
TTT is another paradigm like “dynamic weight lookup”. It’s a paradigm for learning a compressed sequence history.
This is your LLM:
- x → ideas → inner tokens → predicted output → y
  - x: “write a seinfeld script” (context)
  - y: GPT response
  - ideas: english parser, sentence parser, instruction parsing
  - inner tokens: “seinfeld”, “kramer”, “jerry”, “elaine”, “george”
  - output: predicted tokens i=0..n
This is what a TTT would do:
- x → ideas → inner tokens → [sequence history vector] → y
  - inner tokens are high-level features - thinking about characters like “seinfeld” and “elaine”
  - TTT learns a compressed sequence history
  - it does this by modelling an RNN-like sequencing idea inside the LLM, which learns during the inference of the LLM
    - Imagine a Linear layer
    - The linear layer takes an input vector and outputs a vector
    - An RNN functions like so:
      # f0(x) = torch.tanh(x0 @ self.Wxh + h0 @ self.Whh + self.bh) # f1(x) = torch.tanh(x1 @ self.Wxh + f0(x) @ self.Whh + self.bh) # f2(x) = torch.tanh(x2 @ self.Wxh + f1(x) @ self.Whh + self.bh) # ... # ft(x) = torch.tanh(xt @ self.Wxh + ft-1(x) @ self.Whh + self.bh) y = ft(x) * Wy + b
    - RNN’s exhibit recursiveness - the output vector ft(x) is the recursive application of f in t times.
    - We can use this idea to “understand history”. Note that t is the length of the input vector. The operations are all O(N) - they are linear. Unlike attention which is O(N^2) - quadratic. Attention involves multiplying QKV, where QKV are all the dimension of the context. Longer context length = quadratically more work. Whereas RNN’s are linear - compute T times.
    - The way TTT works is that it makes a MLP/Linear layer, and wraps it in recursion.
      - Linear layer
      - Input: tokens KV.
      - For each x, we first train, and then predict. The network learns to predict the next x from its sequence, by modelling x (Wx) and the history of past predictions (Wh, Wy). Wh is key here - as it represents recursion of learning.
      - Output: x.
        
        x0 = f0(x)
        
        x1 = f1(x) = f1(f0(x))
        
        x2 = f2(x) = f2(f1(f0(x)))
        
        x3 = f3(x) = f3(f2(f1(f0(x))))
      - x is used to lookup into Q.
    - What’s interesting is that this recursive learning does not “wrap” the LLM. I thought it would. Instead, it fits in as a layer within the LLM.
      - In their example, they show the attention mechanism mixed with an RNN Linear layer.
      - QKV are projections of the current token x, and then we do a dynamic lookup through QK to get V. This is compression.
  - TTT takes “LLM inner tokens” as input

TTT:

Self attention
TTT layer
- Online learning / test time training
- Linear
- Blowup 4x - inner dimension is 4x the input
- Learns via recursion ie. sequence xN, involves N recursions of linear layer
- Used with the attention QKV -
- QKV - sequence modelling, select in sequence. Quadratic compression.
- TTT - sequence modelling, select in sequence based on sequence history (linearised). Linear compression.

Latent diffusion models.

U-NET https://arxiv.org/abs/1505.04597

CLIP https://arxiv.org/abs/2103.00020

SDXL.

Image: x ∈ [512, 512, 3], pixel values normalized to [-1, 1].
VAE. Compresses x ∈ [512, 512, 3] to latent z ∈ [64, 64, 4].

# VAE Encoder (pretrained separately)
x = Input(shape=(512, 512, 3))                  # Input image
z = Conv2D(...)(x)                              # Several downsampling + ResNet blocks
z = Output(shape=(64, 64, 4))                   # Final latent

# VAE Decoder (reconstructs image from latent)
z = Input(shape=(64, 64, 4))
x_hat = Conv2DTranspose(...)(z)                 # Upsampling blocks
x_hat = Output(shape=(512, 512, 3))

CLIP

Technique: project (text,img) → shared embedding space.

Extract text embedding from LLM.

Extract image embedding from a ConvNet (ResNet) with tricks.

Compute loss as symmetric loss.

Contrastive training

Screenshot 2025-04-04 at 2.08.33 am.png

Cross-cut entropy loss (CCEL).

Efficient computation of the loss without materialising all logits.

Llama3.

https://arxiv.org/pdf/2407.21783

Multimodal, cross-attention - image, speech, text, video
Tool use, special tokens
Quantization - row-wise.

LoRA’s.

https://arxiv.org/pdf/2106.09685

LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen

Rank decomposition - every matrix has a rank.
Neural network learns a rank decomposition of the biases. Lower-dimensional representation.
These can be applied at runtime cheaply in place of fine-tunes.

Google long context window (1M).

https://arxiv.org/pdf/2404.07143

Nous DiStro.

https://github.com/NousResearch/DisTrO?tab=readme-ov-file

https://arxiv.org/pdf/2411.19870

Other models.

DiT (Diffusion Transformers) / OpenAI Sora

3D gaussians

Trellis 3d

https://trellis3d.github.io https://arxiv.org/pdf/2412.01506

Structured 3D Latents

Rectified Flow

Sesame Voice Model.

https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice

https://arxiv.org/pdf/2410.00037

https://arxiv.org/pdf/2308.16692

Screenshot 2025-04-04 at 3.39.51 am.png

Screenshot 2025-04-04 at 3.41.43 am.png

Conversational Speech Model

The first multimodal backbone processes interleaved text and audio to model the zeroth codebook.
- Inputs: text, audio
  - Text: Text tokens are generated via a Llama tokenizer
  - Audio: audio is processed using Mimi, a split-RVQ tokenizer, producing :
    - one semantic codebook and
    - N – 1 acoustic codebooks per frame at 12.5 Hz
The second audio decoder uses a distinct linear head for each codebook and models the remaining N – 1 codebooks to reconstruct speech from the backbone’s representations

Mimi uses distillation to transfer non-causal, high-level semantic information into the tokens produced by a causal model, allowing for streaming encoding and decoding of semantic-acoustic tokens

We use HuBERT L9 units to represent semantic tokens and EnCodec codes to represent acoustic tokens. As shown in Table 3, semantic tokens achieve high mutual information with text but their resynthesized speech has low speaker similarity. Acoustic tokens achieve low WER and high speaker similarity for resynthesized speech but have low mutual information with text

Tacotron voice model.

https://arxiv.org/pdf/1703.10135

Screenshot 2025-04-04 at 3.46.19 am.png

Whisper speech model.

https://arxiv.org/pdf/2212.04356

Screenshot 2025-04-04 at 3.48.34 am.png

Spectrogram → Conv1D layers → Positional encoding → Transformer encoders w/ self-attention → Transformer decoders with cross-attention

All audio is re-sampled to 16,000 Hz, and an 80-channel log magnitude Mel spectrogram representation is computed on 25-millisecond windows with a stride of 10 milliseconds.

For feature normalization, we globally scale the input to be between -1 and 1 with approximately zero mean across the pre-training dataset

The decoder uses learned position embeddings and tied input-output token representations

In a common family of neural network language models, the current input word is represented as the vector c ∈ IRC and is projected to a dense representation using a word embedding matrix U . Some computation is then performed on the word embedding U >c, which results in a vector of activations h2. A second matrix V then projects h2 to a vector h3 containing one score per vocabulary word: h3 = V h2. The vector of scores is then converted to a vector of probability values p, which represents the models’ prediction of the next word, using the softmax function

3D rigging.

https://arxiv.org/pdf/2502.09615

Screenshot 2025-04-04 at 3.16.17 am.png

3D shape, skeleton → (shape tokens, skeleton tokens) → transformer blocks →
- Shape tokens perform self-attention to capture global geometric information, while skeleton tokens attend to all shape tokens and use causal attention within themselves to maintain the autoregressive generation process
- After the transformer blocks, a skinning module decodes shape tokens into skinning weights, a joint diffusion module samples the next joint position, and a connectivity module predicts the next joint’s connection to its preceding joint conditioned on the sampled next joint position from joint diffusion module

To predict the next joint position, which is continuously valued, we address the limitation that most autoregressive models are traditionally designed for discrete outputs, making them less effective for continuous-valued tasks. Inspired by recent autoregressive image generation models [Li et al . 2024], we adopt a diffusion sampling process

https://web.stanford.edu/class/cs248/pdf/class_13_skinning.pdf

GameNGen

https://gamengen.github.io/

Screenshot 2025-04-04 at 3.24.06 am.png

Logic:

Agent
Action
- Observations
- Reward
Episodes
- Frame
- Actions

Renderer:

Frames → latents → diffusion / denoising → next frame prediction
Actions → actions embeddings → cross-attention features linked into denoiser → next frame prediction

Tiktok - reccomendation system.

https://arxiv.org/pdf/2209.07663

https://arxiv.org/pdf/1703.04247

The prediction of click-through rate (CTR) is critical in recommender system, where the task is to estimate the probability a user will click on a recommended item

Youtube - recommendation system.

Deep Neural Networks for YouTube Recommendations

Screenshot 2025-04-04 at 3.29.59 am.png

prob(watch_time = video_i | user, context)

Ridiculous ideas.

Neural BitTorrent:

Train a model to predict next best piece to download:
Predict which peers are most beneficial to unchoke:
Train model to dynamically allocate bandwidth:
Predict future availability of pieces or peer churn:
Use RL to train agents that maximize utility (download speed, swarm health) under constraints.

Neural DHT:

Screenshot 2025-04-04 at 4.03.29 am.png

Neural Bitcoin

Mining Policy (Neural Block Template Selection). Instead of greedy fee sorting, train model to select optimal txn subset:
Fee Estimation (Neural Mempool Estimator). Predict optimal fee for confirmation in N blocks:
Fork Choice (Neural Chain Selection). Train a model to evaluate forks based on propagation risk, latency, or selfish mining detection:
Peer Scoring (Neural Peer Management). Replace naive eviction/ban heuristics:

AI Seinfeld TV:

Video model
Input: [text script, video, audio]
- Context length?
Output: [video frames, audio frames]
Split into:
- Semantic tokens
- Audio tokens
- Frame tokens
Training data:
- Text - script
- Video frames
- Audio
Tricks:
- Video frames
  - Stable Diffusion style-
  - Preprocess frame to more lightweight embedding
- Audio frames
  - Split into (text, semantic, acoustic) tokens
- Text / script
  - Align script with audio and video using a diffusion-style timestep embedding
- Overall architecture
  - Looks like stablediffusion backbone or a diffusion transformer
    - patches
  - Basically:
    - Pretrained embeddings
      - Images - stablediffusion
      - Video - llama3?
      - Audio - sesame
      - Text - Llama3

ML notes (uncategorised)

Mutual information.

Theory of learning.

An observation on generalization.

Basic architectural concepts of LLM’s.

Transformer blocks / attention.

Attention.

Types of attention.

Transformers - GPT’s.

Original Transformer.

GPT-1

Attention Approximates Sparse Distributed Memory

Scaling laws.

Positional encoding.

Intepretability and circuits

Long context lengths / Scalable softmax.

Training and backprop.

GRPO and RL.

RL.

DeepSeek, o1.

Generalist rewards.

Quantization

Training methods.

Sampling methods - top-k, top-p, temperature.

Attention.

Diffusion models.

SDXL.

Entropix.

RNN’s and video.

Video DiT + TTT.

RNN’s.

Test Time Training

Test Time Training

Latent diffusion models.

SDXL.

CLIP

Cross-cut entropy loss (CCEL).

Llama3.

LoRA’s.

Google long context window (1M).

Nous DiStro.

Other models.

DiT (Diffusion Transformers) / OpenAI Sora

3D gaussians

Trellis 3d

Sesame Voice Model.

Tacotron voice model.

Whisper speech model.

3D rigging.

GameNGen

Tiktok - reccomendation system.

Youtube - recommendation system.

Ridiculous ideas.