LLM, et al: Mechanistic Interpretability

Purpose and Approach

The speaker explains that understanding how transformers work mechanistically is difficult when starting directly with large modern language models. Instead, the plan is to begin with highly simplified versions of transformers and gradually build up to more complex ones. This video starts that process by examining the zero-layer transformer, the simplest model that still meaningfully resembles a transformer.

What Is a Zero-Layer Transformer?

A zero-layer transformer has no attention layers and no nonlinear depth. It consists of only two operations and directly predicts the next token from the previous one.

The two steps are:

Token Embedding
- Each input token is represented as a one-hot vector.
- This vector is multiplied by the word embedding matrix (W_E).
- The result is a dense vector representation (the token embedding).
Unembedding
- The embedding is multiplied by the unembedding matrix (W_U).
- This produces logits, which are then adjusted and passed to a softmax to obtain probabilities for the next token.

These two matrix multiplications constitute the entire model.

Mathematical Interpretation

The whole process can be written as a single matrix product:

$W_U W_E$

This combined matrix maps one token directly to a distribution over the next token.

Connection to Bigram Statistics

The speaker emphasizes that the product $W_U W_E$ effectively represents bigram statistics:

It encodes how frequently one token follows another in the training data.
Because the output is fed into a softmax, the matrix should represent bigram log-likelihoods.

However, this representation is only an approximation:

The embedding dimension is much smaller than the vocabulary size.
This forces the matrix to be low-rank, limiting how precisely it can represent true bigram frequencies.

Why This Matters

Although extremely simple, the zero-layer transformer provides an important conceptual insight:

All larger transformers contain a term that looks like $W_U W_E$ .
When this term appears in more complex models, it should be interpreted as performing something similar to bigram-like behavior.
Understanding this simplest case helps ground intuition when analyzing deeper transformers with attention and multiple layers.

Conclusion and Next Steps

The speaker concludes that while the zero-layer transformer is trivial, it is genuinely useful for building intuition. The next step in the series will be to study a one-layer, attention-only transformer, adding complexity on top of this foundational idea.

Mechanistic Interpretability - Purpose and Approach