Although I’ve been repeatedly advised it’s not a good social strategy, a glorious way to start a research paper is with specific, righteous criticism of your anonymous colleagues:^{a }

Transformers are deep feed-forward artificial neural networks with a (self)attention mechanism. They have been tremendously successful in natural language processing tasks and other domains. Since their inception 5 years ago, many variants have been suggested. Descriptions are usually graphical, verbal, partial, or incremental. Despite their popularity, it seems no pseudocode has ever been published for any variant. Contrast this to other fields of computer science, even to “cousin” discipline reinforcement learning.

So begin Phuong & Hutter in a great, rant-filled paper that “covers what Transformers are, how they are trained, what they’re used for, their key architectural components, tokenization, and a preview of practical considerations, and the most prominent models.” As an exercise, in this post I’m dig into the first item by writing down an even more compact definition of a transformer than theirs, in the form of a mathematical function rather than pseudocode, while avoiding the ambiguities rampant in the rest of the literature. I will consider only what a single forward-pass of a transformer does, considered as a map from token sequences to probability distributions over the token vocabulary. I do not try to *explain* the transformer, nor do I address other important aspects like motivation, training, and computational.

(This post also draws on a nice introduction by Turner. If you are interested in understanding and interpretation, you might check out — in descending order of sophistication — Elhage et al., Molina, and Lee & Trott. Corrections to this post are welcome. EDIT: See also the recent pseudocode from Carpenter.)

As detailed well by Phuong & Hutter, the basic transformer structure can be used in several overall “configurations” (e.g., sequence classification, or sequence-to-sequence prediction for translation). We will present transformers as used for next-token prediction. There are many other smaller ways of varying the architecture (see a sample at the end of this post), so I have tried to pick a “standard” version, but I don’t have enough experience in the literature or industry to know confidently which are standard/promising vs speculative/unpromising.

## Preliminaries

First, our basic notation: The matrix transpose is denoted “” and the positive integers are . The set of positive integers up to is and we use to express a vector of length with elements .

Our token vocabulary (e.g., ascii characters, or English words) we denote by . A sequence is a vector of tokens, , and a probability distribution over the vocabulary is . Our goal is to construct a next-token predictor: .

Finally, we will use these common nonlinear functions:

- Rectified linear unit,
(1)

When is applied to vectors, we take it to act element-wise.

- Layer normalization (aka z-scoring),
(2)

with and , as usual.

- Softmax (aka Boltzmann distribution),
(3)

## Hyperparameters, parameters, and activations

#### Model hyperparameters (i.e., architectural choices)

- Number of layers:
- Number of attention heads:
- Embedding, query-key, value-output, and feed-forward dimensions:

#### Model parameters (i.e., trainable weights)

- Token embedding matrix: , which is composed of the row vectors indexed by token .
- Position embedding matrix: , with for .
- Token unembedding matrix:
- For each transformer layer :
- Two feed-forward weight matrices: ,
- Two feed-forward bias vectors: ,
- For each head :
- Query and key matrices:
- Value and output matrices:

#### Activations (i.e., neuron outputs)

- For each transformer layer :
- Pre- and post-attention hidden state: , which are composed of the row vectors indexed by position .

## Definition of a transformer function

Our definition makes use of two workhorses

- Multi-head attention,

(4)

- Feed-forward network (aka multilayer perceptron, MLP),
(5)

which has just one hidden layer.

We can then define our transformer, , recursively with

(6)

Given a token sequence , this defines the transformer next-token predictor, . We can of course probabilistically extend any sequence to arbitrary length by computing , sampling an element according to the probability distribution , concatenating to the end of the sequence , and iterating.

(Fin. The rest of this post is optional.)

## Some comments

- Using and , one often sees attention presented with the highly symmetric looking softmax expression
(7)

but as far as I can tell the apparent elegance is misleading. The softmax function naturally acts on a

*vector*. When applied to a*matrix*it is by convention taken to act column-wise or row-rise, i.e.,*collectively*on the one index but*independently*on the other index, breaking the symmetry. We have kept the broken symmetry explicit in our above expression (4) for the attention function. - As the weight matrices and only appear together in the product , it is tempting to combine them into a single matrix. However, that matrix could have rank , whereas the original product has maximum rank . Typically one chooses for reasons of computational complexity. This all applies similarly to the pair and . See here and here for more.

## Variations

This list of a few variations may help you match up this definition with others you’ve seen. (There are many more variations not listed here.) Below I will drop the superscript indices “” and ““.

- Our layer normalization follows Phuong & Hutter and the original work by Vaswani et al., where is applied to the complete hidden state after being modified by multi-headed attention and again after being modified by the feed-forward net . In contrast, Xioang et al. argue for the superiority of LayerNorm-ing only the
*changes*(“residuals”) from these two mechanisms before they are added to the hidden layer, and this choice is used by Turner. - The layer normalization is also often generalized from to , where “” is element-wise multiplication and where are learned vectors in the embedding space for multiplicative and additive scaling, respectively.
- The use of attention in Eq. (6) is known as bidirectional unmasked self-attention, but unidirectional, masked, and non-self versions are defined and explored elsewhere.
- The linear transformations , , , and can be generalized to non-homogeneous linear function by including additive biases, e.g., replacing with where the bias is now an additional (trainable) vector of weights. (This makes them more closely resemble feed-forward networks.) However, some of these biases can be non-functional.
- Rather than learned, the unembedding matrix is often taken to be simply the transpose of the token embedding matrix: .
- Rather than learned, is often taken to simply be fixed as a discrete Fourier transform of position: and for . This allows the model to handle arbitrary length sequences even if the number of long sequences in the training set is small.
- One can modify the feed-forward network to use a larger number of layers, or a different nonlinearity than .

*I thank Dan Ranard and Steve Howard for suggestions and feedback.*]

### Footnotes

(↵ returns to text)

- For read-ability, I have dropped the citations and section references from these quotes without marking the ellipses.↵