Although I’ve been repeatedly advised it’s not a good social strategy, a glorious way to start a research paper is with specific, righteous criticism of your anonymous colleagues:a
Transformers are deep feed-forward artificial neural networks with a (self)attention mechanism. They have been tremendously successful in natural language processing tasks and other domains. Since their inception 5 years ago, many variants have been suggested. Descriptions are usually graphical, verbal, partial, or incremental. Despite their popularity, it seems no pseudocode has ever been published for any variant. Contrast this to other fields of computer science, even to “cousin” discipline reinforcement learning.
So begin Phuong & Hutter in a great, rant-filled paper that “covers what Transformers are, how they are trained, what they’re used for, their key architectural components, tokenization, and a preview of practical considerations, and the most prominent models.” As an exercise, in this post I’m dig into the first item by writing down an even more compact definition of a transformer than theirs, in the form of a mathematical function rather than pseudocode, while avoiding the ambiguities rampant in the rest of the literature. I will consider only what a single forward-pass of a transformer does, considered as a map from token sequences to probability distributions over the token vocabulary. I do not try to explain the transformer, nor do I address other important aspects like motivation, training, and computational.
(This post also draws on a nice introduction by Turner. If you are interested in understanding and interpretation, you might check out — in descending order of sophistication — Elhage et al., Molina, and Lee & Trott. Corrections to this post are welcome. EDIT: See also the recent pseudocode from Carpenter.)
As detailed well by Phuong & Hutter, the basic transformer structure can be used in several overall “configurations” (e.g., sequence classification, or sequence-to-sequence prediction for translation). We will present transformers as used for next-token prediction. There are many other smaller ways of varying the architecture (see a sample at the end of this post), so I have tried to pick a “standard” version, but I don’t have enough experience in the literature or industry to know confidently which are standard/promising vs speculative/unpromising.
Preliminaries
First, our basic notation: The matrix transpose is denoted “” and the positive integers are
. The set of positive integers up to
is
and we use
to express a vector
of length
with elements
.
Our token vocabulary (e.g., ascii characters, or English words) we denote by . A sequence is a vector of tokens,
, and a probability distribution over the vocabulary is
. Our goal is to construct a next-token predictor:
.
Finally, we will use these common nonlinear functions:
- Rectified linear unit,
(1)
When
is applied to vectors, we take it to act element-wise.
- Layer normalization (aka z-scoring),
(2)
with
and
, as usual.
- Softmax (aka Boltzmann distribution),
(3)
Hyperparameters, parameters, and activations
Model hyperparameters (i.e., architectural choices)
- Number of layers:
- Number of attention heads:
- Embedding, query-key, value-output, and feed-forward dimensions:
Model parameters (i.e., trainable weights)
- Token embedding matrix:
, which is composed of the row vectors
indexed by token
.
- Position embedding matrix:
, with
for
.
- Token unembedding matrix:
- For each transformer layer
:
- Two feed-forward weight matrices:
,
- Two feed-forward bias vectors:
,
- For each head
:
- Query and key matrices:
- Value and output matrices:
- Query and key matrices:
- Two feed-forward weight matrices:
Activations (i.e., neuron outputs)
- For each transformer layer
:
- Pre- and post-attention hidden state:
, which are composed of the row vectors
indexed by position
.
- Pre- and post-attention hidden state:
Definition of a transformer function
Our definition makes use of two workhorses
- Multi-head attention,
(4)
- Feed-forward network (aka multilayer perceptron, MLP),
(5)
which has just one hidden layer.
We can then define our transformer, , recursively with
(6)
Given a token sequence , this defines the transformer next-token predictor,
. We can of course probabilistically extend any sequence
to arbitrary length by computing
, sampling an element
according to the probability distribution
, concatenating
to the end of the sequence
, and iterating.
(Fin. The rest of this post is optional.)
Some comments
- Using
and
, one often sees attention presented with the highly symmetric looking softmax expression
(7)
but as far as I can tell the apparent elegance is misleading. The softmax function naturally acts on a vector. When applied to a matrix it is by convention taken to act column-wise or row-rise, i.e., collectively on the one index but independently on the other index, breaking the symmetry. We have kept the broken symmetry explicit in our above expression (4) for the attention function.
- As the
weight matrices
and
only appear together in the product
, it is tempting to combine them into a single
matrix. However, that matrix could have rank
, whereas the original product
has maximum rank
. Typically one chooses
for reasons of computational complexity. This all applies similarly to the pair
and
. See here and here for more.
Variations
This list of a few variations may help you match up this definition with others you’ve seen. (There are many more variations not listed here.) Below I will drop the superscript indices “” and “
“.
- Our layer normalization follows Phuong & Hutter and the original work by Vaswani et al., where
is applied to the complete hidden state after being modified by multi-headed attention
and again after being modified by the feed-forward net
. In contrast, Xioang et al. argue for the superiority of LayerNorm-ing only the changes (“residuals”) from these two mechanisms before they are added to the hidden layer, and this choice is used by Turner.
- The layer normalization is also often generalized from
to
, where “
” is element-wise multiplication and where
are learned vectors in the embedding space for multiplicative and additive scaling, respectively.
- The use of attention
in Eq. (6) is known as bidirectional unmasked self-attention, but unidirectional, masked, and non-self versions are defined and explored elsewhere.
- The linear transformations
,
,
, and
can be generalized to non-homogeneous linear function by including additive biases, e.g., replacing
with
where the bias
is now an additional (trainable) vector of weights. (This makes them more closely resemble feed-forward networks.) However, some of these biases can be non-functional.
- Rather than learned, the unembedding matrix is often taken to be simply the transpose of the token embedding matrix:
.
- Rather than learned,
is often taken to simply be fixed as a discrete Fourier transform of position:
and
for
. This allows the model to handle arbitrary length sequences even if the number of long sequences in the training set is small.
- One can modify the feed-forward network to use a larger number of layers, or a different nonlinearity than
.
Footnotes
(↵ returns to text)
- For read-ability, I have dropped the citations and section references from these quotes without marking the ellipses.↵