During World War II, there was a significant investment by governments towards the nascent field of electronic programmable computers. Some key examples of this include the Colossus computer developed by the British for the purpose of codebreaking, the ENIAC developed by the Americans for the purpose of calculating artillery firing tables and the Z3 developed by the Germans which was never actually put into use by the German government because it was deemed not vital to the war efforts.

Off the back of the research spawned by this investment, commercial electronic programmable computers began to be developed in the early 1950s, starting with the Ferranti Mark 1 (or the Manchester Electronic Computer) and the UNIVAC I which were developed concurrently in the UK and USA respectively and made available in 1951.

Therefore, in the 1950s, it became well-understood by academics that electronic computers were capable of automatically performing arithmetic calculations much faster than any human could.
With this computational speedup in mind, many academics began investigating the problem of automatic translation of natural languages.

However, researchers quickly found that the problem of machine translation was much harder than they had anticipated. The learnings of traditional structural linguistics were insufficient for producing an accurate machine translator. It was determined that in order to make progress, a more statistical approach would be required. With this, the field of computational linguistics was born.

With the necessity of probabilistic models in mind, the question then becomes: How could we use probabilistic models to generate an accurate machine translation system?

Suppose we have some source language \(S\) and some destination language \(D\) and we're trying to translate some particular phrase from language \(S\) to language \(D\). That is, given the particular phrase in language \(S\), we want to generate the most likely phrase in language \(D\). In mathematical terms, what we want to find is: $$ \tilde{d} = arg\max_{d \in D} P(d|s), s \in S $$

One way to decompose that term is via Bayes' rule: $$ arg\max_{d \in D} P(d|s) = arg\max_{d \in D} \frac{P(s|d) P(d)}{P(s)} $$

Since \(P(s)\) is constant over all \(d \in D\), we can say: $$ \tilde{d} = arg\max_{d \in D} P(d|s) = arg\max_{d \in D} \frac{P(s|d) P(d)}{P(s)} = arg\max_{d \in D} P(s|d) P(d) $$

The term \(P(d)\) is the absolute likelihood of the phrase \(d\) in language \(D\). A model that generates this likelihood is called a language model.

There are, broadly speaking, two main ways that language models are built:

- n-gram methods
- Neural methods
Also called continuous space language models because words and other linguistic symbols end up being mapped to vectors in continuous space.

Let's say we're trying to predict the probability of a sequence of words \(w_1, w_2, ..., w_k\). We can decompose the probability of such a sequence by using the probabilistic chain rule as follows: $$ P(w_{1} w_{2} ... w_{k}) = \prod_{0 \leq i \leq k} P(w_{i} | w_{1} w_{2} ... w_{i-1}) $$

While this is a step in the right direction towards being able to calculate \(P(w_1 w_2 ... w_k)\), computing the chain rule product gets unwieldy very quickly for increasing values of \(k\).
In order to deal with this, n-gram methods use the simplifying assumption that the probability of the next word in a sequence of words is only dependent on the previous \(n\) words.

Using this assumption, the complexity of computing the probability of a sequence of \(k\) words is dependent only on some constant value \(n\) and not on the length of the sequence \(k\).
All conditional probabilities can then be estimated by simply counting the occurrences of sequences of words

While the count-based approach of n-grams is very appealing due its conceptual simplicity, the Markov assumption proves to be a very limiting assumption. In Exploring the Limits of Language Modeling, Jozefowicz et al. showed that neural network models

The most common types of neural networks used for language modeling can be grouped into two categories:

- Recurrent models
- Attentional models

Before diving into the details of specific types of neural network models, we first need to briefly discuss how exactly these neural network models process text data in the first place. Since neural network models work on numerical input only, we need to develop some sort of scheme for numericalizing text input. The way that this is done is by leveraging what are called embedding matrices.

An embedding matrix is simply a matrix that maps an input to some real-valued vector.
How it works is that every token

From this point forward, whenever we refer to any input into a neural network model, we are referring to these embedding vectors.

Recurrent models are models that compute a function of the form: $$ h(t) = f(h(t-1), x(t); \theta) $$

where \(h(t)\) is the hidden state of the model at time \(t\), \(x(t)\) is the input into the model at time \(t\) and \(\theta\) represents the parameters of the model.

Since the model is defined in terms of the transition from one timestep to the next, it allows us to naturally generalize to sequence lengths not seen in the training data and it allows us to share parameters across all timesteps, which simplifies things greatly. Such a general form admits many specific implementations which we now discuss.

The simplest form of recurrent neural model in general use is the recurrent neural network (RNN), otherwise referred to as a vanilla RNN.
In the RNN architecture, the model is defined by three groups of parameters:

- \(U\): the matrix that controls the input-to-state connection
- \(V\): the matrix that controls the state-to-output connection
- \(W\): the matrix that controls the state-to-state connection

The equations that define the RNN model are as follows: $$\begin{eqnarray} &a(t) = b + Wh(t-1) + Ux(t) \\ &h(t) = \sigma_{h} (a(t)) \\ &o(t) = c + Vh(t) \end{eqnarray}$$

where \(x(t)\) is the input at time \(t\), \(o(t)\) is the output at time \(t\), \(\sigma_h\) is some non-linear activation function and \(a(t)\) and \(h(t)\) represent the hidden state of the model at time \(t\).

In the case of language models, in order to calculate a probability distribution over the next predicted word in the sequence, we then do: $$ \hat{y} (t) = softmax(o(t)) $$

In graphical form, the RNN model looks like this:

In order to visualize how we might train this model via backpropagation, it is useful to introduce the concept of an unrolled computational graph. In an unrolled computational graph, the recurrent connection between the state at one timestep to the state at the next timestep is drawn explicitly for all timesteps in the sequence. An unrolled computational graph for an RNN would look something like this:

where the internal calculation of each \(h(t)\) is not shown for simplicity.

We then just execute the familiar backpropagation algorithm over this unrolled computation graph, accumulating the gradients of the parameters through each timestep. This procedure is called the backpropagation-through-time algorithm.

An interesting thing happens if you train an RNN model over sequences of even moderate length; the gradient either quickly blows up to infinity or quickly decays to zero when backpropagating through time.

To gain some mathematical intuition for why this is, let's simplify the state recurrence relation of the model to be: $$\begin{eqnarray} &h(t) = Wh(t-1) \\ &or \\ &h(t+1) = Wh(t) \end{eqnarray}$$

Furthermore, we'll say that only in the last timestep is the hidden state connected to the output and only in the initial timestep is the hidden state connected to the input. The computational graph is thusly this:

Let L be the loss of the model and let: $$ \nu (t) = \frac{dL}{dh(t)} $$

be the error signal at time \(t\).

Via the chain rule, we have: $$ \nu (t) = \frac{dL}{dh(t)} = \frac{dL}{dh(t+1)} \frac{dh(t+1)}{dh(t)} = \frac{dh(t+1)}{dh(t)} \nu (t+1) $$

Using our expression for \(h(t+1)\), we then have: $$ \nu (t) = W \nu (t+1) $$

The solution of this recurrence relation is: $$ \nu (t) = W^{N-t} \nu (N) $$

where \(N\) is the length of the sequence.

Via eigendecomposition

where \(Q\) is a matrix containing the eigenvectors of \(W\) and \(\Lambda\) is a diagonal matrix containing the corresponding eigenvalues.

We can then say: $$ W^{N-t} = Q \Lambda^{N-t} Q^{-1} $$

and therefore: $$ \nu (t) = Q \Lambda^{N-t} Q^{-1} \nu (N) $$

We can then see that \(\nu (t)\) would explode exponentially when backpropagating through time if \(W\) has a spectral radius greater than 1 and would decay exponentially when backpropagating through time if \(W\) has a spectral radius less than 1. For a more rigorous analysis of the vanishing/exploding gradient problem, see Hochreiter.

While it has been shown empirically that simply clipping the gradient to some maximum value

An LSTM is a variant of an RNN that is able learn much longer dependencies than a vanilla RNN. It is able to do that through a few different techniques which we now discuss.

The most important task to accomplish in order to learn long-term dependencies is to reckon with the vanishing gradient problem. The constant error carousel is a technique used by LSTMs in order to do just that.

Let us follow the toy setup from the vanishing gradient analysis above. Using this setup, we can immediately see that the error flow neither decays nor explodes if and only if the absolute value of all of the eigenvalues of the \(W\) matrix are exactly equal to 1. One such matrix that meets this criterion is the identity matrix.

Therefore, this yields the simple recurrence relation: $$ h(t) = h(t-1) $$

which guarantees constant error flow when backpropagating through the network. This is referred to as the constant error carousel or CEC.

At a high level, we can think of an LSTM cell (or any type of RNN cell) as being responsible for 3 main functions:

- Storing "memory" in the hidden state
- "Reading memory" from the previous cell
- "Writing memory" to the next cell

One way to alleviate this problem is to add trainable filters to the input and the output that are responsible for controlling how memory is read and written. These trainable filters are referred to as input and output gates.

Note that from this point we'll change notation and let \(o(t)\) represent the output gate at time \(t\), let \(h(t)\) represent the output of the RNN cell at time \(t\) and let \(c(t)\) represent the hidden state of the RNN cell at time t. What we have now is this: $$\begin{eqnarray} &i_t = \sigma_g (W_i x_t + U_i h_{t-1} + b_i) \\ &o_t = \sigma_g (W_o x_t + U_o h_{t-1} + b_o) \\ &c_t = c_{t-1} + i_{t} \circ \sigma_c (W_c x_t + U_c h_{t-1} + b_c) \\ &h_t = o_t \circ \sigma_h (c_t) \end{eqnarray}$$

where \(W_i\), \(W_o\), \(W_c\), \(U_i\), \(U_o\), \(U_c\), \(b_i\), \(b_o\) and \(b_c\) are learned parameters, \(\sigma_g\), \(\sigma_c\) and \(\sigma_h\) are non-linear activation functions and ⚪ is point-wise multiplication.

Then, in a similar way to the vanilla RNN, if we were to produce a probability distribution over the next token for a language model we have: $$ \hat{y} (t) = softmax(W_y h(t) + b_y) $$

With this, we now have the original LSTM as introduced in the seminal paper Long Short-Term Memory by Hochreiter and Schmidhuber. However, there is one extra flourish that goes into what we know as an LSTM today.

One thing you may notice about the LSTM as defined up to this point is that the cell state: $$ c_t = c_{t-1} + i_t \circ \sigma_c (W_c x_t + u_c h_{t-1} + b_c) $$

will grow unbounded. This obviously poses a problem for long sequences.

In order to deal with this, we gate the self-connection in the cell state in a similar fashion to how we gate the input and the output. This gate is referred to as the "forget gate" because it regulates how much information to keep or forget from the previous cell state.

Our final LSTM architecture is now this: $$\begin{eqnarray} &f_t = \sigma_g (W_f x_t + U_f h_{t-1} + b_f) \\ &i_t = \sigma_g (W_i x_t + U_i h_{t-1} + b_i) \\ &o_t = \sigma_g (W_o x_t + U_o h_{t-1} + b_o) \\ &c_t = f_t \circ c_{t-1} + i_t \circ \sigma_c (W_c x_t + U_c h_{t-1} + b_c) \\ &h_t = o_t \circ \sigma_h (c_t) \end{eqnarray}$$

In summary, the LSTM architecture contains the following components:

- A passthrough connection from the state at one timestep to the state at the next timestep. This is called the constant error carousel and it is needed to keep the error propagated through time constant in order to alleviate the vanishing gradient problem.
- Multiplicative input and output filters (also known as gates) in order to control how information is read into the cell and how information is written from the cell.
- Multiplicative forget gate on the state self-connection. This is needed in order to prevent the cell state from growing unbounded.

One major problem with the models that have been introduced up to this point is that the computation for each timestep can't start until the computation from the previous timestep has finished. This severely limits the amount of parallelism that we can utilize when training vanilla RNNs and LSTMs.

In order to deal with this problem, the QRNN architecture calculates the cell input, the input gate output, the output gate output and the forget gate output in terms of convolutions, which are highly parallelizable: $$\begin{eqnarray} &Z = \tanh (W_z * X) \\ &F = \sigma (W_f * X) \\ &O = \sigma (W_o * X) \\ &I = \sigma (W_i * X) \end{eqnarray}$$

where \(W_z\), \(W_f\), \(W_o\) and \(W_i\) are learned convolutional kernels, \(X\) is input data, \(\sigma\) is some non-linear activation function and \(*\) is the convolution operator.

After this convolutional layer, we apply a pooling layer which calculates familiar-looking recurrent LSTM-like relations: $$\begin{eqnarray} &c_t = f_t \circ c_{t-1} + i_t \circ z_t \\ &h_t = o_t \circ c_t \end{eqnarray}$$

where \(f_t\), \(o_t\), \(i_t\) and \(z_t\) represent the row vectors of \(F\), \(O\), \(I\) and \(Z\) indexed by \(t\).

While most work in sequence processing for neural networks has historically focused on recurrent neural networks of some form or fashion, recently there has been a shift away from recurrent models towards what are called attentional models. As opposed to recurrent models which are based on the principle of recurrence, attentional models are based on a principle called attention.

The concept of attention is quite general. We say that we compute attention if we perform any sort of calculation that takes some multi-valued input (i.e. a vector or a tensor) and computes a multi-valued output (the same dimension as the input) that corresponds to how much weight the model should assign to each value in the input.

While it isn't obvious that attention is in and of itself sufficient for producing powerful sequence processing models, in the paper Attention Is All You Need, Vaswani et al. found that this is indeed the case. In the paper, they defined a new type of model called the Transformer that only uses attentional mechanisms and achieved state-of-the-art results across a variety of natural language understanding tasks at a fraction of the total training time of previous state-of-the-art models.

There are many different ways that one might compute attention over a sequence of inputs. The way that the Transformer model computes attention is by what is called scaled dot-product attention.

In order to understand how exactly scaled dot-product attention works, we take a detour into differentiable memory networks. In a differentiable memory network, we have a sequence of memory cells. The "address" of any given memory cell is specified by a vector which is referred to as a key. The "contents" of any given memory cell are specified by a vector which is referred to as a value. In order to read from this memory, we need a vector (of the same dimensionality as the key vectors) referred to as a query. Given a query vector, the value that is read from memory is a weighted sum of the values of the memory cells where the weight is computed via some similarity function between the query vector and each key vector.

Scaled dot-product attention uses the exact same query-key-value analogy as differentiable memory networks. The input to a scaled dot-product attention layer is a matrix \(Q\) that contains many queries, a matrix \(K\) that contains many keys and a matrix \(V\) that contains many values. We have: $$ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}}) V $$

where \(d_k\) is the dimensionality of each query and key.

In the Transformer network, scaled dot-product attention is combined with what is called a multi-head attention mechanism.

In multi-head attention, the keys, the values and the queries are first projected into numerous lower-dimensional spaces. For each of these different projections, attention is calculated over these projected queries, keys and values. These calculated attention vectors are then concatenated together and projected back into the original vector space to get the final result. Therefore, we have: $$\begin{eqnarray} &MultiHead(Q, K, V) = concat(head_{1}, ..., head_{h}) W^O \\ &head_i = Attention(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V} \end{eqnarray}$$

where \(h\) is the total number of attention heads and \(W^O\) and each \(W_i^Q\), \(W_i^K\) and \(W_i^V\) are learned projection matrices.

Multi-head attention is useful because each different attention head can learn to attend over different aspects of the input sequence. To use a contrived example, one of the attention heads could learn to attend over nouns in a sentence while another of the attention heads could learn to attend over verbs. The combination of these two attention heads would surely yield a lot more information useful for downstream natural language understanding tasks than either one of them individually.

The last main novel contribution of the Transformer network is its positional encoding scheme. Since the Transformer model does not contain any recurrent connections or convolutions, we need to explicitly inject information about the relative positions of tokens in the input. We do that by creating positional encoding vectors and adding them to the input before feeding it into the model.

While it's possible to learn positional encoding vectors via backpropagation, it was found that learned position encoding vectors yielded no performance gain over fixed encoding vectors. Thusly, we define the positional encoding vectors to be: $$\begin{eqnarray} &PE_{(pos, 2i)} = \sin (pos / 10000^{2i / d_{model}}) \\ &PE_{(pos, 2i+1)} = \cos (pos / 10000^{2i / d_{model}}) \end{eqnarray}$$

where \(pos\) is the position of the token in the sequence, \(2i\) and \(2i+1\) represent the particular dimension of the encoding vector and \(d_{model}\) is the dimensionality of the input embeddings. Using sinusoidal functions in this way allows us to generalize to sequence lengths not seen in the training data.

With all of these pieces laid out, we can now describe the overall architecture of the Transformer model.
The Transformer architecture follows the pattern shown in this diagram:

The model is made up of a stack of \(N\) layers with identical architecture where the output of the previous layer is fed into the next layer as input.
Each of these layers is made up of two sublayers, each of which is followed by a residual connection and a layer norm.

While the Transformer architecture allows us to model the interactions between words at different timesteps directly

The main idea of the Transformer-XL architecture is to process consecutive sequences and have each layer of both the encoder and the decoder cache the output it generated from the previous sequence to help generate the output from the current sequence. The initial multi-head self-attention sublayer of each layer uses the output of the previous layer to generate the queries, but it uses the output of the previous layer concatenated with the output of the previous layer from the last sequence in order to generate the keys and values. We then have this: $$\begin{eqnarray} &\tilde{h}_{\tau + 1}^{n-1} = concat(SG(h_{\tau}^{n-1}), h_{\tau + 1}^{n-1}) \\ &q_{\tau + 1}^{n}, k_{\tau + 1}^{n}, v_{\tau + 1}^{n} = h_{\tau + 1}^{n-1} W_{q}, \tilde{h}_{\tau + 1}^{n-1} W_{k}, \tilde{h}_{\tau + 1}^{n-1} W_v \\ &h_{\tau + 1}^{n} = Layer(q_{\tau + 1}^{n}, k_{\tau + 1}^{n}, v_{\tau + 1}^{n}) \end{eqnarray}$$

where \(h_{\tau}^n\) is the output of the \(n\)^{th} layer for the \(\tau\)^{th} sequence, \(SG\) is the stop-gradient function

Additionally, the Transformer-XL architecture utilizes a relative positional encoding scheme. This is needed to account for the fact that since context is now carried from one sequence to the next, the model needs some way of determining the positional difference between a word in the current sequence and a word in a previous sequence.

As described before, the distance of interactions that a Transformer can handle is limited by the largest sequence that it can process. The Transformer-XL architecture aims to improve that by adding recurrence between consecutive sequences. The Sparse Transformer tries an orthogonal approach to improving the Transformer: increasing the size of the largest sequence the model can process.

The main contribution of the Sparse Transformer architecture is to reformulate attention in terms of connectivity patterns. A connectivity pattern is a set of indices that corresponds to the indices of the input that are being attended over. For a given connectivity pattern, we only attend over inputs whose indices appear in the connectivity pattern. Concretely, we have this: $$\begin{eqnarray} &Q = W_q X \\ &K_{S_i} = (W_k x_j)_{j \in S_i} \\ &V_{S_i} = (W_v x_j)_{j \in S_i} \\ &SelfAttention_{S_i} (x) = Attention(Q, K_{S_i}, V_{S_i}) \end{eqnarray}$$

where \(S_i\) is some connectivity pattern.

We can then define multiple connectivity patterns to attend over.
We can either attend over all the connectivity patterns sequentially (i.e. one application of attention after the other) or we can attend over all of them in parallel via multi-head attention.
The idea is that if we choose our connectivity patterns carefully, we can reduce total computational cost by having each connectivity pattern have a small subset of all possible indices

In the landmark paper Learning Representations By Back-Propagating Errors by Rumelhart, Hinton and Williams that first introduced the term "backpropagation", it was found that neural network models trained via backpropagation learn intermediate representations of the input data that are useful for the task that the model is being trained for. It has since been found that these learned intermediate representations are useful input features for neural network models that are trained on related tasks. Training a neural network model using a fully-trained neural network's intermediate representations as input can greatly boost overall performance of a model and can greatly decrease training time. This process is known as transfer learning.

Transfer learning has been effectively applied in the field of computer vision since the beginning of the deep learning revolution. Its impact in the field of natural language processing, however, has historically been comparatively minor. The general wisdom had largely been that transfer learning in NLP wasn't very effective outside of pre-trained word embeddings and related techniques. While convolutional neural networks trained on ImageNet classification were found to be amenable to transfer learning on a variety of computer vision tasks, it seemed as if no such general proxy task existed for NLP.

Recently, however, it has been found that language modeling is a useful proxy task that can allow us to effectively utilize transfer learning in the NLP domain in a way that is similar to how transfer learning has been utilized in the computer vision domain. We now discuss several transfer learning approaches based on language modeling that have recently emerged.

One of first major breakthroughs in transfer learning in NLP via language modeling was in the paper Universal Language Model Fine-tuning (or ULMFiT) which introduced a general training procedure for fine-tuning a language model. The authors then showed that this fine-tuned language model can be used as a basis to train models via transfer learning for NLP tasks other than language modeling. Using the ULMFiT method, the authors established state-of-the-art results by a wide margin on a number of text classification benchmarks.

Previous work on fine-tuning language models ran into the problem of catastrophic forgetting where a pre-trained neural network quickly loses all information about the task it was previously trained on when it starts training on a new task. In order to overcome this issue, the authors proposed several techniques that can be thought of as controlling the rate at which new information is propagated through certain parts of the model when training on a new task. This was done in order to prevent new information from overwhelming certain parts of the model and causing catastrophic forgetting.

The three main techniques introduced were:

- Discriminative fine-tuning
- Gradual unfreezing
- Slanted triangular learning rates

Discriminative fine-tuning works by assigning different learning rates to different layers of the model. The idea is that earlier layers of the model tend to learn general features of the data while later layers learn increasingly more task-specific features. Therefore, when training the model on a new task, it stands to reason that the layers that learn more task specific features (i.e. the later layers) should change more quickly than the layers that learn more general features (i.e. the earlier layers).

Gradual unfreezing works by "unfreezing"

Slanted triangular learning rates are a technique where you sharply increase the learning rate in the beginning of training and then slowly decrease the learning rate for the rest of training. The justification for this technique is that the period of sharp learning rate increase allows the model to quickly find a good general region in parameter space while the rest of training allows the model to then optimize its parameters in a more granular way.

Shortly after the ULMFiT paper came the paper Improving Language Understanding by Generative Pre-Training.

This pre-trained language model was then used as a basis for training models on 12 diverse NLP tasks via transfer learning. Of these 12 models, 9 of them were found to be state-of-the-art in their domain. The previous state-of-the-art models were often ensembles of models whose architecture was highly domain-specific. Overall, the GPT approach achieved state-of-the-art performance on the NLP multi-task benchmark GLUE.

Shortly after the GPT paper came yet another impactful paper concerning transfer learning in NLP, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In the paper, the authors trained a language model based on the Transformer architecture, similar to the GPT model.

The major difference between the BERT model and the GPT model is that the BERT model was trained to predict the probability of a word given context from both the left and the right of the word whereas the GPT model was only given context from the left of the word. This bidirectionality allowed the BERT model to learn more powerful intermediate representations than the GPT model. Using the pre-trained BERT model as a basis for training models for other NLP tasks, the authors achieved state-of-the-art results in 11 different domains. Overall, the BERT approach surpassed the GPT approach in achieving state-of-the-art performance in the GLUE benchmark.