An Overview of Language Modeling

Computational Linguistics and the Language Modeling Problem

A Very Brief History of Computational Linguistics

During World War II, there was a significant investment by governments towards the nascent field of electronic programmable computers. Some key examples of this include the Colossus computer developed by the British for the purpose of codebreaking, the ENIAC developed by the Americans for the purpose of calculating artillery firing tables and the Z3 developed by the Germans which was never actually put into use by the German government because it was deemed not vital to the war efforts.

Off the back of the research spawned by this investment, commercial electronic programmable computers began to be developed in the early 1950s, starting with the Ferranti Mark 1 (or the Manchester Electronic Computer) and the UNIVAC I which were developed concurrently in the UK and USA respectively and made available in 1951.

Therefore, in the 1950s, it became well-understood by academics that electronic computers were capable of automatically performing arithmetic calculations much faster than any human could. With this computational speedup in mind, many academics began investigating the problem of automatic translation of natural languages.Of particular interest during this nascent Cold War era were the problems of Russian-to-English (in the USA) and English-to-Russian (in the Soviet Union) translation systems. See Hutchins for more detail.

However, researchers quickly found that the problem of machine translation was much harder than they had anticipated. The learnings of traditional structural linguistics were insufficient for producing an accurate machine translator. It was determined that in order to make progress, a more statistical approach would be required. With this, the field of computational linguistics was born.

Motivation of the Language Modeling Problem

With the necessity of probabilistic models in mind, the question then becomes: How could we use probabilistic models to generate an accurate machine translation system?

Suppose we have some source language \(S\) and some destination language \(D\) and we're trying to translate some particular phrase from language \(S\) to language \(D\). That is, given the particular phrase in language \(S\), we want to generate the most likely phrase in language \(D\). In mathematical terms, what we want to find is: $$ \tilde{d} = arg\max_{d \in D} P(d|s), s \in S $$

One way to decompose that term is via Bayes' rule: $$ arg\max_{d \in D} P(d|s) = arg\max_{d \in D} \frac{P(s|d) P(d)}{P(s)} $$

Since \(P(s)\) is constant over all \(d \in D\), we can say: $$ \tilde{d} = arg\max_{d \in D} P(d|s) = arg\max_{d \in D} \frac{P(s|d) P(d)}{P(s)} = arg\max_{d \in D} P(s|d) P(d) $$

The term \(P(d)\) is the absolute likelihood of the phrase \(d\) in language \(D\). A model that generates this likelihood is called a language model.

Types of Language Models

There are, broadly speaking, two main ways that language models are built:

N-Gram Methods

Let's say we're trying to predict the probability of a sequence of words \(w_1, w_2, ..., w_k\). We can decompose the probability of such a sequence by using the probabilistic chain rule as follows: $$ P(w_{1} w_{2} ... w_{k}) = \prod_{0 \leq i \leq k} P(w_{i} | w_{1} w_{2} ... w_{i-1}) $$

While this is a step in the right direction towards being able to calculate \(P(w_1 w_2 ... w_k)\), computing the chain rule product gets unwieldy very quickly for increasing values of \(k\). In order to deal with this, n-gram methods use the simplifying assumption that the probability of the next word in a sequence of words is only dependent on the previous \(n\) words.Such an assumption where the next value in a sequence of values is not dependent on the entire history of the sequence, but only on some particular current state (in this case, the state would be only the previous \(n\) words) is called a Markov assumption. As such, we can rewrite the chain rule as follows: $$ P(w_{1} w_{2} ... w_{k}) = \prod_{0 \leq i \leq k} P(w_{i} | w_{i-n} ... w_{i-1}) $$

Using this assumption, the complexity of computing the probability of a sequence of \(k\) words is dependent only on some constant value \(n\) and not on the length of the sequence \(k\). All conditional probabilities can then be estimated by simply counting the occurrences of sequences of wordsSuch a sequence of \(n\) words is called an n-gram. in some corpus.

Neural Methods

While the count-based approach of n-grams is very appealing due its conceptual simplicity, the Markov assumption proves to be a very limiting assumption. In Exploring the Limits of Language Modeling, Jozefowicz et al. showed that neural network modelsFor an overview of neural networks and the backpropagation algorithm, see this lecture from Andrej Karpathy. that make no such assumption can dominate state-of-the-art n-gram models in language modeling performance. Because of this, most research in language modeling in recent years has focused on different types of neural network models.

The most common types of neural networks used for language modeling can be grouped into two categories:

Input Representation

Before diving into the details of specific types of neural network models, we first need to briefly discuss how exactly these neural network models process text data in the first place. Since neural network models work on numerical input only, we need to develop some sort of scheme for numericalizing text input. The way that this is done is by leveraging what are called embedding matrices.

An embedding matrix is simply a matrix that maps an input to some real-valued vector. How it works is that every tokenThere are many different granularity levels at which you can choose to tokenize. You can tokenize at the word level or the character level or even at the byte-pair level. is mapped to some number that represents the index of that token in some list of valid tokens.Such a list of all valid tokens is called a vocabulary. We then use the row vector indexed by that number in the embedding matrix to represent the token. These vectors can then be trained from scratch via backpropagation or they can be pre-trained or they can be initialized with pre-trained values and then fine-tuned via backpropagation.

From this point forward, whenever we refer to any input into a neural network model, we are referring to these embedding vectors.

Recurrent Models

Recurrent models are models that compute a function of the form: $$ h(t) = f(h(t-1), x(t); \theta) $$

where \(h(t)\) is the hidden state of the model at time \(t\), \(x(t)\) is the input into the model at time \(t\) and \(\theta\) represents the parameters of the model.Such models are called recurrent because there is a recurrence relation defined over the states of the model over different timesteps. The output of the model \(o(t)\) is then computed via some function of the hidden state: $$ o(t) = g(h(t); \theta) $$

Since the model is defined in terms of the transition from one timestep to the next, it allows us to naturally generalize to sequence lengths not seen in the training data and it allows us to share parameters across all timesteps, which simplifies things greatly. Such a general form admits many specific implementations which we now discuss.

Recurrent Neural Network (RNN)

The simplest form of recurrent neural model in general use is the recurrent neural network (RNN), otherwise referred to as a vanilla RNN. In the RNN architecture, the model is defined by three groups of parameters:Disregarding the bias parameters \(b\) and \(c\) used in the calculation of the state and the output, for simplicity.

The equations that define the RNN model are as follows: $$\begin{eqnarray} &a(t) = b + Wh(t-1) + Ux(t) \\ &h(t) = \sigma_{h} (a(t)) \\ &o(t) = c + Vh(t) \end{eqnarray}$$

where \(x(t)\) is the input at time \(t\), \(o(t)\) is the output at time \(t\), \(\sigma_h\) is some non-linear activation function and \(a(t)\) and \(h(t)\) represent the hidden state of the model at time \(t\).

In the case of language models, in order to calculate a probability distribution over the next predicted word in the sequence, we then do: $$ \hat{y} (t) = softmax(o(t)) $$

In graphical form, the RNN model looks like this:An architecture where the RNN cell output \(o(t)\) is fed as input into another RNN cell is called a multi-layer RNN.

Diagram of RNN architecture


In order to visualize how we might train this model via backpropagation, it is useful to introduce the concept of an unrolled computational graph. In an unrolled computational graph, the recurrent connection between the state at one timestep to the state at the next timestep is drawn explicitly for all timesteps in the sequence. An unrolled computational graph for an RNN would look something like this:

Diagram of an unrolled RNN

where the internal calculation of each \(h(t)\) is not shown for simplicity.

We then just execute the familiar backpropagation algorithm over this unrolled computation graph, accumulating the gradients of the parameters through each timestep. This procedure is called the backpropagation-through-time algorithm.

The Vanishing/Exploding Gradient Problem

An interesting thing happens if you train an RNN model over sequences of even moderate length; the gradient either quickly blows up to infinity or quickly decays to zero when backpropagating through time.

To gain some mathematical intuition for why this is, let's simplify the state recurrence relation of the model to be: $$\begin{eqnarray} &h(t) = Wh(t-1) \\ &or \\ &h(t+1) = Wh(t) \end{eqnarray}$$

Furthermore, we'll say that only in the last timestep is the hidden state connected to the output and only in the initial timestep is the hidden state connected to the input. The computational graph is thusly this:

Diagram of simplified RNN

Let L be the loss of the model and let: $$ \nu (t) = \frac{dL}{dh(t)} $$

be the error signal at time \(t\).

Via the chain rule, we have: $$ \nu (t) = \frac{dL}{dh(t)} = \frac{dL}{dh(t+1)} \frac{dh(t+1)}{dh(t)} = \frac{dh(t+1)}{dh(t)} \nu (t+1) $$

Using our expression for \(h(t+1)\), we then have: $$ \nu (t) = W \nu (t+1) $$

The solution of this recurrence relation is: $$ \nu (t) = W^{N-t} \nu (N) $$

where \(N\) is the length of the sequence.

Via eigendecompositionOne way to think about eigenanalysis is to view the eigenvectors as the basis vectors in which you can write a matrix in diagonal form. The particular values of this diagonal matrix are the eigenvalues. You can then view the eigendecomposition \(Q \Lambda Q^{-1}\) as firstly mapping the diagonal matrix \(\Lambda\) from the standard basis to the basis formed by the eigenvectors (i.e. the eigenbasis) by left-multiplying by \(Q\) and then mapping that result from the eigenbasis back to the standard basis by right-multiplying by \(Q^{-1}\). For more on eigenvectors and eigenvalues, see this blog post by Dhruv Parthasarathy., we can say that: $$ W = Q \Lambda Q^{-1} $$

where \(Q\) is a matrix containing the eigenvectors of \(W\) and \(\Lambda\) is a diagonal matrix containing the corresponding eigenvalues.

We can then say: $$ W^{N-t} = Q \Lambda^{N-t} Q^{-1} $$

and therefore: $$ \nu (t) = Q \Lambda^{N-t} Q^{-1} \nu (N) $$

We can then see that \(\nu (t)\) would explode exponentially when backpropagating through time if \(W\) has a spectral radius greater than 1 and would decay exponentially when backpropagating through time if \(W\) has a spectral radius less than 1. For a more rigorous analysis of the vanishing/exploding gradient problem, see Hochreiter.

While it has been shown empirically that simply clipping the gradient to some maximum valueMeaning that if \(g\) is the gradient and \(x\) is some predetermined max value, then you backpropagate the value \(min(x, g)\) instead of \(g\). alleviates the exploding gradient problem, no such simple solution exists for solving the vanishing gradient problem. In order to address that issue, a fundamentally different approach is required.

Long Short-Term Memory (LSTM)

An LSTM is a variant of an RNN that is able learn much longer dependencies than a vanilla RNN. It is able to do that through a few different techniques which we now discuss.

Constant Error Carousel

The most important task to accomplish in order to learn long-term dependencies is to reckon with the vanishing gradient problem. The constant error carousel is a technique used by LSTMs in order to do just that.

Let us follow the toy setup from the vanishing gradient analysis above. Using this setup, we can immediately see that the error flow neither decays nor explodes if and only if the absolute value of all of the eigenvalues of the \(W\) matrix are exactly equal to 1. One such matrix that meets this criterion is the identity matrix.

Therefore, this yields the simple recurrence relation: $$ h(t) = h(t-1) $$

which guarantees constant error flow when backpropagating through the network. This is referred to as the constant error carousel or CEC.

Input and Output Gates

At a high level, we can think of an LSTM cell (or any type of RNN cell) as being responsible for 3 main functions:

  1. Storing "memory" in the hidden state
  2. "Reading memory" from the previous cell
  3. "Writing memory" to the next cell
The CEC handles point 1 above, but not points 2 or 3. In the CEC setup, the storage of memory is tightly coupled to the reading and writing of memory.

One way to alleviate this problem is to add trainable filters to the input and the output that are responsible for controlling how memory is read and written. These trainable filters are referred to as input and output gates.

Note that from this point we'll change notation and let \(o(t)\) represent the output gate at time \(t\), let \(h(t)\) represent the output of the RNN cell at time \(t\) and let \(c(t)\) represent the hidden state of the RNN cell at time t. What we have now is this: $$\begin{eqnarray} &i_t = \sigma_g (W_i x_t + U_i h_{t-1} + b_i) \\ &o_t = \sigma_g (W_o x_t + U_o h_{t-1} + b_o) \\ &c_t = c_{t-1} + i_{t} \circ \sigma_c (W_c x_t + U_c h_{t-1} + b_c) \\ &h_t = o_t \circ \sigma_h (c_t) \end{eqnarray}$$

where \(W_i\), \(W_o\), \(W_c\), \(U_i\), \(U_o\), \(U_c\), \(b_i\), \(b_o\) and \(b_c\) are learned parameters, \(\sigma_g\), \(\sigma_c\) and \(\sigma_h\) are non-linear activation functions and ⚪ is point-wise multiplication.

Then, in a similar way to the vanilla RNN, if we were to produce a probability distribution over the next token for a language model we have: $$ \hat{y} (t) = softmax(W_y h(t) + b_y) $$

With this, we now have the original LSTM as introduced in the seminal paper Long Short-Term Memory by Hochreiter and Schmidhuber. However, there is one extra flourish that goes into what we know as an LSTM today.

Forget Gate

One thing you may notice about the LSTM as defined up to this point is that the cell state: $$ c_t = c_{t-1} + i_t \circ \sigma_c (W_c x_t + u_c h_{t-1} + b_c) $$

will grow unbounded. This obviously poses a problem for long sequences.

In order to deal with this, we gate the self-connection in the cell state in a similar fashion to how we gate the input and the output. This gate is referred to as the "forget gate" because it regulates how much information to keep or forget from the previous cell state.

Our final LSTM architecture is now this: $$\begin{eqnarray} &f_t = \sigma_g (W_f x_t + U_f h_{t-1} + b_f) \\ &i_t = \sigma_g (W_i x_t + U_i h_{t-1} + b_i) \\ &o_t = \sigma_g (W_o x_t + U_o h_{t-1} + b_o) \\ &c_t = f_t \circ c_{t-1} + i_t \circ \sigma_c (W_c x_t + U_c h_{t-1} + b_c) \\ &h_t = o_t \circ \sigma_h (c_t) \end{eqnarray}$$


In summary, the LSTM architecture contains the following components:

  1. A passthrough connection from the state at one timestep to the state at the next timestep. This is called the constant error carousel and it is needed to keep the error propagated through time constant in order to alleviate the vanishing gradient problem.
  2. Multiplicative input and output filters (also known as gates) in order to control how information is read into the cell and how information is written from the cell.
  3. Multiplicative forget gate on the state self-connection. This is needed in order to prevent the cell state from growing unbounded.

Quasi-Recurrent Neural Network (QRNN)

One major problem with the models that have been introduced up to this point is that the computation for each timestep can't start until the computation from the previous timestep has finished. This severely limits the amount of parallelism that we can utilize when training vanilla RNNs and LSTMs.

In order to deal with this problem, the QRNN architecture calculates the cell input, the input gate output, the output gate output and the forget gate output in terms of convolutions, which are highly parallelizable: $$\begin{eqnarray} &Z = \tanh (W_z * X) \\ &F = \sigma (W_f * X) \\ &O = \sigma (W_o * X) \\ &I = \sigma (W_i * X) \end{eqnarray}$$

where \(W_z\), \(W_f\), \(W_o\) and \(W_i\) are learned convolutional kernels, \(X\) is input data, \(\sigma\) is some non-linear activation function and \(*\) is the convolution operator.

After this convolutional layer, we apply a pooling layer which calculates familiar-looking recurrent LSTM-like relations: $$\begin{eqnarray} &c_t = f_t \circ c_{t-1} + i_t \circ z_t \\ &h_t = o_t \circ c_t \end{eqnarray}$$

where \(f_t\), \(o_t\), \(i_t\) and \(z_t\) represent the row vectors of \(F\), \(O\), \(I\) and \(Z\) indexed by \(t\).

Attentional Models

While most work in sequence processing for neural networks has historically focused on recurrent neural networks of some form or fashion, recently there has been a shift away from recurrent models towards what are called attentional models. As opposed to recurrent models which are based on the principle of recurrence, attentional models are based on a principle called attention.

The concept of attention is quite general. We say that we compute attention if we perform any sort of calculation that takes some multi-valued input (i.e. a vector or a tensor) and computes a multi-valued output (the same dimension as the input) that corresponds to how much weight the model should assign to each value in the input.


While it isn't obvious that attention is in and of itself sufficient for producing powerful sequence processing models, in the paper Attention Is All You Need, Vaswani et al. found that this is indeed the case. In the paper, they defined a new type of model called the Transformer that only uses attentional mechanisms and achieved state-of-the-art results across a variety of natural language understanding tasks at a fraction of the total training time of previous state-of-the-art models.

Scaled Dot-Product Attention

There are many different ways that one might compute attention over a sequence of inputs. The way that the Transformer model computes attention is by what is called scaled dot-product attention.

In order to understand how exactly scaled dot-product attention works, we take a detour into differentiable memory networks. In a differentiable memory network, we have a sequence of memory cells. The "address" of any given memory cell is specified by a vector which is referred to as a key. The "contents" of any given memory cell are specified by a vector which is referred to as a value. In order to read from this memory, we need a vector (of the same dimensionality as the key vectors) referred to as a query. Given a query vector, the value that is read from memory is a weighted sum of the values of the memory cells where the weight is computed via some similarity function between the query vector and each key vector.

Scaled dot-product attention uses the exact same query-key-value analogy as differentiable memory networks. The input to a scaled dot-product attention layer is a matrix \(Q\) that contains many queries, a matrix \(K\) that contains many keys and a matrix \(V\) that contains many values. We have: $$ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}}) V $$

where \(d_k\) is the dimensionality of each query and key.The justification of the \(\frac{1}{\sqrt{d_k}}\) scaling factor is that if \(q\) is some query vector, \(k\) is some key vector and both \(q\) and \(k\) are unit Gaussian vectors (i.e. each value in the vector is \(N(0, 1)\) distributed), then \(q \boldsymbol{\cdot} k\) is \(N(0, d_k)\) distributed. Dividing by \(\sqrt{d_k}\) converts it back to a unit Gaussian random variable.

Multi-Head Attention

In the Transformer network, scaled dot-product attention is combined with what is called a multi-head attention mechanism.

In multi-head attention, the keys, the values and the queries are first projected into numerous lower-dimensional spaces. For each of these different projections, attention is calculated over these projected queries, keys and values. These calculated attention vectors are then concatenated together and projected back into the original vector space to get the final result. Therefore, we have: $$\begin{eqnarray} &MultiHead(Q, K, V) = concat(head_{1}, ..., head_{h}) W^O \\ &head_i = Attention(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V} \end{eqnarray}$$

where \(h\) is the total number of attention heads and \(W^O\) and each \(W_i^Q\), \(W_i^K\) and \(W_i^V\) are learned projection matrices.

Multi-head attention is useful because each different attention head can learn to attend over different aspects of the input sequence. To use a contrived example, one of the attention heads could learn to attend over nouns in a sentence while another of the attention heads could learn to attend over verbs. The combination of these two attention heads would surely yield a lot more information useful for downstream natural language understanding tasks than either one of them individually.

Positional Encoding

The last main novel contribution of the Transformer network is its positional encoding scheme. Since the Transformer model does not contain any recurrent connections or convolutions, we need to explicitly inject information about the relative positions of tokens in the input. We do that by creating positional encoding vectors and adding them to the input before feeding it into the model.

While it's possible to learn positional encoding vectors via backpropagation, it was found that learned position encoding vectors yielded no performance gain over fixed encoding vectors. Thusly, we define the positional encoding vectors to be: $$\begin{eqnarray} &PE_{(pos, 2i)} = \sin (pos / 10000^{2i / d_{model}}) \\ &PE_{(pos, 2i+1)} = \cos (pos / 10000^{2i / d_{model}}) \end{eqnarray}$$

where \(pos\) is the position of the token in the sequence, \(2i\) and \(2i+1\) represent the particular dimension of the encoding vector and \(d_{model}\) is the dimensionality of the input embeddings. Using sinusoidal functions in this way allows us to generalize to sequence lengths not seen in the training data.


With all of these pieces laid out, we can now describe the overall architecture of the Transformer model. The Transformer architecture follows the pattern shown in this diagram:In the Attention Is All You Need paper which introduced the Transformer, the model was used for machine translation and, as such, followed an encoder-decoder pattern where the source language text was fed into an encoder and the target language text was fed into a decoder. Subsequent research has suggested that the encoder and decoder learn redundant information in monolingual text-to-text tasks. Therefore, Transformer-based language models tend to use an encoder-only architecture as described here.

Diagram of Transformer architecture

The model is made up of a stack of \(N\) layers with identical architecture where the output of the previous layer is fed into the next layer as input. Each of these layers is made up of two sublayers, each of which is followed by a residual connection and a layer norm.Layer normalization is a normalization scheme similar to batch normalization, but instead of normalizing with respect to features across a minibatch, it normalizes each sample in a minibatch across all of that sample's features. The normalization of each sample in the minibatch is then independent of all other samples. The first sublayer is a multi-head attention layer where \(Q = K = V\).This is referred to as self-attention. The second sublayer is a simple feed-forward network.


While the Transformer architecture allows us to model the interactions between words at different timesteps directlyAs opposed to RNNs where the model needs to step through \(N\) timesteps in order for a word to interact with another word \(N\) timesteps apart., there are still practical limitations to the distance of interactions that a Transformer can handle. In particular, since there is no recurrence built into the Transformer architecture, every sequence that is processed by the model is processed entirely independently of every other sequence. That means that the Transformer is incapable of learning dependencies longer than the maximum sequence length. The Transformer-XL architecture builds in a recurrence relation between adjacent sequences in order to alleviate this problem and learn longer-term dependencies.

The main idea of the Transformer-XL architecture is to process consecutive sequences and have each layer of both the encoder and the decoder cache the output it generated from the previous sequence to help generate the output from the current sequence. The initial multi-head self-attention sublayer of each layer uses the output of the previous layer to generate the queries, but it uses the output of the previous layer concatenated with the output of the previous layer from the last sequence in order to generate the keys and values. We then have this: $$\begin{eqnarray} &\tilde{h}_{\tau + 1}^{n-1} = concat(SG(h_{\tau}^{n-1}), h_{\tau + 1}^{n-1}) \\ &q_{\tau + 1}^{n}, k_{\tau + 1}^{n}, v_{\tau + 1}^{n} = h_{\tau + 1}^{n-1} W_{q}, \tilde{h}_{\tau + 1}^{n-1} W_{k}, \tilde{h}_{\tau + 1}^{n-1} W_v \\ &h_{\tau + 1}^{n} = Layer(q_{\tau + 1}^{n}, k_{\tau + 1}^{n}, v_{\tau + 1}^{n}) \end{eqnarray}$$

where \(h_{\tau}^n\) is the output of the \(n\)th layer for the \(\tau\)th sequence, \(SG\) is the stop-gradient functionIn order to prevent backpropagating through time. and \(Layer\) is the function computed by the layer.

Additionally, the Transformer-XL architecture utilizes a relative positional encoding scheme. This is needed to account for the fact that since context is now carried from one sequence to the next, the model needs some way of determining the positional difference between a word in the current sequence and a word in a previous sequence.

Sparse Transformer

As described before, the distance of interactions that a Transformer can handle is limited by the largest sequence that it can process. The Transformer-XL architecture aims to improve that by adding recurrence between consecutive sequences. The Sparse Transformer tries an orthogonal approach to improving the Transformer: increasing the size of the largest sequence the model can process.

The main contribution of the Sparse Transformer architecture is to reformulate attention in terms of connectivity patterns. A connectivity pattern is a set of indices that corresponds to the indices of the input that are being attended over. For a given connectivity pattern, we only attend over inputs whose indices appear in the connectivity pattern. Concretely, we have this: $$\begin{eqnarray} &Q = W_q X \\ &K_{S_i} = (W_k x_j)_{j \in S_i} \\ &V_{S_i} = (W_v x_j)_{j \in S_i} \\ &SelfAttention_{S_i} (x) = Attention(Q, K_{S_i}, V_{S_i}) \end{eqnarray}$$

where \(S_i\) is some connectivity pattern.

We can then define multiple connectivity patterns to attend over. We can either attend over all the connectivity patterns sequentially (i.e. one application of attention after the other) or we can attend over all of them in parallel via multi-head attention. The idea is that if we choose our connectivity patterns carefully, we can reduce total computational cost by having each connectivity pattern have a small subset of all possible indicesSpecifically, it's possible to choose connectivity patterns whose cardinality is roughly equal to \(\sqrt[\leftroot{-2}\uproot{2}p]{n}\) where \(n\) is the length of the sequence and \(p\) is the number of connectivity patterns. while ensuring that the composition of all of the connectivity patterns is equivalent to all indices being attended over.

Transfer Learning

In the landmark paper Learning Representations By Back-Propagating Errors by Rumelhart, Hinton and Williams that first introduced the term "backpropagation", it was found that neural network models trained via backpropagation learn intermediate representations of the input data that are useful for the task that the model is being trained for. It has since been found that these learned intermediate representations are useful input features for neural network models that are trained on related tasks. Training a neural network model using a fully-trained neural network's intermediate representations as input can greatly boost overall performance of a model and can greatly decrease training time. This process is known as transfer learning.

Transfer learning has been effectively applied in the field of computer vision since the beginning of the deep learning revolution. Its impact in the field of natural language processing, however, has historically been comparatively minor. The general wisdom had largely been that transfer learning in NLP wasn't very effective outside of pre-trained word embeddings and related techniques. While convolutional neural networks trained on ImageNet classification were found to be amenable to transfer learning on a variety of computer vision tasks, it seemed as if no such general proxy task existed for NLP.

Recently, however, it has been found that language modeling is a useful proxy task that can allow us to effectively utilize transfer learning in the NLP domain in a way that is similar to how transfer learning has been utilized in the computer vision domain. We now discuss several transfer learning approaches based on language modeling that have recently emerged.


One of first major breakthroughs in transfer learning in NLP via language modeling was in the paper Universal Language Model Fine-tuning (or ULMFiT) which introduced a general training procedure for fine-tuning a language model. The authors then showed that this fine-tuned language model can be used as a basis to train models via transfer learning for NLP tasks other than language modeling. Using the ULMFiT method, the authors established state-of-the-art results by a wide margin on a number of text classification benchmarks.

Previous work on fine-tuning language models ran into the problem of catastrophic forgetting where a pre-trained neural network quickly loses all information about the task it was previously trained on when it starts training on a new task. In order to overcome this issue, the authors proposed several techniques that can be thought of as controlling the rate at which new information is propagated through certain parts of the model when training on a new task. This was done in order to prevent new information from overwhelming certain parts of the model and causing catastrophic forgetting.

The three main techniques introduced were:

  1. Discriminative fine-tuning
  2. Gradual unfreezing
  3. Slanted triangular learning rates

Discriminative fine-tuning works by assigning different learning rates to different layers of the model. The idea is that earlier layers of the model tend to learn general features of the data while later layers learn increasingly more task-specific features. Therefore, when training the model on a new task, it stands to reason that the layers that learn more task specific features (i.e. the later layers) should change more quickly than the layers that learn more general features (i.e. the earlier layers).

Gradual unfreezing works by "unfreezing""Unfreezing" means to allow backpropagation to change the parameters of the layer. If we keep the parameters of a certain layer fixed over a training period, then we say that we have "frozen" that layer. the model one layer at a time when training, starting from the last layer and working towards the first layer. The justification for this technique is similar to that of discriminative fine-tuning: we should be more careful about changing features in the earlier layers because they contain more general features. Changing these features quickly would likely result in catastrophic forgetting.

Slanted triangular learning rates are a technique where you sharply increase the learning rate in the beginning of training and then slowly decrease the learning rate for the rest of training. The justification for this technique is that the period of sharp learning rate increase allows the model to quickly find a good general region in parameter space while the rest of training allows the model to then optimize its parameters in a more granular way.


Shortly after the ULMFiT paper came the paper Improving Language Understanding by Generative Pre-Training."Generative pre-training" meaning to pre-train a network to learn a generative model. A generative model is a model that learns the underlying probability distribution of data over some domain (which is exactly what a language model does over the domain of text). Such a model can then use this learned probability distribution to generate synthetic data samples. In this paper, the authors trained a powerful language model based on the Transformer architecture. They called this model the Generative Pre-training Transformer or GPT.

This pre-trained language model was then used as a basis for training models on 12 diverse NLP tasks via transfer learning. Of these 12 models, 9 of them were found to be state-of-the-art in their domain. The previous state-of-the-art models were often ensembles of models whose architecture was highly domain-specific. Overall, the GPT approach achieved state-of-the-art performance on the NLP multi-task benchmark GLUE.


Shortly after the GPT paper came yet another impactful paper concerning transfer learning in NLP, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In the paper, the authors trained a language model based on the Transformer architecture, similar to the GPT model.

The major difference between the BERT model and the GPT model is that the BERT model was trained to predict the probability of a word given context from both the left and the right of the word whereas the GPT model was only given context from the left of the word. This bidirectionality allowed the BERT model to learn more powerful intermediate representations than the GPT model. Using the pre-trained BERT model as a basis for training models for other NLP tasks, the authors achieved state-of-the-art results in 11 different domains. Overall, the BERT approach surpassed the GPT approach in achieving state-of-the-art performance in the GLUE benchmark.