An in-depth introduction to using Keras for language modeling; word embedding, recurrent and convolutional neural networks, attentional RNNs, and similarity metrics for vector embeddings.

Question answering has received more focus as large search engines have basically mastered general information retrieval and are starting to cover more edge cases. Question answering happens to be one of those edge cases, because it could involve a lot of syntatic nuance that doesn’t get captured by standard information retrieval models, like LDA or LSI. Hypothetically, deep learning models would be better suited to this type of task because of their ability to capture higher-order syntax. Two papers, “Applying deep learning to answer selection: a study and an open task” (Feng et. al. 2015) and “LSTM-based deep learning models for non-factoid answer selection” (Tan et. al. 2016), are recent examples which have applied deep learning to question-answering tasks with good results.

Feng et. al. used an in-house Java framework for their work, and Tan et. al. built their model entirely from Theano. Personally, I am a lot lazier than them, and I don’t understand CNNs very well, so I would like to use an existing framework to build one of their models to see if I could get similar results. Keras is a really popular one that has support for everything we might need to put the model together.

See the instructions here on how to install Keras. The simple route is to install using `pip`

, e.g.

There are some important features that might not be available without the most recent version. I’m not sure if doing `pip install`

gets the most recent version, so it might be helpful to install from binary. This is actually pretty straightforward! Just change to the directory where you want your source code to be and do:

One benefit of this is that if you want to add a custom layer, you can add it to the Keras installation and be able to use it across different projects. Even better, you could fork the project and clone your own fork, although this gets into areas of Git beyond my understanding.

There are actually a couple language models in the Keras examples:

`imdb_lstm.py`

: Using a LSTM recurrent neural network to do sentiment analysis on the IMDB dataset`imdb_cnn_lstm.py`

The same task, but this time using a CNN layer beneath the LSTM layer`babi_rnn.py`

: Recurrent neural networks for modeling Facebook’s bAbi dataset, “a mixture of 20 tasks for testing text understanding and reasoning”

These are pretty interesting to play around with. It is really cool how easy it is to get one of these set up! With Keras, a high-level model design can be quickly implemented.

Ok! Let’s dive in. The first challenge that you might think of when designing a language model is what the units of the language might be. A reasonable dataset might have around 20000 distinct words, after lemmatizing them. If the average sentence is 40 words long, then you’re left with a `20000 x 40`

matrix just to represent one sentence, which is 3.2 megabytes if each word is represented in 32 bits. This obviously doesn’t work, so the first step in developing a good language model is to figure out how to reduce the number of dimensions required to represent a word.

One popular method of doing this is using `word2vec`

. `word2vec`

is a way of embedding words in a vector space so that words that are semantically similar are near each other. There are some interesting consequences of doing this, like being able to do word addition and subtraction:

In Keras, this is available as an `Embedding`

layer. This layer takes as input a `(n_batches, sentence_length)`

dimensional matrix of integers representing each word in the corpus, and outputs a `(n_batches, sentence_length, n_embedding_dims)`

dimensional matrix, where the last dimension is the word embedding.

There are two advantages to this. The first is space: Instead of 3.2 megabytes, a 40 word sentence embedded in 100 dimensions would only take 16 kilobytes, which is much more reasonable. More importantly, word embeddings give the model a hint at the meaning of each word, so it will converge more quickly. There are significantly fewer parameters which have to be jostled around, and parameters are sort of tied together in a sensible way so that they jostle in the right direction.

Here’s how you would go about writing something like this:

Let’s try this out! We can train a recurrent neural network to predict some dummy data and examine the embedding layer for each vector. This model takes a sentence like “sam is red” or “sarah not green” and predicts what color the person is. It is a very simple example, but it will illustrate what the Embedding layer is doing, and also illustrate how we can turn a bunch of sentences into vectors of indices by building a dictionary.

The embedding layer embeds the words into 3 dimensions. A sample of the vectors it produces is seen below. As predicted, the model learns useful word embeddings.

Each category is grouped in the 3-dimensional vector space. The network learned each of these categories from how each word was used; Sarah and Sam are the red people, while Bob and Hannah are the green people. However, it did not differentiate well between `not`

, `is`

, `red`

, and `green`

, because those weren’t immediately obvious for the decision task.

Word distributions in vector space. The word distributions are learned so that the red people, Sarah and Sam, are in one part of the space, and the green people, Bob and Hannah, are in another part.

As the Keras examples illustrate, there are different philosophies on deep language modeling. Feng et. al. did a bunch of benchmarks with convolutional networks, and ended up with some impressive results. Tan et. al. used recurrent networks with some different parameters. I’ll focus on recurrent neural networks first (What do pirates call neural networks? *Arrrgh*NNs). I’ll assume some familiarity with both recurrent and convolutional neural networks. Andrej Karpathy’s blog discusses recurrent neural networks in detail. Here is an image from that post which explains the core concept:

Recurrent neural network architectures, from Andrej Karpathy's blog.

The basic RNN architecture is essentially a feed-forward neural network that is stretched out over a bunch of time steps and has it’s intermediate output added to the next input step. This idea can be expressed as an update equation for each input step:

Note that `dot`

indicates vector-matrix multiplication. Multiplying a vector of dimensions `<m>`

by a matrix of dimensions `<m, n>`

can be done with `dot(<m>, <m, n>)`

and yields a vector of dimensions `<n>`

. This is consistent with its usage in Theano and Keras. In the update equation, we multiply each `input_vector`

by our input weights `W`

, multiply the `prev_hidden`

vector by our hidden weights `U`

, and add a bias, before passing the sum to the activation function `sigmoid`

. To get the **many to one** behavior in the image, we can grab the last hidden state and use that as our output. To get the **one to many** behavior, we can pass one input vector and then just pass a bunch of zero vectors to get as many hidden states as we want.

If the RNN gets really long, then we run into a lot of difficulty training the model. The effect of something a early in the sequence on the end result is very small relative to later components, so it is hard to use that information in updating the weights. To solve this, several methods have been proposed, and two have been implemented in Keras. The first is the Long Short-Term Memory (LSTM) unit, which was proposed by Hochreiter and Schmidhuber 1997. This model uses a second hidden state which stores information from further back in the model, allowing that information to have a stronger effect on the end result. The update equations for this model are:

Note that `*`

indicates element-wise multiplication. This is consistent with its usage in Theano and Keras. First, there are a bunch more parameters to train; not only do we have weights for the input-to-hidden and hidden-to-hidden matrices, but also we have an accompanying `candidate_state`

. The candidate state is like a second hidden state that transfers information to and from the hidden state. It is like a safety deposit box for putting things in and taking things out.

The second model is the Gated Recurrent Unit (GRU), which was proposed by Cho et. al. 2014. The equations for this model are as follows:

In this model, there is an `update_gate`

which controls how much of the previous hidden state to carry over to the new hidden state and a `reset_gate`

which controls how much the previous hidden state changes. This allows potentially long-term dependencies to be propagated through the network.

My implementations of these models in Theano, as well as optimizers for training them, can be found in this Github repository.

Now that we’ve seen the equations, let’s see how Keras implementations compare on some sample data.

The results will vary from trial to trial. RNNs are exceptionally difficult to train. However, in general, a model that can take advantage of long-term dependencies will have a much easier time seeing how two sequences are different.

It isn’t strictly important to understand the RNN part before looking at this part, but it will help everything make more sense. The next component of language modeling, which was the focus of the Tan paper, is the Attentional RNN. This essential components of model are described in “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” (Xu et. al. 2016). I’ll try to hash it out in this blog post a little bit and look at how to build it in Keras.

First, let’s look at how to make a custom layer in Keras. There are a couple options. One is the `Lambda`

layer, which does a specified operation. An example of this could be a layer that doubles the value it is passed:

This doubles our input data. Note that there are no trainable weights anywhere in this model, so it couldn’t actually learn anything. What if we wanted to multiply our input vector by some trainable scalar that predicts the output vector? In this case, we will have to write our own layer.

Let’s jump right in and write a layer that learns to multiply an input by a scalar value and produce an output.

There we go! We have a complete model. We could change it around to make it fancier, like adding a *broadcastable dimension* to the `multiplicand`

so that the layer could be passed a vector of numbers instead of just a scalar. Let’s look closer at how we built the multiplication layer:

First, we make a weight initializer that we can use later to get weights. `glorot_uniform`

is just a particular way to initialize weights. We then call the `__init__`

method of the super class.

This method specifies the components of the model, for when we build it. The only component we need is the scalar to multiply by, so we initialize a new tensor by calling `self.init`

, the initializer we created in the `__init__`

method.

This method tells the builder what the output shape of this layer will be given its input shape. Since our layer just does a scalar multiply, it doesn’t change the output shape from the input shape. For example, scalar multiplying the input `[1, 2, 3]`

of dimensions `<3, 1>`

by a scalar factor of 2 gives the output `[2, 4, 6]`

, which has the same dimensions `<3, 1>`

.

This is the bread and butter of the the layer, where we actually perform the operation. We specify that the output of this layer is the input `x`

matrix multiplied by our multiplicand tensor. Note that this method takes a while to run, because whatever backend we use (for example, Theano) has to put together the tensors in the right way. To make your layer run quickly, it is good practice to add `assert`

checks in the `build`

and `get_output_shape_for`

methods.

Now that we’ve got an idea of how to build a custom layer, let’s look at the specifications for an attentional LSTM. Following Tan et. al., we can augment our LSTM equations from earlier to include an attentional component. The attentional component requires some attention vector `attention_vec`

.

The new equations are the last three, which correspond to equations 9, 10 and 11 from the paper (approximately reproduced below, using different notation).

The attention parameter is a function of the current hidden state and the attention vector mixed together. Each is first put through a matrix, summed and put through an activation function to get an attention state, which is then put through another transformation to get an attention parameter. The attention parameter then re-updates the hidden state. Supposedly, this is conceptually similar to TF-IDF weighting, where the model learns to weight particular states at particular times.

Now that we have all the components for an Attentional LSTM, let’s see the code for how we could implement this. The attentional component can be tacked onto the LSTM code that already exists.

Let’s look at what each function is doing individually. Note that this builds heavily upon the already-existing LSTM implementation.

We will create a subclass (does python even do subclasses?) of the LSTM implementation that Keras already provides. The Keras `backend`

is either Theano or Tensorflow, depending on the settings specified in `~/.keras/keras.json`

(the default is Theano). This backend lets us use Theano-type functions such as `K.zeros`

, which specifies a matrix of zeros, to initialize our model.

We initialize the layer by passing it the out number of hidden layers `output_dim`

and the layer to use as the attention vector `attention_vec`

. The `__init__`

function is identical to the `__init__`

function for the `LSTM`

layer except for the attention vector, so we just reuse it here.

I won’t reproduce everything here, but essentially this method initializes all of the weight matrices we need for the attentional component, after calling the `LSTM.build`

method to initialize the LSTM weight matrices.

This method is used by the `RNN`

superclass, and tells the function what to do on each timestep. It mirrors the equations given earlier, and adds the attentional component on top of the LSTM equations.

This method is used by the LSTM superclass to define components outside of the step function, so that they don’t need to be recomputed every time step. In our case, the attentional vector doesn’t need to be recomputed every time step, so we define it as a constant (we then grab it in the `step`

function using `attention = states[4]`

).

Convolutional networks are better explained elsewhere, and all of the functions required for making a good CNN language model are already supported in Keras. Basically, with language modeling, a common strategy is to apply a ton (on the order of 1000) convolutional filters to the embedding layer followed by a max-1 pooling function and call it a day. It actually works stupidly well for question answering (see Feng et. al. for benchmarks). This approach can be done fairly easily in Keras. One thing that may not be intuitive, however, is how to combine several filter lengths. This can be done as follows:

The basic idea with question answering is to embed questions and answers as vectors, so that the question vector is close in vector space to the answer vector. For example, with the Attentional RNN, we take the question vector and use it as an input for generating the answer vector. A common approach is to then rank answer vectors according to their cosine similarities with the question vector. This doesn’t follow the conventional neural network architecture, and takes some manipulation to achieve in Keras. To use equations, what we would like to do is:

Training is generally done by minimizing hinge loss. In this case, we want the cosine similarity for the correct answer to go up, and the cosine similarity for an incorrect answer to go down. The loss function can be formulated as:

Note that for some implementations, having a loss of zero can be troublesome, so a small value like `1e-6`

is preferable instead. The loss is zero when the difference between the cosine similarities of the good and bad answers is greater than the constant margin we defined. In practice, the margins generally range from 0.001 to 0.2. If we want to use something besides cosine similarity, we can reformulate this as

where `sim`

is our similarity metric. Hinge loss works well for this application, as opposed to something like mean squared error, because we don’t want our question vectors to be orthogonal with the bad answer vectors, we just want the bad answer vectors to be a good distance away.

First, let’s look at how to do cosine similarity within the constraints of Keras. Fortunately, Keras has an implementation of cosine similarity, as a `mode`

argument to the `merge`

layer. This is done with:

If we pass it two inputs of dimensions `(a, b, c)`

, it will calculate the cosine simliarity of the `c`

dimension (specified using `dot_axes`

) and give an output of dimensions `(a, b)`

. However, because we might eventually want to implement other types of similarities besides cosine similarity, let’s look at how this can be done by passing a lambda function to `merge`

.

We define a function `similarity`

which we will use to compute the similarity of the inputs passed to the `merge`

layer. Note that when we do this, we also have to pass an `output_shape`

which tells Keras what shape the output will be after we do this operation (hopefully in the future this shape will be inferred, but it is still an open issue in the Github group).

A cool example might be to see if we can learn a rotation matrix. A rotation matrix in Euclidean space is a matrix which rotates a vector by a certain angle around the origin. It is defined as a function of `theta`

, the angle to rotate by:

We can learn this matrix really simply with the right dataset and one dense layer, that is:

A `Dense`

layer with `linear`

activation is the exact same as a matrix multiplication. We give it two input dimensions and two output dimensions. After training this model, the printed weight matrix is:

which is close to the rotation matrix for an angle of 90 degrees. Let’s try this again, but with cosine similarity. This will require some manipulation. In the previous example, we had a clearly defined input, `a`

, and output, `b`

, and our model was designed to perform a transformation on `a`

to predict `b`

. In this example, we have two inputs, `a`

and `b`

, and we will perform a transformation on `a`

to make it close to `b`

. As an output, we get the similarity of the two vectors, so we need to train our model to make this similarity high by providing it a bunch of 1’s as the target values, since a similarity of 1 indicates perfect similarity.

Running this, we end up with a weight matrix that looks like

This looks a bit like cosine similarity, but the scaling seems off. Cosine similarity is ambivalent about the magnitude of vectors, so the weight matrix ends up not being a rotation matrix so much as a rotation-and-skew matrix. It is interesting to think about why this network learned this particular matrix.

Below, a unit square (blue) is multiplied by the first matrix to get the orange square, and by the second matrix to get the yellow square.

Matrix transformations on a square. The orange square is the result of training a 90 degree rotation transformation by minimizing mean squared error. The yellow square is the result of training the same transformation by minimizing cosine distance.

Feng et. al. provided a list of similarities along with their benchmarks for a CNN architecture. Some of these similarities, along with their implementations in Keras, are reproduced below. They rely on these helper functions:

If the function requires extra parameters, they are usually supplied as arguments in a dictionary.

Values for `gamma`

used in the paper were `[0.5, 1.0, 1.5]`

. The value for `c`

was usually `1`

. Values for `d`

were `[2, 3]`

.

Values for `gamma`

used in the paper were `[0.5, 1.0, 1.5]`

, and `c`

was `1`

.

RBF stands for radial basis function.

Values for `gamma`

used in the paper were `[0.5, 1.0, 1.5]`

.

This was a custom metric developed by the authors which stands for Geometric mean of Euclidean and Sigmoid Dot product. It performed well for their benchmarks.

Values for `gamma`

used were `[0.5, 1.0, 1.5]`

and `c`

was `1`

.

This was a custom metric developed by the authors which stands for Arithmetic mean of Euclidean and Sigmoid Dot product. It performed well for their benchmarks.

Values for `gamma`

used were `[0.5, 1.0, 1.5]`

and `c`

was `1`

.

This surprisingly simple model performed very well on the task.

This model achieved relatively good marks for Top-1 Accuracy (how often did the model rank a ground truth the highest out of 500 results) and Mean Reciprocal Rank (MRR), which is defined as

The results after learning the training set are summaraized in the following table.

Top-1 Accuracy | Mean Reciprocal Rank | |
---|---|---|

Test 1 | 0.4933 | 0.6189 |

Test 2 | 0.4606 | 0.5968 |

Dev | 0.4700 | 0.6088 |

For comparison, the best model from Feng et. al. achieved an accuracy of 0.653 on Test 1, and the model in Tan et. al. achieved an accuracy of 0.681 on Test 1. This model isn’t exceptional, but it works pretty well for how simple it is. It outperforms the baseline bag of words model, and performs on par with the Metzler-Bendersky IR model introduced in “Learning concept importance using a weighted dependence model” (Bendersky and Metzler, 2010). Here’s how we build it in Keras:

The code is kind of awkward without the context, so I would recommend checking out the repository to see how it works. The repository contains the necessary code for building a question answering model using Keras and evaluating it on the Insurance QA dataset.