Question Answering using Keras

An in-depth introduction to using Keras for language modeling; word embedding, recurrent and convolutional neural networks, attentional RNNs, and similarity metrics for vector embeddings.

Apr 27, 2016

Introduction

Question answering has received more focus as large search engines have basically mastered general information retrieval and are starting to cover more edge cases. Question answering happens to be one of those edge cases, because it could involve a lot of syntatic nuance that doesn’t get captured by standard information retrieval models, like LDA or LSI. Hypothetically, deep learning models would be better suited to this type of task because of their ability to capture higher-order syntax. Two papers, “Applying deep learning to answer selection: a study and an open task” (Feng et. al. 2015) and “LSTM-based deep learning models for non-factoid answer selection” (Tan et. al. 2016), are recent examples which have applied deep learning to question-answering tasks with good results.

Feng et. al. used an in-house Java framework for their work, and Tan et. al. built their model entirely from Theano. Personally, I am a lot lazier than them, and I don’t understand CNNs very well, so I would like to use an existing framework to build one of their models to see if I could get similar results. Keras is a really popular one that has support for everything we might need to put the model together.

Installing Keras

See the instructions here on how to install Keras. The simple route is to install using pip, e.g.

sudo pip install --upgrade keras

There are some important features that might not be available without the most recent version. I’m not sure if doing pip install gets the most recent version, so it might be helpful to install from binary. This is actually pretty straightforward! Just change to the directory where you want your source code to be and do:

git clone https://github.com/fchollet/keras.git
cd keras
sudo python setup.py install

One benefit of this is that if you want to add a custom layer, you can add it to the Keras installation and be able to use it across different projects. Even better, you could fork the project and clone your own fork, although this gets into areas of Git beyond my understanding.

Preliminaries

There are actually a couple language models in the Keras examples:

imdb_lstm.py: Using a LSTM recurrent neural network to do sentiment analysis on the IMDB dataset
imdb_cnn_lstm.py The same task, but this time using a CNN layer beneath the LSTM layer
babi_rnn.py: Recurrent neural networks for modeling Facebook’s bAbi dataset, “a mixture of 20 tasks for testing text understanding and reasoning”

These are pretty interesting to play around with. It is really cool how easy it is to get one of these set up! With Keras, a high-level model design can be quickly implemented.

Word Embeddings

Ok! Let’s dive in. The first challenge that you might think of when designing a language model is what the units of the language might be. A reasonable dataset might have around 20000 distinct words, after lemmatizing them. If the average sentence is 40 words long, then you’re left with a 20000 x 40 matrix just to represent one sentence, which is 3.2 megabytes if each word is represented in 32 bits. This obviously doesn’t work, so the first step in developing a good language model is to figure out how to reduce the number of dimensions required to represent a word.

One popular method of doing this is using word2vec. word2vec is a way of embedding words in a vector space so that words that are semantically similar are near each other. There are some interesting consequences of doing this, like being able to do word addition and subtraction:

king - man + woman = queen

In Keras, this is available as an Embedding layer. This layer takes as input a (n_batches, sentence_length) dimensional matrix of integers representing each word in the corpus, and outputs a (n_batches, sentence_length, n_embedding_dims) dimensional matrix, where the last dimension is the word embedding.

There are two advantages to this. The first is space: Instead of 3.2 megabytes, a 40 word sentence embedded in 100 dimensions would only take 16 kilobytes, which is much more reasonable. More importantly, word embeddings give the model a hint at the meaning of each word, so it will converge more quickly. There are significantly fewer parameters which have to be jostled around, and parameters are sort of tied together in a sensible way so that they jostle in the right direction.

Here’s how you would go about writing something like this:

from keras.layers import Input, Embedding

input_sentence = Input(shape=(sentence_maxlen,), dtype='int32')
embedding = Embedding(n_words, n_embed_dims)(input_sentence)

Let’s try this out! We can train a recurrent neural network to predict some dummy data and examine the embedding layer for each vector. This model takes a sentence like “sam is red” or “sarah not green” and predicts what color the person is. It is a very simple example, but it will illustrate what the Embedding layer is doing, and also illustrate how we can turn a bunch of sentences into vectors of indices by building a dictionary.

import itertools
import numpy as np

sentences = '''
sam is red
hannah not red
hannah is green
bob is green
bob not red
sam not green
sarah is red
sarah not green'''.strip().split('\n')
is_green = np.asarray([[0, 1, 1, 1, 1, 0, 0, 0]], dtype='int32').T

lemma = lambda x: x.strip().lower().split(' ')
sentences_lemmatized = [lemma(sentence) for sentence in sentences]
words = set(itertools.chain(*sentences_lemmatized))
# set(['boy', 'fed', 'ate', 'cat', 'kicked', 'hat'])

# dictionaries for converting words to integers and vice versa
word2idx = dict((v, i) for i, v in enumerate(words))
idx2word = list(words)

# convert the sentences a numpy array
to_idx = lambda x: [word2idx[word] for word in x]
sentences_idx = [to_idx(sentence) for sentence in sentences_lemmatized]
sentences_array = np.asarray(sentences_idx, dtype='int32')

# parameters for the model
sentence_maxlen = 3
n_words = len(words)
n_embed_dims = 3

# put together a model to predict
from keras.layers import Input, Embedding, merge, Flatten, SimpleRNN
from keras.models import Model

input_sentence = Input(shape=(sentence_maxlen,), dtype='int32')
input_embedding = Embedding(n_words, n_embed_dims)(input_sentence)
color_prediction = SimpleRNN(1)(input_embedding)

predict_green = Model(input=[input_sentence], output=[color_prediction])
predict_green.compile(optimizer='sgd', loss='binary_crossentropy')

# fit the model to predict what color each person is
predict_green.fit([sentences_array], [is_green], nb_epoch=5000, verbose=1)
embeddings = predict_green.layers[1].W.get_value()

# print out the embedding vector associated with each word
for i in range(n_words):
	print('{}: {}'.format(idx2word[i], embeddings[i]))

The embedding layer embeds the words into 3 dimensions. A sample of the vectors it produces is seen below. As predicted, the model learns useful word embeddings.

sarah:	[-0.5835458  -0.2772688   0.01127077]
sam:	[-0.57449967 -0.26132962  0.04002968]

bob:	[ 1.10480607  0.97720605  0.10953052]
hannah:	[ 1.12466967  0.95199704  0.13520472]

not:	[-0.17611612 -0.2958962  -0.06028322]
is:	[-0.10752882 -0.34842652 -0.06909169]

red:	[-0.10381682 -0.31055665 -0.0975003 ]
green:	[-0.05930901 -0.33241618 -0.06948926]

These embeddings are visualized in the chart below.

Each category is grouped in the 3-dimensional vector space. The network learned each of these categories from how each word was used; Sarah and Sam are the red people, while Bob and Hannah are the green people. However, it did not differentiate well between not, is, red, and green, because those weren’t immediately obvious for the decision task.

Recurrent Neural Networks

As the Keras examples illustrate, there are different philosophies on deep language modeling. Feng et. al. did a bunch of benchmarks with convolutional networks, and ended up with some impressive results. Tan et. al. used recurrent networks with some different parameters. I’ll focus on recurrent neural networks first (What do pirates call neural networks? ArrrghNNs). I’ll assume some familiarity with both recurrent and convolutional neural networks. Andrej Karpathy’s blog discusses recurrent neural networks in detail. Here is an image from that post which illustrates the core concept:

Vanilla

The basic RNN architecture is essentially a feed-forward neural network that is stretched out over a bunch of time steps and has it’s intermediate output added to the next input step. This idea can be expressed as an update equation for each input step:

new_hidden_state = tanh(dot(input_vector, W) + dot(prev_hidden, U) + b)

Note that dot indicates vector-matrix multiplication. Multiplying a vector of dimensions <m> by a matrix of dimensions <m, n> can be done with dot(<m>, <m, n>) and yields a vector of dimensions <n>. This is consistent with its usage in Theano and Keras. In the update equation, we multiply each input_vector by our input weights W, multiply the prev_hidden vector by our hidden weights U, and add a bias, before passing the sum to the activation function sigmoid. To get the many to one behavior, we can grab the last hidden state and use that as our output. To get the one to many behavior, we can pass one input vector and then just pass a bunch of zero vectors to get as many hidden states as we want.

LSTM

If the RNN gets really long, then we run into a lot of difficulty training the model. The effect of something a early in the sequence on the end result is very small relative to later components, so it is hard to use that information in updating the weights. To solve this, several methods have been proposed, and two have been implemented in Keras. The first is the Long Short-Term Memory (LSTM) unit, which was proposed by Hochreiter and Schmidhuber 1997. This model uses a second hidden state which stores information from further back in the model, allowing that information to have a stronger effect on the end result. The update equations for this model are:

input_gate = tanh(dot(input_vector, W_input) + dot(prev_hidden, U_input) + b_input)
forget_gate = tanh(dot(input_vector, W_forget) + dot(prev_hidden, U_forget) + b_forget)
output_gate = tanh(dot(input_vector, W_output) + dot(prev_hidden, U_output) + b_output)

candidate_state = tanh(dot(x, W_hidden) + dot(prev_hidden, U_hidden) + b_hidden)
memory_unit = prev_candidate_state * forget_gate + candidate_state * input_gate

new_hidden_state = tanh(memory_unit) * output_gate

Note that * indicates element-wise multiplication. This is consistent with its usage in Theano and Keras. First, there are a bunch more parameters to train; not only do we have weights for the input-to-hidden and hidden-to-hidden matrices, but also we have an accompanying candidate_state. The candidate state is like a second hidden state that transfers information to and from the hidden state. It is like a safety deposit box for putting things in and taking things out.

GRU

The second model is the Gated Recurrent Unit (GRU), which was proposed by Cho et. al. 2014. The equations for this model are as follows:

update_gate = tanh(dot(input_vector, W_update) + dot(prev_hidden, U_update) + b_update)
reset_gate = tanh(dot(input_vector, W_reset) + dot(prev_hidden, U_reset) + b_reset)

reset_hidden = prev_hidden * reset_gate
temp_state = tanh(dot(input_vector, W_hidden) + dot(reset_hidden, U_reset) + b_hidden)
new_hidden_state = (1 - update_gate) * temp_state + update_gate * prev_hidden

In this model, there is an update_gate which controls how much of the previous hidden state to carry over to the new hidden state and a reset_gate which controls how much the previous hidden state changes. This allows potentially long-term dependencies to be propagated through the network.

My implementations of these models in Theano, as well as optimizers for training them, can be found in this Github repository.

RNN Example: Predicting Dummy Data

Now that we’ve seen the equations, let’s see how Keras implementations compare on some sample data.

import numpy as np
rng = np.random.RandomState(42)

# parameters
input_dims, output_dims = 10, 1
sequence_length = 20
n_test = 10

# generate some random data to train on
get_rand = lambda *shape: np.asarray(rng.rand(*shape) > 0.5, dtype='float32')
X_data = np.asarray([get_rand(sequence_length, input_dims) for _ in range(n_test)])
y_data = np.asarray([get_rand(output_dims,) for _ in range(n_test)])

# put together rnn models
from keras.layers import Input
from keras.layers.recurrent import SimpleRNN, LSTM, GRU
from keras.models import Model
import theano

input_sequence = Input(shape=(sequence_length, input_dims,), dtype='float32')

vanilla = SimpleRNN(output_dims, return_sequences=False)(input_sequence)
lstm = LSTM(output_dims, return_sequences=False)(input_sequence)
gru = GRU(output_dims, return_sequences=False)(input_sequence)
rnns = [vanilla, lstm, gru]

# train the models
for rnn in rnns:
    model = Model(input=[input_sequence], output=rnn)
    model.compile(optimizer='rmsprop', loss='binary_crossentropy')
    model.fit([X_data], [y_data], nb_epoch=1000)

The results will vary from trial to trial. RNNs are exceptionally difficult to train. However, in general, a model that can take advantage of long-term dependencies will have a much easier time seeing how two sequences are different.

Attentional RNNs

It isn’t strictly important to understand the RNN part before looking at this part, but it will help everything make more sense. The next component of language modeling, which was the focus of the Tan paper, is the Attentional RNN. This essential components of model are described in “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” (Xu et. al. 2016). I’ll try to hash it out in this blog post a little bit and look at how to build it in Keras.

Lambda Layer

First, let’s look at how to make a custom layer in Keras. There are a couple options. One is the Lambda layer, which does a specified operation. An example of this could be a layer that doubles the value it is passed:

from keras.layers import Lambda, Input
from keras.models import Model

import numpy as np

input = Input(shape=(1,), dtype='int32')
double = Lambda(lambda x: 2 * x)(input)

model = Model(input=[input], output=[double])
model.compile(optimizer='sgd', loss='mse')

data = np.arange(5)
print(model.predict(data))

This doubles our input data. Note that there are no trainable weights anywhere in this model, so it couldn’t actually learn anything. What if we wanted to multiply our input vector by some trainable scalar that predicts the output vector? In this case, we will have to write our own layer.

Building a Custom Layer Example

Let’s jump right in and write a layer that learns to multiply an input by a scalar value and produce an output.

from keras.engine import Layer
from keras import initializations

# our layer will take input shape (nb_samples, 1)
class MultiplicationLayer(Layer):
	def __init__(self, **kwargs):
		self.init = initializations.get('glorot_uniform')
		super(MultiplicationLayer, self).__init__(**kwargs)

	def build(self, input_shape):
		# each sample should be a scalar
		assert len(input_shape) == 2 and input_shape[1] == 1
		self.multiplicand = self.init(input_shape[1:], name='multiplicand')

		# let Keras know that we want to train the multiplicand
		self.trainable_weights = [self.multiplicand]

	def get_output_shape_for(self, input_shape):
		# we're doing a scalar multiply, so we don't change the input shape
		assert input_shape and len(input_shape) == 2 and input_shape[1] == 1
		return input_shape

	def call(self, x, mask=None):
		# this is called during MultiplicationLayer()(input)
		return x * self.multiplicand

# test the model
from keras.layers import Input
from keras.models import Model

# input is a single scalar
input = Input(shape=(1,), dtype='int32')
multiply = MultiplicationLayer()(input)

model = Model(input=[input], output=[multiply])
model.compile(optimizer='sgd', loss='mse')

import numpy as np
input_data = np.arange(10)
output_data = 3 * input_data

model.fit([input_data], [output_data], nb_epoch=10)
print(model.layers[1].multiplicand.get_value())
# should be close to 3

There we go! We have a complete model. We could change it around to make it fancier, like adding a broadcastable dimension to the multiplicand so that the layer could be passed a vector of numbers instead of just a scalar. Let’s look closer at how we built the multiplication layer:

def __init__(self, **kwargs):
	self.init = initializations.get('glorot_uniform')
	super(MultiplicationLayer, self).__init__(**kwargs)

First, we make a weight initializer that we can use later to get weights. glorot_uniform is just a particular way to initialize weights. We then call the __init__ method of the super class.

def build(self, input_shape):
	# each sample should be a scalar
	assert len(input_shape) == 2 and input_shape[1] == 1
	self.multiplicand = self.init(input_shape[1:], name='multiplicand')

	# let Keras know that we want to train the multiplicand
	self.trainable_weights = [self.multiplicand]

This method specifies the components of the model, for when we build it. The only component we need is the scalar to multiply by, so we initialize a new tensor by calling self.init, the initializer we created in the __init__ method.

def get_output_shape_for(self, input_shape):
	# we're doing a scalar multiply, so we don't change the input shape
	assert input_shape and len(input_shape) == 2 and input_shape[1] == 1
	return input_shape

This method tells the builder what the output shape of this layer will be given its input shape. Since our layer just does a scalar multiply, it doesn’t change the output shape from the input shape. For example, scalar multiplying the input [1, 2, 3] of dimensions <3, 1> by a scalar factor of 2 gives the output [2, 4, 6], which has the same dimensions <3, 1>.

def call(self, x, mask=None):
	# this is called during MultiplicationLayer()(input)
	return x * self.multiplicand

This is the bread and butter of the the layer, where we actually perform the operation. We specify that the output of this layer is the input x matrix multiplied by our multiplicand tensor. Note that this method takes a while to run, because whatever backend we use (for example, Theano) has to put together the tensors in the right way. To make your layer run quickly, it is good practice to add assert checks in the build and get_output_shape_for methods.

Characterizing the Attentional LSTM

Now that we’ve got an idea of how to build a custom layer, let’s look at the specifications for an attentional LSTM. Following Tan et. al., we can augment our LSTM equations from earlier to include an attentional component. The attentional component requires some attention vector attention_vec.

input_gate = tanh(dot(input_vector, W_input) + dot(prev_hidden, U_input) + b_input)
forget_gate = tanh(dot(input_vector, W_forget) + dot(prev_hidden, U_forget) + b_forget)
output_gate = tanh(dot(input_vector, W_output) + dot(prev_hidden, U_output) + b_output)

candidate_state = tanh(dot(input_vector, W_hidden) + dot(prev_hidden, U_hidden) + b_hidden)
memory_unit = prev_candidate_state * forget_gate + candidate_state * input_gate

new_hidden_state = tanh(memory_unit) * output_gate

attention_state = tanh(dot(attention_vec, W_attn) + dot(new_hidden_state, U_attn))
attention_param = exp(dot(attention_state, W_param))
new_hidden_state = new_hidden_state * attention_param

The new equations are the last three, which correspond to equations 9, 10 and 11 from the paper (approximately reproduced below, using different notation).

$\begin{aligned} {\bf s}_{a}(t) & = \tanh({\bf h}(t) {\bf W}_{a} + {\bf v}_a {\bf U}_{a})\\ {\bf p}_{a}(t) & = \exp({\bf s}_{a}(t) {\bf W}_{p})\\ {\bf \tilde{h}}(t) & = {\bf h}(t) * {\bf p}_{a} (t) \end{aligned}$

The attention parameter is a function of the current hidden state and the attention vector mixed together. Each is first put through a matrix, summed and put through an activation function to get an attention state, which is then put through another transformation to get an attention parameter. The attention parameter then re-updates the hidden state. Supposedly, this is conceptually similar to TF-IDF weighting, where the model learns to weight particular states at particular times.

Building an Attentional LSTM Example

Now that we have all the components for an Attentional LSTM, let’s see the code for how we could implement this. The attentional component can be tacked onto the LSTM code that already exists.

from keras import backend as K
from keras.layers import LSTM

class AttentionLSTM(LSTM):
    def __init__(self, output_dim, attention_vec, **kwargs):
        self.attention_vec = attention_vec
        super(AttentionLSTM, self).__init__(output_dim, **kwargs)

    def build(self, input_shape):
        super(AttentionLSTM, self).build(input_shape)

        assert hasattr(self.attention_vec, '_keras_shape')
        attention_dim = self.attention_vec._keras_shape[1]

        self.U_a = self.inner_init((self.output_dim, self.output_dim),
                                   name='{}_U_a'.format(self.name))
        self.b_a = K.zeros((self.output_dim,), name='{}_b_a'.format(self.name))

        self.U_m = self.inner_init((attention_dim, self.output_dim),
                                   name='{}_U_m'.format(self.name))
        self.b_m = K.zeros((self.output_dim,), name='{}_b_m'.format(self.name))

        self.U_s = self.inner_init((self.output_dim, self.output_dim),
                                   name='{}_U_s'.format(self.name))
        self.b_s = K.zeros((self.output_dim,), name='{}_b_s'.format(self.name))

        self.trainable_weights += [self.U_a, self.U_m, self.U_s,
                                   self.b_a, self.b_m, self.b_s]

        if self.initial_weights is not None:
            self.set_weights(self.initial_weights)
            del self.initial_weights

    def step(self, x, states):
        h, [h, c] = super(AttentionLSTM, self).step(x, states)
        attention = states[4]

        m = K.tanh(K.dot(h, self.U_a) + attention + self.b_a)
        s = K.exp(K.dot(m, self.U_s) + self.b_s)
        h = h * s

        return h, [h, c]

    def get_constants(self, x):
        constants = super(AttentionLSTM, self).get_constants(x)
        constants.append(K.dot(self.attention_vec, self.U_m) + self.b_m)
        return constants

Let’s look at what each function is doing individually. Note that this builds heavily upon the already-existing LSTM implementation.

from keras import backend as K
from keras.layers import LSTM

We will create a subclass (does python even do subclasses?) of the LSTM implementation that Keras already provides. The Keras backend is either Theano or Tensorflow, depending on the settings specified in ~/.keras/keras.json (the default is Theano). This backend lets us use Theano-type functions such as K.zeros, which specifies a matrix of zeros, to initialize our model.

def __init__(self, output_dim, attention_vec, **kwargs):
    self.attention_vec = attention_vec
    super(AttentionLSTM, self).__init__(output_dim, **kwargs)

We initialize the layer by passing it the out number of hidden layers output_dim and the layer to use as the attention vector attention_vec. The __init__ function is identical to the __init__ function for the LSTM layer except for the attention vector, so we just reuse it here.

def build(self, input_shape):

I won’t reproduce everything here, but essentially this method initializes all of the weight matrices we need for the attentional component, after calling the LSTM.build method to initialize the LSTM weight matrices.

def step(self, x, states):
    h, [h, c] = super(AttentionLSTM, self).step(x, states)
    attention = states[4]

    m = K.tanh(K.dot(h, self.U_a) + attention + self.b_a)
    s = K.exp(K.dot(m, self.U_s) + self.b_s)
    h = h * s

    return h, [h, c]

This method is used by the RNN superclass, and tells the function what to do on each timestep. It mirrors the equations given earlier, and adds the attentional component on top of the LSTM equations.

def get_constants(self, x):
    constants = super(AttentionLSTM, self).get_constants(x)
    constants.append(K.dot(self.attention_vec, self.U_m) + self.b_m)
    return constants

This method is used by the LSTM superclass to define components outside of the step function, so that they don’t need to be recomputed every time step. In our case, the attentional vector doesn’t need to be recomputed every time step, so we define it as a constant (we then grab it in the step function using attention = states[4]).

Convolutional Neural Networks

Convolutional networks are better explained elsewhere, and all of the functions required for making a good CNN language model are already supported in Keras. Basically, with language modeling, a common strategy is to apply a ton (on the order of 1000) convolutional filters to the embedding layer followed by a max-1 pooling function and call it a day. It actually works stupidly well for question answering (see Feng et. al. for benchmarks). This approach can be done fairly easily in Keras. One thing that may not be intuitive, however, is how to combine several filter lengths. This can be done as follows:

from keras.layers import Convolution1D

cnns = [Convolution1D(filter_length=filt, nb_filter=1000, border_mode='same')
        for filt in [2, 3, 5, 7]]
question = merge([cnn(question) for cnn in cnns], mode='concat')
answer = merge([cnn(answer) for cnn in cnns], mode='concat')

Similarity Metrics

The basic idea with question answering is to embed questions and answers as vectors, so that the question vector is close in vector space to the answer vector. For example, with the Attentional RNN, we take the question vector and use it as an input for generating the answer vector. A common approach is to then rank answer vectors according to their cosine similarities with the question vector. This doesn’t follow the conventional neural network architecture, and takes some manipulation to achieve in Keras. To use equations, what we would like to do is:

best answer = argmax(cos(question, answers))

Training is generally done by minimizing hinge loss. In this case, we want the cosine similarity for the correct answer to go up, and the cosine similarity for an incorrect answer to go down. The loss function can be formulated as:

loss = max(0, constant margin - cos(question, good answer) + cos(question, bad answer))

Note that for some implementations, having a loss of zero can be troublesome, so a small value like 1e-6 is preferable instead. The loss is zero when the difference between the cosine similarities of the good and bad answers is greater than the constant margin we defined. In practice, the margins generally range from 0.001 to 0.2. If we want to use something besides cosine similarity, we can reformulate this as

loss = max(0, constant margin - sim(question, good answer) + sim(question, bad answer))

where sim is our similarity metric. Hinge loss works well for this application, as opposed to something like mean squared error, because we don’t want our question vectors to be orthogonal with the bad answer vectors, we just want the bad answer vectors to be a good distance away.

Cosine Similarity Example: Rotational Matrix

First, let’s look at how to do cosine similarity within the constraints of Keras. Fortunately, Keras has an implementation of cosine similarity, as a mode argument to the merge layer. This is done with:

from keras.layers import merge
cosine_sim = merge([a, b], mode='cos', dot_axes=-1)

If we pass it two inputs of dimensions (a, b, c), it will calculate the cosine simliarity of the c dimension (specified using dot_axes) and give an output of dimensions (a, b). However, because we might eventually want to implement other types of similarities besides cosine similarity, let’s look at how this can be done by passing a lambda function to merge.

def similarity(x):
    return (x[0] * x[1]).sum() / ((x[0] * x[0]).sum() * (x[1] * x[1]).sum())
cosine_sim = merge([a, b], mode=similarity, output_shape=lambda x: x[0])

We define a function similarity which we will use to compute the similarity of the inputs passed to the merge layer. Note that when we do this, we also have to pass an output_shape which tells Keras what shape the output will be after we do this operation (hopefully in the future this shape will be inferred, but it is still an open issue in the Github group).

A cool example might be to see if we can learn a rotation matrix. A rotation matrix in Euclidean space is a matrix which rotates a vector by a certain angle around the origin. It is defined as a function of theta, the angle to rotate by:

$R = \begin{bmatrix}\cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta)\end{bmatrix}$

We can learn this matrix really simply with the right dataset and one dense layer, that is:

from keras.layers import Input, Dense
from keras.models import Model

a = Input(shape=(2,), name='a')
b = Input(shape=(2,), name='b')

a_rotated = Dense(2, activation='linear')(a)

model = Model(input=[a], output=[a_rotated])
model.compile(optimizer='sgd', loss='mse')

import numpy as np

a_data = np.asarray([[0, 1], [1, 0], [0, -1], [-1, 0]])
b_data = np.asarray([[1, 0], [0, -1], [-1, 0], [0, 1]])

model.fit([a_data], [b_data], nb_epoch=1000)
print(model.layers[1].W.get_value())

A Dense layer with linear activation is the exact same as a matrix multiplication. We give it two input dimensions and two output dimensions. After training this model, the printed weight matrix is:

[[-0.00603954, -0.99370766]
 [ 0.99173903,  0.0078686 ]]

which is close to the rotation matrix for an angle of 90 degrees. Let’s try this again, but with cosine similarity. This will require some manipulation. In the previous example, we had a clearly defined input, a, and output, b, and our model was designed to perform a transformation on a to predict b. In this example, we have two inputs, a and b, and we will perform a transformation on a to make it close to b. As an output, we get the similarity of the two vectors, so we need to train our model to make this similarity high by providing it a bunch of 1’s as the target values, since a similarity of 1 indicates perfect similarity.

from keras.layers import Input, Dense, merge
from keras.models import Model
from keras import backend as K

a = Input(shape=(2,), name='a')
b = Input(shape=(2,), name='b')

a_rotated = Dense(2, activation='linear')(a)

def cosine(x):
    axis = len(x[0]._keras_shape)-1
    dot = lambda a, b: K.batch_dot(a, b, axes=axis)
    return dot(x[0], x[1]) / K.sqrt(dot(x[0], x[0]) * dot(x[1], x[1]))

cosine_sim = merge([a_rotated, b], mode=cosine, output_shape=lambda x: x[:-1])

model = Model(input=[a, b], output=[cosine_sim])
model.compile(optimizer='sgd', loss='mse')

import numpy as np

a_data = np.asarray([[0, 1], [1, 0], [0, -1], [-1, 0]])
b_data = np.asarray([[1, 0], [0, -1], [-1,  0], [0, 1]])
targets = np.asarray([1, 1, 1, 1])

model.fit([a_data, b_data], [targets], nb_epoch=1000)
print(model.layers[2].W.get_value())

Running this, we end up with a weight matrix that looks like

[[-0.16537911 -1.26961863]
 [ 1.06261277  0.1144496 ]]

This looks a bit like cosine similarity, but the scaling seems off. Cosine similarity is ambivalent about the magnitude of vectors, so the weight matrix ends up not being a rotation matrix so much as a rotation-and-skew matrix. It is interesting to think about why this network learned this particular matrix.

Other Similarity Metrics

Feng et. al. provided a list of similarities along with their benchmarks for a CNN architecture. Some of these similarities, along with their implementations in Keras, are reproduced below. They rely on these helper functions:

from keras import backend as K
axis = lambda a: len(a._keras_shape) - 1
dot = lambda a, b: K.batch_dot(a, b, axes=axis(a))
l2_norm = lambda a, b: K.sqrt(((a - b) ** 2).sum())

If the function requires extra parameters, they are usually supplied as arguments in a dictionary.

Cosine

$\frac{x y^T}{||x|| ||y||}$

def cosine(x):
    return dot(x[0], x[1]) / K.sqrt(dot(x[0], x[0]) * dot(x[1], x[1]))

Polynomial

$(\gamma x y^T + c)^d$

def polynomial(x):
    return (params['gamma'] * dot(x[0], x[1]) + params['c']) ** params['d']

Values for gamma used in the paper were [0.5, 1.0, 1.5]. The value for c was usually 1. Values for d were [2, 3].

Sigmoid

$\tanh(\gamma x y^T + c)$

def sigmoid(x):
    return K.tanh(params['gamma'] * dot(x[0], x[1]) + params['c'])

Values for gamma used in the paper were [0.5, 1.0, 1.5], and c was 1.

RBF

RBF stands for radial basis function.

$\exp(-\gamma ||x - y||^2)$

def rbf(x):
    return K.exp(-1 * params['gamma'] * l2_norm(x[0], x[1]) ** 2)

Values for gamma used in the paper were [0.5, 1.0, 1.5].

Euclidean

$\frac{1}{1 + ||x - y||}$

def euclidean(x):
    return 1 / (1 + l2_norm(x[0], x[1]))

Exponential

$\exp(-\gamma ||x - y||)$

def exponential(x):
    return K.exp(-1 * params['gamma'] * l2_norm(x[0], x[1]))

GESD

This was a custom metric developed by the authors which stands for Geometric mean of Euclidean and Sigmoid Dot product. It performed well for their benchmarks.

$\frac{1}{1 + ||x - y||} * \frac{1}{1 + \exp(-\gamma (x y^T + c))}$

def gesd(x):
    euclidean = 1 / (1 + l2_norm(x[0], x[1]))
    sigmoid = 1 / (1 + K.exp(-1 * params['gamma'] * (dot(x[0], x[1]) + params['c'])))
    return euclidean * sigmoid

Values for gamma used were [0.5, 1.0, 1.5] and c was 1.

AESD

This was a custom metric developed by the authors which stands for Arithmetic mean of Euclidean and Sigmoid Dot product. It performed well for their benchmarks.

$\frac{0.5}{1 + ||x - y||} + \frac{0.5}{1 + \exp(-\gamma (x y^T + c))}$

def gesd(x):
    euclidean = 0.5 / (1 + l2_norm(x[0], x[1]))
    sigmoid = 0.5 / (1 + K.exp(-1 * params['gamma'] * (dot(x[0], x[1]) + params['c'])))
    return euclidean + sigmoid

Values for gamma used were [0.5, 1.0, 1.5] and c was 1.

InsuranceQA Model Example

For demonstrating the results, I used this model.

This model achieved relatively good marks for Top-1 Accuracy (how often did the model rank a ground truth the highest out of 500 results) and Mean Reciprocal Rank (MRR), which is defined as

$MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|}{\frac{1}{rank_i}}$

The results after learning the training set are summaraized in the following table.

	Top-1 Accuracy	Mean Reciprocal Rank
Test 1	0.4933	0.6189
Test 2	0.4606	0.5968
Dev	0.4700	0.6088

For comparison, the best model from Feng et. al. achieved an accuracy of 0.653 on Test 1, and the model in Tan et. al. achieved an accuracy of 0.681 on Test 1. This model isn’t exceptional, but it works pretty well for how simple it is. It outperforms the baseline bag of words model, and performs on par with the Metzler-Bendersky IR model introduced in “Learning concept importance using a weighted dependence model” (Bendersky and Metzler, 2010). Here’s how we build it in Keras:

def build():
    input = Input(shape=(sentence_length,))

    # embedding
    embedding = Embedding(n_words, n_embed_dims)
    input_embedding = embedding(input)

    # dropout
    dropout = Dropout(0.5)
    input_dropout = dropout(input_embedding)

    # maxpooling
    maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False),
                     output_shape=lambda x: (x[0], x[2]))
    input_pool = maxpool(input_dropout)

    # activation
    activation = Activation('tanh')
    output = activation(input_pool)

    model = Model(input=[input], output=[output])

model = build()
question, answer = Input(shape=(q_len,)), Input(shape=(a_len,))

question_output = model(question)
answer_output = model(answer)

similarity = merge([question_output, answer_output], mode='cos', dot_axes=-1)

model = Merge([question_output, answer_output], [similarity])

The code is kind of awkward without the context, so I would recommend checking out the repository to see how it works. The repository contains the necessary code for building a question answering model using Keras and evaluating it on the Insurance QA dataset.