For my Master’s thesis, I’m working on modeling some time-dependent sequences. There is a pretty rich set of literature associated with doing this, much of it related to addressing the unique challenges posed in voice recognition.
In order to understand the details of this post, it would be good to familiarize yourself with the following concepts, which will be touched on throughout the post:
Some references which explore this topic in greater detail can be found here:
Most of this post will rely on using Theano. The general concepts can probably be ported over to another framework pretty easily (if you do this, I would be interested in hearing about it). It probably also helps to have a GPU, if you want to do more than try toy examples. You can follow the installation instructions here, although getting a GPU working with your system can be a bit painful.
The question which RBMs are often used to answer is, “What do we do when we don’t have enough labeled data?” Approaching this question from a neural network perspective would probably lead you to the autoencoder, where instead of training a model to produce some output given an input, you train a model to reproduce the input. Autoencoders are easy to think about, because they build on the knowledge that most people have about conventional neural networks. However, in practice, RBMs tend to outperform autoencoders for important tasks. In the example below, both models are trained on the MNIST dataset and learn some filters (the weights connecting the visible vector to a single hidden unit).
Boulanger-Lewandowski, Bengio, and Vincent (2012) suggests that unlike a regular discriminative neural network, RBMs are better at modeling multi-modal data. This is evident when comparing the features learned by the RBM on the MNIST task with those learned by the autoencoder; even though the autoencoder did learn some spatially localized features, there aren’t very many multi-modal features. In contast, the majority of the features learned by the RBM are multimodal; they actually look like penstrokes, and preserve a lot of the correlated structure in the dataset.
By definition, the connection weights of an RBM define a probability distribution
Given a piece of data , parameters are updated to increase the probability of the training data and decrease the probability of samples generated by the model
where indicates the free energy of a visible vector, or the negative log of the sum of joint energies of that visible vector and all possible hidden vectors
The explicit derivatives used to update the visible-hidden connections are
In words, the connection between a visible unit and a hidden unit is changed so that the expected activation of that hidden unit goes down in general, but goes up when the data vector is presented, if the visible unit is on in that data vector.
Samples are “generated by the model” by repeatedly jumping back and forth from visible to hidden units. It is not evident why this gives a probability distribution. Suppose you choose a random hidden vector; given the connections between layers, that vector maps to a visible vector. The probability distribution of visible vectors is therefore generated from the hidden distribution. We would like to mold the model so that our random hidden vector will be more likely to map to a visible vector in our dataset. If that doesn’t work, we would like to tweek the model so that a random visible vector will map to a hidden vector that maps to our dataset. And so on. After training, running the model on a random probability distribution twists it around to give us a probability distribution of visible vectors that is close to our dataset.
The best way to think about what an RBM is doing during learning is that it is increasing the probability of a good datapoint, then running for a bit to get a bad datapoint, and decreasing its probability. It is changing the probabilities by updating the connections so that the bad datapoint is more likely to map to the good datapoint than the other way around. So when you have a cluster of good datapoints, their probabilities will be increased together (since they are close to each other, they are unlikely to be selected as the “bad” point of another sample in the cluster), and the probability of all the points around that cluster will be decreased. This illustrates the importance of increasing the number of steps of Gibb’s sampling as training goes on, in order to get out of the cluster. This also gives some intuition on why RBMs learn multi-modal representations that autoencoders can’t; RBMs find clusters of correlated points, while autoencoders only learn representations which minimize the amount of information required to represent some vectors.
As described above, there is some reason to think that an RBM model may learn higher-order correlations better than a traditional neural network. However, as they are conventionally described, they can’t model time-varying statistics very well. For many applications this presents a serious drawback. The top answer on Quora for the question Are Deep Belief Networks useful for Time Series Forecasting? is by Yoshua Bengio, who suggests looking at the work of his Ph.D. student, Nicolas Boulanger-Lewandowski, who wrote the tutorial that much of this blog post is modeled around. In particular, the paper Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription and it’s corresponding tutorial provide a good demonstration of doing almost exactly what Andrej Karpathy’s blog post does, although instead of using RNNs to continually predict the next element of a sequence, it does something a little differrent.
The RNN-RBM uses an RNN to generate a visible and hidden bias vector for an RBM, and then trains the RBM normally (to reduce the energy of the model when initialized with those bias vectors and the visible vector at the first time step). Then the next visible vector is fed into the RNN and RBM, the RNN generated another set of bias vectors, and the RBM reduces the energy of that new configuration. This is repeated for the whole sequence.
What exactly does this training process do? Let’s consider the application that is described in both the paper and tutorial, generating polyphonic music (polyphonic here just means there may be multiple notes at the same time step). The weight matrix of the RBM, which has dimensions
<n_visible, n_hidden>, provides features that activate individual hidden units in response to a particular pattern of visible units. For music, these features are chords, which played at each timestep to generate a song; for video, these features are individual frames, which are very similar to the features learned on the MNIST dataset.
The RNN part is trained to generate biases that activate the right features of the RBM in the right order; in other words, the RNN tries to predict the next set of features given a past set. When we switch the RBM from learning a probability distribution to generating one, the RNN is used to generate biases for the RBM, defining a pattern of activating filters. The stochasticity of the RBM is what gives the model its nondeterminism.