Residual networks are one of the hot new ways of thinking about neural networks, ever since they were used to win the ImageNet competition in 2015. ResNets were originally introduced in the paper Deep Residual Learning for Image Recognition by He et. al. The remarkable thing about the ResNet architecture is just how crazy deep it is. For comparison, the Oxford Visual Geometry Group released a Very Deep Convolutional Network for Large-Scale Visual Recognition, which even has “Very Deep” in the name, and it had either 16 or 19 layers. ResNet architectures were demonstrated with 50, 101, and even 152 layers. More surprising than dectupling the number of layers of another architecture, the deeper ResNet got, the more its performance grew. It did very well in the 2015 ImageNet competition, and seems to be the best single model out there for object recognition, with most of the 2016 ImageNet models being ensembles of other models.
First, here is a graphic illustrating the concept of a “residual network”.
Note that this “network” is really just a building block. In the ResNet architecture, a whole bunch of these were stuck together.
Mathematically, we can express this network with input , some transformation (the complicated network in the diagram), some merge function , some activation function and output as
Characteristically, is a convolution or series of convolutions, is simply addition, and is the rectified linear unit activation function, giving us the equation
The residual is the network error that we want to correct at a particular layer.
When people talk about “ResNet” as an abbreviaion for “Residual Network”, it is usually to refer to convolutional neural networks. But in principle, the idea behind them can be applied to any type of neural network. For example, two researchers at Google recently used residuals applied to a Gated Recurrent Unit for image compression.
The central idea of the ResNet paper is that it is a good idea, when adding more layers to a network, to keep the representation more or less the same. In other words, extra layers shouldn’t warp the representation very much. Suppose a shallow network perfectly represents the data, and more layers are added. Since the shallow network works perfectly, the best thing for the new layers to do would be to learn the identity function. If the shallow network made a few errors, we would want the new layers to learn to correct the errors, but otherwise not affect the output very much.
Phrased another way, it is an easier learning problem if the network learns to correct the residual error. Once a good representation is learned, the network shouldn’t mess with it too much. The other side to this problem is that we want the shallow network to be able to learn a good solution, without having to learn gradients through higher level layers.
Phrased yet another way, the residual part should ensure that the representation learned is strictly better than whatever we can get without the residual part.
This idea has been phrased differently as information flow, and shows up in LSTM and GRU networks and Highway networks. The key difference between residual and highway networks is the absence of gating. In a highway network, the merge function , which for the residual network was simply addition, would instead be expressed as
where is a gating function dependent on . Depending on how the problem is formulated, the gating function can significantly increate the number of parameters. Aside from that, this formulation might get in the way of the idea discussed earlier; we want the lower layers to learn a near-perfect representation, so we should avoid modifying this representation at all in upper layers.
The graphic above illustrates how the residual network achieves better information flow by passing more information through identity mapping to avoid going through the residual function.
As a case study, let’s consider a two-layer fully-connected network. Given input vector and output vector , a feedforward network can be expressed as
We can instead express the second layer as a residual:
Here’s a visual representation of these equations:
Allegedly, the key advantage of doing this is so that more information can flow from the output of the first layer to the end result. Let’s take advantage of some loose mathematical notation to characterize what we mean by “information flow”. Intuitively, high information flow between two parameters means that changing one parameter significantly affects the other parameter. In other words, the partial derivative (usually called a “gradient”) is relatively large.
For both networks:
For the first network, the term has to go through the weight matrix . When the output of the first layer is dotted with the weight matrix to get the output, it means that the gradient suddenly depends on all of the parameters in the weight matrix. Mathematically:
However, for the second network, there are two routes it can go, with one of them avoiding the weight matrix completely:
To take the concept of information flow a step further, we can use a rectified linear unit instead of a sigmoid for our activation function. If we use ReLUs and stack multiple layers together, any positive outputs of any layer are passed along completely, which is really good information flow (mathematicians, I look forward to your hatemail).
I wrote an Keras wrapper which can be applied to any layer with the same input and output dimensions to add residual connections. I provided a link to that repository at the top of this post, which also has some examples of how to use them in the context of RNNs and ConvNets. If you use it for something interesting, let me know!
I generated some visualizations of the activations of some convolutional filters at each layer of of a ResNet model trained on MNIST data, which can be seen below. The code for generating these images is also available in the repository as well.