Review of Deep Learning Architectures for Image Classification Problem (Part 1)
Disclaimer: Live Document
This post is just a collection of my study notes on the papers that I read to keep up-to-date with myself over the years. It contains several models which are popular. I want to share these notes here, hoping they will be valuable and helpful to others like they are to me.
These notes are my understanding of the original papers and their implementations in different libraries such as torchvision and mmclassification. Suppose there is information that is not correct. In that case, it is probably just because of my misunderstanding on the papers or my explanation skills. Feel free to comment and inform me to make them right. I appreciate any help you can provide.
AlexNet (2012)
ImageNet Classification with Deep Convolutional Neural Networks
It was the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) winner in 2012. The competition dataset is a subset of ImageNet, which is much larger, with ~1000 images for every 1000 categories and contains 1.2 million training images, 50,000 validation images, and 150,000 testing images.
Simply it involves five convolution layers and three dense layers.
The network training took 5–6 days in 2 GTX 580 3GB GPUs.
The authors noted that the faster GPUs could improve the results by training deeper networks on larger datasets.
As the preprocessing step, they down-sampled the images to a fixed resolution of 256 × 256. First, rescaling the image such that the shorter side was of length 256, then cropped out the central 256×256 patch from the resulting image. Also, images normalization is applied by subtracting the mean activity over the training set from each pixel.
ReLU activation function introduced instead of tanh/sigmoid function. This modification makes the training faster with gradient descent to achieve a specific accuracy. ReLU is applied after each convolution and dense layer.
Two GPUs are used to train the model in parallel. Half of the neurons are one GPU, and the other half are on the other GPU. Still, there is communication between the GPUs in particular layers.
Local response normalization is added after the first and the second convolution layers. The authors called it brightness normalization. The effect of this normalization on the results is 1.4% in top-1 accuracy.
The employment of overlapping pooling reduces the top-1 error by 0.4%, and also it makes overfitting slightly more difficult. These max-pooling layer comes after the normalization layers and also after the fifth convolution layer.
The objective is set to maximization of the multinomial logistic regression.
As the size of the network made overfitting a vital problem, to solve this problem, data augmentation and the addition of the dropout layer. Dropout layers are added after the first two dense layers. Two different techniques were applied as data augmentation.
First is creating the horizontal reflections and cropping 224x224 images randomly from the resized images with the size 256x256 at the train time. And at test time, five different patches are generated for each image (the four corner patches and the center patch) and their horizontal reflections, and tests are done in 10 patches (take the average)
The second data augmentation technique is applied to modify the intensities of the images. Each pixel is updated according to the principle components found by PCA on the dataset. And this resulted in reducing the error by 1%. This technique approximately captures an essential property of natural images, namely, that object identity is invariant to changes in the intensity and colour of the illumination.
AlexNet achieves top-1 and top-5 error rates of 37.5% and 17.0% on ILSVRC-2010 test dataset.
ZFNet (2013)
Visualizing and Understanding Convolutional Networks
The paper’s primary goals are to understand why CNN architectures work well and improve them by analyzing the visualizations of the layers. The authors focus on the bottlenecks of AlexNet and make the improvements to achieve better accuracy on ImageNet.
They used a multi-layered Deconvolutional network to project the feature activations back to the input pixel space as the visualization technique. Also, unpooling and rectification are applied to go back from activations to layer features.
As the max-pooling is not invertible, the unpooling is an approximate inverse of max-pooling. Rectification employed and transposed versions of the convolution filters applied to the rectified maps.
The filter size of the first convolution layer was reduced from 11x11 to 7x7 and the stride of it to 2 from 4 because using 11x11 filters results in having dead features. Also, to have more capacity, they increase the number of filters in each layer.
They also tried to add more layers, but it ended with an overfitting problem. On the opposite, the experimentations to reduce the network’s depth end with a higher error at test time.
They used Deconvolution to map the features to the image pixels to return them from activations, which is the opposite of convolution.
First, they tried to reproduce the AlexNet, analyze the bottlenecks, and modify the network to improve performance.
For that reason, the augmentation and preprocessing methods, except the PCA one, of the AlexNet study are applied.
Instead of 2 GPUs, one GPU was enough to train the model in 12 days on a single GTX580 GPU.
The paper has excellent and informative visualizations to check and examine. (Suggested to invest time on them)
In addition, they experimented using occlusions on particular objects in the images to reveal the difference between the features vectors of different layers.
VGG (2014)
Very Deep Convolutional Networks for Large-Scale Image Recognition
The study aimed to decrease the error on ImageNet competition by increasing the depth of the network without adding too much computation cost.
They used smaller filters than AlexNet and ZFNet as 3x3 and 1x1 convolutions with smaller strides to achieve the goal. More specifically, Stride values are set to 1, and the padding is 1 pixel for 3x3 convolution layers. Max-pooling is applied fives times but not after each convolution layer, and it is performed over 2x2 pixel windows with stride 2. Also, the width of convolution layers kept smaller, from 64 in the first layer and then increased by a factor of 2 after each max-pooling layer, until 512.
Opposite to the AlexNet Local Response Normalization is not used as it doesn’t improve the performance but increase the memory need and computation cost.
The convolution filters of AlexNet (11x11, 7x7, and 5x5) are replicated with the fixed size filters, such as 3x3 filters, with fewer parameters. I.e., the 5x5 convolution layer contains 25 variables but imitating it with two 3x3 convolution layers decreases the variable size to 18 (3*3*2 ) without losing any functionality.
They created several architectures with different configurations; generally, they keep the number of convolutional block sizes fixed as 5.
VGG-11: contains 8 convolutions layers (1, 1, 2, 2, 2 convolutions) and 3 dense layers.
VGG-13: contains 10 convolutions layers (2, 2, 2, 2, 2 convolutions) and 3 dense layers.
VGG-16: contains 13 convolutions layers (2, 2, 3, 3, 3 convolutions) and 3 dense layers.
VGG-19: contains 16 convolutions layers (2, 2, 4, 4, 4 convolutions) and 3 dense layers.
When they tried to increase the network’s depth, they experimented with the performance loss instead of performance gain with the note “the deeper models might be beneficial for larger datasets.”
The training is done with four NVIDIA Titan Black GPUs. Training a single net took 2–3 weeks, depending on the architecture. Also, they use a different approach than the approach used in AlexNet. Instead of distributing the network to the GPUs, they keep the whole network in each GPU and distribute the batches to the GPUs (Which is famous for the training on multiple GPUs even today and supported by different Deep Learning frameworks).
They improved the results of the ILSVRC by a large margin and be the second in competition 2014.
They also published the results on other recognition tasks (action detection, object detection, and semantic segmentation) and datasets (VOC-2007, VOC-2012, Caltech-101, and Caltech-256).
GoogLeNet (2014)
Going deeper with convolutions
One goal for designing this network is its usability on devices that don’t have too much computation power without losing accuracy. It contains 5million parameters which are ~12x fewer than AlexNet, and it is much deeper than AlexNet, with 22 layers.
The authors highly influenced the Network-in-Network paper, which is an approach to increasing neural networks’ representational power.
1x1 convolutions were used as dimension reduction modules. Also, 1x1 convolutions allowed the authors to increase the width and depth of the network without having an overfitting issue. (Overfitting is a problem that previous methods faced when using deeper/wider networks.)
In general, bigger-sized networks have two issues: they are more prone to overfitting and require more computational power. The authors think that using sparsely connected architectures instead of fully connected ones (valid for dense layers and convolutions) should solve these. In addition to that, they even make some convolutions layers sparse.
So to apply this idea, they tried to find out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components. And they designed the Inception modules.
In addition to the inception module, there is a modification on the pooling layer; average pooling was applied, not max-pooling before the classifier layer. It used batch normalization, image distortions, and RMSprop.
The implementation requires the CPU to run the training just because of the limitation of the GPUs on memory. As an estimation, the architecture could be trained using a few GPUs within a week. It achieved a top-5 error rate of 6.67%!
To solve the possible problem of gradient propagation on this deep network, they employed auxiliary classifiers to the intermediate layers (4a, 4d) in training time. Their loss added to the original loss after multiplying with 0.3 . At test time, these auxiliary classifiers detached from the network.