Review of Deep Learning Architectures for Image Classification Problem (Part 3)
In this blog post, I added my notes on Wide-ResNet, ShuffleNet, ResNeXt and Xception networks.
Wide-ResNet (2017)
The number of layers should be nearly doubled to have a minor performance improvement while using deeper networks. And this means they are having a problem of diminishing feature reuse. To solve this problem, the authors proposed a modified version of residual networks in which they decrease the depth and increase the width.
The authors of this paper think the power of the residual network, using identity maps, is also a weakness, as it can create the issue called diminishing feature reuse. And they presented wider deep residual networks that significantly improved the results while having 50 times fewer layers and being more than two times faster.
There are two main changes in original ResNets,
- they increase the width of the layers by balancing them with the depth of the network
- employed dropout between convolutional layers of residual blocks (as it is shown in the visualization) (leads to consistent gains)
In addition to the changes listed above, they also changed the original order of batch normalization, activation, and convolution in the residual block was changed from convolution-BN-ReLU to BN-ReLU-convolution.
Bottleneck blocks make the network thinner to go deeper, which is the opposite of the Wide-ResNet. Wide-ResNet architecture depends on basic blocks instead of bottleneck blocks.
They did the experiments by using these blocks
1. B(3,3) — original «basic» block
2. B(3,1,3) — with one extra 1×1 layer
3. B(1,3,1) — with the same dimensionality of all convolutions, «straightened» bottleneck
4. B(1,3) — the network has alternating 1×1–3×3 convolutions everywhere
5. B(3,1) — similar idea to the previous block
6. B(3,1,1) — Network-in-Network style block
The authors noted that batch normalization provides a regularization effect in Residual networks, requiring heavy data augmentation. And as they didn’t apply heavy augmentation in their experiments, this decision probably added some bias to the final network, which is interesting to think about it. (It is an open question for me, and I didn’t investigate it). Instead, they added dropout into each residual block between convolutions (as mentioned above) and after ReLU to perturb batch normalization in the next residual block and prevent it from overfitting.
They did all the experiments in these configurations because of the computation power requirements:
- The depth of the networks is between 16 and 40
- The widening factor (k value in the visualization) is kept between 2 and 11
They reported that when the depth, k value, or both increased, the model performed better with one exception Wide-ResNet-40–8 (means depth=40, k=8) lost to WRN-22–8.
And also, when they compare the Wide-ResNet with ResNet, Wide-ResNet-40–4 compares favorably to thin ResNet-1001, and it has better performance, and at the same time, it trains eight times faster.
ShuffleNet (2017)
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
The network is designed to use in mobile devices with limited computing power.
It adds two new operations to the standard networks, pointwise group convolution, channel shuffle, to reduce computation cost while maintaining accuracy.
First, the authors think about employing multiple group convolutions stacked. But there is an issue of applying stack group convolutions: outputs from a particular channel are only derived from a small fraction of input channels. And they see and solve this problem by adding a channel shuffle operation after group convolution.
The proposed network can use wider feature maps given a computational budget.
In ShuffleNet, depthwise convolution only performs on bottleneck feature maps as it can be challenging to implement efficiently and may have worse computational results than other dense operations.
The training took 1 or 2 days for 3×10**5 iterations on 4 GPUs while keeping the batch size 1024.
The authors created a parameter on the number of groups in group convolutions in their experiments. If this parameter is set to 1, the ShuffleNet unit becomes an” Xception-like” structure which means no pointwise group convolution is involved. Also, they reported that when the group number increases, the performance rises without adding more parameters to the network consistently.
One of the experiments’ outcomes is that group convolution allows more feature map channels for a given complexity constraint. Hence, the performance gain comes from wider feature maps which help to encode more information.
The proposed network performs better than MobileNet, and also, it is ∼13 times faster than (theoretically 18) over AlexNet while maintaining comparable accuracy.
ResNeXt (2017)
Aggregated Residual Transformations for Deep Neural Networks
It is a simple architecture that adopts VGG/ResNets’ strategy of repeating layers while exploiting the split-transform-merge approach of Inception models.
Previously, modifications were done to achieve more powerful networks, primarily on the depth or width of the network or both. And they add another dimension to network engineering called cardinality. Cardinality can be described as the size of the set of transformations. And in this study, these transformations are on the residual layers.
The experiments demonstrate that increasing cardinality is a more effective way of gaining accuracy than going deeper or wider, as there is an issue called diminishing feature reuse when depth and width start to give diminishing returns for existing models.
Also, the proposed network has a more straightforward design than Inception models and better performances than Inception, ResNet, and Inception-ResNet models.
They evaluated ResNeXt on ImageNet-5K and the COCO object detection datasets, and they have better performance than ResNet variation, which has the same depth and width values.
In theory, they formulate the aggregated residual transformations to add this cardinality. Still, instead of employing them, their choice turned to the application of Grouped Convolutions as they are easier to implement and support the same theory.
Grouped convolutions: It is, first, used in the AlexNet paper because of the GPU restrictions; GPU memory is not enough to handle the AlexNet. And it is mathematically equal to what the authors achieved (adding the cardinality to the blocks). So to make the implementation of the blocks easier, they use grouped convolutions instead of aggregated residual transformations.
The experimentations are done with different cardinality values, 1, 2,4, 8, 32, to evaluate the trade-off between cardinality and bottleneck width under preserved complexity. (the configurations are given in the table)
With cardinality C increasing from 1 to 32 while maintaining complexity, the error rate reduces. Furthermore, the 32×4d ResNeXt also has a much lower training error than the ResNet counterpart, suggesting that the gains are not from regularization but more robust representations.
Xception (2016)
Xception: Deep Learning with Depthwise Separable Convolutions
The inception model aimed to generate the mappings of the cross-channel and spatial correlations separately.
And Xception model is aimed to achieve the same but in an extreme version. Instead of the Inception module, the Xception module employs the depthwise separable convolution, which is slightly different from the Inception module. The order of channel-wise spatial convolution and 1x1 convolution inversed, and there is no non-linear function between convolution layers in depthwise separable convolutions but Inception module.
Simply, the Xception model can be described as a convolutional neural network architecture established entirely on depthwise separable convolutions. In short, it is a linear stack of depthwise separable convolution layers with residual connections.
Xception contains 36 convolutional layers (within 14 modules) in the proposed version. It also includes the dropout layer of rate 0.5 before the logistic regression layer.
Xception is compared with the Inception V3 model on ImageNet and JFT datasets. The implementation is done in TensorFlow, and the experiments are done on 60 NVIDIA K80 GPUs each. ImageNet and JFT training took approximately three days and over one month each, respectively.
Xception outperforms ImageNet results reported for ResNet-50, ResNet-101, and ResNet-152 and shows marginally better results than Inception V3.
Also, Xception is slightly slower than Inception while training, 28 steps/second-31 steps/second. However, as both architectures have almost the same number of parameters, the performance improvements come from the efficient use of the model parameters.
Xception model with residual connections converges quicker. However, the author remarked that residual connections are not required to build models that are stacks of depthwise separable convolutions.