Know about GoogLeNet and implementation using Pytorch

5 min readMar 30, 2021

Hi Guys! In this blogs, I will share my knowledge, after reading this research paper, what it is all about! Before I proceed it, I want you to know that I didn’t go and study very extensively. It was only means to understand that

What this research paper is all about?
How was different than previous state-of-the-art model?
What was the result of this novel approach compared to old ones (previous ones)?

So. all of these are written here as a key points.

Abstract

What the purpose? for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC2014)
What the main hallmark? the improved utilization of the computing resources inside the network.
What achievement? carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
How architecture novel idea? The architectural decisions were based on Hebbian principle and the intuition of multi-scale processing.

Introduction

How can it be reduced to ‘fewer’ parameters than AlexNet? This GoogLeNet submission to ILSVRC 2014 actually used 12x fewer parameters than the winner architecture AlexNet, while being significantly more accurate.

Motivation and High Level Considerations

The most straightforward way to improving the performance is to increase their size — depth and width. However, it has two drawbacks.

Drawback 1: Bigger size typically means larger number of parameters, which makes the enlarged network more prone to overfitting.

Drawback 2: Uniformly increased network size is the dramatically increased use of computational resources.

The fundamental way to solve above issues would be moving from fully connected to sparsely connected architectures even inside the convolution.

Sparse Matrix → Fast Calculation (due to more 0’s present in it)

Architecture Details

What the main idea? It is based on finding out how an optimal locals sparse structure in a convolutional vision network can be approximated and covered by readily available dense components.
How do take this? they are restricted to filter sizes 1x1, 3x3 and 5x5 (this decision was based more on convenience rather than necessity). It also means than suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Adding an alterative parallel pooling path in each such stage should have additional beneficial effect.

3. There is one problem. What is it? It is that even a modest number of 5x5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters.

To resolve about drawback, just applying dimension reduction and projections whenever the computational requirements would increase too much otherwise. That is, 1x1 conv is apply before expensive 3x3 and 5x5 convolutions (also include rectified linear activation).

4. What is “Inception”? Inception is a network consisting of modules of the above type stacked upon each other with occasional max-pooling layers with stride 2 to half the resolution of the grid.

Advantages of this network:

Advantage 1: It allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity.

Advantage 2: It aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.

GoogLeNet

All the convolutions, including those inside the Inception modules, used rectified linear activation.
The size of the receptive field in this network is 224x224 RGB with normalization.
“#3x3 reduce” and “#5x5 reduce” stands for the number of 1x1 filters in the reduction layer before then 3x3 and 5x5 convolution.
One can see the number of 1x1 in the projection layer after built-in max-pooling in the pool proj column.
All these reduction/projection layers used ReLU activation as well.

Why average pooling layer used instead of fully connected (FC)? It was found that a move from FC layers to average pooling improved the top-1 accuracy by about 0.6%, however use of dropout remain essential after removing FC layers.

What about gradients? Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern.

Solution: By adding auxiliary classifier connected to these intermediate layers, it would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that get propagated back and provide additional regularization.

Where do these auxiliary classifier to be put it and what about inference time? These classifier take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with the discount weight (the losses of the auxiliary classifier were weighted by 0.3). At inference time, these auxiliary classifiers are discarded.

Total Loss = Loss(Whole Network) + Discount*Loss(Auxiliary classifier)

The exact structure of the auxiliary classifier is as mentioned as per research paper:

Average pooling with 5x5 filter size and stride 3, resulting in 4x4x512 in Inception (4a) and 4x4x528 for Inception (4d)
A 1x1 convolution with 128 filters for dimension reduction and ReLU
A FC layer with 1024 units and ReLU
A dropout with p=0.7
A linear layer with softmax loss as the classifier (predicting the same number of classes as the main classifier)