Know about VGG Model and Implementation Using Pytorch

4 min readMar 30, 2021

Hi Guys! In this blog, I will share my points after went through VGG research. Before i proceed it, I want you to know that I didn’t go and study very extensively. It was only means to understand that

What this research paper is all about?
How was different than previous state-of-the-art model?
What was the result of this novel approach compared to old ones (previous ones)?

So. all of these are written here as a key points.

Abstract

What was the contribution in this paper? “Increase depth” using an architecture with very small (3x3) convolution filters.
What did its proven? Significant improvement on the prior-art configuration can be achieved by pushing the depth to 16–19 conv layers.

Introduction

What was the role for challenge? An important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC).
What was history achievement?
The winner of ILSVRC 2011 which used high-dimensional shallow feature encodings. However, the winner of ILSVRC 2012 which was achieved by using deep convnets (AlexNet).
Attempting more…? With the CovNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of AlexNet in a bid to achieved better accuracy.
Convey…? They come up with significant more accurate CovNets architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localization task, but are also applicable to other image recognition datasets.

ConvNet Configuration

Architecture

Input (224x224) RGB: During training, the input to their ConvNet is a fixed-size 224x224 RGC image.
Preprocessing: The preprocessing they do is subtracting the mean RGB value, computed in the training set, from each pixel. (Like Normalization)
Stack of Conv layers: The image is passed through a stack of convolutional (conv.) layers, where they used filters with a very small receptive field: 3x3.
Also attempt to utilise 1x1 convolution filters to differentiale the result.
(See Table, C and D ConvNet Configuration)

Note: AlexNet used 11x11, 5x5 and 3x3 receptive field. But here, they used one receptive field throughout the whole network.

4. Stride=1: The convolution stride is fixed to 1.

5. Padding=1: The padding is 1 pixel for 3x3 convolution layers.

6. Maxpooling: Spatial pooling is carried out by 5 max-pooling layers, which follow some of the conv layers. It is performed over a 2x2 pixel window, with stride 2.

7. ReLU: All the hidden layers are equipped with the rectification non-linearity.

Note: Local Response Normalization (LRN) also tried but it does not improved the performance on the ILSVRC dataset and leads to increased memory consumption and computation time.

Configuration

Table Explain: The ConvNet configurations, evaluated in this paper, one per column.
Refer to the nets by their names (A-E) which differ only in the depth: from 11 weight layers in the network A (* conv and 3 FC layers) to 19 weight layers in the network E (16 conv and 3 FC layers)
Configuration of width: The width of conv layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.

In Table 2, in spite of a large depth, the number of weights in this networks is not greater than the number of weights in a shallow net with increase widths and larger receptive fields. (144 millions weights in Sermanet et al.)

Discussion

What they have gained by using a stack of three 3x3 conv layers instead of a single 7x7 layer?

Firstly, It makes the decision function more discriminative.
Secondly, Decrease the number of parameters.

Classification Framework

Training

Mini-batch gradient descent with momentum = 0.9
batch size = 256
Regularized by weight decay = 5x10^(-4)
First two FC used dropout with p=0.5\
Initial learning rate set to 10^(-2) and decrease by factor 10 when the validation set accuracy stopped improving. In total, learning rate was decreased 3 times and stopped after 370K iterations (74 epochs)
The initialization of weight was sampled from a normal distribution with zero mean and 10^(-2) variance. The biases were set to zero.
To obtain the fixed 224x224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB color shift.

Implementation details

Speed up 3.75 times on an off-the-shelf 4_GPU system as compared to using a single GPU. On a system quipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.

Classification Experiments

The dataset includes images of 1000 classes and is split into three sets: training (1.3M images), validation (50K images) and testing (100K images with held-out class labels).