Know about MobileNet v2 & Implementation from Scratch Using Pytorch

Sahil -

4 min readAug 24, 2022

Hi Guys! In this blogs, I will share my knowledge, after reading this research paper, what it is all about!

1. Abstract

This model is based on an inverted residual structure.
Inverted Residual Structure is where the shortcut connections are between the thin bottleneck layers.
In a sense, the bottleneck layers, the intermediate expansion layer uses lightweight depth-wise convolutions to filter features as a source of non-linearity.

2. Introduction

What did Neural Network brought? Neural network have revolutionised in many areas of machine intelligence, enabling human accuracy for challenging image recognition classification.
However? The drive to improve accuracy often comes at a cost. For instance, modern state of the art networks require high computational resources beyond the capabilities of many mobile and embedded application.
Objective: This network pushes the state of the art for the mobile application by significantly decreasing the number of operations and memory needed while retaining the same accuracy.
Contribution: The Inverted Residual with linear bottleneck.

“The Inverted Residual with linear bottleneck” intuitionally as
a. The modules takes as an input a low dimensional compressed representation.
b. Further, it is first expanded to high dimension
c. Afterwards, it is filtered with a lightweight depthwise convolution.
d. Finally, projected back to a low-dimensional representation with a linear convolution.

Note: In this whole research paper, it explained in more detailed about their contribution. If you understood the intuition of their contribution, then you are done for this paper.

This convolutional module is particularly suitable for the mobile designs because it allows to reduce the memory footprint needed during interference by never fully materialising large intermediate tensors (or layers).

3. Preliminaries

3.1 Depthwise Separable Convolution:

As I have already explained in MobileNetV1 blog. Kindly visit if you are not familiar with.
Here, it uses kernel size = 3 (that will be reduced computationally by 8 or 9 times than standard convolution)

3.2 Linear BottleNeck

ReLU is used as non-linear activation.
This paper summarised two properties that are indicative of the requirement that the manifold of interest should be lie in a low-dimensional subspace of higher-dimensional activation space:

If manifold of interest remains non-zero volume after ReLU transformation, it corresponds to a linear transformation.
ReLU is capable of preserving complete information about the input manifold, but only if the input manifold lies in a low-dimensional subspace of the input space

3.3 Inverted Residual

The bottleneck blocks appear similar to residual block where each block contains an input followed by several bottlenecks then followed by expansion.
They used shortcut between the bottlenecks.

Fig 1: Illustration of Table 1 and Fig 2

Fig 2: Diagram as per stride value in DW Convolution

4. Model Architecture

Basic building block is a bottleneck depth-separable convolution with residual. (Detailed structure is shown in Table 1 and Fig 1)

Initially, fully convolution layer with 32 filters, followed b6 19 residual bottleneck layers.
Used ReLU6 as non-linearity because of the robustness when used with low-precision computation.
It used 3x3 kernel size as is standard for modern networks and utilise dropout and batch normalisation during training.
expansion factor (t) = 6

In Table 2,

each line describes a sequence of 1 or more identical layers, repeated n times.
All layers in the same sequence have the same number c of output channels.
The first layer of each sequence has a stride s and all others use stride 1.
Expansion (t) is always applied to the input size. One point to be noted here, when expansion (t) is 1, then convolution (3x3) in first layer of bottleneck (in Fig 2) is not required.
they applied width multiplier to all layers except the very last convolutional layer. This improves performance for smaller models.