Know about MobileNet v1: Implementation From Scratch using Pytorch

Sahil -
6 min readMar 31, 2022

Hi Guys! In this blogs, I will share my knowledge, after reading this research paper, what it is all about!

1. Abstract

  • This network brought for mobile and embedded vision application
  • Its used depth-wise separable convolutions to build light weight deep neural networks.
  • In this paper, they introduced two hyper-parameters: trade-off between latency and accuracy. Its allow the model builder to chose right sized model for their application based on constraints of the problem.

2. Introduction

  • Although, it is true that “making deeper and more complicated networks” able to achieve higher accuracy.
  • However, if we observed from network point of view, it has not been more efficient w.r.t speed and size. In real world, such as robotics and self driving car, these need to be consider on computationally limited platform.

3. Prior Work

  • MobileNets primarily focus on optimising for latency yet yield small network. However, most paper only considered on “size” but never looked on “speed”.
  • As we stated in abstract, its used depth-wise separable convolution.

4. Comparison and Computational cost of standard convolution and depth-wise separable convolution:

Let say,

In Standard Convolution,

Input * Kernel = Output
  • No. of multiplication on 1 convolution operation:
Same as Kernel Size
  • Since there are “N” filters and each filter slides vertically and horizontally “Dp” times. Total number of multiplication:
Total cost on Standard Convolution

In Depth-wise separable convolution, its perform two operations: depth-wise convolution and point-wise convolution

In depth-wise convolution, convolution applied single channel at a time. (N=1).

(.:. Rest of the above concept are same)

So, total cost of multiplication in depth-wise convolution:

Put N=1 in total cost on Standard convolution equation

In point-wise convolution, 1x1 convolution operation is applied on “M” channels).

Total cost of multiplication on point-wise convolution:

Put Dk=1 in total cost on Standard convolution equation

So, total cost of multiplication on depth-wise separable convolution:

Total cost of depth-wise separable convolution

The above explanation is same as mentioned in research paper.

Definition of depth-wise separable convolution
Illustration between standard convolution and depth-wise separable convolution

You can also see the reduction cost as I proved it mathematically above.

5. Network Architecture and Training

Table 1: MobileNet Architecture
  • All layers are followed by BatchNorm and ReLU activation function except final layer (FC).
  • Last layer used no activation function and feed into softmax layer for classification.
  • Since, we are using depth-wise convolution, we will follow as per right side architecture given below figure.
Left side: Standard Convolution, Right side: Depth-wise separable convolution
  • Down-sampling is handled with strided convolution in the depthwise convolution as well as first layer also.
  • Average pooling reduces the spatial resolution to 1 before FC.

Counting depth-wise and point-wise as separated layer, MobileNet has 28 layers.

If you observe the MobileNet architecture, the most of the computation will be on 1x1 Conv.

Resource Utilization Per Layer

During training…

  • used less regularization and data augmentation techniques because small models have less trouble with overfitting.
  • No label smoothing.
  • Very little or no weight decay (l2 regularization) on depth-wise filters since their are so few parameters in them.

5.1 Code For MobileNet Architecture (as per Table 1)

5.2 Thinner Model: Width Multiplier

  • To construct smaller and less computationally expensive model, they introduced a simple parameter α called width multiplier.
  • For given α, we simple multiply to input channel and output channel. The total computational cost for Depth-wise Separable Convolution with width multiplier:

Note: Just replace M → αM and N → αN

  • For width multiplier, α ∈ (0, 1]. So, α=1 is the baseline MobileNet model (Table 1). For α<1 are reduced MobileNets. Typically, α set as 1, 0.75, 0.5 or 0.25.
  • When reduced α, then computational cost and number of parameter becomes α².

5.3 Code for MobileNets with width multiplier ‘α’

  • Just add width-multiplier in first convolution layer and self.nlayer_filter.
  • In layer_construct function, make it int() type because when multiplied with decimal return decimals but Conv module take int() type in input and output channel as argument.

5.4 Reduced Representation: Resolution Multiplier

  • Second hyper-parameter is resolution multiplier (ρ) to reduce the input image size.
  • When the input image size reduced, subsequently every layer will reduced by the same multiplier.
  • The total cost of Depth-wise Separable Convolution with width multiplier and resolution multiplier:

(Note: ρ is multiplied to input image D_F in above equation)

  • For resolution multiplier, ρ ∈ (0,1]. So, ρ=1 is baseline MobileNet model and for ρ<1 are reduced computational models. Typically, ρ set as 224, 192, 160 or 128.

(Note: When resolution multiplier applied, only the computational part (Millions of Multi-Add operations) is reduced not the number of parameters!)

In code, you have to resize your input image before feeding to the model.

6. Experiment

6.1 Compare the parameters between Standard Conv and Depth-wise Separable Conv

From the observation above, with 7x times fewer parameters able to achieve nearly 70% with 1% difference.

6.2 Comparison of Different width multiplier with same image resolution 224

6.3 Comparison of Different resolution multiplier with same width multiplier=1

6.4 Comparison with previous know models

  • MobileNet is nearly as accurate as VGG16 while being 32 times smaller and 27 times less compute intensive.
  • It is more accurate than GoogleNet while being smaller and more than 2.5 times less computation.

6.5 Comparison with smaller neural networks

  • Reduced MobileNet with width multiplier α = 0.5 and reduced resolution 160 × 160.
  • Reduced MobileNet is 4% better than AlexNet while being 45× smaller and 9.4× less compute than AlexNet.
  • It is also 4% better than Squeezenet at about the same size and 22× less computation.

That’s all folks!

Thanks for reading it. Happy Learning! :D

Here my LinkedIn Profile.

--

--