Hi Guys! In this blogs, I will share my knowledge, after reading this research paper, what it is all about! Before I proceed it, I want you to know that I didn’t go and study very extensively. It was only means to understand that
- What this research paper is all about?
- How was different than previous state-of-the-art model?
- What was the result of this novel approach compared to old ones (previous ones)?
So. all of these are written here as a key points.
**It is the extension of GoogLeNet, which introduced Batch Normalization**
- What was a difficulty till now? Training DNN is complicated by the fact that the distribution in each layer’s input changes during training as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization and makes it notoriously hard to train models with saturating non-linearities. This phenomenon as internal covariate shift.
- How to address? By normalizing layer inputs called Batch Normalization (BN). BN allows us to use much higher learning rates and be less careful about initialization. It also act as a regularizer, in some cases eliminating the need for Dropout.
- What will achieve? BN achieves the same accuracy with 14 times fewer training steps and beats the original model by a significant margin.
- SGD is good but….? While stochastic gradient is simple and effective, it requires careful tuning of the model hyperparameters, specifically the learning rate used in optimization as well as the initial values for the model parameters.
- How its affected? the inputs to each layer are affected by the parameters of all preceding layers. So that small changes to the network amplify as the network becomes deeper.
- We already know why ReLU activation used? To avoid vanishing gradient (according to this paper). However, if we could ensure that the distribution of non-linearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime and the training would accelerate.
- How internal covariate shift named? Change of distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift.
Towards Reducing Internal Covariate Shift
Initially first cut approach was:
- By fixing the distribution of the layer input x as the training progresses, we expect to improve the training speed.
- It has long know by this, the network training converges faster if its inputs are whitened. (Whitened means — linearly transformed to have zero means and unit variances and decorrelated)
- As each layer observes the input produced by the layers below, it would be advantageous to achieve the same whitening of the inputs of each layer.
- By whitening the inputs to each layer, it would take a step towards achieving the fixed distributions of inputs that would be remove the ill (distortion) effects of the internal covariate shift.
Drawback about this first cut approach
Whitening the layer inputs is expensive as
- it required computing covariance matrix
- its inverse square root to produce the whitening activations.
- as well as their derivatives of these transforms for backpropagation.
What we were seeking for? An alternative that performs input normalization in a way, that is, differentiable and does not require the analysis of the entire training set after parameter update.
Normalization via Mini-Batch Statistics
- Because of the above drawback, they make two necessary simplifications.
- First, instead of whitening the features in layer inputs and output jointly, they will normalize each scalar feature independently, by making the mean of zero and the variance of 1 (one).
Can change what layer can represent? Note that simply normalizing each input if a layer may change what the layer can represent. For instance, normalizing the input of a sigmoid would constrain them to the linear regime of the nonlinearity.
Objective: To make sure, the transformation inserted in the network can represent the identity transform.
To accomplish this:
Now, this could only work when we use whole training networks but impractical when using stochastic optimization.
2. Second simplification, since we are using mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.
- Consider a mini-batch B of size m:
- Normalize values:
BN Transform Algorithm applied to activation over mini-batch
Red arrow marked are need to be learned as training, and rest of the input values are fixed (obvious that input dataset remain unchanged).
The distribution of the values of any x(hat) has to be expected value of 0 and the variance of 1 as long as elements of each mini-batch are sampled from the same distribution.
What about gradient during backpropagation?
In whitening, it was not possible and expensive.
In this case: During training, we need to backpropagate the gradient of the loss l through this BN transformation, as well as compute the gradient with respect to the parameters of the BN transform.
Thus, BN transform is differentiable transformation that can introduces normalized activation into the network. This ensure that as model is training, layers can continue learning on input distribution that exhibit less internal covariate shift, thus accelerating the training.
Training and Inference with Batch-Normalized Networks
Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.
From 1–6, it is about training part. While training, weights are updated on the basis of loss l. And we have seen above the derivatives in BN transform, we can also update γ and β w.r.t loss l.
Line 7–12, it about inference part. At each mini-batch, it will calculate mean and variance with given sample size m and transform it with given trained γ and β values.
Batch-Normalized Convolutional Networks
Generally in traditional neural network which consist of an affine transformation followed by an element-wise nonlinearity:
We add BN transform immediately before the nonlinearity by normalizing.
For convolutional layers, we normalize all the activations in a mini-batch over all locations — i.e instead of just single value x, we take 2D feature maps of size p x q. Its become, considering batch size m, |B|=m.pq. We learn a pair of parameters γ and β per feature map, rather than per activation.
Batch Normalization enables higher learning rates
Considering traditional deep networks, which consists of many deep layers which lead to vanish or explode gradient as well as stuck in poor local minima.
Normally, large learning rate is set whenever increase the scale of layer parameters, which can amplify the gradient during backpropagation and lead to model explosion.
However, BN is unaffected by the scale of its parameters. Let say we increase by ‘a’.
Larger weights lead to smaller gradients and BN will stabilize the parameter growth.
Batch Normalization regularizes the model
They found this advantageous to the generalization of the network. Whereas Dropout is typically used to reduce overfitting, in a batch-normalized network they found that either removed or reduced in strength.
Accelerating BN networks
- Increase learning rate: achieve a training speedup from higher learning rates with no ill effects.
- Remove Dropout: removing dropout from Modified BN-Inception speeds up training, without increasing overfitting
- Reduce l2 weights regularization: Modified BN-Inception the weights of this is reduced by a factor of 5.
- Accelerate the learning rate decay: BN Model network train faster than Inception (Without BN), they lower the learning rate 6 times faster.
- Remove Local Response Normalization: While Inception and other models benefits with this, but with BN, we don’t need it.
- Shuffle training examples more thoroughly.
- Reduce the photometric distortions.
Model description as follow:
- Inception: Set initial learning rate of 15e-4
- BN-Baseline: Inception with BN.
- BN-x5: Inception with BN but set the initial learning rate by factor of 5 i.e. 75e-4
- BN-x30: Inception with BN but set the initial learning rate by factor of 30 i.e. 45e-3
- BN-x5-Sigmoid: Like BN-x5 but with sigmoid activation function rather than ReLU.
Interestingly, increasing the learning rate further (BN-x30) causes the model to train somewhat slower initially, but allows it to reach higher accuracy.
Code Implementation Using Pytorch
That’s all the key-points for this research paper. Hope you got it.
Thank you for reading it and Have a nice day! :D
Here my LinkedIn Profile.