Object Detection (R-CNN)

8 min readAug 19, 2020

Hi Guys! In this blog, I will explain you intuitive behind how object detection actually works using R-CNN with provided intense well-documented Code from Scratch.

Straightforward Meaning…

R-CNN stands for **Region-Based Convolution Neural Network**. Its means that:

“ First, each image will crop into region of interest (ROI) using any Region-Based Algorithm and then feed these cropped image to CNN Network”

So, In this case Selective Search Algorithm has been chosen as a Region-Based Algorithm. If you want to read in more detail, Kindly go in this as I’ve explained this in very detail and as simplified as possible.

Things that need to be required…

Before going through walk-through/explanation, you need to have three things when dealing with object detective task.

Image
Object Label
Bounding Box

Note: Each image can have multiple objects and each multiple objects have their bounding box value. See the Figure 1. below.

Figure 1. Illustration of objects and bounding box of an image

Remember, in order to draw box (rectangle shape like structure), you only need two diagonal coordinates (i.e. (x_left, y_bottom) and (x_right, y_top)). With the same images above, we display there bounding box value also. And each bounding box, we assign object label.

Bounding Box in format (x_left,y_bottom,x_right,y_top,object_name) of Figure 1.

Step 0: Pre-processed

So, you need to maintain all the value into variables in the structural manner. That is why pre-processed step is very important. For example, see the image below that has explained the annotation (i.e. Bounding box as well as their corresponding object name) of an image Figure 1.

You need to bring-out the useful information from this text. In this case, we need Bounding box value and object name. See the image above ‘Bounding Box in format (x_left,y_bottom,x_right,y_top,object_name) of Figure 1.’ which I’ve pre-processed into structure manner (List[] in python).

Now, it depends on how datasets is provided by organization. I used the data-set explained in below image.

Overview WorkFlow

Step 1: Image to ROI images

In the first step, each image is sent to Selective Search Algorithm (as ROI) which will generate list of bounding boxes.

Figure 2. Generated list of Bounding Boxes Using Selective Search Algorithm

Each of the bounding box is calculated IOU with the ground truth.

Note: If we have more than one object present in an image, then each generated bounding boxes will compare IOU with every each of object ground-truth bounding box.

If IOU is greater than 70% (or 0.7), we will select this generated bounding box with corresponding label assigned by ground-truth bounding box. (See Figure 1., each Bounding box assigned with corresponding label ‘cat’)
If IOU is less than 30% (or 0.3), we will select this generated bounding box with corresponding label as ‘background’

Note: We need to create an extra class label called background because we don’t want to detect every object present in surrounding like road, cloud, grass, etc. These are not so importance to detect the object but model need to know and learn about what background images are.

So, for each image, we will take positive samples (which contain object) and negative samples (which contain background). How? Just by applying above two condition. If IOU greater than 0.7, take as positive sample and if less than 0.3, take as negative sample. As simple as that.

Now, while you are selecting from generated Bounding Box based on IOU score, then

Crop the image within selected generated Bounding Box
Resize the cropped image into 227x227. (Because generated bounding box are not in equal shape, so we need to bring all the cropped image into one fixed input size (or shape))

In the code:

I’ve taken five positive samples and five negative samples. (You can try playground with different number of samples.)
The output shape I’ve got is: (10838, 227, 227, 3); which means we got 10838 cropped images (contain both positive and negative samples) and each cropped image contain color (channel=3) with input shape 227x227.
We are saving 4 things: list of cropped (or ROI) images — (10838, 227, 227, 3), their corresponding labels — (10838,1), selected proposal (using Selective Search Algorithm) bounding box P — (10838, 4), ground-truth bounding box G — (10838, 4).
These P and G will be useful when we are dealing in Bounding Box Regression. Just remember that for now!

Step 2: ROI to AlexNet

Each of the cropped images feed to the AlexNet Model except that the last layer should be same as number of classes present in it. Now, AlexNet will try to learn the parameters that mapped input images and class labels (just usually like classification problem).

Remember, we are not here yet to detect object detection, we want to extract features using AlexNet.

Once we are trained in AlexNet, drop the last layer and remaining part of layer will be used as a feature extraction from the second last layer (i.e FC2 layer which contain 4096 units).

So, for each ROI image, feed to trained AlexNet and we get 4096 features vector from FC2 layer.

In the code:

We get list of cropped image from (10838,227,227,3) →to → (10838, 4096) (That the only changes we got.)

Step 3: Features to SVM Model

This step is straightforward. We got feature extractions of each image by applying previous step. You already have a corresponding class label which was done in step 1. Just do classification problem like usual.

(trainX, trainY) →Hyperparameter(SVM) →train_with_bestparams(SVM)

Then save the model into .pkl format after fitting training data with best parameters.

Step 4: Predict Bounding Box using Ridge Regression

Here we will use P and G which was performed in step 1.

In the above equation 1., we have 4 coordinates present in P and G in the format [x_left,y_bottom,x_right,y_top]. We can find the width w by difference between x_left and x_right. Similarly, we can find the height h by difference between y_top and y_bottom.

We get the update P and G with format [x,y,w,h] where x = x_left and y=y_bottom.

We can get x_right by adding x_left + w, similarly for y_top by adding y_bottom + h. Its just a simple math. Transforming into these kinds of things is not so hard.

Now, you got all suitable values, just apply into equation 1., we get target t.

Here, new term appear i.e. d(P) which is equal to dot_product(w,P) So the equation of regression in order to find ‘w’ will be

Observing equation 3.,

t and P already available and fixed.
w is weights (which is variable) which we have to find in such a way that loss between t and dot(w,P) will be low.
‘lambda’ is regularizer which we can do hyper-parameter to find the right value.

Once we are trained in this, we get optimal w*. We can get d(P) by applying dot_product(w*,P).

Now, we got everything available. Just put these values into equation 2., you will get Predicted Bounding Box in format[x,y,w,h].

Step 5: Non-max Suppression

Now, there is one problem. If you run till step 4., you will get output:

Multiple BB are generating over a single object One of the method which can reduce is by using Non-Max Suppression.

The Algorithms as follows:

After applying this, we get

Try random images from Internet and see the result:

Points to be note:

Guys! I could not fully utilize this to get performance just like research paper due to lack of resource available in my system.
From the result point of view, class label has not predicted very well. However object location has been performed very decent work. (Again! As I mentioned in above point that I’ve not utilized to get perfect performance.)

Key Take away is:

This blog was meant to understand not only just reading the research paper but also apply in practical scenario so that you can get the sense of how actually it works.

If you have a big resources available in your system, you can playground with:

Just like I have used 5 positive samples and negative samples. You can increase the number of samples and see the results.
I’ve used VOC 2005 Dataset whereas in research paper, its used PASCAL 2010 and ILSRVC2013. You can try it out. Please make sure you pre-process it first like explained in step 0.
I’ve used AlexNet. Try using VGG16 Network.
Try various overlap threshold in Non-max suppression to see the result.