Selective Search for Object Recognition

Sahil -
9 min readJul 29, 2020

--

Hi Guys! In this blog, I’m going to explain how selective search algorithm works (in this research paper) with easiest way step by step as possible.

Definition of Segmentation

I hope you are aware of what is Segmentation of an image? It is a process of breaking down all pixels and group them which have similar pixel value to make multiple regions.

Source: https://medium.com/cogitotech/what-is-the-difference-between-image-segmentation-and-classification-in-image-processing-303d1f660626

You can see:

  • Bed have grouped into single pixel.
  • Pillow have grouped into single pixel. etc

Objective: Selective search algorithm used to perform segmentation to get the object localization.

Selective Search

Before proceeding, let us understand some concepts behinds this proposal. First, see the Fig 1. below and tell me how do you segment it with one similarity strategy?

Fig 1.
  • For (b), to be able to separate for cats, color is different but texture are same. Here color similarity is preferred
  • For (c), here with color similarity wouldn’t helpful as all are same color. So, texture would be preferred as chameleon and grass are different.
  • For (d), here neither be color or texture would be helpful to group that red car into one region because wheel and car body have different color as well as different texture.

So, what did you observe? Did you find anyone similarity strategy to able distinguish for all images?

No. So, this paper says that doing with one-based similarity doesn’t give all results segmentation well. The final, this paper, similarity measure is a combination of four strategy — color, texture, size and fill:

Equation 1.

where (rᵢ , rⱼ) are the Neighboring region pair of the image, aᵢ belong {0,1} (i.e. denotes if the similarity measure is used or not) and s_*(rᵢ, rⱼ) is a similarity metric.

Remember, It is not a weight-age of all similarity metric.

Let us go and understand each of similarity metrics.

I. Color similarity:

You need to know how images are formed. (This comes under the Digital Image Processing prior knowledge) See the below table:

Book recommend: Digital Image Processing by Rafael C. Gonzalez. Chapter name — Color Image Processing. Under the chapter, read only section name-Color Models.

It depends on book editions. This one is from 2/E

Or you can read this blog: https://www.wigglepixel.nl/en/blog/what-are-color-models/

Of course there are other color space possible which is not fully covered there, that why I said you need to have advance prior knowledge about image processing. Reading this much will get you the insight what and how exactly going on that images.

Now back to the approach:

  • For each region rᵢ, we create 1-D vector which contain histogram for color channel values with 25 bins. That is, for each regions it will calculate all the color channels value. Then, create the histogram for that channels value using group of 25 channels (or bins).
Representation 1-D vector for r_i region.

where n=75. (i.e. each rᵢ have 1-D vector of 75 dimensions value). Similarly, we calculate for rⱼ region also (Cⱼ).

  • Now, normalize the 1-D vector of rᵢ using L1 norm. Similarly for 1-D vector of rⱼ region.
  • Color similarity can be expressed as:
Equation 2.

Just take the minimum corresponding k values between Cᵢ and Cⱼ and sum to all minimum result vector.

II. Texture Similarity:

To find the texture similarity, SIFT works very well. If you are not familiar with SIFT, here is the very nicely blog explained in simple words. I like to recommend to go this blog just for revision and come back here to continue.

  • Take Gaussian derivatives in ‘eight orientations’ using σ = 1 for each color channel.
  • Out of all eight orientation; for each orientation for each color channel, we create histogram using 10 bins. So, for rᵢ regions, we get 1-D vector for this texture similarity as

where n=240 (i.e. each rᵢ have 1-D vector of 240 dimensions value). Similarly, we calculate for rⱼ region also (Tⱼ).

  • Now, normalize the 1-D vector of rᵢ using L1 norm. Similarly for 1-D vector of rⱼ region.
  • Texture similarity can be expressed as:
Equation 3.

Just take the minimum corresponding k value between Tᵢ and Tⱼ and sum to all minimum result vector.

(Exactly same as Color Similarity approach with minor modification, right?)

III. Size of region:

This is used to encourage smaller regions to merge early. Those regions, who have not yet merge, are forces to be merge which have similar sizes.

The reason is that: ‘After going through color or texture similarity, there is a chance that particular region (specially smaller region ones) rᵢ doesn’t satisfy (or have too low similarity) which lead to left out in the participation during segmentation.’

So, to avoid this, the paper says those whosoever region rᵢ left out can merge into other region rⱼ which have similar size.

Size of region (or similarity) can be expressed as:

Equation 4.

It is defined as fraction of the image that rᵢ and rⱼ jointly occupy where size(im) is the size of the image in pixels. Here size(x) denotes the number of pixel present in the ‘x’ region.

IV. Fit between regions:

This measure how well regions rᵢ and rⱼ fit to each other. The idea is to fill gap:

  • if rᵢ region lie within inside rⱼ, then it obvious that to merge these two regions to avoid holes.
  • if rᵢ region hardly touch with rⱼ region, then that means they likely to form some strange region and should not be merge.

Now, we can’t compare every pixel-wise from region rᵢ to region rⱼ, which make them slow. So, this paper says that we will define bounding box. In bounding box, you will have a two coordinates values: left top (x₁,y₁) and bottom right (x₂,y₂). With the given coordinates, we can find the number of pixel contains.

So we will take just find the coordinates of region rᵢ , rⱼ and BBᵢⱼ (which is tight bound around r and rⱼ). Once we got these coordinates of these three regions, we will calculate number of pixels, which is denote by ‘size()’.

size(r): number of pixels in region rⱼ; size(r): number of pixels in region rᵢ; size(BBᵢⱼ): number of pixels in tight bound around rᵢ and rⱼ.

Fit between region can be expressed as:

Equation 5.

It defined as fraction of the image contained BBᵢⱼ which is not covered by the region rᵢ and rⱼ.

Algorithm

Now, we got the understanding the different similarity measure. let us see the algorithm step by step:

1.

Initially you feed color image as the input, and you will get the set of object location present in the image.

2.

Let say, we assigned ’n’ initial regions. These ’n’ initial should be initialized in such a way that we get high recall. So, you can think of it as every pixel in an image are initial regions (just for simplicity understanding).

3.

Create the variable S which contain neighborhood pair wise region rᵢ and rⱼ and their above combination of four similarity value (See equation 1.).

4.

After computing S for all region pair wise with similarity, now its time to merge based on similarity.

  • First, find the pair of r and r which contain maximum similarity value.
  • Merge these r and r regions called a set r
  • Now, remove all the pairs which paired with r region as well as rⱼ.
  • Now, based on new region that got merged, will calculate again combination of four similarity value (See equation 1.)

Remember, after merging regions which having maximum similarity, they don’t need to go again calculate s_color(rₜ, neighbors(rₜ)) as it computation expensive (like again compute color channels value, then create 25 bins histogram, etc…). This was only need to perform at first time only.

So, for the next stage (i.e. after merging rᵢ and rⱼ having maximum regions), all they need to compute is given as below:

Equation 6.

You already got rᵢ and rⱼ computed at first stage. Right? Just put these into into above equation 6.

Cᵢ and Cⱼ is the 1-D vector of normalized color channels with 75 dims of region rᵢ and rⱼ. size(rᵢ) or size(rⱼ) is the number of pixels present in the region rᵢ and rⱼ.

Please remember… This algorithm has explained in abstract way which give you the insight how selective search actually working. Precomputed similarity has been stored somewhere as the number of stages increased (or iteration), but its not mentioned in this algorithm.

So, this algorithm where you are seeking are explained very abstract way. There has to be stored some precomputed value at the implementation stage. I hope you got this clear.

Similarity for texture, same formula as in equation 6. Just replace C with T and everything is same.

  • Store new similarity value in S variable after computing neighborhood pair-wise region of rₜ and store merged region rₜ sets into new variable R.

Overall Algorithm:

Now, let us understand in next section, how object locations extracted with given all merged pair stored in variable R.

Extract object location boxes from final merged regions

Figure 2. Source: https://ivi.fnwi.uva.nl/isis/publications/bibtexbrowser.php?key=UijlingsIJCV2013&bib=all.bib
  • Let say, At the top of hierarchy, there are two region R₁ (dark green) and R₂ (light green).
  • Each of the Rᵢ will have all list of region which are merged regions comes from previous stage (lower stage). Again, its go on to the lower stage one after another until it reach initial region stage (lowest part).
  • So, for each Rᵢ, we got the list of regions of the lowest stage rᵢ. Now, for each region rᵢ, at the initial stage (lowest part), it already have precomputed tight bounding box, rights? So, from the lowest to top, each boundary box get increased as regions is also merging rₜ .
  • At the times of merging the regions, please remember that its calculate the probability (or score) of object present in such a way that the locations which are most likely to be an object come first.
  • Now, if the score of object present is high, it will merge the bounding box of two region rₜ and rⱼ in previous layer to the new layer. If not, it will retain as whatever the bounding box was. (Like for example, Face detection. We got eye, nose, ears, lips as bounding boxes and Face will get as bounding box also, but its doesn’t means that all these eyes, lips, ears, will get merged to make Face as new bounding box. They are in different region. So, different region will get some bounding box also.)
  • Ultimately, we get the final bounding box at the top stage.

To represent in mathematics:

From research paper (page no. 5)

In first paragraph, its the mathematics formulations as I explained intuitive behind in above mentioned.

In second paragraph, it explains that there is a chance of getting duplicate bounding box, so its rank all the locations (bounding boxes) first. The remaining part will be filtered out.

I illustrate this research paper and share it with my knowledge. I hope you like it. I know its was the hard part to understand ,however, if you read research paper and blogs/websites regarding this several times, you will get the good understanding/institution behind it.

Here my LinkedIn profile.

Thank you for reading. Have a nice day :D

--

--