Sunday, October 6, 2024
Google search engine
HomeLanguagesMask R-CNN | ML

Mask R-CNN | ML

Faster R-CNN and YOLO are good at detecting the objects in the input image. They also have very low detection time and can be used in real-time systems. However, there is a challenge that can’t be dealt with object detection, the bounding box generated by YOLO and Faster R-CNN does not give any indication about the shape of the object.

Instance Segmentation

This segmentation identifies each instance (occurrence of each object present in the image and colors them with different pixels). It basically works to classify each pixel location and generate the segmentation mask for each of the objects in the image. This approach gives more idea about the objects in the image because it preserves the safety of those objects while recognizing them.

Instance Segmentation

Mask R-CNN architecture

Mask R-CNN was proposed by Kaiming He et al. in 2017. It is very similar to Faster R-CNN except there is another layer to predict segmented. The stage of region proposal generation is the same in both the architecture the second stage which works in parallel predicts the class generates a bounding box as well as outputs a binary mask for each RoI.

Mask R-CNN Architecture

Mask R-CNN Architecture

It comprises of –

  • Backbone Network
  • Region Proposal Network
  • Mask Representation
  • RoI Align

Backbone Network

The authors of Mask R-CNN experimented with two kinds of backbone networks. The first is standard ResNet architecture (ResNet-C4) and another is ResNet with a feature pyramid network. The standard ResNet architecture was similar to that of Faster R-CNN but the ResNet-FPN has proposed some modification. This consists of a multi-layer RoI generation. This multi-layer feature pyramid network generates RoI of different scale which improves the accuracy of previous ResNet architecture.

Mask R-CNN backbone architecture

Mask R-CNN backbone architecture

At every layer, the feature map size is reduced by half and the number of feature maps is doubled. We took output from four layers (layers – 1, 2, 3, and 4). To generate final feature maps, we use an approach called the top-bottom pathway. We start from the top feature map(w/32, h/32, 256) and work our way down to bigger ones, by upscale operations. Before sampling, we also apply the 1*1 convolution to bring down the number of channels to 256. This is then added element-wise to the up-sampled output from the previous iteration. All the outputs are subjected to 3 X 3 convolution layers to create the final 4 feature maps(P2, P3, P4, P5). The 5th feature map (P6) is generated from a max pooling operation from P5.

Region Proposal Network

All the convolution feature map that is generated by the previous layer is passed through a 3*3 convolution layer. The output of this is then passed into two parallel branches that determine the objectness score and regress the bounding box coordinates.

Anchor Generation Mask R-CNN

Anchor Generation Mask R-CNN

Here, we only use only one anchor stride and 3 anchor ratios for a feature pyramid (because we already have feature maps of different sizes to check for objects of different sizes).

Mask Representation

A mask contains spatial information about the object. Thus, unlike the classification and bounding box regression layers, we could not collapse the output to a fully connected layer to improve since it requires pixel-to-pixel correspondence from the above layer. Mask R-CNN uses a fully connected network to predict the mask. This ConvNet takes an RoI as input and outputs the m*m mask representation. We also upscale this mask for inference on the input image and reduce the channels to 256 using 1*1 convolution. In order to generate input for this fully connected network that predicts mask, we use RoIAlign. The purpose of RoIAlign is to use convert different-size feature maps generated by the region proposal network into a fixed-size feature map. Mask R-CNN paper suggested two variants of architecture. In one variant, the input of mask generation CNN is passed after RoIAlign is applied (ResNet C4), but in another variant, the input is passed just before the fully connected layer (FPN Network).

This mask generation branch is a full convolution network and it output a K * (m*m), where K is the number of classes (one for each class) and m=14 for ResNet-C4 and 28 for ResNet_FPN.

RoI Align

RoI align has the same motive as of RoI pool, to generate the fixed size regions of interest from region proposals. It works in the following steps: 

ROI Align

ROI Align

Given the feature map of the previous Convolution layer of size h*w, divide this feature map into M * N grids of equal size (we will NOT just take integer value).

 

The mask R-CNN inference speed is around 2 fps, which is good considering the addition of a segmentation branch in the architecture.

 Applications of Mask R-CNN

Due to its additional capability to generate segmented masks, it is used in many computer vision applications such as:

  • Human Pose Estimation
  • Self Driving Car
  • Drone Image Mapping etc.
RELATED ARTICLES

Most Popular

Recent Comments