Image segmentation is considered one of the most vital progressions of image processing. It is a technique of dividing an image into different parts, called segments. It is primarily beneficial for applications like object recognition or image compression because, for these types of applications, it is expensive to process the whole image.
This article is an excerpt from the book Advanced Deep Learning with TensorFlow 2 and Keras, Second Edition by Rowel Atienza, a revised edition of the bestselling guide to exploring and mastering deep learning with Keras, updated to include TensorFlow 2.x with new chapters on object detection, semantic segmentation, and unsupervised learning using mutual information. This article helps you understand the basic concepts of the process of segmentation.
How does Segmentation work?
Segmentation algorithms partition an image into sets of pixels or regions. The purpose of partitioning is to understand better what the image represents. The sets of pixels may represent objects in the image that are of interest for a specific application. How we partition distinguishes the different segmentation algorithms.
How does segmentation differ with different applications?
In some applications, we are interested in specific countable objects in a given image. For example, in autonomous navigation, we are interested in instances of vehicles, traffic signs, pedestrians, and other objects on the roads. Collectively, these countable objects are called things. All other pixels are lumped together as background. This type of segmentation is called instance segmentation.
In other applications, we are not interested in countable objects but in amorphous uncountable regions, such as the sky, forests, vegetation, roads, grass, buildings, and bodies of water. These objects are collectively called stuff. This type of segmentation is called semantic segmentation.
Roughly, things and stuff together compose the entire image. If an algorithm can identify both things and stuff pixels, it is called panoptic segmentation.
However, the distinction between things and stuff is not rigid. An application may consider countable objects collectively as stuff. For example, in a department store, it is impossible to identify instances of clothing on racks. They can be collectively lumped together as cloth stuff.
How to identify the distinction between various types of segmentation?
The below Figures show the distinction between different types of segmentation. The input image shows two soda cans and two juice cans on top of a table. The background is cluttered. Assuming that we are only interested in soda and juice cans, in instance segmentation, we assign a unique colour to each object instance to distinguish the four objects individually. For semantic segmentation, we assume that we lump together all soda cans as stuff, juice cans as other stuff, and background as the last stuff. We have a unique colour assigned to each stuff. Finally, in panoptic segmentation, we assume that only the background is stuff and we are only interested in instances of soda and juice cans.
Following the example in figures, we will assign unique stuff categories to the objects that we used like
- Water bottle
- Soda can
- Juice can
- Background
Four images showing the different segmentation algorithms
Semantic Segmentation Network
Earlier we learned that the semantic segmentation network is a pixel-wise classifier. The network block diagram is shown below. However, unlike a simple classifier (for example, the MNIST classifier), where there is only one classifier generating a one-hot vector as output, in semantic segmentation, we have parallel classifiers running simultaneously. Each one is generating its one-hot vector prediction. The number of classifiers is equal to the number of pixels in the input image or the product of image width and height. The dimension of each one-hot vector prediction is equal to the number of stuff object categories of interest.
The semantic segmentation network can be viewed as a pixel-wise classifier. Best viewed in colour
For example, assuming we are interested in four of the categories:
- Background
- Water bottle
- Soda can
- Juice can
We can see in the figure that there are four pixels from each object category. Each pixel is classified accordingly using a 4-dim one-hot vector. We use colour shading to indicate the class category of the pixel. Using this knowledge, we can imagine that a semantic segmentation network predicts image_width x image_height 4-dim one-hot vectors as output and one 4-dim one-hot vector per pixel:
Four different sample pixels. Using a 4-dim one-hot vector, each pixel is classified according to its category. Best viewed in colour.
Having understood the concept of semantic segmentation, we can now introduce a neural network pixel-wise classifier. Our semantic segmentation network architecture is inspired by Fully Convolutional Network (FCN). The key idea of FCN is to use multiple scales of feature maps in generating the final prediction. Our semantic segmentation network is shown in the figure below. Its input is an RGB image (for example, 640 x 480 x 3) and it outputs a tensor with similar dimensions except that the last dimension is the number of stuff categories (for example, 640 x 480 x 4 for a 4-stuff category). For visualization purposes, we map the output into RGB by assigning a colour to each category.
Our semantic segmentation network was inspired by FCN, which has been the basis of many modern-day, state-of-the-art segmentation algorithms, such as Mask-R-CNN. Our network was further enhanced by ideas from PSPNet, which won first place in the ImageNet 2016 parsing challenges.
In this article, we explored the various image segmentation techniques in detail with the help of real-world examples. Further, we learnt their implementations in various applications across different fields for object recognition and detection. Advanced Deep Learning with TensorFlow 2 and Keras, Second Edition by Rowel Atienza delineates many more cutting edge techniques that require the knowledge of advanced deep learning for their efficient execution including unsupervised learning using mutual information, object detection (SSD), and semantic segmentation (FCN and PSPNet).
About the author
Rowel Atienza is an Associate Professor at the Electrical and Electronics Engineering Institute of the University of the Philippines, Diliman. He holds the Dado and Maria Banatao Institute Professorial Chair in Artificial Intelligence and received his MEng from the National University of Singapore for his work on an AI-enhanced four-legged robot. He finished his PhD at The Australian National University for his contribution in the field of active gaze tracking for human-robot interaction. His current research work focuses on AI and computer vision.