EasyCV Mask2Former easily implements image segmentation
Related Tags:1.Alibaba Cloud Big Data Consulting Services for Retail
2. What Is Big Data
Image segmentation refers to the pixel level classification of pictures. According to the different classification granularity, it can be divided into three categories: semantic segmentation, instance segmentation, and panoptic segmentation. Image segmentation is one of the main research directions in computer vision. It has important application value in medical image analysis, automatic driving, video surveillance, augmented reality, image compression and other fields. We integrate these three types of segmentation SOTA algorithms in the EasyCV framework, and provide relevant model weights. EasyCV can easily predict the image segmentation spectrum and train customized segmentation models. This paper mainly introduces how to use EasyCV to realize instance segmentation, panoramic segmentation and semantic segmentation, and related algorithm ideas.
EasyCV provides the instance segmentation model and panoramic segmentation model trained on the coco dataset, as well as the semantic segmentation model trained on ADE20K. Refer to EasyCV quick start( https://github.com/alibaba/EasyCV/blob/master/docs/source/quick_start.md )After the configuration of the dependent environment is completed, these models can be directly used to complete the segmentation spectrum prediction of the image. The relevant model links are given in Reference.
Since the mask2friend algorithm in this example uses Deformable attention (using this operator in the DETR series algorithm can effectively improve the algorithm convergence speed and calculation efficiency), additional compilation of this operator is required
cd thirdparty/deformable_ attention
python setup. py build install
Predicting image instance segmentation image through Mask2FormerPredictor
import cv2
from easycv.predictors. segmentation import Mask2formerPredictor
predictor = Mask2formerPredictor(model_path='mask2former_instance_export.pth',task_mode='instance')
img = cv2.imread('000000123213.jpg')
predict_ out = predictor(['000000123213.jpg'])
instance_ img = predictor. show_ instance(img, **predict_out[0])
cv2.imwrite('instance_out.jpg',instance_img)
The output results are shown as follows:
Panoramic segmentation prediction
Predicting panoramic image segmentation through Mask2FormerPredictor
import cv2
from easycv.predictors. segmentation import Mask2formerPredictor
predictor = Mask2formerPredictor(model_path='mask2former_pan_export.pth',task_mode='panoptic')
img = cv2.imread('000000123213.jpg')
predict_ out = predictor(['000000123213.jpg'])
pan_ img = predictor. show_ panoptic(img, **predict_out[0])
cv2.imwrite('pan_out.jpg',pan_img)
The output results are shown as follows:
Semantic segmentation prediction
Predicting image semantic segmentation map through Mask2FormerPredictor
import cv2
from easycv.predictors. segmentation import Mask2formerPredictor
predictor = Mask2formerPredictor(model_path='mask2former_semantic_export.pth',task_mode='semantic')
img = cv2.imread('000000123213.jpg')
predict_ out = predictor(['000000123213.jpg'])
semantic_ img = predictor. show_ panoptic(img, **predict_out[0])
cv2.imwrite('semantic_out.jpg',semantic_img)
Example image source: cocodadataset
Use the Mask2Fformer model on Alibaba Cloud machine learning platform PAI
PAI-DSW (Data Science Workshop) is an on cloud IDE developed by Alibaba Cloud machine learning platform PAI, which provides an interactive programming environment for various developers. In the DSW Gallery (link), various Notebook examples are provided for users to easily learn DSW and build various machine learning applications. We have also launched the Sample Notebook (see the figure below) for image segmentation with Mask2Formar in the DSW Gallery. Welcome to experience!
Interpretation of Mask2Fformer algorithm
The model used in the above example is implemented based on Mask2Former. Mask2Former is a unified segmentation architecture, which can perform semantic segmentation, instance segmentation and panoramic segmentation at the same time, and obtain SOTA results. The panoramic segmentation accuracy is 57.8 PQ on COCO dataset, the instance segmentation accuracy is 50.1 AP, and the semantic segmentation accuracy is 57.7 mIoU on ADE20K dataset.
Mask2Former uses the form of mask classification for segmentation, that is, a set of binary masks are predicted through the model and then combined into the final segmentation graph. Each binary mask can represent a category or an instance, which enables different segmentation tasks such as semantic segmentation and instance segmentation.
In the mask classsification task, a core problem is how to find a good form to learn binary Mask. For example, in the previous work, Mask R-CNN uses bounding boxes to limit the feature regions and predict their respective segmentation spectra within the regions. This method also leads to the fact that Mask R-CNN can only split instances. Mask2Former refers to the method of DETR. It uses a fixed number of object queries to represent binary masks, and uses Transformer Decoder to decode to predict this group of masks. (ps: For the interpretation of DETR, please refer to the correct opening mode of DETR, DAB-DETR and Object Query based on EasyCV.)
Among the algorithms of the DETR series, one of the more important defects is that the cross attention in Transformer Decoder will process the global features, which makes it difficult for the model to focus on the areas that it really wants to focus on, and will reduce the convergence speed of the model and the final algorithm accuracy. For this problem, Mask2Former proposes Transformer Decoder with mask attention. Each Transformer Decoder block will predict an attention mask and binarize it with the threshold value of 0.5. Then the attention mask will be used as the input of the next block, so that the attention module will only focus on the foreground part of the mask when calculating.
model structure
Mask2Former consists of three parts:
1. Backbone (ResNet, Swin Transformer) extracts low resolution features from images
2. Pixel Decoder gradually performs up sampling decoding from low resolution features to obtain feature pyramids from low resolution to high resolution, and circularly serves as the input of V and K in Transformer Decoder. Multi scale features are used to ensure the prediction accuracy of the model for targets of different scales.
The trasformer code of one layer is as follows (ps: To further accelerate the convergence speed of the model, Deformable attention module is used in Pixel Decoder):
3. Transformer Decoder with mask attention uses the multi scale feature obtained in Object query and Pixel Decoder to refine the binary mask map layer by layer to get the final result.
The mask cross attention of the core will use the mask predicted by the previous layer as the attention of the MultiheadAttention_ Mask input to limit the calculation of attention to the foreground of the query. The specific implementation code is as follows:
Tricks
In the pixel decoder, the feature pyramids with scales of 1/32, 1/16, and 1/8 of the original figure will be decoded and used as the inputs of K and V of the corresponding transformer decoder block. Referring to the method of deformable detr, sinusoidal positional embedding and learnable scale level embedding are added to each input. Input in order of resolution from low to high, and cycle for L times.
The memory consumption in the training process is saved through PointRent, which is mainly reflected in two parts: a When the Hungarian algorithm is used to match the prediction mask and truth tag, the match cost is calculated by replacing the complete mask map with K point sets that are uniformly sampled b. When calculating losses, the loss is calculated by replacing the complete mask map with K point sets that are sampled according to the importance sampling strategy (ps experiment shows that calculating losses based on the pointreind method can effectively improve the accuracy of the model)
a. The order of self attention and cross attention has been changed. Self attention ->cross attention becomes cross attention ->self attention.
b. Make query a learnable parameter. The supervised learning of query can play a similar role to that of region proposal. Experiments can prove that learnable queries can generate mask proposals.
c. Dropout operation in transformer receiver is removed. The experiment shows that this operation will reduce the accuracy.
2. What Is Big Data
1:Introduction
Image segmentation refers to the pixel level classification of pictures. According to the different classification granularity, it can be divided into three categories: semantic segmentation, instance segmentation, and panoptic segmentation. Image segmentation is one of the main research directions in computer vision. It has important application value in medical image analysis, automatic driving, video surveillance, augmented reality, image compression and other fields. We integrate these three types of segmentation SOTA algorithms in the EasyCV framework, and provide relevant model weights. EasyCV can easily predict the image segmentation spectrum and train customized segmentation models. This paper mainly introduces how to use EasyCV to realize instance segmentation, panoramic segmentation and semantic segmentation, and related algorithm ideas.
Use EasyCV to forecast the split graph
EasyCV provides the instance segmentation model and panoramic segmentation model trained on the coco dataset, as well as the semantic segmentation model trained on ADE20K. Refer to EasyCV quick start( https://github.com/alibaba/EasyCV/blob/master/docs/source/quick_start.md )After the configuration of the dependent environment is completed, these models can be directly used to complete the segmentation spectrum prediction of the image. The relevant model links are given in Reference.
Case segmentation prediction
Since the mask2friend algorithm in this example uses Deformable attention (using this operator in the DETR series algorithm can effectively improve the algorithm convergence speed and calculation efficiency), additional compilation of this operator is required
cd thirdparty/deformable_ attention
python setup. py build install
Predicting image instance segmentation image through Mask2FormerPredictor
import cv2
from easycv.predictors. segmentation import Mask2formerPredictor
predictor = Mask2formerPredictor(model_path='mask2former_instance_export.pth',task_mode='instance')
img = cv2.imread('000000123213.jpg')
predict_ out = predictor(['000000123213.jpg'])
instance_ img = predictor. show_ instance(img, **predict_out[0])
cv2.imwrite('instance_out.jpg',instance_img)
The output results are shown as follows:
Panoramic segmentation prediction
Predicting panoramic image segmentation through Mask2FormerPredictor
import cv2
from easycv.predictors. segmentation import Mask2formerPredictor
predictor = Mask2formerPredictor(model_path='mask2former_pan_export.pth',task_mode='panoptic')
img = cv2.imread('000000123213.jpg')
predict_ out = predictor(['000000123213.jpg'])
pan_ img = predictor. show_ panoptic(img, **predict_out[0])
cv2.imwrite('pan_out.jpg',pan_img)
The output results are shown as follows:
Semantic segmentation prediction
Predicting image semantic segmentation map through Mask2FormerPredictor
import cv2
from easycv.predictors. segmentation import Mask2formerPredictor
predictor = Mask2formerPredictor(model_path='mask2former_semantic_export.pth',task_mode='semantic')
img = cv2.imread('000000123213.jpg')
predict_ out = predictor(['000000123213.jpg'])
semantic_ img = predictor. show_ panoptic(img, **predict_out[0])
cv2.imwrite('semantic_out.jpg',semantic_img)
Example image source: cocodadataset
Use the Mask2Fformer model on Alibaba Cloud machine learning platform PAI
PAI-DSW (Data Science Workshop) is an on cloud IDE developed by Alibaba Cloud machine learning platform PAI, which provides an interactive programming environment for various developers. In the DSW Gallery (link), various Notebook examples are provided for users to easily learn DSW and build various machine learning applications. We have also launched the Sample Notebook (see the figure below) for image segmentation with Mask2Formar in the DSW Gallery. Welcome to experience!
Interpretation of Mask2Fformer algorithm
The model used in the above example is implemented based on Mask2Former. Mask2Former is a unified segmentation architecture, which can perform semantic segmentation, instance segmentation and panoramic segmentation at the same time, and obtain SOTA results. The panoramic segmentation accuracy is 57.8 PQ on COCO dataset, the instance segmentation accuracy is 50.1 AP, and the semantic segmentation accuracy is 57.7 mIoU on ADE20K dataset.
Core ideas
Mask2Former uses the form of mask classification for segmentation, that is, a set of binary masks are predicted through the model and then combined into the final segmentation graph. Each binary mask can represent a category or an instance, which enables different segmentation tasks such as semantic segmentation and instance segmentation.
In the mask classsification task, a core problem is how to find a good form to learn binary Mask. For example, in the previous work, Mask R-CNN uses bounding boxes to limit the feature regions and predict their respective segmentation spectra within the regions. This method also leads to the fact that Mask R-CNN can only split instances. Mask2Former refers to the method of DETR. It uses a fixed number of object queries to represent binary masks, and uses Transformer Decoder to decode to predict this group of masks. (ps: For the interpretation of DETR, please refer to the correct opening mode of DETR, DAB-DETR and Object Query based on EasyCV.)
Among the algorithms of the DETR series, one of the more important defects is that the cross attention in Transformer Decoder will process the global features, which makes it difficult for the model to focus on the areas that it really wants to focus on, and will reduce the convergence speed of the model and the final algorithm accuracy. For this problem, Mask2Former proposes Transformer Decoder with mask attention. Each Transformer Decoder block will predict an attention mask and binarize it with the threshold value of 0.5. Then the attention mask will be used as the input of the next block, so that the attention module will only focus on the foreground part of the mask when calculating.
model structure
Mask2Former consists of three parts:
1. Backbone (ResNet, Swin Transformer) extracts low resolution features from images
2. Pixel Decoder gradually performs up sampling decoding from low resolution features to obtain feature pyramids from low resolution to high resolution, and circularly serves as the input of V and K in Transformer Decoder. Multi scale features are used to ensure the prediction accuracy of the model for targets of different scales.
The trasformer code of one layer is as follows (ps: To further accelerate the convergence speed of the model, Deformable attention module is used in Pixel Decoder):
3. Transformer Decoder with mask attention uses the multi scale feature obtained in Object query and Pixel Decoder to refine the binary mask map layer by layer to get the final result.
The mask cross attention of the core will use the mask predicted by the previous layer as the attention of the MultiheadAttention_ Mask input to limit the calculation of attention to the foreground of the query. The specific implementation code is as follows:
Tricks
1.efficient multi-scale strategy
In the pixel decoder, the feature pyramids with scales of 1/32, 1/16, and 1/8 of the original figure will be decoded and used as the inputs of K and V of the corresponding transformer decoder block. Referring to the method of deformable detr, sinusoidal positional embedding and learnable scale level embedding are added to each input. Input in order of resolution from low to high, and cycle for L times.
2.PointRend
The memory consumption in the training process is saved through PointRent, which is mainly reflected in two parts: a When the Hungarian algorithm is used to match the prediction mask and truth tag, the match cost is calculated by replacing the complete mask map with K point sets that are uniformly sampled b. When calculating losses, the loss is calculated by replacing the complete mask map with K point sets that are sampled according to the importance sampling strategy (ps experiment shows that calculating losses based on the pointreind method can effectively improve the accuracy of the model)
3.Optimization improvements
a. The order of self attention and cross attention has been changed. Self attention ->cross attention becomes cross attention ->self attention.
b. Make query a learnable parameter. The supervised learning of query can play a similar role to that of region proposal. Experiments can prove that learnable queries can generate mask proposals.
c. Dropout operation in transformer receiver is removed. The experiment shows that this operation will reduce the accuracy.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00