Real-time instance segmentation of maize ears using SwinT-YOLACT

ZHU Deli; YU Maosheng; LIANG Mingfei

doi:10.11975/j.issn.1002-6819.202302172

ZHU Deli, YU Maosheng, LIANG Mingfei. Real-time instance segmentation of maize ears using SwinT-YOLACT[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2023, 39(14): 164-172. DOI: 10.11975/j.issn.1002-6819.202302172

Citation:

Real-time instance segmentation of maize ears using SwinT-YOLACT

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Maize is one of the most important food crops in the field of agricultural development and food security in China. Among them, maize fruit and ear can directly determine the yield and quality of maize. Their phenotypic parameters (such as size and shape) can also be crucial indicators for the growth state of the plant. Fortunately, machine vision can be expected to serve as the maize phenotypic parameter acquisition and trait analysis, due to its objectivity, accuracy and speed, particularly with the application of artificial intelligence technology in agricultural production. Field inspection robots can be utilized to monitor the maize growth status in large-scale planting mode during this stage. This study aims to realize the high-throughput and automated acquisition of maize phenotypic parameters by unmanned inspection robots. A high-precision-speed balanced maize ear segmentation model, SwinT-YOLACT was proposed using YOLACT (you only look at coefficients) algorithm. Three optimization strategies were designed, according to the characteristics of the maize ear segmentation task. Firstly, Swin-Transformer was used as the backbone feature extraction network of the improved model, where the self-attention mechanism of Transformer structure was integrated to enhance the global feature extraction capability; Secondly, a 3-layer effective channel attention mechanism was introduced before the feature pyramid network to eliminate the redundant feature information, in order to enhance the fusion of key features for the high accuracy of the improved model; Finally, the Mish activation function with better smoothing was used to replace the original Relu activation function, in order to further improve the segmentation accuracy at the original inference speed. In addition, the maize plant data was collected with the different environmental backgrounds and various maturity stages of maize fruit and maize ears in the field. Labelme annotation software was then adopted to manually label the data, according to the COCO dataset format. The number of samples was also expanded using data augmentation. A segmentation dataset of the maize fruit and ear was constructed for the model training on a deep learning network. The self-built segmentation dataset of maize ear was used to train and test the improved model. The experimental results show that the mask mean average precision was improved by 2.11 percentage points after introducing Swin-Transformer as the backbone feature extraction network, compared with the original YOLACT model. There was no influence on the segmentation speed. On this basis, the mask mean average precision was improved by 0.65 percentage points after introducing efficient channel attention before the feature pyramid network. The inference speed of the model was basically unchanged. The original model Relu activation function was replaced by Mish activation function, according to the first two experiments. The mask mean average precision was improved by 0.75 percentage points than before the replacement, whereas, the model inference speed was improved by 2.74 frames per second. SwinT-YOLACT was also used to compare with the YOLACT, YOLACT++, YOLACT-Edge, and Mask R-CNN segmentation models, all of which used the same experimental environment and training strategy. The verification results show that the mask mean average precision of SwinT-YOLACT reached 79.43%, which was 3.51, 3.38, and 7.88 percentage points higher than those of the original YOLACT, YOLACT++, and YOLACT-Edge, respectively, while only slightly lower than that of the Mask R-CNN model. The better performance of the improved models was then achieved in the segmentation task. In terms of segmentation speed, the inference speed of SwinT-YOLACT was 35.44 frames per second, which was much better than that of Mask R-CNN model at 6.80 frames per second, and also improved by 3.39 and 2.58 frames per second, compared with the YOLACT and YOLACT++, respectively. In summary, SwinT-YOLACT can be expected for better segmentation of the maize fruit and ear in the unmanned inspection robot vision system. The finding can provide technical support for maize growth status monitoring.

FullText(HTML)

References (29)

Cited By

Real-time instance segmentation of maize ears using SwinT-YOLACT

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content