基于SwinT-YOLACT的玉米果穗实时实例分割

朱德利; 余茂生; 梁明飞

doi:10.11975/j.issn.1002-6819.202302172

基于SwinT-YOLACT的玉米果穗实时实例分割

Real-time instance segmentation of maize ears using SwinT-YOLACT

摘要

摘要: 玉米果穗的表型参数是玉米生长状态的重要表征，生长状况的好坏直接影响玉米产量和质量。为方便无人巡检机器人视觉系统高通量、自动化获取玉米表型参数，该研究基于YOLACT（you only look at coefficients）提出一种高精度-速度平衡的玉米果穗分割模型SwinT-YOLACT。首先使用Swin-Transformer作为模型主干特征提取网络，以提高模型的特征提取能力；然后在特征金字塔网络之前引入有效通道注意力机制，剔除冗余特征信息，以加强对关键特征的融合；最后使用平滑性更好的Mish激活函数替换模型原始激活函数Relu，使模型在保持原有速度的同时进一步提升精度。基于自建玉米果穗数据集训练和测试该模型，试验结果表明，SwinT-YOLACT的掩膜均值平均精度为79.43%，推理速度为35.44帧/s，相较于原始YOLACT和其改进算法YOLACT++，掩膜均值平均精度分别提升了3.51和3.38个百分点；相较于YOLACT、YOLACT++和Mask R-CNN模型，推理速度分别提升了3.39、2.58和28.64帧/s。该模型对玉米果穗有较为优秀的分割效果，适于部署在无人巡检机器人视觉系统上，为玉米生长状态监测提供技术支撑。

Abstract: Maize is one of the most important food crops in the field of agricultural development and food security in China. Among them, maize fruit and ear can directly determine the yield and quality of maize. Their phenotypic parameters (such as size and shape) can also be crucial indicators for the growth state of the plant. Fortunately, machine vision can be expected to serve as the maize phenotypic parameter acquisition and trait analysis, due to its objectivity, accuracy and speed, particularly with the application of artificial intelligence technology in agricultural production. Field inspection robots can be utilized to monitor the maize growth status in large-scale planting mode during this stage. This study aims to realize the high-throughput and automated acquisition of maize phenotypic parameters by unmanned inspection robots. A high-precision-speed balanced maize ear segmentation model, SwinT-YOLACT was proposed using YOLACT (you only look at coefficients) algorithm. Three optimization strategies were designed, according to the characteristics of the maize ear segmentation task. Firstly, Swin-Transformer was used as the backbone feature extraction network of the improved model, where the self-attention mechanism of Transformer structure was integrated to enhance the global feature extraction capability; Secondly, a 3-layer effective channel attention mechanism was introduced before the feature pyramid network to eliminate the redundant feature information, in order to enhance the fusion of key features for the high accuracy of the improved model; Finally, the Mish activation function with better smoothing was used to replace the original Relu activation function, in order to further improve the segmentation accuracy at the original inference speed. In addition, the maize plant data was collected with the different environmental backgrounds and various maturity stages of maize fruit and maize ears in the field. Labelme annotation software was then adopted to manually label the data, according to the COCO dataset format. The number of samples was also expanded using data augmentation. A segmentation dataset of the maize fruit and ear was constructed for the model training on a deep learning network. The self-built segmentation dataset of maize ear was used to train and test the improved model. The experimental results show that the mask mean average precision was improved by 2.11 percentage points after introducing Swin-Transformer as the backbone feature extraction network, compared with the original YOLACT model. There was no influence on the segmentation speed. On this basis, the mask mean average precision was improved by 0.65 percentage points after introducing efficient channel attention before the feature pyramid network. The inference speed of the model was basically unchanged. The original model Relu activation function was replaced by Mish activation function, according to the first two experiments. The mask mean average precision was improved by 0.75 percentage points than before the replacement, whereas, the model inference speed was improved by 2.74 frames per second. SwinT-YOLACT was also used to compare with the YOLACT, YOLACT++, YOLACT-Edge, and Mask R-CNN segmentation models, all of which used the same experimental environment and training strategy. The verification results show that the mask mean average precision of SwinT-YOLACT reached 79.43%, which was 3.51, 3.38, and 7.88 percentage points higher than those of the original YOLACT, YOLACT++, and YOLACT-Edge, respectively, while only slightly lower than that of the Mask R-CNN model. The better performance of the improved models was then achieved in the segmentation task. In terms of segmentation speed, the inference speed of SwinT-YOLACT was 35.44 frames per second, which was much better than that of Mask R-CNN model at 6.80 frames per second, and also improved by 3.39 and 2.58 frames per second, compared with the YOLACT and YOLACT++, respectively. In summary, SwinT-YOLACT can be expected for better segmentation of the maize fruit and ear in the unmanned inspection robot vision system. The finding can provide technical support for maize growth status monitoring.

HTML全文

参考文献(29)

施引文献

资源附件(0)