基于通道特征金字塔的田间葡萄实时语义分割方法

孙俊; 宫东见; 姚坤杉; 芦兵; 戴春霞; 武小红

doi:10.11975/j.issn.1002-6819.2022.17.016

摘要: 复杂环境下葡萄的快速检测识别是智能采摘的关键步骤，为解决目前葡萄识别精度低和实时性差的问题，该研究提出一种轻量级葡萄实时语义分割模型（Grape Real-time Semantic Segmentation Model，GRSM）。首先，利用通道特征金字塔（Channel-wise Feature Pyramid，CFP）模块进行特征提取，该模块通过13和31空洞卷积的跳跃连接，在减少模型参数量的同时提取葡萄图像的多尺度特征和上下文信息；然后，采用池化卷积融合结构完成下采样，增加可训练参数以减少信息损失；最后，利用跳跃连接融合多种特征恢复图像细节。试验结果表明：该研究所提出的模型在田间葡萄测试集上达到了78.8%的平均交并比，平均像素准确率为90.3%，处理速度达到68.56帧/s，网络结构大小仅为4.88 M。该模型具有较高分割识别精度和较好实时性，能满足葡萄采摘机器人对视觉识别系统的要求，为葡萄的智能化采摘提供了理论基础。

Abstract: Automated and intelligent harvesting has been one of the most important steps for urgent task in the grape industry. However, the current models of fruit recognition have posed a great balance between accuracy and real-time performance. In this study, a lightweight and real-time semantic segmentation model was proposed for field grape harvesting using a channel feature pyramid. Firstly, a publicly available dataset of field grape instance segmentation was used as the experimental object. A total of 300 grape images were collected with the different pruning periods, lighting conditions, and maturity levels. The LabelMe annotation tool was used to build the field grape dataset. Four types of objects were annotated, including the background, leaves, grapes, and stems. The dataset was then expanded using random enhancement, resulting in a total of 1200 images. Since the original images were too large in pixels to be trained directly, the image resolution was uniformly compressed to 512×512 (pixels) for better training efficiency of the network model. Secondly, the convolutional kernels of different sizes were arranged in the perceptual fields, due to the huge differences in the grape size and location. The channel feature pyramid module was then utilized for the feature extraction. The 3×3, 5×5, and 7×7 multi-scale feature extraction datasets were then achieved for the jumping connections of 1×3 and 3×1 null convolutions in a single channel. As such, the multi-scale and contextual features were effectively extracted from the grape images. At the same time, the model parameters were reduced to increase the trainable ones for less information loss. The convolutional fusion structure was pooled during down-sampling, instead of the traditional maximum pooling structure. The jump joints were employed in the decoding part, in order to fuse information from different feature layers for the recovery of image details. Finally, the improved model was tested on a grape test set. The experimental results showed that the Mean Intersection over Union（MIoU）was 78.8%, The Mean Pixel Accuracy (MPA) was 90.3%, and the real-time processing speed was 68.56 frames/s. The model size was only 4.88 MB. The accuracies of Mean IoU were improved by 7.9, 5.7, and 10.5 percentage points in the real-time semantic segmentation networks, respectively, compared with the BiSeNet, ENet, and DFAnet. The accuracies of the improved model increased by 1.2 and 8.8 percentage points, respectively, compared with lightweight networks using mobilienetv3 and inception as encoders. Therefore, the proposed network presented a significant advantage over the real-time and lightweight networks, in terms of segmentation accuracy. The mean IoUs of the semantic segmentation network was reduced by 2.3, 2.0, and 3.7 percentage points, respectively, but the model sizes were 12.3%, 4.1%, and 7.4%, respectively, compared with the classical networks, Deeplabv3+, SegNet, and UNet. The real-time requirement fully met the tradeoff between real-time and accuracy. The improved model can be expected to serve as the segmentation recognition of field grapes in smart agriculture. The finding can also provide technical support for the visual recognition systems in the grape-picking robots.

基于通道特征金字塔的田间葡萄实时语义分割方法

Real-time semantic segmentation method for field grapes based on channel feature pyramid