融合轻量化网络与注意力机制的果园环境下苹果检测方法

    Fusion of the lightweight network and visual attention mechanism to detect apples in orchard environment

    • 摘要: 为提高复杂果园环境下苹果检测的综合性能,降低检测模型大小,通过对单阶段检测网络YOLOX-Tiny的拓扑结构进行了优化与改进,提出了一种适用于复杂果园环境下轻量化苹果检测模型(Lightweight Apple Detection YOLOX-Tiny Network,Lad-YXNet)。该模型引入高效通道注意力(Efficient Channel Attention,ECA)和混洗注意力(Shuffle Attention,SA)两种轻量化视觉注意力模块,构建了混洗注意力与双卷积层(Shuffle Attention and Double Convolution Layer,SDCLayer)模块,提高了检测模型对背景与果实特征的提取能力,并通过测试确定Swish与带泄露修正线性单元(Leaky Rectified Linear Unit,Leaky-ReLU)作为主干与特征融合网络的激活函数。通过消融试验探究了Mosaic增强方法对模型训练的有效性,结果表明图像长宽随机扭曲对提高模型综合检测性能贡献较高,但图像随机色域变换由于改变训练集中苹果的颜色,使模型检测综合性能下降。为提高模型检测苹果的可解释性,采用特征可视化技术提取了Lad-YXNet模型的主干、特征融合网络和检测网络的主要特征图,探究了Lad-YXNet模型在复杂自然环境下检测苹果的过程。Lad-YXNet经过训练在测试集下的平均精度为94.88%,分别比SSD、YOLOV4-Tiny、YOLOV5-Lite和YOLOX-Tiny模型提高了3.10个百分点、2.02个百分点、2.00个百分点和0.51个百分点。Lad-YXNet检测一幅图像的时间为10.06 ms,模型大小为16.6 MB,分别比YOLOX-Tiny减少了20.03%与18.23%。该研究为苹果收获机器人在复杂果园环境下准确、快速地检测苹果提供了理论基础。

       

      Abstract: Apple harvesting is a highly seasonal and labor-intensive activity in modern agriculture. Fortunately, a harvesting robot is of great significance to improve the productivity and quality of apples, further alleviating the labor shortage in orchards. Among them, the detection model of the harvesting robot is also required to accurately and rapidly detect the target apples in the complex and changing orchard environment. It is a high demand for the small size to be deployed in the embedded device. This study aims to improve the speed and comprehensive performance of apple detection in a complex orchard environment. A Lightweight apple detection YOLOX-Tiny Network (Lad-YXNet) model was proposed to reduce the size of the original model. Some images of "Yanfu" and "Micui" apples were obtained during the apple harvest season in 2021. The images were uniformly clipped to the 1024×1024 pixels. As such, 1 200 images were selected to make the dataset, including the fruits with shaded branches and leaves, fruit clusters, varying degrees of illumination, blurred motion, and high density. This model was then used to optimize the topology of the single-stage detection network YOLOX-Tiny. Two lightweight visual attention modules were added to the model, including Efficient Channel Attention (ECA), and Shuffle Attention (SA). The Shuffle attention and double convolution layer (SDCLayer) was constructed to extract the background and fruit features. Swish and Leaky Rectified Linear Unit (Leaky-ReLU) was identified as the activation functions for the backbone and feature fusion network. A series of ablation experiments were carried out to evaluate the effectiveness of Mosaic enhancement in the model training. The average precision of the Lad-YXNet model decreased by 0.89 percent and 3.81 percent, respectively, after removing random image flipping and random image length width distortion. The F1-socre also decreased by 0.91 percent and 1.95 percent, respectively, where the precision decreased by 2.21 percent and 2.99 percent, respectively. There was a similar regularity of the YOLOX-Tiny model. After removing the image random combination, the average precision of the Lad-YXNet and the YOLOX-Tiny model decreased by 0.56 percent and 0.07 percent, the F1-socre decreased by 0.68 percent and 1.15 percent, as well as the recall rate decreased by 2.35 percent and 4.49 percent, respectively. The results showed that the random distortion of image length and width greatly contributed to the performance of model detection. But the random color gamut transformation of the image decreased the performance of model detection, due to the change of apple color in the training set. Two specific tests were conducted to explore the effectiveness of visual attention mechanisms in convolution networks. Specifically, one was to remove the visual attention modules from the Lad-YXNet, and another was to exchange the position of visual attention modules from the Lad-YXNet. Compared with the Lad-YXNet, the precision of the improved model to exchange the position of the visual attention modules only increased by 0.04 percent, while the recall, F1-socre, and average precision decreased by 0.78 percent, 0.39 percent, and 0.13 percent, respectively. The precision, recall, F1-socre, and average precision of the models without the attention module were reduced by 1.15 percent, 0.64 percent, 0.89 percent, and 0.46 percent, respectively, compared with the Lad-YXNet. Consequently, the SA and ECA enhanced the ability of the Lad-YXNet to extract the apple features, in order to improve the comprehensive detection accuracy of the model. The main feature maps of Lad-YXNet's backbone, feature fusion, and detection network were extracted by the feature visualization technology. A systematic investigation was made to determine the process of detecting apples with the Lad-YXNet in the complex natural environment, particularly from the point of feature extraction. As such, improved interpretability was achieved in the apple detection with the Lad-YXNet model. The Lad-YXNet was trained to be an average accuracy of 94.88% in the test set, which was 3.10 percent, 2.02 percent, 2.00 percent, and 0.51 percent higher than SSD, YOLOV4-Tiny, YOLOV5-Lite, and YOLOX-Tiny models, respectively. The detection time of an image was achieved in 10.06 ms with a model size of 16.6 MB, which was 20.03% and 18.23% less than YOLOX-Tiny, respectively. Therefore, the Lad-YXNet was well balanced with the size, precision, and speed of the apple detection model. The finding can provide a theoretical basis to accurately and quickly detect the apples for the harvesting robot in the complex orchard environment.

       

    /

    返回文章
    返回