采用改进YOLOv4-tiny的复杂环境下番茄实时识别

杨坚; 钱振; 张燕军; 秦宇; 缪宏

doi:10.11975/j.issn.1002-6819.2022.09.023

摘要: 实时识别番茄的成熟度是番茄自主采摘车的关键功能。现有目标识别算法速度慢、对遮挡番茄和小番茄识别准确率低。因此，该研究提出一种基于改进YOLOv4-tiny模型的番茄成熟度识别方法。在头部网络（Head network）部分增加一个76×76的检测头（y3）来提高小番茄的识别准确率。为了提高被遮挡番茄的识别准确率，将卷积注意力模块（Convolution Block Attention Module，CBAM）集成到YOLOv4-tiny模型的骨干网络（Backbone network）部分。在深层卷积中使用Mish激活函数替代ReLU激活函数以保证提取特征的准确性。使用密集连接的卷积网络（Densely Connected Convolution Networks, DCCN）来加强全局特征融合，并建立红风铃番茄成熟度识别的数据集。试验结果表明，与YOLOv3、YOLOv4、YOLOv4-tiny、YOLOv5m和YOLOv5l模型相比，改进YOLOv4-tiny-X模型的平均精度均值（mean Average Precision, mAP）分别提高了30.9、0.2、0.7、5.4和4.9个百分点，在Nvidia GTX 2060 GPU 上达到111帧/s的速度，平均精度均值达到97.9%。不同模型的实时测试可视化结果表明，改进模型能够有效解决遮挡和小番茄识别准确率低的问题，可为番茄采摘车研制提供参考。

Abstract: Tomato has been one of the most widely cultivated greenhouse crops. Autonomous picking vehicles have been highly demanding in recent years, due to the labor shortage for the hand-picking under the population aging and the harsh working environment for tomato harvesting. Among them, the real-time recognition of tomato maturity can be one of the key technologies for autonomous tomato picking vehicles. However, current advanced target recognition can be confined to the low accuracies for real-time purposes, when detecting the tomatoes with severe occlusions or small sizes. In this study, a tomato maturity recognition was proposed using an improved YOLOv4-tiny. Firstly, a detection head with 76×76 size was added to the head network to improve the recognition performance of small-sized tomatoes using the feature maps in deeper layers. Then, Convolution Block Attention Module (CBAM) was integrated into the backbone network of YOLOv4-tiny. This module was focused on the significant characteristics across the channel- and spatial-wise dimensions, thereby improving the recognition performance of occluded tomatoes. Next, the feature extracting accuracy of the ReLU activation function decreased significantly, as the convolutional layer deepened in the YOLOv4-tiny. The ReLU was replaced by the Mish activation function to maintain the accuracy in the backbone network of YOLOv4-tiny, whose accuracy slightly decreased with the deep layers. Finally, Densely Connected Convolution Networks (DCCN) were adopted to enhance global feature fusion, in order to alleviate the loss of features and the vanishing of gradients in the convolutional process. As such, the feature propagation was strengthened for feature reuse, thereby substantially reducing the number of parameters. An image dataset was also built to detect the maturity of cherry tomatoes, where 1300 pictures of greenhouse tomatoes in the size of 1 280×720 covered a variety of complex scenes of severe occlusions, changeable lighting conditions, and various fruit sizes. All tomatoes were labelled by either ripe or underripe ground truth bounding boxes using the image annotation tool LabelImg. All images were divided into the training and test set with a ratio of 8:2 using the holdout method, where one-fifth of the test set was used as a validation set for the training. All models were iterated 300 times, and the learning rates were attenuated by 10% every time at 50%, 80%, and 90% of the total iterations, respectively. The improved YOLOv4-tiny-X model was compared with YOLOv3, YOLOv4, YOLOv5m, and YOLOv5l. Both the Intersection over Union (IoU) and non maximum suppression (NMS) thresholds were set to 0.5. Experimental results show that the Mean Average Precision (mAP) of the improved model was 97.9%, which was 30.9, 0.2, 5.4, and 4.9 percentage points higher than the rest models, respectively. The improved model run at a rate of 111 frames per second on the Nvidia GTX 2060 GPU. Precision, Recall, F1 Score, and mAP50 of the improved YOLOv4-tiny model increased by 0.4, 3.4, 2, and 0.7 percentage points, respectively, compared with the original. The weight file of the improved model was less than 1/6 in size, the twice faster detection speed, the 339 less layer number of the model, and the computational cost of 20.2% in GFLOPs, compared with the YOLOv4. In conclusion, the YOLOv4-tiny-X presented the highest target recognition accuracy, and an extremely high detection speed, compared with the advanced target recognition models. Consequently, the improved model can be expected to identify the occluded and small-sized tomatoes with the high recognition accuracy and the better performance in a complex picking environment. The finding can provide a strong reference for the real-time recognition systems of tomato maturity in autonomous tomato picking vehicles.

采用改进YOLOv4-tiny的复杂环境下番茄实时识别

Real-time recognition of tomatoes in complex environments based on improved YOLOv4-tiny