采用组合增强的YOLOX-ViT协同识别温室内番茄花果

吕志远; 张付杰; 魏晓明; 黄媛; 李晶晶; 张钟莉莉

doi:10.11975/j.issn.1002-6819.202211246

采用组合增强的YOLOX-ViT协同识别温室内番茄花果

Synergistic recognition of tomato flowers and fruits in greenhouse using combination enhancement of YOLOX-ViT

摘要

摘要: 番茄花果的协同识别是温室生产管理调控的重要决策依据，针对温室番茄栽培密度大，植株遮挡、重叠等因素导致的现有识别算法精度不足问题，该研究提出一种基于级联深度学习的番茄花果协同识别方法，引入图像组合增强与前端ViT分类网络，以提高模型对于小目标与密集图像检测性能。同时，通过先分类识别、再进行目标检测的级联网络，解决了传统检测模型因为图像压缩而导致的小目标模糊、有效信息丢失问题。最后，引入了包括大果和串果在内的不同类型番茄品种数据集，验证了该方法的可行性与有效性。经测试，研究提出的目标检测模型的平均识别率均值（mean average precision，mAP）为92.30%，检测速度为28.46帧/ s，其中分别对小花、成熟番茄和未成熟番茄识别平均准确率分别为87.92%、92.35%和96.62%。通过消融试验表明，与YOLOX、组合增强YOLOX相比，改进后的模型mAP提高了2.38～6.11个百分点，相比于现有YOLOV3、YOLOV4、YOLOV5主流检测模型，mAP提高了16.56～23.30个百分点。可视化结果表明，改进模型实现了对小目标的零漏检和对密集对象的无误检，从而达到了高精度的协同检测的目标。研究成果为温室种植环境下的番茄生长识别提供参考。

Abstract: Abstract: Collaborative identification of tomato flowers and fruits is one of the most important decision-making in greenhouse production management and regulation. However, the current identification algorithms cannot fully meet sufficient accuracy in modern agriculture, due to the high planting density of greenhouse tomatoes and some factors, such as plant obstruction and overlap. It is still challenging to effectively process high-resolution images, with the increasing resolution of available images. Therefore, it is a high demand to accurately and rapidly detect tomato flowers and fruits in greenhouse production. In this study, the collaborative recognition of tomato flowers and fruits was proposed using a cascade deep learning approach. A cascade convolutional neural network (CNN) was utilized to classify and recognize the tomato fruits for target detection. This improved approach effectively reduced the small target blurring and the loss of effective information caused by image compression in the traditional detection models. The cascade CNN architecture was expected to improve the accuracy and efficiency of tomato flower and fruit detection in greenhouse production. The experimental results and performance analysis were presented as follows. Firstly, a total of 1 474 high-resolution images (5 184 × 3 456 pixels) of several greenhouse-grown tomatoes were collected at fixed time intervals. These images were marked with the 4 686 tomato floret labels, 6254 immature tomato labels, and 3 506 mature tomato labels. These labeled objects were used to construct the target detection and image classification dataset through equalization image enhancement. Secondly, an optimization was made to improve the performance of the model for the small target and dense image detection. The YOLOX target detection backbone network was improved to introduce the image combination enhancement and front-end Vision Transformer (ViT) classification network. Among them, the image combination enhancement was used to improve the visibility of small targets in the dense images, while the front-end ViT classification network was to improve the generality of the model for the classification of the small targets. Finally, the feasibility and validity of the improved models were verified by the different types of tomato variety datasets, such as large and string fruits. The results demonstrate that the better performance of the improved model was achieved to detect and classify the tomato flowers and fruits in the high-resolution images, even under challenging conditions, such as the small target and dense images. In the image classification test, the accuracy rate of the 3×3 image slice front-end classification network using ViT was 99.71%, with a recall rate of 99.71%, an precision of 99.72%, and an mF1 score of 0.99. The model weight was 335.2MB, and the detection speed was 78.39 frames/s. Comparatively, there was an improvement of 5.8 to 7.7 percentage points in the accuracy over the existing classification networks, such as MobileNet, Resnet 50, and Vgg. The proposed object detection model achieved a mean average precision (mAP) of 92.30% and a detection speed of 28.46 frames/s, with average precision (AP) of 87.92%, 92.35% and 96.62% for small flowers, ripe tomatoes and unripe tomatoes, respectively. Ablation experiments further demonstrated that the YOLOX and combined enhancement YOLOX models resulted in an mAP increase of 3.75 to 6.11 percentage points, and an improvement of 16.56 to 23.30 percentage points, compared with the mainstream detection models, such as YOLOV3, YOLOV4, and YOLOV5. Additionally, the improved YOLOX-ViT model added a front-end classification network into the original, resulting in a slower detection speed of 28.46 frames/s. Visual results show that the improved model was achieved in the zero missed detection for the small targets, whereas, the accurate detection for the dense objects, indicates the high-precision collaborative detection. Overall, the YOLOX-ViT model can be expected to effectively and efficiently detect the small target images, in order to eliminate the background and reduce the redundancy. At the same time, the high accuracy and robustness were still maintained to reduce the image compression for effective information in the dense images. Therefore, the cascaded and combined target detection model can be suitable for the real-world environment of greenhouses. This finding can provide a strong reference to recognize tomato growth in greenhouse environments.

HTML全文

参考文献(39)

施引文献

资源附件(0)