Abstract:
Abstract: Collaborative identification of tomato flowers and fruits is one of the most important decision-making in greenhouse production management and regulation. However, the current identification algorithms cannot fully meet sufficient accuracy in modern agriculture, due to the high planting density of greenhouse tomatoes and some factors, such as plant obstruction and overlap. It is still challenging to effectively process high-resolution images, with the increasing resolution of available images. Therefore, it is a high demand to accurately and rapidly detect tomato flowers and fruits in greenhouse production. In this study, the collaborative recognition of tomato flowers and fruits was proposed using a cascade deep learning approach. A cascade convolutional neural network (CNN) was utilized to classify and recognize the tomato fruits for target detection. This improved approach effectively reduced the small target blurring and the loss of effective information caused by image compression in the traditional detection models. The cascade CNN architecture was expected to improve the accuracy and efficiency of tomato flower and fruit detection in greenhouse production. The experimental results and performance analysis were presented as follows. Firstly, a total of 1 474 high-resolution images (5 184 × 3 456 pixels) of several greenhouse-grown tomatoes were collected at fixed time intervals. These images were marked with the 4 686 tomato floret labels, 6254 immature tomato labels, and 3 506 mature tomato labels. These labeled objects were used to construct the target detection and image classification dataset through equalization image enhancement. Secondly, an optimization was made to improve the performance of the model for the small target and dense image detection. The YOLOX target detection backbone network was improved to introduce the image combination enhancement and front-end Vision Transformer (ViT) classification network. Among them, the image combination enhancement was used to improve the visibility of small targets in the dense images, while the front-end ViT classification network was to improve the generality of the model for the classification of the small targets. Finally, the feasibility and validity of the improved models were verified by the different types of tomato variety datasets, such as large and string fruits. The results demonstrate that the better performance of the improved model was achieved to detect and classify the tomato flowers and fruits in the high-resolution images, even under challenging conditions, such as the small target and dense images. In the image classification test, the accuracy rate of the 3×3 image slice front-end classification network using ViT was 99.71%, with a recall rate of 99.71%, an precision of 99.72%, and an mF1 score of 0.99. The model weight was 335.2MB, and the detection speed was 78.39 frames/s. Comparatively, there was an improvement of 5.8 to 7.7 percentage points in the accuracy over the existing classification networks, such as MobileNet, Resnet 50, and Vgg. The proposed object detection model achieved a mean average precision (mAP) of 92.30% and a detection speed of 28.46 frames/s, with average precision (AP) of 87.92%, 92.35% and 96.62% for small flowers, ripe tomatoes and unripe tomatoes, respectively. Ablation experiments further demonstrated that the YOLOX and combined enhancement YOLOX models resulted in an mAP increase of 3.75 to 6.11 percentage points, and an improvement of 16.56 to 23.30 percentage points, compared with the mainstream detection models, such as YOLOV3, YOLOV4, and YOLOV5. Additionally, the improved YOLOX-ViT model added a front-end classification network into the original, resulting in a slower detection speed of 28.46 frames/s. Visual results show that the improved model was achieved in the zero missed detection for the small targets, whereas, the accurate detection for the dense objects, indicates the high-precision collaborative detection. Overall, the YOLOX-ViT model can be expected to effectively and efficiently detect the small target images, in order to eliminate the background and reduce the redundancy. At the same time, the high accuracy and robustness were still maintained to reduce the image compression for effective information in the dense images. Therefore, the cascaded and combined target detection model can be suitable for the real-world environment of greenhouses. This finding can provide a strong reference to recognize tomato growth in greenhouse environments.