Abstract:
Tomato has been one of the most widely cultivated greenhouse crops. Autonomous picking vehicles have been highly demanding in recent years, due to the labor shortage for the hand-picking under the population aging and the harsh working environment for tomato harvesting. Among them, the real-time recognition of tomato maturity can be one of the key technologies for autonomous tomato picking vehicles. However, current advanced target recognition can be confined to the low accuracies for real-time purposes, when detecting the tomatoes with severe occlusions or small sizes. In this study, a tomato maturity recognition was proposed using an improved YOLOv4-tiny. Firstly, a detection head with 76×76 size was added to the head network to improve the recognition performance of small-sized tomatoes using the feature maps in deeper layers. Then, Convolution Block Attention Module (CBAM) was integrated into the backbone network of YOLOv4-tiny. This module was focused on the significant characteristics across the channel- and spatial-wise dimensions, thereby improving the recognition performance of occluded tomatoes. Next, the feature extracting accuracy of the ReLU activation function decreased significantly, as the convolutional layer deepened in the YOLOv4-tiny. The ReLU was replaced by the Mish activation function to maintain the accuracy in the backbone network of YOLOv4-tiny, whose accuracy slightly decreased with the deep layers. Finally, Densely Connected Convolution Networks (DCCN) were adopted to enhance global feature fusion, in order to alleviate the loss of features and the vanishing of gradients in the convolutional process. As such, the feature propagation was strengthened for feature reuse, thereby substantially reducing the number of parameters. An image dataset was also built to detect the maturity of cherry tomatoes, where 1300 pictures of greenhouse tomatoes in the size of 1 280×720 covered a variety of complex scenes of severe occlusions, changeable lighting conditions, and various fruit sizes. All tomatoes were labelled by either ripe or underripe ground truth bounding boxes using the image annotation tool LabelImg. All images were divided into the training and test set with a ratio of 8:2 using the holdout method, where one-fifth of the test set was used as a validation set for the training. All models were iterated 300 times, and the learning rates were attenuated by 10% every time at 50%, 80%, and 90% of the total iterations, respectively. The improved YOLOv4-tiny-X model was compared with YOLOv3, YOLOv4, YOLOv5m, and YOLOv5l. Both the Intersection over Union (IoU) and non maximum suppression (NMS) thresholds were set to 0.5. Experimental results show that the Mean Average Precision (mAP) of the improved model was 97.9%, which was 30.9, 0.2, 5.4, and 4.9 percentage points higher than the rest models, respectively. The improved model run at a rate of 111 frames per second on the Nvidia GTX 2060 GPU. Precision, Recall, F1 Score, and mAP50 of the improved YOLOv4-tiny model increased by 0.4, 3.4, 2, and 0.7 percentage points, respectively, compared with the original. The weight file of the improved model was less than 1/6 in size, the twice faster detection speed, the 339 less layer number of the model, and the computational cost of 20.2% in GFLOPs, compared with the YOLOv4. In conclusion, the YOLOv4-tiny-X presented the highest target recognition accuracy, and an extremely high detection speed, compared with the advanced target recognition models. Consequently, the improved model can be expected to identify the occluded and small-sized tomatoes with the high recognition accuracy and the better performance in a complex picking environment. The finding can provide a strong reference for the real-time recognition systems of tomato maturity in autonomous tomato picking vehicles.