Abstract:
Continuously and stably tracking fruit objects is required for tomato production in modern agriculture in recent years. Some difficulties also remained on the statistical counting accuracy under greenhouse production, due to the mode constraints of crop planting. In this study, a counting method was proposed for tomatoes of varying maturity, utilizing ultra-depth masking and an improved YOLOv8 model. Multi-head self-attention (MHSA) was introduced to construct a spatially heterogeneous convolution kernel. The global features were also integrated using the lightweight convolution operator Involution. A new convolution operator was optimized and designed, termed Global Attention-based Involution (GAInvolution). This operator formed the backbone network of the tomato detector, YOLOT, an improved YOLOv8 model. The model also incorporated the WIoU (wise intersection over union) loss to improve the robustness of the object labeling process. In addition, the depth information was predicted by the mono-depth estimation model, called Depth Anything. The distant fruit objects were dynamically filtered to avoid object tracking loss or duplicate tracking. This processing was referred to as ultra-depth masking. A tomato counter was also optimized using the BoT-SORT algorithm. Combining the tomato detector, depth estimation model, ultra-depth masking, and object counter, the comprehensive framework was constructed to identify and count tomatoes of different maturity levels. The experimental results showed that the mean average precision at IoU thresholds of 0.5 (mAP
50) of the improved tomato detector increased by 3.2 percentage points, and the recall rate increased by 3.7 percentage points, indicating the effectiveness of the improved YOLOv8 model, compared with the original. The GAInvolution convolution operator significantly improved the tomato object detector. The detector with GAInvolution as the main operator achieved the 2.7 and 2.8 percentage point increase in the mAP
50 and the recall rate, respectively, thereby significantly improving tomato detection performance. WIoU loss function was introduced to further improve the detection accuracy, where the mAP
50 increased by 1.1 percentage points, compared with the original. Ultra-depth masking greatly contributed to the accuracy of tomato fruit counting. The depth threshold was dynamically calculated and predicted using the depth map and the Depth Anything model. The highest counting accuracy was achieved by setting the average depth value of the depth map minus 0.5 times the depth standard deviation as the threshold. The average counting precision (
ACP) of tomato fruits increased by 12.63 percentage points using ultra-depth masking. The better performance of counting tasks was achieved in the tomato fruits, with an
ACP of 93.80% and a calculation speed of 23 frames per second. The accuracy of fruit counting was closely related to the performance of the object detector. Moreover, counting accuracy varied with different inspection viewports, with the vertical viewport achieving 2.59 percentage points higher accuracy than the parallel one. Tomatoes of different maturity levels were counted with
ACP of at least 91%. This finding offers valuable technical insights for predicting fruit and vegetable yields using visual technology.