Abstract:
Accurately identifying the ripeness of colorful tomato fruits is often required for efficient classification and harvesting by robots. This study aims to detect the ripeness of the colorful tomatoes with diverse colors under complex backgrounds. According to the multi-granularity theory, a "segment first, then detect" approach was proposed to detect the fine-grained colorful tomatoes using U-YOLOv8n. Firstly, the pre-trained weights were incorporated into the U-Net model after training on the VOC2007 public dataset. A U-Net model was used to segment the colorful tomato clusters. The VGG16 was selected as the backbone network for the feature extraction. The network structure and hierarchical feature extraction were obtained to handle the more complex and diverse image data. The global attention mechanism (GAM) was introduced to process the image information in channel and spatial dimensions, particularly for the performance of the segmentation. The region of interest (ROI) was segmented to reduce the interference from the complex background on the second-stage detection task. Secondly, the Bottleneck layer in the C2f module of the YOLOv8n was replaced with the MS-Block module. As such, the C2f_MS module was constructed to extract the multi-scale features. Finally, some Conv layers were replaced with the SCDown module, in order to reduce the computational redundancy on the spatial information. The small object detection head and part of the Neck structure were removed to reduce the computational load for the overall accuracy and efficiency optimization. The results indicated that the MPA, MIoU, and Dice increased by 1.13, 1.37, and 0.67 percentage points, respectively, in the U-Net-related part using the pre-trained weights. The mean pixel accuracy (MPA), mean intersection over union (MIoU), and Dice increased by 0.21, 0.6, and 0.36 percentage points, respectively, after VGG16 as the backbone network of the U-Net model. A deep architecture was utilized to more effectively capture the multi-scale and high-level semantic features. The GAM attention module was introduced into the backbone network. The segmentation performance of the U-Net was enhanced after the improvements. The MPA, MIoU, and Dice of the U-Net model reached 94.35%, 88.98% and 94.39%, respectively, indicating the improvements of 1.55, 2.38, and 1.31 percentage points over the original. In YOLOv8n, the C2f_MS module improved the precision (
P), recall (
R), and mean average precision (mAP0.5) by 1.5, 0.3, and 0.4 percentage points, indicating the stronger feature extraction on the tomatoes of different sizes in the same image. Some Conv layers were replaced with the SCDown module. The parameters and the weight size were reduced by 0.7 and 2.21 M, respectively, compared with the original. A more lightweight model was obtained to remove the small object detection head and some structures. The number of floating-point operations (FLOPs) was reduced by 2.5 G, whereas the precision and mean average precision increased by 1.6 and 0.3 percentage points, respectively. An overall improvement was achieved in both precision and efficiency. The accuracy and mean average precision of the YOLOv8n reached 93.5% and 95.6%, respectively, indicating an increase of 3.1 and 0.5 percentage points over the original. In the harder to recognize half-ripe stage, the accuracy and mean average precision reached 91.8% and 92.4%, respectively, indicating the increases of 5.7 and 1.3 percentage points. In terms of the lightweight performance, the floating-point operations and parameter volume were reduced by 2.5 G and 0.43 M, respectively, compared with the YOLOv8n. The frames per second (FPS) also increased by 41%. The ripeness detection of the colorful tomatoes was effectively realized in complex backgrounds. The finding can provide technical support to the decision-making on the ripeness grading and harvesting.