基于U-YOLOv8n细粒度的彩色小番茄成熟度检测算法

    Algorithm based on U-YOLOv8n for detecting fine-grained colorful tomato ripeness

    • 摘要: 准确识别彩色番茄果实成熟状态是实现机器人高效分类采摘的基础。该研究针对彩色番茄颜色多样、背景复杂、成熟度检测精度不高等问题,以多粒度理论为基础,提出一种“先分割,后检测”U-YOLOv8n细粒度的彩色番茄检测方法。首先,利用U-Net对彩色番茄串进行分割,将VGG16作为U-Net的主干网络,增强模型特征提取能力;引入全局注意力机制(global attention mechanism,GAM),以提升模型分割性能。对感兴趣区域(region of interest,ROI)进行分割,减少复杂背景对第二阶段检测任务的干扰。其次,以YOLOv8n为基准模型,构建C2f_MS模块,提升模型多尺度特征提取能力;采用SCDown模块替换部分Conv,降低计算冗余的同时,保留细节空间信息;去除小目标检测头及部分Neck层结构,降低计算负荷。结果表明,改进后U-Net的平均像素精确度(mean pixel accuracy,MPA)和平均交并比(mean intersection over union,MIoU)分别达94.35%和88.98%。改进后YOLOv8n的精确度、均值平均精度(mAP0.5)分别达93.5%和95.6%,相较于YOLOv8n分别提高了3.1和0.5个百分点;针对较难识别的半熟期,精确率、均值平均精度分别达91.8%和92.4%,较YOLOv8n提高了5.7和1.3个百分点;轻量化方面,较YOLOv8n计算量和参数量分别降低了31%和14%,帧率(frames per second,FPS)提高了41%。该方法可有效完成复杂背景下的彩色番茄成熟度检测任务,为彩色番茄的成熟度分级和智能采摘提供技术支持。

       

      Abstract: Accurately identifying the ripeness of colorful tomato fruits is often required for efficient classification and harvesting by robots. This study aims to detect the ripeness of the colorful tomatoes with diverse colors under complex backgrounds. According to the multi-granularity theory, a "segment first, then detect" approach was proposed to detect the fine-grained colorful tomatoes using U-YOLOv8n. Firstly, the pre-trained weights were incorporated into the U-Net model after training on the VOC2007 public dataset. A U-Net model was used to segment the colorful tomato clusters. The VGG16 was selected as the backbone network for the feature extraction. The network structure and hierarchical feature extraction were obtained to handle the more complex and diverse image data. The global attention mechanism (GAM) was introduced to process the image information in channel and spatial dimensions, particularly for the performance of the segmentation. The region of interest (ROI) was segmented to reduce the interference from the complex background on the second-stage detection task. Secondly, the Bottleneck layer in the C2f module of the YOLOv8n was replaced with the MS-Block module. As such, the C2f_MS module was constructed to extract the multi-scale features. Finally, some Conv layers were replaced with the SCDown module, in order to reduce the computational redundancy on the spatial information. The small object detection head and part of the Neck structure were removed to reduce the computational load for the overall accuracy and efficiency optimization. The results indicated that the MPA, MIoU, and Dice increased by 1.13, 1.37, and 0.67 percentage points, respectively, in the U-Net-related part using the pre-trained weights. The mean pixel accuracy (MPA), mean intersection over union (MIoU), and Dice increased by 0.21, 0.6, and 0.36 percentage points, respectively, after VGG16 as the backbone network of the U-Net model. A deep architecture was utilized to more effectively capture the multi-scale and high-level semantic features. The GAM attention module was introduced into the backbone network. The segmentation performance of the U-Net was enhanced after the improvements. The MPA, MIoU, and Dice of the U-Net model reached 94.35%, 88.98% and 94.39%, respectively, indicating the improvements of 1.55, 2.38, and 1.31 percentage points over the original. In YOLOv8n, the C2f_MS module improved the precision (P), recall (R), and mean average precision (mAP0.5) by 1.5, 0.3, and 0.4 percentage points, indicating the stronger feature extraction on the tomatoes of different sizes in the same image. Some Conv layers were replaced with the SCDown module. The parameters and the weight size were reduced by 0.7 and 2.21 M, respectively, compared with the original. A more lightweight model was obtained to remove the small object detection head and some structures. The number of floating-point operations (FLOPs) was reduced by 2.5 G, whereas the precision and mean average precision increased by 1.6 and 0.3 percentage points, respectively. An overall improvement was achieved in both precision and efficiency. The accuracy and mean average precision of the YOLOv8n reached 93.5% and 95.6%, respectively, indicating an increase of 3.1 and 0.5 percentage points over the original. In the harder to recognize half-ripe stage, the accuracy and mean average precision reached 91.8% and 92.4%, respectively, indicating the increases of 5.7 and 1.3 percentage points. In terms of the lightweight performance, the floating-point operations and parameter volume were reduced by 2.5 G and 0.43 M, respectively, compared with the YOLOv8n. The frames per second (FPS) also increased by 41%. The ripeness detection of the colorful tomatoes was effectively realized in complex backgrounds. The finding can provide technical support to the decision-making on the ripeness grading and harvesting.

       

    /

    返回文章
    返回