实例分割与空间解析融合的番茄实时三维位姿融合估计方法

    A real-time 3D pose fusion estimation method for tomato based on instance segmentation and spatial analysis

    • 摘要: 针对复杂种植环境中现有果实位姿估计方法精度低、实时性差等问题,该研究提出一种融合轻量化实例分割与空间解析的番茄实时三维位姿估计方法。通过构建改进的YOLOv7-M1轻量化网络,实现果实掩膜高精度提取与关键点感兴趣区域快速定位;设计HRNet-ECA嵌入高通量注意力机制提升检测准确率;搭建多模态数据融合框架,结合深度图与感兴趣区域,经点云滤波处理和空间几何计算实时获取果实三维位姿参数。试验结果表明,改进后的YOLOv7-M1掩膜分割平均精度为95.56%,召回率93.52%,准确率96.17%;改进后的HRNet-ECA关键点相似度为96.61%,位姿估计准确率95.0%,三维姿态角平均误差9.40°,关键点平均定位误差4.13 mm,关键点在XYZ方向上的平均误差分别为3.41、2.95和1.02 mm。单果处理平均耗时0.063 s。该方法构建了轻量化实例分割网络与改进关键点检测模型的级联结构,结合点云空间解析,在保证精度指标同时兼顾实时效率,实现了番茄果实的高精度实时位姿估计,可为复杂农业场景下果蔬精准自动化采收提供高效的解决方案。

       

      Abstract: Tomato harvesting is a typical labor-intensive task, and the development of mechanized harvesting has become an inevitable trend to promote agricultural development and alleviate labor shortages. For tomato harvesting robots operating in complex greenhouse environments, accurate three-dimensional (3D) pose information of target fruits is a key prerequisite for selective harvesting and has a direct impact on the success rate of grasping and picking. To provide real-time 3D pose information of target fruits under such challenging conditions, this study proposed a real-time tomato 3D pose fusion estimation method based on instance segmentation and spatial analysis. A cascaded architecture was constructed that combined a lightweight instance segmentation network with an improved keypoint detection model, and integrated a point cloud–based spatial parsing module to achieve high-precision, real-time pose estimation. First, a lightweight instance segmentation model, YOLOv7-M1, was developed by improving the original YOLOv7-seg framework. The CBS backbone of YOLOv7-seg was reconstructed using the MobileOne module, which substantially reduced computational cost while maintaining mask segmentation accuracy. The instance segmentation stage generated pixel-level fruit masks and regions of interest (ROIs), thereby providing both the keypoint ROIs and prior knowledge of the fruit geometry and surrounding environment. Second, keypoints within the ROIs were detected using an enhanced HRNet model, denoted HRNet-ECA. In this network, an Efficient Channel Attention (ECA) mechanism was embedded after the parallel output branches of all four stages of HRNet to strengthen channel-wise feature selection and improve the robustness of keypoint localization in clustered fruits, occlusions, and non-uniform illumination. On this basis, a multi-modal data fusion framework was constructed by combining the depth map with the ROIs to generate target fruit point clouds. The point clouds were then processed via color filtering, outlier removal, down-sampling, and least-squares sphere fitting. Real-time geometric computation on the filtered point clouds yielded the 3D pose of each tomato, including spatial position and orientation. Experimental results showed that the lightweight instance segmentation design successfully improved real-time performance without sacrificing accuracy. Compared with classical instance segmentation models Mask-RCNN and Solov2, the proposed YOLOv7-M1 model achieved significant gains: average precision increased by 14.35% and 15.21%, accuracy by 14.06% and 14.47%, recall by 13.25% and 11.10%, and mean Average Precision (mAP50) by 14.30% and 14.51%, respectively. Relative to YOLOv7-seg, YOLOv8l-seg, and YOLOv11l-seg, the GFLOPs of YOLOv7-M1 were reduced by 30.23%, 54.75%, and 29.99%, while the frame rate was increased by 10.45%, 14.16%, and 8.34%, respectively. In the keypoint detection stage, three attention mechanisms—SE, CBAM, and ECA—were embedded into HRNet for comparison. The keypoint similarity was improved by 0.87%, 1.68%, and 2.23%, respectively, with ECA providing the largest accuracy gain while increasing GFLOPs by less than 1%, thus meeting real-time constraints. The overall pose estimation framework was further validated on 100 groups of tomato pose data. The average pose estimation accuracy reached 95.0%, the mean 3D orientation error was 9.4°, and the mean keypoint localization error was 4.13 mm. The average position errors in the X, Y, and Z directions were 3.41, 2.95, and 1.02 mm, respectively, and the average processing time per fruit was 0.063 s. Finally, the method was deployed on a tomato harvesting robot and tested in a real greenhouse. Among 50 harvesting trials, 44 were successful, corresponding to a comprehensive success rate of 88%. These results indicated that the proposed method effectively provided reliable fruit pose information to guide the end effector, achieving a favorable balance between accuracy and real-time performance, and offering an efficient solution for precise, automated harvesting in complex agricultural environments.

       

    /

    返回文章
    返回