Abstract:
Tomato harvesting is a typical labor-intensive task, and the development of mechanized harvesting has become an inevitable trend to promote agricultural development and alleviate labor shortages. For tomato harvesting robots operating in complex greenhouse environments, accurate three-dimensional (3D) pose information of target fruits is a key prerequisite for selective harvesting and has a direct impact on the success rate of grasping and picking. To provide real-time 3D pose information of target fruits under such challenging conditions, this study proposed a real-time tomato 3D pose fusion estimation method based on instance segmentation and spatial analysis. A cascaded architecture was constructed that combined a lightweight instance segmentation network with an improved keypoint detection model, and integrated a point cloud–based spatial parsing module to achieve high-precision, real-time pose estimation. First, a lightweight instance segmentation model, YOLOv7-M1, was developed by improving the original YOLOv7-seg framework. The CBS backbone of YOLOv7-seg was reconstructed using the MobileOne module, which substantially reduced computational cost while maintaining mask segmentation accuracy. The instance segmentation stage generated pixel-level fruit masks and regions of interest (ROIs), thereby providing both the keypoint ROIs and prior knowledge of the fruit geometry and surrounding environment. Second, keypoints within the ROIs were detected using an enhanced HRNet model, denoted HRNet-ECA. In this network, an Efficient Channel Attention (ECA) mechanism was embedded after the parallel output branches of all four stages of HRNet to strengthen channel-wise feature selection and improve the robustness of keypoint localization in clustered fruits, occlusions, and non-uniform illumination. On this basis, a multi-modal data fusion framework was constructed by combining the depth map with the ROIs to generate target fruit point clouds. The point clouds were then processed via color filtering, outlier removal, down-sampling, and least-squares sphere fitting. Real-time geometric computation on the filtered point clouds yielded the 3D pose of each tomato, including spatial position and orientation. Experimental results showed that the lightweight instance segmentation design successfully improved real-time performance without sacrificing accuracy. Compared with classical instance segmentation models Mask-RCNN and Solov2, the proposed YOLOv7-M1 model achieved significant gains: average precision increased by 14.35% and 15.21%, accuracy by 14.06% and 14.47%, recall by 13.25% and 11.10%, and mean Average Precision (mAP
50) by 14.30% and 14.51%, respectively. Relative to YOLOv7-seg, YOLOv8l-seg, and YOLOv11l-seg, the GFLOPs of YOLOv7-M1 were reduced by 30.23%, 54.75%, and 29.99%, while the frame rate was increased by 10.45%, 14.16%, and 8.34%, respectively. In the keypoint detection stage, three attention mechanisms—SE, CBAM, and ECA—were embedded into HRNet for comparison. The keypoint similarity was improved by 0.87%, 1.68%, and 2.23%, respectively, with ECA providing the largest accuracy gain while increasing GFLOPs by less than 1%, thus meeting real-time constraints. The overall pose estimation framework was further validated on 100 groups of tomato pose data. The average pose estimation accuracy reached 95.0%, the mean 3D orientation error was 9.4°, and the mean keypoint localization error was 4.13 mm. The average position errors in the
X,
Y, and
Z directions were 3.41, 2.95, and 1.02 mm, respectively, and the average processing time per fruit was 0.063 s. Finally, the method was deployed on a tomato harvesting robot and tested in a real greenhouse. Among 50 harvesting trials, 44 were successful, corresponding to a comprehensive success rate of 88%. These results indicated that the proposed method effectively provided reliable fruit pose information to guide the end effector, achieving a favorable balance between accuracy and real-time performance, and offering an efficient solution for precise, automated harvesting in complex agricultural environments.