Abstract:
Grape-picking robots can be an effective solution to deal with the contradiction between manual labor efficiency and the limited harvesting period, with the rapid development of machine vision and artificial intelligence. The varying sizes and shapes of grape key structures have limited the working space of the robot at the grape harvesting stage. Improper positioning of picking points can also lead to collisions between the robot's end and the grapes, even the damage and dropping. In addition, it is necessary to consider such collisions between the robot's end and the fruit branches. The reason is that these collisions can result in failed picking, damage to the branches, and the risk of fungal infection in the fruit trees. In this study, the localization algorithm was proposed for the picking points of grape key structures using deep learning and multi-object recognition. Picking point localization was enhanced to reduce the grape damage and the failure rate during harvesting. Firstly, the G-YOLACT++ model incorporated the SimAM attention module and Mish activation function to optimize the YOLACT++ model. Then the key grape structures were detected, such as grape-bearing branches, grape peduncles, and grape clusters. As such, these grape structures in the multi-adjacent clusters were segmented into multiple masks within the field of view. The membership of grape key structures was determined within the same cluster using their intersection and relative positions. The same string of grapes was then merged to select the Region of Interest (ROI) area with the low collision for grape pedicles. The range of re-selection was also designed to locate the picking point. The experimental results demonstrated that the incorporation of the SimAM attention mechanism into the YOLACT++ model resulted in an improved mean average precision (mAP) for the mask. The Mish activation function was selected to replace the ReLU in the backbone network. After that, the mAP values of the mask and bounding box increased by 0.3 and 2.23 percentage points, respectively. Both modifications were greatly contributed to the enhancement of the performance. The average mAP values of the bounding box and mask in G-YOLACT++ were improved by 0.83 and 0.88 percentage points, respectively, compared with the YOLACT++. By contrast, the mAP values of the improved model for the bounding box and mask increased by 2.36 and 2.13 percentage points, respectively, compared with the original. Furthermore, the sizes of all the improved models remained unchanged, while there was a relatively slight improvement in the inference speed. Therefore, there was a positive effect of improvement on the performance of the models. The correctness rates of the single and multiple fruit samples were 88% and 90%, respectively, for the key structure-dependent judgment and fusion. The correctness rate was 92.3% for the removal of grape clusters with the incomplete recognition of key structures. Compared to the two positioning methods that use the center of the bounding rectangle enclosing the grape peduncles in ROI and the centroid of the grape peduncles identified by the model as the picking points, the success rates of the picking point localization method in this study were improved by 10.95 and 81.75 percentage points, respectively. These results demonstrated the research could be a viable support to the optimization of grape picking robots and lays the foundation for low-damage harvesting of clustered fruits in unstructured environments.