Abstract:
Apples are the popular fruits rich in vitamins and fiber to protect the health of the body in daily life. It is highly required for the automatic apple picking, due to a large amount of human input during traditional picking. Automate apple picking can also rely on the accurate location of ripe apples. Among them, manual labelling or rule-based computer vision has been used to identify the fruit locations. But the traditional counting and positioning cannot fully meet the requirements of complex and varying orchard environments and different seasons, particularly on the intertwining of branches and leaves of fruit trees, the shading of fruits, and the light conditions. Therefore, it is very necessary for the accurate and efficient algorithms in the point localization during apple picking. In this study, an efficient positioning was proposed for apple picking points using target area segmentation. The specific procedures were as follows. Firstly, the target detection frame was acquired to crop the apple target region. LabelMe annotation tool was used to manually annotate the outer contour of the target point by point. The semantic segmentation dataset was constructed for the apple target region. A total of 1503 images were obtained after operations, of which 1352 were used for training and the remaining 151 were used for validation. Secondly, MobileViT-Seg, an apple target region semantic segmentation, was proposed to construct a lightweight encoder and a hierarchical pooling decoder. The encoder was adopted the pre-trained MobileViT structure, which was down sampled the input images step by step to extract the high-level feature information. The decoder was used the PPM (Pyramid Pooling Module) module and Softmax processing to gradually recover the spatial resolution of the image for the accurate segmentation. Effective extraction of global contextual information was maintained on a small model size and low computational cost. Finally, there was the incomplete apple region, due to the branch and leaf occlusion and fruit overlap. The target-mask region after segmentation was fitted with a circle using the least squares shape fitting. The center of the circle was used as the location of picking point. The RGB-D information was then fused to achieve the localization of the spatial location of picking point. The experimental results show that the MobileViT-Seg model shared the high robustness to locate the picking point in the multiple scenes. MobileViT-Seg was performed the best with the low computational cost, compared with the several mainstream segmentation methods, Unet, PSPnet, Mobilenetv3_deeplabv3+, and Deeplabv3+. While the number of parameters and FLOPs are 7.18M and 29.93G, the Mean Intersection over Union (mIoU) reached 89.79%, the mean Pixel Accuracy (mPA)reached 94.46%, the Accuracy(Ac) reached 94.73% and the detection speed reached 100.06 fps. The average accuracy of the picking point localization reached 90.80% on 200 raw apple images that captured by the camera in real time. The positioning accuracy was fully met the requirements. In summary, an efficient technical solution was provided for the automated apple picking. An advanced segmentation was also implemented to combine with the precise spatial localization, indicating the accurate picking point localization in complex orchard conditions. The improved model can lay the foundation for the picking point localization of apple picking robots.