融合语义分割和三维点云分析的果园障碍物实时重构方法

林桂潮; 徐垚; 曾文勇; 王明龙; 殷瑞涵; 丁力行; 朱立学

doi:10.11975/j.issn.1002-6819.202404129

融合语义分割和三维点云分析的果园障碍物实时重构方法

Real-time reconstructing of orchard obstacles by fusing semantic segmentation and 3D point cloud analysis

摘要

摘要: 针对果园障碍物识别和重构存在实时性差的问题，该研究以番石榴果园为例，提出了融合语义分割和三维点云分析的果园障碍物实时重构方法。首先，以DeepLabV3+为基础，使用MobileNetV3-large作为骨干网络以实时提取特征图，结合上下文Transformer注意力丰富特征图细节和语义信息，进一步提出基于Sobel算子的边缘辅助损失函数以提升障碍物分割精度；在此基础上，运用体素化法将枝干三维点云重构为大量立方体，构建基于欧式聚类的果实三维重构方法，设计一种融合逆投影变换和交并比的重构评价指标。试验结果表明：该方法对障碍物语义分割的平均交并比为77.2%，比Xception-DeepLabV3+提高7.0个百分点；平均帧速为78.8帧/s；该方法对枝干和果实的重构交并比分别为56.6%和65.7%，平均重构时间分别为0.84 s和0.34 s。该研究实时性和精度较高，满足水果采摘机器人实时避障作业需求。

Abstract: Mechanized harvesting of fruit has been the ever-increasing trend in smart agriculture. However, the picking quality has been confined to the commonly-used unstructured planting for fruit trees. The robots can easily encounter the collisions with the dense branches and leaves during harvesting. The random spatial distribution of orchard obstacles can be highly influenced by the variation in the lighting conditions. It is very challenging on the real-time reconstruction of obstacles. In this study, a real-time reconstruction of orchard obstacles was proposed to combine the semantic segmentation and 3D point cloud analysis. Firstly, taking guava orchard as an example, the image acquisition was carried out using Intel RealSense D435i depth camera at a distance of about 15-50 cm from the guava fruit trees with a resolution of 640×480 in the orchard of Haiou Island Jiashuo Farm, Guangzhou City, Guangdong Province, China. Data enhancement was performed to improve the generalization and robustness of the model, according to the mirroring, luminance enhancement, luminance attenuation, adding Gaussian noise and pretzel noise. A total of 1 250 samples were obtained to randomly divided into the training, the validation and the test set in a ratio of 6:3:1, where there were 750 samples in the training set and 375 samples in the validation set; The test set was 125 samples. Then, a real-time semantic segmentation model was built using DeepLabV3+. The CoT module was introduced on the low-level feature map and high-level feature map of the encoder, with the MobileNetV3-large as the backbone network. The segmentation accuracy was improved on the thin obstacles like twig groups. The detail information of feature maps was strengthened to suppress the useless information. Furthermore, an edge-assisted loss function was proposed using the Sobel operator. Laplace edge loss function was used to further increase the accuracy of semantic segmentation, in order to improve the accuracy of picture edge segmentation. Next, the intrinsic parameters of the depth camera were combined to transform the semantic map of obstacles into a 3D point cloud. Statistical analysis was applied to remove the outliers. The voxelization was used to approximate the reconstruction of tree trunks and branches. Moreover, Euclidean clustering was employed to identify the individual 3D fruits. Among them, the minimum bounding box of the fruit was served as the reconstruction. Finally, the pose change matrix between the camera and the robot was obtained using the hand-eye calibration with a chessboard pattern. The minimum bounding box of the fruit and the voxelized cubes of branches and tree trunks were transformed into the working space of robot picking. The experimental results show that the MIoU (mean intersection over union) of obstacles in MobileNetV3-large-DeepLabV3+ was improved from 69.7% to 74.3% after the introduction of the CoT block, indicating an increase of 4.6 percentage points; The Sobel algorithm also considered both horizontal and vertical gradient information, indicating some anti-noise performance. The MIoU of the obstacle was further improved from 74.3% to 77.2%, indicating an improvement of 2.9 percentage points. The overall MIoU of the improved model was 7.0 percentage points, 0.6 percentage points, and 1.0 percentage points higher than those of Xception-DeepLabV3+, BranchNet, and LRASPP, respectively. There was 1.7 percentage points lower than BiSeNetV3; The Acc was 3.1 percentage points, 0.5 percentage points, and 0.6 percentage points higher than those of Xception-DeepLabV3+, BranchNet, and LRASPP, respectively, while there was 2.2 percentage points lower than BiSeNetV3; The FPS of the improved model was as high as 78.8 frames per second, which was faster than the comparison; The number of parameters was 4.4 M, which is 8% and 31% of Xception-DeepLabV3+ and BiSeNetV3, respectively, while BranchNet and LRASPP are 68% and 73% of the parameters in this paper, respectively. Besides, the IoU (intersection over union) values of the voxelized cubes for the fruit and branches were 65.7% and 56.6%, respectively, and the voxelization time was 0.34 s and 0.84 s, respectively. The better robustness and real-time performance can fully meet the obstacle avoidance requirements of guava picking robots. The real-time semantic segmentation of irregular obstacles was optimized to utilize more sophisticated attention modules or backbone networks. The finding can provide the better ways to enhance the effects of obstacle reconstruction.

HTML全文

参考文献(33)

施引文献

资源附件(0)