基于Transformer的稀疏点云葡萄语义分割

谢元澄; 高宇阳; 李添天; 戴倩; 姜海燕

doi:10.11975/j.issn.1002-6819.202403220

基于Transformer的稀疏点云葡萄语义分割

Semantic segmentation of sparse point clouds in grapes using a transformer approach

摘要

摘要: 农业智能采摘中，点云语义分割是实现果实定位、准确切割、无人采摘的重要步骤。现有农业场景下语义分割研究多数基于稠密点云数据，稠密点云数据的获取难度大、成本高，稀疏点云数据虽然较易获取、成本低，但是语义分割的效果通常较差。针对数据稀疏的问题，该研究基于pointnet算法，引入Transformer多头自注意力机制，构造点云语义分割方法SP-Transformer，将点云划分为多级窗口，使注意力机制聚焦在窗口局部特征，建立密集键与稀疏键的多尺度融合策略，以此扩大感受野来捕获远距离的上下文依赖关系，并在注意力机制中采用特征高级嵌入的方式，提升稀疏点云的分割效果。试验结果表明，在葡萄数据测试集上平均准确率达到89.9%，对葡萄的分割准确率达到了81.1%，对于低密度点云SP-Transformer方法可以保持较好的分割效果。

Abstract: Grape is one of the major fruit crops in agricultural production. Manual cultivation cannot fully meet the large-scale production in recent years, due to its labor intensity, high costs, and low efficiency. Mechanical harvesting is often required for grape harvesting in the sustainable industry. Therefore, mechanization and intelligent automation can be very crucial to accelerate the cultivation. Point cloud semantic segmentation can be expected to serve as the unmanned harvesting, accurate fruit localization, and precise cutting in the context of agricultural intelligent production. However, the dense point cloud data has been widely used in the semantic segmentation in agricultural scenarios, which is difficult and costly to acquire. Sparse point cloud data can often lead to suboptimal segmentation in the practical implementation of intelligent harvesting. In this study, the point cloud semantic segmentation (called SP-Transformer) was introduced to incorporate the PointNet algorithm with a transformer multi-head self-attention mechanism. The sparse data was also selected. The SP-Transformer divided the point cloud into multi-level windows, enabling the attention mechanism to focus on the local features within each window. A multi-scale fusion strategy was employed to combine the dense and sparse features in the receptive field, and then capture the long-range contextual dependencies. Additionally, the high-level feature embedding was applied at the initial stage of the attention in order to enhance the segmentation performance of the sparse point clouds. Specifically, the three-dimensional space was first partitioned into non-overlapping cubic windows, with the points distributed across different cubic windows. Each query point was considered only the neighboring points within the same cubic window. The independent multi-head self-attention operations were performed within each window. Since the point cloud was divided into small windows, the receptive field of each query point was limited inherently. The cross-window communication was introduced to shift the window between two consecutive Transformer blocks by half the window size, thereby increasing the contextual connections for the receptive field. The input point cloud was divided into non-overlapping cubic windows using a predefined window size (WS). The data set of points (KDEN) within the same cubic window was identified in each query point (QI). Simultaneously, the input point cloud was downsampled to construct a sparse sampling window, thus dividing the downsampled space into non-overlapping cubic windows with a size of twice the WS. The set of points (KSPA) was the larger cubic window of the QI. Finally, the two sampling sets were combined to form the final sampling point set, which was used for further processing. Both local and global features were effectively captured, even in the sparse point cloud data. Experimental results on a grape dataset demonstrated the effectiveness of the SP-Transformer, with an average accuracy of 89.9%, which was 4.4 percentage points higher than PointNet++ and 1.5 percentage points higher than Point Transformer. The segmentation accuracy reached 81.1% for the grape classes, with an average intersection over union (IoU) of 82.8%. To validate the robustness of the model, the point cloud completion algorithm (PENet) was employed to densify the point cloud. The point cloud density increased from approximately 60 000 points to around 500 000 points. A comparison was performed on the dense and sparse data. The SP-Transformer was trained to evaluate the real-time performance and segmentation accuracy. The results indicate that the SP-Transformer maintained the robust segmentation performance, even for the low-density point clouds, thus outperforming traditional PointNet++ in terms of both accuracy and efficiency. The SP-Transformer represented a significant advancement in the point cloud semantic segmentation, particularly in the agricultural applications where sparse point cloud data was prevalent. The multi-head self-attention mechanism and multi-scale fusion strategy effectively solved the challenges posed by sparse data, enabling more accurate and efficient segmentation. As such, the intelligent harvesting was enhanced to accurately locate and segment the fruits, thereby reducing the labor costs for high operational efficiency. Moreover, the SP-Transformer also shared the broad implications beyond agricultural applications, particularly on the sparse point cloud data in autonomous driving, robotics, and environmental monitoring. The cross-window communication and high-level feature embedding can provide a robust framework to capture both local and global features. A versatile tool can also be used in various tasks of computer vision. In conclusion, the SP-Transformer can be expected to serve as a powerful and efficient solution for the point cloud semantic segmentation in agricultural intelligent harvesting. The sparse data can be utilized to capture the long-range contextual dependencies for more accurate and reliable intelligent harvesting. The high accuracy of the segmentation on the sparse point cloud data can greatly contribute to the advancement of computer vision in the field of precision agriculture.

HTML全文

参考文献(34)

施引文献

资源附件(0)