Abstract:
Grape is one of the major fruit crops in agricultural production. Manual cultivation cannot fully meet the large-scale production in recent years, due to its labor intensity, high costs, and low efficiency. Mechanical harvesting is often required for grape harvesting in the sustainable industry. Therefore, mechanization and intelligent automation can be very crucial to accelerate the cultivation. Point cloud semantic segmentation can be expected to serve as the unmanned harvesting, accurate fruit localization, and precise cutting in the context of agricultural intelligent production. However, the dense point cloud data has been widely used in the semantic segmentation in agricultural scenarios, which is difficult and costly to acquire. Sparse point cloud data can often lead to suboptimal segmentation in the practical implementation of intelligent harvesting. In this study, the point cloud semantic segmentation (called SP-Transformer) was introduced to incorporate the PointNet algorithm with a transformer multi-head self-attention mechanism. The sparse data was also selected. The SP-Transformer divided the point cloud into multi-level windows, enabling the attention mechanism to focus on the local features within each window. A multi-scale fusion strategy was employed to combine the dense and sparse features in the receptive field, and then capture the long-range contextual dependencies. Additionally, the high-level feature embedding was applied at the initial stage of the attention in order to enhance the segmentation performance of the sparse point clouds. Specifically, the three-dimensional space was first partitioned into non-overlapping cubic windows, with the points distributed across different cubic windows. Each query point was considered only the neighboring points within the same cubic window. The independent multi-head self-attention operations were performed within each window. Since the point cloud was divided into small windows, the receptive field of each query point was limited inherently. The cross-window communication was introduced to shift the window between two consecutive Transformer blocks by half the window size, thereby increasing the contextual connections for the receptive field. The input point cloud was divided into non-overlapping cubic windows using a predefined window size (WS). The data set of points (KDEN) within the same cubic window was identified in each query point (QI). Simultaneously, the input point cloud was downsampled to construct a sparse sampling window, thus dividing the downsampled space into non-overlapping cubic windows with a size of twice the WS. The set of points (KSPA) was the larger cubic window of the QI. Finally, the two sampling sets were combined to form the final sampling point set, which was used for further processing. Both local and global features were effectively captured, even in the sparse point cloud data. Experimental results on a grape dataset demonstrated the effectiveness of the SP-Transformer, with an average accuracy of 89.9%, which was 4.4 percentage points higher than PointNet++ and 1.5 percentage points higher than Point Transformer. The segmentation accuracy reached 81.1% for the grape classes, with an average intersection over union (IoU) of 82.8%. To validate the robustness of the model, the point cloud completion algorithm (PENet) was employed to densify the point cloud. The point cloud density increased from approximately 60 000 points to around 500 000 points. A comparison was performed on the dense and sparse data. The SP-Transformer was trained to evaluate the real-time performance and segmentation accuracy. The results indicate that the SP-Transformer maintained the robust segmentation performance, even for the low-density point clouds, thus outperforming traditional PointNet++ in terms of both accuracy and efficiency. The SP-Transformer represented a significant advancement in the point cloud semantic segmentation, particularly in the agricultural applications where sparse point cloud data was prevalent. The multi-head self-attention mechanism and multi-scale fusion strategy effectively solved the challenges posed by sparse data, enabling more accurate and efficient segmentation. As such, the intelligent harvesting was enhanced to accurately locate and segment the fruits, thereby reducing the labor costs for high operational efficiency. Moreover, the SP-Transformer also shared the broad implications beyond agricultural applications, particularly on the sparse point cloud data in autonomous driving, robotics, and environmental monitoring. The cross-window communication and high-level feature embedding can provide a robust framework to capture both local and global features. A versatile tool can also be used in various tasks of computer vision. In conclusion, the SP-Transformer can be expected to serve as a powerful and efficient solution for the point cloud semantic segmentation in agricultural intelligent harvesting. The sparse data can be utilized to capture the long-range contextual dependencies for more accurate and reliable intelligent harvesting. The high accuracy of the segmentation on the sparse point cloud data can greatly contribute to the advancement of computer vision in the field of precision agriculture.