基于最优传输特征聚合的温室视觉位置识别方法

侯玉涵; 周云成; 刘泽钰; 张润池; 周金桥

doi:10.11975/j.issn.1002-6819.202407155

基于最优传输特征聚合的温室视觉位置识别方法

Recognizing visual position in the greenhouse using optimal transport feature aggregation

摘要

摘要: 为实现温室场景中基于视觉的位置识别，解决现有视觉位置识别模型局部特征聚合范式对训练样本归纳偏置的强依赖，以及聚合过程中存在的冗余信息问题，构建了一种基于最优传输局部特征聚合的温室视觉位置识别方法。将温室场景图像局部特征聚合过程视为最优传输问题，根据局部特征集动态生成分配矩阵，解耦模型对归纳偏置的强依赖，同时在分配中引入“垃圾”簇来解决特征冗余。结合卷积神经网络（convolution neural network，CNN）和Transformer的优势，优化设计温室场景图像局部特征提取网络。试验结果表明，在种植作物为番茄的温室场景中，所提方法的位置识别top-1召回率（R_@1）为88.96%，与NetVLAD、MixVPR和EigenPlaces 3种方法相比，R_@1分别提高29.67、2.97和2.89个百分点。与NetVLAD和MixVPR的聚合器相比，基于最优传输局部特征聚合的R_@1分别提高21.65和1.09个百分点。相较于CNN网络，研究构建的温室场景图像局部特征提取网络在R_@1指标上提升了5.45个百分点。所提方法的实际温室场景位置识别率不低于81.94%，具有一定的实际应用能力。基于最优传输局部特征聚合及全局描述符生成方法对位置识别是有效的，场景图像局部特征提取网络能够提高位置识别性能，研究结果可为温室智能农机装备视觉系统设计提供技术参考。

Abstract: As the foundation for implementing closed-loop detection within the realm of visual SLAM (simultaneous localization and mapping), visual place recognition (VPR) has great potential in various applications of greenhouse robot navigation and other fields. However, the existing VPR cannot fully meet the actual requirements of greenhouse scenes due to the complexity and constant variations in the greenhouse environment. In particular, the local feature aggregation paradigm strongly depends on the induction bias of training samples in VPR models, which leads to the issue of information redundancy during feature aggregation. In this study, a greenhouse VPR was presented, according to the optimal transport of local feature aggregation. The process of aggregating local features into a global descriptor was framed as an optimal transport problem, where the cost matrix was predicted through an MLP (multi-layer perceptron). Thus, a cost matrix was dynamically generated using the local features that was extracted from the greenhouse scene images. Additionally, a 'dustbin' cluster was introduced into the cost matrix to allocate the redundant features. Taking the cost matrix as the input, the Sinkhorn algorithm was employed to determine an optimal solution to the assignment matrix. Furthermore, the soft assignment of local features to various clusters was achieved through the assignment matrix. Ultimately, the assignment was concatenated to form a global descriptor for the scene image, which was used for place recognition. A deep neural network (DNN) was optimized and designed to serve as the backbone for local feature extraction of greenhouse scene images, by combining the advantages of CNN (convolutional neural network) and Transformer. Furthermore, cosine similarity was used as the metric function to calculate the similarity measure between scene image global descriptors, so as to perform descriptor matching. A series of experiments were conducted in a tomato greenhouse. The experimental results showed that the improved model achieved better performance. The top-1 recall rate (R_@1) for place recognition was achieved at 88.96%, which was 29.67, 2.97, and 2.89 percentage points higher than the those of NetVLAD, MixVPR, and EigenPlaces models, respectively. When compared to the aggregators employed in MixVPR and NetVLAD, our aggregator achieved improvements in R_@1 by 1.09 and 21.65 percentage points, respectively, showcasing its effectiveness. Compared with the CNN, the improved network achieved an increase of 5.45 percentage points in R_@1. There was even more pronounced R_@1 improvement (reaching 10.48 percentage points), compared with a Transformer network. Simultaneously, our network resulted in a 1.6-fold increase in computation speed compared to the previous Transformer. In addition, the experiments further demonstrated that the improved model exhibited excellent performance of place recognition and strong robustness when dealing with factors, such as small sampling distance shifts, small viewpoint shifts, and different sunlight intensities. The greenhouse VPR achieved a place recognition rate of no less than 81.94% in actual greenhouses, indicating its practical application potential. The method based on optimal transport of local feature aggregation and global descriptor generation was effective for place recognition, and the image local feature extraction network can boost the performance of place recognition. These findings can provide technical support to the visual systems of intelligent agricultural machinery in the greenhouse.

HTML全文

参考文献(39)

施引文献

资源附件(0)