侯玉涵,周云成,刘泽钰,等. 基于最优传输特征聚合的温室视觉位置识别方法[J]. 农业工程学报,2024,40(22):1-13. DOI: 10.11975/j.issn.1002-6819.202407155
    引用本文: 侯玉涵,周云成,刘泽钰,等. 基于最优传输特征聚合的温室视觉位置识别方法[J]. 农业工程学报,2024,40(22):1-13. DOI: 10.11975/j.issn.1002-6819.202407155
    HOU Yuhan, ZHOU Yuncheng, LIU Zeyu, et al. Greenhouse visual place recognition method based on optimal transport feature aggregation[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2024, 40(22): 1-13. DOI: 10.11975/j.issn.1002-6819.202407155
    Citation: HOU Yuhan, ZHOU Yuncheng, LIU Zeyu, et al. Greenhouse visual place recognition method based on optimal transport feature aggregation[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2024, 40(22): 1-13. DOI: 10.11975/j.issn.1002-6819.202407155

    基于最优传输特征聚合的温室视觉位置识别方法

    Greenhouse visual place recognition method based on optimal transport feature aggregation

    • 摘要: 为在温室场景中实现基于视觉的位置识别,并解决现有视觉位置识别模型局部特征聚合范式对训练样本归纳偏置的强依赖,以及聚合过程中存在的冗余信息问题,构建了一种基于最优传输局部特征聚合的温室视觉位置识别方法。将温室场景图像局部特征聚合过程视为最优传输问题,根据局部特征集动态生成分配矩阵,解耦模型对归纳偏置的强依赖,同时在分配中引入“垃圾”簇来解决特征冗余。结合卷积神经网络(convolution neural network,CNN)和Transformer的优势,优化设计温室场景图像局部特征提取网络。试验结果表明,在种植作物为番茄的温室场景中,所提方法的位置识别top-1召回率(R@1)为88.96%,与NetVLAD、MixVPR和EigenPlaces三种方法相比,分别提高29.67、2.97和2.89个百分点。基于最优传输的局部特征聚合及全局描述符生成方法是有效的,与MixVPR的聚合器相比,R@1提高了1.09个百分点,与NetVLAD的聚合器相比,则提高了21.65个百分点。所构建的温室场景图像局部特征提取网络能有效提高位置识别性能,与CNN网络相比,R@1提高了5.45个百分点。所提方法的实际温室场景位置识别率不低于81.94%,具有一定的实际应用能力,研究可为温室智能农机装备视觉系统设计提供技术参考。

       

      Abstract: As the foundation for implementing closed-loop detection within the realm of visual SLAM (simultaneous localization and mapping), visual place recognition (VPR) has great potential in various applications of greenhouse robot navigation and other fields. However, due to the complex and constantly changing greenhouse environment, the existing visual place recognition methods struggle to fully meet the actual requirements of greenhouse scenes. This is particularly evident in the strong dependence of local feature aggregation paradigm on training sample induction bias in visual place recognition models, as well as the issue of information redundancy that arises during the feature aggregation process. In this study, a greenhouse visual place recognition method based on optimal transport local feature aggregation was presented. The process of aggregating local feature set into global descriptor was framed as an optimal transport problem, where the cost matrix was predicted through an MLP (multi-layer perceptron). Thus, based on the local feature set extracted from the greenhouse scene image, a cost matrix was dynamically generated. Additionally, a 'dustbin' cluster was introduced within the cost matrix to allocate redundant features. Subsequently, using the cost matrix as input, the Sinkhorn algorithm was employed to find an optimal solution for the assignment matrix. Furthermore, the soft assignment of local features to various clusters was achieved through the assignment matrix. Ultimately, the assignment results were concatenated to form a global descriptor for the scene image, which was used for place recognition. By combining the advantages of CNN (convolutional neural network) and Transformer, a deep neural network (DNN) was optimized and designed to serve as the backbone for local feature extraction of greenhouse scene images. Furthermore, cosine similarity was used as the similarity metric function to calculate the similarity measure between scene image global descriptors, so as to perform descriptor matching and enable accurate place identification of the current scene. A series of experiments were conducted in a tomato greenhouse, and the experimental results showed that the novel model achieved better performance. The top-1 recall rate (R@1) for place recognition had achieved 88.96%, which was 29.67, 23.23, 12.12, 2.97, and 2.89 percentage points higher than the NetVLAD, GeM, CosPlace, MixVPR, and EigenPlaces models, respectively. The local feature aggregation based on optimal transport and the method of global descriptor generation can effectively enhance the accuracy of greenhouse place recognition. Compared with the aggregators of MixVPR, GeM, and NetVLAD, our aggregator achieved improvements in R@1 by 1.09, 2.92, and 21.65 percentage points, respectively. The local feature extraction network constructed in this study can effectively boost the performance of place recognition. Compared to CNN network, our network achieved an increase of 5.45 percentage points in R@1. When compared to a Transformer network, the R@1 improvement was even more pronounced, reaching 10.48 percentage points. Remarkably, this was accomplished while simultaneously achieving a 1.6-fold increase in computation speed. In addition, the experiments further demonstrated that our model exhibited excellent place recognition performance and strong robustness when dealing with change factors such as small sampling distance shifts, small viewpoint shifts, and different sunlight intensity. The greenhouse visual place recognition method proposed in this study had a place recognition rate of not less than 81.94% in actual greenhouses and had a certain practical application ability. These findings can provide technical support for designing the visual systems of greenhouse intelligent agricultural machinery equipment.

       

    /

    返回文章
    返回