Greenhouse visual place recognition method based on optimal transport feature aggregation
-
-
Abstract
As the foundation for implementing closed-loop detection within the realm of visual SLAM (simultaneous localization and mapping), visual place recognition (VPR) has great potential in various applications of greenhouse robot navigation and other fields. However, due to the complex and constantly changing greenhouse environment, the existing visual place recognition methods struggle to fully meet the actual requirements of greenhouse scenes. This is particularly evident in the strong dependence of local feature aggregation paradigm on training sample induction bias in visual place recognition models, as well as the issue of information redundancy that arises during the feature aggregation process. In this study, a greenhouse visual place recognition method based on optimal transport local feature aggregation was presented. The process of aggregating local feature set into global descriptor was framed as an optimal transport problem, where the cost matrix was predicted through an MLP (multi-layer perceptron). Thus, based on the local feature set extracted from the greenhouse scene image, a cost matrix was dynamically generated. Additionally, a 'dustbin' cluster was introduced within the cost matrix to allocate redundant features. Subsequently, using the cost matrix as input, the Sinkhorn algorithm was employed to find an optimal solution for the assignment matrix. Furthermore, the soft assignment of local features to various clusters was achieved through the assignment matrix. Ultimately, the assignment results were concatenated to form a global descriptor for the scene image, which was used for place recognition. By combining the advantages of CNN (convolutional neural network) and Transformer, a deep neural network (DNN) was optimized and designed to serve as the backbone for local feature extraction of greenhouse scene images. Furthermore, cosine similarity was used as the similarity metric function to calculate the similarity measure between scene image global descriptors, so as to perform descriptor matching and enable accurate place identification of the current scene. A series of experiments were conducted in a tomato greenhouse, and the experimental results showed that the novel model achieved better performance. The top-1 recall rate (R@1) for place recognition had achieved 88.96%, which was 29.67, 23.23, 12.12, 2.97, and 2.89 percentage points higher than the NetVLAD, GeM, CosPlace, MixVPR, and EigenPlaces models, respectively. The local feature aggregation based on optimal transport and the method of global descriptor generation can effectively enhance the accuracy of greenhouse place recognition. Compared with the aggregators of MixVPR, GeM, and NetVLAD, our aggregator achieved improvements in R@1 by 1.09, 2.92, and 21.65 percentage points, respectively. The local feature extraction network constructed in this study can effectively boost the performance of place recognition. Compared to CNN network, our network achieved an increase of 5.45 percentage points in R@1. When compared to a Transformer network, the R@1 improvement was even more pronounced, reaching 10.48 percentage points. Remarkably, this was accomplished while simultaneously achieving a 1.6-fold increase in computation speed. In addition, the experiments further demonstrated that our model exhibited excellent place recognition performance and strong robustness when dealing with change factors such as small sampling distance shifts, small viewpoint shifts, and different sunlight intensity. The greenhouse visual place recognition method proposed in this study had a place recognition rate of not less than 81.94% in actual greenhouses and had a certain practical application ability. These findings can provide technical support for designing the visual systems of greenhouse intelligent agricultural machinery equipment.
-
-