Recognizing visual position in the greenhouse using optimal transport feature aggregation
-
-
Abstract
As the foundation for implementing closed-loop detection within the realm of visual SLAM (simultaneous localization and mapping), visual place recognition (VPR) has great potential in various applications of greenhouse robot navigation and other fields. However, the existing VPR cannot fully meet the actual requirements of greenhouse scenes due to the complexity and constant variations in the greenhouse environment. In particular, the local feature aggregation paradigm strongly depends on the induction bias of training samples in VPR models, which leads to the issue of information redundancy during feature aggregation. In this study, a greenhouse VPR was presented, according to the optimal transport of local feature aggregation. The process of aggregating local features into a global descriptor was framed as an optimal transport problem, where the cost matrix was predicted through an MLP (multi-layer perceptron). Thus, a cost matrix was dynamically generated using the local features that was extracted from the greenhouse scene images. Additionally, a 'dustbin' cluster was introduced into the cost matrix to allocate the redundant features. Taking the cost matrix as the input, the Sinkhorn algorithm was employed to determine an optimal solution to the assignment matrix. Furthermore, the soft assignment of local features to various clusters was achieved through the assignment matrix. Ultimately, the assignment was concatenated to form a global descriptor for the scene image, which was used for place recognition. A deep neural network (DNN) was optimized and designed to serve as the backbone for local feature extraction of greenhouse scene images, by combining the advantages of CNN (convolutional neural network) and Transformer. Furthermore, cosine similarity was used as the metric function to calculate the similarity measure between scene image global descriptors, so as to perform descriptor matching. A series of experiments were conducted in a tomato greenhouse. The experimental results showed that the improved model achieved better performance. The top-1 recall rate (R@1) for place recognition was achieved at 88.96%, which was 29.67, 2.97, and 2.89 percentage points higher than the those of NetVLAD, MixVPR, and EigenPlaces models, respectively. When compared to the aggregators employed in MixVPR and NetVLAD, our aggregator achieved improvements in R@1 by 1.09 and 21.65 percentage points, respectively, showcasing its effectiveness. Compared with the CNN, the improved network achieved an increase of 5.45 percentage points in R@1. There was even more pronounced R@1 improvement (reaching 10.48 percentage points), compared with a Transformer network. Simultaneously, our network resulted in a 1.6-fold increase in computation speed compared to the previous Transformer. In addition, the experiments further demonstrated that the improved model exhibited excellent performance of place recognition and strong robustness when dealing with factors, such as small sampling distance shifts, small viewpoint shifts, and different sunlight intensities. The greenhouse VPR achieved a place recognition rate of no less than 81.94% in actual greenhouses, indicating its practical application potential. The method based on optimal transport of local feature aggregation and global descriptor generation was effective for place recognition, and the image local feature extraction network can boost the performance of place recognition. These findings can provide technical support to the visual systems of intelligent agricultural machinery in the greenhouse.
-
-