Abstract:
Simultaneous Localization and Mapping (SLAM) is one of the most crucial aspects of autonomous navigation in mobile robots. The core components of the SLAM system can be depth perception and pose tracking. However, the existing unsupervised learning visual odometry framework cannot fully meet the actual requirements of the visual odometry information, particularly on the scale uncertainty in visual odometry estimation. It is still lacking in the geometric constraints during the autonomous operation of greenhouse mobile robots. In this study, an unsupervised optical flow-based visual odometry was presented. An optical flow estimation network was trained in an unsupervised manner using image warping. The optical flow between stereo images (disparity) was used to calculate the absolute depth of scenes. The optical flow between adjacent frames of left images was combined with the scene depth, in order to solve the frame-to-frame pose transformation matrix using Perspective-n-Point (PnP) algorithm. The reliable correspondences were selected in the solving process using forward and backward flow consistency checking to recover the absolute pose. A compact deep neural network was built with the convolutional modules to serve as the backbone of the flow model. This improved network was designed, according to the well-established principles: pyramidal processing, warping, and the use of a cost volume. At the same time, the cost volume normalization in the network was estimated with high values to alleviate the feature activations at higher levels than before. Furthermore, the local geometric consistency constraints were designed for the objective function of flow models. Meanwhile, a pyramid distilling loss was introduced to provide the supervision for the intermediate optical flows via distilling the finest final flow field as pseudo labels. A series of experiments were conducted using a wheeled mobile robot in a tomato greenhouse. The results showed that the better performance was achieved in the improved model. The local geometric consistency constraints improved the optical flow estimation accuracy. The endpoint error (EPE) of inter-frame and stereo optical flow was reduced by 8.89% and 8.96%, respectively. The pyramid distillation loss significantly reduced the optical flow estimation error of the flow model, in which the EPEs of the inter-frame and stereo optical flow decreased by 11.76% and 11.45%, respectively. The EPEs of the inter-frame and stereo optical flow were reduced by 12.50% and 7.25%, respectively, after cost volume normalization. Particularly, the price decreased by 1.28% for the calculation speed of the optical flow network. This improved model showed a 9.52% and 9.80% decrease in the root mean square error (RMSE) and mean absolute error (MAE) of relative translation error (RTE), respectively, compared with an existing unsupervised flow model. The decrease was 43.0% and 43.21%, respectively, compared with the Monodepth2. The pose tracking accuracy of this improved model was lower than that of ORB-SLAM3. The pure multi-view geometry shared the predicting dense depth maps of a scene. The relative error of depth estimation was 5.28% higher accuracy than the existing state-of-the-art self-supervised joint depth-pose learning. The accuracy of pose tracking depended mainly on the motion speed of robots. The performance of pose tracking at 0.2 m/s low speed and 0.8 m/s fast speed was significantly lower than that at 0.4-0.6 m/s. The resolution of the input image greatly impacted the pose tracking accuracy, with the errors decreasing gradually as the resolution increased. The MAE of RTE was not higher than 3.6 cm with the input image resolution of 832×512 pixels and the motion scope of 1 m, whereas, the MAE of relative rotation error (RRE) was not higher than 1.3º. These findings can provide technical support to design the vision system of greenhouse mobile robots.