面向温室移动机器人的无监督视觉里程估计方法

    Unsupervised visual odometry method for greenhouse mobile robots

    • 摘要: 针对温室移动机器人自主作业过程中,对视觉里程信息的实际需求及视觉里程估计因缺少几何约束而易产生尺度不确定问题,提出一种基于无监督光流的视觉里程估计方法。根据双目视频局部图像的几何关系,构建了局部几何一致性约束及相应光流模型,优化调整了光流估计网络结构;在网络训练中,采用金字塔层间知识自蒸馏损失,解决层级光流场缺少监督信号的问题;以轮式移动机器人为试验平台,在种植番茄温室场景中开展相关试验。结果表明,与不采用局部几何一致性约束相比,采用该约束后,模型的帧间及双目图像间光流端点误差分别降低8.89%和8.96%;与不采用层间知识自蒸馏相比,采用该处理后,两误差则分别降低11.76%和11.45%;与基于现有光流模型的视觉里程估计相比,该方法在位姿跟踪中的相对位移误差降低了9.80%;与多网络联合训练的位姿估计方法相比,该误差降低了43.21%;该方法可获得场景稠密深度,深度估计相对误差为5.28%,在1 m范围内的位移平均绝对误差为3.6 cm,姿态平均绝对误差为1.3º,与现有基准方法相比,该方法提高了视觉里程估计精度。研究结果可为温室移动机器人视觉系统设计提供技术参考。

       

      Abstract: Simultaneous Localization and Mapping (SLAM) is one of the most crucial aspects of autonomous navigation in mobile robots. The core components of the SLAM system can be depth perception and pose tracking. However, the existing unsupervised learning visual odometry framework cannot fully meet the actual requirements of the visual odometry information, particularly on the scale uncertainty in visual odometry estimation. It is still lacking in the geometric constraints during the autonomous operation of greenhouse mobile robots. In this study, an unsupervised optical flow-based visual odometry was presented. An optical flow estimation network was trained in an unsupervised manner using image warping. The optical flow between stereo images (disparity) was used to calculate the absolute depth of scenes. The optical flow between adjacent frames of left images was combined with the scene depth, in order to solve the frame-to-frame pose transformation matrix using Perspective-n-Point (PnP) algorithm. The reliable correspondences were selected in the solving process using forward and backward flow consistency checking to recover the absolute pose. A compact deep neural network was built with the convolutional modules to serve as the backbone of the flow model. This improved network was designed, according to the well-established principles: pyramidal processing, warping, and the use of a cost volume. At the same time, the cost volume normalization in the network was estimated with high values to alleviate the feature activations at higher levels than before. Furthermore, the local geometric consistency constraints were designed for the objective function of flow models. Meanwhile, a pyramid distilling loss was introduced to provide the supervision for the intermediate optical flows via distilling the finest final flow field as pseudo labels. A series of experiments were conducted using a wheeled mobile robot in a tomato greenhouse. The results showed that the better performance was achieved in the improved model. The local geometric consistency constraints improved the optical flow estimation accuracy. The endpoint error (EPE) of inter-frame and stereo optical flow was reduced by 8.89% and 8.96%, respectively. The pyramid distillation loss significantly reduced the optical flow estimation error of the flow model, in which the EPEs of the inter-frame and stereo optical flow decreased by 11.76% and 11.45%, respectively. The EPEs of the inter-frame and stereo optical flow were reduced by 12.50% and 7.25%, respectively, after cost volume normalization. Particularly, the price decreased by 1.28% for the calculation speed of the optical flow network. This improved model showed a 9.52% and 9.80% decrease in the root mean square error (RMSE) and mean absolute error (MAE) of relative translation error (RTE), respectively, compared with an existing unsupervised flow model. The decrease was 43.0% and 43.21%, respectively, compared with the Monodepth2. The pose tracking accuracy of this improved model was lower than that of ORB-SLAM3. The pure multi-view geometry shared the predicting dense depth maps of a scene. The relative error of depth estimation was 5.28% higher accuracy than the existing state-of-the-art self-supervised joint depth-pose learning. The accuracy of pose tracking depended mainly on the motion speed of robots. The performance of pose tracking at 0.2 m/s low speed and 0.8 m/s fast speed was significantly lower than that at 0.4-0.6 m/s. The resolution of the input image greatly impacted the pose tracking accuracy, with the errors decreasing gradually as the resolution increased. The MAE of RTE was not higher than 3.6 cm with the input image resolution of 832×512 pixels and the motion scope of 1 m, whereas, the MAE of relative rotation error (RRE) was not higher than 1.3º. These findings can provide technical support to design the vision system of greenhouse mobile robots.

       

    /

    返回文章
    返回