基于稠密自编码器的无监督番茄植株图像深度估计模型

    Unsupervised deep estimation modeling for tomato plant image based on dense convolutional auto-encoder

    • 摘要: 深度信息获取是温室移动机器人实现自主作业的关键。该研究提出一种基于稠密卷积自编码器的无监督植株图像深度估计模型。针对因视角差异和遮挡而产生的像素消失问题,引入视差置信度预测,抑制图像重构损失产生的问题梯度,设计了基于可分卷积的稠密自编码器作为模型的深度神经网络。以深度估计误差、阈值精度等为判据,在番茄植株双目图像上开展训练和测试试验,结果表明,抑制问题梯度回传可显著提高深度估计精度,与问题梯度抑制前相比,估计深度的平均绝对误差和均方根误差分别降低了55.2%和33.0%,将网络预测的多尺度视差图接入编码器并将其上采样到输入图像尺寸后参与图像重构和损失计算的处理方式对提高预测精度是有效的,2种误差进一步降低了23.7%和27.5%;深度估计误差随空间点深度的减小而显著降低,当深度在9 m以内时,估计深度的平均绝对误差<14.1 cm,在3 m以内时,则<7 cm。与已有研究相比,该研究估计深度的平均相对误差和平均绝对误差分别降低了46.0%和26.0%。该研究可为温室移动机器人视觉系统设计提供参考。

       

      Abstract: Abstract: Depth information acquisition is the key to mobile robots which realize autonomous operation in the greenhouse. This study proposed an unsupervised model that used binocular images for training and testing based on dense convolutional auto-encoder. This model enabled the neural network to perform plant image depth estimation and defined a loss function for the depth estimation with convolution feature comparison and regularization constraints. Aiming at the problem of pixel vanished due to the different perspective and occlusion, a disparity confidence prediction was introduced to suppress the problem gradient caused by the image reconstruction loss. In the meantime, a dense block was designed based on separable convolution and built a convolutional auto-encoder as the backbone network for the model. In the greenhouse of tomato planting, a large number of binocular images were collected when tomato planting was growing on an overcast, cloudy and sunny days. An unsupervised plant image depth estimation network was also designed with a Python application interface, which was implemented by adopting Microsoft Cognitive Tools (CNTK) v2.7, a deep learning computing framework. The experiments of training and testing which were used image feature similarity, depth estimation error, and threshold precision as the criteria were carried out, and the binocular images of tomato planting were also taken as examples, on Tesla K40c graphic device. The results showed that the auto-encoder based on the separable convolution dense block which was compared with the regular convolution could effectively reduce the number of network weight parameters. Compared with the other activations which included ReLU (Rectified Linear Unit), Param-ReLU, ELU (Exponential Linear Unit), and SELU (Scaled-ELU), the network model with Leaky-ReLU as the nonlinear transformation had the minimum depth error and the maximum threshold precision. Also, the results showed that the network structures had significant impacts on the accuracy of prediction disparity. The introduction of separable convolution dense block in the skip connection between the encoder and decoder of auto-encoder had a certain effect on improving the accuracy of depth estimation. Meanwhile, by making the depth estimation model predict the disparity confidence which was used to restrain the problem gradient backpropagation, the error of depth estimation was remarkably decreased, Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) were reduced by 55.2% and 33.0% respectively. The accuracy of depth estimation was significantly improved by using these processing methods, such as image reconstruction, loss function calculation after up-sampling the disparity map to the input image scale and splicing the multi-scale disparity map predicted by the network to the feature map of its encoder, as well as sending the combination feature map to the next prediction module. The performance of depth estimation was improved by increasing the depth and width of the convolutional auto-encoder. The error of depth estimation decreased significantly with the reduction of the spatial point depth. When the spatial point depth was within 9 m, the MAE of the estimated depth was less than 14.1 cm. And when the depth was within 3 m, the MAE was less than 7 cm. The influence of illumination conditions on the accuracy of this study depth estimation model was not significant. The method in this study was robust to the change of the luminous environment. The highest test speed of this study model was 14.2 FPS (Frames Per Second), which was near real-time. Compared with the existing researches, the mean relative error, MAE, and Mean Range Error (MRE) of depth estimation in this study were reduced by 46.0%, 26.0%, and 25.5% respectively. This research could provide a reference for the design of the vision system of greenhouse mobile robots.

       

    /

    返回文章
    返回