Abstract:
Abstract: Depth information acquisition is the key to mobile robots which realize autonomous operation in the greenhouse. This study proposed an unsupervised model that used binocular images for training and testing based on dense convolutional auto-encoder. This model enabled the neural network to perform plant image depth estimation and defined a loss function for the depth estimation with convolution feature comparison and regularization constraints. Aiming at the problem of pixel vanished due to the different perspective and occlusion, a disparity confidence prediction was introduced to suppress the problem gradient caused by the image reconstruction loss. In the meantime, a dense block was designed based on separable convolution and built a convolutional auto-encoder as the backbone network for the model. In the greenhouse of tomato planting, a large number of binocular images were collected when tomato planting was growing on an overcast, cloudy and sunny days. An unsupervised plant image depth estimation network was also designed with a Python application interface, which was implemented by adopting Microsoft Cognitive Tools (CNTK) v2.7, a deep learning computing framework. The experiments of training and testing which were used image feature similarity, depth estimation error, and threshold precision as the criteria were carried out, and the binocular images of tomato planting were also taken as examples, on Tesla K40c graphic device. The results showed that the auto-encoder based on the separable convolution dense block which was compared with the regular convolution could effectively reduce the number of network weight parameters. Compared with the other activations which included ReLU (Rectified Linear Unit), Param-ReLU, ELU (Exponential Linear Unit), and SELU (Scaled-ELU), the network model with Leaky-ReLU as the nonlinear transformation had the minimum depth error and the maximum threshold precision. Also, the results showed that the network structures had significant impacts on the accuracy of prediction disparity. The introduction of separable convolution dense block in the skip connection between the encoder and decoder of auto-encoder had a certain effect on improving the accuracy of depth estimation. Meanwhile, by making the depth estimation model predict the disparity confidence which was used to restrain the problem gradient backpropagation, the error of depth estimation was remarkably decreased, Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) were reduced by 55.2% and 33.0% respectively. The accuracy of depth estimation was significantly improved by using these processing methods, such as image reconstruction, loss function calculation after up-sampling the disparity map to the input image scale and splicing the multi-scale disparity map predicted by the network to the feature map of its encoder, as well as sending the combination feature map to the next prediction module. The performance of depth estimation was improved by increasing the depth and width of the convolutional auto-encoder. The error of depth estimation decreased significantly with the reduction of the spatial point depth. When the spatial point depth was within 9 m, the MAE of the estimated depth was less than 14.1 cm. And when the depth was within 3 m, the MAE was less than 7 cm. The influence of illumination conditions on the accuracy of this study depth estimation model was not significant. The method in this study was robust to the change of the luminous environment. The highest test speed of this study model was 14.2 FPS (Frames Per Second), which was near real-time. Compared with the existing researches, the mean relative error, MAE, and Mean Range Error (MRE) of depth estimation in this study were reduced by 46.0%, 26.0%, and 25.5% respectively. This research could provide a reference for the design of the vision system of greenhouse mobile robots.