Abstract:
Abstract: Supervised deep learning has gradually been one of the most important ways to extract the features and information of plant phenotype in recent years. However, the cost and quality of manual labeling have become the bottleneck of restricting the development of technology, due mainly to the complexity of plant structure and details. In this study, a Depth Mask Convolutional Neural Network (DM-CNN) was proposed to realize automatic training and segmentation for the maize plant. Firstly, the original depth and color images of maize plants were collected in indoor scene using the sensors of Kinect. The parallax between depth and color camera was also reduced after aligning the display range of depth and color images. Secondly, the depth and color images were cropped into the same size to remain from the consistency of spatial and content. The depth density function and nearest neighbor pixel filling were also utilized to remove the background of depth images, while retaining the maize plant pixels. As such, a binary image of the maize plant was represented, where the depth mask annotations were obtained by the maximum connection area. Finally, the depth mask annotations and color images were packed and then input to train the DM-CNN, where automatic images labeling and segmentation were realized for maize plants indoors. A field experiment was also designed to verify the trained DM-CNN. It was found that the training loss of depth mask annotations converged faster than that of manual annotations. Furthermore, the performance of DM-CNN trained by depth mask annotations was slightly better than that of manual ones. For the former, the mean Intersection over Union (mIoU) was 59.13%, and mean Recall Accuracy (mRA) was 65.78%. For the latter, the mIoU was 58.49% and mRA was 65.78%. In addition, the dataset was replaced 10% depth mask samples with manual annotations taken in outdoor scene, in order to verify the generalization ability of DM-CNN. After fine-tuning, excellent performance was achieved for the segmentation with the top view images of outdoor seedling maize, particularly that the mean pixel accuracy reached 84.54%. Therefore, the DM-CNN can widely be expected to automatically generate the depth mask annotations using depth images in indoor scene, thereby realizing the supervised network training. More importantly, the model trained by depth mask annotations also performed better than that by manual annotations in mean intersection over union and mean recall accuracy. The segmentation was also suitable for the different plant height ranges during the maize seedling stage, indicating an excellent generalization ability of the model. Moreover, the improved model can be transferred and used in the complex outdoor scenes for better segmentation of maize images (top view), when only 10% of samples (depth mask annotations) were replaced during training. Therefore, it is feasible to realize automatic annotation and training of deep learning model using depth mask annotations instead of manual labeling ones. The finding can also provide low-cost solutions and technical support for high-throughput and high-precision acquisition of maize seedling phenotype.