Abstract:
Image semantic segmentation has been widely used in various applications, such as plant phenotyping, robot harvesting, and facility scene analysis. Periodic fruit status of tomato is required for phenotypic information, such as shape and color. Tomato can be one of the most important vegetable crops in greenhouse environments. However, manual sampling and detection fail to meet the requirements of high throughput and precision, due to the time-consuming, labor-intensity, and low efficiency. Computer vision can be expected for image semantic segmentation in recent years. This image segmentation has been frequently used to distinguish the crop fruits (foreground) and growth environment (background) in complex environments. It is still necessary to improve the accuracy of semantic segmentation in the complex environments of the greenhouse, for example, the uneven lighting in greenhouse environments, overlapping and occlusion between crop fruits and leaves, and the similarity in texture and color between immature crops and leaves. Traditional semantic segmentation of deep convolutional networks has been used only in the RGB modality of images for training. The accuracy of semantic segmentation can be achieved by the bottleneck using only RGB modality for training, with the continuous evolution of deep learning models. In this study, an "RGB + Depth" model of multimodal semantic segmentation (called DFST, depth-fusion Semantic Transformer) was proposed using a hybrid Transformer encoder (mix transformer encoder). Mit (mix transformer encoder) was adopted as the main feature extraction network of the DFST model. Mit was a Transformer encoder feature extraction backbone network more suitable for semantic segmentation. Compared with the ordinary Vision Transformers (ViTs), Mit shared the following advantages: 1) A hierarchical Encoder structure was employed to output the multi-scale features. The Decoder was also combined to capture and optimize segmentation for both high-resolution coarse- and low-resolution fine-grained features; 2) Computational complexity was reduced using sequence reduction, instead of an ordinary Self-Attention structure. Positional Embedding was removed to replace the Mix FFN. 3×3 Deepwise Conv was introduced in the Feed-Forward Network (Mix-FFN), in order to convey the position information. Depth images were obtained under real greenhouse lighting in a specific experiment. The depth images were encoded into the HHA (horizontal disparity, height above ground, angle) format before training. The HHA-encoded depth images were fused with the RGB images as an auxiliary modality for feature extraction. A lightweight multi-layer perceptron decoder was used to decode and segment the feature maps. The experimental results show that: 1) The DFST model improved the segmentation accuracy of crops in greenhouse environments. The depth was introduced as an auxiliary modality in addition to RGB semantic segmentation. mIoU was improved by 1.37 percentage points than before. 2) The depth images were encoded into the HHA three-channel images. The high-quality depth images were obtained in the conditions of equipment, environment, and lighting. The training accuracy was improved by 1.21 percentage points using HHA-encoded depth images, compared with the non-encoded ones. 3) A transformer was used as the feature extraction backbone network, instead of traditional convolutional neural networks. The reason was the weak global modeling and easy overfitting of previous convolutional neural networks. The transformer Mit feature extraction backbone network improved the mIoU by 2.43 percentage points, compared with the ShapeConv. In summary, the DFST model can be expected to serve as the semantic segmentation task of tomato images in greenhouse environments. The rapid and accurate segmentation was achieved in complex environments, such as various lighting conditions. The findings can provide theoretical assistance for crop detection and intelligent harvesting robots in greenhouse environments.