基于RGB和深度双模态的温室番茄图像语义分割模型

张羽丰; 杨景; 邓寒冰; 周云成; 苗腾

doi:10.11975/j.issn.1002-6819.202309169

基于RGB和深度双模态的温室番茄图像语义分割模型

Semantic segmentation model for greenhouse tomato images using RGB and depth bimodal

摘要

摘要: 图像语义分割作为计算机视觉领域的重要技术，已经被广泛用于设施环境下的植物表型检测、机器人采摘、设施场景解析等领域。由于温室环境下未成熟番茄果实与其茎叶之间具有相似颜色，会导致图像分割精度不高等问题。该研究提出一种基于混合Transformer编码器的“RGB+深度”（RGBD）双模态语义分割模型DFST（depth-fusion semantic transformer），试验在真实温室光照情况下获得深度图像，对深度图像做HHA编码并结合彩色图像输入模型进行训练，经过HHA编码的深度图像可以作为一种辅助模态与RGB图像进行融合并进行特征提取，利用轻量化的多层感知机解码器对特征图进行解码，最终实现图像分割。试验结果表明，DFST模型在测试集的平均交并比可达96.99%，对比不引入深度图像的模型，其平均交并比提高了1.37个百分点；DFST模型对比使用卷积神经网络作为特征提取主干网络的RGBD语义分割模型Shape Conv，其平均交并比提高了2.43个百分点。结果证明，深度信息有助于提高彩色图像的语义分割精度，可以明显提高复杂场景语义分割的准确性和鲁棒性，同时也证明了Transformer结构作为特征提取网络在图像语义分割中也表现出了良好的性能，可为温室环境下的番茄图像语义分割任务提供解决方案和技术支持。

Abstract: Image semantic segmentation has been widely used in various applications, such as plant phenotyping, robot harvesting, and facility scene analysis. Periodic fruit status of tomato is required for phenotypic information, such as shape and color. Tomato can be one of the most important vegetable crops in greenhouse environments. However, manual sampling and detection fail to meet the requirements of high throughput and precision, due to the time-consuming, labor-intensity, and low efficiency. Computer vision can be expected for image semantic segmentation in recent years. This image segmentation has been frequently used to distinguish the crop fruits (foreground) and growth environment (background) in complex environments. It is still necessary to improve the accuracy of semantic segmentation in the complex environments of the greenhouse, for example, the uneven lighting in greenhouse environments, overlapping and occlusion between crop fruits and leaves, and the similarity in texture and color between immature crops and leaves. Traditional semantic segmentation of deep convolutional networks has been used only in the RGB modality of images for training. The accuracy of semantic segmentation can be achieved by the bottleneck using only RGB modality for training, with the continuous evolution of deep learning models. In this study, an "RGB + Depth" model of multimodal semantic segmentation (called DFST, depth-fusion Semantic Transformer) was proposed using a hybrid Transformer encoder (mix transformer encoder). Mit (mix transformer encoder) was adopted as the main feature extraction network of the DFST model. Mit was a Transformer encoder feature extraction backbone network more suitable for semantic segmentation. Compared with the ordinary Vision Transformers (ViTs), Mit shared the following advantages: 1) A hierarchical Encoder structure was employed to output the multi-scale features. The Decoder was also combined to capture and optimize segmentation for both high-resolution coarse- and low-resolution fine-grained features; 2) Computational complexity was reduced using sequence reduction, instead of an ordinary Self-Attention structure. Positional Embedding was removed to replace the Mix FFN. 3×3 Deepwise Conv was introduced in the Feed-Forward Network (Mix-FFN), in order to convey the position information. Depth images were obtained under real greenhouse lighting in a specific experiment. The depth images were encoded into the HHA (horizontal disparity, height above ground, angle) format before training. The HHA-encoded depth images were fused with the RGB images as an auxiliary modality for feature extraction. A lightweight multi-layer perceptron decoder was used to decode and segment the feature maps. The experimental results show that: 1) The DFST model improved the segmentation accuracy of crops in greenhouse environments. The depth was introduced as an auxiliary modality in addition to RGB semantic segmentation. mIoU was improved by 1.37 percentage points than before. 2) The depth images were encoded into the HHA three-channel images. The high-quality depth images were obtained in the conditions of equipment, environment, and lighting. The training accuracy was improved by 1.21 percentage points using HHA-encoded depth images, compared with the non-encoded ones. 3) A transformer was used as the feature extraction backbone network, instead of traditional convolutional neural networks. The reason was the weak global modeling and easy overfitting of previous convolutional neural networks. The transformer Mit feature extraction backbone network improved the mIoU by 2.43 percentage points, compared with the ShapeConv. In summary, the DFST model can be expected to serve as the semantic segmentation task of tomato images in greenhouse environments. The rapid and accurate segmentation was achieved in complex environments, such as various lighting conditions. The findings can provide theoretical assistance for crop detection and intelligent harvesting robots in greenhouse environments.

HTML全文

参考文献(30)

施引文献

资源附件(0)