基于语义分割的非结构化田间道路场景识别

    Recognition of unstructured field road scene based on semantic segmentation model

    • 摘要: 环境信息感知是智能农业装备系统自主导航作业的关键技术之一。农业田间道路复杂多变,快速准确地识别可通行区域,辨析障碍物类别,可为农业装备系统高效安全地进行路径规划和决策控制提供依据。该研究以非结构化农业田间道路场景为研究对象,根据环境对象动、静态属性进行类别划分,提出一种基于通道注意力结合多尺度特征融合的轻量化语义分割模型。首先采用Mobilenet V2轻量卷积神经网络提取图像特征,将混合扩张卷积融入特征提取网络最后2个阶段,在保证特征图分辨率的基础上增加感受野并保持信息的连续性与完整性;然后引入通道注意力模块对特征提取网络各阶段特征通道依据重要程度重新标定;最后通过空间金字塔池化模块将多尺度池化特征进行融合,获取更加有效的全局场景上下文信息,增强对复杂道路场景识别的准确性。语义分割试验表明,不同道路环境下本文模型可以对场景对象进行有效识别解析,像素准确率和平均像素准确率分别为94.85%、90.38%,具有准确率高、鲁棒性强的特点。基于相同测试集将该文模型与FCN-8S、SegNet、DeeplabV3+、BiseNet模型进行对比试验,该文模型的平均区域重合度为85.51%,检测速度达到8.19帧/s,参数数量为2.41×106,相比于其他模型具有准确性高、推理速度快、参数量小等优点,能够较好地实现精度与速度的均衡。研究成果可为智能农业装备在非结构化道路环境下安全可靠运行提供技术参考。

       

      Abstract: Abstract: Environmental information perception has been one of the most important technologies in agricultural automatic navigation tasks, such as plant fertilization, crop disease detection, automatic harvesting, and cultivation. Among them, the complex environment of a field road is characterized by the fuzzy road edge, uneven road surface, and irregular shape. It is necessary to accurately and rapidly identify the passable areas and obstacles when the agricultural machinery makes path planning and decision control. In this study, a lightweight semantic segmentation model was proposed to recognize the unstructured roads in fields using a channel attention mechanism combined with the multi-scale features fusion. Some environmental objects were also classified into 12 categories, including building, person, vehicles, sky, waters, plants, road, soil, pole, sign, coverings, and background, according to the static and dynamic properties. Furthermore, a mobile architecture named MobileNetV2 was adopted to obtain the image feature information, in order to reduce the model parameters for a higher reasoning speed. Specifically, an inverted residual structure with lightweight depth-wise convolutions was utilized to filter the features in the intermediate expansion layer. In addition, the last two stages of the backbone network were combined with the Hybrid Dilated Convolution (HDC), aiming to increase the receptive fields and maintain the resolution of the feature map. The hybrid dilated convolution with the dilation rate of 1, 2, and 3 was used to effectively expand the receptive fields, thereby alleviating the "gridding problem" caused by the standard dilated convolution. A Channel Attention Block (CAB) was also introduced to change the weight of each stage feature, in order to enhance the class consistency. The channel attention block was used to strengthen both the higher and lower level features of each stage for a better prediction. In addition, some errors of semantic segmentation were partially or completely attributed to the contextual relationship. A pyramid pooling module was empirically adopted to fuse three scale feature maps for the global contextual prior. There was the global context information in the first image level, where the feature vector was produced by a global average pooling. The pooled representation was then generated for different locations, where the rest pyramid levels separated the feature maps into different sub-regions. As such, the output of different levels in the pyramid module contained the feature maps with varied sizes, followed by up sampling and concatenation to form the final output. The results showed that the objects in the complex roads were effectively segmented with Pixel Accuracy (PA) and Mean Pixel Accuracy (MPA) of 94.85% and 90.38%, respectively. Furthermore, the single category pixel accuracy of some objects was more than 90%, such as road, plants, building, waters, sky, and soil, indicating a higher accuracy, strong robustness, and excellent generalization. An evaluation was also made to verify the efficiency and superiority of the model, where the mean intersection over union (MIoU), segmentation speed, and parameter scale were adopted as the indexes. The FCN-8S, SegNet, DeeplabV3+ and BiseNet networks were also developed on the same training and test datasets. It was found that the MIoU of the model was 85.51%, indicating a higher accuracy than others. The parameter quantity of the model was 2.41×106, smaller than FCN-8S, SegNet, DeeplabV3+, and BiseNet. In terms of an image with a resolution of 512×512 pixels, the reasoning speed of the model reached 8.19 frames per second, indicating an excellent balance between speed and accuracy. Consequently, the lightweight semantic segmentation model was achieved to accurately and rapidly segment the multiple road scenes in the field environment. The finding can provide a strong technical reference for the safe and reliable operation of intelligent agricultural machinery on unstructured roads.

       

    /

    返回文章
    返回