基于轻量级CNN-Transformer混合网络的梯田图像语义分割

刘茜; 易诗; 李立; 程兴豪; 王铖

doi:10.11975/j.issn.1002-6819.202304025

基于轻量级CNN-Transformer混合网络的梯田图像语义分割

Semantic segmentation of terrace image regions based on lightweight CNN-Transformer hybrid networks

摘要

摘要: 梯田是一种传统的农业种植方式，具有稳定作物生产与水土保持的作用。快速、准确地对梯田区域分布信息进行采集，对提高粮食产量、治理水土流失以及规划区域生态等具有重要的作用与意义。无人机图像梯田道路边界模糊、具有较长的带状结构，为了更准确地获取梯田的边缘信息，受MobileVit启发，该研究在MobileViT block中引入了轴向注意力机制(axial attention)，并采用编码器-解码器结构，提出了基于轻量级CNN-Transformer混合构架网络模型。模型编码器部分由改进的MobileViT block、融入了条形池化的逆残差模块和空洞空间金字塔池化模块构成，再通过有效设计摆放各模块的位置顺序来实现局部与全局的视觉表征信息交互，得到完整的全局特征表达；利用解码器对编码器提取到的多尺度特征图进行采样和卷积操作得到语义分割结果图。选取PSPNet、LiteSeg、BisNetv2、Deeplabv3Plus、MobileViT在相同测试集上进行对比试验，结果表明，该研究所提模型在精度与速度方面均具有一定的优势，其像素精度可达95.79%，频权交并比可达94.86%，模型参数量为8.32 M，实现了使用较少的参数和简单的方法对复杂无规则的无人机图像梯田区域较为准确的分割，将其部署到无人机上可以进一步获取梯田的形状、位置、轮廓等信息，可为预防和修护加固梯田提供重要的依据，同时有助于梯田区域种植面积和范围的统计，为梯田和旱作区农业建设的发展提供参考。

Abstract: Terracing has been widely used in conventional cultivation modes to stabilize crop production, as well as soil and water conservation. The construction of terraces can be one of the most important measures to develop agricultural production. However, some terraces often face the risk of being destroyed, due to the influence of construction quality during management and maintenance. Therefore, it is a high demand to quickly and accurately detect the distribution of terraced areas under high food production, soil erosion control, and planning regional ecology. Alternatively, unmanned aerial vehicle (UAV) aerial camera system has been widely used to obtain high-resolution remote sensing images in the field of intelligent agriculture. Among them, semantic segmentation has promoted the development of several fields using deep learning, particularly with the rapid development of information technology. Inspired by MobileVit, an axial attention mechanism (axial attention) was introduced in the MobileViT block. In this study, an encoder-decoder structure was proposed for a lightweight CNN-Transformer hybrid architecture-based network model. The encoder part of the model consisted of an improved MobileViT block. An inverse residual module was first incorporated into the strip pooling and a void space pyramidal pooling module. And then the local and global visual representation information interaction was achieved to effectively design the placement order of each module, in order to obtain a complete global feature representation. Strip pooling was introduced to effectively capture the remote dependencies. The high-level semantic information was then efficiently extracted from a large amount of data. The bar pools were introduced to effectively capture the remote dependencies, in order to extract the high-level semantic feature maps from a large amount of semantic information. The introduction of the void space pyramid pooling module was to capture contextual information from multiple scales. The perceptual field of the model was improved to obtain a denser semantic feature map. PSPNet, LiteSeg, BisNetv2, Deeplabv3Plush, and MobileViT were selected for comparison experiments on the same test set. The results show that the improved model performed the best, in terms of accuracy and speed. More importantly, better performance of achieved in the more accurate recognition and region delineation of complex and irregular UAV image terraces. Specifically, the pixel accuracy of the lightweight CNN-Transformer hybrid architecture network model was 95.79%, the average pixel accuracy was 87.82%, the average intersection ratio was 80.91%, and the frequency power intersection ratio was 94.86%. Furthermore, the improved model was only 8.32 M parameters with a small size, and low computational complexity, as well as a frame rate of 51.91 frames per second, indicating the real-time and lightweight model. A comprehensive analysis was also made of the performance indexes of each segmentation model. It was found that the segmentation accuracy was higher and faster using the lightweight CNN-Transformer hybrid architecture network model with a small model size and low computational complexity. Therefore, the improved model can be expected to deploy on the UAVs, fully meet the requirements of lightweight, high accuracy, and low latency for mobile vision tasks. The semantic segmentation of the terrace area was used to further obtain the information of shape, location, and outline of terraces. A timely and accurate detection was also achieved in the information of terrace edge for the prevention and reinforcement of terraces. At the same time, the statistics of cultivation area and scope of terrace area can be expected to promote the development of terraces and dry farming area agriculture construction.

HTML全文

参考文献(35)

施引文献

资源附件(0)