Semantic segmentation of terrace image regions based on lightweight CNN-Transformer hybrid networks
-
Graphical Abstract
-
Abstract
Terracing has been widely used in conventional cultivation modes to stabilize crop production, as well as soil and water conservation. The construction of terraces can be one of the most important measures to develop agricultural production. However, some terraces often face the risk of being destroyed, due to the influence of construction quality during management and maintenance. Therefore, it is a high demand to quickly and accurately detect the distribution of terraced areas under high food production, soil erosion control, and planning regional ecology. Alternatively, unmanned aerial vehicle (UAV) aerial camera system has been widely used to obtain high-resolution remote sensing images in the field of intelligent agriculture. Among them, semantic segmentation has promoted the development of several fields using deep learning, particularly with the rapid development of information technology. Inspired by MobileVit, an axial attention mechanism (axial attention) was introduced in the MobileViT block. In this study, an encoder-decoder structure was proposed for a lightweight CNN-Transformer hybrid architecture-based network model. The encoder part of the model consisted of an improved MobileViT block. An inverse residual module was first incorporated into the strip pooling and a void space pyramidal pooling module. And then the local and global visual representation information interaction was achieved to effectively design the placement order of each module, in order to obtain a complete global feature representation. Strip pooling was introduced to effectively capture the remote dependencies. The high-level semantic information was then efficiently extracted from a large amount of data. The bar pools were introduced to effectively capture the remote dependencies, in order to extract the high-level semantic feature maps from a large amount of semantic information. The introduction of the void space pyramid pooling module was to capture contextual information from multiple scales. The perceptual field of the model was improved to obtain a denser semantic feature map. PSPNet, LiteSeg, BisNetv2, Deeplabv3Plush, and MobileViT were selected for comparison experiments on the same test set. The results show that the improved model performed the best, in terms of accuracy and speed. More importantly, better performance of achieved in the more accurate recognition and region delineation of complex and irregular UAV image terraces. Specifically, the pixel accuracy of the lightweight CNN-Transformer hybrid architecture network model was 95.79%, the average pixel accuracy was 87.82%, the average intersection ratio was 80.91%, and the frequency power intersection ratio was 94.86%. Furthermore, the improved model was only 8.32 M parameters with a small size, and low computational complexity, as well as a frame rate of 51.91 frames per second, indicating the real-time and lightweight model. A comprehensive analysis was also made of the performance indexes of each segmentation model. It was found that the segmentation accuracy was higher and faster using the lightweight CNN-Transformer hybrid architecture network model with a small model size and low computational complexity. Therefore, the improved model can be expected to deploy on the UAVs, fully meet the requirements of lightweight, high accuracy, and low latency for mobile vision tasks. The semantic segmentation of the terrace area was used to further obtain the information of shape, location, and outline of terraces. A timely and accurate detection was also achieved in the information of terrace edge for the prevention and reinforcement of terraces. At the same time, the statistics of cultivation area and scope of terrace area can be expected to promote the development of terraces and dry farming area agriculture construction.
-
-