Wang Can, Wu Xinhui, Zhang Yanqing, Wang Wenjun. Recognizing weeds in maize fields using shifted window Transformer network[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2022, 38(15): 133-142. DOI: 10.11975/j.issn.1002-6819.2022.15.014
    Citation: Wang Can, Wu Xinhui, Zhang Yanqing, Wang Wenjun. Recognizing weeds in maize fields using shifted window Transformer network[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2022, 38(15): 133-142. DOI: 10.11975/j.issn.1002-6819.2022.15.014

    Recognizing weeds in maize fields using shifted window Transformer network

    • Weeds have been one of the main factors to affect the growth of crops in the seedling stage. Timely weeding is a necessary measure to ensure crop yield. An intelligent field weeding equipment can also be a promising potential deployment in the unmanned farm system at the current stage of intelligent agriculture. Effective recognition of crops and weeds has been a high demand to promote the development of intelligent weeding equipment. Previous research was focused mainly on object detection and semantic segmentation using deep learning. A great challenge is still remained in the performance of target detection, in the case of overlap images between the crops and weeds under the complex field. The reason was that the different target areas cannot be further divided when the generated anchor box overlaps in a large area. The pixel level annotation can also be required to train the semantic segmentation, where the data samples cannot be easy to obtain. The weak real-time performance cannot be conducive to practical application. In this study, an improved model was proposed using shifted window Transformer (Swin Transformer) network, in order to enhance the accuracy and real-time performance of crop and weed recognition. The specific procedure was as follows. 1) A semantic segmentation model of corn was established for the real and complex field scene. The backbone of the model was the Swin Transformer architecture, which was denoted by Swin-Base. The full self-attention mechanism was also adopted to significantly enhance the modeling ability in the Swin Transformer using the shift window division configuration. Self-attention was then calculated locally in the non-overlapping window of the segmented image block, where the cross-window connection was allowed. The computational complexity of the backbone presented a linear relationship with the image size, thereby elevating the inference speed of the model. The hierarchical feature representation was constructed through the Swin Transformer for the dense prediction of the model at the pixel level. 2) The Unified perceptual parsing Network (UperNet) was used as an efficient semantic segmentation framework. Among them, the feature extractor was the Feature Pyramid Network (FPN) using the Swin Transformer backbone. The multi-level features obtained by Swin Transformer were used by the FPN to represent the corresponding pyramid level. An effective global prior feature expression was added in the Pyramid Pooling Module (PPM). Better performance of semantic segmentation was achieved using the fusion of the hierarchical semantic information. The Swing transformer backbone and UperNet framework were combined into one model through the Decoder-Encoder structure, denoted by Swin-Base-UN. 3) The structure of the Swin-Base backbone was improved to enhance the inference speed. The number of network parameters and calculation cost were reduced to decrease the number of hidden layer channels, headers, and Swin Transformer blocks. Therefore, four improved models were generated, including the Swin-Large-UN, Swin-Small-UN, Swin-Tiny-UN, and Swin-Nano-UN. The model size and computational complexity of improved models were about 2, 1/2, 1/4, and 1/8 times of Swin-Base-UN, respectively. 4) Taking the segmentation of corn morphological region as the case study, an improved image morphological processing combination was established to recognize and segment all the weed regions in real time. The segmentation of corn was also used to segment the weeds. The weed pixel annotation was removed from the training data of the model. As such, a large number of annotation data at the pixel level was obtained in the semantic segmentation of the improved model, compared with the original one. A comparison was made on the performance of all models in training, validation, and testing. Consequently, the Swin-Tiny-UN was determined as the best model to achieve the optimal balance between accuracy and speed. Specifically, the mean Intersection over Union (mIoU) and mean Pixel Accuracy (mPA) on the test set were 94.83% and 97.18%, respectively, which increased by 3.27 and 4.71 percentage points, respectively, compared with the RestNet-101-UN using traditional Convolutional Neural Networks (CNN) backbone. The inference speed of the model was achieved by 18.94 frames/s. The best model of semantic segmentation was superior to the traditional one, in terms of the region segmentation accuracy, pixel recognition accuracy, and inference speed. The image segmentation showed that the improved model can be expected to accurately recognize and segment maize and weeds in complex field scenes. The average correct detection rate of the improved model was 95.04% for the video stream data in the process of field work, whereas, the average detection time per frame was 5.51´10-2 s. Consequently, the improved model can be expected to detect the corn and weeds in the process of field work, indicating higher accuracy and real-time performance under practical application conditions. The findings can provide a strong reference for the development of intelligent weeding equipment.
    • loading

    Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return