Abstract:
Semantic segmentation has been one of the most powerful image-processing technologies for the more complex agricultural environment in the field. In this study, a UNet network model with improved spatial-coordinate attention (SPCA) was proposed to promote the efficiency of the refined network model for the multi-crop semantic segmentation. A new SPCA system was also designed to combine the spatial mechanism and coordinate attention (CA). Specifically, the maximum and average numbers were first determined among all the feature points on each feature layer in this same direction. The feature channels were stacked and adjusted with the maximum and average numbers. Another weight feature layer was then activated with the spatial feature points. Secondly, the location information was extracted along the two directions using a CA module after multiplying the feature layer by the main road. At the same time, the coefficient “r” was removed to retain the feature channel information without any extra calculation cost of the channel. Finally, the SPCA module was blended into the UNet model to verify the semantic segmentation, compared with the rest attention modules. The results show that there was an outstanding difference in the segmentation accuracy for the three semantic segmentation models at the different scales. Firstly, the figure of accuracy was much lower at the scale of 512×512 for the three models. The number of accuracies was about four percentage points higher at the scale of 1500×1500 for the three networks than those at the other two scales. The optimal combination was achieved in the UNet network model. Secondly, the highest accuracy was observed in the SPCA among the five modules. The mean intersection over union (MIoU), mean pixel accuracy (MPA), mean precision (MPrecision), and mean recall (MRecall) were all ahead of the four modules of squeeze-and-excitation networks (SENet), convolutional block attention module (CBAM), efficient channel attention (ECA), and CA, with the MIoU of 92.20%, the MPA of 95.97%, 1.16 and 0.76 percentage points higher than the highest precision of the MIoU and MPA of the four modules, respectively. The CA module ranked second after the SPCA, and the least was the ECA module. Furthermore, there was a very low accuracy gain before and after the replacement for the crop segmentation, because the ECA's 1D convolution was inferior to the full connection layer. The CBAM with spatial attention was added to focus on channel attention, indicating better segmentation accuracy. The CA also shared much better location information, compared with the spatial. Thirdly, the highest accuracy was found in the segmentation of flue-cured tobacco, the building was the second, and the corn was the last, when the crop was classified using the five modules. The intersection over union (IoU) and pixel accuracy (PA) in the SPCA were both superior to those in the other four. Relatively, the SPCA was beneficial in retaining the crop boundary information, clearly integral splitting picture, and fuzzy island. These mistakes were then removed to identify and classify the crop, due to the strong Robust Performance. Lastly, the four indexes of accuracy evaluation were performed the best in the UNet with SPCA. The highest accuracy of the MIoU was 1.3 percentage points higher than the rest. The UNet precision was ranked second. The lowest accuracy of the four evaluation indicators was found in the PSPNet. The SPCA-UNet was the commonly highest on the various accuracy levels of segmentation, each of which remained above about 93%. DeepLab v3+ had the lowest accuracy in the flue-cured tobacco. The lowest accuracy was PSPNet in the other three categories. Therefore, there was the most important information gained by the connection of multi-feature layers between the encoder and the decoder for the task of extracting multiple crops. This finding can provide a strong reference to extract the accuracy categories in the multiple crops, due to the excellent performance of the UNet using improved spatial-coordinate attention. Since the training scenario can be involved in the mobile terminal, the mobile platform can be expected to reduce the training time without significant loss of accuracy.