Abstract:
An automatic, rapid, and accurate detection is required to monitor the pest in the large-scale areas in the field. In this study, the YOLOv5 (you only look once version five) model was used to detect crop pests. The existing YOLOv5 was incorporated the the global response normalization attention mechanism (YOLOv5-GRNS). An accurate detection was realized for the targets on images with complex backgrounds and excessive pest density. The improved model also converged rapidly during training. Firstly, the Global Response Normalization (GRN) operation was introduced into the encoder module, named Convolution Three (C3) which incorporated the GRN attention mechanism. The C3 module was used to exchange the channel information for less background interference at the channel level, thereby improving the detection accuracy of dense targets. Secondly, the Shape Intersection over Union (SIoU) loss function was utilized to improve the convergence speed and detection accuracy of the improved model. Besides, 8 types of pests that harm major crops in Shaanxi Province were screened out, according to the public dataset IP102 (insect pests 102). Then the dataset was revised and expanded to obtain a new dataset, named IP8-CW (insect pests eight for corn and wheat). Extensive experiments of the YOLOv5-GRNS model were conducted on both the new IP8-CW dataset and the existing IP102 dataset. The mean average precision (mAP) was achieved at 72.3% with mAP@0.5 and 47.0% with mAP@0.5:0.95 in the IP8-CW dataset. The YOLOv5-GRNS model increased by 1.3% and 1.6%, respectively, compared with the standard YOLOv5. The best performance was also achieved in the larger IP102 dataset with a 96-class classification task, indicating the lower complexity and fewer parameters. Ablation experiments were then conducted on the IP8-CW dataset to explore the influence of different factors on YOLOv5-GRNS performance. The results showed that a more regular path was achieved in the prediction box fitting the ground truth box using the improved model with the SIoU loss function. Thus the convergence rate of 30 epochs was promoted, compared with the rest two loss functions. The performance of the improved model was significantly improved in the sandwich structure using the GRN operation as the normalization and the channel attention layer. Furthermore, the performance was also higher than that of the structure, where the GRN was only one of them. The ablation experiment showed that there was only a little improvement in the YOLOv5-standard models with the rest attention mechanisms. The improved model with GRN operation was achieved in the best detection performance with the lowest model complexity and minimum parameters. Class Activation Maps (CAM) used the heat maps to mark the key locations that the model focused on in red. This feature of CAM was used to verify the effectiveness of the improved attention mechanism on the YOLOv5-GRNS model. Three datasets showed that the YOLOv5-GRNS model was concentratedly and accurately focused on the target area, rather than the complex background or dense targets. In summary, the YOLOv5-GRNS can be expected to serve as a robust response in the field of pest detection. Moreover, the excellent performance of the YOLOv5-GRNS model was also verified in the detection of small and dense targets on different datasets, indicating better generalization and interpretability with fewer parameters and less computational complexity. The improved model can also be applied to the embedded devices and mobile devices.