Abstract:
Pests have posed the severe threat to the crop yield in agricultural production. The frequent outbreaks of crop pest have limited the development of agriculture in China. It is very necessary to develop the artificial intelligence (AI) monitoring in modern pest control. Among them, transformer-based DETR models have demonstrated the strong potential to the multi-scale tasks of pest detection with intensive distribution that captured from nature scene. But the current quantitative analysis cannot verify the core mechanism of detection models. In this study, the DINO-based model was proposed to detect and count the dense tiny insects. Four modules were contained in the model. Multi-scale deformable attention module (MDAM) was used for the multi-scale feature mapping; Contrastive denoising training strategy (CDTS) was used to alleviate the duplicate detection of single pest object; Mixed query selection and look forward twice (MQS-LFT) was used to obtain the decoder's Query with more efficient initialization and iteration. A total of 4 019 images were collected from the real scene, including planthoppers, aphids and wheat mites. As such, the pest dataset was obtained and then named MTC-PAWPD (multi-scale tiny crowded planthopper-aphid-wheat mite pest detection). The dataset was also divided into the training, validation and test set, according to the ratio of 5:2:3. According to the overlap and relative scale, the dataset was then divided into four scenes using data partitioning, namely, intensive-distribution, large-scale, normal-scale, and tiny-scale scene. A comparison was also made on the performance of model in different scenes. The ablation and validation experiments were carried out to visualize the features and Query Anchor, in order to validate the correlation between the performance and query-related modules. The experimental results demonstrated that the DINO-based model was performed better on the recognition of pest objects. In MTC-PAWPD benchmark, the highest mean average precision was achieved at 50% intersection over union (mAP@50) of 70.0%, indicating an improvement of 2.3 percentage points mAP@50, compared with the mainstream object detection models, such as Faster R-CNN, YOLOv5x, ATSS, YOLOX, and Deformable DETR. In addition, the convergence speed was only 1/10 with respect to Yolo-based model. The mAP@50 reached 42.5%, 79.4%, 75.7% and 62.4% in the four scenes, respectively. All of them showed the powerful detection performance, particularly in the scale-related scenes. All the four improvement modules of DINO were enhanced the performance of pest target detection in real scenes. The performance bias was correlated to the module of Transformer. The performance was substantially improved by 0.2, 1.0, 0.9, and 0.8 percentage point in mAP@50, respectively. Specifically, the mixed query selection strategy was utilized the features in the encoder for query generation. The query point was allowed in the first layer of the decoder closer to the pest objects in the image. Thus, more features were identified to enhance the expressive power of the model. Look forward twice enabled the Query Point in the later layers of the decoder closer to the objects. Contrastive Denoising Training Strategy was introduced into the high-quality negative samples, particularly for the overlapped prediction of a single pest object. In summary, the dense query and components of DINO can be expected to facilitate the pest detection and counting tasks in multiple scale, complex, and densely distributed natural scenarios. The DINO model can fully meet the requirements for the feature extraction from the multi-scale nature pest images, thus accurately detecting the objects. The transformer model can also provide the strong generalization and practical value in the detection of multi-scale pest under the complex background of the field.