Abstract:
Wheat is one of the most widely cultivated staple food crops globally, and its yield prediction has profound implications for food security. Deep learning-based wheat spike detection and counting algorithms can rapidly predict wheat yields. Addressing the issues of low detection accuracy and a large number of model parameters in existing methods within complex agricultural environments, this paper proposes a lightweight wheat ear detection model, RT-WEDT (Real-Time Wheat Ear Detection Transformer), based on RT-DETR. Firstly, EfficientFormerV2 is selected as the backbone network structure of RT-WEDT to fully capture both the long-range and local features of wheat ear images while enhancing computational efficiency. Secondly, a multi scale enhanced hybrid encoder (MSEHE) is introduced, which takes as input the feature maps at four scales output from the four downsampling stages of the backbone network. The multiscale enhanced hybrid encoder consists of three sub-modules: the Attention-based intra-scale feature interaction (AIFI) module acts on the smallest feature maps to extract global features of the image; the Scale Sequence Feature Fusion (SSFF) module, based on multiscale fusion, utilizes 3D convolution to extract information about wheat ear targets at different scales. The outputs of these two modules are fed into the Enhanced Feature Fusion Module (EFFM) for feature fusion, further integrating the global and local information of the wheat ear image. Additionally, to improve the model's localization accuracy for wheat targets, this paper employs the WIoUv3 loss function as the bounding box loss function to enhance the quality of the anchor frame. Experimental results on global wheat head detection dataset demonstrate that the RT-WEDT model has 12M parameters, a floating-point operation capacity of 33.1 G, an average accuracy of 90.2%, and a detection speed of 79.7 frames/sec. Compared to RT-DETR, the RT-WEDT model has 62.5% fewer parameters, 68% fewer floating-point operations, an AP
50-95 increase of 0.6%, an AP
50 increase of 0.5%, and a detection speed increase of 22.4%. For models with a similar parameter volume, compared to YOLOv5, YOLOv8, and YOLOX, AP
50-95 improved by 8.2%, 2.4%, and 1.7%, respectively, and AP
50 improved by 4.6%, 1.1%, and 0.7%, respectively. Furthermore, this paper classifies samples from the global wheat head detection dataset and analyzes the model's detection performance on wheat ear targets in various scenarios. The experimental results indicate that dense and overlapping wheat ears are the most significant factors affecting model performance, followed by image blurriness, which also impacts detection performance. The intensity of light during photography has a minimal effect on the model's detection performance. To verify the robustness of the proposed RT-WEDT, this paper constructs its own drone perspective wheat spike dataset (DPWSD) for two periods and allows RT-WEDT to be directly tested on the drone perspective wheat dataset. During the filling stage, 60.2% AP
50-95 and 97.4% AP
50 were achieved; during the maturity stage, 61.0% AP
50-95 and 96.1% AP
50 were achieved. To validate the counting effectiveness of RT-WEDT, this paper conducted counting experiments on the test set from the global wheat dataset and the self-built drone perspective wheat ear dataset, respectively. The results showed that the R
2 value of RT-WEDT on the global wheat head detection dataset was 0.94, and on the drone perspective wheat spike dataset was 0.95, indicating an excellent fit between predicted and actual values, demonstrating that RT-WEDT is highly accurate for wheat ear detection and counting. These results indicate that the improved model significantly reduces complexity while maintaining a high average accuracy, achieving the goal of real-time wheat ear detection. This study provides technical support for efficient and rapid estimation of wheat yields and is of significant importance for advancing the development of smart agriculture.