李婕,杨子豪,郑权,等. 基于RT-WEDT的麦穗检测计数方法研究[J]. 农业工程学报,2024,40(21):1-11. DOI: 10.11975/j.issn.1002-6819.202405200
    引用本文: 李婕,杨子豪,郑权,等. 基于RT-WEDT的麦穗检测计数方法研究[J]. 农业工程学报,2024,40(21):1-11. DOI: 10.11975/j.issn.1002-6819.202405200
    LI Jie, YANG Zihao, ZHENG Quan, et al. Research on wheat ear detection and counting method based on RT-WEDT[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2024, 40(21): 1-11. DOI: 10.11975/j.issn.1002-6819.202405200
    Citation: LI Jie, YANG Zihao, ZHENG Quan, et al. Research on wheat ear detection and counting method based on RT-WEDT[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2024, 40(21): 1-11. DOI: 10.11975/j.issn.1002-6819.202405200

    基于RT-WEDT的麦穗检测计数方法研究

    Research on wheat ear detection and counting method based on RT-WEDT

    • 摘要: 小麦是重要的粮食作物之一,麦穗计数对于预测麦穗产量至关重要。针对现有的检测计数方法在复杂农田环境下存在检测精度不足、模型参数量大等问题,该研究提出一种轻量级麦穗检测模型RT-WEDT(real-time wheat ear detection transformer)。首先,选择基于transformer的轻量化网络EfficientFormerV2作为RT-WEDT的骨干网络,以提升特征提取效率的同时学习麦穗图像的长距离特征;其次,设计三重特征融合模块(triple feature fusion,TFF)并引入尺度序列特征融合模块(scale sequence feature fusion,SSFF)以构建多尺度增强混合编码器(multi-scale enhanced hybrid encoder,MSEHE),达到浅层和深层特征充分融合,提高模型在不同尺度上的检测精度;最后,采用WIoUv3损失函数作为边界框损失函数来优化模型对麦穗目标的定位准确度。在全球麦穗数据集上的试验结果表明,RT-WEDT模型的交并比阈值0.50的平均精度AP50为90.2%,高于传统的目标检测模型。在自建的无人机视角麦穗数据集(drone perspective wheat spike dataset, DPWSD)上的交并比阈值0.50的平均精度AP50为96.8%,验证了模型有较好的普适性。此外模型的参数量为12M,检测速度为79.7帧/s,可达到麦穗高通量实时检测的目的。该研究为实现高效、快速的小麦产量估计提供了技术支撑,对推动智慧农业的发展具有重要意义。

       

      Abstract: Wheat is one of the most widely cultivated staple food crops globally, and its yield prediction has profound implications for food security. Deep learning-based wheat spike detection and counting algorithms can rapidly predict wheat yields. Addressing the issues of low detection accuracy and a large number of model parameters in existing methods within complex agricultural environments, this paper proposes a lightweight wheat ear detection model, RT-WEDT (Real-Time Wheat Ear Detection Transformer), based on RT-DETR. Firstly, EfficientFormerV2 is selected as the backbone network structure of RT-WEDT to fully capture both the long-range and local features of wheat ear images while enhancing computational efficiency. Secondly, a multi scale enhanced hybrid encoder (MSEHE) is introduced, which takes as input the feature maps at four scales output from the four downsampling stages of the backbone network. The multiscale enhanced hybrid encoder consists of three sub-modules: the Attention-based intra-scale feature interaction (AIFI) module acts on the smallest feature maps to extract global features of the image; the Scale Sequence Feature Fusion (SSFF) module, based on multiscale fusion, utilizes 3D convolution to extract information about wheat ear targets at different scales. The outputs of these two modules are fed into the Enhanced Feature Fusion Module (EFFM) for feature fusion, further integrating the global and local information of the wheat ear image. Additionally, to improve the model's localization accuracy for wheat targets, this paper employs the WIoUv3 loss function as the bounding box loss function to enhance the quality of the anchor frame. Experimental results on global wheat head detection dataset demonstrate that the RT-WEDT model has 12M parameters, a floating-point operation capacity of 33.1 G, an average accuracy of 90.2%, and a detection speed of 79.7 frames/sec. Compared to RT-DETR, the RT-WEDT model has 62.5% fewer parameters, 68% fewer floating-point operations, an AP50-95 increase of 0.6%, an AP50 increase of 0.5%, and a detection speed increase of 22.4%. For models with a similar parameter volume, compared to YOLOv5, YOLOv8, and YOLOX, AP50-95 improved by 8.2%, 2.4%, and 1.7%, respectively, and AP50 improved by 4.6%, 1.1%, and 0.7%, respectively. Furthermore, this paper classifies samples from the global wheat head detection dataset and analyzes the model's detection performance on wheat ear targets in various scenarios. The experimental results indicate that dense and overlapping wheat ears are the most significant factors affecting model performance, followed by image blurriness, which also impacts detection performance. The intensity of light during photography has a minimal effect on the model's detection performance. To verify the robustness of the proposed RT-WEDT, this paper constructs its own drone perspective wheat spike dataset (DPWSD) for two periods and allows RT-WEDT to be directly tested on the drone perspective wheat dataset. During the filling stage, 60.2% AP50-95 and 97.4% AP50 were achieved; during the maturity stage, 61.0% AP50-95 and 96.1% AP50 were achieved. To validate the counting effectiveness of RT-WEDT, this paper conducted counting experiments on the test set from the global wheat dataset and the self-built drone perspective wheat ear dataset, respectively. The results showed that the R2 value of RT-WEDT on the global wheat head detection dataset was 0.94, and on the drone perspective wheat spike dataset was 0.95, indicating an excellent fit between predicted and actual values, demonstrating that RT-WEDT is highly accurate for wheat ear detection and counting. These results indicate that the improved model significantly reduces complexity while maintaining a high average accuracy, achieving the goal of real-time wheat ear detection. This study provides technical support for efficient and rapid estimation of wheat yields and is of significant importance for advancing the development of smart agriculture.

       

    /

    返回文章
    返回