基于YOLOv5s-ESTC的肉苁蓉检测

艾尔肯·亥木都拉; 侯艳林

doi:10.11975/j.issn.1002-6819.202310143

摘要: 为了解决因梭梭和红柳等宿主遮挡、样本分布密集、样本大小不均衡等造成人工种植肉苁蓉检测精度低以及模型参数量过大难以向嵌入式设备移植等问题，该研究提出一种基于改进YOLOv5s的人工种植肉苁蓉轻量化检测方法。首先，将YOLOv5s的主干网络替换为EfficientNetv2网络，以减少模型参数量和计算复杂度进而实现模型轻量化，便于后期将模型向嵌入式设备部署；其次，为了增强模型对小目标肉苁蓉特征信息的提取能力，将C3模块与Swin Transformer Block进行整合，得到C3STR模块，将其引入主干网络，并将其输出的特征图与Neck网络中的特征图进行融合；最后，在检测头前端与颈项加强网络之间添加CA注意力机制，以弱化背景信息、聚焦目标特征，提高网络检测性能。试验结果表明，本文模型对于肉苁蓉的检测精度和准确率分别为89.4%和92.3%，模型的参数量和计算量分别为5.69×10⁶MB和6.8 GB，权重文件大小为11.9 MB，单幅图像推理时间为8.9 ms，能够实现实时性检测。同其他主流模型相比，改进后的模型的检测精度分别比SSD、Faster R-CNN、YOLOv3、YOLOv4、YOLOv5s、YOLOv6s、YOLOv7s和YOLOv8s高出1.1、3.0、10.8、10.3、8.5、1.6、1.2、1.4和0.5个百分点，能够准确识别不同形态下的人工种植肉苁蓉目标，同时模型具有最小的参数量、计算量和权重文件。将本文模型部署到移动端设备上，测试结果表明，本文模型具有较好的检测效果，在移动端设备上的检测时间为110.81 ms，相较于YOLOv5s模型提升了37.8%，平均检测精度相对YOLOv5s模型提升4.96个百分点，可为肉苁蓉智能化采摘装备研制提供参考。

Abstract: Cultivated cistanche tubulosa (known as Rou Cong-Rong) is one kind of herbaceous crop in the arid lands and warm deserts of northwestern China, such as Xinjiang, Inner Mongolia, and Qinghai province. The cash crop is also characterized by its high medicinal and economic value. However, manual picking cannot fully meet the large-scale production at present, due to the time and labor cost. Intellectualized picking has already been the major trend to improve efficiency and labor saving in industrial planting. However, it is difficult to accurately identify and locate cistanche tubulosa during picking, due to the complex environmental factors, such as light, shelter and dense targets. In this study, an improved YOLOv5s model (called YOLOv5s-ESTC) was proposed to increase the recognition accuracy of planted cistanche tubulosa in a complex environment. The lightweight model of object detection was more easily deployed into the mobile terminal and embedded device. Firstly, the backbone network of the YOLOv5s was replaced with the lightweight Eifficientnetv2 using the Fused MBConv module. The performance of the model was optimized to minimize the parameter count, computational complexity, and memory usage. Secondly, the C3STR block was blended in the backbone, due to the integrated C3 and Swin Transformer block. The output feature maps from C3STR were fused in the neck network. The receptive field of cistanche tubulosa was expanded within the Swin transformer layer by layer, in order to leverage the shifted window based on the multi-head self-attention. Therefore, the improved model had effectively captured the subtle features in the image. There was some interaction of texture features between pixel neighborhoods, particularly those associated with the smaller-sized targets. As such, the performance of detection was significantly enhanced for cistanche tubulosa. Lastly, the coordinate attention mechanism was integrated into the network. Meanwhile, the relevant feature of the target channel was strengthened to suppress the invalid background. The ablation experiment was designed to validate the performance of an improved model. The results show that the detection accuracy and precision of the improved YOLOv5-ESTC reached 89.4%, and 92.3%, respectively, which were improved by 1.6 and 2.9 percentage, compared with the original YOLOv5s. The parameters and computational complexity were 5.69× 10⁶ MB and 6.8 GB, respectively, which were reduced by 1.33 MB and 9.1 GB, compared with the original model. The weight of the improved model was 11.9 MB, and the inferred time per image was 8.9 ms, which fully met the real-time detection. A comparative test was designed to verify the detection performance of the improved model. The experimental results show that the detection accuracy of the YOLOv5s-ESTC model was improved by 1.1, 3, 10.8, 10.3, 8.5, 1.6, 1.2, 1.4, and 0.5 percent, compared with the mainstream SSD, Faster R-CNN, YOLOv3, YOLOv4, YOLOv5s, YOLOv6s, YOLOv7s, and YOLOv8s models in the same environment, respectively. The improved model accurately detected cistanche tubulosa targets without missing any mistakes, particularly in the complex environment, including small targets, the soil background similar to cistanche tubulosa, and the densely distributed samples. The higher recognition accuracy was achieved in the improved model, compared with the rest using deep learning. The model before and after improvement was also deployed to the mobile terminal device. The YOLOv5s-ESTC model exhibited excellent detection capability on the mobile platform. The detection time was 110.81 ms, indicating a notable improvement of 37.8%, compared with the original model. Moreover, the detection accuracy of the YOLOv5s-ESTC model remained at a high level. In conclusion, the improved model can be expected to effectively detect the cistanche tubulosa targets in natural environments. The finding can provide a valuable reference to develop intelligent harvesting equipment for cistanche cultivation.

基于YOLOv5s-ESTC的肉苁蓉检测

Detecting cistanche tubulosa using YOLOv5s-ESTC