Abstract
Cultivated Cistanche deserticola (known as Rou Cong-Rong) is one kind of herbaceous crop in the arid lands and warm deserts of northwestern China, such as Xinjiang, Inner Mongolia, and Qinghai province. The cash crop is also characterized by its high medicinal and economic value. However, manual picking cannot fully meet the large-scale production at present, due to the time and labor cost. Intellectualized picking has already been the major trend to improve efficiency and labor saving in industrial planting. However, it is difficult to accurately identify and locate cistanche tubulosa during picking, due to the complex environmental factors, such as light, shelter and dense targets. In this study, an improved YOLOv5s model (called YOLOv5s-ESTC) was proposed to increase the recognition accuracy of planted Cistanche deserticola in a complex environment. The lightweight model of object detection was more easily deployed into the mobile terminal and embedded device. Firstly, the backbone network of the YOLOv5s was replaced with the lightweight Eifficientnetv2 using the Fused MBConv module. The performance of the model was optimized to minimize the parameter count, computational complexity, and memory usage. Secondly, the C3STR block was blended in the backbone, due to the integrated C3 and Swin Transformer block. The output feature maps from C3STR were fused in the neck network. The receptive field of Cistanche deserticola was expanded within the Swin transformer layer by layer, in order to leverage the shifted window based on the multi-head self-attention. Therefore, the improved model had effectively captured the subtle features in the image. There was some interaction of texture features between pixel neighborhoods, particularly those associated with the smaller-sized targets. As such, the performance of detection was significantly enhanced for Cistanche deserticola. Lastly, the coordinate attention mechanism was integrated into the network. Meanwhile, the relevant feature of the target channel was strengthened to suppress the invalid background. The ablation experiment was designed to validate the performance of an improved model. The results show that the detection accuracy and precision of the improved YOLOv5-ESTC reached 89.8%, and 92.3%, respectively, which were improved by 1.6 and 2.9 percentage, compared with the original YOLOv5s. The parameters and computational complexity were 5.69× 106 MB and 6.8 GB, respectively, which were reduced by 1.33 MB and 9.1 GB, compared with the original model. The weight of the improved model was 11.9 MB, and the inferred time per image was 8.9 ms, which fully met the real-time detection. A comparative test was designed to verify the detection performance of the improved model. The experimental results show that the detection accuracy of the YOLOv5s-ESTC model was improved by 1.1, 3.0, 10.8, 10.3, 8.5, 1.6, 1.2, 1.4, and 0.5 percent, compared with the mainstream SSD, Faster R-CNN, YOLOv3, YOLOv4, YOLOX, YOLOv5s, YOLOv6s, YOLOv7s, and YOLOv8s models in the same environment, respectively. The improved model accurately detected Cistanche deserticola targets without missing any mistakes, particularly in the complex environment, including small targets, the soil background similar to cistanche tubulosa, and the densely distributed samples. The higher recognition accuracy was achieved in the improved model, compared with the rest using deep learning. The model before and after improvement was also deployed to the mobile terminal device. The YOLOv5s-ESTC model exhibited excellent detection capability on the mobile platform. The detection time was 110.81 ms, indicating a notable improvement of 37.8%, compared with the original model. Moreover, the detection accuracy of the YOLOv5s-ESTC model remained at a high level. In conclusion, the improved model can be expected to effectively detect the Cistanche deserticola targets in natural environments. The finding can provide a valuable reference to develop intelligent harvesting equipment for cistanche cultivation.