Abstract
Grape has been one of the most popular fruits with great nutritional value and economic benefits. Manual picking of mature grapes cannot fully meet the large-scale production in recent years, particularly with the expansion of planting areas. A picking robot can be expected to monitor the growth of grapes in orchards in real time. Automatic grape picking can also be promoted to realize intelligent agricultural production. In this study, an improved YOLOv5s model (MRW-YOLOv5s) was proposed to rapidly and accurately identify the grapes in orchards. Firstly, the lightweight network MobileNetv3 was used as the feature extraction network, in order to reduce the amount of model parameters. A coordinate attention module (CA) was also embedded into the bneck structure of MobileNetv3 to strengthen the feature extraction capability of the network. Secondly, RepVGG Block was introduced into the neck network, where the multi-branch features were integrated to improve the detection accuracy of the model. Moreover, the structural reparameterization of the RepVGG Block was implemented to further accelerate the inference speed of the model. Finally, Wise Intersection over Union Loss (WIoU Loss) with the dynamic non-monotonic focusing mechanism was taken as the bounding box regression loss function, in order to accelerate the network convergence for the better detection accuracy of the model. Gradient-weighted class activation mapping (Grad-CAM) was also selected to capture the grape targets when the backbone network of the improved model was embedded with the CA module. A better performance was then achieved, compared with the model embedded with Efficient Channel Attention (ECA) and Convolutional Block Attention Module (CBAM). In addition, there was the lowest speed of bounding box loss regression in the convergence curve of the loss function, while the highest loss value after convergence, where the EIoU was the bounding box loss function. Once the CIoU and Wise-IoU v1 were taken as the bounding box loss functions, there were similar convergence speeds and loss values, indicating a slightly lower value than that of EIoU. Moreover, there was the highest convergence speed of the model and the lowest loss value, when the Wise-IoU v3 was used as the bounding box loss function. Therefore, the Wise-IoU v3 can be expected to accelerate the convergence of the model for better accuracy of the model. The results showed that the number of parameters of the improved MRW-YOLOv5s model was only 7.56 M. The mean Average Precision (mAP) on the test set reached 97.74%, and the average detection time per image was 10.03 ms, which were 2.32 percentage points and 6ms higher than those of the original YOLOv5s model, respectively. The mAP of the MRW-YOLOv5s model was 9.89, 7.53, 2.12, 0.91, and 2.42 percentage points higher, respectively, compared with the mainstream object detection models, such as the SSD, RetinaNet, YOLOv4, YOLOv7, and YOLOX. In terms of the number of model parameters, the improved model was only 7.56 M, which was 68.2%, 79.2%, 88.2%, 79.7%, and 15.4% less than the above five models, respectively. The average detection speed of the improved model was only 10.03 ms, which was 2.64, 13.19, 10.59, 4.14, and 5.46 ms higher than the above five models, respectively. Furthermore, the weight size of the improved model was only 26.97 MB, which was more conducive to model deployment. Therefore, the MRW-YOLOv5s model can greatly contribute to the detection accuracy, parameter size, and detection speed. The finding can also provide technical support for the intelligent orchards and mechanization of picking.