Abstract
Abstract: Automatic recognition of apple is one of the important aspects for apple harvest robots. Fast apple recognition can improve the efficiency of picking robots. In the actual scene of the orchard, the recognition conditions for apple are complex such as daytime, night, overlap apples, occlusion, bagged, backlighting, reflected light and dense apple, considering which a highly robust and fast visual recognition scheme is required. A fast and stable apple recognition scheme was proposed based on improved YOLOv3 in this paper. The entire image was traversed by a single convolutional neural network (one-stage), dividing an image into a plurality of sub-regions with the same size, and predicting the class of the target and its bounding box in each sub-region. Finally, the non-maximum value suppression was merged into the outer frame of the whole target, and the category and position of the target were returned. In order to improve the detection efficiency, the VGG-like network model was used to replace the original residual network of YOLOv3, and the model size was reduced, in which the 53-layer neural network was compressed into a 13-layer neural network without affecting the detection effect. Taking into account the size of the smallest apple in dense apples images, the anchor points of 3 different sizes were reduced to 2, reducing the final predicted tensor and ensuring that the smallest anchor point could still include the minimum target. The steps in this paper were stated as follows: Firstly, the data set was manually marked, including 400 images for the training set and 115 images for the verification set, including a total of 1 158 apple samples. In addition, in order to increase the generalization ability of the model, the data set was enhanced by adjusting the hue, color amount and exposure of the image, and a total of 51 500 images were generated. Then the initial value of the anchor points was calculated through K-means. Secondly, training the data set, output a model every 100 iterations. For the verification set, the mean average precision (mAP) value of each weight in batches was calculated, selecting the model with the highest mAP value, and finding the appropriate threshold to ensure most preferred precision, recall rate and intersection over union (IOU). The trained model had a mAP which reached up to 87.71%, an accuracy rate up to 97%, a recall rate up to 90%, and an IOU up to 83.61%. Thirdly, the specific performance of the model under image conditions for different fruit number, illumination angle, fruit growth stage and shooting time were verified in additional experimental data sets. The experimental data set consisted of 336 pictures containing 1 410 apple samples. The comparison was performed with algorithms of HOG+SVM, Faster RCNN, YOLOv2, and YOLOv3, with the evaluated index of F1 value. The experimental results showed that YOLOv3 performed significantly better than YOLOv2 in dense apples image, and better in other environments than Faster RCNN and HOG+SVM. Finally, the detection accuracy of the algorithm was verified in different hardware environments. The detection time of an image under the GPU was 16.69 ms with 60 frame/s for the actual video, and under the CPU was 105.21 ms with 15 frame/s for the actual video. Since it was positioned only at the beginning of the picking process and it did not require frequently refreshing during the picking process, in which the detection time in this paper was qualified. A reference was provided for the rapid, long-term high efficiency of robots to locate apples in complex environments in this research.