融合卷积神经网络与视觉注意机制的苹果幼果高效检测方法

宋怀波; 江梅; 王云飞; 宋磊

doi:10.11975/j.issn.1002-6819.2021.09.034

摘要: 果实表型数据高通量、自动获取是果树新品种育种研究的基础，实现幼果精准检测是获取生长数据的关键。幼果期果实微小且与叶片颜色相近，检测难度大。为了实现自然环境下苹果幼果的高效检测，采用融合挤压激发块（Squeeze-and-Excitation block, SE block）和非局部块（Non-Local block, NL block）两种视觉注意机制，提出了一种改进的YOLOv4网络模型（YOLOv4-SENL）。YOLOv4模型的骨干网络提取高级视觉特征后，利用SE block在通道维度整合高级特征，实现通道信息的加强。在模型改进路径聚合网络（Path Aggregation Network, PAN）的3个路径中加入NL block，结合非局部信息与局部信息增强特征。SE block和NL block两种视觉注意机制从通道和非局部两个方面重新整合高级特征，强调特征中的通道信息和长程依赖，提高网络对背景与果实的特征捕捉能力。最后由不同尺寸的特征图实现不同大小幼果的坐标和类别计算。经过1 920幅训练集图像训练，网络在600幅测试集上的平均精度为96.9%，分别比SSD、Faster R-CNN和YOLOv4模型的平均精度提高了6.9百分点、1.5百分点和0.2百分点，表明该算法可准确地实现幼果期苹果目标检测。模型在480幅验证集的消融试验结果表明，仅保留YOLOv4-SENL中的SE block比YOLOv4模型精度提高了3.8百分点；仅保留YOLOv4-SENL中3个NL block视觉注意模块比YOLOv4模型的精度提高了2.7百分点；将YOLOv4-SENL中SE block与NL blocks相换，比YOLOv4模型的精度提高了4.1百分点，表明两种视觉注意机制可在增加少量参数的基础上显著提升网络对苹果幼果的感知能力。该研究结果可为果树育种研究获取果实信息提供参考。

Abstract: Accurate detection of young fruits is critical to obtain growth data, particularly in the high-throughput and automatic acquisition of phenotypic information serving as the basis of fruit tree breeding. Since the fruits at young stage are in a small shape similar to the leaf color, it has made it difficult to be detected in deep learning. In this study, an improved YOLOv4 network model (YOLOv4-SENL) was proposed to achieve highly efficient detection of young apples in a natural environment. Squeeze-and-excitation (SE) and Non-local (NL) blocks were also combined to detect young apples. The backbone network of feature extraction in YOLOv4 was utilized to extract high-level features, whereas, the SE block was used to reorganize and consolidate high-level features in the channel dimension to achieve the enhancement of the channel information. The NL block was added to three paths of improved path aggregation network (PAN), combining non-local and local information obtained by convolution operations to enhance features. Two visual attention mechanisms (SE and NL block) were used to re-integrate high-level features from both channel and non-local aspects, with emphasis on the channel information and long-range dependencies in features. As such, the improved ability was achieved to capture the characteristics of background and fruit. Finally, the coordinates and classification were performed on the feature maps with different sizes of young apples. The pre-training weights of the backbone network on MS COCO dataset were loaded in the process of network training, where random gradient descent was used to update the parameters. The initial parameters were set as follows: The initial learning rate was 0.01, the training epoch was 350, the weight decay rate was 0.000 484, and the momentum factor was 0.937. A total of 3 000 images were collected in the natural environment, including young fruits in different periods and different interference factors, with abundant samples. Four indexes were selected to evaluate the detection of models in the experiments, including precision, the recall rate, F1 score, and average precision. 1 920 images of the dataset were trained, where the average precision of network was 96.9% on 600 test set images, 6.9 percentage points, 1.5 percentage points, and 0.2 percentage points higher than that of SSD, Faster R-CNN, and YOLOv4 models, respectively. The size of the YOLOv4-SENL model was 69 M larger than that of the SSD model, 59 M smaller than that of the Faster R-CNN model, and 11M larger than that of the YOLOv4 model. It indicated that the detection of young apple objects was accurately realized. The ablation experiment on 480 validation set images showed that only retaining the SE block in YOLOv4-SENL, the precision of the model was improved by 3.8 percentage points, compared with the YOLOv4 model. Only retaining three NL block visual attention modules in YOLOv4-SENL, the precision of the model was improved by 2.7 percentage points, compared with the YOLOv4 model. When replacing the SE and NL blocks in YOLOv4-SENL, the precision of model was improved by 4.1 percentage points, compared with the YOLOv4 model. These indicated that two visual attention mechanisms contributed to significantly improving the perception of network for young apples with a small increase in parameters. This finding can provide a potential reference to obtain the growth information in fruit breeding.

融合卷积神经网络与视觉注意机制的苹果幼果高效检测方法

Efficient detection method for young apples based on the fusion of convolutional neural network and visual attention mechanism