改进SSD的灵武长枣图像轻量化目标检测方法

王昱潭; 薛君蕊

doi:10.11975/j.issn.1002-6819.2021.19.020

摘要: 针对加载预训练模型的传统SSD（Single Shot MultiBox Detector）模型不能更改网络结构，设备内存资源有限时便无法使用的问题，该研究提出一种不使用预训练模型也能达到较高检测精度的灵武长枣图像轻量化目标检测方法。首先，建立灵武长枣目标检测数据集。其次，以提出的改进DenseNet网络为主干网络，并将Inception模块替换SSD模型中的前3个额外层，同时结合多级融合结构，得到改进SSD模型。然后，通过对比试验证明改进DenseNet网络和改进SSD模型的有效性。在灵武长枣数据集上的试验结果表明，不加载预训练模型的情况下，改进SSD模型的平均准确率（mAP，mean Average Precision）为96.60%，检测速度为28.05帧/s，参数量为1.99×106，比SSD模型和SSD模型（预训练）的mAP分别高出2.02个百分点和0.05个百分点，网络结构参数量比SSD模型少11.14×106，满足轻量化网络的要求。即使在不加载预训练模型的情况下，改进SSD模型也能够很好地完成灵武长枣图像的目标检测任务，研究结果也可为其他无法加载预训练模型的目标检测任务提供新方法和新思路。

Abstract: The complex working environment of picking robots has limited the picking speed and equipment memory resources in the intelligent harvesting of Lingwu long jujubes. Therefore, it is necessary to meet the requirements of lighter network structure and higher detection accuracy, particularly for the visual recognition system. A pre-train model has widely been loaded almost all the object detection at present, due to high initialization performance and convergence speed. However, two challenges are still remained: 1) The network structure cannot be changed on the limited memory resources of the device; 2) There may be great differences between the ImageNet dataset and the dataset to be trained, leading to the low training effect. Taking the SSD model as the basic framework, this research aims to propose a lightweight object detection for the images of Lingwu long jujubes. The excellent performance was achieved without loading the pre-train model. Firstly, data augmentation is performed on the collected 1 000 images to obtain 5 000 images. Data augmentation operations include random cropping, random vertical or horizontal flipping, random brightness adjustment, random contrast adjustment, and random saturation adjustment. Secondly, the Lingwu long jujube dataset was established, including 3 500 training images and 1 500 test images. The resolution of images consisted of 3 016×4 032, 4 068×3 456, and 2 448×3 264. The models of smartphones for image acquisition included HUAWEI TRT-AL00A, Vivo Y79A, and Xiaomi 2014501. The images were uniformly scaled to the resolution of 300×300, in order to meet the input requirements of image size in the SSD object detection. Data augmentation included random cropping, random vertical or horizontal flipping, as well as random adjustment of brightness, contrast, and saturation. The format of the PASCAL VOC dataset was also adopted. Labelling software was used to label the images, and then the marked images were stored in the label folder in XML format. Secondly, the improved DenseNet was utilized the Convolutional Block Attention Modules and two dense blocks with convolution groups of 6 and 8. Taking the improved DenseNet as the backbone network, the improved SSD model was obtained to combine with the multi-level fusion structure, where the first three additional layers were replaced in the SSD model with the Inception module. In the improved SSD model without loading the pre-train model, the mAP was 96.60%, the detection speed was 28.05 frames/s, and the number of parameters was 1.99×106, particularly 2.02 percentage points and 0.05 percentage points higher than that of the SSD and SSD model (pre-train), respectively. Correspondingly, the parameter of the improved SSD model was 11.14×106 lower than the SSD model, fully meeting the requirements of the lightweight network without loading the pre-train model. This finding can provide a strong visual technical support for the intelligent harvesting of Lingwu long jujubes, even medical and multispectral images detection tasks.

改进SSD的灵武长枣图像轻量化目标检测方法

Lightweight object detection method for Lingwu long jujube images based on improved SSD