Recognizing and locating apple using improved BlendMask model
-
摘要:
针对实际自然环境中果实被遮挡、环境光线变化等干扰因素以及传统视觉方法难以准确分割出农作物轮廓等问题,该研究以苹果为试验对象,提出一种基于改进BlendMask模型的实例分割与定位方法。该研究通过引入高分辨率网络HRNet,缓解了特征图在深层网络中分辨率下降的问题,同时,在融合掩码层中引入卷积注意力机制CBAM(convolutional block attention module),提高了实例掩码的质量,进而提升实例分割质量。该研究设计了一个高效抽取实例表面点云的算法,将实例掩码与深度图匹配以获取苹果目标实例的三维表面点云,并通过均匀下采样与统计滤波算法去除点云中的切向与离群噪声,再运用球体方程线性化形式的最小二乘法估计苹果在三维空间中的中心坐标,实现了苹果的中心定位。试验结果表明改进BlendMask的平均分割精度为96.65%,检测速度34.51帧/s,相较于原始BlendMask模型,准确率、召回率与平均精度分别提升5.48、1.25与6.59个百分点;相较于分割模型SparseInst、FastInst与PatchDCT,该模型在平均精度变化不大的情况下,检测速度分别提升6.11、3.84与20.08帧/s,该研究为苹果采摘机器人的视觉系统提供技术参考。
Abstract:Manual picking cannot fully meet the large-scale apple harvesting at present, due to the high labor intensity and cost with low efficiency. Fruit picking robot has drawn much attention in recent years, in order to realize the automatic picking and yield estimation. Among them, the vision system can dominate the efficiency and stability of the picking robot. It is required for the high speed and accuracy of fruit recognition under complex natural environments. Therefore, the vision system can be expected to accurately recognize the fruits on the tree. This study aims to identify and locate the apples in natural environments, particularly on the interference factors, such as blocked fruits, ambient light, viewing angle distance. Apple was also taken as the test object. Traditional vision was improved to accurately segment the contour of target fruit. Instance segmentation and localization were proposed using the improved BlendMask model. The original backbone network was replaced with the high-resolution network, HRNet (High-Resolution Net), in order to alleviate the decreasing resolution of feature maps in deep networks. Convolutional block attention mechanism (CBAM) was also introduced in the fusion mask layer of the instance segmentation model. The instance mask was thus improved the instance segmentation. Ablation experiments were carried out to verify a variety of popular instance segmentation backbone networks. HRNet was selected as the backbone network. BlendMask model was used to achieve the better performance balance between real-time and segmentation accuracy. At the same time, fruit recognition and localization were implemented to consider the recognition accuracy in real time. Therefore, the improved model was suitable for the fruit target recognition and localization. Instance segmentation was designed to efficiently extract the surface point cloud of the instance. The instance mask was matched with the depth map, in order to obtain the 3D surface point cloud of apple target instance. The tangential and outlier noises were removed in the point cloud using uniform downsampling and statistical filtering. Then the center coordinates of the apples were estimated in the 3D space using the least-squares method (LSM). The center of target localization was achieved in the form of linearization of spherical equations. Other geometric indicators were also be used in the localization framework to realize the center localization of different kinds of fruits. The experimental results show that the average segmentation accuracy of the improved BlendMask model was 96.65%, and the detection speed reached 34.51 frames/s. The average segmentation accuracy of improved BlendMask model was 96.65%. The accuracy, recall and average accuracy were improved by 5.48%, 1.25% and 6.59%, respectively; Compared with the current new instance segmentation models, SparseInst, FastInst and PatchDCT, the average accuracy of the model was slightly lagged behind by 0.29%, 0.04% and 1.94%, respectively, whereas, the detection speed was ahead by 6.11%, 3% and 3%, respectively. The detection speed was 6.11, 3.84 and 20.08 frames/s ahead, respectively. The improved BlendMask model shared the high segmentation accuracy in real time. The finding can provide a technical solution to the vision system in apple-picking robots.
-
Keywords:
- image recognition /
- image segmentation /
- deep learning /
- apple /
- HRNet /
- CBAM /
- BlendMask
-
-
图 1 改进BlendMask模型推理流程示意图
注: Interpolate表示双线性插值算法,ConvA与ConvB均表示卷积层。虚线框内是模型的改进部分,CBAM模块以残差结构的方式与ConvA相连。Base为模型初步语义分割后的结果,Atten为实例的大致分布张量,Rb为实例张量,Rb与Atten在Blender模块内完成融合计算并生成最终的实例掩码。
Figure 1. Schematic diagram of improved BlendMask model inference process
Note: Interpolate denotes the bilinear interpolation algorithm, and both ConvA and ConvB denote convolutional layers. Inside the dashed box is the improvement part of the model, the CBAM module is connected to ConvA in the way of residual structure.Base is the result of the initial semantic segmentation of the model, Atten is the approximate distribution tensor of the instances, and Rb is the instance tensor, and Rb and Atten are completed within the Blender module to fusion computation and generate the final instance mask.
图 5 宽度为3的深度图与点云输出数组的索引关系示意图
注:深度图中的数字1至8表示像素索引,该索引与点云数组的一维索引d对应。点云数组中$ {{\boldsymbol{v}}}_{i} $表示索引为$ i $的点的三维空间向量。
Figure 5. Diagram of the index relation between a depth map with width 3 and a point cloud output array
Note: The numbers 1 to 8 in the depth map denotes the pixel index which corresponds to the one-dimensional index d of the point cloud array. The $ {{\boldsymbol{v}}}_{i} $ in the point cloud array denotes the three-dimensional spatial vector of the point with index $ i $.
图 10 不同环境下不同模型的分割效果
注:颜色相同则表示同一个苹果的表面像素,颜色不相同表示不同一个苹果的表面像素。
Figure 10. Segmentation effects of different models in different environments
Note: The figure shows the effect of pixel segmentation of apple instances by different models in an RGB image. The models will pixel segment the surface of each apple in an RGB image, with the same color indicating the surface pixels of the same apple, and different colors indicating the surface pixels of a different apple.
图 11 两种滤波算法的运行时间对比
注: k表示近邻点数,r表示半径滤波算法的滤波半径参数,α表示统计滤波算法的标准差比率参数。
Figure 11. Comparison of running time of two filtering algorithms
Note: k denotes the number of nearest neighbor points, r denotes the filter radius parameter of the radius filtering algorithm, and α denotes the standard deviation ratio parameter of the statistical filtering algorithm.
表 1 不同Base通道数目的BlendMask模型性能对比
Table 1 Performance comparison of BlendMask models with different number of Base channels
Base通道数
Number of Base channels参数量
Parameters/MB检测速度
Detection speed v/(帧·$ {{\mathrm{s}}}^{-1} $)均值平均精度
Mean average precision mAP/%1 39.3 42.2 24.8 2 47.9 36.8 27.3 3 55.4 35.7 30.2 4 61.7 34.6 33.5 5 70.8 25.1 35.9 6 79.1 18.4 37.7 表 2 不同主干网络的BlendMask模型性能对比
Table 2 Performance comparison of BlendMask models for different backbone networks
主干网络
Backbone总参数量
Total
parameters /
MB准确率
Precision
P/%召回率
Recall R
/%F1值
F1 score平均精度
Average
precision AP/%v /
(帧·s-1)ResNet50 61.7 92.93 87.94 90.36 91.45 31.52 VGG16 69.4 93.07 88.41 90.68 92.30 29.02 EfficientNet 41.8 87.32 83.83 85.54 86.59 43.91 MobileNetV2 40.9 89.68 84.09 86.79 87.78 45.72 Vision Transformer 78.1 97.11 92.56 94.64 96.89 20.28 Swin Transformer 77.5 97.36 93.73 95.46 96.91 21.69 HRNet 63.6 96.75 91.10 94.62 95.83 33.98 表 3 不同结构的BlendMask模型性能对比
Table 3 Performance comparison of BlendMask models with different structures
模型结构形式
Model structure form总参数量
Total parameters /$ \mathrm{M}\mathrm{B} $P/% R/% $ {F}_{1} $ AP/% $ \mathrm{R}\mathrm{e}\mathrm{s}\mathrm{N}\mathrm{e}\mathrm{t}101 $ 134.3 94.78 89.81 92.22 91.21 $ \mathrm{R}\mathrm{e}\mathrm{s}\mathrm{N}\mathrm{e}\mathrm{t}50 $ 61.7 92.06 88.76 90.37 90.06 $ \mathrm{H}\mathrm{R}\mathrm{N}\mathrm{e}\mathrm{t} $ 63.6 96.42 90.82 93.53 95.44 $ \mathrm{H}\mathrm{R}\mathrm{N}\mathrm{e}\mathrm{t}+\mathrm{C}\mathrm{B}\mathrm{A}{\mathrm{M}}_{1} $ 67.9 97.54 91.06 94.18 96.65 $ \mathrm{H}\mathrm{R}\mathrm{N}\mathrm{e}\mathrm{t}+\mathrm{C}\mathrm{B}\mathrm{A}{\mathrm{M}}_{2} $ 64.5 96.32 90.34 93.23 96.01 $ \mathrm{H}\mathrm{R}\mathrm{N}\mathrm{e}\mathrm{t}+\mathrm{C}\mathrm{B}\mathrm{A}{\mathrm{M}}_{1}+\mathrm{C}\mathrm{B}\mathrm{A}{\mathrm{M}}_{2} $ 68.6 97.69 92.38 94.96 96.98 注: CBAM1表示在Bottom module网络里嵌入的CBAM模块,CBAM2表示以残差连接方式嵌入至ConvA网络的CBAM模块。
Note: CBAM1 denotes the CBAM module embedded in the Bottom module network, and CBAM2 denotes the CBAM module embedded into the ConvA network as a residual connection.表 4 不同实例分割模型的性能比较
Table 4 Performance comparison of different instance segmentation models
模型Models P/% R/% $ {F}_{1} $ AP/% v /(帧·$ {{\mathrm{s}}}^{-1} $) Mask R-CNN 87.89 84.01 85.90 86.48 11.32 SOLOv2 92.57 86.44 89.40 89.21 26.56 YOLACT 96.21 89.74 92.86 91.84 35.45 SparseInst 97.82 92.45 95.06 96.94 28.40 FastInst 97.76 90.31 93.69 96.69 30.67 PatchDCT 99.67 92.42 95.43 98.59 14.43 改进BlendMask Improved BlendMask 97.54 91.06 94.18 96.65 34.51 表 5 不同距离下的改进型BlendMask模型的定位效果
Table 5 Localization effects of the improved BlendMask model at different distances
距离
Distance/$ \mathrm{m} $VMSE/$ \mathrm{c}{\mathrm{m}}^{3} $ DMSE/$ \mathrm{c}\mathrm{m} $ Vs/$ \mathrm{m}\mathrm{s} $ 0.5 5.32 1.63 31.23 1.0 13.53 4.64 31.78 1.5 17.98 5.76 32.12 2.0 19.12 7.39 33.32 2.5 25.41 8.45 32.45 3.0 30.61 7.89 32.29 均值Average value 18.66 5.96 32.19 注:VMSE表示体积均方误差,DMSE表示测距均方误差,Vs表示识别定位计算速度。
Note: VMSE denotes volumetric mean square error, DMSE denotes distance mean measurement error, Vs denotes recognition and localization calculation speed. -
[1] 冯青春,赵春江,李涛,等. 苹果四臂采摘机器人系统设计与试验[J]. 农业工程学报,2023,39(13):25-33. FENG Qingchun, ZHAO Chunjiang, LI Tao, et al. Design and test of a four-arm apple harvesting robot[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2023, 39(13): 25-33. (in Chinese with English abstract)
[2] 苗玉彬,郑家丰. 苹果采摘机器人末端执行器恒力柔顺机构研制[J]. 农业工程学报,2019,35(10):19-25. MIAO Yubin, ZHENG Jiafeng. Development of compliant constant-force mechanism for end effector of apple picking robot[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2019, 35(10): 19-25. (in Chinese with English abstract)
[3] 李涛,邱权,赵春江,等. 矮化密植果园多臂采摘机器人任务规划[J]. 农业工程学报,2021,37(2):1-10. LI Tao, QIU Quan, ZHAO Chunjiang, et al. Task planning of multi-arm harvesting robots for high-density dwarf orchards[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2021, 37(2): 1-10. (in Chinese with English abstract)
[4] 陈青,殷程凯,郭自良,等. 苹果采摘机器人关键技术研究现状与发展趋势[J]. 农业工程学报,2023,39(4):1-15. CHEN Qing, YIN Chengkai, GUO Ziliang, et al. Current status and future development of the key technologies for apple picking robots[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2023, 39(4): 1-15. (in Chinese with English abstract)
[5] TANG Y, CHEN M, WANG C, et al. Recognition and localization methods for vision-based fruit picking robots: A review[J]. Frontiers in Plant Science, 2020, 11: 520170.
[6] XIAO Y, TIAN Z, YU J, et al. A review of object detection based on deep learning[J]. Multimedia Tools and Applications, 2020, 79: 23729-23791. doi: 10.1007/s11042-020-08976-6
[7] WU G, LI B, ZHU Q, et al. Using color and 3D geometry features to segment fruit point cloud and improve fruit recognition accuracy[J]. Computers and Electronics in Agriculture, 2020, 174: 105475. doi: 10.1016/j.compag.2020.105475
[8] 柳长源,赖楠旭,毕晓君,等. 基于深度图像的球形果实识别定位算法[J]. 农业机械学报,2022,53(10):228-235. LIU Changyuan, LAI Nanxu, BI Xiaojun, et al. Spherical fruit recognition and location algorithm based on depth image[J]. Transactions of the Chinese Society for Agricultural Machinery, 2022, 53(10): 228-235. (in Chinese with English abstract)
[9] YU L, XIONG J, FANG X, et al. A litchi fruit recognition method in a natural environment using RGB-D images[J]. Biosystems Engineering, 2021, 204: 50-63. doi: 10.1016/j.biosystemseng.2021.01.015
[10] XIAO B, NGUYEN M, YAN W, et al. Apple ripenessidentification from digital images using transformers[J]. Multimedia Tools and Applications, 2024, 83(3): 7811-7825. doi: 10.1007/s11042-023-15938-1
[11] ZHANG L, HAO Q, CAO J, et al. Attention-based fine-grained lightweight architecture for fuji apple maturity classification in an open-world orchard environment[J]. Agriculture, 2023, 13(2): 228. doi: 10.3390/agriculture13020228
[12] HU J N, GUO L, MO H L, et al. Crop node detection and internode length estimation using an improved YOLOv5 model[J]. Agriculture, 2023, 13(2): 473. doi: 10.3390/agriculture13020473
[13] 张羽丰,杨景,邓寒冰,等. 基于RGB和深度双模态的温室番茄图像语义分割模型[J]. 农业工程学报,2024,40(2):295-306. ZHANG Yufeng, YANG Jing, DENG Hanbing, et al. Semantic segmentation model for greenhouse tomato images using RGB and depth bimodal[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2024, 40(2): 295-306. (in Chinese with English abstract)
[14] 李兴旭,陈雯柏,王一群,等. 基于级联视觉检测的樱桃番茄自动采收系统设计与试验[J]. 农业工程学报,2023,39(1):136-145. LI Xingxu, CHEN Wenbai, WANG Yiqun, et al. Design and experiment of an automatic cherry tomato harvesting system based on cascade vision detection[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2023, 39(1): 136-145. (in Chinese with English abstract)
[15] ZHAO R, ZHU Y, LI Y, et al. An end-to-end lightweight model for grape and picking point simultaneous detection[J]. Biosystems Engineering, 2022, 223: 174-188. doi: 10.1016/j.biosystemseng.2022.08.013
[16] 张宏鸣,张国良,朱珊娜,等. 基于U-Net的葡萄种植区遥感识别方法[J]. 农业机械学报,2022,53(4):173-182. ZHANG Hongming, ZHANG Guoliang, ZHU Shanna, et al. Remote sensing recognition method of grape planting regions based on U-Net[J]. Transactions of the Chinese Society for Agricultural Machinery, 2022, 53(4): 173-182. (in Chinese with English abstract)
[17] 周桂红,马帅,梁芳芳,等. 基于改进YOLOv4模型的全景图像苹果识别[J]. 农业工程学报,2022,38(21):159-168. ZHOU Guihong, MA Shuai, LIANG Fangfang, et al. Recognition of the apple in panoramic images based on improved YOLOv4 model[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2022, 38(21): 159-168. (in Chinese with English abstract)
[18] CHEN C, LI B, LIU J X, et al. Monocular positioning of sweet peppers: An instance segmentation approach for harvest robots[J]. Biosystems Engineering, 2020, 196: 15-28. doi: 10.1016/j.biosystemseng.2020.05.005
[19] 龙洁花,赵春江,林森,等. 改进Mask R-CNN的温室环境下不同成熟度番茄果实分割方法[J]. 农业工程学报,2021,37(18):100-108. LONG Jiehua, ZHAO Chunjiang, LIN Sen, et al. Segmentation method of the tomato fruits with different maturities under greenhouse environment based on improved Mask R-CNN[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2021, 37(18): 100-108. (in Chinese with English abstract)
[20] CHEN H, SUN K, TIAN Z, et al. BlendMask: Top-down meets bottom-up for instance segmentation[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, 2020: 8570-8578.
[21] SUN K, XIAO B, LIU D , et al. Deep High-resolution representation learning for human pose estimation[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA, 2019: 5686-5696.
[22] WANG J, SUN K, CHENG T, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(10): 3349-3364. doi: 10.1109/TPAMI.2020.2983686
[23] WOO S, PARK J C, LEE J, et al. CBAM: convolutional block attention module[C]// Proceedings of the European Conference on Computer Vision. Berlin, Germany, 2018: 3-19.
[24] WERLING M, ZIEGLER J, KAMMEL S, et al . Optimal trajectory generation for dynamic street scenarios in a Frenét Frame[C]//IEEE International Conference on Robotics and Automation. Anchorage, AK, USA, 2010: 987-993.
[25] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// Proceedings of the European Conference on Computer Vision. Zurich, Switzerland, 2014: 740-755.
[26] TORRALBA A, RUSSELL B C, YUEN J, et al. Labelme: Online image annotation and applications[J]. Proceedings of the IEEE, 2010, 98(8): 1467-1484. doi: 10.1109/JPROC.2010.2050290
[27] WANG Z, DU L, MAO J, et al. SAR target detection based on SSD with data augmentation and transfer learning[J]. IEEE Geoscience and Remote Sensing Letters, 2018, 16(1): 150-154.
[28] DIEDERIK P, JIMMY BA. Adam: A method for stochastic optimization[C]//Proceedings of the International Conference for Learning Representations. San Diego, USA, 2015.
[29] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization[C]. International Journal of Computer Vision, 2020: 336-59.
[30] WANG X, ZHANG R, KONG T, et al. Solov2: Dynamic and fast instance segmentation[J]. Advances in Neural Information Processing Systems. 2020, 33: 17721-17732.
[31] CHENG T, WANG X, CHEN S, et al. Sparse instance activation for real-time instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA, 2022: 4433-4442.
[32] HE J, LI P, GENG Y, et al. FastInst: A simple query-based model for real-time instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada, 2023: 23663-23672.
[33] WEN Q, YANG J, YANG X, et al. PatchDCT: Patch refinement for high quality instance segmentation[C]//Proceedings of the International Conference for Learning Representations. Kigali, Rwanda, 2023.