基于注意力机制的双目立体匹配家畜3D姿态估计

谢元澄; 陈自强; 李添天; 严心悦; 姜海燕; 潘增祥

doi:10.11975/j.issn.1002-6819.202404137

基于注意力机制的双目立体匹配家畜3D姿态估计

Livestock 3D posture estimation method based on the attention mechanism for binocular stereo matching

摘要

摘要: 在监控群养家畜的个体行为时，准确估计家畜的空间姿态对行为分析至关重要。3D姿态估计相较于传统的2D方法，在解决遮挡问题和提供精确空间信息方面具有显著优势。目前，3D姿态估计技术主要应用于人体和自动驾驶领域，这些应用通常依赖昂贵的测量设备和庞大的数据集，在动物行为研究与生产管理领域难以迅速普及，因此迫切需要一种低成本且高效的动物行为姿态测量方法。为此，该研究提出一种基于双目立体匹配的家畜3D姿态估计通用方法，首先，使用改进的双目立体匹配深度学习模型获取深度信息；然后，使用基于TopDown方法的2D姿态估计模型提取目标检测框，并检测关键点；最后，将关键点位置信息映射回图像空间，并同立体匹配模型结果融合得到3D姿态信息。由于匹配精度依赖于精准的深度信息，而立体匹配的困难主要集中在薄结构和弱纹理匹配，故以注意力机制和卷积门控循环单元ConvGRU迭代恢复机制构建ACLNet立体匹配模型，通过编码图像纹理的相对深度层次，限制模型注意力集中在真实视差附近，并通过残差的方式逐步恢复高精度深度信息。通过Scence Flow数据集上的消融试验和Middlebury数据集上的泛化试验验证所提模型的有效性。试验结果表明，ACLNet在Scene Flow数据集上的端点误差（EPE）为0.45，与目前领域内最佳效果接近，相比于未使用注意力机制和ConvGRU机制的基线模型，EPE下降了0.37像素；在Middlebury等真实数据集上也取得了良好的泛化结果；在山羊深度数据集上的EPE为0.56；改进后模型在山羊3D姿态测试集上平均关节位置误差（MPJPE）达到45.7 mm，较改进前下降了21.1 mm。在以山羊为测试样本的3D姿态估计试验中，无需额外训练便可进行准确的3D姿态估计，体现了算法较强的泛化能力和通用性。该方法仅使用双目视觉图片就可准确获取3D姿态，验证了使用简单双目视觉系统实现高精度家畜3D姿态估计的可行性，为使用低成本双目相机进行3D姿态估计提供了一种可行方案。

Abstract: Accurate and rapid estimation of spatial posture is crucial to monitoring the behavior in group-housed livestock. Among them, 3D posture estimation can offer precise spatial data in the conditions of occlusion, compared with the traditional 2D. The current techniques of 3D posture estimation are primarily applied in human and autonomous driving fields, thus depending mainly on expensive measurement equipment and datasets. However, it is still challenging for animal behavior and management at present. Therefore, it is urgently needed for the low-cost and efficient measurement of animal behavior posture. In this study, a general approach was proposed to estimate the 3D posture of livestock using binocular stereo matching. Firstly, a modified model of stereo matching was employed to obtain the depth information using deep learning. Then, a top-down 2D posture model was used to extract the target bounding boxes and then detect the key points. Finally, the locations of key points were mapped back to the image space and then fused with the stereo-matching model for the 3D posture information. Since the matching accuracy depended on the precise depth information, the main challenges in stereo matching were attributed to the thin structure and weak texture matching. ACLNet stereo matching model was constructed using attention mechanism and ConvGRU iterative refinement. The relative depth layers of image textures were encoded to restrict the attention of the model to the areas near the true disparity. The high-precision depth information was gradually recovered in a residual manner. Ablation experiments and generalization tests were carried out on the Scene Flow dataset and the Middlebury dataset, respectively, in order to validate the effectiveness of the ACLNet model. The results show that the ACLNet model was achieved in an endpoint error (EPE) of 0.45 on the Scene Flow dataset, with a reduction of 0.37 pixels, compared with the baseline model without attention and ConvGRU mechanisms. Better generalization was also performed on real-world datasets, such as Middlebury. The EPE was 0.56 on the goat depth dataset; The mean per joint position error (MPJPE) on the goat 3D posture test set reached 45.7 mm in the improved model. There was a decrease of 21.1 mm, compared with the baseline. Strong generalization and versatility were obtained to accurately estimate the livestock 3D posture without additional training. The 3D posture estimation experiments were also verified to take the goats as test subjects. Binocular images were only required to accurately obtain the 3D posture. The feasibility of high-precision 3D posture estimation on livestock was then validated using a simple binocular system. The finding can provide a viable solution to accurate 3D posture estimation using low-cost stereo cameras.

HTML全文

参考文献(36)

施引文献

资源附件(0)