基于注意力机制的双目立体匹配家畜3D姿态估计

    Livestock 3D Posture Estimation Method Based on Attention Mechanism for Binocular Stereo Matching

    • 摘要: 在监控群养家畜的个体行为时,准确估计家畜的空间姿态对行为分析至关重要。3D姿态估计相较于传统的2D方法,在解决遮挡问题和提供精确空间信息方面具有显著优势。目前,3D姿态估计技术主要应用于人体和自动驾驶领域,这些应用通常依赖昂贵的测量设备和庞大的数据集,在动物行为研究与生产管理领域难以迅速普及,因此迫切需要一种低成本且高效的动物行为姿态测量方法。为此,该研究提出一种基于双目立体匹配的家畜3D姿态估计通用方法,首先,使用改进的双目立体匹配深度学习模型获取深度信息;然后,使用基于TopDown方法的2D姿态估计模型提取目标检测框,并检测关键点;最后,将关键点位置信息映射回图像空间,并同立体匹配模型结果融合得到3D姿态信息。由于匹配精度依赖于精准的深度信息,而立体匹配的困难主要集中在薄结构和弱纹理匹配,故以注意力机制和卷积门控循环单元ConvGRU迭代恢复机制构建ACLNet立体匹配模型,通过编码图像纹理的相对深度层次,限制模型注意力集中在真实视差附近,并通过残差的方式逐步恢复高精度深度信息。通过Scence Flow数据集上的消融试验和Middlebury数据集上的泛化试验验证所提模型的有效性。试验结果表明,ACLNet在Scene Flow数据集上的端点误差(EPE)为0.45,与目前领域内最佳效果接近,相比于未使用注意力机制和ConvGRU机制的基线模型,EPE下降了0.38;在Middlebury等真实数据集上也取得了良好的泛化结果;在山羊深度数据集上的EPE为0.56;改进后模型在山羊3D姿态测试集上平均关节位置误差(MPJPE)达到45.7 mm,较改进前下降了21.1 mm。在以山羊为测试样本的3D姿态估计试验中,无需额外训练便可进行准确的3D姿态估计,体现了算法较强的泛化能力和通用性。该方法仅使用双目视觉图片就可准确获取3D姿态,验证了使用简单双目视觉系统实现高精度家畜3D姿态估计的可行性,为使用低成本双目相机进行3D姿态估计提供了一种可行方案。

       

      Abstract: When monitoring the individual behaviors of livestock in a group, accurate estimation of the spatial postures of livestock is of great significance for behavioral analysis. In contrast to traditional 2D methods, 3D pose estimation holds prominent advantages in resolving occlusion issues and providing precise spatial information. Currently, 3D pose estimation techniques are predominantly applied in the domains of human body and autonomous driving. However, these applications generally rely on costly measuring equipment and extensive datasets, impeding their rapid dissemination in the fields of animal behavior research and production management. Consequently, there is an urgent demand for a low-cost and efficient approach to measure the postures of animal behaviors.Towards this end, this study proposes a universal method for livestock 3D pose estimation based on binocular stereo matching. Initially, an enhanced deep learning model of binocular stereo matching is utilized to acquire depth information. Subsequently, a 2D pose estimation model based on the TopDown approach is adopted to extract target detection boxes and detect key points. Finally, the positional information of the key points is mapped back to the image space and integrated with the results of the stereo matching model to obtain the 3D pose information.Since the matching accuracy hinges on accurate depth information, and the challenges in stereo matching mainly reside in the matching of thin structures and weak textures, the ACLNet stereo matching model is constructed using the attention mechanism and the Convolutional Gated Recurrent Unit (ConvGRU) iterative restoration mechanism. This model encodes the relative depth levels of image textures, confines the model's attention to the vicinity of the real disparity, and progressively recovers high-precision depth information in a residual fashion.The efficacy of the proposed model is validated through ablation experiments on the Scene Flow dataset and generalization experiments on the Middlebury dataset. The experimental outcomes indicate that the End-Point-Error (EPE) of ACLNet on the Scene Flow dataset is 0.45, approaching the state-of-the-art performance in the current field. Compared with the baseline model without the employment of the attention mechanism and ConvGRU mechanism, the EPE is decreased by 0.38. Favorable generalization results are also attained on real datasets such as Middlebury. The EPE on the goat depth dataset is 0.56. The mean per-joint position error (MPJPE) of the refined model on the goat 3D pose test set reaches 45.7 mm, representing a reduction of 21.1 mm compared to the pre-refinement stage. In the 3D pose estimation experiment using goats as test samples, accurate 3D pose estimation can be accomplished without additional training, demonstrating the robust generalization ability and versatility of the algorithm. This method can precisely obtain 3D postures solely by employing binocular vision images, verifying the feasibility of achieving high-precision livestock 3D pose estimation using a simple binocular vision system and proffering a viable solution for 3D pose estimation with low-cost binocular cameras.

       

    /

    返回文章
    返回