Abstract:
When monitoring the individual behaviors of livestock in a group, accurate estimation of the spatial postures of livestock is of great significance for behavioral analysis. In contrast to traditional 2D methods, 3D pose estimation holds prominent advantages in resolving occlusion issues and providing precise spatial information. Currently, 3D pose estimation techniques are predominantly applied in the domains of human body and autonomous driving. However, these applications generally rely on costly measuring equipment and extensive datasets, impeding their rapid dissemination in the fields of animal behavior research and production management. Consequently, there is an urgent demand for a low-cost and efficient approach to measure the postures of animal behaviors.Towards this end, this study proposes a universal method for livestock 3D pose estimation based on binocular stereo matching. Initially, an enhanced deep learning model of binocular stereo matching is utilized to acquire depth information. Subsequently, a 2D pose estimation model based on the TopDown approach is adopted to extract target detection boxes and detect key points. Finally, the positional information of the key points is mapped back to the image space and integrated with the results of the stereo matching model to obtain the 3D pose information.Since the matching accuracy hinges on accurate depth information, and the challenges in stereo matching mainly reside in the matching of thin structures and weak textures, the ACLNet stereo matching model is constructed using the attention mechanism and the Convolutional Gated Recurrent Unit (ConvGRU) iterative restoration mechanism. This model encodes the relative depth levels of image textures, confines the model's attention to the vicinity of the real disparity, and progressively recovers high-precision depth information in a residual fashion.The efficacy of the proposed model is validated through ablation experiments on the Scene Flow dataset and generalization experiments on the Middlebury dataset. The experimental outcomes indicate that the End-Point-Error (EPE) of ACLNet on the Scene Flow dataset is 0.45, approaching the state-of-the-art performance in the current field. Compared with the baseline model without the employment of the attention mechanism and ConvGRU mechanism, the EPE is decreased by 0.38. Favorable generalization results are also attained on real datasets such as Middlebury. The EPE on the goat depth dataset is 0.56. The mean per-joint position error (MPJPE) of the refined model on the goat 3D pose test set reaches 45.7 mm, representing a reduction of 21.1 mm compared to the pre-refinement stage. In the 3D pose estimation experiment using goats as test samples, accurate 3D pose estimation can be accomplished without additional training, demonstrating the robust generalization ability and versatility of the algorithm. This method can precisely obtain 3D postures solely by employing binocular vision images, verifying the feasibility of achieving high-precision livestock 3D pose estimation using a simple binocular vision system and proffering a viable solution for 3D pose estimation with low-cost binocular cameras.