Abstract:
An accurate and rapid recognition of animal posture and abnormal behavior has been very necessary to effectively prevent diseases in large-scale breeding, with the rapid development of intelligent agriculture and animal husbandry. Alternatively, the successful estimation tasks of a human pose can be attributed to the large-scale datasets and complex network models using deep learning. However, there are only a few studies related to the estimation of animal posture, compared with the human pose. In this study, an improved key point detection model of animal skeletons was proposed to improve the accuracy and robustness using the Transformer encoder and scale fusion. First, an improved Transformer encoder was introduced into the feature extraction layer of the HRNet network to capture the spatial constraint relationship between the key points. A better detection was performed on the small-scale sheep datasets. In the Transformer encoder, a sine position embedding module was introduced to improve the utilization of spatial position relations. The Hardswish activation function was used to improve the convergence speed of the training process. Secondly, a multi-scale information fusion module was introduced to improve the learning ability of the model in the different dimensional features. As such, the improved model was also applied for the more practical scenarios. A distribution-aware coordinate representation strategy was adopted to reduce the quantization error in the conversion of coordinates and heat map when encoding and decoding from the small-scale heat map, where the mean square error was used as the loss function.Furthermore, the key point dataset of sheep skeletons was collected and annotated to verify the effectiveness and generalization of the model. The Siberian tiger dataset ATRW was also added as the training set. The experimental results showed that the accuracy of 77.1% and 89.7% were achieved on the key point datasets of sheep and Siberian tiger, respectively, indicating better performance with a smaller amount of calculation, compared with the rest model. The detection time of a single image was 14 ms, fully meeting the demand for real-time detection. The cross-domain tests demonstrated a better detection of bone key points using data sets, such as the cattle and horses, indicating the excellent interpretability of the Transformer encoder. The global constraint relationship of the network was also obtained from a higher resolution with the feature information of fine-grained local images. The overall performance of the model was better than the rest, due to the decrease in the number of parameters and calculations with better accuracy. Consequently, the improved model performed better in the small-scale data sets and small-resolution input, particularly suitable for the actual applications. A variety of animal experiments was implemented to prove the cross-domain and generalization ability of the model. This finding can also provide effective technical support to accurately detect the key points of animal skeletons for the animal behaviour in the intelligent animal husbandry.