Abstract:
Pig farming can be greatly promoted by automatic behavior recognition for lactating sows. However, the recognition accuracies have been confined to behaviors with similar visual characteristics. In this study, an audio-video fusion-based model was proposed for the behavior classification of lactating sows in pig farming. A three-branch deep neural network (AVSlowFast) was employed as the backbone. The gaussian context transformer (GCT) attention mechanism was introduced to optimize the model without increasing the number of parameters. The experiment was conducted in Lihua Pig Farm of Changzhou City, Jiangsu Province, China, from August 1, 2023 to September 10, 2023. Ten long white sows were randomly selected as the research objects with significant differences in their litter environment and farrowing houses. All of these sows were within three days postpartum. The camera and sound recorder were used to collect video and audio data in the experiment, respectively. The dataset was constructed from the captured video and audio data. The sow behaviors were then manually labelled into six groups: breastfeeding, eating, drinking, sleeping, fence-hitting, and daily activities. Three models of behavior recognition verified the vision-audio fusion with different feature models. Specifically, MFCC-Vision Transformer was tested with audio features, SlowFast was with vision features, and AVSlowFast was with vision-audio multimodal features. The results showed that the outstandingly higher accuracies of multimodal models (AVSlowFast) were achieved to identify six types of sow behaviors, compared with two single-modal models, Vision Transfomer and Slowfast. Notably, AVSlowFast demonstrated superior performance in the behaviors with similar visual features among lactating sows, such as feeding, drinking, and fence-hitting. Nevertheless, there was a relatively smaller decrease in the recognition accuracy of sleeping behavior with a multimodal approach, compared with the single-vision. The reason was that the distinct audio features of sleep behavior were often lacking in the inclusion of audio information. The attention mechanisms (such as SENet and GCT) were then introduced to improve the recognition performance, especially in sleep behavior. After that, the accuracy of sleeping behavior recognition increased with the improved model. The attention mechanisms effectively adjusted the weight values of feature channels during iterative training, thus mitigating the interference caused by audio signals. GCT-AVSlowFast had achieved an accuracy of 94.3% precision and 94.6% recall, compared with SENet-AVSlowFast. The average F1-score of behavior recognition was significantly improved by 12.7 percentage points, compared with the single-modal (SlowFast). Finally, the superior performance of GCT-AVSlowFast without additional model parameters was suitable for deployment in resource-limited pig farm environments. The finding can also provide an effective approach to implementing multi-modal behavior monitoring in livestock and poultry.