基于超参数优化算法的随机森林模型预测奶牛呼吸频率

    Predicting respiratory rate of dairy cows using hyperparameter-optimized random forest models

    • 摘要: 奶牛呼吸频率是评估环境造成的奶牛热应激程度的重要指标之一。该研究基于随机森林(random forest,RF)算法提出了适用于生产条件下的奶牛个体呼吸频率准确预测模型,为了平衡模型精度与计算效率问题,利用遗传算法(genetic algorithm,GA)、差分进化(differential evolution,DE)算法、粒子群优化(particle swarm optimization,PSO)算法、贝叶斯优化(Bayesian optimization,BO)算法对模型超参数进行优化,并与网格搜索(grid search,GS)下的人工神经网络(artificial neural network,ANN)和极限梯度提升机(extreme gradient boosting,XGBoost)模型进行了对比分析。研究结果表明,使用融合环境参数的修正温湿指数(adjusted temperature-humidity index,ATHI)、时间区域、奶牛产奶量、泌乳天数、身体姿势以及胎次作为输入特征时,基准RF模型的预测性能最佳。在此基础上,4种智能优化算法下的RF模型性能优于GS-ANN和GS-XGBoost,其中BO-RF的综合性能最优,其决定系数、平均绝对误差、平均绝对百分比误差以及均方根误差分别为0.614次/min、7.723次/min、14.4%、9.737次/min,超参数优化耗时约为DE-RF的1/220。特征重要性分析表明,输入因子对奶牛呼吸频率的影响程度不同,ATHI是影响力最高的因子,相对重要性(relative importance,RI)为0.73,其次是时间区域(RI=0.09)和奶牛产奶量(RI=0.07)。研究为奶牛生产、健康评价及牛舍环境精准调控提供了有效方法和基础。

       

      Abstract: Heat stress has threatened dairy cows during periods of high temperatures. Dairy farms can often assess the respiratory rate (breahts per minute, bpm) of dairy cows, and then implement timely environmental control measures to mitigate the heat stress, thus enhancing the welfare of animals and production performance. However, previous empirical or numerical models cannot accurately predict the respiratory rate of individual dairy cows in real environments. Data-driven machine learning models can be expected to provide a better representation of the relationship between various variables, particularly on the actual respiratory rate. In this study, a random forest-based prediction model was introduced for the cow respiration rate. Five hyperparameters were fine-tuned using genetic algorithm (GA), differential evolution algorithm (DE), particle swarm optimization algorithm (PSO), and Bayesian optimization (BO). These random forest-based models were compared with the artificial neural network (ANN) and extreme gradient boosting (XGBoost) models under grid search (GS). The training data was obtained from a commercial dairy farm in North China. The thermal environment parameters (air temperature, relative humidity, wind speed, and solar radiation) were integrated with the sampling time blocks and cattle-related variables. The thermal environment variables were utilized as the input features to construct four heat stress indices, namely the temperature-humidity index (THI), adjusted temperature-humidity index (ATHI), equivalent temperature index for cattle (STIC), and skin temperature index (STIC). The dataset was comprised of 3005 records, with 80% allocated for training and 20% for testing. Hyperparameter optimization was conducted on the training set using 5-fold cross-validation. The correlation analysis revealed that there was a highly significant correlation (P<0.01) between all thermal environment variables, heat stress indices, and the respiratory rate of cows. Additionally, significant correlations (P<0.01) were observed between the thermal environment variables and heat stress indices. In feature collinearity, the overall dataset was partitioned into five sub-datasets: EP, THI, ATHI, ETIC, and STIC dataset. The differentiation among these subsets was combined with the distinct environmental variables. Specifically, the EP dataset comprised four environmental parameters, while the rest dataset's features were the corresponding indices. Results indicated that the baseline random forest model achieved the highest prediction accuracy, when utilizing the adjusted temperature-humidity index, time block, milk production of the cows, days of lactation, body posture, and the number of calving as inputs. On the test set of the ATHI feature, the performance of the RF-based model was better than those of the GS-ANN and GS-XGBoost under the four intelligent optimizations. The random forest model was optimized by the differential evolution algorithm (DE-RF), indicating the highest accuracy, with a coefficient of determination (R2), mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE) on the testing set of 0.614, 7.708 bpm, 14.4%, and 9.730 bpm, respectively. After that, the Bayesian-optimized random forest model (BO-RF) was achieved in a coefficient of determination, mean absolute error, mean absolute percentage error, and root mean square error on the testing set of 0.614, 7.723, 14.4%, and 9.737, respectively. Notably, the BO-RF was only required 1/200 of the time taken by the DE-RF. As a result, the BO-RF displayed the most favourable overall performance, in terms of prediction accuracy and computational time. Subsequently, the relative importance (RI) of the input features was determined using the BO-RF. The feature importance analysis revealed that the adjusted temperature-humidity index held the highest importance in the model prediction (RI value = 0.73), followed by time block (RI value = 0.09), milk production of cattle (RI value = 0.07), and days of lactation (RI value = 0.06). The body posture of the cattle (RI value = 0.03) and the number of calving (RI value = 0.02) shared a marginal impact on the model predictions. This finding can offer valuable insights into the precise and intelligent control system in dairy barns.

       

    /

    返回文章
    返回