环境敏感变量优选及机器学习算法预测绿洲土壤盐分

王飞; 杨胜天; 丁建丽; 魏阳; 葛翔宇; 梁静

doi:10.11975/j.issn.1002-6819.2018.22.013

摘要: 基于机器学习预测干旱区（如新疆）土壤盐分的研究目前较少涉及且敏感变量的筛选还需深入探讨。该研究比较5种机器学习算法（套索算法，The Least Absolute Shrinkage and Selection Operator-LASSO；多元自适应回归样条函数，Multiple Adaptive Regression Splines-MARS；分类与回归树，Classification and Regression Trees-CART；随机森林，Random Forest-RF；随机梯度增进算法，Stochastic Gradient Treeboost-SGT）在3个不同地理区域（奇台绿洲，渭-库绿洲和于田绿洲）的性能表现；参与的变量被分为6组：波段，植被相关变量集，土壤相关变量集，数字高程模型（digital elevation model, DEM）衍生变量集，全变量组，优选变量组（全变量组经过算法筛选后的变量集合）。通过算法筛选，以示不同研究区的盐度敏感变量。同时借助以上述6组结果评判算法的性能。结果表明：综合分析6个变量组的R2和RMSE，预测精度排名如下：优选变量组>植被指数变量组>土壤相关变量组>波段>DEM衍生变量组。由于结果不稳定，全变量组未参与排名。在所有变量中，植被指数（EEVI，ENDVI，EVI2，CSRI，GDVI）和土壤盐度指数（SIT，SI2和SAIO）与土壤盐度相关性高于其他变量。综合评价以上5种算法，Lasso和MARS的预测结果出现极端异常值，但其预测结果能基本呈现土壤盐分空间分布格局。CART的结果能清晰分辨灌区和非灌区土壤盐分的分布态势，但二者内部并无太多变化且稳定性较差。RF和SGT的结果显示，二者在3个绿洲的土壤盐分值域范围和土壤盐分空间分布格局相似，纹理信息相对其他3个算法更为丰富。更为重要的是，算法在各个地区的结果都较为稳定。二者相比，SGT验证精度相对最高，其次为RF。

Abstract: The salt-affected cultivated land in Xinjiang accounts for about 37.72% of the irrigated area, which seriously restricts local economic development and ecological stability. In order to evaluate the distribution and severity of soil salinization, many scholars establish a corresponding soil salinity prediction model based on ground sampling data and environmental variables. The research on predicting soil salinity in arid areas (such as Xinjiang) based on machine learning is less involved. And the screening of sensitive variables needs to be further explored. Sensitive variables contribute to reduce the uncertainty of machine learning algorithms, and thus improve the prediction accuracy. The study aims to compare 1) Performance of five machine learning algorithms (The Least Absolute Shrinkage and Selection Operator-LASSO; multivariate adaptive regression spline function, Multiple Adaptive Regression Splines-MARS; Classification and Regression Tree, Classification and Regression Trees-CART; Random Forest, Random Forest-RF; Stochastic Gradient Treeboost-SGT) in three different geographic regions (Qitai oasis, Kuga oasis and Yutian oasis); 2) The variables involved are divided into five groups: bands, vegetation-related variable dataset, soil-related variable dataset, digital elevation model (DEM) derived variable dataset, full variable group, optimized variables group(screening in full variable group by algorithm to show salinity-sensitive variables in different study areas). Then, the performance of the algorithm is judged by the results of each dataset. According to R2 and RMSE, the prediction accuracy of the five variable groups is ranked as follows: optimized variable group > vegetation index variable group > soil related variable group > bands > DEM derived variable group. Among all variables, vegetation index (EEVI, ENDVI, EVI2, CSRI, GDVI) and soil salinity index (SIT, SI2 and SAIO) are more correlated with soil salinity than other variables. When the number of variables involved is scarce, the difference in verification accuracy of each algorithm is not obvious. When the number of variables increases and the correlation with soil salinity is low, such as the DEM derived variable group, SGT and RF have higher ability to mine useful information from complex environments than other algorithms. Based on the algorithm selected, the prediction results of Lasso and MARS have extreme abnormal values, although they basically show the distribution of soil salinity. The results of CART showed that the distribution of soil salinity in irrigation and non-irrigation areas can be clearly distinguished, but there is not much change inside. The results of RF and SGT show that soil salinity range and spatial distribution of soil salinity in the three oases are similar, and the texture information is more abundant than the other three algorithms. More importantly, the results of this these 2 algorithms in each region are relatively stable. Among 5 algorithms, SGT verification accuracy is highest, followed by RF.

环境敏感变量优选及机器学习算法预测绿洲土壤盐分

Environmental sensitive variable optimization and machine learning algorithm using in soil salt prediction at oasis