Abstract:
The salt-affected cultivated land in Xinjiang accounts for about 37.72% of the irrigated area, which seriously restricts local economic development and ecological stability. In order to evaluate the distribution and severity of soil salinization, many scholars establish a corresponding soil salinity prediction model based on ground sampling data and environmental variables. The research on predicting soil salinity in arid areas (such as Xinjiang) based on machine learning is less involved. And the screening of sensitive variables needs to be further explored. Sensitive variables contribute to reduce the uncertainty of machine learning algorithms, and thus improve the prediction accuracy. The study aims to compare 1) Performance of five machine learning algorithms (The Least Absolute Shrinkage and Selection Operator-LASSO; multivariate adaptive regression spline function, Multiple Adaptive Regression Splines-MARS; Classification and Regression Tree, Classification and Regression Trees-CART; Random Forest, Random Forest-RF; Stochastic Gradient Treeboost-SGT) in three different geographic regions (Qitai oasis, Kuga oasis and Yutian oasis); 2) The variables involved are divided into five groups: bands, vegetation-related variable dataset, soil-related variable dataset, digital elevation model (DEM) derived variable dataset, full variable group, optimized variables group(screening in full variable group by algorithm to show salinity-sensitive variables in different study areas). Then, the performance of the algorithm is judged by the results of each dataset. According to R2 and RMSE, the prediction accuracy of the five variable groups is ranked as follows: optimized variable group > vegetation index variable group > soil related variable group > bands > DEM derived variable group. Among all variables, vegetation index (EEVI, ENDVI, EVI2, CSRI, GDVI) and soil salinity index (SIT, SI2 and SAIO) are more correlated with soil salinity than other variables. When the number of variables involved is scarce, the difference in verification accuracy of each algorithm is not obvious. When the number of variables increases and the correlation with soil salinity is low, such as the DEM derived variable group, SGT and RF have higher ability to mine useful information from complex environments than other algorithms. Based on the algorithm selected, the prediction results of Lasso and MARS have extreme abnormal values, although they basically show the distribution of soil salinity. The results of CART showed that the distribution of soil salinity in irrigation and non-irrigation areas can be clearly distinguished, but there is not much change inside. The results of RF and SGT show that soil salinity range and spatial distribution of soil salinity in the three oases are similar, and the texture information is more abundant than the other three algorithms. More importantly, the results of this these 2 algorithms in each region are relatively stable. Among 5 algorithms, SGT verification accuracy is highest, followed by RF.