Abstract:
A crop yield estimation can provide the data support for the decision-making in the entire crop growing season. The previous estimation models of winter wheat yield often perform a low accuracy and less generalization, due to the overfitting during machine learning. Meanwhile, the specific feature variables can be expected to reduce the high collinearity between spectral bands of surface reflectance. It is also proper to set the spectral index for the less influence of independent wavelengths on iterative calculation. In this study, an improved estimation model of winter wheat yield was established using Random Forest (RF) in Henan Province of China, using the surface reflectance data and spectral index during the growth period. The specific procedure was listed. 1) The comparisons were conducted for the regional mean values and regional frequency histogram on the spectral bands of surface reflectance. 2) The different spectral bands of surface reflectance were then compared. 3) The generated sample data on the specific feature variable was used to train a yield estimation model. 4) The yield estimation model was applied to evaluate the winter wheat yields during 2013-2015, using the surface reflectance data and spectral index. The estimated yields of winter wheat were also verified by the survey statistics. The results showed that the frequency histogram presented a better performance than the regional mean values, where the Mean Absolute Error (MAE) was 660 kg/hm2, Root Mean Square Error (RMSE) was 860 kg/hm2, and the coefficient of determination was 0.83. The dimension of input feature variables and computational complexity were greatly reduced to improve the estimation ability of the model. The frequency histogram was suitable for the extraction of feature variables, further to optimize the sample structure. Among the surface reflectance bands, the near-infrared band 1 behaved the best performance, where the MAE was 636 kg/hm2, RMSE was 784 kg/hm2, and the coefficient of determination was 0.76. The near-infrared band 2 was the second most. The estimation accuracy of the blue band was the lowest, where the coefficient of determination was only 0.58, while the MAE and RMSE were 885 and 1040 kg/hm2, respectively. The coefficient of determination of Normalized Difference Water Index (NDWI) was 0.82, whereas, the coefficient of determination of Normalized Difference Vegetation Index (NDVI) was 0.73. Moreover, there was a higher coefficient of determination in the combination of NDVI and NDWI, where the coefficient of determination was 0.89, and the MAE and RMSE were 444 and 527 kg/hm2, respectively. There was the crucial influence of the heading stages of NDVI and NDWI on the winter wheat yield. Specifically, the accuracy of the combined NDVI and NDWI reached the highest on April 15th, indicating that the period from April 15th to 22nd presented the greatest impacts. In addition, the inter-year variation patterns from 2013 to 2015 showed well consistent trends between the estimated and observed yields in the spatial distribution, where the relative errors were within ±20%, and gradually decreased from the northwest to the southeast. A comprehensive effect of topography, temperature, and precipitation can be attributed to the surface microclimate and material redistribution. Consequently, the better yield estimation of winter wheat over the large areas can be achieved in the main winter wheat planting regions of Henan Province, using the random forest model combined with the regional frequency histogram samples. The next step can be recommended to maximize the number of samples and introduce the high-resolution satellite images for higher accuracy of the winter wheat yield estimation model.