Abstract:
The precise estimation of crop yields is essential for global food security, particularly in the face of challenges like climate change, population growth, and food distribution inequalities. Despite the widespread use of machine learning techniques combined with remote sensing data for large-scale yield prediction, the integration of crop spatial position information and local models remains underexplored. This is particularly significant given the spatial nature of crop yield prediction, where spatial factors are highly influential. Previous studies, predominantly conducted on an annual or full-growth season basis, have not provided precise predictions for each phenological stage of maize growth. Consequently, these studies fall short in pinpointing the most effective prediction time for maize yield and understanding the impact of environmental factors at each stage. This research delves into two key questions: 1) Does the inclusion of spatial location information in the geographic weighted random forest (GWRFR) model improve yield prediction accuracy over the traditional random forest model? 2) Among different phenological stages of maize, which stage provides the optimal window for yield prediction? To address these issues, this study employed multi-source remote sensing data in conjunction with machine learning algorithms, and predicted maize yield at the county level in the United States. This study investigated the relationship between yield prediction and the spatial location of sample points, assessing the relevance of including latitude and longitude as independent variables. Further, the study introduced the local GWRFR model for maize yield prediction and compared its modeling performance with the global random forest (RF) model. In addition, the study examined two methodological approaches for determining the best prediction time. The first approach, referred to as the accumulated environmental variables (AEV) approach, integrated data from various phenological periods. The second approach, known as the current stage variables (CSV) approach, used data exclusively from the specific growth stage under analysis. The seven key growth stages of maize included planted, emerged, silking, dough, dent, mature and harvest, providing a comprehensive view of the crop's lifecycle. Through a comprehensive evaluation of the results from both schemes, this study identified the optimal prediction time for maize yield. The findings indicate that incorporating latitude and longitude into the model enhanced yield prediction accuracy. Without these spatial factors, the RF model achieved an coefficient of determination (
R2) of 0.83 and root mean squared error (RMSE) of 994.75 kg/hm
2, while including them improved these metrics to an
R2 of 0.85 and RMSE of 890.88 kg/hm
2. This provides preliminary evidence that including spatial factors can enhance maize yield prediction accuracy. Moreover, the local GWRFR model further improved prediction accuracy (
R2=0.87, RMSE=864.21 kg/hm
2), outperforming the traditional RF model and effectively addressing the non-stationarity of spatial data. In terms of optimal prediction time, the scheme where the environmental variables accumulate over phenological stages showed increasing accuracy from the first stage (planted) up to the fourth stage (dough), peaking at
R2=0.90 and RMSE of 748.39 kg/hm
2, and then stabilized. In contrast, the scheme utilizing only current stage variables improved accuracy from the first stage up to the third stage (silking), reaching its peak (
R2=0.88, RMSE=827.85 kg/hm
2) before decreasing. This suggests the best prediction time was around dough stage, approximately 2-3 months before harvest. Additionally, the strong correlation observed between early prediction results and those covering the entire growth season underscores the reliability of maize yield predictions made during the dough stages. In conclusion, this study introduces a novel method for large-scale crop yield prediction, integrating spatial data and phenological stages with advanced modeling techniques. The findings significantly contribute to enhancing food security and stabilizing the global food supply chain. This research not only provides critical insights for agricultural practices but also sets a foundation for future studies in crop yield prediction, potentially extending to other crops and regions, and incorporating a broader range of environmental factors.