Calculation of causal strength of stored grain pest events augmented by counterfactual data
-
-
Abstract
Stored grain pests are an important factor affecting food security, and it is extremely critical to thoroughly study the development process of grain storage pest events and their causal relationships. Quantitatively analyzing the causal strength between grain storage pest events makes it possible to more accurately assess the potential risks and help relevant staff formulate preventive and control measures to reduce unnecessary losses. To solve the problem of data bias in the grain storage pest domain data which causes the model to overly rely on the surface features in the dataset and is ineffective in coping with generalized data, this study proposes a counterfactual data augmented causal strength computation method, aiming to accurately quantify the causal strength between events. A counterfactual data augmentation-event causal strength computation framework (CDA-ECS) is designed to generate counterfactual instances by using large language model (LLM), and the original data are extension, integrating the debiased causal knowledge into the pre-trained language model to help it understand and learn the causal relationships of sentences more deeply and improve the generalization ability of the model. Specifically, the method is divided into three stages:In the first stage, the premise sentences in the event pairs are inputted into a retriever to obtain the top k sentences that are similar in style and opposite in semantics to the original sentences; in the second stage, we design a rule-based cueing template using the retrieved sentences, which enables the large language model to generate compliant sentences and change the labels of the original event pair sentences based on the samples; in the third stage, the original training instances and the newly generated instances are merged into a new corpus and trained together to train the pre-trained language model so that it learns the causal features of the events, improves the accuracy of the model's reasoning on generalized data, and obtains the causal strength score. Experiments on public and domain datasets prove that the proposed method can train more robust models with 2.4 percentage points higher accuracy on the inference task on generalized data, which can be effectively applied to the causal intensity calculation of grain storage pest events. In the field of grain storage pests, the introduction of counterfactual data augmentation provides a new perspective to address data bias, and the diversity and complexity of the augmented data enable the model to more deeply understand the complex links between pest behavior and environmental factors, which further helps to achieve the risk analysis of grain storage pest events. Currently, the lack of human intervention in the process of generating counterfactual data using LLM in the method proposed in this paper may lead to inaccuracies in the generated data when labels are flipped, which may affect the model's understanding and quantification of causal relationships. In the future, the optimisation method for counterfactual data generation will be further designed to improve the quality of counterfactual data generation and provide a reliable basis for quantifying the causal intensity of events.In conclusion, this study not only provides an effective solution to improve the performance of causal analysis models in the field of grain storage pests, but is also expected to be extended to other related fields with similar data problems, contributing to more accurate decision-making in risk assessment and management.
-
-