采用反事实数据增强方法的储粮害虫事件因果强度计算

肖乐; 赵婧; 徐云飞

doi:10.11975/j.issn.1002-6819.202408083

采用反事实数据增强方法的储粮害虫事件因果强度计算

Calculation of the causal strength of stored grain pest events augmented using counterfactual data method

摘要

摘要: 储粮害虫是影响粮食安全的重要因素，深入研究储粮害虫事件的发展过程及其因果关系极为关键。通过量化分析储粮害虫事件之间的因果强度，能够更加准确的评估潜在风险，帮助相关工作人员制定防控措施，减少不必要的损失。为解决储粮害虫领域数据中存在的数据偏差而造成模型过分依赖数据集中的表面特征，在应对泛化数据时效果不佳的问题，该研究提出一种反事实数据增强的因果强度计算方法，旨在准确量化事件之间的因果强度。设计了一个反事实数据增强的因果强度计算框架（counterfactual data augmentation-event causal strength，CDA-ECS），通过利用大语言模型（large language model，LLM）生成反事实实例，对原始数据进行扩展，将去偏的因果知识整合进预训练语言模型中，帮助其更深入地理解和学习句子的因果关系，提高模型的泛化能力。在公共数据集和领域数据集上的试验表明，所提方法能够训练出更加稳健的模型，在领域泛化数据的推理任务上准确率提高了2.4个百分点，能有效应用于储粮害虫事件的因果强度计算。在储粮害虫领域，反事实数据增强方法的引入为解决数据偏差提供了一种新的视角，增强后数据的多样性和复杂性使得模型能够更加深入地理解害虫行为与环境因素之间的复杂联系，进一步帮助实现储粮害虫事件的风险分析。该研究证明了反事实数据增强方法的可行性和有效性，为实现储粮害虫事件的因果强度计算提供了一定的参考。

Abstract: Stored grain pests have been one of the most important influencing factors on food security in recent years. It is extremely critical to explore the grain storage pest events and their causal relationships. Furthermore, the causal strength among grain storage pest events can be expected to more accurately assess the potential risks, in order to formulate the preventive and control measures. However, the data bias in the grain storage pest domain can often rely overly on the surface features in the dataset, leading to low efficiency with generalized data. In this study, the causal strength among events was accurately computed and quantified using counterfactually augmented data. As such, the counterfactual data augmentation-event causal strength computation framework (CDA-ECS) was designed to generate the counterfactual instances using a large language model (LLM). The original data was then extended to integrate the debiased causal knowledge into the pre-trained language model. The causal relationships of sentences were learned more deeply to improve the generalization of the model. Specifically, three stages were divided: In the first stage, the premise sentences in the event pairs were inputted into a retriever to obtain the top k sentences that were similar in style and opposite in semantics to the original sentences; In the second stage, a rule-based cueing template was designed using the retrieved sentences. The large language model was utilized to generate the compliant sentences, and then adjust the labels of the original event pair sentences using the samples; In the third stage, the original training and the newly generated instances were merged into a new corpus to train together the pre-trained language model. The causal features of the events were learned to improve the accuracy of the reasoning on the generalized data, in order to obtain the causal strength score. Experiments on the public and domain datasets demonstrated that the more robust models were trained with 2.4 percentage points higher accuracy on the inference task on generalized data, which was effectively applied to calculate the causal intensity of grain storage pest events. The counterfactual data augmentation was introduced to represent the data bias in the field of grain storage pests. The diversity and complexity of the augmented data were utilized to more deeply understand the complex links among pest behavior and environmental factors, in order to achieve the risk analysis of grain storage pest events. Nevertheless, it was still lacking human intervention in the process of counterfactual data using LLM, particularly when the labels were flipped. The quantification of causal relationships can also be expected to improve in the future. The counterfactual data generation can be optimized to further improve the quality of counterfactual data generation. The finding can provide a reliable basis to quantify the causal intensity of events. In conclusion, an effective solution can be proposed to improve the performance of causal analysis models in the field of grain storage pests. It is also expected to serve as the more accurate decision-making in risk assessment and management.

HTML全文

参考文献(32)

施引文献

资源附件(0)