低资源场景下苹果种植领域实体关系联合抽取模型

    Joint extraction model for entity relationships using reinforcement learning in low resource scenarios

    • 摘要: 由于苹果种植领域实体关系联合抽取任务标注成本高昂以及与专业领域的强相关性,提高模型在低资源场景中的抽取性能至关重要。针对这一问题,该研究提出了一种基于强化学习的实体关系联合抽取模型。模型包含实体识别模块和基于强化学习的关系抽取模块。引入强化学习的训练架构,通过关系生成器生成伪标签,训练一个策略网络以最大化伪标签数据与有标签数据在梯度方向上的相似性,同时鼓励模型在伪标签数据上进行优化,提高模型对未标注数据的泛化能力。为了验证该研究模型的效果,将其与主流的低资源场景下关系抽取模型在苹果种植领域语料库进行了对比,结果表明在标注数据占比达到 30% 时,该研究模型的 F1 值达到了88.71%,相对其余基线有较大提升,与MetaSRE相比提高了 2.8 个百分点。此外,在公开数据集TACRED上,该模型在低资源场景下也能对实体关系进行有效抽取,F1值达到了59.93%。该模型通过梯度模拟算法的奖励反馈机制得到可泛化的显式信号,相较于直接采用标记数据训练模型得到的隐式信号更具有指导意义,且不会导致逐步漂移问题,实现了低资源场景下实体关系的快速抽取,为苹果种植领域知识图谱高效快速构建提供了解决办法。

       

      Abstract: Annotating entities and relationships have been found with the high cost in the joint extraction tasks of entity relationship. There is the strong correlation with the professional fields. It is crucial to improve the extraction performance of the model in the scenarios of low resource. In this study, the reinforcement learning-based joint extraction model was proposed for the entity relationship. The model included the following two modules: entity recognition and relationship extraction. The extraction efficiency and generalization were improved to jointly extract the entities and relationships in the low resource scenarios. The feature extractor was used to convert the text into feature representations with the richer semantics, which were shared by entity recognition and relationship extraction modules. Entity recognition was realized to utilize the CRF for sequence labeling. The output entity labeling sequence was traversed to locate the entity boundaries, and then connect the entity features for the entity embedding vectors. The limited labeled and abundant unlabeled data was considered in the low resource scenarios. Reinforcement learning was used to train the relationship extraction module. The input of the relationship extraction module included sentence features, entity embeddings, and relationships for the labeled data, and sentence features, and entity embeddings for the unlabeled data. The improved model was trained to simulate the pseudo labels that generated by unlabeled data in the gradient direction of labeled data, in order to maximize the similarity of the average gradients between them. There was the increase in the diversity and richness of the data, particularly for the better generalization with the less risk of overfitting. Meanwhile, the generated pseudo labels were reduced the dependence on a large amount of labeled data and lower annotation costs. More importantly, the gradient simulation was also balanced the sample distribution of different relationship categories in the dataset, especially in the cases of imbalanced relationship categories. The effectiveness of the model was verified to compare the mainstream models of low resource relationship extraction in the apple cultivation corpus (ATC). The results showed that the F1 score of the model was 88.71%, when the proportion of labeled data reached 30%, indicating the significantly improved model than the rest baselines. In addition, the entity relationships model was effectively extracted from the public dataset TACRED in the low resource scenarios. The proportion of unlabeled data was changed in the ATC and TACRED datasets. The experiments showed that the F1 performance varied on the fixed 10% labeled data and 10%, 30%, 50%, 70%, and 90% unlabeled data. The improved performance was achieved to add the unlabeled data for training. The optimal F1 performance was consistently achieved in the different proportions of unlabeled data. The effectiveness of the gradient simulation module was verified through ablation experiments. The relationship extraction model without gradient simulation module was basically the same as the Self-TrainedBERT model. There was an average F1 decrease of 6.12% in the Self-Trained BERT model using different proportions of labeled data. The improved performance of the relationship extraction module was attributed to the gradient simulation module, which was improved the quality of pseudo labels. Finally, principal component analysis was used to demonstrate the gradient descent direction of the relationship extraction module for the labeled and pseudo labeled data, representing the quality of pseudo labeled data. The gradient simulation module was also added to gradually approach the ideal local minimum, although the optimization direction of pseudo label data initially fluctuated greatly. The effectiveness of the gradient simulation module was further proved to generate the high-quality pseudo labels. Therefore, the proposed model can effectively extract the entity relationships in the low resource scenarios, indicating the high generalization and performance

       

    /

    返回文章
    返回