赵良, 张赵玥, 廖子逸, 王玲. 用BERT和改进PCNN模型抽取食品安全领域关系[J]. 农业工程学报, 2022, 38(8): 263-270. DOI: 10.11975/j.issn.1002-6819.2022.08.030
    引用本文: 赵良, 张赵玥, 廖子逸, 王玲. 用BERT和改进PCNN模型抽取食品安全领域关系[J]. 农业工程学报, 2022, 38(8): 263-270. DOI: 10.11975/j.issn.1002-6819.2022.08.030
    Zhao Liang, Zhang Zhaoyue, Liao Ziyi, Wang Ling. Relationship extraction in the field of food safety based on BERT and improved PCNN model[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2022, 38(8): 263-270. DOI: 10.11975/j.issn.1002-6819.2022.08.030
    Citation: Zhao Liang, Zhang Zhaoyue, Liao Ziyi, Wang Ling. Relationship extraction in the field of food safety based on BERT and improved PCNN model[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2022, 38(8): 263-270. DOI: 10.11975/j.issn.1002-6819.2022.08.030

    用BERT和改进PCNN模型抽取食品安全领域关系

    Relationship extraction in the field of food safety based on BERT and improved PCNN model

    • 摘要: 为了提高食品安全领域关系抽取的效率和准确性,该研究在收集食品安全领域语料的基础上,对语料中相应的实体和关系进行标注,构建可用于食品安全领域关系抽取的专业数据集。同时,提出面向食品安全领域的基于BERT-PCNN-ATT-Jieba的关系抽取模型,该模型使用基于转换器的双向编码器表征量(Bidirectional Encoder Representations from Transformers,BERT)预训练模型生成输入词向量,并结合分段卷积神经网络(Piecewise Convolutional Neural Network,PCNN)模型的分段最大池化层能极大程度捕获句子局部信息的特点,在分段最大池化层与分类层之间添加了注意力机制,以进一步提取高层语义。此外,考虑中文语料的特性,在BERT模型进行随机掩码切分之前,采用Jieba分词技术对中文语料进行分词,PCNN模型在执行掩码语言模型(Masked Language Model,MLM)时以词为单位进行掩码,使得输入到训练模型中的句子尽可能减少语义损失,以实现更高效的关系抽取。在该研究构建的数据集基础上,将BERT-PCNN-ATT-Jieba模型与经典的卷积神经网络(Convolutional Neural Network,CNN)、PCNN模型、以及结合BERT的CNN、PCNN、PCNN-ATT、PCNN-Jieba等6个模型进行比较,该研究提出的BERT-PCNN-ATT-Jieba模型取得更优的性能,其准确率达到84.72%,召回率达到81.78%,F1值达到83.22%。该模型为食品安全领域的知识抽取提供参考,为该领域知识图谱的自动化构建节约了成本,同时为基于该领域知识图谱的知识问答、知识检索、数据共享及食品安全智慧监管等应用提供依据。

       

      Abstract: Abstract: A knowledge graph (semantic network) has emerged to organize the real-world entities in a graph database for the relationship between them. Among them, relationship extraction has been one of the most important links in the automatic construction of knowledge graphs. However, there is no public dataset related to knowledge graphs in the food safety field at present. The existing models of relationship extraction are confined to the open standard data set, but most cannot extract the data in the specific domain. In this study, a professional data set was constructed for the relationship extraction in the food safety field using the Bidirectional Encoder Representations from Transformers (BERT) and the improved Piecewise Convolutional Neural Network (PCNN) model. The corpus was firstly collected to annotate the corresponding entities and related categories. At the same time, a relationship extraction model was proposed using BERT-PCNN-Attention-based Neural Networks (ATT)-Jieba for the field of food safety. The BERT pre-training model was selected to generate the input word vector. After that, the segmented maximum pooling layer of the PCNN model was utilized to capture the local information of sentences. An attention mechanism was added between the segmented maximum pooling layer and the classification layer, further to extract the high-level semantics. In addition, Jieba word segmentation was used to segment the Chinese corpus before the random mask segmentation of the BERT model. The segmented maximum pool layer of the PCNN model masked the word unit instead of characters when executing the Masked Language Model (MLM). As such, the semantic loss of sentences was reduced to achieve a more efficient relationship extraction, when inputting into the training model. The performance of the BERT-PCNN-ATT-Jieba model was compared with the classical CNN, PCNN model, as well as the CNN, PCNN, PCNN-ATT, and PCNN-Jieba models combined with BERT under the same data set and the consistent experimental parameters. Comparing the PCNN with the BERT-PCNN model, the precision, recall, and F1 value of BERT-PCNN were slightly improved, indicating that the vector generated by the BERT model can better obtain the semantic feature information of data. Comparing the BERT-PCNN-ATT and BERT-PCNN, the pooled high-level semantic features presented a higher weight value after adding the attention mechanism between the pooling layer and the classification layer, indicating that the attention mechanism can improve the performance of the model. The F1 value of BERT-PCNN-Jieba was better than that of BERT-PCNN because the influence of word length was weakened through sentence preprocessing in the training set for the field of food safety. The position and logical information between words were better analyzed by adding a word segmentation operation. Consequently, the BERT-PCNN-ATT-Jieba model presented the highest precision of 84.72%, recall of 81.78%, and F1 value of 83.22%, indicating that the better performance was achieved in the relationship extraction data set using the field of food safety. The finding can provide a strong reference for knowledge extraction in the cost-saving and automatic construction of knowledge graphs in the field of food safety. The improved model can also lay a foundation for the application of Knowledge Q&A, knowledge retrieval, data sharing, and intelligent supervision of food safety using knowledge graphs.

       

    /

    返回文章
    返回