赵鹏飞, 赵春江, 吴华瑞, 王维. 基于BERT的多特征融合农业命名实体识别[J]. 农业工程学报, 2022, 38(3): 112-118. DOI: 10.11975/j.issn.1002-6819.2022.03.013
    引用本文: 赵鹏飞, 赵春江, 吴华瑞, 王维. 基于BERT的多特征融合农业命名实体识别[J]. 农业工程学报, 2022, 38(3): 112-118. DOI: 10.11975/j.issn.1002-6819.2022.03.013
    Zhao Pengfei, Zhao Chunjiang, Wu Huarui, Wang Wei. Recognition of the agricultural named entities with multi-feature fusion based on BERT[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2022, 38(3): 112-118. DOI: 10.11975/j.issn.1002-6819.2022.03.013
    Citation: Zhao Pengfei, Zhao Chunjiang, Wu Huarui, Wang Wei. Recognition of the agricultural named entities with multi-feature fusion based on BERT[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2022, 38(3): 112-118. DOI: 10.11975/j.issn.1002-6819.2022.03.013

    基于BERT的多特征融合农业命名实体识别

    Recognition of the agricultural named entities with multi-feature fusion based on BERT

    • 摘要: 命名实体识别是农业文本信息抽取的重要环节,针对实体识别过程中局部上下文特征缺失、字向量表征单一、罕见实体识别率低等问题,提出一种融合BERT(Bidirectional Encoder Representations from Transformers,转换器的双向编码器表征量)字级特征与外部词典特征的命名实体识别方法。通过BERT预训练模型,融合左右两侧语境信息,增强字的语义表示,缓解一词多义的问题;自建农业领域词典,引入双向最大匹配策略,获取分布式词典特征表示,提高模型对罕见或未知实体的识别准确率;利用双向长短时记忆(Bi-directional Long-short Term Memory,BiLSTM)网络获取序列特征矩阵,并通过条件随机场(Conditional Random Field,CRF)模型生成全局最优序列。结合领域专家知识,构建农业语料集,包含5 295条标注语料,5类农业实体。模型在语料集上准确率为94.84%、召回率为95.23%、F1值为95.03%。研究结果表明,该方法能够有效识别农业领域命名实体,识别精准度优于其他模型,具有明显的优势。

       

      Abstract: Agricultural named entity recognition is a fundamental task for information extraction in the agricultural domain. Aiming at the problems of local context features、unable to solve the polysemy of the word、low recognition rate of rare entities in the process of entity recognition, the model combined with character level features and dictionary feature was proposed to automatically identify entities in the text,the character level features were obtained from the BERT(Bidirectional Encoder Representations from Transformers)model. Firstly, the BERT pre-trained language model was used to integrate the left and right contextual information to obtain the character level features, enhance the semantic representation of words, in order to alleviate the problem of polysemy; Secondly, we built an agricultural dictionary and introduced external dictionary information through the feature extraction strategy to improve the recognition accuracy of the model for rare or unknown entities. Among them, two feature extraction strategies were designed to capture the dictionary features, included N-gram feature template algorithm and bi-direction maximum matching algorithm. Then, the character level features and dictionary features were fused as the input of the next neural network layer. Finally, the fused feature information were encoded by the BiLSTM (Bi-directional Long-short Term Memory) neural network layer, obtained the sequence feature matrix, and the optimal text label sequence was obtained by CRF (Conditional Random Field). Based on the knowledge of domain experts, a labeling strategy of named entities in the agricultural field was proposed to solve the problem of fuzzy boundaries of agricultural named entities, in order to ensure the integrity of the entities. The experiments were carried out on the corpus of agricultural, which contained 5 295 labeled corpora and 5 categories of agricultural entities. The results showed that better overall performance was achieved in the corpus, where the recognition precision, recall, and F1-score were 94.84%, 95.23%, and 95.03%, respectively. In terms of specific categories, due to the obvious boundary characteristics of crop diseases and pesticide, the model achieved higher recognition precision than the remaining three entities of agricultural, such as machinery, pests, and crop variety. Experimental comparison showed that for the effectiveness of the dictionary feature extraction strategy, the performance of the model based on the bi-direction maximum matching algorithm was better than the N-gram feature template algorithm. When the number of templates was 10, the performance of the model based on N-gram feature template was the best with the recognition precision of93.95%and F1-score of 94.03%. The bi-directional maximum matching algorithm using feature embedding can obtain more potential information, which was better than one-hot encoding. The precision and F1-score of the model were improved by 0.49 and 0.91 percentage points, respectively. Compared with the models based on BiLSTM-CRF, BERT-BiLSTM-CRF, the precision of the BERT-Dic-BiLSTM-CRF model proposed in this paper had obvious performance advantages with the highest recognition precision of 94.84%. Compared with the BERT-BiLSTM-CRF model, for the recognition performance of rare or unknown entities, the recognition precision of the BERT-Dic-BiLSTM-CRF model was improved by 5.93 and 6.44 percentage points, respectively. Further verifying that the integration of dictionary features into the model can improve the recognition accuracy of the model for such entities.

       

    /

    返回文章
    返回