郑丽敏, 任乐乐. 采用融合规则与BERT-FLAT模型对营养健康领域命名实体识别[J]. 农业工程学报, 2021, 37(20): 211-218. DOI: 10.11975/j.issn.1002-6819.2021.20.024
    引用本文: 郑丽敏, 任乐乐. 采用融合规则与BERT-FLAT模型对营养健康领域命名实体识别[J]. 农业工程学报, 2021, 37(20): 211-218. DOI: 10.11975/j.issn.1002-6819.2021.20.024
    Zheng Limin, Ren Lele. Named entity recognition in human nutrition and health domain using rule and BERT-FLAT[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2021, 37(20): 211-218. DOI: 10.11975/j.issn.1002-6819.2021.20.024
    Citation: Zheng Limin, Ren Lele. Named entity recognition in human nutrition and health domain using rule and BERT-FLAT[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2021, 37(20): 211-218. DOI: 10.11975/j.issn.1002-6819.2021.20.024

    采用融合规则与BERT-FLAT模型对营养健康领域命名实体识别

    Named entity recognition in human nutrition and health domain using rule and BERT-FLAT

    • 摘要: 人类营养健康命名实体识别旨在检测营养健康文本中的营养实体,是进一步挖掘营养健康信息的关键步骤。虽然深度学习模型广泛应用在人类营养健康命名实体识别中,但没有充分考虑到营养健康文本中含有大量的复杂实体而出现长距离依赖的特点,且未能充分考虑词汇信息和位置信息。针对人类营养健康文本的特点,该研究提出了融合规则与BERT-FLAT(Bidirectional Encoder Representations from Transfromers-Flat Lattice Transformer,转换器的双向编码器表征量-平格变压器)模型的营养健康文本命名实体识别方法,识别了营养健康领域中食物、营养物质、人群、部位、病症和功效作用6类实体。首先通BERT模型将字符信息和词汇信息进行嵌入以提高模型对实体类别的识别能力,再通过位置编码与词汇边界信息结合的Transformer模型进行编码以提高模型对实体边界的识别效果,利用CRF(Coditional Random Field,条件随机场)获取字符预测序列,最后通过规则对预测序列进行修正。试验结果表明,融合规则与BERT-FLAT模型的人类营养健康领域识别的准确率为95.00%,召回率为88.88%,F1分数为91.81%。研究表明,该方法是一种有效的人类营养健康领域实体识别方法,可以为农业、医疗、食品安全等其他领域复杂命名实体识别提供新思路。

       

      Abstract: Abstract: A nutritious and healthy diet can be widely expected to reduce the incidence of disease, while improving body health after the disease occurs. The nutritional diet knowledge can be acquired mostly through the Internet in recent years. However, reliable and integrated information is highly difficult to discern using time-consuming searching of the huge amount of Internet data. It is an urgent need to integrate the complicated data, and then construct the knowledge graph of nutrition and health, particularly with timely and accurate feedback. Among them, a key step is to accurately identify entities in nutritional health texts, providing effective location data support to the construction of knowledge graphs. In this study, a BRET+BiLSTM+CRF (Bidirectional Encoder Representations from Transformers + Bi-directional Long Short-Term Memory + Conditional Random Field) model was first used with location information. It was found that the precision of the model was 86.56%, the recall rate was 91.01%, and the F1 score was 88.72%, compared with the model without location information, indicating improved by 1.55, 0.20, and 0.32 percentage points. A named entity recognition was also proposed to accurately obtain six types of entities in text: food, nutrients, population, location, disease, and efficacy in the field of human nutritional health, combining rules with BERT-FLAT (Bidirectional Encoder Representations from Transformers-Flat Lattice Transformer) model. Firstly, the character and vocabulary information were stitched together and pre-trained in the BERT model to improve the recognition ability of the model to entity categories. Then, a position code was created for the head and tail position of each character and vocabulary, where the entity position was located with the help of a position vector, in order to improve the recognition of entity boundary. A long-distance dependency was also captured using the Transformer model. Specifically, the output of the BERT model was embedded into the Transformer as a character-embedding conjunction word, thus for the character-vocabulary fusion. Then the text prediction sequence was obtained from the CRF layer. Finally, seven rules were formulated, according to the text characteristics in the field of nutrition and health, where the prediction sequence was modified according to the rules. The experimental results showed that the F1 score of the BERT-FLAT model was 88.99%. The BERT model combined with the word fusion performed the best, compared with that without the Bert model, indicating an effective recognition performance. Correspondingly, the named entity recognition model in the field of nutrition and health using fusion rules and the BERT-FLAT model presented an accuracy rate of 95.00%, a recall rate of 88.88%, and an F1 score of 91.81%. The F1 score increased by 2.82 percentage points than before. The finding can provide an effective entity recognition in the field of human nutrition and health.

       

    /

    返回文章
    返回