面向葡萄知识图谱构建的多特征融合命名实体识别

    Multi-feature fusion named entity recognition method for grape knowledge graph construction

    • 摘要: 为解决构建知识图谱过程中由于上下文环境复杂、现有模型字向量语义表征相对单一导致领域专业实体识别率低的问题,该研究提出了来自转换器的双向编码器表征量(bi-directional encoder representation from transformer, BERT)和残差结构(residual structure, RS)融合的命名实体识别模型(bert based named entity recognition with residual structure, BBNER-RS)。通过BERT模型将文本映射为字符向量,利用双向长短时记忆网络(bi-directional long-short term memory, BiLSTM)提取局部字符向量特征,并采用RS保留BERT提供的全局字符向量特征,以提高字向量的语义丰富度,最后通过条件随机场(conditional random field, CRF)模型对特征向量解码,获取全局最优序列标注。与其他命名实体识别模型相比,提出的BBNER-MRS模型在葡萄数据集上表现较好,在葡萄人民日报、玻森、简历和微博数据集上F1值分别达到89.89%、95.02%、83.21%、96.15%和72.51%。最后该研究依托BBNER-MRS模型,提出基于深度学习的两阶段式领域知识图谱构建方法,成功构建了葡萄知识图谱,研究结果可为相关从业人员提供技术和数据支持。

       

      Abstract: Domain knowledge graph can store the data with structured and fine-grained features, and model the real world in the form of triple groups. Dispersed knowledge can be effectively organized and then widely used in the fields of healthcare, finance, and the Internet. Alternatively, the grape is one of the most important economic fruits in agriculture. However, there is a large amount of unstructured knowledge in the grape domain, limiting the downstream data-driven task use. Current knowledge graphs are also rare in the agricultural domain. It is very necessary to construct a knowledge graph in the grape domain, particularly for knowledge storage and sharing. Furthermore, the key information is often implicit in the complex contextual environment, when constructing domain knowledge graphs. The character vector semantic representations of existing named entity recognition (NER) models are relatively homogeneous, leading to a low recognition rate of domain-specialized entities, and ultimately affect the efficiency and quality of knowledge graph construction. In this study, a named entity recognition model was proposed using the fusion of Bi-directional Encoder Representation from Transformer (BERT) and Residual Structure (RS). Firstly, the raw text was mapped into the character vectors using BERT. The input sentences were then embedded in BERT using token, segment and position embedding. In the subsequent embedded vectors, a distinctive Multi-head Attention mechanism was utilized to calculate the correlation between the current character and other characters in the sentence. This calculation allowed for the adjustment of their weights, thereby endowing the character vectors provided by BERT with global characteristics. In the Bi-directional Long-Short Term Memory (BiLSTM), the character vectors provided by BERT were obtained from the deep-layered local features in both forward and backward directions. Two simple but effective residual structures were designed to optimize the global features provided by BERT and the deep local feature provided by BiLSTM. The mapping residual structure was used to map the feature vectors provided by the BERT in a reduced dimension, in order to preserve as much of the original information of the BERT as possible, while the convolution residual structure convolved the feature vectors twice to obtain more information. The feature vectors were decoded by a Conditional Random Field (CRF) model. Compared with other NER models, the proposed BBNER-MRS model performs better overall, with F1 values of 89.89%, 95.02%, 83.21%, 96.15%, and 72.51% on the Grape, People Daily, BOSON, RESUME, and Weibo datasets, respectively. A two-stage deep learning-based domain knowledge graph construction was proposed, i.e., in the first stage, a domain ontology was constructed, and in the second stage, a deep learning model was utilized to extract knowledge under the constraints of the ontology and construct triple groups. The BBNER-MRS performed the best when constructing triple groups from unstructured text with an F1 value of 86.44%. Finally, the BBNER-MRS was used to successfully construct a grape knowledge graph. This research can provide technical and data support to the standardization and sharing of domain data.

       

    /

    返回文章
    返回