Abstract:
Answer selection is one of the most important tasks during natural language processing in the downstream tasks, such as question-answering systems, and search ranking. The most relevant answer can be selected to the given question from a candidate answer pool, which is usually regarded as a relevance ranking task. However, the current models of answer selection cannot discover the deep semantic relationships between questions and answers using the limited information in the text of the question-answer pairs. Fortunately, knowledge graph can be expected to serve as the background knowledge, in order to enhance the deep semantics of the answer selection model. It is still lacking on the multi-modal background knowledge support, because the answer selection models can rely solely on their own information. In this research, a multi-modal knowledge graph enhanced answer selection model was proposed, including the embedding layer, representation learning layer, knowledge graph enhancement layer, and output layer. Among them, the Glove model was used to obtain the word embeddings for the question-answer texts in the embedding layer. Furthermore, a ComplEx-based method (complex embedding) was designed to learn the entity embeddings for the multi-modal knowledge graph. The image entity information was considered to extract the image feature representations using the Vision Transformer (VIT). Bi-directional long short-term memory (Bi-LSTM) was used for the representation learning of question-answer texts in the representation layer. The context-guided multi-modal knowledge graph question and answer vector representations were obtained using context-guided attention mechanism. In the knowledge graph enhancement layer, the interaction attention mechanism was used to fuse the semantic representation of the question-answer texts with the background knowledge features that provided by the multi-modal knowledge graph, particularly for the feature representations of the multi-modal knowledge graph enhanced question and answer. The feature representations of the knowledge graph enhanced question and answer were concatenated with the additional semantic features in the output layer. The softmax function was used to predict the probability distribution of answer labels for a given question. Taking the grape planting as an example, the multi-modal entity linking was realized using the longest common subsequence algorithm. The entity recognition was also implemented to extract the knowledge using the Bert-LSTM-CRF framework and Bert pre-training model. The reference of knowledge graph was collected from the literature and experts. Finally, a multi-modal knowledge graph was constructed in the grape planting field. A grape planting question and answer dataset was also constructed using grape forums, smart agricultural platforms, agricultural managers, and agricultural benefit networks as data sources, followed by text cleaning and dataset expansion. Experimental results show that the better performance of the model was achieved to obtain more information using the multi-modal knowledge graphs. Specifically, the mean reciprocal rank and mean average precision reached 85.02% and 84.21%, respectively, in the grape question answering dataset. The mean reciprocal rank and mean average precision increased by 2.57 and 3.96 percentage points, respectively. The answer selection model with the knowledge of multi-modal knowledge graph can be expected to improve the better performance of answer selection model. The embedding representation with attention mechanism can be utilized to enhance the background knowledge from the multi-modal knowledge graph. The finding can provide a technical basis for the downstream applications of multi-modal knowledge graphs, such as the search and question answering.