基于深度学习和骨架结构MHA-RNN的农药分子生成模型

    A pesticide molecular generation model based on deep learning and scaffold structure MHA-RNN

    • 摘要: 近年来,深度学习模型在农药发现和从头分子设计方面取得了显著进展。然而目前用于农药分子设计的深度生成模型中,基于骨架的分子生成模型较少。并且基于骨架的分子生成方法面临着生成分子质量和多样性不足的挑战。为此,该研究提出了一种基于骨架结构的循环神经网络模型(multi head attention-recurrent neural network,MHA-RNN),首先生成简化分子线性输入规范(simplified molecular input line entry system,SMILES)格式的分子骨架,然后对骨架进行装饰以生成新的分子。试验结果表明,模型生成的分子在有效性、新颖性和唯一性方面分别达到了97.18%、99.87%和100.00%。此外,生成分子在脂水分配系数(logarithm of partition coefficient,LogP)、拓扑极性表面积(topological polar surface area,TPSA)、相对分子质量(molecular weight,MW)、类药性(quantitative estimate of drug-likeness,QED)、氢键受体(hydrogen bond acceptor,HBA)、氢键供体(hydrogen bond donor,HBD)、旋转键数(rotatable bonds,RotB)等性质上的分布与现有分子高度相似,研究结果为农药新药研发提供了技术支持与参考。

       

      Abstract: Pesticides to control pests and diseases can play an important role in crop yields in modern agriculture. However, the pesticide translation can be the long-term, expensive development with a low success rate. Fortunately, deep learning can be expected to significantly improve the efficiency of pesticide research and application in recent years. The molecular generation can be fabricated as the atoms, fragments, reactions, and scaffolds, indicating the unique characteristics. Among them, scaffold-based approaches demonstrate significant potential in drug discovery and compound design. Existing chemical knowledge can be effectively utilized to adjust molecular structures during generation, in order to meet the requirements of different drug targets and biological activity. However, the quality and validity of generated molecules are required to explore new compounds with molecular characteristics. Furthermore, existing models cannot fully capture the complex structural features of molecules during generation. In this study, a scaffold-based generation model of pesticide molecular was proposed, called multi-head attention and recurrent neural network (MHA-RNN). The structural features of the molecules were better captured to maintain the rationality and validity of the molecule generation using the molecular scaffold. The uniqueness of the generated molecules was also enhanced significantly. The MHA-RNN model was used to first generate the molecular scaffolds in SMILES format, and then decorate these scaffolds for new molecules. The data preprocessing involved two steps: the first step was to slice the molecules, breaking them down into combinations of scaffolds and decorations; In the second step, data augmentation was applied to expand the dataset of scaffolds and decorations. Multiple representations of a single molecule were learned to enhance the robustness and generalization of the model. The entire model consisted of three parts: an encoder, a multi-head attention layer, and a decoder. Among them, the encoder part was a simple bidirectional RNN encoder to encode the input sequence. The multi-head attention mechanism layer was used to perform the attention operations on the encoder's output during decoding, in order to focus on important information. This layer was used to calculate the attention weights and context vectors. The decoder was used to dynamically select the most relevant encoder information for each output, thereby improving the accuracy and validity of the generation. The decoder part was a unidirectional RNN decoder using attention mechanisms and linear layers, in order to generate the output sequence. The better parameters were then adjusted on the num-heads, num-layers, learning start rate, layer size, and embedding layer size. The value of num-heads was set to 2, 4, 8; num-layer was set to 2, 3, 4; learning-start-rate was set to 1E-3, 1E-4, 1E-5; layer-size was set to 512 and 1024; embedding-layer-size was chosen as 256 and 512. The results indicate that the generated molecules shared the outstanding performance, when num-heads=2, num-layers=3, learning-rate-start=1E-3, layer-size=512, and embedding-layer-size=256. The efficacy, novelty, and uniqueness were achieved 97.18%, 99.87%, and 100.00%, respectively. The generation of this model was compared with four commonly-used molecule generation models. According to the same dataset and training equipment, this model was more effective and innovative in generating pesticide molecules. Additionally, the parameter tuning experiments indicate that the properties of the generated molecules (such as LogP, TPSA, MW, QED, HBA, HBD, and RotB) were highly similar to that of existing molecules when num-heads=2, num-layers=3, learning-rate-start=1E-5, layer-size=512, and embedding-layer-size=256. The generalization of the generated model was verified to generate 100 new molecules using a specific scaffold and then compared with 100 molecules with the same scaffold selected from the training set. Morgan's fingerprints and similarities were calculated for the two datasets of molecules. A similarity matrix was constructed to analyze the differences between the two datasets. Next, a validation test of the model was performed with the molecules corresponding to ALS enzymes. The generated molecules performed well, in terms of validity, novelty, and uniqueness, as well as physicochemical properties. Finally, molecular docking of the generated new molecule to the ALS enzyme was carried out to determine the binding properties of the small molecules with ALS enzymes. The generated new molecules shared the high binding. In summary, the MHA-RNN model demonstrated outstanding performance in pesticide molecular generation. The findings can also provide new ideas for pesticide research and development. This model can be expected for highly efficient and sustainable production in modern agriculture.

       

    /

    返回文章
    返回