Abstract
Pesticides to control pests and diseases can play an important role in crop yields in modern agriculture. However, the pesticide translation can be the long-term, expensive development with a low success rate. Fortunately, deep learning can be expected to significantly improve the efficiency of pesticide research and application in recent years. The molecular generation can be fabricated as the atoms, fragments, reactions, and scaffolds, indicating the unique characteristics. Among them, scaffold-based approaches demonstrate significant potential in drug discovery and compound design. Existing chemical knowledge can be effectively utilized to adjust molecular structures during generation, in order to meet the requirements of different drug targets and biological activity. However, the quality and validity of generated molecules are required to explore new compounds with molecular characteristics. Furthermore, existing models cannot fully capture the complex structural features of molecules during generation. In this study, a scaffold-based generation model of pesticide molecular was proposed, called multi-head attention and recurrent neural network (MHA-RNN). The structural features of the molecules were better captured to maintain the rationality and validity of the molecule generation using the molecular scaffold. The uniqueness of the generated molecules was also enhanced significantly. The MHA-RNN model was used to first generate the molecular scaffolds in SMILES format, and then decorate these scaffolds for new molecules. The data preprocessing involved two steps: the first step was to slice the molecules, breaking them down into combinations of scaffolds and decorations; In the second step, data augmentation was applied to expand the dataset of scaffolds and decorations. Multiple representations of a single molecule were learned to enhance the robustness and generalization of the model. The entire model consisted of three parts: an encoder, a multi-head attention layer, and a decoder. Among them, the encoder part was a simple bidirectional RNN encoder to encode the input sequence. The multi-head attention mechanism layer was used to perform the attention operations on the encoder's output during decoding, in order to focus on important information. This layer was used to calculate the attention weights and context vectors. The decoder was used to dynamically select the most relevant encoder information for each output, thereby improving the accuracy and validity of the generation. The decoder part was a unidirectional RNN decoder using attention mechanisms and linear layers, in order to generate the output sequence. The better parameters were then adjusted on the num-heads, num-layers, learning start rate, layer size, and embedding layer size. The value of num-heads was set to 2, 4, 8; num-layer was set to 2, 3, 4; learning-start-rate was set to 1E-3, 1E-4, 1E-5; layer-size was set to 512 and 1024; embedding-layer-size was chosen as 256 and 512. The results indicate that the generated molecules shared the outstanding performance, when num-heads=2, num-layers=3, learning-rate-start=1E-3, layer-size=512, and embedding-layer-size=256. The efficacy, novelty, and uniqueness were achieved 97.18%, 99.87%, and 100.00%, respectively. The generation of this model was compared with four commonly-used molecule generation models. According to the same dataset and training equipment, this model was more effective and innovative in generating pesticide molecules. Additionally, the parameter tuning experiments indicate that the properties of the generated molecules (such as LogP, TPSA, MW, QED, HBA, HBD, and RotB) were highly similar to that of existing molecules when num-heads=2, num-layers=3, learning-rate-start=1E-5, layer-size=512, and embedding-layer-size=256. The generalization of the generated model was verified to generate 100 new molecules using a specific scaffold and then compared with 100 molecules with the same scaffold selected from the training set. Morgan's fingerprints and similarities were calculated for the two datasets of molecules. A similarity matrix was constructed to analyze the differences between the two datasets. Next, a validation test of the model was performed with the molecules corresponding to ALS enzymes. The generated molecules performed well, in terms of validity, novelty, and uniqueness, as well as physicochemical properties. Finally, molecular docking of the generated new molecule to the ALS enzyme was carried out to determine the binding properties of the small molecules with ALS enzymes. The generated new molecules shared the high binding. In summary, the MHA-RNN model demonstrated outstanding performance in pesticide molecular generation. The findings can also provide new ideas for pesticide research and development. This model can be expected for highly efficient and sustainable production in modern agriculture.