多模态引导视觉Transformer的小样本农作物病害识别

杨森; 冯全; 阎文博; 周文伟; 杨婉霞

doi:10.11975/j.issn.1002-6819.202409189

多模态引导视觉Transformer的小样本农作物病害识别

Multimodal-guided visual Transformer for few-shot crop disease recognition

摘要

摘要: 为解决现有基于小样本学习方法的农作物病害识别过程中存在的模态信息单一、识别精度低等问题，该研究提出了一种多模态小样本学习（multimodal few-shot learning, MMFSL）模型，并将其应用于低数据场景下的作物病害识别。首先，该模型在小样本学习图像分支中引入视觉Transformer（visual transformer，ViT）替代传统卷积神经网络编码器，充分利用ViT全局感知特性增强小样本图像的特征提取能力；其次，设计了基于预训练语言模型的文本分支，通过将类标签嵌入手工设计的提示模板中，并提取模板中特定位置的隐藏向量作为文本嵌入，从而引导模型更精准地选择视觉特征；最后，构建图文对比模块对齐图像与文本的语义信息，并采用模型无关的元学习（model-agnostic meta-learning, MAML）算法优化网络参数，实现多模态信息的高效融合。试验结果表明，在1shot设置下，MMFSL模型在PlantVillage数据集和自建大田病害数据集上的平均准确率分别达到86.97%和56.78%；在5shot设置下，模型在两种数据集上的平均准确率分别达到96.33%和74.49%，均优于对比的小样本学习模型。此外，与单模态小样本学习模型相比，MMFSL模型在1shot和5shot设置下的平均准确率分别提升了2.77和0.80个百分点。研究表明，引入文本信息能够提高小样本学习模型的泛化性能，研究结果可为深度学习领域降低病害收集成本提供技术参考。

Abstract: Accurate identification of plant diseases plays a crucial role in plant protection and the intelligent development of agricultural production. In response to the problem of scarce disease data, few-shot learning (FSL) provides a potential solution for crop disease identification. However, existing FSL methods rely only on low-level image features for disease recognition, ignoring the correlations between multimodal data, which results in insufficient recognition performance of the model with small samples. To address these issues, this study proposes a novel multimodal few-shot learning (MMFSL) model and then applies it to the crop disease identification task in a low data scenario. The MMFSL consists of three components: the FSL image branch, the text branch and the image-text comparative learning module. Firstly, the Vision Transformer (ViT) was introduced in the FSL image branch of the MMFSL instead of the conventional Convolutional Neural Network (CNN) encoder. Simultaneously, ViT effectively enhanced feature extraction in few-shot image tasks by segmenting the input samples into small patches and establishing semantic correspondences between local regions of the image. Secondly, a text branch based on a pre-trained language model was developed for extracting labelled text information related to image categories to guide the FSL branch for image feature selection. To bridge the gap between the pre-trained model and the actual task, a hand-crafted cue template was created to incorporate the class labels as input text information for the text header. Finally, the image-text comparison module was developed using a bilinear metric function to align the semantic information between images and text. The network parameters were optimally updated using Model-Agnostic Meta-Learning(MAML) to facilitate cross-modal information learning and fusion. A series of comparative experiments were conducted on MMFSL under the same conditions using a self-constructed field scenario dataset and the Plantvillage dataset.The experimental results show that under the 1shot and 5shot settings, MMFSL achieved average accuracy of 86.97% and 96.33% respectively on Plantvillage. When transferred to complex field scenarios, the model still maintained average accuracy of 56.78% and 74.49% for the 1shot and 5shot tasks, respectively. Compared with mainstream FSL models including MAML, Matching Net, Prototypical Network, DeepEMD, DeepBDC, and FewTURE, the MMFSL achieved the highest classification accuracy, demonstrating particularly superior performance in 5way-1shot tasks. A comparison of four encoders-ViT-Tiny, Swin-Tiny, Deit-Small, and Deit-Tiny-revealed that Swin-Tiny was the most effective for extracting feature information from images, with average accuracy of 84.20% and 95.53% under the 1shot and 5shot settings, respectively. An experiment was also carried out to select the image-text metric function. The bilinear metric exhibited superior performance compared to two conventional metrics, namely cosine and dot product. The ablation test further demonstrates that the average accuracy of the MMFSL model increased by 2.77 and 0.80 percentage points, respectively, compared to the unimodal FSL model. In summary, the MMFSL demonstrates excellent disease recognition accuracy and model robustness in both laboratory and field scenarios. The incorporation of textual information into the FSL model effectively alleviates the limitations in feature representation caused by the scarcity of image samples. In low-data scenarios, MMFSL presents a viable and promising approach for plant disease recognition.

HTML全文

参考文献(38)

施引文献

资源附件(0)