Abstract:
Accurate identification of plant diseases plays a crucial role in plant protection and the intelligent development of agricultural production. In response to the problem of scarce disease data, few-shot learning (FSL) provides a potential solution for crop disease identification. However, existing FSL methods rely only on low-level image features for disease recognition, ignoring the correlations between multimodal data, which results in insufficient recognition performance of the model with small samples. To address these issues, this study proposes a novel multimodal few-shot learning (MMFSL) model and then applies it to the crop disease identification task in a low data scenario. The MMFSL consists of three components: the FSL image branch, the text branch and the image-text comparative learning module. Firstly, the Vision Transformer (ViT) was introduced in the FSL image branch of the MMFSL instead of the conventional Convolutional Neural Network (CNN) encoder. Simultaneously, ViT effectively enhanced feature extraction in few-shot image tasks by segmenting the input samples into small patches and establishing semantic correspondences between local regions of the image. Secondly, a text branch based on a pre-trained language model was developed for extracting labelled text information related to image categories to guide the FSL branch for image feature selection. To bridge the gap between the pre-trained model and the actual task, a hand-crafted cue template was created to incorporate the class labels as input text information for the text header. Finally, the image-text comparison module was developed using a bilinear metric function to align the semantic information between images and text. The network parameters were optimally updated using Model-Agnostic Meta-Learning(MAML) to facilitate cross-modal information learning and fusion. A series of comparative experiments were conducted on MMFSL under the same conditions using a self-constructed field scenario dataset and the Plantvillage dataset.The experimental results show that under the 1shot and 5shot settings, MMFSL achieved average accuracy of 86.97% and 96.33% respectively on Plantvillage. When transferred to complex field scenarios, the model still maintained average accuracy of 56.78% and 74.49% for the 1shot and 5shot tasks, respectively. Compared with mainstream FSL models including MAML, Matching Net, Prototypical Network, DeepEMD, DeepBDC, and FewTURE, the MMFSL achieved the highest classification accuracy, demonstrating particularly superior performance in 5way-1shot tasks. A comparison of four encoders-ViT-Tiny, Swin-Tiny, Deit-Small, and Deit-Tiny-revealed that Swin-Tiny was the most effective for extracting feature information from images, with average accuracy of 84.20% and 95.53% under the 1shot and 5shot settings, respectively. An experiment was also carried out to select the image-text metric function. The bilinear metric exhibited superior performance compared to two conventional metrics, namely cosine and dot product. The ablation test further demonstrates that the average accuracy of the MMFSL model increased by 2.77 and 0.80 percentage points, respectively, compared to the unimodal FSL model. In summary, the MMFSL demonstrates excellent disease recognition accuracy and model robustness in both laboratory and field scenarios. The incorporation of textual information into the FSL model effectively alleviates the limitations in feature representation caused by the scarcity of image samples. In low-data scenarios, MMFSL presents a viable and promising approach for plant disease recognition.