Recognizing few-shot crop diseases using multimodal-guided visual Transformer
-
Graphical Abstract
-
Abstract
Accurate identification of plant diseases can play a crucial role in plant protection and the intelligent agriculture. In response to the scarce data, the few-shot learning (FSL) can provide a potential solution to identify crop diseases. However, the existing FSL has relied only on the low-level image features for the disease recognition. The correlations between multimodal data cannot be considered with small samples. In this study, a multimodal few-shot learning (MMFSL) model was proposed and then applied into the crop disease identification in a low data scenario. Three components were consisted of the FSL image, the text branch and the image-text comparative learning module. Firstly, the Vision Transformer (ViT) was introduced into the FSL image branch of the MMFSL, instead of the conventional Convolutional Neural Network (CNN) encoder. Simultaneously, the ViT was effectively enhanced to extract from the few-shot image features. The input samples were segmented into the small patches, in order to establish the semantic correspondences between local regions of the image. Secondly, the text branch was developed using a pre-trained language model. The labelled text information was extracted to guide the FSL branch. Some features were selected from the image categories. A hand-crafted cue template was created to incorporate the class labels as the input text information for the text header. As such, the text branch was bridged the gap between the pre-trained model and the actual task. Finally, the image-text comparison module was developed using a bilinear metric function, in order to align the semantic images and text. The network parameters were optimally updated to facilitate the cross-modal information learning and fusion using Model-Agnostic Meta-Learning (MAML) model. A series of comparative experiments were also conducted on the MMFSL model using Plantvillage and self-constructed dataset from the field scenario. The experimental results show that the average accuracy of 86.97% and 96.33% were achieved in the MMFSL model on the Plantvillage, respectively, under the 5way-1shot and 5way-5shot settings. Once the MMFSL model on the Plantvillage was transferred to the complex field scenarios, the average accuracy of 56.78% and 74.49% were still maintained for the 5way-1shot and 5way-5shot tasks, respectively. Compared with the mainstream FSL models, including MAML, Matching Net, Prototypical Network, DeepEMD, DeepBDC, and FewTURE, the MMFSL model was achieved in the highest accuracy of classification, indicating the particularly superior performance in 5way-1shot tasks. A comparison was made on the four encoders-ViT-Tiny, Swin-Tiny, Deit-Small, and Deit-Tiny. The Swin-Tiny was the most effective to extract the feature information from images, with the average accuracy of 84.20% and 95.53% under the 5way-1shot and 5way-5shot settings, respectively. The image-text metric function was also optimized using experiments. The bilinear metric also exhibited the superior performance, compared with two conventional metrics, namely cosine and dot product. The ablation test further demonstrated that the average accuracy of the MMFSL model increased by 2.77 and 0.80 percentage points, respectively, compared with the unimodal FSL model. In summary, the MMFSL model shared the high accuracy of disease recognition and excellent robustness in both laboratory and field scenarios. The textual information was incorporated to effectively alleviate the feature representation that caused by the scarcity of image samples. The MMFSL can be expected to serve as the viable and promising approach for the plant disease recognition in low-data scenarios.
-
-