Abstract:
Wolfberry (
Lycium barbarum), valued for its nutritional and medicinal properties, has seen rising demand, yet its cultivation is challenged by pest infestations that compromise crop health and yield. Traditional pest management, reliant on agricultural experts, faces scalability limitations, necessitating advanced artificial intelligence (AI) solutions. While existing intelligent pest identification systems have progressed in uni-modal visual classification, they often fail to deliver comprehensive pest profiles, including morphological characteristics, life cycle details, and science-based control strategies, thus limiting their practical utility. To address this, we propose the Cross-modal Wolfberry Pest Retrieval (CWPR) model, a novel framework that establishes precise semantic correspondences between pest images and textual descriptions to enable holistic pest intelligence retrieval. This technical approach not only preserves the species identification capability inherent in conventional classification methods but further advances to provide comprehensive biological characterization of pests, thereby achieving a technological shift from mere identification to holistic cognition. Concretely, the CWPR model employs a dual-branch architecture, utilizing a Vision Transformer (ViT) to extract high-dimensional visual features from pest images, capturing intricate morphological details, and Bidirectional Encoder Representations from Transformers (BERT) to generate contextual text embeddings from descriptions encompassing morphological characteristics, lifestyle habits, and prevention and control measures. A key innovation is the two-tiered feature fusion mechanism, which tackles cross-modal heterogeneity by employing learnable weighted projection matrices to align visual and textual features into a shared latent space, followed by orthogonal rotation optimization to minimize quantization loss while preserving modality-specific characteristics. To mitigate data imbalance, a pervasive issue in agricultural datasets, a frequency-aware label enhancement technique is introduced to reduce biases toward dominant pest species, ensuring equitable representation. The model is optimized using an iterative alternating optimization strategy, with hash mapping functions for each modality derived via Ridge regression, ensuring efficient convergence. The CWPR model was evaluated on a curated wolfberry pest dataset through two cross-modal retrieval tasks: image-to-text and text-to-image retrieval, using mean Average Precision (mAP) as the performance metric. The dataset was preprocessed, with images resized to 224×224 pixels and text tokenized using BERT’s tokenizer. Experimental results demonstrate the model’s superior performance, achieving 96.49% mAP for image-to-text retrieval and 98.55% for text-to-image retrieval, surpassing state-of-the-art methods by 1.89 percentage points. Ablation studies indicate that the two-tiered feature fusion improves 1.21 percentage points gain over single-fusion schemes, while label enhancement increased by an additional 0.8 percentage points. These results highlight the model’s ability to integrate visual and textual modalities effectively, addressing cross-modal heterogeneity and data imbalance. By providing accurate and comprehensive pest information, including identification, life cycle details, and control strategies, the CWPR model supports informed decision-making in wolfberry pest management. Its scalability and robustness position it as a promising tool for precision agriculture, enhancing sustainable wolfberry production by enabling farmers to access actionable insights for pest prevention and control, ultimately contributing to improved crop health and yield in the face of growing global demand.