Application of Multimodal Transformer Model in Intelligent Agricultural Disease Detection and Question-Answering Systems

Lu, Yuchun; Lu, Xiaoyi; Zheng, Liping; Sun, Min; Chen, Siyu; Chen, Baiyan; Wang, Tong; Yang, Jiming; Lv, Chunli

doi:10.3390/plants13070972

Open AccessEditor’s ChoiceArticle
开放获取编辑推荐文章

Application of Multimodal Transformer Model in Intelligent Agricultural Disease Detection and Question-Answering Systems
多模态 Transformer 模型在智能农业病害检测和问答系统中的应用

by

Yuchun Lu

, 作者：吕玉春

Xiaoyi Lu

, , 陆小艺

Liping Zheng

, , 郑丽萍

Min Sun

, , 孙敏

Siyu Chen

, , 陈思宇

Baiyan Chen

, , 陈柏岩

Tong Wang

, , 王彤

Jiming Yang

and ，杨继明

Chunli Lv

^* 和吕春丽

China Agricultural University, Beijing 100083, China
中国农业大学，北京 100083，中国

^*

Author to whom correspondence should be addressed.
应联系的作者。

Plants 2024, 13(7), 972; https://doi.org/10.3390/plants13070972

Submission received: 30 January 2024 / Revised: 22 March 2024 / Accepted: 24 March 2024 / Published: 28 March 2024
投稿接收：2024 年 1 月 30 日 / 修订：2024 年 3 月 22 日 / 接受：2024 年 3 月 24 日 / 发布：2024 年 3 月 28 日

(This article belongs to the Special Issue The Future of Artificial Intelligence and Sensor Systems in Agriculture)
（本文属于特刊：农业人工智能与传感器系统未来发展）

Download

Browse Figures

Versions Notes
下载浏览图表版本注释

Abstract 摘要

In this study, an innovative approach based on multimodal data and the transformer model was proposed to address challenges in agricultural disease detection and question-answering systems. This method effectively integrates image, text, and sensor data, utilizing deep learning technologies to profoundly analyze and process complex agriculture-related issues. The study achieved technical breakthroughs and provides new perspectives and tools for the development of intelligent agriculture. In the task of agricultural disease detection, the proposed method demonstrated outstanding performance, achieving a precision, recall, and accuracy of 0.95, 0.92, and 0.94, respectively, significantly outperforming the other conventional deep learning models. These results indicate the method’s effectiveness in identifying and accurately classifying various agricultural diseases, particularly excelling in handling subtle features and complex data. In the task of generating descriptive text from agricultural images, the method also exhibited impressive performance, with a precision, recall, and accuracy of 0.92, 0.88, and 0.91, respectively. This demonstrates that the method can not only deeply understand the content of agricultural images but also generate accurate and rich descriptive texts. The object detection experiment further validated the effectiveness of our approach, where the method achieved a precision, recall, and accuracy of 0.96, 0.91, and 0.94. This achievement highlights the method’s capability for accurately locating and identifying agricultural targets, especially in complex environments. Overall, the approach in this study not only demonstrated exceptional performance in multiple tasks such as agricultural disease detection, image captioning, and object detection but also showcased the immense potential of multimodal data and deep learning technologies in the application of intelligent agriculture.
本研究提出了一种基于多模态数据和 Transformer 模型的创新方法，旨在解决农业病害检测和问答系统中的挑战。该方法有效整合了图像、文本和传感器数据，利用深度学习技术深入分析和处理复杂的农业相关问题。本研究取得了技术突破，为智能农业的发展提供了新的视角和工具。在农业病害检测任务中，所提出的方法表现出卓越的性能，其精确率、召回率和准确率分别达到 0.95、0.92 和 0.94，显著优于其他传统深度学习模型。这些结果表明该方法在识别和准确分类各种农业病害方面的有效性，尤其擅长处理细微特征和复杂数据。在从农业图像生成描述性文本的任务中，该方法也表现出令人印象深刻的性能，其精确率、召回率和准确率分别达到 0.92、0.88 和 0.91。这表明该方法不仅能够深入理解农业图像的内容，还能生成准确且丰富的描述性文本。目标检测实验进一步验证了我们方法的有效性，该方法的精确率、召回率和准确率分别达到 0.96、0.91 和 0.94。这一成就突出了该方法在复杂环境中准确定位和识别农业目标的能力。总的来说，本研究中的方法不仅在农业病害检测、图像描述和目标检测等多项任务中表现出卓越的性能，还展示了多模态数据和深度学习技术在智能农业应用中的巨大潜力。

Keywords:

agricultural large model; deep learning; smart agriculture; transformer model; agricultural disease detection
关键词：农业大模型；深度学习；智慧农业；Transformer 模型；农业病害检测

1. Introduction 1. 引言

The rapid development of information technology and artificial intelligence has become a significant driving force in advancing modern agriculture [1], particularly in plant disease detection and management, where technological innovation and applications are key to ensuring agricultural production efficiency and food safety [2,3]. Traditional methods of plant disease detection, reliant on the experience and judgment of agricultural experts [4], are not only time-consuming and labor-intensive, but their accuracy and efficiency are limited by the constraints of the experts’ knowledge and experience [5].
信息技术和人工智能的快速发展已成为推动现代农业发展的重要驱动力 [ 1]，尤其是在植物病害检测与管理方面，技术创新和应用是确保农业生产效率和食品安全的关键 [ 2, 3]。传统的植物病害检测方法，依赖于农业专家的经验和判断 [ 4]，不仅耗时耗力，而且其准确性和效率受限于专家知识和经验的局限性 [ 5]。

The detection of plant fungal pathogens, discussed by Ray, Monalisa et al. [6], necessitates expertise in microbiology and is invariably influenced by individual experience. Vadamalai Ganesan et al. [7], employed plant genetics and physiology for disease detection, analyzing the impact of pathogens on host plants using proteomics; however, the accuracy of their method could not be guaranteed. To enhance precision, Das Debasish et al. [8] utilized various feature extraction techniques to classify different types of leaf diseases. They experimented with support vector machine (SVM), random forest, and logistic regression methods, finding SVM to be the most effective. However, their model, limited to binary classification of tomato leaves as healthy or diseased, failed to meet practical needs.
Ray, Monalisa 等人 [ 6] 讨论的植物真菌病原体检测，需要微生物学专业知识，并且总是受到个人经验的影响。Vadamalai Ganesan 等人 [ 7] 采用植物遗传学和生理学进行病害检测，利用蛋白质组学分析病原体对宿主植物的影响；然而，他们的方法的准确性无法得到保证。为了提高精度，Das Debasish 等人 [ 8] 利用各种特征提取技术对不同类型的叶片病害进行分类。他们尝试了支持向量机 (SVM)、随机森林和逻辑回归方法，发现 SVM 最为有效。然而，他们的模型仅限于将番茄叶片二元分类为健康或患病，未能满足实际需求。

In response to these challenges, the urgency of incorporating intelligent technologies for accurate and rapid detection of plant diseases is evident [9]. In this context, this study introduces a disease detection and agricultural question-answering system based on multimodal and large language model technologies [10], aimed at enhancing the level of agricultural production intelligence and providing effective decision support for agricultural workers.
针对这些挑战，引入智能技术以实现植物病害的准确快速检测的紧迫性显而易见 [ 9]。在此背景下，本研究引入了一种基于多模态和大语言模型技术的病害检测与农业问答系统 [ 10]，旨在提升农业生产智能化水平，并为农业工作者提供有效的决策支持。

Several researchers have made significant contributions. For instance, Deepalakshmi P et al. [11] used CNN to extract features from input images to identify the diseased and healthy leaves of different plants, with their model taking an average of 3.8 s for disease detection and achieving a 94.5% accuracy rate. Sharma, Parul et al. [12] applied CNN to plant disease detection, reaching a 98.6% accuracy rate, though their method could fail in areas with multiple disease symptoms. Bedi Punam et al. [13] used a hybrid model of a convolutional autoencoder (CAE) network and CNN for peach tree disease detection, achieving a 98.38% accuracy rate in testing, but the small dataset size limited the model’s robustness. Given the potential loss of important information with CNN models, De Silva Malithi et al. [14] combined a CNN with ViT, achieving an 83.3% accuracy rate. To enhance accuracy, Parez Sana et al. [15] proposed the green vision transformer technique, employing ViT to reduce model parameters and improve accuracy, demonstrating real-time processing capability. Thai Huy-Tan et al. [16] designed the FormerLeaf model based on ViT for plant disease detection. They also proposed the LeIAP and SPMM algorithms for model optimization. Their experimental results showed a 15% improvement in inference speed, but they noted reduced model accuracy for complex background images and the dataset used in the experiments was unbalanced.
几位研究人员做出了重要贡献。例如，Deepalakshmi P 等人 [11] 使用 CNN 从输入图像中提取特征，以识别不同植物的病叶和健康叶片，他们的模型平均用时 3.8 秒进行疾病检测，并达到了 94.5% 的准确率。Sharma, Parul 等人 [12] 将 CNN 应用于植物病害检测，达到了 98.6% 的准确率，但他们的方法在具有多种病害症状的区域可能会失效。Bedi Punam 等人 [13] 使用卷积自编码器 (CAE) 网络和 CNN 的混合模型进行桃树病害检测，在测试中达到了 98.38% 的准确率，但数据集规模较小限制了模型的鲁棒性。考虑到 CNN 模型可能丢失重要信息，De Silva Malithi 等人 [14] 将 CNN 与 ViT 结合，达到了 83.3% 的准确率。为了提高准确率，Parez Sana 等人 [15] 提出了绿色视觉 Transformer 技术，利用 ViT 减少模型参数并提高准确率，展示了实时处理能力。Thai Huy-Tan 等人 [16] 设计了基于 ViT 的 FormerLeaf 模型用于植物病害检测。他们还提出了 LeIAP 和 SPMM 算法用于模型优化。他们的实验结果显示推理速度提高了 15%，但他们指出对于复杂背景图像，模型准确率有所降低，并且实验中使用的数据集不平衡。

This study employs advanced computer vision models such as convolutional neural networks (CNN) [17] and YOLO (you only look once) [18], along with large language models like GPT [19] and BERT [20], to effectively detect plant diseases and accurately answer agriculture-related questions. The core of this research lies in proposing and implementing an innovative multimodal data processing approach and corresponding system architecture. A multi-transformer-based architecture was designed, capable of efficiently processing and integrating different modalities of data, such as images, text, and knowledge graphs, thereby achieving higher accuracy and efficiency in the identification and classification of plant diseases compared to traditional methods. This is of significant importance for the rapid identification and handling of agricultural diseases and for reducing crop losses. Moreover, a specialized question-answering system for the agricultural domain was constructed, combining large language models and expert knowledge graphs to understand complex agricultural questions and provide accurate, fact- and data-based answers. To train and validate our models, a comprehensive multimodal dataset, including rich image and textual data, was collected and constructed. This not only provided strong support for this study but also offers valuable resources for future research in related fields.
本研究采用卷积神经网络（CNN）[17]和 YOLO（you only look once）[18]等先进计算机视觉模型，以及 GPT[19]和 BERT[20]等大型语言模型，以有效检测植物病害并准确回答农业相关问题。本研究的核心在于提出并实现一种创新的多模态数据处理方法和相应的系统架构。设计了一种基于多 Transformer 的架构，能够高效处理和整合图像、文本和知识图谱等不同模态的数据，从而在植物病害的识别和分类方面比传统方法实现更高的准确性和效率。这对于农业病害的快速识别和处理以及减少作物损失具有重要意义。此外，还构建了一个农业领域的专用问答系统，结合大型语言模型和专家知识图谱，以理解复杂的农业问题并提供准确、基于事实和数据的答案。为了训练和验证我们的模型，收集并构建了一个包含丰富图像和文本数据的综合多模态数据集。这不仅为本研究提供了强有力的支持，也为未来相关领域的研究提供了宝贵的资源。

2. Related Works 2. 相关工作

2.1. Application of Multimodal Data in Agriculture
2.1. 多模态数据在农业中的应用

Recent developments in multimodal technologies have seen widespread application in the agricultural field, especially in disease detection and agricultural question-answering systems [21]. Multimodal technology refers to the integration and analysis of data from various modalities, such as images, text, and sound. In agriculture, this primarily involves a combination of image and text data. Image data typically originate from field photographs or satellite imagery, while text data may consist of professional literature or agricultural databases detailing crop cultivation and disease descriptions. Structurally, multimodal models generally encompass two main components: feature extraction from different modalities and multimodal fusion. For image data, convolutional neural networks (CNN) [22,23] or more advanced models like YOLO [24,25] are commonly used for spatial feature extraction. Text data, on the other hand, are processed using natural language processing techniques, such as the transformer model [26], to extract semantic features. Following feature extraction, multimodal fusion technology effectively combines the features from different modalities to facilitate more accurate classification, prediction, or generation. Matrix-based methods are a core technology in this process, involving the mathematical fusion of data from different modalities. Matrix factorization is a common technique in multimodal fusion that decomposes feature matrices from each modality to uncover shared latent features. Assuming the presence of data in two modalities, represented by matrices

X_{1}

and

X_{2}

, matrix factorization aims to identify two low-rank matrices

U_{1}

,

U_{2}

and a shared latent feature matrix V, satisfying the relation:
多模态技术在农业领域取得了广泛应用，尤其是在病害检测和农业问答系统中 [ 21]。多模态技术是指整合和分析来自不同模态的数据，例如图像、文本和声音。在农业中，这主要涉及图像和文本数据的结合。图像数据通常来源于田间照片或卫星图像，而文本数据可能包括专业文献或详细描述作物种植和病害的农业数据库。从结构上看，多模态模型通常包含两个主要组成部分：从不同模态中提取特征和多模态融合。对于图像数据，卷积神经网络（CNN）[ 22, 23] 或更先进的模型如 YOLO [ 24, 25] 通常用于空间特征提取。另一方面，文本数据则使用自然语言处理技术（例如 Transformer 模型 [ 26]）进行处理，以提取语义特征。特征提取之后，多模态融合技术有效地结合了来自不同模态的特征，以促进更准确的分类、预测或生成。基于矩阵的方法是此过程中的核心技术，涉及对来自不同模态的数据进行数学融合。矩阵分解是多模态融合中一种常用技术，它分解来自每个模态的特征矩阵，以揭示共享的潜在特征。假设存在两种模态的数据，分别由矩阵

X_{1}

和

X_{2}

表示，矩阵分解旨在识别两个低秩矩阵

U_{1}

、

U_{2}

和一个共享潜在特征矩阵 V，满足以下关系：

X_{1} \approx U_{1} V, X_{2} \approx U_{2} V

(1)

Here,

U_{1}

and

U_{2}

represent the feature spaces of the two modalities, while V is the shared latent feature representation. Another method, canonical correlation analysis (CCA), seeks to maximize the correlation between feature vectors of two modalities. Given features X and Y from two modalities, CCA aims to find vectors

w_{x}

and

w_{y}

that maximize the projection correlation between X and Y:
其中，

U_{1}

和

U_{2}

分别代表两种模态的特征空间，而 V 是共享的潜在特征表示。另一种方法，典型相关分析（CCA），旨在最大化两种模态特征向量之间的相关性。给定来自两种模态的特征 X 和 Y，CCA 旨在找到向量

w_{x}

和

w_{y}

，以最大化 X 和 Y 之间的投影相关性：

max_{w_{x}, w_{y}} \frac{w_{x}^{T} X Y^{T} w_{y}}{\sqrt{w_{x}^{T} X X^{T} w_{x}} \sqrt{w_{y}^{T} Y Y^{T} w_{y}}}

(2)

Here, the terms

w_{x}^{T} X X^{T} w_{x}

and

w_{y}^{T} Y Y^{T} w_{y}

are included for normalization, ensuring that the results are not influenced by data scale. Joint Factor Analysis (JFA) is a matrix decomposition technique that simultaneously analyzes multiple data sources. Assuming n modalities with data matrices

X_{1}, X_{2}, \dots, X_{n}

, JFA aims to find a set of factor matrices

U_{1}, U_{2}, \dots, U_{n}

and a shared latent feature matrix V, such that
其中，包含

w_{x}^{T} X X^{T} w_{x}

和

w_{y}^{T} Y Y^{T} w_{y}

项是为了进行归一化，确保结果不受数据尺度的影响。联合因子分析（JFA）是一种矩阵分解技术，可同时分析多个数据源。假设存在 n 种模态，数据矩阵为

X_{1}, X_{2}, \dots, X_{n}

，JFA 旨在找到一组因子矩阵

U_{1}, U_{2}, \dots, U_{n}

和一个共享潜在特征矩阵 V，使得

X_{i} \approx U_{i} V, \forall i \in 1, 2, \dots, n

(3)

In this expression, each

U_{i}

represents the feature space of the ith modality, while V is the cross-modal shared feature representation.
在此表达式中，每个

U_{i}

代表第 th 模态的特征空间，而 V 是跨模态共享特征表示。

In applications such as disease detection in rice, wheat, potatoes, and cotton, multimodal technologies play a significant role [27,28]. For instance, in rice disease detection, a combination of field image data and literature descriptions of diseases [29] enables multimodal models to more accurately identify and classify different types of diseases. This not only enhances the precision of disease detection but also assists farmers in taking timely and effective measures to reduce losses. For agricultural question-answering systems, multimodal technologies have also demonstrated their robust capabilities. By integrating image recognition and natural language processing, such systems can provide more accurate and comprehensive answers. For example, farmers can upload images of crops and inquire about diseases. The system, by analyzing the images and consulting relevant agricultural knowledge bases, can provide specific disease information and prevention recommendations. Additionally, the unique advantage of multimodal technologies is exhibited when handling complex data. In agriculture, where environmental conditions are diverse and complex, a single modality often fails to provide sufficient information for accurate judgment [30]. Multimodal technology, by combining different types of data, offers a more comprehensive perspective, enhancing the model’s generalization capabilities and robustness. In practical applications, the challenges faced by multimodal technology include effectively integrating data from different modalities and designing universal models adaptable to various crops and disease types.
在水稻、小麦、马铃薯和棉花等作物的病害检测应用中，多模态技术发挥着重要作用 [ 27, 28]。例如，在水稻病害检测中，结合田间图像数据和病害文献描述 [ 29] 使多模态模型能够更准确地识别和分类不同类型的病害。这不仅提高了病害检测的精度，还有助于农民及时采取有效措施减少损失。对于农业问答系统，多模态技术也展现了其强大的能力。通过整合图像识别和自然语言处理，此类系统可以提供更准确、更全面的答案。例如，农民可以上传作物图片并询问病害。系统通过分析图像并查阅相关农业知识库，可以提供具体的病害信息和防治建议。此外，多模态技术在处理复杂数据时也展现出其独特的优势。在农业领域，环境条件多样且复杂，单一模态往往无法提供足够的信息进行准确判断 [ 30]。多模态技术通过结合不同类型的数据，提供了更全面的视角，增强了模型的泛化能力和鲁棒性。在实际应用中，多模态技术面临的挑战包括有效整合来自不同模态的数据，以及设计适用于各种作物和病害类型的通用模型。

2.2. Application of Large Language Models in Agriculture
2.2. 大型语言模型在农业领域的应用

Large language models, such as GPT and BERT [20], have made significant strides in various fields, including applications in agricultural disease detection and question-answering systems [31,32]. These models are renowned for their powerful semantic understanding and generation capabilities, providing effective tools for natural language processing. First, the structural features of large language models warrant attention. These models are typically based on deep learning, particularly the transformer architecture [33], learning rich language representations and knowledge through extensive data pre-training. Within the model, the multi-layer transformer network effectively captures long-distance dependencies in text through self-attention mechanisms, enabling the understanding and generation of complex language structures. The key components of the transformer are as follows: Self-attention, the core of the transformer, allows the model to focus on different positions in the input sequence. For a given input sequence, the self-attention mechanism calculates attention scores for each element with respect to the other elements in the sequence, as shown in Figure 1.
大型语言模型，例如 GPT 和 BERT [ 20]，在各个领域都取得了显著进展，包括在农业病害检测和问答系统中的应用 [ 31, 32]。这些模型以其强大的语义理解和生成能力而闻名，为自然语言处理提供了有效的工具。首先，大型语言模型的结构特征值得关注。这些模型通常基于深度学习，特别是 Transformer 架构 [ 33]，通过大量数据预训练学习丰富的语言表示和知识。在模型内部，多层 Transformer 网络通过自注意力机制有效地捕获文本中的长距离依赖关系，从而实现对复杂语言结构的理解和生成。Transformer 的关键组成部分如下：自注意力机制是 Transformer 的核心，它允许模型关注输入序列中的不同位置。对于给定的输入序列，自注意力机制会计算序列中每个元素相对于其他元素的注意力分数，如图 1 所示。

Figure 1. Structural diagram of the BERT model, demonstrating how input passes through an embedding layer and is processed through a multi-layer transformer network structure. This includes multi-head attention mechanisms, feedforward neural networks, and the addition of positional encoding.
图 1。BERT 模型的结构图，展示了输入如何通过嵌入层并经过多层 Transformer 网络结构进行处理。这包括多头注意力机制、前馈神经网络以及位置编码的添加。

This can be expressed as [34]
这可以表示为 [ 34]

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

Here, Q, K, and V represent the query, key, and value matrices, respectively, derived from the input matrix through different weight matrix transformations.

\sqrt{d_{k}}

is a scaling factor used to prevent excessively large values after the dot product. To enable the model to focus on information from different subspaces simultaneously, the transformer introduces a multi-head attention mechanism [34]. In this mechanism, the attention operation is divided into multiple heads, each independently calculating attention scores, which are then concatenated. This can be expressed as
其中，Q、K 和 V 分别代表查询（query）、键（key）和值（value）矩阵，它们通过不同的权重矩阵变换从输入矩阵中得到。

\sqrt{d_{k}}

是一个缩放因子，用于防止点积后数值过大。为了使模型能够同时关注来自不同子空间的信息，Transformer 引入了多头注意力机制[ 34]。在该机制中，注意力操作被分成多个头，每个头独立计算注意力分数，然后将它们拼接起来。这可以表示为

MultiHead (Q, K, V) = Concat ({head}_{1}, {head}_{2}, \dots, {head}_{h}) W^{O}

(5)

Here, each

head * i = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

represents an independent attention mechanism, with

W_{i}^{Q}

,

W_{i}^{K}

,

W_{i}^{V}

, and

W^{O}

being the parameters learned by the model. As transformers inherently lack a sequential order processing capability like RNNs, positional encoding is added to provide position information for elements in the sequence [34]. Positional encoding is usually a learnable parameter, added to the input sequence embedding, furnishing the model with positional information [34]. A common form of positional encoding is
其中，每个

head * i = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

代表一个独立的注意力机制，而

W_{i}^{Q}

、

W_{i}^{K}

、

W_{i}^{V}

和

W^{O}

是模型学习到的参数。由于 Transformer 本身不具备像 RNNs 那样的序列顺序处理能力，因此添加了位置编码，以提供序列中元素的位置信息[ 34]。位置编码通常是一个可学习的参数，被添加到输入序列的嵌入中，为模型提供位置信息[ 34]。位置编码的一种常见形式是

PE (p o s, 2 i) = sin (p o s / {10, 000}^{2 i / d model}), PE (p o s, 2 i + 1) = cos (p o s / {10, 000}^{2 i / d model})

(6)

Here,

p o s

is the position, i is the dimension, and

d * model

is the dimension of the model. Each encoder and decoder layer of the transformer contains a feed-forward network. This network, applying the same operations at each position [34], typically includes two linear transformations and an activation function, represented as
其中，

p o s

是位置，是维度，而

d * model

是模型的维度。Transformer 的每个编码器和解码器层都包含一个前馈网络。该网络在每个位置应用相同的操作[ 34]，通常包括两个线性变换和一个激活函数，表示为

FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}

(7)

Here,

W_{1}

,

W_{2}

,

b_{1}

, and

b_{2}

are network parameters [34].
其中，

W_{1}

、

W_{2}

、

b_{1}

和

b_{2}

是网络参数[ 34]。

In specific applications such as disease detection in rice, wheat, potatoes, and cotton, as well as agricultural question-answering systems, the transformer model leverages its strong semantic understanding capabilities to analyze text information [34], such as disease descriptions and agricultural practice guidelines [35]. This analytical capability is crucial for enhancing the accuracy of disease diagnostics and answering agriculture-related questions. In the agricultural field, especially in disease detection and question-answering systems for crops like rice, wheat, potatoes, and cotton, the application of large language models is particularly important. For instance, in disease detection, the model can provide in-depth understanding and suggestions regarding diseases by analyzing agricultural text materials, such as disease descriptions and treatment methods [27]. Additionally, large language models can be combined with image recognition technology to provide more accurate disease diagnostics by analyzing images related to diseases and their descriptions. In agricultural question-answering systems, the role of large language models is indispensable. They can not only understand user queries but also generate information-rich, accurate responses. This is especially crucial for agriculture-related queries requiring expert knowledge. For example, farmers may inquire about methods for identifying or treating specific crop diseases, and large language models can provide professional and specific answers based on their extensive knowledge base [36].
在水稻、小麦、马铃薯和棉花等作物的病害检测以及农业问答系统等具体应用中，Transformer 模型利用其强大的语义理解能力，分析文本信息[ 34]，例如疾病描述和农业实践指南[ 35]。这种分析能力对于提高疾病诊断的准确性和回答农业相关问题至关重要。在农业领域，特别是在水稻、小麦、马铃薯和棉花等作物的病害检测和问答系统中，大型语言模型的应用尤为重要。例如，在病害检测中，模型可以通过分析农业文本资料，如疾病描述和治疗方法[ 27]，提供对疾病的深入理解和建议。此外，大型语言模型还可以与图像识别技术结合，通过分析与疾病相关的图像及其描述，提供更准确的疾病诊断。在农业问答系统中，大型语言模型的作用不可或缺。它们不仅能够理解用户的查询，还能生成信息丰富、准确的回答。这对于需要专业知识的农业相关查询尤为关键。例如，农民可能会询问识别或治疗特定作物病害的方法，而大型语言模型可以根据其广泛的知识库提供专业且具体的答案[ 36]。

2.3. Application of Computer Vision Techniques in Agriculture
2.3. 计算机视觉技术在农业中的应用

The application of computer vision models, particularly convolutional neural networks (CNN) and YOLO (you only look once), as shown in Figure 2, has been increasingly observed in the agricultural sector, especially in disease detection and agricultural question-answering systems [12,37].
计算机视觉模型，特别是卷积神经网络（CNN）和 YOLO（You Only Look Once）的应用，如图 2 所示，在农业领域中得到了越来越多的应用，特别是在病害检测和农业问答系统中[12, 37]。

Figure 2. Structure diagram of the YOLOv5 object detection model, detailing the data flow from the input layer to the prediction layer, including input processing, backbone network, feature pyramid network (neck), and the different types of neural network modules used in each stage of prediction.
图 2. YOLOv5 目标检测模型的结构图，详细说明了从输入层到预测层的数据流，包括输入处理、骨干网络、特征金字塔网络（neck），以及预测每个阶段中使用的不同类型的神经网络模块。

CNNs, designed for processing data with a grid-like structure such as images, are deep neural networks centered around convolutional layers. These layers extract local features from images through convolution operations, which can be mathematically expressed as
CNNs，专为处理图像等网格状结构数据而设计，是以卷积层为核心的深度神经网络。这些层通过卷积操作从图像中提取局部特征，其数学表达式为

F_{i j} = \sum_{m} \sum_{n} I_{(i + m) (j + n)} K_{m n}

(8)

Here, I represents the input image, K the convolutional kernel, and

F_{i j}

the convolutional output. This formula indicates that convolutional layers slide the kernel over the image, computing the dot product between the kernel and local regions of the image to extract features. In addition to convolutional layers, CNNs typically include activation layers and pooling layers. Activation layers, such as the ReLU function, introduce non-linearity, enabling the network to capture more complex features. Pooling layers, on the other hand, reduce the spatial dimensions of the features, enhancing the model’s generalization capabilities. A common pooling operation, max pooling, is mathematically expressed as
这里，表示输入图像，K 表示卷积核，

F_{i j}

表示卷积输出。该公式表明，卷积层通过在图像上滑动卷积核，计算卷积核与图像局部区域的点积来提取特征。除了卷积层，CNN 通常还包括激活层和池化层。激活层，例如 ReLU 函数，引入非线性，使网络能够捕获更复杂的特征。另一方面，池化层减少了特征的空间维度，增强了模型的泛化能力。一种常见的池化操作，最大池化，其数学表达式为

P_{i j} = max_{a, b \in [0, k - 1]} F_{i + a, j + b}

(9)

Here,

P_{i j}

denotes the pooling output, with k representing the size of the pooling window. The core advantage of these models lies in their ability to efficiently process and analyze large volumes of image data, thereby identifying specific patterns and objects. For instance, in the detection of diseases in crops like rice, wheat, potatoes, and cotton, CNNs initially perform feature extraction on the input crop images. Using these features, CNNs can identify different types of crop diseases. For example, in rice disease detection, extracted features might include the size, shape, and color of spots on the leaves [38].
这里，

P_{i j}

表示池化输出，k 表示池化窗口的大小。这些模型的核心优势在于它们能够高效地处理和分析大量图像数据，从而识别特定的模式和对象。例如，在水稻、小麦、马铃薯和棉花等作物病害检测中，CNN 首先对输入的作物图像进行特征提取。利用这些特征，CNN 可以识别不同类型的作物病害。例如，在水稻病害检测中，提取的特征可能包括叶片上斑点的大小、形状和颜色 [ 38]。

YOLO is a popular single-stage object detection model that conceptualizes object detection as a regression problem. Unlike traditional step-by-step methods (such as first generating candidate regions and then classifying), YOLO directly predicts both the categories and positions of targets in a single network. In the YOLO model, the input image is divided into an

S \times S

grid, with each grid cell responsible for predicting targets within that area. The output of YOLO can be represented as a vector, containing class probabilities, bounding box coordinates, and confidence scores. The mathematical representation of each bounding box is
YOLO 是一种流行的单阶段目标检测模型，它将目标检测概念化为一个回归问题。与传统的逐步方法（例如先生成候选区域再分类）不同，YOLO 在一个网络中直接预测目标的类别和位置。在 YOLO 模型中，输入图像被划分为一个

S \times S

网格，每个网格单元负责预测该区域内的目标。YOLO 的输出可以表示为一个向量，包含类别概率、边界框坐标和置信度分数。每个边界框的数学表示为

(b o x_{x}, b o x_{y}, b o x_{w}, b o x_{h}, C)

(10)

where

(b o x_{x}, b o x_{y})

are the coordinates of the center of the bounding box,

(b o x_{w}, b o x_{h})

its width and height, and C the confidence score of the bounding box containing a target. The loss function of the YOLO model, comprising category loss, localization loss, and confidence loss, is a crucial component. The loss function can be represented as
其中

(b o x_{x}, b o x_{y})

是边界框中心的坐标，

(b o x_{w}, b o x_{h})

是其宽度和高度，C 是边界框包含目标的置信度分数。YOLO 模型的损失函数，包括类别损失、定位损失和置信度损失，是一个关键组成部分。损失函数可以表示为

\begin{matrix} L = λ_{c o o r d} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} l_{i j}^{o b j} [{(x_{i} - \hat{x} * i)}^{2} + {(y_{i} - \hat{y} i)}^{2}] \\ + λ c o o r d \sum * {i = 0}^{S^{2}} \sum_{j = 0}^{B} l_{i j}^{o b j} [{(\sqrt{w_{i}} - \sqrt{\hat{w} * i})}^{2} + {(\sqrt{h_{i}} - \sqrt{\hat{h} i})}^{2}] \\ + \sum {i = 0}^{S^{2}} l * i^{o b j} {(C_{i} - \hat{C} * i)}^{2} + λ * n o o b j \sum_{i = 0}^{S^{2}} l_{i}^{n o o b j} {(C_{i} - \hat{C} * i)}^{2} \\ + \sum * {i = 0}^{S^{2}} \sum_{c \in c l a s s e s} p_{i} (c) log ({\hat{p}}_{i} (c)) \end{matrix}

(11)

Here,

λ_{c o o r d}

and

λ_{n o o b j}

are weight coefficients,

l_{i j}^{o b j}

indicates the presence of a target,

(x_{i}, y_{i}, w_{i}, h_{i})

are the predicted bounding box parameters,

({\hat{x}}_{i}, {\hat{y}}_{i}, {\hat{w}}_{i}, {\hat{h}}_{i})

the actual bounding box parameters,

C_{i}

the predicted confidence score,

{\hat{C}}_{i}

the actual confidence score, and

p_{i} (c)

the probability of class c. The YOLO model excels in real-time disease detection, swiftly locating and classifying diseases within images. In cotton disease detection, YOLO can rapidly identify affected areas, assisting farmers in timely interventions [39]. Agricultural question-answering systems can utilize CNN or YOLO models to analyze crop images uploaded by users, then combine the analysis results with historical data to provide professional advice. Initially, the system analyzes uploaded wheat leaf images through visual models, subsequently integrating the analysis with agricultural knowledge bases to suggest possible disease causes and recommended treatment methods.
这里，

λ_{c o o r d}

和

λ_{n o o b j}

是权重系数，

l_{i j}^{o b j}

表示目标的存在，

(x_{i}, y_{i}, w_{i}, h_{i})

是预测的边界框参数，

({\hat{x}}_{i}, {\hat{y}}_{i}, {\hat{w}}_{i}, {\hat{h}}_{i})

是实际的边界框参数，

C_{i}

是预测的置信度分数，

{\hat{C}}_{i}

是实际的置信度分数，

p_{i} (c)

是类别 c 的概率。YOLO 模型在实时疾病检测方面表现出色，能够迅速定位和分类图像中的疾病。在棉花病害检测中，YOLO 可以快速识别受影响区域，帮助农民及时干预 [ 39]。农业问答系统可以利用 CNN 或 YOLO 模型分析用户上传的作物图像，然后将分析结果与历史数据结合，提供专业的建议。最初，系统通过视觉模型分析上传的小麦叶片图像，随后将分析结果与农业知识库整合，以提出可能的病害原因和推荐的治疗方法。

3. Results and Discussion
3. 结果与讨论

3.1. Disease Detection Results
3.1. 病害检测结果

The primary aim of this experiment was to compare and analyze the performance of various deep learning models in agricultural disease detection tasks, including AlexNet [40], GoogLeNet [41], VGG [22], ResNet [23], and the method proposed in this study. The experimental results are presented using the evaluation metrics of precision, recall, and accuracy to showcase the performance of each model. The experimental results are shown in Table 1.
本实验的主要目的是比较和分析各种深度学习模型在农业病害检测任务中的性能，包括 AlexNet [ 40]、GoogLeNet [ 41]、VGG [ 22]、ResNet [ 23] 以及本研究提出的方法。实验结果使用精确率、召回率和准确率这些评估指标来展示每个模型的性能。实验结果如表 1 所示。

Table 1. Comparison of disease detection performance.
表 1. 病害检测性能比较。

As an early landmark model in deep learning, AlexNet has a relatively simple structure, comprising five convolutional layers and three fully connected layers. Although it achieved significant breakthroughs in early image processing tasks, its performance is relatively weaker when handling more complex agricultural disease detection tasks. This is primarily due to the limited feature extraction capability of AlexNet, especially in capturing subtle features, such as early signs of disease. Consequently, AlexNet showed the most modest performance in terms of precision, recall, and accuracy. GoogLeNet, introducing the Inception module, uses convolutional kernels of different sizes in the same layer, enabling it to capture features at different scales. This design makes GoogLeNet more powerful for feature extraction than AlexNet, especially in processing agricultural images with multi-scale features. Therefore, in the experiment, GoogLeNet’s performance showed an improvement, but it still had limitations in handling extremely complex agricultural data, due to its relatively simple network structure. VGG significantly enhances the model’s feature extraction capability with a deeper network structure (up to 19 layers) and small convolutional kernels. In agricultural disease detection tasks, VGG can better capture complex disease features, such as minute spots or discoloration. However, a major drawback of VGG is its bulky network structure with numerous parameters, leading to a lower computational efficiency in training and inference. ResNet solves the problem of vanishing gradients in deep networks by introducing residual connections, allowing the network to deepen (versions of ResNet reach up to 152 layers) without losing training efficiency. This combination of depth and residual structure enables ResNet to excel in capturing complex, hierarchical features. Thus, in agricultural disease detection tasks, ResNet significantly outperforms preceding models in terms of precision, recall, and accuracy. For example, the MAF-ResNet50 proposed in [4] enhances the model expressiveness of ResNet50 by designing parallel activation function layers to improve the accuracy of corn disease recognition. However, it could only achieve a recognition accuracy higher than 95% on four types of corn disease samples, which still lagged behind the generalization ability of the method proposed in this paper. The method proposed in this paper builds upon the foundation of these models with further innovations and optimizations. Specific details might include more complex network structures, more effective feature fusion mechanisms, and algorithms specifically optimized for agricultural disease detection tasks. These innovations enabled the proposed method to more effectively integrate information from different sources and capture more detailed disease features when processing the multimodal data. Consequently, in the experiment, the proposed method demonstrated the optimal performance for all evaluation metrics.
作为深度学习领域早期的里程碑模型，AlexNet 结构相对简单，由五个卷积层和三个全连接层组成。尽管它在早期图像处理任务中取得了显著突破，但在处理更复杂的农业病害检测任务时，其性能相对较弱。这主要是由于 AlexNet 的特征提取能力有限，尤其是在捕捉细微特征方面，例如疾病的早期迹象。因此，AlexNet 在精确率、召回率和准确率方面表现出最不理想的性能。GoogLeNet 引入了 Inception 模块，在同一层中使用不同大小的卷积核，使其能够捕获不同尺度的特征。这种设计使得 GoogLeNet 在特征提取方面比 AlexNet 更强大，尤其是在处理具有多尺度特征的农业图像时。因此，在实验中，GoogLeNet 的性能有所提升，但由于其相对简单的网络结构，在处理极其复杂的农业数据时仍存在局限性。VGG 通过更深的网络结构（多达 19 层）和小型卷积核，显著增强了模型的特征提取能力。在农业病害检测任务中，VGG 能更好地捕捉复杂的病害特征，例如微小的斑点或变色。然而，VGG 的一个主要缺点是其庞大的网络结构和大量的参数，导致训练和推理的计算效率较低。ResNet 通过引入残差连接解决了深层网络中的梯度消失问题，使得网络可以在不损失训练效率的情况下加深（ResNet 的版本可达 152 层）。这种深度和残差结构的结合使 ResNet 在捕获复杂、分层特征方面表现出色。因此，在农业病害检测任务中，ResNet 在精确率、召回率和准确率方面显著优于之前的模型。例如，[ 4]中提出的 MAF-ResNet50 通过设计并行激活函数层来增强 ResNet50 的模型表达能力，以提高玉米病害识别的准确性。然而，它在四种玉米病害样本上只能达到高于 95%的识别准确率，这仍然落后于本文所提出方法的泛化能力。本文提出的方法在这些模型的基础上进行了进一步的创新和优化。具体细节可能包括更复杂的网络结构、更有效的特征融合机制以及专门针对农业病害检测任务优化的算法。这些创新使得所提出的方法在处理多模态数据时，能够更有效地整合来自不同来源的信息，并捕获更详细的病害特征。因此，在实验中，所提出的方法在所有评估指标上均表现出最优性能。

3.2. Agricultural Image Captioning Experiment Results
3.2. 农业图像描述实验结果

The design of this experiment aimed to evaluate and compare the performance of various deep learning models in the task of generating descriptive text from agricultural images. This task involved the automatic generation of descriptive text from agricultural images, which is significant for enhancing the level of automation and intelligence in agricultural management. By comparing the precision, recall, and accuracy of the different models, insights were gained into each model’s ability to understand and describe agricultural images. In the following experimental results, the performance of BLIP [42], mPLUG-Owl [43], InstructBLIP [44], CLIP [45], BLIP2 [46], and the method proposed in this paper on the agricultural image captioning task is demonstrated. The experimental results are shown in Table 2.
本实验旨在评估和比较各种深度学习模型在从农业图像生成描述性文本任务中的性能。该任务涉及从农业图像自动生成描述性文本，这对于提高农业管理的自动化和智能化水平具有重要意义。通过比较不同模型的精确率、召回率和准确率，可以深入了解每个模型理解和描述农业图像的能力。在以下实验结果中，展示了 BLIP [ 42]、mPLUG-Owl [ 43]、InstructBLIP [ 44]、CLIP [ 45]、BLIP2 [ 46]以及本文提出的方法在农业图像字幕任务上的性能。实验结果如表 2 所示。

Table 2. Comparison of performance in agricultural image captioning.
表 2. 农业图像字幕性能比较。

BLIP (bootstrap your own latent) is an earlier deep learning model with certain capabilities in handling the fusion of image and text, but due to its relatively simple network structure and training strategy, it exhibits average performance in complex agricultural image captioning tasks. This was reflected in its lower precision, recall, and accuracy. mPLUG-Owl, an improved version of the multimodal learning model, shows enhancements in processing image and language fusion, particularly in understanding image content and generating relevant text. However, due to limitations in feature extraction and associative learning, mPLUG-Owl’s performance in agricultural image captioning tasks remains limited. The InstructBLIP model introduces more advanced training strategies and network structures, particularly excelling in understanding image content and generating accurate descriptions in image and text fusion tasks. This improvement can be attributed to its enhanced feature extraction capability and text generation strategy, leading to significant improvements in performance for agricultural image captioning tasks. The CLIP (contrastive language-image pretraining) model is pre-trained on large-scale datasets through contrastive learning, strengthening the model’s ability to understand image content and related text. This training approach endows CLIP with advantages in understanding complex agricultural images and generating accurate descriptions, thereby performing well in all evaluation metrics. BLIP2, as an advanced version of BLIP, has had further optimization in network structure and training strategy. These improvements make BLIP2 more efficient in handling complex image and text fusion tasks, particularly excelling in understanding the details of agricultural images and generating precise descriptions.
BLIP (bootstrap your own latent) 是一种较早的深度学习模型，在处理图像和文本融合方面具有一定能力，但由于其相对简单的网络结构和训练策略，在复杂的农业图像字幕任务中表现平平。这体现在其较低的精确率、召回率和准确率。mPLUG-Owl 作为多模态学习模型的改进版本，在处理图像和语言融合方面表现出增强，尤其是在理解图像内容和生成相关文本方面。然而，由于特征提取和关联学习的局限性，mPLUG-Owl 在农业图像字幕任务中的表现仍然有限。InstructBLIP 模型引入了更先进的训练策略和网络结构，尤其在图像和文本融合任务中，擅长理解图像内容并生成准确的描述。这一改进可归因于其增强的特征提取能力和文本生成策略，从而显著提升了农业图像字幕任务的性能。CLIP (contrastive language-image pretraining) 模型通过对比学习在大型数据集上进行预训练，增强了模型理解图像内容和相关文本的能力。这种训练方法赋予 CLIP 在理解复杂农业图像和生成准确描述方面的优势，从而在所有评估指标中表现出色。BLIP2 作为 BLIP 的高级版本，在网络结构和训练策略上进行了进一步优化。这些改进使 BLIP2 在处理复杂的图像和文本融合任务时更高效，尤其擅长理解农业图像的细节并生成精确的描述。

3.3. Results for Object Detection
3.3. 目标检测结果

The object detection experiment conducted in this study was designed to evaluate and compare the performance of various deep learning models in agricultural disease detection tasks. This task holds significant importance for precision agriculture and intelligent agricultural management. By comparing the precision, recall, and accuracy of the different models, the performance of SSD [47], RetinaNet [48], CenterNet [49], YOLOv8 [50], and the method proposed in this paper in agricultural disease detection tasks was examined. The experimental results are displayed in Table 3 and Table 4.
本研究中进行的目标检测实验旨在评估和比较各种深度学习模型在农业病害检测任务中的性能。这项任务对精准农业和智能农业管理具有重要意义。通过比较不同模型的精确率、召回率和准确率，评估了 SSD [ 47]、RetinaNet [ 48]、CenterNet [ 49]、YOLOv8 [ 50]以及本文提出的方法在农业病害检测任务中的性能。实验结果显示在表 3 和表 4 中。

Table 3. Comparison of object detection performance.
表 3. 目标检测性能对比。

Table 4. Object detection result details for our method.
表 4. 本文方法的目标检测结果详情。

SSD (single shot multibox detector) is a one-stage object detection model known for its speed and simplicity of implementation. SSD directly predicts the categories and locations of objects in a feature map, without the need for additional candidate region generation steps. However, due to its relatively simple network structure, SSD’s precision and recall are relatively lower when processing complex and small-scale targets, as reflected in its lower experimental scores. RetinaNet introduced focal loss to address the issue of class imbalance, which is particularly effective in scenarios with a large number of negative samples. Its performance improved compared to SSD, as evidenced by its increased precision and recall. However, the computational complexity of RetinaNet is relatively high, which may limit its practical application. CenterNet utilizes a keypoint-based object detection approach, detecting the center points of objects and regressing their sizes for target localization. This method is more direct and efficient compared to traditional bounding box prediction approaches. CenterNet outperformed both SSD and RetinaNet in precision and recall, indicating its better capability for locating small targets and handling complex scenes. The YOLO series, known for its speed and good performance, with YOLOv8 as its latest version, introduced multiple innovations in network structure and algorithms, further enhancing the model’s detection capabilities. YOLOv8 excelled in the agricultural disease detection task, with a high precision and recall demonstrating its excellent target localization and classification abilities. For instance, the YOLO-based object detection model discussed in [51] achieved a mAP of 0.6991 on the wheat dataset provided by Kaggle. This performance was significantly different from that of the method proposed in our paper. The reason for this was that the method in the cited paper only fine-tuned the loss function on the basis of a single-stage object detection network and did not leverage attention mechanisms or data from other modalities for model enhancement. On the other hand, our method, by integrating the latest deep learning technologies and specifically optimizing for the task of agricultural disease detection, achieved the best performance across all evaluation metrics. This was due to the more effective feature extraction mechanism, more refined target localization strategy, and more efficient classification algorithms. The high precision, recall, and accuracy of our method demonstrate its significant advantages in recognizing a variety of agricultural diseases.
SSD（单阶段多框检测器）是一种单阶段目标检测模型，以其速度快和实现简单而闻名。SSD 直接在特征图上预测物体的类别和位置，无需额外的候选区域生成步骤。然而，由于其相对简单的网络结构，SSD 在处理复杂和小尺度目标时，其精度和召回率相对较低，这体现在其较低的实验分数上。RetinaNet 引入了焦点损失（focal loss）来解决类别不平衡问题，这在负样本数量庞大的场景中尤其有效。与 SSD 相比，其性能有所提升，表现为精度和召回率的提高。然而，RetinaNet 的计算复杂度相对较高，这可能会限制其实际应用。CenterNet 采用基于关键点的目标检测方法，通过检测物体的中心点并回归其尺寸来进行目标定位。与传统的边界框预测方法相比，这种方法更直接、更高效。CenterNet 在精度和召回率方面均优于 SSD 和 RetinaNet，表明其在定位小目标和处理复杂场景方面具有更好的能力。YOLO 系列以其速度快和性能好而闻名，其中 YOLOv8 作为其最新版本，在网络结构和算法上引入了多项创新，进一步增强了模型的检测能力。YOLOv8 在农业病害检测任务中表现出色，其高精度和高召回率证明了其卓越的目标定位和分类能力。例如，[51] 中讨论的基于 YOLO 的目标检测模型在 Kaggle 提供的麦子数据集上实现了 0.6991 的 mAP。这一性能与我们论文中提出的方法存在显著差异。其原因是，引文中的方法仅在单阶段目标检测网络的基础上对损失函数进行了微调，并未利用注意力机制或来自其他模态的数据来增强模型。另一方面，我们的方法通过整合最新的深度学习技术并针对农业病害检测任务进行专门优化，在所有评估指标上均取得了最佳性能。这归因于更有效的特征提取机制、更精细的目标定位策略和更高效的分类算法。我们方法的高精度、高召回率和高准确率证明了其在识别各种农业病害方面的显著优势。

3.4. Multimodal Dataset Ablation Experiment
3.4. 多模态数据集消融实验

In the multimodal dataset ablation experiment section of this paper, the goal was to explore the impact of different data modalities (image, text, sensor data) on the performance of the model. This experiment, by comparing the precision, recall, and accuracy of the model with different combinations of data modalities, aimed to reveal the contribution of the various data modalities to model performance and their interactions. The experimental results are presented in Table 5.
本文的多模态数据集消融实验部分旨在探讨不同数据模态（图像、文本、传感器数据）对模型性能的影响。该实验通过比较模型在不同数据模态组合下的精确率、召回率和准确率，旨在揭示各种数据模态对模型性能的贡献及其相互作用。实验结果如表 5 所示。

Table 5. Multimodal dataset ablation experiment.
表 5. 多模态数据集消融实验。

When the model used all modal data (image, text, and sensor data) simultaneously, it exhibited optimal performance, with a precision, recall, and accuracy of 0.96, 0.93, and 0.94, respectively. This indicates that the combination of these three data modalities provided the model with the richest and most comprehensive information, greatly enhancing the model’s accuracy in identifying and classifying diseases. Image data offer intuitive visual information, text data provide descriptive and background information, and sensor data contribute additional environmental and condition information. The integration of these data allows the model to more comprehensively understand and analyze agricultural diseases. When only sensor data were used, all performance indicators significantly decreased, with a precision, recall, and accuracy of 0.24, 0.21, and 0.23, respectively. This suggests that relying solely on sensor data is insufficient for complex agricultural disease detection tasks. In [52], this phenomenon was also reflected; the multimodal dataset proposed there demonstrated how the use of sensor data can enhance the application of artificial intelligence technologies in agricultural automation scenarios. Although sensor data can provide environmental condition information, they lack direct descriptive features of specific diseases, limiting the model’s performance in identifying specific diseases. When only text data were used, the model’s performance improved but was still inferior to the full modal data, with a precision, recall, and accuracy of 0.78, 0.73, and 0.75, respectively. This indicates that the text data aided the model by providing disease descriptions and background information but lacked the intuitiveness of image data and the environmental information of sensor data. When only image data were used, there was a significant improvement in performance, with a precision, recall, and accuracy of 0.92, 0.90, and 0.91, respectively. This demonstrates the crucial role of image data in agricultural disease detection, where visual information is highly effective for disease identification and classification. However, without the assistance of text and sensor data, the model still lacked comprehensiveness and contextual understanding of the disease. The results of the multimodal dataset ablation experiment lead to the conclusion that different data modalities contribute differently to agricultural disease detection tasks. Image data play a central role in providing intuitive visual information, text data contribute significantly in providing background and descriptive information, and sensor data offer valuable environmental and condition information. The integration of these data enabled the model to fully understand and accurately identify agricultural diseases, exhibiting outstanding performance. Therefore, the fusion of multimodal data is crucial for enhancing the accuracy of agricultural disease detection.
当模型同时使用所有模态数据（图像、文本和传感器数据）时，它表现出最佳性能，其精确率、召回率和准确率分别为 0.96、0.93 和 0.94。这表明这三种数据模态的结合为模型提供了最丰富和最全面的信息，极大地提高了模型识别和分类疾病的准确性。图像数据提供直观的视觉信息，文本数据提供描述性和背景信息，而传感器数据则提供额外的环境和条件信息。这些数据的整合使模型能够更全面地理解和分析农业疾病。当仅使用传感器数据时，所有性能指标均显著下降，其精确率、召回率和准确率分别为 0.24、0.21 和 0.23。这表明仅依靠传感器数据不足以完成复杂的农业疾病检测任务。在[ 52]中，这种现象也得到了体现；其中提出的多模态数据集展示了传感器数据的使用如何增强人工智能技术在农业自动化场景中的应用。尽管传感器数据可以提供环境条件信息，但它们缺乏特定疾病的直接描述性特征，从而限制了模型在识别特定疾病方面的性能。当仅使用文本数据时，模型的性能有所改善，但仍不如全模态数据，其精确率、召回率和准确率分别为 0.78、0.73 和 0.75。这表明文本数据通过提供疾病描述和背景信息辅助了模型，但缺乏图像数据的直观性和传感器数据的环境信息。当仅使用图像数据时，性能显著提高，其精确率、召回率和准确率分别为 0.92、0.90 和 0.91。这表明图像数据在农业疾病检测中起着至关重要的作用，其中视觉信息对于疾病识别和分类非常有效。然而，如果没有文本和传感器数据的辅助，模型仍然缺乏对疾病的全面性和上下文理解。多模态数据集消融实验的结果得出结论，不同的数据模态对农业疾病检测任务的贡献不同。图像数据在提供直观视觉信息方面发挥核心作用，文本数据在提供背景和描述信息方面贡献显著，而传感器数据则提供有价值的环境和条件信息。这些数据的整合使模型能够充分理解并准确识别农业疾病，表现出卓越的性能。因此，多模态数据的融合对于提高农业疾病检测的准确性至关重要。

3.5. Different Loss Function Ablation Experiment
3.5. 不同损失函数消融实验

The ablation experiment on different loss functions in this study aimed to investigate the impact of various types of loss functions on the performance of tasks in disease detection, agricultural image captioning, and object detection. Loss functions play a crucial role in the training process of deep learning models, determining how the model evaluates the difference between predicted and actual results. Different types of loss functions may lead models to focus on different aspects, thereby affecting the final performance. The experiment examined three loss functions: hinge loss, mean squared error (MSE) loss, and multimodal loss, and their performance in three different tasks. The experimental results are presented in Table 6.
本研究中关于不同损失函数的消融实验旨在探究各类损失函数对病害检测、农业图像描述和目标检测任务性能的影响。损失函数在深度学习模型的训练过程中起着关键作用，它决定了模型如何评估预测结果与实际结果之间的差异。不同类型的损失函数可能使模型侧重于不同方面，从而影响最终性能。实验考察了三种损失函数：合页损失、均方误差（MSE）损失和多模态损失，以及它们在三种不同任务中的表现。实验结果如表 6 所示。

Table 6. Different loss function ablation experiment.
表 6. 不同损失函数消融实验。

In the disease detection task, the multimodal loss achieved the best performance (precision: 0.95, recall: 0.92, accuracy: 0.94), followed by MSE Loss, with hinge loss performing the worst. Hinge loss, a loss function for classification tasks, aims to maximize the margin between correct and incorrect classifications. While effective in some classification tasks, it may not be sufficient for complex disease detection tasks, especially in cases involving multiple categories and subtle features. MSE Loss, which calculates the squared difference between predicted and actual values, is commonly used in regression tasks. In disease detection, it may capture subtle differences better than hinge loss, thereby improving model precision and recall. The multimodal loss, specifically designed for this study, considers the characteristics of different modal data, enabling the model to learn more effectively from multimodal data. This design resulted in the multimodal loss outperforming the other approaches in the disease detection task, reflecting its effectiveness in handling complex data. In the agricultural image captioning task, the multimodal loss also showed the best performance, followed by MSE loss, with hinge loss being the weakest. This result further confirms the effectiveness of multimodal loss in handling complex, diverse data. Agricultural image captioning involves not only image understanding but also the generation of semantically accurate descriptions, requiring the loss function to consider both the image content and the quality of text generation. In the object detection task, the multimodal loss outperformed the other two loss functions, demonstrating a precision of 0.96, recall of 0.92, and accuracy of 0.94. Object detection requires not only accurate identification of targets but also precise localization, demanding a loss function that can handle both aspects. The multimodal loss likely better balances these requirements, thereby improving the overall performance.
在疾病检测任务中，多模态损失函数表现最佳（精确率：0.95，召回率：0.92，准确率：0.94），其次是 MSE 损失函数，而合页损失函数表现最差。合页损失函数是一种用于分类任务的损失函数，旨在最大化正确分类和错误分类之间的间隔。尽管在某些分类任务中有效，但对于复杂的疾病检测任务可能不够充分，尤其是在涉及多个类别和细微特征的情况下。MSE 损失函数计算预测值和实际值之间的平方差，通常用于回归任务。在疾病检测中，它可能比合页损失函数更好地捕捉细微差异，从而提高模型的精确率和召回率。多模态损失函数专为本研究设计，考虑了不同模态数据的特征，使模型能够更有效地从多模态数据中学习。这种设计使得多模态损失函数在疾病检测任务中优于其他方法，反映了其在处理复杂数据方面的有效性。在农业图像字幕生成任务中，多模态损失函数也表现最佳，其次是 MSE 损失函数，而合页损失函数表现最弱。这一结果进一步证实了多模态损失函数在处理复杂、多样化数据方面的有效性。农业图像字幕生成不仅涉及图像理解，还涉及生成语义准确的描述，这要求损失函数同时考虑图像内容和文本生成质量。在目标检测任务中，多模态损失函数优于其他两种损失函数，表现出 0.96 的精确率、0.92 的召回率和 0.94 的准确率。目标检测不仅需要准确识别目标，还需要精确的定位，这要求损失函数能够处理这两个方面。多模态损失函数可能更好地平衡了这些要求，从而提高了整体性能。

3.6. Limitations and Future Work
3.6. 局限性与未来工作

In this study, a comprehensive approach based on multimodal data and the transformer model was proposed to address key challenges in agricultural disease detection and question-answering systems. Despite the experimental results showing the excellent performance of our method across multiple tasks, there remain some limitations that require further improvement and expansion in future work.
本研究提出了一种基于多模态数据和 Transformer 模型的综合方法，以解决农业病害检测和问答系统中的关键挑战。尽管实验结果表明我们的方法在多项任务中表现出色，但仍存在一些局限性，需要在未来的工作中进一步改进和扩展。

Firstly, regarding data limitations. Although multimodal datasets, including image, text, and sensor data, were used in the experiments, the diversity and coverage of these data were still limited. For instance, in terms of the image data, while various disease images of multiple crops were collected, they might not have covered all types of crops and all possible disease conditions. Additionally, the text data mainly being sourced from existing agricultural literature and reports may have limited the model’s ability to handle non-standard texts or colloquial descriptions. As for the sensor data, the experiments primarily relied on data from specific environments, which may not have sufficiently represented the complexity and diversity of all agricultural environments. Second, regarding the model limitations. Although our multimodal transformer model excelled in processing multimodal data, there are still potential issues. For example, while the transformer model has advantages in processing long sequence data, its computational complexity is high, which may not be suitable for resource-limited environments. Additionally, although the multimodal alignment module can effectively integrate data from different modalities, its alignment mechanism may still need improvement to better handle heterogeneity and complex interactions between different modalities. In terms of the loss function, although the multimodal loss designed by us performed well in multiple tasks, its design and optimization require further research. Particularly in balancing the contributions of different modal data and adapting to the requirements of different tasks, more experiments and theoretical analysis might be needed for guidance.
首先，关于数据局限性。尽管实验中使用了包括图像、文本和传感器数据在内的多模态数据集，但这些数据的多样性和覆盖范围仍然有限。例如，在图像数据方面，虽然收集了多种作物的各种病害图像，但可能未能涵盖所有类型的作物和所有可能的病害情况。此外，文本数据主要来源于现有的农业文献和报告，这可能限制了模型处理非标准文本或口语化描述的能力。至于传感器数据，实验主要依赖于特定环境的数据，这可能未能充分代表所有农业环境的复杂性和多样性。其次，关于模型局限性。尽管我们的多模态 Transformer 模型在处理多模态数据方面表现出色，但仍存在潜在问题。例如，Transformer 模型在处理长序列数据方面具有优势，但其计算复杂度较高，可能不适用于资源受限的环境。此外，尽管多模态对齐模块能够有效整合来自不同模态的数据，但其对齐机制可能仍需改进，以更好地处理不同模态之间的异质性和复杂交互。在损失函数方面，尽管我们设计的多模态损失在多项任务中表现良好，但其设计和优化需要进一步研究。特别是在平衡不同模态数据的贡献以及适应不同任务的需求方面，可能需要更多的实验和理论分析来指导。

Future work directions mainly include the following aspects. First, further expansion in data diversity and coverage is planned. More diverse agricultural data will be collected, including a wider range of crop types, data from different regions and environmental conditions, as well as richer textual descriptions and sensor data. This will help improve the model’s generalizability and practicality. Second, further optimization of model performance is planned. Future work will focus on reducing the computational complexity of the transformer model, to make it more suitable for resource-limited environments. Additionally, more efficient multimodal data alignment and fusion mechanisms will be explored to better handle heterogeneity and complex relationships between different modal data. In-depth research on multimodal loss is planned, including exploring the impact of different tasks on the loss function, how to better balance the contributions of different modal data, and how to adapt to the characteristics and requirements of different tasks. Efforts will also be made to apply the model in real agricultural environments for large-scale field testing and validation. This will help assess the practicality and effectiveness of the model and provide real feedback for further optimization.
未来工作方向主要包括以下几个方面。首先，计划进一步扩展数据多样性和覆盖范围。将收集更多样化的农业数据，包括更广泛的作物类型、来自不同地区和环境条件的数据，以及更丰富的文本描述和传感器数据。这将有助于提高模型的泛化能力和实用性。其次，计划进一步优化模型性能。未来的工作将侧重于降低 Transformer 模型的计算复杂度，使其更适用于资源受限的环境。此外，将探索更高效的多模态数据对齐和融合机制，以更好地处理不同模态数据之间的异构性和复杂关系。计划对多模态损失进行深入研究，包括探索不同任务对损失函数的影响、如何更好地平衡不同模态数据的贡献，以及如何适应不同任务的特点和需求。还将努力将模型应用于真实的农业环境，进行大规模的现场测试和验证。这将有助于评估模型的实用性和有效性，并为进一步优化提供真实的反馈。

4. Materials and Methods
4. 材料与方法

4.1. Dataset Collection 4.1. 数据集收集

4.1.1. Corpus Construction
4.1.1. 语料库构建

In the construction of the corpus for this study, the sources, quantity, and methods of data acquisition were first clarified. The primary goal was to collect text data encompassing a broad range of agricultural knowledge, especially concerning diseases related to crops such as rice, wheat, potatoes, and cotton. Sources included the databases of various agricultural research institutions, agricultural technology websites, professional forums, and scientific research paper libraries. National agricultural information platforms and the databases of agricultural science and technology journals were accessed. Through these channels, over one hundred thousand records were collected, covering topics such as crop cultivation, disease identification, treatment methods, and prevention strategies. The rationale for using these data was multifaceted. First, they encompass a wide range of information from basic agricultural knowledge to advanced technical expertise, crucial for building a comprehensive agricultural question-answering system. Second, the accuracy and professionalism of these data were ensured, as they originated from authoritative and reliable platforms. Lastly, the diversity in language expression and structure of these text data aided in enhancing the generalization ability of our model. In terms of data annotation, a semi-automated approach was adopted. Initially, natural language processing technology was used for data preprocessing, including tokenization, part-of-speech tagging, syntactic analysis, etc. These steps helped to better understand the text’s structure and semantics. Subsequently, key information in the data, such as crop names, disease types, and symptom descriptions, were automatically annotated using rule-based methods. This process can be represented by the following equation:
在本研究的语料库构建过程中，首先明确了数据获取的来源、数量和方法。主要目标是收集涵盖广泛农业知识的文本数据，特别是与水稻、小麦、马铃薯和棉花等作物疾病相关的数据。数据来源包括各类农业研究机构的数据库、农业技术网站、专业论坛以及科研论文库。同时，还访问了国家农业信息平台和农业科技期刊数据库。通过这些渠道，共收集了超过十万条记录，涵盖作物栽培、病害识别、治疗方法和预防策略等主题。使用这些数据的理由是多方面的。首先，它们涵盖了从基础农业知识到高级技术专长的广泛信息，这对于构建一个全面的农业问答系统至关重要。其次，这些数据来源于权威可靠的平台，确保了其准确性和专业性。最后，这些文本数据在语言表达和结构上的多样性有助于增强我们模型的泛化能力。在数据标注方面，我们采用了半自动化方法。首先，利用自然语言处理技术对数据进行预处理，包括分词、词性标注、句法分析等。这些步骤有助于更好地理解文本的结构和语义。随后，利用基于规则的方法自动标注数据中的关键信息，如作物名称、病害类型和症状描述。这个过程可以用以下公式表示：

T = T a g (W; ρ)

(12)

Here, W represents the preprocessed text words, T the annotated tags, and

ρ

the set of annotation rules. However, automatic annotation cannot fully replace the accuracy of manual annotation. Therefore, a professional annotation team was organized to review and correct the results of automatic annotation. In the corpus construction process, text vectorization technology was employed, converting text into a format processable by machine learning models. Specifically, word embedding techniques such as Word2Vec and BERT were used to transform each word in the text into a vector in a high-dimensional space. This process can be expressed as follows:
其中，W 代表预处理后的文本词语，T 代表标注标签，而

ρ

代表标注规则集。然而，自动标注无法完全取代人工标注的准确性。因此，组织了一支专业的标注团队，对自动标注的结果进行审查和修正。在语料库构建过程中，采用了文本向量化技术，将文本转换为机器学习模型可处理的格式。具体来说，使用了 Word2Vec 和 BERT 等词嵌入技术，将文本中的每个词语转换为高维空间中的向量。此过程可表示为：

V = E m b e d (W; η)

(13)

Here, V represents word vectors, W is the word, and

η

the parameters of the word embedding model. Through this method, semantic relationships and contextual information between words were captured, crucial for the subsequent model training and knowledge extraction. The final corpus not only contained a large amount of annotated text but was also converted into a format suitable for machine learning through word embedding technology. This corpus served as the foundational dataset for training our agricultural question-answering system and disease identification models, with its comprehensiveness and accuracy having a decisive impact on the improvement in system performance.
其中，V 代表词向量，W 代表词语，而

η

则代表词嵌入模型的参数。通过这种方法，捕获了词语之间的语义关系和上下文信息，这对于后续的模型训练和知识提取至关重要。最终的语料库不仅包含了大量的标注文本，而且通过词嵌入技术被转换成适合机器学习的格式。该语料库作为训练我们农业问答系统和疾病识别模型的基础数据集，其全面性和准确性对系统性能的提升具有决定性影响。

4.1.2. Knowledge Graph Construction
4.1.2. 知识图谱构建

The construction of the knowledge graph was a core component of this study, involving meticulous and systematic work aimed at providing a solid and comprehensive knowledge base for the agricultural question-answering system and disease detection models. The knowledge graph construction process focused not only on collecting and organizing data but also on its in-depth processing and intelligent application. Initially, data sources included the aforementioned annotated corpus dataset, and additional data were gathered from agricultural technology forums and communities, involving actual questions and discussions about crop diseases by farmers. The goal was to collect over one million independent data records to form a comprehensive knowledge system, as shown in Figure 3.
知识图谱的构建是本研究的核心组成部分，涉及细致而系统的工作，旨在为农业问答系统和病害检测模型提供坚实而全面的知识基础。知识图谱构建过程不仅侧重于数据的收集和整理，还侧重于其深度处理和智能应用。最初，数据来源包括上述已标注语料库数据集，并从农业技术论坛和社区收集了额外数据，其中涉及农民关于作物病害的实际问题和讨论。目标是收集超过一百万条独立数据记录，以形成一个全面的知识系统，如图 3 所示。

Figure 3. Knowledge graph of the relationship between cotton growth and diseases, showing typical symptoms during the cotton growth process, possible diseases, related pests, and corresponding treatment methods.
图 3. 棉花生长与病害关系知识图谱，展示了棉花生长过程中的典型症状、可能发生的病害、相关害虫以及相应的治疗方法。

During the data annotation process, a combination of natural language processing technology and artificial intelligence was employed. Text analysis tools automatically identified key entities and concepts in the text, such as disease names, symptom descriptions, and management methods. However, considering the specificity and complexity of the agricultural field, experts in agriculture were also invited for manual reviewing and supplementary annotation. This process ensured the accuracy and professionalism of the data annotation. The annotation process can be represented by the following equation:
在数据标注过程中，结合使用了自然语言处理技术和人工智能。文本分析工具自动识别了文本中的关键实体和概念，例如病害名称、症状描述和管理方法。然而，考虑到农业领域的特殊性和复杂性，还邀请了农业专家进行人工审查和补充标注。这一过程确保了数据标注的准确性和专业性。标注过程可以用以下公式表示：

R = A n n o t a t e (E, β)

(14)

Here, R represents the relationships between entities, E the entities, and

β

the set of annotation rules. Using this method, information in the knowledge graph was ensured to be both accurate and comprehensive. Next came the construction process of the knowledge graph. After defining the types of entities and relationships, graph database technology was used to store and organize this information. This step, central to constructing the knowledge graph, involved not only data storage but also efficient organization and retrieval of these data. The construction process can be summarized by the following equation:
其中，R 代表实体之间的关系，E 代表实体，而

β

代表标注规则集。通过这种方法，知识图谱中的信息得以确保其准确性和全面性。接下来是知识图谱的构建过程。在定义了实体和关系的类型之后，使用了图数据库技术来存储和组织这些信息。这一步，作为构建知识图谱的核心，不仅涉及数据存储，还涉及这些数据的有效组织和检索。构建过程可以用以下公式概括：

G = B u i l d G r a p h (E, R)

(15)

Here, G represents the knowledge graph, and E and R respectively are the sets of entities and relationships. The aim was to build a knowledge graph that not only reflects actual agricultural knowledge but also supports efficient querying and analysis. Through the aforementioned steps, a knowledge graph covering a broad range of agricultural knowledge, with clear structure and dynamic updating, was constructed. This graph not only provided strong knowledge support for the agricultural question-answering system but also the necessary background information for the disease detection model. Its construction greatly enhanced the performance of these systems, enabling them to serve agricultural production and research more accurately and efficiently.
其中，G 代表知识图谱，E 和 R 分别代表实体和关系集合。目标是构建一个知识图谱，它不仅能反映实际农业知识，还能支持高效的查询和分析。通过上述步骤，构建了一个涵盖广泛农业知识、结构清晰并能动态更新的知识图谱。该图谱不仅为农业问答系统提供了强大的知识支持，也为病害检测模型提供了必要的背景信息。它的构建极大地提升了这些系统的性能，使其能够更准确、高效地服务于农业生产和研究。

4.1.3. Sensor Data Collection
4.1.3. 传感器数据采集

In this study, the collection of sensor data was crucial for building a comprehensive agricultural knowledge graph and enhancing the accuracy of the disease detection system. The sensor data we collected originated from smart agricultural monitoring equipment, such as soil testing sensors and plant growth monitoring devices, which can provide detailed data on soil pH, electrical conductivity, nutrient content, and plant physiological indicators. These devices are deployed at key locations in fields and regularly collect data to monitor and assess crop growth condition and potential disease risks. All collected sensor data underwent strict data cleaning and preprocessing to ensure the data quality met the requirements for the subsequent analysis. During the data preprocessing process, we eliminated outliers, filled in missing values, and normalized the data to facilitate their use in later data analysis.
在本研究中，传感器数据采集对于构建全面的农业知识图谱和提高病害检测系统的准确性至关重要。我们收集的传感器数据源自智能农业监测设备，例如土壤测试传感器和植物生长监测设备，它们可以提供关于土壤 pH 值、电导率、养分含量以及植物生理指标的详细数据。这些设备部署在田间的关键位置，并定期收集数据，以监测和评估作物生长状况及潜在的病害风险。所有收集到的传感器数据都经过了严格的数据清洗和预处理，以确保数据质量符合后续分析的要求。在数据预处理过程中，我们剔除了异常值、填充了缺失值，并对数据进行了归一化，以便它们在后续数据分析中得到有效利用。

4.1.4. Image Data Collection
4.1.4. 图像数据收集

In this study, the collection of image data was crucial for establishing an efficient disease detection model and enhancing the performance of the agricultural question-answering system. The content, sources, quantity, and methods of acquiring image data were fundamental to building accurate and comprehensive models. First, the content of the image data primarily included images of healthy and diseased plants of major crops such as rice, wheat, potatoes, and cotton, as shown in Table 7.
在本研究中，图像数据的收集对于建立高效的病害检测模型和提升农业问答系统的性能至关重要。图像数据的内容、来源、数量和获取方法是构建准确、全面模型的基础。首先，图像数据的内容主要包括水稻、小麦、马铃薯和棉花等主要作物的健康和患病植物图像，如表 7 所示。

Table 7. Image dataset details.
表 7. 图像数据集详情。

These images included different growth stages of the crops and manifestations of various common and rare diseases. For rice, images of healthy rice, rice affected by blast disease, yellow leaf disease, and other disease conditions were collected. For each type of crop, efforts were made to ensure that the images covered stages from early symptoms to severe infection. Additionally, the sources of the data were diverse, mainly collected from the West District Botanical Garden of China Agricultural University. Furthermore, a certain number of images were obtained from public resources on the internet, representing crops from different regions and under various climatic conditions. The rationale for using these image data was that a rich and diversified image dataset is key to building an efficient disease detection model. The diversity in crop types, disease types, and stages of disease development significantly enhanced the model’s generalization capability and accuracy. Moreover, images under different lighting conditions and shooting angles helped train the model to better adapt to various situations in practical applications. The annotation process involved identifying disease areas in the images and assigning correct disease category labels. For some complex images, expert knowledge was used for precise annotation, as shown in Figure 4.
这些图像包括作物的不同生长阶段以及各种常见和罕见疾病的表现。对于水稻，收集了健康水稻、患有稻瘟病、黄叶病及其他病害状况的图像。对于每种作物，努力确保图像涵盖从早期症状到严重感染的各个阶段。此外，数据来源多样，主要采自中国农业大学西校区植物园。此外，还有一定数量的图像来源于互联网上的公共资源，这些图像代表了来自不同地区和不同气候条件下的作物。使用这些图像数据的理由是，丰富多样的图像数据集是构建高效疾病检测模型的关键。作物类型、疾病类型和疾病发展阶段的多样性显著增强了模型的泛化能力和准确性。此外，不同光照条件和拍摄角度下的图像有助于训练模型更好地适应实际应用中的各种情况。标注过程包括识别图像中的病害区域并分配正确的疾病类别标签。对于一些复杂的图像，采用了专家知识进行精确标注，如 Figure 4 所示。

Figure 4. Screenshot of the dataset labeling interface, demonstrating the precise labeling of disease lesions on individual plant leaves in an agricultural disease detection dataset using annotation tools. This was carried out to create a labeled dataset for machine learning model training.
图 4. 数据集标注界面的截图，展示了在一个农业病害检测数据集中，使用标注工具对单个植物叶片上的病害病斑进行精确标注。此举旨在创建用于机器学习模型训练的标注数据集。

4.2. Data Preprocessing 4.2. 数据预处理

4.2.1. Preprocessing of Corpus Data
4.2.1. 语料数据预处理

In this study, the preprocessing of corpus data was a key step in building an efficient agricultural question-answering system and disease detection model. The preprocessing involved transforming raw text data into a format more amenable to computer processing for subsequent machine learning and natural language processing tasks. A series of preprocessing techniques were employed to optimize the corpus data, ensuring data quality and processing efficiency. Initially, the raw corpus data, sourced from various channels including agricultural papers, technical reports, online forums, and Q&A, exhibited significant structural and format differences. To enable effective processing by the machine learning models, basic data cleaning was performed. This included removing irrelevant information (such as advertisements, meaningless characters), standardizing data formats (like dates, units), and correcting obvious errors. Subsequently, the text data underwent tokenization. Tokenization is the process of splitting long sentences or paragraphs in the text into individual words, which is particularly crucial for Chinese texts, due to the absence of clear delimiters between words. Efficient tokenization was carried out using statistical and machine-learning-based tools. In this paper, the text serialization operation utilized the existing tokenization tool Jieba for word segmentation, along with the Word2Vec model. Furthermore, the text data were subjected to part-of-speech tagging and syntactic analysis. Part-of-speech tagging involves assigning a grammatical role (such as noun, verb, etc.) to each word in the text, while syntactic analysis explores the dependency relationships between words in a sentence. These steps were vital for understanding the semantic structure of the text. To enhance the model performance and accuracy, the text data were further subjected to vectorization. Text vectorization involved converting words in the text into numerical vectors processable by computers. Word embedding technology [20] was employed for this transformation. Word embedding technology captures semantic relationships between words and translates these into vectors in a high-dimensional space, as shown in Figure 5.
在本研究中，语料数据的预处理是构建高效农业问答系统和病害检测模型的关键步骤。预处理涉及将原始文本数据转换为更适合计算机处理的格式，以便进行后续的机器学习和自然语言处理任务。采用了一系列预处理技术来优化语料数据，确保数据质量和处理效率。最初，从农业论文、技术报告、在线论坛和问答等各种渠道获取的原始语料数据，在结构和格式上存在显著差异。为了使机器学习模型能够有效处理，进行了基本的数据清洗。这包括删除不相关信息（如广告、无意义字符）、标准化数据格式（如日期、单位）以及纠正明显错误。随后，文本数据进行了分词处理。分词是将文本中的长句或段落分割成单个词语的过程，这对于中文文本尤为关键，因为中文词语之间没有明确的分隔符。使用基于统计和机器学习的工具进行了高效的分词。在本文中，文本序列化操作利用了现有的分词工具 Jieba 进行分词，并结合了 Word2Vec 模型。此外，文本数据还进行了词性标注和句法分析。词性标注涉及为文本中的每个词语分配一个语法角色（如名词、动词等），而句法分析则探讨句子中词语之间的依存关系。这些步骤对于理解文本的语义结构至关重要。为了提高模型性能和准确性，文本数据进一步进行了向量化处理。文本向量化涉及将文本中的词语转换为计算机可处理的数值向量。为此，采用了词嵌入技术[20]。词嵌入技术能够捕捉词语之间的语义关系，并将其转换为高维空间中的向量，如图 5 所示。

Figure 5. Schematic diagram of text embedding in a three-dimensional space, displaying how text data are mapped onto points in an embedding space formed by three base vectors

x_{1}

,

x_{2}

, and

x_{3}

.

The word vectorization process can be represented as

V = E m b e d (W; η)

(16)

Here, V represents the word vectors,

E m b e d

is the word embedding function, and

η

indicates the parameters of the word embedding model. This step was crucial, as it directly impacted the performance of the subsequent models. Through these preprocessing steps, raw text data were transformed into a format suitable for machine learning and natural language processing. These preprocessing techniques not only enhanced the quality and consistency of the data but also laid a solid foundation for subsequent model training and analysis.

4.2.2. Preprocessing of Image Data

This paper discusses three methods of image data preprocessing: basic augmentation, Cutout, and Cutmix, as shown in Figure 6.

Figure 6. Example of the application of image enhancement techniques in agricultural disease detection: (A) The image shows plant images enhanced using the Cutout technique, (B) the image displays plant images enhanced using the Cutmix technique (red boxes mean the adding parts), (C) the image showcases plant images enhanced with color and brightness adjustments.

These methods are widely used to enhance image data, improving the performance of deep learning models. Cutout [53] is an image data augmentation technique that introduces randomness by obscuring part of the image, thereby mitigating overfitting and enhancing the model’s generalizability. The concept involves drawing a black square at a random location on the image, randomly eliminating part of the image information. This aids the model in learning more robust features. Each application of Cutout obscures different parts of the image, aiding in better generalization of the model. An image I is defined, with its pixel values represented by matrix P, having dimensions

W \times H \times C

, where W and H are the width and height of the image, and C is the number of channels. A binary mask M represents the Cutout operation, having the same dimensions as P, where

M_{i, j, c} \in {0, 1}

indicates whether the channel c at position

(i, j)

is obscured. Having 1 in M signifies retaining the pixel value, and 0 means obscuring. The Cutout operation can be represented as

I_{cutout} = I ⊙ M

(17)

Here, ⊙ denotes element-wise multiplication. By multiplying the image’s pixel values by the mask, the obscured part of the image

I_{cutout}

is obtained.

Cutmix [54] is another method of image data augmentation that differs from Cutout by merging two images, introducing more diversity. The concept of Cutmix is to randomly select an image block and insert it into another image, simultaneously generating a mask corresponding to the inserted image block, thereby achieving image synthesis. Key features of Cutmix include diversity and the introduction of categorical labels. Cutmix introduces more diversity by merging features of different images, aiding the model in learning richer features. It also mixes the categorical labels of the two images, increasing the diversity of labels. Assuming two input images

I_{1}

and

I_{2}

, their pixel values are represented by matrices

P_{1}

and

P_{2}

, respectively. There are also two corresponding labels

y_{1}

and

y_{2}

, representing the categories of these two images. The following equation can generate the Cutmix image

I_{cutmix}

:

I_{cutmix} = I_{1} ⊙ M + I_{2} ⊙ (1 - M)

(18)

Here, M is a mask with the same dimensions as the image, with values between 0 and 1, indicating the selection of the image block. M is generated by randomly selecting a rectangular area, setting the corresponding pixel values to 1 and others to 0. This mask can also be used to mix the categorical labels:

y_{cutmix} = λ \cdot y_{1} + (1 - λ) \cdot y_{2}

(19)

Here,

λ

is a randomly generated weight controlling the degree of label mixing.

4.3. Proposed Method

4.3.1. Multi-Transformer Overview

In this study, a comprehensive approach based on multimodal and large transformer models was proposed to effectively handle disease detection and question-answering tasks in the agricultural domain. This method integrated advanced multimodal technologies, deep learning, and natural language processing, forming a comprehensive and efficient agricultural intelligence system. Our approach primarily relies on a multi-transformer architecture capable of processing and analyzing data from different modalities (such as images, text). At the core of this architecture is the transformation of various types of data input into a unified format, facilitating effective learning and inference.

Under this framework, data from different modalities are first fused and synchronized through a multimodal alignment module, followed by in-depth analysis of these fused data using a transformer-based inference model, and finally optimizing the overall model performance with a specially designed multimodal loss function. The method flow design includes several key steps. Initially, in the multimodal alignment module, data from various sources are processed and unified. For image data, convolutional neural networks (CNN) are employed to extract features; for text data, NLP techniques are used for word embedding and semantic analysis. Then, features from different modalities are integrated into a unified framework, ensuring effective combination of different types of data in the subsequent processing. Next, the powerful capabilities of the transformer model are utilized in the transformer inference model to process the fused multimodal data. The transformer model, known for its efficient parallel processing and long-distance dependency capturing, excels in handling complex sequential data. In this step, the model not only learns the internal features of the data but also explores the relationships between features from different modalities. Finally, a special multimodal loss function was designed to effectively train this complex system and optimize its performance. This loss function comprehensively considers the characteristics and importance of different modal data and their roles in the final task, ensuring that the model fully considers the characteristics of multimodal data during learning. Theoretically, our method is based on the view that different modalities of data (such as images and text) provide complementary information in agricultural disease detection and question-answering systems. By combining these different data sources, our system can gain a richer and more comprehensive understanding than single modality systems. For example, in disease detection, images provide intuitive disease features, while text offers detailed descriptions and background information about the disease. The combination of this information enables the system to identify and classify diseases more accurately. The adoption of a multi-transformer architecture was due to the advantages of the transformer model in processing sequential data, especially in capturing long-distance dependencies.

4.3.2. Multimodal Alignment Module

In this study, the multimodal alignment module was one of the core components responsible for effectively fusing data from different modalities (including images, text, and sensor data) to enhance the performance of the agricultural disease detection and question-answering system. The design of the multimodal alignment module aimed to address the differences in feature space and semantic level between the different modal data, providing a unified and coordinated data representation for the subsequent processing and analysis. Inputs to the multimodal alignment module primarily included image and text data. Image data are typically processed by convolutional neural networks (CNN) to extract visual features, while text data are processed using natural language processing technologies (such as BERT) to extract linguistic features. The goal of the multimodal alignment module was to transform these two different modal data features into a unified feature representation for effective integration in the subsequent processing, as shown in Figure 7.

Figure 7. Schematic diagram of a multimodal data processing framework, showing how temperature sensor data and text data are encoded through specific encoders, and how image data are processed through an image encoder. It also illustrates how the encoded data from each source are combined to generate a comprehensive feature representation.

In the processing flow, preliminary feature extraction was first performed on image and text data. For image data I, visual features

F_{v}

were extracted using a CNN model (ResNet50 [23]):

F_{v} = C N N (I; θ_{v})

(20)

Here,

θ_{v}

represents the parameters of ResNet50. For text data T, linguistic features

F_{t}

were extracted using a BERT model:

F_{t} = B E R T (T; θ_{t})

(21)

Here,

θ_{t}

represents the parameters of the BERT model. The key step was feature fusion, where visual features

F_{v}

and linguistic features

F_{t}

were combined to generate a unified multimodal feature

F_{m}

. This process could be accomplished using a fusion function F:

F_{m} = F u s i o n (F_{v}, F_{t}; θ_{f})

(22)

Here,

θ_{f}

denotes the parameters of the fusion function. In the multimodal alignment module, the key to feature fusion was finding an effective method to integrate features from different modalities. A weighted fusion method was adopted, where fusion weights were data-driven and learned automatically during model training. Weighted fusion can be represented as

F_{m} = α F_{v} + (1 - α) F_{t}

(23)

Here,

α

is a learned weight used to balance the importance of features from different modalities. The advantage of this method is its ability to automatically adjust the contribution of visual and linguistic features according to the requirements of different tasks. The application of the multimodal alignment module in the agricultural disease detection and question-answering system brought significant advantages. First, it enabled the system to utilize both visual information from images and semantic information from text, enhancing the accuracy of disease detection and the relevance of question answering. Second, the flexibility of the multimodal alignment module allowed the system to adjust the contribution of data from different modalities according to the characteristics of different tasks, precisely meeting the needs of various tasks. Lastly, this method has a strong generalization capability, adapting to different types and sources of data, enhancing the stability and reliability of the system in practical applications.

4.3.3. Transformer Inference Model

In this research, the transformer inference model, as one of the core components, undertook the critical task of processing and analyzing the fused multimodal data. The transformer model, with its outstanding performance and flexibility, has become the preferred choice in the field of natural language processing for handling complex sequential data. In our study, the transformer model was utilized to extract deep features from fused multimodal data and to conduct effective inference, as shown in Figure 8. The core of the transformer model is its self-attention mechanism, which allows the model to consider all positions in a sequence simultaneously while processing it, thereby capturing complex contextual relationships.

Figure 8. Schematic diagram of a multimodal data low-rank fusion model, depicting how image data, sensor data, and text data (knowledge graphs) are transformed into a low-rank space through specific functional mappings. These low-rank representations are then jointly used for prediction tasks to generate the final task output.

In the inference model of this study, the input is the fused features from the multimodal alignment module. The fused features first pass through a series of transformer encoding layers, each containing a self-attention mechanism and a feed-forward neural network. The working principle of the self-attention mechanism can be represented by the Equation (4). This mechanism enables the model to focus on the associations between different parts of the input sequence. After passing through the self-attention mechanism, the data enter a feed-forward neural network for further processing. The entire process can be represented as

T r a n s f o r m e r (F_{m}) = F F N (A t t e n t i o n (F_{m}))

(24)

Here,

F_{m}

is the output from the multimodal alignment module, and

F F N

denotes the feed-forward neural network. The transformer model has significant advantages in processing sequential data. In particular, its self-attention mechanism can effectively handle long-distance dependency issues, which is crucial for understanding and analyzing complex multimodal data. Additionally, the parallel processing capability of the transformer model makes it more efficient in handling large-scale data. Mathematically, the advantage of the transformer model lies in its self-attention mechanism’s ability to dynamically weight different parts of the sequence. By adjusting weights, the model can more flexibly capture important features in the sequence, thereby improving the accuracy of inference. The application of the transformer inference model in the agricultural disease detection and question-answering system brought several advantages: The transformer model can extract rich and deep features from fused multimodal data, crucial for understanding complex agricultural issues. In processing long sequential data such as descriptive text and image labels, the Transformer model can effectively capture long-distance dependencies. The parallel processing ability of the Transformer model makes it more efficient in handling a large amount of multimodal data, which is particularly important for building practical agricultural intelligence systems.

4.3.4. Multimodal Loss Function

In this research, a specialized multimodal loss function has been designed to optimize and evaluate the transformer inference model based on multimodal data. This multimodal loss function takes into consideration the characteristics of different modal data and their significance in the model, ensuring optimal learning outcomes when the model processes multimodal data. The design of the multimodal loss function acknowledges the distinct roles played by different modal data in the model within multimodal learning tasks. By introducing modality-specific loss functions, the model is guaranteed to fully consider the characteristics of each modality during learning, thereby enhancing its capability to handle multimodal data. The design principle of the multimodal loss function is based on the notion that different modalities contribute differently to the model, and these contributions may vary with the task. For instance, in some scenarios, image data might provide more intuitive information than text data, while in others, semantic information from a text may be more crucial. Therefore, our loss function design aimed to dynamically balance these different modal contributions to enhance the overall performance of the model. The multimodal loss function combined traditional classification loss (such as cross-entropy loss) with modality-specific loss. Its mathematical expression can be represented as

L_{t o t a l} = α L_{c l a s s i f i c a t i o n} + β L_{m o d a l 1} + γ L_{m o d a l 2}

(25)

Here,

L_{t o t a l}

denotes the total loss,

L_{c l a s s i f i c a t i o n}

is the cross-entropy loss for classification tasks, and

L_{m o d a l 1}

and

L_{m o d a l 2}

represent losses related to different modalities (for example, specific loss for image and specific loss for text).

α

,

β

, and

γ

are weight coefficients used to balance the loss from different parts. Cross-entropy loss is a common loss function in classification tasks and used to measure the difference between the probability distribution predicted by the model and the actual label distribution. Its mathematical formula is as follows:

L_{c l a s s i f i c a t i o n} = - \sum_{i} y_{i} log (p_{i})

(26)

Here,

y_{i}

is the actual label’s probability distribution, and

p_{i}

is the model’s predicted probability distribution. For modality-specific losses, different loss functions can be designed according to the task. For instance, for the image modality, loss functions related to image reconstruction or feature matching might be used; for the text modality, loss functions related to semantic similarity or sentence generation quality might be employed. The application of the multimodal loss function in agricultural disease detection and question-answering system offers several advantages. By balancing the contributions of different modal data, the multimodal loss function can improve the model’s accuracy in processing multimodal data. Different tasks may require varying degrees of attention to different modal data. The design of the multimodal loss function allows the model to automatically adjust the importance of different modal data based on the task’s characteristics. By combining classification loss and modality-specific loss, the multimodal loss function can optimize the model performance.

4.4. Experimental Configuration

4.4.1. Hardware Platform

The hardware platform forms the foundation for deep learning experiments and is crucial for research on multimodal disease detection and agricultural question-answering systems. This section details the configuration of the hardware platform, including GPUs, CPUs, memory, and other aspects. In our hardware platform, an NVIDIA GeForce RTX 3090 was selected as the primary GPU. This GPU, based on NVIDIA’s Ampere architecture, boasts numerous CUDA cores and substantial memory capacity, making it well-suited for processing multimodal data and large-scale models. On the other hand, the CPU (central processing unit) plays a significant role in data preprocessing, model deployment, and certain computation-intensive tasks. A server equipped with a 32-core CPU was chosen for our hardware platform. This CPU, with multiple physical and logical cores, is capable of handling multi-threaded tasks and supports high-performance computing. In multimodal tasks, datasets are often large, requiring ample memory for data loading and processing. Hence, 128 GB of RAM was configured to ensure sufficient memory for model training and inference. Large-scale datasets necessitate high-speed storage devices to accelerate data loading and saving. Therefore, a high-performance solid state drive (SSD) was chosen as the primary storage device to provide rapid data access.

4.4.2. Software Configuration and Hyperparameter Settings

In deep learning research, appropriate software configuration and hyperparameter settings are vital for training models in multimodal disease detection and agricultural question-answering systems. This section details the software configuration and various hyperparameter settings, including the deep learning framework, operating system, learning rate, batch size, and more. In multimodal tasks, choosing the right deep learning framework is critical for model training and performance. One of the current most popular deep learning frameworks is PyTorch, known for its extensive library support, dynamic computation graph, and user-friendly API. PyTorch was selected as the main deep learning framework for its excellent performance in multimodal tasks and substantial community support. Selecting an appropriate operating system is also a crucial decision. Linux operating systems are widely used in deep learning research and development, due to their good support for deep learning tools and libraries. In our experiments, a popular Linux distribution, Ubuntu, was chosen to ensure compatibility with deep learning tools. The learning rate is a key hyperparameter in deep learning that determines the step size of the model during each parameter update. The choice of learning rate directly affects the model’s convergence speed and performance. An initial learning rate of 0.001 was used in our experiments. Different learning rate settings were tried, and the best one was chosen based on the performance on the validation set. Batch size refers to the number of training samples input to the model at once. Batch training helps speed up the training process and improve memory efficiency. The choice of batch size depends on the model’s architecture and hardware resources. Larger batch sizes can accelerate training but also require more memory. With limited hardware resources, a smaller batch size may need to be chosen. Therefore, in this case, a batch size of 128 was set. Adjustments and optimizations to batch size were made during experiments to achieve optimal performance. To prevent model overfitting, regularization techniques, including L2 regularization and dropout, were applied. Regularization helped the model generalize to new data. Additionally, an appropriate optimizer, Adam, was chosen to update the model parameters. Model parameter initialization is also an important aspect. We used pretrained model weights for initialization. Pretrained models, usually trained on large-scale datasets, have better initial feature representations. For the multimodal tasks, we chose pretrained text and image models and combined them into a multimodal model. Hyperparameter search methods were used in our experiments to find the best combination of hyperparameters, including searching for the best learning rate, batch size, regularization parameters, etc. Techniques such as grid search, random search, and Bayesian optimization were employed to find the best hyperparameter settings. Hyperparameter search is an iterative process, requiring repeated trials of different hyperparameter combinations, guided by the performance on the validation set.

4.4.3. Dataset Training

In deep learning tasks, appropriately partitioning the dataset for training, validation, and testing is of paramount importance. The methods of dataset splitting and cross-validation directly impact the performance assessment and generalizability of the model. In this paper, the details of dataset splitting, K-Fold cross-validation, and other training-related aspects are discussed. Dataset splitting is one of the primary steps in machine learning experiments. An appropriate method of dataset splitting ensures that the model can fully utilize the data during training, validation, and testing processes. In the tasks of multimodal disease detection and agricultural question-answering systems, there is a comprehensive dataset containing a large amount of data, which need to be divided into three key parts. The training set, constituting 70% of the total dataset, is the foundation for model training, wherein the model learns to capture patterns and features of the data. The validation set, making up 15% of the total dataset, is used for hyperparameter tuning and performance evaluation of the model. Multiple validations are conducted on the validation set to choose the optimal model hyperparameter settings, such as learning rate and regularization parameters. The test set, comprising the remaining 15% of the dataset, is utilized for the final evaluation of the model’s performance. The performance assessment of the model on the test set serves as the ultimate metric to measure the model’s performance on real-world data. When splitting the dataset, it is crucial to ensure that each part contains data from different categories or samples, to guarantee the generalizability of the model. Random sampling is employed for splitting to maintain an even distribution of data. Additionally, K-Fold (k = 10) cross-validation is used, allowing for fuller use of data and providing reliable performance evaluation. This approach involves dividing the dataset into K equally sized subsets, where K − 1 subsets are used for training and the remaining one for validation. This process is repeated K times, with a different subset serving as the validation set each time, and the average of the K validation scores is taken as the final performance metric. The benefits of K-Fold cross-validation include obtaining a more accurate performance estimate through multiple validations, reducing the impact of randomness, and the ability to try different hyperparameter settings on each validation fold to select the best settings.

4.4.4. Model Evaluation Metrics

To assess the effectiveness of our disease detection and agricultural question-answering system, we relied on three principal metrics for evaluation.

Accuracy is a metric frequently utilized in classification tasks. It quantifies the percentage of samples that the model has classified correctly out of the total number of samples. In simpler terms, accuracy can be described as the ratio of the count of samples correctly identified by the model to the overall count of samples examined. Precision measures the accuracy of the model in identifying positive samples. Specifically, it calculates the percentage of samples that were accurately predicted as positive from the pool of all samples that the model labeled as positive. This means that precision is determined by dividing the number of true positive samples (those correctly identified as positive) by the number of all samples that the model predicted as positive. Recall, also known as sensitivity, focuses on the model’s ability to correctly identify all possible positive samples. It represents the fraction of positive samples that were correctly predicted as positive out of the total actual positive samples. To put this another way, recall is the quotient obtained when the number of true positive samples is divided by the total number of samples that are actually positive.

These metrics served to gauge the model’s accuracy in identifying diseases and providing answers to agricultural queries. Accuracy provided a broad view of the model’s overall performance, while precision and recall offered insights into its effectiveness in scenarios where the distribution of data might have been skewed.

5. Conclusions

In this study, a comprehensive approach based on multimodal data and the transformer model was proposed to address key challenges in agricultural disease detection and question-answering systems. First, in the disease detection experiments, various models including AlexNet, GoogLeNet, VGG, ResNet, and the method proposed in this paper were compared. The results demonstrated that the proposed method achieved the highest values in precision, recall, and accuracy, with respective scores of 0.95, 0.92, and 0.94, significantly outperforming the other comparative models. This indicated that the proposed method has a significant advantage in identifying various agricultural diseases, particularly in processing complex data and subtle features. Second, in the agricultural image captioning experiment, the performance of BLIP, mPLUG-Owl, InstructBLIP, CLIP, BLIP2, and the method proposed in this paper was examined. In this task, the proposed method also displayed the best performance, with a precision, recall, and accuracy scores of 0.92, 0.88, and 0.91, respectively. These results suggest that the proposed method can effectively understand the content of agricultural images and generate accurate and rich descriptive texts, which is important for enhancing the level of intelligence and automation in agricultural production. In the object detection experiment, SSD, RetinaNet, CenterNet, YOLOv8, and the method proposed in this paper were compared. The experimental results showed that the proposed method performed best in terms of precision, recall, and accuracy, achieving scores of 0.96, 0.91, and 0.94 respectively. This result reaffirms the efficiency and accuracy of the proposed method in processing complex agricultural data, especially in accurately identifying and locating agricultural diseases. Additionally, multimodal dataset ablation experiments and different loss function ablation experiments were conducted. In the multimodal dataset ablation experiment, it was found that the model performed optimally when using full modal data (image, text, and sensor data), and the absence of any modality led to a decrease in performance. This emphasized the importance of multimodal data in enhancing model performance. In the different loss function ablation experiments, it was found that the multimodal loss function performed best in all tasks, proving its effectiveness in handling multimodal data.

Author Contributions

Conceptualization, Y.L., X.L. and C.L.; Methodology, Y.L. and S.C.; Software, Y.L. and L.Z.; Validation, L.Z.; Formal analysis, M.S., S.C. and B.C.; Investigation, X.L.; Resources, M.S. and T.W.; Data curation, X.L., L.Z., M.S., B.C. and T.W.; Writing—original draft, Y.L., X.L., L.Z., M.S., S.C., B.C., T.W., J.Y. and C.L.; Writing—review & editing, J.Y. and C.L.; Visualization, S.C., B.C., T.W. and J.Y.; Supervision, C.L.; Project administration, J.Y. and C.L.; Funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 61202479.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Y.; Wa, S.; Sun, P.; Wang, Y. Pear defect detection method based on resnet and dcgan. Information 2021, 12, 397. [Google Scholar] [CrossRef]
Saleem, M.H.; Potgieter, J.; Arif, K.M. Automation in agriculture by machine and deep learning techniques: A review of recent developments. Precis. Agric. 2021, 22, 2053–2091. [Google Scholar] [CrossRef]
Sujatha, R.; Chatterjee, J.M.; Jhanjhi, N.; Brohi, S.N. Performance of deep learning vs machine learning in plant leaf disease detection. Microprocess. Microsyst. 2021, 80, 103615. [Google Scholar] [CrossRef]
Zhang, Y.; Wa, S.; Liu, Y.; Zhou, X.; Sun, P.; Ma, Q. High-accuracy detection of maize leaf diseases CNN based on multi-pathway activation function module. Remote Sens. 2021, 13, 4218. [Google Scholar] [CrossRef]
Li, L.; Zhang, S.; Wang, B. Plant disease detection and classification by deep learning—A review. IEEE Access 2021, 9, 56683–56698. [Google Scholar] [CrossRef]
Ray, M.; Ray, A.; Dash, S.; Mishra, A.; Achary, K.G.; Nayak, S.; Singh, S. Fungal disease detection in plants: Traditional assays, novel diagnostic techniques and biosensors. Biosens. Bioelectron. 2017, 87, 708–723. [Google Scholar] [CrossRef]
Vadamalai, G.; Kong, L.L.; Iftikhar, Y. Plant Genetics and Physiology in Disease Prognosis. In Plant Disease Management Strategies for Sustainable Agriculture through Traditional and Modern Approaches; Springer: Berlin/Heidelberg, Germany, 2020; pp. 15–25. [Google Scholar]
Das, D.; Singh, M.; Mohanty, S.S.; Chakravarty, S. Leaf disease detection using support vector machine. In Proceedings of the 2020 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 28–30 July 2020; pp. 1036–1040. [Google Scholar]
Lin, X.; Wa, S.; Zhang, Y.; Ma, Q. A dilated segmentation network with the morphological correction method in farming area image Series. Remote Sens. 2022, 14, 1771. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, X.; Liu, Y.; Zhou, J.; Huang, Y.; Li, J.; Zhang, L.; Ma, Q. A time-series neural network for pig feeding behavior recognition and dangerous detection from videos. Comput. Electron. Agric. 2024, 218, 108710. [Google Scholar] [CrossRef]
Deepalakshmi, P.; Lavanya, K.; Srinivasu, P.N. Plant leaf disease detection using CNN algorithm. Int. J. Inf. Syst. Model. Des. (IJISMD) 2021, 12, 1–21. [Google Scholar] [CrossRef]
Sharma, P.; Berwal, Y.P.S.; Ghai, W. Performance analysis of deep learning CNN models for disease detection in plants using image segmentation. Inf. Process. Agric. 2020, 7, 566–574. [Google Scholar] [CrossRef]
Bedi, P.; Gole, P. Plant disease detection using hybrid model based on convolutional autoencoder and convolutional neural network. Artif. Intell. Agric. 2021, 5, 90–101. [Google Scholar] [CrossRef]
De Silva, M.; Brown, D. Multispectral Plant Disease Detection with Vision Transformer–Convolutional Neural Network Hybrid Approaches. Sensors 2023, 23, 8531. [Google Scholar] [CrossRef]
Parez, S.; Dilshad, N.; Alghamdi, N.S.; Alanazi, T.M.; Lee, J.W. Visual intelligence in precision agriculture: Exploring plant disease detection via efficient vision transformers. Sensors 2023, 23, 6949. [Google Scholar] [CrossRef]
Thai, H.T.; Le, K.H.; Nguyen, N.L.T. FormerLeaf: An efficient vision transformer for Cassava Leaf Disease detection. Comput. Electron. Agric. 2023, 204, 107518. [Google Scholar] [CrossRef]
Xie, L.; Yuille, A. Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1379–1388. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Hu, Z.; Dong, Y.; Wang, K.; Chang, K.W.; Sun, Y. Gpt-gnn: Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 1857–1867. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Trong, V.H.; Gwang-hyun, Y.; Vu, D.T.; Jin-young, K. Late fusion of multimodal deep neural networks for weeds classification. Comput. Electron. Agric. 2020, 175, 105506. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Singh, S.; Ahuja, U.; Kumar, M.; Kumar, K.; Sachdeva, M. Face mask detection using YOLOv3 and faster R-CNN models: COVID-19 environment. Multimed. Tools Appl. 2021, 80, 19753–19768. [Google Scholar] [CrossRef]
Wu, W.; Liu, H.; Li, L.; Long, Y.; Wang, X.; Wang, Z.; Li, J.; Chang, Y. Application of local fully Convolutional Neural Network combined with YOLO v5 algorithm in small target detection of remote sensing image. PloS ONE 2021, 16, e0259283. [Google Scholar] [CrossRef]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the 38th International Conference on Machine Learning, ICML, Virtual Event, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
Patil, R.R.; Kumar, S. Rice-fusion: A multimodality data fusion framework for rice disease diagnosis. IEEE Access 2022, 10, 5207–5222. [Google Scholar] [CrossRef]
Dandrifosse, S.; Carlier, A.; Dumont, B.; Mercatoris, B. Registration and fusion of close-range multimodal wheat images in field conditions. Remote Sens. 2021, 13, 1380. [Google Scholar] [CrossRef]
Anandhi, D.R.F.R.; Sathiamoorthy, S. Enhanced Sea Horse Optimization with Deep Learning-based Multimodal Fusion Technique for Rice Plant Disease Segmentation and Classification. Eng. Technol. Appl. Sci. Res. 2023, 13, 11959–11964. [Google Scholar] [CrossRef]
Gadiraju, K.K.; Ramachandra, B.; Chen, Z.; Vatsavai, R.R. Multimodal deep learning based crop classification using multispectral and multitemporal satellite imagery. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3234–3242. [Google Scholar]
Qing, J.; Deng, X.; Lan, Y.; Li, Z. GPT-aided diagnosis on agricultural image based on a new light YOLOPC. Comput. Electron. Agric. 2023, 213, 108168. [Google Scholar] [CrossRef]
Cao, Y.; Sun, Z.; Li, L.; Mo, W. A study of sentiment analysis algorithms for agricultural product reviews based on improved bert model. Symmetry 2022, 14, 1604. [Google Scholar] [CrossRef]
Zhang, Y.; Lv, C. TinySegformer: A lightweight visual segmentation model for real-time agricultural pest detection. Comput. Electron. Agric. 2024, 218, 108740. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 198–214. [Google Scholar]
Shen, Y.; Wang, L.; Jin, Y. AAFormer: A multi-modal transformer network for aerial agricultural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1705–1711. [Google Scholar]
Fountas, S.; Espejo-Garcia, B.; Kasimati, A.; Mylonas, N.; Darra, N. The future of digital agriculture: Technologies and opportunities. IT Prof. 2020, 22, 24–28. [Google Scholar] [CrossRef]
Lippi, M.; Bonucci, N.; Carpio, R.F.; Contarini, M.; Speranza, S.; Gasparri, A. A yolo-based pest detection system for precision agriculture. In Proceedings of the 2021 29th Mediterranean Conference on Control and Automation (MED), Puglia, Italy, 22–25 June 2021; pp. 342–347. [Google Scholar]
Lu, J.; Tan, L.; Jiang, H. Review on convolutional neural network (CNN) applied to plant leaf disease classification. Agriculture 2021, 11, 707. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, G.; Liu, Y.; Wang, C.; Yin, Y. An improved YOLO network for unopened cotton boll detection in the field. J. Intell. Fuzzy Syst. 2022, 42, 2193–2206. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Ye, Q.; Xu, H.; Xu, G.; Ye, J.; Yan, M.; Zhou, Y.; Wang, J.; Hu, A.; Shi, P.; Shi, Y.; et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv 2023, arXiv:2304.14178. [Google Scholar]
Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; Hoi, S. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv 2023, arXiv:2305.06500. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Zhang, L.; Ding, G.; Li, C.; Li, D. DCF-Yolov8: An Improved Algorithm for Aggregating Low-Level Features to Detect Agricultural Pests and Diseases. Agronomy 2023, 13, 2012. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Y. High-precision wheat head detection model based on one-stage network and GAN model. Front. Plant Sci. 2022, 13, 787852. [Google Scholar] [CrossRef]
Bender, A.; Whelan, B.; Sukkarieh, S. A high-resolution, multimodal data set for agricultural robotics: A Ladybird’s-eye view of Brassica. J. Field Robot. 2020, 37, 73–96. [Google Scholar] [CrossRef]
DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]

Figure 1. Structural diagram of the BERT model, demonstrating how input passes through an embedding layer and is processed through a multi-layer transformer network structure. This includes multi-head attention mechanisms, feedforward neural networks, and the addition of positional encoding.

Figure 2. Structure diagram of the YOLOv5 object detection model, detailing the data flow from the input layer to the prediction layer, including input processing, backbone network, feature pyramid network (neck), and the different types of neural network modules used in each stage of prediction.

Figure 3. Knowledge graph of the relationship between cotton growth and diseases, showing typical symptoms during the cotton growth process, possible diseases, related pests, and corresponding treatment methods.
图 3. 棉花生长与病害关系知识图谱，展示了棉花生长过程中的典型症状、可能的病害、相关害虫以及相应的治疗方法。

Figure 4. Screenshot of the dataset labeling interface, demonstrating the precise labeling of disease lesions on individual plant leaves in an agricultural disease detection dataset using annotation tools. This was carried out to create a labeled dataset for machine learning model training.

Figure 5. Schematic diagram of text embedding in a three-dimensional space, displaying how text data are mapped onto points in an embedding space formed by three base vectors

x_{1}

,

x_{2}

, and

x_{3}

.

Figure 6. Example of the application of image enhancement techniques in agricultural disease detection: (A) The image shows plant images enhanced using the Cutout technique, (B) the image displays plant images enhanced using the Cutmix technique (red boxes mean the adding parts), (C) the image showcases plant images enhanced with color and brightness adjustments.

Figure 7. Schematic diagram of a multimodal data processing framework, showing how temperature sensor data and text data are encoded through specific encoders, and how image data are processed through an image encoder. It also illustrates how the encoded data from each source are combined to generate a comprehensive feature representation.

Figure 8. Schematic diagram of a multimodal data low-rank fusion model, depicting how image data, sensor data, and text data (knowledge graphs) are transformed into a low-rank space through specific functional mappings. These low-rank representations are then jointly used for prediction tasks to generate the final task output.

Table 1. Comparison of disease detection performance.
表 1. 病害检测性能比较。

Model	Precision 精确率	Recall 召回率	Accuracy 准确率
AlexNet	0.83	0.81	0.82
GoogLeNet	0.86	0.84	0.85
VGG	0.89	0.87	0.88
ResNet	0.92	0.90	0.91
Proposed Method 所提出的方法	0.95	0.92	0.94

Table 2. Comparison of performance in agricultural image captioning.
表 2. 农业图像字幕生成性能比较。

Model	Precision 精确率	Recall 召回率	Accuracy 准确率
BLIP	0.78	0.74	0.75
mPLUG-Owl	0.80	0.76	0.77
InstructBLIP	0.84	0.80	0.82
CLIP	0.86	0.82	0.85
BLIP2	0.89	0.85	0.88
Proposed Method 所提出的方法	0.92	0.88	0.91

Table 3. Comparison of object detection performance.

Model	Precision	Recall	Accuracy
SSD	0.82	0.80	0.81
RetinaNet	0.85	0.83	0.84
CenterNet	0.89	0.87	0.88
YOLOv8	0.93	0.90	0.92
Proposed Method	0.96	0.91	0.94

Table 4. Object detection result details for our method.

Crop	Disease	Precision	Recall	Accuracy
Rice	Rice Blast	0.97	0.92	0.95
	Sheath Blight	0.95	0.93	0.94
	Rice False Smut	0.92	0.90	0.91
	Bacterial Leaf Blight	0.87	0.85	0.86
	Downy Mildew	0.97	0.94	0.96
Wheat	Rust	0.98	0.93	0.95
	Powdery Mildew	0.96	0.94	0.95
	Fusarium Head Blight	0.95	0.91	0.93
	Loose Smut	0.78	0.76	0.77
	Sheath Blight	0.97	0.92	0.94
Potato	Early Blight	0.96	0.91	0.93
	Late Blight	0.95	0.92	0.94
	Leafroll Disease	0.94	0.90	0.92
	Wilt Disease	0.96	0.94	0.95
	Black Scurf	0.97	0.93	0.95
Cotton	Wilt Disease	0.95	0.92	0.94
	Yellow Wilt	0.93	0.90	0.92
	Verticillium Wilt	0.96	0.94	0.95
	Blight	0.94	0.91	0.93
	Anthracnose	0.97	0.95	0.96
Corn	Rust	0.95	0.93	0.94
	Northern Corn Leaf Blight	0.96	0.92	0.94
	Common Smut	0.97	0.94	0.95
	Southern Corn Leaf Blight	0.74	0.70	0.72
	Leaf Spot Disease	0.98	0.96	0.97

Table 5. Multimodal dataset ablation experiment.

Image Data	Text Data	Sensor Data	Precision	Recall	Accuracy
✓	✓	✓	0.96	0.93	0.94
✗	✗	✓	0.24	0.21	0.23
✗	✓	✗	0.78	0.73	0.75
✓	✗	✗	0.92	0.90	0.91

Table 6. Different loss function ablation experiment.

Task	Loss Function	Precision	Recall	Accuracy
Disease Detection	Hinge Loss	0.90	0.85	0.86
	MSE Loss	0.93	0.87	0.91
	Multimodal Loss	0.95	0.92	0.94
Agricultural Image Captioning	Hinge Loss	0.84	0.79	0.82
	MSE Loss	0.89	0.83	0.86
	Multimodal Loss	0.92	0.8	0.89
Object Detection	Hinge Loss	0.88	0.84	0.85
	MSE Loss	0.91	0.87	0.89
	Multimodal Loss	0.96	0.92	0.94

Table 7. Image dataset details.

Crop	Disease	Number
Rice	Rice Blast	768
	Sheath Blight	1095
	Rice False Smut	677
	Bacterial Leaf Blight	1135
	Downy Mildew	983
Wheat	Rust	690
	Powdery Mildew	734
	Fusarium Head Blight	918
	Loose Smut	1129
	Sheath Blight	885
Potato	Early Blight	921
	Late Blight	1079
	Leafroll Disease	776
	Wilt Disease	698
	Black Scurf	993
Cotton	Wilt Disease	874
	Yellow Wilt	903
	Verticillium Wilt	1005
	Blight	1297
	Anthracnose	793
Corn	Rust	754
	Northern Corn Leaf Blight	913
	Common Smut	952
	Southern Corn Leaf Blight	1045
	Leaf Spot Disease	1176

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Y.; Lu, X.; Zheng, L.; Sun, M.; Chen, S.; Chen, B.; Wang, T.; Yang, J.; Lv, C. Application of Multimodal Transformer Model in Intelligent Agricultural Disease Detection and Question-Answering Systems. Plants 2024, 13, 972. https://doi.org/10.3390/plants13070972

AMA Style

Lu Y, Lu X, Zheng L, Sun M, Chen S, Chen B, Wang T, Yang J, Lv C. Application of Multimodal Transformer Model in Intelligent Agricultural Disease Detection and Question-Answering Systems. Plants. 2024; 13(7):972. https://doi.org/10.3390/plants13070972

Chicago/Turabian Style

Lu, Yuchun, Xiaoyi Lu, Liping Zheng, Min Sun, Siyu Chen, Baiyan Chen, Tong Wang, Jiming Yang, and Chunli Lv. 2024. "Application of Multimodal Transformer Model in Intelligent Agricultural Disease Detection and Question-Answering Systems" Plants 13, no. 7: 972. https://doi.org/10.3390/plants13070972

APA Style

Lu, Y., Lu, X., Zheng, L., Sun, M., Chen, S., Chen, B., Wang, T., Yang, J., & Lv, C. (2024). Application of Multimodal Transformer Model in Intelligent Agricultural Disease Detection and Question-Answering Systems. Plants, 13(7), 972. https://doi.org/10.3390/plants13070972

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Citations

Crossref

17

Web of Science

10

Scopus

17

PubMed

4

PMC

4

ads

1

Google Scholar

[click to view]

Article Access Statistics

For more information on the journal statistics, click here.

Multiple requests from the same IP address are counted as one view.

Article Menu 文章目录

Application of Multimodal Transformer Model in Intelligent Agricultural Disease Detection and Question-Answering Systems 多模态 Transformer 模型在智能农业病害检测和问答系统中的应用

Abstract 摘要

1. Introduction 1. 引言

2. Related Works 2. 相关工作

2.1. Application of Multimodal Data in Agriculture2.1. 多模态数据在农业中的应用

2.2. Application of Large Language Models in Agriculture2.2. 大型语言模型在农业领域的应用

2.3. Application of Computer Vision Techniques in Agriculture2.3. 计算机视觉技术在农业中的应用

3. Results and Discussion3. 结果与讨论

3.1. Disease Detection Results3.1. 病害检测结果

3.2. Agricultural Image Captioning Experiment Results3.2. 农业图像描述实验结果

3.3. Results for Object Detection3.3. 目标检测结果

3.4. Multimodal Dataset Ablation Experiment3.4. 多模态数据集消融实验

3.5. Different Loss Function Ablation Experiment3.5. 不同损失函数消融实验

3.6. Limitations and Future Work3.6. 局限性与未来工作

4. Materials and Methods4. 材料与方法

4.1. Dataset Collection 4.1. 数据集收集

4.1.1. Corpus Construction4.1.1. 语料库构建

4.1.2. Knowledge Graph Construction4.1.2. 知识图谱构建

4.1.3. Sensor Data Collection4.1.3. 传感器数据采集

4.1.4. Image Data Collection4.1.4. 图像数据收集

4.2. Data Preprocessing 4.2. 数据预处理

4.2.1. Preprocessing of Corpus Data4.2.1. 语料数据预处理

4.2.2. Preprocessing of Image Data

4.3. Proposed Method

4.3.1. Multi-Transformer Overview

4.3.2. Multimodal Alignment Module

4.3.3. Transformer Inference Model

4.3.4. Multimodal Loss Function

4.4. Experimental Configuration

4.4.1. Hardware Platform

4.4.2. Software Configuration and Hyperparameter Settings

4.4.3. Dataset Training

4.4.4. Model Evaluation Metrics

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Citations

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Application of Multimodal Transformer Model in Intelligent Agricultural Disease Detection and Question-Answering Systems
多模态 Transformer 模型在智能农业病害检测和问答系统中的应用

2.1. Application of Multimodal Data in Agriculture
2.1. 多模态数据在农业中的应用

2.2. Application of Large Language Models in Agriculture
2.2. 大型语言模型在农业领域的应用

2.3. Application of Computer Vision Techniques in Agriculture
2.3. 计算机视觉技术在农业中的应用

3. Results and Discussion
3. 结果与讨论

3.1. Disease Detection Results
3.1. 病害检测结果

3.2. Agricultural Image Captioning Experiment Results
3.2. 农业图像描述实验结果

3.3. Results for Object Detection
3.3. 目标检测结果

3.4. Multimodal Dataset Ablation Experiment
3.4. 多模态数据集消融实验

3.5. Different Loss Function Ablation Experiment
3.5. 不同损失函数消融实验

3.6. Limitations and Future Work
3.6. 局限性与未来工作

4. Materials and Methods
4. 材料与方法

4.1.1. Corpus Construction
4.1.1. 语料库构建

4.1.2. Knowledge Graph Construction
4.1.2. 知识图谱构建

4.1.3. Sensor Data Collection
4.1.3. 传感器数据采集

4.1.4. Image Data Collection
4.1.4. 图像数据收集

4.2.1. Preprocessing of Corpus Data
4.2.1. 语料数据预处理