Early detection of tomato spotted wilt virus infection in tobacco using the hyperspectral imaging technique and machine learning algorithms

Early detection of tomato spotted wilt virus infection in tobacco using the hyperspectral imaging technique and machine learning algorithms

Computers and Electronics in Agriculture 167 (2019) 105066 Contents lists available at ScienceDirect Computers and Electronics in Agriculture journa...

2MB Sizes 0 Downloads 64 Views

Computers and Electronics in Agriculture 167 (2019) 105066

Contents lists available at ScienceDirect

Computers and Electronics in Agriculture journal homepage: www.elsevier.com/locate/compag

Early detection of tomato spotted wilt virus infection in tobacco using the hyperspectral imaging technique and machine learning algorithms

T



Qing Gua,b, Li Shenga,b, Tianhao Zhangc, Yuwen Lud, Zhijun Zhange, Kefeng Zhenga,b, Hao Hua,b, , Hongkui Zhoua a

Institute of Digital Agriculture, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China Key Laboratory of Information Traceability for Agricultural Products, Ministry of Agriculture and Rural Affairs of China, Hangzhou 310021, China c College of Plant Protection, Shenyang Agriculture University, Shenyang 110161, China d Institute of Plant Virology, Ningbo University, Ningbo 315211, China e Institute of Plant Protection and Microbiology, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China b

A R T I C LE I N FO

A B S T R A C T

Keywords: Tobacco plants Tomato spotted wilt virus Presymptomatic detection Hyperspectral imaging Machine learning

The hyperspectral imaging technique was used for the non-destructive detection of tomato spotted wilt virus (TSWV) infection in tobacco at an early stage. Spectra ranging from 400 to 1000 nm with 128 bands from inoculated and healthy tobacco plants were analyzed by using three wavelength selection methods (successive projections algorithm (SPA), boosted regression tree (BRT), and genetic algorithm (GA)), and four machine learning (ML) techniques (boosted regression tree (BRT), support vector machine (SVM), random forest (RF), and classification and regression tress (CART)). The results indicated that the models built by the BRT algorithm using the wavelengths selected by SPA as the input variables obtained the best outcome for the 10-fold crossvalidation with the mean overall accuracy of 85.2% and area under receiver operating curve (AUC) of 0.932. The band selection results and variable contribution analysis in BRT modeling jointly showed that the near-infrared (NIR) spectral region is informative and important for the differentiation of infected and healthy tobacco leaves. Different stages of post-inoculation were split according to the molecular identification and visual observation. The classification results at different stages indicated that the hyperspectral imaging data combined with ML methods and wavelength selection algorithms can be used for the early detection of TSWV in tobacco, both at the presymptomatic stage and during the period before the systematic infection can be detected by the molecular identification approach.

1. Introduction Tobacco is an important agricultural and economic crop, both in China and around the world. Notably, China grows approximately onethird of the world’s tobacco crop (Hu et al., 2010). However, the quality as well as the output of tobacco can be strongly impacted by plant diseases and insect pests throughout the growing season (Zhu et al., 2017). Tomato spotted wilt virus (TSWV) is one of the most widespread and damaging plant viral pathogens, and can systematically infect lots of crops such as tomatoes, peppers, tobacco, zinnia, and lettuce (Krezhova et al., 2014). TSWV has become one of the most dangerous diseases for tobacco, affecting the cultivation of a wide range of tobacco crops and seriously constraining the tobacco quality and yield worldwide (Mandal et al., 2007; McPherson et al., 2002). For example, in Georgia, the incidence of TSWV in flue-cured tobacco



caused an average reduction in crop value of 41% at an estimated economic loss of up to $19.4 million annually in 2007 (Mandal et al., 2007). The disease has been reported in most provinces of China, and is widely distributed in Yunnan Province — one of the most important tobacco producing regions in China. Plant health monitoring and timely disease detection are crucial for effective morbidity control and crop management (Martinelli et al., 2015). The traditional crop disease detection and monitoring approaches mainly consist of empirical evaluation, i.e., visual surveys, DNA-based and serological methods, such as polymerase chain reaction (PCR), flow cytometry (FCM), immunofluorescence (IF), and double antibody sandwich enzyme-linked immunosorbent assay (DAS-ELISA) (Fang and Ramasamy, 2015; Madufor et al., 2017). The empirical method is inefficient and unreliable, while the laboratory-based detection techniques are destructive, time-consuming, labor-intensive and

Corresponding author at: Institute of Digital Agriculture, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China. E-mail address: [email protected] (H. Hu).

https://doi.org/10.1016/j.compag.2019.105066 Received 4 March 2019; Received in revised form 20 August 2019; Accepted 18 October 2019 0168-1699/ © 2019 Published by Elsevier B.V.

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

In order to make a comparison between different feature selection methods for plant disease detection, we used three algorithms (GA, SPA, and BRT) to select wave bands from hyperspectral image data. Numerous data mining techniques have been employed in previous studies for classification and prediction purposes based on remote sensing data, including statistical analysis methods such as principal components analysis (PCA) and discriminant analysis (DA), and machine learning (ML) algorithms, such as artificial neural networks (ANNs) (Were et al., 2015), support vector machine (SVM) (Were et al., 2015), classification and regression trees (CART) (Razi and Athappilly, 2005), and boosted regression tree (BRT) (Yang et al., 2016). Rumpf et al. (2010) presented an automatic method for the early detection of sugar beet diseases using SVM and hyperspectral reflectance. They correctly discriminated between diseased and healthy sugar beet plants with a classification accuracy of 97%. They also explored the potential of presymptomatic detection of different kinds of diseases on sugar beet, obtaining classification accuracies between 65% and 90%. Wang et al. (2008) adopted ANNs to predict late blight (LB) disease on tomatoes based on spectral reflectance. By comparing different network structures, they successfully predicted healthy and diseased tomato canopies with correlation coefficients between predicted values and measured values of 0.99 and 0.82 for field experiments and remotely sensed images, respectively, suggesting that an ANN with back-propagation training could be employed for spectral detection of LB infections on tomato. Random forest (RF) and BRT are relatively new machine learning algorithms. Michez et al. (2016) used RF for forest health condition classification based on imagery from an unmanned aerial vehicle (UAV), and obtained good overall accuracies (over 90%). Machine learning techniques have been successfully used in prior identification and classification studies and are promising as modeling tools for identifying disease in plants using hyperspectral image data. In order to compare different ML algorithms for plant disease identification, we have selected four methods (CART, SVM, RF and BRT) for the early detection of TSWV infection in tobacco plants. More specifically, this paper has the following purposes: to attest the applicability of hyperspectral imaging to detect the TSWV infection in tobacco plants at an early stage; (2) to identify the optimal predictive wavebands by using different wavelength selection methods, including GA, SPA and BRT; (3) to develop the prediction models based on different machine learning techniques, which include CART, SVM, RF and BRT; (4) to determine the best combination of band selection method and prediction model technique for the early detection of TSWV in tobacco; and (5) to compare the timeliness between the hyperspectral imaging technique and molecular identification approach for TSWV infection detection.

costly (Martinelli et al., 2015). DNA-based and serological methods lack the capability to detect infection at the asymptomatic stage, especially with regard to systemically diffused pathogens (Martinelli et al., 2015). In addition, the above approaches, especially for laboratory-based methods, require highly trained professionals for the sophisticated techniques. These deficiencies have directed researches towards the use of more effective alternative methods for detecting crop diseases at an early stage and on a large scale of fields (Sankaran et al., 2010). Hyperspectral imaging technology has been growing significantly in the past two decades (Bioucas-Dias et al., 2013) and was widely employed for non-destructively investigating biotic and abiotic stresses in crop plants across various spatial and temporal scales (Berger et al., 2018; Galvao et al., 2011; Mananze et al., 2018). Hyperspectral imaging is developed based on the technical integration of imaging and spectroscopy, with which the spatial and spectral information on an object can be acquired synchronously (Bauriegel et al., 2011; Li et al., 2011). The disease infection will lead to variations in the biophysical and biochemical characteristics in plants, e.g., tissue structure, intercellular space, transpiration rate, pigment content, and water content (Rumpf et al., 2010; Slaton et al., 2001). These changes may affect the spectral characteristics of plants, which can be captured by the hyperspectral platform (Zhu et al., 2016). Similar research has been conducted by Zhu et al. (2017) and Krezhova et al. (2014). Krezhova et al. (2014) reported that hyperspectral reflectance in the visible and near-infrared ranges was collected to discriminate the TSWV-infected tobacco leaves from healthy ones. The spectral data were acquired at 14 and 20 days post-inoculation (DPI) and statistical analysis methods were used to detect the development of TSWV infection in tobacco plants. The presence of TSWV was established at 14 DPI. Zhu et al. (2017) demonstrated that it is possible to detect the tobacco mosaic virus (TMV) infection in tobacco plants at the presymptomatic stage using hyperspectral imaging. They compared different machine learning algorithms for classifying disease stages. In this study, hyperspectral image data were acquired at a very early stage (beginning from the first day after inoculation) and processed to prove their potential strength for the early detection of TSWV infection in tobacco plants. We investigated the classification performances of different combinations of wavelength selection methods and machine learning classifiers. Innovatively, by employing the real-time polymerase chain reaction (RT-PCR), we compared the timeliness of the hyperspectral imaging technique with the molecular identification approach for TSWV infection detection. High dimensionality and multi-collinearity frequently occur to the hyperspectral data due to the large amount of highly correlated spectral values within the dataset (Ng et al., 2019; Wei et al., 2017). Therefore, effective wavelengths (EWs) selection is essential for hyperspectral analysis to maximize the efficiency of data use and reduce computation complexity (Delalieux et al., 2007; Ng et al., 2019). Various approaches have been used to solve the multi-collinearity problem, such as principal component regression (Surhone et al., 2013), genetic algorithm (GA), successive projections algorithm (SPA) (Araújo et al., 2001; Xie et al., 2015; Zhu et al., 2017), and partial least squares regression (PLSR) models (Ng et al., 2019). GA has been used for feature selection in spectral data by many studies (Dou et al., 2015; Li et al., 2011; Ma et al., 2003). SPA was used to select the most important wavelengths for identifying different diseases on tomato leaves using hyperspectral imaging by Xie et al. (2015). The wavelengths selected by SPA involved most of the valid information, and played significant roles in the detection of diseases. Zhu et al. (2017) also adopted SPA for EWs identification in the study of presymptomatic detection of tobacco mosaic virus (TMV) infection using hyperspectral imaging. The above two papers both showed that SPA was an effective method for EWs selection. Optimal wavelengths were identified using the PLS models by ElMasry et al. (2007), and the prediction models based on the wavelengths selected by PLS obtained close accuracies as compared with the predictive performances of the models built by the full spectral range.

2. Materials and methods 2.1. Experimental design The experiment was performed at the Zhejiang Academy of Agricultural Science. A total of 80 tobacco plants (Nicotiana benthamiana) were grown in a climate chamber under environmentally controlled conditions (temperature 20–25 °C, humidity 50–70%) with a 12/12 h photoperiod. Among them, 40 plants were inoculated with TSWV at 4–6 leaf stage, and the remaining 40 plants were employed as controls. TSWV was inoculated on tobacco plants according to the previous studies (Krezhova et al., 2014; Zhu et al., 2016). The TSWV inoculum was prepared by grinding the TSWV-infected leaves from diseased tobacco plants. One gram of infected tissue was ground with abrasive powder in 10 ml phosphate-buffered saline (pH 7.2). One expanded leaf per plant was then inoculated by rubbing the homogeneous inoculum on it using a brush. 20 healthy plants and 20 inoculated plants were randomly selected per day, and were employed for data collection for a period of eight consecutive days after inoculation. The selected tobacco plants were 2

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

correction and spectral and spatial radiometric calibration. Wavelength calibration of the SOC710VP system is performed at the factory using monochromatic light sources and the results are used for spectral calibration. The calibrated reflectance (R) is calculated using the following equation:

rapidly taken to the laboratory for hyperspectral image acquisition. The primers designed from the nucleotide sequence of the coat protein (CP) of TSWV were used for quantitative real-time polymerase chain reaction (RT-PCR) to detect TSWV RNAs accumulated within the inoculated leaves, and to testify on which day the infection can be detected by the molecular identification approach. Three repeat experiments were performed. The RT-PCR process was conducted as described in previous studies (Peng et al., 2011; Yan et al., 2012).

R = (Rsample − Rdark )/(Rbackground − Rdark )

(1)

where Rsample is the measured reflectance, Rdark is the camera response obtained by turning off the light source and covering the lens with the opaque black cap, and Rbackground is the reflectance acquired from high reflectance white background. The associated electronics of the imaging system produce a level of electronic noise in the image which needs to be removed. The calibration process was conducted in the SRAnalysis Toolkit software that came with the hardware, which has the capability to process the original data using the calibration files to produce the spectrally and radiometrically calibrated images. All the corrected images were then processed using the Environment for Visualizing Images (ENVI) software (version 5.3, Research Systems Inc., Boulder, CO, USA) to draw the spectral information. The infected leaves inoculated with TSWV, as well as the control leaves for the healthy samples were regarded as the region of interest (ROI) and were manually selected from the corrected images in ENVI. The average spectral reflectance in the ROI was calculated to evaluate the plant. The veins in the leaves were not excluded from the subsequent analyses because they made up very small portions of the leaves and will not affect the results.

2.2. Hyperspectral image acquisition A hyperspectral push broom spectral camera SOC710VP (Surface Optics Corporation, San Diego, CA) was employed to acquire the hyperspectral image. The camera collects 128 bands in the range from 380 to 1040 nm. The optimized focal length of the objective lens was 35 mm with the maximum aperture of F1.4. The primary specifications of SOC710VP include lens type (C-Mount), dynamic range (12 bit), 128 bands (spectral) by 696 pixels (spatial), and sweep speed (30 row/s, 23.2 s/cube). The camera system is operated in a dark room, and two 150 W tungsten halogen lamps were mounted on the two sides of the camera to illuminate the sample stage. The camera was fixed 50 cm above the sample stage where the tobacco plant was placed for image acquisition. Image acquisitions were taken in a dark room with ambient climate conditions of 20 °C and humidity 50–60% for a period of seven days after inoculation. The hyperspectral imaging system was operated by PC installed with the dedicated software HyperScanner for spectral image scanning, binning, and motor control. Fig. 1 shows the configuration of the hyperspectral imaging system. The quality of hyperspectral imaging data is closely associated with the measuring setup (Behmann et al., 2016). The angle and distance of light sources, camera and measured object have considerable impact on the gathered spectral imaging data. The observation angle and distance of the light sources were set up to be consistent to decrease measurement inaccuracies. The angles of the infected/control leaves were adjusted to avoid directly facing the light sources to reduce specular reflection which can cause variations in the spectral data.

2.4. Effective wavelength (EW) selection Hyperspectral images are high-dimensional data with a lot of information redundancy, and will cause instability when they are directly applied to the classifications (Elmasry et al., 2012; Peng et al., 2011). The selection of fewer effective bands representing the full wavelength is widely used to improve the model stability and computational efficiency (Zhang et al., 2013). The EWs possess the most important information to discriminate the infected tobacco leaves from healthy samples, and can be directly input to build machine learning models (Liu et al., 2014). Three feature selection methods (GA, SPA, and BRT) were employed to obtain effective wavelengths. The GA and SPA algorithms were conducted using the MATLAB software (version 8.0, The Math Works, Natick, USA). The BRT algorithm was carried out in R statistical software (version 3.4.2, R Development Core Team, 2017)

2.3. Hyperspectal image processing The hyperspectral camera system was fully calibrated at the factory and calibration files were provided along with the system installation. Hyperspectral image cubes saved from the imager are calibrated in a three-step process, including spectral calibration, dark level offset

Fig. 1. Configuration of the hyperspectral imaging system. 3

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

model accuracy were used in the final model.

using the gbm.simplify function which is an extended function for the gbm library. The theory and algorithm of BRT are presented in Section 2.5.4.

2.5.3. Random forest (RF) RF is an ensemble classification algorithm based on the decision trees, and was first proposed by Breiman (2001) (Breiman, 2001). RF introduces randomness and diversity by randomly selecting the subsets of variables, and uses each subset to create a tree. Unlike the CART algorithm that searches the best threshold from all variables, the optimal variables used for splitting the nodes in RF were draw from a random subset of samples (Rodrigues and de la Riva, 2014). The RF models were built through the R package randomForest (version 4.6-12). The optimal settings for ntree (the number of trees), mtry (the number of random predictors) and node size (the minimum number of observations at end nodes) were determined by repeated tests. The value of node size was specified as the default setting. All combinations of nine ntree values (from 1000 to 5000 with an interval of 500) and seven mtry values (from 1 to 7) were performed and the settings yielding the best predictive accuracy were selected in the final model.

2.4.1. Successive projections algorithm (SPA) SPA is a forward selection technique, and it can be used to minimize the spatial collinear and eliminate redundant information in the original spectral matrix (Araújo et al., 2001). The algorithm operates in a vector space to capture a set of variables that is minimally redundant (Wu et al., 2013). SPA is a conclusive search algorithm and the results are renewable. The detailed SPA process is described by Araújo et al. (2001). 2.4.2. Genetic algorithm (GA) GA was developed based on natural selection theory and genetics (Dou et al., 2015), and has been successfully used for many variable optimization studies. GA is robust for handling large data without having a priori knowledge of how to deal with the problem (Roghanian and Pazhoheshfar, 2014). The basic principle of the genetic algorithm is described by Konak et al. (2006). Genetic Algorithm and Direct Search Toolbox was employed to develop the optimization model for spectral band simplification in this study. In the GA process, Gaussian distribution for mutation was implemented, and the scale and shrink were set to 0.5 and 0.75, respectively.

2.5.4. Boosted regression tree (BRT) BRT is a self-learning method and is developed by integrating two important statistical algorithms: CART and boosting algorithm (Elith et al., 2008). BRT aims to improve the stability and accuracy of a single model by producing a number of models and randomly selecting them for a better prediction. By stage-wise combining many simple models, the boosting algorithm prominently improves the predictive performance of regression trees (Rodrigues and de la Riva, 2014). BRT is constant to monotonic transformation of variables and insensitive to the outliers of trees because they are separated into a node and do not influence the splitting of the succeeding trees (Salazar et al., 2015). BRT models were built with the gbm package (version 2.1.3) facilitated by a set of extended functions (Elith et al., 2008; Ridgeway, 2007). The most effective settings for learning rate (LR, the contribution of every single tree) and bag fraction (BF, the proportion of data for model fitting) were determined by repeated trial-and-error, namely testing all combinations of LR (0.01, 0.005, 0.001, and 0.0005) and BF (0.60, 0.65, 0.70, and 0.75). The combination of LR and BF generating the highest predictive accuracy was used as the optimal setting. The tree complexity (the number of nodes in each tree) was specified as 3. The gbm.simplify function was used to simplify the fitted models by progressively removing variables from the original models, using predictive error to determine the number of variables that could be removed without reducing the model accuracy. After the simplification, the most effective predictors were selected to build models and the contributions of each predictor were calculated to identify the most important variables.

2.5. Modeling techniques and performance Various machine learning techniques (SVM, CART, RF, and BRT) were applied and compared for the early detection of TSWV in tobacco. The models were fitted and validated in R statistical software (version 3.4.2; R Development Core Team, 2017). A total of 320 samples were used for modeling, of which 160 were the healthy samples and 160 were the inoculated samples. 2.5.1. Support vector machine (SVM) Developed by combining statistical learning theory and the minimum structural risk principle, SVM aims to classify samples into groups by creating an optimal hyperplane. It applies to small sample learning problems, as well as nonlinear and high dimensional dataset (Hu et al., 2013). SVM has been widely used in classification and regression studies — its detailed theory and description can be found in previous studies (Liu et al., 2006; Rumpf et al., 2010). In this study, SVM models were built with the R package e1071 (version 1.6-8). Radial basis function (RBF) was used as the kernel function, which has been successfully employed in a great many studies. Two parameters (cost and gamma) were specified according to the model performances by testing different combinations of the settings. The settings generating the highest validation accuracy were set as the optimal parameters.

2.5.5. Model evaluation and validation The model performance was evaluated using 10-fold cross-validation (Rumpf et al., 2010) with several metrics: overall accuracy (OA), Kappa, correlation coefficient (r), and area under receiver operating curve (AUC) (Zhi et al., 2017). Due to the limited number of samples, we did not split the dataset into training set and test set, which are used for model fitting and validation, respectively. Alternatively, we chose cross-validation for model evaluation. It splits all samples into a pre-set number (N) of groups, among which N-1 groups are used to fit a model and the remaining samples are applied for validation. This fitting and validation process was performed for N times, namely each group has been served as the validation group. The averaged values of evaluating metrics were recorded to describe the model performances. Frequently, the N is defined as 10, namely 10-fold cross-validation. AUC is used to evaluate the predictive performance of the binary prediction model (Park et al., 2004). If the AUC value is 0. 5, it indicates that the prediction is completely random. The closer AUC is to 1, the better the performance of the model is, and 1 is the perfect prediction. Generally, the model can be regarded as useful if AUC is greater than 0.7, and AUC

2.5.2. Classification and regression trees (CART) The CART algorithm is an effective non-parametric classification and regression technique, and has been widely applied in various fields, serving as a data mining tool (Gu et al., 2014). CART creates predictive criteria by recursively generating a binary tree which can be easily displayed, interpreted and employed. The dataset will be split into a certain number of groups according to the maximal values of some measures, e.g., Gini index, reflecting the homogeneity of two nodes. All variables are tested to search for the optimal threshold for splitting the nodes. CART was realized in R by using the rpart package (version 4.112). The important parameter cost-complexity (a pruning factor reflecting the tradeoff between prediction accuracy and tree sizes) and minsplit (the minimum sample size assigned to a terminal node) considerably affect the model performance. Different combinations of costcomplexity levels (0.001, 0.005, 0.01, 0.05, 0.1) and minsplit levels (10, 20, 30, 40, 50) were tested and the settings generating the highest 4

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

Fig. 2. Flowchart describing the main steps of the early detection of TSWV infection in tobacco using hyperspectral imaging technique and machine learning algorithms.

latent period, the disease symptom of TSWV started to appear on the infected plants at 6 DPI (Fig. 3). The small spots visible on the inoculated leaves rapidly expanded from 7 DPI and formed significantly large necrotic areas at 8 DPI. Molecular identification results of TSWVinfected tobacco leaves are presented in Fig. 4. TSWV coat protein was detected by RT-PCR, indicating systemic infection with TSWV. It can be observed that TSWV RNAs started to accumulate in the inoculated tobacco leaves and can be detected at 5 DPI. It indicated that the visible disease symptom of TSWV tends to appear closely after the TSWV RNAs accumulate within the infected plant that can be detected by RT-PCR.

larger than 0.9 indicates excellent model performance. The models were fitted and evaluated using a computer with following specifications: operating system, Microsoft Windows 10 (64 bit); CPU, Intel (R) Core (TM) i5-5250U (1.60 GHz, dualcore); RAM, 8.00 GB. The training time (TT) of the models were also recorded to assess the model performance. The main steps of this study are shown by the flowchart in Fig. 2. 3. Results and discussion 3.1. Disease development

3.2. Spectral characteristics of healthy and diseased tobacco leaves The tobacco plant without inoculation kept growing healthily during the experiment period. For the infected plants, after five days of

The reflectance spectral curves of all samples are shown in Fig. 5.

Fig. 3. Development of the symptoms of TSWV disease on infected tobacco leaves. The visual symptom started to appear on the inoculated leaves at 6 DPI, and expanded rapidly at 7 and 8 DPI. 5

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

Fig. 4. (a) RT-PCR detection of TSWV in the infected tobacco leaves (lanes #2, 3 and 4) and control samples (lane #1) from 3 DPI to 8 DPI. Three samples from the inoculated leaves and one sample from the non-inoculated leaves were collected for detection per day. The TSWV infection was detected, starting from 5 DPI as shown by the white bands. (b) The expressions of reference genes (actin) used as control for quantitative calibration of target gene expression. The normal expressions of actin indicate that the experiment has been properly conducted and the results in (a) are reliable. Lane M displays the molecular weight markers showing the sizes of some bands.

green plants, such as wheat (Tanaka et al., 2015), soybean (Mercante et al., 2009), maize (Zhang, 2013) and grapevine (Naidu et al., 2009). There was an emission peak near 560 nm and two absorption valleys centered near 450 nm and 670 nm. The reflectance between 755 nm and 900 nm showed a relatively high level compared with the other parts of the spectral curve. By referring to many plant spectroscopy studies, the wavelength range between 495 and 680 nm is closely related to the photosynthetic capacity, which may contain the pigment

There were a lot of noisy signals at the initial position and end of the curve. Only the range between 400 and 1000 nm was used for further analysis (117 bands). Fig. 5 shows the mean reflectance spectra for tobacco leaves collected from ROIs of healthy and diseased tobacco plants. Also, the average spectra of ROIs representing healthy and inoculated leaves at different ages (1–8 days after inoculation) were illustrated. The general trend of the spectral curves was close to that of other

Fig. 5. Spectral curves extracted from the ROI pixels of the hyperspectral image representing healthy and diseased tobacco leaves. 6

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

3.4. Identification of healthy and diseased tobacco leaves using machine learning algorithms

information of the tobacco leaves (Zhu et al., 2017). The peak near 560 nm corresponded to the green color reflection region, and the valley near 690 nm was related to red color absorption of carotenoids and chlorophyll pigments (Zhu et al., 2016). At 4 and 6 DPI the reduction in the spectrum near the red edge at 720–780 nm hints towards damage to the plant cellular structure, which is expected during virus infection. However, this trend cannot be observed at the data for 5, 7 and 8 DPI. It may be explained that there were individual differences between the randomly chosen plants for different days. Furthermore, the sharp drop-off of the spectral signature between 900 and 1000 nm was uncharacteristic for spectral signatures of healthy plants. This might be a problem due to low illumination through the halogen lamps in these wavelengths on angled leaves compared with the background reference. No obvious distinctions were observed between the two kinds of leaves at 1 and 2 DPI, indicating that it is difficult to distinguish the diseased leaves at very early stages. The red curve of 3 DPI displayed a higher level in the region around 750–1000 nm. This likely resulted from the stress response to the plant–virus interactions. However, from 4 to 8 DPI, the reflectance of inoculated leaves over the entire range decreased day by day, which was contrary to the expectation that there might be an increase in reflection at 690 nm due to the effect of discoloration and tissue damage. This disparity may be explained by the differences between the randomly chosen plants. On the other hand, the reflectance from healthy leaves had little change during 1–4 DPI, and appeared to slightly increase over the spectral region of 720–1000 nm from 5 DPI.

Four machine learning algorithms were used to classify the inoculated and normal tobacco samples with the information of EWs selected by GA, SPA and BRT, respectively. It is significant for machine learning modeling to preliminarily set the optimal parameters, especially for small datasets, because the model performance can be influenced by parameter setting (Zhi et al., 2017). Table 1 shows the model performances based on different combinations of parameters as well as the optimal parameters settings. The mean values of overall accuracy (%) over 50 iterations of 10-fold cross-validations for each combination of parameters were calculated for analysis. The optimal parameter setting was determined by the maximum mean value of classification accuracy. Low coefficient of variation (CV) of overall accuracy indicated that the models were stable when subjected to different parameter settings. The results suggested that the RF and BRT models with lower CV values were more robust than the other two algorithms. A total of 12 models were produced using the optimal parameter settings. The overall accuracy, Kappa coefficient, correlation coefficient, AUC, and TT of developed models are shown in Table 2. The different combinations of machine learning algorithms and band selection methods were assessed and compared. The mean overall accuracies of the models for different combinations ranged from 65.3% to 85.2%, with the Kappa from 0.5 to 0.679 and AUC from 0.629 to 0.932. BRT and RF models showed the best performances, followed by the SVM models, and CART models achieved the worst performances. Comparing the model performances for different sets of EWs, models based on the EWs selected by SPA achieved the best performance for BRT and CART models, while models developed using EWs from BRT achieved the best performance for the RF and SVM models. Overall, the models built by the BRT method using the EWs selected by SPA as the input variables obtained the best performance for the 10-fold crossvalidation with the mean overall accuracy of 85.2% and AUC of 0.932. Besides, these models can be regarded as quick models, because the training processes of the models can be finished in seconds. Fig. 7 shows the contributions of variables for BRT models. The wavelengths of 839.55 nm, 828.9 and 855.56 were the most important contributors among the variables selected by SPA, GA and BRT, respectively. The first two EWs for the three BRT models were all from the 780–1000 nm region, indicating that the wavelengths related to NIR presented a strong impact on the model performance. This suggests that the NIR bands from these three feature selection algorithms are significant for the classification of healthy and infected tobacco leaves by the BRT model. The results indicate that the infection of TSWV can be precisely identified at an early stage using machine learning techniques with the reflectance information extracted from hyperspectral imaging. The comparison of the different combinations of EW selection algorithms and machine learning approaches showed that the BRT model using EWs selected through the SPA method (BRT-SPA model) achieved the highest classification accuracy, indicating that this combination is feasible and promising for the detection of TSWV-infected tobacco leaves. Fig. 8 shows the response curves of the selected variables in the BRT-SPA model.

3.3. Selection of effective wavelength Six bands were selected in the spectra (554.07, 574.31, 760.16, 791.78, 839.55, 936.33 nm) by SPA, six bands (620.16, 666.41, 718.29, 754.91, 828.90, 920.08 nm) were identified by GA, and eight bands (449.13, 549.02, 564.18, 615.04, 812.96, 855.56, 882.36, 952.62 nm) by the BRT algorithm (Fig. 6). The EW (449.13 nm) might be explained as the influence of carotenoids, which have high absorbance in the spectra between 400 and 500 nm. The EWs between 500 and 600 nm were selected by reason of the absorbing reduction in the green band. The EWs between 600 and 670 nm are related to the influence of chlorophyll, because the pigments can strongly absorb the red color lights. The remaining EWs may be assigned to the near-infrared (NIR) spectra of the first and second overtones of the O-H stretching region. The variable was decreased by more than 93% through the wavelength reduction, indicating that the methods used were quite effective for the selection of relevant wavelength.

3.5. TSWV detection at different stages of post-inoculation In order to discriminate between healthy samples and infected samples at different periods, we need to define the different stages of post-inoculation. Molecular identification was used to split the study period according to the day on which TSWV can be detected in the leaf. Visual observation was used to determine how the days split according to the day on which the symptom can be visually observed on the leaf. It can be observed that TSWV RNAs started to accumulate in the inoculated tobacco leaves at 5 DPI (Fig. 4). Through the visual

Fig. 6. Effective wavelengths selected by GA, SPA and BRT. 7

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

Table 1 Descriptive statistics of the averaged overall accuracy (%) of cross-validation over 50 iterations for each combination of parameters. Model

EW selection method

Optimal parameters

Minimum

Mean

Median

Maximum

CV (%)

SVM

GA SPA BRT GA SPA BRT GA SPA BRT GA SPA BRT

cost: 30, gamma: 0.02 cost: 10, gamma: 0.05 cost: 45, gamma: 0.008 cost-complexity: 0.01, minsplit levels: 30 cost-complexity: 0.01, minsplit levels: 20 cost-complexity: 0.005, minsplit levels: 40 ntree: 3000, mtry: 4 ntree: 4500, mtry: 6 ntree: 2500, mtry: 3 BF: 0.65, LR: 0.05 BF: 0.7, LR: 0.005 BF: 0.65, LR: 0.001

65.4 67.1 69.2 55.5 62.3 54.7 74.7 78.1 77.5 77.1 83.2 78.9

68.8 71.3 73.3 59.6 68.8 60.3 76.5 79.3 79.7 78.1 84.1 79.8

68.9 71.1 73.5 59.3 69.3 60.8 76.6 79.5 79.9 78.2 84.2 79.7

72.8 75.3 77.6 66.5 72.4 65.3 78.2 80.5 81.9 79.3 85.2 80.7

4.3 4.5 4.6 5.6 4.8 5.2 1.3 1.0 1.2 0.9 0.7 0.6

CART

RF

BRT

BRT method using EWs selected by SPA. The high level of mean classification accuracies of Stage 2 and Stage B can be explained by the fact that the spectral characteristics were largely different between healthy samples and inoculated leaves in the symptomatic period. The symptom that appeared on the inoculated leaves led to significant variations in the reflectances from 6 DPI, thus the accuracy of Stage B was much higher compared with that of the other three cases. The model performances of Stage 1 and Stage A are acceptable with the mean overall accuracies of 68.2% and 74.3%, respectively. Although the model performances of Stage 1 and Stage A are lower than those of Stage 2 and Stage B, the results showed that it is applicable and promising to employ hyperspectral imaging combined with machine learning algorithms and feature selection methods for detecting the tobacco disease at different stages of post-infection.

Table 2 Model performance averaged over 50 iterations of 10-fold cross-validations of the models for the identification of healthy and diseased tobacco leaves. EWs

Model performance

SVM

CART

RF

BRT

GA

Overall accuracy (%) Kappa r AUC TT (s) Overall accuracy (%) Kappa r AUC TT (s) Overall accuracy (%) Kappa r AUC TT (s)

72.8 0.567 0.529 0.712 0.2 75.3 0.590 0.574 0.757 0.2 77.6 0.610 0.616 0.787 0.2

66.5 0.511 0.416 0.642 3.4 72.4 0.564 0.522 0.708 3.6 65.3 0.500 0.394 0.629 3.6

78.2 0.616 0.627 0.813 2.5 80.5 0.664 0.722 0.902 2.5 81.9 0.646 0.688 0.885 2.6

79.3 0.653 0.700 0.893 3 85.2 0.679 0.753 0.932 3.2 80.7 0.638 0.672 0.853 3.1

SPA

BRT

3.6. Methodological prospects The feasibility of the hyperspectral imaging technique combined with feature selection methods and machine learning algorithms was manifested in this study for non-destructively detecting TSWV infection in tobacco plants at an early stage (within 8 DPI). The wavelength selection methods successfully selected the EWs which carried important discriminating information and can be directly used by classifiers, reducing the total number of bands by more than 93%. The selected EWs can not only decrease the data redundancy and simplify the models but also be applied to design instrumental sensors for plant disease detection (Xie et al., 2015). It was noteworthy that most of the EWs selected were located at the NIR regions (780–1000 nm) of the reflectance spectra, and the NIR EWs made the most important contributions in BRT models, suggesting that the NIR region carried more significant information for the differentiation of TSWV-infected and healthy tobacco leaves compared with the visible range. This was in accordance with some previous studies, in which NIR wavelengths played a prominent role for classifying the diseased and healthy plants. Zhang et al.

observation, the typical symptom of TSWV appeared on the inoculated leaves of tobacco plants at 6 DPI (Fig. 3). Based on the results of molecular detection and visual survey, we segment the eight continuous days into two stages by two means. According to the day on which the TSWV RNAs was detected, the study period was divided into Stage 1 (1–4 DPI) and Stage 2 (5–8 DPI). According to the time when the symptom occurred, the eight days were divided into Stage A (1–5 DPI) and Stage B (6–8 DPI). Then, the BRT technique was used to discriminate between healthy samples and samples that have undergone different durations of inoculation using the EWs selected by SPA. The combination of the BRT algorithm and SPA has been proved above to be the most effective approach for the detection of TSWV. Table 3 shows the performances of the models for different stages (mean values of 50 runs). The mean overall classification accuracies for Stage 2 and Stage B are 90.6% and 95.8%, respectively, based on the

Fig. 7. Relative importance of predictive bands selected by SPA, GA and BRT methods in BRT models (normalized to sum to 100%). 8

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

Fig. 8. Response curves of the selected variables in the BRT-SPA model. The curves show the changes in dependent variables (y-axis) along with a predictor variable range (x-axis), while keeping the remaining predictors at mean values.

parameter changes (see Table 1), which may restrict its stability and applicability. It has been previously mentioned that the calibration for SVM can be a very time-consuming process (Rodrigues and de la Riva, 2014). Additionally, in comparison with SVM, RF and BRT have the advantage of providing the relative importance of input variables, and this assists them to collect the optimal variables for modeling by eliminating the negative predictors. Given the above considerations, BRT and RF would be the adequate options for the early detection of plant diseases. Due to their inherent limitations, traditional plant disease detection approaches are basically employed after the typical symptoms visually appear, without taking into account the diagnosis at the presymptomatic time. In some cases, the diseases will not even show any visual symptoms, for instance, the TSWV-infected cucumbers are symptomless (Krezhova et al., 2014). It has been proved that the virus-infected plants may undergo measurable variations in the physiological characteristics associated with the diseases before the visual symptoms occur (Bertamini and Muthuchelian, 2010). The hyperspectral imaging technology provides an alternative method, which has the advantage of detecting tiny variations in the reflectance that may happen at an asymptomatic stage (Martinelli et al., 2015). Previous studies have demonstrated the potential of using hyperspectral approaches to detect plant diseases at the asymptomatic stage for different plants, e.g., tobacco (Zhu et al., 2017), sugar beet (Rumpf et al., 2010), grapevine (Naidu et al., 2009), and pear tree (Bagheri et al., 2018). Attempting to detect the TSWV infection before it can be identified by visual observation, we split the study period into different stages based on the visual survey. The results indicated that the comprehensive approach developed in this study can be used for the early detection of TSWV in tobacco at the presymptomatic stage after infection. Several studies about spectral diagnosis of plant diseases involved molecular identification techniques to verify the presence of the pathogens associated with the disease. For example, Naidu et al. (2009) used RT-PCR to ascertain the infection of grapevine leafroll-associated virus (GLRaVs) within the leaf tissue of grapevines. Krezhova et al. (2014) adopted the serological method DAS-ELISA to determine the presence of TSWV in tobacco leaves. Although the spectral detection results were confirmed by the molecular approaches, these papers did not compare the timeliness of detection between the two means. According to our search, this study is the first work to compare the timeliness of the hyperspectral imaging technique with the molecular

Table 3 Classification results averaged over 50 iterations of 10-fold cross-validations of BRT models for the discrimination of healthy leaves from leaves at different stages of post-inoculation based on the EWs selected by SPA. Model performance

Stage 1 (1–4 DPI)

Stage 2 (5–8 DPI)

Stage A (1–5 DPI)

Stage B (6–8 DPI)

Overall accuracy (%) Kappa R AUC

68.2 0.529 0.482 0.704

90.6 0.727 0.857 0.961

74.3 0.585 0.563 0.748

95.8 0.774 0.923 0.986

(2003) proved that the NIR range was much more informative than the visible region for late blight disease detection in tomato crop. The reflectance of the tomato canopies in the NIR range (700–1300 nm) showed a significant difference between different Phytophthora infestans-infected stages (Wang et al., 2008). In the research of Zhu et al. (2017), more wavelengths were selected by SPA in the NIR range than in the visible region to detect the TMV infection in tobacco plants. The performances of models using the EWs from SPA showed better results than those of models based on the EWs selected by GA and the BRT process. This indicated that SPA is very effective for the wavelength selection and optimization of hyperspectral data. In recent years, there has been growing interest in investigating the application of ML algorithms for the identification of plant diseases (Rumpf et al., 2010). Various ML techniques have been employed and discussed, whereas it appears that there is no universal technique for all kinds of circumstances. In our study, the BRT models achieved the best performances, followed by RF models which were comparable to the BRT models, and the CART and SVM models performed relatively poorly. In accordance with several recent studies (Dobarco et al., 2017; Rodrigues and de la Riva, 2014; Salazar et al., 2015; Zhi et al., 2017), the ML methods using the resampling approach and ensemble algorithm outperform the other kinds of ML techniques in terms of both the model accuracies and stabilities. In a soil mattic horizon prediction study, RF and BRT obtained better performances than the other two commonly used ML methods (CART and SVM) (Zhi et al., 2017). In the study conducted by Salazar et al. (2015), BRT stood out as the most accurate model, surpassing RF, SVM, neural networks (NN) and multivariate adaptive regression splines (MARS). Although SVM has been widely used for classification and prediction, it is sensitive to the 9

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

Acknowledgement

identification approach for plant disease detection at an early time. We split the study period into different stages based on the results of molecular identification. The RT-PCR technique detected TSWV RNAs within the inoculated tobacco leaves, and the systemic infection by TSWV was established at 5 DPI. The result provided evidence for the potential feasibility of the hyperspectral imaging technique for TSWV detection before the infection can be identified by RT-PCR. This study implies that the early detection using the hyperspectral technique could serve as the preliminary diagnostic tool for TSWV occurrence and, assisted by a laboratory-based method, could confirm the infection. Moreover, the obvious difference of the model performances between different stages after inoculation demonstrated a significant variation in the reflectance characteristics of tobacco leaves within a short period of time after inoculation, indicating that it is crucial to detect the infection of TSWV at an early stage so as to effectively control the spread of the disease. The early detection of crop diseases has considerable potential to facilitate the timely control of the diseases, and to decrease the economic losses and environmental pollutions. This study focused on young tobacco plants and TSWV in a laboratory environment, and provided evidence of the feasibility and effectiveness of the hyperspectral imaging technique for detecting the tobacco disease at an early stage. The main findings may be used in the field to support the timely control of the disease, by providing basic knowledge for the development of specific sensors such as airborne and spaceborne imaging instruments. The developed approach is promising to extend its capability for detecting diseases in other plants based on hyperspectral technology. The results showed that the BRT approach and the spectral reflectance in the visible and NIR region of hyperspectral images have been effectively used to identify TSWV-infected tobacco leaves. With the aim of improving the early detection of TSWV infection in tobacco plants, further investigation is needed to use vegetation indices (VIs) as predictor variables in BRT models.

We gratefully acknowledge the financial support from National Natural Science Foundation of China (Grant No. 41601024, 31501220). We also thank the editor and three reviewers for their valuable comments and suggestions that improved this paper. References Araújo, M.C.U., Saldanha, T.C.B., Galvao, R.K.H., Yoneyama, T., Chame, H.C., Visani, V., 2001. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemometr. Intell. Lab. 57, 65–73. https://doi.org/10. 1016/S0169-7439(01)00119-8. Bagheri, N., Mohamadi-Monavar, H., Azizi, A., Ghasemi, A., 2018. Detection of Fire Blight disease in pear trees by hyperspectral data. Eur. J. Remote Sens. 51, 1–10. https:// doi.org/10.1080/22797254.2017.1391054. Bauriegel, E., Giebel, A., Geyer, M., Schmidt, U., Herppich, W.B., 2011. Early detection of Fusarium infection in wheat using hyper-spectral imaging. Comput. Electron. Agr. 75, 304–312. https://doi.org/10.1016/j.compag.2010.12.006. Behmann, J., Mahlein, A.K., Paulus, S., Dupuis, J., Kuhlmann, H., Oerke, E.C., Plümer, L., 2016. Generation and application of hyperspectral 3D plant models: methods and challenges. Mach. Vision Appl. 27, 611–624. https://doi.org/10.1007/s00138-0150716-8. Berger, K., Atzberger, C., Danner, M., D’Urso, G., Mauser, W., Vuolo, F., Hank, T., 2018. Evaluation of the PROSAIL model capabilities for future hyperspectral model environments: a review study. Remote Sens. 10, 85. https://doi.org/10.3390/ rs10010085. Bertamini, M., Muthuchelian, K.N., 2010. Effect of grapevine leafroll on the photosynthesis of field grown grapevine plants (Vitis vinifera L. Cv. Lagrein). J. Phytopathol. 152, 145–152. https://doi.org/10.1111/j.1439-0434.2004.00815.x. Bioucas-Dias, J.M., Plaza, A., Camps-Valls, G., Scheunders, P., Nasrabadi, N., Chanussot, J., 2013. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosc. Rem. Sen. M. 1, 6–36. https://doi.org/10.1109/MGRS.2013.2244672. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. https://doi.org/10.1023/ A:1010933404324. Delalieux, S., Van Aardt, J., Keulemans, W., Schrevens, E., Coppin, P., 2007. Detection of biotic stress (Venturia inaequalis) in apple trees using hyperspectral data: nonparametric statistical approaches and physiological implications. Eur. J. Agron. 27, 130–143. https://doi.org/10.1016/j.eja.2007.02.005. Dobarco, M.R., Arrouays, D., Lagacherie, P., Ciampalini, R., Saby, N.P., 2017. Prediction of topsoil texture for Region Centre (France) applying model ensemble methods. Geoderma 298, 67–77. https://doi.org/10.1016/j.geoderma.2017.03.015. Dou, J., Chang, K.T., Chen, S., Yunus, A.P., Liu, J.K., Xia, H., Zhu, Z., 2015. Automatic case-based reasoning approach for landslide detection: integration of object-oriented image analysis and a genetic algorithm. Remote Sens. 7, 4318–4342. https://doi.org/ 10.3390/rs70404318. Elith, J., Leathwick, J.R., Hastie, T., 2008. A working guide to boosted regression trees. J. Anim. Ecol. 77, 802–813. https://doi.org/10.1111/j.1365-2656.2008.01390.x. Elmasry, G., Sun, D.W., Allen, P., 2012. Near-infrared hyperspectral imaging for predicting colour, pH and tenderness of fresh beef. J. Food Eng. 110, 127–140. https:// doi.org/10.1016/j.jfoodeng.2011.11.028. ElMasry, G., Wang, N., ElSayed, A., Ngadi, M., 2007. Hyperspectral imaging for nondestructive determination of some quality attributes for strawberry. J. Food Eng. 81, 98–107. https://doi.org/10.1016/j.jfoodeng.2006.10.016. Fang, Y., Ramasamy, R.P., 2015. Current and prospective methods for plant disease detection. Biosensors 5, 537–561. https://doi.org/10.3390/bios5030537. Galvao, L.S., dos Santos, J.R., Roberts, D.A., Breunig, F.M., Toomey, M., de Moura, Y.M., 2011. On intra-annual EVI variability in the dry season of tropical forest: a case study with MODIS and hyperspectral data. Remote Sens. Environ. 115, 2350–2359. https:// doi.org/10.1016/j.rse.2011.04.035. Gu, Q., Deng, J., Wang, K., Lin, Y., Li, J., Gan, M., Ma, L., Hong, Y., 2014. Identification and assessment of potential water quality impact factors for drinking-water reservoirs. Int. J. Env. Res. Pub. He. 11, 6069–6084. https://doi.org/10.3390/ ijerph110606069. Hu, J., Wang, J., Zeng, G., 2013. A hybrid forecasting approach applied to wind speed time series. Renew. Energ. 60, 185–194. https://doi.org/10.1016/j.renene.2013.05. 012. Hu, T., Mao, Z., Shi, J., Chen, W., 2010. The role of taxation in tobacco control and its potential economic impact in China. Tob. Control 19, 58–64. https://doi.org/10. 1136/tc.2009.031799. Konak, A., Coit, D.W., Smith, A.E., 2006. Multi-objective optimization using genetic algorithms: a tutorial. Reliab. Eng. Syst. Safe. 91, 992–1007. https://doi.org/10.1016/ j.ress.2005.11.018. Krezhova, D., Dikova, B., Maneva, S., 2014. Ground based hyperspectral remote sensing for disease detection of tobacco plants. Bulg. J. Agric. Sci 20, 1142–1150. Li, S., Wu, H., Wan, D., Zhu, J., 2011. An effective feature selection method for hyperspectral image classification based on genetic algorithm and support vector machine. Knowl.-Based Syst. 24, 40–48. https://doi.org/10.1016/j.knosys.2010.07.003. Liu, D., Kelly, M., Peng, G., 2006. A spatial–temporal approach to monitoring forest disease spread using multi-temporal high spatial resolution imagery. Remote Sens. Environ. 101, 167–180. https://doi.org/10.1016/j.rse.2005.12.012. Liu, D., Sun, D., Zeng, X., 2014. Recent advances in wavelength selection techniques for hyperspectral image processing in the food industry. Food Bioprocess Tech. 7,

4. Conclusions This study investigated the potential of using the changes in spectral reflectance in the VIS/NIR region (400–1000 nm) to identify the infected tobacco plants with TSWV at an early stage. A comprehensive method was developed by using the hyperspectral imaging platform in conjunction with GA, SPA and BRT to define several optimal wavelengths and four machine learning algorithms (CART, BRT, SVM and RF) for classification. Six bands were selected by SPA, six bands by GA, and eight bands by the BRT algorithm for developing predictive models discriminating between healthy samples and infected samples. Among the selected bands, most were located at the NIR region (780–1000 nm), suggesting that the NIR region is informative and important for the differentiation of infected and healthy tobacco leaves. The selected spectral bands could potentially be implemented in future operational remote sensing applications. BRT models based on the predictive variables (EWs) selected by SPA successfully discriminated the infected tobacco leaves from healthy ones with overall accuracy of 85.2% and AUC of 0.932, indicating that this method can be used to effectively monitor the early changes in TSWV-infected tobacco leaves. Different stages of post-inoculation were split according to the molecular identification and visual observation. The EWs selected by SPA and the BRT modeling method were used for the identification of inoculated leaves. The results indicated that the hyperspectral image data integrated with the machine learning method and wavelength selection algorithm can be used for the early detection of TSWV in tobacco at the asymptomatic stage after infection as well as during the period before the systematic infection can be detected by RT-PCR. The research findings will facilitate the timely prevention of TSWV infection in tobacco crops, and offer knowledge for the processing and analysis of hyperspectral image data for the early detection of plant diseases. 10

Computers and Electronics in Agriculture 167 (2019) 105066

Q. Gu, et al.

https://doi.org/10.1016/j.strusafe.2015.05.001. Sankaran, S., Mishra, A., Ehsani, R., Davis, C., 2010. A review of advanced techniques for detecting plant diseases. Comput. Electron. Agr. 72, 1–13. https://doi.org/10.1016/j. compag.2010.02.007. Slaton, M.R., Hunt, E.R., Smith, W.K., 2001. Estimating near-infrared leaf reflectance from leaf structural characteristics. Am. J. Bot. 88, 278–284. https://doi.org/10. 2307/2657019. Surhone, L.M., Timpledon, M.T., Marseken, S.F., Correlation, C., Squares, T.S.O., Analysis, R., 2013. Principal Component Regression. Betascript Publishing, pp. 1954–1954. https://doi.org/10.1007/978-3-642-16712-6_100781. Tanaka, S., Kawamura, K., Maki, M., Muramoto, Y., Yoshida, K., Akiyama, T., 2015. Spectral index for quantifying leaf area index of winter wheat by field hyperspectral measurements: a case study in Gifu prefecture, Central Japan. Remote Sens. 7, 5329–5346. https://doi.org/10.3390/rs70505329. Wang, X., Zhang, M., Zhu, J., Geng, S., 2008. Spectral prediction of Phytophthora infestans infection on tomatoes using artificial neural network (ANN). Int. J. Remote Sens. 29, 1693–1706. https://doi.org/10.1080/01431160701281007. Wei, C., Huang, J., Wang, X., Blackburn, G.A., Zhang, Y., Wang, S., Mansaray, L.R., 2017. Hyperspectral characterization of freezing injury and its biochemical impacts in oilseed rape leaves. Remote Sens. Environ. 195, 56–66. https://doi.org/10.1016/j. rse.2017.03.042. Were, K., Bui, D.T., Dick, Y.B., Singh, B.R., 2015. A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape. Ecol. Ind. 52, 394–403. https://doi.org/10.1016/j.ecolind.2014.12.028. Wu, D., Wang, S., Wang, N., Nie, P., He, Y., Sun, D.-W., Yao, J., 2013. Application of time series hyperspectral imaging (TS-HSI) for determining water distribution within beef and spectral kinetic analysis during dehydration. Food Bioprocess Tech. 6, 2943–2958. https://doi.org/10.1007/s11947-012-0928-0. Xie, C., Shao, Y., Li, X., He, Y., 2015. Detection of early blight and late blight diseases on tomato leaves using hyperspectral imaging. Sci. Rep. 5, 16564. https://doi.org/10. 1038/srep16564. Yan, F., Lu, Y., Wu, G., Peng, J., Zheng, H., Lin, L., Chen, J., 2012. A simplified method for constructing artificial microRNAs based on the osa-MIR528 precursor. J. Biotechnol. 160, 146–150. https://doi.org/10.1016/j.jbiotec.2012.02.015. Yang, R., Zhang, G., Liu, F., Lu, Y., Yang, F., Yang, F., Yang, M., Zhao, Y., Li, D., 2016. Comparison of boosted regression tree and random forest models for mapping topsoil organic carbon concentration in an alpine ecosystem. Ecol. Ind. 60, 870–878. https:// doi.org/10.1016/j.ecolind.2015.08.036. Zhang, D., 2013. Comparison of spectral indices and wavelet transform for estimating chlorophyll content of maize from hyperspectral reflectance. J. Appl. Remote Sens. 7, 3575. https://doi.org/10.1117/1.JRS.7.073575. Zhang, M., Qin, Z., Liu, X., Ustin, S.L., 2003. Detection of stress in tomatoes induced by late blight disease in California, USA, using hyperspectral remote sensing. Int. J. Appl. Earth Obs. 4, 295–310. https://doi.org/10.1016/s0303-2434(03)00008-4. Zhang, X., Liu, F., He, Y., Gong, X., 2013. Detecting macronutrients content and distribution in oilseed rape leaves based on hyperspectral imaging. Biosyst. Eng. 115, 56–65. https://doi.org/10.1016/j.biosystemseng.2013.02.007. Zhi, J., Zhang, G., Yang, R., Yang, F., Jin, C., Liu, F., Song, X., Zhao, Y., Li, D., 2017. An insight into machine learning algorithms to map the occurrence of soil mattic horizon in the northeastern Qinghai-Tibetan Plateau. Pedosphere. https://doi.org/10.1016/ S1002-0160(17)60481-8. Zhu, H., Cen, H., Zhang, C., He, Y., 2016. Early Detection and Classification of Tobacco Leaves Inoculated with Tobacco Mosaic Virus Based on Hyperspectral Imaging Technique. 2016 ASABE Annual International Meeting. American Society of Agricultural and Biological Engineers, pp. 1. https://doi.org/10.13031/aim. 20162460422. Zhu, H., Chu, B., Zhang, C., Liu, F., Jiang, L., He, Y., 2017. Hyperspectral imaging for presymptomatic detection of tobacco disease with successive projections algorithm and machine-learning classifiers. Sci. Rep. 7, 4125. https://doi.org/10.1038/s41598017-04501-2.

307–323. https://doi.org/10.1007/s11947-013-1193-6. Ma, J., Zheng, Z., Tong, Q., Zheng, L., 2003. An application of genetic algorithms on band selection for hyperspectral image classification. In: International Conference on Machine Learning and Cybernetics, vol. 5, pp. 2810–2813. https://doi.org/10.1109/ ICMLC.2003.1260030. Madufor, N., Perold, W., Opara, U., 2017. Detection of plant diseases using biosensors: a review. VII International Conference on Managing Quality in Chains (MQUIC2017) and II International Symposium on Ornamentals in 1201, pp. 83–90. https://doi.org/ 10.17660/ActaHortic.2018.1201.12. Mananze, S., Pôças, I., Cunha, M., 2018. Retrieval of maize leaf area index using hyperspectral and multispectral data. Remote Sens. 10, 1942. https://doi.org/10.3390/ rs10121942. Mandal, B., Wells, M., Martinez, O.N., Csinos, A., Pappu, H., 2007. Symptom development and distribution of Tomato spotted wilt virus in flue-cured tobacco. Ann. Appl. Biol. 151, 67–75. https://doi.org/10.1111/j.1744-7348.2007.00153.x. Martinelli, F., Scalenghe, R., Davino, S., Panno, S., Scuderi, G., Ruisi, P., Villa, P., Stroppiana, D., Boschetti, M., Goulart, L.R., 2015. Advanced methods of plant disease detection. A review. Agron. Sustain. Dev. 35, 1–25. https://doi.org/10.1007/s13593014-0246-1. McPherson, R.M., Jones, D.C., Bertrand, P.F., Csinos, A.S., 2002. Impact of thrips (Thysanoptera: Thripidae) management practices on suppression of tomato spotted wilt virus and aphid (Homoptera: Aphididae) control in flue-cured tobacco. J. Entomol. Sci. 37, 143–153. https://doi.org/10.18474/0749-8004-37.2.143. Mercante, E., Lamparelli, R.A.C., Uribeopazo, M.A., Rocha, J.V., 2009. Spectral characteristics of soybean during the vegetative cycle with Landsat 5/TM images in the Western Paraná. Brazil. Eng. Agr. 29, 307–310. https://doi.org/10.5772/14446. Michez, A., Piégay, H., Lisein, J., Claessens, H., Lejeune, P., 2016. Classification of riparian forest species and health condition using multi-temporal and hyperspatial imagery from unmanned aerial system. Environ. Monit. Assess. 188, 1–19. https:// doi.org/10.1007/s10661-015-4996-2. Naidu, R.A., Perry, E.M., Pierce, F.J., Mekuria, T., 2009. The potential of spectral reflectance technique for the detection of Grapevine leafroll-associated virus-3 in two red-berried wine grape cultivars. Comput. Electron. Agr. 66, 38–45. https://doi.org/ 10.1016/j.compag.2008.11.007. Ng, W., Minasny, B., Malone, B.P., Sarathjith, M., Das, B.S., 2019. Optimizing wavelength selection by using informative vectors for parsimonious infrared spectra modelling. Comput. Electron. Agr. 158, 201–210. https://doi.org/10.1016/j.compag.2019.02. 003. Park, S.H., Goo, J.M., Jo, C.H., 2004. Receiver operating characteristic (ROC) curve: practical review for radiologists. Korean J. Radiol. 5, 11. https://doi.org/10.3348/ kjr.2004.5.1.11. Peng, J., Yang, J., Yan, F., Lu, Y., Jiang, S., Lin, L., Zheng, H., Chen, H., Chen, J., 2011. Silencing of NbXrn4 facilitates the systemic infection of Tobacco mosaic virus in Nicotiana benthamiana. Virus Res. 158, 268–270. https://doi.org/10.1016/j. virusres.2011.03.004. Razi, M.A., Athappilly, K., 2005. A comparative predictive analysis of neural networks (NNs), nonlinear regression and classification and regression tree (CART) models. Expert Syst. Appl. 29, 65–74. https://doi.org/10.1016/j.eswa.2005.01.006. Ridgeway, G., 2007. Generalized Boosted Models: A guide to the gbm package. Update 1, 2007. Rodrigues, M., de la Riva, J., 2014. An insight into machine-learning algorithms to model human-caused wildfire occurrence. Environ. Modell. Softw. 57, 192–201. https://doi. org/10.1016/j.envsoft.2014.03.003. Roghanian, E., Pazhoheshfar, P., 2014. An optimization model for reverse logistics network under stochastic environment by using genetic algorithm. J. Manuf. Syst. 33, 348–356. https://doi.org/10.1016/j.jmsy.2014.02.007. Rumpf, T., Mahlein, A.K., Steiner, U., Oerke, E.C., Dehne, H.W., Plümer, L., 2010. Early detection and classification of plant diseases with Support Vector Machines based on hyperspectral reflectance. Comput. Electron. Agr. 74, 91–99. https://doi.org/10. 1016/j.compag.2010.06.009. Salazar, F., Toledo, M.A., Oñate, E., Morán, R., 2015. An empirical comparison of machine learning techniques for dam behaviour modelling. Struct. Saf. 56, 9–17.

11