CHAPTER FIVE
Classification and Authentication of Plants by Chemometric Analysis of Spectral Data Steidle Neto Daniela de Carvalho Lopes1, Antonio Jose Federal University of Sa˜o Joa˜o del-Rei, Sete Lagoas, Brazil 1 Corresponding author: e-mail address:
[email protected]
Contents 1. Introduction 2. Chemometric Classification and Authentication 3. Spectral Features of Plants 4. Plant Spectra Preprocessing 5. Chemometric Pattern Recognition Applied to Plant Discrimination 6. Conclusion References
105 106 108 111 114 121 122
1. INTRODUCTION The evolution of electronics and the expansion of the agrofood industry have enabled access to technologies and tools that before were only available in well-equipped laboratories and research centers (Domingues et al., 2012). In this context, the use of spectral analysis combined with chemometrics to classify and authenticate agricultural products as a rapid and nondestructive technique has increased recently (Makky and Soni, 2014). This is particularly important since authentication and classification of food and plant are of a great value for consumers, producers, and industry. Spectroscopy is the study of how electromagnetic radiation interacts with a product at different wavelengths (Diago et al., 2013). The spectroscopic techniques can be applied in a variety of wavelength ranges and their measurement modes include transmission, reflection, and absorbance. Transmittance can be defined as the ratio between the intensities of light passing through some sample and incident light. That is, the amount of light transmitted by a product, corrected by the amount of light incident on it. Comprehensive Analytical Chemistry, Volume 80 ISSN 0166-526X https://doi.org/10.1016/bs.coac.2018.03.009
#
2018 Elsevier B.V. All rights reserved.
105
106
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
On the other hand, reflectance is related to the amount of light reflected by a product, representing a fraction of the incident light. Finally, absorbance is a measure of the amount of light absorbed by a product as light passes through it, expressed as the negative logarithm of the transmittance. Porker et al. (2017) reported that applications of visible (VIS: 400–700 nm), near-infrared (NIR: 700–2500 nm), mid-infrared (MIR: 2500–25,000 nm), and far-infrared (FIR: 25,000–1000 103 nm) spectral analysis for classification and authentication purposes in agricultural products are most common and are based on the fact that samples with similar spectral responses are akin in physical, chemical, and biochemical properties. Moros (2010) reported the Raman spectroscopy as a promising alternative to traditional analytical methods. In Raman spectroscopy samples are irradiated with laser light, resulting in light scattering and allowing that Raman spectrum is used for constructing complex chemical models, such as those capable of characterizing the molecular structure of a sample. Finally, Terahertz spectroscopy is related with electromagnetic radiation in a frequency range between infrared and microwaves. The relationship between the plant spectra and properties can be better exploited when spectral analysis is combined with chemometrics, which is the science of relating measurements made on a chemical system or process to the state of the system via application of mathematical or statistical methods (Moros, 2010). Chemometrics is well established, allowing to get maximum information from spectral data, and using multivariate statistical methods for the analysis of complex spectra. Diago et al. (2013) affirmed that chemometric methods combine the reduction of the spectroscopic data dimension and the creation of models that provide speed in analyzing data, resulting in higher discrimination power. Based on the above, the purpose of this review was to summarize the most used chemometric methods in authentication and classification of plants based on spectral data, also presenting recent applications of these techniques.
2. CHEMOMETRIC CLASSIFICATION AND AUTHENTICATION The terms classification and authentication are very often used when finding mathematical relationships between descriptive and qualitative variables. These procedures are acquiring higher importance in plant science, since there is a need in the agrofood industry to improve the online monitoring
Classification and Authentication of Plants
107
procedures in order to optimize system functioning, also protecting and assuring the product credibility (Ballabio and Todeschini, 2009). Both classification and authentication depend on the identification of potential physical, chemical, and biochemical differences between samples. But, there are fundamental differences between these two procedures (Rodionova et al., 2016). There are still challenges to introduce effective analytical techniques for classifying and authenticating plants, since different products and research objectives imply in different responses when the discriminant methods are applied (Sa´nchez et al., 2013). On the other hand, there are a number of applications of automatic classification and authentication of plants in literature that confirm the cost-effectiveness of spectroscopy combined with chemometrics for this purpose (Ballabio and Todeschini, 2009; Berrueta et al., 2007; Moros, 2010). Classification, discrimination, and authentication are commonly used as synonyms because these procedures assign objects to different classes or categories. However, they can be theoretically differentiated according to the objective behind their applications. Authentication determines whether an object is, in fact, what it is declared to be. This procedure comprises a different pattern recognition called one-class classifier (Rodionova et al., 2016). Discrimination and classification are used to determine if a sample or new object belongs to a predefined category, as well as to separate distinct sets of objects, in which the knowledge of predefined categories is not mandatory (Misaki et al., 2010). Zhang et al. (2015) used NIR spectroscopy combined with chemometrics, and identified the 1000–2500 nm range as optimal for distinguishing transgenic rice (Oryza sativa) from wild one with correct classification rate of 100% in the validation test. A same accuracy was achieved by Aouidi et al. (2012) when using MIR spectroscopy and chemometric methods to discriminate five Tunisian cultivars (Chemlali, Sayali, Meski, Zarrazi, and Chetoui) of olive (Olea europaea) according to their leaves. Bona et al. (2017) also confirmed that spectroscopy is a fast and reliable method for plant classification without the production of chemical wastes, and with little sample preparation, when developing a methodology for geographical classification of different genotypes of green coffee (Coffea arabica) using IR spectroscopy. The number of works that treat the plant authentication based on spectroscopy and chemometrics are scarcer comparatively with the classification and discrimination ones, despite their clear importance in the agrofood industry. Mir-Marques et al. (2016) correctly discriminated between artichokes (Cynara scolymus) with the protected designation of origin
108
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
“Alcachofa de Benicarlo´” and those produced in other regions using their mineral profile information, X-ray fluorescence, and NIR spectroscopy combined to multivariate analysis. In another study, Wong et al. (2016) used direct ionization mass spectrometry and multivariate methods for rapid authentication of the popular herbal medicine Tianma (Gastrodiae rhizoma) from different provinces of China.
3. SPECTRAL FEATURES OF PLANTS The basic similarity in physical, chemical, and biochemical properties of plants restricts the range of spectral variability when compared with that in other products such as fruits, meat, beverages, processed food, soils, and minerals. However, for samples of a same plant, spectra frequently vary with species, varieties, age, internal cellular structure, environmental conditions, and nutritional level, for example (Steidle Neto et al., 2009). The spectral features of plants differ mainly in amount of reflectance (or absorbance) and spectrum shape, with pigments and water producing the most evident differences. The plant spectral signatures also vary depending on if data are obtained from petals, leaves, stalks, or canopies. This behavior propitiates classification and authentication procedures by using plant spectral signatures. Jurado-Expo´sito et al. (2003) evaluated the potential of using NIR spectroscopy to discriminate between seven broadleaf weed species (Malva parviflora, Ecballium elaterium, Convolvulus arvensis, Anagallis arvensis, Galium aparine, Trifolium repens, and Polygonum aviculare), sunflower (Helianthus annuus) leaves, and wheat (Triticum aestivum) stubble. The statistical analysis performed on the spectral data revealed consistent significant differences between the plant reflectance values, with the spectral window between 750 and 950 nm corresponding to the best one to discriminate among wheat stubble, sunflower leaves, and weeds. The authors indicated that the results were promising for further work in real-time remote sensing identification of weed patches in sunflower fields. Fig. 1 shows the average petal spectral signatures of five gladiolus (Gladiolus grandiflora) cultivars (white, light pink, salmon red, red, and light blue-white) in the VIS and NIR regions. Steidle et al. (2009) distinguished these cultivars based on spectral reflectance of their petals, verifying that the reflectance can be used as a criterion for development of automatic systems for classification of gladiolus in a reliable and precise way, mainly considering the region of VIS spectrum. The proposed methodology also can be applied to other flowers or plants.
Classification and Authentication of Plants
109
Fig. 1 Spectral reflectance curves of gladiolus cultivars considering wavelengths in the visible and near-infrared spectrum.
The significant differences between spectral reflectance curves of the different gladiolus species can be associated with the characteristic pigmentation of each cultivar, since the presence of pigment compounds, such as chlorophylls, carotenoids, and anthocyanins, in the petals of flowers directly influences spectral data. According to Lopes et al. (2017) and Steidle Neto et al. (2017a), reflectance curves tend to increase in the blue and purple wavelengths (400–500 nm) due to high presence of anthocyanins in some plant samples, while the orange and red ranges (600–700 nm) are sensitive to carotenoid pigments, and the green range (500–600 nm) is affected by the occurrence of chlorophylls. Variations introduced by water stress on plant spectra can also contribute to authentication or classification tasks, as shown by Moshou et al. (2014) when discriminating between healthy and water stressed wheat (T. aestivum) canopies. Prospere et al. (2014) investigated the use of foliar spectral reflectances to discriminate 46 plant species (herbs, trees, shrubs, grasses, palms, graminoids, hydrophytes, epiphytes, climbers, and emergent) in a tropical wetland in Jamaica, reporting the limitations to their analyses and reinforcing that water absorption bands should be considered in this kind of work. The progressive dehydration of plants generally results in increase of the spectral reflectance at NIR wavelengths (700–1000 nm), as shown by Fig. 2 that presents average spectral reflectances of sunflower (H. annuus)
110
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
Fig. 2 Spectral reflectance curves of sunflower leaves submitted to different water stress levels.
leaves submitted to different water stress levels. This probably occurs due to changes in the internal leaf structure caused by interrupting the irrigation events, which makes the leaf less capable of absorbing the electromagnetic radiation and propitiates greater radiation path deviations comparatively with what occurs with turgid leaves (Ponzoni et al., 2012; Steidle Neto et al., 2017b). In the VIS region, leaves under mild water stress tends to present greater reflectance in the wavelengths associated with green when compared with moderate and severe water stress levels, probably due to the leaf pigmentation. Chaves et al. (2009) reported that under water stress the photosynthetic metabolism is affected by the pigment photooxidation and biochemical damages of enzymes, with consequent decrease of chlorophyll in the leaves. Numerous studies involving spectral analysis for authenticating or classifying plants have been carried out considering other plant properties also. Rubert et al. (2016) proposed a metabolic fingerprint for saffron (Crocus sativus) authentication based on spectrometry and multivariate methods, identifying clear differences between samples cultivated and packaged in Spain, those with protected designation of origin, and saffron packaged in Spain of unknown origin. The authors emphasized that glycerophospholipids and their oxidized lipids were significant markers according to saffron origin. Symonds et al. (2015) developed a real-time plant discrimination system based on discrete spectral reflectance measurements and achieved accuracies greater than 99% when testing a system for discrimination
Classification and Authentication of Plants
111
of anthurium (Anthurium andraeanum) from dandelion (Taraxacum officinale) and sunkisses (Ipomoea batatas). In another study, Atlabachew et al. (2015) tested different spectroscopic methods in combination with multivariate data analysis to identify khat (Catha edulis) samples based on their level of leaf maturity.
4. PLANT SPECTRA PREPROCESSING Sometimes plant spectra contain undesirable components (called noise) or do not visibly differ among themselves, which can reduce the computing efficiency of the proposed models. Thus, spectral transformation techniques are frequently applied to remove any irrelevant information which cannot be handled by the chemometric pattern recognition methods (Rinnan et al., 2009). Several spectral preprocessing methods have been developed for this purpose, including from averaging over spectra to derivative transformations. Fig. 3 presents an original reflectance spectrum of green lettuce (Lactuca sativa), and the same spectrum after applying the most used preprocessing methods. The centered spectrum was obtained by using the mean-centering preprocessing, in which an average spectrum was calculated from a number of lettuce spectra and then it was subtracted from the original spectrum first presented in Fig. 3. The preprocessing methods can be classified as scatter correction, baseline correction, signal enhancement, and statistical filtering of signal noise. More than one method can be used simultaneously in any order, although scatter correction techniques should be performed prior to differentiation ones (Agelet, 2011). Centering and normalization (standardization) are preprocessing techniques for scattering correction. The widely employed centering technique is the mean-centering, which is a mandatory step before some chemometric methods, such as principal component analysis (PCA) and partial least squares (PLS) regression. This procedure ensures that results will be interpretable in terms of variation around the mean (Cozzolino et al., 2011), allowing to enter variability space after another preprocessing technique had been applied. Regarding normalization, there are two methods that are most applied for reducing the variability between spectra due to scatter: standard normal variate (SNV) and object-wise standardization. SNV can be obtained by subtracting the spectrum at every wavelength by the spectrum average and dividing this result by the spectrum standard deviation. In the object-wise standardization the spectrum at every wavelength is directly divided by the spectrum standard deviation (Rinnan et al., 2009).
112
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
Fig. 3 Original spectral reflectance and pretreated spectra for lettuces.
Detrend is a baseline correction algorithm which corrects simple deformations of the spectra baseline as vertical shift and slope (Steidle Neto et al., 2017a). When this pretreatment is applied an orthogonal projection of the original spectra is done in which a polynomial fit of each spectrum
Classification and Authentication of Plants
113
is subtracted from it. Detrend is usually used to remove specific trending variations, as constant or variable offsets. This technique is not recommended in datasets where these trends are important for modeling. That is, detrend should be used only when the offsets of spectra are not related to the chemical or physical properties of interest for the chemometric modeling. Li and He (2008) selected the detrend pretreatment as the optimum one for discriminating varieties of tea (Camellia sinensis) plant based on their VIS/NIR spectral characteristics. Ten types of preprocessing methods were evaluated for their tea plant discrimination accuracies, including normalization and derivative. Loewe et al. (2007) applied the normalization and the detrend algorithms before developing mathematical models and successfully evaluate the suitability of NIR spectroscopy for authenticating geographical origins of Mediterranean pine (Pinus pinea) nuts in Chile. The potential of VIS/NIR spectroscopy to discriminate three bamboo species (Bashania fargesii, Fargesia qinlingensis, and Phyllostachys glauca) was investigated by Wang et al. (2016), and different techniques were tested to preprocess the original spectra, including the normalization with detrend, as well as first and second derivatives. The results showed that the normalization with detrend was more efficient, reaching 100% of correct classifications. Another baseline correction method is the multiplicative scatter correction (MSC), which was originally developed for applications of NIR spectroscopy. When applying MSC an average spectrum is calculated and used for estimating a regression equation. After, each spectrum is corrected by subtracting an offset, which is the regression intercept, and dividing this result by the regression slope (Agelet, 2011; Rinnan et al., 2009). The results after this transformation are similar to normalization, but present the disadvantage of needing a reference spectrum, also requiring saving regression coefficients. Nevertheless, Chen et al. (2014) demonstrated that NIR spectroscopy with pattern recognition could be successfully applied to discriminate white and albino teas (C. sinensis) quickly and nondestructively, reaching accuracy of identification greater than 98% when using MSC and first-order derivative spectral pretreatments. Smoothing is a low-pass filter, often used to correct additive and multiplicative effects in the spectral data, while preserving the important features of information and removing random variations in spectra (Makky and Soni, 2014). Several smoothing algorithms have been proposed, including moving average filters and the Savitzky–Golay method, which has been widely used in chemometric applications (Cozzolino et al., 2011; Savitzky and Golay, 1964). Shao et al. (2015) developed a mathematical model using preprocessed
114
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
spectra data by smoothing and MSC, aiming at using VIS/NIR to discriminate tomato (Lycopersicon esculentum) bred by spaceflight mutagenesis. Spectrum derivative allows removing baseline shifts and minimizing overlapped peaks, also correcting both additive and multiplicative effects. The models resulting from applying derivatives are usually more robust since they tend to require fewer factors. First and second derivatives are the most common and are generally calculated by using the Savitzky–Golay algorithm (Savitzky and Golay, 1964), which optimally fits a set of data points to a polynomial in the least squares sense (Luo et al., 2005). That is, aiming at finding the derivative at a given point, a polynomial is fitted in a symmetric window of data comprising the point and other n points on either side of it. The procedure also involves the calculation of convolution coefficients (or filter coefficients) associated with the polynomials. Depending on the equations used for obtaining the convolution coefficients, polynomials from degree zero to five can be used as a mean filter over the window of data, resulting in a smoothing procedure. In a like manner, polynomials of degrees from one to four can be used to obtain up to first- and secondorder derivatives, respectively (Luo et al., 2005; Shao et al., 2015). Liu et al. (2010) demonstrated that it is feasible to use VIS/NIR spectroscopy to discriminate health conditions of rice (O. sativa) panicles, reaching 99% of correct classifications when pretreating the spectra with the first derivative method. Gutierrez et al. (2015) evaluated the combined use of different data mining techniques along with a nondestructive NIR portable sensor for grapevine (Vitis vinifera) variety discrimination, pretreating the spectra with normalization, detrend, and second-order derivative methods. In this work the second-order derivative was considered the best preprocessing method. In another study, Nisgoski et al. (2016) evaluated the potential use of NIR spectrometry to determine the origin of Japanese cedar (Cryptomeria japonica) varieties based on needles and wood. Original spectra were analyzed and also preprocessed by the second-order derivative, normalization, and MSC. In this case, the second-derivative was the most efficient method.
5. CHEMOMETRIC PATTERN RECOGNITION APPLIED TO PLANT DISCRIMINATION All chemometric pattern recognition methods have in common that they should use a sufficiently large training set of samples to define a decision boundary in the space of response patterns and, consequently, to build a robust mathematical model (Misaki et al., 2010). In this context,
Classification and Authentication of Plants
115
an important aspect is determining how effective the proposed model is. Thus, there exists in chemometrics, including the classification and authentication procedures, an awareness of the relevance of validating the models with external samples, which must be independent from the calibration with cross-validation dataset (Westad and Marini, 2015). The calibration with cross-validation, followed by an external validation, is the most recommended procedure when developing chemometric models based on latent variables or principal components (PCs). In this case, samples are divided into a calibration with cross-validation set (normally comprising 70% of all samples) and an external validation set (normally incorporating the remaining 30% of all samples) (Steidle Neto et al., 2017a; Westad and Marini, 2015). Both datasets should contain representative samples. The calibration with cross-validation in PCA is performed in such way that, after determining the number of PCs to be used (e.g., based on the amount of total variance captured in the first n PCs), the dataset is divided into groups and models from reduced data are developed with one of the groups omitted and used to test the model (Berrueta et al., 2007). The prediction residuals are calculated for each developed model, and the process is repeated omitting another group of the calibration with cross-validation dataset, until every group is left out once. This procedure allows that the same samples have the probability to be used for training and testing models based on the same preprocessing and recognition pattern techniques. A mean prediction residual is calculated obtaining an error of cross-validation. The model may be then used with the external validation dataset in order to perform independent discriminations, confirming its accuracy (Steidle Neto et al., 2017a). The pattern recognition techniques can be divided into unsupervised and supervised methods. The unsupervised procedures do not require a priori knowledge about the training set samples. That is, samples will be grouped into a number of classes without initial qualification of the samples and their class assignment. The supervised pattern recognition requires a priori knowledge about which sample belongs to which class in order to develop a mathematical model. Thus, unsupervised pattern recognition methods comprise exploratory procedures, seeking inherent similarities of data. They are the first step of the data analysis in order to detect patterns in different measured data. On the other hand, supervised methods group data into predefined classes during the training procedures (Wang and Mizaikoff, 2008). The widely used unsupervised pattern recognition method is the PCA, which allows visualization of required information retaining as much as
116
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
possible the information present in the original dataset, but reducing the data dimensionality (Vitale, 2017). PCA transforms by orthogonal linear combination the original measured variables into new and uncorrelated ones, called PCs, which are ordered by the amount of variance explained in each component direction. In classical PCA algorithms the original data matrix is replaced by an estimated constructed one, which is the product of two other smaller matrices called loadings recovering the variance of the data captured in the first n PCs (Cordella, 2012). The new group of variables are calculated by matrix transformations, and performed under orthogonal constraints which provide unique solutions for the developed model. Thus, the first PC represents the largest amount of variability in the original dataset, and each succeeding component accounts for as much of the remaining variability as possible. The loading matrix represents the correlation of the original variables with the PCs (Vitale, 2017). When dataset is composed by spectroscopic data, the loadings can be represented in a graph with the values of wavelengths on the X-axis and the loadings of each PC on the Y-axis. Thus, loading plots are frequently used to identify which wavelengths are more relevant to a given study, since wavelengths or spectral bands that contribute most to distinguish the groups tend to present more evident peaks and valleys (Cordella, 2012). Fig. 4 presents an example of loading plot of the first and second PCs of a PCA model built for distinguishing two of the most
Fig. 4 Loading plot of the first two principal components of a PCA model to distinguish two Brazilian sugarcane varieties.
Classification and Authentication of Plants
117
relevant Brazilian sugarcane varieties (RB867515 and RB92579) to the production of sugar, ethanol, and forage. Measurements were performed in the sugarcane stalks by using a spectrometer preconfigured to acquire and store reflectance data from 450 to 1000 nm range. Spectra were centered and normalized prior to PCA analysis. The score matrix represents the relationship between the PCs and the observed values associated with samples. That is, it is related with the original regressors and can be used for data exploration or model predictions (Lee et al., 2010). For this, PCA results are normally visualized with the score plot of the first two or three PCs aiming at providing a most efficient representation of the discrimination information (Liu et al., 2010). The score plot is a scatter graph in which each axis represents one of the most important PCs, while the points represent the scores and indicate an arrangement of the data into the groups with similar features (Cordella, 2012). Fig. 5 shows an example of score plot of the above-mentioned PCA model for discriminating Brazilian sugarcane varieties. Results presented by Moura et al. (2016) showed that the spectral reflectance in VIS region has great potential to distinguish lettuce (L. sativa) cultivars with different colors, with PCA reaching accuracy of 100%. This technique was also successfully used by Rohaeti et al. (2015) for the classification of java turmeric (Curcuma xanthorrhiza), turmeric (Curcuma longa), and cassumunar ginger (Zingiber cassumunar), as well as by Chen et al. (2017) for the identification of medicinal Honeysuckle flower (Lonicera japonica) buds of two varieties. Another unsupervised pattern recognition method is the cluster analysis (CA) whose goal is to detect similarities between objects and find groups in the data on the basis of calculated distances of objects. There are a number of clustering algorithms, which can be divided into partitional, hierarchical, graph-based, or density-based approaches (Gautam et al., 2015). When working with spectroscopy, partitional and hierarchical clustering algorithms are most common. Partitional clustering optimizes a particular criterion function to divide the dataset into nonoverlapping subsets without repeated objects. In hierarchical clustering a set of nested clusters are organized as a tree, acting as a sequence of partitional clusterings (Wolf and Kirschner, 2013). The results obtained by Cao et al. (2017) showed that IR spectroscopy with PCA and CA can be used to discriminate moss (Rhodobryum roseum) species from other adulterants. Another study was conducted by Wei et al. (2015) to authenticate the orchid (Dendrobium officinale) based on NIR spectroscopy, PCA, and CA, revealing significant variations between the studied orchid and other species.
118
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
Fig. 5 Score plot of the first two principal components of a PCA model to distinguish two Brazilian sugarcane varieties.
Considering the supervised pattern recognition methods, several kinds have been applied in plant classification and authentication. Each of these methods has its own algorithm for determining how to discriminate at best the different groups (Ballabio and Todeschini, 2009). Among the supervised methods the linear discriminant analysis (LDA) (Gautam et al., 2015), partial least squares discriminant analysis (PLS-DA) (Wei et al., 2015), soft independent modeling of class analogy (SIMCA) (Brereton, 2015), support vector machine (SVM) (Li et al., 2015), and artificial neural networks (ANNs) (Jain et al., 2014) are highlighted. The performances of the supervised models are normally evaluated by confusion matrices, which represent the numbers of observations attributed to each class during the discrimination process (represented by the matrix lines) compared to the real belonging of the observations (represented by the matrix columns) (Gutierrez et al., 2015; Mir-Marques et al., 2016). The diagonals of the confusion matrices contain the correct classification percentiles and should ideally be equal to 100%. Misclassifications between the classes are shown by the nondiagonal cells. Confusion matrices should be
Classification and Authentication of Plants
119
processed with both calibration/cross-validation and external validation datasets. Results of supervised methods can also be analyzed in terms of score or loading plots, depending on the technique used during the analysis (Atlabachew et al., 2015; Loewe et al., 2007). LDA is based on the determination of linear discriminant functions which maximize the ratio of between-class variance and minimize the ratio of within-class variance. That is, LDA is a probabilistic parametric classifier, assuming that each class has a multivariate normal distribution, while the dispersion is the same for all the classes (Berrueta et al., 2007). The LDA and PLS-DA techniques were applied by Giovenzana et al. (2014) for developing models to classify fresh-cut corn salad (Valerianella locusta) samples in four classes of freshness based on VIS/NIR data. The average value of samples correctly classified using LDA was 95.5%. The objective of Acquah et al. (2016) was to qualitatively classify forest logging residue made up of different loblolly pine (Pinus taeda) parts using NIR spectroscopy and Fourier transform IR spectroscopy combined with PCA and LDA. The models reached classification accuracies greater than 96% and the study demonstrated that spectroscopy coupled with PCA and LDA has potential to be used as an efficient tool for classifying the plant part makeup of a batch of forest logging residue feedstock. PLS-DA is another popular supervised classification method in plant classification and authentication applications because it allows dealing with the highly correlated spectral variables. Similar to PLS regression for quantification, data reduction is conducted seeking for latent variables, which are a representation of the original variables and are calculated in a way to maximize the covariance with the response (Wei et al., 2015). Thus, samples are always classified in one of the available classes (Ballabio and Consonni, 2013). Manfredi et al. (2017) investigated the use of a portable Fourier transform IR spectrometer coupled to PCA, LDA, and PLS-DA methods for discriminating raw hazelnuts (Corylus avellana) from different origins and cultivars. The multivariate classification methods allowed accurate discriminations among the groups, with PLS-DA coupled to variable selection providing the best results. Sa´nchez et al. (2013) evaluated the ability of NIR spectroscopy to classify intact green asparagus (Asparagus officinalis) as a function of organic or conventional growing system, and as a function of harvest month and postharvest cold storage duration. The models were constructed using the PLS-DA method, which correctly classified 91% of samples by growing system, 100% of samples by harvest month, and 97% of samples by postharvest storage time. The authors reported the good performance of the
120
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
prediction models, particularly for growing method which is of considerable importance for the authentication of organic asparagus at industrial level. When applying the SIMCA method classification rules are defined for each group of samples separately by constructing a confidence region which is limited by orthogonal and Mahalanobis distances for the considered training class (Brereton, 2015). If a sample is located within this limited space, it is considered belonging to the class, otherwise is considered an outlier (Deconinck et al., 2017). Thus, once the optimal dimensionality of a PCA model has been found, the statistical procedures are applied to set the critical limits. Classifications based on the Mahalanobis distance are generally performed by calculating probabilistic confidence levels from a Fisher F test. Gad et al. (2013) established a model for authentication and quality assessment of thyme species (Thymus vulgaris, Thymus Satureja, Thymus Origanum, Thymus Plectranthus and Thymus Eriocephalus) using ultraviolet spectroscopy together with PCA, CA, and SIMCA methods. The model was able to also classify 12 commercial thyme varieties into clusters of species incorporated in the model as thyme or nonthyme. A methodology of yerba mate (Ilex paraguariensis) authentication was developed by Marcelo et al. (2014) by using multivariate analysis of the reflectance NIR spectrum, and identification by SIMCA was 100% correct for spectra wavelengths from 2255 to 2316 nm. The SVM is a supervised method, originally developed for binary classification problems, that uses a portion of the data to train the system and then forms a learning model that can predict the category of samples. This method is based on statistical learning theory and structural risk minimization principles. That is, the SVM algorithm searches an optimal separating hyperplane, also maximizing the distance between the points over the separation margin and the hyperplane (Li et al., 2015). SVM has been used to handle from simple, linear, classification tasks, to more complex and nonlinear, classification problems (Luts et al., 2010). A classification model was built by Ying et al. (2016) to determine and classify the geographical origin of ginseng (Panax ginseng) samples by using the chemical elemental compositions. For this, SVM and Gaussian process classification models, together with the microwave plasma torch-atomic emission spectroscopy, were evaluated regarding their prediction accuracy for unknown ginseng samples. SVM proved valid in the classification of individual types of ginseng with 99.81% accuracy, compared to Gaussian process classification with 71.67%. SVM together with Fourier transform NIR spectroscopy was also used by Teye et al. (2014) for cocoa (Theobroma cacao) bean authentication.
Classification and Authentication of Plants
121
In this study, SVM model accurately discriminated the cocoa bean samples between fermented, unfermented, and adulterated, reaching an optimal identification rate of 100% in both the training and prediction sets. Finally, ANNs are increasing in uses related to several chemical applications and nowadays can be considered as an important emerging tool in chemometrics (Ballabio and Todeschini, 2009). An ANN is defined as a data processing system consisting of mathematical units called neurons which work similar as human brain functions. A neuron is as a nonlinear and parameterized function which results in an output of the ANN and depends on variables called inputs. Computationally the ANNs are capable of operating nonlinear mappings acting as black boxes with multiple inputs and producing several outputs. This implies that ANNs can operate with large number of neurons. There are several classes of ANN, depending on their learning mechanisms (Jain et al., 2014). Gutierrez et al. (2015) presented SVM and ANN modeling for grapevine (V. vinifera) varietal classification from in-field leaf NIR spectroscopy. The authors applied a sequential minimal optimization algorithm working with a polynomial kernel to train the SVM for varietal classification, and a multilayer perceptron as ANN classification method. The accuracy showed by both SVM and ANN models, especially when the number of classes was high and totalized 20, along with the ability of properly train the model from heterogeneous sources, allows to consider the NIR range suitable for in-field grapevine varietal discrimination. Wang et al. (2015) developed a classification model based on ANN and hyperspectral data from 350 to 2500 nm, capable of differentiating five coniferous species (Pinus nigra, Pinus massoniana, Pinus elliottii, Cedrus, and Pseudolarix amabilis) endemic to China, and reached classification accuracy greater than 86%.
6. CONCLUSION The use of spectroscopy combined to multivariate methods and statistics is an efficient way to differentiate plant varieties and cultivars, as well as to authenticate plant materials. One limitation of using spectroscopy combined with chemometrics is that the resulting model is sometimes calibrated and validated according to a reference method, which requires the execution of conventional laboratory analysis during the model development, and results in destructive, time consuming, as well as labor demanding procedures. However, after calibrated and validated, the chemometric spectral model can improve agricultural practices, optimize production lines in
122
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
agrofood industries and crop fields, as well as guarantee to the consumers the origin and credibility of the plants without the risks associated with chemical preparation of the samples, quickly and with reduced costs of analysis. Thus, over the last years, chemometrics associated with spectral data has been focused by several research groups in broad applications for plant classification and authentication, tending to expand in the future as an alternative to the traditional analytical methods.
REFERENCES Acquah, G.E., Via, B.K., Billor, N., Fasina, O.O., Eckhardt, L.G., 2016. Identifying plant part composition of forest logging residue using infrared spectral data and linear discriminant analysis. Sensors 16 (9), 1375–1380. Agelet, L.E., 2011. Single Seed Discriminative Applications Using Near Infrared Technologies. Graduate Theses and Dissertations, Paper 12023, Iowa. Aouidi, F., Dupuy, N., Artaud, J., Roussos, S., Msallem, M., Perraud-Gaime, I., Hamdi, M., 2012. Discrimination of five Tunisian cultivars by Mid InfraRed spectroscopy combined with chemometric analyses of olive Olea europaea leaves. Food Chem. 131 (1), 360–366. Atlabachew, M., Combrink, S., Sandasi, M., Chen, W., Viljoen, A., 2015. Rapid differentiation of Khat (Catha edulis Vahl. Endl.) using single point and imaging vibrational spectroscopy. Vib. Spectrosc. 81, 96–105. Ballabio, D., Consonni, V., 2013. Classification tools in chemistry. Part 1: linear models. PLS-DA. Anal. Methods 5 (16), 3790–3798. Ballabio, D., Todeschini, R., 2009. Multivariate Classification for Qualitative Analysis, Infrared Spectroscopy for Food Quality Analysis and Control. Elsevier, New York. Berrueta, L.A., Alonso-Salces, R.M., Heberger, K., 2007. Supervised pattern recognition in food analysis. J. Chromatogr. A 1158, 196–214. Bona, E., Marquettim, I., Link, J.V., Makimori, G.Y.F., Arca, V.C., Lemes, A.L.G., Ferreira, J.M.F., Scholz, M.B.S., Valderrama, P., Poppi, R.J., 2017. Support vector machines in tandem with infrared spectroscopy for geographical classification of green arabica coffee. Food Sci. Technol. 76, 330–336. Brereton, R.G., 2015. Pattern recognition in chemometrics. Chemom. Intell. Lab. Syst. 149, 90–96. Cao, Z., Wang, Z., Shang, Z., Zhao, J., 2017. Classification and identification of Rhodobryum roseum Limpr. and its adulterants based on Fourier-transform infrared spectroscopy (FTIR) and chemometrics. PLoS One 12 (2). Chaves, M.M., Flexas, J., Pinheiro, J.C., 2009. Photosynthesis under drought and salt stress: regulation mechanisms from whole plant to cell. Ann. Bot. 103, 551–560. Chen, Y., Deng, J., Wang, Y., Liu, B., Ding, J., Mao, X., Zhang, J., Hu, H., Li, J., 2014. Study on discrimination of white tea and albino tea based on near-infrared spectroscopy and chemometrics. J. Sci. Food Agric. 94 (5), 1026–1033. Chen, J., Guo, B., Yan, R., Sun, S., Zhou, Q., 2017. Rapid and automatic chemical identification of the medicinal flower buds of lonicera plants by the benchtop and hand-held Fourier transform infrared spectroscopy. Spectrochim. Acta A Mol. Biomol. Spectrosc. 182, 81–86. Cordella, C.B.Y., 2012. PCA: the basic building block of chemometrics. In: Analytical Chemistry. InTech. Cozzolino, D., Cynkar, W.U., Shah, N., Smith, P., 2011. Multivariate data analysis applied to spectroscopy: potential application to juice and fruit quality. Food Res. Int. 44, 1888–1896.
Classification and Authentication of Plants
123
Deconinck, E., Sokeng Djiogo, C.A., Courselle, P., 2017. Chemometrics and chromatographic fingerprints to classify plant food supplements according to the content of regulated plants. J. Pharm. Biomed. Anal. 143, 48–55. Diago, M.P., Fernandes, A.M., Millan, B., Tardaguila, J., Melo-Pinto, P., 2013. Identification of grapevine varieties using leaf spectroscopy and partial least squares. Comput. Electron. Agric. 99, 7–13. Domingues, D.S., Takahashi, H.W., Camara, C.A.P., Nixdorf, S.L., 2012. Automated system developed to control pH and concentration of nutrient solution evaluated in hydroponic lettuce production. Comput. Electron. Agric. 84, 53–61. Gad, H., El-Ahmady, S.H., Abou-Shoer, M.I., Al-Azizi, M.M., 2013. A modern approach to the authentication and quality assessment of thyme using UV spectroscopy and chemometric analysis. Phytochem. Anal. 24, 520–526. Gautam, R., Vanga, S., Ariese, F., Umapathy, S., 2015. Review of multidimensional data processing approaches for Raman and infrared spectroscopy. EPJ Tech. Instrum. 28, 2–38. Giovenzana, V., Beghi, R., Buratti, S., Civelli, R., Guidetti, R., 2014. Monitoring of fresh-cut Valerianella locusat Laterr. Shelf life by electronic nose and Vis-NIR spectroscopy. Talanta 120, 368–375. Gutierrez, S., Tardaguila, J., Ferna´ndez-Novales, J., Diago, M.P., 2015. Support vector machine and artificial neural network models for the classification of grapevine varieties using a portable NIR spectrophotometer. PLoS One 10 (11), 1–15. Jain, S., Goel, Y., Singhal, D., 2014. Study of artificial neural network. Int. J. Innov. Adv. Comput. Sci. 2 (4), 9–12. Jurado-Expo´sito, M., Lo´pez-Granados, F., Atenciano, S., Garcı´a-Torres, L., Gonza´lezAndu´jar, J.L., 2003. Discrimination of weed seedlings, wheat (Triticum aestivum) stubble and sunflower (Helianthus annuus) by near-infrared reflectance spectroscopy (NIRS). Crop Prot. 22, 1177–1180. Lee, S., Zou, F., Wright, F.A., 2010. Convergence and prediction of principal component scores in high-dimensional settings. Ann. Stat. 38 (6), 3605–3629. Li, X., He, Y., 2008. Discriminating varieties of tea plant based on Vis/NIR spectral characteristics and using artificial neural networks. Biosyst. Eng. 99, 313–321. Li, H., Chung, F., Wang, S., 2015. A SVM based classification method for homogeneous data. Appl. Soft Comput. 36, 228–235. Liu, Z.Y., Shi, J.J., Zhang, L.W., Huang, J.F., 2010. Discrimination of rice panicles by hyperspectral reflectance data based on principal component analysis and support vector classification. J. Zhejiang Univ. Sci. B 11 (1), 71–78. Loewe, V., Navarro-Cerrilo, R.M., Garcia-Olmo, J., Riccioli, C., Sa´nchez-Cuesta, R., 2007. Discriminant analysis of Mediterranean pine nuts (Pinus pinea L.) from Chilean plantations by near infrared spectroscopy (NIRS). Food Control 73, 634–643. Lopes, D.C., Moura, L.O., Neto, A.J., Ferraz, L.C.L., Carlos, L.A., Martins, L.M., 2017. Spectral indices for non-destructive determination of lettuce pigments. Food Anal. Methods, 1–8. Luo, J., Ying, K., Bai, J., 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal Process. 85, 1429–1434. Luts, J., Ojeda, F., Van de Plas, R., Moor, B., Van Huffel, S., Suykens, J.A.K., 2010. A tutorial on support vector machine-based methods for classification problems in chemometrics. Anal. Chim. Acta 665 (2), 129–145. Makky, M., Soni, P., 2014. In situ quality assessment of intact oil palm fresh fruit bunches using rapid portable non-contact and non-destructive approach. J. Food Eng. 120, 248–259. Manfredi, M., Robotti, E., Quasso, F., Mazzucco, E., Calabrese, G., Marengo, E., 2017. Fast classification of hazelnut cultivars through portable infrared spectroscopy and chemometrics. Spectrochim. Acta A Mol. Biomol. Spectrosc. 189, 427–435.
124
Daniela de Carvalho Lopes and Antonio Jose Steidle Neto
Marcelo, M.C.A., Martins, C.A., Pozebon, D., ferra˜o, M.F., 2014. Methods of multivariate analysis of NIR reflectance spectra for classification of yerba mate. Anal. Methods 6, 7621–7627. Mir-Marques, A., Elvira-Sa´ez, C., Cervera, M.L., Garrigues, S., Guardia, M., 2016. Authentication of protected designation of origin artichokes by spectroscopy methods. Food Control 59, 74–81. Misaki, M., Kim, Y., Bandettini, P.A., Kriegeskorte, N., 2010. Comparison of multivariate classifiers and response normalizations for pattern-information fMRI. NeuroImage 53 (1), 103–118. Moros, J., 2010. Vibrational spectroscopy provides a green tool for multi-component analysis. TrAC Trends Anal. Chem. 29 (7), 578–591. Moshou, D., Pantazi, X.E., Kateris, D., Gravalos, I., 2014. Water stress detection based on optical multisensor fusion with a least squares support vector machine classifier. Biosyst. Eng. 117, 15–22. Moura, L.O., Lopes, D.C., Steidle Neto, A.J., Ferraz, L.C.L., Carlos, L.A., Martins, L.M., 2016. Evaluation of techniques for automatic classification of lettuce based on spectral reflectance. Food Anal. Methods 9, 1799–1806. Nisgoski, S., Schardosin, F.Z., Batista, F.R.R., Muniz, G.I.B., 2016. Potential use of NIR spectroscopy to identify Cryptomeria japonica varieties from southern Brazil. Wood Sci. Technol. 50 (1), 71–80. Ponzoni, F.J., Shimabukuro, Y.E., Kuplich, T.M., 2012. Sensoriamento Remoto da Vegetac¸a˜o. Oficina de Textos, Sa˜o Paulo. Porker, K., Zerner, M., Cozzolino, D., 2017. Classification and authentication of barley (Hordeum vulgare) malt varieties: combining attenuated total reflectance mid-infrared spectroscopy with chemometrics. Food Anal. Methods 10, 675–682. Prospere, K., McLaren, K., Wilson, B., 2014. Plant species discrimination in a tropical wetland using in situ hyperspectral data. Remote Sens. (Basel) 6 (9), 8494–8523. Rinnan, A., van den Berg, F.W.J., Engelsen, S.B., 2009. Review of the most common pre-processing techniques for near-infrared spectra. Trends Anal. Chem. 28 (10), 1201–1222. Rodionova, O.Y., Titova, A.V., Polmerantsev, A.L., 2016. Discriminant analysis is an inappropriate method of authentication. Trends Anal. Chem. 78, 17–22. Rohaeti, E., Rafim, M., Syafitri, U.D., Heryanto, R., 2015. Fourier transform infrared spectroscopy combined with chemometrics for discrimination of Curcuma longa, Curcuma xanthorrhiza and Zingiber cassumunar. Spectrochim. Acta A Mol. Biomol. Spectrosc. 137, 1244–1249. Rubert, J., Lacina, O., Zachariasova, M., Hajslova, J., 2016. Saffron authentication based on liquid chromatography high resolution tandem mass spectrometry and multivariate analysis. Food Chem. 204, 201–209. Sa´nchez, M.T., Garrido-Varo, A., Guerrero, J.E., Perez-Marı´n, D., 2013. NIRS technology for fast authentication of green asparagus grown under organic and conventional production systems. Postharvest Biol. Technol. 85, 116–123. Savitzky, A., Golay, M.J.E., 1964. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36, 1627–1639. Shao, Y., Xie, C., Jiang, L., Shi, J., Zhu, J., He, Y., 2015. Discrimination of tomatoes bred by spaceflight mutagenesis using visible/near infrared spectroscopy and chemometrics. Spectrochim. Acta A Mol. Biomol. Spectrosc. 140, 431–436. Steidle Neto, A.J., Grossi, J.A.S., Lopes, D.C., Anasta´cio, E.A., 2009. Potential of spectral reflectance as postharvest classification tool for flower development of calla lily (Zantedeschia aethiopica (L.) Spreng.). Chil. J. Agric. Res. 69 (4), 588–592. Steidle Neto, A.J., Moura, L.O., Lopes, D.C., Carlos, L.A., Martins, L.M., Ferraz, L.C.L., 2017a. Non destructive prediction of pigment content in lettuce based on visible–NIR spectroscopy. J. Sci. Food Agric. 97 (7), 2015–2022.
Classification and Authentication of Plants
125
Steidle Neto, A.J., Lopes, D.C., Silva, T.G.F., Ferreira, S.O., Grossi, J.A.S., 2017b. Estimation of leaf water content in sunflower under drought conditions by means of spectral reflectance. Eng. Agric. Environ. Food 10, 104–108. Steidle, A.J., Neto, J.A.S.G., Finger, F.L., Ferreira, S.O., 2009. Identification of gladiolus cultivars based on spectral reflectance of petals. Acta Hortic. (847), 309–312. Symonds, P., Paap, A., Alameh, K., Rowe, J., Miller, C., 2015. A real-time plant discrimination system utilising discrete reflectance spectroscopy. Comput. Electron. Agric. 117, 57–69. Teye, E., Huang, X.Y., Lei, W., Dai, H., 2014. Feasibility study on the use of Fourier transform near-infrared spectroscopy together with chemometrics to discriminate and quantify adulteration in cocoa beans. Food Res. Int. 55, 288–293. Vitale, R., 2017. Novel Chemometric Proposals for Advanced Multivariate Data Analysis, Processing and Interpretation. Universitat Polite`cnica de Vale`ncia. Wang, L., Mizaikoff, B., 2008. Application of multivariate data-analysis techniques to biomedical diagnostics based on mid-infrared spectroscopy. Anal. Bioanal. Chem. 391 (5), 1641–1654. Wang, X., Zeng, Y., Wang, S., Zhao, T., 2015. Identification of conifer species based on laboratory spectroscopy and an artificial neural network. Int. J. Softw. Eng. 9 (2), 362–372. Wang, Y.Z., Dong, W.Y., Kouba, A.J., 2016. Fast discrimination of bamboo species using Vis/NIR spectroscopy. J. Appl. Spectrosc. 83 (5), 826–831. Wei, Y., Fan, W., Zao, X., Wu, W., Lu, H., 2015. Rapid authentication of Dendrobium officinale by near-infrared reflectance spectroscopy and chemometrics. Anal. Lett. 49 (5), 817–829. Westad, F., Marini, F., 2015. Validation of chemometric models—a tutorial. Anal. Chim. Acta 893, 14–24. Wolf, A., Kirschner, K.N., 2013. Principal component and clustering analysis on molecular dynamics data of the ribosomal L11 23S subdomain. J. Mol. Model. 19, 539–549. Wong, H.Y., Hu, B., So, P.K., Chan, C.O., Mok, D.K.H., Xin, G.Z., Li, P., Yao, Z.P., 2016. Rapid authentication of Gastrodiae rhizoma by direct ionization mass spectrometry. Anal. Chim. Acta 938, 90–97. Ying, Y., Jin, W., Yu, B., Lv, S., Wu, X., Yu, H., Shan, J., Zhu, S., Jin, Q., Mu, Y., 2016. Support vector machine classification for determination of geographical origin of Chinese ginseng using microwave plasma torch-atomic emission spectrometry. Anal. Methods 8, 5079–5086. Zhang, L., Wang, S.S., Ding, Y.F., Pan, J.R., Zhu, C., 2015. Discrimination of transgenic rice based on near infrared reflectance spectroscopy and partial least squares regression discriminant analysis. Rice Sci. 22 (5), 245–249.