Accepted Manuscript Random Forest as One-Class Classifier and Infrared Spectroscopy for Food Adulteration Detection Felipe Bachion de Santana, Waldomiro Borges Neto, Ronei J. Poppi PII: DOI: Reference:
S0308-8146(19)30736-8 https://doi.org/10.1016/j.foodchem.2019.04.073 FOCH 24674
To appear in:
Food Chemistry
Received Date: Revised Date: Accepted Date:
10 July 2018 7 April 2019 21 April 2019
Please cite this article as: de Santana, F.B., Neto, W.B., Poppi, R.J., Random Forest as One-Class Classifier and Infrared Spectroscopy for Food Adulteration Detection, Food Chemistry (2019), doi: https://doi.org/10.1016/ j.foodchem.2019.04.073
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
RANDOM FOREST AS ONE-CLASS CLASSIFIER AND INFRARED SPECTROSCOPY FOR FOOD ADULTERATION DETECTION
Felipe Bachion de Santanaa, Waldomiro Borges Netob, Ronei J. Poppia* a
Institute of Chemistry, University of Campinas, 13084-971 Campinas, SP, Brazil
b
Institute of Chemistry, Federal University of Uberlândia, 38408-100, Brazil.
Abstract
This paper proposes the use of random forest for adulteration detection purposes, combining the random forest algorithm with the artificial generation of outliers from the authentic samples. This proposal was applied in two food adulteration studies: evening primrose oils using ATR-FTIR spectroscopy and ground nutmeg using NIR diffuse reflectance spectroscopy. The primrose oil was adulterated with soybean, corn and sunflower oils, and the model was validated using these adulterated oils and other different oils, such as rosehip and andiroba, in pure and adulterated forms. The ground nutmeg was adulterated with cumin, commercial monosodium glutamate, soil, roasted coffee husks and wood sawdust. For the primrose oil, the proposed method presented superior performance than PLS-DA and similar performance to SIMCA and for the ground nutmeg, the random forest was superior to PLS-DA and SIMCA. Also, in both applications using the random forest, no sample was excluded from the external validation set.
Keywords: one-class; random forest; artificial outliers; food adulteration; infrared spectroscopy
*Corresponding author. Email:
[email protected] 1
1. Introduction Adulteration is a known problem in several areas, such as food, spices, and cosmetics. Adulterations are usually performed by adding adulterant to the authentic (target) product, thereby generating a mixture of worse quality. Among the reasons for the growth of adulteration, we can highlight the increase in world trade and emerging novel markets, as well as the steady increase in food, spice and cosmetic prices worldwide. Additionally, there is a need to supply the demand of the authentic/target product due to low production (Manning & Soon, 2014). The detection of adulteration is a crucial factor for commerce and industry due to the economic reasons, necessity of ensuring the use of pure (non-adulterated/authentic) and safe ingredients at all levels of the production and legal compliance. The major agrofood commodities subjected to adulteration are edible oils, spices, honey, meat and milk products (Lohumi et al., 2017; Lohumi, Lee, Lee & Cho, 2015). Of these commodities, the adulteration in edible oils is performed by adulterating edible oils with oils of lower economic values than the authentic ones, while the adulteration in spices is performed by adding spices of low economic value or remnants of branches, soil, coffee husks and other materials (September, 2011). Spices are particularly susceptible to adulteration because they are often sold in powdered form, and they have long and complicated supply chains, making their quality control difficult (Lohumi et al., 2015). In the case of oil adulteration, Codex Alimentarius Specification for Fats and Oils lists the range of fatty acids and physical parameters for various oils and fats. However, due to the similarity of the authentic oil and its respective blend (adulterated oil), the comparison of fatty acid composition and physical parameters may not be sufficient to identify adulterations in the oil (FAO, 2001; Gliszczyńska-Świgło & Chmielewski, 2017; Jiménez-Carvelo, González-Casado, PérezCastaño & Cuadros-Rodríguez, 2017). For spice adulteration, the American Spice Trade Association (ASTA) and Indian Standards Institution (ISI) have recommended several methods of analysis to identify adulterated spices. These methods are based on density, microscopic probing and specific chemical analyses (American Spice Trade Association - ASTA, 2016). However, density methods cannot be sufficient to identify adulterations in specific spices, and methods based on microscopic examination can be extremely time-consuming for screening large numbers of samples. Due to the necessity of detecting adulteration, several papers in the literature employ different analytical techniques, mainly chromatography and spectroscopy, to quantify or identify specific adulterants in oils and spices (Merás, Manzano, Rodríguez & de la Peña, 2018; Hong et al., 2017; Shi et al., 2018). Among the spectroscopic techniques, we can highlight work using Fourier transform near infrared (FT-NIR) (Haughey, Galvin-King, Ho, Bell & Elliott, 2015), Fourier 2
transform infrared coupled attenuated total reflectance (FTIR-ATR) (Dupuy, Molinet, Mehl, Nanlohy, Le Dréau & Kister, 2013; Varliklioz Er, Eksi-Kocak, Yetim & Boyaci, 2017), Raman (Haughey et al., 2015; Varliklioz Er et al., 2017) and fluorescence spectroscopy (Merás et al., 2018). These spectroscopic techniques in combination with different chemometric algorithms have become powerful tools for quality control in oil and spice samples. Most of the chemometric methods applied in adulteration detection use binary or multiclass classification methods, such as partial least squares for discriminant analysis (PLS-DA), k-nearest neighbors (k-NN), linear discriminant analysis (LDA) and support vector machines (SVM) (Oliveri & Downey, 2012). These methods define a delimiter between two or more classes, requiring information regarding the authentic/target and non-authentic/non-target samples in the training step (Oliveri et al., 2014). However, in many food adulteration problems, the interest lies in the definition of whether the sample is adulterated, independent of the adulterant used. In this case, chemometric methods based on binary or multiclass algorithms are not suitable, since the adulterant cannot be known or there is a great possibility of many different adulterants, making it impossible to define all classes. In this sense, in the last several years, there has been a discussion regarding the use of binary or multiclass classification methods in authentication and adulteration problems (Oliveri, 2016; Oliveri & Downey, 2012; Tax & Duin, 2001a). For this type of problem, the use of one-class classification models is recommended, mainly applied to detect if a new sample resembles the target class (Tax & Duin, 2001b). In this case, we need to distinguish between a set of target samples (authentic/pure/control samples) and all other samples (adulterated or different samples), which are not available during the training step. Pure one-class classifiers use only the target class to build the model. These class-modelling can be based on the principal component analysis (PCA) or related to multivariate data reduction, including soft independent modeling of class analogy (SIMCA), unequal class spaces (UNEQ), potential function techniques (POTFUN) and multivariate statistical process control (MSPC) (Brereton, 2011; Oliveri, Di Egidio, Woodcock, & Downey, 2011). In addition, we can consider the modified multivariate calibration methods, such as partial least squares density modeling (PLS-DM) (Oliveri et al., 2014) and one-class partial least squares (OC-PLS) (Pieszczek, Czarnik-Matusewicz & Daszykowski, 2018). Other methods not based on PCA, such as one-class support vector machines (OC-SVM), k-Nearest Neighbors (kNN), and one-class random forest, can be considered (Désir, Bernard, Petitjean & Heutte, 2013; Zhang et al., 2017). The random forest algorithm emerged in the beginning of the 21st century as a simple and powerful machine learning algorithm (Breiman, 2001), specially projected to address large and 3
complex nonlinear systems. However, the random forest algorithm cannot be used directly in oneclass problems, necessitating the use of artificial generation of outliers to build the model. In this work, we investigate the application of random forest to one-class problems using a new methodology for artificial generation of outliers. The efficiency of this method was investigated in two food adulteration cases, both using infrared spectroscopy as the analytical technique. In the first case, FTIR-ATR spectroscopy was used to distinguish non-adulterated evening primrose oil from evening primrose oil adulterated with soybean, corn and sunflower oils in several proportions, and from other different vegetable oils. The evening primrose oil (EPO) is a natural product extracted by cold-pressed from Oenothera biennis L. seeds. This oil is popularly used as a dietary supplement to minimize the effects of dermatitis, psoriasis, premenstrual and menopausal syndrome, and diabetic neuropathy (Montserrat-de la Paz, Fernández-Arche, Ángel-Martín & García-Giménez, 2014). These beneficial effects are associated with high values of γ-linolenic acid (Montserrat-de la Paz et al., 2014). Due to high costs of this oil, adulteration normally is performed by addition of cheaper edible oils. In the second case, NIR spectroscopy was used to distinguish non-adulterated ground nutmeg from samples adulterated with cumin, commercial monosodium glutamate, soil, roasted coffee husks and wood sawdust in several proportions. The ground nutmeg (GN) is a natural product obtained mainly by the nuts of Myristica fragans and Myristica argentea. The ground nutmeg is used in numerous recipes, including desserts (fruit cakes, muffins and pies), meats, sauces, potato dishes, beverages (teas and mulled wine) and others (Calliste, Kozlowsky, Duroux, Champavier, Chulia & Trouillas, 2010). In this case, due to high cost of this spice the adulterations are performed by adding others cheaper spices or using products with no added value such as: soil, roasted coffee husks and wood sawdust.
1.1. Random forest Random forest (RF) is a method based on decision trees, which uses rules to binary split data. In classification problems, the main rules used to binary split data are Gini index, deviance and the twoing rule (Breiman, 2001; MathWorks, 2017). Among these rules, Gini index (Eq. 1), which measures the node impurity, is the most commonly used: (1) where A is the target class, and
is the sample proportion of class . A small value of
indicates
that a node predominantly contains observations from a single class, or in other words, it is a purity node with good separation between the classes (Cao, Xu, Zhang & Huang, 2012). 4
Decision tree methods can produce good predictions on the training set, but a tree with many splits will probably overfit the model and consequently lead to poor test set performance. In this case, pruning of the trees must be performed, generating a smaller tree with fewer splits, leading to lower variance and better interpretation and consequently generating feasible results (Breiman, Friedman, Olshen & Stone, 1984; Casella, Fienberg & Olkin, 2006). The main problem is that the correct pruning process in many situations does not produce suitable models. In contrast, random forest uses an ensemble of decisions trees without pruning and two powerful randomization processes, bagging (Breiman, 1996) and random feature selection, providing more accurate results and making the model more resistant to overfitting (Breiman, 2001). This machine learning algorithm became popular due to its simplicity of training and tuning parameters, possibility to fit nonlinear models and production of excellent classification results (Breiman, 2001; Cao et al., 2012). In the tree structure, leaves represent class labels, in this case target or outlier classes, and nodes represent the decision rules that lead the samples to a specific class. The random forest algorithm can be briefly summarized in some steps: 1 - Draw ntrees bootstrap sets from the original training dataset with replacement (bagging). Approximately two-thirds of training samples are used to grow the tree, and the other one-third is used to perform a cross-validation in parallel with the training step. These samples are called out of bag samples (OOB samples) that can be used to obtain an estimate of the model performance (Breiman, 2001). 2 – For each bootstrap, grow an unpruned tree. However, in each node, randomly select mtry variables and choose the best split, which provides the lowest result of Gini index. The tree is grown until no further splits are possible and not pruned back. 3. Repeat steps 1 and 2, until the number of trees (ntrees) defined by the user are grown. The number of trees (ntrees) chosen must be sufficiently large for the OOB error stabilization. In general, 500 trees are sufficient; however, if a large number of trees is chosen, the prediction will not differ statistically, but more time will be necessary to build the model (Breiman, 2001). The mtry values can range from 1 to in classification problems is
(total number of variables). The default value of mtry
, which contributes to the reduction of the dimensionality of the
data and calculation time required to build the model (de Santana, Mazivilla, Gontijo, Borges Neto & Poppi, 2018). Finally, each tree will predict a response for each sample, generating a set of answers. The prediction of each sample will be given by the majority vote of the ensemble of trees (Breiman, 2001). It is important to emphasize that the random forest algorithm presents interesting properties such as high capability for handling mixed or badly unbalanced datasets, flexibility with no formal 5
assumption on data structure and ability to address complex non-linear systems. The bootstrapping samples and the random selection of variables in conjunction with the considerable number of trees dampens the influence of noise and anomalous samples. Nevertheless, it is always recommended to evaluate the anomalous samples. The anomalous samples can be identified based on the proximity matrix (Breiman, 2001; de santana et al., 2018). The evaluation of anomalous samples was performed just in the target class, since there is no reason to evaluate anomalous samples in the outlier class for authentication purposes. To measure the anomaly of each sample, the proximity matrix is first computed (Breiman, 2001) by calculation of the total number of times that two samples of the same class run through the tree and arrive at the same terminal node. This value is divided by the total number of trees. This measure ranges from 0 to 1, where 1 is a similar sample, and 0 is a nonsimilar sample (Breiman, 2002; Cao et al., 2012). Anomalous values are measured by taking the inverse of the average squared proximity. This measure is normalized by subtracting the median of their distribution, taking the absolute value of this difference and dividing by the median absolute deviation. Samples will be considered anomalous if they show higher value of anomalous measure than all other samples of their respective class. Several studies consider that values above 10 may be candidates for anomalous and must be excluded from the model (Breiman, 2002; Cao et al., 2012). As commented above the RF requires the use of artificial generation of outliers to be used in authentication problems. In the literature, there are diverse methods for artificial outlier generation (Tax & Duin, 2001b) by formulating different hypotheses about the possible outlier class. In this case, for one-class modeling purposes, outliers must be uniformly distributed around the target class in all directions and cover the whole domain of variations, resulting in a huge number of artificial outliers generated (Désir et al., 2013).
1.2. Artificial outlier generation
In this paper, the outliers were generated in a uniform hyperspherical distribution around the target class of radius
and center using a d-dimensional Gaussian distribution as suggested in
Tax & Duin, 2001b. The generation of the artificial outliers can be summarized briefly as follows: 1. Firstly, a matrix
(number of outliers x number of variables) is generated with Gaussian
distribution (mean equal to zero and unit variance) and the squared Euclidean distance from each sample to the origin
is calculated: (2) 6
2. The values of
follow the distribution of chi squared
(number of variables). Then, dominated
with
degrees of freedom
is transformed in a cumulative distribution
between 0 and 1,
: (3)
3. The value of
is rescaled by
such that
is distributed as
for values
between 0 and 1, thereby creating the following relationship: (4) 4. Finally, all variables are rescaled: (5) contains the new values uniformly distributed in a unit hypersphere in
dimensions.
can be
used to generate uniform outliers in any d-dimensional hypersphere by rescaling and shifting the data through the multiplication of radius R (Tax & Duin, 2001b). In other words, we can multiply by R and add this new matrix to the mean spectra to generate the artificial outliers. The R value must be tuned to accommodate the expected outliers. In the generation of the artificial outliers, it is necessary to pay attention to some topics: first, the value of
must be tuned to correct classification of the target samples in training and
validation sets (Tax & Duin, 2001b). Second, it is necessary to generate enough artificial data to have samples around and near the target class in all feature directions. Finally, the classification method should be robust to handle mixed or badly unbalanced datasets, since this method generates more outlier samples than target samples. The random forest algorithm presents properties that can be useful is this type of problem, since it can work with unbalanced datasets beyond the possibility of executing the algorithm in parallel, thereby reducing the time to identify the best radius
of the hypersphere.
2. Methods
2.1. Materials
2.1.1 Oil samples
7
Forty samples of evening primrose oil (Oenothera biennis L) and rosehip oil (Rosa canina) were acquired at a local commercial establishment. The andiroba oils (Carapa guianensis Aubl) samples were acquired from three different producers in the state of Amazonas (n = 20) and two different producers in the state of Roraima (n = 20) for a total of 40 andiroba oil samples. Four distinct commercial brands of each adulterant (soybean, corn and sunflower oils) were obtained in local supermarkets.
2.1.2 Spice samples and their adulterants Thirty-nine samples of nutmeg (Myristica fragrans) were acquired through the local commerce. The adulterants, three samples of cumin (Cuminum cyminum) and one sample of commercial monosodium glutamate, were also acquired in a local commercial establishment. The other adulterants, roasted coffee husks and wood sawdust, were obtained on a farm and in a timber industry, respectively.
2.2 Sample preparation
2.2.1 Adulteration of oils Adulterations of the authentic evening primrose, rosehip and andiroba oils were performed by additions of edible vegetable oils of lower economic value. From the 40 authentic oil samples, always 24 samples were randomly selected and adulterated with portions of 5, 10, 15 or 20%, resulting in 264, 288 and 192 adulterated samples for evening primrose, rosehip and andiroba oils, respectively as shown in Table 1. The solutions were prepared in amber glasses, and they were stored in a cool, dry place, away from direct sunlight.
2.2.2 Adulteration of ground nutmeg Thirty-nine samples of nutmeg were grated and subsequently macerated in a crucible. From this macerated sample, 5 samples of nutmegs were always selected to be adulterated with cumin, commercial monosodium glutamate, soil, roasted coffee husks and wood sawdust. The adulteration was performed by increasing additions of each adulterant in different proportions (3, 5, 10, 30 or 40 or 50% (w/w)). The proportions of each adulterant are shown in Table 1. INSERT TABLE 1 8
2.3. Acquisition of spectra 2.3.1 FT-HATR spectra The spectra were obtained using a PerkinElmer Spectrum Two spectrometer equipped with a sampler device using horizontal attenuated total reflectance (HATR) containing a ZnSe crystal in the range of 680–3100 cm–1, 4 cm-1 resolution, 16 scans and repeated in quintuplicate. Between each of the spectra, the HATR was cleaned with isopropyl alcohol (QUIMEX purity PA). In every insertion of a new sample, the cleanliness was monitored using PerkinElmer Spectrum Software version 3.10. The baseline of the spectrum was corrected, and the initial data matrix consisted of 864 samples, 40 samples of each authentic oil (totally 120 samples) and 744 adulterated samples, with 2421 variables per spectrum.
2.3.2 FT-NIR spectra The NIR spectra were collected by a Perkin Elmer Spectrum 100 NIR spectrometer in the range of 1150 to 2500 nm, 0.5 nm spectral resolution, 32 scans and repeated in quintuplicate. The initial data set consisted of 129 samples (39 samples of authentic ground nutmeg and 90 adulterated samples) with 4650 variables per spectrum.
2.4. Data analysis All calculations were performed in MATLAB R2016b environment (Mathworks) for Windows 10 using an Intel Core i7-6700k CPU with four cores and eight threads, 64 GB memory. Random forest models were built using the “Statistics and Machine Learning Toolbox” version 11.0 coupled with “Parallel Computing Toolbox” version 6.9, both from Mathworks. PLS-DA and SIMCA models were built using the “PLS Toolbox 8.1” from Eigenvector Research. For the generation of the artificial outliers, we used the “Data Description Toolbox” version 2.1.3 (Tax, 2013). The PLS-DA algorithm requires at least 2 well-established sets, in this case, authentic samples and adulterated samples. However, since it is not viable to use all possible adulterants in the training set, the most common adulterants can be used to make adulterations using different proportions of these adulterants. In this sense, the training set of the PLS-DA model for EPO was composed of 27 authentic samples of EPO and 175 adulterated EPO samples containing soybean, corn and sunflower oils in different proportions (5 – 20% w/w). The external validation set was composed of 13 authentic samples of EPO, 89 adulterated EPO samples containing soybean, corn 9
and sunflower oils. We also used 560 samples of different authentic oils (Andiroba and Rosehip) and blends of these samples with soybean, corn and sunflower oils. Eighteen samples of diesel/biodiesel blends were also inserted. Samples of different oils and biodiesels were used to analyze the ability of the model to predict samples of different matrices. The training set of RF was composed of only 20 authentic samples of EPO and 1000 artificial outliers generated through these authentic samples. The external validation set was the same for the PLS-DA. However, the 176 samples that were used in the calibration step of the PLSDA model were also inserted in the external validation set. For the SIMCA model, the training set was composed by only 20 authentic samples of EPO and the external validation set was the same used for the RF. The training set of the PLS-DA model of ground nutmeg (GN) was composed of 25 authentic samples of GN and 20 samples of GN adulterated with roasted coffee husks and wood sawdust, which are the most common adulterants (September, 2011), in different proportions (3 – 10% w/w). The external validation set was composed of 14 authentic samples of GN, 10 GN samples adulterated with roasted coffee husks and wood sawdust (3 – 10% w/w), 20 GN samples adulterated with cumin (3 – 50% w/w), 20 GN samples adulterated with soil (3 – 40% w/w) and 20 GN samples adulterated with commercial monosodium glutamate (3 – 30% w/w). The training set of RF was composed of 15 authentic samples of GN and 1000 artificial outliers generated through GN samples. The external validation set was also the same for the PLSDA. However, similar to the last case, the 20 adulterated samples that were used in the training step of the PLS-DA model were inserted in the external validation set. The same idea used for the RF of EPO was used here: samples of different adulterants present in the training set were used to analyze the ability of the model. The SIMCA model was composed by only 15 authentic samples of GN in the training step and the external validation set was the same used for the RF. For use the random forest in one-class problems is necessary to give the same weight to both classes, target (50%) and artificial outliers (50%), to avoid the problem of unbalanced datasets. Since numerous outliers were generated, we used high values of weight for the target samples and a tiny weight value for outlier samples. However, using this strategy, few of the target samples will constitute the OOB set, generating OOB errors that are not representative (MathWorks, 2017). To overcome this problem, an internal validation set composed only of target samples was used to evaluate the performance of the calibration model. To select the best representative samples in the target class, the Kennard-stone algorithm was used to select samples for training, internal validation and external validation sets.
10
To simplify the step of artificial outlier generation and to reduce computational costs, the PCA scores of target samples were used instead of the whole spectra. The number of principal components was chosen so that the explained variance was greater than 99%. Using this strategy, artificial outliers were generated and then, using the PCA loadings, the full spectrum was recovered to be used in the RF algorithm. The R value was varied from 1.0 until a maximum value established by the user was reached. For each value of R, an RF model was developed using the default parameters (ntrees = 500, mtry =
and node size = 1), and the
optimum value of R was selected using the internal validation set. After choosing the optimal value of R, the RF model was built, and the presence of anomalous samples was evaluated. If exist samples with high anomalous measured values, then these samples must be excluded from the model, and a new optimum R value was reached. After, it is necessary to evaluate the representativeness of the artificial outliers used in the training set. This evaluation was accomplished by the analysis of PCA scores, where is possible to observe the distribution and the directions of all outliers generated. If the artificial outliers were not representative, then larger values of R must be tested. The summary of the steps used to build the random forest in tandem with artificial generation of outliers as one-class classifier is shown in Figure 1.
INSERT FIGURE 1 Following these steps, different numbers of artificial outliers were tested: 100, 250, 500, 750, 1000, 1500, 2000 and 3000. For 100 and 250 artificial outliers generated, the sensitivity did not stabilize in 100 % in the OOB and internal validation samples as presented in Fig. S1. In the range of 500 to 1500 it was observed that the sensitivity stabilized and the classification results were similar in the external validation set. For high numbers of artificial outliers (2000 and 3000) the sensitivity stabilized, but the results were not similar in the external validation set. In this way, the RF models were built using 1000 artificial outlier samples.
3. Results and Discussion
3.1. MIR and NIR Spectra The MIR spectra were pre-processed using first derivative and smoothing (Savitzky-Golay). The pre-processed spectra of the authentic samples of EPO, the artificial outlier spectra used to 11
build RF model and the difference between the average of the authentic spectra and the outlier spectra are shown in Figures 2a-c. The MIR spectra present absorption bands common to most of the vegetable oils already reported in several studies: at 1500-500 cm-1, at 1720-1750 cm-1 (carbonyl C=O from esters) and at 2800-3050 due to CH, CH2, = C-H (de Santana, Gontijo, Mitsutake, Mazivila, de Souza & Borges Neto, 2016). Artificial outliers generated are very similar to the original MIR spectra, with major variations in the fingerprint region in the ranges of 1500 to 500 cm-1 and of 3000 to 2800 cm-1. NIR spectra were pre-processed using multiplicative scatter correction (MSC) followed by the first derivative. The pre-processed spectra of the authentic GN, the artificial outlier spectra and the difference between the average of the authentic spectra and the outlier spectra are shown in Figs. 2d-f. NIR spectra contain fewer absorption bands than MIR spectra due to the broad and overlapping bands and consequently, are difficult to interpret. (Pieszczek et al., 2018). The main absorptions bands are at 1400 and 1900 nm due to the OH group, at 1740 nm due to C-H first overtones of CH bonds; at 2200 nm due to phenolic O-H, amide CONH2, amine N-H and aliphatic C-H (Pieszczek et al., 2018; September, 2011). In this case, the spectra of the artificial outliers have large variations in the whole spectra.
INSERT FIGURE 2
3.2. Optimization of the radius (R) value The value of R was tuned by analyzing the sensitivity of the developed models. The R value was initially varied from 1.0 to 10.0, in intervals of 0.2. The sensitivity results of RF model in functions of R are shown in Supplementary Material, Fig. S1. Small values of R tend to present large false negative results (target samples were classified as outlier samples), while high values of R tend to present large false positive results (outliers samples classified as target samples). The explanation is that by using small values of R, authentic samples in the validation set can be localized out of the hypersphere, while for higher values of R, the outlier samples can be localized inside the hypersphere. The choice of the optimum R value was performed using the following criterion: first, it is located in the range in which the sensitivity of the internal validation presents little or no variation, in addition to presenting a sensitivity of approximately 1. Then, the second smallest R value is chosen in this range. By minimizing the R value, we also minimize the chance of accepting outlier samples in the target class. Following these criteria, the R values for EPO and GN samples were 2.0 and 4.8, respectively. As already reported with higher values of R, the distance of artificial outlier 12
samples from the center of hypersphere will be greater and consequently, the spectral variation of the artificial outlier spectra will be greater.
3.3. Evaluation of anomalous samples After the choice of an optimum R value in the training set, it is necessary to evaluate the presence of anomalous samples. There are no problems in this evaluation for the authentic samples in the training set, since we know the samples that belong to the target class. However, in the external validation set, there is no information about the samples. In this case, the RF model first classifies the unknown sample as target (authentic) or non-target (non-authentic). Then, the anomalous measurement is performed only for the authentic ones. Fig. 3 shows the anomalous measurement for all samples classified as target/authentic sample for EPO Fig. 3 (a, b) and GN Fig. 3(c, d) models.
INSERT FIGURE 3
Values in Fig. 3 higher than 10 indicate that these samples are anomalous. We have anomalous EPO samples, which should be excluded from the training model to obtain feasible results (Breiman, 2002). Due to the presence of anomalous samples in the training set, several adulterated samples were misclassified as target/authentic samples in the external validation set (there are only 13 target samples in this set). For the GN samples, there are no anomalous samples in the training and external validation sets as shown in Figs. 3c and 3d. All samples with anomalous values higher than 10 (3 samples) were excluded from the RF model of EPO, and a new optimum value of R was obtained. In this new model, no outliers were detected, and the optimized R value was 3.6 as shown in Supplementary Material, Fig. S2.
3.4. Evaluation of artificial outliers generated The last step to build the random forest in tandem with artificial generation of outliers as one-class classifier is the evaluation of the representativeness of the artificial outliers generated. This evaluation was performed through the analysis of PCA scores. Fig. 4 presents the plot of the first two PCA scores of the target, artificial outliers and external validation sets.
INSERT FIGURE 4 13
We can note in Fig. 4a that for EPO, the artificial outliers are equally distributed around and near the target samples in all directions and represent the adulterations carried out on the EPO samples, except for biodiesel samples. Biodiesel samples are very different, and they are distant from the center of the data. Artificial outlier samples present in Fig. 4b of ground nutmeg (GN) are also equally distributed around and near the target samples in all directions. The artificial outlier samples represent, in general, the adulterations carried out on the samples of GN. Analyzing the amplified region in this figure, we can observe that the target/authentic samples are contained (represented by the black rectangle) near the center of the data, and there are many artificial outliers near them, indicating that the artificial generation of outliers was also able to represent spectral regions close to the target/authentic samples. 3.5. Evaluation parameters The performance of the random forest model was compared with the SIMCA, which is a one-class classifier normally used in chemometrics, and PLS-DA which is the most widely used binary supervised method for classification in chemometrics (Oliveri & Downey, 2012; Yi et al., 2016). The parameters used to compare the performance of the models were true positive (TP) and negative (TN), false positive (FP) and negative (FN), sensitivity and specificity. True positive samples are target samples that are correctly classified by the model and false negative if they are erroneously classified. True negative is non-target (non-authentic) samples that are classified as non-target samples and false positive if they are erroneously classified as target samples (Oliveri & Downey, 2012). The sensitivity (Eq. 6) is the ability of the model to correctly classify target samples, in other words, is the fraction of target samples classified as target samples. While the specificity (Eq. 7) is the fraction of non-target samples classified as non-target samples by the model. (6) (7)
3.6. Evaluation of the model results
There is no default value of the number of latent variables and principal components to be used in PLS-DA and SIMCA, respectively. For the PLS-DA model this choice was accomplished through the number of samples correctly classified by cross validation on a joint analysis of the percentage of variance explained in the blocks X (spectra) and y (class). While for SIMCA, this
14
choice was accomplished analyzing the percentage of variance explained in X block and the number of target samples correctly classified in the internal validation set. The identification of anomalous samples in PLS-DA and SIMCA was based on the high values of Hotelling and Q residuals for the training and test set at a significance level of 5%. Hotelling is related to the distance of the sample from the center of the data, and Q residuals represent the unmodeled X block (American Society for Testing and Materials - ASTM, 2012). The results of the figures of merit used to compare the RF, PLS-DA and SIMCA models are present in Table 2.
INSERT TABLE 2
The results of the RF model for EPO obtained were excellent in all three data sets (training, internal and external validation), where only one adulterated sample in the external validation set was misclassified as a target/authentic sample, and all the other 854 samples were correctly classified. Similar to RF, SIMCA obtained excellent results in training, internal and external validation sets, where all target and non-target samples were correctly classified. The PLS-DA model presented excellent results in training and cross validation sets. However, in the external validation set, 258 samples were considered anomalous samples, and 50 adulterated samples were misclassified as authentic samples. In summary, this model presented a rate of false positive of 12,22% and cannot decide if 258 anomalous samples (~38% of the external validation set) were authentic or adulterated. All Andiroba oils and biodiesel samples were considered anomalous by the PLS-DA model. Certain samples of rosehip oil and adulterated EPO were also considered anomalous. The 50 misclassified samples were samples of authentic rosehip oil and some rosehip oils adulterated with soybean, corn and sunflower oils (see Supplementary Material, Fig. S3). The RF model results for GN were also excellent for all three sets, and only one authentic sample in the external validation set was misclassified as an adulterated sample. All the other 103 samples were correctly classified. The SIMCA model presented excellent results in training and internal validation sets, however in the external validation set, 9 non-target/non-authentic samples were misclassified as target samples. Among these 9 samples erroneously classified, 6 were ground nutmeg samples adulterated with coffee husks. Similar to SIMCA the PLS-DA model presented excellent results in training and cross validation sets, however in the external validation set 4 samples (5,7% of the external validation set) 15
were considered anomalous samples by the model. Also 8 non-target/non-authentic samples were classified as target samples. In short, the PLS-DA model presented a rate of false positive of 12,12% and, in addition, cannot decide if 4 anomalous samples were authentic or non-authentic. The anomalous samples considered by the PLS-DA model were one sample of GN adulterated with soil and 3 samples adulterated with monosodium glutamate. All the 8 samples misclassified as authentic samples were samples of ground nutmeg adulterated with cumin (see Supplementary Material, Fig. S3). When an authentic sample is classified as non-authentic, it is quite likely that the producer/seller will appeal against the result; therefore, other techniques can be applied to determine if this sample is authentic. However, when a non-authentic sample is classified as authentic, the producer/seller will not complain about the result of the model. In this sense, the number of false negative rates must be minimized. Another crucial point is about the anomalous samples. When a sample is considered anomalous by the classification model, this is an indication that the model is not able to provide information about the sample. In other words, this model is not able to predict whether the sample is authentic or non-authentic, and it is necessary to employ another analytical technique to define a result. From our results, it is possible confirm that the use of binary/multiclass classification method PLS-DA directly in authentication problems presents undoubtedly results inferior to SIMCA and random forest in tandem to artificial generation of outliers. For the PLS-DA model, there is a considerable number of false positive samples (50 for EPO and 8 for GN). In addition, it presents a high number of anomalous samples, mainly for the EPO model. It is important to note that the PLS-DA model predicts the samples of the external validation set very well when the adulterants are present in the training set. However, when another adulterant is present or when other similar product is present in the external validation set, the model tends to classify them as anomalous samples or when this does not happen, they can be misclassified. The RF and SIMCA models present similar results to identify authentic samples of EPO using FTIR-HATR spectroscopy, with sensitivity equals to 100 and specificity > 99% in external validation set for both models. For the GN samples, the RF misclassified 1 target sample as nontarget sample, while the SIMCA model misclassified 9 non-target samples as target samples. In addition, no sample was excluded from the external validation set using SIMCA and RF models. Although the PLS-DA did not present the properties necessary to use artificial outlier generation as proposed in this work, the PLS-DA model was built using the artificial outliers. However, the results were not satisfactory, presenting errors in the training set near 50% and in the external validation set errors near 70%. 16
4. Conclusions This study presented a new methodology for use the random forest algorithm in tandem with artificial generation of outliers as a one-class classifier to verify the authenticity of evening primrose oil (EPO) using FTIR-HATR spectroscopy and ground nutmeg (GN) samples based on near infrared (NIR) spectroscopy. The proposed methodology showed superior performance in relation to the PLS-DA method with values of sensitivity and specificity equal to 1.0 and 0.9988 for EPO and 0.9286 and 1.0 for GN, respectively. Compared to SIMCA the proposed methodology presents similar performance in verify the authenticity of EPO using FTIR-HATR spectroscopy and for GN samples the RF misclassified 1 target sample as non-target sample, while the SIMCA model misclassified 9 non-target samples as target samples. Similar to SIMCA the proposed methodology had no sample exclusion in the external validation set, and it was developed without any information regarding the adulterants in the training set. In short, the proposed methodology present superior results than PLS-DA and SIMCA, which are the most widely used supervised methods for classification and authentication in chemometrics respectively. For application of the random forest as a one-class classifier, the generation of artificial outliers from the target samples was necessary, where all possible adulterations were simulated. The algorithm adopted successfully produced uniformly distributed artificial samples around the target class in all directions, covering the whole domain of variations, enabling the development of an adequate random forest model.
Acknowledgments
The authors thank Instituto Nacional de Ciência e Tecnologia de Bioanalítica - INCTBio (proc. FAPESP No. 2014/508673 and proc. CNPq No. 465389/2014), Conselho Nacional de Desenvolvimento Científico e Tecnológico (proc. 303994/2017-7) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES, Brazil, Finance Code 001) for financial support.
References American Spice Trade Association - ASTA. (2016). Guidance from the American Spice Trade Association - IDENTIFICATION AND PREVENTION OF ADULTERATION. Washington: The American Spice Trade Association. American Society for Testing and Materials - ASTM. (2012). ASTM E1655-05 Standard Practices 17
for Infrared Multivariate Quantitative Analysis. ASTM International: West Conshohocken. Breiman, L. (1996). Bagging Predictors. Machine Learning, 24, 123–140. Breiman, L. (2001). Random Forests. Machine Learning, 45.1, 5–32. Breiman, L. (2002). Looking Inside the Black Box, 1–35. Part II of a lecture presented at the 277th meeting of the Institute of Mathematical Statistics, Alberta, Canada. http://doi.org/10.1080/10508610802471096 Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and Regression Trees. New York: Chapman & Hall/CRC. Brereton, R. G. (2011). One-class classifiers. Journal of Chemometrics, 25(5), 225–246. Cao, D., Xu, Q., Zhang, L., & Huang, J. (2012). Tree-based ensemble methods and their applications in analytical chemistry. Trends in Analytical Chemistry, 40(2), 158–167. Calliste, C.A., Kozlowsky, D., Duroux, J.L., Champavier, Y., Chulia, A.J., & Trouillas, P. (2010). A new antioxidant from wild nutmeg. Food Chemistry, 118, 489–496. Casella, G., Fienberg, S., & Olkin, I. (2006). An Introduction to Statistical Learning. Design (Vol. 102). New York: Springer. De Santana, F. B., Gontijo, L. C., Mitsutake, H., Mazivila, S. J., de Souza, M. L., & Borges Neto, W. (2016) Non-destructive fraud detection in rosehip oil by MIR spectroscopy and chemometrics. Food Chemistry, 209, 228-233. De Santana, F. B., Mazivilla, S. J., Gontijo, L. C., Borges Neto, W., & Poppi, R. J. (2018) Rapid Discrimination Between Authentic and Adulterated Andiroba Oil Using FTIR-HATR Spectroscopy and Random Forest. Food Analytical Methods, 11(7), 1927–1935. Désir, C., Bernard, S., Petitjean, C., & Heutte, L. (2013). One class random forests. Pattern Recognition, 46(12), 3490–3506. Dupuy, N., Molinet, J., Mehl, F., Nanlohy, F., Le Dréau, Y., & Kister, J. (2013). Chemometric analysis of mid infrared and gas chromatography data of Indonesian nutmeg essential oils. Industrial Crops and Products, 43(1), 596–601. FAO, Food and Agriculture Organization of the United Nations (2001), Codex alimentarius: Section 2. Codex standards for fats and oils from vegetable sources. http://www.fao.org/docrep/004/y2774e/y2774e04.htm. Accessed 5 Jun 2018 Gliszczyńska-Świgło, A., & Chmielewski, J. (2017). Electronic Nose as a Tool for Monitoring the Authenticity of Food. A Review. Food Analytical Methods, 10(6), 1800–1816. Haughey, S. A., Galvin-King, P., Ho, Y. C., Bell, S. E. J., & Elliott, C. T. (2015). The feasibility of using near infrared and Raman spectroscopic techniques to detect fraudulent adulteration of chili powders with Sudan dye. Food Control, 48, 75–83. Hong, E., Lee, S. Y., Jeong, J. Y., Park, J. M., Kim, B. H., Kwon, K., & Chun, H. S. (2017). 18
Modern analytical methods for the detection of food fraud and adulteration by food category. Journal of the Science of Food and Agriculture, 97(12), 3877–3896. Jiménez-Carvelo, A. M., González-Casado, A., Pérez-Castaño, E., & Cuadros-Rodríguez, L. (2017). Fast-HPLC Fingerprinting to Discriminate Olive Oil from Other Edible Vegetable Oils by Multivariate Classification Methods. Journal of AOAC International, 100(2), 345–350. Lohumi, S., Joshi, R., Kandpal, L. M., Lee, H., Kim, M. S., Cho, H., Mo, C., Seo, Y.-K., Rahman, A., & Cho, B. K. (2017). Quantitative analysis of Sudan dye adulteration in paprika powder using FTIR spectroscopy. Food Additives and Contaminants - Part A Chemistry, Analysis, Control, Exposure and Risk Assessment, 34(5), 678–686. Lohumi, S., Lee, S., Lee, H., & Cho, B. K. (2015). A review of vibrational spectroscopic techniques for the detection of food authenticity and adulteration. Trends in Food Science and Technology, 46(1), 85–98. Manning, L., & Soon, J. M. (2014). Developing systems to control food adulteration. Food Policy, 49, 23–32. MathWorks (2017) Statistics and machine learning toolbox user’s guide R2017a. The MathWorks, Inc, Apple Hill Drive, pp 4523 Merás, I. D., Manzano, J. D., Rodríguez, D. A., & de la Peña, A. M. (2018). Detection and quantification of extra virgin olive oil adulteration by means of autofluorescence excitationemission profiles combined with multi-way classification. Talanta, 178, 751–762. Montserrat-de la Paz, S., Fernández-Arche, M.A., Ángel-Martín, M., & García-Giménez, M.D. (2014). Phytochemical characterization of potential nutraceutical ingredients from Evening Primrose oil (Oenothera biennis L.). Phytochemistry Letters, 8, 158–162. Oliveri, P. (2016). Class-modelling in food analytical chemistry: Development, sampling, optimisation and validation issues - A tutorial. Analytica Chimica Acta, 982, 9–19. Oliveri, P., Di Egidio, V., Woodcock, T., & Downey, G. (2011). Application of class-modelling techniques to near infrared data for food authentication purposes. Food Chemistry, 125(4), 1450–1456. Oliveri, P., & Downey, G. (2012). Multivariate class modeling for the verification of foodauthenticity claims. Trends in Analytical Chemistry, 35, 74–86. Oliveri, P., López, M. I., Casolino, M. C., Ruisánchez, I., Callao, M. P., Medini, L., & Lanteri, S. (2014). Partial least squares density modeling (PLS-DM) - A new class-modeling strategy applied to the authentication of olives in brine by near-infrared spectroscopy. Analytica Chimica Acta, 851, 30–36. Pieszczek, L., Czarnik-Matusewicz, H., & Daszykowski, M. (2018). Identification of ground meat species using near-infrared spectroscopy and class modeling techniques – Aspects of 19
optimization and validation using a one-class classification model. Meat Science, 139, 15–24. Shi, T., Zhu, M. T., Chen, Y., Yan, X. L., Chen, Q., Wu, X. L., Lin, J., Xie, M. (2018). 1H NMR combined with chemometrics for the rapid detection of adulteration in camellia oils. Food Chemistry, 242, 308–315. September, D. J. F. (2011). Detection and quantification of spice adulteration by near infrared hyperspectral imaging. Stellenbosch University, Department of Food Science. http://scholar.sun.ac.za/handle/10019.1/6624. Accessed 5 Jun 2018. Tax, D. M. J., & Duin, R. P. W. (1999). Support vector domain description. Pattern Recognition Letters, 20, 1191–1199. Tax, D. M. J., & Duin, R. P. W. (2001a). Combining One-Class Classifiers. Lecture Notes in Computer Science, 1032, 299–308. Tax, D. M. J., & Duin, R. P. W. (2001b). Uniform Object Generation for Optimizing One-class Classifiers. Journal of Machine Learning Research, 2, 155–173. Tax, D. M. J. (2013). DDtools, the Data Description Toolbox for Matlab. http://homepage.tudelft.nl/n9d04/dd_manual.pdf. Accessed 5 Jun 2018. Varliklioz Er, S., Eksi-Kocak, H., Yetim, H., & Boyaci, I. H. (2017). Novel Spectroscopic Method for Determination and Quantification of Saffron Adulteration. Food Analytical Methods, 10(5), 1547–1555. Yi, L., Dong, N., Yun, Y., Deng, B., Ren, D., Liu, S., & Liang, Y. (2016). Chemometric methods in data processing of mass spectrometry-based metabolomics: A review. Analytica Chimica Acta, 914, 17–34. Zhang, L., Huang, X., Li, P., Na, W., Jiang, J., Mao, J., Ding, X., Zhang, Q. (2017). Multivariate adulteration detection for sesame oil. Chemometrics and Intelligent Laboratory Systems, 161, 147–150. Zontov, Y. V., Rodionova, O. Y., Kucheryavskiy, S. V., & Pomerantsev, A. L. (2017). DD-SIMCA – A MATLAB GUI tool for data driven SIMCA approach. Chemometrics and Intelligent Laboratory Systems, 167, 23–28.
20
FIGURE CAPTIONS Figure 1. Summary of the steps used to build the random forest in tandem with artificial generation of outliers as one-class classifier.
Figure 2. Evening primrose oil MIR spectra of (a) authentic samples, (b) artificial outliers and (c) difference between the average of the authentic spectra and the outlier spectra. Ground nutmeg NIR spectra of (d) authentic samples, (e) artificial outliers and (f) difference between the average of the authentic spectra and the outlier spectra.
Figure 3. Anomalous measurement for authentic samples in random forest model. Training and external validation sets for EPO (a, b) and GN (c, d).
Figure 4. Plot of first two PCA scores. (a) Evening primrose oil (EPO) and (b) ground nutmeg (GN).
21
Table 1. List of adulterants and their respective proportions. Samples
Evening primrose oil Rosehip oil Andiroba oil Ground nutmeg
Adulterants and their Respective Proportion % (w/w)
Soybean (5;10;15;20), corn (5;10;15;20) and sunflower (5;10;15) Soybean (5;10;15;20), corn (5;10;15;20), sunflower (5;10;15;20) Soybean (5;10;15;20) and corn (5;10;15;20) Cumin (3;5;10;50), monosodium glutamate (3;5;10;30), soil (3;5;10;40), roasted coffee husks (3;5;10), wood sawdust (3;5;10)
Number of Samples
264 288 192 90
22
Table 2. Figures of merit of RF, PLS-DA and SIMCA models of EPO and GN samples. Evening Primrose Oil (EPO) Model
Predict class
RF
Target Non-Target Target Non-Target Target Non-Target
PLS-DA SIMCA
Model
Predict class
RF
Target Non-Target Target Non-Target Target Non-Target
PLS-DA SIMCA
Model
Predict class
RF
Target Non-Target No. Excl. Samples Target Non-Target No. Excl. Samples Target Non-Target No. Excl. Samples
PLS-DA
SIMCA
Training Set Original Class Target Non-Target Sensitivity 16 0 1.0000 26 0 1.0000 0 175 19 0.9500 1 Internal Validation Set Original Class Target Non-Target Sensitivity 7 1.0000 0 26 0 1.0000 0 175 7 1.0000 0 External Validation Set Original Class Target Non-Target Sensitivity 13 1 1.0000 0 841 0 0 13 0 0
50 359
13 0 0
0 842
Specificity 1.0000 1.0000 -
Specificity 1.0000 -
Specificity 0.9988
Training Set Original Class Target Non-Target Sensitivity 15 0 1.0000 0 1000 25 0 1.0000 0 20 15 1.0000 0 Internal Validation Set Original Class Target Non-Target Sensitivity 9 0.9000 1 25 0 1.0000 0 20 10 1.0000 0 External Validation Set Original Class Target Non-Target Sensitivity 13 0 0.9286 1 90 0
0
1.0000
0.8777
14 0
8 58
0
4
1.0000
1.0000
14 0
9 81
0
0
258
0
Ground Nutmeg (GN) Specificity 1.0000 1.0000 -
Specificity 1.0000 -
Specificity 1.0000
1.0000
0.8788
1.0000
0.9000
23
Figure 1
24
Figure 2
25
Figure 3
26
Figure 4
27
-
one-class random forest for adulteration detection purposes
-
artificial generation of outliers from the target samples
-
attenuated total reflectance Fourier transform infrared spectroscopy for adulteration detection in evening primrose oils
-
near infrared diffuse reflectance spectroscopy for adulteration detection in ground nutmeg
-
excellent performance with high values of sensitivity and specificity
28