C H A P T E R
12 Data Fusion of Nonoptimized Models: Applications to Outlier Detection, Classification, and Image Library Searching John H. Kalivas Department of Chemistry, Idaho State University, Pocatello, ID, United States Data fusion is considered a “multilevel, multifaceted process dealing with the detection, association, correlation, estimation, and combination of data and information from multiple sources” [1]. As attested to in this book, there are multitudes of data fusion approaches and applications. Presented in this chapter is a unique approach applicable to many situations. For example, outlier detection is a step often used for a variety of purposes, primarily to remove a single sample or a group of samples uniquely different from the bulk of a data set. Typically, a single outlier measure is used, and the measure requires optimization of a tuning parameter value such as number of eigenvectors for the Mahalanobis distance (MD). Thus, decisions on whether samples are outliers depend on the measure used and the tuning parameter value. A similar problem occurs with classification studies in which the class a sample is classified to depends on the classifier and the corresponding optimized tuning parameter value(s). For example, the tuning parameter for partial least squares discriminant analysis (PLS-DA) is the number of latent variables (LVs). Rather than using one outlier measure or classifier, a multitude of measures or classifiers are combined. Such an ensemble approach is not uncommon with classifiers. However, instead of requiring optimization of classifiers (and outlier measures), described in this chapter is using a
Data Fusion Methodology and Applications https://doi.org/10.1016/B978-0-444-63984-4.00012-0
345
Copyright © 2019 Elsevier B.V. All rights reserved.
346
12. DATA FUSION OF NONOPTIMIZED MODELS
collection (window) of tuning parameter values thereby avoiding model optimizations. The fusion approach is general and applicable to many situations. Presented is an overview of recent work using this fusion strategy. Examined first is outlier detection [2] followed by classification [3]. Application of this fusion approach to number recognition using thermal images of defaced serial numbers is also briefly overviewed [4]. The fusion approach has been utilized for optimizing models (selection of model tuning parameter(s)) [5,6], but this work is not overviewed. Instead, fusion is used in this chapter to avoid model selection.
1. OUTLIER DETECTION An often-used step before forming a multivariate calibration model for quantitative analysis or a classification model is outlier detection. The goal is to remove samples that are distinctly different from other training samples that could degrade the quality of the calibration or classifier. The outlier problem is not new and has been well reviewed. There are three general tactics for dealing with outliers. One involves detection followed by removal of samples deemed to be outliers [7e10] and is the topic highlighted with fusion in this chapter. The second uses robust modeling methods [11e14] to minimize outlier influences and is not discussed further. The third approach takes the point of view of retaining all measured samples as even peculiar samples may be part of the sample population. The understanding presented in this chapter is that a sample is an outlier because there is a small likelihood of measuring equivalent samples. If several samples are deemed different from the rest, then these samples could possibly be kept as extreme samples [15]. This situation is also not considered in this chapter. Two measures commonly used to identify spectral outliers (x-outliers) in analytical chemistry are Hotelling’s t-squared (linked with MD) and Q residuals. These measures can be used separately, but most often, simultaneous evaluation at optimized tuning parameter values is used. Outlier measures assessing analyte prediction values (y-outlier) are available, with Studentized residuals being the most frequently used measure where optimization is also required. Optimization of respective tuning parameter values for x- and y-outlier measures is a difficult problem with no established solution. Confounding the difficulty of tuning parameter selection is deciding which outlier merit to use. A comparison of six outlier measures has been examined whereby it was concluded that the ability of a particular measure to identify samples as outliers was influenced by several factors [8]. Thus, it was resolved that a variety of outlier measures should be used, but how to simultaneously assess such measures was not addressed.
1. OUTLIER DETECTION
347
The data fusion process sum of ranking differences (SRD) [16e18] is used in this chapter to simultaneously evaluate several nonoptimized xand y-outlier measures followed by removal of identified outlier samples. Model optimization (tuning parameter selection) is avoided by using windows of respective tuning parameter values. For example, optimization of MD as an x-outlier measure requires selecting an appropriate number of eigenvectors (the tuning parameter). Instead of one MD at the optimized number of eigenvectors for assessment of all samples, a group of MDs are used in the fusion approach. This group stems from a window of eigenvectors such as MDs at one eigenvector, one and two eigenvectors, one through three eigenvectors, and so on until a stop point. With fusion across a window of nonoptimized MDs (and other outlier measures), a consensus of whether or a not a sample is an outlier is obtained decreasing the risk of missing an irregular sample. The SRD process provides the sample rank for each sample relative to the degree of sample dissimilarities to the rest of the sample set. Actual SRD ranking values are not evaluated to ascertain if a sample from the calibration or class sample set is an outlier or extreme or if conditions have changed requiring model updating. Instead, presented in this chapter is application of SRD to determine if a sample is abnormal and, hence, deemed an outlier. With SRD, it is also possible to verify if a suspect sample is abnormal as well as if swamping and/or masking are occurring [14,19,20]. Swamping arises when outliers in the calibration or classification sample set cause the outlier measure to make normal samples appear as outliers. Conversely, masking results when outliers cause other outliers to appear as normal samples relative to the particular outlier measure being used. It should be noted that several approaches exist that attempt to resolve swamping and masking such as robust processes [11,12,21] and with cross-validation (CV) strategies [19,22,23]. The general procedure for computing outlier measures for SRD consists of sequentially removing each sample from the sample set and computing all outlier measures for each removed sample relative to the remaining samples. Values for the outlier measures are used in the SRD input fusion matrix. This process is further explained in the SRD section.
1.1 Outlier Mathematical Notation Vectors are bold lowercase and column vectors unless otherwise indicated by the transpose operation superscript T. Scalars are italic. The L2 norm, or Euclidean norm, of a vector is written as k$k, whereas the symbol j$j implies the absolute value and k$kF implies the Frobenius norm of a matrix. The matrix X defines a collection of n observations for rows
348
12. DATA FUSION OF NONOPTIMIZED MODELS
(samples) and p variables for columns (wavelengths for spectral data as used in this chapter). An X with xi removed (the ith sample) is written as ( Xn-1. The column-wise mean vector of Xn-1 is denoted by x . Outer vector product arrays are computed by xi xTi and xxT .
1.2 Outlier Measures All outlier measures are the same as those used in Ref. [2] and characterize the degree of uniqueness for each sample relative to the remaining samples in various ways. In total, 17 outlier measures are used. Eleven x-outlier measures are directly computed comparing the sample spectrum removed xi to the mean of the remaining samples x (cosine, Euclidean distance, determinant, inner product correlation, Procrustes analysis [PA] with three different measures, and extended inverted signal correction differences [EISCD] with four measures), four nonoptimized x-outlier measures comparing xi to the space spanned by Xn-1 that vary with the number of eigenvectors used from the singular value decomposition (SVD) of Xn-1 (MD, Q residual, sine, and divergence criterion), and two nonoptimized y-outlier measures that depend on the PLS LV used (externally studentized residuals and matrix matching). Rather than optimizing the four x- and two y-outlier measures to respective tuning parameter values, windows of these tuning parameters are used. Two of the 11 x-outlier measures not requiring tuning parameter values are uncharacteristic to outlier detection. One is PA commonly used to transform a collection of samples measured with one set of variables to appear as if the same samples (as well as new samples) were measured with different variables or in new conditions [24e26]. Another measure used with a similar novel twist for outlier detection is EISCD, a modified version of the inverted form of multiplicative signal correction (MSC) commonly used for single processing [27,28]. These two x-outlier measures are further described owing to the unusual use, but the other outlier measures are listed and detailed in Ref. [2]. One of the two y-outlier measures is also uncharacteristic to outlier detection. It is based on new matrix matching strategy [29] and is also briefly described. 1.2.1 Procrustes Analysis (X-Outlier) There are two general forms of PA. One is termed unconstrained and the other is constrained, and both are used as outlier measures. As previously noted, PA is used to transform one data set X1 to mimic another data set X2 using transformation processes where both data sets have the same samples (rows or objects). For spectral data with unconstrained PA, a transformation matrix T (rotation and dilation operations) is estimated by solving X1 ¼ X2T. Translation is another PA step carried out by mean
349
1. OUTLIER DETECTION
centering each data set to its respective mean. This step is not included here. The quality of T is assessed by computing the sum of the squares of the differences between X1 and the transformed set X2T easily calculated from kX1 X2 Tk2F . For outlier detection, PA is used to transform the sample removed xi to match the mean spectrum x of the remaining samples. However, the quality of the transformation is not assessed. Instead, the difficulty in carrying out the transformation is determined, i.e., how hard is it to perform the transformation? The greater the difficulty (a large amount of rotation and dilation is needed), the more likely a sample is an outlier. The degree of difficulty to carry out the transformation is contained in T. To characterize this information, a reference point is needed. Thus, two transformation matrices are computed and the difference is used as the outlier measure. The first transformation is Ti for xxT ¼ xi xTi Ti, and the second one is from xxT ¼ xxT T with DTi ¼ kTi TkF being the outlier measure. For constrained PA as an outlier measure, the rotation and dilation steps are isolated and defined as the orthogonal constrained rotation matrix H and dilation constant r. Again, differences are used to assess the difficulty in performing the transformations. Specifically, xxT ¼ ri xi xTi Hi and for reference xxT ¼ rxxT H. The outlier measures are DHi ¼ kHi HkF and Dri ¼ jri rj. 1.2.2 Enhanced Inverted Signal Correction Differences (X-Outlier) MSC is commonly used for spectral preprocessing to correct for light scattering and other nonanalyte spectral effects [28]. The method has been extended (EMSC) to include additional spectral correction terms [30]. Inverted versions, ISC and EISC, respectively, have been formulated [27]. A adaptation of EISC is applied as an outlier detection measure based on the spectral differences between x and xi (d ¼ x xi ) [31,32] and is denoted as EISC differences (EISCD). Similar to PA, corrected spectra are not used. Instead, the degree of difficulty to correct spectra is assessed two ways. The first is the magnitude of kbk, where b represents the regression vector from solving x ¼ xi þ Xc b d ¼ Xc b where
Xc ¼ 1; xi ; x2i ;
d d2 xi ; xi ; l; l2 ; lnðlÞ dl dl2
(14.1) (14.2)
and l is the vector of actual wavelengths from the full spectrum. The greater the value of kbk, the larger some or all of the regression vector
350
12. DATA FUSION OF NONOPTIMIZED MODELS
coefficients and hence, the greater the amount of spectral correction necessary to map xi to x and the more likely the sample is an outlier. The second measure is kXc bk and denotes the amount of correction used. Two more measures are, respectively, obtained by switching xi and x. 1.2.3 Matrix Matching (Y-Outlier) The y-outlier measures are used only for multivariate calibration and depend on the multivariate calibration method used to solve for the analyte calibration regression vector b in the model y ¼ Xb. The matrix matching outlier measure depends on how well the physical and chemical matrix effects relative to xi agree with the remaining sample spectra [29]. The degree of matrix matching is assessed from the interaction between b and sample spectra. The more matrix matched samples are, the more similar the interaction will be and the less likely a sample is an outlier. To determine the degree of matrix matching, a plot of yj al b y j
against al is made, where al is the lth scalar value for the jth sample and b y j ¼ xTj b for a given calibration regression vector. A V-shape plot results with zero prediction error at the bottom. Matrix-matched samples will have similar a values at the bottom of the V-curves (zero prediction error). Respective shapes of the V-curves will also be similar for matrix-matched samples, but this component was not used in this study [29]. The outlier measure used is Dai ¼ jan1 ai j, where an1 denotes the mean aj at the minimum of the V-shaped curves for each of the samples making up Xn-1 and ai represents the minimum of the V-shaped curve for xi. 1.2.4 Tuning Parameter Windows (X- and Y-Outliers) Some of the x-outlier measures and both y-outlier measures are tuning parameter dependent. Values for these x-outlier measures depend on the number of eigenvectors retained from the SVD of Xn-1. Values for the y-outlier measures depend on the calibration method and the respective tuning parameter value. Only PLS is used, but additional calibration methods could be included, such as ridge regression and principal component regression. As previously noted, rather than using an optimized value of tuning parameters, windows are used. The window size for the x-outlier measures starts with the first eigenvector and ends when approximately 99% of Xn-1 is spanned and the retained eigenvectors are not unreasonably noisy. The window of LVs used with PLS is determined from performing a quick leave multiple out CV process on the full calibration sample set being outlier checked. The root mean square error of CV is evaluated, and several LVs before and after the minimum are used for the window. Obvious under- and overfitted regression vectors should be avoided [2].
1. OUTLIER DETECTION
351
1.3 Sum of Ranking Differences The SRD process is next described in the framework for outlier detection. 1.3.1 Outlier Detection For outlier detection, all outlier measures are rows for the SRD input matrix and columns correspond to the respective sample removed and tested as a possible outlier. With m outlier measures, the SRD input matrix is m n. Each row of the SRD input matrix is scaled to unit length to remove scaling issues between the magnitudes of the values from the multiple outlier measures being calculated. The outlier measures included in the SRD input matrix can be the spectral x-outlier measures, y-outlier measures, or both. The SRD process requires a target value for each row as a reference point to rank against. For outlier detection, the target value is set to maximum. With this target, SRD ranks the samples such that the sample removed (column) with the lowest rank is the sample most consistently dissimilar across all the outlier measures relative to the remaining samples. To determine if the lowest ranked sample is an outlier, the SRD comparison of ranks by random numbers (CRRN) is used [17]. The CRRN assesses the probability that an SRD ranking is no different from a random ranking. To decide if a sample is a potential outlier, a threshold is used on the number of standard deviation (s) units (and hence confidence level) from the mean of the distribution in the smaller ranking direction. A collection of samples with rankings less than the s threshold ranking could be labeled outliers, but only the lowest ranked sample beyond the s threshold is considered an outlier. This sample is further assessed with the SRD verification process to avoid swamping and masking effects. Generally, a threshold of 3s is used [2]. If a sample is confirmed an outlier, it is removed and the SRD process is repeated until no more samples can be sequentially removed. An internal SRD CV can be used to additionally reduce the chance of a random rank. The SRD CV is obtained by using a random leave multiple out CV process on the rows of the SRD input matrix. The objective of SRD CV is to obtain a boxplot of the rankings allowing assessment of the SRD rank consistency of the samples removed. The SRD method can also be used to identify if any samples in a sample set are similar to any particular sample. The particular sample can be selected at random, an outlier or suspect sample, a swamped sample, or a masked sample. The specific sample is removed, and the outlier measures are computed using the usual sample removal procedure. The outlier measures are also computed for the particular sample removed relative to the remaining samples. Instead of using maximum as the SRD ranking
352
12. DATA FUSION OF NONOPTIMIZED MODELS
target, the outlier values for the particular sample being evaluated are used. With this target, samples ranked lowest are those with outlier values closest to the sample being assessed. 1.3.2 Outlier Verification As previously noted, if a sample is detected as a possible outlier, the SRD is used again for verification. This evaluation is accomplished by removing the suspect sample from the sample set. The outlier measures are calculated again for the remaining samples using the usual sample removal process. The outlier measures are also calculated for the suspect sample relative to the remaining sample set (these values are the same as those from the original SRD assessment). The SRD procedure is performed, and if the suspect sample is indeed an outlier, its rank will remain less than the s threshold. If it does not and is ranked above the s threshold, then swamping is probably occurring and the sample is not an outlier. 1.3.3 Sample Swamping Assessment The degree of closeness of sample SRD rankings identifies samples behaving similarly relative to the outlier measures. Samples within a cluster of similarly ranked samples below the s threshold may be normal samples swamped as outliers. Any sample in such a cluster can be evaluated as with the verification procedure and assessed for swamping. 1.3.4 Sample Masking Assessment Whenever the verification or swamping assessment procedures are performed, if new samples not previously ranked below the s threshold are now ranking below the s threshold, then masking is probably occurring. In this case, the new samples ranking below the s threshold are possibly outlier samples that were originally being identified as similar to the normal samples. Such samples should be further tested as outliers using the verification process.
1.4 Outlier Detection Example To overview the SRD outlier detection method, an example data set is evaluated. This data set is well used in the literature composed of the same 80 corn samples measured on three NIR instruments [33]. Values are provided for the moisture, oil, protein, and starch contents of each sample. Spectra measured on instrument mp6 are used here. An eigenvector window from 1 to 15 is used for the x-outlier measures. Accordingly, respective outlier measures are computed with the first eigenvector, then with the first two eigenvectors, and so on up to the first 15 eigenvectors.
1. OUTLIER DETECTION
353
All 15 values for each of the corresponding x-outlier measures are used in SRD. The PLS LV window for the y-outlier measures are relative to each prediction property. For moisture, oil, protein, and starch, the respective LV windows are LVs 3e20, 3e17, 3e18, and 4e17 [2]. Shown in Fig. 14.1 is the SRD input matrix with only the x-outlier measures. Note that with the eigenvector window there are 71 outlier measures being assessed for each of the 80 samples, i.e., the SRD input matrix is 71 by 80. Using multiple outlier measures enhances the robustness of the outlier detection process because each measure has a unique evaluation of each sample. Also pictured in Fig. 14.1 are the SRD results. It is obvious that samples 75 and 77 are potential outliers as these two samples distinctly stand out from the random distribution and are ranked below the 3s threshold. The boxplot shown in Fig. 14.1 from using SRD CV indicates that these two samples are consistently ranked below the 3s threshold. These consistent low rankings provide further evidence that samples 75 and 77 are uniquely different from the rest of the samples.
FIGURE 14.1 The SRD input matrix (upper left) with unit length normalized rows for the x-outlier measures, SRD CV box plots with the 3s threshold (upper right), and SRD normalized ranks with the random ranking distribution and 3s limit (lower left). Reprinted with permission from Anal. Chem. 89 (2017) 5087e5094. Copyright 2017 American Chemical Society.
354
12. DATA FUSION OF NONOPTIMIZED MODELS
As previously noted, the lowest ranked sample is removed and the SRD processes is repeated until no more samples are ranked below the 3s threshold. This procedure is depicted in Fig. 14.2. After removal of sample 75, sample 77 continues to be marked as an outlier with no other samples being identified as potential outliers. When sample 77 is removed, no new samples are identified as potential outliers. The lack of new samples indicates no masking. Not shown, but when the verification processes are performed individually for samples 75 and 77, these samples remain at SRD ranks lower than the 3s threshold, implying these sample are true outliers and were not being swamped. If samples 75 and 77 had ranked closer to the 3s threshold, then these samples may have been normal samples being swamped. Shown in Fig. 14.3 is the SRD input matrix and results using only the two y-outlier measures. Because LV windows are used, the two outlier measures become 36 measures. The SRD rankings indicate that samples 71 and 1 are potential outliers. Removing sample 71 and running SRD again, it was found that sample 1 and the remaining samples are all now above 3s. This SRD evaluation is shown in Fig. 14.3. Thus, sample 1 was probably being swamped and is not a y-outlier. Using only the y-outlier measures, the calibration set was checked for oil, protein, and starch. Samples
FIGURE 14.2 The SRD normalized ranks with the random ranking distribution and 3s limit for sequential respective removal of suspect outlier samples 75 and 77. Reprinted with permission from Anal. Chem. 89 (2017) 5087e5094. Copyright 2017 American Chemical Society.
1. OUTLIER DETECTION
355
FIGURE 14.3 The SRD input matrix (upper left) with unit length normalized rows for the y-outlier measures, SRD CV box plots with the 3s threshold (upper right), SRD normalized ranks with the random ranking distribution and 3s limit (lower left), and the SRD normalized ranks with the random ranking distribution and 3s limit after removal of sample 71 (lower right). Reprinted with permission from Anal. Chem. 89 (2017) 5087e5094. Copyright 2017 American Chemical Society.
sequentially removed for each prediction property were samples 77, 27, and 7 for oil; sample 77 for protein; and samples 75, 2, and 11 for starch. Besides using the x- and y-outlier measures independently, the two sets of measures can be combined in one SRD. Combing the measures with sequential removal identifies samples 75, 77, 71, 72, and 80 as outliers for moisture. Note that by combing the x- and y-outlier measures, two additional samples are identified as outliers compared with using the measures separately. Outlier samples for the other prediction properties are samples 75, 77, 27 and 7 for oil, samples 75 and 77 for protein, and samples 75, 77, 2, and 11 for starch.
1.5 Outlier Detection Summary Presented is application of SRD as a method for fusion of nonoptimized outlier detection measures. The process can be used for outlier cleaning a calibration or classification data set, verification, and checking for swamping and masking. There is no limit to the number of outlier
356
12. DATA FUSION OF NONOPTIMIZED MODELS
measures to include for simultaneous evaluation such as other MD related measures. With spectral data (and other data measured over continuous variables), the spectra can be broken into wavelength windows. Each wavelength window would be used with all the outlier measures and included as additional rows to the SRD input matrix. Several wavelength window sizes can be used as well as several sets of how a wavelength window slides across a spectrum. Presented in Ref. [2] are results from another data set and studies on the number of outliers that can be identified. Results from varying the s threshold are also discussed.
2. CLASSIFICATION Classifying samples into categories is a common problem in many fields and numerous stand-alone methods are available. Each field often characterizes classification differently. In this chapter and as in previous work [3], classification is broadly interpreted to be the problem of establishing which class (category) a sample belongs to regardless of the classifier being used (e.g., regression, distance classifier, discriminant analysis). In addition to deciding which classifier to use for a particular problem, most classification processes require optimization of one or more tuning parameters for an appropriate level of classification accuracy, sensitivity, and specificity. For example, with PLS (and PLS2 as the case may be) discriminant analysis (DA), the number of LVs is optimized and for k nearest neighbors (kNN), the distant measure and number of nearest neighbors need to be determined. Similar to outlier detection, optimizing a classifier to its “optimal” tuning parameter value(s) is a confounding problem with no commonly accepted solution. To circumvent classifier method selection, ensemble (fusion) approaches of multiple classifiers have been developed [34e40]. Analogous to fusion of outlier measures, fusion of classifiers supposes a more robust classification beyond the capability of a single classifier, i.e., a consensus classification is obtained decreasing chance misclassifications of samples. It can be said that with fusion, classification is less risky [41]. Two approaches to ensemble classifiers are often used. One involves developing a group of classifiers using a stand-alone (single) algorithm such as random forests or other bagging processes [37,42e44]. Each standalone method uses one algorithm to form dissimilar classifiers by randomly varying the training sample set. The objective is to ensemble a group of weak classifiers to form a strong classifier. However, there are multiple stand-alone ensemble classifiers to choose from, and thus, the problem of deciding which single classification algorithm to use persists. The other approach ensembles classification results from a group of
2. CLASSIFICATION
357
different optimized single classifiers such as PLS-DA (and/or PLSD2-DA), random forest, and kNN [36e40,45,46]. However, in addition to the task of optimizing several classifiers, the best way to weight a linear or nonlinear combination of each classifier to form the final classification rule needs to be determined. Instead of optimizing each single classifier, greater diversity in a collection of stand-alone classifiers can be obtained by using a group of single classifiers formed with a window of the respective tuning parameter values. This is the same approach taken with the outlier detection method previously discussed in this chapter. For example, as used in this chapter, a window of say k single PLS2-DA classifiers would be composed of predicted class pseudovalues based on the first LV, the first two LVs, and so on up to the first k LVs. As another example, for k kNN classifiers, the number of NN at 1 NN, 2 NN, and so on up to k NNs are included for each class. This classification process obtains a consensus classification without optimizing the classifiers and hence, avoids the training phase. Although SRD proved useful for the fusion process with outlier detection, the sum rule [37,42] is used for classifier fusion without any weighting process. Using the sum rule is simple and allows raw values to be used instead of preclassifying and using a majority vote rule. For example, if the MD is used as a classifier, then the actual MD values are used in the fusion.
2.1 Classification Mathematical Notation The matrix X specifies a collection of samples representing a particular class composed of n observations (samples) in the rows and p variables (wavelengths for spectra) in the columns. The variable k denotes the number eigenvectors, LVs, and NNs used in respective calculations. Otherwise, the same mathematical notation used for outlier detection is used here.
2.2 Classifiers and Fusion Rule All classifiers are the same as those used in Ref. [3]. In total, 17 independent single classifiers are used. Essentially, the outlier detection measures are used as classifiers with the addition of PLS2-DA and kNN, and hence, brief descriptions of how PLS2-DA and kNN are used for fusion and stand-alone classifiers are provided in the next sections. Summarizing the measures, 11 measures are directly computed comparing the sample spectrum being classified xi with the mean of a class sample set x (cosine, Euclidean distance, determinant, inner product correlation, PA with three different measures, and EISCD with four
358
12. DATA FUSION OF NONOPTIMIZED MODELS
measures) and six nonoptimized classifiers compare xi with the space spanned by the class sample set X (PLS2-DA, kNN, MD, Q residual, sine, and divergence criterion). Rather than optimizing the six tunable classifiers to respective tuning parameter values, windows of these tuning parameters are used forming a group for each particular classifier. For example, if the tuning parameter window size is 10 (k ¼ 10), then instead of six stand-alone optimized classifiers, 60 nonoptimized single classifiers are combined with the 11 other single classifiers for a total of 71 classifiers to classify a new sample. As previously noted, the sum rule is used on the fusion input matrix where the fusion input matrix is m c, where m denotes the number of classifiers and c represents the number of classes. The column (class) with the smallest sum identifies class membership. Thus, all values in the fusion input matrix must be minima for class membership. Only two of the classifiers calculate maxima values, PLS2-DA and kNN. These values are scaled to minima as described in the following two respective sections. Classification rules are needed when PLS2-DA, kNN, MD, and Q residual are used as stand-alone classifiers at optimized tuning parameter values. The rules for PLS2-DA and kNN in this chapter are noted in the respective following sections. For MD and Q residual, the simple rule of the smallest value is used for class membership assignment. 2.2.1 Partial Least Squares-2 Discriminant Analysis The method of PLS2-DA requires pseudovalues for the matrix Y in Y ¼ XB, where Y has a column for each class. The corresponding sample class membership row, column value in Y is 1. The remaining column values in the same row are 1. The predicted pseudovalues are used in the fusion input matrix with a row for each LV combination up to k rows. Because the smallest sum for a class is the predicted class for the particular sample being classified, each predicted PLS2-DA pseudovalue is transformed to minima values for class membership. Let b y t signify the transformed predicted pseudovalue of b y for a particular LV predicted value, then b yt ¼ b y max b y , where b y max represents the maximum predicted pseudovalue over all rows and columns of predicted values in the current window of LVs being evaluated. When PLS2-DA is used as a stand-alone classifier, a classification rule is needed. Used here is at a specific LV model, the most positive predicted value at a specific LV model is taken as class membership. 2.2.2 k Nearest Neighbors With kNN, two tuning decisions are needed, the number of nearest neighbors (NN) to form the classification rule and the distance measure. Only the Euclidean distance is used, but other distances could be included. The number of NN is not optimized, and instead, a window of NN
2. CLASSIFICATION
359
values are used up to size k. Actual NN values are used in the fusion input matrix. Because the column sum minimum of the input fusion matrix identifies class membership for a new sample, each NN value is transformed to minima values similar to PLS2-DA. Let NNt denote a transformed NN value for a particular value of k, then NNt ¼ NNmax e NN, where NNmax designates the maximum NN value over all rows and columns of NN values at the current NN window size being evaluated. When kNN is used as a stand-alone classifier, the classification rule is the standard rule; the class with the greatest number of NN is class membership. 2.2.3 Fusion As with the input matrix for outlier detection, each row of the input fusion matrix is normalized to unit length to remove scaling differences between the classifiers. No weights are used on the classifiers, and as previously noted, the sum rule is used instead of SRD. The reader is referred to Ref. [3] for further details. An example input fusion matrix for a new sample being classified to one of three classes is shown in Fig. 14.4. In this case, the widow size is set to k ¼ 23, turning the 17 classifiers into 149 classifiers. Note that raw values are used in the left image. If threshold values are applied to each classifier (the usual process requiring training at each tuning parameter value), then the image on the right is obtained. In this case, the sum rule is the majority vote. In the example presented,
FIGURE 14.4 An example of the input fusion matrix for a sample being classified to one of three classes. The row-wise normalized raw values are shown on the left, and on the right are the classification results based on chosen threshold values. Window size is k ¼ 23 and rows are 1e23 PLS2-DA, 24e46 kNN, 47e69 MD, 70e92 sine, 93e115 Q residual, 116e138 divergent criterion, and 139e149 the nontuning parameter-based classifiers cosine, Euclidean distance, determinant, inner product correlation, PA (three), and EISCD (four).
360
12. DATA FUSION OF NONOPTIMIZED MODELS
the PLS2-DA window of values are rows 1 through 23 and show that PLS2-DA essentially misclassifies this sample. The kNN window of values are rows 24 through 46 and show that kNN also misclassifies this sample. Because several other nonoptimized classifiers are included, the consensus from the sum fusion rule correctly classifies the sample as belonging to class 1 (class with the smallest sum is class membership).
2.3 Classification Example To assess the significance of the fusion classification strategy, the standard measures of classification quality are used (accuracy, sensitivity, and specificity). These values are calculated by: accuracy ¼ ðTP þ TNÞ=ðTP þ TN þ FP þ FNÞ
(14.3)
sensitivity ¼ TP=ðTP þ FNÞ
(14.4)
specificity ¼ TN=ðTN þ FPÞ
(14.5)
where TP, FP, TN, and FN are, respectively, true positive, false positive, true negative, and false negative. In addition to assessing the fusion process, the four methods PLS2-DA, MD, Q residual, and kNN are evaluated as stand-alone optimized classifiers. A leave one out CV procedure is used on each class set of samples to calculate accuracy, sensitivity, and specificity values. An example is a beer data set [47] composed of 19 Birra del Borgo ReAle samples and 41 other craft samples made up of 12 non-ReAle samples from the same manufacture Birra del Borgo and 29 samples from producers in Italy and other parts of Europe. All beer samples were measured on five instruments: mid-infrared, near infrared (NIR), ultraviolet (UV), visible, and thermogravimetric analysis. These data were originally analyzed as a binary classification problem and are used the same way to illustrate the usefulness of multiinstrument fusion in a nontypical way possible by the nonoptimized fusion approach. The maximum k value across all five instruments for eigenvector, LVs, and NN windows is 16. For example, PLS2-DA models and subsequent predictions are obtained with the first LV, then the first two LVs, and so on up to LVs 1e16. If a window of size four is used for the fusion input matrix, then the first four raw values from PLS2-DA (pseudoprediction values), kNN (number of NNs), MD (actual MD values), and so on for the classifiers are only used. The maximum window size is the full rank set at 16. Displayed in Fig.14.5 are the results using the UV instrument. Shown are the classification measures of quality for PLS2-DA, MD, Q residual, and kNN when these classifiers are used as optimized stand-alone classifiers. Specifically, the accuracy, sensitivity, and specificity values at a respective tuning parameter value on the x-axis should be read as the
2. CLASSIFICATION
361
FIGURE 14.5 Classification results from the UV instrument for classifying the beer samples using PLS2-DA, MD, Q residual, and kNN as stand-alone classifiers at each respective tuning parameter value. Also shown are the fusion classification results from all 17 classifiers at each window size. Accuracy (red (gray in print version)), sensitivity (blue (dark gray in print version)), and specificity (green (light gray in print version)). Reprinted with permission from Anal. Chem. 90 (2018) 4429e4437. Copyright 2018 American Chemical Society.
classification quality as if the stand-alone classifier had been optimized to the corresponding tuning parameter value. For example, highlighted in Fig. 14.5 are the kNN results if kNN was optimized to classify at 7 NNs using the Euclidean distance. However, for fusion at tuning parameter window size 7, each of the six tunable classifiers is being evaluated as seven single classifiers forming 42 raw classification values for each sample. These 42 raw classification values are combined with the classification values from the remaining 11 nontunable single classifiers totaling 53 raw classification values for the sum rule. From Fig. 14.5, classification quality is seen to depend on the classification method and the corresponding tuning parameter value for the stand-alone classifiers. This observation is demonstrated by the irregular behavior of the results relative to respective classifier tuning parameter values. Hence, depending on the tuning parameter value chosen (optimized value), classifier quality fluctuates. Similar results are observed for the other four instruments [3]. Conversely, fusion of the nonoptimized classifiers reveals consistency in the final classification results regardless of the instrument. The fusion process smooths out the irregular behavior. At the maximum window size 16, the 17 single classifiers become 107 classifiers. The pattern observed for fusion with the UV instrument was present for the other four instruments. When more than one spectral instrument is used to measure the same samples, a common standard data fusion practice is to augment the
362
12. DATA FUSION OF NONOPTIMIZED MODELS
spectra column-wise to form a single multiinstrument array. This situation is depicted in Fig. 14.6 and is commonly referred to as low-level fusion. The low-level fusion data set was classified by the same standalone classifiers and 17 classifiers using fusion of tuning parameter windows. The classification results are presented in Fig. 14.6 and are observed to be similar to the UV instrument results shown in Fig. 14.5 and the other four instruments (not shown). Specifically, fusion of the 17 classifiers has greater consistency in the results. Some of the single classifiers did improve over using only one instrument. Using the fusion process of nonoptimized models allows a different fusion method for combining multiple instruments. Instead of the lowlevel fusion, the classification results for all 17 classifiers on each instrument can be vertically stacked. Essentially, each instrument acts as a block of classification values and the blocks are stacked. This stacking is illustrated in Fig 14.7 for one ReAle beer sample. The fusion results for all samples are also shown in Fig. 14.7. These fusion results are superior to those obtained from the low-level fusion seen in Fig. 14.6. The improvement is probably due to the increased number of classifiers and, hence, classification values. For example, at a window size of 5 tuning parameters for low-level fusion, 41 classifiers are used (Fig. 14.6) and when the
FIGURE 14.6 Classification results from augmenting the five instruments (low-level fusion) into one data array (top). Shown are the results from single classifiers at respective tuning parameter values and fusion of the 17 classifiers at each tuning parameter window size. MIR, mid-infrared; TG, thermogravimetric analysis; VIS, visible. Reprinted with permission from Anal. Chem. 90 (2018) 4429e4437. Copyright 2018 American Chemical Society.
2. CLASSIFICATION
363
FIGURE 14.7 Fusion input matrix for one ReAle beer sample by stacking the classification raw values from each of the five instruments vertically (top). Class 1 is ReAle beer, and class 2 is non-ReAle beer. Each instrument block is 107 classifiers on the y-axis for a total of 535 classifiers at the maximum tuning parameter window size. Also shown are the fusion results at each tuning parameter window size (bottom). MIR, mid-infrared; TG, thermogravimetric analysis; VIS, visible.
instruments are stacked a total of 205 classifiers are now summed at window size 5. The classification values from the instrument stacking strategy can be combined with the classification values from the conventional low-level fusion. This is accomplished by stacking the block of low-level classification values to the instrument stack. Illustrated in Fig. 14.8 is an
FIGURE 14.8 Fusion input matrix for one ReAle beer sample by stacking the classification raw values from each of the five instruments vertically (same as Fig. 14.7) augmented (or stacked) with the raw classifier values from the low-level fusion block for a total of 642 classifiers at the maximum tuning parameter window size. MIR, mid-infrared; TG, thermogravimetric analysis; VIS, visible.
364
12. DATA FUSION OF NONOPTIMIZED MODELS
example for a ReAle beer sample. The classification results from this approach are slightly improved compared with the results shown in Fig. 14.7. No new instrumental analyses are needed to gain the small classification improvements, as the extra block of classification values are easy to calculate.
2.4 Classification Summary For the beer data, it is observed that the trends for the classification quality measures using the fusion method do not degrade if the full rank value of X is used for k. However, this statement is not true for the standalone classifiers, and thus, optimization of stand-alone classifiers is critical. These conflicting trends in accuracy, sensitivity, and specificity between fusion and stand-alone classifiers have been observed for other data sets [3,48]. Other classifiers could be included as additional blocks of classification values to stack in the fusion input matrix, for example, kNN based on other distant measures. A variety of support vector machine strategies and neural networks could be included as additional blocks of classification values. As with outlier detection by fusion, spectral data (and other data measured over continuous variables) can be broken into wavelength windows using the same suggested processes to form multiple blocks for the fusion input matrix. Lastly, multiple data preprocessing methods can be included in the fusion process rather than determining the “best” spectral preprocessing method [49e51]. An example of combing two preprocessing methods with the fusion approach described in this chapter is provided in Ref. [3]. A similar tactic was used to avoid identifying an appropriate preprocessing method for multivariate calibration. In this case, stacking was used to fuse a collection of calibration models based on different data preprocessing methods [51].
3. THERMAL IMAGE ANALYSIS Serial numbers are stamped into firearms and other metallic objects for identification purposes. Laser etching is another common method to brand an object with a series of numbers and/or letters for future reference. However, when firearms are used in a criminal activity or a commodity of interest is stolen, serial numbers are commonly defaced in an attempt to remove future identification of the object. For example, a stolen vehicle will often have the vehicle identification number defaced to hamper recovery of the vehicle to the rightful owner.
3. THERMAL IMAGE ANALYSIS
365
Chemical etching is often used to recover nonreadable serial numbers. However, this method is destructive to the object and requires an expert for interpretation purposes. Infrared (IR) thermography is a useful nondestructive method to evaluate structural integrity [52e54]. When an object is heated, subsurface defects produce nonuniform heat dissipation, and this phenomenon can be captured by an IR camera. Described in this chapter is a data fusion approach of such thermal images based on tuning parameter windows for recovery of defaced serial numbers. The specific thermal approach used is lock-in thermography (LIT) whereby the metal surface is heated with a brief pulse of heat from a laser and surface temperatures are monitored with the IR camera [52,55]. The underlying assumption is that residual subsurface defects (a zone of plastic strain) from the stamping or laser etching process remain in the metallic structure after defacing [56e58]. The residual strain will alter the temperature gradient producing the nonuniform heat dissipation recorded by the IR camera. With LIT, a pulsed heat wave is induced in the sample and the IR camera images the thermal wave propagation over a short time interval. From the collection of recorded LIT-generated IR thermal images, amplitude and phase images are calculated that record the underlying metallic strain structure. Currently, only the phase images have been useful [55,59]. Details of this study are in Ref. [4], and only brief highlights are presented. A collection of 15 LIT phase images are sequentially obtained. Each phase image is calculated from 32 thermal images over a 40-s cycle. Shown in Fig. 14.9 are the mean thermal images from one cycle of recorded IR images for a nondefaced number (6) and a defaced number (5).
FIGURE 14.9
Pictures of a metal plate with the numbers 2,6,2,5, and 0 stamped in (top left) and the same plate defaced (top right). Also shown are the mean thermal images recorded for the nondefaced 6 and defaced 5 and the corresponding mean phase images.
366
12. DATA FUSION OF NONOPTIMIZED MODELS
Also presented in Fig. 14.9 are the corresponding mean phase images. Clearly the nondefaced 6 is readable, whereas the defaced 5 is not recognizable. The 15 phase images are unfolded (as is common with hyperspectral imaging), and principal component analysis (PCA) is performed on the unfolded data. From the PCA, a sequence of 15 score images is formed. The 15 score images for the nondefaced 6 and defaced 5 are pictured in Fig 14.10. As expected, the nondefaced 6 appears in the first score image with 97.20% of the image variance captured. Most of the score images for the defaced 5 have nonrecognizable information correlated to the number 5. Score image 10 does contain some recognizable information relative to the defaced 5. The percent variance captured is only 6.3 104%. For other defaced numbers, a similar trend is observed with one score image at a small fraction of the variance loosely resembling the corresponding number. The lesser the residual strain left, the further in the PC list a semiqualitative (at best) score image appears. To avoid selection of specific score images for further examination, all images are used in a data fusion approach for a consensus recognition of a defaced number. To identify a defaced number, multiple numerical library images are matched to each score image by a collection of similarity measures. Before computing similarity measures between a library
FIGURE 14.10
Images of the 15 respective score images for the nondefaced 6 (left) and defaced 5 (middle). Also displayed are the percent variance captured in each score image (right).
3. THERMAL IMAGE ANALYSIS
367
number image and a defaced score image, all images are decomposed to Zernike moments using respective Zernike polynomials [60,61]. The Zernike polynomials form an orthogonal basis set allowing extraction of image features describing shape characteristics of an imaged object. Ultimately, each score image is transformed to seven vectors (a window) of Zernike moments. Seven are used to avoid deciding which polynomial order to use. Each vector is then matched with 15 similarity measures to 10 sets of Microsoft font library images also transformed to the same seven vectors of Zernike moments. Using 15 similarity measures and 10 numerical libraries avoids deciding which similarity measure and library to use. Over the 15 score images, this collection of match indicators produces 15,750 values. As with outlier detection and classification, values are tabulated into a fusion input matrix. In this case, the fusion input is 15,750 by 10. Rather than one fusion rule, 14 rules are used and a majority vote of the fusion rules are needed to identify which number the defaced number is most similar to. Displayed in Fig. 14.11 is the fusion input matrix for the defaced 5 where values in each row have been ranked from 1 to 10. The results, also presented in Fig. 14.11, reveal that the defaced 5 is correctly identified as a 5. The sum of the rankings from the 14 fusion rules show that the defaced 5 is not always identified as a five by each fusion rule. Perfect agreement among the 14 fusion rules would produce a rank sum of 14. However, a majority of the fusion rules identify the defaced number correctly.
FIGURE 14.11 The fusion input matrix for the defaced 5 with values in each row ranked from 1 to 10 (top). The final results from the majority vote over 14 fusion rules and the sum of the corresponding rankings from each fusion rule (bottom).
368
12. DATA FUSION OF NONOPTIMIZED MODELS
3.1 Thermal Image Summary All defaced numbers were correctly identified on the metal sample in Fig. 14.9 as well as a series of defaced laser-etched numbers and other situations [4]. Fusion of multiple similarity measures matched to multiple number libraries computed over a window of Zernike moments provides consensus identification. The method is nondestructive, can be automated, and does not require human expertise.
Acknowledgments The outlier and classification material is based upon work supported by the National Science Foundation under Grant No. CHE-1506417 (cofunded by CDS&E) and is gratefully acknowledged by the author. The thermal image analysis material is based upon work supported by the National Institute of Justice Grant No. NIJ 2013-R2-CX-K012 and is gratefully acknowledged by the author. The author is thankful to Brett Brownfield, Tony Lemos, and Ikwulono Unobe for performing the data analysis presented.
References [1] L.A. Klein, Sensor and Data Fusion Concepts and Applications, second ed., SPIE Optical Engineering Press, Bellingham, WA, 1999. [2] B. Brownfield, J.H. Kalivas, Consensus outlier detection using sum of ranking differences of common and new outlier measures without tuning parameter selections, Anal. Chem. 89 (2017) 5087e5094. [3] B. Brownfield, T. Lemos, J.H. Kalivas, Consensus classification using non-optimized classifiers, Anal. Chem. 90 (2018) 4429e4437. [4] I. Unobe, L. Lau, J.H. Kalivas, R. Rodriguez, A. Sorensen, Restoration of defaced serial numbers using lock-in infrared thermography, in preparation. [5] J.H. Kalivas, K. He´berger, E. Andries, Sum of ranking differences (SRD) to ensemble multivariate calibration model merits for tuning parameter selection and comparing calibration methods, Anal. Chim. Acta 869 (2015) 21e33. [6] A.J. Tencate, J.H. Kalivas, A.J. White, Fusion strategies for selecting multiple tuning parameters for multivariate calibration and other penalty based processes: a model updating application for pharmaceutical analysis, Anal. Chim. Acta 921 (2016) 28e37. [7] T. Næs, T. Isaksson, T. Fern, T.A. Davies, A User Friendly Guide to Multivariate Calibration and Classification, 2002. Chichester, UK. [8] K.I. Penny, I.T. Jolliffe, A comparison of multivariate outlier detection methods for clinical laboratory safety data, J. R. Stat. Soc. Ser. D (Stat.) 50 (2001) 295e307. [9] D.M. Hawkins, Identification of Outliers, Chapman and Hal, London, 1980. [10] B. Walczak, D.L. Massart, Multiple outlier detection revisited, Chemom. Intell. Lab. Syst. 41 (1998) 1e15. [11] P.J. Rousseeuw, A.M. Leory, Robust Regression and Outlier Detection, Hoboken, NJ, 1987. [12] M. Hubert, Robust calibration, in: S.D. Brown, R. Tauler, B. Walczak (Eds.), Comprehensive Chemometrics: Chemical and Biochemical Data Analysis, vol. 3, Elsevier, Amsterdam, 2009, pp. 315e343. [13] P.J.J. Rousseeuw, Least median of squares regression, Am. Stat. Assoc. 79 (1984) 871e880. [14] R. Pell, Multiple outlier detection for multivariate calibration using robust statistical techniques, Chemom. Intell. Lab. Syst. 52 (2000) 87e104.
REFERENCES
369
[15] A.L. Pomerantsev, O.Y. Rodionova, Concept and role of extreme objects in PCA/ SIMICA, J. Chemom. 28 (2014) 429e438. [16] K. He´berger, Sum of ranking difference compares methods or models fairly, Trends Anal. Chem. 29 (2010) 101e109. [17] K. He´berger, K.J. Kolla´r-Hunek, Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers, J. Chemom. 25 (2011) 151e158. [18] K. Kolla´r-Hunek, K. He´berger, Method and model comparison by sum of ranking differences in cases of repeated observations (ties), Chemom. Intell. Lab. Syst. 127 (2013) 139e146. [19] S. Roberts, M. Martin, L. Zheng, An adaptive, automatic multiple-case deletion technique for detecting influence in regression, Technometrics 57 (2015) 408e417. [20] C. Becker, U. Gather, The masking breakdown point of multivariate outlier identification rules, J. Am. Stat. Assoc. 94 (1999) 947e955. [21] P.J. Rousseeuw, B.C. van Zomeren, Unmasking multivariate outliers and leverage points, J. Am. Stat. Assoc. 85 (1990) 633e651. [22] H.D. Li, Y.Z. Liang, D.S. Cao, Q.S. Xu, Model-population analysis and its applications in chemical and biological modeling, Trends Anal. Chem. 38 (2012) 154e162. [23] L. Zhang, D. Wang, R. Gao, P. Li, W. Zhang, J. Mao, L. Yu, X. Ding, Q. Zhang, Improvement on enhanced Monte-Carlo outlier detection method, Chemom. Intell. Lab. Syst. 151 (2016) 89e94. [24] J.M. Andrade, M.P. Go´mez-Carracedo, W. Krzanowski, M. Kubista, Procrustes rotation in analytical chemistry, a tutorial, Chemom. Intell. Lab. Syst. 72 (2004) 123e132. [25] J.H. Kalivas, Learning from Procrustes analysis to improve multivariate calibration, J. Chemom. 22 (2008) 227e234. [26] C.E. Anderson, J.H. Kalivas, Fundamentals of calibration transfer through Procrustes analysis, Appl. Spectrosc. 53 (1999) 1268e1276. [27] I.S. Hellend, T. Næs, T. Isaksson, Related versions of the multiplicative scatter correction method for preprocessing spectroscopic data, Chemom. Intell. Lab. Syst. 29 (1995) 233e241. [28] P. Geladi, D. MacDougall, H. Martens, Linearization and scatter-correction for nearinfrared reflectance spectra of meat, Appl. Spectrosc. 39 (1985) 491e500. [29] R. Emerson, J.H. Kalivas, In preparation for submission 2018. [30] H. Martens, E. Stark, Extended multiplicative signal correction and spectral interference subtraction: new preprocessing methods for near infrared spectroscopy, J. Pharm. Biomed. Anal. 9 (1991) 625e635. [31] R.S. Berns, K.H. Petersen, Empirical modeling of systematic spectrophotometric errors, Color Res. Appl. 13 (1988) 243e256. [32] J. Ottaway, J.H. Kalivas, Feasibility study to transform spectral and instrumental artifacts for multivariate calibration maintenance, Appl. Spectrosc. 69 (2015) 407e416. [33] B.M. Wise, N.B. Gallagher, Eigenvector Research, Manson, Washington. [34] F.M. Campos, L. Correia, J.M.F. Calado, Robust visual localization through local feature fusion: an evaluation of multiple classifier system, J. Intell. Robotic Syst. 77 (2015) 377e390. [35] M. Wo zniak, E. Gran˜a, Corchado, A survey of multiple classifier systems as hybrid systems, Inform. Fusion 16 (2014) 3e17. [36] T.K. Ho, J.J. Hull, S.N. Srihari, Decision combination in multiple classifier systems, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 66e75. [37] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 226e239. [38] D. Ruta, B. Gabrys, Classifier selection for majority voting, Inf. Fusion 6 (2005) 63e81.
370
12. DATA FUSION OF NONOPTIMIZED MODELS
[39] L. Xu, A. Krzyzak, C.Y. Suen, Methods of combining multiple classifiers and their applications to handwriting recognition, IEEE Trans. Syst. Man. Cyber. 22 (1992) 418e435. [40] S. D zeroski, B. Zenko, Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54 (2004) 255e273. [41] M. Hibon, T. Eveniou, To combine or not to combine: selecting among forecasts and their combinations, Int. J. Forecasting 21 (2005) 15e24. [42] M. Ferna´ndez-Delgado, E. Cernadas, S. Barro, Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15 (2014) 3133e3181. [43] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning Data Mining, Inference, and Prediction, second ed., Springer, New York, 2009. [44] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123e140. [45] S.-B. Cho, J.H. Kim, Combining multiple neural networks by fuzzy integral for robust classification, IEEE Trans. Syst., Man, Cyber. 25 (1995) 380e384. [46] S. Hashem, B. Schmeiser, Improving model accuracy using optimal linear combination of trained neural networks, IEEE Trans. Neur. Net. 6 (1995) 792e794. [47] A. Biancolillo, R. Bucci, A.L. Magrı`, A.D. Magri, F. Marini, Data-fusion for Multiplatform characterization of an Italian craft beer aimed at its authentication, Anal. Chim. Acta 820 (2014) 23e31. [48] T.D. Stokes, M. Foteini, B. Brownfield, J.H. Kalivas, G. Mousdis, A. Amine, C. Georgiou, Feasibility assessment of synchronous fluorescense spectral fusion by application to argan oil for adulteration analysis, Appl. Spectroc. (2018) in press. [49] J. Gerretzen, E. Szymanska, J.J. Jansen, J. Bart, H.-J. van Manen, E.R. van den Heuvel, L.M.C. Buydens, Simple and effective way for data preprocessing selection based on design of experiments, Anal. Chem. 87 (2015) 12096e12103. [50] J. Engel, J. Gerretzen, E. Szymanska, J.J. Jansen, G. Downey, L. Blanchet, L.M.C. Buydens, Breaking with trends in pre-processing? Trends Anal. Chem. 50 (2013) 96e106. [51] L. Xu, Y.-P. Zhou, L.-J. Tang, H.-L. Wu, J.-H. Jian, G.-L. Shen, R.-Q. Yu, Ensemble preprocessing of near-infrared (NOR) spectra for multivariate calibration, Anal. Chim. Acta 616 (2008). [52] M. Choi, K. Kang, J. Park, W. Kim, K. Kim, Quantitative determination of a subsurface defect of reference specimen by lock-in infrared thermography, NDT E Int. 41 (2008) 119e124. [53] A. Killey, J.P. Sargent, Analysis of thermal nondestructive testing, J. Phys. D: Appl. Phys. 22 (1989) 216e224. [54] C.M. Sayers, Detectability of defects by thermal non-destructive testing, Brit. J. NonDestruc. Test. 26 (1984) 28e33. [55] O. Breitenstein, W. Warta, M. Langenkamp, Lock-in Thermography: Basics and Use for Evaluating Electronic Devices and Materials, Springer Science & Business Media, 2010. [56] P.B. Wilson, The restoration of erased serial identification marks, Police J. Theory, Practice, Principles 52 (1979) 233e242. [57] D.E. Polk, B.C. Giessen, Metallurgical aspects of serial number recovery, AFTE J. 21 (1989) 174e181. [58] G. Peeler, A.J. Gutowski, A.J. Wrobel, G. Dower, The restoration of impressed characters on aluminum alloy motorcycle frames, J. Forensic Ident. 58 (2008) 27e32. [59] W. Bai, B.S. Wong, Evaluation of defects in composite plates under convective environments using lock-in thermography, Meas. Sci. Tech. 12 (2001) 142e150. [60] A. Khotanzad, Y.H. Hong, Invariant image recognition by Zernike moments, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 489e497. [61] C. Singh, E. Walia, R. Upneja, Accurate calculation of Zernike moments, Inform. Sci. 233 (2013) 255e275.