Accepted Manuscript Leveraging multiple linear regression for wavelength selection Tony Lemos, John H. Kalivas PII:
S0169-7439(17)30373-8
DOI:
10.1016/j.chemolab.2017.07.011
Reference:
CHEMOM 3474
To appear in:
Chemometrics and Intelligent Laboratory Systems
Received Date: 31 May 2017 Revised Date:
18 July 2017
Accepted Date: 24 July 2017
Please cite this article as: T. Lemos, J.H. Kalivas, Leveraging multiple linear regression for wavelength selection, Chemometrics and Intelligent Laboratory Systems (2017), doi: 10.1016/ j.chemolab.2017.07.011. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Leveraging Multiple Linear Regression for Wavelength Selection
a
Department of Chemistry, Idaho State University, Pocatello, ID 83209, USA
SC
*Corresponding Author. Email Address:
[email protected]
RI PT
Tony Lemosa, John H. Kalivasa*
ABSTRACT
M AN U
Wavelength selection is often used for multivariate calibration methods to lower prediction error for the calibrated sample properties. As a result, there are a plethora of wavelength selection methods to select from; all with unique advantages and disadvantages. All wavelength selection methods involve a range of tuning parameters making the methods cumbersome or complex and hence, difficult to work with. The goal of this study is to provide a simple process to select
TE D
wavelengths for multivariate calibration methods while trying to standardize values of the five adjustable algorithm tuning parameters across data sets. The proposed method uses multiple
EP
linear regression (MLR) as an indicator to which wavelengths should be used further to form a multivariate calibration model by some processes such as partial least squares (PLS). From a
AC C
collection of MLR models formed from randomly selected wavelengths, those models within a thresholds of the bias indicator root mean square error of calibration (RMSEC) and variance indicator model vector L2 norm are evaluated to ascertain the most frequently selected wavelengths. Portions of the most frequent wavelengths are retained and used to produce a calibration model by PLS. This proposed wavelength selection method is compared to PLS models based on full spectra. Several near infrared data sets are evaluated showing that PLS models based on MLR selected wavelengths provide improved prediction errors. Of the five 1
ACCEPTED MANUSCRIPT
adjustable parameters for the wavelength selection method, three could be standardized across the data sets while the other two required minor tuning. Recommendations are provided as to
RI PT
alternate wavelength selection algorithms.
SC
Keywords: Multivariate calibration Wavelength selection Multiple linear regression
1. Introduction
M AN U
In analytical chemistry, there are different multivariate regression techniques to characterize chemical and physical properties as analytes. Some of these techniques include multiple linear regression (MLR) and partial least square (PLS) [1,2]. The method of MLR requires selection of r wavelengths from p full wavelength spectra such that the X matrix of row-
TE D
wise spectra for n samples and r columns is full rank for the inverse operation. Full rank requires the rank of X to be r with n ≥ r. Partial least squares and other methods can be used in full spectral mode at p wavelengths or with a set of r selected wavelengths. With wavelength
EP
selection there is generally a tradeoff between bias and variance. Specifically, the potential for prediction variance inflation increases with selected wavelengths as characterized by the size of
AC C
the Euclidean vector L2 norm (2-norm) of the regression vector. Compared to a full wavelength PLS regression vector L2 norm, a wavelength selected L2 norm is larger indicating greater variance. Nevertheless, wavelength selection is popular because of documented improvement in prediction errors compared to a full wavelength method. Similar results are noted in other fields [3].
2
ACCEPTED MANUSCRIPT
Over the past several decades, many wavelength selection methods have been created to obtain optimal regression vectors for MLR as well as full spectral methods. Some of these methods include interval PLS [4], genetic algorithms [5-7], selectivity ratios [8], simulated
RI PT
annealing [9,10], methods based on Tikhonov regularization (TR) variants that simultaneously select wavelength and form the model [11-16] (penalty regression methods, one of which is based on the L1 norm commonly referred to as least absolute shrinkage and selection operator
SC
(LASSO) [17]), and methods based on Monte Carlo approaches [18-24]. This list is a small fraction of the algorithms that can be used. It is interesting to note that many wavelength
M AN U
selection algorithms base selection on merits assessing the signal to noise ratio of regression vector coefficients as proposed in a wavelength selection algorithm from 1974 [25]. Professor Yi-Zeng Liang spent a considerable amount of his career working with wavelength selection [16,21-24,26-33] to mention a few. Most notable are his competitive
variations [21-24].
TE D
adaptive reweighted sampling (CARS) [33] and his recent work involving Monte Carlo
Regardless of the method used to select wavelengths, each method has respective tuning
EP
parameters that must be adjusted to obtain effective models. These tuning parameters can be as few as one, such as some TR variants, to several making these methods cumbersome to work
AC C
with. The goal of this study is to provide a simple process to select wavelengths using standardized algorithm tuning parameter values across multiple near infrared (NIR) data sets and hopefully, extendable to other spectral data sets. A Monte Carlo approach is proposed that leverages MLR to select appropriate
wavelengths. Once a set of wavelengths is selected, PLS is used to form the final calibration model. Results from wavelength selected PLS models are compared to results obtained from full
3
ACCEPTED MANUSCRIPT
wavelength PLS models to determine if the MLR selected wavelengths improve model performance. The effectiveness of this proposed method is tested with several NIR data sets.
RI PT
2. Methods
The proposed MLR wavelength selection procedure is described using general variables to represent the tuning parameters. Exact values are stated in section 4. Results and discussion.
SC
Before describing the new approach, measures of model quality used in the selection process are
M AN U
first defined.
2.1. MLR and PLS Model quality measures
Measures of MLR model quality are needed to ascertain which wavelengths should ultimately be chosen from the multitude of models formed by the Monte Carlo process of
TE D
random wavelength selection. Bias and variance measures are both used to prevent over or under fitting of a calibration model. This allows one to determine which MLR models are the better models and hence, which wavelengths. In this study, only one bias and one variance measures
EP
are used to keep the method as simple as possible. The L2 norm of an estimated MLR model
AC C
vector bˆ , symbolized by bˆ , is used as the variance measure. The bias measure is the root mean square error of calibration (RMSEC) for a calibration set with n samples calculated by
RMSE =
n
∑ (y i =1
i
- yˆ i ) 2 n
(1)
where yi represents the reference value for the ith sample and yˆi denotes the corresponding estimated value from the MLR model.
4
ACCEPTED MANUSCRIPT
The performance of the final selected wavelengths used with PLS as well as the full spectrum PLS models are assessed using the same bias and variance measures of model quality used for MLR. Specifically, equation (1) is used as a bias measure formatted to the root mean
RI PT
square error of prediction (RMSEP) for the predication sample set not used in calibration and the PLS model L2 norms are used to characterize variance. Respective R2 values from plotting
2.2. MLR wavelength selection and PLS prediction
SC
predicted values and against reference values are also evaluated.
M AN U
A given data set is split in to calibration and prediction sets. In this study, 80% are used for the calibration samples and 20% for the prediction set, but other proportions can be used. The percentage split is not a tuning parameter for the MLR wavelength selection algorithm. Using the calibration samples, an MLR model is formed with r randomly selected
TE D
wavelengths from the full set of p wavelengths. This process is repeated m times for m MLR models. The resulting RMSEC values are plotted against the corresponding L2 norms. A percentage, h, of the total number of MLR models with the lowest RMSEC values are retained
EP
and the same percentage of MLR models with the lowest L2 norm values are also retained. The MLR models common to the separate lowest valued RMSEC and L2 norm collections (the
AC C
intersection) are then isolated. It is important that the MLR models satisfy both criteria at the intersection in order to remove over and under fitted models. The MLR wavelength selection process is repeated t times for the same calibration
samples producing t sets of models from respective intersections. The t intersected sets are combined and a histogram of the selected wavelengths is formed. A percentage, w, of the calibration sample set rank (rank ≤ min(n,p)) is used to determine how many wavelengths are
5
ACCEPTED MANUSCRIPT
selected from the histogram. This number of wavelengths are then selected from the respective most frequently used wavelengths in the histogram. The final selected wavelength set is assessed by predicting the prediction sample set using PLS to form the RMSEP and corresponding R2
RI PT
values. These values are compared to the respective RMSEP and R2 values using full spectrum PLS models. Figure 1 provides a representation of the MLR wavelength selection steps for a particular calibration set.
SC
From the previous description of the method and Figure 1, it is apparent there are five tuning parameters: r (number of random wavelengths to select), m (the number of times r
M AN U
random wavelengths are selected to form the corresponding r MLR models), h (percentage of the m models to retain), t (number of times to run an r, m, h combination), and w (final number of wavelengths to retain from the combined t sets). Values for each tuning parameter are discussed in respective sections of section 4. Results and discussion. From the presented discussions, it is
TE D
shown that of the five tuning parameters, only two need to actually be tuned (r and w), but suggested values do exhibit similar trends across data sets. Processes are also suggested for determining optimal values.
EP
Each data set is split into calibration/prediction sets 55 times in this study to evaluate the consistency of selected wavelengths. Greater and less splits were evaluated, but 55 was found to
AC C
adequately show the wavelength variation between splits.
3. Experimental
As noted in section 2.2. MLR wavelength selection and PLS prediction, each data set is
split such that 80 percent of the samples become the calibration set and the remainder are placed in the prediction set. The calibration set is used for MLR wavelength selection and the prediction
6
ACCEPTED MANUSCRIPT
set is used to assess the performance of PLS models formed from the selected wavelengths and the full spectrum PLS models. The calibration/prediction split is performed 55 times to characterize the selected wavelength variations. Mean centering is used where the calibration set
RI PT
is column-wise centered and the prediction set is centered to the calibration set.
3.1. Algorithms
SC
All algorithms for MLR, wavelength selection by MLR, and PLS were written by the
M AN U
authors using MATLAB (version 9.0.0 (R2016a), The MathWorks Inc., Natick, MA, USA).
3.2 Data sets
A total of five data sets were assessed with up to 19 data situations (multiple instruments and multiple prediction properties). Only three data sets with four of the possible data situations
3.2.1 Corn
TE D
are shown. These results are representative of the general results obtained.
EP
This data set consists of 80 samples measured on three instruments designated m5, mp5 and mp6. Samples are measured in the wavelength range of 1100-2498 nm at 2 nm intervals,
AC C
resulting in 700 wavelengths. Each sample has four prediction properties: moisture, oil, protein, and starch. Presented are m5 and prediction properties oil and moisture. The data set is freely available from the Eigenvector Research website [34].
3.2.2 Sugar
7
ACCEPTED MANUSCRIPT
The data set consists of three sugars; sucrose, glucose and fructose in aqueous solutions for 125 samples [35]. Spectra are in the wavelength range 1100-2498 nm at 2 nm intervals for
3.2.3 Gasoline
RI PT
700 wavelengths. The prediction property shown is sucrose.
The data set consists of 55 samples measured over 401 wavelengths from 900-1700 nm in
SC
2 nm increments [36]. The prediction property is octane number.
M AN U
4. Results and discussion
There are five adjustable parameters (r, m, h, t, and w) to obtain an effective set of wavelengths. See the discussion in section 2.2. MLR wavelength selection and PLS prediction and Figure 1 for respective definitions. The following discussion describes the effect each
TE D
parameter has and recommends standard values for three of the parameters (m, h, and t) that should be useful for other data sets than those presented. Recommendations are made for the other two parameters (r and w). Following the parameter discussion, the selected wavelengths
AC C
three data sets.
EP
based on the calibration data are used to form PLS models to predict the prediction sets for the
4.1. Number of wavelengths for each MLR model (r) The first step in the MLR wavelength selection process is to choose a value for r, the
number of wavelengths to randomly select from the full p wavelength set available. To determine the effect this parameter has on the RMSEC and L2 norms, various values were used for the data sets. Shown in Figure 2 are typical results using the corn data predicting moisture
8
ACCEPTED MANUSCRIPT
content with m set to 10,000 for three values of r; an r value close to the rank of the calibration spectra (60 in this case with a rank 64 calibration set from the 80/20% calibration/prediction split), a midrange value (20), and a small value (10). From Figure 2 it is observed that the
RI PT
number of wavelengths alter how the MLR models perform. The more wavelengths used, the lower the RMSEC values tend to be.
Because a future step in the MLR wavelength selection process is to retain an equal
SC
number of models with the lowest RMSEC and L2 values and at the same time, guard against over and under fitted models, the value of r is critical. For example, from Figure 2 it is observed
M AN U
that at too large of an r value, many models are generally over fitted to the calibration set and these models will probably not predict the prediction set well. The better outcome is shown with r at 20 wavelengths in Figure 2. In this case, a cone shape is obtained such that the set of models from the intersection of the models with lowest RMSEC and L2 norm values will be fairly well
TE D
represented. In analyzing several data sets, a value of r just under the midrange of the calibration sample rank worked well. Alternatively, by empirical observation, the value of r can be adjusted until the cone shape plot shown in Figure 2 is obtained. For all the data sets in this study, r was
EP
set to 20.
AC C
4.2. Number of MLR models (m)
Deciding a value for m, is not as critical as for r. The number of MLR models should be
sufficient to obtain a representative range of models. As shown in Figure 2 with m set to 10,000 models, this value clearly spans the space of possible models and a greater number is not needed. With a smaller value, it was found that the models tended to be more sparse (spread out) in Figure 2 and hence, when a subset of the model are retained (set by h, described next), there are
9
ACCEPTED MANUSCRIPT
too many poor models included. In general, the more models formed, the more likely the wavelengths useful for prediction will be retained. For all the data sets, it was decided to fix m to
RI PT
10,000.
4.3. Percentage of MLR models with low RMSEC and L2 norm values (h)
A subset of the m MLR models are needed since only a portion of all the models formed
SC
are of any use. Thus, only a percentage of the total MLR models, h, are retained. As noted
previously, two subsets of models are formed each with the same value of h. One subset has the
M AN U
h percentage MLR models with the lowest RMSEC values and the other subset has the h percentage MLR models with the lowest L2 norm values. If just the models with the lowest RMSEC values are inspected for wavelengths selected, then many of these wavelengths are part of over fitted models and would not be useful for predicting new samples. Similarly, if just the
TE D
models with the lowest L2 norms are inspected, then many of these wavelengths are part of under fitted models and would also not be useful for predicting new samples. Thus, models in both subsets (the intersection) are retained and analyzed for common wavelengths. The value for h
EP
cannot be too large or too many poor models are included and the same is true for too small of a value. With m set to 10,000 models for all data sets, it was found that a value of h set to 30%
AC C
provided a good selection of balanced models in the intersection with generally about 1,000 models retained for wavelength evaluation.
4.4. Number of intersection sets (t) It was found that using only one set of intersected models did not provide a useful histogram. Specifically, dominant wavelengths that should ultimately be selected for use by a
10
ACCEPTED MANUSCRIPT
multivariate calibration method were not obvious. In essence, too many spurious wavelengths would be selected. Thus, a histogram is formed from the t intersection sets and now dominant wavelengths are much more obvious allowing a straightforward selection. The value for t needs
RI PT
to be large enough to obtain a discernable histogram. From trial and error, it was found that after t =50, the histogram typically converges in terms of wavelength selection trends. Thus, t was set
SC
to 50 for all the data sets.
4.5. Percentage of selected wavelengths (w)
M AN U
The final tuning parameter is how many wavelengths to select from the histogram of the wavelengths used in the MLR models present in each of the t intersected low RMSEC and L2 regions. As noted previously, the number of wavelengths is a percentage, w, of the original n x p calibration set rank (rank ≤ min (n,p)). The corresponding most frequently used wavelengths
TE D
from the histogram form the final wavelength set. The number of wavelengths is ultimately user defined because different values for w produce multivariate calibration models (PLS in this study) with varying degrees of performance per data set. However, the general trend for the
EP
range of w is the smaller the value, the worse the model performance. Too large of a value for w and the more similar the performance is to full wavelength models.
AC C
Some of these trends are demonstrated in Figure 3 for corn moisture on instrument m5. In
this example, with the 80/20% calibration/prediction split, there are 64 samples in the calibration set leaving 16 samples in the prediction set. With 64 samples in the calibration set, the rank is 64 and shown are w values of 20%, 40%, 80%, 120% of the calibration rank giving 13, 26, 51, and 77 wavelengths, respectively. These wavelengths would be selected from the histogram as the respective 13, 26, 51, and 77 most frequently used wavelengths. From the images, as the number of wavelengths increases, more wavelengths are coming from the two dominate wavelength regions. This is the general 11
ACCEPTED MANUSCRIPT
trend for most data sets with intermediate wavelengths sometimes being selected. Using PLS with final selected wavelengths, the mean RMSEP plots across the 55 calibration/prediction splits show that as a greater percentage of the calibration rank is used for w, the more an RMSEP plot resembles the full
RI PT
spectrum PLS plot. The RMSEP values are plotted against the L2 norm to demonstrate that while a wavelength subset may predict better than the full spectrum model, the L2 norm is larger.
After exploring different w values for all the respective data situations, 80% is used in
SC
this study. An approach not evaluated for determining w is to use some form of cross validation on the calibration set with different values of w. A value for w that “behaves” close to ideal
M AN U
would be chosen to form the final model and predict the prediction set.
4.6. Corn
Shown in Figure 4 are few a more details from Figure 3 of the overall trends and
TE D
performance of the MLR wavelength selected models for corn moisture on instrument m5. In Figure 4, the mean RMSEP values are also plotted against PLS latent variable (LV) to better see the improvement in prediction error over the full wavelength models. From Figure 4, it is also
EP
observed that when the RMSEP values are plotted against the L2, the tradeoff between bias and variance is more obvious. Over the 55 splits, the wavelengths selected are consistently the same.
AC C
These wavelengths are similar to wavelengths that other wavelength selection algorithms have chosen [12,16,19,33,37-39].
The corn situation predicting oil content on instrument m5 is shown in Figure 5. Similar
to moisture, the selected wavelengths predict better than full wavelengths and the L2 norm is much larger. From Figure 5e, it is observed that more variations in the selected wavelengths occurs compared to moisture. Regardless, there are still dominant bands of wavelengths and the small deviations do not seem to affect the prediction error variations across the 55 splits any 12
ACCEPTED MANUSCRIPT
more than the full spectra PLS models. The wavelength bands are similar to those selected by other algorithms [16,37].
RI PT
4.7. Sugar
The results and trends for predicting sucrose in sugar are shown in Figure 6. The MLR chosen wavelengths for each calibration set is consistent with some of the same wavelengths
SC
being present or absent depending on the calibration split. In each situation, 80 wavelengths are selected (80% of the calibration rank for the full spectrum calibration set size 100 x 700 on each
M AN U
split for rank 100). The RMSEP and R2 values for selected wavelengths show improvement when compared to the full spectrum PLS models, especially in terms of consistency across the data splits. As expected, the L2 norms are larger for the selected wavelengths.
TE D
4.8. Gasoline
Presented in Figure 7 are the results for predicting gasoline octane number. In this case, calibrations sets are 44 x 401 and hence, at 80% of the calibration rank, there are 35 wavelengths
EP
selected. From the image in Figure 7e, there is one primary band, a few dominant wavelengths, and other key wavelengths that are present or absent depending on the split. With the larger
AC C
number of individual wavelengths being selected over mostly bands, the variation in the RMSEP and R2 values across the 55 splits is greater than with the other data sets where wavelength bands dominate. The behavior of the wavelength selected PLS models are similar to the full spectra PLS models. It may be that different wavelength selection parameters would provide better performing models. However, this data set has been studied by others and the bands and other general wavelengths selected are similar [10,11,16].
13
ACCEPTED MANUSCRIPT
5. Conclusions The goal of this study was to develop a simple wavelength selection method that could be standardized with respect to algorithm operational parameters across data sets thereby
RI PT
eliminating the need to tune parameters to each particular data set. From the work presented, it is clear that total elimination was not possible. For the method to work appropriately, it was found that of the five algorithm parameters (r (number of random wavelengths to select), m (the
SC
number of times r random wavelengths are selected), h (percentage of the m models to retain), t (number of times to run the r, m, h combination), and w (final number of wavelengths to retain
M AN U
from the combined t sets)), two required adjustment per data situation (r and w) and three could be set to the same values for the 19 data situations studied (four of which were presented). Specifically, for the three fixed value parameters, m can be set to 10,000 to ensure enough models are formed with useful wavelengths; h should be set to 30% for m = 10,000 to provide a
TE D
good selection of balanced models in the intersection with generally about 1,000 models ultimately retained for wavelength evaluation; and a value of t set to 50 was found to provide adequate histogram convergence in terms of wavelength selection trends. For the two adjustable
EP
parameters, a value for r just under the midrange of the calibration set rank (or by empirical observation, the value of r can be adjusted until the cone shape plot shown in Figure 2 is
AC C
obtained) and setting w to 80% provided acceptable prediction errors from PLS modeling with this final selected wavelength set. The process presented is basically based on using random wavelengths to create an
ensemble of MLR models; the wavelengths of the best bias/variance balanced MLR models are analyzed to identify common wavelengths. In actuality, probably the easiest wavelength selection algorithms to operate are the penalty methods. Two of the simplest require one tuning
14
ACCEPTED MANUSCRIPT
parameter each. These two are TR in an L1 norm format (also known as LASSO) and an iterative process based on a full wavelength regression vector [11-15]. In both methods, wavelengths are selected simultaneously with forming the multivariate calibration model.
RI PT
In wavelength selection studies, selected wavelengths can sometimes be correlated to vibrations related to analyte properties. Useful tables are available [40] but such assignments were not attempted in this study. Some caution though, just because selected wavelengths are
SC
useful for prediction, the wavelengths are not always interpretable relative to the final regression
M AN U
vector [41].
Acknowledgments
This material is based upon work supported by the National Science Foundation under
References
TE D
Grant No. CHE-1506417 (co-funded by CDS&E) and is gratefully acknowledged by the authors.
1. T. Næs, T. Isaksson, T. Fearn, T. Davies, A user friendly guide to multivariate calibration
EP
and classification, NIR Publications: Chichester UK, 2002. 2. J. H. Kalivas, Calibration methodologies, in: S. D. Brown, R. Tauler, B. Walczak (Eds.-
AC C
in-Chief), and J. H. Kalivas (Second ed.), Comprehensive Chemometrics: Chemical and Biochemical Data Analysis, Elsevier, Amsterdam, 2009, Vol. 3. Chap. 1, pp. 1-32.
3. I. Guuon, A. Elisseef. An introduction to variable and feature selection. J. Mach. Learn. Res. 3 (2003) 1157-1182.
4. L. Nørgaard, A. Saudland, J. Wagner, J.P. Nielsen, L.Munck, S. B. Engelsen, Interval partial least-squares regression (iPLS): A comparative chemometric study with and example from near-infrared spectroscopy, Appl. Spectrosc. 54 (2000) 413-419. 15
ACCEPTED MANUSCRIPT
5. D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord. Genetic algorithms as a tool for wavelength selection in multivariate calibration, Anal. Chem. 67 (1995) 4295-4304. 6. M. Arakawa, Y. Yamashita, K. Funatsu, Genetic algorithm-based wavelength selection
RI PT
method for spectral calibration, J. Chemom. 25 (2011) 10-19.
7. C.B. Lucasius, M.LM. Beckers, G. Kateman, Genetic algorithms in wavelength selection: a comparative study, Anal. Chim. Acta 286 (1994) 135-153.
SC
8. T. Rajalihit, R. Arneberg, A.C. Kroksveen, M. Berle, K.M. Myher, O.M. Kvalheim., Discriminating variable test and selectivity ratio plot: Quantitative tools for interpretation
Chem. 81 (2009) 2581-2590.
M AN U
and variable (biomarker) selection in complex spectral or chromatographic profiles, Anal.
9. J.H. Kalivas, N. Roberts, J. Sutter, Global optimization by simulated annealing with wavelength selection for ultraviolet-visible spectrophotometry, Anal. Chem. 61 (1989)
TE D
2024-2030.
10. J.M. Brenchley, U Hörchner, J.H. Kalivas, Wavelength selection characterization for NIR spectra, Appl. Spectrosc. 51 (1997) 689-699.
EP
11. F. Stout, J.H. Kalivas, K. Héberger, Wavelength selection for multivariate calibration using Tikhonov regularization, Appl. Spectrosc. 61 (2007) 85-95.
AC C
12. J. Ottaway, J.H. Kalivas, E. Andries, Spectral multivariate calibration with wavelength selection using variants of Tikhonov regularization, Appl. Spectrosc. 64 (2010) 1388-
1395.
13. J.H. Kalivas, Overview of two-norm (L2) and One-norm (L1) regularization variants for full wavelength or sparse spectral multivariate calibration models or maintenance, J. Chemom. 26 (2012) 218-230.
16
ACCEPTED MANUSCRIPT
14. E. Andries, S. Martin, Sparse methods in spectroscopy: an introduction, overview, and perspective, Appl. Spec. 67 (2013) 579593. 15. E. Andries, Sparse models by iteratively reweighted feature scaling: a framework for
RI PT
wavelength and sample selection, J. Chemom. 27 (2013) 50-62.
16. G.-H. Fu, Q.-S. Xu, H.-D. Li, D.-S. Cao, Y.-Z. Liang, Elastic net grouping variable
selection combined with partial least squares regression (EN-PLSR) for the analysis of
SC
strongly multi-collinear Spectroscopic data, Appl. Spectrosc. 65 (2011) 402-408.
Soc. Ser. B. 58(1996) 267-288.
M AN U
17. R.J. Tibshirani, Regression and shrinkage and selection via the LASSO, J. Royal Statist.
18. W.S. Cai, Y.K. Li, X.G. Shao, A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra, Chemom. Intell. Lab. Syst. 90 (2008) 188-194.
TE D
19. C. Esquerre, A.A. Gowen, G. Downey, C.P. O’Donnell, Selection of variables based on most stable normalized partial least squares regression coefficients in an ensemble Monte Carlo procedure, J. Near. Infra. Spectrosc. 19 (2011) 343-350.
EP
20. C.A. Esquerre, A.A. Gowen, A. O’Gorman, G. Downey, C.P. O’Donnell, Evaluation of ensemble Monte Carlo variable selection for identification of metabolite markers in NMR
AC C
data, Anal. Chim. Acta 964 (2017) 45-54.
21. H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Model-population analysis and its applications in chemical and biological modeling, Trends Anal. Chem. 38 (2012) 154162.
22. H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Model population analysis for variable selection, J. Chemom. 24 (2010) 418-423.
17
ACCEPTED MANUSCRIPT
23. Y.-H. Yun, W.-T. Wang, B.-C. Deng, G.-B. Lai, X.-B. Liu, D.-B. Ren, Y.-Z. Liang, W. Fan, Q.-S. Xu, Using variable combination population analysis for variable selection in multivariate calibration, Anal. Chim. Acta 862 (2015) 14-23.
RI PT
24. J. Bin, F. Ali, W. Fan, J. Zhou, X. Li, W. Tang, Y. Liang, An efficient variable selection method based on variable permutation and model population analysis for multivariate calibration of NIR spectra, Chemom. Intell. Lab. Syst. 158 (2016) 1-13.
SC
25. J. Ŝustek, Method for the choice of optimal analytical positions in spectrophotometric analysis of multicomponent systems, Anal. Chem. 46 (1974) 1676-1679.
M AN U
26. B.C. Deng, Y.H. Yun, Y.Z. Liang, L.Z. Yi, A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling, Analyst 139 (2014) 4836–4845.
27. Y.-H. Yun, Y.-Z. Liang, G.-X. Xie, H.-D. Li, D.-S. Cao, Q.-S. Xu, A perspective
TE D
demonstration on the importance of variable selection in inverse calibration for complex analytical systems, Analyst 138 (2013) 6412-6421. 28. Y.-H. Yun, D.-S. Cao, M.-L. Tan, J. Yan, D.-B. Ren, Q.-S. Xu, L. Yu, Y.-Z. Liang, A
EP
simple idea on applying large regression coefficient to improve the genetic algorithm– PLS for variable selection in multivariate calibration, Chemom. Intell. Lab. Syst. 130
AC C
(2014) 76-83.
29. H.-D. Li, Q.-S. Xu, Y.-Z. Liang, Random frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification, Anal. Chim. Acta 740 (2012) 20-26.
18
ACCEPTED MANUSCRIPT
30. Y.-H. Yun, H.-D. Li, L.R.E. Wood, W. Fan, J.-J. Wang, D.-S. Cao, Q.-S. Xu, Y.-Z. Liang, An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration, Spectrochim. Acta A 111 (2013) 31-36.
RI PT
31. Y.-H. Yun, W.-T. Wang, M.-L. Tan, Y.-Z. Liang, H.-D. Li, D.-S. Cao, H.-M. Lu, Q.-S. Xu, A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration, Anal. Chim. Acta 807 (2014) 36-43.
SC
32. B.-C. Deng, Y.-H. Yun, Y.-Z. Liang, L.-Z. Yi, A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling, Analyst 139
M AN U
(2014) 4836-4845.
33. H. Li, Y. Liang, Q. Xu, D. Cao, Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration, Anal. Chim. Acta 648 (2009) 77-84.
TE D
34. http://www.eigenvector.com/data/Corn/index.html, last accessed May 25, 2017. 35. P.J. Brown, M. Vannucci, T. Fearn, Bayes model averaging with selection of regressors, J. Roy. Stat. Soc. Ser. B, 27 (2002) 519-536.
255-259.
EP
36. J.H. Kalivas, Two data sets of near infrared spectra, Chemom. Intell. Lab. Syst. 37 (1997)
AC C
37. M. Shariati-Rad, M. Hasani, Selection of individual variables versus intervals of variables in PLSR, J. Chemom. 24 (2010) 45-56.
38. J. Gerretzen, E. Szymańska, J. Bart, AN. Davies, H.-J. van Manen, E.R. van den Heuvel, J.J. Jansen, L.M.C. Buydens, Boosting model performance and interpretation by entangling preprocessing selection and variable selection, Anal. Chim. Acta 938 (2016) 44-52.
19
ACCEPTED MANUSCRIPT
39. J.P.M. Andries, Y.V. Heyden, L.M.C. Buydens, Improved variable selection reduction in partial least squares modeling based on Predictive-Prosperity-Ranked Variables and adaption of partial least squares complexity, Anal. Chim. Acta 705 (2011) 292-305.
RI PT
40. B.G. Osborne T. Fearn, P.H. Hindle, Practical NIR spectrsocpy with applications in food and beverage analysis, 2nd ed. Longman Scientific & Technical: Essex: UK, 1993, pp. 20-33.
SC
41. C.D. Brown, R.L. Green, Critical factors limiting the interpretation of regression vectors
AC C
EP
TE D
M AN U
in multivariate calibration, Trends Anal. Chem. 28 (2009) 506-514.
20
ACCEPTED MANUSCRIPT
Figure Captions Fig. 1. Flow chart illustrating the procedure for MLR wavelength selection.
RI PT
Fig. 2. Effects of changing r, the number of random wavelengths in the first step of the MLR wavelength selection process. Shown is the corn data predicting moisture with m = 10,000 MLR models. Numbers in key refer to number of random wavelengths (r). The rank of the calibration set is 64 from the 80/20%
SC
calibration/prediction spilt giving 64 samples in the calibration set.
Fig. 3. Effects on wavelengths selected and the RMSEP from varying values for w for corn moisture on
M AN U
instrument m5. Images show wavelengths selected (yellow) on each of the 55 calibration splits for w at (a) 20%, (b) 40%, (c) 80%, and (d) 120% of the calibration rank (13, 26, 51, and 77 wavelengths, respectively). Plots in (e) are the corresponding mean PLS RMSEP values at each w and the full spectrum PLS models. The rank of the calibration set is 64 from the 80/20% calibration/prediction spilt giving 64
TE D
samples in the calibration set.
Fig. 4. Mean PLS results for corn moisture on instrument m5 (a) RMSEP (blue solid) and R2 (orange
EP
dash) for full wavelengths, (b) RMSEP (blue solid) and R2 (orange dash) for selected wavelengths at 80% of calibration rank (51 wavelengths, see Figure 3c), (c) RMSEP with error bars across the 55 splits at full
AC C
wavelengths (black solid) and selected wavelengths (blue dash), and (d) R2 with error bars at full wavelengths (black solid) and selected wavelengths (blue dash).
Fig. 5. Results for predicting oil content of corn using instrument m5 (a) mean RMSEP (blue solid) and R2 (orange dash) for full wavelengths, (b) RMSEP (blue solid) and R2 (orange dash) for selected wavelengths, (c) RMSEP with error bars across the 55 splits at full wavelengths (black solid) and selected
21
ACCEPTED MANUSCRIPT
wavelengths (blue dash), (d) R2 with error bars at full wavelengths (black solid) and selected wavelengths (blue dash), and (e) selected wavelengths on each split (yellow).
RI PT
Fig. 6. Results for the sugar data predicting sucrose content (a) mean RMSEP (blue solid) and R2 (orange dash) for full wavelengths, (b) RMSEP (blue solid) and R2 (orange dash) for selected wavelengths, (c) RMSEP with error bars across the 55 splits at full wavelengths (black solid) and selected wavelengths
SC
(blue dash), (d) R2 with error bars at full wavelengths (black solid) and selected wavelengths (blue dash),
M AN U
and (e) selected wavelengths on each split (yellow).
Fig. 7. Results for the gasoline predicting octane number (a) mean RMSEP (blue solid) and R2 (orange dash) for full wavelengths, (b) RMSEP (blue solid) and R2 (orange dash) for selected wavelengths, (c) RMSEP with error bars across the 55 splits at full wavelengths (black solid) and selected wavelengths (blue dash), (d) R2 with error bars at full wavelengths (black solid) and selected wavelengths (blue dash),
AC C
EP
TE D
and (e) selected wavelengths on each split (yellow).
22
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
Fig. 1.
23
ACCEPTED MANUSCRIPT
RI PT
Fig. 2.
AC C
EP
TE D
M AN U
SC
bˆ
24
ACCEPTED MANUSCRIPT
Fig. 3
a)
e)
b)
c)
RI PT
e)
d)
AC C
EP
TE D
M AN U
SC
bˆ
25
ACCEPTED MANUSCRIPT
Fig. 4
a)
RI PT
bˆ
EP
TE D
M AN U
d)
SC
bˆ
AC C
c)
b)
26
ACCEPTED MANUSCRIPT
Fig. 5.
a)
e)
bˆ
bˆ
EP
TE D
M AN U
d)
AC C
c)
SC
RI PT
b)
27
ACCEPTED MANUSCRIPT
Fig. 6.
a)
e)
d)
bˆ
EP
TE D
M AN U
bˆ
AC C
c)
SC
RI PT
b)
28
ACCEPTED MANUSCRIPT
Fig. 7.
a)
e)
d)
bˆ
EP
TE D
M AN U
bˆ
AC C
c)
SC
RI PT
b)
29
ACCEPTED MANUSCRIPT
Highlights
AC C
EP
TE D
M AN U
SC
RI PT
Simple wavelength selection algorithm Wavelengths selected are bias/variance balanced Ensemble of MLR models are used to characterize wavelengths