Leveraging multiple linear regression for wavelength selection

Leveraging multiple linear regression for wavelength selection

Accepted Manuscript Leveraging multiple linear regression for wavelength selection Tony Lemos, John H. Kalivas PII: S0169-7439(17)30373-8 DOI: 10.1...

2MB Sizes 0 Downloads 128 Views

Accepted Manuscript Leveraging multiple linear regression for wavelength selection Tony Lemos, John H. Kalivas PII:

S0169-7439(17)30373-8

DOI:

10.1016/j.chemolab.2017.07.011

Reference:

CHEMOM 3474

To appear in:

Chemometrics and Intelligent Laboratory Systems

Received Date: 31 May 2017 Revised Date:

18 July 2017

Accepted Date: 24 July 2017

Please cite this article as: T. Lemos, J.H. Kalivas, Leveraging multiple linear regression for wavelength selection, Chemometrics and Intelligent Laboratory Systems (2017), doi: 10.1016/ j.chemolab.2017.07.011. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Leveraging Multiple Linear Regression for Wavelength Selection

a

Department of Chemistry, Idaho State University, Pocatello, ID 83209, USA

SC

*Corresponding Author. Email Address: [email protected]

RI PT

Tony Lemosa, John H. Kalivasa*

ABSTRACT

M AN U

Wavelength selection is often used for multivariate calibration methods to lower prediction error for the calibrated sample properties. As a result, there are a plethora of wavelength selection methods to select from; all with unique advantages and disadvantages. All wavelength selection methods involve a range of tuning parameters making the methods cumbersome or complex and hence, difficult to work with. The goal of this study is to provide a simple process to select

TE D

wavelengths for multivariate calibration methods while trying to standardize values of the five adjustable algorithm tuning parameters across data sets. The proposed method uses multiple

EP

linear regression (MLR) as an indicator to which wavelengths should be used further to form a multivariate calibration model by some processes such as partial least squares (PLS). From a

AC C

collection of MLR models formed from randomly selected wavelengths, those models within a thresholds of the bias indicator root mean square error of calibration (RMSEC) and variance indicator model vector L2 norm are evaluated to ascertain the most frequently selected wavelengths. Portions of the most frequent wavelengths are retained and used to produce a calibration model by PLS. This proposed wavelength selection method is compared to PLS models based on full spectra. Several near infrared data sets are evaluated showing that PLS models based on MLR selected wavelengths provide improved prediction errors. Of the five 1

ACCEPTED MANUSCRIPT

adjustable parameters for the wavelength selection method, three could be standardized across the data sets while the other two required minor tuning. Recommendations are provided as to

RI PT

alternate wavelength selection algorithms.

SC

Keywords: Multivariate calibration Wavelength selection Multiple linear regression

1. Introduction

M AN U

In analytical chemistry, there are different multivariate regression techniques to characterize chemical and physical properties as analytes. Some of these techniques include multiple linear regression (MLR) and partial least square (PLS) [1,2]. The method of MLR requires selection of r wavelengths from p full wavelength spectra such that the X matrix of row-

TE D

wise spectra for n samples and r columns is full rank for the inverse operation. Full rank requires the rank of X to be r with n ≥ r. Partial least squares and other methods can be used in full spectral mode at p wavelengths or with a set of r selected wavelengths. With wavelength

EP

selection there is generally a tradeoff between bias and variance. Specifically, the potential for prediction variance inflation increases with selected wavelengths as characterized by the size of

AC C

the Euclidean vector L2 norm (2-norm) of the regression vector. Compared to a full wavelength PLS regression vector L2 norm, a wavelength selected L2 norm is larger indicating greater variance. Nevertheless, wavelength selection is popular because of documented improvement in prediction errors compared to a full wavelength method. Similar results are noted in other fields [3].

2

ACCEPTED MANUSCRIPT

Over the past several decades, many wavelength selection methods have been created to obtain optimal regression vectors for MLR as well as full spectral methods. Some of these methods include interval PLS [4], genetic algorithms [5-7], selectivity ratios [8], simulated

RI PT

annealing [9,10], methods based on Tikhonov regularization (TR) variants that simultaneously select wavelength and form the model [11-16] (penalty regression methods, one of which is based on the L1 norm commonly referred to as least absolute shrinkage and selection operator

SC

(LASSO) [17]), and methods based on Monte Carlo approaches [18-24]. This list is a small fraction of the algorithms that can be used. It is interesting to note that many wavelength

M AN U

selection algorithms base selection on merits assessing the signal to noise ratio of regression vector coefficients as proposed in a wavelength selection algorithm from 1974 [25]. Professor Yi-Zeng Liang spent a considerable amount of his career working with wavelength selection [16,21-24,26-33] to mention a few. Most notable are his competitive

variations [21-24].

TE D

adaptive reweighted sampling (CARS) [33] and his recent work involving Monte Carlo

Regardless of the method used to select wavelengths, each method has respective tuning

EP

parameters that must be adjusted to obtain effective models. These tuning parameters can be as few as one, such as some TR variants, to several making these methods cumbersome to work

AC C

with. The goal of this study is to provide a simple process to select wavelengths using standardized algorithm tuning parameter values across multiple near infrared (NIR) data sets and hopefully, extendable to other spectral data sets. A Monte Carlo approach is proposed that leverages MLR to select appropriate

wavelengths. Once a set of wavelengths is selected, PLS is used to form the final calibration model. Results from wavelength selected PLS models are compared to results obtained from full

3

ACCEPTED MANUSCRIPT

wavelength PLS models to determine if the MLR selected wavelengths improve model performance. The effectiveness of this proposed method is tested with several NIR data sets.

RI PT

2. Methods

The proposed MLR wavelength selection procedure is described using general variables to represent the tuning parameters. Exact values are stated in section 4. Results and discussion.

SC

Before describing the new approach, measures of model quality used in the selection process are

M AN U

first defined.

2.1. MLR and PLS Model quality measures

Measures of MLR model quality are needed to ascertain which wavelengths should ultimately be chosen from the multitude of models formed by the Monte Carlo process of

TE D

random wavelength selection. Bias and variance measures are both used to prevent over or under fitting of a calibration model. This allows one to determine which MLR models are the better models and hence, which wavelengths. In this study, only one bias and one variance measures

EP

are used to keep the method as simple as possible. The L2 norm of an estimated MLR model

AC C

vector bˆ , symbolized by bˆ , is used as the variance measure. The bias measure is the root mean square error of calibration (RMSEC) for a calibration set with n samples calculated by

RMSE =

n

∑ (y i =1

i

- yˆ i ) 2 n

(1)

where yi represents the reference value for the ith sample and yˆi denotes the corresponding estimated value from the MLR model.

4

ACCEPTED MANUSCRIPT

The performance of the final selected wavelengths used with PLS as well as the full spectrum PLS models are assessed using the same bias and variance measures of model quality used for MLR. Specifically, equation (1) is used as a bias measure formatted to the root mean

RI PT

square error of prediction (RMSEP) for the predication sample set not used in calibration and the PLS model L2 norms are used to characterize variance. Respective R2 values from plotting

2.2. MLR wavelength selection and PLS prediction

SC

predicted values and against reference values are also evaluated.

M AN U

A given data set is split in to calibration and prediction sets. In this study, 80% are used for the calibration samples and 20% for the prediction set, but other proportions can be used. The percentage split is not a tuning parameter for the MLR wavelength selection algorithm. Using the calibration samples, an MLR model is formed with r randomly selected

TE D

wavelengths from the full set of p wavelengths. This process is repeated m times for m MLR models. The resulting RMSEC values are plotted against the corresponding L2 norms. A percentage, h, of the total number of MLR models with the lowest RMSEC values are retained

EP

and the same percentage of MLR models with the lowest L2 norm values are also retained. The MLR models common to the separate lowest valued RMSEC and L2 norm collections (the

AC C

intersection) are then isolated. It is important that the MLR models satisfy both criteria at the intersection in order to remove over and under fitted models. The MLR wavelength selection process is repeated t times for the same calibration

samples producing t sets of models from respective intersections. The t intersected sets are combined and a histogram of the selected wavelengths is formed. A percentage, w, of the calibration sample set rank (rank ≤ min(n,p)) is used to determine how many wavelengths are

5

ACCEPTED MANUSCRIPT

selected from the histogram. This number of wavelengths are then selected from the respective most frequently used wavelengths in the histogram. The final selected wavelength set is assessed by predicting the prediction sample set using PLS to form the RMSEP and corresponding R2

RI PT

values. These values are compared to the respective RMSEP and R2 values using full spectrum PLS models. Figure 1 provides a representation of the MLR wavelength selection steps for a particular calibration set.

SC

From the previous description of the method and Figure 1, it is apparent there are five tuning parameters: r (number of random wavelengths to select), m (the number of times r

M AN U

random wavelengths are selected to form the corresponding r MLR models), h (percentage of the m models to retain), t (number of times to run an r, m, h combination), and w (final number of wavelengths to retain from the combined t sets). Values for each tuning parameter are discussed in respective sections of section 4. Results and discussion. From the presented discussions, it is

TE D

shown that of the five tuning parameters, only two need to actually be tuned (r and w), but suggested values do exhibit similar trends across data sets. Processes are also suggested for determining optimal values.

EP

Each data set is split into calibration/prediction sets 55 times in this study to evaluate the consistency of selected wavelengths. Greater and less splits were evaluated, but 55 was found to

AC C

adequately show the wavelength variation between splits.

3. Experimental

As noted in section 2.2. MLR wavelength selection and PLS prediction, each data set is

split such that 80 percent of the samples become the calibration set and the remainder are placed in the prediction set. The calibration set is used for MLR wavelength selection and the prediction

6

ACCEPTED MANUSCRIPT

set is used to assess the performance of PLS models formed from the selected wavelengths and the full spectrum PLS models. The calibration/prediction split is performed 55 times to characterize the selected wavelength variations. Mean centering is used where the calibration set

RI PT

is column-wise centered and the prediction set is centered to the calibration set.

3.1. Algorithms

SC

All algorithms for MLR, wavelength selection by MLR, and PLS were written by the

M AN U

authors using MATLAB (version 9.0.0 (R2016a), The MathWorks Inc., Natick, MA, USA).

3.2 Data sets

A total of five data sets were assessed with up to 19 data situations (multiple instruments and multiple prediction properties). Only three data sets with four of the possible data situations

3.2.1 Corn

TE D

are shown. These results are representative of the general results obtained.

EP

This data set consists of 80 samples measured on three instruments designated m5, mp5 and mp6. Samples are measured in the wavelength range of 1100-2498 nm at 2 nm intervals,

AC C

resulting in 700 wavelengths. Each sample has four prediction properties: moisture, oil, protein, and starch. Presented are m5 and prediction properties oil and moisture. The data set is freely available from the Eigenvector Research website [34].

3.2.2 Sugar

7

ACCEPTED MANUSCRIPT

The data set consists of three sugars; sucrose, glucose and fructose in aqueous solutions for 125 samples [35]. Spectra are in the wavelength range 1100-2498 nm at 2 nm intervals for

3.2.3 Gasoline

RI PT

700 wavelengths. The prediction property shown is sucrose.

The data set consists of 55 samples measured over 401 wavelengths from 900-1700 nm in

SC

2 nm increments [36]. The prediction property is octane number.

M AN U

4. Results and discussion

There are five adjustable parameters (r, m, h, t, and w) to obtain an effective set of wavelengths. See the discussion in section 2.2. MLR wavelength selection and PLS prediction and Figure 1 for respective definitions. The following discussion describes the effect each

TE D

parameter has and recommends standard values for three of the parameters (m, h, and t) that should be useful for other data sets than those presented. Recommendations are made for the other two parameters (r and w). Following the parameter discussion, the selected wavelengths

AC C

three data sets.

EP

based on the calibration data are used to form PLS models to predict the prediction sets for the

4.1. Number of wavelengths for each MLR model (r) The first step in the MLR wavelength selection process is to choose a value for r, the

number of wavelengths to randomly select from the full p wavelength set available. To determine the effect this parameter has on the RMSEC and L2 norms, various values were used for the data sets. Shown in Figure 2 are typical results using the corn data predicting moisture

8

ACCEPTED MANUSCRIPT

content with m set to 10,000 for three values of r; an r value close to the rank of the calibration spectra (60 in this case with a rank 64 calibration set from the 80/20% calibration/prediction split), a midrange value (20), and a small value (10). From Figure 2 it is observed that the

RI PT

number of wavelengths alter how the MLR models perform. The more wavelengths used, the lower the RMSEC values tend to be.

Because a future step in the MLR wavelength selection process is to retain an equal

SC

number of models with the lowest RMSEC and L2 values and at the same time, guard against over and under fitted models, the value of r is critical. For example, from Figure 2 it is observed

M AN U

that at too large of an r value, many models are generally over fitted to the calibration set and these models will probably not predict the prediction set well. The better outcome is shown with r at 20 wavelengths in Figure 2. In this case, a cone shape is obtained such that the set of models from the intersection of the models with lowest RMSEC and L2 norm values will be fairly well

TE D

represented. In analyzing several data sets, a value of r just under the midrange of the calibration sample rank worked well. Alternatively, by empirical observation, the value of r can be adjusted until the cone shape plot shown in Figure 2 is obtained. For all the data sets in this study, r was

EP

set to 20.

AC C

4.2. Number of MLR models (m)

Deciding a value for m, is not as critical as for r. The number of MLR models should be

sufficient to obtain a representative range of models. As shown in Figure 2 with m set to 10,000 models, this value clearly spans the space of possible models and a greater number is not needed. With a smaller value, it was found that the models tended to be more sparse (spread out) in Figure 2 and hence, when a subset of the model are retained (set by h, described next), there are

9

ACCEPTED MANUSCRIPT

too many poor models included. In general, the more models formed, the more likely the wavelengths useful for prediction will be retained. For all the data sets, it was decided to fix m to

RI PT

10,000.

4.3. Percentage of MLR models with low RMSEC and L2 norm values (h)

A subset of the m MLR models are needed since only a portion of all the models formed

SC

are of any use. Thus, only a percentage of the total MLR models, h, are retained. As noted

previously, two subsets of models are formed each with the same value of h. One subset has the

M AN U

h percentage MLR models with the lowest RMSEC values and the other subset has the h percentage MLR models with the lowest L2 norm values. If just the models with the lowest RMSEC values are inspected for wavelengths selected, then many of these wavelengths are part of over fitted models and would not be useful for predicting new samples. Similarly, if just the

TE D

models with the lowest L2 norms are inspected, then many of these wavelengths are part of under fitted models and would also not be useful for predicting new samples. Thus, models in both subsets (the intersection) are retained and analyzed for common wavelengths. The value for h

EP

cannot be too large or too many poor models are included and the same is true for too small of a value. With m set to 10,000 models for all data sets, it was found that a value of h set to 30%

AC C

provided a good selection of balanced models in the intersection with generally about 1,000 models retained for wavelength evaluation.

4.4. Number of intersection sets (t) It was found that using only one set of intersected models did not provide a useful histogram. Specifically, dominant wavelengths that should ultimately be selected for use by a

10

ACCEPTED MANUSCRIPT

multivariate calibration method were not obvious. In essence, too many spurious wavelengths would be selected. Thus, a histogram is formed from the t intersection sets and now dominant wavelengths are much more obvious allowing a straightforward selection. The value for t needs

RI PT

to be large enough to obtain a discernable histogram. From trial and error, it was found that after t =50, the histogram typically converges in terms of wavelength selection trends. Thus, t was set

SC

to 50 for all the data sets.

4.5. Percentage of selected wavelengths (w)

M AN U

The final tuning parameter is how many wavelengths to select from the histogram of the wavelengths used in the MLR models present in each of the t intersected low RMSEC and L2 regions. As noted previously, the number of wavelengths is a percentage, w, of the original n x p calibration set rank (rank ≤ min (n,p)). The corresponding most frequently used wavelengths

TE D

from the histogram form the final wavelength set. The number of wavelengths is ultimately user defined because different values for w produce multivariate calibration models (PLS in this study) with varying degrees of performance per data set. However, the general trend for the

EP

range of w is the smaller the value, the worse the model performance. Too large of a value for w and the more similar the performance is to full wavelength models.

AC C

Some of these trends are demonstrated in Figure 3 for corn moisture on instrument m5. In

this example, with the 80/20% calibration/prediction split, there are 64 samples in the calibration set leaving 16 samples in the prediction set. With 64 samples in the calibration set, the rank is 64 and shown are w values of 20%, 40%, 80%, 120% of the calibration rank giving 13, 26, 51, and 77 wavelengths, respectively. These wavelengths would be selected from the histogram as the respective 13, 26, 51, and 77 most frequently used wavelengths. From the images, as the number of wavelengths increases, more wavelengths are coming from the two dominate wavelength regions. This is the general 11

ACCEPTED MANUSCRIPT

trend for most data sets with intermediate wavelengths sometimes being selected. Using PLS with final selected wavelengths, the mean RMSEP plots across the 55 calibration/prediction splits show that as a greater percentage of the calibration rank is used for w, the more an RMSEP plot resembles the full

RI PT

spectrum PLS plot. The RMSEP values are plotted against the L2 norm to demonstrate that while a wavelength subset may predict better than the full spectrum model, the L2 norm is larger.

After exploring different w values for all the respective data situations, 80% is used in

SC

this study. An approach not evaluated for determining w is to use some form of cross validation on the calibration set with different values of w. A value for w that “behaves” close to ideal

M AN U

would be chosen to form the final model and predict the prediction set.

4.6. Corn

Shown in Figure 4 are few a more details from Figure 3 of the overall trends and

TE D

performance of the MLR wavelength selected models for corn moisture on instrument m5. In Figure 4, the mean RMSEP values are also plotted against PLS latent variable (LV) to better see the improvement in prediction error over the full wavelength models. From Figure 4, it is also

EP

observed that when the RMSEP values are plotted against the L2, the tradeoff between bias and variance is more obvious. Over the 55 splits, the wavelengths selected are consistently the same.

AC C

These wavelengths are similar to wavelengths that other wavelength selection algorithms have chosen [12,16,19,33,37-39].

The corn situation predicting oil content on instrument m5 is shown in Figure 5. Similar

to moisture, the selected wavelengths predict better than full wavelengths and the L2 norm is much larger. From Figure 5e, it is observed that more variations in the selected wavelengths occurs compared to moisture. Regardless, there are still dominant bands of wavelengths and the small deviations do not seem to affect the prediction error variations across the 55 splits any 12

ACCEPTED MANUSCRIPT

more than the full spectra PLS models. The wavelength bands are similar to those selected by other algorithms [16,37].

RI PT

4.7. Sugar

The results and trends for predicting sucrose in sugar are shown in Figure 6. The MLR chosen wavelengths for each calibration set is consistent with some of the same wavelengths

SC

being present or absent depending on the calibration split. In each situation, 80 wavelengths are selected (80% of the calibration rank for the full spectrum calibration set size 100 x 700 on each

M AN U

split for rank 100). The RMSEP and R2 values for selected wavelengths show improvement when compared to the full spectrum PLS models, especially in terms of consistency across the data splits. As expected, the L2 norms are larger for the selected wavelengths.

TE D

4.8. Gasoline

Presented in Figure 7 are the results for predicting gasoline octane number. In this case, calibrations sets are 44 x 401 and hence, at 80% of the calibration rank, there are 35 wavelengths

EP

selected. From the image in Figure 7e, there is one primary band, a few dominant wavelengths, and other key wavelengths that are present or absent depending on the split. With the larger

AC C

number of individual wavelengths being selected over mostly bands, the variation in the RMSEP and R2 values across the 55 splits is greater than with the other data sets where wavelength bands dominate. The behavior of the wavelength selected PLS models are similar to the full spectra PLS models. It may be that different wavelength selection parameters would provide better performing models. However, this data set has been studied by others and the bands and other general wavelengths selected are similar [10,11,16].

13

ACCEPTED MANUSCRIPT

5. Conclusions The goal of this study was to develop a simple wavelength selection method that could be standardized with respect to algorithm operational parameters across data sets thereby

RI PT

eliminating the need to tune parameters to each particular data set. From the work presented, it is clear that total elimination was not possible. For the method to work appropriately, it was found that of the five algorithm parameters (r (number of random wavelengths to select), m (the

SC

number of times r random wavelengths are selected), h (percentage of the m models to retain), t (number of times to run the r, m, h combination), and w (final number of wavelengths to retain

M AN U

from the combined t sets)), two required adjustment per data situation (r and w) and three could be set to the same values for the 19 data situations studied (four of which were presented). Specifically, for the three fixed value parameters, m can be set to 10,000 to ensure enough models are formed with useful wavelengths; h should be set to 30% for m = 10,000 to provide a

TE D

good selection of balanced models in the intersection with generally about 1,000 models ultimately retained for wavelength evaluation; and a value of t set to 50 was found to provide adequate histogram convergence in terms of wavelength selection trends. For the two adjustable

EP

parameters, a value for r just under the midrange of the calibration set rank (or by empirical observation, the value of r can be adjusted until the cone shape plot shown in Figure 2 is

AC C

obtained) and setting w to 80% provided acceptable prediction errors from PLS modeling with this final selected wavelength set. The process presented is basically based on using random wavelengths to create an

ensemble of MLR models; the wavelengths of the best bias/variance balanced MLR models are analyzed to identify common wavelengths. In actuality, probably the easiest wavelength selection algorithms to operate are the penalty methods. Two of the simplest require one tuning

14

ACCEPTED MANUSCRIPT

parameter each. These two are TR in an L1 norm format (also known as LASSO) and an iterative process based on a full wavelength regression vector [11-15]. In both methods, wavelengths are selected simultaneously with forming the multivariate calibration model.

RI PT

In wavelength selection studies, selected wavelengths can sometimes be correlated to vibrations related to analyte properties. Useful tables are available [40] but such assignments were not attempted in this study. Some caution though, just because selected wavelengths are

SC

useful for prediction, the wavelengths are not always interpretable relative to the final regression

M AN U

vector [41].

Acknowledgments

This material is based upon work supported by the National Science Foundation under

References

TE D

Grant No. CHE-1506417 (co-funded by CDS&E) and is gratefully acknowledged by the authors.

1. T. Næs, T. Isaksson, T. Fearn, T. Davies, A user friendly guide to multivariate calibration

EP

and classification, NIR Publications: Chichester UK, 2002. 2. J. H. Kalivas, Calibration methodologies, in: S. D. Brown, R. Tauler, B. Walczak (Eds.-

AC C

in-Chief), and J. H. Kalivas (Second ed.), Comprehensive Chemometrics: Chemical and Biochemical Data Analysis, Elsevier, Amsterdam, 2009, Vol. 3. Chap. 1, pp. 1-32.

3. I. Guuon, A. Elisseef. An introduction to variable and feature selection. J. Mach. Learn. Res. 3 (2003) 1157-1182.

4. L. Nørgaard, A. Saudland, J. Wagner, J.P. Nielsen, L.Munck, S. B. Engelsen, Interval partial least-squares regression (iPLS): A comparative chemometric study with and example from near-infrared spectroscopy, Appl. Spectrosc. 54 (2000) 413-419. 15

ACCEPTED MANUSCRIPT

5. D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord. Genetic algorithms as a tool for wavelength selection in multivariate calibration, Anal. Chem. 67 (1995) 4295-4304. 6. M. Arakawa, Y. Yamashita, K. Funatsu, Genetic algorithm-based wavelength selection

RI PT

method for spectral calibration, J. Chemom. 25 (2011) 10-19.

7. C.B. Lucasius, M.LM. Beckers, G. Kateman, Genetic algorithms in wavelength selection: a comparative study, Anal. Chim. Acta 286 (1994) 135-153.

SC

8. T. Rajalihit, R. Arneberg, A.C. Kroksveen, M. Berle, K.M. Myher, O.M. Kvalheim., Discriminating variable test and selectivity ratio plot: Quantitative tools for interpretation

Chem. 81 (2009) 2581-2590.

M AN U

and variable (biomarker) selection in complex spectral or chromatographic profiles, Anal.

9. J.H. Kalivas, N. Roberts, J. Sutter, Global optimization by simulated annealing with wavelength selection for ultraviolet-visible spectrophotometry, Anal. Chem. 61 (1989)

TE D

2024-2030.

10. J.M. Brenchley, U Hörchner, J.H. Kalivas, Wavelength selection characterization for NIR spectra, Appl. Spectrosc. 51 (1997) 689-699.

EP

11. F. Stout, J.H. Kalivas, K. Héberger, Wavelength selection for multivariate calibration using Tikhonov regularization, Appl. Spectrosc. 61 (2007) 85-95.

AC C

12. J. Ottaway, J.H. Kalivas, E. Andries, Spectral multivariate calibration with wavelength selection using variants of Tikhonov regularization, Appl. Spectrosc. 64 (2010) 1388-

1395.

13. J.H. Kalivas, Overview of two-norm (L2) and One-norm (L1) regularization variants for full wavelength or sparse spectral multivariate calibration models or maintenance, J. Chemom. 26 (2012) 218-230.

16

ACCEPTED MANUSCRIPT

14. E. Andries, S. Martin, Sparse methods in spectroscopy: an introduction, overview, and perspective, Appl. Spec. 67 (2013) 579593. 15. E. Andries, Sparse models by iteratively reweighted feature scaling: a framework for

RI PT

wavelength and sample selection, J. Chemom. 27 (2013) 50-62.

16. G.-H. Fu, Q.-S. Xu, H.-D. Li, D.-S. Cao, Y.-Z. Liang, Elastic net grouping variable

selection combined with partial least squares regression (EN-PLSR) for the analysis of

SC

strongly multi-collinear Spectroscopic data, Appl. Spectrosc. 65 (2011) 402-408.

Soc. Ser. B. 58(1996) 267-288.

M AN U

17. R.J. Tibshirani, Regression and shrinkage and selection via the LASSO, J. Royal Statist.

18. W.S. Cai, Y.K. Li, X.G. Shao, A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra, Chemom. Intell. Lab. Syst. 90 (2008) 188-194.

TE D

19. C. Esquerre, A.A. Gowen, G. Downey, C.P. O’Donnell, Selection of variables based on most stable normalized partial least squares regression coefficients in an ensemble Monte Carlo procedure, J. Near. Infra. Spectrosc. 19 (2011) 343-350.

EP

20. C.A. Esquerre, A.A. Gowen, A. O’Gorman, G. Downey, C.P. O’Donnell, Evaluation of ensemble Monte Carlo variable selection for identification of metabolite markers in NMR

AC C

data, Anal. Chim. Acta 964 (2017) 45-54.

21. H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Model-population analysis and its applications in chemical and biological modeling, Trends Anal. Chem. 38 (2012) 154162.

22. H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Model population analysis for variable selection, J. Chemom. 24 (2010) 418-423.

17

ACCEPTED MANUSCRIPT

23. Y.-H. Yun, W.-T. Wang, B.-C. Deng, G.-B. Lai, X.-B. Liu, D.-B. Ren, Y.-Z. Liang, W. Fan, Q.-S. Xu, Using variable combination population analysis for variable selection in multivariate calibration, Anal. Chim. Acta 862 (2015) 14-23.

RI PT

24. J. Bin, F. Ali, W. Fan, J. Zhou, X. Li, W. Tang, Y. Liang, An efficient variable selection method based on variable permutation and model population analysis for multivariate calibration of NIR spectra, Chemom. Intell. Lab. Syst. 158 (2016) 1-13.

SC

25. J. Ŝustek, Method for the choice of optimal analytical positions in spectrophotometric analysis of multicomponent systems, Anal. Chem. 46 (1974) 1676-1679.

M AN U

26. B.C. Deng, Y.H. Yun, Y.Z. Liang, L.Z. Yi, A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling, Analyst 139 (2014) 4836–4845.

27. Y.-H. Yun, Y.-Z. Liang, G.-X. Xie, H.-D. Li, D.-S. Cao, Q.-S. Xu, A perspective

TE D

demonstration on the importance of variable selection in inverse calibration for complex analytical systems, Analyst 138 (2013) 6412-6421. 28. Y.-H. Yun, D.-S. Cao, M.-L. Tan, J. Yan, D.-B. Ren, Q.-S. Xu, L. Yu, Y.-Z. Liang, A

EP

simple idea on applying large regression coefficient to improve the genetic algorithm– PLS for variable selection in multivariate calibration, Chemom. Intell. Lab. Syst. 130

AC C

(2014) 76-83.

29. H.-D. Li, Q.-S. Xu, Y.-Z. Liang, Random frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification, Anal. Chim. Acta 740 (2012) 20-26.

18

ACCEPTED MANUSCRIPT

30. Y.-H. Yun, H.-D. Li, L.R.E. Wood, W. Fan, J.-J. Wang, D.-S. Cao, Q.-S. Xu, Y.-Z. Liang, An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration, Spectrochim. Acta A 111 (2013) 31-36.

RI PT

31. Y.-H. Yun, W.-T. Wang, M.-L. Tan, Y.-Z. Liang, H.-D. Li, D.-S. Cao, H.-M. Lu, Q.-S. Xu, A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration, Anal. Chim. Acta 807 (2014) 36-43.

SC

32. B.-C. Deng, Y.-H. Yun, Y.-Z. Liang, L.-Z. Yi, A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling, Analyst 139

M AN U

(2014) 4836-4845.

33. H. Li, Y. Liang, Q. Xu, D. Cao, Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration, Anal. Chim. Acta 648 (2009) 77-84.

TE D

34. http://www.eigenvector.com/data/Corn/index.html, last accessed May 25, 2017. 35. P.J. Brown, M. Vannucci, T. Fearn, Bayes model averaging with selection of regressors, J. Roy. Stat. Soc. Ser. B, 27 (2002) 519-536.

255-259.

EP

36. J.H. Kalivas, Two data sets of near infrared spectra, Chemom. Intell. Lab. Syst. 37 (1997)

AC C

37. M. Shariati-Rad, M. Hasani, Selection of individual variables versus intervals of variables in PLSR, J. Chemom. 24 (2010) 45-56.

38. J. Gerretzen, E. Szymańska, J. Bart, AN. Davies, H.-J. van Manen, E.R. van den Heuvel, J.J. Jansen, L.M.C. Buydens, Boosting model performance and interpretation by entangling preprocessing selection and variable selection, Anal. Chim. Acta 938 (2016) 44-52.

19

ACCEPTED MANUSCRIPT

39. J.P.M. Andries, Y.V. Heyden, L.M.C. Buydens, Improved variable selection reduction in partial least squares modeling based on Predictive-Prosperity-Ranked Variables and adaption of partial least squares complexity, Anal. Chim. Acta 705 (2011) 292-305.

RI PT

40. B.G. Osborne T. Fearn, P.H. Hindle, Practical NIR spectrsocpy with applications in food and beverage analysis, 2nd ed. Longman Scientific & Technical: Essex: UK, 1993, pp. 20-33.

SC

41. C.D. Brown, R.L. Green, Critical factors limiting the interpretation of regression vectors

AC C

EP

TE D

M AN U

in multivariate calibration, Trends Anal. Chem. 28 (2009) 506-514.

20

ACCEPTED MANUSCRIPT

Figure Captions Fig. 1. Flow chart illustrating the procedure for MLR wavelength selection.

RI PT

Fig. 2. Effects of changing r, the number of random wavelengths in the first step of the MLR wavelength selection process. Shown is the corn data predicting moisture with m = 10,000 MLR models. Numbers in key refer to number of random wavelengths (r). The rank of the calibration set is 64 from the 80/20%

SC

calibration/prediction spilt giving 64 samples in the calibration set.

Fig. 3. Effects on wavelengths selected and the RMSEP from varying values for w for corn moisture on

M AN U

instrument m5. Images show wavelengths selected (yellow) on each of the 55 calibration splits for w at (a) 20%, (b) 40%, (c) 80%, and (d) 120% of the calibration rank (13, 26, 51, and 77 wavelengths, respectively). Plots in (e) are the corresponding mean PLS RMSEP values at each w and the full spectrum PLS models. The rank of the calibration set is 64 from the 80/20% calibration/prediction spilt giving 64

TE D

samples in the calibration set.

Fig. 4. Mean PLS results for corn moisture on instrument m5 (a) RMSEP (blue solid) and R2 (orange

EP

dash) for full wavelengths, (b) RMSEP (blue solid) and R2 (orange dash) for selected wavelengths at 80% of calibration rank (51 wavelengths, see Figure 3c), (c) RMSEP with error bars across the 55 splits at full

AC C

wavelengths (black solid) and selected wavelengths (blue dash), and (d) R2 with error bars at full wavelengths (black solid) and selected wavelengths (blue dash).

Fig. 5. Results for predicting oil content of corn using instrument m5 (a) mean RMSEP (blue solid) and R2 (orange dash) for full wavelengths, (b) RMSEP (blue solid) and R2 (orange dash) for selected wavelengths, (c) RMSEP with error bars across the 55 splits at full wavelengths (black solid) and selected

21

ACCEPTED MANUSCRIPT

wavelengths (blue dash), (d) R2 with error bars at full wavelengths (black solid) and selected wavelengths (blue dash), and (e) selected wavelengths on each split (yellow).

RI PT

Fig. 6. Results for the sugar data predicting sucrose content (a) mean RMSEP (blue solid) and R2 (orange dash) for full wavelengths, (b) RMSEP (blue solid) and R2 (orange dash) for selected wavelengths, (c) RMSEP with error bars across the 55 splits at full wavelengths (black solid) and selected wavelengths

SC

(blue dash), (d) R2 with error bars at full wavelengths (black solid) and selected wavelengths (blue dash),

M AN U

and (e) selected wavelengths on each split (yellow).

Fig. 7. Results for the gasoline predicting octane number (a) mean RMSEP (blue solid) and R2 (orange dash) for full wavelengths, (b) RMSEP (blue solid) and R2 (orange dash) for selected wavelengths, (c) RMSEP with error bars across the 55 splits at full wavelengths (black solid) and selected wavelengths (blue dash), (d) R2 with error bars at full wavelengths (black solid) and selected wavelengths (blue dash),

AC C

EP

TE D

and (e) selected wavelengths on each split (yellow).

22

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

Fig. 1.

23

ACCEPTED MANUSCRIPT

RI PT

Fig. 2.

AC C

EP

TE D

M AN U

SC



24

ACCEPTED MANUSCRIPT

Fig. 3

a)

e)

b)

c)

RI PT

e)

d)

AC C

EP

TE D

M AN U

SC



25

ACCEPTED MANUSCRIPT

Fig. 4

a)

RI PT



EP

TE D

M AN U

d)

SC



AC C

c)

b)

26

ACCEPTED MANUSCRIPT

Fig. 5.

a)

e)





EP

TE D

M AN U

d)

AC C

c)

SC

RI PT

b)

27

ACCEPTED MANUSCRIPT

Fig. 6.

a)

e)

d)



EP

TE D

M AN U



AC C

c)

SC

RI PT

b)

28

ACCEPTED MANUSCRIPT

Fig. 7.

a)

e)

d)



EP

TE D

M AN U



AC C

c)

SC

RI PT

b)

29

ACCEPTED MANUSCRIPT

Highlights

AC C

EP

TE D

M AN U

SC

RI PT

Simple wavelength selection algorithm Wavelengths selected are bias/variance balanced Ensemble of MLR models are used to characterize wavelengths