Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
Fused stagewise regression — A waveband selection algorithm for spectroscopy☆,☆☆ B. Malli ⁎, T. Natschläger Software Competence Center Hagenberg, Softwarepark 21, 4232 Hagenberg, Austria
a r t i c l e
i n f o
Article history: Received 5 November 2014 Received in revised form 29 July 2015 Accepted 4 September 2015 Available online 11 September 2015 Keywords: Robust variable selection Wavelength selection Fused Lasso Double repeated cross validation Spectroscopy
a b s t r a c t While partial least squares (PLS) and principal component regression (PCR), the most popular regression techniques in chemometrics, may theoretically be able to deal with large numbers of possibly correlated variables, as occurring in the analysis of spectroscopic data, the importance of performing some form of variable selection in practical applications has been widely discussed and acknowledged. In this work we address this problem via proposing a sparse regression algorithm, referred to as fused stagewise regression (FSR), which iteratively performs a selection of connected regions of variables (wavelengths), while being quite easy to implement and interpret, due to its resemblance to typical steps in iterative manual feature selection procedures. We evaluate the proposed variable selection technique on a publicly available benchmark data set and compare the performance of PLS models built on the determined selection to ones obtained by state-of-the-art feature selection methods from the fields of chemometrics and machine learning. In order to ensure robust feature selection, we integrate the individual selection methods into an extensive repeated cross validation procedure. For the data set under investigation, it is shown that FSR performs at least as good as state-of-the-art approaches and well within the range of variable selections provided by experts. © 2015 Elsevier B.V. All rights reserved.
1. Introduction A critical step in the analysis of spectroscopic data is the choice of an appropriate variable/wavelength selection used for chemometric modelling. Projection-based regression methods, such as PLS or PCR, are designed to deal with a very large number of possibly correlated variables. In practice, though, and particularly for near-infrared (NIR) and mid-infrared (MIR) applications, it is known that a removal of irrelevant and noisy wavelengths can improve prediction performance, yield simpler models and allow for better model interpretation (see e.g. [1,2] and references therein). Ideally, a determination of relevant variables should be based upon expert knowledge about the chemical properties of the substance under analysis. In practice, though, such a selection can only rarely be considered an objective process, as it is frequently heavily influenced by the experience of the expert carrying out the analysis. Thus, different experts may select significantly different wavelengths, possibly also resulting in highly deviating models and corresponding
☆ The research reported in this article has been partly supported by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry of Science, Research and Economy, and the Province of Upper Austria in the frame of the COMET center SCCH. ☆☆ Selected papers presented at the 3rd European Conference on Process Analytics and Control Technology, 6–9 May 2014, Barcelona, Spain. ⁎ Corresponding author. Tel.: +43 7236 3343 832. E-mail addresses:
[email protected] (B. Malli),
[email protected] (T. Natschläger).
http://dx.doi.org/10.1016/j.chemolab.2015.09.004 0169-7439/© 2015 Elsevier B.V. All rights reserved.
performances. This issue becomes particularly obvious when analysing highly complex target substances or properties, where a selection of relevant spectral regions solely based on chemical insight may be extremely difficult, if possible at all. Because of this, a universally applicable algorithm working independently of expert knowledge can often bear advantages. This has been known for a long time and, as an exhaustive search of the perfect combination of variables is generally impossible, resulted in the use and development of a huge variety of different variable selection algorithms: starting with forward and backward selection methods [1], through genetic algorithms [3,4], up to, among others, interval PLS (iPLS) [5,6] or moving window PLS (MWPLS) approaches (see e.g. [2] and references therein). In contrast to techniques that pick combinations of single variables, iPLS and MWPLS select connected regions of wavelengths (=wavebands) and are thus of particular interest in spectroscopy, a field where adjacent variables are usually highly correlated. In this contribution, we aim at the introduction of a new waveband selection algorithm that is based on the idea of the Fused Lasso [7], a regression method developed in the field of machine learning. In an iterative manner, the newly developed approach seeks for connected regions of variables relevant for predicting the target property at hand and weights them in an appropriate way, resulting in a vector of regression coefficients that can either be used for prediction as is done for any other conventional regression technique, but incorporating implicit waveband selection, or serve in a pre-step to regression, purely to determine important input variables. In order to allow for a fair comparison with state-of-the-art variable selection methods, only the latter
54
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
This is actually the idea behind a method referred to as the Fused Lasso [7] which incorporates an additional regularization term and solves
Table 1 Method specific parameters to be determined in inner CV loop. Method
Parameters
FusedStage τ ∈ [0, 1]: Minimum correlation parameter η ∈ [0, 2]: Relative step size ϕ ∈ [0, 1]: Relative shrinkage δ ∈ [0, 1]: Only for FusedStageCUT; Cutoff parameter GAPLS None: The number of evaluations could be decided within the inner CV loop, but experiments have shown that the prior selection of a good value once in advance is possible. The same was found true for l, the number of variables to combine in mean building. FusedSLEP λ1: l1 — norm regularization parameter λ2: Regularization parameter controlling the influence of the l1 — norm on differences of successive regression coefficients. Lasso λ1: l1 — norm regularization parameter ElNet λ1: l1 — norm regularization parameter λ2el: l2 — norm regularization parameter MWPLS w: Window width ρ: Importance thresholding parameter Additionally, the maximum number of PLS components m could be chosen within the inner CV loop, but was found possible to be set suitably (10) in advance. Expert None NoSel None
approach1is studied and PLS is employed in the regression step.2 Results for a benchmark data set from literature are given in Section 7. The complete algorithm, its basic ideas and recommendations for parameter selection, possibly incorporating expert knowledge, are presented in the following sections. 2. Background — Lasso and the Fused Lasso Our proposed fused stagewise regression algorithm is based on the idea of a regression technique referred to as the Fused Lasso [7]. This technique represents a generalization of the popular Lasso method [8] which solves the optimization problem stated in Eq. (1), where X denotes the mean centred n × p design matrix (rows corresponding to samples and columns to variables/wavelengths) — in a spectroscopic application each row of X corresponds to a spectrum, y the mean centred n × 1 vector of target observations (reference vector), β the p × 1 vector of regression coefficients and λ1 a non-negative regularization parameter determining the influence of the l1 regularization term. βL ¼ argmin ky−X βk22 þ λ1 kβk1 : β∈ℝ p
ð1Þ
Thus, the Lasso aims for a solution, βL, minimizing the sum of squared residuals, while attempting to keep λ1‖β‖1 = ∑j = 1,…,p‖βj‖ small. It has been shown [8] that the use of the l1 norm in the penalty term λ1 ‖β‖ 1 introduces sparsity, i.e. it enforces solutions βL with βj = 0 for many coordinates j, thereby resulting in an implicit variable selection. A closer look at the solution reveals, however, that in case of correlated adjacent variables, as found in spectroscopy, the Lasso tends to select a set of single wavelengths instead of connected regions. From a chemical (thinking in terms of functional groups) and statistical (use of redundancies) point of view, the selection of wavebands, i.e. connected regions of wavelengths, would be favourable.
1
The former approach is intended to be studied in a follow-up publication. The same approach is adopted for all other discussed regression techniques performing implicit variable selection (i.e. for the Lasso, Elastic Net and Fused Lasso). 2
0 βF L ¼
argmin @ky−X βk22 β∈ℝ p
þ λ1 kβk1 þ λ2
1 β jþ1 −β j A:
p−1 X
ð2Þ
j¼1
As was already the case for the Lasso solution, the term λ1‖β‖1 introduces sparsity in the regression coefficients, resulting in βj = 0 for certain j. In the same manner, the use of the l1 norm in the additional penalty term λ2∑pj =−11|βj + 1 − βj| introduces sparsity in the differences (βj + 1 − βj) of consecutive coordinates, resulting in βj + 1 = βj for many j (i.e. βjs to be equal to their immediate neighbouring entries). As such, the combination of these two penalty terms leads to a piecewise constant solution βF L containing connected regions of zero and non-zero entries, where the latter ones exhibit only a limited number of different values and only few significant jumps between neighbouring coefficients [7]. Due to this property, the Fused Lasso appears particularly suitable for employment in spectroscopic analysis. The problem of solving the optimization objective in Eq. (2), though, was found to be considerably more challenging than the solution of the Lasso problem [9–11]. An excellent review of existing algorithms, incorporating information on applicability, limitations and computational considerations, was provided by Yu et al. [11]: For a long time only special cases, demanding a certain structure or properties of X, could be solved efficiently [12–14]. Solutions for a more general setting, including the one occurring in spectroscopy, could only be provided via the introduction of numerous auxiliary variables and constraints [7] in order to enable the application of a general purpose quadratic programming solver. As has been noted by Tibshirani et al. as well as others, e.g. [15,11], this computational approach does not scale well with the number of considered variables, p. Concretely, it is stated that for values of p N 2000 and n N 200 speed could become a practical limitation, particularly if five or tenfold cross validation is carried out. In 2010, Liu et al. published an algorithm for the efficient solution of this problem [15]. The complexity of the proposed algorithm, though, makes it difficult to follow propagated steps and interpret finally selected bands. Recently developed algorithms [16–18,11] bear the advantage to provide solutions to an even more general optimization problem, but are, in our opinion, equally complex and difficult to follow. This led us to develop a forward stagewise regression (FSR) algorithm that is based on the idea of the Fused Lasso but designed in a way that allows for considerably easier interpretation and implementation. A comparison of finally selected wavebands will be carried out for the analysed data set in Section 7. Table 2 Criteria used to determine best parameters/parameter combinations. Method
Criterium
FusedStage Minimization of the FSR MSECV (i.e. the MSECV obtained if the FSR predictions are compared against the measured reference values; same notation also for methods below). GAPLS None FusedSLEP Minimization of the (SLEP) Fused Lasso MSECV. Lasso SLEP: minimization of the (SLEP) Lasso MSECV. MATLAB: Determine the largest λ1, for which the corresponding (Matlab) Lasso MSECV is within one standard error of the minimum (see also http://www.mathworks.de/help/stats/lasso.html). This value is referred to as λ1,1SE below. ElNet SLEP: minimization of the (SLEP) Elastic Net MSECV. MATLAB: determine λ1,1SE to all available λ2el and compute (Matlab) elastic net MSECVs for all (λ1,1SE, λ2el) pairs. Use the combination with minimal obtained MSECV. MWPLS Minimization of the PLS MSECV obtained using the selected variables only and a fixed number (set to 10 in results below) of PLS components.
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
φ
Results for η = 0.3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Results for η = 0.2
1000
600
φ
800
400 200
800 600 400
0.1 0.3 0.5 0.7 0.9 0.97 τ
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.3 0.5 0.7 0.9 0.97 τ Results for η = 0.4
1000
φ
φ
Results for η = 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.3 0.5 0.7 0.9 0.97 τ
200
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
55
1000 800 600 400 200
1000 800 600 400
0.1 0.3 0.5 0.7 0.9 0.97 τ
200
Fig. 1. Exemplary convergence map. Convergence maps summarize information regarding the convergence behaviour of the FSR algorithm using certain parameter combinations. Parameter triples are only suitable for use in the RDCV procedure, if they cause the algorithm to converge in a reasonable number of iterations.
3. The fused stagewise regression algorithm The proposed fused stagewise regression (FSR) algorithm is shown in Algorithm 1, where xk denotes column k of X, corr(v,w) refers to the correlation coefficient between vectors v and w, 1[I] ∈ ℝp is defined as the p-dimensional vector with entry k equal to 1 if k ∈ I and 0 otherwise (I defined in line 7 of Algorithm 1) and the required inputs τ, η and ϕ are the parameters of the algorithm. Additionally, a stopping criterion is required (an example is found in 6.3).
In short, the proposed method performs the following steps until a convergence criterion is met (line 12) where the given line numbers refer to the pseudocode given in Algorithm 1: A) Look for the variable best correlated with the current residual and build an interval around it (lines 5–7). B) Walk a small step into the best direction via increasing or decreasing the regression coefficients corresponding to the determined interval of important variables to a certain extent (lines 8–9).
Test Data 4
3.5
3.5
3
3
2.5
2.5 absorbance
absorbance
Calibration Data 4
2
2
1.5
1.5
1
1
0.5
0.5
0 400
600
800
1000 1200 1400 1600 1800 2000 2200 2400 wavelengths [nm]
0 400
600
800
1000 1200 1400 1600 1800 2000 2200 2400 wavelengths [nm]
Fig. 2. Absorbance spectra for beer samples. In the left plot, spectral data for calibration samples are shown, while the right plot depicts absorbance spectra for test samples.
56
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
C) Shrink all coefficients (in absolute magnitude) in a way that should prevent ending up in a local minimum, while still decreasing the error in each step (lines 10–11).
The detailed behaviour of each of the steps A to C depends on three corresponding parameters:
Table 3 Applied internal preprocessing of data for different variable selection algorithms. Method
Internal preprocessing of X
Internal preprocessing of y
FusedStage GAPLS
Mean centre (=default) Mean centre (default: autoscale)
FusedSLEP Lasso
None (=default) SLEP: none (=default) MATLAB: mean centre (default: autoscale) SLEP: none (=default) MATLAB: mean centre (default: autoscale) mean centre (=default)
Autoscale (=default) Code such that preprocessing of y equals the one for X None (=default) SLEP: none (=default) MATLAB: mean centre (=default) SLEP: none (=default) MATLAB: mean centre (=default) Autoscale (=default)
Algorithm 1. The fused stagewise regression algorithm ElNet
MWPLS
α ∈ [αl, αr] the following inequalities hold: 2 2 2 ky−Xβ k2 ≤ y−Xβ new 2 ≤ y−Xβ old :
ð3Þ
2
3.1. Step A: minimum correlation τ τ represents a lower limit for the required correlation between a variable and its close neighbours. Hence, the domain [0, 1] arises naturally. For the application in spectroscopy, however, we might rarely use values smaller than 0.5. Interpretation: If τ is close to 1, the interval built around the currently best variable j will be small. If it is far from 1 (closer to 0), intervals will contain a larger number of variables. It is clear that the optimal value of τ depends on the measurement setup and the underlying chemical substances to be measured.
To be more precise, αl = max(0, α ′) with α ′ defined as the smaller solution solving ‖y − X(β* · α ′)‖22 = ‖y − Xβold‖22 and αr = min(α ″, 1) with α ″ ≠ 1 solving ‖y − X(β* · α ″)‖22 = ‖y − Xβ*‖22. The smaller ϕ is chosen, the closer α is to αr and the corresponding error will be close to the one obtained using as regression vector. If, on the other hand, ϕ is close to its upper bound 1, α will be in the area of αl and the corresponding error remains similar to the one obtained in the previous iteration. As it can be seen from the inequalities in Eq. (3), the algorithm produces solutions with a monotone decrease of the RMSE in every iteration. It might be argued that this approach could lead to the choice of a local optimum. Via the use of coefficient shrinkage (ϕ) and reduced step sizes (η), however, the algorithm is designed not to follow the direct path to a minimum but explores sensible regions of the solution space.
3.2. Step B: relative step size η
3.4. Modifications of the fused stagewise regression algorithm
The optimal step size ε∗ (line 8) is chosen such as to minimize ‖y − X(βold + ε1[I])‖22. If we used this step size to calculate the new regression coefficients βold + ε*1[I], we might make too big steps and get stuck in a local minimum. Thus, we propose not to make the optimal step using ε∗, but only a fraction η ∈ [0, 1] of it (or walk a little further η ∈ [1, 2]); see line 9. Interpretation: If η is chosen close to 0, the steps we take will be small and the algorithm might take very long to converge. However, we might not miss promising areas. If η is close to 1, convergence may be faster, but we might miss important areas and end up in a local optimum. If we choose η to be close to 2, we are able to explore big regions of the solution space very fast; in between, however, we might miss important areas. Thus, a reasonable compromise needs to be found (see Section 6.3). The boundaries of the interval [0, 2] arise from the constraint ‖y − X β∗‖22 ≤ ‖y − X βold‖22 defined by us.
Within the study of the application performance of Algorithm 1, slight modifications have proven useful for certain settings: (a) If the data to be analysed are likely to contain noisy variables, it may be a good idea to prohibit their selection. In such a case, the additional implementation of a noise detection criterion
NoSel Expert1 Expert2 MWPLS
3.3. Step C: relative shrinkage ϕ It might happen that coefficients in a certain interval start dominating the algorithm as such to require computed coefficient vectors to remain in a very limited area in the solution space. Thus, the algorithm might get stuck in a locally optimal region. In order to allow the algorithm to move out from this area, we propose to decrease the importance of all currently selected variables via shrinking corresponding absolute regression coefficients. This is performed by multiplying the potential new solution β* with a factor α ∈ [αl, αr], α = αr − ϕ ⋅ (αr − αl) to get the new solution βnew = α ⋅ β*. While shrinkage appears to be a sensible idea, we do not intend to decrease coefficient values too much. Thus, we compute αl and αr such that for each
GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet 400
700
1000
1300 Wavelengths
1600
1900
2200
Fig. 3. Selected variables for the WholeReg case. Variables chosen by different variable selection techniques in at least 50% of all RDCV runs; compare to Fig. 5.
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
57
(b) If our main purpose is to perform waveband selection, we might not only be interested in prohibiting the use of noisy variables but may also want to neglect ones which the algorithm determines to be of minor importance only. We may, e.g., consider a variable k as important if the computed final reweighted by the standard deviation of gression coefficient βnew k the corresponding variable is large. As such, we might neglect ∗ std(xk) ≤ δ ∗ max(βnew · ∗ std(X)), variables for which βnew k where (X) denotes the vector with entries (xj) and δ ∈ [0, 1] defines an additional parameter determining when to cut off a variable and when to use it in the final selection. The combination of (a) and (b) will be referred to as FusedStageCUT in the following.
NoSel Expert1 Expert2 MWPLS GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet 600
700
800 900 1000 Wavelengths
1100
1200
1300
Fig. 4. Selected variables for the PropReg case. Variables chosen by different variable selection techniques in at least 50% of all RDCV runs; compare to Fig. 6.
Runs (numreps*numfolds)
20 40 60 80 100 400
800 1200 1600 2000 Wavenumbers FusedStageCUT
20 40 60 80 100 400
800 1200 1600 2000 Wavenumbers ElNet
Runs (numreps*numfolds)
MWPLS
Runs (numreps*numfolds)
Runs (numreps*numfolds)
Runs (numreps*numfolds)
might be useful. In case of spectroscopic data, we would assume neighbouring variables to be highly correlated. As such, a possible implementation of a noise criterion could be to check at the beginning (right after line 1) of Algorithm 1 if there are variables v1,…,vh for which the immediate neighbourhood (e.g. two variables to the left and right of vk) does not contain mainly (e.g. half the time) highly correlated (e.g. with a correlation coefficient of at least 0.7 with vk) variables. If so, these variables can be assumed to contain mainly noise and their selection should be prohibited.
4. Robust variable and parameter selection via repeated double cross validation Repeated double cross validation (RDCV) was proposed in [19] in order to provide an optimized strategy for the determination of parameters of regression models and the realistic estimation of prediction errors when applying the model to new data. The procedure uses two nested loops of cross validation (CV): Within the inner CV loop model parameters are determined and the outer loop is used to provide information on the generalization error of the method. The procedure is completed via a row of repetitions of these two CV loops, which among other advantages provide a more realistic determination of the expected prediction error for new data. The modified RDCV strategy for variable selection proposed in this contribution aims at the identification of robust wavebands to be selected for a subsequent regression step. To this end, calibration data are split into K training and test folds
GAPLS 20 40 60 80 100 400
800 1200 1600 2000 Wavenumbers FusedSLEP
20 40 60 80 100 400
800 1200 1600 2000 Wavenumbers
Runs (numreps*numfolds)
500
Runs (numreps*numfolds)
400
FusedStage 20 40 60 80 100 400
800 1200 1600 2000 Wavenumbers Lasso
20 40 60 80 100 400
800 1200 1600 2000 Wavenumbers
20 40 60 80 100 400
800 1200 1600 2000 Wavenumbers
Fig. 5. Consistency of selected variables for the WholeReg case. Variables selected by considered methods shown for each individual RDCV run.
60 80 100 400
600 800 1000 1200 Wavenumbers FusedStageCUT
20 40 60 80 100 400
600 800 1000 1200 Wavenumbers ElNet
20 40 60 80 100 400
600 800 1000 1200 Wavenumbers FusedSLEP
20 40 60 80 100 400
600 800 1000 1200 Wavenumbers
Runs (numreps*numfolds)
40
GAPLS
Runs (numreps*numfolds)
20
Runs (numreps*numfolds)
Runs (numreps*numfolds) Runs (numreps*numfolds)
MWPLS
Runs (numreps*numfolds)
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
Runs (numreps*numfolds)
58
FusedStage 20 40 60 80 100 400
600 800 1000 1200 Wavenumbers Lasso
20 40 60 80 100 400
600 800 1000 1200 Wavenumbers
20 40 60 80 100 400
600 800 1000 1200 Wavenumbers
Fig. 6. Consistency of selected variables for the PropReg case. Variables selected by considered methods shown for each individual RDCV run.
(outer CV loop). On the training data we use CV to determine the best parameter combination of the variable selection method under investigation. We output the correspondingly selected variables and repeat
this step for the next training fold. The whole procedure is then repeated R times. All in all, we thereby obtain R · K, possibly different variable selections. With the intention of choosing the most important variables,
2.5
NoSel Expert1 Expert2 MWPLS GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet
RMSEs
2
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
No. of PLS Components Fig. 7. Overall performance on TEST data based on the WholeReg variable selections. The performance of PLS models of varying complexities when using different variable selection techniques to determine the set of variables to be used for PLS is displayed. Shown are means and standard deviations (error bars) of RMSE values computed for the separate test data (right plot in Fig. 2).
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
59
0.45 NoSel Expert1 Expert2 MWPLS GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet
0.4
0.35
RMSEs
0.3
0.25
0.2
0.15
0.1
0.05
0
2
2.5
3
3.5
4
4.5
5
No. of PLS Components Fig. 8. Detailed performance on TEST data based on the WholeReg variable selections. Here we see a close-up of results shown in Fig. 7 on best performance areas of reasonable complexity (2–5 PLS components).
we propose to select the ones present in a large fraction F (e.g. F = 0.5) of all selections. The performance of the final variable selection in a regression model can be determined using the repetitions of the outer test folds. A pseudo-code for this RDCV procedure and performance estimation can be found in Algorithms 2 and 3, where X(tr(r, f), S) denotes the matrix of samples/rows tr(r, f) contained in training fold f for repetition r using the selected wavelengths/columns S. Results presented in Section 7 make use of PLS in the regression step of Algorithm 3 and compare results, namely mean and standard deviations of the R · K
root mean squared error (RMSE) values for different numbers of latent variables using various variable selection techniques.
5. Evaluated variable selection techniques In addition to our fused stagewise regression (FSR) approaches (FusedStage and FusedStageCUT), results will also be given for different state-of-the-art variable selection techniques.
2.5
NoSel Expert1 Expert2 MWPLS GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet
2
RMSEs
1.5
1
0.5
0
0
1
2
3
4 5 6 No. of PLS Components
7
8
9
10
Fig. 9. Overall performance on left out CV test folds based on the WholeReg variable selections. Same as in Fig. 7 but with CV test folds used instead of the separate test data.
60
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65 0.45 NoSel Expert1 Expert2 MWPLS GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet
0.4
0.35
RMSEs
0.3
0.25
0.2
0.15
0.1
0.05
0 2
2.5
3
3.5
4
4.5
5
No. of PLS Components Fig. 10. Detailed performance on left out CV test folds based on the WholeReg variable selections. Same as in Fig. 8 but with CV test folds instead of the separate test data.
Algorithm 2. RDCV for robust variable selection
As the FSR is a sparse regression method we compare it to its relatives: the Fused Lasso [7] using the SLEP Solver (FusedSLEP), Lasso [8], and the Elastic Net (ElNet) [20]. In addition we investigate differences to variable selection algorithms especially designed for chemometric applications: genetic algorithm PLS (GAPLS) [21] and a modified moving window PLS version (MWPLS) [2]. For general comparison purposes we also compare against expert based variable selections. In the following all these methods are briefly described.
5.1. Fused Lasso — SLEP (FusedSLEP)
Algorithm 3. Performance assessment of variable selection
The authors from [15] provide a publicly available Sparse Learning with Efficient Projections (SLEP) Toolbox for Matlab3 [22]. The function fusedLeastR provided in this package was used to obtain variable selections solving the Fused Lasso optimization problem from Eq. (2). Naturally, it is of particular interest to compare FusedSLEP selections to ones obtained by our proposed FSR algorithm. Remark: Other software packages for the Fused Lasso exist, but as shown in [16] and [11], the SLEP Toolbox is generally a very good choice from a computational and practical (e.g. no limitations w.r.t. X) point of view.
5.2. Lasso (Lasso) The optimization problem solved by the Lasso [8] is defined in Eq. (1). Two different solvers were investigated: the function lasso available in the Matlab Statistics Toolbox4 and the LeastR function available in the SLEP Toolbox. Parameter handling differs slightly between these methods, as does our choice of the best parameter. Some details on the differences can be found in Tables 3 and 2. Solutions are only shown for the version providing the better performance.
3 Toolbox and corresponding manual available at http://www.public.asu.edu/jye02/ Software/SLEP. 4 Details on http://www.mathworks.de/help/stats/lasso.html.
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
2.5
NoSel Expert1 Expert2 MWPLS GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet
2
RMSEs
61
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
No. of PLS Components Fig. 11. Overall performance on TEST data based on the PropReg variable selections. The performance of PLS models of varying complexities when using different variable selection techniques to determine the set of variables to be used for PLS is displayed. Shown are means and standard deviations (error bars) of RMSE values computed for the separate test data (right plot in Fig. 2).
0.45 NoSel Expert1 Expert2 MWPLS GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet
0.4
0.35
RMSEs
0.3
0.25
0.2
0.15
0.1
0.05
0
2
2.5
3
3.5
4
4.5
5
No. of PLS Components Fig. 12. Detailed performance on TEST data based on the PropReg variable selections. Here we see a close-up of results shown in Fig. 11 on best performance areas of reasonable complexity (2–5 PLS components).
5.3. Elastic net (ElNet)
5.4. Genetic algorithm PLS (GAPLS)
The elastic net [20] represents a compromise between Ridge Regression [23], also known as Tikhonov regularization, and the Lasso, solving an optimization problem incorporating both l1-norm and l2-norm regularization. Computations were again carried out using Matlab Statistics Toolbox and the SLEP package. Additionally to a deviating parameterization and parameter handling within the two applied functions, the choice of the best parameter combination was performed in a different way (details again in Tables 3 and 2). Again, only best results are shown.
Genetic algorithm PLS [4,21] evaluation was based on the function gaplssp available in the PLS-genetic algorithm toolbox for Matlab.5 The code was modified in a way to allow for 5-fold grouped CV, i.e. CV based on groups of samples as naturally occur in case of repeated measurements. The author's recommendation to build means of successive variables, if the total number of available variables exceeds 200, was 5 Toolbox and corresponding manual provided by R. Leardi available at http://www. models.life.ku.dk/GAPLS.
62
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65 2.5
NoSel Expert1 Expert2 MWPLS GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet
RMSEs
2
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
No. of PLS Components Fig. 13. Overall performance on left out CV test folds based on the PropReg variable selections. Same as in Fig. 11 but with CV test folds used instead of the separate test data.
0.45 NoSel Expert1 Expert2 MWPLS GAPLS FusedStage FusedStageCUT FusedSLEP Lasso ElNet
0.4
0.35
RMSEs
0.3
0.25
0.2
0.15
0.1
0.05
0
2
2.5
3
3.5
4
4.5
5
No. of PLS Components Fig. 14. Detailed performance on left out CV test folds based on the PropReg variable selections. Same as in Fig. 12 but with CV test folds instead of the separate test data.
followed and a parameter, l, was introduced determining the number of variables to use in the mean-building process (Comment: if l is manually selected it has to be big enough to ensure that the number of corresponding mean variables is smaller than 200). All parameters except from the number of evaluations were kept to the values proposed in the toolbox manual and the number of evaluations was determined application-specifically in advance (using the function gaplsopt and the same kind of modifications as outlined above).
[24] provided in the iToolbox for Matlab.6 In addition to the common parameter window width, w, an additional parameter, ρ, was introduced to perform some form of importance thresholding to identify potentially useful combinations of windows. To this end, we define variable k as important, if we have: mean(E(3 : m, k)) ≤ f(ρ) with f(ρ) := min(E) + ρ · (max(E) − min(E)), E, denoting the m × p generalization RMSE matrix (e. g. from 5-fold grouped CV) over 1, 2, …, m latent variables in PLS regression where E(i, k) is the RMSE if only a window of width w around variable k is used for PLS regression.
5.5. A modified moving window PLS version (MWPLS) In order to incorporate results for another more “traditional” chemometric variable selection method, we made use of the function mwpls
6 Toolbox and corresponding manual available at http://www.models.life.ku.dk/ itoolbox.
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
5.6. Expert selection (Expert) Expert selection is provided by chemical experts using information on functional groups. For the benchmark data set used for method evaluation and comparison in Section 7, we use two (slightly) different expert selections, Expert1, found in [24], and Expert2, found in the Eigenvector University course notes on variable selection (generously provided by Eigenvector Research president B.M. Wise). 5.7. No selection (NoSel) For comparison, results are also provided using PLS on the full set of variables. 6. Parameter selection 6.1. General parameter selection The majority of variable selection techniques listed in Section 5 involves one or more method specific parameters to be chosen; see Table 1. These choices were all made in the same manner: Start with a coarse grid of possible parameter values, covering the whole parameter space and determine the best (see Section 6.2) parameters or parameter combinations using 5-fold grouped CV in the inner loop of the RDCV routine described in Section 4. RDCV provides R × K — possibly different parameter selections. Use these to refine your first grid and iterate the process if necessary. Usually, no more than two refinements were required. A repetition of this procedure, starting with different coarse grids, however, was performed to reduce the risk to get stuck in a not very good local optimum. The results displayed in Section 7 represent grids showing the best observed performance. Certainly, care was taken to treat every method equally, i.e. to spend approximately equal time tuning parameter grids for each algorithm. 6.2. Criteria used within parameter selection Different criteria are used to determine method specific best parameters or parameter combinations. Information on these criteria can be found in Table 2 (obviously not required for Expert and NoSel). It can be seen in Table 2 that the only method optimizing parameters with respect to the PLS MSECV computed on selected variables is MWPLS. All other methods performing parameter optimization use MSECV values based on a different regression method. As our aim in this contribution is the determination of a variable selection performing well in a subsequent use within PLS regression, MWPLS, and naturally also GAPLS and expert selections, might have an advantage over other methods. The results shown in Section 7 are in line with this observation. 6.3. Parameter selection for FSR As described in Section 3 there are three parameters and a stopping criterion to be defined for the FSR algorithm. A simple choice for the latter one is to test whether the root mean squared error stopped decreasing, i.e. we check whether the root mean squared qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi error, rmseðiÞ :¼ 1n ky−XβðiÞk22 , to the current solution β(i), i the number of the current iteration, fulfils rmseði−hÞ−rmseðiÞ ≤0:001 meanðrmseði−hÞ; rmseðiÞÞ
ð4Þ
for a certain history h, e.g. h = 200. Regarding a reasonable selection of a parameter grid to be used in the RDCV procedure, we propose to start by investigating the
63
convergence behaviour of a particular parameter combination (τ, η, ϕ). Our general approach to choose an appropriate grid is as follows: • Start with crude grids of the parameters τ ∈ [0, 1], η ∈ [0, 2] and ϕ ∈ [0, 1]. Due to our observations, 5–10 values per parameter are generally enough to give a first overview. • For each parameter triple (τ, η, ϕ) record the number of iterations until convergence (terminate if a maximum number of iterations is reached). Graphical tools can be applied to summarize recorded convergence information, as was for example done in the convergence map shown in Fig. 1 (a colour version of this figure is shown in the online publication), where the minimum and maximum number of iterations was set to 200 and 1000 iterations, respectively, and Eq. (4) was used as stopping criterion. • Convergence maps can be used to determine (in an automatic way) a subspace of parameter triples suitable as inputs for the RDCV procedure. • Optional: Due to the evaluation of the corresponding waveband selections by the FSR, it might already happen at this stage that certain parameter triples (τ, η, ϕ) are more favourable than others in terms of chemical expert knowledge. If so, one may simply neglect parameter triples considered as chemically less useful. Thus, there is a possibility to incorporate chemical information, if intended.
7. Evaluation and comparison 7.1. Data We evaluate our waveband selection algorithm and compare its performance to the techniques described in Section 5 on a publicly available data set7 known as particularly suited for variable selection, i.e. enabling a considerable performance boost by incorporation of variable selection in the modelling process. The data set comprises of real extract measurements taken for 60 undiluted degassed beer samples, 40 for calibration and 20 for testing, and the corresponding VIS-NIR transmission recorded spectroscopic data in the range of 400–2250 nm (every second nm recorded) measured via the use of a 30 mm quartz cell. Transmission spectra were converted to absorbance units, shown in Fig. 2. The experienced spectroscopist can immediately see that spectral data for higher wavelength regions contain a considerable amount of noise and would discard corresponding areas before further analysis. As stated in the introductory section, though, we aim for a universally applicable algorithm working independently of such a form of prior expert knowledge. In order to cover both aspects, we decided to analyse two settings: one, later referred to as WholeReg case, where we apply the variable selection techniques introduced above directly to the whole spectral region and another one, denoted as the PropReg case, where results are provided for the case that prior knowledge on the noisy regions is available and analysis is limited to the noise-free range of 400–1350 nm.8 In this way, we can also obtain information on the different methods' ability to cope with noisy data. For the generation of results, we use no sample-wise preprocessing, but the column-wise mean centering of spectra and autoscaling of reference values described below, parameter selection as outlined in 6.1 and 6.3, K = 10 folds in the outer loop of the RDCV procedure and R = 10 repetitions – resulting in 100 runs with corresponding variable selections and performance estimates for each variable selection method – and use variables selected in at least 50% of all runs for the subsequent PLS regression step. As performance measure in line 10 of Algorithm 3 we compute root mean squared error (RMSE) values and use their 7 8
The data set is e.g. included in the iToolbox installation and described in [5]. This spectral region was proposed by one of the reviewers.
64
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
means and standard deviations for visual comparison. Comparison is carried out for a range of different numbers of latent PLS components. Mean centering and/or autoscaling: Due to numerical reasons, the majority of variable selection algorithms considered include some form of centering and/or scaling of the input data before further processing.9 For certain functions/toolboxes used within this contribution, it is possible to change default centering/scaling settings, which was e.g. done for the Matlab lasso function. For a better comparability and the reduction of effects caused by a different internal data handling, we compute results shown in this section performing mean centering for spectra (line 5 in Algorithm 2) and autoscaling for reference values (line 6 in Algorithm 2) before the application of a variable selection method. In this way, a possible internal centering/scaling of reference values should only have a marginal effect and methods ought to remain comparable as long as no scaling is used for spectroscopic data. The latter was ensured by us, as seen in Table 3.
7.2. Selected variables Using Algorithm 2 and all before mentioned conventions and instructions, the variables selected by compared methods in at least 50% of all RDCV runs are shown in Figs. 3 and 4, for the WholeReg and PropReg case respectively. It should be noticed that the methods FusedSLEP and Lasso select numerous variables in the noisy region of the spectra. Moreover, regions selected by these methods as well as the ElNet are generally smaller and more widely spread than selections obtained by the remaining techniques. All selections contain certain variables in the range of 1200– 1400 nm which is also considered as particularly important by the experts. As was reported above, different parameter grids have been investigated and optimized for each method. Selected variable regions shown here correspond to the ones performing best in the subsequent PLS regression. This may lead to a clear difference between the variables selected in the WholeReg and PropReg case, as can e.g. be observed for the Lasso (here the difference is also enhanced by the analysis of the two different lasso algorithms described above). The consistency of variable selections between different RDCV runs may give information about the robustness of the underlying feature selection technique and is shown in Figs. 5 and 6. It can be seen that, with few exceptions, selections performed by a certain method are generally quite consistent between runs.
7.3. Performance of PLS models built on selected variables Regarding the performance of PLS models built on the selected wavelength regions, we present results separately for the performance on test data (RMSEP) and test folds left out during cross validation (RMSECV). Performance information obtained when comparing RMSE values for the separate test set are shown in Figs. 7 and 8 for the WholeReg case and Figs. 11 and 12 for the PropReg case, while cross validation performance is shown in Figs. 9 and 10 for the WholeReg case and Figs. 13 and 14 for the PropReg case (please refer to the online version of this article for colour figures). Results for the separate test data are obtained by training PLS on the individual train sets tr(r, f) of the RDCV procedure while using the single fixed variable selection as shown in Figs. 3 and 4. Hence the shown variance in RMSE is only due to the varying training data for each run which allows assessing the robustness of a given variable selection with respect to the training samples. 9 For variable selection via sparse regression methods, the obtained regression coefficients and possibly intercepts incorporate the centering/scaling information, i.e. they have been rescaled as to be able to be applied directly to data on the input data scale. This information is relevant to understand how predictions used to derive MSECVs referred to in Table 2 were computed.
It can be seen in Figs. 7, 9, 11 and 13, that the prediction error of PLS models is considerably decreased via means of variable selection. In the PropReg case, however, the improvement is certainly smaller (e.g. at 4 PLS components), as disturbing noise variables have been eliminated beforehand. The overall view regarding the error of models based on variable selections from the different techniques reveals that, in addition to the option of performing no variable selection at all, the methods ElNet and FusedSLEP are not particularly beneficial for this data set. In the WholeReg case, the same is also true for the Lasso. For a PLS model of reasonable complexity (2–5 latent variables), the more detailed views provided in Figs. 8, 10, 12 and 14 exhibit the clear superiority of FusedStageCUT if PLS models with two latent variables are considered, while an approximately comparable performance of FusedStage(CUT), GAPLS, MWPLS and Expert2 can be observed for 3– 5 PLS components, where their exact ranking differs between test and cross validation results. All in all, it can be seen that for this data set FusedStage and FusedStageCUT outperform other sparse regression techniques (Lasso, ElNet, FusedSLEP) considerably and represent valid alternatives to common chemometric variable selection techniques like MWPLS, GAPLS and even expert-based waveband selection. 8. Discussion and outlook In this work we propose a variable selection method inspired by the sparse regression methods Lasso [8] and Fused Lasso [7]. The algorithm is conceptually simple and follows the design of an additive regression approach [25] (finding the variable showing highest correlation with the target) which is extended by the heuristics a) that updates of neighbouring regression coefficients βj and βj + 1 should be similar, b) a weight shrinkage/decay (commonly encountered in neural networks optimization algorithms [26,27]) which is intended to better explore the solution space, and c) clipping small regression coefficients to zero to improve robustness. From this point of view the algorithm resembles activities in an iterative manual feature selection approach (looking for wavelength regions with high correlations, running PLS and neglecting wavelengths with non-significant regression coefficients) and is therefore easy to comprehend. There is an inherent difference between the proposed algorithm (and its relatives like Lasso and Fused Lasso) compared to MWPLS or iPLS [5,6]: the latter search on a predefined set of possible intervals and/or combinations thereof, while sparse regression algorithms like the FSR optimize a well defined objective function, see Eq. (2), into which the feature selection is encoded in a more general way. Hence there is in principle no need to specify parameters like the number of intervals and the width of the intervals (although in our implementation the parameter τ determines the width of selected regions to a certain extent). The basic algorithm is integrated within the RDCV procedure to ensure robustness with respect to noise and potential outliers in the data. This approach also provides the means of automatically selecting proper values for the parameters τ (minimal correlation), η (relative step width), ϕ (relative shrinkage) and δ (relative cut off of regression parameters) of the algorithm. Up to now we did not establish an analytical proof of convergence of our algorithm. Therefore we propose to integrate the convergence analysis into the parameter selection process performed in the inner loop of the RDCV. Based on a publicly available data set, it has been shown that the proposed algorithm is capable of selecting relevant wavelength regions providing the basis for well-performing chemometric models (via PLS). The FSR – in its original version or using a modification – is, albeit its conceptual simplicity, generally comparable or even better than results obtained applying a range of state-of-the-art variable selection techniques and could outperform other sparse regression techniques (Lasso, ElNet, FusedSLEP) considerably. Within the research project Process Analytical Chemistry (PAC) we applied the algorithm to data sets from different company partners
B. Malli, T. Natschläger / Chemometrics and Intelligent Laboratory Systems 149 (2015) 53–65
and could also observe promising results. Unfortunately, these data cannot be made public at the moment, therefore we did not report details about the obtained results. A more comprehensive evaluation using further data sets from both application and simulation and a comparison to additional state-of-theart methods like uninformed variable elimination (UVE) [28] or interval PLS [5,6] is planned. In particular, we intend the analysis of selected wavebands with respect to chemical information. A further interest is the investigation of strengths and limits exhibited by the proposed method, aiming at the determination of possible break down criteria. Moreover, the incorporation of further available information e.g. regarding measurement technique, presents noise or spectral resolution should be enabled, if possible. Acknowledgements This work was carried out and financially supported in the frame of the COMET programme within the research network Process Analytical Chemistry (PAC) (contract Nr. 825340) of the Austrian research funding association (FFG). References [1] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer Series in Statistics, Springer New York Inc., New York, NY, USA, 2001. [2] R.M. Balabin, S.V. Smirnov, Variable selection in near-infrared spectroscopy: benchmarking of feature selection methods on biodiesel data, Anal. Chim. Acta 692 (12) (2011) 63–72. [3] C. Cernuda, E. Lughofer, P. Hintenaus, W. Maerzinger, Enhanced genetic operators design for waveband selection in multivariate calibration based on NIR spectroscopy, J. Chemom. 28 (2) (2014) 123–136. [4] R. Leardi, A.L. Gonzalez, Genetic algorithms applied to feature selection in PLS regression: how and when to use them, Chemom. Intell. Lab. Syst. 41 (2) (1998) 195–207. [5] L. Norgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Interval partial least-squares regression (iPLS): a comparative chemometric study with an example from near-infrared spectroscopy, Appl. Spectrosc. 54 (2000) 413–419. [6] R. Leardi, L. NÃrgaard, Sequential application of backward interval partial least squares and genetic algorithms for the selection of relevant spectral regions, J. Chemom. 18 (11) (2004) 486–497. [7] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, K. Knight, Sparsity and smoothness via the fused lasso, J. R. Stat. Soc. Ser. B (2005) 91–108.
65
[8] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B 58 (1994) 267–288. [9] J. Friedman, T. Hastie, H. Hoefling, R. Tibshirani, Pathwise coordinate optimization, Ann. Appl. Stat. 1 (2) (2007) 302–332. [10] G.-B. Ye, X. Xie, Split Bregman method for large scale fused lasso, Comput. Stat. Data Anal. 55 (4) (2011) 1552–1569. [11] D. Yu, J.-H. Won, T. Lee, J. Lim, S. Yoon, High-dimensional Fused Lasso Regression using Majorization-Minimization and Parallel Processing, ArXiv e-printsarXiv: 1306.1970. [12] H. Hoefling, A path algorithm for the fused lasso signal approximator, J. Comput. Graph. Stat. 19 (4) (2010) 984–1006. [13] A. Rinaldo, Properties and refinements of the fused lasso, Ann. Stat. 37 (5B) (2009) 2922–2952. [14] R. J. Tibshirani, J. Taylor, The solution path of the generalized lasso, The Annals of Statistics 39 (3), http://dx.doi.org/10.1214/11-AOS878SUPP. URL http:// projecteuclid.org/euclid.aos/1304514656. [15] J. Liu, L. Yuan, J. Ye, An efficient algorithm for a class of fused lasso problems, in: B. Rao, B. Krishnapuram, A. Tomkins, Q. Yang (Eds.),KDD, ACM 2010, pp. 323–332. [16] B. Xin, Y. Kawahara, Y. Wang, W. Gao, Efficient generalized fused lasso and its application to the diagnosis of Alzheimer’s disease, Proceedings of the TwentyEighth Conference on Artificial Intelligence AAAI 2014, pp. 2163–2169. [17] X. Chen, Q. Lin, S. Kim, J. G. Carbonell, E. P. Xing, Smoothing proximal gradient method for general structured sparse learning. [18] X. Lin, M. Pham, A. Ruszczyn'ski, Alternating linearization for structured regularization problems, J. Mach. Learn. Res. 15 (1) (2014) 3447–3481 (URL http://dl.acm.org/ citation.cfm?id = 2627435.2697075). [19] P. Filzmoser, B. Liebmann, K. Varmuza, Repeated double cross validation, J. Chemom. 23 (4) (2009) 160–171. [20] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B 67 (2005) 301–320. [21] R. Leardi, Application of genetic algorithm PLS for feature selection in spectral data sets, J. Chemom. 14 (5–6) (2000) 643–655. [22] J. Liu, S. Ji, J. Ye, SLEP: Sparse Learning with E_cient Projections, Arizona State University, http://www.public.asu.edu/jye02/Software/SLEP 2009. [23] A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67. [24] L. Norgaard, The iToolbox for MATLAB, KVL, Denmark, http://www.models.life.ku. dk/itoolbox 2004. [25] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, Ann. Stat. 28 (2) (1998) 337–407. [26] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1996. [27] H. Lohninger, Evaluation of neural networks based on radial basis functions and their application to the prediction of boiling points from structural parameters, J. Chem. Inf. Comput. Sci. 33 (5) (1993) 736–744. [28] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C. Sterna, Elimination of uninformative variables for multivariate calibration, Anal. Chem. 68 (21) (1996) 3851–3858.