3.05 Variable Selection R. K. H. Galva˜o, Instituto Tecnolo´gico de Aerona´utica, Sa˜o Jose´ dos Campos, SP, Brazil M. C. U. Arau´jo, Universidade Federal da Paraı´ba, Joa˜o Pessoa, PB, Brazil ª 2009 Elsevier B.V. All rights reserved.
3.05.1 3.05.2 3.05.2.1 3.05.2.2 3.05.2.3 3.05.2.3.1 3.05.2.3.2 3.05.2.3.3 3.05.3 3.05.3.1 3.05.3.2 3.05.3.3 3.05.3.4 3.05.3.5 3.05.4 3.05.4.1 3.05.4.2 3.05.4.2.1 3.05.4.2.2 3.05.4.2.3 3.05.4.3 3.05.4.4 3.05.4.5 3.05.4.6 3.05.4.7 3.05.4.8 3.05.4.9 3.05.4.10 3.05.5 3.05.5.1 3.05.5.1.1 3.05.5.1.2 3.05.5.1.3 3.05.5.1.4 3.05.5.2 3.05.6 References
Introduction Criteria for Evaluating Variable Subsets without the Explicit Construction of a Model A Priori Information Correlation with the Dependent Variable Multicollinearity Criteria Variance inflation factor Condition number Illustrative example Criteria for Evaluating Variable Subsets on the Basis of the Regression Results Use of a Validation Set Analysis of Regression Residuals in the Calibration Set Analysis of Significance for the Inclusion of Additional Variables Analysis of the Regression Coefficients Variance of the Prediction Variable Selection Techniques Exhaustive Search Iterative MLR Methods Forward selection Backward elimination Stepwise regression Polytope (Simplex) Algorithm Generalized Simulated Annealing Genetic Algorithm Clustering Methods Successive Projections Algorithm Interval Partial Least Squares Multiobjective Optimization Regularization Methods Case Studies Diesel Analysis by NIR Spectrometry Stepwise regression Genetic algorithm Successive projections algorithm Comparison of results QSAR Study Limitations of Variable Selection
Symbols 1 b
column vector with all elements equal to one column vector of regression coefficients
bk Cp E{?}
235 236 236 238 238 240 240 241 242 242 245 246 246 247 248 248 249 249 249 250 253 256 257 263 264 271 271 273 274 274 275 276 276 276 279 280 281
regression coefficient associated to variable xk Mallows statistic expected value
233
234 Variable Selection
e f F( ) I I(k) K m mmax mmin N P pc pm R r S s t t(k)
residual fitness value in a genetic algorithm cost function identity matrix relevance index for variable xk total number of predictor (independent) variables number of selected variables maximum number of variables to be selected minimum number of variables to be selected number of observations (objects, samples) population size in a genetic algorithm crossover probability in a genetic algorithm mutation probability in a genetic algorithm set of real numbers correlation coefficient empirical covariance matrix of the independent variables empirical standard deviation target vector in a multiobjective optimization problem t-statistic for variable xk
Glossary a priori information Information gathered from previous experience, theoretical analysis, or technical details concerning the measurement apparatus, which may be used to select relevant variables. backward elimination Iterative variable selection algorithm that starts from a model with all available predictors and successively removes variables until a stopping condition is satisfied. clustering Grouping of elements in a given set that are similar according to some metric. condition number Ratio between the largest and smaller singular values of a matrix, often used to assess the degree of multicollinearity between variables associated with the columns of the matrix. cross-validation Resampling method in which elements of the modeling set are alternately removed and re-inserted for validation purposes. exhaustive search Evaluation of all possible solutions to a given optimization problem. forward selection Iterative variable selection algorithm that starts from a single predictor and successively adds more variables to the regression until a stopping condition is satisfied.
u uk X X˜ xk xi,k xk y y yi —
^ || || ||q
vector of real-valued input arguments of a cost function kth real-valued input argument of a cost function matrix of x data X matrix with mean-centered columns kth column of matrix X value of xk in the ith object (element in row i, column k of matrix X) kth predictor (independent) variable column vector of y data response (dependent) variable value of y in the ith object (ith element of vector y) significance level for a statistical hypothesis test condition number coefficient of multiple correlation employed in the calculation of VIF (bar symbol) empirical mean (hat symbol) estimated value absolute value q-norm of a vector
genetic algorithm Optimization technique inspired by the mechanisms of evolution by natural selection, in which the possible solutions are represented as the chromosomes of individuals competing for survival in a population. interval PLS Range selection algorithm that optimizes the center and end points of an interval of adjacent variables for the purpose of partial least squares regression. multicollinearity Linear dependence among predictor variables. polytope algorithm Deterministic search method for the minimization of multivariate functions with real-valued input arguments. quantitative structure–activity relationships Investigation concerned with the modeling of relationships between biological activities of molecules and physicochemical properties obtained either by experimentation or by computational means. stepwise regression Iterative variable selection algorithm that combines the inclusion and elimination mechanisms of the forward and backward selection procedures.
Variable Selection
simulated annealing Optimization method that introduces a degree of randomness on the search procedure (which may be interpreted as thermal agitation) to avoid capture by local minima. successive projections algorithm Technique designed to select subsets of variables with small collinearity and appropriate prediction power for use in multiple linear regression models.
235
variable selection Process of selecting appropriate predictors from an overall set of independent variables for model-building purposes. variance inflation factor Degree by which the variance of a given regression coefficient is increased with respect to the variance that would result if the predictors were uncorrelated.
3.05.1 Introduction The construction of empirical models by linear regression often entails the problem of selecting the most relevant predictors from the overall set of x variables. In fact, there are several reasons why the use of a reduced subset of variables may be preferred over the use of all available data: 1. A reduction in the number of variables may be useful to decrease the cost and time involved in the measurements, either for the calibration procedures or for subsequent routine use of the model. For instance, in applications involving the spectroscopic determination of physical or chemical parameters, the calibration data may come from a scanning spectrophotometer, which monitors hundreds of different wavelengths. However, if the goal is to deploy a cheap instrument for field use, such as those based on filters or lightemitting diodes (LEDs), a small number of informative wavelengths will have to be selected from the full spectrum acquired by the spectrophotometer. 2. Models with a smaller number of predictor variables tend to be more amenable to physical interpretation. In spectroscopic problems, a model with fewer wavelengths is easier to understand in terms of spectral band attributions. In studies of quantitative structure–activity relationships (QSARs), a model may be developed to predict biological activity from physicochemical descriptors of a given compound. In this case, the selection of a small set of predictor variables (descriptors) is important to better understand the effect of structural changes in the molecule on its biological properties. Such an understanding is vital for the design of new and more effective drugs. Examples can also be found in industrial process control applications, in which the quality of the final product may be affected by several variables, such as temperatures, flow rates, and addition of chemicals. Finding the variables with the largest influence on the product quality is very important to optimize the process.1 3. If the model is constructed by multiple linear regression (MLR), the uncertainty of the predictions tends to increase with the relation K/N, where K and N are the number of predictor variables and observations (calibration objects) used in the regression, respectively.2 Therefore, the use of a large number of variables may lead to a model of poor prediction ability. This problem can be aggravated by the presence of near multicollinearity between the predictors, which deteriorates the conditioning of the regression. If K > N, it is not possible to obtain a model by MLR because the empirical covariance matrix of the predictors becomes singular. Regularization methods such as ridge regression2,3 or regression techniques based on latent variables,4,5 namely principal component regression (PCR) and partial least squares (PLS), can be employed to circumvent the third problem described above. However, such methods are not directly concerned with the two first aspects (cost/ time constraints and physical interpretation). In addition, several authors have presented theoretical and empirical evidence supporting the use of variable selection to improve the predictive ability of PCR/PLS models.6–8 In this context, variable selection can be regarded as a limit case of the variable weighting technique,9 which consists of multiplying noisy or uninformative variables by an attenuating factor to decrease their influence on the latent variables to be included in the model. Variable selection amounts to assigning a weight of one to the selected variables and zero to the remaining ones. There is no unique approach to variable selection. The best methodology will depend on many factors such as the dimension of the problem, the statistical and physical nature of the data, and the computational resources available. Some techniques do not consider the particular order in which the variables are disposed, such as algorithms for the selection of physicochemical descriptors in QSAR, isolated wavelengths in spectroscopy, or
236 Variable Selection
parameters in an industrial process. However, in some applications, the order of the variables may be considered in the selection procedure, which can then be formulated as a window selection strategy. This is the case, for example, of the selection of wavelength ranges in a spectrum or time intervals in a chromatogram. Roughly speaking, a variable selection strategy can be regarded as the combination of a suitable criterion to evaluate subsets of variables and a search algorithm to optimize such criterion. The evaluation of a subset of variables can be accomplished with or without the explicit construction of a regression model. For instance, the variables can be chosen on the basis of their individual signal-to-noise ratio or other a priori information available for the analyst. In this case, the regression equations are not invoked. Alternatively, given a subset of variables, the regression can be carried out and diagnostic procedures can be used to assess the quality of the resulting model. These two alternatives are known in the machine learning literature as the ‘filter’ and ‘wrapper’ approaches, respectively.10 Throughout this chapter, the predictor (independent) variables will be denoted by x1, x2, . . ., xK, the response (dependent) variable by y, and the regression coefficients by b1, b2, . . ., bK as usual. The linear regression equation will be written as y ¼ b0 þ b1 x1 þ b2 x2 þ þ bK xK þ e
ð1Þ
where b0 and e are the intercept and residual terms, respectively. Moreover, the term ‘variables’ (without additional designations) will refer to the predictors whenever there is no risk of misunderstanding. Estimated or predicted values will be indicated by a hat symbol (^). Whenever matrix–vector formulations are employed, the available x and y data will be arranged as follows: 2
x1;1
6 6 x2;1 6 X¼6 6 .. 6 . 4 xN ;1
x1;2
x2;2
.. .
..
xN ;2
.
x1;K
3
7 x2;K 7 7 7 .. 7; . 7 5 xN ;K
2
y1
6 6 y2 6 y¼6 6 .. 6 . 4 yN
3 7 7 7 7 7 7 5
ð2Þ
where xi;k and yi are the values of xk and y in the ith calibration object, respectively. The remaining sections of this chapter are organized as follows. In Section 3.05.2, criteria for selecting variables without the explicit construction of a model (filter approach) are discussed. In Section 3.05.3, criteria for comparing models built from different subsets of variables (wrapper approach) are presented. Section 3.05.4 describes different algorithms that were developed or adapted for use in the variable selection problem. Section 3.05.5 presents some examples to illustrate the concepts and algorithms discussed in this chapter. For this purpose, two case studies concerning the selection of wavelengths in a multivariate calibration problem and physicochemical descriptors in QSAR are discussed. Finally, Section 3.05.6 discusses possible drawbacks associated to variable selection. Throughout the text, illustrative computational routines employing Matlab 6.5 software are given.11
3.05.2 Criteria for Evaluating Variable Subsets without the Explicit Construction of a Model 3.05.2.1
A Priori Information
In some cases, the analyst may have a priori information concerning the relevance of the variables under consideration. Such information may come from previous experience, theoretical analysis, or technical details concerning the measurement apparatus. In spectroscopy, for instance, one should exclude wavelength regions in which the signal saturates or the detector response is unreliable. A typical example arises in measurements accomplished by using a dispersive ultraviolet–visible (UV–VIS) spectrophotometer fitted with a diode array, in which the signal from the initial and final elements of the array may lack reproducibility. Conversely, some wavelengths may be deemed appropriate because of chemical knowledge regarding the spectral bands of specific species or functional groups of interest. A priori information can be useful in preselection procedures before the application of statistical data-driven selection algorithms.
Variable Selection
237
For illustration, Figure 1(a) presents the near-infrared (NIR) spectrum of a diesel sample in the 666–1730 nm range acquired by using a Fourier transform spectrometer with an optical path length of 1.0 cm. As can be seen, the absorbance values for wavelengths above 1700 nm, which are associated to the first overtone of CH vibrational transitions, are very intense and lead to saturation of the detector. The NIR spectrum at longer wavelengths (data not shown) is mainly related to vibrational transitions of bonds involving sulfur, oxygen, and nitrogen, which are not found in large quantities in the usual composition of diesel oil. The region below 1000 nm (short NIR), which is associated to higher order (third) overtones of CH vibrational transitions, has very small absorbance values. In addition, the NIR FR-DTGS (fast recovery deuterated triglycine sulfate) detector used in the spectrometer has a low signal-to-noise ratio in this range, which results in a noisy spectrum as shown in Figure 1(b). On the contrary, the range of wavelengths from 1000 to 1600 nm is associated to useful vibrational transitions (second overtone of CH bonds and first overtone of combination bands). On the basis of these a priori considerations, the 1000–1600 nm range (Figure 1(c)) could be preselected for an investigation concerning the properties of diesel samples. In analytical problems involving the spectrophotometric determination of chemical species in mixtures, a priori information may be used to select wavelengths for which the instrumental response has maximum sensitivity for the analyte and is not overlapped by interferent bands. For this purpose, the analyst may attempt to choose appropriate wavelengths by visually inspecting the pure spectrum of each mixture component. Unfortunately, some overlapping is almost always present, especially in complex chemical matrices. For this reason, some quantitative criteria have been proposed to select the wavelengths with the smallest degree of spectral overlapping for the determination of a given analyte. A simple criterion, which was presented elsewhere,12 employs a ratio between absorptivities. If only two absorbing species 1 and 2 are present, the ratio of absorptivities a1()/a2() is considered. Two wavelengths 1 and 2 are selected such that a1(1)/a2(1) is maximum (best conditions for analyte 1) and a1(2)/a2(2) is minimum (best conditions for analyte 2). If A
(a) 7
Absorbance
6 5 4 3 2 1 0 700
800
900 1000 1100 1200 1300 1400 1500 1600 1700 Wavelength (nm) (c) 0.7
(b) 0.04
0.6 0.02
Absorbance
Absorbance
0.5 0
–0.02
0.4 0.3 0.2
–0.04
–0.06
0.1
700 750 800 850 900 950 1000 1050 1100 Wavelength (nm)
0 1000
1100
1200 1300 1400 Wavelength (nm)
1500
1600
Figure 1 (a) NIR spectrum of a diesel sample in the 666–1730 nm range. (b) Details of the short NIR region. (c) Spectral region selected according to a priori considerations.
238 Variable Selection
analytes are involved, the best wavelength i for the ith analyte is the one that maximizes the function Gi () defined as ðA –1Þ
Gi ðÞ ¼
ai ðÞ a1 ðÞa2 ðÞ ai–1 ðÞaiþ1 ðÞ aA ðÞ
ð3Þ
The main shortcoming of this criterion is the need for knowing all components of the mixture and the corresponding pure spectra. Such a requirement may hamper the application of this approach to complex matrices in which the mixture composition cannot be easily characterized and/or pure spectra are not available.
3.05.2.2
Correlation with the Dependent Variable
The empirical correlation coefficient r of the kth predictor variable xk with the dependent variable y can be calculated as13 r ðkÞ ¼
N 1 X xi;k – xk yi – y N – 1 i¼1 sy sxk
ð4Þ
where xk ; y; sxk ; sy denote the empirical means and standard deviations corresponding to xk and y. A large correlation coefficient indicates that changes in the predictor value are good indicators of changes in the dependent variable being modeled. In fact, the square value r2(k) indicates the proportion of y variance explained by xk. Therefore, a possible selection scheme consists of using only those predictors with the largest correlation values. For this purpose, either a correlation threshold can be set or a desired number of variables can be established. Such parameters (threshold value/number of variables) may be adjusted to optimize one of the model-based criteria presented in Section 3.05.3. Alternatively, the correlation can be used as a preselection criterion before the use of a more elaborate variable selection method. Some shortcomings associated to the use of correlation in this context are worth pointing out. Two predictors having large correlation with y may also present large correlation between themselves. Therefore, this criterion does not deal with the multicollinearity problem. Moreover, if the pool of available predictors contains a large number of uninformative variables, there is always the possibility that one or more of these variables will have large empirical correlation with y by chance.9 For this reason, it is generally useful to carry out a preliminary elimination of uninformative variables by using a priori information, if possible. Finally, the correlation coefficient may not be appropriate to measure the predictive power of variables when the data distribution is not Gaussian or the model to be constructed is nonlinear. In this case, using the more elaborate mutual information index,14,15 which evaluates the dependency between two variables on the basis of their joint probability distribution, could be a better option.
3.05.2.3
Multicollinearity Criteria
A subset of variables is said to have near multicollinearity if some of the variables can be approximated by linear combinations of the others.13 Multicollinearity in spectral data often occurs because of interanalyte spectral overlapping. In this case, the regression may be unstable, leading to a model of poor prediction performance.4 For instance, consider a two-variable model of the form y ¼ b0 þ b1 x1 þ b2 x2 þ e
ð5Þ
If the variables follow an exact relation x2 ¼ x1 for a given constant , the model can be re-written as y ¼ b0 þ b1 x1 þ b2 x1 þ e ¼ b0 þ ðb1 þ b2 Þx1 þ e ¼ b0 þ b91 x1 þ e
ð6Þ
where b91 ¼ b1 þ b2 . By using linear regression, unique least squares estimates for b0 and b91 can be obtained. However, it is not possible to establish unique values for b1 and b2 because there is an infinite variety of (b1, b2) pairs for which b1 þ b2 ¼ b91 . The regression is then said to be ill-conditioned. If x2 is only approximately equal to x1 (near collinearity), the least squares equations will provide a unique estimate for (b1, b2). However, the
Variable Selection
239
result will be unstable, in that small perturbations in the x-values will cause large changes in the resulting estimates of b1 and b2. To illustrate the connection between spectral overlapping and multicollinearity, let us consider a simulated spectroscopic data set generated by assuming a linear relation between the matrix X of instrumental responses for 1000 wavelengths and the matrix Y of concentrations for two analytes: X ¼ YW
ð7Þ
The element at line i and column k of matrix W corresponds to the proportionality coefficient for the ith analyte (i ¼ 1, 2) at the kth spectral bin (k ¼ 1, . . ., 1000). The adopted W-values for the two analytes, termed A and B, are presented in Figure 2(a). The concentration values in matrix Y were set according to a full factorial design with three levels (1.0, 5.5, and 10.0), which resulted in nine samples. The resulting spectra are presented in Figure 2(b). The solid vertical lines in Figures 2(a) and 2(b) indicate two wavelengths (variables x270 and x330) in which the spectral responses of the analytes are strongly overlapping. As shown in Figure 2(c), the corresponding data points in matrix X are mainly distributed along a single direction, which denotes a near-collinearity condition. In contrast, the dotted vertical lines in Figures 2(a) and 2(b) indicate two other wavelengths (variables x100 and x500) in which there is almost no spectral overlapping between the analyte responses. In this case, the degree of collinearity is much smaller, as can be seen in Figure 2(d). This example is also useful to illustrate the relation between multicollinearity and selectivity of the analytical method.16,17 If the method has good selectivity, in the sense that each analyte can be determined Analyte B
Analyte A
(a) 1.2
(b) 20
1 15
xk
wk
0.8 0.6
10
0.4 15 0.2 0
0 100
1
200
300 k
400
500
600
(c) 22
100
1
200
300 k
2
3
400
500
600
(d) 7
20 6
18 16
5
12
x500
x330
14 10 8 6
4 3 2
4 1
2 0
0
2
4
6
8
10 12 14 16 18 20 22 x270
0
0
1
4
5
6
7
x100
Figure 2 (a) Proportionality coefficient values for analytes A and B. (b) Spectra of the nine samples resulting from the factorial design. (c) Data points associated to variables x270 and x330 (strong spectral overlapping). (d) Data points associated to variables x100 and x500 (weak spectral overlapping).
240 Variable Selection
with little interference from other species, then overlapping features and the associated multicollinearity effects will be small. Two criteria often used to evaluate the degree of multicollinearity in a given subset of variables are the variance inflation factor (VIF) and the condition number (CN). 3.05.2.3.1
Variance inflation factor The VIF for the kth variable can be expressed as VIFðkÞ ¼
1 1 – 2 ðkÞ
ð8Þ
where (k) is the correlation coefficient between xk and the linear regression estimate of xk obtained from the remaining x variables.13 The value of VIF(k) indicates the degree by which the variance of bˆk is increased (‘inflated’) with respect to the variance that would result if the x variables were uncorrelated (orthogonal design). For an orthogonal design, all variables have VIF ¼ 1. In contrast, if variable xk can be exactly written as a linear combination of the others (exact multicollinearity), the corresponding (k) value will be equal to one and VIF(k) will be infinite. In intermediate situations between these two extreme cases, the VIF will have finite values larger than one and can be used to measure the degree of redundancy of each variable with respect to the others. An illustration of the relation between VIF and multicollinearity is presented in Section 3.05.2.3.3 together with a discussion of the CN. A possible selection algorithm based on the VIF may be structured as follows: Step 1: Step 2: Step 3: Step 4:
Calculate VIF(k) for k ¼ 1, . . ., K. Eliminate the variable with the largest VIF. Renumber the remaining variables and let K ¼ K – 1. Check a suitable stopping criterion. If the criterion is not satisfied, return to Step 1.
It is worth noting that the VIF has to be recalculated at each iteration because the VIF values change after the removal of a variable. The stopping criterion can be set to attain a desired number of variables or to optimize one of the model-based metrics presented in Section 3.05.3. A limitation of this algorithm is the fact that the VIF will usually be infinite for all variables if K N. In this case, it would not be possible to decide which variable should be initially eliminated in Step 2. 3.05.2.3.2
Condition number The condition number associated to a given subset of variables is defined as ¼
rffiffiffiffiffiffiffiffiffi max min
ð9Þ
where max and min are the largest and smallest eigenvalues, respectively, of the empirical covariance matrix S of the variables under consideration.13 If the available x data are arranged in a matrix X of dimensions N K, then S is calculated as S¼
1 ˜T˜ X X N –1
ð10Þ
˜ is obtained by mean centering the columns of X. In the Matlab software, can be calculated by where X ˜ The CN may be interpreted as the ratio of the empirical standard applying function cond.m to matrix X. deviations observed along the two directions in space with the largest and smallest data variability. It is worth noting that is a single metric for the entire subset of variables, unlike the VIF value, which is defined for each variable individually. ˜ are linearly dependent and If two variables x1 and x2 are perfectly collinear, the corresponding columns in X therefore min ¼ 0, which results in an infinite CN. In geometrical terms, all empirical points in the plane x1 x2 will fall on a straight line, that is, the entire data variability will be associated to a single axis. Such
Variable Selection
241
reasoning can be extended to multiple dimensions. In general, larger -values are associated to stronger multicollinearity effects. Conversely, if ¼ 1, there is no empirical correlation between the variables. However, the reciprocal statement is not true. In fact, a large CN may occur, even if the variables are not correlated, simply because one of the variables has a small standard deviation in comparison with the others. The data set presented in Figure 3, for example, was randomly generated by using uncorrelated distributions for x1 and x2. However, the resulting CN is relatively large ( ¼ 101). To circumvent this problem, a normalization procedure can be used by dividing the elements in each variable vector by the corresponding standard deviation. If such normalization is applied to the data in Figure 3, the CN is reduced to ¼ 1.7. Additional details concerning multicollinearity-related criteria can be found elsewhere.18 3.05.2.3.3
Illustrative example To illustrate the role of the VIF and the CN in diagnosing collinearity, two synthetic data sets comprising a pair of variables (x1, x2) were generated, as shown in Table 1. No y variable is considered because this example is simply concerned with the relation between the two predictors. 1
x2
0.5
0
–0.5
–1 –15
–10
–5
0 x1
5
10
15
Figure 3 Random data set generated by using uncorrelated, zero-mean Gaussian distributions for x1 and x2.
Table 1 Synthetic data sets used to illustrate the multicollinearity-related metrics Strong collinearity
Weak collinearity
Object
x1
x2
x1
x2
1 2 3 4 5 6 7 8 9 10
1.4 3.0 4.6 7.4 8.8 10.8 12.4 15.2 16.5 19.1
5.0 7.6 9.7 11.6 14.6 15.7 17.9 19.9 21.2 23.1
1.7 0.4 3.1 4.5 9.4 12.2 12.7 14.2 16.1 18.2
9.7 9.6 6.9 7.7 10.7 18.1 20.4 19.6 24.2 18.2
VIF max min VIF, variance inflation factor.
0.8946 5.007 71.88 0.217 18.2
0.7697 2.453 80.12 6.17 3.6
242 Variable Selection
(b)
(a)
30
25
25
20
20
15
15
x2
x2
30
10
10
5
5
0
0
–5 –5
0
5
10
15
20
25
–5 –5
30
0
5
10
15
20
25
30
x1
x1
Figure 4 Synthetic data sets with (a) strong and (b) weak collinearity between predictors x1 and x2.
In the first data set, strong collinearity was imposed. As can be seen in Figure 4(a), the data points are mainly distributed along a single direction, which concentrates most of the variability. As a result, the correlation coefficient is close to one ((1) ¼ (2) ¼ ¼ 0.8946), resulting in a VIF value of 5.007. In addition, the CN associated to the data set is ¼ 18.2, which indicates that the standard deviation along the main axis of data variability is 18.2 times larger than the standard deviation along the axis of smallest variability. It is worth noting that x1 and x2 have the same values of and VIF because there are only two variables involved in this example. Evidently, the degree of collinearity of x1 with respect to x2 is equal to the degree of collinearity of x2 with respect to x1. If three or more x variables were considered, different values of and VIF would be associated to each variable. In the second data set, weaker collinearity was imposed, as can be seen in Figure 4(b). Such a smaller degree of collinearity results in smaller values for the correlation coefficient ( ¼ 0.7697), VIF (2.453), and CN ( ¼ 3.6), as compared with the first data set. Finally, it should be noted that multicollinearity criteria such as VIF and CN do not address the predictive power of the variables, that is, their relation with the dependent variable y. Therefore, a selection algorithm with the sole purpose of minimizing multicollinearity may fail to deliver an informative subset of predictors. For this reason, additional criteria related to modeling accuracy should be used in the selection procedure.
3.05.3 Criteria for Evaluating Variable Subsets on the Basis of the Regression Results 3.05.3.1
Use of a Validation Set
If the number N of objects available for the model-building procedures is sufficiently large, NV objects may be separated to form an independent validation (also termed ‘test’) set. In this case, the remaining (N – NV) objects are employed to obtain the regression coefficients and the validation set is used to evaluate the prediction ability of the resulting model. For this purpose, the model is applied to predict the y-values for the validation objects and the predictions are compared against the known values. Usual metrics for such a comparison are presented below.
Prediction error sum of squares in the validation set (PRESSV) defined as PRESSV ¼
NV X
yV;i – yˆV;i
2
i¼1
where yV;i and yˆV;i are the known and predicted values of the y variable in the ith validation object.
ð11Þ
Variable Selection
Root mean square error of prediction in the validation set (RMSEPV) defined as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi NV u1 X 2 PRESSV t RMSEPV ¼ yV;i – yˆV;i ¼ NV i¼1 NV
243
ð12Þ
Standard deviation of validation errors (SDV) defined by the American Society for Testing and Materials (ASTM)19 as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u NV u 1 X 2 SDV ¼ t yV;i – yˆV;i – BIAS NV – 1 i¼1
ð13Þ
where BIAS is the average value of the validation errors calculated as BIAS ¼
NV 1 X yV;i – yˆV;i NV i¼1
ð14Þ
Some authors denote the SDV quantity by SEP,13 which stands for ‘standard error of prediction’. Correlation coefficient (rV) between the known and predicted y-values: rV ¼
NV 1 X yV;i – yV yˆ V;i – y¯ˆV sy v syˆv NV – 1 i¼1
ð15Þ
where yV ;y¯ˆV ;s yv ;s yˆv are the empirical mean and standard deviation values for y and yˆ in the validation set. The metrics PRESSV and RMSEPV evaluate the prediction accuracy by taking into account both the systematic (bias) and random parts of the validation error. The use of PRESSV instead of RMSEPV may be more convenient for the application of F-tests, which are based on ratios of variances, rather than on standard deviations. However, RMSEPV has the advantage of being expressed in the same measurement units of the y variable, which facilitates the interpretation of results. The metrics SDV and rV only assess the random part of the validation error. Therefore, they should always be reported together with the BIAS value. The calculated BIAS could be used to correct future predictions of the model, leaving only the random error related to SDV and rV. However, it is worth noting that the BIAS value may not be a reliable estimate of the true prediction bias if the validation set is not representative, and therefore such correction in future predictions may actually be detrimental. It should be emphasized that the performance indexes described above are meaningful only if the validation set is a good representation of the objects to which the model will be applied in the future. Algorithms have been proposed to select representative samples from a given set to cover the range of x variability in a uniform manner.20 In addition, the distribution of the y-values can also be taken into account.21 If the number N of available objects is not large enough for the extraction of a validation set, cross-validation procedures may be employed as an alternative. In cross-validation, a subset of samples is removed from the data set and a model is constructed with the remaining objects.9 The model is then used to predict the y-values in the removed samples. The subset of samples is returned to the data set and the process is repeated for different subsets until each sample has been subjected to this procedure once. The predicted values obtained in the course of this process can be used to calculate cross-validation metrics PRESSCV, RMSEPCV, SDCV, and rCV by using the corresponding equations presented above. A usual cross-validation approach is the so-called ‘leave-one-out’ procedure in which only a single sample is excluded from the data set at each iteration. A limitation of the cross-validation approach is the fact that a careful experimental design may be compromised if a sample is removed. For example, in a 23 factorial design for a problem with three dependent variables y1, y2, and y3, each calibration object corresponds to one vertex of a cube in the (y1 y2 y3) space. Whenever one sample is removed from the data set in the cross-validation procedure, the cube loses a vertex and the experimental design becomes incomplete. As a result, the model obtained with the remaining objects will be used to predict the (y1, y2, y3) values in a point that is out of the multidimensional space covered by the incomplete experimental design.
244 Variable Selection
The Matlab routine validation.m presented below illustrates the calculation of the predicted y-values (yhat) and corresponding prediction errors (e). According to the number of input arguments, the routine automatically detects whether to use validation with a separate data set (four input arguments Xcal, ycal, Xval, yval) or crossvalidation (two input arguments Xcal, ycal). The results are used in a second routine (validation_metrics.m) to calculate the performance metrics described above (PRESS, RMSEP, SDV, BIAS, and r). function [yhat,e] ¼ validation(Xcal,ycal,Xval,yval,var_sel) % [yhat,e] ¼ validation(Xcal,ycal,Xval,yval,var_sel) - -> Validation with a separate set % [yhat,e] ¼ validation(Xcal,ycal,[],[],var_sel) - -> Cross-validation N ¼ size(Xcal,1); % Number of objects in the calibration set NV ¼ size(Xval,1); % Number of objects in the validation set if NV > 0 % Validation with a separate set Xcal_ones ¼ [ones(N,1) Xcal(:,var_sel)]; b ¼ Xcal_onesnycal; % MLR with offset term (b0) yhat ¼ [ones(NV,1) Xval(:,var_sel)]b; % Prediction over the validation set e ¼ yval - yhat; % Validation error else % Cross-validation yhat ¼ zeros(N,1); % Setting the proper dimensions of yhat for i ¼ 1:N % Removing the ith object from the calibration set cal ¼ [[1:i-1] [iþ1:N]]; X ¼ Xcal(cal,var_sel); y ¼ ycal(cal); xtest ¼ Xcal(i,var_sel); ytest ¼ ycal(i); X_ones ¼ [ones(N-1,1) X]; b ¼ X_onesny; % MLR with offset term (b0) yhat(i) ¼ [1 xtest]b; % Prediction for the ith object end e ¼ ycal - yhat; % Cross-validation error end function [PRESS,RMSEP,SDV,BIAS,r] ¼ validation_metrics (Xcal,ycal,Xval,yval,var_sel) % % % %
[PRESSV,RMSEPV,SDV,BIASV,rV] ¼ validation_metrics(Xcal,ycal,Xval,yval,var_sel) - -> Validation with a separate set [PRESSCV,RMSEPCV,SDCV,BIASCV,rCV] ¼ validation_metrics(Xcal,ycal,[],[],var_sel) - -> Cross-validation
if size(Xval,1) > 0 % Validation with a separate set y ¼ yval; else % Cross-validation y ¼ ycal; end [yhat,e] ¼ validation(Xcal,ycal,Xval,yval,var_sel); PRESS ¼ e9e; N ¼ length(e); RMSEP ¼ sqrt(PRESS/N); BIAS ¼ mean(e); ec ¼ e - BIAS; % Mean-centered error values SDV ¼ sqrt(ec9ec/(N - 1)); yhat_as ¼ (yhat - mean(yhat))/std(yhat); % Autoscaling y_as ¼ (y - mean(y))/std(y); % Autoscaling r ¼ (yhat_as9y_as)/(N-1);
Variable Selection
3.05.3.2
245
Analysis of Regression Residuals in the Calibration Set
Two MLR models with the same number m of variables (with m < N) can be compared in terms of the residual sum of squares (RSS) or correlation between estimated and known y-values (r) obtained in the calibration set itself.13 Smaller RSS or larger r values indicate a better fit. However, such criteria cannot be directly used to compare models with different number of variables because the fitting of calibration data always improves as more variables are added. One possibility consists of inspecting the scree plots of RSS or r versus m. The point after which the RSS or r curves tend to stabilize (i.e., the incremental improvements become minor) may be taken as an indication of the appropriate number of variables. However, this procedure is not straightforward because usually the localization of such a point in the scree plots is not clearly defined. An alternative consists of adopting a criterion that somehow penalizes the model complexity, as related to the number m of variables included in the regression. One such criterion is the residual mean square (RMS) defined as RMS ¼ RSS/(N – m – 1). However, although both the numerator and denominator decrease as m increases, the RMS value typically tends to decrease initially and then stabilize around the variance of the pure error in y. Therefore, the choice of an appropriate point in the plot of RMS versus m may again be unclear. Another option is the use of the Mallows Cp statistic,2,22 which is defined as RSSm – N þ 2p ˆ 2
ð16Þ
ðN – pÞ2 – N þ 2p 2
ð17Þ
Cp ¼
where RSSm is the RSS for a model with p ¼ (m þ 1) coefficients (i.e., m variables plus the offset term) and ˆ 2 is an empirical estimate for the variance 2 of the pure error in y. A low Cp value is achieved for parsimonious models (small p) with a good fit (small RSSm). Moreover, it is desirable to have Cp approximately equal to p. In fact, if a model with p coefficients is an adequate representation of the x–y relation, that is, if the model does not suffer from lack of fit, the expected value of RSSm can be shown to be (N – p)2. Therefore, assuming that ˆ 2 is a good estimate for 2, it follows that Cp
that is, Cp p for an adequate model. The same line of reasoning shows that Cp >> p is an indication of a poor fit. To summarize, it is advisable to choose the model with the smallest Cp such that Cp p. Figure 5 illustrates the choice of an appropriate model on the basis of the Cp statistic. In this hypothetical case, each point corresponds to a model built from a different subset of variables. The best choice, according to the criterion discussed above, is indicated by an arrow. The main difficulty involved in the calculation of the Cp statistic is the determination of a suitable estimate 2 ˆ . If replicate data are available, ˆ 2 can be calculated from the dispersion of y-values for the same object. Otherwise, a standard procedure consists of using the RMS of the model including all K predictors, that is ˆ 2 ¼
RSSK N –K –1
ð18Þ
12 10
Cp
8 6 4 2 0
0
1
2
3
4
5 p
6
7
8
9
10
Figure 5 Illustrative example for the use of Mallows Cp statistic. The best model (red circle) is indicated by an arrow.
246 Variable Selection
However, this approach is possible only if the number of calibration samples is larger than the number of variables. Alternatively, ˆ 2 may be taken as the asymptotic value of the curve of RMS versus number of variables. The discussion presented in this section could also be adapted for use with PCR or PLS models by replacing the number of coefficients p with the effective rank (ER). The ER is a measure of the number of degrees of freedom consumed to calibrate a given model, as described elsewhere.23
3.05.3.3
Analysis of Significance for the Inclusion of Additional Variables
A special case of model comparison consists of evaluating the benefit of adding extra variables to a given model. In this case, the two models (with and without the extra variables) can be compared using a partial F-test.2 Such a test considers the sum of squares due to regression (SSreg) for each of the two models and decides whether the extra sum of squares is large enough to justify the inclusion of the additional variables, given the existing degrees of freedom. The SSreg value for a model of the form given by Equation (1) is defined as SSreg ðb0 ; b1 ; . . . ; bK Þ ¼
N X
ðˆyi – yÞ2
ð19Þ
i¼1
where yˆi is the model estimate for the dependent variable corresponding to the ith calibration object and P y ¼ ð1=N Þ Ni¼1 y i . For instance, consider the following two models: Model 1: y ¼ b0 þ b1 x1 þ b2 x2 þ b3 x3 þ b4 x4 þ e
ð20Þ
Model 2: y ¼ b0 þ b1 x1 þ b2 x2 þ e
ð21Þ
It is worth noting that the empirical values obtained for b0, b1, and b2 in Model 1 will usually be different from those in Model 2, unless the x variables are mutually orthogonal. In the same manner, the residual e will usually be different in both models. Such distinctions are not made explicit in the equations above for brevity of notation. Let S4 ¼ SSreg(b0, b1, b2, b3, b4) and S2 ¼ SSreg(b0, b1, b2) denote the regression sum of squares for Models 1 and 2, respectively. The extra sum of squares due to the inclusion of x3 and x4 in the regression is defined as SSreg ðb3 ;b4 jb0 ;b1 ;b2 Þ ¼ S4 – S2
ð22Þ
Alternatively, it can be easily shown that (S4 – S2) is equal to RSS2 – RSS4, which is the difference between the RSS values in the reverse order. Either expression can be used at the convenience of the user. The inclusion of x3 and x4 in Model 1, as compared with Model 2, is justifiable if (S4 – S2) is significantly large. Such a decision is taken in view of the estimated variance ˆ 2 ¼ RSS4 =ðN – 5Þ calculated from the residuals of the larger model (Model 1). It can be shown2 that the extra sum of squares (S4 – S2) has (5 – 3) degrees of freedom. Therefore, the ratio between (S4 – S2)/(5 – 3) and ˆ 2 can be subjected to an F-test with (5 – 3) and (N – 5) degrees of freedom in the numerator and denominator, respectively. In the general case, if a model with q variables is to be evaluated against a larger model with (m–q) additional variables, the quantity (Sm–Sq)/(m–q) can be compared with ˆ 2 by an F(m – q, N – m – 1) test. It is worth noting that the estimated variance ˆ 2 is calculated from the residuals of the larger model. With a slight abuse of language, the resulting F-statistic is termed the partial F-value for the additional variables. A particularly useful application of this partial F-test consists of analyzing the result of adding one variable to or removing one variable from a given model. In this case (m – q) ¼ 1. Such a procedure is used in the forward selection, backward elimination, and stepwise regression methods, which are described in Section 3.05.4.
3.05.3.4
Analysis of the Regression Coefficients
The relative relevance of each variable xk(k ¼ 1, . . . , K) included in a given model can be roughly assessed by inspection of the estimated regression coefficients bˆ1 ; . . . ; bˆK . If the variables have been autoscaled, the absolute values jbˆ1 j; . . . ; jbˆK j may be used as a relevance measure (a larger regression coefficient indicates that the variable
Variable Selection
247
under consideration has a larger impact on the model output). If autoscaling has not been carried out, a relevance index I(k) for each variable xk can be defined as the product of jbˆk j and the empirical standard deviation of xk, that is I ðkÞ ¼ jbˆk js xk
ð23Þ
An extension of this concept is exploited in the so-called recursive weighted regression method for variable selection in PLS models.24 This method employs a recursive re-weighting of the x variables based on the b coefficients calculated from a PLS regression. The recursion is carried out as follows. Let xk P RN be the vector of available observations for variable xk and bk0 be the corresponding regression coefficient obtained after carrying out the PLS calculations. The weighted vector of xk values is defined as xk1 ¼ bk0 xk for k ¼ 1, . . . , K. New regression coefficients, denoted by bk1, . . ., bk1, are then obtained by re-applying PLS to the weighted data. This sequence of operations defines a recursion of the form xkiþ1 ¼ bki xki, i ¼ 0, 1, . . ., for each variable xk with xk0 ¼ xk . Given a sufficiently large number of iterations, one expects the b-values to converge to either zero or one. At the end, the selected variables will be those associated to nonzero limit values of b. In addition to analyzing the magnitude of the regression coefficients, one can also consider their estimated associated uncertainties. The variances s 2 bˆ 0 ; s 2 bˆ 1 ; . . . ; s 2 bˆ K are given by the diagonal elements of the (K þ 1) (K þ 1) matrix V defined as V ¼ ½1
X T ½ 1
X
–1
ˆ 2
ð24Þ
where X is the matrix of x data with dimensions (N K), 1 is a column vector of N ones (associated to the intercept coefficient), and ˆ 2 is obtained as described in Section 3.05.3.2. A t-ratio t(k) can be obtained for the kth regression coefficient by dividing its estimated value by the corresponding standard deviation, that is bˆ k t ðkÞ ¼ s bˆ k
ð25Þ
If ˆ 2 is calculated from the RSS value as in Equation (18), each t-ratio will have a Student’s t-distribution with (N – K – 1) degrees of freedom.13 Therefore, a t-test may be used to determine which coefficients can be said to be different from zero at a given significance level. Alternatively, it can be shown that t2(k) has an F-distribution with one and (N – K – 1) degrees of freedom, which corresponds to the partial F-statistic for xk, as defined in Section 3.05.3.3. By using the t-test or the equivalent F-test, the variables found to be insignificant for the regression model could be discarded. The significance level for the tests can be used to adjust this selection procedure. In latent variable models, the variance of the regression coefficients for x variables cannot be evaluated according to the procedure described above, which prevents the direct use of t- or F-tests. In this case, an alternative is the jack-knifing procedure proposed by Centner et al.25 Jack-knifing26 is similar to leave-one-out cross-validation, in that different models are obtained by removing a single object at a time from the data set. The estimated value of the regression coefficient bk will be slightly different for each model and thus an associated variance can be calculated. The variance estimated in this manner may then be used in a t- or F-test.27 In this context, an appropriate t-value threshold for discarding variables can be obtained by using the uninformative variable elimination method proposed elsewhere.25 For this purpose, K artificial random variables are added to the data set. These artificial variables should have values small enough so that the latent variables to be included in the model are not significantly disturbed. The jack-knifing procedure is then carried out and the t-ratios are calculated for each of the original and artificial variables. The original variables with a t-ratio smaller (in absolute value) than the maximum t-value obtained for the artificial variables are considered uninformative. 3.05.3.5
Variance of the Prediction
In addition to the magnitude and uncertainty of each regression coefficient, the 2-norm of the regression vector b ¼ ½ b1 b2 bK T can be used as a measure of the variance in the predictions of the model. In fact, consider that a model of the form yˆ ¼ b 0 þ
K X k¼1
bk x k
ð26Þ
248 Variable Selection
will be used to calculate the predicted values yˆ from the measured values of x ¼ ½ x1 Moreover, assume that the x-measurements are contaminated by noise as
T
x¼mþh
x2
xK T . ð27Þ
such that m ¼ Efxg, E fhg ¼ 0, and E hh ¼ Sh , where E{ ? } denotes the expected value of a random variable. If the regression coefficients are assumed to be deterministic values, for simplicity, it follows that ( E fyˆ g ¼ b0 þ E
K X
) bk xk
¼ b0 þ
K X
k¼1
bk Efxk g ¼ b0 þ
k¼1
K X
b k k
ð28Þ
k¼1
Thus, the variance of yˆ can be obtained as
2yˆ ¼ E ðyˆ – E fyˆ gÞ2 ¼ E
(
K X
!2 ) bk ðxk – k Þ
( ¼E
k¼1
K X k¼1
n
2 o
¼ E bT hhT b ¼ bT E hhT b ¼ bT Sh b ¼ E bT h
!2 ) bk k
ð29Þ
If the noise h is uncorrelated and homoscedastic, the variance–covariance matrix Sh is of the form 2 I, where I is a (K K) identity matrix. In this case, Equation (29) becomes 2yˆ ¼ 2 bT b ¼ 2 jjbjj22
ð30Þ
where jjbjj2 is the 2-norm of the regression vector. Of course, this is a simplified result, as the uncertainties in the regression coefficients were not considered. More detailed variance expressions can be found elsewhere.28 However, the expression in Equation (30) can be used as a guideline, showing that larger regression vectors (in the sense of the 2-norm) tend to generate predictions that are more sensitive to measurement noise in the x variables.
3.05.4 Variable Selection Techniques 3.05.4.1
Exhaustive Search
An exhaustive (also called ‘enumerative’) search consists of evaluating all possible combinations of the variables available for regression. The main drawback of this approach lies in the computational effort required. If K variables are available, there are (2K – 1) possible subsets that can be formed with 1 up to K variables. If one wishes to evaluate combinations of mmin up to mmax variables, the number S of possible combinations is given by the following sum of binomials: S¼
K mmin
! þ
K mmin þ 1
! þ þ
K mmax
! ð31Þ
Such a number may get considerably large even for problems of relatively small dimension. For instance, consider the problem of building a model for the simultaneous determination of three analytes with overlapping spectral features. Suppose that a total of K ¼ 60 wavelengths are available and that a 23 factorial design was used to generate N ¼ 8 calibration samples. In this case, at least three wavelengths are required to solve the overlapping problem, that is mmin ¼ 3. Moreover, if MLR is used to build the model, the number of model variables cannot be larger than the number of calibration samples and thus mmax ¼ 7 (if the offset term b0 is included in the model). In this case, there are more than 400 million possible combinations (S 4.4 108). The implementation of an exhaustive search algorithm in the Matlab software is presented below (exhaustive_search.m). In this case, the Matlab function dec2bin.m, which converts decimal integer values to binary strings, is used to facilitate the implementation. In this example, either validation with an independent set or leave-one-out cross-validation can be used to evaluate all possible variable subsets. The PRESS criterion (either PRESSV or PRESSCV, depending on the input arguments) is employed.
Variable Selection
249
function var_sel ¼ exhaustive_search(Xcal,ycal,Xval,yval) % var_sel ¼ exhaustive_search(Xcal,ycal,Xval,yval) - -> Validation with a separate set % var_sel ¼ exhaustive_search(Xcal,ycal,[],[]) - -> Cross-validation K ¼ size(X,2); % Total number of variables available for selection NumberPossibleSubsets ¼ (2^K)-1; aux ¼ blanks(2K); % Auxiliary string of K blank characters index ¼ 2(1:K); % Positions in aux to be filled with 0/1 (1 ¼ selected variable) for i ¼ 1:NumberPossibleSubsets binary_selection ¼ dec2bin(i,K); % Encodes the selected subset in binary format aux(index) ¼ binary_selection; % Extracts the selected variables from the binary encoding var_sel ¼ find(str2num(aux)); [yhat,e] ¼ validation(Xcal,ycal,Xval,yval,var_sel); F(i) ¼ e9e; % Evaluates the PRESS value for the ith subset end % Determines the minimum cost value and the index of the associated subset of variables [F_min,i_opt] ¼ min(F); binary_selection_opt ¼ dec2bin(i_opt,K); aux(index) ¼ binary_selection_opt; var_sel ¼ find(str2num(aux)); % Stores the selected variables in var_sel
3.05.4.2
Iterative MLR Methods
3.05.4.2.1
Forward selection The forward selection method uses an iterative algorithm that starts with a single x variable and successively adds more variables to the regression until a stopping condition is satisfied. The initial variable is the one having the largest empirical correlation with the dependent variable y. At each subsequent iteration, the effect of including each of the remaining variables is evaluated according to a partial F-test. Each test involves including the variable under consideration, building the new model, calculating the corresponding values of ˆ 2 , SSreg (or RSS), and evaluating the partial F-statistic. The variable with the largest partial F-value is then included in the model. This selection process continues until none of the remaining variables has a partial F-value larger than a given critical value. The limitation of the forward selection method is the fact that, once a variable has been included in the model, it cannot be removed in subsequent iterations. Therefore, some combinations in which this variable is not present are never tested during the selection process.
3.05.4.2.2
Backward elimination The backward elimination method starts by building a model with all available variables. Subsequently, the effect of removing each of the variables in the model is evaluated according to a partial F-test as in the forward selection algorithm. The variable with the smallest partial F-value is then removed. This selection process continues until all variables in the model have partial F-values larger than a given critical value. The limitation of the backward elimination method is the fact that, once a variable has been removed from the model, it cannot be re-inserted in subsequent iterations. Therefore, some combinations in which this variable is present are never tested during the selection process. As can be seen, this drawback is exactly the reverse of the limitation seen in the forward selection method. In addition, if the overall number of variables (K) is larger than or equal to the number of calibration objects (N), it will not be possible to build an MLR model with all available variables, as required in the initial step of the algorithm.
250 Variable Selection
3.05.4.2.3
Stepwise regression The stepwise regression method can be regarded as a combination of the forward and backward procedures described above. The resulting algorithm progressively adds new variables to the model, starting from the x variable with largest empirical correlation with the dependent variable y, as in the forward selection method. In addition, the algorithm incorporates a mechanism for eliminating variables as in the backward elimination method. Each iteration of the selection procedure comprises an inclusion phase followed by an exclusion phase. In the inclusion phase, each of the remaining variables is subjected to a partial F-test as in the forward selection method. If the largest F-value obtained in this manner is larger than a critical ‘F-to-enter’ value, the corresponding variable is inserted in the model. In the exclusion phase, each of the variables in the model is subjected to a partial F-test as in the backward elimination method. If the smallest F-value thus obtained is smaller than a critical ‘F-to-remove’ value, the corresponding variable is eliminated from the model and returned to the pool of variables still available for selection. The selection procedure stops when no variables can be added to or removed from the model according to the partial F-test criteria. The algorithm will usually achieve convergence if the significance levels for the entry and exit tests have been sensibly chosen to avoid a cyclic behavior. Some authors2 recommend using the same significance level (-value associated to the partial F-test) for both entry and exit tests. If a smaller is used for the exit test than for the entry test, the algorithm will be biased toward removing variables. On the contrary, if the -value is smaller for the entry test, the algorithm will be biased toward retaining variables. A Matlab implementation of this algorithm is presented below (stepwise_regression.m). If the -values for the entry and exit tests are not specified by the user, a default value of 0.05 is adopted.
function var_sel ¼ stepwise_regression(X,y,alpha_entry,alpha_exit) % Default values for alpha_entry,alpha_exit if nargin ¼¼ 2 alpha_entry ¼ 0.05; alpha_exit ¼ 0.05; end N ¼ size(X,1); % Number of objects K ¼ size(X,2); % Total number of variables % Step 0: Initialization % The algorithm starts from the variable with largest empirical correlation % with y y_as ¼ ( y - mean(y) ) / std(y); % Autoscaling for k ¼ 1:K xk ¼ X(:,k); xk_as ¼ ( xk - mean(xk) ) / std(xk); % Autoscaling r(k) ¼ abs(xk_as9y_as)/(N-1); % Absolute correlation coefficient between xk and y end [dummy,k0] ¼ max(r); % k0 is the index of the variable with the largest r var_sel ¼ k0; % K_selected ¼ 1; var_available ¼ % var_available K_available ¼ K
var_sel ¼ index set of variables selected at present % Number of variables selected at present setdiff([1:K],k0); ¼ index set of variables still available for selection - 1; % Number of variables still available for selection
Variable Selection
251
% Main routine with entry and exit tests % The flag_continue variable will be set to zero if no more variables can be either added % to or removed from the model flag_continue ¼ 1; yc ¼ y-mean(y); SS_about_mean ¼ yc9yc; % Sum of squares about the mean while (flag_continue ¼¼ 1) %%%%%%%%%%%%%%% % Entry phase % %%%%%%%%%%%%%%% %%% 1 - Before adding the candidate variable Xbefore ¼ [ones(N,1) X(:,var_sel)]; SSreg_before ¼ sum_squares_regression(Xbefore,y); %%% 2 - Testing the inclusion of each candidate variable F ¼ zeros(1,K_available); % Degrees of freedom for estimation of the pure error variance in the augmented models dof ¼ N - ( K_selected þ 1 ) - 1; for index ¼ 1:K_available % Index of the candidate variable to be included var_included ¼ var_available(index); Xafter ¼ [Xbefore X(:,var_included)]; SSreg_after ¼ sum_squares_regression(Xafter,y); % Estimate of pure error variance RSS ¼ SS_about_mean - SSreg_after; variance_hat ¼ RSS/dof; F(index) ¼ (SSreg_after - SSreg_before)/variance_hat; end [Fmax,index_Fmax] ¼ max(F); Fcrit ¼ finv( (1-alpha_entry), 1, dof); if Fmax > Fcrit % The variable with the largest F-value is added var_sel ¼ [var_sel var_available(index_Fmax)]; var_available ¼ setdiff([1:K],var_sel); K_selected ¼ K_selected þ 1; K_available ¼ K_available - 1; flag_entry ¼ 1; % Indicates that a variable has been added to the model else flag_entry ¼ 0; % Indicates that no variable has been added to the model end % End of entry phase
252 Variable Selection
%%%%%%%%%%%%%% % Exit phase % %%%%%%%%%%%%%% %%% 1 - Before removing a candidate variable Xbefore ¼ [ones(N,1) X(:,var_sel)]; SSreg_before ¼ sum_squares_regression(Xbefore,y); % Estimate of pure error variance RSS ¼ SS_about_mean - SSreg_before; dof ¼ N - K_selected - 1; variance_hat ¼ RSS/dof; %%% 2 - Testing the removal of each candidate variable F ¼ zeros(1,K_selected); for index ¼ 1:K_selected var_excluded ¼ var_sel(index); % Index of the candidate variable to be excluded % Index set of the predictor variables after excluding the variable under test var_reduced ¼ setdiff(var_sel,var_excluded); Xafter ¼ [ones(N,1) X(:,var_reduced)]; SSreg_after ¼ sum_squares_regression(Xafter,y); F(index) ¼ (SSreg_before - SSreg_after)/variance_hat; end [Fmin,index_Fmin] ¼ min(F); Fcrit ¼ finv( (1-alpha_exit), 1, dof); if Fmin < Fcrit % The variable with the smallest F-value is removed % Index of the candidate variable to be excluded var_excluded ¼ var_sel(index_Fmin); var_sel ¼ setdiff(var_sel,var_excluded); var_available ¼ [var_available var_excluded]; K_selected ¼ K_selected - 1; K_available ¼ K_available þ 1; flag_exit ¼ 1; % Indicates that a variable has been removed from the model else flag_exit ¼ 0; % Indicates that no variable has been removed from the model end % End of exit phase flag_continue ¼ max(flag_entry,flag_exit); end function SSreg ¼ sum_squares_regression(X,y) % Regression b ¼ Xny; % y estimation yhat ¼ Xb; % Evaluation of sum of squares due to regression yhat_c ¼ yhat - mean(y); SSreg ¼ yhat_c9yhat_c;
Variable Selection
253
The stepwise regression algorithm has two main drawbacks. First, selecting appropriate -values for the entry and exit tests may not be straightforward. The best -values will generally depend on the nature of the data set. It is possible to test different -values and make the final decision according to one of the validation or cross-validation metrics presented in Section 3.05.3.1. However, such a procedure may be considerably demanding in terms of computation time. Second, the partial F-test is based on statistical assumptions (such as Gaussian distribution of the data) that may not be valid for a particular data set.4 3.05.4.3
Polytope (Simplex) Algorithm
The polytope algorithm is a method proposed by Nelder and Mead29 for the minimization of functions F(u1, u2, . . . , um) with m real-valued input arguments. This algorithm is also termed ‘flexible polyhedron’ or ‘simplex’ method, but the latter name will not be used here to avoid confusion with the well-known simplex method for linear programming. It is worth noting that variable selection is a combinatorial optimization task rather than a problem of minimization in a space of real-valued arguments. However, if the selection is constrained to choosing a fixed number of m variables, search methods in Rm, such as the polytope algorithm, can be used. For this purpose, an m-uple u ¼ (u1, u2, . . . , um) P Rm can be employed to encode a subset of m variables by rounding each ui to the nearest integer and using this integer as the index of a variable. For instance, suppose that two variables are to be selected from a pool of seven variables {x1, x2, x3, x4, x5, x6, and x7}. In this case, given a pair (u1, u2) P R2, the indices of the two corresponding variables are obtained by rounding u1 and u2 to the nearest integers. For instance, point A in Figure 6 is associated to variables x6 and x4. If u1 and u2 indicate the same variable, one of them must be rounded to the next nearest integer. According to this rule, point B in Figure 6 would be associated to variables x2 and x3. Alternatively, one may consider that only one variable was selected (x3 in this example). In this manner, a given m-uple u ¼ (u1, u2, . . . , um) P Rm could possibly less than m variables. A cost function F(u1, u2, . . . , um) for the variable selection problem can be defined by using one (or a combination) of the criteria presented in Sections 3.05.2 and 3.05.3, which is then applied to the variables encoded in (u1, u2, . . . , um). For instance, F(u1, u2, . . . , um) can be taken as the RSS when the model is built with the selected variables. Alternatively, the correlation coefficient between estimated and known y-values could be employed, but in this case a minus sign would need to be added to the correlation value to define a suitable cost function for the minimization problem. At each iteration of the search algorithm, m þ 1 points u1, u2, . . . , umþ1 are considered, together with the corresponding cost values F1 ¼ F(u1), F2 ¼ F(u2), . . . , Fmþ1 ¼ F(umþ1). The boldface notation indicates that each of the points under consideration is an m-uple (u1, u2, . . . , um), which can be regarded as the vertex of a polytope in Rm. For illustration, a polytope in R2 with vertices u1, u2, and u3 (i.e., a triangle) is depicted in Figure 7(a). The additional points indicated in this figure will be defined in the description of the operations carried out by the algorithm. The m þ 1 vertices of the polytope are assumed to be ordered according to the cost value so that F1 F2
Fm Fmþ1. For example, in Figure 7(a), the best and worst points of the polytope are u1 and u3, respectively. u2 7 6 5 A 4 3
B
2 1 0 1
2
3
4
5
6
Figure 6 Example of variable selection based on real-valued encoding.
7 u1
254 Variable Selection
(a) u 1
(b) uREF
u1
uEXP u3,SHR
uC
u2,SHR
uCON,A u3
uCON,B
u3 u2
u2
Figure 7 Illustration of the points involved in the polytope algorithm. (a) Reflection, expansion, and contraction operations. (b) Shrinking operation.
At each iteration, the polytope is updated by replacing the worst vertex umþ1 with a new point generated according to the sequence of operations described below.30 1. Reflection Let uC denote the centroid of the best m vertices u1, u2, . . ., um, that is uC ¼
m 1X ui m i¼1
ð32Þ
Point uC corresponds to the center of the face opposite to the worst point umþ1 as illustrated in Figure 7(a). A point uREF is then generated by reflecting umþ1 about the opposite face as uREF ¼ uC þ ðuC – umþ1 Þ
ð33Þ
where > 0 is the reflection coefficient. This operation is illustrated in Figure 7(a) for ¼ 1. Let FREF ¼ F(uREF). If F1 FREF Fm (i.e., uREF is neither a new best nor worst point), umþ1 is replaced with uREF, the remaining operations are skipped and the algorithm proceeds to the next iteration. 2. Expansion If FREF < F1, uREF is a new best point, which suggests that the line of reflection is a good search direction. In this case, an expansion along this direction is attempted by defining a point uEXP as uEXP ¼ uC þ ðuREF – uC Þ
ð34Þ
where > 1 is the expansion coefficient. The geometrical interpretation of this operation is presented in Figure 7(a). Let FEXP ¼ F(uEXP). If FEXP < FREF, the expansion was successful and umþ1 is replaced with uEXP. Otherwise, the expansion failed and umþ1 is replaced with uREF. The algorithm then proceeds to the next iteration. 3. Contraction If FREF > Fm, the reflection line does not seem to be a promising search direction. Therefore, the polytope is contracted by bringing the new vertex back along the reflection line. The contracted vertex uCON is defined as (
uCON ¼
uCON;A ¼ uC þ ðuREF – uC Þ if
FREF < Fmþ1
uCON;B ¼ uC þ ðumþ1 – uC Þ if
FREF Fmþ1
ð35Þ
where 0 < < 1 is the contraction coefficient. Both possible results for the contraction (uCON,A and uCON,B) are depicted in Figure 7(a). Let FCON ¼ F(uCON). If FCON < min{FREF, Fmþ1}, the contraction was successful and umþ1 is replaced with uCON. Otherwise, a shrinking operation is carried out as detailed below. 4. Shrinking If the contraction step was not successful, the polytope is shrunk toward the best point by replacing each vertex ui (i ¼ 2, 3, . . ., m þ 1) by a point ui,SHR defined as ui;SHR ¼ u1 þ ðui – u1 Þ
ð36Þ
Variable Selection
255
where 0 < < 1 is the shrinking coefficient. This operation is illustrated in Figure 7(b) for ¼ 0.5. The shrinking operation is applied to the vertices in the order from u2 to umþ1. It is worth noting that the best point may change during this process, in which case u1 has to be updated in the previous equation. The polytope algorithm is implemented in the standard Matlab function fminsearch.m. An example of the use of such function for variable selection in MLR is presented below (routine polytope.m). The cost function needs to be specified in a separate file, which is named cost.m in this example. The cost can be either PRESSV or PRESSCV, depending on the input arguments given by the user.
function var_sel ¼ polytope(Xcal,ycal,Xval,yval,m) % var_sel ¼ polytope(Xcal,ycal,Xval,yval,m) - -> Validation with a separate set % var_sel ¼ polytope(Xcal,ycal,[],[],m) - -> Cross-validation % m ¼ Maximum desired number of variables K ¼ size(X,2); % Total number of variables aux ¼ linspace(1,K,mþ2); % Initial point u0 ¼ round(aux(2:mþ1)) % Initial point u_opt ¼ fminsearch(9cost9,u0,[],Xcal,ycal,Xval,yval); u_round ¼ round(u_opt); % Indexes larger than K are replaced with K over ¼ find(u_round > K); u_round(over) ¼ K; % Indexes smaller than 1 are replaced with 1 under ¼ find(u_round < 1); u_round(over) ¼ 1; var_sel ¼ unique(u_round); % Repeated variables are excluded function F ¼ cost(u,Xcal,ycal,Xval,yval) % Cost function for the polytope example. K ¼ size(X,2); % Total number of variables u_round ¼ round(u); % Indexes larger than K are replaced with K over ¼ find(u_round > K); u_round(over) ¼ K; % Indexes smaller than 1 are replaced with 1 under ¼ find(u_round < 1); u_round(over) ¼ 1; var_sel ¼ unique(u_round); % Repeated variables are excluded [yhat,e] ¼ validation(Xcal,ycal,Xval,yval,var_sel); F ¼ e9e; % Calculation of the PRESS value
256 Variable Selection
(a)
(b) 100 2 80 60
1
u2
F(u1,u2)
1.5
40
0.5 0 100 80
60 40 20 u2
0 0
20
40
60 u1
80
20
100
0
0
20
40
60
80
100
u1 Figure 8 (a) Example of a function with two extrema. (b) Contour plot showing the local (square symbol) and global (asterisk) minima.
The main limitation of the polytope algorithm is the possibility of convergence to a local minimum if the cost function to be minimized has more than one extremum, as illustrated in Figure 8(a). In this example, function F(u1, u2) has a local minimum at (30, 70) and a global minimum at (70, 30), as can be seen in the countour plot presented in Figure 8(b). If the polytope is initialized in the vicinity of (30, 70), convergence to the local minimum is expected to occur. Global search algorithms have been proposed in an attempt to overcome this problem. Two of these algorithms (generalized simulated annealing and genetic algorithm (GA)) are presented in the next sections. However, it should be emphasized that there is no guarantee that any algorithm will find the global minimum of the cost function, unless an exhaustive search is carried out.
3.05.4.4
Generalized Simulated Annealing
Global search algorithms attempt to find the global minimum of a given cost function avoiding capture by local minima. For this purpose, some degree of randomness is imposed on the search procedure to allow the exploration of regions away from the current tendency of cost minimization. One such method is the so-called simulated annealing, which is termed after the annealing process involved in the cooling of metals after which different final crystalline configurations (i.e., different energy states) can be achieved. By analogy, the cost function could be regarded as the potential energy level of a moving particle. If the particle is moving inside a well around a local minimum in the cost surface, it may not have enough energy to leave the well. However, if thermal agitation is present, the particle may eventually jump to a different well, possibly associated to the desired global minimum. The probability of the particle leaving a shallow energy well (poor solution in terms of cost value) is larger than the probability of leaving a deep energy well (good solution in terms of cost value). Therefore, there is a greater chance that the particle will be captured by the global minimum than by a worse local minimum. A simulated annealing algorithm for function minimization can be described as follows.31 An initial point u0 P Rm is chosen, either randomly or by using a priori information, and the corresponding cost value F0 is calculated. Then, a random search vector u P Rm of fixed magnitude is generated, a new tentative point u1 ¼ u0 þ u is defined, and the corresponding cost F1 is evaluated. Finally, the new point is accepted with probability given by ( ¼
if
F 0
expð – F Þ if
F > 0
1
ð37Þ
where F ¼ F1 – F0 and is a positive parameter. These steps are repeated for a specified number of iterations or until a target cost value is attained. According to Equation (37), beneficial steps (reductions in the cost) are always accepted, whereas detrimental steps (increments in the cost) may be accepted with a probability that
Variable Selection
257
decreases with the size of the cost increment. In the work of Bohachevsky et al.,31 values of such that 0.5 < exp( F) < 0.9 are recommended. That paper also suggests that the magnitude of u should be such that the vicinity of a local minimum can be escaped in 2–3 steps. Finally, a modification can be used to ensure that the probability of accepting detrimental steps tends to zero as the search approaches the global minimum of the cost. This is accomplished by setting ¼ exp
– F F – Fˆmin
if F > 0
ð38Þ
where F is the cost for the current point and Fˆmin is an estimate of the minimum value for the cost function. The algorithm with this modification is termed generalized simulated annealing31 and was first used for variable selection by Kalivas et al.32 The principles of simulated annealing may also be applied to the polytope algorithm. In this case, ‘bad’ points (associated to large cost values) generated in one of the iterations could be accepted with a certain probability, instead of being discarded.
3.05.4.5
Genetic Algorithm
The GA is a global search technique inspired by the biological mechanisms of evolution by natural selection.33,34 In this algorithm, each possible solution for an optimization problem is represented by an individual belonging to an evolving population. Using a population of individuals to explore the space of possible solutions (parallel search) is, in many cases, more efficient and less prone to capture by local extrema than guiding a single individual across the search space. The personal traits of each individual (which correspond to the values to be adjusted in the optimization problem) are encoded in a chromosome, which is a string of elements termed ‘genes.’ The fitness of the individual, regarded as the ability to survive and generate descendants, is measured by an objective function, which should reflect the performance index to be maximized in the optimization problem. At each iteration of the algorithm, a mating pool is formed by taking couples of individuals from the population. Individuals with better fitness are given a larger probability of being selected for the mating pool. The chromosomes of each couple are then combined using genetic operators to generate descendants. Finally, a new population is formed by replacing the previous generation with its offspring. This evolutionary cycle is repeated until a given stopping criterion is satisfied. The entire process is illustrated in Figure 9 for a typical GA formulation. Each of the steps described above will now be explained in more detail with reference to the variable selection problem. Further details can be found elsewhere.35,36 Genetic encoding of solutions The genes in a chromosome can be either real- or binary-valued, depending on the GA formulation being adopted. Real-valued genes can be used in a variable selection application by resorting to the encoding procedure described in Section 3.05.4.3. However, the usual GA approach for variable selection employs chromosomes with the number of binary genes equal to the total number of available variables. Genes with a 1 value indicate the variables that are to be included in the model. For instance, the two chromosomes depicted in Figure 10 represent two possible selections from an overall set of seven variables. The first chromosome indicates the selection of variables x2, x4, and x5, whereas the second chromosome points to x4 and x7. In the discussion that follows, the use of binary chromosomes will be assumed. Generation of the initial population The initial population is usually generated by creating chromosomes with random 1/0 gene values. If the analyst has reason to believe that particular variables would be important for the model, the inclusion of such variables can be enforced by setting the corresponding genes to 1 in some chromosomes. The number P of individuals in the populations is set by the analyst. The use of more individuals increases the chance of finding a good solution for the problem at the cost of a larger computational workload. Fitness evaluation The fitness value assigned to a given individual is calculated on the basis of the variables indicated in its chromosome. In principle, any (or a combination) of the criteria presented in Sections 3.05.2 and 3.05.3 can be
258 Variable Selection
START
INITIALIZATION - Define crossover and mutation probabilities (pc , pm ). - Define stopping criterion. - Generate initial population. - Set the generation counter to 1.
FITNESS EVALUATION - Evaluate the fitness index for each individual.
Stopping criterion satisfied ?
RESULTS - Present the variables encoded in the best chromosome.
YES
NO
NATURAL SELECTION - Select individuals for the mating pool on the basis of their fitness values.
END
OFFSPRING GENERATION - Combine pairs of chromosomes from the mating pool to generate descendants (with probability pc of crossover). - Apply the mutation operator to the descendants (with probability pm ).
POPULATION UPDATE - Replace the population by its descendants. - Increase the generation counter by 1.
Figure 9 Typical formulation of a genetic algorithm.
x1
x2
x3
= 1
x4
x5
x6
x7
= 0
Figure 10 Example of chromosomes with binary genes associated to seven variables.
Variable Selection
259
adopted as a basis to define the fitness index. Ideally, the fitness value should increase with the quality of the variable subset indicated in the chromosome (better selections should have a larger fitness value). In some cases, the performance metric may have to be modified to suit this requirement. For instance, if the goal is to minimize the resulting RMSEPV value for the model, the fitness index can be defined as the inverse of RMSEPV. Natural selection The mating pool is formed by selecting couples from the population. According to the principles of natural selection, this process should favor the individuals with larger fitness values. Therefore, useful features of these individuals (i.e., variable subsets that lead to a good model) are passed on to their descendants. In this manner, after successive generations, the search narrows down to combinations of variables that have been found to be suitable, whereas inappropriate combinations are discarded in the process. Such a concept can be implemented by the ‘tournament’ method, in which the population is sorted in decreasing order of fitness and a specified number of individuals with the largest fitness values (the ‘tournament winners’) are selected. However, this method may cause the population to lose genetic diversity, which could result in a premature convergence to a local maximum of the fitness function. Alternatively, individuals with small fitness may be given a competitive chance to enter the mating pool. A procedure that can be used for this purpose is the socalled ‘Roulette’ method. In this method, the probability of the jth individual with fitness f (j ) being selected for the mating pool is given by its relative fitness fr( j ) defined as f ðj Þ fr ðj Þ ¼ PP i¼1 f ðj Þ
ð39Þ
where P is the population size. For example, consider a population with P ¼ 4 individuals and suppose that the individual fitness values are those indicated in Figure 11. As can be seen, a fraction of a spinning roulette wheel is assigned to each individual, according to its relative fitness. When the wheel stops, the triangle indicates the selected individual. Each pair of chromosomes chosen in this manner is combined to generate two descendants, as described below. This procedure is repeated until there are P descendants to replace the previous generation. Offspring generation Each pair of individuals in the mating pool generates two descendants. In this process, the chromosomes of the parents can be combined by crossover with probability pc. If crossover does not occur, the descendants are simply copies of their parents. Figure 12(a) illustrates a one-point crossover operation between the two chromosomes presented in Figure 10. The crossover point is randomly chosen. Variants of this operation include two- or multiple-point crossover procedures, as illustrated in Figure 12(b). In addition to the crossover operation, each descendant can also be subjected to mutation with probability pm. The simplest mutation procedure consists of changing the value of a randomly chosen gene (from 1 to 0 or vice versa), as illustrated in Figure 13. The purpose of mutation is to increase the genetic diversity in the population, by possibly changing features (i.e., inserting or removing a variable) that are common in both parent chromosomes. Mutation at more than one gene can also be implemented to further increase diversity. However, the mutation probability should be small enough not to hamper the convergence of the algorithm. Usually, values smaller than 10% are adopted for pm.
50%
10% 25%
15%
Figure 11 Illustration of the Roulette method.
Individual
Fitness f
1 2 3 4 Overall
10 5 3 2 20
Relative fitness fr 50% 25% 15% 10% 100%
260 Variable Selection
(a)
(b)
Parents
Offspring
Figure 12 Illustration of (a) one-point and (b) two-point crossover operations.
Before
After Figure 13 Mutation example.
Some GA implementations employ a varying mutation probability. In this case, pm is made large at the initial generations to maintain a high level of genetic diversity and thus widen the exploration of the search space. As the generations pass, the value of pm is then decreased to allow the algorithm to converge. Alternatively, an adaptive mutation policy can be used. The idea consists of temporarily increasing pm whenever the search seems to be stuck in a local maximum of the fitness function, as indicated, for instance, by a small difference between the largest and smallest fitness values in the population. Population update The GA iteration is finished by replacing the population with its descendants. To prevent good solutions from being lost in this process, the Q best individuals (or only the single best individual) of the previous population can be kept in the new one. This practice is termed ‘elitism’. In this case, only P – Q descendants need to be created in the offspring generation process. Stopping criterion Several stopping criteria can be adopted to terminate the evolutionary cycle. For instance, the algorithm may stop when 1. a maximum number of iterations (or ‘generations’) is reached; or 2. the fitness value for the best individual in the population is larger than a given threshold; or 3. the difference between the largest and smallest fitness values in the population is smaller than a given threshold. A Matlab implementation of a GA for variable selection in MLR is presented below (ga_varsel.m). The fitness function is the inverse of either PRESSV or PRESSCV, depending on the input arguments given by the user. In this implementation, the user is given the choice of specifying lower and upper limits for the number of variables to include in the model. Such limits are considered in the routines for initialization of the chromosomes, crossover, and mutation, which were designed to prevent the creation of invalid chromosomes. Elitism is employed, with the two best individuals (Q ¼ 2) of a generation being automatically transferred to the next generation.
Variable Selection
function [var_sel,fitlog,mlog] ¼ ga_varsel(P,MaxGenerations,m_min,m_max,Xcal,ycal,Xval,yval) % % % % % % % % % % % % % % % % % % % % % % % % %
[var_sel,fitlog,mlog] ¼ ga_varsel(P,MaxGenerations,m_min,m_max,Xcal,ycal,Xval,yval) - -> Validation with a separate set [var_sel,fitlog,mlog] ¼ ga_varsel(P,MaxGenerations,m_min,m_max,Xcal,ycal,[],[]) - -> Cross-validation Input arguments: P - -> Constant number of individuals in the population (must be even) MaxGenerations - -> Maximum number of generations m_min,m_max - -> Lower and Upper limits for the number of variables to include in the MLR model If m_min ¼ [], the default value m_min ¼ 1 is employed If m_max ¼ [], the default values m_max ¼ min(N-1,K) (validation with a separate set) or min(N-2,K) (cross-validation) are employed. Xcal,ycal,Xval,yval - -> Calibration and validation data Xcal,ycal - -> Calibration data for use with cross-validation Output arguments: var_sel - -> Index set of selected variables fitlog - -> (3 x MaxGenerations) array containing the maximum, average, and minimum values of fitness for each generation mlog - -> (1 x MaxGenerations) array containing the number of selected variables associated to the best individual in each generation
if rem(P,2)¼0 error(9P must be even !9) end pc ¼ 0.6; % Crossover probability pm ¼ 0.05; % Mutation probability N ¼ size(Xcal,1); % Number of calibration objects K ¼ size(Xcal,2); % Chromosome length ¼ Total number of variables; if length(m_min) ¼¼ 0, m_min ¼ 1; end if length(m_max) ¼¼ 0, if size(Xval,1) > 0 m_max ¼ min(N-1,K); else m_max ¼ min(N-2,K); end end if m_max > min(N-1,K) error(9m_max is too large !9); end % Random generation of the initial population NEWGEN ¼ zeros(P,K); for i ¼ 1:P % Chooses a number m between m_min and m_max at random m ¼ m_min þ sum(round(rand(1,m_max-m_min))); dummy ¼ randperm(K);
261
262 Variable Selection
var_sel ¼ dummy(1:m); NEWGEN(i,var_sel)¼ones(1,m); end % Evolutionary cycle fitlog ¼ zeros(3,MaxGenerations); for n ¼ 1:MaxGenerations % Fitness evaluation if n ¼¼ 1 imax ¼ P; else % After the first generation, elitism has been applied imax ¼ P - 2; f([P-1,P]) ¼ f(index([P-1,P])); % These are the fitness values for the two best individuals in the previous generation end for i ¼ 1:imax var_sel ¼ find(NEWGEN(i,:)); [yhat,e] ¼ validation(Xcal,ycal,Xval,yval,var_sel); f(i) ¼ 1/(e9e); % Fitness ¼ 1/PRESS end [fitlog(1,n),index_best_individual] ¼ max(f); fitlog(2,n) ¼ mean(f); fitlog(3,n) ¼ min(f); mlog(n) ¼ length(find(NEWGEN(index_best_individual,:))); % Roulette selection fr ¼ f/sum(f); [fsort,index] ¼ sort(fr); ORD ¼ NEWGEN(index,:); Roulette ¼ [0 cumsum(fsort)]; winners ¼ zeros(P-2,1); for i ¼ 1:(P-2) roll ¼ rand; winners(i) ¼ min(find(Roulette > roll)) - 1; end MATE ¼ ORD(winners,:); % Generation of new individuals % Crossover for i ¼ 1:(P/2 - 1)% ith couple father ¼ MATE(2i-1,:); mother ¼ MATE(2i,:); son ¼ father; % Default daughter ¼ mother; % Default if rand < pc % Crossover takes place left_father ¼ cumsum(father); right_father ¼ sum(father) - left_father; left_mother ¼ cumsum(mother);
Variable Selection
263
right_mother ¼ sum(mother) - left_mother; BP1 ¼ left_father(1:K-1) þ right_mother(1:K-1); % Breakpoint 1 Result BP2 ¼ left_mother(1:K-1) þ right_father(1:K-1); % Breakpoint 2 Result valid1 ¼ find(BP1>¼m_min & BP1<¼m_max); % Valid Breakpoints for Son valid2 ¼ find(BP2>¼m_min & BP2<¼m_max); % Valid Breakpoints for Daughter L1 ¼ length(valid1); L2 ¼ length(valid2); if L1 > 0 j1 ¼ valid1(ceil(L1rand)); % Select one of the valid breakpoints at random son ¼ [father(1:j1) mother(j1þ1:K)]; end % Else (no valid breakpoint) -> Default: son ¼ father if L2 > 0 j2 ¼ valid2(ceil(L2rand)); % Select one of the valid breakpoints at random daughter ¼ [mother(1:j2) father(j2þ1:K)]; end % Else (no valid breakpoint) -> Default: daughter ¼ mother end NEWGEN(2i-1,:) ¼ son; NEWGEN(2i,:) ¼ daughter; end % Mutation for i ¼ 1:P-2 % ith descendant if rand < pm % Mutation takes place flip ¼ ceil(Krand); NEWGEN(i,flip) ¼ 1 - NEWGEN(i,flip); m ¼ sum(NEWGEN(i,:)); % Number of selected variables if (m < m_min) % A new wavelength must be included j ¼ find(NEWGEN(i,:)¼¼0); include ¼ ceil((K-m)rand); % A random wavelength is included NEWGEN(i,j(include)) ¼ 1; elseif (m > m_max) % A wavelength must be excluded j ¼ find(NEWGEN(i,:)¼¼1); exclude ¼ ceil(mrand); % A random wavelength is excluded NEWGEN(i,j(exclude)) ¼ 0; end end end % Elitism (Two best individuals) NEWGEN([P-1,P],:)¼ORD([P-1,P],:); end % Presentation of the best solution var_sel ¼ find(ORD(P,:));
3.05.4.6
Clustering Methods
The term ‘clustering’ refers to the operation of grouping together elements of a given set that are similar according to some metric. In a variable selection context, clustering algorithms can be used to form groups of variables with similar information content, according to the empirical data available for model building. After the clustering process has been performed, a single representative variable can be extracted from each group. Suppose that the available x data are disposed in a matrix X of dimensions (N K) such that the kth variable xk is associated to the kth column vector xk P RN. This vector contains empirical information concerning the statistical distribution of xk. Therefore, similar vectors in the RN space are expected to be associated to variables with similar information content. In this context, the degree of similarity could be measured, for instance, by
264 Variable Selection
Euclidean distance. If the set x1, x2, . . . , xK is grouped in clusters, the vector closest to the centroid of each cluster may be taken as a representative element of that group. The selected x variables would be those associated to the representative vectors defined in this manner. Todeschini et al,37 exploited this concept for wavelength selection by using a Kohonen artificial neural network as a clustering method based on Euclidean distances. As a result, an improvement in the predictive ability of a PLS model was reported with respect to the use of full-spectrum data. The main disadvantage of clustering methods lies in the fact that y information is not exploited. However, such a limitation may not be severe if clustering is only used as a preliminary variable reduction procedure prior to the use of other selection methods.
3.05.4.7
Successive Projections Algorithm
As discussed in Section 3.05.2.3, multicollinearity among predictor variables is undesirable in least squares regression because it may lead to numerical problems and increased noise propagation. The successive projections algorithm (SPA) is a technique specifically designed to select subsets of variables with small collinearity and appropriate prediction power for use in MLR models. This algorithm was originally proposed for wavelength selection in spectroscopic data sets, especially under conditions of strong spectral overlapping.38 MLR models obtained by using SPA have been shown to be superior, in terms of prediction ability, to fullspectrum PLS models in a variety of applications, including UV–VIS,38 ICP-OES (inductively coupled plasma optical emission spectroscopy),39 and NIR40 spectrometry. The SPA comprises three phases.41 Initially, the algorithm builds candidate subsets of variables on the basis of a collinearity minimization criterion. Such subsets are constructed according to a sequence of vector projection operations applied to the columns of the matrix XNK of available predictor data. In this phase, the dependent variable y is not used as multicollinearity is concerned with only the x variables. In the second phase, the best candidate subset is chosen according to a criterion that evaluates the prediction ability of the resulting MLR model, such as PRESSV, if a separate validation set is available, or PRESSCV, if cross-validation is used.42 Finally, in a third optional phase, the selected subset is subjected to an elimination procedure to determine whether any variables can be removed without significant loss of prediction ability. Each of these phases is explained in detail below.
Phase 1 – Construction of candidate subsets of variables Suppose that the available x data are disposed in a matrix X of dimensions (N K) such that the kth variable xk is associated to the kth column vector xk P RN. Let M ¼ min(N 1, K) be the maximum number of variables that can be included in an MLR model with the intercept term. Starting from each variable xk, k ¼ 1, . . . , K, a chain containing M variables is constructed according to the following sequence of projection operations: Step 1: (Initialization): Let z1 ¼ xk (vector that defines the initial projection operations) xj1 ¼ xj, j ¼ 1, . . . , K SEL(1, k) ¼ k i ¼ 1 (iteration counter) Step 2: Calculate the matrix Pi of projection onto the subspace orthogonal to zi as Pi ¼ I –
zi ðzi Þ
T
ðz i ÞT z i
ð40Þ
where I is an identity matrix of appropriate dimensions. Step 3: Calculate the projected vectors Xiþ1 as j Xiþ1 ¼ Pi xij j
for all j ¼ 1, . . . , K.
ð41Þ
Variable Selection
265
Step 4: Determine the index j of the largest projected vector and store this index in matrix SEL. j ¼ arg max jjxiþ1 j jj
ð42Þ
SELði þ 1;kÞ ¼ j
ð43Þ
j ¼1; ::: ;K
iþ1
xjiþ1
Step 5: Let z ¼ (vector that defines the projection operations for the next iteration) Step 6: Let i ¼ i þ 1. If i < M, return to Step 2. At each iteration of the procedure presented above, the selected vector is the one with the smallest degree of collinearity with respect to the vectors selected in the previous iterations. Steps 1–5 are illustrated in Figure 14 for a simple case in which N ¼ 3, K ¼ 5. In this example, the sequence of projections starting from x3 is depicted. It is worth noting that each coordinate axis corresponds to the measured value for a given sample. To simplify the drawing, x3 is supposed to be aligned with one of the coordinate axes, but this is not a requirement.
(a)
Sample 3 x3 = x31 = z1 Starting vector
x2 = x21
x4 = x41 x5 = x51 x1 = x11
Sample 2 P1 x31 =0
P1 x51 = x52 P1 x41 = x42
Sample 1 P1 x21 = x22
Subspace orthogonal to z1
P1 x11 = x12 Largest = z2 projection
(b)
P2 x12 =0
P2 x22 = x23
P2 x42 = x43
P2 x52 Largest = x53 projection
Subspace orthogonal to z2
x52 x42
x22 2 = z2
Subspace orthogonal to z1
x1
Figure 14 Illustration of the sequence of projections carried out in SPA. (a) First iteration. (b) Second iteration. In this example, the chain of variables starting from x3 would be {x3, x1, x5}.
266 Variable Selection
After this procedure has been repeated for k ¼ 1, . . ., K, the candidate subsets of variables will be stored in the index matrix SEL of dimensions M K. According to the notation defined above, the indexes of the variables in the chain starting from xk will be {SEL(1, k), SEL(2, k), . . ., SEL(M, k)} where SEL(1, k) ¼ k. Some readers familiarized with the well-known Gram–Schmidt orthogonalization procedure will note a resemblance with the projection operators carried out in this first phase of SPA. However, the Gram– Schmidt algorithm manipulates the data to generate a new set of orthogonal vectors, which, in the general case, do not have physical meaning. In contrast, SPA only uses the projection operations to determine suitable candidate subsets of variables for MLR modeling. A Matlab routine (projections.m) that implements the projection operations starting from a given column index k is presented below.
An alternative implementation consists of using the Matlab built-in function qr, which can be used to generate the chain of variables in a computationally efficient manner.42 By default, the qr function adopts the column with the largest norm as the starting vector. However, a scaling procedure can be used to force the algorithm to start from a given kth column, as implemented in the routine projections_qr.m presented below. The computational advantage of this routine over the previous one can be demonstrated by a simple example involving a matrix X of dimensions (100 1000) with random values. By using projections.m, the construction of a chain with M ¼ 99 variables starting from x1 takes approximately 1.50 s in a Centrino Duo computer with 2 GB RAM. In contrast, projections_qr.m takes only 0.05 s.
Variable Selection
267
Phase 2 – Evaluation of the candidate subsets The candidate subsets defined in Phase 1 are evaluated according to a suitable criterion that takes into account their relation with the y variable, which was not employed in the projection operations. For this purpose, metrics associated to the prediction ability of the resulting MLR model can be used. For instance, if a separate validation set is available, PRESSV can be calculated for each candidate subset as follows: For k ¼ 1 to K do For m ¼ 1 to M do Use the variables with indexes SEL(1, k), SEL(2, k), . . . , SEL(m, k) to build an MLR model. Apply the resulting model to the validation set and calculate the prediction error sum of squares. Store the result in PRESSV(m, k) Next m Next k At the end, the results will be stored in a matrix PRESSV of dimensions (M K) such that PRESSV(m, k) is associated to the chain of m variables starting from xk. The best chain of variables is the one corresponding to the minimum PRESSV value obtained in this way. Other performance metrics (such as SDV, rV, and PRESSCV) could also be used in a similar manner.42 If there is reason to establish a lower limit mmin on the number of variables for the model, computational time can be saved by restricting the calculation of PRESSV(m, k) to m ranging from mmin to M. This would be the case, for instance, in the spectrometric determination of A analytes with overlapping spectral features, as the number of wavelengths will have to be at least equal to A. Moreover, an upper limit mmax on the number of variables may also be established by the analyst to save computational time. To further clarify this phase of SPA, suppose that a model for the determination of A ¼ 2 analytes is to be constructed by using absorbance measurements at K ¼ 5 wavelengths, such as 200, 300, 400, 500, and 600 nm. In this case, x1, x2, x3, x4, and x5 correspond to the absorbance values at each of the five wavelengths under consideration. Moreover, assume that N ¼ 5 calibration mixtures are available. It follows that mmin ¼ 2 and M ¼ min(N 1, K) ¼ 4. For illustration, suppose that the chains of variables generated in Phase 1 are those presented in Table 2. Consider that a separate validation set is available so that PRESSV values can be calculated for each analyte.
268 Variable Selection Table 2 Hypothetical example of results obtained in Phases 1 and 2 of SPA Chains of variables x1 x1, x 3 x1, x3, x4
x2 x2, x3 x2, x3, x1
x3 x3, x 1 x3, x1, x4
x4 x4, x1 x4, x1, x2
x5 x5, x2 x5, x2, x1
PRESSV for analyte 1 1.0 0.9 1.1
2.1 1.8 1.7
1.5 0.9 1.1
1.7 0.7 1.3
2.5 2.3 1.9
PRESSV for analyte 2 1.5 0.8 1.3
2.4 2.1 2.2
1.7 0.8 1.3
1.5 0.9 1.4
2.9 2.2 1.7
The best outcome for analytes 1 and 2 is indicated by and , respectively. PRESSV, prediction error sum of squares in the validation set; SPA, successive projections algorithm.
According to the hypothetical results in Table 2, the best variables for the determination of analytes 1 and 2 would be x1 and x4, and x1 and x3, respectively, which correspond to absorbance measurements at 200 and 500 nm, and 200 and 400 nm, respectively. It is worth noting that the maximum number (M K) of variable subsets evaluated by SPA is usually considerably smaller than the number S of possible combinations that would need to be tested by exhaustive search. For instance, if N ¼ 8 samples and K ¼ 60 variables, the maximum number of variables that can be included in an MLR model (with the intercept term) is M ¼ 7. In this case, M K ¼ 420 subsets of variables would be tested in SPA. If an exhaustive search was carried out, S would be given by
S¼
60 1
! þ
60 2
! þ þ
60 7
! 4:4 108
ð44Þ
that is, more than 400 million combinations. Moreover, a simulation study presented elsewhere39 revealed that SPA may have advantages over exhaustive search in addition to saving computational time. In that study, by using a separate validation set to guide the variable selection procedure, SPA led to an RMSEPV value larger than the exhaustive search. However, when the resulting models were tested in a third independent data set, the SPA solution actually yielded smaller prediction errors. Such a better generalization ability of SPA-generated models may be ascribed to the fact that SPA takes into account not only the minimization of prediction errors, but also the explicit elimination of collinearity between the variables. This last requirement introduces constraints in the search process that make the selection outcome less dependent on the particularities of the validation set. Phase 3 – Final elimination of variables The final phase of SPA consists of a variable elimination procedure aimed at reducing the number of selected variables without significantly compromising the prediction ability of the resulting MLR model.41 For this purpose, let {v1, v2, . . . , vL} be the subset of L variables that were selected in Phase 2 described above and let
bˆ 1 ;bˆ2 ; . . .;bˆL be the corresponding regression coefficients obtained by building an MLR model. A relevance index I( j) is then defined for each variable vj as I ð j Þ ¼ svj jbˆj j;
j ¼ 1; 2; : . . . ; L
ð45Þ
where svj is the empirical standard deviation of variable vj calculated from the available data. The variables are sorted in descending order of the relevance index and a scree plot of PRESS versus number of variables included in the regression is generated. Let PRESSmin be the minimum value of PRESS thus obtained. The
Variable Selection
269
algorithm adopts the minimum number of variables in the scree plot for which PRESS is not significantly larger than PRESSmin according to an F-test. Such a criterion is similar to the method of Haaland and Thomas,43 which was developed to choose an appropriate number of latent variables for PLS. A significance level ¼ 0.25 for the F-test can be adopted as suggested elsewhere.44 A Matlab implementation of SPA is presented below. The search criterion for Phase 2 is either PRESSV or PRESSCV, depending on the input arguments given by the user. As in the GA implementation, the user is given the choice of specifying lower and upper limits for the number of variables to include in the model. The columns of the X matrix are mean-centered prior to the projections operations. The user is given the choice of normalizing the x variables with respect to their empirical standard deviation (autoscaling) or not.
function [var_sel,var_sel_phase2] ¼ spa(Xcal,ycal,Xval,yval,m_min,m_max,autoscaling) % % % % % % % % %
[var_sel,var_sel_phase2] ¼ spa(Xcal,ycal,Xval,yval,m_min,m_max,autoscaling) - -> Validation with a separate set [var_sel,var_sel_phase2] ¼ spa(Xcal,ycal,[],[],m_min,m_max,autoscaling) - -> Cross-validation If m_min ¼ [], the default value m_min ¼ 1 is employed If m_max ¼ [], the default values m_max ¼ min(N-1,K) (validation with a separate set) or min(N-2,K) (cross-validation) are employed. autoscaling - -> 1 (yes) or 0 (no)
autoscaling if ((autoscaling ¼ 1) & (autoscaling ¼ 0)) error(9Please choose whether or not to use autoscaling.9) end N ¼ size(Xcal,1); % Number of calibration objects K ¼ size(Xcal,2); % Total number of variables if length(m_min) ¼¼ 0, m_min ¼ 1; end if length(m_max) ¼¼ 0, if size(Xval,1) > 0 m_max ¼ min(N-1,K); else m_max ¼ min(N-2,K); end end if m_max > min(N-1,K) error(9m_max is too large !9); end % Phase 1: Projection operations for the selection of candidate subsets % The projections are applied to the columns of Xcal after % mean-centering and (optional) autoscaling if autoscaling ¼¼ 1 normalization_factor ¼ std(Xcal); else normalization_factor ¼ ones(1,K); end
270 Variable Selection
for k ¼ 1:K x ¼ Xcal(:,k); Xcaln(:,k) ¼ (x - mean(x)) / normalization_factor(k); end SEL ¼ zeros(m_max,K); for k ¼ 1:K SEL(:,k) ¼ projections_qr(Xcaln,k,m_max); end disp(9Phase 1 completed !9) % Phase 2: Evaluation of the candidate subsets according to the PRESS criterion PRESS ¼ Infones(m_max,K); for m ¼ m_min:m_max for k ¼ 1:K var_sel ¼ SEL(1:m,k); [yhat,e] ¼ validation(Xcal,ycal,Xval,yval,var_sel); PRESS(m,k) ¼ e9e; end end [PRESSmin,m_sel] ¼ min(PRESS); [dummy,k_sel] ¼ min(PRESSmin); var_sel_phase2 ¼ SEL(1:m_sel(k_sel),k_sel); disp(9Phase 2 completed !9) % Phase 3: Final elimination of variables % Step 3.1: Calculation of the relevance index Xcal2 ¼ [ones(N,1) Xcal(:,var_sel_phase2)]; b ¼ Xcal2nycal; % MLR with intercept term std_deviation ¼ std(Xcal2); relev ¼ abs(b.std_deviation9); relev ¼ relev(2:end); % The intercept term is always included % Sorts the selected variables in decreasing order of ‘‘relevance’’ [dummy,index_increasing_relev] ¼ sort(relev); % Increasing order index_decreasing_relev ¼ index_increasing_relev(end:-1:1); % Decreasing order % Step 3.2: Calculation of PRESS values for i ¼ 1:length(var_sel_phase2) [yhat,e] ¼ validation(Xcal,ycal,Xval,yval,var_sel_phase2(index_decreasing_relev (1:i)) ); PRESS_scree(i) ¼ e9e; end figure, grid, hold on plot(PRESS_scree) title(‘‘Scree plot’’;),xlabel(‘‘;Number of variables included in the model’’;),ylabel (‘‘PRESS’’) % Step 3.3: F-test criterion PRESS_scree_min ¼ min(PRESS_scree); alpha ¼ 0.25; dof ¼ length(e); % Number of degrees of freedom fcrit ¼ finv(1-alpha,dof,dof); % Critical F-value
Variable Selection
271
PRESS_crit ¼ PRESS_scree_minfcrit; % Finds the minimum number of variables for which PRESS_scree % is not significantly larger than PRESS_scree_min i_crit ¼ min(find(PRESS_scree < PRESS_crit)); var_sel ¼ var_sel_phase2(index_decreasing_relev(1:i_crit) ); % Indicates the selected point on the scree plot h ¼ gca; YLim ¼ get(h,9YLim9); line([i_crit i_crit],Ylim)
Regarding the shortcomings of SPA, three aspects can be pointed out. First, there is no guarantee that the best subset of predictors will be obtained because the search in Phase 2 is limited to the chains of variables constructed in Phase 2. This is a common limitation of variable selection techniques that do not resort to exhaustive search. Second, the emphasis on multicollinearity avoidance may cause the selection of noisy variables, because random noise is uncorrelated (save for chance correlation) to informative features. The elimination procedure used in Phase 3 is aimed at removing such artifacts, but some uninformative variables may occasionally remain. Third, even though Phase 1 can be carried out in a computationally efficient manner, the entire process may be time demanding because of the regression operations involved in Phase 2. The overall computation time is usually similar to that demanded by stepwise regression. A graphical user interface for SPA available at from the authors upon request. 3.05.4.8
Interval Partial Least Squares
If the independent variables under consideration are naturally arranged in a sequential order, such as wavelengths in a spectrum or time intervals in a chromatogram, an interval (or ‘window’)-based procedure may be used as an alternative to the selection of individual variables. Such a procedure is usually not advisable for MLR, as variables associated to adjacent points in a signal are likely to be highly correlated, which leads to multicollinearity problems. Therefore, interval selection is best suited for use with latent variable methods such as PLS. This concept was exploited in the interval partial least squares (iPLS) procedure proposed by Norgaard et al.24 for use with spectral data. The iPLS procedure comprises two steps. First, the spectrum is divided into intervals of equal width and local PLS models are built for each interval. Second, the center position and width of the interval that yielded the best PLS model (in terms of RMSEPCV, for instance) are adjusted to optimize the results. For this purpose, the interval is initially shifted to the left and to the right up to a pre-established maximum number of points. Finally, after the interval has been translated to the best position, the width is optimized by testing both symmetric (two-sided) and asymmetric (one-sided) shifts in the interval limits. It is worth noting that this procedure does not test all possible intervals in an exhaustive manner and therefore the result may be a local minimum of the adopted cost function. Alternatively, global search procedures such as generalized simulated annealing or the GA could be used to optimize the center point and the width of the interval. In this case, the search could also be extended to consider the simultaneous use of several intervals in the model. It is also possible to select individual variables within the interval obtained by iPLS. In this case, the overall procedure would comprise two steps: (1) choice of the interval followed by (2) selection of individual variables.45 3.05.4.9
Multiobjective Optimization
Optimization techniques such as the exhaustive search, polytope, generalized simulated annealing, and GAs are aimed at minimizing or maximizing a given criterion. In the context of variable selection, some possible criteria are described in Sections 3.05.2 and 3.05.3. However, each criterion has its own limitations and disadvantages. For example, multicollinearity metrics do not take into account the relationship between the x and y variables.
272 Variable Selection
This relationship can be taken into account by minimizing the RSS or RMS values associated to the calibration residuals, but such an approach may lead to overfitting problems. A better approach may be the use of a separate validation set to evaluate the prediction ability of the model in terms of PRESSV or rV. However, there is no guarantee that a good performance in a particular validation set will translate into the suitable prediction of new, previously unseen samples. Finally, the 2-norm of the vector of regression coefficients could be used to evaluate the sensitivity of the model predictions with respect to measurement noise. However, such a metric does not reflect the lack of fit for the model. In fact, if all regression coefficients are made equal to zero, the 2-norm is minimized, but evidently the predictions will be meaningless. Therefore, it may be advantageous to consider two or more criteria simultaneously, instead of a single performance metric. For this purpose, multiobjective optimization techniques may be used. In a multiobjective optimization problem,46–49 multiple objective functions F1(u), F2(u), . . ., Fn(u) must be simultaneously maximized or minimized (or a combination of both) with respect to vector u. In the variable selection context, u may be a real-valued m-uple encoding the indices of m variables, as discussed in Section 3.05.4.3. Alternatively, u could be a binary string of K bits, as in the case of GAs. One approach to deal with such a problem consists of defining a single objective function F(u) as a weighted sum of F1(u), F2(u), . . ., Fn(u), that is, F ðuÞ ¼ 1 F1 ðuÞ þ 2 F2 ðuÞ þ þ n Fn ðuÞ
ð46Þ
and then applying an optimization technique to minimize F(u) to obtain a compromise solution for u. However, the selection of appropriate weights 1, 2, . . ., n may not be a trivial task. In general, changes in the weights will lead to different results and the association between the weight values and the optimization outcome may not be clear to the analyst. An alternative to the use of a weighted sum of objectives is the target vector approach.47 In this case, the optimization is carried out to minimize the distance between the vector of objectives f ðuÞ ¼ ½ F1 ðuÞ F2 ðuÞ Fn ðuÞ T and a given target vector t ¼ ½ t1 t2 tn T PRn . A common choice for the distance measure is the Euclidean distance defined as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX 2 u n d ðf ðuÞ; tÞ ¼ t Fj ðuÞ – tj
ð47Þ
j ¼1
If the jth objective function Fj (u) is to be minimized/maximized, the corresponding target tj should be set to a small/large value accordingly. It is worth noting that the resulting solution will depend on the choice of the target values. A better insight into the relation between the multiple objectives of the optimization problem can be obtained by using the concept of domination and Pareto optimality.48,49 Given two points (or ‘solutions’) uA and uB, uA is said to dominate uB if the values of all objective criteria are better for uA in comparison with uB. In this case, uB is called a dominated solution. If the problem consists of minimizing the objective functions, this property can be stated as Fj (uA) < Fj (uB), j ¼ 1, 2, . . ., n. If at least one of the objective criteria for uA is equal or worse in comparison with uB, then uA does not dominate uB. If uA does not dominate uB and vice versa, then both solutions are said to be nondominated with respect to each other. The set of all nondominated (also called ‘efficient’ or ‘noninferior’) solutions of a multiobjective optimization problem is called Pareto optimal set. In the general case, where some of the objectives are conflicting, the Pareto optimal set will be formed by several different compromise solutions. This will be the case, for instance, of a variable selection problem in which RSS and PRESSV are to be minimized for the calibration and validation sets, respectively. In fact, reductions in RSS may cause overfitting problems, with a resulting increase in PRESSV. For illustration, consider a hypothetical problem with n ¼ 2 conflicting objectives to be minimized. Figure 15 presents a graph with points in the plane F1(u) F2(u) for different values of u. In this case, point A dominates point B. However, all points that lie on the curved line are nondominated with respect to one another. This line would be the Pareto optimal set for the problem. Different nondominated solutions may be obtained by repeatedly minimizing the weighted combination of objectives in Equation (46) or the distance measure in Equation (47) for different sets of weights or target values,
Variable Selection
273
4
F2(u)
Pareto optimal set
B
A
0
4
0 F1(u)
Figure 15 Hypothetical optimization example with n ¼ 2 criteria.
respectively. An alternative to this approach consists of using evolutionary algorithms.48,49 In the work of Dietz et al.,48 for instance, a multiobjective GA was developed by applying the Roulette method to each objective separately in the process of selecting the best-fitted individuals. According to this process, the mating pool will comprise good individuals for each individual objective. Compromise solutions may then be achieved by using crossover operations to combine the individuals. The nondominated solutions obtained at the end of several iterations are then used to form an estimate of the Pareto optimal set. 3.05.4.10
Regularization Methods
Regularization methods may be used to improve the conditioning of a regression problem by forcing the estimate of the regression vector to be well-behaved in some sense.50 For example, the regression coefficients may be forced to be small by minimizing the norm of vector b in addition to the magnitude of the calibration residuals or of the errors in a validation set. At the end, the variables associated to the smallest (or least significant) coefficients may be removed, as discussed in Section 3.05.3.4. For this purpose, multiobjective optimization techniques could be used for the joint minimization of ||b||2 and RSS, RMSEPV, or similar metrics. Such an approach is discussed, for example, by Kalivas,51 in the light of Pareto optimal sets. Another possibility is the use of Tikhonov regularization,50,52 which involves the minimization of an expression of the form jjXb – yjjaa þ jjLbjjbb
ð48Þ
where || ? ||q denotes the q-norm of a vector, is a positive weight parameter, and L represents a regularization operator that forces the estimate of b to be well-behaved in some sense. Usual choices for || ? ||q are the 1- and 2-norm, which are defined as jjbjj1 ¼
K X
jbk j
ð49Þ
vffiffiffiffiffiffiffiffiffiffiffiffi u K uX jjbjj2 ¼ t bk2
ð50Þ
k¼1
k¼1
for a given vector b P RK. Options for L include the identity matrix I, derivative operators (to enforce smoothness of the estimated vector of coefficients), instrumental differences (in calibration transfer problems), among others.52 If a ¼ b ¼ 2 and L ¼ I, the minimization of Equation (48) is equivalent to the well-known ridge regression method.2,52 As discussed in Section 3.05.3.5, the 2-norm of the regression vector is associated to the variance of
274 Variable Selection
(a)
4
bk
2 0 –2 –4
(b)
1
10
20
30
40
50 k
60
70
80
90 100
1
10
20
30
40
50 k
60
70
80
90 100
4
bk
2 0 –2 –4
Figure 16 Hypothetical vectors of regression coefficients with the same 2-norm value (9.6) and different 1-norm values: (a) 78 and (b) 43.
the model predictions. Therefore, the Tikhonov regularization in this case provides a trade-off between a good fit of the calibration set and a reduced variance in the prediction of new samples. Choosing a ¼ 2, b ¼ 1, and L ¼ I in Equation (48) leads to the so-called LASSO (least absolute shrinkage and selection operator) method.52,53 Using ||b||1 instead of ||b||2 may be useful for the selection of variables according to the magnitude of the resulting regression coefficients. The reason can be illustrated by the following example. Consider the hypothetical regression coefficients depicted in Figure 16. In Figure 16(b), there are a small number of coefficients that are markedly larger than the others, which facilitates the process of variable selection. This is not the case in Figure 16(a), where there is no clear distinction between small and large coefficients. Such a difference between Figures 16(a) and 16(b) is not reflected in the 2-norm, which equals 9.6 in both the cases. However, the 1-norm has a value of 43 in case (b) in contrast to 78 in case (a). In general, as illustrated by this particular example, the minimization of the 1-norm tends to favor solutions with a larger proportion of small regression coefficients, as compared to the use of the 2-norm.52
3.05.5 Case Studies 3.05.5.1
Diesel Analysis by NIR Spectrometry
This case study concerns the determination of the T10% temperature in diesel samples by NIR spectrometry (See data files available at the book website). This parameter corresponds to the temperature at which 10% of the sample has evaporated and is useful to characterize the most volatile fractions of the fuel. The fuel should have an adequate volatile content to facilitate engine starting. However, if the volatile fraction is too large, bubbles may be formed in the fuel line, hampering engine performance. The data set used in this study consisted of 170 diesel samples that were collected from gas stations in the city of Recife (Pernambuco State, Brazil) and stored in ambar glass flasks. The reference values for T10% were obtained according to the ASTM D86 method, by using a Herzog HDA 628 automatic distiller. The spectra were acquired using a Fourier transform (FT)-NIR/MIR (mid-infrared) spectrometer Perkin Elmer GX with a spectral resolution of 2 cm–1, 16 scans, and an optical path length of 1.0 cm. By using a priori considerations, the NIR region in the range of 1000–1600 nm was exploited, as discussed in Section 3.05.2.1. The resulting absorbance spectra are presented in Figure 17(a). To circumvent the problem of systematic variations in the baseline, derivative spectra were calculated with a Savitzky–Golay filter using a second-order polynomial and an 11-point window. Each of the resulting derivative spectra (Figure 17(b)) had 1191 variables. The overall pool of samples was divided into calibration, validation, and prediction sets with 80, 40, and 50 objects, respectively. The validation set was used to determine an appropriate number of latent variables in PLS
Variable Selection
(a) 0.8
(b)
275
0.02
0.7 0.01
0.5
dA/dλ
Absorbance
0.6
0.4 0.3
0
0.2 –0.01
0.1 0 –0.1 1000
1100
1200 1300 1400 Wavelength (nm)
1500
1600
–0.02 1000
1100
1200 1300 1400 Wavelength (nm)
1500
1600
Figure 17 (a) Absorbance spectra of the 170 diesel samples used in this study. (b) Corresponding Savitzky–Golay derivative spectra.
regression, as well as to evaluate performance metrics in the variable selection procedures. The prediction set was used for the final comparison of the PLS and MLR models under consideration. The three sample sets were defined by applying the Kennard–Stone (KS) uniform sampling algorithm20 to the y (T10%) vector of 170 elements. The two initial T10% values selected in this manner correspond to the lowest and highest extremes of the interval. Therefore, the two corresponding samples were included in the calibration set to avoid extrapolation problems. The samples corresponding to the next 50 y-values in the KS selection were taken as the prediction set. Because the KS algorithm spreads the selected y-values uniformly across the T10% range, the resulting prediction set is representative and ensures that the predictive ability of the model is evaluated in a fair manner along the entire calibration range. The next 40 selected samples were taken as the validation set. Finally, the remaining 78 samples, in addition to the two samples with extreme T10% values, formed the calibration set. Savitzky–Golay differentiation and PLS modeling were performed with The Unscrambler 9.6 software (CAMO). The number of latent variables in the PLS model was determined by testing on the validation set, according to the default settings of the software package. Variable selection for MLR was carried out by using the Matlab routines for stepwise regression, GA, and SPA presented in the previous sections. 3.05.5.1.1
Stepwise regression The stepwise regression procedure was applied to the calibration data set. The same -value for the F-test was used in both the entry and exit phases. Five different -values were tested, as shown in Table 3. In each case, the RMSEPV value obtained by applying the resulting MLR model to the validation set was calculated. As can be seen, the number of selected variables tends to increase with the -value. In fact, a larger value of makes the partial F-test less selective, in that variables with a small F-value are more easily accepted for inclusion in the model. The best result in terms of the resulting RMSEPV value is attained for ¼ 0.02. Table 3 Results of the stepwise regression procedure for different -values
RMSEPV ( C)
Number of selected variables
0.01 0.02 0.05 0.10 0.25
6.9 5.8 6.9 7.0 7.1
13 11 38 44 48
RMSEPV, root mean square error of prediction in the validation set.
276 Variable Selection
3.05.5.1.2
Genetic algorithm In the GA formulation adopted in this example, the fitness criterion was defined as the inverse of PRESSV for an MLR model employing the variable subset under evaluation. The population size was set to P ¼ 1000 individuals, the number of generations was set to 100, and the crossover and mutation probabilities were set to pc ¼ 60% and pm ¼ 5%, respectively. The algorithm was run five times, starting from different initial populations. The fitness value of the best individual obtained in each case is presented in Table 4. As can be seen, the best fitness value was attained in both the second and fourth GA realizations. On the basis of parsimony considerations, the second realization can be chosen because the resulting model comparises a smaller number of variables. 3.05.5.1.3
Successive projections algorithm The SPA was applied by using PRESSV to assess the candidate subsets of variables in Phase 2. As a result, a chain of 26 variables was selected. Figure 18 presents the scree plot of PRESSV used in Phase 3 to reduce the number of variables included in the model. According to the F-test criterion with ¼ 0.25, only 11 out of the 26 variables were retained, resulting in a PRESSV value of 5.0 102 C2 (RMSEPV ¼ 3.5 C). 3.05.5.1.4
Comparison of results Figures 19 and 20 show the variables selected by the stepwise regression, GA, and SPA procedures. The raw and derivative NIR spectra of a typical diesel sample are shown in Figures 19 and 20, respectively, to facilitate visualization of the results. As can be seen, the three algorithms provided different solutions to the variable selection problem. In particular, SPA focused on the regions of largest absorbance values, whereas stepwise regression and GA selected wavelengths more distributed along the entire spectral region. Table 4 Results for the MLR models obtained with five different GA realizations GA realization
Best fitness value (103 C2)
Corresponding RMSEPV value ( C)
Number of selected variables
1 2 3 4 5
7.9 9.8 9.1 9.8 9.0
1.8 1.6 1.7 1.6 1.7
40 40 46 44 43
GA, genetic algorithm; MLR, multiple linear regression; RMSEPV, root mean square error of prediction in the validation set.
4000 3500
PRESSV
3000 2500 2000 1500 1000 500 0
1
15 20 5 10 Number of variables included in the model
26
Figure 18 Scree plot used for variable elimination in the last phase of SPA. The arrow indicates the number of variables (11) selected according to the F-test criterion.
Variable Selection
277
Absorbance
(a) 0.8 0.6 0.4 0.2 0 1000
1100
1200
1300
1400
1500
1600
1100
1200
1300
1400
1500
1600
1100
1400 1200 1300 Wavelength (nm)
1500
1600
(b) 0.8
Absorbance
0.6 0.4 0.2 0 1000 (c) 0.8
Absorbance
0.6 0.4 0.2 0 1000
Figure 19 NIR spectrum of a typical diesel sample. The circles indicate the spectral variables selected by (a) stepwise regression, (b) GA, and (c) SPA.
Table 5 presents the results for each of the models considered in this case study in terms of RMSEPP and correlation coefficient (rP) in the prediction set. The best results were obtained by MLR-GA and MLR-SPA, in this order. As can be seen, the use of variable selection may grant better prediction accuracy in comparison with the application of PLS to the entire spectral range. In this example, the worst MLR results were obtained with stepwise regression, which may indicate that the data set does not conform to the statistical assumptions adopted by this method. To evaluate the robustness of the predictions, the calculations were repeated after degrading the derivative spectra in the prediction set with an artificial zero-mean white Gaussian noise with a standard deviation of 105. The results in Table 5 show that the detrimental effects of this disturbance were larger in the MLR models, as compared with PLS. Such a drawback of variable selection is described in more detail in Section 3.05.6. In relative terms, the MLR-GA model was the most affected by the introduction of noise. Such a finding suggests that the GA optimization incurred into an overfitting problem, that is, the solution may have been excessively fine-tuned for a particular set of experimental conditions. Among the three MLR models considered in this study, MLR-SPA seems to be the least sensitive to the introduction of noise in the spectra of the prediction set. Such an advantage may be ascribed to two factors: (1) the collinearity avoidance mechanism embedded in SPA, which reduces the propagation of noise in MLR; and (2) the selection of wavelengths associated to spectral bands with good signal-to-noise ratio, as shown in Figures 19(c) and 20(c).
278 Variable Selection
(a) 0.015
dA/dλ
0.01 0.005 0 –0.005 –0.01 1000
1100
1200
1300
1400
1500
1600
1100
1200
1300
1400
1500
1600
1100
1400 1200 1300 Wavelength (nm)
1500
1600
(b) 0.015
dA/dλ
0.01 0.005 0 –0.005 –0.01 1000 (c) 0.015
dA/dλ
0.01 0.005 0 –0.005 –0.01 1000
Figure 20 Derivative NIR spectrum of a typical diesel sample. The circles indicate the spectral variables selected by (a) stepwise regression, (b) GA, and (c) SPA.
Table 5 Results obtained in the prediction set No added noise
Artificial noise added
Model
Number of variablesa
RMSEPP ( C)
Correlation rP
RMSEPP ( C)
Correlation rP
PLS Stepwise MLR MLR-GA MLR-SPA
5 11 40 11
7.4 7.4 4.4 6.1
0.958 0.957 0.984 0.970
7.4 9.2 10.4 6.5
0.958 0.928 0.930 0.965
a
Latent variables in PLS and wavelengths in the MLR models. GA, genetic algorithm; MLR, multiple linear regression; PLS, partial least squares; RMSEPP, root mean square error of prediction in the prediction set; SPA, successive projections algorithm.
Variable Selection
3.05.5.2
279
QSAR Study
QSAR studies are concerned with the modeling of relationships between biological activities of molecules and physicochemical properties (descriptors) obtained either by experimentation or by computational means (e.g., semiempirical quantum chemistry calculations). The present case study involves the Selwood data set54 (http://www.qsar.org), which has been extensively used in feature selection investigations. This data set comprises 31 molecules (antimycin analogs) and 53 descriptor variables, namely
ATCH1–ATCH10: partial atomic charge for atom (x1 – x10) DIPV_X, DIPV_Y, DIPV_Z: dipole vector (x11 – x13) DIPMOM: dipole moment (x14) ESDL1–ESDL10: electrophilic superdelocalizability for atom (x15 – x24) NSDL1–NSDL10: nucleophilic superdelocalizability for atom (x25 – x34) VDWVOL: van der Waals volume (x35) SURF_A: surface area (x36) MOFI_X, MOFI_Y, MOFI_Z: principal moments of inertia (x37 – x39) PEAX_X, PEAX_Y, PEAX_Z: principal ellipsoid axes (x40 – x42) MOL_WT: molecular weight (x43) S8_1DX, S8_1DY, S8_1DZ: substituent dimensions (x44 – x46) S8_1CX, S8_1CY, S8_1CZ: substituent centers (x47 – x49) LOGP: partition coefficient (x50) M_PNT: melting point (x51) SUM_F: sum of F substituent constant (x52) SUM_R: sum of R substituent constant (x53)
The biological property of interest (dependent variable y) is the antifilarial activity of the compound. Because the number of objects in the data set is relatively small (as compared, for instance, with the spectroscopic data set used in the previous example), a cross-validation study involving all 31 molecules was carried out. In this case, MLR models obtained by using stepwise regression and GA were compared with a PLS model using all descriptors. Different -values were tested in the stepwise regression procedure to obtain MLR models with one, two, four, and five descriptors. For comparison, GA was used to obtain models with the same number of descriptors. The GA parameters were the same as those adopted in the previous case study. In this case, the fitness criterion was the inverse of PRESSCV. The results for PLS, stepwise regression, and MLR are presented in Table 6. As can be seen, there is a general tendency of improvement for the MLR models as the number of descriptors is increased. However, the use of too many descriptors may hamper the interpretation of the QSAR model. It is worth noting that the selection of four descriptors is already sufficient to outperform the PLS result. Moreover, the GA results can be seen to be consistently better than the corresponding results for stepwise regression, as in the previous case study.
Table 6 Cross-validation results Method
Descriptors
RMSEPCV
rCV
PLS (10 latent variables) Stepwise MLR ( ¼ 103) Stepwise MLR ( ¼ 102) Stepwise MLR ( ¼ 5 102) Stepwise MLR ( ¼ 7 102) MLR-GA MLR-GA MLR-GA
All ATCH6 ATCH6, ATCH4 ATCH6, ATCH4, LOGP, MOFI_Z ATCH1, ATCH4, DIPV_Y, MOFI_Z, LOGP MOFI_Z, LOGP DIPV_Y, MOFI_Y, LOGP, SUM_F ATCH4, DIPV_X, MOFI_Z, LOGP, SUM_F
0.59 0.68 0.61 0.54 0.47 0.56 0.47 0.45
0.738 0.558 0.671 0.760 0.819 0.732 0.817 0.842
GA, genetic algorithm; MLR, multiple linear regression; PLS, partial least squares.
280 Variable Selection
3.05.6 Limitations of Variable Selection When compared with latent variable methods, models based on a few selected variables have advantages in terms of simplicity, ease of interpretation, and, in some cases, prediction accuracy. However, two possible drawbacks should be pointed out in this context.9 First, the use of a small number of variables may hinder the detection of outliers, that is, samples that are unusual as compared with those in the calibration set. For example, consider that the blue and red lines in Figure 21 depict the absorbance spectra of a typical calibration sample and that of an outlier, respectively. The difference between the two spectra could be ascribed, for instance, to the existence of additional absorbing species in the abnormal sample. Flagging such an abnormality is important because the presence of interferents not considered in the calibration phase may render the model predictions invalid. In the present case, the abnormal sample can be flagged as an outlier by simple inspection of the spectrum. However, if the analysis was restricted, for example, to variables x50, x150, and x250 (which could be selected on the basis of peak location for typical spectra), the abnormality would not be detected. Second, predictions based on a reduced subset of variables may be more sensitive to measurement noise. In fact, when many redundant variables are used, such as in a full-spectrum PCR/PLS model, an averaging process takes place, which attenuates the effects of instrumental noise. For instance, consider two prediction models I and II of the form yˆ I ¼ x 1
ð51Þ
yˆ II ¼ 0:5x 1 þ 0:5x 2
ð52Þ
Suppose that x1 and x2 have the same expected (or ‘true’) values for all samples, but different, uncorrelated noise terms, that is xi;1 ¼ xi þ ni;1
ð53Þ
xi;2 ¼ xi þ ni;2
ð54Þ 2 E{ni,1 }
where i is the sample index. It will be assumed that E{xi,1} ¼ E{xi,2} ¼ xi , E{ni,1} ¼ E{ni,2} ¼ 0, 2 E{ni,2 } ¼ 2, E{ni,1 ni,2} ¼ 0, 8i. Dropping the sample index, for simplicity of notation, it follows that
E
E yˆ I ¼ E fx1 g ¼ x
ð55Þ
n 2 o
¼ E ½x1 – x 2 ¼ E n21 ¼ 2 yˆ I – E yˆ I
ð56Þ
1.5
Absorbance
¼
Normal Outlier
1.0
0.5
0
1
50
100 150 200 Variable index
250
300
Figure 21 Spectral profiles of two hypothetical samples. The blue and red lines represent a normal spectrum and an outlier, respectively. If the analysis was restricted to variables x50, x150, and x250, no distinctions between the two samples would be detected.
Variable Selection Efˆy II g ¼ 0:5Efx 1 g þ 0:5Efx 2 g ¼ x n
o
281 ð57Þ
2 E ½yˆ II – E fyˆ II g ¼ E ½0:5x1 þ 0:5x2 – x 2 ¼ E ½0:5ðx1 – xÞ þ 0:5ðx2 – xÞ 2 8 9 > > <
=
¼ 0:25E ½n1 þ n2 2 ¼ 0:25 E n21 þ2 E fn1 n2 g þ E n22 ¼ 0:52 |fflfflfflffl{zfflfflfflffl} |fflfflffl{zfflfflffl} > > :|fflffl{zfflffl} ; 2
0
ð58Þ
2
which shows that the prediction has the same expected value in both models, but the prediction variance of model II (two variables) is half that of model I (single variable). For this reason, it may be recommendable to combine variable selection methods with de-noising techniques, such as Savitzky–Golay smoothing, Fourier transform, or wavelet decompositions.55
References 1. Wold, S.; Sjostrom, M.; Eriksson, L. PLS-Regression: A Basic Tool of Chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. 2. Draper, N. R.; Smith, H. Applied Regression Analysis, 3rd ed.; Wiley & Sons: New York, 1998. 3. Gusnanto, A.; Pawitan, Y.; Huang, J.; Lane, B. Variable Selection in Random Calibration of Near-Infrared Instruments: Ridge Regression and Partial Least Squares Regression Settings. J. Chemom. 2003, 17, 174–185. 4. Martens, H.; Naes, T. Multivariate Calibration; Wiley & Sons: London, NY, 1993. 5. Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; De Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of Chemometrics and Qualimetrics; Elsevier: Amsterdam, 1997. 6. Spiegelman, C. H.; McShane, M. J.; Goetz, M. J.; Motamedi, M.; Yue, Q. L.; Cote´, G. L. Theoretical Justification of Wavelengh Selection in PLS Calibration: Development of a New Algorithm. Anal. Chem. 1998, 70, 35–44. 7. Goicoechea, H. C.; Olivieri, A. C. A New Family of Genetic Algorithms for Wavelengh Intervals Selection in Multivariate Analytical Spectroscopy. J. Chemom. 2003, 17, 338–345. 8. Leardi, R.; Seasholtz, M. B.; Pell, R. J. Variable Selection for Multivariate Calibration Using a Genetic Algorithm: Prediction of Additive Concentrations in Polymer Films from Transform-Infrared Spectral Data. Anal. Chim. Acta 2002, 461, 189–200. 9. Beebe, K. R.; Pell, R. J.; Seasholtz, M. B. Chemometrics: A Pratical Guide; Wiley & Sons: New York, 1998. 10. Kohavi, R.; John, G. H. Wrappers for Feature Subset Selection. Artif. Intell. 1997, 97, 273–324. 11. Matlab 6.5 User’s Guide; The Mathworks: Natick, MA, USA. 12. Ditusa, M. R.; Schilt, A. A. Selection of Wavelengths for Optimum Precision in Simultaneous Spectrophotometric Determinations. J. Chem. Educ. 1985, 62, 541–542. 13. Naes, T.; Isaksson, T.; Fearn, T.; Davies, T. A User-Friendly Guide to Multivariate Calibration and Classification; NIR Publications: Chichester, 2002. 14. Rossi, F.; Franc¸ois, D.; Wertz, V.; Meurens, M.; Verleysen, M. Fast Selection of Spectral Variables with B-Spline Compression. Chemom. Intell. Lab. Syst. 2007, 86, 208–218. 15. Rossi, F.; Lendasse, A.; Franc¸ois, D.; Wertz, V.; Verleysen, M. Mutual Information for the Selection of Relevant Variables in Spectrometric Nonlinear Modelling. Chemom. Intell. Lab. Syst. 2006, 80, 215–226. 16. Lorber, A. Error Propagation and Figures of Merit for Quantification by Solving Matrix Equations. Anal. Chem. 1986, 58, 1167–1172. 17. Kalivas, J. H.; Lang, P. M. Interrelationships Between Sensitivity and Selectivity Measures for Spectroscopic Analysis. Chemom. Intell. Lab. Syst. 1996, 32, 135–149. 18. Kalivas, J. H.; Lang, P. M. Mathematical Analysis of Spectral Orthogonality; Marcel Dekker: New York, 1994. 19. ASTM E 1655-00 (Standard Practices for Infrared Multivariate Quantitative Analysis). 20. Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11, 137–148. 21. Galva˜o, R. K. H.; Arau´jo, M. C. U.; Jose´, G. E.; Pontes, M. J. C.; Silva, E. C.; Saldanha, T. C. B. A Method for Calibration and Validation Subset Partitioning. Talanta 2005, 67, 736–740. 22. Mallows, C. L. Some Comments on Cp. Technometrics 1973, 15, 661–675. 23. Stout, F.; Baines, M. R.; Kalivas, J. H. Impartial Graphical Comparison of Multivariate Calibration Methods and the Harmony/ Parsimony Tradeoff. J. Chemom. 2006, 20, 464–475. 24. Norgaard, L.; Saudland, A.; Wagner, J.; Nielsen, J. P.; Munck, L.; Engelsen, S. B. Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Appl. Spectrosc. 2000, 54, 413–419. 25. Centner, V.; Massart, D.; Noord, O.; Jong, S.; Vandeginste, B.; Sterna, C. Elimination of Uninformative Variables for Multivariate Calibration. Anal. Chem. 1996, 68, 3851–3858. 26. Efron, B. The Jack-Knife, the Bootstrap and Other Resampling Plans; Society for Industrial and Applied Mathematics: Philadelphia, PA, 1982. 27. Martens, H.; Martens, M. Modified Jack-Knife Estimation of Parameter Uncertainty in Bilinear Modelling by Partial Least Squares Regression (PLSR). Food Qual. Prefer. 2000, 11, 5–16. 28. Faber, K.; Kowalski, B. R. Propagation of Measurement Errors for the Validation of Predictions Obtained by Principal Component Regression and Partial Least Squares. J. Chemom. 1997, 11, 181–238.
282 Variable Selection 29. Nelder, J. A.; Mead, R. A Simplex Method for Function Minimization. Comput. J. 1965, 7, 308–313. 30. Gill, P. E.; Murray, W.; Wright, M. H. Practical Optimization; Academic Press: London, 1981. 31. Bohachevsky, O.; Johnson, M. E.; Stein, M. L. Generalized Simulated Annealing for Function Optimization. Technometrics 1986, 28, 209–217. 32. Kalivas, J. H.; Roberts, N.; Sutter, J. M. Global Optimization by Simulated Annealing with Wavelength Selection for Ultraviolet– Visible Spectrophotometry. Anal. Chem. 1989, 61, 2024–2030. 33. Goldberg, D. E. Genetic Algorithms in Search, Optimization, and Machine Learning; Addison-Wesley: Reading, MA, 1989. 34. Fogel, D. B. Evolutionary Computation; IEEE Press: New York, 1995. 35. Jouan-Rimbaud, D.; Massart, D. L.; Leardi, R.; Noord, O. E. Genetic Algorithms as a Tool for Wavelength Selection in Multivariate Calibration. Anal. Chem. 1995, 67, 4295–4301. 36. Leardi, R. Genetic Algorithms in Chemometrics and Chemistry: A Review. J. Chemom. 2001, 15, 559–569. 37. Todeschini, R.; Galvagni, D.; Vı´lchez, J. L.; del Olmo, M.; Navas, N. Kohonen Artificial Neural Networks as a Tool for Wavelengh Selection in Multicomponent Spectrofluorimetric PLS Modelling: Application to Phenol, o-Cresol, m-Cresol and p-Cresol Mixtures. Trends Anal. Chem. 1999, 18, 93–98. 38. Arau´jo, M. C. U.; Saldanha, T. C. B.; Galva˜o, R. K. H.; Yoneyama, T.; Chame, H. C.; Visani, V. The Sucessive Projections Algorithm for Variable Selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. 2001, 57, 65–73. 39. Galva˜o, R. K. H.; Pimentel, M. F.; Arau´jo, M. C. U.; Yoneyama, T.; Visani, V. Aspect of the Successive Projections Algorithm for Variable Selection in Multivariation Calibration Applied to Plasma Emission Spectrometry. Anal. Chim. Acta 2001, 443, 107–115. 40. Breitkreitz, M. C.; Raimundo, I. M. Jr.; Rohwedder, J. J. R.; Pasquini, C.; Dantas Filho, H. A.; Jose´, G. E.; Arau´jo, M. C. U. Determination of Total Sulphur in Diesel Fuel Employing NIR Spectroscopy and Multivariate Calibration. Analyst 2003, 128, 1204–1207. 41. Galva˜o, R. K. H.; Arau´jo, M. C. U.; Fragoso, W. D.; Silva, E. C.; Jose´, G. E.; Soares, S. F. C.; Paiva, H. M. A Variable Elimination Method to Improve the Parsimony of MLR Models Using the Successive Projections Algorithm. Chemom. Intell. Lab. Syst. 2008, 92, 83–91. 42. Galva˜o, R. K. H.; Arau´jo, M. C. U.; Silva, E. C.; Jose´, G. E.; Soares, S. F. C.; Paiva, H. M. Cross-Validation for the Selection of Spectral Variables Using the Successive Projections Algorithm. J. Braz. Chem. Soc. 2007, 18, 1580–1584. 43. Haaland, D. M.; Thomas, E. V. Partial Least-Squares Methods for Spectral Analyses. 1. Relation to Other Quantitative Calibration Methods and the Extraction of Qualitative Information. Anal. Chem. 1988, 60, 1988, 1193–1202. 44. Li, B. X.; Wang, D. L.; Xu, C. L.; Zhang, Z. J. Flow-Injection Simultaneous Chemiluminescence Determination of Ascorbic Acid and L-Cysteine with Partial Least Squares Calibration. Mikrochim. Acta 2005, 149, 205–212. 45. Pereira, A. F. C.; Pontes, M. J. C.; Gambarra Neto, F. F.; Santos, S. R. B.; Galva˜o, R. K. H.; Arau´jo, M. C. U. NIR Spectrometric Determination of Quality Parameters in Vegetable Oils Using iPLS and Variable Selection. Food Res. Int. 2008, 41, 341–348. 46. Hendriks, M. M. W. B.; de Boer, J. H.; Smilde, A. K.; Doornbos, D. A. Multicriteria Decision Making. Chemom. Intell. Lab. Syst. 1992, 16, 175–191. 47. Wienke, D.; Lucasius, C.; Kateman, G. Multicriteria Target Vector Optimization of Analytical Procedures Using a Genetic Algorithm. Part I. Theory, Numerical Simulations and Application to Atomic Emission Spectroscopy. Anal. Chim. Acta 1992, 265, 211–225. 48. Dietz, A.; Azzaro-Pantel, C.; Pibouleau, L.; Domenech, S. Multiobjective Optimization for Multiproduct Batch Plant Design Under Economic and Environmental Considerations. Comput. Chem. Eng. 2006, 30, 599–613. 49. Halsall-Whitney, H.; Thibault, J. Multi-Objective Optimization for Chemical Processes and Controller Design: Approximating and Classifying the Pareto Domain. Comput. Chem. Eng. 2006, 30, 1155–1168. 50. Hansen, C. Rank-Deficient and Discrete Ill-Posed Problems; Society for Industrial and Applied Mathematics: Philadelphia, PA, 1998. 51. Kalivas, J. H. Pareto Calibration with Built-In Wavelength Selection. Anal. Chim. Acta 2004, 505, 9–14. 52. Stout, F.; Kalivas, J. H.; He´berger, K. Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization. Appl. Spectrosc. 2007, 61, 85–95. 53. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. 54. Selwood, D. L.; Livingstone, D. J.; Comley, J. C. W.; O’Dowd, A. B.; Hudson, A. T.; Jackson, P.; Jandu, K. S.; Rose, V. S.; Stables, J. N. Structure–Activity Relationships of Antifilarial Antimycin Analogs: A Multivariate Pattern Recognition Study. J. Med. Chem. 1990, 33, 136–142. 55. Walczak, B. Wavelets in Chemistry; Elsevier Science: New York, 2000.
Variable Selection
Biographical Sketches
Roberto Kawakami Harrop Galva˜o received the B.S. degree in Electronics Engineering (1995, with Summa cum Laude honors) and the Doctorate degree in Systems and Control (1999) from the Instituto Tecnolo´gico de Aerona´utica (ITA, Sa˜o Jose´ dos Campos, SP, Brazil). In 2001, he spent a postdoctoral period at the Cybernetics Department of the University of Reading, UK. Since 1998, he has been a full-time academic staff member in the Electronics Engineering Department of ITA. His main areas of interest are multivariate analysis, wavelet-based signal processing, and predictive control. Dr. Galva˜o is a senior member of the Institute of Electrical and Electronics Engineers (IEEE) and also a member of the Brazilian Chemistry Society (SBQ) and Brazilian Automatic Control Society (SBA).
Ma´rio Ce´sar Ugulino de Arau´jo received the B.S. degree in Industrial Chemistry from Universidade Federal da Paraı´ba (UFPB, Joa˜o Pessoa, PB, Brazil) in 1979 and the Doctorate degree in Analytical Chemistry from Universidade Estadual de Campinas (Unicamp, Campinas, SP, Brazil) in 1987. Since 1982, he has been a full-time academic staff member at the Chemistry Department of UFPB, where he is currently Head of the Laborato´rio de Instrumentac¸a˜o e Automac¸a˜o em Quı´mica Analı´tica/Quimiometria (LAQA). His main areas of interest are chemometrics, flow automatic systems, and analytical instrumentation. Dr. Arau´jo is a member of the Brazilian Chemistry Society (SBQ).
283