Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c h e m o l a b
Exploring nonlinear relationships in chemical data using kernel-based methods Dong-Sheng Cao a, Yi-Zeng Liang a,⁎, Qing-Song Xu b,⁎, Qian-Nan Hu c, Liang-Xiao Zhang a, Guang-Hui Fu b a b c
Research center of modernization of traditional Chinese medicines, Central South University, Changsha 410083, PR China School of Mathematical Sciences and Computing Technology, Central South University, Changsha 410083, PR China Systems Drug Design Laboratory, College of Pharmacy, Wuhan University, Wuhan 430071, PR China
a r t i c l e
i n f o
Article history: Received 9 October 2010 Received in revised form 11 February 2011 Accepted 15 February 2011 Available online 19 February 2011 Keywords: Kernel methods Dual solution Support vector machines (SVMs) Kernel principal component analysis (KPCA) Kernel partial least squares (KPLS) Kernel Fisher discriminant analysis (KFDA)
a b s t r a c t Kernel methods, in particular support vector machines, have been further extended into a new class of methods, which could effectively solve nonlinear problems in chemistry by using simple linear transformation. In fact, the kernel function used in kernel methods might be regarded as a general protocol to deal with nonlinear data in chemistry. In this paper, the basic idea and modularity of kernel methods, together with some simple examples, are discussed in detail to give an in-depth understanding for kernel methods. Three key ingredients of kernel methods, namely dual form, nonlinear mapping and kernel function, provide a consistent framework of kernel-based algorithms. The modularity of kernel methods allows linear algorithms to combine with any kernel function. Thus, some commonly used chemometric algorithms are easily extended to their kernel versions. © 2011 Elsevier B.V. All rights reserved.
1. Introduction In chemistry, several multivariate linear techniques, such as multiple linear regression (MLR), ridge regression (RR), principal component regression (PCR), and partial least squares regression (PLSR) are very popular. They are usually used to construct a mathematical model that relates multi-wavelength spectral responses to analyte concentrations, molecular descriptors to biological activity, and multivariate process conditions/states to final product attributes etc. Such a model can in general be used to efficiently predict properties of new samples. In practice, however, some chemical systems and problems in chemistry are probably nonlinear ones. In some practical situations dealing with complex chemical systems, linear methods may be inappropriate for describing the underlying data structure because such systems may exhibit significant nonlinear characteristics, while the linear models assume that the chemical data is linear. Of course, if some chemical data indeed exhibits linear relationships or even nearly linear relationships, the data processing process will be greatly simplified, and the results obtained will be more reliable. Modeling nonlinear chemical systems is a very difficult task, since the analytical form of the relation between measured properties and prediction properties of interest is usually unknown. How to effectively cope with such an issue has been a major concern for chemists for a long time. In the early period, several nonlinear techniques have initially
⁎ Corresponding authors. E-mail addresses:
[email protected] (Y.-Z. Liang),
[email protected] (Q.-S. Xu). 0169-7439/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2011.02.004
been proposed to approximate the properties of interest in chemistry. Among these methods, we can generally divide them into four groups: (1) methods based on smoothers [1], e.g., alternating conditional expectations (ACE), smooth multiple additive regression technique (SMART); (2) methods based on splines, e.g., classification and regression tree (CART) [2], multivariate adaptive regression splines (MARS) [3]; (3) nonlinear PLS methods[4–9], e.g., Spline-PLS, RBF-PLS, Poly-PLS and (4) neural networks (NN), e.g., back-propagation network (BPN), radial basis function network (RBFN). An excellent tutorial about the first two groups of approaches can be found in Refs. [1,2,10], and about neural networks in Refs. [11–13]. These nonlinear techniques for the first time made it possible to cope with a lot of nonlinear chemical problems to a large extent. However, they still suffer from some practical difficulties. For example, the approaches in the first group often need too many terms and too many adjustable parameters. Building such a model is usually a time-consuming and tedious task. The well-known nonlinear algorithms such as CART, MARS, and NN etc. are based on gradient descent or greedy heuristics and thereby suffer from local minima. They also frequently suffered from overfitting since their statistical behaviors were not well understood [10]. Moreover, the learned nonlinear functions are usually not stable, i.e., showing low reliability of the prediction results. Kernel methods [14–18], in particular support vector machines (SVMs) [19–24], have been further extended into a new class of methods, which could effectively solve the nonlinear problems in chemistry by using simple linear transformation manner. Advances in their statistical analysis made it possible to do so in high dimensional feature spaces while avoiding the dangers of overfitting, because they are strictly established based on the statistical learning theory, a widely recognized
D.-S. Cao et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
theory of statistical science. From all points of view, computational, statistical and conceptual, the kernel-based nonlinear algorithms are very efficient and as well founded as linear ones. The problems of local minima and overfitting that were typical of NN and CART have been overcome. Most importantly, the kernel function used in kernel methods might be regarded as a general protocol to deal with nonlinear data in chemistry. This is just the main topic in this paper. The present paper mainly introduces the main idea of kernel algorithms. In this article, we intend to present a somewhat basic point of view illustrating the key features of kernel algorithms. So, there is not much more theoretical derivation in this tutorial. It is organized as follows. In Section 2 some essential notations are illuminated in order to well understand the paper. In Section 3 three key ingredients of kernel methods are in detail described together with some simple examples. In Section 4, the modularity of kernel methods is introduced to establish a consistent framework. In Section 5 the most commonly used chemometric algorithms, namely principal component analysis (PCA), partial least squares (PLS) and Fisher discriminant analysis (FDA), are extended to their kernel versions. Section 6 mainly discusses the relationship between kernel functions and support vector machines (SVMs). A simple comparison of kernel PLS, kernel FDA and SVMs is introduced to demonstrate the ability of kernel function. In Section 8 some attention will be devoted to questions of model selection, i.e., how to choose the parameters in kernel approaches. Finally Section 9 provides a short discussion and concludes the paper.
2. Notations Let us first introduce some notations used in this paper. The data matrix X has n samples in the rows and p variables in the columns. In addition, matrices and vectors are denoted by bold capital and lowercase characters, respectively, while scalar variables are shown in italic letters.
3. Kernel Methods: Three Key Ingredients Briefly speaking, the kernel methods could be performed in two successive steps. The first step is to embed the training data in the input space into a much higher dimensional feature space. The second step is that a linear algorithm is designed to discover the linear relationship in that feature space (see Fig. 1). In the process, three key ingredients of kernel methods, namely the dual form of linear algorithms, nonlinear mapping and the kernel function, should mainly be highlighted.
107
3.1. Primal and Dual Forms Many linear models for regression and classification can be reformulated in terms of a dual representation where the kernel function arises naturally. This concept plays a very important role when we consider kernel methods. To well elucidate the dual form, let us consider a simple linear regression example: ridge regression (RR). In simple cases, the function f can have a linear form, and then the relation between the response vector y and the calibration matrix X can be represented as follows: y = f ðXÞ + e = Xb + e
ð1Þ
where b is the regression coefficient vector and e are the residuals. To estimate b, RR essentially corresponds to solving the following optimization problem given by: n
2
minimize: LðbÞ = ∑ ðyi −f ðxi ÞÞ + λ∥b∥
2
ð2Þ
i=1
t
2
= ðy−XbÞ ðy−XbÞ + λ∥b∥
where ||b|| is the norm of b, λ is the positive regularization parameter that determines the trade-off between minimizing the residual sum of squares and minimizing ||b||2. The optimal b can be sought by taking the derivatives of Eq. (2) with respect to b and setting them equal to zero. So we can obtain the following equation: ∂LðbÞ t t = 2X Xb 2X y + 2λb = 0: ∂b
ð3Þ
Thus, the solution of RR can be further expressed as: −1 t ˆ = Xt X + λΙp Xy b
ð4Þ
where Ip is a p-dimensional identity matrix. Eq. (4) is the commonly known solution of RR and thus taken as the primal form of the solution. Correspondingly, the prediction function involved is given by: −1 ˆ N = yt X Xt X + λIp f ðxÞ = bb;x x
ð5Þ
where b,N denotes the inner product. Alternatively, we can also rewrite Eq. (3) in terms of b to obtain ˆ = λ−1 Xt y−Xb ˆ = Xt α b
ð6Þ
̂ Note that Eq. (6) shows that the estimated with α= λ−1(y− Xb). regression coefficient b̂ can be written as a linear combination of the n ˆ = ∑ α x . If we substitute Xtα for b̂ in the equation training data, say b i=1
α =λ as:
i i
(y− Xb)̂ and then simplify Eq. (6), α can be further expressed
−1
−1
α = ðG + λIN Þ
y
ð7Þ
where G = XXt (n × n) or component-wise, Gij = bxi, xjN represents the so-called Gram matrix, defined as an inner product of the coordinates of the training data in the p-dimensional space. The Gram matrix is also referred to as the kernel matrix, which will be introduced in detail soon. We substitute Eq. (7) for α in the Eq. (6) and then we can obtain ˆ = Xt ðG + λIn Þ−1 y: b
Fig. 1. The mapping φ embeds the data points into a feature space where the nonlinear relationship now appears linear.
ð8Þ
It is surprising that b̂ has another expression form differing from Eq. (4). It gives a linear combination of the training data with weights α (Eq. (8)). This form is known as the dual solution of RR. The parameter α is known
108
D.-S. Cao et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
as the dual variable. Correspondingly, the prediction function involved is given by n
−1
ˆ x N = b ∑ α x ;x N = y ðG + λI Þ f ðxÞ = bb; i i n t
k
ð9Þ
i=1
where k = bxi, xNis the inner product vector between a new test sample x and every training sample xi. So far, we have obtained two distinct ways to solve the RR problem, namely the primal form and the dual form. The primal form directly computes the regression coefficient b. This is the commonly known solution of RR. However, RR can also be solved in a dual form that only requires inner products between data points, denoted by G and k. That is, all the information RR needs in the dual form is only the inner products bxi, xjN between data points. In the following sections, we will clearly see that the inner products between data points actually construct the so-called kernel matrix, another key component of kernel methods. Thus, the dual form makes the kernel matrix naturally arise in many linear methods. This is the reason why the dual form can be considered as a key component of kernel methods. 3.2. Nonlinear Mapping It is known that RR can only model linear relationships between X and y. However, in practice, their relations are often nonlinear, that is, y can be accurately estimated as a nonlinear function of X. This raises the following question, “Can we still cope with the nonlinear relation with RR?” “If we can, how should we do it?” An effective and feasible strategy for this question is just the nonlinear mapping we want to highlight. The nonlinear mapping is to embed X into a new feature space where the linear relation can be discovered and hence RR will be able to cope with them. Given a training set D = {(x1, y1), (x2, y2),…,(xn, yn)}, we will consider a nonlinear mapping: ϕ: Rp →FpRT x a φðxÞ
where ϕ aims to convert nonlinear relations into linear relations. T is the number of dimensionality of the mapping space. So the data points in the training set are mapped into a potentially much higher dimensional feature space F. The training set in the feature space F can be expressed as: D = fðφðx1 Þ;y1 Þ;ðφðx2 Þ; y2 Þ; …;ðφðxn Þ; yn Þg: To intuitively understand the nonlinear mapping, let us take into account a simple example. In this case, each sample is denoted by xi = [xi1, xi2]. All samples are directly depicted in 2-D input space using an ellipse x2i1 + 2x2i2 = 4. Fig. 2 A shows such a situation where the relation between the data points is nonlinear. Now let us consider the following mapping: h i pffiffiffi 2 2 2 3 ϕ: x = ½xi1 ;xi2 ∈ R ↦φðxÞ = xi1 ;xi2 ; 2xi1 xi2 ∈ R : The mapping ϕ transforms the 2-D data points into a new 3-D space. While describing the data points in 3-D space, we obtain a linear relation with just two features x2i1 and x2i2 (when three features are all used, a plane will be obtained. For simplicity, we only use two features to illustrate the action of nonlinear mapping). Fig. 2B shows the data points in 3-D space. As can be shown from Fig. 2B, it is surprising that an ellipse in the input space (i.e., a nonlinear shape in the input space) is neatly transformed into a straight line in the feature space. With the help of the aforementioned example as a decision boundary, we consider a nonlinear case. Fig. 3A shows the data points in 2-D input space. From Fig. 3A, we can clearly see that the data points are linearly inseparable unless a nonlinear decision function (e.g., an ellipse) is used. To correctly classify the data points, a strategy adopted is to embed them into a new feature space where a linear function can be sought. Herein we continue to use the above nonlinear mapping. All data points in the new feature space are plotted in Fig. 3B. From Fig. 3B, it can be seen that the data points, which are nonlinear in the original 2-D input space, have remarkably become
Fig. 2. A nonlinear mapping example. We actually have an ellipse — i.e., a non-linear shape in the input space (A), while we obtain a straight line in the feature space by certain nonlinear mapping.
D.-S. Cao et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
109
methods. Generally speaking, such an implicit mapping actually reflects our expectations about the relation to be learned and thereby is usually very difficult to obtain. Furthermore, the adjustment of parameters involved with nonlinear mapping may be a timeconsuming and tedious task. There is, however, a simple way of calculating the inner product in the feature space directly as a function of the original input variables. The direct computation approach is called the kernel function approach. More formally, the kernel function is defined in such a way, that for all xi, xj∈ X: κ xi ;xj = bφðxi Þ;φ xj N :
ð11Þ
To well understand how the kernel function work, hlet us continue to i pffiffiffi consider the mapping ϕ: x = ½xi1 ;xi2 ∈R2 ↦φðxÞ = x2i1 ;x2i2 ; 2xi1 xi2 . The composition of the feature mapping with inner product in the feature space F can be evaluated as follows: Fig. 3. The linearly inseparable data points in the original input space have been linearly separable in the feature space by nonlinear mapping.
linearly separable in 3-D feature space. Accordingly, the quadratic decision boundary in 2-D input space is transformed into a linear decision boundary. That is, the nonlinear mapping takes the data points from a 2-D input space to a 3-D feature space in a way that linear relation in the feature space corresponds to quadratic relation in the input space. As mentioned above, changing the data representation via nonlinear mapping can significantly simplify data analysis. Generally speaking, we can always seek the linear representation of data in this way. Instead of the original input space, so we can consider the same algorithms in the new feature space F, and learn nonlinear relations with linear methods in the feature space T
f ðxÞ = ∑ bi φi ðxÞ + b0
ð10Þ
i=1
where T is the dimensionality size of the feature space. In other words, the nonlinear methods can be constructed in two successive steps: firstly a nonlinear mapping transforms the data into a feature space F, and then a linear learning algorithm is designed to discover the linear relations of data in that space. In the process, the nonlinear mapping aims to convert nonlinear relations to linear relations and hence reflects our expectations about the relation to be learned. Although the primal form of the solution could be used in the feature space, the computational complexity of Eq. (5) will be very high if T is very large. It requires solving the large T × T system for inverse of φ(X)tφ(X). However, in the dual representation, the prediction function f(x) = yt(G + λIn)−1k of RR involves the Gram matrix G with entries Gij = bφ(xi), φ(xj)N after a nonlinear mapping. That is, all the information RR needs is the inner products bφ(xi), φ(xj)N between data points in the feature space F. To obtain the final model, we only need to compute the Gram matrix G and inner product vector k in the dual representation. Thus the dual solution in the feature space only requires solving the n × n system rather than the T × T system. This is very effective when T is far larger than n — a situation quite common in chemometrics. The use of the dual form can greatly improve the computational efficiency of kernel methods. This is also the other reason why we use the dual form in kernel methods. 3.3. Kernel Function and Kernel Matrix In the above analysis, no matter whether it is the primal form or the dual form, we only require an implicit mapping of the data to the feature space F when dealing with nonlinear relations with the linear
h ih i pffiffiffi pffiffiffi bφðxi Þ;φ xj N = b x2i1 ;x2i2 ; 2xi1 xi2 ; x2j1 ;x2j2 ; 2xj1 xj2 N = x2i1 x2j1 + x2i2 x2j2 + 2xi1 xi2 xj1 xj2 h i = b½xi1 ;xi2 ; xj1 ;xj2 N2 2
= bxi ;xj N : Hence, the function κ(xi, xj) = bxi, xjN2 is a kernel function with its corresponding feature space F. Thus we can directly compute the inner product bφ(xi), φ(xj)N between the projections of two data points into the feature space without explicitly evaluating their coordinates φ(xi) and φ(xj). The inner product in the feature space is directly calculated as the quadratic function bxi, xjN2 of the inner product in the input space. In this process, we even do not need to know the feature h i pffiffiffiunderlying mapping ϕ: x = ½xi1 ; xi2 ∈R2 ↦φðxÞ = x2i1 ; x2i2 ; 2xi1 xi2 ∈R3 . It is worth noting that the kernel function is an inner product between two data points in the new feature space. Moreover, it is also a quite simple function of the data points in the original input space. This means that the inner products of the mapped data points in the feature space can be calculated with some functions of the data points in the input space by kernel functions. The use of the kernel function is an attractive computational short-cut. A curious fact about using a kernel is that we do not need to know the underlying feature mapping. In practice, the approach adopted is to define a kernel function directly, hence implicitly to define the feature space. In this way, we avoid the feature space not only in the computation of inner product, but also in the design of the learning algorithm itself. So kernels allow us to compute inner products in the feature space, where one could otherwise hardly perform any computations. The use of the kernel function discovers a very effective and feasible way to combine the nonlinear mapping with the dual form of linear algorithms in nonlinear chemical systems. To ensure that the kernel function is a valid kernel for a certain feature space, it must satisfy the Mercer's theorem [19]. That is, the kernel matrix must be positive semi-definite (i.e., having nonnegative eigenvalues). Instead of the above linear kernel function, other kernel function can be implemented to cope with the data at hand. Table 1 lists four commonly used kernel functions in chemometrics. If the chemical data is linear or nearly linear, we can first select the linear kernel function because it usually exhibits good prediction performance. For the nonlinear chemical data, the radial basis kernel function should be tried in computation at first. With the aid of these kernel functions, we reconsider the solution of RR (Eq. (9)). When we replace G by kernel matrix K, denoted by Table 1, a kernel RR can be naturally established to tackle the nonlinear chemical problems. In fact, kernel RR is equivalent to the least squares support vector machine [25]. They have the same loss function (Eq. (2)).
110
D.-S. Cao et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
Table 1 Commonly used kernel functions.
Table 2 Comparison of principal component analysis and kernel principal component analysis.
Kernel
K(x1,x2)
Linear kernel Polynomial kernel Radial basis kernel Sigmoidal kernel
xt1x2 [scale·(xt1x2) + offset]degree Exp(− δ||x1 − x2||2) Tanh[scale·(xt1x2) + offset]
Scale, offset, degree and δ are the parameters of used kernels, which should be predefined by user. Among the different types of kernels, radial basis kernel is more common. It should be noted that the sigmoidal kernel does not fulfill Mercer's condition for all parameter settings.
4. The Modularity of Kernel Methods
1 2 3 4 5
PCA
KPCA
X = X-ones(n,1) × mean(X) C = XtX/n [V,Λ] = eig(C)
K = κ(x,z) = φ(X)φ(X)t K = (1/n)jjtK − (1/n)Kjjt + (1/n2)(jtKj)jjt [U,Λ] = eig(K) Vk = XtUkΛ−1/2 (α = UΛ−1/2) k Tnew = XnewVk (tnew = κ(xi, xnew)αk)
Tnew = XnewVk
j represents a n × 1 vector with all elements equal to 1.
matrix arise naturally. Secondly, the original data is mapped into an implicit feature space. Correspondingly, the linear kernel matrix can be transformed to the nonlinear kernel matrix by the specific kernel function. Finally one is able to generate powerful nonlinear algorithms. In the process, the pretreatment of the data in the feature space is very necessary for kernel methods. It is summarized in the separate section.
So far, we have highlighted these key components of kernel methods in detail. The dual form makes linear algorithms solvable only by the information of inner products (i.e., kernel function). In addition, the nonlinear mapping makes linear algorithms available to the nonlinear problems. While the kernel function is also able to effectively bridge the gaps between the dual form and nonlinear mapping. In this way, we have clearly seen that kernel RR can naturally be established. A direct consequence from this finding is: every (linear) algorithm that only uses inner products can implicitly be executed in feature space F by using kernels, i.e., one could very elegantly construct a nonlinear version of a linear algorithm [26]. This is the so-called modularity of kernel methods we will introduce. The modularity of kernel methods indicates that all kernel-based methods have a consistent framework. Clearly, we can use any suitable kernel to represent the data being considered. Similarly, we can use any linear algorithm, which only uses inner products between samples, to establish a model while retaining the chosen kernel. This illustrates that we can consider the modeling methods separately from the choice of kernel functions. Fig. 4 shows the steps involved in the implementation of kernel methods. The training samples are firstly represented using a kernel function to create a kernel matrix, and subsequently the kernel matrix obtained is modeled by a linear algorithm to produce a complex model. The model is finally used to predict the unseen samples. In the process, the linear algorithms are naturally combined with the specific kernel function to produce a more complex algorithm in a high dimensional feature space.
Principal component analysis (PCA) is a linear algorithm [27,28]. It can clearly extract linear structure in the data. For a p-dimensional data, a set of orthogonal directions, usually less than p, capturing most of the variance in the data, can be obtained. In practice, one typically wants to describe the data with reduced dimensionality by extracting a few meaningful components, while at the same time one also wants to retain most existing structure in the data. That is, the first k projections are usually used to reconstruct the data with minimal squared errors. The technique, say eigenvalue problem, is usually used to fulfill this task (see Table 2). Unfortunately, for the data with nonlinear structures, PCA may not be appropriate for the purpose. Thus, kernel PCA (KPCA) is the natural version of PCA in a kernel feature space for dealing with the nonlinearity [26,29–31]. The basic idea of KPCA is that mapping the original data set into some higher dimensional feature space where we use PCA to establish a linear relationship which, however, is nonlinear in the original input space (see Table 2). For PCA, one can first compute the covariance matrix C:
5. Nonlinear Extension to Linear Algorithms Using Kernel
C=
In the following sections we will use this philosophy to show how one could deal with the nonlinear problems by some commonly used linear modeling algorithms. We firstly reformulate linear, inner product-based algorithms in the input space (i.e., the dual form). The dual representation in the input space makes the linear kernel
A principal component v is computed by solving the following eigenvalue problem:
5.1. Kernel Principal Component Analysis
1 n 1 t t ∑ x x = X X: n i=1 i i n
λv = Cv =
1 t X Xv: n
Fig. 4. The general steps used to implement kernel methods.
ð12Þ
ð13Þ
D.-S. Cao et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
111
Here λ N 0 and v ≠ 0. All eigenvectors with nonzero eigenvalues must be in the span of the data X (i.e., dual form). Thus, the eigenvectors can be written as: n
t
v = ∑ αi xi = X α
ð14Þ
i=1
where α = [α1,…, αn]t. by substituting v in Eq. (13) for Eq. (14), that is, t
λX α =
1 t t X XX α n
ð15Þ
Then, the eigenvalue problem can be represented by the following simple form: λα =
1 Kα n
ð16Þ
where Kij = xti xj or K = XXt ∈ Rn × n is a linear kernel matrix. To ensure the normality of the principal component, i.e., ||v||2 = 1, the calculated α must be scaled such they satisfy ||α||2 = 1 / (nλ). To derive KPCA, one firstly needs to map the data X=[x1, x2, …, xn]t ∈ p R in to a feature space F (i.e., M=[φ(x1), φ(x2), …, φ(xn)]t). Hence, we can easily obtain K = MMt ∈Rn×n . Such a nonlinear kernel matrix K can be directly generated by means of specific kernel function (denoted by Table 1). For extracting features of a new sample x with KPCA, one simply projects the mapped sample φ(x) onto the first k projections Vk n
k
n
k
Vk ⋅ϕðxÞ = ∑ αi bϕðxi Þ;ϕðxÞ N = ∑ αi κðxi ;xÞ = ktest ⋅α i=1
k
ð17Þ
i=1
where ktest = Mφ(x) = [κ(x1,x), κ(x2,x),…, κ(xn,x)]. It seems that hereby we need a simple example to show what PCA and KPCA mean and work. Fig. 5 shows a nonlinear example in which a sphere with radius equaling to 1, denoted by black circle, are regarded as one class and randomly generated points ranging from −0.5 to 0.5, denoted by red circle are regarded as the other class. To begin with, PCA is employed to visualize the trends of the data in twodimensional space, as described in the left panel of Fig. 6. Clearly, no
Fig. 6. A score plot of PCA and KPCA. PCA is not capable of distinguishing the two classes (the left plot). However, a big separation between two classes can be obtained using KPCA (the right plot). The use of Gaussian kernel function makes the nonlinear data linearly separable in KPCA.
separation can be observed in the first two scores. It seems that linear PCA is not capable of differentiating the two classes with specific nonlinear structure because it inherently is a linear algorithm. The right panel of Fig. 6 shows the results of applying KPCA to the data. KPCA is to map the original data with nonlinear features in the input space into kernel feature space in which the linear PCA algorithm is then performed. As can be seen from Fig. 6, the first two principal components in KPCA, extracted with Gaussian kernel function, are able to nicely distinguish two classes. From this example, one can conclude that KPCA is more suitable to describe the nonlinear structure than PCA.
5.2. Kernel Partial Least Squares Kernel partial least squares (KPLS) [32] is a nonlinear extension of linear PLS in which training samples are transformed into a feature space via a nonlinear mapping. The PLS algorithm can then be carried out in the feature space. In chemometrics, the nonlinear iterative partial least squares (NIPALS) algorithm is the commonly used PLS version [33,34]. However, the modified NIPALS algorithm [35], proposed by Lewi, normalizes the latent vectors t, u rather than the weight vectors w, c. The modified PLS algorithm is shown in the left column of Table 3. By the connection of step 2 and step 3 and by using the φ(X) matrix of the mapping samples, we can derive the algorithm for the nonlinear KPLS model, which is listed in the right column of Table 3.
Table 3 Comparison of partial least squares and kernel partial least squares.
1 2 3 4 5 Fig. 5. A linearly inseparable example used in PCA and KPCA. The data points denoted by black circle are generated using a sphere with radius equal to 1, and the data points denoted by red circle are generated using random number from − 0.5 to 0.5.
6 7
PLS
KPLS
Randomly initiate u or u = y w = X tu t = Xw t = t/||t|| c = y tt u = yc u = u/||u|| Repeat 2–5 until convergence X = X − tttX y = y − ttty
Randomly initiate u or u = y t = φ(X)φ(X)t u = Ku t = t/||t|| c = y tt u = yc u = u/||u|| Repeat 2–5 until convergence K = (I − ttt)K(I − ttt) y = y − ttty
112
D.-S. Cao et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
Applying the so-called “kernel trick”, i.e., Eq. (11), we can clearly see that MMt represents the kernel matrix of the inner products between all mapped samples. Thus, instead of an explicit nonlinear mapping, the kernel function can be used to modify line 7 in Table 3. Instead of the deflation of X matrix, the deflation of kernel matrix is given by: t t t t t t K← I−tt K I−tt = K−tt K−Ktt + tt Ktt
ð18Þ
where I is an n-dimensional identity matrix. The estimated regression coefficient b in KPLS can be given by: 1 t t t Ty b = M U T KU
ð19Þ
where U and T consist of u and t in the column form, respectively. For a new test data consisted of nt samples, one can use the following equations to predict the training set and test data, respectively: 1 t t Ty yˆ = Mb = KU T KU
ð20Þ
1 t t ytest = Mtest b = Ktest U T KU Ty
ð21Þ
where Mtest is the matrix of the mapped test points and Ktest is the (ntest × n) test kernel matrix whose elements are κ(xi, xj), where xi is the ith test sample and xj is the jth training sample.
discriminant vector w can be expressed as a linear combination of the training samples in the original input space. n
t
w = ∑ αi xi = X α
where X = [x1, x2, …, xn]t and α = [α1,…, αn]t. Substituting Eq. (23) into Eq. (22), we have J ðα Þ =
α t KWKα α t KKα
where K = XXt ∈ Rn × n or Kij = xti xj is also a linear kernel matrix. W = diag(W1, W2) and Wk are an nk × nk matrix with terms all equal to 1/nk. To maximize Eq. (24) one could solve the generalized eigenvalue problem KWKα = λKKα and then select eigenvector α with maximal eigenvalue λ. Now, to derive KFDA, one firstly needs to map the data X= [x1, x2, …, xn]t ∈ Rp into the kernel feature space F, i.e., M = [φ(x1), φ(x2), …, φ(xn)]t. Thus, substituting φ(x) for all x in Eq. (22), we can similarly obtain the optimal solution for KFDA in the feature space. Here, K = MMt. Such nonlinear kernel matrix K can also be directly generated by means of specific kernel function (denoted by Table 1). Correspondingly, given a test sample x, the projection of this sample onto the discriminant vector can be computed by: t
n
t
w SB w : wt Sw w
ð22Þ
Where SB and SW are the between-class and the total scatter matrices defined in the original input space, respectively. They have the following expression:
where ktest = Mφ(x) = [κ(x1,x), κ(x2,x),…, κ(xn,x)]. 5.4. The Pretreatment of the Kernel Matrix To obtain better prediction performance, mean centering of the data should be carried out in the feature space for every algorithm [26]. Given an implicit mapping ϕ: X ∈ Rp → φ(X) ∈ F, mean centering of the data in the feature space can be accomplished conveniently by the following algorithms: 1 t l l n nn φðXÞ←HφðXÞ
ð26Þ
H = I−
1 t φðXtest Þ←φðXtest Þ− lntest ln φðXÞ n with I an n-dimensional identity matrix, and ln and lntest the vectors of one of dimension n and ntest, respectively. Correspondingly, the kernel matrix can be centered in the following way: t
2
t
K←ðHφðXÞÞðHφðXÞÞ = HKH
nk
t
SB = ðm2 −m1 Þðm2 −m1 Þ and SW = ∑ ∑ ðxi −mk Þðxi −mk Þ : k=1 i=1
Here mk and nk denote the sample mean value and the number of samples for class k, respectively. According to the dual form, the optimal
Table 4 Comparison of Fisher discriminant analysis and kernel Fisher discriminant analysis. FDA 1 2 3 4 5
X = X-ones(n,1) × mean(X) Compute SB and SW [V,Λ] = eig(SB, SW) y = XnewV1
ð25Þ
i=1
Fisher discriminant analysis (FDA) is to find a linear projection of the data onto a one-dimensional subspace such that the classes are well separated according to a certain measure of separability (see Table 4) [36–38]. However, FDA seems to be very limited in that it is clearly beyond its capabilities to separate the data with nonlinear structures. Fortunately, it is possible to have both linear models and a very rich set of nonlinear decision functions by kernel trick [39–42]. The idea of kernel FDA (KFDA) is to solve the problem of FDA in a kernel feature space, thereby yielding a nonlinear discriminant function in the input space (see Table 4). For FDA, it can be achieved by maximizing the following Fisher criterion: J ðwÞ =
ð24Þ
w ⋅φðxtest Þ = ∑ αi κðxi ;xÞ = ktest α
5.3. Kernel Fisher Discriminant Analysis
ð23Þ
i=1
1 1 t t Ktest ←ðHφðXÞÞ φðXtest Þ− lntest ln φðXÞ = Ktest − lntest ln K H ð28Þ n n Thus, Eqs. (27) and (28) can be further expressed by: K←K−In K−KIn + In KIn
ð29Þ
Ktest ←Ktest −Itest K−Ktest In + Itest KIn
ð30Þ
where In = (1/n)n × n and Itest = (1/n)ntest × n.
KFDA t
t
K = κ(x,z) = XX or φ(X)φ(X) K = (1/n)jjtK − (1/n)Kjjt + (1/n2)(jtKj)jjt Compute KWK and KK [V,Λ] = eig(KWK, KK) y = ktest α (α = V1, ktest = Mφ(xtest))
j represents a n × 1 vector with all elements equal to 1.
ð27Þ
6. The Relationship between Kernel Functions and SVMs SVMs were originally developed by Vapnik and his coworkers and it has shown promising capability for solving both linear and nonlinear problems. A detailed description of the theory of SVMs can be easily
D.-S. Cao et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
113
found in several excellent books and literature [21,43–45]. Functionally, SVMs can be divided into two classes: support vector classification (SVC) and support vector regression (SVR). Herein, we only focus on SVC. The detailed description of SVR can be found in Ref. [20]. For linearly separable cases, the decision function of SVC can be expressed in the following way: t f ðxi Þ = sign w xi + b ð31Þ where w is a vector of weights, and b is the constant coefficient. In the original input space, the conditions for perfect classification can be described as: t yi w xi + b ≥0 ; i = 1; :::; n
ð32Þ
SVC performs classification tasks by constructing a hyperplane in the input space to differentiate two classes with a maximum margin (Fig. 7a). Therefore, the aim of SVC is to find a vector w and a parameter b which can be estimated by minimizing ||w||2. This can formulated as a quadratic optimization problem: minimize:
1 2 ∥w∥ 2
ð33Þ
subject to yi(wtxi + b) ≥ 1, i = 1, …, n. The SVM such obtained is referred to as the linear SVC (LSVC). Note that the significant feature of LSVC is that it attempts to seek a “safest” hyperplane maximizing the square sum of distance between the hyperplane and all data points (Fig. 8). That is, LSVC can define the “safest” hyperplane to give the best prediction performance. In order to deal with nonlinearity embedded in the data, the kernel function was introduced into SVMs. In most cases, the data are linearly inseparable probably due to two reasons. One is noise contaminations, unknown background or man-made mistakes, the other is intrinsic nonlinearity. To deal with the two situations, slack variable and kernel functions were initially introduced by Cortes, Boser and Vapnik. To allow for specific training error, slack variables are introduced to consider the inevitable measured errors in data to a certain extent. These slack variables are generally associated with the misclassified samples (Fig. 7b). Even though the erroneous classification cannot be avoided, the effect of the misclassified samples can be reduced by these slack variables. Thus, optimal object of SVMs can further be reexpressed in the following form with a slack variable ξ introduced: n 1 2 minimize: ∥w∥ + C ∑ ðξi Þ 2 i=1 subject to: yi wt xi + b ≥1−ξi
ξi ≥0
i = 1; 2; …; n
Fig. 8. Even when the training set is linearly separable, there does not exist unique hyperplane to differentiate the two classes. However, the support vector machine can define the “safest” hyperplane to give the best generalization performance (see the red line).
where C is a penalizing factor which controls the trade-off between the training error and the margin. As indicated in Section 3.1, Eq. (34) can be re-written as the dual form given by: n
maximize : ∑ αi − i=1
0≤αi ≤C;
1 n t ∑ ααyyx x 2 i; j = 1 i j i j i j i = 1; …; n
ð35Þ
n
subject to: ∑ αi yi = 0 i=1
where αi and αj are the optimized Lagrange multipliers. Eq. (35) can be solved by quadratic optimization methods. Thus, the decision function of SVC can be written as n t f ðxÞ = sign w x + b = sign ∑ yi αi bxi ; x N + b
ð36Þ
i=1
where αi (0b αi ≤ C) is the solution of quadratic programming problem in Eq. (35), and b can be computed using the following equation: n o n b = yj − ∑ yi αi bxi ; xj N; j∈ j j0bαj bC :
ð37Þ
i=1
ð34Þ
When the data structure is intrinsically nonlinear, the kernel function must be used to extend the linear problems to nonlinear problems. Thus, when we replace the bxi, xN term in Eq. (36) with K (xi, x), A nonlinear version of SVC can be easily constructed. For SVR,
Fig. 7. Support vector machines in linearly separable (a) and non-separable (b) classification problems. Support vectors and margins are marked by red circles and dot lines, respectively. In the linearly inseparable case, negative margins are also encountered and associated with the misclassified samples.
114
D.-S. Cao et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
the same tools (i.e., slack variables and kernel function) can be used. As mentioned above, SVMs can also be composed according to three key components of kernel methods. They have the similar ways as kernel RR except for the loss function (i.e., Eqs. (35) and (2)). The use of the kernel function is one of the chief factors to make SVM become an effective tool of machine learning. 7. Simple Comparison of KPLS, KFDA and SVC It seems to be quite necessary to show how to solve the nonlinear classification problems using the aforementioned three kernel methods, namely SVM, KPLS and KFDA. A simple nonlinear classification example with 50 samples in each class is shown in Fig. 9. As can be shown from Fig. 9, the data points are linearly inseparable unless a nonlinear decision function, such as a conic or an ellipse, is used. So, SVM, KPLS and KFDA are used to establish the nonlinear models, respectively. The results of three kernel methods with polynomial and Gaussian kernel functions are represented in Fig. 10. It is shown in Fig. 10 that three methods can all classify the data points correctly. However, there seems to be some slight differences among three kernel methods. Furthermore, we can also see that two different decision boundaries are obtained using polynomial and Gaussian kernel functions. The choice of kernel function is very important for generating a correct decision boundary because it closely relates to the underlying structure of the data. Herein, for polynomial kernel function, the degree of polynomial function is chosen as 2 for three kernel methods. For Gaussian kernel function, its widths are set to 1, 0.6 and 0.8 for SVM, KPLS and KFDA, respectively. Besides, KPLS can also projects the data points in the feature space into a reduced latent variable space where we can directly observe the underlying data structure. The score plot of KPLS is shown in Fig. 11. We can clearly see from Fig. 11 that these linearly inseparable data points in the original input space have been linearly separable in the latent variable space, which adequately indicates that the PLS model can indeed model the nonlinear relationship in the data by means of kernel functions.
Fig. 10. The decision boundaries of three kernel methods with polynomial and Gaussian kernel functions. SVM: green, KPLS: purple, KFDA: black.
The use of the kernel function directly makes kernel algorithms being considered only focus on the selection of a few kernel parameters, without considering many parameters in the mapping process. Therefore the choice of kernel parameters has a crucial effect on the performance. If one does not choose them properly, one will not achieve the excellent performance reported in many papers. Model selection techniques can provide principled ways to select proper model
parameters [46–54]. Usually, the candidates of the optimal kernel are prepared using some heuristic rules. In the field of chemometrics, cross validation is usually used for selecting an optimal model with different criteria. For cross validation, the training samples are divided into k subsets, each of which has the same number of samples. Then, the kernel algorithm is trained k times. In the ith iteration, the kernel algorithm is trained on all subsets except the ith subset. Then the prediction error is computed for the ith subset. Thus the average value of the k errors can give a good estimation of the prediction error [46,49]. The extreme case is called the leave-one-out cross validation, where k is equal to the number of the training samples. Moreover, bootstrap is also principled resampling method which is often used for model selection [53]. Currently, double cross validation has increasingly applied to estimation of several chemical models [55,56]. Double cross-validation is a much better choice to simultaneously performing model selection and error estimation, compared to “common” cross-validation. Double cross validation mainly consists of two nested loops. In the outer loop the total samples are split randomly into test sets and calibration sets (one segment as test set, and the others as calibration set). The test set is used to estimate the prediction performance by the application of models made solely from calibration set. After finishing the outer loop once, the inner loop works with a calibration set as defined in the outer loop. The inner loop again consists of a cross-validation process for finding the
Fig. 9. A simulated classification example.
Fig. 11. The score plot of KPLS.
8. Model Selection
D.-S. Cao et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115
optimum complexity of the model (e.g., kernel parameters). Once the model is established, one can predict the samples in the test set, defined in the outer loop. Thus, different random splits into test sets and calibration sets are repeated many times in the outer loop; the prediction performance can be better estimated, as well as its variability. Furthermore, the variability of kernel parameters can be estimated from the results and a final optimum subset of kernel parameters can be derived for a final model from all samples. 9. Conclusions The goal of the present article was to give a detailed introduction into the exciting field of kernel methods. We only briefly touched kernel theory-omitting many theoretical details-and instead focused on introducing the kernel algorithms. Kernel methods provide an effective approach to the quantitative modeling of the complicated nonlinear relationships between X and y. In this paper, three key components of kernel methods, namely dual form, nonlinear feature mapping and kernel function, together with some simple examples are in detail introduced to give an in-depth understanding for kernel methods. The modularity of kernel methods allows linear algorithms to combine with any kernel function. So, several commonly used chemometric methods can be directly derived into their corresponding kernel versions. It can without doubt be estimated that these kernel methods, together with support vector machines will become very promising modeling approaches for nonlinear chemical problems. Acknowledgements We would like to thank the reviewers for their useful discussions, comments and suggestions throughout this entire work. This work is financially supported by the National Nature Foundation Committee of P.R. China (Grants No. 20875104, No. 10771217 and No. 20975115), the international cooperation project on traditional Chinese medicines of ministry of science and technology of China (Grant No. 2007DFA40680), China Hunan Provincial science and technology department (No. 2009GK3095), and the Advanced Research Plan of Central South University (Grants No. 201021200011). The studies meet with the approval of the university's review board. References [1] I.E. Frank, Modern nonlinear regression methods, Chemometrics and Intelligent Laboratory Systems 27 (1995) 1–19. [2] L. Breiman, J.H. Friedman, R.A. Olsen, C.J. Stone, Classification and Regression Trees, Wadsworth International, California, 1984. [3] J.H. Friedman, Multivariate adaptive regression splines, Annals of statistics 19 (1991) 1–67. [4] M. Andersson, A comparison of nine PLS1 algorithms, Journal of Chemometrics 23 (2009) 518–529. [5] I.A.-R. Araby, J.L. Gino, A nonlinear partial least squares algorithm using quadratic fuzzy inference system, Journal of Chemometrics 23 (2009) 530–537. [6] B. Walczak, D.L. Massart, The radial basis functions — partial least squares approach as a flexible non-linear regression technique, Analytica Chimica Acta 331 (1996) 177–185. [7] B. Walczak, D.L. Massart, Application of radial basis functions — partial least squares to non-linear pattern recognition problems: diagnosis of process faults, Analytica Chimica Acta 331 (1996) 187–193. [8] S. Wold, Nonlinear partial least squares modelling II. Spline inner relation, Chemometrics and Intelligent Laboratory Systems 14 (1992) 71–84. [9] R. Lombardo, J.F. Durand, R.D. Veaux, Model building in multivariate additive partial least squares splines via the GCV criterion, Journal of Chemometrics 23 (2009) 605–617. [10] J.H. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer-Verlag, New York, 2008. [11] B.J. Wythoff, Backpropagation neural networks: a tutorial, Chemometrics and Intelligent Laboratory Systems 18 (1993) 115–155. [12] G. Kateman, Neural networks in analytical chemistry? Chemometrics and Intelligent Laboratory Systems 19 (1993) 135–142. [13] D. Svozil, V. Kvasnicka, J. Pospichal, Introduction to multi-layer feed-forward neural networks, Chemometrics and Intelligent Laboratory Systems 39 (1997) 43–62. [14] B. Schölkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, 2002. [15] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge, (2004).
115
[16] K. Mller, S. Mika, G. Rtsch, K. Tsuda, B. Schölkopf, An introduction to kernel-based learning algorithms, IEEE Transactions on Neural Networks 12 (2001) 181–202. [17] B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods: Support Vector Learning, MIT Press, 1999. [18] T. Czekaj, W. Wu, B. Walczak, About kernel latent variable approaches and SVM, Journal of Chemometrics 19 (2005) 341–354. [19] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 2000. [20] A.J. Smola, B. Schölkopf, A tutorial on support vector regression, Statistics and Computing 14 (2004) 199–222. [21] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (1998) 121–167. [22] E. Mayoraz, E. Alpaydin, Engineering Applications of Bio-Inspired Artificial Neural Networks. Springer (1999) p. 833–842. [23] T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and support vector machines, Advances in Computational Mathematics 13 (2000) 1–50. [24] O. Ivanciuc, Applications of support vector machines in chemistry, Rev. Comput. Chem. 23 (2007) 291–400. [25] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002. [26] B. Schölkopf, A.J. Smola, K.-R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput. 10 (1998) 1299–1319. [27] S. Wold, Principal component analysis, Chemometrics and Intelligent Laboratory Systems 2 (1987) 37–52. [28] I.T. Jolliffe, Principal Component Analysis, Springer, Berlin, Berlin, 1986. [29] W. Wu, D.L. Massart, S. de Jong, The kernel PCA algorithms for wide data. Part I: theory and algorithms, Chemometrics and Intelligent Laboratory Systems 36 (1997) 165–172. [30] W. Wu, D.L. Massart, S. de Jong, Kernel-PCA algorithms for wide data part II: fast cross-validation and application in classification of NIR data, Chemometrics and Intelligent Laboratory Systems 37 (1997) 271–280. [31] S. Mika, B. Schölkopf, A.J. Smola, K.-R. Müller, M. Scholz, G. Rätsch, Kernel PCA and de-noising in feature spaces, Proceedings of the 1998 conference on Advances in neural information processing systems II, MIT Press, 1999. [32] R. Rosipal, J.T. Leonard, Kernel partial least squares regression in reproducing kernel hilbert space, J. Mach. Learn. Res. 2 (2002) 97–123. [33] S. de Jong, SIMPLS: an alternative approach to partial least squares regression, Chemometrics and Intelligent Laboratory Systems 18 (1993) 251–263. [34] R. Rosipal, N. Krämer, Overview and recent advances in partial least squares, in: G.M., C. Saunders, S. Gunn, J. Shawe-Taylor (Eds.), Subspace, Latent Structure and Feature Selection, Springer, 2006, pp. 34–51. [35] P.J. Lewi, Pattern recognition, reflections from a chemometric point of view, Chemometrics and Intelligent Laboratory Systems 28 (1995) 23–33. [36] K. Fukunaga, Handbook of Pattern Recognition and Computer Vision, World Scientific Publishing Co., Inc (2010) 33–60. [37] W. Wu, Y. Mallet, B. Walczak, W. Penninckx, D.L. Massart, S. Heuerding, F. Erni, Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to NIR data, Analytica Chimica Acta 329 (1996) 257–265. [38] T. Hastie, A. Buja, R. Tibshirani, Penalized discriminant analysis, The Annals of Statistics 23 (1995) 73–102. [39] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, K.R. Mullers, Fisher Discriminant Analysis with Kernels, 1999, pp. 41–48. [40] J. Yang, Z. Jin, J.-Y. Yang, D. Zhang, A.F. Frangi, Essence of kernel Fisher discriminant: KPCA plus LDA, Pattern Recognition 37 (2004) 2097–2100. [41] Y. Jian, A.F. Frangi, Y. Jing-Yu, Z. David, J. Zhong, KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions on 27 (2005) 230–244. [42] G. Baudat, F. Anouar, Generalized discriminant analysis using a Kernel approach, Neural Comput. 12 (2000) 2385–2404. [43] R. Herbrich, Learning Kernel Classifiers, MIT Press, Cambridge, 2002. [44] V. Vapnik, A. Chervonenkis, Theory of Pattern Recognition, Nauka, Moscow, 1974. [45] V. Kecman, Learning and Soft Computing, MIT Press Cambridge, 2001. [46] M. Stone, Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statistical Society. Series B (Methodological) 36 (1974) 111–147. [47] B. Efron, R. Tibshirani, Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy, Statistical Science 1 (1986) 54–75. [48] B. Efron, R. Tibshirani, Improvements on cross-validation: the .632+ bootstrap method, Journal of the American Statistical Association 92 (1997) 548–560. [49] J. Shao, Linear model selection by cross-validation, Journal of the American Statistical Association 88 (1993) 486–494. [50] J. Shao, Bootstrap model selection, Journal of the American Statistical Association 91 (1996) 655–665. [51] Q.S. Xu, Y.Z. Liang, Monte Carlo cross validation, Chemometrics and Intelligent Laboratory Systems 56 (2001) 1–11. [52] Q.S. Xu, Y.Z. Liang, Y.P. Du, Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration, Journal of Chemometrics 18 (2004) 112–120. [53] B. Efron, R. Tribshirani, An Introduction to the Bootstrap, Chapman & Hall/CRC, New York, 19938 436 pp. [54] D.J.C. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge University Press, 2002. [55] P. Filzmoser, B. Liebmann, K. Varmuza, Repeated double cross validation, Journal of Chemometrics 23 (2009) 160–171. [56] V.E. De Monte, G.M. Geffen, C. May, K.A. McFarland, Double cross-validation and improved sensitivity of the rapid screen of mild traumatic brain injury, Journal of Clinical and Experimental Neuropsychol 24 (2004) 628–644.