A novel kernel Fisher discriminant analysis: Constructing informative kernel by decision tree ensemble for metabolomics data analysis

A novel kernel Fisher discriminant analysis: Constructing informative kernel by decision tree ensemble for metabolomics data analysis

Analytica Chimica Acta 706 (2011) 97–104 Contents lists available at SciVerse ScienceDirect Analytica Chimica Acta journal homepage: www.elsevier.co...

656KB Sizes 0 Downloads 97 Views

Analytica Chimica Acta 706 (2011) 97–104

Contents lists available at SciVerse ScienceDirect

Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca

A novel kernel Fisher discriminant analysis: Constructing informative kernel by decision tree ensemble for metabolomics data analysis Dong-Sheng Cao a , Mao-Mao Zeng b , Lun-Zhao Yi a , Bing Wang a , Qing-Song Xu c , Qian-Nan Hu d , Liang-Xiao Zhang a , Hong-Mei Lu a , Yi-Zeng Liang a,∗ a

Research center of modernization of traditional Chinese medicines, Central South University, Changsha 410083, PR China State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan Univesity, Wuxi, 214122, PR China c School of Mathematical Sciences and Computing Technology, Central South University, Changsha 410083, PR China d Systems Drug Design Laboratory, College of Pharmacy, Wuhan University, Wuhan 430071, PR China b

a r t i c l e

i n f o

Article history: Received 25 March 2011 Received in revised form 12 June 2011 Accepted 16 August 2011 Available online 12 September 2011 Keywords: Metabolomics Kernel methods Fisher discriminant analysis (FDA) Biomarker discovery Decision tree Classification and regression tree (CART)

a b s t r a c t Large amounts of data from high-throughput metabolomics experiments become commonly more and more complex, which brings an enormous amount of challenges to existing statistical modeling. Thus there is a need to develop statistically efficient approach for mining the underlying metabolite information contained by metabolomics data under investigation. In the work, we developed a novel kernel Fisher discriminant analysis (KFDA) algorithm by constructing an informative kernel based on decision tree ensemble. The constructed kernel can effectively encode the similarities of metabolomics samples between informative metabolites/biomarkers in specific parts of the measurement space. Simultaneously, informative metabolites or potential biomarkers can be successfully discovered by variable importance ranking in the process of building kernel. Moreover, KFDA can also deal with nonlinear relationship in the metabolomics data by such a kernel to some extent. Finally, two real metabolomics datasets together with a simulated data were used to demonstrate the performance of the proposed approach through the comparison of different approaches. © 2011 Published by Elsevier B.V.

1. Introduction The development of robust, high-throughput analytical techniques, such as 1 H nuclear magnetic resonance spectroscopy (NMR) and mass spectrometry (MS), allows the simultaneous measurement of hundreds to thousands of metabolites on a single biological sample [1]. Thus, the data generated in metabolomics studies have been of multivariate character, that is, the included samples are described by a large number of highly correlated variables. This will bring about tremendous difficulty for analysis of metabolomics data. In fact, the predictive structures or patterns of metabolomics data usually hold between informative metabolites/biomarkers in specific parts of the measurement space. How to effectively extract the patterns hidden in such complex multivariate metabolomics data appears very necessary for understanding and interpreting dynamic metabolite changes in biological system [2]. Up to date, several chemometrics approaches, including principal component analysis (PCA) [3,4], partial least squares discriminant analysis (PLSDA) [5], and recently the orthogonal PLS methodology (OPLS)

∗ Corresponding author. E-mail address: yizeng [email protected] (Y.-Z. Liang). 0003-2670/$ – see front matter © 2011 Published by Elsevier B.V. doi:10.1016/j.aca.2011.08.025

[6], have been widely used for coping with these metabolomics data with high dimensionality to certain extent. In recent years, kernel methods [7–11] have emerged as an important class of machine learning methods suitable for certain types of problems under investigation. A famous example is support vector machines (SVMs) [12–14], which require kernels to implicitly perform a potential nonlinear mapping. If we have two objects, such as two human plasma samples measured by NMR, the basic idea behind kernel methods is to construct a kernel to represent the relationship between samples. Such a kernel inherently measure the similarity between these samples. Some tasks in metabolomics study, such as classification and cluster analysis, can then be effectively tackled using linear methods based solely on inner products computed via the kernel in the embedding space, rather than the original input space. The kernel mainly acts as an interface between the input data module and the learning algorithms. In fact, the kernel function used in kernel methods might be regarded as a general protocol to deal with certain types of tasks in metabolomics study. The selection of the kernel can be shown to correspond in a very tight sense to the encoding of our prior knowledge about the data and the types of patterns we can expect to identify [15]. Selecting the best kernel among this extensive of possibilities becomes thereby the most critical stage in applying kernel-based algorithms in practice. To ensure that kernel function

98

D.-S. Cao et al. / Analytica Chimica Acta 706 (2011) 97–104

is a valid kernel for a certain embedding space, it must mathematically be positive semi-definite [16], i.e., the kernel matrix needs to be symmetric and must not have negative eigenvalues. Up to date, a large amount of kernels [17,18], including graph kernel, tree kernel, string kernel, spectrum kernel etc., have been skillfully devised for meeting different scientific tasks at hand. In metabolomics studies, researchers may be interested in the proximity measure determined by the subset of metabolites which are the most informative for classification. In the present study, we attempt to obtain such a proximity measure hidden in complex multivariate metabolomics data by constructing an informative kernel based on decision tree ensemble. The constructed kernel can effectively encode the similarities of metabolomics samples between informative metabolites/biomarkers in specific parts of the measurement space. Simultaneously, informative metabolites or potential biomarkers can be successfully discovered by means of variable importance ranking in the process of building kernel. With the help of the kernel, a novel kernel Fisher discriminant analysis (KFDA) algorithm is then developed strictly under the theoretical framework of kernel methods. Finally, two real metabolomics datasets together with a simulated data are used to test the performance of the proposed KFDA approach. The results obtained show that the KFDA approach is really an attractive alternative technique for complex multivariate metabolomics data analysis.

and C+ and C− are given by



C+ ij

=

and



= C− ij

2n− /nn+ if yi = +1 = yj 0 otherwise

(6)

2n+ /nn− if yi = −1 = yj 0 otherwise

(7)

where n+ and n− are the number of positive samples and negative samples, respectively. Taking derivatives with respect to w, we obtain: Xt y − Xt BXw − w = 0

According to the dual form [21], the optimal discriminant vector w can be expressed as a linear combination of the training samples in the original input space. w=

n 

˛i xi = Xt ␣

where ␣ is called the dual variable. Substituting for w in Eq. (9) and rescale the dual variable ␣ by v, we obtain the solution of dual variable as:

f (x) = sgn(kt ((BK + I)

(1)

where w is called a weight vector and b an appropriate offset. sgn(·) is the function returning the sign of its argument. The aim of Fisher discriminant analysis (FDA) [19,20] is to find a linear projection of the data onto a one-dimensional subspace such that the classes are well separated according to certain measure of separability. It can be achieved by maximizing the following regularized Fisher criterion: − (+ w − w ) 2

2

2

+ − (w ) + (w ) + ||w||2

(2)

where + w is the mean of the projection of the positive samples onto the direction w, − w the mean for the negative samples, and + − , w the corresponding standard deviations.  is the regularized w parameter. First observe that Eq. (2) is invariant under rescalings of the weight vector w so that we can constrain the denominator to have a fixed value r. Thus, using a Lagrange multiplier v, the regularized Fisher criterion can be re-written as: J(w) = yt Xw −

1 t t 1 w X BXw + r − wt w 2 2

(3)

Here, B can be computed by the following formula: B = D − C+ − C−

(4)

where D is a diagonal matrix with entries:

 Dii =

2n− /n if yi = +1 2n+ /n if yi = −1

y

(10)

where is a linear kernel matrix. The corresponding classification function can be expressed as:

Suppose that we are given a sample matrix X of size n × p. Each row denotes a sample and each column a variable. The corresponding class label is recorded in vector y of size n × 1 with element equal to −1 and +1 for binary classification cases. For binary classification, the Fisher discriminant function can be expressed as:

J(w) =

−1

K = XXt

2.1. Kernel Fisher discriminant analysis

y = sgn(wt x + b)

(9)

i=1

␣ = (BK + I) 2. Theory and calculations

(8)

(5)

−1

y) + b)

The value of b is chosen so that wt + w

(11) b − w t − w , that is, so

−b= that the decision boundary bisects the line joining the two centers of mass. Taking the weight vector w = Xt ␣, we can obtain: b=

1 t ␣ Kc 2

(12)

where c is the vector with entries:



ci =

1/n+ 1/n−

if yi = +1 if yi = −1

(13)

So far, we have obtained the general solution form of KFDA based on linear kernel. In fact, the modularity of kernel methods [7,22] indicates that we can consider the modeling methods separately from the choice of kernel functions. Clearly, we can use any suitable kernel to represent the data being considered. Thus, Different KFDA could be easily established by constructing different kernel matrix instead of linear kernel matrix K = XXt . 2.2. Constructing informative kernel by decision tree ensemble Decision tree, or classification and regression tree (CART), proposed by Breiman et al. [23], is a nonparametric statistical technique, which has been widely used in data mining. For classification tasks, the goal of a decision tree is to divide the total data space into several high class-purity segments by selecting some useful variables from a large pool of variables. That is, on the one hand, it can easily select informative metabolites/biomarkers from hundreds to thousands of measured metabolites. On the other hand, it can also recursively divide the given total sample space into several rectangular areas with specific similarity (e.g., the samples under the same terminal node). In fact, a decision tree carries out a thorough but not optimal search in two spaces (i.e., sample and variable) to seek for variable subsets most relevant to classification and sample subsets with specific similarity under different variable subspace [24,25]. Inspired by these features of decision tree, we constructed an informative kernel matrix, which is based on the random selection of

D.-S. Cao et al. / Analytica Chimica Acta 706 (2011) 97–104

small sample populations via Monte-Carlo selection followed by the application of the decision tree algorithm. A kernel matrix K of size n × n with all elements equal to 0 is initially generated. Thus, the kernel matrix is constructed as follows: To begin with, with the help of the Monte-Carlo procedure, for the training set, the samples for each class are further randomly divided into two parts which are training samples and validation samples, respectively. The training samples for each class are combined to obtain the final training set, e.g., (Xtrain , ytrain ). Accordingly, the validation samples for each class are combined to obtain the final validation set, e.g., (Xval , yval ). Note that the total training set is further divided into a new training set and a validation set in the above step. Such a step effectively guarantees that the samples for each class can be evenly divided into two parts. Generally speaking, the size of the training set varies from 40% to 80% of all the samples. Then, the obtained training samples are firstly used to grow a classification tree and the validation samples are then used to prune the overgrown classification tree to obtain the optimal pruning level (e.g., Lbest ). Instead of obtaining an optimal tree, a suboptimal tree, which is between optimal tree and overgrown tree, is constructed using a fuzzy pruning strategy: randomly generate a pruning level L between 1 and Lbest , and then prune the overgrown tree by L. Fuzzy pruning strategy helps in effectively exploiting the information of internal nodes, but does not totally destroy the structure of tree. After that, the entire samples, e.g., X, are predicted by the suboptimal tree, and thereby each sample falls into one of the terminal nodes. It may be worth noting that the samples under the same terminal node may possess some specific similarity to some extent, rather than only being limited to class similarity. If two samples i and j turn up in the same terminal node, the sample proximity measure K(i, j) is increased by one. Herein, K can be considered as a similarity measure based on classification tree-based similarity and reflects the size of similarity among these training samples. We can repeat the above process many times (e.g., ntree) so that a large number of tree models are established. Accordingly, the kernel matrix is changed by the results of the tree. At the end of KFDA construction, the kernel matrix is normalized by dividing by the number of trees. Note that the proximity between a sample and itself is always set to one (i.e., K(i, i) = 1). Likewise, when s unknown samples are arriving, we can obtain the proximity measure between training samples and new samples via constructing predictive kernel matrix as follows: A predictive kernel matrix Kt of size n × s with all elements equal to 0 is firstly generated. Then these new samples are predicted by ntree established trees, and each new sample falls into one of the terminal nodes in each tree. Likewise, if the new sample i and some training sample j turn up in the same terminal node, the predictive kernel matrix Kt(i, j) is increased by one. Kt(i, j) reflects the size of similarity between new sample i and training sample j. The bigger Kt(i, j) is, the more similar new sample i and training sample j is. 2.3. Variable importance ranking Based on the kernel constructed above, KFDA also allows the measurement of variable importance. KFDA mainly incorporates the idea of ensemble variable selection. Since the kernels based on decision tree ensembles are constructed from the combination of a set of decision trees, the variable importance in KFDA can be obtained by averaging variable importance of all decision trees, which is given as follows: 1  JMb (i) ntree ntree

J(i) =

(14)

b=1

where ntree is the number of trees in the process of building kernel, Mb denotes the bth decision tree, and JMb (i) is the importance of

99

the ith variable in the bth decision tree. As each tree model is established, the importance of a variable in this tree is determined by the decrease of impurity across the tree for all non-terminal nodes that use this variable as a splitter. Thus, all importance regarding the variable in KFDA is averaged to get the final variable importance measure. Ensemble variable selection can effectively guarantee that the variables by KFDA indeed reflect the real contribution to classification, not deriving from the chanciness (e.g., chanciness correlation). The variable importance by KFDA is thereby more stable and reliable compared to one by decision tree. Finally, the KFDA procedure is simply summarized in the following steps: 1. 2. 3. 4.

Compute the matrices K and Kt, Compute the matrix B using Eqs. (4)–(7), Compute b using Eqs. (12) and (13), Given an appropriate regularized parameter , predict the new test sample using Eq. (11) and obtain the variable important ranking using eq. (14).

In KFDA, two parameters related to the Monte-Carlo procedure need to be set. Generally, the best choice of these parameters greatly depends on the data at hand. However, a rough guideline for them is also very necessary for the practical application to KFDA. (1) the number (ntree) of Monte Carlo experiments (i.e., the number of trees grown). Theoretically, the fewer samples are selected randomly from the entire samples, the more repeats are needed. Whereas, it has been proven that ntree = n2 is generally enough to make Monte Carlo strategy better performance according to Zhang [26]. According to our real experience, the similarity of samples can be dominated by the inherent feature of the data when ntree is about (5–10) × n. (2) The ratio R of the training samples to the total samples. The value of R regulates the prediction reliability of each tree model and diversity of tree structures in KFDA. A small value of R may decrease the reliability of tree models in KFDA, and thereby cannot reflect true rule of data. Generally speaking, a moderate value of R (i.e., 0.4–0.8) can simultaneously ensure the diversity and the prediction reliability of the KFDA approach. For the ratio R, it can be determined together with regularized parameter  by cross validation and grid-based search. The KFDA algorithm used in the present study, together with other programs, were written in Matlab 7.0 and run on a personal computer (Intel Pentium processor4/2.66GHz 2024 MB RAM). 3. Datasets To well illustrate the performance of the proposed approach, a simulated data, which derived from two classes (i.e., iris virginica and iris versicolor) in the iris data, is firstly designed to test this KFDA method. In the original iris data, 150 iris samples are divided into three classes, each of which includes 50 samples. 4 informative variables are used for characterizing these samples. In our study, we only selected two classes in the iris data as our research object (i.e., Xo). Additionally, another 200 random noise variables (i.e., Xn) as uninformative variables are added to this data. Each uninformative variable is generated by rand function in Matlab. Thus, the new iris data is X = [Xo, Xn]. Additionally, two real metabolomics datasets are used for demonstrating our proposed approach. The first metabolomics data is the impaired fasting glucose data. Blood samples were all collected in the Diabetes Center of the Second Xiangya Hospital of Central South University in Changsha. 34 male patients with the fasting blood sugar concentrations in the range of 5.6–7.0 mmol L−1 were enrolled in the impaired fasting glucose group, while 24 male people whose fasting blood sugar concentrations ranged from

100

D.-S. Cao et al. / Analytica Chimica Acta 706 (2011) 97–104

Fig. 1. The score plots of simulated data. (A) the original iris data; (B) the new iris data.

3.9 to 5.6 mmol L−1 were considered as healthy controls. Subsequently, all the plasma samples are analyzed by a Shimadzu GC2010A (Kyoto, Japan) gas chromatography instrument coupled to a GC/MS-QP2010 mass spectrometer (Compaq-Pro Linear data system, class 5K software). Finally, 26 compounds are considered as endogenous metabolites to represent each plasma sample. Detailed description of data pretreatment can be found in our previous work [27]. The second metabolomics data is human diabetes data. Human plasma samples were collected from 45 healthy adults (age range was 24–78), 78 patients with type 2 diabetes (age range was 26–67). All type 2 diabetic patients were from the Xiangya Hospital of Hunan of China with a fasting plasma glucose concentration above 7.0 mmol L−1 . All clinical experiments were approved by Xiangya Institutional Human Subjects Committee. GC–MS analyses of the plasma samples are performed on a Shimadzu GC2010A (Kyoto, Japan) gas chromatography instrument coupled to a GCMS-QP2010 mass spectrometer (Compaq-Pro Linear data system, class5K software). Finally, 21 fatty acids are identified and quantified by chemometrics methods, including heuristic evolving latent projection (HELP) [28], selective ion analysis (SIA), etc. The dataset hence contains 123 samples measured on 21 variables (20 fatty acids and the total fatty acid computed as the sum of all 20 fatty acids). For more details on the data, please refer to the previous work [29].

4. Results and discussion 4.1. Simulated data For simulated datasets, PCA is firstly used for observing clustering trends in the two iris datasets. Fig. 1 shows the score plots of PCA for the original data and new data, respectively. From Fig. 1A, one can see that the two classes have a good separation trend for the original iris data without uninformative variables. However, there is certain overlap between the two classes. From Fig. 1B one can find that when 200 random noise variables are added to the iris data, the results of PCA become very bad. We cannot see a very significant

separation trend. In the new iris data, the two classes have relatively severe overlap. These seem to indicate that inclusion of uninformative variables severely destroy the patterns hidden in the data. For the new iris data, the patterns of the data are hidden in the subspace of 4 informative variables. Now, we attempt to compare the performance of different algorithms with our proposed KFDA approach. To check the validity of newly proposed KFDA approach in metabolomics studies, as a comparison, Linear Fisher discriminant analysis (LFDA) and partial least squares discriminant analysis (PLSDA), and their kernel versions induced by radial basis kernel function (KFDA-RBF and PLSDA-RBF) were also used to classify the two datasets. All parameters in each algorithm are determined by grid-based search strategy under specific range (see Supporting Information). Five-fold cross validation is used to evaluate the prediction performance of each algorithm. Additionally, different pretreatment methods are applied to these algorithms. For KFDA, the pretreatment step is not needed mainly due to the application to the decision tree algorithm. For LFDA and PLSDA, autoscaling is used for the pretreatment method. For KFDA-RBF and PLSDA-RBF, centering is used. The predictive results for the five different classification approaches applied to simulated data are listed in Table 1, from which one can see that five methods all obtain satisfactory prediction results for the original iris data. When the data is not contaminated by noisy variables, five methods all have ability to well find the patterns hidden in the data. For the new iris data including uninformative variables, one can obtain from Table 1 that LFDA gives very bad prediction results, which seem to be random guess. A main reason is that LFDA is an averaging technique and the informative variables have been averaged with those irrelevant variables when noisy variables are not removed. Thus, inclusion of many irrelevant or uninformative variables makes LFDA unavailable to find the patterns of the data. Compared to LFDA, PLSDA gives better predictions because PLSDA can extract useful information for classification by only selecting the initial few latent variables. In this process, PLSDA has in fact performed a denoising step analogous to PCA by selecting those initial few latent variables. However, those new latent variables are still the linear combination of all

D.-S. Cao et al. / Analytica Chimica Acta 706 (2011) 97–104

101

Table 1 The predictive results of simulated data sets based on five-fold cross validation. Methods

LFDA PLSDA KFDA-RBF PLSDA-RBF KFDA

Acc

Se

Sp

MCC

Original

New

Original

New

Original

New

Original

New

97.00(3/100) 97.00(3/100) 97.00(3/100) 97.00(3/100) 97.00(3/100)

54.00(46/100) 77.00(23/100) – – 91.00(9/100)

98.00(2/50) 98.00(2/50) 98.00(2/50) 98.00(2/50) 98.00(2/50)

52.00(24/50) 74.00(13/50) – – 92.00(4/50)

96.00(3/50) 96.00(3/50) 96.00(3/50) 96.00(3/50) 96.00(3/50)

56.00(22/50) 80.00(10/50) – – 90.00(5/50)

94.02 94.02 94.02 94.02 94.02

0.080 54.10 – – 82.02

Acc = accuracy, Se = sensitivity, Sp = specificity, MCC = Matthews correlation coefficient. All results are in percents.

variables including uninformative ones. Thus, PLSDA is affected inevitably by those uninformative variables to a certain degree. For KFDA-RBF and PLSDA-RBF, the kernel matrix based on RBF has been totally destroyed due to inclusion of those uninformative variables. Thereby we failed to obtain the KFDA-RBF and PLSDA-RBF models with predictive ability. Variable selection approaches are thereby very necessary for KFDA-RBF. However, when our proposed KFDA method was used for classifying the new iris data, we can obtain very satisfactory prediction results. It gives prediction accuracy of 91.00%, sensitivity of 92.00%, specificity of 90.00% and Matthews correlation coefficient of 82.02%, respectively. On the whole, these results are accordance with those from the original iris data. This sufficiently indicates that our proposed KFDA method is indeed not affected by a large number of uninformative variables to a large extent. A main reason is that we only use those variables useful for classification and neglect those useless variables by means of establishment of a large number of decision tree models in the process of building kernel. As was mentioned in Section 2.2, a decision tree carries out a thorough but not optimal search in two spaces (i.e., sample and variable) to seek for variable subsets most relevant to classification and sample subsets with specific similarity under different variable subspace. Thus, a large number of tree models can effectively summarize this type of similarity between samples, and simultaneously obtain variable importance of each variable. The kernel matrix such obtained can thereby well find the patterns hidden in subspace of the data. That is to say, our kernel matrix based decision tree ensemble is constructed in subspace of informative variables, rather than all variables. Finally, variable importance for this new iris data is shown in Fig. 2. From this plot, one can clearly see that most of uninformative variables have very small variable importance values (even is equal to zero), and only the second and third variables give very high variable importance values. This sufficiently demonstrates that the prediction results by our proposed

KFDA further make use of information of variables 2 and 3, and neglect (even totally discard) information of those noisy variables. 4.2. Impaired fasting glucose data Two different patterns of samples, say impaired fasting glucose patients and healthy adults, exist in the impaired fasting glucose data. The newly developed KFDA algorithm is employed for classifying the impaired fasting glucose data. For this data, the number of decision trees is manually set to ntree = 400, and the ratio R and the regularized parameter  are optimized by grid search (R = 0.5,  = 11). Any pretreatment steps (e.g., normalization) for variables are not needed before KFDA are run. With these settings, 400 decision trees are independently grown and randomly pruned by the Monte-Carlo sampling strategy. For each tree, fifty percentages of the training data are used for growing this tree and the remaining are used for pruning based on fuzzy pruning strategy. Fuzzy pruning strategy can not only ensure the diversity of classification trees constructed, but also effectively explore the information of inner nodes in trees. Thus, 400 classification tree models with different tree structures, which sufficiently explore the inner relationship of the impaired fasting glucose data, are finally constructed. Based on these trees, Kernel matrix K can be generated to collect the training sample similarity information, and Predictive matrix Kt collects the similarities between training samples and test samples. Based on the kernel matrix constructed, KFDA gives prediction accuracy of 94.83%, sensitivity of 87.50%, specificity of 100%, and Matthews correlation coefficient of 89.67%, respectively. Likewise, as a comparison, the prediction results for five different approaches applied to the impaired fasting glucose data are listed in Table 2. One can obtain from Table 2 that five classification approaches all obtain the same prediction accuracy, especially for LFDA, PLSDA, KFDA-RBF and PLSDA-RBF. However, the sensitivity and specificity for KFDA is somewhat different from ones from other three approaches, which may indicate that a different classification mechanism from other three approaches is designed by our proposed KFDA approach. Additionally, it is worth noting that our predictions are based on information of only two or three variables. KFDA mainly used a built-in variable selection program to lower the influence of irrelevant variables, and thereby obtained satisfactory predictions analogous to ones from other classification approaches. In the comparison process, all parameters in each algorithm are Table 2 The predictive results of impaired fasting glucose data based on five-fold cross validation.

Fig. 2. Variable importance obtained by KFDA for the new iris data.

Methods

Acc

Se

Sp

MCC

LFDA PLSDA KFDA-RBF PLSDA-RBF KFDA

94.83(3/58) 94.83(3/58) 94.83(3/58) 94.83(3/58) 94.83(3/58)

95.83(1/24) 95.83 (1/24) 95.83(1/24) 95.83(1/24) 87.5(3/24)

94.12(2/34) 94.12(2/34) 94.12(2/34) 94.12(2/34) 100(0/34)

89.70 89.70 89.70 89.70 89.67

Acc = accuracy, Se = sensitivity, Sp = specificity, MCC = Matthews correlation coefficient. All results are in percents.

102

D.-S. Cao et al. / Analytica Chimica Acta 706 (2011) 97–104

Fig. 4. Scatter plot of the impaired fasting glucose data with two selected variables. Fig. 3. Variable importance obtained by KFDA for the impaired fasting glucose data.

determined by grid-based search strategy under specific range, which are all listed in Supporting Information material. Fig. 3 shows the variable importance calculated by KFDA. From the plot, we can clearly see that only two metabolites (1-monostearin and phosphoric acid) are very predictive of the impaired fasting glucose data. That is to say, we used these metabolites to obtain the prediction accuracy of 94.83%. 1-monostearin could be formed biochemically via release of a fatty acid from diacylglycerol. Univariate t-test can give p value of 0 (<1 × 10−6 ) for this metabolite. It has been reported in the work that 1-monostearin was discovered to be negatively correlated to blood glucose with large Pearson’s correlation [27] and thereby could be considered as a possible biomarker. Phosphoric acid is an essential component of life especially for ATP and thereby plays important roles in some biochemical processes [30]. Univariate t-test can give p value of 2.5 × 10−3 for phosphoric acid. This indicates that the concentrations of phosphoric acid are statistically significantly different between healthy controls and impaired fasting glucose patients. Several papers [31–33] have reported the important role of phosphoric acid in the saccharine disease process. When only using the two metabolites selected to visualize the data, we obtain the scatter plot of Fig. 4. One can clearly see from Fig. 4 that a significant separation trend can be observed only by the two metabolites. This sufficiently demonstrates that the variables selected by KFDA are indeed informative. 4.3. Human diabetes data For human diabetes data, the number of decision trees is manually set to ntree = 500, and the ratio R and the regularized parameter  are optimized as R = 0.8,  = 100. With these settings, 500 classification trees with different structures are finally constructed to obtain kernel matrix and variable importance of metabolites. The predictive results of the five different classification approaches are listed in Table 3. The optimization of parameters for five classification approaches can be found in Supporting Information material. From Table 3, one can see that LFDA gives the worst prediction results. This is mainly caused by the fact that the LFDA model is built on the full dataset including all metabolites. Some irrelevant variables very possibly introduce so much noise that a good classification of the data cannot be obtained. PLSDA gives better prediction results compared to LFDA. KFDA-RBF and PLSDA-RBF, as nonlinear modeling approaches, can capture the nonlinear structure of

the data and thereby improve the prediction accuracy compared to two linear modeling approaches. Good results from them seem to indicate the existence of nonlinear structure in the human diabetes data. However, PLSDA-RBF give better results than KFDA-RBF. A probable reason is that PLSDA-RBF performs a denoising step by selecting the above few latent variables in the kernel mapping space. Among all these modeling approaches, one can clearly see that KFDA outperforms other three classification approaches, and gives prediction accuracy of 85.37%, sensitivity of 82.22%, specificity of 87.18%, and Matthews correlation coefficient of 68.80%, respectively. Such results are not surprising because KFDA can not only deal with nonlinear structure of the metabolomics data by kernel matrix constructed to a large extent, but also select more informative metabolites from a pool of variables by variable importance ranking. In fact, kernel matrix in KFDA provides us a very effective proximity measure, which is determined by the subset of metabolites which are most informative for classification. Intuitively, this kernel matrix has two advantages: It is “supervised” because the class information dictates the structure of the trees in the process of building trees. Also, because irrelevant metabolites contribute little to the tree ensemble, they have little influence on the kernel matrix. The kernel matrix such constructed in KFDA can thereby give a more intrinsic measure of similarities between samples. Moreover, this kernel can also deal with nonlinear problems in the metabolomics data by virtue of nonlinear ability of decision tree to a large extent. The variable importance calculated by KFDA is shown in Fig. 5, from which one can see that nine fatty acids (C12:0, C14:0, C16:0, C16:1n-7, C18:0, C18:1n-9, C20:4, C20:5, C22:4) are very predictive of the human diabetes data. The potential biomarkers, especially for C16:0 (palmitic acid), C18:0 (stearic acid) and C18:1n-9 (oleic acid), are all very important bioactive molecules. They are not only the main energy source as nutrients, but also signaling molecules in various cellular processes [29]. The long-term high level of these Table 3 The predictive results of human diabetes data based on five-fold cross validation. Methods

Acc

Se

Sp

MCC

LFDA PLSDA KFDA-RBF PLSDA-RBF KFDA

76.42(29/123) 79.67(25/123) 82.93(21/123) 84.55(19/123) 85.37(18/123)

91.11(4/45) 95.56(2/45) 75.56(11/45) 73.33(12/45) 82.22(8/45)

67.95(25/78) 70.51(29/78) 87.18(10/78) 91.03(7/78) 87.18(10/78)

55.99 63.82 63.04 66.17 68.80

Acc = accuracy, Se = sensitivity, Sp = specificity, MCC = Matthews correlation coefficient. All results are in percents.

D.-S. Cao et al. / Analytica Chimica Acta 706 (2011) 97–104

103

Table 4 The predictive results of two real metabolomics data sets based on five-fold cross validation and individual test set. Methods

LFDA PLSDA KFDA-RBF PLSDA-RBF KFDA

CV

Test

Impaired fasting glucose data

Human diabetes

Impaired fasting glucose data

Human diabetes

97.37(1/38) 97.37(1/38) 97.37(1/38) 97.37(1/38) 97.37(1/38)

75.00(22/88) 78.41(19/88) 82.95(15/88) 84.09(14/88) 85.23(13/88)

95.00(1/20) 95.00(1/20) 95.00(1/20) 95.00(1/20) 95.00(1/20)

74.29(9/35) 77.14(8/35) 80.00(7/35) 82.86(6/35) 85.70(5/35)

All results are in percents.

fatty acids in plasma may have more contribution to lipotoxicity than other nonesterified fatty acids for their great influence on discrimination between diabetes and controls. Some literature [34–36] have reported their relationship between insulin sensitivity, substrate oxidation, and so on. After nine potential biomarkers are selected, PCA is then employed for visualizing the separation trend of samples. Before performing PCA, all data are firstly standardized with zero mean and unit variance. The score plots for this data with all variables and selected variables are shown in Fig. 6, from which one can see that none of separation can be observed for the two groups when all variables are used (see Fig. 6A). The first two scores can only explain up to 27.95% of the total variance of the data. From the score plot, we cannot obtain useful information to explain the metabolite change in two patterns. When nine selected metabolites are used for representing the data, significant separation can be easily observed from Fig. 6B. This seems to indicate these selected variables are more informative than other ones. 4.4. Predictivity of external validation set To further demonstrate the prediction capability of our proposed approach for two metabolomics datasets, our models should also be validated by predicting the labels of other samples not used in the training set, but whose labels are known (i.e., independent validation set). The major difference between cross-validation and independent validation set is that the samples selected in the latter case are in a sense random. This provides a more reliable evaluation of the model’s prediction capability for untested samples than cross-validation. Thus, these two real metabolomics datasets are further split into the training set of about 2/3 and the independent validation set of about 1/3 according to their distribution in the metabolite space. Table 4 lists the prediction accuracies of differ-

Fig. 6. Score plots of the human diabetes data with (A) all variables and (B) selected variables.

ent modeling approaches by using this independent validation set and five-fold cross validation. It can be seen from Table 4 that five different modeling methods give the same prediction results for impaired fasting glucose data. This indicates that these five modeling methods all have ability to obtain satisfactory predictions for simple metobolomics data. However, this is not the case for the human diabetes data. KFDA based decision tree ensemble obtained the best prediction results among all five modeling methods. This again sufficiently indicates that our proposed KFDA not only are able to cope with nonlinear structure hidden in the human diabetes data by the nonlinear ability of decision tree, but also extract those metabolites most useful for classification by variable importance ranking. Together with the results of 5-fold cross-validation, these results by independent validation set strongly indicate that our proposed approach is very predictive and robust. 5. Concluding remarks

Fig. 5. Variable importance obtained by KFDA for the human diabetes data.

The present paper aims to skillfully design an informative kernel that can measure the similarities of metabolomics samples under the subspace built by some potential biomarkers, rather than all variables. By means of the kernel matrix, a novel KFDA algorithm is developed strictly under the theoretical framework of kernel methods. The kernel such obtained can effectively decrease the influence of uninformative variables and thereby focus on the intrinsic measure of sample similarity. Uninformative features can usually introduce so much noise that a good classification of the data cannot be obtained. When these variables are removed, a clear and well-separated class structure can be found. Our examples have sufficiently demonstrated this situation. Moreover, for averaging techniques such as Fisher linear discriminant analysis, variable selection is vital since signal is averaged

104

D.-S. Cao et al. / Analytica Chimica Acta 706 (2011) 97–104

with noise over a large number of variables with a loss of discernible signal amplitude when noisy variables are not removed from the data. Kernel matrix based on decision tree ensemble just improves the reliability of our developed KFDA classifier by variable ranking in the process of building kernel. A series of problems raised by irrelevant variables can thereby be well addressed by this kernel. More importantly, removal of uninformative variables can also lead to an understanding of the essential variables that play an important role in governing the behavior of the system under investigation. It can identify those measurements that are informative and those measurements that are not informative. The constructed kernel was applied to FDA, but it can be considered to be equally useful for other approaches such as PLS and SVM. Acknowledgements We would like to thank the reviewers for their useful discussions, comments and suggestions throughout this entire work. This work is financially supported by the National Nature Foundation Committee of P.R. China (Grants Nos. 20875104, 20975115 and 10771217), the international cooperation project on traditional Chinese medicines of ministry of science and technology of China (Grant No. 2007DFA40680), and Central South University for special support of the basic scientific research project (No. 2010QZZD007). The studies meet with the approval of the University’s review board. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.aca.2011.08.025.

References [1] J. Trygg, J. Gullberg, A. Johansson, P. Jonsson, T. Moritz, Plant Metabolomics (2006) 117–128. [2] J.C. Lindon, J.K. Nicholson, E. Holmes, The Handbook of Metabonomics and Metabolomics, Elsevier Science & Technology, 2008. [3] S. Wold, Chemometrics and Intelligent Laboratory Systems 2 (1987) 37–52. [4] P. Geladi, H. Isaksson, L. Lindqvist, S. Wold, K. Esbensen, Chemometrics and Intelligent Laboratory Systems 5 (1989) 209–220. [5] B. Matthew, R. William, Journal of Chemometrics 17 (2003) 166–173. [6] T. Johan, W. Svante, Journal of Chemometrics 16 (2002) 119–128. [7] K. Mller, S. Mika, G. Rtsch, K. Tsuda, B. Schlkopf, IEEE Transactions on Neural Networks 12 (2001) 181–202. [8] H. Noslen, T. Isneri, D. Angel, J.B. Rolando, M.C.F. Marcia, P. Diana, Journal of Chemometrics 22 (2008) 686–694.

[9] T. Czekaj, W. Wu, B. Walczak, Journal of Chemometrics 19 (2005) 341–354. [10] W. Wu, D.L. Massart, S. de Jong, Chemometrics and Intelligent Laboratory Systems 36 (1997) 165–172. [11] J. Shawe Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. [12] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 2000. [13] C.J.C. Burges, Data Mining and Knowledge Discovery 2 (1998) 121–167. [14] H. Li, Y. Liang, Q. Xu, Chemometrics and Intelligent Laboratory Systems 95 (2009) 188–198. [15] R. Herbrich, Learning Kernel Classifiers, MIT Press, Cambridge, 2002. [16] B. Schölkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, 2002. [17] S.J. Swamidass, J. Chen, P. Phung, L. Ralaivola, P. Baldi, Bioinformatics 21 (2005) I359–I368. [18] P. Anderson, N. Reo, N. DelRaso, T. Doom, M. Raymer, Metabolomics 4 (2008) 261–272. [19] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, K.R. Mullers, Fisher discriminant analysis with kernels Neural Networks for Signal Processing IX, 1999, in: Proceedings of the 1999 IEEE Signal Processing Society Workshop, 1999, pp. 41–48. [20] G. Baudat, F. Anouar, Neural Computation 12 (2000) 2385–2404. [21] B. Schölkopf, Herbrich Ralf, J. Alex Smola, A Generalized Representer Theorem, Springer, Berlin, ALLEMAGNE, 2001. [22] D.S. Cao, Y.Z. Liang, Q.S. Xu, Q.N. Hu, L.X. Zhang, G.H. Fu, Chemometrics and Intelligent Laboratory Systems 107 (2011) 106–115. [23] L. Breiman, J.H. Friedman, R.A. Olsen, C.J. Stone, Classification and Regression Trees, Wadsworth International, California, 1984. [24] J.H. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer-Verlag, New York, 2008. [25] D.S. Cao, B. Wang, M.M. zeng, Y.Z. Liang, Q.S. Xu, L.X. Zhang, H.D. Li, Q.N. Hu, Analyst 136 (2011) 947–954. [26] P. Zhang, Annals of Statistics 21 (1993) 299–313. [27] M.M. Zeng, Y. Xiao, Y.Z. Liang, B. Wang, X. Chen, D.S. Cao, H.D. Li, M. Wang, Z.G. Zhou, Metabolomics 6 (2010) 303–311. [28] O.M. Kvalheim, Y.Z. Liang, Analytical Chemistry 64 (1992) 936–946. [29] L.Z. Yi, J. He, Y.Z. Liang, D.L. Yuan, F.T. Chau, FEBS Letters 580 (2006) 6837–6845. [30] M.M. Zeng, Y. Liang, H.D. Li, M. Wang, B. Wang, X. Chen, N. Zhou, D.S. Cao, J. Wu, Journal of Pharmaceutical and Biomedical Analysis 52 (2010) 265–272. [31] A.B. Borle, Annual Review of Physiology 36 (1974) 361–390. [32] M. Igarashi, H. Wakasaki, N. Takahara, H. Ishii, Y.Z. Jiang, T. Yamauchi, K. Kuboki, M. Meier, J.C. Rhodes, L.G. King, Glucose or Diabetes Activates p38 MitogenActivated Protein Kinase via Different Pathways, American Society for Clinical Investigation, Ann Arbor, MI, ETATS-UNIS, 1999. [33] H.J. Adrogue, H. Wilson, A.E. Boyd, W.N. Suki, G. Eknoyan, New England Journal of Medicine 307 (1982) 1603–1610. [34] W. Xing Li, Z. Lin, Keith Youker, X. Zhang, Jian W. Ming, Scott A. Lemaire, Joseph S. Coselli, Ying H. Shen, Free Fatty Acids Inhibit Insulin Signaling-Stimulated Endothelial Nitric Oxide Synthase Activation through Upregulating PTEN or Inhibiting AKT kinase, American Diabetes Association, Alexandria, VA, ETATSUNIS, 2006. [35] Jennifer C. Lovejoy, Steven R. Smith, Catherine M. Champagne, Marlene M. Most, Michael Lefevre, James P. Delany, Yvonne M. Denkins, Jennifer C. Rood, Veldhuis Johannes, George A. Bray, Effects of Diets Enriched in Saturated (palmitic), Monounsaturated (Oleic), or Frans (Elaidic) Fatty Acids on Insulin Sensitivity and Substrate Oxidation in Healthy Adults, American Diabetes Association, Alexandria, VA, 2002. [36] S. Lilian, G.S. Karin, K. Preben, S. Peter, S. Mats, H.N. Fredrik, Nutrition (Burbank, Los Angeles County Calif.) 22 (2006) 60–68.