Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c h e m o l a b
Group aggregating normalization method for the preprocessing of NMR-based metabolomic data Jiyang Dong a, b,⁎, Kian-Kai Cheng b, c, d, Jingjing Xu a, Zhong Chen a, Julian L. Griffin b a
Department of Electronic Science, Xiamen University, Xiamen 361005, China Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK Department of Bioprocess Engineering, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia d Institute of Bioproduct Development, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia b c
a r t i c l e
i n f o
Article history: Received 14 March 2011 Received in revised form 18 May 2011 Accepted 13 June 2011 Available online 7 July 2011 Keywords: Data normalization NMR-based metabolomics Principal component analysis Dilution effect Group aggregating normalization
a b s t r a c t Data normalization plays a crucial role in metabolomics to take into account the inevitable variation in sample concentration and the efficiency of sample preparation procedure. The conventional methods such as constant sum normalization (CSN) and probabilistic quotient normalization (PQN) are widely used, but both methods have their own shortcomings. In the current study, a new data normalization method called group aggregating normalization (GAN) is proposed, by which the samples were normalized so that they aggregate close to their group centers in a principal component analysis (PCA) subspace. This is in contrast with CSN and PQN which rely on a constant reference for all samples. The evaluation of GAN method using both simulated and experimental metabolomic data demonstrated that GAN produces more robust model in the subsequent multivariate data analysis, more superior than both CSN and PQN methods. The current study also demonstrated that some of the differential metabolites identified using the CSN or PQN method could be false positives due to improper data normalization. © 2011 Elsevier B.V. All rights reserved.
1. Introduction In metabolomics, the nuclear magnetic resonance (NMR) spectral peak areas are assumed to be proportional to the concentrations of metabolites. Nevertheless, the peak intensities in NMR spectra are affected by both biological and technical variations. For example, urine samples collected from animals following food deprivation or drug intervention could have high metabolite concentrations, more than a scale of 10, as compared to normal subjects [1,2]. In addition, variations in the efficiency of sample preparation process could also introduce variations in sample concentrations. Therefore, data preprocessing method such as normalization has been widely used prior to statistical analysis to make the data comparable with each other [3]. The most commonly used normalization methods include constant sum normalization (CSN) [4,5] and probabilistic quotient normalization (PQN) [1]. Other normalizing methods [6] have also been used, including but not limited to histogram matching (HM) normalization [7], normalization to creatinine concentration [8], vector length normalization [9], and a number of methods proposed by Ramano et al.[10–12].
⁎ Corresponding author. E-mail address:
[email protected] (J. Dong). 0169-7439/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2011.06.002
In the CSN method, it is assumed that the total peak area of a spectrum remains constant across the samples. This method normalizes the spectra by dividing each signal intensity or each bin (following binning or bucketing operation) of a spectrum by the total peak areas. However, it is well known that this assumption is not universally valid, and it could lead to inaccurate data interpretation. The limitations of the CSN method are demonstrated in recent publications [1,13]. The PQN method overcomes some limitations of the CSN method. Basically, PQN assumes that a majority of metabolites keep their concentrations unchanged across the samples. Therefore, normalization is carried out based on the maximum probability of dilutions, which is estimated by the calculation of the most probable quotient between a target spectrum and a reference spectrum. The PQN method takes the median of many estimates rather than a single sum as the standard of normalization, and this makes it less susceptible to outliers. However, PQN method is insufficient for datasets in which the concentrations of a large number of metabolites change simultaneously. In the current study, a supervised normalization method is proposed and referred to as group aggregating normalization (GAN), by which the information about the group classification is incorporated into the normalization procedure. GAN normalizes the data so that they aggregate to the centers of groups in a subspace of principal component analysis (PCA) [14]. Both simulated data and experimental data were used to evaluate the performance of GAN
124
J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
method, and the results obtained were compared with the conventional CSN and PQN methods. 2. Methods 2.1. Group aggregating normalization A NMR spectrum xk obtained from a biological sample could be expressed as: xk = αk ⋅x0;k + ek
ð1Þ
where x0,k represents the clean spectrum (i.e., the noise-free and dilution-free spectrum), ek represents technical noise, and αk is a multiplicative constant defined as dilution factor. In the current work, ek is assumed to be Gaussian white noise and relatively small as compared to x0, k. The clean spectrum x0, k can be further expressed as:
projected onto PCA subspace one experimental group at a time, in this case mi remains constant for all samples within the group. Secondly, ek is assumed to be homoscedastic Gaussian white noise [15,16], and the inter-individual difference uk is assumed to be normally distributed. Thus the variance due to ek and uk could be assumed to be near-uniformly distributed across the PCA components [17]. This result suggested that ðmi + uk + ek Þ is approximately equal for any sample xk from the same group in the PCA subspace (using the first few components). The correct dilution factor can be obtained because the transformation of PCA subspace estimation satisfied both Eqs. (5) and (6). The overall GAN method could be summarized as follows:
If the noise ek could be considered small, the spectrum xk can be approximated as,
Step 1 Determination of group centers. The averaged spectrum for each experimental group (G i) is evaluated and used as the group center (mi). In the case where outliers exist, the outliers could be removed from further analysis, or the median spectrum could be used as the group center. Step 2 Determination of dilution factors in PCA subspace. For each experimental group (G i), the subset of data is mean-centered and subjected to PCA. Then a PCA subspace is obtained for each group. The original data (before mean-centering) of G i is projected onto the obtained PCA subspace. The dilution factor αk for each spectrum xk (xk ∈Gi ) is estimated by minimizing the Euclidean distance between the spectrum xk and the group center mi in the PCA subspace. Step 3 Determination of the normalized data. For each spectrum xk in ˆ k = α−1 the dataset, the normalized spectrum x k ⋅xk , where αk is the estimated dilution factor of xk obtained in Step 2. Then ˆ =x ˆ k g, which is ready for the normalized data would be X subsequent multivariate analysis. Step 4 Prediction analysis. The normalization of a new test spectrum with known group classification is straightforward for GAN. However, classification estimation is needed before normalization for a new test spectrum whose group classification is unknown. A simple approach is to determine the classification of the new spectrum (x) by the minimal distance between x ˜ i ) in the PCA subspace of the whole and the group center (m training data.
xk ≈αk ⋅ðm0 + gi + uk + ek Þ
2.2. Evaluation of normalization method
x0;k = m0 + ðmi −m0 Þ + x0;k −mi where m0 =
ð2Þ
1 N ∑ x represents the averaged spectrum of the N j = 1 0; j
overall dataset, and mi =
1 Ni ∑ x ; x ∈Gi represents the averaged Ni j = 1 0; j 0; j
clean spectrum for the ith group (G i), to which sample xk belongs. N represents the number of samples of the overall dataset, and Ni is the number of samples for G i. In the current work, gi is defined as the difference between the group mean and the averaged spectrum for all data (gi = mi −m0 ), while uk is defined as inter-individual difference within a same group (uk = x0;k −mi ). Then, Eq. (1) can be written as: −1 xk = αk ⋅ m0 + gi + uk + αk ek
ð3Þ
ð4Þ
Since gi , uk and ek are unknown and sample-dependent, the determination of dilution factor using Eq. (4) is not straightforward. Therefore, a transformation function could be used instead for estimation of dilution factor, by which the function Tˆ satisfies with the following equations for the kth sample, þ Tˆ ðα⋅xk Þ = α⋅ Tˆ ðxk Þ;∀α∈R
ð5Þ
and, Tˆ ðm0 + gi + uk + ek Þ≈const:
ð6Þ
2.2.1. Recovering accuracy A new term, recovering accuracy is introduced to assess the efficiency of a normalization method in the determination of dilution factor. Let αk be the dilution factor of the kth sample in a dataset, and the dilution matrix of the dataset could be written as A = diagðα1 ; α2 ; ⋯; ˆ = diag α ˆ 1; α ˆ 2 ; ⋯; α ˆ M is defined as the estimation of αM Þ. In addition, A A following a normalization method. The values in A are relative to each other, i.e., the two dilution matrices A and γA (γ∈Rþ , is a positive real value) are equivalent. Therefore an equivalent analysis should be obtained whether the data is normalized with A−1 orðγAÞ−1 . In the ˆ as, current study, we define the recovering accuracy of A
Then the dilution factor αk can be calculated by Tˆ ðxk Þ
Tˆ ðxk Þ ≈ αk ≈ const: ˆ Tðm0 + gi + uk + ek Þ
fA = ð7Þ
For example, the CSN method assumes the total spectral integration of a dilution-free spectrum to be a constant. In this case, the transformation Tˆ involved spectral integration, but it is known that spectral integration is affected by mi, gi , uk , and ek. Therefore, Eq. (6) is not satisfied for the CSN method, it is not suitable to be applied for all cases. In GAN method, PCA subspace estimation is used as the transformation Tˆ for data normalization. Firstly, the samples were
j
j
ˆk 1 M αk −γ⋅α ∑ × 100% M k=1 αk
ð8Þ
ˆ where γ is the optimal matching coefficient subject to minþ D A; γ A ˆ k can be written as, γ∈R and the recovering accuracy of α fαk =
j
j
ˆk αk −γ⋅α × 100% αk
ð9Þ
2.2.2. Dataset I: evaluation of recovering accuracy of GAN A published dataset was used to evaluate the recovery accuracy of the GAN method. The dataset consisted of urinary 1H NMR spectra
J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
125
(1) Generation of base data matrix. A single spectrum was selected from the vegetarian male group. The averaged signal intensity of the spectrum was normalized to 100. Then, a dataset (X0) was generated and it consisted of M replicates (rows) of the selected spectrum. The first half (M/2) of spectra were labeled as group 1 (G1), and the second half were labeled as group 2 (G 2).
obtained from a vegetarian diet study (samples from vegetarian or omnivorous male and female) [18]. The spectra were phased and baseline-corrected and bucketed into 205 bins (N = 205), as described in Ref. [18]. A range of simulated noise intensities, discriminant variables, and dilution factors were then introduced artificially into the data, and the procedure could be summarized as follows:
b
a GAN
PQN
= 500
M = 500
CSN
= 2, H varies u = 0,
GAN
H = 50,
=0
u = 0,
n =0
varies
CSN
=0
n =0
PQN
d
c GAN
GAN PQN
M = 500
PQN H = 50, u = 5 0,
M = 500
=2
H = 50,
varies
= 1.0,
n =0
CSN
e
=2 u varies
CSN
n =0
f PQN GAN
M = 500 H = 50, u = 0,
= 2
PQN
GAN
=0
n varies
CSN
M varies H = 50,
= 2
u = 0,
=0
n= 0
CSN
Fig. 1. Average recovering accuracy obtained using CSN, PQN and GAN methods. (a) Recovering accuracy with different number of discriminant variables, H; (b) recovering accuracy with different amplification factors, λ; (c) recovering accuracy with different densities of inter-individual difference, β; (d) recovering accuracy with different standard deviations of interindividual difference intensity, σu; (e) recovering accuracy with different standard deviations of noise, σn; (f) recovering accuracy with different numbers of samples in the dataset, M.
126
J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
(2) Addition of artificial between-group difference. A number of H variables (columns, where H ≤ N) were randomly selected as the discriminant variables that discriminate G 1 and G 2. For each spectrum in G 2, the intensities of the discriminant variables were multiplied by a positive factor λ. This step produced a dataset (X g) with artificial between-group difference. (3) Addition of artificial inter-individual difference. A random matrix (Eu) is simulated to be normally distributed and subjected to N (0,σu). Then, Eu was modified to simulate different densities of noise and inter-individual difference (β; between 0 and 100%), where the value for (100-β)% of variables (column) in Eu were set to zero. Then a new dataset (Xu) was obtained, i.e., Xu = Xg + Eu. (4) Addition of simulated dilution effect. For each spectrum xm in the simulated dataset, a random positive value αk was generated and multiplied with xm, then the diluted spectrum αmxm was obtained. The random dilution factor αmwas subjected to a normal distribution N(10, 3). Then a dataset (Xd) with dilution effect was generated, Xd = diag ðα1 ; α2 ; ⋯; αM ÞXu . (5) Addition of simulated additive noise. A random noise matrix (En) is simulated to be normally distributed, zero mean, with standard deviation of σn. Finally, the simulated dataset (X = Xd + En) is obtained and ready for further analysis. (6) Analysis of recovering accuracy. The simulated dataset is then normalized using GAN, CSN, or PQN methods to estimate the dilution matrix. Then, the recovering accuracy was calculated based on Eq. (5). (7) Evaluation of GAN method. Steps 2 to 6 were repeated 200 times, and the averaged results for recovering accuracy for GAN, CSN and PQN were compared.
2.2.4. Dataset III: evaluation of GAN using an experimental dataset A published high-resolution magic-angle spinning (HR-MAS) dataset was used to evaluate the efficiency of GAN method in the identification of differential variables. The spectra were acquired from liver tissues obtained from 7 male C57BLKS/J-db/db mice (diabetic group) and 6 male C56BLKS/J-db/m mice (control group). The spectra were then phased and baseline-corrected manually, and the peaks were aligned manually to correct for minor peak shift following automated spectral alignment using Alm's method [22]. After that, the spectra were bucketed into 200 bins with fixed width, as described in Ref. [23]. As the group separation could be achieved using unsupervised method, the data matrix was normalized using CSN, PQN or GAN methods, then subjected to PCA model using SIMCA-P+ software [24]. 3. Results and discussion 3.1. Evaluation of GAN using dataset I Dataset I was designed to examine the efficacy of GAN in the estimation of dilution factor under different conditions, i.e., the range of between-group difference (H), the scale of between-group difference
a GAN
2.2.3. Dataset II: evaluation of GAN on improving the subsequent multivariate analysis Using the same experimental dataset as in Section 2.2.2, 36 urinary 1 H NMR spectra from vegetarian females were selected to build a data matrix X 0, where these samples fall in the 95% confidence ellipse in PCA score plot. This dataset contains actual experimental interindividual difference and noise, and only artificial between-group difference and dilution effect were added to generate a simulated dataset for evaluation of GAN using multivariate analysis. (1) Sample permutation. At first, the order of rows in X0 was permuted randomly. The first 18 samples were labeled as G 1, and the latter 18 samples were labeled as G 2. (2) Addition of between-group difference. A number of H variables (columns) were randomly selected as the discriminant variables between G 1 and G 2. In order to simulate the betweengroup difference, the intensities of the discriminant variables were multiplied by a positive factor λ for every spectrum in G 2, and the resulted dataset is denoted as X g. (3) Addition of dilution effect. The random positive values of dilution factors for all 36 samples were generated as described in Section 2.2.2, to produce a dataset with dilution effect, X d. (4) Data normalization. Xd was normalized using CSN, PQN, or GAN. The normalized dataset was then Pareto-scaled [19] prior to partial least squares discriminant analysis (PLS-DA) [20,21]. (5) Evaluation of GAN method. Steps 1 to 4 were repeated 500 times, and the averaged result for GAN, CSN and PQN methods were compared. A number of parameters were studied including R 2 which shows the fraction of variation explained by a component, and Q2 which shows the predictability of the model through cross-validation. In addition, retrieval rate defined as nðvÞ=H was introduced, where n(v) is the number of simulated discriminant variables appears in the first H loadings (absolute value in descending order) of the first PLS component. Lastly for all normalization methods, p-value comparing scores for both groups in the first PLS-DA component was examined.
M = 5 00 = 2 , H varies u = 0,
CSN
=0
n = 20
PQN
b GAN
PQN
M = 500 H = 50,
= 2
u = 0,
=0
n varies
CSN
Fig. 2. Leave-one-out cross-validation for GAN PQN and CSN. (a) Recovering accuracy with different numbers of group discriminant variables, H; (b) recovering accuracy with different standard deviations of noise density, σn.
J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
(λ), the range of inter-individual difference (β), the scale of interindividual difference (σu) and the scale of noise (σn). The averaged recovering accuracy obtained using CSN, PQN, and GAN are shown in Fig. 1. Fig. 1(a) and (b) showed the effect of between-group difference on the normalization methods CSN, PQN and GAN by increasing the number of group discriminant variables (H) and the amplification factor of group discriminant variables (λ) in the dataset which is noise-free (σn = 0) and no inter-individual difference (β = 0 or σu = 0) in the dataset: PQN performs better than other methods when H is less than half of the number of variables (N); nevertheless, when H is higher than N/2, the most probable quotient obtained using PQN is not adequate to represent the data, which leads to a low accuracy in the determination of dilution factor. The recovering accuracy of CSN decreases greatly with increasing H or λ, as the variation in ‘actual’ total peak area is no longer negligible at high H or λ. GAN method overcomes this problem by including class information in the normalization procedure, and its performance was found relatively consistent, with average recovering accuracy fA N97% for all cases. Notably, GAN is particularly suitable for datasets with a large number of discriminant variables. Fig. 1(c) and (d) illustrated the impact of inter-individual difference on the performance of CSN, PQN and GAN methods by
127
increasing the density (β) and intensity (σu) of the inter-individual difference in the noise-free (σn = 0) dataset with small betweengroup difference (H = 50, λ = 2): PQN performs better than other methods for the dataset with low density of the inter-individual difference, which is consistent with the results shown in Fig. 1(a). However, when the dataset contains high levels of inter-individual difference (β N 0.5), the recovering accuracy of PQN decreases at high σu and β. The recovering accuracy of CSN is lower than the other methods but slightly increases with increasing β. The performance of GAN was found relatively consistent (fA N 98% for all cases), which indicates that GAN is not susceptible to the level of inter-individual difference. Fig. 1(e) demonstrated the influence of Gaussian white noise on the performance of CSN, PQN and GAN methods. This was carried out by varying of the standard deviation of artificial noise (σn) of the dataset with small between-group difference (H = 50, λ = 2). Unlike other methods, the recovering accuracy of GAN method was found consistent at different noise levels (Fig. 1(e)), which implies that GAN works equally well for those datasets. In addition, the average recovering accuracy for GAN increases with the increasing number of samples, when a better estimate of group center, and hence a more accurate reference used in the normalization procedure could be obtained, see Fig. 1(f).
a
b
c
d
Fig. 3. PLS-DA results of the simulated dataset with H = 20. (a) R2 of the first PLS-DA component; (b) Q2 of the first PLS-DA component; (c) retrieval rate of the group discriminators; (d) logarithm of p-value comparing scores from the first PLS component.
128
J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
As both CSN and PQN methods are unsupervised methods, their recovering accuracies are not affected by samples with unknown class membership. Nevertheless, this is not the case for the GAN methods. Therefore, a leave-one-out study was carried out to assess the performance of the GAN method for unknown samples, and the normalization procedure was performed based on Step 4 in Section 2.1. By comparing Fig. 2(a) with Fig. 1(a), the average recovering accuracy of GAN is reduced when the number of group discriminant variables H is high, when a correct classification of unknown samples could not be easily achieved. Nevertheless, the results in Fig. 2(a) shows that GAN gave a better overall performance, as compared with both CSN and PQN methods. Since GAN method is less affected by noise level (Fig. 1(d) and (e)), it works well for the dataset with apparent between-group difference (H = 50, λ = 2), as shown in Fig. 2(b). Results from this study indicated that: (1) CSN is sensitive to the between-group difference (both H and λ) but robust relatively to interindividual difference (both β and σu) and noise (σn). (2) PQN worked well on the dataset with a small range of perturbations (low H and β). When the dataset is affected by a high level of perturbation (high H or β), the performance of PQN is sensitive to the scale of perturbation (λ, σu
and σn). (3) GAN offered advantage in term of recovering accuracy not only for the dataset with between-group difference (both H and λ), but also for the datasets with inter-individual difference (both β and σu) and noise (σn), which indicated that GAN is more suitable for metabolomic study where perturbation in a large number of metabolites is expected. 3.2. Evaluation of GAN using dataset II The effect of normalization methods on the subsequent PLS-DA model was studied using a simulated dataset, covering a range of H and λ. The dataset could be grouped into three cases: Case 1. Large-scale perturbations occur to a small number of variables (high λ, low H); Case 2. Small-scale perturbations occur to a large number of variables (small λ, high H); Case 3. Large-scale perturbations occur to a large number of variables (high λ, high H).
a
b
c
d
Fig. 4. PLS-DA results of the simulated dataset with λ = 2. (a) R2 of the first PLS-DA component; (b) Q2 of the first PLS-DA component; (c) retrieval rate of the discriminant variables; (d) logarithm of p-value of the first PLS-DA component.
J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
From Fig. 3(a) and (b), GAN was found to perform better than CSN and PQN, particularly when λ is low, and it is comparable to CSN at high λ. The R 2 and Q 2 for PQN method are generally lower than both GAN and CSN methods. In addition, GAN also produced a better overall retrieval rate at low λ, and it is comparable to PQN at high λ, as shown in Fig. 3(c). Furthermore, the result in Fig. 3(d) indicated that the groups are more separable (in terms of p-value comparing scores from both groups) in the first PLS component when the data was normalized by GAN. The advantage of GAN method is more evident when the number of discriminant variables (H) is high. Fig. 4(a) and (b) demonstrate that PQN and CSN are not suitable for Case 2 datasets (high H), while GAN is able to handle these datasets well. This could be due to a low recovering accuracy for both PQN and CSN at high H, as shown in Fig. 1 (a). In addition, GAN also performs better for this dataset, as experimental dataset normally has high noise density due to the existence of inter-individual difference for all variables (Fig. 4(c)). Furthermore, Fig. 4(c) also shows GAN to be a more robust method in retrieving the discriminant variables. It also leads to a better group separation in the subsequent PLS-DA model (i.e. lower p-values), notably at high H value (Fig. 4(d)). 3.3. Evaluation of GAN using dataset III The developed GAN method was also studied using an experimental HR-MAS dataset [23]. The dataset consisted of two groups, the diabetic group (db/db mice) and the control group (dbm mice). The
a
129
PCA score plots obtained following CSN, PQN, and GAN methods are shown in Fig. 5(b), (c) and (d) respectively. PCA score plot of original data is also shown in Fig. 5(a). Group-separation could be achieved using all three normalization methods, due to significant metabolic differences between the diabetic and control groups. Nevertheless, GAN produced best group separation, possibly due to its ability to give a better estimation for dilution factor. Since GAN is a supervised method, a validation study was carried out to examine the “GAN + PCA” model. In this study, the classification information of the samples in 50 experiments are permuted randomly, then the new data was normalized using GAN, and −log10 (p-value) of PC1 scores comparing scores from both groups were evaluated. Fig. 6 showed the p-value of PC1 with the correlation coefficient between the assigned classification and the true classification. The p-value is lowest when the correlation coefficient is 1, i.e., the classification of samples are correct. This result suggested that the group separation obtained using the “GAN + PCA” model is robust and does not happen by chance. Fig. 7 showed a section of differential spectra (2.46–4.10 ppm) between the center of diabetic group (i.e., mean spectrum) and the control group, following different normalization procedures. The differential spectrum obtained following GAN method resembles the differential spectrum of original data, as shown in Fig. 7(c) and (d). In fact, the correlation coefficients between the differential spectrum of GAN normalized and original data was found to be 0.998, as compared with CSN (0.875) and PQN (0.947) methods. These results suggested that both CSN and PQN methods might introduce more bias than GAN
b 800 2000
600
6 5
400
1000
t[2] (9.9%)
t[2] (1.3%)
6
200
0 11 91812 13 2
0
1
-200 -400
2
0
3
13 9 12
7
-1000
7
4
5
10 11 8
4
3
1
-600
-2000
-800 -6000
-4000
-2000
0
2000
4000
6000
-8000
-6000
-4000
-2000
t[1] (97.9%)
0
2000
4000
6000
8000
t[1] ( 83.2% ) SIMCA-P+ 12.0.1 - 2011-05-11 13:47:58 (UTC+8)
c 4000
SIMCA-P+ 12.0.1 - 2011-05-11 13:45:25 (UTC+8)
d 800
3000
600 6
2000 5
0
12
t[2] (5.7%)
t[2] (4.8%)
6 11 10 8 13 9
1000
2 3
-1000
7
-2000 1
5
200
2 10 11 8 9 13 12
0
3 7
-200 4
-400
4
1
-600
-3000 -4000
400
-800 -15000
-10000
-5000
0
5000
10000
15000
-3000
-2000
-1000
0
1000
2000
3000
t[1] (92.3%)
t[1] (91.5%) SIMCA-P+ 12.0.1 - 2011-05-11 13:50:37 (UTC+8)
SIMCA-P+ 12.0.1 - 2011-05-11 13:52:10 (UTC+8)
Fig. 5. PCA score plots of the data normalized by PQN, CSN and GAN. (a) Score plot of the original data, without any normalization procedure; (b) score plot of CSN normalized data; (c) score plot of PQN normalized data; (d) score plot of GAN normalized data (symbols: “●” for samples from db/db mice; “▲” for samples from db/m mice. Numbers on score plots represent sample index.).
130
J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
method is less susceptible to variation in group mean, noise and interindividual difference. It produces more compact plots for each of the experimental groups, probably due to a better estimation of dilution factor (Fig. 8(c)).
-log10(p-value)
15
10
4. Conclusions 5
0 0
0.2
0.4
0.6
0.8
1
Correlation Coefficient Fig. 6. PCA model validation for GAN normalized data. Correlation coefficients of the classification vectors between the re-assigned data and the original data is represented in horizontal axis.
method into the dataset. In fact, the spectral region between 3 and 4 ppm was scaled up by both CSN and PQN methods (Fig. 8), which might lead to false positive results. On the other hand, the GAN
Normalization is a crucial data preprocessing step in metabolomics. In conventional CSN and PQN methods, a reference for normalization is assumed to be constant for all spectra. This is not necessarily true for all cases because of the complexity of biological samples. Therefore in the current study, a supervised sample-dependent normalization method called GAN was proposed. GAN operates by first projecting the high dimensional data onto a PCA subspace, and each of the samples within each group were then scaled with a coefficient (α− 1) so that the distance between the PCA scores arising from the samples and its group mean scores is minimized. Both simulated and experimental data were used to examine the efficiency of the developed GAN method. Our study showed that GAN produced a better class discrimination and improved the ability in the identification of differential metabolites in the subsequent multivariate analysis, as compared to conventional CSN and PQN methods.
a
b
c
d
Fig. 7. A section of differential spectrum between the mean spectra of the diabetic group and the control group. (a) CSN normalized data; (b) PQN normalized data; (c) GAN normalized data; (d) original data.
J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
131
a
b
c
d
Fig. 8. Stacked plots of the binned data obtained from NMR spectra from both db/m (red solid line) and db/db mice (black dot line) following (a) CSN normalized data; (b) PQN normalized data; (c) GAN normalized data; (d) original data.
Acknowledgements This work was supported by the Science Research Foundation of the Ministry of Healthand the United Fujian Provincial Health and Education Project for Tackling the Key Research (WKJ2008-2-36) and NSF of Fujian Province of China (2009J05087 and 2009J01299). References [1] F. Dieterle, A. Ross, G. Schlotterbeck, H. Senn, Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures: application in 1H NMR metabonomics, Anal. Chem. 78 (2006) 4281–4290. [2] A.G. Ciba-Geigy, Wissenschaftliche Tabellen Geigy 1, Ko¨rperflu¨ssigkeiten, 8th Ed. Basel, Switzerland, 1977.
[3] A. Craig, O. Cloarec, E. Holmes, J.K. Nicholson, J.C. Lindon, Scaling and normalization effects in NMR spectroscopic metabonomic data sets, Anal. Chem. 78 (2006) 2262–2267. [4] M.E. Bollard, E.G. Stanley, J.C. Lindon, J.K. Nicholson, E. Holmes, NMR-based metabonomic approaches for evaluating physiological influences on biofluid composition, NMR Biomed. 18 (2005) 143–162. [5] J.T. Brindle, H. Antti, E. Holmes, G. Tranter, J.K. Nicholson, H.W. Bethell, S. Clarke, P.M. Schofield, E. McKilligin, D.E. Mosedale, D.J. Grainger, Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using 1H-NMRbased metabonomics, Nat. Med. 8 (2002) 1439–1444. [6] M. Sysi-Aho, M. Katajamaa, L. Yetukuri, M. Orešič, Normalization method for metabolomics data using optimal selection of multiple internal standards, BMC Bioinforma. 8 (2007) 93. [7] R.O. Torgrip, K.M. Aberg, E.I. Schuppe-Koistinen, J. Lindberg, A note on normalization of biofluid 1D 1H-NMR data, Metabolomics 4 (2008) 114–121. [8] P. Jatlow, S. McKee, S.S. O'Malley, Correction of urine creatinine concentrations for creatinine excretion: is it useful? Clin. Chem. 49 (2003) 1932–1934.
132
J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132
[9] M. Scholz, S. Gatzek, A. Sterling, O. Fiehn, J. Selbig, Metabolite fingerprinting: detecting biological features by independent component analysis, Bioinformatics 20 (2004) 2447–2454. [10] P. Lemmerling, L. Vanhamme, R. Romano, S. van Huffel, A subspace time-domain algorithm for automated NMR spectral normalization, J. Magn. Reson. 157 (2002) 190–199. [11] R. Romano, R. Lamanna, M.T. Santini, P.L. Indovina, A new algorithm for NMR spectral normalization, J. Magn. Reson. 138 (1999) 115–122. [12] R. Romano, M.T. Santini, P.L. Indovina, A time-domain algorithm for NMR spectral normalization, J. Magn. Reson. 146 (2000) 89–99. [13] S. Wold, E. Johansson, M. Cocchi, PLS—partial least-squares projections to latent structures, in: H. Kubinnyi, G. Folkers, Y.C. Martin (Eds.), 3D QSAR in drug design— theory, methods and applications, ESCOM Science, Leiden, 1993. [14] M. Ringnér, What is principal component analysis? Nat. Biotechnol. 26 (2008) 303–304. [15] O.M. Kvalheim, F. Brakstad, Y.Z. Liang, Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise, Anal. Chem. 66 (1994) 43–51. [16] T.K. Karakach, P.D. Wentzell, J.A. Walter, Characterization of the measurement error structure in 1D 1H NMR data for metabolomics studies, Anal. Chim. Acta 636 (2009) 163–174. [17] M. Statheropoulos, A. Pappa, P. Karamertzanis, H.L.C. Meuzelaar, Noise reduction of fast, repetitive GC/MS measurements using principal component analysis (PCA), Anal. Chim. Acta 401 (1999) 35–43.
[18] J.J. Xu, S.Y. Yang, S.H. Cai, J.Y. Dong, X.J. Li, Z. Chen, Identification of a metabolic signature in urine for lacto-vegetarians using 1H NMR spectroscopy and pattern recognition, Anal. Bioanal. Chem. 396 (2010) 1451–1463. [19] L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Scaling: in introduction to multi- and megavariate data analysis using projection methods (PCA & PLS), Umetrics (1999) 213–225. [20] H. Wold, Soft modeling: the basic design and some extensions, in: K.G. Joreskog, H. Wold (Eds.), Systems Under Indirect Observation, vols. I and II, North-Holland, Amsterdam, 1982. [21] S. Wold, M. SjEostrEom, L. Eriksson, PLS-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst. 58 (2001) 109–130. [22] E. Alm, R.J.O. Torgrip, K.M. Åberg, I. Schuppe-Koistinen, J. Lindberg, A solution to the 1D NMR alignment problem using an extended generalized fuzzy Hough transform and mode support, Anal. Bioanal. Chem. 395 (2009) 213–223. [23] J.J. Xu, J. Zhang, J.Y. Dong, S.H. Cai, J.Y. Yang, Z. Chen, Metabonomics studies of intact hepatic and renal cortical tissues from diabetic db/db mice using high-resolution magic-angle spinning 1H NMR spectroscopy, Anal. Bioanal. Chem. 393 (2009) 1657–1668. [24] L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi- and Megavariate Data Analysis: Principles and Applications, Umetrics Academy, Umea, 2001.