Group aggregating normalization method for the preprocessing of NMR-based metabolomic data

Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory...

Download PDF

2MB Sizes 0 Downloads 39 Views

Report

PDF Reader
Full Text

Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c h e m o l a b

Group aggregating normalization method for the preprocessing of NMR-based metabolomic data Jiyang Dong a, b,⁎, Kian-Kai Cheng b, c, d, Jingjing Xu a, Zhong Chen a, Julian L. Grifﬁn b a

Department of Electronic Science, Xiamen University, Xiamen 361005, China Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK Department of Bioprocess Engineering, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia d Institute of Bioproduct Development, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia b c

a r t i c l e

i n f o

Article history: Received 14 March 2011 Received in revised form 18 May 2011 Accepted 13 June 2011 Available online 7 July 2011 Keywords: Data normalization NMR-based metabolomics Principal component analysis Dilution effect Group aggregating normalization

a b s t r a c t Data normalization plays a crucial role in metabolomics to take into account the inevitable variation in sample concentration and the efﬁciency of sample preparation procedure. The conventional methods such as constant sum normalization (CSN) and probabilistic quotient normalization (PQN) are widely used, but both methods have their own shortcomings. In the current study, a new data normalization method called group aggregating normalization (GAN) is proposed, by which the samples were normalized so that they aggregate close to their group centers in a principal component analysis (PCA) subspace. This is in contrast with CSN and PQN which rely on a constant reference for all samples. The evaluation of GAN method using both simulated and experimental metabolomic data demonstrated that GAN produces more robust model in the subsequent multivariate data analysis, more superior than both CSN and PQN methods. The current study also demonstrated that some of the differential metabolites identiﬁed using the CSN or PQN method could be false positives due to improper data normalization. © 2011 Elsevier B.V. All rights reserved.

1. Introduction In metabolomics, the nuclear magnetic resonance (NMR) spectral peak areas are assumed to be proportional to the concentrations of metabolites. Nevertheless, the peak intensities in NMR spectra are affected by both biological and technical variations. For example, urine samples collected from animals following food deprivation or drug intervention could have high metabolite concentrations, more than a scale of 10, as compared to normal subjects [1,2]. In addition, variations in the efﬁciency of sample preparation process could also introduce variations in sample concentrations. Therefore, data preprocessing method such as normalization has been widely used prior to statistical analysis to make the data comparable with each other [3]. The most commonly used normalization methods include constant sum normalization (CSN) [4,5] and probabilistic quotient normalization (PQN) [1]. Other normalizing methods [6] have also been used, including but not limited to histogram matching (HM) normalization [7], normalization to creatinine concentration [8], vector length normalization [9], and a number of methods proposed by Ramano et al.[10–12].

⁎ Corresponding author. E-mail address: [email protected] (J. Dong). 0169-7439/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2011.06.002

In the CSN method, it is assumed that the total peak area of a spectrum remains constant across the samples. This method normalizes the spectra by dividing each signal intensity or each bin (following binning or bucketing operation) of a spectrum by the total peak areas. However, it is well known that this assumption is not universally valid, and it could lead to inaccurate data interpretation. The limitations of the CSN method are demonstrated in recent publications [1,13]. The PQN method overcomes some limitations of the CSN method. Basically, PQN assumes that a majority of metabolites keep their concentrations unchanged across the samples. Therefore, normalization is carried out based on the maximum probability of dilutions, which is estimated by the calculation of the most probable quotient between a target spectrum and a reference spectrum. The PQN method takes the median of many estimates rather than a single sum as the standard of normalization, and this makes it less susceptible to outliers. However, PQN method is insufﬁcient for datasets in which the concentrations of a large number of metabolites change simultaneously. In the current study, a supervised normalization method is proposed and referred to as group aggregating normalization (GAN), by which the information about the group classiﬁcation is incorporated into the normalization procedure. GAN normalizes the data so that they aggregate to the centers of groups in a subspace of principal component analysis (PCA) [14]. Both simulated data and experimental data were used to evaluate the performance of GAN

124

J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

method, and the results obtained were compared with the conventional CSN and PQN methods. 2. Methods 2.1. Group aggregating normalization A NMR spectrum xk obtained from a biological sample could be expressed as: xk = αk ⋅x0;k + ek

ð1Þ

where x0,k represents the clean spectrum (i.e., the noise-free and dilution-free spectrum), ek represents technical noise, and αk is a multiplicative constant deﬁned as dilution factor. In the current work, ek is assumed to be Gaussian white noise and relatively small as compared to x0, k. The clean spectrum x0, k can be further expressed as:

projected onto PCA subspace one experimental group at a time, in this case mi remains constant for all samples within the group. Secondly, ek is assumed to be homoscedastic Gaussian white noise [15,16], and the inter-individual difference uk is assumed to be normally distributed. Thus the variance due to ek and uk could be assumed to be near-uniformly distributed across the PCA components [17]. This result suggested that ðmi + uk + ek Þ is approximately equal for any sample xk from the same group in the PCA subspace (using the ﬁrst few components). The correct dilution factor can be obtained because the transformation of PCA subspace estimation satisﬁed both Eqs. (5) and (6). The overall GAN method could be summarized as follows:

If the noise ek could be considered small, the spectrum xk can be approximated as,

Step 1 Determination of group centers. The averaged spectrum for each experimental group (G i) is evaluated and used as the group center (mi). In the case where outliers exist, the outliers could be removed from further analysis, or the median spectrum could be used as the group center. Step 2 Determination of dilution factors in PCA subspace. For each experimental group (G i), the subset of data is mean-centered and subjected to PCA. Then a PCA subspace is obtained for each group. The original data (before mean-centering) of G i is projected onto the obtained PCA subspace. The dilution factor αk for each spectrum xk (xk ∈Gi ) is estimated by minimizing the Euclidean distance between the spectrum xk and the group center mi in the PCA subspace. Step 3 Determination of the normalized data. For each spectrum xk in ˆ k = α−1 the dataset, the normalized spectrum x k ⋅xk , where αk is the estimated dilution factor of xk obtained in Step 2. Then ˆ =x ˆ k g, which is ready for the normalized data would be X subsequent multivariate analysis. Step 4 Prediction analysis. The normalization of a new test spectrum with known group classiﬁcation is straightforward for GAN. However, classiﬁcation estimation is needed before normalization for a new test spectrum whose group classiﬁcation is unknown. A simple approach is to determine the classiﬁcation of the new spectrum (x) by the minimal distance between x ˜ i ) in the PCA subspace of the whole and the group center (m training data.

xk ≈αk ⋅ðm0 + gi + uk + ek Þ

2.2. Evaluation of normalization method

x0;k = m0 + ðmi −m0 Þ + x0;k −mi where m0 =

ð2Þ

1 N ∑ x represents the averaged spectrum of the N j = 1 0; j

overall dataset, and mi =

1 Ni ∑ x ; x ∈Gi represents the averaged Ni j = 1 0; j 0; j

clean spectrum for the ith group (G i), to which sample xk belongs. N represents the number of samples of the overall dataset, and Ni is the number of samples for G i. In the current work, gi is deﬁned as the difference between the group mean and the averaged spectrum for all data (gi = mi −m0 ), while uk is deﬁned as inter-individual difference within a same group (uk = x0;k −mi ). Then, Eq. (1) can be written as: −1 xk = αk ⋅ m0 + gi + uk + αk ek

ð3Þ

ð4Þ

Since gi , uk and ek are unknown and sample-dependent, the determination of dilution factor using Eq. (4) is not straightforward. Therefore, a transformation function could be used instead for estimation of dilution factor, by which the function Tˆ satisﬁes with the following equations for the kth sample, þ Tˆ ðα⋅xk Þ = α⋅ Tˆ ðxk Þ;∀α∈R

ð5Þ

and, Tˆ ðm0 + gi + uk + ek Þ≈const:

ð6Þ

2.2.1. Recovering accuracy A new term, recovering accuracy is introduced to assess the efﬁciency of a normalization method in the determination of dilution factor. Let αk be the dilution factor of the kth sample in a dataset, and the dilution matrix of the dataset could be written as A = diagðα1 ; α2 ; ⋯; ˆ = diag α ˆ 1; α ˆ 2 ; ⋯; α ˆ M is deﬁned as the estimation of αM Þ. In addition, A A following a normalization method. The values in A are relative to each other, i.e., the two dilution matrices A and γA (γ∈Rþ , is a positive real value) are equivalent. Therefore an equivalent analysis should be obtained whether the data is normalized with A−1 orðγAÞ−1 . In the ˆ as, current study, we deﬁne the recovering accuracy of A

Then the dilution factor αk can be calculated by Tˆ ðxk Þ

Tˆ ðxk Þ ≈ αk ≈ const: ˆ Tðm0 + gi + uk + ek Þ

fA = ð7Þ

For example, the CSN method assumes the total spectral integration of a dilution-free spectrum to be a constant. In this case, the transformation Tˆ involved spectral integration, but it is known that spectral integration is affected by mi, gi , uk , and ek. Therefore, Eq. (6) is not satisﬁed for the CSN method, it is not suitable to be applied for all cases. In GAN method, PCA subspace estimation is used as the transformation Tˆ for data normalization. Firstly, the samples were

j

j

ˆk 1 M αk −γ⋅α ∑ × 100% M k=1 αk

ð8Þ

ˆ where γ is the optimal matching coefﬁcient subject to minþ D A; γ A ˆ k can be written as, γ∈R and the recovering accuracy of α fαk =

j

j

ˆk αk −γ⋅α × 100% αk

ð9Þ

2.2.2. Dataset I: evaluation of recovering accuracy of GAN A published dataset was used to evaluate the recovery accuracy of the GAN method. The dataset consisted of urinary 1H NMR spectra

J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

125

(1) Generation of base data matrix. A single spectrum was selected from the vegetarian male group. The averaged signal intensity of the spectrum was normalized to 100. Then, a dataset (X0) was generated and it consisted of M replicates (rows) of the selected spectrum. The ﬁrst half (M/2) of spectra were labeled as group 1 (G1), and the second half were labeled as group 2 (G 2).

obtained from a vegetarian diet study (samples from vegetarian or omnivorous male and female) [18]. The spectra were phased and baseline-corrected and bucketed into 205 bins (N = 205), as described in Ref. [18]. A range of simulated noise intensities, discriminant variables, and dilution factors were then introduced artiﬁcially into the data, and the procedure could be summarized as follows:

b

a GAN

PQN

= 500

M = 500

CSN

= 2, H varies u = 0,

GAN

H = 50,

=0

u = 0,

n =0

varies

CSN

=0

n =0

PQN

d

c GAN

GAN PQN

M = 500

PQN H = 50, u = 5 0,

M = 500

=2

H = 50,

varies

= 1.0,

n =0

CSN

e

=2 u varies

CSN

n =0

f PQN GAN

M = 500 H = 50, u = 0,

= 2

PQN

GAN

=0

n varies

CSN

M varies H = 50,

= 2

u = 0,

=0

n= 0

CSN

Fig. 1. Average recovering accuracy obtained using CSN, PQN and GAN methods. (a) Recovering accuracy with different number of discriminant variables, H; (b) recovering accuracy with different ampliﬁcation factors, λ; (c) recovering accuracy with different densities of inter-individual difference, β; (d) recovering accuracy with different standard deviations of interindividual difference intensity, σu; (e) recovering accuracy with different standard deviations of noise, σn; (f) recovering accuracy with different numbers of samples in the dataset, M.

126

J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

(2) Addition of artiﬁcial between-group difference. A number of H variables (columns, where H ≤ N) were randomly selected as the discriminant variables that discriminate G 1 and G 2. For each spectrum in G 2, the intensities of the discriminant variables were multiplied by a positive factor λ. This step produced a dataset (X g) with artiﬁcial between-group difference. (3) Addition of artiﬁcial inter-individual difference. A random matrix (Eu) is simulated to be normally distributed and subjected to N (0,σu). Then, Eu was modiﬁed to simulate different densities of noise and inter-individual difference (β; between 0 and 100%), where the value for (100-β)% of variables (column) in Eu were set to zero. Then a new dataset (Xu) was obtained, i.e., Xu = Xg + Eu. (4) Addition of simulated dilution effect. For each spectrum xm in the simulated dataset, a random positive value αk was generated and multiplied with xm, then the diluted spectrum αmxm was obtained. The random dilution factor αmwas subjected to a normal distribution N(10, 3). Then a dataset (Xd) with dilution effect was generated, Xd = diag ðα1 ; α2 ; ⋯; αM ÞXu . (5) Addition of simulated additive noise. A random noise matrix (En) is simulated to be normally distributed, zero mean, with standard deviation of σn. Finally, the simulated dataset (X = Xd + En) is obtained and ready for further analysis. (6) Analysis of recovering accuracy. The simulated dataset is then normalized using GAN, CSN, or PQN methods to estimate the dilution matrix. Then, the recovering accuracy was calculated based on Eq. (5). (7) Evaluation of GAN method. Steps 2 to 6 were repeated 200 times, and the averaged results for recovering accuracy for GAN, CSN and PQN were compared.

2.2.4. Dataset III: evaluation of GAN using an experimental dataset A published high-resolution magic-angle spinning (HR-MAS) dataset was used to evaluate the efﬁciency of GAN method in the identiﬁcation of differential variables. The spectra were acquired from liver tissues obtained from 7 male C57BLKS/J-db/db mice (diabetic group) and 6 male C56BLKS/J-db/m mice (control group). The spectra were then phased and baseline-corrected manually, and the peaks were aligned manually to correct for minor peak shift following automated spectral alignment using Alm's method [22]. After that, the spectra were bucketed into 200 bins with ﬁxed width, as described in Ref. [23]. As the group separation could be achieved using unsupervised method, the data matrix was normalized using CSN, PQN or GAN methods, then subjected to PCA model using SIMCA-P+ software [24]. 3. Results and discussion 3.1. Evaluation of GAN using dataset I Dataset I was designed to examine the efﬁcacy of GAN in the estimation of dilution factor under different conditions, i.e., the range of between-group difference (H), the scale of between-group difference

a GAN

2.2.3. Dataset II: evaluation of GAN on improving the subsequent multivariate analysis Using the same experimental dataset as in Section 2.2.2, 36 urinary 1 H NMR spectra from vegetarian females were selected to build a data matrix X 0, where these samples fall in the 95% conﬁdence ellipse in PCA score plot. This dataset contains actual experimental interindividual difference and noise, and only artiﬁcial between-group difference and dilution effect were added to generate a simulated dataset for evaluation of GAN using multivariate analysis. (1) Sample permutation. At ﬁrst, the order of rows in X0 was permuted randomly. The ﬁrst 18 samples were labeled as G 1, and the latter 18 samples were labeled as G 2. (2) Addition of between-group difference. A number of H variables (columns) were randomly selected as the discriminant variables between G 1 and G 2. In order to simulate the betweengroup difference, the intensities of the discriminant variables were multiplied by a positive factor λ for every spectrum in G 2, and the resulted dataset is denoted as X g. (3) Addition of dilution effect. The random positive values of dilution factors for all 36 samples were generated as described in Section 2.2.2, to produce a dataset with dilution effect, X d. (4) Data normalization. Xd was normalized using CSN, PQN, or GAN. The normalized dataset was then Pareto-scaled [19] prior to partial least squares discriminant analysis (PLS-DA) [20,21]. (5) Evaluation of GAN method. Steps 1 to 4 were repeated 500 times, and the averaged result for GAN, CSN and PQN methods were compared. A number of parameters were studied including R 2 which shows the fraction of variation explained by a component, and Q2 which shows the predictability of the model through cross-validation. In addition, retrieval rate deﬁned as nðvÞ=H was introduced, where n(v) is the number of simulated discriminant variables appears in the ﬁrst H loadings (absolute value in descending order) of the ﬁrst PLS component. Lastly for all normalization methods, p-value comparing scores for both groups in the ﬁrst PLS-DA component was examined.

M = 5 00 = 2 , H varies u = 0,

CSN

=0

n = 20

PQN

b GAN

PQN

M = 500 H = 50,

= 2

u = 0,

=0

n varies

CSN

Fig. 2. Leave-one-out cross-validation for GAN PQN and CSN. (a) Recovering accuracy with different numbers of group discriminant variables, H; (b) recovering accuracy with different standard deviations of noise density, σn.

J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

(λ), the range of inter-individual difference (β), the scale of interindividual difference (σu) and the scale of noise (σn). The averaged recovering accuracy obtained using CSN, PQN, and GAN are shown in Fig. 1. Fig. 1(a) and (b) showed the effect of between-group difference on the normalization methods CSN, PQN and GAN by increasing the number of group discriminant variables (H) and the ampliﬁcation factor of group discriminant variables (λ) in the dataset which is noise-free (σn = 0) and no inter-individual difference (β = 0 or σu = 0) in the dataset: PQN performs better than other methods when H is less than half of the number of variables (N); nevertheless, when H is higher than N/2, the most probable quotient obtained using PQN is not adequate to represent the data, which leads to a low accuracy in the determination of dilution factor. The recovering accuracy of CSN decreases greatly with increasing H or λ, as the variation in ‘actual’ total peak area is no longer negligible at high H or λ. GAN method overcomes this problem by including class information in the normalization procedure, and its performance was found relatively consistent, with average recovering accuracy fA N97% for all cases. Notably, GAN is particularly suitable for datasets with a large number of discriminant variables. Fig. 1(c) and (d) illustrated the impact of inter-individual difference on the performance of CSN, PQN and GAN methods by

127

increasing the density (β) and intensity (σu) of the inter-individual difference in the noise-free (σn = 0) dataset with small betweengroup difference (H = 50, λ = 2): PQN performs better than other methods for the dataset with low density of the inter-individual difference, which is consistent with the results shown in Fig. 1(a). However, when the dataset contains high levels of inter-individual difference (β N 0.5), the recovering accuracy of PQN decreases at high σu and β. The recovering accuracy of CSN is lower than the other methods but slightly increases with increasing β. The performance of GAN was found relatively consistent (fA N 98% for all cases), which indicates that GAN is not susceptible to the level of inter-individual difference. Fig. 1(e) demonstrated the inﬂuence of Gaussian white noise on the performance of CSN, PQN and GAN methods. This was carried out by varying of the standard deviation of artiﬁcial noise (σn) of the dataset with small between-group difference (H = 50, λ = 2). Unlike other methods, the recovering accuracy of GAN method was found consistent at different noise levels (Fig. 1(e)), which implies that GAN works equally well for those datasets. In addition, the average recovering accuracy for GAN increases with the increasing number of samples, when a better estimate of group center, and hence a more accurate reference used in the normalization procedure could be obtained, see Fig. 1(f).

a

b

c

d

Fig. 3. PLS-DA results of the simulated dataset with H = 20. (a) R2 of the ﬁrst PLS-DA component; (b) Q2 of the ﬁrst PLS-DA component; (c) retrieval rate of the group discriminators; (d) logarithm of p-value comparing scores from the ﬁrst PLS component.

128

J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

As both CSN and PQN methods are unsupervised methods, their recovering accuracies are not affected by samples with unknown class membership. Nevertheless, this is not the case for the GAN methods. Therefore, a leave-one-out study was carried out to assess the performance of the GAN method for unknown samples, and the normalization procedure was performed based on Step 4 in Section 2.1. By comparing Fig. 2(a) with Fig. 1(a), the average recovering accuracy of GAN is reduced when the number of group discriminant variables H is high, when a correct classiﬁcation of unknown samples could not be easily achieved. Nevertheless, the results in Fig. 2(a) shows that GAN gave a better overall performance, as compared with both CSN and PQN methods. Since GAN method is less affected by noise level (Fig. 1(d) and (e)), it works well for the dataset with apparent between-group difference (H = 50, λ = 2), as shown in Fig. 2(b). Results from this study indicated that: (1) CSN is sensitive to the between-group difference (both H and λ) but robust relatively to interindividual difference (both β and σu) and noise (σn). (2) PQN worked well on the dataset with a small range of perturbations (low H and β). When the dataset is affected by a high level of perturbation (high H or β), the performance of PQN is sensitive to the scale of perturbation (λ, σu

and σn). (3) GAN offered advantage in term of recovering accuracy not only for the dataset with between-group difference (both H and λ), but also for the datasets with inter-individual difference (both β and σu) and noise (σn), which indicated that GAN is more suitable for metabolomic study where perturbation in a large number of metabolites is expected. 3.2. Evaluation of GAN using dataset II The effect of normalization methods on the subsequent PLS-DA model was studied using a simulated dataset, covering a range of H and λ. The dataset could be grouped into three cases: Case 1. Large-scale perturbations occur to a small number of variables (high λ, low H); Case 2. Small-scale perturbations occur to a large number of variables (small λ, high H); Case 3. Large-scale perturbations occur to a large number of variables (high λ, high H).

a

b

c

d

Fig. 4. PLS-DA results of the simulated dataset with λ = 2. (a) R2 of the ﬁrst PLS-DA component; (b) Q2 of the ﬁrst PLS-DA component; (c) retrieval rate of the discriminant variables; (d) logarithm of p-value of the ﬁrst PLS-DA component.

J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

From Fig. 3(a) and (b), GAN was found to perform better than CSN and PQN, particularly when λ is low, and it is comparable to CSN at high λ. The R 2 and Q 2 for PQN method are generally lower than both GAN and CSN methods. In addition, GAN also produced a better overall retrieval rate at low λ, and it is comparable to PQN at high λ, as shown in Fig. 3(c). Furthermore, the result in Fig. 3(d) indicated that the groups are more separable (in terms of p-value comparing scores from both groups) in the ﬁrst PLS component when the data was normalized by GAN. The advantage of GAN method is more evident when the number of discriminant variables (H) is high. Fig. 4(a) and (b) demonstrate that PQN and CSN are not suitable for Case 2 datasets (high H), while GAN is able to handle these datasets well. This could be due to a low recovering accuracy for both PQN and CSN at high H, as shown in Fig. 1 (a). In addition, GAN also performs better for this dataset, as experimental dataset normally has high noise density due to the existence of inter-individual difference for all variables (Fig. 4(c)). Furthermore, Fig. 4(c) also shows GAN to be a more robust method in retrieving the discriminant variables. It also leads to a better group separation in the subsequent PLS-DA model (i.e. lower p-values), notably at high H value (Fig. 4(d)). 3.3. Evaluation of GAN using dataset III The developed GAN method was also studied using an experimental HR-MAS dataset [23]. The dataset consisted of two groups, the diabetic group (db/db mice) and the control group (dbm mice). The

a

129

PCA score plots obtained following CSN, PQN, and GAN methods are shown in Fig. 5(b), (c) and (d) respectively. PCA score plot of original data is also shown in Fig. 5(a). Group-separation could be achieved using all three normalization methods, due to signiﬁcant metabolic differences between the diabetic and control groups. Nevertheless, GAN produced best group separation, possibly due to its ability to give a better estimation for dilution factor. Since GAN is a supervised method, a validation study was carried out to examine the “GAN + PCA” model. In this study, the classiﬁcation information of the samples in 50 experiments are permuted randomly, then the new data was normalized using GAN, and −log10 (p-value) of PC1 scores comparing scores from both groups were evaluated. Fig. 6 showed the p-value of PC1 with the correlation coefﬁcient between the assigned classiﬁcation and the true classiﬁcation. The p-value is lowest when the correlation coefﬁcient is 1, i.e., the classiﬁcation of samples are correct. This result suggested that the group separation obtained using the “GAN + PCA” model is robust and does not happen by chance. Fig. 7 showed a section of differential spectra (2.46–4.10 ppm) between the center of diabetic group (i.e., mean spectrum) and the control group, following different normalization procedures. The differential spectrum obtained following GAN method resembles the differential spectrum of original data, as shown in Fig. 7(c) and (d). In fact, the correlation coefﬁcients between the differential spectrum of GAN normalized and original data was found to be 0.998, as compared with CSN (0.875) and PQN (0.947) methods. These results suggested that both CSN and PQN methods might introduce more bias than GAN

b 800 2000

600

6 5

400

1000

t[2] (9.9%)

t[2] (1.3%)

6

200

0 11 91812 13 2

0

1

-200 -400

2

0

3

13 9 12

7

-1000

7

4

5

10 11 8

4

3

1

-600

-2000

-800 -6000

-4000

-2000

0

2000

4000

6000

-8000

-6000

-4000

-2000

t[1] (97.9%)

0

2000

4000

6000

8000

t[1] ( 83.2% ) SIMCA-P+ 12.0.1 - 2011-05-11 13:47:58 (UTC+8)

c 4000

SIMCA-P+ 12.0.1 - 2011-05-11 13:45:25 (UTC+8)

d 800

3000

600 6

2000 5

0

12

t[2] (5.7%)

t[2] (4.8%)

6 11 10 8 13 9

1000

2 3

-1000

7

-2000 1

5

200

2 10 11 8 9 13 12

0

3 7

-200 4

-400

4

1

-600

-3000 -4000

400

-800 -15000

-10000

-5000

0

5000

10000

15000

-3000

-2000

-1000

0

1000

2000

3000

t[1] (92.3%)

t[1] (91.5%) SIMCA-P+ 12.0.1 - 2011-05-11 13:50:37 (UTC+8)

SIMCA-P+ 12.0.1 - 2011-05-11 13:52:10 (UTC+8)

Fig. 5. PCA score plots of the data normalized by PQN, CSN and GAN. (a) Score plot of the original data, without any normalization procedure; (b) score plot of CSN normalized data; (c) score plot of PQN normalized data; (d) score plot of GAN normalized data (symbols: “●” for samples from db/db mice; “▲” for samples from db/m mice. Numbers on score plots represent sample index.).

130

J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

method is less susceptible to variation in group mean, noise and interindividual difference. It produces more compact plots for each of the experimental groups, probably due to a better estimation of dilution factor (Fig. 8(c)).

-log10(p-value)

15

10

4. Conclusions 5

0 0

0.2

0.4

0.6

0.8

1

Correlation Coefficient Fig. 6. PCA model validation for GAN normalized data. Correlation coefﬁcients of the classiﬁcation vectors between the re-assigned data and the original data is represented in horizontal axis.

method into the dataset. In fact, the spectral region between 3 and 4 ppm was scaled up by both CSN and PQN methods (Fig. 8), which might lead to false positive results. On the other hand, the GAN

Normalization is a crucial data preprocessing step in metabolomics. In conventional CSN and PQN methods, a reference for normalization is assumed to be constant for all spectra. This is not necessarily true for all cases because of the complexity of biological samples. Therefore in the current study, a supervised sample-dependent normalization method called GAN was proposed. GAN operates by ﬁrst projecting the high dimensional data onto a PCA subspace, and each of the samples within each group were then scaled with a coefﬁcient (α− 1) so that the distance between the PCA scores arising from the samples and its group mean scores is minimized. Both simulated and experimental data were used to examine the efﬁciency of the developed GAN method. Our study showed that GAN produced a better class discrimination and improved the ability in the identiﬁcation of differential metabolites in the subsequent multivariate analysis, as compared to conventional CSN and PQN methods.

a

b

c

d

Fig. 7. A section of differential spectrum between the mean spectra of the diabetic group and the control group. (a) CSN normalized data; (b) PQN normalized data; (c) GAN normalized data; (d) original data.

J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

131

a

b

c

d

Fig. 8. Stacked plots of the binned data obtained from NMR spectra from both db/m (red solid line) and db/db mice (black dot line) following (a) CSN normalized data; (b) PQN normalized data; (c) GAN normalized data; (d) original data.

Acknowledgements This work was supported by the Science Research Foundation of the Ministry of Healthand the United Fujian Provincial Health and Education Project for Tackling the Key Research (WKJ2008-2-36) and NSF of Fujian Province of China (2009J05087 and 2009J01299). References [1] F. Dieterle, A. Ross, G. Schlotterbeck, H. Senn, Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures: application in 1H NMR metabonomics, Anal. Chem. 78 (2006) 4281–4290. [2] A.G. Ciba-Geigy, Wissenschaftliche Tabellen Geigy 1, Ko¨rperﬂu¨ssigkeiten, 8th Ed. Basel, Switzerland, 1977.

[3] A. Craig, O. Cloarec, E. Holmes, J.K. Nicholson, J.C. Lindon, Scaling and normalization effects in NMR spectroscopic metabonomic data sets, Anal. Chem. 78 (2006) 2262–2267. [4] M.E. Bollard, E.G. Stanley, J.C. Lindon, J.K. Nicholson, E. Holmes, NMR-based metabonomic approaches for evaluating physiological inﬂuences on bioﬂuid composition, NMR Biomed. 18 (2005) 143–162. [5] J.T. Brindle, H. Antti, E. Holmes, G. Tranter, J.K. Nicholson, H.W. Bethell, S. Clarke, P.M. Schoﬁeld, E. McKilligin, D.E. Mosedale, D.J. Grainger, Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using 1H-NMRbased metabonomics, Nat. Med. 8 (2002) 1439–1444. [6] M. Sysi-Aho, M. Katajamaa, L. Yetukuri, M. Orešič, Normalization method for metabolomics data using optimal selection of multiple internal standards, BMC Bioinforma. 8 (2007) 93. [7] R.O. Torgrip, K.M. Aberg, E.I. Schuppe-Koistinen, J. Lindberg, A note on normalization of bioﬂuid 1D 1H-NMR data, Metabolomics 4 (2008) 114–121. [8] P. Jatlow, S. McKee, S.S. O'Malley, Correction of urine creatinine concentrations for creatinine excretion: is it useful? Clin. Chem. 49 (2003) 1932–1934.

132

J. Dong et al. / Chemometrics and Intelligent Laboratory Systems 108 (2011) 123–132

[9] M. Scholz, S. Gatzek, A. Sterling, O. Fiehn, J. Selbig, Metabolite ﬁngerprinting: detecting biological features by independent component analysis, Bioinformatics 20 (2004) 2447–2454. [10] P. Lemmerling, L. Vanhamme, R. Romano, S. van Huffel, A subspace time-domain algorithm for automated NMR spectral normalization, J. Magn. Reson. 157 (2002) 190–199. [11] R. Romano, R. Lamanna, M.T. Santini, P.L. Indovina, A new algorithm for NMR spectral normalization, J. Magn. Reson. 138 (1999) 115–122. [12] R. Romano, M.T. Santini, P.L. Indovina, A time-domain algorithm for NMR spectral normalization, J. Magn. Reson. 146 (2000) 89–99. [13] S. Wold, E. Johansson, M. Cocchi, PLS—partial least-squares projections to latent structures, in: H. Kubinnyi, G. Folkers, Y.C. Martin (Eds.), 3D QSAR in drug design— theory, methods and applications, ESCOM Science, Leiden, 1993. [14] M. Ringnér, What is principal component analysis? Nat. Biotechnol. 26 (2008) 303–304. [15] O.M. Kvalheim, F. Brakstad, Y.Z. Liang, Preprocessing of analytical proﬁles in the presence of homoscedastic or heteroscedastic noise, Anal. Chem. 66 (1994) 43–51. [16] T.K. Karakach, P.D. Wentzell, J.A. Walter, Characterization of the measurement error structure in 1D 1H NMR data for metabolomics studies, Anal. Chim. Acta 636 (2009) 163–174. [17] M. Statheropoulos, A. Pappa, P. Karamertzanis, H.L.C. Meuzelaar, Noise reduction of fast, repetitive GC/MS measurements using principal component analysis (PCA), Anal. Chim. Acta 401 (1999) 35–43.

[18] J.J. Xu, S.Y. Yang, S.H. Cai, J.Y. Dong, X.J. Li, Z. Chen, Identiﬁcation of a metabolic signature in urine for lacto-vegetarians using 1H NMR spectroscopy and pattern recognition, Anal. Bioanal. Chem. 396 (2010) 1451–1463. [19] L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Scaling: in introduction to multi- and megavariate data analysis using projection methods (PCA & PLS), Umetrics (1999) 213–225. [20] H. Wold, Soft modeling: the basic design and some extensions, in: K.G. Joreskog, H. Wold (Eds.), Systems Under Indirect Observation, vols. I and II, North-Holland, Amsterdam, 1982. [21] S. Wold, M. SjEostrEom, L. Eriksson, PLS-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst. 58 (2001) 109–130. [22] E. Alm, R.J.O. Torgrip, K.M. Åberg, I. Schuppe-Koistinen, J. Lindberg, A solution to the 1D NMR alignment problem using an extended generalized fuzzy Hough transform and mode support, Anal. Bioanal. Chem. 395 (2009) 213–223. [23] J.J. Xu, J. Zhang, J.Y. Dong, S.H. Cai, J.Y. Yang, Z. Chen, Metabonomics studies of intact hepatic and renal cortical tissues from diabetic db/db mice using high-resolution magic-angle spinning 1H NMR spectroscopy, Anal. Bioanal. Chem. 393 (2009) 1657–1668. [24] L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi- and Megavariate Data Analysis: Principles and Applications, Umetrics Academy, Umea, 2001.

Group aggregating normalization method for the preprocessing of NMR-based metabolomic data

Group aggregating normalization method for the preprocessing of NMR-based metabolomic data

Recommend Documents