The use of environmental metabolomics to determine glyphosate level of exposure in rapeseed (Brassica napus L.) seedlings

The use of environmental metabolomics to determine glyphosate level of exposure in rapeseed (Brassica napus L.) seedlings

Environmental Pollution 159 (2011) 3071e3077 Contents lists available at ScienceDirect Environmental Pollution journal homepage: www.elsevier.com/lo...

684KB Sizes 2 Downloads 48 Views

Environmental Pollution 159 (2011) 3071e3077

Contents lists available at ScienceDirect

Environmental Pollution journal homepage: www.elsevier.com/locate/envpol

The use of environmental metabolomics to determine glyphosate level of exposure in rapeseed (Brassica napus L.) seedlings Iben Lykke Petersen 1, Giorgio Tomasi, Hilmer Sørensen, Esther S. Boll, Hans Christian Bruun Hansen, Jan H. Christensen* Department of Basic Sciences and Environment, Faculty of Life Sciences, University of Copenhagen, Thorvaldsensvej 40, DK-1871 Frederiksberg C, Denmark

a r t i c l e i n f o

a b s t r a c t

Article history: Received 15 September 2010 Received in revised form 29 March 2011 Accepted 7 April 2011

Metabolic profiling in plants can be used to differentiate between treatments and to search for biomarkers for exposure. A methodology for processing Ultra-High-Performance Liquid ChromatographyeDiode-Array-Detection data is devised. This methodology includes a scheme for selecting informative wavelengths, baseline removal, retention time alignment, selection of relevant retention times, and principal component analysis (PCA). Plant crude extracts from rapeseed seedling exposed to sublethal concentrations of glyphosate are used as a study case. Through this approach, plants exposed to concentrations down to 5 mM could be distinguished from the controls. The compounds responsible for this differentiation were partially identified and were different from those specific for high exposure samples, which suggests that two different responses to glyphosate are elicited in rapeseed depending on the level of exposure. The PCA loadings indicate that a combination of other metabolites could be more sensitive than the response of shikimate to detect glyphosate exposure. Ó 2011 Elsevier Ltd. All rights reserved.

Keywords: Environmental metabolomics Metabolic profiling Glyphosate Plant biomarker Ultra performance-liquid chromatographyediode array detection

1. Introduction The determination of the level of exposure of organisms to a contaminant is central to environmental risk assessment procedures (Covello and Merkhoher, 1993). Biomarkers, which are biochemical or physiological responses specific to the exposure to a compound or class of compounds, can be used to this end (Walker, 1992). Biomarkers can be proteins as well as low molecular weight endogenous metabolites and, in the latter case, they are a part of the so-called metabolome. The traditional approach often focuses on single biomarkers and thus ignores large portions of the metabolome and potential biomarkers may be neglected (Hendriks et al., 2005). A more comprehensive characterization of a relevant portion of the metabolome can correct for this, and extends the concept of biomarkers to metabolite profiles. These profiles can include many compounds, whose change in concentration, if taken singularly, may not be sufficient to ascertain and quantify exposure, but in combination with other compounds may be directly related to exposure. Metabolomics is becoming increasingly popular for * Corresponding author. E-mail address: [email protected] (J.H. Christensen). 1 Present address: Carlsberg Research Center, Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark. 0269-7491/$ e see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.envpol.2011.04.005

human health studies as well as for drug development (Covello and Merkhoher, 1993). The use of metabolomics in the environmental sciences would allow a comprehensive characterization of the causal relation between exposure to chemicals and toxic effects on organisms. However, its application for environmental studies is still in its infancy and has been employed thus far almost exclusively to animal species (Viant, 2008). Plant metabolomics has been used to classify samples (e.g., according to origin, biological relevance and growth conditions) (Fiehn, 2002) and more generally terrestrial plant phenotypes (Fiehn, 2002; Messerli et al., 2007); to evaluate the response to abiotic changes or environmental stressors (Bailey et al., 2003; Gidman et al., 2006) and to screen new/unknown herbicides for their modes of action (Aranibar et al., 2001; Ott et al., 2003). However, to the authors’ knowledge plant metabolite profiling has not been used to study the actual level of exposure of the environment to organic contaminants. In general, such level is difficult and costly to determine (Walker, 1992) and risk assessment procedures are most often based on total concentrations of contaminants, although only a fraction is bioavailable and may constitute a risk. Furthermore, the estimation of environmental effects is complicated by the fact that organic contaminants often demonstrate synergistic or antagonistic adverse effects on vegetation, animals, and humans (Carpenter et al., 2002). The plant

3072

I.L. Petersen et al. / Environmental Pollution 159 (2011) 3071e3077

metabolome, on the other hand, only responds to the bioavailable portion of the contaminants and thus is representative of the actual level of exposure in the field. Moreover, plants are stationary, and are therefore more indicative of local contamination compared to animals. In this work, plant metabolite profiling is used to detect features (namely, patterns of plant biomarkers) that give a stable and robust response even at low levels of exposure to glyphosate (N-(Phosphonomethyl)glycine e also known as Round-upÒ). Glyphosate affects the shikimic acid pathway resulting in an accumulation of shikimate, which is therefore often used as a biomarker (Petersen et al., 2006). However, other endogenous metabolites (e.g., downstream from shikimate in the shikimic acid pathway) are affected due to metabolic regulations (Petersen et al., 2007), which could be revealed by a less targeted approach. The objectives of this study are therefore to determine whether reversed phase Ultra-HighPerformance Liquid Chromatography with Diode-Array-Detection (RP-UHPLCeDAD) analysis of crude plant extracts can be used to differentiate between lower levels of glyphosate exposure and subsequently suggest a protocol for data preprocessing and chemometric data analysis for plant metabolomics based on this type of chemical data. 2. Methodology Crude extracts of glyphosate treated and control plants obtained through pressurized liquid extraction (PLE) were analyzed by RP-UHPLCeDAD. The analytical method was targeted for amphiphilic low molecular weight (LMW) compounds rather than the entire metabolome. Namely, the PLE method extracts both hydrophilic and amphiphilic LMW compounds, while the strongly hydrophilic compounds are not adequately resolved in RP-UHPLC. The use of UVeVISeDAD allowed the detection of important aromatic compounds characteristic for Brassica plants (e.g., sinapoyl compounds and flavonoids), which are expected to undergo changes following glyphosate effects on the shikimate pathway (Petersen et al., 2007, 2006). The analysis by RPeUHPLCeDAD produced for each sample n a chromatogram that can be stored in a data matrix Xn of dimensions I  T, where I is the number of wavelengths and T is the number of scan times; the i-th row of Xn contains the single wavelength chromatogram (SWC) recorded at the i-th wavelength and t-th column contains the absorbance spectrum measured at in the t-th scan time. Such chromatograms were then preprocessed to cut them to a more manageable size (i.e., by wavelength selection) and to reduce the effects of instrumental factors unrelated to glyphosate exposure (e.g., signal shift due to instrument variation and biological variability). Finally, the preprocessed signals were stacked into a matrix that was analyzed by PCA and Linear Discriminant Analysis (LDA) was employed on the PCA scores to distinguish controls from plants exposed to low levels of glyphosate. A flowchart of these processing steps is shown in Fig. S1.

2.1.1. Wavelength selection The Xn’s are generally large but they are characterized by high collinearity between many SWC’s (i.e., by the fact that several wavelengths carry approximately the same information) and by the presence of SWC’s in which no significant peaks can be detected and carry hardly any useful information. In order to reduce the size of the data set, and thus the computation time, a preliminary selection scheme for the wavelengths was devised with the idea of retaining those with the largest variance and least correlated to the others. This selection was performed through a procedure based on two sets of parameters ri and si for i ¼ 1.I (Equation (1)).

ri h

P P

 xi Þ2 t ðxitP

isi0

t ðxit

  X  xi Þ xi0 t  xi0 rii0 h pffiffiffiffiffiffiffiffi si si0 isi0

2.1.2. Baseline removal Chromatographic baselines need to be removed since they can negatively affect wavelength selection, alignment, normalization, and ultimately the results of the data analysis (Christensen and Tomasi, 2007). The signals were therefore numerically differentiated along the rows of Xn to remove baselines without their explicit estimation. Given the high signal-to-noise ratio, no smoothing was employed (Christensen et al., 2005; Christensen and Tomasi, 2007). 2.1.3. Retention time alignment The differentiated SWC’s were rigidly shifted within a pre-selected range by maximizing the cross-correlation (calculated using the fast Fourier transform (Press et al., 2002) between the samples and a target chromatogram e see below). Only integer shift values were considered. The alignment was further improved by means of the correlation optimized warping (COW) algorithm (Nielsen et al., 1998). The optimal warping parameters m* (segment length) and c* (the slack parameter) were obtained using a simplex based procedure (Skov et al., 2006). The initial search space was [50, 500] for m with 25 points increments, and [1, 25] for c with two point increments. The target chromatogram was the sample with the highest sum of correlation with other chromatograms. The correlation was calculated using the vectorized chromatograms (i.e., stringing out the rows of Xn in a vector xn), therefore treating the SWC’s for the same sample as instances of the same random variable (Nielsen et al., 1998). A leading and a trailing section replicating the first and the last point in each SWC were attached prior to the application of COW to guarantee the necessary flexibility at the endpoints (Nielsen et al., 1998). The length of these sections, which were removed right after the alignment, was automatically determined based on the two warping parameters and on the maximum shift required by the data (w), estimated as 1.5 time the maximum rigid shift required for smaller sections at the beginning, half-way and at the end of the SWCs. The optimal m*, c* and w* were also used to align the selected SWC separately (as an alternative to the standard COW algorithm for multiple channel signals, which was also tested), because it was considered that the intrinsic multi-way structure of the data would not be exploited in the PCA after all and that better alignments could be obtained in this fashion at a relatively small additional computational cost. 2.1.4. Normalization The signals were normalized to unitary Euclidean norm (Equation (2)) both before and after variable selection (see below) as the interest was mostly on compositional changes in the metabolome rather than the absolute values of peak areas. xN nt ¼

2.1. Data preprocessing

si h

those, in a particular spectral region, that are least correlated to all the other wavelengths in the matrix. The values of ri and si can vary depending on the sample, but by concatenating several Xn’s (e.g., by forming the matrix ½ X1 / XN , where N is the number of samples) it is possible to determine the position of the critical points in a subset of samples or in all of them at once. It is worth mentioning that for this concatenation, retention time shift (i.e., a shift along the t index between samples) has no influence on either ri or si. Here it was decided to group samples by exposure level (thereby obtaining seven sets of ri and si values) and a high occurrence of local minima/maxima at a certain wavelength for either si or ri across exposure level was taken as an indication that such wavelength was more informative than the others in a certain spectral region.

(1)

where xit denote the absorbance (differentiated along retention time to remove the baselines e cf. below) for the i-th wavelength at the t-th time interval, xi is the mean differentiated absorbance for the i-th wavelength, and rii0 is the Pearson’s correlation coefficient between the differentiated SWC’s measured at the i-th and the i’-th wavelengths. The absolute values of ri and si are of little interest and there is no rationale for setting an absolute cutoff value for choosing which variables to retain. However, absorbance spectra in liquid media are smooth and present relatively large features (peaks) that span several adjacent wavelengths; hence, si and ri are also expected to vary smoothly and local maxima of si denote wavelengths which carry the most information in a certain spectral interval whereas the local minima in ri denote

xnt an

for an ¼

rffiffiffiffiffiffiffiffiffiffiffiffiffiffi X x2nt

(2)

t

where xnt is the first derivative of the n-th signal at the t-th retention time, and xN nt is the normalized data. The SWC’s were normalized separately before variable selection whereas no such distinction was applied in the second normalization; moreover the latter was applied on the selected variables from the preprocessed (and normalized, although numerically it would not make any difference) signals. 2.1.5. Variable selection Variables for which exposure could be a significant factor compared to biological variation were detected using a linear regression model having the absorbance at the t-th retention time as the response and the exposure level as predictor. The p-values for the model significance were then used to decide which variables to retain for the PCA. Note that this is essentially equivalent to ANOVA filtering, which has previously been used for variable selection in metabolic profiling (Boccard et al., 2007; Pierce et al., 2005; Quemener et al., 2007), although a continuous quantitative factor (predictor) was used rather than a categorical one. The rationale for using a selection scheme that is basically univariate as a preliminary step before multivariate analysis was not the identification of all the potential biomarkers, but rather the removal of irrelevant or confounding variables (e.g., those with random and/or little variability between exposure levels) to find a smaller set of variables that could reliably separate the different levels of exposure. For the same reason, the risk of not selecting variables that have a non-linear correlation with exposure level was deemed acceptable. To avoid the drawbacks of choosing an uncertain (and somewhat arbitrary) threshold on the p-value based on few samples per exposure level, and in order to further reduce the number of retained variables (v), it was decided to

I.L. Petersen et al. / Environmental Pollution 159 (2011) 3071e3077 fix their number starting from those with the lowest p and choose the model with the lowest Root Mean Square Error in Cross Validation (RMSECV) which also had a small variance within each exposure level and the best separation between exposures. The models for 240 values of v (i.e., from 25 to 6000 variables with 25 variables increments) were thus calculated and the RMSECV from the cross validation with 11 random segments repeated 5 times of all the combinations of K and v were visually inspected to determine the optimal values of these two metaparameters. 2.2. Chemometric data analysis PCA condenses the most salient patterns of variation in the data set in few orthogonal variables (the principal components e PCs). For the N  V data matrix X, the PCA model can be expressed as in Equation (3), where T denotes the N  K score matrix, P the V  K loading matrix (P), “’” the transpose, and E the residuals. X ¼ TP’ þ E

(3)

The idea is that few PC’s are sufficient to describe most of the systematic variation in the set; however, given the complexity of the metabolome and the overall biological variability, this is often unrealistic. The optimal number of PCs was determined by inspection of the score and loading plots and through cross validation with 11 random segments repeated 5 times using the Eigenvector algorithm (Bro et al., 2008). Sample outliers were detected by inspecting the influence plot (sample residuals vs. leverages), followed by the inspection of the chromatograms for suspected samples. Furthermore, validation was performed by projecting an independent test set of preprocessed signals on to the model (Krzanowski, 2000). The preprocessing of this independent set included: differentiation, alignment toward the same target and using the same warping parameters as determined for the training set, selection of variables and normalization according to Equation (2). Linear Discriminant Analysis in the PC space (Krzanowski, 2000; Hoogerbrugge et al., 1983) was employed to ascertain whether it were possible to separate lowexposure samples from controls when this separation was not apparent in the score plot.

3073

3.2.2. UHPLC analysis The crude extracts were passed through a 0.2 mm filter and analyzed on an UPLC (Ultra Performance Liquid Chromatography) Waters Acquity System (Milford, MA) interfaced to a diode array detector. The system was equipped with an Acquity UPLCÔ BEH C18 column (100 mm length, 2.1 mm inner diameter, particle size 1.7 mm). Five microliters were injected using the partial loop with needle overfill option. The instrument was located in a room with temperature control set at 20  C and the column temperature was kept at 25  C. The mobile phases were A: 2% Acetonitrile in Milli-Q Water þ 0.1% formic acid and B: 98% Acetonitrile in Milli-Q Water þ 0.1% formic acid. The flow rate was 0.2 mL/min, which resulted in a maximum system pressure of 6500 psi. The UPLC program was comprised of an isocratic step using 99% A for 3 min followed by a gradient elution to 70% A from 3 to 26 min. The column was then cleaned with 70% B for 3 min and reconditioned for 12 min in 99% A, leading to a total analysis time of 41 min. Detection was performed at 20 scan/sec, with a resolution of 2.4 nm from 191 to 400 nm with 1 nm increments. 3.3. Data

Glyphosate (N-(phosphonomethyl)glycine) (95% purity), shikimic acid (99% purity), acetonitrile (LCeMS grade,  99.9% purity e Chromasolv) and Formic acid (LCeMS grade, 98% purity, Fluka) was obtained from SigmaeAldrich. All other chemicals were of proanalysis grade.

The data set is comprised of SWC’s of 28 samples (viz. seven exposures times four biological replicates); 8 replicate analyses of a 1:1 mixture of the controls and of the highest dose samples (the ‘reference set’); four replicate analyses of three selected samples of exposures 0, 10 and 50 (12 samples in total). Four samples of the 28 were assigned to the test set (one for each of the exposures 0, 5, 30 and 50) and the remaining 24 to the training set. The reference samples were analyzed between every six samples followed by a blank sample (Milli-Q water). The analytical sequence, which was comprised in total of 56 samples, was randomized. The data for each sample consisted of a 210 wavelengths  48000 retention times, which were reduced to 70  4951 by considering one every five sampling points between 2.1 and 22.7 min and one every three wavelengths.

3.1. Growth and glyphosate treatment of rapeseed

4. Results and discussion

Seeds of Brassica napus L. cv. Pollen (purchased from Cetiom, Pessac, France) were germinated and grown in a hydroponic system as described previously by Petersen et al. (2007). In short, the glyphosate treatment was started by adding glyphosate to achieve initial concentrations of 1.0, 5.0, 10, 20, 30 and 50 mM. Four pots were used for each level of exposure (biological replicates). The nutrient solutions were replaced every three days with fresh solutions of the same glyphosate concentrations. A set of control experiments with zero concentration (again four replicates) were performed. After nine days of glyphosate treatment the plants were removed, separated into root and top, and freeze-dried.

4.1. Initial comparison of UHPLCeDAD chromatograms

3. Experimental

3.2. Analytical 3.2.1. Crude extract Crude extracts were prepared using a Pressurized Liquid Extraction system (ASE200, Dionex Corporation, Sunnyvale, CA). 250 mg of ground freeze-dried plant top material from all exposures and biological replicates was weighed and transferred to a 1 mL extraction cell. The extraction was performed using 100% methanol and subject to the following conditions. Preheat time: 5 min; static time: 5 min; flush volume: 30%; purge time: 100 s; cycles: 3; pressure: 1000 psi; extraction temperature: 90  C. The crude extracts were then evaporated to dryness using compressed air, and the residues redissolved in 4.0 mL Milli-Q water.

Several regions with potential biomarkers can already be identified from a visual inspection of the SWC’s. For example, the retention time region between 3 and 4 min (Fig. 1) contains at least two compounds that increase in concentration at higher glyphosate exposures (above 10 mM). Conversely, the concentration of a compound that elutes between 20 and 21 min visibly decreases with exposure above 10 mM, although some of the replicates suggest that this decrease is seen already at lower exposures (not shown). No single compound or cluster of compounds with a clear monotonic decrease or increase in concentration at low exposures could be identified. This identification was hampered by relatively large biological variability (and to some extent also instrumental) within exposures (not shown), which highlights the need for proper data processing and selection of informative retention times before chemometric data analysis. 4.2. Data preprocessing 4.2.1. Baseline removal Of the preprocessing steps, this was the most straightforward; its effectiveness was determined through visual inspection of the chromatograms and hardly any decrease in the signal-to-noise ratio was observed (not shown).

3074

I.L. Petersen et al. / Environmental Pollution 159 (2011) 3071e3077

Fig. 1. Comparison of raw 265.7 SWC of crude extracts from plants exposed to glyphosate and controls. All the analytical replicates at 0, 10 and 50 mM have all been included.

4.2.2. Wavelength selection The local maxima and minima of s (the SWC’s’ variances) and r (the measures of correlation between SWC’s) respectively appears to be in good accord and four to six groups of highly correlated wavelengths could be observed (Fig. S2A and C). The selection is consistent between exposures although a clear change can be observed for the 50 mM samples (Fig. S2B). However, at this concentration range the plants were visibly affected (Fig. S3) and a definite change in the metabolite profile was expected. The first wavelength (191 nm) was always selected for both diagnostics whereas the last (400 nm) was always selected for the lowest correlation with the others (Fig. S2C). However, the 191 nm SWC contains many unresolved peaks and irregular patterns of retention time shifts that could not be entirely corrected using COW. Few peaks in this spectral region were not present in the other selected SWC’s (e.g., 217.7 nm); therefore, this wavelength was not considered in the analysis. Conversely, the 400 nm SWC contained hardly any peaks and the low r was largely determined by noise. Hence, it was also left out and four wavelengths (i.e., 217.7, 238.7, 265.7 and 328.7 nm) were retained (Fig. 2). Although, wavelength selection was not based on a priori knowledge, the selected wavelengths are highly relevant for the detection of the aromatic compounds of special interest in this study (e.g., sinapoyl compounds and flavonoids). In particular, 217.7 nm is a sensitive but unspecific wavelength for the determination of aromatic compounds; sinapoyl derivatives have almost equal absorption at 238.7 and 328.7 nm; and the relative absorptions at 265.7 and 328.7 nm give important information on flavonoids typical for vegetative parts of Brassica plants. 4.2.3. Retention time alignment The preliminary rigid shift on the entire signal did not lead to appreciable improvements, but allowed to estimate w* to be around 40. The simplex optimization procedure performed solely on the sample set (i.e., without the analytical replicates or the reference samples) and the four retained wavelengths indicated that m* ¼ 73 points, i.e., approximately 50% larger than the single peak at its base, and a c* ¼ 22 points. Although this level of local correction appears to be extreme (w30% local corrections are allowed), it is consistent with the fact that shift patterns tend to be

Fig. 2. Contour plot of a reference sample, with lines indicating the selected wavelengths: 217.7, 238.7, 265.7 and 328.7 nm.

more irregular in liquid chromatography data and with larger retention time shifts than gas chromatographic data (Christensen and Tomasi, 2007). The goodness of the choice of the warping parameters is supported by the relatively high value of the peak factor (0.99), which indicates that the areas under the peaks remained largely unaffected after warping and of the simplicity (0.97) of the reference set (Skov et al., 2006). Moreover, the chromatograms did not show any particular artifact (e.g., extreme peak widening) after the alignment. Due to extreme analytical shifts (data not shown), one of the five biological replicates of the 20 mM glyphosate treatment could not be properly aligned and it was therefore removed from the data set. The separate alignment of the SWCs using m*, c* and w* led to slightly better results than those obtained using the standard COW (visually determined, as peak factor and simplicity for the two cases are not directly comparable) and was thus preferred (not shown). The effects of the baseline removal and retention time alignment on the reference set are shown in Fig. 3. 4.2.4. Variable selection and normalization Finally, the p-values for the significance of a linear regression model were used to filter the variables by selecting those for which the exposure was a significant factor. The 19804 (4 wavelengths  4951 retention times) p-values ranged between 1012 and near-unity. With a significance level a of 0.05 (or 0.01), v ¼ 11828 variables (9393) were selected. The corresponding PCA models were characterized by a good separation for the high exposure samples (20 mM) and poor resolution of the lowest ones, in particular of the controls (not shown). Moreover, for both, the variance was also larger for the biological replicates than for the analytical ones and a marked drift in the second PC can be observed for the low-exposure samples. This may be indicative, among others, of the inclusion of noisy variables which affect the normalization term: because of the squaring of the elements in the Euclidean norm, noise does not cancel out upon summation (cf. Equation (2)), and constitutes a relatively small perturbation of an which can cause a drift in otherwise identical samples in the PC space (Christensen and Tomasi, 2007; Sjoedin, 1984). Therefore, lower significance levels (i.e., p-value thresholds) were tested to further cut on the number of variables. Disregarding the high collinearity between variables in chromatographic data, a further reduction of a could be achieved by the Bonferroni

I.L. Petersen et al. / Environmental Pollution 159 (2011) 3071e3077

3075

scarce number of samples at these concentrations. On the other hand, all the score plots indicate that it could be possible to distinguish the 0/1 mM from the 5/10 mM samples in Fig. 4. LDA was thus employed to find the best separation between these two groups in the 2 PC space (which explains up to 99% of the variation for v ¼ 50). The best results in prediction were obtained with the 500 variable model (Fig. S5), with a misclassification rate of 21% on the training set (3 samples out of 14) and 0% on the monitoring set (the replicates). The two samples of the independent test set are also correctly assigned. It is worth mentioning that smaller data sets excluding the higher exposures were also investigated without visible improvements for the separation at the lower concentrations. The 500 selected variables with lowest p-value were distributed between the four wavelengths (195 at 217.7 nm, 115 at 238.7 nm, 162 at 265.7 nm, and 28 at 328.7 nm). As expected the highest wavelength is less informative and fewer variables were picked. The two-component model explained 97.2% of the variance. Fig. 4 shows the PC1 vs. PC2 score plot with and without ANOVA filtering. As can be seen, the problem of separating low exposures could not be solved, but from Fig. 4B, the variance due to the analytical procedure is small compared to the biological one, which is particularly large for the 20 and 30 mM samples. One 30 mM

Fig. 3. Effects of baseline correction and retention time alignment on the 217.7 nm SWC for the reference set. A: original data, B: differentiated and aligned data.

correction aV1, where V is the number of variables in the data matrix (here 19804), which accounts for the large number of independent tests. This produced a further five-fold reduction in the number of variables to 2906 (2324 for a ¼ 0.01), but hardly any significant improvement (or degradation) of the clustering in the score plot could be observed. The incremental procedure described in the methods section in which variables with increasing p-values are progressively included was thus used to investigate further reductions in the number of retained variables v which did not rely on a specific threshold. The RMSECV values did not vary much with respect to K for v larger than 250 and the curves for constant v flatten out after two, at most three, PCs (Fig. S4). With respect to v, the lowest values were obtained around 50 variables (RMSECV ¼ 0.06 for 2 PC), and relative minima were observed around 800 and 3400 (0.09 in both cases). However, the function is essentially stable between 300 and 1500 and between 2300 and 4800 variables. The number of components was thus fixed to two and the several interesting values of v were further investigated (e.g., 50, 500, 800 and 3400) for lowest within-exposure variance and best separation.

4.3. Chemometric data analysis For none of the four models a clear separation between the controls and the 1 mM samples (not shown) was achieved, which suggests that the variations in the metabolome may be too subtle at this level of exposure to be detected with the proposed combination of analytical and chemometric methods and the relatively

Fig. 4. Score plots of PC1 vs. PC2 on all concentrations levels. A: without variable selection and B: after variable selection for v ¼ 500. The circles represent the analytical replicates and lozenges the independent test set.

3076

I.L. Petersen et al. / Environmental Pollution 159 (2011) 3071e3077

sample was less affected by the glyphosate exposure than the other biological replicates (see Fig. S3), which can explain that this sample generally lie in-between the 20 mM and 30 mM sample clusters (Fig. 4b). PC1, which accounts for 93.1% of the variance, mostly describes the degree of exposure, with scores increasing with the concentration of glyphosate. The second PC also captures the effect at higher exposures, although the direction is changed and the scores become more negative as one moves from 20 to 50 mM. These results are compatible with the existence of two different responses of the plants to the exposure, one for lower concentrations of glyphosate and one for higher ones. However, further studies will be required to further substantiate this observation, as horseshoe configurations such as the one observed may also result from preprocessing (e.g., normalization or retention time alignment e (Christensen and Tomasi, 2007)). The results also suggest that glyphosate exposed plants could be detected and distinguished from control plants down to concentrations in water as low as 5 mM, which is a better sensitivity than observed for the biomarker shikimate in a previous study (Petersen et al., 2007). The chemical identity of the compounds picked by the variable selection scheme and corresponding to the largest loading coefficients (Fig. 5) was tentatively investigated and the findings of a previous study on the same data set were confirmed (Petersen et al., 2007). The loading coefficients are relatively large for 16 peak regions distributed amongst the wavelengths. Five of the selected compounds elute between 3.4 and 5.4 min are therefore highly hydrophilic. Their absorption spectra suggest that they are aliphatic (Fig. S6). The central part of the chromatogram (i.e., retention time between 11 and 14 min) is quantitatively dominated by structurally different flavonoids, of which some also contain sinapoyl groups as revealed from the UV spectra (Fig. S7). However, most of these compounds seem not to be systematically affected by glyphosate exposure, and only few peaks seem to show a nonlinear response that was responsible for the rejection by the filtering procedure. Structural different LMW compounds with the sinapoyl groups bound to aliphatic compounds as glucose, sucrose and malic acid are seen both in the central and in the last part of the chromatograms as dominating but non-relevant peaks. The effects of glyphosate on plant metabolic regulations appear to be more complex than previously expected (Petersen et al., 2007) and to affect also hydrophilic aliphatic compounds in addition to

Fig. 5. Variables retained after ANOVA filtering.

LMW aromatic ones. However, the data set is relatively small and a larger study also involving structure elucidation with mass spectrometry detection and nuclear magnetic resonance spectroscopy is necessary to validate the results and to identify the individual compounds of relevance. Acknowledgments The authors gratefully acknowledge the financial support from the Information Centre on Contaminated Sites, Copenhagen, from the Danish Research Council, grant number 643-01-0101, the Lundbeck foundation, the COWI foundation, Ib Henriksens foundation and the RECETO consortium. G. Tomasi gratefully acknowledges the financial support of the EU FP6 ToK project “Food Informatics” (EC Contract No: MTKI-CT-2005-030056). Appendix. Supplementary material Additional information on the method protocol, wavelength and variable selection schemes, linear discriminant analysis, loading plots, and UV spectra of selected and unselected peak regions. This material is available free of charge via the Internet at http://www. sciencedirect.com. Supplementary material related to this article can be found at doi:10.1016/j.envpol.2011.04.005. References Aranibar, N., Singh, B.K., Stockton, G.W., Ott, K.H., 2001. Automated mode-of-action detection by metabolic profiling. Biochemical and Biophysical Research Communications 286, 150e155. Bailey, N.J.C., Oven, M., Holmes, E., Nicholson, J.K., Zenk, M.H., 2003. Metabolomic analysis of the consequences of cadmium exposure in Silene cucubalus cell cultures via H-1 NMR spectroscopy and chemometrics. Phytochemistry 62, 851e858. Boccard, J., Grata, E., Thiocone, A., Gauvrit, J.Y., Lanteri, P., Carrupt, P.A., Wolfender, J.L., Rudaz, S., 2007. Multivariate data analysis of rapid LCeTOF/MS experiments from Arabidopsis thaliana stressed by wounding. Chemometrics and Intelligent Laboratory Systems 86, 189e197. Bro, R., Kjeldahl, K., Smilde, A.K., Kiers, H.A.L., 2008. Cross-validation of component models: a critical look at current methods. Analytical and Bioanalytical Chemistry 390, 1241e1251. Carpenter, D.O., Arcaro, K., Spink, D.C., 2002. Understanding the human health effects of chemical mixtures. Environmental Health Perspectives 110, 25e42. Christensen, J.H., Tomasi, G., 2007. Practical aspects of chemometrics for oil spill fingerprinting. Journal of Chromatography A 1169, 1e22. Christensen, J.H., Tomasi, G., Hansen, A.B., 2005. Chemical fingerprinting of petroleum biomarkers using time warping and PCA. Environmental Science & Technology 39, 255e260. Covello, V.T., Merkhoher, M.W., 1993. Risk Assessment Methods: Approaches for Assessing Health and Environmental Risks. Plenum Press, New York, NY, USA. Fiehn, O., 2002. Metabolomics e the link between genotypes and phenotypes. Plant Molecular Biology 48, 155e171. Gidman, E.A., Stevens, C.J., Goodacre, R., Broadhurst, D., Emmett, B., GwynnJones, D., 2006. Using metabolic fingerprinting of plants for evaluating nitrogen deposition impacts on the landscape level. Global Change Biology 12, 1460e1465. Hendriks, M.M.W.B., Cruz-Juarez, L., De Bont, D., Hall, R.D., 2005. Preprocessing and exploratory analysis of chromatographic profiles of plant extracts. Analytica Chimica Acta 545, 53e64. Hoogerbrugge, R., Willig, S.J., Kistemaker, P.G., 1983. Discriminant analysis by double stage principal component analysis. Analytical Chemistry 55, 1710e1712. Krzanowski, W.J., 2000. Principles of Multivariate Analysis e A User’s Perspective, second ed. Oxford University Press, Oxford, UK. Messerli, G., Nia, V.P., Trevisan, M., Kolbe, A., Schauer, N., Geigenberger, P., Chen, J.C., Davison, A.C., Fernie, A.R., Zeeman, S.C., 2007. Rapid classification of phenotypic mutants of Arabidopsis via metabolite fingerprinting. Plant Physiology 143, 1484e1492. Nielsen, N.P.V., Carstensen, J.M., Smedsgaard, J., 1998. Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. Journal of Chromatography A 805, 17e35. Ott, K.H., Aranibar, N., Singh, B.J., Stockton, G.W., 2003. Metabonomics classifies pathways affected by bioactive compounds. Artificial neural network classification of NMR spectra of plant extracts. Phytochemistry 62, 971e985.

I.L. Petersen et al. / Environmental Pollution 159 (2011) 3071e3077 Petersen, I.L., Andersen, K.E., Sørensen, J.C., Sørensen, H., 2006. Determination of shikimate in crude plant extracts by micellar electrokinetic capillary chromatography. Journal of Chromatography A 1130, 253e258. Petersen, I.L., Hansen, H.C.B., Ravn, H.W., Sørensen, J.C., Sørensen, H., 2007. Metabolic effects in rapeseed (Brassica napus L.) seedlings after root exposure to glyphosate. Pesticide Biochemistry and Physiology. Pierce, K.M., Hope, J.L., Johnson, K.J., Wright, B.W., Synovec, R.E., 2005. Classification of gasoline data obtained by gas chromatography using a piecewise alignment algorithm combined with feature selection and principal component analysis. Journal of Chromatography A 1096, 101e110. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P., 2002. Numerical Recipes in C, second ed. Cambridge University Press, Cambridge, UK.

3077

Quemener, B., Bertrand, D., Marty, I., Causse, M., Lahaye, M., 2007. Fast data preprocessing for chromatographic fingerprints of tomato cell wall polysaccharides using chemometric methods. Journal of Chromatography A 1141, 41e49. Sjoedin, K., 1984. Minimizing effects of closure on analytical data. Analytical Chemistry 56, 1685e1688. Skov, T., van den Berg, F., Tomasi, G., Bro, R., 2006. Automated alignment of chromatographic data. Journal of Chemometrics 20, 484e497. Viant, M.R., 2008. Recent developments in environmental metabolomics. Molecular Biosystems 4, 980e986. Walker, C.H., 1992. Biochemical responses as indicators of toxic effects of chemicals in ecosystems. Toxicology Letters 64-5, 527e533.