Bioinformatic methods in NMR-based metabolic profiling

Bioinformatic methods in NMR-based metabolic profiling

Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374 Contents lists available at ScienceDirect Progress in Nuclear Magnetic Resonan...

2MB Sizes 2 Downloads 151 Views

Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

Contents lists available at ScienceDirect

Progress in Nuclear Magnetic Resonance Spectroscopy journal homepage: www.elsevier.com/locate/pnmrs

Bioinformatic methods in NMR-based metabolic profiling Timothy M.D. Ebbels *, Rachel Cavill Biomolecular Medicine, Division of Surgery, Oncology, Reproductive Biology and Anaesthetics, Faculty of Medicine, Imperial College London, Sir Alexander Fleming Building, South Kensington, London SW7 2AZ, UK

a r t i c l e

i n f o

Article history: Received 17 June 2009 Accepted 16 July 2009 Available online 22 July 2009

Ó 2009 Elsevier B.V. All rights reserved.

Key words: Metabonomics Metabolomics Metabolic profiling Bioinformatics Statistical methods Modelling Machine learning Pattern recognition

Contents 1. 2. 3.

4. 5. 6.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data types and preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Classical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Orthogonal PLS (O-PLS and O2-PLS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Genetic algorithms and genetic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4. Kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5. Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical spectroscopy and biomarker identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical integration of NMR metabolic profiles with other post-genomic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions and future prospects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

361 363 365 365 365 366 367 367 368 370 372 372 372

1. Introduction

Abbreviations: PCA, principal components analysis; PLS, partial least squares; PLSDA, partial least squares-discriminant analysis; O-PLS or O2-PLS, orthogonal partial least squares; GA, genetic algorithm; GP, genetic programming; SVM, support vector machine; kNN, k-nearest neighbour; STOCSY, statistical total correlation spectroscopy; SHY, statistical heterospectroscopy; LC, liquid chromatography; JRES, Jresolved; HSQC, heteronuclear single quantum coherence; MS, mass spectrometry; DIGE, differential in gel electrophoresis. * Corresponding author. Tel.: +44 20 75943160; fax: +44 20 75943226. E-mail address: [email protected] (T.M.D. Ebbels). 0079-6565/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.pnmrs.2009.07.003

The last decade has seen a revolution in the practice of biological research, primarily due to the rise of techniques that are able to profile levels of molecular organisation at a global level. This wide coverage of the so-called ‘omics’ techniques has been inspired, and is often made possible by, the completion of genome sequences which allow interpretation of events in terms of the full complement of biomolecules present in a cell or organism. The field of metabolic profiling (also known as metabonomics [1] or metabolomics [2]) studies the myriad small molecular weight

362

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

Experimental design Pre-processing (e.g. phasing, deconvolution)

Processed data

Pre-treatment (e.g. unit variance scaling)

Pre-treated data

Exploratory analysis (e.g. PCA, HCA)

Data overview, outliers, clusters, important metabolites

Supervised modelling (classification / regression)

Discriminating metabolites

Statistical spectroscopy

Structural identities

Pathway representation

Mechanistic interpretation

Fig. 1. A schematic diagram of the steps involved and information recovered in the statistical analysis of NMR metabolic profiles.

molecules involved in metabolism, with a particular emphasis on how the levels of these molecules change in response to different biological conditions. The ultimate goal of this field is to achieve improvements in understanding of metabolic processes that may lead to advances in many areas including clinical diagnosis, therapeutics, functional genomics and toxicology. NMR spectroscopy has been employed as an analytical tool in studies of metabolism for many years [3–6] principally because it is non-selective in the analytes studied and because of the high degree of structural information obtainable in a short time. It is also favoured because it is non-destructive and highly reproducible with minimal sample preparation. The complex biological samples analysed in metabolic profiling studies result in correspondingly complex NMR spectra; consequently the extraction of useful information from such metabolic profiles is a difficult task and this is the subject of this review. Early studies employing NMR to profile biofluid mixtures [5,7] did not attempt to model the spectra statistically and relied on visual analysis of the spectra for their interpretation. In the early 1990s the potential of statistical pattern recognition techniques to extract useful information from the data became apparent and began to be applied [8–11]. With the advent of other omics techniques in the late 1990s, the statistical analysis of metabolic profiles has become one of several specialities within the field of post-genomic statistical modelling and has received wider attention [12–19]. A distinction is usually made between global profiling, in which as many metabolites as possible are assayed in a single step, and targeted profiling in which a subset of interesting metabolites are identified a priori and the analytical methods optimised for measurement of this group. We are also particularly interested in screening approaches in which large numbers of samples are rapidly profiled, necessitating the use of automatic processing algorithms. The goals of statistical modelling in metabolic profiling can be briefly summarised as (1) visualisation of the overall similarities and differences between samples and variables, (2) determination of whether or not there is a significant difference between groups or trends related to the effect of interest, (3) discovery of which metabolic signals are responsible for these patterns, (4) structural characterisation of the metabolites involved, and (5) analysis of metabolic effects at a pathway level. Until recently, aim 4 was addressed exclusively by further experimental analytical procedures, but recent developments in correlation anal-

ysis have yielded useful statistical tools [20–24] which can aid this process. The analysis of NMR metabolic profiles typically involves a number of stages, conceptualised in the flow diagram of Fig. 1. One of the most important stages, but one which is often overlooked, is the initial design of the experiment. Here, an interaction of the data analyst with the other scientists involved (analytical chemists, biological scientists etc.), is essential to ensure that the desired goals are met. For example, it is important to ensure that any extraneous variables are not confounded with the effect of interest (e.g. having systematic differences in the ages of control and treated groups in studies of age related diseases). Once the biological experiment has been run and the NMR spectra obtained, standard data processing techniques are applied to the raw free induction decays (FIDs) to obtain correctly phased, baseline corrected and chemical shift referenced spectra. This process itself is non-trivial when the samples are complex and numerous, but software for automatically processing metabolic profile spectra has matured greatly over the past 10 years and is now generally sufficient for this task. Once the basic spectra are obtained, they must undergo a preprocessing step which transforms the data to a table of N samples (rows) by M variables (columns). Each row will represent the metabolic profile of a particular biological sample and each column represents a given metabolite or metabolic signal. There follows an optional pre-treatment step in which the rows or columns of the table may be rescaled. For example, normalisation procedures scale each row by a factor, for example to account for overall variation in sample dilution. An example column operation is scaling each variable to unit variance. The resulting table forms the data input to the next two stages of analysis – exploratory and predictive modelling. The exploratory stage is characterised by so-called unsupervised methods, those in which the algorithm takes no account of a priori information about the structure of the data (e.g. clusters, trends etc.) This stage addresses the first goal mentioned above and typically involves visualisation of the overall distribution of samples and variables, and assessment of data quality including detection and removal of outliers. Once the global structure of the data has been assessed in such a way, the analyst may wish to proceed to a predictive modelling stage in which one attempts to build mathematical rules that use the metabolic profile to predict an external response variable, such as biological class. At this stage, supervised methods are typically used in which a part of the data (the training set) is used to fit the model, and a separate part (the test set) is used to estimate the predictive accuracy of the model. The importance of this latter step cannot be overstated. As with all high dimensional data, it is very easy to obtain models appearing to explain the data well, yet which are over-fitted or otherwise not predictive outside the particular sample set for which they have been developed. Thus the process of model validation is a critical step in the whole modelling process. The nature of the metabolome and the NMR methods used in its interrogation give particular characteristics to the data obtained, thus influencing the types of modelling methods used. In contrast to the genome or even the proteome, a large proportion of the metabolome is still unknown, even for model organisms. The degree of incompleteness varies between organisms, biofluids/tissues and conditions and is hard to estimate. This is one reason why it is not easy to convert raw spectral data to concentrations of known metabolites; in a typical NMR profile, a large number of the resonances may be unassigned, particularly for low level or partially resolved signals. Even when resonances are assigned, it is difficult to obtain truly quantitative concentration information in high throughput profiling applications. Some important reasons for this include peak crowding and overlap, changes in chemical shift position of resonances due to differential matrix effects, complexation

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

of analytes with internal standards or protein components of the mixture, and the difficulty of observing fully relaxed nuclei coupled with an associated lack of knowledge of T1 relaxation times in the biological matrix of the sample. While many methods, both analytical and computational exist to help resolve these issues, as yet they are difficult to apply successfully in an automated way across diverse sample types and conditions. Due to the difficulties mentioned above, the typical approach to the analysis of 1-dimensional (1-D) NMR metabolic profiles has been as follows [12]. First, model the spectra directly in the chemical shift domain (in which statistical variables are identified by chemical shift). Next extract the most ‘important’ variables from the statistical models, and finally, assign the resonances corresponding to the important variables, possibly using further statistical or analytical techniques. This reliance on determination of important variables from statistical modelling, or model interpretation, is a key feature of metabolic profiling and has greatly influenced the type of modelling techniques used. For example, principal components analysis (PCA) is preferred to multidimensional scaling (MDS) as a dimension reduction technique, mainly because it is easy to determine influential variables with the former technique but not the latter. Traditionally 1-D 1H experiments have been the method of choice [5,6] for initial screening studies, due to the ubiquity of protons in metabolic analytes and the speed of the basic 1-D acquisition. These would typically be followed up by more in-depth analysis of key samples using 2-dimensional (2-D) NMR experiments and other analytical techniques. However, peak crowding and overlap are acute in 1-D proton metabolic profiles, particularly for lower abundance metabolites, making automated quantification of individual species difficult. Recently 2-dimensional experiments such as 1H JRES [25] and 1H–13C HSQC [26] have been advocated for such routine screening. The improved dispersion of resonances given by the 2-dimensional techniques partially alleviates overlap problems and can make assignment easier. While such methods are gaining popularity, they have yet to replace the conventional 1-D 1H experiment in the majority of applications, and we therefore restrict this review to the analysis of 1-D 1H NMR profiles. The aim of this review is to provide a contemporary overview of bioinformatic approaches to the modelling of NMR metabolic profiles. For excellent reviews of the conventional techniques we refer the reader to refs. [12,27,28]. Here, we seek to update such resources by covering methods and topics which have developed more recently. We specifically exclude a number of topics from the scope of this review. We do not cover the use of NMR data to generate metabolic network models which attempt to provide mechanistic explanations for metabolic phenomena. We also exclude surveying the myriad of software and database tools for mapping metabolic data to biological pathways. A short description of the organisation of the paper follows. In Section 2, we briefly discuss preprocessing approaches which are applied to NMR metabolic data prior to statistical analysis, and contrast the resulting types of data matrix from the modelling point of view. Section 3 reviews recent developments in methods for building classification models of NMR metabolic profiles. In Section 4 we look at the recent development of statistical methods for obtaining structural information from NMR profiles. Section 5 addresses the question of how to integrate NMR-derived metabolic profiles with data from other analytical techniques (e.g. liquid chromatography mass spectrometry data) and other omics (e.g. gene expression microarray data). Finally Section 6 summarises some conclusions and future prospects in this area.

363

2. Data types and preprocessing The goal of preprocessing is to transform the data into a form that can be meaningfully analysed by statistical procedures [29]. Here, we consider that the FIDs have already been Fourier Transformed to the frequency domain, correctly phased and baseline corrected (though the automation of the latter operations for complex biological mixtures is certainly non-trivial). The data are usually arranged into a table or matrix in which each row corresponds to a separate analytical sample and each column to a metabolic signal. The metabolic signal typically corresponds to the spectral intensity at a particular chemical shift or to the estimated level of a particular metabolite. Both representations require an algorithm for estimating the metabolic signal from the spectral data. In the former case, the most common approach has been to integrate the spectral intensity in adjacent regions of equal width (typically 0.04 ppm wide, sometimes termed bins or ‘buckets’) [30,31], thus providing a low resolution or ‘reduced’ representation of the spectrum. The same approach can easily be used for the analysis of 2-dimensional spectra, though in this case a large proportion of the bins may contain only background noise. This approach has largely been superseded by the use of narrower regions or analysis of the spectra at their full native resolution [32]. The second approach of estimating individual metabolite levels is a particularly difficult one for 1-D 1H spectra of complex mixtures due to the significant degree of overlap between resonances. This is further complicated by small but significant sample-to-sample variations in the chemical shift position of signals produced by effects such as differences in pH and ionic strength. Such ‘positional noise’ remains a problem even with buffered samples. However, several attempts have been made to produce automatic [33] or semi-automatic [34] software for quantifying metabolites. These approaches are far from perfect, however, because they do not fit the whole metabolite spectral signature simultaneously, cannot deconvolve overlapped resonances, and/or require significant manual intervention. Attempts have also been made to accomplish this task semi-automatically in 2-D spectra, where the degree of overlap is reduced, with some success [26]. Probably the largest disadvantage of this ‘targeted’ approach, however, is that only known metabolites are included in the fit thus ignoring any novel or unexpected molecules present – a particularly important disadvantage in biomarker discovery applications. A further risk is that unknown peaks may be mistaken for signals from known metabolites resulting in erroneous quantitation; this becomes ever more likely as the ratio of unknowns to knowns increases. Notwithstanding these shortcomings, the targeted profiling approach does lead to more straightforward biological interpretation of statistical models, since each variable corresponds to an identified metabolite. One may suppose that a hybrid approach employing targeted profiling of known metabolites, combined with modelling of residual unidentified signals, would therefore be an interesting and productive direction for the future. Fig. 2 illustrates the effects that the different preprocessing methods for representing the metabolic profiles have on a PCA analysis. Urine was sampled over 7 days from animals whose diet was restricted to 50% of a control group and 1-D 1H NMR spectra acquired at 600 MHz [35]. Scores and loadings plots are shown in panels A-E for each of three representations. The scores plots for all representations indicate that early (624 h) and late (>24 h) time points are clearly differentiated on principal component 1 (PC1). (The worse separation in panel E results from the influence of small shifts in positions of some peaks.) However, the process of interpreting models can be quite different for each representation. For the peak-integrated loadings of panel B, it is easy to make an immediate interpretation since each variable corresponds to an

364

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

Fig. 2. Effect of preprocessing/representation on PCA analysis of caloric restriction data (600 MHz 1H NMR spectra of rat urine). Panels A–F show scores (left) and loadings (right) the first two PCs for three representations of the NMR metabolic profiles. (A,B) Identified and integrated peaks; (C,D) 0.04 ppm width bins; (E,F) 0.001 ppm width bins. Key: filled circles – time points up to 24 h, open squares – time points after 24 h. Variables in panel B are identified by ppm of the centre of the bin. Adjacent bins in panel E are connected with lines. Panels G and H show back-scaled coloured loading plots (PC1) for the 0.001 ppm width binned data. (G) Aromatic region; (H) aliphatic region. Data courtesy of the COMET project [88].

identified metabolite. However, many peaks are not integrated and are thus not modelled meaning that information is missing. For the conventional 0.04ppm resolution loadings of panel D, each bin is labelled with its ppm and the analyst must return to the original spectra to identify the metabolite(s) contributing to each bin. The 2-D loading plot for the high resolution (0.001 ppm) data is much more difficult to interpret because of the large number of variables. Nonetheless, an intuitive interpretation of the PC1 loadings may be accomplished by the use of back scaling [32] in which the loadings are depicted as a 1-D line plot. The colour corresponds to the loading from a model where each variable has been scaled to unit-variance. The y-coordinate corresponds to the unit-variance scaled loading divided by the unit-variance scale factor for each variable. The resulting plot combines the strengths of both scaling techniques. The unscaled model preserves line shapes and multiplet patterns similar to those of the original spectra. However, small, low variance signals that would normally be lost in an unscaled model can still be picked out by colour if they are influential within the unit-variance scaled model. Most 1H NMR techniques used in metabolic profiling incorporate some suppression of the water resonance, which otherwise dominates the spectral profile for biological samples. The suppression is usually imperfect, however, and the residual signal is not easily dealt with by automated phasing and baseline correction algorithms. Most authors take the approach of removing the region of 1H chemical shift worst affected by the residual signal before statistical analysis, though other techniques such as time–frequency or time-scale filtering are also used [36]. Other resonances can also be regarded as interfering or ‘nuisance’ signals in 1H NMR metabolic profiling. For example, in drug toxicity studies, the administered compound and its metabolites are often seen, mainly

in urine. Yet when building mathematical models describing toxicity these signals can skew the model, obscuring endogenous changes and therefore they must be removed [37]. Other examples of interfering signals are those of solvents such as DMSO or acetonitrile. One of the most important steps in the preprocessing of NMR spectra for statistical analysis is that of normalisation, by which we mean any mathematical operation intended to remove unwanted variation between profiles. Normalisation usually involves the multiplication of each row of the data matrix by a constant [38], and is typically employed with the intent of removing variation due to uncertainty in the ‘total amount’ of sample. This is particularly true of urine studies in which overall urinary concentration can vary by orders of magnitude [6], for example due to variation in water intake which may or may not be linked to the condition or treatment of interest. It is also important for in vitro and tissue extract studies in which the total number/volume of cells or tissue is difficult to control. It should be noted that normalisation can also remove sample-to-sample differences due to other effects, such as differential relaxation or small variations in the calibration of the 90° pulse. There are many different normalisation methods, a full discussion of which is beyond the scope of this review. Selection of the most appropriate method will depend on the objective of the analysis, for example, whether it is more important to estimate absolute levels of a given compound, or whether (possibly more accurate) relative concentrations will suffice. Methods for achieving the former include normalisation to the known concentration of an internal standard (e.g. TSP), or to the level of an endogenous metabolite whose concentration can be determined by an independent method (e.g. creatinine assayed by the Jaffe reaction). If relative changes are more important,

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

then statistical approaches can be used such as normalising constant total integrated intensity (TII) across the whole spectrum. The latter technique is a standard that has been successfully employed for many years [31]. While it has many advantages, TII normalisation has one key disadvantage: when a treatment causes large changes in the spectral profile, TII normalisation can introduce spurious correlations between metabolic variables [38]. If a high concentration metabolite changes, the levels of other signals must change to retain a constant TII. In the past few years, alternative methods have been proposed to remedy this problem, many of which require selection of a ‘reference’ spectrum against which to normalise the remaining ‘test’ spectra. One method is ‘probabilistic quotient normalisation’ [39] which ensures that the median fold change between the test and reference spectra variables is constant for all spectra. An alternative is ‘histogram normalisation’ [40] which picks the normalisation constant which obtains the best fit of the intensity histograms of the test and the reference. Metabolic profiling has become a key element of many research endeavours in the post-genomic era. However, to ensure effective communication of results and inter-lab collaboration, it is important that data can be reported and exchanged in standard formats. Such standards also have an important role to play in quality control and encouraging wider use of published data. The wide array of technologies employed, coupled with the rapid development of the field makes construction of usable standards difficult, yet much progress has already been made. Early initiatives included the Standard Metabolic Reporting Structures (SMRS) group who compiled a vision of the minimum information that should be required in reporting metabolic profile data (derived from NMR and other techniques) [41] akin to that required for publication of transcriptomics and proteomics data [42,43]. This was complemented by a similar minimum information proposal in the plant metabolomics community, MIAMET [44], and a more detailed data model known as ArMET [45]. Ongoing standards efforts are currently coordinated by the Metabolomics Society’s Metabolomics Standards Initiative (MSI) [46,47] which has separate working groups for each area of metabolic profiling. Initial recommendations have been published, including one for data analysis [29], though these are generally aimed at minimum reporting requirements, rather than standard formats or data models. Currently community-wide standard data exchange formats for NMR metabolic profile data (which include reporting of biological and analytical meta-data) are only just emerging; consequently it is not yet possible for metabolomics software to be completely standards compliant. Clearly, this is a developing area and the reader is encouraged to monitor the MSI at http://msi-workgroups.sourceforge.net/ for up-to-date information.

365

differences between classes can aid hypothesis generation regarding mechanisms of action and thus inform the design of future experiments on the system. 3.1. Classical methods PCA and partial least squares (PLS) regression have become two of the most popular techniques in metabolic profiling for exploring class differences and highlighting explanatory metabolites. These methods allow the analyst to project the NMR data into a low dimensional space for easier interpretation and visualisation. This space is spanned by latent variables or components, each of which consists of a linear combination of the original variables. These components are orthogonal to each other. In PCA, the model describes the space corresponding to the highest variance of the data, while in PLS the space corresponds to that with the highest covariance between the NMR data and the response variable. PLS is often used in discriminant analysis mode (PLS-DA), in which the response variable indicates class membership (0 or 1), thus producing a model in which class separation is emphasised. These methods are also favoured since the loadings give easy access to information about which metabolites are influencing the variation seen between the samples. For further discussion of these conventional techniques we refer the reader to reviews such as Lindon et al. [12]. PCA is a very useful tool for data exploration. Beyond visualisation of groups and discriminatory variables, it is helpful in outlier detection and aiding understanding of other sources of variation within the dataset (for instance time courses or experimental drift). When using PCA one must remember that it optimises explanation of variance rather than class distinction. For instance, if there is much more inter-individual or time related variation than treatment variation, then these confounding factors may dominate the first few components, potentially obscuring differences due to treatment. In contrast, PLS-DA is a supervised technique which takes class membership into account. Since the PLS-DA projection maximises the covariance to the class variable, the influence of confounding factors on the projection is less than in PCA (though these may still be significant). Perhaps surprisingly for such a ‘conventional’ method, there are many methods for using the model to select variables important for discrimination [48,49], and it is not clear which is optimal. This problem is particularly acute when full resolution data is used and many variables (often from the same metabolite) have high loadings. Further, PLS is fundamentally a linear model, and there may be cases where classification boundaries or biological responses do not follow this linearity (e.g. where a treated individual has either high or low levels of a given metabolite compared to controls).

3. Classification methods

3.2. Orthogonal PLS (O-PLS and O2-PLS)

Classification methods are widely used in the analysis of complex biological datasets such as those generated through NMRbased metabolic profiling. However, unlike other domains where the principal aim is to predict the class of unknown samples, in metabolic profiling these methods are often used as exploratory tools, to aid the identification of important differences between the groups. In particular, where class differences are formed from a combination of many spectral features, each of which on their own is insufficient for good discrimination, these methods can give meaningful information about the differences between groups which would otherwise be hidden. Therefore, determination of variable importance – and thus biomarker discovery – is a crucial aspect of the analysis. In addition to biomarker discovery, the identification of the metabolic

O-PLS and O2-PLS [50,51] are extensions to the PLS algorithm, which work by splitting the variation of the predictor variables into two parts: variation orthogonal (uncorrelated) to the response and variation correlated to the response. While the predictive accuracy remains the same as for conventional PLS, by separating the variation in this way, interpretation of the model can be improved. This is most easily explained by an example. Suppose a PLS-DA model requires five components to discriminate two classes. The equivalent O-PLS-DA model will contain one so-called ‘predictive’ component and four orthogonal components. Whereas in the former case, the loadings of all five components need to be inspected to determine which variables are important for the discrimination, in the latter, only the loadings of the predictive component need be assessed for this pur-

366

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

Ó Copyright (2005) American Chemical Society. 2005

3.3. Genetic algorithms and genetic programming

Fig. 3. Back scaled O2-PLS coefficients plots indicating the resonances important for discriminating the urinary profiles of rats dosed with mercury chloride from those of control animals. The lower panel shows an expansion of the region 0.5–3.5. In each plot, the y coordinate and colour scale depict the covariance and correlation between each intensity variable and the class (0 or 1) respectively. 1H NMR spectra were acquired at 600 MHz. Reproduced with permission from Ref. [32].

pose. (In addition, one should always inspect the scores and loadings from the orthogonal components as these give valuable information on the sources of confounding variation or ‘structured noise’ in the data). Following the initial study [32] (see Fig 3), the O-PLS method has been applied in numerous NMR metabolic profiling studies. The O-PLS loadings are often visualised using the back scaling technique mentioned in Section 2. For O-PLS-DA models, the y-coordinate depicts the covariance and the colour the correlation to the class variable. Thus small peaks that discriminate the classes well, but show low variance, are highlighted. Conversely, large peaks exhibiting high variance but with low correlation to the class discrimination can be rejected. The introduction of O-PLS and the coloured loadings visualisation has also coincided with the use of the full resolution spectrum, instead of the reduced spectrum that was previously common. Cloarec et al. [32], showed that the detrimental influence of positional noise (peak shifts caused by changes in sample pH or ionic strength) could be mitigated using this approach. However, readers should be careful not to assume that all coloured spectral visualisations of this kind correspond to the results of an O-PLS analysis; see for example ref. [52]. Finally, validation of models to avoid over-fitting is always of paramount importance when dealing with metabolic profiles. For O-PLS, statistics such as the proportion of variance in the response variable predictable in cross-validation (Q2Y) should be quoted with each model.

Genetic algorithms (GAs) are a common machine learning technique for solving complex multidimensional optimisation problems and have been used on a wide variety of optimisation problems, across many domains (see [53] for an excellent introduction). GAs work by simulating evolutionary processes. Solutions are represented by fixed-length strings of binary digits called chromosomes. At the start of a run, all strings are randomly initialized. For each ‘generation’ of the genetic algorithm all the current candidate solutions are evaluated to see how well they solve the problem – their ‘fitness’; those which perform best are more likely to be used to generate the next generation of candidate solutions. Two operators are generally used to form the next generation – crossover, which takes two solutions and combines them and mutation, which makes small random changes to an individual solution. This generational process is repeated until a fixed number of generations have past or the best solutions have surpassed a fixed quality level in the fitness test. Genetic programming (GP) (see [54] for a good introduction), is an extension to this technique in which the restriction on working with fixed-length strings is lifted, and instead each solution is represented as a computer program. The most common representation is the parse tree, a structuring of potential solutions according to a hierarchical system of rules and objects. The parse tree leads to a crossover operator involving the swapping of sub-trees between parents, and a mutation operator which generates a new random subtree to place at a designated point. However, the range of operators and representations is very extensive [54]. Most work in NMR-based metabolic profiling with GAs or GP has been directed at variable selection [56]. GAs are naturally suited to this task, since a simple binary chromosome with one bit per variable, can indicate if that variable is to be present or absent in each solution. By studying the frequency with which variables are present in the best solutions across many runs and comparing this with the appropriate binomial distribution a measure can be obtained as to the importance of each variable. A key point here is that the GA is not a prediction algorithm itself – it merely determines lists of variables to be tried in candidate solutions. The GA must be coupled with a separate classification/regression algorithm in order to evaluate the fitness (here prediction success) of each solution. For instance, Ramadan et al. [56] used their GA to select variables for a PLS-DA model, when classifying male and female biofluids. Cavill et al. [57] used a GA to select both variables and samples to be used for classification of liver and kidney toxicity with a nearest neighbour classifier. GP overcomes this problem by evolving classifiers themselves. In this case the problems arise when trying to extract important variables from the evolved classifiers. GP classifiers can become very complex, due to a tendency for the solutions to bloat [58]. Bloat can be limited by imposing a constraint on maximum size of each solution and/or including a penalty for complexity in the fitness function. Nonetheless, understanding how the GP solution performs and which variables are most important is non-trivial. Within metabolic profiling it is common practice to count the number of times each variable is used in the best solutions across many runs, as for a GA. Whilst this is often a good approximation of variable importance, there are drawbacks. Firstly, variables used in GP solutions are context dependant. The type of solutions evolved generally take the form of complex formulae, simply adding up the number of times a variable occurs ignores this context, which may be crucial to understanding why the evolved classifier performs well. Secondly, it is often found that many parts of solutions are seldom evaluated; these areas of the tree are termed ‘introns’ or ‘junk’ (by analogy to non-coding regions of DNA). Variables which are selected in these regions will not contribute to the fitness of the

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

367

individual but will contribute to the frequency of variable selection. This problem is aggravated by the fact that sub-trees which have little or no impact on fitness can propagate throughout the population, being copied many times through repeated applications of the crossover operator [59]. Therefore, a highly selected variable may have very little (or even no) impact on the ‘fitness’ of the solutions where it is found. Examples of GP being used in this way include Davis et al. [60] who employed a two stage process, first narrowing down the set of selected variables, then further reducing them by repeating the process with only those selected in the first stage. The advantages of splitting the evolutionary process in this way are unclear, especially as GP is often accused of loosing diversity of solutions prematurely in a run [59]. The application of GP to metabolic profiling can be taken beyond variable selection to classification. Gray et al. [61], used GP to classify and select variables using 1H NMR spectra of human brain tumours. They used preprocessed spectral variables obtained from principal component scores as the inputs to their GP and were able to predict the type of tumour using only a small number of generations (less than 50). This fast convergence, suggests that this problem was particularly easy for the GP system and it is reasonable to ask whether GP was necessary for this task or whether a simpler classification algorithm might have had the same success. One other way in which GAs have been applied to metabolic profiles is to do clustering [62]. Defining clusters, particularly 2way clusters or biclusters, which divide both the samples and the variables into groups is a difficult problem [63]. The use of GAs to find good solutions for this problem, for which exact techniques are complex, slow, and in some cases non-existent appears to be a good way to use the technique’s strengths. There is no reason why this analysis could not be applied to NMR metabolic profiles, excepting that overlapping signals might render the problem of finding biologically meaningful clusters more complex.

on their test set, recording error rates around 2/3 lower than the other methods (the SVM had mean error rates of 7–8%, whereas linear discriminant analysis and kNN had rates of 24.5% and 21% respectively). A radial basis function kernel proved to be optimal, performing slightly better than linear or polynomial kernels. In order to use their SVM for variable selection, they use recursive variable elimination, repeatedly removing variables with the lowest weights. In another NMR-based metabolic profiling study [73], the focus was on visualisation and interpretation in order to better understand the SVM model. The authors put forward the idea of a ‘‘correlation image” displaying the correlation of each variable with each row of the kernel matrix, which gives information about the importance of the input variables; this can then be visualised using heatmaps. The method was applied to NMR spectra from brain tissue of healthy and cancerous patients, and the spectra used to predict metabolite concentrations across the different samples. In addition to SVMs, there are other kernel methods which act as extensions to the standard algorithms discussed previously KPCA [68,74], K-PLS [67,70] and K-OPLS [71,72]. By using the kernel trick, these linear models can be made non-linear (for example defining non-linear class boundaries). The main application to metabolic NMR has been through the use of K-OPLS. K-OPLS was developed to combine the classification strength of kernel-based methods, whilst allowing a high level of interpretability of the model, through the O-PLS method to model structured noise separately from the class variation. It has shown its utility on several NMR metabolic data sets [72], studying differences in biochemical composition between wild type and mutant hybrid aspen. Since the kernel in K-OPLS does not have to be linear, with the right choice of kernel the algorithm can model data for which linear regression is insufficient [71].

3.4. Kernel methods

Random forests [75] are an extension of the well known treebased classifiers [76]. Binary tree-based classifiers are built from data by finding criteria which divide the data into two groups which are as pure in class as possible. Then new criteria are found which optimally split the new groups with high purity, and this process repeated recursively until each group is pure or a minimum number of samples per group is reached. There are many variations on the exact way in which splitting criteria are chosen [77]. Tree-based classifiers alone, if used for biomarker discovery, have been shown to be unsuitable for metabolic profiling data [78]. They perform badly on noisy and/or co-linear data, due to their greedy selection of the single most important factor at each stage, and therefore fail to take account of collinear variables which would have also performed well, affecting the interpretability of the solution. By building multiple trees with different subsets of the variables and then allowing these trees to vote to determine the overall classification of a sample, random forests minimise the effect of the issues mentioned above. However, to date, this technique has been restricted to mass spectrometry data [65,79]. In Ref. [65] random forests were used to classify diseased and non-diseased blood plasma samples based on measurements on 317 metabolites. Random forests can also be used to examine the important variables for classification, either through a simple measurement of average accuracy of all trees containing a particular variable, or through permuting the values for the variable of interest and measuring the negative impact this has on the prediction ability [65]. Enot et al. [79] applied the former approach and proposed a strategy for determining significance thresholds based on the classifier margin in their study of transgenic Arabidopsis and potato plants.

Kernel methods have become very popular in the machine learning community in recent years, partly due to their excellent performance in classification/regression problems but also because of a solid underpinning of relevant theoretical work from statistical learning theory, allowing for example, upper bounds to be placed on classification error rates. These techniques work by mapping the data to a new space, which may be of (much) higher dimensionality than the original space. This may be useful if, for example, classes become more separable in the new space. Crucially, if an algorithm can be written in terms of vector inner products, the kernel function can be used to compute these without explicitly mapping the data to the new space at all. This so-called ‘kernel trick’ is applicable to many conventional linear algorithms, but has also spawned a new class of techniques known as Support Vector Machines (SVMs) [64]. Within metabolic profiling several approaches have been used: SVMs [65,66] and kernel versions of the standard PCA, PLS and O-PLS algorithms [67–72]. SVMs apply a kernel to map the data to a space in which the data is hoped to be linearly separable. They are very powerful as they seek to maximise the separation between the classes. There are a wide variety of kernels appropriate for metabolomics data, including linear, polynomial and radial basis functions. In [66] the performance of SVMs with a range of kernels is compared to that of linear discriminant analysis and a k-nearest neighbour (kNN) classifier using an NMR data set comparing human urine samples from healthy subjects and those with pneumonia. The authors find that the SVM’s performance is much more accurate

3.5. Random forests

368

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

4. Statistical spectroscopy and biomarker identification The incomplete knowledge of the metabolome mentioned in the introduction has encouraged the development of statistical methods which can aid in structural identification of metabolites. While these methods can never replace the powerful array of experimental methodologies for structural characterisation of unknown molecules, they are beginning to have significant impact by streamlining the process and allowing time consuming experimental strategies to be more effectively targeted. The ‘statistical spectroscopy’ methods developed so far rely primarily on the fact that in many analytical techniques, some molecules result in more than one signal for each measurement. In NMR, peak integrals are proportional to the number of nuclei contributing and therefore the ratios between peak intensities (as measured by signal heights) from the same molecule will be constant as long as the line shape does not vary. This means that peak levels from the same molecule will exhibit a linear relationship which can be detected via statistical correlation. The idea of correlating signals is a familiar one to NMR spectroscopists, yet the use of statistical correlation has only recently become popular in NMR metabolic profiling. First used in near infra-red spectroscopy [80], the idea was recently applied to NMR metabolic profiling as Statistical Total Correlation Spectroscopy (STOCSY) [20–23,81–84]. Fig. 4 illustrates a typical 2-D STOCSY map for the area around the aromatic resonances of hippuric acid. Each element in the figure represents the Pearson productmoment correlation [85,86] between two distinct NMR intensity variables. The correlation matrix was generated from 1050 1-D 1 H NMR spectra of urine from normal laboratory rats from the COMET project [35,87,88]. Strong Pearson correlations between the three aromatic hippurate resonances can clearly be observed, revealing their structural relationship. The technique is also often used in 1-D mode, by nominating a ‘driver’ peak and visualising the correlation of all other spectral variables to the driver as a colour scale superimposed on the mean or covariance spectrum [20]. STOCSY has been applied widely to many problems in 1-dimensional NMR metabolic profiling. The approach has been applied several times to aid structural assignment, such as in the deconvolution of liquid chromatography (LC)-NMR data [22], following

drug metabolites in molecular epidemiology [82] and using diffusion-edited experiments to induce molecule specific correlations [21]. Using multiple observed nuclei, a similar approach (HET-STOCSY) [23,83,84], has allowed cross-assignment of signals between 1 H, 31P and 19F spectra, as well as editing of the homonuclear STOCSY according to heteronuclear correlations. STOCSY has some similarities to the technique of covariance NMR [89,90] which has also been applied to the analysis of complex mixtures [91– 93]. In STOCSY, correlations derive from variation in resonance amplitude across many 1-d spectra, usually of different biological samples. Covariance NMR is used in 2-D NMR of a single sample and dramatically improves the spectral resolution and symmetry of conventional 2-D approaches. It thus improves observation of NMR connectivities, which will usually derive from spin–spin couplings. In order to assign two resonances to the same molecule via STOCSY, one must discriminate between structural correlations (those due to true structural relationship) and non-structural correlations. Non-structural correlations may derive from biological effects (e.g. two metabolites responding similarly to a treatment) or from analytical procedures (e.g. two metabolites similarly affected by extraction procedures). Although on average, structural correlations are expected to exceed non-structural correlations [94], the extent to which this is true depends on the data set at hand. Data exhibiting strong heterogeneity or clustering, for example due to treatment effects or time evolution, can show very strong biological correlations which can mask true structural relationships. The source of variation used to drive the correlation analysis will also have an effect. For example, sample-to-sample variation in metabolite concentrations will have different statistical properties to the elution profile of a chromatographic peak. Additionally, peak positional variation will reduce the mean level of structural correlation; however it should not induce spurious high non-structural correlations and therefore should not lead to false assignments. In the crowded spectra typical of 1-D 1H NMR profiles, peak overlap also reduces the level of structural correlation and therefore the STOCSY approach will benefit from the analysis of well separated peaks or the use of pulse sequences which reduce overlap (e.g. JRES projections). As mentioned above, the type of normalisation will affect the correlation structure,

Fig. 4. A 2-dimensional STOCSY map for the aromatic resonances of hippuric acid. Pearson correlation (r) level is shown by the colour and mean 1-D spectra are indicated on the axes. 1050 1-D 1H NMR spectra of urine from normal laboratory rats were analysed at 600 MHz. Data courtesy of the COMET project [88].

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

especially when one studies data sets showing very strong changes in large peaks. Finally, relatively large numbers of spectra are required to confidently attribute a correlation to a structural relationship, although a few tens of spectra are often enough for practical purposes [94]. Therefore, when applying STOCSY for structural assignment one should aim to use data sets which are highly homogeneous, of size larger than around 10 spectra, and minimise peak overlap and positional variation. Just as heteronuclear NMR experiments can be useful for metabolite identification, so heteronuclear STOCSY can yield important structural information. The HET-STOCSY approach analyses the cross-correlation, across a number of samples, of spectra acquired from different observed nuclei. Coen et al. [23] found that HET-STOCSY enabled assignment of peaks from phosphorus containing metabolites in liver tissue; these assignments could not have been made with conventional 2d NMR methods due to magnetization transfer and bond-distance constraints. Another application of HET-STOCSY to 31P–1H spectra of human gut biopsies further demonstrated its utility for resonance assignment and also suggested the existence of different microenvironments for phospholipids [84]. Keun et al. [83] applied the approach to follow the biotransformation of the antibiotic Flucloxacillin, using the 19 F–1H correlations to filter the 1H–1H STOCSY maps. By editing the correlations in this way, they were able to greatly simplify the complex homonuclear maps and derive highly metabolite specific patterns of cross peaks (see Fig. 5). The concept of using statistical correlations to link signals deriving from the same molecule was taken further by addressing the problem of peak deconvolution in LC-NMR spectra [22]. In contrast to other STOCSY approaches, the correlations in this applica-

369

tion derive from the common elution profile of structurally related resonances in a single sample, rather than variations in metabolite concentration across an array of samples. Partially co-eluting compounds, whose peaks were highly overlapped in retention time, could be distinguished based on their distinct chromatographic profiles. The utility of the technique was demonstrated by the assignment of three metabolites of the drug thiabendazole in rat urine. This application along with other single sample methods [21] illustrates the potential of STOCSY-like approaches to make use of correlations deriving from a wide variety of processes for structural assignment and pathway analysis. Multiple analytical methods have been used in structure elucidation studies for decades and the statistical correlation approach can be of help here also. Mass spectrometry is one of the most important technologies in this area, providing highly complementary information to that from NMR owing to higher sensitivity, and an ability to detect fragmentation, adduct, dimer and isotope patterns. Crockford et al. [24,95] showed that significant structural relationships could be derived by cross-correlating NMR and MS data sets, terming their method Statistical HeteroSpectroscopy (SHY). The SHY approach indicated strong correlations between the NMR and MS signals for many molecules. The simultaneous recognition of NMR multiplet patterns and MS fragmentation/adduct patterns increased the confidence of structural assignments in many cases. Interestingly, the correlation analysis also revealed the presence of hitherto unrecognised NMR resonances purely by their correlation to MS data. The SHY technique thus increased the information available for structural assignment in several complementary ways. Despite its advantages, SHY and similar approaches have not been widely applied in the literature to date

Fig. 5. HET-STOCSY editing of 600 MHz 1H NMR spectra of urine from humans dosed with flucloxacillin. Each panel shows a 1H–1H STOCSY map. (A) Unnormalised data, (B) normalised data, (C) edited by 19F peak at 110.36 ppm (parent) and (D) edited by peak at 110.07 ppm (metabolite IV). In editing the 1H–1H STOCSY, only signals with correlation of r > 0.5 to the 19F peak are retained. Reproduced with permission from Ref. [83].

370

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

save for a few examples [96]. One reason for this is perhaps the technical difficulty relating to combination of very large data sets (>1010 correlations, even when combining reduced data sets), but also to the choice of normalisation and noise rejection strategies. Databases of standard compounds are a very important tool aiding the assignment of complex mixture NMR spectra. Although there has been much progress in recent years, users still face several important challenges in this area, including the rather poor coverage of the metabolome and the relevance of the acquisition conditions (such as the solvent or pH) to those of the experiment at hand. An example of the latter is the ubiquity of standard spectra using organic solvents, while most metabolic profiling experiments will analyse aqueous samples. Two databases stand out – the Biological Magnetic Resonance Data Bank (BMRB) [97] and the Human Metabolome Database (HMDB) [98]. The BMRB has augmented its role as a primary repository of macromolecular NMR data with a library of small molecule information. As of May 2009, NMR data on 557 standard compounds were available, along with standard information such as molecular formula, structure and links to other databases. The HMDB targets metabolites found in humans, including chemical, clinical and biochemical data. As of May 2009, 793 compounds had assigned 1H NMR spectra, although the database listed over 6500 compounds in total. While these numbers represent a huge effort, it is clear that the numbers of compounds with experimental NMR spectra still represent a small fraction of the metabolome (thought to comprise thousands of metabolites). The HMDB also lists literature-reported metabolite concentrations for various tissues of the body, in both normal and pathological states. Although coverage is patchy, this represents a valuable additional resource. In addition, the HMDB stores detailed information on enzymes related to each metabolite, along with cross-references to other databases. Many other online resources are also of value when interpreting the results of NMR metabolic profiling experiments. Detailed descriptions are beyond our scope, but we refer the reader to some useful examples such as databases aimed at pathways (KEGG [99], metaCyc [100], ConsensusPathDB [101]) and compounds (ChEBI [102], PubChem (pubchem.ncbi.nlm.nih.gov), The Dictionary of Natural Products (dnp.chemnetbase.com)). As with data standards, this is a continually evolving area, and it seems likely that the utility of database resources will greatly improve in the coming years.

5. Statistical integration of NMR metabolic profiles with other post-genomic data Data from multiple omics techniques from the same experiment – ‘multi-omic’ data – is rapidly becoming ubiquitous in many areas of biology. NMR-based metabolic profiling is often complemented by transcriptomics, proteomics and other post-genomic techniques to generate a multilevel overview of the biological problem of interest. This is a systems biology approach in which observations are global, multilevel and often complemented by mathematical models which can be directly compared to the experimental data. How can researchers extract useful information from these complex, multivariate, multi-omic data? This problem of data integration is a formidable challenge, whose solutions are still in their infancy, but which seems set to have a large impact on future biological research. A key requirement, but one which we do not discuss here, is the integration of omics data with highly annotated bioinformatics resources (such as databases of gene functions, protein interactions, metabolic pathways etc.) Automatic linking of experimental data to such resources would significantly ease the interpretation of many experiments. Metabolic profiling, particularly using NMR, is currently a little behind some omics fields (e.g. transcriptomics) in this area, partly because of the

difficulty of automatically identifying compounds in spectra of highly complex and variable biological mixtures. We acknowledge three broad types of data integration, each differing in the amount of mathematical modelling, hypotheses required, and ease of implementation. The simplest is conceptual integration, in which each block of omics data is analysed independently, and conclusions/inferences about the combined behaviour of different biological levels are synthesised by the researcher. At a deeper level, we have statistical integration in which we search for statistical associations between entities from each omic data set in a largely hypothesis-free or hypothesis-generating manner. Finally, we have what we might call model-based integration, in which we have a mathematical description of the system that can model and predict each level of biological organisation separately. For example, a fully parameterised description of a metabolic network might be able to model both the levels and activities of key enzymes and metabolites. Each of these types of integration could be applied to the integration of data from different platforms (e.g. NMR vs. MS), tissue types (e.g. blood vs. liver) and/or biomolecular level (e.g. transcriptome vs. metabolome). At the conceptual level, many authors have published integrated studies combining NMR metabolic profiling with other types of data. Weeks et al. [103] examined the proteomic and NMR-metabolomic response of the yeast Schizosaccharomyces pombe to oxidative stress induced by hydrogen peroxide treatment. The global effects of treatment on both a mutant and wild type strain were synthesised manually and illustrated via a diagram of connected pathways. A similar approach was employed by Vilasi et al. [104] in their analysis of three genetic forms of renal Fanconi syndrome. 2d-gel/MS plasma proteomics was combined with urinary NMR-metabolomics at the conceptual level suggesting that one genetic variant showed a molecular profile quite distinct from the other two. Hirai et al. [105,106] used transcriptomics and MSderived metabolomics to study responses of Arabidobsis Thaliana to nutritional stress. PCA, self organising maps and pathway diagrams were used to draw conclusions regarding the pathways involved in sulphur and nitrogen deficiency. The statistical level of integration has been attempted in several different ways. For example, Griffin et al. [107] studied the transcriptional and NMR-derived metabolic signature of fatty liver induced by orotic acid treatment. Initially, the metabolic and transcriptional data sets were analysed separately and their combination visualised via a colour coded diagram of the relevant metabolic pathways, see Fig. 6. Subsequently, the data were statistically integrated via a 2-block PLS regression, allowing the metabolites and transcripts which covaried to be identified. The method also enabled the relationship of different biological samples to be studied – e.g. different rat strains showed different patterns of transcriptional-metabolic associations on exposure to the orotic acid diet. A proportion of the transcript-metabolite pairs were additionally subject to a bootstrap correlation analysis to confirm the robustness of their association. A similar approach was later applied to the study of programmed cell death in rat glioma tumours [108], though in this case, the transcriptional part of this study was rather small, with just four pro-apoptotic genes assayed. The direct correlation and PLS based approaches were also taken by Rantalainen et al. [109] in an integrated analysis of NMR metabolic profiles and 2-D Differential in Gel Electrophoresis (DIGE) proteomic data from a murine cancer xenograft model. Fig. 7 shows the protein-metabolite correlation map resulting from computation of Pearson correlation coefficients between each DIGE spot and each NMR chemical shift intensity. Many pairs of DIGE spots and NMR chemical shifts show strong positive or negative correlation. In particular, several proteins were correlated/anticorrelated to tyrosine intensity as shown by the inset, for example serotransferrin precursor/fibrinogen A alpha polypeptide. While

371

Ó Copyright 2004 American Physiological Society. 2004

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

Fig. 6. Conceptual integration of metabolic and transcriptional data. Pathways associated with triglyceride/phospholipid synthesis, choline metabolism, and methyl donor metabolism are depicted and metabolite/transcript names coloured according to their response to orotic acid exposure (red, increased; green, decreased). Reproduced with permission of American Physiological Society in the format Journal via Copyright Clearance Center from Ref. [107].

which were predictive of the metabolite/protein data, as well as a separate analysis of the variation specific to each block. The O2-PLS approach has been successfully applied to the integration of multi-omic data in several other instances [110,111], though with MS-derived metabolic profiles rather than NMR. Other approaches such as hierarchical latent variable models [112] have also been applied but have yet to gain popularity in this area. Direct correlation has also been applied to integration of NMR meta-

Ó Copyright (2006) American Chemical Society. 2006

direct correlation offers an intuitive, visual approach to data integration, it is difficult to validate on a large scale, does not generate a predictive model and outliers are difficult to identify. To circumvent some of these drawbacks, these authors applied Orthogonal PLS (O2-PLS) to partition the overall variance of the multi-omic data into three parts: variance shared by both proteomic and metabolic data, variance specific to each platform and residual variance. This allowed identification of protein/metabolite signatures

Fig. 7. Statistical integration of NMR-metabolomic and proteomic data via direct correlation. The main panel shows the Pearson cross-correlation matrix between protein DIGE spots (rows) and 600 MHz 1H NMR intensities (thresholded for significance, red, positive; blue, negative correlation). The side panels illustrate the DIGE spot levels and 600 MHz 1H NMR spectrum. The DIGE spot with the strongest expression ratio near 300 is serotransferrin precursor/fibrinogen A alpha polypeptide which is highly negatively correlated to the tyrosine resonance at 7.20. Reproduced with permission from [109].

372

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

bolic profiles with genotype information, generating so-called metabolic Quantitative Trait Loci (mQTLs) [113] in which the variation of individual metabolites is directly linked to specific genetic sequence markers. Whatever the method, one should always be aware that identification of pairs of related molecules by statistical integration seldom reveals directly causal relationships, so results of such models should be interpreted with care. Model-based integration of metabolic and other omics data is still in its infancy. This is largely due to the high degree of information missing from our understanding of complex biological systems. While many of the system parts (e.g. genes, metabolites) may be known, the way they interact to form a fully functioning system is often unknown. Even for metabolic networks, which benefit from more than a century of study in classical biochemistry, key parameters such as rate constants are only sparsely determined, and then mainly for model organisms in specific conditions. Therefore, building accurate system-wide models which are predictive of multiple levels of biomolecular organisation is exceptionally difficult. Nonetheless, this field is the focus of intense activity and many techniques and modelling approaches are available [114,115], although there has been little work using NMR-derived metabolic profiles.

6. Conclusions and future prospects NMR metabolic profiles are inherently complex and information rich, with the potential to yield fundamental insights into the molecular mechanisms underlying health and disease. The difficulty is how to extract this information efficiently, reliably and in a way which is interpretable to chemists and biologists without in-depth statistical training. Bioinformatic and statistical analysis cannot be divorced from the biological goals and analytical challenges of an experiment. Our experience is that success is achieved by close communication between individuals from the three stakeholder disciplines: biological sciences, analytical chemistry and data analysis. Effective collaboration all the way from experimental design to biological interpretation is the key to successful application of the technology. As with all types of statistical modelling, the methods employed must be selected both to fit the characteristics of the data and also to address the specific objectives of the study in question. NMR data possesses key attributes which sets it apart both from other technologies used in metabolic profiling and also other ‘omics’ sciences such as transcriptomics. This is particularly important to bear in mind when considering the transfer of one statistical technique from one omics field to another, as the validity of assumptions may change from field to field. For example, a spot on a microarray slide represents a known sequence designed to target a known gene; however a chemical shift variable from an NMR profile may represent anything from zero to several different known and/or unknown metabolites. Despite considerable progress in the statistical methodology for analysing NMR metabolic profiles in the last 10 years, many important challenges remain. One of the most prominent is the problem of automatically assigning and quantifying the levels of metabolites in the spectra of complex biofluids such as urine using a database of pure compound spectra. Some published methods ignore the relationship between multiple resonances from the same metabolite, while none make use of sample-to-sample correlations between metabolite levels – both aspects that are typically taken into account in the manual assignment of resonances. Nonetheless, it is our opinion that the automated assignment/quantification problem is less severe for NMR than it is for some other technologies used in metabolic profiling such as liquid chromatography – mass spectrometry. Finally, we note that a database of pure spectra is only useful if the compound of interest is represented; there will

always be a need for the detection of novel biomarkers and de novo assignment of resonances. Structural characterisation is one area into which statistical methods have recently expanded [20–23,82–84,94] and this looks set to continue in the near future. For example, is it possible to determine which statistical correlations observed in a STOCSY or SHY analysis derive from structural relationships and which from other sources [94]? Are there higher order patterns which could be diagnostic of structural relationships among resonances? Is it possible to improve the effective detection limit for NMR profiling by cross-correlating with other platforms (such as MS as in SHY [24])? The work on correlations has highlighted the prospect of using inter-metabolite correlations as a new set of descriptors to describe the metabolic phenotype [116,117], targeting a higher level of system structure than the metabolite levels themselves. To accurately estimate the correlations such studies will necessarily be reliant on a large number of replicates per condition. One can envisage studies in which biological states are distinguished not by differences in metabolite levels, but by alterations in intermetabolite correlation structure. An exciting further possibility is the detection of non-linear correlations between metabolites which may reveal relationships otherwise hidden to typical linear methods [118,119]. Integration of data from multiple platforms is another area where we expect to see developments in the next few years, especially in multi-omic integration as more biological studies begin to use several profiling technologies in parallel. As already discussed, data integration faces extreme challenges at the mechanistic model level, so we expect to see most immediate progress at the statistical level, perhaps addressing the problems of false discoveries, noise removal, simultaneous integration of more than two data sets and detection of ‘functional modules’ containing entities from several biological levels. It is clear that statistical and bioinformatic methods are an integral part of any study using NMR metabolic profiling. Both the analytical and bioinformatic state-of-the-art are continuously evolving and we look forward to new techniques which enable the maximum useful information to be gleaned from this rich and complex data. Acknowledgements R.C. acknowledges financial support from the EU carcinoGENOMICS project (contract No. PL037712). The members of the Consortium for Metabonomic Toxicology are acknowledged for the data depicted in Fig. 2 and Fig. 4. References [1] J.K. Nicholson, J.C. Lindon, E. Holmes, Xenobiotica 29 (1999) 1181. [2] L.M. Raamsdonk, B. Teusink, D. Broadhurst, N. Zhang, A. Hayes, M.C. Walsh, J.A. Berden, K.M. Brindle, D.B. Kell, J.J. Rowland, H.V. Westerhoff, K. van Dam, S.G. Oliver, Nat. Biotechnol. 19 (2001) 45. [3] A. Daniels, R.J. Williams, P.E. Wright, Nature 261 (1976) 321. [4] F.F. Brown, I.D. Campbell, P.W. Kuchel, D.C. Rabenstein, FEBS Lett. 82 (1977) 12. [5] J.R. Bales, D.P. Higham, I. Howe, J.K. Nicholson, P.J. Sadler, Clin. Chem. 30 (1984) 426. [6] J.K. Nicholson, I.D. Wilson, Prog. Nucl. Mag. Res. Sp. 21 (1989) 449. [7] J.K. Nicholson, J.A. Timbrell, J.R. Bales, P.J. Sadler, Mol. Pharmacol. 27 (1985) 634. [8] K.P. Gartland, S.M. Sanins, J.K. Nicholson, B.C. Sweatman, C.R. Beddell, J.C. Lindon, NMR Biomed. 3 (1990) 166. [9] K.P. Gartland, C.R. Beddell, J.C. Lindon, J.K. Nicholson, Mol. Pharmacol. 39 (1991) 629. [10] E. Holmes, F.W. Bonner, B.C. Sweatman, J.C. Lindon, C.R. Beddell, E. Rahr, J.K. Nicholson, Mol. Pharmacol. 42 (1992) 922. [11] E. Holmes, J.K. Nicholson, F.W. Bonner, B.C. Sweatman, C.R. Beddell, J.C. Lindon, E. Rahr, NMR Biomed. 5 (1992) 368. [12] J.C. Lindon, E. Holmes, J.K. Nicholson, Prog. Nucl. Mag. Res. Sp. 39 (2001) 1.

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374 [13] H.E. Johnson, R.J. Gilbert, M.K. Winson, R. Goodacre, A.R. Smith, J.J. Rowland, M.A. Hall, D.B. Kell, Genet. Program. Evol. M. 1 (2000) 243. [14] P. Mendes, Brief Bioinform. 3 (2002) 134. [15] R. Goodacre, S. Vaidyanathan, W.B. Dunn, G.G. Harrigan, D.B. Kell, Trends Biotechnol. 22 (2004) 245. [16] T.M.D. Ebbels, Nonlinear chemometric methods for the analysis of metabolic profiles, in: J.C. Lindon, J.K. Nicholson, E. Holmes (Eds.), The Handbook of Metabonomics and Metabolomics, Elsevier, Amsterdam, 2006, p. 201. [17] D. Broadhurst, D. Kell, Metabolomics 2 (2006) 171. [18] M. De Iorio, T.M.D. Ebbels, D.A. Stephens, Statistical techniques in metabolic profiling, in: D.J. Balding, C. Cannings, M. Bishop, (Eds.), Handbook of Statistical Genetics third ed., vol. 1, John Wiley & Sons Ltd., Chichester, 2007, pp. 347. [19] R. Steuer, K. Morgenthal, W. Weckwerth, J. Selbig, Methods Mol. Biol. 358 (2007) 105. [20] O. Cloarec, M.E. Dumas, A. Craig, R.H. Barton, J. Trygg, J. Hudson, C. Blancher, D. Gauguier, J.C. Lindon, E. Holmes, J. Nicholson, Anal. Chem. 77 (2005) 1282. [21] L.M. Smith, A.D. Maher, O. Cloarec, M. Rantalainen, H. Tang, P. Elliott, J. Stamler, J.C. Lindon, E. Holmes, J.K. Nicholson, Anal. Chem. 79 (2007) 5682. [22] O. Cloarec, A. Campbell, L.H. Tseng, U. Braumann, M. Spraul, G. Scarfe, R. Weaver, J.K. Nicholson, Anal. Chem. 79 (2007) 3304. [23] M. Coen, Y.S. Hong, O. Cloarec, C.M. Rhode, M.D. Reily, D.G. Robertson, E. Holmes, J.C. Lindon, J.K. Nicholson, Anal. Chem. 79 (2007) 8956. [24] D.J. Crockford, E. Holmes, J.C. Lindon, R.S. Plumb, S. Zirah, S.J. Bruce, P. Rainville, C.L. Stumpf, J.K. Nicholson, Anal. Chem. 78 (2006) 363. [25] M.R. Viant, Biochem. Biophys. Res. Commun. 310 (2003) 943. [26] I.A. Lewis, S.C. Schommer, B. Hodis, K.A. Robb, M. Tonelli, W.M. Westler, M.R. Suissman, J.L. Markley, Anal. Chem. 79 (2007) 9385. [27] W. el-Deredy, NMR Biomed. 10 (1997) 99. [28] G. Hagberg, NMR Biomed. 11 (1998) 148. [29] R. Goodacre, D. Broadhurst, A. Smilde, B. Kristal, J. Baker, R. Beger, C. Bessant, S. Connor, G. Capuani, A. Craig, T. Ebbels, D. Kell, C. Manetti, J. Newton, G. Paternostro, R. Somorjai, M. Sjöström, J. Trygg, F. Wulfert, Metabolomics 3 (2007) 231. [30] E. Holmes, P.J.D. Foxall, J.K. Nicholson, G.H. Neild, S.M. Brown, C.R. Beddell, B.C. Sweatman, E. Rahr, J.C. Lindon, M. Spraul, P. Neidig, Anal. Biochem. 220 (1994) 284. [31] M. Spraul, P. Neidig, U. Klauck, P. Kessler, E. Holmes, J.K. Nicholson, B.C. Sweatman, S.R. Salman, R.D. Farrant, E. Rahr, C.R. Beddell, J.C. Lindon, J. Pharm. Biomed. Anal. 12 (1994) 1215. [32] O. Cloarec, M.E. Dumas, J. Trygg, A. Craig, R.H. Barton, J.C. Lindon, J.K. Nicholson, E. Holmes, Anal. Chem. 77 (2005) 517. [33] D.J. Crockford, H.C. Keun, L.M. Smith, E. Holmes, J.K. Nicholson, Anal.Chem. 77 (2005) 4556. [34] A.M. Weljie, J. Newton, P. Mercier, E. Carlson, C.M. Slupsky, Anal. Chem. 78 (2006) 4430. [35] T.M.D. Ebbels, H.C. Keun, O. Beckonert, E. Bollard, J.C. Lindon, E. Holmes, J.K. Nicholson, J. Proteome Res. 6 (2007) 4407. [36] J.-P. Antoine, A. Coron, J.-M. Dereppe, J. Magn. Reson. 144 (2000) 189. [37] T.M.D. Ebbels, J.C. Lindon, J.K. Nicholson, E.C. Holmes, United States Patent 6683455 (2004). [38] A. Craig, O. Cloarec, E. Holmes, J.K. Nicholson, J.C. Lindon, Anal. Chem. 78 (2006) 2262. [39] F. Dieterle, A. Ross, G. Schlotterbeck, H. Senn, Anal. Chem. 78 (2006) 4281. [40] R.J.O. Torgrip, K.M. Aberg, E. Alm, I. Schuppe-Koistinen, J. Lindberg, Metabolomics 4 (2008) 114. [41] J.C. Lindon, J.K. Nicholson, E. Holmes, H.C. Keun, A. Craig, J.T. Pearce, S.J. Bruce, N. Hardy, S.A. Sansone, H. Antti, P. Jonsson, C. Daykin, M. Navarange, R.D. Beger, E.R. Verheij, A. Amberg, D. Baunsgaard, G.H. Cantor, L. Lehman-McKeeman, M. Earll, S. Wold, E. Johansson, J.N. Haselden, K. Kramer, C. Thomas, J. Lindberg, I. Schuppe-Koistinen, I.D. Wilson, M.D. Reily, D.G. Robertson, H. Senn, A. Krotzky, S. Kochhar, J. Powell, F. van der Ouderaa, R. Plumb, H. Schaefer, M. Spraul, Nat. Biotechnol. 23 (2005) 833. [42] A. Brazma, P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, J. Aach, W. Ansorge, C.A. Ball, H.C. Causton, T. Gaasterland, P. Glenisson, F.C. Holstege, I.F. Kim, V. Markowitz, J.C. Matese, H. Parkinson, A. Robinson, U. Sarkans, S. Schulze-Kremer, J. Stewart, R. Taylor, J. Vilo, M. Vingron, Nat. Genet. 29 (2001) 365. [43] C.F. Taylor, N.W. Paton, K.S. Lilley, P.A. Binz, R.K. Julian Jr., A.R. Jones, W. Zhu, R. Apweiler, R. Aebersold, E.W. Deutsch, M.J. Dunn, A.J. Heck, A. Leitner, M. Macht, M. Mann, L. Martens, T.A. Neubert, S.D. Patterson, P. Ping, S.L. Seymour, P. Souda, A. Tsugita, J. Vandekerckhove, T.M. Vondriska, J.P. Whitelegge, M.R. Wilkins, I. Xenarios, J.R. Yates 3rd, H. Hermjakob, Nat. Biotechnol. 25 (2007) 887. [44] R.J. Bino, R.D. Hall, O. Fiehn, J. Kopka, K. Saito, J. Draper, B.J. Nikolau, P. Mendes, U. Roessner-Tunali, M.H. Beale, R.N. Trethewey, B.M. Lange, E.S. Wurtele, L.W. Sumner, Trends Plant. Sci. 9 (2004) 418. [45] H. Jenkins, N. Hardy, M. Beckmann, J. Draper, A.R. Smith, J. Taylor, O. Fiehn, R. Goodacre, R.J. Bino, R. Hall, J. Kopka, G.A. Lane, B.M. Lange, J.R. Liu, P. Mendes, B.J. Nikolau, S.G. Oliver, N.W. Paton, S. Rhee, U. Roessner-Tunali, K. Saito, J. Smedsgaard, L.W. Sumner, T. Wang, S. Walsh, E.S. Wurtele, D.B. Kell, Nat. Biotechnol. 22 (2004) 1601. [46] S.A. Sansone, T. Fan, R. Goodacre, J.L. Griffin, N.W. Hardy, R. Kaddurah-Daouk, B.S. Kristal, J.C. Lindon, P. Mendes, N. Morrison, B.J. Nikolau, D. Robertson,

[47]

[48]

[49] [50] [51] [52] [53] [54]

[56] [57] [58] [59] [60] [61] [62] [63] [64] [65]

[66]

[67] [68] [69]

[70] [71] [72] [73] [74]

[75] [76] [77] [78] [79] [80] [81] [82]

[83] [84] [85] [86] [87]

[88] [89] [90] [91] [92]

373

L.W. Sumner, C. Taylor, M. van der Werf, B. van Ommen, O. Fiehn, Nat. Biotechnol. 25 (2007) 846. O. Fiehn, D. Robertson, J. Griffin, M. van der Werf, B.J. Nikolau, N. Morrison, L. Sumner, R. Goodacre, N. Hardy, C. Taylor, J. Fostel, B. Kristal, R. KaddurahDaouk, P. Mendes, B. van Ommen, J.C. Lindon, S.-A. Sansone, Metabolomics 3 (2007) 175. S. Wold, E. Johansson, M. Cocchi, PLS – partial least squares projections to latent structures, in: H. Kubinyi (Eds.), 3D QSAR in Drug Design, Theory, Methods, and Applications, ESCOM Science Publishers, Leiden, 1993. L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi- and Megavariate Data Analysis, Umetrics AB, Umea, Sweden, 2001. J. Trygg, S. Wold, J. Chemometr. 16 (2002) 119. J. Trygg, S. Wold, J. Chemometr. 17 (2003) 53. O. Teahan, S. Gamble, E. Holmes, J. Waxman, J.K. Nicholson, C. Bevan, H.C. Keun, Anal. Chem. 78 (2006) 4307. M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, 1996. W. Banzhaf, P. Nordin, R.E. Keller, F.D. Francone, Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and its Applications, first ed., Morgan Kaufmann, 1998. Z. Ramadan, D. Jacobs, M. Grigorov, S. Kochhar, Talanta 68 (2006) 1683. R. Cavill, H.C. Keun, E. Holmes, J.C. Lindon, J.K. Nicholson, T.M.D. Ebbels, Bioinformatics 25 (2009) 112. W.B. Langdon, R. Poli, in: Genetic Programming, Morgan Kauffman, Stanford University, 1997. N.F. McPhee, J.D. Miller, in: Proceedings of the 6th International Conference on Genetic Algorithms, Morgan Kaufmann Publishers Inc., 1995. R.A. Davis, A.J. Charlton, S. Oehlschlager, J.C. Wilson, Chemometr. Intell. Lab. 81 (2006) 50. H.F. Gray, R.J. Maxwell, I. Martinez-Perez, C. Arus, S. Cerdan, NMR Biomed. 11 (1998) 217. J. Hageman, R. van den Berg, J. Westerhuis, M. van der Werf, A. Smilde, Metabolomics 4 (2008) 141. S.C. Madeira, A.L. Oliveira, IEEE/ACM Transactions on Computational Biology and Bioinformatics 01 (2004) 24. N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000. Y. Truong, X. Lin, C. Beecher, in: Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: Seattle, WA, USA, 2004, pp. 835. S. Mahadevan, S.L. Shah, C.M. Slupsky, T.J. Marrie, E. Saude, D.J. Adamko, in: 10th International IFAC Symposium on Computer Applications in Biotechnology: Cancun, Mexico, 2007. F. Lindgren, P. Geladi, S. Wold, J. Chemometr. 7 (1993) 45. B. Scholkopf, A. Smola, K.R. Muller, Neural Comput. 10 (1998) 1299. B. Scholkopf, A.J. Smola, K.-R. Muller, Kernel principal component analysis, in: B. Scholkopf, C.J.C. Burges, A.J. Smola, (Eds.), Advances in Kernel Methods, MIT Press, 1999, pp. 327. R. Roman, J.T. Leonard, J. Mach. Learn. Res. 2 (2002) 97. M. Rantalainen, M. Bylesjo, O. Cloarec, J.K. Nicholson, E. Holmes, J. Trygg, J. Chemometr. 21 (2007) 376. M. Bylesjo, M. Rantalainen, J. Nicholson, E. Holmes, J. Trygg, BMC Bioinformatics 9 (2008) 106. B. Ustun, W.J. Melssen, L.M.C. Buydens, Anal. Chim. Acta 595 (2007) 299. S. Bernhard, J.S. Alexander, M. Klaus-Robert, ller, Kernel Principal Component Analysis Advances in Kernel Methods: Support Vector Learning, MIT Press, 1999, pp. 327. L. Breiman, Mach. Learn. 45 (2001) 5. S.R. Safavian, D. Landgrebe, IEEE Trans. Syst. Man Cybern. 21 (1991) 660. T.M. Mitchell, Machine Learning, McGraw-Hill, 1997. R. Rousseau, B. Govaerts, M. Verleysen, B. Boulanger, Chemometr. Intell. Lab. 91 (2008) 54. D.P. Enot, M. Beckmann, D. Overy, J. Draper, Proc. Natl. Acad. Sci. USA 103 (2006) 14865. I. Noda, Appl. Spectrosc. 44 (1990) 550. E. Holmes, O. Cloarec, J.K. Nicholson, J. Proteome Res. 5 (2006) 1313. E. Holmes, R.L. Loo, O. Cloarec, M. Coen, H. Tang, E. Maibaum, S. Bruce, Q. Chan, P. Elliott, J. Stamler, I.D. Wilson, J.C. Lindon, J.K. Nicholson, Anal. Chem. 79 (2007) 2629. H.C. Keun, T.J. Athersuch, O. Beckonert, Y. Wang, J. Saric, J.P. Shockcor, J.C. Lindon, I.D. Wilson, E. Holmes, J.K. Nicholson, Anal. Chem. 80 (2008) 1073. Y. Wang, O. Cloarec, H. Tang, J.C. Lindon, E. Holmes, S. Kochhar, J.K. Nicholson, Anal. Chem. 80 (2008) 1058. K. Pearson, Biometrika 13 (1920) 25. J.L. Rodgers, W.A. Nicewander, Am. Stat. 42 (1988) 59. J.C. Lindon, J.K. Nicholson, E. Holmes, H. Antti, M.E. Bollard, H. Keun, O. Beckonert, T.M.D. Ebbels, M.D. Reily, D. Robertson, G.J. Stevens, P. Luke, A.P. Breau, G.H. Cantor, R.H. Bible, U. Niederhauser, H. Senn, G. Schlotterbeck, U.G. Sidelmann, S.M. Laursen, A. Tymiak, B.D. Car, L. Lehman-McKeeman, J.M. Colet, A. Loukaci, C. Thomas, Toxicol. Appl. Pharmacol. 187 (2003) 137. J.C. Lindon, H.C. Keun, T.M. Ebbels, J.M. Pearce, E. Holmes, J.K. Nicholson, Pharmacogenomics 6 (2005) 691. R. Bruschweiler, F. Zhang, J. Chem. Phys. 120 (2004) 5253. R. Bruschweiler, J. Chem. Phys. 121 (2004) 409. F. Zhang, R. Bruschweiler, Angew. Chem. Int. Ed. Engl. 46 (2007) 2639. F. Zhang, A.T. Dossey, C. Zachariah, A.S. Edison, R. Bruschweiler, Anal. Chem. 79 (2007) 7748.

374

T.M.D. Ebbels, R. Cavill / Progress in Nuclear Magnetic Resonance Spectroscopy 55 (2009) 361–374

[93] F. Zhang, L. Bruschweiler-Li, S.L. Robinette, R. Bruschweiler, Anal. Chem. 80 (2008) 7549. [94] A. Couto Alves, M. Rantalainen, E. Holmes, J.K. Nicholson, T.M.D. Ebbels, Anal. Chem. 81 (2009) 2075. [95] D.J. Crockford, A.D. Maher, K.R. Ahmadi, A. Barrett, R.S. Plumb, I.D. Wilson, J.K. Nicholson, Anal. Chem. 80 (2008) 6835. [96] S. Moco, J. Forshed, R.C.H. De Vos, R.J. Bino, J. Vervoort, Metabolomics 4 (2008) 202. [97] E.L. Ulrich, H. Akutsu, J.F. Doreleijers, Y. Harano, Y.E. Ioannidis, J. Lin, M. Livny, S. Mading, D. Maziuk, Z. Miller, E. Nakatani, C.F. Schulte, D.E. Tolmie, R. Kent Wenger, H. Yao, J.L. Markley, Nucleic Acids Res. 36 (2008) D402. [98] D.S. Wishart, D. Tzur, C. Knox, R. Eisner, A.C. Guo, N. Young, D. Cheng, K. Jewell, D. Arndt, S. Sawhney, C. Fung, L. Nikolai, M. Lewis, M.A. Coutouly, I. Forsythe, P. Tang, S. Shrivastava, K. Jeroncic, P. Stothard, G. Amegbey, D. Block, D.D. Hau, J. Wagner, J. Miniaci, M. Clements, M. Gebremedhin, N. Guo, Y. Zhang, G.E. Duggan, G.D. Macinnis, A.M. Weljie, R. Dowlatabadi, F. Bamforth, D. Clive, R. Greiner, L. Li, T. Marrie, B.D. Sykes, H.J. Vogel, L. Querengesser, Nucleic Acids Res. 35 (2007) D521. [99] M. Kanehisa, S. Goto, Nucleic Acids Res. 28 (2000) 27. [100] P.D. Karp, M. Riley, S.M. Paley, A. Pellegrini-Toole, Nucleic Acids Res. 30 (2002) 59. [101] A. Kamburov, C. Wierling, H. Lehrach, R. Herwig, Nucleic Acids Res. 37 (2009) D623. [102] K. Degtyarenko, P. de Matos, M. Ennis, J. Hastings, M. Zbinden, A. McNaught, R. Alcantara, M. Darsow, M. Guedj, M. Ashburner, Nucleic Acids Res. 36 (2008) D344. [103] M.E. Weeks, J. Sinclair, A. Butt, Y.L. Chung, J.L. Worthington, C.R. Wilkinson, J. Griffiths, N. Jones, M.D. Waterfield, J.F. Timms, Proteomics 6 (2006) 2772. [104] A. Vilasi, P.R. Cutillas, A.D. Maher, S.F. Zirah, G. Capasso, A.W. Norden, E. Holmes, J.K. Nicholson, R.J. Unwin, Am. J. Physiol. Renal. 293 (2007) F456.

[105] M.Y. Hirai, M. Yano, D.B. Goodenowe, S. Kanaya, T. Kimura, M. Awazuhara, M. Arita, T. Fujiwara, K. Saito, Proc. Natl. Acad. Sci. USA 101 (2004) 10205. [106] M.Y. Hirai, M. Klein, Y. Fujikawa, M. Yano, D.B. Goodenowe, Y. Yamazaki, S. Kanaya, Y. Nakamura, M. Kitayama, H. Suzuki, N. Sakurai, D. Shibata, J. Tokuhisa, M. Reichelt, J. Gershenzon, J. Papenbrock, K. Saito, J. Biol. Chem. 280 (2005) 25590. [107] J.L. Griffin, S.A. Bonney, C. Mann, A.M. Hebbachi, G.F. Gibbons, J.K. Nicholson, C.C. Shoulders, J. Scott, Physiol. Genomics 17 (2004) 140. [108] J.L. Griffin, C. Blenkiron, P.K. Valonen, C. Caldas, R.A. Kauppinen, Anal. Chem. 78 (2006) 1546. [109] M. Rantalainen, O. Cloarec, O. Beckonert, I.D. Wilson, D. Jackson, R. Tonge, R. Rowlinson, S. Rayner, J. Nickson, R.W. Wilkinson, J.D. Mills, J. Trygg, J.K. Nicholson, E. Holmes, J. Proteome Res. 5 (2006) 2642. [110] M. Bylesjo, D. Eriksson, M. Kusano, T. Moritz, J. Trygg, Plant J. 52 (2007) 1181. [111] M. Bylesjo, R. Nilsson, V. Srivastava, A. Gronlund, A.I. Johansson, S. Jansson, J. Karlsson, T. Moritz, G. Wingsle, J. Trygg, J. Proteome Res. 2008. [112] J.S. Spicker, S. Brunak, K.S. Frederiksen, H. Toft, Toxicol. Sci. 102 (2008) 444. [113] M.E. Dumas, S.P. Wilder, M.T. Bihoreau, R.H. Barton, J.F. Fearnside, K. Argoud, L. D’Amato, R.H. Wallis, C. Blancher, H.C. Keun, D. Baunsgaard, J. Scott, U.G. Sidelmann, J.K. Nicholson, D. Gauguier, Nat. Genet. 39 (2007) 666. [114] A.R. Joyce, B.O. Palsson, Nat. Rev. Mol. Cell Biol. 7 (2006) 198. [115] E. Klipp, R. Herwig, A. Kowald, C. Wierling, H. Lehrach, Modeling Tools Systems Biology in Practice, 2005, pp. 419. [116] R. Steuer, J. Kurths, O. Fiehn, W. Weckwerth, Biochem. Soc. Trans. 31 (2003) 1476. [117] R. Steuer, J. Kurths, O. Fiehn, W. Weckwerth, Bioinformatics 19 (2003) 1019. [118] R. Steuer, J. Kurths, C.O. Daub, J. Weise, J. Selbig, Bioinformatics 18 Suppl. 2 (2002) S231. [119] Q. Guo, J.K. Sidhu, T.M.D. Ebbels, F. Rana, D. Spurgeon, C. Svendsen, S.R. Sturzenbaum, P. Kille, A.J. Morgan, J.G. Bundy, Metabolomics 5 (2009) 72.