Chemometrics and Intelligent Laboratory Systems 84 (2006) 82 – 87 www.elsevier.com/locate/chemolab
Detection of ovarian cancer using chemometric analysis of proteomic profiles Oliver P. Whelehan a,⁎, Mark E. Earll a , Erik Johansson b , Marianne Toft c , Lennart Eriksson b a
Umetrics UK Ltd, Woodside House, Winkfield, Windsor, SL4 2DX, UK b Umetrics AB, Umeå, Sweden c Umetrics AB, Malmö, Sweden
Received 23 November 2005; received in revised form 27 February 2006; accepted 8 March 2006 Available online 21 July 2006
Abstract Chemometric analysis is used to discriminate between ovarian cancer patients and unaffected controls. In particular, Partial Least Squares– Discriminant Analysis (PLS–DA), and its more recent extension, OPLS–DA, are applied to 100 biopsy-proven cancer patients and 91 controls selected from the Ovarian Dataset 8-7-02 (http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp). Diagnostic models built on a representative training set of approximately 50% of the samples yield both 100% sensitivity and specificity when applied to a blind test set containing the remaining samples. The OPLS–DA model is particularly impressive. The approach presented here, which is widely used in the related field of metabonomics, has the advantage that the entire proteomic profile of 15,154 m/z values is analysed simultaneously in a single step. There is no requirement for prior variable selection or stepwise regression techniques and the results are easily interpretable in terms of simple plots. The most important biomarkers for distinguishing the control and cancer groups have m/z values less than 700. © 2006 Elsevier B.V. All rights reserved. Keywords: Partial least squares–discriminant analysis; Ovarian cancer
1. Introduction The early diagnosis of ovarian cancer could significantly reduce mortality rates among women. The disease usually presents few symptoms until it has spread beyond the ovaries, which precludes surgical removal and contributes to the poor 5-year survival rate of just 35% for late stage presentation. Conversely, early stage diagnosis is associated with a 5-year survival rate of over 90% [1]. In a landmark paper, Petricoin et al. [2] used a data analytical tool based on a combination of genetic algorithms and cluster analysis to discriminate between biopsy-proven ovarian cancer patients and unaffected controls. The discrimination was based
on proteomic analysis of blood serum samples using surfaceenhanced laser desorption and ionisation time-of-flight (SELDITOF) mass spectroscopy. Importantly, their work indicates that the disease can be correctly diagnosed even during its early stages, raising the possibility of a rapid, cheap and non-invasive technique for the accurate and timely diagnosis of cancer. The data from Petricoin's study, and other related studies, are available on the Internet via the Food and Drug Administration's National Institutes of Health Clinical Proteomics Program Databank [3]. We are indebted to these workers for making their data openly available to the scientific community in this way. 2. Materials and methods 2.1. Data
Abbreviations: DOE, design of experiments; LAD, logical analysis of data; OPLS, orthogonal partial least squares; OPLS–DA, orthogonal partial least squares–discriminant analysis; PCA, principal components analysis; PLS; partial least squares; PLS–DA; partial least squares–discriminant analysis. ⁎ Corresponding author. Fax: +44 1344 885410. E-mail address:
[email protected] (O.P. Whelehan). 0169-7439/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2006.03.008
The Clinical Proteomics Program Databank contains three sets of ovarian cancer data. The first (2-16-02) was used in the study described above [2]. In this report, we analyse the third ovarian dataset (8-7-02), produced using low resolution SELDI-TOF
O.P. Whelehan et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 82–87
mass spectroscopy. The complete dataset contains proteomic profiles of blood serum for 162 biopsy-proven ovarian cancer patients and 91 unaffected women (controls). Of the cancer group, only the first 100 numbered patients were analysed here to allow for a more balanced study. Each proteomic profile, or spectrum, consists of 15,154 distinct m/z values ranging from 0 to 20,000. 2.2. Analysis This dataset has been analysed previously by other workers using various approaches [1,4–6]. Zhu et al. [1] used Student's t-tests to identify 563 m/z values (protein biomarkers) that showed significant univariate differences between the two groups of women after correcting for the large number of simultaneous tests. These variables were then entered into a stepwise discriminant analysis which identified 18 m/z values that separated their test set samples with 100% sensitivity and specificity. Sorace and Zhan [4] followed a similar two-step approach using Wilcoxon signed rank tests to highlight 685 significant m/z values that were reduced to just seven by stepwise discriminant analysis, again yielding 100% sensitivity and specificity when applied to a blind test set. Alexe et al. [5] applied logical analysis of data (LAD), a method based on combinatorial optimisation, to produce a range of models with cross-validated sensitivities and specificities of up to 100%. Baggerly et al. [6] looked at the reproducibility of the three ovarian cancer datasets, highlighting potential deficiencies in the approach and suggesting that the differences found in Petricoin's original paper [2] could be due to artifacts of sample processing rather than physiological changes induced by cancer. Petricoin et al. addressed these concerns in their comments on the article by Sorace and Zhan [7]. In this paper, we re-analyse the 8-7-02 data using a straightforward chemometric approach called PLS–Discriminant Analysis (PLS–DA). Unlike the methods described above, PLS–DA is a single step approach that analyses all 15,154 m/z values simultaneously without the need for prior variable selection or stepwise regression and without the rigid constraint of having more samples than variables. PLS (Partial Least Squares) regression is a chemometric projection method for relating two blocks of variables, X and Y, to each other via a linear multivariate model [8]. It has many advantages over more traditional regression methods because it can handle, among other things: noise in both blocks of variables, missing data, co-linearity among variables and more variables than samples. In practice, the predictive ability of a PLS model typically increases with the number of X-variables because of the intuitive observation that many variables tend to carry more information than few. Analysing the entire spectrum also tends to result in more robust models with the added advantage that no data are thrown away. PLS has become the standard method for multivariate calibration in optical spectroscopy because it readily handles the large number of correlated variables inherent with such methods [9]. PLS–DA [10] is a variant of standard PLS regression in which the block of Y-variables consists of a set of binary indi-
83
cator variables (one for each class) denoting class membership. A sample has a 1 if it is a member of a given class and 0 otherwise. The objective of PLS–DA, therefore, is to find directions in the X-space that separate the classes based on a training set of samples with known class membership. In a clinical proteomics setting, Gottfries et al. used PLS–DA to distinguish between different types of dementia [11], while Karp et al. used the approach to measure changes in protein expression [12]. PLS– DA is already an established classification method in the related fields of metabonomics [13,14] and genomics [15], while a more theoretical discussion of the technique is provided in Ref. [16]. Orthogonal PLS (OPLS) and OPLS–DA are, respectively, extensions of PLS and PLS–DA based on a minor modification of the original algorithm [17,18]. The objective of OPLS and OPLS–DA is to divide the systematic variation in the X-block into two model parts: one part which models the co-variation between X and Y, and a second part which captures systematic variation in X that is unrelated (orthogonal) to Y. The practical upshot is cleaner models that are MUCH easier to interpret. Model components that are related to Y are called predictive while those that are unrelated to Y are called orthogonal. Note that with a single Y-variable, as in the current study, there is only one predictive component but several orthogonal components. Note also that an OPLS/OPLS–DA model has the same total number of components (predictive + orthogonal) as the corresponding PLS/PLS–DA model. In addition to PLS–DA and OPLS–DA, principal components analysis (PCA) was used to provide an overview of the data and as a means of selecting the training and test sets, which is discussed in more detail below. All analyses were performed using SIMCA-P 11 [19]. 2.3. Scaling PCA, PLS–DA and OPLS–DA are scale-dependent methods. Within optical spectroscopy, spectral variables are routinely centred but not scaled prior to analysis. This ensures that wavelength regions with the largest signal amplitude variation exhibit most influence on the analysis. The usual alternative is unit variance scaling, also known as autoscaling, in which each variable is centred and scaled by dividing by its standard deviation. The disadvantage of this approach in optical spectroscopy is that the importance of noisy wavelength regions can become inflated, which may mask the effects of interest. With NMR and MS methods, a third option, known as Pareto scaling, has become prevalent. This represents a compromise between the extremes of no scaling and unit variance scaling and involves dividing each spectral variable by the square root of its standard deviation after first centring. Pareto scaling was applied to the SELDI-TOF m/z values prior to the analyses reported below. 3. Results 3.1. PCA overview To get an overview of the data, the full dataset of 100 cancer patients and 91 controls was analysed using PCA. A plot of the
84
O.P. Whelehan et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 82–87
Fig. 1. PCA overview of all 191 serum samples (● = controls, ○ = ovarian cancer).
Fig. 3. Actual versus fitted group values (● = controls, ○ = ovarian cancer). On each axis, 1 = control, 0 = ovarian cancer.
first two principal component scores, accounting for 61% of the original variation, is shown in Fig. 1 (the ellipse represents a 95% tolerance region for the two scores based on Hotelling's T2). There is some evidence of separation between the two classes along the second principal component and this is indeed statistically significant (p < 0.05) according to a t-test. However, the large degree of overlap means that the model has no practical value and suggests that much of the spectral variation is unrelated to class differences. There are no major outliers.
scores. The use of DOE ensures the selection of a diverse and representative training set that embraces the test set. Thus, separate PCA models were computed for the control and cancer groups. In both cases, it was found that three principal components gave a good summary of the data (77% and 75% of variation accounted for respectively). Score plots of each group model (not shown) again did not reveal any strong outliers. A 43 full factorial experimental design defining 64 combinations of the first three principal components for each class was used to select the training set. In other words, each score vector was split into four levels which, with three principal components, creates 64 possible score combinations. Only samples (individuals) corresponding to design points were selected with a limit of one sample per design point. This resulted in a training set of 43 controls and 54 ovarian cancer patients. The remaining 48 controls and 46 ovarian cancer patients form the test set.
3.2. Training set selection Members of the training set were selected from the control and ovarian cancer groups using multivariate characterisation. Nordahl and Carlson [20] provide an example of this approach in synthetic organic chemistry, although the method is generic and can essentially be applied to any type of multivariate data. There are two steps to the procedure. First PCA is used to calculate a number of scores (principal properties) and then the principles of design of experiments (DOE) are applied to these
Fig. 2. First two PLS–DA components of the training set (● = controls, ○ = ovarian cancer).
3.3. PLS–DA PLS–DA was applied to the 43 controls and 54 ovarian cancer samples comprising the training set. The number of significant components was determined using cross-validation
Fig. 4. PLS–DA regression coefficients after six components.
O.P. Whelehan et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 82–87
with 1/7th of the data being excluded during each round (software default setting). This yielded six PLS components with an R2 of 0.95 and a cross-validated R2 (Q2) of 0.91. A plot of the first two components is shown in Fig. 2. There is good separation between the two classes. In fact, using all six components, all 97 samples are correctly classified. This is confirmed by Fig. 3 where the actual class value (1 = control, 0 = ovarian cancer) is plotted against the fitted class value. All samples are correctly located either to the left (ovarian cancer) or to the right (control) of 0.5 on the x-axis. The m/z values responsible for the separation of the control and ovarian cancer groups are summarised by the PLS regression coefficients, one for each m/z value. These coefficients are plotted in Fig. 4 as a function of the m/z value. Many of the largest coefficients are associated with m/z values less than 500, which are considered by some workers to represent noise [4]. Indeed, the ten m/z values with the largest absolute regression coefficient peaks are, in descending order of importance, 245.245, 224.092, 435.075, 221.862, 25.3074, 555.303, 675.547, 680.894, 193.890 and 516.847, which all lie below 700. 3.4. Prediction of test set samples The true test of the model lies in its ability to classify samples when applied to a blind test set. Accordingly, the PLS–DA model described above was applied to the 94 samples (48 controls and 46 ovarian cancer patients) from the test set. All 94 samples were correctly classified yielding 100% sensitivity and specificity. The projection of the test set samples into the PLS– DA space defined by the training set is shown in Fig. 5. This plot is for illustration only as there are six PLS components in the full model but only two shown in the plot.
85
Fig. 6. First two OPLS–DA components of the training set (● = controls, ○ = ovarian cancer).
model, but concentrates all the discriminating information into the first component (i.e. with OPLS–DA there is always one predictive component less than the number of classes). In this case, just 10% of the original mass spectral variation is responsible for the class separation. As shown by Figs. 6 (training set) and 7 (test set), which are analogous to Figs. 2 and 5 above, class separation is greatly clarified by OPLS–DA. The classes are completely separated along the first predictive component, the second (orthogonal) component is shown for visualisation purposes only as it offers no extra discriminatory power. There is no risk of overfit here because only one predictive component is possible with two classes and we have deliberately split the
3.5. OPLS–DA for enhanced interpretation and class separation In order to visualise the class separation further, the dataset was re-analysed using OPLS–DA. The OPLS–DA model has the same number of components (six) as the previous PLS–DA
Fig. 5. First two PLS–DA components with test set samples projected in (● = controls, ○ = ovarian cancer).
Fig. 7. First two OPLS–DA components with test set samples projected in (● = controls, ○ = ovarian cancer).
86
O.P. Whelehan et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 82–87
Fig. 8. OPLS–DA regression coefficients for predictive component.
samples into training and test sets for validation purposes. 100% selectivity and specificity were obtained for both sets. Since OPLS–DA concentrates all discriminating information into the first component, it is sufficient to plot the regression coefficients for this component only (see Fig. 8, which is analogous to Fig. 4 above). It can be seen that there are three main mass regions responsible for class separation, namely 220–950, 3500–5275 and 7610–8750. 4. Discussion The principal aim of this paper is to present chemometrics as a serious alternative to more complex analytical tools for the analysis of “-omics” data for medical diagnosis and, ultimately, clinical implementation. Using the chemometric techniques of PLS–DA and OPLS–DA, we were able to predict straightforwardly and correctly the class membership of 48 control and 46 ovarian cancer samples in a blind test set with 100% sensitivity and specificity. The separation provided by OPLS–DA is particularly impressive and warrants further investigation in other proteomic studies. The success of these techniques on SELDI-TOF mass spectra comes as no surprise given that they are already routinely applied to NMR and MS spectra in the parallel field of metabonomics [13,14]. Indeed, Brindle et al. [21] used PLS–DA to diagnose the presence and severity of coronary heart disease while Odunsi et al. [22] used related chemometric techniques to detect epithelial ovarian cancer, both studies being based on NMR-based metabonomics data. As far back as 1981, similar techniques were used to classify cancer cells based on gas chromatography [23]. In contrast to the multi-step approaches reported previously [1,2,4], PLS–DA and OPLS–DA are single step techniques that require no prior variable selection: the entire spectrum is utilised. Variable selection should always be undertaken with care as there is a serious risk of discarding valuable diagnostic information. The danger of using univariate t-tests (and related nonparametric techniques) as a means of variable selection is that such tests do not take account of how variables combine together to form diagnostic patterns. As pointed out by Alexe et al. [5],
the combined discriminating power of a group of biomarkers cannot be inferred from their individual p-values. The results of the chemometric analyses reported here are transparent and easily interpretable using a few intuitive plots. The degree of class separation is readily apparent from score plots while the most important biomarkers are clearly identified by inspecting the regression coefficients. In this case, three main mass regions were identified as being responsible for separating the diseased and control classes. The most influential region lay in the m/z range 220–950, which some workers consider represents matrix noise and would exclude [6]. However, as pointed out by Petricoin et al., low mass regions that can correctly classify blind samples cannot be noise by definition [7]. This low mass region could indicate the presence of small molecule biomarkers, which might be better resolved by a metabonomic analysis. The 8-7-02 dataset was chosen for this study largely for practical reasons because it was low resolution and therefore involved many fewer masses than the high resolution sets. A secondary motivating factor was whether low resolution data would show the same discriminatory ability as high resolution. There has been much recent dialogue in the literature regarding the validity of proteomic profiling of blood serum for diagnosing cancer [24–26]. In particular, questions have been raised about the 8-7-02 dataset because the case and control samples were run in separate batches on separate days to investigate sources of variance according to an experimental design. However, this does not invalidate the methods presented here which have been used successfully for many years in the related field of metabonomics, where data analysis has moved on from the mindset of having more samples than variables. It also seems highly improbable that the same extraneous sources of variation could be introduced to the various batches of case and control samples in a consistent enough manner to allow 100% specificity and sensitivity to be achieved on a blind test set. References [1] W. Zhu, X. Wang, Y. Ma, M. Rao, et al., Detection of cancer-specific markers amid massive mass spectral data, Proceedings of the National Academy of Sciences of the United States of America 100 (25) (Dec 9 2003) 14666–14671, doi:10.1073/pnas.2532248100. [2] E.F. Petricoin III, A.M. Ardekani, B.A. Hitt, P.J. Levine, et al., Use of proteomic patterns in serum to identify ovarian cancer, Lancet 359 (2002) 572–577. [3] http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp. [4] J.M. Sorace, M. Zhan, A data review and re-assessment of ovarian cancer serum proteomic profiling, BMC Bioinformatics 4 (2003) 24, doi:10.1186/ 1471-2105-4-24. [5] G. Alexe, S. Alexe, L.A. Liotta, E. Petricoin, et al., Ovarian cancer detection by logical analysis of proteomic data, Proteomics 4 (2004) 766–783, doi:10.1002/pmic.200300574. [6] K.A. Baggerly, J.S. Morris, K.R. Coombes, Reproducibility of SELDITOF protein patterns in serum: comparing datasets from different experiments, Bioinformatics 20 (5) (Jan 29 2004) 777–785, doi:10.1093/ bioinformatics/btg484. [7] E.F. Petricoin III, D.A. Fishman, T.P. Conrads, T.D. Veenstra, L.A. Liotta, Proteomic pattern diagnostics: producers and consumers in the era of correlative science, BMC Bioinformatics 4 (2003) (comments on paper by Sorace and Zhan).
O.P. Whelehan et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 82–87 [8] S. Wold, C. Albano, W.J. Dunn III, U. Edlund, et al., Multivariate data analysis in chemistry, in: B.R. Kowalski (Ed.), Chemometrics: mathematics and statistics in chemistry, D. Reidel Publishing Company, Dordrecht, Holland, 1984. [9] S. Wold, M. Josefsson, Multivariate calibration of analytical data, Encyclopedia of Analytical Chemistry, Wiley, 2000, pp. 1–27. [10] L. Ståhle, S. Wold, Partial least squares analysis with cross-validation for the two-class problem: a Monte Carlo study, Journal of Chemometrics 1 (1987) 185–196. [11] J. Gottfries, K. Blennow, A. Wallin, C.G. Gottfries, Diagnosis of dementias using partial least squares discriminant analysis, Dementia 6 (1995) 83–88. [12] N.A. Karp, J.L. Griffin, K.S. Lilley, Application of partial least squares discriminant analysis to two-dimensional difference gel studies in expression proteomics, Proteomics 5 (1) (Jan 2005) 81–90. [13] E. Holmes, H. Antti, Chemometric contributions to the evolution of metabonomics: mathematical solutions to characterising and interpreting complex biological NMR spectra, Analyst 127 (2002) 1549–1557, doi:10.1039/b208254n. [14] H.C. Keun, T.M.D. Ebbels, H. Antti, M.E. Bollard, et al., Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling, Analytica Chimica Acta 490 (2003) 265–276. [15] U. Atif, M.E. Earll, L. Eriksson, E. Johansson, et al., Analysis of gene expression datasets using partial least squares discriminant analysis and principal component analysis, in: Martyn Ford, David Livingstone, John Dearden, Han Van de Waterbeemd (Eds.), Euro QSAR 2002 Designing Drugs and Crop Protectants: processes, problems and solutions, Blackwell Publishing, ISBN: 1-4051-2516-0, 2003, pp. 369–373. [16] M. Barker, W. Rayens, Partial least squares for discrimination, Journal of Chemometrics 17 (2003) 166–173.
87
[17] J. Trygg, S. Wold, Orthogonal projections to latent structures (OPLS), Journal of Chemometrics 16 (2002) 119–128. [18] J. Trygg, Prediction and spectral profile estimation in multivariate calibration, Journal of Chemometrics 18 (2004) 166–172. [19] SIMCA-P 11 User Guide, Umetrics AB, Umeå, Sweden. [20] Å. Nordahl, R. Carlsson, Exploring organic synthetic procedures, Topics in Current Chemistry 166 (1993) 1–64. [21] J.T. Brindle, H. Antti, E. Holmes, G. Tranter, et al., Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using 1HNMR metabonomics, Nature Medicine 8 (2002) 1439–1444. [22] K. Odunsi, R.M. Wollman, C.B Ambrosone, A. Hutson, et al., Detection of epithelial ovarian cancer using 1H-NMR-based metabonomics, International Journal of Cancer 113 (2005) 782–788, doi:10.1002/ijc.20651. [23] E. Jellum, I. Björnsson, R. Nesbakken, E. Johansson, S. Wold, Classification of human cancer cells by means of capillary gas chromatography and pattern recognition analysis, Journal of Chromatography 217 (1981) 231–237. [24] K.A. Baggerly, J.S. Morris, S.R. Edmondson, K.R. Coombes, Signal in noise: evaluating reported reproducibility of serum proteomic tests for ovarian cancer, Journal of the National Cancer Institute 97 (2005) 307–309. [25] L.A. Liotta, M. Lowenthal, A. Mehta, T.P. Conrads, et al., Importance of communication between producers and consumers of publicly available experimental data, Journal of the National Cancer Institute 97 (2005) 310–314. [26] D.F. Ransohoff, Lessons from controversy: ovarian cancer screening and serum proteomics, Journal of the National Cancer Institute 97 (2005) 315–319.