Artificial Intelligence in Medicine 53 (2011) 57–71
Contents lists available at ScienceDirect
Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim
Integration of gene signatures using biological knowledge Michalis E. Blazadonakis a,∗ , Michalis E. Zervakis a , Dimitrios Kafetzopoulos b a b
Department of Electronic and Computer Engineering, University Campus, Technical University of Crete, GR 731 00, Chania Crete, Greece Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology - Hellas, N. Plastira 100, GR 700 13, Heraklion Crete, Greece
a r t i c l e
i n f o
Article history: Received 1 November 2009 Received in revised form 14 May 2011 Accepted 16 June 2011 Keywords: Marker gene selection Gene signature integration Cross-platform integration Biological knowledge Gene ontology Pathways Breast cancer
a b s t r a c t Objective: Gene expression patterns that distinguish clinically significant disease subclasses may not only play a prominent role in diagnosis, but also lead to the therapeutic strategies tailoring the treatment to the particular biology of each disease. Nevertheless, gene expression signatures derived through statistical feature-extraction procedures on population datasets have received rightful criticism, since they share few genes in common, even when derived from the same dataset. We focus on knowledge complementarities conveyed by two or more gene-expression signatures by means of embedded biological processes and pathways, which alternatively form a meta-knowledge platform of analysis towards a more global, robust and powerful solution. Methods: The main contribution of this work is the introduction and study of an approach for integrating different gene signatures based on the underlying biological knowledge, in an attempt to derive a unified global solution. It is further recognized that one group’s signature does not perform well on another group’s data, due to incompatibilities of microarray technologies and the experimental design. We assess this cross-platform aspect, showing that a unified solution derived on the basis of both statistical and biological validation may also help in overcoming such inconsistencies. Results: Based on the proposed approach we derived a unified 69-gene signature, which outperforms significantly the performance of the initial signatures succeeding a 0.73 accuracy metric on 234 new patients with 81% sensitivity and 64% specificity. The same signature manages to reveal the two prognostic groups on an additional dataset of 286 new patients obtained through a different experimental protocol and microarray platform. Furthermore, it manages to derive two clusters in a dataset from a different platform, showing remarkable difference on both gene-expression and survival-prediction levels. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Microarray technology has become a valuable tool for classifying breast tumors according to their prognosis, subtype, or response to treatment. An open problem in studies of breast cancer, as well as other types of cancer, is to determine the most appropriate treatment protocol for a specific patient. Even though chemotherapy or hormonal therapy reduces the risk of distant metastasis by approximately 1/3, 70–80% of the patients receiving adjuvant treatment would have survived without it [1,2]. Along with the treatment plan, there are also variations in the evaluation of diagnostic means. Existing inconsistencies in histological grading forced the American Joint Committee on Cancer to exclude histological tumor grading from its staging criteria [3]. Hence, attempts to increase the prognostic value through the use of stable and robust markers
∗ Corresponding author. Tel.: +30 2821037206; fax: +30 2821037542. E-mail addresses:
[email protected] (M.E. Blazadonakis),
[email protected] (M.E. Zervakis),
[email protected] (D. Kafetzopoulos). 0933-3657/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.artmed.2011.06.003
become more than a necessity and this is a direction towards which microarray technology is expected to contribute. The mass information conveyed by microarray data must be robustly processed in order to enable the extraction of meaningful and reproducible results necessitating close collaboration among many scientific fields, such as medicine, biology, statistics and computer science. Besides the need of statistical validation, “an understanding of both the biology and the computational methods is essential for tackling the associated data mining task without being distracted by the abundant fool’s gold” [4]. We express these issues by adopting the position that simply generating statistically significant results is not enough in genomic analysis; any result should also be evaluated in terms of its biological significance, which in any case is clinically more important than its statistical relevance on a limited dataset. Nevertheless, since the recorded knowledge may not be complete, the biological significance is used in our approach to enrich the statistical result rather than to filter it out of potentially noisy and biologically irrelevant genes. To further support this position we refer to the study of Van’t Veer et al. [5], which has received criticism [6,7] from a statistical point of view when considering stringent statistical criteria. Neverthe-
58
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
Fig. 1. The proposed biological-knowledge integration process that derives the S1S2-BK gene signature; red-green part represents independent focus on a single signature; blue part represents biological knowledge integration. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of the article.)
less, taking into account additional biological and medical criteria, the Food and Drug Administration (FDA) [8] approved the result of Van’t Veer et al. for further clinical tests. Specifically, the 70-gene derived by the study has been approved by FDA under the product name ‘MammaPrint’ [9] as the first clear product that profiles gene expression correlated to the likelihood of tumor recurrence. Furthermore, the European Organization for Research and Treatment of Cancer (EORTC) [10] launched the MINDACT [11] (microarray in node negative disease may avoid chemotherapy) project opening the road for the evaluation of gene expression signatures in randomized clinical trials. In this study we demonstrate that, by appropriately adopting biological knowledge, the statistical results can be significantly improved. Published gene signatures have few or perhaps no genes in common. Recent studies, however, prove that even if there is no significant gene overlap between two different gene signatures, there might be significant overlap in terms of the biology conveyed by them [12]. Building on these steps, we search for and exploit significant biological information by means of gene ontology biological processes (GOBPs) and pathways hidden behind a 57-gene signature (referred to as S1), the derivation of which is discussed extensively in [13]. In parallel we consider one widely discussed signature, i.e. the 70-gene signature (referred to as S2) published by Van’t Veer et al. in [5]. First, we show that the two signatures indeed demonstrate significant biological overlap when GOBPs and pathways are considered, in complete agreement to [12]. Secondly, we demonstrate that statistical results are substantially improved when integrating the biological knowledge conveyed by the individual signatures S1 and S2, which have been derived from the same dataset. The result of such knowledge integration is evaluated on the 234 new cases published in [15], as well as on the 286 cases published in [14] as to assess its efficiency on a different microarray platform and experimental design. The proposed approach actually addresses two major problems in cancer gene-selection: (a) the issue of minor or no overlap between different gene signatures at the gene level [16],
and (b) the cross platform evaluation of results by testing a predictor that is derived using a specific microarray platform and experimental design on another group’s data derived with a different experimental platform and protocol [18]. As stated in the abstract, the main contribution and novelty of this work is that it proposes an approach for integrating different gene signatures, in an attempt to derive a unified and more global solution based on the common biological aspects of the individual signatures. Even though several integration schemes have been proposed in the area of microarray analysis (selectively refer to [12,19–21]), little has been done towards assessing the concept of integrating results from different group. Our motive behind this approach is the meta-analysis view that different gene signatures, which may be proposed by different research groups, form the parts of a more global solution with each individual approach addressing only a small part of the whole. In turn, we treat such individual (and seemingly different) solutions as pools of valuable knowledge for integration and unification of solutions. We emphasize that this process is a dynamic one, which can systematically evolve to include more individual gene-signatures related to the same pathology. The concept of gene-signature overlap can be considered at four levels of abstraction as follows: 1. Gene-level overlap assesses the number of common genes between two signatures. This issue has been addressed before showing minimal or no overlap among the various signatures published in breast cancer [16,18]. 2. Pathway-level overlap assesses the common pathways that exist between two gene signatures [16,18]. We should clarify here that the term “pathway” encompasses all processes associated with both GOBPs and pathways. 3. Overlap of significant pathways is measured by HGPD discussed in Section 2. After specifying all pathways induced by the genes of a gene signature, we can use the value of HGPD in order to produce a rank-order list of significance. We then define significant
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
59
Fig. 2. Derived pathway signatures with the corresponding survival prediction curves based on our 57 gene signature (panel A) and on the 70-gene Van’t Veer’s signature (panel B). Panel C presents the derived integrated signature, along with the corresponding Kaplan–Meier curve.
60
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
Table 1 The 38 significant GOBPs pointed to by the 57-gene-signature (S1). GOBP – ID
GOBP – Name
GOBP – Num of genes
GOBP – Genes in data
Genes in signature
HGPD
GO:0051084 GO:0007165 GO:0051649 GO:0006886 GO:0006573 GO:0019859 GO:0030949 GO:0051000 GO:0030036 GO:0009966 GO:0006101 GO:0045908 GO:0045909 GO:0007632 GO:0031532 GO:0045429 GO:0007049 GO:0009615 GO:0006091 GO:0042177 GO:0050715 GO:0006006 GO:0050930 GO:0007021 GO:0051056 GO:0007051 GO:0007159 GO:0007212 GO:0006892 GO:0042994 GO:0001570 GO:0016310 GO:0000070 GO:0050708 GO:0006607 GO:0045664 GO:0048167 GO:0046631
De novo posttranslational protein folding Signal transduction Establishment of cellular localization Intracellular protein transport Valine metabolic process Thymine metabolic process Positive regulation of vascular endothelial growth Positive regulation of nitric-oxide synthase activity Actin cytoskeleton organization and biogenesis Regulation of signal transduction Citrate metabolic process Negative regulation of vasodilation Positive regulation of vasodilation Visual behavior Actin cytoskeleton reorganization Positive regulation of nitric oxide process Cell cycle Response to virus Generation of precursor metabolites and energy Negative regulation of protein catabolic process Positive regulation of cytokine secretion Glucose metabolic process Induction of positive chemotaxis Tubulin folding Regulation of small GTPase mediated signal transduction Spindle organization and biogenesis Leukocyte adhesion Dopamine receptor signaling pathway Post-Golgi vesicle-mediated transport Cytoplasmic sequestering of transcription factor Vasculogenesis Phosphorylation Mitotic sister chromatid segregation Regulation of protein secretion NLS-bearing substrate import into nucleus Regulation of neuron differentiation Regulation of synaptic plasticity Alpha-beta T cell activation
10 3788 1027 556 2 1 1 1 216 543 2 2 3 6 8 8 915 98 602 7 14 114 9 8 215 22 11 12 26 12 23 791 30 25 14 25 20 18
6 2115 542 314 1 1 1 1 135 293 2 2 3 4 4 4 588 65 387 5 5 84 7 8 92 9 9 9 10 10 11 516 12 12 14 15 15 18
2 14 7 5 1 1 1 1 3 4 1 1 1 1 1 1 5 2 4 1 1 2 1 1 2 1 1 1 1 1 1 4 1 1 1 1 1 1
0.000081 0.000226 0.000236 0.000766 0.002400 0.002400 0.002400 0.002400 0.003700 0.004400 0.004700 0.004700 0.007000 0.009400 0.009400 0.009400 0.009800 0.009800 0.010900 0.011700 0.011700 0.015800 0.016300 0.018500 0.018600 0.020800 0.020800 0.020800 0.023100 0.023100 0.025300 0.026000 0.027600 0.027600 0.032000 0.034200 0.034200 0.040800
pathway overlap by means of the top ranked pathways indicated by two or more signatures under consideration. In this study, as in [12], we label a pathway as significant when its HGPD is less than 0.05. 4. We proceed one step further and define gene-overlap within significant pathways of two or more signatures. At this level we focus on the significant pathways of the signatures as defined above and assess the overlap of the ensemble of genes they define. In this form, we essentially consider overlap of all genes that are functionally related with the ones involved in the original gene signature. Thus, we consider overlap among groups of genes instead of genes themselves. We clarify here that we do not aim at detecting complex gene interactions, but rather exploiting them as they appear in the biology-oriented GOBPs or pathways, in order to enrich the set of initially selected genes with all those that fall under the same biological processes.
2. Methods In this section we briefly describe the proposed integration approach and provide the essential background on biological processes and pathways. After augmenting each signature with its associated biological extensions, we refine the enhanced signatures through the same gene selection approach. The method used in all conducted experiments is a representative filter approach employing a variation of Fisher’s ratio [22], which is briefly described. Furthermore, this section provides information on the set of criteria used to assess (a) the significance of the biological knowledge hidden behind a signature, (b) the significance of the derived gene signature(s) on the available dataset and (c) the classification approach that is used to estimate the prediction accuracy. The proposed integration scenario was implemented in its entity in Matlab. 2.1. Proposed integration scheme
In the following sections we focus on the 4th-level overlap, but we also consider overlap at the remaining three levels.
Table 2 Significant pathways pointed to by the 57-gene-signature (S1). Pathway
Genes in pathway
Genes in data
Genes in signature
HGPD
Tcell Hedg
359 26
318 19
3 1
0.0324 0.0429
Based on the principles of biological significance and overlap discussed above, the proposed approach for gene-signature integration proceeds in steps as illustrated in Fig. 1. Notice that we refer to two original signatures S1 and S2 derived from the same dataset, but the process can be readily extended to a larger number of signatures. The dashed lines in the upper right corner imply that the process may repeatedly evolve adopting signatures from different sources, which in turn contribute additional complementary knowledge.
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
61
Table 3 The 24 significant GOBPs pointed to by the 70-gene-signature (S2). GOBP – ID
GOBP – Name
GOBP – Num of genes
GOBP – Genes in data
Genes in signature
HGPD
GO:0009653 GO:0006260 GO:0043627 GO:0007165 GO:0001501 GO:0031333 GO:0031274 GO:0000022 GO:0030198 GO:0008065 GO:0006271 GO:0009887 GO:0006562 GO:0009113 GO:0015012 GO:0007586 GO:0006024 GO:0030225 GO:0006890 GO:0006940 GO:0001836 GO:0001837 GO:0016525 GO:0006631
Anatomical structure morphogenesis DNA replication Response to estrogen stimulus Signal transduction Skeletal development Negative regulation of protein complex assembly Positive regulation of pseudopodium formation Mitotic spindle elongation Extracellular matrix organization and biogenesis Establishment of blood-nerve barrier DNA strand elongation during DNA replication Organ morphogenesis Proline catabolic process Purine base biosynthetic process Heparan sulfate proteoglycan biosynthetic process Digestion Glycosaminoglycan biosynthetic process Macrophage differentiation Retrograde vesicle-mediated transport, Golgi to ER Regulation of smooth muscle contraction Release of cytochrome c from mitochondria Epithelial to mesenchymal transition Negative regulation of angiogenesis Fatty acid metabolic process
1068 219 25 3788 235 1 5 2 50 2 3 390 4 5 10 97 21 11 16 17 11 12 18 185
740 168 20 2115 182 1 1 1 39 2 2 290 3 3 6 76 10 10 12 13 14 14 14 123
8 4 2 14 4 1 1 1 2 1 1 4 1 1 1 2 1 1 1 1 1 1 1 2
0.0010 0.0013 0.0015 0.0017 0.0017 0.0029 0.0029 0.0029 0.0055 0.0058 0.0058 0.0085 0.0086 0.0086 0.0171 0.0191 0.0282 0.0282 0.0337 0.0364 0.0390 0.0390 0.0390 0.0440
In the first part, the original signatures are enhanced with biological knowledge resulting in S1-BK and S2-BK, respectively (red-green paths of Fig. 1). Considering all genes involved in S1 (or S2), the associated biological processes (gene ontology and pathways) are identified and ranked based on the likelihood of being selected at random (HGPD). By gathering all genes involved in the most significant biological processes, we obtain an augmented set of genes with specific biological meaning. By further assessing the statistical significance of these genes on the population database, we proceed with a gene selection process (recursive filter method) and end up with the enhanced signature S1-BK (or S2-BK), which encodes both biological and statistical information. Note that S1-BK and S2-BK are expected to differ from S1 and S2, respectively, since their derivation is based on different algorithmic schemes; S1 is derived through a recursive feature-elimination method, whereas S2 is derived by a different filter scheme than the recursive filter considered here. At the second part (blue path in Fig. 1), the two signatures are integrated into S1S2-BK (for interpretation of the references to color in the text legend, the reader is referred to the web version of the article). This is achieved by merging the biologically augmented sets of genes from the two approaches and following the same statistical process of gene selection. The derived signature integrates the two (or more) original ones at the level of biological knowledge and further assesses the statistical significance of selected genes on the population dataset. 2.2. Gene ontology and pathways The Gene Ontology (GO) project [23] provides a controlled vocabulary to describe gene and gene-product attributes in any organisms. The GOBP constitutes basic background knowledge Table 4 Significant pathways pointed to by the 70-gene-signature (S2). Pathway Wnt TGF AR IL6 IL4
Genes in pathway 114 1556 603 105 307
Genes in data 112 1182 468 100 267
organizing genes into ensembles according to the biological processes they participate in [23]. Using GOBPs we can find a collection of genes that are involved in a specific biological process or find the biological processes that a specific gene is involved in. Notice that while there are uniquely identified biological processes, a specific gene may be involved in more than one of them. Hence, GOBP provides background biological knowledge on the genes that constitute specific biological ‘mechanisms’ but also the mechanisms associated with a specific gene. A pathway, on the other hand, is a series of biochemical reactions occurring within a cell. Such processes are usually rapid, lasting on the order of milliseconds in the case of ion flux, or minutes for the activation of protein, but some can take hours and even days (as is the case with gene expression) to be completed. A set of such chemical reactions serving a specific purpose constitutes a pathway. Hence, GOBPs constitute a conceptual network of biological processes and interactions, dynamically adapted and updated with evolving knowledge, while a pathway refers to specific and rather strict biological functions accomplished through a series of biochemical reactions. Biochemical pathways refer to rather strict rules and apply mostly to small molecules while GOBPs refer to more complex interactions. In this study we focus on the 20 pathways published by NetPath [24], 10 of which are related to immune system (immune signaling pathways) and 10 related to cancer (cancer signaling pathways). In subsequent sections we use background knowledge provided through GOBPs and NetPath to either validate or enhance derived results.
Table 5 Metrics on the individual signatures (S1 and S2), biologically refined signatures (S1BK and S2-BK) and the combined signature S1S2-BK. Area under the ROC curve (AUC) sensitivity (SEN) and specificity (SPE) measures. AUC
Genes in signature
HGPD
3 9 5 2 3
0.0039 0.0048 0.0091 0.0310 0.0349
SEN
SPE
S1 S1-BK (70-GS)
66% 69%
74% 80%
59% 59%
S2 S2-BK (70-GS)
69% 71%
76% 78%
62% 63%
S1S2-BK (69-GS)
73%
81%
64%
62
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
Table 6 The 69 genes of the S1S2-BK gene signature. Gene ID
Accession
Symbol
Gene ID
Accession
Symbol
21256 23724 706 14957 10930 10325 18891 5358 13130 14796 8782 11123 15594 497 2131 6214 9274 19928 1686 22267 23978 659 9033 21356 12572 24333 14063 1201 23996 19840 21818 7081 6598 5019 11428
NM 000419 NM 000788 D25328 AJ224741 U17327 NM 007057 NM 000127 NM 020386 NM 005915 Contig34634 RC NM 006117 NM 007111 NM 006931 NM 002358 X05610 NM 012429 NM 004798 NM 000291 AF201951 NM 001282 NM 002208 NM 001667 NM 005444 NM 001168 AF055033 NM 000849 NM 006763 NM 003163 NM 002217 NM 000272 AB023216 NM 004504 NM 003766 NM 012310 NM 005721
ITGA2B DCK PFKP MATN3 NOS1 ZWINT EXT1 HRASLS MCM6 GCN1L1 PECI TFDP1 SLC2A3 MAD2L1 COL4A2 SEC14L2 KIF3B PGK1 MS4A7 AP2B1 ITGAE ARL2 RQCD1 BIRC5 IGFBP5 GSTM3 BTG2 STX1B ITIH3 NPHP1 KIAA0999 HRB BECN1 KIF4A ACTR3
7489 12359 9663 9646 8976 7126 1831 5509 24169 19933 19549 6494 18109 13800 9156 3524 8508 23744 12680 17881 10827 24016 6404 13490 19891 6541 20891 7459 13309 6463 23198 3224 7797 21944
NM 003875 NM 014584 NM 006265 NM 006260 NM 004701 NM 005243 NM 003239 NM 002914 NM 002268 NM 000296 NM 000207 NM 005192 NM 000017 Contig47544 RC AF257175 NM 004163 NM 003981 NM 000797 U82987 D13540 NM 004994 NM 002224 NM 005176 AK000365 NM 000286 NM 003748 BE739817 RC NM 003862 NM 006681 NM 012479 Contig25991 NM 020120 NM 013306 NM 001216
GMPS ERO1L RAD21 DNAJC3 CCNB2 EWSR1 TGFB3 RFC2 KPNA4 PKD1 INS CDKN3 ACADS ATP5E PECI RAB27B PRC1 DRD4 BBC3 PTPN11 MMP9 ITPR3 ATP5G2 ACO2 PEX12 ALDH4A1 IFNAR1 FGF18 NMU YWHAG ECT2 UGCGL1 SNX15 CA9
2.3. The hyper-geometric probability distribution The HGPD is a discrete probability distribution that describes the number of successes (x) in a sequence of N draws, corresponding to the N genes constituting a specific pathway or GOBP, from a finite population of M total genes without replacement:
p = f (x|M, K, N) =
K x
M−K N−x
M N
(1)
where x is the number of genes in the signature belonging in the pathway, M is the total number of genes in data, K the number of the pathway genes that exist in the data and N is the number of genes in the examined gene signature. In the application considered, the HGPD models the probability of drawing x among M total genes, such that they belong to a specific pathway of K genes, given that those x genes are also part of the N gene signature under consideration. In essence, the HGPD models the probability of the null hypothesis that the selection of x genes is a random choice; small p-values lead to rejecting the hypothesis in favor to the alternative one, that the choice is unlikely to be a random guess. Thus, the lower the value of p the most significant the examined pathway is,
Fig. 3. Kaplan–Meier survival analysis of initial signatures (S1 – left, S2 – right) on the 286 cases of Wang’s dataset.
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
63
Fig. 4. Class comparison (left) and Kaplan–Meier analysis (right) of the integrated signature S1S2-BK on Wang’s dataset.
since it has a low probability of being selected at random. Note that for simplicity purposes we use the term pathway to designate both the pathway and the GOBP concepts. Elaborating on Eq. (1), we can highlight some interesting aspects. Suppose that a pathway contains only one gene (K = 1) and that gene (x = 1) has been selected in a signature consisting of only that one gene (N = 1). Then, for large M we get a p-value that is
approximately zero, from HGPD. This implies that the specific pathway is an important one, since the probability of selecting genes of that pathway at random is close to zero. Consider now the reverse case, where we let the number of genes constituting a pathway equal the total number of genes (K = M), restricting our selection to only one pathway, and a signature of N genes all selected from the same pathway (x = K). In this case we get a p value of one, imply-
Fig. 5. Gene interactions within S1S2-BK signature; the majority of genes interact through a co-expression relationship, implying that they have similar expression profiles throughout multiple experiments, while there exists both physical and protein interactions for some genes.
64
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
Fig. 6. Impact of the number of surviving genes in classification performance and survival analysis on the 19 and 234 independent patient sets. The arrow points to the derived 69-gene signature, while the dots represent snapshots of Kaplan–Meier survival analysis along the gene elimination process, depicted at the lower panel of the graph.
ing that the specific pathway is not a significant one, since it gives a high probability of being a random guess. Focusing on a more realistic case, consider our gene signature of 57-genes, which has been selected from a set of 24,188 genes. Let, for instance, the signal transduction pathway pointed to by 14 different genes (x = 14) in our 57-gene signature (N = 57), which contains 3788 genes (during the period of our research), 2115 of which exist in our 24,188 gene set (K = 2115). This case yields a p-value of 2.26E−04, ranking the specific pathway among the most important ones. Hence, a pathway could be important even if it consists of a large number of genes, provided that it contributes a large proportion of genes in
the final gene signature; the fact that about 25% (14/57) genes of our gene signature are part of the signal transduction GOBP makes the specific pathway very important. Using such a metric we can assess the statistical significance of a pathway, as to whether its selection in the final gene signature is a random result or not. Thus, ranking the pathways according to their corresponding p values from Eq. (1), leads to a ranked list of pathways according to their significance. Based on the 95% significance level, we consider as significant those pathways with p <0.05, in the sense that there is at least 95% confidence that they do not result as random guesses. The same cut off point was used in the work of Yu et al. in [12].
Table 7 Clinical study based on sample signatures (S1-BK and S1S2-BK), positive, negative predictions and accuracies.
2.4. The filter selection method
Class assignment S1-BK AUC 0.69 Positive Negative Missed Total S1S2-BK AUC 0.73 Positive Negative Missed Total
Given
Predicted
Accuracy
54 180
43 106
79.63% 58.89% 36.32% 63.68%
85 234
149
54 180
44 115 75
234
159
81.48% 63.80% 32.05% 69.75%
According to the proposed integration approach, each original signature is enhanced by sets of “biologically similar” genes. Then, each enhanced super-set is refined through a gene-selection approach. In order to treat similarly the enhanced signatures, the gene-selection method at this level is a common one, based on a recursive filter method (RFM). To derive a gene signature through the RFM approach, genes are ranked in decreasing order according to their Fisher’s score as given: f (gi ) =
+ (gi ) − − (gi ) + (gi ) + − (gi )
(2)
Table 8 Gene sets of S1, S2, S1-BK and S2-BK signatures. S1
S2
S1BK
S2BK
Accession
Symbol
SeqID
Accession
Symbol
SeqID
Accession
Symbol
SeqID
Accession
Symbol
659 719 831 838 1505 1623 1626 2819 3224 3232 3742 3786 3851 4952 4966 5293 6463 7012 7126 7797 8071 8162 8910 8976 8982 9235 9646 9874 10325 10538 10643 12259 12416 12553 12572 13270 13343 13490 13800 13917 14762 15157 15813 15874 16474 17778 18425 19549 19840 19928 20891 23161
NM 001667 NM 001685 Contig43684 Contig13548 RC AF148505 Contig17273 RC Contig35229 RC NM 003376 NM 020120 NM 020123 Contig44713 NM 002779 Contig6238 RC NM 005007 AB018337 NM 003607 NM 012479 NM 005219 NM 005243 NM 013306 NM 013360 NM 013376 NM 013438 NM 004701 NM 004703 Contig33814 RC NM 006260 NM 005594 NM 007057 NM 004911 NM 020974 NM 006544 U90904 Contig64861 RC AF055033 Contig5456 RC AL355708 AK000365 Contig47544 RC Contig65439 AF160213 Contig11065 RC Contig50013 RC Contig31312 RC Contig14882 RC NM 015984 Contig63102 RC NM 000207 NM 000272 NM 000291 BE739817 RC Contig41716 RC
ARL2 ATP5J FLJ23312
1686 1831 1929 2100 2131 3461 5293 5358 5514 5835 6541 7459 7489 7509 8508 8731 8782 8979 9156 9730 10075 10228 10245 10643 10741 10827 10889 10921 11994 12427 12572 12680 13130 13309 14097 14173 14892 15303 15390 15594 16038 16223 16516 16523 16822 16951 17209 17233 17337 17778 18425 18891
AF201951 NM 003239 NM 001809 Contig32185 RC X05610 NM 020188 NM 003607 NM 020386 NM 002916 AF052162 NM 003748 NM 003862 NM 003875 NM 003882 NM 003981 NM 006101 NM 006117 NM 004702 AF257175 Contig48328 RC AA555029 RC NM 014321 NM 007036 NM 020974 Contig51464 RC NM 004994 AL080059 AL080079 NM 007203 Contig55377 RC AF055033 U82987 NM 005915 NM 006681 NM 014791 Contig38288 RC NM 014889 Contig55725 RC NM 016359 NM 006931 Contig46223 RC NM 016448 AK000745 Contig20217 RC AB037863 NM 016577 Contig28552 RC Contig40831 RC AL137718 NM 015984 Contig63102 RC NM 000127
MS4A7 TGFB3 CENPA
497 659 706 1055 1201 1505 1686 1831 2543 3224 3524 3670 3743 4346 4584 5019 5509 5514 6214 6404 6463 6494 6506 6598 7012 7081 7126 7459 7797 8508 8976 9274 9445 9536 9646 9663 10325 10930 11123 11428 12359 12572 12680 13130 13309 13490 13800 14705 16951 17881 18109 18891
NM 002358 NM 001667 D25328 Contig35634 RC NM 003163 AF148505 AF201951 NM 003239 NM 003319 NM 020120 NM 004163 M26880 NM 002760 NM 002808 NM 002875 NM 012310 NM 002914 NM 002916 NM 012429 NM 005176 NM 012479 NM 005192 NM 005196 NM 003766 NM 005219 NM 004504 NM 005243 NM 003862 NM 013306 NM 003981 NM 004701 NM 004798 NM 006201 NM 014251 NM 006260 NM 006265 NM 007057 U17327 NM 007111 NM 005721 NM 014584 AF055033 U82987 NM 005915 NM 006681 AK000365 Contig47544 RC Contig45917 RC NM 016577 D13540 NM 000017 NM 000127
MAD2L1 ARL2 PFKP ADCYAP1R1 STX1B ALDH6A1 MS4A7 TGFB3 TTN UGCGL1 RAB27B UBC PRKY PSMD2 RAD51 KIF4A RFC2 RFC4 SEC14L2 ATP5G2 YWHAG CDKN3 CENPF BECN1 DIAPH1 HRB EWSR1 FGF18 SNX15 PRC1 CCNB2 KIF3B PCTK1 SLC25A13 DNAJC3 RAD21 ZWINT NOS1 TFDP1 ACTR3 ERO1L IGFBP5 BBC3 MCM6 NMU ACO2 ATP5E FOXL2 RAB6B PTPN11 ACADS EXT1
497 659 706 1055 1505 1686 1831 2042 2131 2713 2761 2819 3309 3352 3524 3670 4209 5019 5358 5514 6214 6463 6541 7081 7095 7459 7489 7797 8508 8782 8976 9033 9156 9445 9646 10325 10827 11123 11428 12359 12433 12572 12680 13130 13309 14063 14705 14796 14957 15594 16951 17881
NM 002358 NM 001667 D25328 Contig35634 RC AF148505 AF201951 NM 003239 NM 002563 X05610 NM 002624 NM 001905 NM 003376 NM 012129 NM 020156 NM 004163 M26880 NM 003504 NM 012310 NM 020386 NM 002916 NM 012429 NM 012479 NM 003748 NM 004504 NM 005238 NM 003862 NM 003875 NM 013306 NM 003981 NM 006117 NM 004701 NM 005444 AF257175 NM 006201 NM 006260 NM 007057 NM 004994 NM 007111 NM 005721 NM 014584 NM 006582 AF055033 U82987 NM 005915 NM 006681 NM 006763 Contig45917 RC Contig34634 RC AJ224741 NM 006931 NM 016577 D13540
MAD2L1 ARL2 PFKP ADCYAP1R1 ALDH6A1 MS4A7 TGFB3 P2RY1 COL4A2 PFDN5 CTPS VEGF CLDN12 C1GALT1 RAB27B UBC CDC45L KIF4A HRASLS RFC4 SEC14L2 YWHAG ALDH4A1 HRB ETS1 FGF18 GMPS SNX15 PRC1 PECI CCNB2 RQCD1 PECI PCTK1 DNAJC3 ZWINT MMP9 TFDP1 ACTR3 ERO1L GMEB1 IGFBP5 BBC3 MCM6 NMU BTG2 FOXL2 GCN1L1 MATN3 SLC2A3 RAB6B PTPN11
ALDH6A1
VEGF UGCGL1 LOC56889 PSD NFKBIL1 KIAA0794 PK428 YWHAG DIAPH1 EWSR1 SNX15 ZNF222 SEI1 UBQLN1 CCNB2 RAB5EP DNAJC3 NACA ZWINT ERP70 CEGP1 SEC10L1 FLJ23033 IGFBP5
ACO2 ATP5E FLJ21939 LOC56889
UCH37 FLJ11354 INS NPHP1 PGK1 IFNAR1 HUGT1
COL4A2 DC13 PK428 HRASLS RFC4 FLJ12443 ALDH4A1 FGF18 GMPS WISP1 PRC1 HEC PECI CCNE2 PECI
ORC6L ESM1 CEGP1 FLJ22477 MMP9 DKFZP564D0462 AKAP2 IGFBP5 BBC3 MCM6 NMU KIAA0175 MP1 LOC51203 SLC2A3 L2DTL KIAA1067 KIAA1442 RAB6B
DIAPH3 UCH37 FLJ11354 EXT1
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
SeqID
65
NM 000017 NM 000127 NM 000207 NM 000272 NM 000296 BE739817 RC NM 001124 NM 000419 NM 001211 NM 001216 NM 000599 Contig25991 NM 000788 NM 000797 NM 002208 NM 002217 NM 002224 NM 000849
SeqID Symbol
INS NPHP1 PEX12 PGK1 PKD1 IFNAR1 ADM ITGA2B BIRC5 KIAA0999 BUB1B AP2B1 IGFBP5 ECT2 DRD4 ITGAE ITPR3 KPNA4
Accession
NM 000207 NM 000272 NM 000286 NM 000291 NM 000296 BE739817 RC NM 001124 NM 000419 NM 001168 AB023216 NM 001211 NM 001282 NM 000599 Contig25991 NM 000797 NM 002208 NM 002224 NM 002268
SeqID
19549 19840 19891 19928 19933 20891 21172 21256 21356 21818 21938 22267 22432 23198 23744 23978 24016 24169
18109 18891 19549 19840 19933 20891 21172 21256 21938 21944 22432 23198 23724 23744 23978 23996 24016 24333
S2BK S1BK
Accession
ACADS EXT1 INS NPHP1 PKD1 IFNAR1 ADM ITGA2B BUB1B CA9 IGFBP5 ECT2 DCK DRD4 ITGAE ITIH3 ITPR3 GSTM3
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
Symbol
66
Table 9 Count of common genes among all signatures considered. S1 S1 S2 S1-BK S2-BK S1S2-BK
S2 5
5 21 15 19
14 23 21
S1-BK
S2-BK
S1S2-BK
21 14
15 23 45
19 21 53 51
45 53
51
where + (gi ), − (gi ), + (gi ) and − (gi ) are the means and standard deviations of the expressions of gene gi in positive and negative class, respectively. Variations of Fisher’s ratio have been used in a number of studies, from which we selectively refer to [5,22]. According to this metric, genes that differentiate more their expression in the two situations are assigned higher weights than those differentiating less between the two classes. At each step, the lower-rank genes are eliminated and the process continues recursively until the gene list is empty. In each recursion we proceed by cutting-off as many genes so that the remaining ones form a power of two, up to the point of 900 remaining genes. From that point on we cut sets of 100 genes up to 100 surviving ones and, finally, we proceed by eliminating one gene at a time up to the end of the process. For training and validation purposes we use the 78 and 19 patient sets of the Van’t Veer dataset (see Section 3.1), respectively. At the end of the process, the set of genes with size close to 70 achieving the highest classification accuracy on the independent test set of 19 samples are selected as the final gene signature. We use the criterion of 70 genes to be compliant with the 70-gene cut off point in Van’t Veer’s signature [5].
ECT2 DCK GSTM3
GNAZ
TMEFF1 AP2B1 IGFBP5 FLT1
SERF1A OXCT
FLJ11190 HSA250839 SM-20
Contig63649 RC NM 018354 NM 018401 Contig2399 RC Contig24252 RC AF073519 NM 000436 Contig35251 RC Contig56457 RC NM 001282 NM 000599 NM 002019 Contig46218 RC NM 002073 Contig32125 RC Contig25991 NM 000788 NM 000849
Accession
DRD4 FLJ21062 ITGAE ITPR3 KPNA4 NM 000797 Contig22253 RC NM 002208 NM 002224 NM 002268
SeqID Symbol Accession
S2
The global test is a score test that measures the significance of a gene signature [25]. It elaborates on the relation between the expressions of genes in the specific signature and the clinical outcome of samples. Suppose we are given gene expression measurements of N genes in L samples and we want to test if there is a close connection between gene-expression patterns and clinical outcome. The rationale behind the test is that the patterns of expression must resemble or differ across samples in accordance to the corresponding clinical outcomes [25]; i.e. samples of similar outcome should have similar gene-expression patterns. Let X = [xij ] be the L × N data matrix containing the N genes for L samples and Y = [yi ] define the corresponding L × 1 clinical outcome vector. The gene expression values xij often receive positive and negative values in the range of [−3, +3] indicating over or under-expression, respectively. In the usual case of the two-class consideration the class labels yi are presented as +1 or −1. The data correlation matrix R = (1/N)XX defines the expression similarity of pairs of samples across all genes. More specifically, the ijth entry of R defines the expression correlation of sample i to sample j and obtains positive values if these samples show similar gene expression patterns and negative values otherwise. On the other hand, let and 2 denote the expectation and variance of Y. Then, the matrix Y¯ Y¯ = [(yi − )(yj − )] expresses the outcome covariance matrix with its ijth element [(yi − )(yj − )] obtaining positive value if the samples i and j have similar outcomes and negative values otherwise. We can now define the score test Q that can test the predictive effect of the gene expression on the clinical outcome by: 1 Rij (yi − )(yj − ) 2 L
Q =
L
(3)
23744 23889 23978 24016 24169
SeqID
i=1 j=1
S1
Table 8 (Continued)
19044 19694 20320 20437 20711 20866 21350 21682 21923 22267 22432 22691 22724 22864 23179 23198 23724 24333
Symbol
2.5. The global test
In essence, the global test obtains large positive values if the samples of similar outcomes in {(yi − )(yj − )} also show simi-
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
67
Fig. 7. Randomness analysis of the derived result in 100 runs; three conditions are tested in each run AUC19 ≥ 0.89, AUC234 ≥ 0.73, #genes ≤ 69. None of the runs succeeded to satisfy all three conditions simultaneously, indicating that the derived result is very unlikely to be a random guess. This finding indicates also that the set of 4852 genes extracted through the integration of biological knowledge is also very unlikely to have happened by chance.
lar expression patterns through {Rij }. It is used in the experimental section as a measure to assess the statistical significance of gene expressions within a gene signature on the outcome (label) of samples. 2.6. The nearest centroid classifier For classification purposes we use the nearest-centroid prediction rule [26]. Each sample (patient) is classified according to the distance between his/her profile (the expression of his/her genes) and the two average profiles corresponding to the centroids of the two classes (positive and negative). The predicted class is the one whose distance is closer to the examined patient profile. Such a classifier can be formulated for a sample x as follows:
f (x) = where c+ =
c− =
1 L+ 1 L−
+1 if ||c + − x|| < ||c − − x|| −1 if ||c + − x|| > ||c − − x|| 0 if ||c + − x|| = ||c − − x||
xi
(4)
(5) 3.2. Associating gene signatures and pathways
{i:Ci =+1}
of which are characterized negative and correspond to patients that remain disease free for a period of at least 5 years, whereas the remaining 34 are characterized positively and correspond to patients that relapse within a period of 5 years. 293 genes expressing missing information for all 78 patients were removed and the remaining 13,604 missing values were substituted using Expectation-Maximization (EM) imputation [17]. The independent test set consists of 19 samples, 7 negative and 12 positive. The 78-sample set is used for training purposes, the additional 19 samples are used for independent testing, while the cohort of 234 new patients published in [15] is used as an extra evaluation test-set. In addition, the data published by Wang et al. [14] on breast cancer is employed for supplementary validation purposes. It consists of 286 patients and 22,283 genes and considers disease recurrence; this entire dataset is used for validating and testing the cross-platform prediction ability of a signature derived from a different experimental design and application platform. Accordingly, patients for which no recurrence occurred are characterized as the good prognosis group and form the negative class, while those for which a reoccurrence occurred belong to the positive class.
xi
(6)
{i:Ci =−1}
are the two centriods and L+ , L− correspond to the number of cases in positive and negative classes respectively, while Ci reflects the class label (+1 or −1). The choice of a rather simple classifier is made so as to keep the evaluation emphasis around the properties of the gene set rather than the classifier. Since we are explicitly searching for differentially expressed genes, we employ a classifier structure able to quantify even small levels of differential expression. The nearest centroid classifier is a good candidate for such an approach and we use the Euclidean distance as a correlation metric that evolves symmetrically around the class centroids. This approach is used in a number of benchmark studies; we selectively refer to Van’t Veer et al. [5], Van De Vijver et al. [15], Michiels et al. [7], Niijima et al. [27]. 3. Results 3.1. Datasets Three publicly available breast-cancer datasets are used to assess and validate results. The dataset of Van’t Veer et al. [5] contains 24,481 genes and 78 patients on the training set, 44
For each gene in S1 and S2 we locate the associated GOBPs [23] and pathways [24]; notice that the term pathway refers to both GOBPs and pathways. This step results to an initial collection of pathways associated with the two signatures under consideration. We continue by ranking in ascending order the derived pathways of each individual signature, according to their p values given by Eq. (1). We preserve only those with a p value less than 0.05 [12]. Tables 1–4 present the rank-order list of the survived (p < 0.05) GOBPs and pathways of S1 and S2, respectively. We now turn our attention to the overlap between the two signatures. Even though there exists minor overlap when absolute gene identifiers are taken into account (only 5 genes in common), the overlap is significantly increased when we focus on the biological knowledge provided through the significant pathways associated with those signatures. By assembling the genes involved in these significant pathways we end up with a collection of 5899 genes for S1, and 5860 genes for S2. The interesting result is that the two signatures show a significant overlap of 52% (4031/(5860 + 5899 − 4031)), with 4031 genes in common at this level. This result demonstrates that the two signatures share significant biological knowledge even though they reflect very small overlap at the gene level (1st-level overlap in Section 1). We also point out that even though there is only one significant GOBP in common between the two signatures (3rd-level overlap in Section 1), i.e. the Signal Transduction (GO:0007165), it reflects a domi-
68
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
nating overlap by contributing 3788 genes in the final ensemble while the remaining GOBPs are contributing significantly less. This fact highlights yet another aspect of overlap associated with biological knowledge. Even though two signatures may seem to share an insignificant number of common pathways, this overlap may prove to be significant when the associated genes are also taken into account. 3.3. Building a unified gene signature Focusing first on S1, the signature succeeded an 89.47% success rate on the independent test-set of 19 samples while it demonstrated a significant distinction on survival prediction [13]. Classification on the 234 new cases published in [15] yields a measure of the area under the receiver-operating-characteristic curve (AUC) of 0.66, which induces 74% sensitivity (true positive) and 59% specificity (true negative), as illustrated in Table 5. Forming now the collection of genes indicated by the underlying biological knowledge, i.e. taking all genes in the 38 GOBPs and 2 pathways (Tables 1 and 2) induced by the genes of S1, we derive a collection of 3535 genes. By applying the RFM selection method described in Section 2.4, a 70-gene Biological Knowledge Signature (S1-BK) is extracted. For the derivation of S1-BK, the initial 78 samples were used for training and the additional 19-samples were used for testing, onto which S1-BK achieved a success rate of 84.47%. Evaluating now the performance of S1-BK on the 234-sample cohort, we notice a significant improvement over its predecessor S1. More specifically, the AUC measure improved by 3 units (from 66% to 69%) while sensitivity achieved an 80% success rate, improving substantially by 6 units, with the specificity rate remaining at the same level. We emphasize the increase in sensitivity, since the percentage of the success rate in the true positive cases has more significant clinical implications; the cost of misclassifying a positive patient is much higher than the cost of misclassifying a negative one. In addition, high sensitivity enables doctors to decide more accurately on the therapeutic protocol, which is very important in cancer since patients may avoid unnecessary toxic treatment. More detail on clinical implications is provided in Section A.4 of the appendix. Returning to S1-BK signature depicted in Fig. 2 (panel A), we perform a class comparison of the classification result derived on the 234 new cases [15], in order to demonstrate differences at geneexpression level between the two classes. The class comparison is performed through a labeled clustering. The class label derived through the classification process is given as input to the clustering algorithm, in order to avoid inter-mingling of samples between the two classes. The result of clustering reveals significant expression differences of marker genes between the two classes (green and red part of the tree), as demonstrated in Fig. 2 (panel A) where rows correspond to genes and columns to patients. Additionally, the Kaplan-Meier survival-prediction curves distinguish the two prognostic groups with significant distance, with the good prognosis (green curve) associated to the green cluster and the poor prognosis (red curve) associated with the red cluster. Focusing now on S2, it initially achieves a success rate of 89.47% on the independent test set [5] while it yields 0.69 AUC (Table 5), with 76% sensitivity and 62% specificity on the 234 new patients dataset [15]. In the next step we follow the same scenario as in the case of S1, to exploit significant biological knowledge. By gathering the genes pointed by the top 24 GOBPs (Table 3) and the top 5 pathways (Table 4) that also exist in the dataset, we derive a collection of 3854 genes, which are again candidates for the derivation of a new biologically-significant signature. Applying the RFM selection scheme, the derived signature S2-BK consists of 70 genes that give an 89.47% success rate on the independent test set. When tested on the 234 cases, S2-BK improves its performance over its predecessor S2, with AUC improving from 0.69 to 0.71, sensitivity
from 0.76 to 0.78 and specificity from 0.62 to 0.63, as illustrated in Table 5. Even though the performance increase is not as significant as in the case of S1, the clinical implications discussed in Section A.4 of the appendix are still substantial, indicating that biological knowledge is indeed useful for signature refinement and prediction improvement. The corresponding class comparison and Kaplan-Meier analysis are depicted in panel B of Fig. 2, leading to similar conclusions as in the case of S1 signature. The integration of biological knowledge reflected by both signatures generates a total ensemble of 4852 biologically-significant genes. Proceeding in a similar manner as before through the application of RFM, a 69-gene signature is derived by reaching an 89.47% success rate on the independent test set of the 19 samples. The progression of the learning scheme is presented in detail in Section A.2 of the appendix, emphasizing also the impact of the number of genes in the classification result. The derived signature (S1S2-BK) achieves an AUC rate of 0.73 with sensitivity and specificity rates of 81% and 64%, respectively, on the dataset of 234 new cases (Fig. 6 in the appendix). Notice that the proposed integration scheme, with the appropriate use of biological knowledge hidden behind the initial gene signatures (S1 and S2), results in a significant increase of all measures compared to the ones derived by the original signatures. A class comparison and Kaplan–Meier study of the S1S2-BK signature is presented in Fig. 2 (panel C), indicating significant differentiation on the expression level of selected genes on the 234cases cohort and a clear difference on the survival-prediction curves of the two groups. We observe that the probability of survival for the good-prognosis profile stays above 95% for a time interval of 12 years. The 69 genes constituting the S1S2-BK signature are listed in Table 6, while the set of genes involved in S1, S2, S1-BK and S2-BK are presented in Table 8. An interesting aspect concerns the number of common genes between pairs of signatures involved in this study. Table 9 addresses this issue and provides an overall view of gene overlap in the signatures. We notice that even though there initially exists minor overlap (only 5 genes) between S1 and S2, when supplied by background knowledge (BK) in the derivation of S1-BK, S2-BK and S1S2-BK, the overlap among the derived signatures is substantially increased. More specifically, S1-BK and S2-BK share 45 genes in common (48% overlap), while S1S2-BK shares an average overlap of 75% with S1-BK and S2-BK. To measure now the significance of the derived gene signature in terms of clinical outcome, we use the global-score (Q) test [25]. The higher the score of this test, the more tightly the expression patterns of the underlined gene signature are correlated with the class labels. The integrated gene signature is (expression-wise) more significantly correlated with the outcome (Q = 451.98), while S1 gives a score of 367.41 and S2 a score of 416.96. Another relevant issue concerns the confidence associated with the derived result; in other words, how likely is the case that the derived solution could be the product of a random guess. This concern is addressed in the appendix Section A.3 by an appropriately designed computational experiment. The conclusion implies that the derived signature is unlikely to have happened by chance. 3.4. Cross platform validation of integrated gene signature In this section we proceed by validating the S1S2-BK signature on the dataset of 286 patients published by Wang et al. [14]. Through this test we examine the performance of our proposed integration scheme across a multi-center study that addresses the problem of transferring results to different platforms, datasets and experimental designs. Before however proceeding in such a validation, we test the survival prediction capability of the original signatures S1 and S2 on these 286 new patients. We first searched for the genes of those signatures in Wang’s dataset. In the case of S1,
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
we managed to locate 44 genes in the dataset out of the 57 signature genes, while in the case of S2 we located 52 out of 70 genes. Applying SOM clustering [28] on these two sets of common genes, we then attempted to discover two prognostic patient-groups within the Wang’s dataset and performed Kaplan-Meier survival analysis on the derived groups, in order to test the significance of the individual signatures on the new dataset. The results of these tests are depicted in Fig. 3, where we notice that the examined signatures are not able to differentiate significantly between the two prognostic groups. This outcome shows that statistical results are tightly bound to the array platform and design of the experiment [18]; recall that Van’t Veer’s group performed a double-array experiment on Agilent chips, while Wang’s group applied a single-array experiment using Affymetrix chips. Performing the exact same test using the integrated gene signature (S1S2-BK), we managed to locate 116 genes in Wang’s data. Examining the integrated signature in the same manner as before, SOM clustering revealed two clusters of patients corresponding to the two prognostic groups. The result of SOM clustering is depicted in Fig. 4 (left panel), where the two clusters reveal significant difference on the gene expression profiles. These two clusters reflect a good (green) and a poor (red) prognostic group as verified by the Kaplan-Meier survival analysis in Fig. 4 (right panel). Notice the substantial improvement of the integrated gene signature over the initial predecessor signatures S1 and S2 (Fig. 3). Also notice that, even though the difference on the survival prediction may not be as significant as in the case of Van De Vijver data in Fig. 2, the difference at the crucial period of 5 years is still substantial with the good prognosis group achieving approximately an 80% probability of survival, while for the poor prognosis group the probability is below 60%. 3.5. Gene interactions Regarding the source of biological knowledge, we use GO [23] and NetPath [24] as reliable and valid databases supported by experimental data; KEGG can also be used as yet another source. The gene relations involved in these sources are well established, even though the actual structure among genes may be quite incomplete; these databases are systematically updated on a frequent basis. Nevertheless, there exist many more tools, stemming from holistic genomic analysis (e.g. involving microarrays, ChIP-on-Chip, proteomics, etc.), as well as literature-mining tools. Such tools can derive additional gene relations, which could be (loosely) used for generating new research hypotheses. By studying the interactions among the genes in the S1S2-BK signature using such a software tool, namely GNCpro provided by SABioscience.com, we discover strong relationships among the majority of genes involved. These relations are depicted in Fig. 5. More specifically, a large proportion of genes demonstrate co-expression interaction, implying that they have similar expression profiles throughout multiple experiments. Furthermore, 6 genes demonstrate protein interaction relationships, while 7 demonstrate physical interactions. These results further support the biological significance of the derived signature.
69
perspective of assessing the value of marker genes may lead to a variety of different gene signatures for the same pathological condition with negligible or even no overlap among them. The derivation of a unified solution able to overcome such methodological limitations and uncertainties is still an open problem. Towards this direction, we propose an approach where signatures are not competitive to each other but rather form niches for the unification of different results. Different solutions (signatures) lead to different pools of biological knowledge, which in turn reveal an overlap of biological functions and relations that may be more meaningful than the gene overlap itself at the original signature level. Unifying or integrating such pools of biological knowledge for further investigation leads to a significant improvement of the original signature characteristics. Furthermore, it re-orients the perspective of approaching complex biological states and diseases from genecentric lists to inferred molecular and regulatory pathways. It may in turn facilitate an insight into the genetic mutation and epigenetic alterations that have been shown to underpin the fate of complex biological systems [30,31]. In this study we address the problem of limited overlap between gene signatures and propose an approach for signature integration. We begin with the assumption that even though different gene-expression signatures may share minor or no overlap, they are probably parts of a more global solution, each one focusing on a different, complementary or overlapping segment of the complex biological structure of the problem. Our results demonstrate that commonalities may be revealed when taking into account the biological knowledge behind gene signatures, even though this may not be visible at gene level. Focusing on two gene signatures, i.e. our S1 signature consisting of 57 genes and the Van’t Veer’s signature (S2) of 70 genes, we show that they share significant gene overlap (52%) at the biological level. The proposed integration of the derived biological knowledge produces a statistically more significant signature, improving significantly the performance of the original ones. In addition, the integrated signature demonstrates reliable outcome of clinical prediction applied on another dataset from different experimental design and microarray platform. This further enhances the scope of signature integration based on biological grounds, since the power of the result does not entirely depend on the platform or experimental design, but it is most likely associated directly with the biological processes underling the disease itself. Finally, we point out that the proposed approach is not competitive to other solutions and should not be used to reveal strengths or weaknesses of individual signatures. It rather searches for common pieces of information conveyed by these signatures, targeting at a unification procedure that hopefully leads to a more stable and powerful solution. Acknowledgements Present work was supported by “OncoSeed Diagnostics” and “OASYS” projects funded by the National Strategic Reference Franework 2007-13 of the Greek Ministry of Development, as well as the Hellenic Ministry of Education.
4. Discussion and conclusion
Appendix A.
Several developmental transitions and pathological transformations, such as cancer, are due to variations in the expression of certain genes orchestrated by core regulatory pathways. Different pervasive aspects of these transcriptional programs are manifested by the different molecular signatures that distinguish the various diseases or developmental stages [29]. Nevertheless, incompatibilities related to the experimental platform, study design, and
A.1. Statistical and clinical considerations of the integrated signature The analysis of massive genomic data suffers from the small sample size and the large feature dimensionality, often leading to algorithmic overtraining, large bias in the prediction accuracy, errors in the recording of expression values and random correla-
70
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
tions among data samples [26]. Since the statistical outcome of a specific experiment is affected by a limited group of differentially expressed genes, the analysis of microarray data is often preceded by the elimination of irrelevant genes and the selection of a minimal number of marker genes with sufficient predictive power for either the classification of unknown samples, or the prediction of cancer state, or the survival analysis of patients [7]. Thus, the selection of a small number of relevant genes increases the algorithmic accuracy and the generalization of results, also in accordance to the minimum description length principle [26]. Furthermore, the use of a small-size signature increases the algorithmic speed and improves the ability of the experts to interpret the analysis results [27]. Accordingly, the biological relevance of genes in pathways, networks and functional groups is much easier to be specified if the results include small-size gene signatures cleaned up from irrelevant genes. Alternatively, the selection of a few marker genes in a large-dimensionality space often increases the chance of random correlations among genes and data samples. Finally, the clinical merits obtained by a gene signature should be clearly demonstrated, in order to substantiate its use in clinical practice. A.2. Impact of the size of gene-signature on classification and survival analysis We first study the statistical properties of the gene-elimination process deriving the S1S2-BK signature. More specifically, we examine the prediction ability of the gene set throughout the process of gene elimination (Section 2.4) in the last 100 algorithmic iterations. Results are presented in Fig. 6, which is divided into two parts. At the top part we assess variations in the prediction ability of the filter selection method on both the Van’t Veer test set of 19 samples and the Van De Vijver 234-sample independent test set, while at the bottom part we asses the survival prediction ability (on the 234 patient set) at three representative snapshots along the process of gene elimination. Regarding the prediction ability on the 19-samples set, we observe that it remains almost at a stable level from 100 to 27 surviving (missing between 2 and 3 cases), varying between 0.89 and 0.87 AUC. Examining the 234 new cases, the prediction performance is smaller and drops below 0.70 AUC at the last 50 steps. Recall that we are searching for the minimal set of size closer to 70 genes (compliant with the 70-gene of the Van’t Veer’s et al. signature [6]) that achieve the highest classification accuracy on the independent test set of 19 samples; this benchmark is highlighted by the arrow pointer in Fig. 6, yielding 0.89 and 0.73 AUC measures for the 19 and 234-cases datasets, respectively. We also observe that both AUC measures decrease significantly at the last 35 step of the process, indicating that lower numbers of genes do not form reliable predictors as gene signatures. Similar conclusions are reached by the survival analysis process. More specifically, three snapshots are taken along the process, as highlighted by the three red dots in upper part of the graph, for which the results are presented at the lower part of Fig. 3. We observe that the survival analysis of the two populations deteriorates drastically as the selection process progresses, indicating that signatures with less numbers of genes are not reliable survival predictors. To further assess the randomness of the derived result, a special computational experiment is conducted and presented in the next section. A.3. Randomness of the derived results A question of particular interest concerns the likelihood of the derived result being a random guess. This type of question is a direct consequence of the nature of the problem, in which a large-dimensionality space is sparsely covered, often leading to random correlations among the involved features [26]. To assess
this question as objectively as possible, we performed the following computational experiment 100 times: Among the 24,188 genes that are initially given, we randomly select 4852 each time; recall that this is the number of genes collected after the integration of S1-BK and S2-BK (Section 3.3). In turn, we apply the filter selection method and derive a new gene signature according to the rule presented in Section 2.4 (i.e. the set of genes whose count is less than 100 but closer to 70 and achieve the highest classification accuracy on the independent test set of 19 samples). After completion of the experiment we are left with 100 different signatures, each derived from an initial random set of 4852 genes. Recall that S1S2-BK consists of 69 genes and succeeds an AUC of 0.89 on the independent test set of 19 samples (AUC19) and 0.73 on the independent test set of 234 patients (AUC234). We can now examine the frequency of the following joint condition: (AUC19 ≥ 0.89) & (AUC234 ≥ 0.73) & (#Genes ≤ 69)
(7)
as to assess the likelihood of the above null hypothesis, which is stated as: “what is the likelihood of deriving the same or even better result starting from an initial random set of 4852 genes?” Among the 100 runs conducted, none satisfied the null hypothesis with a p-value <0.01, indicating that not only the derived result (signature), but also the initial set of 4852 genes are all unlikely to have happened by chance. We present the result of this computational experiment in Fig. 7, which depicts all 100 runs in more detail. More specifically, different color bars of length 1 represent the three conditions assessed; runs with no bars indicate that none of the three conditions are met. Notice that none of these runs meets all three conditions at once, while the AUC234 ≥ 0.73 criterion (stand alone) is satisfied in only 3/100 at the limit of 0.73. We also point out that AUC19 and AUC234 criteria are never met simultaneously. A.4. Clinical importance of the results As a final issue of this study, we assess the clinical merits of the derived result in Section 3.3. Without loss of generality, we selectively focus on the S1-BK and S1S2-BK signatures. Details on the classification measures of these signatures are displayed in Table 7. Focusing on the positive class for S1-BK, there are 117 (=43 TP1 + 74 FP2 ) cases assigned as positives of which only 43 are indeed true positives. Turning it into a relational probability, this implies that a doctor will decide with only a 36.8% (43/117) likelihood of taking the right decision on assigning a patient to a chemotherapy program. Alternatively, focusing on the negative class the S1-BK assigned a total of 117 patients as negatives, from which 11 are false negative cases. Accordingly, a doctor can decide with 90.6% (106/117) confidence on not assigning a patient to a chemotherapy program. Notice the significant difference on the certainty of the decisions between the two prognostic groups; the doctor decides with only a 36.8% certainty on assigning a patient to a chemotherapy program, implying that a significant number of patients (74 in this case) will receive unnecessary toxic treatment. The same doctor on the other hand, decides with a significant high confidence rate of 90.6% on not assigning a patient to a chemotherapy program. Based on such a decision, a significantly less but not negligible number of patients (11 in this case) will not receive the necessary appropriate treatment. Taking into account the relative costs of misclassification in the two classes, any improvement over the previously presented figures is of great social impact since more patients in need of the treatment will receive it and more patients that do not need the treatment will avoid it.
1 2
True Positives. False Positives.
M.E. Blazadonakis et al. / Artificial Intelligence in Medicine 53 (2011) 57–71
Examining the potential improvement achieved, the S1S2-BK signature assigns to the positive class 109 (=44 TP + 65 FP) cases and 125 total cases in the negative class from which 10 are false negatives. Hence, based on this signature the doctor can decide with a 40.4% (44/109) certainty for the positive class and with a 92% (115/125) certainty for the negative class. In essence, the integrated signature through its increased prediction power succeeds to protect 9 patients from receiving unnecessary toxic treatment, while it assigns 1 more patient to the correct treatment protocol as compared to the S1-BK signature. References [1] Early Breast Cancer Trialist’s Collabortive Group. Polychemotherapy for early breast cancer: an overview of the randomized trials. Lancet 1998;352: 930–42. [2] Early Breast Cancer Trialist’s Collabortive Group. Tamoxifen for early breast cancer: an overview of the randomized trials. Lancet 1998;352:1451–67. [3] Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, et al. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 2006;98:262–72. [4] Slonin DK. From patterns to pathways: gene expression data analysis comes of age. Nat Genet Suppl 2002;32:502–8. [5] Van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Augustinus AM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Lett Nat 2002;415:530–6. [6] Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003;95:14–8. [7] Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005;365:488–92. [8] http://www.fda.gov/ [accessed 28 June 2010]. [9] http://medgadget.com/archives/2007/02/mammaprint a br.html [accessed 28 June 2010]. [10] http://www.eortc.be [accessed 28 June 2010]. [11] http://www.eortc.be/services/unit/mindact/MINDACT websiteii.asp [accessed 28 June 2010]. [12] Yu JX, Sieuwerts AM, Zhang Y, John WM, Smid M, Klijin JGM, et al. Pathway analysis of gene signatures predicting metastasis of node-negative primary breast cancer. BMC Cancer 2007;7(182).
71
[13] Blazadonakis ME, Zervakis M. The linear neuron as marker selector and clinical predictor. Comput Methods Programs Biomed 2008;91(1):22–35. [14] Wang Y, Klijn JGM, Zhang Y, Sieuerts AM, Look MP, Yang F, et al. Geneexpression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005;365:671–9. [15] Van De Vijver MJ, He YD, Van’t Veer LJ, Hongyue D, Augustinus AM, Hart M, et al. A gene expression signature as a predictor of survival in breast cancer. N Engl J Med 2002;347:1999–2009. [16] Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005;21:171–8. [17] Little A, Rubin D. Statistical Analysis with Missing Data. New York, USA: Wiley’s Series in Probability and Mathematical Statistics; 1987. [18] Tuma RS. Multiple gene signatures aim to qualify risk in breast cancer. J Natl Cancer Inst 2005;97:332. [19] Liu Z, Gartenhaus RB, Tan M, Jiang F, Jiao X. Gene and pathway identification with Lp penalized Bayesian logistic regression. BMC-Bioinformatics 2008;9:412. [20] Dinu V, Zhao H, Miller HPL. Integrating domain knowledge with statistical and data mining methods for high density genomic SNP disease association analysis. J Biomed Inform 2007;40(6):750–60. [21] Rosa GJM, Vazquez A. Integrating biological information into the statistical analysis and design of microarray experiments. Animal 2010;4(2): 165–72. [22] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Masirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286:531–6. [23] http://www.geneontology.org/ [accessed 4 April 2009]. [24] http://www.netpath.org/ [accessed 4 April 2009]. [25] Goeman JJ, Van de Geer SA, de Kort F, Van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bionformatics 2004;20:93–9. [26] Simon R. Diagnostic and prognostic prediction rule using gene expression profiles in high dimensional microarray data. Br J Cancer 2003;89:1599–604. [27] Niijima S, Kuhara S. Recursive gene selection based on maximum margin criterion. BMC-Bioinformatics 2006;7(543). [28] Kohonen T. Self Organizing Maps. Berlin: Springer-Verlag; 2001. [29] Perou CM, Sorlie T, Eisen MB, Van De Rijn M, Jeffrey SS, Pollack JR, et al. Molecular portraits of human breast tumours. Nature 2000;406:747. [30] The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008;455:1061. [31] Jones S, Zhang X, Parsons DW, Cheng-Ho Lin J, Leary RJ, Angenendt P, et al. Core signalling pathway in human pancreatic cancers revealed by global genomic analyses. Science 2008;321:1801–6.