An ensemble predictive modeling framework for breast cancer classification

Accepted Manuscript An Ensemble Predictive Modeling Framework for Breast Cancer Classification Radhakrishnan Nagarajan, Meenakshi Upreti PII: DOI: Ref...

Download PDF

1MB Sizes 3 Downloads 73 Views

Report

PDF Reader
Full Text

Accepted Manuscript An Ensemble Predictive Modeling Framework for Breast Cancer Classification Radhakrishnan Nagarajan, Meenakshi Upreti PII: DOI: Reference:

S1046-2023(17)30056-7 http://dx.doi.org/10.1016/j.ymeth.2017.07.011 YMETH 4277

To appear in:

Methods

Received Date: Revised Date: Accepted Date:

30 April 2017 11 July 2017 12 July 2017

Please cite this article as: R. Nagarajan, M. Upreti, An Ensemble Predictive Modeling Framework for Breast Cancer Classification, Methods (2017), doi: http://dx.doi.org/10.1016/j.ymeth.2017.07.011

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

An Ensemble Predictive Modeling Framework for Breast Cancer Classification Radhakrishnan Nagarajan, Ph.D.1 Division of Biomedical Informatics, College of Medicine, University of Kentucky, KY, USA. Meenakshi Upreti, Ph.D. Department of Pharmaceutical Sciences, Markey Cancer Center, University of Kentucky, KY, USA.

1

Corresponding Author

Radhakrishnan Nagarajan, Ph.D. Associate Professor Division of Biomedical Informatics, College of Medicine, University of Kentucky 725 Rose Street, Multidisciplinary Science MDS 230F Lexington, KY 40536-0082 Phone: 859 218 0109 Email: [email protected]

Keywords Predictive Modeling, Ensemble Classification, Molecular Profiling.

1

ABSTRACT Molecular changes often precede clinical presentation of diseases and can be useful surrogates with potential to assist in informed clinical decision making. Recent studies have demonstrated the usefulness of modeling approaches such as classification that can predict the clinical outcomes from molecular expression profiles. While useful, a majority of these approaches implicitly use all molecular markers as features in the classification process often resulting in sparse high-dimensional projection of the samples often comparable to that of the sample size. In this study, a variant of the recently proposed ensemble classification approach is used for predicting good and poor-prognosis breast cancer samples from their molecular expression profiles. In contrast to traditional single and ensemble classifiers, the proposed approach uses multiple base classifiers with varying feature sets obtained from two-dimensional projection of the samples in conjunction with a majority voting strategy for predicting the class labels. In contrast to our earlier implementation, base classifiers in the ensembles are chosen based on maximal sensitivity and minimal redundancy by choosing only those with low average cosine distance. The resulting ensemble sets are subsequently modeled as undirected graphs. Performance of four different classification algorithms is shown to be better within the proposed ensemble framework in contrast to using them as traditional single classifier systems. Significance of a subset of genes with high-degree centrality in the network abstractions across the poor-prognosis samples is also discussed.

2

INTRODUCTION Molecular profiling in conjunction with predictive models has been useful in discerning distinct disease groups with potential to assist in clinical decision making [1-12]. Unsupervised learning techniques such as clustering have been successfully used for identifying inherent grouping and visualization of patient samples as well as partitioning genes into functionally related categories using a chosen similarity measure [2]. On the other hand, supervised learning techniques such as classification [1] that implicitly accommodate the class labels (e.g. clinical outcomes) of the samples have been shown to learn a model on a training data and predict the class labels of new incoming samples using this model. Several classification techniques have been proposed in the past that broadly fall under single and ensemble classifiers. Many of these techniques, while helpful, implicitly use all the genes as features in the classification approach often resulting in a sparse high-dimensional projection of the samples. Such representations in turn can challenge the generalizability of the findings and affect the overall performance of the classifiers especially at small sample sizes. Popular single classifier approaches used widely in molecular expression studies include support-vector machines (SVM), Naïve-Bayes classifier (NB) and linear discriminant analysis (LDA). Ensemble classifiers or multiple classifiers implicitly subscribe to the wisdom of crowds and combine the predictions across multiple base classifiers. Studies have elucidated the merits of ensemble classifiers from statistical, computational and representational standpoints and improved performance over single classifier systems under certain assumptions [13].

Several studies have also successfully applied ensemble

classification techniques to molecular data [14-17]. Two popular ensemble approaches include bagging [18] and boosting [19-22]. Bagging essentially uses bootstrapped realizations generated by resampling with replacement from the given empirical training sample. Subsequently, the class label is predicted by combining or aggregating the individual predictions using voting strategies (e.g. majority voting) [23]. One of the popular bagging approaches is random forest (random decision forests) [24-26]. On the other hand, boosting techniques [193

22] combine the predictions across multiple weak learners whose accuracy is better than that of random guess to form a strong learner from weighted training samples. As with any classification technique, prudent choice of hyperparameters is critical for optimal performance of these ensemble classification algorithms. While single and ensemble classifiers such as those discussed above have been successfully used across a number of applications, they do not necessarily provide sufficient insights into potential variations between the samples within the groups in their out-of-the-box or native form. Recently, we had proposed an ensemble classification approach (SVA) [27-29] that uses bootstrapping in conjunction with majority voting strategy and two-dimensional feature sets in providing insights into potential patient-specific variations within the disease groups while discerning them using molecular expression profiles. The advantages of SVA includes its ability to reveal potential interactions between the features across multiple scales [29] and superior performance over some of the established algorithms as well as clinical criteria [27]. In SVA, the base classifiers were enriched for sensitivity about an optimal sensitivity threshold and bootstrap realizations along with varying features sets were used to promote classifier diversity. The present study overcomes some of the limitations of SVA as (a) it obviates the need for estimating an optimal sensitivity threshold and (b) it screens for base classifiers based on ranked sensitivity as well as ranked average cosine distance resulting in a subset of base classifiers with high sensitivity and low average cosine distance. Explicitly screening for base classifiers with low average cosine distance is expected to minimize redundancy and enhance the overall diversity of the base classifiers in the ensemble resulting in better performance and generalizability. Performance (sensitivity, specificity, accuracy, positive predictive value, negative predictive value) of four different classification algorithms with widely varying assumptions is investigated within the proposed ensemble classification framework and as single classifiers.

4

Molecular taxonomy of breast tumors had been shown to provide unprecedented insights into inherent variations within breast tumors and response to treatments [30] with potential to assist in traditional clinical assessment. Molecular subtyping of breast cancer broadly groups the subjects based on (a) Estrogen Receptor (ER), (b) Progesterone Receptor (PR), and (c) Human Epidermal Growth Factor Receptor Type (HER2/neu) status. Based on these criteria breast cancers are classified as hormone receptor positive (ER+, PR+), HER2/neu positive and Triple Negative (ER-, PR-, HER2/neu-). Commercial molecular assays (e.g. MammaPrint, Onctotype DX) that interrogate the breast cancer samples for the expression of a subset of transcripts have also been shown to be useful in prognosis and assess the benefit of chemotherapy in breast cancer subjects. Out of the several commercial molecular assays, data related to the FDA approved MammaPrint multi-gene assay (70 genes across 78+307 = 385 samples) was recently made publicly available [3, 6, 9, 11] with adequate documentation. The present study aims to predict the good-prognosis and poor-prognosis (i.e. 5 yrs metastasis) using the proposed ensemble predictive modeling framework on these publicly available molecular expression data.

METHODS Ensemble Predictive Modeling Framework Given:

Training

data

representing

across

subjects and test data

the

expression

of

representing the expression of

molecular

markers

molecular markers

across subjects with columns normalized to zero-mean and unit variance. Clinical labels of the good-prognosis data

and poor-prognosis

samples across the training

) were provided.

Step 1: Set

.

Step 2: Set

.

5

and the test

Step 3: Choose a pair of features from the bootstrap realizations

molecular markers. Generate

by resampling the training data

and store the corresponding labels in

with replacement

. Normalize the columns of

to zero-

mean and unit variance. Step 4: Learn the base classifier markers in

using the chosen pair of molecular

as features.

Step 5: Predict the labels Step 6: Set

of the training samples

using

.

Repeat steps 3-5 for each pair of features.

Step 7: Determine the sensitivity

of each base classifier from

and

. Store

the feature pairs corresponding to the Top 50% of base classifiers ranked by decreasing sensitivity in classifier

. Determine the average cosine distance of each base

as

.

Store

the

feature

pairs

corresponding to the Top 50% of the base classifiers ranked by increasing average cosine distance in

.

Step 8: Determine the reduced set of features

given by

Step 9: Choose a pair of features in and their corresponding expression profile from the test sample

. Predict the labels

of the

test sample. Store the feature pair along with their predicted label as follows. If

then add the feature pair to the ensemble set then add the feature pair to the ensemble set

Step 10: Repeat Step 9 for each pair of features in . Step 11: Majority Voting: Combine the labels across the base classifiers as follows:

6

=P = Randomly assign G or Step 12: Repeat Steps 9-11 for each sample

in

Good-Prognosis Poor-Prognosis Tie .

Step 13: Determine the performance measures (sensitivity, specificity, accuracy, positive predictive value, negative predictive value) by comparing Step 14: Set

. Repeat Steps 2-13 till

to

.

. Determine the average performance

measures on the test sample by averaging over the

realizations.

Step 15: Determine the confidence of the classifiers in

given by the proportion of times they

were in

out of the

realizations.

Working principle of the proposed ensemble predictive modeling framework is shown in Fig. 1 for convenience. Since the approach uses pairs for features, for four molecular markers (M1, M2, M3, M4) we get 4C2 = 6 base classifiers

with feature pairs (M1M2,

M1M3, M1M4, M2M3, M2M4, M3M4) respectively. The procedure begins by learning the base classifiers using bootstrapped realizations using pairs of features and a classification algorithm (e.g. SVM). Subsequently, these base classifiers are used to predict the labels of the empirical training sample. The base-classifiers are ranked by their sensitivity in descending order such that the highest values are on the top resulting in

, Fig. 1. Base

classifiers are also ranked by their average cosine distance such that lowest values are on the top (i.e. ascending order) resulting in comprises of those classifiers

, Fig. 1. The reduced subset that are common in the Top 50% of the base

classifiers with the highest sensitivity (i.e. distance (i.e.

) and lowest average cosine

). Pairs of features corresponding to the reduced set

i.e.

{M1M2, M1M3} along with the training sample are used subsequently to predict the labels of the test sample. The ensemble set

in turn can be represented as an undirected graph with nodes

represented by (M1, M2, M3) and edges by M1-M2 and M1-M3. Performance (SEN: Sensitivity, 7

SPC: Specificity, ACC: Accuracy, PPV: Positive Predictive Value, NPV: Negative Predictive Value) is estimated by comparing the predicted and true class labels of the test sample.

Figure 1 Ensemble predictive modeling framework using molecular markers (M1, M2, M3, M4) in pairs resulting in six base classifiers with feature sets (M1M2, M1M3, M1M4, M2M3, M2M4, M3M4) respectively. Base classifiers trained on the bootstrapped realizations are used to predict the labels of the empirical training sample. The reduced subset generated from Top 50% of the base classifiers ranked by their sensitivity and average cosine distance are used subsequently to predict the labels of the test sample. Network abstraction of is also shown.

8

RESULTS Data Description and Implementation A recent study [9] identified putative breast cancer samples used in the development of multigene assay for predicting breast cancer recurrence (Mammaprint, ©Agendia). An R package was also made available through Bioconductor (mammaPrintData) [9] with detailed information of these putative samples and the 70-genes of interest. The 70-genes were identified in the very first study and shown to include those spanning critical biological processes such as cell-cycle, angiogenesis and metastasis related to breast cancer recurrence [3]. The training data used in this study consisted of 78 sporadic lymph node-negative breast cancer samples (44 good-prognosis and 34 poor-prognosis) [3, 6]. In order to prevent any discrepancy in performance due to unequal sample sizes in the training data, the goodprognosis samples (N = 44) were down-sampled to be of the same size of the poor-prognosis samples (N = 34). Subsequently, performance of the ensemble classification approach was tested on an independent validation cohort (test data) consisting of 307 node-negative breast cancer samples (260 good-prognosis and 47 poor-prognosis) [11]. Performance measures reported in this study includes sensitivity and specificity with sensitivity representing the proportion of correctly classified poor-prognosis samples and specificity representing the proportion of correctly classified good-prognosis samples. Clinical labels of the test and training samples were based on 5yr metastasis and used as the benchmark for performance evaluation in the present study. Performance of four different classification algorithms (SVML: SVM with Linear Kernel; NB: Naïve Bayes; LDA Linear Discriminant Analysis and SVMR: SVM with Radial Basis Kernel) were investigated as traditional single classifier systems and within the proposed ensemble classification approach. The ensemble classification approach was implemented in open-source R and used existing implementation of Support Vector Machines (SVM), NaïveBayes classifier (NB), and Linear Discriminant Analysis as a part of the R packages e1071 and

9

MASS. Parallelization of the algorithms across multi-core environment was accomplished using readily available R wrappers parallel

and

doSNOW.

Weighted undirected graph

representations in .GDF (Graph Definition Format) of the ensemble sets were generated in R and subsequently displayed using force-directed layout (forceatlas) in Gephi [31].

Figure 2 Performance measures sensitivity (SEN), specificity (SPC), accuracy (ACC), positive predictive value (PPV), negative predictive value (NPV) obtained on the test sample (N = 307) using LDA, SVML, SVMR and NB as traditional single classifiers (dotted lines) and within the proposed ensemble classification framework using the overlap between the Top 25% (solid line), Top 50% (dashed line), Top 75% (dashed-dotted line) of the base classifiers ranked by their sensitivity and average cosine distance is shown in (a-d). Vertical lines represent the standard deviation about the average estimates across realizations.

Performance of the Ensemble Predictive Modeling Framework Performance measures (sensitivity, specificity, accuracy, positive predictive value, negative predictive value) of the four classification algorithms averaged over

realizations

across the independent test data (260 good prognosis and 47 poor prognosis subjects) is shown in Fig. 2. Sensitivity of the four classification techniques within the proposed ensemble 10

framework was consistently better than that of their traditional single classifier counterparts that use all the features simultaneously in the classification using the Top 50% of the base classifiers. Standard deviation about the average estimates was also considerably lower across the proposed ensemble approach in contrast to the single classifier approach. Specificity, PPV and NPV of the four classifiers within the ensemble classification framework were similar to those obtained using them as traditional single classifiers, Fig. 2. Since the proposed approach was enriched for sensitivity, it resulted in low PPV and high NPV also observed across independent studies on the Mammaprint related data [32]. Recent studies on re-analysis of the Mammaprint related data had also reported performance (sensitivity = 91% ~ 42 subjects; specificity = 47% ~ 122 subjects) [9] and using an implementation of the Mammaprint approach (sensitivity = 89% ~ 41 subjects; specificity = 42% ~ 109 subjects) [9]. Our earlier study [27] on the Mammaprint related data using (SVM, NB) within a different ensemble classification framework (SVA) resulted in (sensitivity = 0.93 = 0.93 x 47 ~ 43 subjects; specificity = 0.48 = 0.48 x 260 ~124) across SVA-SVM and (sensitivity = 0.92 = 0.92 x 47 ~ 43 subjects; specificity = 0.39 = 0.39 x 260 ~101 subjects) across SVA-NB. In the present study, sensitivity and specificity of the classifiers were comparable with similar trends across Top K with K = 25%, 50%, 75%, Fig 2. In general, the Top K can be treated as a hyperparameter in the proposed approach and tuned using a validation data set. Cross-validation can also be used as suitable alternative but may suffer from the drawback that the samples in the validation set are not independent of those in the training set. SVMR using the Top 50% of the base classifiers resulted in (sensitivity = 0.86 ~ 0.86*47 ~ 40 subjects; specificity = 0.6 = 0.60 *260 ~ 156 subjects). While the sensitivity of the proposed approach is a bit lower than what we had proposed earlier (3 additional misclassified poor-prognosis subjects), the specificity is considerably higher and classified 34 additional good-prognosis subjects correctly than what was reported earlier [27] preventing unnecessary treatment to these subjects.

11

Figure 3 Time complexity of the four algorithms (LDA, SVML, SVMR, NB) within the ensemble classification framework as a function of Top 25%, 50% and 75% of the base classifiers.

Time complexity with varying K: The for-loops in the ensemble classification approach were parallelized using R packages (parallel, doSNOW) across an Intel Xeon CPU E3-1240, 3.4 GHz, 32 GB RAM, Quad-Core Processor running on Windows 7, 64 bit OS. The R function proc.time() was used to determine the time complexity of the SVML, SVMR, LDA and NB algorithms with the ensemble classification framework. The elapsed time in seconds as a function of the Top K% (K = 25, 50, 75) of the base classifiers for the four ensemble algorithms exhibited a linear trend, Fig. 3. The slope was particularly steep for NB classifier in contrast to SVML, SVMR and LDA within the ensemble classification framework. It is important to note that a more detailed investigation using training data sets of varying sizes across distinct

12

experimental paradigms and parallel implementation of the classification algorithms (LDA, SVM, NB) may be required to determine the time complexity of the proposed ensemble approach.

Network Abstractions of Ensemble Sets

Figure 4 Network abstractions of the ensemble sets in compromising only of those classifiers with confidence greater than 80% across realizations and across the four different classification algorithms SVML, NB, LDA and SVMR using the Top 50% of the base classifiers in the ensemble classification approach. The size of the nodes in the force-directed layout is proportional to its degree centrality. Confidence of the classifiers in the ensemble sets

were estimated across

realizations for each of the four classification techniques. Those classifiers in

that that had at

least 80% confidence consistently across the four classification techniques were subsequently retrieved and their network abstraction generated. Force-directed layout of the network generated using Gephi [31], Fig. 4, revealed non-uniform degree centrality of the nodes with

13

certain nodes more highly connected as opposed to others. Nodes with relatively high degree centrality comprised of molecules (KIAA0175, GNAZ, FLJ11354, RFC4, Contig32185, EXT1). 

KIAA0175 (MELK): Homo sapiens maternal embryonic leucine zipper kinase, MELK) is an oncogenic kinase grouped under the category insensitivity to antigrowth signals under the MammaPrint profile [33]. Independent studies have also implicated over-expression of MELK and correlation to poor survival outcomes as well as metastatic recurrence in basallike breast cancer (e.g. triple-negative breast cancer) [34], an aggressive form of breast cancer unresponsive to targeted therapies. On the other hand, recent studies have shown inhibition of MELK using small-molecules and inhibitors (MELK-T1, OTS167) [35, 36] to reduce tumor growth and threshold for DNA damage in cancer cell lines (e.g. MCF-7, MDAMB-231, A549) potentially sensitizing the tumors to DNA damaging agents as well as radiotherapies [37].



RFC4: replication factor C4, 37 kDa, transcript variant 2 is grouped under the category limitless replicative potential in the MammaPrint profile [33]. Several independent studies have demonstrated the critical role of RFC4 in breast cancer. siRNA knockdown of RFC4 in ER-negative breast cancer cell lines (e.g. BT549, MDA-MB-157) was shown to result in significantly reduced proliferation [38]. RFC4 has also been shown to considerably upregulated in poor-prognostic breast cancer subjects [39]. Genistein treated breast cancer cell lines (e.g. MCF-7) have been shown to be accompanied down-regulation of DNA replication genes such as RFC4 [40]. Other studies on the effect of the investigational therapy using HSP90 inhibitor (17AAG) currently under Phase II clinical trials reported constitutive down-regulation of RFC4 following treatment across sensitive as well as resistant breast cancer cell lines [41].



GNAZ and EXT1: GNAZ: guanine nucleotide binding protein, alpha z polypeptide have been implicated under self-sufficiency in growth signals whereas tumor suppressor EXT1 (exostoses 1) is grouped under the category insensitivity to anti-growth signals with possible 14

involvement in epithelial-mesenchymal transition in the MammaPrint profile [33]. Germline mutations in EXT1 have been associated with different forms of sarcomas and hypermethylation of CpG island in EXT1 was also observed in certain types of Leukemia [42]. A recent study on EXT1, revealed its decreased expression in metastatic breast cancer and also it’s potential to serve as a marker for assessing the risk of metastasis [43].

CONCLUSION Characteristic changes in molecular mechanisms often precede clinical presentation of disease. Modeling approaches can be useful in predicting clinical outcomes from molecular expression profiles so as to enable timely treatment decisions. The present study investigated an ensemble predictive modeling framework for predicting good-prognosis and poor-prognosis breast cancer samples from their 70-gene signatures using publicly available data. The ensemble model was enriched for sensitivity and diversity using the average cosine distance. Their performance was shown to be better than traditional single classifiers that use all features simultaneously. Network abstraction of the ensemble sets was also shown to reveal critical molecules involved in classifying good and poor-prognosis breast cancer samples.

15

REFERENCE 1. 2. 3. 4. 5.

6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

Golub TR,Slonim DK,Tamayo P et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 1999; 286(5439): 531-7. Perou CM,Sorlie T,Eisen MB et al., Molecular portraits of human breast tumours. Nature, 2000; 406(6797): 747-52. van 't Veer LJ,Dai H,van de Vijver MJ et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature, 2002; 415(6871): 530-6. Reis-Filho JS and Pusztai L, Gene expression profiling in breast cancer: classification, prognostication, and prediction. Lancet, 2011; 378(9805): 1812-23. Butte AJ,Sigdel TK,Wadia PP et al., Protein microarrays discover angiotensinogen and PRKRIP1 as novel targets for autoantibodies in chronic renal disease. Mol Cell Proteomics, 2011; 10(3): M110.000497. Glas AM,Floore A,Delahaye LJ et al., Converting a breast cancer microarray signature into a highthroughput diagnostic test. BMC Genomics, 2006; 7: 278. Marchionni L,Wilson RF,Wolff AC et al., Systematic review: gene expression profiling assays in early-stage breast cancer. Ann Intern Med, 2008; 148(5): 358-69. Marchionni L,Wilson RF,Marinopoulos SS et al., Impact of gene expression profiling tests on breast cancer outcomes. Evid Rep Technol Assess (Full Rep), 2007(160): 1-105. Marchionni L,Afsari B,Geman D et al., A simple and reproducible breast cancer prognostic test. BMC Genomics, 2013; 14: 336. Paik S, Is gene array testing to be considered routine now? Breast, 2011; 20 Suppl 3: S87-91. Buyse M,Loi S,van't Veer L et al., Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst, 2006; 98(17): 1183-92. Modlich O and Bojar H, Can a 70-gene signature provide useful prognostic information in patients with node-negative breast cancer? Nat Clin Pract Oncol, 2007; 4(4): 216-7. Dietterich T Ensemble Methods in Machine Learning, in Multiple Classifier Systems. 2000, Springer Berlin Heidelberg. p. 1-15. Hödar C,Assar R,Colombres M et al., Genome-wide identification of new Wnt/β-catenin target genes in the human genome using CART method. BMC Genomics, 2010; 11: 348. Dettling M and Bühlmann P, Boosting for tumor classification with gene expression data. Bioinformatics, 2003; 19(9): 1061-1069. Ben-Dor A,Bruhn L,Friedman N et al., Tissue classification with gene expression profiles. Journal of computational biology, 2000; 7(3-4): 559-583. Sung CO and Sohn I, The expression pattern of 19 genes predicts the histology of endometrial carcinoma. Scientific reports, 2014; 4. Breiman L, Bagging predictors. Machine learning, 1996; 24(2): 123-140. Freund Y and Schapire RE, A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 1997; 55(1): 119-139. Freund Y, Boosting a weak learning algorithm by majority. Information and computation, 1995; 121(2): 256-285. Friedman J,Hastie T and Tibshirani R, Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 2000; 28(2): 337-407. Friedman JH, Greedy function approximation: a gradient boosting machine. Annals of statistics, 2001: 1189-1232. Kuncheva LI Combining pattern classifiers: methods and algorithms: John Wiley & Sons 2004. Ho TK Random decision forests, in Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on. 1995, IEEE. p. 278-282. 16

25. 26. 27.

28. 29. 30. 31. 32. 33. 34. 35.

36. 37.

38.

39.

40.

41.

42.

43.

Ho TK, The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 1998; 20(8): 832-844. Breiman L, Random forests. Machine learning, 2001; 45(1): 5-32. Nagarajan R and Upreti M, An approach for deciphering patient-specific variations with application to breast cancer molecular expression profiles. Journal of Biomedical Informatics, 2016; 63: 120-130. Nagarajan R,Miller CS,Dawson III D et al., Patient-specific variations in biomarkers across gingivitis and periodontitis. PloS one, 2015; 10(9): e0136792. Nagarajan R,Al‐Sabbagh M,Dawson D et al., Integrated Biomarker Profiling of Smokers with Periodontitis. Journal of Clinical Periodontology, 2016. Comprehensive molecular portraits of human breast tumours. Nature, 2012; 490(7418): 61-70. Bastian M,Heymann S and Jacomy M, Gephi: an open source software for exploring and manipulating networks. ICWSM, 2009; 8: 361-362. Wittner BS,Sgroi DC,Ryan PD et al., Analysis of the MammaPrint breast cancer assay in a predominantly postmenopausal cohort. Clin Cancer Res, 2008; 14(10): 2988-93. Tian S,Roepman P,van't Veer LJ et al., Biological functions of the genes in the mammaprint breast cancer profile reflect the hallmarks of cancer. Biomarker insights, 2010; 5: 129. Wang Y,Lee YM,Baitsch L et al., MELK is an oncogenic kinase essential for mitotic progression in basal-like breast cancer cells. Elife, 2014; 3: e01763. Beke L,Kig C,Linders JT et al., MELK-T1, a small-molecule inhibitor of protein kinase MELK, decreases DNA-damage tolerance in proliferating cancer cells. Bioscience reports, 2015; 35(6): e00267. Chung S,Kijima K,Kudo A et al., Preclinical evaluation of biomarkers associated with antitumor activity of MELK inhibitor. Oncotarget, 2016; 7(14): 18171. Speers C,Zhao SG,Kothari V et al., Maternal embryonic leucine zipper kinase (MELK) as a novel mediator and biomarker of radioresistance in human breast cancer. Clinical Cancer Research, 2016; 22(23): 5864-5875. Srihari S,Kalimutho M,Lal S et al., Understanding the functional impact of copy number alterations in breast cancer using a network modeling approach. Mol Biosyst, 2016; 12(3): 96372. Sotiriou C,Neo S-Y,McShane LM et al., Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proceedings of the National Academy of Sciences, 2003; 100(18): 10393-10398. Chen W-F,Huang M-H,Tzang C-H et al., Inhibitory actions of genistein in human breast cancer (MCF-7) cells. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease, 2003; 1638(2): 187-196. Zajac M,Gomez G,Benitez J et al., Molecular signature of response and potential pathways related to resistance to the HSP90 inhibitor, 17AAG, in breast cancer. BMC Med Genomics, 2010; 3: 44. Ropero S,Setien F,Espada J et al., Epigenetic loss of the familial tumor-suppressor gene exostosin-1 (EXT1) disrupts heparan sulfate synthesis in cancer cells. Hum Mol Genet, 2004; 13(22): 2753-65. Taghavi A,Akbari ME,Hashemi-Bahremani M et al., Gene expression profiling of the 8q22-24 position in human breast cancer: TSPYL5, MTDH, ATAD2 and CCNE2 genes are implicated in oncogenesis, while WISP1 and EXT1 genes may predict a risk of metastasis. Oncol Lett, 2016; 12(5): 3845-3855.

17

Highlights   

Ensemble Predictive Modeling Approaches Breast Cancer Classification from Molecular Expression Profiles Network Abstractions of Ensemble Sets

An ensemble predictive modeling framework for breast cancer classification

An ensemble predictive modeling framework for breast cancer classification

Recommend Documents