Computers in Biology and Medicine 86 (2017) 98–106
Contents lists available at ScienceDirect
Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/compbiomed
A CBR framework with gradient boosting based feature selection for lung cancer subtype classification pez-Sanchez, Jose A. Castellanos-Garzo n, Juan F. de Paz, Juan Ramos-Gonzalez*, Daniel Lo Juan M. Corchado Department of Computer Science and Automation, Faculty of Science, University of Salamanca, Plaza de los Caídos, s/n, 37008 Salamanca, Spain
A R T I C L E I N F O
A B S T R A C T
Keywords: Case-based reasoning Gradient boosting NSCLC Biomarker Microarray
Molecular subtype classification represents a challenging field in lung cancer diagnosis. Although different methods have been proposed for biomarker selection, efficient discrimination between adenocarcinoma and squamous cell carcinoma in clinical practice presents several difficulties, especially when the latter is poorly differentiated. This is an area of growing importance, since certain treatments and other medical decisions are based on molecular and histological features. An urgent need exists for a system and a set of biomarkers that provide an accurate diagnosis. In this paper, a novel Case Based Reasoning framework with gradient boosting based feature selection is proposed and applied to the task of squamous cell carcinoma and adenocarcinoma discrimination, aiming to provide accurate diagnosis with a reduced set of genes. The proposed method was trained and evaluated on two independent datasets to validate its generalization capability. Furthermore, it achieved accuracy rates greater than those of traditional microarray analysis techniques, incorporating the advantages inherent to the Case Based Reasoning methodology (e.g. learning over time, adaptability, interpretability of solutions, etc.).
1. Introduction
Case-based reasoning (CBR) has been used in medical research due to its learning capability and self-adaptability to solve diagnosis problems by recalling information obtained from previous experiences [8]. Moreover, CBR also has other qualities that make it an appropriate approach for this diagnosis problem, such as its ease of application [8] or flexibility in retrieving previous experiences [9]. In this paper, a novel CBR system with a gradient boosting based feature selection is proposed and applied to the task of squamous cell carcinoma and adenocarcinoma discrimination, aiming to provide precise diagnosis of lung cancer subtypes. As opposed to previous gene expression-based CBR systems, our proposal uses a sophisticated feature selection procedure based on Gradient Boosted Regression Trees which has never been employed before in the context of CBR. This feature selection method, combined with basic preliminary filtering, enables our system to identify a small set of genes and achieve a high predictive accuracy. Therefore, our proposal aims to use this group of biomarkers with high classification potential for discriminating the mentioned molecular subtypes of NSCLC. It is a major goal to achieve a classification approach capable of learning and adapting to new data in an automatic manner,
Lung cancer is the leading cause of cancer deaths worldwide [1]. It involves a very complex and heterogeneous group of pathologies that we are only beginning to understand, with non-small cell lung cancer (NSCLC) being the most prevalent type [2,3] as it represents close to 90% of all lung cancers [4]. The ability to distinguish between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) is critical for diagnosis and treatment [5]. However, the specific biomarkers which should be used in clinical practice remain unclear [6], since different candidate genes have been proposed whose classification potential varies depending on their combination with others. Moreover, there are also several methodological differences in biomarker research [5] and usually few samples from patients are available. Advances in understanding molecular processes occurring in lung cancer help researchers in developing new treatment strategies and accelerate diagnosis. Unfortunately, early detection is not often possible, and approximately 40% of patients diagnosed are stage IV [7]. This is one of the many reasons to increase our knowledge about molecular processes in lung cancer and to improve biomarker detection. * Corresponding author. E-mail addresses:
[email protected] (J. Ramos-Gonzalez),
[email protected] (D. L opez-Sanchez). http://dx.doi.org/10.1016/j.compbiomed.2017.05.010 Received 16 January 2017; Received in revised form 10 May 2017; Accepted 10 May 2017 0010-4825/© 2017 Elsevier Ltd. All rights reserved.
J. Ramos-Gonz alez et al.
Computers in Biology and Medicine 86 (2017) 98–106
to be able to generalize all the samples, diminishing the number of genes needed for classification. Redundancy reduction has received great attention in previous approaches [17,18], and several works from statistics and machine learning have focused on reducing dimensions in the field of microarray data. Consequently, such methods have been divided in four categories: filters, wrappers, embedded and ensemble [19]. Since CBR allows for the combination of various methods, it is common to integrate it with feature selection and different techniques in order to improve classification tasks. In this regard, Gonzalez et al. [20] developed a freely available software tool that allows for the use of combined techniques in gene selection, clustering, knowledge extraction, and prediction in aiding cancer diagnosis. The search for biomarkers constitutes a very active field, since it contributes to improving diagnosis and providing knowledge about molecular processes in cancer. Several works focus on studying the differences between adenocarcinoma and squamous cell carcinoma of the lung, and a most authors employ inmunohistochemical techniques for validation, existing few bioinformatic approaches for this problem. Genes such us DSC3, KRT5, KRT6, for LUSC and TTF-1 and Napsin-A for LUAD, among others, have been reported by different authors [5,21–23] as high potential biomarkers. However, in most recent proposals, new markers such as MLPH, TMC5, SFTA3 [5], or ST6GALNAC1 and SPATS2 [23] are proposed from time to time for NSCLC discrimination. Recently, Takamochi et al. [23] presented the first Cap Analysis of Gene Expression (CAGE) to survey primary tumors for this specific problem, also focusing on clustering analysis and validating the extracted knowledge in a different dataset. They achieved high accuracy by means of immunohistochemical techniques, but the validation test was small, and required a better evaluation stage. Finally, we find that several biomarkers have been proposed and new advances have made it possible to broaden knowledge about cellular processes occurring in lung cancer. However, there are still two main problems which need to be addressed: (1) the difficulty in applying new knowledge in clinical practice since datasets for validation are usually not large enough and (2) the wide range of methodologies that are difficult to automate, which also makes it difficult to design protocols and choose specific methods and biomarkers for an efficient diagnosis. The next section will describe the proposed framework.
since few data samples are usually available, also considering that the database will grow through time. This approach aims to support a rapid and accurate classification. Regarding this, some CBR characteristics are critical for the present case study, like the capability of allowing corrections performed by different diagnosis methods (revise and retain stages) and achieving high accuracy results with a reduced dataset, also increasing this accuracy as new data samples become available. Moreover, there is a natural variability in gene expression and methodological differences between datasets. When classifying new data, accuracy and thus diagnosis support capability could be negatively affected. Nevertheless, the CBR system faces this problem and it is shown that this model is particularly applicable to transcriptomic domain, since it would reduce the error rate with the incorporation of new cases. Other advantages of this feature selection model include its resistance to noise and non-informative features and its ease of parallelization (i.e. this method can be easily implemented to take advantage of the multiprocessing capabilities of the majority of modern computing hardware). Given that this selection method does not mix or combine gene expression values from different genes, the selected genes can be interpreted from a biological point of view and contrasted with previous biomarkers identified in the literature. The rest of this article is structured as follows: section 2 summarizes the relevant works present in the available literature; the proposed framework is described in detail in section 3; section 4 includes the experimental results obtained by the proposed system and a comparison with other methods. Finally, in section 5, we discuss the experimental results and outline some promising future lines of research. 2. Related work Case Based Reasoning (CBR) is a problem solving approach that goes beyond general knowledge and makes use of specific solutions obtained in past experiences. CBR solves new problems by searching and reusing similar cases from the past. Thus, this approach applies incremental learning, by retaining each new case in the case-base for future problems. Such characteristics constitute the main difference between CBR and other artificial intelligence methodologies, making it an increasingly used approach in the scientific domain [10]. These systems have been applied to solve different diagnostic problems in the medical domain [8,11]. However, there are few CBR approaches in microarray data classification, which do not tend to focus on specific problems and are only tested on simple data sets [12]. In this area, classification is decisive for diagnosis and treatment but the high dimensionality and data noise affect the performance of several algorithms [12]. In microarray datasets, reliable knowledge extraction is still challenging due to the small number of training samples available in databases [12]. Such datasets present a great number of attributes, much higher than the number of training samples [13], making it necessary to reduce the number of attributes in the cases stored in the case base [14]. Thus, feature selection has received more attention in recent years as a way of improving CBR based classifiers [13]. In this regard, Arshadi and Jurisica [14] included feature selection techniques in a CBR system as a way of managing high dimensional data and tested it over a lung cancer dataset, showing an increase in accuracy from 60% to 70% when comparing it with a similar system without a selection stage. Motivated by this and other approaches, such as that of Huang et al. [15], Anaisse et al. [13] successfully combined feature selection methods with feature weighting, improving the retrieval process of a CBR-system and obtaining high accuracies in cancer classification. Fuzzy logic has been also applied to CBR. Fdez Riverola et al. [16] proposed a method involving fuzzy representation for gene expression levels of each case, based on the discretization of expression data into a small number of fuzzy membership functions. Their method was shown
3. Proposed framework The proposed CBR framework and the feature selection procedures adopted for this study are presented in this section. The proposed system applies two successive feature selection stages during the training process. Most genes uncorrelated with the desired system output are discarded during the first feature selection stage. A second and more sophisticated feature selection stage allows the system to attain high accuracy rates while considering only a small set of gene expression levels. Once the most informative genes have been identified (i.e. the case representation has been determined) the case-base of the CBR system is built with the selected genes of all the available micro-array samples. A simple, distance-based approach is then taken during the retrieve stage. The reuse stage is performed by means of a weighted variant of the k-Nearest Neighbours (kNN) algorithm, which grants a solid probabilistically-interpretable output. The system is able to learn from revised cases, using the revision-retaining procedure. The four CBR stages are formally described in sections 3.3 and 3.4. Experimental results that validate the learning capabilities of the designed system are presented in section 4. In addition, the effectiveness of the proposed framework is compared to other classic microarray feature selection and classification techniques. Fig. 1 presents a high-level overview of the proposed framework. The information flows during training and test stages are depicted separately. As shown in the figure, the number of genes in the training microarray dataset is first reduced by the
99
J. Ramos-Gonz alez et al.
Computers in Biology and Medicine 86 (2017) 98–106
Fig. 1. Overview of the proposed CBR system.
preliminary feature selection procedure described in section 3.1. Next, a Gradient Boosting-based feature selection procedure is applied (as described in section 3.2). The training phase ends with the building of the Case-Base, which only stores selected genes' expression levels for the available cases. Both the preliminary and the Gradient Boosting feature selection modules store the names of the selected genes. In this manner, when a new case is presented to the CBR system, the two feature selection modules provide the information needed to discard uninformative genes before the actual retrieval stage begins. The following subsections describe the building blocks of the framework.
3.1. Preliminary feature selection This section describes the statistical prefiltering stage. In order to provide the CBR system with basic criteria as the basis for the classifier, the goal of this stage is to reduce dimensionality and select a set of relevant genes that are good candidates to be used as biomarkers. Those genes whose expression levels variation are not statistically related to the difference between subtypes must be discarded. Thus, the nonparametric Mann-Whitney test is used, stating a null hypothesis supporting the premise that samples come from the same population whereas the alternative hypothesis supports that samples come from different populations. Once the test is applied, those genes with a resulting p-value <0.05 are selected. At this point, we have a subset of significant genes. Nevertheless, for a more restrictive selection we need genes that are not just related to the difference between subtypes, but those that also present important variations in their expression levels. Thus, the expression level variance for each gene with a p-value <0.05 in Mann-Whitney test is computed for this purpose. To reduce the number of genes, the middle point from the range of variance values of all genes is used as a threshold. Consequently, the resulting genes with a higher variance than the defined cutoff value are selected. After these two filtering stages (Mann-Whitney and variance cutoff), a small subset of high variance significance genes (p-value <0.05) is obtained. It could be noted that, once the Mann-Whitney test is applied, we use the variance, rather than the p-value order, as the following filter criterion. Considering that a low number of genes is required, it is important that these genes present a great expression (and probably phenotypic) change, whose repercussion in the molecular processes is more easily differentiable. The complete process of preliminary feature selection is formalized in 1.
3.2. Feature selection with gradient boosting This section provides a detailed description of the second feature selection stage that the proposed CBR system applies. This second feature selection stage is based on a well-known method called Gradient Boosted Regression Trees (GBRT) [24]. In spite of its name, the GBRT algorithm is often used for classification rather than regression. However, in the context of the proposed CBR framework, GBRT is used solely as a feature selection method. Although the number of genes under consideration is greatly reduced after the first selection stage, an additional refinement of the selected features (i.e. the case representation in the CBR system) enables higher accuracy rates. Furthermore, reducing the number of genes that need to be considered for case analysis can drastically increase the applicability of the proposed system. The key idea of GBRT consists of building a predictive model as an ensemble of simple models. The output of the ensemble is computed as the weighted sum of M weak models. Formally, the output of GBRT for a test sample x is computed as:
FðxÞ ¼
M X m¼1
100
γ m hm ðxÞ
(1)
J. Ramos-Gonz alez et al.
Computers in Biology and Medicine 86 (2017) 98–106
where hi(x) is the output of the i-th weak learning model and γ i is a constant known as the step length, which controls the contribution of the i-th model to the overall output. As opposed to other ensemble methods, these base models are not all trained to produce the same desired output. Instead, the models are build sequentially, refining the output of the previous stage:
Fm ðxÞ ¼ Fm1 ðxÞ þ γ m hm ðxÞ
(2)
Therefore, at each learning stage the current weak model hm(⋅) and the stage length γ m are trained to approximate the difference between the output of the previous stage and the desired overall output (i.e. the residuals). Of course, in the case of GBRT, the base models consist of regression trees. For more details on the training process of GBRT see Refs. [24,25]. When used for feature selection, the GBRT model is built as usual. After the M basic models have been built, they are analyzed in order to determine the relative importance of each input feature. This process depends on the specific split criterion used when the trees were built. The Mean-Squared Error, one of the simplest split criteria, was used in this case. The Mean Squared Error (MSE) of a node t in a regression tree is defined as a function of the desired output for each of the training samples in that node:
MSEðtÞ ¼
n 1X 2 ðyi yÞ n i¼1
Fig. 2. Example of a simple tree with depth two. Sample distribution and MSE measure are shown at each node.
(3)
Ij ¼
where {y1,y2,⋯yn} are the desired outputs for the n samples at node t and y is their average value. In the case of the first base model of the ensemble, these desired outputs are the class labels for each sample (i.e. either zero or one). As mentioned before, the successive regression trees are trained to the residuals at each stage. If a given node t defines a good split criterion, the MSE of its children will be lower than their father. It is possible to measure the effectiveness of the split in a node as the obtained decrease in the MSE, formally:
Retain all the features with non-zero importance. In this case, the refined representation x0 of a sample x with features ½xj dj¼1 is computed as follows:
samplesðleftChildðtÞÞ⋅MSEðleftChildðtÞÞ
! x0 ¼ I j xj ! I j ≠0
(4)
Note that the MSE of each child node is weighted by the number of samples in that node. Once a measure for the quality of a given split has been defined, it becomes possible to obtain an estimate of the importance of each feature (i.e. each gene) in a single decision tree. To this end, we use a formula based on the work by Friedman [24], but modified to work with the above defined MSE-based measure. Formally, the importance of feature j according to the regression tree hm is:
1X Ij ðhm Þ ¼ MSEdecreaseðtÞ⋅1ðvt ¼ jÞ n t
(6)
The feature importance estimates are then normalized so they add up ! to exactly one, and arranged in a vector I with a length equal to the number of original features. We shall use this vector of feature importances to refine the case representation described in the previous section. At this point, two different feature selection approaches can be taken:
MSEdecreaseðtÞ ¼ samplesðtÞ⋅MSEðtÞ samplesðrightChildðtÞÞ⋅MSEðrightChildðtÞÞ
M 1 X Ij ðhm Þ M m¼1
(7)
Note that the features are not only selected, but weighted according to their importance. Retain only the k most important features. In this case, the number of genes used by the system can be set manually, thus increasing its applicability in real-world scenarios. Formally, the refined representation x0 of a sample x is computed as follows:
! x0 ¼ I j xj ! I j th
(5)
(8)
! where th is an adaptive threshold equal to the k-th greatest value in I . However, manually reducing the number of genes that the system must consider may reduce the accuracy of the framework. This trade-of between gene number and accuracy is empirically studied in the experimental results section. The resulting accuracies obtained by using these two approaches are compared in section 4. In addition, a biological interpretation of the selected genes is provided. Note that this is possible because the feature selection methods applied do not merge the expression levels of different genes to produce the final representation (i.e. each feature of the final representation corresponds to the expression level of a single gene). The feature selection procedure explained in this section is summarized in
where the summation is over the non-terminal nodes of tree hm, vt is the splitting variable evaluated at node t, and n is the number of samples at the root node of hm. Consider for example the fist tree h0 build during one of our experiments (see Fig. 2). According to this tree, only two features have an importance greater than zero, namely the expression levels of genes CLCA2 and AKR1B10. Applying eq. (5) the respective importance of each feature is Iðh0 ÞCLCA2 ¼ 0:1659 and Iðh0 ÞAKR1B10 ¼ 0:0198 respectively. To determine the importance of each feature according to the complete ensemble of trees obtained by boosting {h1,h2,⋯,hM}, the importances are averaged over the individual models:
101
J. Ramos-Gonz alez et al.
Computers in Biology and Medicine 86 (2017) 98–106
X
algorithm 2.
Pðy ¼ l j x; CBÞ ¼ 1
(11)
l
where the summation is over the set of all possible class labels. At the end of the reuse stage, the label of the unclassified sample x is predicted to be the one with the highest probability:
y ¼ arg max Pðy ¼ l j x; CBÞ
(12)
l
3.4. Revise and retain The revision and retain stages are in charge of enabling the learning process of the CBR system. When an unlabeled case is presented to the CBR system, the sample goes through the two consecutive feature selection stages (which were previously trained before the case-base was built) and the retrieve and reuse stages emit a prediction concerning its class label. In some cases, a revision stage can be performed. Here, alternative diagnosis methods are applied to determine whether the original prediction of the CBR system was correct or not. The diagnosis information required during the revision stage is obtained by means of different medical tests which may depend on the case, but usually include a combination of imaging and molecular techniques. Regarding imagebased techniques, computed tomography, magnetic resonance, and bronchoscopy are commonly used [28,29]. A biopsy, a sputum cytology and also a series of molecular marker studies are required in order to better characterize the tumor and presence of specific proteins and growth factors [28]. Once the revision stage has been completed for a given case, it its stored into the case-base along with its revised solution. The accuracy of the CBR system is expected to grow as new revised cases become available in the case-base. Experimental results concerning the ability of the proposed CBR framework to learn are reported in section 4.
3.3. Retrieve and reuse Once the feature selection procedures have been completed, the casebase is built. Formally, the case-base consists of a set of n pairs CB ¼ {(x1,y1),⋯,(xn,yn)} where xi is the case-representation of a sample and yi its corresponding solution. When the system is dealing with binary classification problems (e.g. cancer subtype discrimination), each solution yi is a class label with a value of either zero or one. As mentioned before, a simple distance-based approach is adopted to perform the task of case retrieval. When a new, unlabeled sample is presented to the system, it is fist pre-processed by the feature selection modules to obtain a case representation x. After that, the k-nearest neighbors of x in CB are determined. Although usually only a small number of initial cases is present in the case-base, this step will scale poorly as new cases become available. Specifically, finding the k-nearest neighbors of x in CB requires a time of O ðndkÞ, where d is the number of features a case consists of, and n is the number of cases stored in CB. A variety of techniques has been proposed in the literature to overcome this limitation. In the proposed framework, a space partitioning method known as KD-tree [26] is used to guarantee a sub-linear query time as the number of cases stored in CB increases. The reuse stage begins after the neighbors have been found. Here, the distances from the k-nearest neighbors to the unlabeled sample are used to compute what is called the un-normalized weight vector [27]:
1 ! ; j ¼ 1; 2; ⋯; k w ¼ x xj 2
4. Case study In this section, we evaluate the effectiveness of the proposed CBR system in the task of diagnosing non-small cell lung cancer subtype. We also evaluate the effectiveness of the proposed system with and without the two feature selection stages described in sections 3.1 and 3.2. 4.1. Non-small cell lung cancer datasets As described in the ”Introduction” section, the CBR system uses a set of biomarkers as the basic criterion for classifying tissue samples. Thus, the gene selection process is a critical stage for the good performance of our approach. Two different datasets of lung adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) level expression profiling by array were used in our experiments. More specifically, the chip model is the Affymetrix U133 Plus 2.0 whole-genome microarray. Both NSCLC tissue sample datasets come from the public repository of The National Center for Biotechnology Information (NCBI). One of the datasets also contained a small amount of large cell carcinoma tissue samples, but they were removed as we did not need them for our experimental purposes. In order to properly evaluate the generalization capabilities of the proposed CBR system and other alternative approaches, we used the first dataset for training while the second one was employed solely during the evaluation stage. The datasets are publicly available in Ref. [30] (training) and [31] (evaluation). The training dataset contains 58 tissue samples, 40 from adenocarcinoma and 18 from squamous cell carcinoma. The evaluation dataset contains 172 samples (after removing the large cell carcinoma samples) of which 66 correspond to squamous cell carcinoma samples.
(9)
where jjxxjjj2 is the Euclidean distance between the unlabeled case x and its j-th nearest neighbor.1 Then, the probability of x belonging to the class with label l is computed as follows:
Pðy ¼ l j x; CBÞ ¼
Pk ! i¼1 w j ⋅1 yj ¼ l Pk ! w i¼1
(10)
j
Note that, conveniently:
1 In the special case where jjxxijj2 ¼ 0, the following steps are skipped and the corresponding class of xj is directly predicted for the unlabeled sample x.
102
J. Ramos-Gonz alez et al.
Computers in Biology and Medicine 86 (2017) 98–106
Table 1 Experimental results for the proposed CBR system with various combinations of the feature selection methods. Results on the learning capabilities of the CBR system after a number of revision stages have been executed are also reported. Preliminary F. Selection Var & MW (53 feat.) Var & MW (53 feat.) Var & MW (53 feat.) – (54,613 feat.) – (54,613 feat.) – (54,613 feat.)
Secondary F. Selection
Classifier
–
Initial case-base
40 cases retained
Acc.
κ
Prec/rec
Acc.
κ
Prec/rec
CBR (wkNN)
94.1%
0.875
89.3/95.1%
95.45%
0.900
87.7/100.0%
GBRT (non-zero) 24 features GBRT (top-5) 5 features –
CBR (wkNN)
97.0%
0.938
95.4/96.9%
97.7%
0.950
93.8/100.0%
CBR (wkNN)
96.5%
0.926
95.4/95.4%
97.7%
0.950
93.8/100.0%
CBR (wkNN)
95.5%
0.901
93.9/93.9%
96.2%
0.919
97.9/92.1%
GBRT (non-zero) 20 features GBRT (top-5) 5 features
CBR (wkNN)
91.2%
0.817
90.0/86.9%
92.4%
0.837
91.6/88.0%
CBR (wkNN)
90.6%
0.807
93.9/83.7%
90.9%
0.807
91.6/84.6%
Table 2 Experimental results on CBR classification with various gene selection methods from the literature. Results on the learning capabilities of the CBR system after a number of revision stages have been executed are also reported. Gene Selection Boruta [32] 27 features Spikeslab [33] 54 features SDA [34] 24 features Alpha-Investing [35] 24 features RFE [36] 24 features
Classifier
Initial case-base
40 cases retained
Acc.
κ
Prec/rec
Acc.
κ
Prec/rec
CBR (wkNN)
96.5%
0.925
92.4/98.3%
96.2%
0.917
93.7/95.7%
CBR (wkNN)
87.2%
0.711
66.6/100.0%
95.4%
0.902
95.8/92.0%
CBR (wkNN)
96.5%
0.926
95.4/95.4%
93.9%
0.867
89.5/93.4%
CBR (wkNN)
88.95%
0.774
95.4/79.7%
93.1%
0.849
85.4/95.3%
CBR (wkNN)
96.5%
0.925
93.9/96.8%
95.4%
0.900
91.6/95.6%
A third set of experiments was performed to compare alternative feature extraction and classification methods present in the literature of microarray data analysis to our CBR framework. Various combinations of feature extraction/selection algorithms and classifiers were evaluated. The selected feature extraction methods were Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), Independent Component Analysis (ICA), Locally Linear Embedding (LLE) and Random Forest-based feature selection; all of which have been used in the literature regarding gene feature selection [38,39]. Concerning classification, the evaluated methods are k-Nearest Neighbours (kNN), Naive Bayes Classifier (NB) and Support Vector Machine (SVM) [40]. The training and test sets were the same as in the first set of experiments, thus making the accuracy results directly comparable. To correctly parametrize the different algorithms, a model selection process was executed by using 10-fold Cross-Validation over the training set.2 Again, the precision and recall measures are provided in addition to the accuracy measure. Table 3 shows the results of those experiments. The results concerning KPCA are not shown because no significant improvement with respect to its linear counterpart (PCA) was registered.
4.2. Experimental results The first set of experiments was carried out to validate the effectiveness of the proposed feature selection stages. In addition, the ability of the proposed CBR framework to learn as new revised cases become available was evaluated. To this end, the CBR system was trained on the samples from the first dataset [30] and its classification accuracy was evaluated on the second dataset [31]. In addition to the accuracy measure, the precision/recall and the Cohen's kappa [37] measures were computed taking LUSC as the positive class. We decided to do so because the LUSC class is infra-represented in the accuracy measure due to its small appearance in both the training and test sets. This unbalance in data is explained mainly by the lower occurrence of LUSC with respect to LUAD in the global population. Additionally, we simulated a revise and retain process by incorporating 40 samples from the second dataset and their corresponding class labels to the case-base. Then, the classification accuracy of the system was measured on the remaining evaluation samples. This protocol was repeated for different combinations of the proposed methods (e.g. enabling and disabling the preliminary feature selection). The results of these experiments are reported in Table 1. As expected, the combination of both preliminary (Var & MW) and GBRT-based feature selection stages achieved the best accuracy. In addition, the experimental results suggest that the system is able to attain very high accuracy rates even when it considers only the top-5 most important genes (according to GBRT). This increases the applicability of the proposed system to a great extent. The second set of experiments was carried out to compare the effectiveness of the proposed feature selection stages with some representative feature selection methods from the literature. To this extent, the feature selection stages described in sections 3.1 and 3.2 were replaced by alternative methods and the resulting CBR systems were evaluated as in the first set of experiments. The results of those experiments are listed in Table 2.
4.3. Selected genes Finally, we analyzed the gene selection performed by our CBR framework and reviewed other works and biological evidences to verify that the selection was coherent with other authors and gene roles. The CBR system with both the preliminary and GBRT-based feature selection stages assigned a non-zero importance to a total of 24 genes. Those importance estimates, normalized as explained in sec 3.2, are shown in
2 To ensure a fair comparison, the output dimension for the feature extraction algorithms was set (when possible) to 24, which is the number of features selected by our method.
103
J. Ramos-Gonz alez et al.
Computers in Biology and Medicine 86 (2017) 98–106
Table 3 Experimental results on the effectiveness of alternative feature extraction and classification methods applied in the literature. F. Ext. PCA (k ¼ 24) PCA (k ¼ 24) PCA (k ¼ 24) FastICA (k ¼ 24) FastICA (k ¼ 24) FastICA (k ¼ 24) LLE (k ¼ 24) LLE (k ¼ 24) LLE (k ¼ 24) RF (300) 774 feats RF (300) 765 feats
Classifier
Acc.
κ
Prec/rec
SVM(linear) C ¼ 0.001 NaiveBayes
94.7%
0.887
89.3/96.7%
95.9%
0.913
92.4/96.8%
kNN (k ¼ 3)
95.3%
0.901
92.4/95.3%
SVM(linear) C ¼ 0.5 NaiveBayes
92.44%
0.835
81.8/98.1%
95.3%
0.901
93.9/93.9%
kNN (k ¼ 4)
92.4%
0.835
83.3/96.4%
SVM(linear) C ¼ 0.001 NaiveBayes
94.7%
0.888
90.9/95.2%
91.2%
0.820
95.4/84.0%
kNN (k ¼ 3)
94.7%
0.890
96.9/90.1%
SVM(linear) C ¼ 0.001 kNN (k ¼ 3)
94.7%
0.890
93.9/92.5%
94.7%
0.889
92.4/93.8%
Fig. 4. 24 genes with non-zero importances according to the ensemble of gradient boosted trees. The genes are sorted in decreasing order according to their score in the first feature selection stage.
Fig. 4. As explained in section 3.2, it is possible to manually set the maximum number of genes to be considered by the CBR system. The results presented in the previous subsection suggest that the number of selected genes may be reduced to five without a dramatic loss in accuracy. Furthermore, the experimental results show that once a number of revised cases becomes available, the system with the gene-number restriction performs as well as its non-restricted counterpart (which uses 19 more genes). The five genes used by the restricted system, along with their expression level distribution for LUAD and LUSC test samples, are shown in Fig. 3. In addition, Fig. 5 provides a visualization of expression levels for selected genes in test set samples by means of parallel coordinates. Although genes like KRT5 and TRIM29 have been widely used as biomarkers for LUAD and LUSC discrimination, PKP1 and AKR1B10 have been used less frequently, and SFTA2 is proposed here as a great potential biomarker, successfully employed for classification, providing highly accurate results. Moreover, it is the combination of genes rather than an individual biomarker that determines the resulting effectiveness of the classification system. The TRIM protein family play different roles in cellular processes that are typically altered in cancer such as migration, proliferation, apoptosis
Fig. 5. Parallel coordinates visualization of the top-5 most important genes (according to our feature selection procedure) in test set. This set contains a total of 172 samples (66 LUSC, 106 LUAD).
Fig. 3. Violin plots showing the distribution of the five genes with a higher importance according to our experiments. Distributions are computed separately for LUAD and LUSC samples. Genes are sorted from left to right in decreasing importance order. 104
J. Ramos-Gonz alez et al.
Computers in Biology and Medicine 86 (2017) 98–106
Another strength of the proposed system is that, thanks to GBRTbased feature selection, genes are not selected individually but according to their combined relevance. In other words, even if a specific gene is not highly informative on its own, it might enable high accuracy rates when combined with a certain set of other genes. The GBRT-based feature selection module is able to identify non-linear feature interactions [24], thus selecting a set of genes that work well together. For instance, looking at Fig. 5 we see that a significant overlapping between classes exists in the expression levels of AKR1B10. However, its high importance value (as estimated by GBRT) indicates that the trees in the ensemble often found this feature as useful to differentiate between lung cancer subtypes. This suggest that, although AKR1B10 might seem to be individually little informative, it provides some complementary information other genes do not contain. This is supported by the high classification accuracy achieved by our system when using only the top-5 most important genes. It is also worth noticing that, if an efficient implementation of GBRT is used [51], the feature selection stage of our system enjoys a linear computational complexity with respect to the number of samples and features. An interesting result was obtained when evaluating the accuracy of the proposed CBR with the preliminary feature selection disabled. In this case, the CBR working with all the available genes outperformed its counterpart with GBRT-based feature selection. This supports the necessity of using a preliminary feature selection stage to provide GBRT with a reduced list of genes. In our case, the preliminary feature selection lowered the number of genes from 54,613 to 53, thus significantly reducing the number of non-informative features the second feature selection stage had to handle. The poor accuracies obtained when using GBRT feature selection without the preliminary feature selection can be explained by considering how the regression trees that form the GBRT ensemble are built. At each node, a number of splits on a random set of features are evaluated. When the number of input features is too large (perhaps even larger that the number of nodes in the complete ensemble), some features might not be evaluated even once, thus reducing the significance of the importance measure (as defined in section 3.2). As a consequence, the set of genes selected in such a manner is not optimal, and the CBR using the expression level of those genes as a case representation performs poorly. In addition, the experimental results confirm the ability of the proposed CBR system to learn as new revised cases are incorporated to the case-base, with no need to re-train the feature extraction modules. Furthermore, we have seen that when taking into consideration only the top-5 most important genes according to the selection criterion described in section 3.2, the system is able to achieve accuracy rates very close to those obtained when considering all genes with non-zero importance. This approach also aims to face the current uniformity problem in lung cancer subtype diagnosis regarding biomarkers. In this regard, the system trained over the first dataset achieved a high accuracy rate when evaluated on an independent dataset coming from different laboratories and researchers. This suggests that generalization capabilities can be achieved with the only requirement being that training and test data come from the same chip model. SFTA2 importance shown in classification tasks will motivate further exploration of NSCLC initiating cells expression patterns and their potential utility for classification in our system. Moreover, pathways, rather than individual genes, provide understanding of the processes taking place in tumor cells. For this reason, the inclusion of a pathway analysis stage in the proposed framework shall be explored in the future. In addition, to increase the significance of our experimental results, we intend to collect a larger database with gene expression data to further validate the efficacy of the proposed method.
and differentiation, among others [41]. TRIM29 has been reported to be involved in different cancer types, affecting proliferation and tumor progression [42]. It may play both an oncogenic or a tumor suppressor role, depending on the cellular context [42]. Zhou et al. [43] has reported a significantly different expression of TRIM29 between poorly differentiated LUSC and LUAD, advocating the combination of this biomarker with KRT5, also selected by our framework. This gene is highly expressed in squamous cell carcinoma [21], as can also be observed in Fig. 3. KRT5 plays an important role in NSCLC and it is one of the most used and explored biomarkers for NSCLC subtype diagnosis. Keratins show a heterogeneous pattern of expression level in NCLCs. Its sensitivity varies in different studies, but always presenting high values [22] [5]. Consequently, many authors have purposed this gene as a high specific biomarker for LUSC detection. SFTA2 is a gene that is highly expressed in the lung [44]. There are very few works focusing on SFTA2 [44], but there is evidence that it plays a role in LUAD. To the best of our knowledge, it has not been commonly used as lung subtype marker, but in their study Zhan et al. [5] provide a group of genes in which SFTA2 figures as a highly expressed gene in LUAD. However, in the end they do not use it in the final biomarker group they propose because the appropriate primary antibody of human SFTA2 could not be obtained. However, the protein of the same family SFTA3 has been used as a biomarker in some studies [5]. In this regard, we should consider that genes encoding surfactant proteins are markers for Alveolar type 2 lineage (AT2), which share many molecular characteristics with adenocarcinoma cells [45]. In fact, Xu et al. [46] more recently concluded that AT2 cells are the predominant cancer-initiating cells of K-RasG12Dinduced lung adenocarcinoma. Nevertheless, AT2 cells are not always the initiating cells of LUAD [46], and it would be interesting to further study the expression patterns of SFTA2 in LUAD. AKR1B10 has been found to be overexpressed in NSCLC, but more frequently in squamous cell carcinoma than in adenocarcinoma [47]. The alteration of this gene has been commonly associated to tobacco-related LUSC [48]. Although it is not found among the most powerful biomarkers, its combination with the other genes in the proposed subset results in a high potential for subtype discrimination. There are few studies regarding the desmosomal plaque protein PKP1, but there have been increasing efforts to study the role of junction proteins recently and it is well known that they participate in cell proliferation and metastasis [49]. In addition, [21] and [50] report a high upregulation of this gene in LUSC, although its use has not been extended. 5. Discussion and future work This paper has proposed a novel CBR framework for microarraybased lung cancer subtype classification. The proposed system relies on two successive feature selection procedures to reduce the number of genes the system must consider in order to evaluate and predict the class of an unseen case. To the best of our knowledge, this is the first time a feature selection method based on Gradient Boosted Regression Trees has been integrated into the CBR methodology. The experimental results presented in section 4 show that the proposed CBR framework outperforms some of the most common embedding and classification methods typically used for microarray data interpretation. Furthermore, in the case-representation used by the proposed CBR framework, each feature is computed based on a single gene expression level. Therefore, our case representation is interpretable to a greater extent than the output of other methods, which compute each output feature as a combination (linear or non-linear) of all the input genes (e.g. PCA, K-PCA, LLE). In addition, our comparison with alternative feature selection methods shows that the proposed feature selection pipeline produces a more suitable case-representation than alternative methods. In particular, the proposed GBRT-based feature selection method outperforms its competitors by a large margin in our experiments after the retaining of a number of unseen cases.
Conflict of interest statement None declared.
105
J. Ramos-Gonz alez et al.
Computers in Biology and Medicine 86 (2017) 98–106
Acknowledgements
[25] J. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learning, second ed., vol. 1, Springer series in statistics Springer, Berlin, 2009. [26] M. R. Abbasifard, B. Ghahremani, H. Naderi, A survey on nearest neighbor search methods, Int. J. Comput. Appl.. 95(25). [27] K. Hechenbichler, K. Schliep, Weighted K-nearest-neighbor Techniques and Ordinal Classification, collaborative research center 386, Tech. rep., discussion paper 399, University of Munich, 2004. [28] M.P. Rivera, A.C. Mehta, M.M. Wahidi, Establishing the diagnosis of lung cancer: diagnosis and management of lung cancer: American college of chest physicians evidence-based clinical practice guidelines, CHEST J. 143 (5_suppl) (2013) e142S–e165S. [29] C. Plathow, P. Aschoff, M.P. Lichy, S. Eschmann, T. Hehr, I. Brink, C.D. Claussen, C. Pfannenberg, H.-P. Schlemmer, Positron emission tomography/computed tomography and whole-body magnetic resonance imaging in staging of advanced nonsmall cell lung cancerinitial results, Investig. Radiol. 43 (5) (2008) 290–297. [30] K. Ruprecht, T. Muleyb, M. Meisterb, M. Ruschhaupta, B. Andreas, E.C. Xub, P. Schnabeld, A. Warthd, A. Poustkaa, H. Sltmanna, Non-small Lung Cancer Subtypes: Adenocarcinoma and Squamous Cell Carcinoma, 2009. https://www. ncbi.nlm.nih.gov/sites/GDSbrowser?acc¼GDS3627 [Online; accessed 28-October2016]. [31] J. Botling, K. Edlund, M. Lohr, B. Hellwig, L. Holmberg, M. Lambe, A. Berglund, S. Ekman, M. Bergqvist, F. Pontn, A. Knig, O. Fernandes, M. Karlsson, G. Helenius, C. Karlsson, J. Rahnenfhrer, J. Hengstler, P. Micke, Biomarker Discovery in Nonsmall Cell Lung Cancer: Integrating Gene Expression Profiling, Meta-analysis and Tissue Microarray Validation, 2012. https://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc¼GSE37745 [Online; accessed 28-October-2016]. [32] M.B. Kursa, W.R. Rudnicki, et al., Feature Selection with the Boruta Package, 2010. [33] H. Ishwaran, J.S. Rao, Spike and slab variable selection: frequentist and bayesian strategies, Ann. Statistics (2005) 730–773. [34] M. Ahdesm€aki, K. Strimmer, et al., Feature selection in omics prediction problems using cat scores and false nondiscovery rate control, Ann. Appl. Statistics 4 (1) (2010) 503–519. [35] J. Zhou, D. Foster, R. Stine, L. Ungar, Streaming feature selection using alphainvesting, in: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, ACM, 2005, pp. 384–393. [36] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (1) (2002) 389–422. [37] R. Artstein, M. Poesio, Inter-coder agreement for computational linguistics, Comput. Linguist. 34 (4) (2008) 555–596. [38] R. Díaz-Uriarte, S.A. De Andres, Gene selection and classification of microarray data using random forest, BMC Bioinforma. 7 (1) (2006) 1. [39] Z.M. Hira, D.F. Gillies, A review of feature selection and feature extraction methods applied on microarray data, Adv. Bioinforma. 2015 (2015). Article ID 198363 https://www.hindawi.com/journals/abi/2015/198363/abs/. [40] S. Bind, A.K. Tiwari, A.K. Sahani, A survey of machine learning based approaches for parkinson disease prediction, Int. J. Comput. Sci. Inf. Technol. 6 (2) (2015) 1648–1655. [41] R. Xu, J. Hu, T. Zhang, C. Jiang, H. Wang, Trim29 overexpression is associated with poor prognosis and promotes tumor progression by activating wnt/β-catenin pathway in cervical cancer., Oncotarget. [42] C. Liu, X. Huang, S. Hou, B. Hu, H. Li, Silencing of tripartite motif (trim) 29 inhibits proliferation and invasion and increases chemosensitivity to cisplatin in human lung squamous cancer nci-h520 cells, Thorac. Cancer 6 (1) (2015) 31–37. [43] Z.-Y. Zhou, G.-Y. Yang, J. Zhou, M.-H. Yu, Significance of trim29 and β-catenin expression in non-small-cell lung cancer, J. Chin. Med. Assoc. 75 (6) (2012) 269–274. [44] R.A. Mittal, M. Hammel, J. Schwarz, K.M. Heschl, N. Bretschneider, A.W. Flemmer, S. Herber-Jonat, M. K€ onigshoff, O. Eickelberg, A. Holzinger, Sfta2a novel secretory peptide highly expressed in the lung is modulated by lipopolysaccharide but not hyperoxia, PloS one 7 (6) (2012) e40011. [45] I.-J. Kim, D. Quigley, M.D. To, P. Pham, K. Lin, B. Jo, K.-Y. Jen, D. Raz, J. Kim, J.H. Mao, et al., Rewiring of human lung cell lineage and mitotic networks in lung adenocarcinomas, Nat. Commun. 4 (2013) 1701. [46] X. Xu, J.R. Rock, Y. Lu, C. Futtner, B. Schwab, J. Guinney, B.L. Hogan, M.W. Onaitis, Evidence for type ii cells as cells of origin of k-ras–induced distal lung adenocarcinoma, Proc. Natl. Acad. Sci. 109 (13) (2012) 4910–4915. [47] M. Kang, E. Lee, S. Yoon, J. Jo, J. Lee, H. Kim, Y. Choi, K. Kim, Y. Shim, J. Kim, et al., Akr1b10 is associated with smoking and smoking-related non-small-cell lung cancer, J. Int. Med. Res. 39 (1) (2011) 78–85. [48] S.-i. Fukumoto, N. Yamauchi, H. Moriguchi, Y. Hippo, A. Watanabe, J. Shibahara, H. Taniguchi, S. Ishikawa, H. Ito, S. Yamamoto, et al., Overexpression of the aldoketo reductase family protein akr1b10 is highly correlated with smokers' non–small cell lung carcinomas, Clin. cancer Res. 11 (5) (2005) 1776–1785. [49] E.A. Runkle, D. Mu, Tight junction proteins: from barrier to tumorigenesis, Cancer Lett. 337 (1) (2013) 41–48. [50] R. Kuner, T. Muley, M. Meister, M. Ruschhaupt, A. Buness, E.C. Xu, P. Schnabel, A. Warth, A. Poustka, H. Sültmann, et al., Global gene expression analysis reveals specific patterns of cell junctions in non-small cell lung cancer subtypes, Lung cancer 63 (1) (2009) 32–38. [51] Z. Xu, G. Huang, K.Q. Weinberger, A.X. Zheng, Gradient boosted feature selection, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2014, pp. 522–531.
The research of Daniel L opez-Sanchez has been financed by the Ministry of Education, Culture and Sports of the Spanish Government (University Faculty Training (FPU) program, reference number FPU15/ 02339). References [1] R.L. Siegel, K.D. Miller, A. Jemal, Cancer statistics, 2015, CA a cancer J. Clin. 65 (1) (2015) 5–29. [2] J. Fujimoto, I.I. Wistuba, Current concepts on the molecular pathology of non-small cell lung carcinoma, in: Seminars in Diagnostic Pathology, vol. 31, Elsevier, 2014, pp. 306–313. [3] B.A. Chan, B.G. Hughes, Targeted therapy for non-small cell lung cancer: current standards and the promise of the future, Transl. Lung cancer Res. 4 (1) (2014) 36–54. [4] A.V. Nascimento, H. Bousbaa, D. Ferreira, B. Sarmento, Non-small cell lung carcinoma: an overview on targeted therapy, Curr. drug targets 16 (13) (2015) 1448–1463. [5] C. Zhan, L. Yan, L. Wang, Y. Sun, X. Wang, Z. Lin, Y. Zhang, Y. Shi, W. Jiang, Q. Wang, Identification of immunohistochemical markers for distinguishing lung adenocarcinoma from squamous cell carcinoma, J. Thorac. Dis. 7 (8) (2015) 1398. [6] N. Rekhtman, D.C. Ang, C.S. Sima, W.D. Travis, A.L. Moreira, Immunohistochemical algorithm for differentiation of lung adenocarcinoma and squamous cell carcinoma based on large series of whole-tissue sections with validation in small specimens, Mod. Pathol. 24 (10) (2011) 1348–1359. [7] C. Zappa, S.A. Mousa, Non-small cell lung cancer: current treatment and future advances, Transl. Lung Cancer Res. 5 (3) (2016) 288. [8] B. Pandey, R. Mishra, Knowledge and intelligent computing system in medicine, Comput. Biol. Med. 39 (3) (2009) 215–230. [9] I. Jurisica, J. Glasgow, Applications of case-based reasoning in molecular biology, Ai Mag. 25 (1) (2004) 85. [10] A. Aamodt, E. Plaza, Case-based reasoning: foundational issues, methodological variations, and system approaches, AI Commun. 7 (1) (1994) 39–59. [11] W.H. Chan, M.S. Mohamad, S. Deris, N. Zaki, S. Kasim, S. Omatu, J.M. Corchado, H. Al Ashwal, Identification of informative genes and pathways using an improved penalized support vector machine with a weighting scheme, Comput. Biol. Med. 77 (2016) 102–115. [12] B. Yao, S. Li, Anmm4cbr: a case-based reasoning method for gene expression data classification, Algorithms Mol. Biol. 5 (1) (2010) 1. [13] A. Anaissi, M. Goyal, D.R. Catchpoole, A. Braytee, P.J. Kennedy, Case-based retrieval framework for gene expression data, Cancer Inf. 14 (2015) 21. [14] N. Arshadi, I. Jurisica, Maintaining case-based reasoning systems: a machine learning approach, in: European Conference on Case-based Reasoning, Springer, 2004, pp. 17–31. [15] M.-L. Huang, Y.-H. Hung, W.-M. Lee, R. Li, T.-H. Wang, Usage of case-based reasoning, neural network and adaptive neuro-fuzzy inference system classification techniques in breast cancer dataset classification diagnosis, J. Med. Syst. 36 (2) (2012) 407–414. ~ ez, J.M. Corchado, Improving gene [16] F. Fdez-Riverola, F. Díaz, M.L. Borrajo, J.C. Yan selection in microarray data analysis using fuzzy patterns inside a cbr system, in: International Conference on Case-based Reasoning, Springer, 2005, pp. 191–205. [17] J. J€ ager, R. Sengupta, W.L. Ruzzo, Improved gene selection for classification of microarrays, in: Proceedings of the Eighth Pacific Symposium on Biocomputing: 3–7 January 2003; Lihue, Hawaii, 2002, pp. 53–64. [18] H. Qi, Feature selection and knn fusion in molecular classification of multiple tumor types, Proceedings of the Mathematics and Engineering Techniques in Medicine and Biological Sciences: 24–27 June 2002; Las Vegas, Nevada, USA. [19] R. Duangsoithong, T. Windeatt, Relevant and redundant feature analysis with ensemble classification, in: Advances in Pattern Recognition, 2009. ICAPR’09. Seventh International Conference on, IEEE, 2009, pp. 247–250. [20] D. Glez-Pe~ na, M. Glez-Bedia, F. Díaz, F. Fdez-Riverola, Multiple-microarray analysis and internet gathering information with application for aiding medical diagnosis in cancer research, in: 2nd International Workshop on Practical Applications of Computational Biology and Bioinformatics (IWPACBB 2008), Springer, 2009, pp. 112–117. [21] A. Sanchez-Palencia, M. Gomez-Morales, J.A. Gomez-Capilla, V. Pedraza, L. Boyero, R. Rosell, M. F arez-Vidal, Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer, Int. J. cancer 129 (2) (2011) 355–364. [22] A. Argon, D. Nart, A. Veral, The value of cytokeratin 5/6, p63 and thyroid transcription factor-1 in adenocarcinoma, squamous cell carcinoma and non-smallcell lung cancer of the lung, Turkish J. Pathol. 31 (2) (2015) 81–88. [23] K. Takamochi, H. Ohmiya, M. Itoh, K. Mogushi, T. Saito, K. Hara, K. Mitani, Y. Kogo, Y. Yamanaka, J. Kawai, et al., Novel biomarkers that assist in accurate discrimination of squamous cell carcinoma from adenocarcinoma of the lung, BMC cancer 16 (1) (2016) 760. [24] J.H. Friedman, Greedy function approximation: a gradient boosting machine, Ann. Statistics (2001) 1189–1232.
106