biocybernetics and biomedical engineering 39 (2019) 765–774
Available online at www.sciencedirect.com
ScienceDirect journal homepage: www.elsevier.com/locate/bbe
Original Research Article
Machine learning methods for MRI biomarkers analysis of pediatric posterior fossa tumors Mengmeng Li a,b,1, Zhigang Shang a,b,c,1,*, Zhongliang Yang a,b, Yong Zhang d, Hong Wan a,b,c,* a
School of Electrical Engineering, Zhengzhou University, Zhengzhou, Henan, China Industrial Technology Research Institute, Zhengzhou University, Zhengzhou, Henan, China c Henan Key Laboratory of Brain Science and Brain-Computer Interface Technology, Zhengzhou, Henan, China d Magnetic Resonance Department, the first Affiliated Hospital of Zhengzhou University, Zhengzhou, Henan, China b
article info
abstract
Article history:
Medical imaging technologies provide an increasing number of opportunities for disease
Received 28 September 2018
prediction and prognosis. Specifically, imaging biomarkers can quantify the entire tumor
Received in revised form
phenotypes to enhance the prediction. Machine learning technology can be explored to mine
18 June 2019
and analyze these biomarkers and to establish predictive models for the clinical applica-
Accepted 9 July 2019
tions. Several studies have applied various machine learning methods to imaging biomark-
Available online 20 July 2019
ers based clinical predictions of different diseases. Here we seek to evaluate different machine learning methods in pediatric posterior fossa tumor prediction. We present a
Keywords:
machine learning based magnetic resonance imaging biomarkers analysis framework for
Pediatric posterior fossa tumor
two kinds of pediatric posterior fossa tumors. In details, three feature extraction methods
Magnetic resonance imaging
are used to obtain 300 imaging biomarkers. 10 feature selection methods and 11 classifiers
Biomarker
are evaluated by the quantified predictive performance and stability, and importance
Machine learning
consistency of features and the influence of the experimental factors are also analyzed.
Feature selection
Our results demonstrate that the CFS feature selection method (accuracy: 83.85 5.51%,
Classifier
stability: [0.84, 0.06]) and SVM classifier (accuracy: 85.38 3.47%, RSD: 4.77%) show relatively better performance than others and should be preferred. Among all the biomarkers, 17 texture features seem to be more important. Multifactor analysis results indicate the choice of classifier accounts for the most contribution to the variability in performance (37.25%). The machine learning based framework is efficient for pediatric posterior fossa tumors biomarkers analysis and could provide valuable references and decision support for assisted clinical diagnosis. © 2019 Published by Elsevier B.V. on behalf of Nalecz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences.
* Corresponding author at: School of Electrical Engineering, Zhengzhou University, Zhengzhou, Henan, China E-mail addresses:
[email protected] (Z. Shang),
[email protected] (H. Wan). 1 Mengmeng Li and Zhigang Shang contribute equally to this work. https://doi.org/10.1016/j.bbe.2019.07.004 0208-5216/© 2019 Published by Elsevier B.V. on behalf of Nalecz Institute of Biocybernetics and Biomedical Engineering of the Polish Academy of Sciences.
766
1.
biocybernetics and biomedical engineering 39 (2019) 765–774
Introduction
Nowadays, with the technology advance, studies related to medical imaging have been carried out widely. Benefiting from their non-invasive advantages, image-based diagnosis of human disease has attracted more interest in computerassisted interventions [1–4]. Imaging biomarkers provide comprehensive information of entire tumor phenotypes and can be used to enhance the tumor prediction. Hence, in the prevalent imaging data based diagnosis framework, feature extraction of imaging biomarkers is an essential step [3]. More and more studies focus on imaging biomarkers in terms of their prognostic or predictive abilities and reliability across different clinical settings [1,5–9]. Different studies have reported the discriminating capabilities of imaging biomarkers for the tumor classification [10], tumor stages [11], and clinical outcomes [1], indicating that imaging biomarkers could provide essential assisted information and positive impact for accurate diagnosis, individualized treatment selection, and post-treatment surveillance. Predictive models are usually involved in the biomarkers analysis [12–14]. Machine learning technology can be used to establish highly accurate and reliable models for improving decision support in clinical application [15–16]. More importantly, machine learning can also help to overcome the uncertainty in all medical disciplines from diagnosis to treatment [17]. However, the applicability of different machine learning methods is usually quite different in diverse fields, various problems and kinds of data sets [18]. Hence, we usually need to explore and try different types of methods as many as possible, which causes a heavy burden of time and finances, especially in this big data era of medical data explosion. Therefore, to compare different machine learning methods, a basic contrastive study is very significant for precision oncology and imaging-based clinical biomarkers [19,20]. As a result of advances in medical image acquisition technologies and further studies in feature extraction, the feature dimensions of imaging biomarkers are increasing rapidly. This will increase the source of effective information, but it will be accompanied by the ‘‘curse of dimensionality’’ [21]. So dimensionality reduction technologies emerged, as the times require. It has grabbed more and more attention to solve the problem. Furthermore, the selection of the classifiers also has a great impact on the final performance for the biomarkers based classification and prediction [22–24]. Therefore, to reduce the dimensionality of features and enhance predictive performance, different dimensionality reduction methods and classifiers should be investigated thoroughly. In this paper, a machine learning based biomarkers analysis framework is presented for pediatric posterior fossa tumors classification. 174 MRI slices from 58 patients are acquired and a total of 300 biomarker features are extracted from the segmented tumor regions which consist of two independent data sets for training and validation. In the framework, 10 feature selection methods and 11 classifiers are evaluated in terms of predictive performance and stability. Consistency analysis of the biomarkers importance and multifactor ANOVA of experimental factors on accuracy are also conducted.
2.
Related work
Several studies have indicated feature selection methods and classifiers are effective for imaging biomarkers based clinical predictions [10,25–28]. Luts et al. examined the effect of feature extraction methods prior to automated pattern recognition based on magnetic resonance spectroscopy for brain tumor diagnosis, different feature extraction/selection methods and classifiers were assessed [10]. However, only two classifiers were involved in this study, and there was no further distinction between feature extraction and feature selection methods. Parmar et al. presented a study of machine learning methods for quantitative radiomic biomarkers of lung cancer patients [27]. 14 feature selection methods and 12 classification methods were examined in terms of performance and stability. In their following similar study [28], 13 feature selection methods and 11 machine learning classifiers were evaluated in head and neck cancer prediction. Results of these two studies indicated that the methods showed a performance difference in different applications. Pediatric brain tumors, as the most common solid neoplasms in children [29], cause more deaths than any other childhood malignancy [30]. Pediatric posterior fossa tumors are common intracranial brain tumors5 occurring in children aged 4–11 years old. Medulloblastoma (MB) and ependymoma (EP) are more representative, but not easy to be identified [31]. Medical imaging technology can provide unprecedented opportunities in this field and advanced magnetic resonance imaging (MRI) techniques may improve the specific diagnosis of pediatric posterior fossa tumors [29]. As for this kind of tumor, it is difficult to collect a large number of patient samples so that the comparative diagnosis is full of challenges. Therefore, computer-aided diagnosis based on machine learning technology is of great significance. Some MRI biomarkers based pediatric posterior fossa tumor analysis and classification researches based on machine learning have been reported in recent years [5,32,33]. Orphanidouvlachou et al. deployed texture analysis to produce 279 features [5], which were reduced to obtain principal components explaining 95% of the variance using principal component analysis (PCA). Linear discriminant analysis (LDA) and probabilistic neural network (PNN) were used to classify the cases. Rodriguez et al. computed shape, histogram, and textural features of MRIs to train SVM classifiers for posterior fossa classification [32]. Fetit et al. carried out 3D and 2D texture analysis to extract MRIs features and used Entropy-MDL discretization to select the most influential features [33]. Six supervised classification algorithms were trained and compared in the classification of tumor types. These studies extract MRI features and classify tumor types combined with machine learning methods such as dimensionality reduction methods and classifiers. However, only one or few methods are used and there is a lack of comprehensive comparative studies of multiple feature selection and classification methods in biomarkers analysis of pediatric posterior fossa tumors according to our investigation.
biocybernetics and biomedical engineering 39 (2019) 765–774
3.
Materials and methods
3.1.
Patients and datasets
In this paper, we consider real preoperative MRI dataset of 58 patients (mean age: 7 years old; age range: 0–14 years old), from the first Affiliated Hospital of Zhengzhou University between May 2008 and August 2015. All of the patients are diagnosed as posterior fossa tumors with histopathological evidence by professional physicians. 174 MRI slices are collected and evaluated retrospectively: 93 slices have histologically verified MB and 81 slices have EP. This research has been conducted with the consent of the guardian of the patients. Image acquisition is carried out by using the same imaging acquisition protocol on a 3.0T MRI system (Siemens Medical Solutions, Erlangen, Germany). The MRI protocol consists of a T1-weighted sequence and a T2-weighted sequence. FOV = 240 240 mm, matrix size is 128 128, slice thickness is 6 mm and slice gap is 1 mm. Contrast medium is Gd-DTPA and dosage is 0.2 mmol/kg. The image preprocessing stage includes brain tissue segmentation and intensity normalization. All the MRIs are collected and marked by clinical specialists to acquire the regions of interest (ROI) supported by tumor pathological results. Manual segmentation is used to extract all ROIs guided by two radiologists through a double-blind method based on Image J software (National Institutes of Health, Bethesda, USA). Min-Max scaling is used as the intensity normalization method for each ROI to mitigate the differences.
3.2.
Biomarkers feature extraction
Biomarkers are digitized through feature extraction. To improve the prediction precision, efficient feature extraction methods should reduce the feature value difference of samples within the class and increase the difference between classes. Calculations are carried out on the ROIs and a total of 300 biomarkers are extracted for analysis in this study. These features quantifying tumor phenotypic characteristics on MRI data can be divided into three groups: (1) Gabor transform based features, (2) texture based features and (3) wavelet transform based features. More details on the three kinds of biomarker features are described in Supplementary A.
3.3.
Machine learning methods
3.3.1.
Feature selection methods
Not all the features are propitious to classification. To achieve efficient and accurate results in machine learning, feature selection is very important. It can not only effectively reduce the complexity of data processing time and computer storage space, but also improve the stability and generalization of the model. In this paper, 10 different feature selection methods, namely Fisher Score (FiS), Gini Index (GINI), Information Gain (IG), Correlation-based Filter Selection (CFS), Chi-square Score (ChiS), Fast Correlation based Feature Selection (FCBF), Kruskal–Wallis Test Score (KWT), ReliefF (REF), Sparse Multinomial Logistic Regression algorithm with Bayesian regularization (SBMLR) and t-test score (TtS), are used for the analysis. These methods are
767
widely applied in literatures for their simplicity, universality and computational efficiency. Thus, their reusability remains high in different studies and applications. In Supplementary B, we show the abbreviations of the feature selection method names in Table S2 and also list the details of them.
3.3.2.
Classifiers
Supervised learning algorithms targeted at learning general rules or models of the existing training data. When the output labels are discrete, it comes to classifiers. With the development of machine learning, many classification algorithms belonging to different classifier families have been proposed. There is no best algorithm in the absolute sense, but the one that works best with specific research problem and data could be found out, according to the prediction performance, stability, robustness, etc. Regarding the classifiers used in this paper, they are kNearest Neighbors (kNN), Support Vector Machines (SVM), Bagging (BAG), Boosting (BST), Neural Networks (NNet), Classification and Regression Trees (CART), Random Subspace Method (RSM), Extreme Learning Machine (ELM), Naive Bayesian (NB), Random Forests (RF) and Partial Least Square Regression (PLSR). These methods are the most representative ones among various classifier families and have been widely used in different fields. The names and corresponding abbreviations of these classifiers are shown in Table S2. Further description of them along with their parameters settings can be obtained from Supplementary C.
4.
Biomarkers analysis
To investigate the machine learning methods for the biomarkers, 300 features are extracted from the MRI ROIs belonging to two kinds of tumors. Feature selection and classification training are conducted by using repeated hold-out validation, in which the training set (70% of all data) is sampling from the data set randomly, whereas the validation set (the remaining 30%) is used to evaluate the predictive performance each time. The machine learning based biomarkers analysis framework is shown in Fig. 1.
4.1. Predictive performance of the machine learning methods In this study, the predictive performances of different methods are comprehensively compared. For each of the 10 feature selection methods, the dimensionality of the selected feature subset is set up and the corresponding optimal feature subset can be obtained. Then the subset is regarded as the input to all of the 11 classifiers and will be evaluated in terms of the accuracy on validation data. Thus, each combination containing a feature selection method and a classifier can obtain its corresponding accuracy reflecting the predictive performance respectively.
4.2.
Stability analysis of the machine learning methods
In order to evaluate the stability of feature selection methods, an effective stability measure [34] is used. Given a database with N samples and D features, it is divided into two nonoverlapping subsets (of size N/2). Then a same feature
768
biocybernetics and biomedical engineering 39 (2019) 765–774
Fig. 1 – Machine learning based biomarkers analysis framework. A total of 300 biomarkers are extracted from 174 MRI data. The regions surrounded by red line are marked ROIs of each MRI.
selection method is applied on them to obtain corresponding feature rankings. The similarity between the ranking results is used to quantify the stability of the method, and its calculation is shown in (1).
simðR1 ; R2 Þ ¼
2jR1 \ R2 j jR1 j þ jR2 j
(1)
where R1 and R2 are the feature rankings with specific dimensionality of the two non-overlapping subsets. jR1j and jR2j represent the length of the rankings. When a feature is selected in both subsets simultaneously, it will be put into R1 \ R2 and jR1 \ R2 j represents its length. For each method, the
median and standard deviation (std) of stability are computed by using a bootstrap approach for 100 times. The stability of a classifier is quantified using the relative standard deviation (RSD) by means of a bootstrap approach. For each classification method, the classifier model is trained on the subsampled training set (of size N/2) and the performance on the validation set is validated. Subsampling of the training set is done 100 times. RSD value reflects the stability of the classifiers and it is defined by (2).
RSD ¼
s accuracy 100 maccuracy
(2)
biocybernetics and biomedical engineering 39 (2019) 765–774
where saccuracy and maccuracy are the standard deviation and mean of the 100 accuracy values respectively. It should be noted that lower RSD values correspond to higher stability of the classifiers.
4.3.
Predictive performance and stability analysis
To identify the machine learning methods with high predictive performance and stability, the median values of accuracy and stability are used to categorize them into different groups. With the median accuracy (80.00%) and stability (0.84) as thresholds, all feature selection methods are evaluated. Similarly, the median accuracy (80.58%) and RSD value (4.84%) are used as thresholds to compare all classifiers. Feature selection methods with higher accuracy (≥80.00%) and stability (≥0.84), and classifiers with higher accuracy (≥80.58%) and lower RSD (≤4.84%) are considered as highly reliable and accurate methods.
4.4.
Consistency analysis of the biomarkers importance
The importance of each feature for supervised learning will be determined by its location in the feature ranking sequence. In this study, for every feature sequence corresponding to each feature selection method, all of 300 features are divided into 10 levels, and thus each level contains 30 features. For example, the first 30 features at the forefront of the sequence consist of the first level and will obtain an importance index value 10; the following second level includes the thirty-first to the sixtieth features and their importance index value is assigned 9, and so on. By doing so, every feature obtains 10 importance index values corresponding to 10 methods. The mean value of them is calculated as the importance consistency (IC) of the feature. Larger IC value indicates that the importance and stability of this feature are more significant. In this paper, the features with the IC higher than 8.5 are regarded as important ones of good stability, while those less than 4.0 as inconsequential ones of bad stability. According to the obtained feature ranking sequence, the above classification experiments are repeated by varying the number of selected features to compare the performance.
4.5. Multifactor analysis of variance affecting the prediction In our experiments, three main factors, namely feature selection method, classifier, and the dimension of selected features potentially affecting the predictive performance are considered. Multifactor analysis of variance (ANOVA) is used to assess the effects of these factors on the prediction. The variability in accuracy contributed by these factors and their interactions are quantified through the corresponding estimated variance component, which is divided by the total variance for comparison.
5.
Results
5.1. Predictive performance of the machine learning methods Validation accuracy is used to assess the predictive performance of different methods in our study. Fig. 2a depicts the
769
predictive performance heatmap of feature selection methods (in rows) and classifiers (in columns) using 150 top-ranked features selected by different methods. It can be observed that feature selection methods CFS, FCBF, and classifiers SVM, kNN show relatively high predictive performance in the majority many cases. For each feature selection method, there are 11 accuracy values corresponding to the 11 classifiers. The mean of them is calculated as a representative accuracy of a feature selection method. Similarly, for each classifier, a mean of 10 accuracy values (corresponding to 10 feature selection methods) is used as its representative accuracy. These representative accuracy values for the methods are given in Table 1. For feature selection methods, the correlation based method CFS shows the highest predictive performance (accuracy: 83.85 5.51%) (mean std). FCBF (accuracy: 81.54 3.49%) and ChiS (accuracy: 81.15 5.51%) also perform better than other methods, whereas FiS (accuracy: 78.08 4.43%) has the lowest mean value. In terms of classifiers, SVM displays the highest predictive performance (accuracy: 85.38 3.47%) followed by kNN (accuracy: 83.08 5.46%), while RSM (accuracy: 74.62 3.26%) shows the worst performance.
5.2.
Stability of the machine learning methods
These machine learning methods are assessed in terms of stability by using the measures and bootstrap approach mentioned above. The stability values of 10 feature selection methods and RSD values corresponding to 11 classifiers are reported in Table 1. The results suggest that, for feature selection methods, REF features the most stable performance (stability: [0.90, 0.03] ([median, std]), closely followed by FiS (stability: [0.86, 0.06]) and TtS (stability: [0.86, 0.06]), whereas IG (stability: [0.75, 0.04]) and GINI (stability: [0.74, 0.05]) show relatively low stability and the stability of ChiS (stability: [0.73, 0.05]) is the lowest. As far as classifiers are concerned, we observe that RSM is most stable (RSD: 3.61%) followed by NB (RSD: 3.67%). NNet has the highest RSD (7.47%) and hence the lowest stability among the classifiers.
5.3.
Predictive performance and stability
In order to compare the performance and stability of all methods for biomarker analysis of tumors, the values on different scatterplots are displayed. The scatterplot in Fig. 3a assesses the prediction performance and stability of feature selection methods and Fig. 3b for classifiers. According to the thresholds we set for the evaluation, the two-dimensional coordinate space is divided into four parts. The methods located in the area with the shadow as the watermark are the ones of better performance and stability. It can be observed from Fig. 3a that the predictive performance and stability of feature selection methods CFS (accuracy: 83.85 5.51%, stability: 0.84 0.06), FCBF (accuracy: 81.54 3.49%, stability: 0.85 0.06), TtS (accuracy: 80.88 3.70%, stability: 0.86 0.06) and SBMLR (accuracy: 80.77 4.71%, stability: 0.85 0.06) are higher than median values (accuracy: 80.00%, stability: 0.84). Thus CFS should be preferred because of its higher accuracy and stability.
770
biocybernetics and biomedical engineering 39 (2019) 765–774
Fig. 2 – Predictive performance heatmap of (a) machine learning methods, and (b) classifiers and the dimension of selected features. Colorbar depicts the accuracy (%): as the accuracy becomes higher, it appears to turn red otherwise turn blue. It can be observed from (a) that feature selection methods CFS and FCBF and SVM and kNN classifiers show relatively high predictive performance in many cases. Results in (b) show that the performance of most classifiers increases with the rise of the selected features. SVM and kNN also show relatively high predictive performance in the majority of cases. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Similarly, for classification methods, the results observed from Fig. 3b suggest that SVM (accuracy: 85.38 3.47%, RSD: 4.77%), kNN (accuracy: 83.08 5.46%, RSD: 4.72%) and BST (accuracy: 81.73 3.66%, RSD: 4.84%) should be preferred as their performance and stability are higher than the median values (accuracy: 80.58%, RSD: 4.84%) among all classifiers. Furthermore, compared with the other two methods, it seems that SVM is the optimal choice due to its highest accuracy. As shown above, the machine learning methods tend to predict two kinds of tumors at rates above the accuracy level that would be expected based on chance alone (i.e., 50.00% correct), with the lowest and the highest classification accuracy of 74.62% and 85.38%. In addition, it is noted that all of the accuracy results are average estimates between various methods. That is to say, an appropriate pair of feature selection method and classifier combination (CFS and SVM/ kNN here) will inevitably achieve relatively higher accuracy. Combined CFS and SVM/kNN, hold-out validated classification is carried out and the confusion matrices summarizing prediction accuracy are reported in Fig. 4. As illustrated in the confusion matrices of Fig. 4, for both two tumors types, the prediction results are both a lot better than accuracy by chance (50.00%), no matter with classifier
SVM or kNN. Moreover, the accuracy of MB is significantly higher than that of EP.
5.4.
Consistency of the biomarkers importance
Different feature ranking sequences are obtained by 10 feature selection methods, and the importance and stability of the features are assessed using the IC we design. Fig. 5 depicts an IC heatmap of all biomarker features. We further analyze the feature composition of the sequences with different dimensions (50–300, with an increment step of 50) and the 22 features which have good stability (IC > 8.5). The results are shown in Tables S3 and S4 in Supplementary D. From Fig. 5, Tables S3 and S4, we found that 22 features with high IC values have significant stability via different feature selection methods whereas 97 features show low stability (IC < 4.0). In the 22 features, the quantity of texture features (17) is the largest and they are selected by almost all feature selection methods. The significance analysis results of the used biomarker features are reported from Supplementary D. The 14 features that rank in the front row of the sequence have larger IC value (IC ≥ 9.0) and we provide the box plots of them in Fig. S1. It can
Table 1 – The accuracy and stability of the machine learning methods. Feature selection method FiS GINI IG CFS ChiS FCBF KWT REF SBMLR TtS –
Accuracy/% (mean std)
Stability ([median, std])
Classifier
Accuracy/% (mean std)
RSD/% (median)
78.08 4.43 79.23 5.77 79.62 3.44 83.85 5.51 81.15 5.51 81.54 3.49 78.46 3.04 79.47 3.75 80.77 4.71 80.88 3.70 –
[0.86, 0.06] [0.74, 0.05] [0.75, 0.04] [0.84, 0.06] [0.73, 0.05] [0.85, 0.06] [0.77, 0.05] [0.90, 0.03] [0.85, 0.06] [0.86, 0.06] –
kNN SVM BAG BST RF RSM CART NNet ELM NB PLSR
83.08 5.46 85.38 3.47 79.81 4.61 81.73 3.66 80.77 5.28 74.62 3.26 76.15 5.67 80.58 4.88 80.58 4.19 75.19 3.13 79.42 3.93
4.72 4.77 5.00 4.84 5.05 3.61 6.30 7.47 6.19 3.67 4.56
biocybernetics and biomedical engineering 39 (2019) 765–774
771
Fig. 3 – Scatterplots between the predictive performance and stability of (a) feature selection methods, and (b) classifiers. The methods displayed in a gray square region are considered as highly reliable and accurate methods. It can be observed feature selection methods with accuracy ≥80.00% and stability ≥0.84 (CFS, FCBF, TtS, and SBMLR), classifiers with accuracy ≥80.58% and RSD ≤4.84% (SVM, kNN and BST) are considered highly reliable.
be observed that there are indeed differences in the distributions between the features of two kinds of tumors. The 15 features listed in the last of the sequence have smaller IC values (IC ≤ 2.4). Their corresponding box plots are shown in Fig. S2 and they show poor discriminability. We repeat the experiments with varying the number of selected features of the obtained feature ranking sequence by IC analysis. Predictive performance results corresponding to 30–300 (with an increment step of 30) ranked features are reported in Fig. 2b. Mean accuracy values on classifiers and the number of selected features are depicted by the heatmaps. The results suggest that the predictive performance of most of the classifiers improves with the increase of the selected feature dimension. Here, classifiers SVM and kNN show higher accuracy in the majority of cases, as well. Furthermore, predictive performance results corresponding to 30–300 (with an increment step of 30) ranked features, 2– 22 (with an increment step of 2) top important features and 17 top texture features based on the optimal feature selection method CFS and classifier SVM and kNN are reported in Figs. S3, S4, and S6 respectively from Supplementary E. Confusion matrices of 22 top ranked features and 17 top ranked texture
features are shown in Fig. S5 and S7. The results show that a few features (22) can already guarantee higher accuracy (mean accuracy: 84.40% and 84.70%). Furthermore, the performance of some important texture features is very close to it (mean accuracy: 83.06% and 84.79%), which further illustrates the important value of texture features for tumor classification.
5.5. Experimental factors affecting the biomarkers based prediction Multifactor ANOVA on accuracy values is performed to quantify the effects of the three experimental factors (feature selection methods, classifiers, and the number of selected features) to the biomarkers based prediction. The estimated variance component of them and their interactions are shown in Fig. 6. We observe that the most dominant source of variability is classifier, which explains 37.25% of the total variance in accuracy values. Feature selection method and dimension of the selected features only account for 5.84% and 3.82% respectively. For the interactions of the experimental factors, the interaction of classifier and dimension and the interaction
Fig. 4 – Confusion matrices summarizing prediction hold-out validated accuracy using the combined methods of CFS and (a) SVM and (b) kNN. Numbers and cell colors indicate the prediction accuracy (chance = 50.00%): as the accuracy becomes higher, it appears to turn black while the reverse case is to turn white. It can be observed the prediction of MB and EP is much better than accuracy by chance (mean accuracy of SVM and kNN: 88.01% and 88.10%).
772
biocybernetics and biomedical engineering 39 (2019) 765–774
Fig. 5 – Heatmap showing the importance consistency of different kinds of features. Color key depicts the IC of each feature: as the value becomes higher, it appears to turn red otherwise turn blue. Three colors (red, blue and green) indicate three kinds of features (Gabor, Texture, and Wavelet) and their positions are marked by corresponding colors. It can be observed that the number of texture features (17) is the largest in all important features with higher IC. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 6 – Variation of accuracy explained by the experimental factors and their interactions. (C: Classifier, F: Feature selection method, D: Dimension of selected features, C*F: interaction of Classifier and Feature selection method, C*D: interaction of Classifier and Dimension of selected features, F*D: interaction of Feature selection method and Dimension of selected features, R: Residuals). It can be observed that classifier is the most dominant source of variability, whereas the dimension of the selected feature shares the least of the total variance.
of feature selection method and dimension make up 15.37% and 11.00% respectively, whereas interaction of classifier and feature selection method only constitutes 5.77% of the total variation.
6.
Conclusion and discussion
In conclusion, the machine learning based imaging biomarkers analysis framework for pediatric posterior fossa tumors prediction has been presented in this paper. 10 feature
selection methods and 11 classifiers are evaluated and their predictive performance and stability are quantified. Our results suggest that feature selection method CFS and classifier SVM should be preferred for pediatric posterior fossa tumors biomarkers analyses as they display relatively higher predictive performance and stability than other methods. Moreover, consistency analysis of the biomarkers importance indicates some significant texture features are of great value for this study and should be preferred. Furthermore, multifactor ANOVA results show that the classifier accounts for most of the contribution to the variability in prediction performance, and hence it should be chosen carefully, compared with feature selection method and dimension of the selected features. In this study, the analysis is limited by the sample size (174 images from 58 patients) and the feature dimension (300 features extracted by 3 kinds of strategies). In the following study, we expect a better performance by expanding the sample size and feature dimension. We hope to obtain more shared medical image data with different modality (such as CT, PET, and so forth) to promote the imaging biomarkers based studies and overcome the variability of features from different imaging methods [35]. In addition, if we can combine standard clinical variables to carry out a comprehensive study, it will improve and validate the performance of the model further. In all 10 feature selection methods used in this paper, most of them are filter based. As a classifier independent approach, CFS should be preferred in this study. CFS is a feature ranking based method, which only takes into account the relationship between the features and the response variables without considering the redundancy of selected features. So it leads to the fact that the selected feature subset still contains redundant information. Therefore, we will consider combining other strategies to further reduce redundancy. In this study, a series of mature classifiers are used, and the parameter selections involving some literatures [18,36] have been previously validated on many datasets in different fields. Predictive results show that SVM performs the best in many cases. Multifactor ANOVA results show that classifier is the most dominant source of variation in prediction. Hence, we
biocybernetics and biomedical engineering 39 (2019) 765–774
should be very careful when choosing the classifier used in similar studies. However, there is great diversity in different application situations. And the difficulty to understand all assumptions and to adjust the parameters in an unbiased manner makes it more challenging. On the other hand, various algorithms have different time complexities. Details of the time complexities of the algorithms used in this paper are attached in Supplementary C. It is an important issue to consider the tradeoff between performance and time complexity in algorithm choosing and we will also make a more comprehensive comparative study in future. Although the dimension of the selected features contributes the least in the accuracy variation according to multifactor ANOVA, we still consider that the selection and analysis of biomarker features play an important role in tumor prediction and diagnosis. In the future, we will continue to analyze the prediction performance under different conditions of both single type and combined multiple types of feature subsets. In addition, besides feature selection, the feature extraction based dimensionality reduction is another approach [37]. What role it will play in the analysis of biomarker features? This is also a significant research direction. Predictive performance and stability for pediatric posterior fossa tumors of different feature selection methods and classifiers are compared in the machine learning based framework. Furthermore, the framework should be applied to other tumor types to evaluate the methods further. Overall, the analysis in this study is a step forward toward the enhancements of imaging biomarkers based clinical predictions and our results can provide some valuable references for the future researches. One may prefer higher predictive power or stability to choose the methods accordingly in similar studies or applications depending on the requirement.
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
Acknowledgements
[12]
This work is supported by the National Natural Science Foundation of China, Grant U1304602 and 61673353. The MRIs used in this paper are provided and marked by clinical imaging specialists of the Magnetic Resonance Department, the first Affiliated Hospital of Zhengzhou University and the authors express their gratitude to them. The authors also would like to thank Prof. Yu Jia, Dr. Mingming Chen, Dr. Kunjie Yu, Dr. Renping Yu, Caitong Yue and Bofei Lang for all of their assistance. In addition, Mengmeng Li especially wants to thank Jay Chou and his excellent music for the companionship, inspiration and encouragement over the past 14 years.
[13]
references
[1] Aerts HJWL, Velazquez ER, Leijenaar RTH, Parmar C, Grossmann P, Cavalho S, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun 2014;5:4006. [2] Gong J, Liu JY, Sun XW, Zheng B, Nie SD. Computer-aided diagnosis of lung cancer: the effect of training data sets on
[14]
[15]
[16]
[17]
[18]
[19]
773
classification accuracy of lung nodules. Phys Med Biol 2018;63:035036. Suk HI, Lee SW, Shen D. Deep ensemble learning of sparse regression models for brain disease diagnosis. Med Image Anal 2017;37:101–3. Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T et al. CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint arXiv 2017: 1711.05225. Orphanidou-Vlachou E, Vlachos N, Davies NP, Arvanitis TN, Grundy RG, Peet AC. Texture analysis of T1- and T2weighted MR images and use of probabilistic neural network to discriminate posterior fossa tumours in children. NMR Biomed 2014;27:632–9. Lee SJ, Zea R, Kim DH, Lubner MG, Deming DA, Pickhardt PJ. CT texture features of liver parenchyma for predicting development of metastatic disease and overall survival in patients with colorectal cancer. Eur Radiol 2017;28:1–9. Abidin AZ, Jameson J, Molthen R, Wismüller A. Classification of micro-CT images using 3D characterization of bone canal patterns in human osteogenesis imperfecta. In: Armato SG, Petrick NA, editors. Proc SPIE 10134, Medical Imaging 2017: Computer-Aided Diagnosis. Orlando, FL, USA: SPIE; 2017. p. 1013413. Sabin ND, Merchant TE, Li X, Li Y, Klimo Jr P, Boop FA, et al. Quantitative imaging analysis of posterior fossa ependymoma location in children. Childs Nerv Syst 2016;32:1441–7. Gevaert O, Mitchell LA, Achrol AS, Xu J, Echegaray S, Steinberg GK, et al. Glioblastoma multiforme: exploratory radiogenomic analysis by using quantitative image features. Radiology 2015;276:313. Luts J, Poullet JB, Garciagomez JM, Heerschap A, Robles M, Suykens JA, et al. Effect of feature extraction for brain tumor classification based on short echo time 1H MR spectra. Magn Reson Med 2008;60:288–98. Ganeshan B, Abaleke S, Young RCD, Chatwin CR, Miles KA. Texture analysis of non-small cell lung cancer on unenhanced computed tomography: initial evidence for a relationship with tumour glucose metabolism and stage. Cancer Imaging 2010;10:137. Holdenrieder S, Nagel D, Heinemann V, Pawel JV, Raith H, Feldmann K, et al. Predictive and prognostic biomarker models in advanced lung cancer. J Clin Oncol 2008;26:431–6. Anderson B, Hardin JM, Alexander DD, Meleth S, Grizzle WE, Manne U. Comparison of the predictive qualities of three prognostic models of colorectal cancer. Front Biosci 2010;2:849–56. Su XW, Simmons Z, Mitchell RM, Kong L, Stephens HE, Connor JR. Biomarker-based predictive models for prognosis in amyotrophic lateral sclerosis. JAMA Neurol 2013;70:1505–11. Nilashi M, Ibrahim OB, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng 2017;106:212–23. Nilashi M, Ahmadi H, Shahmoradi L, Ibrahim O, Akbari E. A predictive method for hepatitis disease diagnosis using ensembles of neuro-fuzzy technique. J Infect Public Health 2019;12:13–20. Ahmadi H, Gholamzadeh M, Shahmoradi L, Nilashi M, Rashvand P. Diseases diagnosis using fuzzy logic methods: a systematic and meta-analysis review. Comput Methods Programs Biomed 2018;161:145–72. Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 2014;15:3133–81. Davatzikos C, Fan Y, Wu X, Shen D, Resnick SM. Detection of prodromal Alzheimer's disease via pattern classification
774
[20]
[21]
[22]
[23] [24]
[25]
[26]
[27]
[28]
biocybernetics and biomedical engineering 39 (2019) 765–774
of magnetic resonance imaging. Neurobiol Aging 2008;29:514–23. Gao L, Pan H, Li Q, Xie X, Zhang Z, Han J, et al. Brain medical image diagnosis based on corners with importance-values. BMC Bioinformatics 2017;18:505–17. Verleysen M, François D. The curse of dimensionality in data mining and time series prediction. In: Joan Cabestany, Alberto Prieto, Hernández FS, editors. Proc of the 8th Int Work-Conf Artif Neural Netw.. Vilanova i la Geltrú, Barcelona, Spain: Springer Berlin; 2005. p. 758–70. Cho SB, Won HH. Machine learning in DNA microarray analysis for cancer classification. In: Chen Y-P, editor. Proc of the First Asia-Pacific Bioinformatics Conf. 2003. pp. 189–98. Hijazi H, Chan C. A classification framework applied to cancer gene expression profiles. J Healthcare Eng 2013;4:255–83. Manchon U, Talevich E, Katiyar S, Rasheed K, Kannan N. Prediction and prioritization of rare oncogenic mutations in the cancer Kinome using novel features and multiple classifiers. PLoS Comput Biol 2014;10:e1003545. Keyvanfard F, Shoorehdeli MA, Teshnehlab M. Feature selection and classification of breast cancer on dynamic magnetic resonance imaging using ANN and SVM. Am J Biomed Eng 2011;1:20–5. Hawkins SH, Korecki JN, Balagurunathan Y, Gu Y, Kumar V, Basu S, et al. Predicting outcomes of nonsmall cell lung cancer using CT image features. IEEE Access 2014;2:1418–26. Parmar C, Grossmann P, Bussink J, Lambin P, Aerts HJ. Machine learning methods for quantitative radiomic biomarkers. Sci Rep 2015;5:13087. Parmar C, Grossmann P, Rietveld D, Rietbergen MM, Lambin P, Aerts HJWL. Radiomic machine-learning classifiers for prognostic biomarkers of head and neck cancer. Front Oncol 2015;5:272.
[29] Poretti A, Meoded A, Huisman TAGM. Neuroimaging of pediatric posterior fossa tumors including review of the literature. J Magn Reson Imaging 2012;35:32–47. [30] Cole BL, Pritchard CC, Anderson M, Leary SE. Targeted sequencing of malignant supratentorial pediatric brain tumors demonstrates a high frequency of clinically relevant mutations. Pediatr Dev Pathol 2017;25. 109352661774390. [31] Jansen JF, Backes WH, Nicolay K, Kooi ME. 1H MR spectroscopy of the brain: absolute quantification of metabolites. Radiology 2006;240:318–32. [32] Rodriguez GD, Awwad A, Meijer L, Manita M, Jaspan T, Dineen RA, et al. Metrics and textural features of MRI diffusion to improve classification of pediatric posterior fossa tumors. Am J Neuroradiol 2014;35:1009–15. [33] Fetit AE, Jan N, Peet AC, Arvanitits TN. Three-dimensional textural features of conventional MRI improve diagnostic classification of childhood brain tumours. NMR Biomed 2015;28:1174–84. [34] Haury AC, Gestraud P, Vert JP. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 2011;6: e28210. [35] Mackin D, Fave X, Zhang L, Fried D, Yang J, Taylor B, et al. Measuring computed tomography scanner variability of radiomics features. Invest Radiol 2015;50:757–65. [36] Kotsiantis SB, Kotsiantis SB. Supervised machine learning: a review of classification techniques. Informatica J 2007;31:249–68. [37] Coroller TP, Bi WL, Huynh E, Abedalthagafi M, Aizer AA, Greenwald NF, et al. Radiographic prediction of meningioma grade by semantic and radiomic features. PLoS One 2017;12:e0187908.