Available online at www.sciencedirect.com
Expert Systems with Applications Expert Systems with Applications 36 (2009) 2017–2026 www.elsevier.com/locate/eswa
Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients Mevlut Ture a,*, Fusun Tokatli b, Imran Kurt c a
Trakya University, Medical Faculty, Department of Biostatistics, 22030 Edirne, Turkey Trakya University, Medical Faculty, Department of Radiation Oncology, Edirne, Turkey c Eskisehir Osmangazi University, Medical Faculty, Department of Biostatistics, Eskisehir, Turkey b
Abstract Current evidence supports a clear association between clinical and pathologic factors and recurrence-free survival (RFS) in breast cancer patients. The Cox regression model is the most common tool for investigating simultaneously the influence of several factors on the survival time of patients. But it gives no estimate of the degree of separation of the different subgroups. We propose to analyze different decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) and use them additionally to the well-known Kaplan–Meier estimates to investigate the predictive power of these methods. Five hundred patients were included to the study. Two hundred and seventy-nine of them had complete data for prognostic factors and median follow-up is about 40.5 months. First, decision tree methods were analyzed for prognostic factors. Then, according to multidimensional scaling method C4.5 (error rate 0.2258 for training set and 0.3259 for cross-validation) performed slightly better than other methods in predicting risk factors for recurrence. Tumor size, age of menarche, hormonal therapy, histological grade and axillary nodal status are found that an important risk factors for the recurrence. Eight terminal nodes were found and stratified by Kaplan–Meier survival curves. Larger tumor size (P4.4 cm) and receiving no hormonal therapy in a small subgroup of patients were associated with worse prognosis. The five-year RFS is 71.3% in the whole patient population. The sensitivity, specificity and predictive rates calculated by C4.5 method were found 43.8%, 91% and 77.4%, respectively. In this study, C4.5 showed a better degree of separation. As a result, we recommend to use decision tree methods together with Kaplan–Meier analysis to determine risk factors and effect of this factors on survival. Ó 2008 Elsevier Ltd. All rights reserved. Keywords: Decision tree; C&RT; CHAID; QUEST; C4.5; ID3; Kaplan–Meier; Breast cancer; Recurrence-free survival
1. Introduction The clinicopathologic characteristics of breast cancer patients are heterogeneous. Consequently, the survival times are different in subgroups of patients. Generally, five-year recurrence-free survival is ranged from 65% to 80% in all population in breast cancer patients (Buchholz, Strom, & McNeese, 2003). The purposes of this study were *
Corresponding author. Tel.: +90 284 2357641/1631; fax: +90 284 2357652. E-mail address:
[email protected] (M. Ture). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.12.002
to apply a novel analytical method to breast cancer patients to identify prognostic factors, and explore the interactions between clinical variables and their impact on survival. Decision tree algorithms allow for non-linear relations between predictive factors and outcomes and for mixed data types (numerical and categorical), isolates outliers, and incorporates a pruning process using cross-validation as an alternative to testing for unbiasedness with a second data set (Faderl et al., 2002). In the literature, there are several reports about a separation of patients in subgroups with different prognosis for survivals (Aligayer et al., 2002; Kenneth, Abbruzzese,
2018
M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026
Lenzi, Raber, & Abbruzzese, 1999; Sauerbrei, Hu¨bner, Schmoor, & Schumacher, 1997; Ture, Memis, Kurt, & Pamukcu, 2005). Decision trees use recursive partitioning to assess the effect of specific variables on survival, thereby ultimately generating groups of patients with similar clinical features and survival times. The partitioning of patients into groups with differing survival times using clinical variables generates a tree-structured model that can be analyzed to assess its clinical utility. Therefore decision tree methods such as classification and regression tree (C&RT), Chi-squared automatic interaction detector (CHAID), quick, unbiased, efficient statistical tree (QUEST), Commercial version 4.5 (C4.5) and Interactive Dichotomizer version 3 (ID3) are more suitable than classical statistical methods. We analyzed the simultaneous relationship among risk factors for breast cancer by five decision tree algorithms. This study compared the relative effects of each risk factor for breast cancer in the multivariate analysis models. We tried to discover the significant patterns and relationship among the risk factors and make decision rules for the management of breast cancer. 2. Patients and methods 2.1. Patients A retrospective analysis was performed in 500 breast cancer patients diagnosed between 1997–2006. For the investigation of the prognostic factors age, menopausal status, age of menarche, age of first delivery, presence of abortus, hormone replacement therapy, family history of cancer, histologic tumor type, quadrant of tumor, tumor size, estrogen and progesterone receptor status, histologic and nuclear grading according to Scarf-Bloom–Richardson criteria (Bloom & Richardson, 1957), type of surgery, axillary nodal status, pericapsular involvement of lymph nodes, stage of disease according to AJCC (American Joint Committee on Cancer, 1997), lymphovascular and perineural invasion, radiotherapy, chemotherapy and hormonal therapy, we had complete data for 279 patients, who form the basis of this study. Primary local treatment was surgery (modified radical mastectomy or breast conserving surgery). The median age was 48 (range, 28–84) in the whole patient population. Tumors were considered positive for estrogen and progesterone receptors if more than 10% of tumor cells showed a nuclear staining (Zhang, Salto-Tellez, Putti, Do, & Koay, 2003). Descriptive statistics of clinical and pathologic data for the entire patient population was listed in Table 1. We performed the classical statistical analysis to examine the differences in the distribution of variables between patients who had recurrence or not. The Kolmogorov Smirnov test was used to assess the normality of numeric variables. For all the numeric variables that were non-normally distributed, comparison between two groups was made by the Mann–Whitney U-test and results were expressed as median and interquartile range.
Association of recurrence with nominal variables was assessed using the v2-test. Adjuvant radiotherapy was given to 230 patients (82.4%) and chemotherapy was administered to 245 patients (87.8%). Chemotherapy was delivered prior to radiotherapy. Hormonal therapy was initiated after the completion of radiotherapy, and typically continued for 5 years in hormone receptor positive patients until recurrence of disease. Follow-up consisted of a clinical assessment in every three months for the first three years, six-monthly for two years and annual follow-up after five years. 2.2. Statistical analysis For these variables, decision tree algorithms were used to identify optimal cut points in the data. A 10-fold cross-validation analysis was performed as an initial evaluation of the test error of the algorithms. Briefly, this process involves splitting up the dataset into 10 random segments and using 9 of them for training and the 10th as a test set for the algorithm. Survival analysis was performed for disease-free survival, the time from initial diagnosis to the first recurrence of disease (local–regional or distant). For the terminal nodes of the best decision tree method, survival curves were estimated by the Kaplan–Meier method and the difference between the curves was evaluated by Log-Rank test (Mantel–Cox). Follow-up time for each patient was calculated in months from the last day of the initial treatment to the date of death or the date of last visit. For all statistical tests, pvalues less than 0.05 were considered significant. 2.3. Decision tree algorithms 2.3.1. Classification and regression tree C&RT is a recursive partitioning method to be used both for regression and classification. C&RT is constructed by splitting subsets of the data set using all predictor variables to create two child nodes repeatedly, beginning with the entire data set. The best predictor is chosen using a variety of impurity or diversity measures (Gini, twoing, ordered twoing and least-squared deviation). The goal is to produce subsets of the data which are as homogeneous as possible with respect to the target variable (Breiman, Friedman, Olshen, & Stone, 1984). In this study, we used measure of Gini impurity that used for categorical target variables. Gini Impurity Measure: The Gini index at node t, g(t), is defined as X pðjjtÞpðijtÞ gðtÞ ¼ j–i
where i and j are categories of the target variable. The equation for the Gini index can also be written as X p2 ðjjtÞ gðtÞ ¼ 1 j
M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026
2019
Table 1 Clinical and laboratory characteristics of the study groups Independent variables
Recurrence
Age (years-old) median (IRQ) Age of menarche (years-old) median (IRQ) Tumor Size (cm) median (IRQ) Hormone replacement therapy Age of first delivery (years-old) Menopausal status
Presence of abortus Stage of disease
Nuclear grade
Estrogen receptor status Progesterone receptor status Type of surgery
Radiotherapy Chemotherapy Hormonal therapy Family history of cancer
Perineural invasion Lymphovascular invasion Axillary nodal status
Histologic grade
Histologic tumor type Quadrant of tumor Pericapsular involvement of lymph nodes
Present P30 Post Pre Peri Present In situ cancer Early stage cancer Locally advanced cancer I II II Positive Positive Modified radical Mastectomy Breast conserving surgery Present Present Present Absent Other cancers Breast cancer Present Present Negative 1–3 Lymph nodes positive P4 Lymph nodes positive I II III Ductal Other Unicentric Multicentric Positive
Thus, when the cases in a node are evenly distributed across the categories, the Gini index takes its maximum value of 1 (1/k), where k is the number of categories for the target variable. When all cases in the node belong to the same category, the Gini index equals 0. If costs of misclassification are specified, the Gini index is computed as X CðijjÞpðjjtÞpðijtÞ gðtÞ ¼ j–i
where C(i—j) is the probability of misclassifying a category j case as category i. The Gini criterion function for split s at node t is defined as Uðs; tÞ ¼ gðtÞ pL gðtL Þ pR gðtR Þ
p
Absent (n = 199)
Present (n = 80)
48 (15) 13 (2) 3 (2) n (%) 45 (22.6) 11 (5.5) 79 (39.7) 109 (54.8) 11 (5.5) 16 (8.0) 6 (3.0) 157 (78.9) 36 (18.1) 35 (17.6) 111 (55.8) 53 (26.6) 152 (76.4) 151 (75.9)
48 (20) 13 (1) 4 (4) n (%) 10 (12.5) 6 (7.5) 38 (47.5) 39 (48.8) 3 (3.8) 13 (16.3) 0 (0.0) 47 (58.8) 33 (41.3) 10 (12.5) 37 (46.3) 33 (41.3) 52 (65.0) 50 (62.5)
0.308 0.884 <0.001
138 (69.3) 61 (30.7) 161 (80.9) 172 (86.4) 159 (79.9) 139 (69.8) 44 (22.1) 16 (8.0) 63 (31.7) 114 (57.3) 86 (43.2) 59 (29.6) 54 (27.1) 36 (18.1) 89 (44.7) 74 (37.2) 164 (82.4) 35 (17.6) 180 (90.5) 19 (9.5) 59 (29.6)
67 (83.8) 13 (16.3) 69 (86.3) 73 (91.3) 48 (60.0) 52 (65.0) 19 (23.8) 9 (11.3) 35 (43.8) 58 (72.5) 16 (20.0) 20 (25.0) 44 (55.0) 10 (12.5) 28 (35.0) 42 (52.5) 69 (86.3) 11 (13.8) 62 (77.5) 18 (22.5) 41 (51.3)
0.021
0.079 0.582 0.454
0.690 <0.001
0.054 0.073 0.035
0.375 0.363 0.001 0.631 0.056 0.018 <0.001
0.061 0.547 0.007 0.001
where pL is the proportion of cases in t sent to the left child node, and pR is the proportion sent to the right child node. The split s is chosen to maximize the value of U(s,t). This value is reported as the improvement in the tree (Breiman et al., 1984). 2.3.2. Chi-squared automatic interaction detection CHAID method is based on the v2-test of association. A CHAID tree is a decision tree that is constructed by repeatedly splitting subsets of the space into two or more child nodes, beginning with the entire data set (Michael & Gordon, 1997). To determine the best split at any node, any allowable pair of categories of the predictor variables is merged until there is no statistically significant difference within the pair with respect to the target variable. This CHAID method naturally deals with interactions between
2020
M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026
the independent variables that are directly available from an examination of the tree. The final nodes identify subgroups defined by different sets of independent variables (Magidson & SPSS Inc., 1993). The CHAID algorithm only accepts nominal or ordinal categorical predictors. When predictors are continuous, they are transformed into ordinal predictors before using the following algorithm. For each predictor variable X, merge non-significant categories. Each final category of X will result in one child node if X is used to split the node. The merging step also calculates the adjusted p-value that is to be used in the splitting step. 1. If X has 1 category only, stop and set the adjusted pvalue to be 1. 2. If X has 2 categories, go to step 8. 3. Else, find the allowable pair of categories of X (an allowable pair of categories for ordinal predictor is two adjacent categories, and for nominal predictor is any two categories) that is least significantly different. The most similar pair is the pair whose test statistic gives the largest p-value with respect to the dependent variable Y. How to calculate p-value under various situations will be described in later sections. 4. For the pair having the largest p-value, check if its pvalue is larger than a user-specified alpha-level a merge. If it does, this pair is merged into a single compound category. Then a new set of categories of X is formed. If it does not, then go to step 7. 5. (Optional) If the newly formed compound category consists of three or more original categories, then find the best binary split within the compound category which p-value is the smallest. Perform this binary split if its p-value is not larger than an alpha-level a split–merge. 6. Go to step 2. 7. (Optional) Any category having too few observations (as compared with a user-specified minimum segment size) is merged with the most similar other category as measured by the largest of the p-values. 8. The adjusted p-value is computed for the merged categories by applying Bonferroni adjustments that are to be discussed later. (Biggs & Suen, 1991; Goodman, 1979; Kass, 1980; Magidson & SPSS Inc., 1993). 2.3.3. Quick, unbiased, efficient statistical tree QUEST is a binary-split decision tree algorithm for classification and data mining. QUEST can be used with univariate or linear combination splits. A unique feature is that its attribute selection method has negligible bias. If all the attributes are uninformative with respect to the class attribute, then each has approximately the same change of being selected to split a node (Loh & Shih, 1997). The QUEST tree growing process consists of the selection of a split predictor, selection of a split point for the selected predictor, and stopping. In this algorithm, only univariate splits are considered. For selection of split predictor, it uses the following algorithm.
1. For each continuous predictor X, perform an ANOVA F-test that tests if all the different classes of the dependent variable Y have the same mean of X, and calculate the p-value according to the F statistics. For each categorical predictor, perform a Pearson’s v2-test of Y and X’s independence, and calculate the p-value according to the v2 statistics. 2. Find the predictor with the smallest p-value and denote it X*. 3. If this smallest p-value is less than a/M, where a 2 (0,1) is a user-specified level of significance and M is the total number of predictor variables, predictor X* is selected as the split predictor for the node. If not, go to 4. 4. For each continuous predictor X, compute a Levene’s F statistic based on the absolute deviation of X from its class mean to test if the variances of X for different classes of Y are the same, and calculate the p-value for the test. 5. Find the predictor with the smallest p-value and denote it as X**. 6. If this smallest p-value is less than a/(M + M1), where M1 is the number of continuous predictors, X** is selected as the split predictor for the node. Otherwise, this node is not split (Loh & Shih, 1997). 2.3.4. Commercial version 4.5 C4.5 is a supervised learning classification algorithm used to construct decision trees from the data (Quinlan, 1993). Most empirical learning systems are given a set of pre-classified cases, each described by a vector of attribute values, and construct from them a mapping from attribute values to classes. C4.5 is one such system that learns decision tree classifiers. It uses a divide-and-conquer approach to growing decision trees (Benjamin, Tom, Samuel, Weijun, & Xuegang, 2000). The main difference between C4.5 and other similar decision tree building algorithms is in the test selection and evaluation process. Let attributes be denoted A = {a1, a2, . . ., am}, cases be denoted D = {d1, d2, . . ., dn}, and classes be denoted C = {c1, c2, . . ., ck}. For a set of cases D, a test Ti is a split of D based on attribute at. It splits D into mutually exclusive subsets D1, D2, . . ., Dp. These subsets of cases are single-class collections of cases. If a test T is chosen, the decision tree for D consists of a node identifying the test T, and one branch for each possible subset Di. For each subset Di, a new test is then chosen for further split. If Di satisfies a stopping criterion, the tree for Di is a leaf associated with the most frequent class in Di. One reason for stopping is that cases in Di belong to one class. C4.5 decision tree algorithm uses a modified splitting criteria, called gain ratio. It uses arg max(gain (D, T)) or arg max(gain ratio(D, T)) to choose tests for split InfoðDÞ ¼
k X i¼1
SplitðD; T Þ ¼
pðci ; DÞ log 2ðpðci ; DÞÞ
p X jDi j jDi j log 2 jDj jDj i¼1
M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026
GainðD; T Þ ¼ InfoðDÞ
p X jDi j InfoðDi Þ jDj i¼1
GainRatioðD; T Þ ¼ GainðD; T Þ=SplitðD; T Þ where, p(ci, D) denotes the proportion of cases in D that belong to the ith class. C4.5 selects the test that maximizes gain ratio value (Benjamin et al., 2000). Once the initial decision tree is constructed, a pruning procedure is initiated to decrease the overall tree size and decrease the estimated error rate of the tree (Quinlan, 1993). 2.3.5. Interactive Dichotomizer version 3 The ID3 is a simple decision tree learning algorithm developed by Quinlan (1993). The ID3 algorithm builds decision trees using a top-down, greedy search procedure and represents the core of Quinlan’s highly successful C4.5 decision tree algorithm. The basic idea of ID3 algorithm is to construct the decision tree by employing a top-down, greedy search through the given sets to test each attribute at every tree node. In order to select the attribute that is most useful for classifying a given sets, we introduce a metric information gain. An ID3 algorithm works as follows (Cheng & Maghsoodloo, 1995; Shao, Zhang, Li, & Chen, 2001). Suppose T = PE [ NE, where PE is the set of positive examples, and NE is the set of negative examples, p = jPEj and n = jNEj. An example will be determined to belong to PE with probability p/(p + n) and NE with probability n/(p + n). By employing the information theoretic heuristic, a decision tree is considered as a source of message, PE or NE, with the expected information needed to generate this message, given by p p n n log2 pþn pþn log2 pþn when p; n – 0 pþn Iðp; nÞ ¼ 0 otherwise If attributeX with value domain {v1, . . . , vN} is used for the root of the decision tree, it will partition T into {T1, . . . ,TN} where Ti contains those examples in T that have value vi of X. Let Ti contain pi examples of PE and ni of NE. the expected information required for the sub-tree for Ti is I(pi, ni). The expected information required for the tree with X as the root, EI(X), is then obtained as a weighed average EIðX Þ ¼
N X p i þ ni Iðpi ; ni Þ pþn i¼1
where the weight for the ith branches is the proportion of the examples in T that belong to Ti. The information gained by branching on X, G(X), is therefore
2021
examples in Ti are positive, it creates a ‘‘yes” node and halts; if all the examples in Ti are negative, it creates a ‘‘no” node and halts; otherwise it selects another attribute in the same way as given earlier. 2.4. Multidimensional scaling (MDS) MDS is a method that represents measurements of similarity (or dissimilarity) among pairs of objects as distances between points of a low-dimensional space. It helps us to represent (dis)similarities between objects as distances in a Euclidean space. In effect, the more dissimilar two objects are, the larger the distance between the objects in the Euclidean space should be. The objects in our study are the different classification techniques described by their characteristics in terms of classification performance measurements. The location of the techniques on the map is based on their position in the m-dimensional variable space. Similar to the R2 measure in regression analysis, there is a pseudo-R2 calculated in MDS. The pseudo-R2 is equal to the percentage of the sum of squared dissimilarities explained by the model. Another goodness-of-fit measure of the projection is the so-called stress factor. A stress factor <0.05 is considered to be good (Borg & Groenen, 1997; Kruskal, 1964). 2.5. Kaplan–Meier survival analysis The Kaplan–Meier analysis is a non-parametric technique for estimating time-related events (Kaplan & Meier, 1958). It can be used to test the statistical significance of differences between the survival curves associated with two different circumstances. It is applied by analyzing the distribution of patient survival times following their recruitment to a study. The analysis expresses these in terms of the proportion of patients still alive up to a given time following recruitment. In graphical terms, a plot of the proportion of patients surviving against time has a characteristic decline (often exponential), the steepness of the curve indicating the efficacy of the treatment being investigated. The more shallow the survival curve, the more effective the treatment (Utley et al., 2000). A variety of tests may be used to compare two or more Kaplan–Meier curves under certain well-defined circumstances. Median remission time (the time when 50% of the cohort has reached remission), as well as quantities such as three-, five-, and ten-year probability of remission, can also be generated from the Kaplan–Meier analysis, provided there has been sufficient follow-up of customers.
GðX Þ ¼ Iðp; nÞ EIðX Þ
3. Results
ID3 examines all candidate attributes, chooses X to maximize G(X), constructs the tree, and then uses the same process recursively to construct decision trees for residual subsets T1, . . ., TN. For each Ti (I = 1, 2, . . ., N): if all the
3.1. Characteristics of study subjects After a median follow-up of 40.5 months (25, 75 percentile; 21.4, 59.7 months), 80 (28.7%) patients have had at
2022
M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026
least one of the events for recurrence-free survival (locoregional recurrence, distant metastases, or second cancer). Tumor size, axillary nodal status, stage of disease, lymphovascular invasion, quadrant of tumor, progesterone receptor status, pericapsular involvement of lymph nodes, type of surgery, and hormonal therapy were found statistically significant prognostic factors for recurrence (Table 1).
We used error rate for training set and cross-validation to monitor prediction performance of the methods. Table 2 gives the performance measures of C&RT, CHAID, QUEST, C4.5 and ID3 algorithms. The error rate values for training set ranged from 0.2150 to 0.2510 and the error rate values for cross-validation ranged from 0.2760 to 0.3259. As it can be seen from Table 2, the application of the C&RT took on the smallest error rate value for training set. C4.5 and ID3 ranked second, CHAID ranked third, followed by QUEST. A comparison of the predictive values (sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and predictive rate (PR)) for training set of decision tree methods are shown in Table 3. All models had sensitivity, specificity, PPV, NPV and PR for training set in the 21.3–43.8%, 89.9–99.5%, 61.5–95.5%, 77.0–80.1% and 74.9–78.5% range, respectively. Predictive values for training set were used as input variables in MDS. It was done to identify homogenous groups of classification techniques based on predictive values. We found that stress factor was 0.000000009 in MDS. The twodimensions were plotted against each other in Fig. 1. As it can be seen from Fig. 1, C4.5 performed better than CHAID, QUEST, C&RT and ID3 in predicting breast cancer.
Table 2 Comparison of the error rates of models Model
Error rate
C&RT CHAID QUEST C4.5 ID3
Training set
Cross-validation
0.2150 0.2440 0.2510 0.2258 0.2258
0.2830 0.3230 0.2760 0.3259 0.3111
2
C&RT
C4.5 ID3 Dimension 2
3.2. Comparison of decision tree methods
3
1
CHAID
0
QUEST
-1
-2
-3
-2
-1
0
1
2
3
Dimension 1
Fig. 1. C4.5 was found the best method by multidimensional scaling.
3.3. Classification tree and rules of C4.5 for the prediction of RFS in breast cancer In the C4.5 analysis, we identified the variables that play important roles in explaining recurrences (Table 4). This indicated that the tumor size was the most important determining factor for recurrence. This first-level split produced the two initial branches of the classification tree: <4.4 cm versus P4.4 cm. We could see differences in two subtrees. For the tumor size <4.4 cm, age of menarche proved the best predicting variable. For the age of menarche branch which included <15 years-old, hormonal therapy was the most prominent. For the tumor size <4.4 cm, age of menarche (<15yearsold) and absent of hormonal therapy, axillary nodal status (P4 lymph nones positive) was the most prominent. For the tumor size P4.4 cm, present of hormonal therapy, histological grade was the most prominent. Classification trees are charts that illustrate decision rules. The decision rules provide specific information about risk factors based on the rule induction. They begin with one root node that contains all of the observations in the sample. The C4.5 has 13 leaf nodes, of which 8 are terminal nodes. 3.4. Survival analysis for breast cancer patients
Table 3 Comparison of the performance of models for training set Model
Sensitivity (%)
Specificity (%)
PPV (%)
NPV (%)
PR (%)
C&RT CHAID QUEST C4.5 ID3
26.3 40.0 21.3 43.8 27.5
99.5 89.9 96.5 91.0 97.5
95.5 61.5 70.8 66.0 81.5
77.0 78.8 75.3 80.1 77.0
78.5 75.6 74.9 77.4 77.4
The tree of C4.5 had an initial split on breast cancer, and eight terminal nodes were formed. The variables determining the structure of the tree included tumor size, age of menarche, hormonal therapy, histological grade and axillary nodal status. The longest surviving terminal node (node 1) included only three events in 41 patients with tumor size <4.4 cm
M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026
2023
Table 4 Terminal nodes Model
Terminal nodes
Recurrence (%)
C&RT
Node 1: Tumor size (64.3 cm) Node 2: Tumor size (>4.3 cm) + Age (>64 years-old) Node 3: Tumor size (>4.3 cm) + Age (664 years-old) + Axillary nodal status (negative, 1–3 lymph nodes positive) Node 4: Tumor size (>4.3 cm) + Age (664 years-old) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (absent) Node 5: Tumor size (>4.3 cm) + Age (664 years-old) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (present) Node 1: Axillary nodal status (negative, 1–3 lymph nodes positive) + Quadrant of tumor (multicentric) Node 2: Axillary nodal status (P4 lymph nodes positive) + Radiotherapy (absent) Node 3: Axillary nodal status (negative, 1–3 lymph nodes positive) + Quadrant of tumor (unicentric) + Hormonal therapy (absent) Node 4: Axillary nodal status (negative, 1–3 lymph nodes positive) + Quadrant of tumor (unicentric) + Hormonal therapy (present) Node 5: Axillary nodal status (P4 lymph nodes positive) + Radiotherapy (present) + Progesterone receptor status (negative) Node 6: Axillary nodal status (P4 lymph nodes positive) + Radiotherapy (present) + Progesterone receptor status (positive) Node 1: Tumor size (>6 cm) Node 2: Tumor size (66 cm) + Presence of abortus (present) Node 3: Tumor size (66 cm) + Presence of abortus (present) +Axillary nodal status (P4 lymph nodes positive) Node 4: Tumor size (66 cm) + Presence of abortus (present) + Axillary nodal status (negative, 1–3 lymph nodes positive) + Age of menarche (>13 years-old) Node 5: Tumor size (66 cm) + Presence of abortus (present) + Axillary nodal status (negative, 1–3 lymph nodes positive) + Age of menarche (613 years-old) +Hystologic tumor type (ductal) Node 6: Tumor size (66 cm) + Presence of abortus (present) + Axillary nodal status (negative, 1–3 lymph nodes positive) + Age of menarche (613 years-old) + Hystologic tumor type (other) Node 6: Tumor size (<4.4 cm) + Age of menarche (<15 years-old) + Hormonal therapy (absent) + Axillary nodal status (P4 lymph nodes positive) Node 7: Tumor size (<4.4 cm) + Age of menarche (<15 years-old) + Hormonal therapy (absent) + Axillary nodal status (negative) Node 8: Tumor size (<4.4 cm) + Age of menarche (<15 years-old) + Hormonal therapy (absent) + Axillary nodal status (1–3 lymph nodes positive) Node 3: Tumor size (<4.4 cm) + Age of menarche (<15 years-old) + Hormonal therapy (present) Node 1: Tumor size (<4.4 cm) + Age of menarche (P15 years-old) Node 2: Tumor size (P4.4 cm) + Hormonal therapy (absent) Node 4: Tumor size (P4.4 cm) + Hormonal therapy (present) + Hystologic grade (I–II) Node 5: Tumor size (P4.4 cm) + Hormonal therapy (present) + Hystologic grade (III) Node 1: Tumor size (<4.4 cm) Node 4: Tumor size (P4.4 cm) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (absent) Node 5: Tumor size (P4.4 cm) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (present) + Pericapsular Involvement of Lymph Nodes (positive) Node 6: Tumor size (P4.4 cm) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (present) + Pericapsular Involvement of Lymph Nodes (negative) Node 2: Tumor size (P4.4 cm) + Axillary nodal status (negative) Node 3: Tumor size (P4.4 cm) + Axillary nodal status (1–3 lymph nodes positive)
21.6 100.0 13.3 92.3
CHAID
QUEST
C4.5
ID3
and age of menarche P15 years-old. Such patients had a 52-month median survival. A second terminal node (node 7) with a relatively long median survival of 49-month included only two events in 15 patients with tumor size <4.4 cm, age of menarche <15 years-old, hormonal therapy was not given and axillary nodal status was negative. A third terminal node (node 8) with median survival of 45 months included nine events in 15 patients with tumor size <4.4 cm and age of menarche <15years-old, hormonal therapy was not given and axillary nodal status was 1–3 lymph nodes positive. One of the shortest surviving terminal node (node 2) included 15 events in 19 patients with tumor size P4.4 cm and hormonal therapy was not given.
45.0 52.9 100.0 30.0 12.1 58.6 33.3 70.8 44.0 37.0 7.8 24.4 6.7 42.9 13.3 60.0 20.4 7.3 79.0 31.6 57.9 21.6 92.9 69.2 36.4 20.0 35.7
These patients had a median survival of only 14.6 months (Table 5). The five-year Kaplan–Meier estimates for recurrencefree survival was 71.3% in the whole patient population. Fig. 2 shows the estimated recurrence-free survival rates according to the decision tree method based on the C4.5 analysis. We tested the statistical significance of the difference between the survival curves of two terminal nodes using the Log-Rank test (Table 6). The survival curve of node 1 was statistically different from node 2, 4, 5, 6 and 8. Node 2 was statistically different from all nodes except node 5. The survival curve of node 3 was statistically different from node 2, 5, and 8. The survival curve of node
2024
M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026
Table 5 Descriptive statistics and five-year recurrence-free survivals (RFS) for each nodes Node
Terminal node
Median
Mean
Standard deviation
Number of recurrence
n
Five-year RFS (%)
1 2 3
Tumor size (<4.4 cm) + Age of menarche (P15 years-old) Tumor size (P4.4 cm) + Hormonal therapy (absent) Tumor size (<4.4 cm) + Age of menarche (<15 yearsold) + Hormonal therapy (present) Tumor size (P4.4 cm) + Hormonal therapy (present) + Hystologic grade (I–II) Tumor size (P4.4 cm) + Hormonal therapy (present) + Hystologic grade (III) Tumor size (<4.4 cm) + Age of menarche (<15 yearsold) + Hormonal therapy (absent) + Axillary nodal status (P4 lymph nodes positive) Tumor size (<4.4 cm) + Age of menarche (<15 yearsold) + Hormonal therapy (absent) + Axillary nodal status (negative) Tumor size (<4.4 cm) + Age of menarche (<15 yearsold) + Hormonal therapy (absent) + Axillary nodal status (1–3 lymph nodes positive)
52.1 14.6 40.9
47.5 22.3 44.7
25.2 22.2 27.8
3 15 28
41 19 137
97.7 21.1 79.6
33.5
46.9
30.3
6
19
68.4
20.5
23.7
15.1
11
19
42.1
41.8
40.3
23.4
6
14
57.1
48.9
48.0
22.7
2
15
86.7
44.8
50.2
44.5
9
15
40.0
40.5
42.6
28.1
80
279
71.3
4 5 6
7 8
Total
4 was statistically different from node 2 and 5. The survival curve of node 5 was statistically different from
4. Discussion
1.0 Node 1 Node 7
0.8
Disease-free survival
node 7, and finally, node 7 was statistically different from node 8.
Node 3
0.6
Node 6 Node 4
0.4
Node 5
0.2
Node 8 Node 2
0.0 0
20
40
60
80
100
120
140
160
180
200
Follow-up time (months)
Fig. 2. Kaplan–Meier survival curves of the eight terminal nodes generated from C4.5
In this study, we reported a research where we developed several prediction models for predicting the risk factors of breast cancer. Specially, we used five decision trees methods. Furthermore, we evaluated performance of models according to predictive values. MDS was done to identify homogenous groups of classification techniques. We estimated RFS rates according to the decision tree method based on the C4.5 analysis. Current evidence supports a clear association between clinical and pathologic factors and reduced RFS in breast cancer. The prognostic factors influencing recurrences and survival can be divided into intrinsic, which are related to the characteristics of the tumor (histologic features, axillary lymph node metastases, tumor size, hormonal receptor status, histologic and nuclear grade, stage, lymphovascular
Table 6 Pairwise comparisons by Log-Rank (Mantel–Cox) Node 2 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7
2
v = 38.805 p < 0.001
Node 3 2
v = 3.558 p = 0.059 v2 = 50.900 p < 0.001
Node 4 2
v = 5.574 p = 0.018 v2 = 12.764 p < 0.001 v2 = 0.788 p = 0.375
Node 5 2
v = 24.869 p < 0.001 v2 = 1.141 p = 0.285 v2 = 26.697 p < 0.001 v2 = 7.131 p = 0.008
Node 6 2
v = 8.957 p = 0.003 v2 = 4.593 p = 0.032 v2 = 3.737 p = 0.053 v2 = 0.428 p = 0.513 v2 = 1.507 p = 0.220
Node 7 2
v = 0.473 p = 0.492 v2 = 16.327 p < 0.001 v2 = 0.510 p = 0.475 v2 = 1.486 p = 0.223 v2 = 9.790 p = 0.002 v2 = 3.152 p = 0.076
Node 8 v2 = 14.153 p < 0.001 v2 = 5.266 p = 0.022 v2 = 8.432 p = 0.004 v2 = 1.865 p = 0.172 v2 = 1.804 p = 0.179 v2 = 0.071 p = 0.790 v2 = 4.322 p = 0.038
M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026
invasion, pericapsular involvement of lymph nodes and perineural invasion), and extrinsic (host factors-age, menopausal status, age of menarche, family history of cancer – the type and adequacy of treatment-surgery, radiotherapy, chemotherapy, hormonal therapy). The incidence of recurrence is greater and survival decreased with larger tumor size, higher histologic grade, presence of lymphovascular invasion, involvement of axillary nodes by tumor, negative estrogen receptor status and young age (Carter, Allen, & Henson, 1989; Henson, Ries, Freedman, & Carriaga, 1991). Generally, five-year recurrence-free survival is ranged from 65% to 80% in all population in breast cancer patients (Buchholz et al., 2003). The Cox regression model is the most common tool for investigating simultaneously the influence of several factors on the survival time of patients. But it gives no estimate of the degree of separation of the different subgroups. In the literature, there are several reports about a separation of patients in subgroups with different prognosis for survivals. Kenneth et al. (1999) reported that clinicians often experience difficulty applying standard statistical methods to assess the interactions between clinical variables, determining the cumulative effect of these variables on survival, and translating this information into appropriate management, because of the complex presentations of patients with unknown primary carcinoma. Hence, they showed using Kaplan–Meier analysis together with C&RT in patients with unknown primary carcinoma. Aligayer et al. (2002) showed to determine if Src activity is a marker for poor clinical prognosis in colon carcinoma patients, and analyzed a significant association between elevated Src activity and shorter overall survival of all patients by Kaplan–Meier analysis. Stark and Pfeiffer (1999) reported that classification trees (ID3, C4.5, CHAID and C&RT) were well-suited for exploratory data analysis in complex data sets in veterinary epidemiology. Sauerbrei et al. (1997) reported a new prognostic classification schemes for node negative breast cancer patients. According to the C&RT analysis for RFS, they found that tumor size and grade are the most important factors for prognosis of this group of patients. Additionally age and estrogen receptor status are the other factors. In our study, we found the similar prognostic factors with C4.5 metod for RFS in breast cancer patients. The node 2 (tumors larger than 4.4 cm and had not hormonal treatment) consists of 19 patients with a bad prognosis (79% recurrences for RFS, median 14.6 months for RFS). Patients with low grade (HG I) breast cancers have a better prognosis than those with high grade carcinoma (Henson et al., 1991). Present study has shown a statistically significant shorter RFS in patients with high grade (HG III) tumors than those with low grade tumors (p = 0.008). In breast cancer patients, tumor metastasis to axillary lymph nodes is a significant risk factor for survival outcome or development of metastatic disease (Carter et al., 1989). In present study, patients having P4 positive axil-
2025
lary lymph nodes (node 6) had a statistically significant lower survival than node 1 and 2 (p = 0.003 and p = 0.032, respectively). Age at menarche has been shown to be a risk factor for the development of primary breast cancer. Evidence indicates that lifetime estrogen exposure may be a critical factor in breast carcinogenesis. However, their prognostic influence on breast cancer once it has presented is uncertain. In some studies, they found no association between age at menarche and outcome in patients with primary breast cancer (Tsutsui et al., 2003). On the contrary, Trivers et al. (2007) reported that early age at menarche modestly increased mortality. In our study, we found that the age of menarche as a second important risk factor for survival. In this study, we found that C4.5 performed better than CHAID, QUEST, ID3 and C&RT techniques. Furthermore, we estimated RFS rates using Kaplan–Meier analysis according to the C4.5 analysis. As a result, we recommend to use decision tree methods together with Kaplan–Meier analysis to determine risk factors and effect of this factors on survival. We compared methods by using a real data set in order to provide information on general tendency of data structures, assess the effect of specific variables on survival in data sets and help researchers to select best method for solving problems of classification. There are limited data on sufficiency of classification efforts by only one method. On the basis of these considerations, we suggest that data should be better explored and processed by high performance modelling methods. Researchers should avoid assessment of data by using only one method in future studies focusing on breast cancer or any other clinical condition. References Aligayer, H., Boyd, D. D., Heiss, M. M., Abdalla, E. K., Curley, S. A., & Gallick, G. E. (2002). Activation of Src kinase in primary colorectal carcinoma: An indicator of poor clinical prognosis. Cancer, 94(2), 344–351. American Joint Committee on Cancer (1997). AJCC cancer staging manual. Philadelphia, Pa: Lippincott-Raven Publishers. Benjamin, K. T., Tom, B. Y. L., Samuel, W. K. C., Weijun, G., & Xuegang, Z. (2000). Enhancement of a Chinese discourse marker tagger with C4.5. In Annual Meeting of the ACL (Proceedings of the second workshop on Chinese language processing: Held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics) (Vol. 12, pp. 38-45). Morristown, NJ, USA: Association for Computational Linguistics. Biggs, D. B. V., & Suen, E. (1991). A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics, 18, 49–62. Bloom, H. J. G., & Richardson, W. W. (1957). Histological grading and prognosis in breast cancer; a study of 1409 cases of which 359 have been followed for 15 years. British Journal of Cancer, 11(3), 359–377. Borg, I., & Groenen, P. (1997). Modern multidimensional scaling theory and applications. New York: Springer-Verlag. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Montery: Wadsworth and Brooks/ Cole.
2026
M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026
Buchholz, T. A., Strom, E. A., & McNeese, M. D. (2003). The breast. In J. D. Cox & K. K. Ang (Eds.), Radiation oncology: Rationale, technique, results (pp. 333–386). St. Louis, Missouri: Mosby. Carter, C. L., Allen, C., & Henson, D. E. (1989). Relation of tumour size, lymph node status and survival in 24740 breast cancer cases. Cancer, 63, 181–187. Cheng, B., & Maghsoodloo, S. (1995). Optimization of mechanical assembly tolerances by incorporating Taguchi’s quality loss function. Journal of Manufacturing Systems, 14(4), 264–276. Faderl, S., Keating, M. J., Do, K.-A., Liang, S.-Y., Kantarjian, H. M., O’Brien, S., et al. (2002). Expression profile of 11 proteins and their prognostic significance in patients with chronic lymphocytic leukemia (CLL). Leukemia, 16, 1045–1052. Goodman, L. A. (1979). Simple models for the analysis of association in cross-classifications having ordered categories. Journal of the American Statistical Association, 74, 537–552. Henson, D. E., Ries, L., Freedman, L. S., & Carriaga, M. (1991). Relationship among outcome, stage of disease, and histologic grade for 22616 cases of breast cancer. The basis for a prognostic index. Cancer, 68, 2142–2149. Kaplan, E. L., & Meier, P. (1958). Non parametric estimation from incomplete observations. Journal of the American Statistical Association, 53, 457–481. Kass, G. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2), 119–127. Kenneth, R. H., Abbruzzese, M. C., Lenzi, R., Raber, M. N., & Abbruzzese, J. L. (1999). Classification and regression tree analysis of 1000 consecutive patients with unknown primary carcinoma. Clinical Cancer Research, 5, 3403–3410. Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to nonmetric hypothesis. Psychometrika, 29, 1–27. Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica, 7, 815–840. Magidson, J., & SPSS Inc. (1993). SPSS for Windows CHAID Release 6.0. Chicago: SPSS Inc.
Michael, J. A., & Gordon, S. L. (1997). Data mining technique: For marketing, sales and customer support. New York: Wiley. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufman. Sauerbrei, W., Hu¨bner, K., Schmoor, C., & Schumacher, M. (1997). Validation of existing and development of new prognostic classification schemes in node negative breast cancer. Breast Cancer Research and Treatment, 42, 149–163. Shao, X., Zhang, G., Li, P., & Chen, Y. (2001). Application of ID3 algorithm in knowledge acquisition for tolerance design. Journal of Materials Processing Technology, 117(1-2), 66–74. Stark, K. D. C., & Pfeiffer, D. U. (1999). The application of nonparametric techniques to solve classification problems in complex data sets in veterinary epidemiology-an example. Intelligent Data Analysis, 3, 23–35. Trivers, K. F., Gammon, M. D., Abrahamson, P. E., Lund, M. J., Flagg, E. W., Kaufman, J. S., et al. (2007). Association between reproductive factors and breast cancer survival in younger women. Breast Cancer Research and Treatment, 103(1), 93–102. Tsutsui, S., Ohno, S., Murakami, S., Kataoka, A., Kinoshita, J., & Hachitanda, Y. (2003). Histological classification of invasive ductal carcinoma and the biological parameters in breast cancer. Breast Cancer, 10, 149–152. Ture, M., Memis, D., Kurt, I., & Pamukcu, Z. (2005). Predictive value of thyroid hormones on the first day in adult respiratory distress syndrome patients admitted to ICU: comparison with SOFA and APACHE II scores. Annals of Saudi Medicine, 25(6), 466–472. Utley, M., Gallivan, S., Young, A., Cox, N., Davies, P., Dixey, J., et al. (2000). Potential bias in Kaplan–Meier survival analysis applied to rheumatology drug studies. Rheumatology, 39, 1–2. Zhang, D., Salto-Tellez, M., Putti, T. C., Do, E., & Koay, E. S. (2003). Reliability of tissue microarrays in detecting protein expression and gene amplification in breast cancer. Modern Pathology, 16(1), 79–85.