Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients

Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 36 (2009) 2017–2026 www.elsevier.com/loca...

Download PDF

177KB Sizes 0 Downloads 2 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

Expert Systems with Applications Expert Systems with Applications 36 (2009) 2017–2026 www.elsevier.com/locate/eswa

Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients Mevlut Ture a,*, Fusun Tokatli b, Imran Kurt c a

Trakya University, Medical Faculty, Department of Biostatistics, 22030 Edirne, Turkey Trakya University, Medical Faculty, Department of Radiation Oncology, Edirne, Turkey c Eskisehir Osmangazi University, Medical Faculty, Department of Biostatistics, Eskisehir, Turkey b

Abstract Current evidence supports a clear association between clinical and pathologic factors and recurrence-free survival (RFS) in breast cancer patients. The Cox regression model is the most common tool for investigating simultaneously the inﬂuence of several factors on the survival time of patients. But it gives no estimate of the degree of separation of the diﬀerent subgroups. We propose to analyze diﬀerent decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) and use them additionally to the well-known Kaplan–Meier estimates to investigate the predictive power of these methods. Five hundred patients were included to the study. Two hundred and seventy-nine of them had complete data for prognostic factors and median follow-up is about 40.5 months. First, decision tree methods were analyzed for prognostic factors. Then, according to multidimensional scaling method C4.5 (error rate 0.2258 for training set and 0.3259 for cross-validation) performed slightly better than other methods in predicting risk factors for recurrence. Tumor size, age of menarche, hormonal therapy, histological grade and axillary nodal status are found that an important risk factors for the recurrence. Eight terminal nodes were found and stratiﬁed by Kaplan–Meier survival curves. Larger tumor size (P4.4 cm) and receiving no hormonal therapy in a small subgroup of patients were associated with worse prognosis. The ﬁve-year RFS is 71.3% in the whole patient population. The sensitivity, speciﬁcity and predictive rates calculated by C4.5 method were found 43.8%, 91% and 77.4%, respectively. In this study, C4.5 showed a better degree of separation. As a result, we recommend to use decision tree methods together with Kaplan–Meier analysis to determine risk factors and eﬀect of this factors on survival. Ó 2008 Elsevier Ltd. All rights reserved. Keywords: Decision tree; C&RT; CHAID; QUEST; C4.5; ID3; Kaplan–Meier; Breast cancer; Recurrence-free survival

1. Introduction The clinicopathologic characteristics of breast cancer patients are heterogeneous. Consequently, the survival times are diﬀerent in subgroups of patients. Generally, ﬁve-year recurrence-free survival is ranged from 65% to 80% in all population in breast cancer patients (Buchholz, Strom, & McNeese, 2003). The purposes of this study were *

Corresponding author. Tel.: +90 284 2357641/1631; fax: +90 284 2357652. E-mail address: [email protected] (M. Ture). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.12.002

to apply a novel analytical method to breast cancer patients to identify prognostic factors, and explore the interactions between clinical variables and their impact on survival. Decision tree algorithms allow for non-linear relations between predictive factors and outcomes and for mixed data types (numerical and categorical), isolates outliers, and incorporates a pruning process using cross-validation as an alternative to testing for unbiasedness with a second data set (Faderl et al., 2002). In the literature, there are several reports about a separation of patients in subgroups with diﬀerent prognosis for survivals (Aligayer et al., 2002; Kenneth, Abbruzzese,

2018

M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026

Lenzi, Raber, & Abbruzzese, 1999; Sauerbrei, Hu¨bner, Schmoor, & Schumacher, 1997; Ture, Memis, Kurt, & Pamukcu, 2005). Decision trees use recursive partitioning to assess the eﬀect of speciﬁc variables on survival, thereby ultimately generating groups of patients with similar clinical features and survival times. The partitioning of patients into groups with diﬀering survival times using clinical variables generates a tree-structured model that can be analyzed to assess its clinical utility. Therefore decision tree methods such as classiﬁcation and regression tree (C&RT), Chi-squared automatic interaction detector (CHAID), quick, unbiased, eﬃcient statistical tree (QUEST), Commercial version 4.5 (C4.5) and Interactive Dichotomizer version 3 (ID3) are more suitable than classical statistical methods. We analyzed the simultaneous relationship among risk factors for breast cancer by ﬁve decision tree algorithms. This study compared the relative eﬀects of each risk factor for breast cancer in the multivariate analysis models. We tried to discover the signiﬁcant patterns and relationship among the risk factors and make decision rules for the management of breast cancer. 2. Patients and methods 2.1. Patients A retrospective analysis was performed in 500 breast cancer patients diagnosed between 1997–2006. For the investigation of the prognostic factors age, menopausal status, age of menarche, age of ﬁrst delivery, presence of abortus, hormone replacement therapy, family history of cancer, histologic tumor type, quadrant of tumor, tumor size, estrogen and progesterone receptor status, histologic and nuclear grading according to Scarf-Bloom–Richardson criteria (Bloom & Richardson, 1957), type of surgery, axillary nodal status, pericapsular involvement of lymph nodes, stage of disease according to AJCC (American Joint Committee on Cancer, 1997), lymphovascular and perineural invasion, radiotherapy, chemotherapy and hormonal therapy, we had complete data for 279 patients, who form the basis of this study. Primary local treatment was surgery (modiﬁed radical mastectomy or breast conserving surgery). The median age was 48 (range, 28–84) in the whole patient population. Tumors were considered positive for estrogen and progesterone receptors if more than 10% of tumor cells showed a nuclear staining (Zhang, Salto-Tellez, Putti, Do, & Koay, 2003). Descriptive statistics of clinical and pathologic data for the entire patient population was listed in Table 1. We performed the classical statistical analysis to examine the diﬀerences in the distribution of variables between patients who had recurrence or not. The Kolmogorov Smirnov test was used to assess the normality of numeric variables. For all the numeric variables that were non-normally distributed, comparison between two groups was made by the Mann–Whitney U-test and results were expressed as median and interquartile range.

Association of recurrence with nominal variables was assessed using the v2-test. Adjuvant radiotherapy was given to 230 patients (82.4%) and chemotherapy was administered to 245 patients (87.8%). Chemotherapy was delivered prior to radiotherapy. Hormonal therapy was initiated after the completion of radiotherapy, and typically continued for 5 years in hormone receptor positive patients until recurrence of disease. Follow-up consisted of a clinical assessment in every three months for the ﬁrst three years, six-monthly for two years and annual follow-up after ﬁve years. 2.2. Statistical analysis For these variables, decision tree algorithms were used to identify optimal cut points in the data. A 10-fold cross-validation analysis was performed as an initial evaluation of the test error of the algorithms. Brieﬂy, this process involves splitting up the dataset into 10 random segments and using 9 of them for training and the 10th as a test set for the algorithm. Survival analysis was performed for disease-free survival, the time from initial diagnosis to the ﬁrst recurrence of disease (local–regional or distant). For the terminal nodes of the best decision tree method, survival curves were estimated by the Kaplan–Meier method and the diﬀerence between the curves was evaluated by Log-Rank test (Mantel–Cox). Follow-up time for each patient was calculated in months from the last day of the initial treatment to the date of death or the date of last visit. For all statistical tests, pvalues less than 0.05 were considered signiﬁcant. 2.3. Decision tree algorithms 2.3.1. Classiﬁcation and regression tree C&RT is a recursive partitioning method to be used both for regression and classiﬁcation. C&RT is constructed by splitting subsets of the data set using all predictor variables to create two child nodes repeatedly, beginning with the entire data set. The best predictor is chosen using a variety of impurity or diversity measures (Gini, twoing, ordered twoing and least-squared deviation). The goal is to produce subsets of the data which are as homogeneous as possible with respect to the target variable (Breiman, Friedman, Olshen, & Stone, 1984). In this study, we used measure of Gini impurity that used for categorical target variables. Gini Impurity Measure: The Gini index at node t, g(t), is deﬁned as X pðjjtÞpðijtÞ gðtÞ ¼ j–i

where i and j are categories of the target variable. The equation for the Gini index can also be written as X p2 ðjjtÞ gðtÞ ¼ 1 j

M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026

2019

Table 1 Clinical and laboratory characteristics of the study groups Independent variables

Recurrence

Age (years-old) median (IRQ) Age of menarche (years-old) median (IRQ) Tumor Size (cm) median (IRQ) Hormone replacement therapy Age of ﬁrst delivery (years-old) Menopausal status

Presence of abortus Stage of disease

Nuclear grade

Estrogen receptor status Progesterone receptor status Type of surgery

Radiotherapy Chemotherapy Hormonal therapy Family history of cancer

Perineural invasion Lymphovascular invasion Axillary nodal status

Histologic grade

Histologic tumor type Quadrant of tumor Pericapsular involvement of lymph nodes

Present P30 Post Pre Peri Present In situ cancer Early stage cancer Locally advanced cancer I II II Positive Positive Modiﬁed radical Mastectomy Breast conserving surgery Present Present Present Absent Other cancers Breast cancer Present Present Negative 1–3 Lymph nodes positive P4 Lymph nodes positive I II III Ductal Other Unicentric Multicentric Positive

Thus, when the cases in a node are evenly distributed across the categories, the Gini index takes its maximum value of 1 (1/k), where k is the number of categories for the target variable. When all cases in the node belong to the same category, the Gini index equals 0. If costs of misclassiﬁcation are speciﬁed, the Gini index is computed as X CðijjÞpðjjtÞpðijtÞ gðtÞ ¼ j–i

where C(i—j) is the probability of misclassifying a category j case as category i. The Gini criterion function for split s at node t is deﬁned as Uðs; tÞ ¼ gðtÞ pL gðtL Þ pR gðtR Þ

p

Absent (n = 199)

Present (n = 80)

48 (15) 13 (2) 3 (2) n (%) 45 (22.6) 11 (5.5) 79 (39.7) 109 (54.8) 11 (5.5) 16 (8.0) 6 (3.0) 157 (78.9) 36 (18.1) 35 (17.6) 111 (55.8) 53 (26.6) 152 (76.4) 151 (75.9)

48 (20) 13 (1) 4 (4) n (%) 10 (12.5) 6 (7.5) 38 (47.5) 39 (48.8) 3 (3.8) 13 (16.3) 0 (0.0) 47 (58.8) 33 (41.3) 10 (12.5) 37 (46.3) 33 (41.3) 52 (65.0) 50 (62.5)

0.308 0.884 <0.001

138 (69.3) 61 (30.7) 161 (80.9) 172 (86.4) 159 (79.9) 139 (69.8) 44 (22.1) 16 (8.0) 63 (31.7) 114 (57.3) 86 (43.2) 59 (29.6) 54 (27.1) 36 (18.1) 89 (44.7) 74 (37.2) 164 (82.4) 35 (17.6) 180 (90.5) 19 (9.5) 59 (29.6)

67 (83.8) 13 (16.3) 69 (86.3) 73 (91.3) 48 (60.0) 52 (65.0) 19 (23.8) 9 (11.3) 35 (43.8) 58 (72.5) 16 (20.0) 20 (25.0) 44 (55.0) 10 (12.5) 28 (35.0) 42 (52.5) 69 (86.3) 11 (13.8) 62 (77.5) 18 (22.5) 41 (51.3)

0.021

0.079 0.582 0.454

0.690 <0.001

0.054 0.073 0.035

0.375 0.363 0.001 0.631 0.056 0.018 <0.001

0.061 0.547 0.007 0.001

where pL is the proportion of cases in t sent to the left child node, and pR is the proportion sent to the right child node. The split s is chosen to maximize the value of U(s,t). This value is reported as the improvement in the tree (Breiman et al., 1984). 2.3.2. Chi-squared automatic interaction detection CHAID method is based on the v2-test of association. A CHAID tree is a decision tree that is constructed by repeatedly splitting subsets of the space into two or more child nodes, beginning with the entire data set (Michael & Gordon, 1997). To determine the best split at any node, any allowable pair of categories of the predictor variables is merged until there is no statistically signiﬁcant diﬀerence within the pair with respect to the target variable. This CHAID method naturally deals with interactions between

2020

M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026

the independent variables that are directly available from an examination of the tree. The ﬁnal nodes identify subgroups deﬁned by diﬀerent sets of independent variables (Magidson & SPSS Inc., 1993). The CHAID algorithm only accepts nominal or ordinal categorical predictors. When predictors are continuous, they are transformed into ordinal predictors before using the following algorithm. For each predictor variable X, merge non-signiﬁcant categories. Each ﬁnal category of X will result in one child node if X is used to split the node. The merging step also calculates the adjusted p-value that is to be used in the splitting step. 1. If X has 1 category only, stop and set the adjusted pvalue to be 1. 2. If X has 2 categories, go to step 8. 3. Else, ﬁnd the allowable pair of categories of X (an allowable pair of categories for ordinal predictor is two adjacent categories, and for nominal predictor is any two categories) that is least signiﬁcantly diﬀerent. The most similar pair is the pair whose test statistic gives the largest p-value with respect to the dependent variable Y. How to calculate p-value under various situations will be described in later sections. 4. For the pair having the largest p-value, check if its pvalue is larger than a user-speciﬁed alpha-level a merge. If it does, this pair is merged into a single compound category. Then a new set of categories of X is formed. If it does not, then go to step 7. 5. (Optional) If the newly formed compound category consists of three or more original categories, then ﬁnd the best binary split within the compound category which p-value is the smallest. Perform this binary split if its p-value is not larger than an alpha-level a split–merge. 6. Go to step 2. 7. (Optional) Any category having too few observations (as compared with a user-speciﬁed minimum segment size) is merged with the most similar other category as measured by the largest of the p-values. 8. The adjusted p-value is computed for the merged categories by applying Bonferroni adjustments that are to be discussed later. (Biggs & Suen, 1991; Goodman, 1979; Kass, 1980; Magidson & SPSS Inc., 1993). 2.3.3. Quick, unbiased, eﬃcient statistical tree QUEST is a binary-split decision tree algorithm for classiﬁcation and data mining. QUEST can be used with univariate or linear combination splits. A unique feature is that its attribute selection method has negligible bias. If all the attributes are uninformative with respect to the class attribute, then each has approximately the same change of being selected to split a node (Loh & Shih, 1997). The QUEST tree growing process consists of the selection of a split predictor, selection of a split point for the selected predictor, and stopping. In this algorithm, only univariate splits are considered. For selection of split predictor, it uses the following algorithm.

1. For each continuous predictor X, perform an ANOVA F-test that tests if all the diﬀerent classes of the dependent variable Y have the same mean of X, and calculate the p-value according to the F statistics. For each categorical predictor, perform a Pearson’s v2-test of Y and X’s independence, and calculate the p-value according to the v2 statistics. 2. Find the predictor with the smallest p-value and denote it X*. 3. If this smallest p-value is less than a/M, where a 2 (0,1) is a user-speciﬁed level of signiﬁcance and M is the total number of predictor variables, predictor X* is selected as the split predictor for the node. If not, go to 4. 4. For each continuous predictor X, compute a Levene’s F statistic based on the absolute deviation of X from its class mean to test if the variances of X for diﬀerent classes of Y are the same, and calculate the p-value for the test. 5. Find the predictor with the smallest p-value and denote it as X**. 6. If this smallest p-value is less than a/(M + M1), where M1 is the number of continuous predictors, X** is selected as the split predictor for the node. Otherwise, this node is not split (Loh & Shih, 1997). 2.3.4. Commercial version 4.5 C4.5 is a supervised learning classiﬁcation algorithm used to construct decision trees from the data (Quinlan, 1993). Most empirical learning systems are given a set of pre-classiﬁed cases, each described by a vector of attribute values, and construct from them a mapping from attribute values to classes. C4.5 is one such system that learns decision tree classiﬁers. It uses a divide-and-conquer approach to growing decision trees (Benjamin, Tom, Samuel, Weijun, & Xuegang, 2000). The main diﬀerence between C4.5 and other similar decision tree building algorithms is in the test selection and evaluation process. Let attributes be denoted A = {a1, a2, . . ., am}, cases be denoted D = {d1, d2, . . ., dn}, and classes be denoted C = {c1, c2, . . ., ck}. For a set of cases D, a test Ti is a split of D based on attribute at. It splits D into mutually exclusive subsets D1, D2, . . ., Dp. These subsets of cases are single-class collections of cases. If a test T is chosen, the decision tree for D consists of a node identifying the test T, and one branch for each possible subset Di. For each subset Di, a new test is then chosen for further split. If Di satisﬁes a stopping criterion, the tree for Di is a leaf associated with the most frequent class in Di. One reason for stopping is that cases in Di belong to one class. C4.5 decision tree algorithm uses a modiﬁed splitting criteria, called gain ratio. It uses arg max(gain (D, T)) or arg max(gain ratio(D, T)) to choose tests for split InfoðDÞ ¼

k X i¼1

SplitðD; T Þ ¼

pðci ; DÞ log 2ðpðci ; DÞÞ

p X jDi j jDi j log 2 jDj jDj i¼1

M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026

GainðD; T Þ ¼ InfoðDÞ

p X jDi j InfoðDi Þ jDj i¼1

GainRatioðD; T Þ ¼ GainðD; T Þ=SplitðD; T Þ where, p(ci, D) denotes the proportion of cases in D that belong to the ith class. C4.5 selects the test that maximizes gain ratio value (Benjamin et al., 2000). Once the initial decision tree is constructed, a pruning procedure is initiated to decrease the overall tree size and decrease the estimated error rate of the tree (Quinlan, 1993). 2.3.5. Interactive Dichotomizer version 3 The ID3 is a simple decision tree learning algorithm developed by Quinlan (1993). The ID3 algorithm builds decision trees using a top-down, greedy search procedure and represents the core of Quinlan’s highly successful C4.5 decision tree algorithm. The basic idea of ID3 algorithm is to construct the decision tree by employing a top-down, greedy search through the given sets to test each attribute at every tree node. In order to select the attribute that is most useful for classifying a given sets, we introduce a metric information gain. An ID3 algorithm works as follows (Cheng & Maghsoodloo, 1995; Shao, Zhang, Li, & Chen, 2001). Suppose T = PE [ NE, where PE is the set of positive examples, and NE is the set of negative examples, p = jPEj and n = jNEj. An example will be determined to belong to PE with probability p/(p + n) and NE with probability n/(p + n). By employing the information theoretic heuristic, a decision tree is considered as a source of message, PE or NE, with the expected information needed to generate this message, given by p p n n log2 pþn pþn log2 pþn when p; n – 0 pþn Iðp; nÞ ¼ 0 otherwise If attributeX with value domain {v1, . . . , vN} is used for the root of the decision tree, it will partition T into {T1, . . . ,TN} where Ti contains those examples in T that have value vi of X. Let Ti contain pi examples of PE and ni of NE. the expected information required for the sub-tree for Ti is I(pi, ni). The expected information required for the tree with X as the root, EI(X), is then obtained as a weighed average EIðX Þ ¼

N X p i þ ni Iðpi ; ni Þ pþn i¼1

where the weight for the ith branches is the proportion of the examples in T that belong to Ti. The information gained by branching on X, G(X), is therefore

2021

examples in Ti are positive, it creates a ‘‘yes” node and halts; if all the examples in Ti are negative, it creates a ‘‘no” node and halts; otherwise it selects another attribute in the same way as given earlier. 2.4. Multidimensional scaling (MDS) MDS is a method that represents measurements of similarity (or dissimilarity) among pairs of objects as distances between points of a low-dimensional space. It helps us to represent (dis)similarities between objects as distances in a Euclidean space. In eﬀect, the more dissimilar two objects are, the larger the distance between the objects in the Euclidean space should be. The objects in our study are the diﬀerent classiﬁcation techniques described by their characteristics in terms of classiﬁcation performance measurements. The location of the techniques on the map is based on their position in the m-dimensional variable space. Similar to the R2 measure in regression analysis, there is a pseudo-R2 calculated in MDS. The pseudo-R2 is equal to the percentage of the sum of squared dissimilarities explained by the model. Another goodness-of-ﬁt measure of the projection is the so-called stress factor. A stress factor <0.05 is considered to be good (Borg & Groenen, 1997; Kruskal, 1964). 2.5. Kaplan–Meier survival analysis The Kaplan–Meier analysis is a non-parametric technique for estimating time-related events (Kaplan & Meier, 1958). It can be used to test the statistical significance of diﬀerences between the survival curves associated with two diﬀerent circumstances. It is applied by analyzing the distribution of patient survival times following their recruitment to a study. The analysis expresses these in terms of the proportion of patients still alive up to a given time following recruitment. In graphical terms, a plot of the proportion of patients surviving against time has a characteristic decline (often exponential), the steepness of the curve indicating the eﬃcacy of the treatment being investigated. The more shallow the survival curve, the more eﬀective the treatment (Utley et al., 2000). A variety of tests may be used to compare two or more Kaplan–Meier curves under certain well-deﬁned circumstances. Median remission time (the time when 50% of the cohort has reached remission), as well as quantities such as three-, ﬁve-, and ten-year probability of remission, can also be generated from the Kaplan–Meier analysis, provided there has been suﬃcient follow-up of customers.

GðX Þ ¼ Iðp; nÞ EIðX Þ

3. Results

ID3 examines all candidate attributes, chooses X to maximize G(X), constructs the tree, and then uses the same process recursively to construct decision trees for residual subsets T1, . . ., TN. For each Ti (I = 1, 2, . . ., N): if all the

3.1. Characteristics of study subjects After a median follow-up of 40.5 months (25, 75 percentile; 21.4, 59.7 months), 80 (28.7%) patients have had at

2022

M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026

least one of the events for recurrence-free survival (locoregional recurrence, distant metastases, or second cancer). Tumor size, axillary nodal status, stage of disease, lymphovascular invasion, quadrant of tumor, progesterone receptor status, pericapsular involvement of lymph nodes, type of surgery, and hormonal therapy were found statistically signiﬁcant prognostic factors for recurrence (Table 1).

We used error rate for training set and cross-validation to monitor prediction performance of the methods. Table 2 gives the performance measures of C&RT, CHAID, QUEST, C4.5 and ID3 algorithms. The error rate values for training set ranged from 0.2150 to 0.2510 and the error rate values for cross-validation ranged from 0.2760 to 0.3259. As it can be seen from Table 2, the application of the C&RT took on the smallest error rate value for training set. C4.5 and ID3 ranked second, CHAID ranked third, followed by QUEST. A comparison of the predictive values (sensitivity, speciﬁcity, positive predictive value (PPV), negative predictive value (NPV) and predictive rate (PR)) for training set of decision tree methods are shown in Table 3. All models had sensitivity, speciﬁcity, PPV, NPV and PR for training set in the 21.3–43.8%, 89.9–99.5%, 61.5–95.5%, 77.0–80.1% and 74.9–78.5% range, respectively. Predictive values for training set were used as input variables in MDS. It was done to identify homogenous groups of classiﬁcation techniques based on predictive values. We found that stress factor was 0.000000009 in MDS. The twodimensions were plotted against each other in Fig. 1. As it can be seen from Fig. 1, C4.5 performed better than CHAID, QUEST, C&RT and ID3 in predicting breast cancer.

Table 2 Comparison of the error rates of models Model

Error rate

C&RT CHAID QUEST C4.5 ID3

Training set

Cross-validation

0.2150 0.2440 0.2510 0.2258 0.2258

0.2830 0.3230 0.2760 0.3259 0.3111

2

C&RT

C4.5 ID3 Dimension 2

3.2. Comparison of decision tree methods

3

1

CHAID

0

QUEST

-1

-2

-3

-2

-1

0

1

2

3

Dimension 1

Fig. 1. C4.5 was found the best method by multidimensional scaling.

3.3. Classiﬁcation tree and rules of C4.5 for the prediction of RFS in breast cancer In the C4.5 analysis, we identiﬁed the variables that play important roles in explaining recurrences (Table 4). This indicated that the tumor size was the most important determining factor for recurrence. This ﬁrst-level split produced the two initial branches of the classiﬁcation tree: <4.4 cm versus P4.4 cm. We could see diﬀerences in two subtrees. For the tumor size <4.4 cm, age of menarche proved the best predicting variable. For the age of menarche branch which included <15 years-old, hormonal therapy was the most prominent. For the tumor size <4.4 cm, age of menarche (<15yearsold) and absent of hormonal therapy, axillary nodal status (P4 lymph nones positive) was the most prominent. For the tumor size P4.4 cm, present of hormonal therapy, histological grade was the most prominent. Classiﬁcation trees are charts that illustrate decision rules. The decision rules provide speciﬁc information about risk factors based on the rule induction. They begin with one root node that contains all of the observations in the sample. The C4.5 has 13 leaf nodes, of which 8 are terminal nodes. 3.4. Survival analysis for breast cancer patients

Table 3 Comparison of the performance of models for training set Model

Sensitivity (%)

Speciﬁcity (%)

PPV (%)

NPV (%)

PR (%)

C&RT CHAID QUEST C4.5 ID3

26.3 40.0 21.3 43.8 27.5

99.5 89.9 96.5 91.0 97.5

95.5 61.5 70.8 66.0 81.5

77.0 78.8 75.3 80.1 77.0

78.5 75.6 74.9 77.4 77.4

The tree of C4.5 had an initial split on breast cancer, and eight terminal nodes were formed. The variables determining the structure of the tree included tumor size, age of menarche, hormonal therapy, histological grade and axillary nodal status. The longest surviving terminal node (node 1) included only three events in 41 patients with tumor size <4.4 cm

M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026

2023

Table 4 Terminal nodes Model

Terminal nodes

Recurrence (%)

C&RT

Node 1: Tumor size (64.3 cm) Node 2: Tumor size (>4.3 cm) + Age (>64 years-old) Node 3: Tumor size (>4.3 cm) + Age (664 years-old) + Axillary nodal status (negative, 1–3 lymph nodes positive) Node 4: Tumor size (>4.3 cm) + Age (664 years-old) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (absent) Node 5: Tumor size (>4.3 cm) + Age (664 years-old) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (present) Node 1: Axillary nodal status (negative, 1–3 lymph nodes positive) + Quadrant of tumor (multicentric) Node 2: Axillary nodal status (P4 lymph nodes positive) + Radiotherapy (absent) Node 3: Axillary nodal status (negative, 1–3 lymph nodes positive) + Quadrant of tumor (unicentric) + Hormonal therapy (absent) Node 4: Axillary nodal status (negative, 1–3 lymph nodes positive) + Quadrant of tumor (unicentric) + Hormonal therapy (present) Node 5: Axillary nodal status (P4 lymph nodes positive) + Radiotherapy (present) + Progesterone receptor status (negative) Node 6: Axillary nodal status (P4 lymph nodes positive) + Radiotherapy (present) + Progesterone receptor status (positive) Node 1: Tumor size (>6 cm) Node 2: Tumor size (66 cm) + Presence of abortus (present) Node 3: Tumor size (66 cm) + Presence of abortus (present) +Axillary nodal status (P4 lymph nodes positive) Node 4: Tumor size (66 cm) + Presence of abortus (present) + Axillary nodal status (negative, 1–3 lymph nodes positive) + Age of menarche (>13 years-old) Node 5: Tumor size (66 cm) + Presence of abortus (present) + Axillary nodal status (negative, 1–3 lymph nodes positive) + Age of menarche (613 years-old) +Hystologic tumor type (ductal) Node 6: Tumor size (66 cm) + Presence of abortus (present) + Axillary nodal status (negative, 1–3 lymph nodes positive) + Age of menarche (613 years-old) + Hystologic tumor type (other) Node 6: Tumor size (<4.4 cm) + Age of menarche (<15 years-old) + Hormonal therapy (absent) + Axillary nodal status (P4 lymph nodes positive) Node 7: Tumor size (<4.4 cm) + Age of menarche (<15 years-old) + Hormonal therapy (absent) + Axillary nodal status (negative) Node 8: Tumor size (<4.4 cm) + Age of menarche (<15 years-old) + Hormonal therapy (absent) + Axillary nodal status (1–3 lymph nodes positive) Node 3: Tumor size (<4.4 cm) + Age of menarche (<15 years-old) + Hormonal therapy (present) Node 1: Tumor size (<4.4 cm) + Age of menarche (P15 years-old) Node 2: Tumor size (P4.4 cm) + Hormonal therapy (absent) Node 4: Tumor size (P4.4 cm) + Hormonal therapy (present) + Hystologic grade (I–II) Node 5: Tumor size (P4.4 cm) + Hormonal therapy (present) + Hystologic grade (III) Node 1: Tumor size (<4.4 cm) Node 4: Tumor size (P4.4 cm) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (absent) Node 5: Tumor size (P4.4 cm) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (present) + Pericapsular Involvement of Lymph Nodes (positive) Node 6: Tumor size (P4.4 cm) + Axillary nodal status (P4 lymph nodes positive) + Hormonal therapy (present) + Pericapsular Involvement of Lymph Nodes (negative) Node 2: Tumor size (P4.4 cm) + Axillary nodal status (negative) Node 3: Tumor size (P4.4 cm) + Axillary nodal status (1–3 lymph nodes positive)

21.6 100.0 13.3 92.3

CHAID

QUEST

C4.5

ID3

and age of menarche P15 years-old. Such patients had a 52-month median survival. A second terminal node (node 7) with a relatively long median survival of 49-month included only two events in 15 patients with tumor size <4.4 cm, age of menarche <15 years-old, hormonal therapy was not given and axillary nodal status was negative. A third terminal node (node 8) with median survival of 45 months included nine events in 15 patients with tumor size <4.4 cm and age of menarche <15years-old, hormonal therapy was not given and axillary nodal status was 1–3 lymph nodes positive. One of the shortest surviving terminal node (node 2) included 15 events in 19 patients with tumor size P4.4 cm and hormonal therapy was not given.

45.0 52.9 100.0 30.0 12.1 58.6 33.3 70.8 44.0 37.0 7.8 24.4 6.7 42.9 13.3 60.0 20.4 7.3 79.0 31.6 57.9 21.6 92.9 69.2 36.4 20.0 35.7

These patients had a median survival of only 14.6 months (Table 5). The ﬁve-year Kaplan–Meier estimates for recurrencefree survival was 71.3% in the whole patient population. Fig. 2 shows the estimated recurrence-free survival rates according to the decision tree method based on the C4.5 analysis. We tested the statistical signiﬁcance of the diﬀerence between the survival curves of two terminal nodes using the Log-Rank test (Table 6). The survival curve of node 1 was statistically diﬀerent from node 2, 4, 5, 6 and 8. Node 2 was statistically diﬀerent from all nodes except node 5. The survival curve of node 3 was statistically diﬀerent from node 2, 5, and 8. The survival curve of node

2024

M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026

Table 5 Descriptive statistics and ﬁve-year recurrence-free survivals (RFS) for each nodes Node

Terminal node

Median

Mean

Standard deviation

Number of recurrence

n

Five-year RFS (%)

1 2 3

Tumor size (<4.4 cm) + Age of menarche (P15 years-old) Tumor size (P4.4 cm) + Hormonal therapy (absent) Tumor size (<4.4 cm) + Age of menarche (<15 yearsold) + Hormonal therapy (present) Tumor size (P4.4 cm) + Hormonal therapy (present) + Hystologic grade (I–II) Tumor size (P4.4 cm) + Hormonal therapy (present) + Hystologic grade (III) Tumor size (<4.4 cm) + Age of menarche (<15 yearsold) + Hormonal therapy (absent) + Axillary nodal status (P4 lymph nodes positive) Tumor size (<4.4 cm) + Age of menarche (<15 yearsold) + Hormonal therapy (absent) + Axillary nodal status (negative) Tumor size (<4.4 cm) + Age of menarche (<15 yearsold) + Hormonal therapy (absent) + Axillary nodal status (1–3 lymph nodes positive)

52.1 14.6 40.9

47.5 22.3 44.7

25.2 22.2 27.8

3 15 28

41 19 137

97.7 21.1 79.6

33.5

46.9

30.3

6

19

68.4

20.5

23.7

15.1

11

19

42.1

41.8

40.3

23.4

6

14

57.1

48.9

48.0

22.7

2

15

86.7

44.8

50.2

44.5

9

15

40.0

40.5

42.6

28.1

80

279

71.3

4 5 6

7 8

Total

4 was statistically diﬀerent from node 2 and 5. The survival curve of node 5 was statistically diﬀerent from

4. Discussion

1.0 Node 1 Node 7

0.8

Disease-free survival

node 7, and ﬁnally, node 7 was statistically diﬀerent from node 8.

Node 3

0.6

Node 6 Node 4

0.4

Node 5

0.2

Node 8 Node 2

0.0 0

20

40

60

80

100

120

140

160

180

200

Follow-up time (months)

Fig. 2. Kaplan–Meier survival curves of the eight terminal nodes generated from C4.5

In this study, we reported a research where we developed several prediction models for predicting the risk factors of breast cancer. Specially, we used ﬁve decision trees methods. Furthermore, we evaluated performance of models according to predictive values. MDS was done to identify homogenous groups of classiﬁcation techniques. We estimated RFS rates according to the decision tree method based on the C4.5 analysis. Current evidence supports a clear association between clinical and pathologic factors and reduced RFS in breast cancer. The prognostic factors inﬂuencing recurrences and survival can be divided into intrinsic, which are related to the characteristics of the tumor (histologic features, axillary lymph node metastases, tumor size, hormonal receptor status, histologic and nuclear grade, stage, lymphovascular

Table 6 Pairwise comparisons by Log-Rank (Mantel–Cox) Node 2 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7

2

v = 38.805 p < 0.001

Node 3 2

v = 3.558 p = 0.059 v2 = 50.900 p < 0.001

Node 4 2

v = 5.574 p = 0.018 v2 = 12.764 p < 0.001 v2 = 0.788 p = 0.375

Node 5 2

v = 24.869 p < 0.001 v2 = 1.141 p = 0.285 v2 = 26.697 p < 0.001 v2 = 7.131 p = 0.008

Node 6 2

v = 8.957 p = 0.003 v2 = 4.593 p = 0.032 v2 = 3.737 p = 0.053 v2 = 0.428 p = 0.513 v2 = 1.507 p = 0.220

Node 7 2

v = 0.473 p = 0.492 v2 = 16.327 p < 0.001 v2 = 0.510 p = 0.475 v2 = 1.486 p = 0.223 v2 = 9.790 p = 0.002 v2 = 3.152 p = 0.076

Node 8 v2 = 14.153 p < 0.001 v2 = 5.266 p = 0.022 v2 = 8.432 p = 0.004 v2 = 1.865 p = 0.172 v2 = 1.804 p = 0.179 v2 = 0.071 p = 0.790 v2 = 4.322 p = 0.038

M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026

invasion, pericapsular involvement of lymph nodes and perineural invasion), and extrinsic (host factors-age, menopausal status, age of menarche, family history of cancer – the type and adequacy of treatment-surgery, radiotherapy, chemotherapy, hormonal therapy). The incidence of recurrence is greater and survival decreased with larger tumor size, higher histologic grade, presence of lymphovascular invasion, involvement of axillary nodes by tumor, negative estrogen receptor status and young age (Carter, Allen, & Henson, 1989; Henson, Ries, Freedman, & Carriaga, 1991). Generally, ﬁve-year recurrence-free survival is ranged from 65% to 80% in all population in breast cancer patients (Buchholz et al., 2003). The Cox regression model is the most common tool for investigating simultaneously the inﬂuence of several factors on the survival time of patients. But it gives no estimate of the degree of separation of the diﬀerent subgroups. In the literature, there are several reports about a separation of patients in subgroups with diﬀerent prognosis for survivals. Kenneth et al. (1999) reported that clinicians often experience diﬃculty applying standard statistical methods to assess the interactions between clinical variables, determining the cumulative eﬀect of these variables on survival, and translating this information into appropriate management, because of the complex presentations of patients with unknown primary carcinoma. Hence, they showed using Kaplan–Meier analysis together with C&RT in patients with unknown primary carcinoma. Aligayer et al. (2002) showed to determine if Src activity is a marker for poor clinical prognosis in colon carcinoma patients, and analyzed a signiﬁcant association between elevated Src activity and shorter overall survival of all patients by Kaplan–Meier analysis. Stark and Pfeiﬀer (1999) reported that classiﬁcation trees (ID3, C4.5, CHAID and C&RT) were well-suited for exploratory data analysis in complex data sets in veterinary epidemiology. Sauerbrei et al. (1997) reported a new prognostic classiﬁcation schemes for node negative breast cancer patients. According to the C&RT analysis for RFS, they found that tumor size and grade are the most important factors for prognosis of this group of patients. Additionally age and estrogen receptor status are the other factors. In our study, we found the similar prognostic factors with C4.5 metod for RFS in breast cancer patients. The node 2 (tumors larger than 4.4 cm and had not hormonal treatment) consists of 19 patients with a bad prognosis (79% recurrences for RFS, median 14.6 months for RFS). Patients with low grade (HG I) breast cancers have a better prognosis than those with high grade carcinoma (Henson et al., 1991). Present study has shown a statistically signiﬁcant shorter RFS in patients with high grade (HG III) tumors than those with low grade tumors (p = 0.008). In breast cancer patients, tumor metastasis to axillary lymph nodes is a signiﬁcant risk factor for survival outcome or development of metastatic disease (Carter et al., 1989). In present study, patients having P4 positive axil-

2025

lary lymph nodes (node 6) had a statistically signiﬁcant lower survival than node 1 and 2 (p = 0.003 and p = 0.032, respectively). Age at menarche has been shown to be a risk factor for the development of primary breast cancer. Evidence indicates that lifetime estrogen exposure may be a critical factor in breast carcinogenesis. However, their prognostic inﬂuence on breast cancer once it has presented is uncertain. In some studies, they found no association between age at menarche and outcome in patients with primary breast cancer (Tsutsui et al., 2003). On the contrary, Trivers et al. (2007) reported that early age at menarche modestly increased mortality. In our study, we found that the age of menarche as a second important risk factor for survival. In this study, we found that C4.5 performed better than CHAID, QUEST, ID3 and C&RT techniques. Furthermore, we estimated RFS rates using Kaplan–Meier analysis according to the C4.5 analysis. As a result, we recommend to use decision tree methods together with Kaplan–Meier analysis to determine risk factors and eﬀect of this factors on survival. We compared methods by using a real data set in order to provide information on general tendency of data structures, assess the eﬀect of speciﬁc variables on survival in data sets and help researchers to select best method for solving problems of classiﬁcation. There are limited data on suﬃciency of classiﬁcation eﬀorts by only one method. On the basis of these considerations, we suggest that data should be better explored and processed by high performance modelling methods. Researchers should avoid assessment of data by using only one method in future studies focusing on breast cancer or any other clinical condition. References Aligayer, H., Boyd, D. D., Heiss, M. M., Abdalla, E. K., Curley, S. A., & Gallick, G. E. (2002). Activation of Src kinase in primary colorectal carcinoma: An indicator of poor clinical prognosis. Cancer, 94(2), 344–351. American Joint Committee on Cancer (1997). AJCC cancer staging manual. Philadelphia, Pa: Lippincott-Raven Publishers. Benjamin, K. T., Tom, B. Y. L., Samuel, W. K. C., Weijun, G., & Xuegang, Z. (2000). Enhancement of a Chinese discourse marker tagger with C4.5. In Annual Meeting of the ACL (Proceedings of the second workshop on Chinese language processing: Held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics) (Vol. 12, pp. 38-45). Morristown, NJ, USA: Association for Computational Linguistics. Biggs, D. B. V., & Suen, E. (1991). A method of choosing multiway partitions for classiﬁcation and decision trees. Journal of Applied Statistics, 18, 49–62. Bloom, H. J. G., & Richardson, W. W. (1957). Histological grading and prognosis in breast cancer; a study of 1409 cases of which 359 have been followed for 15 years. British Journal of Cancer, 11(3), 359–377. Borg, I., & Groenen, P. (1997). Modern multidimensional scaling theory and applications. New York: Springer-Verlag. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classiﬁcation and regression trees. Montery: Wadsworth and Brooks/ Cole.

2026

M. Ture et al. / Expert Systems with Applications 36 (2009) 2017–2026

Buchholz, T. A., Strom, E. A., & McNeese, M. D. (2003). The breast. In J. D. Cox & K. K. Ang (Eds.), Radiation oncology: Rationale, technique, results (pp. 333–386). St. Louis, Missouri: Mosby. Carter, C. L., Allen, C., & Henson, D. E. (1989). Relation of tumour size, lymph node status and survival in 24740 breast cancer cases. Cancer, 63, 181–187. Cheng, B., & Maghsoodloo, S. (1995). Optimization of mechanical assembly tolerances by incorporating Taguchi’s quality loss function. Journal of Manufacturing Systems, 14(4), 264–276. Faderl, S., Keating, M. J., Do, K.-A., Liang, S.-Y., Kantarjian, H. M., O’Brien, S., et al. (2002). Expression proﬁle of 11 proteins and their prognostic signiﬁcance in patients with chronic lymphocytic leukemia (CLL). Leukemia, 16, 1045–1052. Goodman, L. A. (1979). Simple models for the analysis of association in cross-classiﬁcations having ordered categories. Journal of the American Statistical Association, 74, 537–552. Henson, D. E., Ries, L., Freedman, L. S., & Carriaga, M. (1991). Relationship among outcome, stage of disease, and histologic grade for 22616 cases of breast cancer. The basis for a prognostic index. Cancer, 68, 2142–2149. Kaplan, E. L., & Meier, P. (1958). Non parametric estimation from incomplete observations. Journal of the American Statistical Association, 53, 457–481. Kass, G. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2), 119–127. Kenneth, R. H., Abbruzzese, M. C., Lenzi, R., Raber, M. N., & Abbruzzese, J. L. (1999). Classiﬁcation and regression tree analysis of 1000 consecutive patients with unknown primary carcinoma. Clinical Cancer Research, 5, 3403–3410. Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of ﬁt to nonmetric hypothesis. Psychometrika, 29, 1–27. Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classiﬁcation trees. Statistica Sinica, 7, 815–840. Magidson, J., & SPSS Inc. (1993). SPSS for Windows CHAID Release 6.0. Chicago: SPSS Inc.

Michael, J. A., & Gordon, S. L. (1997). Data mining technique: For marketing, sales and customer support. New York: Wiley. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufman. Sauerbrei, W., Hu¨bner, K., Schmoor, C., & Schumacher, M. (1997). Validation of existing and development of new prognostic classiﬁcation schemes in node negative breast cancer. Breast Cancer Research and Treatment, 42, 149–163. Shao, X., Zhang, G., Li, P., & Chen, Y. (2001). Application of ID3 algorithm in knowledge acquisition for tolerance design. Journal of Materials Processing Technology, 117(1-2), 66–74. Stark, K. D. C., & Pfeiﬀer, D. U. (1999). The application of nonparametric techniques to solve classiﬁcation problems in complex data sets in veterinary epidemiology-an example. Intelligent Data Analysis, 3, 23–35. Trivers, K. F., Gammon, M. D., Abrahamson, P. E., Lund, M. J., Flagg, E. W., Kaufman, J. S., et al. (2007). Association between reproductive factors and breast cancer survival in younger women. Breast Cancer Research and Treatment, 103(1), 93–102. Tsutsui, S., Ohno, S., Murakami, S., Kataoka, A., Kinoshita, J., & Hachitanda, Y. (2003). Histological classiﬁcation of invasive ductal carcinoma and the biological parameters in breast cancer. Breast Cancer, 10, 149–152. Ture, M., Memis, D., Kurt, I., & Pamukcu, Z. (2005). Predictive value of thyroid hormones on the ﬁrst day in adult respiratory distress syndrome patients admitted to ICU: comparison with SOFA and APACHE II scores. Annals of Saudi Medicine, 25(6), 466–472. Utley, M., Gallivan, S., Young, A., Cox, N., Davies, P., Dixey, J., et al. (2000). Potential bias in Kaplan–Meier survival analysis applied to rheumatology drug studies. Rheumatology, 39, 1–2. Zhang, D., Salto-Tellez, M., Putti, T. C., Do, E., & Koay, E. S. (2003). Reliability of tissue microarrays in detecting protein expression and gene ampliﬁcation in breast cancer. Modern Pathology, 16(1), 79–85.

Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients

Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients

Recommend Documents