Variable selection bias in regression trees with constant fits

Variable selection bias in regression trees with constant fits

Computational Statistics & Data Analysis 45 (2004) 595 – 607 www.elsevier.com/locate/csda Variable selection bias in regression trees with constant '...

236KB Sizes 1 Downloads 98 Views

Computational Statistics & Data Analysis 45 (2004) 595 – 607 www.elsevier.com/locate/csda

Variable selection bias in regression trees with constant 'ts Yu-Shan Shih∗ , Hsin-Wen Tsai Institute of Statistical Science, National Chung Cheng University, Minghsiung, Chiayi 62117, Taiwan Received 15 July 2002; received in revised form 11 January 2003

Abstract The greedy search approach to variable selection in regression trees with constant 'ts is considered. At each node, the method usually compares the maximally selected statistic associated with each variable and selects the variable with the largest value to form the split. This method is shown to have selection bias, if predictor variables have di5erent numbers of missing values and the bias can be corrected by comparing the corresponding P-values instead. Methods related to some change-point problems are used to compute the P-values and their performances are studied. c 2003 Elsevier B.V. All rights reserved.  Keywords: Change-point; Maximally selected statistic; Missing values; P-values

1. Introduction Regression trees are exploratory tools for prediction in Data Mining. A regression tree is constructed by recursively partitioning the data into homogeneous regions within which constant or linear estimates are generally 'tted. It gives a piecewise constant or piecewise linear estimate of a regression function (Witten and Frank, 2000). The AID algorithm (Morgan and Sonquist, 1963) is known as the 'rst method on this subject. A binary split is usually chosen to partition the sample at each node into two subnodes. A common approach to the split selection is to search over all axis-orthogonal ∗

Corresponding author. E-mail addresses: [email protected] (Y.-S. Shih), [email protected] (H.-W. Tsai). c 2003 Elsevier B.V. All rights reserved. 0167-9473/$ - see front matter  doi:10.1016/S0167-9473(03)00036-7

596

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

partitions and the split with maximally decreasing in a selected statistic is chosen to actually divide the sample. This approach is adapted in methods like AID (Morgan and Sonquist, 1963), CART (Breiman et al., 1984), RT (Torgo, 1999), and Weka (Witten and Frank, 2000). However, Doyle (1973) warns that this exhaustive search approach has selection bias toward variables which provide more split points, if data sets have no missing values. To address this problem, Bonferroni-adjusted signi'cance tests are used to correct the bias (Kass, 1975; Scott and Knott, 1976; Hawkins, 1997). But, as Hawkins (1997, p. 17) points out, the Bonferroni adjustment can over-correct and it favors predictors with fewer splits. Loh (2002) corrects the bias by separating the issue of variable selection from that of point selection. A key advantage of the recursive binary tree is its interpretability (Hastie et al., 2001, p. 267). Selection bias will weaken our con'dence in explaining the results of regression trees. In this paper, we provide a solution based on P-values to the bias problem when constant 't is considered at each node and data sets may contain missing observations. Similar approach were proposed to correct the bias in other tree-structured methods. For example, Jensen and Cohen (2000) point out the exist of the selection bias and gives references to possible adjustment methods for classi'cation trees. Schlittgen (1999) and the references therein show possible improvement of survival regression tree methods by comparing appropriate P-values. The rest of the paper is organized as follows. The selection criteria based on impurity and variance are discussed and the reason for the selection bias is provided in Section 2. We study the methods of computing the corresponding P-value of the maximally selected statistic in Section 3. Simulation experiments are conducted to compare these methods and the results are shown in Section 4. Conclusions are given in Section 5. 2. Selection criteria Let Y be the response variable and X be a continuous predictor variable. A common approach to select the best binary split based on X in regression trees with constant 't at each node is as follows. Suppose there are n paired complete observations (Y1 ; X1 ); (Y2 ; X2 ); : : : ; (Yn ; Xn ) at node t. De'ne the impurity associated with t be  R(t) = (Yi − YF (t))2 ; Xi ∈t

where YF (t) is the complete sample Y-mean at node t. Given a binary split of the form X 6 c, it partitions the sample into two parts tL and tR where tL contains all the cases with X 6 c and tR contains the other cases. Let the decrease in the impurity be (c; t) = R(t) − R(tL ) − R(tR ): The best split is chosen to be the one which induces the maximum decrease in the impurity. That is, the best split X 6 c∗ satis'es (c∗ ; t) = max (c; t): c

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

597

n n De'ne X (t) = (c∗ ; t). Let k = i=1 I{Xi 6c} ¿ 0; YF L = k −1 i=1 I{Xi 6c} Yi ; and YF R =  n (n − k)−1 i=1 I{Xi ¿c} Yi . We 'nd that (c; t) can be rewritten as (c; t) = (YF L − YF R )2 =[k −1 + (n − k)−1 ]: Thus, X (t) = max(YF L − YF R )2 =[k −1 + (n − k)−1 ]: c

For another continuous predictor variable, say W, the best binary split with W (t) value is determined. The variable selection between X and W is to compare X (t) with W (t) and choose the variable whose corresponding value is larger. That is, if X (t) ¿ W (t), X is selected; otherwise, W is selected. This exhaustive approach is used in CART (Breiman et al., 1984, p. 230) among others. De'ne the sample variance of the Yi values for the complete sample at the node t be  S 2 (t) = n−1 (Yi − YF (t))2 : Xi ∈t

The reduction of the sample variance due to split X 6 c is de'ned as (c; t) = S 2 (t) − k=n × S 2 (tL ) − (n − k)=n × X 2 (tR ): Another approach of split selection is to select the split which maximizes the reduction of the sample variance for the complete sample. Suppose the best split associated with X has the reduction value X (t) = maxc (c; t) and for variable W, its best split with W (t) value is determined. The corresponding scheme of variable selection between X and W is: if X (t) ¿ W (t), X is selected; otherwise, W is selected. This method is used in RT (Torgo, 1999, p. 65) and similar criterion is adapted in Weka as well Witten and Frank, 2000, p. 203). Since n(c; t) = (c; t), the impurity criterion and the variance criterion will choose the same splitting variable, if each predictor variable has the same number of nonmissing values. Otherwise, these two criteria may select di5erent variables. We observe that the two criteria depend on the number of nonmissing observations. The criterion value may not necessary represent the associated power of the split. Thus, a direct comparison between X (t) and W (t) or their variance counterparts may not be able to distinguish two splits. Instead of the original criterion itself, we propose to use its corresponding P-value as the new criterion. That is, we select the variable with the minimum corresponding P-value of the maximally selected statistic as the splitting variable. 3. Exact and asymptotic distributions The exact method and four approximation methods of computing the P-value are brieLy described in this section. The methods are related to some change-point problems and detail derivation can be found in the referred papers. The location shift model proposed in Lausen and Schumacher (1992) is used. That is, we access the predictive

598

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

power of X for the response variable Y by testing the hypothesis that X and Y are independent (H0 ) versus the alternative that the distribution of Y di5ers in location between X 6  and X ¿  where  is an unknown cutpoint. Denote (x) and (x) be the standard normal distribution and density function, respectively. Under the assumption that Y follows normal distribution with known variance, Hawkins (1977) derives the following likelihood ratio test statistic: Tn = max |YF L − YF R |=[k −1 + (n − k)−1 ]1=2 c

and gives a recursive algorithm to obtain the exact distribution of Tn under H0 . The algorithm is implemented in an R package maxstat (Hothorn and Lausen, 2003). It can compute the exact distribution of sample size up to one hundred. It is interesting to observe that Tn2 = X (t): Thus, in principle, we can obtain the exact distribution of X (t). However, since the method is quite computational demanded, asymptotic methods may be desired. When n → ∞, Yao and Davis (1986) 'nd that P(Tn 6 an x + bn ) − [(an x + b˜n )]2 ln(n=2) → 0; −1 ˜ where an = (2 ln ln n)−1=2 ; bn = a−1 n + 2 an ln ln ln n; and bn = bn − an (ln ln ln n + ln 2). Consequently,

P(Tn ¿ u) ≈ 1 − [(Ku )]2 ln(n=2) ;

(YD)

where Ku = u − [(2 ln ln n)−1=2 (ln ln ln n + ln 2)] and u ¿ 0. They claim that it performs reasonably well for 20 6 n 6 50 under H0 . De'ne   n    −1=2  −1 Mnc = A [nFX (c)(1 − FX (c))] F ;  {I{Xi 6c} a(Ri ) − FX (c)a}   i=1

Y1 ; : : : ; Yn , a(i) denote some scores, FX (c) = where nR1 ; : : : ; Rn denote the ranks of  n F 2 with aFn = i a(i)=n. Lausen and n−1 i=1 I{Xi 6c} , and A2 = (n − 1)−1 i=1 (ai − a) Schumacher (1992) propose the maximally selected rank statistic Mn = max Mnc c

for the location shift problem. If a(i) = Yi , we 'nd that Mn = A−1 Tn = A−1 { X (t)}1=2 : Thus, the approximated P-value of X (t) is obtainable, if we are able to obtain the approximated P-value of Mn . In order to do so, Lausen and Schumacher (1992) restrict the hypothetical split point c to an interval and de'ne Mn (1 ; 2 )=maxc∈[x1 ;x2 ] Mnc , where x1 = FX−1 (1 ); x2 = FX−1 (2 ); and 0 ¡ 1 ¡ 2 ¡ 1. They obtain that P(Mn (1 ; 2 ) ¿ b) ≈ 4(b)b−1 + (b)(b − b−1 ) log(K1 ;2 );

(LS)

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

599

Table 1 Acronym list

Acronym

Method

JJS LS SC YD

James et al. (1987) Lausen and Schumacher (1992) Schlittgen (1999) Yao and Davis (1986)

where K1 ;2 =2 (1−2 )−1 1−1 (1−1 ) and b ¿ 0. Another approximation given by James et al. (1987) for standard normal response is   U   P(Mn (1 ; 2 ) ¿ b) ≈ 2 1 − (b) + b(b) K(x) d x ; (JJS)   L

where k1 = 1 n ¿ 0; k2 = 2 n ¿ 0; L = b(k2−1 − n−1 )1=2 ; U = b(k1−1 − n−1 )1=2 ; and K(x) = x−1 exp{−0:583[x + b2 (nx)−1 ]}. Furthermore, Schlittgen (1999) gives P(Mn ¿ b) ≈ 1 − P(|S1 | 6 b)

n

P(|Si | 6 b | |Si−1 | 6 b);

(SC)

i=2

where Si follows the standard normal distribution and the correlation coeOcient between Si and Si−1 is [ni−1 (n − ni )]1=2 [ni (n − ni−1 )]−1=2 where ni is the number of complete observations 6 the ith ordered split point. This approximation is claimed to able to deal with small n (Schlittgen, 1999, p. 944) and is close to the improved Bonferroni procedure in Hothorn and Lausen (2003). The acronyms which are used to represent the approximated methods discussed above is given in Table 1.

4. Simulation studies The exact method and four asymptotic approaches can be used to obtain the P-values of X (t) and their performances in variable selection are studied in this section. Simulation experiments were carried out to investigate the variable selection problems. The distribution of Y is independent standard normal in the experiments unless it is mentioned otherwise. Denote E; U; Z1 and Z2 to be independent random variables where E follow exponential distribution with mean 1 and Z1 and Z2 are two standard normal random variables. U takes uniform values on (0; 1) and the number of ties is equal to two at each distinct point. The distributions of four predictor variables used in the simulation are listed in Table 2. The distribution of X4 is chosen to be correlated with X2 and four dependence structures between these two variables are considered.

600

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

Table 2 Distributions of the predictor variables

Variable Distribution

X1 E

X2 Z1

X3 U

Medial X2 + 1:5Z2

Strong X2 + 0:75Z2

Dependence between X4 and X2 Independence Z2

Weak X2 + 2:8Z2

X1 ; X2 and X3 are independent and four dependence structures between X3 and X4 are simulated. The correlation coeOcients are 0, 0.336, 0.555 and 0.800, respectively.

4.1. Comparison of asymptotic distributions First, a simulation experiment was conducted to compare the four asymptotic distributions which can be used to obtain the P-values of X (t). The 'rst three predictor variables in Table 2 are generated and only the 'rst variable has randomly missing observations. The estimated probabilities of variable selection for those four methods based on one thousand iterations and one thousand samples in each iteration are given in Table 3. For the JJS and LS methods, the boundary values are taken to be 1 = 1 − 2 = 0:1. The results for the impurity and the variance methods are given for comparisons. We 'nd that if the impurity method is used, the chance of X1 being selected decreases as the proportion of its missing values increases. On the other hand, if the variance method is used, the chance of X1 being selected increases as the missing proportion increases. Some of the results coincide with those in Kim and Loh (2001) who show the usual exhaustive method has selection bias toward variables with more missing values in classi'cation trees. Thus, these methods have selection bias. If the P-value is used as the criterion instead, we 'nd it is relatively unbiased in the sense that its estimates are within three standard errors of 13 . However, as the missing percentage on X1 increases, some methods are unsatisfactory. Among the approximation methods, the YD method yields the best results because all of its estimates of variable selection are within not only three but also two standard errors of 13 . We also 'nd that the selection probability of X3 is always less than the corresponding selection probability of X2 . Based on the fact that the number of available split points of X2 is twice as many as that of X3 accordingly, the results here again demonstrate that the greedy method based on impurity or variance tends to favor variables which provide more splits. These results also suggest that the asymptotic methods do not work quite well for ordered variables with ties. We have increased the number of ties to ten for X3 and 'nd the evidence becomes even more strong. We also study the performance of these approximations when the distribution of Y is not normal. A mixture normal and an exponential with mean 1 distributions are generated while all the other settings remain the same. The results are given in Tables 4

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

601

Table 3 Estimated probabilities of variable selection when the response variable Y is independent of the predictor variables X’s

Method

Xi

Missing percentage on X1 0

20

40

60

80

Impurity

X1 X2 X3

0.367 0.337 0.296

0.340 0.349 0.311

0.339 0.366 0.295

0.309 0.352 0.339

0.285 0.385 0.330

Variance

X1 X2 X3

0.367 0.337 0.296

0.462 0.278 0.260

0.641 0.199 0.160

0.818 0.101 0.081

0.963 0.022 0.015

JJS

X1 X2 X3

0.367 0.338 0.295

0.338 0.348 0.314

0.335 0.367 0.298

0.317 0.354 0.329

0.303 0.374 0.323

LS

X1 X2 X3

0.367 0.337 0.296

0.339 0.348 0.313

0.334 0.368 0.298

0.314 0.355 0.331

0.291 0.381 0.328

SC

X1 X2 X3

0.364 0.339 0.297

0.341 0.347 0.312

0.332 0.372 0.296

0.316 0.354 0.330

0.293 0.380 0.327

YD

X1 X2 X3

0.363 0.340 0.297

0.342 0.348 0.310

0.349 0.360 0.291

0.335 0.338 0.327

0.334 0.353 0.313

Y follows the standard normal distribution and the distributions for the X’s are listed in Table 2. Only X1 has randomly missing values. Estimates are based on one thousand Monte Carlo iterations and one thousand samples in each iteration. A method is unbiased if it selects each variable with probability 13 . Simulation standard errors are about 0.015. Estimates which are not within three standard errors of 13 are emphasized.

and 5, respectively. We 'nd that the results in Table 4 are similar to those in the previous experiment. The results in Table 5 are less satisfactory. Nevertheless, the YD method still yield better results among the four asymptotic methods. In summary, the YD method are the best and we use it to obtain the approximated P-values in the later studies. 4.2. Correction of the bias We then carried out an experiment to study the performance of the YD method and the exact method. We use the maxstat package (Hothorn and Lausen, 2003) to obtain the exact P-values. One more predictor variable is added and the distribution is given in Table 2. Four dependence scenarios between X2 and X4 are considered. According to the correlation coeOcients, the four dependent models are independence,

602

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

Table 4 Estimated probabilities of variable selection when the response variable Y is independent of the predictor variables X’s

Method

Xi

Missing percentage on X1 0

20

40

60

80

Impurity

X1 X2 X3

0.339 0.370 0.291

0.337 0.343 0.320

0.328 0.340 0.332

0.313 0.377 0.310

0.279 0.367 0.354

Variance

X1 X2 X3

0.339 0.370 0.291

0.482 0.270 0.248

0.660 0.169 0.171

0.834 0.100 0.066

0.967 0.016 0.017

JJS

X1 X2 X3

0.339 0.370 0.291

0.341 0.339 0.320

0.331 0.339 0.330

0.313 0.380 0.307

0.279 0.371 0.350

LS

X1 X2 X3

0.339 0.370 0.291

0.339 0.339 0.322

0.331 0.339 0.330

0.311 0.381 0.308

0.277 0.370 0.353

SC

X1 X2 X3

0.341 0.369 0.290

0.343 0.337 0.320

0.331 0.340 0.329

0.309 0.380 0.311

0.276 0.373 0.351

YD

X1 X2 X3

0.337 0.372 0.291

0.352 0.335 0.313

0.360 0.325 0.315

0.351 0.358 0.291

0.351 0.342 0.307

Y follows a normal mixture distribution: 0:4N (1; 1) + 0:6N (0; 1). The distributions for the X’s are listed in Table 2. Only X1 has randomly missing values. Estimates are based on one thousand Monte Carlo iterations and one thousand samples in each iteration. A method is unbiased if it selects each variable with probability 1 . Simulation standard errors are about 0.015. Estimates which are not within three standard errors of 13 are 3 emphasized.

week dependence, medial dependence, and strong dependence. Similar structures are used in Loh (2002) to study the bias problem. The estimated probabilities of variable selection for the methods based on one thousand iterations and one hundred samples in each iteration are given in Table 6. The results in Table 6 again show that the criterion based on impurity or variance has selection bias. In contrast, the exact P-value method gives relatively unbiased results in that most of the estimates are within three standard errors of 0.25. Furthermore, we also conduct the experiment using the approximated P-value method and 'nd that the results are similar to those of using the exact method. Due to the space limitation, we do not report the approximated results here. Since the exact method is limited for sample size at most one hundred, it is interesting to see how the YD method perform for larger sample size. We later changed the sample size to two hundreds with all the other settings remaining the same. The estimated probabilities of variable selection

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

603

Table 5 Estimated probabilities of variable selection when the response variable Y is independent of the predictor variables X’s

Method

Xi

Missing percentage on X1 0

20

40

60

80

Impurity

X1 X2 X3

0.323 0.385 0.292

0.338 0.350 0.312

0.300 0.404 0.296

0.311 0.406 0.283

0.268 0.401 0.331

Variance

X1 X2 X3

0.323 0.385 0.292

0.468 0.287 0.245

0.593 0.234 0.173

0.766 0.148 0.086

0.943 0.033 0.024

JJS

X1 X2 X3

0.322 0.386 0.292

0.332 0.353 0.315

0.289 0.406 0.305

0.309 0.409 0.282

0.282 0.390 0.328

LS

X1 X2 X3

0.323 0.385 0.292

0.334 0.352 0.314

0.287 0.407 0.306

0.309 0.409 0.282

0.272 0.399 0.329

SC

X1 X2 X3

0.319 0.387 0.294

0.330 0.356 0.314

0.288 0.407 0.305

0.307 0.409 0.284

0.271 0.402 0.327

YD

X1 X2 X3

0.321 0.386 0.293

0.350 0.347 0.303

0.314 0.396 0.290

0.347 0.386 0.267

0.346 0.361 0.293

Y follows the exponential distribution with mean 1 and the distributions for the X’s are listed in Table 2. Only X1 has randomly missing values. Estimates are based on one thousand Monte Carlo iterations and one thousand samples in each iteration. A method is unbiased if it selects each variable with probability 13 . Simulation standard errors are about 0.015. Estimates which are not within three standard errors of 13 are emphasized.

are given in Table 7. Again, these results show that the P-value method is relatively unbiased. 4.3. Power studies The three selection criteria are investigated in this section when one predictor variable is informative. A jump regression model is considered and it is Y = 0:5I (X1 ¿ 0) + ; where X1 and  are independent standard normal random variables and I (·) is the indicator function. A noise variable X2 which has exponential distribution with mean 1 is also generated. Two hundreds samples of (Y; X1 ; X2 ) from the model are simulated under two schemes: (a) X1 is complete and X2 has randomly missing observations

604

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

Table 6 Estimated probabilities of variable selection when the response variable Y is independent of the predictor variables X’s Missing percentage on X1 Xi Impurity 0

20

Variance

Exact P-value

40

60

80

0

20

40

60

80

0

20

40

60

80

0.231 0.292 0.214 0.263

0.201 0.309 0.232 0.258

0.169 0.299 0.234 0.298

0.264 0.279 0.208 0.249

0.361 0.259 0.157 0.223

0.481 0.193 0.146 0.180

0.624 0.145 0.108 0.123

0.812 0.070 0.041 0.077

0.264 0.280 0.208 0.248

0.242 0.293 0.212 0.253

0.249 0.275 0.226 0.250

0.229 0.291 0.243 0.237

0.200 0.272 0.248 0.280

Weak dependence X1 0.258 0.241 0.226 X2 0.275 0.304 0.287 X3 0.210 0.191 0.213 X4 0.257 0.264 0.274

0.196 0.309 0.238 0.257

0.164 0.294 0.236 0.306

0.258 0.275 0.210 0.257

0.357 0.255 0.165 0.223

0.488 0.182 0.144 0.186

0.628 0.146 0.106 0.120

0.813 0.069 0.044 0.074

0.249 0.268 0.225 0.258

0.241 0.290 0.218 0.251

0.240 0.271 0.228 0.261

0.222 0.292 0.238 0.248

0.206 0.269 0.248 0.277

Medial dependence X1 0.264 0.242 0.229 X2 0.271 0.296 0.286 X3 0.213 0.199 0.209 X4 0.252 0.263 0.276

0.202 0.299 0.239 0.260

0.168 0.298 0.232 0.302

0.264 0.271 0.213 0.252

0.358 0.249 0.171 0.222

0.488 0.185 0.144 0.183

0.630 0.141 0.108 0.121

0.821 0.071 0.045 0.063

0.252 0.263 0.234 0.251

0.242 0.281 0.222 0.255

0.242 0.267 0.231 0.260

0.226 0.277 0.244 0.253

0.211 0.270 0.241 0.278

Strong dependence X1 0.269 0.255 0.243 X2 0.266 0.293 0.281 X3 0.223 0.217 0.219 X4 0.242 0.235 0.257

0.209 0.293 0.260 0.238

0.176 0.296 0.248 0.280

0.269 0.266 0.223 0.242

0.370 0.249 0.186 0.195

0.495 0.182 0.152 0.171

0.646 0.135 0.112 0.107

0.823 0.071 0.047 0.059

0.257 0.256 0.243 0.244

0.245 0.268 0.246 0.241

0.256 0.259 0.235 0.250

0.236 0.272 0.263 0.229

0.214 0.268 0.263 0.255

Independence X1 0.264 0.243 X2 0.279 0.310 X3 0.208 0.188 X4 0.249 0.259

Y follows the standard normal distribution and the distributions for the X’s are listed in Table 2. Only X1 has randomly missing values. Estimates are based on one thousand Monte Carlo iterations and one hundred sample in each iteration. A method is unbiased if it selects each variable with probability 0.25. Simulation standard errors are about 0.014. Estimates which are not within three standard errors of 0.25 are emphasized.

and (b) X1 has randomly missing observations and X2 is complete. In one thousand Monte Carlo iterations, the estimated probabilities that the informative variable X1 is selected by various selection criteria are recorded. The results are given in Tables 8 and 9. From Table 8 we 'nd that the selection probability of X1 increases as the missing percentage on X2 increases, when the variable is selected by the impurity criterion. On the other hand, the trend reverses when the variable is selected by the variance criterion. But, these properties are undesirable because the selection probability of X1 should not be a5ected by the missing percentage on X2 . In contrast, the selection probability remains relatively unchanged when the P-value method is used. If the missing values are observed on X1 instead, we learn from Table 9 that the P-value criterion is more powerful than the impurity criterion in selecting X1 . Under the missing scheme, we would expect that the selection probability decreases when the missing percentage on X1 increases because X1 provides less information about the model when its samples

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

605

Table 7 Estimated probabilities of variable selection when the response variable Y is independent of the predictor variables X’s Missing percentage on X1 Xi Impurity 0

20

Variance

Approximated P-value

40

60

80

0

20

40

60

80

0

20

40

60

80

0.250 0.276 0.216 0.258

0.198 0.284 0.231 0.287

0.155 0.303 0.242 0.300

0.251 0.277 0.205 0.267

0.362 0.230 0.183 0.225

0.516 0.175 0.142 0.167

0.685 0.099 0.100 0.116

0.885 0.045 0.030 0.040

0.253 0.274 0.207 0.266

0.234 0.273 0.225 0.268

0.265 0.274 0.213 0.248

0.225 0.276 0.221 0.278

0.210 0.287 0.220 0.283

Weak dependence X1 0.259 0.223 0.254 X2 0.274 0.274 0.275 X3 0.208 0.219 0.210 X4 0.259 0.284 0.261

0.193 0.290 0.234 0.283

0.158 0.294 0.241 0.307

0.259 0.274 0.208 0.259

0.358 0.228 0.179 0.235

0.505 0.173 0.143 0.179

0.685 0.102 0.101 0.112

0.887 0.043 0.030 0.040

0.260 0.274 0.207 0.259

0.228 0.274 0.218 0.280

0.269 0.270 0.206 0.255

0.220 0.284 0.224 0.272

0.212 0.281 0.222 0.285

Medial dependence X1 0.257 0.231 0.248 X2 0.262 0.264 0.276 X3 0.216 0.217 0.210 X4 0.265 0.288 0.266

0.202 0.291 0.238 0.269

0.161 0.290 0.249 0.300

0.257 0.262 0.216 0.265

0.355 0.220 0.173 0.252

0.506 0.173 0.142 0.179

0.696 0.100 0.100 0.104

0.885 0.042 0.029 0.044

0.258 0.261 0.217 0.264

0.237 0.264 0.213 0.286

0.261 0.270 0.206 0.263

0.230 0.288 0.225 0.257

0.207 0.275 0.230 0.288

Strong dependence X1 0.268 0.245 0.251 X2 0.270 0.267 0.266 X3 0.231 0.238 0.227 X4 0.231 0.250 0.256

0.207 0.277 0.254 0.262

0.167 0.283 0.263 0.287

0.268 0.270 0.231 0.231

0.384 0.214 0.188 0.214

0.513 0.168 0.151 0.168

0.710 0.094 0.100 0.096

0.884 0.041 0.029 0.046

0.269 0.268 0.232 0.231

0.252 0.267 0.232 0.249

0.266 0.258 0.221 0.255

0.241 0.270 0.240 0.249

0.229 0.267 0.239 0.265

Independence X1 0.251 0.223 X2 0.277 0.273 X3 0.205 0.227 X4 0.267 0.277

Y follows the standard normal distribution and the distributions for the X’s are listed in Table 2. Only X1 has randomly missing values. The approximated P-values are obtained using the YD method. Estimates are based on one thousand Monte Carlo iterations and two hundreds samples in each iteration. A method is unbiased if it selects each variable with probability 0.25. Simulation standard errors are about 0.014. Estimates which are not within three standard errors of 0.25 are emphasized. Table 8 Estimated probabilities of variable X1 is selected under the jump model where X1 is complete and the noise variable X2 has randomly missing values

Method

Impurity Variance P-value

Missing percentage on X2 0

20

40

60

80

0.832 0.832 0.832

0.809 0.719 0.803

0.833 0.620 0.821

0.843 0.437 0.817

0.872 0.191 0.833

Estimates are based on one thousand Monte Carlo iterations and two hundreds samples in each iteration.

are less available. In this respect, the variance criterion is unrealistic, because the selection probability reaches its maximum value 0.962 when X1 has 80 percentage of randomly missing values.

606

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

Table 9 Estimated probabilities of variable X1 is selected under the jump model where X1 has randomly missing values and the noise variable X2 is complete

Method

Impurity Variance P-value

Missing percentage on X1 0

20

40

60

80

0.822 0.821 0.820

0.775 0.857 0.781

0.705 0.871 0.716

0.605 0.908 0.634

0.462 0.962 0.526

Estimates are based on one thousand Monte Carlo iterations and two hundreds samples in each iteration.

We changed the jump size from 0.5 to 0.3, 0.7 and 1.0. We 'nd that the results are similar and only the results for jump size 0.5 are reported here. 5. Conclusions We demonstrate that the usual exhaustive search method for variable selection in regression trees with constant 'ts has selection bias if either the impurity or the variance criterion is used. We provide the reason for its existence and a correction method based on P-values when predictor variables possess di5erent number of missing values. Methods related to change-point problems are used to compute the P-values. We 'nd that the exact P-values can be obtained by the method given in Hawkins (1977) and Hothorn and Lausen (2003). The method of Yao and Davis (1986) performs reasonably well in approximating the P-values. We suggest that the exact method be used when the node size is small, say less than one hundred, and the approximation method be used when the node size is large. This strategy can be useful in eliminating variable selection bias in regression trees with constant 'ts. Acknowledgements The authors are very grateful to the co-editor and two referees for their constructive comments on earlier versions of the article. This research is supported by a grant from NSC, Taiwan. References Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classi'cation and Regression Trees. Chapman & Hall, New York. Doyle, P., 1973. The use of automatic interaction detector and similar search procedures. Oper. Res. Quart. 24, 465–467. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.

Y.-S. Shih, H.-W. Tsai / Computational Statistics & Data Analysis 45 (2004) 595 – 607

607

Hawkins, D.M., 1997. FIRM: Formal inference-based recursive modeling, PC version, release 2.1. Technical Report 546, School of Statistics, University of Minnesota. Hawkins, D.W., 1977. Testing a sequence of observations for a shift in location. J. Amer. Statist. Assoc. 72, 180–186. Hothorn, T., Lausen, B., 2003. On the exact distribution of maximally selected rank statistics. Comput. Statist. Data Anal., in press; preprint available from http://www.mathpreprints.com/math/preprint/hothorn/ 20020227/2/. James, B., James, K.L., Siegumnd, D., 1987. Test for a change-point. Biometrika 74, 71–83. Jensen, D., Cohen, P., 2000. Multiple comparisons in induction algorithms. Mach. Learning 38, 309–338. Kass, G.V., 1975. Signi'cance testing in automatic interaction detection (AID). Appl. Statist. 24, 178–189. Kim, H., Loh, W.-Y., 2001. Classi'cation trees with unbiased multiway splits. J. Amer. Statist. Assoc. 96, 589–604. Lausen, B., Schumacher, M., 1992. Maximally selected rank statistics. Biometrics 48, 73–85. Loh, W.-Y., 2002. Regression trees with unbiased variable selection and interaction detection. Statist. Sinica 12, 361–386. Morgan, J.N., Sonquist, J.A., 1963. Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc. 58, 415–434. Schlittgen, R., 1999. Regression trees for survival data—an approach to select discontinuous split points by rank statistics. Biometrical J. 8, 943–954. Scott, A.J., Knott, M., 1976. An approximate test for use with AID. Appl. Statist. 25, 103–106. Torgo, L., 1999. Inductive learning of tree-based regression models. Ph.D. Thesis, Faculty of Science, University of Porto. Witten, I., Frank, E., 2000. Data Mining. Morgan Kaufmann, San Francisco. Yao, Y.-C., Davis, R.A., 1986. The asymptotic behavior of the likelihood ratio statistic for testing a shift in mean in a sequence of independent normal variables. SankhyFa A 48, 339–353.