Calibration and refinement for classification trees

Journal of Statistical Planning and Inference 70 (1998) 241–254 Calibration and re nement for classi cation trees Stephen E. Fienberg a;∗ , Sung-Ho K...

Download PDF

115KB Sizes 3 Downloads 77 Views

Report

PDF Reader
Full Text

Journal of Statistical Planning and Inference 70 (1998) 241–254

Calibration and re nement for classi cation trees Stephen E. Fienberg a;∗ , Sung-Ho Kimb

a Carnegie

Mellon University, Pittsburgh, PA 15213, USA Daejon, 305-701, South Korea

b KAIST,

Received 31 August 1993; received in revised form 1 November 1997; accepted 21 November 1997

Abstract The calibration of forecasts for a sequence of events has an extensive literature. Since calibration does not ensure ‘good’ forecasts, the notion of re nement was introduced to provide a structure into which methods for comparing well-calibrated forecasters could be embedded. In this paper we apply these two concepts, calibration and re nement, to tree-structured statistical probability prediction systems by viewing predictions in terms of the expected value of a response variable given the values of a set of explanatory variables. When all of the variables are categorical, we show that, under suitable conditions, branching at the terminal node of a c 1998 tree by adding another explanatory variable yields a tree with more re ned predictions. Elsevier Science B.V. All rights reserved. Keywords: Calibration; Categorical forecasts; Classi cation trees; Prediction systems; Probability forecasters; Re nement

1. Introduction In a typical sequential classi cation procedure, we observe explanatory or predictor variables, one after another, deciding after each observation whether or not to continue adding variables. In selecting the next predictor variable, we usually attempt to maximize the expected utility, which involves the total cost of variable observations and the loss from the decision. The predictions in this paper are made in terms of probabilities, and thus the loss is a function of probabilistic predictions and the corresponding outcomes. It is common to depict the above sequential procedure by a directed acyclic graph, called a tree. In this paper, however, we refer to a tree-structured statistical probability prediction system as a tree. Further, we assume that all of the predictor variables are categorical, and if we observe a categorical variable with k levels at a node of a tree, then the node branches into k child-nodes. Finally, we refer to a node where we ∗

Corresponding author. Tel.: +1 412 268 2717; Fax: +1 412 268 7828; e-mail: [email protected].

c 1998 Elsevier Science B.V. All rights reserved. 0378-3758/98/$19.00 PII S 0 3 7 8 - 3 7 5 8 ( 9 7 ) 0 0 1 9 5 - X

242

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

observe a predictor variable as a non-terminal node, and a node where we make a prediction as a terminal node (TN). Fig. 1 illustrates a tree with these characteristics. A sequential variable selection procedure (i.e., a sequential branching process in a tree) is equivalent to a procedure of getting more re ned partitions of the sample space generated by the set of the predictor variables. This equivalence leads us to the concept of the comparison of experiments described in Blackwell and Girshick (1954), which is cast in terms of the comparison of partitions of the sample space (see Section 3 below). The Blackwell–Girshick notion of suciency in the comparison of experiments has been linked by DeGroot and Fienberg (1982, 1986) to their concept of re nement for comparing well-calibrated forecasters and provides an alternative language and methodology for the problem considered here. In Section 2 we brie y review the concepts of calibration and re nement for forecasters and then we apply them to trees in Section 4. Section 3 describes how trees and partitions of a sample space of a set of predictor variables are related. In Section 4, we show that, under suitable conditions, branching at a TN of a tree improves the tree in the sense of re nement. Finally, in Section 5, we focus on some concerns expressed in the literature on calibration as they relate to the problem of tree-construction. 2. Overview of calibration and re nement For the evaluation of probability forecasters, Murphy and Epstein (1967) suggest several criteria. Consider a long sequence of weather forecasts (rain=dry), look at those days for which the forecast probability of rain equals x and determine the long run proportion, (x), of such days on which the forecast event (rain) in fact occurred. The plot of (x) against x is the forecaster’s empirical calibration curve. If the curve is diagonal, i.e., if (x) = x, we say that the forecaster is empirically well-calibrated. In a more general sense, we may view (x) as our conditional probability of the event (e.g. rain tomorrow) to which the forecaster assigns the value x. Then the forecaster is well-calibrated (WC) if (x) = x;

for each value x the forecaster used:

(2.1)

On average, a WC forecaster gets his forecasts correct for each value x. Murphy and Winkler (1977) showed that experienced weather forecasters are, on the whole, WC. Further empirical studies reported by Lichtenstein et al. (1982), yielded some poorly calibrated forecasts. The calibration criterion we have just described seems to have something to do with the frequency de nition of probability, but it does not require independent and identically distributed repeated trials. Roberts (1968), for example, attempted to interpret the calibration criterion by supposing that one could select a subset of all such days that could be regarded, at the time of forecast, as identical in all relevant respects, and considered the limiting relative frequency of rain on such days as the true probability for any one of them. Dawid (1982) proved that a coherent Bayesian expects to be

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

243

WC. He considered the meaning that might be attached to a sequence of probability predictions and introduced calibration as a criterion, which can be used to test the empirical validity of such a sequence in the light of the outcomes of the events being predicted. Kadane and Lichtenstein (1982) made a related comment to the eect that if the probability assessor knows the outcomes of all the previous events when making assessments, calibration is always expected. DeGroot and Eriksson (1985) suggest three dierent interpretations of (x) in expression (2.1): 1. the limiting or theoretical values for an in nite sequence of events, 2. the actual relative frequencies for a nite sequence of events, and 3. the subjective probabilities of an observer who is comparing the forecasters. If a forecaster gives predictions on a sequence of events in terms of unbiased estimates of the conditional probabilities given all the data available, then the forecaster is WC. This corresponds to viewpoint (1). Empirical calibration using forecasts based on nite sequences corresponds to viewpoint (2). Only viewpoint (3) provides a way to extricate the interpretation of (x) from empirical or long-run frequencies. In the literature on statistical diagnostic aids, calibration is rarely mentioned (a notable exception is Knill-Jones et al., 1973) and it even explicitly rejected as unnecessary by some authors (for example, see Ben-Bassat et al., 1980). Dawid (1985), on the other hand, argues that the calibration criterion is an appealing one in evaluating systems for medical prognosis (or diagnosis), insurance portfolios, and even criminal trials. In his comment on Dawid (1985), Schervish (1985) notes that the existence of well-calibrated predictors is not guaranteed in general but that so long as we keep in mind that the most a probability forecast can be is a measure of how strongly we believe that an event will occur (based on evidence currently available, but not on evidence yet to be observed), we may yet be able to nd a useful method for comparing and evaluating forecasters from the calibration perspective. Calibration alone is not suitable as a criterion for evaluating forecasts. To address the evaluation problem, DeGroot and Fienberg (1982) introduced the concept of re nement which can be used to induce a partial ordering on the class of all WC forecasters for the same sequence of events. Let A be the set of the probability predictions by forecaster A and let A be the distribution function over A . Similarly, let B be the set of forecasts by B and let B be the distribution function over B . As for (x), A and B can be interpreted as limiting distributions, the forecasters’ actual relative frequency distributions for a nite sequence, or an observer’s subjective probabilities. A stochastic transformation h(y|x) is a non-negative function de ned on A × B such that P h(y|x) = 1; for every x ∈ A : (2.2) y∈ B

De nition 2.1. Suppose that both forecasters A and B are WC. Forecaster A is at least as re ned as forecaster B if there exists a stochastic transformation h(y|x) such that P h(y|x)A (x) = B (y); for y ∈ B (2.3) x∈ A

244

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

and P x∈ A

h(y|x)xA (x) = yB (y);

for y ∈ B :

(2.4)

If A and B are not identically equal in expressions (2.3) and (2.4), then A is more re ned than B. The re nement relationship is both re exive and transitive, and induces partial ordering among WC forecasters but not a total ordering (see DeGroot and Fienberg, 1983). Several equivalent conditions of the re nement relationship between a pair of WC forecasters are discussed by DeGroot and Eriksson (1985) and DeGroot and Fienberg (1986). For example, Theorem 15 of DeGroot and Fienberg (1986) reexpresses re nement without explicit reference to the stochastic transformation, h(y|x) and is linked to the notion of Blackwell suciency in the comparison of experiments: Theorem 2.2. Suppose that both forecasters A and B are WC for the forecasting of events with k¿2 outcomes. Then, the condition that A is at least as re ned as B is equivalent to the condition that, for every continuous convex function g(x) de ned on the (k − 1)-dimensional simplex S k−1 , P P A (x)g(x)¿ B (x)g(x); x

x

where the summation goes over the set of the probability predictions by A or B. 3. Partitions of a sample space and trees Let X1 ; X2 ; : : : ; XL be discrete random variables, each of whose support sets is nite, and let X be the L-component vector of X1 ; X2 ; : : : ; XL . A partition of the sample space generated by X is de ned as a collection of mutually disjoint and exhaustive subsets of the sample space. We denote such a partition by Par(X ) and the support of Xj by S(Xj ) = {x: P(Xj = x)¿0}. For any partition Par(X ), there exists at least one X variable which is involved in all the component sets of the partition. Otherwise, there should exist a component set which overlaps with some other, contradicting the de nition of a partition. We may select one of such X variables and label it Xsel(1) . Pick a value, c1 , say, so that the partition is divided into two exhaustive subgroups according to whether Xsel(1) ¿c1 or not. For each of the subgroups, we apply the same argument for another split of the subgroup. After a sequence of such splits, we get a partition that is at least as ne as the partition Par(X ). The new partition is expressed in terms of the inequalities, Xsel(i) 6ci ; Xsel(i) ¿ci , for i = 1; 2; : : : ; D, where D is the total number of the splits. By following the sequence of the inequalities in each component set of the new partition, we can draw a connected, acyclic directed graph, namely, a tree (see Fig. 1 (neglect Pj ’s there) for example). We refer to such a partition as a tree-partition.

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

245

Fig. 1. A tree with 4 terminal nodes where predictions are made in terms of probabilities in the box. The number, 1 or 2, on a arrow is the value of X at the corresponding arrow-tail. The variable in a circle-node is observed at the node.

Given a tree, we can always nd a unique corresponding partition. For example, a ˆ there) corresponds to the partition tree in Fig. 1 (neglecting the P’s P(X1 ; X2 ; X3 ) = {(X1 = 2); (X1 = 1; X2 = 2); (X1 = 1; X2 = 1; X3 = 1); (X1 = 1; X2 = 1; X3 = 2)}: Given a partition Par(X ), we may wish to nd the smallest tree-partition of Par(X ). The smallest tree-partition would correspond to the smallest tree. Suppose that the relation between a predicted variable and the vector X gives rise to a partition Par(X ). Then, if we make predictions by sequentially observing the X variables, the tree corresponding to the smallest tree-partition of Par(X ) would be the most desirable. In constructing a sequential prediction system such as a tree, we apply a decision-theoretic approach (cf. Breiman et al., 1984). As we have seen, we can link tree-growing to the re nement of a partition Par(X ). In principle, we grow a tree until the ‘expected loss’ of the predictions by the tree is minimized. The expected loss must include the cost of growing the tree; otherwise we will surely opt for the largest possible tree. The expected loss is a measure of the performance of the predictions by the tree and is a convenient tool for the tree-construction. We may be able to make better use of the tool, however, if we better understand the probabilistic mechanism behind it, as we suggest in the following section. 4. Calibration and re nement for trees In this section we de ne the concept of calibration for trees, and then show that the re nement relationship (De nition 2.1) holds for trees, under suitable conditions. We use a circle and a box in a tree to denote a non-terminal node and a TN, respectively. We denote the response variable by Y and the predictor variables by X or by Xi when

246

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

we need the subscript. Y is multinomial with parameters 1 and probabilities p1 ; : : : ; pk , which sum to 1. Suppose we are given the tree in Fig. 1, where Pˆj is the probability prediction for Y at the jth terminal node. The probabilities assigned to the terminal nodes are as follows: ˆ |X1 = 2); Pˆ1 = E(Y ˆ |X1 = 1; X2 = 2); Pˆ2 = E(Y ˆ |X1 = 1; X2 = 1; X3 = 2); Pˆ3 = E(Y ˆ |X1 = 1; X2 = 1; X3 = 1): Pˆ4 = E(Y At each node of a tree the probability predictions are made by estimating the probability of the predicted event based on the data and other appropriate information (e.g., the prior on the set of the probabilities for Y ). We refer to the tree which has T terminal nodes as a T -tree. For convenience, we denote the probability prediction at the jth TN by Pj . We say that a T -tree is well-calibrated (WC) if, for all j, 16j6T , at the jth TN with probability prediction Pj0 = (Pj1 ; Pj2 ; : : : ; Pjk ): lim (Pj0 − n−1 j+ (nj1 ; nj2 ; : : : ; njk )) = 0;

(4.1)

nj+ →∞

Pk where nji is the number of cases arriving at the jth TN with response i and i=1 nji = nj+ . If the prediction probabilities are updated based on all the data available at the time of estimation, the predicting system is expected to be WC (Kadane and Lichtenstein, 1982). In this sense, trees are expected to be WC. The de nition of re nement is now immediately applicable to the trees (for a related observation, see Section 12 of Dawid, 1985). Suppose that a WC tree 1 branches at the T th TN into a WC (T + 1 − u)-tree, 2 , i.e., there are u new TNs at the T th TN. Let Pj be the probability predictions at the jth TN of 1 , and j the true arrival rate at the TN (i.e., the probability that a case passes through the tree to the TN), and similarly let Qj and j be the probability prediction and the true arrival rate at the jth TN of 2 , where the jth TN of 2   the jth TN of 1 = the ( j − T + 1)th among new TNs  branched at the T th TN of 1

if 16j6T − 1;

(4.2)

if T 6j6T − 1 + u:

Expression (4.2) implies that T =

T −1+u P j=T

j

(4.3)

and that Pj = Qj ;

for j = 1; 2; : : : ; T − 1:

(4.4)

Note that the same probability can be used for prediction at multiple TNs of a tree. Further, in order to prove the re nement relation between trees, using De nition 2.1,

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

247

Table 1 True use rates of prediction probabilities with 1 and 2 (a) Use rates for 1 Prediction

P1

P2

:::

Pv

Pv+1

:::

PT −1

PT

True use rate

1

2

:::

v

v+1

:::

T −1

T

(b) Use rates for 2 Prediction

Q1

Q2

:::

Qv

Qv+1

:::

QT −1

True use rate

1 + T +u−v

2 + T +u−v+1

:::

v + T +u−1

v+1

:::

T −1

Prediction

QT

:::

QT −1+u−v

True use rate

T

:::

T −1+u−v

we need only deal with a set of prediction probabilities and their distribution (which we shall refer to as ‘use rates’) but not necessarily with the index set of the TNs. In other words, the re nement relationship has nothing to do with the location of prediction. Thus, we may group the TNs into sets consisting of the TNs of the same prediction probabilities. Then the prediction probabilities selected one from each set of TNs are distinct from one another. The use rate of each prediction probability is the sum of the arrival rates at the TNs where the prediction probability is used. Therefore, without loss of generality, we may consider the following two situations only: (1) that all of the Pj are distinct, and (2) that there are at least two TNs where the same prediction probability is used and one of the TNs is branched. We prove the re nement relationship between trees for each of the situations. For situation (1), some of the Qj s for T 6j, may be equal to some of the Ps, but adding terminal nodes may change the probabilities. For simplicity, we let, for a non-negative integer v, v6u, the rst v Ps be equal to the last v Qs in the following way: QT −1+u−v+i = Pi ;

for 16i6v6u:

(4.5)

Then the distributions of Pj and Qj are as in Tables 1a and b, respectively. Note that we have used the term ‘use rate’ in Table 1 instead of ‘arrival rate’. While we used the term true arrival rate for TNs, we cannot use it for prediction probabilities since the same probability can be used at dierent TNs. The true use rate for the prediction Qj , 16j6v, is the sum of the true arrival rate j for tree 1 and T −1+u−v+j for tree 2 , i.e., the sum of the true arrival rates at the TNs where the prediction probability is Qj . Table 1b summarizes this phenomenon. For the multivariate probability prediction problem, x and y in expression (2.4) are column vectors, unless otherwise speci ed. If we multiply each side of expression (2.4) by the transpose of the 1-vector, whose components are all 1s, then we get expression

248

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

(2.3). Thus, the re nement between any two multivariate probability predictors can be de ned by expression (2.4) only. To show that 2 is more re ned than 1 , we must nd the stochastic transformation h(P|Q) satisfying the following equation, for j = 1; 2; : : : ; T : v P i=1

h(Pj |Qi )(i + T −1+u−v+i )Qi

+

TP −1 i=v+1

h(Pj |Qi )i Qi +

T −1+u−v P i=T

h(Pj |Qi ) i Qi = j Pj :

(4.6)

From expression (4.2), we know that Pj = Qj ; for j = 1; 2; : : : ; T − 1: Furthermore, we can see that whenever 1 uses the prediction probability Pi , for i = v + 1; : : : ; T − 1, so does 2 , where Qi = Pi . Thus, we have h(Pv+1 |Qv+1 ) = · · · = h(PT −1 |QT −1 ) = 1:

(4.7)

The stochastic transformation property then implies that h(Pj0 |Qj ) = 0;

for j 0 6= j with v + 16j6T − 1:

(4.8)

Hence, for j ∈ {1; 2; : : : ; v − 1; v; T }, v P i=1

h(Pj |Qi )(i + T −1+u−v+i )Qi +

T −1+u−v P i=T

h(Pj |Qi ) i Qi = j Pj :

(4.9)

Let hij = h(Pj |Qi ). We can show that the (T −1+u−v)×T stochastic transformation matrix (STM), (hij ), given by expression (4.10) satis es Eq. (4.6) for j = 1; 2; : : : ; T (see Appendix A for a proof): 

1  1 + T +u−v

              H =              

T +u−v 1 + T +u−v T +u−v+1 ::: 0 2 + T +u−v+1

0

:::

0

0

0

0

2 2 + T +u−v+1

:::

0

0

0

.. .

.. .

..

.. .

.. .

.. .

..

.

.. .

0

0

:::

v 0 v + T −1+u

0

::: 0

T −1+u v + T −1+u

0

0

:::

0

1

0

::: 0

0

0

0

:::

0

0

1

::: 0

0

.. .

.. .

..

.

.. .

.. .

.. .

..

. . ..

.. .

0

0

:::

0

0

0

::: 1

0

0

0

:::

0

0

0

::: 0

1

.. .

.. .

..

.

.. .

.. . . . .

0

0

:::

0

0

.

0

::: 0

.. .

.. .

::: 0

                :              

1

(4.10)

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

249

We can represent the STM H in expression (4.10) in a partitioned form. Let H11 be a (T − 1) × (T − 1) diagonal matrix whose rst v diagonal elements are 2 v 1 ; ;:::; 1 + T +u−v 2 + T +u−v+1 v + T −1+u and whose remaining elements are 1s. Let H12 be a (T − 1) column vector whose rst v components are T +u−v+1 T −1+u T +u−v ; ;:::; 1 + T +u−v 2 + T +u−v+1 v + T −1+u and 0s for the rest, and denote by 0(u−v)×(T −1) the (u − v) × (T − 1) zero matrix. Then if we denote by 1u−v the (u − v)-component column vector of 1s, we can express H in a partitioned form as H=

H11

H12

0(u−v)×(T −1) 1u−v

! :

(4.11)

Thus far, we have assumed that all the Pj s are distinct. The stochastic transformation deals with the distinct prediction values and the corresponding probabilities of use only j=T then 1 + 2 is the corresponding (e.g., if only P1 and P2 are the same out of {Pj }j=1 j=T prediction probability.) Thus, if some of {Pj }j=1 are the same, then we may regard them as a prediction value assigned at a TN (recall that 1 is branched at the TN of PT ). But when PT = Pj0 ;

for some 16j 0 6T − 1;

(4.12)

matters are not so simple. A formal expression of situation (2) is expression (4.12). Appendix B discusses this situation in detail and shows that the resultant STM is obtained by deleting the j 0 th column from the STM (4.10) and replacing the ( j0 ; T )th entry of the matrix by 1. In this section and Appendix B, we have considered every possible situation at each branching, and we have determined the corresponding STMs. Therefore, by De nition 2.1, we have proved: ˆ |·), then a branched Theorem 4.1. If predictions by a tree are given in terms of E(Y tree is at least as re ned as the tree that precedes it (before branching). By de nition, we can easily see that the re nement relation is transitive. From Theorem 4.1, we can see that a sequence of trees, where each tree (except the rst one) is obtained by branching at a TN of the preceding tree, is totally ordered in the sense of re nement.

250

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

5. Summary observations Underlying the de nition of re nement is the assumptions that the forecaster is well-calibrated. As Dawid (1985) notes the intuitive concept of calibration is that, for all suitably speci ed subsequences, probability forecasts should be right ‘on average’ in comparison with relative frequencies, at least asymptotically. Moreover, he notes in his Section 12, that these concepts carry over to prognostic systems such as the tree-structured ones considered in this paper. Earlier, Dawid (1982) had shown that a coherent Bayesian forecaster expects (with probability 1 according to his=her prior distribution) to be well calibrated. Unfortunately, for any forecasting system, there exists a sequence of forecasts and outcomes for which the system will not be well calibrated (Oakes, 1985). Thus, one needs to approach with caution a prediction approach which assumes calibration in order to achieve improvement in the re nement sense, as we have done in this paper. In the context of well calibrated classi cation tree prediction systems, we have shown ˆ |·), then branching improves that, if predictions are given in terms of estimates of E(Y the re nement of predictions, because there exists a stochastic transformation for each branching. This suggests that, when we have a large number of explanatory variables, we would do well to continue branching until we exhaust them all. Leaving aside the issue of sparseness of the observed data, this suggestion is intuitively problematic and various authors have adopted loss functions as ways to limit the growth of their trees. For example, Breiman et al. (1984) and Kim (1994) use one loss function to grow a tree and a second to control the tree size in estimating the prediction risk by the tree. They then used estimated prediction risks in searching for the optimal tree. Alternatively, one can implicitly think in terms of a tradeo between the cost of an added branch and the bene t of an improved prediction. DeGroot and Fienberg (1982, 1983, 1986) have shown that strictly proper scoring rules can be decomposed as the sum of two components. The rst component represents a measure of the departure of the prediction system from the state of calibration. The second component is a measure of the degree of re nement of a prediction system given calibration and is used in Theorem 2.2, i.e., P x

v(x)g(x):

To the extent to which a tree construction approach is well calibrated, it will carry out branching. Thus it is that lack of calibration that limits the growth of trees. In essence, the dual loss function approach of Breiman et al. (1984) and Kim (1994) can be viewed as an heuristic attempt to balance calibration and re nement in the same sense that squared error loss balances the ‘bias’ of an estimate with the variance which measures the variability of an unbiased predictor. Sparseness of data represents a situation where actual predictions will not be well-calibrated and thus one needs some mechanism to balance the growth of a tree beyond the point warranted by the data.

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

251

The framework of calibration and re nement for the construction of classi cation trees points the way to a reconsideration of existing methods. By exploring the class of strictly proper scoring rules, statisticians may be able to develop a formal probabilistic justi cation for the two-stage approach of rst growing a tree and then pruning its branches. Acknowledgements We are grateful to the editor and two referees for their valuable and insightful comments and suggestions. Morrie DeGroot provided early reactions and comments and we wish he had been able to see the nal version of the paper. Appendix A Here we will derive the STM in expression (4.10) under situation (1). Let j = T in expression (4.9). Dividing expression (4.9) by T gives v P i=1

h(PT |Qi )

T −1+u−v P i + T −1+u−v+i i Qi + h(PT |Qi ) Qi = PT : T T i=T

(A.1)

ˆ |·), Recall that the predictions are given in terms of an estimate of E(Y |·), E(Y where the dot (·) symbolizes the outcome of the predictor variables observed along a path from the root node of the tree through to the current node where E(Y |·) is estimated. According to the framework of Section 3, the predictor variable added at the T th TN of 1 is a u-level categorical variable. We denote the predictor variable by X 0 . In practice, Pj ; Qi ; j , and i may be replaced by estimated values. Under the setup leading to expression (4.2), we have PT =

T −1+u P i=T

i Qi : T

(A.2)

ˆ |·) and QT −1+i = E(Y ˆ |·; X 0 = xi ), for i = 1; 2; : : : ; u. Note that We know that PT = E(Y 0 X takes on u possible values, x1 ; : : : ; xu . If we rewrite QT −1+i and PT by P ˆ QT −1+i = yP(y|·; X 0 = xi ) y

= and PT =

P y

1 P ˆ yP(y; xi |·) ˆ i |·) y P(x

ˆ ˆ |·); yP(y|·) = E(Y

(A.3)

(A.4)

where y is a k-vector consisting of k − 1 zeros and one 1, then we can see that T −1+i ˆ i |·): = P(x T

(A.5)

252

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

If we substitute Qi in expression (A.2) by expression (A.3), and compare the result with expression (A.4), then we have expression (A.5). Thus, we can always obtain { i } that satisfy Eq. (A.1). Now, from expression (A.2), we can see that the last column of the matrix (4.10) satis es expressions (A.1) or (4.9). The values of h(Pj |Qi ), for j = 1; : : : ; v, are straightforward consequences of result (4.9), the last column of the matrix (4.10), and the fact that the last v Q-values are the same as the rst v P-values. Recall that results (4.7) and (4.8) follow from this fact.

Appendix B Here we will derive an analogue of the STM in Eq. (4.10) under situation (2) or, equivalently, under condition (4.12). Two cases are possible under condition (4.12): (i) PT = Pj0 for some j 0 ; 16j 0 6v, and (ii) PT = Pj0 for some j 0 ; v + 16j 0 6T − 1. Case (i): In this case, Eq. (4.8) is valid except that, for j = T , v P i=1

h(PT |Qi )(i + T −1+u−v+i )Qi +

T −1+u−v P i=T

h(PT |Qi ) i Qi = (j0 + T )PT : (B.1)

Noting that Qj0 = Pj0 = PT in Case (i), we can see that the following satis es (B.1):  1 if i = j0 or T 6i6T − 1 + u − v;      T −1+u−v+i if 16i6v and i 6= j 0 ; h(PT |Qi ) = + i T −1+u−v+i      0 otherwise: For 16j6v and j 6= j 0 , we can see that the expression  j  if i = j;  j + T −1+u−v+i h(Pj |Qi ) =   0 if i 6= j

(B.2)

(B.3)

satis es Eq. (4.9). Consequently, the corresponding STM of expression (B.2) is obtained by deleting the j 0 th column, 16j 0 6v, and replacing the ( j 0 ; T )th entry of the STM in expression (4.10) by 1. This yields the (T − 1 + u − v) × (T − 1) STM H 0 as in expression (B.4), 0 where the v × (T − 2) matrix H11 is obtained by deleting the j 0 th column from H11 , 0 v-component column vector H12 is obtained by replacing the j 0 th component of H12 by 1 in H12 : ! 0 0 H H 11 12 : (B.4) H0 = 0(u−v)×(T −2) 1u−v

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

253

Case (ii): In this case, we can see that for j ∈ {v + 1; : : : ; T − 1} and j 6= j0 ;

h(Pj |Qj ) = 1;

(B.5)

satis es expression (4.6) for such js as above (note the dierence between expressions (4.7) and (B.5)). For j = T , note that expression (4.9) becomes v P i=1

h(PT |Qi )(i + T −1+u−v+i )Qi + h(PT |Qj0 )j0 Qj0

+

T −1+u−v P i=T

Thus, if we let

h(PT |Qi ) =

h(Pj |Qi ) i Qi = (j0 + T )PT :

  1    

(B.6)

if i = j0 or T 6i6T − 1 + u − v;

T −1+u−v+i if 16i6v;  + T −1+u−v+i i      0 otherwise;

(B.7)

then h(PT |Qi ) satis es Eq. (B.6). For j ∈ {1; 2; : : : ; v}, Eq. (4.6) now becomes expression (4.9) by expressions (B.5) and (B.7). Hence, for j ∈ {1; 2; : : : ; v},  i  if i = j;  h(Pj |Qi ) = i + T −1+u−v+i (B.8)  0 otherwise; since Qj = Pj for such j. As indicated in Eqs. (B.5), (B.7), and (B.8), we can see that the j0 th column, v + 16j 0 6T − 1, is deleted from the STM (4.10), and the ( j 0 ; T )th entry of the matrix is replaced by 1. The resultant (T − 1 + u − v) × (T − 1) matrix H 00 as in expression (B.9) is similar to H 0 except that j 0 now ranges over v + 1 through T − 1. Thus, the 00 00 and the rst v elements of H12 are the same as the rst v diagonal elements of H11 0 corresponding elements of H11 and H12 each. Note the dierence between H12 and 00 H12 that the element ( T −1+u−v+j0 )=(j0 + T −1+u−v+j0 ) of H12 is replaced by 1 to 0 obtain H12 while the zero at the j 0 th entry, v + 16j 0 6T − 1, of H12 is replaced by 1 00 . to obtain H12 ! 00 00 H12 H11 00 H = : (B.9) 0(u−v)×(T −2) 1u−v References Ben-Bassat, M., Carlson, W.C., Puri, V.K., Davenport, M.D., Schriver, J.A., Latif, M., Smith, R., Portigal, L.D., Lipnick, E.H., Weil, M.H., 1980. Pattern-based interactive diagnosis of multiple disorders: the MEDAS system. IEEE Trans. Pattern Anal. Mach. Intell. 2, 148–159.

254

S.E. Fienberg, S.-H. Kim / Journal of Statistical Planning and Inference 70 (1998) 241–254

Blackwell, D., Girshick, M.A., 1954. Theory of Games and Statistical Decisions. Wiley, New York. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classi cation of Regression Trees. Wadsworth, Belmont, CA. Dawid, A.P., 1982. The well-calibrated Bayesian. J. Amer. Statist. Assoc. 77, 605–613. Dawid, A.P., 1985. Calibration-based empirical probability (with discussion). Ann. Statist. 13, 1251–1285. DeGroot, M.H., Eriksson, E.A., 1985. Probability forecasting stochastic dominance, and the Lorenz curve. In: Bernardo, J.M. et al. (Eds.), Bayesian Statistics 2. North-Holland, Amsterdam, pp. 99–118. DeGroot, M.H., Fienberg, S.E., 1982. Assessing probability assessors: Calibration and re nement. In: Gupta, S.S., Berger, J.O. (Eds.), Statistical Decision Theory and Related Topics III, vol. 1. Academic Press, New York, pp. 291–314. DeGroot, M.H., Fienberg, S.E., 1983. The comparison and evaluation of forecasters. Statistician 32, 12–22. DeGroot, M.H., Fienberg, S.E., 1986. Comparing probability forecasters: basic binary concepts and multivariate extensions. In: Goel, P., Zellner, A. (Eds.), Bayesian Inference and Decision Techniques. North-Holland, Amsterdam, pp. 247–264. Kadane, J.B., Lichtenstein, S., 1982. A subjective view of calibration. Technical Report No. 233, Department of Statistics, Carnegie Mellon University, Pittsburgh, PA. Kim, S.-H., 1994. A general property among nested, pruned subtrees of a decision-support tree. Comm. Statist., Theory Methods 23, 1227–1238. Knill-Jones, R.P., Stern, R.B., Girmes, D.H., Maxwell, J.D., Thompson, R.P.H., Willimas, R., 1973. The use of sequential Bayesian model in the diagnosis of jaundice by computer. Brit. Med. J. 1, 530–534 Lichtenstein, S., Fischho, B., Phillips, L.D., 1982. Calibration of probabilities: The state of the art to 1980. In: Kahneman, D., Slovic, P., Tversky, A. (Eds.), Judgment under Uncertainty: Heuristics and Biases. Cambridge University Press, New York, pp. 306–334. Murphy, A.H., Epstein, E.S., 1967. Veri cation of probabilistic predictions: a brief review. J. Appl. Meteor. 6, 748–755. Murphy, A.H., Winkler, R.L., 1977. Can weather forecasters formulate reliable probability forecasts of precipitation and temperature? National Weather Digest. 2, 2–9. Oakes, D., 1985. Self calibrating priors do not exist (with discussion). J. Amer. Statist. Assoc. 80, 339–342. Roberts, H.V., 1968. On the meaning of the probability of rain. Paper Presented to the 1st National Conf. on Statistical Meteorology. Schervish, M.J., 1985. Comment on “Calibration-based empirical probability,” by A.P. Dawid. Ann. Statist. 13, 1274–1282.

Calibration and refinement for classification trees

Calibration and refinement for classification trees

Recommend Documents