Maximum likelihood estimation of a binary choice model with random coefficients of unknown distribution

Maximum likelihood estimation of a binary choice model with random coefficients of unknown distribution

Journal of Econometrics 86 (1998) 269—295 Maximum likelihood estimation of a binary choice model with random coefficients of unknown distribution Hid...

711KB Sizes 0 Downloads 82 Views

Journal of Econometrics 86 (1998) 269—295

Maximum likelihood estimation of a binary choice model with random coefficients of unknown distribution Hidehiko Ichimura!,*, T. Scott Thompson" ! Department of Economics, University of Pittsburgh, Pittsburgh, PA 15260, USA " Antitrust Division, U.S. Department of Justice, Washington, DC 20530, USA Received 1 March 1993; received in revised form 1 September 1997; accepted 22 September 1997

Abstract We consider a binary response model y "1Mx@b #e *0N with x independent of the i i i i i unobservables (b , e ). No finite-dimensional parametric restrictions are imposed on F , i i 0 the joint distribution of (b , e ). A nonparametric maximum likelihood estimator for F is i i 0 shown to be consistent. We analyze some conditions under which F is or is not 0 identified. In particular, we show that if the support of F is a subset of any half of the unit 0 hypersphere, then F is identified relative to all distributions on the unit hypersphere. We 0 also provide some Monte Carlo evidence on the small sample performance of our estimator. ( 1998 Elsevier Science S.A. All rights reserved. JEL classification: C25; C14; C13 Keywords: Binary response; Discrete choice; Random coefficients; Nonparametric estimation; Identification

1. Introduction The simplest econometric model of binary choice postulates that for each individual i (chosen randomly from a large population) an observed choice variable y is related to a K]1 vector of observables x by the equation i i (1) y "1Mx@b#e *0N, i i i where b (a K]1 vector) is an unknown parameter and e is an unobserved i variable. Here 1M ) N denotes the zero—one binary indicator function. This model * Corresponding author. E-mail: ichimura#@pitt.edu 0304-4076/98/$19.00 ( 1998 Published by Elsevier Science S.A. All rights reserved. PII S 0 3 0 4 - 4 0 7 6 ( 9 7 ) 0 0 1 1 7 - 6

270

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

is sometimes motivated by assuming that individual i receives an indirect utility º when alternative j is chosen, and therefore chooses y "1 if and only if ij i º *º . If there are versions of indirect utility that satisfy the separable i1 i0 specification º "z@ b#l , then we obtain Eq. (1) by setting x "z !z and ij ij i i1 i0 ij e "l !l . In this context b is a vector of the marginal (indirect) utilities i i1 i0 associated with the observed variables x (e.g. prices), and e collects all other i i unobserved determinants of (indirect) utility. A natural way to introduce heterogeneity in the model above is to treat b as a random variable, see Quandt (1966) and McFadden (1976). Denoting b to be i the realization of b for an individual i, and subsuming the e as the first argument i of b , the random coefficients model is i (2) y "1Mx@b *0N. i i i A conventional approach is to specify the distribution of b except for a finite i number of parameters. We call this the parametric random coefficients binary choice model. In this paper we consider the problem of estimating the distribution of the unobserved terms b in the model of Eq. (2) without imposing finite-dimensional i parametric restrictions on this distribution. We assume that b and x are i i independent random vectors. We propose a nonparametric maximum likelihood estimator for the distribution function F of b under the assumption that 0 i F is an element of F, some space of distribution functions. We analyze some 0 conditions under which F is or is not identified relative to F. We find that F is 0 0 not identified when F is large enough. This remains true even when b is i normalized to lie on the unit hypersphere. Our Theorem 1 provides some simple, sufficient conditions for identification. Given identification, we prove consistency of the maximum likelihood estimator. We also provide some Monte Carlo evidence on the small sample performance of our estimator. In the remainder of this introduction we motivate our interest in F and relate 0 our approach to other literature on binary choice. Section 2 considers identification issues for our model. Section 3 introduces the maximum likelihood estimator and discusses its geometric structure. Section 4 establishes consistency of the estimator. Section 5 presents some example calculations of the maximum likelihood estimator and summarize the results of some Monte Carlo experiments. We conclude with some interpretation of our results and some suggestions for further research. All proofs are gathered in the appendix. 1.1. Prior restrictions on binary choice models This paper is distinguished from earlier work by a focus on estimation of F without requiring parametric assumptions about the form of this distribu0 tion. By adopting the model of Eq. (2) and assuming independence of x and b , i i

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

271

we impose sufficient structure on the distribution of y conditional on x to i i permit estimation of F , without imposing much else. Here we discuss the 0 rationale for this modeling decision. Applied econometricians often adopt the models of Eq. (1) or Eq. (2) and assume that x is independent of the unobservable terms. A normal or logistic i distribution typically is assumed for the unobservable terms, leading to the probit and logit models, respectively. Estimation of these models is straightforward using maximum-likelihood methods, see Quandt (1966), McFadden (1976), Albright et al. (1977), and Hausman and Wise (1978). However, economic theory rarely (if ever) provides guidance about the form of the distribution of unobservables. Since imposition of incorrect restrictions on this distribution in the models of Eq. (1) or Eq. (2) typically leads to inconsistent estimates of parameters of interest, econometricians have sought estimation methods that do not impose parametric functional forms. Over the past few decades, many ‘semiparametric’ methods have emerged for estimation of b in model the of Eq. (1) that do not require specification of a parametric model for the distribution of e , see, e.g., Cosslett (1983), Manski (1985), Ruud i (1986), Stoker (1986), Han (1987), Ichimura (1993), Klein and Spady (1993) and Thompson (1989). Since b is constant in the model of Eq. (1), these methods do not allow for the range of heterogeneity considered here.1 Alternatively, nonparametric methods permit estimation of the distribution of y conditional on x without imposing any structure, apart from some smoothi i ness. For example, the function p (x)"Pr(y "1Dx "x) can be estimated using 0 i i kernel or local linear regression methods. However, since no structure is imposed by these methods, they fail to recover the distribution F , which is 0 unidentified in the absence of further restrictions. The methods described here permit estimation of F without imposing finite0 dimensional functional form restrictions. This has several potential advantages over a conventional nonparametric approach. First, our assumptions impose global restrictions on p , which means that observations with x far from x may 0 i be quite informative about p (x). Since this structure constrains our estimates, 0 we conjecture that our method produces improved statistical efficiency (relative to less structured methods) when estimating p (x) at points where the regressor 0 distribution is sparse. This conjecture is supported by the simulation results in Section 5. Second, when b is invariant over repeated decisions of individual i, consideri ably more inferences are available from knowledge of F than are available from 0 knowledge of p alone. For example, suppose that the decision of individual 0 i will be repeated and that we are concerned with forecasting changes in

1 Some of these methods allow for limited dependence between x and the unobservables. i

272

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

observed behavior. Suppose further that b is determined by tastes specific to i individual i, but invariant over repeated decisions made by individual i. Then for any hypothesized realization x of the remaining determinants of choice, the associated decision of individual i will be y (x)"1Mx@b *0N. Under these i i conditions, knowledge of F permits calculation of PrMy (b)"1Dy (a)"0N, the 0 i i proportion of the population who would switch their choice from 0 to 1 when x is changed from a to b. In contrast, knowledge of p identifies both of the 0 probabilities PrMy (a)"0N and PrMy (b)"1N, but reveals nothing about the joint i i distribution of (y (a),y (b)), absent further restrictions. i i More generally, knowledge of F permits inferences about the joint distribu0 tion of (y (a ),2,y (a )) for any set Ma ,2,a N of candidate values for x. This i 1 i m 1 m motivates our interest in estimating F . 0 2. Model identification We assume that x includes a constant term: x@"(1, xJ @), and that the cumulati i i ive distribution function of b is F 3F, where F is some space of distribution i 0 functions on RK. Let H(x)"MbDx@b*0N. Given that x and b are independent, i i the model of Eq. (2) requires that Pr(y "1Dx "x),p(x,F ), where we write i i 0 p(x, F),

P

dF for any F3F.

(3)

H(x) The distribution F is identified relative to F if and only if for each F3F, 0 PrMp(x , F)"p(x , F )N"1 implies F"F . i i 0 0

(4)

When PrMEb E"0N"0, y "1Mx@b *0N"1Mx@b /Eb E*0N. Thus with little i i i i i i i loss of generality, we may restrict F to the class of distributions on the unit hypersphere in RK centered at the origin. The following example shows that additional restrictions are needed to identify F . Let B"MbEbE"1N and let 0 k be the uniform distribution on B. Example. Suppose that F includes all distributions that are absolutely continuous with respect to k and that f is the density of F with respect to k. If 0 0 f *d'0 everywhere on B then F is not identified. 0 0 A proof is straightforward: Pick an arbitrary point x O0 and any function 0 h on the half-hypersphere H(x )WB such that 0)DhD)d and : h dk"0. 0 H(x0) Obviously there are many such functions. Now define h on BCH(x ) by requiring 0 h(!b)"h(b). The symmetry ensures that : h dk"0 for any x. Let f"f #h H(x) 0

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

273

be a new density function on B with corresponding distribution function F. Then

P

p(x, F)!p(x, F )" h dk"0. (5) 0 H(x) Since F3F, but FOF , F is not identified. 0 0 A trivial example occurs in two dimensions when f "1/(2p) on B, while 0 f"1/p on the portions of B that intersect the first and third quadrants, with f"0 elsewhere. Then p(x, F)"p(x, F )"1 for all x. 2 0 In these examples, the function h is necessarily negative somewhere in any given half of B, unless h"0 everywhere. The requirement f *d'0 is 0 used to ensure that f #h is everywhere nonnegative nevertheless. If instead f is 0 0 zero on some half of B, then f #h must be negative somewhere, and therefore is 0 not a density. This suggests that F can be identified when its support is 0 contained in some half of B. The following theorem shows that this is true more generally. ¹heorem 1. If the following four assumptions hold then F is identified relative 0 to F. (i) Model of Eq. (2) holds with x "(1,xJ @ )@. i i (ii) ¹he random vectors xJ and b are independent. i i (iii) F is a set of distribution functions on B. ¹here exists c3B such that PrMc@b '0N"1. i (iv) PrMxJ 3EN'0 for each open ELRK~1. i The first assumption restricts us to linear random coefficients models. The proof uses the Crame´r—Wold device to reconstruct the joint distribution of b from these linear combinations. To the extent that the Crame´r—Wold device i can be extended to the nonlinear case our result can be extended to nonlinear random coefficients models. The second assumption is fairly stringent, but commonly used in applications, including applications of the parametric random coefficients model. The joint distribution of x and y is unrestricted if x and b are arbitrarily dependent. The i i i i assumption can, in principle, be relaxed by explicitly modelling the dependence. The third assumption restricts attention to distributions on B. As we have seen, this is not restrictive when PrMEb E"0N"0. The second part of the i assumption is restrictive since it implies that p(c, F )"1. Without this assump0 tion F may not be identified, as in the previous example. The assumption 0 is satisfied whenever it is known that one of the random coefficients is strictly positive. However, the theorem does not demand that c is known in advance.

274

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

The fourth assumption requires that the distribution of xJ has an absolutely i continuous component with everywhere nonzero density. This assumption, together with the first assumption, allows us to obtain all of the linear combinations needed to reconstruct the joint distribution, using the Crame´r—Wold device. The assumption rules out discrete or bounded random variables from xJ , i and also prevents xJ from including terms that are functionally related (such as i interaction terms together with main effects). Perhaps this condition can be weakened. Identification of F under weaker conditions on x will most likely 0 i require stronger restrictions on F.

3. Maximum likelihood estimation We now define the maximum likelihood estimator of F and prove that it 0 exists by describing the geometry of the maximization problem. The conditional average log likelihood function given a sample of N independent and identically distributed (i.i.d.) observations is 1 N ¸(F)" + y log[p(x , F)]#(1!y )log[1!p(x , F)]. i i i i N i/1

(6)

We set ¸(F)"!R when y "0 and p(x , F)"1, or y "1 and p(x , F)"0 for i i i i any i. We also adopt the convention that 0 log(0)"0. The i.i.d. sampling assumption is discussed further in Section 4. The maximum likelihood estimator FK is defined to be any (measurable) N solution to the equation ¸(FK )"max ¸(F). N F|F

(7)

Since F is generally an infinite-dimensional space, it is perhaps not immediately clear that any solutions to Eq. (7) exist. In the remainder of this section we characterize the geometry of the problem. Using this characterization, we show that a solution always exists that is a discrete distribution with at most N points of support. Let B be a subset of RK known to contain the support of every F3F. Each datum, i, restricts b to lie in the region G LB, where G "H(x )WB if y "1 i i i i i and G "BCH(x ) otherwise. Moreover, H(x ) and BCH(x ), for i)N, partition i i i i B into a finite collection of sets A "MA ,2,A N. N 1 M Clearly G is a union of sets in A . Let c be the M]1 vector of zeros and ones i N i that identifies the sets of A that constitute G . That is, let N i c "(1MA LG N,2,1MA LG N)@. i 1 i M i

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

275

Also, for each F3F, let q(F) be the M]1 vector of probabilities that F assigns to the sets of A : N

P

q (F)" dF. j Aj Obviously q(F) lies in S , the unit simplex in RM. M~1 Now we can restate problem of Eq. (7) as

(8)

1 N (9) ¸(FK )" max + log c@q. i N N q|SM~1 i/1 The maximand on the right-hand side of Eq. (9) is concave and S is convex. M~1 There must exist at least one solution to the maximization problem, which can be found using standard algorithms for concave programming subject to linear inequality constraints. Fletcher (1987) describes various algorithms for computing a solution. The preceding discussion suggests a method for constructing a particular solution to Eq. (7). Let qL be any solution to Eq. (9), and let t denote any point j chosen arbitrarily from A , j)M. Finally let FK (t)"+M qL 1Mt )tN. Substituj/1 j j j N tion of FK (t) into Eq. (9) will verify that this is a solution since we have q(FK )"qL . N N Clearly this solution is not unique. 3.1. Computational issues Although the existence of a solution is shown, the computation of a solution is still demanding. In fact the dimension of each vector q(F) is M, where M"O(NK~1).2 We consider two ways to reduce the computational burden. First, we show that only N-dimensional subproblems need to be considered. Second, we show a method for determining, a priori, a large subset of the q ’s that must be zero at j any solution. Let C be the N]M matrix of zeros and ones obtained by stacking the row vectors c@, and let h be the jth column of C. Then Eq. (9) can be rewritten i j 1 N ¸(FK )"max + log g , (10) N i N g|G i/1 where G is the image of S under the linear transformation C. M~1 2 This bound can be obtained through application of Theorem II.16 and Lemma II.18 in Pollard (1984). A precise bound is given by M) +K N/k!1. This bound is attained if every set of k/1 x vectors taken K at a time are linearly independent. i

276

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

The dimension of G is N, which suggests that a solution to our problem exists that involves at most N columns of C. The following theorem establishes this result:3 ¹heorem 2. A solution qL to Eq. (9) exists with at most N non-zero weights on Mh ,2,h N. 1 M The theorem does not inform which N elements of Mh ,2,h N receive non1 M zero weights. However the result is extremely useful in computing a solution because it implies that we can restrict attention to the N-dimensional subproblems. There is no need to allocate storage for a Hessian matrix bigger than N]N, for example. Our second result provides a way to find columns of C with zero weights at any solution. Say that A correctly predicts observation i if A LG and that j j i h dominates h if h (h , which we take to mean that h )h (coordinate-byk j j k j k coordinate) and h Oh . Note that we cannot have h "h except when j"k.4 j k j k Since the ith coordinate of h is a binary indicator for whether or not j A correctly predicts observation i, h dominates h if and only if the observaj k j tions correctly predicted by A form a proper subset of those correctly predicted j by A . When this happens we should not place any positive probability on k A because we can make the likelihood strictly larger by shifting the probability j to A . Thus: k ¹heorem 3. Suppose that two columns of C satisfy h (h . ¹hen qL "0 for every j k j solution qL to Eq. (9). So we can eliminate the dominated columns of C at the outset. At first glance this appears to require O(M2) comparisons, since each column must be compared with every other. However this bound ignores the geometry of the problem. Theorem 4 implies a tighter bound for most problems: ¹heorem 4. Suppose that a column h of C is dominated by another column h and j k that B is convex. ¹hen h is dominated by a column h such that the boundaries of j r the sets A and A meet. j r Thus generally5 a column h is dominated if and only if it is dominated by j a column corresponding to one of the sets immediately adjacent to A . It follows j 3 The theorem can be viewed as a particular case of a result due to Lindsay (1983). 4 If jOk then the distinct sets A and A are separated by at least one of the hyperplanes j k Mb3RKDx@b"0N for some i)N. So exactly one of the sets A or A correctly predicts observation i. i j k Thus h and h must differ in (at least) their ith coordinate. j k 5 To apply the theorem in the case where B is the unit hypersphere, one must temporarily redefine B to be the unit ball in RK, so that it is convex. This does not affect M or C, or the set of pairs (j,r) for which the boundaries of A and A meet. j r

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

277

that one can learn whether a column h is dominated by comparing it to only j those columns whose corresponding sets in A are the ones that ‘touch’ A . N j Thus ‘only’ O(M)"O(NK~1) comparisons are needed to identify the dominated columns. Our experiences to date suggest that 95% or more of the columns of C are dominated and can be eliminated a priori. 4. Consistency of the ML estimator In order to prove consistency of the maximum likelihood estimator of F we 0 will adopt a topology for F and make certain assumptions. In this section we equate F with the corresponding set of probability measures and endow it with the weak topology. This may be metrized by any number of functions. Let dF be any such metric. Consistency of FK under this topology implies consistency for N any functional on F that is continuous at F with respect to the weak topology. 0 For example, consistency of FK implies that the predicted probability p(x,FK ) is N N a consistent estimator of the true probability p(x,F ) if F is absolutely continu0 0 ous. More generally, for any set C3RK, we may consistently estimate PrMb 3CN i using the probability assigned to C by FK , provided b lies on the boundary of N i C with probability zero. We assume Assumption 1. (y , x ), i"1, 2, 3,2 are i.i.d. i i The assumption can be relaxed partially, as described in the proof of Theorem 5. Our proof requires that F is compact under the chosen topology, and that the continuous distributions are dense in F. This follows immediately if F is the set of all distributions on a suitably chosen space. The following assumption will suffice.6 Given the need for a scale normalization, one natural choice for B is the unit hypersphere in RK. Assumption 2. F is the set of all distributions on a compact set BLRK. For consistency we need identification of F with respect to F: 0 Assumption 3. PrMp(x ,F)"p(x , F )N"1 implies F"F for each F3F. i i 0 0 Finally, we require that the expectation of the log-likelihood is finite, at least at the true distribution, so that we may invoke a law of large numbers. The log likelihood is nonpositive, since 0)p(x , F ))1. So we assume i 0 6 See Parthasarathy (1967), Theorems II.6.2 and 6.4.

278

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

Assumption 4. EMy log[p(x , F )]#(1!y ) log[1!p(x , F )]NO!R. i i 0 i i 0 Consistency of our estimator follows from these four assumptions: ¹heorem 5. Under Assumptions 1—4, for any e'0 lim PrMdF(FK , F )'eN"0. N 0 N?=

(11)

5. Computational examples and Monte Carlo evidence We have already discussed some computational aspects of our estimator. Here we elaborate further and present some example calculations. We also summarize results from some Monte Carlo experiments investigating the performance of our estimator and of parametric and nonparametric alternatives in some small sample problems. 5.1. Computational algorithms We have written software that implements our estimator for certain specialized cases, described below. Our program first solves the geometry problem described in Section 3 and eliminates the parameters corresponding to dominated columns of C as described in that section. This process eliminates roughly 95% of all columns. Eq. (9) is solved using the BFGS concave maximization algorithm with exact line searches. The inequality constraints defining S are handled using an M~1 active sets method. These techniques are discussed extensively in Fletcher (1987). Since the constraint set is convex and the objective function concave, the algorithm converges monotonically to a solution with at most finitely many changes in the active set of constraints, and with no cycling possible. We choose the initial starting value for q as follows. First, let w be the number j of coordinates of h equal to one. This is the number of observations i that are j correctly predicted by the region A . We sort the regions A into decreasing j j order by w . Second, we select regions A from this list, in order, if they satisfy the j j following criteria: each newly selected region must correctly predict at least one observation i that is not correctly predicted by any of the regions previously selected. Selection terminates when a set is obtained such that every observation i)N is correctly predicted by at least one of the selected regions. Finally, the initial q vector is obtained by weighting the columns h associated with the j selected regions proportionally to the corresponding value of w , with zero j weight assigned to the other coordinates of q.

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

279

The method selects at most N undominated columns, but does not guarantee that those are the only ones that receive positive probability at a solution. The method ensures that any strictly convex combination of the selected columns will not have any zero coordinates, which in turn ensures that the log-likelihood is finite at the starting value. Since columns are selected and weighted according to the numbers of observations correctly predicted by their corresponding regions A , the method reflects a preliminary assessment of the importance of j placing probability mass on each region. The choice of a starting value is very important due to the high dimensionality of Eq. (9). A preliminary version of our program simply started at the vector q whose coordinates were all equal to 1/M. We found that the maximization algorithm spent enormous amounts of time successively activating constraints until it found a much lower dimensional problem close to the solution. The algorithm for choosing starting values described here produces several orders of magnitude improvement in the performance relative to this first attempt. The estimation code was written largely in Cray Pascal and was compiled and run on the Cray-2 supercomputer. 5.2. Model specifications In previous sections we worked with distributions supported on the unit hypersphere in RK. Here we choose K"3 and, for expositional convenience, adopt a slightly less flexible normalization on F by requiring the third coordinate of b to be one. This permits us to describe estimates for the K"3 case in i terms of the two-dimensional joint distribution for the remaining coordinates. In the remainder of this section we redefine b and x to mean the ith coordinate i i of the original b and x respectively, and let F and FK denote the corresponding i i 0 N two-dimensional distribution functions for (b ,b ). 1 2 We consider three different specifications for F , summarized in Table 1. 0 Model 1 takes F as the bivariate standard normal distribution function. We 0 consider this model an important test case, since the probit maximum likelihood estimator of the mean and covariance matrix for this model figures prominently in the applied literature. Models 2 and 3, represent two different kinds of departures from normality. Contour plots of their density functions are depicted in Figs. 1 and 2 Model 2 is characterized by a long, curved ridge, with two modes separated by a slightly lower saddle point. The iso-density contours are shaped roughly like a boomerang, leading to a nonlinear dependency among b and b . Model 3, 1 2 like Model 1, preserves independence of b and b , but has a distinct bimodality. 1 2 We feel that detection of features like those present in models 2 and 3 provide an important motivation for our new estimator. Each model assumes i.i.d. sampling of observations, using independent standard normal distributions for x and x . Sample size in our computations is 2 3

280

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

Table 1 Simulated distributions for b"(b ,b )@ 1 2 Model 1: Model 2:

b&N(k,R),k"0,R"I b an equally weighted mixture of c and c 1 2

Model 3:

AC D C AC D C

DB DB

p2 op p 1 1 2 op p p2 1 2 2 p2 op p 1 2 c &N !kk , 2 2 op p p2 1 2 1 k"0.3587,p2"0.26271,p2"0.06568,ando"!0.1 1 2 b and b independently distributed 1 2 b &N(0,p2) 1 b an equally weighted mixture of c and c 2 1 2 c &N(0.2806,p2), c &N(!1.6806,p2) 1 2 p2"0.038462 c &N 1

k!k ,

Fig. 1. True density of F in Model 2. 0

N"1000. The corresponding value of M depends on the random configuration of the x vectors. An upper bound is M) 500,501. This bound is achieved i (with probability one) whenever xJ has a continuous distribution, as in our i examples.

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

281

Fig. 2. True density of F in Model 3. 0

5.3. Example calculations Here we present some examples of estimates obtained from our new procedure. For comparison purposes, we also calculated the parametric maximum likelihood estimates corresponding to Model 1 with unknown parameters k and R. The probit ML estimator was calculated by choosing starting values of k"0 and R"I (the true values for Model 1). We used commercial software to maximize the random coefficients probit likelihood function using analytical first and second derivatives. The estimated value of R was constrained slightly away from its region of singularity to avoid numerical instability of these calculations. The flavor of the estimates is revealed in Figs. 3—5.These figures give estimates obtained from three data sets, each with 1000 observations simulated from one of the models of Table 1. Panel (a) of each figure displays a pseudo-density for each nonparametric ML estimate, obtained by convoluting the (discrete) estimated distribution with a small amount of continuously distributed noise. Panel (b) of each figure displays the corresponding probit density estimate. Although the asymptotic theory presented in the last section does not support any formal interpretation for the nonparametric density estimates, we feel that these views of the estimates capture their important features quite well. Examining Fig. 3, one finds (as expected) that the parametric ML estimate is closer to the true distribution than is the nonparametric estimate in the sample

282

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

Fig. 3. Example estimates from Model 1.

Fig. 4. Example estimates from Model 2. (a) Nonparametric MLE. (b) Probit MLE.

obtained from Model 1. After all, the parametric ML estimate is efficient for this model. While the nonparametric estimate has too much fine variation, it does pick up the gross features of the bivariate standard normal distribution. The distribution estimated by our nonparametric procedure is roughly unimodal and fairly symmetric. The probit estimate has relatively less advantage when applied to data simulated from the other two models. See Figs. 4 and 5. The probit likelihood

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

283

Fig. 5. Example estimates from Model 3. (a) Nonparametric MLE. (b) Probit MLE.

does not allow any adaptation to the boomerang shape of Model 2 or to the bimodality of Model 3. In contrast, the nonparametric estimate does pick up these features to some extent. Again, the nonparametric estimate displays a great deal of roughness at finer scales. 5.4. Monte Carlo evidence The remainder of this section summarizes results from a Monte Carlo experiment designed to evaluate relative performances of three estimators in estimating p(x, F ): our nonparametric maximum likelihood estimator (NPMLE), the 0 random coefficients probit maximum likelihood estimator (MLE), and the local linear regression estimator (LLR) described in Fan (1993). For each of the models in Table 1 we generated 500 simulated data sets of 1000 observations each, and used the three estimators to estimate p(x, F ) at 0 a set of points along the line where x "x . Figs. 6—8 summarize the Monte 2 3 Carlo distributions of the estimators at these points. Each figure displays the estimated root-mean-squared error (RMSE) and absolute bias of the three estimators for one of the models.7 The local linear regressions were computed using the kernel K(u)" (1!EuE2)2 with bandwidth of 2, which is close to the optimal value for this

7 A more comprehensive set of Monte Carlo distributions for the NPMLE and MLE estimators are tabulated in Ichimura and Thompson (1993). The characterization given here for these estimators is representative of those results.

284

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

Fig. 6. Monte Carlo sampling properties for estimation of p(x, F ) in Model 1, along the line where 0 x "x . There are two lines of the same type for each estimator. The upper line in each pair displays 2 3 the RMSE. The lower line displays the absolute bias.

kernel in Model 1 (see Ruppert and Wand, 1994, Theorem 2.1). With K"3, the LLR estimator is well defined at the point x only if there are at least three data points within a radius of x equal to the bandwidth. To ensure that this condition always held, we occasionally adjusted the bandwidth upwards to equal the distance to the fourth nearest neighbor of x. The simulation results are not especially sensitive to the bandwidth within a fairly wide range.8 Not surprisingly, Fig. 6 shows that the parametric MLE is the best estimator for Model 1, where it is known to be asymptotically efficient. The NPMLE performs respectably on this model too. The RMSE of this estimator is uniform across the range of x values. The LLR performs erratically, outperforming the NPMLE at the origin, where p(x, F ) is locally linear, but doing substantially 0 worse in the tails of the regressor distribution. All three estimators are essentially unbiased at the origin — a consequence of the symmetric structure of this model. To our surprise, the parametric MLE continues to perform well in Model 2, as shown by Fig. 7, although the relative performance of the NPMLE is certainly better here than in Model 1. The MLE has a low variance, but performance suffers near the origin where a large bias increases its RMSE. The NPMLE

8 At each point x chosen for evaluation, we examined six bandwidths spaced evenly on the interval [0.5, 3.0]. The RMSE of the LLR estimator varied little over this range, although the relative proportions of bias and variance did vary moderately, as expected.

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

285

Fig. 7. Monte Carlo sampling properties for estimation of p(x, f ) in Model 2, along line where 0 x "x . These are two lines of the same type for each estimator. The upper line in each pair displays 2 3 the RMSE. The lower line displays the absolute bias.

performs almost as well as in Model 1. It displays reasonably low RMSE at all points, and dominates the parametric MLE near the origin due to the large bias of the MLE in that neighborhood. The LLR is erratic. Its RMSE is off the scale of the figure, except at the largest values of x. The LLR displays considerable bias at most points. The NPMLE is the preferred estimator for Model 3. Fig. 8 shows that NPMLE has the lowest RMSE, except in a small neighborhood of the origin, where LLR is best, and in the remainder of a slightly larger neighborhood of the origin, where MLE is best. Although the LLR is best very close to the origin, its RMSE increases off the scale of the figure at points a little further away. The LLR is highly biased away from the origin. As in Model 1, all three estimators are unbiased at the origin due to symmetry. The parametric MLE and NPMLE both show increased variance near the origin. These results warrant several general conclusions. First, the NPMLE performs respectably and uniformly over the entire range of values for x and over the different model specifications. While it tends to have a greater variance than the MLE, it rarely has a substantial bias. The performance of the NPMLE does not appear to depend on the local density of the regressors. Second, the parametric MLE expectedly performs best when the parametric model is correctly specified, as in Model 1, and least well when deviations from this model are most extreme. Due to its low variance, it performs well whenever

286

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

Fig. 8. Monte Carlo sampling properties for estimation of p(x, F ) in model 3, along line where 0 x "x . These are two lines of the same type for each estimator. The upper line in each pair displays 2 3 the RMSE. The lower line displays the absolute bias.

it has a low bias. This occurs at some isolated points, even when the parametric model is misspecified. However, performance of the parametric MLE breaks down whereever misspecification induces a large bias. Third, the local linear regression estimator has a low variance where the density of the regressor distribution is high (so that there are many local data points), and a low bias at points where p(x, F ) is approximately linear (so that 0 a linear regression fits well). Both conditions are satisfied near the origin in Models 1 and 3. However, LLR performance suffers when one or both conditions are not met. The performance of this estimator is especially poor in the tails of the regressor distribution. Of course, the estimator also suffers when the bandwidth is chosen poorly. In contrast, the global shape restrictions implied by the choice model cause the NPMLE estimate of p(x, F ) to be determined by all of the data points, not 0 just by the ones that are close to x. So the NPMLE estimates are far less sensitive to the local features of the regressor distribution than are nonparametric estimates based on local averaging. The advantage of the NPMLE over other nonparametric estimators is greatest in the tails of the regressor distribution and where there is substantial nonlinearity in p(x, F ). 0 Table 2 displays some summary statistics on the computational requirements for the nonparametric estimator encountered in our Monte Carlo simulations. Clearly this estimator requires substantial resources to calculate, and would

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

287

Table 2 Computational requirements for the nonparametric estimator Model 1 Median Undominated regions (out of 500, 501) 24 230 Maximum active parameters during iterations 138 Active parameters at the solution 37 Processor time required (seconds) 108.7

Model 2

Model 3

Maximum

Median

Maximum

Median

Maximum

27 816

13 668

17 674

13 073

16 172

174

102

128

74

109

54

32

43

24

34

160.3

56.8

84.6

45.8

63.5

Note: These statistics are from 500 Monte Carlo trials on each model.

have been infeasible on most research budgets available to economists until recently. However, the calculations are feasible on the current generation of mainframe computers and advanced workstations. The remainder of the table demonstrates the effectiveness and importance of our strategies for dimension reduction. Recall that we have M"500, 501 in each of these simulations. Elimination of dominated regions lowered the dimension of the calculations by roughly 95%. The maximum number of active (i.e. non-zero) parameters is typically encountered at the starting value for each estimation. Our algorithm for choosing starting values kept this number below 200 in every simulation. The final solutions never had more than 54 active parameters, which is roughly 0.01% of M, and considerably lower than the worst-case upper bound of N"1000.

6. Conclusions This paper presents a maximum likelihood estimator of a binary choice model with random coefficients. The model does not require parametric specification of the distribution of the coefficients. We discussed identification and proved consistency under suitable conditions. We showed that the estimator does a reasonable job of recovering the unknown coefficient distribution in some small example problems. In our Monte Carlo experiments the new estimator performed well. The nonparametric estimator has superior predictive performance relative to the probit estimator when the true coefficient distribution is significantly non-normal, due to the bias displayed by the probit estimator. Also, unlike the nonparametric regression estimator, the estimator seems insensitive

288

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

to the regressor distribution. The new estimator requires substantial computer resources to calculate, but is feasible using current technology. We have left a number of important questions for future research. First, the asymptotic theory for our estimator is incomplete. We present no rates of convergence for functionals of the estimated distribution, nor do we develop a formal theory for testing parametric against nonparametric specifications for the random coefficients model. We are aware of one result pertinent to the convergence rate of our estimator. Suppose that K"2 and that we know that PrMb '0N"1. If we normalize 2i scale by treating F as a distribution for !b /b , then p((1, xJ )@, F)"F(xJ ). In 1i 2i this case our estimator, for F is an isotonic regression estimator, see Robertson et al. (1988), Example 1.5.1. Wright (1981) established that under suitable conditions FK (xJ )"F (xJ )#O (N~a@(2a`1)), provided F is a times differentiable N O 1 0 and F(1)(xJ )"2"F(a~1)(xJ )"0(DF(a)(xJ )D. So the NPMLE estimator converges at rate N~1@3 or better under these conditions, provided that F has a density at xJ . It remains to be seen if a comparable result holds when K'2. A rate (or a better rate) of convergence for our estimator may be achievable, while preserving other desirable features of our approach, by imposing some smoothness on FK via a sequence of sieves. For example, one could require FK to N N be a mixture of smooth parameterized distributions, with the number of distributions in the mixture dependent on sample size. While it is straightforward to modify our consistency proof to accomodate this kind of estimation strategy, we foresee substantial computational difficulties associated with this approach. For example, if the means of the smooth distributions in a mixture are parameterized, then the likelihood function generally is not concave. It remains to be seen if these difficulties can be overcome. Second, we feel that stronger identification results can be achieved when there is stronger a priori information about the coefficient distribution than we make use of here. For example, it seems likely that our assumption on the regressors could be weakened if it were known a priori that certain of the coefficients were fixed rather than random. Pursuit of this extension would yield a synthesis of our estimator, which allows all coefficients to be random, with the estimator proposed by Cosslett (1983), which is a special case of our estimator requiring all coefficients (except for the intercept) to be fixed. Finally, the model and method described here are generalizable to settings involving polytomous choice. Extensions to panel data are also possible in which full independence between the regressors and coefficients can be partially relaxed.

Acknowledgements An earlier version of this paper was presented at the 1993 Midwestern Econometric Group conference, the 1994 annual meeting of the American

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

289

Statistical Association, a 1994 conference at Northwestern University sponsored by the National Science Foundation, and the 1994 Latin America meeting of the Econometric Society. We have benefited from communications with A. Ron Gallant, two anonymous referees, and seminar participants in the economics or statistics departments of the following institutions: The University of Minnesota, The University of Chicago, The University of Rochester, North Carolina State University, The University of North Carolina, Queens University, and the University of British Columbia. We also thank Charles Geyer, John Geweke, Rosa Matzkin and Herman Rubin. We thank the Minnesota Supercomputer Institute for providing computer support. Thompson thanks the National Science Foundation for support provided by grant SES—9110419. The views expressed herein are not purported to reflect those of the US Department of Justice.

Appendix

Proof of ¹heorem 1. Suppose that F3F and that p(x , F)"p(x , F ) almost i i 0 surely. By (i) and (iv), p(x, F)"p(x, F ) holds for almost every x3R ]RK~1. 0 `` Furthermore, provided PrMb "0N"0, we must have PrMx@b "0N"0 for ali i most every x, hence PrM!x@b *0N"1!PrMx@b *0N for almost every x. By i i (iii), these conditions hold for F and F . Thus p(!x, F)"p(!x, F ) for almost 0 0 every x3R ]RK~1. Conclude that p(x, F)"p(x, F ) for almost every x3RK. `` 0 Let A be an orthonormal matrix (i.e. A@A"AA@"I) whose first row is c@. Let h "(h , h@ )@"Ab and let t"(t , t@ )@"Ax where h and t are scalar. Let i i1 i2 i 1 2 i1 1 G and G denote the distributions for h when b has distributions F and F , 0 i i 0 respectively. Note that G, G 3F since Eh E"Eb E, and that PrMh '0N"1 by 0 i i i1 (iii) and construction of h . i Since A is nonsingular and x@b "t@h , p(t, G)"p(t, G ) for almost every t. i i 0 Furthermore, G"G if and only if F"F . Therefore it suffices to establish that 0 0 G"G . First, we establish that PrMh '0N"1 when h has distribution G. 0 i1 i Rewrite the condition p(t,G)"p(t, G ) as follows: 0

M

P

t{hiw0, hi1:0N

dG# M

P

t{hiw0, hi1/0N

dG# M

P

t{hiw0, hi1;0N

dG" M

P

t{hiw0, hi1;0N

dG . 0 (A.1)

Fix t and consider a sequence of values for t diverging to !R. For almost 2 1 every value of t , without loss of generality, we may restrict attention to 2 sequences for which Eq. (A.1) holds. The right side, hence each term on the left side, converges to zero. Since the second term on the left side does not depend on

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

290

t , it must equal zero exactly. Thus for almost every value of t we have 1 2

M

P

P

dG ! 0

dG"

P

dG)

P

dG. (A.2)

M M M t{hiw0, hi1;0N t{hiw0, hi1:0N t{hiw0, hi1;0N hi1;0N Now consider a sequence of values for t diverging to #R, again holding 1 t fixed and restricting attention to sequences where (Eq. (A.2)) holds. The left 2 side of (Eq. (A.2)) converges to one, and the final term is a probability, so the final term equals one. This establishes that PrMh '0N"1 when h has distribui1 i tion G. To complete the proof, rewrite the condition p(t, G)"p(t, G ) again, this time 0 as

M@

P

dG"

P

dG . 0

(A.3)

M@ t 2hi2@hi1wt1N t 2hi2@hi1wt1N Since this holds for almost every t and t , apply the Crame´r—Wold device to 2 1 conclude that the distribution of h /h is the same under G and G . For each i2 i1 0 possible value of h /h there is a unique point h on B that satisfies h '0. i2 i1 i i1 Hence a distribution for h /h together with the condition PrMh '0N"1 i2 i1 i1 induces a unique distribution on B. Thus G"G , hence F"F . h 0 0

Proof of ¹heorem 2. The set G is a convex polytope in RN because it is the convex hull of Mh ,2,h N. Furthermore, the maximand in Eq. (10) is strictly concave in 1 M g. So the solution vector gL to Eq. (10) is unique, and lies on an exterior face of G. Each exterior face of G is in turn the convex hull of a subset of Mh ,2,h N whose 1 M affine dimension is at most N!1. By Carathe´odory’s Theorem (see Rockafellar, 1970; Theorem 17.1) the vector gL can be written as a convex combination of N of the vectors in Mh ,2,h N. So there exists a vector qL 3S with at most 1 M M~1 N nonzero coordinates satisfying gL "CqL . h Proof of ¹heorem 3. The maximand in Eq. (10) is strictly increasing on G. Let g"Cq be a candidate solution to Eq. (10) for some q3S for which M~1 q '0. Let qJ coincide with q except that qJ "0 and qJ "q #q . Let gJ "CqJ . j j k j k Then g(gJ since at least one coordinate of h exceeds the corresponding k coordinate of h and all other coordinates of h are no less than the correspondj k ing coordinates of h . But gJ 3G since qJ 3S . Conclude that the original j M~1 q cannot solve Eq. (9). h Proof of ¹heorem 4. Let b and b be arbitrarily chosen points in A and j k j A respectively. Let J"MiDy "1Mx@b *0NN. Let K"MiDy O1Mx@b *0NN. i k i j i k i Then J is the set of observations correctly predicted by A and K is the set of j

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

291

observations not correctly predicted by A . Now let k (A.4) B "5 G 5 BCG . i K i jk J i| i| Since h dominates h , A and A are subsets of B , which is (by construction) k j j k jk a union of some of the sets in A . Since B is convex, so are each of the sets N G and their complements with respect to B. Thus B is convex. In particular, i jk the line segment connecting b and b is in B . Since A is a partition of B, this j k jk N line segment must intersect a set A LB whose boundary meets the boundary r jk of A . By construction of B , A correctly predicts all of the observations in J, j jk r plus at least one more. Thus h dominates h . h r j Proof of ¹heorem 5. Our proof loosely follows the arguments of Wald (1949) as modified by Wolfowitz (1949). We first establish some notation. Let Z "(y , x ), i i i u(Z , F)"p(x , F)yi[1!p(x , F)]1~yi, D(F, o)"MF@3F D dF(F, F@)(oN and i i i u(Z , F, o)"supD u(Z , ) ). Let F denote all absolutely continuous distribu(F,o) i 0 i tions on B. F is a dense subset of F with respect to the weak topology. See 0 Corollary II.8.1 of Parthasarathy (1967). For any d'0, let F(d)" MF3F D dF(F,F )*dN. 0 The proof relies on several properties of the log-likelihood function and its expectation. These are established by several lemmas. Lemma 1 establishes that the expectation of the log-likelihood is maximized uniquely at F . Lemma 0 2 establishes that the log-likelihood is almost surely upper-continuous. These properties are used in Lemma 3 to prove that the expectation of the loglikelihood is upper-continuous. These properties are needed in the proof of Lemma 4, which proves that in the area outside any neighborhood of F the 0 objective function eventually lies strictly below the value attained at F with 0 probability approaching one. The consistency of our estimator follows immediately from this result. We allow E[log u(Z , F)]"!R for FOF in the lemma below. i 0 ¸emma A.1. For each F3F, FOF implies 0 E[log u(Z , F)](E[log u(Z , F )]. i i 0

(A.5)

Proof. Note that Jensen’s equality implies EMlog [u(Z , F)/u(Z , F )]N)EMlog E[u(Z , F)/u(Z , F ) D x ]N, i i 0 i i 0 i

(A.6)

where the equality holds only if u(Z ,F)/u(Z ,F ) is a degenerate random varii i 0 able given x almost everywhere with respect to the distribution of x . However, i i when this holds Assumption 3 implies F"F , which is a contradiction to the 0

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

292

premise. Hence the strict inequality must hold in Eq. (A.6). The conclusion follows by computing the right-hand side and noting that it equals zero. h ¸emma A.2. For each F3F , lim log u(Z , F, o)"log u(Z , F) almost surely. 0 o?0 i i Proof. Given the assumptions, Pr(x@b "0 D x )"0, almost surely, when b has i i i i the distribution F. Condition on this probability one event. Suppose lim log u(Z , F, o)Olog u(Z , F). Then there exists a positive sequence Mo N o?0 i i n such that o P0 and a sequence MF NLF such that dF(F , F)(o and such n n n n that lim inf log u(Z , F )'log u(Z , F). But since F converges weakly n?= i n i n to F we must have p(x , F )Pp(x , F) because H(x ) is a continuity set with i n i i respect to F3F , see Billingsley, 1968, Theorem 2.1.(v). Then, since 0 log u(Z , F) is a continuous function of p(x , F ), we must have i i n lim log u(Z , F )"log u(Z , F), a contradiction. h n?= i n i Both sides may be minus infinity in the next lemma. ¸emma A.3. For each F3F , lim E[log u(Z , F, o)]"E[log u(Z , F)]. 0 o?0 i i The proof follows the argument given in Halmos (1950), pp. 112—113). Proof. Suppose first that E[log u(Z , F)] is finite. Then since i 0)!log u(Z , F, o))!log u(Z , F), (A.7) i i the Lebesgue dominated convergence theorem implies the result. Next consider the case where E[log u(Z , F)]"!R. We show that this i implies lim E[log u(Z , F, o)]"!R. Suppose not. Then since o?0 i E[log u(Z , F, o)] is monotone decreasing as oB0, lim E[log u(Z , F, o)] i o?0 i exists and is finite. Let o "1/n. Then for every m, n'0, n log u(Z , F, o )!log u(Z , F, o )*0, which implies i m i m`n EDlog u(Z , F, o )!log u(Z , F, o )D i m i m`n "E[log u(Z , F, o )]!E[log u(Z , F, o )]P0 (A.8) i m i m`n as m, nPR. Thus Mlog u(Z , F, o )N is a Cauchy sequence with respect to the i n ¸ norm on R. It must converge, in ¸ to an integrable function, and must have 1 1 a subsequence that converges almost surely. By Lemma A.2, the limit must be log u(Z , F) almost everywhere. Thus log u(Z , F) is integrable, contradicting the i i hypothesis E[log u(Z ,F)]"!R. h i ¸emma 4. For any d'0, there exists g(d) with 0(g(d)(1 such that

G

Pr

H

N sup < [u(Z , F)/u(Z , F )]'g(d)N P0 as NPR. i i 0 F |F(d) i/1

(A.9)

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

293

Proof. Lemmas A.1 and A.3 imply that for each F3F WF(d), there exists 0 o '0 such that F E[log u(Z , F, o )](E[log u(Z , F )]. i F i 0

(A.10)

Since F is dense in F there is a covering of F(d) consisting of open balls 0 centered at points F3F and radii o . Since F(d) is compact there exists a finite 0 F subcover. Let (F , o ),2,(F , o ) index a finite subcover. Note that 1 1 m m
G

H

N Pr sup < [u(Z ,F)/u(Z , F )]'g(d)N i i 0 F|F(d) i/1 m N )Pr + < [u(Z , F , o )/u(Z , F )]'g(d)N i j j i 0 j/1 i/1 N m ) + Pr < [u(Z , F , o )/u(Z , F )]'g(d)N/m i j j i 0 i/1 j/1 1 N m + t (Z )'log g(d)!(log m)/N , ) + Pr (A.11) j i N i/1 j/1 where t (Z )"log u(Z , F , o )!log u(Z , F ). j i i j j i 0 Since we allow E[log u(Z , F , o )] to be minus infinity we cannot directly i j j apply a law of large numbers to the sum (over N) in the last expression. Instead we make the following trimming argument. Let C'0 be a constant to be determined momentarily. Define

G

H

G G

H

H

t (Z , C)"t (Z )1Mt (Z ))!CN, j i j i j i

(A.12)

t (Z , C)"t (Z )1Mt (Z )'!CN. j i j i j i

(A.13)

Then

G

H

m 1 N + Pr + t (Z )'log g(d)!(log m)/N j i N j/1 i/1 1 N m + t (Z ,C)'[log g(d)!(log m)/N]/2 ) + Pr j i N i/1 j/1 1 N m + t (Z ,C)'[log g(d)!(log m)/N]/2 # + Pr j i N i/1 j/1 1 N m + !M'[log g(d)!(log m)/N]/2 ) + Pr N i/1 j/1

G

H

G

G

H

H

294

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

G

1 N m + t (Z ,C)!E[t (Z ,C)] # + Pr j i j i N i/1 j/1

H

'!E[t (Z ,C)]#[log g(d)!(log m)/N]/2 . j i

(A.14)

By Lemma A.1 and Assumption 4, !E[t (Z ,C)]'0 when C is chosen large j i enough. Thus for any e'0, we can choose C so large and g(d) close enough to one so that !E[t (Z ,C)]#[log g(d)!(log m)/N]/2'e holds for each j i j"1,2,m for large N. By increasing C further, if necessary, we can make !C([log g(d)!(log m)/N]/2 for all large N. Thus the first probability in the last expression is zero, and the second probability converges to zero by a law of large numbers. h As mentioned above, consistency of our estimator follows immediately from Lemma 4. The i.i.d. sampling Assumption 1 is used only to apply a weak law of large numbers in the last step of the proof of this lemma.

References Albright, R.L., Lerman, S.R., Manski, C.F., 1977. Report on the development of an estimation program for the multinomial probit model. Report for the federal highway administration, Cambridge Systematics, Inc., Cambridge, Massachusetts. Billingsley, P., 1968. Convergence of Probability Measures. Wiley, New York. Cosslett, S.R., 1983. Distribution-free maximum likelihood estimator of the binary choice model. Econometrica 51, 765—782. Fan, J., 1993. Local linear regression smoothers and their minimax efficiencies. The Annals of Statistics 21, 196—216. Fletcher, R., 1987. Practical Methods of Optimization, 2nd ed.. Wiley, Chichester. Halmos, P.R., 1950. Measure Theory. D. Van Nostrand, New York. Han, A.K., 1987. A non-parametric analysis of transformations. Journal of Econometrics 35, 191—209. Hausman, J.A., Wise, D.A., 1978. A conditional probit model for qualitative choice: discrete decisions recognizing interdependence and heterogeneous preferences. Econometrica 46, 403—426. Ichimura, H., 1993. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics 58, 71—120. Ichimura, H., Thompson, T.S., 1993. Maximum likelihood estimation of a binary choice model with random coefficients of unknown distribution. Discussion paper 268, Center for Economic Research, Department of Economics, University of Minnesota, Minneapolis, Minnesota. Klein, R.W., Spady, R.H., 1993. An efficient semiparametric estimator for discrete choice models. Econometrica 61, 387—421. Lindsay, B.G., 1983. The geometry of mixture likelihoods: a general theory. The Annals of Statistics 11, 86—94. Manski, C.F., 1985. Semiparametric analysis of discrete response: asymptotic properties of the maximum score estimator. Journal of Econometrics 27, 313—333.

H. Ichimura, ¹.S. ¹hompson /Journal of Econometrics 86 (1998) 269—295

295

McFadden, D.L., 1976. Quantal choice analysis: a survey. Annals of Economic and Social Measurement 5, 363—390. Parthasarathy, K.R., 1967. Probability Measures on Metric Spaces. Academic Press, New York. Pollard, D., 1984. Convergence of Stochastic Processes. Springer, New York. Quandt, R., 1966. Probabilistic Theory of Consumer Behavior. Quarterly Journal of Economics 70, 507—536. Robertson, T., Wright, F., Dykstra, R., 1988. Order Restricted Statistical Inference. Wiley, Chichester. Rockafellar, R.T., 1970, Convex Analysis. Princeton University Press, Princeton, NJ. Ruppert, R., Wand, M., 1994. Multivariate locally weighted least squares regression. The Annals of Statistics 22, 1346—1370. Ruud, P.A., 1986. Consistent estimation of limited dependent variable models despite misspecification of distribution. Journal of Econometrics 32, 157—187. Stoker, T.M., 1986. Consistent estimation of scaled coefficients. Econometrica 54, 1461—1481. Thompson, T.S., 1989. Least squares estimation of semiparametric discrete choice models. Department of Economics, University of Minnesota, Minneapolis, Minnesota. Wald, A., 1949. Note on the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics 20, 595—601. Wolfowitz, J., 1949. On Wald’s proof of the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics 20, 601—602. Wright, F., 1981. The asymptotic behavior of monotone regression estimates. The Annals of Statistics 9, 443—448.