Journal
of Econometrics
42 (1989) 33’7-349. North-Holland
MODE REGRESSION Myoung-jae
LEE*
University of Wisconsin, Madison,
Received
December
WI 53706, USA
1986, final version received January
1989
There exists a loss function whose expectation is minimized at mode (y(x). Adding the assumption of mode( y 1x) = x’jl, the mode regression estimator is derived. The mode regression finds its major application in the case of truncated dependent variable, particularly with asymmetric density under homogeneity. The identification of the population parameter /3 and the strong consistencv of the mode regression estimator are proved. Since no distribution theory is available, a small-scale Monte Carlo study is given at the end.
1.
Introduction
There exists a loss function whose expectation is minimized at the conditional mode of y given x. This loss function, long known in the statistical decision theory literature [for instance, Ferguson (1967)], is
where l[ ] is the indicator function taking the value 1 if the condition inside [ ] is satisfied and 0 otherwise, p is a predictor, and w is a positive number. The expectation of the loss function is
E.&Y
-PI
> WI = 1 - E,E.,,,&
-PI 2 ~1
=1-E,{F,.,,(P+w)-F,,,(p-w)}.
(1)
This is minimized at p = mode( v]x) if fPIX is symmetric, where fVIX is the density of FY,X. Assuming mode( ylx) = x’p, we can estimate /3. If fV ,X is asymmetric, (1) is minimized at p = the middle value of the interval of length 2 w capturing the most probability under f,, x. If the middle value for a given x *I am very encouragement. comments.
grateful to Charles I am also indebted
0304-4076/89/$3.50
Manski and James Powell for their helpful comments and to the Associate Editor and referees for their many valuable
0 1989, Elsevier Science Publishers
B.V. (North-Holland)
338
M. Lee, Mode regression
is different from x’/3 by the same constant for all x, we can still estimate p up to a constant. This point will be made clear later. In this paper, we derive the mode regression estimator and prove its strong consistency, focusing on the case of a truncated dependent variable where mode regression appears most promising. The estimator’s asymptotic distribution has not been found but there are convincing reasons to believe that the mode estimator will be N113 consistent. There appears to be no direct literature for the mode regression as pursued in this paper except Sager and Thisted (1982) who used the zero-one loss function for MLE of isotonic mode regression. Sager and Thisted dealt with a considerably more difficult problem as the title suggests, without being motivated by the truncated case. They minimized a weighted loss with its weight related to a nonparametric likelihood. The literature on estimation of the mode in the one sample problem is large; see Parzen (1962) Chernoff (1964), Grenander (1965) Venter (1967) Eddy (1980) Muller (1984), and Roman0 (1988). The mode regression studied here reduces to the Chernoff case, if the regression function has no other regressors but the intercept. There are also several papers related to the mode regression in other aspects. Rousseeuw’s (1984) generalization of the ‘shorth’ [see Andrews et al. (1972)] is similar in its estimation procedure and Powell’s (1986) symmetrically trimmed least squares is closely related in its applicability to truncated data. While Powell’s method requires symmetry of the conditional density, the mode regression can do without the symmetry. Manski (1975,1985) has a similar sample objective function consisting of indicator functions. Other related papers will be referred to in due places. This paper has six sections: (1) introduction, (2) basic model, (3) identification, (4) strong consistency, (5) choice of band width, and (6) simulation study.
2. Basic model The basic model for the paper is set out in the following several assumptions. Since the mode regression finds its main application in the case of a truncated dependent variable, we start with the truncated model, denoting the original nontruncated variable by y* and y* truncated from below at c by y. The nontruncated case can be treated as a special case when c = - co. Basic model assumptions are: Assumption 1. Linear model and truncated random sample yI* = ~$3 + ut, yt* is observed only when y,* > c, where x: = (1, xZ2,. . . , x,k) and p = (&, &, . . . , Pk)‘, (x,, yt) random sample, t = 1,2,. . . , T, where A is y,* truncated from below by c.
M. Lee, Mode
339
regression
Assumption 2. Monotonicity and unimodality of u(x distribution u]x has a density which is unimodal and strictly increasing up to the mode and strictly decreasing after the mode. The mode of a density is defined to be the argmax of the density. Assumption 3. Conditional mode restriction mode( y* Ix) = x’/? e mode( u]x) = 0 due to Assumption Assumption 4. The support chosen.
1.
Wide support of u 1x distribution of u(x is wider than [- w, w] for all x and for a w suitably
Assumption 5. Trade-of between asymmetry and heterogeneity of u) x distribution The density f of u]x is either asymmetric and homogeneous, or symmetric and heterogeneous, where the homogeneity means that f (u,jx,) does not vary across x,. If f( u,Ix,) varies, f( u,Ix,) is defined to be heterogeneous. Assumption 6. Full rank condition E(xx’l[x’/_I > c + w]) is a positive
definite
matrix.
Assumption 7. Compact parameter space The parameter space B is a compact space. The assumptions are self-explanatory except for Assumptions 4, 5, and 6, which are to be used for identification and strong consistency. Assumption 5 is explained next. Ignore the truncation for a while and write (1) in its maximizing version with y* and x’b replacing y and p, respectively. Then we have
E,E,~,,l[]y*
- x’bl I w] = E,E,,,,l[x’b
- w I y* I x’b + w].
(2)
In order to maximize (2) we capture as much probability of y* Ix as possible using a band of length 2w around x’b. Hence, if mode( y* Ix) = x;B, the interval with x’/3 at its center will capture the most probability whatever w is, so long as u]x is symmetric. If we drop the assumption of symmetry, mode( y* Ix) will not be the middle point of the maximizing interval. But the middle point will differ from mode( y* Ix) only by a constant, so all p but the intercept are still estimable. In regression, dropping the symmetry assumption requires bringing in the homogeneity assumption, because if the distribution of u,]x, varies across xI, the difference between the middle point and mode( y* Ix) will not be the same
340
M. Lee, Mode regression
.
i I
I
X’B -W
Xl3
xg 4-w
Fig. 1
for each datum. Hence, dropping the assumption of symmetry of u]x can be done at the expense of adding a new assumption of homogeneity. For the truncated model with y, (2) should be modified to
Q(b) =Elb-
max(x’b,
c + w)] 5 w].
(3)
The reason for the change can be understood by examining fig. 1. Suppose c = A, then x’j3 - w 2 c e x’p 2 c + w. We still capture the most probability with the interval around x’/3 as if no truncation is done. If c is equal to any other point B, or C, or D, then x’j3 - w < c e x;B < c + w. In this case we capture the most probability by putting the interval of length 2w on [c, c + 2w] and the middle of the interval is c + w. Hence, combining the two cases, the middle of the optimizing interval is max(x’/?, c + w). The sample analog of (3) is
Q,(b)=(l/T)
i t=l
l[ly,-max(x;b,c+w)(
SW].
(4)
Formally the mode estimator b, is obtained by maximizing (4) with respect to b over B. Graphically, in a simple case of one regressor without the intercept, we give each datum an upper and lower interval of length w. Then the slope is estimated such that the straight line bx pierces as many intervals as possible . Each time bx pierces an interval, it will ‘score’ a point of 1 and a b scoring the most points will be the slope estimate. Since in a finite sample it is possible for
341
M. Lee, Mode regression
the estimate to change a little without changing the score, the mode estimate is set-valued in a finite sample. This estimation procedure contrasts with the Least Median of Squares regression (LMS, hereafter) of Rousseeuw (1984) which minimizes the median of the square of the residuals with respect to b. For the simple case of one regressor only, LMS minimizes the strip covering half of the observations. In mode regression, the strip size 2w is prefixed and the number of data covered is determined by the choice of the estimate, while in LMS the number of data covered is prefixed at 0.5 * T, but the strip size is determined by the choice of the estimate. More on LMS can be found in Rousseeuw and Leroy (1987). 3. Identification The identification of the mode estimator will be shown Q(b) achieves an unique global maximum at /3.
by proving
Theorem 1. Identijkation of j3. The parameter p is identijed is asymmetric but homogeneous, /3 is ident$ed symmetric. If f addltlve constan;.’ ’ Proof.
First we show that if
f,, ,x is symmetric,
E .ylXIIIY-max(x’b,c+w)( for a given x, under the assumption Assumption 6. Rewrite (5) as E,,,{l[ly
- x’bl < w]l[x’b
+l[ly-c-w1 = [{
if f,,, is up to an
maximizes
SW], of P(x’j3 > c + w) > 0 which is implied
(5) by
2 c + w]
Fv,,(x’b+ w) - f&(x’b -
+ {
x’/3 uniquely
that
F,,x(c+24 - I”,,}.
w>} (6)
Since the second term is not a function of x’b, ignore it for a while. As for the first term, if x’b is chosen such that x’b < c + w, (6) equals zero. Hence, such x’b does not maximize (6). Among x’b’s such that x’b 2 c + w, x’fi maximizes (6) as shown in the paragraph following (3). Due to Assumptions 2 and 4, which rule out any flat portion of f,,, near f w, any other choice of interval
342
M. Lee, Mode regression
loses more probability mass than it gains. If P(x’p > c + w) is zero, (5) becomes E ,,,l[ly-(c+w)J~w],whichisnotafunctionofP.ThenPisnot estimable. A requirement similar to P( x’/3 > c + w ) > 0 also appears in Powell’s (1984) censored LAD. The unique maximum of (5) at x’b = x’b alone is not sufficient for the unique maximum of Q(b), since if P(x’j3 = x’y) = 1 though /3 is not equal to y, y as well as j3 maximizes Q(b). But this case is excluded by the full rank condition of Assumption 6, for letting d = p - y,
= E{ (x’d)21[x’/?
> c + w]} > 0 for any
since E( xx’l[ x’/? > c + w]) is positive
definite,
P(x’d#O,x’j3>c+w)>O
forany
which indicates
d#O,
that
d#O;
that is P(x’/?#x’y,x’/3>c+w)>O
forany
y#j3.
achieves its unique maximum at b = p. is asymmetric, x’fi* which maximizes (5) is the middle value of the optimal interval of length 2w fitted under fy ,x with J;,,(x’b* - w) = the fYIx(x’/3* + w). In general, x ’p * is not equal to x’/3, but assuming homogeneity of f, ,x, the difference between x’p* and x;B becomes a constant, not a function of x. Using Assumption 6 again, it can be shown that Q(b) attains its unique maximum at b = /I* which is different from fl only in the intercept. n Hence
Q(b)
If fu,x
4. Strong consistency In this section, Theorem Proof.
procedure
2.
we prove the strong consistency
of 6,.
Strong consistency of the mode estimator b,.
The proof for the strong consistency of b, is done in a three-step which is fairly standard. In the first step, we use the combinatorial
M. Lee, Mode regression
method ch. 2):
to prove
the uniform
convergence,
Step 1. Almost sure uniform convergence verges a.s. and uniformly in b to Q(b). Proof.
Decompose
l[ ly - max(x’b,
343
as expounded
of QT(b)
in Pollard
to Q(b).
QT(b)
(1984,
con-
c + w) 1 I w] into
l[lv - x’bl I w, x’b r c + w] + l[lv - (c + w)l 5 w, x’b < c + w]. (7) Then, proving the uniform convergence for each indicator function is sufficient. Let Z be the class of the indicator functions 1[ ly - x’b I I w], indexed by b. First we claim that the class of the ‘graphs’ of the functions in Z have ‘polynomial discrimination’, where the graph of a real-valued function h(x, y) is defined as (see Pollard, p. 27)
discrimination in Rk+2. Saying that a class of sets in R k+2 has polynomial means that the class has the property of not being too variable to pick up all 2” subsets of a finite subset of Rk” with n elements (see Pollard, p. 17, for the precise definition). For instance, on a real line R’, the class of the sets composed of the connected closed intervals cannot pick up some subsets of a set with n elements: ordering the elements by 1,2,. . . , n, the class cannot pick up 2 and 4 without picking up 3 (any interval including 2 and 4 should also include 3). The claim that the class of G, has polynomial discrimination is proved next. Note that G, is the intersection of two half spaces { y - x’b - w I 0) and { y - x’b + w 2 O}. The class of the half spaces has polynomial discrimination and so does the class formed by intersections of the half spaces (see Pollard, p. 18, lemma 15). Hence the class of G, with h(x, y) = l[ ly - x’bl s w] has polynomial discrimination. The first term in (7) has one more half space { x’b 2 c + w } in addition to {y-x’b-~50) and {y-x’b+w>O} and the second term has { x’b < c + w }. Hence the class of the graphs of the two functions in (7) has polynomial discrimination. The facts that the class of the graphs has polynomial discrimination and that the indicator functions have an uniform upper bound 1 imply that the ‘covering number’ for the class is bounded (Pollard, p. 27, lemma 25). The covering number is an index of how many ‘distinct’ members are in Z (see
344
M. Lee, Mode regression
Pollard, p. 25, for the precise definition). The main proof of the a.e. uniform convergence of Qr(b) to Q(b) comes from Pollard (p. 25, theorem 24) using the boundedness of the covering number. Step
2.
Continuity
of Q(b).
Q(b)
can be rewritten
Q(b) = /[ F,,,{max(x’b,c + w) + -[V,X{max(x’b,c+
as
w}
w) -w}]
dF,.
(9)
The integrand is a bounded and continuous function of b, for cYIX is bounded and continuous in b and max(x’b, c + w) is continuous in b. Then, due to the bounded convergence theorem, Q(b) is continuous in b. Step 3. Maximum Q(b), Q(b) attains maximizes Q(b). Therefore, for /3. W
of Q(b). Due to Assumption 7 and the continuity of its maximum on B. Hence there exists at least one b which
by the three
steps
and
Theorem
1, b,
is strongly
consistent
A natural question to arise at this stage is the asymptotic distribution of b,. Chernoff (1964) showed it in a one-sample problem. If one estimates the population mode p with the center value of a of fixed 2w which contains most data, as T to infinity,
follows
distribution of argmax Z(z) where Z a Gaussian with independent increments the origin mean 0 variance 1 unit z. basic idea that, if set up to the and left the mode count the of observations in small nonoverlapping intervals, numbers are approximately uncorrelated to other and normally distributed their variances depending on size of intervals, which is to Taking the of counted numbers subtracting difference from its expected will give Gaussian process Z(z) is form of process. The of argmax is in terms a solution a heat not in closed form practical use. a matter fact, this is not surprising because the maximand is summation of RV’s that as
M. Lee, Mode regression
345
an asymmetric random walk process so that the sum moves like somewhat ‘drifted’ Brownian motion. Kim and Pollard (1987) reveal new information about ‘nonconventional’ estimation methods such as a bivariate mode estimation with indicator functions of which the one-sample mode estimation problem is a special case, LMS of Rousseeuw, and the maximum score estimation of Manski (1975,1985). Their major finding is that those estimators are N1j3 consistent and their population objective functions can be represented by a Brownian motion process indexed by z plus a deterministic term quadratic in 2. They also prove that the estimators converge in distribution to the argmax of the population process. However, they did not derive the argmax in a closed form. Kim and Pollard give reasons to believe that the mode regression may have and the population process of a similar properties such as N1j3 consistency Brownian process plus a quadratic drift.
4. Choice of band width w One lingering question that has been avoided so far is the choice of width w. The problem of choosing w seems to be related to the bias and efficiency of the mode estimator. It will be nice if we can choose an optimal w based on those criteria as it is done in the density estimation literature. But in the mode regression, where there exists no closed form for the estimator, it does not seem to be possible. However, there always exist sensible bounds for w with a given data set, considering the ‘scoring’ feature of the mode regression. If the score [number of data covered by the modal interval; that is, number of data with l[ ]y - max(x’b, c + w)] 2 w] = 11 is too small, there will be many estimates with the same score. If w is too large, many estimates can easily score the maximum = T. In both cases, the estimation results are not good. The sensible bounds can be found in practice by trying many w’s on the data set. In addition to the bounds, there is another version of the mode estimator which is obtained by trying many w’s and getting the average of the estimates for each w. In the context of the generalized jackknife method, it can be said that we add up positively correlated estimators, hoping to gain more in efficiency at the cost of bias. There is no theory behind this idea but as shown in the section for simulation study, more often than not, the mode estimators appear to benefit more from the trade-off between the bias and efficiency, The problems such as the choice of a location parameter to estimate in an asymmetric density, particularly among the quantiles, and the choice of the window in the density estimation are not really ‘solved’. Considering these, it may be too much to ask to choose w optimally at this stage of development in the mode regression. One may have to proceed by trial and error.
346
M. Lee, Mode regression Table 1 T = 200. BIAS
STD
RMSE
LQ
MED
(JQ
MAE
0.992 1.217 1.274
1.135 1.483 1.494
0.125 0.267 0.279
1.091 1.414 1.517
1.353 1.993 1.877
0.199 0.542 0.545
1.122 1.412 1.006
1.670 2.136 1.569
0.403 0.737 0.687
1.198 1.305 1.494
1.966 2.156 1.957
0.495 0.649 0.631
1.111 1.375 1.467
1.607 2.215 1.946
0.303 0.765 0.672
2.182 1.830 2.151
0.540 0.460 0.763
2.488 2.235 2.015
0.846 0.608 0.719
Design I: 25% truncation, std. normal STLS MODE A MODE
0.021 0.258 0.314
STLS MODE A MODE
0.201 0.536 0.523
0.209 0.387 0.263
0.878 0.973 1.133
0.210 0.466 0.410
Design 2: 50% truncation, std. normal 0.503 1.134 0.590
0.914 1.049 1.179
0.541 1.254 0.789
Design 3: 75% truncation, std. normal STLS MODE A MODE
1.139 0.277 -0.135
1.653 1.852 1.137
0.767 0.795 0.798
1.659 1.873 1.145
Design 4: 50% truncution, std. logistic STLS MODE A MODE
0.498 0.472 0.536
3.764 1.956 1.150
0.777 0.658 0.983
3.797 2.012 1.269
Design 5: 50% truncation, STLS MODE AMODE
- 0.397 0.292 0.416
STLS MODE A MODE
0.566 0.549 0.658
STLS MODE A MODE
0.707 0.631 0.359
6.839 2.308 1.051
6.851 2.327 1.131
Design 6: 50% truncation, 3.401 0.810 0.980
3.448 0.979 1.180
Design 7: 50% truncation,
6. Simulation
3.714 1.437 1.308
3.780 1.569 1.356
std. Cauchy 0.824 0.765 1.136
gamma (2,1 )-mode 1.044 1.031 1.200
1.479 1.455 1.695
gamma (3, I )-mode 0.733 0.932 0.696
1.331 1.519 1.299
study
No asymptotic theory is given, so a simulation study will be helpful to understand the behaviour of the mode estimator. A small-scale simulation study comparing the mode estimator (MODE, hereafter) with Powell’s (1986) symmetrically trimmed least squares (STLS, hereafter) and AMODE (mode estimation averaging over many w’s) for truncated data will be shown in table 1. In table 2, the estimator suggested by Bhattacharya et al. (1983) which is another semiparametric method for the truncated case is compared to the above three methods. Random numbers are generated by a multiplicative-congruential method. The simplex algorithm [see Himmelblau (1972)] is used for the simulation
347
M. Lee, Mode regression Table 2 T=30. BIAS
RMSE
STD
Design I: 25% truncation, BHA T STLS MODE A MODE
-0.175 - 0.001 0.011 0.027
BHA T STLS MODE A MODE
-
BHA T STLS MODE A MODE
0.513 - 1.367 - 1.254 - 1.255
0.401 0.251 0.574 0.412
MED
Lt2
UP
MAE
0.873 1.005 1.010 1.024
0.991 1.169 1.279 1.264
0.182 0.157 0.306 0.244
0.620 0.995 0.926 0.623
0.798 1.172 1.226 1.008
0.386 0.186 0.327 0.438
0.355 0.594 -0.151 -0.180
0.517 1.044 0.915 0.317
0.687 0.464 1.151 1.180
std. normul 0.717 0.846 0.678 0.783
0.437 0.251 0.574 0.413
Design 7: 50% truncution. std. normal 0.644 0.165 0.305 0.475
2.031 0.620 0.862 0.688
0.476 0.768 0.352 0.213
2.131 0.642 0.914 0.837
Design 3: 75% trunccrtion.std. normul 10.546 1.880 1.375 0.849
10.468 2.325 1.861 1.515
0.239 - 1.511 - 0.992 ~ 0.636
study, which is not to be confused with the simplex method for linear programming method. The algorithm does not use the gradient and is known to work fairly well. In all designs, only the result for the slope coefficient is reported. As for the choice of w in MODE, w is chosen such that the score lies between 70% and 80% of the data; that is, for T = 200, score should be 140 to 160 (T of table 2 is 30 for a reason to be given below). Such w usually falls in between 0.5 to 1.5, depending on the degree of truncation and the error distribution. There is no particularly good reason to choose such w, but it seems that, if anything, relatively large w gives better results than small w. Bhattacharya et al.‘s method (BHAT, hereafter) does not require the symmetry of the conditional density while STLS does. Both methods have distribution theory, but BHAT is good only for the case of one regressor without the intercept. Hence it is not a viable alternative to STLS and MODE. Computationally BHAT is very burdensome since the evaluation of the objective function involves rC2 terms; for instance, if the number of data is 200, there are 100 * 199 = 19,900 terms in the objective function. Because of this, T is set at 30 in table 2. The model for table 1 is y, = x, + u r’ where
T = 200, Replications
x and u are standard
normal
RV’s.
= 200,
348
M. Lee, Mode regression
In table 2 we use the model with T = 30 and estimate only the slope parameter, fixing the intercept at 0. In all designs, we provide LQ (lower quantile), MED (median), UQ (upper quantile), and MAE (median absolute error) in addition to BIAS, STD, and RMSE. We examine table 1 first. In design 1, 2, and 3, we change the degree of truncation to note several points. First STLS does better than MODE in RMSE as well as in MAE, and AMODE does better than MODE in RMSE. Second, judging from LQ, MED, and UQ, the distribution of all estimators looks asymmetric and the degree of asymmetry becomes larger as that of truncation goes up. Third, there exists small sample bias for STLS and MODE and it is more pronounced for MODE. The reason for the bias is not clear. In censored STLS, Powell (1986) gave a reasoning that doing two tasks of including data with ~$3 2 c and minimizing the objective function at the same time affects the number of data included for each replication and it causes bias. But his argument is based on the number of observation of yI = 0 in an essential way. So it is not certain whether this argument can be carried over to the truncation case. Comparing design 2, 4, and 5 where the truncation is 50% and error distribution is standard normal, logistic, and Cauchy respectively, we can see that as the tail of error distribution becomes heavier, so does that of STLS, for whereas MAE of STLS does not change much, STD increases by a big margin. In design 4 and 5, MODE does better than the STLS in RMSE and STLS does better in MAE, which suggests that the tail of the distribution of MODE is not as thick as STLS. The performance of AMODE is similar to that of MODE except that AMODE has noticeably lower STD. Checking designs 6 and 7, where STLS is supposed not to work since the distributions are asymmetric, STLS shows high STD but its BIAS is not high compared to that of MODE and AMODE. The performance of STLS with asymmetric densities will depend on the degree of the asymmetry, though there is no unique measure for the degree of asymmetry. Also, in principle, the testing procedure of STLS which gives an edge to STLS over MODE with a symmetric density is not justified anymore. Examining table 2, in all designs BHAT is the worst in RMSE due to its high STD, though BHAT fares relatively well in MAE. One caution is that design 3 may not be very informative, since with 25% of data with T = 30, only about seven to eight data are included in the estimation. In summary, STLS is good when the distribution is symmetric and has thin tails, which is understandable since STLS is a version of LSE. MODE does relatively well with distributions with thick tails and AMODE does better than MODE in most cases. BHAT does not seem to be a good alternative to STLS and MODE except in one very rare case: no intercept, single regressor, and asymmetric density.
M. Lee.
Mode regression
349
References Andrews, D.F., P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, and J.W. Tukey, 1972, Robust estimates of location: Survey and advances (Princeton University Press, Princeton, NJ). Bhattacharya, P.K., H. ChemotT, and S.S. Yang, 1983, Nonparametric estimation of the slope of a truncated regression, Annals of Statistics 11, 505-514. Chernoff, H., 1964. Estimation of the mode, Institute of Statistical Mathematics Annals 16. 31-41. Eddy, W.F., 1980, Optimum kernel estimators of the mode, Annals of Statistics 8, 870-882. Ferguson, T.S.. 1967, Mathematical statistics: A decision theoretic approach (Academic Press, New York, NY). Grenander, U., 1965, Some direct estimates of the mode, Annals of Mathematical Statistics 36, 131-138. Himmelblau, D.M., 1972, Applied nonlinear programming (McGraw-Hill, New York, NY). Kim, J. and D. Pollard, 1987, Cube root asymptotics, Preprint. Manski, C., 1975, Maximum score estimation of the stochastic utility model of choice, Journal of Econometrics 3. 205-228. Manski, C., 1985, Semiparametric analysis of discrete response, Journal of Econometrics 27, 313-333. Muller, H.G., 1984, Smooth optimum kernel estimators of densities, regression curves and modes, Annals of Statistics 12, 766-774. Parzen, E., 1962, On the estimation of a probability density function and the mode, Annals of Mathematical Statistics 33, 1065-1076. Pollard. D., 1984, Convergence of stochastic processes (Springer-Verlag, New York, NY). Powell, J., 1984, Least absolute deviation estimation for the censored regression, Journal of Econometrics 25, 303-325. Powell, J., 1986, Symmetrically trimmed least squares estimation for Tobit models, Econometrica 54. 1435-1460. Romano, J.P., 1988, On weak convergence and optimality of kernel density estimates of the mode, Annals of Statistics 16, 629-647. Rousseeuw. P.J., 1984, Least median of squares regression, Journal of the American Statistical Association 79, 871-880. Rousseeuw, P.J. and A. Leroy. 1987, Robust regression and outlier detection (Wiley-Interscience, New York, NY). Sager, T.W. and A. Thisted. 1982, MLE of isotonic modal regression, Annals of Statistics 10, 690-707. Venter, J.H., 1967, On estimation of the mode, Annals of Mathematical Statistics 38, 1446-1455.