Regression towards the mode

Regression towards the mode

Journal of Econometrics 170 (2012) 92–101 Contents lists available at SciVerse ScienceDirect Journal of Econometrics journal homepage: www.elsevier...

289KB Sizes 1 Downloads 33 Views

Journal of Econometrics 170 (2012) 92–101

Contents lists available at SciVerse ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Regression towards the mode✩ Gordon C.R. Kemp a , J.M.C. Santos Silva a,b,∗ a

Department of Economics, University of Essex, United Kingdom

b

CEMAPRE, Portugal

article

info

Article history: Received 18 March 2011 Received in revised form 17 January 2012 Accepted 21 March 2012 Available online 28 March 2012

abstract We propose a semi-parametric mode regression estimator for the case in which the dependent variable has a continuous conditional density with a well-defined global mode. The estimator is semi-parametric in that the conditional mode is specified as a parametric function, but only mild assumptions are made about the nature of the conditional density of interest. We show that the proposed estimator is consistent and has a tractable asymptotic distribution. © 2012 Elsevier B.V. All rights reserved.

JEL classification: C13 C14 Keywords: Conditional mode Density estimation Normal kernel Robust regression

1. Introduction The mode is a characterizing feature of any statistical distribution or data set and naturally its estimation has received considerable attention in the statistics literature (early references include Parzen, 1962; Chernoff, 1964). Likewise, non-parametric estimation of the conditional mode is the subject of many papers in statistical journals (e.g., Collomb et al., 1987; Samanta and Thavaneswarn, 1990; Quintela-Del-Rio and Vieu, 1997).1 Less attention, however, has been devoted to the case that is most likely to be useful in econometric applications, that is, the semi-parametric case in which the conditional mode is specified

as a parametric function, but only mild assumptions are made about the conditional distribution of interest. Having a parametric specification of the conditional mode, although restrictive, is convenient for a number of reasons. First of all, it avoids the curse of dimensionality that would plague a fully non-parametric approach in many applications. Also, it provides a convenient summary of how the regressors affect the conditional mode, and allows extrapolation of the conditional mode for values of the regressors not observed in the data. Lee (1989) introduced a semi-parametric estimator of Mode (y|x), the mode of the conditional distribution of y given x, based on the well-known loss function LR (y, x) = 1 − 2KR

✩ We are very grateful to the Associate Editor and to four referees for their many constructive and helpful comments. We also thank Marcus Chambers, Emma Iglesias, Myoung-jae Lee, Oliver Linton, Myung Hwan Seo, Paulo Parente, and participants in seminars at Birmingham, Cambridge, cemmap, GREQAM, ISER, LSE, Manchester, Reading, University College Dublin, and York for their helpful comments and suggestions. The usual disclaimer applies. Santos Silva gratefully acknowledges the partial financial support from Fundação para a Ciência e Tecnologia, program POCTI, partially funded by FEDER. ∗ Correspondence to: Department of Economics, University of Essex, Wivenhoe Park, Colchester CO4 3SQ, United Kingdom. Tel.: +44 1206872769. E-mail addresses: [email protected] (G.C.R. Kemp), [email protected] (J.M.C. Santos Silva). 1 Estimation of the conditional mode should not be confused with the estimation

of the mode or peak of a (mean) regression function; see Müller (1985, 1989). 0304-4076/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2012.03.002



y − x′ β

δ



,

(1)

where it is assumed that Mode (y|x) = x′ β0 , δ > 0 is the bandwidth parameter, and KR (u) = 1 [|u| < 1] /2 denotes the rectangular kernel (Silverman, 1986), with 1 [A] being the indicator function for event A. The estimator proposed by Lee (1989) can be obtained by minimizing the sample analog of the expectation of (1), which is achieved when x′ β is the midpoint of the interval of length 2δ with the highest probability of containing y (Manski, 1991). If the conditional density of y, fY |X (y|x), has a unique global mode, the minimizer of this function approaches Mode (y|x) as δ → 0. Moreover, for fixed δ , the minimizer of LR (y, x) is Mode (y|x) if fY |X (y|x) has a unique global mode and is symmetric about it up

G.C.R. Kemp, J.M.C. Santos Silva / Journal of Econometrics 170 (2012) 92–101

to ±δ , and it is parallel to Mode (y|x) if x affects only the location of fY |X (y|x) (see Lee, 1989, 1993). For fixed δ , Manski (1991) terms the minimizer of (1) the δ -mode. Despite its elegance, Lee’s (1989) estimator is impractical because its distribution is intractable due to the nature of the objective function (Kim and Pollard, 1990). In order to overcome this problem, Lee (1993) introduced the quadratic mode regression estimator, which can be obtained by replacing the rectangular kernel in (1) with the quadratic or Epanechnikov kernel (Silverman, 1986). Lee (1993) shows that, under the assumed regularity conditions, for fixed δ , this estimator is √ n-consistent and asymptotically normal.2 However, as before, consistency requires fY |X (y|x) to be symmetric about the mode up to ±δ . Alternatively, the slope parameters can be consistently estimated if x affects only the location of fY |X (y|x) (see Lee, 1989, 1993). The pathbreaking papers of Lee (1989, 1993) motivated mode regression estimators by noting that, under certain conditions, the conditional mode from truncated data provides consistent estimates of the conditional mean for the original non-truncated distribution. Although mode regression is appealing in this context, applications with truncated dependent variables are relatively rare and, therefore, tailoring mode regression to this kind of data unduly restricts its usefulness. Indeed, the interest of mode regression is much broader and the mode is arguably the most intuitive measure of central tendency, being especially interesting when modeling the positively skewed data found in many econometric applications (e.g., wages, prices, and expenditures). For example, the work of Cardoso and Portugal (2005) and Martins et al. (2010) shows that the mode is the natural measure of central tendency to use when studying the typical hiring wage. Additionally, the robustness properties of the conditional mode may also make it attractive in many econometric applications. In particular, there is evidence that the conditional mode may be reasonably resilient to some forms of measurement error (see Bound and Krueger, 1991; Hu and Schennach, 2008) and we will show that mode regression is a limiting case of a family of well-known robust regression estimators, which are otherwise difficult to interpret. The use of these robust estimators in applied econometrics can be seen as an attempt to estimate the conditional mode in the absence of a practical method of doing so. This is illustrated by the work of Temple (1998, 1999), who advocates the use of robust regression arguing that in the context of growth regressions it is particularly appropriate to use estimators that in some sense fit the majority of the data, which is exactly what mode regression does. Although in principle quantile regression (Koenker and Bassett, 1978) can completely characterize the shape of the conditional distribution, the way it is used in practice generally fails to reveal any information about the conditional mode and on how it is affected by the regressors. For example, it is easy to find examples where the mean and all quantiles are increasing functions of a regressor, while the mode decreases with the same regressor. Therefore, mode regression is potentially a very useful tool that can be of interest in itself, or used to complement the standard mean and quantile regressions in the study of the features of conditional distributions. In this paper we study the semi-parametric estimation of the conditional mode (mode regression) for a variate that has a continuous conditional density with a well-defined global mode.3

2 Karlsson (2004) studied the finite sample properties of this estimator. 3 Two recent contributions in this area should be acknowledge. The function npconmode() in the np package in R (Hayfield and Racine, 2008) implements nonparametric mode regression for a categorical dependent variable based on the results of Hall et al. (2004). Additionally, Gannouna et al. (2010) introduced a semi-parametric mode regression estimator but the problem they address is very different from the one considered here.

93

In doing this, we depart from Lee (1989, 1993) by using smooth kernels and by letting the bandwidth parameter pass to zero as the sample size increases. We show that in this case it is possible to obtain a mode regression estimator whose asymptotic validity does not depend on the restrictive assumptions on the conditional density of y required by Lee (1989, 1993). In addition, the estimator has a tractable asymptotic distribution and it is simple to implement using standard software. Of course, the fact that consistency is possible under much more general conditions has a cost. In particular, as in other cases where the objective function depends on a vanishing bandwidth (see, e.g., Parzen, 1962; Horowitz, 1992), the estimator will not converge at the usual √ n rate. Nevertheless, as we will illustrate, the proposed mode regression estimator can still be useful in practice. The remainder of the paper is organized as follows. Section 2 details our approach to mode regression and presents the main asymptotic results. Section 3 provides simulation evidence on the finite sample performance of the proposed estimator, and Section 4 contains concluding remarks. Finally, the proofs of the main results are collected in a technical Appendix. 2. Main results 2.1. Framework and asymptotic results We consider a regression model of the form yi = x′i β0 + εi

(i = 1, 2, . . . , n),

(2)

where xi takes values in R for some finite p, β0 is an unknown element of the parameter space B, which is a known subset of Rp , and the conditional density of εi given xi has a strict global maximum at εi = 0 so that the conditional mode of yi given xi is equal x′i β0 .4 As in Lee (1989, 1993), our starting point is a loss function that can be written as one minus a (scaled) kernel. In particular, we consider a loss function of the form p

Ln (y, x) = 1 − γ K



y − x′ β



δn

,

(3)

where γ = K (0)−1 > 0 is a scaling constant, δn is a non-stochastic strictly positive bandwidth that depends on n, and K (u) denotes a smooth kernel function. Many smooth kernels are available but throughout we focus on the popular choice K (u) = φ (u), where φ (u) denotes the standard normal density.5 As it will be shown, this choice has the advantage of generating a loss function which has both the mode and the mean as minimizers in limiting cases. Minimizing the sample analog of the expectation of (3) is equivalent to maximizing Qn (β) = n

−1

n 

δn K −1

i =1



yi − x′i β

δn



,

(4)

which leads to the estimator of interest, a regression version of Parzen’s (1962) mode estimator, defined by

βˆ n ≡ arg max Qn (β). β

(5)

For a fixed δn , the asymptotic distribution of βˆ n can be obtained using standard techniques (see, e.g., Amemiya, 1985, Chapter 4). However, for a fixed δn this mode regression estimator is inter-

4 Strictly speaking, the conditional density is not uniquely defined: we just require that there is a version of the conditional density with this property. 5 This choice of kernel is not innocuous because it is possible to obtain estimators with somewhat improved asymptotic properties by using different kernels (see, e.g., Eddy, 1980; Romano, 1988). However, this would also require strengthening some assumptions.

94

G.C.R. Kemp, J.M.C. Santos Silva / Journal of Econometrics 170 (2012) 92–101

esting only under very restrictive conditions and therefore we derive the asymptotic distribution of βˆ n when δn is allowed to vanish as n passes to infinity. For any given value β ∈ B, Qn (β) is a kernel-based estimator of the density function of the residuals, ηi (β) ≡ yi − x′i β , at 0. This identifies the parameters of interest because fηi (β) (0) = E [fY |X (x′i β|xi )], which is clearly maximized at β = β0 provided that fY |X (x′i β|xi ) ≤ fY |X (x′i β0 |xi ) for all x and β ∈ B, with a strict inequality when β ̸= β0 , on a set of x with positive probability. Provided that the kernel is continuous and the bandwidth is finite and strictly positive, then the objective function is continuous in β for any realized data. If, in addition, the data have a well-defined joint distribution, then the value of the objective function at any fixed value of β is clearly a random variable. Then, if the parameter space B is a compact subset of a finite dimensional Euclidean space, it follows that our estimator is well-defined in that there exists a random variable βˆ n which satisfies (5), except possibly on a set of probability zero. Below we present the main results on the asymptotic properties of βˆ n as a set of three theorems whose proofs are provided in the Appendix. Before the theorems are presented, we give details on the assumptions under which they are valid. 2.1.1. Consistency In order to prove consistency, we make the following assumptions. A1. {(εi , xi )}∞ i=1 is an independent and identically distributed (iid) sequence, where εi takes values in R and xi takes values in Rp for some finite p. A2. The parameter space B is a compact subset of Rp and β0 ∈ B. A3. The distribution of x is such that: (i) E {∥xi ∥} < ∞, where ∥a∥ denotes the Euclidean norm of a for any scalar or finitedimensional vector a; (ii) Pr{x′i λ = 0} < 1 for all fixed λ ̸= 0. A4. There exists a version of the conditional density of ε given x, denoted fε|X (·|·): R × Rp → R, such that: (i) supε∈R,x∈Rp fε|X (ε|x) = L0 < ∞; (ii) fε|X (ε|x) is continuous in ε for all ε and x; (iii) fε|X (ε|x) ≤ fε|X (0|x) for all ε and x. In addition, there exists a set A ⊆ Rp such that Pr{xi ∈ A} = 1 and fε|X (ε|x) < fε|X (0|x) for all ε ̸= 0 and x ∈ A. A5. K (·)  :∞R → R is a differentiable kernel function such that: (i) −∞ K (u)du = 1; (ii) supu∈R |K (u)| = c0 < ∞; (iii) supu∈R

errors given the regressors is symmetric on an interval around the mode. Under these assumptions we can establish consistency. p

Theorem 1. Under Assumptions A1–A6: βˆ n → β0 . 2.1.2. Asymptotic normality The proof of asymptotic normality requires the following additional assumptions. B1. B2. B3.

E {|xi |5+ξ } < ∞ for some ξ > 0. β0 belongs to the interior of B. fε|X (ε|x) is three times differentiable with respect to ε for all x (j)

such that: (i) fε|X (ε|x) = ∂ j fε|X (ε|x)/∂ε j is uniformly bounded



(2)



for j = 1, 2, 3; (ii) E fε|X (0|xi )xi x′i is negative definite.

∞

B4. K (·) is such that: (i) it is three times differentiable; (ii) −∞ uK ∞ (u)du = 0; (iii) limu→±∞ K (u) = 0; (iv) −∞ u2 |K (u)| du =

2

∞ 

M0 < ∞; (v) −∞ K ′ (u) du = M1 < ∞; (vi) supu∈R K ′′ (u)   ∞ = M2 < ∞; (vii) supu∈R K ′′′ (u) = M3 < ∞; (viii) −∞





 ′′ 2 K (u) du = M4 < ∞. 7 B5. The sequence {δn }∞ n=1 is such that: (i) nδn = o(1); (ii) → ∞.

nδn5 ln(n)

Unsurprisingly, each of these additional assumptions involves strengthening a corresponding assumption used in establishing consistency. Thus the true parameter value needs to belong to the interior of the parameter space, the regressors possess moments of higher order, both the conditional density of the disturbances given the regressors and the kernel exhibit greater smoothness, and the bandwidth goes to zero at a more tightly constrained rate. The conditions imposed on the bandwidth are the same as those used by Romano (1988) to establish that the kernel density estimator of the mode is asymptotically unbiased and normal. Finally, notice that no smoothness with respect to x is required and therefore the regressors can be discrete. We are now in position to characterize the asymptotic distribution of βˆ n . Theorem 2. Under Assumptions A1–A6 and B1–B5:

 ′  K (u) = c1 < ∞, where K ′ (u) = dK (u)/du. A6. {δn }∞ n=1 is a strictly positive bandwidth sequence such that: (i) δn → 0; (ii) lnn(δnn) → ∞.

  D (nδn3 )1/2 βˆ n − β0 → N [0, Ω0 ],

We make Assumption A1 for convenience: the assumptions in the paper could be modified to allow the {(εi , xi )}∞ i=1 process to exhibit some dependence but this would complicate the proofs quite substantially and there would be a trade-off between allowing some dependence in the {(εi , xi )}∞ i=1 process (captured, for example, by mixing rates) and strengthening other assumptions (mostly on the moments of xi ). Assumptions A2, A3, A4(i) and A4(ii) are all fairly standard. Assumption A4(iii) is specific to the context of mode regression and imposes that the conditional density of ε has a well defined global mode at 0. Notice, however, that although the global mode needs to be unique, fε|X (ε|x) is not required to be unimodal and that no assumptions are made about the existence of moments of the conditional distribution of ε . Assumption A5 is fairly standard and is satisfied by many commonly used kernel functions though the required differentiability rules out the kernels used in Lee (1989, 1993). Assumption A6 is a fairly standard condition on the bandwidth sequence and specifies that the bandwidth goes to 0 at a suitably fast rate. It is required for the proof of consistency since, unlike Lee (1989, 1993), we do not assume the conditional density of the

1 −1 Ω0 = B− 0 A0 B0 , 

(6)

where:

A0 = lim Var (nδn3 )1/2 n→∞

(7)



  ∂ Qn (β)  ∂β β0

  = M1 E fε|X (0|xi )xi x′i ,      ∂ 2 Qn (β)  B0 = lim E = E fε|(2X) (0|xi )xi x′i .  ′ n→∞ ∂β∂β β0

(8) (9)

This theorem reveals that, given our bandwidth assumptions, the proposed mode regression estimator converges to a normal distribution at a rate that can be made arbitrarily close to n2/7 . Moreover, we see that the variance of the asymptotic distribution, Ω0 , depends on the choice of kernel, through M1 , and on the interplay between characteristics of the distributions of the regressors and error term. In particular, Ω0 depends on how high and on how concave the conditional density of ε is at the mode. The following theorem provides a way of obtaining a consistent estimator of Ω0 .

G.C.R. Kemp, J.M.C. Santos Silva / Journal of Econometrics 170 (2012) 92–101

Theorem 3. Under Assumptions A1–A6 and B1–B5: 1 ˆ ˆ −1 p ˆ n = Bˆ − Ω n An Bn → Ω0 ,

(10)

where: Aˆ n = n

−1

n 

 δn

−1



K



yi − x′i βˆ n

n 

 δn K

−3 ′′

(xi x′i ),

δn

i=1

Bˆ n = n−1

2

yi − x′i βˆ n

δn

i=1

(11)

 (xi x′i ).

(12)

Here, Bˆ n is the conventional observed Hessian estimator, while Aˆ n is an outer-product of the gradient variance estimator. Therefore, if conventional software is used to maximize Qn (β), asymptotically valid inference can be based on the standard misspecification-robust sandwich covariance matrix estimator. 2.2. Implementation issues Two issues are of paramount importance in the implementation of the proposed mode regression estimator. One, of course, is the choice of the bandwidth. The other is the choice of algorithm to use in the maximization of (4) because this function may have multiple maxima, especially for small values of δn , and it is important to ensure that a global maximum is found. Our approach to these problems is based on the observation that, for K (u) = φ (u),6 maximization of (4) can be seen as solving the following moment conditions



 

E exp −

yi − x′i β 2δ

2 n

2 

 yi − x′i β x′i





= 0.

(13)

Eq. (13) makes clear that maximization of (4) is essentially a weighted least squares problem that has as special cases mode regression, when δn passes to zero, and mean regression, when δn → ∞. Moreover, for any fixed value of δn , the estimators obtained by maximizing (4) can be understood as members of the class of M-estimators introduced by Huber (1973), which aim to ‘‘give a good fit to the bulk of the data without being perturbed by a small proportion of outliers’’ (Maronna et al., 2006, p. 88). The link between mode regression and robust M-estimators was noted by Lee (1989, 1993) and by Lee and Kim (1998), who also noted the close relation between mode regression and the least median of squares estimator of Rousseeuw (1984). These results show that (4) defines a continuum of conditional measures of central tendency, of which the two polar cases have particularly interesting interpretations. This has important implications for the choice of bandwidth because δn not only determines the properties of the estimator but also actually defines the conditional measure of central tendency that is estimated. Therefore, rather than just estimating the conditional mode for a chosen value of δn , we can estimate the parameters of interest for a wide range of values of δn and obtain a more detailed picture of how these parameters, say β(δn ) , vary within this class of conditional measures of central tendency. Of course, it is still necessary to define the limits for the sequence of values of δn to be used in the estimation. However, because inference will not be based on a single value of the smoothing parameter, this choice is less critical than the choice of an optimal value of δn to estimate the mode. In the application in Kemp and Santos Silva (2010), we estimate β(δn ) for a grid of

6 More generally, a similar result holds whenever the kernel used is a function of

(yi − xi β)2 .

95

100 values of δn between 50 mad and 0.5 madn−0.143 ,7 where mad denotes the median of the absolute deviation from the median ordinary least squares (OLS)residual, i.e.,  by b the  OLS   denoting estimates of β , mad = medi  yi − x′i b − medj yj − x′j b  . From a computational point of view, this strategy is attractive because OLS provides a natural set of starting values for the estimation of β(δn ) when δn is large enough. Subsequently, the new estimation results can be used as starting values for the estimation with a smaller value of the smoothing parameter.8 Of course, there is no guarantee that the estimates obtained in this way will correspond to the global maximum of the objective function for each value of δn . Therefore, it is recommended that, at least for an interesting value of the smoothing parameter, additional checks are performed to try to ensure that a global maximum is indeed found. 3. Simulation evidence This section presents the results of a small simulation study illustrating the finite sample performance of the proposed mode regression estimator. In these experiments data are generated by the simple linear model yi = β0 + β1 xi + (1 + v xi )εi

(i = 1, 2, . . . , n).

(14)

For each replication, the regressor is generated as independent draws from the χ(23) distribution, and scaled to have variance equal to one, and the error εi is generated as independent draws from a re-scaled log-gamma random variable

εi = −λ ln (Zi ) ,

λ > 0,

(15)

where Zi has a gamma distribution with mean α/κ and variance α/κ 2 , for α, κ > 0. It is possible to show that the mode of εi is given by λ ln (κ/α), and therefore we set κ = α to ensure that εi has zero mode. For this choice of parameters, εi will have positive expectation defined by µε = λ [ln (κ) − ψ0 (α)], where ψ0 (·) denotes the digamma function. The variance of εi is given by λ2 ψ1 (α), where ψ1 (·) is the trigamma function, and in our experiments the value of λ is set so that the unconditional variance of the error (1 + v xi )εi is equal to one.9 Finally, εi is positively skewed, with coefficient of skewness −ψ2 (α)ψ1 (α)−3/2 , where ψ2 (·) is the quadrigamma function. Having fixed κ and λ, α can be used to control the degree of skewness of the distribution. We perform experiments with α ∈ {0.05, 5.00},10 v ∈ {0, 2} and n ∈ {250, 1000, 4000, 16,000}. For each replication of the experiments, we estimate the conditional mean and the conditional mode of yi , which are both linear functions of xi . Specifically, with this design, Mode (yi |xi ) = xi and E (yi |xi ) = µε + (1 + vµε ) xi , which shows that for the homoskedastic cases (v = 0) the conditional mean and the conditional mode have the same slope. The mode regression estimator was implemented using the iterative weighted least squares estimator described in Section 2.2, for smoothing parameters defined as δn = k madn−0.143 , with k ∈ {0.8, 1.6}.11

7 The exponent −0.143 is −1/7 rounded down to three decimal places; see part (i) of B5. 8 Alternatively, (4) can be maximized, e.g., using a Newton-type algorithm. This is specially useful when the kernel used is not a function of (yi − xi β)2 and therefore the proposed iterative weighted least squares estimator is not available.      9 Specifically, λ = 1 + 2E (x ) v + E x2 v 2 ψ (α) −0.5 . i

i

1

10 For α = 5.00 the coefficient of skewness of ε is approximately 0.5, being i approximately 2.0 for α = 0.05. 11 The value of 1.6 is inspired by Silverman’s 1986 rule-of-thumb and takes into account that for the normal distribution σ = 1.4826 mad. The value of 0.8 is chosen to ensure some under-smoothing.

96

G.C.R. Kemp, J.M.C. Santos Silva / Journal of Econometrics 170 (2012) 92–101

Table 1 summarizes the main results obtained with 10,000 replications of the simulation procedure. The table displays the mean and standard error of the estimates obtained with OLS, mode regression with k = 1.6, labeled Mode 1.6, and mode regression with k = 0.8, labeled Mode 0.8. The OLS results are not surprising in any way and illustrate the well-known properties of this estimator. As for the results obtained with Mode 1.6 and Mode 0.8, perhaps the most remarkable finding is that the intercept picks-up most of the bias, with the mean of the estimates of the slope being always close to one. Not surprisingly, the biases shrink with the sample size, but the rate at which the biases vanish depends on the degree of skewness and heteroskedasticity of the errors. The standard errors of the mode estimators also shrink with the sample size and, again, the rate at which this happens depends on the characteristics of the conditional distribution. Generally speaking, as expected, Mode 1.6 has smaller standard errors but larger biases than Mode 0.8, and the efficiency penalty of Mode 0.8 also depends on the skewness and heteroskedasticity of the errors. As noted above, the OLS and the mode regression identify the same slope when v = 0. Therefore, for these cases, it is meaningful to compare the results of the mode estimators with those of the OLS. In particular, it is interesting to notice that for α = 5.00 the slopes are estimated with much better precision by OLS, but for α = 0.05 the mode estimators are strong competitors, with Mode 1.6 outperforming OLS for all sample sizes considered in these exercises. The competitiveness of the mode regression in this case is, of course, a reflex of the well-known fact that OLS can be outperformed by ‘‘robust’’ estimators when the distribution of the errors has high skewness and/or kurtosis (see, e.g., Maronna et al., 2006). Overall, the results of these experiments are quite encouraging in that they show that the proposed mode estimator is likely to have a reasonable performance in samples of a realistic size. 4. Concluding remarks We provide the asymptotic results needed for valid inference about the conditional mode of a variate that has a continuous conditional density with a well-defined global mode. The estimator, which is based on a smooth kernel with a bandwidth that is allowed to pass to zero as the sample size increases, is very easy to implement and is valid under mild conditions. In particular, its asymptotic validity does not depend on the symmetry or homoskedasticity of the conditional distribution of interest. The main drawback of this √ estimator is that it converges at much slower rate than the usual n. In spite of this, the simulation results presented in Section 3 suggest that the proposed mode regression estimator can be useful in practice. A number of extensions of our results are possible. The most obvious is to allow the conditional mode to be a non-linear function. Extending our results to cover this case is straightforward and requires only minimal changes to our assumptions. Also, as noted in Section 2, our assumptions can be modified to allow for some dependency in the data, as is typically found in time-series applications. This extension can be very useful, for example in applications involving forecasting (see, e.g., Collomb et al., 1987; Quintela-Del-Rio and Vieu, 1997; Wallis, 1999). A related area for possible future research is the consideration of functional data as in the recent papers by Ezzahrioui and Ould Saïd (2010) and Ferraty et al. (2012).

when a(·) is an s-dimensional vector valued function and FX (·) is a cumulative distribution function on Rs (for finite s), and so on. A.1. Proof of Theorem 1 First, in Lemma 1 below, we establish that limn→∞ E [Qn (β)] exists and that it is continuous in β ∈ B with a unique global maximum at β = β0 . Second, in Lemma 2 below, we establish that sup |Qn (β) − Q0 (β)| = op (1),

where Q0 (β) = limn→∞ E [Qn (β)], i.e., Qn (β) satisfies a uniform law of large numbers. Since B is compact, then the result of the theorem follows by application of Theorem 2.1 of Newey and McFadden (1994).  Lemma 1. Under Assumptions A1–A6: m (β, δ) =

  Rp

K (u)fε|X (x′ (β − β0 ) + δ u|x)du dFX (x),

Throughout, for any finite-dimensional matrix A, ∥A∥ = [tr(A′ A)]1/2 . Also, integrals are taken  over their entire  range unless explicitly indicated otherwise, so a(u) du = R a(u) du when   a(·) is a scalar valued function, a(x) dFX (x) = Rs a(x) dFX (x)

(17)

R

exists and is continuous for all (β, δ) ∈ Rp × R, where FX (·) is the distribution function of x. In addition, limn→∞ E [Qn (β)] is equal to m (β, 0) and has a strict global maximum over B at β = β0 . Proof. Assumptions A4(i), A4(ii) and A5(i) imply that m (β, δ) exists and is continuous for every (β, δ) ∈ Rp × R by dominated convergence. Assumption A1 implies that E [Qn (β)] = m (β, δn ) and then the continuity of m (β, δ) combined with Assumption A6(i) implies that: lim E [Qn (β)] = E [fε|X (x′i (β − β0 )|xi )] = m (β, 0) .

n→∞

Assumptions A3(ii) and A4(iii) then imply that E [fε|X (x′i (β − β0 )|xi )] < E [fε|X (0|xi )] for all β ̸= β0 , which combined with the continuity of m (β, δ) and the compactness of B implies that Q0 (β) = limn→∞ E [Qn (β)] achieves a strict global maximum over B at β = β0 .  Lemma 2. Under Assumptions A1–A6: sup |Qn (β) − Q0 (β)| = op (1).

(18)

β∈B

Proof. Under Assumption A2 it follows that there is a constant G0 < ∞ such that for each n = 1, 2, . . . we can find a collection Jn of elements of B and a function β¯ n (·) : B → Jn such that the number of elements of Jn  is less than or equal to G0 n2p and, for  every β ∈ B, β¯ n (β) − β  ≤ n−2 . Now, for any β ∈ B define A1n (β) = [Qn (β) − Qn β¯ n (β) ], A2n (β) = [Qn (β) − Qne (β)],

    A3n (β) = [Qne β¯ n (β) − Q0 (β)], Qne (β) = E [Qn (β)], and observe that:

[Qn (β) − Q0 (β)] = A1n (β) + A2n (β¯ n (β)) + A3n (β).

(19)

Taking each of the Ajn terms in turn, first, observe that Assumption A5(iii) implies that for any β A , β B ∈ B:

  n      A B − 1 Qn (β ) − Qn (β ) ≤ c1 n ∥xi ∥ δn−2 β A − β B  , i =1

by the mean value theorem, and hence:

 Appendix

(16)

β∈B

sup |A1n (β)| ≤ c1 n β∈B

−1

n 

 ∥xi ∥ (nδn )−2 .

i=1

 ∥x i ∥ > η ≤ E ∥xi ∥ for any η > 0 and Assumption A6(ii) implies that nδn → ∞ η as n → ∞. Hence it follows that supβ∈B |A1n (β)| = op (1). 

Assumptions A1 and A3(i) imply that Pr n−1

n

i=1

G.C.R. Kemp, J.M.C. Santos Silva / Journal of Econometrics 170 (2012) 92–101

97

Table 1 Simulation results.

α

v

5.00

0

2

0.05

0

2

Intercept

Slope

n

n

250

1,000

4,000

16,000

250

1,000

4,000

16,000

OLS

0.220 (0.101)

0.220 (0.050)

0.220 (0.025)

0.220 (0.013)

0.999 (0.064)

1.000 (0.032)

1.000 (0.016)

1.000 (0.008)

Mode 1.6

0.053 (0.184)

0.037 (0.115)

0.025 (0.072)

0.019 (0.045)

1.002 (0.119)

1.001 (0.072)

1.001 (0.045)

1.000 (0.029)

Mode 0.8

0.026 (0.369)

0.011 (0.253)

0.007 (0.178)

0.005 (0.125)

1.007 (0.226)

1.005 (0.150)

1.003 (0.103)

1.001 (0.072)

OLS

0.055 (0.119)

0.055 (0.061)

0.055 (0.031)

0.055 (0.015)

1.109 (0.123)

1.110 (0.062)

1.110 (0.031)

1.110 (0.016)

Mode 1.6

0.043 (0.096)

0.036 (0.056)

0.028 (0.034)

0.022 (0.021)

1.000 (0.191)

0.992 (0.125)

0.992 (0.083)

0.992 (0.054)

Mode 0.8

0.019 (0.136)

0.014 (0.086)

0.008 (0.056)

0.006 (0.037)

1.007 (0.268)

0.999 (0.186)

1.000 (0.133)

0.999 (0.094)

OLS

0.872 (0.101)

0.874 (0.050)

0.874 (0.025)

0.873 (0.012)

1.000 (0.064)

1.000 (0.032)

1.000 (0.016)

1.000 (0.008)

Mode 1.6

0.276 (0.078)

0.222 (0.035)

0.181 (0.019)

0.145 (0.010)

1.006 (0.051)

1.002 (0.024)

1.001 (0.012)

1.000 (0.006)

Mode 0.8

0.129 (0.113)

0.089 (0.055)

0.065 (0.029)

0.047 (0.016)

1.018 (0.078)

1.009 (0.039)

1.003 (0.020)

1.001 (0.011)

OLS

0.219 (0.117)

0.220 (0.061)

0.219 (0.031)

0.219 (0.015)

1.437 (0.120)

1.438 (0.063)

1.438 (0.031)

1.438 (0.016)

Mode 1.6

0.191 (0.051)

0.174 (0.024)

0.155 (0.012)

0.136 (0.006)

1.053 (0.072)

1.030 (0.034)

1.016 (0.017)

1.006 (0.009)

Mode 0.8

0.110 (0.049)

0.096 (0.025)

0.082 (0.013)

0.069 (0.007)

1.025 (0.086)

1.004 (0.046)

0.993 (0.026)

0.989 (0.014)

Second, observe that for any β ∗ ∈ B:

Setting bn =

n

A2n (β ∗ ) = n−1



it follows that:



ain β ∗ ,



2p ln(n) nδn



Pr

i=1

 sup |A2n (β )| ≥ ε ∗

β ∗ ∈Jn

= o (1) ,

where:

     εi − x′i (β ∗ − β) ain β ∗ = δn−1 K δn    ε − x′i (β ∗ − β) i − E δn−1 K . δn

since a0 (ε) > 0 for all ε > 0 and since Assumption A6(ii) implies that bn → 0 and nδn → ∞ as n → ∞. Then, since supβ∈B |A2n (β¯ n (β))| = supβ ∗ ∈Jn |A2n (β ∗ )|, it follows that

Assumptions A1, A5(ii) and A6 imply that {ain (β ∗ )}ni=1 are iid random variables with mean zero and bounded in absolute value by bn = 2δn−1 c0 . Furthermore, Assumptions A5(i) and A5(ii) imply   that R K (u)2 du is finite, so setting c2 = R K (u)2 du it then follows by Assumptions A1, A4(i) and A6 that:

  Var ain β ∗ ≤ E 



δn K −1



εi − x′i (β ∗ − β) δn

2 

≤ δn−1 d0 ,

where d0 = L0 c2 is finite and positive. From Bernstein’s inequality (see Hoeffding, 1963) it then follows that for any ε > 0: Pr |A2n (β ∗ )| ≥ ε = 2 exp {−nδn a0 (ε)} ,





where a0 (ε) =

3ε 2 6d0 +4c0 ε

 Pr

and hence that:

supβ∈B |A2n (β¯ n (β))| = op (1). Third, under Assumptions A1–A6, Lemma 1 implies that: A3n (β) = m β¯ n (β), δn − m (β, δn ) .



sup |A2n (β )| ≥ ε

β ∗ ∈Jn

≤ 2G0 n exp {−nδn a0 (ε)} . 2p



But from Lemma  1 it follows that m (β, δ) is uniformly continuous on B × 0, δ¯ n , where δ¯ n = supn≥1 δn , and since supβ∈B |β¯ n (β) − β| = o (1) by construction and δn = o (1) by Assumption A6(i), it then follows that supβ∈B |A3n (β)| = o (1). Thus we have established supβ∈B |Ajn (β)| = op (1) for j = 1, 2, 3 and hence it follows from (19) that supβ∈B |Qn (β) − Q0 (β)| = op (1).  A.2. Proof of Theorem 2 Since βˆ n is consistent by Theorem 1 and K (·) is twice continuously differentiable by Assumption B4(vi), then, with probability tending to 1, βˆ n belongs to the interior of the parameter space and can be characterized by the first-order conditions:

 ∗

≤ 2G0 exp {−nδn (a0 (ε) − bn )}

0 =

n 1 

nδn2 i=1

 K



yi − x′i βˆ n

δn

 xi

98

G.C.R. Kemp, J.M.C. Santos Silva / Journal of Econometrics 170 (2012) 92–101 n



1  −2 ′ δn K nδn2 i=1

=

n 1 

+

nδn3 i=1

 K

′′

εi δn

1/2

Assumptions A4(i) and B4(iii) imply that [n1/2 δn K ( δεn )fε|X (ε|x)]∞ −∞ = 0. Hence, by transformation of variables we obtain:

 xi

yi − x′i βˆ n∗



xi x′i βˆ n − β0 ,

δn

(20)

where βˆ n∗ lies on the line segment joining β0 and βˆ n (formally we

may need to use a different βˆ n∗ for each equation in this system of p equations). Multiplying (20) by





nδn3 βˆ n − β0



 =−

nδn3 and rearranging gives:





n 1 1

n i=1 δn3

K

yi − x′i βˆ n∗

′′

ngne = − nδn3







−1



n

1





nδn i=1

K



εi δn



xi xi

(1)

 =−

xi → N [0, A0 ] ,







1

K

δn





ε δn



2 xx



.

Note that this requires that nδ = o (1) as given in Assumption B5(ii). Using integration by parts and transformation of variables it follows from Lemma 4 below and the consistency of βˆ n for β that: 7 n

n i=1 δ

′′

y i − xi β n

′ ˆ∗

 ′

 = lim E

1

δn3

n→∞

 K

′′



ε δn

 xx







nδn3 βˆ n − β0



,

 −1

= −B0

1



n 

nδn i=1

K





εi δn

  xi

+ op (1)

  D 1 −1 → N 0, B− , 0 A0 B0 as desired.

(nδ )



     n1/2 δ 7/2 u2 K (u)xf (3) (τ δn u|x) du dFX (x) n ε|X        ≤ (nδn7 )1/2 u2 |K (u)| du · E {|xi |} · sup fε|(3X) (ε|x) = o(1), ε,x

′ E {|zin |2 } = nE {|gin λ|2 } − n−1 [(ngne )′ λ]2 .

i=1

From above we have that ngne = o(1) and hence n−1 [(ngne )′ λ]2 = o(1). By transformation of variables we obtain: ′ nE {|gin λ|2 } =

D

→ N [0, A0 ], (21)  where A0 = M1 · E fε|X (0|xi )(xi x′i ) is positive definite.   −1/2 ′ εi K δ xi and gne = E [gin ] so that: Proof. First, let gin = n−1/2 δn n    n n     ∂ Qn  e (nδn3 )1/2 = − g = − ng − gin − gne . in n  ∂β β0 i =1 i=1 

n 

ngne =

ε δn −∞    ε (1) − n1/2 δn1/2 K xfε|X (ε|x) dε dFX (x). δn n1/2 δn1/2 K



∞ fε|X (ε|x) x dFX (x)

|K ′ (u)|2 (x′ λ)2 fε|X (δn u|x) du dFX (x).



E {|zin |2 } = M1 · E fε|X (0|xi ) λ′ xi



2 

i =1

= λ′ A 0 λ = ω 2 .

(23)

Assumptions A3(ii), A4(i), A4(ii), A5(i), B1, B4(i) and B4(v) imply that ω2 is finite and strictly positive. Since the data is iid, by Assumption A1, it follows that:



 3 1/2 n

Var (nδ )

   ′  n  ∂ Qn (β)  ′ λ = Var gin λ ∂β β0 i =1 =

n 

E |zin |2 ,

i =1

and hence:

Using integration by parts it follows that:





Assumptions A4(i), A4(ii), A6(i) and B4(v) imply that: lim

    n ∂ Qn  1  ′ εi = √ K xi ∂β β0 δn nδ n i = 1

 

(22)

Assumptions A3(i), B3(i), B4(iv) and B5(i) imply that:

n→∞

3 1/2 n

(3)

n1/2 δn7/2 u2 K (u)xfε|X (τ δn u|x) du dFX (x).

(2)

Lemma 3. Under Assumptions A1–A6 and B1–B5:



2

n1/2 δn5/2 uK (u)xfε|X (0|x) du dFX (x) = 0.

n 

which is invertible by Assumption B3(ii) and is also symmetric. Hence:



(2)

and hence that = o(1). Second, fix any λ ∈ Rp such that λ ̸= 0 and set zin = [gin − gne ]′ λ. Clearly, by construction E (zin ) = 0 which implies that:

xi x′i = B0 + op (1) ,

where: B0 = E fε|′′X (0|x) xx

(δn u)2 fε|(3X) (τ δn u|x),

ngne



δn



2

Assumptions A3(i), B3(i) and B4(ii) then imply that:

n→∞

K

1

D

A0 = M1 · E fε|X (0|x) xx′ = lim E

3 n

(2)

n1/2 δn5/2 uK (u)xfε|X (0|x) du dFX (x) 1





1 1

K (s) fε|X (δn u|x) x du dFX (x) .

(1)

 





(1)

R

fε|X (δn u|x) = fε|X (0|x) + (δn u)fε|X (0|x) +

ngne

where:

n

Rp

By Assumption B3, a second-order Taylor series expansion of (1) fε|X (δn u|x) around u = 0 for given x implies that:

Lemma 3 below ensures that:



 

for some 0 ≤ τ ≤ 1 (which may vary with δn , u, and x). (1) Assumptions A4(iii) and B3 imply that fε|X (0|x) = 0 and thus:



δn     n 1  ′ εi × √ K xi . δn nδn i=1

1/2



 3 1/2 n

A0 = lim Var (nδ ) n→∞

 ′  ∂ Qn (β)  ∂β β0

  = M1 · E fε|X (0|xi )xi x′i , is finite and non-negative definite.

G.C.R. Kemp, J.M.C. Santos Silva / Journal of Econometrics 170 (2012) 92–101

Third, observe that |x1 + x2 |r ≤ 2r −1 (|x1 |r + |x2 |r ) for any r ≥ 1 and real x1 and x2 . Hence for any ρ > 0 such that E {|zin |2+ρ } < ∞ ′ then, by setting r = 2 + ρ , x1 = gin λ, x2 = −(gne )′ λ, and noting that the data is iid by Assumption A1: n 

by the mean value theorem, and hence it follows by Assumptions A1, A6(ii), B1 and B5(ii) that:

 sup |C1n (β)| ≤ M3 n





But n|gne |2+ρ = n−(1+ρ) |ngne |2+ρ = o(1), since ngne = o(1) as established above. In addition, Assumption A6(ii) implies that nδn → ∞ and hence by transformation of variables it follows that: ′ nE {|gin λ|2+ρ }



|K (u)|2+ρ du · E {|xi |2+ρ } → 0.

(25)

n

2.4.2 from Bierens (1994), and thus ′

D

n

i=1

zin → N 0, ω2 , where





0 < ω2 = λ A . Since λ ̸= 0 was arbitrary and ngne = o(1), 0 λ < ∞  ∂ Qn  ∂β β

then (nδn3 )1/2

D

→ N [0, A0 ], as desired.



0

Lemma 4. Under Assumptions A1–A6 and B1–B5:

 2   2   ∂ Qn (β) ∂ Q0 (β)   sup  = op (1), − ∂β∂β ′ ∂β∂β ′  β∈B

(26)

∥xi ∥

 −2 · ∥λ∥2 · nδn2 (28)

In addition, for any fixed β ∈ B, C3n (β) = −E [C1n (β)], and hence it follows from (28) that:





sup |C3n (β)| = sup |E {C1n (β)}| ≤ E sup |C1n (β)| β∈B

β∈B

= o(1).

Second, define: hin,1 (β) = δn K



yi − x′i β



(x′i λ)2 1[(x′i λ)2 ≤ δn−2 ], δn   yi − x′i β hin,2 (β) = δn−3 K ′′ (x′i λ)2 1[(x′i λ)2 > δn−2 ], δn n so Hn (β) = Hn,1 (β) + Hn,2 (β), where Hn,j (β) = n−1 i=1 hin,j (β) e for j = 1, 2. Also define Hn,j (β) = E [Hn,j (β)] = E [hin,j (β)] and C2n,j (β) = Hn,j (β)− Hne,j (β) for j = 1, 2. Assumption B4(vi) implies that |hin,1 (β)| ≤ δn−5 M2 and hence that |hin,1 (β) − Hne,1 (β)| ≤ 2δn−5 M2 . Assumptions A4(i), B1, and B4(viii) imply that:  Var[hin,1 (β)] ≤ δn−5 · L0 · [K ′′ (u)]2 du · E [(x′i λ)4 ] −3 ′′

Together, (23) and (25) imply that i=1 zin satisfies the conditions of the Lyapunov central limit theorem, see Theorem

 3

= op (1) .

β∈B

≤ (nδn )−ρ/2 L0

n  i=1

(24)

i=1

where:



−1

β∈B

′ E {|zin |2+ρ } ≤ 21+ρ n E {|gin λ|2+ρ } + |(gne )′ λ|2+ρ .

99

= δn−5 d1 ,

∂ Q0 (β) ∂β∂β ′ 2

 =E



(2) fε|X (x′i (β



− β0 )|xi )xi xi , ′

(27)

is continuous in β .

(29)

where d1 is finite and positive (and depends on λ). By a parallel line of argument to that used in the proof of Lemma 2 it follows from Bernstein’s inequality that:

Proof. The proof of this lemma follows a similar approach to that of Lemma 2 but in addition makes use of a trimming argument. (j) Observe that since E (xi x′i ) is finite, by Assumption B1, and fε|X (ε|x) is continuous in ε and uniformly bounded from above for j = 0, 1, 2, 3, by Assumptions A4(i) and B3(i), then we can interchange the order of taking derivatives with respect to β and taking expectations with respect to xi to establish that:

   ¯ Pr sup |C2n,1 (βn (β))| ≥ ε ≤ G0 n2p exp −nδn5 a1 (ε) ,

  ∂ 2 E fε|X (x′i (β − β0 )|xi ) ∂ 2 Q0 (β) = ∂β∂β ′ ∂β∂β ′   (2) ′ = E fε|X (xi (β − β0 )|xi )xi x′i ,

sup |C2n,1 (β¯ n (β))| = op (1).

Hn (β) = λ



  ∂ 2 Qn  λ, ∂β∂β ′  β

ε > 0, and since

2p ln(n) nδn5



3ε 2 6d0 +4M2 ε



. Since a1 (ε) > 0 for all

= o (1) and nδ → ∞, by Assumption 5 n

B5(ii), it follows that:

sup |hin,2 (β¯ n (β))| ≤ δn−3 M2 (x′i λ)2 1[(x′i λ)2 > δn−2 ], β∈B



and hence that:

  ∂ 2 Q0  λ. ∂β∂β ′  β

 ∞ ¯ Define G0 , {Jn }∞ as in the proof of Lemma 2. Now, n=1 , and βn (·) n=1    for any β ∈ B define C1n (β) = Hn (β) − Hn β¯ kn (β) , C2n (β) =       H (β) − Hne (β) , C3n (β) = Hne β¯ kn (β) − Hne (β) , C4n (β) =  e  ne Hn (β) − H0 (β) , Hn (β) = E [Hn (β)], and note that: [Hn (β) − H0 (β)] = C1n (β) + C2n (β¯ n (β)) + C3n (β) + C4n (β). Taking the Cjn terms one by one, first, Assumption B4(vii) implies that for any β A , β B ∈ B:

  A   Hn β − Hn β B    n    −1 3 ∥xi ∥ · ∥λ∥2 · δn−4 β A − β B  , ≤ M3 n i =1

for all ε > 0, where a1 (ε) =

Now observe that by Assumption B4(vi):

 H0 (β) = λ

β∈B

β∈B

is well-defined. Next fix any λ ∈ Rp such that λ ̸= 0 and define:







   ¯ E sup |C2n,2 (βn (β))| ≤ 2δn−3 M2 E (x′i λ)2 1[(x′i λ)2 > δn−2 ] , β∈B

noting that supβ∈Jn Hne,2 (β) = supβ∈Jn |E [hin,2 (β)]| ≤ E {supβ∈Jn |hin,2 (β)|}. Now fix r = (5 + ξ )/2, where ξ is defined as in Assumption B1(i), and note that since ξ > 0 then r > 1 and also that E {|Xi |2r } < ∞. In addition,  define s = r /(r − 1) and note that s > 1 and that r −1 + s−1 = 1. From the Hölder inequality it follows that:





  E (x′i λ)2 1[(x′i λ)2 > δn−2 ]   1/r    ′ 2 s 1/s ≤ E |x′i λ|2r E 1 (xi λ) > δn−2 . In addition:

      ′ 2r  E |x′i λ|2r ′ 2 −2 s −2r E 1 (xi λ) > δn = Pr (xi λ) > δn ≤ , δn−2r

100

G.C.R. Kemp, J.M.C. Santos Silva / Journal of Econometrics 170 (2012) 92–101



  E (x′i λ)2 1[(x′i λ)2 > δn−2 ]

D2n (β) = Sn (β) − Sne (β) , D3n (β) = Sne (β¯ n (β)) − Sne (β) , D4n (β)



  = Sne (β) − S0e (β) , and note that:

   1/s   ′ 2r 1/r E |x′i λ|2r ≤ E |xi λ| δn−2r ξ

o(1), by Assumption A6(i), and hence that supβ∈B |C2n,2 (β¯ n (β))| = op (1) by the Markov inequality. Since C2n (β) = C2n,1 (β)+ C2n,2 (β), this implies that supβ∈B |C2n (β¯ n (β))| = op (1). Third, observe that by repeated application of integration by parts and transformation of variables, it follows that:

Taking the Djn terms one by one, first, observe  that Assumption  A4(iii) implies that for any β A , β B ∈ B such that β A − β B  ≤ n−2 :

  A   Sn β − Sn β B    n    −1 −2 3 ≤ 2c1 M2 n δn ∥xi ∥ · ∥λ∥2 · δn−2 β A − β B  , i=1

by the mean value theorem, and hence it follows by Assumptions A1, A6(ii) and B1 that:

(2)

K (u)(x′i λ)2 fε|X (x′ (β − β0 ) + δn u|x) du dFX (x).

Assumptions A5(i), A5(ii), A6(i), B1 and B4(i) then imply that supβ∈B |Cin,4 (β)| = o (1) by dominated convergence. Finally, putting all of these results together, we have that: sup |Hn (β) − H0 (β)| = op (1),

 sup |D1n (β)| ≤ 2c1 M2 n

and since λ ̸= 0 was set at an arbitrary value it follows that:

β∈B

β∈B

sin,1 (β) = δn−1 K ′



n 



δn−1 K ′



yi − x′i β

n 

δn−3 K ′′



2

δn

i=1

yi − x′i β



δn

i =1

sin,2 (β) = δn−1 K ′

(xi x′i ),

β∈B

= o(1).

(xi x′i ) =

∂ 2 Qn (β) , ∂β∂β ′

Lemma 5. Under Assumptions A1–A6 and B1–B5:



yi − x′i βˆ n

−1





δn 

yi − x′i β

2

δn

(x′i λ)2 1[(x′i λ)2 ≤ δn−2 ], (x′i λ)2 1[(x′i λ)2 > δn−2 ],

that |sin,1 (β)| ≤ δn−3 c12 and hence that |sin,1 (β) − Sne,1 (β)| ≤ b˜ n =

2δn−3 c12 . Assumptions A4, A5(iii), and B1 imply that: Var[sin,1 (β)] ≤ δn−1 · L0 ·



[K ′ (u)]4 du · E [(x′i λ)4 ] = δn−1 d2 ,

where d2 is a finite and positive (and depends on λ). By a parallel line of argument to that used in the proof of Lemma 2, it follows from Bernstein’s inequality that:

β∈B

(xi x′i ) = A0 + op (1).

(30)

 ε − x′ (β − β0 ) 2 (β) = E [An (β)] = δn K δn × (xi x′i )fε|X (ε|x) dε dFX (x),  Ae0 (β) = [K ′ (u)]2 (xi x′i )fε|X (x′ (β − β0 )|x) du dFX (x), 

2

 ¯ Pr sup |S2n,1 (βn (β))| ≥ ε ≤ G0 n2p exp {−nδn a2 (ε)} ,

Proof. The proof of this lemma follows a similar approach to that of Lemmas 2 and 4. Define: Aen

yi − x′i β



2

δn



n

1 ˆ converges in probability to B− 0 . Hence it follows that Ωn converges in probability to Ω0 as desired. 

i=1



so Sn (β) = Sn,1 (β) + Sn,2 (β), where Sn,j (β) = n−1 i=1 sin,j (β) for j = 1, 2. Also define Sne,j (β) = E [Sn,j (β)] = E [sin,j (β)] and D2n,j (β) = Sn,j (β) − Sne,j (β) for j = 1, 2. Assumption A5(iii) implies

and note that Aˆ n = A˜ n (βˆ n ) and Bˆ n = B˜ n (βˆ n ). Lemma 5 below establishes that Aˆ n converges in probability to A0 . From Lemma 4 together with the consistency of βˆ n it follows that Bˆ n converges in 1 probability to B0 . Since B0 is non-singular, it then follows that Bˆ − n

δn−1 K ′

(31)

Second, define:

Define:

n− 1

· ∥λ∥2 · (nδn )−2

= op (1) .





δ n ∥x i ∥

3

i =1

β∈B

A.3. Proof of Theorem 3

n 

 −2

sup |D3n (β)| = sup |E {D1n (β)}| ≤ E sup |D1n (β)|



B˜ n (β) = n−1

n 



 2   2   ∂ Qn (β) ∂ Q0 (β)  = op (1) − sup  ∂β∂β ′ ∂β∂β ′  β∈B

A˜ n (β) = n−1

−1

In addition, for any fixed β ∈ B, D3n (β) = −E [D1n (β)], and hence it follows from (31) that:

β∈B

as desired.



Sn (β) − S0e (β) = D1n (β) + D2n (β¯ n (β)) + D3n (β) + D4n (β).

It follows that E {supβ∈B |C2n,2 (β¯ n (β))|} = O(δn2r −5 ) = O(δn ) =

Hne (β) =





= E {|x′i λ|2r }δn(2r −2) .



∞

¯ S0e (β) = λ′ Ae0 (β)λ. Define G0 , {Jn }∞ n=1 , and  βn (·) n=1 as in the  proof of Lemma 2. Then define D1n (β) = Sn (β) − Sn (β¯ n (β)) ,

by the Markov inequality, and therefore:



where a2 (ε) =

3ε 2 , 6d2 +4c0 ε

and hence that:



lim Pr sup |S2n,1 (β¯ n (β))| ≥ ε

n→∞

β∈B



= 0.

Now observe that by Assumption A5(iii): sup |sin,2 (β¯ n (β))| ≤ δn−1 c12 (x′i λ)2 1[(x′i λ)2 > δn−2 ], β∈B

and hence that:



which clearly both exist, by Assumptions A4(i), A5(iii), A6 and B1. Note that A0 = Ae0 (β0 ). Next fix any λ ∈ R such that λ ̸= 0, define Sn (β) = λ An (β)λ, Sne (β) = E [Sn (β)], and S0e (β) = limn→∞ Sne (β), and note that p

′˜

   ¯ E sup |D2n,2 (βn (β))| ≤ 2δn−1 c12 E (x′i λ)2 1[(x′i λ)2 > δn−2 ] , β∈B

noting that supβ∈Jn Sne,2 (β) = supβ∈Jn |E [sin,2 (β)]| ≤ E {supβ∈Jn |sin,2 (β)|}. By a parallel line of argument to that used in the proof





G.C.R. Kemp, J.M.C. Santos Silva / Journal of Econometrics 170 (2012) 92–101 2+ξ

of Lemma 4, it follows that E {supβ∈B |D2n,2 (β¯ n (β))|} = O(δn

)= o(1), and thus supβ∈B |D2n,2 (β¯ n (β))| = op (1) by the Markov inequality. Since D2n (β) = D2n,1 (β) + D2n,2 (β), this implies that supβ∈B |D2n (β¯ n (β))| = op (1). Third, observe that transformation of variables implies that:

Sne (β) =



[K ′ (u)]2 (x′i λ)2 fε|X (x′ (β − β0 ) + δn u|x) du dFX (x).

Assumptions A5(i), A5(ii), A6(i), B1 and B4(i) then imply that supβ∈B |D4n (β)| = o (1) by dominated convergence. Finally, putting these properties together we have that: sup Sn (β) − S0e (β) = op (1) .





β∈B

But since βˆ n converges in probability to β0 , by Theorem 1, it follows that Sn (βˆ n ) converges in probability to S0e (β0 ) = λ′ Ae0 (β0 ) λ = λ′ A0 λ, since A0 = Ae0 (β0 ). Since λ ̸= 0 it then follows that Aˆ n = A˜ n (βˆ n ) converges in probability to A0 , as desired.



References Amemiya, T., 1985. Advanced Econometrics. Harvard University Press, Cambridge (MA). Bierens, H.J., 1994. Topics in Advanced Econometrics. Cambridge University Press, Cambridge. Bound, J., Krueger, A.B., 1991. The extent of measurement error in longitudinal earnings data: do two wrongs make a right? Journal of Labor Economics 9, 1–24. Cardoso, A.R., Portugal, P., 2005. Contractual wages and the wage cushion under different bargaining settings. Journal of Labor Economics 23, 875–902. Chernoff, H., 1964. Estimation of the mode. Annals of the Institute of Statistical Mathematics 16, 31–41. Collomb, G., Härdle, W., Hassani, S., 1987. A note on prediction via estimation of the conditional mode function. Journal of Statistical Planning and Inference 15, 227–236. Eddy, W., 1980. Optimum kernel estimators of the mode. Annals of Statistics 9, 870–882. Ezzahrioui, M., Ould Saïd, E., 2010. Some asymptotic results of a non-parametric conditional mode estimator for functional time-series data. Statistica Neerlandica 64, 171–201. Ferraty, F., Quintela-Del-Rio, A., Vieu, Ph., 2012. Specification test for conditional distribution with functional data. Econometric Theory 28, 363–386. Gannouna, A., Saracco, J., Yu, K., 2010. On semiparametric mode regression estimation. Communications in Statistics—Theory and Methods 39, 1141–1157. Hall, P., Racine, J.S., Li, Q., 2004. Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association 99, 1015–1026.

101

Hayfield, T., Racine, J.S., 2008. Nonparametric econometrics: the np package. Journal of Statistical Software 27 (5). Available at: http://www.jstatsoft.org/v27/i05/. Hoeffding, W., 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 13–30. Horowitz, J.L., 1992. A smoothed maximum score estimator for the binary response model. Econometrica 60, 505–531. Hu, Y., Schennach, S.M., 2008. Instrumental variable treatment of nonclassical measurement error models. Econometrica 76, 195–216. Huber, P.J., 1973. Robust regression: asymptotics, conjectures and Monte Carlo. Annals of Statistics 1, 799–821. Karlsson, M., 2004. Finite sample properties of the QME. Communications in Statistics—Simulation and Computation 33, 567–583. Kemp, G.C.R., Santos Silva, J.M.C., 2010, Regression towards the mode. Department of Economics, University of Essex. Discussion Paper No. 686. Kim, J.K., Pollard, D., 1990. Cube-root asymptotics. Annals of Statistics 18, 191–219. Koenker, R., Bassett Jr., G.S., 1978. Regression quantiles. Econometrica 46, 33–50. Lee, M.J., 1989. Mode regression. Journal of Econometrics 42, 337–349. Lee, M.J., 1993. Quadratic mode regression. Journal of Econometrics 57, 1–19. Lee, M.J., Kim, H.J., 1998. Semiparametric econometric estimators for a truncated regression model: a review with an extension. Statistica Neerlandica 52, 200–225. Manski, C.F., 1991. Regression. Journal of Economic Literature 29, 34–50. Maronna, R.A., Martin, R.D., Yohai, V.J., 2006. Robust Statistics: Theory and Methods. John Wiley & Sons, Chichester. Martins, P.S., Solon, G., Thomas, J., 2010, Measuring what employers really do about entry wages over the business cycle. NBER Working Paper 15767. Müller, H.G., 1985. Kernel estimators of zeros and of location and size of extrema of regression functions. Scandinavian Journal of Statistics 12, 221–232. Müller, H.G., 1989. Adaptive nonparametric peak estimation. Annals of Statistics 17, 1053–1069. Newey, W.K., McFadden, D.L., 1994. Large Sample Estimation and Hypothesis Testing. In: Engle, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. 4. North Holland, Amsterdam, pp. 2111–2245 (Chapter 36). Parzen, E., 1962. Estimation of a probability density function and mode. The Annals of Mathematical Statistics 33, 1065–1076. Quintela-Del-Rio, A., Vieu, Ph., 1997. A nonparametric conditional mode estimate. Journal of Nonparametric Statistics 8, 253–266. Romano, J.P., 1988. On weak convergence and optimality of kernel density estimates of the mode. Annals of Statistics 16, 629–647. Rousseeuw, P.J., 1984. Least median of squares regression. Journal of the American Statistical Association 79, 871–880. Samanta, M., Thavaneswarn, A., 1990. Non-parametric estimation of the conditional mode. Communications in Statistics—Theory and Methods 19, 4515–4524. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. Temple, J.R.W., 1998. Robustness tests of the augmented solow model. Journal of Applied Econometrics 13, 361–375. Temple, J.R.W., 1999. The new growth evidence. Journal of Economic Literature 37, 112–156. Wallis, K.F., 1999. Asymmetric density forecasts of inflation and the bank of England’s fan chart. National Institute Economic Review 167, 106–112.