Transportation Research Part A 82 (2015) 240–254
Contents lists available at ScienceDirect
Transportation Research Part A journal homepage: www.elsevier.com/locate/tra
Critical assessment of five methods to correct for endogeneity in discrete-choice models C. Angelo Guevara Universidad de los Andes, Chile
a r t i c l e
i n f o
Article history: Received 2 July 2015 Received in revised form 1 October 2015 Accepted 16 October 2015 Available online 3 November 2015 Keywords: Multiple indicator solution Proxies Control-function Maximum-likelihood Latent-variables Monte Carlo
a b s t r a c t Endogeneity often arises in discrete-choice models, precluding the consistent estimation of the model parameters, but it is habitually neglected in practical applications. The purpose of this article is to contribute in closing that gap by assessing five methods to address endogeneity in this context: the use of Proxys (PR); the two steps Control-Function (CF) method; the simultaneous estimation of the CF method via Maximum-Likelihood (ML); the Multiple Indicator Solution (MIS); and the integration of Latent-Variables (LV). The assessment is first made qualitatively, in terms of the formulation, normalization and data needs of each method. Then, the evaluation is made quantitatively, by means of a Monte Carlo experiment to study the finite sample properties under a unified data generation process, and to analyze the impact of common flaws. The methods studied differ notably in the range of problems that they can address; their underlying assumptions; the difficulty of gathering proper auxiliary variables needed to apply them; and their practicality, both in terms of the need for coding and their computational burden. The analysis developed in this article shows that PR is formally inappropriate for many cases, but it is easy to apply, and often corrects in the right direction. CF is also easy to apply with canned software, but requires instrumental variables which may be hard to collect in various contexts. Since CF is estimated in two stages, it may also compromise efficiency and difficult the estimation of standard errors. ML guarantees efficiency and direct estimation of the standard errors, but at the cost of larger computational burden required for the estimation of a multifold integral, with potential difficulties in identification, and retaining the difficulty of gathering proper instrumental variables. The MIS method appears relatively easy to apply and requiring indicators that may be easier to obtain in various cases. Finally, the LV approach appears as the more versatile method, but at a high cost in computational burden, problems of identification and limitations in the capability of writing proper structural equations for the latent variable. Ó 2015 Elsevier Ltd. All rights reserved.
1. Introduction: Causes, impact and cures for endogeneity Endogeneity occurs when some explanatory variables are correlated with the error term of an econometric model due to, among other things, omitted attributes, measurement or specification errors, simultaneous determination or self-selection. This issue is almost unavoidable in various practical cases, and it results in inconsistent estimators of the parameters, invalidating any type of analysis performed with the model. Forecasting or behavioral assessment for policy design may be seriously misled if they are based on models based in inconsistent estimators (see e.g. Guevara and Thomas, 2007). E-mail address:
[email protected] http://dx.doi.org/10.1016/j.tra.2015.10.005 0965-8564/Ó 2015 Elsevier Ltd. All rights reserved.
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
241
For example, in a mode choice model between public and private urban transportation, it is likely that the perceived level of discomfort (due to e.g. crowding) will grow with travel time. Since the perceived level of discomfort is relevant for the decision maker, but very difficult to measure by the researcher, its omission will cause endogeneity. This omission will make the model useless for assessing policies that enhance comfort, such as providing air conditioning, or redesigning the vehicle’s layout. Besides, this omission will result in poor forecasting capabilities and in an overestimation of the value of travel time savings, which will be confounded with the improvements in comfort. Various researchers have reported results that seem to confirm the existence of endogeneity due to the omission of comfort in public transportation models (see e.g. Wardman and Whelan, 2011 and Tirachini et al., 2013). Similar problems can be found in models of residential location. When choosing a residence, individuals take into consideration a large list of dwelling attributes, such as location, price, size, painting, layout, natural illumination, orientation or neighborhood attributes. However, the researcher is likely to be able to measure, at best, the dwelling price, size and location, omitting many relevant attributes. Those omitted attributes will be positively correlated with dwelling’s price because of market forces. The better the dwelling, the larger will be number of households that would be willing to bid for it, increasing its market price. Therefore, a residential location model that neglects the impact of those omitted attributes will underestimate the impact of price in the choice process. The prevalence of this type of endogeneity has been evidenced by the results obtained by different researchers who have reported estimated coefficients of housing price that are not significant, or even positive, when endogeneity was not addressed (Guevara and Ben-Akiva, 2006, 2012; Guevara, 2005, 2010; Bhat and Guo, 2004; Sermonss and Koppelman, 2001; Waddell, 1992; Quigley, 1976). A third example, among many others, is interurban mode choice modeling. In this case, different services (bus, train or airplane) are likely to compete among them in terms of price and different dimensions of level of service. Thus, a premium alternative will be more highly priced but, in compensation, it may offer shorter travel or waiting time, fewer transfers, larger space between seats, better seats, food, entertainment, safety and security, a smaller carbon footprint, additional amenities, or just a more attentive and caring crew. The choice-maker will somehow take into consideration some of these amenities, but the researcher will usually be able to account only for price, travel and waiting time, and maybe the number of transfers. Therefore, a choice model omitting some of the various dimensions of the level of service will suffer of endogeneity. Evidence of this type of endogeneity in interurban mode choice, although treated at an aggregated level, can be found in Mumbower et al. (2014). Various methods have been developed to obtain consistent estimators in spite of the presence of endogeneity. The methods depend on the particular model that is being analyzed. This article focuses on the problem of endogeneity in discretechoice models. For the case of linear models, the reader is referred to the comprehensive book of Wooldridge (2010). When endogeneity is present in a discrete-choice model, the methods to correct for it differ importantly on the assumptions considered. For the more general case, the problem could be addressed using nonparametric methods. The reader is referred to Matzkin (2007) for a detailed review and analysis of the conditions needed for achieving identification in nonparametric discrete-choice models. Further analysis of this topic can be found in Chesher (2003, 2005, 2010) and Chesher et al. (2013). In the latter paper, the authors study in particular the discrete-choice case and show that their results also apply in the presence of parametric restrictions. In spite of its generality, the complexity of the nonparametric approach seems to have precluded so far its application in practice. When the structural equation of the latent utility of a discrete-choice model is linear, and the endogenous explanatory variable is discrete; the problem can only be formally solved using Full Information Maximum Likelihood (FIML). An example of such type of application in transportation is the work of Abay et al. (2013), analyzing the relation between injury’s severity and seat belt use in two-vehicle crashes. For achieving identification in this case it is essential to be able to write the endogenous and the dependent variables as a function of exogenous variables that work as instruments. The Latent-Variables (LV) approach (Walker and Ben-Akiva, 2002) can be classified among the FIML methods to address endogeneity. The idea of the method is to include the latent variable – which can be continuous or discrete – in the model, and to integrate it out to calculate the likelihood function. The conditional distribution of the latent variable, which is needed for the integration, is written using structural and measurement equations. The method requires writing the latent variable as a function of exogenous variables in the structural equation, what may prove challenging for various practical cases. When the endogenous explanatory variable is continuous in a discrete-choice model with linear utility, there are some alternative methods to address endogeneity that are easier to apply than FIML or nonparametric methods. For example, if endogeneity occurs at the level of groups of observations, such as when automobile prices are determined in equilibriums that occur by regional markets, the problem can be solved using the BLP approach (Berry et al., 1995). This method consists in taking the endogeneity out of the choice model by including alternative specific constants for each alternative in each group or market. Then, in a second stage, the estimated constants are regressed on other explanatory variables, where remaining endogeneity can be resolved using any method for lineal models. If endogeneity does not only occur at the level of groups of observations but at the level of each observation, the ControlFunction (CF) method can be used instead. This method was originally proposed by Rivers and Vuong (1988) for binary Probit, as a generalization of a method proposed by Heckman (1978). Petrin and Train (2002) extended the CF method to Logit Mixture models. The CF function is conceptually similar to the Two Stages Least Squares (2SLS) method (see. e.g. Wooldridge, 2010) for linear models and, as it, the CF can be applied in two stages or simultaneously, case in which it is termed Maximum-Likelihood (ML) (Train, 2009). The CF method has been shown to be particularly suitable to address endogeneity for various types of discrete-choice models (Ferreira, 2010; Guevara and Ben-Akiva, 2006, 2012). The key aspect of the CF
242
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
method is that it requires valid instruments, auxiliary variables that have to be correlated with the endogenous variable and, at the same time, uncorrelated (independent) with the error terms of the model. Finding suitable instruments to apply the CF method can be problematic, motivating the search for alternative methods. Guevara and Polanco (2015) showed that the Multiple Indicator Solution (MIS) method, originally proposed by Wooldridge (2010) for linear models, can be extended to discrete-choice models under mild assumptions. The MIS allows the correction of endogeneity in this context without the need of instruments or, more precisely, making use of alternative instruments that may be easier to collect in practice. The MIS relies on having a couple of suitable indicators, which are measured variables that depend on the latent variable that causes endogeneity, but not on other measured attributes. The MIS method is applied in two stages. First, one of the indicators is included as an explanatory variable in the structural equation. By this modification, the endogeneity of other variables is eliminated, and the included indicator becomes the only endogenous variable. Then, in a second stage, the problem is solved by using the second indicator as an instrument for the first one. Wooldridge (2010) applies this method to linear models; where he proposes using the 2SLS method for the second stage. Guevara and Polanco (2015) adapted and applied the MIS method to Logit models by considering the CF approach on the second stage of the MIS method. An additional approach to deal with endogeneity in practice consists in substituting the omitted variable that causes the problem by a proxy (PR) that can be obtained from real data. This is the approach used, e.g., by Wardman and Whelan (2011) and Tirachini et al. (2013), to account for the omission of comfort in public transportation. Wooldridge (2010) shows the conditions under which the PR method may address endogeneity in linear models. The present article studies the extension of the PR method to discrete-choice models. The goal of this paper is to assess different methods to correct for endogeneity in discrete-choice models, evaluating their practical and theoretical differences. The article concentrates on the cases where endogeneity occurs at the level of each observation, reducing by this the set of methods to five: the use of Proxies (PR); the two steps Control-Function (CF); the simultaneous estimation of the CF method via Maximum-Likelihood (ML); the Multiple Indicator Solution (MIS), and the integration of Latent-Variables (LV). The article is structured as follows. Section 2 begins considering a common data generation process under which endogeneity could be potentially addressed with the five methods under study. Then, Section 3 describes and assesses the theoretical framework of each method, their assumptions, and their data and coding requirements. Section 4 reports the use of Monte Carlo simulation to illustrate the application of each method in practice, to assess their finite sample performance, their computational cost, and the practical impact of some common flaws. Finally, Section 5 summarizes the main contributions and suggests possible extensions for this research. 2. Unified data generation process of a discrete choice model that suffers of endogeneity For expositional purposes, consider a data generation process related to a particular discrete-choice model that is fairly general, covering at least the three examples of endogeneity that were described in the introduction: urban mode choice, residential location, and interurban mode choice. The discrete-choice model is characterized by the structural equation of the utility Eq. (1) and the choice indicator Eq. (2). The asterisk (⁄) is used to distinguish the variables that are latent, from those that are observed by the researcher. Unless contrarily specified, all right-hand-side variables, which are also non-latent, are assumed to be independent of other model variables.
U in ¼ X in bx þ bq qin þ ein ¼ X in bx þ ein h i yin ¼ 1 U in P U jn ; 8j 2 C n
ð1Þ ð2Þ
Individual n chooses the alternative i with the largest latent utility U in , among those in the choice-set Cn. The latent utility depends linearly, with coefficients b, on the K-dimensional row vector of variables Xin, on a latent variable qin and on an exogenous error term ein . Xin can be a mix of attributes of alternative i and the characteristics of the individual n. The first element in Xin is a one, accounting for the alternative specific constant. Instead of the latent utility U in , the researcher observes the choice yin, which takes value 1 if alternative i has the largest random utility. For expositional purposes, from this point the sub-indices for alternative and individual (in) will be omitted from the model variables within the text, but will be maintained in the equations. Unless differently specified, it will be assumed that the variables are stacked by alternatives and individuals. Since q⁄ is latent, the actual error term of the choice model (Eqs. (1) and (2)) will be e⁄ = bqq⁄ + e⁄. This choice model will suffer of endogeneity if, at least, one of the elements in X depends on q⁄. This would be the case if, for example, xK corresponds to the kth column of X, and is defined by the structural equation shown in Eq. (3), where c is a vector of coefficients and u⁄ is an exogenous error term. Z X xK is a matrix comprising all the exogenous variables in X, including the intercept, and at least one additional exogenous variable.
xKin ¼ Z in cz þ cq qin þ uin ¼ Z in cz þ din
ð3Þ
Instead of the latent variable q⁄ the researcher observes two indicators, generated by Eqs. (4) and (5), where a01 ; a02 ; a1 ; a2 are coefficients, and eq1 and eq2 are exogenous error terms.
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
243
q1in ¼ a01 þ a1 qin þ eq1in
ð4Þ
q2in ¼ a02 þ a2 qin þ eq2in
ð5Þ
Finally, it is assumed that the latent variable qin is generated by the structural equation Eq. (6), where H is a matrix of exogenous variables, with its first column equal to one, h is a vector of coefficients, and x⁄ is an exogenous error term.
qin ¼ Hin h þ xin
ð6Þ
If this choice model is estimated using only Eqs. (1) and (2), endogeneity will be present for two reasons. First, if a variable is both in H and X, the error term e⁄ of the structural equation of the utility will be correlated with it. The second source of endogeneity under this setting may occur because xK depends on q⁄ through Eq. (3), and then xK will depend on the error term e⁄. For the analysis developed in Sections 3 and 4, it will be assumed that only the latter type of endogeneity is present, but it should be kept in mind that both sources of endogeneity may exist. The methods that are going to be described may be used, with minor shifts for LV and ML, to address both types. The model depicted by Eqs. (1)–(6) can correspond, for example, to an urban mode choice model. In this case, the latent modal utility U⁄ may depend on travel time, travel cost (and other variables contained in X), on the level of comfort q⁄, and an error term e⁄. Comfort q⁄ will likely depend crowding and other things, like travel time, which will then be both in H and X, causing endogeneity. Instead of the latent comfort q⁄, the researcher may observe the indicators q1 and q2 shown in Eqs. (4) and (5), which can come from different sources. For example, the choice-maker may be asked first to rate his or her experience regarding the level of comfort on the mode (q1), and then inquired to declare his or her level of agreement with the statement: ‘‘travelling on this mode is highly comfortable” (q2). Other source of indicators, but probably less efficient, may be to request the choice-maker to provide adjectives that describe their travel experience, and then request third persons to interpret those adjectives quantitatively (as in Glerum et al., 2014). Likewise, among other ways, the indicators could be gathered from overall ratings given by anonymous patrons regarding the same type of trip, in terms of origin, destination, mode and period. This data generation process may also correspond to a residential location model. In this case, the elements in X may contain measureable dwelling attributes, such as price, size or age. q⁄ would represent a compound of quality attributes that are considered by the decision-maker but unobserved by the researcher. Since q⁄ will likely explain the dwelling’s price xK, its omission from the model (1) and (2) will cause endogeneity. The variables in H may correspond in this case to the quality of the painting or the plumbing, or other omitted dwelling attributes that may explain q⁄. However, if those attributes happen to be measured, it would actually be much better to include them directly in U⁄. This implies that the structural Eq. (6) may be very hard to construct in practice for this example. To conclude, the indicators in this case may come from equivalent sources as those described for the urban mode choice model. Finally, this setting may represent as well an interurban mode choice model. In this case xK could be the price of the mode, which may depend on the quality q⁄, the distance and maybe the schedule, as shown in Eq. (3). In turn, q⁄ will depend on H, which in this case may include the space between seats, the food and entertainment provided, the number of transfers, some safety and security features, the reported carbon footprint, among many other attributes. Likewise, indicators q1 and q1 could be gathered from sources similar to those described for the two previous examples. 3. Formulation and qualitative assessment of methods for correcting endogeneity in discrete-choice models This section conveys the formulation and qualitative analysis of the capabilities and limitations of five methods to address endogeneity in the discrete-choice model described in Eqs. (1)–(6). 3.1. Proxies (PR) The problem of endogeneity can be addressed if the latent q⁄ is replaced by a proxy variable p in the specification of the systematic part of the utility. This is, for example, the approach used by Wardman and Whelan (2011) and Tirachini et al. (2013). Wooldridge (2010, Section 4.3.2) describes the method for the case of linear models, and it is extended here to discrete-choice. A proper proxy variable in a discrete-choice context would have to comply with two requirements. First, p would have to be independent1 of e⁄ in Eq. (1). That is, if q⁄ would had been observed, the proxy p should be redundant to the model. In the mode choice example considered, if the omitted variable is comfort, a variable complying with this first requirement may be the actual level of crowding or any other elements on H in Eq. (6). Those variables would have no impact on the choice if somehow the true comfort would have been accounted for. The second requirement for having a proper proxy is more cumbersome. It corresponds to consider that x⁄ in Eq. (6) is independent from all variables on X in Eq. (1). It can be easily shown that this assumption is violated in the problem depicted in Eqs. (1)–(6). The problem is that xK depends on x⁄ by Eq. (3), which is the cause of endogeneity to begin with. 1
For the linear case, only zero correlation is required.
244
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
Despite the PR method being formally inappropriate for the model defined by Eqs. (1)–(6), it will be studied later using Monte Carlo experimentation, because it is very easy to apply with canned software and, therefore, it is widely used. Results reported in Section 4 suggest that, despite its theoretical limitations under this setting, the PR may mitigate the endogeneity problem considerably for the problems analyzed. The implementation of the PR approach in this case will correspond to use a composite proxy h = Hh, which is implemented estimating the choice model considering the following variables in the systematic part of the utility
X in ; hin ;
ð7Þ
^x for the bx in Eq. (1), up to a scale (Guevara and Ben-Akiva, 2012). to obtain estimators b
3.2. Control-Function (CF) The Control-Function (CF) method (Heckman, 1978; Rivers and Vuong, 1988; Petrin and Train, 2002) can be used to correct for endogeneity in the model described by Eqs. (1)–(6). To apply the CF method, an instrumental variable is needed for every element on X that is correlated with the omitted q⁄ in Eq. (1). As with the 2SLS for linear models, the instrumental variables need to be correlated with the endogenous variable, but uncorrelated (formally independent in the discrete-choice case) with the error terms of the model. This role can be played by Z X xK in Eq. (3). The application of the CF method requires assuming that the error terms he⁄, d⁄i are distributed bivariate Normal. Then, it can be shown that e⁄ = hd⁄ + f, where f is iid normally distributed (see Wooldridge, 2010, pp. 587). Then, if d⁄ or an estimator of it, is included in the utility, a proper Probit model could be consistently estimated. Under this framework, the extension to a Logit model can come from accepting that the approximation of a Normal by an Extreme Value distribution causes negligible discrepancies (Lee, 1982; Ruud, 1983). Villas-Boas (2007) proposed a less restrictive definition for the CF that avoids the need of assuming joint normality of he⁄, ⁄ d i. The author shows that, under some general assumptions, the researcher may directly specify the conditional distribution of bqq⁄ given d⁄ and know that there would be some distribution of he⁄, d⁄i from which it would come. Then, if the researcher assumes that bqq⁄ given d⁄ follows an Extreme Value distribution, there would be no need for making further assumptions to approximate a Probit by a Logit. This is the approach considered by Petrin and Train (2010). ^x for the bx in Eq. (1), up to a scale (Guevara and Ben-Akiva, 2012), can be obtained In practice, consistent estimators b applying the following two-stage CF procedure: Stage 1: obtain the residuals2 ^d from the OLS regression of
xKin on Z in ;
ð8Þ
Stage 2: Estimate the choice model considering the following variables in the systematic part of the utility
X in ; ^din ;
ð9Þ
In terms of practicality, the CF is very easy to apply with canned software, with no coding difficulties and very low computation cost since it implies the application of an OLS regression and the estimation of the choice-model with an additional variable. Nevertheless, a different source of complexity arises because the standard errors cannot be calculated directly from the Fisher-information-matrix. Instead, the variance–covariance matrix can be estimated in this case using nonparametric methods such as bootstrapping (Petrin and Train, 2002) or the delta-method (Karaca-Mandic and Train, 2003). The main challenge for applying the CF method is to gather proper instrumental variables. For example, Guevara and BenAkiva (2006, 2012) used the CF method to correct for endogeneity due to the omission of quality attributes that are correlated with price in a residential location choice model. Following an argumentation used previously by Hausman (1996) and Nevo (2001) in other contexts, Guevara and Ben-Akiva (2006, 2012) used as instrument for dwelling price the average price of other dwellings located not too near (to avoid reflection bias), but also not too far (to ensure relevance) from the incumbent dwelling. However, finding instruments in other circumstances may be difficult or simply impossible. For example, safety, comfort and reliability are attributes that are very relevant in mode choice models, but are hard to measure, and may also be correlated with travel cost and time, causing endogeneity. The CF may be very difficult to apply for such problem because it is unclear how to obtain proper instrumental variables for those omitted modal attributes. Even if plausible instruments can be found and their application results in a coherent correction of the estimated parameters, the validity of the instruments is always debatable. Newey (1985) shows that over-identification tests needed to confirm the exogeneity of the instruments are inconsistent, that is, even when the sample size goes to infinity, the power of the 2 Recall that Z X xK comprises all exogenous variables in X and, at least, one additional exogenous variable. The latter may differ from the respective variable considered in the data generation process. In such a case, the statistical properties for the correction of endogeneity would hold, but ^d in Eq. (8) would not be a direct estimator of d in Eq. (3), and thus, a different notation would be needed for each one in the general case.
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
245
test does not go to 1. Although De Blander (2008) shows that the alternate hypothesis that cannot be detected by such tests is very atypical, the fact is that exogeneity of the instrumental variables is always controversial. Another difficulty that arises with the use of instrumental variables is known as the weak instruments problem. When the instruments are only weakly correlated with the endogenous variable in a linear model, the 2SLS estimation may be poorer than that of the uncorrected model (Staiger and Stock, 1997). It has been shown that and F test of the first stage of 2SLS can be used to judge the strength of the instrumental variables in linear models. Recent research by Guevara and Navarro (2015) suggests that a similar approach may be applicable in binary choice models with the CF method. An additional challenge for the application of the CF would be to assume that the distributional restrictions required in the definition described by Wooldridge (2010) do hold. This source of difficulty would be reduced under the alternative derivation of the CF proposed by Villas-Boas (2007). Regarding efficiency, since the CF is performed in two stages there would be, in general, a loss of efficiency involved in the estimation of the method. However, Rivers and Vuong (1988) show that, if the error terms e⁄ and d⁄ are homoscedastic and non-autocorrelated, the CF applied in two stages will be efficient. Moreover, under the alternative definition of the CF provided by Villas-Boas (2007), the potential source for a reduction of efficiency would disappear. Finally, it should be noted that the true coefficient bq of the omitted q in Eq. (1) cannot be estimated when correcting for endogeneity with the CF method, not even at the level of a scale. What can be obtained is the sign of the coefficient of q, provided that the sign of the causality of q on the endogenous variable is clear. 3.3. Maximum-Likelihood (ML) The ML approach corresponds to the CF described in Section 3.2, but where the two stages are estimated simultaneously. For the model described by Eqs. (1)–(3), assuming that e⁄ are distributed iid Extreme Value I (0, l), and defining n⁄ = bqq⁄, the likelihood of an observation can be written as shown in Eq. (10), where f1 and f2 are the densities of n⁄ and d⁄, and H is used to depict the full set of model parameters.
Z
Pn ijnn ; X n ; H f 1 ðn jd ; HÞf 2 ðd jZ n ; X n ; HÞdn
Pn ðijX n ; Z; HÞ ¼
ð10Þ
Assuming that n⁄ is distributed iid Normal (0, rn) and d⁄ is distributed iid Normal (0, rd), then n⁄ can be written as n⁄ = cd⁄ + #, where # is an error term distributed Normal (0, r0) (see Wooldridge, 2010). Then, the components of the likelihood in Eq. (10) become
Pn ijX n ; nn ; H ¼
elðXin bx þnin Þ ;
P
l X jn bx þnjn
j2C n e
f 1 ðn jd ; HÞ ¼
Y 1 j2C n
f 2 ðd jZ n ; X n ; HÞ ¼
r#
n cdjn and /
Y 1
j2C n
rd
r#
ð11Þ
xKin Z jn c ; /
rd
where /() corresponds to the density of the standard normal distribution, which has mean equal to zero and variance equal to 1. Finally, if the variable used for integration is not n⁄ but a standard normal error term g⁄ such that #⁄ = r0g⁄, the change of variables would imply that dn⁄ = r0dg⁄. Identification may be cumbersome in a general model like the one shown in Eq. (10). Although rank, order, and equality conditions (Walker et al., 2007) may be used to explore a possible proper normalization of the model, there are no necessary and sufficient conditions to solve the issue and, therefore, the problem has to be addressed on a case-by-case basis. However, the special case described in Eq. (11) is greatly simplified by the assumption that n⁄ and d⁄ are iid. It can be shown that the only two additional conditions needed for the identification of Eq. (11) are that c = 1 and l = 1. The simultaneity of the ML results, in general, in more efficient estimators than those of the CF method, but at the cost of a reduction in generality. For the ML method one needs to specify the joint distribution of n⁄ and d⁄, whereas for the CF method only the conditional distribution of n⁄ given d⁄ is needed. A given conditional distribution can correspond to various joint distributions, while the reciprocal is not true (see, e.g., Train, 2009). Regarding hypothesis testing, if the joint distribution in the ML method is correctly specified, the standard errors of the model can be estimated directly from the information matrix, avoiding the need for the alternative procedures that are required for the CF. Finally, regarding practicality, the estimation of the ML method can become seriously cumbersome as it involves coding in non-canned software and integration, what would become a serious issue particularly if the number of alternatives grows, because of what is known as the curse of dimensionality (Cherchi and Guevara, 2012). Besides, if the researcher considers a wrong normalization, the estimation results will be wrong but that would be very difficult to tell in practice from the model output. Among other things, it may occur that a wrong normalization may mask identification problems.
246
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
3.4. Multiple Indicator Solution (MIS) The MIS method for discrete-choice models was proposed by Guevara and Polanco (2015) as an extension of Wooldridge’s (2010) MIS method for linear models. The main difference in the discrete-choice case is that, in the second stage of the method, instead of using the 2SLS, the CF method is used. Consider the choice model defined by Eqs. (1) and (2), where the two indicators q1 and q2 for the latent variable q⁄ are available and have the form shown in Eqs. (4) and (5). Consider that q⁄ is replaced by q1 in the utility, such that
bq q1in a01 eq1in þ ein a1 bq a01 bq eq1in þ ein U in ¼ X in bx þ c1 q1in þ a1 a1 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} U in ¼ X in bx þ
ð12Þ
min
The inclusion of one of the indicators in the utility did not solve the endogeneity problem; it only changed the source of it. In this new model the error term m⁄ is correlated with q1 because m⁄ contains the error term eq1 . In this new model the only endogenous variable is q1. However, because of the nature of Eqs. (4) and (5), q2 is a proper instrument for q1. On the one hand, q2 is correlated with q1 because both depend on q⁄. On the other hand, q2 is independent from m⁄ because eq1 and eq2 are exogenous error terms. ^x of the model parameters bx in Eq. (1), up to a scale, would be obtained by applying the Then, consistent estimators b following two-stage procedure: Stage 1: obtain the residuals ^ dq from the OLS regression of
q1in on X in ; q2in :
ð13Þ
Stage 2: Estimate the choice model considering the following variables in the systematic part of the utility
X in ; q1in ; ^dqin :
ð14Þ
Note that X has to be used as regressor in Eq. (13). Otherwise ^ dq may be correlated with some columns of X. The same consideration is needed for the application of the CF method in Eq. (8), because in that case Z X xK. Formally, the application of the CF in this context requires v in Eq. (12) to be independent of q2 (Wooldrige, 2010, pp. 585). This assumption is only slightly stronger than the assumptions needed for the application of the MIS method in linear models. However, it is equally plausible and, furthermore, it is equivalent to the assumptions needed for any other application of the CF method. Given that the MIS method can be seen as a CF method with a special type of instruments, the potential gains in efficiency by estimating the MIS simultaneously would be the same as those of the CF method. Consequently, a simultaneous version of the MIS is not studied here, neither in the Monte Carlo section, since their conclusions would be redundant. Furthermore, for this example, since the errors are homoscedastic and non-autocorrelated, and only one instrumental variable is used, the two-stage estimator will be as efficient as the FIML version (Rivers and Vuong, 1988). However, the standard errors of the two stage procedure cannot be directly obtained from the information matrix, but would have to be calculated using, for example, bootstrap, or the delta method proposed by Karaca-Mandic and Train (2003). D E The MIS method requires assuming that the error terms m ; dq are distributed bivariate Normal. This requirement is also needed for the CF, but may be harder to fulfill with for the MIS because the indicators are often discrete. Therefore, in the presence of discrete indicators, the MIS correction could only be formally understood as an approximation. Even though, the Monte Carlo evidence provided in Section 4 shows that this approximation seems to provide remarkably good results, what suggests that a formal demonstration for it may be viable. Such demonstration exceeds the scope of this paper, but it can be mentioned it may potentially be pursued following an approach similar to that of Villas-Boas (2007). It should be noted that the role of q1 and q2 can be reversed attaining the same asymptotic result, but differences in finite samples may be found. Also, when more than two indicators are available, the method could be performed considering any indicator playing the role of q1 in Eq. (12), and all the others as instruments for the first. The selection of the first indicator could be based on the model fit. The MIS method will not work if some of the columns of X in Eq. (1) are also explanatory variables in Eqs. (4) or (5). The problem will be that, in such case, m⁄ would be correlated with X in Eq. (12), producing another source of endogeneity. This places a practical problem in the application of the MIS method, since this redundancy assumption cannot be fully verified. This is equivalent to the assumption about the exogeneity of instruments required for the application of the CF. Finally, regarding practicality, the MIS method is as practical as the CF method, applicable with canned software to estimate a linear regression and the discrete choice model. It also has the additional advantage of providing, for various practical problems, a method to construct proper instruments from the indicators.
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
247
3.5. Latent-Variables (LV) Other alternative to correct for the endogeneity problem under this setting is the Latent-Variables approach (Ben-Akiva et al., 2002). Under this approach, the likelihood of observation n for the problem defined by Eqs. (1)–(6), can be written as shown in Eq. (15), where f3 and f4 are densities of q⁄ and xK, respectively.
Z Pn ðijX n ; Hn ; HÞ ¼
qn
Pn ijX n ; qn ; H f 3 qn jHn ; H f 4 xKn jZ n ; qn ; H dqn
ð15Þ
Note that the likelihood in Eq. (15) is different from the usual expression considered in models of LV (see. e.g. Walker and Ben-Akiva, 2002), where none of the observed variables are considered to be a function of the latent one. Since in this case there is a causal relationship between the latent q⁄ and the observed xK through Eq. (3), that relation should be accounted for when integrating out the latent variable. This is achieved by including f4 in Eq. (15). Besides, since the researcher also observes indicators q1 and q2, the likelihood of an observation for individual n can be rewritten as
Z Lðyn ; qn Þ ¼
qn
Pn ijX n ; qn ; H f 3 qn jHn ; H . . .
ð16Þ
. . . f 4 xKn jZ n ; qn ; H f 5 q1n jqn ; H f 6 q2n jqn ; H dqn :
Assuming that e⁄ is distributed iid Extreme Value I (0, l); eq1 and eq2 are distributed iid Normal (0, req1) and Normal (0, req1); x⁄ is distributed iid Normal (0, rx); and uin is distributed iid Normal (0, ru); the components of the likelihood become:
l X in bx þbxKjn ðZ in cz þcq qin Þþbq qin
Pn ijX n ; qn ; H ¼
e P
l X jn bx þbxKjn Zjn cz þcq qjn þbq qjn
;
j2C n e qjn Hjn h Y 1 ; f 3 qn jHn ; H ¼ / j2C n
ru
ru
xKjn Z jn cz cq qjn Y 1 ; / f 4 xKn jZ n ; qn ; H ¼
r j2C n x
ð17Þ
rx
Y 1 q a01 a1 qin and / 1in f 5 q1n jqn ; H ¼ j2C n
f 6 q2n jqn ; H ¼
req1
Y 1
j2C n
req2
req
q a02 a2 qin ; / 2in
req
where X in ¼ X in xKin . It should be noted that there is another subtle difference between this likelihood and the one of the ML model in Eq. (10). In practice, if g⁄ is simulated as an exogenous error, xK in the modal utility would have to be replaced by Eq. (3), leading to the formulation for Pn(i| ) shown in Eq. (17). Similarly to what happens with the ML model, if the variable used for integration is not q⁄ but a standard normal error term g⁄ such that x⁄ = rxg⁄, the change of variables would imply that dq⁄ = rxdg⁄. Besides, identification may be cumbersome in a general model like the one shown in Eq. (16), but it is greatly simplified by the iid assumptions that led to Eq. (17). It can be shown that the only additional conditions that are needed for the identification of Eq. (17) are that l ¼ 1; a1 ¼ 1; a01 ¼ 0; rx ¼ 1. Using the LV approach in this context has two major challenges. The first challenge is practicality. The LV method cannot be applied with canned software, requiring then the use of non-canned software, what makes its application susceptible to coding errors. Furthermore, if the omitted attribute varies by alternative, a multifold integral would have to be evaluated, challenging practicality because of what is known as the curse of dimensionality (Cherchi and Guevara, 2012). This may explain why almost all applied LV models estimated simultaneously considered only latent characteristics of the individuals, but not latent attributes of the alternatives. The second challenge for using the LV approach in this context is causal. When omitted attributes are modeled as latentvariables in current literature, such as in Glerum et al. (2014) and Yanez et al. (2010), they are written in a structural equation as a function of socioeconomic characteristics. This means that, in practice, it is assumed that, e.g., comfort or safety by mode is the same across individuals, and that it is only the perception of them what changes across the population. Such approach applied to problem (1)–(6) would be equivalent to consider an alternative specific constant that is a random coefficient, with a mean that varies depending on some socioeconomic attributes. In turn, finding causal variables for actual individual’s experienced comfort or safety as latent variables is a problem that has not been addressed so far in empirical literature.
248
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
The problem is that the researcher may not be able to collect all the variables that are in Hn. As a consequence, the error term x⁄ of the structural equation for the latent variable in Eq. (6) may be correlated with some observed attributes, causing a different source of endogeneity. In a general case, the endogeneity of the structural equation will propagate to the whole model, precluding the consisting estimation of the coefficients b of the utility. However, for the data generation process defined by Eqs. (1)–(6), where proper instrumental variables are available (Eq. (3)), the endogeneity problem will be confined to the structural equation of the latent variable q⁄. Another issue to highlight is that the application of the latent-variable method in two stages for the problem (1)–(6) would result in inconsistent estimators. The application in two stages consists in first estimating Eq. (6) by OLS and then ^OLS in Eq. (1). Even if the researcher has the correct H to obtain a consistent estimator of q⁄ replacing q⁄ by the estimator q by means of Eq. (6), the second stage will fail to produce consistent estimators because x is correlated with xK. Similarly, what is defined as sequential numerical integration by Ben-Akiva et al. (2002), will also fail, unless the correlation between x and xK is addressed somehow. Finally, the comparison of Eqs. (17) and (10) serves to illustrate what was defined as the link between the CF and the latent-variables approach by Guevara (2010) and Guevara and Ben-Akiva (2010). In the absence of H to write the structural equation Eq. (6), proper instrumental variables can be used to build d, which can be used to state an alternative structural ^ can be seen as a equation for q⁄. In the essence, f1 in Eq. (11) substitutes f3 in Eq. (17). Moreover, it can be affirmed that d construct that fulfills all the requirements stated in Section 3.1 for being a proper proxy. 4. Monte carlo experiment: quantitative assessment of methods for correcting endogeneity in discrete-choice models This section develops a Monte Carlo experiment with the threefold purpose of (i) illustrating the application of the five methods under study; (ii) assessing their finite-sample performance and (iii) assessing the impact of some common flaws. 4.1. Data generation process Consider a binary choice problem that replicates the main features of the conceptual model depicted in Eqs. (1)–(6). Endogeneity is created by omitting a variable that is correlated with another observed variable. Then, proper and improper instruments, indicators and structural equations are created to apply the methods under study to address endogeneity. The data is regenerated R = 100 times and statistics are calculated to analyze the finite sampling properties of the estimators obtained with the different methods. The sample for each of the 100 repetitions consists of N = 1000 individuals. The linear systematic utility of each alternative depends on four attributes: p, a, b and q⁄, with population coefficients bp = 2, ba = 1, bb = 1, bq = 1. p represents, e.g., the price and q⁄ represents a quality measure that is correlated with price and that will cause endogeneity when it is omitted. The random part of the utility is distributed Extreme Value (0, 1), and therefore the choice model becomes the binary Logit shown in Eq. (18), which is used to simulate the choices in the Monte Carlo experiment.
Pn ðiÞ ¼
e2pin þain þbin þqin 2pin þain þbin þqin
e
þe
ð18Þ
2pjn þajn þbjn þqjn
Variables a, b, z, wz and h1 were generated as independent random Uniform (2, 2). Variable q⁄ was generated using the structural equation shown in Eq. (19), where the error term eq was generated as iid random Uniform (2, 2). Besides, variable h2 was generated using Eq. (20), where eh was generated as iid random Uniform (2, 2). By this, h2 will be correlated with h1, implying that the omission of h2 will cause endogeneity in the estimation of the structural equation Eq. (19).
qin ¼ 2h1in þ 1h2in þ eq in h2in ¼ 1h1in þ
ð19Þ
ehin
ð20Þ
The attribute p was generated as a function of variables z, wz and the attribute q⁄, with the coefficients show in Eq. (21), plus an error term ep that was generated as independent random Uniform (2, 2).
pin ¼ 5 þ 1zin þ 0:03wzin þ 1qin þ epin ⁄
ð21Þ ⁄
By virtue of Eq. (21), q is correlated with p. Therefore, if q is omitted in the estimation of Eq. (18), the choice model will suffer of endogeneity. Furthermore, also by virtue of Eq. (21), z would make a valid instrument for p because it is correlated with p and is exogenous to the model. In turn, wz is a weak instrument that is only slightly correlated with p because its coefficient in Eq. (21) is too small. Additionally, with the purpose of studying the impact of the failure of instruments’ exogeneity assumption, an endogenous instrumental variable that is correlated with q⁄ was created in the way that is shown in Eq. (22).
xzin ¼ 1zin þ 1qin
ð22Þ
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
249
To apply the MIS method, two proper indicators (q1 and q2), two endogenous indicators (xq1 and xq2) and two weak indicators (wq1 and wq2) were generated for variable q⁄, as shown in Eq. (23), where the error terms eq1 ; eq2 ; exq1 ; ewq2 ; ewq1 ; ewq2 are iid Uniform (2, 2).
q1in ¼ 1qin þ eq1in q2in ¼ 1qin þ eq2in xq1in ¼ 0:25qin þ 0:5pin þ 0:5ain þ exq1in xq2in ¼ 0:25qin þ 0:5pin þ 0:5ain þ exq2in
ð23Þ
wq1in ¼ 0:01qin þ ewq1in wq2in ¼ 0:01qin þ ewq2in Besides, two discrete indicators, Dq1 and Dq1, were constructed to explore the impact on the MIS method of having discrete, instead of continuous, indicators. The indicators were constructed from the quantiles of the empirical distribution of q1 and q2, which was built from the 1000 observations and 100 repetitions. The expressions used to build Dq1 are shown in Eq. (24), where Q%[q1] corresponds to the respective quantile. Dq2 was built in the same way.
Dq1in
8 1 if q1in < Q 20% ½q1 > > > > > > < 2 if q1in 6 Q 40% ½q1 and q1in > Q 20% ½q1 ¼ 3 if q1in 6 Q 60% ½q1 and q1in > Q 40% ½q1 > > > 4 if q 6 Q 80% ½q and q > Q 80% ½q > 1in 1 1in 1 > > : 5 if q1in > Q 80% ½q1
ð24Þ
Under this setting, 17 models were estimated for each one of the 100 repetitions. The first two models correspond to the benchmarks, the models that will be used to compare the best possible, and the doing-nothing scenarios. These correspond to the True and the Endogenous models. The former considers all the variables described in Eq. (18), including q⁄, and the latter is estimated excluding q⁄ from the systematic utility in Eq. (18) to produce endogeneity. The next five models correspond to the application of the respective five methods under study to correct for endogeneity under their best possible setting in this framework. The results of these models are reported and contrasted to the benchmarks in Section 4.2. The final 10 models estimated correspond to cases in which the proposed methods are estimated under the presence of some flaws that may occur in practice. The results of these models are reported and contrasted to the benchmarks in Section 4.3. 4.2. Methods to correct for endogeneity under their best setting The PR model to try to alleviate endogeneity in this setting was estimated including a composite proxy hin = 2h1in + 1h2in in the systematic part of the utility. CF was estimated using the proper instruments z described in Eq. (21). ML was estimated considering the likelihood described in Eq. (17) and using the proper instruments z. MIS was estimated using the proper continuous indicators q1 and q2 described in Eq. (23). Finally, LV was estimated considering the structural equation described in Eq. (19) and the likelihood described in Eq. (17). The estimators of these models were compared in terms of finite sample bias and efficiency. Because the correction of endogeneity in discrete-choice models produces a change in the scale of the estimators (Guevara and Ben-Akiva, 2012), ^p =b ^a is analyzed and compared to its population value bp/ba = 2. instead of the estimators themselves, their ratio b The result of the benchmark and the five corrections are reported in Table 1 and Fig. 1. Table 1 depicts the Mean Bias, the RMSE, a t-test that bp/ba = 2 and the Empirical Coverage probability of the asymptotic confidence intervals. It is also reported the average estimation Time to illustrate the computational burden of each method. Fig. 1 deploys a Box-Plot (see, e.g., Larsen, 1978) of the sampling distribution of the estimators of each method across the 100 repetitions. The abscissa ^p =b ^a , which is compared to shows each estimation method and the ordinate corresponds to the respective estimated ratio b the population value, which is highlighted by an horizontal line. The diamond depicts the average and the bolded bar shows the median. Table 1 shows first that the True model was successfully estimated. The Bias is below 1% (0.0146/(2) ⁄ 100 0.73), the t-test of the null hypothesis that bp/ba = 2 cannot be rejected with large confidence, and the Empirical Coverage is almost equal to its nominal value. Likewise, Fig. 1 shows that the sampling distribution of the True model is almost perfectly symmetric. It is centered on its true value both for the mean and the median, and both the box and the whiskers (see, e.g., Larsen, 1978) have similar size to both directions. Besides, the size of the box and the whiskers is relatively small, compared to the results obtained for other methods. The picture is very different for the Endogenous model. The Bias jumps to over 30%, the null hypothesis that bp/ba = 2 is wrongly rejected (t2.5%,99 = 1.99), in spite of being true, and the Empirical Coverage is only 3. Fig. 1 reinforces this result, showing that the sampling distribution is not only biased, but that it also has a moderate negative skewness, since the lower
250
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
Table 1 Monte Carlo analysis of methods under their best setting. Models
Mean bias
RMSE
t-test bp/ba = 2
Empirical coverage 75%
Time (s)
True Endogenous PR CF ML MIS LV
0.0146 0.688 0.139 0.0532 0.0654 0.0294 0.00795
0.220 0.713 0.278 0.319 0.340 0.234 0.262
0.0666 3.70 0.577 0.169 0.196 0.126 0.0303
76 3 64 81 82 78 81
0.115 0.0708 0.0953 0.107 89.8 0.126 338
Statistics built from 100 Repetitions for 1000 observations. Binary Logit choice model.
Fig. 1. Box-Plots of estimators for methods under their best setting.
box and whiskers are larger than the upper ones. This means that the omission of q⁄ successfully caused enough endogeneity to be detectable. Regarding the PR method, results show that it did address to some degree the endogeneity problem, despite it is not theoretically fully correct in this context. The Bias was substantially reduced to 7%. The t-test implies that the null hypothesis cannot be rejected at reasonable significance level and the sampling distribution is fairly symmetric. However, the results are clearly inferior to those achieved by the other correction methods. In this case, PR appears as a practical method the makes a correction in the right direction, although not being consistent. CF and ML addressed very well the problem of endogeneity. In both cases the bias is below 3%. The Empirical Coverage is very close to the nominal one, but slightly larger, implying permissive confidence intervals. This suggests that the empirical distribution of the estimator may be skewed. The t-test is far below the critical value of t2.5%,99 = 1.99. Note also that the RMSE of both methods is almost the same. This occurs because only one instrument is considered (Rivers and Vuong, 1988). Of course, compared to the true model, the Bias and the RMSE is larger for CF and ML. Finally, it should be noted that the CF is 800 times faster than ML, because the former does not require integration. Also regarding the CF and ML, Fig. 1 shows that their empirical sampling distribution seems to have a moderate negative skewness, since the lower box and whiskers are larger than the upper ones. This can be attributed the fact that the true distribution of the error terms is not Normal, as it was assumed in the likelihood written in Eq. (11). Besides, the ML shows a larger number of outliers, what could be attributed to estimation difficulties involved in the Maximum Simulated Likelihood method (see, e.g., Train, 2009) used for integration. The MIS also resolved the problem of endogeneity. The Bias is below 2%. The Empirical Coverage is very similar to the nominal one, even closer than for CF and ML. The t-test of the null hypothesis that the ratio of the coefficients is equal its population value, is far below the critical value of t2.5%,99 = 1.99. The RMSE is smaller than that of the CF and ML and the estimation time is almost equal to that of the true model. Besides, Fig. 1 shows that the empirical sampling distribution is also fairly symmetric. Finally, the results show that LV method has the better performance in terms of Bias, which is not only smaller than any other method to correct for endogeneity, but even smaller than that of the true model. This is, of course, a result that will
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
251
depend on the empirical setting and in this case is explained by the use of the additional information provided by the indicators and by the independent variables of the structural equation. It is interesting although that Fig. 1 shows that, in spite of the small Bias, the empirical distribution of the LV estimators is a little squeezed, what can be noted in that the median is not over the population value. Interestingly, the 50% mass represented by the box of LV is wider than the one obtained for MIS. This may be explained by the fact that the true error terms are not normal, so that the likelihood considered for the LV method in Eq. (17) can only be seen as an approximation (White, 1982). To finish, it should be noted that the estimation time of the LV method was more than 3000 times larger than other methods, what illustrates the huge differences in computational burden involved in the use of different methods. 4.3. Analysis of the impact of common flaws We now explore the finite sample impact of various types of flaws that may affect each correction method in practice. For the PR method, it is considered a case when, instead of the composite proxy hin = 2h1in + 1h2in, only h1in is used to apply the method (PR-end). For the CF method it is considered the impact of using the weak (CF-weak) and endogenous (CF-end) instruments shown in Eqs. (21) and (22). For the MIS method it is studied the impact of using the discrete indicators (MIS-disc) shown in Eq. (24); and the weak (MIS-weak) and endogenous (MIS-end) indicators shown in Eq. (23). It is considered as well the effect of including a proper indicator in the utility, but not using the second one to correct for the remaining endogeneity (Indic). Finally, for the case of LV, it is analyzed the impact of estimating it in two stages (LV-seq), the impact of estimating it in two stages and omitting h2in (LV-s-end), and of estimating it simultaneously but omitting h2in (LV-end). Table 2 and Fig. 2 show first that an endogenous specification of the proxy (PR-end) doubled the Bias and reduced the Empirical Coverage to 25. The t-test is close to wrongly rejecting the null hypothesis that ratio of the estimators is equal to its population value. Regarding CF-weak and CF-end, results show that those methods did not resolve the problem at all. In the first case, although the t-test and the count seem to be well behaved, the Bias goes up to 19%. This dichotomy is explained by the huge RMSE of CF-weak. For the CF-end the Bias is over 40% and the RMSE is only four times the one attained with the true model and, consequently, the Empirical Coverage becomes as small as zero, and the null hypothesis that the estimator is equal to its true value is wrongly rejected with high confidence. Notably, CF-end is even worse than the doing nothing approach and, although regarding the average Bias the CF-weak may represent a small improvement, its huge variance shall result in that, in many cases, the CF-weak results are also often worse than those of the uncorrected model. Fig. 2 illustrates also that when the instruments are only weakly correlated with the endogenous variable, the severe skewness and variance of the sampling distribution of the estimators may easily result in that the corrected model may be significantly worse than that of the uncorrected model. Results also show that, despite in theory the MIS-disc method can only be formally seen as an approximation, the practical results are remarkably good, very similar to those obtained for MIS with continuous indicators. This finding may be relevant for practice since, in various cases, the indicators available will be discrete, such as when they are gathered from likert scales. However, it should be kept in mind that the Monte Carlo experiments are specific case studies that cannot be used to reach a general conclusion. Further theoretical and empirical research is needed to explore the validity of this promising finding. For both MIS and MIS-disc the bias is below 2%. The Empirical coverage is very similar to the nominal one, but slightly smaller for the MIS-disc, what suggests that the empirical distribution of the estimator may have skewness. This is reinforced by the Box-plot, which shows that the median is much larger than the true value. The t-test is far below the critical value of t2.5%,99 = 1.99. Note also that the RMSE of MIS and MIS-disc are almost the same. Finally, both methods require similar estimation time. These results are reinforced by the Box-Plots of Fig. 2 where it can be noted that the use of discrete indicators (MIS-disc) does make the empirical sampling distribution a little skewed, compared to the distribution obtained for the True model or for the application of the MIS. Instead, the results are extremely poor when the two indicators are endogenous (MIS-end). They are even worse than the results attained for the endogenous model. The bias is over 100%, the null hypothesis that bp/ba = 2 is wrongly rejected with 5% confidence and the Empirical Coverage is very low. Fig. 2 reinforces this result, suggesting that main reason for having non-zero Empirical Coverage in this case is the huge sampling variance. When the indicators are week (MIS-weak) the results are also poor, almost as poor as the uncorrected model. Besides, Fig. 2 shows that the skewedness becomes extreme for the case weak indicators, what can be noted in that the median (bold line) in this case is very far from the sample mean (diamond). Instead, the direct inclusion of one proper indicator, without further correction (Indic) did not solve the endogeneity problem, but it mitigated it substantially. The Bias is 6%, the Empirical Coverage is close to its nominal value and the ttest for the null hypothesis that the ratio of the estimators is equal to its population value cannot be rejected with high confidence. It can also be noted that, as expected the LV-Seq method did not work, although the results are not remarkably bad. The Bias is over 5% and the empirical coverage is slightly lower than the nominal value. Besides, the t-test is lower than the critical value. Since the results for the LV-Seq are much better than those of the endogenous model, it can be affirmed that, in some way, the LV-Seq mitigates the problem. Interestingly, the results of the sequential estimator are almost equal to those of the PR.
252
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
Table 2 Monte Carlo Analysis of Methods under common flaws. Models
Mean bias
RMSE
t-test bp/ba = 2
Empirical coverage 75%
Time (s)
True Endogenous PR end CF weak CF end MIS disc MIS weak MIS end Indic LV seq LV s-end LV-end
0.0146 0.688 0.348 0.372 0.836 0.0354 1.17 2.32 0.123 0.139 0.348 0.00661
0.220 0.713 0.408 13.2 0.849 0.237 4.02 2.55 0.245 0.278 0.407 0.266
0.0666 3.70 1.65 0.0283 5.76 0.151 0.303 2.18 0.577 0.577 1.64 0.0248
76 3 25 95 0 70 98 12 63 65 25 80
0.115 0.0708 0.0770 0.105 0.102 0.126 0.146 0.130 0.0896 0.108 0.101 323
Statistics built from 100 Repetitions for 1000 observations. Binary Logit choice model.
Fig. 2. Box-Plots of estimators for methods under common flaws.
The results also show that the sequential estimation of the LV method became significantly worse when the structural equation of the latent variable was wrongly specified (LV-s-end). In turn, when this problem occurred in the simultaneous LV model (LV-end), the endogeneity problem was not transmitted to the estimators of the utility parameters. Indeed, the statistics of the LV and the LV-end methods are virtually the same, both in Table 2 and Fig. 2. This is because, as it was discussed before, the inclusion of f4 in the likelihood, with proper instrumental variables, bounded the endogeneity to the equation of the latent variable (Eq. (6)). For a general case, in which Eq. (2) is not available, the endogeneity in Eq. (6) will result as well in inconsistent estimators of the utility coefficients of Eq. (1).
5. Conclusion Endogeneity often arises in discrete-choice models, precluding the consistent estimation of the model parameters. This is important because forecasting or behavioral assessment for policy design may be seriously misled if they are based on models based in inconsistent estimators (see e.g. Guevara and Thomas, 2007). This problem is often neglected or only partially addressed using ad-hoc approaches. This article assessed five methods to address this issue, contrasting and clarifying the methodological and practical details involved in their application. The methods studied were: the use of Proxys (PR); the two steps Control-Function (CF) method; the simultaneous estimation of the CF method via Maximum-Likelihood (ML); the Multiple Indicator Solution (MIS); and the integration of latentvariables (LV). The assessment was first made qualitatively, in terms of the formulation, normalization and data needs of each method. Then, the assessment was made quantitatively, by means of a Monte Carlo experiment to study the finite sample properties under a common data generation process, both when all assumptions hold and under some common flaws.
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
253
The methods studied differed notably in the range of problems that they can address; their underlying assumptions; the difficulty of gathering proper auxiliary variables needed to apply them; and their practicality, both in terms of the need for coding and their computational burden. The choice of a method will ultimately depend on the beliefs of the researcher about the data generation process for the particular problem under study, and the respective identification restrictions that could be gathered from it. The main contribution of the article is to enlighten this choice by providing, in Section 3, a comprehensive comparative analysis of the assumptions of each method under a common framework. Besides, Section 4 helps to illustrate the application of the methods and to explore the potential impact of some common flaws. The analysis showed that PR was formally inappropriate for many cases, but that it was easy to apply, and corrected in the right direction. Besides, CF was also easy to apply with canned software, but required instrumental variables which may be hard to collect in various contexts. Since CF is estimated in two stages, it may also compromise efficiency and may make difficult the estimation of standard errors. In addition, ML guarantees efficiency and direct estimation of the standard errors, but at the cost of a larger computational burden required for the estimation of a multifold integral, with potential difficulties in identification, and retaining the difficulty of gathering proper instrumental variables. The MIS method appears relatively easy to apply and requiring indicators that, in various situations, may be easier to obtain. An interesting result attained in this case from the Monte Carlo experiments suggests that discrete indicators may be as good as the continuous ones for applying the MIS method. Finally, the LV approach appears as the more versatile method, but at a high cost in computational burden, problems of identification and limitations in the capability of writing proper structural equations for the latent variable. Three lines for future research can be identified from this work. The first is to further explore the finding suggesting that discrete indicators may be as good as the continuous ones for the application of the MIS method. This venue could first be investigated in practice performing a Monte Carlo analysis systematically oriented toward this goal. Then, a formal demonstration could be tried following an approach similar to the one proposed by Villas-Boas (2007). A Second line of future research corresponds to the study an assessment of practical applications of nonparametric methods to address endogeneity in discrete-choice models. For this goal, the starting point would be the approach proposed by Chesher et al. (2013). Finally, a weakness shared by all the methods to address endogeneity in discrete-choice models discussed in this article is that they rely on point identification, which is achieved with parametric restrictions. The problem is that such approach will fail if the parametric restrictions do not hold in practice, as it is often the case. An alternative to point identification is partial identification (Tamer, 2010). The latter approach considers the sensitivity of the estimated parameters to the validity of weak modeling assumptions. Consequently, a final venue for further research corresponds to the analysis of partial identification methods in discrete-choice models in practice. Acknowledgments This publication was funded in part by CONICYT, FONDECYT 1150590. The final version of this article was prepared when I was a Visiting Research Fellow at the Choice Modelling Centre of the Institute for Transport Studies of the University of Leeds, UK. All Monte Carlo experiments were generated, estimated and analyzed using the open-source software R (R Development Core Team, 2008). I would like to acknowledge the valuable comments provided by Dr. Thijs Dekker and three anonymous referees to a preliminary version of the article. Of course, all potential errors remain mine. References Abay, K.A., Paleti, R., Bhat, C.R., 2013. The joint analysis of injury severity of drivers in two-vehicle crashes accommodating seat belt use endogeneity. Transport. Res. Part B: Methodol. 50, 74–89. Ben-Akiva, M., Walker, J., Bernardino, A.T., Gopinath, D.A., Morikawa, T., Polydoropoulou, A., 2002. Integration of choice and latent variable models. In: Perpetual Motion: Travel Behaviour Research Opportunities and Application Challenges, pp. 431–470. Berry, S., Levinsohn, J., Pakes, A., 1995. Automobile prices in market equilibrium. Econometrica: J. Econ. Soc., 841–890. Bhat, C.R., Guo, J., 2004. A mixed spatially correlated logit model: formulation and application to residential choice modeling. Transport. Res. Part B: Methodol. 38 (2), 147–168. Cherchi, E., Guevara, C.A., 2012. A Monte Carlo experiment to analyze the curse of dimensionality in estimating random coefficients models with a full variance–covariance matrix. Transport. Res. Part B: Methodol. 46 (2), 321–332. Chesher, A., 2003. Identification in nonseparable models. Econometrica 71 (5), 1405–1441. Chesher, A., 2005. Nonparametric identification under discrete variation. Econometrica, 1525–1550. Chesher, A., 2010. Instrumental variable models for discrete outcomes. Econometrica 78 (2), 575–601. Chesher, A., Rosen, A.M., Smolinski, K., 2013. An instrumental variable model of multiple discrete choice. Quant. Econ. 4, 157–196. De Blander, R., 2008. Which null hypothesis do overidentification restrictions actually test? Econ. Bull. 3 (76), 1–9. Ferreira, F., 2010. You can take it with you: proposition 13 tax benefits, residential mobility, and willingness to pay for housing amenities. J. Public Econ. 94 (9), 661–673. Glerum, A., Atasoy, B., Bierlaire, M., 2014. Using semi-open questions to integrate perceptions in choice models. J. Choice Modell. 10, 11–33. Guevara, C.A., 2005. Addressing endogeneity in residential location models (Master thesis). Massachusetts Institute of Technology. Guevara, C.A., 2010. Endogeneity and Sampling of Alternatives in Spatial Choice Models (Doctoral dissertation, Ph. D. Thesis, Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA). Guevara, C.A., Ben-Akiva, M., 2006. Endogeneity in residential location choice models. Transport. Res. Rec.: J. Transport. Res. Board 1977 (1), 60–66. Guevara, C.A, Ben-Akiva, M., 2010. Addressing endogeneity in discrete choice models: assessing control-function and latent-variable methods. In: Choice Modelling: The State-of-the-Art and the State-of-Practice: Proceedings from the Inaugural International Choice Modelling Conference. Emerald Group Publishing, p. 353.
254
C.A. Guevara / Transportation Research Part A 82 (2015) 240–254
Guevara, C., Ben-Akiva, M., 2012. Change of scale and forecasting with the control-function method in logit models. Transport. Sci. 46 (3), 425–437. Guevara, C.A., Navarro, P., 2015. Detection of Weak Instruments in Binary Choice-Models: A Monte Carlo Method. Working Paper, Universidad de los Andes. Guevara, C.A., Polanco, D., 2015. Correcting for endogeneity without instruments in discrete choice models: the multiple indicator solution (MIS). Working Paper, Universidad de los Andes. Guevara, C.A., Thomas, A., 2007. Multiple classification analysis in trip production models. Transport Policy 14 (6), 514–522. Hausman, J., 1996. Valuation of New Goods under Perfect and Imperfect Competition. In: Bresnahan, Gordon (Eds.), The Economics of New Goods, Studies in Income and Wealth, vol. 58. National Bureau of Economic Research, Chicago. Heckman, J., 1978. Dummy endogenous variables in a simultaneous equation system. Econometrica 46, 931–959. Karaca-Mandic, P., Train, K., 2003. Standard error correction in two-stage estimation with nested samples. Econ. J. 6 (2), 401–407. Larsen, W.A., 1978. Variations of box plots. Am. Stat. 32 (1), 12–16. Lee, L., 1982. Specification error in multinomial logit models. J. Econ. 20, 197–209. Matzkin, R.L., 2007. Nonparametric identification. In: Heckman, J., Leamer, E. (Eds.), Handbook of Econometrics, vol. 6B. Elsevier, Amsterdam. Mumbower, S., Garrow, L.A., Higgins, M.J., 2014. Estimating flight-level price elasticities using online airline data: a first step toward integrating pricing, demand, and revenue optimization. Transport. Res. Part A: Policy Pract. 66, 196–212. Nevo, A., 2001. Measuring market power in the ready-to-eat cereal industry. Econometrica 69 (2), 307–342. Newey, W., 1985. Generalized method of moments specification testing. J. Econ. 29, 229–256. Petrin, A., Train, K., 2002. Omitted Product Attributes in Discrete Choice Models. Working Paper, Department of Economics, University of California, Berkeley, CA. Petrin, A., Train, K., 2010. A control function approach to endogeneity in consumer choice models. J. Market. Res. 47 (1), 3–13. Quigley, J.M., 1976. Housing demand in the short run: an analysis of polytomous choice. In Explorations in Economic Research, Volume 3, number 1. NBER, pp. 76–102. R Development Core Team, 2008. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Rivers, D., Vuong, Q., 1988. Limited information estimators and exogeneity tests for simultaneous probit models. J. Econ. 39, 347–366. Ruud, P., 1983. Sufficient conditions for the consistency of maximum likelihood estimation despite misspecification of distribution in multinomial discrete models. Econometrica 51, 225–228. Sermons, M.W., Koppelman, F.S., 2001. Representing the differences between female and male commute behavior in residential location choice models. J. Transp. Geogr. 9 (2), 101–110. Staiger, D., Stock, J.H., 1997. Instrumental variables regression with weak instruments. Econometrica, 557–586. Tamer, E., 2010. Partial identification in econometrics. Annu. Rev. Econ. 2 (1), 167–195. Tirachini, A., Hensher, D.A., Rose, J.M., 2013. Crowding in public transport systems: effects on users, operation and implications for the estimation of demand. Transport. Res. Part A: Policy Pract. 53, 36–52. Train, K.E., 2009. Discrete Choice Methods with Simulation. Cambridge University Press. Villas-Boas, J.M., 2007. A Note on Limited Versus Full Information Estimation in Non-linear Models. Discussion Paper, University of California, Berkeley. Waddell, P., 1992. A multinomial logit model of race and urban structure. Urban Geogr. 13 (2), 127–141. Walker, J., Ben-Akiva, M., 2002. Generalized random utility model. Math. Soc. Sci. 43 (3), 303–343. Walker, J.L., Ben-Akiva, M., Bolduc, D., 2007. Identification of parameters in normal error component logit-mixture (NECLM) models. J. Appl. Econ. 22 (6), 1095–1125. Wardman, M., Whelan, G., 2011. Twenty years of rail crowding valuation studies: evidence and lessons from British experience. Transport Rev. 31 (3), 379– 398. White, H., 1982. Maximum likelihood estimation of misspecified models. Econometrica 50 (1), 1–25. Wooldridge, J., 2010. Econometric Analysis of Cross-Section and Panel Data, second ed. MIT Press, Cambridge, MA. Yanez, M.F., Raveau, S., Ortúzar, J.D.D., 2010. Inclusion of latent variables in mixed logit models: modelling and forecasting. Transport. Res. Part A: Policy Pract. 44 (9), 744–753.