Choice of mode of transport for long-distance trips: Solving the problem of sparse data

Choice of mode of transport for long-distance trips: Solving the problem of sparse data

Transportation Research Part A 40 (2006) 587–601 www.elsevier.com/locate/tra Choice of mode of transport for long-distance trips: Solving the problem...

327KB Sizes 0 Downloads 73 Views

Transportation Research Part A 40 (2006) 587–601 www.elsevier.com/locate/tra

Choice of mode of transport for long-distance trips: Solving the problem of sparse data Andre´s Monzo´n *, Alvaro Rodrı´guez-Dapena Departamento de Transportes, Escuela Te´cnica Superior de Ingenieros de Caminos, Universidad Polite´cnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain Received 6 December 2002

Abstract Transport planning is usually based on models’ forecasts, but the reliability of their outputs depends so much on the quality of input-data they are fed with. Discrete-choice models are used to characterise travellers’ behaviour in choosing their transport mode. Their calibration process is usually based on data stemming from household survey campaigns. However, the modelling in multimodal and intermodal transport on an interurban level is far more complicated and costly than in the case of an urban area. An alternative way to reduce costs is achieved by designing a choice-based sampling strategy where household surveys are replaced by specific surveys for each transport mode. This strategy generates a non-random sample that has to be treated correctly during the estimation process. In principle, the sample does not represent population market quotas for each different transport option. Moreover, as a result of both physical and functional constraints, the survey period cannot cover all origin–destination pairs (O–D pairs) in an optimal way and, consequently, the above-mentioned bias also affects each different individual O–D pair or, at least, group of pairs. In order to overcome this problem, this study presents a new procedure derived from the introduction of maximum likelihood estimators. These estimators assume the original mode options in terms of population quotas and in terms of O–D groups of pairs. The procedure is based on the optimisation of an objective-function to correct the above-mentioned bias in a way similar to the estimators of samples based on different choice options. The method named DWELT estimates the parameters corresponding to each explanatory variable using mode shares for each O–D pair or group of pairs. DWELT has been successfully validated in the case study of the Madrid–Barcelona interurban corridor in Spain. This result allows to achieve a more flexible cheaper survey procedure for interurban transport planning activities. Therefore transport policy strategies could be better designed and tested with lower costs.  2005 Elsevier Ltd. All rights reserved. Keywords: Multimodal transport corridor; Choice-modelling; Data collection; Sample optimisation

*

Corresponding author. Tel.: +34 91 336 53 73; fax: +34 91 549 26 28. E-mail address: [email protected] (A. Monzo´n).

0965-8564/$ - see front matter  2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.tra.2005.11.007

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

588

1. Introduction On a regular basis, travellers’ behaviour, in the choice of transport options (modes and/or routes) from a given limited set C, has usually been characterised by discrete-choice models. The most common theoretical framework for this kind of model is represented by the random utility theory. This theory is based on the individual maximisation of a random utility Ui assigned to each option i. The random utility can be broken down into two components: Ui = Vi + ei (i 2 C), i.e., a systematic component Vi, which is a function of the explanatory variables x being measured, and a random component ei representing the uncertainty derived from individual behaviour and from modeller measurement errors (McFadden, 1981). If we assume the existence of a mathematical and statistical structure for the systematic and random components, the utility maximisation hypothesis can then be formulated in a probabilistic way as denoted by PijC = P(ijx, h). The probability of choosing a transport alternative i from a set C is constrained by the introduction of a set of variables x that account for individual and travel patterns and also by a set of h parameters. The problem associated with estimating a discrete-choice model type PijC = P(ijx, h) for a scenario defined by a given sample of observations {in, xn}16n6N lies in the difficulty of calibrating the h parameters in such a way as to guarantee a closer relation between the model and reality. From a statistical point of view, our main goal is to develop an appropriate estimator for the h parameters. This estimator should act as a random variable with regard to successive sample drawing, and it should take a value as close as possible to the assumed real value corresponding to the population under analysis. On a general basis, the quality of the value is examined in asymptotic terms, i.e., presupposing sample sizes tending to their population counterparts, and these, in turn, tending to infinity. According to Rao (1984) and Cramer (1986), it is then used to check consistency (the estimator converges in direct proportion to the population value) and asymptotic efficiency (distribution variance observed in the limit stands at the minimum possible value). Generally speaking, maximum likelihood estimators have these two properties and thus are especially suitable to high sampling sizes. Additionally, except for a few specific occasions, the lack of linearity observed in discrete-choice models prevents the use of minimum square error estimators. Thus, when we assume a simple random sample of observations {in, xn}16n6N, and once we have adopted the logarithmic function, the maximum likelihood estimator corresponding to h gives rise to the following equation: Max Ln LðhÞ ¼ h

N X

Ln P ðin jxn hÞ

ð1Þ

n¼1

However, it is not always possible to conduct survey campaigns aimed at generating random samples at a reasonable cost. In this respect, household surveys appear to meet that requirement since they allow random lists of observations to be obtained for the trips under study that can be used to set up a representative sample for the population. Nevertheless, this type of campaign usually proves costly in interurban studies on account of several reasons. Firstly, the target group for the survey is often scattered along different population nuclei located in areas, which are—by definition—distant from each other. Secondly, long-distance trip frequencies are much lower than those of urban trips, which imposes the need to remind respondents to keep a clean record of trips that took place at an earlier time. This information gap is prone to omission or to erroneous description of previous trips. Lastly, for geographic reasons, a large number of the trips recorded may be of no interest whatsoever to the analyst since they cannot be classed in the area—corridor or region—under analysis. A possible solution to this efficiency problem would be to undertake a different type of survey specially targeted on travellers. On a general basis, this type of survey is mapped by selecting a number of strategic survey places along communication routes and transport stopping centres such as airports and stations. The aim is to draw a sufficiently representative sample of observations directly related to the study goals. Given the same rate of efficiency, the campaign cost is considerably reduced, since this approach ensures the direct in situ collection of data from trips which are relevant in terms of profiling travel behaviour patterns in choosing transport modes. From a statistical standpoint, this type of survey corresponds to a stratified sampling strategy based on individual travellers’ real choices; in other words, it is based directly on the dependent variable of the dis-

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

589

crete-choice model. Following this approach, the stratification process becomes endogenous (‘‘choice-based’’) generating a first type of bias due to the lack of proportionality between the observed sample and the population for each transport mode. The correction of this bias imposes the need to redesign these estimators based on a random sampling strategy and therefore the procedures for adjusting the discrete-choice model become more complex. The technical literature (Manski and Lerman, 1977; Cosslett, 1981; Manski and McFadden, 1981; Imbens, 1992) includes several proposals on the design of estimators which seek to guarantee the fulfilment of statistical properties such as consistency and asymptotic efficiency—in cases where the sample size tends to infinity. Additionally, part of these techniques can be converted into computer-based algorithms that can be efficiently run on personal computers. Moreover, there is another source of bias of a geographic nature and associated with interurban long-distance transport surveys derived from the small number of possible places to locate survey points. This is particularly important in the case of road transport since the survey spots are restricted to the existing transport stopping points (toll gates, gas stations, etc.) that are not always the best point to stop the adequated travellers. This problem prevents any accurate representation of the distribution of the O–D pairs system, because these survey spots do not cover it uniformly and in some cases it is simply impossible to find a survey point. To overcome this source of biases it is especially relevant to combine properly differentiated O–D pairs in the same sample so that enough diversity can thus be achieved in each sample. In urban studies the diversity for O–D is usually sufficient. However, this information is dramatically curtailed in the case of long-distance trips. Only an approach based on stated preference techniques would confer sufficient variability on such a case. Therefore, it is advisable to combine different O–D pairs so as to allow for a greater diversity in terms of travellers’ choices. We can benefit from the good level of knowledge we usually have on the universe using each mode of transport, for each O–D pair. Hence, we can use this information to improve statistical estimators and then to correct the whole bias affecting samples in interurban transport studies in an economically affordable way. This paper presents a new method for estimating discrete-choice models in an interurban context, called DWELT (Double Weighted Estimator for Long distance Transport mode choice models). It uses all the information available provided by specific surveys based on a choice-based sampling strategy. To solve the sample bias between modes and O–D pairs, DWELT incorporates ratios between the population quotas and the sample quotas of transport demand, disaggregated by modes and by O–D pairs. These O–D pairs can be considered on an individual level of desegregation or integrated into groups of O–D pairs with a common level of sample bias. One important requirement of DWELT focuses on the fact that these population quotas of transport demand are inputs of the estimation process and, consequently, should be known previously. This restriction is solved if there exists a national transportation model or mobility survey, including O–D matrices stemming either from official databases or specific transport studies. Interurban demand matrices by modes are available in many countries, Spain included, and have been used successfully to apply the method to the Madrid–Barcelona corridor. DWELT can also be applied in modelling short distance mobility patterns (urban and metropolitan areas). However it is in interurban cases where DWELT proves more beneficial, due to the (almost prohibitive) excessive cost of household survey campaigns. Reasonably-priced household surveys could only be planned for urban and metropolitan areas and usually provide samples of travel collections with no significant bias. 2. Estimation with endogenous sampling by O–D pairs Estimating a discrete-choice model P(ijx, h) to a generic stratified sample calls for a twofold distinction in terms of stratification strategies: the exogenous type based on one or more explanatory variables x, and the endogenous type (‘‘choice-based’’) set upon the dependent variable i itself. The exogenous stratification allows the model to be estimated just as in the case of a simple random sample by means of the current maximum likelihood estimator provided in (1) (Ben-Akiva and Lerman, 1985). By contrast, endogenous stratification tends to generate a significant sampling bias, which imposes the need to redesign the maximum likelihood estimator. Then the estimation reaches such degrees of complexity that, except for some specific instances, it is necessary to simplify calculations at the expense of downgrading statistical efficiency. From the statistical

590

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

point of view, there is yet a third strategy besides the exogenous and the endogenous stratification approaches. This wider approach consists of a previous grouping of the sample, using a variable other than i or x. In interurban transport studies that variable could be constituted by the groups of O–D pairs, which will be denoted by r. Each and every one of the r pairs would thus constitute the primary level of groups of observations, comprising in turn a set of secondary observations measured through variables i and x. The combination, in the same sample, of observations corresponding to different O–D pairs adds positive variability to the estimation of discrete-choice models, especially in the case of stated preference surveys. This procedure provides the analyst with information which is accessible in a comprehensive and reliable census. In urban studies, when the survey is of the household type it is simple to obtain a sample with O– D pair distribution close to the population. By contrast, in interurban cases, where the campaign is traveller-orientated, grouping by O–D pairs can only be partially identified a priori depending on the selected survey places. The task of identifying the most illustrative groups of O–D pairs have to be carried out using the feedback obtained from the surveys, which does not allow any previous control over the sampling bias associated with those O–D pairs. The statistical correction of this bias is to be justified by means of a stratification scheme that should incorporate not only variables i and x, but also the O–D pairs represented by r. The overall density function is expressed as follows: f ði; x; rÞ ¼ pðx; rÞ  P ðijx; r; h Þ ¼ pðrÞ  pðxjrÞ  P ðijx; r; h Þ

ð2Þ

where the probability of each observation {in, xn, rn}16n6N is comprised of three factors: the probability p(rn) that the trip observation n should belong to r, the probability p(xnjrn) that, the above being true, the explanatory variables are xn, and the probability P(injxn, rn,h*) that, both the above having proved true, the traveller should choose option i from the Cn set of possible alternative modes. Under these conditions, in order to set up the estimator, it is necessary to consider a group of samples based on the O–D pairs r. Each of them is formed by a stratified sub-sample based on alternatives i—in other words, the sub-sample is totally endogenous. The exogenous stratification with respect to variables x is not taken into consideration since it has no influence on the estimation method presented herein. Fig. 1 shows the differences to be expected among sample quotas (top) and population quotas (bottom) both in terms of O–D pairs r and in terms of transport alternatives i. In sampling terms, there are C · R strata {Ni r}16i6C;16r6R that yield the following fractions or quotas: Hir = Hr Æ Hijr = (Nr/N) Æ (Nijr/Nr) (1 6 i 6 C, 1 6 r 6 R). In population terms, each different stratum deter-

Fig. 1. Stratified sampling scheme according to mode and groups of O–D pairs.

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

591

mines a specific probability volume, which is constrained by the different parameters considered and expressed by Z Z f ði; x; rjhÞ dx ¼ pðrÞ P ðxjrÞ  P ðijx; r; hÞ dx ¼ QðrÞ  Qðijr; hÞ ð3Þ Qði; rjhÞ ¼ x

x

The key factor when developing maximum likelihood estimators rests upon the design of an appropriate density function g(i, x, rjh) to be induced by the sampling scheme that has just been outlined. Following the order imposed on the stratification process, if we start by strata based on O–D pairs r, the probability of an observation (i, x, r) being selected from the sample will be gði; x; rjhÞ ¼ H r gði; xjr; hÞ

ð4Þ

where Hr = Nr/Nm represents the probability of stratum r appearing in the sample and g(i, xjr, h) represents the probability of the pair (i, x) being selected from within the said stratum. In turn, this relation can be expressed as follows: gði; xjr; hÞ ¼

gIjr ði; xjr; hÞ QðrÞ

ð5Þ

where Q(r) is the probability of stratum r appearing in the population and gIjr(i, xjr, h) represents the function of demand induced by pure endogenous stratified sampling based on each different mode i, and constrained by each O–D pair r. Sub-index Ijr specifically illustrates the above-mentioned sampling scheme. Put differently, by combining (4) and (5), we elicit another expression that is no more than a direct application of Bayes theorem gði; x; rjhÞ ¼

gIjr ði; xjr; hÞ  H r QðrÞ

ð6Þ

The density function corresponding to a pure endogenous stratified sampling scheme is widely known in the literature (see Manski and McFadden, 1981) and can be set up following the same previous method, the only exception being that, in this case, all functions are constrained by r. gIjr ði; xjr; hÞ ¼

P ði; jx; r; hÞ  pðxjrÞ  Hijr Qðijr; hÞ

ð7Þ

Finally, the density function induced by the sampling scheme can be written as follows: gði; x; rjhÞ ¼

gIjr ði; xjr; hÞHr P ði; jx; r; hÞ  pðxjrÞ  Hijr Hr ¼ QðrÞ Qðijr; hÞ  QðrÞ

ð8Þ

Once we have calculated this density function from the sample of observations {in, xn, rn}16n6N and presupposing that we know the quotas Q(r) = Qr and Q(ijr, h) = Qijr, we are in a position to write the maximum likelihood estimator for the h parameters Max Ln Lðh; pÞ ¼ Z being Qijr

N X n¼1

Ln

P ðin ; jxn ; rn ; hÞ  pðxn jrn Þ  Hin jrn .Hrn Qin jrn  Qrn

P ði; jx; r; hÞ  pðxjrÞ  dx

ð9Þ

Qr ¼ pðrÞ

x

In fact, this expression is an extension of the equation provided by Cosslett (1981) for the problems associated with pure endogenous sampling exercises. However, if the problem provided by the above-mentioned author had a number of calculation setbacks, the one presented in (9) involves even greater difficulties, since it incorporates, as an unknown variable, the probability function p(x, r) = p(r) Æ p(xjr). Consequently, the upsurge of new difficulties makes it necessary to resort to alternative estimators in order to guarantee a real chance for calculation. In this respect, the idea supported by Manski and Lerman (1977) as regards the incorporation of a set of weights {wi}16i6C = {Qi/Hi}16i6C intended to correct the bias depending on transport modes i may prove a great contribution to the case. Just as these weights are liable for a significant correction of

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

592

the differences among sampling and population quotas, we propose to use weights similarly {wr}16r6R = {Qr/ Hr}16r6R to solve the same problem, but this time associated with the O–D pairs r. Therefore Eq. (9) would be written as follows: Max Ln LðhÞ ¼

Nm X

wrn Ln P Ijr ðin jxn .win ; rn ; hÞ

ð10Þ

n¼1

where judging from the estimators available so far and based on the likelihood maximisation theory, PI—r can be expressed in two ways ðin jxn ; win ; rn ; hÞ ¼ P ðin jxn ; rn ; hÞ P WESML Ijr

ðQin =H in Þ

w in

P ðin jxn ; rn ; hÞ

ðH i =Q Þ  P ðin jxn ; rn ; hÞ ð1=win Þ  P ðin jxn ; rn ; hÞ P CMLS ðin jxn ; win ; rn ; hÞ ¼ PC n in ¼ PC Ijr j¼1 ðH j =Qj Þ  P ðjjxn ; rn ; hÞ j¼1 ð1=wj Þ  P ðjjxn ; rn ; hÞ

ð11Þ

Both expressions are derived from two different proposals1 that are characterised by lowering calculation difficulty at the expense of risking efficiency: • WESML (‘‘Weighted Exogenous Sampling Maximum Likelihood’’) developed by Manski and Lerman (1977). This is in fact an extension of a previous expression of the estimator based on simple random sampling. This reformulation affects the maximum likelihood function of the expression (1) for a given set of weights wi = Qi/Hi Max Ln LðhÞ ¼ h

N X

win .Ln P ðin jxn ; rn ; hÞ

ð12aÞ

n¼1

• CML (‘‘Conditioned Maximum Likelihood’’) developed by Cosslett (1981). This expression discretises function p(x) included in (6) into a set of values kn (1 < n < N) that allow the restriction in sum to be transformed instead of an integral. The problem associated with the maximum is thus solved by means of Lagrangian multipliers ki. The final result is a somewhat complex scheme. To overcome this shortcoming, Manski and McFadden (1981) developed a less complex problem in that the multipliers tend to quotient Hi/ Qi as far as the sample size increases. The original Cosslett proposal becomes notably simplified at the expense of decreasing statistical efficiency, as can be observed as follows: Max Ln LðhÞ ¼

N X n¼1

ðH i =Q Þ  P ðin jxn ; rn ; hÞ Ln PC n in j¼1 ðH j =Qj Þ  P ðjjxn ; rn ; hÞ

ð12bÞ

Based on these results, a new method is proposed for the correction of bias in terms of groups of O–D pairs, known as DWELT (‘‘Double Weighted Estimator of Long distance Transport mode choice models’’). The method responds to the sampling procedure outlined in this chapter as comprising two stages. The first stage corresponds to a grouping sampling strategy in terms of O–D pairs. At this stage, the quotas are corrected by the given set of weights {wr}16r6R = {Qr/Hr}16r6R. The second stage involves a constrained pure endogenous stratification strategy that favours correction based on the use of the sets of weightings {wijr}16i6C, 16r6R = {Qijr/Hijr}16i6C,16r6R. From a statistical point of view, DWELT preserves the statistical properties of the pseudo-maximum-likelihood estimators to a greater extent, especially when the sample size is close to the universe. Following this approach, the Jacobian meter-vector and the Hessian matrix acquire features similar to those characterising the WESML estimator as developed by Manski and Lerman (1977), although in this case

1

There are some other proposals like Imbens (1992) premised on the GMM method (‘‘Generalised Moments Method’’). If contrasted with the problem posed by Cosslett (1981), this method—based on the log-likelihood function—preserves original statistical efficiency while slightly reducing complexity in calculation. However, the function itself is not set up as a maximum likelihood estimator, which hinders any generalisation aimed at the correction of new biases in terms of O–D pairs.

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

593

they refer to constrained probability as in the expression PIjr = PIjr(ijx,wi,r,h) instead of Pi = P(ijx,h) where weightings {wi}16i6C = {Qi/Hi}16i6C are replaced by those corresponding to O–D pairs as denoted by {wr}16r6R = {Qr/Hr}16r6R. By applying the density function induced by the sampling scheme proposed here, we could conclude that the mathematical expectation for the Jacobian vector is zero. This characteristic allows the application of general axioms to demonstrate the consistency of the estimator represented by DWELT. Appendix 1.11 of the referred work by Manski and McFadden (1981) includes a proposal with a high potential for those estimators fulfilling the previous characteristic among some other general ones. This proposal proves that under such circumstances the estimators appear to be consistent. Moreover, DWELT fulfils such requirements and can thus be said to be consistent as well. Following Manski and McFadden (1981) we can also conclude that the asymptotic distribution is Normal, its mean value zero and its variance given by the following expression: R¼

1 1 X DX1 N



R X

XIjr



R X

r¼1

wr DIjr

ð13Þ

r¼1

All the expressions considered so far include both the probability functions and the relative quotas for each different transport mode, constrained by group of O–D pairs. More particularly, the probability of selection for any transport mode has been specified in P(ijx, r, h). Nonetheless, the ultimate goal is to adjust a function P(ijx, h) common to all O–D pairs, in such a way that hparameters account for the variability of choice patterns conferred by those different O–D pairs. Given the fact that this variability is implicit in explanatory variables x, it is possible to keep the original formulation drawn up for the choice-model denoted by P(ijx, h) instead of P(ijx, r, h) in the formulas above. However, it is more difficult to accept the relative independence of x variables with respect to the O–D pairs r, i.e., to admit the equivalence p(xjr) = p(x). By holding up this assumption, we were in fact admitting the existence of a homogenous trip distribution among all different groups of O–D pairs. This hypothesis is open to discussion, mainly because the individual explanatory variables appear to be linked to the origins and destinations as far as they determine individual socioeconomic features. Fortunately, estimators based on endogenous sampling (WESML and CML) do not require an estimation based on p(xjr), though they are necessary when calculating mathematical expectations. Consequently, if we accept that only the {wi}16i6M quotas are determined by O–D pairs r, the DWELT estimation method will be reformulated as follows: Max Ln LðhÞ ¼

N X

wrn Ln P Ijr ðin jxn ; wijn ; hÞ

ð14Þ

n¼1

The application of this method to the WESML and CML estimators allows two formulations: • DWELT-WESML (R): extension of the WESML method Max Ln LðhÞ ¼

R X

wr

Nr X

winjr Ln P ðin jxn ; hÞ

ð15Þ

n¼1

r¼1

• DWELT-CML (R): extension of the CML method Max Ln LðhÞ ¼

R X r¼1

wr

Nr X n¼1

ð1=W in jr Þ  P ðin jxn ; hÞ Ln PC j¼1 ð1=W jjr Þ  P ðjjxn ; hÞ

ð16Þ

Both expressions remain within a relatively simple structure without losing any operational efficiency from the estimators on which they are premised.

3. Validation of DWELT: Application to the Madrid–Barcelona corridor In order to validate the estimation techniques proposed so far, they were tested with data stemming from a travel demand survey on the 680-km long Madrid–Barcelona corridor (Spanish Department of Transport and

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

594

Environment, 1992). The survey was developed to measure the impact of the introduction of a new high-speed rail line, currently under construction, in the existing transport services. The Madrid–Barcelona corridor represents the most important transport relationship in the Iberian Peninsula. The corridor is bounded on its two furthermost points by Spain’s two most important metropolitan cities, both in terms of population (almost 10 million = 25% of total Spanish population) and GDP (30%). There are other cities in the corridor such as Zaragoza, with 0.85 million inhabitants, Huesca, Lerida, Tarragona and Guadalajara. Therefore the corridor constitutes an appropriate place for testing the estimation methods put forward in this paper (see Fig. 2). Five basic transport modes coexist in the corridor, namely, plane, day train, night train, car and longdistance bus. The explanatory variables x used to characterise travellers’ behaviour choosing one of these five possible optional modes are three generic variables (time, cost and frequency) and one alternative-specific variable (household income): • Time was measured as total time between origin and destination, including not only travel time, but also, for public transport modes, access and dispersion time, waiting time and transfer time between different services if needed. • Cost was also measured referring to the total transport chain, including not only travel price but also an estimated cost for access and dispersion concerning public transport modes. • Frequency was taken, for public transport modes, to be the number of services per week in each O–D pair. • Household income was measured on six different levels. Three kinds of source provided the data required for these variables: a supply model, transport demand data and survey campaigns (see Fig. 3). • Supply model. In turn, each mode affords a wide range of routes and services generating multiple trip suboptions. The great diversity in transport supply evinced the need for a supply model specifically devised to calculate the most illustrative variables affecting supply (relative time, cost and frequency) for each different O–D pair of cities and for each specific choice. It was therefore necessary to determine explanatory variables related to supply for those O–D pairs; e.g. Table 1 shows the basic supply data calculated for the most important O–D pairs in the case of rail mode.

BARCELONA MADRID

Fig. 2. The Madrid–Barcelona corridor.

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

595

O-D matrices (2nd stage: SPATIAL DISTRIBUTION) Qi, Qi|r by purpose

Survey campaigns

DWELT estimation method

Household income O-D pairs Trip purpose Supply data for checking

(3rd stage: MODAL DISTRIBUTION)

Hi , Hi|r by purpose

SUPPLY MODEL (4th stage: ROUTE ALLOCATION) Supply data (time, cost, frequency)

Fig. 3. Data sources for DWELT method.

Table 1 Main rail O–D pairs in the Madrid–Barcelona corridor: Demand and explanatory variables (1991) Origin

Destination

Travel time (trip time one way minutes)

Cost (fares one way Spanish Pesetas)

Frequency (services/ week)

Passengers (both ways/year)

Day

Night

Day

Night

Total

0 0 11,995 43,490 291,167

583,468 47,743 44,846 88495 524,943

0 217 1414 8252

24,593 – 1160 3787 16,519

0 0 0

– 17,090 45,055 213,367

Travel

Transfera

Day

Night

Average

Madrid Madrid Madrid Madrid Madrid

Zaragoza Huesca Le´rida Tarragona Barcelona

195 275 304 410 442

– – – – –

2404 2872 3770 3953 4189

0 0 5793 5891 5567

2404 2872 4782 4922 4878

40 5 10 15 15

0 0 10 10 15

583,468 47743 32,851 45,005 233,776

Guadalajara Guadalajara Guadalajara Guadalajara Guadalajara

Zaragoza Huesca Le´rida Tarragona Barcelona

162 235 263 354 414

– – – – –

1987 2503 3247 2977 3423

0 0 3710 3947 4327

1987 2503 3479 3462 3875

25 5 5 10 10

0 0 5 5 10

24,593 – 943 2373 8267

Zaragoza Zaragoza Zaragoza Zaragoza

Huesca Le´rida Tarragona Barcelona

52 102 184 224

– – – –

566 1187 1504 2562

0 0 0 0

566 1187 1 504 2562

5 21 25 16

0 0 0 0

– 17,090 45,055 213,367

Huesca Huesca Huesca

Le´rida Tarragona Barcelona

80 253 186

37 50 37

1120 2337 2349

0 0 0

1120 2337 2349

5 5 5

0 0 0

1451 1669 17,492

0 0 0

1451 1669 17,492

Le´rida Le´rida

Tarragona Barcelona

68 116

– –

698 1217

0 0

698 1217

6 16

0 0

17,239 83,222

0 0

17,239 83,222

a





When there is no direct service.

• Transport demand data. The population quotas Qr and Qijr are calculated directly using O–D matrices by mode and also disaggregated in terms of purpose, which are taken from mobility studies carried out on a nationwide scale in Spain, for each mode, referring to Spanish provinces. For instance, Table 1 also shows

596

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

transport demand for the most important O–D pairs by rail. In general, this is a requirement of the DWELT estimation method and assumes that O–D matrices exist and are available, generally stemming from national transport models or official statistical databases. • Survey campaigns. As part of the same framework, the study launched a series of stated preference surveys on travellers using the five modes—plane, day/night trains, car and bus. Survey spots were based on service areas, toll gates and gas stations along the main roads of the corridor and on rail stations and airports. This source provides information of some specific variables related to travellers and their trips, specifically household income, O–D pairs, trip purpose and also supply data. Household income is an explanatory variable together with the supply explanatory variables (time, cost and frequencies). In parallel, O–D pair survey data were used to calculate sample quotas Hr and Hijr and the survey outputs related to trip purposes allow these sample quotas to be disaggregated by purpose. In addition, supply data from surveys were used to check the results of time, cost and frequency values given by the supply model for each O– D pair. The sampling proved endogenous, having a sampling proportion (H) of road surveys substantially lower than the population (Q). This was derived from the higher cost of conducting campaigns in this mode. This application aimed at checking the validity of the DWELT method and its use in a multinomial logit model for transport mode choice. To this end, the global sample was segmented into the following homogenous groups as follows: • A group of trips covering O–D pairs with five transport mode possibilities—plane, day train, night train, car and bus. This group (G-5) corresponds to the furthest O–D pairs in the corridor, including the Madrid– Barcelona relationship. Surveys were divided into two sub-samples for trip purposes: work (1163 surveys) and other purposes (1195 surveys). • Another group of trips covering O–D pairs with four (G-4) modal options–plane, day train, car and bus; it was similarly divided for trip purposes into work (469 surveys) and others (619 surveys). The logit multinomial model assumes that random components ei of utility function Ui conform to Gumbel distributions, which are both similar and independent and expressed as follows: e vi P ðijx; bÞ ¼ PC

j¼1 e

vj

Vi ¼

J X

bj xij 1 6 i 6 C

ð17Þ

j¼1

where b represents the parameters to be calibrated, i the number of modal choices (C being the total number of modes available) and x explanatory variables. Vi = Vi(x, b) constitutes the systematic component of the utility function, which is formulated as a linear function of a number of explanatory variables x. There are C-1 ‘‘dummy’’ variables, or specific constants for each mode option, three variables related to transport supply—namely time (minutes), price (pesetas-Spanish currency2) and frequency (services per week)—and yet another variable measuring income per person. Trip time and income per person were defined as specific to each mode option in order to analyse their variability. Some approaches for specifying Vi consider time and price as related to income levels (Jara-Diaz and Farah, 1987; Jara-Diaz and Ortuzar, 1989); typically this is the case of urban trips in developing countries, but in this particular case-study we were analysing long interurban trips along a corridor where leisure trips vary by more than half (not constrained by time budget), and working trips are mainly paid for by companies (69% according to the survey referred to). In order to assess the validity of the modified estimation methods DWELT-WESML (R) and DWELTCML (R) by using the logit multinomial model presented above, it was necessary to divide each group of trips into several strata according to the different O–D pairs. Thus in G-5 there were two sub-groups of different O– D pairs (r = 1 corresponding to Madrid–Barcelona trips and r = 2 to the rest). Similarly G-4 was divided into

2

1991 conversion rate: $1 = 102 Spanish Pesetas. 1990–1995 average conversion rate: $1 = 127 Pesetas.

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

597

Table 2 Results from estimating G-5 with DWELT-WESML (five modal options and two O–D pair sub-groups) b Parameters

Purpose: work

Purpose: leisure and others

Value

Std. error

t-Student

Value

Std. error

t-Student

0.8597950 1.4036644 0.3562930 1.9429683

0.7570011 1.1470117 0.7722312 0.7296028

1.1357910 1.2237577 0.4613813 2.6630494

0.7415682 0.8055204 0.2142596 2.3677825

0.4224602 1.0879510 0.4249925 0.4006265

1.7553564 0.7404014 0.5041491 5.9101994

Car time Plane time Day train time Bus time

0.0049138 0.0172913 0.0042535 0.0036842

0.0010387 0.0066252 0.0006013 0.0009591

4.7307211 2.6099288 7.0738400 3.8413096

0.0033299 0.0057954 0.0023321 0.0025007

0.0006512 0.0062971 0.0005027 0.0003807

5.1134828 0.9203284 4.6391486 6.5686893

Price

0.0001497

0.0000393

3.8091603

0.0002194

0.0000327

6.7094801

0.0084011 0.6111343 0.8231694 0.3584179 0.4420117

0.0014170 0.0902020 0.0833151 0.1146329 0.0973067

5.9287932 6.7751746 9.8801946 3.1266582 4.5424590

0.0032546 0.8226480 0.6389608 0.0135439 0.5101614

0.0017953 0.0903139 0.0848558 0.0969706 0.0981137

1.8128446 9.1087640 7.5299602 0.1396702 5.1996959

Pspecific Pspecific Pspecific Pspecific

car plane day train night train

Frequency Car income Plane income Day train income Night train income Log-likelihood function

Ln L(0)

Ln L(Pspecific)

Ln L(b)

Ln L(0)

Ln L(Pspecific)

Ln L(b)

Values

1871.78

1276.01

1052.11

1923.28

1829.71

1626.83

Ratios

q21

q22

q23

q21

q21

q31

Values

0.4379

0.4304

0.1645

0.154

0.1469

0.1032

Within Pspecific and Income, bus was considered the reference alternative. The time parameter for the night train was not taken into account, as ‘arriving before’ reduces the sleeping time and therefore does not represent an advantage for a quicker trip.

three sub-groups (Madrid–Zaragoza, Zaragoza–Barcelona, and the rest). This means that in both groups the longest and most important O–D pairs were specified separately. The estimation procedures used in this study were based on second-order optimisation techniques whereby first and second derivatives are calculated. Each run provides the optimum direction on the one hand and the optimum step length on the other. Starting from the point corresponding to the solution of an estimation by a simple random sampling, it can be affirmed that progress was satisfactory in all cases and the algorithm converged rapidly. Results from the estimations are shown in Tables 2–5. All four tables show the values resulting from the calibration of parameters, the standard error and the statistical t-Student. All values obtained correspond to those expected for the Madrid–Barcelona corridor. For more detail, time values drawn from the results are displayed below in Table 6. It is remarkable the high time values for plane trips. The air link Madrid–Barcelona is one of the most important in Europe, used mainly by businessmen, as they are the two most important economic centres in Spain, where main companies have offices. As a result ticket used to be paid by companies, less worried about price. There are also a lot of high level leisure offers (opera, theatre, cultural events, etc.), sometimes combined with business activities, based on the great plane connection between both cities. Additionally, Tables 2–5 show the values corresponding to the log-likelihood function in the following cases: b = 0, b including only the modal specific parameters and with all the parameters taken into consideration. From these values we obtain coefficients q21 , q22 and q23 as yielded by • q21 ¼ 1  Ln LðbÞ=Ln Lð0Þ: global estimation coefficient. • q22 ¼ 1  ðLn LðbÞ  KÞ=Ln Lð0Þ: global estimation coefficient corrected by means of the Akaike information criterion, where K represents the number of parameters. • q23 ¼ 1  ðLn LðbÞ  KÞ=Ln Lðbspecific Þ: global estimation coefficient corrected by means of the Akaike information criterion and calculated on Ln L(bspecific).

598

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

Table 3 Results from estimating G-5 with DWELT-CML (five modal options and two O–D pair sub-groups) b parameters

Purpose: work

Purpose: leisure and others

Value

Std. error

t-Student

Value

Std. error.

t-Student

Pspecific car Pspecific plane Pspecific day train Pspecific night train

1.5012979 0.4828710 0.1271781 1.5174251

0.7467952 1.1039646 0.7254868 0.6825412

2.0103208 0.4373972 0.1753004 2.2231993

0.6267177 0.9546564 0.0682367 2.5136766

0.3876197 1.0757587 0.4154265 0.3825738

1.6168366 0.8874261 0.1642570 6.5704358

Car time Plane time Day train time Bus time

0.0055111 0.0106981 0.0040427 0.0030544

0.0012108 0.0059367 0.0006347 0.0008969

4.5516188 1.8020281 6.3694659 3.4055079

0.0036257 0.0106244 0.0022917 0.0026411

0.0006309 0.0061880 0.0005025 0.0003686

5.7468696 1.7169360 4.5605970 7.1652198

Price

0.0001310

0.0000323

4.0557276

0.0001590

0.0000319

4.9843260

0.0095968 0.5986190 0.8049852 0.3702051 0.4444447

0.0013461 0.0927152 0.0842496 0.1115598 0.0954427

7.1293366 6.4565357 9.5547658 3.3184454 4.6566652

0.0033791 0.7840728 0.5965290 0.0156139 0.4684621

0.0017961 0.0860959 0.0835338 0.0955353 0.0967657

1.8813540 9.1069703 7.1411692 0.1634359 4.8411999

Frequency Car income Plane income Day train income Night train income Log-likelihood function

Ln L(0)

Ln L(Pspecific)

Ln L(b)

Ln L(0)

Ln L(Pspecific)

Ln L(b)

Values

2319.4

1474.81

1237.43

1996.26

1904.31

1724.35

Ratios

q21

q22

q23

q21

q22

q32

Values

0.4665

0.4605

0.1515

0.1362

0.1292

0.0871

Table 4 Results from estimating G-4 with DWELT-WESML (four modal options and three O–D pair sub-groups) b parameters

Purpose: work Value

Pspecific car Pspecific plane Pspecific day train

Purpose: leisure and others Std. error

t-Student

Value

Std. error

t-Student

2.3199444 1.9502912 0.8848881

0.5090257 0.9889497 0.4963470

4.5576174 1.9720833 1.7828013

1.8045890 1.1428497 1.9893510

0.5821317 0.9999109 0.5392079

3.0999669 1.1429515 3.6893951

Car time Plane time Day train time Bus time Price Frequency Car income Plane income Day train income

0.0059420 0.0154011 0.0044913 0.0022974 0.0001291 0.0138743 0.1733872 0.3284475 0.2306178

0.0008803 0.0060067 0.0006773 0.0006685 0.0000633 0.0039777 0.1326454 0.1724082 0.1310653

6.7499716 2.5639869 6.6311826 3.4366492 2.0394945 3.4880207 1.3071482 1.9050573 1.7595641

0.0021153 0.0081990 0.0023364 0.0014176 0.0001780 0.0190176 0.6151831 0.5195243 0.1935763

0.0014244 0.0077947 0.0010776 0.0013467 0.0000547 0.0039116 0.1663653 0.2013127 0.1579946

1.4850463 1.0518686 2.1681514 0.3100913 3.2541133 4.8618468 3.6977849 2.5806832 1.2252083

Log-likelihood function

Ln L(0)

Ln L(Pspecific)

Ln L(b)

Ln L(0)

Ln L(Pspecific)

Ln L(b) 556.86

Values

650.17

530.25

463.60

858.12

606.18

Ratios

q21

q22

q23

q21

q22

Values

0.2870

0.2685

0.1031

0.3511

0.3371

q32 0.0616

On a general basis, q23 seems to be the most adequate for contrasting different estimations. This is because it includes the potential ability of the variables selected for the study (time, cost, frequency and income) to account for variability in travel choices without taking other factors into consideration. In order to assess the potential benefits from considering different O–D pair sub-groups, it was necessary to carry out estimations corresponding to the scenario where all the O–D pairs were grouped. This particular

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

599

Table 5 Results from estimating G-4 with DWELT-CML (four modal options and three O–D pair sub-groups) b parameters

Purpose: work Value

Pspecific car Pspecific plane Pspecific day train

Purpose: leisure and others Std. error

t-Student

Value

Std. error

t-Student

2.1508039 0.9000520 0.6842614

0.4733969 0.7762305 0.4611399

4.5433417 1.1595164 1.4838477

1.2012569 0.4522433 1.2207047

0.5294593 0.8285091 0.4867134

2.2688371 0.5458519 2.5080565

Car time Plane time Day train time Bus time

0.0052004 0.0101269 0.0039003 0.0020616

0.0008283 0.0046871 0.0005743 0.0006014

6.2784015 2.1605897 6.7913982 3.4280013

0.0027380 0.0078000 0.0024032 0.0010263

0.0011375 0.0055417 0.0008228 0.0010649

2.4070330 1.4075103 2.9207584 0.9637525

Price

0.0001468

0.0000603

2.4344942

0.0001436

0.0000463

3.1015119

Frequency

0.0154634

0.0038536

4.0127154

0.0171656

0.0031143

5.5118646

Car income Plane income Day train income

0.2018906 0.4221762 0.2336206

0.1226772 0.1420387 0.1204213

1.6457060 2.9722618 1.9400272

0.8357382 0.5438025 0.4044390

0.1687089 0.1889259 0.1655464

4.9537292 2.8783904 2.4430552

Log-likelihood function

Ln L(0)

Ln L(Pspecific)

Ln L(b)

Ln L(0)

Ln L(Pspecific)

Ln L(b)

Values

666.39

611.45

526.31

935.71

681.25

628.40

Ratios

q21

q22

q23

q12

q22

q32

Values

0.2102

0.1922

0.1196

0.3284

0.3156

0.0600

Table 6 Time values 1991 (Spanish Pesetas/Hour) Modes

DWELT-WESML

DWELT-CML Leisure

Work

Leisure

Five modal options and two O–D pair sub-groups Car 1969 Plane 6930 Day train 1705 Bus 1,477

Work

911 1585 638 684

2524 4900 1852 1399

1368 4009 865 997

Four modal options and three O–D pair sub-groups Car 2762 Plane 7158 Day train 2087 Bus 1068

713 2764 788 441

2126 4139 1594 843

1144 3259 1004 429

Table 7 Comparison of estimations with and without dwelt procedure (q23 values) Purposes

WESML

DWELT-WESML

CML

DWELT-CML

Five modal options and two O–D pair sub-groups (R = 2) Work 0.1186 0.1645 Leisure 0.0947 0.1032

0.1210 0.0846

0.1515 0.0871

Four modal options and three O–D pair sub-groups (R = 3) Work 0.0923 0.1031 Leisure 0.0612 0.0616

0.0998 0.0595

0.1196 0.0600

case corresponds to the use of the WESML and CML estimators, which are in fact only one application of DWELT as shown previously when R = 1. Table 7 shows the coefficient q23 values obtained from the previous estimations for the purposes of this comparison.

600

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

Results clearly show that DWELT improves the global reliability of the estimations. Values of q23 corresponding to the DWELT method are higher than those corresponding to estimation methods without DWELT. This result occurs in all cases: different estimation methods, different purposes and different subgroups of O–D pairs. Yet, it is worth highlighting that the improvement is slightly higher in the case of the WESML method. Furthermore, the improvement is higher with samples of travel with a purpose based on working activities and in the case of five modal options and two sub-groups of O–D pairs. Moreover, DWELT produces a calibration that is more in line with the real situation under evaluation, and it solves some of the multi-colinearity problems arising from the time/cost variables on the one hand, and from the specific constants and the household income of travellers on the other. These improvements are not the expected results derived from the theoretical development of previous chapters; they are only specific results related to this application. However, they confirm the advantages of DWELT for representing the reality observed. All the above considered, it seems sensible to support the application of the DWELT method for estimating discrete-choice models in an interurban context. 4. Final discussion and conclusions Interurban sustainable transport policies are oriented to achieve a more balanced use of different transport means and hence they should be multimodal oriented. Planners use choice-models to that end. However, collecting input-data in large study areas for a particular interurban corridor, for example, is rather costly and difficult. Each mode of transport must be surveyed in a different way and their results are not directly comparable. That singularity observed in the interurban transport studies imposes the need to devise a specific methodological approach for the estimation of transport mode choice-models, different from those widely applied in the urban and metropolitan cases. The proposed methodology allows reduction of the excessive costs derived from household surveys and profits from the sources of information, more particularly from traveller-orientated surveys. This type of approach generates biases that are so wide in scope that it is necessary to carry out a reelaboration of the estimators commonly applied in the field of discrete choice. The new method—named as DWELT-presented in this study has been designed ad-hoc for application in interurban studies, using information that can be obtained at a reasonable cost, and at the same time correcting the biases inherent in such information sources: • DWELT benefits from the internal variability of the combination of different groups of O–D pairs in the sample, upgrading the estimation quality with respect to the hypothesis of the non-existence of sampling biases among O–D pairs. • It is based on well-known estimators premised on the pure endogenous sampling hypothesis and its design permits efficient computerisation. • In addition, the study carried out for its validation yields good quality estimation results for the parameters under calibration. In short, the proposed method improves mode-choice and route-choice calibration by using data from different type of surveys for each mode in multimodal interurban corridors. Thus alternative transport policy options could be designed and tested in a more reliable way. Acknowledgements ´ az for their comments. We would like to thank F. Robuste´, F. Garcı´a and S. Jara-Dı This paper is based on the thesis entitled ‘‘Methodology for estimating modal choice models in interurban ´ lvaro Rodrı´guez-Dapena and contransport. Application to the Madrid–Barcelona corridor’’, developed by A ducted by Prof. Andre´s Monzo´n. References Ben-Akiva, M.E., Lerman, S.R., 1985. Discrete Choice Analysis Theory and Application to Travel Demand. Massachusetts Institute of Technology.

A. Monzo´n, A. Rodrı´guez-Dapena / Transportation Research Part A 40 (2006) 587–601

601

Cosslett, S., 1981. Efficient Estimation of Discrete-Choice Models. In: Manski, C., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge Mass, pp. 51–111. Cramer, J.S., 1986. Econometric Applications of Maximum Likelihood Methods. Cambridge University Press, Cambridge. Department of Transport and Environment, 1992. Estudio de tra´fico y rentabilidad de la lı´nea de alta velocidad Madrid–Barcelona. MOPTMA Report, Madrid, Spain. Imbens, G., 1992. An efficient method of moments estimator for discrete choice models with choice-based sampling. Econometrica 60 (5), 1187–1214. Jara-Diaz, S., Farah, M., 1987. Transport demand and user’s benefits with fixed income: the goods/leisure trade-off revisited. Transportation Research 21B, 165–170. Jara-Diaz, S., Ortuzar, J.D., 1989. Introducing the expenditure rate in the estimation of mode choice models. Journal of Transport Economics and Policy 23 (3), 293–308. Manski, Ch., Lerman, S.R., 1977. The estimation of choice probabilities from choice-based samples. Econometrica 45 (8), 1977–1988. Manski, Ch., McFadden, D., 1981. Alternative estimators and sample designs for discrete choice analysis. In: Manski, Ch., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge, pp. 3–50. McFadden, D., 1981. Econometric Models of Probabilistic Choice. In: Manski, C., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge, MA, pp. 199–272. Rao, S.S., 1984. Optimisation. Theory and Applications. Wiley Eastern Limited.