Transpn ges.-A Vol. 16A, No. 5-6, pp. 371-382, 1982 Printed in Great Britain,
0191-2607/82/050371-12503.00/0 © 1982 Pergamon Press Ltd.
STATISTICAL ASPECTS OF TRAVEL DEMAND MODELLING HUGH GUNN Institute for Transport Studies, University of Leeds, Leeds LS2 9JT, England and JOHN BATES John Bates Services, The Old Coach House, Wyke, Gillingham,Dorset, England Abstract--This paper presents a review of statistical problems which in varying degrees of Severity are common to the majority of models used in conventional analyses of travel demand, whether these are based on data defined for individual travellers or for groups of individuals. Illustration is given of the theoretical and practical problems of testing model goodness-of-fit and of model selection. Consideration is given to the special problem of spatial interaction models which incorporate a large number of parameters. The implications of poor model specification are discussed. Finally, recommendations are given for methods of validating models against data sets other than those used for estimation. I. INTRODUCTION
A majority of the models used in conventional representations of travel demand are constructed to explain choices between alternatives, relating characteristics of options and of decision makers to proportions of decision makers selecting each option. The choices most frequently dealt with are: whether to own a car, where to travel to and what mode to use. In the last five years there has tended to be some polarization, at least in the U.K., between practitioners favouring the use of "aggregate" models and those favouring "disaggregate" models. It is unfortunate that the disagreement should have appeared to relate to the nature of the data rather than to the nature of the models and the associated estimation techniques, since the facts of the matter are that not only are the statistical considerations for model-fitting virtually the same for both kinds of data, but the models themselves show a high degree of similarity. This paper sets out to review some of the statistical problems which, in varying degrees of severity, are common to all travel demand models. The emphasis, however, will be on the opportunities to improve and extend the statistical procedures for the estimation and evaluation of models based on aggregate (grouped) data, such as are commonly used in U.K. transport models to predict car ownership, mode split and trip distribution. Illustration will be mainly by reference to the most common of choice models, the logit model. Under this heading, we will include the conventional "gravity" model of spatial interaction. This paper assumes the general suitability of maximum likelihood estimation. We begin in Section 2 by discussing the treatment of grouped data in the likelihood function and then turn in Section 3 to the general problem of model selection, using a model of household car ownership as an example. The next problem that we consider is how to obtain parameters estimates when we have a very large number of parameters to estimate: we discuss this in Section 4 with reference to the well known "gravity model".
However, the validity of the general theorems of maximum likelihood estimation depends on having an acceptable model specification: in Section 5 we tentatively consider the implications for the statistical properties of models which do not appear to be well-specified. Since the question of specification is so crucial, it is necessary to take the task of validating models much more seriously, and we discuss some possible methods in Section 6. We conclude that, particularly for aggregate models, the question of survey design deserves much greater attention. However, improvements can only be made in conjunction of a greater understanding of the statistical effects on the model due to data collection methods. The goal, of course, is to achieve better models and a more efficient use of data collection resources. 2. THE TREATMENT OF GROUPED DATA
Denoting the probability that individual q chooses option i by Piq, the likelihood function for a given data set of observations on Q individuals each choosing between J options can be written down as o J [ 1 if individual q chose L = I I r I P ~q, where gjq = J option j. q = ~j= t 0 otherwise. If we model P~q by a function of observed characteristics _X involving unknown parameters, g, the method of maximum likelihood (ML) entails choosing values of the unknown parameters a to maximise L, or more conveniently In(L). As usual, the estimates of the variance-covariance matrix of the fitted parameters is derived as the negative of the inverse of the matrix of expectations of second derivatives of In(L). Grouping individuals together prior to the estimation of the model can be considered equivalent to assigning a single common set of characteristics to each member of a group and each option they face. The general form of the likelihood function is not affected by such a pro371
372
HUGHGUI,IN and JOHN BATES
cedure, although it may be rewritten in a way to correspond to the chosen grouping. Whether or not the aggregation of the data has any effect on the parameters of the final model depends on the degree of similarity of individuals within groups. In the extreme case where individuals within groups are indistinguishable to the modeller, it is simply inefficient data handling to retain the data at a fully disaggregate level; group members can effectively be treated as "replicates" of one another. An example of this is found in those models of household car ownership which use income as the sole explanatory variable, based on surveys which record income levels only to within a finite number of groups. Further, in this case, even if income has been recorded as a continuous variable, Bates (1979) has shown that there is virtually nothing to be gained from retaining individual data points as opposed to imposing a sensible grouping. In short, if individuals within groups are similar in all important respects, they can be grouped together without detriment to the analysis. (Of course, as the number of characteristics considered relevant rises, the instances in which such a condition will obtain will be less and less common). On the other hand, it is generally recognised that aggregation over dissimilar individuals can produce misleading analyses, masking important sources of variation or even leading to so-called "ecological correlations". In a later section we shall return to a consideration of the opportunity to test for the importance of certain explanatory variables by comparing models fitted at one level of aggregation with data at another. In the meantime, we note that the issue of level of aggregation of the data does not affect the estimation pro-
car-ownership data; for example, Bates, Gunn and Roberts (1978) estimated that some 40% of U.K. households will not change from being single car-owning to owning two or more cars, irrespective of increasing real income. There is also statistical evidence for a corresponding "captivity" of some 5% of households to the state of being non-car-owning, once again, regardless of income. Traditionally, this feature of the data has been accommodated in the model by the introduction of "saturation levels" less than unity, acting as upper bounds to the choice probabilities governing the proportions of households opting for car-ownership or multiple car-ownership at any given income level. That this is directly equivalent to the "dogit" formulation in the simple case of binary choice and 'captivity' to a single option is demonstrated in Appendix 1. In Section 4, we show that the standard gravity model for the distribution of trips can be considered as a choice model where all individuals are considered to have the same characteristics and the J options available to them are the set of movements between pairs of zones (i-j). Although it has not normally been described in this way, the model can be represented as a multinomial logit model in which the attributes of the choices are the same for all individuals. Thus we see that there is a direct equivalence between the estimation procedures that are used for zonally aggregated models of spatial interaction, models of household car-ownership based on "replicate" observations, and fully disaggregate models of individual choice. Let us therefore consider the general statistical treatment of such models.
cedure.
3. GOODNESSOF FIT ANDMODELSELECTION The measure of goodness-of-fit most commonly used in disaggregate modelling is the so-called p2 statistic, which--on an analogy with the well known R 2 statistic-ranges from 0 to 1. If L*(O) is the value of the loglikelihood function at the maximum, for a given model specification, and L*(0) is the value assuming that each individual chooses options at random, then
That there is such a direct correspondence between both problems and solutions for a wide group of travel demand models, whatever the degree of disaggregation in the data to which they are fitted, has not always been clear in the literature. Thus, for one example, the analytic device of generalising the logit transformation to allow the multinomial logit model to be fitted to grouped data by least-squares estimation, familiar in the disaggregate literature as the Berkson-Theil method, has only recently been considered in the context of aggregate interaction models (van Est and van Setten, 1979). Another example of an unnoticed parallel concerns the issue of fitting choice models when certain proportions of the population are (unknown to the modeller) "captive" to particular options. In the recent literature this has been discussed by Gaudry and Dagenais (1979); by way of solution, the "dogit" model has been proposed. This model includes coefficients which allow measurement of the degree of apparent "captivity" in the data, and which permit this feature to be preserved in forecasts. However, the same problem has been explicitly recognised for over a decade in the standard models of U.K. car-ownership, based on household data aggregated by income group (see Bates, 1971). Indeed, the same model form has been used (albeit in simpler guise, since the models are defined in terms of binary choices). This apparent "captivity" is an important feature of U.K.
2
L*(O)
p = 1 L*(0)' It is well known that there are a number of problems with this statistic, and it is in any case unable to provide the relatively straightforward indication attributed to R 2. This may well turn out to be a blessing in disguise. Most criticism of p2 (e.g. Stopher, 1975, Tardiff, 1976) has centered on whether the logit should contain (alternative-specific) constants in the null hypothesis or not. If, in the sample population, the various options are distributed in anything like unequal proportions, the inclusion of the constants alone as a model compared with the null hypothesis in which all coefficients are set to zero, may appear quite significant, even though none of the behavioural variables are yet included. Dealing with this aspect goes some way t6 answering the criticism that for many data sets the minimum value of zero for the 02 statistics is not a serious possibility.
Statistical aspects of travel demand modelling However, there seems to have been very little discussion of the other side of the problem, that equally the maximum value of 02= 1 is not a serious possibility either, since this implies that the model can account for all the variation in the data. Quite apart from the asymptotic problems of reproducing the values of zero and one with the kind of choice models used in multinomial logit analysis, it only requires two persons in the sample with the same vector of attributes but choosing different options to make it impossible for the model to reproduce all choices correctly. In our opinion, the answer to both these aspects of the problem lies in abandoning the search for a simple indicator and returning to more basic principles. It may be said in passing that the simplicity of the R 2 measure has not infrequently led to its misuse, and that there are considerable advantages for the analyst in being forced to consider the statistical basis of his modelling more carefully. In this case, the fundamental statistical property of the log-likelihood ratio of two alternative models with one being a subset of the other, is that, when it is multiplied by 2, it is distributed as X2 with degrees of freedom equal to the number of additional parameters in the more highly specified model. This property, allied with some important (and possibly controversial) principles of model building, constitutes the major criterion in determining the choice of model. For the purposes of illustration, we will deal with the case of binary choice, but most of what we say can be generalised to the multiple case. If pSq is the probability ascribed by model S to individual q's probability of choosing option 1 (so that the probability of his choosing option 2 is (1 -pSq)), the log likelihood function for any model S is defined as the sum of: 2qgtq log pSq for those individuals choosing option 1, and Eqg2, log(1 - pSq) for those individuals choosing option 2. Suppose there are altogether N individuals in the sample, of whom R choose option 1. Then the minimum value of the log likelihood function L*(S) is that associated with the null hypothesis pSq = 0.5 for all q, giving L*(S) = - N log 2 (Note: it is of course possible to obtain lower log likelihood values by choosing deliberately perverse models, but the optimisation procedure for estimating the coefficients will always reject such models in favour of the null model). Let us refer to this model by S=0. Consider now the model where plsq = c for all q, and c is to be estimated. It can easily be shown that the Maximum Likelihood (ML) value of c will be R/N. Let us call this model 1 (S = I). The value L*(1) is given by R log(R/N) + (N -
R) log(1 - R/N)
and this can be tested against L*(0) by comparing 2(L*(1)-L*(0)) with X2 with 1 degree of freedom (d/). Unless R/N is approximately 0.5, the null model will normally be rejected by this test, implying that it is necessary to have an alternative-specific constant in the logit.
373
If the constant is found to be necessary, the next stage is to construct a number of possible models, whose specification will be a combination of intuition and theoretical requirements. There are no hard and fast rules governing the choice of "best" model, and there is a general problem in defining the trade-off between goodness-of-fit and simplicity. Most recommendations about model selection are concerned with the choice from a hierarchical nested set of models: a useful discussion is provided by Fienberg (1977). In the hierarchical case, which is characteristic of most applications of multinomial logit, the basic principle is to make successive improvements to the likelihood function, provided that each improvement is significant, as measured by the likelihood ratio (LR) test. With a limited number of independent variables, it may be possible to find consistently "best paths" through the alternatives: even so, it may well transpire that different paths lead to different conclusions. A simple example will suffice. Suppose that we have four alternative models, the simplest being model 1. Model 2 differs only by having an additional linear term in the logit, associated with variable X, and model 3 is comparably defined, but with variable Y in place of X Finally, model 4 contains both X and Y introduced as linear terms. All the nested possibilities are joined, and the LR test can be evaluated for all these pairs. Thus in this simple example, the only pair which is not directly assessable is models 2 and 3. Now it is quite possible that the LR test could indicate that both 2 and 3 were significant improvements on 1, and that 4 was a significant improvement on 2 but not on 3: additionally, it would be likely to indicate that 4 was an improvement on 1. In such a case, if we proceed from 1 to 2, then we would advance to 4, whereas if we proceed from 1 to 3, then we would not advance further. In this case, on grounds of parsimony, we would probably choose model 3: however, in more complicated cases, we cannot possibly evaluate all the alternatives, and it is clear that the use of different criteria for choosing a path at each juncture may lead to different assessments of the "best" model. All "stepwise" methods of model selection suffer from this problem; for example, it occurs with forward or backward stepwise regression. In addition, suppose our selection criterion led us to model 3 in the first place rather than model 2; from this basis, we reject the further addition of variable X to the logit function. However, we may have extremely good a priori grounds for believing that X ought to be included; this is particularly likely to occur in the context of constructing forecasting models. In such a case, we will tend to override the selection procedure. This sort of strategy allows us to continue to add variables within a given model structure until a point is reached when no additional variables at our disposal can be justifiably included. For choice sets containing more than two options, there is a further dimension of difficulty in that one must also decide which utility functions should include each extra variable. In principle, the same sort of methods can be used; in practice,
374
HUGHGUNNand JOHNBATES
for multiple options and many explanatory variables, it will seldom be feasible to explore all the permutations involved. In such cases, there may be not only no unique "best" model defined by the LR test, but no practical means to assess the full range of possible models in the first place. If, on top of this complication, we consider the possibility that we should also explore different functional forms for the utility functions, and indeed a range of completely different model structures, it is clear that the methods we have described can tackle only a small part of the whole problem, albeit the part most commonly encountered, at least in modelling grouped data. We now go on to another difficult area; goodness of fit. Having developed a model to an extent where we are unable to justify further complication, we still need to assess whether the model we have developed does in fact fit the data. This is a very different consideration to that which is normally addressed in logit analysis (by means of the p2 test, for instance), which concentrates on whether the variables in the model can be collectively justified. The general theory of likelihood estimation allows the LR-test to operate in both directions--backwards, to the null model and forwards to the full or saturated model. In a useful sense, the process of model-building can be seen as selecting an appropriate point in the interval for the log-likelihood function spanned between the null model and the saturated model--exactly the tradeoff between simplicity and degree of explanation. The saturated model is a "model" with no degrees of freedom, which perfectly reproduces the observed data. For grouped data, the general formula for the log likelihood function is
]~j(Rj log(Rj/Nj) + (Nj - Rj) log(1 - Rj]Nj)) where Ni is the number in group j and Rj is the "responding" subgroup. The formula can of course easily be expanded for the multinomial case. With disaggregate data, each group represents a single individual, so that Nj is always 1, and Rj is either 0 or 1. Under these circumstances, the log-likelihood function for the saturated model, which we write as L*(max) is zero,
There seems no reason in principle why the LR-test should not be used to evaluate the goodness of fit of the chosen model, by comparison with the saturated model. This should provide a far better test of the performance of the model, since it is being assessed by its ability to reproduce the data. The number of degrees of freedom is obtained by subtracting the number of coefficients estimated from the size of the sample--typically this will be a large value. There are a number of available approximations for obtaining percentile points of the X2 distribution when the number of degrees of freedom is large. In the case of fully disaggregate data, the test against the saturated model (i.e. against a log-likelihood function value of zero) is unfortunately no longer valid. McFadden, Tye and Train (1976) have shown that the X2 pro-
perties of the LR-test do not apply in the absence of "replicate" individuals, and attempts to use the test will lead to an excessive number of rejections of the models tested. (We are grateful to Joe Whittaker and Joel Horowitz for clarifying this point). However, the test against the saturated model is valid in the case of grouped data. Once the data is grouped, the log-likelihood function value of the saturated model is no longer zero, indicating that some information has been lost: effectively, the estimation procedure no longer takes into account which individuals in any given group have chosen a particular option--only the total number of such individuals. This results in a slight complication, since the estimate of L for the saturated model will depend on the level of grouping. In general, it would seem sensible to compare the chosen model with a saturation model which accurately represents the amount of information presented in the estimation process. What should we do if the LR test against the saturated model rejects the chosen model? The implication is that it is unlikely that the difference between the model and the data is due to sampling variation alone. Again, our reaction will depend on the use to which the model is to be put, but in any case we would proceed with caution after such a rejection. Certainly we would not be wise to rely on using the calibrated model in different circumstances to predict variations in individual behaviour. On the other hand, it is possible that the process of aggregation may get rid of the non-random effects that we are not accounting for; we shall consider the problem of identifying an occasion in which this occurs in Section 6. It should be clear, however, that the test against the saturated model will expose the weakness of many models which would not be apparent from a consideration of the p2 statistic. As an example we present a model relating to the binary choice between car-ownership and non-ownership. A sample of 7420 households from a transportation survey were classified according to gross household income, accessibility to employment and the level of bus service: altogether 133 distinct categories were used. The log likelihood function values were as given in Table 1. A test against the tabulated X2 values with appropriate degrees of freedom indicates that each model from N to IAB provides a significant improvement. Nonetheless, the most highly specified model is rejected against the saturated model. Taking the differences between the successive values of the log-likelihood function makes it clear that we are seeing fast-diminishing returns--it appears extremely unlikely that the addition of one or two more variables will significantly reduce the gap between the parametrical and the saturated model. Thus, there are important (non-random) features in the data which the model is unable to reproduce, despite the significant contribution of the included variables. Thus, the use of the more complicated LR-test has the important virtue of forcing the analyst to assess the model-building process far more carefully. Although there are no rules which guarantee a unique best model, the analyst is in a position to make a number of statistical comparisons between alternative models, and having
Statistical aspects of travel demand modelling
375
Table 1. An example of successive reductions in log likelihood NO. of coefficients
Model
N
Null
C
Constant
(random choice)
I
Income
IA
Income
only
(market share)
+ accessibility
IAB Income + a c c e s s i b i l i t y
+ bus
service S
Saturated
justified (whether on a priori, theoretical or statistical grounds) his best model, he must then subject it to the sterner test--its performance compared with the data.
4. ESTIMATIONOF ERRORS WHEN THERE ARE MANY PARAMETERS: THE GRAVITY MODEL
In the previous section, we reviewed the general question of model selection with regard to goodness-of-fit measures. Of course, another important consideration in model selection is that the coefficients should have the expected signs and have adequately small confidence intervals. In Section 2, we referred briefly to the wellknown approximation to the variance-covariance matrix which is produced in the ML estimation process. It was remarked at the outset that the aggregation of the basic data in no way changes the estimation problem and thus that, for example, gravity models at one end of the spectrum can be treated in exactly the same way as models using data on individual choices at the other. However, in practice there are aspects of the circumstances in which gravity models are applied which do tend to place them in a class of their own. The one which concerns us particularly here is the question of sheer scale; many gravity models are defined in terms of thousands of parameters rather than the few dozen that are involved in most other applications and the one or two that enter conventional car-ownership models. Whilst the number of parameters involved is theoretically immaterial, in practice estimation procedures which involve inverting matrices of the same order as the number of parameters become unworkable, with current computers, at much over 200 parameters. Fortunately, the actual solution of the ML equations for the optimum can be performed without any such matrix inversion, since an efficient algorithm is available (see Murchland, 1977, for a discussion of its properties). Estimation of the accuracy of these parameters and of the fitted model, does remain a problem, however. As a result of recent research, approximations are now available which avoid the need for the inversion of large matrices. We shall first review the approximations given in Gunn, Kirby, Murchland and Whittaker (1980a, b) for assessment of the accuracy of the gravity model estimation and the implications of these approximations for the errors that can be expected in predicted trip interchanges and weighted sums of interchanges, such as modelled flows on links in the road network. In the following, it is
Log likelihood
O
-5143.15
1
-5116.28
2
-3771.80
3
-3728.22
5
-3716.19
133
-3624.99
assumed that the gravity model does provide a good explanation of the data. Consider a gravity model defined in terms of an "empirical" deterrence function: 1 if the separation between zones i t~jk = a~fijFkM~jk where M~ik= and ] falls into category k 0 otherwise l
and Fk, a~ and/3j are constants; thus t~j = a~/3j (2kFkMijk) models the number of trips between zone i and zone ]; equivalently, we can write _
Pij - ~
ti~
_ exp(log al + log flj + log(2kMijkFk)) - ~,,~j, exp(log al, + log/3r + log(~kM/,rkFk))
where the model, defined in terms of the three sets of constants, has been re-written in more immediately recognisable logit form. Denoting '2~'2kt~k by S j, 2jEkt~jk by Si.., ~i~ltiik by S..k, EiEj2ktijk by S.., the approximate standard error 1 1 1 2 \ ,/2 has been proposed for the logarithm of the modelled trip interchange, t~j; empirical testing in circumstances where the number of parameters is small enough to permit the calculation of standard errors in the usual way (i.e. by means of inversion of the matrix of expectations of second derivatives) has confirmed the general accuracy of the approximation. (Gunn, Kirby. Murchland and Whittaker 1980a, b). Another way to generate the same approximation to the standard errors of trip interchanges is via approximations to the variance-covariance matrix of the model parameters. Either approach can be used whatever the form of deterrence function assumed, and can provide assessments of the accuracy of general functions of trip interchanges, and in particular of the weighted sums that make up assigned flows on real links in the network. It can be seen that the major sources of error in modelled trip interchanges in most applications come from estimation errors in the a and /3 parameters, the origin and destination constants. Typically, the estimation errors in the parameters in the deterrence function are small by comparison, and may be neglected. The approximation to the errors in the parameters gives
376
HUGHGUNNand JOHNBATES
var(a,) = a,2/S,..Ot,), var(/30 = [37/s4.(Vj), cov(a, ai) = 0CCi~ j), cov(/3./%) = 0(Vie j), cov(a./3j) = 0('qi, j). Thus we have an approximate standard error of
1]
,/2
for the modelled trip interchange to. This is consistent with the other derivation of the approximation; when the number of rows and columns is large, and large in comparison to the number of deterrence parameters, we would have - -1+
±~l
S+
S~.. S..k S... "S+
. . . .
2
,_-
1
+
1 SL.
and (lj+l 112 1 + ( ~ + + 1 ~/2 exp ~.. ~L.) S~L.) Making the variance and covariance terms for all the parameters explicit allows us to calculate the expected accuracy of assigned flows quite easily (assuming that the assignment process itself does not contribute any error.) For any particular point on the network, the actual assigned flow can be written as Y,~wijtu where the wlj define the assignment, and lie between zero and one (zero if none of the trips between zones i and j pass through that link, and unity if all such trips use that link.) The actual assignment provides values for each w~i in any particular case; the variance of the overall flow may then be calculated as ~_
221"1
l "]
V(FLOW) = .~,.~jw,, t~j | ~ 7 + S-~-~| L
i..
.j.J
This sort of error can be interpreted as the inescapable estimation error due to finite sample size, but it will usually be an underestimate of the real variability of modelled flows as compared to actual flows, even for the base period over which the data has been collected, for two reasons; firstly, because the gravity model will not generally (if ever) be a good fit, and secondly because the assignment process will never be exact.
Implications for data collection The approximations are mentioned here to demonstrate that the scale of the problem involved in gravity model applications need not prevent the statistical appraisal of the reliability of the gravity model predictions, in much the same way as for other similar travel demand models. One immediate consequence of the simplicity of the approximations is that it will be possible to make a preliminary assessment of the desired sample numbers starting and finishing trips in any zone prior to data collection. The adequacy of any particular sample that is then collected can be assessed before model fitting, and if numbers are too low for the accuracy that is desired, further sampling can take place immediately. There are
potentially great advantages to be had from such a procedure, in terms of ensuring that the final data base is actually sufficient for the task in hand. In practice, there is very often an existing estimate, however crude, of the pattern of trip interchanges. This can be factored down to represent any chosen sample fraction and estimates made of the standard errors in the various parameters of the model that would result from a survey of that size. Different sample designs and sizes can then be investigated to establish the most efficient way to attain the level of accuracy that is needed. This sort of analysis relies on the fact that the errors in the fitted model depend principally on the amount of data that is collected, and are fairly insensitive to the exact pattern of trip interchanges. Thus even a crude "existing estimate" can be a reasonable guide to the accuracy that could be expected from the actual survey. Such an exercise has recently been carried out in the U.K., using a standard statistical package to fit gravity models to a data set consisting of predicted private vehicle movements between a number of urban areas (Gunn, Mackie and Ortuzar, 1980). This exercise, sponsored by the Department of Transport, was designed to assess the feasibility of estimating "values of time" from aggregate interaction data. For a number of years, the Department have recommended that interaction models should use a linear combination of time and out-ofpocket costs to measure zonal separation; where no local data exists to suggest weights for these two aspects, a typical "value of time" was suggested. In theory of course there is no reason why the "value of time" coefficient should not have been fitted freely along with the other unknown parameters in the model; however, the various analyses which attempted such a procedure encountered difficulties, not least because of the high degree of collinearity that usually exists between trip times and costs. Using the forecast travel patterns from a conventional model as an "existing estimate", the sort of analysis outlined above was used to produce a direct estimate of the efficiency of various possible survey designs in terms of survey effort and the resulting accuracy with which the "value of time" parameters could be estimated. In conclusion, despite the special problems inherent in fitting models with the numbers of parameters typical of many gravity model applications, there is considerable scope for improving current methods, and the predictions they produce, by developing the analysis of the statistical properties of the gravity model estimates. And while some of the results in this section are peculiar to the gravity model, the importance of relating data collection methods to the statistical properties of the model has much wider implications. This is of course particularly true when the model has many parameters and a large amount of data is required for calibration. 5. A TENTATINE ASSESSMENT OF THE IMPLICATIONS OF MODEL MIS-SPECIFICATION
At the outset, we stated a bias of interest towards the estimation problems associated with grouped data, given
Statistical aspects of travel demand modelling that the experience of the authors have been in the context of U.K. transportation modelling. From that perspective, the issue of model mis-specification appears in rather a different light to that in which it is viewed in the disaggregate literature. (See for example Tardiff, 1979, Williams and Ortuzar, 1979.) If we accept the LR test against the saturated model as the appropriate test for mis-specification for models fitted to grouped data, it has been our experience that there will very often be sufficient data to demonstrate that even the best available model cannot be considered "exact". The problem is particularly acute with the gravity model, where what little evidence on goodness-of-fit is available demonstrates that such models do not provide a good fit to actual interaction data (see, for example, Sikdar and Hutchinson, 1981). On the other hand, the general structure of the model may appear sound enough to regard the mis-specification as random 'noise' for the purposes of the analysis. Given the position that the problem is not one of identifying mis-specification which can be rectified but rather of allowing for mis-specification that cannot be avoided, it is natural to look for ways of accommodating this extra source of error in the estimtion and evaluation procedures. Previous sections have outlined the calculation of errors and of approximations to errors for the usual maximum likelihood estimates. It has been stressed that these errors are appropriate only when the model is a "good" model; they are in fact calculated on the assumption that the fitted model is exactly true. Whilst this sort of assumption is also made in the alternative techniques that are available, such as weighted least squares, it is used in a far more restrictive way in the construction of the usual ML variance estimates. This is illustrated in Appendix 2 in the context of a simple example using a binary logit model with a single explanatory variable, and contrasting the ML estimation procedure with a weighted least squares approach. If the model is in fact a "good" model, and the weights are defined appropriately, the estimation criteria for the two approaches coincide. So too do the expectations of the variance of the coefficient of the single explanatory variable; the crucial difference is in the calculation of this variance. In the weighted least squares approach, an empirical assessment is made of the variance of the actual observed proportions around the model, whereas in the ML procedure the variance estimate is calculated without direct reference to the observed data. A theoretical measure (based on the assumption that the model is exact) replaces the empirical one. The situation can be compared most simply to the position of two experimenters observing the same set of data points and wishing to calculate their mean and variance; if one experimenter believes that the data have an underlying Poisson distribution, he might calculate only the mean, believing that mean and variance must coincide. The second experimenter, perhaps with less confidence in the way the data have been derived, might form the usual empirical estimates of both mean and variance. If the first experimenter is right in his belief, then both should end up with much the same estimates. If he is wrong, he will have the
377
same estimate of the mean, but a different (and wrong) estimate of the variance. In practice, of course, no-one would make such an unsupported assertion about the error distribution if, as here, it were to be crucial to the analysis. A formal test of the Poisson assumption would be performed, and, if found incorrect, "belief" (and analysis) would be revised. In the context of the models that we are discussing, the likelihood ratio test discussed in Section 3, assessing the fitted model against the fully saturated model, should usually serve as the test of the adequacy of the hypothesis; namely that the sample data have been drawn randomly from a population in which the mean numbers choosing each option are correctly defined by the model for any possible set of values of the explanatory variables. But what should be done if, after exhausting all variants in model form and all (forecastable) explanatory variables, the best model is rejected by the LR test? The effect on the error estimates
In practice, one can simply record this, state the caveat that all error estimates are "conservative", and proceed much as before. However it is clearly unsatisfactory to have no measure at all of how "conservative" the estimates are, especially if (as we have said in the context of the gravity model) there is reason to fear that certain models will be very inaccurate indeed in some applications. Consider the analysis of aggregate data, so that the dependent variable is a proportion. In general, we know the accuracy to expect in the measurement of any proportion from the sampling process, expressed as a function of sample sizes and the true size of the proportion. We can distinguish three broad sets of circumstances, according as mis-specification errors are very much less than, comparable to, or very much greater than sampling errors. In the first case, the usual ML procedure holds. In the last case, in the absence of any systematic variation in the specification error, a simple non-linear least squares criterion seems most appropriate, implicitly treating specification error as an additive disturbance term with zero mean and constant variance, and ignoring sampling errors. However, whilst this sort of dominance of the mis-specification problem over sampling uncertainty will arise with any inexact model, when the sample size is large enough, it seems likely that the first two cases will be most common in practice. If it should happen that the specification error varies systematically in proportion to the sampling error, then a suitable weighted least squares criterion will give appropriate parameter estimates and error estimates. (For gravity models, for example, there is some empirical evidence that specification errors are indeed related to size of trip interchange in much the same way as are sampling errors (see Haskey, 1973)). Van Est and Van Setten (1977) report two studies in which singly constrained gravity models were fitted to a data set using two different criteria; firstly, the usual ML procedure, and secondly, a least squares procedure after using a transformation of the data, along the lines of the
378
HUGH GUNN and JOHN BATES
Berkson-Theil approach. The standard errors calculated on the basis of the latter method, which they note has the advantage of incorporating the effects of model misspecification which are omitted from the usual ML procedure, were an order of magnitude larger than those returned by the ML analysis. (The actual parameter estimates were virtually identical, however, in each case). In short, the implication is that the usual ML procedure may overstate the accuracy of the fitted coefficients, sometimes grossly. Given these possible solutions when the misspecification is very large compared to sampling error, or directly related to sampling error, the case left for discussion is when both sampling and specification errors need to be taken account of separately, i.e. when (a) neither dominates and (b) they are not directly related and thus cannot be treated jointly. One possible approach to the analysis of such data is outlined in Gunn (1980) using the device of interpreting model mis-specification as error in explanatory variables. For the simple case of a binary Iogit using grouped data and a single explanatory variable (the example used is car-ownership and income) the assumption is that the proportions falling into each category are linked to the level of explanatory variable in that category via a third, latent variable. The usual binary logit model is used to explain car-ownership, but is defined as a function of (unobserved) latent variables. The observed explanatory variable is assumed to be an estimate of the latent variables in each group; the usual estimation problem then becomes extended to include estimation of the latent variables. The closeness of the association between the original explanatory variables and the new "latent" variables is a measure of the amount of misspecification in the original model, and permits error estimates for forecasts using the same sort of explanatory variables to include an expected contribution from a similar sort of mis-specification. In conclusion, if the fitted model is demonstrably mis-specified, the error estimates from the usual ML procedures will be inappropriate. To our knowledge, there is no fully satisfactory alternative procedure when an adequate model cannot be constructed; we have tentatively advanced various approaches which might be explored, depending on the relative sizes of misspecification and sampling errors. 6. MODELVALIDATION
In previous sections we have tried to air some of the problems that are encountered in the process of estimating choice models from data and of assessing their statistical properties. In this final section, we shall discuss an aspect of model evaluation which is less frequently encountered in the literature, namely validation in comparison with information not used during the process of model estimation. Such a validation procedure can be a useful independent test for model misspecification, albeit an expensive alternative to the statistical analysis of the estimation data set. We shall consider two cases. Firstly, the validation of a fully disaggregate model of mode choice against a data
set of individual decision makers, and secondly the validation of a "replicate household" model of car-ownership, to confirm that a model fitted at a large area level is applicable at the level of small zones.
A modal choice example--disaggregate data The disaggregate example that will be considered concerns mode choice, and is based on data and models supplied by Juan de Dios Ortuzar, referring to work trips for Leeds city centre workers living in a particular corridor to the east of Leeds. The analysis is described in Ortuzar (1980). Given a fitted model, the validation procedure is essentially a process of ensuring that the choices observed in a data set of Q individuals (other than that on which the model was based) are "consistent" with the probabilities predicted by the model. In other words, we must check that the overall patterns of the options that were selected are consistent with a process of randomly sampling "Q" times from the options, when the probability of selecting option i at stage q is defined as Piq; the model prediction of individual q's probability of choosing option i. This procedure is comparable to the test against the "saturated" model during model estimation; it compares model with data and assesses consistency. It is interesting to note that the best known methods for assessing the performance of disaggregate models, such as the Prediction Success Table proposed by McFadden (1976), use the device of comparing model performance with that of a "null" model, rather than with the data; we have discussed the limitations of such comparisons in relation to likelihood ratio tests in Section 3. It seems to us that in validation, as in estimation, the comparison of two models for superiority is a somewhat less important consideration than the assessment of each model for overall adequacy in explaining the data. In this example, we will consider the problem of comparing the relative performance of two alternative models, when transferred to a different data set. Note that when based on a data set that has not been used in estimation, the issue of ranking models according to "fit" is particularly easily resolved, since the likelihood ratio may be used regardless of differences in model structure or numbers of parameters, For each of 124 individuals in a validation sample, each model produced sets of predicted probabilities for the complete set of events that individual q would choose option j. The data set supplied the option actually chosen. To assess the consistency of model prediction and data outturn, the data set can be transformed into proportions and compared with predicted proportions from the model. Table 2 sets out the results of grouping data points according to the level of predicted probability of choosing any particular option. The first six rows refer to the six options available and the first model; the second six row.s to the same set of options and the second model. For each option, ten groups were defined, corresponding to predictions of the probability of choosing that option lying between 0.0 and 0.1 and 0.2 . . . . up
379
Statistical aspects of travel demand modelling Table 2. An example of model comparison I
predicted
probabilily bands
0.0-0.1 0.1-0.2 0.2-0.3 0.3-0., 0.4-0.5 ~.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0
T O
T
~lm 1
0
0
8
Mode 2
40
O
Moae 3
94
t,',oae 4
"l~ Mode 5
v--
0 E
Mode 6
T
O
0
11
0
O
o
o
o
o
0
55
.total
l" O
T
0
0
0
0
0
0
0
O
o
0
0
o
8
3
47
I0
0
0
0
0
6
11
3
58
6
O
6
r
O
T
O
T
O
TO
0
0
0
0
0
0
0
0
0
0
0
O
0
O
0
0
O
0
0
O
0
0
O
o
o
o
o
o
o
o
o
o
o
o
0
o
o
o
11
8
o
o
43
34
15
Io
o
o
o
o
0
0
0
0
0
8
4
47
32
0
0
0
0
0
0
14
0
O
O
O
0
O
O
O
0
0
0
0
O
O
24
O
O
8
4
66
10
TO
O
O
Mode I
9
0
5
0
0
0
3
0
2
0
0
0
0
"0
0
0
0
0
0
0
Mode 2
36
0
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Mode 3
81
0
11
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
~1) Mode 4
21
3
14
4
9
4 i12
5
16
12
6
3
7
3
16
9
12
11
11
11
Mode 5
0
0
O
O
1
1
9
4
9
4
10
6
9
5
2
2
10
10
5
4
Mode 6
43
3
44
7 18
8
10
3
3
0
3
2
1
0
2
O
0
O
0
O
04
"~
O E
total
6
11
13
12
16
11
8
-
11
21
15
Notes: "T" denotes the total number of individuals assigned the corresponding probability of choosing each mode, and "0" denotes the number of these individuals actually choosing that mode. to 0.9 to 1.0. Thus, for example, in the cell for the 0.0 to 0.1 level for option 1 and the second model, we see the pair of numbers 9 and 0. These record that fact that, with this model, nine individuals were predicted to have an approximate probability of 0.05 of choosing option 1, but none of them actually did. Thus, on the basis of the model, we can calculate the expected number of these individuals choosing option l as (0.05 × 9) = 0.45. Such a calculation of expected number of "successes" corresponding to each cell can be calculated, for each model, for each cell; after pooling cells to ensure a minimum expected frequency of 5 "successes" per cell, we can calculate the usual X2 statistic. 2 __.2(0 - E) 2 {where 0 denotes observed proportion E {E denotes modelled proportion for each model and use this as a formal test of the consistency of the models as predictions of the actual choices recorded in the validation sample. Both of the models presented in Table 2 produced values of Xz which are less than the 0.05 critical value for the appropriate X2 distributions; thus there is no evidence that either model is inconsistent with the data. We now calculate the likelihood ratio: L(2) _ ( 0 . 0 5 ) 6 ( 0 . 1 5 ) 11 . . . . . . . . (0.95) 15 L(1) (0.05)6(0.16)6 . . . . . . . . (0.95)o = 22. TRA VoL 16, No. 5/~-D
Thus the observed data set is something like twenty-two times as likely under model 2 as under model 1; we would accordingly judge the second model the better of the two, though both are consistent with the validation sample. We have assumed in this example that we are concerned with the model's overall performance in predicting the various options. In certain circumstances, there may be particular interest in the model performance for a restricted set of options; the sort of screening for consistency and superiority described above for overall model performance can be applied for comparisons on single options or groups of options if desired. Finally, we can mention a different sort of criterion on which to assess model validation, namely the total number of First Preference Returns (FPR's); a FPR is recorded for each individual who actually chooses the option to which the model assigns the greatest probability as the outcome of his choice. A methodology for the comparison of models, and groups of models, by this criterion is described in Foerster (1979). However, it can be argued that this criterion is inferior to the method set out above. The absolute number of FPR's is an ambiguous criterion of model performance, in the sense that a large number of FPR's may be an indication of poor model performance, rather than the converse. For any given model, we can calculate the expected number of FPR's from the model predictions, as Xpq where p~
380
HUGH GUNN and JOHN BATES
denotes the maximum predicted probability of choosing any one of the options in the case of individual q. We can also calculate the variance of the number of FPR's as
1-I p 7 ( 1 - ~,~) q
If the observed number of FPR's is very much larger than expectation or very much smaller than expectation, in the light of the estimated variance, we would have to conclude that the data were inconsistent with the model. For the two models given in Table 1, we can summarise the results as on Table 3. Thus model 2 does in fact have a higher expected number of FPR's than model 1. The actual observed numbers of FPR's from each model were within two standard errors of their expectations, once again confirming that model and data are consistent in each case. It also happened that the actual observed number of FPR's from model 2 was larger than that from model 1; however, it seems from the size of the standard errors attaching to each of these numbers and the relative closeness of their expectations, that this ordering might be reversed in a fair proportion of random trials from such models; in other words, that the criterion of observed number of FPR's is not a robust indicator of model superiority even when confined to models which have observed numbers of FPR's consistent with expectation. An example with grouped data--household car ownership The second sort of validation problem that we shall discuss applies only in the context of grouped data; it arises when a model is fitted to data for one particular level of aggregation and then applied to sub-groups of the data, aggregated with respect to a variable not included in the model. The null hypothesis is that the model fitted to the data at the higher level of aggregation applies also at the lower; a test of this hypothesis is then a form of mis-specification test, determining whether or not the excluded variable does have some effect on the model. It is standard practice in U.K. transportation studies to
fit logit models relating household car-ownership to household gross income, allowing for area to area differences by grouping the data according to some measure of urbanisation and fitting separate models to each group. Relationships derived for households within groups of zones are then applied to households within individual zones. In order to confirm that the measure of urbanisation is appropriate, and that there is no evidence of further sources of zone to zone variability in the fitted relationships, we can compare observed and predicted zonal car-ownership levels, given some measure of the variance of observed proportions (sampling error) and the variance of the predicted proportions (estimation error) based on the assumption that the model is correct. This procedure is described in Bates et al. (1979), in the context of a national car-ownership model. The simplest way to assess the model's performance is to form a standardised residual (SR) dividing the difference between the observed and estimated zonal proportions by the square root of the sum of their respective variances, assuming the model correct. If the model is indeed correct, we should find approximately 5% of zones with standardised residuals lying outside the range - 2 to +2, with roughly half of these being over-estimates, half underestimates. Further, although it is clearly not possible to describe any one zone as "unsatisfactorily" estimated by such a method, (since the true zonal proportions are never known) there are useful insights to be gained about which factors need to be included in an unsatisfactory model by looking for common factors across the set of worst-fitting zones. By way of illustration, Table 4 sets out the zonal standardised residuals corresponding to the sequence of car-ownership models described in Section 3. In this case, the conclusions to be drawn from the level of zonal fit agree with those derived earlier from considerations of overall fit; each additional variable improves the model, but even when they are all included, the model cannot satisfactorily account for zone to zone variability. On the most highly specified model, we find 13% of zones outside the {-2,+2} range for standardised residuals, compared with the expected 5% for a "correct" model.
Table 3. An example of modelcomparisonby F.P.R?s Observed
FPR's
Expected
FPR's
vat
(FPR' s)
s.e.
(F!~R's)
Model
1
88
79
27
5
Model
2
91
83
27
5
Table 4. Standardisedresiduals(SR) for zonalperformanceof car-ownership model % of zones by levelof SR Model C
Constant
I
Income
IA
Income
only
& Accessibility
IAB Income & A c c e s s i b i l i t y Bus Service
&
<-2
>+2
n.a.
n.a.
Total n.a.
ii
9
20
9
6
15
7
7
13
Statistical aspects of travel demand modelling 7. CONCLUSIONS
The most striking virtue of the developing theory of disaggregate modelling is the rigour with which behavioural hypotheses have been translated into testable mathematical models. However, it must also be said that the level of statistical awareness that has been demonstrated in the application of these has been consistently higher than that which has accompanied more conventional transportation models. In recent years it has become more generally recognised that a model can only be evaluated in terms of its statistical validity; it is in recognition of this fact that much of the work on the statistical properties of aggregate models, described in this paper, has been undertaken. As was remarked at the outset, the form of many of the models is common to both aggregate and disaggregate applications; so too is the fitting criterion and hence also the basic form of the estimation problem. On the other hand, when the standard procedures have been applied to models using aggregate data, we have been brought up against a basic dilticulty; we can often demonstrate quite clearly that the models are not exact, contrary to the assumption made during estimation. (Indeed, we would expect the gravity model to be a very poor representation of spatial interaction in some applications). If we treat mis-specification as an additional source of error, we can avoid this difficulty in a number of ways; the effect is always to decrease the accuracy that we would claim for the model. In Section 5, we have made some tentative suggestions as to possible approaches that might be taken; however we raise as an unresolved question the correct treatment of unavoidable misspecification error in models of this kind. In view of this uncertainty, it is essential that practitioners take the question of goodness-of-fit and validation seriously, developing and applying the tests discussed in Sections 3 and 6 of this paper. In addition, we emphasize the need to link the process of data collection with the statistical properties of the model, along the lines of the discussion in Section 4. While we are aware that in some places our treatment has perforce been somewhat superficial, it is to be hoped that the ideas in this paper will stimulate discussion, and lead to a strengthening of both theory and practice in the field of travel demand modelling. REFERENCES Bates J. J. (1971) A hard look at car ownership modelling. MAU Note 216, Mathematical Advisory Unit, Department of the Environment, London. Bates J. J. (1979) Sample size and grouping in the estimation of disaggregate models--a simple case. Transportation 8, 347369. Bates J. J., Gunn H. F. and Roberts M. (1978) A model of household car ownership. Traffic Eng. Control 19 (11), 486491, (12), 562-566. Fienberg S. E. (1977) The Analysis o/Cross Classified Categorical Data. MIT Press, Cambridge, Mass. Foerster J. F. (1979) Mode choice decision process models: a comparison of compensatory and non-compensatory structures. Transpn. Res. 13A (1), 1%28. Gaudry M. J. I. and Dagenais M. G. (1979) The DOGIT model. Transpn. Res. 13B (2), 105-112. Gunn H. F., Kirby H. R., Murchland J. D. and Whittaker J. C.
381
(1980a) Regional Highways Traffic Model: Trip distribution investigation. Department of Transport, London. Gunn H. F., Kirby H. R., Murchland J. D. and Whittaker J. C. (1980b) The RHTM Trip Distribution Investigation. Seminar P. Proc. 8th PTRC Summer Annual Meeting, University of Warwick, July 1980. PTRC Education and Research Services Ltd., London. Gunn H. F., Mackie P. J. and Ortuzar J. D. (1980) Assessing the value of travel time savings--a feasibility study on Humberside. Working Paper 137, Institute for Transport Studies, University of Leeds, Leeds, England. Gunn H. F. (1980) An allowance for bias in forecasts from logit models estimated by M. L. from aggregated data. Technical Note 36, Institute .for Transport Studies, University of Leeds, Leeds, England. Haskey J. C. (1973) Empirical methods for judging the efficacy of mathematical models for transportation studies. Session F2(i), Proc. PTRC Summer Annual; Meeting, University o.f Warwick, July 1973. Planning and Transport Research and Computation Co. Ltd., London. McFadden D. (1976) The theory and practice of disaggregate demand forecasting for various modes of urban transportation. U.S. Department of Transportation Seminar on Engineering Transportation Planning Methods. McFadden D., Tye W. and Train K. (1976) Diagnostic tests for the independence from irrelevant alternatives property of the multinomial Iogit model. Workingpaper No. 616, Urban Travel Demand Forecasting Project, Institute o.f Transportation Studies, University of California, Berkeley. Murchland J. D. (1971) Convergence of gravity model balancing interactions. Session G24, Proc. 5th PTRC Summer Annual Meeting, Universityo/ Warwick,June 1977.PTRC Education and Research Services Ltd., London. Ortuzar J. D. (1980) Modal choice modelling with correlated alternatives. Ph.D. Thesis, Department of Civil Engineering, University of Leeds, Leeds, England. Sikdar P. K. and Hutchinson B. G. (1981) Empirical studies of work trip distribution models. Transpn Res. ISA, 233-247. Stopher P. R. (1975) Goodness of fit measures for transportation demand models. Transportation 4, 64-80. Tardiff T. J. (1976) A note on goodness of fit statistics for probit and logit models. Transportation 5, 377-388. Tardiff T. J. (1979) Specification analysis for quantal choice models. Transpn Sci. 13 (3), 179-190. Van Est J. and van Setten J. (1977) Calibration methods: applications of some spatial distribution models. Working Pap. 6, Research Center for Urban Planning, Delft, Netherlands. Van Est J. and van Setten J. (1979) Least squares procedures for a multiplicative spatial distribution model. In New Developments in Modelling Travel Demand and Urban Systems (Edited by G. R. M. Jansen, P. H. L. Bovy, J. van Est and F. le Clercq). Saxon House, Farnborough, England. Williams H. C. W. L. and Ortuzar J. D. (1979) Behavioural travel theories, model specification and the response error problem. Session MI7, Proc 7th PTRC Summer Annual Meeting, Univ. Warwick, July 1979. PTRC Education and Research Services Ltd, London.
APPF_~IX 1 The "doglt" model, as stated by Gaudry and Dagenais (1979) for the general case of choice between N alternatives, relates the probability of selecting alternative i to the "utility" U~ of that alternative and the "utilities" of all other alternatives as exp(Ui) + Oi~. ~=, exp(Uk)
pr(i)=
(
N )yN
l+~k=lOk
k=, exp(Uk)
,
where the 0i are non-negative constants which increase in proportion with the degree of "captivity" to alternative i. If N = 2, and Oi=0 and 02 = (I/$1- 1) where St is a constant, 0< $1 < 1,
382
HUG~ GUNNand JOHN BATES
then we can write pr(1) =
or
exp(U0 (1 + 02)(exp(U0 + exp(U2))
$1 (1 + exp(U2- U,))
which is the form in which the conventional car ownership models deal with "captivity". APPENDIX 2 Consider a binary logit model of ownership, say p(a, X3 = 1/1 + exp(aX~) = p~ fitted to N data points at which we observe r~ households choosing to be car-owning out of n~ in the sample, all with income level X~. " a " is an unknown constant, to be estimated. The log-likelihood may be written down as N
Ii = ~ (rl log Pi + (hi - ri) log(1 - Pi))
N
u
i=l
N = 0.
] .~__-~.
=--N
6"{ap?~ 2 I =
E,~, k~]W
6"2
N
y',=, x'2n~"(1-P')
= 62VML~ where #2 is given as the solution of al21&r ~" I = 0. (Note that E(a:lqaaaa"] = 0.) Thus o=a . ,r = ¢~
The estimate of the variance of a* is given by
N
a=a
Thus we will obtain a* = ti: the variance of d, VwLs say, is given by
Ta = ~ Xi(ri - nipi) I.=~" = O.
t
1
ni
Z (ri - nipi)Xi
E ~
i)211
1
Wi
If the W: are defined to be equal to the square root of expected sampling variance corresponding to n~ and "true" proportion p~ (a process which will involve an iterative scheme to update the weights) then W : = pffl-p:)/n~, so that the WLS estimate ci is derived as the solution of
[
and the ML estimate of a, a* say, found to solve
-1
- 7 ~_~ .-72, (ri - nlpi) -- Xip:( l - p:) = 0.
-I
The weighted least squares approach (WLS) can be interpreted as a ML procedure in wNch the variables {(lIW3(rJn3}, i = 1 . . . . . N are modelled by {Pi/Wi};it is assumed that {(1/W3(rdn3} is independently normally distributed with mean {pdW~} and variance ~' (for all i). Thus the WLS estimate of a, ~i say, is given by maximising N
~
/ 2 = _ N l o g o . _ , _,,= ~ ' ~Ta 1 [l ~[ - ~1i ~ ri'~ ] - - ~ , Pi'~" } Thus ti solves
( ( ± r:'~_p,~± ap,=n Zr e. .5'. " x~w, . . w,/ w, ao -
w.r.t.a
If the observed scatter of the (rdn3 over all the modelled p~ are consistent with random sampling and an "exact" model, then -2
1 N
1 rl
1 ri
1 ~ var(rdni) 1 N = - - ~ = ~ i
2
=1.
Thus VML1is the expected value of VWLS. If the scatter is too great to be accounted for in this way, we would expect VwLs to be larger than VMU.