Tmnspn Rer., Vol. 10, pp. W-US.
Pcrpmon Preaa 1976. Pinted in Gnat Btitain
STATISTICAL NOTES ON THE EVALUATION OF CALIBRATED GRAVITY MODELS S. R. WILSON Department of Statistics,TheAustralianNationalUniversity,Canberra,Australia (Receiwd1July 1975;in revisedform 1 January 1976) Abstract-This
paper discusses theoretically the statisticaltests and measuresthat are commonly used to evaluate the accuracyof the calibration of gravitymodels.It is shown that an inappropriatetreatmentof trip matricesas a certain type of contingency table has yieldederroneous interpretations.Also one of the most commonlyused techniquesin this context, the correlationcoefficient,is inappropriate.Finallya further method of analysisinvolving the likelihoodis proposed.
lNTRODUC!TlON regarding the origin-clestination matrices as contingency In transportation planning a considerable amount of re- tables (Evans (1971)) and inappropriately applying some search has been devoted to determining the calibration of test statistics (which are suitable only for certain types of contingency tables) to the matrices. The effect of this appropriate gravity models, by employing in the analysis different model constraints, deterrence functions or confusion will be apparent in the next section when we consider some of the measures for comparing trip stratification procedures and various statistical considerations (as, for example Evans (1971), Batty and Mackie matrices. Now a contingency table arises when we have (1972), Kirby (1974)).The calibration of gravity models is two distinct variables each of which is class&d into two equivalent, when considered in the general modelling or more categories. The major purpose of such a treatframework, to the Systems Analysis and Model Veritica- ment is mainly to measure the association between the two variables, although for clarity of presentation, clastion stages, (using the categorization given by Mii (1972)).An aspect which has received less attention is the sifications of units with respect to discrete variables are next stage, namely that of Model Validation. Model often presented in contingency table form. Now for the origin-destination matrix, one of the variables refers to validation in this context is concerned with the evaluation of the accuracy of the models which arise from the the place of origin (FWduction variable), the other variable to the place of destination (Attraction variable), and calibration procedure. This is of considerable importance for the applied researcher who is using gravity models. obviously we are not interested in the correlation between Theoretical problems have nris+ in the selection of a these two variables! The tabular format is merely a suitable statistical measure to evaluate the accuracy of the convenient form of presentation of the data. Yet the comparative values obtained for these tables have been calibrated models, and abuse of some statistical techniques has occurred. A surnmnry of the methods commonly used as if they were a measure of association between the used, with references, is given by Black and Salter (1975). variables. For these matrices the observations should be analysed The main purpose of this paper is to discuss theoretically these commonly used statistical tests and measures, in a univariate (not bivariate) framework. There are at showing which are appropriate iu this context and those least 3 possible representations. One is when we consider that are not, and to propose a further method of general the matrix across the rows and the variable is the number applicability. of trips out of each zone, one appropriate model beii the There are two distinct outputs from the models: one is production-constrained gravity model. Another is when an origin-destination matrix of synthesised trips; the other we consider the matrix down the columns and the variable is a frequency distribution of the lengths of these trips. In is the number of trips into each zone, and the model could this paper it will be first shown that the approach of be an attraction constrained gravity model. A third representation occurs from considering the individual cells of inappropriately treating the trip matrices as contingency tables has yielded erroneous interpretations. Then the the matrix and the variable is the number of trips between various statistical approaches for analysing both trip each of the zones. An appropriate model could be the matrices and trip lengths, will be discussed and it will be fully-constrained gravity model. shown that one of the most commonly used statistical techniques, the correlation coefficient, is not only inap.WAmcAL METEOus‘ToEVALUATE ACCURAClE3 propriate and meaningless, but also is (unpredictably) The following statistics have been proposed to evaluate incorrect as a relative measure. Finally an alternative the accuracy of the calibrated gravity models and these approach involving the likelihood is discussed. will be individuaBy discussed in the following: x*-test, contingency coefficient, Kolmogorov-Smirnov test; RootTHEINTERPRETATION OF ORIGDkDHTlNATION MATRICS Mean Square test; correlation coefficient. Considerable confusion in the evaluation of the accuThe x2 test is a standard statistical test of goodness-ofracy of calibrated models appears to have arisen through fit (Kendall and Stuart (1967))used in the evaluation of
344
8. R. WILSON
The most commonly used statistic to evaluate the gravity models to test between the observed (survey) frequencies for the number of trips and the expected accuracies of survey and model trip matrices is the (calibrated model) frequencies, (or the appropriate values correlation coeflicient. As it is used in this context, it is evaluated as in the triplength distribution). From the appropriate tables the probability of obtain@ that value of x2 (or larger) can be determined. To compare two different models the values of the probabilities can be compared between the two sets of observed and expected frequencies, yielding a relative assessment of the two models. This use of x2 as a relative measure has not been reahsed. (Also geographically neigbbouring zones should be aggre- However its use is completely erroneous since the corregated, where appropriate, to ensure there are sufficient lation coe5cient measures the degree of linear de.penobservations in the cells. The model matrix must be dence between two random variables, Here 0, is a calculated for the aggregated zones.) The high x2 values random variable, but E, is not. E, is a non-random determined (see, for example, Black and Salter (1975)), variable, determined by 0, and other constants predeterindicate that the present trip-distribution modelling proce- mined by the distances between the zones. Thus one dure needs refinement. would expect high values of R, which are meaningless. The use of the contingency coelIicient is incorrect for Also R cannot be used as a relative measure. Nor is it evaluating the accuracy of the calibrated trip-distribution possible to readily determine under what conditions, nor models, and seems to have arisen due to an incorrect the proportion of times R will, or will not, make a correct application of test statistics to trip matrices as if they were relative assessment, due to the type of dependence of Er a certain type of contingency table, as well as due to on Or, and the mathematical intractability of the relationconfusion over the interpretation of x2. Fist the x2 ship between R and a reliable measure such as the value obtained above for trip matrices is used to deter- probability levels obtained from the x2 statistic. mine the goodness-of-fit of the model trip matrix to the However it is easy to construct examples where R does survey trip matrix. This is the value that has been not make correct relative assessments. The following is a incorrectly used to determine the contingency coefficient. simple example to demonstrate this. Suppose we have just However the x2 value which should be used is one 3 centres, and have observed the following originderived from testing the hypothesis of independence destination matrix: when we have observations classified according to two 15 10 10 variables, and the coefficient measures the dependence 15 20 15 between the variables. However, as was discussed above, 80 85 90 this representation of the trip matrix is not applicable, and hence the coe5cient is meaningless. The Kolmogorov-Smirnov test is a standard non- Suppose also that we have two different calibrated gravity (for simplicity, productionparametric test, used to test between the observed and models to compare model triplength frequencies. Similarly to the x2 it can be constrained) yielding the two following matrices: used as a relative measure by comparing probability Model I Model II levels. (A discussion of the comparison between the 25 5 5 15 10 10 Kolmogorov-Smirnov test and the x2-test is given in 10 30 10 15 20 15 Kendall and Stuart (1%7).) Also this test could be adapted 80 85 90 80 70 105 to trip matrices. The Root-Mean-Square @MS) “test” is defined by For Model I we have ~6’= 22.58and R = 0.98, and for Model II ~6’ = 5.70 and R = O.%. So here Model II is a lws = [ (01 good fit to the data while Model I is not (at the 5% level of sign&ance), but comparison of R values would have (where Or and EJ are the jth survey and model values given us a preference for Model I. respectively and n is the total number of categories). This IJKEUEIOOD METHODSOF EVALUATION is the “test” usually given for comparing trip length One approach to the relative evaluation of the accuracy frequencies. It can also be used for trip matrices. However the I&IS is not a test but merely a measure, based on of the calibration of gravity models that has not been considered so far is an analysis involving the likelihood. Euclidean distance. No signi8cance levels can be attached to the values obtained. Also this measure should be used The likelihood, L (H/R), of the hypothesis H given data with caution as it is sensitive to large deviations of the R, and a specific model is proportional to P(R/H) (the frequencies from the mean. An alternative measure (re- probability of obtaining results R given the hypothesis H, lated to the x2 statistic) that would compensate for this according to the probability model) the constant of proportionality being arbitrary. In this context, to compare the effect is survey and model (hypothesis) trip matrices, the simplest and most appropriate probability model is the multinomial [n-1X211n = g (0, distribution. Let pii represent the probability of a trip
2
E,yqn
c
Ej)YnE,]“2.
Statisticalnoteson the evaluationof caliiratedgravitymodels
345
For the above examples we have that the support of Model II by the data is 2.59 units, while that of Model I is 10.25 units, and hence Model II is better supported by the data than Model I. when comparing a group of models an m-unit support region can be determined. It is that region in the parameter space bounded by the curve on which the support is m-units less than the maximum. All models which fall within this support region could be accepted as where C is the constant of proportionality. The most having generated the data. (For example, Edwards (1972) plausible, or best supported, values of (pii}are those that showed that for the case of a sample &awn from a normal maximise L. In this case, the maximum of the likelihood is distribution with unknown mean and known variance, the 2-unit support region corresponded to those values which fell within 2 standard deviations of the estimated mean.) The main advantages of this type of approach are fist that the distributional assumptions required are minimal, where fiti = ~J/N and N = “, n+ and second that it is not a statistical routine that can be Suppose our model (hypothesis) of the trip matrix gives as readily applied without thought to produce a “signifius an expected probability fiij of a trip between OJ and DJ. cant” or “non-significant” result. Also the approach Then the likelihood of this outcome is naturally follows on from the maximum likelihood procedures presently beii considered for the caliiration of gravity models (Batty and Mackie (1972), Kirby (1974)).
between origin 0, and destination 4, and suppose we observe nu trips between these zones and there are n zones. The likelihood, under the multinomial model, of this outcome is
A measure of the plausibility for the model having generated the observed data is the relative likelihood
So the values vu} make the observed data approximately R@u}) (0 5 R 5 1) as plausible as do the maximum likelihood estimates &J}. Hence the values of R cakulated for various models give us a comparable measure for determining the plausibility for the data having arisen from each of the models. Considering the two above very simple hypothetical examples, we obtain the plausibility of Model I is 3.6 x lo-’ while that of Model Il is 7.5 x lo-‘. Hence it is more plausible that the data arose from Model II rather than Model I. An alternative but analogous approach to the above is Edwards (1972)Method of Support. The support function, S, is delined to be the logarithm of the likelihood, and in this case
The number of units of support that the model generating the set of values (61,) is from the maximum of the likelihood is given by
= -log R (@,j}).
DlSCUQKM
Of the statistics that have been used to determine the goodness of fit of various caliirated gravity models to observed inter-zonal trip frequencies, the most approp riate, x2, has been found to produce extremely high values. This suggests that the modelling procedure needs vast refinement (Black and Salter, 1975). From the above discussion on the interpretation of origin-destination matrices, it follows that greater emphasis should be placed on modelling in a univariate framework. The random variable involved can be considered for example as a “Production” variable, or alternatively as an “Attraction” variable, or perhaps as an “Iuter-zonal trip production” variable. Another possibility is the application of MonteCarlo simulation techniques.
Batty M. and Mackie(19n) The calibrationof gravity, entropy, and related models of spatial interaction. Envimnmtxt and Planaiag 4,205-233. Black I. A. and SalterR. J. (1975) A statistical evaluation of the accuracyof a familyof gravity models.Pmt. Institutionof Civil Engineers59, I-20. EdwardsA. W. F. (1972)Likelihood,CambridgeUniversity Press. Evans A. W. (1971)The calibratim of trip distribution models with exponentialor similarcost functions.Transp Res. 5,1M8. Kendall M. G. and Stuart A. (1967) The Advanced Theory of Statistics.Vol. 2,2nd Edn. Charles Grih. London. Kirby H. R. (1974)Theoreticalrequirementsfor &bra* @vity models. Tmnsp.Res. 8. 97-104. MihramG. A. (1972) Simulation: Statistical Foundations and Methodology. AcademicPress, New York.