Accid. And.
& #‘rev. Vol. 24. No. 3. pp. 275-305.
oax-45w!E sxcQ+ 1992 Pergamon Press
I%‘2
Printed in Great Britain.
.oo
Ltd.
THE VARIATION OF DRIVERS’ ACCIDENT RATES BETWEEN DRIVERS AND OVER TIME E. M. ffor_Ro~~ Transport and Road Research Laboratory, Department of Transport, Crowthorne, RGl I 6AU. England
Berkshire,
Abstract-Accident records of individual drivers in eight large samples from the generat driving ~pulation have been analysed as Ronhomogeneous Poisson processes.i.e. by assuming that each driver’s accidents accur randomly with an individual underlying accident rate that varies in time. Previous hypotheses about the form of this variation are reviewed. The moments of the distribution over drivers of individual underlying accident rates averaged over periods of one or more years are estimated; they indicate that most of the distributions are intermediate in form between the gamma and lognormal distributions. The results suggest the presence of an overall annual factor that multiplies all the individual rates; when this is removed the remaining process is approximately stationary (i.e. the mean and standard deviation are approximately constant in time). The coefficient of variation of individual accident rates falls siowiy as the length of the period over which they are averaged increases, indicating positive correlation between the underlying accident rates of the same drivers in different periods. These corretations are estimated directly from the accident frequency data, which appears to be a new and usefut method of analysis. For one sample, which spans 14 years, the correfation between two adjacent years is about 0.74 and falls approximately linearly to about 0.49 for two years separated by a gap of 12 years. A “switching process” with differing individual means, previously proposed for drivers’ accidents by Bartlett, does not appear to fit these results well. A “scaling process,” studied by Mandeibrot and previously found to apply in a number of different fields, provides a better fit to the data and could represent the combined effect of a hierarchy of switching processes operating on different time scales, in accordance with what is known or conjectured about the variation of factors affecting accident rates. Although the data analysed are affected by variation in exposure between drivers, the methods and ideas used have implications that may help to resolve the long-standing dispute about individual accident proneness.
I * INTRUDtiCTfUN
The work of the ~ehavioura~ Studies Unit of the Transport and Road Research Laboratory in Crowthorne, England, includes a programme of research into individual differences in accident liability of road users (Maycock 19855).As part of this programme it was decided to examine previous research and if necessary reanalyse existing sets of data to answer the questions: 1. How much does accident liability differ between drivers over the general driving population? 2. How consistent are these differences over time? It was thought that the answers to these questions should provide an essential background to the rest of the work on i~diyiduai differences in accident habihty. In the first place, measuring the size of the differences would give an indication of their contribution to the general problem of road accidents, and measuring their consistency over time would show whether it was worth while trying to identify those drivers with high accident liability as a preiiminary to taking remedial action. In the second ptace, the answers to the questions should provide some clues in the search for the factors responsibie for the variation in accident liability by indicating the size of the variation that requires explanation and the extent to which it corresponds to stable or transitory factors. The term “accident liability,” which is used in the above questions, refers to the tendency of individual drivers to have accidents; this wih be infiuenced by various factors, 275
E. M. HOLROYD
276
including exposure to risk. Since the occurrence of accidents is essentially uncertain, it is natural to consider a probability model in which an individual’s accidents occur “randomly” with a certain underlying rate per unit time (which is not assumed to be constant in time). The number of accidents an individual has in a certain time provides an estimate of his underlying accident rate; but when this rate is low, as it is for road accidents, the element of chance is relatively high and the estimate of the individual’s rate provided by the actual number of accidents is therefore very unreliable. Nevertheless, the idea of an individual’s underlying accident rate, which determines his “expected” number of accidents, is important and provides the key to statistical modelhng of accident data. The rates of individual drivers may be envisaged as varying in time as shown in Fig. 1. If this diagram represents a period of several years, the rates as shown would have been smoothed or averaged in some way and a more detailed examination of a shorter time period would reveal more short-term fluctuation. A diagram covering a period of several days would be dominated by the alternation of periods of driving and non-driving (with zero accident rate), while a closer look at, for example, an hour’s driving would reveal variations due to stopping and starting, traffic, and road conditions. In fact, the data to be analysed in this paper all relate to periods of a year or more, so little can be deduced about variation over shorter periods. Effect of exposure
The individual’s underlying accident rate, or expected number of accidents per unit time, (which we shall usually refer to simply as his “accident rate” or just “rate”) is a measure of that individual’s “accident liability.” It is clear that the accident rate averaged over a period of a year, for example, will be affected by the amount of driving the individual does in that period, and also by the conditions (type of road, traffic, weather, day/night) in which he drives, in other words, by his “exposure.” For some purposes, it would be more relevant to study variations in individual accident rates in standard conditions of exposure or after allowance has been made for the effect of variation in
1
Driver 3
Time Fig.
1. The variation
of individual
accident
rates in time.
Variation of drivers’ accident rates
277
exposure between individuals. This would presumably be more closely related to the individual’s driving behaviour and personality and has usually been called “accident proneness.” This phrase has sometimes been taken to imply a fixed lifelong characteristic of the individual and sometimes that accident-prone drivers form a distinct group, but neither implication is intended here. There are both practical and theoretical difficulties in making allowance for the effects of varying exposure in the way suggested. The practical problems arise from the difficulty of obtaining reliable data for the total mileage driven by individuals and even more so for the mileage driven in different conditions. The theoretical problems arise in deciding where to draw the line between “proneness” and “exposure” factors. For example, if a driver habitually drives after drinking, is this to be regarded as an exposure factor? Moreover, if one is concerned simply with predicting which drivers will have accidents, it may be unnecessary to eliminate the effect of exposure. For these reasons it was decided not to attempt in this paper to separate the exposure factor: we shall be concerned with accident liability, not accident proneness. If it is possible to arrive at an understanding of the pattern of variation of accident liability, this will provide a good basis for future work on the separate contributions of proneness and exposure. Previous work
An examination of the existing literature showed that there was a good deal of data available of the type required, mainly for American states, giving the number of accidents incurred in one or more periods by each driver in a large sample from the general driving population. Much of the analysis of this data, however, has been confined to an examination of the actual numbers of accidents and the fitting of standard distributions to them, rather than attempting to reveal the underlying process. The very extensive literature on accident proneness contains models and methods of analysis that can be applied to the existing data on drivers’ accidents to answer the first of the questions asked above by estimating the distribution of the underlying individual accident rates over the driving population. However, the literature does not appear to provide satisfactory models or methods to measure the degree of consistency of individual accident rates over time (with the exception of Bartlett [1986f-see Section 2). This is an astonishing omission and seems to reflect the curious way in which the subject of accident proneness has developedas a controversy with political and moral overtones rather than a dispassionate scientific study. It may also reflect an increasing degree of specialization as a result of which accident researchers and theoretical statisticians have tended to lose contact with each other, In a valuable review of statistical work on accident proneness Kemp (1970) drew the following conclusions: (1) from a practical point of view . . , the concept of accident proneness has proved singularly ineffectual. (2) from the standpoint of statistical theory, however. it has been very fruitful.
Writing as a theoretical statistician, Kemp was able to regard this situation with some satisfaction. From the point of view of the accident researcher, however, it is much fess satisfactory, and in the view of the present writer, the practical failure of the idea of accident proneness and the consequent decline of interest in it may well stem partly from the failure of the statisticians to provide the appropriate kind of understanding of the processes involved for the accident researchers to use. Statisticians have tended to devote their efforts to inventing new distributions that can be fitted to the data without necessarily illuminating them, rather than starting from the data and trying to deduce the structure of the mechanisms that could have given rise to them. The subject would benefit from a renewal of the close contact between the practical and theoretical sides that existed in the early days of accident research. It is hoped that the present paper may go a little way towards bridging the gap.
278
E. M. HOLROYD 2. PREVIOUS
MODELS
FOR
VARIATION
OF
ACCIDENT
RATES
IN
TIME
The Poisson process It was said in the previous section that there appeared to be no appropriate models in the literature on accident proneness that would enable the degree of consistency of individual accident rates over time to be measured, with the exception of those in Bartlett (1986). The present paper is partly an attempt to extend Bartlett’s approach by applying his methods to other data and to some extent developing more intuitive concepts and measures. In order to put Bartlett’s contribution into perspective it will be necessary to give some account of previous models, with particular emphasis on their treatment of variation in time of individual accident rates. As a preliminary, we shall give a brief description of the Poisson process and some of its properties. All the models to be described are based on the Poisson process, which describes the random occurrence of events in time at a constant rate, 0 say. The probability of an event occurring in a short interval of length dt is 0 dt. If the number of events in an interval of length T is x, then x has a Poisson distribution with mean A = T0. Thus the probability that x = a is eW‘h”/ra! and the variance of the distribution is X. A property of the Poisson distribution which we shall need to use is that the sum of a number of independent Poisson variates is itself a Poisson variate, with mean equal to the sum of the means of the components. This has an important consequence for a “nonhomogeneous Poisson process,” i.e. a Poisson process for which the rate 6 changes with time. Suppose that a period of length T is divided into sub-periods of length T, (j = I, 23, . . .> in which the rates are 8, and the numbers of events x,. Then the totai number of events x = Xx, has a Poisson distribution with mean X1;@/. Thus the distribution of x is exactly the same as if it resulted from a homogeneous Poisson process of rate XT&/T operating for time T. In other words, as far as the distribution of the total number of events is concerned, a nonhomogeneous Poisson process is equivalent to a homogeneous Poisson process with a rate that is the average of the component rates weighted by the lengths of time for which they apply. The time intervals between events will of course have different distributions for the two processes. There is another way of combining Poisson processes that we shall also need to use. If the accidents of the ith individual form a (homogeneous) Poisson process with rate $, and if in a period of length T this individual has xi accidents, with mean h, = T@, then the distribution of x over the population of individuals is called a mixture of Poisson distributions, the distribution of 8 over individuals being known as the mixing distribution. The distribution of x will not be a Poisson distribution (unless all the 8s are equal), and its variance will be greater than its mean. The distribution of x is determined by that of 0, and the moments of the O-distribution may be estimated from those of the x-distribution (see Section 3). It follows from the previous paragraph that if we use this estimation procedure in a situation where the accident rates of individuals are varying throughout the period of observations, we shall be estimating the moments of the distribution of the individual rates fl averaged over this period. Previous hypotheses It is now possible to describe the main hypotheses for the variation in time of individual accident rates that have been proposed in the literature. These are listed below and illustrated in Fig. 2. Each of the types listed includes a range of specific models obtained by specifying the parameters of the process of change and the assumptions about variation between individuals. It should be noted that these hypotheses and models were generally not intended to cover trends in accident rates that affect all individuals (due to weather, legislation, etc.) or changes related to the increase in age and experience of individuals. These factors are probably more important for road accidents than for the factory accidents on which most of the eariy work was based, and will be discussed in more detail later. HI. Constant rates (Greenwood and Yule 1920) Accident rates 0 are constant in time: either the same for all individuals (“chance’“) or varying between individuals (“proneness”).
Variation
of drivers’
279
accident rates
HI Constant
H2 Contagion
~ Accidents
c
CT
H3 Spells
H4 Switching process
H5 Random walk
Fig. 2. Five possible types of variation
for individual
accident rates.
H2. Contagion models (Greenwood and Woods 1919) An individual’s accident rate depends only on the number of accidents the individual concerned has had, i.e. 0 has the value 0,) for each individual initially and changes to 8, after his or her first accident, and so on. (The changes could be increases or decreases.) H3. Spells models (Cresswell and Froggatt 1963) The accident rate 8 of all individuals has the value OOexcept during short spells occurring randomly and independently for each individual, during which 8 = 8, (> 6,). H4. Switching process (Bartlett 1986) The accident rate 8 switches at random times to a new value randomly chosen from a specified distribution. AkP24:3-E
280
E. M. HOLROYD
H5. Random walk (Miller and Schuster 1983) The accident rate 8 increases or decreases at random times by amounts chosen randomly from a specified distribution. We shall now discuss the origins of these five hypotheses and how well they have succeeded in describing and explaining real accident data. The consideration of alternative hypotheses began with the papers of Greenwood and Woods (1919) and Greenwood and Yule (1920). They put forward three hypotheses: Hl*. the special case of Hypothesis Hl in which accident rates are not only constant in time but equal for all individuals; Hl. the genera1 case of constant but unequal rates; H2. Various forms of contagion models. Constant rates (Hl)
Hypothesis Hl* (constant and equal rates) is the simplest possible hypothesis, although it would be very surprising if it were found to be true. The hypothesis implies that the number of accidents incurred by different individuals during the same period will have a Poisson distribution, with variance equal to the mean. in fact, it has been found that such distributions, whether for conditions of equal exposure or not, almost invariably have variances greater than their means, and the hypothesis Hl* of constant and equal rates can therefore be rejected. The general hypothesis Hl, by which accident rates are assumed to vary from one individual to another but to remain constant in time, has sometimes been called the proneness model, but, as indicated in Section 1, this implies an unnecessarily restrictive definition of proneness. This hypothesis is the natural one to consider when Hypothesis Hl* (constant and equal rates) is rejected, in the “Occam’s razor” sense of being the simplest mechanism that may account for the data. However, constancy of individual accident rates in time would perhaps be almost as surprising as equality between individuals, particularly in situations of varying exposure. The model can be fitted to any observed distribution of accidents in a single period (provided that the variance is not less than the mean and that certain other inequalities hold between the moments) by fitting a mixed Poisson distribution with an appropriate mixing distribution of rates. Greenwood and Yule (1920) showed that if accident rates have a gamma distribution, then the number of accidents has a negative binomial distribution, and this provides a good fit to many sets of data. However, the fact that this or any other distribution fits the data for a single period provides no evidence whatever of the constancy or otherwise of the individual accident rates over time. To test this, it is necessary to examine the accidents of the same individuals in two or more periods of time. This has in fact often been done, but the results have generally been inconclusive. Normally the correlation between the numbers of accidents in the two periods has been obtained; see, for example, Peck, McBride, and Coppin (1971). These correlations have usually been small but positive, as would be expected on hypothesis Hl, but the significance of the differences from the values expected on the hypothesis does not appear to have been tested. There was some consideration in the paper by Chambers and Yule (1941) and the subsequent discussion of whether accident rates were constant, and it was generally concluded that they were not. However, the data were for professional drivers, and the effect of training and experience seemed to be important factors. Contagion models (H2)
It might be thought that the natural hypothesis against which to test Hl would be one in which rates were liable to some form of random variation. In fact, Greenwood and his colleagues did not consider this, but instead put forward the contagion hypothesis H2 as the only one incorporating variation of individual rates in time. The idea that a person’s accident rate may change as a result of an accident is certainly an interesting possibility that deserves to be examined, and it was a legitimate hypothesis for Greenwood and his colleagues to consider. However, if accident rates do change, it seems unlikely that the occurrence of accidents will be the only or even the main cause of the
Variation of drivers’ accident rates
281
changes, and it was unfortunate that this was the only model of change considered. In a pioneering paper this is perhaps understandable, but unfortunately the three hypotheses seem to have exercised a hypnotic and restrictive influence on later researchers. Many papers have begun by quoting the three hypotheses as though they were the only possibilities, and Fitzpatrick (19X$), for example, says “there are essentially only two important hypotheses available [as alternatives to Hl*j. These were proposed more than 30 years ago by Greenwood and Woods.” There are other objections to the hypothesis HZ as the alternative to Hl. It contains more parameters than can be expected to be estimated with a reasonable degree of accuracy, and any limitation on the numbers, such as making 0, = 8? = Ff3= . . , or letting the 8,, form an arithmetic or geometric progression, involves an arbitrary decision. The hypothesis implies that the individual’s current accident rate depends on the total number of accidents he has incurred, some of which may have happened before the period of observations began. To the extent that this is so, the hypothesis is untestable with the usual sort of data. There is also a question about the types of accident-does an accident of one type result in a change in the rate for a different type? Finally, it has been shown by Cane (1972) that the result of certain particular versions of this hypothesis are indistinguishable from those of other noncontagious hypotheses. For these reasons, hypothesis H2 is not considered further in this paper. This does not imply that it is thought to be unimportanton the contrary, it would be of considerable importance to know whether drivers do become safer (or less safe) as the result of having an accident, and the question deserves further study. However, in view of the theoretical difficulties and the inconclusive results so far obtained, it seems best to exclude this type of effect from the present investigation.
Spells models (H3) It was not until 1963 that accident models escaped from the straitjacket of Greenwoods three hypotheses. The “spells” model introduced by Cresswell and Froggatt (1963) marked an important advance by postulating that the individual accident rate varied over time in a random manner. Just as accidents occurred randomly in accordance with a certain law (Poisson) when the accident rate was constant, so the accident rate itself also fluctuated according to a probabilistic law. The spells model was therefore an example of what was later to be called by Bartlett (1963) a “doubly stochastic Poisson process”-“doubly stochastic” because the stochastic or chance element operates in two stages: first, in determining the accident rate and second, in determining the times of the accidents. It must be emphasized that for both stages the probability model is simply a convenient way of formulating our lack of knowledge. In fact, accidents are not, in one sense, random events, since they have causes and would be to some extent predictable by someone with detailed knowledge of the preceding conditions, and, in the same way, changes in accident rates also have causes and are, in principle, to some extent predictable. This applies equally, of course, to events of other types, which are often regarded as random, such as the results of coin-tossing or dice-throwing Although Cresswell and Froggatt’s model was valuable in being, as they claimed, less mechanistic and more “human” than its predecessors, it nevertheless suffered from some considerable drawbacks. These probably resulted from the authors’ wish to be able to derive a distribution of accident frequencies with explicit probabilities. It was assumed that all individuals (bus drivers or tram drivers in Cresswell and Froggatt’s work) were subject to spells during which their accident rate was higher than usual. The authors assumed that the number of spells that an individual had in a given period had a Poisson distribution. Although they said nothing about the duration of the spells, this assumption can be only an approximation and is valid only when the spells are short compared with the intervals between them. According to the hypothesis formulated by Cresswell and Froggatt, all individuals are subject to the same rate of spell occurrence and the same accident rates during spells and between spefls. It folfows that the only correlation between the numbers of accidents
282
EM.
HOLROYD
of the same individuals in two periods arises from the fact of a spell’s overlapping the two periods. Because of the implied shortness of the spells, this will be rather rare for consecutive periods and will not occur at all for nonconsecutive periods (separated by a year or more, say). In practice such correlations are observed between both consecutive and nonconsecutive periods, so that the model in the form put forward by Cresswell and Froggatt cannot be considered to fit the data. Switching process (H4) The paper by Bartlett (1986a) is in several ways a remarkable contribution to the theory of accident modelling, perhaps most notably in ignoring virtualiy all previous contributions. Bartlett analysed a set of accident data for 7,842 Californian drivers over six years, previously analysed by Seal (198(l), mainly as an illustrative example of a particular type of stochastic process. The most novel feature of his approach was that he was not concerned to derive the expected frequencies of accidents explicitly from his hypotheses; instead, he derived the moments of the distribution, which could be compared with those of the observed distribution. The advantage of this fresh approach was that Bartlett was able to propose a model for variation of individual accident rates in time, which was considerably more plausible than its predecessors. The model that Bartlett proposed was called a “switching process” (hypothesis t-14). For each individual there is a distribution of accident rates, and at random times with a specified frequency, the rate switches from its existing value to a new value randomly chosen from the distribution, independently of previous values. Bartlett assumed the frequency of change to be the same for all individuats. He suggested several possible assumptions for the form of the distribution of accident rates, the variation of mean rates between individuals, and the relation of the individual variances to the means. In fact, Bartlett originally introduced the switching model to represent the common overall variation of the individual rates over time, but in a later note (Bartlett 1986b), he concluded that for his data this common variation was rather small compared with the independent individual variation, and in this paper we shall take the switching process to represent the latter. Random walk (HS) Finally, we refer to hypothesis I%, the random walk model, which differs from the switching model (H4) in that successive increments rather than successive values are assumed to be independent. This hypothesis does not appear to have been subjected to analysis, but is referred to in the final paragraph of Miller and Schuster (1983), although more in relation to violations (offences) than accidents. The hypothesis deserves to be considered as an alternative to H4, since a priori it seems at least as plausible that changes in accident rates should be independent of previous changes as that successive rates should be independent. Hypothesis H4 implies that changes in accident rates are caused by “temporary states” whose effects do not persist, whereas, HS implies that changes become incorporated in the future behaviour of the process. Some limitation on the changes in rates would have to be imposed to prevent the rates from becoming negative. One possibility would be to assume that the logarithm of the accident rate performs a random walk with a distribution of increments symmetrical about zero. Summary To summarise, the constant rates hypothesis Hl, although rather implausible in the present context, has rarely been clearly refuted. The contagion hypothesis H2 has been ruled out of consideration here as being largely untestable. The spells hypothesis H3 in the form proposed by Cresswell and Froggatt may be ruled out since it fails to account for the observed correlation between the accidents of the same drivers in different periods; it may be regarded as being superseded by the more general hypothesis H4. The switching process H4 and the random walk HS remain to be tested as possible alternatives to the constant rates hypothesis Hl.
Variation of drivers’ accident rates 3. VARIATION
BETWEEN
DRIVERS
283 IN A SINGLE
PERIOD
The data
The eight sets of data to be analysed all relate to random samples of the genera1 driving population. Large samples are required to enable various quantities to be estimated with a useful degree of accuracy, and the size of the samples used here ranges from 29,000 to 2,500,OOOdrivers. The data relate to three American states-California, North Carolina, and Connecticut-and to the state of South Australia. California provides five samples of drivers for different periods and the other sources provide one sample each. They are all longitudinal samples in the sense that the data for the different years relate to the same individuals. The details of the eight samples are given in Table 1, including the sample size, the length of the period covered, and the subperiods for which data are available. Each sample has been given a code for future reference in this report. Some comments on the data are given below. California. The California data are much the most detailed and the most extensive in time, with sample size ranging from 54,000 to 148,000 drivers. They are taken from the California Driver Record Study, the first report of which was issued in 1958, and which is still continuing. The accidents included were those involving death, personal injury, or more than $100 worth of damage (the last category, in particular, being probably underreported). Sample CAL1 for 1961-1974 (omitting 1964-1968) is particularly valuable for the present paper, not only because of the long time span, but because the published data give the pattern of annual accidents over the nine years of the study for individual drivers.
Table I. Details of samples analysed Sample
Area
Period
code
CAL1
California
NO. of
Period
Subperiods
drivers
length
availabLe
(years)
(year nos.)
Source of data
14
1x2~3~9~10
Kwong, Kuan and Peck
(excl.
(5:yr
x11x12x13x14
cl9761
1964-68)
gap)
and all
1961-74
54,165
combinations CAL2
1961-63
148,006
3
None
California Department of Motor Vehicles f19651
CAL3
1972-80
93,393
9
(1-6)x(7-9)
Peck t19871
CAL4
1977-82
120,925
6
(1-3)x(4-6)
Peck t19871
CAL5
1980-82
143,624
3
(1-2)x3
Peck f19871
4
(1-2)x(3-4)
Stewart and CampbeLl
1x(2-4)
cl9721
NC
North
1967-71
2,502,240
Carolina
(l-31x4 CON
Connecticut
1931-36
29,531
6
(1-3)x(4-6)
Forbes 119391; Cobb cl9401
SA
South Australia
1983-84
909,466
2
None
South Australia Dept of Transport
cl9861
E. M. HOLROYD
284
North Carolina. The data are taken from a single study covering four years and are particularly valuable because of the large size of the sample-over 2,500,OOOdrivers. Connecticut. The data for 30,000 drivers were collected in 1931-1936 in the first large-scale study of this type and are now mainly of historical interest. South Australia. This is the most recent set of data, the second largest sample (910,000 drivers), and the only non-American one. Estimation of moments of accident rate distributions In this section we analyse frequency distributions of accidents per driver during a single period of time of one year or more. The aim is to deduce the form of the underlying distribution of the accident rates of the individual drivers averaged over the period. It is assumed that A,, the expected number of accidents of the ith driver during the period, given his pattern of accident rates, has a certain distribution over the population of drivers, and that the actual number of accidents x, of the ith driver has a Poisson distribution whose mean is his value of A,. The distribution of x over drivers is said to be a mixture of Poisson distributions, and the distribution of A is known as the “mixing distribution.” Newbold (1927) gave formulae for estimating the moments of the distribution of A from the moments of the observed values X. These formulae were obtained on the assumption that the accidents of an individual i in a period of length T form a Poisson
Table 2. Estimated Sample
Period
Year
Length
numbers
moments
of accident
Accidents per year Mean
(years)
rate distributions
Coeff.
Skewness
Kurtosis
of var. CAL1
1
1
0.0632
1.062
3.67
1
2
0.0726
1.059
2.65
24.2 9.2
1
3
0.0700
0.851
1.13
-15.1
1
9
0.0510
1.006
9.13
84.1
1
10
0.0452
0.919
0.08
- 9.2 -11.0
1
11
0.0509
1.060
1.27
1
12
0.0566
1.003
4.49
18.5
1
13
0.0520
1.043
2.48
-15.6
1
14
0.0396
1.103
0.52
- 6.4
2
1-2
0.0679
0.966
2.71
23.2
2
2-3
0.0713
0.893
0.97
0.0
2
9-10
0.0481
0.933
3.82
9.6
2
10-11
0.0480
0.903
2.10
5.4
2
II-12
0.0538
0.941
4.93
55.7
2
12-13
0.0543
0.942
1.72
4.9
2
13-14
0.0458
1.012
3.02
- 6.8
3
l-3
0.0686
0.890
1.49
6.5
3
9-11
0.0490
0.893
2.90
15.1
3
10-12
0.0509
0.885
2.89
20.2
3
11-13
0.0532
0.902
3.31
28.1
3
12-14
0.0494
0.952
2.79
18.5
4
9-12
0.0509
0.879
3.00
16.1
4
10-13
0.0512
0.884
3.15
34.2
4
11-14
0.0498
0.917
3.52
39.2
5
9-13
0.0512
0.871
3.29
31.2 30.6 27.3
5
IO-14
0.0489
0.902
3.15
6
9-14
0.0492
0.881
3.18
Variationof drivers' accidentrates
285
Table 2 (Continued)
Sample
Period
Year
length
numbers
(years)
Accidents per year
Mean
Coeff.
Skewness
Kurtosis
of var. CAL2
3
l-3
0.0711
0.938
2.86
37.2
CAL3
3
7-9
0.0453
0.906
4.10
20.0
6
l-6
0.0568
0.879
2.67
14.1
9
l-9
0.0530
0.852
2.72
13.9
3
l-3
0.0579
0.922
3.08
19.3
3
4-6
0.0439
0.910
3.01
22.4
6
1-6
0.0509
0.874
2.55
15.9 31.8
CAL4
CAL5
NC
CON
SA
1
3
0.0475
0.904
1.65
2
1-2
0.0472
0.974
3.06
19.8
3
1-3
0.0473
0.929
2.98
16.5
1
1
0.0606
1.257
3.40
14.1
1
4
0.0643
1.195
3.75
20.7
2
1-2
0.0611
1.180
3.27
21.2
2
3-4
0.0648
1.132
3.55
30.5
3
1-3
0.0625
1.122
3.10
18.8
3
2-4
0.0637
1.096
3.28
20.7
4
1-4
0.0629
1.076
3.22
22.9
3
1-3
0.0420
1.085
2.07
10.1
3
4-6
0.0379
1.166
1.84
4.3
6
1-6
0.0400
1.070
2.71
18.3
2
1-2
0.0427
1.098
3.27
41.5
process with constant rate Cl/(differing between individuals), so that A, = T8;. If we make the alternative assumption that 8, varies within the period, then the accidents of the individual i will no longer form a Poisson process, but, as we have seen in Section 2, his number of accidents xi during the period will still have a Poisson distribution, the mean being hi = T&, where gi is his mean accident rate over the period. It follows that Newbold’s formulae can be applied in this situation of accident rates varying in time and will give estimates of the moments of the distribution of the rates of individual drivers averaged over the period. There is one important proviso. The variation of the accident rates in time must be independent of the occurrence of accidents. We therefore exclude the possibility of the “contagion” models of Section 2 in which the rates are assumed to change to new values following an accident. Newbold’s formulae for estimating the mean and moments of A are as follows:
P;(A) = 4 F~(A) = m2 - rn; l.#)
= m3 - 3mz + 2m;
p..,(A) = m4 - 6m3 - (6m; - ll)mz + (3m; - 6)m;,
I
(3-I)
E. M. HOLROYD
286
where m;, mz, m3, m4 denote the mean and moments of the observed values of X. The mean and moments of s = h/T can then be estimated by:
,
(3.2)
Newbold’s formulae (3. l), which are given in the approximate form appropriate for large samples, provide unbiased estimates of the moments of A. They are not, in general, the most efficient estimators, except for the mean. Anscombe (1950) has shown that with a gamma distribution of A (negative binomial distribution of x), an alternative method that equates the observed mean and the proportion of zeroes to their expected values will provide a rather more efficient estimate of the variance with typical values for accident distributions. However, this involves successive approximations, and Newbold’s formulae are used here on grounds of simplicity. The standard deviations of the estimates of the moments of A are not in general available, although Anscombe (1950) has given approximations for the case when the cumulants &(A) are zero or negligible for i 3 3. The results of applying the formulae to the various sets of data listed in Table 1 are shown in Table 2. All the subperiods for which data are available are included. The moments other than the mean have been expressed in the form of dimensionless coefficients: Coefficient Skewness Kurtosis
of variation
= &‘?/k; = /J,jI(*;‘Z = I+;*
This facilitates comparison between distributions with different means and enables the distributions to be compared with standard ones. Effect of overall exter~z~lfactors The mean accident rates for the shortest available subperiods are plotted as time series in Fig. 3. This shows that the mean accident rate of a sample of drivers varies considerably from one year to another, particularly over the longer periods, and shows a general downward trend. These changes in the sample will not exactly reflect those in the population of drivers as a whole, since they will include the effects of increasing age and experience, which will be largely averaged out in the population. with its changing membership. Moreover, some of the drivers in the sample may effectively drop out of the driving population during the period covered. Apart from these differences, however, changes in the mean accident rate of the sample may be expected to follow the same trend as that in the population and to be affected by the same factors, such as changes in legislation, vehicle construction, the road network. and the weather. In modelling these effects, a natural assumption would be that they affect all individual accident rates by the same multiplicative factor. although of course this can be only an approximation. This would imply that the standard deviation of the accident rates would change in proportion to the mean-in other words, that the coefficient of variation would be constant from year to year. To test this hypothesis, standard deviations of accident rates have been plotted against the means in Figs. 4a and 4b for the six samples having data for more than one subperiod, using one-year periods for Sample CAL1 and two-year periods for Sample NC. It can be seen that as the mean accident rate varies from one subperiod to another for the same sample of drivers, the standard deviation does vary approximately in proportion. This suggests that for the purposes of modelling, the variation of individual accident rates in time may be represented by a stationary process, that is, a process
Variation
of drivers’
accident
287
rates
0.08
0.07
k 0
0.06
i t .‘r G
0.05
:
0.04
H E ; 0.03 (0 t 4
0.02
0.01
$970
1980
Year Fig. 3. Changes
in mean annual
accident
rates.
whose statistical properties are constant in time, with an additional factor representing changes in legislation etc. by which all the individual rates are multiplied, The variation of this factor will then be regarded as extraneous to the model and will not be regarded as forming part of the stochastic process. This assumption of a basicahy stationary process would be inconsistent with the “random walk” model (hypothesis H5). In a random walk model, the successive increments are independent and the variance between individuals will tend to increase with time. To examine this possibility more explicitly, the coefficients of variation of the accident rates for the samples and subperiods shown in Figs. 4a and 4b are plotted against time in Fig. 5. There seems to be no general tendency for the coefficient of variation to increase with time, and the random walk model will therefore not be considered further, Form of distri~l~tionof i~dividllalaccident rates The estimated moments in Table 2 give information about the form of the distribution of accident rates between drivers (averaged over the stated periods). The gamma distribution has generally been used to represent individual accident rates (usually assumed to be constant in time) since it was first proposed by Greenwood and Yule (1920). It was chosen mainly on grounds of mathematical convenience, being a simple distribution with non-negative values, which when used as a mixing distribution for the Poisson distribution leads to a simple discrete distribution of numbers of accidents with explicit probabilities-the negative binomial distribution. This has generally been found to provide a reasonable fit to observed accident distributions, whether in conditions of uniform exposure (e.g. Newbold 1926) or variable exposure, as with car accident data (e.g. Weber 1971). Another simple distribution for non-negative quantities which could represent individual accident rates is the lognormal, Unlike the gamma distribution, this leads to a distribution of accidents (called by Anscombe [1950] the discrete lognormal distribution) whose probabilities cannot be obtained explicitfy. Nevertheless, the lognormal distri-
E. M. HOLROYD
288
0.08
0.06
.’
:
:
.’
.’
:;
:
_’
,’
.’
:
,:
:
:,. ,’ (’ _’ .,‘.: _’.’ .:;. : .‘.‘;. ::; ..‘.’ _;._ ;... I
:
,:
:
,:
:
:
,:
: .’
.’
:
:
:
0
:
:
:
,:.’ _’ ()
,a” :
:
CO.’ 0’
,:.
0
:
_’
.’
.’
.’
_’
_:...
,:..
_.:’ .’
.’
.’
:
0.8”
0.02
0
:
:
:
: _’
,_:’
(’
_’
;
:
_’
,’
:
, .2’ , ,.
Coefficient of variation .’
:
:
:
:
:
: : o
:
:
:
:
:
:
:
Mean
I
I
0.04
0.06
accident
0.08
rate
0.08
1
(b) Other
1
samples
:
: Coefficient of varlatlon
1.2’
,’
;
: :
.’
.’
:
: :
1 ,O
:
:
,:
I
0.06
0.04 Mean
Mean
:
I
1
4.
,’
0.8
:
0.02
Fig.
:
and
:
standard
accident deviation
rate of accident
rates.
a
8
Variation of drivers’ accident rates
289
NC
\
0.7 1930
1-1 1940
1960
I
I
1970
1980
l! )o
Year
Fig. 5. Changes in coefficient of variation of accident rates.
bution itself has convenient properties arising from its connection with the normal distribution, and for some purposes would provide a useful model for accident rates. Both the gamma and lognormal distributions have two parameters and their shape is determined by their coefficient of variation c. The coefficients of skewness s and kurtosis k can be expressed in terms of c as follows (Johnson and Kotz 1970): gamma: lognormal:
s = 2c k=68+3 s = 3c + c3 k = c” + 6ch + 158 f 16s + 3
Figures 6 and 7 show the estimated values of s and k from Table 2 plotted against c, with curves representing the relations for the gamma and lognormai distributions. For
6
01 0.8
I 0.9
I
I
1
1 .o
1.1
1.2
Coefficient
of variation
Fig. 6. Skewness and coefficient of variation of accident rates.
290
E. M. HOLROYD 110
.Y
60
0” s y
50
40
30
AA
20
.
. -
A
Gamma
10 . 0
01
0.8
I
I
0.9
1 .o Coefficient
Fig. 7. Kurtosis and coefficient
1
I
1.1
1.2
3
of variation of variation
of accident
rates
the small CAL1 sample the values of s and k are subject to large sampling fluctuations, particularly for the shorter periods, and the values for one-year and two-year periods have been omitted from the graphs. In both graphs the majority of points lie between the curves representing the gamma and lognormal distributions. This suggests that a better fit to the data would be obtained with some intermediate distribution, such as a particular form of Pearson’s Type VI distribution. which has three parameters. However, the difference between the gamma and the lognormal frequencies is not very great. and probably either of them would fit most sets of data adequately for practical purposes. It cannot be expected that any simple theoretical distribution will provide a very close fit, if only because accident rate distributions are likely to have been partially truncated at the upper end by the need to pass a driving test and by other legislative measures and are also likely to include a certain proportion of zero rates corresponding to drivers with zero exposure during the period of observation. Effect of length of period
Table 2 gives the estimated moments of the distributions of accident rates averaged over periods of different lengths. Bartlett (1986) drew attention to the fact that these moments may vary with the length of the period and that the form of the relation provides information about the form of variation of individual accident rates between drivers and over time. In fact, Bartlett did not explicitly refer to the moments of the accident rates as such. However, the two quantities that he obtained and plotted against the period length T were, in the notation of (3.1) and (3.2), mz - rn; T
and
m3 - 3mz + 2m;
T’
’
291
Variation of drivers’ accident rates
which are simply our estimates of CL&~)and ~~(8) multiplied by T, There are some advantages in working with the actual moment estimates without the extra factor T. Apart from their being more tangible and familiar quantities, their behaviour as functions of T for various models is more intuitive, and departures from the constant rate model are more easily seen in the graphs. In the situation where there is an external multiplicative factor affecting all the rates, it is appropriate to eliminate it (at least approximately) by using the coefficients of variation and skewness instead of the second and third moments. Figure 8 shows the coefficient of variation of accident rate plotted against period length for the six samples with data for subperiods. Although there is a considerable amount of vertical scatter, particularly in the small CAL1 sample, there is a clear tendency for the coefficient of variation to fall as the length of the period increases, but at a decreasing rate, The implication of this tendency may be seen by comparing the observed results with what would be expected on the basis of two particular simple models. The models will be described for stationary processes, and it will be assumed that the effect of the superimposed common variation has been eliminated from the data by using the coefficient of variation rather than the standard deviation. If the individual accident rates
0
1
2
3
4
5
6
7
8
9
Length of period bears)
CAL 3 0
0.8
0
I
I
I
I
I
I
I
I
I
1
2
3
4
5
6
7
a
9
Length of period (vearsb
0.8
1
I
I
I
I
I
1
I
I
I
0
1
2
3
4
5
6
7
8
9
Length of period (years)
Fig. 8. Coefficient of variation of accident rates and length of period.
292
E. M. HOLROYD
were constant in time and unequal (hypothesis Hl), then the variance and the coefficient of variation of the rates averaged over a period would of course be independent of the length of the period. Suppose, on the other hand, that the average rate for an individual in one year is uncorrelated with the value for the same individual in all previous years, i.e. that the annual rates are in effect selected randomly from the same distribution for all individuals. Then the average rate for an individual over II years will be the mean of n independent annual values, and its standard deviation will be proportional to l/dn. It can be seen that the observed coefficients of variation vary with T in a manner that is intermediate between the models with perfect positive correlation and zero correlation, and closer to the former, indicating the existence of substantial positive correlation. Bartlett pointed out that his switching model provides an approximation to the uncorrelated model above if the rate of switching is high in relation to the length of the period of observations (“white noise”). He obtained an adequate fit to his data by combining the two processes described above, so that the annual rates for individual drivers were in effect drawn randomly from individual distributions whose means differed between drivers. This implies that the variance of rates averaged over periods of length T is of the form a f b/T. A relation of this form has been fitted to the data of Fig. 8 for the samples CAL1 and NC, which have a sufficient range of values of t, giving for CAL1 c = t/O.724 + 0.307/T
(3.3)
c = ti1.056
(3.4)
and for NC + 0.466/T.
Although the vaues of c are considerably higher for the North Carolina than for the California data, the ratio of the coefficients is very similar in the two cases. There is another form of relation that is also intermediate between the perfectly correlated and the uncorrelated models. The fact that these lead respectively to relations of the form c(T) = UT” and
suggests fitting a curve of the form c(T) = UT-”
(0 < b < 0.5).
Curves of this form have also been fitted to the CALi and NC data and are shown in Fig. 8. Their interpretation is less obvious than that of the switching process curves, but they have some advantages and will be discussed further in Section 5. The equation for CAL1 is c = 1 004~-ll.~t~ss
(3.5)
and for NC c =
1 228T -Il.0927
(3.6)
The powers are similar for the two samples. It will be seen that for both samples the two curves are quite similar within the range of the data, and either would probably provide an adequate fit. The power curve
293
Variation of drivers’ accident rates
(3.6) appears to fit the NC data rather better, while the switching process curve (3.3) is slightly closer to the CAL1 data. The two curves diverge beyond the range of the data. As T becomes large, the switching process curves tend to horizontal asymptotes, while the power curve continues to fall slowly. The respective merits of the two types of relation will be discussed in Section 5. The coefficients of skewness and kurtosis were also examined in relation to the length of the period. For the large North Carolina sample the skewness appears to fall slightly as the period lengthens, but otherwise there is little evidence of any effect, probably because of the considerable amount of scatter.
4. ACCIDENTS
IN TWO
PERIODS-DIAGONAL
ANALYSIS
We now turn from the analysis of univariate accident distributions for single periods to that of bivariate distributions giving the numbers of accidents of the same individuals in two periods. This Section represents a digression from the main theme of the paper and is concerned simply with establishing the existence of variation of individual accident rates over time, as a preliminary to measuring and modelling it in Section 5. We saw in Section 3 that the coefficient of variation of individual accident rates averaged over a period of length T decreased as T increased. This implies that the individual rates do not remain constant from one period to another and do not all change in the same proportion+ for in either case the coefficient of variation would be independent of T. A more direct demonstration of the existence of this variation may be obtained by using a test introduced by Bartlett (1986). Bartlett pointed out that if the individual accident rates are constant in time, then for all drivers having a given number of accidents n in a period of Y years the expected distribution of the IZ accidents over the r single years is independent of the individual’s accident rate. It follows that the results for all drivers with IZaccidents may be combined, and by comparing the observed numbers of accidents with the expected numbers, the hypothesis of constant rates may be tested. This argument may in fact be extended to the situation where the (unconditional) expected numbers of accidents for a particular driver in r periods are unequal but are in the same proportion for all drivers, either because the periods vary in length, or because the accident rates change in the same proportion, or both. In fact, our use of this method will be confined to data for two periods only, so for given n we need only examine the number of accidents in the first period, which on the “proportional change”’ hypothesis will have a binomial distribution. Bartlett’s sample of drivers was relatively small, and he needed to combine results for different values of n to demonstrate a significant departure from the hypothesis, but with the larger samples used here the results for the various values of II may be examined separately. Suppose then that we know the numbers of accidents xl and X? of each of a sample of drivers in two nonoverlapping periods of lengths T, and T2. Suppose the average accident rates of the ith driver in the two periods are Oil and Oiz*where Oir/$il = 4 is the same for all drivers. Then his expected numbers of accidents are T,Bil and 7&, and given that xi + x2 = n the expected value of xl is np, where P=
Tt
Tl f c&T;'
which is the same for all drivers. Thus, taking all drivers for whom x, + x_1 = t2, xl will have a binomial distribution with parameters p and n. The value of p may be estimated from the data, and the observed variance may be compared with its expected value np(1 - p). This method may be referred to as “diagonal analysis” since it uses the data in the diagonal Xi + x2 = n of the bivariate distribution of xr and x2. This method of analysis has been applied to the six samples of drivers for which data are available for subperiods. For sample CALl, the subperiods used were years l-
E.M.
294
HOLROYD
Table 3. Diagonal analysisfor accidents in two periods Sample
Periods and
gap*
Total accidents
No.
of
Proportion
drivers
of accidents
in first Mean
period
Variance
Binomial variance
CAL1
3(5)6
CAL3
6(0)3
CAL4
3(0)3
CAL5
2(0)1
NC
2(0)2
CON l
3(0)3
T(C)U
means
periods
2
4,013
0.416
0.523
0.486
3
1,184
0.419
0.832
0.730
2
6,399
0.718
0.414
0.405
3
1,827
0.728
0.637
0.594
4
524
0.724
0.936
0.799
2
4,802
0.574
0.505
0.489
3
1,017
0.568
0.801
0.736
2
1,790
0.672
0.458
0.441
2
77,910
0.490
0.546
0.500
3
16,721
0.496
0.870
0.750
4
4,090
0.491
1.198
1.000
5
1,044
0.488
1.612
1.249
2
936
0.518
0.552
0.499
of 1 and U years
separated
by a gap of G years.
3 and 9-14 (separated by a five-year gap) and for sample NC the two subperiods of two years were used. The results are shown in Table 3 for all values of n for which the number of drivers exceeds 500. It will be seen that in every case the variance of the proportion of accidents in the first of the two periods exceeds the value for a binomial distribution having the observed mean and value of II. This implies that the value of p varies from one driver to another, i.e. that the accident rates do not all change in the same ratio. Bartlett discusses appropriate tests of significance, but these are scarcely necessary with the larger samples and more consistent results obtained here. It is possible to estimate the variance of p and to fit a distribution of p (e.g. betabinomial) but this approach does not fit in easily with the time-series models used in the remainder of this paper. We shall therefore regard diagonal analysis as providing a simple demonstration of the existence of individual variation of accident rates over time but not leading to a suitable way of measuring the variation. For this we turn to correlation in Section 5. 5. THE
CORRELATION
BETWEEN INDIVIDUAL IN TWO PERIODS
ACCIDENT
RATES
Estimation of correlation of accident rates
The existence of positive correlation between the accidents of the same drivers in different periods of time was indicated by the data of Fig. 8. This showed that the coefficient of variation between drivers of accident rates averaged over a period decreased as the length of the period increased, but much more slowly than if the rates in successive years had been uncorrelated. In this Section the correlation of the accident rates will be estimated directly. The method of estimation depends on a straightforward extension of Newbold’s formulae from a univariate to a bivariate distribution. It will be recalled that equations
Variation of drivers’ accident rates
2%
(3.1) and (3.2) enabled the moment of the distribution over drivers of h, the expected number of accidents, and hence of 8, the mean accident rate, to be estimated from the moments of X, the observed number of accidents in a single period. If we consider the bivariate distributions of accidents in two nonoverlapping periods of lengths I; and T2, then with the same assumptions as in Section 3 and the additional assumption that deviations from expectation are independent in the two periods, it can easily be shown that the covariance ~~~ of the expected numbers of accidents may be estimated simply by the covariance ml1 of the observed numbers, i.e. by
It follows that the covariance of the mean rates may be estimated by
Thus using the estimated variances of e, and ?& from (3.1) and (3.2), the correlation between the mean accident rates may be estimated as
As in the case of the univariate distribution, these formulae are not in general the most efficient estimators of the quantities concerned, but they have the merit of simplicity. Table 4 shows the results of applying formula (5.1) to the data from the six samples having data for subperiods. For sample CAL1 correlations have been estimated for all possible pairs of nonoverlapping periods of equal lengths. Table 4 also shows the correlations between numbers of accidents for comparison with those between accident rates. The former are much lower because they include a large random component arising from the Poisson distribution; they increase with the length of the periods as this component becomes relatively less important. The correlation between numbers of accidents in different periods has frequently been calculated in accident studies, but rather remarkably the correlation between rates does not appear to have been estimated previously. (Newbold 11927) did make a reference to the possible use of this correlation, but did not calculate it, possibly because, for her rather small samples, some of the estimated values would have been greater than one.) The quantity is a simple one to understand and is easy to estimate. It is much more informative about the underlying process than the correlation between numbers of accidents. The latter has usually been found to be small and positive, which, in the absence of appropriate significance tests, has been taken to be consistent with the hypothesis of constant accidents rates. The correlation between rates will also normally be positive, but much higher, and although there may still be difficulties in testing significance the amount by which it falls below unity gives a measure of the indjvidual variation of the rates in time. The rates correlation will also provide a better basis for comparison between groups with different rates, for example drivers classified by age and sex. Comparisons on the basis of correlations between numbers of accidents may be misleading, since groups with lower rates will have relatively higher proportions of Poisson variation and therefore lower correlations. A useful feature of the correlation between rates is that it is unaffected by a linear transformation of the rates. Thus, if we have a basic stationary process with superimposed common variation that is multiplicative, as previously assumed, or even contains an additive term as well, the correlation between rates is unaffected and may be taken to apply to the basic process. MQ 24:3-F
296
E. M. HOLROYD Table 4. Correlations between accidents and between accident rates Sample Periods
CAL1
Rates
Sample Periods
Accidents Rates
Corr'n
l(O)1
0.046
0.657
0.045
0.739
0.067
0.833
0.037
0.866
0.067
0.773
0.029
0.646
0.065
0.794
0.036
0.669
0.038
0.699
0.039 0.043 0.032
0.615
0.032
0.724
0.030
0.552
0.038
0.765
0.036
0.700
0.040
0.899
0.034
0.673
0.028
0.544
2(10)2
0.035
0.857
I(411
0.025
l(S)1 l(6)l
l(l)1
l(2)l
1(3)1
I(711
I(811
1(9)1
l(lO)l
l
Accidents
and gap* Corr'n
and gap* Corr'n CAL1
2(0)2
2(1)2
0.062
Corr'n 0.760
0.066
0.843
2(2)2
0.062
0.756
0.791
2(5)2
0.064
0.722
0.759
2(6)2
0.058
0.620
0.068
0.785
2(7)2
0.069
0.766
0.063
0.668
2(8)2
0.068
0.690
0.052
0.543
0.055
0.553
0.051
0.541
0.051
0.523
3(0)3
0.092
0.824
0.532
3(5)3
0.094
0.770
0.036
0.732
3(6)3
0.094
0.769
0.029
0.473
3(7)3
0.086
0.674
0.037
0.885
3(8)3
0.078
0.606
0.028
0.484
0.032
0.614
CAL3
6(0)3
0.118
0.816
0.030
0.579
0.032
0.645
CAL4
3(0)3
0.092
0.8\4
0.041
0.644
0.032
0.618
CAL5
2(0)1
0.050
0.899
0.038
0.631
2(9)2
NC
0.029
0.454
2(0)2
0.106
0.738
0.022
0.443
3(0)1
0.091
0.719
0.035
0.578
l(O)3
0.094
0.739
0.024
0.378 3(0)3
0.107
0.816
0.029
0.617
l(ll)l
0.027
0.456
0.031
0.530
1(12)1
0.025
0.458
CON
T(C)U means periods of T and U years separated
by a gap of C years.
Effect of lengths of periods and their separation
The great majority of the correlations in Table 4 relate to the sample CALl. and these will form the basis of the subsequent analysis. Of the seven correlations from the other samples it should be noted that they all lie between 0.7 and 0.9. The correlations for North Carolina are lower than those for California and Connecticut. All these correlations are for pairs of adjacent periods, and it is not possible to deduce from them anything about the effect of period length or separation. For information about the effect of these factors we are dependent on the single relatively small sample CALl. Figure 9 shows the correlations for two l-year, 2-year, and 3-year periods plotted against the length of the gap between the periods. Although there is a good deal of scatter, it is clear that in each diagram the correlation falls as
Variation of drivers’ accident rates 1 .o .
.
.
”
:i--:-I,__l’
f
l
:
l
l
. 1 0
r
t
I
I
1
1
I
t
I
I
i
k
1
2
3
4
5
6
7
8
9
10
11
12
.
:
.
I
I
Gap (years)
0.5
-
0.4
-
0.3L-
I
’
I
4
1
I
1
I
0123456789
I
I
I
10
11
12
Gap (years1
kf 3 _ year periods
0
f
Fig. 9. Correlation
2
3
4
5
7 6 Gap &mm)
8
9
10
11
I
12
between individual accident rates and tength of period.
the length of the gap increases, as would be expected. In principle, the form of the relationship should tell us about the structure of the variation of individual accident rates in time, but because of the wide scatter and the lack of appropriate sampling theory and standard errors, any attempt to draw such conclusions must be rather speculative. Inspection of Fig. 9 suggests that the reiations are ail approximately linear. Straight lines have been fitted to the data and there is no indication that the fit would be improved by adding a quadratic term. If r(T, G, U,) represents the correlation between the accident
E. M. HOLROYD
298
rates for the same drivers in periods of length T and U separated by a gap of length G, the equations are: ~(1, G. 1) = 0.735 - 0.0205G r(2, G, 2) = 0.821 - 0.02486 r(3, G, 3) = 0.851 - 0.02356. These linear equations may not correspond to any very plausible model, but they do provide a convenient empirical means of summarising the data, It is possible to express the values of r(2, G, 2) and ~(3, G, 3) in terms of the values of ~(1, G, l), by relations such as
r(2, 0, 2) = r(1, 0, 1) 2{1 + 2r(l, + r(l,1, 0. 1) 1)) + r(l, 2, 1) . Consequently, the second and third of the above equations may be deduced from the first. The resulting equations are also linear. They are: r(2, G, 2) = 0.824 - 0.02366 r(3, G, 3) = 0.848 - 0.0250G.
These agree quite closely with the equations fitted directly to the data, confirming that the linear relations summarise the data well. Correlations and coefficients of variation
As a first step towards finding an appropriate model to fit the data, we recall the equations fitted to the data of Fig. 8 for coefficient of variation against iength of period. It was pointed out that the data lay between the lines representing perfect correlation and zero correlation between adjacent periods, indicating the existence of substantial positive correlation. It is, in fact, possible to deduce the value of the correlation between any two periods from the relation between the coefficient of variation and period length. Suppose first that the accident rate 6 of each individual constitutes a stationary process. Let 6(i,_T) be the mean accident rate for driver i over a period of length T, and X(i, T) = Tf3(i, T) be the expected number of accidents. Let u(T) be the variance of $(i, T) between drivers and v(T) = T’u( T) be the variance of X(i, T) between drivers. Then by using the formula for the variance of a sum of correlated quantities, it can be shown that r(T, G, U) =
u(T + G + U) -
u(T + G) -
u(G f
U) + u(G)
2~u(T)W)
(5.2)
In particular, r(T, G, T) =
u(G + 2T)
-
2u(G + T) f
u(G) t
2u( T)
(5.3)
and r(T, 0, T) =
u(2T) - 2u(T) 2u(T)
’
(5.4)
We have seen that the accident rate process is not in fact stationary, as assumed above, but that the individual annual mean rates can be modelled as a stationary process with an overall annual factor muItiplying all the individua1 rates. Thus, if the annual
Variation of drivers’ accident rates
299
standard deviations are divided by the corresponding annual means, this annual factor will be removed, and the resulting annual coefficients of variation will be proportional to the standard deviations of the underlying stationary process. There is a slight approximation involved in applying the same argument to longer periods. If the coefficients of variation for two adjacent years are equal, then for a given correlation the coefficient of variation for the combined two-year period is slightly higher if the annual means are unequal than if they are equal. However, even if the ratio of the means is as high as 2, then for correlations greater than 0.7, the increase in the coefficient of variation is less than 1%. It would be possible to make a correction for this effect, but for our data this would be so small in comparison with the scatter in Fig. 8 that it was not considered worth while. Switching process
Consider first Bartlett’s switching process, and in particular the version used in fitting a curve to the data of Fig. 8. This involved two components: different individual means for the different drivers plus a high-frequency switching process that could be approximated as “white noise,” giving uncorrelated means for adjacent years. This led to u(T)
= a + b/T,
and therefore u(T)
= aT? + bT.
Substituting in (5.3) gives r(T, G, T) = fi,
which is independent of G. Thus, the correlation between two equal periods is independent of the gap between them, which is indeed obvious from the definition of the process. Clearly this does not fit the data in Fig. 9 which shows r(T, G, T) decreasing as G increases. Although the corresponding relation seemed to provide a reasonable fit in Fig. 8, the correlations of Fig. 9 evidently provide a more sensitive test. This suggests that if we are to use Bartlett’s switching process as a model, we shall need the accurate version specified by the switching rates, the variance IJ~of the switching process, and the variance u,,~of the individual means. This gives u(T)
= u,,,Tz + 2u, (e-‘J s2
-
1 + sT),
and hence from (5.3) r(T, G, T) =
m’f
+ e-Km(l - e-m)2
m2f + 2(e-‘n - 1 + m)’
where m = ST, .f = u,,,/u,,, g = CIT. The curve of r( 1, G, 1) against G falls exponentially from r( 1 , 0, 1) to its asymptotic value r(1, x, l), at a rate that depends on s. To fit the data we require a value of s greater than 1, which implies that r(1, G, 1) has almost levelled off at its asymptotic value by G = 4 years. In fact the best (least squares) fit to the data is obtained with s = 1.60 per year and f = 0.95, and the corresponding curve is shown in Fig. 10. In this diagram the 36 observed values of r(1, G, 1) have been replaced by 6 means with their standard errors for various ranges of G, representing 7, 5, 6, 6, 6, and 6 values,
E. M. HOLROYD
10
0
20 30 Gap between periods, G (years)
40
50
Fig. IO. Correlation between individual accident rates: fitted curves.
respectively. This is intended to give a better visual impression of the goodness of fit of the curves. It can be seen that the fit of the switching process curve is rather poor; the initial drop is too steep and the curve levels off too quickly. Power law The power law c(T)
= aT_’ which was fitted to the data of Fig. 8 leads to u(T) Q T-?&
and u(T)
x T’-‘f’.
Variance-time power laws of this type have been found to fit many different physical processes (Beran 1989). They were introduced by Hurst (1951) as a description of the variation of flow in rivers and have since been extensively studied by Mandelbrot (e.g. 1983). They may arise from what Mandeibrot has called “scaling processes” because their statistical properties are unaffected by a change of scale. Their origin is not well understood, but it seems likely that they resuh from a hierarchy of processes operating on different time scales. It was suggested by Cox and Isham (1980, p. 93) that such a process “could be used to define a doubly stochastic Poisson process with very long term fluctuations.” The scaling property may be verified for the correlation r(T, C, 0) (Mandelbrot and Van Ness 1968). If, following Mandelbrot, we use H (in honour of Hurst) to represent 1 - b, we have
and substituting in (5.2) gives G
r(T
3
ul
,
=
(T
+
C -i- U)*‘+ - (T + G)" 2TWH
- (G + UjzH $ G" ,
Variation of drivers’ accident rates
from which it follows that r(kT, kG, kU) = r(T, G, U). In particular 0,
gT, T) =
(g + 2)2H - 2 (g + 1)2H + g2H 2
and r(T, 0, 1”) = 22H-’ - 1.
Thus the correlation between two equal adjacent periods is the same, whatever their length. For g > 1, a close approximation to r(T, gT, T) is given by r(T, gT, T) * N(2H - I)& + I)‘H-‘* Thus the correlation also follows an approximate power law. The curve fitted to the data of Fig. 8 gave b = 0.0855, i.e. H = 0.914; but for the data of Fig. 10, the value H = 0.924 gives a slightly better least squares fit. and this curve is shown in the diagram. (Note that this curve, unlike the other two, depends on a single parameter.) The fit appears reasonable, although the data give no indication of the curvature implied by the power law. The curve fits better than that of the switching process because the curvature is less and the curve continues to fall slowly rather than levelling off as G increases. Co~~ar~so~ of f&fed curves The three curves which have been fitted by least squares to the data of Fig. 9(a) are shown in Fig. 10, which gives a visual impression of their closeness of fit, The rootmean-square deviation for each curve is as follows: Linear:
0.103
Power law:
0.113
Switching process:
0.123
It would be useful if we could test for each curve whether the data could have arisen from it by random variation. An analysis of variance has been made of the residuals for each curve, to test whether the deviations from the curve of the means of the six groups shown in Fig. 10 are consistent with the variation within the groups. This is only an approximate test since it ignores the variation of G within groups and the nonlinearity of two of the curves. The conclusion is that the means fit the straight line better than would be expected, that their deviations from the power curve are not significant, and from the switching process curve just significant at the 5% level. Because of the approximate nature of the test, these conclusions should be regarded only as indications and should be looked at in conjunction with the considerations in Section 6.
6. TOWARDS
A MODEL
FOR
INDIVIDUAL
ACCIDENT
RATES
We have seen that the various models put forward in the past for the vacation of individual accident rates have been found to be inconsistent with the data. The most
302
E.M.
HOLROYD
recent proposal, Bartlett’s switching process, appears not to provide the observed longterm decline of the correlation coefficient. In seeking a model that will provide a better fit to the data, it is important to take account of the likely mechanism determining individual accident rates. A useful starting point may be found in a paper by Haight (1964), entitled “Accident proneness, the history of an idea.” After describing the early work on the subject, Haight describes “a full-scale attack on all aspects of the accident proneness concept” in the postwar decade, meaning on the hypothesis of constant individual accident rates. Having referred to the “universal discredit” into which this idea has fallen, Haight suggests that the accident rate of an individual, A, (which has been called f3 in the present paper) fluctuates in a completely unknowable fashion; high A with alcohol, fatigue, anxiety. bad eyesight. anger. worn tires. or even deep resentment against persons in authority. Many of these “causes” vary within any person in a completely unmeasurable way, and so therefore would their “effect,” namely the value of A.
There are two comments that may be made on this very interesting passage. In the first place the “causes” that Haight lists are not completely unmeasurable. Some of them, in fact, namely alcohol level, eyesight, and tire wear, are frequently measured, and the result may lead to the prosecution of the driver. Psychological factors such as fatigue, anxiety. anger, and resentment are more difficult to assess accurately, and since they are to some extent transient states, it may be difficult to obtain measurements for the relevant periods of driving. Nevertheless, most of the factors that may be expected to affect an individual’s accident rate are in principle measurable, and it is also possible to make some estimate of their effects, either in the laboratory or using actual accident rates. The possibilities and problems of doing so have been discussed by Sivak (1981). In addition to making direct measurements of psychological states, it is possible to take account of “life events” such as divorce and bereavement, which are liable to cause stress (McKenna 1983). It has. in fact, been shown that drivers’ accident rates tend to rise when they are involved in divorce proceedings (McMurray 1970). The second comment that arises from Haight’s statement quoted above is that the various factors affecting accident rates will fluctuate on very different time scales. and thus give rise to correlation persisting over periods of different lengths. One can construct a hierarchy of factors classified by the time for which their effects persist. Thus, taking the examples given by Haight, a driver’s mood (e.g. anger) may vary within a matter of minutes, while fatigue or the effect of alcohol may last for a few hours. Worn tires might remain in use for weeks or months. bad eyesight could be present for a matter of years, and resentment against persons in authority could well be a lifetime characteristic. Anxiety is difficult to place in the hierarchy; it may be a transitory response to some external situation, or it may be a more or less permanent feature of certain personalities. Possibly, therefore, two anxiety factors would in principle be needed. The duration of the effect of life-events is likely to be measured in years; McMurray’s ( 1970) data for divorce. in fact, suggest a period of about a year. The factors mentioned by Haight are ail internal ones, with the exception of worn tires, but an individual’s accident rate (per year. for example) is also affected by external or exposure factors, both quantitative (distance driven) and qualitative (vehicle. road, and traffic conditions). Here also there are a number of factors operating on different time-scales. Thus many drivers make a daily journey to work that may remain more or less constant over many years. Journeys for other purposes, such as shopping or recreation, may also fall into a pattern but probably a less rigid one and one that changes more frequently. The characteristics of the vehicle driven may change from time to time (e.g. worn tires replaced), with perhaps a change of vehicle every few years. The conclusion from this discussion is that an individual’s accident rate is the resultant of a number of internal and external factors, which fluctuate at different rates, and therefore have correlations that decay over different periods. Now the switching process suggested by Bartlett has a fixed mean rate of switching, and it may well be a
Variation of drivers’ accident rates
3x3
suitable model for the variation of a single factor affecting the accident rate, but it seems unlikely to provide an appropriate model for the accident rate itself. The switching model that was fitted to the correlation data of Fig. 10 may be regarded as the sum of two independent processes: (i) a switching process, with a fitted switching rate of L60 per year; and (ii) a set of constant individual mean accident rates. Now (ii) amo~ts in effect to a switching process with a very low switching rate, such that a switch is unlikely to take place within a person’s driving life (or at least within the period of observations). The correlation of the combined process is simply the average of those of the two separate processes fi) and (ii) weighted by their variances. Process (i) has a correlation falling exponentially to zero, and (ii) has a constant correlation of 1, The combined correlation therefore falls exponentially to art asymptotic value, and does not match the more gradual and prolonged fall of the observed correlations. Now suppose that instead of combining two switching processes with high and very low switching rates we were to use a whole set of processes with rates varying over a wide range. If we assume at this stage for the sake of simplicity that the processes combine additively, then it is clear intuitively that the correlation of the sum will decline gradually somewhat in the manner observed. It can in fact be shown that a ~ntinuous hierarchy of switching processes with an appropriate dist~b~tion of switching rates will lead to a power law for correlation, which is characteristic of scaling processes (Mandelbrot 1983, p. 418). It has been assumed above for simplicity that the cornpo~~~ts of the accident rate are additive. A multiplicative assumption is probably more plausible and would be more consistent with the approximately lognormal distribution of accident rates between drivers. It might then be more appropriate to assume that the logarithm of the accident rate, which would be the sum of the logarithms of the multiplicative factors, constituted a scaling process, rather than the rate itself. The available data, however, do not enabie us to discriminate between the additive and multiplicative models, and the point will not be pursued further here. ft is suggested, therefore, that the power law should provisi~~aily be used as a model for variation of accident rates. ft is algebraically simple, corresponds to a simple property (scaling), has some theoretical support, and fits the available data adequately. It will not necessarity provide an adequte fit when data for larger samples or longer periods are obtained, and there is no compelling reason to expect that the model will furnish a reliable basis for extrapolation of the kind shown in Fig. 10. The power law merely represents the simplest of a whole range of models, and further work could well support a model giving results closer to the linear relation of Fig. 10. Nevertheless, the power law does appear to capture at least qualitatively some important aspects of the observations and the probable underlying structure. In particular, it throws light on the way an individual’s accident rate changes over time and yet retains a substantial individual element (relenting both internal and external factors). It thus provides a common framework that incorporates variation both within and between individuals, and in which “individual accident liability~’ is meaningful without necessarily applying to a constant quantity. Such a model may well have applications to other human attributes. 7. CONCLUSIONS
Ever since the idea of accident proneness was introduced in the 1920s it has been surrounded by controversy and misunderstanding. The debate has tended to be polarised between two extreme views and contains echoes of the parallel debate about the factors determining intelligence. On the one hand, there was the view that individual accident rates differed between individuals and remained constant over time, with the jmpli~ation that those with high rates should be identified and “selected out.” On the other hand, there was the view that all individuals had the same basic propensity to have accidents
304
E. M. HOLROYII
but that their accident rates varied in time, and in such a way that even if the people with a high rate in one period might be identified, their rates in the following period would be no higher than average. Because these two extreme positions seemed to most people highly improbable and because no intermediate position was clearly formulated, the whole subject tended to fall into disrepute. The way to reach a synthesis between these conflicting ideas is to focus attention on the rate of change of individual accident rates over time. If it is accepted that over a period of a year some people will have higher average accident rates than others, the question is not whether in the following year their rates will remain equally high or fall to normal levels, but how much they will change. In other words, the question should be, “How persistent are high individual accident rates?” It is hoped that the present paper has helped to clarify some of these issues, both theoretically and numerically, although it must be stressed that the data analysed measure accident liability rather than proneness, since exposure varies between drivers. it has been shown that drivers’ “ underlying” individual accident rates, averaged over a period such as a year, have a fairly consistent form of distribution in different places and at different times. It has also been shown that the individual rates do change from one year to the next but that high and low rates tend to persist (r = 0.74). The degree of change increases slowly as the separation between the two years increases; with a gap of I2 years there is still considerable persistence (r = 0.49). The approach used in this paper could usefully be applied to studies of the accident rates of professional drivers and factory workers, for whom the exposure is more uniform. For the general driving population, further work is needed to separate the variation of accident rates into the internal and external (proneness and exposure) components. Further data of the type used here should also be analysed, if possible for larger samples and longer periods, although the two requirements tend to conflict. Theoretical work is also needed, particularly on methods of estimation and sampling distributions of the moments and correlations of the distributions of accident rates. Acknowledgemene--I am grateful to G. Maycock for many valuable discussions during the course of this work and to R. C. Peck of the California Deoartment of Motor Vehicles for kindly supplving me with some unpublished data. The work described in this-paper forms part of the programme of the’Tran:port and Road Research Laboratory and the paper is published by permission of its Director. Crown Copyright. The views expressed in this paper are not necessarily those of the Department of Transport. Extracts from the text may be reproduced. except for commercial purposes. provided the source is acknowledged.
REFERENCES Anscombe. F. J. Sampling theory of the negative binomial and logarithmic series distributions. Biometrika 37: 358-382; 19.50. Bartlett, M. S. The spectral analysis of point processes. J. R. Stat. Sot. B 24: 264-281: 1963. Bartlett. M. S. Mixed Cox processes, with an application to accident statistics. J. Appl. Prob. 23A: 321-333: 1986a. Bartlett. M. S. Correction to above. J. Appl. Prob. 23: 848-849; 1986b. Beran, J. A test of location for data with slowly decaying serial correlations. Biometrika 76: 261-269: 1989. California Department of Motor Vehicles. The 1964 California Driver Record Study. Part 2: Accidents. traffic citations and negligent operator count by sex. Report 20. Sacramento. CA: DMV; 1965. Cane, V. R. The concept of accident proneness. Bull. Inst. Math. (Bulgaria) 15: 183-189; 1972. Chambers. E. G.; Yule G. U. Theory and observation in the investigation of accident causation (with discussion). J. R. Stat. Sot. Supplement 7: 89-109: 1920. Cobb. P. W. The limit of usefulness of accident rate as a measure of accident proneness. J. Appl. Psych. 24: 154-159; 1940. Cox, D. R.; Isham, V. Point processes. London: Chapman and Hall; 1980. Cresswell, W. L.; Froggatt, P. The causation of bus driver accidents-an epidemiological study. London: Oxford University Press; 1963. Department of Transport, South Australia, Road Safety Division. Driver Improvement Study, Report S/86, 1986. Fitzpatrick, R. The detection of individual differences in accident susceptibility. Biometrics 14: 50-68; 1958. Forbes, T. W. The normal automobile driver as a traffic problem. J. Gen. Psychol. 20: 471-474, 1939. Greenwood, M.; Woods, H. M. The incidence of industrial accidents upon individuals with special reference to multiple accidents. Report No. 4, Medical Research Committee. Industrial Fatigue Research Board (Great Britain). London: H.M.S.O.; 1919.
Variation of drivers’ accident rates
305
Greenwood, M.; Yule, G. U. An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J. R. Statist. Sot. 83: 255-279; 1920. Haight, F. A. Accident proneness, the history of an idea. Automobilismo e Automobilismo Industriale 4: 3Is; 1964. Hurst, H. E. Long-term storage capacity of reservoirs. Trans. Am. Sot. Civil Eng. 116: 770; 1951. Johnson, N. L.; Kotz, S. Distributions in statistics: Continuous univariate distributions-l, Boston: Houghton Mifflin; 1970. Kemp, C. D. “Accident proneness” and discrete distribution theory. In: Patil, G, P.. ed. Random counts in scientific work: Vol. 2, Random counts in biomedical and social sciences. University Park, PA and London: Pennsylvania State University Press; 1970. Kwong, K. N.; Kuan. J.; Peck, R. C. Longitudinal study of California driver accident frequencies I: An exploratory multivariate analysis. Final Report 55, California Department of Motor Vehicles; 1976. McKenna. E P. Accident proneness: A conceptual analysis. Accid. Anal. Prev. 15: 42-52; 19g3. McMurray, L. Emotional stress and driving performance: The effect of divorce. Behavioural Research in Highway Safety 1: 100-l 14: 1970. Mandelbrot. B. B. The fractal geometry of nature. New York: W. H. Freeman; 1983. Mandelbrot, B. B.; Van Ness, J. W. Fractional Brownian motion, fractional noises and applications. SIAM Review 10: 422-437; 1968. Maycock. G. Accident liability and human factors- researching the relationship. Traffic Engineering and Control 26: 330-335; 1985. Miller. T. M.; Schuster, D. H. Long-term predictability of driver behaviour. Accid. Anal. Prev. 15: 11-22; 1983. Newbold, E. M. Practical applications of the statistics of repeated events. pa~icularly to industrial accidents (with discussion). J. R. Stat. Sot. 90: 487-547; 1927. Peck. R. C. (Caiifomia Department of Motor Vehicles), Personal communication, 19 November 1987. Peck, R. C.; McBride. R. S.: Coppin, R. S. The distribution and prediction of driver accident frequencies. Accid. Anal. Prev. 2: 243-299; 1971. Seal, H. L. Mixed Poisson processes and risk theory. Notes of a lecture course. University of Lausanne. 1980. Sivak, M. Human factors and highway-accident causation: Some theoretical considerations. Accid. Anal. Prev. 13: 61-64: 1981. Stewart. J. R.; Campbell. B. J. The statistical association between past and future accidents and violations. Chapel Hill, NC: Highway Safety Research Centre. Universitv of North Carolina; 1972. Weber. D. C. An analysis of the California Driver Record Study in the context of a classical accident model. Accid. Anal. Prev. 4: 109-116: 1972.