Reliability Engineering and System Safety 90 (2005) 75–82 www.elsevier.com/locate/ress
An alternative approach to trend analysis in accident data J.T. Kvaløy, T. Aven* Stavanger University, P.O. Box 8002, Stavanger 4068, Norway Received 27 September 2004; accepted 23 October 2004 Available online 11 January 2005
Abstract Traditional statistical theory is the most common tool used for trend analysis in accident data. In this paper, we point at some serious problems in using this theory in a practical safety management setting. An alternative approach is presented and discussed in which focus is on observable quantities and expressing uncertainties regarding these rather than on hypothetical probability distributions. q 2004 Elsevier Ltd. All rights reserved. Keywords: Trend analysis; Accident data; Risk analysis; Predictive Bayesian approach
1. Introduction To many people, risk is closely related to accident statistics. Numerous reports and tables are produced showing for example the number of fatalities and injuries as a result of accidents. The statistics may cover the total number of accidents associated with an activity within different consequence categories (loss of life, personal injuries, material losses, etc.) and it could be related to different types of accidents, for example accidents in industry, transport, etc. Often the statistics are related to time periods, and then time trends can be identified. More detailed information is also available in some cases, related to, e.g. occupation, sex, age, operations, type of injury, etc. Do these data provide information about the future, about risk? Yes, of course, although the data are historical data they would in most cases provide a good picture of what to be expected in the future. If the numbers of accidental deaths in the traffic in the previous 5 years have been 1000, 800, 700, 800, 750, we know much about risk, even though we have not explicitly expressed it by formulating predictions and uncertainties. This is risk related to the total activity, not to individuals. Depending on your driving habits, these records could be more or less representative to you. * Corresponding author. Tel.: C47 51 832 267; fax: C47 51 831 750. E-mail address:
[email protected] (T. Aven). 0951-8320/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.ress.2004.10.010
Accident statistics is used actively in the industry. It is seen as an essential tool for management to obtain records regularly for the number of for example injuries (suitably defined) per hour of working, or any other relevant reference, for the total company and divided into relevant organisational units. These numbers provide useful information about the safety and risk level within the relevant units. The data are historical data, but assuming a future performance of systems and human beings along the same lines as the history shows, the data give reasonable estimates and predictions for the future. According to the literature, accident statistics can be used to: † † † † † †
monitor the risk and safety level, give input to risk analyses, identify hazards, analyse accident causes, evaluate the effect of risk reducing measures, compare alternative areas of efforts and measures.
In this paper, we focus on how accident data can be used for monitoring the risk and safety level. One commonly used tool for this purpose is statistical trend analysis. The purpose of a trend analysis is to investigate whether a trend is present in the data, i.e. the data show an increase or decrease over time that is not due to ‘randomness’. Suppose we have the observations given in Table 1. We assume that the number of working hours is constant for the time period
76
J.T. Kvaløy, T. Aven / Reliability Engineering and System Safety 90 (2005) 75–82
Table 1 Number of injuries per month Month
1
2
3
4
5
6
Number of injuries
1
2
1
3
3
4
considered. The question is now whether the data show that there is a trend present, i.e. that there is a worsening in the safety level which is not due to randomness. And if we can conclude on a trend, what are then the causes? Answering these questions will provide a basis for identifying risk reducing measures which can reverse the negative development observed. In classical statistical theory there exist a number of tests to reveal possible trends, see Section 2.1. The null hypothesis in such tests is no trend and a considerable amount of data or a strong tendency in the data is required to give rejection of this null hypothesis. In the example above, Table 1, we can observe that there is some tendency of an increasing number of injuries as a function of time, but a statistical test would not ‘prove’ that we have a significant increase in injuries. The amount of data is too small—the tendency could be a result of randomness. To reject the null hypothesis a large change in the number of injuries would be required, but hopefully such a development would have been stopped long before the test gives the ‘alarm’. To increase the amount of data, we may include data of near misses and deviations from established procedures. It is, however, a complex connection between such data and risk. Such events can give a relatively good picture of where accidents might occur, but they do not necessarily give a good basis for quantifying risk. An increase in the number of near misses could be a result of a worsening of the safety, but it could also be a result of increased reporting. When the data show a negative trend as in Table 1 above, we should conclude immediately. There is a trend present— the number of events is increasing. Quick response is required as any injury is unwanted. We should not explain the increase by randomness. Then we need to question why this trend is observed and what we can do to reduce the number of injuries. We conclude that in an active safety management regime, classical statistical methods cannot be used as an isolated instrument for analysing trends. We must include other information and knowledge than the one given by the historical data. However, some sort of statistical testing of historical data could be one of the useful tools in a trend analysis process. For instance, could classical statistical testing be useful as a screening instrument for identifying where to concentrate the follow-up when studying a number of types of accidental events. Say that, we have to look into data of more than 100 hazards. Then some kind of identification of the most ‘surprising’ results would be useful, and statistical testing could be used for this purpose.
Statistical testing could also be useful in other situations, for instance as an automatic tool giving an alarm when the number or injuries or other unwanted events show a tendency that should cause concern. However, classical hypothesis tests are not generally suited for this purpose as they only detect very strong trends. Alternative tests are needed. We also see serious problems with classical hypothesis testing in terms of interpretation and communication of results. In Section 2, these issues are further discussed and an alternative approach giving some alternative tests is presented. Another concern that needs to be taken into account when performing trend analysis of accident data is data quality. A basic requirement to historical data is that they are correct, that they are reliable. In the example above where focus is on injuries, it would be difficult in many cases to make accurate measurements. Psychological and organisational factors could result in a too low reporting. As an example, we may think of an organisational incentive structure where absence of injuries is rewarded. Then we may experience that some injuries are not reported as the incentive structure is interpreted as ‘lack of reported injuries’. So, judgments are required—we cannot base our conclusions on the data alone. Another measurement problem is related to the specification of relevant reference or normalising factors to obtain suitable accident or failure rates, for example the number of working hours, opportunities of failures, and so on. It is essential that it is made clear what the data provide information on. Is the injury rate seen as an indicator for the safety level as a whole? Probably, there would be no discussion when saying that such a conclusion has low validity. Injuries could be more or less serious. The safety level is more than the number of injuries. There is an extensive literature on the analysis of accident data. Some references to authors taking a classical statistical approach are [1–3]. Bayesian approaches are, for instance, presented by Martz and Waller [4], Barlow [5] and Bedford and Cooke [6]. A critical discussion of both the classical and the common Bayesian approach to risk analysis is given in the book by Aven[7]. Problems with these approaches when it comes to philosophy, practical usefulness and communication of results, are addressed in the book, and an alternative approach called ‘the predictive Bayesian approach’ is suggested. In the present paper, we will look further into these issues for the trend analysis problem. In the analyses performed in Section 2 we have not normalised the data. This is done to avoid too many technicalities. In practice such normalisation is required, but it is not important for the purpose of this paper. It is trivial to adjust the methods to the normalised case. The data we analyse in Section 2 are typical data in a real life application (Risk Level Norwegian Sector project) for hazards occurring on offshore installations on the Norwegian Continental shelf [8]. The alternative screening method
J.T. Kvaløy, T. Aven / Reliability Engineering and System Safety 90 (2005) 75–82
presented in Section 2.2.1 has been used for many years within this project. The conclusions made from the screening method are consistent with typical observations made in this project. Of course, additional information about the specific situation is required to be able to make conclusions about the causes of an observed trend and possible measures to reduce the accident rate.
2. Identifying trends In this section, we will look more in detail into the problem of identifying trends, for instance, in accident data or in data on critical or unwanted events. As discussed in Section 1, statistical testing for strong trends is, for example, useful as a screening instrument for identifying where to concentrate the resources in cases where many hazards are examined. In other cases, for instance if we look at very critical events, it might also be useful to have an approach for detecting more moderate trends. We first present the classical approach, which many practitioners will tend to use for identifying trends, and discuss the problems we see with this approach. Then an alternative approach is outlined. 2.1. The classical approach Let x1,.,xr denote the number of events of a certain type observed in r consecutive time periods of equal length, for instance, in r consecutive years. We want to decide if there is a trend in the data x1,.,xr. In a classical framework, a typical approach will be to assume that the data are realisations of Poisson distributed random variables with an expectation which, if there is a trend, may be different from year to year. A statistical hypothesis test is then used to test if it can be claimed that the expectation is varying from year to year. Several approaches can be used, see for instance [9–11] and references therein. Poisson regression, comparing means and chi-square tests are some of the approaches that could be used. Here we will look closer at the Poisson regression approach. In this approach, a parametric model for the expected number of events as a function of time is defined. A typical choice of model would be exp(aCbj), where j is year number. In this model, bs0 implies that there is a trend in the data, and a test of the null hypothesis bZ0 versus the alternative bs0 can easily be constructed using standard statistical theory. We see several problems with this classical approach. One problem is the fundamental philosophy underlying this approach. For each year, a thought hypothetical probability distribution refering to the long run frequency of outcomes is defined. However, this is only a mental construction, no such distribution exist. Each year is observed only once. But in the classical approach this hypothetical distribution is constructed and is thought to describe some true underlying mechanism generating the data. Moreover, typically
77
a specific parametric model is specified for this distribution. This creates two problems. First, the situation is not reproducible, there does not exist any true distribution. Secondly, if we believe in an underlying probability distribution, how do we know that the defined parametric model is the true model or at least close to the true model? Another problem is related to the practical usefulness of the classical approach. As already discussed in Section 1, classical hypothesis testing could only be used to detect a strong trend in the data. In the parametric approach sketched above it is easy to test the null hypothesis bZ0 versus the alternative bs0. If the null hypothesis is rejected this means we have strong evidence that bs0. Rejection limits are defined such that the probability of falsely rejecting the null hypothesis, also called the probability of committing a type I error, is small (typically 5–10%). This implies that the null hypothesis will only be rejected when we have strong evidence that bs0. However, this means that in many cases it will not be detected that bs0 even if it is the case. Not detecting that the null hypothesis is wrong is called a type II error. Unfortunately, it is not possible to construct any classical hypothesis test of the null hypothesis of trend versus the alternative of no trend. So classical hypothesis testing is only useful for detecting strong trends. In practice, however, we would in many cases like to have an alarm bell which detects trends that are less strong than those detected by a classical hypothesis test. If we, for instance, are looking at critical events like the number of injuries or fatalities, we should point out trends as possible deteriorations in the safety level long before the increases are so overwhelming that they yield a rejection of the classical hypothesis test. 2.2. Alternative thinking We have pointed out several problems with the classical approach above, both in terms of practical usefulness and interpretation. Instead of constructing probability models refering to thought non-existing populations, we suggest using the predictive Bayesian approach outlined in [7,12]. In this approach focus is on observable quantities, and probability distributions are only used for expressing epistemic uncertainties. Thus in the present case focus should be on the actual observed event data, and probability distributions are only used to express our uncertainties regarding what would be the number of events occurring during a time period, based on available information. For instance, if we have some event data from previous years and no other information, Aven and Kvaløy [13] argue that a reasonable way of expressing uncertainties regarding the number of events next year, assuming no change in the safety level, is the following. Say that we have observed 6, 9 and 9 events the previous 3 years. Then we could express our uncertainty regarding the number of events the next year by a Poisson distribution with mean (6C9C9)/3Z8. If the number of events the next year falls outside the 90% prediction interval
78
J.T. Kvaløy, T. Aven / Reliability Engineering and System Safety 90 (2005) 75–82
[4,13] given by the Poisson distribution with mean 8, this is seen as an indication of a change in the safety level. 2.2.1. Screening for trends As a simple screening method for trends in historical data when no other relevant information is available we propose the following procedure: 1. Calculate the average number of hazards during the years P 1,2,.,j, mj Z jkZ1 xk =j. 2. Use the Poisson distribution with mean (rKj)mj to calculate a 90% prediction interval for the total number of hazards during Pthe years jC1,jC2,.,r, and check if the observation rkZjC1 xk falls in this interval or not. If the observation falls outside the interval an alarm is given. 3. Repeat 1–2 for some or all of jZrK1,rK2,.,1. The Poisson distribution with mean equal to the average of the previous years is used to assess the uncertainty regarding the number of hazards the next years of the observation period. As we realise from step 3, a number of variants of this screening procedure could be used. Which values of j we run the procedure for depends on what is the main focus in the situation under study. If, for instance, the primary concern is the latest observation(s) the procedure is stopped when the prediction intervals for one or a few of the latest periods have been calculated. If the main concern is the development over the whole interval the full procedure could be run. Or, for instance, a comparison of the first half to the last half could be done. Notice that procedures that technically are similar to the procedure presented here could also be formulated within a classical statistical framework. The main novelty here is in the context the procedure is presented, the predictive Bayesian framework, where the data are compared to uncertainty distributions expressing our uncertainty regarding what the future number of hazards would be in a no trend situation. In terms of interpretation this is very different from a seemingly similar procedure in a classical setting. As an example, suppose we have the data 6, 9, 9, 12, 13. Then, following the complete screening procedure we would first use a Poisson distribution with mean 9 and obtain a 90% prediction interval [5,14] for the number of hazards the last year. Thus no alarm is given. Next, a Poisson distribution with mean 2$8 is used to obtain a 90% prediction interval [10,23] for the two last years, and with the observation 12C13Z25 an alarm is given here. The interval for the 3 last years based on the two first becomes [15,30], with the observation 34 giving an alarm, and for the last 4 years the interval becomes [16,32], with the observation 43 again giving alarm. Thus, the message from the screening procedure here is a clearly increasing trend over time, and it should be further examined what caused this trend. Note that if the prime concern is the last year, the interval for only this year gives no alarm, but an
Fig. 1. Plot of the number of hazards each year for the example data. A 90% prediction interval calculated using a Poisson distribution with mean equal to the average number of hazards per year is indicated by the dashed lines.
alarm is given when the last 2 years are considered together compared to the previous years. Visualising the data by plotting, for instance, histograms as displayed in Fig. 1 for the example data is also recommended. The dashed lines in this plot represents the lower and upper limits of a 90% prediction interval using a Poisson distribution with mean equal to the average of all the five observations. This is the interval expressing our uncertainty regarding the number of hazards the next year assuming a ‘no strong trend’ situation, but if any of the observations fall outside this interval this is also a strong message of a trend in the historical data. In the present example, none of the observations fall outside the interval, but the clear and steady increase in the number of hazards within the interval limits also gives a clear message of a trend, and this trend is detected by the screening procedure above. 2.2.2. Testing for trend Another suggested way of thinking for detecting a trend in historical data is the following. Let x1,.,xr be the number of events year 1,.,r. A measure of trend in these data can, for instance, be constructed by comparing averages of different parts of the data. One suggestion for such a measure is: ! Pr Pj rK1 X iZjC1 xi iZ1 xi K T1 Z : j r Kj jZ1 Here the averages of all first and last parts of the data for all divisions of the data into a ‘first’ and ‘last’ part are compared. If we have an increasing trend in the data, T1 will become negative, and the clearer the trend is, the larger the negative value of T1 will be. In the case of a decreasing trend things will be opposite. But how large negative or positive value should T1 have before we conclude that there is a trend in the data? One way to decide P that is the following. We have observed a total of nZ riZ1 xi events and an average of
J.T. Kvaløy, T. Aven / Reliability Engineering and System Safety 90 (2005) 75–82
Pr xZ iZ1 xi =r events per year. Thus, in a no trend situation we would express our uncertainty regarding the number of events each year by Poisson distributions with mean x, conditional on the sum of the number of events being n. To decide whether the observed value of T1 is unusually large or small compared to what we would expect in a no trend situation we can generate a large number of new datasets x1 ; .; xr from the uncertainty distribution conditionally on P r iZ1 xi Z n. For each of these datasets we calculate T1 and compare the observed value of T1 from the original data with the T1-values obtained from the generated datasets. If the observed value of T1 from the original data is among the largest or smallest, say, 5% of the values of the generated data we conclude that there is a trend in the original data. For instance, in the case with data 6, 9, 9, 12 and 13 we find that T1ZK17.1. To decide whether this is a small enough value to conclude that there is a trend in the data, we generated 10,000 new datasets of size 5 from the Poisson distribution with mean (6C9C9C12C13)/5Z9.8 conditioned on the sum of the generated data being 49 as in the original dataset. Calculating T1 for all these generated datasets we find that K17.1 is among the lowest 5% of the T1-values for the generated data and we thus conclude that there is an increasing trend in the data. Generating r new data x1 ; .; xr from P the Poisson distribution with mean x conditional on riZ1 xi Z n is not difficult. This is simply done by generating x1 ; .; xr from the multinomial distribution with r outcomes of equal probability 1/r and with the number of trials equal to P nZ riZ1 xi . See also [5, chap. 3]. Many other measures of trend than T1 could be constructed. If we, for instance, are interested in, say, comparing the first half of the data to the second half we could look at 8 r=2 r X X > > > xi x > i > > > iZr=2C1 iZ1 > K ; r even < r=2 r=2 T2 Z r ðrC1Þ=2 X X > > > xi > xi > > > iZðrC1Þ=2C1 iZ1 > : K ; r odd ðr C 1Þ=2 ðr K 1Þ=2 or if we look at longer sequences and in addition to monotonic trends also would like to detect non-monotonic trends we could look at
T3 Z
Pj rK1 X
iZ1 xi
jZ1
j
Pr
iZjC1 xi
K
r Kj
!2 :
With this measure we would only claim that there is a trend if T3 is large compared to the values calculated from the simulated data, for instance, among the 10% largest values. Another measure that should have the ability to
79
detect both monotonic and non-monotonic trends is T4 Z
r X
2; ðxi K xÞ
iZ1
where we simply compare each observation with the mean. As for T3 we only claim that there is a trend if T4 is large compared to the values calculated from the simulated data. For some of the tests presented here it is probably possible to determine theoretically large sample distributions of the test measures in the no trend situation. However, since the simulation approach suggested is very easy to perform and also cover the small sample situation, which is the main interest in this paper, we shall not elaborate further on such distributions. Note that test technically similar to the above tests could also be constructed within a classical framework. The difference would be in the interpretation of the tests as discussed earlier in the paper. The test based on T4 is similar to a chisquare type test in the classical setting, and the test based on T2 is similar to tests based on comparison of means as, for instance, discussed in [11] and references therein. 2.3. Simulations and case studies We have done some case studies and simulations to study the practical performance of the methods for detecting trend presented in Section 2.2 and the classical Poisson regression test approach presented in Section 2.1. We have looked at the properties of the tests both seen from a predictive Bayesian point of view and from a classical frequentistic point of view. In the classical test, a 10% significance level is used. For the trend screening approach presented in Section 2.2.1, we have used the full screening approach and checking the last rKa years based on information from the first a years for various values of a. For the T-test approach presented in Section 2.2.2, we have generated 1000 new datasets each time to decide whether there is a trend or not. We first present an evaluation of the tests done in the classical frequentistic framework. Then we point out the limitations and conceptual problems we see with this classical type of evaluation and present an alternative evaluation done in the predictive Bayesian framework. 2.3.1. An evaluation in the classical frequentistic framework In the classical frequentistic approach outlined in Section 2.1, there is thought to be some true underlying probability distribution. When we observe data, the data are thought to be realisation from a population of data described by this underlying probability distribution. If the mean of this probability distribution is changing from year to year there is said to be a trend. In this paradigm, it is easy by simulations to study properties of various procedures or tests for detecting a trend. A large number of datasets can be generated from the same sequences of probability
80
J.T. Kvaløy, T. Aven / Reliability Engineering and System Safety 90 (2005) 75–82
Table 2 Proportion of times each of the tests detect a trend in 5000 datasets of size 5 generated from a sequence of 5 Poisson distributions with means as indicated in the table Means
Test C
(2,2,2,2,2) (7,7,7,7,7) (20,20,20,20,20) (1,2,3,4,5) (7,8,9,10,11) (11,10,9,8,7) (2,4,6,8,10) (8,10,12,14,16) (7,10,13,16,19) (7,7,7,7,10) (6,6,6,8,8) (7,7,7,12,12) (21,22,23,24,25) (20,22,24,26,28)
T
0.09 0.10 0.10 0.57 0.27 0.27 0.84 0.58 0.84 0.19 0.18 0.48 0.16 0.36
Screening
T1
T2
T3
T4
Full
1-4
2-3
3-2
4-1
0.09 0.09 0.10 0.56 0.28 0.27 0.84 0.57 0.84 0.19 0.17 0.47 0.16 0.35
0.05 0.07 0.08 0.36 0.20 0.19 0.66 0.42 0.67 0.12 0.16 0.51 0.13 0.27
0.10 0.10 0.10 0.55 0.27 0.26 0.83 0.56 0.83 0.21 0.16 0.47 0.16 0.35
0.07 0.10 0.10 0.36 0.19 0.18 0.67 0.39 0.66 0.17 0.14 0.38 0.13 0.23
0.58 0.61 0.63 0.93 0.78 0.72 0.99 0.91 0.99 0.69 0.70 0.87 0.69 0.82
0.45 0.43 0.44 0.82 0.59 0.53 0.95 0.76 0.93 0.45 0.48 0.58 0.49 0.62
0.24 0.27 0.28 0.77 0.48 0.42 0.92 0.73 0.92 0.32 0.36 0.60 0.35 0.55
0.14 0.17 0.18 0.59 0.37 0.31 0.82 0.61 0.82 0.27 0.32 0.70 0.26 0.43
0.07 0.10 0.12 0.36 0.21 0.16 0.56 0.39 0.57 0.27 0.15 0.33 0.16 0.28
distributions and the proportion of times each test detects a trend can be calculated. This proportion is an estimate of what is called the rejection probability of the test for the given sequence of probability distributions. If the sequence of probability distributions is having a trend (i.e. the mean of the various distributions is different), the rejection probability is called the power of the test for the situation under study and should be large. If there is no trend the rejection probability is the same as the type I error probability and is called the significance level of the test and should be close to a small prespecified number (typically 0.01, 0.05 or 0.10). In the simulations presented here, we have generated data from various sequences of Poisson distributions. In each situation, 5000 datasets have been generated. The results of generating various datasets of size 5 are reported in Table 2, while some results of generating a few datasets of size 10 are reported in Table 3. For the screening procedures 1–4 means checking the last 4 years based on information from the first year and so on. In the frequentistic approach, a standard requirement for a hypothesis test is that the probability of false rejection,
the type I error, should be small, i.e. the rejection probabilities for the cases with no trend, the significance level of the tests, should be small, preferably equal to the chosen level which in this study is 10%. In the three first lines in Table 2 and the first line in Table 3, we see that the classical test, denoted C, the T1-, T3- and T4-tests and screening 4-1 and 8-2 have a significance level close to 10%. The T2-test is too conservative while the other tests have too high rejection probabilities in the no trend situations according to these classical considerations. Among the tests with not too high significance level the classical test and the T1- and T3-tests have generally the highest probabilities of detecting trends, i.e. they have the highest power among the tests with acceptable significance level properties. One exception is the case with mean (7,7,7,7,10) for which screening 4-1 is particularly suited and is the best procedure among those with acceptable significance level properties. In the nonmonotonic cases in Table 3, the T3- and T4-tests are as expected the best tests. In a standard frequentistic evaluation the test procedures with significance levels higher than 10% would be judged to
Table 3 Proportion of times each of the tests detect a trend in 5000 datasets of size 10 generated from a sequence of 10 Poisson distributions with means as indicated in the table Means
(7,7,7,7,7,7,7,7,7,7) (6,6,7,7,8,8,9,9,10,10) (8,8,8,8,8,10,10,10,10,10) (10,9,8,7,6,6,7,8,9,10) (14,12,10,8,6,6,8,10,12,14) (6,8,10,12,14,14,12,10,8,6)
Test C
T
Screening
T1
T2
T3
T4
2-8
4-6
6-4
8-2
0.10 0.46 0.23 0.12 0.14 0.06
0.10 0.44 0.22 0.13 0.15 0.05
0.07 0.34 0.24 0.08 0.08 0.09
0.10 0.41 0.21 0.28 0.62 0.58
0.10 0.24 0.15 0.24 0.58 0.58
0.45 0.69 0.52 0.57 0.75 0.83
0.27 0.65 0.46 0.32 0.40 0.44
0.18 0.53 0.34 0.22 0.35 0.31
0.11 0.33 0.18 0.25 0.47 0.48
J.T. Kvaløy, T. Aven / Reliability Engineering and System Safety 90 (2005) 75–82
be more or less useless and called non-conservative. However, if we take a somewhat more pragmatic view and keep in mind those application where it is very important to detect even the less strong trends, i.e. the situations where it is important to have a small type II error, we should also take a look at the other tests. In such cases, variants of the screening procedure could be useful, for instance the 3-2 and 6-4 screening. The full screening procedure and procedures like 1-4 and 2-8 seems to be giving an alarm too often.
81
of different datasets with more or less strong trend and in each case see which of the procedures that do detect the trend and which of them that do not. The results for some data sets of size 5 and 10 are reported in Tables 4 and 5. In these tables, an extraction of a larger exploration is displayed. In this exploration, sequences of datasets with a systematic difference was studied (a number was added or multiplied to all data for each new dataset). When we study Tables 4 and 5 we realise several things. It is always a screening test that is the most sensitive test. Which one varies depending on the type of trend. In several of the cases it is also one of the screening tests that is the least sensitive test. The classical test and the T1-test are giving very similar results. The T3-test is a bit less sensitive against monotonic trends but, of course, more sensitive against non-monotonic trend. The T2- and T4-tests are generally less sensitive than the T1- and T3-tests.
2.3.2. An evaluation in the predictive Bayesian framework The problem with the frequentistic evaluation is that it refers to something non-existing, an imagined sequence of probability distributions. The simulations in Section 2.3.1 still tells us something about the practical properties of the tests, it is, for instance, clear that the full screening procedure will detect much more moderate trends than the T2-test, but further interpretations of the results are unclear. Seen from the predictive Bayesian point of view it is, for instance of no meaning to talk about things like rejection probabilities. For a given data set, a procedure either says that there is a trend in the data or not. So to evaluate properties of the various procedures for detecting a trend from a predictive Bayesian point of view we will not generate a large number of datasets from some probability distributions. Instead, we will look at a number
2.3.3. Conclusion Based on the evaluations in Sections 2.3.1 and 2.3.2 there is no single test or procedure for detecting trend that could be recommended over the others in all situations. It depends on the situation which test is most relevant. If the point of a study is to detect only strong trends to concentrate resources on the ‘worst’ cases the T1 or T3 tests could be a good choice in the general case. However, if the main concern is
Table 4 The table indicates whether each test detect a trend (1) or not (0) for each dataset Data
Test C
(0,1,2,3,4) (1,2,3,4,5) (2,3,4,5,6) (3,4,5,6,7) (6,7,8,9,10) (8,9,10,11,12) (0,2,4,6,8) (2,4,6,8,10) (4,6,8,10,12) (6,8,10,12,14) (8,10,12,14,16) (12,14,16,18,20) (14,16,18,20,22) (24,26,28,30,32) (36,38,40,42,44) (0,0,0,3,3) (2,2,2,5,5) (4,4,4,7,7) (6,6,6,9,9) (10,10,10,13,13) (2,2,2,4,4) (3,3,3,6,6) (4,4,4,8,8) (5,5,5,10,10) (6,6,6,12,12)
1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1
T
Screening
T1
T2
T3
T4
Full
1-4
2-3
3-2
4-1
1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1
1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 1
1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1
1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1
1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
82
J.T. Kvaløy, T. Aven / Reliability Engineering and System Safety 90 (2005) 75–82
Table 5 The table indicates whether each test detect a trend (1) or not (0) for each dataset Data
(6,6,6,6,6,6,6,9,9,9) (10,10,10,10,10,10,10,15,15,15) (12,12,12,12,12,12,12,18,18,18) (14,14,14,14,14,14,14,21,21,21) (16,16,16,16,16,16,16,24,24,24) (6,6,4,4,2,2,4,4,6,6) (12,12,8,8,4,4,8,8,12,12) (15,15,10,10,5,5,10,10,15,15) (18,18,12,12,6,6,12,12,18,18) (21,21,14,14,7,7,14,14,21,21)
Test C
T
Screening
T1
T2
T3
T4
2-8
4-6
6-4
8-2
0 1 1 1 1 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 1 1 0 0 1 1 1
0 0 0 0 0 0 0 0 1 1
0 0 1 1 1 1 1 1 1 1
0 1 1 1 1 0 0 1 1 1
1 1 1 1 1 0 0 0 0 1
0 1 1 1 1 0 1 1 1 1
the number of events the last 1 or 2 years, variants of the screening procedure checking the last 1 or 2 years based on the information from the previous years could be the best procedure. Or another variant of T-test made for testing the specific concern could be constructed. If it is important to detect also the less strong trends, a variant of the screening procedure should be used. Maximum sensitivity is gained by using the full screening procedure. The problem is, of course, that then alarms might be given too often. If particular things need to be checked or a bit less sensitivity is required, other variants of the screening procedure could be used.
3. Summary In this paper, we have discussed the problem of analysing accident data with respect to trend in the data. As a part of such a trend analysis statistical tests could be useful, for instance, for detecting the worst cases in situations where many hazards are examined or as an automatic alarm bell giving an signal when an unwanted change in the safety level is developing. Many practitioners would tend to use classical hypothesis testing for this purpose. We do, however, see several problems with the classical approach both in terms of practical usefulness and interpretation, and do instead suggest to use a predictive Bayesian approach. We have suggested several trend tests within this approach, some which in practical use give results similar to classical tests and some that are able to detect more moderate trends than the classical tests.
Acknowledgements The authors are grateful to the reviewers for valuable comments and suggestions.
References [1] Crowder MJ, Kimber AC, Smith RL, Sweeting TJ. Statistical analysis of reliability data. London: Chapman & Hall; 1991. [2] Høyland A, Rausand M. System reliability theory: Models and statistical methods. New York: Wiley; 1994. [3] Meeker WQ, Escobar LA. Statistical methods for reliability data. New York: Wiley; 1998. [4] Martz HF, Waller RA. Bayesian reliability analysis. New York: Wiley; 1982. [5] Barlow RE. Engineering reliability. Philadephia: SIAM; 1998. [6] Bedford T, Cooke RM. Probabilistic risk analysis. Cambridge, MA: Cambridge University Press; 2001. [7] Aven T. Foundations of risk analysis. New York: Wiley; 2003. [8] Vinnem JE, Tveit O, Aven T, Ravns E. Use of risk indicators to monitor trends in major hazard risk on a national level. Proceedings of European safety and reliability conference (ESREL), Lyon; 2002, p. 512–8. [9] Frome EL, Kutner MH, Beauchamp JJ. Regression analysis of Poisson-distributed data. J Am Stat Assoc 1973;68:935–40. [10] Magel R, Wright FT. Tests for and against trends among Poisson intensities. Inequal Stat Probab 1984;5:236–43. [11] Krishnamoorthy K, Thomson J. A more powerful test for comparing two Poisson means. J Stat Plan Infer 2004;119:23–35. [12] Apeland S, Aven T, Nilsen T. Quantifying uncertainty under a predictive epistemic approach to risk analysis. Reliab Eng Syst Saf 2002;75:93–102. [13] Aven T, Kvaløy JT. Implementing the Bayesian paradigm in risk analysis. Reliab Eng Syst Saf 2002;78:195–201.