Methodological criteria for evaluating occupational safety intervention research1

Methodological criteria for evaluating occupational safety intervention research1

SAFETY SCIENCE Safety Science 31 (1999) 161±179 Methodological criteria for evaluating occupational safety intervention research1 Harry S. Shannon a,...

146KB Sizes 0 Downloads 145 Views

SAFETY SCIENCE Safety Science 31 (1999) 161±179

Methodological criteria for evaluating occupational safety intervention research1 Harry S. Shannon a,b,*, Lynda S. Robson a, Stephen J. Guastello c a

Institute for Work and Health, 250 Bloor St. E., 7th ¯oor, Toronto, Ontario, Canada M4W 1E6 b McMaster University, Hamilton, Ontario, Canada c Marquette University, Milwaukee, WI, USA Received 4 March 1998

Abstract We describe the importance of evaluating workplace safety interventions. Based on the literature and other scources, we list eight areas for which readers can assess the quality of reports evaluating these interventions. The areas are: intervention objectives and their conceptual basis; study design; external validity; outcome measurement; use of qualitative data; threats to internal validity; statistical analysis; and study conclusions. Good quality evaluations can help avoid wasting limited time, money and e€ort on ine€ective or even harmful interventions. # 1999 Elsevier Science Ltd. All rights reserved.

1. Introduction The safety literature and anecdotal evidence suggest that many interventions in occupational safety are implemented with the sincere hope that they work, but with a lack of solid evidence of their e€ectiveness, especially in the area of safety training and education (Hale, 1984; Oce of Technology Assessment, 1985; Hoyos and Zimolong, 1988; Johnston et al., 1994; Schulte et al., 1996). However, the importance of evaluating safety interventions is suggested by the evidence from road accident studies in which the actual e€ect of safety measures can sometimes be to make the situation worse (Evans, 1985). Clearly, there is a problem when our interventions do more harm than good, or even when they are simply ine€ective, since limited resources are being wasted which could be put to better use elsewhere. * Corresponding author. Tel.: +1-416-927-2027 ext. 2126; fax: +1-416-927-4167; e-mail: hshannon@ iwh.on.ca. 1 On behalf of the Scienti®c Committee on Accident Prevention (SCOAP) of the International Commission on Occupational Health (ICOH). 0925-7535/99/$Ðsee front matter # 1999 Elsevier Science Ltd. All rights reserved. PII: S092 5-7535(98)0006 3-0

162

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

Given the importance of evaluation of safety measures to the evolving practice of occupational safety, there is relatively little of this type of research published. An audit of the 1996 issues of nine relevant public health/occupational health and safety journals, including Journal of Safety Research, Safety Science, Accident Analysis and Prevention, showed that of 54 papers on occupational injuries, just four reported on interventions. There is a de®nite need for safety researchers to increase their e€orts in evaluating the e€ectiveness of interventions. Even when evaluations are done, they are often of limited quality (Ellis, 1975; Oce of Technology Assessment, 1985; Vojtecky and Schmitz, 1986; Hoyos and Zimolong, 1988; Goldenhar and Schulte, 1994). Further, improving the quantity and quality of evaluations was a recurrent theme at the recent National Institute for Occupational Safety and Health (NIOSH)-sponsored National Occupational Injury Research Symposium (Morgantown, WV, October 1997). However, the principles of good evaluation have been well established. Thus, what we present in this paper will not be new material. Rather, it represents the application of ideas to occupational safety from evaluation in other ®elds, such as health promotion, clinical medicine and education. Methodological criteria for evaluating reports of interventions have been developed for reviewing the literature in clinical medicine and health promotion (Guyatt et al., 1993; Shekelle et al., 1994; Dobbins et al., 1996; van Tulder et al., 1997) and childhood injury prevention interventions (Harborview Medical Center Injury Prevention and Research Center, 1997). We do not know of such criteria speci®cally for occupational injury prevention, although Heacock et al. (1997) have suggested criteria for reviewing all types of ergonomics studies. Goldenhar and Schulte (1994, 1996), Guastello (1993) and Zwerling et al. (1997) have contributed valuable reviews of safety interventions and in doing so identi®ed and discussed some important methodological issues, also applying evaluation concepts developed in other ®elds, but they did not actually use or suggest a set of methodological criteria. We have drawn on the ideas of these researchers, the experience of the members of the Scienti®c Committee for Accident Prevention of the International Commission on Occupational Health, and the principles of program evaluation to construct our set of criteria (Table 1). We believe our report should help those who read e€ectiveness evaluations to understand the trustworthiness and relevance of the results. We also expect that those planning evaluations will ®nd it helpful to be reminded of the di€erent methodological areas they need to consider in designing their projects. Before moving to the methodological criteria, a few terms will be de®ned, in the context of workplace safety interventions. The e€ectiveness of an intervention is the degree to which it causes an e€ect under realistic workplace conditions. This contrasts with the ecacy of an intervention, which is the degree to which it causes an e€ect under ideal conditions. Internal validity is the legitimacy with which we infer that a given intervention did (or did not) produce an e€ect on the safety outcome measured, i.e. whether we can believe in the results. This depends on factors such as the comparability of groups under study. Thus, the degree of con®dence that one can place in conclusions regarding the e€ectiveness or ecacy of an intervention relies on the internal validity of the study. External validity is the legitimacy with

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

163

Table 1 Methodological criteria for evaluating occupational safety intervention research Program objectives and conceptual basis  Were the program objectives stated?  Was the conceptual basis of the program explained and sound? Study design  Was an experimental or quasi-experimental design employed instead of a non-experimental design? External validity  Were program participants/study population fully described?  Was the intervention explicitly described?  Were contextual factors described? Outcome measurement  Were all relevant outcomes measured?  Was the outcome measurement standardized by exposure?  Were the measurement methods shown to be valid and reliable? Qualitative data  Were qualitative methods used to supplement quantitative data? Threats to internal validity  Were the major threats to internal validity addressed in the study? Statistical analysis  Were the appropriate statistical analyses conducted?  If study results were negative, were statistical power or con®dence intervals calculated? Conclusions  Did conclusions address program objectives?  Were the limitations of the study addressed?  Were the conclusions supported by the analysis?  Was the practical signi®cance of the result discussed?

which we infer that the evident e€ect of the intervention can be generalized to other persons, settings or times. A bias or lack of validityÐinternal or externalÐis ``a process at any stage of inference tending to produce results that depart systematically from true values'' (Murphy, 1976). The implications of these terms should become clearer with their further use. 2. The criteria 2.1. Program objectives and conceptual basis . Were the program objectives stated? . Was the conceptual basis of the program explained and sound?

164

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

Knowledge of a program's objectives or its underlying research hypothesis is essential for assessing the appropriateness of a chosen intervention and its associated evaluation design. In particular, judgment on the appropriateness of outcome measures will depend on the clear identi®cation of program objectives. The conceptual basis of a program, which can be constructed from one or more existing theories, and even incorporate empirical ®ndings (Earp and Ennett, 1991), consists of an elucidation of the linkages among all major study variables. Such an elucidation serves to orient others within the complex set of variables at play in the workplaceÐvariables which operate at the level of the individual, work group, job task, work environment, work organization and broader socio-economic environment. This is especially important in safety research which involves a diverse range of research disciplines and their correspondingly diverse range of theoretical perspectives. The conceptual framework can help those outside of the study team assess the appropriateness of particular intermediate outcomes (see Section 2.4) chosen for measurement and the appropriateness of intermediate outcomes serving as surrogates for ®nal outcomes (e.g. safety behavior measure substituting for injury rate measure). Knowledge of the program's underlying conceptual model is also important for the ongoing development of theory, since the intervention results will need to be interpreted with reference to the modelÐeither providing support for the model or challenging the model and possibly generating new hypotheses. Several researchers have noted a lack of a theoretical or conceptual basis in safety programs reported in the literature (McAfee and Winn, 1989; Goldenhar and Schulte, 1994; Hale and Hovden, 1998) and carried out in the ®eld (Vojtecky and Schmitz, 1986), and yet agree on the importance of there being such a basis. 2.2. Study design . Was an experimental or quasi-experimental design employed instead of a nonexperimental design? 2.2.1. Experimental design The ideal type of study in terms of internal validity is the true experiment, in which units (typically individuals, but also perhaps work teams or even work sites) are allocated at random to either experimental (treatment) or control conditions, thus forming two (or more) groups. Proper randomization involves the use of a technique such as random number tables or a computer randomization routine. As the statistical theory of randomization shows, this ensures that experimental units and variables associated with the experimental units are allocated to the groups in an unbiased manner, thus increasing the internal validity of the study. Despite the desirability of randomization to experimental and control groups from a design point of view, reviews indicate that this is seldom done in occupational safety interventions. None of the articles reviewed by Goldenhar and Schulte (1994) or Peters (1989), nor the 38 new reports included in the article by Zwerling et al. (1997) included randomized (or quasi-randomized) experiments. However,

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

165

several examples of true experiments in workplace back-injury prevention programs (Walsh and Schwartz, 1990; Reddell et al., 1992; Lahad et al., 1994), workplace stress interventions (McLeroy et al., 1984), occupational hygiene interventions (Marcus et al., 1986; Girgis et al., 1994) suggest that it may be possible to increase the use of this design with occupational safety interventions. Glassock et al. (1997) is one recent example. Because even randomization will by chance sometimes result in the unequal distribution among experimental groups of a variable related to the outcome measurement, it is always recommended to take baseline measurements of individual/group characteristics (e.g. age of individuals) which could in¯uence the outcome. For similar reasons it is also recommended to take initial measurements of the outcome variables (e.g. safety behavior) that the intervention is designed to change. By taking these baseline and initial measurements, pre-existing di€erences between the groups can be accounted for in the statistical analysis of results. 2.2.2. Quasi-experimental design There are several reasons, including practical, ethical, political or ®nancial ones, why it may not be feasible to use a true experimental design, in which case, a `lesser' design (i.e. quasi- or non-experimental) might be used. The classic text by Cook and Campbell (1979) and a more recent update (Cook et al., 1990) give detailed discussion of quasi-experimental designs, which minimize some of the additional threats to internal validity that arise when not using an experimental design. The conceptualization and utility of quasi-experimental designs has continued to be recognized in the program evaluation ®eld (Green and Lewis, 1986; Prosavec and Carey, 1989; Windsor et al., 1994). In one major type of quasi-experiment, units are assigned to one or more treatment groups and to a comparison group known as a non-equivalent control. This di€ers from the experimental design in that units are not randomized to the groups. A common situation would be one in which one or more pre-existing work groups (e.g. work teams, divisions, sites) are given the treatment(s), while one or more are not. The e€ect of an intervention could then be deduced by comparing the change in the outcome variable (from before to after an intervention) in the treatment group(s) to that in the comparison group(s). An even more convincing variant of the nonequivalent control group design occurs when the group(s) that originally served as controls receive the intervention later (an example of the switching replication design in Cook et al., 1990). By continuing data collection past this point an additional opportunity to observe a treatment e€ect (or lack of it) is created. When using the non-equivalent control group design, it is desirable to have minimal pre-existing di€erences between the treatment and non-equivalent control groups. Thus, one can try to match units a priori, so that variables thought to in¯uence the e€ect of the treatment would be equally distributed to the groups. Whether matching units or not, it is essential to have data collected prior to the treatment administration, such that the similarity of groups analyzed can be assessed on the variable that will serve as the outcome measure, as well as any other unmatched variables that might a€ect the outcome. This will either con®rm that

166

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

pre-existing di€erences were minimal (as in the case of Gregersen et al., 1996) or allow adjustment of results during their statistical analysis. Another type of quasi-experimental design used in safety studies (e.g. Mohr and Clemmer, 1989), uses, instead of a comparison group, a second measure of the types of injuries which are not expected to be a€ected by the intervention. This allows a comparison of treatment e€ects, and a means of detecting history and other e€ects (see Section 2.6). This is a much weaker design (``non-equivalent dependent variables design'') and must be used with caution (Cook and Campbell, 1979; Cook et al., 1990). The other major type of quasi-experimental design, which also lacks a comparison group, is known as interrupted time-series (Cook and Campbell, 1979; Cook et al., 1990) or within-group designs (including reversal and multiple-baseline designs; Komaki and Jensen, 1986), involves taking several measurements of the outcome variable both before and after introduction of the intervention. If a change in the outcome variable in the time period following introduction of the intervention is signi®cant compared with changes observed over the other time periods, an e€ect of the intervention can be inferred, in the absence of major threats to internal validity. The simple interrupted time-series design di€ers from the simple before-and-after design, by the former's greater sampling of the dependent variable so that any maturational trends (see Section 2.6) can be measured. Unfortunately, history or instrumentation e€ects (Section 2.6) still threaten the internal validity of the simple time-series design (Cook et al., 1990). Thus, it is preferable that the time-series design also include either (1) a reversal of the treatment and a measurement of the e€ect of its reversal (e.g. Chhokar and Wallin, 1984; Komaki and Jensen, 1986), although this is not always possible in safety interventions, or (2) a non-equivalent control group, especially with a multiple-baseline across groups (e.g. Fox et al., 1987). 2.2.3. Non-experimental design The weakest group of designs, sometimes known as pre-experimental (Conrad et al., 1991) or non-experimental (Prosavec and Carey, 1989), are those in which the inference that the intervention caused change is most debatable. Indeed, they are considered ``not useful for causal inference'' (Goldenhar and Schulte, 1996). Cook and Campbell (1979) included in this group of designs the following: (1) the postonly one-group design, (2) the post-only non-equivalent control design, and (3) the pre-post one-group design. An example of the post-only one-group design would be a test of worker knowledge following an educational intervention. Even if a high score were found, one could not actually deduce that the educational intervention was e€ective, because one would not know how the group would have scored in the absence of an intervention. On the other hand, if one were able to compare the post-intervention score of the treatment group to the score of a control group of similar workers (an example of the post-only non-equivalent control group design), the inference would be stronger. The weakness in this design, however, is that, without pretesting, we cannot be sure how di€erent the groups were in their knowledge before the educational intervention.

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

167

The third design mentioned above, the pre-post one-group design, is also known as a before-and-after design, which is commonly used in safety studies. Unfortunately, it su€ers from several threats to internal validity (Cook and Campbell, 1979; Green and Lewis, 1986; Prosavec and Carey, 1989), as will be described below (Section 2.6). Thus, although this design, unlike the two non-experimental designs just mentioned, can detect change, the cause of the change often remains unclear, because other explanations are plausible. For this reason, one repeatedly sees recommendations in the safety literature for the use of comparison groups (randomized or non-equivalent controls) (Tarrants, 1980; Waller, 1980; Peters, 1989; Johnston et al., 1994). Further, Goldenhar and Schulte (1996) consider the use of either a quasi-experimental design with non-equivalent groups or time-series data collection preferable to a non-experimental design. Windsor et al. (1994) consider the before-and-after design useful for assessing the immediate impact (e.g. a change in knowledge) of short educational or training programs, but not for longer-term impacts. Others involved in workplace research have suggested that control groups might not always be suitable in workplace research, and go on to recommend other quasi-experimental approaches like time-series (Blumberg and Pringle, 1983). If one does not use a true experimental design, opportunities for bias increase, as will be discussed below. It can, nevertheless, be justi®able to use quasi- and even non-experimental designs, but the researcher/evaluator must then address, through alternative methods, the threats to internal validity that are not addressed by the design (see Section 2.6). The study by Mohr and Clemmer (1989) is an example of how two quasi-experimental designs can provide complementary information which increases the validity of the ®nal conclusions. 2.3. External validity . Were program participants/study population fully described? . Was the intervention explicitly described? . Were contextual factors described? Although an e€ectiveness evaluation is primarily concerned with issues of internal validity (whether the intervention was e€ective), we should also consider external validity (whether results can be generalized to other workplaces), as do methodological criteria for clinical studies (e.g. Guyatt et al., 1993; Shekelle et al., 1994). Information pertinent to external validity includes details of the program participants, such as their occupation, age, gender, experience, etc. The manner of recruitment of individuals or work organizations to the study is also relevant, since e€ectiveness results obtained from volunteers are not necessarily generalizable to the wider work population. For similar reasons, the drop-out rate should be reported. Similarly, explicit details on the intervention are needed to assess external validity. They should include the intervention's duration and program content. It is not only important to specify the worker population and the intervention, but also the variety of `contextual factors' which could also in¯uence a program's e€ectiveness and thus the generalizability of a particular study's results. For example, a

168

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

comprehensive safety system intervention might have a larger impact on a company which is performing below average in terms of safety performance than on a company already performing above average. An example of a study which incorporated much contextual detail useful in judging external validity is Cooper et al. (1994), who speci®ed: the company had a high commitment to improving safety (e.g. safety issues were the ®rst agenda items at executive meetings); safety committees contained worker and management representation and had been active since the early 1960s; and descriptions of earlier safety interventions. 2.4. Outcome measurement . Was the outcome measurement standardized by exposure? . Were all relevant outcomes measured? . Were the measurement methods shown to be valid and reliable? Outcome measurements, of course, need to be standardized by exposure, since the number of employees in a work group ¯uctuates with time. Thus, comparisons and statistical tests on numbers of incidents should also incorporate information on exposure, usually by including the number of workers or worker-hours exposed to the hazard or workplace. When this is not the case (e.g. Jones et al., 1987), some comment should be included which attests to the stability of exposure during the period of comparison. The criterion concerned with relevant outcomes incorporates several ideas illustrated in Fig. 1: (1) Did the ®nal outcome measure address the program objective? (2) Were implementation outcomes measured? (3) Were corroborating intermediate or ®nal outcomes measured? (4) Were unintended outcomes measured? 2.4.1. Final outcome The outcome measurement must be appropriate to the intervention's objective. If an intervention is directed towards injury prevention, then the ideal ®nal outcome measurement would be injuries. However, the low frequency of more severe accidents means that injury incident rates and injury severity rates are often not useful in evaluating the e€ectiveness of safety interventions in all but the largest workgroups (Tarrants, 1980, 1987; Petersen, 1996), because of the associated low statistical power. Thus, people use surrogate measures, such as ®rst-aid only incidents, near-accidents or safety behavior, which occur with greater frequency. This is an acceptable strategy in injury prevention research, as long as the validity of

Fig. 1. Relationship between program objectives, intervention and outcomes.

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

169

these surrogates can be demonstrated (Bickman, 1983)Ðnot necessarily an easy undertaking. 2.4.2. Implementation outcome Measures of program implementation, i.e. of how a program was actually delivered, is another outcome of interest, especially if no e€ect is observed. An example of an implementation measure would be attendance at training. According to Lipsey (1996) ``weak, inconsistent, or even nonexistent implementation of intervention plans is all too common in practical settings''. Robins et al. (1990) found an unexpected degree of variation in the implementation of training among di€erent worksites. If implementation results are suciently poor, an e€ectiveness evaluation might not even be worth pursuing, as Verbeek et al. (1993) found, because the treatment e€ect would be greatly diminished. Even if an intervention has been implemented exactly as planned, evidence of this should be speci®ed. 2.4.3. Intermediate outcomes Intermediate measures or additional outcome measuresÐe.g. measures of knowledge or behaviorÐshould be taken whenever possible, even when the ®nal outcome to be measured is injury rate. This serves at least two purposes. First, the internal validity of the study could be bolstered if the ®nal outcome results of main interest are mirrored by other outcomes. For example, if one found, following a behavioral intervention, that there was a decrease in injuries, but no change in safety behavior, it would seem unlikely that the intervention actually caused the decrease in injuries. If, however, the decrease was accompanied by an increase in knowledge and safety behavior, a causal relationship between intervention and injuries would be supported. A second purpose for the additional measures is to provide information useful to conceptual developments in safety. An intervention might achieve its ®nal goal in reducing injury and the primary goal of an evaluation would be to measure this outcome, but accompanying measures can give insight into how that goal was achieved. 2.4.4. Unintended outcomes Attempted measurement of possible unintended outcomes is important when a particular class of injuries is the focus of the study and they or their surrogate are measured as a ®nal outcome in the study. Inclusion of an alternative measure sensitive to all injuries would also be necessary to exclude the possibility that the intervention had not inadvertently caused an increase in another class of injuries, as can sometimes be the case when technological improvements are made in industry (La¯amme, 1993). Evans' (1985) review of trac safety studies shows how an improvement in safety, predicted from a new engineering intervention, might not be realized or have the e€ect opposite from intended, because behavior may change in response to the intervention. 2.4.5. Validity An essential of e€ectiveness research is establishing the validity of the outcome measurement, which is the degree to which the concept under study is accurately

170

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

represented by the means of measuring the outcome. Injury rate is widely accepted as a valid means of measuring the e€ect of an injury prevention or safety intervention at a particular worksite, but discussion of the integrity of the worksite's reporting system helps demonstrate its validity. The validity of aggregated data sources also needs to be discussed in a report because of the high degree of underreporting associated with such sources (McCurdy et al., 1991; Stout and Bell, 1991). If one uses a surrogate measure for injury, validation of measurement methods should include substantiation of the choice of the surrogate. For example, the choice of a score from a safety behavior list assessment as a surrogate for injury rate could be validated by a strong correlation between the number of unsafe/safe behaviors observed and injury rate. A measure of knowledge is not usually an adequate surrogate for injury rate, because, as behavioral research has repeatedly shown, a change in knowledge does not even ensure a change in behavior, let alone a change in injury rate. As this illustrates, a marked attenuation of intervention e€ect size can sometimes be observed when moving from a more immediate outcome to a ®nal outcome. 2.4.6. Reliability The reliability of a measure refers to the degree of consistency of a measurement. The lower the reliability of the outcome measure, the more dicult it is to detect the e€ects of an intervention (Cook et al., 1990; Lipsey, 1990). Low reliability of a measure can occur because of (1) instability in what is being measured (true variance) and (2) variation in the instrument of measurement or the instrument's application (error variance). Examples of the ®rst include variation in the safety behavior of individuals, in hazardous conditions in the workplace, in accident events, and in the treatment cost of injuries. This variation is beyond the control of the researcher, but it could in¯uence which outcome measure is chosen for a study. Examples of the reliability associated with an instrument include: consistency over time in how an observer rates behavior or safety audit items (intra-rater reliability); consistency among observers in the way they rate behaviors or safety audit items (inter-rater reliability); elicitation of response by various items in a scale of safety attitudes (internal consistency); similarity in compliance with reporting procedures among workplaces or work groups. Of particular concern for reliability in safety studies is the situation in which the intervention a€ects not just the occurrence of incidents, but also the reporting of those incidents; e.g. in management audits (which tend to increase reporting) and safety incentive programs (which tend to decrease reporting). At times a report of reliability could take a qualitative approach; but, whenever possible, and especially when tests or questionnaires are involved, it should take a quantitative approach. 2.5. Qualitative data . Were qualitative methods used to supplement quantitative data? It is now generally advocated in public health and education that, for a variety of reasons, qualitative, as well as quantitative, information be collected for the purpose

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

171

of evaluating programs (Green and Lewis, 1986, p. 149). This recommendation has more recently been extended to safety research in particular (Needleman and Needleman, 1996). Qualitative information can be gathered through interviews, observation and the use of primary and secondary documents (Patton, 1990). One is interested in the opinions and reactions of those directly involved with the site of a workplace intervention, such as managers, workers, or program delivery sta€, and also those not directly involved, since their attitudes or actions could indirectly a€ect a program. Israel et al. (1992) describe the enhancement of their occupational stress program through the use of qualitative information for purposes of problem de®nition, interpretation of quantitative results, cross-validation (or ``triangulation'') of quantitative results and better understanding of the generalizability of results. Qualitative information could also be useful for the development of theory relevant to interventions; identi®cation of ways to improve the nature and delivery of the intervention; barriers to its implementation and acceptance; unintended outcomes; and previously unidenti®ed contextual considerations. Perhaps most important to the main purpose of e€ectiveness evaluation would be the strengthening of internal validity or identi®cation of threats to internal validity. Cooper et al. (1994) and Michaels et al. (1992) illustrate some of the uses of qualitative information. 2.6. Threats to internal validity . Were the major threats to internal validity addressed in the study? A threat to internal validity is a condition or event which could lead to a wrong answer, i.e. which could bias the outcomes, such that inferences concerning the estimate of treatment e€ect would be in error. Cook and Campbell (1979) deal in great detail with threats to internal validity in ®eld situations and this reference still forms the basis of discussion of this issue by others (Green and Lewis, 1986; Prosavec and Carey, 1989; Conrad et al., 1991; Windsor et al., 1994). In general, nonexperimental designs have the greatest likelihood of threats to internal validity, quasi-experimental designs have less, and experimental designs have the least. The criterion indicates that the threats to internal validity which are not already eliminated by the study design still need to be addressed by other means and be reported on. This includes both qualitative and quantitative information: additional data analysis; feedback and opinion from study participants, experts or stakeholders; theory; and logical argument (Cook and Campbell, 1979; Lipsey, 1996). In the following, we ®rst describe threats to internal validity arising in experimental designs, based on the classi®cation by Cook and Campbell (1979)Ðthreats which are also applicable to quasi-experimental designs with non-equivalent control groups as well. We will not describe how authors can explore these biases, but note that they should be explored and reported upon. We will then describe additional biases that may occur in non-randomized studies. 2.6.1. Threats to internal validity in randomized designs One type of threat in randomized designs includes those which would result in groups no longer re¯ecting their random assignment: improper randomization;

172

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

treatment-related refusal to participate (when the likelihood that subjects refuse depends on whether they are in the treatment or control group) and di€erential attrition (when the dropout pattern causes groups to become increasingly di€erent). Another type of threat to a randomized design are those which diminish the treatment e€ect by diminishing the `dose' of treatment relative to non-treatment. This includes di€usion of treatment (or contamination), which can occur when those in the control group unintentionally receive some of the intervention, such as when the control group in Blumberg and Pringle (1983) quietly adopted some of the same safety modi®cations as the experimental group. Di€usion could also be through exchange of information alone, such as when members of an education treatment group pass on new information to members of the control group. Di€usion can most easily be prevented when groups are physically separated, although this would be an impractical arrangement in some workplaces. Another threat that diminishes the dose of treatment is non-compliance, which is when members of the experimental group do not follow the intervention protocol, e.g. treatment group members do not use their new personal protective equipment. In addition, a variety of situations, stemming from the reactions of people, can arise in a randomized or non-equivalent group design that would threaten its internal validity. For example, the control group might resent not being given the new treatment (e.g. Blumberg and Pringle, 1983) and behave in a way that alters their outcome measure. The opposite e€ect has also been seen, in which the no-treatment group, perhaps for competitive reasons, changes its behavior such that its outcome measure becomes more similar to that of the treatment group (Cook and Campbell, 1979). Such e€ects can sometimes be minimized by promising that those in the control group will eventually receive the treatment. Well known is the `Hawthorne e€ect', which refers to alterations in the behavior of the treatment group, simply as a result of being under observation. For this reason, the groups should be dealt with as similarly as possible, except for the intervention itself. Finally, persons administering the program can, even perhaps subconsciously, in¯uence its conduct in subtle ways, due to their expectations of the e€ect (or lack of e€ect) of the intervention. For this reason the presence of `blinding'Ðthe people administering the treatment, receiving the treatment, assessing the outcome and analyzing the data are `blind' to the assignment of experimental units to intervention and control groupsÐappears as a methodological criterion for clinical research (Guyatt et al., 1993). This is less often possible in ®eld experiments, but it is nevertheless recommended whenever the situation permits it. 2.6.2. Selection threats to internal validity When one uses a non-equivalent control instead of a randomized control, as in some quasi-experimental designs, selection e€ects can pose a threat to validity. This occurs when the amount of change that takes place during the intervention period is a€ected by either (1) pre-existing di€erences between treatment and control groups or (2) interaction of those di€erences with other threats to internal validity (e.g. selection-history interaction, where a history bias [see below] a€ects only one of the groups). The major guard against this threat is to ensure as much comparability as

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

173

possible between treatment and comparison groups and thus, as mentioned before, the matching of individuals or groups on variables predictive of the outcome measure is sometimes carried out (e.g. Gregersen et al., 1996). A somewhat weaker alternative approach to reducing the e€ect of selection bias is to take initial measurements of outcome variables and other group characteristics so that the contribution of selection to apparent treatment e€ects can be estimated or accounted for in the statistical analysis. 2.6.3. Threats to internal validity in before-and-after designs The lack of a comparison group, as in before-and-after studies, introduces at least four other threats to internal validityÐhistory, maturation, testing and instrumentationÐwith the ®rst deserving of great consideration in any but the shortest workplace studies. A history e€ect in a safety intervention refers to changes in observed accident rates or their proxy that were attributable to workplace, industry or societal changes other than the intervention. This could easily arise in a workplace because of its complexity and the many `contextual factors' potentially a€ecting safety (see Section 2.2). To reduce the threat of a history e€ect, one would need to report on the stability of these factors, e.g. by reporting on industry or company trends in injury rates, or any concurrent pertinent legislative, organizational or management changes. Maturation threats result from natural changes in the study group that occur over time independently of the intervention. Such changes could arise from changes in the individual, such as age (La¯amme and Menckel, 1995) and experience, or they could re¯ect industry-wide trends (attributable to business cycle, technology, etc.; Brooker et al., 1997). Viscusi (1985), as reported by Cook et al. (1990), attempted to account for such a trend in injury rates in the cotton textile industry by developing a prediction model for injury rates from three variables measured at ®ve pre-treatment points. Cook et al. (1990) concluded that this analysis would have bene®tted from more pre-treatment points, as would be found in a complete time-series design. Mohr and Clemmer (1989) were able to address the maturation threat to their before-and-after intervention in the oil drilling industry by using a separate quantitative examination of industry-wide trends. A testing e€ect occurs when the post-intervention measurement (which in the educational ®eld is often a test, hence the name) shows a di€erent result than that of the pre-intervention, simply because taking the pre-intervention measurement creates an `e€ect'. One could imagine workers scoring higher on a second test of knowledge and possibly even behavior, simply because their awareness of issues had been heightened in an initial test. For this reason, Parkinson et al. (1989) included a (non-equivalent) control group which repeatedly took the same tests as the intervention group over the course of the 2-year study. Finally, bias attributable to instrumentation can occur when, between pre- and post-testing, there is a change in the means of obtaining outcome measurements. This includes changes in survey tools and measuring instruments, or a change in the procedure of carrying out measurements. Sometimes the intervention itself can in¯uence reporting of injuries. Examples include management audits that score points for features pertinent to reporting systems (expected to increase reporting)

174

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

and safety incentive programs that reward low injuries (expected to suppress reporting; List, 1997). Of value in addressing this threat to validity is the reporting of the ratio of major to minor injuries (since the former are harder to suppress than the latter), as well as qualitative investigation. A special type of bias (regression to the mean) arises when extreme individuals or groups are chosen for study, as might be the case if one applied an intervention to only those with high injury rates. Through the laws of probability alone, one would expect lower accident rates in a subsequent test of the same individuals/group. Such a result would look like an e€ect of treatment, unless a control group of similar individuals with high injury rates was also included. The study by Moran (1985) shows the usefulness of such a control group. 2.7. Statistical analysis . Were the appropriate statistical analyses conducted? . If study results were negative, were statistical power or con®dence intervals calculated? The limited scope of the present paper precludes the detailed discussion of the appropriate statistical technique in a given analysis. We do, however, highlight some important statistical issues. Typically, comparisons are made between experimental and control groups or between pre- and post-intervention measurements. For non-randomized studies (and sometimes even in randomized trials), it might be necessary to adjust for confounding variables (variables which di€er between the two groups and which a€ect the measured outcome), such as age or experience, so that they do not account for any of the treatment e€ect observed. However, a purist can argue that the possibility of residual confounding cannot be ruled out. In such circumstances, it may be suf®cient to show that the confounder could not have produced an e€ect of the magnitude seenÐor perhaps if there was an e€ect, its direction would have been to reduce the observed e€ect, rather than create one when none existed. Adjustments can also be made for pre-intervention di€erences among groups in the outcome variable. However, because such adjustments involve assumptions and are not straightforward (Anderson et al., 1980; Kaplan and Berry, 1990; Streiner and Norman, 1995), it is best to minimize the need for such adjustments by selecting groups that are similar. Conducting several tests of signi®cance a€ects the overall level of statistical signi®cance. Essentially, the more tests that are done, the greater the probability that at least one of the tests is signi®cant. Strictly speaking, an adjustment should be made (either explicitly or in the reader's mind) to allow for this. The concern can arise in two ways: 1. a study may include multiple outcomes; e.g. looking at each of the frequency rate of accidents, the rates of accidents of particular types, the severity rate(s), the costs, etc; or

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

175

2. comparisons may be made across several groups. It is improper to choose speci®c groups for comparisons a posteriori ; speci®c comparisons should be determined a priori or after ®rst determining that there is an overall `group e€ect'. A less well-appreciated point is the potential impact of `clustering'. If a measure is aimed at workplaces (rather than at individuals), the appropriate unit of analysis must be carefully considered (Hopkins, 1982). Strictly speaking, it is the workplace, not the individual. Thus we cannot treat the analysis as if the sample size is the number of workers. Rather, the e€ective sample size will be in between the number of workplaces and the number of subjectsÐit will depend on the intra-class correlation, the tendency of individuals within workplaces (the clusters) to be similar. For `negative' studies, i.e. those in which no `statistically signi®cant' result is found, it is important to have considered statistical power, i.e. whether the sample size was large enough to have detected a di€erence of the expected magnitude, because power is often quite low in e€ectiveness studies (Lipsey, 1990). Ideally, of course, it should be considered before the study, as Gregersen et al. (1996) did, in an estimate of required sample size. When information on power is missing, readers can look at the con®dence interval on the estimate of e€ect. It is certainly preferable for reports to provide such intervals, as Gregersen et al. (1996) did, rather than simply p-values, and indeed some biomedical journals now demand this (e.g. Journal of the American Medical Association). Con®dence intervals are also useful in `positive' studies, since they provide bounds on the plausible e€ect sizes. A wide interval typically re¯ects a small sample size and shows that the e€ect size may not be robust. Researchers are encouraged to present data required for calculation of e€ect size, which can be de®ned as the percentage reduction in accident rate, preferably in relation to the change in rate in any comparison group (e.g. Guastello, 1993). Such measures will allow the comparison and pooling of results across studies and intervention types, as carried out by Guastello (1993). 2.8. Conclusions . . . .

Did the conclusions address program objectives? Were the limitations of the study addressed? Were the conclusions supported by the analysis? Was the practical signi®cance of the result discussed?

Discussion of study results should focus primarily on the conclusions to be drawn from the data that are relevant to the program's objectives or its underlying research hypothesis. In doing so they should address all threats to the validity of these conclusions, i.e. discuss and try to discount all alternative hypotheses. Researchers, who are in the best position to appreciate the limitations in their studies, do the reader a service by highlighting these and assessing their likely impact on the conclusions. Furthermore, the strength of conclusions should not exceed those warranted by the data, despite the temptation to `want' the intervention to be successful. Indeed,

176

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

the publication of negative results is important because it prevents the implementation by others of ine€ective interventions. Finally, one wants to take away from a report of a study the answer to the question: ``So what?'' (Lipsey, 1996; Shekelle et al., 1994). This question concerns the practical signi®cance of a result, which is distinct from its statistical signi®cance. An intervention might make a statistically signi®cant improvement in a safety outcome, but the reductions in injury rate might not be large enough to be of interest to a decision-maker. A discussion of practical signi®cance almost always requires an outside frame of reference, such as cost, health e€ects in meaningful terms like injury rates (instead of an experimental surrogate measure), and size of e€ects relative to standards or progress towards goals (Lipsey, 1996). Several such goals can be found in the USDHHS objectives for the year 2000 (USDHHS, 1995). 3. Conclusion These criteria have provided a set of questions which can help the reader assess the value of the interventions described in the literature. Further, for systematic reviews of particular types of interventions, the criteria can be readily adapted to be more speci®c. It is recognized that the time and resource constraints imposed by the workplace can demand compromises in experimental protocol. Nevertheless, if more rigor were applied to evaluations of safety interventions, truly e€ective measures could be more readily identi®ed and the redirection of resources towards the adoption of more e€ective measures would follow, as would their accompanying bene®ts. Acknowledgements The comments of Andrew Hale, Richie Gun, Tore J. Larsson, John Bisby, Ricardo Montero Martinez and Carin Sundstrom-Frisk are appreciated for their contribution to the content and form of this manuscript.

References Anderson, S., Auquier, A., Hauck, W.W., Oakes, D., Vandaele, W., Weisberg, H.I., 1980. Analyzing data from premeasure/postmeasure designs. In: Statistical Methods for Comparative Studies. Techniques for Bias Reduction. Wiley Series in Probability and Mathematical Statistics. John Wiley, Toronto. Bickman, L., 1983. The evaluation of prevention programs. Journal of Social Issues 39, 181±194. Blumberg, M., Pringle, C.D., 1983. How control groups can cause loss of control in action research: the case of the Rushton Coal Mine. Journal of Applied Behavioral Science 19, 409±425. Brooker, A.-S., Frank, J.W., Tarasuk, V.S., 1997. Back pain claim rates and the business cycle. Social Science and Medicine 45, 429±439. Chhokar, J.S., Wallin, J.A., 1984. Improving safety through applied behavior analysis. Journal of Safety Research 15, 141±151.

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

177

Conrad K.M., Conrad K.J., Walcott-McQuigg, J., 1991. Threats to internal validity in worksite health promotion program research: common problems and possible solutions. American Journal of Health Promotion 6, 112±122. Cook, T.D., Campbell, D.T., 1979. Quasi-experimentation. Design and Analysis Issues for Field Settings. Rand McNally College Publishing, Chicago. Cook, T.D., Campbell, D.T., Peracchio L., 1990. Quasi experimentation. In: Dunnette, J.D., Hough, L.M. (Eds.), Handbook of Industrial and Organizational Psychology, pp. 491±576. Consulting Psychologists Press, Palo Alto, CA. Cooper, M.D., Phillips, R.A., Sutherland, V.J., Makin, P.J., 1994. Reducing accidents using goal setting and feedback: a ®eld study. Journal of Occupational & Organizational Psychology 67, 219±240. Dobbins, M., Thomas, H., Ploeg, J., Ciliska, D., Hayward S., Underwood, J., 1996. The e€ectiveness of community-based heart health projects: a systematic overview. McMaster University Quality of Worklife Research Unit Working Paper Series, 96-1. McMaster University, University of Toronto, Hamilton, Ontario. Earp, J.A., Ennett, S.T., 1991. Conceptual models for health education research and practice. Health Education Research 6, 163±171. Ellis, L., 1975. A review of research on e€orts to promote occupational safety. Journal of Safety Research 7, 180±189. Evans, L., 1985. Human behaviour feedback and trac safety. Human Factors 27, 555±576. Fox, D.K., Hopkins, B.L., Anger, W.K., 1987. The long-term e€ects of a token economy on safety performance in open-pit mining. Journal of Applied Behavior Analysis 20, 215±224. Girgis, A., Sanson-Fisher, R.W., Watson, A., 1994. A workplace intervention for increasing outdoor workers' use of solar protection. American Journal of Public Health 84, 77±81. Glassock, D.J., Hansen, O.N., Rasmussen, K., Carstensen, O., Lauritsen, J., 1997. The West Jutland study of farm accidents: a model for prevention. Safety Science 25, 105±112. Goldenhar, L.M., Schulte, P.A., 1994. Intervention research in occupational health and safety. Journal of Occupational Medicine 36, 763±775. Goldenhar, L.M., Schulte, P.A., 1996. Methodological issues for intervention research in occupational health and safety. American Journal of Industrial Medicine 29, 289±294. Green, L.W., Lewis, F.M., 1986. Measurement and Evaluation in Health Education and Health Promotion. May®eld Publishing, Palo Alto, CA. Gregersen, N.P., Brehmer, B., Moren, B., 1996. Road safety improvement in large companies. An experimental comparison of di€erent measures. Accident Analysis and Prevention 28, 297±306. Guastello, S.J., 1993. Do we really know how well our occupational accident prevention programs work? Safety Science 16, 445±463. Guyatt, G.H., Sackett, D.L., Cook, D.J. for the Evidence-based Medicine Working Group, 1993. Journal of the American Medical Association 270, 2598±2601. Hale, A.R., 1984. Is safety training worthwhile? Journal of Occupational Accidents 6, 17±33. Hale A.R., Hovden J., 1998. Management and culture: the third age of safety. A review of approaches to organizational aspects of safety health and environment. In: Williamson, A., Feyer, A.-M. (Eds.), Occupational Injury: Risk, Prevention and Injury. Taylor & Francis. Harborview Medical Center Injury Prevention and Research Center, 1997. Systematic Reviews of Childhood Injury Prevention Interventions. http://weber.u.washington.edu/hiprc/index_left.html/ (Oct. 22, 1997). Heacock, H., Koehoorn, M., Tan, J., 1997. Applying epidemiological principles to ergonomics: a checklist for incorporating sound design and interpretation of studies. Applied Ergonomics 28, 165±172. Hopkins, K.D., 1982. The unit of analysis: group means versus individual observations. American Educational Research Journal 19, 5±18. Hoyos, C.G., Zimolong, B., 1988. Occupational Safety and Accident Prevention: Behavioral Strategies and Methods. Advances in Human Factors/Ergonomics, Vol. 11. Elsevier, New York. Israel, B.A., Schurman, S.J., Hugentobler, M.K., 1992. Conducting action research: relationships between organization members and researchers. Journal of Applied and Behavioural Science 28, 74±101. Johnston, J.J., Cattledge, G.T.H., Collins, J.W., 1994. The ecacy of training for occupational injury control. Occupational Medicine: State of the Art Reviews 9, 147±158.

178

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

Jones, J.W., Murphy, L.R., Ste€y, B.D., 1987. Impact of organizational stress management on accident rates and insurance losses: two quasi-experiments. In: Opatz, J.P (Ed.), Health Promotion Evaluation: Measuring the Organizational Impact, pp. 66±81. National Wellness Institute/National Wellness Assocation, Stevens Point, WI. Kaplan, R.M., Berry, C.C., 1990. Adjusting for confounding variables. In: Sechrest, L., Perrin, E., Bunker, J. (Eds.), Research Methodology: Strengthening of Causal Interpretations of Nonexperimental Data, pp. 105±114. United States Department of Health and Human Services, Washington, DC. Komaki, J.L., Jensen, M., 1986. Within-group designs: an alternative to traditional control-group designs. In: Cataldo, M.R., Coates, T.J. (Eds.), Health and Industry. A Behavioral Medicine Perspective, pp. 86±139. John Wiley, New York. La¯amme, L., 1993. Technological improvements of the production process and accidents: an equivocal relationship. Safety Science 16, 249±266. La¯amme, L., Menckel, E., 1995. Aging and occupational accidents. A review of the literature of the last three decades. Safety Science 21, 145±161. Lahad, A., Malter, A.D., Berg, A.O., Deyo, R.A., 1994. The e€ectiveness of four interventions for the prevention of low back pain. Journal of the American Medical Association 272, 1286±1291. Lipsey, M.W., 1990. Design Sensitivity. Sage Publications, Newbury Park, CA. Lipsey, M.W., 1996. Key issues in intervention research: a program evaluation perspective. American Journal of Industrial Medicine 29, 298±302. List, W., 1997. Do safety incentives have a place in the workplace? Canadian Occupational Safety May/ June, 9±10. Marcus, A.C., Baker, D.B., Froines, J.R., Brown, E.R., McQuiston, T., Herman, N.A., 1986. ICWU cancer control education and evaluation program: research design and needs assessment. Journal of Occupational Medicine 28, 226±236. McAfee, R.B., Winn, A.R., 1989. The use of incentives/feedback to enhance work place safety: a critique of the literature. Journal of Safety Research 20, 7±19. McCurdy, S.A., Schenker, M.B., Samuels, S.J., 1991. Reporting of occupational injury and illness in the semiconductor manufacturing industry. American Journal of Public Health 81, 85±89. McLeroy, K.R., Green, L.W., Mullen, K.D., Foshee, V., 1984. Assessing the e€ects of health promotion in worksites: a review of the stress program evaluations. Health Education Quarterly 11, 379±401. Michaels, D., Zoloth, S., Bernstein, N., Kass, D., Schrier, K., 1992. Workshops are not enough: making right-to-know training lead to workplace change. American Journal of Industrial Medicine 22, 637± 649. Mohr, D.L., Clemmer, D.I., 1989. Evaluation of an occupational injury intervention in the petroleum drilling industry. Accident Analysis and Prevention 21, 263±271. Moran, G.E., 1985. Regulatory strategies for workplace injury reduction. Evaluation Review 9, 21±33. Murphy, E.A., 1976. The Logic of Medicine, John Hopkins University Press, Baltimore. Quoted in Fletcher, R.H., Fletcher, S.W., Wagner, E.H., 1996. Clinical Epidemiology. The Essentials, 3rd Edition, Williams & Wilkins, Baltimore, MD. Needleman, C., Needleman, M.L., 1996. Qualitative methods for intervention research. American Journal of Industrial Medicine 29, 329±337. Oce of Technology Assessment, 1985. Preventing Illness and Injury in the Workplace. OTA-H- 256. US Congress, Oce of Technology Assessment, Washington, DC. Parkinson, D.K., Bromet, E.J., Dew, M.A., Dunn, L.O., Barkman, M., Wright, M., 1989. E€ectiveness of the United Steel Workers of America coke oven intervention program. Journal of Occupational Medicine 31, 464±472. Patton, M.Q., 1990. Qualitative Evaluation and Research Methods, 2nd Edition, Sage Publications, Newbury Park, CA. Peters, R.H., 1989. Information Circular 9232: Review of Recent Research on Organizational and Behavioral Factors Associated with Mine Safety. Bureau of Mines, United States. Petersen, D., 1996. Analyzing Safety System E€ectiveness, 3rd Edition, Van Nostrand Reinhold, Toronto. Posavac, E.J., Carey, R.G., 1989. Program Evaluation. Methods and Case Studies, 3rd Edition, Prentice Hall, Englewood Cli€s, NJ.

H.S. Shannon et al. / Safety Science 31 (1999) 161±179

179

Reddell, C.R., Congleton, J.J., Huchingson, R.D., Montgomery, J.F., 1992. An evaluation of a weightlifting belt and back injury prevention training class for airline baggage handlers. Applied Ergonomics 23, 319±329. Robins, T.G., Hugentobler, M.K., Kaminski, M., Klizman, S., 1990. Implementation of the Federal Hazard Communication Standard: does training work? Journal of Occupational Medicine 32, 1133±1140. Schulte, P.A., Goldenhar, L.M., Connally, L.B., 1996. Intervention research: science, skills, and strategies. American Journal of Industrial Medicine 29, 285±288. Shekelle, P.G., Andersson, G., Bombardier, C., Cherkin, D., Deyo, R., Keller, R., Lee, C., Liang, M., Lipscomb, B., Spratt, K., Weinstein, J., 1994. A brief introduction to the critical reading of the literature. Spine 19, 2028S±2031S. Stout, N., Bell, C., 1991. E€ectiveness of source documents for identifying fatal occupational injuries: a synthesis of studies. American Journal of Public Health 81, 725±728. Streiner, D.L., Norman, G.R., 1995. Health Measurement Scales. A Practical Guide to their Development and Use, 2nd Edition, Oxford University Press, New York. Tarrants, W.E., 1980. The Measurement of Safety Performance. Garland STPM Press, New York. Tarrants, W.E., 1987. How to evaluate your occupational safety and health program. In: Slote, L. (Ed.), Occupational Safety and Health Handbook, pp. 166±200. Wiley Interscience, New York. USDHHS, 1995. Healthy People 2000 Review, 1994. DHHS publication PHS 95-1256-1. National Center for Health Statistics, Hyattsville, MD. van Tulder, M.W., Assendelft, W.J.J., Koes, B.W., Bouter, L.M., the Editorial Board of the Cochrane Collaboration Back Review Group, 1997. Method guidelines for systematic reviews in the Cochrane Collaboration Back Review Group for Spinal Disorders. Spine 22, 2323±2330. Verbeek, J.H.A.M., Hulshof, C.T.J., van Dijk, F.J.H., Kroon, P.J., 1993. Evaluation of an occupational health care programme: negative results, positive results or failure? Occupational Medicine 43 (Suppl. 1), S34±S37. Viscusi, W.K., 1985. Cotton dust regulation: an OSHA success story? Journal of Policy Analysis and Management 4, 325±343. Vojtecky, M.A., Schmitz, M.F., 1986. Program evaluation and health and safety training. Journal of Safety Research 17, 57±63. Waller, J.A., 1980. A systems model for safety program evaluation. Accident Analysis and Prevention 12, 1±5. Walsh, N.E., Schwartz, R.K., 1990. The in¯uence of prophylactic orthoses on abdominal strength and low back injury in the workplace. American Journal of Physical Medicine & Rehabilitation 69, 245± 250. Windsor, R., Baranowski, T., Clark, N., Cutter, G., 1994. Evaluation of Health Promotion, Health Education and Disease Prevention Programs, 2nd Edition, May®eld Publishing, Toronto. Zwerling, C., Daltroy, L.H., Fine, L.J., Johnston, J.J., Melius, J., Silverstein, B.A., 1997. Design and conduct of occupational injury intervention studies: a review of evaluation strategies. American Journal of Industrial Medicine 32, 164±179.