Comprehensive Psychiatry 48 (2007) 88 – 94 www.elsevier.com/locate/comppsych
Generalizability studies of the Global Assessment of Functioning–Split version Geir Pedersena,T, Knut Arne Hagtvetb, Sigmund Karterudc a
Department of Personality Psychiatry, Ulleva˚l University Hospital, 0407 Oslo, Norway b Department of Psychology, University of Oslo, 0317 Oslo, Norway c Institute for Psychiatry, Faculty of Medicine, University of Oslo, 0318 Oslo, Norway
Abstract Objective: The study aimed to use the Generalizability Theory to investigate the reliability and precision of the split version of the Global Assessment of Functioning (GAF). Materials: Six case vignettes were assessed by 2 samples; one by 19 experienced and independent raters and another by 58 experienced raters from 8 different day-treatment units, evaluating both symptom and function scores of GAF. Methods: Generalizability studies were conducted to disentangle relevant variance components accounting for error variance in GAF scores. Furthermore, decision studies were conducted to estimate the reliability of different measurement designs, as well as precision in terms of error tolerance ratio. Results: Both symptom and function scores of GAF were found to be highly generalizable, and a measurement design of 2 raters per subject was found to be most efficient with respect to reliability, precision, and use of resources. Conclusion: Both symptom and function scores of GAF seem highly consistent across experienced raters. D 2007 Elsevier Inc. All rights reserved.
1. Background The first standardized and broadly used instrument for assessing patients overall mental health was introduced in 1962 when Luborsky [1] reported the development of the Health-Sickness Rating Scale. Some decades later, Endicott et al [2] modified the original instrument, which resulted in the Global Assessment Scale (GAS). Both the HealthSickness Rating Scale and GAS are single 100-point rating scales reflecting overall functioning from 1, representing the hypothetically sickest, to 100, representing the hypothetically healthiest individual. In the third edition of Diagnostic and Statistical Manual of Mental Disorder [3], axis V was introduced as a measure of badaptive functioning,Q scored on a 7-point scale ranging from superior to grossly impaired. In 1987, the Global Assessment of Functioning (GAF) scale replaced the axis V of DSM-III-R [4] to reflect psychological, social, and occupational functioning. The GAF scale was based on and made very similar to the GAS scale, although the upper range from 91 to 100 was omitted. There 4 Corresponding author. E-mail address:
[email protected] (G. Pedersen). 0010-440X/$ – see front matter D 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.comppsych.2006.03.008
has been, and still is, a great deal of skepticism concerning the use of a single scale to assess both the level of social and occupational functioning and severity of psychiatric symptoms [5,6]. With DSM-IV [7], the GAF scale was extended to a 100-point scale, and the axis V was supplemented with the Global Assessment of Relational Functioning scale and Social and Occupational Function Assessment Scale for further study. In short, the procedure for setting a GAF score is to consider all available information of a subject’s psychiatric symptoms and social and occupational level, and then deciding on a score in accordance with the lowest level of either symptom or function. The time to concern usually varies from overall last year to present last week. The 2 main purposes of the GAF score are to reflect the need for psychiatric treatment and to measure treatment outcome. However, the characteristics of the GAF scale, that is, combining the severity of psychiatric symptoms and level of social functioning into a single measure, along with rather modest scoring instructions, has led to several studies on the reliability of GAF. In 1995, Hall [8] reported higher (and acceptable) intraclass correlation (ICC) of the modified (DSM-III-R) GAF than on the original scale (DSM-III),
G. Pedersen et al. / Comprehensive Psychiatry 48 (2007) 88 – 94
whereas in another study, Rey et al [9] found the interrater reliability rather moderate. In a study conducted by Lbvdahl and Friis [10], interrater reliability was found high among experienced GAF raters and unsatisfactory among untrained raters. Furthermore, they disclosed raters with systematically high, systematically low, as well as random extreme deviations from mean scores of the case vignettes. Concluding that raters’ biases were large enough as to cause spurious intercenter systematic differences, they strongly suggest routine quality assurance and training programs for GAF. More recently, Fallmyr and Rep3l [11] found that just a single 1-day course in GAF is insufficient to obtain reliable GAF scores. Reliable GAF scores do not necessarily imply that they are of any clinical use, and several studies have been conducted to map its associations with other clinical phenomenon, of which many reported only moderate validity [5]. However, it has been reported that there are significant associations between GAF scores and the presence of axis II pathology, self-reported symptom distress, interpersonal problems, as well as social functioning [12-14]. 1.1. The split version of GAF In Norway, Karterud et al [15] constructed a GAF scale that was divided into one symptom and one function score in 1998. A set of 25 case vignettes were constructed to reflect a wide area of the GAF scale. An expert panel of 19 experienced psychiatrists across several mental health regions of Norway scored all vignettes on both the symptom and function dimensions. Twenty of these vignettes were used in the Norwegian National GAF Training Program on the Internet, with mean scores from the expert panel used as reference norms. Later, 7 of these case vignettes were assessed by group consensus of the staff in 8 day hospital units, all members of the Norwegian network of psychotherapeutic day hospitals [13]. No consistent pattern of under- or overestimation was found among the units, and the ICC [16] (2.1, single measure, absolute agreement definition) was 0.97 (95% CI, 0.94-0.99) and 0.94 (95% CI, 0.85-0.98) for symptom and function scores, respectively. Thus, GAF rated in plenum from these units seemed highly similar. In comparison, for the same set of case vignettes, the estimated ICC (2.1, single measure, absolute agreement definition) for the bexpert ratersQ was 0.93 (95% CI, 0.830.98) and 0.90 (95% CI, 0.78-0.98) for symptom and function score, respectively. Although the estimates from the day units scoring the case vignettes in plenum were slightly higher than for the expert raters, the estimates were very close, high, and indeed promising. It could indicate that GAF scores assessed by independent raters are as reliable as those assessed by plenum discussions. There is an implicit assumption in splitting the GAF scale that the 2 measurements are measuring different clinical aspects of a patient’s condition. In other words, there is an expectation of bdifferential construct validityQ [17]. In a study of patients treated in the Norwegian network of
89
psychotherapeutic day hospitals [13], 649 of 1244 patients were admitted to treatment after the introduction of split GAF. Mean scores on GAF symptom level for these patients were 46.9 (SD, 5.8) and for function level 46.1 (SD, 6.5). The mean difference between symptom and function level of GAF (S F) was 0.8 (SD, 5.4), ranging from 24 to 20. Thus, the 2 scores were found circling close to each other, which could be an argument not to continue with split GAF. However, the correlation between symptom and function scores was 0.61 ( P b .01), which means that they only share 37% of variance. The rest of the variance will be accounted for by other systematic, and possibly important, clinical factors and/or by random error of measurement. Furthermore, about 30% of these patients had a difference between symptom and function score greater than 5. By splitting the GAF scale, there was a clear expectation that the S and F score should be separable. In other words, an expectation of differential construct validity was implicit. Concerning the assessment of differential construct validity, it has to be noted that the case vignettes in this study were constructed in a way that symptom and function level should partly reflect the same area of the GAF scale. By this, the design of this study will not be suitable for an examination of the differential construct validity. 1.2. The objective of the study The objective of this study was to (1) assess the generalizability of the symptom and function score of GAF, respectively, with respect to both relative and absolute interpretations; and (2) to assess the precision of these measurements in terms of the error tolerance (E/T) ratio.
2. Material and methods 2.1. The samples in the study In this study, 2 samples are analyzed and compared, where both samples scored the symptom and function dimension of GAF on the same set of 6 case vignettes. The first sample consisted of 19 experienced psychiatrists across several mental health regions of Norway, who scored the case vignettes in 1998. In 2001, 8 units from the Norwegian Network of Psychotherapeutic Day Hospitals [13] participated in a study where the same case vignettes were GAF rated by all staff members. From this, 58 staff members independently rated the GAF levels. Staff members were all familiar with GAF, and as members of the Norwegian network all units have had an introductory course on GAF assessment, and all units routinely evaluated their patients on GAF. Staff members were psychiatrists, psychologists, psychiatric nurses, nurses, physiotherapists, occupational therapists, and social workers. Four of the 6 case vignettes were used in a study conducted 3 years earlier, within 7 of the 8 units, where GAF score was rated after plenum discussions (noted in the Background section). The subjects participated voluntarily after informed consent was
90
G. Pedersen et al. / Comprehensive Psychiatry 48 (2007) 88 – 94
obtained, and the study was approved by the Regional Ethical Committee. 2.2. Assessing the reliability and precision of the split version of GAF In clinical measurement, one measurement is no measurement. Despite this, often, and probably in most cases, there is only 1 staff member or clinician rating the patients GAF score. Furthermore, raters within different units or clinical milieus might not be independent, accounting for systematic variation in GAF scores between local clinical groups. Ideally, we only want individual differences to account for the variation of the observed scores. Classical test theory (CT) partitions observed score variance into 2 parts: one systematic (true score variance) and one random (error variance). However, in designs where the observed score is compounded for 3 or more sources of variance, CT comes short as it cannot estimate them simultaneously in 1 analysis. For example, in a total sample of raters sampled from different clinical units, ICC based on raters will bhideQ the possible variance because of the unit effect, that is, systematic differences between units, and an analysis of variance with unit as the factor will bhideQ the variance because of the rater effect. Because behavioral measurements may contain multiple sources of error, the Generalizability Theory (G theory) can be used which is well suited to address such issues. Within the design of the G theory, several variance components can be disentangled in just 1 analysis [18]. According to Kane [19,20], one way is to disentangle the different sources of variance accounting for the observed score variance, that is, to estimate the amount of true and error score variance. Another question is whether the measurements are precise enough for the particular use. Thus, the magnitude of measurement error has to be evaluated relative to the tolerance for error within the particular context. Reliability coefficients evaluate the standard error relative to the deviation of observed scores. By this, with a very heterogeneous population, with a wide range of true scores, reliability estimates can be high even if the errors are large. If the population is very homogeneous, with a small true score variance, the reliability can be low even if the errors are small. The tolerance for errors depends on the context and intended use of the results of measurement. Thus, error tolerance is closely connected to validity, and so the magnitude of random error in the measurements should be evaluated relative to this tolerance. A general index of the context-specific precision is given by the standard error of measurement divided by the standard tolerance. This gives an E/T ratio. If the E/T ratio is small, it indicates an acceptable level of precision within the context. If the ratio is relatively large, the precision will be low, indicating a validity problem in the context. An E/T ratio allows many possible definitions of tolerance and provides a more fundamental angle to the issue of precision than commonly applied reliability coefficients.
2.3. Generalizability theory In classical test theory, the concept of reliability is connected to how accurately observed scores reflect their corresponding true scores. In the G theory, reliability is replaced by the concept of generalizability, reflecting how accurately an observed score would permit us to generalize to the average score that person would receive under all possible and acceptable conditions [17]. An essential feature of the G theory is its conceptualization of multiple error components that allows a differentiated assessment of measurement error in contrast to the single error term in the classical test theory. A G study [18] makes an explicit separation of empirical information into facets of observation and objects or targets of measurement, respectively. A facet represents the set of all acceptable conditions of observation of a particular kind; for example, the set of all acceptable raters is the rater facet, and the set of all acceptable units is the unit facet. The facet, as used in the G theory, serves to emphasize the distinction between the object or target of measurement and the conditions of observations. Commonly, persons, here case vignettes, are the objects of measurement, whereas raters, units, etc, constitute the facets of observations. The distinction between object of measurement and facets of observation determines how information is described and organized for a specific measurement purpose. Essential to the conceptual framework of the G theory is the definition of 2 types of universe. These are the buniverse of admissible observations,Q which underscores the relevance of its facets for a specific measurement purpose, and the buniverse of generalization,Q which addresses the universe to which generalizations are made. The purpose of a G study is to estimate variance components reflecting different types of inconsistency of equivalent conditions of facets in the universe of admissible observations. The variance components are the building blocks for generalizability estimation. Based on the collected sample data, the relative impact of different sources of variation in terms of variance components in the universe of admissible observations is estimated by a G study. Based on obtained G-study components, the generalizability framework offers a subsequent study (decision [D] study) to decide (a) as to which facet in the G study needs to be controlled, and (b) how many conditions of each facet are necessary to obtain adequate generalizability for the actual measurement operation. Based on the G study, different numbers of D studies can be performed depending upon the purpose of measurement. As part of a D study, a generalizability coefficient and different indices of error can be estimated. In G theory, there are 2 types of reliability coefficients. The first is called bgeneralizability coefficientQ, denoted as E U2. This is the analogue of a reliability coefficient in CT and indicates the level of consistency in rank order of the measurement. The second is called bdependability coefficient Q, denoted as U, and can be interpreted as the generalizability coefficient for absolute decisions [21].
G. Pedersen et al. / Comprehensive Psychiatry 48 (2007) 88 – 94
91
an unbalanced nested 2-facet random effect design. The G study of sample 1 (expert raters) will be a 1-facet, crossed (p r) design (see Fig. 1). 2.5. Analysis and statistics A G study will be conducted to disentangle the different variance components and their interaction effects. Based on the results of the G study, a decision study (D study) will be conducted to compare the effectiveness of different measurement designs, so as to minimize error and maximize reliability. See Appendix A for expected mean squares of the 2 measurement designs, and Appendix B for formulas for generalizability and dependability coefficients. Descriptive statistics will be conducted using SPSS for Windows, release 11.0 (SPSS, Chicago, Ill) [22]. The G and D studies will be processed through the GENOVA program [23]. However, because of the unbalanced p (r:u) design, data from the second sample will first be processed through the urGENOVA program [24], which allows unbalanced designs, to obtain variance components to be processed through GENOVA. According to Brennan [23] this will be reasonable as long as the data are not unbalanced by design (ie, not planned). In our p (r:u) design, the unbalanced number of raters within units was arbitrary, not planned (Table 1).
Fig. 1. Graphical illustration of the measurement designs.
2.4. The design of the study In terms of the G theory, the GAF scores of the hypothetical persons described in the case vignettes will be the objects of measurement, and raters and units will represent facets of observations. Furthermore, the 6 case vignettes are considered as a random sample of all admissible (ie, acceptable) case vignettes, and raters and units are considered as random facets of raters and units, respectively. All raters rate each GAF case vignette. By this, case vignettes (p) are crossed with raters (r). However, in our second sample, several raters come from the same unit. In G theory terms, raters are nested within units. Thus, the design for the G study of sample 2 (staff members) will be noted as: bp (r:u)Q, that is, persons are crossed with raters and units, and raters are nested within units. However, there were different numbers of raters within units, and so this is
3. Results The main findings of this study are as follows: both relative and absolute agreement of GAF symptom and function scores were found to be acceptable, even when assuming only a single rater. Both symptom and function scores of GAF were found highly consistent across raters and units. Considering the large GAF scale (100 points), only minor differences between raters were found. Table 2 shows the estimated variance components of the 2 measurement designs. For both designs, the largest variance was found for the person component, reflecting the systematic variability between the case vignettes in their GAF levels. For the unbalanced p (r:u) design, the unit effect (u) was practically 0. This means that GAF scores did
Table 1 Descriptive statistics of GAF scores Vignette
Staff members (n = 58) Symptom
Expert raters (n = 19) Function
Symptom
Function
Mean
SD
Min-max
Mean
SD
Min-max
Mean
SD
Min-max
Mean
SD
Min-max
A B C D E F
41.6 68.2 53.2 42.0 60.3 57.0
4.1 5.0 5.2 4.0 6.8 4.7
25-49 51-80 41-70 32-65 45-72 49-70
38.6 73.2 54.8 41.5 62.7 56.8
3.8 5.6 5.7 6.5 6.9 5.2
30-48 53-82 43-69 29-65 49-75 43-68
42.9 71.3 51.9 41.3 62.6 59.5
2.6 4.0 4.9 4.8 5.1 4.7
38-47 62-78 43-59 32-53 53-71 52-68
40.1 74.1 52.2 39.2 66.6 59.3
4.1 5.3 5.3 6.1 6.8 7.1
34-47 64-81 45-62 30-52 53-78 49-74
Mean
53.7
54.6
54.9
55.2
92
G. Pedersen et al. / Comprehensive Psychiatry 48 (2007) 88 – 94
Table 2 Estimated G-study variance components Effect
Symptom scores
Function scores
VC
%
df
SE
VC
%
df
Staff members a Person (p) Unit (u) Rater:unit (r:u) pu pr:u
SE
109.00738 0.00000 4.82650 1.32820 20.11902
80.6 0.0 3.6 1.0 14.9
5 7 48 35 240
58.54780 0.06180 0.02972 0.12642 0.03266
167.79196 0.10328 2.84232 1.45517 28.15171
83.7 0.1 1.4 0.7 14.0
5 7 48 35 240
90.05465 0.09224 0.02797 0.16560 0.04570
Expert raters Person (p) Rater (r) pr
136.30292 2.55984 17.01813
87.4 1.6 10.9
5 18 90
73.33586 0.09247 0.13206
198.07388 5.61930 28.56296
85.3 2.4 12.3
5 18 90
106.67873 0.17666 0.22165
Variance component for unit (u) was 0.53775 and was set to 0. SE for staff members is estimated under the assumption of balanced design. VC indicates variance component. a Variance components from urGENOVA of the p (r:u) design for staff members analyzed by GENOVA.
not differ between units, averaged over raters within units. Because raters are nested within units, it is impossible to separate rater main effect from the interaction effect between raters and units. However, this combined effect (r:u) can be interpreted as the main effect of raters, averaged over units. The rank order of GAF scores differed little between units (pu), accounting for only 1% of the total variance. The relatively large residual effect (pr:u) of 14.9% reflects the changes in the rank order of GAF scores over raters within units or other unmeasured events plus random error variance. The estimated variance components of symptom and function scores were highly comparable. For the balanced p r design, for expert raters the rater effect accounted for only 1.6% of the variance, indicating no
substantial difference between raters in their scores. The residual effect (pr) indicates an error variance due to the interaction between rater and person, and/or other unmeasured events plus random error variance. The estimated variance components from the staff members and expert raters were also highly comparable. Table 3 shows the results of the D study for different variations of the 2 measurement designs. In the p (r:u) design, we had 8 units and assumed 7 raters within each unit in the data collection design of the G study. Such a comprehensive design may be characterized as expensive and unrealistic. However, in this case, it nevertheless represents an important basis for estimating stable G-study variance components that in turn will
Table 3 D-study results and error tolerance statistics for different measurement designs Four raters
Two raters
One rater
Symptom
Function
Symptom
Function
Symptom
Function
Staff members4 r H2 r 2y r D2 E U2 U Er 2x (E/T)rel2 (E/T)rel
109.007 6.358 7.565 .945 .935 115.365 .058 .242
167.792 8.493 9.307 .952 .947 176.285 .051 .225
109.007 11.388 13.801 .905 .888 120.395 .104 .323
167.792 15.531 17.055 .915 .908 183.323 .093 .304
109.007 21.447 26.274 .836 .806 130.454 .197 .444
167.792 29.607 32.553 .850 .838 197.399 .176 .420
Expert raters r 2oˆ r27 r D2 E U2 U Er 2x (E/T)rel2 (E/T)rel
136.303 4.255 4.895 .970 .965 140.558 .031 .177
198.074 7.141 8.546 .965 .959 205.215 .036 .190
136.303 8.509 9.789 .941 .933 144.812 .062 .250
198.074 14.282 17.092 .933 .921 212.356 .072 .269
136.303 17.018 19.578 .889 .874 153.321 .125 .353
198.074 28.563 34.182 .874 .853 226.637 .144 .380
r H2 indicates universe score variance; r 2y, relative error variance; r D2, absolute error variance; E U2, generalizability coefficient for relative decision with E/T incorporated; U, generalizability coefficient for absolute decision with E/T incorporated; Er 2x, expected observed score variance. 4 nu = 1.
G. Pedersen et al. / Comprehensive Psychiatry 48 (2007) 88 – 94
represent a fundament for providing reliability estimates in more applied measurement situations. The same goes for the p r design, with 19 raters. Therefore, 3 more realistic designs were examined for each of the 2 samples. The reliability was estimated for situations in which there were 4, 2, and only 1 rater, respectively. For the staff member, design reliability with respect to E U2 and U was estimated assuming that raters come from only 1 unit. With only 1 rater, the expected error variance for symptom and function scores will be around 15% to 20% (Table 3), which is considered acceptable for most psychological measurement. Increasing the number of raters to 2, that is, the mean score between 2 raters, the expected error variance decreases to less than 10%. With 4 raters per case vignette, the reliability increases to around 95%. Table 3 also shows the error tolerance statistics for the 3 D-study designs. In the 1-rater design for staff members, the E/T ratio indicates a relatively weak precision for symptom and function scores. Here the size of the error is about 20% of the tolerance for error: (E/T)rel2 = 0.197 and 0.176 for symptom and function scores, respectively. In the 2-rater design, the size of the error decreases to about 10% of the tolerance for error—(E/T)rel2 = 0.104 and 0.093—and indicates a relatively good precision. In the 4-rater design, the precision is even higher, where the size of the error is under 6% of the tolerance for error: (E/T)rel2 = 0.058 and 0.051. The error tolerance statistics for the expert raters are highly comparable to the staff members (Appendix C). However, the expert raters seem to show a slightly higher precision on the symptom and function score of GAF. 4. Discussion With respect to both relative and absolute interpretations, GAF scores from single independent raters hold acceptable reliability. Furthermore, there seems to be no noticeable systematic difference between the groups of local raters. Thus, GAF scores from different clinical units seem highly interchangeable. As to the reliability of difference scores, we found no support for any differential construct validity of GAF. Although the reliability, given by E U2 and U, for symptom and function scores, is acceptable for the 1-rater design, the E/T ratio is too high as to indicate acceptable precision. Based on 2 raters, the precision is increased substantially and is considered acceptable for expert raters. In the 4-rater design, the precision is highly satisfying for both staff members and expert raters. As to the precision of the difference scores, even a 4-rater design gave far from acceptable results for neither staff members nor expert raters. Although GAF scores set by just 1 rater reveal acceptable reliability, the use of 2 raters gives a considerable improvement, with estimated error variance accounting for less than 10% (Table 3). Increasing the number of raters to 4 will increase reliability somewhat further, but
93
considering the use of recourses, the optimal number of raters seems to be just 2. Reliability coefficients are largely dependent on the object of measurement and the difference between their true scores. In this study, case vignettes are selected to cover a wide area of the GAF scale. The larger the distance between the true scores the larger the tolerance for error will be, compared to the error variance. By this, it might not be of any surprise that the raters were able to distinguish between the case vignettes in a way that revealed high reliability and acceptable precision of the symptom and function scores, respectively. A consequence of this might be some overestimation of the reliability of the relative interpretation. The scoring manual for GAF is divided into 10 sections, each with a general description of the symptom and function level it represents on the scale and with some examples and keywords. Considering the complexity of information each patient represents, it might not be reasonable to expect a closer agreement on GAF. For example, there are no instructions or criteria that make a clear difference between a GAF score of 42 and 46. At this level of precision, the GAF scores have to be rather arbitrary estimates. Considering the relatively modest clinical information in the case vignettes (approximately half a page) compared to what clinicians usually know about their patients, this study might underestimate the reliability of the absolute interpretation. We are still reminded about the previously reported correlation of 0.61 in the introduction of this article between the symptom and the function score estimated from a sample of real-life patients. Although the mean difference between symptom and function level of GAF may not be large, the correlation of 0.61 may very likely suggest these 2 scores to be separable in real-life patients. The 2 GAF scores might share their variance with other different clinical variables. For example, it could be that the symptom score shares more variance with other measures of symptoms, and that the function score shares more variance with other measures of interpersonal and social functioning. Several studies have supported this suggestion [12-14], but up till now we have found no study focusing directly on the symptom and function score of GAF. As a result, future studies on GAF should focus on the validity of both the symptom and function level of GAF scores in general, and in particular on the clinical implications of the difference score, especially among those patients who have shown large difference scores in either direction. Acknowledgment We wish to thank the staff from the day units in the Norwegian Network of Psychotherapeutic Day Hospitals, and the expert raters for their effort in scoring the GAF case vignettes which made this study possible.
94
G. Pedersen et al. / Comprehensive Psychiatry 48 (2007) 88 – 94
Appendix A. Expected mean squares for the expert raters and staff members Expected mean squaresforthe expert ratersa EMSp r pr2 + n rr p2 EMSr r pr2 + n pr2r EMSpr r pr2
Expected mean squares for the staff membersb EMSp
r pr:u2 + n rr pu2 + n rn ur p2
EMSu
r pr:u2 + n rr pu2 + n pr p:u2 + n pn rr u2 r pr:u2 + n pr r:u2 r pr:u2 + n rr pu2 r pr:u2
EMSr:u EMSpu EMSpr:u
a
p r design. p (r:u) design. p, r, and u are assumed random. n = sample size. b
Appendix B. Formulas for generalizability and dependability coefficients Formulas for the expert ratersa q2 ¼
U¼
r2p r2pr nr V
q2 ¼
þ r2p r2p
r2pr nr V
Formulas for the staff membersb
r2
þ nrrV þ r2p
U¼
r2p r2pr:u nr Vnu V
þ
r2pu nu V
þ r2p r2p
r2pr:u nr Vnu V
þ
r2pu nu V
r2
r2
þ nuuV þ nr Vnr:uu V þ r2p
a
p r design. p (r:u) design. nV = number of r or u used in the study. E n˜2 indicates generalizability coefficient (for relative decisions); U, dependability coefficient (for absolute decisions). b
Appendix C. Formulas for error tolerance statistics Expert raters r H2 r p2 2 ry r pr2/n r 2 rD r y2 + (r2r /n r) 2 E r H2/(r y2 + r H2) r H2/(r D2 + r H2) 2 Er x r H2 + r y2 2 (E/T)rel r y2/r H2
Staff members r H2 r p2 2 ry (r pr:u2/n rn u) + (r pu2/n u) 2 rD r y2 + (r r:u2/nrnu) + (r u2/nu) 2 E r H2/(r y2 + r H2) r H2/(r D2 + r H2) 2 Er x r H2 + r y2 2 (E/T)rel r y2/r H2
References [1] Luborsky L. Clinicians’ judgements of mental health. Arch Gen Psychiatry 1962;7:407 - 17.
[2] Endicott J, Spitzer RL, Fleiss JF, Cohen J. The Global Assessment Scale. A procedure for measuring overall severity of psychiatric disturbance. Arch Gen Psychiatry 1976;33:766 - 71. [3] American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 3rd ed. Washington (DC)7 American Psychiatric Association; 1981. [4] American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 3rd ed. Revised Washington (DC)7 American Psychiatric Association; 1987. [5] Goldman HH, Skodol AE, Lave TR. Revising axis V for DSM-IV: a review of measures of social functioning. Am J Psychiatry 1992;149(9):1148 - 56. [6] Bacon SF, Collins M, Plake EV. Does the Global Assessment of Functioning assess functioning? J Ment Health Couns 2002;24(3): 202 - 12. [7] American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 4th ed. Washington (DC)7 American Psychiatric Association; 1994. [8] Hall RC. Global assessment of functioning. A modified scale. Psychosomatics 1995;36(3):267 - 75. [9] Rey JM, Starling J, Wever C, Dossetor DR, Plapp JM. Inter-rate reliability of global assessment of functioning in a clinical setting. J Child Psychol Psychiatry 1995;36(5):787 - 92. [10] Lbvdahl H, Friis S. Routine evaluation of mental health: reliable information or worthless dguesstimatesT? Acta Psychiatr Scand 1996;93:125 - 8. [11] Fallmyr a., Rep3l A. Evaluering av GAF-sk3ring som del av Minste Basis Datasett. Tidsskrift for Norsk Psykologforening 2000;39: 1118 - 9. [12] Hilsenroth MJ, Ackerman SJ, Blagys MD, Baumann BD, Baity MR, et al. Reliability and validity of DSM-IV axis V. Am J Psychiatry 2000;157(11):1858 - 63. [13] Karterud S, Pedersen G, Bjordal E, et al. Day hospital treatment of patients with personality disorders. Experiences from a Norwegian treatment research network. J Pers Disord 2003;17(2): 173 - 93. [14] Startup M, Jackson MC, Bendix S. The concurrent validity of the Global Assessment of Functioning (GAF). Br J Clin Psychol 2002; 41(4):417 - 22. [15] Karterud S, Pedersen G, Loevdahl H, Friis S. Global Assessment of Functioning–Split version (S-GAF): background and scoring manual. Oslo7 Department of Psychiatry, Ullevaal University Hospital; 1998. [16] Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;68(2):402 - 28. [17] Eikeland HM. Generalizability estimates for difference scores: an aspect of the construct validity of tests. In: Fhane´r S, Sjfberg L, editors. Measurement in differential psychology: a symposium. Gfteborg Psychological Reports, vol. 3 (6), 1973. p. 19 - 22. [18] Shavelson RJ, Webb NM, Rowley GL. Generalizability theory. American Psychologist 1989;44(6):922 - 32. [19] Kane M. The precision of measurement. Appl Meas Educ 1996;9(4):355 - 79. [20] Miller TB, Kane M. The precision of change scores under absolute and relative interpretations. Appl Meas Educ 2001;14(4):307 - 27. [21] Shavelson RJ, Webb NM. Generalizability theory. A primer. Newbury Park (Calif)7 Sage; 1991. [22] SPSS Inc. SPSS Base 10.0 for Windows user’s guide. Chicago (Ill)7 SPSS; 1999. [23] Crick JE, Brennan RL. Manual for GENOVA: a generalized analysis of variance system. Iowa City, IA7 The American College Testing Program; 1983. [24] Brennan RL. Manual for urGENOVA, version 2.1. Iowa City, IA: Iowa Testing Programs, University of Iowa; 2001.