Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212 www.elsevier.com/locate/jspi
Confidence intervals on intraclass correlation coefficients in a balanced two-factor random design Kye Gildera,∗ , Naitee Tingb , Lili Tianc , Joseph C. Cappellerid , R. Choudary Hanumarae a Biogen Idec Inc., San Diego, CA 92122, USA b Pfizer Inc., Global Research & Development, New London, CT 06320, USA c Department of Biostatistics, School of Public Health and Health Professions, Buffalo, NY 14214-3000, USA d Pfizer Inc., Global Research & Development, Groton, CT 06340, USA e Department of Computer Science and Statistics, University of Rhode Island, Kingston, RI 02881, USA
Received 19 November 2003; accepted 25 March 2006 Available online 11 May 2006
Abstract A modified large-sample (MLS) approach and a generalized confidence interval (GCI) approach are proposed for constructing confidence intervals for intraclass correlation coefficients. Two particular intraclass correlation coefficients are considered in a reliability study. Both subjects and raters are assumed to be random effects in a balanced two-factor design, which includes subjectby-rater interaction. Computer simulation is used to compare the coverage probabilities of the proposed MLS approach (GiTTCH) and GCI approaches with the Leiva and Graybill [1986. Confidence intervals for variance components in the balanced two-way model with interaction. Comm. Statist. Simulation Comput. 15, 301–322] method. The competing approaches are illustrated with data from a gauge repeatability and reproducibility study. The GiTTCH method maintains at least the stated confidence level for interrater reliability. For intrarater reliability, the coverage is accurate in several circumstances but can be liberal in some circumstances. The GCI approach provides reasonable coverage for lower confidence bounds on interrater reliability, but its corresponding upper bounds are too liberal. Regarding intrarater reliability, the GCI approach is not recommended because the lower bound coverage is liberal. Comparing the overall performance of the three methods across a wide array of scenarios, the proposed modified large-sample approach (GiTTCH) provides the most accurate coverage for both interrater and intrarater reliability. © 2006 Elsevier B.V. All rights reserved. MSC: 62F25; 62J10; 62K99; 62P10 Keywords: Modified large-sample approach; Generalized confidence interval; Reliability study; Interrater reliability; Intrarater reliability; Variance components
1. Introduction Since the introduction of the intraclass correlation coefficient by R. A. Fisher, its use as a measure of reliability has received much attention (Bartko, 1966). Its usefulness and application in the social, behavioral, and medical sciences has been clearly demonstrated (Bartko, 1966; Lin et al., 2002; Fleiss, 1986). Researchers have become increasingly aware of the observer or rater as a source of measurement error. Because measurement error can impair the statistical ∗ Corresponding author. Tel.: +1 858 4015676; fax: +1 858 4018003.
E-mail address:
[email protected] (K. Gilder). 0378-3758/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2006.03.002
1200
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
analysis and its interpretation, it is important to quantify the amount of measurement error by use of some form of a reliability index such as a kappa statistic, concordance correlation coefficient, or an intraclass correlation coefficient (Lin et al., 2002; Fleiss, 1986; Fleiss and Cohen, 1973; Landis and Koch, 1977; Fleiss and Shrout, 1979; Armitage et al., 1994). The intraclass correlation coefficient is based upon standard analysis of variance (ANOVA) models and the estimation of variance components. Although assumptions of normality for these models may not be warranted in certain cases, the ANOVA procedure is generally robust and permits the estimation of the appropriate variance components and the intraclass correlation coefficients (Landis and Koch, 1977; Searle, 1971). Typically, intraclass correlation coefficients are the ratio of variance components of interest to total variance. There are numerous forms of intraclass correlation, each appropriate for different models and objectives (Fleiss and Cohen, 1973; Müller and Büttner, 1994; St. Laurent, 1998). The appropriate version is dictated by the specific situation defined by the experimental design and conceptual intent of the reliability study (Fleiss and Shrout, 1979). This paper concentrates on two particular intraclass correlation coefficients that measure interrater reliability and intrarater reliability. For the balanced two-factor random design with only one observation per subject–rater cell, in which the interaction term cannot be separated from the error term, confidence intervals for interrater reliability have been presented by several authors including Fleiss and Shrout (1978), Zou and McDermott (1999), Cappelleri and Ting (2003), Arteaga et al. (1982), Burdick and Graybill (1988, 1992) and Tian and Cappelleri (2004). For the balanced two-factor random design with multiple replicates per subject–rater combination (cell), which affords assessment of subject-by-rater interaction, confidence intervals for interrater reliability have been presented using a modified large-sample (MLS) approach by Leiva and Graybill (1986) and Burdick and Graybill (1992). For general discussion of intraclass correlation coefficients,Adamec and Burdick (2003) propose a Satterthwaite approach and Hamada and Weerahandi (2000) propose a generalized confidence interval (GCI) approach. These investigations, however, do not provide confidence intervals for intrarater reliability. This paper, therefore, focuses on applying a type of MLS approach to obtain approximate one-sided and two-sided confidence intervals for two particular forms of intraclass correlation: interrater reliability and intrarater reliability. Both reliabilities are derived from a study in which we assess the balanced two-factor random effects design having multiple replicates per cell. In addition, the proposed MLS and GCI approach are compared to the Leiva and Graybill (LG) method based on lower bound coverage, upper bound coverage, two-sided confidence interval coverage, and confidence interval width. Section 2 describes the two intraclass correlation coefficients. Section 3 explicates the derivation of the confidence bounds for the intraclass correlation coefficients using the proposed modified large-sample approach. Section 4 introduces the proposed generalized variables for interval estimation for the particular intraclass correlation coefficients. Section 5 covers the existing Leiva and Graybill method. Section 6 explains the computer simulation study, and Section 7 contains the simulation results. Section 8 illustrates the competing procedures with an example. Section 9 provides concluding remarks. 2. Interrater and intrarater reliability We consider a reliability study in which each of the I subjects is independently measured by each of the J raters a total of K times (K > 1). In addition, it is assumed that both subjects and raters are randomly selected from their respective populations. The appropriate statistical model is the two-factor crossed random effect with interaction term. The kth measurement on the ith subject by the j th rater is represented as Yij k = + Ai + Bj + (AB)ij + ij k , i = 1, . . . , I ;
j = 1, . . . , J ;
k = 1, . . . , K, K > 1,
(1)
where is the overall mean, Ai is the effect of the ith subject, Bj is the effect of the j th rater, (AB)ij is the effect of the interaction between the ith subject and the j th rater, and ij k is the unexplained error. The terms Ai , Bj , (AB)ij , and ij k are assumed to be jointly independent normal random variables each with a mean of zero and variances given by 2A , 2B , 2AB , and 2 , respectively. The expected mean squares for model (1) is shown in Table 1.
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
1201
Table 1 Expected means squares for the two-factor crossed random model SV A B AB Error
DF
MS
E(MS)
n1 n2 n3 n4
S12 S22 S32 S42
1 = 2 + K 2AB + KJ 2A 2 = 2 + K 2AB + KI 2B 3 = 2 + K 2AB 4 = 2
The assumption of joint independence among the random components in model (1) implies the total variance of any single measurement is var[Yij k ] = 2A + 2B + 2AB + 2 . In the ANOVA application, Sq2 represents mean squares with E[Sq2 ] = q , for q = 1, 2, 3, and 4. Also, a balanced design implies the nq Sq2 /q terms are each independently distributed central 2 random variables with nq degrees of freedom, respectively, for q = 1, 2, 3, and 4 (Burdick and Graybill, 1992; Graybill, 1976; Montgomery, 1997; Milliken and Johnson, 1992). Interrater reliability, denoted here as inter , is defined as the proportion of total variability in observed measurements accounted for by the subject-to-subject variability. It can also be interpreted as the correlation between two randomly selected measurements on a single subject by two different randomly selected raters (Armitage et al., 1994; Sahai and Ageel, 2000; Damon and Harvey, 1987). Under model (1), this intraclass correlation coefficient inter is inter = corr[Yij k , Yij k ] = =
2A
2A + 2B + 2AB + 2
1 − 3 . 1 + (J /I )2 + (J − 1 − (J /I ))3 + J (K − 1)4
(2)
Intrarater reliability, denoted here as intra , is defined as the correlation between two randomly selected measurements on the same subject for a randomly selected rater (Sahai and Ageel, 2000; Damon and Harvey, 1987). Under model (1), this intraclass correlation coefficient intra is intra = corr[Yij k , Yij k ] = =
2A + 2B + 2AB
2A + 2B + 2AB + 2
1 + (J /I )2 + (J − 1 − (J /I ))3 − J 4 . 1 + (J /I )2 + (J − 1 − (J /I ))3 + J (K − 1)4
(3)
In Eqs. (2) and (3), 1 , 2 , 3 , and 4 represent the expected mean squares due to the between-subject, between-rater, subject-by-rater interaction, and the residual error, respectively. 3. The proposed modified large-sample approach According to the MLS approach for developing a confidence interval on a function of variance components, it was assumed that all but one of the variance components possessed large-sample properties and the objective was to solve for the coefficient (a function of F -quantile points) of the remaining component. The process was repeated for every other variance component. A method for constructing confidence intervals using this idea has been called, in general, the MLS approach because these confidence intervals perform well for small samples, too (Burdick and Graybill, 1992; Montgomery, 1997; Ting et al., 1990, 1991). Using an MLS approach, Gui et al. (1995) proposed a general method to construct a 100(1 − )% confidence interval on the ratio of linear combinations of non-disjoint sets of expected mean squares of the form ⎛ ⎞ ⎛ ⎞ Q Q P ⎝ ⎝ (4) ei i − dj j ⎠ ck k ⎠ , i=1
j =P +1
k=1
1202
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
where the i ’s are expected mean squares and ei , dj , and ck are non-negative constants. Based on Gui et al. (1995), we derived interrater and intrarater reliabilities for the specific case of a two-factor random effects model with equal number of replicates. The proposed new method for constructing confidence bounds and intervals on the two interrater reliabilities will be referred to as the GiTTCH (Gilder, Ting, Tian, Cappelleri, Hanumara) method. Interrater reliability inter , as defined in Eq. (2), can be expressed in the form of Eq. (4) where P = 1, Q = 4, e1 = 1, d2 = 0, d3 = 1, d4 = 0, c1 = 1, c2 = J /I , c3 = J − 1 − (J /I ), and c4 = J (K − 1). inter =
1 − 3 1 + (J /I )2 + (J − 1 − J /I )3 + J (K − 1)4
= (1 − 3 )/(1 + c2 2 + c3 3 + c4 4 ).
(5)
Extending upon Gui et al., we derived the 100(1 − )% upper confidence bound U and lower confidence bound L for inter using ⎛ ⎞ 2 − 4A C , 0] −B + max[B U U U U ⎜ ⎟ U = Min ⎝1, (6) ⎠, 2AU and
⎛ ⎜ −BL + L = Max ⎝0,
⎞ max[BL2 2AL
− 4AL CL , 0] ⎟ ⎠,
(7)
where AU = S14 (1 − H12 ) + c22 S24 (1 − G22 ) + c32 S34 (1 − G23 ) + c42 S44 (1 − G2e ) + c2 S12 S22 (2 + H12 ) + c3 S12 S32 (2 + H13 ) + c4 S12 S42 (2 + H14 ) + 2c2 c3 S22 S32 + 2c2 c4 S22 S42 + 2c3 c4 S32 S42 , BU = − 2S14 (1 − H12 ) + 2c3 S34 (1 − G23 ) + (1 − c3 )S12 S32 (2 + H13 ) − c2 S12 S22 (2 + H12 ) − c4 S12 S42 (2 + H14 ) + 2c2 S22 S32 + 2c4 S32 S42 , CU = S14 (1 − H12 ) + S34 (1 − G23 ) − S12 S32 (2 + H13 ), AL = S14 (1 − G21 ) + c22 S24 (1 − H22 ) + c32 S34 (1 − H32 ) + c42 S44 (1 − He2 ) + c2 S12 S22 (2 + G12 ) + c3 S12 S32 (2 + G13 ) + c4 S12 S42 (2 + G14 ) + 2c2 c3 S22 S32 + 2c2 c4 S22 S42 + 2c3 c4 S32 S42 , BL = − 2S14 (1 − G21 ) + 2c3 S34 (1 − H32 ) + (1 − c3 )S12 S32 (2 + G13 ) − c2 S12 S22 (2 + G12 ) − c4 S12 S42 (2 + G14 ) + 2c2 S22 S32 + 2c4 S32 S42 , CL = S14 (1 − G21 ) + S34 (1 − H32 ) − S12 S32 (2 + G13 ), with Hi = (1/F1−:ni ,∞ ) − 1, Gi = 1 − (1/F:ni ,∞ ),
i = 1, 2, 3, 4; i = 1, 2, 3, 4;
H1j = [(1 − F1−:ni ,nj )2 − (H1 F1−:ni ,nj )2 − G2j ]/F1−:ni ,nj , G1j = [(F:ni ,nj − 1)2 − (G1 F:ni ,nj )2 − Hj2 ]/F:ni ,nj ,
j = 2, 3, 4;
j = 2, 3, 4.
Here F:nv ,nw represents the upper -quantile point of an F distribution with nv degrees of freedom in the numerator and nw degrees of freedom in the denominator. In the context of the ANOVA model (1), n1 = I − 1, n2 = J − 1, n3 = (I − 1)(J − 1), and n4 = I J (K − 1).
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
1203
Similarly, intrarater reliability intra , as defined in Eq. (3), can be expressed in terms of Eq. (4) where P = 3, Q = 4, e1 = c1 = 1, e2 = c2 = J /I , e3 = c3 = J − 1 − (J /I ), d4 = J , and c4 = J (K − 1). intra =
1 + (J /I )2 + (J − 1 − (J /I ))3 − J 4 1 + (J /I )2 + (J − 1 − (J /I ))3 − J (K − 1)4
= (1 + e2 2 + e3 3 − d4 4 )/(1 + c2 2 + c3 3 + c4 4 )
(8)
The 100(1 − )% lower confidence bound L∗ and upper confidence bound U ∗ for intra can be determined using ⎞ ⎛ ∗ + max[B ∗2 − 4A∗ C ∗ , 0] −B U U U U ⎟ ⎜ U ∗ = Min ⎝1, (9) ⎠, ∗ 2AU and
⎛ ⎜ L∗ = Max ⎝0,
−BL∗ +
⎞ 2 max[BL∗ − 4A∗L CL∗ , 0] ⎟ ⎠, 2A∗L
(10)
A∗U = S14 (1 − H12 ) + c22 S24 (1 − H22 ) + c32 S34 (1 − H32 ) + c42 S44 (1 − G2e ) + c4 S12 S42 (2 + H14 ) + c2 c4 S22 S42 (2 + H24 ) + c3 c4 S32 S42 (2 + H34 ) + 2c2 S12 S22 + 2c3 S12 S32 + 2c2 c3 S22 S32 , BU∗ = − 2S12 (1 − H12 ) − 2c2 S24 (1 − H22 ) − 2c3 S34 (1 − H32 ) + 2J c4 S44 (1 − G24 ) − 4c2 S12 S22 − 4c3 S12 S32 − 4c2 c3 S22 S32 + (J − c4 )(2 + H14 )S12 S42 + (J − c4 )c2 (2 + H24 )S22 S42 + (J − c4 )c2 (2 + H24 )S22 S42 , CU∗ = S14 (1 − H12 ) + c22 S24 (1 − H22 ) + c32 S34 (1 − H32 ) + J 2 S44 (1 − G24 ) + 2c2 S12 S22 + 2c3 S12 S32 + 2c2 c3 S22 S32 − J (2 + H14 )S12 S42 − c2 J (2 + H24 )S22 S42 − c3 J (2 + H34 )S32 S42 , A∗L = S14 (1 − G21 ) + c22 S24 (1 − G22 ) + c32 S34 (1 − G23 ) + c42 S44 (1 − H42 ) + c4 S12 S42 (2 + G14 ) + c2 c4 S22 S42 (2 + G24 ) + c3 ce S32 S42 (2 + G34 ) + 2c2 S12 S22 + 2c3 S12 S32 + 2c2 c3 S22 S32 , BL∗ = − 2S14 (1 − G21 ) − 2c22 S24 (1 − G22 ) − 2c32 S34 (1 − G23 ) − 4c2 S12 S22 − 4c3 S12 S32 + 2J c4 S44 (1 − H42 ) − 4c2 c3 S22 S32 + (J − c4 )S12 S42 (2 + G14 ) + (J − c4 )c2 S22 S42 (2 + G24 ) + (J − c4 )c3 S32 S42 (2 + G34 ), CL∗ = S14 (1 − G21 ) + c22 S24 (1 − G22 ) + c32 S34 (1 − G23 ) + 2c2 S12 S22 + 2c3 S12 S32 + 2c2 c3 S22 S32 + J 2 S44 (1 − H42 ) − J S 21 S42 (2 + G14 ) − J c2 S22 S42 (2 + G24 ) − J c3 S32 S42 (2 + G34 ), with Hi = (1/F1−:ni ,∞ ) − 1, Gi = 1 − (1/F:ni ,∞ ),
i = 1, 2, 3, 4; i = 1, 2, 3, 4;
Hi4 = [(1 − F1−:ni ,n4 )2 − (Hi F1−:ni ,n4 )2 − G24 ]/F1−:ni ,n4 , Gi4 = [(F:ni ,n4 − 1)2 − (G1 F:ni ,n4 )2 − H42 ]/F:ni ,n4 ,
i = 1, 2, 3;
i = 1, 2, 3.
1204
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
4. The generalized confidence interval (GCI) approach Suppose that X = (X1 , X2 , . . . , Xn ) form a random sample from a distribution which depends on the parameters = ( , v) where is the parameter of interest and vT is a vector of nuisance parameters. A generalized variable R(X; x, , v), where x is a observed value of X, for interval estimation defined by Weerahandi (1995) has the following properties: 1. R(X; x, , v) has a distribution free of unknown parameters. 2. The value of R(X; x, , v) is . Let R be the 100th percentile of R. Then R is the 100(1 − )% lower bound, R1− is the 100(1 − )% upper bound, and (R/2 , R1−/2 ) is the 100(1 − )% two-sided confidence interval for . From Table 1, the term nq Sq2 /q is an independent central 2 random variable with nq degrees of freedom, respectively, for q =1, 2, 3, and 4. Let Q1 , Q2 , Q3 , and Q4 be random variables distributed as 2I −1 , 2J −1 , 2(I −1)(J −1) , and 2I J (K−1) , respectively. The generalized test variables for estimating 1 , 2 , 3 , and 4 are I −1 I −1 1 ∼ Sˆ12 2 , R1 = Sˆ12 2 = Sˆ12 Q1 S1 I −1
(11)
2 J −1 J −1 ∼ Sˆ22 2 , R2 = Sˆ22 2 = Sˆ22 Q S2 J −1 2
(12)
R3 = Sˆ32
(I − 1)(J − 1) (I − 1)(J − 1) 3 ∼ Sˆ32 2 , = Sˆ32 Q3 (I −1)(J −1) S32
(13)
R4 = Sˆ42
4 I J (K − 1) I J (K − 1) = Sˆ42 ∼ Sˆ42 2 , Q4 S42 I J (K−1)
(14)
where Sˆ12 , Sˆ22 , Sˆ32 , and Sˆ42 are observed values of S12 , S22 , S32 , and S42 , respectively. The generalized test variables R1 , R2 , R3 , and R4 coincide with the usual pivotal variables for confidence intervals of 1 , 2 , 3 , and 4 . The generalized variable for confidence interval for interrater reliability inter can be obtained by replacing 1 , 2 , 3 , and 4 in (5) with the generalized variables R1 , R2 , R3 , and R4 , respectively. Thus, Rinter =
Sˆ12 S12 − Sˆ32 S32 1
1
=
3
Sˆ12 S12 + (J /I )Sˆ22 S22 + (J − 1 − J /I )Sˆ32 S32 + J (K − 1)Sˆ42 S42 2
3
4
−1) − Sˆ32 (I −1)(J Q3 −1) − 1 − J /I )Sˆ32 (I −1)(J Q3
Sˆ12 IQ−1 1 Sˆ12 IQ−1 + (J /I )Sˆ22 JQ−1 + (J 1 2
+ J (K − 1)Sˆ42 I J (K−1) Q4
.
(15)
Similarly, the generalized variable for confidence interval for intrarater reliability intra can be obtained by replacing 1 , 2 , 3 , and 4 in (8) with the generalized variables R1 , R2 , R3 , and R4 , respectively. Rintra =
Sˆ12 S12 + (J /I )Sˆ22 S22 + (J − 1 − (J /I ))Sˆ32 S32 − J Sˆ42 S42 1
1
=
2
3
4
Sˆ12 S12 + (J /I )Sˆ22 S22 + (J − 1 − (J /I ))Sˆ32 S32 − J (K − 1)Sˆ42 S42 2
3
4
−1) + (J /I )Sˆ22 JQ−1 + (J − 1 − (J /I ))Sˆ32 (I −1)(J − J Sˆ42 I J (K−1) Sˆ12 IQ−1 Q3 Q4 1 2
−1) Sˆ12 IQ−1 + (J /I )Sˆ22 JQ−1 + (J − 1 − (J /I ))Sˆ32 (I −1)(J − J (K − 1)Sˆ42 I J (K−1) Q3 Q4 1 2
.
(16)
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
1205
As discussed by Tian and Cappelleri (2004), two features are especially noteworthy. First, X can be represented by (S12 , S22 , S32 , S42 ) and x can be represented by (Sˆ12 , Sˆ22 , Sˆ32 , Sˆ42 ). Second, in = ( , v), is the parameter of interest (), and v consists of any three of 2A , 2B , 2AB , 2 . To construct confidence intervals based on Rinter , we must verify that Rinter satisfies the two conditions aforementioned in this section. It is clear from (11) to (15) that, for any given Sˆ12 , Sˆ22 , Sˆ32 , and Sˆ42 , the following holds: (1) the distribution of Rinter is independent of any unknown parameters, and (2) the value of Rinter is inter as S12 = Sˆ12 , S22 = Sˆ22 , S32 = Sˆ32 , and S42 = Sˆ42 . Therefore, Rinter is a generalized variable for interval estimation, and its quantiles may be used to construct confidence limits for inter . The two-sided 100(1 − )% confidence interval is given by (Rinter ,/2 , Rinter ,1−/2 ) and the one-sided 100(1−)% lower bound is given by Rinter , , which denotes the 100th percentile of Rinter . Although for any given Sˆ12 , Sˆ22 , Sˆ32 , and Sˆ42 the distribution of Rinter does not depend on any unknown parameters, the confidence limits depend on sampling distributions of S12 , S22 , S32 and S42 , that, in turn, depend on the parameters 2A , 2B , 2AB , and 2 . It therefore becomes necessary to evaluate the performance of the proposed confidence interval by simulation. The similar arguments apply to Rinter . 5. The Leiva and Graybill (LG) approaches For the balanced two-factor random design with multiple replicates per subject–rater combination, Leiva and Graybill (1986) and Burdick and Graybill (1992) presented the approximate (1 − 2) confidence intervals on inter I LLG I U LG ; , (17) I LLG + J I U LG + J where LLG =
S12 − F:n1 ,n3 S32 I (K − 1)F:n1 ,∞ S42 + F:n1 ,n2 S22 + (I − 1)F:n1 ,∞ S32
and ULG =
S12 − F1−:n1 ,n3 S32 I (K − 1)F1−:n1 ,∞ S42 + F1−:n1 ,n2 S22 + (I − 1)F1−:n1 ,∞ S32
.
Here F:nv ,nw represents the upper -quantile point of an F distribution with nv degrees of freedom in the numerator and nw degrees of freedom in the denominator. 6. Simulation study An empirical study using Monte Carlo simulation was conducted to evaluate the ability of the proposed confidence bounds for inter and intra to maintain the stated confidence coefficient. For the two-factor model with interaction, a broad array of designs were considered: 2, 3, and 5 raters evaluating 10, 25, 50, and 100 subjects (2 and 5 replicates); 10 raters evaluating 10 subjects (2 replicates), 5 raters evaluating 5 subjects (2 replicates), 10 raters evaluating 5 subjects (2 replicates), 2 raters evaluating 2 subjects (2 replicates), 2 raters evaluating 3 subjects (2 replicates), 2 raters evaluating 5 subjects (2 replicates), 3 raters evaluating 3 subjects (2 replicates), 4 raters evaluating 3 subjects (2 replicates), and 3 raters evaluating 5 subjects (2 replicates). Coverage for one-sided (two-sided) confidence intervals with (1−) nominal confidence levels equal to 0.99 (0.98), 0.95 (0.90), and 0.90 (0.80) were considered. In the simulation procedure, we defined r1 = 2A /(2A + 2B + 2AB + 2 ), r2 = 2B /(2A + 2B + 2AB + 2 ), r3 =2AB /(2A +2B +2AB +2 ), and r4 =1−r1 −r2 −r3 . Without loss of generality, we defined 2A +2B +2AB +2 =1 so that r1 = 2A , r2 = 2B , r3 = 2AB , and r4 = 2 . Next, we also defined i , for i = 1, 2, 3, and 4, as shown in Table 1. The distributional assumptions were S12 ∼ r1 V1 /n1 , S22 ∼ r2 V2 /n2 , S32 ∼ r3 V3 /n3 and S42 ∼ r4 V4 /n4 where V1 , V2 , V3 , and V4 represented a set of jointly independent central 2 random variables with n1 , n2 , n3 , and n4 degrees of freedom, respectively. In the context of the ANOVA model (1), n1 = I − 1, n2 = J − 1, n3 = (I − 1)(J − 1), and n4 = I J (K − 1). These four 2 random variables were generated using the RANGAM function of the Statistical
1206
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
Analysis System (SAS䉸 ) (SAS Institute, 1999–2001). Since the mean squares have 2 distributions, it was easier to simulate them directly than to simulate values of Yij k . Values of r1 , r2 , and r3 were selected from a set of values from 0.1 to 0.7 in increments of 0.1, resulting in 84 combinations of r1 , r2 , and r3 in which r4 is positive. For each of the 84 combinations, a total of 5,000 sets of {S12 , S22 , S32 , S42 } were simulated for each of the designs. Simulated values for the mean squares were substituted into the appropriate formulas and the upper and lower confidence bounds were computed. For GCI approach, in each of the 5000 random samples, 2500 values of Rinter and Rinter were obtained. Coverage probabilities were determined by counting the number of times the bounds (intervals) covered the parameter and then dividing it by 5000. Mean, median, and minimum percent coverage across the parameter space of 84 unique combinations were computed for the one and two-sided intervals. 7. Simulation results As the mean and median results are similar, only median results are reported. Only (1 − ) confidence level results when = 0.05 are reported because results using = 0.01 and = 0.1 are similar. 7.1. Interrater reliability coverage results Table 2 reports the simulated median lower bound and upper bound coverage probabilities across the 84 parameter sets for each of the 33 designs. Regarding the one-sided 95% lower confidence bound for inter , the Leiva–Graybill (LG) and GiTTCH methods provide confidence coefficients that are conservative and close to the stated nominal level. The GCI results are less consistent across various designs. In general, the GiTTCH coverage is on the conservative side. For the lower bound, the LG method provides liberal coverage for designs with I = 2, J = 2, K = 2 and I = 3, J = 2, K = 2. In most of other cases, the LG method tends to be more unduly conservative than the GiTTCH method. In some cases, the GCI confidence bounds tend to be somewhat conservative, while in a few other cases (e.g., I = 10, J = 10, K = 2; I = 5, J = 10, K = 2; I = 100, J = 5, K = 2; I = 100, J = 5, K = 5) it is liberal. Regarding upper bound coverage, it is clear that the GCI is liberal. Hence, GCI cannot be recommended for upper confidence bounds on inter . Both LG and GiTTCH were close to the stated confidence level. The LG method provided confidence coefficients that were either correct or slightly conservative while the GiTTCH method provided confidence coefficients that are either correct or slightly liberal. In those designs where the number of raters, subjects, and replicates are all small (fewer than 10 each), the GiTTCH approach provides coverage closer to the correct nominal value. In the remaining designs with larger numbers of raters and subjects, the GiTTCH approach provides coverage closer to the correct nominal value than does the LG approach. In order to help encapsulate the distributions of the lower bound and upper bound coverage within a given design, box plots for three prototypical designs are presented in Fig. 1. Table 3 reports the simulated median two-sided confidence interval coverage probabilities and median interval widths across the 84 parameter sets for each of the 33 designs. Because the upper bounds on GCI are not recommended, there is no need to compare the two-sided confidence intervals between GCI and the other methods. Hence, the GCI results from the two-sided confidence intervals are not included in Table 3. A comparison between LG and GiTTCH methods attests to the GiTTCH method being closer to the correct nominal value in all 33 designs. The LG method provides liberal coverage for the designs with I = 2, J = 2, K = 2 and I = 3, J = 2, K = 2. Consistent with the coverage probabilities, the interval widths of the GiTTCH approach tended to be narrower than the interval widths of the LG approach. In the few designs where LG coverage is liberal, GiTTCH maintains the probability coverage, and hence the two-sided confidence intervals are wider. 7.2. Intrarater reliability coverage Table 4 presents the results of the simulation for the one-sided 95% upper and lower confidence bounds for intra . Since the Leiva and Graybill paper does not provide the derivation for intrarater reliability, only the GCI and GiTTCH methods were compared here. For GCI, the lower confidence bounds were liberal. Overall, for GiTTCH, the one-sided 95% lower bound for intra tended to produce slightly conservative coverage. The one exception (I = 25, J = 5, K = 2) produced liberal coverage. For the upper bound using GCI or GiTTCH, the coverage tended to be liberal in some
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
1207
Table 2 Median percent coverage of approximate 95% one-sided lower bound and one-sided upper bound of interrater reliability across 84 parameter sets (5000 simulations per parameter set) I
J
K
GCI
LG
GiTTCH
LB
UB
LB
UB
LB
UB
10 25 50 100
2 2 2 2
2 2 2 2
0.9789 0.9774 0.9757 0.9707
0.8426 0.8438 0.8429 0.8477
0.9908 0.9908 0.9907 0.9885
0.9568 0.9629 0.9667 0.9695
0.9882 0.9879 0.9875 0.9858
0.9468 0.9454 0.9441 0.9421
10 25 50 100
3 3 3 3
2 2 2 2
0.9616 0.9614 0.9588 0.9559
0.8633 0.8582 0.8604 0.8614
0.9855 0.9868 0.9877 0.9865
0.9567 0.9635 0.9694 0.9723
0.9769 0.9772 0.9779 0.9768
0.9470 0.9464 0.9460 0.9435
10 25 50 100
5 5 5 5
2 2 2 2
0.9404 0.9415 0.9409 0.9385
0.8755 0.8727 0.8722 0.8734
0.9779 0.9817 0.9841 0.9848
0.9561 0.9628 0.9672 0.9727
0.9616 0.9646 0.9674 0.9676
0.9474 0.9472 0.9460 0.9449
10 25 50 100
2 2 2 2
5 5 5 5
0.9784 0.9771 0.9736 0.9672
0.8428 0.8434 0.8450 0.8511
0.9917 0.9908 0.9899 0.9878
0.9579 0.9654 0.9692 0.9724
0.9884 0.9872 0.9860 0.9829
0.9474 0.9453 0.9432 0.9424
10 25 50 100
3 3 3 3
5 5 5 5
0.9605 0.9595 0.9579 0.9512
0.8636 0.8620 0.8611 0.8635
0.9857 0.9874 0.9878 0.9855
0.9571 0.9650 0.9710 0.9740
0.9753 0.9768 0.9783 0.9749
0.9469 0.9466 0.9454 0.9436
10 25 50 100
5 5 5 5
5 5 5 5
0.9398 0.9422 0.9403 0.9389
0.8764 0.8750 0.8736 0.8739
0.9786 0.9824 0.9844 0.9836
0.9556 0.9628 0.9696 0.9732
0.9608 0.9653 0.9671 0.9670
0.9472 0.9464 0.9468 0.9452
5 10 5 2 3 5 3 3 5
5 10 10 2 2 2 3 4 3
2 2 2 2 2 2 2 2 2
0.9394 0.9203 0.9195 0.9868 0.9844 0.9815 0.9636 0.9486 0.9617
0.8802 0.8874 0.8906 0.8543 0.8504 0.8447 0.8710 0.8797 0.8653
0.9726 0.9674 0.9627 0.6302 0.8894 0.9796 0.9821 0.9746 0.9839
0.9513 0.9532 0.9509 0.9496 0.9494 0.9535 0.9499 0.9504 0.9527
0.9593 0.9541 0.9518 0.9908 0.9900 0.9888 0.9742 0.9636 0.9754
0.9464 0.9472 0.9474 0.9436 0.9445 0.9468 0.9458 0.9468 0.9466
I = number of subjects; J = number of raters; K = number of replicates. LB = lower bound coverage probability; UB = upper bound coverage probability. GCI = general confidence interval approach. LG = Leiva–Graybill approach. GiTTCH = Gilder , Ting, Tian, Cappelleri, and Hanumara approach.
designs. Using the normal approximation to the binomial, if the true confidence coefficient is 95%, there is less than a 5% chance that a simulated confidence coefficient based on 5000 replications will be less than 94.4%. For GiTTCH, 12 of the 33 designs in Table 4 produced slightly liberal upper bound coverage below 0.944; for GCI, 8 of 33 were liberal. Because GCI does not produce an accurate lower bound, the only method of interest for lower bounds and confidence intervals is that of GiTTCH. Table 4 provides the coverage probabilities of GiTTCH for both its lower bound and the upper bound. No table is included to compare coverage or interval widths since CGI is not recommended for confidence intervals.
1208
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
(a) I=10, J=5, K=2 LG
LG
GiTTCH
GiTTCH
GCI
GCI 0.80
0.85
0.90
0.95
1.00
0.80
0.85
0.90
0.95
1.00
0.80
0.85
0.90
0.95
1.00
0.80
0.85
0.90
0.95
1.00
(b) I=25, J=5, K=5 LG
LG
GiTTCH
GiTTCH
GCI
GCI 0.80
0.85
0.90
0.95
1.00
(c) I=3, J=2, K=2 LG GiTTCH GCI
LG GiTTCH GCI 0.80
0.85
0.90
0.95
1.00
Lower Bound Confidence Coefficient
Upper Bound Confidence Coefficient
Fig. 1. Box plot of simulated confidence coefficients.
8. An example Hamada and Weerahandi presented an example gauge repeatability and reproducibility (R&R) study based on an experiment presented by Tsai (1988). Tsai describes the assessed measurements system that is for an injected molded plastic part with specifications on several dimensions. The data in Table 5 are deviations of one such dimension from a nominal value of 685 mm. Two operators were randomly selected for those who normally use the coordinate measuring machine to measure the part’s dimensions. Ten parts were used in the study in which each part was loaded twice into the fixture by each operator and then measured. The operators did not have access to the measurements taken until after all the measurements were done. The appropriate statistical design to examine gauge R&R is the balanced two-factor random design with I = 10 parts, J = 2 operators, and K = 2 replicates. Table 6 contains the ANOVA. Using the significance level = 0.05, the proportion of the total variability accounted for by differences among the parts is ˆ inter = ˆ parts = 0.515. The 90% two-sided confidence interval is (0.012, 0.785) using the LG method, (0.026, 0.780) using the GiTTCH method, and (0.0092, 0.7526) using the GCI method. The 95% one-sided lower bound is 0.012 using the LG method, 0.026 using the GiTTCH method, and 0.0092 using the GCI method. The 95% one-sided upper bound is 0.785 using LG method, 0.780 using GiTTCH method, and 0.7526 using the GCI method. Regarding the intrarater reliability, ˆ intra = 0.671 and the 90% confidence interval from GiTTCH is (0.4014, 0.9808) and from GCI is (0.4825, 0.9817). SAS䉸 and C simulation and application programs are available from the first author. 9. Summary and conclusion An alternate modified large-sample method—the GiTTCH method—and a generalized confidence interval (GCI) method are derived and applied to obtain one-sided confidence bounds and two-sided confidence intervals for two particular intraclass correlation coefficients derived from a balanced two-factor random design with interaction: interrater reliability (inter ) and intrarater reliability (intra ). The GiTTCH and GCI approaches are compared with another MLS method derived by Leiva and Graybill, denoted as the LG method.
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
1209
Table 3 Median percent coverage and median interval width of approximate 90% two-sided confidence interval of interrater reliability across 84 parameter sets (5000 simulations per parameter set) I
J
K
LG
GiTTCH
Coverage
Width
Coverage
Width
10 25 50 100
2 2 2 2
2 2 2 2
0.9460 0.9509 0.9535 0.9551
0.5607 0.4476 0.3873 0.3438
0.9348 0.9335 0.9302 0.9253
0.5443 0.4320 0.3691 0.3224
10 25 50 100
3 3 3 3
2 2 2 2
0.9425 0.9485 0.9535 0.9559
0.4925 0.3794 0.3211 0.2791
0.9232 0.9240 0.9235 0.9182
0.4676 0.3534 0.2909 0.2505
10 25 50 100
5 5 5 5
2 2 2 2
0.9340 0.9440 0.9506 0.9561
0.4210 0.3053 0.2474 0.2088
0.9100 0.9122 0.9125 0.9113
0.3951 0.2769 0.2167 0.1806
10 25 50 100
2 2 2 2
5 5 5 5
0.9485 0.9528 0.9562 0.9570
0.5458 0.4376 0.3747 0.3350
0.9349 0.9321 0.9294 0.9226
0.5281 0.4127 0.3574 0.3111
10 25 50 100
3 3 3 3
5 5 5 5
0.9426 0.9513 0.9550 0.9577
0.4792 0.3666 0.3110 0.2668
0.9218 0.9231 0.9224 0.9156
0.4525 0.3362 0.2803 0.2399
10 25 50 100
5 5 5 5
5 5 5 5
0.9346 0.9452 0.9521 0.9565
0.4107 0.2940 0.2389 0.2010
0.9085 0.9111 0.9130 0.9116
0.3826 0.2645 0.2088 0.1739
5 10 5 2 3 5 3 3 5
5 10 10 2 2 2 3 4 3
2 2 2 2 2 2 2 2 2
0.9238 0.9207 0.9125 0.5997 0.8447 0.9328 0.9310 0.9249 0.9363
0.5626 0.3511 0.5029 0.5325 0.6825 0.6602 0.7587 0.7368 0.6238
0.9051 0.9012 0.9007 0.9338 0.9345 0.9357 0.9190 0.9102 0.9217
0.5389 0.3340 0.4886 0.8671 0.7729 0.6605 0.7385 0.7175 0.6007
Coverage = confidence interval coverage probability. LG = Leiva–Graybill approach. GiTTCH = Gilder , Ting, Tian, Cappelleri, and Hanumara approach.
For the GCI method, the coverage of upper confidence bounds on inter tends to be liberal. Hence, this method is not recommended to use for upper confidence bounds of inter . Simulation results also indicate that the GCI method renders liberal coverage on the lower confidence bound for intra . This finding is an important limitation because lower bounds are more useful than upper bounds in most applications. Therefore, the GCI method is also not recommended for confidence intervals for the intrarater reliability. Conversely, GiTTCH is shown to be an effective method for setting confidence intervals on the interrater reliability inter . Simulation results suggest the GiTTCH method produces appropriate coverage on the upper confidence bounds, and it tends to provide either correct coverage or degrees of conservative coverage for both the one-sided lower bound and the two-sided confidence intervals, depending on the situation. For fixed numbers of subjects and replicates, the degree of its conservative coverage decreases as the numbers of raters increase.
1210
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
Table 4 Median percent coverage of approximate 95% one-sided lower bound and one-sided upper bound of intrarater reliability across 84 parameter sets (5000 simulations per parameter set) I
J
K
GCI
GiTTCH
LB
UB
LB
UB
10 25 50 100
2 2 2 2
2 2 2 2
0.8211 0.8371 0.8490 0.8651
0.9829 0.9720 0.9621 0.9464
0.9434 0.9417 0.9418 0.9417
0.9900 0.9858 0.9785 0.9696
10 25 50 100
3 3 3 3
2 2 2 2
0.8336 0.8459 0.8600 0.8697
0.9751 0.9641 0.9508 0.9352
0.9426 0.9418 0.9405 0.9400
0.9846 0.9808 0.9752 0.9663
10 25 50 100
5 5 5 5
2 2 2 2
0.8466 0.8558 0.8636 0.8739
0.9645 0.9522 0.9427 0.9267
0.9428 0.9410 0.9410 0.9395
0.9795 0.9749 0.9712 0.9637
10 25 50 100
2 2 2 2
5 5 5 5
0.8219 0.8432 0.8595 0.8737
0.9830 0.9674 0.9512 0.9279
0.9380 0.9370 0.9377 0.9422
0.9907 0.9828 0.9730 0.9622
10 25 50 100
3 3 3 3
5 5 5 5
0.8355 0.8512 0.8646 0.8781
0.9748 0.9589 0.9430 0.9247
0.9360 0.9360 0.9382 0.9414
0.9871 0.9792 0.9695 0.9614
10 25 50 100
5 5 5 5
5 5 5 5
0.8448 0.8592 0.8673 0.8799
0.9641 0.9541 0.9398 0.9225
0.9350 0.9370 0.9387 0.9416
0.9817 0.9767 0.9695 0.9594
5 10 5 2 3 5 3 3 5
5 10 10 2 2 2 3 4 3
2 2 2 2 2 2 2 2 2
0.8347 0.8563 0.8445 0.8291 0.8036 0.8131 0.8153 0.8211 0.8258
0.9726 0.9524 0.9634 0.9884 0.9932 0.9890 0.9878 0.9845 0.9816
0.9423 0.9413 0.9425 0.9466 0.9459 0.9447 0.9446 0.9444 0.9445
0.9834 0.9731 0.9788 0.9942 0.9956 0.9932 0.9923 0.9906 0.9888
I = number of subjects; J = number of raters; K = number of replicates. LB = lower bound coverage probability; UB = upper bound coverage probability. GCI = general confidence interval approach. GiTTCH = Gilder , Ting, Tian, Cappelleri, and Hanumara approach.
Regarding inter , the LG method maintains good coverage on the upper confidence bounds. For the lower bounds, LG tends to be conservative in most of the cases. However, in the case where I, J , and K are small, lower bounds obtained from LG method can be liberal. Based on the comparison of the three methods in constructing confidence intervals on interrater reliability, we recommend the GiTTCH method. It provides good coverage based on the upper confidence bounds and correct or reasonably conservative coverage on the lower confidence bounds. Regarding intra , an LG method is not available and the GCI method is not recommended. Hence, only the GiTTCH approach is discussed here. Simulation results suggest that, with only one exception, GiTTCH produces either correct coverage or degrees of conservative coverage for the lower confidence bounds. Regarding the upper confidence bounds, GiTTCH could be liberal in certain designs. We therefore offer a qualified recommendation of the GiTTCH method to
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
1211
Table 5 Gauge repeatability and reproducibility study data Part
1 2 3 4 5 6 7 8 9 10
Operator 1
2
Measurement
Measurement
1
2
1
2
0.289 0.311 0.295 0.301 0.265 0.298 0.273 0.276 0.328 0.293
0.273 0.327 0.318 0.303 0.288 0.304 0.293 0.301 0.341 0.282
0.324 0.340 0.335 0.304 0.289 0.305 0.287 0.275 0.316 0.300
0.309 0.333 0.326 0.333 0.279 0.299 0.250 0.305 0.314 0.297
Table 6 Analysis of variance for gauge R&R study data SV
DF
SS
MS
p-value
Part Operator Part × operator Error
9 1 9 20
0.01174973 0.00064803 0.00253773 0.00326950
0.001305525 0.000648025 0.000281969 0.000163475
0.00 0.06 0.15
construct confidence interval for intra . If the results of its coverage are not found in this paper (specifically Table 4), researchers should perform simulations first to examine the performance of the GiTTCH method. If the performance is acceptable, then use of GiTTCH is recommended for that particular scenario as well. Overall, among the three methods, the proposed modified large-sample approach (GiTTCH) provides the most accurate coverage across the wide array of scenarios investigated in this paper. Acknowledgements The authors are grateful to Stephen Eckert whose useful initial derivations helped to motivate our work in its nascent, preliminary stages. We also wish to thank the associate editor and the two referees for their helpful suggestions and comments that led to substantial improvements in the article. References Adamec, E., Burdick, R., 2003. Confidence intervals for a discrimination ratio in a gauge R&R study with three random factors. Qual. Eng. 15 (3), 383–389. Armitage, P., Berry, G., Matthews, J.N.S., 1994. Statistical Methods in Medical Research. fourth ed. Blackwell, Oxford, UK. pp. 704–707. Arteaga, C., Jeyaratnam, S., Graybill, F.A., 1982. Confidence intervals for proportions of total variance in the two-way cross component of variance model. Comm. Statistist. Theory Methods 11 (15), 1643–1658. Bartko, J.J., 1966. The interclass correlation coefficient as a measure of reliability. Psychol. Rep. 19, 3–11. Burdick, R.K., Graybill, F.A., 1988. The present status of confidence interval estimation on variance components in balanced and unbalanced random models. Comm. Statist. Theory Methods 17 (4), 1165–1195. Burdick, R.K., Graybill, F.A., 1992. Confidence Intervals on Variance Components. Marcel Dekker, Inc., New York, NY. Cappelleri, J.C., Ting, N., 2003. A modified large-sample approach to approximate interval estimation for a particular intraclass correlation coefficient. Statist. Med. 22, 1861–1877. Damon, R.A., Harvey, W.R., 1987. Experimental Design, ANOVA, and Regression. Harper and Row NY, New York, NY.
1212
K. Gilder et al. / Journal of Statistical Planning and Inference 137 (2007) 1199 – 1212
Fleiss, J.L., 1986. The Design and Analysis of Clinical Experiments. Wiley, New York, NY. pp. 1–32. Fleiss, J.L., Cohen, J., 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as a measure of reliability. Educ. Psychol. Meas. 33, 613–619. Fleiss, J.L., Shrout, P.E., 1978. Approximate interval estimation for a certain intraclass correlation coefficient. Psychometrika 43, 259–262. Fleiss, J.L., Shrout, P.E., 1979. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86 (2), 420–428. Graybill, F.A., 1976. Theory and Application of the Linear Model. Duxbery Press, Pacific Grove, CA. Gui, R., Graybill, F.A., Burdick, R.K., Ting, N., 1995. Confidence intervals on ratios of linear combinations for non-disjoint sets of expected mean squares. J. Statist. Plann. Inference 48, 215–227. Hamada, M., Weerahandi, S., 2000. Measurement system assessment via generalized inference. J. Qual. Technol. 32 (3), 241–253. Landis, J.R., Koch, G.G., 1977. The measurement of observer agreement for categorical data. Biometrics 33, 159–174. Leiva, R.A., Graybill, F.A., 1986. Confidence intervals for variance components in the balanced two-way model with interaction. Comm. Statist. Simulation Comput. 15, 301–322. Lin, L., Hedayat, A.S., Sinha, B., Yang, M., 2002. Statistical methods in assessing agreement: models, issues, and tools. J. Amer. Statist. Assoc. 97, 257–270. Milliken, G.A., Johnson, D.E., 1992. Analysis of Messy Date, Volume 1: Designed Experiments. Chapman & Hall, London, UK. Montgomery, D.C., 1997. Design and Analysis of Experiments. fourth ed. Wiley, New York, NY. Müller, R., Büttner, P., 1994. A critical discussion of intraclass correlation coefficients. Statist. Med. 13 (23–24), 2465–2476. Sahai, H., Ageel, M., 2000. The Analysis of Variance: Fixed, Random and Mixed Models. Birkäuser, Boston, MA. SAS Institute, Inc., 1999–2001. SAS For Windows (Version 8.02). SAS Institute, Inc., Cary, NC. Searle, S.R., 1971. Linear Models. Wiley, New York, NY. St. Laurent, R.T., 1998. Evaluating agreement with a gold standard in method comparison studies. Biometrics 54 (2), 537–545. Tian, L., Cappelleri, J.C., 2004. A new approach for interval estimation and hypothesis testing of a certain intraclass correlation coefficient: the generalized variable method. Statist. Med. 23, 2125–2135. Ting, N., Burdick, R.K., Graybill, F.A., Jeyaratnam, S., Lu, T.F.C., 1990. Confidence intervals on linear combinations of variance components that are unrestricted in sign. J. Statist. Comput. Simul. 35, 135–143. Ting, N., Burdick, R.K., Graybill, F.A., 1991. Confidence intervals on ratios of positive linear combinations of variance components. Statist. Probab. Lett. 11, 523–528. Tsai, P.F., 1988. Variable gauge repeatability and reproducibility study using the analysis of variance method. Qual. Eng., 107–115. Weerahandi, S., 1995. Exact Statistical Methods for Data Analysis. Springer, New York, NY. Zou, K.H., McDermott, M.P., 1999. Higher-moment approaches to approximate interval estimation for a certain intraclass correlation coefficient. Statist. Med. 18 (15), 2051–2061.