Comparative Evaluation of Two Models for Estimating Sample Sizes for Tests on Trends across Repeated Measurements

Comparative Evaluation of Two Models for Estimating Sample Sizes for Tests on Trends across Repeated Measurements

Comparative Evaluation of Two Models for Estimating Sample Sizes for Tests on Trends across Repeated Measurements John E. Overall, PhD, Ghassan Shobak...

133KB Sizes 0 Downloads 77 Views

Comparative Evaluation of Two Models for Estimating Sample Sizes for Tests on Trends across Repeated Measurements John E. Overall, PhD, Ghassan Shobaki, MA, and Cheryl B. Anderson, PhD The University of Texas Medical School, Houston, Texas

ABSTRACT: Two equations for calculating sample sizes that are required for power in testing differences in rates of change in repeated measurement designs have been presented by different authors. One equation provides support for the conclusion that increased frequency of measurements across a treatment period of fixed duration enhances power of the tests. The other equation supports the counterintuitive conclusion that increased frequency of measurements actually tends to decrease power in the presence of realistic serial dependencies in the data. Monte Carlo methods confirm that the equation providing support for the latter conclusion is accurate, whereas the alternative equation tends to underestimate sample sizes required for power in testing differences in slopes of regression lines fitted to changes in the repeated measurements across time when symmetry is absent from the covariance structure. Controlled Clin Trials 1998;19: 188–197  Elsevier Science Inc. 1998 KEY WORDS: Sample size, power, repeated measurements, frequency of measurements, linear trends, rates of change

A randomized parallel-groups design is conventionally used to evaluate differences in patterns of treatment effects across time in controlled clinical trials. Primary interest often focuses on the differences between experimental and control groups in average rates of change across a treatment period of fixed duration. In consecutive issues of this journal, Kirby, Galai, and Mun˜oz (KGM) [3] and Overall and Doyle (OD) [7] have proposed different equations for calculating power and sample size for what appear to be tests of the same hypothesis. This article elucidates the similarities and differences between the two models. KGM specify the treatment effect to be the difference between means of slope coefficients from linear equations fitted to the response patterns for individual subjects in two groups [3, p 168]. The implied analysis, to which the sample Address reprint requests to: Dr. John E. Overall, Department of Psychiatry and Behavioral Science, The University of Texas Medical School, P.O. Box 20708, Houston, TX 77225. Received March 25, 1996; accepted June 9, 1997. Controlled Clinical Trials 19:188–197 (1998)  Elsevier Science Inc. 1998 655 Avenue of the Americas, New York, NY 10010

0197-2456/98/$19.00 PII S0197-2456(97)00095-0

Models for Estimating Sample Sizes

189

size and power estimates relate, conforms to a two-stage “random regression model” [4] in which individual response patterns are modeled first, and then tests of significance for differences between groups are applied to the slope coefficients from the individual regression equations fitted in the first stage. KGM specify a design matrix consisting of coefficients for linear trend and average elevation. Because those two components of the response pattern are affected by departures from symmetry in the matrix of correlations among the repeated measurements, KGM consider a solution which, in effect, transforms the time scale to compensate for serial dependencies in the data. They then, however, relate the sample size and power estimates so derived to a test of significance for the difference between group means for slope coefficients from regression equations relating the original repeated measurements to their unadjusted assessment times. OD [7] use orthogonal polynomial coefficients as weights to define composite “linear trend scores,” which they note to be proportional to slope coefficients within a scaling constant. Normalizing the vector of orthogonal polynomial coefficients to unit sum of squares results in linear trend scores that are precisely the slope coefficients from least squares regression equations, relating the repeated measurements for the individual subjects to a normalized time scale. The normalizing does not affect sample size or power calculations because the linear treatment effect and its standard error are rescaled proportionately. In this paper, we consider that the orthogonal polynomial coefficients are normalized so that the linear trend scores of OD are slope coefficients as discussed by KGM. Whether normalized or not, the linear orthogonal polynomial coefficients are orthogonal to a summation vector representing the average response level across the repeated measurements; however, the correlation between two weighted combinations of repeated measurements (e.g., slope and average level) depends also on the matrix of covariances among the repeated measurements. Because KGM [3] introduce a scale transformation via R21 and OD [7] do not, their respective equations for sample size calculations will not, in general, provide equivalent results. The original articles do not adequately convey this difference; instead, the papers describe the alternative equations for sample size and power calculations as relating to tests of the same hypothesis. It is unrealistic to assume uniform correlations among sequentially obtained measurements in a controlled clinical trial. Sample size and power calculations require specification of a matrix of expected correlations among the repeated measurement [3,7]. To model temporal decay in the population correlation matrix, KGM [3] used a “dampened exponential” equation whereas OD [7] used a simple exponential equation. In this article, we adopt the dampened exponential model of KGM to further equilibrate conditions for comparison of the different sample size and power calculations. Another difference in approach to definition of the population correlation matrix needs to be rationalized before making direct comparisons of sample size calculations for controlled clinical trials. To generate a complete matrix of expected correlations among the repeated measurements, KGM begin with the expected correlation between adjacent time points and extrapolate across successive fixed intervals to the end of a study. Using that approach, they have considered lengthening the total treatment time to increase the number of

190

J.E. Overall et al

measurements while holding the interval between the measurements constant. OD consider clinical trials of specified duration, and they have then evaluated the sample size and power implications of inserting different numbers of repeated measurements within that time span. Increased frequency of measurements across a treatment period of fixed duration decreases the interval between adjacent measurements and thus increases their correlation according to the exponential temporal decay model. In considering different numbers of repeated measurements across a treatment period of fixed total duration, we believe that it is more appropriate to start with an estimate of the baseline-toendpoint correlation and to adjust the correlation between adjacent measurements according their spacing. The dampened exponential model [3,5] can accommodate this approach by fixing the baseline-to-endpoint correlation and varying the correlation between adjacent measurements as a function of the frequency of measurements across the fixed treatment period. Thus, the expected correlation between measurements at times t and (t 1 s) is calculated as follows: Corr (eit, eit1s) 5 gs

Q

(1)

where g is the within-groups correlation between adjacent measurements, s is the number of time intervals separating the two measurements for which correlation is to be defined, and Q measures the degree of attenuation in the Q exponential temporal decay gs . Setting Corr (eit, eit 1 s) equal to the baselineto-endpoint correlation and s equal to the number of intervals between baseline and endpoint allows calculation of g for any specified Q. We used this approach to calculate theoretical correlation matrices (R) with different baseline-to-endpoint correlations of 0.3, 0.5, and 0.7 for the present comparison of power and sample size calculations. DIFFERENT EQUATIONS FOR CALCULATING SAMPLE SIZES KGM [3] provide the following equation for calculating the sample size (n per group) for testing the significance of the difference between mean slope coefficients, D1 and D2, for experimental and control groups. n5

21 (Za/2 1 Zb)2 2s2 (XTR21X)2,2 2 (D2 2 D1)

(2)

where XT 5

310

1 1

1 2

1 3

1 .

1 .

1 .

4

1 . V

The subscript 2,2 in the numerator of Eq. 2 indicates the second diagonal element of the 2 3 2 matrix [XTR21X]21 after inversion. The constants Za/2 and Zb delineate areas under the unit normal curve equal to one-half a and to b, respectively. If the desired power is greater than 0.5, Zb is the positive-signed critical value beyond which the smaller area under the curve is equal to b. The sample size formula of OD [7] differs from Eq. 2 in not adjusting for serial dependencies in the pattern of correlations among the repeated measurements when considering differences in rates of change across time. Those authors

Models for Estimating Sample Sizes

191

have considered the unadjusted slopes from linear regression equations relating change in repeated measurements to equally spaced assessment times to be a more conceptually meaningful definition of rates of change in clinical trials. No “symmetry assumption” is required. n5

(Za/2 1 Zb)2 2s2 (x9Rx) (D2 2 D1)2

(3)

The x vector contains linear orthogonal polynomial coefficients that sum to zero and that we here consider normalized to unit sum of squares. The OD mean slopes are D1 5 x9m1 and D2 5 x9m2 when x9 consists of the orthogonal polynomial coefficients normalized to x9x 5 1.0. The single coefficient vector is substituted for the two-row design matrix XT of KGM with the rationale that linear coefficients in x9 that sum to zero are orthogonal to the summation vector (19). As will be elaborated in later discussion, statistical orthogonality (independence) is retained for any 19Rx in which the lower triangle of R is a mirror image of the upper triangle. The quadratic form x9Sx 5 s2 (x9Rx) is the variance of a weighted combination of the repeated measurements formed by applying the linear orthogonal polynomial coefficients in vector x9 to the repeated measurements. Both models [3,7] permit factoring s2 from covariance matrix S under the assumption that within-group variances are constant across the repeated measurements. Factoring of s2 leaves the correlation matrix R, Q which will be assumed to exhibit dampened exponential temporal decay, gs . Given specified baseline-to-endpoint correlation for a treatment period of fixed total duration, the correlation g between adjacent measurements varies as a function of the number of equally spaced repeated measurements within that time frame. We used the dampened exponential equation to generate the required theoretical correlation matrix R in an effort to conform to KGM [3] as closely as possible; however, either Eq. 2 or 3 can be used to calculate sample sizes for repeated measurements whose correlations are of any other specified form. Modeling of other types of correlation or covariance structures for repeated measurements is discussed in considerable detail in recent texts [1,2]. If the variance is expected to change across time in a predictable way, x9Sx can be substituted for s2 (x9Rx) in Eq. 3. Given an expected correlation matrix R, which conforms to the dampened exponential or any other theoretical model, the matrix of expected covariances can be generated by pre- and postmultiplying R by a diagonal matrix S containing expected standard deviations of the repeated measurements at the various assessment points, S 5 STRS. COMPARISON OF SAMPLE SIZES CALCULATED BY THE TWO EQUATIONS We solved Eqs. 2 and 3 for sample sizes aimed at providing power 0.7 and 0.9 against a true treatment effect that increases linearly from zero at baseline to one-half standard deviation at endpoint. The KGM mean slopes D1 and D2 were calculated by least squares regression using the hypothesized linearly increasing treatment means and integers 0,1,2, . . . , V. The OD mean slopes were calculated by applying the normalized orthogonal polynomial coefficients x9 to the hypothesized linearly increasing treatment means. We chose randomized designs involving baseline plus 16, 8, or 4 repeated measurements to

192

J.E. Overall et al

Table 1 Sample Sizes Calculated to Provide Power 0.9 for Testing Mean Difference in Slopes of Regression Lines Fitted to Different Numbers of Repeated Measurements Baseline-to-Endpoint Correlation** 0.30 17 Data points Q 0.0 0.5 1.0 9 Data points Q 0.0 0.5 1.0 5 Data points Q 0.0 0.5 1.0

0.50

0.70

[OD]*

[KGM]

[OD]

[KGM]

[OD]

[KGM]

37 107 139

37 98 116

26 69 80

26 64 73

16 43 49

16 40 45

63 108 132

63 104 118

45 80 84

45 76 56

27 49 53

27 47 50

94 112 124

94 110 118

67 81 88

67 80 84

40 49 53

40 49 50

* [OD] denotes Overall and Doyle [7]; [KGM] denotes Kirby, Galai, and Mun˜oz [3]. ** Three different levels of baseline-to-endpoint correlation used to construct theoretical R matrices with dampened exponential temporal decay patterns for intervening correlations determined by solving Eq. 1 for the indicated values of Q 5 0.0, 0.5, and 1.0.

simulate weekly, biweekly, or monthly evaluations across a total treatment period of 16 weeks. Including the baseline assessment, the different series had 17, 9, or 5 repeated measurements. Three different magnitudes of baseline-toendpoint correlation (0.3, 0.5, and 0.7) were considered in generating theoretical “population” correlation matrices exhibiting three different rates of temporal Q decay Corr gs with Q 5 0, 0.5, or 1.0. Note that Q 5 0 results in uniform correlations (i.e., compound symmetry) among the repeated measurements. The results presented in Table 1 indicate equivalence of the sample sizes calculated from the two equations only when Q 5 0. As Q increases, correlations become increasingly less homogenous. KGM have suggested that Q 5 1.0 approaches the upper limit of realistic serial dependencies among repeated measurements. For Q ≠ 0, the sample sizes calculated to provide specified power by the OD equation are larger than those calculated using the equation of KGM. The magnitude of the difference between sample sizes calculated from the two models increases with the number of repeated measurements interposed between baseline and end of the treatment period of specified total duration. The difference is also greater for lower average levels of (nonhomogeneous) correlation than for higher levels. We also calculated sample sizes for power 0.7 using the alternative equations. Although the sample sizes were appropriately smaller across the board, the pattern of similarities and differences was so consistent with that shown in Table 1 that those results will not be presented in detail. The accuracy of the alternative sample sizes for producing actual power 0.7 as well as 0.9 will be considered, however.

Models for Estimating Sample Sizes

193

The difference in sample sizes produced by the two equations for testing what is purportedly the same hypothesis raises the question of which, if either, actually provides the desired power. We next used Monte Carlo simulation methods to evaluate the actual power provided by sample sizes derived from the two equations. The empirical evaluation of power produced by the calculated sample sizes required generating sample data for repeated measurements having (population) correlations equal to the theoretical correlations that we used for the sample size calculations. Although alternative methods exist for generating samples from multivariate distributions with known correlation matrices (e.g., factor analysis), the objective of realistic simulation was pursued through use of a data generation model consistent with the conception of a repeated measurements experimental design. The data generation model thus included true treatment effects, between-subjects and within-subjects error components, and carry-over effects from previous assessment periods. Xijk 5 mij 1 Sk(I) 1 eijk 1 v1ei(j21)k 1 v2ei(j22)k 1 . . . ,

(4)

where mij is the mean for the ith treatment on the jth occasion, Sk(I) is the normally distributed sampling deviation for the kth individual in the ith group, and eijk is the independent normally distributed error deviation for the kth individual in the ith group at the jth time period. The v1ei(j21)k, v2ei(j21)k, and carry-over fractions from successively more remote periods are responsible for serial dependencies that violate the symmetry assumption of analysis of variance (ANOVA) for a split-plot design. Otherwise the data generation model was consistent with the usual ANOVA model for a design in which mij 5 mo 1 ai 1 bj 1 abij. The coefficients in the model were adjusted numerically to produce data with correlation matrices that matched closely the theoretical dampened exponential patterns used for calculation of sample sizes. The relative magnitudes of components of variance associated with Sk(I) and eijk determined the general level of correlation, and the v coefficients were temporally graded to produce heterogeneous correlations (across large samples) that matched the theoretical R matrices used for sample size calculations. We analyzed large numbers of such repeated measurement data sets to determine how closely actual power corresponded to the power that the calculated sample sizes were supposed to provide for tests of the difference between means of slope coefficients from regression equations relating change in repeated measurements to associated assessment times. A true treatment effect that increased linearly from zero at baseline to onehalf standard deviation at endpoint was introduced into data that were generated to evaluate actual power produced by the calculated sample sizes. Randomized parallel-groups designs with baseline plus 4, 8, or 16 subsequent repeated measurements across a treatment period of fixed total duration had sample sizes that were calculated by Eq. 2 or 3 to produce power of 0.7 or 0.9 for ANOVA tests of significance for a difference of that magnitude between slope coefficients in two treatment groups. As is evident in Table 1, the two equations for calculating sample sizes for testing the difference in linear slope coefficients produced meaningfully different results only when repeated measurements were more numerous and their pattern of correlations departed substantially from compound symmetry (e.g., Q 5 1.0). Thus, actual power provided by the alternative sample size estimates was evaluated across Monte

194

J.E. Overall et al

Table 2 Comparison of Actual Power Provided by Sample Sizes Calculated from Two Equations Baseline-to-Endpoint Correlation 0.30 17 Repeated measures Intended power 0.7 Intended power 0.9 9 Repeated measures Intended power 0.7 Intended power 0.9 5 Repeated measures Intended power 0.7 Intended power 0.9

0.50

0.70

[OD]

[KGM]

[OD]

[KGM]

[OD]

[KGM]

0.702 0.901

0.604 0.823

0.702 0.900

0.619 0.852

0.719 0.908

0.657 0.868

0.694 0.899

0.644 0.856

0.700 0.908

0.658 0.874

0.693 0.893

0.655 0.865

0.696 0.894

0.675 0.880

0.702 0.905

0.684 0.883

0.703 0.912

0.697 0.899

Carlo runs of 10,000 simulated data sets with Q 5 1.0 for each combination of the other conditions. The results, which are presented in Table 2, confirm that the actual power provided by the OD sample sizes for testing the significance of difference between mean slopes D1 and D2 closely approximated the target values of 0.7 and 0.9. The smaller sample sizes calculated by Eq. 2 tended to provide actual power modestly below the intended values for testing the difference between means of slope coefficients relating change in repeated measurements to equally spaced assessment times. The sample sizes that aimed at producing power equal to 0.9 can be identified for corresponding design and correlation conditions in the Q 5 1.0 rows of Table 1. The sample sizes that aimed at producing power equal to 0.7 are not shown, but the differences between the KGM and OD sample sizes are evident in the power results of Table 2. RATIONALIZING DIFFERENCES BETWEEN THE EQUATIONS One might suppose that the second diagonal element of the inverse of the 21 in Eq. 2 adjusts the error variance for correlation 2 3 2 matrix (XTR21X)2,2 between slope and elevation components defined by the nonorthogonal coefficients in rows of XT. The sample size produced by that equation might then relate to a test for difference in slopes corrected for correlation with average elevation. If the KGM sample sizes were intended to provide specified power for testing the significance of difference between mean slopes corrected only for correlation with average elevation, we suggest that a similar result should follow from introducing covariance correction into the OD model for sample size calculation. The covariate should reduce the error variance by (1 2 r2), where r is the correlation between slope and elevation components. Eq. 3 can be expanded to accommodate the covariance correction as follows: n5 where

(Za/2 1 Zb)2 2s2 (x9Rx)(1.0 2 r2) (D2 2 D1)2

(5)

Models for Estimating Sample Sizes

r5

19Rx

√(19R1)(x9Rx)

195

,

in which 19 is a summation vector with all elements equal to unity and x is the vector of linear orthogonal polynomial coefficients with elements that sum to zero. We determined that the sample sizes calculated from this covarianceadjusted equation are not equal to those calculated from Eq. 2. This r will equal zero for any R matrix in which the lower triangle is a mirror image of the upper triangle, which is much more general than the “symmetry” required for repeated measurements ANOVA. Thus, the smaller KGM sample sizes cannot be accepted as a simple consequence of correcting slopes coefficients for correlation with average elevation of the response pattern. Insight into the relationship between the KGM and OD models follows from insertion of the identity I 5 RR21 into Eq. 2 to obtain Eq. 3 with y9 5 x9R21 substituted for x9. The result is Eq. 6. To avoid confounding slope with the average elevation component, we substitute x9 containing normalized linear orthogonal polynomial coefficients for the linearly increasing coefficients in the second row of XT in the KGM equation. As a result, there is no need to correct for average elevation in Eq. 6 since x9 is orthogonal to the unit vector in the first row of XT. n5

(Za/2 1 Zb)2 2s2 (x9R21RR21x) [(x9R21m1) 2 (x9R21 m2)]2

(6)

As noted, s2(x9Rx) in the numerator of Eq. 3 is the (within-groups) variance of a weighted combination of the repeated measurements in which the elements in x9 are the compounding coefficients. When x9 consists of normalized linear orthogonal polynomial coefficients, s2(x9Rx) is the variance of slope coefficients whose means in the denominator of Eq. 3 are D1 5 x9m1 and D2 5 x9m2. When y9 5 x9R21 replaces x9 of Eq. 3, the “slopes” for which s2 (x9R21RR21x) 5 s2(x9R21x) is the sampling variance are no longer simple products of linear coefficients and repeated measurement vectors. Instead, the treatment effect to which that error variance relates is the difference between D1 5 x9 R21m1 and D2 5 x9R21m2, as indicated in the denominator of Eq. 6. Again, Eq. 6 is Eq.2 with identity I 5 RR21 inserted and with D1 and D2 representing the adjusted treatment effects for which s2(x9R21RR21x) 5 s2 (x9R21x) of Eq. 2 is the appropriate error variance. When elements of R satisfy compound symmetry, y9 5 x9 R21 is proportional to x9 within a scalar constant. Since that constant would appear (squared) in both numerator and denominator of Eq. 6, the KGM and OD models result in equivalent sample sizes under symmetry conditions. In cases where elements in R conform to an autoregressive pattern, y9 5 x9R21 is not proportional to x9, and the elements in y9 may depart substantially from a linear progression. Specifically, the elements in y9 5 x9R21 are a nonlinear function of the equally spaced time points with the general form of a third-degree polynomial. For the particular case of 9 repeated measurements, an exponentially declining autoregressive (order 1) correlational pattern with Q 5 1.0, and baseline-toendpoint correlation of 0.3, the nonlinear transform y9 5 x9R21 of the linear orthogonal polynomial vector x9 contains the following elements:

196

J.E. Overall et al

x9R21 5 (2.705 2.029 2.019 2.010 0.0 .010 .019 .029 .705). For baseline-to-endpoint correlation of 0.5 under similar autoregressive conditions, the nonlinear coefficient vector y9 5 x9R21 is as follows: x9R21 5 (21.014 2.017 2.012 2.006 0.0 .006 .012 .017 1.014). For baseline-to-endpoint correlation of 0.7 under the same conditions, the nonlinear coefficient vector y9 5 x9R21 contains even more highly differentiated elements: x9R21 5 (21.711 2.009 2.006 2.003 0.0 .003 .006 .009 1.711). The nonlinear coefficients emphasize the difference between magnitudes of the baseline and the last of the repeated measurements, with much less weight given differences between measurements in the central portion of the treatment period. Note that these contrasts approximate the simple difference scores often used for an endpoint analysis, which we have previously shown to provide power superior to that of a complete linear trend analysis in the presence of an autoregressive pattern of serial dependencies among the repeated measurements [6]. Giving greater weight to extreme values may enhance power for testing a difference in treatment effects, but the question here is whether it is appropriately characterized as a test for the difference between mean slope coefficients from linear regression equations calculated by regressing repeated measurements on the associated assessment times. Rather than using orthogonal polynomial coefficients, KGM address orthogonality by using the inverse 21 to adjust the trend variance for dependence on average elevation, (XTR21X)2,2 or adjustment required by nonorthogonality of the rows of their XT matrix, but the orthogonality adjustment does not linearize the function being tested in the case of serially dependent measurements. In conclusion, the KGM and OD equations for calculating sample sizes for tests on linear trends are different, and they produce different sample sizes when serial dependencies introduce heterogeneity into correlations among sequentially obtained repeated measurements. In those cases, the sample sizes calculated from Eq. 3 provide the intended power for a test of difference between mean slopes from least squares regression equations relating change in the repeated measurements to equally spaced time points, whereas Eq. 2 underestimates the sample size required to produce intended power for testing that difference. In case of serious departure from symmetry, the difference is substantial. It may be most important in the design of experiments. Consider the number of equally spaced repeated measurements to be inserted across a treatment period of fixed total duration. As shown in Table 1, the OD model [7] suggests that increasing the number of repeated measurements spanning a fixed treatment period will actually increase the sample size required to maintain constant power in cases where serial dependencies produce an autoregressive pattern of correlations (Q 5 1.0). This inverse relationship between power and the frequency of repeated measurements spanning a treatment period of fixed duration, which is suggested by the OD model, has been confirmed by simulation work reported elsewhere [6]. In contrast, the sample sizes calculated from the KGM model for the same conditions do not increase with an increase in the number of repeated measurements spanning the fixed treatment

Models for Estimating Sample Sizes

197

period. In the presence of strong serial dependencies among the repeated measurements, the KGM sample sizes approach those for tests on simple baseline-to-endpoint difference scores, which are commonly used for endpoint analyses. From a design perspective, this is relevant for planning the number and placement of repeated measurements, and it emphasizes the importance of understanding which model provides the correct sample sizes for testing which types of hypotheses. This work was supported in part by grant DHHS MH32457.

REFERENCES 1. Diggle P, Liang KY, Zeger SL. Analysis of Longitudinal Data. New York: Oxford University Press; 1994. 2. Dwyer JH, Feinleib M, Lippert P, et al. Statistical Models for Longitudinal Studies of Health. New York: Oxford University Press; 1992. 3. Kirby AJ, Galai N, Mun˜oz A. Sample size estimation using repeated measurements on biomarkers as outcomes. Controlled Clin Trials 1994;15:165–172. 4. Laird NM, Ware JH. Random effects models for longitudinal data. Biometrics 1982;38:963–974. 5. Mun˜oz A, Carey VJ, Schouten JP, et al. A parametric family of correlation structures for the analysis of longitudinal data. Biometrics 1992;48:733–742. 6. Overall JE. How many repeated measurements are useful? J Clin Psych 1996; 52:243–252. 7. Overall JE, Doyle SR. Estimating sample sizes for repeated measurement designs. Controlled Clin Trials 1994;15:100–123.