FERTILITY AND STERILITY
Vol. 65, No.5, May 1996
Copyright "'"' 1996 Arllerican Society for Reproductive Medicine
Printed on acid-free paper in U. S. A.
Empirical evidence of bias in infertility research: overestimation of treatment effect in crossover trials using pregnancy as the outcome measure
Khalid S. Khan, M.B.B.S.*t Salim Daya, M.B., Ch.B.:j:§11 John A. Collins, M.D.:j:§ Stephen D. Walter, Ph.D.§ McMaster University, Hamilton, Ontario, Canada
Objective: To determine whether crossover trials with simple pooling of data over different study periods leads to a different estimate of treatment effect compared with parallel group trials in infertility research using pregnancy as the outcome measure. Design: An observational study using nine overviews that included trials with both crossover and parallel group designs. These overviews comprised 17 crossover and 17 parallel group trials. In total, there were 5,291 outcomes including 775 pregnancies. The association between study design and treatment effect estimate was analyzed using multiple logistic regression, controlling for differences in the therapeutic interventions and variations in the methodological quality of the trials. Setting: Infertile patients in an academic research environment. Patients: Infertile patients undergoing treatment efficacy evaluation in controlled trials. Interventions: Random allocation to a variety of treatments including clomiphene citrate, hCG, lUI, tamoxifen, and bromocriptine. Main Outcome Measure: Estimate of bias between study designs, based on the interaction of study design and treatment in the logistic regression model. Results: Crossover trials produced a larger average estimate of treatment effect compared with trials with a parallel group design, overestimating the odds ratio by 74% (95% confidence interval, 2% to 197%). Conclusion: The use of a crossover design for evaluating infertility treatments with outcomes that prevent patients from completing later phases of the trial should be avoided because it leads to exaggerated estimates of treatment effect and may result in erroneous inferences and clinical decisions. Furthermore, the type of study design should be taken into account when assessing the methodological quality of therapy trials in infertility. Fertil Steril 1996;65:939-45 Key Words: Bias, randomized controlled trials, infertility, crossover trials, pregnancy, regression analysis, study quality, overviews, meta-analysis
Investigators generally use the parallel group design for clinical trials evaluating health care interventions. The crossover design provides a useful alReceived August 4, 1995; revised and accepted January 12, 1996. * International Scholars Programme, Department of Clinical Epidemiology and Biostatistics. t Present address: Department of Obstetrics and Gynaecology, Ninewells Hospital, Dundee, Scotland. =!: Department of Obstetrics and Gynaecology. § Department of Clinical Epidemiology and Biostatistics. Vol. 65, No.5, May 1996
ternative to the parallel group comparison because a smaller sample size is required to achieve the same precision in estimating the treatment effect (1, 2). However, crossover trials are reported to suffer from a lack of adequate analysis (3) and are considered inappropriate for evaluating treatments with outII Reprint requests: Salim Daya, M.B., Ch.B., Department of Obstetrics and Gynecology, McMaster University, 1200 Main Street West, Hamilton, Ontario, Canada L8N 3Z5. (FAX: 905524-2911; E-mail:
[email protected]). Khan et a1.
Treatment effect in crossover trials
939
comes that are nonreversible (such as pregnancy) that prevent subjects from participating in later phases of the trial (4, 5). Both crossover and parallel group designs have been used in experimental research evaluating infertility treatments. In a systematic review of 501 trials in the literature from 1966 to 1990, the crossover design was used in 103 trials, with conception, pregnancy, or live birth as the main outcomes of interest (6). In these trials, the analysis typically did not account for dropouts, carry-over effects, and period effects. Instead, the results in 82 trials were reported by simply pooling the data from all the study periods. Despite the fact that such an analysis is inadequate (and potentially biased), these trials have been included in systematic reviews because for some treatments they represent the only available evidence of therapeutic efficacy (7, 8). The possibility of bias has raised concern about the validity of infertility research that is based on a crossover design (5, 9, 10), but the direction and magnitude of the bias is not clearly discernible. It has been postulated that such trials will result in an overestimation of the true treatment effect and possibly create spurious statistical significance (6, 7). However, this point of view is not accepted universally. Some investigators support the use of the crossover design in infertility treatment evaluation (11, 12) and the observation of the continued publication of findings from such trials in the infertility literature further attests to this viewpoint. For example, a MEDLINE search using text word "crossover" identified 14 such trials in Fertility and Sterility over a period extending from 1991 to 1994. One approach to resolving this debate is to evaluate published trials to determine whether the crossover design produces a biased estimate of the treatment effect. Therefore, the aim of this study was to test the hypothesis that there are differences in effect estimates in trials using a crossover design compared with those using a parallel group design for treatments with nonreversible outcomes. The parallel group design was chosen as the referent because a single-phase trial conducted with sufficient rigor is believed to be the ideal infertility study (6). MATERIALS AND METHODS Selection of Study Material
Systematic reviews of controlled trials in infertility have led to several publications that provided a database for the present study: the recent publication of 33 systematic overviews in the report of the Royal Commission on New Reproductive Technologies (8) and two other publications (6, 7) contained information from a total of 501 trials. 940
Khan et al. Treatment effect in crossover trials
All overviews from this database were assessed independently by two reviewers for possible inclusion in the present study. The reviews were deemed relevant if they included trials with both crossover and parallel group designs comparing two or more treatments or interventions. A crossover trial was defined as a randomized study in which subjects, upon completion of one course of treatment, were switched over to receive a different treatment. A parallel group trial was defined as one in which subjects were allocated randomly to receive either an experimental or a control intervention. Any nonrandomized observational studies included in the overviews were excluded. Discrepancies about overview selection and trial design were resolved by consensus or through arbitration by a third reviewer. Methodological Quality
Although an experimental research design provides the least biased method of assessing medical interventions, clinical trials are not always free from bias (13, 14). When comparing crossover and parallel group trials, the possibility of confounding by factors related to methodological quality is an important component requiring appropriate adjustment in the evaluation. Therefore, the methodological quality of selected trials was assessed using the components of study design that have been considered to have an effect on internal validity (14, 15). In addition, a validated methodological quality scoring scale was used (15, 16). After selecting relevant systematic reviews, the complete manuscripts of the trials included in these reviews were assessed independently for methodological quality by two reviewers. The following items were considered in the assessment of methodological quality: 1. Concealment of randomization. Concealment of allocation was considered adequate if randomization was performed at a site remote from the treatment site, or if coded bottles, serially numbered drug containers, opaque sealed envelopes, or other such methods were used. Concealment was considered inadequate if randomization was based on open methods such as reference to case record numbers or birth dates. In the absence of any information about concealment, the trial was categorized as being concealed unclearly. 2. Allocation sequence generation. Methods of sequence generation that were considered adequate included random numbers generated by computer, random number tables, coin tossing, or card shuffling. Trials in which allocation was performed using case record numbers, social insurance numbers, or birth dates were considered to have inadequate ranFertility and Sterility
domization sequence generation. Trials in which the authors did not report their method of sequence generation were categorized as being unclear. 3. Blinding. A trial was considered blind if the authors referred to the term "double-blind" in the description. Blinding was considered adequate if both the physician and the patients were blinded and the placebos used were identical in taste and/or appearance to the active treatment or inadequate if it was clear that the physician or the study participants were able to identify the intervention being provided, e.g., oral versus parenteral administration of treatment. H reference to double-blind was made, but the method of blinding was not stated, the trial was classified as having unclear blinding. Trials in which blinding was impossible owing to the nature ofthe comparisons (e.g., comparison ofIUI with natural intercourse) were classified separately. 4. Description of follow-up. Information was sought on all participants who were included in the trial but who did not complete the observation period. Follow-up was considered adequate if >90% of patients originally enrolled were included in the analysis. Hno statement on dropouts was made, follow-up was considered to be defined unclearly. A score for methodological quality was computed for each trial by evaluating the method of randomization sequence generation, blinding, and the description of dropouts. One point was awarded for each item that was reported. In addition, one point each was added if randomization and blinding were judged to be appropriate, whereas one point was deducted for each ofthese criteria judged to be inappropriate. The maximum possible score was 5. This method has been shown to have good discrimination between trials of low quality (score :5 2 points) and high quality (score 2: 3) compared with quality assessment by experts (16). Data Extraction
Information extracted from each trial included the features of study design, the number of couples, and the number of pregnancies per couple in the treatment and control groups. In most crossover trials the data were collapsed from the different study periods and the treatment effect was estimated by simply pooling each of the successes and failures in the different study periods to produce a single 2 X 2 table. In parallel group trials, the effect size was determined from the results summarized in a 2 X 2 table. The trials in which the meta-analysts had extracted data separately from the first period of a crossover trial were regarded as parallel group comparisons; the results from subsequent periods were excluded from the analysis. Where possible, data Vol. 65, No.5, May 1996
were abstracted to perform an intention-to-treat analysis. Statistical Analysis
The analysis of agreement between two observers for overview selection and methodological quality items was performed using crude percentage agreement and kappa statistic to obtain an estimate corrected for agreement expected by chance. Weighted kappa statistic with quadratic weights was used to provide credit for partial agreement (17). The minimally acceptable kappa level was set a priori at 0.7, a level that is considered to represent a substantial strength of agreement (18). Subsequent analysis was based on logistic regression using the BMDP statistical software (BMDP Statistical Software, Inc., Los Angeles, CA) (19). A regression model was built to assess the effect of various factors on pregnancy categorized as a binary outcome. The impact of study design and quality was assessed in a simple unadjusted model and in a multivariate model adjusted for confounding factors. A term for the interaction between trial design and treatment-control allocation was used to address the primary hypothesis concerning the effect of crossover design on treatment effect. The statistical null hypothesis we wanted to test was, "the value of the f3 coefficient for the interaction term is zero (or its exponent is 1.0)." In the model, the exponent of the f3 coefficient associated with the interaction term represents the ratio of odds ratios for the two comparison groups (14). A value> 1.0 indicated that the treatment effect, on average, was larger in trials with a crossover design (study group) compared with a parallel group design (referent group). Interaction terms for quality items and treatment-control allocation were used to evaluate the effect of quality items on treatment effect. To simplify the multivariate model, the comparison of treatment effects was made in the subgroups of trials with different study designs. To account for the overall effects of treatment, variables to control for the inclusion of a study in a specific meta-analysis and for the different treatment effects in the different meta-analyses were included in the model. In addition, terms for methodological quality and their interactions with treatment-control allocation were used to control for confounding by these factors. The Hosmer-Lemeshow decile of risk test was used to assess model goodness-of-fit (20). RESULTS Selection and Description of Study Material
Crude agreement concerning overview selection occurred for 30 of 33 (90%) overviews reviewed Khan et al.
Treatment effect in crossover trials
941
Table 1 Observer Agreement in Overview Selection, Study Design, and Quality Assessment Overview selection
Study design
Concealment
Sequence generation
Blinding
Follow-up
Quality score
90 0.80
100 1.0
90 0.85
84 0.78
81 0.70
90 0.74
84 0.94
Agreement (%) Weighted kappa
(weighted kappa 0.80). After consensus and arbitration by a third reviewer, nine overviews that included trials with both parallel group and crossover designs were selected for analysis. The total number of trials included in these overviews was 34 and represented 5,291 outcomes, including 775 clinical pregnancies. In parallel group trials, these outcomes represented individual patients, whereas in crossover trials, they represented the number of study periods completed by the patients enrolled in the trial. There were 17 trials with a parallel group design and 17 trials with a crossover design. Among the trials with a parallel group design, five contained results from the first period of crossover trials; the data from the subsequent periods were excluded. Two trials were represented twice in different overviews. To avoid duplication of data, each trial was assigned randomly to one overview by tossing a coin. Another trial also was included in two overviews, but its data were not duplicated because different arms of the trial were included in the different meta-analyses. Crude agreement on the components of study quality ranged from 81% to 90%. The kappa values ranged from 0.70 to 0.94 (Table 1). The methodological quality of the trials was generally poor, as shown in Table 2. Concealment of randomization and allocation sequence generation were not stated clearly in 22 (65%) trials, and were assessed to be adequate in only 8 (23%) and 4 (12%) trials, respectively. There were 21 (63%) nonblinded trials, including 13 in which blinding could not have been used owing to the nature of the treatment comparisons. In three trials the method of blinding was not stated. Blinding was judged to be adequate in 10 (29%) trials. Follow-up was not described explicitly in 24 (70%) trials. When a description was provided, only seven Table 2
(20%) trials had >90% follow-up rate. Overall, the poor methodological quality of the trials was evident in the summary scale score. Only 10 (29%) trials had a score >2, indicating that they were of high quality. The frequency of adequate study quality components was higher in the trials with the parallel group design than the crossover design, but these differences were not statistically significant. Bias in Effect Estimation
The logistic regression estimates of bias in the estimation of treatment effect resulting from differences in study design and quality are shown in Table 3. The results of the simple models (not adjusted for confounding) indicated that the crossover design overestimated the effect size by 60% compared with the parallel groups comparisons (95% confidence intervals 16% to 119%; P = 0.003). Inadequacy in concealment of randomization was associated with a 57% higher estimate in the treatment effect compared with trials with adequate concealment (95% confidence intervals 2% to 140%; P = 0.04). Inadequacy in blinding were associated with a 41 % overestimation of effect size compared with trials with adequate blinding (95% confidence intervals 2% to 94%; P = 0.04). In this comparison, trials in which blinding was impossible were included in the reference group with adequate blinding. Inadequacy in sequence generation had a trend toward overestimation but these results were not statistically significant. In contrast, inadequacy in follow-up was associated with a 25% underestimation of the treatment effect (95% confidence intervals 3% to 42%; P = 0.03). The multiple regression model that adjusted for
Methodological Quality According to the Design of the Trials Included in the Overviews Selected for Analysis Methodological quality*
Study design
Adequate concealment
Adequate sequence generation
Adequate blinding
Adequate follow-up
High quality (score> 2)
Parallel groups (n = 17) Crossover (n = 17) Total (n = 34)
6 2 8
3 1 4
7 3 10
5 2 7
7 3 10
* The data presented are the number of trials. No statistically significant differences between parallel groups comparisons and crossover trials (P > 0.05) were observed for any of the methodological quality categories using Fisher's exact tests.
942
Khan et aI.
Treatment effect in crossover trials
Fertility and Sterility
Table 3 Logistic Regression Estimates of Bias in Treatment Effect Resulting From Study Design and Variations in Methodological Quality . Unadjusted estimates
Study design Crossover design (reference: parallel group design) Methodological quality Randomization Concealment of allocation Inadequate or unstated (reference: adequate concealment) Sequence generation Inadequate or unstated (reference: adequate generation) Blinding Inadequate or unstated (reference: adequate blinding) Follow-up Inadequate or unstated (reference: adequate follow-up)
Adjusted estimatestt
Exponent of (3 Coefficient
P value
Exponent of (3 Coefficient
P value
1.60 (1.16 to 2.19)
0.003
1.74 (1.02 to 2.97)
0.043
1.57 (1.02 to 2.40)
0.040
1.27 (0.43 to 3.74)
0.661
1.23 (0.47 to 3.19)
0.671
1.05 (0.29 to 3.76)
0.935
1.41 (1.02 to 1.94)
0.040
1.42 (0.69 to 2.92)
0.338
0.75 (0.58 to 0.97)
0.031
1.02 (0.62 to 1.66)
0.944
* Multiple logistic regression was used to identifY a model with pregnancy as the binary dependent variable. The independent variables included binary variables for treatment group (control or treatment), study design (parallel or crossover), and the "quality measures" (adequate or inadequate-unclear), and their interaction terms to evaluate their influence on the estimates of treatment effect. Bias was estimated by the exponent of the (3 coefficient associated with the "bias factor by treatment" interaction term in the model. t Values in parentheses are 95% confidence intervals. :j: The adjusted analysis includes all of the above terms as well as parameters for the different meta-analyses and the treatment effects in the different meta-analyses. (Hosmer-Lemeshow goodness-of-fit test P = 0.24 on 8 d£).
confounding factors by including all the available terms showed that trials with a crossover design yielded higher estimates of treatment effect compared to those with a parallel group design (Table 3). The overestimation in the odds ratio was 74% (95% confidence interval 2% to 197%; P = 0.04). In this model, the biases due to inadequacies of study quality also had a trend toward higher treatment effects, but the estimation was imprecise, leading to statistically nonsignificant results. In another logistic regression model, in which confounding due to quality items of sequence generation, blinding, and follow-up was controlled by a single-measure quality score incorporating all three variables, the treatment effect again was overestimated in crossover trials compared with trials with a parallel group design (72% overestimation; 95% confidence interval 1% to 194%; Hosmer-Lemeshow goodness-of-fit P = 0.82 on 8 d£). However, the biases in this model from inadequate-unclear concealment of allocation and low methodological quality (40% and 43% overestimation, respectively) were both statistically nonsignificant. DISCUSSION
In this study, the null hypothesis that study design had no effect on the estimation of treatment effect was rejected in favor of the alternate hypothesis that the use of the crossover design in infertility trials with pregnancy as the outcome influenced the effect size. Crossover trials in which the data were Vol. 65, No.5, May 1996
pooled by a simple aggregation of results from the different periods led to a systematically larger estimate of the treatment effect compared with trials with a parallel group design. The odds ratio was overestimated by 60% (95% confidence interval 16% to 119%; P = 0.003) in the unadjusted analysis and by 74% (95% confidence interval 2% to 198%; P = 0.04) in the adjusted analysis, after controlling for differences in the overviews, their treatment effects, and the variation in the quality of trials included in the reviews. For other items of methodological quality, unadjusted analyses produced a significant overestimation of the odds ratio in trials with inadequacies of concealment and blinding. When the analysis was adjusted for confounding factors, there was a trend toward an exaggeration of the effect size in trials with poor methodological quality. However, this effect was not statistically significant for inadequacies of concealment and blinding in the present study and possibly is a reflection of the low power to detect these differences. These factors were statistically significant in a study published by Schulz et al. (14). The unadjusted analysis for follow-up showed underestimation of the treatment effect, a trend similar to that found by Schulz et al. (14). However, this variable did not have a significant impact in the adjusted analysis. The hypothesis about the effect of study design was generated a priori in response to the current debate on the issue of choice of study design in inferKhan et al. Treatment effect in crossover trials
943
tility research (5-7, 9-12). In this way, the biases associated with post hoc explorations of data were avoided (21). Only those overviews that contained both parallel group and crossover design trials were included in the analysis. The exclusion of meta-analyses that had only crossover or only parallel group trials was necessary to avoid the risk of drawing conclusions about effect sizes in different subgroups, on the basis of between-group (i.e., between metaanalyses) differences that could be explained by the different treatments being evaluated (21,22). Subsequent analysis also took into account the differences between the meta-analyses and the effects of other aspects of study quality. The agreement between reviewers regarding the assessment of study design and methodological quality was high, reducing the possibility of bias due to misclassification. Hence, the observed difference in trials with different research designs is likely to be real. The implication of this finding for primary research evaluating treatment in the field of infertility is that a crossover trial, analyzed by simple pooling of results from different periods, should be avoided. However, if the design is used, then statistical techniques to account for the period effect and the treatment by period interaction should be used to determine ifthe results are different in the different study periods (4, 23, 24). If the results are heterogeneous, data for each period should be reported separately and the treatment comparison should be based on the first period alone (4). In effect, the results from the first period represent what would have happened in a parallel group trial, but the size of crossover trials is often too small to detect differences between treatments. Because crossover trials in infertility are not large enough to provide a powerful parallel comparison from the first period alone, they run the risk of providing an imprecise estimate of the treatment effect. If one were to recommend large sample sizes in a crossover trial, then one would lose one of the major statistical advantages of this design. An additional concern is that the statistical tests for interaction are weak, resulting in a lack of power to evaluate if the treatment effect is different in the different study periods. Thus, an appreciable interaction could be missed in a crossover trial, resulting in an inaccurate conclusion about treatment efficacy. Another difference between the two study designs concerns the way the research questions are posed. For example, the research question in infertility treatment evaluation can be stated as: "Given that the patient fulfills the eligibility criteria, will treatment produce a better outcome than control?" In a crossover trial the occurrence of a pregnancy in the first period of the trial will result in the patient being excluded from the second period. Therefore, it can 944
Khan et al.
Treatment effect in crossouer trials
be said that the second period of a crossover trial actually addresses a question that is different from the original study question, i.e., "Given patients who fail to conceive in the first period of the trial, what is the likelihood that they will conceive with treatment in the second period?" Thus, it can be seen that the original question for outcomes that are nonreversible can be addressed only in the first period of a crossover trial or in a parallel group trial. Therefore, to obtain an unbiased estimate of treatment effect in infertility research, a parallel group design study is more appropriate than a crossover study. The demands on health care systems to help childless couples conceive are growing (25). To ensure that therapeutic measures offered to these couples are efficacious, increasing interest is being placed on clinical decision making that is based on research evidence. The medical community is moving from the reliance on traditional, largely qualitative judgements about therapeutic effectiveness to much more structured, systematic, and quantitative approaches (8). However, the strength of this evidence-based approach in making health care recommendations lies in the quality of the primary research. Valid assessment of study quality is essential in providing an unbiased estimate of treatment effect. In infertility research, trial design has been considered in quality assessment but it has been given less importance compared with other items such as randomization, blinding, and followup (7, 8). This oversight may have been the result of insufficient evidence relating bias to inappropriate choice of trial design. Our study has now established that treatment effect is overestimated when using the crossover design with a simple analysis of its data. Therefore, validity assessment of studies now should include trial design as an important component. Also, it should be given more weight than other items for which empirical evidence of bias has not been established. Investigators planning clinical trials on treatment efficacy in infertility should avoid the crossover design when pregnancy is the outcome measure. In this way, therapeutic decisions will be based on unbiased proof of treatment efficacy. Acknowledgments. The authors thank Edward Hughes, M.B., McMaster University, Hamilton, Ontario, Canada for help in identifying original manuscripts from the database and Marina R. Quaresma, M.D., McMaster University, Hamilton, Ontario, Canada for assistance in selecting study material and assessing its methodological quality.
REFERENCES 1. Petrie A. The crossover design. In: Tygstrup N, Lachin JM, Juhl E, editors. The randomized clinical trial and therapeutic decisions. New York: Marcel Dekker Inc., 1982:199-204.
Fertility and Sterility
2. Woods JR, Williams JG, Tavel M. The two-period crossover design in medical research. Ann Intern Med 1989; 110:5606. 3. Louis TA, Lavori PW, Bailar JC, Polansky M. Crossover and self-controlled designs in clinical research. N Engl J Med 1984;310:20-31. 4. Hills M, Armitage P. Two-period cross-over clinical trial. Br J Pharmacol 1979;82:7-20. 5. Daya S. Is there a place for the crossover design in infertility trials? Fertil Steril 1993;59:6-7. 6. Vandekerckhove P, O'Donovan PA, Lilford R, Harada TW. Treatment ofInfertility-from cookery to science. Br J Obstet Gynaecol 1993; 100:1005-36. 7. O'Donovan P, Vandekerckhove P, Lilford R, Hughes E. Treatment of male infertility: is it effective? Review and metaanalysis of published randomized controlled trials. Hum Reprod 1993;8:1209-22. 8. Royal Commission on New Reproductive Technologies. New reproductive technologies and the health care system: the case for evidence-based medicine. Volume 11. Ottawa, Ontario, Canada: Ministry of Supply and Services, 1993. 9. Lilford RJ. Treatments of male infertility. Hum Reprod 1994; 9:566-7. 10. Mersol-Barg MS, Dlugi AM. Clomid for unexplained infertility? Simple lie or complicated truth [letter J. Fertil Steril 1994;62:896-7. 11. Peek J. Treatments of male infertility. Hum Reprod 1994; 9:566-7. 12. Arici A. Clomid for unexplained infertility [letterl. Fertil Steril 1994;62:897. 13. Chalmers TC, Celano P, Sacks HS, Smith H Jr. Bias in treatment assignment in controlled clinical trials. N Engl J Med 1983;309:1358-61.
Vol. 65, No.5, May 1996
14. Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias; dimensions of methodologic quality associated with estimates of treatment effect in controlled clinical trials. JAMA 1995;273:408-12. 15. Moher D, Jadad AR, Nichol G, Penam M, Tugwell P, Walsh S. Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Controlled Clin Trials 1995; 16:62-73. 16. Jadad AR, Moore RA, Carrol D, Jenkinson C, Reynolds DJM, Gavaghan DJ, et al. Assessing the quality of reports of randomised clinical trials: is blinding necessary? Controlled Clin Trials. 1996;17:1-12. 17. Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 1968; 70:213-20. 18. Landis RJ, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-74. 19. Dixon WJ. BMDP statistical software release 7. Los Angeles: BMDP Statistical Software Inc., 1992. 20. Hosmer DW, Lemeshow S. Goodness of fit test for the multiple logistic regression model. Commun Statist Part A Theor Meth 1980;A9:1043-69. 21. Oxman AD, Guyatt GH. A consumers guide to subgroup analyses. Ann Intern Med 1992; 116:78-84. 22. Gillman MW, Runyan DK. Bias in treatment assignment in controlled clinical trials. N Engl J Med 1984;310:1610-1. 23. Jones B, Kenward MG. Design and analysis of cross-over trials. New York: Chapman and Hill, 1989. 24. Senn S. Cross-over trials in clinical research. Sussex, United Kingdom: John Wiley & Sons, 1993. 25. Jones HW, Toner JP. The infertile couple. N Engl J Med 1993; 329:1710-5. Note. A bibliographic listing of the trials included in this study is available from the author upon request.
Khan et al. Treatment effect in crossover trials
945