The problem of the type II statistical error

The problem of the type II statistical error

2. Rappleye on Medical WC. Medical Education. education: New York: Colleges, 1932. 3. Muller S. Physicians for project panel on the general and col...

315KB Sizes 0 Downloads 34 Views

2. Rappleye on Medical

WC. Medical Education.

education: New York:

Colleges, 1932. 3. Muller S. Physicians for project panel on the general and college preparation 1984;59:Part 4. Swanson Washington, 5. Cantor Medical

2. AG. ACME-TRI DC: Association JC, Cohen educators’

1991;265:1002-6. 6. Metheny Wl’. student clinical ship.

Obstet

THE

Gynecol

century: Report of the education of the physician J Med Educ November

report: Educating medical students. of American Medical Colleges, 1992. DC, medical

Shuster AL, education

Reynolds reform.

RC. JAMA

of physician ratings in the assessment of in an obstetrics and gynecology clerk-

1991;78:136-40.

PROBLEM

STATISTICAL

the twenty-first professional for medicine.

AB, Barker views on

Limitations performance

Final report of the Commission Association of American Medical

NC

27157

Received February 6, 1995. Receizled in reoised form Iunc 9, 1995 Accepted ]uly 22, 1995.

Copyright

0

1995

by

The

American

College

of Obstetricians

and

introduced into clinical practice as quickly as possible, we believe that a priori power calculations should always be done in quantitative clinical research. (Obstet Gynecol 2995; 862357-9)

ERROR MD,

Objective: To determine if type II statistical errors (also known as beta errors) are a common problem in published clinical research. Methods: Type II statistical errors occur when sample sizes are too small to show an effect of treatment, even when an effect truly exists. Searching the Medline data base, we identified ten meta-analyses published during 1986-1994 in the American Journal of Obstetrics and Gynecology, Obstetrics and Gynecology, and The Journal of Reproductive Medicine. Meta-analyses were used as sources of component or individual studies for the following reason: When small component studies have negative findings that differ from the overall conclusions of the meta-analysis, the component studies may have type II statistical errors. ResuZts: We found that only 6.5% (15 of 2311 of component studies provided any documentation that power calculations to determine sample sizes had been done a priori (before) the research began. Thus, many of these component studies with findings of no treatment effect may have had type II errors because of too-small sample sizes. When stratifying the component studies by year of publication, we found that 7.9% (14 of 178) of studies published in the 1980s and 1990s had any documented evidence of a priori power calculations. In the 1960s and 197Os, only one of 53 component studies had documented evidence of power calculations. Conclusion: To ensure that truly effective treatments are From the Depurtment Hospital and Pritzkev Chicago, Illinois.

Winston-Salem,

Gynecologists.

OF THE TYPE II

Robert Mittendorf, MD, DvPH, Veena Arm, and Anna Maria V. Sapugay, MD

Address reprint requests to: 1. M. Eruesf, MD Department of Obstetrics and Gynecology Bowman Gray School of Medicine Medical Center Boulevard

The purpose of this article is to discuss the type II statistical error (also known as beta error) and how its occurrence can impede the timely introduction of effective treatments into clinical practice. Many are familiar with the first of the two statistical errors that can occur in a study, ie, the type I (alpha) error, but fewer know the type II error. Type I errors occur when the results of a properly designed and performed study show a treatment effect but no treatment effect is truly present. By convention, we usually agree to accept a 5% probability (P < .05) of such an occurrence. Type I errors happen because of chance variation. Type II statistical errors occur when an effect truly exists but sample sizes are too small to show an effect. The implication of the type II error is that a possibly effective treatment can be misjudged to be ineffective. To help avoid type II errors, power calculations must be done a priori (ie, before studies begin) to determine proper sample sizes. Power is the probability that a study will be able to determine that a treatment is effective when it truly is effective. Mathematically, power is defined as 1 - P. Thus, if a 20% probability for the occurrence of a type II error is acceptable, a study would have 80% (1 - .20 = 80) power. By convention, 80% power is usually considered sufficient for most studies. As an example of a power calculation for a randomized controlled trial, consider the following formula:

c?f Obstetrics and Gyrzrcobgy, Chicnp Lying-irk School of Medicine, The Unizwsity of Chicagc~,

in which n is the minimum number of patients needed in each of the two arms of a trial (treatment arm and

VOL.

86,

NO.

5, NOVEMBER

0029.7844/95/$9.50

1YYi SSDI

0029-7844(95)00251-L

857

placebo control arm), z,,~ = 1.96 when the probability of a type I error is predetermined to be .05, za = .84, when the probability of a type II error is predetermined to be 20, rr, is the probability of a given outcome in the placebo control group, V, is the probability of the same outcome in the treatment group, 7~is the average of nTTI and rr,, and A is the absolute difference between rri and m2. For example, if it is known that the probability for a given adverse outcome is 20% and it is assumed that a new drug would reduce that outcome by 50%, then rr, = .20, lT* = .lO, rr = .15, and A = .lO. Using the preceding formula, the minimum sample size (n) in each arm would be 199. Thus, 398 (2 X 199) subjects would be required to conduct a trial in which there is an 80% probability of finding a treatment effect of the drug when a treatment effect is truly present. Note that in the formula, if alpha and beta are made smaller (reduced probability of committing a type I or a type II error, respectively) z,,2 and zp become larger. Thus, the required sample sizes would also become larger. For example, if it were decided that the probability of a type I error should be o( = .Ol, and the probability of a type II error should be P = .lO (90% power), the required sample size in each of the arms would be 370 and the total number of subjects would be 740 (2 x 370). Likewise, as the difference, A, between the treatment group and the placebo group decreases, the sample size must increase. Conversely, if the difference between groups is greater, then the required sample size decreases.

Methods Searching the Medline data base, we identified metaanalyses on various topics that were published during 1986-1994 in the three most widely read peer-reviewed journals in the specialty of obstetrics and gynecology, the American Journal of Obstetrics and Gynecology, Obstefrics and Gynecology, and The Journal of Reproductive Medicine. The search was limited to meta-analyses because they are sources of component or individual studies in which a summary conclusion from metaanalysis can be compared with the inference drawn from the components. Thus, when the conclusions of a small component study are different from the overall conclusions of the meta-analysis, the component study may have a type II statistical error. For this report, we identified ten meta-analyses and their component studies (Table 1). Of the 274 component studies, eight could not be recovered and 35 were duplications. Thus, 231 different component studies were reviewed to determine if the original investigators

858

Mittendorf

et al

7-lp

II Ewors

Table

1. Review of Ten M&a-Analyses Component Studies

J’

Fraser EJ’ Keirse MJNC’ Mittendorf R4 Ohlsson A’ Romero

I-?‘

Colditz CA’ Hunter RW” Owen J” Sillero-Arenas

M”

Their No. of component studies with power calculations 7x total no. of component studies reviewed

Discussions of power by meta-analysis authors

Primary authors Fanning

and

Effect size method used to increase power Referred to lack of power Stated that some sample sizes were small Power discussed Stated that studies lacked power Discussed high power of studies No power discussion No power discussion No power discussion No power discussion

2/29 l/4 O/41”’ 2/30 o/4 o/17+ S/31* 2/58 O/4’? o/13*+

Total

15/231

* 35 component articles were duplicated: 12 and Owen et a1,9 and 23 between Colditz et al’ al,“’ which are only included once in the table. ’ Eight component studies were not recovered: al,’ two from Owen et a1,9 two from Romero Sillero-Arenas et al.‘” These are not included in

between Keirse et al” and Sillero-Arenas et three from Keirse et et al,’ and one from the table.

documented whether they had done power calculations or attempted to determine appropriate sample sizes.

Results Sixle6 of the meta-analysts referred to statistical power in the component studies, and four7-” did not. Of those who referred to power, one reported high statistical power in the component studies,’ three noted lack of power in the components,2”Z”and the other two implied that the component data were inconclusive and recommended that future studies use larger sample sizes to increase power.‘r3 When evaluating the 231 different component studies of the ten meta-analyses, we found that only 6.5% (15 of 231) had documented evidence of a priori power calculations or sample size determinations (Table 1). When stratifying the component studies by year of publication, we found that 7.9% (14 of 178) of the component studies published in the 1980sand 1990shad a priori power calculations or sample size estimations. In the 1960sand 197Os,only one of 53 component studies had documented a priori power calculations. By Fisher exact test, the difference between these two proportions (14 of 178 and one of 53) is not statistically significant (P > .lO). However, recent improvements in the percentage of studies documenting a priori power calculations

Obstetrics

b Gynecology

would probably decades.

not be shown

in this stratification

by

Discussion

1119-24.

It is important to remember that the statistical inference drawn from small studies with negative findings may be incorrect. For example, Romero et al’ reported that four of eight studies in their review had negative findings, but the studies lacked sufficient power (less than 80% power to detect a difference between the case and control subjects) to discover an effect. Their metaanalysis found a significant benefit in treatment. In a meta-analysis by Mittendorf et a1,4 the test statistic for the combined data from component studies showed that the treatment was highly effective (P < ,007). However, only one” of the 31 trials had as much as 80% power (setting LY= .05, 7~~ = .20, and rr* = .lO, and using the same assumptions). Six of the trials had less than 20% power. The average power in each of the 31 trials was only 32%. Because of this insufficient power, the introduction into clinical practice of a truly effective treatment was impeded for many years. Although the use of a priori power calculations may be somewhat more common now than in the past, we have a substantial opportunity to improve. When investigators do sample size determinations, they assure themselves and the specialty that they have greatly reduced the probability of committing a type II statistical error. It is important for our science to be based on data sets of sufficient size, as well as sufficient quality.

References 1. Fanning J, Bennett TZ, Hilgers RD. Meta-analysis of cisplatin, doxorubicin, and cyclophosphamide versus cisplatin and cyclophosphamide chemotherapy of ovarian carcinoma. Obstet Gynecol 1992;80:954-60. 2. Fraser EJ, Grimes DA, Schulz KF. Immunization as therapy for recurrent spontaneous abortion: A review and meta-analysis. Obstet Gynecol 1993;82:854-9. 3. Keirse MJNC. Prostaglandins in preinduction cervical ripening,

VOL.

86, NO.

meta-analysis of worldwide clinical experience. J Reprod Med 1993;38:89 -100. 4. Mittendorf R, Aronson Ml’, Berry RE, et al. Avoiding serious infections associated with abdominal hysterectomy: A metaanalysis of antibiotic prophylaxis. Am J Obstet Gynecol 1993;169:

5, NOVEMBER

199s

5. Ohlsson A, Myhr TL. lntrapartum chemoprophylaxis of perinatal group B streptococcal infections: A critical review of randomized controlled trials. Am J Obstet Gynecol 1994;170:910-7. 6. Romero R, Oyarzun E, Mazor M, Sirtori M, Hobbins JC, Bracken M. Meta-analysis of the relationship between asymptomatic bacteriuria and preterm delivery/low birth weight. Obstet Gynecol 1989;73:576-82. 7. Colditz GA, Egan KM, Stampfer MJ. Hormone replacement therapy and risk of breast cancer: Results from epidemiologic studies. Am J Obstet Gynecol 1993;168:1473-9. 8. Hunter RW, Alexander NDE, Soutter WI’. Meta-analysis of surgery in advanced ovarian carcinoma: Is maximum cytoreductive surgery an independent determinant of prognosis? Am J Obstet Gynecol 1992;166:504P10. 9. Owen J, Winkler CL, Harris BA, Hauth JC, Smith MC. A randomized, double-blind trial of prostaglandin E, gel for cervical ripening and meta-analysis. Am J Obstet Gynecol 1991;165:991-5. 10. Sillero-Arenas M, Delgado-Rodriguez M, Rodigues-Cameras R, Buena-Cavanillas A, Galvez-Vargas R. Menopausal hormone replacement therapy and breast cancer: A meta-analysis. Obstet Gynecol 1992;79:286 -94. 11. Polk BF, Tager IB, Shapiro M, Goren-White B, Goldstein I’, Schoenbaum SC. Randomised clinical trial of perioperative cefazolin in preventing infection after hysterectomy. Lancet 198O;i:43740.

Address reprint requests to: Robert Mitfendorf, MD, DrPH Deyartmenf of Obstetrics and Gynecology Chicago Lying-in Hospital, MC2050 5841 South Maryland Azleenue Chicago, IL 60637

Receizlrd May 17, 1995. Received in revised form June 27, 1995. Accepted July 22, 1995. Copyright 0 1995 by The Gynecologists.

American

Mittendorf

College

of Obstetricians

and

et al

Type 11 Errors

859