CHAPTER
3
Statistical Design of a Performance Test
We wish to remind our audience that this chapter, dealing with the size and logical design of a performance test, and the next chapter, dealing with the reliability of generalizations from test results, are largely statistical in nature. Furthermore, none of the chapters in Part II, which considers experimental methods used in such tests, depends on Chapters 3 and 4. We believe that it will be in the interests of some to proceed to the substantive material in Part II and save the more abstract ideas of Chapter 3 and 4 until later. As another prefatory note, we point out that although the discussion in this chapter and the next is general to tests of any diagnostic system, we adopt here the terminology appropriate to medical images interpreted by human diagnosticians. Thus we refer to imaging modalities, image readers, and reading tests rather than somewhat more generally to di agnostic systems, observers, and performance tests. This specific ter minology is a convenience in coordinating these chapters with Chapter 11. Chapter 11 describes a medical imaging study that we undertook 68
3.1
QUESTIONS TO BE ANSWERED BY THE READING TEST
69
largely to examine empirically the theoretical statistical issues that we treat here. The structure of the reading test will depend on four basic factors: 1. 2. 3. 4.
the particular questions to be answered by the test, various logistic and cost constraints, the nature of the statistical procedures to be applied, and the degree of statistical power required.
We discuss each of these factors here only in general terms, sufficient to suggest the issues in designing the test. More detailed discussions occur in other chapters. 3.1
QUESTIONS TO BE ANSWERED BY THE READING TEST
We generally assume in this book that the primary question to be answered by the reading test is whether one modality mediates more accurate reading than another. But that question will usually have to be refined—for example, by specifying the conditions under which the com parison is to be made. In laying out the study plan, the fundamental condition (i.e., the diagnostic context) will have been decided upon al ready, but the need for more refined specification will often be ignored until the reading-test design is begun. Should the comparison be made with or without the readers having access to case background information for each test case? Should the reading be done by the most highly qualified readers or by "typical" readers? Decisions such as whether to add a reading condition or to conduct the test with more than one type of reader, will expand the size of the test and thus affect its cost. A complex design involving blocking of the reading trials and counterbal ancing of test conditions may be needed to ensure that comparisons across conditions are not biased by reader fatigue or learning. Where study resources permit, the reading test might be designed to answer one or more of a variety of secondary questions. For example, one might like to know not just whether one modality is better than another, but how much better it is and why. One specific question in this connection is how much of recorded error for a modality is due to unreliability of the reader rather than due to noise in the imaging process. Measuring the reliability of a particular component of a diagnostic system requires repeated application of that component to the same cases. Thus, to measure the reader's reliability, readers must read the same cases at
70
3
STATISTICAL DESIGN OF A PERFORMANCE TEST
least twice. Measuring the reliability of the imaging process would re quire, were it medically appropriate, imaging the patient at least twice and running each of those images through the reading test. Another possible question is whether or not combining information across mo dalities increases diagnostic accuracy. Answering that question would require imaging the same case in each modality, a rather fundamental requirement affecting how the cases are originally collected and also how the results are statistically analyzed.
3.2
LOGISTIC AND COST CONSTRAINTS
The number of test cases available that fit the various test require ments, for example, for truth and case background data, is often the major logistic constraint on test design. Another important constraint is how much of the reader's time one can demand. The number of hours it is reasonable to spend per session and the number of sessions it is reasonable to schedule without seriously disrupting the reader's other activities are often severely restrictive. Also, if each case is to be read twice by the same reader, the session involving the second reading might have to be separated by, perhaps, several weeks from the session in volving the first reading, to reduce the probability that the reader will recognize the case on the second occasion. Recognition of a case pre viously read might compromise the reader's ability to find new infor mation on the second reading. The cost of the reading test will also be a major constraint. The re searcher will have to weigh the costs versus the benefits of adding a reading-test condition or of increasing the amount of reading to generate more precise results. In many situations, of course, the cost of the reading test will be relatively small in comparison to the cost of the overall study, and to reject moderate expansions that might make the reading test much more informative would be inappropriate. But certainly there will be situations for which the desired design has to be trimmed to save money or fit logistic constraints. One possible approach is to simplify study questions and pare away entire test conditions. This al ternative depends wholly on scientific judgment. The arithmetic of such savings is straightforward. Another approach, simply trimming the amount of reading in each condition, is discussed at greater length in Section 3.3. Also described there are approaches to case and reader matching. These approaches, when medically and logistically feasible, can achieve substantial savings in the cost of the reading test.
3.3
GENERAL APPROACH TO STATISTICAL ANALYSIS OF ACCURACY DATA
3.3
71
GENERAL APPROACH TO STATISTICAL ANALYSIS OF ACCURACY DATA
The general approach we prescribe for testing the statistical signifi cance of a difference in accuracy is the familiar one of computing the critical ratio (C.R.), the ratio of the difference in accuracy between two modalities to the standard error of that difference. Although the approach can be extended to cover tests comparing more than two modalities (Section 4.6), we shall limit this general discussion to the two-modality situation. We present here a general formula for computing the C.R. that ac counts for two commonly varied aspects of reading-test design. The first of these is replicated reading: having each case read by more than one reader or more than once by the same reader. The second aspect covered in the general formula is matching cases and readers across the modalities being compared. Replicated reading and matching are commonly em ployed in reading studies for the purpose of increasing the "power" of the reading test in the face of shortages of cases. Power is the probability of correctly rejecting the null hypothesis, that is, the probability of finding a real difference between the modalities. Power is always expressed in relation to some acceptably low probability of incorrectly rejecting the null hypothesis. The main control over power is n, the number of cases being read, but when adequately sampled or documented cases are in short supply, as they often are, and when additional power is needed beyond what those limited cases can provide, replication and matching can provide considerable additional power. Replication is, of course, employed for reasons other than boosting power: for example, to increase the gen erality of findings that would be suspect if based on just one reader, or to provide information of interest in its own right on the variability of reading between and within readers. 3.3.1
General Formula for Computing the Critical Ratio
We present the general formula first in contracted form and then de velop an expanded formula for the denominator. In contracted form, C.R. is given by C.R. = (Mean©, - Mean 0 )/S.E. 2
(diff)
(Eq.l)
The numerator is assumed to be obtained in the following way. A sample of η cases is assembled for each modality and read independently by
72
3
STATISTICAL DESIGN OF A PERFORMANCE TEST
each of € readers on each of m independent occasions. For each reader on each occasion, we obtain some index, call it Θ, of each reader's accuracy on the sample. For each reader we take the mean of Θ over the m reading occasions, and for each modality we take the mean of those individual reader means over all readers. The difference between those grand means for each modality constitutes the numerator in Eq. 1. Turning now to the denominator of Eq. 1, S.E. , we note first that, without certain simplifying assumptions, this expression is rather com plex. The problem is that a large part of this expression is the merger of the standard error (S.E.) of Θ ι and the S.E. of Θ , and each of those terms, although we have not shown them, consists of three components of variance: variance due to case sampling, reader sampling, and reader inconsistency in reading the same case on different occasions. The main simplifying assumption we make is that the S.E. of Θ ι equals the S.E. of Θ . This requires two subassumptions: that £, m, and η each be the same in the two modalities, and that the three variance components each be the same in the two modalities. Given those assumptions, S.E. is given by: (diff)
2
2
(diff)
S.E.
= 2 £(l - r) + ^ ( 1 - r ) + | ^ ,/2
(diff)
c
br
,
(Eq. 2)
where
si
=
r
=
c
S br r
= =
S
=
2
ύ
br
2
variability in Θ due to case sampling; propor tional to l/n, the fraction of S common to the two modalities; the correlation between cases across modalities, variability in Θ due to reader sampling, the fraction of S common to the two modali ties; the correlation between readers across modalities, variability in Θ due to reader inconsistency. 2
C
2
br
The details of how to compute S.E. are reserved for Chapter 4, where we present a computational formula, and for Chapter 11, where we illustrate its application to data from the mammography study. Here we are concerned mainly with the meaning of each term and how its impact can vary, depending on characteristics of the reading-test design. Our ultimate concern in this section is how the researcher can use that understanding, albeit in conjunction with estimates of these terms, to predict the cost and effectiveness of alternative reading-test designs. (diff)
3.3
GENERAL APPROACH TO STATISTICAL ANALYSIS OF ACCURACY DATA
3.3.2
73
Case and Reader Matching
As a general rule, the researcher will want to take advantage of case and reader matching to enhance the power of the statistical test. In principle, matching can be accomplished in two ways, either by using the same cases and/or the same readers in the two modalities or by using cases and/or readers which, although different, are matched on some variable relevant to the accuracy of diagnosis. Using the same cases and/or the same readers is the most direct procedure, of course, and in our view is the preferred alternative whenever it is possible. Using the same cases is not possible, for example, when multiple imaging is not medically justified, and using the same readers is not appropriate when in practice the readers of the two modalities have different character istics, such as different kinds of training or experience. In our opinion, the amount of statistical power that is gained by attempting to match different cases and/or different readers does not usually justify the costs. For one thing, matching different cases (e.g., on size of lesion) and matching different readers (e.g., on level of training or years of expe rience) bring with them an attrition of available cases and readers, both of which tend to be in short supply. Thus, in the remainder of this discussion we consider explicitly only the relatively high degree of match ing that can be achieved with the same cases or the same readers versus no matching at all. Although use of the same cases can achieve a high degree of matching, observe that a near-perfect degree of matching will be achieved only when the two modalities are highly similar in terms of the correlation of image characteristics with the underlying pathophysiology. One would expect that correlation to be very close, for example, if the modalities being compared were essentially the same except for some small shift in an imaging or processing parameter. Thus, provided the same cases were used in both modalities, one would expect to achieve near-perfect matching if one were comparing, let us say, the effect of two slightly different algorithms for.processing CT images. The same would be true for a comparison of X-ray imaging for two slightly different dose levels, provided all other aspects of the imaging process were carefully con trolled. As a counterexample, we observe that CT and RN images would probably be a considerably less than perfect match, because they do not display the same pathophysiology. Near-perfect matching of readers will usually occur only when the same readers are used and the modalities are highly similar so that the reader expertise in the one modality is essentially the same in the other.
74
3
STATISTICAL DESIGN OF A PERFORMANCE TEST
Note that there is no restriction that cases be the same or in any way matched to achieve near-perfect reader matching. We note that when cases are unmatched, the term r in Eq. 2 goes to 0, and the correction for communality of variance due to case sampling goes to 1. Thus, the full impact of variability due to case sampling is felt in the computation. Similarly, when readers are unmatched, r goes to 0, the correction for communality of variance due to reader sampling goes to 1, and the full impact of variability due to reader sampling is felt. On the other hand, when the correlations are as high as 0.90, say, the variances are reduced by 90%. c
br
3.3.3
Effect of Matching and Replication on Statistical Power
The aim in this section is to convey some feeling for the effects of matching and replication on the size of the S.E. and, consequently, on the size of the C.R. and the power of the test. For convenience, let us assume that (diff)
c —c 2
O
c
2
— Obr
—
— ^wr
—
1
— 5·
Thus, in the worst situation, with no matching and no replication, S.E. = 1.0. Let us also assume that near-perfect matching of cases is achieved, and that r = 0.90. Similarly, let us assume near-perfect reader matching and r = 0.90. We can examine the combined impact of matching and replication on size of the S.E. ) in Figure 12. There we see o.ti.yiff) as a function of € for m = 1 and m = 2 for two conditions: both cases and readers unmatched, and both cases and readers near perfectly matched. It is clear in this illustration that replication and matching can have significant effects on the S.E. and, attendantly, on the power of the statistical test. Replication alone without matching accounts for nearly a 40% decrease in S.E. , or equivalently, a 1.7-fold increase in the size of the C.R. Replication with near-perfect matching leads to almost a 5-fold increase in the C.R. To look at these illustrative effects of replication and matching in terms of the power of the statistical test, we first have to set the probability of incorrectly rejecting the null hypothesis, which we shall set at 0.05. Then let us consider the impact on the C.R. for a situation in which, with no matching or replication, the actual differences between the mo dalities is of a size that would cause the null hypothesis to be correctly rejected only 10% of the time. The 1.7-fold increase in C.R. due to replication alone would increase the power to 24%. The 5-fold increase, (diff)
c
br
(diff
(diff)
(diff)
3.3
GENERAL APPROACH TO STATISTICAL ANALYSIS OF ACCURACY DATA Τ—I—I—I—I—I—I—I—I—I—I—I C A S E AND READER
#
75
I
MATCHING
NUMBER OF R E A D E R S ( I ) Figure 12. Illustration of the theoretical effects of replication and matching on the standard error of the difference. Types and degrees of replication and matching are indicated in the figure. For convenience here, the assumption is made that S = S = S = J. z
2
c
2
br
wr
due to the added impact of near-perfect matching, would raise the rate of correct rejections to 96%. Looked at another way, if we were to hold power fixed, the 1.7-fold increase in the C.R. due to replication would mean that we could detect a difference 1/1.7 the size we could detect without the benefit of replication. In general, for a g-fold decrease in S.E. (or increase in the C.R.), we could detect at a fixed level of power a difference \lq as large. Clearly, the specific properties of functions such as these, their relative height and shape, will vary depending on the particular values of the (diff)
76
3
STATISTICAL DESIGN OF A PERFORMANCE TEST
variance and correlation terms. Because we set those values arbitrarily for the convenience of this illustration, it would be inappropriate to dwell on effects specific to these functions. We can, however, make certain general observations. The impact of replication will clearly depend on the relative values of the variance terms. If S is large relative to S\ and Slrr, replication will have little effect. The impact of matching will sim ilarly depend on the relative size of the variance terms. When variation due to case sampling is relatively large, matching of cases will have a potentially large effect, and the same is true for reader matching. How much of that potential can be realized depends on how successfully the cases or readers can be matched. One other general observation to make is that increasing the number of readers or readings will always have diminishing returns. How quickly one reaches the point where the returns are negligible will, of course, depend on the situation. We can see that in this illustration very little is to be gained in going beyond about seven readers. Similarly, the effect of increasing the number of readings per reader will depend on the relative size of S , but as a practical matter, m will usually be limited to 2 or 3 at most, so, in any event, one cannot hope to gain a great deal from increasing m. Moreover, it can be shown that unless r = 1 (which cannot occur unless the modalities being compared are identical), S.E. is always reduced more by a fractional increase in the number of readers than by the same fractional increase in the number of readings per reader. This suggests the principle that for a fixed total number of readings (R = t x m) of each case, S.E. is smallest if one uses i = R readers and m = 1 reading per reader. However, as we discuss in Chapter 4, there is a practical need to estimate within-reader variance (S ), and this need requires, at least, that some readers read some cases twice. 2
C
r
2
wr
br
(am
(diff)
2
wr
3.4
THE DEGREE OF STATISTICAL POWER REQUIRED
The size of the case sample and the amount of replication required will depend on two main factors: (a) the size of a difference between modalities that the researcher decides would be of practical interest, and (b) the degree of confidence that the researcher considers necessary in concluding that the critical difference has or has not been exceeded. The number of cases, readers, and readings required to achieve this statistical power will in turn depend upon the inherent variability of the data, which
3.5
SOME PRACTICAL RULES FOR STUDY DESIGN
77
the researcher will have to estimate from prior studies of similar mo dalities, from a pilot study, or in the first part of a study with an optional termination. How to use such prior data to determine the desired number of cases and readers is illustrated in Chapter 11 (Section 11.9) with data from the mammography study. We can observe—although without sug gesting a rule of thumb, because available experience is limited—that (a) quite adequate power was achieved in our CT/RN study with 136 cases (the same cases in the two modalities), six readers in each modality (differing across modalities), and a single reading of each case; and (b) quite inadequate power was achieved in our mammography study with 62 cases (the same cases in the three modalities), three readers in each modality (the same readers in one experimental condition and different readers in another), and a single reading of each case. The size of a difference of interest will depend wholly on the context and aims of the overall study in which the reading test is undertaken. The required level of confidence in the test results will also depend on the situation: on the benefits of drawing one or the other correct con clusion and on the losses from drawing one or the other incorrect con clusion. The researcher will presumably have considered both of these requirements carefully in early stages of planning the overall study and at this point will have to settle on specific values of difference and confidence in difference on which to set the power of the reading test.
3.5
SOME PRACTICAL RULES FOR STUDY DESIGN
Although the process of designing the reading test (and, specifically, deciding which factors should be considered first or weighed more heavily than others) will depend on the situation, we suggest the following as a general guide. The first main consideration is the size of the case sample; the rule is that, independent of power considerations, the case sample should be large enough to ensure credible generalization of the study results to the case population and large enough for valid application of whatever statistical tests one plans to apply. As regards the size of the case sample to ensure credible generality of the reading-test results, that is a matter of scientific judgment which we leave to the investigator. As regards the size of the case sample needed to ensure valid application of particular statistical procedures, we offer two comments. First, we remind the investigator that the procedures recommended in this book are based on large-sample statistics. As a rule of thumb, then, we rec-
78
3
STATISTICAL DESIGN OF A PERFORMANCE TEST
ommend that a minimum of 50 to 60 cases be read in each modality; otherwise, small-sample statistics should be used. Second, we point the reader to a discussion in Section 4.3 on the minimum size of a case sample for a credible estimate of case correlation when case matching is to be exploited. The second main consideration, again to ensure credible generalization of the study results, is that the reading test be conducted, as a rule, on at least several readers. Also, when reader matching is to be exploited, a minimum number of readers is needed to obtain a credible estimate of reader correlation (see Section 4.3.2). Again, these considerations are independent of power requirements. The third consideration is that for a complete analysis of the sources of error in reading performance, at least some of the readers should read at least some cases independently at least twice. As we explain in Chapter 4, if this minimum amount of rereading of cases is not done, then there is no basis for estimating within-reader variance, with the result that the statistical interpretation is weakened and considerable statistical power may be lost. Finally, when consideration shifts from ensuring validity and com pleteness of the statistical tests and credible generalization of the results to controlling the power and cost of the test, the trade-offs of several alternative mechanisms must be considered. Probably the most important mechanisms to consider first are case and reader matching, because they can deliver considerable increases in power, often at relatively low cost. If case matching and reader matching are not feasible, or if yet more power is needed, increasing the number of cases should be considered. If cases are in short supply, increasing the number of readers will be the next most important consideration. If for some reason the number of readers is limited, some gains in power might be achieved by increasing the number of readings, but as a rule that would be the direction of last resort. An algorithm could presumably be developed for finding the combi nation of numbers of cases, readers, and readings that would maximize power at a fixed cost, or minimize cost at a fixed degree of power. But rarely will the design options be so unconstrained and the parameters of reading variability and cost be so precisely quantified as to make such an algorithm of more than academic interest. For all practical purposes, we think, the researcher will do better simply to keep in mind these various options for controlling power and to use them as a general guide in designing the reading test. A graphic aid can be helpful, as illustrated in Chapter 11, Figure 24.
3.6
79
SUMMARY
3.6
SUMMARY
The structure of a performance test depends on which questions it is to answer, and how precisely it is to answer them, in relation to the various resources available. The investigator will have some control over the numbers of test cases, test readers, and readings per reader. The first two should be large enough to permit credible generalization to their respective populations of interest, large enough to justify application of the statistical tests to be used, and large enough to supply the desired reliability. The number of readings per reader will necessarily be small, but obtaining more than one may be desirable to provide an estimate of within-readers variance. When two or more systems are compared, ad vantages accrue to using the same cases and the same readers with each system, although such duplication is frequently impossible. In comparing systems, manipulation of the various parameters just mentioned is un dertaken to provide a desired level of confidence in a difference of a specified size.