Paul L. Canner, William F. Krol, and Sandra A. Forman
12
External Quality Control Programs
INTRODUCTION
A primary concern of the Coronary Drug Project (CDP) was" to assure the quality of the data being collected. The validity of the reports and results produced and published by the study depended upon the integrity of the data submitted by the clinical centers, the Central Laboratory, the ECG Reading Center, and the Mortality Classification Committee. To this end, procedures were established during the course of the study to monitor the performance of these groups with respect to the quality of the study data reported by them. These procedures generally were directed and conducted by the Coordinating Center and were in addition to the extensive internal monitoring procedures in effect at the Central Laboratory and ECG Reading Center. Although the Drug Procurement and Distribution Center did not generate any study data, an external procedure was developed to verify that the study drugs were being labeled with the proper code number. No external monitoring programs were devised for the CDP Coordinating Center. However, in recent years a number of such programs have been instituted for the Coordinating Centers of other studies. Some of these programs and procedures are described in this chapter. In the following sections are described the procedures, both operational and analytical, and some of the results in the CDP of such external monitoring programs. CENTRAL LABORATORY MEASUREMENTS--EXTERNAL LABORATORY SURVEILLANCE PROGRAM
The CDP unit with the most extensive internal quality control procedures was the Central Laboratory located at the Center for Disease Control in Atlanta, Georgia. These internal programs are discussed in Chapter 7. As a supplement to this internal monitoring, the CDP Steering Committee decided to establish an external monitoring program for the Laboratory. This was Controlled Clinical Trials 4:441-466 (1983) © Elsevier Science Publishing Co., Inc. 1983 52 Vanderbilt Ave., New York, New York 10017
441 0197-2456/83/$3.00
442
P.L. Canner, W.F. Krol, and S.A. Forman known as the External Laboratory Surveillance Program (ELSP). This decision was made late in 1968, almost 3 years after recruitment had begun and just before it was to be completed. The timing of the decision strongly influenced the design of the ELSP, especially with regard to the designation of fictitious patient ID numbers to be used for the submission of the test samples. There were two aspects of the ELSP: an external evaluation of the technical error (i.e., agreement between split samples) of the various laboratory determinations, and an evaluation of long-term trends in the test results. Both parts of the ELSP involved the submission of serum specimens to the Central Laboratory using the ID numbers and other identifying information of deceased patients. It was necessary to use such numbers because (1) they had to be recognized as valid by the Central Laboratory computer and (2) the ELSP data had to be kept distinct from the test results being received for living CDP patients. If the program had been initiated at the beginning of the study, it would have been possible to introduce "dummy" patient ID numbers into the system to be used for this purpose. As it was, although there had been a substantial number of deaths by the time the decision was made to initiate the ELSP, the Central Laboratory had been notified of these deaths so that the ID numbers of these deceased patients could not be used. Thus, before the ELSP could be implemented, the study procedures had to be changed so that the Central Laboratory was no longer notified of deaths and the implementation of the programs had to be delayed until a sufficient number of additional deaths occurred. Thus, the Technical Error Evaluation aspect of the ELSP began in July 1969 and the Secular Trends Evaluation aspect in January 1971.
Technical Error Evaluation Twelve CDP clinics participated in the Technical Error Evaluation aspect of the ELSP. This program consisted essentially of collecting from volunteer CDP patients twice the normal amount of blood at a given examination and submitting to the Central Laboratory split samples of the serum obtained from these blood specimens. One part of each sample was submitted using identifying information for the volunteer patient and the other part with identifying information for a deceased patient. The two parts of the sample were sent to the Central Laboratory in different shipments, usually a week apart. The second part of the sample was frozen and stored at - 20°C until shipped. Since the Central Laboratory computer was set up to receive data from specimens at 4-month intervals, it was necessary to obtain and prepare the specimens for the Technical Error Evaluation aspect in accordance with the Appointment Schedules of the deceased patients whose ID numbers were being used for the program. The Coordinating Center assisted the clinical centers in this by sending a notice to the clinic whenever a new specimen should be submitted for a given deceased patient's ID number. When both of the split samples had been sent to the Central Laboratory, the clinical center submitted a form to the Coordinating Center indicating the ID number and name of the true donor, and ID number of the deceased patient, and the dates of shipment of the two samples to the Central Laboratory.
External Quality Control Programs
443
Upon analysis of the split samples at the Central Laboratory, the results were forwarded to tl~e Coordinating Center under the usual procedures. Whenever results were received under the ID number of a deceased patient, a message was generated when the attempt was made to write these data onto the magnetic disk master data file at the Coordinating Center. Once the results for the split samples were identified and matched, they were placed onto a special tape file for analysis purposes. The differences, di, between the values for the n split sample pairs were calculated electronically and used to compute (1) the mean of the 2n determinations, (2) the between-sample standard deviation, G = (~d~/2n) 1~2,of the n pairs of determinations (the square of which is equivalent to the mean square between split samples in a components of variance model), (3) the mean absolute difference, ~]di]/n, of the n pairs, (4) the average error, 100 times the ratio of the mean absolute difference to the mean of the 2n determinations, and (5) the coefficient of variation, 100 times the ratio of the standard deviation to the mean of the 2n determinations. The standard deviation, (~e, can be interpreted as a measure of closeness of a particular measurement of a laboratory test to the true value for that patient at that time point (i.e., in the absence of measurement error). If the measurement errors are approximately normally distributed, we can be 95% confident that the true value will lie within _ 1.96G of the measured value. For example, since the betweensample standard deviation for cholesterol in the CDP was 4.384 mg/dl, we can be 95% confident that a cholesterol measurement was within _-_1.96 x 4.384 = -+8.59 mg/dl of the true value. These data were reported periodically to the CDP Data and Safety Monitoring Committee (DSMC) as a part of the semiannual reports prepared by the Coordinating Center. Table 1 summarizes the results for various time periods between October 1969 and January 1974 as well as for the total period of data collection. Throughout the program, the results generally were consistent with the similar statistics reported by the Central Laboratory as collected in their internal monitoring programs. Some inflation of error was apparent in the ELSP results, as might be expected since there were additional components of error in the external system due to the handling of the specimens in the clinics, freezing and storing one of the samples for up to I week before submission, and separate shipment of the two samples to the Central Laboratory. Some of the additional error may also have resulted from the fact that the ELSP samples were totally "blinded" to the laboratory technicians. The DSMC forwarded these results to the Steering Committee. Throughout the course of the study, neither group thought any study action was warranted on the basis of the reported data.
Comment. In the CDP, the average error of the split sample data ranged from 1.7% to 9.7% and the coefficient of variation ranged from 1.5% to 10.2% for the ten laboratory tests. In both the CDP and other studies the question has been raised as to what constitutes an acceptably low or an alarmingly high level of technical error of a laboratory test. The general practice is to set arbitrarily values of 5% or 10% as acceptable levels of technical error, without really knowing whether these levels make any sense.
150 152 155 159 51
Total bilirubin (mg/dl) Direct bilirubin (mg/dl) SGOT (Henry units) Cholesterol (mg/dl) Triglyceride (mEq/1)
0.138 0.332 0.267 2.241 3.268
0.161 0.202 0.220 1.554 4.283
0.029 0.021 1.099 4.672 0.293
S.D.
71 67 67 69 69
197 198 198 195 69
No.
S.D.
0.154 0.260 0.252 1.555 4.722
0.032 0.025 1.157 4.898 0.394
No.
0.113 0.190 0.251 1.383 4.263
333 328 332 334 327
S.D.
0.095 0.268 0.192 2.273 5.498
0.021 0.018 1.100 5.077 0.329
6.390 6.917 16.165 101.229 172.415
0.574 0.250 29.842 240.118 5.376
Mean
45 41 45 43 42
158 158 159 159 45
No.
0.033 1013 0.026 1010 1.129 1025 3.245 1015 0.350 335
S.D.
9/73-1/74
40 42 43 43 42
134 134 137 138 43
No.
S.D.
0.103 0.268 0.217 1.913 4.398
0.023 0.025 1.086 4.052 0.259
0.124 0.248 0.233 1.835 4.612
0.031 0.025 1.154 4.384 0.317
S.D.
2.1 1.7 3.5 2.0 2.9
5.0% 9.7 3.9 1.9 6.3
A.E.
Total period
53 52 53 53 53
161 158 161 159 53
No.
1.9 1.5 3.4 1.8 2.7
5.3% 10.2 3.9 1.8 5.9
C.V.
43 43 42 45 41
109 107 108 108 43
No.
S.D.
0.35 0.17 0.37 0.35 0.37
0.84% 0.98 0.42 0.42 0.22
I.S.D.P.
0.094 0.210 0.206 1.932 5.589
0.036 0.027 1.349 4.532 0.224
patients (see text for definitions).
°Key: S.D. = standard deviation; A.E. = average error; C.V. = coefficient of variation; I.S.D.P. = increase in standard deviation among
50 51 50 50 49
No.
Uric acid (mg/dl) Urea-N (mg/dl) Alkaline phosphatase (K-A units) Fasting glucose (mg/dl) One hour glucose (mg/dl)
S.D. 0.040 0.035 1.231 4.218 0.335
3/73-8/73
39 40 40 39 38
Tests
Uric acid (mg/dl) Urea-N (mg/dl) Alkaline phosphatase (K-A units) Fasting glucose (mg/dl) One-hour glucose (mg/dl)
118 117 121 120 39
No.
Analysis of Duplicate Specimens Submitted in the External Laboratory Surveillance Program, Seven Time Periods ~ 10/69-9/70 10/70-9/71 10/71-3/72 4/72-9/72 10/72-2/73
Total bilirubin (mg/dl) Direct bflirubin (mg/dl) SGOT (Henry units) Cholesterol (mg/dl) Triglyceride (mEq/l)
Tests
Table I
External Quality Control Programs
445
The following represents an attempt by the authors to provide a more practical basis for evaluating technical error. The total variability among patients of the values of a laboratory test is made up of a component due to the patients themselves and a component due to measurement error. Let ~ denote the variance due to patients alone and, as above, let ~ denote the between-sample variance. The total standard deviation among patients, ( ~ + ~)1/2, is given elsewhere [4] for the tests included in Table 1. A useful, practical question to ask is how much the standard deviation among patients would be decreased if there were no technical or measurement error; or vice versa, h o w much would the variability due to patients alone be increased if there were technical error as well. Since assessment of treatment differences with respect to laboratory values is dependent on the magnitude of the variability among patients, the ability to detect treatment differences could theoretically be hampered by the presence of a large technical error component. This leads to the definition of the quantity, (~
+ ~)1,2 _ ~o
x 100,
O"a
which we term "increase in standard deviation among patients." This new measure of technical error is given in Table 1 for the ten tests. The percent increase in standard deviation among patients ranges from 0.22% to 0.98%, substantially smaller values than the average errors and coefficients of variation. We conclude from this that the technical error associated with laboratory values in the CDP had a negligible impact on the among-patient variability and, hence, on treatment comparisons with respect to these variables. A substantial effort [perhaps about 0.5 full-time equivalent (FTE) over several years] was required by Coordinating Center personnel for the administration, data processing, and data analysis of the ELSP. The number and frequency of specimens submitted for the Technical Error Evaluation aspect in the CDP were, in retrospect, greater than necessary. It is recommended for new studies that 40 or 50 specimens be submitted during the first few months of the trial to obtain an initial assessment of technical error. If the increase in standard deviation among patients for all chemical constituents is found to be less than 1%, it may be satisfactory to repeat this exercise just once or twice more, e.g., at the midpoint and end of the trial. If larger increases in standard deviation among patients are observed, and if the biochemical determinations play an important role in the assessment of treatment efficacy and safety, it may be desirable to carry out the external program for technical error evaluation on a more frequent basis.
Secular Trends Evaluation Six clinics participated in the Secular Trends Evaluation aspect of the ELSP. Serum specimens were obtained for this program from professional blood donors w h o were not participants in the CDP. Every 2 months 500-ml blood samples were obtained from two donors at the Michael Reese Research Foundation Blood Center in Chicago, Illinois, by Dr. Howard Adler of the Chicago Health Research Foundation who assisted the CDP in this project. The serum
446
P.L. Canner, W.F. Krol, and S.A. Forman from each donor blood sample (hereafter referred to as "pool") was aliquoted into 72 vials labeled with CDP identifying information (using ID numbers of deceased patients) and distributed to the clinics participating in the program. The vials from one pool went to three clinics and the vials from the other went to the remaining three clinics. The distribution of the vials was made according to a master shipment schedule prepared by the Coordinating Center. The shipments were arranged so that from each pool, one of the three clinics received 32 vials of serum and the remaining two clinics received 20 vials each. This was done in such a w a y that any given clinic received the larger number of vials every third cycle (i.e., every 6 months). The reason for this arrangement was to allow every third shipment of vials to a clinic to be used for simulated annual visits, which required more vials than the nonannual visits (every third CDP visit was an annual visit). When the vials from a given pool were received by a clinic, one set of vials (four if for a nonannual visit, seven if for an annual visit) was included in the next scheduled shipment of vials to the Central Laboratory. The remaining vials were stored at - 4 0 ° C and shipped to the Central Laboratory on three separate occasions at 2-month intervals. Thus, serum from a given pool could be used to evaluate trends in the laboratory determinations over a 6-month period. The labels that were affixed to the vials (at the Chicago Health Research Foundation) had been prepared earlier at the clinics in order to maintain the uniformity of each clinic's labeling, and thus to prevent the unblinding of the samples at the Central Laboratory. The labels had to be affixed before the vials were frozen for shipment to the clinics, thus requiring that the labeling be done centrally. The vial labels were prepared by each clinic according to instructions provided by the Coordinating Center approximately 1 month before the preparation of a new serum pool. These instructions were based upon a master schedule for each clinic giving the deceased patient ID numbers to be used for each serum pool throughout the duration of the program. Table 2 gives an example of this schedule for one of the clinics. These schedules were prepared by the Coordinating Center in order to assure the submission of the aliquots from each serum pool at the required intervals, for the appropriate type of visit (i.e., annual or nonannual), for legitimate (i.e., deceased patient) ID numbers. In spite of the fact that these master schedules were available at the clinics, it was discovered, after the program had been in operation a number of months, that many of the specimens were not being submitted. At that time the Coordinating Center instituted a system whereby the clinics were notified whenever specimens were to be submitted and were asked to notify the Coordinating Center verifying that the specimens had been submitted. Following the establishment of this procedure, the number of unsubmitted samples dropped dramatically. For analysis purposes, the data were first of all arrayed by computer as in Table 3. Each line gives the four determinations made at 2-month intervals on the given specimen. Let these observations be denoted by wij~, where i = 0 . . . . . 17 indexes the month of submission of the first of the four samples from a given serum pool, with i = 0 denoting March 1971 and i = 17 denoting
447
External Quality Control Programs Table 2
Schedule for Submittal of Secular Trends S e r u m Samples for Clinic Xa Deceased patient ID numbers
Date 1970 1971
1972
1973
1974
(1) September November
1
January March May July September November
3
January March May July September November
9
(2)
(3)
(4)
1 1
7
3
6
5
8
7
10
9
12
11
14
13
16
15
18
17
20
5 7
11
9
4 6 6 8
10
11
8 10
12
13
21
2 4
8
11
10
12 14
13 15
15
12 14 14
16
17 17
19
2 6
9
17
(8)
4
7
January March May
4
3 5
15
(7)
2 1
5
13
(6)
2
3
January March May July September November
(5)
16
18
19
16
18 20
19
21
18 20
19
21
20
'Numbers in table denote serum pool number. Italicized numbers refer to simulated "annual visit" specimens. Table 3
S e r u m Cholesterol Values, External Laboratory Surveillance Program, Secular Trends Evaluation Aspect
Month of submission of first pool specimen March 1971
Cholesterol (mg/dl) Serum pool
Clinic number
Month a 0
Month 2
A
1 2 3
228 223 221
224 229 229
226
4 5 6
218 216 215
211 207 212
210 209 219
3 4 5
224 216 223
220 223
220
213 211 216
1 2 6
242 250 244
241 248
241 248 253
229 245 242
B
May 1971
A
B
'Month of submission relative to first specimen. bDashes are for missing values.
Month 4
Month 6
b 226 231 212
448
P.L. Canner, W.F. Krol, and S.A. Forman January 1974; j = 0,1,2,3 indexes the n u m b e r of 2-month periods since the submission of the first sample from a given pool; p = 1,2 indexes the specific serum pool obtained at time i; and r = 1. . . . . n indexes the specific replication (or participating clinic) for each pool, n = 3 for the 4-monthly laboratory determinations and n = 1 for the annual determinations. The following statistical model m a y be specified for the observations W,pr for a particular biochemical component: i+j
Wijpr = Tit +
~
j
Sk - Si +
k=i
~ k
=
Dk - Do + eijm, 0
where Tip denotes the true value of the component in the pth serum pool obtained and first analyzed at time i, Skdenotes the amount of change in the value of the specimen due to secular trends in the laboratory determinations between time points k - 1 and k, D~ denotes the amount of change in the value of the specimen due to deterioration of the sample between time points k - 1 and k, and e~j~denotes a component due to random error. The model may be simplified by taking differences of successive determinations made 2 months apart (suppressing the indices p and r for simplicity): Yii = w~j - w i j - 1 = S,+j + Dj + ( e i j -
ei,j - 1).
It is impossible to obtain unique estimates of the Si+js and Dis due to conf o u n d i n g of the deterioration and secular trends components. The best that can be d o n e is to estimate S~ + D1, $2 - $1. . . . . S,7 - $1, D2 - D1, and D3 - D1. Thus the model takes the form, yij = ~ + S;+j + D; + e;,
where ~ = S, + D1, S'~+i = Si+j - S,, and D; = Dj - D~. This model m a y be further modified to allow for missing values. Where the specimen value from 2 m o n t h s previous m a y be missing, one can take the difference from the value 4 m o n t h s previous, or if that one is missing as well, from the value 6 m o n t h s previous. The model can finally be written as m
yijm = wij - wi.j-,-m = (m + 1)}z +
m
~ ¢t
=
S'i+j-~, + 0
~ ct
=
D;_,~ + e'ij, 0
where m denotes the n u m b e r of immediately previous missing values. The ~, SI+j, and D; m a y be estimated via either the general linear hypothesis approach or the m e t h o d of multiple regression with zero intercept. For the variable serum cholesterol, with a total of 236 determinations, the following estimates of these parameters were obtained: $1 + /~, = -1.50; D; = 0.64 and 0.52 forj = 2,3; a n d S'i = 0.84, 0.32, -4.21, 3.61, 3.16, 5.02, -0.29, 1.16, -0.06, 3.98, 0.20, 4.00, 2.72, -0.47, 2.72, and - 1 . 0 6 for i = 2 . . . . . 17. No unique plot of the secular trends in the laboratory determinations is possible due to the inseparability of $1 a n d / ~ , . One can only assume different values of S~ a n d / ~ t subject to the constraint S, + /~1 = -1.500 (for the particular case considered). This is done in Figure 1, where the cumulative estimated cholesterol change from time 0 (March 1971), ~ Oit = l S~, is plotted for i = 1. . . . . 17 for different values of $1. The value $1 = - 1.107 corresponds
449
External Quality Control Programs ox
15
//
,o
/ ......................
.•."".•......*"•'/"""
SI^ = -0.500
..."
D I " - 1.000
~
^ s,. -,.,o,
/ o
; •, ,
o
"...
NO
i
",\;,>,/., '~ /
..
./
\V ..'" • #,.
k-
~l /
,.,
"~.~
.~... !" ,.,." ,,
,.-
.~.
I
,.
R
i9
., \..-" .~
^
,,, DI
..,oo •
0.200
-tO
Figure 1 Secular trends in serum cholesterol determinations; results from the CDP External Laboratory Surveillance Program. ~1 denotes the estimated change in cholesterol due to secular trends in the laboratory determinations between March and May 1971;/51 denotes the estimated change in cholesterol due to deterioration of the sample during the first two months of freezer storage; ~1 + /~ must add to - 1.5 (see text). to the assumption that Xi=l 17 ~ O t = l S, = 0, i.e., the mean cumulative change in cholesterol value over 17 time periods is zero. As seen in Figure 1, with this method it is possible to detect fluctuations over time in the laboratory determinations, but it is not possible with this method alone to distinguish a gradual, consistent upward or downward trend in the laboratory determinations over the entire course of the study from a possible similar effect due to deterioration of the serum specimens over time. The analytical procedure just described was developed following termination of the CDP data collection phase. No satisfactory analysis procedure was available during the study to evaluate the data derived from the Secular Trends Evaluation aspect of the ELSP. Although the procedure described above may be useful in detecting temporal fluctuations in the laboratory determinations, it is the authors' opinion that in many cases such fluctuations can be just as readily detected using the control group data from the trial proper• A method for doing this is described in the next section. The primary goal of an external laboratory surveillance program for evaluating long-term secular trends is generally to identify any gradual upward or d o w n w a r d trends in laboratory determinations over the entire course of the trial, and not just to detect short-term fluctuations. The ability to achieve such a goal, and hence the value of such a program, depends heavily on the ability to obtain, via ancillary studies, precise estimates of the deterioration effects; that is, it is necessary to determine the effects of freezing serum specimens at different temperatures for different lengths of time on the measured levels of the various chemical constituents of the serum. Comment.
450
P.L. Canner, W.F. Krol, and S.A. Forman
C E N T R A L L A B O R A T O R Y MEASUREMENTS---CDP DATA
Another approach to the evaluation of long-term secular trends in the Central Laboratory determinations utilized the Central Laboratory data routinely collected on the CDP placebo-treated patients. Considering the R patients in the placebo group, for patient r the value of a laboratory test obtained during the ith calendar time period and at the patient's jth follow-up visit is defined as follows: J
Wijr = Tr +
X k=
Sk + 0
X
Vk "}- eijr,
k = 0
where Tr denotes the hypothetical true value for patient r at baseline (j = 0) and time period i = 0; Sk denotes the amount of change in the value due to secular trends in the laboratory determinations between time periods k - 1 and k, So = 0; V~ denotes the amount of change in the value due to aging and other patient-related causes between follow-up visits k - I and k, 110 = 0; and eijr denotes a component due to random error. The model may be simplified by taking differences of successive followup visits for each patient:
yijr = Wijr -- W i - l , j - l , r = Si -~- Vj -~- (eijr - ei_l,j_l,r). This can be rewritten as yijr = tL + S', + V~ + e'~jr,
where ~ = $1 + V1, S'~ = S ~ - $1, and V~ = Vj - V1. If w~_l,j_~,ris missing, the model can be modified so that Wl-2,j-2,r or, in general, w~-m,j . . . . can be used in its place, thus increasing the amount of useful data. The modified model is m
yijmr ~- ( m + 1)1~ +
m !
~ ot
=
Si-o~ dF 0
r
X ct
=
p
VJ -(x "~ e i j r , 0
where m denotes the number of immediately previous missing values. For the case of serum cholesterol and patients in the CDP placebo group, the following estimates of these parameters were obtained: $1 + V~ = - 0.80, and S'i = 10.33, 0.56, -3.54, -0.62, 0, 7.71, 3.92, -2.35, 8.40, 4.7"7, -6.04, 5.80, -0.91, -2.80, 0.27, 3.66, -1.72, 5.98, -0.49, 0, 6.03, -0.36, and 0 for i = 2 ..... 24. These results were obtained via a multiple regression analysis based on 9000 cholesterol determinations at baseline and the first 15 followup visits (i.e., 5 years) from 622 patients, representing a 33% systematic sample of the placebo group and with the restriction that all patients had determinations for at least 5 years. The sampling was done to reduce the cost of the regression analysis. As with the analysis of the ELSP data, no unique plot of the secular trends in the laboratory determinations is possible due to the 24 inseparability of S~ and V1. In Figure 2, X~~= 1 ~= is plotted for i = 1. . . .. for S~ = -1.82. This value of ~1 corresponds to the assumption that ~24=~ X~i~ ~ S~ = 0, i.e., the mean cumulative change in cholesterol value over 24 time periods is zero.
451
External Quality Control Programs
_z
~ .J "1= 0
I5
m RESULTS FROM REGRESSION ANALYSIS ........... CRUDE MEANS MINUS OVERALL MEAN uJ - - t oQ •, r
o
j o
lID
o
!
0
,~
b.
O- • '~ bJ hi
A
5
~ w
I0
-5
I ~ ~ " ~ ' ~
:~
14~
IS
16
NUMSER OF FOUR MONTH PERIODS~ / SINCE MAY-AUeUST 1966
"~
20
'~
2Z
24
..~
W
Figure 2 Secular trends in serum cholesterol determinations; results from the CDP placebo group data.
A much simpler analysis involves the computation of the crude means of all of the cholesterol determinations done in each four month period. For this analysis, cholesterol values from all follow-up visits (as many as 24 per patient) of placebo-treated patients were used 42,063 values in all. These means, after subtracting the overall mean from each, are plotted in Figure 2 along with the plot from the more complicated and costly multiple regression analysis. In this example the two plots are remarkably similar. In Figure 3 are plotted the results of the external surveillance program (Fig. 1) together with the results from the regression analysis of the placebo group data (Fig. 2) for the same time period. There is moderate agreement between the two plots; the discrepancies are probably due to the placebo group results reflecting a large number (median = 428) of determinations uniformly distributed over a 4-month period, whereas the external surveillance results reflect a small number (median = 16) of determinations more or less concentrated at points of time at 2-month intervals. Thus the latter results are much more sensitive to short-term variations within each 4-month period. Comment. It is recommended that in all trials with laboratory determinations the simplest secular trends analysis--i.e., plotting mean values for control group patients by calendar time period be carried out. The multiple regression analysis described above was developed after termination of the CDP data collection phase, and so far has been applied only to the cholesterol data. It is hoped that this method will be tried in other trials so that a better assessment can be made as to its general utility.
452
P.L. Canner, W.F. Krol, and S.A. Forman
PLACEBO GROUP EXTERNAL BURVEILLANOE
0
:
:
:
:
-'\.. .......~ ......."'J ........... ,/"
o= 0
-5
Iw
:
~
..... (/,
A
.......
,
,
,
MONTHB SINCE MARCH 1971
V
-IC
Figure 3 Secular trends in serum cholesterol determinations; comparison of results from External Laboratory Surveillance Program with results from placebo group regression analysis.
ECG CENTER READINGS For purposes of external surveillance, samples of ECGs that had already been read at the ECG Reading Center (see Chapter 8 for details) were submitted by the Coordinating Center for rereading after a lag of 1 month to 3 years. In general, the procedure consisted of adding one or two lots (60 per lot) of previously read ECGs to a new shipment; lots within a shipment were randomly ordered before being numbered 1 through 16. The coding sheets prepared by computer at the Coordinating Center and submitted with the ECGs included the ECGs for repeat reading. Thus, it was not obvious to the coders that any of the ECGs in the shipment were being resubmitted. This repeat reading program began with Shipment 11 in September 1969 and continued until the end of the study in 1975. One aspect of the program consisted of obtaining one rereading of a large number of different ECGs. This was accomplished for all three types of ECGs, namely, the qualifying, scheduled, and interim-event ECGs. Repeat reading data are available on 120 qualifying, 2616 scheduled, and 235 interim-event ECGs from this part of the program. Another aspect consisted of rereading the same two lots of ECGs repeatedly to detect more efficiently and sensitively any secular trend or systematic bias in coding. These two lots were originally read in August 1969 and were submitted in alternate fashion in subsequent shipments during the period September 1969 to September 1972. Upon receipt of the coding forms at the Coordinating Center, the sheets for the repeat reading lot(s) were separated from the rest of the coding sheets prior to keying the codes in machine-readable form. The results of the repeat
453
External Quality Control Programs
readings were keyed using the same format as that used for the original ECG readings except that a code was included to indicate that these were repeat readings and to specify whether they were the first, second, third, etc., repeat readings. The keyed original readings, besides residing in the CDP master data file on magnetic disk, were copied onto a special file along with the repeat reading data in order to facilitate analysis of these data. The repeatability of readings was analyzed using the four formulas [1] shown in Table 4. This table also illustrates in detail the application of these formulas to the comparison of the two independent readings for Q/QS findings in 2616 scheduled ECGs. Formula I deals with disagreements of the type where one reader has coded an abnormality and the other has coded no abnormality. Formula 2 is more severe in that it counts not only disagreements as to presence or absence of a finding but also as to the severity of the finding. For both Formulas 1 and 2 the denominator is the number of ECGs for which at least one reader coded the presence of an abnormality of the particular ECG item. Formulas 3 and 4 are the same as Formulas 1 and 2, respectively, except that the denominator includes all ECGs coded for the particular item. The percentage disagreements are given in Table 5 for the four formulas for 16 ECG items. The repeatability of quantitative measurements for heart rate, QRS axis, R and T wave amplitude, and T-R' interval is shown in Table 6 for the 2616 ECGs for which two independent readings have been made. Measures of repeatability reported in Table 6 include the between-readings standard deviation, coefficient of variation, and increase in standard deviation among patients, all defined as in the preceding section in the context of technical error of laboratory determinations. One concludes from the last column of this table that the measurement error adds very little to the total variability among ECGs for these variables. Analyses of consistency among seven independent readings of the same lot of 120 scheduled ECGs over a 3-year period are illustrated in Table 7 for four ECG items. The first analysis involves the comparison of the frequency distributions of a particular ECG item over the seven readings to uncover any trends toward more liberal or more conservative coding of the item over time.
Table
4
Formulas Used in Evaluating Repeatability of the ECG Readings in the CDP (Example of Q/QS Findings)
Second reading 1.0 No Q/QS 1.3 Minor Q/QS 1.2 Moderate Q/QS 1.1 Major Q/QS Total Formula 1 = [ n l - a) Formula2 = ( n - a Formula 3 = [nl - a) Formula4 = ( n - a -
1.0
1.3
1069(a) 51 32 14 1166(n~) + (n2 b+ (n2 b-
cc-
First reading 1.2
53 30 269(b) 54 25 436(c) 6 31 353 551 a ) ] / ( n - a) x 100 = 12.8 d ) / ( n - a) x 100 = 24.7 a ) ] / n x 100 = 7.6 d ) / n x 1 0 0 = 14.6
1.1
Total
18 12 56 460(d) 546
1170(n2) 386 549 511 2616(n)
454
P.L. Canner, W.F. Krol, and S.A. Forman Table 5
Repeatability of the Scheduled ECG Tracings: First Reading versus Second Reading Minnesota Code and description
n
n - a
Percent disagreement by formula": 1 2 3 4
1.X 5.x 4.X 9.2
Q/QS patterns b T wave findingsb ST depression ~ ST elevation~
2616 2616 2616 2616
1547 1339 751 109
12.8 14.3 28.1 43.1
24.7 22.9 38.5 43.1
7.6 7.3 8.1 1.8
14.6 11.7 11.0 1.8
2.X 3.X 6.X 7.X
QRS axis deviation High amplitude R A-V conduction defect Ventricular conduction defect
2616 2616 2616 2616
345 261 143 300
20.3 23.4 61.5 22.7
21.7 27.2 61.5 34.3
2.7 2.3 3.4 2.6
2.9 2.7 3.4 3.9
8.1.X Frequent ectopic beats 8.2-6 Other arrhythmias 8.7-8 Sinus arrhythmias Any codable abnormality
2616 2612 2600 2616
157 28 89 2380
21.7 28.6 13.5 2.7
30.6 32.1 13.5 2.7
1.3 0.3 0.5 2.5
1.8 0.3 0.5 2.5
Ectopic beat items: Supraventricular ectopic beats Ventricular ectopic beats Runs and bigeminy Multiform ectopic beats
2615 2615 2615 2615
273 383 160 71
27.1 11.5 86.3 46.5
28.9 15.4 89.4 53.5
2.8 1.7 5.3 1.3
3.0 2.3 5.5 1.5
'n, n - a, and formulas 1, 2, 3, and 4 are defined in Table 4. bMaximumfindings from three anatomical sites. The second analysis involves the determination of percent disagreement between successive readings to detect the existence of major coding problems (e.g., due to turnover in personnel) at a particular time point. A l t h o u g h it is relatively simple to generate data on ECG repeatability such as displayed in Tables 5-7, the interpretation and practical application of such data are not so simple. It m a y be asked, "What constitutes an acceptably low rate of disagreement b e t w e e n two readings of an ECG item, or w h a t constitutes an alarmingly high rate of disagreement?" An analysis that helps us to assess better the clinical and scientific importance of the ECG repeatability data is s h o w n in Table 8. Incidence of significant worsening of the annual follow-up ECGs compared to the baseline ECG was routinely tabulated in the semiannual reports prepared for the CDP Data and Safety Monitoring Committee as well as reported in two of the CDP publications of treatment effects [2,3]. The definitions of significant worsening for the various ECG items are provided elsewhere [3]. In Column 1 of Table 8 is given, for several ECG items a n d combinations of items, the percentage of patients in the clofibrate, nicotinic acid, and placebo groups combined showing significant worsening from the baseline to the year 1 ECG. In Column 2, the same criteria for significant worsening have been applied to the 2616 pairs of repeat ECG readings, behaving as if the first reading were a baseline ECG and the second reading were a year 1 ECG for the same patient. By taking the ratio of the percentages given in C o l u m n 2 to those in Column 1, we see in C o l u m n 3 the m a g n i t u d e of the effect of coding variability on the incidence of significant
2613 2606
2607 2608
2604 2605
337
Heart rate QRS axis
R in lead 2 R in lead V5
T in lead 2 T in lead V5
T-R' interval
Number of pairs, n
5.972
0.722 1.035
5.705 11.976
69.513 16.210
Mean of 2n measurements
1.029
0.434 0.616
0.382 0.977
2.159 11.842
Between-readings standard error
Repeatability of the Measured Items on the Scheduled ECG
ECG item
Table 6
17.2
60.1 59.5
6.7 8.2
3.1% 73.1
Coefficient of variation
13.8
6.8 4.2
0.9 1.7
1.8% 4.4
Increase in s t a n d a r d deviation a m o n g patients
O1
Third reading 8/70
Fifth reading 4/71
of Q/QS findings 35.0 36.6 15.8 16.7 25.0 24.2 24.2 22.5
Fourth reading 1/71
A. Percentage frequency distributions 40.0 34.2 15.0 18.3 24.2 25.8 20.8 21.7
Second reading 9/69 39.2 16.7 25.0 19.1
Sixth reading 4/72
9.9 (71)
18.9 (37)
50.0 (8)
Any T wave
Any ST depression
Any ST elevation
57.1 (7)
23.1 (39)
11.4 (70)
8.9 (79)
2 vs 3
33.3 (6)
19.5 (41)
8.7 (69)
8.5 (82)
3 vs 4
14.3 (7)
16.2 (37)
8.7 (69)
7.5 (80)
4 vs 5
50.0 (8)
14.7 (34)
7.2 (69)
9.0 (78)
5 vs 6
33.3 (6)
31.4 (35)
14.1 (71)
5.3 (76)
6 vs 7
37.5 18.3 23.3 20.8
Seventh reading 7/72
aDenominator is the number of ECGs for which at least one of the two readings showed any abnormality; numerator is the number of pairs for which only one reading showed any abnormality.
11.4 (79)
1 vs 2
B. Percentage disagreement between successive readings (denominators in parentheses) a
35.8 13.4 30.8 20.0
AnyQ/QS
No Q/QS Minor Q/QS Moderate Q/QS Major Q/QS
First reading 8/69
Repeatability a m o n g Seven I n d e p e n d e n t Readings of the S a m e Lot of 120 Scheduled ECGs
ECG finding
Table 7
o1 o'x
457
External Quality Control Programs Table 8
Percentage of Patients with Significant Worsening of ECG from Baseline to Year 1 and from First to Second Reading of Scheduled ECGs
(2) ECG finding A. Q/QS patterns B. Q/QS and T C. T-wave findings D. ST depression E. ST elevation F. A-V conduction defect G. Ventricular conduction defect H. Frequent ventricular ectopic beats I. Other arrhythmias J. A or B K. A, B, or C L. A-E M. A-I Denominators
(1) First to BL to YRI" second reading 7.4 3.0 1.8 0.2 7.2 2.1 4.4 1.9 1.9 0.9 0.9 0.9 1.5 0.3 2.4 0.4 0.4 0.1 9.0 3.2 14.6 5.0 18.0 7.1 21.9 8.7 4424 2616
(3) (2)/(1) x 100 40.5 11.1 29.2 43.2 47.4 100.0 20.0 16.7 25.0 35.6 34.2 39.4 39.7
aClofibrate, nicotinic acid, and placebo groups combined.
worsening. Line A of Table 8 indicates a 3.0% incidence of "significant worsening" of Q/QS patterns from the first to the second reading of the same ECG, compared to a 7.4% incidence of significant worsening of the same item found by comparing the true baseline and year I ECGs of 4424 CDP patients. Thus Column 3 says that 40.5% of what is called "significant worsening of Q/QS patterns" can be attributed to variability in coding the ECGs. Similar results are seen for many of the other ECG items. Comment. A very modest effort (perhaps about 0.1 FTE) was required by the CDP Coordinating Center to administer and process data for the repeat reading program for ECGs. The rereading effort also entailed only about 5% increase in workload for the ECG Reading Center. This program yielded much useful information, particularly of the kind displayed in Table 8. The latter suggests the need either to increase reading precision (perhaps by having each ECG read three or four times rather than twice at the Reading Center), tighten up the definitions of significant worsening of ECG findings, or read baseline and follow-up ECGs side-by-side to determine significant changes. All in all, the repeat reading program for ECGs in the CDP was very worthwhile and cost-effective.
MORTALITY CLASSIFICATION COMMITTEE CODING A Mortality Classification Committee (MCC) was established in the CDP for the purpose of providing a blinded, unbiased assessment of cause of death. Five of the clinical investigators participating in the CDP served as coders on this committee. All of the information submitted from the clinic to the Coordinating Center concerning a death i.e., Death Form, Cause of Death
458
P.L. Canner, W.F. Krol, and S.A. Forman
Coding Form, death certificate, and autopsy report (if available)---was sent to one of the five coders. Any mention in these materials of the patient's study treatment or of signs or symptoms that had a high likelihood of unblinding the coders to the study treatment (such as the mention of enlarged breasts on the autopsy report--a common finding for patients assigned to estrogen treatment) was blanked out prior to submission to the coder. No coder was asked to code deaths occurring in his own clinic. For each death the following items were coded by the MCC: Underlying cause of death, immediate cause of death, chronology of the death event in the case of an atherosclerotic coronary heart disease death with recent or acute cardiac event, hepatobiliary pathology on the autopsy, and mention of malignant neoplasia in the autopsy report. The classification systems for underlying and immediate causes of death are published elsewhere [4]. For difficult cases, the coder had the option of requesting that the death be reviewed and coded by the other four members of the Committee as well. Any discrepancies among the five sets of codings were adjudicated by means of the coders conferring together and reaching a consensus. As a check on both the intra- and interrater reliability of the codings, 10% of the deaths were randomly selected to be returned to the same coder and another 10% to another coder for recoding. The forms and reports for such deaths were included in the same package as the information for deaths submitted for initial coding so as not to alert the coder that these were repeat codings. A total of 212 deaths were submitted to the same coders for recording. The following percentage rates of agreement were recorded: underlying cause, 94.3%; immediate cause, 88.7%; chronology of death, 86.3%; hepatobiliary pathology, 90.6%; malignant neoplasia, 97.6%. For the 243 deaths submitted to other coders for recoding, slightly lower rates of agreement were found: underlying cause, 91.8%; immediate cause, 80.7%; chronology of death, 77.8%; hepatobiliary pathology, 86.0%; malignant neoplasia, 96.3%. A comparison of cause of death assignment by the local clinic physician with assignment by the MCC is shown in Table 9. The figures given in this table are for death due to coronary heart disease with recent or acute event, other cardiovascular causes, noncardiovascular causes, and unknown cause, and for the clofibrate, nicotinic acid, and placebo group patients combined. There was an 88.4% (1136/1285) agreement in classification between the clinics and the coders for these broad categories. The MCC tended to assign the cause "coronary heart disease" more frequently than the clinic physicians. Causes such as "arrhythmia" or "ventricular fibrillation" written in by the clinics under "other cardiovascular causes" were usually assigned by the MCC coders to the category of "coronary heart disease with recent or acute event." The same was true for cases of unexpected, unobserved death, which the clinics tended to code as "unknown." Although this was a somewhat arbitrary rule set down by the MCC, it provided a uniform, blinded, unbiased means of coding all such deaths. Central, blinded assessment of cause of death was undertaken relatively late in the course of the study; the first meeting of the Committee took place in July 1971. Due to delays both in initiating and completing coding of deaths, only information provided by the clinics on cause of death was reported in
External Quality Control Programs
459
Table 9
Number of Deaths by Cause, as Reported by Clinics and by MCC Coders; Clofibrate, Nicotinic Acid, and Placebo Groups Combined ° Cause reported by MCC coders Cause reported by clinics Coronary Other CV Non-CV Unknown Coronary 926 28 3 0 Other CV 70 109 6 0 Non-CV 2 10 94 3 Unknown 21 3 3 7 Total 1019 150 106 10
Total 957 185 109 34 1285
'Rate of agreementbetween clinicsand MCC coders = 88.4% (1136/1285). the CDP publications on treatment effects. However, the CDP Research Group satisfied itself that there were no major discrepancies between the findings based on the clinic designations and those based on the MCC codes. Table 10 provides, for broad categories of deaths, clofibrate-placebo and nicotinic acid-placebo comparisons in cause-specific mortality according to the source of designation of cause of death. The lines marked B and C under Source of Designation are based on the same data as shown in Table 9 for the three treatment groups combined. Although the MCC tended to assign the code "coronary heart disease" more frequently than did the dinic physicians, the Z values for the treatment-placebo differences were not substantially different for the two sources of designation of cause. On the other hand, the MCC coded fewer noncardiovascular causes in the two drug groups and more in the placebo group than did the clinics, thereby resulting in a rather large difference between Z values (1.08 vs 0.08 for clofibrate and 1.36 vs 0.55 for nicotinic acid) for the two sources. The lines marked A under Source of Designation in Table 10 give the results previously published in Table 5 of the report on clofibrate and nicotinic acid findings [3]. This latter publication was prepared on the basis of data received by the Coordinating Center through September 30, 1974. As of that date a number of deaths occurring on or prior to the official study termination date of August 31, 1974 had not yet been reported to the clinics nor to the Coordinating Center. Also as of that date, a number of the reported causes of deaths occurring during the last month or two of the study were only provisional, pending further information such as hospital report, autopsy report, communication from the family of the deceased concerning the death event, and the like. Thus the final mortality data as represented in the lines marked B in Table 10 differ slightly from the data in the A lines. Since there is always great pressure to publish the final results of a clinical trial as soon as the follow-up period is ended, it is often necessary to be satisfied with data that are almost, but not absolutely, complete in the published report of the treatment group findings. A slightly different procedure for coding cause of death was followed in the Aspirin Myocardial Infarction Study (AMIS). The information on each death was submitted to two physicians for independent coding of cause of death as well as recording findings from the autopsy report. The responses by the two coders were keyed and compared electronically, and any disagree-
A B C A B C A B C A B C
Coronary heart disease
All cardiovascular causes
All noncardiovascular causes
Missing, unknown
11 (1.0) 10 (0.9) 5 (0.5)
29 (2.6) 27 (2.4) 23 (2.1)
241 (21.8) 250 (22.7) 259 (23.5)
195 (17.7) 202 (18.3) 221 (20.0)
281 (25.5) 287 (26.0) 287 (26.0)
Clofibrate (n = 1103)
5 (0.4) 5 (0.4) 1 (0.1)
30 (2.7) 29 (2.6) 26 (2.3)
238 (21.3) 243 (21.7) 250 (22.3)
203 (18.1) 206 (18.4) 215 (19.2)
273 (24.4) 277 (24.8) 277 (24.8)
Nic, acid (n = 1119)
22 (0.8) 19 (0.7) 4 (0.1)
54 (1.9) 53 (1.9) 57 (2.0)
633 (22.7) 649 (23.3) 660 (23.7)
535 (19.2) 549 (19.7) 583 (20.9)
709 (25.4) 721 (25.9) 721 (25.9)
Placebo (n = 2789)
1.35 1.08 0.08
- 0.57 -0.40 -0.12
-1.08 -0.98 - 0.60
0.04 0.11 0.11
Clofibrateplacebo
Z
1.45 1.36 0.55
- 0.97 -1.05 -0.88
-0.75 -0.91 - 1.18
-0.67 -0.71 -0.71
Nic. acidplacebo
~A = Cause of death designated by clinics; analysis date October 1, 1974 (published in report on clofibrate-nicotinic acid findings [3]). ~B = Cause of death designated by clinics; analysis date February 1, 1976. 'C = Cause of death designated by MCC coders; analysis date February 1, 1976.
Aa B~ Cc
All causes
Source of designation
No, (%) with event
Cause-Specific M o r t a l i t y b y T r e a t m e n t G r o u p a n d b y S o u r c e of D e s i g n a t i o n of C a u s e
Cause of death
Table 10
External Quality Control Programs
461
ments printed out. These disagreements were then adjudicated by a meeting of all five of the coders. Committees for blinded, unbiased evaluation of diagnoses of nonfatal events, such as myocardial infarction, angina pectoris, stroke, and others, were also established along with the Mortality Classification Committees in AMIS as well as in other more recent clinical trials.
Comment. In this section we have sought not only to assess the usefulness of the blinded recoding program for cause of death, but also to assess the value of having an MCC at all. The coding of about 2000 deaths consumed a great deal of time of five CDP physicians. Preparation of death forms and autopsy records for submission to the MCC members, blanking out informarion that might unblind the coders, making requests of the clinics for additional information, and other administrative duties by Coordinating Center staff required a substantial amount of time as well. The high level of agreement between cause of death assigned by the clinic physician and that assigned by the MCC, as shown in Tables 9 and 10, suggests that an MCC may not always be necessary in clinical trials. On the other hand, the MCC of the CDP as well as of most other trials has been established not just to provide uniform assignment of cause of death but also to provide unbiased assignment, uninfluenced by knowledge of the decedent's study treatment. An alternative to both local and MCC assignment of cause of death is to have all of the death certificates coded centrally by a trained nosologist. The experiences of other trials utilizing both a nosologist and an MCC must be documented before reaching a definite recommendation in this regard. A useful approach might be to submit only a sample of death records to an MCC and compare the committee c6ding with nosologists' coding of the death certificates. If good agreement is found with a sample of records, it may not be necessary to have all of the records coded by the MCC. STUDY DRUG LABELING
As indicated in Chapter 9 on the Drug Procurement and Distribution Center, a great many checks were carried out by this Center to assure that the bottles of CDP medication were labeled with the correct bottle code number. Nevertheless, it was felt by the CDP Steering Committee that an outside check on the accuracy of bottle numbering should be carried out. Beginning May 1972 the following procedures for external surveillance of the accuracy of the medication bottle codes were carried out: 1. Twice a year, in conjunction with the semiannual drug shipments to the clinics, three randomly selected clinics were asked to send one bottle of each of the 30 codes to the Central Laboratory for identification. Only those bottles received in the current shipment were sent. 2. A person at the Central Laboratory was designated to receive these bottles and to renumber them using random numbers prior to distribution to the analysts. The renumbering was done to prevent bias due to recall of the results of previous analyses.
462
P.L. Canner, W.F. Krol, and S.A. Forman
3. Clofibrate and low-dose estrogen were identified by visual examination of the capsule contents whereas nicotinic acid and placebo (lactose) were identified by chemical procedures. By the time this surveillance was begun, 5.0 mg/day estrogen and dextrothyroxine capsules were no longer being used in the CDP. 4. The results were sent to the Coordinating Center and were reported to the Steering Committee at semiannual intervals. No incidents of incorrect bottle numbering were detected by these procedures.
Comment. External monitoring of study drug bottle labeling in the CDP required relatively little effort for potentially large dividends, even though no problems were actually detected by this means in the CDP. Although not done in the CDP (except for one or two special occasions), it might also be worthwhile to engage a laboratory to perform quantitative analyses of the active ingredients of the study medication for each new drug shipment. CLINIC PERFORMANCE By far the largest amount of data submitted to the Coordinating Center was generated by the clinical centers. Appropriately, the greatest amount of attention was paid by the Coordinating Center to assuring the quality of these data. A description of the detailed computer editing of these data is given in Chapter 6. Clinic performance was assessed by consideration of the following at periodic intervals: 1. Number of patients enrolled to date and ratio of this number to the number who should have been enrolled to date given the clinic's stated recruitment goal and the proportion of the scheduled recruitment period already completed. 2. Percentage of patients prescribed or reported as taking less than full dosage of the study medication during the most recent follow-up period. 3. Percentage of patients prescribed or reported as taking essentially none of the study medication---although still in active follow-up---during the most recent follow-up period. 4. Percentage of patients who had dropped out of the study, i.e., discontinued not only their study medication but also their scheduled 4-month visits to the clinic. 5. Percentage of scheduled clinic visits missed by active participants. 6. Number of forms and ECGs past due at the Coordinating Center by at least 1 month according to the patients' visit schedules. 7. Percentage of forms received during a recent 3-6-month period with one or more errors detected by the computer edit. The measures listed above were tabulated by clinic and reported every 6 months to the Steering Committee and to the entire group of clinic investigators. A few additional measures were tabulated and reported once or twice during the study and, ideally, should have been reported at regular intervals as well:
External Quality Control Programs 8. N u m b e r of 9. N u m b e r of 10. Percentage insufficient 11. Percentage
463
patients whose medication code had been broken. patients ever issued incorrect study medication bottles. of serum specimens found to be hemolyzed, thawed, or of volume upon receipt by the Central Laboratory. of poor-quality ECGs submitted to the ECG Reading Center.
Clinics performing poorly with respect to any of these measures were encouraged to better performance by means of visits to the clinics by members of the Steering Committee, small group meetings in conjunction with the semiannual Technical Group meetings, or by letter from the Steering Committee. In general, clinics were identified as poor performers by simply selecting those two or three that had the lowest ranks among the 53 clinics for a given measure. Sometimes further criteria were added requiring that a clinic be consistently ranked at the bottom for two or more consecutive reports or for two or more different measures of performance before being considered a poor performer. More recently, a statistical "slippage" test has been developed for detecting clinics whose performance scores are distinctly different from those of the rest of the clinics; the details of this test are reported elsewhere [5,6]. Table 11 presents distributions of three different measures of performance among the 53 CDP clinics. The distribution shown for Criterion C shows definitely more spread than those for Criteria A and B. When the statistical slippage test mentioned above was applied to the data in this table, no outlier clinics were identified for Criterion A even though the percentages of patients taking 80% or less of maximum dosage ranged from 2.9% in the best clinic to 23.5% in the worst. In other words, the observed distribution was consistent with the hypothesis that all clinics were doing equally well in getting their patients to take the study medication, and so there really were no "best" nor "worst" clinics. One outlier clinic was detected from Criterion B by this statistical test--a clinic with 23.2% of the patients listed as dropouts. For Criterion C the range among the 53 clinics of percentage of forms failing computer edit was 0% (two clinics) to 43.6%. The nine worst clinics, those having 22.5% or more of their forms failing edit, were all tagged as outlier clinics by the
Table 11
Distribution of Percentage of Patients Taking 80% or Less of Maximum Dosage for Most Recent Follow-up Period (Criterion A), Percentage of Dropouts (Criterion B), and Percentage of Forms Failing Computer Edit (Criterion C) among the 53 CDP Clinics, July 1973 Number of clinics Criterion value (%) Criterion A Criterion B Criterion C <2.5 0 6 4 2.5-7.4 6 19 15 7.5-12.4 14 14 11 12.5-17.4 20 10 8 17.5-22.4 9 3 6 22.5-27.4 4 1 5 327.5 0 0 4 Total 53 53 53
464
P.L. Canner, W.F. Krol, and S.A. Forman statistical test. The information in Table 11 illustrates the fact that a great deal of knowledge about clinic performance can be obtained at relatively low cost by utilizing the information provided on the data forms submitted by the clinics.
COORDINATING CENTER PERFORMANCE
Although the Coordinating Center of a multicenter clinical trial typically is actively involved in the continual monitoring of the clinical centers, laboratories, reading centers, Drug Distribution Center, and other study units for quality of performance, often there is no group designated to monitor the performance of the Coordinating Center. Such was the case with the CDP. However, in some newer trials external monitoring programs have been set up to evaluate the performance of the Coordinating Center. Two models of this type of program are (1) a Data Quality Control Center and (2) a Coordinating Center Monitoring Committee (CCMC). For the Persantine-Aspirin Reinfarction Study (PARIS), a Data Quality Control Center was established in the Department of Statistics, University of Chicago, to function as a second data analysis center to monitor the performance of the Coordinating Center. Duplicate copies of certain study forms were sent directly to the Data Quality Control Center from the clinical centers. On the basis of these data, certain analyses by treatment group were prepared and presented by the staff of this center at the Data Monitoring Committee meetings for comparison with the analyses prepared by the Coordinating Center staff. Further details of this operation are published elsewhere [7]. For the National Cooperative Gallstone Study (NCGS), a CCMC was established to meet periodically with the Biostatistical Support Unit (BSU) staff to review their operations and procedures [8--10]. This was a team of three persons having expertise in biostatistics, data processing, and medical science. The team visited the NCGS BSU twice a year over a 5-year period, each visit lasting 2-3 days. Agenda items for these meetings included the following: 1. Review of data entry procedures and direct observation of data entry operation. 2. Review of randomization and treatment assignment procedures. 3. Review of data editing procedures and test of the data editing system by submission of forms completed with intentional errors by members of the CCMC. 4. Review of data base inventory and update systems. 5. Review of documentation of programs, data files, procedures, changes in study forms, changes in definitions, and catastrophes or extraordinary happenings. 6. Review of data and program file backup procedures. 7. Review of BSU procedures for monitoring the performance and quality of data from the clinical centers and the Central Serum, Bile, Radiology, Pathology, and Electron Microscopy Laboratories. 8. Review of data analysis procedures and plans. 9. Review of procedures for quality assurance of data analyses and review of reports to the Data Monitoring Committee for format of tables, con-
External Quality Control Programs
465
sistency of figures among different tables, reasonableness of results, and comprehensiveness of analyses. 10. Audit of a small sample of study records, comparing the data received from the dinics on study forms with computer-generated listings of those data as they appear on the analysis files. A similar committee has been established to monitor the Coordinating Center of the Coronary Artery Surgery Study [11]. The experiences with the Data Quality Control Center for the PARIS Coordinating Center and the CCMC for the NCGS BSU have both been quite favorable [7,10] and indicate that this type of monitoring of a Coordinating Center is to be recommended for incorporation in new multicenter clinical trials.
SUMMARY OF METHODOLOGICAL PRINCIPLES AND LESSONS
In most situations, it appears to be a good idea to utilize central laboratories, reading facilities, or committees in multiclinic trials to provide more precise, uniform, and unbiased (due to being blind to study treatment assignment) measurement of chemical constituents of serum; grading or coding of ECGs, radiographs, fundus photographs, etc.; morphological evaluation of tissue specimens; and the like. (A possible exception, on the basis of CDP experience, is the coding of cause of death by a committee.) However, even with a central laboratory or reading facility, the quality of measurements of gradings should be evaluated in a blind fashion by external quality control programs. It is important in this way to determine the precision of the measurements, grades, or codes provided by the laboratory or reading center, and to make certain that the precision is at an acceptably low level and consistent throughout the duration of the trial. Further, whenever possible, the precision provided by the central laboratory or reading facility should be compared against the precision obtained by providing the measurements, grades, or codes at the local clinics. The a m o u n t of gain in precision (if any) provided by a central facility should be weighed against the added cost usually required to run such a facility. The fact that a central facility can provide measurements in a way that is truly blinded to study treatment assignment, whereas this may not always be assured with local measurements, must also be taken into consideration w h e n assessing the value of a central laboratory or reading facility. The following methodological principles and lessons concerning external quality control programs may be summarized from this chapter: 1. Some external quality control programs require the use of " d u m m y " patient ID numbers. It is helpful to allow for such when assigning ID numbers from the beginning of the trial. 2. Technical error of laboratory, ECG, and other measurements can most usefully be evaluated if considered in the context of the magnitude of among-patient variability. 3. In order for an external surveillance program for the evaluation of longterm secular trends of serum laboratory determinations to be worthwhile, one must be able to determine precisely the effects of freezing serum
466
P.L. Canner, W.F. Krol, and S.A. Forman
4.
5.
6.
7.
specimens for different periods of time on the measured levels of the chemical constituents of the serum. A useful w a y of interpreting rates of disagreement between two readings of an ECG item is to consider the disagreement in the context of assessing significant worsening of follow-up ECGs compared to baseline. It remains to be determined whether a committee for coding cause of death provides more accurate and precise information than the much less costly nosologist coding of death certificates. A substantial amount of information on performance of the clinical centers can be obtained at relatively low cost by analyzing the data submitted by the clinics on the study forms. Just as the Coordinating Center is responsible for monitoring the performance of the clinical centers, Central Laboratory, and other units, so there should be an external group or committee established to monitor the performance of the Coordinating Center.
REFERENCES
1. Rose GA, Blackburn H: Cardiovascular Survey Methods. Geneva, World Health Organization, 1968 2. Coronary Drug Project Research Group. The Coronary Drug Project: Initial findings leading to modifications of its research protocol. JAMA 214:1303-1313, 1970 3. Coronary Drug Project Research Group. Clofibrate and niacin in coronary heart disease. JAMA 231:360-381, 1975 4. Coronary Drug Project Research Group. The Coronary Drug Project: Design, methods, and baseline results. Circulation 43 (suppl 1):I1-I79, 1973 5. Canner PL, Huang YB, Meinert CL: On the detection of outlier clinics in medical and surgical trials. I. Practical considerations. Controlled Clin Trials 2:231-240, 1981 6. Canner PL, Huang YB, Meinert CL: On the detection of outlier clinics in medical and surgical trials. II. Theoretical considerations. Controlled Clin Trials 2:241-252, 1981 7. Persantine-Aspirin Reinfarction Study Research Group. Persantine-Aspirin Reinfarction Study: Design, methods and baseline results. Circulation 62 (suppl 2):II1-II42, 1980 8. Lachin JM, Marks JW, Schoenfield LJ, the NCGS Protocol Committee, and the National Cooperative Gallstone Study Group. Design and methodological considerations in the National Cooperative Gallstone Study: A multicenter clinical trial. Controlled Clin Trials 2:177-229, 1981 9. Schoenfield LJ, Lachin JM, the Steering Committee, and the National Cooperative Gallstone Study Group. Chenodiol (chenodeoxycholic acid) for dissolution of gallstones: The National Cooperative Gallstone Study. Ann Intern Med 95:257-282, 1981 10. Canner PL, Gatewood LC, White C, Lachin JM, Schoenfield LJ: External monitoring of a data coordinating center: Experience of the National Cooperative Gallstone Study. (In preparation.) 11. Principal Investigators of CASS and Their Associates. National Heart, Lung, and Blood Institute Coronary Artery Surgery Study. Circulation 62 (suppl 1):I1-I81, 1981