187
Screening, 2 (1993) 187-199 0 1993 Elsevier Science Publishers B.V. All rights reserved 09256164/93/$06.00 SCREEN- 00053
Neonatal screening for congenital hypothyroidism: analysis of interlaboratory quality control J.L. Dhondt”vb, J.P. Farriauxavb and R.J. Pollitt” “Centre Rgional de Dkpistage Nkonatal, Lille. France, bAssociation Francaise pour le Dkpistage et la Prtkntion des Maladies mt?taboliques et des Handicaps de I’Enfant, Paris, France and ‘Neonatal Screening Laboratory, Shefield, UK
(Accepted 6 March 1993)
Procedures to control performance of all steps from sampling to follow-up of diagnosed patients are essential for evaluating the effectiveness of neonatal screening programmes. Since 1980, the French programme has conducted regular quality control surveys. Experience with the TSH scheme is analyzed to show various ways in which data may be presented to assist laboratories in explaining analytical deviations and in improving their overall performance. Each method of graphical presentation has its advantages and limitations; cumulative histories are beneficial. In the absence of reference calibrators and of standardisation of quality control material, the actual goal of a Quality Control programme is to position each laboratory with regards to a national consensus, in order: (i) To evaluate the ability of a laboratory to detect the positive case suspected to have the disease among the general population. (ii) To compare the analyte concentration found, which is important to clarify problems in the classification of patients with elevated TSH. Key words: Quality control; Dried blood sample; Neonatal screening; Congenital hypothyroidism;
TSH
Introduction In screening, the prime aim of quality control is to evaluate the ability of a laboratory to spot an abnormal sample. In effect, this hinges on the measurement of a ‘marker’ to select from within the population ‘suspect’ subjects for further diagnostic investigations. In France, a programme for the neonatal screening of phenylketonuria and congenital hypothyroidism was placed under the control of the “Association Corresportdence to: Professor J.L. Dhondt, Centre Regional de Dtpistage Neonatal, Facultb de Medecine, place de Verdun, 59045 Lille, France.
188
Franqaise pour le Depistage et la Prevention des Handicaps de 1’Enfant” and financed by the Social Security system (CNAMTS, National Fund for the Medical Insurance of Salaried Workers). The agreement between these two organisations right from the start included the idea of monitoring the performance of participating laboratories in order to detect methodological problems before they affected the efficiency of the programme, especially as there were relatively few laboratories and they often used the same techniques. Amongst these monitoring procedures (national statistics, register of false negatives, etc.) was external quality control which began in 1980. The object of this work is to compare different modes of presenting statistical analyses of the results of this quality control with the aim of defining those which most effectively indicate whether the screening programme is under control. Our experience with the programme for hypothyroidism has been chosen as a model because initially all laboratories used the same reagents, and the frequency of the disorder, with 2297 cases detected up to 31st December 1990, gives us precise information on the thyrotropin (TSH) values found in affected babies.
Materials and Methods The organisation and application of neonatal screening for metabolic disease in France has already been described [1,2]. Screening methods Two periods must be considered: (i) From 1980-1988 measurement of TSH was by radioimmunoassay using TSH K-NN@ (Oris-CEA, Gif s/Yvette, France) reagents. This is a double antibody method using polyclonal antibodies in the liquid phase. (ii) From 1989 onwards, to reduce problems encountered with the previous reagents (insufficient robustness, poor limit of detection, only moderate reproducibility), two other techniques were employed: (a) ELSA TSH NN@ (CIS bioindustrie, Gif, s/Yvette, France) - a sandwich immunoradiometric assay using two monoclonal antibodies, one fixed to a solid support (ELSA paddle), the other labelled with lz51. (b) DELFIA neonatal h-TSH@ (Pharmacia-Wallac, Turku, Finland) - a sandwich immunofluorimetric method using two monoclonal antibodies, one fixed in the wells of a microtitre plate, the other labelled with europium. All results are expressed as mU1 TSHjl of blood. The organisation of quality control The specimens were prepared from blood from hypothyroid patients or from normal blood supplemented with TSH, spotted onto Macherey NagelTM paper (product No. MN 818). There were five or six circulations a year, falling into two periods: (1) From February 1980 to January 1985 - each circulation comprised two concentrations, four spots each. Since repeated measurements considerably complicated the statistical situation, the mean value of the four determinations was used in statistical calculation. (2) Since February 1985 - circulation of sets of four spots, all of different concentration. For each sample are required a measured value,
189
interpretation (normal, intermediate, or high) and action (no follow-up, repeat assay on same sample, request another specimen, immediate recall of patient). A ministerial committee for quality control (Laboratoire National de Sante) analysed the results, the Association Francaise guaranteeing the anonymity of participants’ codes. Data analyzed
The 25 French laboratories participated regularly in the control, with a 98.8% rate of returns. Since 1989, 16 laboratories have used ELSA TSH NN reagents and nine the DELFIA method. For statistical reasons we have included results from foreign participants in the scheme who use the latter method. Table 1 presents the characteristics of the different periods for a total of 58 circulations between 1980 and 1990. Data below are from a circulation in 1991 (specimens 9109-9112) and are used to illustrate the prospective application of certain of the statistical models.
Results and Discussion Standard statistical
analysis
Classical analysis of the results of an interlaboratory control includes for each specimen calculation of the mean, standard deviation (SD) and coefficient of variation (CV). Two methods of presentation are possible: (i) A histogram of results for each specimen (Fig. 1): information content is poor as only gross deviance is noticeable. Insufficient results per specimen means that usually only five or six classes of value can be represented. (ii) The position of each laboratory can be visualised by the relative standard deviation of their individual measurements, each expressed as the deviation from the interlaboratory mean divided by the interlaboratory SD (z-score). This
TABLE 1 Characteristics of the different periods of the French Quality Control programme (see text for details) Period
Method
Different TSH levels per survey
No. of QC samples
Total number of results*
1980-1985 1985-1988 1989-1990
TSH KNNa TSH KNN@ ELSA TSH NN” DELFIA neonatal hTSH@
2 4 4 4
59 16 36 36 171
1396 1775 568 487 4226
Total *Exclud.mg data submitted as inexact quantities (’> ’or ‘< ‘).
190
9109 9110 mean 14,1 14,o 23,7 23,9 S.D. 2,6 2,7 4,7 4,o c;\5.18% 15% 20% 17%
QC sample #
t
9111 32,8 11,4 360
31,4 5,5 ,o
9112 632 64,7 14,o IO,9 22% li
Nb
Fig. 1. Histogram of the distribution of TSH values for quality control specimens 9109-9112. Statistical parameters are before (roman type) or after (italics) truncation outside 2 SD. Only one aberrant value (at 80 mIU/l for specimen 9111) is apparent.
method of presentation shows significant deviations in analytical performance compared to that of the group as a whole (Fig. 2). However, the interpretation acceptable deviances.
depends on the definition of target values and of
Target values
The effect of outliers (poor accuracy) may be partly eliminated by using statistical parameters (mean and SD) calculated after truncation of the data by eliminating results outside certain thresholds, e.g., 12 SD or 5th to 95th centile. This method is generally applied for more precise estimation of the ‘consensus’ value. However, by ‘improving’ the CV (from 23% to 17% for results between 20 and 40 mIU/l) this increases the range of relative standard deviation scores achieved for individual samples. In addition, retrospective analysis of the results with or without truncation has shown no significant modification of the mean values, taken as consensus values, with a few easily spotted exceptions. However, this method only takes into account the dispersion of all the results, so that, in the case of four specimens per circulation, a systematic error (error of accuracy often resulting from poor standardisation) effects in fact the four results, which should all be removed from the statistics. The Youden diagrams can demonstrate this type of error but only if specimens with widely differing target values are provided (Fig. 3).
LABORATORIES
Fig. 2. Results by laboratory expressed as z-scores for quality control specimens 9101-9112. Each vertical represents the four results of a single laboratory. In this example 11% of the relative standard deviations lie outside 2 SD of the mean. SD +3
+2
cu 6 2 P z 0 0
+l
-1 -2 -3
SD
6.3
8:9
11.5 . 16.1
-3
-2
-1
16.7
li.3
+l
+2
21.9 mUl/L +3
QC sample9109
Fig. 3. Youden diagram for specimens 9109 and 9112. @, probable systematic error, the two values are at - 2 SD from the group mean. 0, case of random error (outlier) in one of the measurements.
Acceptable deviance
For most QC programs, the most important factor in variability is method [4,6]. The French program offered an interesting opportunity to study this, since for the first 8 years a single method was used by all participants. The Table 2 shows distribution of reported values during the different periods. The number of reported values outside 2 SD of the mean was not significantly
192 TABLE 2 Number of values outside 2 or 3 SD of the mean reported since 1980 Period:
No. of results:
1980-1985
1396
1985-1988
198991990 Delfia
Elsa
1715
487
568
deviation 23 SD 22 SD
theoretical %* 0.14% 2 (0.14%) 2.28% 32 (2.29%)
8 (0.45%) 55 (3.10%)
1 (0.21%) 14 (2.87%)
1 (0.18%) 15 (2.64%)
<-2SD <-3SD
2.28% 0.14%
30 (1.69%) 4 (0.23%)
9 (1.69%) 0
9 (1.58%) 0
34 (2.44%) 4 (0.29%)
*From a normal distribution.
different from that expected from a normal distribution. The f2 SD interval seems a reasonable choice for confidence in interlaboratory control. Cumulative analysis
As in intralaboratory quality control, time series representations provide information on trends only recognizable over long periods of time. In addition, cumulative analysis enable a more precise estimation of the imprecision of the method. Cumulated performance of laboratories using relative standard deviation
Results expressed as standard deviations relative to the mean are equally useful for judging the cumulative performance of each laboratory (Fig. 4). The analysis for all laboratories since 1980 has shown that changing from reporting the mean of four values for a single specimen to one of a single determination did not significantly change this distribution. Additional graphic displays can provide further information: to illustrate the historical performance of a laboratory the graph of z-scores can be associated with that of the cumulative sum of algebraic (Eqn. 1) or absolute (Eqn. 2) values of z-scores (Fig. 4). i [z-score (r)] 1
i1 [‘z-score (r)‘l
cumulative (running) sum of algebraic values of z-score. r = number of QC sample.
(1)
cumulative (running) sum of absolute values of z-score. r = number of QC sample.
(2)
If errors are solely random, the cumulative sum of their algebraic values will fluctuate around the zero line and the cumulative sum of the absolute values will remain below the mean error level deemed normal (e.g., 1 SD multiplied by number of QC samples analyzed) (Fig. 4, laboratory A). The two cumulative graphs thus show respectively the accuracy and the precision of the analyses. Problems are easily
QC samples
fl!
60
60
60 r”=C!_!?‘@!------c 40
60 40
20
20
1 -_---
0 Fig. 4. Graphs of z-scores, cumulative sum of algebraic and absolute z-score values, of results returned by two laboratories in 1988. For laboratory A (54 specimens), despite two values outside 2 SD, the cumulative variations are not significant. For laboratory B (49 specimens), the cumulative variation is significant and shows an excess of negative scores, suggesting a problem of accuracy.
detected but the rate of participation must be close to 100% to allow interlaboratory comparisons to be made on the profiles thus obtained. A bidimensional representation can also be achieved by placing on the x-axis the cumulative algebraic z-scores (Eqn. l), which theoretically is little affected by the number of results, and on the y-axis the cumulative sum of squared z-scores divided by the number of results (Eqn. 3). The use of squares, by emphasising z-scores > 1, produces an informative graph (Fig. 5). $ [z-score 2(r)]/n
cumulative (running) sum of z-score squared values divided by the number of analyzed QC samples (n).
Cumulated performance method
(3)
of laboratories taking into account the imprecision of the
A weakness in the preceding representations is that they fail to take into account the variation of the measure with the concentration of the analyte. The distribution of the coefficient of variation as a function of the mean concentration of each specimen is presented in Fig. 6, which combines the imprecision profile of the method with the interlaboratory error.
QC
QC
samples
samples
2.0 1.6 1.6 1.4 1.2 1
____-___-
________---
0.6 0.6 0.4 0.2 , 0.30 -26 -a CUMULATIM SUM OF zscoAES
-15 -10 -6
0
5
10
15
20
25
3
CUMULATM SUM OF G3CORES
Fig. 5. Graphs of z-scores and two-dimensional step by step representation of the cumulative sum of zscore values squared relative to the sum of their algebraic values. Example of two laboratories having no reported values outside 2 SD, but for lab. B squared values detected a period of systematic shift (arrow).
01 0
~,~,~,~,~,~,~,~,~,~ 20
40
60
80
100
120
140
160
180
1 200
TSH (mlUA)
Fig. 6. Relation between coefficient of variation and mean concentration from 1980 onwards.
of TSH for results obtained
195
It is possible to take these variations into account by expressing the deviation of results as difference relative to the mean (Eqn. 4). relative difference
=
reported value-consensus consensus
mean (4)
mean
A cumulative picture, as proposed by Riihle et al. [6], can be constructed from the numerical values of the 25th and 75th centiles of the distribution of the relative differences for each sample already circulated (Fig. 7A). The resulting curves delin_._
1
A
B
0.5 . ’ 0.4
0.3 1,
.,.”
._
_._
0.6 u.3-
,
0.4. . 0.3.
: \
0.2
‘,I
’ I
i\R
??
.
m -c.?? *-_
__---
__--
0.1
’ c _
-0.2. -0.3.
:
-0.4.
;
-0.5 :lU :
Fig. 7. Mapping of 25th and 75th percentiles (A) of the distribution of relative differences from the mean as a function of concentration of TSH (1980-1990). Construction of curves delineating acceptable limits (B). Distr:lbution of differences relative to the mean calculated for specimens 9109-9112 for all laboratories (C) and for three of the individual laboratories (D).
196
eate the area below and above the mean in which half of the results of a QC sample lie. The 2-fold vertical distance of the curves from the position of the target value on the zero line can be chosen as constituting acceptable limits for the evaluation area (Fig. 7B). Figure 7C shows the results obtained using this method with the returns for samples 9109-9012, with the identification of more deviant values (20 vs. 12) than with the conventional approach (Fig. 2). This approach circumvents the use of the group standard deviation and allows two-dimensional representation of the distribution of all laboratories (Fig. 7C) or, equally, that of a single laboratory for several specimens (Fig. 7D). The cumulative 25th and 75th percentiles may be updated after each new circulation, thus smoothing occasional deviations. Interpretation
and decision
The aim of screening being the identification of suspect subjects in a normal population, an element of decision has been added to quality control by requiring that the result be interpreted. This is more to check for agreement with the consensus rather than a parameter of quality control. For screening for hypothyroidism this consensus is, in France, to accept as normal a result < 30 mIU/l, intermediate a result between 30 and 50, high a result > SOmIU/l. The example of the circulation of specimens 9109-9112 (Fig. 8) shows a tendency to ‘overestimate’ values below the cut-off, an observation made since the beginning of the programme. The idea of false negatives and false positives is linked with the specificity and sensitivity of the disease marker used. Consequently the temptation to harmonise/rationalise the number of false negatives starting from the results of quality control are unjustified: ~ this idea is based on the separation between the results of a normal population and those of hypothyroid patients. _ the analytical variability entails logically a probability of obtaining results on one side or the other of a predetermined cut-off. In choosing a quality control specimen
normal
Fig. 8. Comparison
43
9
-
intermediate
20
3
high
3
18
of interpretations
given by individual
laboratories
with the consensus.
197
for which the 99% confidence limit of the consensus value is just above the cut-off there re:mains a 0.5% chance of finding an ‘incorrect’ result without, a priori, casting doubt on the analytical method. Only intralaboratory surveillance of histograms of daily measurements are able to reveal this risk. Bias The idetz of consensus value
Strictly speaking there does not exist a reference method for dried blood spot TSH. Problems arise from the matrix effect of the paper, variability of results with haematocrit, ‘chromatographic’ effects of diffusion of TSH on the paper, variable elution, and the effects of storage conditions. The absence of reference standards, explicable by these difficulties in production, is particularly a handicap in defining accuracy. Preparation of specimens for quality control may involve either the addition of a known ‘quantity of TSH or using the blood of a hypothyroid patient. In the former case, the difference between supplemented and non-supplemented (base) specimens allows estimation of precision. In the latter, determination of the serum TSH and correcting for haematocrit does not provide a secure measure and reference to the mean observed value seems then the most appropriate course. However, the potential bias inherent with a limited number of participant laboratories has been evoked on several (occasions, notably in not rejecting all the values from a laboratory when one is outside the limits for truncation. Distortion due to change of technique
Keeping historical records of QC surveys enables the detection of changes in performance secondary to the introduction of new methods. Despite the initial experience, comparing the three methods, showing identical proportionality between the results with DELFIA or ELSA and those with KNN [3], retrospective analysis of the period 1989-1991 shows a significant difference between the means of the two methods (DELFIA and ELSA). Figure 9 illustrates this difference for specimens with target values close to the cut-off even though the interlaboratory reproducibility does not differ significantly with the two methods or even with the previous method (TSH KNN) method. This significant difference in means necessitates separation of the two methods for closer analysis. Above all it may require different cut-offs according to technique. However, that appears unrealistic in a national screening programme; efforts have rather to be concentrated on the choice of appropriate calibrators, since kit to kit variation are essentially due to the difference of TSH standard in assay systems [4]. The imlpact of the change of method on the global performance of screening has appeared negligible. It must be remembered that in the past few, if any, patients have been diagnosed with initial TSH between 30 and 50 mIU/l [5]. Conversely, the similarity of the coefficients of variation obtained with the two methods, and thus the difierences relative to the mean, allow continued use the cumulative data as shown in Fig. 7.
198
-----DELFIA
QC sample89-7 10
12 14 15 18 19 21 22 90-2 3
6
7 10
11 14 15
Fig. 9. Comparison of the means (with f 1 SD) for 15 specimens measured with DELFIA or ELSA reagents. The horizontal line represents the concentration of TSH (30 mIU/l) normally considered as the cut-off value.
Special attention being paid to quality control samples It is a perpetual problem. The original method of a quality control specimen mimicking perfectly a screening sample taken in a maternity unit has been used in Australia and New Zealand where the very limited number of laboratories made this approach possible [8]. This method has the additional value of testing the entire process of screening: postal delivery, reception, analysis, and response. Despite precise and repeated instructions special attention is given to quality control specimens, explaining the reporting of two results instead of one for values well below the recall level, and ambiguities in interpretation compared to the consensus. This fact might suggest the need to return the filter paper card with the results so that the number of disks punched out can be seen.
Conclusion
Procedures to control performance of all steps from sampling to follow-up of diagnosed patients are essential to evaluate the effectiveness of neonatal screening programs [7]. Especially since, in contrast to usual medical diagnostic procedure, neonatal screening for metabolic disease relies on a single determination of appropriate analytes. Consequently it is important to periodically verify that current methods meet screening criteria, essentially the ability to separate normal from abnormal specimens with “no false-negative results”. The preoccupation of a quality control scheme is no longer a debate in terms of quality assurance but more over methods of ensuring vigilance and appropriate reaction. The optimum efficacy of these exchanges is linked to the presentation of
199
the results in a way that enables each laboratory to easily draw the information needed for its everyday function. As long as plots are not produced, one can feel unduly confident that the long-term aspect of the quality is acceptable. The experience of the French quality control programme for laboratories undertaking neonatal screening for congenital hypothyroidism shows that each one of these methods of graphical presentation has its advantages and limitation. The ability to produce a historical record allows deviations which may seem important but which are in fact acceptable in view of the particular technique (dried blood, etc.) to be seen in perspective. Although quality control material mimics the matrix of unknown samples, no consensus is reached concerning the use of artificially added or patient’s (endogenous analyte) material, In addition, no consensus exists concerning the ideal concentration with which to monitor screening efficiency. False negative or positive reports from QC samples at or near the cut-off point cannot assess confidence that a laboratory is able or not to detect an abnormality. So as to improve the benefit of such a control to individual laboratories one may perhaps choose the levels of specimens to give at least two extreme values (e.g., 20 and 100 mIU/l) to estimate accuracy and two identical values (e.g., 30 or 50 mIU/l) to estimate reproducibility. At present, most quality control programmes provide a measure of the ability of different laboratories to obtain the same result on the same specimen, but cannot yet provide an overall measure of accuracy.
Referemes 1 Dhondt JL, Farriaux JP. Integration of screening procedures - the French experience. In: Naruse H, Irie M, eds. Neonatal screening. Amsterdam: Excerpta Medica, 1983:438-443. 2 Farriaux JP, Briard ML. Le depistage neonatal de l’hypothyrdidie: enjeux et r&hats. Immunoanal Biol Spec 1992;33:52-58. 3 Ingrand J. Rapport sur les expertises men&es a l’instigation de la commission technique a propos des trousses de dosage de la TSH en periode neonatale (Elsa et Delfia). La D&p&he 1989;14:38-43. 4 Irie M, Naruse H. Neonatal thyroid screening: organization, quality-control and pitfalls. In: MedeirosNeto G, Maciel RMB, Halpern A, eds. Iodine deficiency disorders and congenital hypothyroidism. Sao Paulo, Brazil: 1986;203-212. 5 Lbger J, Lemerrer M, Briard ML, Czemichow P. L’hypothyro’idie congenitale ou neonatale avec “TSH papier de depistage” inferieure a 50 $J/ml. Conclusion sur la stratbgie du depistage en France. Arch Fr Pediatr 1987;44:13-16. 6 Rohle G, Voigt U, Kruse R, Torresani T. Result of quality control surveys of radioimmunological determinations of thyrotropin in newborns. J Clin Chem Clin Biochem 1983;21:813-821. 7 Slazyk WE, Hannon WH, Powell MK, Jensen RJ, Myrick JE, Spierto FW. National laboratory performance surveillance program for congenital hypothyroidism, phenylketonuria, and galactosemia. In: Schmidt BJ, Diament AJ, Loghin-Gross0 NS, eds. Current trends in infant screening. Amsterdam: Excerpta Medica, 1989;27-30. 8 Webster D, Lyon I. Quality assurance in newborn screening - report from Australasia. In: Schmidt BJ, Di,ament AJ, Loghin-Gross0 NS, eds. Current trends in infant screening. Amsterdam: Excerpta Medica, 1989:7-10.