Journal of Immunological Methods 373 (2011) 200–208
Contents lists available at SciVerse ScienceDirect
Journal of Immunological Methods journal homepage: www.elsevier.com/locate/jim
Computational modeling
Statistical considerations for calculation of immunogenicity screening assay cut points David Hoffman a,⁎, Marion Berger b a b
Early Development Biostatistics, Sanofi-Aventis, Bridgewater, NJ, 08807-0890, United States Research & CMC Biostatistics, Sanofi-Aventis, Montpellier, France
a r t i c l e
i n f o
Article history: Received 26 April 2011 Received in revised form 9 August 2011 Accepted 23 August 2011 Available online 1 September 2011 Keywords: Immunogenicity Anti-drug antibody Cut point False positive rate Prediction limit
a b s t r a c t Most therapeutic proteins induce an unwanted immune response. Antibodies elicited by these therapeutic proteins may significantly alter drug safety and efficacy, highlighting the need for the strategic assessment of immunogenicity at various stages of clinical development. Immunogenicity testing is generally conducted by a multi-tiered approach whereby patient samples are initially screened for the presence of anti-drug antibodies in a screening assay. The screening assay cut point is statistically determined by evaluation of drug-naïve samples and is typically chosen to correspond to a false positive rate of 5%. While various statistical approaches for determination of this screening cut point have been commonly adopted and described in the immunogenicity literature, the performance of these approaches has not been fully evaluated. This paper reviews various statistical approaches for cut point calculation, evaluates the impact of sampling design and variability on the performance of each statistical approach, and highlights the difference between an ‘average’ or ‘confidence-level’ cut point in order to develop more specific recommendations regarding the statistical calculation of immunogenicity screening cut points. © 2011 Elsevier B.V. All rights reserved.
1. Introduction The increasing use of biotechnology-derived proteins as therapeutic agents has highlighted the need for the strategic assessment of immunogenicity at various stages of clinical development. Most therapeutic proteins induce an unwanted immune response which may be triggered by numerous patient-, disease-, product-, or process-related factors (EMA, 2007; Jahn and Schneider, 2009). The potential consequences of such an immune reaction to a therapeutic protein range from the benign transient appearance of antibodies to lifethreatening conditions (EMA, 2007; FDA, 2009; Shankar et al., 2006). Clinical consequences may include, among others, loss of efficacy, altered pharmacokinetics, administration reactions,
⁎ Corresponding author. Tel.: + 1 908 981 6287. E-mail addresses: david.hoffman@sanofi-aventis.com (D. Hoffman), marion.berger@sanofi-aventis.com (M. Berger). 0022-1759/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jim.2011.08.019
and anaphylaxis (EMA, 2007; Shankar et al., 2006, 2008). Accordingly, the evaluation of potential immunogenicity has been a recent focus of regulatory concern (FDA, 2009; EMA, 2007, 2009; Shankar et al., 2006). The evaluation of clinical and nonclinical immunogenicity is generally performed via detection and characterization of anti-drug antibodies. A number of different analytical formats are available for the detection of anti-drug antibodies, including direct or bridging enzyme-linked immunosorbent assays (ELISA), radioimmunoprecipitation assays (RIPA), surface plasmon resonance (SPR), and electrochemiluminescence assays (ECL) (FDA, 2009). While each format has relative advantages and disadvantages, consideration should be given to product characteristics, potential co-medications, diseasespecific issues, and epitope exposure when selecting a format (Mire-Sluis et al., 2004). Regardless of the chosen format, the assay must be validated for its intended purpose. Such validation generally includes assessment of linearity, accuracy, precision, specificity, selectivity, stability, detection and/or quantification
D. Hoffman, M. Berger / Journal of Immunological Methods 373 (2011) 200–208
limits, robustness, and system suitability (FDA, 1999; Mire-Sluis et al., 2004; ICH, 1996). Immunogenicity testing is generally conducted by a multi-tiered approach whereby patient samples are initially screened for the presence of anti-drug antibodies in a screening assay (Koren et al., 2008). Samples testing positive for the presence of anti-drug antibodies in the screening assay are subsequently analyzed in a confirmatory assay which characterizes the specificity of the binding response to the drug. Samples confirmed for the presence of antidrug antibodies are then typically analyzed in a neutralizing antibody (NAB) assay to assess the neutralizing capacity of the anti-drug antibodies. The screening assay is intended to provide a rapid and sensitive initial assessment of the samples, while the confirmatory and neutralizing antibody assays are generally more labor-intensive and time-consuming. In this multi-tiered approach to immunogenicity assessment, a key consideration is the determination of a screening assay cut point. The screening assay cut point is the level of response of the screening assay at or above which a sample is defined to be positive for the presence of anti-drug antibodies and below which it is defined to be negative. This cut point should be statistically determined by evaluating samples deemed to be representative of the drug-naïve target subject/ patient population (i.e. negative control samples) (FDA, 2009). Typically, the cut point is chosen to correspond to a false positive rate of 5% (FDA, 2009). This is intended to control the number of unnecessary confirmatory assays while providing some assurance that the false negative rate will be small. However, it should be noted that the false negative rate cannot be determined from the false positive rate and can only be estimated from evaluation of samples deemed to be representative of the target population of patients/subjects with drug-induced immune response (i.e. positive control samples). Several statistical approaches for determination of the screening cut point are commonly used and have been described in the immunogenicity literature (Mire-Sluis et al., 2004; Gupta et al., 2007; Shankar et al., 2008; FDA, 2009). These may include parametric, robust parametric, or nonparametric approaches. There has been some recent investigation of the performance characteristics of various statistical approaches for cut point determination (Schlain et al., 2010; Jaki et al., 2011). Further, some general recommendations regarding the number of samples and independent assay runs used to determine the assay cut point have been proposed (Gupta et al., 2007; Shankar et al., 2008; FDA, 2009). However, there has been little consideration of the required control of the nominal false positive rate necessary to meet regulatory expectations and achieve suitable analytical performance. Moreover, the impact of the sampling design (i.e. number of subject/patients and assay runs) and relevant sources of variability on cut point performance characteristics has not been thoroughly explored. The purpose of this paper is to review various statistical approaches for cut point calculation, evaluate the impact of sampling design and variability on the performance of each statistical approach, and highlight the difference between an ‘average’ or ‘confidencelevel’ cut point in order to develop more specific recommendations regarding the statistical calculation of immunogenicity screening cut points.
201
2. Methods 2.1. Sampling design The importance of proper experimental design when determining a screening cut point has been recognized in the immunogenicity literature (Shankar et al., 2008). Multiple factors may introduce variability into the observed assay responses. Such factors may include the assay run (or batch), analyst, plate, and plate location, among others. Care should be taken to utilize an experimental design which allows for the direct evaluation of the effect of each factor, eliminating potential confounding between the effects of two or more individual factors. See Shankar et al. (2008) for a brief discussion of balanced experimental designs to eliminate such confounding. The chosen experimental design will determine which sources of variability are directly estimable from the observed assay responses, and the calculated cut point should incorporate each relevant source. At minimum, three sources of variability should be directly estimable from the observed assay responses and incorporated into the calculated cut point: • Biological inter-subject (or inter-patient) variance • Analytical inter-run (or inter-batch) variance • Analytical intra-run (or intra-batch) variance Depending on the assay format and experimental design, additional sources of variability may be relevant and identifiable. For simplicity, we will consider only the above three sources of variability for the remainder of this paper. Assuming all samples are assayed in each analytical run, a statistical model to describe the assay responses can then be given by: yij ¼ μ þ si þ rj þ εij
ð1Þ
where yij is the observed assay response for the ith (i =1,2,…,I) subject in the jth (j =1,2,…, J) assay run, μ is the true (unknown) mean response for the assay, si is the random effect for the ith subject, rj is the random effect for the jth assay run, and εij is the random effect for the ith subject in the jth assay run. Under the assumption that the observed assay responses follow a normal distribution (or that a transformation to achieve normality exists), we can further specify that the random effects si, rj and εij are normally and independently distributed with means zero and variances σS2, σR2 and σE2. These variances, σS2, σR2 and σE2, correspond to the biological inter-subject, analytical inter-run, and analytical intra-run variability, respectively. The total variability of an observed assay response is then given by σy2 =σS2 +σR2 + σE2. The statistical model given in Eq. (1) is commonly referred to as a balanced two-way random effects model (without interaction). Interaction terms or additional random effects may be incorporated into the analysis model as appropriate based on the experimental design and assay format. Denote the overall mean of the observed assay responses by – y ¼ ∑Ii¼1 ∑Jj¼1 yij =IJ, the J mean for the ith subject by y i ¼ ∑j¼1 yij =J, and the mean for the jth assay run by y j ¼ ∑Ii¼1 yij =I. Table 1 gives the analysis of variance table for the balanced two-way random effects model, where EMS denotes the expected mean square. The mean squares MSS, MSR, and MSE can be used to obtain estimates of the inter-subject, inter-run, intra-run, and
202
D. Hoffman, M. Berger / Journal of Immunological Methods 373 (2011) 200–208
Table 1 Analysis of variance table for balanced two-way random effects model. Source
Degrees of freedom
Sums of squares
Inter-subject
dfS = I − 1
Inter-run
dfR = J − 1
Intra-run
dfE = IJ − I − J + 1
SSB ¼ J∑Ii¼1 ðy i −– y Þ2 2 J SSR ¼ I∑j¼1 y j −– y 2 SSE ¼ ∑Ii¼1 ∑Jj¼1 yij −y i −y j þ – y
total variances. Table 2 gives the variance estimates obtained from the ANOVA mean squares. Table 3 gives the estimated proportion of total variability attributable to the inter-subject, inter-run, and intra-run variances obtained from six historical case studies. Each study was conducted using a validated ELISA or SPR assay. Studies 1 and 2 were preclinical studies each conducted with samples from 20 mice assayed in three analytical runs. Studies 3, 4, 5 and 6 were conducted with samples from 80, 90, 52 and 112 healthy human volunteers, respectively, and assayed in 6, 3, 3 and 3 analytical runs, respectively. Inspection of Table 3 and additional historical data (not shown) indicate that the intersubject variability can be substantial, while the inter-run and intra-run variability are typically of comparable magnitude. 2.2. Statistical approaches for cut point calculation Various statistical approaches for determination of the screening cut point are commonly used and have been described in the immunogenicity literature. These typically include parametric, robust parametric, and nonparametric approaches. Here we consider five different approaches for cut point calculation: • • • • •
Nonparametric Simple parametric Robust parametric Prediction interval Quantile lower bound
2.2.1. Nonparametric Let y(1), y(2),…, y(N) denote the ordered assay responses such that y(1) ≤y(2)…≤ y(N). A common definition for the empirical pth quantile is given by: ( if N × p is an integer 0:5× yðN×pÞ þ yðN×pþ1Þ if N × p is not an integer
where [N× p] denotes the integer portion of N× p. Table 2 Estimates of inter-subject, inter-run, intra-run, and total variance. Variance Component
Estimate
Inter-subject Inter-run
σˆ S = (MSS − MSE) / J 2 σˆ R = (MSR − MSE) / I
Intra-run
σˆ E = MSE 2 2 2 2 σˆ y ¼ σˆ S þ σˆ R þ σˆ E
Total
EMS JσS2 + σE2
MSR = SSR/dfR
IσR2 + σE2
MSE = SSE/dfE
σE2
2.2.2. Simple parametric A simple parametric estimator for the pth quantile is given by: qffiffiffiffiffiffi — y þ zp σ ˆ 2y where zp is the upper pth percentile of the standard normal distribution and σˆ 2y is the estimated total variance of y. Note that zp = 1.645 for p = 0.95. 2.2.3. Robust parametric A robust parametric estimator for the pth quantile is given by: y˜ þ zp ×1:483×MAD where y˜ is the median assay response, zp is the upper pth percentile of the standard normal distribution and MAD is the median absolute deviation given by the median of yij − y˜ . 2.2.4. Prediction interval A one-sided (100 × p)% upper prediction interval for an assay response is given by: qffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 — y þ tp;df σˆ y 1 þ Ne−1
Note that the simple parametric, robust parametric, prediction interval, and quantile lower bound approaches each assume approximate normality of the assay responses (or appropriately transformed responses). Details on the construction of the screening cut point for each approach follows.
yð½N×pþ1Þ
Mean square MSS = SSS/dfS
2
2
where tp,df is the upper pth percentile of the t distribution 2 with df degrees of freedom, σˆ y is the estimated total variance of y, and Ne is the effective sample size. The effective sample 2 2 2 2 2 2 size Ne is given by σˆ y =σˆ y– , where σˆ y– ¼ σˆ S =I þ σˆ R =J þ σˆ E =IJ is — the estimated variance of the observed mean y. The degrees of freedom df for the t distribution can be estimated via Satterthwaite approximation. For the balanced two-way random effects model, this is given by: df ¼
2 σˆ y
ðÞ
1 2 ×MS2 I S
dfS
þ
ðÞ
1 2 ×MS2 J R
dfR
ðIJ−I−J IJ Þ
2
þ
ð2Þ
×MS2E
dfE
Table 3 Estimated proportion of total variability attributable to inter-subject (σS2), inter-run (σR2), and intra-run (σE2) variance in six historical case studies. Proportion Study
σS2
σR2
σE2
1 2 3 4 5 6
0.28 0.83 0.10 0.93 0.46 0.04
0.52 0.02 0.45 0.01 0.28 0.24
0.20 0.15 0.45 0.06 0.26 0.72
D. Hoffman, M. Berger / Journal of Immunological Methods 373 (2011) 200–208
2.2.5. Quantile lower bound A one-sided lower 100 × (1-α)% confidence bound for the pth quantile is given by: pffiffiffi2 qffiffiffiffiffiffiffiffiffi — y þ tα;df ;δ σˆ y Ne−1 where tα,df,δ is the lower αth percentile of the t distribution with df degrees of freedom and non-centrality parameter δ, 2 and σˆ y and Ne are as given previously. The degrees of freedom df can be estimated via Satterthwaite approximation as given pffiffiffiffiffiffiin Eq. (2). The non-centrality parameter δ is given by zp Ne , where zp is the upper pth percentile of the standard normal distribution. 2.3. Additional considerations It should be noted that each of the above approaches can be sensitive to statistical outliers, though the robust parametric approach will generally be less sensitive. Various graphical (e.g. box-plots) and hypothesis testing (Lund, 1975; Rosner, 1975) techniques exist to identify statistical outliers, and such outliers should be excluded prior to calculation of the screening cut point. These outliers may reflect, among others causes, analytical errors, large biological variability, non-antibody proteins which may bind to the therapeutic protein, or background levels of anti-drug antibody formation. Jaki et al. (2011) discuss the impact of anti-drug antibody incidence within the drug-naïve screening samples on the performance of various cut point calculation methods. Such outliers may be effectively identified
203
by an appropriate confirmatory assay and the influence on cut point calculation assessed. The impact of outlier inclusion or exclusion on cut point performance is not considered further in this paper. Various types of screening cut points are possible (Shankar et al., 2008). Generally, three types of cut points are considered: fixed, floating, and dynamic. A fixed cut point is typically an absolute assay response value determined in pre-study validation and applied to in-study sample testing. A floating cut point allows for an adjustment of the actual cut point based on the performance of the assay background sample (e.g. negative control pools). This adjustment (or normalization) factor is typically determined in pre-study validation by the average biological background signal produced in the assay. A dynamic cut point also allows for an adjustment of the actual cut point based on the assay background sample but does not use variability estimates from the pre-study validation. The use of a floating cut point appears to be the most widespread in practice (Gorovits, 2009). We do not further consider the differences between the various types of cut points here, except to note that the general results in the following section hold regardless of the cut point type. 3. Results The performance of each statistical approach for cut point calculation was assessed via simulation techniques. Simulated assay responses were assumed to follow the two-way random effects model given previously, with normally distributed random effects. Without loss of generality, we assume that μ = 0
Fig. 1. Distribution of simulated false positive rates (FPR) for each cutpoint calculation approach for various number of subjects and analytical runs. Horizontal axis gives calculation approach: nonparametric (N), simple parametric (S), robust parametric (R), prediction interval (P), or quantile lower bound (Q). Variance component ratio (σS2 : σR2 : σE2) fixed at 2:1:1. Reference line at nominal false positive rate of 0.05.
204
D. Hoffman, M. Berger / Journal of Immunological Methods 373 (2011) 200–208
and σS2 + R2 + σE2 = 1. The simulation considered various sampling designs (I = 20, 50, or 100 subjects and J = 3, 6, 9, or 12 runs) and various ratios of inter-subject, inter-run, and intra-run variances (σS2 : σR2 : σE2 = 1:1:1, 1:2:1, 1:2:2, 2:1:1). These variance ratios were selected to be consistent with our general experience regarding the relative magnitude of each variance component. For each combination of sampling design and variance ratio, 5000 datasets were simulated using the SAS (version 9.1) NORMAL function with a random seed. For each simulated dataset, a cut point (CP) corresponding to a false positive rate of 0.05 (i.e. p = 0.95) was calculated using each of the five statistical approaches. The confidence level of the quantile lower bound approach was set at (1-α) = 0.95. For each simulated dataset and statistical approach, the actual false positive rate (FPR) of the calculated cut point CP was determined by FPR = 1 − Φ(CP), where Φ denotes the cumulative distribution function of the standard normal distribution. For each combination of sampling design and variance ratio, the following quantities were calculated across the 5000 simulated datasets: • Average FPR • FPR standard deviation • Proportion of FPR values b 0.05 Fig. 1 shows the distribution of actual false positive rates for selected simulation scenarios (I =20, 50, or 100 subjects, J = 3 or 12 runs, and variance ratio σS2 :σR2 :σE2 =2:1:1). Note that the distributions for the nonparametric and robust parametric approaches are largely similar. Likewise, the performance characteristics (average FPR, FPR standard deviation, and proportion
of FPR valuesb 0.05) for the nonparametric and robust parametric approaches were similar across all simulation scenarios considered. As such, results for the robust parametric approach are not shown further. Full simulation results for all approaches are provided online as supplemental material. Figs. 2 and 3 give the FPR average and standard deviation, respectively, while Fig. 4 gives the probability of obtaining an FPR value b 0.05, for each combination of sampling design and variance ratio. For brevity, the results for I = 50 subjects are excluded in these figures. Fig. 2 indicates that the prediction interval approach yields average FPR values at the nominal level of 0.05, across all combinations of sampling design and variance ratio. This is as expected, as the prediction interval approach is intended to achieve the nominal FPR level on average. The nonparametric and simple parametric approaches yield average FPR values greater than the nominal level of 0.05, particularly when the number of subjects and number of runs is low. For large numbers of subjects and runs, the average FPR for the nonparametric and simple parametric approaches is close to the nominal level. The quantile lower bound approach yields average FPR values greater than the nominal level, across all combinations of sampling design and variance ratio. Inspection of Fig. 3 indicates that the prediction interval approach yields FPR values with smallest standard deviation across all combinations of sampling design and variance ratio. For large number of subjects and runs, the nonparametric and simple parametric approaches have FPR standard deviation comparable to that of the prediction interval approach. The FPR standard deviation is largest for the quantile lower bound
Fig. 2. Average false positive rate (FPR) versus number of analytical runs, for various number of subjects and variance component ratios σS2 : σR2 : σE2. Plotted symbol indicates cutoff calculation approach: nonparametric (N), simple parametric (S), prediction interval (P), and quantile lower bound (Q). Reference line at nominal false positive rate of 0.05.
D. Hoffman, M. Berger / Journal of Immunological Methods 373 (2011) 200–208
205
Fig. 3. False positive rate (FPR) standard deviation versus number of analytical runs, for various number of subjects and variance component ratios σS2 : σR2 : σE2. Plotted symbol indicates cutoff calculation approach: nonparametric (N), simple parametric (S), prediction interval (P), and quantile lower bound (Q).
Fig. 4. Probability of obtaining false positive rate (FPR) less than 0.05 versus number of analytical runs, for various number of subjects and variance component ratios σS2 : σR2 : σE2. Plotted symbol indicates cutoff calculation approach: nonparametric (N), simple parametric (S), prediction interval (P), and quantile lower bound (Q). Reference line at probability of 0.05.
206
D. Hoffman, M. Berger / Journal of Immunological Methods 373 (2011) 200–208
approach, across all combinations of sampling design and variance ratio. Fig. 4 shows that the quantile lower bound approach controls the probability of obtaining an FPR value b 0.05. This probability is no greater than 0.05 across all combinations of sampling design and variance ratio. This result is as expected, as the quantile lower bound approach is intended to control the probability of obtaining an FPR value b 0.05 at the α = 0.05 level. The prediction interval approach yields the largest probability of obtaining an FPR value b 0.05 across all combinations of sampling design and variance ratio, while the nonparametric and simple parametric yield somewhat smaller probabilities. 4. Example
Fig. 5. Histogram of log-transformed normalized mean RU values with superimposed theoretical normal distribution.
Each of the four cut point calculation approaches are illustrated by application to an actual validation study conducted in the mouse. Plasma samples from twenty drug-naïve mice were taken on three separate occasions. Each plasma sample was assayed in duplicate in each of three analytical runs using a validated electroluminescence based assay. A pool of drug-naïve mouse plasma (negative pool) was also prepared and assayed in quadruplicate in each run. The electrochemiluminescence signal intensity is denoted by RU. For each mouse plasma and the negative pool, the mean RU was calculated within each run. Cut points corresponding to an FPR of 0.05 were calculated from the normalized mean RU values (i.e. mean mouse plasma RU value divided by mean negative pool RU value within each run). Note that this corresponds to a floating cut point as described previously. Table 4 gives the mean and normalized mean RU values for each mouse and the negative pool across each run. Prior to the cut point calculation, the normalized mean RU values were log transformed to achieve approximate normality. Fig. 5 shows a histogram of the log-transformed normalized
Table 4 Mean RU and normalized mean RU values. Mean RU
Normalized mean RU
Mouse ID
Run 1
Run 2
Run 3
Run 1
Run 2
Run 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Negative pool
62.5 56.5 84.5 59.0 52.5 57.5 54.5 55.0 57.0 57.5 59.0 64.0 58.0 58.5 57.5 59.0 52.5 59.0 53.5 53.5 56.5
52.5 49.5 64.0 48.5 47.5 43.5 49.5 49.5 51.0 54.5 49.0 51.5 57.0 54.0 54.5 58.0 50.5 49.0 50.5 46.5 42.125
55.5 52.0 70.5 60.5 57.5 59.5 43.5 46.0 55.0 58.0 50.5 53.0 60.0 60.5 55.0 55.5 49.5 55.0 49.5 52.0 54.125
1.11 1.00 1.50 1.04 0.929 1.02 0.965 0.973 1.01 1.02 1.04 1.13 1.03 1.04 1.02 1.04 0.929 1.04 0.947 0.947 1
1.25 1.18 1.52 1.15 1.13 1.03 1.18 1.18 1.21 1.29 1.16 1.22 1.35 1.28 1.29 1.38 1.20 1.16 1.20 1.10 1
1.03 0.961 1.30 1.12 1.06 1.10 0.804 0.850 1.02 1.07 0.933 0.979 1.11 1.12 1.02 1.03 0.915 1.02 0.915 0.961 1
mean RU values. The Shapiro–Wilk test (Shapiro and Wilk, 1965) for normality was also performed and yielded a p-value of 0.275, indicating no statistically significant departure from normality. This p-value should be interpreted with some caution, as the sixty log-transformed normalized mean RU values are correlated (violating the Shapiro–Wilks assumption of independence). However, both graphical (e.g. histogram, quantile–quantile plot) and goodness of fit (e.g. Shapiro–Wilk test) assessments of the log-transformed normalized mean RU values indicate the assumption of normality is reasonable. The cut points are thus calculated from the logtransformed data, and expressed in the original scale via the anti-log transformation. To calculate the simple parametric, prediction interval, and quantile lower bound cut points, we first construct the analysis of variance table as previously shown in Table 1. Table 5 gives the analysis of variance for the log-transformed normalized mean RU data. Table 6 gives additional intermediary quantities necessary for each cut point approach. Table 7 gives the calculated cut point values for each approach on both the log-transformed and original scales. Cut point values on the original scale (normalized mean RU) are 1.36, 1.37, 1.27, 1.45, and 1.22 for the nonparametric, simple parametric, robust parametric, prediction interval, and quantile lower bound approaches, respectively. Fig. 6 plots the normalized mean RU values with reference lines given at each calculated cut point value. 5. Discussion Five statistical approaches were evaluated via simulation techniques: nonparametric, simple parametric, robust parametric, prediction interval, and quantile lower bound. The prediction interval approach yields average false positive rates at the nominal level, while also providing the smallest Table 5 ANOVA table for log-transformed normalized mean RU data. Source
Degrees of freedom
Sums of squares
Mean square
Inter-subject Inter-run Intra-run
dfS = 19 dfR = 2 dfE = 38
0.40517 0.42374 0.15738
0.02132 0.21187 0.00414
D. Hoffman, M. Berger / Journal of Immunological Methods 373 (2011) 200–208 Table 6 Intermediary quantities for cut point calculations. Quantity
Value
– y y˜ y(57) y(58) 2 σˆ S 2 σˆ R 2 σˆ E 2 σˆ y MAD 2 σˆ y– Ne df δ zp t0.95,df t0.05,df,δ
0.07945 0.03922 0.32208 0.30010 0.00573 0.01039 0.00414 0.02026 0.08100 0.00382 5.31 6.96 3.79 1.645 1.896 1.993
Table 7 Calculated cut points for example data. Cut point Approach
Calculation
Log scale
Original scale
Nonparametric Simple parametric Robust parametirc Prediction interval Quantile lower bound
0.5 × (0.32208 +p 0.30010) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0.07945 + 1.645 0:02026
0.3111 1.36 0.3136 1.37
0.03922 + (1.645 × 1.483 × 0.08100) 0.2368 1.27 0.07945 +p 1.896 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi −1 0:02026 1 þ 5:31 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0.07945 + 1.993 0:02026 5:31−1
0.3737 1.45 0.2026 1.22
variability. While yielding the largest average false positive rates, the quantile lower bound approach controls the probability that the false positive rate will be less than the nominal
207
level. The nonparametric, simple parametric and robust parametric approaches yield average false positive rates which are greater than the nominal level, and with greater variability than the prediction interval approach. For large numbers of subjects and analytical runs, the performance of the nonparametric, simple parametric and robust parametric approaches is similar to that for the prediction interval approach. The nonparametric, simple parametric, robust parametric and prediction interval approaches do not control the probability that the false positive rate will be less than the nominal level. The simulation study conducted here assumed normally distributed observed assay responses and it is noted that the simple parametric, robust parametric, prediction interval, and quantile lower bound approaches are based on the assumption of normality. Gross departures from normality may impact the performance of the cut points calculated from each approach. However, in our experience, appropriately transformed (e.g. log transformed) assay responses generally exhibit approximate normality and normal-based approaches are applicable. Some authors (Schlain et al., 2010) have suggested that cut point calculations based on non-normal distributions, such as the gamma distribution, are often appropriate for immunogenicity data. This may be a topic for further research. A fundamental issue when selecting a statistical approach for cut point calculation is the required control of the nominal false positive rate. Common practice and regulatory guidance indicate that a screening cut point be chosen to correspond to a false positive rate of 5%. However, it is unclear whether the screening cut point should be chosen to achieve an average 5% false positive rate (‘average’ cut point) or to strictly control the probability that the actual false positive rate will be less than 5% (‘confidence-level’ cut point). An ‘average’ cut point approach will achieve a 5% false positive rate on average, but may occasionally yield a false positive rate much less than 5%. A ‘confidence-level’ cut point approach will result in
Fig. 6. Normalized mean RU values for each mouse by analytical run. Reference lines given at calculated cut point for each approach.
208
D. Hoffman, M. Berger / Journal of Immunological Methods 373 (2011) 200–208
comparatively high false positive rates, but will ensure (with specified confidence level) that the false positive rate is at least 5%. The prediction interval approach is appropriate for calculating an ‘average’ cut point, while the quantile lower bound approach is appropriate for calculating a ‘confidencelevel’ cut point. The nonparametric, simple parametric and robust parametric approaches yield neither an ‘average’ nor ‘confidence-level’ cut point. Further discussion between the biopharmaceutical community and regulatory agencies may lead to consensus as to which approach is required when determining an immunogenicity screening assay cut point. Regardless of the statistical approach utilized for cut point calculation, the choice of sampling design will influence the cut point performance characteristics. The number of subjects and analytical runs, along with the relative magnitude of the inter-subject, inter-run, and intra-run variances, will determine the variability of the cut point false positive rates. Inadequate numbers of subjects or analytical runs may result in cut points with unacceptably large variability. Prior estimates (or assumptions) for inter-subject, inter-run, and intra-run variances along with a pre-specified target for acceptable false positive rate variability can be used to develop an appropriate sampling design. Note, however, that the prediction interval approach will maintain the nominal false positive rate on average regardless of the sampling design. Similarly, the quantile lower bound approach will strictly control the probability that the false positive rate will be less than the nominal level regardless of the sampling design. 6. Conclusions Calculation of an immunogenicity screening cut point requires consideration of appropriate experimental design, identification and estimation of relevant sources of variability, adequate numbers of samples and analytical runs, and proper statistical methodology. While various statistical approaches for cut point calculation have been suggested in the immunogenicity literature, there has been little consideration of the desired cut point performance characteristics and the impact of the sampling design and sources of variability on cut point performance. The prediction interval approach provides an ‘average’ cut point which will yield average false positive rates at the nominal level. The quantile lower bound approach provides a ‘confidence-level’ cut point which will strictly control the probability of obtaining false positive rates below the nominal level. A priori estimates (or assumptions) regarding the relative magnitude of the inter-subject, inter-run, and intra-run sources of variability can be utilized to determine a sampling design sufficient to obtain calculated cut points with acceptable variability.
Acknowledgments The authors wish to thank Laurent Vermet and Pierre Cortez for helpful discussions and the example data, and Christophe Agut and Robert Kringle for their review and comments on the manuscript. Appendix A. Supplementary data Supplementary data to this article can be found online at doi:10.1016/j.jim.2011.08.019. References European Medicines Agency, 2007. Guideline on Immunogenicity of Biotechnology-derived Therapeutic Proteins. European Medicines Agency, Canary Wharf, London. European Medicines Agency, 2009. Concept Paper on Immunogenicity Assessment of Monoclonal Antibodies Intended for In Vivo Clinical Use. European Medicines Agency, Canary Wharf, London. Food and Drug Administration, 1999. Draft Guidance for Industry: Bioanalytical Method Validation. US Food and Drug Administration, Rockville, MD. Food and Drug Administration, 2009. Guidance for Industry: Assay Development for Immunogenicity Testing of Therapeutic Proteins. US Food and Drug Administration, Rockville, MD. Gorovits, B., 2009. Antidrug antibody assay validation: industry survey results. AAPS J. 11 (1), 133. Gupta, S., Indelicato, S., Jethwa, V., et al., 2007. Recommendations for the design, optimization, and qualification of cell-based assays used for the detection of neutralizing antibody responses elicited to biological therapeutics. J. Immunol. Methods 321, 1. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use, 1996. Validation of analytical procedures: text and methodology. Jahn, E.-M., Schneider, C., 2009. How to systematically evaluate immunogenicity of therapeutic proteins — regulatory considerations. N. Biotechnol. 25 (5), 280. Jaki, T., Lawo, J.-P., Wolfsegger, M., et al., 2011. A formal comparison of different methods for establishing cut points to distinguish positive and negative samples in immunoassays. J. Pharm. Biomed. Anal. 55, 1148. Koren, E., Smith, H., Shores, E., et al., 2008. Recommendations on risk-based strategies for detection and characterization of antibodies against biotechnology products. J. Immunol. Methods 333, 1. Lund, R., 1975. Tables for an approximate test for outliers in linear models. Technometrics 17 (4), 473. Mire-Sluis, A., Barrett, Y., Devanarayan, V., et al., 2004. Recommendations for the design and optimization of immunoassays used in the detection of host antibodies against biotechnology products. J. Immunol. Methods 289, 1. Rosner, B., 1975. On the detection of many outliers. Technometrics 17 (2), 221. Schlain, B., Amaravadi, L., Donley, J., et al., 2010. A novel gamma-fitting statistical method for anti-drug antibody assays to establish assay cut points for data with non-normal distribution. J. Immunol. Methods 352, 161. Shankar, G., Shores, E., Wagner, C., Mire-Sluis, A., 2006. Scientific and regulatory considerations on the immunogenicity of biologics. Trends Biotechnol. 24 (6), 274. Shankar, G., Devanarayan, V., Amaravadi, L., et al., 2008. Recommendations for the validation of immunoassays used for detection of host antibodies against biotechnology products. J. Pharm. Biomed. Anal. 48, 1267. Shapiro, S., Wilk, M., 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 591.