Fundamentals of assay development and validation

Fundamentals of assay development and validation

Chapter 6 Fundamentals of assay development and validation 6.1 What is an assay? An assay (also known as a test) is a laboratory procedure, mostly,...

2MB Sizes 2 Downloads 60 Views

Chapter 6

Fundamentals of assay development and validation 6.1

What is an assay?

An assay (also known as a test) is a laboratory procedure, mostly, performed in vitro on a biological sample taken from human or animal body to detect and/or measure the amount of a specific substance (also known as biomarker, analyte, or measurand) to help diagnose or monitor disease or other conditions. As per FDA definition, an assay includes the entire laboratory testing system, from sample collection to preparation to delivery of test results [1]. For an assay to be valuable in clinical practice, it should be analytically valid to reliably detect or measure a clinically meaningful biological signal. This chapter will focus on the essential aspects that can be applicable to different kinds of assays and technologies. Examples and scenarios will be demonstrated whenever applicable.

6.2

Assay development

Assay development means building an assay from raw chemicals and includes the following steps.

6.2.1

Identify the purpose

As detailed in an earlier chapter, a lab test can be applied to assess a general diagnostic, companion diagnostic (CDx), prognostic, efficacy, pharmacodynamics (PD) or safety biomarker, and/or to prove a drug concept or mechanism of action. Test selection depends on its intended purpose. For example, identify a test to diagnose and follow up patients with diabetes [2], to assess the efficacy of tyrosine kinase inhibitors (TKI) [3,4], to monitor efficacy of anti-HIV or anti-hepatitis C virus (HCV) drugs [5,6], or to predict response to PD-1/PD-L1 inhibitors [7,8].

6.2.2

Build a biological concept

Like a potential drug candidate, an in vitro diagnostic test should be built on a concept. Select a target biomarker from pathobiological perspectives, gene or gene variant, receptor, signaling protein, intermediary metabolite, end product metabolite, that can best describe or reflect a disease, condition, or disease progress, or predict a drug response or assess its effect. Here are some examples for building a concept: G

G

G

G

Blood glucose is an intermediary analyte that reflects a person’s glycemic state and serves as surrogate clinical end point for diabetes. Blood creatinine is a metabolic end product, that is, waste, of muscle catabolism secreted through the kidney, and its blood level reflects kidney excretory function. Aberrant activation of the epidermal growth factor receptor (EGFR) signaling pathways plays a critical role in the invasive and metastatic potential of tumors. Overexpression of HER3, in addition to EGFR, may be useful to identify patients who are at the risk of developing metastases in rectal cancer [3], or change in expression may serve as a PD for TKI effect [4]. Binding of the PD-1 ligands, PD-L1 and PD-L2, to the PD-1 receptor found on T-cells, inhibits T-cell proliferation and cytokine production. Upregulation of PD-1 ligands occurs in some tumors, and signaling through this pathway can contribute to the inhibition of the active T-cell immune surveillance of tumors. Checkpoint inhibitors (CPI),

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry. DOI: https://doi.org/10.1016/B978-0-12-816121-0.00006-4 © 2019 Elsevier Inc. All rights reserved.

117

118

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

for example, Opdivo, Keytruda, and Tecentriq, bind to the PD-1 receptor or its ligands, releasing PD-1 pathwaymediated inhibition of the immune response, including the antitumor immune response [911]. It is logical to assume that the expression of PD-1 or its ligand PD-L1 can predict the efficacy of CPI.

6.2.3

Select analyte form and technology

The analyte (or, as commonly called measurand on laboratory guidelines) is the form of a biomarker to be assayed. Depending on the metabolic, pathophysiology, and signaling pathways, select the analyte form (nature), for example, DNA, RNA, protein, or metabolite, and, depending on analyte nature and biomarker panel (single, few, or several), select technology, for example, real-time polymerase chain reaction (PCR), Sanger or next generation sequencing (NGS), immunohistochemistry (IHC), flow cytometry, enzyme-linked immunosorbent assay (ELISA), etc. Here are some examples: G

G

G

G

Blood glucose, the major parent absorbable and biologically active sugar, reflects the body ability to utilize the molecule in different anabolic and catabolic pathways and also reflects the clinical consequences of abnormal utilization. This makes glucose as the best target for a diagnostic test. Since glucose is a small molecule, not amenable to immunoassays, it can be determined by a spectrophotometric or mass spectrometric (MS) procedure, but the first is more feasible and objective. For large molecular predictive biomarkers, testing the protein for which the direct drug target should be the best choice of testing. For example, while PD-1/PD-L1 or HER2 expression can be tested at DNA or RNA level, assessing the protein expression and, possibly, differentiation of its cellular compartment localization can be the most sensible analyte to predict response of drugs designed to hit these protein targets. IHC, but not mass spectroscopy or Western plot, can allow the assessment of relative expression and demonstration of its subcellular localization. Detection of DNA mutation is traditionally more feasible and conclusive than trying to assess the downregulation of the corresponding wild-type (WT) protein or differentiating mutant from WT protein using specific antibody. For example, while EGFR mutant proteins can be detected with dedicated MS [12], the technique can add value in research settings, but detection of the mutations at DNA level is more feasible and robust in clinical settings. While p53 protein expression was reported as a surrogate biomarker for the gene mutations [13,14], the challenges associated with the test [15] and the extremely large number of mutations associated with the genes make DNA profiling via Sanger or NGS [1618] as the best choice. Quantification of a gene, for example, HER3, alongside other related genes, mRNA can be a better PD biomarker panel than individual testing of multiple protein expressions. For a single to few genes mRNA expressions, quantitative reverse transcriptase (RT)-PCR can be the best option amenable to clinical lab settings, but for several genes, multiplex-based technologies, for example, microarray can be the solution.

6.2.4

Select sample matrix

The selection of sample type or matrix depends on multiple factors balancing easiness and noninvasiveness against biological distribution (compartmentalization) of an analyte and correlation with the clinical indication. For examples: G

G

G

G

Blood glucose is a circulating molecule with homogenous distribution (partitioning) between blood and tissues and, thus, can be estimated in whole blood, plasma, or serum. If feasible, blood coagulation testing is preferred to be done on whole blood without anticoagulant added, and plasma but not serum should be used in individual coagulation factors testing. HER3 mRNA is typically determined in tumor tissue with preference given to fresh biopsy over formalin-fixed paraffin-embedded tumor (FFPET) due to stability issues in FFPET. However, peripheral blood or what is called “liquid biopsy,” is getting attraction because it is less invasive and can be frequently sampled. If a robust evidence can be established for the use of circulating RNA as a surrogate biomarker for tumor content, especially, in early stage cancer, blood RNA will be an ideal choice. PD-L1 was detected in circulating tumor cells and serum from patients with advanced breast and nonsmall cell lung cancer (NSCLC) [19,20], but PD-L1 is mainly contained in tumor cells or tumor microenvironment, especially in nonmetastatic tumors, and, thus, it has to be analyzed in tumor tissues. FFPE tissue is the preferred matrix for two reasons: (1) while freezing is better to maintain natural protein structure, freezing artifacts, and the more thickness of tissue section lead to the loss of tissue morphology compared to FFPE and (2) FFPET is the most commonly available tissues in clinical practice.

Fundamentals of assay development and validation Chapter | 6

6.2.5

119

Generate an analytical concept

Postulate how the addition of certain chemical reagent(s) to unseen entity (analyte) visualizes the selected target or make it countable/measurable. This constitutes the principle of the reaction and/or detection system. Here are some examples of setting analytical concepts or principles: G

G

G

Glucose oxidase and hexokinase are two common mechanisms to test plasma glucose. As demonstrated by Fig. 6.1A, the principle is that glucose oxidase catalyzes the oxidation of glucose to gluconic acid with the formation of hydrogen peroxide. The hydrogen peroxide produced is then reduced by o-dianisidine into water with the formation of colored oxidized o-dianisidine which color intensity directly proportionate with the glucose level. In the other method, hexokinase catalyzes the reaction between glucose and adenosine triphosphate to form glucose-6-phosphate (G-6-P) and adenosine diphosphate. In the presence of nicotinamide adenine dinucleotide (NAD) or NAD phosphate (NADP), G-6-P is oxidized by the enzyme G-6-P dehydrogenase to 6-phosphogluconate with the reduction of NADP or NAD into NADPH or NADH. The increase in NADPH or NADH concentration, which can be monitored spectrophotometrically at 340 nm, is directly proportional to the glucose concentration [21]. HER3 mRNA is first reverse transcribed into cDNA, which can be then amplified in multiple cycles of PCR using fluorescent-labeled probe, and the evolved fluorescence is measured. PD-L1 in a tumor section is visualized via immunohistochemical reaction. As simplified by Fig. 6.1B, in the Dako PD-L1 IHC 28-8 pharmDx test, rabbit antihuman PD-L1 monoclonal antibody (mAb) is used to detect PD-L1 in FFPET. Following incubation with the primary mAb, tissue section is incubated with visualization reagent consisting of goat antirabbit/antimouse immunoglobulins secondary antibody coupled with horseradish peroxidase (HRP) enzyme. When hydrogen peroxide (H2O2) and diaminobenzidine (DAB) are added, HRP breakdowns H2O2 with the oxidation of DAB forming colored precipitate, which can be visualized under a microscope [22]. The other Dako PD-L1 test, pharmDx PD-L1 22C3, has the same principle of detection, and main reagent components except that the primary mAb antibody is from mouse instead of rabbit.

For a method developer to choose between glucose oxidase, hexokinase and other method for glucose determination, or to choose between rabbit and mouse mAbs, he/she should have considered pros and cons for each option and selected what it was thought as the best option for each situation at the time of assay development.

(A)

o-Dianisidine (colorless)

Oxidized o-Dianisidine (colored)

(B)

Gluconic acid Peroxidase

2H2O

Glucose oxidase

H2O2

PD-L1

+

Rabbit mAb Dako 28-8

PD-L1

2H2O + O2

+

Glucose

Mouse mAb Dako 22C3

HRP

Hexokinase

ATP

DAB colorless H2O2 ADP

Glucose-6-phosphate

PD-L1

G6PD

NADP or NAD Does not absorb at 340 nm

6-Phosphogluconate

NADPH or NADH Absorb at 340 nm

HRP

H2O Oxidized DAB colored

FIGURE 6.1 Examples for how a biomarker can be measured by different assay deigns. Panel (A) for blood glucose estimated by two chemical assays and panel (B) for tissue PD-L1 by two IHC assay approaches. IHC, Immunohistochemistry.

120

6.2.6

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

Select or design reagents

Think about if the analyte can be determined directly in the matrix, for example, glucose, need to be extracted, for example, HER3 RNA, or need to be retrieved and exposed, for example, PD-L1. Create a list of reagents and chemicals, including buffer solutions, which are needed to translate an analyte (e.g., glucose, HER3 mRNA, or PD-L1 protein) into a detectable signal which magnitude is expected to correlate with the analyte concentration.

6.2.7

Preliminary test of the analytical concept

At this stage the concept has to be tested on the basis of “shoot-in-the-dark” before investing more resources. Reagent solutions can be formulated with component proportions estimated using previous experience with a similar assay or relying on empirical formulas. Concept can be valid if the analyte can be detected as in the following examples: G

G

G

Two levels of glucose in water, in addition to a blank (water), can be used. If optical density signals can be detected with glucose solutions, with some proportionality to glucose levels, but not with the blank, the concept is considered valid. Total RNA extracts from cell lines, xenografts or human tissues, preferably with predictable levels of expression, in addition to no-template control (buffer) and no-RT control (RNA extract but no RT enzyme in the reaction mix) are exposed to the RT-PCR. Amplification should be seen in samples tested in the presence and at much smaller magnitude in the samples without RT but not in the no-template control. The concept can be more credible if the amplification signals proportionate with the expected level of expression, if known. Duplicate slides from few archived tumor tissues are exposed to an exploratory antigen retrieval condition. One set is stained with an exploratory recipe containing the primary antibody, and the other set is stained with the same recipe except the primary antibody that is replaced by an isotypic antibody (rabbit or mouse monoclonal IgG antibody for the Dako PD-L1 examples listed earlier). If at least some tissues show some staining with the PD-L1 mAb but not, or significantly less stain, with the control antibody, the concept can be considered valid.

6.2.8

Assay optimization

After an assay is proved to work from analytical perspectives, it needs to be optimized via trying different options for each of the following parameters whenever applicable, preferably, in checkerboard format: G G G G G

Different concentrations of each reagent components Different reaction buffer pHs and ionic strengths Different PCR annealing, hybridization and extension temperatures and times Different incubation times and temperatures for chemical reaction, ELISA, IHC Different antigen retrieval systems for IHC

The combination of options that can clearly differentiate between negative and positive samples with signals that correlate with analyte levels is selected to move to the assay validation phase.

6.3

Assay validation

Assay validation of an analytical procedure serves as an objective evidence that the assay is suitable for its intended purpose [2326]. While these guidelines focus on bioanalytical method validation and FDA realizes that some characteristics may not apply or that different considerations may need to be addressed, FDA states that method validation for biomarker assays should address the same questions as method validation for drug assays [26]. As discussed in the previous chapter, ideally, performance goals, including total analytical error (TAE), should be predefined according the purpose of the test and the possible impact of erroneous results on medical decision. For example, if a statin drug is expected to lower low-density lipoprotein cholesterol by 20% [27], the assay TAE should be less than this value, preferably, ,1/2 of it. Otherwise, the drug effect can be hidden within the analytical noise. However, in most of times, especially with CDxs and lab-developed test, performance characteristics are demonstrated first, and clinical value and limitations are determined accordingly. In these situations, proper risk analysis and management plan should be conducted but, unfortunately, as it will be shown in later chapters, it does not take place in most of times.

Fundamentals of assay development and validation Chapter | 6

121

This section will highlight the fundamental technical characteristics for consideration during the validation of the analytical procedures. Whenever applicable, for demonstration purpose, glucose, HER3, and PD-L1 assays will be used as examples for quantitative assays with reference materials available for calibration, quantitative assays with no reference materials available to use as calibrators, and for qualitative assays with no calibration needed, respectively.

6.3.1

Determination of the assay measurement range

As defined in the previous chapter, assay measurement range (AMR) for a quantitative method is the interval between the lower limit of quantification (LOQ or LLOQ) and upper limit of quantification (ULOQ). It is recommended to use certified traceable reference standard (also known as primary standard) or certified secondary reference material, but if unavailable, working calibrator provided by a credible manufacturer can be used to establish AMR as follows.

6.3.1.1 Preliminary/exploratory step This step is more relevant to assays with calibrator materials available to generate solutions in matrix-matched sample or a pool of samples, whenever applicable, in buffer or water. Prepare concentrations that may exceed the possible AMR to adequately define the relationship between concentrations and raw response, for example, optical density, fluorescence units. Mix well and analyze all samples in triplicates on a single run. To avoid possible carryover especially with automatic analyzers, which will be assessed later, arrange samples from the lowest to highest level. Plot the average of raw signals (on the y-axis) against target concentrations and try different curve fit options to find the best fit. Fig. 6.2 illustrates a real example for a gene mRNA sequence analyzed by real-time RT-PCR. Fig. 6.2A shows the average of triplicate Ct (cycle threshold) against copy numbers ranging from 1 to 100,000,000 of the mRNA sequence. It is (A)

40

(B)

40 35

30

30

R² = 0.9833

Log (1–100,000,000)

R² = 0.993 R² = 0.994

Log (1–10,000,000) Log (1–1,000,000)

Copy number

(D)

y = –1.341ln(x) + 39.586 R² = 0.9942 y = –1.419ln(x) + 40.363 R² = 0.9997

40 R² = 0.9966

35

Log (10–1,000,000) Log (100–1,000,000)

R² = 0.9995

10 0, 00 0 1, 00 0, 00 0 10 ,0 00 ,0 00 10 0, 00 0, 00 0

Copy number

(C)

40

10 ,0 00

10 00

10

10

10 0, 00 0 1, 00 0, 00 0 10 ,0 00 ,0 00 10 0, 00 0, 00 0

10

1

15 10 ,0 00

15 10 00

20

0

20

10

25

1

25

0

Ct

Ct

35

35

30

Log (10–1,000,000) Log (100–1,000,000)

30

Ct

Ct

Log (1–1,000,000)

25

25

20

y = –1.437ln(x) + 40.554 R² = 0.9998

20

15

Copy number

1, 00 0, 00 0

10 0, 00 0

,0 00 10

10 00

10 0

10

1

0 00 0, 00 0, 10

1, 00 0, 00 0

10

,0 00

0 10

1

15

Copy number

FIGURE 6.2 Different assay linearity curve plots for a gene mRNA determined by real-time RT-PCR. Panels (A)(C) plot average of triplicates from a single exploratory run; panel (A) with dot-to-dot curve, but panels (B) and (C) with linear curve fit covering different ranges of the analyte concentrations in terms of copy number. Panel (D) presents different linear curve fits for average Ct 6 SD calculated from 20 replicates. Ct; Cycle threshold; PCR, polymerase chain reaction; RT, reverse transcriptase.

122

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

obvious that the reaction is not linear above 1,000,000 or at 1 copy number, which is confirmed by curve fits of different ranges on Fig. 6.2B and C where R2, as an indicator of curve fit, increased with decreasing the range until reached to 0.9995 with 1001,000,000 copies.

6.3.1.2 Confirmatory step This is to verify the exploratory step observation on multiple runs. Since clinical samples with copy numbers above 1,000,000/reaction were unlikely and if it was encountered, samples could be diluted and reanalyzed, but assay sensitivity was important, the confirmatory step included levels between 1 and 1,000,000 copy numbers. Samples were analyzed in 20 replicates over 10 runs by two technologists. Fig. 6.2D shows curve plots [average Ct 6 standard deviation (SD)] for different ranges, which confirm results from the exploratory step, 1001,000,000 and 101,000,000 copies give the best linear fits with R2 of 0.9998 and 0.9997, respectively. Although nothing was detected in the no-template blank up to 40 PCR cycles, the average Ct from the 1 copy number (38.7) was used as the limit of detection (LOD). The following table list number of replicates from each level with amplifications below this cutoff. Data confirms curve fit conclusion where 100% of replicates at 10 copy or higher were detectable, and only 8 out of 20 at 1 copy numbers were below the Ct cutoff. In such scenario, 10 copy numbers can be the LLOQ (Table 6.1). TABLE 6.1 Number of replicates with cycle threshold (Ct) below the cutoff. Copy number

1,000,000

100,000

10,000

1000

100

10

1

Number of replicates

20

20

20

20

20

20

20

Number of replicates ,38.7 Ct

20

20

20

20

20

20

8

% of replicates ,Ct cutoff

100

100

100

100

100

100

40

6.3.1.3 Assay measurement range for immunohistochemistry and similar cell-based technologies The above listed approach cannot be feasible for IHC, even, the AMR term is not common, but assay dynamic range is used instead. To assess a dynamic range of an IHC, it is common to scan a large number of samples (100 or more) expected to express the target. Ideally, an assay should stain 0%100% of cells with different intensity; negative (no stain above the background), low (1 1 ), moderate (2 1 ), and high (3 1 ) [22,28]. In some situations, where biomarkers are not highly expressed or if only a subpopulation of cells, for example, tumor-infiltrating inflammatory or immune cells, is targeted, this approach cannot be feasible either. For example, estrogen receptor assay is considered positive if $ 1% of cell nuclei is stained at any intensity where 1% to ,10% is considered weak positive and $ 10% is considered high positive [29,30]. Ventana SP142 PD-L1 assay codeveloped with Tecentriq determines the proportion of tumor area occupied by PD-L1 expressing tumor-infiltrating immune cells (% IC) of any intensity. Sensitivity of the assay was assessed by analyzing 3750 urothelial carcinoma specimens. Of these, 2545 (67.9%) showed immune cell staining of any percentage, 466 (12.4%) showed $ 5% immune cell staining and 512 (13.7%) showed tumor cell staining of any percentage [31]. In such circumstances, where finding tumor samples with high percentage of stained cells is unlikely, diluting highly (100%) positive cell line at different proportions in negative cell line (lacks the biomarker of interest) followed by pelleting and embedding into paraffin can be a good option to try assessing an assay dynamic range. Another possible approach is to use cell line transfected in a way that makes different cell populations express different levels of a particular biomarker [3234].

6.3.2

Sample dilutability

Dilutability, also is known as dilutional linearity [26] is to demonstrate if clinical samples must be diluted or can be diluted before analysis or reanalysis.

6.3.2.1 Obligatory dilution Clinical samples need to be diluted before attempting analysis in three situations: (1) all or most of samples are expected to have concentrations above an assay ULOQ and dilution will make all samples within an AMR, (2) analyte

Fundamentals of assay development and validation Chapter | 6

123

concentrations exceed an assay capacity, for example, hook effect in immunoassays, or (3) there is an interfering agent in the sample matrix.

6.3.2.2 Optional dilution No fear of interference, most of neat samples are measurable, that is, with concentrations within an AMR, but some clinical samples may be encountered to exceed the assay ULOQ. While samples with results above the ULOQ value of an assay can be reported as greater than that value, absolute concentration is always desired to monitor a disease progress or assess a drug efficacy and/or toxicity through serial sampling. In these situations, samples are expected to be diluted and reanalyzed, then the absolute concentration 5 observed value in the diluted sample 3 dilution factor. For example, in the gene mRNA example abovementioned, a sample with copy number above 1,000,000 can be reported as .1,000,000, or the RNA extract can be diluted 100 3 or 1000 3 , or whatever validated dilution ratio, and reanalyzed. For sample extracts made in buffer or water, dilution in the same vehicle is not expected to impact the analytical performance, that is, signal from the diluted sample will show proportionality with copy number similar to that between the 10 and 1,000,000 copy numbers in the mRNA example. However, for samples run in neat matrix, for example, serum, urine, and ascetic fluid, result from diluted sample may be higher or lower than expected due to multiple factors that can collectively be called matrix effects.

6.3.2.3 Matrix effect Matrix effect can be caused by endogenous matrix components, for example, bilirubin, lipids, urea, and other metabolites, or by additives, for example, preservatives and anticoagulants used in sample collection. Possible interference from exogenous materials (additives) should be investigated during an assay validation, for example, investigating if ethylenediaminetetraacetic acid (EDTA) or citrate used to collect plasma from whole blood sample can affect a given assay. Interference due to endogenous molecules can be classified into the following two groups. 6.3.2.3.1

Specific

These are interfering agents with structures related to the targeted analyte or the reactants, for example, antibody/ies (capture and/or detection antibodies) in immunoassays. Therapeutic antibodies given to a patient may compete with the assay antibody for the analyte leading to underestimation or false negative. Heterophilic antibody or human antianimal antibodies (HAAA), for example, human antimouse antibodies can cross-link the detecting antibody with the capture antibody [3537]. Heterophilic antibodies are endogenously produced, with no history or medical treatment with animal immunoglobulins or other well-defined immunogens, against poorly defined antigens and can react with many antigens and antibodies. While heterophilic antibodies can cause overestimation or false detection, as it is demonstrated later, they can attach to the capture antibody antigen-binding sites inhibiting the reaction between the antigen and the capture antibody leading to underestimation or false negative result [35,3739]. The presence of heterophilic antibodies, for example, rheumatoid factor, in serum was reported to cause a false detection of HCV-specific IgM [40] and falsely elevated analyte levels in troponin assays [41], thyroid function tests [42], tumor marker assays [43], and other biomarkers [35,44]. Since they are formed without known reasons, to detect heterophilic antibody interference in the sample, the first critical step is to suspect it. The starting point is often a clinician contacting the laboratory about a mismatch between the clinical information and laboratory results. Once there is suspicion of such interference, a classic approach has been to make serial dilutions of the sample [45]. 6.3.2.3.2

Nonspecific

Also, can be known as general interfering agents, are those matrix components that are structurally unrelated to the targeted analyte (or the reactants) but can still interfere in the assay. Hemoglobin (from red blood cells hemolysis that may happen during blood sample acquisition and/or processing), abnormal bilirubin level (icteric samples), abnormally high triglycerides (lipemic samples), and high immunoglobulin levels (paraproteins) are the most common nonspecific interfering agents. Their positive (increase) or negative (decrease) impacts on several serum/plasma biomarkers have been documented [4652]. MS-based techniques are often vulnerable to matrix effects that may compromise its sensitivity, selectivity, accuracy, and precision. The interfering agents can affect chromatographic behavior and the ionization of target compounds, resulting in ion suppression or enhancement [26,5362].

124

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

6.3.3

Types of assays that can be candidates for sample dilution

The top panels of Fig. 6.3 illustrate the following different scenarios of an immunometric assay, as an example of techniques vulnerable to the issues listed earlier, to show if a sample needs to be diluted or if dilution can resolve an issue: G

G

G

G

Fig. 6.3A1: An assay with analyte concentration in the middle of an assay dynamic range where all analyte molecules can be sandwiched between the capture antibody (blue) and the detecting antibody (orange). No dilution is needed. Fig. 6.3A2: The analyte concentration is at the upper end of the AMR where all analyte molecules saturate the binding sites of the capture antibody and the detecting antibody is more than enough to bind to all captured antigen. The concentration is still quantitatively measurable and no dilution is needed. Fig. 6.3A3: The analyte concentration exceeds the binding capacity of the capture antibody and regardless of the abundancy of the detecting antibody, any analyte molecules that are not captured on the solid matrix will be washed away. In this situation the analyte will be underestimated as it is shown by Fig. 6.3B and C and Table 6.3. The assay plateaus after 5000 and below 100 ng/mL and, as shown by Fig. 6.3C, the AMR is 1005000 ng/mL. Results above 5000 ng/mL (the ULOQ) can be reported as .5000 but if the absolute concentration is needed, reanalysis after proper dilution is required. Fig. 6.3A4: Similar to Fig. 6.3A3 but the analyte concentration is too high and saturates the binding sites of capture and detecting antibodies individually, which prevents the capture-antigen-detecting complex formation (AbAgAb). In this situation the analyte will be significantly underestimated or even undetected in what is called “hook effect.” Also known as prozone or high-dose effect, it occurs when increasing analyte concentrations result in decreased signals as compared to the preceding concentration. The effect is very common in one-step immunometric assays. As demonstrated by Fig. 6.3B and C and Table 6.3, the AMR is 1005000 ng/mL, concentrations above the assay ULOQ will be underestimated and the more concentrated the sample, the less the assayed value. The problem

(A1)

1.6

(A4)

(A3)

(A2)

(A6)

(A5)

1.6

(B)

1.4 1.2

OD

OD

y = –7E– 09x 2 + 0.0003x + 0.0561 R² = 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

(A9)

y = –3E– 08x 2 + 0.0004x + 0.117 R² = 1

1.2

Plateau

1

(A8)

(C)

1.4

Hook

(A7)

Hook Plateau

0

0 1

10

100

1000

Concentration (ng/mL)

10,000

100,000

100

1000

10,000

Concentration (ng/mL)

FIGURE 6.3 Different scenarios of good and bad immunometric assay. (A1) Good assay with amounts of primary and secondary Ab enough to detect all Ag; (A2) saturated binding; (A3A5) inadequate proportions of primary and/or secondary Ab to detect all Ag; (A6) Ab interference; (A7) Ag interference; (A8) cross-reacting heterophilic Aby; (A9) nonspecific hindrance for the AgAb binding. Panel (B) shows an assay with plateau or hook effect and panel (C) is a plot for the best linear part of the curves in panel (B). Ab, Antibodies; Ag, antigen; OD, optical density.

Fundamentals of assay development and validation Chapter | 6

G

G

G

G

G

125

was encountered in assays for hormones, for example, luteinizing hormone, follicular-stimulating hormone, prolactin, human chorionic gonadotropin, thyrotropin, for tumor markers, for example, CA125, CA19.9, and prostaticspecific antigen, and for immunoglobulins, for example, IgE [35,6373]. This is an example where dilution of samples prior to analysis is a must. Fig. 6.3A5: A situation where the detecting antibody is not enough to bind to all captured analyte molecules, which leads to underestimation. While dilution can solve the issue, it is recommended to adjust the secondary antibody (detecting reagent) to expand the AMR. Fig. 6.3A6: Another scenario of underestimation but this time due to interfering antibody, which competes with the detecting antibody for the antigen. A therapeutic antibody may compete with the assay detection antibody for the analyte leading to underestimation or false negative. Also, heterophilic antibodies can cause overestimation or false detection, but also, underestimation or false negative result [35,3739]. Fig. 6.3A7: One more scenario of underestimation but due to cross-reacting antigen, which competes with the target antigen for the capture antibody but not for the detecting antibody. If the cross-reacting antigen can bind to both antibodies, it will lead to false detection or overestimation. Fig. 6.3A8: A scenario of false detection due to the presence of a heterophilic antibody or HAAA, which cross-link the detecting antibody with the capture antibody in the absence of the target biomarker [3537]. Fig. 6.3A9: A scenario of nonspecific interference, listed earlier, which, in addition to immunoassays, can impact other methodologies including MS and spectrophotometry. The presence of any of the listed nonspecific interfering agent at abnormally high concentrations can impact an assay in different ways including physical hindrance of the reaction/binding, volume displacement, scattering or absorbance of reaction signal, for example, light or fluorescence or chemical interference in the reaction or signal formation.

6.3.4

Assessment of matrix interference

As stated by the Clinical and Laboratory Standards Institute (CLSI) guidelines [52], no practical strategy can investigate all possible interferents. However, all potential or speculated interferents should be tested during assay validation. We recommend to investigate matrix interference by conducting one or both of the following two studies.

6.3.4.1 Interferent spike Interference can be screened by adding a potential interfering agent at a final concentration at or exceeding the highest expected level in clinical samples, that is, worst case expected, for example, hemoglobin at $ 5 g/L or bilirubin at $ 500 μmol/L [49], to a set of three to five samples or pools of samples while a second set is spiked with the vehicle in which the interferent is made. We recommend to use samples with predetermined or expected target biomarker concentrations scattered through the AMR. To not introduce another variable by excessive dilution of the neat sample matrix, the interfering agent should be spiked in the least possible volume of the vehicle. If the differences between observed biomarker results from the interferent-spiked and vehicle-spiked paired samples are within the assay TAE and there is no consistent drift, interference will be excluded. If there is consistent bias seen, that is, all interferent-spiked samples deviate from the vehicle-spiked set by more than one-fourth of the assay TAE and all are either positively or negatively biased, titrate the interferent impact as in the next step. Details can be found in the CLSI EP07 guidelines, but, basically, the interferent can be spiked into a set of samples at 100%, 75%, 50%, 25%, and 0% of the level screened earlier. Typically, the impact of interfering agent dissipate at a certain level, and the highest tolerable level can be determined as the level at which results from the interferent-spiked samples can be within results from the samples with no interferent 6 TAE, and no consistent drift above one-fourth of the TAE is seen. Fig. 6.4 illustrates a scenario for a test with TAE of 20% when two sets of three samples (S1, S2, and S3) were analyzed with the potential interfering agent at 100%, 75%, 50%, 25%, and 0% of the screened level. While the percentage of bias of results from the 50% interferent was within the TAE for the three samples, the three results positively deviated above the one-fourth TAE cutoff, and only 25% of the initially screened interferent level can be tolerated. Three samples (S1, S2, and S3) were analyzed with the potential interfering agent spiked at 100%, 75%, 50%, 25%, and 0% of potential interfering concentration. While the percentage of bias of results from the 50% interferent was within the TAE for the three samples, the three results positively deviated above the one-fourth TAE cutoff, and only 25% of the initially screened interferent level can be tolerated.

126

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

% bias from no interferent

70 60 50 40 30 20 10 0 –10 0% S1

25% S2

50% S3

TEa

75% 1/4 TEa

100% No bias

FIGURE 6.4 Example of assay interference test during an assay validation. The green horizontal line represents an assay without the interferent (0% bias from expected target). Interferent spiked at 25% is tolerable [individual biases from the three samples are within the assay one-fourth of the assay TAE], spikes at 50% caused systematic bias but with the TAE, but higher spikes induced significant biases.

Table 6.2 shows an example for investigating possible interference of a structurally relevant sample component with the targeted biomarker. There is a high degree of homology between the NY-ESO-1 and LAGE-1 gene sequences [7476], and to validate a quantitative assay for NY-ESO-1 mRNA, in addition to ruling out matrix interference, the assay was tested for possible cross-reactivity with LAGE-1. The AMR was established as 101,000,000 copies/reaction. To investigate matrix interference and cross-reaction of LAGE-1 with the NY-ESO-1 assay, the following experiment was conducted: G G

G

G G

G

RNA was extracted from five different FFPET. In vitro transcribed NY-ESO-1 RNA was spiked into two sets of aliquots from the five samples and five replicates of reaction buffer at final levels of 100 and 1000 copies/reaction. One set of the NY-ESO-1-spiked samples and buffer was spiked with LAGE-1 at a final level of 1,000,000 copies/ reaction, and the other set of samples and buffer were spiked with the buffer only. All samples were analyzed on the same run. The difference between Ct of NY-ESO-1 from each sample (Sample 2 Buffer) at 1000 or 100 copies was calculated by subtracting the average of the buffer Ct from the Ct of the sample with the corresponding copy number. The difference between Ct of LAGE-1-spiked samples from each no-LAGE-1 sample [(LAGE-1) 2 (NY-ESO-1)] at 1000 or 100 copies was calculated by subtracting the no-LAGE Ct of each sample from the corresponding LAGE-spiked sample. As shown by Table 6.2, the average dCt ranges between 20.17 and 0.12, and the maximum dCt was 0.52. There was no obvious trend of drift, and all biases were within the acceptable limit, and, hence, data ruled out matrix interference or LAGE-1 cross-reactivity with the NY-ESO-1 assay.

6.3.4.2 Analyte spike recovery This study can detect interference from a potential (expected), for example, in drug analysis [77], or unknown (unexpected) interferent. In addition to this purpose, spike recovery can be used to assess extraction recovery for procedures where samples have to be extracted [26,78]. Whenever a reference material or a pure material with known concentration is available, recovery should be performed to investigate matrix interference [79]. Ideally, a material from a second source, that is, different from an assay provider, is used but, if unavailable, a kit calibrator or quality control (QC) material can be used. If no sample extraction is needed, —three to five matrix-based samples can be spiked by an analyte material with known concentration at —two to five levels within the AMR, in addition to 0 spike (vehicle spike), same spikes are made in a vehicle suitable for the assay, and all samples are analyzed on one run. We recommend to use results from the vehicle spikes instead of relying on the spike theoretical (targeted or nominal) that may be affected by pipetting errors or an assay analytical error (error that is not related to interference). To avoid matrix dilution, spike should be made in the least possible volume of vehicle. Spiked sample targeted concentrations (baseline 1 spike) should be with the AMR and, typically, at least one level to be around the medical decision point.

TABLE 6.2 Results from experiment to investigate matrix interference and LAGE-1 cross-reactivity with a quantitative reverse transcriptase- polymerase chain reaction assay for NY-ESO-1. 1000 copy NY-ESO-1 Buffer (Ct)

Sample (Ct)

100 copy NY-ESO-1

Sample 2 buffer (dCt)

Buffer (Ct)

1000,000 LAGE on 1000 NY-ESO

Sample (Ct)

Sample 2 buffer (dCt)

Ct

LAGE 2 no-LAGE (dCT)

1,000,000 LAGE on 100 NY-ESO Ct

LAGE 2 no-LAGE (dCt)

Sample 1

26.69

26.94

0.28

30.56

29.91

2 0.34

26.88

2 0.06

29.79

2 0.12

Sample 2

26.86

26.40

2 0.26

30.12

30.27

0.02

26.92

0.52

29.85

2 0.42

Sample 3

26.36

26.60

2 0.06

29.93

30.42

0.17

26.87

0.27

29.98

2 0.44

Sample 4

26.46

26.67

0.01

30.34

30.29

0.04

26.48

2 0.19

30.34

0.05

Sample 5

26.91

26.58

2 0.08

30.29

30.53

0.28

26.66

0.08

30.61

0.08

Average

26.66

26.64

2 0.02

30.25

30.28

0.04

26.76

0.12

30.11

2 0.17

128

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

If extraction is needed, matrix-based samples but not the spiked vehicle are extracted [78,8082]. Recovery % for each level from each sample 5 ½ðSpiked sample assayed value 2 unspiked sample assayed valueÞ= vehicle spike assayed value 3 100 Recovery has been considered successful by the previous references if assayed values are within the target values 6 15%20%. However, it can be more reasonable if recovery results are adjudicated in the light of TAE instead of a fixed percentage. We recommend to consider recovery successful if recovery percentage is within 100 6 TAE, preferably, with no consistent drift, that is, not all results between 80% and 90% or 110% and 120%.

6.3.5

Testing of sample dilutability

To investigate if clinical samples can be dilutable and if dilution can be a solution for any of the issues listed under Section 6.3.3, sample dilutability should be assessed during an assay validation as follows: If neat samples with concentrations at the needed levels cannot be available, multiple samples or pools of samples can be spiked with a concentrated solution of analyte reference material or the material used in the calibration at a level exceeding an AMR. Spike volumes should be kept at the minimal levels, preferably # 5% of the neat sample volumes. The spiked samples are then diluted in a vehicle compatible with the assay at different ratios, for example, 1:5, 1:10, 1:20, 1:50, and/or 1:100 or other dilution scheme. All samples, neat and dilutions, are analyzed on a single run. Sample can be considered dilutable at a certain ratio if the assayed value is within expected (target or nominal) 6 TAE, but we recommend to consider consistent drift also. Fig. 6.3B and C illustrates two assays with AMRs between 100 and 5000 ng/mL, both plateau below 100 ng/mL, one plateau above 5000 ng/mL and the second shows hook effect after this level. Table 6.3 illustrates results from two samples in dilutability study on the two assays depicted by figures. Fig. 6.3B and C shows the following findings: G

G

When spiked samples were analyzed without dilution (as is), they were significantly underestimated compared to target values, and the more concentrated the sample, the great the loss. In the assay with hook effect, raw signal inversely proportionate with sample concentrations above the ULOQ.

TABLE 6.3 Dilutability of two samples with targeted concentrations above upper limit of quantification of two assays. First assay (with hook)

Second assay (plateau)

Sample 1

Sample 2

Sample 1

Sample 2

Target (ng/mL)

50,000

10,000

50,000

10,000

As is

Signal Observed (ng/mL) Reportable (ng/mL) % bias

0.400 749 749 2 98.5

1.050 3013 3013 2 69.9

1.501 5530 5530 2 88.9

1.480 5436 5436 2 45.6

1:5 dilution

Signal Observed (ng/mL) Reportable (ng/mL) % bias

0.802 2,018 10,090 2 79.8

0.798 2,004 10,020 0.2

1.421 5,174 25,870 2 48.3

0.638 2,036 10,180 1.8

1:10 dilution

Signal Observed (ng/mL) Reportable (ng/mL) % bias

1.325 4,623 46,230 2 7.5

0.501 1,041 10,410 4.1

1.286 4,592 45,920 2 8.2

0.322 905 9,050 2 9.5

1:100 dilution

Signal Observed (ng/mL) Reportable (ng/mL) % bias

0.312 507 50,700 1.4

103 103 10,300 3.0

0.138 275 27,500 2 45.0

0.071 49.7 4970 2 50.3

Reportable results in the cells having bold values are within the acceptable % bias (10%) of the two assays. Observed values were calculated using the curve fit equations shown on Fig. 6.3C.

Fundamentals of assay development and validation Chapter | 6

G

G

129

1:5 dilution in buffer brought Sample 2, but not Sample 1, into the AMRs and reportable results (observed 3 dilution factor) were within acceptable percentage of bias from the target. 1:10 dilution was enough to make the two samples in the measurable ranges of the two assays. 1:100 dilution was, also, good for the first assay but not for the second. Results demonstrate that samples for the first assay can be diluted up to 1:100 before analysis, but 1:10 is the maximum allowable dilution for the second assay unless a dilution between 1:10 and 1:100 is tried and produces linear results.

6.3.6

Can dilution eliminate matrix effect?

Matrix effect cannot be eliminated entirely, but it can be minimized or compensated for by employing a combination of various measures, for example, sample dilution, extraction, use of matrix-matching calibrators or use of the isotopically labeled analog of the target analyte as an internal standard (IS) in MS-based assays. The latter was considered the most effective way to account for matrix effects as matrix, theoretically, has same degree of ion suppression or enhancement on the target analyte and IS [62]. In some situations, matrix effect issue cannot be resolved and, the lab has to find a different methodology or technology that is not liable or more tolerant to a specific interferent, reject samples with interfering, for example, highly hemolyzed, icteric, lipemic, or turbid samples or, if a sample can be irreplaceable, result of a biomarker can be reported with a comment to indicate the expected margin of error. These situations include those with no IS/control applicable, infeasible to extract as most of biological markers and samples cannot be diluted. Dilution alone may not be a solution, at least in some cases, for the following two reasons: G

G

Diluting a sample dilutes the interferent but also dilutes the target analyte, that is, the two reactants (target and interferent) stay as the same ratio, and unless an assay dynamics prefer the target analyte over the interferent, the latter will stay interfering with the reaction, especially if the interferent is more abundant than the target analyte. Away from an interferent, matrix integrity (neat formulation) can be critical for a reaction, for example, in the determination of plasma coagulation factors, enzyme activities, ionized calcium, and protein-free (unbound) hormones. In these cases, diluting samples in buffer, saline, or water alter the results significantly. While the manipulated matrix may not be similar to the patient native one, analyte-depleted or analyte-inactivated plasma can be tried as a diluent in such cases [83,84].

6.3.7

Matrix interference in immunohistochemistry

Like quantitative assays, tissue staining-based technologies, IHC and in situ hybridization (ISH), can suffer from interference from a matrix component. To rule out cross-reactivity of a primary or secondary antibody with nontargeted antigen(s) or a matrix component, testing selectivity is very a crucial step in IHC assay development. This can be easily performed through peptide (antigen) blocking experiment where a couple of tissue and/or cell line sections are incubated with the primary antibody in the presence of excess molar amount of targeted biomarker recombinant protein or peptide, which can abolish or decrease the stainability of the sections. Replicate slides should be stained side-by-side but without the blocking protein. Fig. 6.5A shows micrographs from two sets of slides from two human renal cell carcinoma and CAKI-1 cell line known to express the target (name is confidential); the upper set without and the lower set with the blocking protein. To address the argument that the blocking may not be specific but could be due to protein aggregation that affects the accessibility of the antibody to the target in the tissue, similar sets of the slides were treated similarly but using irrelevant antibody/assay. Fig. 6.5B shows micrographs from replicate sections with or without the sample type and amount of blocking protein used in the first set but the antibody/assay was for Ki67. As demonstrated by the two figures and Table 6.4, the target-specific peptide was able to completely or significantly eliminate the stain for the target but had no impact on Ki67. This study ruled out nonspecific binding that was not the case in other studies (data not shown), and we had to change the primary antibody in one study and reduce the concentration of antibody in another.

6.4

Precision and accuracy

As defined in the previous chapter, precision is the closeness of assayed analyte (biomarker) values from a single sample to each other, and accuracy is the closeness of assayed value from targeted (also known as nominal or theoretical) value if it is known.

130

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

FIGURE 6.5 Data from IHC study to verify specificity of binding. Replicate sections from three samples (two human renal cell carcinoma biopsies and one cell line) were stained with an antibody directed to the target protein (Panel A) or with Ki67 antibody (Panel B) with or without excess molar concentration of the target protein. Incubation with the target protein did not affect Ki67 but abolished the target biomarker staining. IHC, Immunohistochemistry

6.4.1

Precision and accuracy requirements in the FDA bioanalytical guidance

FDA bioanalytical guidelines [26] allow the use of QCs freshly prepared in the clinical sample matrix to assess the precision and accuracy of an assay. As per the guidelines, for chromatographic assays, sponsor should analyze four QC levels (LLOQ, low, medium and high QC), in $ 5 replicates per QC level on at least three independent calibrated runs. Within-run and between runs accuracy should be 6 15% of nominal concentrations, except 6 20% at LLOQ. Withinrun and between runs precision should be # 15% coefficient of variation (CV), except at LLOQ where 20% can be acceptable. The guidelines do not imply the total error concept.

Fundamentals of assay development and validation Chapter | 6

131

TABLE 6.4 Percentage of stained cells and H-score in the immunohistochemistry selectivity experiment. RCC1

RCC2

CAKI-1

% stained cells

H-Score

% stained cells

H-Score

% stained cells

H-Score

Target assay with no peptide block

100

300

75

210

98

178

Target assay with peptide block

33

36

0

0

0

0

Ki67 assay with no peptide block

7.5

15.3

4.1

9.9

65

177

Ki67 assay with peptide block

7.5

15.3

3.6

8.5

65

175

Probably because chromatographic procedures are, in general, more robust than ligand-binding assay (LBA), while the number of replicates is less, the FDA guidance requires more levels and more replicates and makes acceptance criteria more flexible. The guidance indicated five QC levels per run (LLOQ, low, medium, high and ULOQ), and $ 3 replicates per QC level on at least six independent calibrated runs. Within-run and between runs accuracy should be 6 20% of nominal concentrations except at LLOQ, ULOQ where 25% can be acceptable. Within-run and between runs precision should be # 20% CV at LLOQ, ULOQ where 25% can be acceptable. It is not clear why total error is applied here, but in chromatographic assays, where it should be 6 30%, except at LLOQ, ULOQ 6 40% for LBA. Note: Acceptable limits for CV% were listed on the guidance in terms of “ 6 ” but “ # ” might have been meant and which is used here.

6.4.2

Precision and accuracy requirements in laboratory guidelines

6.4.2.1 Clinical and Laboratory Standards Institute EP05 CLSI EP05 guidelines [85] recommend more comprehensive study than what is indicated on the FDA Bioanalytical guidance to assess precision; repeatability (within-run), within-lab interrun reproducibility, and between-lab reproducibility. The guidelines recommends five or more matrix-matched samples at different levels for assays with wide AMR and multiple medical decision points, but three levels may suffice for assays with narrow AMR and one decision point. For repeatability and within-lab, EP05 recommends two runs per day for 20 different days with each sample analyzed in duplicates (2 3 2 3 20 design), or one run per day, alternating between morning and afternoon, on 20 different days in triplicate (1 3 20 3 3 design). For multisite (interlab) reproducibility, EP05 recommends at least three sites with each site to analyze the samples in five replicates on five runs on 5 different days alternating between morning and afternoon runs (3 3 5 3 5 design), or in three replicates on two runs/day for 5 days (3 3 2 3 5 3 3 design). If recognized, no more than two outliers can be excluded from a sample 6080 reads, no more than 1% of reads can be excluded from the whole lab multisample study, and no more than 1 read/site if # 4 samples or no more than 2 if .4 samples are used in a multisite study.

6.4.2.2 Clinical and Laboratory Standards Institute EP15 CLSI EP15 [86] that meant by the verification of precision and accuracy against manufacturer’s claims indicates at least two samples, preferably individual patient samples, pools of samples, or commercial QC to be used. Samples should have different concentrations, preferably represent medical decision points (cutoffs) or reference limits, or simply, fall in normal and abnormal regions, and are analyzed on five runs performed on 5 days with five replicates in each run. If recognized, no more than one outlier/sample and no more than 2/study can be excluded. If samples with known concentrations, for example, certified reference materials, survey materials, QC/calibrators, patient repeatedly measured over substantial period of time in one or more labs, material analyzed by a reference method, results can be used to assess accuracy.

6.4.2.3 Suggested acceptance criteria Neither EP05 nor EP15 defined acceptable criteria and we recommend to adjudicate results in the light of TAE for an assay as explained in the previous chapter. If it is not already rectified, results from the precision and accuracy study can be used to establish the TAE of a test.

132

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

Concentration (ng/mL)

1.2

1.1

1

0.9

0.8

Lab A

Lab B

Lab C

Target value

FIGURE 6.6 A scenario to show intra- and interlab precision and accuracy of a quantitative assay per common guidelines. Results from three samples are listed on Table 6.5, but this figure demonstrates results from one of the samples, with target concentration of 1 ng/mL of a biomarker, analyzed by three labs (Lab A, B, and C) on five different runs with each run included five replicates, according to the CLSI EP05 guidelines.

From general good practice perspectives, repeatability, within-lab reproducibility can be acceptable if CV% are within 0.25 and 0.33 of the TAE budgeted for an assay [87]. Since interlab reproducibility is expected to include all possible analytical variables, we recommend that observed TAE should not exceed the budgeted limit. Also, % bias (percentage of difference of average of assayed values from target) can be acceptable if it does not exceed 0.5 of TAE.

6.4.2.4 Illustrative example Fig. 6.6 and Table 6.5 demonstrate results from one of three samples, with target concentration of 1 ng/mL of a biomarker, analyzed by three labs (Lab A, B, and C) on five different runs with each run included five replicates, according to the CLSI EP05 guidelines. Results from Lab A showed good distribution around the target value, which produced the lowest % bias (within-lab bias of 0.44%) compared to the other two labs where Lab B showed consistent negative drift (within-lab bias of 26.00), and Lab C showed consistent positive drift (within-lab bias of 9.16%). The overall (between-lab) % bias (0.08%) was very small compared to the individual labs’ % bias figures. This indicates the importance of utilizing multiple labs, at least three, when establishing a reference material that can be used later as calibrator or QC, instead of relying on a single or even two labs. Results were repeatable in the three labs with within-run CV% of 2.15%5.85%, and within-lab (interrun) CV% of 5.19%, 4.33%, and 4.14% in Lab A, B, and C, respectively, and the overall (betweenlab) CV% was 7.65%. Within-lab TAE were 10.6%, 14.5%, and 17.3% in the three labs, which were smaller than the maximal TAE seen from individual runs within each lab. This observation indicates the importance of analyzing a sample on multiple runs instead of relying on a single run to assess accuracy of an assay. The overall TAE was 16.2%.

6.4.3

Precision and accuracy of qualitative assays

As indicated in the CLSI EP12 guidelines [88], precision study for a qualitative test should provide an estimate of the imprecision at analyte concentration around the defined positive/negative cutoff. It is not appropriate to assess imprecision with low-negative or high-positive samples since the analyte concentration will be far from the medical decision point. Different from quantitative assays where medical decision point(s) are independent from an assay analytical cutoff (LOD or LOQ), detection cutoff for a binary (positive/negative) qualitative assay is considered the medical decision point. For qualitative readout based on underpinning quantitative signals, for example, gene variant PCR test, the following assumptions, depicted by Fig. 6.7, are made: G

G

G

If a concentration of an analyte at an assay designated cutoff is analyzed several times, around 50% of results will be positive ($ the cutoff) and around 50% of results will be negative (, the cutoff). The cutoff is called C50. Concentration at 5% (C5), which can be calculated as 5 [Cutoff 2 (1.645 3 SD)], and 95% (C95), which can be calculated as 5 [Cutoff 1 (1.645 3 SD)], will produce around 5% and 95% positive results, respectively. Concentrations .C95 will be consistently positive and concentrations ,C5 will be consistently negative.

Fundamentals of assay development and validation Chapter | 6

133

TABLE 6.5 Within-run, within-lab (interrun) and interlab imprecision, bias, and TAE from one sample analyzed by three labs in five replicates on five runs. Average

SD

CV%

% Bias

TAE

Lab A repeat—Run 1

1.02

0.06

5.85

1.60

13.1

Lab A repeat—Run 2

1.00

0.05

5.30

0.40

10.8

Lab A repeat—Run 3

1.05

0.04

3.64

5.20

12.3

Lab A repeat—Run 4

0.96

0.04

4.02

2 4.40

12.3

Lab A repeat—Run 5

0.99

0.03

3.07

2 0.60

6.6

Lab A within-lab

1.00

0.05

5.19

0.44

10.6

Lab B repeat—Run 1

0.94

0.05

4.83

2 5.80

15.3

Lab B repeat—Run 2

0.96

0.04

3.90

2 4.00

11.6

Lab B repeat—Run 3

0.95

0.03

3.07

2 5.00

11.0

Lab B repeat—Run 4

0.94

0.03

3.39

2 5.80

12.4

Lab B repeat—Run 5

0.91

0.05

5.55

2 9.40

20.3

Lab B within-lab

0.94

0.04

4.33

2 6.00

14.5

Lab C repeat—Run 1

1.10

0.03

2.36

10.40

15.0

Lab C repeat—Run 2

1.06

0.04

3.37

5.80

12.4

Lab C repeat—Run 3

1.05

0.04

3.60

4.80

11.8

Lab C repeat—Run 4

1.11

0.02

2.15

11.20

15.4

Lab C repeat—Run 5

1.14

0.04

3.21

13.60

19.9

Lab C within-lab

1.09

0.05

4.14

9.16

17.3

Interlab

1.01

0.08

7.65

1.20

16.2

CV, Coefficient of variation; SD, standard deviation.

100

% positive results

90 80 70 60 50

C50

40

C95

30

C5

20 10 0 0

5

10

15

20

25

30

35

40

Concentration (ng/mL) FIGURE 6.7 A scenario for decision points in assessing precision and accuracy of a qualitative assay per common guidelines. Figure depicts an example of analyte with C50 (the value where a binary examination declares a specimen to be positive 50% of the time) at 20 ng/mL, C5 (the value where a binary examination declares a specimen to be positive 5% of the time) at 8 ng/mL and C95 (the value where a binary examination declares a specimen to be positive 95% of the time) at 30 ng/mL.

The width of the C5C95 interval correlates with the amount of expected variability in an assay, that is, the wider the range, the more variable the assay is. Samples with concentrations within the interval between C5 and C95 are used to estimate a qualitative assay imprecision. Fig. 6.7 depicts an example of analyte with C50 at 20 ng/mL, C5 at 8 ng/mL, and C95 at 30 ng/mL. To assess precision for this assay, neat or contrived samples at or close to these concentrations can

134

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

be analyzed on 40 replicates performed on different days. Considering the 95% confidence interval, if the assay is precise, 1426 replicates (35%65%) of C50 samples will be positive, 3540 of the 40 replicates at the C95 will be positive, and 3540 of the 40 replicates at C5 will be negative.

6.5

Method comparison

Comparing an assay under development to a reference, an already existing test, is very common parameter of assay validation especially to prove substantial equivalence for FDA 510K submission or transfer a method from a lab to another. Similar approach can be applied to investigate the utility of a sample matrix different from a one that has been already validated, for example, to see if plasma can be an alternative matrix to serum, or for instrument-to-instrument comparison in a lab employing more than one instrument for the same assay.

6.5.1

Study setup for quantitative method comparison

Method comparison for quantitative assays is typically performed on 40100 or more samples covering as much as possible of the AMR of the predicate or, preferably, the assay with wider AMR. If native samples cannot cover the AMR, contrived samples can be used as a supplement as it was previously performed for parathyroid hormone [89]. Additional information can be found elsewhere [90,91] but will use examples mentioned later to highlight some important points. While CLSI guidelines listed 40 different patient specimens as the minimum number of samples to be tested, as indicated by others [92], only 20 specimens with target concentrations that cover the AMR will likely provide better information than hundred specimens that can be randomly used and results only cover a limited range. Split samples should be treated similarly, in terms of storage, shipping, and handling conditions until tested simultaneously by the two procedures, generally within 2 h from each other as recommended by CLSI [91]. To avoid the impact of imprecision of each assay on the results, samples can be analyzed in duplicates or triplicates and averages from the two assays can be used for comparison. To avoid the impact of an error, for example, improper calibration, that can impact all samples on a given run, it is recommended to analyze samples by the two procedures on at least 5 days [91,92].

6.5.2

Data analysis and interpretation from quantitative method comparison

For quantitative assays, regression analysis, as a measurement of correlation between the two methods, alone or with the calculation of average difference or average percent of differences, as commonly done, is not enough but difference plot, which is also known as BlandAltman (B&A) plot [93,94], should be used. B&A plot is a simple way to demonstrate possible bias between one of the method (newly validated or transferred, if applicable) from the other (reference or predicate) by plotting the percentage of difference on the y-axis against the average of the two methods on the xaxis. If the first method is a standard or reference method, its values can be used instead of the average of the two measurements on the x-axis [95]. Fig. 6.8 demonstrates five hypothetical scenarios of method comparison results to show the value of investigating the data from different angles for distinguishing different possible outcomes. As shown by the regression lines, method A seems to correlate well with method B where R2 were higher than 0.99 in the five models. 1. Fig. 6.8A is a good example of comparable methods with R2 of 0.993, distribution of good percentage of difference around 0, average percentage of difference of 0.5 with all percent differences are within 6 8%. 2. Fig. 6.8B is an example of comparison study outcomes where R2 is still 0.993, like the previous one, the two methods are comparable in the range of 1001200 units, but percent of differences were at or more than 20% at lower levels. This pattern indicates that the two methods or at least one of them is not sensitive enough to quantify the biomarker precisely at concentrations lower than 100 units. As shown by Table 6.6, percent of differences between the two methods range between 222.2% and 27.2% but if the assay LLOQs are set up at 100 units, differences will be within 6 8%, that is similar to Fig. 6.8A. 3. Fig. 6.8C is a model for assays that are comparable at low concentrations but start to drift after certain concentration, that is, the behavior is opposite to the behavior of Fig. 6.8B. Table 6.6 shows that assay B can differ from assay

Fundamentals of assay development and validation Chapter | 6

(A)

1200

% difference of B from A

1000

Method B

(A’)

10.0 y = 1.0011x–0.6391 R² = 0.9928

800 600 400 200

8.0 6.0 4.0 2.0 0.0 –2.0 –4.0 –6.0 –8.0 –10.0

0 0

200

400

600

800

1000

0

1200

200

(B) y = 1.0016x–1.2797 R² = 0.9927

Method B

1000

600

800

1000

1200

(B’)

30.0

1200

% difference of B from A

1400

400

Average of methods A and B

Method A

800 600 400 200 0

20.0 10.0 0.0 –10.0 –20.0 –30.0

0

1200

200

400

600 Method A

800

1000

1200

0

200

400

600

800

1000

1200

Average of methods A and B (C’)

10.0

(C)

% difference of B from A

y = 0.8981x + 26.623 R² = 0.9912

1000

Method B

135

800 600 400 200 0

5.0 0.0 –5.0 –10.0 –15.0 –20.0

0

200

400

600

Method A

800

1000

1200

0

200

400

600

800

1000

1200

Average of methods A and B

FIGURE 6.8 Five hypothetical scenarios of method comparison results. The two methods are highly correlated with R2 . 0.99 in the five models. When adjudicated in terms of percent difference between paired results from the two methods, Fig. 6.8A is a good example in terms of the range and distribution of percentage of difference around zero. Fig. 6.8B is an example of comparable methods in the range of 1001200 units. Fig. 6.8C is a model for assays that are comparable at low concentrations but start to drift after certain concentration. Fig. 6.8D shows consistent positive drift of method B. Fig. 6.8E is a model where method B can positively drift from method A at low concentrations and negatively drift at high concentrations.

A by up to around 216%, but if concentrations above 900 units are eliminated, differences will be within 6 8%, that is similar to Fig. 6.8A. 4. Fig. 6.8D is another example where R2 can be misleading. This scenario shows consistent positive drift of method B in the range of 10.426.0 with an average of about 19%. The drift is likely caused by a calibration issue that can be verified by cross-calibration and then fixed by adjusting the calibrators. 5. Fig. 6.8E is a model where method B can positively drift from method A at low concentrations and negatively drift at high concentrations. This scenario is the worst and may indicate different reaction kinetics, and the assays have to be thoroughly investigated before any of them can be used (unless method A has been documented as a reference).

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

(D)

1600

y = 1.2013x–0.767 R² = 0.9928

1400 1200

Method B

(D’)

30.0

% difference of B from A

136

1000 800 600 400 200

25.0 20.0 15.0 10.0 5.0 0.0

0 0

200

400

600

800

1000

0

1200

200

(E)

1200

% difference of B from A

Method B

600

800

1000

1200

(E’)

15.0 y = 0.9062x + 28.055 R² = 0.9957

1000

400

Average of methods A and B

Method A

800 600 400 200 0

10.0 5.0 0.0 –5.0 –10.0 –15.0

0

200

400

600

800

1000

1200

0

Method A

200

400

600

800

1000

1200

Average of methods A and B

FIGURE 6.8 Continued.

TABLE 6.6 Average, minimum, and maximum percentage of difference of method B (comparative) from method A (predicate) in the above listed scenarios. % Difference Model

Fig. 6.8A

Average

Fig. 6.8B

Fig. 6.8C

Fig. 6.8D

Fig. 6.8E

0.5

0.2

18.7

0.6

3.4

Minimum

2 7.8

2 22.2

10.4

2 15.9

2 12.4

Maximum

7.9

27.2

26.0

7.9

11.3

6.5.3

Study setup for qualitative method comparison

Twenty, or ideally 40 or more specimens, with approximately equal distribution of negative and positive samples, should be tested. CLSI EP05 [85] suggests at least 50 positive and 50 negative samples. Similar to what have been discussed under Section 6.4, samples should cover an assay dynamic range with enough representation around the assay cutoff. Samples are tested on the two methods on multiple runs. To avoid the impact of imprecision of each assay on the results, samples can be analyzed in duplicates or triplicates, and positivity or negativity of each sample should be determined first from the replicates of each assay before comparing results from the two assays. For example, if a replicate of a sample tests negative and the other tests positive on one assay, results should be adjudicated and conformed first in a blind fashion, that is, before looking at results from the other assay. If duplicate results stay equivocal, the College of American Pathologists [96] allows the elimination of equivocal results from the comparison.

Fundamentals of assay development and validation Chapter | 6

137

TABLE 6.7 Data from a qualitative test comparison study.

Predicate Positive Negative Candidate Method

Positive Negative Subtotal of columns Grand total

46 4

0 50

50

50

Subtotal of rows 47 53

100

Similar to the quantitative method comparison, it is recommended to analyze samples by the two procedures on multiple days.

6.5.4

Data analysis and interpretation from qualitative method comparison

Results from the two assays are entered onto what is commonly known as 2 3 2 table as in the following example in which 50 positive and 50 negative samples were used (Table 6.7). The concordance between the two methods is estimated using the following three parameters: G G G

Percent positive agreement 5 46/50 3 100 5 92% Percent negative agreement 5 50/50 3 100 5 100% Percent total agreement (PTA) 5 96/100 3 100 5 96%

Acceptance criteria can be based on statistical analysis as what is suggested by CLSI EP05 [85], which suggests a confidence interval of 78%97% for the PTA if the assays are 90% sensitive and 90% specific. Others consider different figures between 80% and 95% or even lower as it was preset at 75% for overall concordance in one of the CDx submission [106]. However, as it has been recommended throughout this book, acceptance criteria should be determined for each assay in the light of total error that can be tolerated without significant impact on medical decision.

6.6

Carryover

As per FDA bioanalytical guidance, carryover is defined as the appearance of an analyte in a sample from a preceding sample. In other words, it is the transfer of an analyte from a sample to the next. Carryover can be a systematic or random issue.

6.6.1

Random carryover

Carryover can also happen randomly especially in techniques that include signal amplification as in manual PCR if the amplification plate or tubes are not well sealed. Unfortunately, possibility for random carryover cannot be verified, but proactive measures should be taken to decrease the probability, for example, by assuring complete plate seal and avoiding high concentration of plasmid DNA or reference RNA in reaction wells close to patient samples.

6.6.2

Systematic carryover

The issue is usually encountered continuous analyzing systems, mainly chromatographic techniques, for example, LCMS, but also in automatic chemistry analyzers caused by contaminated autosampler in both techniques or contaminated chromatographic column in LCMS. This type of carryover should be evaluated as an item of method validation.

138

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

10,000

(A) 10,000

10,000

(B) 10,000

250

250

6000

150

4000

100

2000

50

200

6000

150

4000

100

2000

50

0

0

0 0

0

0

10,000 186

(C) 10,000

200

Sample(ng/mL)

180 160

8000

140 6000

120

5000

100 80

4000

55

2500 2000 500

0

60 40

1000

8

0

20

0

Blank and LLOQ (ng/mL)

0

8000

Blank (ng/mL)

200

Sample(ng/mL)

8000

Blank (ng/mL)

Sample(ng/mL)

200

0 Dil 4

Dil 3

Sample

Dil 2

Dil 1

Blank

Neat

LLOQ

FIGURE 6.9 Results from carryover studies. Sample values and blank values are presented on the primary and secondary y-axes, respectively. Dashed orange line presents the assay LLOQ. Panel (A) shows no carryover, but panel (B) shows carryover in an exploratory experiment. Panel (C) shows carryover from serial analyte concentrations.

6.6.2.1 Screening of carryover Carryover can be explored in different ways, the simplest of which is by inserting a blank matrix sample (with no detectable analyte) immediately before and after a neat or contrived sample with the analyte at or above the highest possible concentration that may be encountered in clinical samples. Fig. 6.9 depicts results from a carryover study for a bioanalytical assay with AMR of 505000 ng/mL. If no analyte was detected in the second blank sample, preceded by a sample at twofold the ULOQ (Fig. 6.9A), carryover would be excluded but if carryover was seen in the second blank compared to that before the sample (Fig. 6.9B), the effect should be titrated.

6.6.2.2 Defining tolerable level Similar concentrated sample should be diluted at different levels and dilutions, as well as the neat sample, should be analyzed interrupted by blanks to determine the highest analyte concentration at which no analyte carryover is observed. Any detectable analyte in the blank tube, that is, at or above a method LOD is considered a carryover, but per the FDA Bioanalytical guidance, carryover can be acceptable as long as it does not exceed 20% of LLOQ [26]. Fig. 6.9C shows that the analyte was not detectable in the lowest two dilutions but was quantifiable in the higher three dilutions. We recommend to repeat the experiment at least two more times and results are reproducible, following FDA acceptance criteria, since the assay LLOQ is 50 ng/mL, in this scenario, 2500 ng/mL should be set up as the maximal tolerable dose.

6.6.2.3 Management of carryover If carryover cannot be eliminated by applying some measures, for example, introducing a wash step after each sample, elongating the wash step or inserting a blank after each sample, the highest concentration that does not cause carryover should be set as the highest tolerable concentration.

Fundamentals of assay development and validation Chapter | 6

139

In operation, if a sample has a level that exceeds the tolerable threshold and the following sample is quantifiable, sample exceeding the tolerable concentration should be reanalyzed after proper dilution, and the sample(s) that might have been impacted should be reanalyzed after blank(s) to differentiate between the actual value(s) and false detections.

6.7

Stability

Biomarker stability is a crucial component of an assay validation. During a bioanalytical method development [26], FDA requires sponsor to determine stability of the analyte in a given matrix and investigate the effects of sample collection procedure, handling, and storage on the analyte. Typically, biomarker stability should be tested under all environmental conditions (temperature, humidity, and light), clinical samples are expected, planned, or accidental, to exposed to mainly during the preanalytical phase but also during testing.

6.7.1 G

G

G

G

G

G

G

Conditions for stability testing

Sample acquisition, for example, to investigate duration of ischemia during obtaining a tumor biopsy [9799], impact of applying tourniquet for long time on blood biomarkers [100], or impact of anticoagulant, preservative, or other additives used for blood or other body fluid collections. Sample processing, for example, tissue biopsy fixation and embedding, or blood sample centrifugation time and speed on blood components. Sample shipping and handling if samples will be transported or shipped to a lab outside the sample collection facility, which is the case in most of clinical trial samples. This assessment is very important if samples are expected to be exposed to extreme environmental conditions. Stability study should be conducted under the worst possible conditions that samples may be exposed to. For example, if samples will be shipped from the east coast to a lab located in California in summer, transportation time can be up to 24 h (assuming flight delays or connection flight in addition to ground transportation) and samples may be exposed to up to 45 C for few hours. Ideally, a temperaturerecording probe should be shipped in a sample shipping package, and the recorded conditions (different temperatures for different durations) should be mimicked in the stability study. Lab bench and on-board stability—Stability testing should target the longest time a sample may be left on the bench or on an instrument before gets analyzed. If samples can be exposed to freezing temperatures during transportation or before analysis, for example, in batch analysis, stability should be tested for multiple cycles of freeze-thaw. Short-term stability—If samples are not analyzed once received especially if a lab may receive samples late Friday for Monday analysis, and, also, to account for repeat analysis if needed, sample stability should be tested for up to 7 or more days. Long-term stability—If samples will be batched as in most of clinical trials. Long-term stability is crucial especially if samples can be used after the end of a clinical trial if bridging study will be needed for a CDx.

6.7.2

Important points to consider in conducting a stability study

To be able to rule out any other confounding factors, it is important to consider the following points: G

G

G

G

G

G

G

Do not test sample stability without knowing or before establishing test reagent stability. Ideally, use same lot of reagents but if infeasible, assess lot-to-lot variability first. Do not test sample stability before assessing the assay precision and when interpret stability data, do not confuse imprecision or random error with the stability data. Use a reasonable number of samples as feasible but at least target six or more samples unless samples can be difficult to obtain. Use same sample matrix with the same preservatives or other additives, which will be used in clinical samples. For example, do not use serum in stability if clinical samples will be plasma. Use samples from the targeted indication or disease. For example, do not use breast cancer samples if the clinical samples will be mainly from lung tumors. It is preferred to prescreen potential samples first to be able to select those with values within an assay dynamic range and represent medical decision point(s). If hard to obtain samples with measurable levels, contrived samples can be used. Stability results can be adjudicated different ways but drift, that is, trend of decrease or increase, is the most important and simple observation to make.

140

6.7.3

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

Examples from stability studies

Fig. 6.10 demonstrates data from the following examples of real stability studies to reflect most of the points listed earlier. G

G

G

G

G

Fig. 6.10A1 and 6.10A2 demonstrates data from a serum biomarker stability at 220 C and 270 C, respectively. Five samples with values within the AMR were tested fresh and after 2, 3, 4, and 7 days, 2 weeks, and 1, 2, 3, 4, 6, 8, 10, 12, and 15 months of freezing at 220 C or 270 C. As shown by the graphs and overall change, as depicted by CV% in Table 6.8, the biomarker is stable under the two freezing conditions for 15 months. Fig. 6.10B1 and 6.10B2 demonstrates data from a second serum biomarker stability at 220 C and 270 C, respectively. Six samples with values within the AMR were tested at the time points listed under Fig. 6.10A. The graphs show drops (marked by the red arrows on the graphs) in all samples, especially the three samples with higher levels, under the two conditions at month 3. As shown in Table 6.8, imprecision (CV%) has improved significantly when it was calculated for results up to month 2. Different kit lot was used starting from month 3, which seemed to cause this drop. Even with limiting CV% to results up to month 2, the assay looked more imprecise than the Fig. 6.10A assay. Fig. 6.10C1 and 6.10C2 demonstrate data from a third serum biomarker stability at 220 C and 270 C, respectively. Six samples with values within the AMR were tested at the time points listed under Fig. 6.10A. Both sets of samples show spikes at day 2, which was, obviously, due to a random error (imprecision). Otherwise, 270 C results were fairly consistent until the end of the study, but, as shown by Fig. 6.10C1 graph and CV%, results from samples stored at 220 C tended to decrease over time and drift was obvious from month 3. As shown on Table 6.8, limiting CV% to results up to month 2 improved the imprecision but CV% for both sets of results were still higher than Fig. 6.10A biomarker, mainly, because of the day 2 spike. Fig. 6.10D demonstrates data from an IHC stability study. Four breast cancer, one xenograft and one cell line FFPE blocks were cut into 4 μm slides, one set of slides was analyzed fresh (within 48 h of cutting), and the rest were left at ambient temperature away from direct light. Sets of slides were analyzed after 1, 2, 4, 6, and 8 weeks, and 3, 4, 6, 8, 10, 12, 16, 20, and 24 months. As shown by the graph and CV%, results from the cell line stayed the same, results from the xenograft were reasonably precise, results from one breast cancer sample (BC1) could be acceptable for IHC assays, but results from the other three breast cancer samples were highly variable. Fig. 6.10E demonstrates data from a similar IHC study for the same biomarker on the same assay at the same lab but conducted on two NSCLC (LC) FFPET analyzed in triplicates for up 12 months. Solid lines with solid circles represent replicates from one sample and the dashed lines with open circles represent the second sample. The graph and CV% show less variability than the previous study.

The issue of high imprecision in IHC will be discussed in a later chapter but regarding stability, the following two points can be emphasized here: G

G

Not only for IHC but in general, using samples with results at the extremes of an assay dynamic range, like the cell line in Fig. 6.10D study, is misleading and discouraged. Variability seen in these two IHC examples cannot be attributed to stability but rather to imprecision of the technique.

6.8

Validation of FDA-approved/cleared test

Even for FDA-approved or cleared tests, assay performance should be validated by the end user (clinical lab) in the lab location where the test will be used for clinical sample analysis, that is, if a lab organization has more than one location, validating a test in one location cannot be enough to use the test in another location without local validation. However, for FDA-approved/cleared and unmodified tests, validation is much simpler than what is listed earlier and, to reflect this simplicity, the process is usually called “verification” rather than validation. A lab is required to only verify four parameters: reportable range (linearity), accuracy, precision, and reference range, with limited number of samples and replicates compared to non-FDA-approved/cleared or modified FDA-approved/cleared tests [96,101]. If an FDA-approved test is modified, full assay validation should be performed.

Fundamentals of assay development and validation Chapter | 6

(A1)

(B1)

75

400

300

45

U/L

IU/mL

60

200

30 100

15

0

0

S1

S2

S3

S4

S5

(A2)

S1

S2

S3

S4

S5

S6

S1

S2

S3

S4

S5

S6

(B2)

75

400

60

300

45

U/L

IU/mL

141

200

30 100

15

0

0

S1

S2

S3

S4

S5

(C1)

(D)

2800

300

H-Score

pg/mL

2100

1400

200

100 700

0

0

S1

2800

S2

S3

S4

S5

S6

BC1

(C2)

BC2

BC3

BC4

Xeno

Cell

(E) 300

H-Score

pg/mL

2100

1400

200

100

700

0

0 48h 1w S1

S2

S3

S4

S5

S6

LC1-1

2w LC1-2

4w

6w

8w

LC1-3

3m

4m

LC2-1

6m

8m 10m 12m

LC2-2

LC2-3

FIGURE 6.10 Results from six stability studies for three circulating biomarkers (at 220 C and 270 C) and two FFPE IHC studies for one biomarker at ambient temperature. Panels (A1)(C1) are for circulating biomarkers at 220 C, panels (A2)(C2) are for circulating biomarkers at 270 C, and panels (D) and (E) for breast and NSCLC samples, respectively. Each line represents a sample (S) with results at different time points. D, day; w, week; m, month. Red arrows indicate where drop occurred.

TABLE 6.8 CV% from each set of time points for each sample. Biomarker in Fig. 6.10A

Biomarker in Fig. 6.10Ball

Fig. 6.10B through month 2

2 20 C

2 70 C

2 20 C

2 70 C

2 20 C

3.7

3.2

18.1

14.2

7.4

2.7

3.2

7.2

12.3

3.3

3.9

14.0

3.0

3.6

3.2

3.7

2 70 C

Biomarker in Fig. 6.10Call

Fig. 6.10C through month 2

NSCLC IHC

RT

RT

2 20 C

2 70 C

2 20 C

7.0

20.6

6.9

13.5

8.6

23.6

15.8

6.1

7.2

16.0

6.2

7.8

7.6

251.3

25.2

11.9

6.9

7.1

16.3

7.9

10.6

10.7

118.3

22.4

20.7

20.1

9.9

6.7

13.5

6.9

11.6

8.4

102.8

27.8

12.0

10.3

3.9

4.5

13.2

6.5

6.0

8.4

15.6

29.9

12.9

12.4

8.9

10.7

13.1

8.5

8.8

11.4

0.0

10.8

Note: Each row under each biomarker represents CV% of different time points from a sample. IHC, Immunohistochemistry; NSCLC, nonsmall cell lung cancer; RT, reverse transcriptase

2 70 C

BC IHC

Fundamentals of assay development and validation Chapter | 6

6.9

143

Fit-for-purpose validation

The concept of fit-for-purpose validation has evolved with the intent to provide biomarker researchers with an approach that tailors the burden and stringency of assay validation depending on the nature of technology utilized and the context in which the biomarker will be applied. The approach was meant to provide efficient drug development by conserving resources in the exploratory stages of biomarker characterization. Since exploratory biomarker data would be used for less critical decisions than data describing a well-qualified biomarker, a biomarker under exploratory development in an early phase clinical trial would be less rigorously validated than an already well-qualified biomarker in the same trial. The rigor of biomarker method validation increases as the biomarker data are used for increasingly advanced clinical or otherwise business-critical decision-making [102105]. For this approach, biomarker assays were classified into four classes from technical perspectives: 1. A definitive quantitative assay that makes the use of calibrators to calculate absolute quantitative values in samples. Such assays use reference standards that are well defined and fully representatives of the endogenous biomarkers, such as in the case of small molecule biomarkers, for examples, glucose and steroids. 2. A relative quantitative assay that uses calibrators made from reference standards that are not pure or fully representatives of the biomarkers, as is the case for many cytokine immunoassays. 3. A quasiquantitative assay that does not employ a calibration standard but has a continuous response that can be expressed in terms of a characteristic of the test sample, for example, antibody titers of antidrug antibody assays. 4. Qualitative (categorical) assays that can either be described as ordinal reliant on discrete scoring scales like those used in IHC or nominal that pertains to a yes/no situation; for example, the presence or absence of a genetic variant. A qualitative assay generates categorical data that lack proportionality to the amount of analyte in a sample. In general, qualitative methods are more applicable for differentiating marked effects such as the all-or-none effect of gene expression, or effects on relatively homogenous cell populations. Not only the position of the biomarker in the spectrum between research tool and clinical end point, which dictates the stringency of method validation [104] but as depicted by Fig. 6.11 (redrawn from [105]), the nature of the analytical technology was considered to influence the level of assay validation [105].

6.9.1

Assessment of the current fit-for-purpose practice

While the general consensus was that validation should demonstrate that a method is reliable for the intended application, the fit-for-purpose approach might be drifted from this consensus. The approach has been further stretched and misused in the drug industry and clinical labs, which likely contributed to some of the challenges, which will be discussed in later chapters.

Biomarker purpose

4 Surrogate

Prognostic/predictive

3 PD

POC

2 PD

POM

1 Novel

Discovery Qualitative

Quasi

1 IHC

2 qPCR

Relative Definitive

0 0

3 LBA

4 MS

Assay type FIGURE 6.11 Factors that determine level of stringency of fit-for-purpose assay performance as projected in literature. PD, Pharmacodynamics; POC, proof-of-concept; POM, proof of mechanism; LAB, ligand-binding assay; MS, mass spectrometry. Redrawn from Cummings J, Raynaud F, Jones L, Sugar R, Dive C. Fit-for-purpose biomarker method validation for application in clinical trials of anticancer drugs. Br J Cancer 2010;103 (9):131317. Available from: https://doi.org/10.1038/sj.bjc.6605910 to demonstrate Cummings’ perception for fit-for-purpose assay validation.

144

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

Assay validity (reliability) has nothing to do with the type of technology employed but as explained in this and previous chapter, with the level of decision to be made. According to Cummings et al. [105] grid, depicted by Fig. 6.11, IHC and ISH were the most exploratory technology and, as further detailed in the article, it needs the least amount of assay performance characterization. Quantitative PCR comes next to IHC, but MS was the highest in this rank. No doubt that MS can be the most definitive methodology for small molecules, but no MS diagnostic assay based on a large molecule has been approved yet. In fact, as shown in earlier chapter, all CDx approved so far, except the Ferriscan MRI, were based on IHC, ISH, PCR, and, lately, NGS. Not only the limited assay validation parameters that were suggested but the number of runs to check curve fit, establish assay dynamic range, and to assess precision and accuracy were three versus six for exploratory method versus advanced method volition respectively [103,105]. There was no rationale provided about selection of the limited number of parameters or the number of runs especially six runs are still less than the lower expectations in good laboratory practice. Lee et al. [103] reported that, in general, qualitative methods are more applicable for differentiating marked effects such as the all-or-none effect of gene expression, or effects on relatively homogenous cell populations, which is not true for two reasons: (1) as shown in the previous chapter, most of qualitatively reported assays rely on underpinning quantitative or semiquantitative signals that have to be considered during an assay validation and (2) most of these assays, especially IHC, are done on most heterogeneous matrix, tumor tissues. No doubt that an assay for diagnostic, prognostic or predictive biomarker should by highly reliable, but there was no explanation by Cummings et al. [105] for why proof of mechanism (POM) could be more exploratory than proof-ofconcept (POC). Unfortunately, the fit-for-purpose approach has been further stretched and misused by some users to the extent that some of off-shelf research-use-only kits are used to analyze clinical trial samples without any validation or, even, verification.

6.9.2

Our proposal for fit-for-purpose validation

6.9.2.1 Underlying concept Degree of robustness of an assay should reflect the tolerable amount of error, or as explained in the previous chapter, TAE and degree of impact of erroneous results on the decision, which will be made on results. Decisions are made at different phases in drug development, especially, when transition a drug candidate from preclinical to first-in-human, from FI to PII and from PII to pivotal trial. While some researchers downplay the importance of a reliable biomarker assay in the preclinical, PI and PII phases, biomarkers-based go/no-go decisions and false results may kill a potentially good compound or promote a bad candidate. As explained in an earlier chapter, a final (locked-in) assay or an assay with performance characteristics similar to the registerable assay should be used in patient selection as a candidate for CDx to avoid the bridging study hassle. For the other purposes an assay does not need to be fully validated for exploratory or early phases, but the basic performance characteristics should be assured. Enough evidence should be shown that the assay detects what is meant to detect, but nothing else, in the intended sample matrix (exclude matrix interference) within the range of concentrations expected in the study samples (assay dynamic range), under the conditions that study samples will be exposed to (same preanalytical variables).

6.9.2.2 Some examples If samples from an exploratory study, including POM and POC, can fit into a single ELISA plate, no need to assess interrun, interday, intertechnologist, interinstrument, or interlot precision. Similarly, if all tissue samples from a study will be stained on a single batch on the same autostainer and scored by a single pathologist, no need to establish imprecision at this point. In all these cases, having an expectedly negative (low value) and expectedly positive (high value) sample as QC on the batch to make sure that the assay has behaved as expected but no need to establish the QC targeted ranges as it will be discussed in the next chapter. No need to establish calibration curve, similar to the example in Fig. 6.2 to estimate changes in gene expression in posttreatment from pretreatment samples at copy number level but Ct or ΔCt (ratio or difference between target gene Ct and housekeeping genes Ct) can be used as long all samples will be analyzed on the same batch. Even, samples can be analyzed in more than one batch, but in this case ΔCt only should be used. An assay to test samples within 48 h of collection, no need to assess longer term stability.

Fundamentals of assay development and validation Chapter | 6

145

Other than chromatography-based techniques, for example, LCMS, carryover is not critical to investigate especially if high levels would not be expected in the clinical samples. To establish the needed fit-for-purpose performance characteristics, number of runs (e.g., the three vs six mentioned earlier) cannot predefined, but it will depend on an assay imprecision noted from accumulated data. For example, if CV % for different levels from three runs is within 6 10%, no more runs will be needed but if it is 6 30%, more runs will be necessary to either confirm this level of imprecision and consider it in looking at the study results or recognize an outlying run that need to be eliminated.

References [1] Lathrop JT. FDA. Analytical validation and points for discussion, ,https://www.fda.gov/downloads/MedicalDevices/NewsEvents/ WorkshopsConferences/UCM401486.pdf.; Undated presentation [accessed 23.08.18]. [2] European Medicines Agency. Guideline on clinical investigation of medicinal products in the treatment or prevention of diabetes mellitus— Draft, ,https://www.ema.europa.eu/documents/scientific-guideline/draft-guideline-clinical-investigation-medicinal-products-treatment-prevention-diabetes-mellitus_en.pdf.; 2018 [dated 29.01.18; accessed 01.09.18]. [3] Ho-Pun-Cheung A, Assenat E, Bascoul-Mollevi C, et al. EGFR and HER3 mRNA expression levels predict distant metastases in locally advanced rectal cancer. Int J Cancer 2011;128(12):293846. [4] Amin DN, Sergina N, Lim L, Goga A, Moasser MM. HER3 signaling is regulated through a multitude of redundant mechanisms in HER2driven tumor cells. Biochem J 2012;447(3):41725. Available from: https://doi.org/10.1042/BJ20120724. [5] NIH. Guidelines for the use of antiretroviral agents in adults and adolescents living with HIV, ,https://aidsinfo.nih.gov/guidelines/html/1/adultand-adolescent-arv/3/tests-for-initial-assessment-and-follow-up.; 2017 [last updated 17.10.17; accessed 01.09.18]. [6] American Association for the Study of Liver Disease. HCV guidance: recommendations for testing, managing, and treating hepatitis C, ,https://www.hcvguidelines.org/evaluate/monitoring. [last updated 24.05.18; accessed 01.09.18]. [7] Shukuya T, Carbone DP. Predictive markers for the efficacy of anti-PD-1/PD-L1 antibodies in lung cancer. J Thorac Oncol 2016;11 (7):97688. [8] Maleki Vareki S, Garrigo´s C, Duran I. Biomarkers of response to PD-1/PD-L1 inhibition. Crit Rev Oncol Hematol 2017;116:11624. [9] FDA. KEYTRUDA (pembrolizumab) drug label, ,https://www.accessdata.fda.gov/drugsatfda_docs/label/2015/125514s004s006lbl.pdf.; 2015 [revised 20.12.15; accessed 26.08.18]. [10] FDA. OPDIVO (nivolumab) drug label, ,https://www.accessdata.fda.gov/drugsatfda_docs/label/2017/125554s055lbl.pdf.; 2017 [revised Dec, 2017; accessed 26.08.18]. [11] FDA. TECENTRIQ (atezolizumab) drug label, ,https://www.accessdata.fda.gov/drugsatfda_docs/label/2018/761034s010lbl.pdf.; 2018 [revised June 2018; accessed 26.08.18]. [12] Greig MJ, Niessen S, Weinrich SL, Feng JL, Shi M, Johnson TO. Effects of activating mutations on EGFR cellular protein turnover and amino acid recycling determined using SILAC mass spectrometry. Intl Cell Biol 2015;2015:798936. Available from: https://doi.org/10.1155/2015/ 798936. [13] Yemelyanova A, Vang R, Kshirsagar M, et al. Immunohistochemical staining patterns of p53 can serve as a surrogate marker for TP53 mutations in ovarian carcinoma: an immunohistochemical and nucleotide sequencing analysis. Mod Pathol 2011;24(9):124853. [14] Liu J, Li W, Deng M, Liu D, Ma Q, Feng X. Immunohistochemical determination of p53 protein overexpression for predicting p53 gene mutations in hepatocellular carcinoma: a meta-analysis Coleman WB, editor PLoS One 2016;11(7):e0159636. Available from: https://doi.org/ 10.1371/journal.pone.0159636. [15] Murnya´k B, Hortoba´gyi T. Immunohistochemical correlates of TP53 somatic mutations in cancer. Oncotarget 2016;7(40):6491020. Available from: https://doi.org/10.18632/oncotarget.11912. [16] Tennis M, Krishnan S, Bonner M, et al. p53 Mutation analysis in breast tumors by a DNA microarray method. Cancer Epidemiol Biomarkers Prev 2006;15(1):805. [17] Liu Y, Bodmer WF. Analysis of P53 mutations and their expression in 56 colorectal cancer cell lines. PNAS 2006;103(4):97681. [18] Malcikova J, Tausch E, Rossi D, et al. (on behalf of the European Research Initiative on Chronic Lymphocytic Leukemia (ERIC))—TP53 network. ERIC recommendations for TP53 mutation analysis in chronic lymphocytic leukemia—update on methodological approaches and results interpretation. Leukemia 2018;32:107080. [19] Mazel M, Jacot W, Pantel K, et al. Frequent expression of PD-L1 on circulating breast cancer cells. Mol Oncol 2015;9(9):177382. Available from: https://doi.org/10.1016/j.molonc.2015.05.009. [20] Zhang J, Gao J, Li Y, et al. Circulating PD-L1 in NSCLC patients and the correlation between the level of PD-L1 expression and the clinical characteristics. Thorac Cancer 2015;6(4):5348. Available from: https://doi.org/10.1111/1759-7714.12247. [21] Sachs DB. Carbohydrates. In: Burtis CA, Ashwood ER, Bruns DE, editors. Tietz textbook of clinical chemistry and molecular diagnostics. 5th ed. St Louis, MO: Elsevier; 2012. p. 70930. [22] Dako. FDA PD-L1 IHC 28-8 pharmDx package insert, ,https://www.accessdata.fda.gov/cdrh_docs/pdf15/P150025c.pdf.; 2015 [accessed 25.08.18]. [23] FDA. Guidance for industry Q2B validation of analytical procedures: methodology, ,https://www.fda.gov/downloads/drugs/guidances/ ucm073384.pdf.; 1996 [issued November 1996; accessed 27.08.18].

146

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

[24] ICH (International Conference on Harmonization). Validation of analytical procedures: text and methodology Q2(R1), ,https://www.ich.org/ fileadmin/Public_Web_Site/ICH_Products/Guidelines/Quality/Q2_R1/Step4/Q2_R1__Guideline.pdf.; 2005 [revised November 2005; accessed 27.08.18]. [25] WHO. Guidelines on validation—Appendix 4 3 analytical method validation 4 (June 2016) 5 draft for comments, ,http://www.who.int/medicines/areas/quality_safety/quality_assurance/Guideline_Validation_AnalyticalMethodValidationQAS16-671.pdf.; 2016 [issued June 2016; accessed 27.08.18]. [26] FDA. Bioanalytical method validation guidance for industry, ,https://www.fda.gov/downloads/drugs/guidances/ucm070107.Pdf.; 2018 [released May 2018; accessed 27.08.18]. [27] Lennerna¨s H, Fager G. Pharmacodynamics and pharmacokinetics of the HMG-CoA reductase inhibitors. Similarities and differences. Clin Pharmacokinet 1997;32(5):40325. [28] Dako. FDA PD-L1 IHC 22C3 pharmDx package insert, ,https://www.agilent.com/cs/library/packageinsert/public/P03951%20rev%2004.pdf.; 2016 [accessed 25.08.18]. [29] Hammond MEH, Hayes DF, Dowsett M, et al. American Society of Clinical Oncology/College of American Pathologists guideline recommendations for immunohistochemical testing of estrogen and progesterone receptors in breast cancer. Arch Pathol Lab Med 2010;134:90722. [30] FDA. Dako Human Estrogen Receptor α 510(k) substantial equivalence determination summary, ,https://www.accessdata.fda.gov/cdrh_docs/ reviews/K120663.pdf.; 2013 [decision dare 12.02.13; accessed 29.08.18]. [31] FDA. Ventana VENTANA PD-L1 (SP142) Assay package insert, ,https://www.accessdata.fda.gov/cdrh_docs/pdf16/P160002c.pdf.; 2016 [accessed 29.08.18]. [32] Ferrer B, Bermudo R, Thomson T, et al. Paraffin-embedded cell line microarray (PECLIMA): development and validation of a high-throughput method for antigen profiling of cell lines. Pathobiology 2005;72:22532. [33] Cardano M, Diaferia GR, Falavigna M, et al. Cell and tissue microarray technologies for protein and nucleic acid expression profiling. J Histochem Cytochem 2013;61(2):11624. Available from: https://doi.org/10.1369/0022155412470455. [34] Howat WJ, Lewis A, Jones P, et al. Antibody validation of immunohistochemistry for biomarker discovery: recommendations of a consortium of academic and pharmaceutical based histopathology researchers. Methods 2014;70(1):348. Available from: https://doi.org/10.1016/j. ymeth.2014.01.018. [35] Schiettecatte J, Anckaert E, Smitz J. Interferences in immunoassays Available from: http://www.intechopen.com/books/advances-in-immunoassay-technology/interference-in-immunoassays In: Chiu NHL, editor. Advances in immunoassay technology. InTech; 2012. p. 4562. [36] Vainshtein I, Lee R, Schneider A, Liang M. Interference in immunoassays to support therapeutic antibody development in preclinical and clinical studies. Bioanalysis 2014;6(14):193951. ´ lvarez-Ballano D, Trincado P, Rello L. Serum sample containing endogenous antibodies interfering with [37] Garcı´a-Gonza´lez E, Aramendı´a M, A multiple hormone immunoassays. Laboratory strategies to detect interference. Pract Lab Med 2016;4:110. Available from: https://doi.org/ 10.1016/j.plabm.2015.11.001. [38] Emerson JF, Lai KKY. Endogenous antibody interferences in immunoassays. Lab Med. 2013;44(1):6973. Available from: https://doi.org/ 10.1309/LMMURCFQHKSB5YEC. [39] Warade J. Retrospective approach to evaluate interferences in immunoassay. EJIFCC 2017;28(3):22432. [40] Stevenson DL, Harris AG, Neal KR, Irving WL. On behalf of Trent HCV Study Group. The presence of rheumatoid factor in sera from antiHCV positive blood donors interferes with the detection of HCV-specific IgM. J Hepatol 1996;Vol XX:6216. [41] Fitzmaurice TF, Brown C, Rifai N, Wu AHB, Yeo KTJ. False increase of cardiac Tropinin I with heterophilic antibodies. Clin Chem 1998;44:221213. [42] Martel J, Despre´s N, Ahnadi CE, et al. Comparative multicentre study of a panel of thyroid tests using different automated immunoassay platforms and specimens at high risk of antibody interference. Clin Chem Lab Med 2000;38:78593. [43] Berth M, Bosmans E, Everaert J, et al. Rheumatoid factor interference in the determination of carbohydrate antigen 19-9 (CA19-9). Clin Chem Lab Med 2006;44:11379. [44] Bolstad N, Warren DJ, Bjerner J, et al. Heterophilic antibody interference in commercial immunoassays: a screening study using paired native and pre-blocked sera. Clin Chem Lab Med 2011;49(12):20016. [45] Bjerner J, Børmer OP, Nustad K. The war on heterophilic antibody interference. Clin Chem 2005;51(1):911. [46] Kroll MH, Ruddel M, Blank DW, Elin RJ. A model for assessing interference. Clin Chem 1987;33(7):11213. [47] Kroll MH, Elin RJ. Interference with clinical laboratory analyses. Clin Chem 1994;40(11 Pt 1):19962005. [48] Kazmierczak SC, Catrou PG, Boudreau D. Simplified interpretative format for assessing test interference: studies with hemoglobin-based oxygen carrier solutions. Clin Chem 1998;44(11):234752. [49] Dimeski G. Interference testing. Clin Biochem Rev 2008;29(Suppl. 1):S438. [50] Ji JZ, Meng QH. Evaluation of the interference of hemoglobin, bilirubin, and lipids on Roche Cobas 6000 assays. Clin Chim Acta 2011;412 (1718):15503. [51] DeSilva B, Garofolo F. Matrix interference in ligand-binding assays: challenge or solution?. Bioanalysis 2014;6(8):102931. [52] CLSI. EP07—Interference testing in clinical chemistry. 3rd ed. 2018 [issued April 2018]. [53] Matuszewski BK, Constanzer ML, Chavez-Eng CM. Strategies for the assessment of matrix effect in quantitative bioanalytical methods based on HPLC-MS/MS. Anal Chem 2003;75(13):301930. [54] Hajslova J, Zrostlikova J. Matrix effects in (ultra)trace analysis of pesticide residues in food and biotic matrices. J Chromatogr A 2003;1000 (12):18197.

Fundamentals of assay development and validation Chapter | 6

147

[55] Mei H, Hsieh YS, Nardo C, et al. Investigation of matrix effects in bioanalytical high-performance liquid chromatography/tandem mass spectrometric assays: application to drug discovery. Rapid Commun Mass Spectrom 2003;17(1):97103. [56] Antignac JP, de Wasch K, Monteaua F, De Brabanderb H, Andrea F, Le Bizeca B. The ion suppression phenomenon in liquid chromatographymass spectrometry and its consequences in the field of residue. Anal Chim Acta 2005;529(12):12936. [57] Taylor PJ. Matrix effects: the Achilles heel of quantitative high-performance liquid chromatography-electrospray-tandem mass spectrometry. Clin Biochem 2005;38(4):32834. [58] Little JL, Wempe MF, Buchanan CM. Liquid chromatography-mass spectrometry/mass spectrometry method development for drug metabolism studies: examining lipid matrix ionization effects in plasma. J Chromatogr B Analyt Technol Biomed Life Sci 2006;833(2):21930. [59] Chambers E, Wagrowski-Diehl DM, Lu Z, Mazzeo JR. Systematic and comprehensive strategy for reducing matrix effects in LC/MS/MS analyses. J Chromatogr B Anal Technol Biomed Life Sci 2007;852(12):2234. [60] Ismaiel OA, Zhang TY, Jenkins RG, Karnes HT. Investigation of endogenous blood plasma phospholipids, cholesterol and glycerides that contribute to matrix effects in bioanalysis by liquid chromatography/mass spectrometry. J Chromatogr B Anal Technol Biomed Life Sci 2010;878 (31):330316. [61] Ghosh C, Shinde CP, Chakraborty BS. Influence of ionization source design on matrix effects during LC-ESI-MS/MS analysis. J Chromatogr B Anal Technol Biomed Life Sci 2012;893:193200. [62] Panuwet P, Hunter RE, D’Souza PE, et al. Biological matrix effects in quantitative tandem mass spectrometry-based analytical methods: advancing biomonitoring. Crit Rev Anal Chem 2016;46(2):93105. Available from: https://doi.org/10.1080/10408347.2014.980775. [63] Vaidya HC, Wolf BA, Garrett N, Catalona WJ, Clayman RV, Nahm MH. Extremely high values of prostate-specific antigen in patients with adenocarcinoma of the prostate; demonstration of the “hook effect”. Clin Chem 1988;34:21757. [64] Brensing AK, Dahlmann N, Entzian W, Bidlingmaier F, Klingmuler D. Underestimation of LH and FSH hormone concentrations in a patient with a gonadotropin secreting tumor: the high dose “hook effect” as a methodological and clinical problem. Horm Metab Res 1989;21:6978. [65] Saryan JA, Garrett PE, Kurtz SR. Failure to detect extremely high levels of serum IgE with an immunoradiometric assay. Ann Allergy 1989;63:3224. [66] Jury DR, Mikkelsen DJ, Dunn PJ. Prozone effect and the turbidimetric measurement of albumin in urine. Clin Chem 1990;36:151819. [67] Zweig MH, Csako G. High-dose hook effect in a two site IRMA for measuring thyrotropin. Ann Clin Biochem 1990;27:4945. [68] Fernando SA, Wilson GS. Studies on the hook effect in the one-step immunoassay. J Immunol Methods 1992;151:4766. [69] Haller BL, Fuller KA, Brown WS, Koenig JW, Evelend BJ, Scott MG. Two automated prolactin immunoassays evaluated with demonstration of a high-dose “hook effect” in one. Clin Chem 1992;38:4378. [70] Pesce MA. “High-dose hook effect” with the Centocar CA125 assay. Clin Chem 1993;39:1347. [71] Flam F, Hambraeus-Jonzon K, Hansson LO, Kjaeldgaard A. Hydatidiform mole with non-metastatic pulmonary complications and a false low level of hCG. Eur J Obstet Gynecol Reprod Biol 1998;77:2357. [72] Petakov MS, Damjanovic SS, Nikolic-Durovic MM, et al. Pituitary adenomas secreting large amounts of prolactin may give false low values in immunoradiometric assays. The hook effect. J Endocrinol Invest 1998;21:1848. [73] Butch AW. Dilution protocols for detection of hook effects/prozone phenomenon. Clin Chem 2000;46(10):171921. [74] Lethe B, Lucas S, Michaux L, et al. LAGE-1, a new gene with tumor specificity. Int J Cancer 1998;76:9038. [75] Odunsi K, Jungbluth AA, Stockert E, et al. NY-ESO-1 and LAGE-1 cancer-testis antigens are potential targets for immunotherapy in epithelial ovarian cancer. Cancer Res 2003;63(18):607683. [76] De Carvalho F, Vettore AL, Inaoka RJ, et al. Evaluation of LAGE-1 and NY-ESO-1 expression in multiple myeloma patients to explore possible benefits of their homology for immunotherapy. Cancer Immun 2011;11:15. [77] Linnet K, Boyd JC. Selection and analytical evaluation of methods-with statistical techniques. In: Burtis CA, Ashwood ER, Bruns DE, editors. Tietz textbook of clinical chemistry and molecular diagnostics. 5th ed. St Louis, MO: Elsevier; 2012. p. 747. [78] United Nations Office on Drugs and Crime-Laboratory and Scientific Section. Guidance for the validation of analytical methodology and calibration of equipment used for testing of illicit drugs in seized materials and biological specimens, ,https://www.unodc.org/documents/scientific/validation_E.pdf.; 2009 [accessed 09.09.18]. [79] Westgard J.O., Quam E.F. Basic method validation—interference and recovery experiments, ,https://www.westgard.com/lesson27.htm.; 2009 [accessed 09.09.18]. [80] Park SR, Kinders RJ, Khin S, et al. Validation of a hypoxia-inducible factor-1α specimen collection procedure and quantitative ELISA in solid tumor tissues. Anal Biochem 2014;459:111. Available from: https://doi.org/10.1016/j.ab.2014.04.025. [81] Andreasson U, Perret-Liaudet A, van Waalwijk van Doorn LJC, et al. A practical guide to immunoassay method validation. Front Neurol 2015;6:179. Available from: https://doi.org/10.3389/fneur.2015.00179. [82] Van Waalwijk van Doorn LJC, Kulic L, Koel-Simmelink MJA, et al. Multicenter analytical validation of Aβ40 immunoassays. Front Neurol 2017;8:310. Available from: https://doi.org/10.3389/fneur.2017.00310. [83] Verbruggen B, Dardikh M, Polenewen R, van Duren C, Meijer P. The factor VIII inhibitor assays can be standardized: results of a workshop. J Thromb Haemost 2011;9(10):20038. [84] Tormoen GW, Khader A, Gruber A, McCarty OJT. Physiological levels of blood coagulation factors IX and X control coagulation kinetics in an in vitro model of circulating tissue factor. Phys Biol 2013;10(3):036003. Available from: https://doi.org/10.1088/1478-3975/10/3/036003. [85] CLSI EP05. Evaluation of precision of quantitative measurement procedures. 3rd ed.; 2014 [issued October 2014]. [86] CLSI EP15-A3: User verification of precision and estimation of bias; approved guideline. 3rd ed.; 2014 [issued September 2014].

148

Biomarkers, Diagnostics and Precision Medicine in the Drug Industry

[87] Westgard J.O. The replication experiment, ,https://www.westgard.com/lesson22.htm.; 2009 [accessed 18.09.18]. [88] CLSI EP12. User protocol for evaluation of qualitative test performance. 2nd ed.; 2008 [issued January 2008]. [89] FDA. Immunodiagnostic systems’ intact ParathyroidHormone (PTH) assay; 510K substantial equivalence determination decision summary, ,https://www.accessdata.fda.gov/cdrh_docs/reviews/K161158.pdf.; 2017 [dated 31.01.17; accessed 12.09.18]. [90] Dufour DR. CAP. Laboratory general checklist: how to validate a new test ,https://www.cap.org/apps/docs/education/lapaudio/pdf/ 091708_Presentation.pdf.; 2018 [accessed 12.09.18]. [91] CLSI. EP9-A3—measurement procedure comparison and bias estimation using patient samples. Approved Guideline. 3rd ed.; 2013 [approved August 2013]. [92] Westgard J.O. Basic method validation—the comparison of methods experiment, ,https://www.westgard.com/lesson23.htm.; 2009 [accessed 12.09.18]. [93] Altman DG, Bland JM. Measurement in medicine: the analysis of method comparison studies. Statistician 1983;32:30717. [94] Bland JM, Altman DG. Statistical method for assessing agreement between two methods of clinical measurement. Lancet 1986;1:30710. [95] Krouwer JS. Why Bland-Altman plots should use X, not (Y 1 X)/2 when X is a reference method. Stat Med 2008;27:77880. [96] CAP. All Common Checklist, ,http://www.cap.org/ShowProperty?nodePath 5 /UCMCon/Contribution%20Folders/DctmContent/education/ OnlineCourseContent/2016/LAP-TLTM/resources/AC-all-common.pdf.; 2015 [accessed 12.09.18]. [97] Hatzis C, Sun H, Yao H, et al. Effects of tissue handling on RNA integrity and microarray measurements from resected breast cancers. J Natl Cancer Inst 2011;103(24):187183. Available from: https://doi.org/10.1093/jnci/djr438. [98] Neumeister VM, Anagnostou V, Siddiqui S, et al. Quantitative assessment of effect of preanalytic cold ischemic time on protein expression in breast cancer tissues. J Natl Cancer Inst 2012;104(23):181524. Available from: https://doi.org/10.1093/jnci/djs438. [99] Yildiz-Aktas IZ, Dabbs DJ, Bhargava R. The effect of cold ischemic time on the immunohistochemical evaluation of estrogen receptor, progesterone receptor, and HER2 expression in invasive breast carcinoma. Mod Pathol 2012;25(8):1098105. [100] Saleem S, Mani V, Chadwick MA, Creanor S, Ayling RM. A prospective study of causes of haemolysis during venepuncture: tourniquet time should be kept to a minimum. Ann Clin Biochem 2009;46(Pt 3):2446. [101] COLA. How to verify performance specifications, ,http://www.cola.org/wp-content/uploads/2016/08/LG132015.pdf.; 2017 [reviewed 20.05.17; accessed 12.09.18]. [102] Lee JW, Weiner RS, Sailstad JM, et al. Method validation and measurement of biomarkers in nonclinical and clinical samples in drug development: a conference report. Pharm Res 2005;22:499511. [103] Lee JW, Devanarayan V, Barrett YC, et al. Fit-for-purpose method development and validation for successful biomarker measurement. Pharm Res 2006;23(2):31228. [104] Lee JW, Figeys D, Vasilescu J. Biomarker assay translation from discovery to clinical studies in cancer drug development: quantification of emerging protein biomarkers. Adv Cancer Res 2007;96:26998. [105] Cummings J, Raynaud F, Jones L, Sugar R, Dive C. Fit-for-purpose biomarker method validation for application in clinical trials of anticancer drugs. Br J Cancer 2010;103(9):131317. Available from: https://doi.org/10.1038/sj.bjc.6605910. [106] FDA. Leica HER2 Summary of safety and effectiveness data (SSED). April 18, 2012. Accessed Aug 25, 2018.