Analytica Chimica Acta 703 (2011) 101–113
Contents lists available at ScienceDirect
Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca
Tutorial
Method ruggedness studies incorporating a risk based approach: A tutorial Phil J. Borman ∗ , Marion J. Chatfield, Ivana Damjanov, Patrick Jackson Product Development and Statistical Sciences, GlaxoSmithKline Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2NY, UK
a r t i c l e
i n f o
Article history: Received 28 March 2011 Received in revised form 7 June 2011 Accepted 5 July 2011 Available online 19 July 2011 Keywords: Precision Variance Components Noise Variability Ruggedness
a b s t r a c t This tutorial explains how well thought-out application of design and analysis methodology, combined with risk assessment, leads to improved assessment of method ruggedness. The authors define analytical method ruggedness as an experimental evaluation of noise factors such as analyst, instrument or stationary phase batch. Ruggedness testing is usually performed upon transfer of a method to another laboratory, however, it can also be employed during method development when an assessment of the method’s inherent variability is required. The use of a ruggedness study provides a more rigorous method for assessing method precision than a simple comparative intermediate precision study which is typically performed as part of method validation. Prior to designing a ruggedness study, factors that are likely to have a significant effect on the performance of the method should be identified (via a risk assessment) and controlled where appropriate. Noise factors that are not controlled are considered for inclusion in the study. The purpose of the study should be to challenge the method and identify whether any noise factors significantly affect the method’s precision. The results from the study are firstly used to identify any special cause variability due to specific attributable circumstances. Secondly, common cause variability is apportioned to determine which factors are responsible for most of the variability. The total common cause variability can then be used to assess whether the method’s precision requirements are achievable. The approach used to design and analyse method ruggedness studies will be covered in this tutorial using a real example. © 2011 Elsevier B.V. All rights reserved.
Phil J. Borman is a manager in Product Development within GlaxoSmithKline and is currently responsible for data management and the implementation of quality by design within his department. He joined GlaxoWellcome in 1997 and prior to his current role has worked in both Chemical and Pharmaceutical Development. He first studied at UMIST University (Manchester, UK) where he obtained a Masters in Chemistry and more recently obtained an MSc in Industrial Data Modeling from De Montfort University (Leicester, UK). Phil is also a chartered member of The Royal Society of Chemistry, UK and has authored and co-authored over 20 publications, including reviews, perspectives, feature articles and research papers.
Marion J. Chatfield is a manager in Statistical Sciences within GlaxoSmithKline. She gained an MA in Mathematics at Cambridge followed by an MSc in Applied Statistics at Southampton in1984. After gaining broad pharmaceutical experience she has focused her efforts on the application of statistics in process chemistry. She has a keen interest in enabling chemists and analysts to take advantage of the benefits of statistical techniques in their process development. In the analytical arena she has authored and co-authored seven publications and has started a part-time PhD on “Assessing Analytical Method Performance – with a focus on statistical aspects of estimating precision”.
∗ Corresponding author. Tel.: +44 0 1438 763713; fax: +44 0 1438 764414. E-mail address:
[email protected] (P.J. Borman). 0003-2670/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2011.07.008
Ivana Damjanov is a senior statistician in Statistical Sciences within GlaxoSmithKline. She graduated from Louisiana State University with a BSc in Mathematics and Computer Science followed by an MSc in Applied Statistics from the same university. While at university she worked as a teaching assistant and later as a research assistant in a medical research centre. She joined GlaxoSmithKline in 2006 as a statistician in a non-clinical statistics group. Her responsibilities include application of statistics in process chemistry and supporting scientists in Product Development through consultation, and teaching statistics and statistical computing packages. She has co-authored 2 publications.
Patrick Jackson is an analyst in Product Development within GlaxoSmithKline and is currently responsible for leading an analytical method quality by design centre of excellence. Patrick joined GlaxoSmithKline in 2005 after obtaining an MChem from the University of York. Patrick is currently studying for an MSc in Applied Statistics with Sheffield Hallam University. Patrick is also an associate member of The Royal Society of Chemistry and has co-authored 2 publications to date.
102
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
1. Introduction The terms robustness and ruggedness have often been used interchangeably in the literature [1,2]. This paper focuses on the ruggedness of an analytical procedure – “the degree of reproducibility of test results obtained by the analysis of the same sample under a variety of normal test conditions. . .” [3]. This encompasses repeatability (same analyst/equipment over a short period of time), intermediate precision (within laboratory variation) and reproducibility (between laboratories) which are defined in the ICH guidelines [4] under the precision validation characteristic. Despite the current USP [5] stating that intermediate precision is also known as ruggedness, the authors’ definition of ruggedness extends to include conditions across laboratories as per previous versions [3] of the USP. The authors of this paper restrict the use of the term robustness to the ICH [4] definition where the influences of deliberate variations in procedure-related method parameters are evaluated. The application of quality by design (QbD) to analytical methods has been outlined by Borman et al. [6,7]. The authors of these articles propose the use of risk assessment tools such as cause and effect diagrams (also known as fishbones [8]) and failure mode effect analysis (FMEA) [9] to identify potential noise factors. Noise factors vary randomly from a specified population and typically cannot be, or are preferred not to be controlled, e.g. column batch, analyst and environmental conditions. Before any experimentation is performed it is important to use the output from the risk assessment to consider whether any noise factors are to be controlled. Remaining factors that are not controlled can be studied using a method ruggedness study. This study aims to challenge the method by maximising the opportunities for any problems to surface. It is much more rigorous than typical assessments of analytical methods, such as intermediate precision and reproducibility to meet validation requirements (as described by Ermer and Miller [10]), or generic gauge capability studies. Montgomery and Runger [11,12] for example describe typical gauge capability studies where part-to-part variability (the process), ‘gauge’ variability (the analytical method) and operator-to-operator variability (the analyst) are assessed. In contrast, in this paper the authors advocate the use of a risk assessment to identify the most important noise factors associated with the analytical method. This approach supports varying a number of at risk method factors often separately rather than combined (not just operator). The use of the FMEA tool to aid the design of a method ruggedness study is demonstrated in the real example in this paper. Dasgupta and Murthy [13] describe another approach where high gauge repeatability and reproducibility initially identified from a typical gauge capability study can be resolved by unearthing the root causes and taking appropriate corrective actions. The identification of critical factors is mentioned by Dasgupta, but no information is provided on how these factors were risk assessed and chosen. Noise factors should also be considered and sometimes evaluated during method development to ensure the method is likely to be rugged when in long term use. For example, it may be critical to understand batch to batch variability of a particular stationary phase in a high performance liquid chromatography (HPLC) method, or determine whether different batches of liner cause variability in a GC method. Nethercote et al. [14] have applied principles aligned with the FDA’s guidance on process validation [15] to method validation where they advocate the consideration for ruggedness testing at the end of the design stage before the method is qualified. When a method moves site, it is common for an exercise called analytical method (or technology) transfer [16–18] to be performed. This involves transferring the knowledge of how to operate the method to analysts who will use it routinely and performing
an exercise across the groups to confirm that comparable results are obtained. This exercise provides an ideal opportunity to test method ruggedness with a second analytical group. The exercise can be designed to ensure high risk noise factors are studied along with determining whether the analysts can simply generate comparable results. Once complete the overall method variability can be estimated. The method ruggedness study should be designed to provide the opportunity to both identify special cause variation and generate estimates of common cause variability which constitute repeatability and intermediate precision variability [19–21]. Special cause variation has an assignable root cause and can usually be addressed. Common cause variation is the remaining quantifiable variation in the system. In the literature various other names have been used for ruggedness studies or studies which have similar objectives. Some are named according to their application area or purpose, e.g. measurement systems analysis (MSA) and precision study. Others are named according to the statistics involved, e.g. crossed/nested designs referring to the structure of the design, and mixed model or variance components studies referring to the statistical analysis. A key focus for the authors is to design the study to give a good likelihood (with the resources available) of identifying any special cause variation among potential noise factors. Often other studies focus on estimating the common cause variability due to noise factors e.g. runs and sample preparations. The aspects to consider when designing, conducting and analysing method ruggedness studies (see Fig. 1) are now described. These are illustrated through an application by the authors to an HPLC method for the combined analysis of assay and impurities in an Active Pharmaceutical Ingredient (API).
2. Study design 2.1. Design aspects The four design steps shown in Fig. 1 are described in more detail.
Fig. 1. Procedure for a method ruggedness study.
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
D1: Noise factors potentially affecting ruggedness may be identified using cause and effect (fishbone) diagrams, FMEAs, method walkthroughs, etc. Section 2.2 gives an example of a cause and effect diagram followed by an FMEA. Ideally a ‘method walk through’ [22] is performed by an analyst performing the entire analytical procedure and observed by other analysts face-to-face. However, videos can form a basis for a remote assessment. Those noise factors remaining after implementing any method controls or improvements are prioritised according to their assessed risk D2: The test material is chosen to give confidence that the method is rugged across the range of inputs, e.g. different strengths of drug product or specifically chosen batches such as those with small and large particle sizes. The analyte level in the test material should be at a level adequate to perform an assessment i.e. all results expected to be above the quantitation limit. This is not always easy to achieve especially for impurities where batches may not contain all impurities at sufficiently high levels, ideally above 0.1% D3: Practical and resource aspects to be considered include: • The number of available analysts and pieces of equipment (e.g. HPLC instruments, columns). Cost would usually prohibit the purchase of such equipment merely for a ruggedness assessment though perhaps an HPLC available in a nearby laboratory might additionally be used or the study timed to fit in with a column purchase. • The practicality of transferring analysts or equipment between sites. Though this would allow their effects to be distinguished from other factors which differ between sites, such as water quality it is often impractical. • The transfer of samples from one site to another. The stability of samples is important as the transfer or timing of analysis must not cause a difference. D4: The study is designed to provide the opportunity to detect special cause variation in the noise factors as well as estimate common cause variability. Designing is an iterative procedure—evaluation of sample size may result in re-evaluation of the design and vice versa. Following construction of the design, the order of measurements within the design should be randomised where possible, e.g. the order of analytical measurements within an analytical run, and the order in which higher level factors such as HPLC instrument and column are applied to runs. Some principles the authors use to design the study are described below. It should be noted that the design of a study depends on the objectives and prior knowledge, requiring careful thought and often statistical knowledge, as an appropriate resource efficient design can some-
103
times be complex. However, the application of the principles to situations where practicalities result in 2 levels within a site for most noise factors or where (unusually) resource is not an issue, is more straightforward. Section 2.2 describes their application to a method where most noise factors at a site were investigated at two levels. For advanced designs there is considerable statistical literature on this topic for which Khuri [23] gives a valuable review D4a: Structure of design (i) High risk factors should be incorporated on their own, if possible. However, it is usually necessary to group factors to reduce resource. ‘Grouping’ factors means testing levels together e.g. if analyst, HPLC and column factors were grouped together in the design and included at 4 levels this would mean that each of the four analysts uses a unique HPLC instrument with a unique column (see Fig. 2). If any of the four analysts, or four HPLC instruments or four columns were to cause unusual results (special cause variation) or any of these factors induced high common cause variation this should be observed in the results. Sometimes the factor causing the variation can be deduced from the effect seen and investigation of the chromatograms. However, if it is not clear which specific noise factor is the cause, additional experimentation can be performed. (ii) It is usually recommended to incorporate factors (individual or grouped) into the design as crossed (i.e. each level of one of the factors occurs with each level of the other factor and vice versa). Hence if analyst, HPLC instrument and column were incorporated into the design as crossed individual factors with 2 levels, each analyst would use every instrument with every column (2 × 2 × 2 = 8 combinations)—see Fig. 2. However, practicalities and physical constraints will often result in nested [24] (or hierarchical) factors (i.e. each level of one factor occurs with only one level of the other factor). Again using the example above, if HPLC instruments were nested within analyst and columns nested within HPLC instrument this design could lead to 2 analysts, each of whom studies 2 different instruments and for each instrument 2 distinct columns—see Fig. 2. This impractical design results in 2 analysts, a total of 4 HPLC instruments and a total of 8 columns being studied. A realistic situation where nesting occurs is that analyst is often nested within site as it is impractical to transfer analysts across sites. Another example is sample preparation which could be crossed with an instrument or column but has to be nested within analyst as the analyst prepares the sample. It will often be nested within runs using different instruments or columns to avoid stability issues. In contrast to crossed factors, nesting results in the interaction between the two nested factors being confounded with (inseparable from) the nested effect. Prediction of rela-
Fig. 2. Structure of factors in design.
104
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
Fig. 3. Partial crossing in design.
tive sizes of nested factors is also required to try to place larger nested effects higher up in the design [25]. Pfleeger [26] gives a fuller description of nesting and crossing as well as a discussion on how to choose between the two arrangements, albeit in a different application area. Tsang et al. [27] also provide examples. (iii) Resource may be saved through not performing all combinations of levels of crossed factors (partial crossing), ideally using statistical designs such as fractional factorial or Latin squares [28]. Fig. 3 shows an example for 3 analysts, 3 HPLC instruments and 3 columns. Only 9 runs (formed from a Graeco Latin Square [28]) are used to investigate the effects of the noise factors whereas 27 runs would be required for full crossing of all three factors. In the 9 runs each analyst uses all HPLC instruments and all columns, and all HPLC instruments are run with each of the 3 columns. Partial crossing may be more easily seen in the bottom axis of a variability plot of the collected data—see Fig. 4 for this alternative representation of the design in Fig. 3 (plotting an impurity response). Partial crossing is used in the example in Section 2.2.“Staggering” [25,29] can also be used to reduce the number of factor levels arising from nested factors, and occurs towards the bottom of the design. For some combinations of higher level factors, a single rather than duplicate preparation might be performed and for some preparations single rather than duplicate injections might be used. Ignoring particular knowledge/considerations e.g. practicalities and cost, Leone et al. [29] recommends spreading the degrees of freedom (d.f.) almost evenly over all the factors (d.f. are discussed in step D4b). The staggering should ideally be balanced
with respect to higher design factors. The concept of staggering is illustrated in Fig. 5. For each site, the combinations of analyst, instrument and column result in 8 runs. If duplicate preparations and duplicate injections were to be performed a total of 32 preparations and 64 injections would be required. However, 4 runs per site have duplicate preparations with single injections (e.g., runs 1,4,6,7 for site S1), whilst the remaining 4 runs have single preparations with duplicate injections (e.g., runs 2,3,5,8 for site S1). This saves resource by only requiring a total of 24 preparations and 32 injections. The staggered design results in 8 d.f. to estimate preparation variance and 8 d.f. for injection variance compared to 16 and 32 respectively had staggering not been applied – providing the same number of d.f. over the bottom three factors. D4b: Sample size The number of levels for each factor being investigated should be considered and not just the total number of preparations or injections. The design should ideally provide sufficient opportunity to detect special cause variation (related to the number of levels) and, where appropriate, enough data to estimate common cause variation (related to degrees of freedom (d.f.)—number of independent pieces of information available to estimate the variance which is always less than the number of levels). Risk assessment should be used to make the most effective use of resource. Usually it is infeasible to estimate the common cause variation for all factors incorporated in the design. Common cause variations which typically can be estimated are: the long term variability (intermediate precision), the between run variation and the variability of repeated
Fig. 4. Partial crossing shown in variability plot of data.
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
105
Fig. 5. Illustrating staggering and degrees of freedom in design.
preparations and injections. Practicalities and resource limitations result in only a rough estimate of the long term variability. The ability of a design to estimate common cause variation depends not only on the design itself but also the expected relative magnitude of the sources of variation. Evaluating the sample size can be very complex statistically. However, the authors take a pragmatic approach, usually focusing on just the design, calculating the d.f. as if an ‘ANOVA’ (analysis of variance) analysis were to be performed and assessing these as if a simple variance was being estimated. The general concept and example calculations are shown in Fig. 5 for a reasonably simple design. The d.f. shown in Fig. 5 are found by taking the number of levels of a factor and subtracting the d.f. for any effects it is nested within. For example, column has 4 levels and is nested within both site and the overall mean, each of which has d.f. = 1. Thus column has d.f. = 2 (4-1-1). For more complicated designs, provided the analyst can determine the appropriate model, the easiest way of calculating the d.f. is to analyse the model and a dummy response using ANOVA in appropriate software. This is illustrated in Section 2.2 for the API example. There are no fixed rules for establishing whether the d.f. available are sufficient to estimate the common cause variation. However, evaluating the sample sizes required to estimate simple variances may provide some indication. A sample size of at least 6 is recommended for repeatability by ICH [4], though for an estimated standard deviation (SD) of 1%, this gives a wide 95% confidence interval of 0.62% to 2.45% (based on the chi-squared distribution). Larger sample sizes are suggested for estimating process capability. For a typical sample size of 30 an estimated SD of 1% gives a 95% confidence interval of 0.80% to 1.34%. Statistica [30], the statistical software used to analyse the API example, colour codes d.f. for random effects with < 4 being colour coded red and > 8 being colour coded green. 2.2. Application to an HPLC method for the combined analysis of assay and impurities in API The method in this example uses HPLC with UV detection for the analysis of API assay and impurities in the API. The transfer of this method from development into manufacturing provided a valuable opportunity to perform a ruggedness study. Fig. 6 provides a fishbone (cause and effect) diagram for this method where factors have been listed and categorized according to whether
they will be controlled (C), are noise factors (N) or identified as experimental (X) parameters. The identified noise factors were risk assessed using an FMEA tool. See Table 1 for an excerpt from the FMEA that was performed which displays the highest risk noise factors. Note: risk is calculated by multiplying scores assigned to the three categories: severity, occurrence and detection. This calculation provides a risk priority number (RPN) for each noise factor which enables them to be ranked according to their perceived risk. Further controls were identified and written into the method following the above risk assessment (e.g. use of a particular grade of glassware was specified in the method). All the noise factors with a RPN of above 20 were considered for inclusion into the study. Every identified factor could not be studied separately so factors were grouped together in the ruggedness study (this can be seen in the last column of Table 1). The grouping of factors driven by a risk based approach has also been used to aid the design of ‘reduced method robustness studies [31]. The grouped factors used were site of operation, analyst, HPLC instrument, column, sample preparation and sample injection (see Fig. 7). The output of the risk assessment was used to select the levels for some of these grouped factors. For example, two types of HPLC instrument (made by different manufacturers) were deliberately chosen for inclusion in the study because they were known to have differences in dwell volume (one of the high risk factors identified). One of the sites was prone to higher levels of humidity (another high risk factor), therefore prior to the study being performed experimental work was performed under controlled conditions to assess the impact of humidity on the sample preparation. Following this work, controls were identified which minimised the uptake of water by samples. The ruggedness study provided an opportunity to verify that the implemented method controls were appropriate. Lastly, the method for preparing the mobile phase could differ according to whether it was pre-mixed by the analyst or mixed by the instrument. Both of these methods were included in the study (within the analyst and instrument factors) to ensure this high risk factor (mobile phase homogeneity) was investigated. Three batches, selected to be representative of the commercial process, were tested for content and impurities. An additional batch was used as the analytical working standard to calculate the % (w/w) assay for the other three batches.
106
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
Fig. 6. Fishbone diagram for the example.
The design incorporates partially crossed factors, e.g. ‘column’ is partially crossed with HPLC because all of the possible combinations involving this factor are not performed. Only a certain fraction of the ‘column’ experiments were selected whilst retaining the balance in the number of sample preparations across the design. For each combination of the factors shown in Fig. 7 the three batches were analysed by performing 2 preparations each injected twice. The bottom level of design, ‘injection’, is not shown. An in-
house package Draw Design was used to create and visualise the design. The proposed design was evaluated to assess whether there were enough d.f. to estimate the common cause variation. The d.f. for this design are summarized in Table 2 (generated using Statistica). This indicates that the d.f. are too few to adequately estimate the random effects of analyst, HPLC and column – the design only supports an assessment of special cause variation for these factors. The bottom three levels in the ANOVA table (Table 2) have an adequate number of d.f.
Table 1 Excerpt from FMEA for the example. Failure mode (Variable identified from fishbone)
Failure effects
Severity (S)
Occurrence (O)
Detection (D)
Glassware quality
Extraneous peaks
9
3
5
135
Preparation
Dwell volume
Critical Resolution between main peak and impurity on tail Peaks misclassified/poorly integrated Shift in retention time. Loss of selectivity. Weighing inaccuracy (due to sample water uptake) Presence of extraneous peaks Extraneous peaks
9
3
3
81
Instrument
9
3
3
81
Analyst
9
4
2
72
Analyst/Site/Instrument
9
3
2
54
Site
9
2
3
54
Analyst/Site/Instrument
9
3
2
54
Analyst
9
3
2
54
Analyst
Column age
Extraneous peaks, ‘carry over’ Loss of selectivity
9
3
2
54
Column
Injector module
Quantification variation
3
3
5
45
Injection
Column batch
Loss of critical resolution, impact on quantitation limit Loss of selectivity. quantification inaccuracy
9
3
1
27
Column
3
3
1
9
Analyst experience Homogeneity of mobile phase Laboratory humidity Quality of reagents Sample storage Column equilibration
Impurity levels
Risk priority number (RPN)
Level (in ruggedness study)
Batch
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
107
Fig. 7. Structure of the study for the example. Table 2 Evaluation of d.f. for random effects for the example.
3. Implementation of study It is important to consider the implementation of the study e.g. the practicalities of transporting samples to avoid stability issues. As the design often includes a number of analysts in different locations, the use of data collection plans/templates is recommended. This ensures the design is run in the correct randomised order and that data is collated in an appropriate format (to sufficient significant figures, and with correct factor assignment). It is important to consider where mistakes might be made and try to plan the data collection to minimise this risk. As well as ensuring sample and data processing integrity throughout a study, thought must be given to the risk assessment to ensure any low risk noise factors are adequately controlled. For example, within each site there may be specific equipment used for the sample preparation. Consideration should be given as to whether the equipment should be fixed or left to vary. The instrument set-up should also be considered. In a study where two analysts need to use the same two instruments (where analyst is ‘crossed’ with instrument), execution of the design must be clearly articulated. It would be simpler and probably common practice for one of the analysts to set up both of the instruments prior to analysis. However, this may introduce bias into the study as the intent of the design is to evaluate analyst-to-analyst variability where all of the method procedural steps should be performed by the allocated analyst. It is therefore important to ensure clear instructions are
provided to the participating analysts to ensure bias or unintended variation is not introduced. Output from the method risk assessment should also be used to ensure any identified method controls are implemented. 4. Analysis and interpretation 4.1. Description of analysis steps The three analysis and interpretation steps were shown in Fig. 1. The statistical analysis typically consists of visualising the data and fitting statistical mixed models to estimate the variation due to factors or their interactions (variance components). The term mixed model is used to denote that the model includes both fixed effects and random effects. Fixed effects are those where the levels are deliberately chosen by the experimenter e.g. test materials with specific attributes. Random effects [32] are those which are allowed to vary “randomly” from a specified population e.g. any reagent bottle of grade X, and are usually noise factors. Of the various statistical methods used to fit mixed models restricted maximum likelihood (REML) [32] is perhaps the most commonly used. The three analysis steps are now described in more detail. A1: To prioritise detailed data evaluation of special cause variation, it can be useful to calculate variance components (note the purpose is not to use these as estimates of common cause variation). Some factors may be included as random effects (e.g. site)
108
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
even though in step A2 they could be fixed effects. A stacked bar chart of the estimated variance components for responses allows easy visualisation of all sources of analytical variation, particularly if there are several responses or many factors investigated in the study. If the responses have similar expected values (e.g. similar specifications) the raw variance components can be plotted, otherwise normalisation of the data with respect to specification limits can be useful (see the example). The plot can be visually scanned for responses with particularly large components, or particularly large total variation. From this, particular responses are identified for review. A variability plot (see Fig. 4) is a very useful way of examining the effect of factors on each response and looking for special cause variation (larger than expected differences between factor levels) and possible causes. Root causes can often be found from method knowledge, study records or raw data traces, e.g. chromatograms. Otherwise specific investigational studies may be required. Controls should then be implemented. A2: Individual sources of variance (if sufficient data), and estimates of repeatability and longer term variability (often intermediate precision though data from two laboratories may be included) should be compared against what is acceptable for that type of method. If larger than expected, possible reasons for this should be assessed and improvements sought. The preparation and injection variation can be used to assess the merits of increasing the number of preparations and/or injections when the method is conducted, if averaging is a viable option in producing a batch reported result during manufacture. In addition, any fixed effects of particular interest are examined, e.g. the size of any bias between two specific sites. The variance estimates and fixed effects are estimated using a mixed model (after removing special cause variation if appropriate) A3: The predicted long term analytical variation is assessed against the specific performance requirements. One way to assess method capability is through using the precision to tolerance ratio [33] (PTOL2-sided ) = 6 a /(USL − LSL) where a = analytical method standard deviation (6 a represents the spread of reportable results due to analytical variability), LSL = lower specification limit and USL = upper specification limit. PTOL2-sided , however, ignores whether the process is close to a specification limit and also cannot be applied directly to one-sided specifications. Analogous to process capability, the authors have defined a one-sided PTOL1-sided [34] = Min. of 3 a /(USL − process mean) or 3 a /(process mean − LSL). Rules of thumb (taken from engineering background [35]) are: PTOL < 0.1 is ideal; PTOL > 0.3 is unacceptable [33]. To achieve PTOL < 0.3, half the analytical variation should occupy < 30% of the distance between the LSL and the process mean (which is nearest to the LSL). This allows for process variability and some change in the process mean. Calculation of P/TOL is illustrated in the example below. The PTOL is a useful tool, but the limit needs to be chosen appropriate to the context. For example, if the process variation is very small or the analytical variation cannot be estimated dis-
tinct from process variability a larger analytical variation is often acceptable. If performance is unlikely to be met or variation is unexpectedly large, possible reasons and improvements should be investigated. 4.2. Analysis of example The variance estimation, precision and comparison (VEPAC) module in Statistica [30] was used to setup the model applicable to the design that was created. The final mixed model fitted, included the following nested and crossed effects is shown in Table 3. All of the model terms were selected as random effects except batch. Batch was selected as a fixed effect as the batches were deliberately selected to cover all of the impurities which could be present in API. The effects were estimated using REML estimation method and Type I decomposition. A few steps were performed in order to decide on the final model (Table 3) used in evaluation of the data. Firstly, one random effect (labeled as Analyst × HPLC × Col(Site) in the table) was created which included the 2 and 3 factor interactions between random factors and also run to run and standard set to standard set variation (which are confounded with the 3 factor interaction). Secondly, random effects that consisted of all the other 3 factor interactions involving batch were combined into the 4 factor interaction random effect (labeled as 4FI in the table). The default statistical test provided in Statistica (Wald statistic [36]) was used to decide whether to further split the 3 and 4 factor interactions. If either were statistically significant (p-value < 0.05), the model would have been expanded to include all constituent 2 and 3 factor interactions as random effects. For this example, the 3 and 4 factor interactions were not statistically significant and therefore kept in the model. The model could further be reduced by removing the random effects with variance estimates of zero. This would not affect the estimates of variance components or the overall method standard deviation ( a ), though see discussion for implications regarding confidence intervals. A variability plot for content (Fig. 8) was created to visualise all the data, assess overall variation observed between the factors studied and identify cases of special cause variability. Each data point in Fig. 8 represents an injection within the first or second preparation. Two cases of special cause variability are observed for analyst 2 at the transferring site where low values are likely to be due to an error in sample preparation (assessed at time of analysis through discussion with analyst 2. Analyst 2 had issues transferring the sample from the balance to the volumetric flask). These two data points were therefore removed prior to any further analysis. Further controls were also added to the method to reduce the risk of special cause arising from sample handling. The breakdown of variance was also examined (see Fig. 9) and estimated variance components included in Table 4. The greatest source of variance was error, which describes the bottom factor in the MSA design – injection-to-injection replication. The injection volume was therefore investigated and it was found that increasing the injection volume from 5 L to 10 L improved the performance
Table 3 The final fitted mixed model for the example. Effect Batch Site Analyst nested within site HPLC nested within site Column nested within site Site crossed with batch (Analyst crossed with HPLC crossed with column) nested within site – ‘Analyst × HPLC × Col(Site)’ ((Analyst crossed with HPLC crossed with column) nested within site) crossed with batch – ‘4FI’ Preparation nested within (((Analyst crossed with HPLC crossed with column) nested within site) crossed with batch) Error (injection variability to be estimated as error in the fitted model)
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
109
Fig. 8. Variability plot of assay data for the example.
Table 4 Estimated variance components for the example.
Table 5 Variance estimates with 95% confidence intervals.
Effects
Estimated variance components
Site Analyst (Site) HPLC (Site) Column (Site) Site × batch Analyst × HPLC × Col(Site) 4FI Preparation Error
0.000 0.003 0.000 0.005 0.007 0.055 0.000 0.052 0.137
of the method. The other main sources were preparation and the ‘Analyst × HPLC × Col’ interaction which is likely to relate to standard preparation or run-to-run variability. No particular reasons for the variation were identified and as the magnitude of the variances were considered typical for an assay content method no further investigation was performed. No effect from site was seen and thus this was not analysed further as a fixed effect. The estimated variation due to analyst, HPLC or column was small or zero. Since little or no variation was attributed to HPLC instrument or analyst, the possible effect of dwell volume or method of mobile phase preparation were also not analysed further as fixed effects. The repeatability SD and longer term SD ( a ) were then estimated. In order to calculate the between preparation variance, the estimated variance components for sample preparation and error were summed. Table 5 shows their estimates together with 95% confidence intervals produced by Statistica (based on a chi-squared distribution with d.f. = 2 × Z2 where Z is the Wald statistic [36]). To obtain the repeatability variance based on the routine analysis procedure (2 preparations, each injected once) the preparation variance was divided by 2. To obtain the longer term variance ( a ), the repeatability variance was then added to the other estimated variance components. The square root of the variances was taken to obtain SDs. The repeatability SD was estimated as 0.31 with a 95% confidence interval of (0.25, 0.39). The longer term SD was estimated as 0.4 which was used in the P/TOL calculation. A P/TOL
Effect
Preparation Error Sum
Variance estimate
0.052 0.137 0.189
95% Confidence interval Lower
Upper
0.024 0.101 0.128
0.188 0.196 0.306
of 0.96 was calculated based on a LSL of 98% (w/w), an estimated standard deviation of 0.4 and a process mean of 99.3% (w/w) (see Fig. 10). P/TOL was further improved by using a larger injection volume and reassessing the specification limits (P/TOL was approximately 0.6, which though still >0.3, is typical for this type of method). Note that assay content method capability is often poor and Hofer et al. [37] question whether an API HPLC assay is a useful attribute for API batch release given that the quality of the material is monitored and assured by other analytical tests on chemical content. Similar to the content assessment, impurity data were also evaluated. The sources of variability across the impurity responses were visualised using a normalised stacked bar chart of the estimated variance components (see Fig. 11). The data were normalised because impurity 4 had considerably higher levels in the study than the other impurities and therefore had a higher variance. Each data point for a given impurity was expressed as a percentage of its specification limit (note, different specification limits existed across the impurities studied). Fig. 11 shows that impurity 3 has the highest level of variability relative to its specification limit. The variability plot for impurity 3 (see Fig. 12) showed that the batches contained this impurity at very low levels – for three of the four batches the levels were around the limit of detection (0.01% area) and well below the limit of quantitation (0.03%area). The variability due to site is because for one site this impurity often was not detected for those 3 batches. The fourth batch had this impurity at a higher level. For this batch similar measurements were observed at both sites, no particular causes of variation were observed and the variation was considered acceptable given
110
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
Fig. 9. Breakdown of variances for the example (in percent).
Fig. 10. Variation of assay data for the example relative to process and specifications.
the specification limit is 0.15%. This example shows the importance of having batches with impurities close to the specification limit. In practice this is usually not possible, but having batches which contain impurities above the limit of quantitation is very important. Fig. 13 shows the variability plot for impurity 4—the impurity with the next largest variation with respect to its specification limit. There was no special cause variability observed. There was no particularly large source of common cause variation in this data-set and it was considered to be at a low enough level compared to the
specification limit of 0.4%. Similar conclusions were drawn for the other two impurities. Typically a comparison of means from each laboratory against criteria is performed as part of the transfer from one laboratory to another. Thus in addition to the analysis described above, the data was used for this purpose. 5. Discussion The focus of the statistical analysis used in this paper is the identification of factors that cause method variation and a visual
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
111
Fig. 11. Stacked bar chart of estimated variance components for the example.
Fig. 12. Variability plot for impurity 3 from the example.
assessment of method ruggedness. The uncertainty associated with variance estimates is likely to be large given the small number of levels for most individual noise factors. The example above only uses two levels of a ruggedness factor at each laboratory, which is typical. Consequently the long term variability estimate is likely to be just indicative. It is recommended that a greater number of levels should be studied if possible. Ideally, confidence intervals for the variance or P/TOL estimates should be provided as advocated in the literature on measurement systems analysis [38]. The analysis of the example provided confidence intervals for preparation and injection variance components and the repeatability
SD as the associated degrees of freedom were high and the software can provide these (albeit with a small additional calculation). However, the authors do not provide general recommendations for obtaining confidence intervals in this tutorial. Given the number of levels associated with many factors being investigated are usually very small in these analytical studies and the variance components themselves are often estimated as small or zero the authors have concerns which methodology is appropriate. The confidence intervals also depend on whether variance components estimated to be small remain in the model or not. Note the example above used the Wald statistic [36] to assess whether to further investigate
112
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113
Fig. 13. Variability plot for impurity 4 from the example.
terms representing multiple interactions. An improvement would be to examine changes in deviance which, though available in Statistica, requires manual extraction and calculations. Also, much of the methodology discussed in the literature [39,40] is not readily accessible to a practicing analyst. The methodology implemented in Statistica does not provide the means to calculate the confidence intervals for the long term SD since the two preparations are averaged (and the variance covariance matrix is not available to perform additional calculations). There are other statistical issues which complicate the estimation of uncertainty in P/TOL or variance estimates such as the effect of non homogeneous variation. Despite there being statistical reservations concerning the small sample size for individual noise factors, or assumptions behind the analysis performed, it should be noted that in practice these studies have shown to be very useful in identifying method improvements. Analysis step A2 referred to examining fixed effects of particular interest. For example, two types of stationary phase could be assessed in the ruggedness study, or it might be necessary to demonstrate comparability or equivalency between sites. This can be accommodated in the mixed model analysis provided the sample size is sufficient to estimate the appropriate variance adequately. For equivalence this may entail a review for any special cause variation and then pooling similar or low risk sources of variation so that sample size is adequate. Borman et al. [41] illustrate the use of a two one sided tests (TOST) procedure to perform an assessment of equivalency of two methods from a study design similar to that of a ruggedness study and design principles set out in this paper can be used to design equivalency studies. It should also be noted that the TOST approach mentioned above demonstrates the equivalence of means. If there is a desire to demonstrate equivalence of precision across sites then this has difficulties: derivation of the confidence interval for the ratio of variances derived from multiple sources of variation; a large sample size required; and possible non homogeneity of variances for some attributes e.g. impurities.
6. Conclusion A method ruggedness study can be used to rigorously assess all three types of precision described in ICH Q2 [4] rather than using generic validation experiments to measure repeatability, intermediate precision and reproducibility separately. The structured approach recommended in this tutorial encourages the analyst to focus on finding and eliminating any potential method problems, rather than performing generic precision studies merely as a validation requirement. This will result in improved method performance in routine use. The output from a method risk assessment can be used to identify which noise factors should be incorporated into a study. Designing the study in this way maximises the chances of discovering any unknown sources of special cause variability and enables estimation of the common cause variability. Method controls should be applied to eliminate special cause variability. If the common cause variability is higher than desired, the variation due to each noise factor is assessed and strategies employed to improve the method precision. The estimated method variability can also be combined with knowledge gained in development to help set limits for the continuous verification of the method (e.g. through using system suitability criteria such as acceptable agreement between replicate sample preparations). References [1] B. Dejaegher, Y. Vander Heyden, J. Chromatogr. A 1158 (2007) 138–157. [2] U. Shafrir, R.S. Kenett, Accred. Qual. Assur. 15 (2010) 585–590. [3] United States Pharmacopoeia, 29th ed., National Formulary, 24th ed., United States Pharmacopoeial Convention, Rockville, MD, USA, 2006, <1225>. [4] ICH-Topic Q2(R1), Validation of Analytical Procedures, 1994. [5] United States Pharmacopoeia, 32nd ed., National Formulary, 27th ed., United States Pharmacopoeial Convention, Rockville, MD, USA, 2009, <1225>. [6] P. Borman, M. Chatfield, P. Nethercote, D. Thompson, K. Truman, Pharm. Technol. 31 (2007) 142–152. [7] M. Pohl, M. Schweitzer, G. Hansen, M. Hanna-Brown, P. Borman, K. Smith, J. Larew, P. Nethercote, Pharm. Technol. Eur. 22 (2010) 29–36.
P.J. Borman et al. / Analytica Chimica Acta 703 (2011) 101–113 [8] K. Ishikawa, What is Total Quality Control? The Japanese Way , Prentice-Hall, Englewood Cliffs, NJ, 1985, pp. 63–64. [9] D. Stamatis, Failure Modes and Effects Analysis, FMEA from Theory to Execution , second ed., ASQ Quality Press, Milwaukee, WI, 2003. [10] J. Ermer, J.H.M. Miller, Method Validation in Pharmaceutical Analysis, first ed., Wiley-VCH, Weinheim, Germany, 2005, pp. 32–35. [11] D.C. Montgomery, G.C. Runger, Qual. Eng. 6 (1993–1994) 115–135. [12] D.C. Montgomery, G.C. Runger, Qual. Eng. 6 (1993–1994) 289–305. [13] T. Dasgupta, S.V.S.N. Murthy, Total Qual. Manage. 12 (2001) 649–655. [14] P. Nethercote, P. Borman, T. Bennett, G. Martin, P. McGregor, Pharm. Manuf. 9 (2010) 37–47. [15] US Food and Drug Administration, Guidance for Industry–Process Validation: General Principles and Practices , 2011. [16] M. Swartz, I. Krull, LCGC North America 24 (2006) 480–490, http://chromatographyonline.findanalytichem.com/lcgc/Misc/AnalyticalMethod-Transfer/ArticleStandard/Article/detail/387497. [17] ISPE, ISPE Good Practice Guide: Technology Transfer, Tampa, Florida, 2003. http://www.ispe.org/guidancedocs/technology transfer. [18] S. Scypinski, D. Roberts, M. Oates, J. Etse, Pharm. Technol. 26 (2002) 84–89. [19] W.H. Woodall, J. Qual. Technol. 32 (2000) 341–350. [20] D.C. Montgomery, Introduction to Statistical Quality Control , sixth ed., Wiley NJ, USA, 2009, p. 52. [21] J. Mandel, J. Qual. Technol. 4 (1972) 74–85. [22] M. Puertollano, T. Cartwright, M. Aylott, N. Kaye, Tablets Capsules 1 (2009) 30–39. [23] A.I. Khuri, Int. Stat. Rev. 68 (2000) 311–322. [24] Y. Vander Heyden, K. De Braekeleer, Y. Zhu, E. Roets, J. Hoogmartens, J. De Beer, D.L. Massart, J. Pharm. Biomed. Anal. 20 (1998) 875–887.
113
[25] ISO, International Organization for Standardization, Accuracy (trueness and precision) of measurements methods and results. Part 3: intermediate measures of the precision of a standard measurement method , ISO:5725-3, 1994. [26] S. Pfleeger, Ann. Software Eng. 1 (1995) 219–253. [27] P.K.S. Tsang, J.S.A. Larew, L.A. Larew, T.W. Miyakawa, J.D. Hofer, J. Pharm. Biomed. Anal. 16 (1998) 1125–1141. [28] W.G. Cochran, G.M. Cox, Experimental Designs , second ed., John Wiley & Sons, New York, USA, 1957, pp. 117, 132, 244. [29] F.C. Leone, L.S. Nelson, N.L. Johnson, S. Eisenstat, Technometrics 10 (1968) 719–737. [30] Statistica Software, www.statsoft.com. [31] P. Borman, M. Chatfield, P. Jackson, A. Laures, G. Okafo, Pharm. Technol. 34 (2010) 72–86. [32] H. Brown, R. Prescott, Applied Mixed Models in Medicine , second ed., Wiley, West Sussex, England, 2006. [33] K.D. Majeske, P.C. Hammett, J. Manuf. Process. 5 (2003) 54–65. [34] M.J. Chatfield, P.J. Borman, Anal. Chem. 81 (2009) 9841–9848. [35] M. Adams, M. Kiemele, L. Pollock, T. Quan, Lean Six Sigma: A Tools Guide , second ed., Air Academy Associates, CO, USA, 2004. [36] A. Wald, Ann. Math. Stat. 10 (1939) 299–326. [37] J.D. Hofer, B.A. Olsen, E.C. Rickard, J. Pharm. Biomed. Anal. 44 (2007) 906–913. [38] R.K. Burdick, C.M. Borror, D.C. Montogomery, Design and analysis of gauge R&R studies: making decisions with confidence intervals in random and mixed ANOVA models , ASA-SIAM Series on Statistics and Applied Probability, 17, SIAM/ASA, Philadelphia/Alexandria, VA, 2005. [39] C.M. Borror, D.C. Montgomery, G.C. Runger, Qual. Reliab. Eng. Int. 13 (1997) 361–369. [40] R.K. Burdick, C.M. Borror, D.C. Montgomery, J. Qual. Technol. 35 (2003) 342–354. [41] P.J. Borman, M.J. Chatfield, I. Damjanov, P. Jackson, Anal. Chem. 81 (2009) 9849–9857.