Chemometrics and Intelligent Laboratory Systems 87 (2007) 147 – 154 www.elsevier.com/locate/chemolab
Using the correct intervals for prediction: A tutorial on tolerance intervals for ordinary least-squares regression Steven De Gryze a , Ivan Langhans b,c,⁎, Martina Vandebroek c,d a
c
Department of Plant Sciences, University of California, Davis, USA b CQ Consultancy, Leuven, Belgium Faculty of Economics and Applied Economics, Katholieke Universiteit Leuven, Belgium d University Center for Statistics, Katholieke Universiteit Leuven, Belgium
Received 11 October 2005; received in revised form 11 March 2007; accepted 13 March 2007 Available online 24 March 2007
Abstract The use of regression models for prediction purposes is one of the most common applications of chemometrics. In most cases, only reporting the point predictions is unsatisfactory and intervals are needed to quantify the uncertainty involved. It is common practice to use the 95% prediction interval for this goal and to interpret such an interval as if 95% of the future responses will be contained within this interval. However, this is inappropriate, and it can result in a gross underestimation of the real uncertainty of the predicted response, especially in cases where the degrees of freedom is small. To correctly quantify the uncertainty of a prediction, so-called tolerance intervals should be used. Although the theoretical background on tolerance intervals is well documented in the statistical literature, practical guidelines to calculate and use tolerance intervals in real-world applications are lacking. Another less known concept is that of simultaneous versus non-simultaneous intervals. In this tutorial, we explain the origin, interpretation and practical calculation of confidence, prediction, and tolerance intervals, both in their simultaneous and non-simultaneous form. Depending on the case, equations for exact or approximate tolerance intervals are provided. The accuracy of the approximations is discussed. A MATLAB program is available to estimate the exact width of tolerance intervals in all cases involving ordinary least squares regression. © 2007 Elsevier B.V. All rights reserved.
1. Introduction A large part of chemometrics involves some form of model building based on regression methods. The majority of these chemometrical applications focuses on the prediction of some property of interest rather than on the estimation of a model parameter (e.g. a factor effect in the analysis of a designed experiment). Examples of such studies are the prediction of constituent concentrations or product properties using calibration models for near-infrared spectrometry, the prediction of reaction yields based on a response surface model in design of experiments (DOE), or the prediction of the quality level of a product based on multivariate process models. Within this pre-
⁎ Corresponding author. Economics and Applied Economics, Naamsestraat 69, 3000 Leuven, Katholieke Universiteit Leuven, Belgium. E-mail address:
[email protected] (I. Langhans). 0169-7439/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2007.03.002
diction framework we can differentiate between two significantly different objectives. 1.1. Predicting the mean The first objective is to make a statement about the mean or expected value of a response at one or more certain value of the explanatory variable(s) x: μy|x or E( y|x), i.e. the value we would find for the mean if we could collect an infinite number of responses at x under similar conditions. Three examples where predicting the mean is the objective: • Example 1 — DOE: using the optimal factor setting derived from a response surface modeling (RSM) study we will predict what the average yield will be if we maintain the process at these settings and produce a large (infinite) number of batches. • Example 2 — spectroscopic calibration: based on a multivariate calibration model for the concentration of the active
148
S. De Gryze et al. / Chemometrics and Intelligent Laboratory Systems 87 (2007) 147–154
Fig. 1. Introductory example of 8 observations (x, y), for which y ¼ 1 þ x þ ; ∼ N ð0; r Þ; with σϵ = 1.5. The OLS regression line is shown, together with the (nonsimultaneous) confidence and prediction bands, both with a confidence level of 95%.
ingredient in a tablet we can make a statement on the mean of a batch of tablets. • Example 3 — stability testing: based on data collected over a period of six months, a model is built to predict the average amount of degradation over the next two years. In these cases, the prediction based on the regression model estimated from a finite number of data will not be exact and we will only be able to specify an interval that contains μy|x with a chosen level of confidence i.e. the well-known confidence interval for the mean. A confidence level of 95% means that, if we would repeatedly recollect the data, fit the same model and predict the response for a certain x, then, μy|x would be contained in the 95%-confidence interval in 95% of the cases. For a one-factor ordinary least squares (OLS) model the continuum of confidence intervals results in a confidence band around the fitted line as shown in Fig. 1.
what the prediction interval really stands for, as can be easily verified by a simulation of the following simple ordinary least squares (OLS) model: y ¼ b1 þ b2 x þ ; where ∼ N 0; r2 :
ð1Þ
The simulation consists of the following three steps. 1. Generate eight x-values (either at fixed settings or at random locations) and corresponding responses according to Eq. (1). In our simulation we used β1 = β2 = 1 and σϵ = 1.5. 2. Fit the OLS model to the eight (x, y) pairs and calculate the prediction interval (cfr. Section 2.3). 3. Calculate the coverage of the prediction interval at a single new point x, i.e. the probability of finding a response value inside the interval. In our simulation the verification is at x = 4.5 which implies that y|x = 4.5 ∼N (μy |x = 4.5, σy) with μy|x = 4.5 = 5.5 and σy =σϵ = 1.5.
1.2. Predicting individual response values In other cases, the objective is not only to make statements on the expected value, but also to indicate in which range individual responses will be found given values for x. • Example 1 (continued): specify the range in which the yields of future batches are likely to be found. • Example 2 (continued): specify the range in which the concentration of the active ingredient will vary in individual tablets (‘content uniformity’). • Example 3 (continued): after one year, one sample will be analyzed to verify the model. Specify the interval in which this single result should be found. It is common practice to use 95% prediction intervals to answer these questions, as visualized by the dashed hyperbola in Fig. 1. This interval will often be interpreted as ‘if we can maintain x at the given position, then 95% of the response values will end up within the 95% prediction interval’. Unfortunately, this is a fundamentally incorrect interpretation of
Repeat steps 1 to 3 a number of times to quantify how often a certain coverage occurs. It is found that the statement ‘95% of response values will fall within the prediction interval’ is only true in 70% of the cases (Table 1 and Fig. 2). In 5% of the cases, the prediction interval contains even less than 80% of the response values. Fig. 3 shows an example of eight such generated (x, y) couples, the corresponding OLS regression line, the associated prediction interval and the interval which truly contains 95% of the response values. Table 1 The coverage of the 95% prediction interval with corresponding levels of confidence estimated from simulation for the example from Fig. 3 Coverage C of the 95 % prediction interval
Fraction of cases with at least a coverage C
95 92 85 79 62
0.70 0.80 0.90 0.95 0.99
S. De Gryze et al. / Chemometrics and Intelligent Laboratory Systems 87 (2007) 147–154
149
1.3. Simultaneous versus non-simultaneous intervals Another distinction to be made is the difference between simultaneous and non-simultaneous intervals. If we would restrict ourselves to prediction at a single, pre-chosen location x, we could use a so called non-simultaneous interval. If we want our confidence levels and coverages to be valid over the entire range of x-values we will need simultaneous intervals. In this tutorial we will introduce both simultaneous and nonsimultaneous tolerance intervals and compare them with confidence and prediction intervals within the framework of ordinary least-squares regression. In a sequel to this paper we will discuss their use for biased regression models such as principal components regression (PCR) and partial least squares (PLS). Fig. 2. The relation between confidence as estimated from simulations (fraction of cases) and coverage of the population (in %) for the case from Fig. 3.
2. Theory 2.1. General considerations
This little known discrepancy comes from the fact that the ‘95%’ is misinterpreted as a coverage, where in fact it is a confidence level for a single verification of a prediction. Indeed, when steps 1 to 3 of the simulation are repeated a large number of times, and each time in Step 3 a single response at x = 4.5 is generated, this individual response will fall inside the prediction interval generated in that same iteration in 95% of the cases. So the prediction interval is an interval that corresponds to the following question “Where will one find a single response value with 95% confidence?”, a question which has very little practical relevance in chemometrics (except for example 3 (continued)). The relevant interval is one that tells us where effectively 95% of the response values will be found (with a certain level of confidence). This is precisely what is given by the much wider tolerance intervals as shown in Fig. 4). Note that tolerance intervals are by no means new [1–3] nor have they become obsolete [4,5], they just drowned in ignorance [6]. Some standard statistical packages offer tolerance intervals for the simplest model y = μ + ϵ. But even for the simplest regression models, tolerance intervals are lacking.
2.1.1. Mean, predicted and individual or observed response We will use the term (individual) response value (noted by y) to refer to the outcome of a random variable drawn from the population of responses at a position x. Fig. 3 visualizes the distribution of the response by the dotted gaussian curve for a new response at x = 4.5. The term ‘mean response’ at location x (noted by μy|x) refers then to the mean (or expected value) of the response variable, as can be seen in Fig. 3. The mean response is estimated by a random variable which we will refer to as the ‘predicted response’ noted by ˆy |x (not shown in Fig. 3). Assume now the following linear model: y ¼ b1 þ b2 x1 þ N þ bp xp1 þ ¼ xb þ ;
ð2Þ
in which the βj's are the regression coefficients (β in vector notation), the xj's explanatory variables or predictors (x in vector notation), and y is the response (a random variable). The error term ϵ is assumed to be normally distributed with mean 0 and variance σ2, so that the mean response at x is μy|x = xβ.
Fig. 3. Introductory example of 8 observations (x, y), for which y ¼ 1 þ x þ ; ∼ N ð0; r Þ; with σϵ = 1.5. The mean response and OLS regression line based on the 8 points are shown. At x = 4.5, a prediction interval is presented. It can be observed that this is different from the interval containing 95% of the population of the responses at this location (dotted line).
150
S. De Gryze et al. / Chemometrics and Intelligent Laboratory Systems 87 (2007) 147–154
Fig. 4. Introductory example of 8 observations (x, y), for which y ¼ 1 þ x þ ; ∼ N ð0; r Þ; with σϵ = 1.5. The ordinary least square (OLS) regression line is shown, together with the 95%-confidence band for μy|x, the prediction band with 95% confidence and the tolerance band for 95% proportion with 95% confidence. The region centered around μy|x that truly contains 95% of the population is shown in gray. All bands are non-simultaneous.
Least-squares regression is used to estimate these βj's (the estimates are denoted by βˆ j) based on a set of n observations (xi1, …, xi( p − 1), yi). The predicted mean response can then be ˆ . The predicted response at a single new calculated as yˆ |x = xβ position of the predictors, denoted by xn + 1, is calculated as ˆ. xn + 1β 2.1.2. Confidence level versus probability and coverage As is often the case in statistics, the confusion between prediction and tolerance intervals, finds its origin in the mix-up of some key concepts; in this case the confusion between probabilities or coverages on the one hand and confidence levels on the other. The first is always associated with the distribution of a random variable and the second is nearly always used in the context of inference on a population parameter. One of the exceptions is precisely in the context of prediction intervals where the confidence level still has the same interpretation (i.e. what if I would do it over and over again) but since it is now not associated with a population parameter but with the outcome of the random variable y, it gets misinterpreted as a probability for finding y at that setting of x. 2.1.3. Simultaneous versus non-simultaneous Traditional intervals constructed in basic statistical problems (e.g. intervals for μ based on the sample mean) are nonsimultaneous since only a single statement on a population parameter is made. In regression, non-simultaneous intervals would be appropriate if one would construct an interval at one single pre-chosen location xn + 1 [7]. If one wants to use the same model for predicting the response at several positions xn + j simultaneously or in a post-hoc situation, e.g. if one wants to calculate an interval after first determining the optimal setting, a simultaneous interval is appropriate. This is precisely equivalent to multiple comparisons in a 1-way ANOVA setting. If only one pre-chosen comparison is to be made, Fisher's LSD test is appropriate but if more comparisons are made, other approaches are needed. If a limited number of comparisons are made, we can use a Bonferroni correction to maintain the overall Type I error rate α. If this α is to be valid for any prediction at an
arbitrary linear combination of factor levels, then a Scheffé correction is needed. Another way to derive simultaneous intervals, (a.k.a. confidence bands) is from considering the joint confidence region of regression coefficients [6]. 2.1.4. One-sided versus two-sided Intervals can be constructed one-sided or two-sided. Onesided intervals will be appropriate if one is only concerned with placing a boundary on extreme values on one side, in all other cases two-sided intervals are used. 2.2. Confidence interval The (non-simultaneous) confidence interval for the expected response at xn + 1 with confidence level 1 − α, is the interval IC(xn + 1) around yˆ |xn + 1 for which: h i ð3Þ Pr lyjxnþ1 a IC ðxnþ1 Þ ¼ 1 a in which μy|xn + 1 is the mean response of y at x n + 1. In the case of upper one-sided confidence intervals IC,1(xn + 1) = (− ∞, yˆ |xn + 1 + ΔC,1(xn + 1)s], in the case of two-sided confidence intervals, IC,2(xn + 1) = yˆ |xn + 1 ± ΔC,2(xn + 1)s, in which s is an estimator for σ (s 2 corresponds to the mean squared error of the residuals). Since yˆ |xn + 1 is a normally distributed random variable for which: Eðˆyjxnþ1 Þ¼ lyjxnþ1 ; Varðˆyjxnþ1 Þ¼ r2 xnþ1 Vð X ′X Þ1 xnþ1 :
ð4Þ
This leads to the well known confidence intervals, calculated from: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DC;1 ðxnþ1 Þ ¼ t ðn p; aÞ xnþ1 Vð X ′X Þ1 xnþ1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð5Þ a xnþ1 Vð X ′X Þ1 xnþ1 ; DC;2 ðxnþ1 Þ ¼ t n p; 2 where t(ν, α) is the (1 − α)100% percentile of the t-distribution with ν degrees of freedom. These intervals hold only for prediction at a single x n + 1 (non-simultaneous). For a two-sided simultaneous interval, the t n p; a2 factor in Eq. (5) should be replaced by
S. De Gryze et al. / Chemometrics and Intelligent Laboratory Systems 87 (2007) 147–154
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pF ð p; n p; aÞ, in which F(d1, d2, α) is the (1 − α)100%
percentile of the F-distribution with d1 and d2 degrees of freedom [7]. The resulting confidence band is often referred to as the Working–Hotelling band. 2.3. Prediction interval The (non-simultaneous) prediction interval for the predicted response at xn + 1 with confidence 1 − α, is the interval IP(xn + 1) for which: Pr½ yjxnþ1 a IP ðxnþ1 Þ ¼ 1 a;
ð6Þ
in which y|xn + 1 is the response at a single new position xn + 1. In the case of upper one-sided prediction intervals IP,1(xn + 1) = (− ∞, ˆy |xn + 1 + ΔP,1(xn + 1)s], in the case of two-sided prediction intervals, IP,2(xn + 1) = yˆ |xn + 1 ± ΔP,2(xn + 1)s. It is important to note here that the confidence statement bounds the difference y|x n + 1 − yˆ |x n + 1 rather than y|x n + 1 itself [6]. Therefore, (absolute) inferences on y|xn + 1 cannot be obtained using a prediction interval, as illustrated by simulation in Section 1.2. To avoid confusion, it is therefore better to use the following definitions: one sidedðupperlimitÞ : Pr yjxnþ1 ˆyjxnþ1 V DP;1 ðxnþ1 Þs ¼ 1 a
½j
j
two sided : Pr yjxnþ1 yˆ jxnþ1 V DP;2 ðxnþ1 Þs ¼ 1a
ð7Þ
ð8Þ
which is estimated by: s2(1 + xn + 1′(X′ X)− 1xn + 1). This leads to the classic prediction intervals, which can be calculated from: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DP;1 ðxnþ1 Þ ¼ t ðn p; aÞ 1 þ xnþ1 Vð X ′X Þ1 xnþ1 ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a ð9Þ 1 þ xnþ1 Vð X ′X Þ1 xnþ1 : DP;2 ðxnþ1 Þ ¼ t n p; 2 Since α relates to a confidence level and not to a probability, a lot of misunderstandings would have been avoided by giving it a more appropriate (but less catchy) name such as the confidence interval for the distance of a single new observation to the corresponding ˆy. The prediction intervals above are all nonsimultaneous; they are only valid at a single new location xn + 1. Similarly as to confidence intervals, two-sided simultaneous prediction intervals can be calculated by replacing t n p; a2 in pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Eq. (9) by pF ð p; n p; aÞ [8]. 2.4. Tolerance interval 2.4.1. Definitions A non-simultaneous tolerance interval at xn + 1 which contains a proportion P of the individual responses y|xn + 1 with confidence 1 − α is defined as the interval IT(xn + 1) around the predicted response ˆy |xn + 1 for which: Pr½Pr½ yjxnþ1 a IT ðxnþ1 Þ z P ¼ 1 a;
Note that this definition contains two probability statements. The probability Pr[ y|xn + 1 a IT(xn + 1)] is called the coverage of the interval IT (xn + 1). In the case of upper one-sided tolerance intervals IT,1(xn + 1) = (− ∞, ˆy |xn + 1 + ΔT,1s], in the case of twosided tolerance intervals, I(xn + 1) = ˆy |xn + 1 ± ΔT,2s. In the simultaneous case, the probability statement must be valid for a set of predictions at various combinations of the predictors, with overall confidence level 1 − α. Define the points at which (simultaneous) prediction will take place as the set S. Then, the simultaneous tolerance interval is defined as: Pr Pr½ yjxnþj a IT xnþj z P for each xnþj a S ¼ 1 a;
ð11Þ
in which the IT(xn + j) is defined analogously to a non-simultaneous tolerance interval. 2.4.2. Calculation of a non-simultaneous tolerance interval: one-sided case This tolerance interval can be derived analytically and makes use of the non-central t-distribution [9]. For this, we work further from the definition for ΔT,1: Pr Pr½ yjxnþ1 V yˆ jxnþ1 þ DT;1 s z P ¼ 1 a:
ð12Þ
Since y|xn + 1 is normally distributed with mean μy|xn + 1 and variance σ, it follows that:
The variance of the difference y|xn + 1 − ˆy|xn + 1 is: Varð yjxnþ1 yˆ jxnþ1 Þ ¼ r2 þ r2 xnþ1 Vð X ′X Þ1 xnþ1 ;
151
ð10Þ
yˆ jxnþ1 þ DT;1 s lyjxnþ1 1 z U ð PÞ ¼ 1 a; Pr r
ð13Þ
where Φ− 1(P) is the inverse cumulative normal distribution. Rearranging this equation, and dividing both sides by the qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi pffiffiffiffi constant H ¼ xnþ1 Vð X ′X Þ1 xnþ1 yields: 2 yˆ jxnþ1 l 3 yjx U1 ð PÞ pffiffiffi nþ1 pffiffiffi D T;1 r H H Pr4 z pffiffiffiffi5 ¼ 1 a: s H r
ð14Þ
Since ˆy |xn + 1 is a random variable with mean μy|xn + 1 and ð yjx Þ pffiffiffi nþ1 is a standard normally variance Var ( yˆ |xn + 1) =σ2H, yˆ jxnþ1 E r H distributed random variable. So the ratio at the left hand side of the inequality in Eq. (14) consists of this standard normally distributed 1 ð PÞ , divided by the random random variable, plus a constant Upffiffiffi H variable rs and hence it follows a1non-central t-distribution, with a ð PÞ . non-centrality parameter of Upffiffiffi H Therefore ΔT,1 can be calculated as:
DT;1
pffiffiffiffi U1 ð PÞ ¼ H t V n p; a; pffiffiffiffi ; H
ð15Þ
in which t′(ν, α, δ) is the α100% percentile of a non-central t-distribution with ν degrees of freedom and non-centrality parameter δ.
152
S. De Gryze et al. / Chemometrics and Intelligent Laboratory Systems 87 (2007) 147–154
2.4.3. Calculation of a non-simultaneous tolerance interval: two-sided case In contrast to the one-sided case, no analytical solution exists for two-sided tolerance intervals. However one can at least bound the tolerance interval using the Bonferroni inequality, as was pointed out by Lieberman and Miller [10] and Miller [8]. Suppose that both the population mean response μy|xn + 1 and the population variance σ would be known, a two-sided tolerance interval would look like: lyjxnþ1 F U1 ð PÞr:
ð16Þ
It is trivial that when these population parameters are estimated by sample statistics, a tolerance interval has to encompass two sources of uncertainty: (1) the estimation of μy|xn + 1 by yˆ |xn + 1, and (2) the estimation of σ by s. As was seen before, the uncertainty on μy|xn + 1 can be quantified in a confidence interval (with confidence 1 − α): a pffiffiffiffi a pffiffiffiffiffiffi yˆ jxnþ1 t n p; s H V lyjxnþ1 V ˆyjxnþ1 þ t n p; s H: 2 2 ð17Þ
The latter equation was further on referred to as the modified Lieberman and Miller [10] approximation. We will evaluate these approximations in Section 3. Other, more sophisticated methods are discussed by Lieberman and Miller [10], but these authors pointed out that these do not guarantee a smaller tolerance interval. Moreover, the simple formula in Eq. (20) performs just as well as the more complex methods. Note that for a large number of observations, the mean response is exactly known, and the prediction and tolerance intervals will converge towards each other. As the degrees of freedom approach infinity, the prediction and tolerance intervals pffiffiffiffi become equal.ffi In that case, s converges to σ, and H ¼ q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi xnþ1 Vð X′X Þ1 xnþ1 to 0. In addition: limnp Yl
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi np ¼ 1; 2 v n p; 1 a2
ð21Þ
and therefore, limnp Yl IT;2 ¼ lyjxnþ1 F U1 ð PÞr:
ð22Þ
Furthermore, in this case, the prediction interval converges to:
In addition, the uncertainty on σ can be quantified by (with confidence 1 − α): rV
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi np s v2 ðn p; 1 aÞ
ð18Þ
with χ2(n − p, 1 − α) the (1 − α)100% percentile of the χ2 distribution with n − p degrees of freedom. Lieberman and Miller [10] used the Bonferroni inequality to combine these two statements for simultaneous tolerance intervals and Zorn et al. [11] incorrectly translated this to the following non-simultaneous two-sided tolerance interval as (with confidence 1 − α): "
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi# apffiffiffiffi np 1 : H þU ð PÞ IT;2 o ˆyjxnþ1 F s t n p; 2 2 v n p; 1 a2 ð19Þ The error made by Zorn et al. was that if a Bonferroni correction is applied one should use a4 for the t-quantile. But also the original use of the Bonferroni correction by Lieberman deserves some critique since this approach superimposes the worst-case scenario for the approximation of μy|xn + 1, with the worst-case scenario for the approximation of σ. Since these two quantities are not independent at all, Eq. (19) overestimates the true width of the tolerance intervals and is therefore strongly conservative. We found that taking α in both the χ2(n −p) and the t (n −p) quantiles in Eq. (19) led to more correct (and less conservative) tolerance intervals. The approximation then becomes: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffi np IT;2 cˆyjxnþ1 F s t ðn p; aÞ H þ U1 ð PÞ : 2 v ðn p; 1 aÞ ð20Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a 1 þ xnþ1 Vð X ′X Þ1 xnþ1 limnp Yl ˆyjxnþ1 F s t n p; 2 ¼ lyjxnþ1 F U1 ð PÞr:
ð23Þ
2.4.4. Calculation of a simultaneous tolerance interval: two-sided case Likewise to the non-simultaneous case, the Bonferroni inequality can be used to construct bounds for a simultaneous tolerance interval, see Lieberman and Miller [10] and Miller [8]. This interval looks like:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffi np : IT;2 o yˆ jxnþ1 F s p F ð p; n p; a=2Þ H þ U1 ð PÞ v2 ðn p; 1 a=2Þ
ð24Þ Mee et al. [12] reported a narrower solution for simultaneous tolerance intervals, but only for a region within a pre-specified distance to the average of the explanatory variables. The solution is elaborated, as factors have to be looked up in Mee et al. [12]. 3. Simulations: tolerance intervals in regression at work In this section, the estimation of tolerance intervals is discussed, and the Lieberman approximations are evaluated in various situations. 3.1. The simulation algorithm 3.1.1. Non-simultaneous case The simulation algorithm calculates the proportion P for a given confidence 1 − α, a position of a new observation xn + 1, and a given tolerance factor ΔT,1 (one-sided case) or ΔT,2 (twosided case), as defined in Section 2.4.1. The proportion P also depends on n and X. To calculate P, a vector C of N coverages
S. De Gryze et al. / Chemometrics and Intelligent Laboratory Systems 87 (2007) 147–154
153
Table 2 Tolerance factors ΔT (see Section 2.4.1) for 95% confidence and a proportion of 95%, in one-and two-sided cases, for simple (n = 6, p = 2) and multiple (n = 8, p = 4) linear regression, and for the new observation xn + 1 at the mean or at the maximum of the explanatory variables xi (i ≤ n)
Non-simultaneous
Tails
Regression
xn + 1 at P xi Simulation
Exact
Approximate
Simulation
Exact
Approximate
One-sided
Simple Multiple Simple Multiple Simple Multiple Simple Multiple
4.16 4.11 5.02 4.69 – – – –
4.18 4.08 – – – – – –
– – 5.52 5.40 – – 6.17 5.97
4.59 4.58 5.16 5.14 5.40 6.18 6.06 6.77
4.60 4.59 – – – – – –
– – 6.19 6.16 – – 7.35 7.28
Two-sided Simultaneous
One-sided Two-sided
xn + 1 at extremes
For the one-sided, non-simultaneous tolerance intervals, exact ΔT -factors are calculated with Eq. (15); in the two-sided, non-simultaneous case, ΔT is approximated with Eq. (20), in the two-sided simultaneous case, with Eq. (24).
(e.g. N N 50,000) is generated using the following steps for j = 1 … N: 1. Generate a new set of n ‘observed’ responses yi for each of the vectors xi (the rows of X) using a normally distributed random number generator, so that yi ∼ N ðxi β; r2 Þ with β a vector of ones and σ2 = 1. ˆ -vector, the 2. From these n (xi, yi) observations, calculate the β ˆ predicted responses yˆi = xiβ, and the root of the mean square error, s, using least-squares regression. 3. Calculate the tolerance interval at xn + 1 based on factors ΔT,1, or ΔT,2, and its coverage C( j). • One-sided case: the tolerance interval is (− ∞, UL1] = (− ∞, yˆ |x n + 1 + sΔT,1]. The coverage is calculated from: C ð jÞ ¼ U1
UL1 lyjxnþ1 : r
ð25Þ
• Two-sided case: the tolerance interval is: [LL2, UL2] = yˆ |xn + 1 ± sΔT,2. The coverage is given by: C ð jÞ ¼ U1
UL2 lyjxnþ1 r
U1
LL2 lyjxnþ1 : ð26Þ r
After N iterations, the proportion P is then the upper α quantile of the vector C, i.e. the coverage for which a fraction α of the coverages is larger (C(t − αt) after sorting the vector C). The reversed problem, estimating the width ΔT,1 or ΔT,2 for a chosen proportion P and confidence 1 − α, is solved using an ˜ T are iteratively optimization algorithm. Many possible Δ evaluated until the resulting P˜ equals the requested P. The ˜ T in the optimization procedure was set to the starting value of Δ width of a prediction interval ΔP . 3.1.2. Simultaneous case Simultaneous tolerance intervals are valid for each element within an interval [xmin; xmax]. For the simulations, we chose xmin = min (xi), and xmax = max (xi). The simulation procedure is very similar to the non-simultaneous case, except that for each iteration i, the coverages at xmin and xmax are evaluated, and the
smallest of the two is stored in C( j). Therefore, this procedure will lead to a worst-case or conservative estimate of a simultaneous tolerance interval where the width of the tolerance interval is equal for all positions between xmin and xmax. 3.2. Simulation results The simulation algorithm to estimate ΔT is illustrated with two examples: (1) simple linear regression with 6 equidistant xi values (n = 6, p = 2), and (2) a three-factor full-factorial design with two levels for each factor, and one replicate for each combination of the factors (n = 8, p = 4). Note that in both cases, the degrees of freedom n − p is 4. In the one-sided case, the accuracy of the algorithm was tested by comparing the ˜ T,1 factors to the exact ΔT,1 factors from Eq. (15). produced Δ Table 2 shows that 50,000 iterations were sufficient to estimate the correct ΔT,1 with less than 1% of error. In the two-sided case, no exact solution exists, only the approximations given in Table 3 The influence of the degrees of freedom (df ) on the simulated and approximated two-sided non-simultaneous tolerance intervals, together with their effective coverages (between brackets) df Simulated ΔT,2
Zorn approximation of Our approximation of ΔP,2 ΔT,2 ΔT,2
1 35.4 2 9.54 3 6.26 4 5.02 5 4.37 6 3.99 7 3.71 8 3.53 9 3.37 10 3.27 20 2.73 30 2.54 40 2.44 50 2.37
69.9 (100) 14.5 (99.7) 8.73 (99.5) 6.77 (99.3) 5.78 (99.1) 5.18 (99.0) 4.78 (98.9) 4.48 (98.8) 4.26 (98.7) 4.08 (98.6) 3.28 (98.1) 2.98 (97.9) 2.82 (97.7) 2.71 (97.5)
34.9 (94.9) 10.1 (96.4) 6.78 (96.7) 5.52 (96.9) 4.86 (97.0) 4.44 (97.1) 4.15 (97.2) 3.94 (97.2) 3.78 (97.2) 3.65 (97.2) 3.03 (97.1) 2.80 (96.9) 2.67 (96.8) 2.58 (96.7)
14.7 (56.4) 4.81 (66.4) 3.49 (71.4) 3.00 (75.2) 2.75 (77.5) 2.60 (79.3) 2.49 (80.7) 2.42 (81.8) 2.36 (82.8) 2.32 (83.5) 2.13 (87.4) 2.07 (89.1) 2.04 (90.0) 2.03 (90.6)
The original Lieberman approximation was calculated using Eq. (19), and our modified Lieberman approximation using Eq. (20). Results are for a proportion P of 95%, a confidence level of 95%, simple linear regression ( p = 2) and prediction at the mean of the explanatory variables x.
154
S. De Gryze et al. / Chemometrics and Intelligent Laboratory Systems 87 (2007) 147–154
Eqs. (19) (Zorn) and (20) (our modification). Table 2 contains the approximation from Eq. (20). While the approximated and ˜ T,2 factors correspond very well for xn + 1 at the simulated Δ mean (error b 1%), a slightly larger deviation is present when the position of xn + 1 is more extreme. In that case, our modification overestimates the true ΔT,2 factor by about 5%, and the approximation becomes more conservative. Table 3 shows that the ΔT,2-factor decreases with increasing degrees of freedom until it converges to Φ− 1(0.05) = 1.96 (see Section 2.4.3) where tolerance and prediction bands become equal. It follows that the difference between prediction and tolerance intervals is especially important for cases where the degrees of freedom are relatively small (b 50). As a consequence, the commonly made error of assuming that a 95%-prediction interval contains 95% of the population of y|xn + 1 leads to a severe error when the degrees of freedom are small (b 20). For example, the 95%-confidence prediction interval covers only 83% for 10 degrees of freedom. However, even when the degrees of freedom are large, a tolerance interval is still considerably wider than a prediction interval: for df = 50, the latter covers only 91%. It is actually more informative to think in terms of 1-P here, e.g. in terms of Example 2, at df = 50 on average twice as many tablets will fall outside the limits calculated from a prediction interval as compared to the expected 5%. At df = 20, the number of out-of-limit cases will be almost three times higher than suggested by the 95% prediction interval. Also from Table 3, it becomes clear that the Zorn approximation is especially conservative for small degrees of freedom. For example, the true coverage of a 95% Zorn approximated tolerance interval is N 99% for 5 degrees of freedom or less. Table 3 shows that the accuracy of this approximation increases with the degrees of freedom. Our approximation slightly underestimates the simulated ΔT,2 factor. The accuracy of this approximation is mildly dependent on the degrees of freedom, but the coverage of this approximation deviated never more then 1.3% of the simulated coverage. Our approximation will produce satisfactory tolerance intervals for most cases and will be in any case far more correct than the prediction intervals (which are also conceptually wrong). If more precise intervals should be needed they can be constructed using the simulation program. 4. Conclusions Tolerance intervals are hardly known and even less used in practice. We have demonstrated that the classical prediction interval is used in many instances where in fact tolerance intervals should be used. We have shown that there is a large discrepancy
between the sizes of prediction and tolerance intervals when the degrees of freedom are relatively small (b 50). For the non-simultaneous case, the tolerance intervals can be calculated directly for the one-sided case and approximated for the two-sided case. For simultaneous intervals similar approximations are given. These approximations can be somewhat conservative. Should it be a problem, the simulation approach is a valid alternative and the MATLAB program to calculate the various tolerance intervals is available on request. Although this tutorial was restricted to OLS regression with normal, homoscedastic noise, deviations from these premises can never be a justification to cling to the familiar prediction intervals since these are not only conceptually wrong but also consistently more narrow then the appropriate tolerance intervals. The development of tolerance intervals for PLS regression model is the topic of the sequel to this paper. Acknowledgement This research was supported by the Fund for Scientific Research-Flanders (FWO project G.0611.05). References [1] S. Wilks, Determination of sample sizes for setting tolerance limits, Annals of Mathematical Statistics 12 (1941) 91–96. [2] A. Wald, J. Wolfowits, Tolerance intervals for a normal distribution, The Annals of Mathematical Sciences 17 (1946) 208–215. [3] W.A. Wallis, Tolerance intervals for linear regression, in: J. Neyman (Ed.), Second Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Los Angeles, Berkeley, CA, 1951, pp. 43–51. [4] S.B. Vardeman, What about the other intervals? The American Statistician 46 (1992) 193–197. [5] Y. Lee, T. Mathew, Tolerance regions in multivariate linear regression, Journal of Statistical Planning and Inference 126 (2004) 253–271. [6] R. Myers, Classical and Modern Regression with Applications, PWS-Kent Publishing Company, Boston, U.S.A., 1990. [7] T.P. Lane, W.H. DuMouchel, Simultaneous confidence intervals in multiple regression, The American Statistician 48 (1994) 315–321. [8] R.G.J. Miller, Simultaneous Statistical Inference, second edition, SpingerVerlag, New York, U.S.A., 1981. [9] D.B. Owen, A survey of properties and applications of the noncentral t-distribution, Technometrics 10 (1968) 445–478. [10] G. Lieberman, R. Miller, Simultaneous tolerance intervals in regression, Biometrika 50 (1963) 155–168. [11] M.E. Zorn, R.D. Gibbons, W.C. Sonzogni, Weighted least-squares approach to calculating limits of detection and quantification by modeling variability as a function of concentration, Annals of Chemistry 69 (1997) 3069–3075. [12] R. Mee, K. Eberhardt, C. Reeve, Calibration and simultaneous tolerance intervals for regression, Technometrics 33 (1991) 211–219.