14 ONE-
AND
TWO-TAILED TESTS
This chapter is going to be rather dull and boring. The topic we will discuss is important, but our discussion of it will be rather mundane, nor do we have any cute tricks up our sleeves to enliven the presentation (at least, not at this point, when we are just starting to write it—maybe something will occur to us as we go along. We can always hope). So, onward we push. Figure 14.1-A shows a diagram we should all be used to by now, a Normal distribution plot with the 95% confidence interval marked. Recall the meaning of this: given a random, normally distributed variable, 95% of the time a randomly selected value will fall between the lines, so that the probability of finding a value between m 2 1:96s and m þ 1:96s is 0.95. Since probabilities must add to 1.0, the probability of finding a value outside that range is only 0.05, so that if a measurement is made that does in fact fall outside that range, we consider that the probability of it happening purely because of the random variability of the data is so low that we say, “this value is too improbable to have happened by chance alone, something beyond the randomness of the reference population must be operating”. This statement, or a variant of it, is the basis for performing statistical hypothesis testing. The Normal distribution, of course, applies only to certain statistics under the proper conditions; so far we have discussed the fact that it applies to data subjected to many independent sources of variation and to the mean of a set of data of any distribution. At this point we will note but not pursue the fact that there are two possible explanations why data can fall outside the 95% confidence interval: either 125
126
STATISTICS IN SPECTROSCOPY
FIGURE. 14.1. Part A shows the conditions for a two-tailed hypothesis test. Each of the tails, which start at ^1.96s, contains 2.5% of the possible cases; this is appropriate for testing if a measurement is different from the hypothesized value of m. Part B shows the conditions for a one-tailed test; this is appropriate for testing whether a measurement is higher (or lower, if the other tail is used) than the hypothesized population value. For a one-tailed test, the tail starts at 1.65s.
the population mean for the measured data is not the hypothesized mean, or the population value of standard deviation is not the hypothesized value. Either of these situations is a possible explanation for the failure of any given null hypothesis concerning m. What we do wish to point out here is that the 95% confidence interval shown in Figure 14.1-A is symmetrical, and is appropriate to a null hypothesis of m being equal to some value that we wish to test, against the alternative hypothesis that m equals some other value. It is this alternative hypothesis that is the subject of this chapter.
14. ONE- AND TWO-TAILED TESTS
127
Stating the alternative hypothesis that way, i.e., simply allowing m not to equal the hypothesized value, leads to the diagram of Figure 14.1-A and the associated probabilities, that 95% of the time a sample from the specified population falls within the indicated range, 2.5% of the time is above the range and 2.5% of the time is below the range. Now you are probably sitting there wondering “What else could there be? m either equals the hypothesized value or it doesn’t.” Well, that is almost but not quite true (surprise!). When we draw a diagram such as in Figure 14.1-A, we make an implicit assumption. The assumption is that, if m does not equal the hypothesized value, the probability of it being above the hypothesized value is equal to the probability of it being below the hypothesized value. This assumption is sometimes true, but sometimes not. If it is true, then Figure 14.1-A is a correct and proper description of the situation. On the other hand, suppose that assumption is not true; then we have a problem. To illustrate the problem, let us suppose that the physical situation prevents m from ever being less than the hypothesized value; it can only be greater. In this case, we would never obtain a value below the low end of the confidence interval due to failure of the null hypothesis; and under these conditions the lower end of the distribution might as well not exist, since only the upper end is operative. But random samples from the distribution obtained from the population specified by the null hypothesis will exceed the upper bound only 2.5% of the time under these conditions, not the 5% originally specified. Thus, under conditions where the possible values are restricted, what was supposed to be a hypothesis test at a 95% confidence level is actually being performed at a 97.5% confidence level. Intuitively you might like the extra confidence, but that is not an objective criterion for doing scientific work. If a true 95% confidence level for the hypothesis test is desired, it is necessary to adjust the position of the upper confidence limit to the point where, in fact, 95% of the samples from the test population fall within it. In the case of the Normal distribution, this value is m þ 1:65s: This value compensates for the fact that the lower confidence limit has disappeared. If a higher confidence level is desired, it should be introduced explicitly, by setting the confidence level to 97.5% (or 99%, or whatever confidence level is appropriate). Figure 14.1-B presents the probability diagram for this condition. Ninety-five percent of the data lies below m þ 1:65s and 5% lie above. Contrast this with Figure 14.1-A where the corresponding points are m ^ 1:96s: The region outside the confidence limits is, in both cases, called the rejection region for the hypothesis test, because if a measurement results in a value outside the confidence interval, the null hypothesis is rejected. In the case shown in Figure 14.1-A the rejection region includes both tails of the distribution; in
128
STATISTICS IN SPECTROSCOPY
Figure 14.1-B the rejection region includes only one tail of the distribution. For this reason, the two types of hypothesis test are called two-tailed and one-tailed tests, respectively. Two-tailed tests are used whenever the alternative hypothesis is that the population parameter is simply not equal to the hypothesized value, with no other specification. One-tailed tests are used when the alternative hypothesis is more specific, and states that the population parameter is greater (or less, as appropriate) than the hypothesized value. We have presented this discussion specifically in terms of the Normal distribution. That was useful, as it allowed us to show the situation for a particular distribution that was well known, and furthermore, allowed us to use actual numbers in the discussion. Notice, however, that the only item of real importance is the probability levels associated with given values of the distribution. The point, of course, is that the discussion is general, and applies to the t statistic, as well as all other statistics, including ones that we have not yet discussed. We have noted above that, for the Normal distribution, and for the t distribution, the distributions are symmetric and the positions of the critical values for a two-tailed test are symmetrically positioned around the population parameter. We will find in the future that the distributions of some statistics are not symmetric; how do we handle this situation? The key to the answer to this question is to note that it is always the probability levels that are important in statistical testing. For asymmetric distributions we can still find values for which the probability of obtaining a reading equals a given, predefined percentage of all possible random samples. These then become the critical values for hypothesis testing. It is of no consequence if they are not uniformly positioned around the population parameter: that is not a fundamental characteristic of the critical values, merely an happenstance of the nature of the Normal and t distributions. But, you are probably asking yourselves, when would someone want to make the more specific alternative hypothesis, rather than just making the simpler hypothesis of m – (test value)? In textbooks of statistics the usual examples involve a “manufacturer” who manufactures (fill in the blank) and who wants to know if a new manufacturing procedure is “better” than the old one in some way. Clearly the “manufacturer” is not interested in a procedure that is worse. However, as an example of a situation of more interest to us as chemists where the concept of one-tailed testing applies, let us consider the question of detection limits. Considerable literature exists concerning the definition of the detection limit for a given analysis; two recent articles have appeared (in the same issue of Analytical Chemistry—interest must be growing!) which treat the subject from the chemist’s point of view (14-1,14-2) and include a good list of citations to older literature.
14. ONE- AND TWO-TAILED TESTS
129
We are not going to enter the fray and try to come up with a definition that will suit everybody. Not only are people too cantankerous in general for such an attempt to succeed, but also as we shall see, different definitions may be appropriate in different circumstances. What we will do, however, is to discuss the subject from the statistical point of view; we hope that the insight that you, our readers, gain, will allow you to formulate definitions to suit your needs at the appropriate time. What does the concept of a detection limit entail? In general, it allows the analyst to make a statement concerning the sensitivity of an analytical procedure. Presumably, in the absence of the analyte of interest, the method (whether it be specifically spectroscopic, or more generally, instrumental, or most generally, any method at all) will give some value of a measured quantity, that can then be called the “zero” reading. In the presence of the analyte, some different reading will be obtained; that reading then indicates the presence of the analyte and gives some information concerning how much analyte is present. The difficulty that enters, and that makes this process of statistical interest, is the fact that any reading, whether the analyte is present or absent, is subject to noise and/or other sources of error. Let us consider the situation in the absence of analyte, which is then exactly as pictured in curve A of Figure 14.2. In the absence of analyte, we know a priori that m ¼ 0: The readings obtained, however, will vary around zero due to the noise. Indeed, when m actually does equal zero, the presence of random noise will
FIGURE. 14.2. As the true concentration increases, the probability of obtaining a measurement above the upper critical value for the null hypothesis m also increases until, as shown in curve F, a sufficiently high measurement will always be obtained.
130
STATISTICS IN SPECTROSCOPY
give rise to measured values (let us say, of an electrical signal that is otherwise proportional to the amount of analyte) that could even be negative. Ordinarily we would censor these; we know that negative values of a concentration are physically impossible. For the sake of our discussion, let us not perform this censoring, but consider these negative values to be as real as the positive values. In this case, we might describe the situation as in Figure 14.1-A, and, to define a detection limit, decide to perform a hypothesis test, where the null hypothesis is m ¼ 0 against the alternative hypothesis m – 0: In this case, failure of the null hypothesis would indicate that the analyte is present. Clearly, this is not correct: even when the null hypothesis is true, we base our hypothesis test upon the fact that we will not find a value beyond the confidence limits, either above or below. In the presence of analyte, the distributions of the readings will be as shown by curves B –F of Figure 14.2. Consider curve B, where m . 0 (but not by much). If, in fact, the concentration can be expressed as m ¼ mB ; then the distributions of measured values will be as shown in curve B. If the null hypothesis m ¼ 0 is used when the actual distribution follows curve B, then the probability of obtaining a value above the upper confidence limit will be approximately 25%, within the confidence interval 75%, and below the lower confidence limit effectively zero. Note that, even though the concentration is non-zero, we will fail to reject the null hypothesis more often than we will reject it in this case. As the true concentration (i.e., m) increases, the situation is that it becomes more and more likely that the measured value will, in fact, be sufficiently high to allow us to reject the null hypothesis m ¼ 0 until the point is reached, as shown in curve F of Figure 14.2, where the true concentration is so high that measured values will always cause rejection the null hypothesis, and we will always conclude that the analyte is in fact present. Inspect Figure 14.2, and take notice of the fraction of the area of each curve that lies above the upper confidence limit for the concentration ðmÞ ¼ 0 case; this should be convincing. Thus, when m . 0; we will certainly never obtain a reading below the lower confidence limit, only values above 0 þ as are possible. This asymmetry in the behavior of the system is exactly the condition that gives rise to the change in the true confidence level from 0.95 to 0.975 described earlier. Thus, even when we do not censor the values on physical grounds (which, by the way, would lead us to the same conclusion), the mathematics of the situation show that, in order to do a proper statistical hypothesis test at a true 95% confidence level, it is necessary to specify only the one confidence limit and, instead of performing the hypothesis test m against the alternative hypothesis m – 0; the alternative hypothesis must be formulated: m . 0; and a one-tailed test used. In order to implement this concept, it is necessary to know s, or at least to have estimated s by performing many analyses of a “blank”, samples of a specimen (or
14. ONE- AND TWO-TAILED TESTS
131
set of specimens) that do not contain any of the analyte. Then, the upper confidence limit can be specified as an appropriate number of standard deviations of the distribution, which could either be Normal (if s is known), or, more commonly, the t distribution (if s must be estimated by the sample standard deviation S from the data). We could then state that, when m does in fact equal zero, the distribution is known, and furthermore, the distribution of the average of any number of readings is also known. We could then determine the criterion for stating that a single reading, or the average of a given number of readings, above the appropriate cutoff value, is an indication that the concentration of the analyte is non-zero. Now, this is all fine and well when we can sit here and discuss distributions, and show pretty pictures of what they look like, and calmly talk about the probabilities involved. When you actually have to make a decision about a measurement, though, you do not have all these pretty pictures available, all you have is the measurement. (Well, that’s not quite true, you can always look up this chapter and reread it, and look at all the pictures again.) However, there are two important points to be made here: one obvious and one that is subtle. Curve A in Figure 14.2 shows the distribution and upper confidence limit of readings when m ¼ 0: Curve C shows the distribution when m is equal to the upper confidence limit of curve A. There is an important relation between these two distributions: the lower confidence limit of curve C equals zero. That is the obvious point. The subtle point is that there is an inverse relationship between m and X (the value of a single reading): if a reading (rather than the true m) happens to fall at mC, the distribution shown as curve C is then the distribution of probabilities for finding m at any given distance away from the reading. Using this interpretation, if a reading falls at mC or above, then the chance of the true value being equal to zero becomes too unlikely, and so in this interpretation we also conclude that a reading higher than mC indicates a non-zero value for the true concentration. The problem of defining a detection limit is that knowing how and when to do a hypothesis test, even using a one-tailed test, is only half the battle. This procedure allows you to specify a minimum value of the measurement that allows you to be sure that the analyte is present; but it does not tell you what the minimum value of the analyte concentration must be. As we noted earlier, if the concentration of the analyte is mB, there is a probability of well over 50% that the method will fail to detect it. Suppose the true concentration (m) is equal to the upper critical limit (for the null hypothesis m ¼ 0) and which, by our sneakiness, is shown as mC. This condition is shown in curve C of Figure 14.2. In this case, half of the distribution lies above and half below the cutoff value, and the probability of detecting the presence of the analyte is still only 50%.
132
STATISTICS IN SPECTROSCOPY
What is the minimum concentration needed? The answer depends upon the variation (i.e., the standard deviation) of the measurements in the presence of the analyte. This variation is not necessarily the same as the standard deviation of the measurements when no analyte is present. The curves of Figure 14.2 were drawn with the implicit assumption that the standard deviation of the measurements is independent of the amount of analyte present. In fact, many analytical methods are known where the error does depend upon the concentration (exercise for the reader: think of one from your own experience). In such a case it is necessary to determine the standard deviation of measurements in the presence of small amounts of analyte, as well as the standard deviation of measurements in the absence of any analyte. Figure 14.3-A illustrates the situation in this case. When the true concentration, m, is zero, the readings obtained will follow a distribution around zero, just as
FIGURE. 14.3. (A) At very high concentrations, there is no overlap between the distributions of possible measurements, and the presence of analyte is readily distinguished from the absence of the analyte. Readings will essentially never occur between the two confidence limits shown. (B) When the concentration decreases to the point where the two confidence limits merge into a single value, it is still possible to distinguish unambiguously between the presence and absence of the analyte. (C) At still lower concentrations, there is a region where measurement can occur that will be consistent with both the presence and absence of the analyte.
14. ONE- AND TWO-TAILED TESTS
133
before. This distribution will have an upper confidence limit beyond which any measurement indicates a non-zero concentration. Clearly, in order to be sure of detecting the analyte when it is present, its concentration must be “sufficiently high”. “Sufficiently high” in this case means that we are sure to obtain a reading that is high enough that there is a negligible probability of its falling below the confidence limit of the distribution around zero. In order for this to happen, the lower confidence limit of the distribution of measured values around the actual concentration must be equal to or greater than the upper confidence limit for the measurements around a m of zero. This condition is illustrated by the distribution of measured values when the true concentration is mA. There is a region, between the confidence limits of the two distributions, where no measured values will occur. In Figure 14.3-B the concentration has decreased to the point where the confidence limits of the two curves have merged into a single value. In the situation shown in Figure 14.3-B we could perform two hypothesis tests: m ¼ 0; and m ¼ mB : Since the same actual value represents the confidence limit for both tests, this value separates the two possibilities. A continuum of measured values is possible in this case, but there is still an unambiguous demarcation between the presence and absence of the analyte. If the actual concentration is lower yet, as shown in Figure 14.3-C, then there is a region between the two confidence limits where there is uncertainty, because a reading in this region would not cause rejection of either null hypothesis. When the concentration is still higher, as in Figure 14.3-A, there is no problem in distinguishing the two cases. Therefore, the minimum true concentration that will prevent an error in either direction is mB shown in Figure 14.3-B, where the true concentration equals the sum of the distances of the two confidence limits from their respective population mean. It is only when the true concentration is this high or higher that we can be sure that we will not erroneously claim that the concentration is zero because we accepted the null hypothesis: m ¼ 0: Thus, we see that there are at least these two possible definitions for detection limit: one specifies the minimum measurement that insures a non-zero concentration, and the other specifies the minimum concentration that ensures a non-zero measurement. In fact, there are an infinite number of possible values between these two that could be claimed. In each case, there is a different probability of obtaining incorrect conclusion from the two possible hypothesis tests. These intermediate possibilities are discussed at length in the references. References 14-1. Clayton, C.A., Hines, J.W., and Elkins, P.D., Analytical Chemistry, 59: 20, (1987), 2506–2514. 14-2. Bergmann, G., Oepen, B.v., and Zinn, P., Analytical Chemistry, 59: 20, (1987), 2522–2526.