9 Experimental Designs: Part 2
As we have mentioned in the last chapter, "Experimental Design" often takes a form in scientific investigations, such that some of experimental objects have been exposed to one level of the variable, while others have not been so exposed. Oftentimes this situation is called the "experimental subject" versus the "control subject" type of experiment. In the face of experimental error, or other source of variability of the readings, both the "experimental" and the "control" readings would be taken multiple times. That provides the information about the "natural" variability of the system against which the difference between the two can be compared. Then, a t-test is used to see if the difference between the "experimental" and the "control" subjects is greater than can be accounted for by the inherent variability of the system. If it is, we conclude that the difference is "statistically significant", and that there is a real effect due to the "treatment" applied to the experimental subject. Of course there are variations on this theme: the difference between the "experimental" and the "control" subjects can be due to different amounts of something applied to the two types of object, for example. That is how we have treated this type of experiment previously. We will now consider a somewhat different way to formulate the same experiment; the purpose being to be able to set up the experimental design, and the analysis of the data, in such a way that it can be generalized to more complicated types of experiments. In order to do this, we recognize that the value of any individual reading, whether from the experimental subject or the control subject, can be expressed as the sum of three quantities. These three quantities arise from a careful consideration of the nature of the data. Given that a particular measurement belongs either to the experimental group or to the control group, then the value of the data collected can be expressed as the sum of these three quantities: 1) The grand mean of all the data (experimental + control) 2) The difference between the mean of the data group (experimental or control) and the grand mean of the data 3) The difference between the individual reading and the mean reading of its pertinent group. This can then be expressed mathematically as:
(9-1)
Chemometrics in Spectroscopy
58 where,
X__jj represents each individual datum. Xi
represents the mean of the particular data group (experimental or control) that the individual datum belongs to.
X
represents the grand mean of all the data (from both groups).
By rearranging equation 9-1, we can also express it as follows, wherein the fact that it is a mathematical identity becomes apparent:
Xij = (~- ~) --~~-X-X) -ll-Xij
(9-2)
We have previously shown that through the operation called "partitioning the sums of squares", the following equality holds [1]: =
+ ~
_ ~]2
(9-3)
Note that what we call the grand mean here is simply called the mean in the prior discussion. That is because in the prior discussion there was no further splitting of the data into subgroups. In the current discussion we have indeed split the data into subgroups; and we note that what was previously the total difference from the mean now consists of two contributions: the difference of each subgroup's mean from the grand mean, and the difference of each datum's value from its subgroup's mean. We might expect, and it turns out to be so (again we leave the proof as an "exercise for the reader"), that sum of squares of the differences of each datum's value from the grand mean can also be partitioned; thus,:
~ X 2 .~ y~2
..~_~
(Xi __ ~) 2.4_~ (Xij- ~/)2
(9-4)
We had previously discussed the situation (from a slightly different point of view) where more than two subgroups of data existed. In that case we noted that we could generate two estimates of sigma, the within-group standard deviation. One estimate is calculated from the pooled within-group standard deviation. The other is calculated from the standard deviation between the means of the various subgroups. This quantity, you recall, is equal to the within-group standard deviation divided by the square root of n, the number of data used in the calculation of each subgroup's mean. However, the second calculation is correct only if the differences between the means is due to the random variations of the data itself, and there are no external influences. If such influences exist, then the second calculation (from the between-group means) will estimate a larger value for sigma than the first calculation (the pooled within-group standard deviations). This was then used as the basis of a statistical hypothesis test: if the value of sigma calculated from the between-groups means is statistically significantly larger than the value of sigma calculated from with the groups, then we have evidence to conclude that there are indeed, external influences acting upon the data, and we used an F-test to determine whether there was more scatter between the means than could be accounted for by the random variations within the subgroups. In the case at hand, with only two subgroups, we can proceed the same way. The difference is that now, with only two subgroups, there is only one degree of freedom
59
Experimental Designs: Part 2
available for the difference between the subgroups. No matter; an F-test with one degree of freedom is possible. Thus, to analyze the data from the model of equation 9-4, we calculate the mean square between the subgroups, and the mean square within the subgroups and perform an F-test (rather than a t-test as before) between these two mean squares. We would recommend doing it formally, with an ANOVA table, but this is the basic calculation. The conclusions drawn will be identical to those drawn by use of the t-test. Check it out: the tabled values of F for one and n degrees of freedom is equal to the square of the value of t for n degrees of freedom. We might also note here, almost parenthetically, that if the hypothesis test gives a statistically significant result, it would be valid to calculate the sensitivity of the result to the difference between the two groups (i.e., divide the difference in the means of the two groups by the difference in the values of the variable that correspond to the "experimental" and "control" groups). As an example of using an experimental design together with its associated analysis of variance to obtain a meaningful result, we have here an example based on some real data that we have collected. The problem was interesting: to troubleshoot a method of (wet) chemical analysis. A large quantity of sample was available, and had been well-ground and mixed. Suitable data was collected to permit performing a straightforward one-way analysis of variance. To start with, 5 g of sample was dissolved in 100 ml of water, and 20 repeat analyses were performed. The resulting values are shown in Table 9-1. The entry in the third row, second column was noted to have been measured under abnormal conditions. Since an assignable cause for this discrepant value was available, the reading was discarded. The statistics for the remaining data were Mean -- 5.01, SD = 0.327. This value for the standard deviation was accepted as the best available approximation to the population value for o-. The next step was to take several different aliquots from a large sample (a different sample than used previously) and collect multiple readings from each of them. Six aliquots were placed in each of six flasks, and six repeat measurements were made on each of these six flasks. Each aliquot consisted of 10 g of test sample/100ml water. The results are shown in Table 9-2. The value for the pooled within-flask standard deviation, while somewhat higher than for the twenty repeat readings, is not so high as to be worrisome. Strictly speaking, we should have done an F-test between the variance from the two sets of results to see if there is any extra variance there, but we will ignore that question for now, because the important point here is the highly statistically significant value of the "between" flasks standard deviation, indicating some extra source of variation was superimposed on the analytical value.
Table 9-1 Results from 20 repeat readings of 5 g of sample dissolved in 100 ml water 5.12 5.28 4.97 5.20 4.50
5.60 5.14 3.85 4.69 5.12
5.18 4.74 5.39 4.49 5.61
4.71 4.72 4.94 4.91 4.99
Chemometrics in Spectroscopy
60 Table 9-2 Flask #
Means: SDs"
Results of repeat readings of six aliquots in six flasks (from 10-g samples) 1
2
3
4
5
6
7.25 7.68 7.76 8.10 7.50 7.58
10.07 9.02 9.51 10.64 10.27 9.64
5.96 6.66 5.87 6.95 6.54 6.29
7.10 6.10 6.27 5.99 6.32 5.54
5.74 6.90 6.29 6.37 5.99 6.58
4.74 6.75 6.71 6.51 5.95 6.50
7.64 0.28
9.85 0.58
6.37 0.42
6.22 0.51
6.31 0.41
6.19 0.77
Pooled SD = 0.52, "Between" SD = 1.46 Expected "Between" SD = 0.212 F =47 F(crit) = F(0.95, 5, 30) = 2.53 Having found a statistically significant "between" flasks standard deviation, the next step was to formulate hypotheses as to the possible physical causes of this situation. The list we arrived at was the following" • • • •
Inhomogeneous sample Drift between sets of readings Sampling error Something else.
The first physical cause considered was the possibility of an inhomogeneous sample. To eliminate this as a possibility, the sample was ground before aliquots were taken. The sample size was still 10 g of sample per 100 ml of water. In this case, however, time constraints permitted only three replicate readings per flask. The results are shown in Table 9-3. We note that there is still much larger difference between the different flasks' readings that can be accounted for by the within-flask repeatability. Therefore we press onward to consider another possible cause of the variation; in this case we consider the possibility of inhomogeneity of the sample, at a scale not affected by grinding. For example, the sample might contain small specks of material that are too small to be ground further, Table 9-3
Means: SDs:
Results of repeat readings of six aliquots in six flasks (from 10-g samples ground) 6.57 6.27 6.35
5.06 6.27 5.88
8.07 7.82 8.52
4.93 5.64 5.19
4.78 5.50 5.99
6.23 7.37 5.27
6.39 0.16
5.74 0.61
8.19 0.35
5.25 0.36
5.43 0.61
7.29 1.01
Pooled SD = 0.58, "Between" SD = 1.14 Expected "Between" SD = 0.33 F = 11.3 F(crit) = F(0.95, 5, 12) = 3.10
Experimental Designs: Part 2 Table 9-4
61
Results from using 10 × larger (100-gram) samples
Means: SDs:
8.29 8.12 8.72 8.54
8.61 8.72 8.42 8.76
10.04 11.67 11.38 10.19
8.86 9.02 9.29 8.63
8.42 0.26
8.63 0.15
10.82 0.82
8.94 0.26
Pooled SD = 0.46, "Between" SD = 1.10 Expected "Between" SD = 0.23 F = 23 F(crit) = F(0.95, 3, 12)-- 3.49 but which are large enough to measurably affect the analysis. In this case, the expected distribution of the sampling variation of such particles would be the Poisson distribution [2]. In such a case, if we take a larger sample, we would expect the standard deviation to decrease as the square root of the sample size. Thus, if we take samples ten times larger than previously, the standard deviation of the "between" readings should become approximately one-third of the previous value. Therefore, for the next test, 100 g samples each were dissolved in 1 liter of water. The results are shown in Table 9-4. Note that the "between" standard deviation is almost identical to the previous value; we conclude that inhomogeneity of the sample is not the problem. The possibility of drift between sets of readings was ruled out by virtue of the fact that many of the steps of the analytical procedure were done simultaneously on the several readings of the different aliquots. The possibility of drift between readings was ruled out by repeating the readings in different orders; the same values were obtained regardless of the order of reading. This left "something else" as the possible cause of the variability. When we considered the nature of the test, which was sensitive to parts per million of organic materials, we realized that one possibility was contamination of the glassware by the soap used to clean it. We next cleaned all glassware with chromic acid cleaning solution, and reran the tests, with the result as shown in Table 9-5. Removal of the extraneous source of variability did indeed reduce the "between-flasks" variance to a level that is now explainable (in the statistical sense) by the underlying random variations attributable to the within-flask variability. Table 9-5
Means: SDs:
Results after cleaning glassware with chromic acid 4.65 5.03 4.38
5.98 4.61 4.49
5.19 3.96 4.92
4.97 4.43 4.79
4.62 4.94 3.37
3.93 4.60 5.95
4.68 0.33
5.16 0.73
4.69 0.64
4.73 0.27
4.31 0.83
4.84 1.03
Pooled SD --- 0.69, "Between" SD = 0.27 Expected "Between" SD -- 0.39 F = 0.47 F(crit)-- F(0.95, 5, 12)= 3.10
62 Table 9-6
Chemometrics in Spectroscopy Types of experimental designs
Number of levels
Number of factors Single
Multiple
Two
Experimental versus control subjects
One-at-a-time designs Factorial designs Fractional factorial designs Nested designs Special designs
Multiple
Sensitivity testing Simple regression
Response surface designs Multiple regression
End of example From the prototype experiment, we can generate many variations of the basic scheme. The two main ways that the model shown in equation 9-4 can be varied is to increase the number of factors and to increase the number of levels of each factor. A given factor must have at least two levels (even if one of the levels is an implied zero), and may have any number greater than two. Table 9-6 lists the types of designs that fall into each of these categories. The types of designs used by scientists in simple settings, not usually considered "statistical" designs, are the "experimental versus control" designs (discussed above), the one-at-a-time designs (where each factor is individually changed from its "control" value to its "experimental" value, then restored when the next factor is changed), and the simple regression (often used in calibration work when only one physical variable is affected- in chemistry, electrochemical and chromatographic applications come to mind). The table is not exhaustive, although it does include a majority of experimental designs that are used. One-at-a-time designs are the usual "non-statistical" type of experiments that are often carried out by scientists in all disciplines. Not included explicitly, however, are experimental designs that are generated from combinations of listed items. For example, a multi-factor experiment may have several levels of some of the factors but only two levels of other factors. Also, due to the nature of the physical factors involved, the values of some of the factors may not be under the experimenter's control. Thus, some factors may be nested, while others may not be.
REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991), pp. 80-81. 2. Mark, H. and Workman, J., Spectroscopy 5(3), 55-56 (1991).