Soil Biology and Biochemistry 31 (1999) 1323±1330
Analysis of inter-sample distances from BIOLOG plate data in Euclidean and simplex spaces P.J.A. Howard* The Institute of Terrestrial Ecology, Merlewood Research Station, Grange-over-Sands, Cumbria, LA11 6JU, UK Accepted 23 March 1999
Abstract BIOLOG plates have a potential use in quantitative or qualitative comparisons between physiological activities of microorganisms from dierent soils or soil horizons. However, the scaling of the data advocated by Garland and Mills (1991) restricts the range of subsequent data analyses because the scaled points lie in a constrained simplex space. This article describes methods for analyzing inter-sample distances from BIOLOG plate data in the usual Euclidean space, using total (quantitative plus qualitative) or quantitative information, and in simplex space, using only qualitative information. The methods are applied to data for three replicates of each of ®ve soils. The quantitative comparisons showed that for some soils the replicates formed distinct groups, but for others they did not. Major dierences must be due to real dierences in activity levels, which need further study. The qualitative comparisons showed that on the ®rst two axes of a simplex PCA the replicates formed compact and distinct groups, although two of the groups were close together. The C-sources that best discriminate between the samples can be found from the eigenvectors. Attention is drawn to the problem of negative data values. # 1999 Elsevier Science Ltd. All rights reserved. Keywords: BIOLOG plates; Compositional data; Physiological activity; PCA; Euclidean space; Simplex space; Inter-sample distances
1. Introduction In studies of soil processes, it is often of interest to make quantitative or qualitative comparisons between the physiological activities of micro-organisms in dierent soils, or in dierent horizons of the same soil. For example, it is still not clear if there are distinct qualitative or quantitative physiological dierences between the microbial populations of mull and mor humus forms. This article shows how quantitative and qualitative information about dierences between samples can be obtained from data from BIOLOG plates (Garland and Mills, 1991). The 95 substrate wells in a standard Gram-negative or Gram-positive BIOLOG plate contain, in dried form, a complex, lowconcentration buered medium with dierent C-
* Corresponding author. Fax: +44-015395-34705. E-mail address:
[email protected] (P.J.A. Howard)
sources. Microbial activity in each well is determined, as dehydrogenase activity, by the development of colour from the reduction of a tetrazolium dye. A control or blank well contains no C-source, The colour intensity of the control is subtracted from that for each of the 95 wells with substrate. The method of data analysis described here diers from multivariate methods that have been used by other researchers. For example, Grayston and Campbell (1996) used canonical variate analysis (CVA) to examine dierences in C utilization among the microbial communities from three forest sites. In CVA, the objects studied are in a priori groups, which are assumed to have homogeneous dispersion matrices. The method described here studies detailed interrelationships among individual samples which may, or may not, form groups. Garland and Mills (1991) advocated scaling the colour values for the 95 wells of a BIOLOG plate to give them a mean of 1. They called this the `average well
0038-0717/99/$ - see front matter # 1999 Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 8 - 0 7 1 7 ( 9 9 ) 0 0 0 5 5 - 3
1324
P.J.A. Howard / Soil Biology and Biochemistry 31 (1999) 1323±1330
colour development', but the term `plate mean' will be used here. Howard (1997) pointed out that this scaling should not be applied to the data if inter-sample distances are to be calculated in the usual way, as it distorts those distances. There is also a more profound problem, because the rows of a scaled data matrix are of constant sum, in this case 95. Such data are called compositional data. Reyment and JoÈreskog (1993) stated: `It has seemed natural, and uncomplicated, to employ the usual methods of statistics for such data and, in fact, most people, including professional statisticians, hardly expend a single thought on eventual computational complications or interpretative malchances. The analysis of compositional data is not given adequate consideration in even the most modern of texts'. Aitchison (1986) discussed the special distinctions between the statistics of usual space and those of constrained (simplex) space in which compositional data lie, and described the special methods that are needed to analyse such data. Dierences or similarities between samples can be represented conveniently by analyzing the relationships contained in inter-sample distances. In the present article, BIOLOG plate data for three replicates from each of ®ve soils are used in a case-study to illustrate a method for examining inter-sample distances for unscaled data, and to illustrate the corresponding simplex approach for data after Garland±Mills scaling. The two approaches allow qualitative and quantitative dierences to be separated. The data are used solely to illustrate the methods of statistical analysis. A full description of the soil samples and the ecological interpretation of the data will be published elsewhere. 2. Methods 2.1. Inter-sample distances in Euclidean space In this example, the data matrix contains N=15 rows (samples, plates) and P = 95 columns (Csources). The control value (colour intensity) was subtracted from the value for each experimental well on a plate, and the resulting values were summed and divided by 95, giving the plate mean. The raw data contain information on the overall, that is, quantitative plus qualitative, inter-sample distances. To examine these in detail, we begin with a principal component analysis (PCA) of the correlation matrix. Geometrically, the N objects (soil samples in this case) can be thought of as being distributed in a Pdimensional test or variable space. PCA is a very useful exploratory technique for continuous variables. It represents an orthogonal rotation of the original test axes to give new axes, the ®rst having maximum variance, the second accounting for a maximum of the
remaining variance in a direction orthogonal to the ®rst, and so on. The variances of the axes (components) are given by the eigenvalues. With P tests we normally obtain r R P non-zero eigenvalues, but if, as in this example, N R P, we obtain a maximum of N-1 non-zero eigenvalues. It is often convenient to disregard components which have small eigenvalues, treating them as constants, so that the main aspects of variation can be studied in a subspace of dimension q < P. Although some information is lost in this approach, the q components may be chosen to account for enough of the total variance to make the reduced dimensionality useful. Indeed, much of the variance in the lower components may only be noise. If only the ®rst q components are used, the problem is deciding on a value for q. If the PCA is based on a correlation matrix for data with N>P, which is generally the case, a useful rule of thumb is to consider only those components with eigenvalues greater than one (Howard, 1991), and perhaps the next one or two if they are greater than 0.75. However, if N < P, these criteria become 1 P/N and 0.75 P/N. Attempts have been made to provide more statistically-based stopping rules, but no other simple rule has proved to be suitable (Franklin et al., 1995). As PCA is an orthogonal rotation, the original inter-sample distances are retained in the total component space, which is Euclidean. Hence, inter-sample distances can be calculated from the component scores (axis co-ordinates) using the normal Pythagorean formula. Of course, any distances calculated from the scores on the ®rst q components will be an approximation of those in the full component or variable space, but it may happen that the ®rst q components contain all the information that is of interest in a particular study, as will be shown below. Next, the minimum spanning tree (Gower and Ross, 1969) is calculated. It provides a concise summary of the inter-sample distance relationships, and leads naturally to the construction of a single-linkage cluster analysis, the dendrogram of which provides a graphical representation of that information. The plate means contain information about the quantitative relationships only, and the inter-sample distances are calculated from them. 2.2. Inter-sample distances in simplex space For analysing the purely qualitative inter-sample dierences, we need to apply the little-known method for compositional data to BIOLOG plate data after Garland±Mills scaling. The scaled data must not contain negative values (see below). We proceed as follows: 1. Replace any zero values by the procedure of
P.J.A. Howard / Soil Biology and Biochemistry 31 (1999) 1323±1330
1325
Table 1 Eigenvalues of the correlation matrix for the BIOLOG plate data Component
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2. 3.
4. 5.
Eigenvalue
24.85 19.47 12.35 7.33 6.71 5.13 4.66 3.16 2.99 2.82 1.83 1.55 1.19 0.95
Percentage of trace Component
Cumulative
26.16 20.49 13.00 7.72 7.07 5.40 4.91 3.33 3.15 2.97 1.92 1.63 1.26 1.00
26.16 46.65 59.66 67.38 74.44 79.84 84.75 88.08 91.23 94.19 96.11 97.75 99.00 100.00
Fig. 1. Scatterplot of points representing the samples on the ®rst two component axes from a principal components analysis of the correlation matrix. Components 1 and 2 account for 26 and 20% of the total variation respectively.
Reyment (1991), suitable computer code is given in the Appendix. Scale the row sums to 1, converting the data to proportions, this is matrix X. Compute the N P logratio data matrix Z, by converting all the values xij to ln (xij), calculating the resulting row means, and subtracting the appropriate row mean from each value in X. With Z as the data matrix, compute the covariance and correlation matrices using the usual formulae. This centred logratio covariance matrix, or the corresponding correlation matrix, is used in simplex PCA because it possesses symmetry of all P parts.
Examples are given in Reyment (1991) and Reyment and JoÈreskog (1993) and the former gives a SAS program for carrying out the analysis. Once the PCA has been carried out, the remaining analyses are the same as those described for inter-sample distances in Euclidean space.
Table 2 Component scores from the correlation matrix of the BIOLOG plate data, and the plate mean values Sample
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 E1 E2 E3
Component
Plate mean
1
2
3
4
5
6
7
8
9
10
11
12
13
14
ÿ6.24 5.25 ÿ2.87 ÿ3.32 0.97 ÿ1.81 6.09 8.77 6.51 ÿ4.11 ÿ7.47 ÿ4.55 2.28 ÿ0.95 1.46
ÿ3.66 ÿ11.54 ÿ5.84 1.59 ÿ0.95 2.69 4.30 4.27 3.60 1.72 3.88 2.67 ÿ2.40 0.36 ÿ0.68
ÿ7.91 3.25 ÿ0.85 ÿ4.12 3.03 ÿ1.75 ÿ1.40 ÿ0.80 ÿ1.47 4.22 3.65 5.22 0.63 ÿ2.40 0.70
2.74 2.84 ÿ2.24 0.70 4.05 1.54 1.22 1.25 0.05 ÿ0.24 1.50 ÿ0.36 ÿ4.92 ÿ4.13 ÿ4.00
1.71 1.34 ÿ7.62 ÿ0.84 1.30 2.22 ÿ2.40 ÿ0.90 ÿ0.90 0.46 ÿ0.12 ÿ1.01 2.42 2.26 2.08
ÿ3.81 ÿ0.58 0.89 4.66 3.66 2.46 ÿ1.82 ÿ0.59 ÿ1.31 ÿ2.70 ÿ1.15 ÿ0.68 ÿ0.04 0.52 0.50
ÿ1.92 3.11 ÿ0.78 1.24 ÿ4.22 1.47 2.08 ÿ0.49 ÿ1.35 ÿ2.02 3.63 ÿ0.98 ÿ1.63 1.56 0.29
0.70 0.17 ÿ1.27 2.23 0.67 ÿ5.06 ÿ0.56 1.77 ÿ0.78 ÿ1.60 1.38 1.14 0.26 1.07 ÿ0.13
ÿ0.05 0.72 ÿ1.07 2.28 ÿ1.76 0.33 1.61 0.43 ÿ2.56 1.87 ÿ3.35 2.58 0.15 ÿ0.97 ÿ0.31
0.48 ÿ0.99 0.15 ÿ1.74 2.33 ÿ0.55 3.92 ÿ1.58 ÿ2.67 ÿ1.22 ÿ0.16 ÿ0.08 ÿ0.28 1.01 1.38
ÿ0.04 0.44 ÿ0.80 1.10 ÿ0.04 ÿ0.93 0.74 ÿ3.13 2.85 ÿ0.39 ÿ0.59 1.05 ÿ0.89 ÿ0.57 1.21
0.74 ÿ0.08 0.19 ÿ1.20 ÿ0.47 1.08 ÿ0.96 0.94 ÿ0.05 ÿ2.66 ÿ0.23 2.65 ÿ0.53 ÿ0.67 1.26
0.16 ÿ0.22 ÿ0.21 0.33 ÿ0.19 0.26 0.77 ÿ0.65 0.01 ÿ1.10 0.90 0.08 2.90 ÿ1.93 ÿ1.11
ÿ0.22 0.27 ÿ0.05 ÿ0.54 0.21 0.28 0.14 ÿ0.53 0.54 ÿ0.52 ÿ0.53 1.17 0.24 2.01 ÿ2.47
1.30 1.01 1.17 1.19 1.08 1.18 0.96 0.82 0.88 1.25 1.39 1.28 1.03 1.15 1.09
1326
P.J.A. Howard / Soil Biology and Biochemistry 31 (1999) 1323±1330
Table 3 Inter-sample Pythagorean distances. In the lower half-matrix, calculated from all 14 component values in Table 2, representing qualitative plus quantitative dierences. In the upper half-matrix, calculated from the plate mean values, and representing quantitative dierences Sample
A1
A2
A3
B1
B2
B3
C1
C2
C3
D1
D2
D3
E1
E2
E3
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 E1 E2 E3
0.00 1.35 1.03 0.90 1.13 0.97 1.27 1.39 1.25 1.05 1.13 1.14 1.09 0.92 1.06
0.29 0.00 1.12 1.34 1.06 1.28 1.28 1.27 1.25 1.28 1.47 1.33 1.00 1.20 1.04
0.13 0.16 0.00 0.94 1.02 1.04 1.16 1.27 1.15 0.98 1.12 0.99 0.92 0.92 0.91
0.11 0.18 0.01 0.00 0.91 0.69 0.98 1.07 0.98 0.90 0.91 0.86 0.92 0.69 0.83
0.22 0.07 0.09 0.11 0.00 0.83 0.96 0.96 0.91 0.83 1.01 0.84 0.82 0.89 0.77
0.12 0.17 0.01 0.01 0.10 0.00 0.89 0.99 0.88 0.78 0.86 0.84 0.86 0.70 0.74
0.34 0.05 0.22 0.23 0.12 0.23 0.00 0.62 0.65 1.02 1.16 1.03 0.93 0.88 0.82
0.48 0.19 0.35 0.37 0.26 0.36 0.14 0.00 0.57 1.11 1.29 1.13 0.91 0.97 0.89
0.42 0.13 0.29 0.30 0.20 0.30 0.07 0.06 0.00 1.00 1.18 1.05 0.82 0.86 0.77
0.05 0.24 0.08 0.07 0.17 0.07 0.29 0.43 0.37 0.00 0.72 0.52 0.82 0.81 0.77
0.08 0.37 0.21 0.20 0.31 0.20 0.43 0.57 0.50 0.13 0.00 0.67 1.10 0.90 0.97
0.03 0.26 0.10 0.09 0.20 0.09 0.32 0.46 0.39 0.02 0.11 0.00 0.89 0.85 0.79
0.27 0.02 0.15 0.16 0.05 0.15 0.07 0.21 0.14 0.22 0.36 0.25 0.00 0.59 0.47
0.15 0.14 0.03 0.04 0.07 0.04 0.19 0.33 0.26 0.11 0.24 0.13 0.12 0.00 0.49
0.22 0.07 0.09 0.10 0.01 0.10 0.13 0.27 0.20 0.17 0.30 0.19 0.06 0.06 0.00
Fig. 2. Dendrogram of a single-linkage cluster analysis from the distances in the lower half-matrix of Table 3, representing qualitative plus quantitative dierences between the samples in the total variable space.
Fig. 3. Dendrogram of a single-linkage cluster analysis from the distances in the upper half-matrix of Table 3, representing quantitative dierences between the samples.
P.J.A. Howard / Soil Biology and Biochemistry 31 (1999) 1323±1330
1327
Table 4 Eigenvalues of the simplex correlation matrix Component
Eigenvalue
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Percentage of trace
32.67 13.39 8.78 8.18 6.97 4.70 4.34 3.53 3.45 2.70 2.55 1.52 1.29 0.93
Component
Cumulative
34.39 14.10 9.24 8.61 7.34 4.95 4.56 3.72 3.63 2.84 2.68 1.60 1.36 0.98
34.39 48.49 57.73 66.35 73.69 78.63 83.20 86.92 90.55 93.39 96.07 97.66 99.02 100.00
3. Results 3.1. Analysis of the data in Euclidean space The eigenvalues of the correlation matrix are shown in Table 1. We have condensed all of the variation in the original 95 tests into only 14 orthogonal axes. If we wished to reduce the dimensionality, we could use the rule of thumb test mentioned above, which would mean that we could accept all the eigenvalues greater than 1 95/15=6.33, or possibly 0.75 95/15=4.75, as being important. Thus, we might accept the ®rst six eigenvalues, which account for about 80% of the total variation, and consider the rest as `noise'. The ®rst component accounts for some 26% of the total variation, and represents the main axis of vari-
Fig. 4. Scatterplot of points representing the samples on the ®rst two component axes from a principal components analysis of the simplex correlation matrix. Components 1 and 2 account for 34% and 14% of the total variation respectively.
ation in the original test space. The component scores are given in Table 2, and a scatterplot of points representing the samples on the ®rst two components, which account for almost 47% of the variation in the data, is given in Fig. 1. As one would expect from the fact that the ®rst component is a weighted mean, the scores on the ®rst component are correlated strongly
Table 5 Simplex component scores Sample
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 E1 E2 E3
Component 1
2
3
4
5
6
7
8
9
ÿ0.56 ÿ0.62 ÿ0.46 0.08 0.02 0.02 0.81 1.16 0.84 ÿ0.31 ÿ0.35 ÿ0.19 ÿ0.17 ÿ0.12 ÿ0.15
ÿ0.67 ÿ0.54 ÿ0.26 ÿ0.24 0.14 ÿ0.09 ÿ0.01 ÿ0.02 ÿ0.06 0.53 0.65 0.71 ÿ0.07 ÿ0.12 0.05
0.07 ÿ0.65 0.03 0.01 ÿ0.28 ÿ0.00 ÿ0.22 ÿ0.19 ÿ0.00 0.10 ÿ0.12 ÿ0.01 0.49 0.43 0.35
0.36 ÿ0.47 0.38 0.10 ÿ0.10 ÿ0.10 0.31 0.06 0.06 0.17 0.03 0.21 ÿ0.33 ÿ0.32 ÿ0.36
ÿ0.34 0.21 0.55 ÿ0.08 ÿ0.45 ÿ0.38 0.13 0.16 0.06 ÿ0.06 0.09 0.16 ÿ0.02 ÿ0.02 ÿ0.01
ÿ0.41 ÿ0.28 0.22 0.41 0.34 0.14 ÿ0.02 ÿ0.13 ÿ0.24 ÿ0.12 ÿ0.09 0.14 0.01 0.00 0.03
0.17 0.06 ÿ0.06 ÿ0.18 0.34 ÿ0.62 ÿ0.29 0.23 0.04 0.09 ÿ0.15 0.19 0.08 0.01 0.07
ÿ0.03 ÿ0.03 0.02 ÿ0.25 0.05 0.21 ÿ0.53 ÿ0.07 0.50 ÿ0.03 0.25 ÿ0.30 0.09 0.06 0.07
0.10 ÿ0.13 0.24 ÿ0.28 0.12 0.05 0.49 ÿ0.24 ÿ0.18 ÿ0.01 ÿ0.29 ÿ0.02 0.00 0.03 0.12
10
11
12
13
14
0.04 0.07 ÿ0.01 0.01 ÿ0.21 0.20 ÿ0.19 0.01 0.17 0.33 ÿ0.56 0.10 0.03 ÿ0.04 0.04
0.02 0.06 0.14 ÿ0.20 ÿ0.07 0.22 ÿ0.10 0.43 ÿ0.29 0.01 0.06 ÿ0.21 0.06 ÿ0.00 ÿ0.14
0.07 0.03 0.08 ÿ0.10 ÿ0.02 0.11 0.00 0.10 ÿ0.07 ÿ0.39 ÿ0.02 0.30 ÿ0.05 ÿ0.08 0.04
0.04 ÿ0.06 0.01 0.02 ÿ0.09 ÿ0.02 ÿ0.12 0.08 ÿ0.08 0.08 0.13 ÿ0.17 0.00 ÿ0.07 0.24
ÿ0.04 ÿ0.02 0.01 ÿ0.06 ÿ0.01 ÿ0.09 0.19 0.01 ÿ0.02 ÿ0.09 0.10 0.02 0.20 ÿ0.14 ÿ0.06
1328
P.J.A. Howard / Soil Biology and Biochemistry 31 (1999) 1323±1330
Table 6 Inter-sample Pythagorean distances calculated from the ®rst four simplex components Sample
A1
A2
A3
B1
B2
B3
C1
C2
C3
D1
D2
D3
E1
E2
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 E1 E2 E3
0.00 0.28 0.11 0.20 0.29 0.24 0.39 0.47 0.39 0.31 0.35 0.36 0.27 0.26 0.28
0.00 0.28 0.29 0.27 0.27 0.44 0.49 0.44 0.37 0.35 0.40 0.33 0.32 0.31
0.00 0.15 0.21 0.17 0.33 0.42 0.34 0.21 0.25 0.26 0.23 0.22 0.23
0.00 0.13 0.06 0.21 0.28 0.20 0.22 0.25 0.25 0.18 0.16 0.17
0.00 0.09 0.23 0.29 0.23 0.17 0.17 0.18 0.21 0.20 0.18
0.00 0.23 0.29 0.21 0.19 0.21 0.22 0.14 0.13 0.12
0.00 0.11 0.08 0.32 0.34 0.31 0.34 0.33 0.33
0.00 0.09 0.40 0.41 0.39 0.39 0.37 0.37
0.00 0.33 0.35 0.32 0.30 0.28 0.28
0.00 0.07 0.06 0.22 0.23 0.19
0.00 0.07 0.25 0.26 0.22
0.00 0.27 0.27 0.23
0.00 0.02 0.05
0.00 0.05
with the plate means, which are given in the last column of Table 2. The inter-sample distances calculated from all 14 components are given in the lower half of Table 3, the minimum spanning tree is not listed, but the corresponding single-linkage dendrogram is given in Fig. 2. Of course, this analysis represents all the information contained in the BIOLOG plate data, which is a combination of two independent types of information: (i) quantitative dierences among the samples, measured by the total dehydrogenase activity in all the wells of the plate, and represented by the plate sum or mean, and (ii) qualitative dierences among the samples, represented by the ratios of the values for the wells. The inter-sample distances calculated from the plate means (quantitative dierences) are given in the upper half of Table 3. The corresponding single-linkage dendrogram is given in Fig. 3. For studying the qualitative inter-sample dierences we need to use simplex analysis.
4. Discussion 4.1. Inter-sample distances The overall PCA, which contains both qualitative
3.2. Distances in simplex space The eigenvalues of the simplex correlation matrix are given in Table 4, and the corresponding component scores are given in Table 5. A scatterplot of points representing the samples on the ®rst two component axes, which account for nearly 49% of the total variation in the scaled data, is given in Fig. 4. As only the ®rst four components provide clear discrimination between groups of replicates, the inter-sample distances calculated from the ®rst four components are given in Table 6, the minimum spanning tree is not listed, but the corresponding dendrogram is given in Fig. 5.
Fig. 5. Dendrogram of a single-linkage cluster analysis from the distances in Table 6, representing qualitative dierences between the samples in the space of the ®rst four components of a simplex PCA.
P.J.A. Howard / Soil Biology and Biochemistry 31 (1999) 1323±1330
and quantitative information, shows that on the ®rst two components (Fig. 1), which account for nearly 47% of the total variation (Table 1), points C1, C2, and C3 are fairly close together, but are separate from the remaining points. Points Al, A2, and A3 are not very alike, and are dierent from the other points. The groups of B, D, and E points form an elongated cluster, with points B2 and E2 overlapping. However, in the full space of 14 components (Fig. 2, Table 2), E2 is closer to E3 (distance 0.49) than it is to B2 (0.89) (Table 3, lower half-matrix), so that points El, E2, and E3 form a compact group. However, B2 is still distant from B1 and B3 because it is closer to E3 (0.77) than it is to B1 (0.91) or B3 (0.83). A2 is distant from all the other points, being closest to El (1.00). The smallest of these interpoint distances is E1 to E3 (distance 0.47). The A points join others only at large distances, A1 joins B1, A2 joins E1, and A3 joins E3. The group of B points is joined to the group of E points because B1 joins E2 and B2 joins E3. Similarly, C3 and D1 also join E3, so sample E3 has much in common with samples from all the other soils. These relationships are presented graphically in the dendrogram of the single-linkage cluster analysis (Fig. 2), which also shows that the groups of D and E points are more distinct than is suggested in Fig. 1, as is the pair B1 and B3. The interrelationships among the points on the ®rst two axes of the simplex PCA (Fig. 4), which account for 47.5% of the total variation in the qualitative relationships between the samples, are very dierent from those described above. Points Al, A2, and A3 are closer together and form a distinct group, as do the groups of C and D points. Although the groups of B and E points are close together, there is no overlap. It follows that the lack of such distinct groupings in the overall PCA described above must be due to quantitative dierences and similarities. Thus, Table 3 and Fig. 3 show that points Al, A2, and A3 are not close together because their plate means are very dierent. Similarly, the plate mean of B2 is not close to the plate means of B1 and B3, but it is close to the plate mean of E3. The plate mean of A2 is close to the plate mean of El. However, the most eective axes for discriminating between groups of points in terms of qualitative dierences are components 1 to 4 (Table 5), components 5 to 14 have little or no discriminatory power. Hence, the subset of the original tests which is most eective for showing qualitative dierences between the communities of organisms in the original samples is that which is associated with the largest (absolute) elements of the ®rst four eigenvectors in the way described in Howard (1991). The relationships between the points in the space of the ®rst four components from the simplex PCA (Tables 5 and 6) are slightly dierent from
1329
those in Fig. 4. The important distances are represented graphically in the single-linkage dendrogram in Fig. 5. Apart from the distances B1 to B2, A2 to A3, and A2 to A1, the three replicates of each soil sample have mutually small distances, that is, form compact groups (Table 6). In the dendrogram, the group of B replicates joins the group of E replicates via the small distance B3 to E3. Similarly, the group of B replicates joins the group of C replicates via the small distance B 1 to C3. The above results raise the question, why do samples that are qualitatively similar have such dierent overall levels of activity? In the present example, care was taken to ensure that the aliquots of microbial suspension used were as alike as possible, so large dierences must be due to real dierences in levels of activity. If such care is not taken, quantitative dierences will also re¯ect dierent amounts of microbial material in the aliquots, and cannot be interpreted. 4.2. Reliability of the data Some authors, for example, Hitzl et al. (1997), have advocated the use of formal statistical methods in the analysis of BIOLOG plate data, but this assumes that the data have minimum levels of accuracy and precision such as are found in routine chemical analyses. However, it is not unusual for the control well colour value to exceed those of some of the test wells, so that subtracting the control value from the test values gives some negative values, and in the real world negative dehydrogenase activity is impossible. Two possible explanations for this eect are: (i) that the control value is an error; or (ii) that some of the C-sources have an inhibiting eect on the organisms. It is dicult to see how the control value could be an error without throwing into question all the wells on the plate, and hence also on the materials in the wells initially. Randomly-distributed negative values in wells with dierent C-sources would also suggest this. A possible source of problems is the eect on soil micro-organisms of the tetrazolium salts used in this type of test, because of the known toxicity of these compounds (Bene®eld et al., 1977; Cook and Garland, 1997). These aspects need further investigation, it is not acceptable simply to remove the negative values from the data and analyse the rest without knowing the underlying cause. In this connection, the second component value for sample A2 (ÿ11.54) is so large that it was thought to be anomalous. A detailed examination of the contributions to this value showed that there were 23 elements of the second eigenvector with values equal to, or greater than, 0.75 times the largest value (Howard, 1991). Nineteen of them had positive signs and four had negative signs, indicating a contrast in the corre-
1330
P.J.A. Howard / Soil Biology and Biochemistry 31 (1999) 1323±1330
sponding groups of tests. The individual tests had opposite signs, that is, there were 19 negative test values, which in¯uenced this component very strongly. These 23 contributions to the component summed to 8.32, so that 24% of the tests contributed 72% of the component value. Because there are no logarithms of zero or negative numbers, the positive equivalent of the largest negative value in the data matrix was added to each value in the matrix to allow the simplex analysis to proceed as a demonstration. The process of adding a constant to, or subtracting it from, a data matrix is called translation. Geometrically, it moves the origin of the measurement scale, but it does not aect variances and distance measures. This is not recommended as a general strategy, for reasons given above. Furthermore, it should be noted that logarithms of very small numbers are very sensitive to a change of origin, and can give undue importance to the smallest items in the composition, which is the opposite of what is desirable. 5. Conclusions The method of data analysis described here provides much information about physiological relationships between samples, and used thoughtfully can provide useful insights. By computing inter-sample distances in the ways described, a researcher can partition the total information that is contained in BIOLOG plate data into qualitative and quantitative forms. The quantitative information is obtained from the plate mean values. As was shown in this example, replicate samples from the same soil type may have dierent plate mean values, and in some cases these may be closer to the values for samples from other soil types. If care has been taken to ensure that the aliquots of microbial suspension used are as alike as possible, these relationships will re¯ect real levels of activity that will need to be interpreted by the researcher. Qualitative relationships require a simplex analysis. In the present examples, replicates of the same soil type formed compact groups, in spite of quantitative dierences. Three of the ®ve groups were well separated, but two were close together. Such results help the researcher to generate hypotheses that can be tested in further research. However, it is necessary to ensure that the data are reliable. Acknowledgements I am grateful to Dr Nisha Parekh for providing the data for the demonstration of these methods. I thank the referees for their helpful suggestions.
Appendix Computer code for replacing zero data values delta=0.005; do i=1 to N; cnt=0; do j=1 to p; if x[i,j]=0 then cnt=cnt+1; end; zerorpl=delta(cnt+1)(p-cnt)/(pp); correc=deltacnt(cnt+l)/(pp); if cnt>0 then do i=1 to p; if x[i,j]=0 then x[i,j]=zerorpl; else x[i,j]=x[i,j]-correc; end; end;
References Aitchison, J., 1986. The Statistical Analysis of Compositional Data. In: Monographs on Statistics and Applied Probability. Chapman and Hall, London. Bene®eld, C.B., Howard, P.J.A., Howard, D.M., 1977. The estimation of dehydrogenase activity in soil. Soil Biology & Biochemistry 9, 67±70. Cook, K.L., Garland, J.L., 1997. The relationship between electron transport activity as measured by CTC reduction and CO2 production in mixed microbial communities. Microbial Ecology 34, 237±247. Franklin, S.B., Gibson, D.J., Robertson, P.A., Pohlmann, J.T., Fralish, J.S., 1995. Parallel analysis: a method for determining signi®cant principal components. Journal of Vegetation Science 6, 99±106. Garland, J.L., Mills, A.L., 1991. Classi®cation and characterization of heterotrophic microbial communities on the basis of patterns of community-level sole-carbon-source utilization. Applied and Environmental Microbiology 57, 2351±2359. Gower, J.C., Ross, G.J.S., 1969. Minimum spanning trees and single linkage cluster analysis. Applied Statistics 18, 54±64. Grayston, S.J., Campbell, C.D., 1996. Functional biodiversity of microbial communities in the rhizospheres of hybrid larch (Larix eurolepis ) and sitka spruce (Picea sitchensis ). Tree Physiology 16, 1031±1038. Hitzl, W., Henrich, M., Kessel, M., Insam, H., 1997. Application of multivariate analysis of variance and related techniques in soil studies with substrate utilization tests. Journal of Microbiological Methods 30, 81±89. Howard, P.J.A., 1991. An Introduction to Environmental Pattern Analysis. , Parthenon, Carnforth. Howard, P.J.A., 1997. Analysis of data from BIOLOG plates: comments on the method of Garland and Mills. Soil Biology & Biochemistry 29, 1755±1757. Reyment, R., 1991. Multidimensional Palaeobiology. Pergamon Press, Oxford. Reyment, R., JoÈreskog, K.G., 1993. Applied Factor Analysis in the Natural Sciences. Cambridge University Press, Cambridge.