The closure problem as reflected in discriminant function analysis

The closure problem as reflected in discriminant function analysis

Chemical Geology, 37 (1982) 367--375 367 Elsevier Scientific Publishing Company, Amsterdam -- Printed in The Netherlands THE CLOSURE PROBLEM AS REF...

556KB Sizes 0 Downloads 58 Views

Chemical Geology, 37 (1982) 367--375

367

Elsevier Scientific Publishing Company, Amsterdam -- Printed in The Netherlands

THE CLOSURE PROBLEM AS REFLECTED I N D I S C R I M I N A N T FUNCTION ANALYSIS

JOHN C. BUTLER

Department of Geosciences, University of Houston, Houston, TX 77004 (U.S.A.) (Received October 6, 1981; revised and accepted March 22, 1982)

ABSTRACT Butler, J.C., 1982. The closure problem as reflected in discriminant function analysis. Chem. Geol., 37: 367--375. Much has been written concerning difficulties in assessing the strengths of relationships among percentages and the fact that problems exist whether one is examining trends on binary or ternary scatter diagrams or examining the results from some multivariate statistical process. Discriminant function analysis (DFA) is intuitively appealing as it offers a procedure whereby one can attempt to reclaim known group membership with data that were not employed in defining the groups. Successful assignment of specimens to their known groups requires that one or more of the selected variables have a high variability among the groups and a low variability within the groups. If DFA is successful (as assessed by a high percentage of correct assignments) it is tempting to interpret the results in a geochemical/petrologic context. However, if it is known that some transformation (such as percentage formation) performed prior to the analysis is responsible for some of the structure of the data matrix then one must interpret the results with extreme care and caution.

INTRODUCTION M u c h e f f o r t has b e e n d i r e c t e d t o w a r d s d e t e c t i n g n a t u r a l l y o c c u r r i n g g r o u p s of s p e c i m e n s o f g e o c h e m i c a l a n d p e t r o l o g i c i n t e r e s t . G i v e n t h e r e l a t i v e ease with which tectonic-setting, major-element, trace-element, isotopic and modal d a t a c a n b e o b t a i n e d , t h e r e is i n d e e d a w e a l t h o f i n f o r m a t i o n f r o m w h i c h o n e c a n a t t e m p t t o d e f i n e g r o u p s o f s i m i l a r s p e c i m e n s . O f all o f t h e m u l t i v a r i a t e p r o c e d u r e s a v a i l a b l e , d i s c r i m i n a n t f u n c t i o n a n a l y s i s ( D F A ) is i n t u i t i v e l y a p p e a l ing t o a g e o c h e m i s t / p e t r o l o g i s t . Q u i t e o f t e n o n e is f a c e d w i t h a s i t u a t i o n in w h i c h g r o u p s o f s i m i l a r s p e c i m e n s a r e r e c o g n i z e d on t h e basis o f f i e l d occ u r r e n c e , g e o g r a p h i c d i s t r i b u t i o n o r s t r a t i g r a p h i c p o s i t i o n . G i v e n s u c h recogn i z e d g r o u p s p l u s d a t a t h a t w e r e n o t u s e d t o d e f i n e t h e g r o u p s ( s u c h as t h e i r m a j o r - o r t r a c e - e l e m e n t c h e m i s t r y ) , o n e c a n d e t e r m i n e if t h e c o m p l e m e n t a r y d a t a also a l l o w c h a r a c t e r i z a t i o n o f t h e g r o u p s . P e a r c e a n d C a n n ( 1 9 7 3 ) , f o r e x a m p l e , s e l e c t e d b a s a l t s f r o m several t e c t o n i c s e t t i n g s a n d w e r e c o n c e r n e d as t o w h e t h e r t h e a m o u n t s o f Ti, Zr, Y a n d Sr w o u l d a l l o w r e c l a m a t i o n o f

368 the same group membership. If such data provide an efficient means of assignment, th en one m a y inquire as to w het her there is a relationship between tectonic setting and trace-element chemistry. In recent years there has been considerable interest, by a relatively small n u m b e r o f geoscientists, in the effects t ha t various transformations have on the structure o f a data matrix (that is, on the statistical descriptors o f the da~a set). The purpose of this paper is to ext end previous studies of such transformations specifically to DFA. MODIFICATION OF STRUCTURE DUE TO FORMING PERCENTAGES In this paper emphasis will be focused on percentage form at i on but the reader should realize t h a t o t h e r transformations m ay lead to similar difficulties. Papers by Miesch (1969), Chayes (1971), Butler (1978, 1979, 1981) and Skala (1979) should be consulted for more detailed discussions of percentage formation effects. It will be assumed an investigator has measured the values of several variables (such as major a n d / o r trace elements) on each of several specimens and these data are arranged in matrix f o r m with the specimens in the rows and values c>~" the measured variables in the columns. One could elect to perform some t ra ns f o r matio n on these data prior to some multivariate analysis. For example~ rare-earth element data are usually " n o r m a l i z e d " by dividing each measured value by an appropriate average rare-earth abundance. T h a t is, each measured value o f a given variable is divided by a constant the average abundance ot: t h a t variable. Such simple linear transformations m o d i f y some of the univariate statistical descriptors o f the columns o f the data array (such as sample memos and variances) b u t others are unchanged and survive the transformation (such as the coefficients o f variation). In addition, the covariance between a pair o f the columns is changed p r o p o r t i o n a l to the constant(s) but the correlatio~~ coefficient (Pearson's p r o d u c t m o m e n t correlation coefficient) survives. Davis (1973) provides a good i n t r o d u c t i o n to such simple linear transformations. O th er transformations make use of a p r o p e r t y of the columns or rows of the data matrix. Column transformations (Butler, 1979a) (such as the proportional to the range transformation } m o d i f y the means, variances and covariances but the coefficients o f variation and correlation coefficients survive. Ro w transformations (Butler, 1979a) are those t hat make use of a p r o p e r t y o f each row o f th e data matrix. Percentage f o rm at i on is a row transformation as each value o f a variable is divided by the sum of all of the measured variables~ for that specimen. Percentages are said to be closed as all row sums in the final t ra ns f o r matio n are constant {100% for percentages). Data n o t subject to the constant item sum restriction are referred to as open. Such row transformations can be responsible for major changes in the structure o f a d ata matrix. In general, the following types of changes m ay be expected to occur when percentages are f o r m e d (these generalizations apply regardless of w h e t h e r the percentages are f o r m e d from an open set or from a part o f an already closed set).

369

(1) The correlation coefficient (Pearson's product m o m e n t or one of the rank-order correlation coefficients) will n o t remain the same after percentages are formed. For variables with large variances, percentage formation will tend to change the correlation coefficient in the negative direction. Changes in the positive direction have been recorded between trace elements (Chayes, 1971) and when one of the variables has an extremely small coefficient of variation (Butler, 1978). (2) The absolute value of all of the simple summary statistics (mean, variance, skewness and kurtosis) of the transformed data should be expected to differ from those of the parent-data (Chayes, 1971; Butler, 1979a). (3) In certain situations properties change in rank order as well as in absolute value. For example, when ternary percentages are formed one may observe a change in the rank order of the sample variances. In such data sets it is n o t unc o m m o n to observe that the variable with the largest sample variance in the parent-set has the smallest variance in the ternary subset (Butler, 1979b). (4) Variables that exhibit a symmetrical distribution will n o t retain zero skewness after percentages are formed (Butler, 1979a). In spite of considerable discussion of these effects in the literature, there still may be some concern and/or confusions as to their implications. If, for example, DFA is performed on an open set and then repeated with the percentage form of the set, there will be differences between the two sets of results. Such differences reflect changes in the structure of a data matrix that must accompany the percentage formation process and one must exercise extreme care and caution in making comparisons between the two sets of results. DISCRIMINANT FUNCTION ANALYSIS

In the literature there is a variance of opinion as to whether percentage formation poses a problem for the interpretation of the results of a DFA. For example, Miesch (1969, p. 162) argues t h a t techniques such as cluster analysis, DFA and Q-mode factor analysis are not affected by percentage formation. Skala (1979), however, argues that percentage formation imposes constraints upon the DFA process and that such constraints complicate interpretation of the results. It seems intuitive to the author that transformations that m o d i f y the structure of a data matrix m a y lead to difficulties in interpreting the results of multivariate statistical methods. Assessment of such effects can be accomplished through an analysis of the mathematics involved in the procedure or by a direct comparison of results from a non-transformed set and its transformed equivalent. The latter approach will be emphasized in this paper although the former will not be ignored. Davis (1973) provides a good discussion of DFA as applied to geochemical problems and as to the requirements that must be satisfied in order to undertake a t h o r o u g h interpretation of the results. As noted previously, it is assumed that the investigator has determined which of his specimens belong to a given

370 group. DFA finds a linear combination of the selected variables that yields the maximum difference between the groups. If such a linear combination of variables is successful (that is, if the function assigns a large percentage of the specimens to their known groups) one might elect to use the function to assign additional specimens (not part of the training set) to one of the recognized groups. Or, one could elect to obtain a linear subset of the variables such that a relatively small number of variables are used to discriminate between groups and, if successful, inquire as to the geochemical/petrologic significance of the: selected subset. In all multivariate analyses of variance one can partition the matrix of total sums of squares and cross-products (T) into the matrices of the among (A) and within (W) group sums of squares and cross-products (Koch and Link, 1971):

T=A+W

~i~

The pooled within-group variance- covariance matrix (Sp) is obtained by dividing each element of W by the number of degrees of freedom. A discrimi nant function transforms the raw measurements into a single value .....the discriminant score. In general, DFA involves finding a linear function that maximizes the ratio of the difference between multivariate means to the multivariate variance (Davis, 1973). The matrix of the coefficients of the discrimi nant function (C) can be found by solving the equation: C --- ( S p ) - i n

{2i

where (Sp)-I is the inverse of the pooled within-group variance--covariance matrix; and D contains the differences between the variable means. Chayes (1971) showed that the sum of each row of the variance-covaxim~c(~ matrix of a closed data set is zero; t h a t is, for variable X1, the sum of the c o variances of X1 and all other variables is the negative of the variance of variable X1. The covariance of variables 1 and 2 is given by: COVI2 = SP~2/(N-- 1)

i3)

where SP12 is the sum of the cross products of variables I and 2; and N is the number of observations. Each row of the sums of squares and cross-products matrix must sum to zero as all of the elements in the variance- covarianee matrix can be multiplied by (N -- 1) to yield the matrix of sums of squares and cross-products. No such restriction exists for the within-group sums of sums of squares and cross-products matrix (W) for the open form of the data. There are a n u m b e r of available c o m p u t e r programs that perform DFA. In this study the program Stepwise Discriminant Analysis in the Biomedical Computer Programs Package (program BMDP 7M, Dixon, 1977) has been used. Although similar to other routines, BMDP 7M offers considerable flexibility and provides an abundance of options and o u t p u t that makes its use convenient although not critical. BMDP 7M allows for working with more than two groups and processes information in a stepwise fashion. The first step in the formula-

371

tion of the discriminant function defines the one variable that maximizes the differences between the groups. The second step identifies the pair of variables that maximizes the differences between the groups and so on. If, with respect to the efficiency of assignment, three variables are just as efficient as ten, then one may wish to consider whether there is some geochemical significance associated not only with that triplet of variables b u t also with respect to the sequence in which the variables were added to the discriminant functions. Chayes and Metais (1964, p. 179) found that TiO2 (weight percent) by itself is sufficient to discriminant between oceanic-island and circum-oceanic basalts with an efficiency of more than 93%. The binary pair TiO2--MgO increases the efficiency of assignment only by ~ 2%. Thus, one could elect to determine the role of TiO2 (or TiO2 and MgO) in allowing such a strong discrimination between these t w o groups of basalts. Of concern in this note is whether a given transformation (such as percentage formation) will produce changes in the o u t c o m e of DFA such as: (1) the overall efficiency of assignment; (2) differences of assignment of individual observations (specimens); and (3) the sequence in which variables are added to the discriminant function. IGNEOUS ROCKS FROM BIG BEND NATIONAL P A R T AND TRI S TA N DA CUNHA

A comprehensive comparison of the major-element chemistry of the extrusive igneous rocks in the Big Bend National Park region (Maxwell et al., 1967) with extrusive igneous rocks from various tectonic settings is being conducted at the University of Houston. As a part of this comparison, the 20 analyzed extrusive specimens from the Big Bend region with SiO2 less than 62 wt.% (trachyandesites and basalts; Maxwell et al., 1967) were compared with the 32 analyzed specimens from Tristan da Cunha (olivine basalts, trachyandesites, trachybasalts, and trachytes; Baker et al., 1964). Given that many petrologists routinely publish a F--M--A (total Fe as FeO- -MgO--Na20+K20) variation diagram as a descriptor of their data set(s), it was believed worthwhile to compare the Big Bend and Tristan da Cunha data sets with respect to the " o p e n " (non recalculated to 100% values of F--M--A) and with respect to the ternary percentages of F'--M'--A (primed values will refer to ternary percentage values). It must be remembered, however, that the so-called " o p e n " values are themselves part of a larger closed set (the wholerock analyses). The effects of percentage formation are evident in the summary statistics for the combined data sets (Table I). Note the nature of the changes in summary statistics accompanying the formation of ternary percentages: (1) there is a reversal in rank order of the sample means of F and A ; (2) there is a reversal in rank order of sample variances of F and A; and (3) A' contributes more than 60% of the sample variance in the ternary subset whereas A contributes less than 30% of the sample variance.

372 TABLE I Summary statistics for the combined Big Bend and Tristan da Cunha data sets OPEN (SUB)SET OF V A R I A B L E S

Variable

F Total Fe

111 MgO

A (Na20 + K20 )

Row sum

Sample mean Sample variance % variance Coefficient of variation

8.302( 1 ) 9.747(1) 45.7%(• ) 0.376(2)

3.051(3) 5.338(3) 25.0%(3) 0.757(1)

8.110(2) 6.256(2) 29.3%(2) 0.308(3)

19.463 9.096 0.155

C L O S E D S E T OF V A R I A B L E S

Variable

F' Total Fe

M' MgO

A' (Na:O + K20 )

Sample mean Sample variance % variance Coefficient of variation

41.426(2) 1.35.50(2) 23.5% 0.281{3)

14.479(3) 86.729(3) 15.1% 0.643(• )

44.095(•)* 353.37(• )* 61.4% 0.426{2)*

Rank order within the set given in parentheses. *Indicates change in rank order following percentage formation. W I T H I N - G R O U P VARIANCE~--COVARIANCE M A T R I X FOR OPEN (SUB)SET OF V A R I A B L E S

F F M A

M 9.327 5.699 6.672

A

5.276 -5.006

6.052

W I T H I N - G R O U P VARIANCE~--COVARIANCE M A T R I X FOR CLOSED SET OF V A R I A B L E S

F' M' A'

F'

M'

A'

112.34 75.00 - 187.34

87.78 ~ 161.05

348.39

For the "open" the order: F(1) -

M(2)

-

s e t , v a r i a b l e s w e r e a d d e d t o t h e d i s c r i m i n a n t f u n c t i o n s in

A(3)

and 84.6% of the specimens were assigned to their known group (16 of the 20 Big B e n d a n d 2 8 o f t h e 3 2 T r i s t a n d a C u n h a ) . A l t h o u g h n o t p e r f e c t (less thaz~ 100% assignment efficiency), these results indicate that there are significant d i f f e r e n c e s b e t w e e n t h e t w o s u i t e s w i t h r e s p e c t t o t h e " o p e n " v a l u e s o f F, M and A.

373 Fo r the ter n ar y subset variables were added in the order: F'(1 ) -- A ' ( 2 ) - M ' ( 3 )

and 92.3% of the specimens were assigned to the correct group (18 of the 20 Big Bend and 30 of the 32 Tristan da Cunha). Note t h a t the sequence in which variables were added to the discriminant functions is not the same for the F M A and F ' M ' A ' subsets. A summary of the results is given in Table II. TABLE II A comparison of the Big Bend and Tristan da Cunha suites using discriminant functions containing the variables F--M--A Step No.

"Open"

Closed

variable added

assignment efficiency (%)

variable added

assignment efficiency (%)

1

F M(+F) A(+F+M)

53.8 82.6 84.6

F'

2 3

A'(+F') M'(F'+A')

57.7 92.3 92.3

All f o u r of the incorrectly assigned F M A Big Bend specimens were correctly assigned in the F ' M ' A ' subset. Of the four incorrectly assigned Tristan da Cunha samples in the F M A subset, t w o are incorrectly assigned in the t ernary subset. A n o t h e r way to c om par e the two sets of results is to n o t e that, of the 20 Big Bend samples, six (30%) were inconsistently assigned (that is, correct in one set b u t wrong in the other); similarly, f o ur (13%) of the Tristan da Cunha specimens were inconsistently assigned. F r o m Table II, there is an i m p r o v e m e n t in the assignment efficiency accompanying percentage f o r m a t i o n (84.6% vs. 92.3%). F. Chayes (pers. c o m m u n . , 1982) n o t e d th at this i m p r o v e m e n t is m os t likely due to the addition of a n e w variable rather than the percentage f o r m a t i o n process itself. T hat is, a multivariate data set can be reduced to the four variables A, B, C and D where D is defined as: D = (row sum) - (A + B + C)

or

A + B + C = (row s u m ) - D

(4a)

T h at is, D is the sum of the variables n o t included in the subset ABC. The percentage o f A is the t e r nar y subset A ' B ' C ' can be expressed as: A ' = I O 0 [ A / ( A + B + C)]

or

A' = 1 0 0 [ A / { ( r o w sum) - D } ]

(4b)

Thus, it is n o t correct to com pa r e assignment efficiencies for the F M A and F ' M ' A ' subsets. Studies are u n d e r w a y to de term i ne if inclusion of a variable such as D always improves the assignment efficiency. Regardless o f the above, percentage f o r m a t i o n does m o d i f y the statistical descriptors and these modifications will be reflected by differences bet w een

374 subsets such as FMA and F'M'A'. For example, Chayes and Metais (1964) n o t e that if variables are uncorrelated or weakly correlated, those with mema values th at differ the most between the groups will c o n t r i b u t e most to a dis criminant function. Thus, any t r ans f or m at i on that modifies the correlation coefficient is b o u n d to effect the results of DFA. Do results such as those discussed in this not e pose a problem? If one wer+~ to select that f o r m of the data (open or closed) that yielded the highest percentage o f co r rect assignments and to ignore the o t h e r form of the data the answer would be "yes ''~ The decision to use the open or closed form should be based on the model of the investigator and not on the results of some mathematical manipulatiom Otherwise, the investigator must accept the possi bility that a numerical consequence may be interpreted as some evidence of petrologic significance. A comparison using the open form would be logical if one were most concerned with absolute amounts. However, if the model w~-; concerned with relative amounts of the parts of a particular subset the perceiver o age form would be appropriate as in most data sets the sum of an open subset is not a constant. That is, the form of the data should be selected with as much care (or possibly more) as is given to the interpretation of the results. Consider for a moment the fact that the so-called open subset in this study is in fact part of a closed set for which the open subset is unavailable (Chayes~ 1971). Given the nature of the differences already presented, what assurance does the investigator have that the structure of his open subset has not been modified seriously as a result of percentage formation? Thus, all of the con~ cerns of Chayes (1971) and others concerning the interpretation of various scatter diagrams in which percentages are employed must be extended to multivariate procedures such as DFA. SUMMARY Percentage f o r m a t i o n can p r o d u c e changes in the structure o f a data matrix of significant magnitude and can lead to difficulties in the interpretation of the results o f DFA. Changes in rank order of summary statistics may take place along with the anticipated changes in magnitude. There is no tr ans f or m a t i on that can be applied to the rows of a data matri~ that does n o t m o d i f y the statistical properties of and between the columns o t the t r an s f o r med array as compared with the original array. Therefore, the investigator must select the f o r m of the data with as much care as wilt be devoted to the interpretation of the results of some analytical procedure.

REFERENCES

Baker, P.E., Gass, I.G., Harris, P.G. and LeMaitre, R.W., 1964. The volcanological report ol the Royal Society expedition to Tristan da Cunha, 1962. Philos. Trans. R. Soc. London Ser. A, 256: 439--578.

375 Butler, J.C., 1978. Visual bias in r-mode dendrograms due to the effect of closure. J. Math. Geol., 10: 243--252. Butler, J.C., 1979a. Numerical consequences of changing the units in which chemical analyses of igneous rocks are analyzed. Lithos, 12: 33--39. Butler, J.C., 1979b. Trends in ternary petrologic variation d i a g r a m s - fact or fantasy? Am. Mineral., 64: 1115--1121. Butler, J.C., 1981. Effect of various transformations on the analysis of percentage data. J. Math. Geol., 13: 53--68. Chayes, F., 1971. Ratio Correlation. University of Chicago Press, Chicago, Ill., 99 pp. Chayes, F. and Metais, D., 1964. Discriminant functions and petrographic classification. Carnegie Inst. Washington, D.C., Annu. Rep. Dir. Geophys. Lab., No. 1440, pp. 179--181. Davis, J., 1973. Statistics and Data Analysis in Geology. New York, N.Y., 550 pp. Dixon, W.J. (Editor), 1977. Biomedical Computer Programs P-Series. University of California Press, Los Angeles, Calif., 881 pp. Koch, G.S. and Link, R.F., 1971. Statistical Analysis of Geological Data, Vol. II. Wiley, New York, N.Y., 438 pp. Maxwell, R.A., Lonsdale, J.T., Hazzard, R.T. and Wilson, J.A., 1967. Geology of the Big Bend National Park, Brewster County, Texas. Bur. Econ. Geology, Austin, Texas, Univ. Texas Publ. 6711: 245--266. Miesch, A.T., 1969. The constant sum problem in geochemistry. In: D.F. Merriam (Editor), Computer Applications in the Earth Sciences. Plenum, New York, N.Y., pp. 161--176. Pearce, J.A. and Cann, J.R., 1973. Tectonic setting of basic volcanic rocks using trace element analyses. Earth Planet. Sci. Lett., 19: 290--300. Skala, W., 1979. Some effects of the constant-sum problem in geochemistry. Chem. Geol., 27: 1--9.