The statistical analysis of compositional data, by John Aitchison

The statistical analysis of compositional data, by John Aitchison

m Chemometrics and Intelligent Laboratory Systems Chemometrics: A New “Curriculum” in Chemistry? The importance of measurement in all branches of sci...

352KB Sizes 8 Downloads 109 Views

m Chemometrics and Intelligent Laboratory Systems

Chemometrics: A New “Curriculum” in Chemistry? The importance of measurement in all branches of science is well known. Analytical chemistry has long been a specialisation needing sophisticated measurement techniques and with the advent of widespread computing power, analysts saw a wide horizon opening before their eyes and new techniques for data processing were developed and with these a new field called chemometrics. This field has now spread to other specialities of chemistry. There are many stages in the acquisition and processing of data and, although some are performed automatically by instruments, a chemist should not treat computers as a “black box” and chemists should, in my opinion, obtain an understanding of all stages in the acquisition and processing of data. The best statistical programmes are of little use if there is no clear understanding as to whether the approaches being used are pertinent (let alone optimal) to the particular problem. In our Institute, and especially in the Chemometrics Department, we believe that chemistry students

m

should, already during their studies for graduation and independently from the future specialization, get familiarized with the basic concepts of physics, mathematics and information science, as well as with the necessary tools for measurement. By way of information, and very briefly, I would like to enumerate the subjects related with chemometrics which have been taught in our Institute for the last few years, as compulsory subjects within the frame of the graduation studies, and whose content covers the basic needs in this matter. - Graphic and numerical calculus

-

(75 h) Linear and advanced algebra (60 and 28 h) Infinitesimal calculus (100 h) Differential equations (45 h) Electrostatics and electromagnetism (85 h) Digital programming (20 h theory, 15 h laboratory) Basic statistics (40 h) Electronics for chemists (110 h theory, 65 h laboratory) Applied digital calculus (42 h theory, 65 h laboratory) Digital techniques of analysis (50

(30 h); - One subject for a doctorate course under the title of: Experimental design and optimization. For 1988 there are, in addition, two subjects under the titles of: Programming algorithms (30 h) and simulation and elements for industrial control (30 h). In the future the chemometrician will be a valuable partner to the chemist in tackling problems in laboratory automation such as robotics, expert systems and the building of truly intelligent laboratory systems. It is likely that the chemometrician trained today will play a valuable role in the society of tomorrow. W.K. BEK Institute Quimico de Sarria, S/N-O801 7 Barcelona, Spain

Book Reviews

The Statistical Analysis of Compositional Data, by John Aitchison Chapman & Hall, London, 1986, XV + 416 pages, price E29.95, ISBN O-41228060-4 Closed data sets, i.e. compositional data, occur naturally and un254

h theory, 50 h laboratory) In addition to the graduate studies, by way of test and as courses for postgraduates and for the doctorate degree, in 1987 several further courses have been organized: - A “Curriculum” on informatics with two subjects, already imparted in 1987, under the titles: Architecture and interface in microcomputers (30 h) and digital processing of signals

avoidably in many areas of natural science - geology, chemistry, ecol-

ogy, genetics. The characteristic feature of compositional data is that the proportions of a composition are subject to a unit-sum constraint. Although the existence of “closure” has been recognized by a few for several decades, the attempts made for coming to grips with the problem have not been appropriate. More generally, the unit-sum con-

Monitor n

straint has been ignored by the vast body of practitioners (including professional statisticians) and standard versions of statistical methods devised for unconstrained data have, after perhaps lip-service to the problem, been applied to constrained data. Considering that this way of dealing with constrained data is so widespread, and practised by so many people of high scientific standing, it might well be asked if the bugbear is really so bad as it is made out to be by “purists”. The situation is, however, grave. The results of such incorrect applications of statistical methods are usually misleading, at best, and just plain wrong, at worst. Among the geochemists who recognized the significance of closure some 25 years ago, Felix Chayes deserves special mention. Chayes made several attempts at achieving a mathematical solution of the problem, culminating with a book devoted to the topic [l]. The first sentence in his book reads, “Descriptive petrography is a web of ratio correlations”. This is a true representation of the nature of unit-sum data. What then is the constant-sum problem? We may describe it briefly in terms of any vector x, with elements xi,. . . , xD (all non-negative), which represent proportions of some whole, and subjected to the constraint: xi + . . . +x, = 1, or, equivalently, 100%. The reader will immediately recognize the familiar representational form of chemical analyses, chemical compositions of rocks, food-stuffs, serological data, etc. All of these situations have one thing in common, to wit, they are defined in a restricted part of real space, referred to as the simplex. Professor J. Aitchison has in the book here under review summarized his several contributions to a solution of the statistical analysis of compositional data. In addition, many new

results are published for the first time. The monographically presented text is mainly directed towards the enlightenment of his fellow mathematical statisticians, but it can be read and understood without much difficulty by any chemist with a normal background in mathematics at the university level. Of particular relevance for chemometrics is the subject of graphical distortions due to the closure constraint; this is presented and analyzed at length. Thus, the dangers and pitfalls inherent in Harker diagrams and the like are made clear. Aitchison shows how serious attempts at treating the problem of the statistical analysis of compositions made in the past all fall short of an adequate solution, including the use of the Dir&let class of distributions. With the ready access to largescale computer programs we have today, and the rapid development of microcomputers with impressive capabilities, the temptation to cast caution to the winds and ignore the unitsum constraint proves for many to be overwhelming. It can be shown without much trouble that methods of multivariate statistical analysis applied to compositional data almost always yield an incorrect, irrelevant, distorted analysis. The problem posed in the multivariate case can be readily appreciated by the following simple exposition. Consider a composition consisting of two parts. It is easy to show that in the “normal” case (raw correlations) cov(x,,

x1) = -var(x,) = -var(x,)

and that corr(xi,

xz) = -1

Thus, the correlation coefficient is not free to range from -1 to t-1 -

it is restricted to a specified value. Moreover, the covariances between the components of an unrestricted vector do not change for various possible subvectors of that vector. They do, however, change for subcompositions of a compositional vector. Consequently, cov(x,, x2) in a p-part composition is not the same as cov(x,, xz) in a ( p - l)-part subcomposition for the same xi and x2. It is indeed amazing that these very serious flaws have not created more of a furore among chemists, geoscientists, biologists, and, for that matter, applied statisticians, than has been the case. Aitchison has given the first unified treatment of how to approach the analysis of compositional data. He has carefully exposed the pitfalls, fallacies and dangers involved, with frequent reference to a large number of examples. In order to further the usefulness of his text, he has prepared a microcomputer statistical package, CODA, which is sold separately. Unfortunately, the original version of this diskette contains programming errors; these have now been corrected and a second version is being made available. A persuasive argument is given in favour of a compositional covariance structure, based on logratios of components, and the superiority of the logistic normal class for statistical analysis. Most chemometrical analyses are made on multivariate data, as is well known to readers of Chemometrics and Intelligent Luboratory Systems the wide use of standard packages such as SIMCA, BMDP, etc. attest to this statement. Unfortunately, none of these takes account of the distortional effects of closure, a malaise that was made distressingly apparent at the recent conference held in Ulvik, Norway (June, 1986). Aitchison has applied the additive logratio transFormation to the whole arsenal of

255

n

Chemometrics and Intelligent Laboratory Systems

multivariate techniques available, thus making them suitable for unitsum data. In a section largely devoted to principal component analysis, the stark differences between the results of an analysis made on the crude covariance structures and those attained by log-contrast principal components, using the centred logratio covariance matrix, are shown to be especially significant when there is curvature in the compositional data, a common condition. The classification of specimens of igneous rocks on chemical criteria is an area of prime endeavour among modem petrologists. Currently used

computer-oriented methods are open to serious criticisms, some of which are so severely damaging as to render many a result invalid. Another useful aspect of the monograph concerns the technicalities of operations on the elements of the simplex. Special attention is paid to the formation of subcompositions from compositions and to the operation of perturbation within the simplex. The useful concept of perturbation in closed data has, for example, been applied in genetics for charting changes in the proportions of genotypes, before and after selection on them. Aitchison’s book is clearly written

Trends in Analytical Chemistry, Vol. 6, No. IO: special issue “Chemometrics” Elsevier Science Publishers, Amsterdam,

1987, price US%lO.OO, Dji! 20.00

The editorial, written by guest editor Dr. B.G.M. Vandeginste, outlines the motive for this special issue, which arose, in particular, from the 1987 Pittsburgh conference. Three out of the four speakers at the chemometrics session contribute to this issue. A note from the editorial office outlines the publishers’ philosophy that chemometrics is more than merely the application of statistical methods to chemical data. Many of the recent and forthcoming chemometrics meetings such as CAC 88, SCA, COMPANA 88 and the Pittsburgh LIMS meeting are well covered. There is an article on the chemometrics society by the outgoing president and several other items of general interest. The four books reviewed reflect the surprising rate at which chemometrics research monographs are being produced. A special feature of this issue is a product news

section: it is certainly useful to keep analysts up to date by short announcements of new software and hardware, as it often takes a long time before scientific articles in this area appear in the literature, so this approach enables information about the rapidly changing marketplace to reach the reader relatively promptly. The “ Trends” articles, authored by leading analytical chemometricians, provide clearly written introductions as to how chemometrics is useful in modem analytical practice, and are well illustrated with figures and references. The article by Vandeginste, Klaessens and Kateman describes how LABGEN can be used to automate decision making in an analytical laboratory. The main questions asked by the laboratory manager are outlined, together with their possible solutions, given the need for efficient laboratory systems.

256

and enhanced by the inclusion of problems for solution at the end of every chapter. It is absolutely indispensible to any chemometrician interested in correctly analyzing his data. Reference 1 F. Chayes, Ratio Correlation, University of Chicago Press, Chicago, London, 1971, VIII + 99 pp. R.A. REYMENT Paleontologiska Institutionen, Box 558, S-751 22, Uppsala, Sweden -

Programming approaches including expert systems are briefly compared and an approach to simulation of a modern analytical laboratory is described. An article by Brown discusses systems theory. De Smet and Massart describe how information theory, cluster analysis and expert systems combine to help automate HPLC method selection for the analysis of drugs. Frank describes various regression methods, and Chretien provides an overview of how chemometrics methods are used in chromatography. In the “Computer Corner” a program for varimax rotation in factor analysis is described by Forma, Lanteri and Leardi. This issue of Trencis in AnalyticaI Chemistry is certainy likely to convince both the manager of an analytical laboratory and intending students that chemometrics is an important, interesting and rapidly developing field. I do not, however, believe that the issue quite succeeds in providing a flavour of what modem chemometrics is about. Although the first and most mature developments of the subject appear to be within the field