A New Beginning. . .

A New Beginning. . .

A New Beginning... Why do we title this chapter "A New B e g i n n i n g . . . "? Well, there are a lot of reasons. First of all, of course, is the s...

433KB Sizes 6 Downloads 171 Views

A New Beginning...

Why do we title this chapter "A New B e g i n n i n g . . . "? Well, there are a lot of reasons. First of all, of course, is the simple fact that that is just the way we do things. Secondly, is the fact that we developed this book in much the same way we developed our previous book Statistics in Spectroscopy (SIS). Those of you out there who have followed the series of articles published in Spectroscopy magazine since 1986 know that for the most part, each column in the series was pretty much self-contained and could stand alone, yet also fit into that series in the appropriate place and contributed to the flow of information in that series as a whole. We hope to be able to reproduce that on a larger scale. Just as the series Statistics in Spectroscopy (this is too long to write out each time, from here on we will abbreviate it SiS) was self-contained and stood alone, so too will we try to make this new series stand alone, and at the same time be a worthy successor to SiS, and also continue to develop the concepts we began there. Thirdly is the fact that we are finally starting to write again. To you, our readership, it may seem like we have been writing continuously since we began SiS, but in fact we have been running on backlog for a longer time than you would believe. That was advantageous in that it allowed us time to pursue our personal and professional lives including such other projects as arranging for SiS to be published as a book [1 ]. The downside of our getting ahead of ourselves, on the other hand, is that we were not able to keep you abreast on the latest developments related to our favorite topic. However, since the last time we actually wrote something, there have been a number of noteworthy developments. Our last series dealt only with the elementary concepts of statistics related to the general practice of calibration used for UV-VIS-NIR and occasionally for IR spectroscopy. Our purpose in writing SiS was to help provide a small foot bridge to cross the gap between specialized chemometrics literature written at the expert level and those general statistics articles and texts dealing with examples and questions far removed from chemistry or spectroscopic practice. Since the beginning of the "Statistics" series in 1986, several reviews, tutorials, and textbooks have been published to begin the construction of a major highway bridging this gap. Most notably, at least in our minds, have been tutorial articles on classical least squares (CLS), principal components regression (PCR), and partial least squares regression (PLSR) by Haaland and Thomas [2, 3]. Other important work includes textbooks on calibration and chemometrics by Naes and Martens [4], and Mark [5]. Chemometric reviews discussing the progress of tutorial and textbook literature appear regularly in Analytical Chemistry, Critical Review issues. Another recent series of articles on chemometric concepts termed "The Chemometric Space" by Naes and Isaksson has appeared [6]. In addition, there is a North American chapter of the International Chemometrics Society (NAmICS) which we are told has

2

Chemometrics in Spectroscopy

over 300 members. Those interested in joining or obtaining further information may contact Professor Thomas O'Haver at the Department of Chemistry and Biochemistry, University of Maryland, College Park, Maryland 20742 (Donald B. Dahlberg, 1993, personal communication). All the foregoing was true as of when the Chemometrics column began in 1993. Now in 2006, when we are preparing this for book publication, there are many more sources of information about Chemometrics. However, since this is not a review of the field, we forebear to list them all, but will correct one item that has changed since then: to obtain information about NAmlCS, or to join the discussion group, contact David Duewer at NIST ([email protected])) or send a message to the discussion group (ICSL @LISTSERV.UMD.EDU). Finally, since imitation is the sincerest form of flattery (or so they tell us), we are pleased to see that others have also taken the route of printing longer tutorial discussions in the form of a series of related articles on a given topic. Two series that we have no qualms recommending, on topics related to ours, have appeared in some of the sister publications of Spectroscopy [7-15] (note: there have been recent indications that the series in Spectroscopy International has continued beyond the ones we have listed. If we can obtain more information we will keep you posted - Spectroscopy International has also undergone some transformations and it is not always easy to get copies). So, overall the chemometrics bridge between the lands of the overly simplistic and severely complex is well under construction; one may find at least a single lane open by which to pass. So why another series? Well, it is still our labor of love to deal with specific issues that plague ourselves and our colleagues involved in the practice of multivariate qualitative and quantitative spectroscopic calibration. Having collectively worked with hundreds of instrument users over 25 combined years of calibration problems, we are compelled, like bees loaded with pollen, to disseminate the problems, answers, and questions brought about by these experiences. Then what would a series named "Chemometrics in Spectroscopy" hope to cover which is of interest to the readers of "Spectroscopy"? We have been taken to task (with perhaps some justice) for using the broader title label "Chemometrics in Spectroscopy" for what we have claimed will be discussions of the somewhat narrower range of topics included in the field of multivariate statistical algorithms applied to chemical problems, when the term "Chemometrics" actually applies to a much wider range of topics. Nevertheless, we will use this title, for a number of reasons. First, that is what we said we were going to do, and we hate to not follow through, even on such a minor point. Secondly, we have said before (with all due arrogance) that this is our column, and we have been pretty fortunate that the editors of Spectroscopy have always pretty much let us do as we please. Finally, at this point we consider the possibility that we may very well eventually extend our range to include some of these other topics that the broader term will cover. As of fight now, some of the topics we foresee being able to expand upon over the series will include, but not be limited to • The multivariate normal distribution • Defining the bounds for a data set

A New Beginning...

3

• The concept of Mahalanobis distance • Discriminant analysis and its subtopics of - Sample selection - Spectral matching (Qualitative analysis) • Finding the maximum variance in the multivariate distribution • Matrix algebra refresher • Analytic geometry refresher • Principal components analysis (PCA) • Principal components regression (PCR) • More on Multiple linear least squares regression (MLLSR), also known as Multiple linear regression (MLR) and P-matrix, and its sibling, K-matrix • More on Simple linear least squares regression (SLLSR), also known as Simple least squares regression (SLSR) or univariate least squares regression • Partial least squares regression (PLSR) • Validation of calibration models • Laboratory data and assessing error • Diagnosis of data problems • An attempt to standardize statistical/chemometric terms • Special calibration problems (and solutions) • The concept of outliers: theory and practice • Standardization concepts and methods for transfer of calibrations • Collaborative study problems related to methods and instruments. We also plan to include in the discussions the important statistical concepts, such as correlation, bias, slope, and associated errors and confidence limits. Beyond this, it is also our hope that readers will write to us with their comments or suggestions for chemometric challenges which confront them. If time and energy permit, we may be able to discuss such issues as neural networks, general factor analysis, clustering techniques, maximizing graphical presentation of data, and signal processing.

THE

MULTIVARIATE

NORMAL

DISTRIBUTION

We will begin with the concept of the multivariate normal distribution. Think of a cigar, suspended in space. If you cannot think of a cigar suspended in space, look at Figure 1-1a. Now imagine the cigar filled with little flecks of stuff, as in Figure 1-lb (it does not really matter what the stuff is, mathematics never concerned itself with such unimportant details). Imagine the flecks being more densely packed toward the middle of the cigar. Now imagine a swarm of gnats surrounding the cigar; if they are attracted to the cigar, then naturally there will be fewer of them far away from the cigar than close to it (Figure 1-1 c). Next take away the cigar, and just leave the flecks and the gnats. By this time, of course, you should realize that the flecks and the gnats are really the same thing, and are neither flecks nor gnats but simply abstract representations of points in space. What is left looks like Figure 1-1d.

4

Chemometrics in Spectroscopy (a)

(b)

/

/ (0)

(c)

I

• •

,

i.i

"

• .



.'.."



, . .





:



..'.

.





• . . .

..-',.-'. 7...''..

,

"

:7 ;!)iii:::i: 3

.:~~:,

"

.

Figure 1-1 Development of the concept of the Multivariate Normal Distribution (this one shown having three dimensions) - see text for details. The density of points along a cross-section of the distribution in any direction is also an MND, of lower dimension. Figure 1-1d, of course, is simply a pictorial/graphical representation of what a Multivariate Normal Distribution (MND) would look like, if you could see it. Furthermore, it is a representation of only one particular MND. First of all, this particular MND is a three-dimensional MND. A two-dimensional MND will be represented by points in a plane, and a one-dimensional MND is simply the ordinary Normal distribution that we have come to know and love [16]. An MND can have any number of dimensions; unfortunately we humans cannot visualize anything with more than three dimensions, so for our examples we are limited to such pictures. Also, the MND depicted has a particular shape and orientation. In general, an MND can have a variety of shapes and orientations, depending upon the dispersion of the data along the different axes. Thus, for example, it would not be uncommon for the dispersion along two of the axes to be equal and independent. In this case, which represents one limiting situation, an appropriate cross-section of the MND would be circular rather than elliptical. Another limiting situation, by the way, is for two or more of the variables to be perfectly correlated, in which case the data would lie along a straight line (or plane, or hyperplane as the corresponding higher-dimensional figure is called). Each point in the MND can be projected onto the planes defined by each pair of the axes of the coordinate system. For example, Figure 1-2 shows the projection of the data onto the plane at the "bottom" of the coordinate system. There it forms a twodimensional MND, which is characterized by several parameters, the two-dimensional MND being the prototype for all MNDs of higher dimension and the properties of this MND are the characteristics of the MND that are the key defining properties of it. First of all, the data contributing to an MND itself has a Normal distribution along any of the

A New

Beginning...

5



J



..~i ""

"?.,':.



".'.

k"

".



.

• ..........2:.i:'

.".

"

::..~-..~:...'.

.....

."





Figure 1-2 Projecting each point of the three-dimensional MND onto any of the planes defined by two axes of the coordinate system (or, more generally, a n y plane passing through the coordinate system) results in the projected points being represented by a two-dimensional MND). The correlation coefficients for the projections in all planes are needed to fully describe the original MND.

axes of the MND. We have discussed the Normal distribution previously [ 16], and have seen that it is described by the expression: x-~)

f (x)

2

ae-( 7

(1-1)

The MND can be mathematically described by an expression that is similar in form, but has the characteristic that each of the individual parts of the expression represents the multivariate analog of the corresponding part of equation 1-1. Thus, for example, where Y represents the mean of the data for which equation 1-1 describes the distribution, there is a corresponding quantity X that represents in matrix notation the fact that for each of the axes shown in Figure 1-1, each datum has a value, and therefore the collection of data has a mean value along each dimension. This quantity represented as a list of the set of means along all the different dimensions is called a vector, and is represented as X (as opposed to Y, an individual mean). If we project the MND onto each axis of the coordinate system containing the MND, then as stated above, these projections of the data will be distributed as an ordinary Normal distribution, as shown in Figure 1-3. This distribution will itself then have a standard deviation, so that another defining characteristic of the MND is the standard deviation of the projection of the MND along each axis. This must also then be represented by a vector.



•"..

•;:..:.~:-..:i-..

.. . . . . . . . . .

.

•. . . . :..:~.#~".'...... :i

/ Figure 1-3 Projecting the points onto a line results in a point density that is our familiar Normal Distribution.

6

Chemometrics in Spectroscopy

The final key point to note about the MND, which can also be seen from Figure 1-2, is the fact that when the MND is projected onto the plane defined by any two axes of the coordinate system the data may show some correlation (as does the data in Figure 1-2). In fact, the projection onto any of the planes defined by two of the axes will have some value for the correlation coefficient between the corresponding pair of variables. The amount of correlation between projections along any pair of axis can vary from zero, in which case the data would lie in a circular blob, to unity, in which case the data would all lie exactly on a straight line. Since each pair of axes define another plane, many such projections may be possible, depending on the number of dimensions in which the MND exists. Indeed, every possible pair of axes in the coordinate system defines such a plane. As we have shown, we mere mortals cannot visualize more than three dimensions, as so our examples and diagrams will be limited to showing data in three or lesser dimensions, but the mathematical descriptions can be extended with all generality, to as high dimensionality as might be needed. Thus, the full description of the MND must include all the correlations of the data between every pair of axes. This is conventionally done by creating what is known as the correlation matrix. This matrix is a square matrix, in which any given row or column corresponds to a variable, and the individual positions (i.e., the m, n position for example, where m and n represent indices of the variables) in the matrix represent the correlation between the variable represented by the row it lies in and the variable represented by the column it lies in. In actuality, for mathematical reasons, the correlation itself is not used, but rather the related quantity the covariance replaces the correlation coefficient in the matrix. The elements of the matrix that lie along what is called the main diagonal (i.e., where the column and row numbers are the same) are then the variances (the square of the standard deviation- this shows that there is a rather close relationship between the standard deviation and the correlation) of the data. This matrix is thus called the variance-covariance matrix, and sometimes just the covariance matrix for simplicity. Since it is necessary to represent the various quantities by vectors and matrices, the operations for the MND that correspond to operations using the univariate (simple) Normal distribution must be matrix operations. Discussion of matrix operations is beyond the scope of this column, but for now it suffices to note that the simple arithmetic operations of addition, subtraction, multiplication, and division all have their matrix counterparts. In addition, certain matrix operations exist which do not have counterparts in simple arithmetic. The beauty of the scheme is that many manipulations of data using matrix operations can be done using the same formalism as for simple arithmetic, since when they are expressed in matrix notation, they follow corresponding rules. However, there is one major exception to this: the commutative rule, whereby for simple arithmetic: A (operation) B = B (operation) A e.g.: A + B = B + A A-B=B-A does not hold true for matrix multiplication: AxBCBxA

A New Beginning...

7

That is because of the way matrix multiplication is defined. Thus, for this case the order of appearance of the two matrices to be multiplied may provide different matrices as the answer. Thus, instead of f ( x ) and the expression for it in equation 1-1 describing the simple Normal distribution, the MND is described by the corresponding multivariate expression (1-2): f (X) - Ke -(x-~)rA(X-X)

(1-2)

where now the capital letters X and K represent vectors, and the capital letter A represents the covariance matrix. This is, by the way, a somewhat straightforward extension of the definition (although it may not seem so at first glance) because for the simple univariate case the matrix A degenerates into the number 1, X becomes x, and thus the exponent becomes simply the square of x - 2. Most texts dealing with multivariate statistics have a section on the MND, but a particularly good one, if a bit heavy on the math, is the discussion by Anderson [17]. To help with this a bit, our next few chapters will include a review of some of the elementary concepts of matrix algebra. Another very useful series of chemometric related articles has been written by David Coleman and Lynn Vanatta. Their series is on the subject of regression analysis. It has appeared in American Laboratory in a set of over twenty-five articles. Copies of the back articles are available on the web at the URL address found in reference [ 18].

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991) Haaland, D. and Thomas, E., Analytical Chemistry 60, 1193-1202 (1988). Haaland, D. and Thomas, E., Analytical Chemistry 60, 1202-1208 (1988). Naes, T. and Martens, H., Multivariate Calibration (John Wiley & Sons, New York, 1989). Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). Naes, T. and Isaksson, T., "The Chemometric Space", NIR News (PO Box 10, Selsey, Chichester, West Sussex, PO20 9HR, UK, 1992). Bonate, P.L., "Concepts in Calibration Theory", LC/GC, 10(4), 310-314 (1992). Bonate, P.L., "Concepts in Calibration Theory", LC/GC, 10(5), 378-379 (1992). Bonate, P.L., "Concepts in Calibration Theory", LC/GC, 10(6), 448-450 (1992). Bonate, P.L., "Concepts in Calibration Theory", LC/GC, 10(7), 531-532 (1992). Miller, J.N., "Calibration Methods in Spectroscopy", Spectroscopy International 3(2), 42-44 (1991). Miller, J.N., "Calibration Methods in Spectroscopy", Spectroscopy International 3(4), 41-43 (1991). Miller, J.N., "Calibration Methods in Spectroscopy", Spectroscopy International 3(5), 43-46 (1991). Miller, J.N., "Calibration Methods in Spectroscopy", Spectroscopy International 3(6), 45-47 (1991).

Chemometrics in Spectroscopy

15. Miller, J.N., "Calibration Methods in Spectroscopy", Spectroscopy International 4(1), 41-43 (1992). 16. Mark, H. and Workman, J., "Statistics in Spectroscopy- Part 6 - The Normal Distribution", Spectroscopy 2(9), 37-44 (1987). 17. Anderson, T.W., An Introduction to Multivariate Statistical Analysis (Wiley, New York, 1958). 18. Coleman, D. and Vanatta, L., Statistics in Analytical Chemistry, International Scientific Communications, Inc. found at http://www.iscpubs.com/articles/index.php?2.