Latent-structure decompositions (projections) of multivariate data

Latent-structure decompositions (projections) of multivariate data

Original Research Paper n Chemometrics and Intelligent Luboratoty Systems, 2 (1987) 283-290 Elsevier Science Publishers B.V., Amsterdam - Printe...

833KB Sizes 0 Downloads 38 Views

Original

Research Paper

n

Chemometrics and Intelligent Luboratoty Systems, 2 (1987) 283-290 Elsevier Science Publishers

B.V., Amsterdam

-

Printed

in The Netherlands

Latent-Structure Decompositions ( Projections) of Multivariate Data OLAV M. KVALHEIM

Department

of Chemistry (Received

University of Bergen, N-5000 Bergen (Norway)

30 March 1987; accepted

29 June 1987)

ABSTRACT

Kvalheim,

O.M., 1987. Latent-structure

decompositions

(projections)

of multivariate

data. Chemometrics and InteNigent

Laboratory Systems,Z: 283-290. Several approaches to the decomposition of multivariate data arrays in terms of a common mathematical frame. The methods considered are (i) decomposition into principal components, also (ii) decomposition using the partial-least-squares approach and (iii) projection on or objects, so-called markers. Graphic display, being of major importance in interactive data exploration and

INTRODUCTION

Latent-structure decomposition (LSD) forms the basis of many approaches to multivariate data analysis [1,2]. The method most often used in chemistry is decomposition in terms of principal components [3]. To obtain factors amenable to chemical interpretation, the decomposition step is often followed by rotation, usually by means of the target-transformation technique [4] to reveal “chemical” factors that influence the data. The partial-least-squares (PLS) method [5,6] represents another approach to LSD. The PLS method has been used extensively for multivariate calibration [6] and the description of structure-activity relationships [7]. Oberrauch et al. [8] used PLS to examine the relevance of ratios commonly 0169-7439/87/$03.50

0 1987 Elsevier Science Publishers

B.V.

latent structure

are developed

within

called singular-value decomposition, to axes defined by selected variables classification,

is discussed.

used for correlation analysis in organic geochemistry. This application represents a first step t extending the range of PLS from calibration an % regression to the more general field of exploratory data analysis [9]. Recently, a method of projecting on to selected variables was developed as a means of decomposing a data matrix into either oblique or orthogonal factors [lo]. The term marker was introduced to imply a variable carrying specific information about a “chemical” factor, i.e., a factor amenable to interpretation in terms of a chemical phenomenon. Varimax [ll] and Promax [12] are methods for rotating a principal-component decomposition so as to obtain a simple structure [13] with respect to variables. Marker-variable projections aim, as does target transformation [4], to produce chem283

m

Chemometrics

and intelligent

Laboratory

Systems

ically interpretable factors. Thus, projecting a set of objects on to marker variables represents a means of displaying multivariate data on meaningful axes of a lower dimensional space. These axes are generally, but not necessarily, oblique [lo]. In this paper, the method of projecting on to markers is extended so as to incorporate projections on to axes defined by marker objects. The aims and results of projecting on to marker variables have been discussed previously [lo]. However, projecting on to marker objects requires some further consideration. The marker-object projection (MOP) technique aims at producing a classification directly on interpretable factors without the rotation step necessary for target transformation of principal-component [4] or PLS [9] decompositions. Similarly to the PLS [5,6] and marker-variable projections [lo], marker-object projections are obtained directly from the matrix of raw data. However, the marker-object projection technique does not presuppose external information as does, e.g., PLS. Samples influenced by different chemical factors can be revealed by straightforward calculation of scalar products, and only two matrix mu~tip~ications are necessary to determine scores and Ioadings for each projection. This work is organized as follows. First, a brief description of projections of multiva~ate data in general is given to shed light on the connection between the various approaches to latent-structure decomposition (LSD). Then it is shown that PLS, marker-va~able projections, marker-object projections and principal-component analysis can be developed within a uniform mathematical frame, providing a very simple interpretation of these methods in factor analytical terms. Finally, the marker-object projection technique is investigated and its performance is compared with those of other methods. An appendix on graphical display is included.

variable 2

I

variable 1

Fig. 1. Projecting an object vector xk on a latent variable Variables 1 and 2 are original variables, unit vectors

wO_ are

{et, e2 1.

mensional variable space spanned by the two variable axes is determined by the object vector: xk

=

Xklel

+

xk2e2

(1)

where (e,, e2 } are the unit vectors along the two axes {1,2} and { xki, i = 1, 2) are the coordinates of the object on the axes. The coordinates of object k can be written as a row vector, i.e., ~~2). Also shown in Fig. 1 is an axis with (Xklr unit vector IV,. In the same way as for object k, we can represent W, as a linear combination of the two unit vectors (e,, e2 }, i.e., w, = wale1 +

w,2e2

(2)

The object k is projected on the axis determined by the vector w,. The coordinate tka of this object on w, is the scalar product of xk and wcl: t,, = Xk. w,

(3)

The projections of N objects on the axis spanned by w, can be presented in a column vector of scores t, on the latent variable w,, i.e., t, = Xw,.

LATENT-STRUffURE

DECOMPOSITIONS

The relationship PROJECTING MULTIVARIATE SELECTED AXIS

DATA

ON

TO

A

A X=

Fig. 1 shows an object k characterized by two variables. The location of the object in the two-di284

c

t,p,i-E

(4)

a=1 represents

a decomposition

of a data

matrix

X

Original Research Paper

into

A columns of orthogonal score vectors { t, } and A rows of orthogonal (principal components) or oblique (PLS) loading vectors ( pO) 1141. Decomposition into principal components represents an or,dinary least-squares solution which minimizes the residuals E. Other criteria are available, however, e.g., using the covariance (“overlap”) between X and another variable block Y as in partial-least-squares (PLS) and canonical correlation [IS] decomposition to obtain more “relevant components” [6], or using an a priori factor criterion as in marker-variable projection [lo]. Each row of X corresponds to one of N object row vectors. We can set up the following general algorithm for the decomposition of X by successive projections: Define X, = X. Repeat for a = 1, 2,. . . , A (A = number of dimensions extracted) Select a latent variable in variable space, i.e., a row vector of coordinates: wO; ]I IV, ]I = 1. Project the objects on w,, i.e., calculate the column vector of scores: t, = X,w,t. Calculate the row vector of (structure [16]) loadings p, expressing the covariances between the latent variable and the original variables, i.e.,

Pu = oL/&?

(5)

Eq. 5 can be used to calculate (st~cture) loadings both for orthogonal (follows directly from eq. 4) and non-orthogonal score vectors. For orthogonal score vectors a fourth step is necessary: 4. Remove the dimension of X, associated with wLl:Xa+l = x, - tap,. Note that the decomposition formula, eq. 4, holds only for orthogonal scores. Different decompositions of X can be selected by different choices of ( w,}. Four different methods for direct decomposition are of interest for this work: 1. Principal-component decomposition, w, = pa/ 11p, 11,where p, is the (converged) loading vector for each dimension. 2. Partial-least-squares decomposition [5,6,17, 181: w,= H~XJ]]U~X, ]I, where u, are the

n

(converged) latent variable column vectors for the external (dependent, predicted) variable block (equal to y with one predicted variable only [17]). 3. Marker-variable projection (MVP) [lo]: w, = ei, variable i is projected on. 4. Marker-object projection (MOP): w, = xJ]) xk )I, object k is projected on. From the projection algorithm above and the discussion following Fig. 1, note that the PLS weightings { w,} (ref. 6, p. 209) are coordinates of the latent variables on the original variable axes. The (oblique) PLS loadings {p,} (ref. 6, p. 210) are proportional to covariances (structure loadings; see, e.g., refs. 10 and 16 and Appendix) between the latent variables and the original variables (see eq. 5). Principal components represent the special case of the general projection method defined above where weightings and normalized loadings are identical. For this case, the row loading vectors defined in eq. 4 are orthogonal. Depending on choice of representation, marker projections can be performed with or without the orthogonalization step in the successive projection algorithm. Orthogonalization is necessary if scores and loadings are to obey eq. 4, but representation on chemically interpretable factors is straightforward also without orthogonalization (see Appendix).

MARKER-OBJECT

PROJECTIONS

Each object can be represented as a point in variable space. The collection of objects defines what is called a point representation or point configuration [13] in this vector space. TWO objects span a plane in variable space if not exactly similar to each other with respect to the measured variables. Likewise, three or more dissimilar objects span a hyperplane. The other objects can be projected as points on such planes or hyperplanes. The object scores on the latent variable defined by object k are given by (f,=x,*Xj/]lXkll,

1==1,2 3... ,N)

Thus, the similarity between other objects is quantitatively

(6)

the marker k and the determined from the 285

n

Chemometrics

and Intelligent

Laboratory

Systems

Sample No.16 (a) .

(b)

0

depositional

environment

.

(cl

.

PLS

component

. * .

1

Fig. 2. (a) Marker-object score plot. Data from ref. 10. Marker objects were selected by using information about the geographical location of oil wells. The different symbols represent samples from five different oil fields. (b) Samples plotted on axes obtained by target transformation of principal components (from ref. 10). Symbols as in (a). (c) Samples plotted on axes determined by the PLS canonical variate approach [14]. Symbols as in (a).

286

Original

scalar product above. The scores on the latent variable defined by object k differ only in normalization from the cosine association measure introduced by Imbrie and Purdy [19,20], i.e, division by ]IX, ]] in eq. 6 gives this similarity measure directly (for percentage data, division by 100 gives the same similarity measure as normalization is inherent in such data). This observation suggests several interesting applications. By projecting on to objects representing preconceived classes, classification of objects in terms of “real” factors can be obtained. These factors may be oblique combinations of “pure” factors, i.e., providing the possibility of direct extraction of higher order factors [21,22]. Fig. 2a shows a marker-object score plot based on the triterpane distribution (percentages) [lo] of 35 crude oils from five different North Sea oil fields. The oils are projected on to axes defined by two oil samples randomly selected among the samples from the two fields differing most in depositional environment and thermal maturity of the source [lo], one sample from each of the two fields (classes). These two samples can easily be identified in the marker-object plot (Fig. 2a), as each one will lie on the axis it spans. Comparison with the score plot on chemically interpretable factors (Fig. 2b) obtained by target transformation of a p~ncipal-component decomposition [lo] reveals that the “geographically” selected markerobject projections separate samples from two of the oil fields (samples symbolized by filled and unfilled triangles) completely, whereas no separation was observed using the target-transformation technique. An oblique loading plot might

TABLE

Research

Paper

m

have revealed the variables responsible for this separation and thus served as a basis for a “geological” interpretation. Such an interpretation is outside the scope of this work, but it is worth mentioning that each of these higher order “geological” correlated factors represents a combination of source-related factors, mainly the influence of depositional environment and thermal maturity, factors which are often correlated in an experimental sample [lo]. Fig. 2c shows the classification on the first two factors obtained using the PLS canonical variate technique 1141,which has a similar aim to that of canonical variate analysis 1151. This projection is obtained by using information about the samples’ field belonging as external variables in PLS. This information is coded as a block of binary variables, one variable for each of the five oil fields, with values equal to one for samples belonging to the oil field represented by that variable, otherwise zero. Althou~ the projection on the PLS canonical variates is similar to that obtained using the marker-object projection technique (Fig. 2a), the grouping of the samples in accordance with field belongings is much less distinct. For a quantitative comparison of the methods, the factors displayed in Fig. 2a-c were expressed as linear combinations of the principal-component d~omposition of the oil triterpane dist~butions. Table 1 gives the contributions along the most dominant principal components, and shows that the PLS canonical variates represent an almost orthogonal rotation of the first two principal components. These factors were also just as difficult to interpret geochemically as were the principal com-

1

Contributions transfo~ation

from principal components to factors obtained by means of principal components (T-T) and (c) PLS canonical variates

The expansion coefficients are given for the three principal from the triterpane distributions of 35 North Sea oils [lo].

components

of (a) marker-object (PLSCV)

with largest

variance

projections

(cumulative

(MOP),

variance

(b)

target

92.7%) obtained

PLSCV

T-T

PC/factor No.

MOP 1

2

1

2

1

2

1 2 3

0.169 - 0.985 0.018

0.823 0.457 0.334

0.547 - 0.740 - 0.141

0.576 0.300 0.714

0.825 - 0.560 0.077

- 0.560 - 0.828 - 0.022

287

l

Chemometrics and Intelligent Laboratory Systems

ponents [lo]. The factors obtained by target transformation of the principal components and by marker-object projections are similar, although not completely congruent. Although marker-object projection has different aims, i.e., data exploration and classification, it is conceptually very similar to Hruschka and Norris’s regression 1231, where a subset of samples is selected so as to span the independent variation in a set of calibration samples. For both cases, samples are used to define loadings, for Hrusckha and Norris regression directly, for marker-object projections indirectly through a projection. Further, both cases imply that samples with mutual cosine association measures close to zero are selected, a feature also used in polar ordination in ecology [24]. A strong feature of the marker-object projection technique is that the method can be used without presupposing external information as does, e.g., PLS and canonical variate analysis. Dissimilar samples can either be selected directly by calculation of scalar products, or through a screeening procedure using principal component analysis or cluster techniques such as fuzzy clustering to reveal marker-object candidates. Unique samples, although they will be revealed as outliers after projection and then can be rejected, should be avoided. If there are several samples spanning the same direction, a good choice is to use that with the maximum norm. With preconceived classes, a compromise between closeness to the centre of gravity of a class and the maximum norm criterion should provide good marker objects. Projection on to objects may be useful also in problems related to “masked” correlations and unmixing, e.g., quantitative determination of input from different sources in oil-source correlation and for the determination of the composition of mixtures when pure standards are available. Lastly, it should be emphasized that the marker-object projection technique is an exploratory data analytical technique and when used as such, no assumptions are required about distribution of error, etc.

288

ACKNOWLEDGEMENTS I am grateful to John Birks and Rolf Manne, University of Bergen, for major suggestions for improving this work. Harald Martens, Norwegian

Computing Center, Oslo, is thanked for informing me about the Hrusckha and Norris regression. A referee is thanked for stimulating ideas for improving the marker-object selection procedure. The Norwegian Research Council for Science and Humanities (NAVF) is thanked for financial support.

APPENDIX

Graphical interpretation Two graphical displays are of major importance in multivariate data analysis: score plots showing the relations~p between samples and loading plots showing the relationship between variables. The following expressions define the coordinate system for any of the four decompositions discussed above: cos GaPah = “0 * w,

174

cos V*b =~:/Il~&vII~hII

(w

where { +OLlb } and { vuh } are the angles between axes a and b in score plots and loading plots, respectively. The cosines defined by eqs. 7a and b are elements of the A X A matrices Cp and 0, where A is the number of latent variables. Coordinate systems for score and loading plots for markers and PLS components are now.uniquely defined. From eqs. 7a and b it follows that decomposition of a data array into principal or PLS components leads to orthogonal axes for both scores and loadings. Decomposition on to marker variables or marker objects, however, may require oblique axes if we do not orthogonalize after each projection. This will often be the case if chemically interpretable factors are wanted. The last section above provides an example of such a case. Fig. 3a shows how to position an object in a marker-object score plot either by use of the calculated projections (eq. 3) on oblique axes or by use of object coordinates (Fig. 3a). The connec-

Original

comp~~nt

b

(a)

tionship

similar

Q=e-‘p*

Research

l

Paper

to eq. 8a: (8b)

where Q and P* are the matrices of pattern and structure loadings, respectively. The asterisk on P implies variance-weighted loadings [lo], i.e., P * consists of the row vectors { p$ = I/t, )/ apa, a = 1, 2, . . . A}. One can also obtain these graphical oblique representations for markers indirectly from an orthogonal decomposition. Scores and loadings are then plotted on the orthogonal axes, while the oblique axes are obtained from scalar products between orthogonal and oblique basis sets of latent variables.

component b

(b) %i/--_

REFERENCES

Fig. 3. (a) Relationship between scores { tko, tkh) and coocdinates ( uko, ukh) for object k projected on the latent variables f wO, I+,]. The angle #+,Poh between the axes (a, b) is calculated from eq. 7a. (b) Relations~p between structure loadings ( ~2, p$,} and pattern loadings { u,,, uh,} for variable i on the latent variables {IV,, wb}. The angle vuh between the axes {a, b} is calculated from eq. 7b.

tion between projections T and coordinates U for a set of objects is given through the matrix relationship U = @-‘T

(8a)

As shown previously [lo], marker projection loadings are equivalent to structure loadings in factor analysis. Fig. 3b shows both the structure loadings and the coordinates or pattern loadings 1161 for two latent variables with correlation given by eq. 7b. The connection between structure and pattern loadings is defined through a matrix rela-

1 P. Horst, Factor Analysis of Data ~utriees, Holt, Rinehart and Winston, New York, Chicago, London, 1964. 2 K.G. Joreskog, J.E. Klovan and R.A. Reyment. Geological Fucfor Analysis, Elsevier, Amsterdam, 1976. 3 H. Hotelling, Analysis of a complex of statistical variables into principaf components, Journal of Educational Psychologv, 24 (1933) 417-441,498-520. 4 E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry, Wiley, New York, 1980. 5 S. Weld, A. Ruhe, H. Wold and W.J. Dunn 111, The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM Journal of Scientific and Statistical Computations, S (1984) 735-743. 6 H. Martens, ~ultivariute calibration - Q~ntit~tive interpretation of non-selective chemical data, Dr. Techn. Thesis, Technical University of Norway, Trondheim, 1985. 7 S. Hellberg, A multivariate approach to QSAR, Fil. Dr. Thesis, University of Wme&, Umei, 1986. 8 E. Oberrauch, T. Salvatori, L. Novelli and S. Clementi, Oils of the PO basin: a chemometric-geochemical study, Chemometrics and Intel~jge~t Laboratory Systems, 2 (1987) 137-147. 9 O.M. Kvalheim, A partial-least-squares approach to interpretative analysis of multivariate data, Chemometrics and intelligent L-uborato~ Systems, 3 (1988) in press. 10 O.M. Kvalheim and N. Teinaes, Visuafizing information in multivariate data: Applications to petroleum geochemistry, Part 1. Projection methods: Part 2. Interpretation and correlation of North Sea oils using three different biomarker fractions, Analytica Chimicu Acta, 191 (1986) 87-110. 11 H.F. Kaiser, The Varimax criterion for analytic rotation in factor analysis, Psychometrika, 23 (1958) 187-200.

289

n

Chemometrics

and Intelligent

Laboratory

Systems

12 A.E. Hendrickson and P.O. White, Promax: A quick method for rotation to oblique simple structure, British Journal of Statistical Psychology, 17 (1964) 65-70. 13 L.L. Thurstone, Multiple-Factor Analysis, University of Chicago Press, Chicago, London, 1947. 14 S. Wold, C. Albano, W.J. Dunn III, U. Edlund, K. Esbensen, P. Geladi, S. Hellberg, E. Johansson, W. Lindberg and M. Sjiistrom, Multivariate data analysis in chemistry, in B.R. Kowalski (Editor), Chemometrics - Mathematics and Statistics in Chemistty, Reidel, Dordrecht, 1984, pp. 17-95. 15 R.A. Reyment, R.E. Blackith and N.A. Campell, Multivariate Morphometrics. Academic Press, London, 2nd ed., 1984, 233 pp. 16 H.H. Harman, Modern Factor Analysis, University of Chicago Press, Chicago, 2nd ed., 1967, pp. 249-272. 17 R. Manne, Analysis of two partial-least-squares algorithms for multivariate calibration, Chemometrics and Intelligent Laboratory Systems, 2 (1987) 187-197. 18 I.S. Helland, On the structure of partial least squares regression, Reports from the Department of Mathematics and

290

Statistics, 19

20 21 22 23

24

Agricultural

University

of Norway,

No. 21, 1986,

44 PP. J. Imbrie and E. Purdy, Classification of modern Bahamian carbonate sediments, in Classification of Corbonate Rocks, American Association of Petroleum Geologists, Memoir 7, 1962, pp. 253-272. K.G. Jiireskog, J.E. Klovan and R.A. Reyment, Geological Factor Analysis, Elsevier, Amsterdam, 1976, pp. 16 and 124. L.L. Thurstone, Multiple-Factor Analysis, University of Chicago Press, Chicago, London, 1947, Ch. 18. R.J. Rummel, Applied Factor Analysis, Northwestern University Press, Evanston, IL, 1970, pp. 423-432. N. Hruschka and K. Norris, Least squares curve fitting of near infrared spectra predicts protein and moisture content of ground wheat, Applied Spectroscopy, 36 (1982) 261-265. G. Cottam, F.G. Goff and R.H. Whittaker, Wisconsin comparative ordination, in R.H. Whittaker (Editor), Ordination of Plant Communities, Dr Junk, The Hague, 1978, pp. 185-213.