Original Research Paper
213
n
Chemometrics and Intelligent Laboratory Systems, 7 (1990) 273-219 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands
A New Method for the Analysis of Correlated Data Using Procrustes Rotation which is Suitable for Spectral Analysis MIKAEL KUBISTA
Department of Physical Chemistry, Chalmers University of Technology, S-412 96 Gothenburg (Received
15 April 1989; accepted
(Sweden)
3 August 1989)
ABSTRACT
Kubista, M., 1990. A new method for the analysis of correlated data using Procrustes rotation which is suitable for spectral analysis. Chemometrics and Intelligent Laboratov Systems, I: 273-219. A new approach is presented for the analysis of correlated data sets. It is based on a singular value decomposition of the data and the relation of the sets through Procrustes rotation. The approach is especially suitable for spectral analysis. If two related spectra are recorded on each sample, the components in the samples can be identified. In addition, their responses to the two measurements, their relative concentrations, and the correlation between their responses in the two measurements can be determined. The requirement is that the response of each component to the two spectra has the same intensity distribution but different magnitude. This is usually the case for absorption, fluorescence excitation, circular dichroism, linear dichroism, and magnetic linear dichroism spectra and any two of these can be used in the analysis. In addition, the method can be used to analyze correlated time resolved measurements.
INTRODUCTION
Spectroscopic methods are frequently used to characterize and quantify components in mixtures. The spectra are usually easy to interpret, since one often has a good idea of the spectroscopic properties of the species involved. However, in some cases the molecules studied are involved in complex equilibria such as tautomerism, pH equilibria, dimerization and oligomerization, solvent complexes, complexes with other species present, etc. For such systems a direct spectral analysis is difficult since the number of components contrib0169-7439/90/$03.50
0 1990 Elsevier Science Publishers B.V.
uting to the spectrum and their spectral profiles are usually unknown. If the relative concentrations of the components can be shifted by changing factors that do not alter the individual spectral profiles, information theory can be used to estimate the number of species present as well as the overall similarities between the different spectra [l-3]. The recorded spectra are organized in a matrix which is analyzed in terms of eigenvalues and eigenvectors. The eigenvalues are proportional to the contributions of the corresponding eigenvectors to the data set and can usually be used to deduce the number of physically relevant eigen-
n
Chemometrics
and Intelligent
Laboratory
Systems
214
vectors that describe the data matrix. This value corresponds to the number of components that significantly contribute to the spectra and the eigenvectors are linear combinations of their spectral profiles. The remaining eigenvectors essentially contain noise. Without any knowledge of the correlation between the different spectra it is not possible to find the linear combination of the eigenvectors that represents the pure spectral profiles of the components. If the influence of the factors altered between the different samples can be physically correlated to the concentrations of the species involved - for example, the effect of temperature and pressure on equilibrium constants - the correct linear combinations may be found [4]. In this work I extend previous analysis of correlated data (see, for example Refs. 5 and 6) by relating different data sets through the Procrustes rotation. I show that if two related spectra are measured on each of a set of samples, the number of independent components and their relative concentrations and spectral contributions can be ob-
mined without any assumption about the dependence of concentration on external factors. This approach opens new possibilities for the study of complex equilibria, for example.
THEORY
On each of n different samples two types of spectra, A( A) and B(X), are measured and each is digitized into m data points. The two spectra depend on the concentrations of the components and are related in such way that the contributions of each component to the two spectra has the same wavelength (A) dependence but differs in amplitude: A,(9
= h
GY!D)
k=l
4(X)
=
i
C,kdkhl’,(X)
(1)
k=l A,(A)
and B,(X) are the two spectra
measured
on
Wavelength (A) 12..............m
Al A2
.
.
T, -
U
.
.
TP
-
.
.
-
TP
T!I
.
.
Tk
-
-
-
A
Tl
TB
-
T2
BI B2
B
-
PI
I
p;
Fig. 1. Schematic loading P.
illustration
of the decomposition
of the two data sets A and B into the score matrices
T” and Tb and a common
Original
275
sample i(i = 1, 2,. . . , n), V, is the spectral profile of component k(k = 1, 2,. . . , r) and C,, is the concentration of component k in sample i. d, is the scale factor between the contributions of component k to the two types of spectra and corresponds to an intrinsic molecular property that relates the response of component k in the two measurements. Eq. 1 can be written in matrix form: A=CV B=CDV
(2)
where D is a diagonal matrix with the elements d,. The rank of A and B is given by r(r 6 n, r 6 m), the number of components in the mixtures. To simplify the analysis we decompose A and B in some orthonormal basis set, for example using NIPALS [7,g] or singular value decomposition [9,10] (see illustration in Fig. 1): A = T”P’
or
Aij=
i
TE(P’)kj
k=l
B = TbP’
or
Bjj=
i
(3) qi(P’)kj
k=l
The orthonormal condition is P’P = I, where I is a unit matrix with the dimension r x r and P’ is the transpose of the matrix P. The matrices T” and Tb have n rows and r columns. They are known from the experimental data and the choice of P; for example: T”=AP
T”=CVP
Tb=BP
Tb = CDVP
(4)
Research
n
Paper
The matrix Q has dimensions r x r. It has full rank and can be determined by the least squares method using all spectra in T” and Tb. The solution is: Q = (TafTa)-‘T”‘Tb
(7)
Combining eq. 5 with eq. 6 we write: T”=CU Tb = CDU = T”Q CUQ = CDU
(8)
where C is a standing rectangular matrix with n rows and r columns. We need to find a pseudo-inverse to C with the property C-C = I where I has dimensions r x r. If the rank of C is r such a matrix does exist; in fact, there is an infinite number of such matrices [ll]. This gives us the important relation: UQ=DU
(9)
where all matrices have dimensions r X r and D is diagonal. Transposing both sides of eq. 9 we obtain: Q’U’ = U’D
or
(Q’);k(U’)kj=
i
(u’)ijdj
k=l (10)
This is an ordinary nonsymmetric eigenvalue equation. The eigenvectors are nonorthogonal and indeterminable with respect to normalization. They can be arbitrarily normalized, for example, by relating all concentrations to sample No. 1 (C,, = 1, j=l, 2 ,..., r):
defining U = VP: K”k= i
T”=CU
j=l
Tb = CDU
(5)
T” and Tb now contain all information about the objects in A and B, and the rows in the matrix U describe the spectra of the pure components in ‘T-representation’. We now seek the Procrustes rotation of T” relative Tb, i.e., the matrix Q that minimizes the differences between Tb and T”Q [ll]. Tb=TaQ
or
$=
i k=l
TiQkj
(6)
Ciil$,
= i
$
(IIa)
j=l
The concentrations, relative to sample 1, of the components are obtained as the row vectors of the matrix C calculated from eq. 5: C = T”U-’
(IIb)
The pure spectral profiles of the components are obtained as the column vectors of the matrix V (eqs. 3-5): V’ = PU’
(12)
The intrinsic molecular factors that relate the re-
n
Chemometrics
and Intelligent
Laboratory
Systems
216
WAVELENGTH
WAVELENGTH
Fig. 2. Ten synthesized spectra of two types obeying eq. 1. Three components data points. All spectra have a signal to noise ratio of 250.
contribute
to each spectrum
which is a collection
of 300
30
(b)
t
WAVELENGTH
WAVELENGTH
I
I
I
I
I
I
I
I
I
I
I
Fig. 3. The four most significant
eigenvectors
I
I
I
I
I
I
(
‘I ,
I
I
WAVELENGTH
WAVELENGTH
obtained
from the spectra
in Fig. 1. Corresponding
eigenvalues
are shown in the figures.
277
Original Research Paper
n
sponse of the components in the two measurements are found as the diagonal elements of the matrix D.
A TEST EXAMPLE
To illustrate the precision of the treatment described, ten spectra of two types have been synthesized as shown in Fig. 2. All spectra have contributions from three components and the spectra of types A and B are related according to eq. 1. A random noise corresponding to an average signalto-noise ratio of 250 is added (to be compared with typical S/N ratios of 10000 usually encountered in UV/VIS measurements and 1000 in IR). Each spectrum is treated as a collection of 300 data points. As seen in Fig. 2, the spectra are of rather poor quality, and it is even hard to guess the number of independent components. The spectra are decomposed using the NIPALS algorithm [5,6] and the four eigenvectors with largest eigenvalues are shown in Fig. 3. It is seen that the fourth eigenvector essentially contains noise and has a significantly lower eigenvalue than the other three. It may thus be concluded that three different components are responsible for the main features in the spectral shown in Fig. 2. The matrix Q is calculated from eq. 7 and is diagonalized to give the matrices D and U (cf. eqs. 9 and 10). The elements of the matrix D, which are the eigenvalues of Q, correspond to the factors d,(i = 1, 2, 3) that relate the contributions of the components to the two types of spectra. The calculated d values are compared with those used in the synthesis of the original spectra in Table 1. The correspondence is surprisingly good, all calculated d values are within 0.25% of the correct values.
SAMPLE
Fig. 4. Calculated concentrations (0) compared used in the synthesis of the original data ( -).
with those
The matrix C is calculated from eq. 11. Its elements correspond to the concentrations of the components to the ten samples and are compared with the concentrations used in the synthesis of the original spectra in Fig. 4. The calculated concentrations are in excellent agreement with the the true ones. The spectral profiles are finally calculated from eq. 12 and are compared with the original profiles in Fig. 5. It can be seen that, despite the severe overlap between the individual profiles, the calculated profiles are in good agreement with the originals.
TABLE 1 Original and calculated
1 2 3
Original
Calculated
1 0.5 0.25
0.9992 0.4987 0.2504
WAVELENGTH
Fig. 5. Calculated spectral originals (. . . . . ).
profiles
( -)
compared
with
w
Chemometrics and Intelligent Laboratory Systems
278
DISCUSSION
A novel method for the analysis of a correlated data set has been presented. It is based on a singular value decomposition analysis of the data followed by relating the sets through Procrustes rotation. It is shown that essentially all information can be obtained, such as the response of the components, their relative contributions to the different measurements, and the correlation between their responses to the two types of measurements. The requirement is that the two measurements made on each sample are correlated accord-
TABLE
ing to eq. 1. Such a correlation may usually be expected for various spectroscopic measurements. The profiles of pure electronic or vibrational transitions are usually the same in absorption, fluorescence excitation, circular dichroism, linear dichroism, and magnetic circular dichroism spectra. Their magnitudes, which are different in all these spectra, reflect some intrinsic molecular properties. Any two of these spectra can be used in the analysis (Table 2). The ratio between the contributions to the two spectra usually varies between different transitions in the same component. Care must therefore be
2
Examples of spectra suitable for this type of analysis Difference
Spectrum A
Spectrum B
Domain
di
Concentration Concentration Concentration Concentration Concentration Concentration Concentration Concentration Time Time Concentration Concentration Concentration Concentration Aex Aen?
ABSa LDd
FEXb ABS Al’ ABS ABS
wavelength wavelength wavelength wavelength wavelength wavelength wavelength wavelength wavelength wavelength
& LD,’ dg
All’ CDh MCDJ FEM’ hex, FEX A,;, III In FEM A_, FEX A,,! ABS A,
FEM Xex~
FEX Lnz Ilrn
FEM Aexz FEX L-a
ABS A,
&xl A,, hex2 Aem, AnlP I* II
A,, A,,
time time time wavelength wavelength
I
(“A+ B)/D k FEX(A,)/FEX(W FEM(A,)/FEM(A,)
Ill/I
I
FEX(A,)/FEX(A,)” FEM(A,)/FEM(A,)” ABS(A,)/ABS(W’ FEM(A,)/FEM(A,)“ FEX(A,)/FEX(A,)’ An [l/An 1” IlIPJ1 II(I’I
a Absorption spectrum. b Fluorescence excitation spectrum. ’ Fluorescence quantum yield. d Linear dichroism. e Reduced linear dichroism. f Absorption measured with polarized light. Index refers to polarization direction relative to the orientation axis of the sample. g Dichroic ratio. h Circular dichroism, vibronic perturbations may violate eq. 1. i Kuhn’s dissymmetry factor. J Magnetic circular dichroism, only B and C terms satisfy eq. 1 and may be perturbed by vibronic interactions. ’ The ratio between the sum of MCD A and B terms divided by the dipole strength. ’ Fluorescence emission. m Polarized fluorescence intensity, either emission or excitation wavelength can be varied. ” Matrix C contains decay profiles. o Matrix V contains decay profiles. p Increase in birefringence measured with polarized light. q Matrix C contains excitation profiles and matrix V the emission profiles. ’ Matrix C contains emission profiles and matrix V the excitation profiles.
Original Research Paper
219
taken when analyzing complete spectra if some components have more than one transition. In such case, transitions within the same component that have different d values will appear as individual components. The spectral profiles of such ‘false’ components will be some linear combinations of the transition profiles and consequently corresponding concentrations and d values might be erroneous. The spectra should instead be divided into segments to which each component only contributes no more than one transition. One exception, though, is the combination of absorption and fluorescence excitation spectra. Since most molecules obey the Kasha-Vavilov rules [12], transitions in a single component will have the same d value and complete spectra can be safely analyzed. The analysis can also be applied to spectra collected as a function of time e.g., absorption or fluorescence emission measured at two wavelengths in stopped flow or time resolved fluorescence measurements. The time developments of the spectral intensities will then be contained in the matrix C. Alternatively, complete spectra can be recorded in time. In this case only a single sample is required and the time developments of the spectral intensities will be contained in the matrix V. Finally, in special cases, such as polarized fluorescence, steady-state measurements can be performed on a single sample. The apparent concentrations are varied by changing excitation wavelength and the spectra are recorded as two polarized components of the emitted light. The matrix C will contain the excitation profiles and the emission profiles are found in the matrix V. The d values reflect the intrinsic anisotropies of the components.
n
ACKNOWLEDGEMENT
I wish to thank Professor Rolf Manne at University of Bergen for his help in presenting this work in mathematical terms and for valuable discussions.
REFERENCES 1 W.C. Johnson, Circular dichroism and its empirical application to biopolymers, Methods of Biochemical Analysis, 31 (1985) 61-163. 2 H. Cartwright, Color perception and factor analysis, Journal of Chemical Education, 63 (1986) 984-987. 3 LA. Cowe and J.W. McNicol, The use of principal components in the analysis of near-infrared spectra, Applied Spectroscopy, 39 (1985) 257-266. 4 Ch. Marck, Ch. Schneider and L. Brehamet, Vectorial interpretation of spectra for the study of associations, Biopolymers, 17 (1978) 1065-1074. 5 O.M. Kvalheim and N. Telnae, Visualizing information in multivariate data: application to petroleum geochemistry, Analytica Chimica Acfu, 191 (1986) 87-96. 6 C. Pizarro, L.A. Sarabia and J.L. Palacios, Multivariate calibration in UV-VIS spectroscopy, Journal of Chemometrics, 3 (1988) 241-247. 7 R. Fisher and W. MacKenzie, Studies in crop variation. II. The manurial response of different potato varieties, Journal of Agricultural Science, 13 (1923) 311-320. 8 H. Weld, Nonlinear estimation by iterative least squares procedures, in F. Daved (Editor), Research Papers in Statistics, Wiley, New York, 1966, pp. 411-444. 9 G. Golub and C. VanLoan, Matrix Computations, The Johns Hopkins University Press, Oxford, 1983. 10 J. Mandel, Use of the singular value decomposition in regression analysis, American Statistician, 36 (1982) 15-24. 11 K.V. Mardia, J.T. Kent and J.M. Bibby, Multivariate Analysis, Academic Press, London, 1979. 12 N.J. Turro, Modern Molecular Photochemistry, The Benjamin Cummings Publishing Company, Menlo Park, CA, 1978.