n
Original Reseach Paper
69
Chemometrics and Intelligent Laboratory Systems, 8 (1990) 69-71 Elsevier Science Publishers B.V., Amsterdam -
Printed in The Netherlands
On Reducing the Underestimation of Correlation by Optimal Weighting RENE HENRION
and GUNTER
HENRION
*
Humboldt- University Berlin, Department of Chemistry, Hessische Str. I-2, 1040 Berlin (G. D. R.) (Received
29 March
1989;
accepted
7 September
1989)
ABSTRACT Hemion, R. and Henrion, G., 1990. On reducing the underestimation of correlation by optimal weighting. Chemomer-
rics and Intelligent Luboratory Systems, 8: 69-71. If a set of random variables (e.g. trace concentrations of different samples) is measured at two occasions, the well-known effect of underestimation of correlation due to measurement errors may be reduced by using optimal linear combinations. A simplified theoretical approach relates this problem to principal component analysis. For practical computations another method, described elsewhere, is proposed. A simple example from tungsten wire production serves as an illustration.
INTRODUCTION
In technological processes it is useful to investigate whether a correlation exists between contamination by traces of several elements at the various stages of the manufacture of a raw material (see ref. 1). A high correlation coefficient, r(y,, y2), with y,, y2 denoting concentrations at stages 1 and 2, will indicate a linear relation y2 = a my, + b. Typical cases are, for instance, a = 1, b = 0 or a = 1, b > 0 or a < 1, b = 0 [2]. It is well known that, under the influence of measurement errors, the computed correlation r, will generally underestimate the true correlation to a greater or lesser degree, so r, < r (ref. 3, p. 606). A natural way of avoiding this would consist in increasing the precision of analysis by repeated measurements. In contrast to this it was more important for us to characterize contamination 0169-7439/90/$03.50
0 1990 - Elsevier Science Publishers B.V.
due to the presence of many trace elements in many samples of the raw material rather than to have only a few determinations, though high precision was necessary. We describe below how to reduce the underestimation by taking account of intercorrelations among trace elements.
THEORETICAL
BACKGROUND
Let yi, y2 be p-dimensional random vectors representing true concentrations of p trace elements at two different stages of manufacture. We shall consider here the case of an exact linear relation y2 = A y, + b, with A a ( p, p)-matrix and b a p-vector. Thus we are in the situation of exact true correlation (r = 1). Denote by xi, x2 the corresponding measured concentrations, i.e. yi =
Chemometrics
70
xi + ei (i = 1, 2) where E; are measurement errors. Let us make the following assumptions concerning the errors: 1. E(Ei) = E(Q)
(E = expectation)
= 0
(Cov = covariance matrix) 3. e1 and e2 are independent each other.
0
4 1
of Y, as well as of
Supposing these conditions we may derive the covariance matrix C for the 2p-dimensional random vector (xi, x~)~:
0) Here 2 denotes the covariance matrix for Y,. The right upper part of eq. (l), for instance, is obtained by (E = expectation, see point 1, above):
-Exdb2
3~2)~)
=E((YI
--‘I
=E((YI
-EYI)(H-EY,)~)
-JTY,)(Y~~,-EY~)~)
-E(el(Y*-Ey,)T) -E((Y,-EY,)~;)
n
= wTZATw
The terms in the denominator are verified analogously. Obviously, for fixed w the underestimation of the true correlation (r = 1) is increased by the measurement error D (Z, D and usually A too are positive semi-definite). For the purpose of clarification we state two further simplifying assumptions: 4. A = I, (= identity
E((x,
Systems
The numerator of eq. (2), for instance, is obtained as the covariance for wTx, and wTx2 which, by virtue of eq. (1) is equal to (wTOT)C( ;)
... i
Laboratory
0
d, 2. COV(E~) = COV(Q) = D =
and Intelligent
+E&
matrix)
5. D = de I, (i.e. equal ments).
errors
for all trace
ele-
Then eq. (2) reduces to rc( w) = (wTXw)/( wTZw + d) (since wTw = 1). Since our task is to avoid the underestimation of correlation, we have to choose a weighting vector w that maximizes rc( w). This is equivalent to minimizing l/r,(w) = 1+ d/( wTZ w). Once again, this is equivalent to maximizing wTZw. Of course, the solution of the last problem is provided by principal component analysis (PCA). The maximum correlation (by variation of w) is found to be
and, since y, = A y, + b and due to point 3, above:
rcmax= X,/(X,
E((Y~
with Xi being the first eigenvalue of X. This last equation enables us to recognize circumstances under which optimal weighting may be successful in the problem described. High intercorrelations among trace elements (i.e. strong deviation of Z from diagonal shape) yield high Xi and consequently may be used for reducing the underestimation of r (with weights taken from the first eigenvector of Z). This gives an explanation of why measurements of complexes of similar trace elements may be considered as a kind of repeated measurement for these elements. However, in most applications we may not expect that all of the assumptions l-5 are fulfilled (at least, not exactly). Hence, using PCA to find
-EY,)(Y,
-E(e,(y,
-EY,)~)A~
-Ey,)‘)AT-E((yi
-EY&;)
+EeeT=ZAT 12 The other parts of eq. (1) are verified in a similar way. Consider now an arbitrary linear combination of xi and x2, defined by the normalized weighting vector w (i.e. rvTrv = 1). For the random variables wTx, and wrx, we compute a correlation coefficient (using eq. (1)): r, ( w) = corr( wTx,, wTx2) wTZATw =
wT(~+D)w.wT(AZAT+D)w
(2)
+ d)
(3)
n
Original
TABLE
Reseach
Paper
71
1
Correlation of Fe and Mn concentrations during four stages of tungsten manufacturing. Using optimal weights rrFe and rvMn one obtains adjusted correlation coefficients r * Stage of manufacture 1+2 I+3 2+3 I-4 2+4 3-4
TFe
0.05 -0.02 0.81 - 0.09 -0.06 -0.16
0.02 -0.16 - 0.08 0.15 0.22 -0.05
r*
WFe
WMn
0.62 0.43 0.81 0.18 0.26 0.17
0.72 0.75 1.00 0.33 0.32 0.96
- 0.70 - 0.66 0.00 - 0.94 - 0.95 0.28
optimal weights gives us an insight into the theoretical background, but it seems to be justified for practical computations only in a few special cases. Therefore for computational purposes, we used a method that was not based on any restrictive assumptions. This procedure is based on the solution of nonlinear equations [4].
mation did not arise. In the cases, 1 + 4, 2 + 4, 3 + 4 optimal weighting does not lead to any essential increase in correlation. Hence we may conclude that during stages (l), (2), (3) the Fe and Mn contaminations remain rather stable (although they do not necessarily remain unchanged). However, this is not a priori obvious, but is only revealed after weighting. The success of the method in this case is clearly due to the correlation between Fe and Mn (compare the preceding section). In contrast to this, the reduction of WO, to tungsten powder (3 + 4) coincides with a considerable increase in contamination by Fe and Mn due to the use of steel boats during reduction (this was first reported in ref. 6). Hence, this step in the manufacturing process distorts the correlation with the preceding stages.
REFERENCES EXAMPLE
In ref. 5 contaminations of trace elements were supervised during four stages of tungsten manufacture: (1) (2) (3) (4)
Ammonium paratungstate Tungsten trioxide (decomposition of (1)) Impregnation of (2) with IS, Al, Si Tungsten powder (reduction of (3)).
We restrict our attention to the contamination of 20 samples by Fe and Mn. Table 1 shows computed and ‘optimized’ correlation coefficients for all possible comparisons of stages of manufacture. Furthermore the optimal weights are indicated. In the cases 1 + 2 and 1 + 3 considerable improvements are obtained in the correlation. In the situation 2+3 (rFe = 0.81) the problem of underesti-
E.V. USakov, V.I. Karavajcev, E.K. DrobySeva, V.I. Tiraspol’skij, N.M. Vdovin and N.M. Zelencova, Korreljacionnoregressionnyj analiz vzaimosvjazi svoistv neprovisajulego vol frama i charakteristik ischodnogo porogka, in Splauy Tugoplavkich i Redkich Metallov dlja Raboty pri Vysokich Temperaturach, Nauka, Moscow, 1984, pp. 114-118. R. Hemion, U. Rassmann and G. Henrion, Chemometric analysis of the behaviour of trace element concentrations in tungsten wire production, Analytica Chimica Acta, submitted for publication. M. Fisz, Wahrscheinlichkeitsrechnung und mathematische Statistik, Verlag der Wissenschaften, Berlin, 1980. R. Henrion and G. Henrion, Latent features with maximal correlation between two equal sets of variables, Journal of Chemometrics, in press. U. Rassmann, Dissertation, Humboldt-University, Berlin, 1989. G. Henrion, H.-J. Lunk, A. Hemion and R. Henrion, Klassifiienmg von sehr %lmlichen Wolframmaterialien durch multivariate Interpretation umfangreicher Analysenserien der Spurenelemente, Zeitschrift ftir Chemie, 25 (1985) 393-397.