Chemometrics and intelligent laboratory systems ELSEVIER
Chemometrics and Intelligent Laboratory Systems 32 (1996) 57-6.5
Multivariate correlation analysis - a method for the analysis of multidimensional time series in environmental studies Sabine GeiB a9*, Jiirgen Einax b a Thiiringer Landesanstalt fir Urnwelt, Zentrallabor, Priissingstr. 25, 07745 Jena-Gkchwitz, Germany b Institut fir Anorganische und Analytische Chemie, Friedrich-Schiller-Uniuersitiit Jena, Steiger 3, 07743 Jena, Germany Received 1 October 1992; accepted 16 July 1995
Abstract After a short demonstration of the theoretical base of univariate, partial and multiple correlation analysis (auto- and crosscorrelation), the multivariate auto- and crosscorrelation function was deduced. One advantage of multivariate correlation is the possibility for simultaneous handling of all variables of the time or local series, which allows considering all interactions within the variables in the series and between the series which are dependent upon lag. Applications in environmental analy-
sis are provided in this text for the partial crosscorrelation and the multivariate auto- and crosscorrelation for three selected practical examples. Keywords: Correlation techniques; Multivariate analysis; Signal/time series analysis
1. Introduction Many analytical investigations deal with the determination of variation of concentration - in time or space [l]. But these concentration variations not only concern themselves with one dimensional cases, e.g., the time series of one parameter [2-71 in one dimension or in areas like in geostatistics [8], but also with many parameters which are changing simultaneously. Especially in environmental analysis such time or local changes of environmental contaminants are very relevant [9]. Multivariate time series models,
* Corresponding author. 0169-7439/96/$15.00
0 1996 Elsevier Science B.V. AI1 rights reserved
SSDI 0169-7439(95)00067-4
e.g., for the autoregressive moving average [ 10,111 or multiway decomposition [12], are available. In environmental analysis there are two problems which need to be solved: 1. Strong variation of the trace concentrations in environmental compartments like air, water and soil due to the natural inhomogeneities. 2. Complex problems of interactions and the common appearance of different parameters. For that purpose the well-known univariate correlation analysis [13] was generalized into the multivariate case. The multivariate correlation analysis permits including all interactions within the variables and excluding the share of the variance due to the variable noise.
58
S. Geip, J. Einax/Chemometrics
and Intelligent Laboratory Systems 32 (1996) 57-65
2. Theory of multivariate auto- and crosscorrelation analysis The aim of correlation analysis is to compare more functions and to calculate their relationship with respect to a change of T (lag) in time or distance. Assumed is a normal distribution of the functions x(t) and y(t) with the means
2.1. Multiple correlation between x and y Multiple correlation between x and y means more variables in y, variables 1, 2, 3,. . . ,m, interact with one variable in x.
%2(T) ... %mtT)
and the variances
s22(r) sx=
sy=
$1:
[i-x(t)]“dt fg
i/-T
.
[y-y(t)12dz g fg
1 = ,br+nm2,jtg x(t)Y(t+ 8 57 - fg
Partial variables only this the other formula)
%yW =
i
correlation between x and y means more in y interact with one variable in x, but one interaction is relevant for our study. So interactions (relations to 2, 3,4,. . . ,m in the have to be excluded. C,*(r)
~+1(2,3,4,...,m)W
=
lEimGx3
g=S-’
S(T) /
=
function
smm-w
r)dt
In practice, correlation analysis can also used for the discrete measuring values of x(t) and y(t). In the simple univariate case, x(t) and y(t) contain one variable. The autocorrelation function is defined as
and the crosscorrelation
.
2.2. Partial correlation between x and y
where t, is the total time or distance. Correlation analysis involves the comparison of functions in relation to local or temporal change. Therefore folding functions can be used: sJ7)
...
SAT)
s&)
%2(T)
...
SIX(~)
%1(T)
%2(T)
...
s2*(7)
s12b)
s22(7)
,..
s3J7)
.
.
.
.
hn(r> %mW %n(4
as
;I
gX(t)y(t+r) -fg
where n is the number of the items and r is the lag. In the following cases more variables in x(t) and y(t) are available.
,%x(3
sir)
2.3. Multivariate correlation between x and y Multivariate correlation between x and y means all relations between the various variables within -X
S. Gei;6,J. Einax/ Chemometrics and Intelligent Laboratory Systems 32 (19%) 57-65
(e.g., x1, x2,. . ., x,,), within Y (e.g., yl, y2,. . . ,y,) and furthermore the relations between X and Y, which are dependent upon lag, are considered. NC% the total variance can be expressed as a correlation matrix by using centered values. The multivariate auto- and crosscorrelation functions have to contain the total variance in dependence on r. Principal component analysis is one possibility for the extraction of the total variance from the correlation matrix. The to tal variance is equal to the sum of positive eigenvalues of the correlation matrices. For II variables, at least 3n values must be available for the calculation, according to the rule of HORST [14]. The multivariate autocorrelation function SX~ (7) _ (for centered values) is defined as sx,W
= [xWx(t
+ r,‘]
Now this multivariate function is reduced by the following instruction into a univariate function [L(r) S,,(r)
- u] =
= 0
i *i A>0
The multivariate crosscorrelation treated in the same way S&> [L(r)
= [x(M+ -ti]
function
was
r,‘]
/ -0.2
-.
-04: '0
1
2
3
4
6
6
7
=
C
8
9 10 11 12 13 14 15 16 17 16 19 20 21 22 la9 (months)
Fig. 1. Partial crosscorrelation function between the nitrate load at the storage reservoir intake and nitrate concentration in the storage reservoir itself.
significancy of the multivariate correlation value, it is necessary to take into consideration two influences. 2.3.1. Multiplication of probabilities A level of 0.9523 = 0.307 is the result of the multivariate case with 23 variables of a statistical significancy of 0.95 for each variable. 2.3.2. Multivariate testing by the spur criterion In the multivariate case the significant correlation coefficients for each variable add to the multivariate critical correlation value.
=0 2.4. The calculation
S*y(7)
59
of degrees of freedom
hi
A>0
These reduced multivariate auto- and crosscorrelation functions are now simply called multivariate auto- and crosscorrelation functions. For testing the
Table 1 Scheme of crosscorrelated
time series of application
f=n.n-n.m=n.(n-m) where n is the number ber of objects.
Time series 2
storage reservoir intake
storage reservoir
monthly 1976-1991(180
objects)
and m the num-
3.1
Time series 1
date
of variables
nitrate load (x variable)
nitrate concentration (var. 1 in y)
COD (var. 2 in y)
PO: (var. 3 in y)
x(O) x(l)
l(O) 10)
2(O)
n(2)
l(2)
2(2)
3(O) 30) 3(3)
x(180)
;(180)
2(180)
j(180)
20)
60
S. Geib, J. Einax/Chemometrics
and Intelligent Laboratory Systems 32 (1996) 57-65
For the determination of IZ variables, II objects are necessary. The rest are degrees of freedom. For example: Dust deposition was analyzed in 23 elements for 60 weeks (objects). In order to find a multivariate probability P =
mo
7,oo
a00
a00
no0
tt.00
3200
#roe
0.977, the following quired.
univariate
is re
Puni" = 2&377 = 0.999
0 a.00
’
’
’
r.00
’
’
a00
’
’
aw
flmo LyJZn twu
probability
-
’
’
n
1100
*
’
lmo
’
am
Tim
1
,Iaca
”
TM)
”
aw
Fig. 2. Time series of the elements on the first bridge.
”
t3.w
”
nm
mw
’
“‘I
law
l2w
1
v3.m
S. GeijI, J. Einax/Chemometrics
and intelligent Laboratory Systems 32 (1996) 57-65
61
100.
0.“’ a00
n
”
a00
7.00
wo
”
no0 mm
”
nca
’
”
(ew
I
80.
40
30
30
20
20
10 -
10 ’
’ 7.00
’
’ aw
’
’ 900
’
s mw
a00
~~00
1l.w
aw
13.00
’
’ 1t.00
’
WWfU
,
80
/
/
80
40
0 a.00
a00
nma
w Zn (uonl
-
01” ““““7.00 “” aw
taw
’ lzoo
’
l3.00
0 a00
i
’
’ 7.00
m
’ a00
’
’ e.00
’
nm
TIIIN Fig. 3. Time series of the elements on the second bridge.
’ tQ.00
m
’ la.00
’
’ l2.w
’
13.00
62
S. GeijI, J. Einox/Chemometrics and Intelligent Laboratory Systems 32 (19%) 57-65
Fig. 4. Univariate
crosscorrelation
The number of degrees of freedom depends on the lag r. f=
23(60 - 23) = 851 for T= 0
f=
23(_50 - 23) = 621 for T= 10
rcrir = 0.115 for f= (table for univariate
851 critical
correlation
coefficients
[71) Rcrit = 2.645 for
f = 851
function for each element.
areas is strong in winter, but minimal in summer due to the assimilation of the plants. Due to the storage time of the water another influence on the nitrate concentration in the storage reservoir is the biochemical interaction between the parameters nitrate and chemical oxygen demand and phosphate, respectively. Therefore it was necessary to use the partial crosscorrelation function to analyze the time-lag between the nitrate load upon intake of the storage reservoir and the nitrate concentration in the storage reservoir itself and now it was possible to exclude the
(resulting from multivariate critical correlation value) These critical values depend on r because of the resulting degrees of freedom. 2
Multivariate
cro88correlaiion
function
3. Applications
mean
flow
time
f,6
3.1. Partial crosscorrelation for the determination of the time-lag of nitrate pollution in a storage reservoir In Thuringia there are many storage reservoirs for winning drinking water. These storage reservoirs may have nitrate problems. The nitrate concentration time series upon intake and the reservoir water itself are periodic. The main source of nitrate is agricultural fertilization, The wash-out effect from agricultural
fl
0
1
2
3 lag (30
Fig. 5. Multivariate crosscorrelation transport rate in a river.
4
6
6
7
mid
function for calculation
of the
S. Geij?, J. Einax/ Table 2 Scheme of crosscorrelated Time
6.00 6.30 7.00
Chemometrics
time series for application
63
and Intelligent Laboratory Systems 32 (1996) 57-65
3.2
Time series 1
Time series 2
bridge 1 (x variables)
bridge 2 ( y variables)
Cd
Cr
CU
Fe
Zn
Cd
Cr
CU
Fe
Zn
x JOI x,(l) x,(2)
x,(O) x,(l) x,(2)
x,(O) x&l) x,(2)
x,(O) x,(l) x,(2)
x,(O) x&l) x,(2)
YI(O) Y,(l) Y,(2)
Y,(O) Y*(l) Y*(2)
Y&O) Y,(l) Y,(2)
Y,(O) Y,(l) Y,(2)
Y,(O) Y,(l) Y,(2)
13.00
influences of chemical oxygen demand and phosphate on nitrate concentration in the storage reservoir (Table 1). A lag of 1.5 months was found (Fig. 1). This means maximum nitrate concentrations in the intake of the reservoir cause maximum mitrate concentrations in the storage reservoir 1.5 months later. Such predictions may help to determine if drinking water should be taken from other storage reservoirs, if nitrate concentrations are too high. 3.2. Multivariate crosscorrelation
for computation of
transport rates of dissolved metals in river water
The transport rates of substances in river water are of economic importance, e.g., in potentially hazardous or pollutive situations. Usually these transport rates have been measured by expensive tracer experiments with salts, colored compounds or radioactive isotopes which can cause pollution themselves when the measurements are repeated frequently. Single 1.2 1-t 0.8 0.6-
time series of trace concentrations of metals in the river are evidently of scattered character (Figs. 2 and 3). The univariate crosscorrelation function is a possible means of measuring transport rates in flowing systems. For this purpose, the time series of heavy metal concentrations (Cd, Cr, Cu, Fe and Zn) at two sampling points from the river with a distance of 4.5 km at exactly the same time (Figs. 2 and 3) were crosscorrelated. The 10 time series of the 5 elements were crosscorrelated univariately, e.g., for each element the time series at the first sampling point from the river was crosscorrelated with the time series at the second sampling point. The univariate crosscorrelation functions for the 5 elements are very noisy and offer no useful information (Fig. 4). With the help of the multivariate crosscorrelation analysis has been possible to include the interaction between the metals due to the relationships in emission and transformation (Table 2). The assumption is that the metals are transported at the same rate. The multivariate crosscorrelation function sXY(r) shows a wide maximum at T = 3, e.g., 1.5 h (Fig. 5). This means that the mean transport rate of the metals is 3 km/h. Hydrological data of this river on the same day makes this result plausible (Fig. 5): maximum
flow rate = 4.28 km/h
mean flow rate = 2.64 km/h
3.3. Multivariate-statistical
analysis of dust deposition data on the North Frisian island Pellworm
Fig. 6. Univariate fraction.
autocorrelation
function of Sulfur. Sulfur in 3rd
Aerosol samples [15] were taken once a week over a period of 14 months on the North Frisian island
64
S. GeijI, J. Einax / Chemometrics and Intelligent Laboratory Systems 32 (I 9%) 57-65
Table 3 Scheme of autocorrelated Date
time series in application
3.3 a
1. fraction time series (x variables)
weekly (52 objects)
x,(O), x,(O), x,Ko,
‘.
>
x*3@)
x,(l), x,(l), x,(l), . .’9x,,(l) x,(2), , 42) x,(51), x*(51),
. , x&1)
2. fraction time series Lx variables) weekly (52 objects)
+
x,(O),x,(O),x,(O), >40) x,(l), x,(l), x,(l), ‘. .T x,,(l) x,(2), ., x&9 x,(51), x,(51), . . . , x*,(51) 3. fraction time series (x
Fig. 8. Multivariate
limit
*Ignlficancy
autocorrelation
function of the 2nd fraction.
26,
I
variables)
weekly (52 objects)
x,(O),x,(O),x,(O), >x,,(O) x,(l), x,(l), x,(l), ‘. . >41) x,(2), , x,,(2) x,(51), x,(51), . . ., -%,(51)
10
c +
a 23 elements: As, Ba, Ca, Cd, Cr, Cu, Fe, K, Mn, MO, Ni, Pb, Rb, S, Sb, Se, Sn, Sr, Ti, V, Y, Zn, Zr. +
Pellworm. A high volume sampler was used according to the impactor principle which allows classification for the aerodynamic equivalent diameter for a 50% separation. In six steps dust particles with the diameters demonstrated in Table 4 were precipitated. The sampling period was 14 months, from May 1984 to June 1985. Determination of metal concentrations in the dried digestion solutions was carried out by total reflection X-ray fluorescence (TXRF) analysis.
Fig. 9. Multivariate
+ Fig. 7. Multivariate
8Ignitlcancy
autocorrelation
limit
function of the 1st fraction.
function of the 5th fraction.
Fig. 6 demonstrates the low evidence of the univariate autocorrelation function for the element sulfur, shown here as an example. The multivariate autocorrelation functions (Table 3) of the data set of 23 elements in these dustlike emission samples were demonstrated in Figs 7-9. In the first fraction (30-7.2 pm) (Fig. 7) significant autocorrelation cases are to Table 4 Lags of significant the particle size
i_:_
*Ignllicancy llmlt
autocorrelation
multivariate
autocorrelation
in dependence
Fraction No.
Particle size in pm
Lags of significant multivariate autocorrelation in weeks
1 2 3 4 5 6 7 (total amount)
30-7.2 7.2-3.0 3.0-1.5 1.5-0.7 0.7-0.3 < 0.3 _X
1 5 5-6 7 7-10 6 5
on
S. Gei@,J. Einax/ Chemometrics and Intelligent Laboratory Systems 32 (1996) 57-65
be seen only accidentally. Conclusions for the relationship between the emission situations per week can only be poorly drawn. In the second fraction (7.2-3.0 pm) (Fig. 81, the autocorrelation graph is not so noisy as compared with the first fraction. Significant relationships up to five weeks are obtained between the deposition situations. This tendency of stronger relations between the deposition events with smaller particles continues up to the fraction number 5 (Fig. 9, Table 4). This means, that changes in the emission of smaller particles are more dependent on the preceding emission state than those of bigger particles. So it is better to characterize the long time emission level by smaller particles than by bigger particles of airborne dust.
Acknowledgements The authors thank Professor Michaelis from the research center in Geesthacht for the cooperation in relation to the dust deposition correlation studies.
References [l] K. Doerffel and A. Wundrack, in R. Bock, W. Fresenius, H. Giinzler, H. Huber and G. Tiilg (Eds.), Correlation Functions
65
in Analytics, Analytiker Taschenbuch Bd. 6, Akademieverlag, Berlin, 1986, pp. 37-63 (in German). Dl E. Fiirster and B. R&z, Methods of Correlation and Regression Analysis, Die Wirtschaft, Berlin, 1979 (in German). [31 P. Metzler and B. Nickel, Methodic in Medicine and Psychology, Time Series and Progress Analysis, Hirzel, Leipzig, 1986 (in German). [41 P.J. Brockwell and R.A. Davis, Time Series: Theory and Methods, Springer, Berlin, 1987. 151 T.B. Fomby, R.C. Hill and S.R. Johnson, Advanced Econometric Methods, Springer, Berlin, 1984. [61 R. Schlittgen and B.H.J. Streitherg, Time Series Analysis, Oldenbourg, Mlnchen, 1989 (in German). 171 SM. Pandit and S. Wu, Time Series and System Analysis with Applications, Krieger, Malabar, FL, 1990. Springer, 181 H. Akin and H. Siemes, Practical Geostatistics, Berlin, 1988 (in German). Valuation of Analytical Investiga[91 J. Einax, Chemometrical tions of Metals in Environment, Dissertation B, Jena, Friedrich Schiller University, 1990 (in German). [lOI D.R. Cox, Stand. J. Statist., 8 (1981) 93. Ml M.B. Priestley, Spectral Analysis and Time Series, Academic Press, London, 1981. WI L. Stable, J. Pharm. Biomed. Anal., 9 (1991) 671. 1131 G. Kateman, Chemometrics and Species Identification, Akademieverlag, Berlin, 1987. [141 E. Weher, Introduction to Factor Analysis, Gustav Fischer, Jena, 1974 (in German). WI R.-P. StoOel, Examinations on the Wet and Dry Deposition of Heavy Metals on the Island Pellworm, Dissertation, GKSS Research Center Geesthacht GmbH, Geesthacht, 1987 (in German).