Ecological Modelling 187 (2005) 475–490
Interaction structures analysed from water-quality data Ulrich Callies∗ Institute for Hydrophysics, GKSS Research Center, D-21494 Geesthacht, Germany Received 15 March 2004; received in revised form 30 December 2004; accepted 5 January 2005 Available online 23 March 2005
Abstract Fortnightly observations of water quality parameters, discharge and water temperature along the River Elbe have been subjected to a multivariate data analysis. In a previous study [Petersen, W., Bertino, L., Callies, U., Zorita, E., 2001. Process identification by principal component analysis of river-quality data. Ecol. Model. 138, 193–213] applied principal component analysis (PCA) to show that 60% of variability in the data set can be explained through just two linear combinations of eight original variables. In the present paper more advanced multivariate methods are applied to the same data set, which are supposed to suit better interpretations in terms of the underlying system dynamics. The first method, graphical modelling, represents interaction structures in terms of a set of conditional independence constraints between pairs of variables given the values of all other variables. Assuming data from a multinormal distribution conditional independence constraints are expressed by zero partial correlations. Different graphical structures with nodes for each variable and connecting edges between them can be assessed with regard to their likelihood. The second method, canonical correlation analysis (CCA), is applied for studying the correlation structures of external forcing and water quality parameters. Results of CCA turn out to be consistent with the dominant patterns of variability obtained from PCA. The percentages of variability explained by external forcing, however, are estimated to be smaller. Fitting graphical models allows a more detailed representation of interaction structures. For instance, for given discharge and temperature correlated variations of the concentrations of oxygen and nitrate, respectively, can be modelled as being mediated by variations of pH, which is a representer for algal activity. Considerably simplified graphical models do not much affect the outcomes of both PCA and CCA, and hence it is concluded that these graphical models successfully represent the main interaction structures represented by the covariance matrix of the data. The analysed conditional independence patterns provide constraints to be satisfied by directed probabilistic networks, for instance. © 2005 Elsevier B.V. All rights reserved. Keywords: Elbe River; Water quality data; Statistical analysis; Graphical modelling; Conditional independences
1. Introduction
∗
Tel.: +49 4152 87 2837; fax: +49 4152 87 2818. E-mail address:
[email protected] (U. Callies).
0304-3800/$ – see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.ecolmodel.2005.01.045
In a previous study Petersen et al. (2001) investigated time series of data collected by the official water authorities from the River Elbe on a two weekly basis. The data set included nutrients and relevant water
476
U. Callies / Ecological Modelling 187 (2005) 475–490
quality parameters (cf. Section 2). Petersen et al. subjected the data to principal component analysis (PCA), which is a standard statistical technique widely used for the condensation of multivariate data sets (e.g. Johnson and Wichern, 1992). Other recent applications of the method in ecological or environmental research can be found in Brosse et al. (2001), Giraudel and Lek (2001) or Parinet et al. (2004), for instance. Given a k-dimensional random vector, x, as function of time, t, PCA allows to expand x into a series of fixed orthogonal k-dimensional vectors, yj , with uncorrelated time coefficients (principal components), αj (t): x(t) =
k
αj (t)yj .
(1)
j=1
In the Earth Sciences the vectors yj are called empirical orthogonal functions (EOFs), a terminology (e.g. von Storch and Zwiers, 1999) that will be adopted here. Mathematically they are the eigenvectors of the sample covariance matrix. An often considerable reduction of the dimension of the data set can be achieved if already a small number (
and water temperature as external forcing parameters. The main focus, however, will be on the application of graphical modelling as a second method, which is less conventional. Graphical modelling is different from PCA and CCA in that it does not involve any coordinate transformation like Eq. (1), for instance. The simplified description of the data is not related to a reduction of data dimensionality. Instead, a condensed description of the dominant interactions between the originally observed variables in a sampled data set is achieved. The technique is closely related to multiple regression analysis. Suppose that variable xi is to be predicted on an explanatory vector r. Assuming that all variables are centred to have zero means, the linear least squares predictor xˆ i of xi is (e.g. Draper and Smith, 1998) xˆ i = cov(xi , r) var−1 (r)r,
(2)
where var(r) = cov(r, r) = E(rrT ) and cov(xi , r) = E(xi rT ) with E denoting the expectation operator. Suppose now that in addition to xi a second variable xj is to be predicted based on r. The question that can be addressed by graphical modelling is the following: Is there any covariation between variables xi and xj that cannot be explained by the common explanatory vector r? In other words, does a significant correlation between xi and xj remain when the value of r is fixed? The amount of covariation of xi and xj given the value of r can be represented by the partial covariance: cov(xi , xj | r) = cov(xi , xj ) − cov(ˆxi , xˆ j )
(3)
with cov(ˆxi , xˆ j ) = cov(xi , r) var−1 (r) cov(r, xj )
(4)
according to Eq. (2). From Eq. (3) the partial correlation of xi and xj adjusted for r is obtained as (Whittaker, 1990) corr(xi , xj | r) =
cov(xi , xj | r) var(xi | r) var(xj | r)
(5)
with var(xi | r) = var(xi ) − var(ˆxi ). If xi and xj are uncorrelated given r, no additional information about xi in xj beyond that contained in xˆ i , can be exploited by a linear regression approach. For normally distributed data lack of correlation is equivalent to independence. Similarly lack of partial correlation is equivalent to conditional independence (Whittaker, 1990).
U. Callies / Ecological Modelling 187 (2005) 475–490
Simon (1954) discusses a simple example with measurements of three variables. He points out that observed conditional independence of two of the variables given the third variable is not sufficient to decide which of the two following situations one has: (a) An unconditional correlation between the two variables is produced by a causal effect operating through the third variable or (b) the correlation between the two variables results from a joint causal effect of the third variable on both of the other two variables, and hence the correlation is spurious. A distinction between these two situations is possible, however, when using directed graphs (Bayesian networks) whose nodes represent random variables and an arrow from one node to another represents probabilistic influence. Pearl (1988) expounds the usage of directed graphs as a tool for structuring probabilistic knowledge. The emphasis is put on Bayesian inference by which posterior probabilities of certain variables can be calculated given observations of other variables. Reckhow (1999) discusses the usefulness of probability networks for water quality modelling. Borsuk et al. (2004) demonstrate their use to realistically represent existing knowledge about a river ecosystem as a basis for eutrophication management. Lee and Rieman (1997) provide another application to fisheries assessment. Varis and Kuikka (1997), Kuikka and Varis (1997) propose the use of probabilistic networks for embedding competing or/and mutually supportive assessment models to represent uncertainties in the system. In a more recent book, Pearl (2000) developed a theory of how causal models could coherently be inferred from data. Pearl’s theory is an extension of the traditional view of structural equation modelling (Duncan, 1975). Stow and Borsuk (2003) apply Pearl’s technique for assessing assumed causal relationships between fishkills and toxic dinoflagellates in a river estuary. Shipley (2000) expounds a framework similar to Pearl’s and discusses its applications to problems in biology. In particular he addresses the topic of causal inference and conditional independence. Shipley (2004) applies this technique for testing alternative models of multivariate allometric patterns. It is well known that causal relationships cannot be learned from data without any further assumptions. However, missing edges in directed graphs induce constraints in the form of conditional independence relationships, which may be refuted by a statistical analysis of observations (Pearl, 2000). Sets of
477
such pairwise conditional independence relationships can conveniently be summarised by undirected graphs (Whittaker, 1990) discussed in the present paper. Alternative representations like vines, for instance, developed more recently (Bedford and Cooke, 2001) will not be considered in this study. At present they seem less suitable for learning conditional independence structures from observed data. Generally different directed graphs may parameterise the same multivariate probability distribution of a set of variables, and hence imply the same set of conditional independence constraints. The above three nodes problem discussed by Simon (1954) provides the most simple example of this ambiguity. Graphical modelling as applied in this paper does not explicitly hypothesize how causal effects propagate through measured variables but investigates properties of the joint multivariate distribution of the data. According to Cox and Wermuth (1996) conditional independence graphs are useful in that they aid interpretation in the light of existing background information even though the statistical analysis on its own cannot provide convincing evidence of causality. In the present study analysed conditional independence relationships will be discussed with respect to their consistency with the outcomes of the wellestablished methods PCA and CCA, which originally have not been designed for the description of causal relationships but investigate joint probability distributions with regard to the possibility of an efficient data reduction. Section 2 describes the set of observations that are analysed in the present study. Section 3 summarizes the general theory underlying graphical modelling. In Section 4 a preparatory study is conducted to illustrate the concept of graphical modelling for a subset of only three variables. Results of fitting graphical models to the full data set are reported in Section 5. Section 6 applies both CCA and graphical modelling to the analysis of a partitioned data set with external meteorological forcing treated separately. Finally, main results of the paper are summarised in Section 7.
2. Data The water quality data analysed in this study were taken at eight monitoring stations on the Elbe River in Germany. These stations and their distances from
478
U. Callies / Ecological Modelling 187 (2005) 475–490
the Czech border are listed in Table 1. All stations are located upstream of the weir Geesthacht (km 590) and therefore not influenced by the tidal regime of the North Sea. The distance between any two neighbouring
stations ranges from 4 to 137 km. Observations were collected every 14 days by the water authorities on a routine basis (ARGE Elbe, 1997). The same five years (1993–1997) of observations already examined
Fig. 1. Time series of observations in Schnackenburg.
U. Callies / Ecological Modelling 187 (2005) 475–490 Table 1 List of the stations, from which data have been used No.
Station
Location
1 2 3 4 5 6 7 8
Wittenberg Magdeburg (left) Magdeburg (right) Wahrenberg (left) Wahrenberg (right) Cumlosen Schnackenburg Boizenburg
km 214 km 318 km 322 km 459 km 459 km 470 km 475 km 559
In Magdeburg and Wahrenburg measurements are taken on both sides of the river. Locations (km) are given relative to the Czech border.
by Petersen et al. (2001) have been selected here to be consistent with the preceding study. Each data vector at given location and time is made up by the observations of eight different variables. These variables include concentrations of three dissolved nutrients, ammonia (NH4 ), nitrate (NO3 ) and o-phosphate (PO4 ) that are mainly involved in biological cycles. The remaining five vector components are water temperature (T ), discharge (D), pH, electric conductivity (EC) and oxygen saturation index (O% 2 ). By using the oxygen saturation index (in percent) the physical dependence of oxygen concentrations on water temperature (solubility) is excluded. As a typical example, Fig. 1 shows the observations that have been made at Schnackenburg (km 475).
479
All data – except for discharge, which represents the average of daily observations during the preceding two weeks, – are actual observations, no averaging has been applied. Some outlying measurements were removed from the data set. In cases where concentrations were below the limit of detection, the observations were taken to be half the detection limit. Petersen et al. (2001) assumed for their study that water quality dynamics at the eight selected stations are similar in the sense that all observations can be pooled as if they had all been made at the same location. We will adhere to this approximation. The statistical analyses applied in this study exclusively deal with variances and covariances between different variables. For CCA (Section 6) a partitioned representation x = (f , θ) of the eightdimensional data vector x will be used, with a twodimensional random vector f = (D, T ) comprising external forcing and a six-dimensional vector θ = (pH, NH4 , PO4 , O% 2 , NO3 , EC) being made up by water quality parameters. Then the variance–covariance matrix reads: var(f ) cov(f , θ) S = var(x) = . (6) cov(θ, f ) var(θ) All data are standardised to have zero means and unit variances, so that the covariance matrix coincides with
Table 2 Unconditional (upper triangle) and partial (lower triangle) correlations D Sample correlations D 1 T −0.164 pH −0.032 NH4 −0.174 PO4 −0.494 −0.134 O% 2 NO3 0.219 EC 0.289
T
pH
NH4
PO4
O% 2
NO3
EC
−0.240 1 0.095 −0.607 −0.076 0.183 −0.278 0.063
−0.191 0.526 1 −0.100 −0.124 0.406 −0.183 0.213
0.007 −0.704 −0.390 1 −0.071 −0.017 0.036 0.241
−0.394 −0.136 −0.264 0.114 1 −0.271 −0.042 0.175
−0.137 0.523 0.613 −0.391 −0.330 1 −0.019 −0.128
0.382 −0.558 −0.462 0.371 −0.011 −0.368 1 −0.060
−0.332 0.073 0.239 0.158 −0.054 0.065 −0.193 1
−0.394 −0.123 −0.143 0.101 1 −0.278 0 0
−0.029 0.523 0.613 −0.355 −0.330 1 0 0
0.382 −0.558 −0.462 0.371 −0.025 −0.380 1 0
−0.332 −0.024 0.239 0.158 0.108 0.077 −0.143 1
Fitted correlations (constrained to satisfy the restrictions of graph (A) in Fig. 4) D 1 −0.139 −0.163 0.051 T 0 1 0.398 −0.704 pH 0 0 1 −0.245 NH4 0 −0.602 0 1 −0.381 0 0 0 PO4 O% 0 0.271 0.480 0 2 NO3 0.310 −0.342 −0.247 0 EC −0.273 0 0.202 0.182
480
U. Callies / Ecological Modelling 187 (2005) 475–490
the correlation matrix. Elements of this correlation matrix are given in the upper part of Table 2 (upper triangle).
dev(G) = 2[l(S) − l(V)] = 2N0 I(S; V) ≥ 0.
3. Fitting Gaussian graphical models The objective of graphical modelling is to achieve a simplified description of inter-relationships between variables by small modifications of the elements of sample covariance matrix S. In a graphical model each variable is represented by a vertex and any conditional independence constraint for two variables is portrayed by the absence of an edge that directly connects them. Examples of such graphs are shown in Figs. 3 and 4. It can be shown (Whittaker, 1990) that zero partial correlations (i.e. absent edges in a Gaussian graphical model) correspond with zero elements in the inverse correlation matrix. Due to this relationship fitting the model to data can be formulated as a maximumlikelihood estimation of those elements of the inverse correlation matrix that correspond with existing edges in the graphical model. Let S be the sample correlation matrix of multinormal data and let N0 denote the sample size. The likelihood, l(W), of any alternative correlation matrix W differs from the likelihood of matrix S according to l(W) = l(S) − N0 I(S; W),
(7)
where I(S; W) is the Kullback–Leibler information divergence (Kullback, 1959) between two jointly normal distributions when the means are equal: I(S; W) =
deviance of the graph, dev(G), provides an entropytype measure of the amount of information in the data against the hypothetical interaction structure that is represented by the graphical model:
1 [tr(SW−1 − I) − log det(SW−1 )] ≥ 0. 2 (8)
In Eq. (8) I is the unit matrix and operators tr(· · ·) and det(· · ·) denote trace and determinant, respectively. For details of the derivation see Whittaker (1990). Generally, one has l(W) ≤ l(S) so that unconstrained maximisation of the likelihood l(W) gives W = S. Assume now that a set of conditional independence constraints portrayed by a graphical model, G, is imposed. The likelihood that graph G provides an adequate description of the data structure is specified as the likelihood of the correlation matrix W = V that results from a constrained maximisation of l(W). The
(9)
Generally, a numerical iterative method is needed to find the maximum likelihood covariance matrix V subject to the constraints of the graph (Dempster, 1972; Whittaker, 1990).2 For the present study iterative proportional fitting described by Whittaker has been implemented in Fortran 90. Just in the very special case that only one edge has been deleted from the complete graph, the deviance of the truncated model can be calculated analytically as a function of the observed partial correlation that corresponds with the absent edge (cf. Eq. (17)). In an already simplified graphical model, however, an edge may turn out to be important notwithstanding a small corresponding partial correlation. For normally distributed data the deviance has an asymptotic χ2 distribution with degrees of freedom given by the number of edges that have been excluded from the graph (Whittaker, 1990). In this study, however, statistical tests will not be applied. One reason is that the outcome of any such test would crucially depend on the specified number of observations. The analysed time series, however, exhibit serial correlations. In addition observations at different stations have been pooled, so that the nominal number of observations (about 1000) by far overestimates the effective number of independent pieces of information. The main objective of this study is to identify the dominant inter-relationships between variables, thereby deliberately accepting model simplifications as long as they are small compared to the overall connectivity of the data. The assumed number of observations does not affect the relative ordering of edge strengths. For convenience the number of observations, N0 , has been set to 100. The deviance will be employed throughout this study as a relative measure of the overall goodness of fit of a particular graphical model, even though the assumption of a multinormal distribution is not strictly
2 A freeware version of a popular software package for graphical modelling, MIM3, can be downloaded from the internet (http://www.hypergraph.dk).
U. Callies / Ecological Modelling 187 (2005) 475–490
justified for the data. To accept the conditional independence constraints of a graphical model means to assume a reduced overall data connectivity. Therefore, a natural reference value for the deviance of a graphical model, G, is the overall strength of association between variables in the data set. For standardised data, the observational evidence against the hypothesis of completely independent variables may be represented by the deviance 2N0 I(S; I). The reduction of the overall connectivity by graphical modelling becomes evident when expanding Eq. (9) in the following way: dev(G) = 2N0 [I(S; I) − I(V; I)]
(10)
For any correlation matrix, W, one has tr(W) = tr(I). With eigenvalues λi of W it follows from Eq. (8): 2N0 I(W; I) = −N0 log det(W) = −N0 log
k
481
of individual vector components, it exclusively deals with interaction strengths.
4. Graphical models for only three variables To illustrate the concept of graphical modelling in some more detail, in this section correlations between only three variables pH, oxygen and phosphate are investigated. The correlation matrix, S∗ , for this subset of variables can be extracted from the upper part of Table 2. The corresponding matrix of partial correlations, S∗p , can be specified using Eqs. (3)–(5), where now the length of vector r is one:
λi
i=1
= −N0
k
log λi .
(11)
i=1
The size of this positive value depends on how far individual eigenvalues λi deviate from their mean value of one. If all eigenvalues have a value of one, the original variables are independent and the statistical dimension (degrees of freedom) of the data equals the number of variables. With increasing overall interaction strength between the data, eigenvalues become different in size and the statistical dimension of the data shrinks. PCA amounts to an orthogonal transformation of the original data that conserves both the trace and the determinant of the correlation matrix. The EOFs, yj , in Eq. (1) are eigenvectors of sample correlation matrix S = var(x), the conjugated eigenvalues λj represent the variances of the corresponding uncorrelated principal components, αj . Accordingly the total variability in the data set can be written as: k j
var(xj ) = tr(S) =
k
λj = k.
(12)
j
The application of PCA to data filtering means to neglect principal components with small variances, λi . As a consequence generally the variances of all components of vector x decrease. As opposed to this, graphical modelling does not affect marginal variances
(13) Subjecting S∗ to PCA reveals that the pattern of variability represented by the leading eigenvector shown in Fig. 2 already explains 61% of the total variability of the three normalised variables. According to Petersen et al. (2001), parallel changes of pH and oxygen being opposed to a change of phosphate reflect the process of algal growth and degradation, respectively: during primary production of biomass by photosynthesis, the nutrient phosphate is assimilated while oxygen is released. At the same time the value of pH increases due to a removal of acidifying carbon dioxide. In agreement with the anticipated mechanistic coupling, pH and oxygen are effective mutual explanatory variables for each other. Using a linear least squares prediction scheme (cf. Eq. (2)) based on correlation matrix S∗ we get the following multiple regression
482
U. Callies / Ecological Modelling 187 (2005) 475–490
% : and O estimates pH 2 = 0.59 O% −0.07 PO4 pH 2
37.6%←
(14)
(40.6%).
(15)
7.0%←
% = 0.57 pH −0.18 PO O 2 4
37.6%←
(38.0%),
10.9%←
Numbers in parentheses give the percentages of the particular response variable’s observed variance that can be attributed to fitting the bivariate linear model. In addition, under each explanatory variable the percentage of explained variance when using this variable alone is shown. According to Eqs. (14)–(15), using phosphate as an additional explanatory variable does not much improve the estimates in particular of pH. This finding may be displayed graphically in terms of conditional independence statements. The complete graph (A) in Fig. 3 with connecting edges between all pairs of variables represents the most general unconstrained model. The simplified graph (B) with no direct edge between pH and phosphate portrays the fact that, conditionally on is nearly independent of phosoxygen, the estimate pH phate (cf. Eq. (14)). Similarly, the acceptance of graph (C) implies a simplification of Eq. (15). The adjusted maximum likelihood matrix of unconditional correlations, V∗ , which satisfies the conditional independence constraints of graph (B), and the corresponding matrix of partial correlations, Vp∗ , read:
(16) A comparison of V∗ with S∗ in Eq. (13) illustrates the general result (Whittaker, 1990) that as long as only one edge is absent, the constrained likelihood optimum can
Fig. 2. First eigenvector (EOF) of correlation matrix S∗ (Eq. (13)), displayed in arbitrary units. The EOF explains 61% of data variability.
be found analytically: that element of the sample correlation matrix, which corresponds with the absent edge, has to be replaced by a value, which makes the corresponding partial correlation (i.e. the LHS of Eq. (3)) vanish. All other elements of V∗ are just copied from S∗ . The deviance of graph (B) simplifies to a function of the partial correlation coefficient that has been replaced by zero (cf. Whittaker, 1990): 2N0 I(S∗ ; V∗ ) = −N0 log(1 − corr2 (PO4 , pH | O% 2 )) = −100 log(1 − 0.0832 ) = 0.7.
(17)
The exclusion deviances for the other two edges in the complete graph (A) in Fig. 3 have been calculated analogously. The fact that graph (B) has a smaller deviance than graph (C) corresponds well with the lower importance of phosphate in Eq. (14) than in Eq. (15) in terms of explained variances. However, model (C) seems still acceptable considering the fact that its deviance of 5.0 is small compared to the exclusion deviance of 40.5 of the edge connecting pH and oxygen in the complete graph (A). The existence of two model options is reminiscent of the lack of uniqueness which is often found when fitting parameters of mechanistic models (Spear, 1997). It is important to note that if in a non-complete graph an edge is considered for elimination, the corresponding exclusion deviance must always be calculated as the incremental change of the graph’s overall deviance specified according to Eq. (9), including the iterative procedure for updating V. Formulas in analogy with Eq. (17) are not applicable. For instance, the removal of the edge connecting pH and phosphate in graph (C) in Fig. 3 would enlarge the deviance of the graph by 7.2 which is clearly more than the exclusion deviance
U. Callies / Ecological Modelling 187 (2005) 475–490
483
Fig. 3. Graphical models for three variables. For each edge its exclusion deviance is shown.
of 0.7 of the corresponding edge in the complete graph. Thus, in graph (C) an edge directly connecting variables pH and phosphate is more important than it is in the complete graph (A) where its absence could partly be compensated via the intervening variable oxygen. The importance of an edge depends on the overall topology of a graphical model. 5. Graphical models for the full data set The upper part of Table 2 shows the elements of the observed correlation matrix, S, for eight variables (upper triangle) alongside all pairwise partial correlations conditionally on six variables (lower triangle).3 Each partial correlation could be calculated based on Eqs. (2)–(5) with r being a vector of length six. However, a more efficient approach (Whittaker, 1990) allows to specify all partial correlations at one pass by calculating the inverse of S, scaling S−1 to have a unit diagonal and finally multiplying all off–diagonal elements by −1. This procedure shows the equivalence between imposing conditional independence constraints and specifying zeros in the inverse correlation matrix (Whittaker, 1990). Some very small partial correlations in Table 2 indicate the potential for a significantly simplified representation of the inter-relationships between the eight variables. Graph (A) in Fig. 4 arises after the exclusion of 17 edges from the complete graph made up by 28 edges. The corresponding maximum likelihood unconditional and partial correlations obtained by iterative proportional fitting are shown in the lower part 3 The partial correlations do not significantly differ from corresponding elements of S∗p in Eq. (13). Differences are due to the fact that the sets of variables partial correlations are conditioned on are not identical.
of Table 2. Note that only those unconditional correlations that correspond to the absence of an edge in the graph (or to a zero partial correlation) have changed. The deviance of the graph is 21.7. This value may be compared with the total amount of information in sample correlation matrix S against complete independence of all variables, 2N0 I(S; I) = 291.2. In Fig. 4 for all edges the corresponding exclusion deviances are shown. The three weakest edges in graph (A) connect electric conductivity with the remaining variables. Exclusion deviances are as small as 7.2, 7.6 and 11.7. On the other hand, re-establishing the strongest of the eliminated edges (connecting temperature and pH) would diminish the deviance by 4.9, so that the graphical model (A) does not provide a clear separation between edges to be eliminated and edges to be kept. However, electric conductivity may be excluded from the data set (cf. graph (B)) without much affecting the interaction structure between the remaining seven variables. The importance of an edge directly connecting temperature and pH does even decrease so that now existing and eliminated edges are well separated with respect to their importance. The deviance of graph (B) is much smaller than the deviance of a model assuming complete independence of the seven variables, 263.4, and smaller than the exclusion deviance of any remaining edge. Thus the model being composed of 8 edges (13 edges have been eliminated) is adequate. An interesting aspect of graph (B) in Fig. 4 is its representation of the impact of external forcing on water quality parameters. Nitrate and phosphate are the only variables that are directly associated with discharge. Given the concentrations of these two nutrients, the impacts of discharge on ammonia, pH and oxygen vanish. It is well known that concentrations of phosphate tend to decrease with discharge due to dilution effects, while concentrations of nitrate usually increase
484
U. Callies / Ecological Modelling 187 (2005) 475–490
Fig. 4. Left column: Graphical models for eight, seven and five variables. Right column: As graphs on the LHS but after conditioning on discharge and water temperature. For all edges their exclusion deviances are shown. Dashed lines represent the strongest among eliminated connections, corresponding negative exclusion deviances specify the decrease of the graph’s deviance when the respective edge is re-established. Together with deviances of the graphs the deviances of models assuming completely independent variables are given as reference values (in parentheses).
after precipitation events due to an increased drainage from agricultural fields (Behrendt., 1993). According to the graph most information about phosphate beyond that obtained from discharge should be available from oxygen. Indeed the corresponding bivariate regression model is able to explain 30.6% of phosphate’s observed variance. A complete model with seven explanatory variables does not perform significantly better (35.6%). Regarding the impact of temperature, the most important conclusion from graph (B) is that for fixed
temperature ammonia becomes uncorrelated with any of the remaining five variables. The edge connecting ammonia and temperature is supposed to mainly reflect seasonal variations. Ammonia is involved into biological processes which are active only from spring to autumn when temperatures are sufficiently high. Given both discharge and temperature (cf. graphs (D) and (E)), all correlations between oxygen and phosphate on the one side and nitrate on the other side are modelled as being mediated by pH. This aspect of the
U. Callies / Ecological Modelling 187 (2005) 475–490
485
Table 3 Eigenvalues of sample correlation matrix, S, and fitted correlation matrix, V (cf. Table 2)
S V
1
2
3
4
5
6
7
8
Sum
3.136 2.959
1.614 1.702
1.181 1.181
0.612 0.608
0.500 0.537
0.411 0.442
0.319 0.339
0.226 0.232
8 8
graphical model seems reasonable if pH is seen as a representer for algal activity related to consumption of nitrate and/or phosphate and production of oxygen. Petersen et al. (2001) found that excluding nitrate from the statistical analysis did not much affect the shape of the leading EOF-pattern. Graph (C) in Fig. 4 arises after omitting both NO3 and NH4 from the data set. Note that associations between discharge and temperature as well as between temperature and pH, which before both were mediated by nitrate, are now transformed into direct connections. However, no direct edge is obtained connecting discharge and pH. To conclude this section, the weak impact of graphical modelling on PCA shall be demonstrated. Table 3 lists the eigenvalues for both the sample
correlation matrix S and the fitted correlation matrix V that satisfies the restrictions inherent in graph (A) in Fig. 4. To satisfy the graphical model, the largest eigenvalue has slightly decreased whereas small eigenvalues have slightly increased. The sum of the eigenvalues remained unchanged by graphical modelling. The weak impact of graphical modelling on the two leading EOFs (eigenvectors) is documented by the two upper panels of Fig. 5.
6. Analysis of the partitioned data set In this section the notation of a partitioned data vector x = (f , θ) with forcing vector, f , and response
Fig. 5. Upper panels: The two leading eigenvectors of sample correlation matrix S (black) and fitted correlation matrix V (grey) from Table 2. The first EOF explains 39.2% (S) or 37.0% (V) of total variability, the second mode explains 20.2% (S) or 21.3% (V). Lower panels: Two pairs of canonically correlated patterns for a partitioned data vector. Correlations are 0.80 (S) or 0.81 (V) for the first pair and 0.62 (S) or 0.57 (V) for the second pair of canonical variates.
486
U. Callies / Ecological Modelling 187 (2005) 475–490
vector, θ, already introduced in Eq. (6) will be adopted. In Section 6.1 CCA will be applied to identify and quantify linear associations between f and θ. In Section 6.2 graphical modelling will be applied to analyse the correlation structure within the space of residues of θ after the linear impact of external forcing has been subtracted. 6.1. Externally forced variability CCA concentrates information on the kind of association between the partial data sets by identifying pairs (pj , qj ) of vectors (CCA-patterns) in the spaces of f and θ, respectively, so that the correlation, ρj , between the corresponding time coefficients (canonical variates) (uj , vj ), in the following expansion is maximised: f (t) =
2
uj (t)pj ;
j=1
θ(t) =
2
˜ vj (t)qj + θ(t).
j=1
(18) The vectors pj and qj satisfy a pair of eigen-equations that share the same eigenvalues. These eigenvalues are the squared correlations of the canonical variates: cov(f , θ) var−1 (θ) cov(θ, f ) var−1 (f ) pj = ρj2 pj , (19) −1
cov(θ, f ) var
−1
(f ) cov(f , θ) var
j
(θ) q =
ρj2 qj . (20)
Technical details can be found, for instance, in von Storch and Zwiers (1999). As external forcing, f , is two-dimensional, only two pairs of patterns (pj , qj ) ˜ is needed can be identified so that a residual term θ(t) in Eq. (18) to span the six-dimensional space of θ. In Appendix A the way the canonical variates in Eqs. (18) are calculated is summarised. According to Fig. 5, re-combining the two analysed pairs of CCA-patterns essentially reproduces the structure of the two leading eigenvectors (EOFs) that Petersen et al. (2001) obtained by subjecting the unpartitioned data vector to PCA.4 This is interesting 4 According to Eq. (24) the vector lengths of a CCA-pattern represents the contribution of this pattern to the total variability of the data. To make EOFs and CCA-patterns comparable, each EOF has been normalised in the same way.
because a fundamental constraint for EOFs to be orthogonal is relaxed for CCA (von Storch and Zwiers, 1999). Fig. 5 also compares patterns analysed from the sample correlation matrix S with those obtained from the fitted correlation matrix V that satisfies the constraints of graph (A) in Fig. 4 (cf. Table 2). In both cases the truncation effect due to graphical modelling remains small which indicates consistency of the interaction structure portrayed by the graphical model. The first CCA-pattern in the space of forcing is mainly aligned with temperature, the second with discharge. This observation confirms the interpretation proposed by Petersen et al. (2001) that the first EOF (‘biological mode’) represents temperature dependent algal activities whereas the second EOF (‘discharge mode’) represents dilution effects (e.g. for phosphate) and increased drainage from agricultural fields after strong precipitation events (for nitrate). A more detailed analysis, however, reveals that the second pair of CCA-patterns is unable to provide a reasonably efficient model due to a small proportion of explained variance. The relevance of external forcing for the variability of θ depends on two factors. The first factor is the magnitude of correlations ρj . Correlations between all canonical variates are zero except for the correlations between the pairs (uj , vj ). Consequently expected values vj (t) of time coefficient vj (t) given the values, f (t), of external forcing can be modelled using a simple univariate least squares regression approach. If canonical variates are normalized to unit variance (cf. Appendix A) one gets: vj (t) = E(vj | f (t)) = ρj uj (t),
j = 1, 2.
(21)
The second factor the relevance of external forcing de˜ in pends on is the relative magnitude of the residue, θ, Eq. (18). According to the second column of Table 4, q1 and q2 together represent 52.5% of θ’s total variability in the six-dimensional data space, the remain˜ ing 47.5% are represented by θ(t). These 52.5% of total variability, however, are only partly correlated with external forcing so that according to the analysis in the Appendix A only 29.6% of θ’s total variability can be attributed to external forcing (third column of Table 4): 23.0% are explained by the ‘biological mode’ (to be compared with 39.2% obtained by
U. Callies / Ecological Modelling 187 (2005) 475–490 Table 4 Partitioned covariance matrix S: correlations, ρi , and explained variances
1. CCA-patterns 2. CCA-patterns
ρi
Variance θ
Expl. Variance θ
0.80 0.62
35.6% 16.9%
23.0% 6.6%
52.5%
29.6%
Total
Variance θ: Percentages of total variance in θ represented by individual CCA-patterns (cf. Eq. (24)). Expl. Variance θ: Percentages of variance in θ that can be modelled by regressing on canonical variates in the space of f (cf. Eq. (25)).
Petersen et al. (2001) from PCA) and 6.6% are explained by the ‘discharge mode’ (to be compared with 20.2% from PCA). 6.2. Internal variability We subtract now from the observed data those components that can be estimated by linear least squares regression on external forcing. In analogy with Eqs. (3)–(4) the correlation matrix of the resulting residues reads:
(22) The smaller the variance of a residue (diagonal element) is the more variability of the respective parameter can be attributed to external forcing. In agreement with graph (A) in Fig. 4, the smallest residual variance is found for ammonia due to its strong association with temperature. On the other hand, electric conductivity exhibits the weakest coupling with external forcing. The sum of all diagonal elements is 4.224, i.e. 70.4% of the total variance of 6.0, in agreement with the 29.6% of total variability that can be attributed to external forcing according to Table 4.
487
For graphical modelling the matrix of partial correlations must be calculated from Eq. (22). This step is extremely simple. Correlations in Eq. (22) are already conditioned on temperature and discharge. Conditioning them on remaining four variables reproduces exactly the partial correlations in the upper part of Table 2. The only difference is that now columns and lines referring to temperature or discharge are omitted. The matrix of partial correlations remains unaffected by rescaling var(θ | f ) to obtain a unit diagonal. CCA is scale-free as well. The situation is different for PCA. Petersen et al. (2001) presented results for the rescaled version of the correlation matrix var(θ | f ), although it turned out that in this specific example the impact of scaling was not very pronounced. Graphs (D)–(F) in Fig. 4 portray the interactions of residues. These graphs result from their counterparts (A)–(C) on the LHS by the elimination of discharge and temperature and all corresponding edges. Edge strengths and deviances for the reduced models, however, are different. According to graph (D) all association between the residue of ammonia and other parameters is mediated by electric conductivity. Given the value of conductivity, ammonia becomes independent. Since all edges involving conductivity are weak, ammonia becomes also independent when conductivity is simply excluded from the data set or marginalised (graph (E)). Finally, graph (F) portrays the interaction structure of residues of the three variables pH, oxygen and phosphate. The graph corresponds with graph (B) in Fig. 3 that was fitted, however, to original data instead of residues. A more detailed study of the separate impacts of discharge or temperature (not shown) reveals that conditioning on discharge reduces the overall data connectivity much less than conditioning on temperature. When conditioning on discharge, the deviance of a model assuming complete independence of the six residues (excluding again electric conductivity from the analysis) is 219.1 while after conditioning on temperature a value of only 85.5 is found. This agrees well with the fact that according to Table 4 only 6.6% of θ’s total variability can be explained by the second CCA-pattern, which is dominated by discharge. In contrast, the first CCA-pattern dominated by temperature was found to explain 23.0% of θ’s total variance.
488
U. Callies / Ecological Modelling 187 (2005) 475–490
7. Discussion Statistical inter-relationships between eight variables monitored at German stations on the Elbe River have been analysed using graphical Gaussian modelling and CCA. Both methods examine the sample correlation matrix. Graphical models proved to be able to condense considerably the structure of associations between the data. A complete graph with 28 pairwise connections between eight variables could be approximated by a graphical model with only 11 edges. After the exclusion of electric conductivity from the analysis, keeping only 8 of 21 possible edges provided an adequate model. The maximum likelihood technique of fitting graphical models is based on the assumption that all observations have a jointly normal distribution. Even though this assumption is not strictly adequate for water quality data, two facts corroborate the graphical models that were obtained: (1) the constraints imposed by the graphical models do not much change (i.e. are consistent with) the results of PCA and CCA, (2) the interaction graphs are meaningful in the light of a process oriented interpretation. A main objective has been to describe the impact of external forcing on water quality variability. The proposed graphical model portrays discharge as being linked with only phosphate, nitrate and electric conductivity. These links suggest an interpretation in terms of dilution (e.g. phosphate) and increased drainage after precipitation events (nitrate). Temperature is directly linked with ammonia, nitrate and oxygen. The observed correlation between temperature and pH is established indirectly via a strong edge between pH and oxygen. According to the graphical model the correlation between oxygen and nitrate given temperature and discharge is mediated by pH. This is in agreement with the notion of pH as a representer for algal activity. According to the graphical model temperature governs the strength of all biological processes involving ammonia. This result reflects most probably the seasonal signal in the data. The construction of directed graphs (e.g. Pearl, 1988) has not been addressed in this paper as this would need further assumptions. From any directed graph a corresponding undirected conditional independence graph can be derived by first placing edges between all pairs of unconnected nodes that share a common child
and then second replacing all arrows by lines (Bedford and Cooke, 2001). Consequently the undirected graph (F) in Fig. 4, for instance, would be consistent with three different directed graphs: (1) pH influences oxygen and oxygen in turn influences phosphate, (2) as before but with both arrows being reversed, and (3) oxygen influences both of the other two variables. The forth option, a model with two arrows pointing from pH and phosphate, respectively, into oxygen is inconsistent with the conditional independence graph since application of the above rule would require an edge that directly connects pH and phosphate. None of the three possible models seems clearly preferable based on mechanistic model concepts. The conditional independence graph, however, portrays something that the three directed graphs have in common and that can be tested if data are available. In contrast with graphical modelling CCA introduces artificial coordinates. This specific coordinate transformation eliminates the distinction between partial and unconditional correlations. Here CCA has been applied to identify patterns of water quality variation that are most correlated with variations of discharge and water temperature. Two pairs of CCA-patterns were found to be in very good agreement with the two leading EOFs previously analysed by Petersen et al. (2001) using PCA. However, the results of CCA indicate that PCA in the space of the full data vector might be prone to an overestimation of the importance of external forcing for water quality variability. Given the observations of two sets of variables, CCA provides a unique description of associations between them. Contrary to this, in general the definition of a satisfactory graphical model is not unique. Adding new independence constraints affects the appropriateness of constraints introduced at an earlier stage of the analysis, so that different paths of successive model simplifications may be tried. Several models can be found that are consistent with the data, and different models may be selected by different users. The ambiguity in graphical model selection is related to the problem of identifying appropriate explanatory variables in linear regression models. However, a lack of uniqueness of model structure is also often found when fitting simplified mechanistic models to data (Spear, 1997). The topology of a conditional independence graph being consistent with the available data can
U. Callies / Ecological Modelling 187 (2005) 475–490
guide modellers in designing parsimonious model structures for different applications. All conditional independence constraints portrayed in a graphical Gaussian model can be verified by considering suitable regression models that involve variables in different combinations and in different roles as explanatory or response variables. Generally it is not possible, however, to represent all constraints that are represented in a graph in the same multivariate regression model. Conditional independences between explanatory variables given the value of response variables remain invisible in a regression framework. Acknowledgements The open provision of data by the “Arbeitsgemeinschaft zur Reinhaltung der Elbe” is much appreciated. I wish to thank Wilhelm Petersen for helpful discussions. This study has been carried out within the European Commission project IMPACT (contract IST-1999-11313). Appendix A. CCA and explained variances The vectors pj and qj in Eq. (18) do not form orthogonal bases in their respective subspaces. Therefore, unlike in PCA, the corresponding time coefficients uj (t) and vj (t) cannot be specified by simply projecting f (t) and θ(t) onto the patterns pj and qj , respectively. Instead one has (von Storch and Zwiers, 1999) uj (t) = (var−1 (f )pj )T f (t) vj (t) = (var
−1
and
j T
(θ)q ) θ(t).
(23)
j = var−1 (f )pj and q j = Often the vectors p −1 j var (θ)q instead of pj and qj are introduced as the basic entities, thereby putting the main emphasis on Eqs. (23) rather than on Eqs. (18). In this case the eigen-equations (19)–(20) are to be re-formulated accordingly (e.g. von Storch and Zwiers, 1999). An arbitrary normalisation can be applied to achieve var(uj ) = var(vj ) = 1. Then, with Eq. (18), the fraction of the total variance of θ, tr(var(θ)), represented by pattern qj reads, tr(var(vj qj )) var(vj )tr(qj qj ) qj qj = = , tr(var(θ)) tr(var(θ)) tr(var(θ)) T
T
(24)
489
where tr(qj qj ) = qj qj has been used. To obtain the fraction of total variance of the variables in θ that can be attributed to variations of f , the time coefficients vj in Eq. (24) must be replaced by the corresponding estimates vj from Eq. (21): T
T
var(uj )tr(qj qj ) qj qj tr(var(ˆvj qj )) = ρj2 = ρj2 . tr(var(θ)) tr(var(θ)) tr(var(θ)) T
T
(25) References ARGE Elbe, 1997. Wasserg¨utedaten der Elbe von Schmilka bis zur See, Zahlentafel 1997. ISSN 0931–2153, Arbeitsgemeinschaft f¨ur die Reinhaltung der Elbe, Hamburg. Available at http://www.arge-elbe.de/wge/Allgem/VeroefJb.html. Bedford, T., Cooke, R.M., 2001. Probabilistic Risk Analysis — Foundations and Methods. Cambridge University Press, Cambridge, UK. Behrendt, H., 1993. Point and diffuse loads of selected pollutants in the River Rhine and its main tributaries. Report RR-93–1, International Institute for Applied Systems Analysis, Laxenburg, Austria. Borsuk, M.E., Stow, C.A., Reckhow, K.H., 2004. A Bayesian network of eutrophication models for synthesis, prediction, and uncertainty analysis. Ecol. Model. 173, 219–239. Brosse, S., Giraudel, J.L., Lek, S., 2001. Utilisation of nonsupervised neural networks and principal component analysis to study fish assemblances. Ecol. Model. 146, 159–166. Cox, D.R., Wermuth, N., 1996. Multivariate Dependencies — Models, Analysis and Interpretations. Chapman & Hall/CRC Press, Boca Raton. Dempster, A.P., 1972. Covariance selection. Biometrics 28, 157–175. Draper, N.R., Smith, H., 1998. Applied Regression Analysis, third ed. John Wiley & Sons, New York. Duncan, O.D., 1975. Introduction to Structural Equation Models. Academic Press, New York. Giraudel, J.L., Lek, S., 2001. A comparison of self-organizing map algorithm and some conventional statistical methods for ecological community ordination. Ecol. Model. 146, 329–339. Johnson, R.A., Wichern, D.W., 1992. Applied Multivariate Statistical Analysis, third ed. Prentice Hall, Englewood Cliffs, New Jersey. Kuikka, S., Varis, O., 1997. Uncertainties of climate change impacts in Finnish water-sheds: a Bayesian network analysis of expert knowledge. Boreal. Environ. Res. 2, 109–128. Kullback, S., 1959. Information Theory and Statistics. Wiley, New York. Lee, D.C., Rieman, B.E., 1997. Population viability assessment of salmonids by using probabilistic networks. N. Am. J. Fish. Manag. 17, 1144–1157. Parinet, B., Lhote, A., Legube, B., 2004. Principal component analysis: an appropriate tool for water quality evaluation and management — application to a tropical lake system. Ecol. Model. 178, 295–311.
490
U. Callies / Ecological Modelling 187 (2005) 475–490
Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Mateo, California. Pearl, J., 2000. Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, UK. Petersen, W., Bertino, L., Callies, U., Zorita, E., 2001. Process identification by principal component analysis of river-quality data. Ecol. Model. 138, 193–213. Reckhow, K.H., 1999. Water quality prediction and probability network models. Can. J. Fish. Aquat. Sci. 56, 1150– 1158. Shipley, B., 2000. Cause and Correlation in Biology: A User’s Guide to Path Analysis, Structural Equations, and Causal Inference. Cambridge University Press, Cambridge, UK. Shipley, B., 2004. Analysing the allometry of multiple interacting traits. Pers. Ecol. Evol. Syst. 6, 235–241.
Simon, H.A., 1954. Spurious correlation: a causal interpretation. J. Am. Stat. Assoc. 49, 467–479. Spear, R.C., 1997. Large simulation models: calibration, uniqueness and goodness of fit. Environ. Model. Software 12, 219– 228. Stow, C.A., Borsuk, M.E., 2003. Enhancing causal assessment of estuarine fishkill using graphical models. Ecosystems 6, 11–19. Tarantola, A., 1987. Inverse Problem Theory — Methods for Data Fitting and Model Parameter Estimation. Elsevier, Amsterdam. Varis, O., Kuikka, S., 1997. Joint use of multiple environmental assessment models by a Bayesian meta-model: the Baltic salmon case. Ecol. Model. 102, 341–351. von Storch, H., Zwiers, F.W., 1999. Statistical Analysis in Climate Research. Cambridge University Press, Cambridge, UK. Whittaker, J., 1990. Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, Chichester.