Math/ Compu,.Modelling. Vol.II, Printed inGreatBritain
pp. 1073&1076. 1988
0895-7177:88 $3.00+0.00 Pergamon Pressplc
PREDICTING THM CONCENT&%TION IN TREATED WATER WITH HIGHLY CORRELATED DATA
Paul .J. Ossenbruggen, Department of Civil Engineering Marie Gaudard, Department of Mathematics M. Robin Collins, Department of Civil Engineering University of New Hampshire, Durham, NH
03824
Abstract. Chlorine is an effective disinfectant in killing bacteria and viruses; however, trihalomethanes (THM) and several chlorination by-products of the drinking water treatment process are suspected carcinogens. In order to develop treatment strategies to minimize THM production, mathematical models containing appropriate raw water quality and control predictor variables are needed. Some of these predictor variables are known to be highly correlated; therefore, when collectively introduced into a regression model, questions about the validity of the model arise. Condition indexes and variance decompositions, as well as ridge regression, are used to identify the degree and causes of ill-conditioning. Path analysis is used to support the inclusion of predictor variables in the final model. Keywords.
Linear regression, trihalomethanes, control, water quality.
INTRODUCTION Drinking water disinfection with chlorine is probably the single most effective factor in controlling typhoid fever and other waterborne diseases. However, trihalomethane (THM) and other by-products of the chlorination process are suspected carcinogens. Plant operators must judiciously use chlorine to maximize bacteria and virus kill while, at the same time, minimizing THM production. The purpose of this paper is to describe the steps used in developing a mathematical model that can be used to evaluate treatment strategies under different operating conditions. The THM production process is a very complex chemical process that is not entirely understood. There are no mechanistic models that explain the THM production process; empirical studies based on correlation analysis and simple regression methods have been widely used to study the problem. As a result of some of these efforts (Kavanaugh, 1980), we have learned that the concentration of finished water THM, a mixture of CHC13 (chloroform), CHClzBr, CHClBr2, and CHBr3, is a function of: . . . . . .
type of raw water organic precursor PH temperature presence of selected inorganic species method of chlorination detention time
Raw water contains a mixture of dissolved and suspended organic matter that generally gives water its yellowish-to-brownish color. This organic matter is generally formed by the decomposition of plant and animal material. Fulvic acid, humic acid, and other products
with complex chemical structure have been isolated from raw water and have been shown to have very different chemical properties. The presence of naturally occurring inorganic species, such as bromide and calcium, pH, and temperature have also been shown to affect THM production. Consequently, THM production is highly dependent upon the quality and source of the raw water. THM production is also dependent upon the effectiveness of the treatment process in removing organic matter and the location of chlorine addition during treatment. Prechlorination refers to the practice of adding chlorine to raw water prior to the removal of organic precursors. The chlorinated raw water is subject to further treatment to remove unwanted matter. Postchlorination refers to the practice of adding chlorine as the final treatment step. For proper plant operation and maintenance, both pre- and post- chlorination are used in most plants. It has been found, however, that reducing the prechlorination dose reduces THM concentration in the finished water. While correlation analysis and simple regression models have been successful in identifying the factors that are related to THM production, a mathematical model that contains the collective effects of these factors is imperative for establishing better treatment control strategies. An effective control model must contain variables reflecting the quality of the raw water and the effect of chlorine dose. EFFECTS OF ILL-CONDITIONING Consider a multiple regression model, 1[- xb + e, where the columns of the X matrix are highly dependent (or, equivalently, where
1073
Proc. 6th Inr. Conf. on Mathematical
I074
predictor variables are highly correlated). In such a case, ordinary least squares procedures can result in a final model which is inappropriate for control. In our study of data collected periodically (approximately bimonthly) from July 1980 to March 1982 at a water treatment plant in Canton, New York (Edzwald, 1983), six of seven predictor variables were eliminated from a multiple regression model using the SAS backward elimination procedure. A significance level of 0.100 "as used. The seven predictors were raw water IJV Absorbance, total organic carbon (TOC), trihalomethane formation potential (THMFP), temperature, turbidity and pre- and post-chlorination dose. The final model contained raw water temperature as its only predictor variable. Clearly, this model is of very limited use for control because variables of raw water quality and chlorine dose are absent. In the Canton data set, UV Absorbance, TOC and THMFP, all surrogate measures of organic precursor concentration, are highly correlated. All pairwise correlation coefficients (r) are greater than 0.90. Raw water temperature is also highly correlated with these variables with all r > 0.8. These high correlations are expected to cause the I! data matrix to be ill-conditioned. Models fit by ordinary least squares when the X matrix is ill-conditioned are often highly unstable; the sample variances of regression parameters are often larger than expected, resulting in confidence intervals which are too wide and tstatistics which are too low. For the Canton data set, ill-conditioning may explain the elimination of all surrogate measures of raw water quality and chlorination dose variables by the SAS backward elimination procedure. For models consisting of 3 or more predictor variables, an examination of pairwise correlation coefficients may not always lead to the identification of variables causing ill-conditioning. It is possible that 3 or more predictor variables taken collectively may be highly dependen,; whereas, their pairwise correlation may be small. Therefore, a means of evaluating the degree and causes of ill-conditioning, which is more effective than the use of correlation coefficients alone, is needed. MODEL
DEVELOPMENT
Our approach consists statistical tools: . . . .
Modelling
the reformulated model by examining contribution that each variable has prediction. ILL-CONDITIONING
the toward
DIAGNOSTICS
Condition indexes and variance-decomposition proportions are derived from the singular-value decomposition
where the diagonal elements of D, pj,.are the The condition Index for singular values of X. predictor j is defined to be pj/pmax, where pmax is the largest singular value. It can be shown empirically that large condition indexes, say from 30 to 100, are associated with moderate to strong dependencies. The notion of variance-decomposition proportion is based on the relationship
where @ is the least squares estimate of b, and o2 is the variance of e. For each condition index, a variance decomposition prouortion is associated with each predictor, and reflects the proportion of its variance which is associated with that condition index. The Canton data exhibits two moderate to strong dependences because two pj exceed 30. Thus, since two dependencies are involved it is suspected that at least two predictor variables will need to be eliminated from the model in order to stabilize it. Analysis of the variance-decomposition proportions and auxiliary least squares models show that W Absorbance, TOC and temperature are involved in one near dependency, and THMFP and TOC are involved in the other near dependency. Ridge regression, a biased least squares estimation method, is used as a second means of identifying the causes of ill-conditioning. The ridge estimators are defined by j$(k) = (zTz + kI)-1 ZTU where z and are obtained by standardizing of X and 1, respectively. The ridge trace for the Canton data set, Figure 1, shows that
APPROACH
of the use of following
Condition index and variancedecomposition proportions Ridge regression Ordinary least squares Path analysis
Condition indexes, variance-decomposition proportions (Belsey, Kuh, and Welsch, 1980) and ridge regression (Hoerl, et al; 1962, 1980) are used to determine the degree of ill-conditioning and to identify the variables that cause ill-conditioning. After variables are removed from the ill-conditioned model, ordinary least square is used to estimate the parameters (or b's) of the reformulated model. Path analysis (Li, 1975) is used to validate
2 G
-2
00
0.2
0.4
0.6
0.8
1.0
k,Ridge Caefficisnf FIG, 1
Ridee
trace of initial model.
temperature, IJV Absorbance, and TOC are the Additionally, as k most unstable predictors.
Pm-.
6th Int. ConJ on Mathematical
approaches one, the p(k)s for THMFP, W Absorbance and turbidity approach zero, Since the indicating little predicting power. findings from these two different diagnostic techniques indicate that THMFP and W Absorbance are causing ill-conditioning and providing little predicting power, they are removed from the model. Turbidity, while not considered to be a direct cause of illconditioning, is also removed because it has little predicting power. The ridge trace of the reformulated model, Figure 2, exhibits strong stability. Ordinary least squares final model:
regression
TABLE
1
Modellin~
1075
Path Analysis
Predictor
Results
Direct Effect
Temperature Pre-Cl2 dose Post-Cl2 dose TOC
pj
Joint Association II' - p' J J
0.885 0.486 0.313 -0.504
-0.126 0.100 -0.338 0.993
dose, have significant correlations with THM and contribute to R2, Interestingly, raw water TOC has the weakest correlation with treated water THMs (see Table 2).
gives the
THM - -2.95 + 3.96 (TEMP) + 19.7 (PreC12) + 12.9 (PostC12) - 5.65 (TOG) (4)
TABLE
2 Path Analvsis
Predictor
Results
Correlation Coefficient rj
Temperature Pre-Cl2 dose Post-Cl2 dose TOC
Contribution to Determination
Rj2=
0.759 0.586 -0.025 0.489
0.672 0.285 -0.008 -0.246
R2 = y rjPj =
tf
0.4
0.2
0.0
3 z Q
k. Ridge
0.6
Ridge
0.705
1.0
08
Coefficient
TABLE
FIG. 2
rJ PJ
trace of reformulated
3
Correlation-Coefficient Predictors*
model.
Temperature
Between
Pre-Cl2
Dose
PATH ANALYSIS Since the effects of ill-conditioning are assumed to be eliminated from the reformulated model, path analysis may be used to study the predicting power of each variable of the model. It is also used to validate the model as a reasonable model for control. Using path analysis, each predictor's contribution to the coefficient of determination will be divided into two components, a direct effect, and a ioint association. The direct effects p' and ridge regression estimates for k=O are ti e same (Pj = pj(k-0)) and are simply the least squares estimates of a multiple regression equation when the standardized forms of x and 1 are used. The direct effect, pj, of each predictor measures its direct contribution to predicting Y. The total association of that predictor with Y is measured by its correlation with Y, r.. Then the joint J association is simply the difference, rj - pj. See Tables 1 and 2. Since
the coefficient
R2 - 7 Rj
_ FjPjv
of determination
is (5)
the contribution to determination of each predictor can be determined as shown in Table 2. In our final model for THM prediction, predictors have respectable direct effects, and with the exception of post-chlorination
Pre-Cl2 dose Post-Cl2 dose TOC
0.602 -0.823
-0.570 0.519
*Only correlations between predictors, are significant at 0.10, are shown.
which
The variables TOC and post-chlorination dose were retained in the model because of their joint associations with the other variables. The correlation coefficients between the dependent variable, THM, and TOC, W absorbance and THMFP, are respectively 0.480, 0.557, and 0.550. This relationship contributes to the predicting power of the model. Note that even though ill-conditioning has been eliminated, there still remain significant correlations among certain predictor variables (see Table 3). CONCLUSIONS The two approaches used to diagnose illconditioning and stabilize our model, namely the diagnostics of Belsey, Kuh, and Welsch and ridge regression, give different perspectives on the problem and seem to complement one another. Path analysis provides a "check" on the model suggested by these two procedures. Note that the joint association among predictor variables can result in a common causal effect; therefore, it plays an important part in explaining the retention of a predictor variable in the final model. High
Proc. 6th Inr. Cmf.
1076
on Mathematical
correlation between a predictor variable and a dependent variable does not guarantee the retention of a predictor variable in using our approach. This was illustrated by the retention of TOC while excluding W Absorbance and THMFP from the final model. Likewise, a low correlation between a predictor variable and dependent variable does not guarantee its exclusion as exhibited by our retention of post-chlorination dose in the model. Our final model contains a measure of raw water quality (TOC) and the effects of preand post-chlorination dose on THM production. It also shows the importance of temperature; therefore, it is a satisfactory model for control. In three other data sets which were examined (not reported here), this approach led to models that are consistent with theories of THM formation and that contain appropriate predictor variables for control. REFERENCES Kavanaugh, M.C., Trussell, A.R., Cramer, J. and Trussell, R.R. (1980) An Empirical Kinetic Model of Trihalomethane Formation: Applications to Meet the Proposed Standard, Journal AWWA. 72, 578-582.
ModellinK
Edzwald, J.K. (1983) Removal of Trihalomethane Precursors by Direct Filtration and Conventional Treatment, Department of Civil Engineering and Environmental Engineering, Clarkson College, Potsdam, NY 13676, 224 pp. Belsey, D.A., Kuh, E., and Welsch, R.E. (1980) Rearession Diaanostics: Identifying Influential Data and Sources of Collinearitv, John Wiley and Sons. Hoerl, A.E. and Kennard, R.W. (1970) Ridge Regression: Applications to Nonorthogonal Problems, Technometrics. l2, 1970, 69-82. Hoer-l, A.E. and Kennard, R.W. (1970) Ridge Regression: Biased Estimation for Nonorthogonal Problems, Technometrics, l2, 55-67. Hoerl, A.E. (1962) Applications of Ridge Analysis to Regression Problems, Chemical Engineering Progress. 58, 54-59. Li, C.C., Path Analvsis - A Primer, Boxwood Press, 1975. The research on which this paper is based was financed in part by the U.S. Department of the Interior as authorized by the Water Research and Development Act of 1978 (P.L. 95-467).