M&h/ Compuf.Modelling. Vol.II, Printed inGreatBritain
pp. 1077-1082, 1988
0895-7177/88 $3.00+0.00 PergamonPressplc
PREDICTING THE PROBABILITY OF CONTAMINATION AT GROUNDWATER DRINKING SUPPLIES
BASED
PUBLIC
Jailyn A. Brown, M.S. William P. Darby, Ph.D. Department of Engineering and Policy, Washington University, Campus Box 1106, One Brookings Drive, St. Louis, Missouri 63130
Abstract. Logistic regression analysis is used to predict chemical contamination of groundwater based water supplies using site specific parameter. Predictions are evaluated for accuracy and usefulness in an expected utility sensitivity analysis, balancing the costs of monitoring too comprehensively with the risks of monitoring too infrequently. Results confirm that utility is maximized under conditions of unlimited Prioritizing by monitoring budget when all sites are monitored. population improves the selection process when the number of sites to be monitored is restricted. w.
groundwater, logistic regression, VOC contamination.
distributed to representatives of each water to ascertain the local impacts supply influences had on well field characteristics.
INTRODUCTION increasing Water issues have been gaining an amount of federal and local attention during the past few years. The reauthorization of the Clean Water Act despite two presidential vetoes is an example.(l, 2). In addition, both the House and Senate are developing groundwater related legislation. And the U.S. Environmental Protection Agency (USEPA) is undertaking a project entitled "The National Survey of Pesticides in Drinking Water Wells" These examples indicate the increasing (3). priority being placed on maintaining water resources, with an emerging emphasis on groundwater.
A total of 945 sites were sampled. Of these, 466 were randomly chosen sites, identified in the Federal Reporting Data System's inventory of public water systems. Random site selection was subdivided into two sets: 285 sites served populations under 10,000 while 181 served populations over 10,000. The remain479 sites comprised the non-random ing sample. USEPA delegated selection responsibility for the non-random sites to each state, relying on local expertise to choose sites of suspected contamination, or sites with no available VOC information in order to maximize the discovery of contamination.
Identifying populations at risk due to the contamination of groundwater based drinking supplies provides one important aspect of a comprehensive groundwater management program. A monitoring program could be established to measure the chemical quality of every groundwater drinking supply in the nation. However, limits on money and personnel make such a monitoring program infeasible. The objective of this paper is to investigate a methodology for developing guidelines or criteria useful in targeting groundwater based public drinking supplies to monitor for chemical contamination. The selection criteria balance the costs of monitoring too comprehensively (increasing the expense associated with monitoring sites with likelihood of contamination) against the risks of monitoring too infrequently, (increasing the health costs associated with population exposure to 2 contaminants through drinking water).
Of the 34 VOCs analyzed by USEPA, only the six most frequently occurring were used in this study, as listed in Table 1. Three sets of variables were developed from these binary The first set, including all six of data. the VOCs, used the detection limit to differentiate a zero- one variable, where one indicated a VOC found at or above its detection limit. The other two sets were based upon USEPA recommended maximum contaminant levels contaminant proposed maximum (RMCLs) and levels (PMCLs) (5, 6, 7). For those of the six VOCs that had RMCLs and/or PMCLs, corresponding zero-one variables were similarly assigned. Table 1 lists detection level, RMCL and PMCL values for each VOC along with the frequency of occurrence for each VOC at each level. In addition, a variable indicating the presence of any of the VOCs at each quantitation level was included in the analysis.
The Groundwater Survey provided SUPPlY most of the information used in this analysis (4). The project, conducted by USEPA in 1980involved samples analyzing taken 1981, from groundwater based drinking supplies for the presence of 34 volatile organic compounds In addition, a questionnaire was (VOCS).
Most of the independent variables used in USEPA this analysis were derived from the questionnaire. It gathered information on well characteristics and distribution, water treatment processes and and industrial The commercial activity near the well sites. questionnaire-derived variables are listed in Table 2. Additional data were provided from previous research on this same topic (8).
1077
1078
Proc. 6th Int. Conf. on Mathematical
TABLE 1.
ModeNing
VOC Contaminant Levels in ug/L and Frequency of Occurrence
Detection ug/L occurrence
voc
l,l-dichloroethane cis/trans-l,Z-dichloroethylene l,l,l-trichloroethane carbon tetrachloride trichloroethylene tetrachloroethylene any of the above
0.2 0.2 0.2 0.2 0.2 0.2
41(0.043)" 54(0.057) 78(0.083) 30(0.032) 91(0.096) 79(0.084) 189(0.200)
RMCL PMCL ug/L occurrence ug/L occurr
--70
--1(0.001)
-----
-----
200 0.2 0.2 ---
0(0.000) 30(0.032) 91(0.096) --115(0.122)
200 5 5 -__
O(0. 3(0. 26(0. _-_ 28(0.
*Numbers in parentheses indicate proportion of total sample (n.945).
TABLE 2.
Questionnaire Derived Independent Variables
variable
description
variable
ID TTLWELL CTWELL
Sample id number. Total number of wells. Number of wells contributing to the sampling point. Percent total pump capacity from confined aquifers contributing to the sampling point. Percent total pump capacity from water tables contributing to the sampling point. Clay overburden. Sand overburden. Loam overburden. Other overburden. Average daily production. Number of groups of wells. Closest well distance. Farthest well distance. Holding facilities used. Chlorination used. Treatment: Non-Chlorine Disinfection. Coagulation. Sedimentation. Filtration. Lime Soda Softening. Ion Exchange Softening. Aeration. Ammoniation. Iron Removal. Activated Alumina. Corrosion Control. Fluoride Addition. Fluoride Removal. Gran. Activated Carbon. Other.
TRTD WELLT CNTLT
CONF
WT
CLAY SAND LOAM OTHER MGD GROUPS CLOSE FAR HOLD CHLOR TA TB TC TD TE TF TG TH TI TJ TK TL TM TN TO
PROX PRXA PRXB PRXC PRXD VXA WXA VXB WXB vxc wxc VXD WXD VXE WXE VXF WXF VXG WXG VXH WXH VXI WXI VXJ WXJ VXK WXK VXL WXL VXMWXM CONCERN
l
‘ description Percent water treated. Treatment at wells. Number central treatment plants. Close indus./comm. activity. Number of wells within l/2 mile of activity. Number of wells l/2 to 1 mile from activity. Number of wells 1 to 3 miles from activity. Number of wells 3 to 10 miles from activity. Indus./comm. activity* Dry cleaning. Aviation. Machine Shop. Metal Fabrication. Electroplating. Refineries. Chemical Plants. Dumps/Landfills. Haz. Waste Disposal. Industrial Septic. Home Septic. Industrial Pits. Other. Community concern expressed regarding water quality.
VX variables represent indus./ comm. activity within 3 miles of the well field. WX variables represent indus./ comm. activity 3 to 10 miles from the well field.
Proc. 6th 1n1. Conf. on Mathematical
PRINCIPAL COMPONENTS ANALYSES Principal components analyses were conducted on both the dependent VOC data and the independent data to look for underlying relationships among the variables. In each which case, varimax rotation was used, attempts to weight each variable heavily on one component and very lightly on all others. For the dependent VOC data, two pairs of VOCs were linked, with the occurrence of the other VOCs essentially independent. For the first pair linked, cis/trans-1,2-dichloroethylene and tetrachloroethylene, no explanation was apparent. Dichloroethylene was commonly used as an organic solvent and in the production of other chlorinated compounds. Tetrachloroethylene's primary uses included dry cleaning, textile processing, metal degreasing and fluorocarbon production (9). Textile processing and metal degreasing seem to have been the common links between these two VOCs, though no relationships were specific established. For the pair second linked, 1,ldichloroethane and l,l,l-trichloroethane, a substantial relationship was identified. Dichloroethane's primary use was in the production of trichloroethane. Trichloroethane, in turn, had a wide variety of applications from cleaning electronic components to missile hardware, with an increasing importance in the dry cleaning industry (9). A complementary and a supplementary variable were created for each VOC pair. The complementary variable accounted for the presence of both VOCs, while the supplementary variable accounted for the presence of either VOC. the The principal components analysis of independent variables identified twenty-nine components. Data from the USEPA questionnaire weighed on several different components, indicating groups of questionnaire variables contributed unique pieces of information to the database. Composite binary variables were developed for each component having two or more binary variable constituents. UNIVARIATE ANALYSES In order to "weed out" less significant candidate variables univariate t and chisquare tests were conducted. All of the nonbinary questionnaire-derived variables were subjected to t tests conducted at the 5% level, allowing for simultaneous hypothesis testing. Two groups of chi-square tests were conducted, one on the binary questionnaire variables and one on the binary composite variables from the derived principal components analysis. Both groups were tested at the 5% level allowing for simultaneous hypothesis testing. Significant composite variables were compared with one another. Only those that were most representative or inclusive were chosen as candidate variables. An additional binary variable called SIZE was included as a candidate variable for all regressions. SIZE was set to one if the population was greater than 10,000, and set to zero if the population was less than or equal to 10,000. The Groundwater Supply Survey report suggested that this distinction was
Modding
important to characterize the nature of component. SIZE was used to incorporate distinction into the present study.
1079
each this
REGRESSION AND RESULTS Logistic multiple regression was used to fit the independent candidate variables, identified by the univariate t and chi-square The tests, to the dependent binary VOC data. probability, P, that the dependent variable was one (that is, contamination was present) was estimated by: P(yi=l) = l/(l+exp(-alpha-BXi)) where yi was the ith dependent observation, Xi I (xi,, xi2,..., xip) was the p x 1 vector of variables for the independent ith observation, B : (Bl, B2,..., B ) was the 1 x p vector of regression parame ger estimates, alpha was the intercept term. and The regressions were done using the SAS LOGIST procedure in the backward stepwise mode with the significance levels for entering the model and staying in the model set at 10% (IO). Regression results are listed in Table 3. The data set was randomly split into two one used for regression parts, equation development and the other set aside, used later to test the results. were Regression results essentially limited to the detection level. RMCL equations were developed only where the RMCL value was equal to the detection level value. No equations were successfully developed at the PMCL. Three examples of the regression results are presented here. final model The regression for l,ldichloroethane at the detection level included only one independent variable. This variable was a binary composite, derived from the principal components analysis, set to one if, within three miles of the well field, dry cleaning, machine shop or metal fabricating l,l-Dichloroethanets activities took place. primary use was in the production of l,l,ltrichloroethane. Trichloroethane, in turn, had a wide variety of solvent applications like vapor degreasing and dry cleaning (9). Hence it appears, given the close match of composite variable activities to the used of trichloroethane, that dichloroethane may have often been a contaminant in the tri_chloroethane used for these activities. Cis/trans-1,2-dichloroethylene was positively related to two variables, sedimentation treatment and metal fabrication. Its primary use was as a low temperature extraction solvent for organic materials (9). Dichloroethylene's solvent applications probably related to its use as a rinse or in vapor degreasing to remove oil and other organic substances, such as is typically done during metal fabrication. Sedimentation treatment of groundwater drawn for public supplies may provide another reflection of the existence of similar conditions. The model for the variable representing the presence of any of the VOCs at the detection level included two independent variables. The first was a composite treatment variable set to one if lime soda
1080
Proc.
6th Int.
TABLE 3.
Regression Results
Detection ug/L variable
voc
Modelling
Cony. on Mathematical
ug/L
RMCL variable
PMCL ug/L variable
l,l-dichloroethane
0.2 intc. (-4.33) D115 (7.68)
No RMCL value
No PMCL value
cis/trans-1,2-dichloroethylene
0.2
70 No Equation
No PMCL value
l,l,l-trichloroethane
0.2 intc. c-2.50) 200 No Eauation D116 (1.06) CONF (-0.01)
No PMCL value
carbon tetrachloride
0.2 intc. (-3.53) 0.2 intc. (-3.53) TA (2.84) TA (2.84)
No Equation
trichloroethylene
0.2 intc. (-2.97) 0.2 intc. (-2.97) D45 D45 (0.63) (0.63) D121 D121 (0.70) (0.70) SIZE SIZE (0.76) (0.76)
No Equation
tetrachloroethylene
0.2 No Equation
intc. (-3.06) TC (1.32) VXD (0.78)
No RMCL value
any of the VOCs
intc. (-1.89) D52 (0.95) vxc (0.66)
tetrachloroethylene OR cis/trans-1,2-dichloroethylene
intc. (-2.48) TK (I.051 vxc (0.74)
tetrachloroethylene AND cis/trans-1,2-dichloroethylene
intc. c-5.07) Dll4 (2.06)
1,1-dichloroethane OR l,l,l-trichloroethane
intc. (-2.38) TK (I.051 D116 (0.84) CONF (-0.01)
l,l-dichloroethane AND l,l,l-trichloroethane
intc. c-5.03) D115 (2.29)
softening or corrosion control were practiced by the water treatment facility. The second was machine shop activity within three miles This equation broadly of the well field. reflects the common VOC application, vapor degreasing as used in machine shop activities. The composite treatment variable was more difficult to explain, but may be related to metal removal and acid solution control related to the solvent applications. SENSITIVITY ANALYSIS:
THE GENERAL CASE
A sensitivity analysis was conducted to examine applications of the regression equation to the reserved part of the data set. The analysis was designed to establish bench marks in identifying the number and characteristics of sites to be monitored. The analysis was based upon a general case, in which a choice had to be make to monitor or not to monitor a site. Then it would happen that the site was correctly determined to be either contaminated or not contaminated, leading to four possible outcomes from each monitor/not monitor decision.
intc. (-2.68) TRTD (0.01)
No PMCL value No Equation
Four components were considered for determining the utility of each decision/outcome pair: the cost of monitoring; remedial costs associated with pollution mitigation at contaminated sites; avoided risk; and the benefit of environmental awareness. The components were assigned utility values, as indicated below: 1. 2. 3. 4.
monitoring : 1. environmental awareness : 5. remedial cleanup : 100. avoided risk 200. q
Monitoring represented the smallest component. Environmental awareness was assigned a slightly higher, but comparable value. Remedial cleanup was given a higher utility value. Avoided risk was given an even higher utility value than remedial cleanup (in any other case, remedial action could not have been justified). The utility of each decision/outcome pair was evaluated in terms of these four components. The individual utilities formed part of the
Proc. 6th Int.
basis of calculating an expected utility each decision option.
Corzfl on Mathematical
for
ANALYSIS OF UTILITIES The basis of the monitor/not monitor decision at each site was the establishment of thresholds for judging that the predicted probabilities of contamination were sufficient to warrant monitoring. In order to teat the regression results, the predicted probabilities for contamination were calculated for the sites in the reserved data set. Then the expected utility associated with each probability threshold value was calculated using four the weighted auma for each of the outcomes. Weights were assigned a3 the number of caaea for each outcome, defined by the probability threshold value and the contamination data of the application data set. The total expected probability utility associated with each threshold value was the sum of the utilities of the four outcomes evaluated with respect to that probability value. Given that no external (for example, budget) restrictions were placed on the number of sites monitored, and that the component relationships were essentially fixed with respect to one another, the anticipated result -- that expected utility was maximized when all sites were monitored -- emerged. Typically, budget consideration3 limit the number of sites that can be monitored. In response to such constraints an analysis was done to evaluate the incremental utility gained from monitoring one to n sites. First a set of sites was ranked according to descending probability of predicted contamination, with sites having identical probabilities randomly ordered. For a comparison, the same set of sites was ranked by descending
lOXI
Modelling
probability, and then by descending population Then for sites with identical probabilities. the expected utility was calculated for each site in both the population-corrected and uncorrected sets. An example of the results is graphically displayed in Figure 1, which corresponds to the VOC regression results for l,l-dichloroethane, discussed above. Once the data had been graphed, natural cutoff points, where plateaus occurred, were identified indicating reasonable choices for numbers of sites to monitor. Then, the probability value3 were checked for natural breaks near one3 established from the the graphs. Population-corrected data should improve the selection process since such a ranking more effectively characterized avoided risk, giving large populations with predicted contamination preference for over problems monitoring smaller populations with the same likelihood of contamination. In the case of Figure 1, the population correction did improve site selection. For the uncorrected data, it appeared that approximately 200 sites should be monitored, corresponding to a break in the predicted probability at 214 sites. The population-corrected data suggested that only 125 sites could be monitored, a 75 site decrease with only a slight corresponding drop in expected utility which remained positive. This example suggested that prioritizing monitoring site selection on the basis of population could be useful. Not all of the VOCbaaed dependent variables behaved so well. However, when coupled with probability predicsomewhat tions, the overall results were unclear due to the interaction of population and probability criteria. So, although ordering by population was generally useful, rules for establishing the probability level at which to monitor should first be developed. Then population considerations could be used
1 2000
0 x t &
3
-2000
lJ
‘u :
e -4000 w -6000
-+ 100
200
300
400
Sites monltored
Population corrected data are indicated by the boldface line.
Figure 1.
l,l-Dichloroethane per Site Expected Utility Analysis
Proc. 6th Int. Conf. on Mathematical Modeliing
to refine the choice and reduce the number sites to be monitored.
of
Using l,l- dichloroethane as an example, the sensitivity of the relative magnitudes of the utility values was tested while maintaining the assumption that the value of avoided risk must be greater than the value of remedial (to justify remedial action at cleanup contaminated sites). The results showed that site selection was improved using monitoring criteria only when avoided risk was relatively high. Ranking sites initially by descending avoided risk may have further improved the site selection process, but might be difficult. Such a ranking system would be akin to ranking sites according to their present and future uses, which would be an intricate and time intensive process.
justified for every case of contamination, or that risk avoided by taking remedial action always makes remedial action worthwhile. However, given budgetary restrictions, monitoring all sites is infeasible. Prioritizing population monitoring site selection by improves the selection process by weighting avoided risks without markedly differing from the non-degradation goal. REFERENCES 1.
The Bureau of National Affairs, "Senate Water Reauthorization: Approves Act Leaders Claim Margin to Override Reagan Veto,* mfleeorter, Volume 17, National Number 39, The Bureau of Affairs, Washington, D. C. (January 23, 1987).
2.
The Bureau of National Affairs, "Congress Overrides President's Veto to Renew Water Act into 1990s by Big Margin," EnvironmReDorter, Volume 17, Number 41, The Bureau of National Affairs, Washington, D. C. (February 6, 1987).
3.
U. S. Water New, "EPA to survey nation's wells for ag chemical residues," U. S, Water Neu, Volume 3, Number 9, U. S. Water News and the Freshwater Foundation, Halstead, Kansas (March, 1987).
The methodology, as outlined in this proved to be generally useful. paper, Results, on the whole, had reasonable physical explanations. Logistic regression results provided good matches among VOC uses and activities adjacent to the well fields.
4.
Westrick, James J., J. Wayne Mellow, and Robert F. Thomas, "The Groundwater Supply Survey: Organic Summary of Volatile Contaminant Occurrence Data," U. S. Environmental Protection Agency, Cincinnati, Ohio (June, 1982).
It appears that sites at the most risk of contamination have vapor degreasing, dry cleaning and/or electroplating activities However, located close to the well fields. confined aquifers seem to lower the likelihood of contamination. Unfortunately it was found that sites to be monitored for the presence of any of the six VOCs included in the study are more difficult to target than sites to be monitored for the presence of individual VOCs.
5.
Mainstream, "EPA Moves Toward Final Drinking Water Regs," w, Volume Works 29, Number 12, American Water Association, Denver, Colorado (December, 1985).
6.
HUnrO, Nancy B., and Curtis C. Travis, "Drinking-water standards: Risks for chemicals and radionuclides," Environmen_ fal Scim Volume 20, Number 8, American ChemiLal Society, Washington, D. C. (August, 1986).
7.
Dowd, Richard M., "EPA drinking-water proposals: Round One," Environmental e and Tew, Volume 19, Number 12, American Chemical Society, Washington, D. C. (December, 1985).
a.
Kennedy,
9.
Grayson, Martin, Executive Editor, Kj,&x cloneaa of C&~&al Teu ~gy, Third Edition, John Wiley and Sons, New York, New York (1984).
10.
SAS Institute, Inc., SUGI: B -User's G&&, 1983 Edition, SAS Institute. Inc.. Carv. North Carolina (1983). ’ ’ -’
Making the assumption that avoided risk is relatively high with respect to monitoring, especially where large populations are dependent on groundwater drinking supplies, is reasonable. Given that assumption, In conjunction with justifying remedial action in all cases of contamination, the methodology presented in this paper yields sensible and applicable results. CONCLUSIONS
The analysis should be repeated on the basis of the population size distinction discussed earlier in the paper. Then results from the separate analyses should be compared with those obtained from this analysis, looking for improvements in predictive capability especially for those of the six VOCs that had high VOC concentration data which related to small or large populations. Results for the VOCs not exhibiting this population-based distinction should be compared to examine the consistency in their findings. Particular attention should be given to the model describing any VOC contamination, which may be the most important model examined since monitoring for a host of contaminants versus individual ones allows for a more comprehensive and uniform method of chemical analysis. Results for that model in the present study were unsatisfactory. Perhaps they could be improved in the revised method. The sensitivity analysis confirmed that expected utility is maximized when all sites are monitored. This result is based on the assumption that remedial action is always
Kevin M., &W&.iu Areas of ter w Kevin M. Kennedy, Saint Louis, HissAuri (August, 1985).