Int. J. Man-Machine Studies(1992) 36, 327-336
Chemical environmental uncertain quality
models using data of
D. A. SWAYNE
Computing and Information Science Department, University of Guelph, Guelph, Ontario, NlG 2W1, Canada and ES Aquatic Inc., Guelph, Ontario, Canada D. C.-L. LAM AND A. S. FRASER National Water ResearchInstitute, Burlington, Ontario, L 7R 4A6, Canada J. STOREY
ES Aquatic Inc., Guelph, Ontario, Canada
Application of knowledge-based systems to modelling lake acidification is outlined. The approaches taken by our group are described, using expert systems to incorporate human understanding and qualitative reasoning in selection of models of progress of acidity, and to overcome the problems of data noise and scale variations in the spatial and temporal domains.
1. Introduction This paper is concerned with reporting on, and abstracting the experience from, the use of knowledge-based systems (Swayne & Fraser, 1986) in the large-scale modelling of the effects of acid rain on lakes in Canada (RMCC, 1990). Complex geochemical models are derived from chemical kinetic equations to explain elevated levels of acidity in lakes, measured directly
as pH or indirectly
as depression
of
alkalinity. Inputs include: background chemistry, water evaporation and run-off rate estimations, the ability of soil and rock to neutralize acidity, and the input of acids from industrial
activity. Theoretical
considerations
dictate that one or two mechan-
isms for acid neutralization dominate in each particular instance, and different models are based on these considerations. Therefore a region of alkaline soil will favour a neutralization mechanism involving dissolved salts in groundwater. Lakes with barren, rock surroundings will react (possibly) more directly to acid reduction, unless the surroundings are naturally acidic (from vegetation or wetlands). The models have required the abstraction from large datasets which most often were not collected for the purpose of modelling acidification. The inputs to the models come from data collections which are not necessarily coordinated in space or time. Integrative
approaches
are needed to combine
the necessary data in a
meaningful way. Data with missing or biased components must be detected and edited or removed. Related environmental issues share the same problems within the database inputs and model calculation: run-off and general water quality modelling, the effects of 327 0020-7373/92/020327+10$03.00/0
0 1992AcademicPressLimited
328
0. A. SWAYNE
El’ Al..
mining activity, and non-point-source pollutant modelling (such as from agroindustrial activity). 2. The domain Prior to the evolution of large-scaleindustrial activity, the geochemistry of Canada’s lakes would have been distributed according to background acidity, largely from organic sources, with counteracting effects from carbonate rocks and other elements in the soils through which the waters pass. The input of industrial acids through rain directly affects the pH (logarithm of hydrogen ion concentration), or reacts first with natural buffering agents. The presenceof metallic ions above background levels are typical indicators of stress, as are dissolved acids themselves. For background, see Henriksen (1982), Bobba et al. (1986), Jones et al. (1984), Lam et al. (1986) and Schnoor et al. (1986). Various models are based on the consumption and transport of the known acidic inputs and their supposedpre-industrial levels. Assuming successivetransitions through various equilibrium states, the models attempt to predict a final asymptotic pH level appropriate to the geochemistry of the locality. Two such models are illustrated in Figure 1. In Figure 2, a somewhat larger suite of models is plotted against available real data. Figure 2 illustrates the difficulties in selecting one model over another for any given geographical region of Canada. The small vertical bars in the figure are the data ranges (10-90 percentile) and the lines join the respective model predictions based on the asymptotic effects of current acid loading. These data values are distributed over the aggregates(Figure 3) chosen for supposedsimplification of the problem. These aggregatesare selected for proximity and geological similarity. Seemingly, no model is applicable for all aggregates,and for some, no model appears to work. In our assemblyof models, however, we have succeededin fitting the distribution of observations rather well. Our accomplishment would be rendered much more difficult without applying knowledge-basedprinciples. 3. Environmental
data
Environmental data are often collected for a specific purpose and re-used for purposesother than that for which the data was originally intended. Such re-use and
FIGURE1. Sample asymptotic (steady-state) models, CDR and ‘ID.
CHEMICAL
-0.1
ENVIRONMENTAL
L- I 19
14
329
MODELS
I
I
I
I
13
22
3
1
I
2
I
I
I
I
I
I
4
8
7
6
5
10
11
Watershed aggregatenumber 2. Model-data comparisons. Vertical bars represent data ranges by aggregate, with lines joining median predicted acid neutralization capacity (ANC), by six models. 0, observed; X, CDR-LTH; + ,
FIGURE
CDR; e, TD-BO; +, TD; A, TD-LTH: &3, CDR-BO.
multiple use is wise conservation of the public purse. However, extrapolation from any single data collection requires detailed (expert) knowledge of the relevance of that collection. Some data are collected frequently for long times (lasting decades) for research. O ther data, collected for monitoring are sparse (once yearly or in extreme cases only a singleton). Scientific data measurement changes in time. Systemssuffer from bias in methodology (which may not be recorded) for periods of time. The techniques for the measurement of sulfate, for example change from collection to collection. Detection limits, baselines and units of measure (grammolecular vs sample weight) vary. The size of the database makes such considerations problematic: the error in recording and maintenance is sporadic, and requires quality control measuresto be automated such as reasonablenesschecking and units analysis; ionic balance
FIGURE3. Aggregates of Eastern Canada (not to scale).
330
I). A
Moni topin
Stations
in
Eastern
SWAINE
F.7. 41
Canada
FIGURE4. Distribution of water quality sampling stations over Eastern Canada, superimposed on tertiary watershed map.
which have subjected to smoothing or principles; data been interpolation/extrapolation without retention of the algorithms which were employed; data below detection limits; data with large uncertainties; and contaminated samples or faulty instrumentation. There are often spatial scale differences among components in data collection. Access to a region may be limited by remoteness. Figure 4 is illustrative of the uneven distribution of sampling stations. Conversely, some regions are covered extensively. The analysis must reflect geographical considerations, which may not be accurately represented in the distribution of observations. Data with uneven spatial or temporal distributions may, because of scaling, provide an incomplete or distorted picture. In Canada, physical accessis one obstacle to uniform distribution in space. Secondary characteristics (such as geology and rainfall) which apply on larger spatial scales have been used to re-inforce or corroborate extrapolation from measurement. Linkage between well-covered regions and sparsely sampled ones is necessary. However, sources of interference abound. Water quality information is local in nature, with a “zone of influence” restricted to a few square kilometres or less. Rainfall, temperature, wind and thus deposition vary within a larger length scale. Samples from these latter spatial domains may be more representative of a trend than one from the former. The domains are possibly biased towards one geographical direction or feature: data from a rivermouth may not provide accurate information about a lake, but instead be more representative of a part of the river. Bias may also result from unrelated chemical inputs (from road salting, for example). These extraneous factors--possibly known at the time of sampling are usually lost with the passageof time and the centralized accumulation of data. Time series in the environment are extremely noisy, and gradients may be harder to accurately estimate than cumulative measures. Long and short time-scales may
CHEMICAL
ENVIRONMENTAL
331
MODELS
coexist within the same model. Weathering occurs over long time-intervals, while seasonalvalues may be appropriate in other system variables, while episodic events like run-off are measured in hours or days. This stiifness may require relatively transient phenomena to be measured in a cumulative sense, while other equally important phenomena may remain comparatively constant in time. Nature also provides “delivery delays”. For example, when acid inputs are locked up in Winter snow, acid is delivered as a “pulse load” in Springtime. The effect on wildlife in the short term may be more marked by this episodic effect than the longer timeaveragedacidity. For example, if eggs or larval populations are more sensitive to acidity, and occur mainly in Spring, then the effect of the acidity on a particular population may be greatly amplified, even though yearly averageacidity is much less marked. Figure 5 illustrates the layout of our modelling paradigm, including the acceptance testing for the input data. We have used “person-in-the-loop” to censor part of our data. Tasks are automated as experience permits. Code was written to detect the “corruption” in the large datasets as we became familiar with their characteristics (which were mainly a function of their size and length of time of residence in large computer systems). For any environmental advisory system, a “smart” data diagnostic would be essential to improvement in functionality and reliability. We were faced with time pressuresthat allowed only partial automation. Some data are not presentedin numeric form. Soils data, for example, are already heavily screenedand appear only as contours, with a sensitivity index. Deposition data are yearly averaged,and contoured or gridded. Therefore, it is, in a sense,very coarseas compared to the water samples. Precipitation data are universally available as yearly or seasonal accumulation, and provide some localized detailed history. Even watershed surface area is not exactly known, for many watersheds, and the number of lakes in many regions is only an estimate. This latter uncertainty stems partially from the threshold chosen for determining whether a body of water is a lake. We have been required to develop some visual tools for handling the large water quality datasets. Figure 6, the cloud plot, is useful for uncovering rules which separatepopulations of watersheds. The more distinct the data population subdivisions are, the more unique the shape of the “hull” enclosing the data points. Figure 7, a plot of ionic balance, is useful for diagnosing faulty data. The molecular balance
Target Scenarios
Acceptance Criteria tl Data Analysis, Model Selection Knowledge Data Matching
CI and Rule w Correction. Bases Models for Missing Data t.l Empirical Models f Database
FIGURE5. Modelling paradigm.
332
0. A. SWAYNE
ET Ala.
-SW
FIGURE
6. Convex hull (“cloud plots”) of ions in several aggregate data populations.
of positive and negative ions (if a complete analysis is taken) should reflect expectations of chemists modelling the system. 4. Models Modelling is an exercise intended to explain the underlying causes behind changes in the system being examined. Models serve several purpose+among these are: they form a basis for prediction; they provide management of resources through “what-if’ scenarios; they form the expression of hypotheses to be affirmed or contradicted; and they are tools in investigating the possibility of research or data gaps. Figure 5 also illustrates the relationships between modelling and data analysis. As with the data used for modelling, any theoretical or empirical model which has been tested in a suite of settings, and for which axioms of appropriateness are tk71
-se
(WA outside
e
se
INC
lee
rang*>
158
4ee
OQ ‘;:3aa 4”
E+
480 -se
e
se
lee
158
288
9
nm FIGURE
7. Ionic balance plot (used by data screening and modelling teams to visualize data populations for completeness and correctness of major ions).
CHEMICAL ENVIRONMENTAL
MODELS
333
assumed, should demonstrate knowledge of its likely validity in any particular application. The models themselves are as varied and complex as the environments from which they are derived. The ones chosen have largely measured the acid-neutralizing capacity (or ANC), which is comparable to alkalinity. Alkalinity is supposedly a dual measure of acidity, but it is not directly so, particularly when the data used for its estimate are noisy. Translation of model results to pH is then rendered difficult, requiring a range or distribution calculation. Organization into a common geographical frame of reference and scale has presented challenges. Models assume operation over an area of a certain specified parameter-set, but are calculated based on poorly-distributed samples. Even such a simple item as acid reduction is geographical in nature, and local “hot-spots” could persist at higher levels even when an overall reduction is effected. 5. Implementation
overview
Symbolic problem-solving techniques are of use in many aspects of these environmental models. Some of the analyses outlined thus far were completed and some only reached initial prototype status. One aspect which has been considerably refined is that of running the models in the context of the large-scale or regional assessment,which we now describe. Several variants of the original system have been constructed for the problem (Ando, 1987; Lam et al. 1988; Lam et al., 1989). The most recent (RMCC, 1990) system works in the following, straightforward way. For each station, the known background factors are used to calculate a model ANC, which we label ALKl. ALKl is compared against its nearest competition and against a reasonable (modeldependent) value range for its output ANC. If the value of ALKl is acceptable and no other considerations (turbidity, dissolved organic carbon) apply, it is accepted. If no models appear acceptable, a value is selected and an alternate kept. If several models are accepted, one is selected and the alternates are retained. When an acid reduction scenario is calculated, the choice of model is subjected to the further check that several common-sense rules are not violated. For example, a switch in model choice can lead to a jump in acidity, even under a loading reduction. When any choice is rejected, the reason for its rejection is sought and an alternate choice accepted. We have used the calculation in predictive mode by attempting to invert the question posed. The prediction of so-called critical loading involves calculating the threshold sulfate deposition which would leave an acceptable level of acidity. It is calculated by a succession of intermediate attempts. This could be categorized as a “shooting” technique, since there is no underyling justification for the use of interpolants. Critical loading scenarios have received critical acclaim for their sensitivity to the geographical distribution of the pollution sources and the effect of variation in individual loading reductions at these sources. Where possible, and for current acid loading, data are split or jack-knifed to attempt validation of the codes and rules. Predictions, based on a small scale (
334
0.
A.
SWAYNE
ET AI.
failure to mirror the current state would entirely invalidate the approach. Fortunately we have been very successful in this modest test, and see no remedy for reduction scenario validation short of that performed by the originators of the individual models and the actual results from any particular realization in nature. 6. Rationale
for symbolic
problem-solving
systems
This project has features in common with other environmental modelling tasks. The process of formalizing the policy-level analysis of such an issue is being carried forward for non-point sources of pollution, for example. For other domains (such as mining and other industrial effluent) there are multiple sources, complex interactions, and many points of concern. Model synthesis in a single coherent information processing environment is a difficult task. The approach of using many small models with complex interconnection has come under criticism at various times, as opposed to a single monolithic algorithmic model. For instance, why is this approach different? First-level answers lie in economy of computation. More detailed justification is arguably based on good science: models are placed in an approximate environment for which they were designed. Expert control is exercised over the output: if the models have a certain ordering based on their pH predictions, then their behaviour under changed circumstances can be subjected to “reality check” using expert advice. We have also begun to use our system (cloud plots, geographical overlays) to develop aggregates based on acid deposition and soil sensitivity (Figures 8, 9). This may lead to more simple and understandable models. Qualitative reasoning is often an integral part of modelling. Parameter sets and variable interactions (even the choice of dependent variables and time-scales) are chosen from experience and observation, with obvious irrelevant factors discarded in the name of simplicity and structure. When models developed for one setting are transported to another, parameters are added and discarded (often with the aid of subjective evaluation rather than the rigor imposed by formal statistical means). The range of valid parameters for a model may be selected algorithmically in the procedure code of the implementation, and may even be announced in a run-time narrative of execution. Criteria which have been applied are not generally retained in the model outputs, however, in any sort of explicit fashion. Since symbolic problem solvers usually involve processing of meaningful (to a reader) text strings, they are useful even on this simple level to permit “postmortem” browsing of the execution of models. Choices made become part of the knowledge base for the problem at hand, and an advisor is available to navigate the results, thereby providing insight into the local choices made in running a model. For the same reasons, results of models may be rejected by an experienced scientist if they do not conform to expectation. When this happens, some alternative answer must be calculated. These considerations alone do not, however, justify the use of symbolic problem solvers, unless they permit improvement in the modelling process not possible otherwise. We feel that a substantial increase in utility over more conventional approaches has been achieved, partly because conventional tools have not yet absorbed the developments in the artificial intelligence community. Large multi-parameter models-often tens of thousands of lines of FORTRAN code--continue to be developed and refined. They may not be well-suited to
CHEMICAL
ENVIRONMENTAL
335
MODELS
FIGURE8.
Soil classification.
piecewise explanation over a large geographical area, unless that facility is developed around the inputs, or added to the model outputs. By their very nature, they are quantitative; qualitative post-processing is often a new, distinct and difficult-to-implement function. Each of the parts of this large problem-solving system might not qualify as candidates for AI tools, but the sheer size of the analysis, the diversity of problems, and the ease of maintaining and upgrading as new knowledge becomes available all contribute to the defence of our approach. The support from Environment Canada for Swayne and Storey, in the form of several research contracts, is gratefully acknowledged. Several individuals have contributed to the development of RAISON. Isaac Wong, Jane Kerby and Yoko Ando, both of the University of Guelph, have worked with the development of rules for model applicability. Data analysis has been performed (and the knowledge shared) by Janice Jones and Fred Naroussian at NWRI.
FIGURE9. Superposition of soil and deposition maps in creating new test aggregates.
336
D. A.SWAYNE
ET’AL.
References ANDO, Y. (1987). An expert system (USRFUNO) on RAISON Micro, 12 pp. Unpublished
Report. DSS#K405-7-3166/01-SE, NWRI, Burlington, Ontario, Canada. BOBBA, A. G., LAM, D. C.-L., JEFFRIES, D. S., BO~OMLEY, D., CHARE~E, J. Y., DILLON, P. J. & LOGAN, L. (1986). Modelling the hydrological regimes in acidified
watersheds. Water, Air and Soil Pollution, 31, 155-163. HENRIKSEN, A. (1982). Changes in base cation concentrations due to freshwater acidification.
Acid Rain Research Report l/1982, NIVA, N-0314, Oslo 3, Norway. JONES, M., MARMOREK, D. & CUNNINGHAM, G. (1984). Predicting the extent of damage to fisheries in inland lakes of Eastern Canada due to acidic precitation, 91 pp. Report by
ESSA, Toronto, Ontario, Canada. LAM, D. C.-L., BOBBA, A. G., JEFFRIES, D. S. & KELSO, J. R. (1986). Relationship of
spatial gradients of primary production, buffering capacity, and hydrology in Turkey Lakes Watershed. In B. G. ISOM, Ed. Impact of acid rain and deposition on aquatic biological systems. pp. 42-53. Philadelphia, PA: ASTM STP#l928. LAM, D. C.-L., FRASER, A. S., SWAYNE, D. A., STOREY, J. & WONG, I. (1988). Regional analysis of watershed acidification using the expert systems approach. Environmental Software, 3, 127-134. LAM, D. C.-L., SWAYNE, D. A., STOREY, J., FRASER, A. S. (1989). Regional acidification models using the expert systems approach. Ecological Modelling, 47,131-152. RMCC REPORT. (1990). National LRTAP Assessment Report, Federal/Provincial Research and Monitoring Coordinating Committee. Environment Canada, Ottawa. SCHNOOR, J. L., LEE, S., NIKOLAIDIS, N. P. & NAIR, D. R. (1986). Lake resources at risk to acidic deposition in the Eastern United States. Water, Air and Soil Pollution, 31, 1091-1101. SWAYNE, D. A. & FRASER, A. S. (1986). Development of an expert system/intelligent interface for acid rain analysis. Journal of Microcomputers in Civil Engineering, 1, 181-185. THOMPSON, M. E. (1982). The cation denudation rate as a quantitative index of sensitivity of Eastern Canadian rivers to acidic atmospheric precipitation. Water, Air, and Soil Pollution, 18, 215-226.