Ecological Modelling, 40 (1988) 155-159
155
Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands
A M E T H O D O L O G Y FOR DERIVING M O D E L I N P U T P A R A M E T E R S F R O M A SET OF E N V I R O N M E N T A L DATA
D.E. FIELDS and C.W. MILLER
Health and Safety Research Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831 (U.S.A,) (Accepted 6 May 1987)
ABSTRACT Fields, D.E. and Miller, C.W., 1988. A methodology for deriving model input parameters from a set of environmental data. E~ol. Modelling, 40: 155-159. Selection of appropriate values for model input parameters requires knowledge of the distribution of these parameters. The TERPED computer code is a versatile methodology for determining with what confidence a parameter set may be considered to have a normal or lognormal frequency distribution and for determining appropriate representative values. Several measures of central tendency are computed. Other options include computation of the chi-square statistic, the Kolmogorov-Smirnov (KS) non-parametric statistic, and Pearson's correlation coefficient. Our implementation of the KS test uses algorithms that yield results valid for both large and small sample sizes. Cumulative probability plots are produced either in high resolution (pen and ink for film) or in printer-plot form.
INTRODUCTION
Knowledge of the distribution of values characterizing an observable quantity is crucial for discussing and utilizing these values. Such knowledge is especially valuable in selecting input parameters for mathematical models. Accurate specification of the distribution of model input parameters is perhaps even more important if Monte Carlo techniques are to be applied. Furthermore, predictions of Monte Carlo models may often be examined and compared by characterizing their statistical distribution. The importance of knowing the statistical distribution of model input parameters and predictions and a desire for a convenient methodology for Research sponsored by the Office of Reactor Research and Technology, U.S. Department of Energy, under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems, Inc. 0304-3800/88/$03.50
© 1988 Elsevier Science Pubfishers B.V.
156 assessing this distribution has been expressed by several investigators (Hoffman and Baes, 1979; Schaeffer, 1979). The SAS * methodology, for example, is useful but lacks certain capabilities, as will be discussed below. We have developed and implemented an interactive computer code, TERPED (Fields, 1981, 1982; Kurtz and Fields, 1983a), to aid in analyzing a set of data and determining whether these data may be considered to be samples from either a normal or lognormal population. The code is written in F O R T R A N and runs on a Digital Equipment Corporation PDP-10 computer; typical central-processor-unit execution time is about 0.32 s, exclusive of plotting time. The code size is about 1500 card images. STATISTICALTESTS PERFORMED BY TERPED For a set of m values ordered by increasing magnitude with index i, the cumulative probability may be computed (D.G. Gosslee, Oak Ridge National Laboratory, personal communication, 1972) by the following equation: Pi = ( i - 0 . 3 7 5 ) / ( m + 0.25).
(1)
Values of cumulative probability are used in generating graphical output. The TERPED used may invoke a number of numerical algorithms to test his hypothesis of the distribution of his data. The user assumes either a normal or lognormal distribution, and the code first calculates cumulative probability values using equation (1). The user's data are then transformed into a linear function of cumulative probability assuming the hypothesized normal or lognormal distribution. Data linearization is by a spline fit (Ahlberg et al., 1967) of computed probability of occurrence values to a set of tabulated inverse error function values representing the number of standard deviations away from the mean value (50% cumulative probability value). If the user's assumption is a normal distribution then the mean value x and variance S 2 are computed and output, whereas if the choice is a lognormal distribution then the calculated values of the log mean of/_t, log variance 02, most probable value Xp, log median x m, log mean x, and 99th quantile value x99 are printed. The user may assess the validity of his assumption before generating a plot data set by specifying that a chi-squared goodness of fit test (Kendal and Stuart, 1966) or a Kolmogorov-Smirnov (KS) one sample test analysis be performed (Siegel, 1956; Bradley, 1968, pp. 367-369). The degree of linearity of the cumulative probability plot is quantized as Pearson's correlation (Snedecor and Cochran, 1972) between * Proprietary software distributed by SAS Institute, Inc., Raleigh, NC.
157 points corresponding to input data, and equivalent values of the results of a least-squares fit to the data. In either case (normal or lognormal), the assumed distribution (given or log-transformed) is tested versus a normal distribution with the same mean (x of/~) and standard deviation (S 2 of o 2) as the input data set. The agreement between the data and the least-squares fit is quantified by calculating Pearson's correlation coefficient r (Snedecor and Cochran, 1972). The goodness of fit between data and the assumed distribution may also be calculated using a chi-squared test. A second TERPED user option is the KS one-sample test (Siegel, 1956). The KS test considers the hypothesis of equality between the actual distribution of data and the assumed normal or lognormal distribution. The KS test is based on the maximum difference between corresponding terms of the value ranked sets xi and y~. We define D = maxi=l,,, [ xi - Y i I. This value is compared to an expected statistic valid for that number of points, and a 'confidence level' is determined. The KS confidence level is the probability of Z being exceeded if the hypothesis of equality of the two distributions is true and if the alternative is 'two sided', i.e., if a given measurement has equal chance of lying above or below the expected value. The numerical procedure we have developed for use in TERPED to compute the confidence level is valid for both large and small samples sizes (Kurtz and Fields, 1983a, b). In summary, TERPED computes several measures of central tendency the chi-squared probability, the correlation coefficient, and the KS statistic. TERPED GRAPHICS OPTIONS Several TERPED options control the generation of plots of usersupplied data versus cumulative probability. The cumulative probability axis is scaled so the data will lie along a straight line if the data are distributed as hypothesized. A least-squares fit of the linearized values is plotted on the same graph to permit visual evaluation of the distribution hypothesis. Plots may be of either a printer-plot form, high-resolution calcomp output, or 35 mm film plot form. APPLICATION OF TERPED TO ANALYSIS OF CHEMICAL EXHANGE PARAMETER VALUES The TERPED program has been used to examine the distribution of values of the chemical exchange parameter (KD) for strontium. The K o values used are the arithmetic means of each of 16 different experiments compiled by Baes and Sharp (1983). The goal of this exercise was to determine a representative K D value while minimizing any bias in favor of a
158
zq O,~[... zo.
~.~ ~"
O r..z.l ~ -
b~ ,-..1
~O Z o
[2
C2 i
.14
.62
2.3
6.7
16 3'1 50 69 8~t 9'3 CUMULATIVE P R O B A B I L I T Y (%)
Fig. 1. Normal (Gaussian) analysis of strontium
K D
97.7
99.4
99.9
97.7
99.4
99.9
values.
Z
O e~
z
o
f~ N
o z
.i4
.
2.3
67
16 31 50 69 84 93 CUMULATIVE PROBABI LI TY (7~)
Fig. 2. Lognormal analysis of strontium K D values.
159
single experimental method and at the same time cover a broad range of soil pH (4.5-9.0). Figure 1 is an example of an analysis assuming a normal hypothesis and utilizing the 'high-resolution' graphics option. Figure 2 shows the lognormal analysis of the same data. By comparing the two graphs, it can be seen that the lognormal assumption more closely describes the data. SUMMARY
The TERPED code is versatile methodology for determining with what confidence a parameter set may be considered to have a normal or lognormal frequency distribution. TERPED should prove beneficial even to the users having access to SAS. TERPED produces high quality graphics, whereas SAS generates only printer plots unless the user has purchased additional, graphics-oriented software. The code described here is portable, fast, interactive, and requires significantly less computer memory for execution. REFERENCES Ahlberg, J., Nelson, E. and Walsh, J., 1967. The Theory of Splines and Their Applications. Academic Press, New York, 284 pp. Baes, C.F., III and Sharp, R.D., 1983. A proposal for estimation of soil leaching constants for use in assessment models. JEQ 12(1), 12 pp. Bradley, J.V., 1968. Distribution-Free Statistical Tests. Prentice-Hall, Englewood Cliffs, NJ, 450 pp. Fields, D.E., 1981. TERPED: A versatile code for examining the distribution of experimental data. Rep. ORNL-5689, Oak Ridge National Laboratory, Oak Ridge, TN, 85 pp. Fields, D.E., 1982. Examining the distribution of experimental data. Rep. D O E / T I C / E G 81/125, Technical Information Center, U.S. Department of Energy, Oak Ridge, TN, 1 p. Hoffman, F.O. and Baes, C.F., III (Editors), 1979. Statistical Analysis of Selected Parameters for Predicting Food Chain Transport and Internal Dose of Radionuclides. NUREG/CR1004, U.S. Nuclear Regulatory Commission, Washington, DC, 201 pp. Kendal, M.G. and Stuart, A., 1966. The Advanced Theory of Statistics, Vol. 2. Hafner, New York, 250 pp. Kurtz, S.E. and Fields, D.E., 1983a. An analysis/plot generation code with significance levels computed using Kolmogorov-Smirnov statistics valid for both large and small samples. Rep. ORNL-5967, Oak Ridge National Laboratory, Oak Ridge, TN, 67 pp. Kurtz, S.E. and Fields, D.E., 1983b. An algorithm for computing significance levels using the Kolmogorov-Smirnov statistic and valid for both large and small samples. Rep. ORNL5819, Oak Ridge National Laboratory, Oak Ridge, TN, 39 pp. Shaeffer, D.L., 1979. A model evaluation methodology applicable to environmental assessment models. Rep. ORNL-5507, Oak Ridge National Laboratory, Oak Ridge, TN, 53 pp. Siegel, S., 1956. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, New York, 312 pp. Snedecor, G.W. and Cochran, W.G., 1980. Statistical Methods (7th Edition), Iowa State University Press, Ames, IA, 507 pp.