Atmospheric EnvironmentVol. 24A, No. 9, pp. 2387-2395, 1990.
0004-6981/90 $3.00+0.00 Pergamon Press plc
Printed in Great Britain.
A STATISTICAL PROCEDURE FOR DETERMINING THE BEST PERFORMING AIR QUALITY SIMULATION MODEL WILLIAMM. Cox and JOSEPHA. TIKVART United States Environmental Protection Agency, Office of Air Quality Planning and Standards (MD-14), Research Triangle Park, NC 27711, U.S.A. (First received 19 September 1989 and in final form 21 February 1990) Abstract--Air quality simulation models have been the subject of extensive evaluations to determine their performance under a variety of,environmental and meteorological conditions. While much information has been gathered, no clearly defined methodology exists for comparing the performance of two or more models. The purpose of this paper is to present a statistically oriented procedure to test if the performance of one model is superior to others using a composite performance index involving the bootstrap resampling technique. Key word index: Air quality models, model performance, model comparisons, statistical analysis, bootgtrap resampling.
1. BACKGROUND AND PURPOSE
EPA has conducted an extensive program to evaluate the performance of air quality simulation models used for regulatory purposes (Cox and Tikvart, 1985). The statistical foundation for evaluating models was established as the result of two workshops sponsored jointly by EPA and the American Meteorological Society and summarized by Fox (1981, 1984). Based on recommendations from these workshops, EPA set out to develop a comprehensive library of statistics that summarizes the performance of various model categories. While this goal has been achieved (e.g. Londergan et al., 1982; EPA, 1985, 1986, 1987), the resulting array of statistics are so diverse and numerous that it is difficult to objectively compare the overall performance of competing models (EPA, 1984). Since completion of the original model evaluation studies, advances have been made in statistical methodology for calculating uncertainty that require fewer assumptions about the form of the underlying probability distribution (Efron, 1982). Using this newer methodology, it is now feasible to aggregate model evaluation results from various averaging periods and from different meteorological regimes and to compare aggregated results for different models in a probabilistic framework. In addition, statistical methodologies have been developed (Brieman et al., 1979) and applied (Rao et al., 1985) that allow treatment of extreme concentrations in the context of model evaluation. Concentrations from the upper end of the distribution are of great concern to regulators because of the nature of many existing air quality standards. The purpose of this paper is to describe a method for aggregating component results of model performance into a single performance measure that may
be used to compare the overall performance of two or more models. The distribution of the difference in composite performance between models is determined by using the bootstrap resampling technique (Efron, 1982). Results from different data bases are combined using a technique related to meta-analysis to produce an overall result. An example of the method is provided in which the performance of two rural models is compared using SO2 data collected around four large midwestern power plants.
2.
METHODOLOGY
The methodology is derived from EPA's experience in evaluation of isolated rural point source models located in essentially flat terrain. A strong emphasis is placed on the ability of models to predict peak concentrations independent of time of occurrence for two primary reasons: (1) the inherent limitations of most model evaluation data bases and (2) the specific regulatory purposes for which models are used. A major limitation of data commonly used to evaluate point source models is the lack of accurate information necessary to pinpoint actual transport wind direction. Data bases for rural models are usually rich in terms of the number of observations (time periods) but relatively sparse in terms of the size of the monitoring network. This fact argues for using peak concentrations under the assumption that "the data are dense enough in time to define the maximum concentration for each stability class at each of the monitors" (Irwin and Smith, 1984). The other reason for focusing on peak concentrations is that rural models are used to estimate the impact of sources on ambient standards or increments. Because of the nature of some ambient stan-
2387
WILLIAMM. COX and JOSEPHA. TIKVART
2388
dards, such as those for SO2, models must accurately predict the highest 3-h or 24-h average concentration independent of exactly when or where they may occur. Following the rationale above, the evaluation is divided into two separate components. For the sake of discussion, the term 'scientific component' refers to evaluation of peak 1-h averages during specific meteorological conditions at each monitor, while the term 'operational component' refers to evaluation of peak 3-h or 24-h averages independent of meteorological condition or spatial location. The scientific component is designed to test the fundamental capability of models in predicting 1-h concentrations at specific locations and meteorological conditions subject to the limitations of the data base. The operational component is designed to test the ability of the models in predicting peak concentrations averaged over periods consistent with regulatory needs. The fractional bias is used as the fundamental measure of discrepancy between the measurementbased and prediction-based test statistic. Actually, the absolute fractional bias is used since the direction of difference (overprediction or underprediction) is not being specifically considered here. The performance measures obtained from the operational and scientific components and from among the various data bases are combined to create a composite performance measure. The bootstrap procedure is used to estimate the uncertainty in the composite performance measure for each model. Using the estimate of uncertainty, the significance of the difference between models is assessed. Before illustrating the technique using actual data, the terminology used to describe the process is defined in more detail below.
Because peak concentrations are highly variable, a robust test statistic is calculated using information contained in the upper end of the distribution of concentrations. The statistic, referred to as the robust highest concentration (RHC), is preferred to the actual peak value because it mitigates the undesirable influence of unusual events. The RHC is based on the tail exponential estimator described by Brieman et al. (1979) but has been adapted for this purpose using estimates of percentiles after Hoaglin (1983). The tail exponential estimate of a given upper percentile, Xp, of a distribution is given by:
where
p(i, N ) = ( i - 1/3)/(N + 1/3) where
(1)
p=probability associated with percentile Xp Xp= upper percentile to be estimated (e.g. 99th) p0=probability associated with percentile
Xpo Xpo = upper percentile cutoff (e.g. 90th) Theta = average concentration above Xpo. The procedure (Hoaglin) for associating data percen-
(2)
p(i,N)=fraction of values less than or equal to X(i) X(i)= ith smallest data value N = total number of observations.
For the case when i= N (maximum value is estimated) and Xpo is set to be the N - R + I ordered value (corresponds to the Rth largest value), Expression (1) becomes:
RnC=X(R)+Theta*log((3R-1)/2) where
(3)
Theta = X B A R - X(R) X B A R = average of the R - 1 largest values X(R) = Rth largest value R =number of values exceeding the threshold value (R < 26).
The value of R is arbitrarily chosen to be equal to 26 but may be lower in cases where there are fewer concentrations exceeding the threshold value. Whenever R < 3 , the RHC statistic is set equal to the threshold value where the threshold is defined as a concentration near background which has no impact on the determination of the robust highest concentration. 2.2. Performance measures The fractional bias is used as the basic measure of model performance. The general expression for the fractional bias is given by:
FB = 2 • ( O B - PR)/(OB + PR) where
2.1. Robust highest concentration
Xp=Xpo+Theta,log((1-pO)/(1-p) )
tiles with rank ordered data values is as follows:
(4)
OB= RHC calculated using the observed data PR = R H C calculated using the predicted data.
The fractional bias was chosen because of two desirable features. First, the fractional bias is symmetrical and bounded. Values for the fractional bias range between - 2 . 0 (extreme overprediction) and -I-2.0 (extreme underprediction). Second, the fractional bias is a dimensionless number which makes it convenient for combining the results from data categories having significantly different concentration levels. The scientific component (1-h averages) of the evaluation requires a separate fractional bias calculation for each meteorological condition and each monitor. The meteorological conditions selected, while somewhat arbitrary, are designed to test model performance over the range of meteorological conditions that would be expected in practical applications. Experience with evaluation of rural models suggests use of six meteorological conditions that are a function of atmospheric stability and wind speed. These six meteorological conditions correspond with the coupling of two wind speed categories (below and above
Air quality simulation model 4.0ms -1) and three stability categories: unstable (A, B, C), neutral (D), and stable (E and F). The operational component (3-h and 24-h averages) is determined by comparing the largest measurement and prediction-based RHC statistic from all monitors in the network. This evaluation procedure is consistent with assessment of ambient standards in that air quality status is determined first for each monitor and then an area-wide assessment is based on the worst (highest) level within the area. 2.3. Composite performance measure The individual fractional bias and absolute fractional bias (AFB) components (1-, 3- and 24-h) are averaged to produce a composite result. The composite bias is a measure of the overall tendency for the models to over or underpredict the measured values. The composite absolute bias can in fact be quite different from the composite bias since positive and negative components do not offset each other. For this reason, the composite absolute bias is the actual statistic used in judging which model is performing better overall. For the 1-h average component, the composite bias and composite absolute bias are computed by averaging the separate results for each meteorological category and station combination. In some situations, it may be appropriate to use different weights for some stations and/or meteorological categories. For the 3-h and 24-h component, no averaging is necessary since a single fractional bias is calculated for each of these two averaging periods. The algebraic expression for the composite performance measure is: C = (A VG(AFB(i, j)) + AFB(3) + AFB(24))/3 (5) where AFB(i,j)=absolute fractional bias for met category i at station j AFB(3) = absolute fractional bias for 3-h averages AFB(24)=absolute fractional bias for 24-h averages. Because the purpose of the analysis is to contrast the performance among the models, the composite performance measure is used to calculate pairs of differences between the models. For discussion purposes, the difference between the composite performance of one model and another is referred to as the model comparison measure. The expression for the model comparison measure is given by: M(A, B) = C ( A ) - C(B) where
C(A)=composite for Model C(B)=composite for Model
(6)
performance measure A performance measure B.
When more than two models are being compared simultaneously, the number of model comparison measure statistics is equal to the total of the number of
2389
unique combinations of two models. For example, for three models, three comparison measures are computed, for four models, six comparison measures are computed, etc. The model comparison measure is used to judge the significance of the apparent superiority of any one particular model over another. 2.4. Bootstrap estimate of sampling error Because of its simplicity, the blocked bootstrap (Tukey, 1987) is used to generate the estimates of the sampling error. The bootstrap is basically a resampling technique whereby the desired performance measure is recalculated for a number of 'trial' years. The original year of data (nominally 365 days) is partitioned into four blocks corresponding to the four seasons. Within each season, 3-day 'pieces' are randomly sampled with replacement resulting in approximately 30 sampled units per season. This process is repeated using each of the four seasons to construct a complete bootstrap year. Three-day pieces are chosen to preserve day-to-day meteorological persistence, while sampling within seasonal blocks guarantees that each sample year will be represented appropriately by each season. Since sampling is done with replacement, some days may be represented more than once while other days may not be represented at all. For each bootstrap year that is generated, the composite performance measure and model comparison measure is calculated. Finally, the standard deviation for the model comparison measure is computed using the data resulting from the bootstrap calculations. It is important that the bootstrap sampling be done such that all data are output for the selected 3-day piece including the meteorological data, the measured data for each averaging period, and corresponding model predictions for each model being evaluated. This ensures that all correlations inherent in the data are preserved including, for example, correlations that may exist in time between measurements and predictions. While other methods, including the jackknife, may be used to generate sampling errors, the bootstrap has been successfully used in model evaluation (Thrall and Stoeckenius, 1985) and found to reproduce the variability in the robust highest concentration. 2.5. Selecting the best models The magnitude and sign of the model comparison measure are indicative of the relative performance of each pair of models. The smaller the composite performance measure, the better the overall performance of the model. This means that for two arbitrary models, Model A and Model B, a negative difference between the composite performance measure for Model A and Model B implies that Model A is performing better (Model A has the smaller composite performance measure), while a positive value indicates that Model B is performing better. When only two models are being compared, the most straightforward procedure for determining significant difference is by
2390
WILLIAMM. Cox and JOSEPHA. TIKVART
invoking the central limit theorem and using the z-test where z is computed as the ratio of the model comparison measure to the standard deviation of the bootstrap outcomes. Alternatively, approximate confidence bounds may be determined from the upper and lower percentiles of the bootstrap outcomes. While more sophisticated methods have been advocated to compute confidence intervals using bootstrap results (Efron, 1987), these methods require further testing before they can be adapted for model evaluation purposes. When more than two models are being compared, it is convenient to calculate simultaneous confidence intervals for each pair of model comparisons (Cleveland and McGilI, 1984). For each pair of model comparisons, the significance of the model comparison measure depends upon whether or not the confidence interval overlaps zero. If the confidence interval overlaps zero, the two models are not performing at a level which is statistically different. If the confidence interval does not overlap zero (upper and lower limits are both negative or both positive), then there exists a statistically significant difference between the two models at the stated level of confidence.
3. EXAMPLE MODEL COMPARISON
The following example is based on comparisons among model predictions and measured data using SO2 data collected for four large power plants located in relatively flat terrain. The two models being compared include EPA's MPTER, which is routinely used in regulatory applications, and an 'alternate' model, which incorporates a number of features not presently available in MPTER or in other comparable rural models*. Table 1 provides a brief summary of environmental and data characteristics for each of the four plants. Detailed information for each plant is available elsewhere (Londergan et al., 1982; EPA, 1985, 1986, 1987). For the Clifty Creek and Muskingum plants, data are available for both 1975 and 1976, while for the Paradise and Kincaid plants only a single year of data is used. The evaluation data base consists of 1-, 3-, and 24-h average measured and predicted concentrations at each of the monitoring stations. For the 1-h periods, wind speed and atmospheric stability were also included. An hourly background concentration was estimated and subtracted from the measured hourly concentrations. Background is calculated as the average of values at monitor locations outside the 90-deg. sector downwind from the source. Whenever a nega-
* The alternate model is currently under development and has not yet been released in its final form. This example exercise was conducted using an early version of the model not generally available to the public (personal communication with Bob Paine at ENSR).
tive value resulted, the value of 0 is used. In addition, a threshold check is used to eliminate low observed or predicted values that have no effect on the performance statistics. For 1-h averages, a threshold value of 25 gg m-3 is used, while for 3-h and 24-h averages a value of 5 #g m-3 is used. The threshold checks are applied independently to the measured and predicted concentrations. To minimize distortions associated with small counts, data categories having fewer than 100 observations were eliminated from the analysis. 3.1. Results for Clifty Creek--1975 The robust highest concentration estimates for l-h averages along with the corresponding fractional bias for both models are shown in Table 2. Generally, the robust highest concentration for observed values is largest for unstable conditions and smaller during more stable conditions although variations among the six stations are fairly large. The magnitude of RHC for model predictions is also dependent on meteorological conditions; however, the models differ with respect to which condition is most conducive to highest levels. For example, MPTER tends to predict largest values during low wind-unstable conditions and lowest values during high wind-stable conditions. This contrasts with the alternate model which tends to predict high concentrations in most categories except high windstable conditions. The net effect is for MPTER to overpredict during unstable conditions and lower wind speeds and underpredict during stable conditions and higher wind speeds. The alternate model tends to overpredict not only for unstable and low wind conditions, but also throughout the entire stability and wind speed range. Note that since the high wind-stable condition category has fewer than 100 counts (N = 66) it is excluded from the analysis. Figure la displays the absolute fractional bias for each model for 1-h averages (panel 1), 3-h averages (panel 2), 24-h averages (panel 3) and the grand composite (last panel) among the three averaging periods. Figure I b displays the difference in absolute fractional bias between MPTER and the alternate model. The results for 1-h averages are the composite average over the individual stations and six meteorological conditions, while the 3-h and 24-h results are based on the maximum concentrations across the six monitoring stations. The upper and lower values represent an approximate 95 per cent confidence bound for the absolute fractional bias. The upper bound was computed by adding twice the standard deviation of the bootstrap outcomes to the actual (no bootstrap) absolute fractional bias. The lower bound was computed by subtracting twice the standard deviation from the actual absolute fractional bias. For 1-h averages, the composite absolute fractional bias is approximately the same for the two models. Confidence bounds for the difference between MPTER and the alternate model overlap 0 which
2391
Air quality simulation model Table 1. Characteristics of power plant data bases Plant name and location
Source characteristics
Terrain
Monitoring network
Clifty Creek, Indiana
1300 M W Three 208-m stacks
Rolling hills Low ridges Below stack height
Six SO2 stations 3-15 k m from plant (1975 and 1976)
M u s k i n g u m River, Ohio
1460 M W Two 252-m stacks
Rolling hills Low ridges Below stack height
Four SO2 stations 4--20 k m from plant (1975 and 1976)
Paradise, Kentucky
2560 M W Two 183-m stacks One 244-m stack
Rolling hills Below stack height
Twelve SO2 stations 3-17 k m from plant (1980/81)
Kincaid, Illinois
1260 M W One 187-m stack
Relatively isolated In flat terrain
Thirty SO2 stations 3-20 k m from plant (1976)
All plants are coal-fired, base-load facilities.
Table 2. Robust highest concentrations and fractional bias by meteorological category and monitoring station for Clifty Creek--1975
Meteorological category
Station
Robust highest concentration 1 h averages (#g m - 3 ) Observed MPTERV6 ALT
Fractional bias MPTERV6 ALT.
Low winds, unstable (n = 1312)
1 2 3 4 5 6
732 806 982 927 421 828
866 1280 1409 1000 996 1137
970 1488 1272 1341 1211 1474
-0.2 -0.5 -0.4 -0.1 -0.8 - 0.3
-0.3 -0.6 -0.3 -0.4 - 1.0 - 0.6
Low winds, neutral (n = 3353)
1 2 3 4 5 6
652 595 544 698 273 643
321 150 65 291 25 25
1269 1719 1159 2166 874 1108
0.7 1.2 1.6 0.8 1.7 1.9
-0.6 - 1.0 -0.7 - 1.0 - 1.0 -0.5
Low winds, stable (n = 2559)
1 2 3 4 5 6
501 253 502 354 176 610
1093 277 112 605 25 25
850 1820 1564 2641 25 151
-0.7 - 0.1 1.3 -0.5 1.5 1.8
-0.5 - 1.5 - 1.0 - 1.5 1.5 1.2
High winds, unstable (n = 147)
1 2 3 4 5 6
831 863 193 311 201 870
674 697 1498 631 370 361
375 878 818 693 430 153
0.2 0.2 - 1.5 -0.7 - 0.6 0.8
0.8 -0.0 - 1.2 -0.8 - 0.7 1.4
High winds, neutral (n = 1323)
1 2 3 4 5 6
631 801 839 550 132 518
613 662 282 612 25 60
1055 2043 1955 1685 485 780
0.0 0.2 1.0 -0.1 1.4 1.6
-0.5 - 0.9 -0.8 - 1.0 - 1.1 -0.4
High winds, stable (n = 66)
1 2 3 4 5 6
190 39 157 371 118 57
25 25 83 289 25 25
25 25 1007 662 25 467
1.5 0.4 0.6 0.2 1.3 0.8
1.5 0.4 - 1.5 - 0.6 1.3 - 1.6
A L T = Alternate model.
2392
WILLIAM AVERAGE= 1
I/1
M. Cox
and JOSEPH A. TIKVART
AVERAGE= 3
AVERAGE= 24
COMPOSITE
~2.0
2.0
2.0
2.0
gl.5 C~1.0
1.5
1.5
1.5
1.0
1.0
I
I
~0.5
1.0
i
0.5
0.5
0.0
0.0
3 30.0 MPTER
MPTER Alternote
Altemote
I I
0.5
#__ I
0.0
MPTER AlternGte
MPTER Altemote
Fig. la. Performance comparison between MPTER V6 (left) and the alternate model (right) with 95 per cent bootstrap confidence bounds for Clifty Creek--1975.
AVERAGE = ,3
AVERAGE = I
~ERAGE = 24
COMPOSITE
09
<
1.0
liO
1.0
1.0
z 0.6
0.6
0.6
0.6
,3 02 ,<
0.2
0.2
0.2
',-0.2
--0.2
-0.2
-0.2
~-0.6
--0.6
© m~-l.0 <
-0.6
-0.6
--1 i0
-1.0
0
[
Ld I"-
<
-1.0
Fig. lb. Difference in performance between MPTER V6 and the alternate model with 95 per cent bootstrap confidence for Clifty Creek--1975.
suggests no statistical difference between the models. For 3-h averages, the differences are larger and suggest that MPTER is performing better statistically. For 24-h averages, MPTER is also performing better although the confidence bounds do slightly overlap 0. The composite result across the three averaging periods indicates that overall MPTER is performing marginally better than the alternate model. If information were available for only this data base, we would conclude that MPTER is the model of choice. 3.2. Multiple data bases The same procedure was applied to evaluate and compare the performance of the two models using data from each of the other five data bases. Figure 2 displays the results as the difference in absolute fractional bias between MPTER and the alternate model for each of the six data bases. The first panel summarizes the results for 1-h averages which is the composite result among the six meteorological conditions. The alternate model appears to be performing better since most of the differences are positive and since the 95 per cent confidence limits do not cover 0 except for Clifty Creek--1975. For 3-h averages, the results shift in favor of MPTER. For three of the data bases, both the upper and lower confidence bounds
are negative, indicating that MPTER is performing better. For the other three data bases, the results slightly favor MPTER even though the upper confidence bounds overlap 0. The confidence bounds for 3-h averages are always larger than the corresponding bounds for 1-h averages which is a direct result of the number of comparisons. For 1-h averages, multiple meteorological conditions and stations are pooled which effectively reduces the variance. Note that the confidence bounds for Kincaid are large compared with the other five data bases, indicating greater uncertainty in the calculated difference in performance between the two models. For 24-h averages, confidence bounds are relatively large and overlap 0 for each data base except Clifty Creek--1976, where MPTER appears to perform better and for Muskingum River--1976, where the alternate model appears to perform better. The data presented in the last panel are for the composite performance among the three averaging periods for each data base. MPTER is clearly the better performing model at Clifty Creek, with results mixed at the other three power plants. At Muskingum River-1976, the alternate model appears to be performing better, while for the other three data bases confidence bounds overlap 0 indicating no clearly better performing model.
Air quality simulation model
2393
Y)
Z
0.6
g
0.2
t
I
I
o1
0.6
t L_
L,_ -0.2
-0.2
~_j -0.6
-0.6
~V)-I.0 m <~
-1.0 0C75
CC76
1
0.2
it-CC75
M K 7 5 M K 7 6 PD76 KINC/dD
CC76
.. M K 7 5 M K 7 6 PD76 KINCAIO
3 - H O U R AVERAGES
1 - HOUR AVERAGES
1.0J
o1
0.6
0.5
t
0.2 -0.2
t
- T
t 1
-0.6 -1.0 CC75
T CC78
.. M K 7 5 M K 7 6 PD76 KINCAID
CC75
CC76 M K 7 5 M K 7 6 P 0 7 6 KINCAID
COMBINED
2 4 - H O U R AVERAGES
Fig. 2. Performance comparisons between MPTER V6 and the alternate model by averaging period for six rural data bases.
3.3. Combining resultsfrom data bases The goal of this study is to combine the results of model performance comparisons from multiple averaging periods and multiple data bases into one composite measure that can be used to distinguish the overall performance between the two (or more) models. Combination of results is meaningful only if the factors that influence differences between the models are accounted for among the data bases. For example, one model may perform better than another in very flat terrain while the opposite may be true when terrain height is closer to stack height. Obviously it could be misleading to combine results from studies that had such marked differences in environmental conditions. In this case, there are strong similarities
(Table 1) among the data bases that argue for one conclusion regarding differences in performance among the models. Since the results at each data base are derived independently, a straightforward combination of composite results is obtained by simple averaging. Table 3 summarizes the results, which includes the composite differences in absolute fractional bias for the two models and the corresponding bootstrap estimated standard error for each of the six data bases. If a common difference between the two models can be assumed among the six data bases, the minimum variance unbiased estimator for the common difference is the weighted sum of the individual differences where weights are inversely proportional to the square
Table 3. Comparison of composite performance between MPTER V6 and the alternate model using differences in absolute fractional bias and associated bootstrap standard deviation Data base
AFB diff (I-h)
Std dev (I-h)
AFB diff (3-h)
Std dev (3-h)
AFB diff (24-h)
Std dev (24-h)
Clifty Creek--1975 -0.021 Clifty Creek--1976 0.168 Muskingum River--1975 0.432 Muskingum River--1976 0.380 Paradise---1976 0.508 Kincaid--1980/1981 0.093
0.054 0.053 0.069 0.076 0.051 0.040
-0.702 -0.480 -0.124 -0.064 -0.460 -0.236
0.144 0.117 0.154 0.099 0.198 0.389
-0.335 -0.793 0.059 0.310 -0.231 -0.397
0.237 0.140 0.168 0.124 0.256 0.347
Composite among data bases
AFB diff Std dev composite composite -0.353 -0.368 0.123 0.209 -0.061 -0.180
0.099 0.067 0.085 0.055 0.126 0.224
-0.04
0.03
2394
WILLIAMM. COX and JOSEPHA. TIKVART
of the standard errors. Using the notation from Equation (6), this becomes: C M = S U M ( Wi * M i ) / S U M ( W i )
where
(7)
C M = Composite model comparison meas-
ure S i = bootstrap estimated standard error
for ith data base Wi = 1.0/Si ** 2 M i = m o d e l comparison measure for ith
data base. The estimated standard error for the composite model comparison measure is then given by: S = SQR T(1.O/(SUM(Wi))).
(8)
The usual test of significance for the combined estimate can be obtained from the ratio of the composite measure to the standard error. Using the data for the six data bases results in a composite measure of -0.04 + 0.03, which indicates that the composite difference among the models is not significant. At this point, the evidence leads us to conclude that there is no reason to believe that one model performs better than the other. Some readers will note the strong similarities between the statistical technique used here to combine the results from different data bases with the technique known as meta-analysis (Hedges, 1985; Rosenthall, 1984). Meta-analysis is often applied by investigators in the social and medical sciences to arrive at an overall conclusion about the effectiveness of a particular treatment or policy. By pooling results from multiple studies, meta-analytic techniques provide for more discriminating power to detect effects than is possible from any one individual study. While conceptually attractive for model evaluation some of the parameters required for recta-analysis are not directly available. For example, 'study size' in meta-analysis is a parameter associated with each study that is directly related to the statistical significance of the measured effect. In the context of this evaluation, there is no quantitative definition of study size although the number of monitoring stations and the number of independent meteorological events within the year are indirect indicators of study size.
4. C O N C L U D I N G REMARKS
A statistical procedure has been devised and tested for comparing the performance of two or more air quality models. Models were compared using a composite performance measure that combines the performance from the scientific and operational components. The bootstrap procedure was used to estimate the uncertainty of the composite performance measure. A statistical test was used to determine the significance of the difference-in composite performance between models.
The technique was applied using available data from four large SOz emitting power plants located in relatively isolated terrain. The example compared the performance of EPA's MPTER model and an alternate model that incorporated advanced algorithms for predicting plume dispersion. The results suggested that differences in model performance exist although those differences appear to vary among the three averaging periods. The alternate model appeared to perform best for 1-h averages at specific monitoring stations over a variety of meteorological conditions. MPTER appeared to perform better for 3-h and 24-h averaging periods unpaired in time or space. A simple straightforward statistical method was used to combine results from the six data bases to obtain a n o v e r a l l comparison between the performance of the two models. The method used for statistically aggregating the results from the six data bases is strongly related to meta-analysis which has been commonly used for combining results from multiple research studies. Although meta-analytic techniques hold promise for combining the results of separate model evaluation studies, problems related to definition and incorporation of study size must be resolved before these techniques can be directly applied. The methodology as presented has focused on the ability of models to accurately predict peak concentrations. In practice, the technique may be easily adapted to include other conditions of interest. For example, it might be desirable to include a separate performance measure to evaluate the ability of a model in predicting concentrations completely paired in time and/or space. This could be accomplished by including a fractional bias statistic based on paired data values into the calculation of the composite performance measure using a weight that reflects its importance relative to other measures.
REFERENCES
Brieman L., Stone and Gins J. D. (1979) New methods for estimating tail probabilities and extreme value distributions. TSC-PD-A226-1,Technology Services Corporation. Cleveland W. S. and McGill R. (1984) Graphical perception theory, experimentation and application to the development of graphical methods. J. Am. Stat. Ass. 79, 531-554. Cox W. M. and Tikvart J. A. (1985) Assessing the performance level of air quality models. Paper presented at the 15th International Technical Meeting on Air Pollution and Its Applications, NATO/CCMS Conference, St. Louis, MO. Efron B. (1982) The Jackknife, the Bootstrap and other resampling plans. Society for Industrial and Applied Mathematics, Philadelphia, PA. Efron B. (1987) Better Bootstrap confidence intervals. J. Am. Star. Ass. 82, 171-185. Fox D. G. (1981) Judging air quality model performance. Bull. Am. Met. Soc. 62, 599-609. Fox D. G. (1984) Uncertainty in air quality modeling. Bull Am. Met. Soc. 65, 2%36. Hedges L. V. and Olkin I. (1985) Statistical Methods for Metaanalysis. Academic Press, New York.
Air quality simulation model Hoaglin D. C., Mosteller F. and Tukey J. W. (1983) Understandino Robust and Exploratory Data Analysis. John Wiley, New York. Irwin J. and Smith M. (1984) Potentially useful additions to the rural model performance evaluation. Bull Am. Met. Soc. 65, 559-568. Londergan R. J., Minott D. H., Wackter D. J., Kincaid T. and Bonitata D. (1982) Evaluation of rural air quality simulation models. EPA-450/4-83-003, U.S. Environmental Protection Agency. Rao S. T., Sistla G., Pagnotti V., Petersen W. B., Irwin J. S. and Turner D. B. (1985) Resampling and extreme value statistics in air quality model performance evaluation. Atmospheric Environment 19, 1503-1518. Roscnthall R. (1984) Meta-analytic Procedures for Social Research. Sage Publications, Beverly Hills. Thrall A. D. and Stoeckenius T. E. (1985) Use of the Bootstrap method to generate confidence intervals for atmospheric dispersion model predictions: a numerical experiment. Presented at the Ninth Conference of Statistics in the Atmospheric Sciences sponsored by the Ameri-
2395
can Meteorological Society at Virginia Beach, VA. Systems Application, Inc. San Rafael, California. Tukey J. W. (1987) Kinds of Bootstraps and kinds of Jackknives, discussed in terms of a year of weather-related data. Technical Report No. 292, Dcpt of Statistics, Princeton University. U.S. Environmental Protection Agency (1984) Interim procedures for evaluating air quality models (revised). EPA450/4-84-023, U.S. Environmental Protection Agency. U.S. Environmental Protection Agency (1985) Evaluation of rural air quality simulation models, addendum A: Muskingum River Data Base, EPA-450/83-003a, U.S. Environmental Protection Agency. U.S. Environmental Protection Agency (1986) Evaluation of rural air quality simulation models, addendum C: Kincaid SO2 Data Base, EPA-450/83-003c, U.S. Environmental Protection Agency. U.S. Environmental Protection Agency (1987) Evaluation of rural air quality simulation models, addendum D: Paradise SOz Data Base, EPA-450/83-003d, U.S. Environmental Protection Agency.