Atmosphex Environment Vol. 21. No. 3. pp. 629439. Primed in Great Briuin
owl4981197 Perternan
1987
f3.OQ+0.00
Journals Ltd.
ON THE STATISTICAL ANALYSIS OF AMBIENT OZONE DATA WHEN MEASUREMENTS ARE MISSING A. C. DAVISON* Center for Statistimf Sciences, The University of Texas at Austin, Texas 78712, U.S.A.
and M. w. ~EMPHILL Texas Air Control Board, 6330 Highway 290 East, Austin, Texas 78723, U.S.A. (First received 15 Aprit 1986 and receiued $0~ ~ub~~e~rioa11 Augusr 1986)
Abstract-Two methods are proposed to deal with missing observations, which frequently pose problems in the statistical analysis of ambient ozone data. The first is based on exceedances of the data over thresholds and providesa flexible and general class of modeis for statistical analysis of air pollution data. The second uses the measured values of related variables to impute missing observations. They are applied to data for three sites in
E Texas, Key word index: Ambient ozone, binomial distribution, exceedance, generalized linear model, imputation of missing values, multivariate regression, normal distribution, Poisson process, threshold.
Vast amounts of data regarding the levels in the atmosphere of pollutants are automatically gathered throughout the world, in part for subsequent statistical analysis to detect possibie trends in air quality. Relatively high levels of pollutant concentrations, which have possible adverse effects on human health, are often the focus of interest. However such analysis is bedevilled by the frequently large proportions of observations missing from the data due to machine failure, routine maintenance, changes in the siting of monitors, human error, and other factors. Changes in monitoring equipment and procedures may also complicate analysis. Moreover climatic changes from year to year obscure possibte long-term trends. The use of statistical methods capable to some extent of making aifowance for these features of the data is thus a necessity, and it is this that we discuss in this paper. The idea fundamentai to the type of analysis suggested here is to restrict attention to those occasions on which a particular fixed threshold level of pollutant con~ntration is exceeded. One motivation for this in the U.S.A. is that federal standards for ambient 0, concent~tions promulgated by the U.S. Environmental Protection Agency are phrased in terms of the number of days on which a threshold level is exceeded during a given period. Costly penalties may be imposed on a state containing sites not in compliance with regulations, and massive, expensive, and sometimes l Current address: Department of Mathematics, Imperial College, London SW? 2B2, U.K.
ineffective efTorts may be made to attain compIian~. This focus of legislative attention on exceedances over a fixed threshofd suggests that a method of statistical analysis which is directly interpretable in terms of the rate at which exceedances occur is relevant. In other contexts there may be medical or physical, rather than legal, reasons why such analysis is useful. Having fixed a threshold, we then modef statistically the frequency with which it is exceeded as a function of such factors as climatic variables, site characteristics, and so on. Among other useful features, this modefling enables apparent increases in exceedance rates due to, for example, exceptionally hot weather, to be distinguished from the effects of possible trend. It may also provide information for the operation of monitors, as seen in section 4. The method of analysis adopted in this paper is rather flexible. If data are gathered at houriy intervals, say, we may opt to analyze the number of hours per day with recorded exceedances over a threshoid level; or we may simply record whether it was exceeded each day, and analyze the number of such days instead. In both approaches such other quantities ofpossible interest as the size of an exceedance- over the threshold level and the integrated total excess of it are ignored. The payoff is a flexible and readily interpretable class of statistica models, capable of making allowance for missing data, which can be fitted in standard statistical packages. The models involve the use of methods closely related to linear regression techniques (Weisberg, 19851, but applicable to discrete measurements in the form of counts rather than to continuous data. Such models are common in the social and biological sciences, but
629
A. C. DAVISON and M. W. HEMPHILL
630
we are unaware of other uses of them in the air pollution literature. Throughout this paper the statistical methods are discussed in relation to the analysis of ambient Oj levels, but the ideas may readily be applied to the analysis of levels of other pollutants. The next section of this paper shows how allowance may be made for entire missing days of data. In section 3 we discuss a possible method for estimating missing values when days with incomplete data are available. As an illustration of the use of the methods, we discuss in section 4 the analysis and interpretation of ambient ozone data at three sites in Texas. Section 5 concludes the paper with a discussion. 2. ENTIRE MISSING DAYS OF DATA
Ambient levels of atmospheric pollutants typically depend upon seasonal variations in meteorological conditions. To understand the statistical features of such data, it is therefore necessary to allow for such dependence. One way to achieve this is to split the data into fairly short periods and accommodate dependence on the recorded values of climatic variables during them. Such periods should be short enough that climatic factors may be thought ofas roughly constant over them, and so the choice of periods of length 1 month, though arbitrary, is natural. We assume for the moment that only entire days are missing from the data. Suppose, then, that the ith in a series of k months is of length mi days, but that data are only available for n, of the days. Thus ni is between zero and mi. Suppose that ambient levels of a pollutant exceed the threshold level once or more on Rj of the ni days for which data are available that month. Suppose also that days on which exceedances occur arise independently and with equal probability pi on each day of that month, a supposition made primarily for ease of exposition and discussed in section 5. Under these conditions, Ri has the binomial distribution with probability pi and denominator ni, i.e. tli!
prob (Ri = r) = ___ ’ r!fni_r)lPi(l-Pi)n.-r*
(1)
where 0 C r d n, and 0 < pi ,< 1. As it stands, (1) is of little use because a different value of pi must be estimated each month, making it hard to summarize easily site and climatic factors affecting the rate at which exceedances occur. A more parsimonious representation is needed, and may be derived as follows. If exceedances over the threshold during month i occur as a Poisson process (Feller, 1968, p. 446) with a positive constant rate Ri exceedances per day, the number Z, occurring on any day that month has the Poisson distribution prob(Z( = k) = $exp( ->.,I,
wherek=0,1,2,. . . . Thus the probability pi of one or more exceedances that day is pi = prob(Zi 2 1)~ 1 -eXp(-li).
(2)
The final stage in giving a readily interpretable statistical model for the number of days with exceedances is to expressthe rate parameter li in terms ofclimate, site, and other relevant characteristics of the data as Ii = exp (xl/?),
(3)
where xi is a p x 1vector of known quantities, so-called ‘covariates’, measured at the site in month i, and /3 is a p x 1 vector of parameters to be estimated from the data. Thus, for example, we might have 8’ = (Do, #?,, p2) and xi = (1, i/12, ti), with ti the mean daily maximum temperature (“C) at the site in month i. Single elements, or combinations of the elements, of xi are known as ‘terms’ of the statistical model. Here /iO reflects the baseline rate at which exceedances occur; p, represents a long-term trend, with the rate increasing by a factor exp (fl, ) per year; and j12represents a crude climatic effect, with the rate Ri increasing by a factor exp(j&) with each 1°C increase in the mean daily maximum temperature per month. Other positive functions could be used in (3) instead of the exponential, whose main practical advantage is the ease with which the parameter values may be interpreted. Other quantities thought to affect pi may be included in xi and the corresponding parameters 1 estimated. The statistical model embodied in Equations (I), (2) and (3) is a particular case of a generalized linear model (Dobson, 1983; McCullagh and Nelder, 1983), a generalization of linear regression models ( Weisberg, 1985). Such models are increasingly used in many areas of application of statistics; they can be readily fitted to data using maximum likelihood estimation in computer packages such as GLIM (Baker and Nelder, 1978) and SAS (SAS Institute, Inc., 1982). Further details of such models are sketched in Appendix I. Note that models may be fitted very rapidly; that evidence for or against the presence of effects such as the trend PI described above may be weighed; and that how well such models fit data may be assessed both formally by significance tests and informally by graphical procedures. The main statistical assumptions made in the course of the development above are: first, that the rates i., of exceedances are constant within periods of I month; and second, that exceedances of a given level occur independently on different days. These are somewhat unrealistic because the conditions leading to exceedances, whether natural or caused by man, may vary within a month, and because of the short-term persistence of such conditions. The main qualitative effect of this variation is that data will tend to be statistically over-dispersed relative to the model, in the sense that the data will be more variable than would be expected from the binomial distribution alone. This overdispersion will tend to increase the margin oferror in the conclusions drawn, rather than bias estimates of
On the statistical analysis of ambient ozone data parameters in the model. The issue is further discussed below in relation to Table 3. It should be borne in mind that in the present context the aim of statistical modelling is to give an easily interpreted summary of the main features of the data, with particular regard to possible trends in pollutant levels, rather than to give direct physical insight into the mechanisms governing the production of air pollutants. However it is possible to incorporate relevant information about emissions from sources close to the monitor and from local meteorological conditions into the rate parameter li of the statistical model described by Equations (I), (2) and (3), and this may be useful in some applications. 3. INCOMPLETE
DAYS OF DATA
In addition to entire missing days, a database will typically contain days for which some but not all hourly observations are present. The proportion of such days may be appreciable, so it is wasteful and potentially misleading not to use statistical methods capable of incorporating them into an analysis. The possibility we explore here is to impute missing values in 0, data at a site by reference to the levels there of related pollutants and air temperature. We discuss below the use of multivariate regression analysis to impute missing 0, values. Empirical statistical methods akin in spirit to those described here have been used by ApSimon and Davison (1986), Tiao ef ai. (1976), and many others; a useful general reference in Horowitz (1982, section 7.8). Technical statistical details are relegated to Appendix 2. First we discuss briefly the typical pattern of diurnal variation of O,, NO and NO*. The formation of ambient 03 requires sunlight, NO or N02, and organic compounds. The principal mechanism relies on NO-to-NO2 oxidation by organic and other radicals. A description of the reactions is provided by Atkinson and Lloyd (1984). The typical pattern of O3 concentrations in areas like SE Texas is that maximum values occur from late morning to early afternoon. OJ-forming chemical reactions require several hours to yield a maximum Ol level, which diminishes in mid-afternoon as atmospheric dilution processes and wind transport peak. Generally the NO levels peak shortly after dawn and then decrease sharply due to increasing atmospheric mixing and photochemical conversion. They then remain low during the day, but increase at night. Concentrations of NOz typically increase after dawn and decrease again after midday. They may then increase in the late afternoon and often climb through the night. Study of the data described in section 4 suggests that O>, NO and NO, levels may be related by the equations log(O3) = alI -t-a21 log(T)+a,,/T, log (NO) = aI 2 + QZ log V) + a,,/T,
(4.1) (4.2)
log(NO~)=al~+az310gtT)+a~~/~
631 (4.3)
where we use 01, T, NO and NO1 to denote the hourly levels at a site of OJ (ppm), temperature (“C), NO (ppm) and NO, (ppm), respectively. Moreover, levels of O3 are related to those of NO and NOL, suggesting that when present, NO and NO1 obser~tions can be used to impute missing values of 0~ data. Thus when measurements of OJ, T, NO and NO2 are available for each of a total of N h, the equation Y=XA+U
(6)
may adequately describe the data. Here Yis the known N x 3 matrix whose ith row is (log (O,,), log(NOj), log (NOZi)], X is the known N x 3 matrix with ith row {I, log(Ti), l/T,}, A the unknown 3 x 3 matrix of parameters aij at (4), and U a random N x 3 matrix of errors whose rows are independent multivariate normal vectors with mean zero and unknown covariante matrix Q. The subscript i indexes the ith of the total N h for which data are available,and the matrix Cl summarizes the correlations among the poliutant levels. The unknown 3 x 3 matrices A and R can then be estimated by multivariate linear regression, as outlined in Appendix 2. Once estimated, the values of the parameters in (5) may be used to impute missing 03 values, as follows. If NO and NOz values as well as 0s values are missing, imputation of the missing O3 values is based on (4.1) using the estimated values of the parameters a. If however either the NO or NO* levels, or both, are available, their values provide additional information which can provide better estimates of the missing OJ level 0,. An estimate of log (03) for the missing value of OJ is then found by calculating the mean of the distribution of log (0,)conditional upon the observed values of the other variable(s). This procedure makes appropriate allowance for the correlation, summarized in the estimate of 52, between levels of Ox, NO, and N02. Thus in general four distributions are needed: (a) of log (0,) without regard to levels of NO and NOz; (b) of log (03) given the observed level of NO, but without regard to the level of NO,; (e) of log (0, ) given the observed level of NO1, but without regard to the level of NO, and (d) of log (0,) given the observed values both of NO and NOz. An estimate of a particular missing hourly Oa level is found as the mean of whichever of these four distributions is relevant, depending on which of the variables NO and NO1 are absent. If the temperature observation for a particular hour is absent then no O3 level can be imputed, but this is relatively rare, since the temperature data are much more complete. In any application of this idea, major questions pertain to whether Equations (4.1)-(4.3) fit the data well enough reliably to impute missing 0, values, especially high ones. These issues are discussed in section 4 in the context of our application.
A. C. DAVISON and M. W. HEMPHILL
632
4. OZONE AT THREE SITES IN TEXAS 4.1.
Introduction
The National Ambient Air Quality Standard for 0, is designed to limit exposure to high concentrations of OS throughout the U.S.A. It sets a limit of0.12 ppm for a l-h average at a monitoring site, not to be. exceeded on an average of more than 1 day per year during any 3year period. Areas which violate this standard are required to implement programs intended to bring them into compliance with it. National control efforts have reduced the levels of precursors of O,-emissions of volatile organic compounds from stationary sources and automobiles, which react in sunlight with NO, and Or to produce O,-but OJ levels have not yet been correspondingly reduced. Statistical analyses of ambient O3 measurement to detect overall trends have generally used annual statistics computed from all available l-h averages and have not made allowance in the analysis itself for the occurrence of missing data. An example of this type of analysis is Walker (1985), who excludes from his study monitoring sites in Texas at which fewer than 50% of all possible l-h averages are available over his period of interest, among other criteria. We apply instead the methods described above to the analysis of l-h 0s averages for the years 1981-1984 at three monitoring sites in SE Texas: Beaumont, Port
Arthur and Orange (see Fig. 1). We examine the occurrence of exceedances of 0.08 ppm 03, in part because exceedances of 0.12 ppm 0s are too rare at these sites to demonstrate the features of our method of analysis. Interest focuses on two main issues: overall differences among the 4 years; and differences among the three sites. Our study and all the discussion below are restricted entirely to the hours 9 a.m.-9 p.m., for two reasons. Firstly, EPA guidelines suggest this period for the interpretation of data; and secondly, exceedances even of level 0.08 ppm, apart from those due to machine malfunction, are rare outside this period. A typical level overnight is 0.02 ppm or so. The data follow diurnal and seasonal patterns similar to those from urban areas such as Houston and Dallas. Table 1, which displays summary statistics for missing data on a yearly basis at each site, shows the severity of problems posed by missing values. Some years are relatively complete, but in others more than one-half of the data is missing. Table 2 gives a more detailed breakdown of the statistics, showing that in several years large fractions of data in the summer months-when 0, levels are generally highest-are missing. Ignoring the missing data could lead to gross underprediction of the occurrence of high 0s levels, as we shall see. Several months are missing entirely, most notably the early months of 1981 at Port Arthur. In the next section we discuss the application of the
SCde
’ 200 miles ’
Fig. 1. Map of Texas showing area of monitoring sites.
On the statistical analysis of ambient ozone data
633
Table 1. Summary statistics for missing data Entire days missing
Incomplete days
Proportion of missing hours in incomplete days
29 81 189 24
138 107 51 133
0.204 0.241 0.333 0.22 1
220 81 132 62
41 71 74 156
0.500
97 82 120 17
130 112 59 141
0.233 0.231 0.322 0.219
Beaumont 1981 1982 1983 1984 Port Arthur 1981 1982 1983 1984 Orange 1981 1982 1983 1984
methods outlined in section 3 to these data, and then in section 4.3 we describe and interpret the application of the ideas in section 2. 4.2. Classification of data The method for the imputation of missing values outlined in section 3 is only needed for incomplete days and no recorded exceedances over the current threshold level of interest. To allow for possible seasonal variation of the parameter matrices A and R, data were processed on a monthly basis and missing values between 9 a.m. and 9 p.m. were imputed from monthly rather than overall estimates of A and 0. Data for the three sites were processed separately, using the SAS statistical package (SAS Institute, Inc., 1982). In order to assess the effect on the analysis of the imputation of missing values, days with exceedances due to observed and imputed values were distinguished. Imputed observations gave rise to few exceedances relative to those from recorded data, contributing only to 8 of the total 144 months, all 8 of which had observed as well as imputed exceedances. Seven of the months had 1 day with an imputed exceedance of the level, and the other had 2 such days. Comparison of observed levels of OJ with those that would have been imputed by our procedure indicates that as a result of regression towards the mean the procedure tends to underestimate 0, levels: thus the method is conservative in the sense that the rate parameter Ri is slightly underestimated. The effect of this conservation seems small because of the small contribution made by the imputed values. The regression towards the mean could to some extent be overcome by ‘post-whitening’ the imputed values: adding approximately chosen random noise to them.
4.3. Mode/fitting and interpretation Once each day in the database was classified as missing, containing an exceedance of a given level, or not containing an exceedance thereof, the methods described in section 2 were applied. The data were split
0.312 0.351 0.234
into monthly periods, and the number of non-missing days of data ni each month and the recorded number of days with exceedances, Ri, found. Here there are 4 different full years of data at three sites, so the total number of months for analysis is k = 144. The average maximum daily temperature ti each month at each site was also extracted from the database to act as a surrogate for insolation, which drives the chemical reactions which produce 0,. Several versions of the statistical model described by Equations (I), (2) and (3), but with varying combinations of covariates, were fitted to the data by the method of maximum likelihood using the statistical package GLIM (Baker and Nelder, 1978). A model adequate for our purpose contains the following terms: an effect of average daily maximum temperature; different baseline rates at which exceedances occur each year in the so-called ‘Ozone season’-the months May-October; and different baseline rates at which exceedances occur for each combination of site and year-a so-called ‘site-year interaction.’ Table 3 shows the contributions of the different terms of the model towards explaining the variation observed in the data, both with and without six anomalous observations discussed below. The ‘deviance’ is defined as twice the diITerence between the maximum possible loglikelihood for the data under the model defined at Equations (l), (2) and (3) and the loglikelihood achieved by fitting model terms in the table. For instance, the reduction in deviance due to augmenting xi/3 for site effects at (3) by terms representing yearly variation is 6.36 if the six anomalous observations are included. Table 3 may be thought of as analogous to an analysis of variance table, but with the distinction that the deviance explained by a term would be approximately distributed as chi-squared on the corresponding number of degrees of freedom if that term were not needed in the model, and otherwise would tend to be too large. If the data had no effect of daily maximum temperature, for example, the correspond-
319 8%? 149 Oft
2B O@ 149 ot”
Port Arthur 1981 1982 1983 1984
Orange 1981 1982 1983 1984
JN 1oq 289 04p
28g 03 3q l@
3+$ OH 283 l+$
Feb.
OH of-3 149 oq
318 09 29 4%
7@ 4+j 26Y OJrt
Mar.
2 ts* OH 3Y OF
30?? 24 1Olp 8l:
6f3 I# 129 6+!
Apr.
17Y 3ff 29 o!jQ
319 OY 179 O;t’
Oq 6#$ 3% O+#
May
Missing hours in incomplete days Key: Entire missing days -” incomplete days ---_
tff .Of-4 1g%? Oy
Beaumont 1981 1982 1983 1984
Jan.
309 3): 9zp 7+Q
129 2v 303 27y
Ojj 8+(j 9y 6$
June
185$ 249 998 l#
12182 219 26Y 5%
l’p 28 Isp 31% 6+
July
4= 23$ 139 0%
188 209 12% J$j
6#$ 119 319 2%
Aug.
0% 45s 174r 3%
7’” 1S$ 1oq 0%
4jj 6+ 219 1%
Sept.
11491 6”lf 6#$ 0%
134& 59 4p 14g
O+j 69 59 I+$
oci.
lift 2y” Oq 48
6y”: Off 2q 2%
lfp 4L: 34;? 0%
Nov.
1% 7F Srf 2&j
fls” 89 2q 0%
O+$ I$? 2# 1%
Dec.
Table 2. Numbers of enbre missing days, of missing hours in incomplete days, and of incomplete days, per month
97% 82ftp 12oq9@ 17%
220!,# 81% 132% 62 ‘&
29% 81% 189e 24j#
Totals
635
On the statistical analysis of ambient ozone data
Table 3. Deviance explained by model terms when entered in the order given Degrees of
Model term Site Year Site-Year interaction Daily maximum temperature Year-Ozone Season interaction Residual
Deviance explained with outliers
Deviance explained
2 :
10.65 6.36 19.39
8.38 5.55 42.53
1
133.75
137.19
4
17.80
10.98
116/110
199.13
140.11
freedom
ing deviances explained in Table 3 would be randomly dist~bu~ed as chi-squared on one degree of freedom. In fact those deviances are very large compared to that distribution, overwhelming evidence of a strong effect of daily maximum temperature whether or not the six outlying observations are included in the fit. The deviance explained by each effect except that for years is statistically significant at less than the 5 % level, indicting the need for the retention of all effects in the model. The effect of daily maximum temperature is enormous, The year effect is retained in the model because of the interaction of site and year: it makes little sense to allow overall rates of exceedance to vary with year and site but to require that the overall year effects must of necessity sum to zero. The strength of the site-year interaction apparently increases when the outliers are excluded, but the OJ season effect seems to diminish, though it remains significant at roughly the 5 % level. The residual deviances explained with and without the six outliers have 116 and 110 degrees of freedom, respectively. They are respectively very high and fairly high relative to the appropriate chi-squared distributions, which suggests that although a good deal of the extra variation is due to the outliers, the data are somewhat overdispersed relative to what might be expected. However there are technical statistical reasons why the approximate chi-squared distribution of the deviance may be less adequate in this case, and an alternative and preferable check on model fit is provided by Pearson’s X2 statistic. After the six outliers are excluded, X2 has value 119.3 and so gives no evidence of lack of fit relative to its asymptotiq chisquared distribution on 110 degrees of freedom. See McCullagh (1985) for more details. The effect on the exceedance rate parameter li of a 1°C increase in average daily maximum temperature, f;, is to increase i, by a factor of 1.11, with approximate 95 ‘i; confidence interval ( I .06, 1.16). Since r, varies by about 15°C between winter and summer months, the annual variation in the rate due to daily maximum temperature alone is roughly a factor 4.8. This effect is to some extent linked with the O3 season effect, which increases the baseline rate at which exceedances occur
without outliers
by factors respectively 0.99, 1.77,2.37 and 0.82 in the years 1981-1984. Corresponding separate 95 % confidence intervals for these increases are (0.48,2.05), (0.87, 3.60), (1.13, 4.97) and (0.45, 1.49). Thus although the years 1982 and 1983 seem to show an increase in overall levels of OJ distinct from the effect of temperature during the months May-October, 1981 and 1984 arguably did not. A possible interpretation of the OJ season effect is as climatic or regional annual variation, as the data show no evidence that the effect varies from site to site. Figure 2 shows the estimated daily rates of exceedance i. of 0.08 ppm ozone for the 4 years of interest, together with the observed proportion ofdays per month with exceedances of that level and approximate confidence bands obtained by estimating daily rates of exceedance from 50 sets of binomial data randomly simulated from the estimated daily rates of exceedance at each site. These confidence bands consist of the maxima and minima of the 50 simulations and are displayed to give some idea of the variability of the
%
0.6
Fig. 2(a). Estimated rate of exceedance over 0.08 ppm ambient 03: Beaumont, 1981-1984. Key:exceedance rate (--); envelope from 50 ~rnu~tjons (---): monthly ratio of days with exceedancesto non-missing days (*).
A. C. DAVISONand M. W. HEMPHILL
636
Table 4. Comparison of counts of days with exceedances over 0.08 ppm ozone: observed counts, fitted counts, and estimated annnal total counts based on model
Observed
Beaumont
Port Arthur
Orange
1981 1982 1983 1984 1981 1982 1983 1984 1981 1982 1983 1984
24 Month
Fitted
Estimated annual total
29 34 23 44 29 27 29 31 25
8.64(2.88) W.OO(5.25) 23.16(4.30) 33.55(5.45) 28.98(4.68) 27.02(4.76) 28.86(4.67) 26.87(4.89) 24.97(4.67)
9.43(3.02) 51.40(6.37) 56.32(6.52) 36.55(5.68) 46.87(5.91) 41.87(5.84) 57.75(6.53) 33.80(5.48) 38.37(5.76)
11 27 30
10.98(3.21) 26.80(4.71) 30.03(5.19)
16.80(3.95) 39.36(5.70) 31.71(5.33)
Format is: count (estimated standard error of count). Fig. 2(b). Estimated rate of exceedance over 0.08 ppm ambient 0,: Port Arthur, 1981-1984. Key: exceedance rate (-); envelope from 50 simulations (---); monthly ratio of days with exceedances to non-missing days (,).
0.8 1 0.7 S 0.6 B & ,p 0.5 I
0
12
24 Month
36
48
Fig. 2(c). Estimated rate of exceedance over 0.08 ppm ambient 0,: Orange, 1981-1984. Key: exceedance rate (--); envelope from 50 simulations (---); monthly ratio of days with exceedances to non-missing days (.).
estimated
rate. Partly because substantial
numbers
that they should reflect the patterns of those at nearby Port Arthur and Orange. Table 4 compares the observed number of days with exceedances at each site by year with the number fitted under the model and the total number each year estimated from the fitted model. The fitted number is Zai/?i, with ai and p^i,respectively the number of nonmissing days and the estimated daily probability of exceedance in month i of the given year at the site of interest; summation is over i = 1,. . . , 12. The total number of exceedances each year predicted by the fitted model is Cmipli, with estimated variance ~mi~i (1 -ii); here mi is the total number of days in the ilh month. This makes appropriate allowance for the different number of days of missing data and probabilities of exceedance pi each month. The table indicates that the model fits very well except for Beaumont in 1981 and Beaumont and Port Arthur in 1984, for each of which it underestimates the numbers of exceedances. This underestimation is particularly bad at Beaumont in 1981, and is due to several unusual observations incapable of prediction by the model, which are discussed below. The estimated annual total numbers of observations are often substantially larger than the corresponding observed numbers, and indicate the sometimes large loss of information due to missing data. Examination of the model to detect lack of fit shows the presence of six apparently anomalous observations. Dropping them from the model and refitting gives a better fit of the model to the data, but no qualitative change in the results. In particular the need for inclusion of a site-year interaction remains. All six outliers represent months for which the fitted model underestimated the number of exceedances. These peculiar observations seem to arise in three groups: the first at Beaumont during the second half of 1981; the second during January 1984 at Beaumont and Port Arthur; and the third during May 1984 at Beaumont (see Fig. 2). supposition
of
entire days of data are missing in some months, the observed proportions of exceedances, which are shown as stars in the figure, are quite scattered. The curves are slightly smoothed using a spline algorithm. Marked seasonal variation of the rate of exceedances 1 is evident in Fig. 2. The figure explains the site-year interaction as a difference between Orange and Port Arthur on one hand and Beaumont on the other, since the pattern of exceedances at Beaumont over the years in question is evidently different from that at the other two sites. Thus it may be misleading not to monitor On levels at Beaumont under the
On the statistical analysis of ambient ozone data
The first group of high exposures, in August, September and November 1981, occurred in a series of fairly separate incidents, during which the winds were typically of 2-3 m s-r and variable in direction. The third group of exposures, at Beaumont in May 1984, occurred in lower wind speeds, but the pattern of exceedances is, like the first, rather sporadic. The second group is due to two isolated episodes, during each of which the winds were from the SW and had rather lower speeds, typically of 0.5-2.5 m S- I. Trajectory plots show evidence that exposures at both Beaumont and Port Arthur may have been due to emissions from sources SW of Beaumont, but whether these emissions were routine or extraordinary is not known, as records were not kept at the time. The levels of OJ experienced during these episodes were higher than for the other groups. No other observations seeem odd, nor is there evidence that any particular observations influence the parameter estimates overmuch. Examination of residuals shows no systematic departures from the model, except for some evidence that the data are slightly overdispersed relative to it. Such overdispersion, which seems common in environmental applications, is discussed in detail in McCullagh and Nelder (1983, Ch. 8), together with statistical methods to accommodate it.
5.
CONCLUSION
AND DISCUSSION
This paper has shown how two ideas for dealing with missing observations may be applied lo the statistical analysis of ambient OJ data, and results obtained which are not available from methods currently in use. The first idea, which is closely related to United States Environmental Protection Agency guidelines for the monitoring of ambient 03, is to make allowance for entire missing days of data by statistically modefling the rate at which a predetermined threshold level of pollutant is exceeded. This enables an analysis of the data to be performed whose results are directly interpretable in terms of the National Ambient Air Quality Standard for OJ. Moreover appropriate allowance can be made for missing days of data as and when they occur, leading to estimates of the total numbers of days in exceedance of the threshold even when many observations are missing. The et&t of temperature and other variables which affect the rate at which the given threshold level is exceeded can be estimated, unanticipated features of the data detected, and occasions with unusually high levels of 0, can be identified. The basis of the method is a so-called generalized linear model (McCullagh and Nelder, 1983) which can be fitted to data using standard statistical packages. The second idea is to use the observed levels of related pollutants to impute missing OJ observations
631
within incomplete days of data. The levels ofNO, NOz and 0, are regressed on simple functions of temperature, and missing O3 values imputed using the estimated regression equations. The basis of this is multivariate regression analysis (Mardia et 01.~197% a standard and easily applied statistical method. The usefulness of these ideas has been illustrated by their providing an incisive and readily interpretable statistical analysis of OJ data at three sites in SE Texas. Some miscellaneous comments and discussion follow. (a) In section 2 it was supposed that days on which exceedances occur arise independently and with equal probability on each day of a month of interest. Since high pollution days tend to cluster together due to shortterm persistence of meteorological conditions, this is somewhat unrealistic. The effect of this clustering is rather small, but the difficulty can to some extent be overcome by breaking the data into shorter periods, of length 1 week or even 1 day, and using in the model the average daily maximum temperature or other suitable covariates associated with that period. This would also help overcome the obvious problem that the weather is not always homogeneous throughout monthly periods, and hence that the probabilities p, of exceedances may vary within them. (b) Throughout the development in section 2 and subsequent use of the method in section 4, attention is paid only to the occurrence of exceedances over the threshold of interest. In the present context of an analysis closely related to the United States EPA standard for ozone this seems sensible, but in other contexts the size of the exceedances or some related quantity might be the focus of attention. Davison (1984) and Smith (1984) describe statistical methods which can be used in such circumstances. See ApSimon and Davison (1986) and Berger et al. (1982) for examples of the use of these and related methods in the air pollution literature. (c) Methods other than those of section 3, particularly time series methods, could be used to impute missing values in ambient O3 data. A referee suggested that missing values be deduced by interpolation from available O3 observations at the same site, using estimated diurnal variations. This idea seems good but we feel that it would be important to base the interpolation on a measure of the diurnal variation observed on the day in question, such as hourly temperature observations, rather than some average measure of diurnal variation. In any case, the choice of method seems unlikely to be crucial except where very few data are available-in which case any method should be used cautiously. (d) Weather conditions which lead to stagnant air, such as unusually prolonged stationary anticyclones, can produce episodes of consecutive days each with abnormally high 0, concentrations. The method for imputation of missing values described in section 3, however, makes no allowance for this buildup of 03. Such allowance could be made by incorporating some measure of O3 levels on immediately preceding days,
A.
638
such as their average maximum (4.1).
C. DAVISONand M. W. HEMPHKL
OS level, into Equation
~cknow~~ffe~~fs-~is work was performed at the Center for Statistical Sciences at The Uni~~ity of Texas at Austin;
the first author was supported in part by the National Science Foundation. We thank Joe Hill, David Hinkley, Alfred0 Vaquiax and Lee Watkins of The University of Texas, and Brua Brobera. Becky Caldwell and Jim Gise of the Texas Air Control Boari; for s&port and helpful discussions. A referee made several helpful comments.
independently distributed with means fi. The vector p of means is assumed to be linearly related to an n x p matrix X of known quantitie-s, covariata, by the equation cc = X/?, where fi is a p x I vector of unknown parameters to be estimated from Y. This may be written thus: (1) The random component: Y has an inde~ndent normal distribution with constant variance c2 and expected value EY = I(, i.e. the ith component Yi of Y has probability density function
j(Yi;~.~)=expC-{(yi-~i)2/uZ+~~~(2~~2)}/21~ which has mean EY, = pi and variance var (&) = 6’. (2) The systematic component: the n x I covariate vectors XI,. I. * x,, produce a linear predictor q given by q=
REFERENCES ApSimon H. M. and Davison A. C. (1986) A statistical model for deriving probability distributions of contamination for accidental releases. Atmospheric Enuironmenr 20, 1249-1259. .4tkinson R. and Lloyd A. C. (1984) Evaluation of kineticand mechanistic data for modeling of photochemical smog. J. Phys. Chem. Ref. Dora 13, 31-39. Baker R. J. and Nelder J. A. (1978) The GLIM System, Release 3, Generalized Linear lnteraclive Modelling, Numerical Algorithms Group, Oxford. Berger A., Melice J. L. and DeMuth Cl. (1982) Statistical distributions of daily and high atmospheric SO2 concentrations. Atmospheric
Enuironmenr
16, 2863-2877.
Davison A. C. (1984) Modelling exceedanccs over high thresholds, with an application. In Storistical Extremes and Applicorions (edited by Tiago de Oliveira J.). D. Reidel, Dordrecht. Dobson A. J. (f983) An forr~ucrion to Sratisrical quelling. Chapman and Hall, London. Feller W. (1968) An Introduction to Probabiliry Theory nnd ifs Applications. Third edition, Vol. I. Wiley, New York. Horowitz J. L. (1982) Air Qualiry Analysis for Urban Transporrofion Planning. MIT Press, Cambridge, Massachusetts. Mardia K. V., Kent J. T. and Bibby J. M. (1979) Mulrivmiure Analysis. Academic Press, London. McCullagh P. (1985) On the asymptotic distribution of Pearson’s statistic in linear ex~nential-family models. fnr. Statisr. Reo. 53, 61-67. McCuIlagh P. and Nelder J. A. (1983) Generalized Linear Models. Chapman and Hall, London. SAS Institute. Inc. (1982) Srarisricol Anolvsis Svstem User’s Guide: B&s d Sr&stics, 1982 edit&s. SkS Institute, Gary, N. Carolina. Smith R. L. (1984) Threshold methods for sample extremes. In Stufisrical Extremes und Appbcurions (edited by Tiago de Oliveira J.). D. Reidel, Dordrecht. Tiao G. C., Phadke M. S. and Box G. E. P. (1976) Some empirical models for the Los Angeles photochemical smog data. J. Air Polfut. Controt Ass. 26,485-490. United States Environmental Protection Agency (1979) Guideline for interpretation of ozone air quality standards. Office of Air. Noise and Radiation: Office of Air Quality Planning and Standards; Research &iangle Park, N Carolina 27711. Walker H. M. (1985)Ten-year ozone trends in California and Texas. J. Air Pollur. Control Ass. 35, 903-912. We&erg S. (1985) Applied Linear Regression, Second edition.
Wiley, New York.
APPENDIX I: GENERALIZED LINEAR
MODELS
At their simplest, generalized linear models extend linear
regression models in two ways. A linear regression model has the following structure: a vector of observations y of length N is darned to arise from a vector of random variables Y
(3) The
link between
Zj@jXj+
the
random
and
systematic
componenls: ? = s@k where g(p) = p. Generalized linear models extend this in two ways: first, the random component need not be normal but may be a member of a class of distributions known as an exponential family; and second, the link function g(q) need not be the identity function g(p) = p, but must be a monotonic dj~erentiable function. Thus (I) and (3) above are replaced by (1’) The random component: Y has an independent distribution in the one-parameter exponential family, i.e. the i’h component & of Y has the probability density function /(Yi;ei,#)=expCtY,Oi-b(Bi)jlp(~)+cCy,~)] for some functions a(,), b(*) and c(.), which has mean EY, = p = ~b($,)/~~j and variance var (Yi) = u(~~zb(~i~/~~~. (3’) The link between the systematic and random component is 9 = 9(/1X where g(‘) is a monotonic differentiable function. This generalization encompasses many common statistical procedures, such as binary logistic regression and log-linear models, as well as the method of least squares. Thus the linear regression model outlined above has normal distribution with @,=k, b(&)=pf/2, # =u2, a(#$= #, and c) = - {y2/02 + log (2zo*)], and identity link function. The model de&bed at %quati&s (I), (2) and (3) of section 2 has binomial distribution, with a(~$)= I/n, b(O) = log (I $ e’), c& 4) = log { G”,I}, p = ee(l + e’), and the complementary log-log link, i.e. g(p) = log { - log (I -II)}. The book by McCullagh and Nelder (I 983), on which this brief account is based, shows how maximum likelihood estimation of the unknown parameters fi in the systematic component of a generalized Iinear model may be performed quickly and efficiently by the method of iterated weighted least squares. The usual diagnostic methods for modelchecking based on residuals and so on may easily be applied in this much more general setting. Several statistical packages, most notably GLIM (Baker and Nelder, 1978) make fitting these models to data fast and simple.
APPENDIX 2: MULTIYARlATE
REGRESSION ANALYSIS
Suppose that an x p matrix Y of observations is related to a n x q matrix X of known quantities by the linear relationship Y=XA+U, where A is a q x p matrix of unknown parameters, and the rows of the n x p matrix U of random errors are independent multivariate normal random variables with mean zero and unknown p x p covariancc matrix Q. The pro~bility density
639
On the statistical analysis of ambient ozone data of Y then has logarithm I(A,n)=
-fnlog)ZnR(-ffr{(Y-XA)R-‘(Y-XA)‘}. (At)
which is the loglikelihood of the data Y in terms of the unknown parameters A and fL We briefly discuss maximum likelihood estimation of A and f&basing our summary on the account by Mardia et 01. (1979). wherein more details may be found. To ensure that the maximum likelihood estimate of A is unique, we assume that n 2 p + q and that X has full rank q, so that the inverse (X,X)-’ exists. The matrix P = I -X(X’X)-‘X’ is the idempotent symmetric matrix which projects the observation space onto the space orthogonal to the columns of X, and so PX = 0. To derive the maximum likelihood estimates of A and Q, let A = (X,X)-‘X’Y and h = n-‘Y’PY; define also Y = XA =X(X,X)-‘X’YandO=Y-XA=PY.NowY-XA=O + XA - XA. so the second term on the right-hand side of Equation (1) is -frr{(R-‘(Y-XA)‘(Y-XA)} + (2 - A)‘X’X(i
= -ftr[W’{rj’Ij
Y* = X*A+U*, where X * is a known 1 x q vector ofcovariates and that U l is a 1 x q vector of multivariate normal random variables with covariance matrix CI. The conditional mean of Y: given that YJ = y: is then
-A)}] g(Y:)Y:
and (1) becomes /(A,R) = -tnloglZxRl-tnrr(R-‘6) -frr{f--‘(A-A)‘X’X(A-A)}.
(A2)
Only !he last term of (2) involves A, and it is maximized when A = A. The ‘reduced’ loglikelihood obtained when A = 2 is rcA,n,=
fi = n _ ’ Y’PY are respectively the maximum likelihood estimates of A and R. Moreover A is an unbiased estimate of A, and an unbiased estimate of f2 is (n - p)- ’ Y’PY. That the estimates d and fi are indenendentlv distributed is evident upon noting that since PX’=_O, A and 0 are independent, and that nfi = Y’PY = O’U. The estimate A is multivariate normal with mean A and covariances given by cov (a’,j, ci,) = wj,ge, where G is the q x q matrix (X,X)-‘. The estimated is so distributed that nb is a Wishart random variable with parameters p, R and n -4. Section 3 requires calculation of the estimated mean of a single unobserved element Y:, say, of the new 1 x p observation Y* = (UT,. . . , Y;), conditionally on the observed values of a separate subset Y: = (Y:, . . , Y;) of Y*. We assume that Y* is observed independently of Y. Suppose that Y: has taken observed value yl, that
-fnlog(2n)-tn{logIRI+rr(n-16)},
which is maximized when R = 6. Thus A = (X,X)- ‘X’Y and
= y,+) = X’o,
+&,f&,l
(I’,+ -X*A:).
(3)
Here a, and A: are respectively the q x 1 and q x @ -s + 1) submatrices of A corresponding to Y 7 and Y:, and R,, and R,, are respectively the 1 x (p -s + 1) and (p -s + 1) x (p-s + 1) submatrices of R corresponding to the covariance of Y: and Y: and of Y,’ with itself. in our application the matrices A and Rare unknown, but the conditional mean may be estimated by substituting their maximum likelihood estimates into Equation (3).