Atmospheric Environment Vol. 16, No. 10. pp. 2299 2308, 1982
0004-~981/82,'I02299 10 $03.00/0 Pergamon Pres~ Ltd.
Printed in Great Britain.
TIME SERIES ANALYSIS OF AN HISTORICAL VISIBILITY DATA BASE L. C. MURRAY a n d R. J. FARBER Research and Development, Southern California Edison Company, P.O. Box 800, Rosemead, CA 91770, U.S.A. (First received 30 March 1981 and received for publication 25 March 1982)
Abstract--An historical visibility-sulfate data base has been statistically modeled using time series analysis. Salt Lake City, for the 3 year time period beginningJanuary 1971 was selected because daily sulfate values were available. This study examines the necessary procedures involved in selecting the most appropriate statistical techniques for this kind of data base. First, the obvious inadequacies of the linear regression analysis are shown. Second, we demonstrate the usefulness of the Box-Jenkins technique on the time series. This method is able to account for autocorrelation among the variables as well as lead-lag relationships between them.
I. I N T R O D U C T I O N
Various forms of regression analysis have been employed by several researchers in an attempt to relate various independent variables, such as sulfate, to visual range. For several reasons, it appears unsatisfactory to employ regression analysis for this type of data base. This paper investigates the procedures involved in selecting appropriate statistical techniques and makes a comparison of various types of regression models to the statistical method which we have chosen as most appropriate.
on the filter, particularly during the winter months when relative humidities are comparatively higher.
3. SELECTION OF AN APPROPRIATE STATISTICAL TECHNIQUE
Traditionally, researchers have tried to establish a physical linear relationship between visibility and other air quality measurements using linear relationships (Cass, 1979; Leaderer et al., 1979; Trijonis, 1979; White and Roberts, 1977). One such relationship is that the extinction coefficient, bext, is linearly related to sulfate, nitrate, TSP and relative humidity where
2. DESCRIPTION OF THE HISTORICAL VISIBILITYDATA BASE
24.3 b~t = - - ,
(1)
U
During the past several years, ambient air quality measurements, including the measurement of sulfate, have been taken every sixth day, while visual range measurements have been taken on an hourly basis at many airports throughout the West. However, for a short period of time in Salt Lake City, daily sulfate measurements were taken. Because of the completeness of this data set, Salt Lake City was chosen for analysis, beginning in January 1971 and continuing to the end of 1973. To simplify the study, only two variables were selected for evaluation--visual range and sulfate. Visual range was taken hourly at Salt Lake City airport and sulfate was sampled daily for a 24-h period in the same area. There are obvious shortcomings associated with this data set which is one important reason for not trying to draw quantitative conclusions from the data. Visual range was measured in 10mile increments for longer distances and the measurement itself is certainly somewhat subjective. Furthermore, visual range was arbitrarily truncated at the upper limit of 50 miles (80km). Sulfate concentrations were collected on untreated glass fiber filters and are subject to substantial spurious sulfate formations
when v = visual range in miles and be~t = extinction coefficient in (104m) - I . Equation (1)is based on the Koschmieder formula for black objects with a 2 °'o contrast threshold. In keeping with meteorological protocol, visual range is expressed in miles but bex t in meters. In the present study, because only two variables were available, the extinction coefficient was regressed against only sulfate. The result of these calculations is a poor fit, r 2 = 0.0895, due to one outlier when visibility was 0.05 miles. A change in this value to 0.5 miles was made to improve the fit, r 2 = 0.327. Using the modified outlier, Fig. 1 is a plot of the extinction coefficient versus sulfate. The time plot of extinction coefficient is given in Fig. 2 and a frequency histogram of the data in Fig. 3. Figure 4 is a histogram of the residuals. As can be clearly seen from the heavy tail and peaked histogram, the residuals are far from normally distributed, Figure 5 is a plot of the autocorrelation of the residuals defined by ~".Ge,_i y.e2'
2299
i = 1, 2 . . . . .
2300
L.C. MURICAYand R ,t. FARBER
90
[~] contains about 900 data points
80 70 -:
60
~
50
30 20
J e
10 **t I
lO
* =
I
20
3'0
W
,o
1o
;o
;o
85
SULFATE ( . g / m 3)
Fig. 1. Salt Lake City calculated boxt from observed visual range versus measured sulfate levels from I,/1/71 to 12/31/73. 90 80 70 60 E "o
50
,,0
40 30 20 10
0
10o
200
300
400
500 600 DAYS
700
800
900
1000 1100
Fig. 2. Time plot of the daily averaged calculated bext from observed visual range over the three year period in Salt Lake City from 1/1/71 to 12/31/73. 550 SO0 450 400 >. 350 u z tu 300 0 250 2OO 150 100 50 10
20
30
40 50 b q(10'111) ~
60
70
80
90
Fig. 3. Frequency of occurrence histogram of the bext values for Salt Lake City.
Time series analysis of an historical visibility data base
2301
400
t t 250
100
50 jJ
-20
-10
10
20
30
40
50
60
70
80
REGRESSION RESIDUALS (104m) '
Fig. 4. Frequency of occurrence histogram of the residuals from the extinction coefficient-sulfate linear regression model.
1.0,
0.8 0.6 0.4 0.2
-0.2 -0.4
-0.6 t-0.8
~
0
5
10
15
- 2 0-
25 30 LAG (Days)
35
40
45
50
Fig. 5. Autocorrelation function of the residuals from the extinction coefficient-sulfate linear regression model. where e, represents the residuals of the equation b~ I = - 0 . 5 9 + 0.37 S,
outlier modified
(2)
and S is the sulfate concentration in /~gm -3. The autocorrelation becomes significantly different from zero at the 95 % confidence level, when in absolute value it exceeds the critical value which is approximately 2 , = 0.06,
,/,,
where n is the number of observations (1096 in this case). Autocorrelation exists among the residuals since many of the values are greater than 0.06. Because ofthe
non-normality and autocorrelation, the estimators for the slope and intercept are biased estimators. Since regression analysis using least square estimates is not able to treat statistically non-normality and autocorrelation among the residuals, any physical interpretations drawn from regression analysis may be invalid. The second regression model (Equation 3) presented is visibility regressed on log sulfate v = 34.7 - 6.8 In(S)
(3)
where In(S) is the natural log of the sulfate concentration /~gm-3. This particular model is presented because, of the several models investigated, it demonstrated the best linear relationship. The results of the calculations are presented in Figs 6 and 7. In this case
L.C. MURRAY and R. J~ FARBER
2302 140 130 120 110 100 90 80 Z
70
uJ ,,,,. 60 g. 50 40 30
L-LL -
20 10
. . . . . . . . -25
-30
-20
-15
-10
-5
0
5
10
15
20
25
30
REGRESSION RESIDUALS (Miles)
Fig. 6. Frequency of occurrence histogram of the residuals from the visibility-log sulfate linear regrvssion model.
'ot
0.8
~° 6 f ~ 0.4
~ 0.2 uJ
-0.2 (3 0
-0.4 -0.6 -0.8
....
,
5
.
.
.
.
i
10
.
.
.
.
i
15
.
.
.
.
,
20
.
.
.
.
i
.
.
.
.
,
25 30 LAG (Days)
. . . .
i
35
.
.
.
.
,
40
. . . .
i
45
.
.
.
.
,
50
Fig. 7. Autocorrelation function of the residuals from the visibility-log sulfate linear regression model.
the regression rcsidnals appear to be normally distributed or very nearly so, but they still demonstrate autocorrelation (i.e. values are greater than 0.06). Again, the assumptions of regression analysis have been violated so that any conclusions bated upon regression may be invalid. Inadequacies associated with univariant linear regression analysis including autocorrelation and non-normality of residuals will also extend to multivariant regression techniq ues. According to Box and Jenkins (1970), time series analysis takes into account seasonality, lead-lag relationships within and between variables, and exogenous variables. Because of the capabilities of this analysis, it is conducive to analyzing the dynamic
relationship between any two variables, in this case, visibility and ambient air quality parameters. This dynamic concept is illustrated by considering atmospheric behavior. Visual range will be comparatively excellent in the clean airmass as high pressure builds into an area following a weather front. As the high pressure area stagnates, with the passage of time, there will be a gradual deterioration in visual range as various natural and anthropogemc pollutants accumulate. Here both seasonality and lead-lag relationships are involved. For this particular statistical analysis, sulfates are the input into the dynamic system and visibility is the output or response variable. Time series examines
Time series analysis of an historical visibility data base
As a first step, the raw data for daily averaged visibility in Salt Lake City has been plotted in Fig. 8. From this plot we note that there are many fixed values; no visual range exceeds 50 miles; the spread in visual range is uniform over time; and visibility is cyclic in nature. A frequency histogram of these values, shown in Fig. 9, indicates the data appears to behave in a normalized manner. Figures 8 and 9 indicate that no further transformation is necessary because the data is sufficiently normal about the mean. Figure 10 is an analogous plot of daily sulfate levels with time. Unlike visual range data, there are no fixed values and the spread in sulfate is not uniform over time. This is more clearly seen in the frequency histogram, Fig. 11, where the sulfate distribution is
sequences of observations and how the inputs (sulfate) into a system affect the output (visibility) as a function of time. 4. DESCRIPTION OF TIME SERIES
2303
ANALYSIS
Several steps are involved in the proper construction of a statistical model. These include model identification, model building and model evaluation. Model evaluation is an integral ongoing process of model development. (a) Model identification The quality of the data base and the degree of independence for each variable needs to be examined.
50 45 40
~. as 30
i
~ 2s N 20 15 10 5 0
100
200
300
400
500
600 DAYS
700
800
900
1000
100
Fig. 8. Time plot o f t h e average daily visualrangein Salt Lake City during the three year period from 1/1/71 to 12/31/73. 3O0 275
[--
250 225 200 Z L,J 0
175 150
UJ
~ 125
--[__
1 O0
50 25
___ i
5
10
i
i
15
20
,,,
i
i
25
30
,,,
L
I
/
35
40
45
1
50
VISIBILITY (Miles)
Fig. 9. Histogram of the average daily visual range in Salt Lake City from 1/1/71 to 12/31/73.
2304
L. C, MURRAYand R. J. FARBER 85 8O
l
75
~
70 65 6O 5s
= 5o ~ 45
0
100
200
300
400
SO0 600 DAYS
700
800
900
1000
1100
Fig. 10. Time plot of the daily sulfate levels in Salt Lake City from I/1/71 to
12,/31/73.
,,s!
200
150 >- 125 u z 100
75
so
r
!
I' I i i
25f 0
5
.~
.
10
15
,.,~.rt-~. 20
25
30
35 40
45 S0
55
60
65 70
75
80 85
SULFATE (.g rn 3)
Fig. 11. Histogram of the daily sulfate levels in Salt Lake City from l/l/Tl to
12/31/73. skewed and tails significantly toward higher concentrations. Since the relationship between visibility and log sulfate is linear or very nearly so, it was determined that this transformation would provide better least squares estimates for the parameters. As shown in Figs 12 and 13, the log transformation on the sulfate data does tend to normalize the data. Figure 12 illustrates that the variability of sulfate concentration is also much more uniform. Another facet of model identification involves examination of the independence of the sequence of visual range observations and sulfate concentrations. Time series analysis can account for the dependence of these variables and employs this dependence as an integral
part of the statistical model. By contrast, regression analysis cannot account for this dependence and, when it is statistically significant, should not be used. The degree o f dependence of these variables is analyzed by estimating the autocorrelation function. The autocorrelation of lag i days for visibility is defined
as: E(Z'z- ~)(Z', _,--~) ~: (t,, _~)2
i = I, 2. . . .
(4)
where v, is the visual range at day t, u,_ i is the visual range i's days back and ~" is the mean visual range. These results are displayed in Figs 14 and 15 for visual range and log transformed sulfate values. These results are statistically significant at the 95 % confidence level
Time series analysis of an historical visibility data base
2305
10 2
~" 10' E oi
u~
100 i
10 ~
................ 100
200
L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
400
500
600
. ........ i ........................ 700
800
900
1000
1100
DAYS
Fig, 12. Time plot ofthe log ofthe daily sul~te levels in Salt Lake City.
225 200 175
150 >U Z 125 0
~
I00
75
50
2s
L__
F5-
-0.5
-0.25
0.25
•
i
0 5
•
0.75
,
,
L
1.25
1.5
1,75
2
S U L F A T E ( L o g =g ' m 3)
Fig. 13. Histogram of the log of the daily sulfate concentrations in Salt Lake City.
or 20 for daily values which are greater than 0.06. The standard deviation 1
where n = 1096, the number of observations. The first value of Fig. 14 says that approximately (0.5) 2 = 25 ~,, of the variability measured today can be explained by what was measured yesterday. The second value on Fig. 14 says that 16",,,of the variability measured today can be explained by what was measured two days ago. (b) Model building After model identification has been completed, the next step is the process of building univariate models
for each variable. In this study, because we have two variables, sulfate and visibility, two univariate models are constructed. The fitting of the properly transformed data to the time series models is accomplished by obtaining least squares estimates of the parameters. The residuals from these univariate processes are cross correlated and used in the identification of the complete dynamic model. This completed model is defined as the transfer function model. For each phase of model titting, the criteria for adequacy are that the residuals are independent (no autocorrelation) and that the model exhibits parsimony. The residuals should also exhibit a symmetric distribution (e.g. a normal distribution). A parsimonious model employs the least number of parameters from adequate representation.
2306
L C. MURRAYand R. J. FARBER 1.0~ 0.8 0.6i z
O
0.4
Z 0.2 Z
~
0
~-0.2 0
~
-0.4
"~ -0.6 -0.8
0
5
10
15
20
25
30
35
40
45
50
LAG (Days)
Fig. 14. Autocorrelation function of the average daily visibility.
1.ot 0.8 l 0.6 z
0 P
o.,\
z
0.2 Z
0
er
80
.0.2,~
-0.4' ~0.6 -0.8 0
5
10
15
20
25 30 LAG (Days)
35
40
45
50
Fig. 15. Autocorrelation function of the log sulfate.
5. RESULTS The univariatc models provide an increased understanding of the behavior of the system. The changes in the measured values from day to day are function of "shocks" to the time process occurring during the past several days. Shocks were the deviations from what would be expected if the dynamics of the system are held constant. The intent of this procedure is to reduce the variance (a 2) to a minimum, fit the model so that the residuals are not correlated at the 95 % level (no autocorrelation) and normalize the residuals as much as possible. An analogous procedure was followed for the sulfate concentrations. The equations which resuited from these models are v, = (1 - B)- 1(1 - 0.63B - 0.16B 2 - 0.10B3)et (5)
and ln(S) t = (1 - 0.688)- l (1 - B) - 1 (1 - 1.12B + 0.15B2)e,
(6) where B is the backshift operator defined in general by Bx, = x,_ t and BZx, = x,_ 2.
The next appropriate step is the propez fitting of the univariant models into one mod©l which is tcrrned the transfer function model. This model will shed light on the sequence of visibility observations in relationship to the sequence of sulfate concentrations. To build this model an identification step is necessary. This identification step can be accomplished by calculating the cross correlation function between the residuals of the univanate visibility and sulfate time series models (Table I). Because the residuals within each series
Time series analysis of an historical visibility data base
2307
Table 1. Cross correlation function - 14:-0.042 - 13:-0.007 - 12: - 9 : 0.009 - 8 : 0.017 - 7 : - 4 : - 0 . 0 2 0 - 3 : - 0 . 0 7 0 -2: 1: - 0.003 2: - 0.034 3: 6:-0.004 7: 0.013 8: 11:-0.031 12:-0.015 13:
0.036 -11: 0.003 -10: 0.049 0.033 - 6 : 0.007 - 5 : 0.004 0.010 - 1:-0.215 0:-0.320 0.012 4:-0.032 5:-0.002 0.032 9: 0.004 10: - 0.034 0.018 14: 0.035
~2 Test for independence---all lags )~2 184.86
Significance level 0.1313
Degrees of freedom 29
X2 Test for independence--negative lags only X2 64.46
Significance level 0.00
Degrees of freedom 14
y.2 Test for independence--positive lags only X2 8.16
Degrees of freedom 14
Significance level 0.88
1st Series, 1096 residualg (0, 1, 3), Salt Lake visibility (miles). 2nd Series, 1096 residuals (1, 1, 2), Salt Lake sulfate (#g m-3). Statistically significant cross correlations at the 95°,o level are underlined. Standard deviation for independent white noise series = 0.030. exhibit no significant dependence (no autocorrelation), it is possible to obtain a clearer understanding of the actual cross correlation function between the variables. As may be expected when autocorrelation exists within the univariate series, the cross correlation between the measured values of two series can lead to inflated spurious correlations. Consequently, correlation coefficients (r:) from multiple regression analysis will incorporate both the cross and autocorrelation functions as there is no way of separating them in that analysis. The construction and formulation of time series models has been discussed, among others, by Merz et al. (1972) and by Chock et al. (1975). Equation (7) is the result of the complete time series model.
N, is the noise and N, and In(S), are independent. Table 1 presents the cross correlation function. Only the present and l a g - 1 are statistically significant at the 95 ~o level. The model results may be interpreted to mean that today's visual range is affected by only today's and yesterday's sulfate levels. Because there are no correlations at positive lags, the interpretation is that visibility does not influence further sulfate levels as would be expected. Table 2 provides an additional comparison between regression and time series models. The objective of these statistical models is to minimize the residual variance. F r o m Table 2, we see that the time series model is far preferable to regression models and provides the best statistical fit to the data.
v, = ( - 4.58 - 1.95B)ln(S), + N,
6. SUMMARY AND CONCLUSIONS
where N, = (1 - B)- 1(1 - 0.72B - 0 . 1 3 B 2 - 0.08B~)e,,
(7)
The objective of this study has been to demonstrate an approach which will most accurately statistically
Table 2. Comparison of residual variances with visibility as the dependent variable. Residual variance Visibility time series model Visibility-sulfate time series model Visibility regression model Variance of visibility Extinction coefficient regression model r: r 2 (outlier modified)
Var. 69.3 58.2 78.3 105.3 ( x 100) ( x 100)
~o 34.2 44.7 25.6
8.9 32.7
The ~g columns are the percentage reduction in the variance of visibility as explained by the respective model, For the visibility regression model, the percentage reduction in the variance is very nearly equal to r-squared. The percentage reduction in the variance for time series models is analogous to an rsquared value.
2308
L.C. MURRAYand R. J. FARBER
model the data base. Because meteorological and ambient air quality data often exhibit autocorrelation with time, and thus require special attention when using the data statistically, linear regression analysis is not suitable for our purposes. The Box-Jenkins technique considers autocorrelation among the variables as well as lead-lag relationships between them in order to attain a solid presumption of a cause-effect relationship in bivariate time series analysis. This time series approach can also be naturally extended to multivariate analysis. Although it would be ideal to have an equation that has complete physical meaning, this is not always possible since often the assumptions of least squares theory are violated. In lhis case, the mathematical relationship between visibility and log sulfate using time series analysis did the best job of reducing the variance without violating any underlying assumptions. From a physical viewpoint, the time series analysis was able to evaluate the lead-lag relationship within and between visibility and sulfate.
REFERENCES Box C. E. P. and Jenkins G. M. (1970) Time Serie.~ Analy:~i~: Forecastin# and Control. Holden-Day, San Francisco. Cass G. R. (1979) On the relationships between sulfate and air quality and visibility in Los Angeles. Atmospheric Environment 13, 1069-1084. Chock D. P., Terrell T. R. and Levitt S. B. (1975) Time series analysis of Riverside, California, Air Quality Data. Atmospheric Environment 9, 978-989. Jenkins G. M. (1979) Practical experiences with modelling and forecasting time series, Gurlyn Jenkins and Partners (Overseas) Ltd., St. Helier, Jersey, Channel Islands. Leaderer B. P., Holford T. R. and Stolwijk J. (1979) Relationships between sulfate and visibility. J. Air Pol/ut: Control Ass. 29, 154.-157. Merz P. H, Painter L. J. and Ryason P. R. (1972) Aerometric data analysis--time series analysis and forecast and an atmospheric smog diagram. Atmospheric Environment 6, 319-3,42. Trijonis J. (1979) Visibility in the Southwest--an exploration of the historical data base. Atmospheric Environment 13, 833-843. White W. H. and Roberts P. T. (1977) On the nature and origin of visibility reducing aerosols in the Los Angeles air basin. Atmospheric Em,ironment 1!, 803--812.