E(OLOGI(nL mODELLInG ELSEVIER
Ecological Modelling 73 (1994) 159-165
Short Communication The Kalman filter model and Bayesian outlier detection for time series analysis of B O D data R a m C. T i w a r i *, T i m o t h y P. D i e n e s Department of Mathematics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA (Received 11 November 1992; accepted 9 June 1993)
Abstract
The purpose of this paper is to fit a trigonometric time series model to a biochemical oxygen demand (BOD) data set using the Kalman filter approach to allow estimates of the parameters to be updated recursively with each new observation. In addition, we analyse the data set Ibr outliers by computing the prior and posterior probabilities for all observations. The smoothing equations result in a better fit of the proposed model than the model of Papadopoulos et al. (Ecological Modelling, 55(1991): 57-65) based on the same data, in the sense that it reduces the mean squared error by more than half. Key words: Biochemical oxygen demand; Kalman filter; Time series analysis 1. Introduction
Water pollution is commonly monitored by observation of the levels of biochemical oxygen demand (BOD) and dissolved oxygen (DO). These terms are defined respectively by Padgett and D u r h a m (1976) as " t h e amount of oxygen required by bacteria while organic matter is being decomposed with the help of dissolved oxygen in the water. Dissolved oxygen is obtained directly from the air by aeration and indirectly through the photosynthetic process of aquatic plants." The original mathematical model which described B O D levels was produced by Streeter and Phelps (1925). Their work was expanded upon by Dobbins (1964), Loucks and Lynn (1966), and Thayer and Krutchkoff (1967). Padgett and D u r h a m (1976) and Padgett and Papadopoulos (1979) used a joint probability density function for B O D and DO, as did Chen and Papadopoulos (1988), but using a non-parametric kernel density estimator.
* Corresponding author. 0304-3800/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved
SSDI 0 3 0 4 - 3 8 0 0 ( 9 3 ) E 0 0 5 0 - D
R.C. Tiwari, T.P. Dienes / Ecological Modelling 73 (1994) 159-165
160
Fuller and Tsokos (1971) performed a spectral time series analysis of DO data, and Papadopoulos et al. (1991) fitted a trigonometric polynomial to a BOD data set, examining bootstrap confidence intervals. We utilized the same data as the latter paper. The data set consists of 109 weekly measurements of BOD levels, measured in parts per million, taken between February 1982 and February 1984 from McDowell Creek, Charlotte, NC. In Section 2 we define the model in the terms of a Kalman filter and give the recursive time series equations, including those for smoothing, and describe the method used for detection of outliers. In Section 3 we analyse the data.
2. The model
The cosine time series model for the BOD data is given by Y,=/z t + a t cos(2"rrwt+4,) + E , ,
t = l . . . . . n,
where the parameters /z, a, w, 4, and e are, respectively, the overall mean, the amplitude, the frequency, the phase shift and the model error. The Kalman filter (set of equations) which allows updating of estimators is given by Yt = Izt + at c o s ( 2 ~ w t + 4') + et, /3~t = t/'t--1 "[- r/t '
Ott=OLt_l-{-~/t,
(1)
t= l,...,n,
where the errors et, r/t and Yt are independent, normally distributed with zero means, and variances equal to 3.-1Ry, A-1R and A-1R,, respectively, where A-l is the steady-state observation variance. Eq. 1 is a special form of the general linear dynamic model (see for example, Gordon and Smith, 1990), (2)
Yt = H,O + e,, Ot=GtOt_l +r/t,
t= l,...,n,
where Yt is the observation, 0, the vector of parameters, H t the regression matrix, G t the parameter transition matrix, such that et-U(O,A-1Ry),
r/t - N(O,A IR0),
(Here and throughout we have used the notation ~ to denote "distributed as".) It is assumed that e s and e t are independent for s 4:t, that {et} and {r/s} are independent, and that e t and r/t are independent of Or_ 1 for all t, given Y1. . . . . Y¢-aLetting 0t
=Pq, [at]
H, = [1 cos(2"rrwt + 4')]
in Eq. 2 yields the model (1).
R. C. Tiwari, T.P. Dienes / Ecological Modelling 73 (1994) 159-165
161
2.1. Recursive and smoothing equations and a s s u m p t i o n s
Given the steady-state observation variance h -l, we assume that the prior distribution of the initial (at t = 0) parameter vector 0 0 is given by 0 o [h ~ U ( m o , A - ) C 0 ) . Further, given the first t - 1 observations, D t 1 = ( ] 1 1 vector 0 t_ 1 is similarly to be distributed according to
..... Yt-)),
the parameter (3)
0,_11D,_1, X ~ N(mt_I,A-1C,_,),
while A given D t_ 1 is assumed to follow the distribution given by X I D,_1 ~ g a m m a ( ½ n t _ , , ½ G _ l ) ,
(4a)
where U ~ g a m m a ( a , f l ) is the gamma density defined by [ [ Y ~ / F ( a ) l u '~-1 e x p ( - f l u ) ,
(u > 0)
(4b)
and initially (at t = 0) the prior distribution of A is gamma(sn0,sr0). 1 ) Once Yt is observed, the posterior distribution of 0 t and A t a r e given by h-lct),
O, ] D , , X ~ N ( m , ,
(5a)
and 1
1
a ] D t ~ gamma(~n t,Srt),
(5b)
where the forward updating recursion (Kalman filter) equations for m t and A t defined by (cf. Gordon and Smith, 1990; Meinhold and Singpurwalla, 1987): m t = Gtm,_
(6a)
1 + Stet,
(Gb)
Ct = Pt - S t F t S ; , nt = nt- 1 +
are
1,
(6c) t = l . . . . . n,
rt=rt_l+et[Ftl-le,,
(6d)
where ft=HtGtmt_l,
et=E-ft,
Ft=H,P, Ht'+Ry,
nt=GtCt_lG't+no,
(7)
St=PtHt'[F,] -l,
are respectively, fitted value at time t, residual at time t, cumulative parameter variance matrix, cumulative observational variance matrix, and proportion of total observational variance contributed by the latest observation. Also, Ot[Dn,a~N(mt,,,,A-1CtI,,),
where the smoothing (backward) equations are given by (cf. Meinhold and Singpurwalla, 1987; Harvey, 1989) m , i . = m t + Ct+,G[+,P~+'I(mt+,I . - G t + l m , ) ,
t
(8a)
Ct In = C t _ CG t+ t el+ t (lC+ t nl_P+ t )lp~-+G l+ tC l,t- 1
t
(8b)
with Cnl . = Cn, and real n = ma obtained from Eqs. 6a and 6b.
R.C. Tiwari, T.P. Dienes /Ecological Modelling 73 (1994) 159-165
162
2.2. Detection of outliers using prior-posterior analysis The following method of outlier detection is taken from Chib and Tiwari (1993). Their work is based on Zellner (1975), Zellner and Moulton (1985), Chaloner and Brant (1988) and Chaloner (1991). According to Chib and Tiwari (1993), Yt, for t = 1 . . . . . n, is an outlier if the event [etl > ko- occurs for some k. The choice of k is suggested by Chaloner and Brant (1988) in the case of a linear model. This definition of an outlier involves the computation and comparison of the prior and posterior probabilities. We have assumed that the prior probability that there are no outliers among the n data points is 0.95. This gives the value of the standard normal ordinate k as k = *-1{0.5 + 1 ( 0 . 9 5 ) ' / ' } ,
(9)
where qb-l(.), the inverse function of ~, is the quantile function of the standard normal distribution. Thus the prior probability that any particular observation Yt is an outlier is { 2 ~ ( - k ) } . Letting o-= R~-y/A, the smoothed residual ~tl. and leverage h t I n are calculated as follows: ~tln = Yt - H t m t l n
and
htl,,=Ht'Ctl,,H t,
(t
Define the normal ordinates utl,, and L,tl . as
and
The posterior probability that each given Yt is an outlier, Ptln, is defined as P,I. = P ( l e , I > ko" I O.) =
£~{1 - dP(u,h,, ) + (P(u,I,)}p(A I O,) dA,
(11)
where p ( A I D , ) is the density of gamma(½(n o + n),½rn). Thus Yt is an outlier if Ptln is larger than 2 ( D ( - k ) , where (P(-) is the standard normal distribution
funtion.
3. Analysis of BOD data
3.1. Fitting recursive and smoothing equations to BOD data In order to compare our results with those found in Papadopoulos et al. (1991) we used the first 100 observations. Also, we assumed that A = 1, n o = 1, r 0 = 1 and
R.C. Tiwari, T.P. Dienes /Ecological Modelling 73 (1994) 159-165
163
ppm
15
13
m
m
11
I~/l° ° I • '~,
•
.,,/." ° ~t'l,ll'
7': i
2[
1
11
21
31
41
51
61
71
81
91
101
Week
Fig. 1. B O D d a t a a n d f o r w a r d m o d e l .
C0~, = Co, = R~, = R~ = 1. Initial least squares estimates of the parameters were obtained using a numerical method computer program containing a quasi-Newton algorithm for finding a minimum of a function. The least squares fit obtained was similar to that found in Papadopoulos et al. (1991) and is given by = 9.5744 + 1.9806 cos(2-rr" 0.0184(t) + 0.2445). It is readily observed from Fig. 1 that the forward recursion responds well to each new observation, and appears to lag the series by one week. The mean squared error for the first 100 observations was approximately 0.88267. Fig. 1 also gives the predicted values for the observations 101-109. Adding the smoothing equations to the program produced a significantly better fit (Fig. 2) and the mean squared error of only 0.13756.
3.2. Outlier analysis First we analyse the data for outliers using the prior-posterior analysis proposed in Section 2.2. Letting n = 100 in Eq. 9 we obtain k = 3.474 and the prior probability of at least one outlier is 0.000513. Smoothing residuals ~,l- and leverages htl n are used with or= (Ry)l/2 = 1 and h = 1 to produce utl ~ and vtl ~ as given in Eq. 10. In this case, the posterior probability of Yt being an outlier, Ptln, given by Ptl~ = 1 - ~ ( u t l ~ ) + ~(vtl ~) is compared to the prior probability. The following outliers were detected (Table 1). Letting n = 109 yields k = 3.497 and the prior probability of at least one outlier is 0.000470. In this case no new outliers were detected. Allowing h to vary as previously described (cf. Eq. 4) with n o = 1 and r 0 = 1, the following ordered partial listing of outlier probabilities was generated (Table 2).
164
R.C. TiwarL T.P. Dienes / Ecological Modelling 73 (1994) 159-165 ppm
15
13 mW m
1I
•
°=
I
i =
® ~ ®
A~
t
I
11
21
31
l
41
,
•
,
2
t
I
L
I
I
51
61
71
81
91
101
Week
Fig. 2. BOD data and smoothing model.
T h e o b s e r v a t i o n s 33 a n d 45 w e r e a g a i n f o u n d to b e o u t l i e r s , b u t t h e o b s e r v a t i o n 51 c e a s e d t o b e a n o u t l i e r ( r e c a l l i n g t h e p r i o r p r o b a b i l i t y for at l e a s t o n e o u t l i e r w a s 0.000513).
4. Conclusion I n this p a p e r w e h a v e s h o w n t h a t t h e K a l m a n filter, w h e n a p p l i e d w i t h smoothing equations, reduces the mean squared error significantly over the model
Table 1 Outliers and their posterior probabilities with fixed A( = 1) t
Yt
htl,,
Ptl, = 100,a = 1
Ptl,, = 109,A = 1
33 45 51
7.3 8.9 14.2
0.496995 0.501883 0.506193
0.000763 0.000672 0.000763
0.000674 0.000590 0.000684
Table 2 Outliers and their posterior probabilities with variable A and fixed A (A = 1) t 33 45
Ptln 0.003038 0.000923
P~in,X=l 0.000763 0.000672
R.C. Tiwari, T.P. Dienes /Ecological Modelling 73 (1994) 159-165
165
o f P a p a d o p o u l o s e t al. (1991). A l s o , t h e d e t e c t i o n o f o u t l i e r s is e a s i l y c a r r i e d o u t b y c o m p a r i n g t h e p r i o r a n d p o s t e r i o r p r o b a b i l i t i e s o f all o b s e r v a t i o n s .
Acknowledgement T h e a u t h o r s w o u l d like t o t h a n k a r e f e r e e f o r t h e h e l p f u l c o m m e n t s .
References Chaloner, K., 1991. Bayesian residual analysis in the presence of censoring. Biometrika, 78: 637-644. Chaloner, K. and Brant, R., 1988. A Bayesian approach to outlier detection and residual analysis. Biometrika, 75: 651-659. Chen, K.W. and Papadopoulos, A.S., 1988. A non-parametric method for estimating the joint probability density of BOD and DO. Ecol. Modelling, 41: 183-191. Chib, S. and Tiwari, R.C., 1993. Outlier detection in the state space model. Stat. Probability Lett., 19(5) (in press). Dobbins, W.E., 1964. BOD and oxygen relationships in streams. SA3 Proc. Pap. 3949, J. Sanit. Eng. Div. ASCE, 90: 53-78. Fuller, F.C., Jr. and Tsokos, C.P., 1971. Time series analysis of water pollution data. Biometrics, 27: 1017-1034. Gordon, K. and Smith, A.F.M., 1990. Modeling and monitoring biomedical time series. J. Am. Stat. Assoc., 85: 328-337. Harvey, A.C., 1989. Time Series Model. Wiley, New York. Loucks, D.P. and Lynn, W.R., 1966. Probabilistic models for predicting stream quality. Water Resour. Res., 2: 593-605. Meinhold, R.J. and Singpurwalla, N.D., 1987. A Kalman filter smoothing approach for extrapolations in certain dose-response, damage-assessment and accelerated-life-testing studies. Am. Stat., 41:101 106. Padgett, W.J. and Durham, S.D., 1976. A random differential equation rising in stream pollution. J. Dyn. Syst. Meas. Control (Trans. ASME), 98: 32-36. Padgett, W.J. and Papadopoulos, A.S., 1979. Stochastic models for prediction of BOD and DO in streams. Ecol. Modelling, 6: 289-303. Papadopoulos, A.S., Tiwari, R.C. and Muha, M.J., 1991. Bootstrap procedures for time series analysis of BOD data. Ecol. Modelling, 55: 57-65. Streeter, H.W. and Phelps, E.B., 1925. A study of the pollution and natural purification of the Ohio River. Public Health Bull. 146, U.S. Public Health Service, Washington, DC. Thayer, R.P. and Krutchkoff, R.G., 1967. Stochastic model for BOD and DO in streams. J. Sanit. Eng. Div. ASCE, 93: 59-72. Zellner, A., 1975. Bayesian analysis of regression error terms. J. Am. Star. Assoc., 70: 138-144. Zellner, A. and Moulton, B.R., 1985. Bayesian regression diagnostics with applications to international consumption and income data. J. Econometrics, 29:187-211.