Step Ahead Streamflow Forecasting Using Pattern Analysis

Step Ahead Streamflow Forecasting Using Pattern Analysis

374 STEP AHEAD STREAMFLOW FORECASTING U S I N G PATTERN ANALYSIS M.G. GOEBEL and T . E . UNNY Department of C i v i l E n g i n e e r i n g , U n i...

657KB Sizes 0 Downloads 74 Views

374 STEP AHEAD STREAMFLOW FORECASTING U S I N G PATTERN ANALYSIS M.G.

GOEBEL and T . E .

UNNY

Department of C i v i l E n g i n e e r i n g , U n i v e r s i t y o f Waterloo, Waterloo, O n t a r i o , Canada

S YN O P S I S

Persistence i n streamflow time s e r i e s i s u s u a l l y s p e c i f i e d through c o r r e l a t i o n c o e f f i c i e n t s o f o r d e r one and l a r g e r . a r e c a l c u l a t e d u s i n g t h e whole t i m e s e r i e s r e c o r d .

These c o e f f i c i e n t s However,

p e r s i s t e n c e e f f e c t s a r e bound t o be v a r i a b l e i n d i f f e r e n t p a r t s o f t h e record.

C o n s i d e r a t i o n o f s ma l l c o l l e c t i o n s o f s u c c e s s i v e d a t a as

d i s t i n c t e n t i t i e s e f f e c t i v e l y incorporates persistence i n analysis o f data.

T h i s i s t h e b a s i c method i n v o l v e d i n p a t t e r n a n a l y s i s .

The h i s t o r i c t i m e s e r i e s i s segmented i n t o a s e r i e s o f r e p r e s e n t a t i o n a l o b j e c t s which d e f i n e l o c a l shapes w i t h i n t h e t i m e s e r i e s . o b j e c t s a r e i n t u r n d e f i n e d by a s e t o f p a t t e r n v e c t o r s .

These

Cluster

a n a l y s i s i s used t o s p e c i f y t i m e i n d e x e d r e g i o n s o r c l a s s e s i n n-dimensional p a t t e r n space a c c o r d i n g t o which f u t u r e p a t t e r n s may be classified. The c u r r e n t s t r e a m f l o w o b s e r v a t i o n s a r e used f o r s t e p ahead f o r e c a s t i n g based on t h e c o n c e p t t h a t a c o m p l i m e n t a r y datum i s r e q u i r e d t o complete a p a t t e r n . INTRODUCTION St reamf low d a t a r e c o r d e d a t m o n t h l y i n t e r v a l s i s c h a r a c t e r i z e d b y t h e presence o f s e a s o n a l i t y caused by r e o c c u r r i n g g e o p h y s i c a l e v e n t s w i t h i n a year.Thus

one observes groups o f h i g h o r l o w f l o w s a c c o r d i n g

t o wet o r d r y p e r i o d s .

Interdependence o f s u c c e s s i v e f l o w d a t a i n

m o nt hly s t r e a m f l o w t i m e s e r i e s i s termed p e r s i s t e n c e and can be e x p l a i n e d by watershed s t o r a g e s as w e l l as b y l o n g t e r m m e t e o r o l o g i c a l Reprinted from T i m e Series Methods in Hydrosciences, by A.H. El-Shaarawiand S.R. Esterby (Editors) 0 1982 Elsevier Scientific Publishing Company, Amsterdam - Printed in The Netherlands

375

events. Persistence in data is usually specified through correlation coefficients of order one and larger. However, because historical data is only a single realization of a stochastic sequence, correlation coefficients are bound to be variable with increases in the length of record with time. Furthermore, correlation coefficients are derived by scanning the data from beginning to end; thus if at all they represent the persistence effect, it could be nothing more than an averaged effect across the whole time series. On the other hand, in physical situations dealing with real world data, the persistence effect is bound to vary change in different par o f the record. An effective way of incorporating persistence in data analysis is through consideration of small collections of data as distinct entities in the analysis. Such an approach is taken through the use of pattern analysis. Pattern analysis is a well developed discipline by itself and it has been applied in diverse fields such as satellite imagery, medical diagnosis and cybernetics; however, it's application in hydrology is recent. Panu (1978) has developed a model to synthesize streamflow records using pattern analysis. Within the context of analysis of time dependent hydrologic data Panu, Unny, MacInnes and Wong (1981) give a well documented review. In simple terms, a pattern is a shape representation of a physical object. In this application a rising limb, the peak or the falling limb of a streamflow time series (a streamflow time waveform or STW) are a1 1 considered separate physical objects. Measurements on these objects define the shape or pattern of the objects. In practice the actual streamflow data is simply arranged into a set of n-dimensional pattern vectors with each pattern element being numerically equal to the field measurement. Selecting an appropriate dimensionality for the pattern vectors is based on the degree of interdependence between successive streamflow observations. For monthly data it would be difficult to justify any significant dependence beyond a two month period; hence,

376

t h r e e dimensional p a t t e r n v e c t o r s a r e adequate.

Overlapping the

o b j e c t s such t h a t each p a t t e r n v e c t o r c o n t a i n s o n l y one new datum a l l o w s a l l p o s s i b l e shapes t o be r e p r e s e n t e d .

Pattern analysis i s

c a r r i e d o u t on these p a t t e r n v e c t o r s and g i v e s i n t e r p r e t a b l e i n f o r m a t i o n regarding the nature o f the streamflow time s e r i e s . v e c t o r formation i s i l l u s t r a t e d i n F i g u r e 1. process t h a t maps a p a t t e r n

&

Pattern

Feature e x t r a c t i o n , a

onto a feature vector

L, i s

used t o reduce d i m e n s i o n a l i t y and enhance s e p a r a b i l i t y . was n o t r e q u i r e d , h e r e t h e r e f o r e f e a t u r e v e c t o r s ,Y+,

commonly

T h i s process

a r e synonomous w i t h

pattern vectors. Step ahead f o r e c a s t i n g i s c a r r i e d o u t by c o n s i d e r i n g t h e p a t t e r n found f r o m c u r r e n t and most r e c e n t o b s e r v a t i o n s as b e i n g i n c o m p l e t e i n t h e sense t h a t t h e f u t u r e o b s e r v a t i o n i s r e q u i r e d t o p r o v i d e t h e data t o complete t h e p a t t e r n v e c t o r .

The i n f o r m a t i o n g a t h e r e d t h r o u g h

p a t t e r n a n a l y s i s o f t h e h i s t o r i c a l r e c o r d i s used t o p r e d i c t t h e m i s s i n g datum.

Dixon (1979) used p a t t e r n a n a l y s i s t o e s t i m a t e d a t a

i n data sets w i t h missing observations.

The methodology p r e s e n t e d

h e r e i n d i f f e r s f r o m D i x o n ' s approach i n t h a t t h e s t e p ahead f o r e c a s t i s made by f i n d i n g t h e expected v a l u e o f t h e f u t u r e s t r e a m f l o w g i v e n thi possible c l a s s i f i c a t i o n o f t h e incomplete p a t t e r n . Data from t h e B l a c k R i v e r watershed (gauging s t a t i o n 02EC002, n e a r Washago, O n t a r i o ) was used t o s i m u l a t e s t e p ahead f o r e c a s t i n g .

For

t h i s r i v e r 30 y e a r s o f h i s t o r i c a l r e c o r d was used t o p e r f o r m c l u s t e r analysis.

An a d d i t i o n a l 28 y e a r s o f h i s t o r i c a l d a t a was t h e n used t o

make s t e p ahead f o r e c a s t s which were t h e n compared t o a c t u a l observations.

F u r t h e r f o r e c a s t s were always based upon t h e a c t u a l

observations r a t h e r than forecasted values. CLUSTER ANALYSIS A p a t t e r n c l a s s can be c o n s i d e r e d t o be a r e g i o n i n n-dimensional p a t t e r n space c o n t a i n i n g t h e s e t o f a l l p o s s i b l e p a t t e r n s which a r e s u f f i c i e n t l y a l i k e a c c o r d i n g t o some s p e c i f i e d measure.

Any mathematical d e s c r i p t i o n o f a c l a s s i s a s t a t i s t i c a l summary o f

377

STW 1916 for

Black River

....

1

T

I

I

I

I

I

I

1

I

I

I

F

M

A

M

J

J

A

S

O

N

D

.I

a)

100

(No n t hs )

H i s t o r i c a l Data

--

75 -50 --

2 5 --

0

A

L .I

M

F

5

F

M

A

52 b)

1

A

M

A

53

M

54

J

M

J

J

55

Representational Objects

c)

Figure

M

Pattern Vectors

Pattern Vector Formation From Streamflow Data For Black River

J

J

‘6

A

378

i t s member patterns. I n many instances of pattern analysis, including the one i n t h i s application the t r u e classes a r e u n k n o w n . All information concerning c l a s s s t r u c t u r e must be estimated from a training s e t of unlabelled sample patterns t h r o u g h a procedure referred t o a s clustering or unsupervised learning. Various clustering algorithms have been developed, the simplest and b e t t e r known of which are the K-means algorithm (Tou and Gonzales, 1974) and the Isodata algorithm (Ball and Hall, 1965). I n the clustering algorithm developed f o r monthly streamflow patterns i t was assumed t h a t there a r e K = 1 2 d i s t i n c t pattern classes. All the patterns from one time position are i n i t i a l l y assumed t o belong t o one g r o u p , G k ; t h a t i s they form an i n i t i a l c l u s t e r . The c l u s t e r center k i s a reference vector, m , calculated a s follows: Nk

m k = -1 Z Y N k j = 1 -j where N k i s the number of patterns ( f e a t u r e vectors Y . ) i n group -J k G . These i n i t i a l c l u s t e r s are characterized by a rather large variation. This variation i s described by a covariance matrix C k as fol 1ows : Nk

Another way o f examining the variation within a g r o u p i s t o calculate the distance from each pattern t o i t s reference vector. Using a Euclidean distance measure:

379

ntra group distance i s :

The mean

Nk

1

(4)

Another useful distance i s the i n t e r g r o u p distance or the distance k j between reference vector 3 and any other reference vector E : k

j

dE(" , m )

n =

C Z

(mr

-

j 2m i ) 1'

(5)

i=l A boundary between two groups can be visualized as a distance half-

way between any two c l u s t e r centres. The hyperplane describing such a boundary i s a decision surface according t o which a p a r t i c u l a r pattern i s said t o belong t o one group o r the other. For example:

C1 ustering proceeds, as fol1 ows :

Determine i n i t i a l c l u s t e r centres and boundaries as noted above . 2 . Using 'Equation ( 3 ) , determine f o r every pattern the distance t o i t s own c l u s t e r centre and the distance t o the centre of the two c l u s t e r s t h a t are situated time contiguous on b o t h sides (i . e . , neighbouring c l u s t e r s ) . 3 . Reassign a l l the patterns t o any of the three time contiguous c l u s t e r s according t o the c r i t e r i o n of Equation ( 6 ) . T h e centres of these c l u s t e r s can be considered t o be the centres of the pattern classes of which the c l u s t e r s form a s u b g r o u p . The procedure ensures t h a t a proper t r a j e c t o r y progressing in time i s formed by connecting the centres o f the classes. The c l u s t e r s so formed are time-dependent c l u s t e r s a n d n o t global c l u s t e r s . 4. For the c l u s t e r s obtained in Step 3 above, new c l u s t e r centres and boundaries are calculated a n d the procedure repeated from Step 2 t o Step 3 w i t h the objective of reducing the variance of the groups, and thus the i n t r a c l u s t e r distances are a l s o reduced. 1.

380

Further, the i n t e r c l a s s distances are enhanced providing f o r increased s e p a r a b i l i t y of the classes. This clustering algorithm stops when the l a s t complete cycle of i t e r a t i o n was carried o u t and i t was found t h a t there was no f u r t h e r change i n the membership of any of the c l u s t e r s . Typically, about fourteen i t e r a t i o n s were required t o c l u s t e r the patterns. The final k c l u s t e r s represent sample estimates of the pattern classes ( W ) . These classes are characterized by a narrow multivariate probability distribution having a mean and some variance about the mean. The regions i n m-dimensional feature space representing each of the twelve classes a l s o have d e f i n i t e boundaries according t o which any new unlabelled pattern can be properly c l a s s i f i e d . Two typical c l u s t e r s f o r Black River are shown i n Figure 2 . The ordered progression from one class centre t o the next i s called a mean annual cycle (See Figure 3 ) . The outer segment of the curve h a v i n g large values of y l , y2 o r y3 o r combinations of a l l three, represent the regions in pattern space t h a t are equivalent t o peak flows. The two loops represent streamflows t h a t have a large annual peak flow (as occurs in the spring r u n o f f ) and a smaller peak occurring annually i n the f a l l . A larger distance between two adjacent reference vectors on the annual cycle s i g n i f i e s a greater v a r i a b i l i t y of the member patterns within the c l a s s . Cluster boundaries bisect the distance between adjacent reference vectors. Clusters may be r e l a t i v e l y close together i n terms of distance i n the 3-dimensional feature space. However, i f the reference vectors a r e on d i f f e r e n t loops, they are f a r apart i n time ( i . e . , they are n o t time contiguous). STEP AHEAD FORECASTING

Forecasting i s based on a l l of the information gathered t h r o u g h the use o f c l u s t e r analysis. The most recent pattern i s considered t o be incomplete in t h a t i t has two known features, y1 and y 2 , b u t the t h i r d feature, y3, representing the step ahead flow, i s u n k n o w n .

381

Figure 2

Typical Clusters Represented by Objects from the STW f o r Black River. (Not a l l objects are shown)

125 ,3

100

75 A

--. m (I:

E

v

3

50

4 LA

25 P a t t e r n Vector

----0

1

y3

50

-w

m

E

v

25

8

--

3

0

0

I

I

1

Reference Vector

382

1

Figure 3 .

Mean Annual C y c l e For t h e B l a c k R i v e r enpresented i n thr--i dirli?psional_ F'carur? sr)' ? & - . ~ ' - 0

3 83

A step ahead forecast i s made by estimating the value o f y3.

The i n i t i a l step i s t o " c l a s s i f y " the incomplete pattern. Because the t h i r d element of the pattern vector i s missing, proper classification i s n o t always possible. However, classes t o which the-pattern i s l i k e l y t o belong can be determined with a f i n i t e probability. The classes t o which an unlabelled pattern may belong are called candidate classes. These candidate classes are determined u s i n g the f i r s t two elements of the incomplete pattern. Forecasting i s carried o u t based on the probability of the occurrence of y3 w i t h i n a candidate c l a s s as well a s the probability o f occurrence of the candidate class within the pattern space. Thus the forecast i s the expected value o f y3. I t was observed in the c l u s t e r analysis t h a t patterns tended t o cluster i n groups whose member patterns were derived from times of origin t h a t are the same o r neighbouring months of the year. The time of origin of a pattern i s therefore an a p r i o r i information for determination of candidate c l a s s e s . Two c r i t e r i a must be satisfied f o r a c l a s s t o be a candidate c l a s s :

For the features y1 and y2 from an incomplete pattern, a f i n i t e value f o r y3 must e x i s t w i t h i n the c l a s s boundaries. 2. The probability of the class must be non-zero given the months of origin o f the incomplete pattern class t o be a candidate c l a s s . 1.

Pattern c l a s s i f i c a t i o n i s based on the a posteriori probability of the class which can be found using Bayes Theorem. Written i n terms o f the application:

384

k where p(y, ,y21u ) i s the c l a s s - c o n d i t i o n a l p r o b a b i l i t y d e n s i t y and i s e s t i m a t e d from the marginal p r o b a b i l i t y a s f o l l o w s :

ull

“12

i s the p a r t i a l covariance m a t r i x of c l a s s w k which i s

The a p o s t e r i o r i p r o b a b i l i t i e s f o r a l l k G 12 c a n d i d a t e c l a s s e s a r e such t h a t

zK k

P(,

k

IY1 ? Y 2 )

=

1

(10)

I t must be noted here t h a t a l l p r o b a b i l i t i e s a r e based on c a l c u l a t i o n s using d a t a from the c l u s t e r a n a l y s i s .

Should a s l i g h t l y

d i f f e r e n t set o f p a t t e r n s be used f o r c l u s t e r i n g then i t i s quite l i k e l y t h a t the a p o s t e r i o r i p r o b a b i l i t i e s w i l l change given t h e same incomplete p a t t e r n . Such a case would occur a s new streamflow measurements c o n t i n u o u s l y update t h e h i s t o r i c a l r e c o r d . This s i t u a t i o n i s common t o any model t h a t uses a f i n i t e and, in most c a s e s , 1 imi t e d amount of samples.

385 The p a t t e r n c l a s s e s a r e assumed t o have a m u l t i v a r i a t e normal The m i s s i n g f e a t u r e y3 o f an

d i s t r i b u t i o n o f sample p a t t e r n s .

i n complet e p a t t e r n i s fo u n d b y c a l c u l a t i n g t h e e x p e c t e d v a l u e o f F o r any y1 and y2 f r o m a p a r t i c u l a r c a n d i d a t e c l a s s t h e r e i s y3. a p r o b a b i l i t y a s s o c i a t e d w i t h any y3. T h i s c o n d i t i o n a l p r o b a b i l i t y

is:

P(Y31Yl ,Y,Ik

where

P(Y1 'Y2 ,Y3) =

P(Y, ,Y$ k

P(Yl ,YZyY3)

k

1

--

Zn3" and

P(Y, YY2)

k

=

e

-+(Y-! -

k T k-1 ) cc 1

(Y-m _ - k)

detCCkl'

k

P(Y, YY2IW )

as i n Equat ion (8) Not a l l value s o f y3 a r e f e a s i b l e .

F e a s i b l e v a l u e s o f y3 a r e

d e t ermined by f i n d i n g p o i n t s on t h e boundary o f t h e c l a s s i n m-dimensional f e a t u r e space a l o n g t h e l i n e g i v e n b y y1 and y2. These values f o r y3 range from, say, A t o B where A i s on t h e boundary between c l a s s k- 1 w k and w .

wk

and wkt1

and B i s on t h e boundary between

The expect ed v a l u e o f y 3 g i v e n t h e c a n d i d a t e c l a s s i s : B

386

The variance of y3 can a l s o be calculated by simply including an additional term in the expression above. I n a l l cases the estimate of the most probable value o f y3 i s weighted according t o the class a posteriori probabilities. The expected value of y3 i s :

k= 1

STEP AHEAD FORECASTING RESULTS

The methodology described above was applied t o an additional 28 years o f historical data t o produce a s e t of step ahead forecasts based on actual observations. This made i t possible t o compare the step ahead forecasts w i t h the streamflows t h a t actually occurred. I t i s c l e a r t h a t there a r e always some forecast e r r o r s . These e r r o r s being the difference between predicted and observed streamflows are assumed t o be independent random variables. The mean overall e r r o r for the e n t i r e s e r i e s of forecasts i s g i v e n a s follows: €

=

-

1 n

n

*

E

i=l

i

(14)

where n i s the number of step ahead forecasts. The standard deviation i s calculated as: s

=

c n

A - -E 2) / (n-1)1'

(Ei

i=l

The two sided Student's t Test can be used t o t e s t whether the overall e r r o r i s s i g n i f i c a n t l y d i f f e r e n t from zero. The t t e s t i s a s follows:

IL.5 I S

>

tn-1;0,025

387

The f o r e c a s t e r r o r s were a l s o e v a l u a t e d on a monthly b a s i s u s i n g t h e method g i v e n above. A 5% s i g n i f i c a n c e l e v e l was used f o r the c r i t i c a l value o f t . R e s u l t s f o r the Black River a r e given in the following t a b l e : Summary of Forecast E r r o r s by Months f o r Black River Month

S

a l1 S

January February March April May June July August September October November December Overall

-0.07 -1.93 -0.33 3.54 8.03 3.05 -0.36 2.17 0.89 -1.77 -3.81 -3.37 0.50

11.27 6.44 20.46 26.76 13.49 7.88 6.46 3.03 3.17 11.47 10.83 12.95

0.032 1.583 0.086 0.699 3.150 2.048 0.292 3.790 1.493 0.816 1.860 1.379

13.2

0.70

I

tn-l ; O . 025 2.052 2.052 2.052 2.052 2.052 2.052 2.052 2.052 2.052 2.052 2.052 2.052 1.960

Figure 4 shows the f i r s t two y e a r s of h i s t o r i c a l d a t a used f o r the simulation of f o r e c a s t i n g .

Superimposed on t h e time s e r i e s a r e the

s t e p ahead f o r e c a s t s i n c l u d i n g the 2 s t a n d a r d d e v i a t on lim t s a s c a l c u l a t e d by the f o r e c a s t i n g methodology. CONCLUDING REMARKS

A methodology i s given f o r s t e p ahead streamflow f o r e c a s t i n g

using p a t t e r n a n a l y s i s which accounts f o r varying p e r s i s t e n c e e f f e c t ! within the time s e r i e s . C l u s t e r i n g the hi t o r i c d a t a provides the information t h a t c h a r a c t e r i z e s the streamf ows. The a c t u a l f o r e c a s t i n g procedure uses t h i s information t o complete a p a t t e r n v e c t o r on a p r o b a b i l i s t i c b a s i s . This approach g ves s a t i s f a c t o r y r e s u l t s f o r the case s t u d y presented h e r e .

388

\-. 130

Observed Streamflow

i

S t e p ahead f o r e c a s t s .

*

Mean flow and 2 s t a n d a r d d e v i a t i o n s on e i t h e r s i d e .

120

110 100

90

.. .-. . .. ..*

I

80

,--.70 (11

60

3

2 0

:.

l

1

1 ;

--m E v

. .. ...

:.

-

?

50

i

40

-

.. ! ..

-

30 T

20

10

0

r

J

Figure

4.

,

F

l

l

,

,

l

1946

,

,

,

l

,

,

l

D J F

,

,

1

1947

,

,

,

,

(

D

S t e p Ahead F o r e c a s t e d Streamflows Superimposed on t h e Observed T i m e S e r i e s For the Black River

389

Further developments in t h i s use of pattern analysis will be t o extend the methodology t o weekly and daily streamflow time s e r i e s . Pattern vectors of higher dimensionality would then be needed. In a d d i t i o n pattern vectors can be used t o incorporate other pertinent

information i n the analysis. This would include such data as snow accumulation, temperatures, the amount o f water stored in the watershed, o r weather forecasts from independent sources. Data from t r i b u t a r i e s o r other rivers i n the region could be included as we1 1 . ACKNOWLEDGEMENTS

The senior author gratefully acknowledges the support provided during the research period by grants from the National Science and Engineering Research Counci 1 of Canada. REFERENCES

Ball, G . H . , and Hall, D.J., "Isodata, An I t e r a t i v e Method o f Multivariate Analysis and Pattern Classification", Proceedings of the IFIPS Congress, 1965. Dixon, J.K., "Pattern Recognition with Partly Missing Data", I E E E Trans. on Systems, Man and Cybernetics, Vol . SMC-9, No. 10, October 1979. Panu, U.S., "Stochastic Synthesis of Monthly Streamflows Based on Pattern Recognition", Doctoral Dissertation, Department Of Civil Engineering, University of Waterloo, Canada, May, 1978. Tou, J.T., and Gonzalez, R . C . , "Pattern Recognition Principles", Addison-Wesley Publishing Company, Massachusetts, 1974, rev. 1977. Unny, T . E . , Panu, U.S., MacInnes, C . D . , and Wong, A . K . C . , "Pattern Analysis and Synthesis o f Time-Dependent Hydrologic Data", Advances i n Hydroscience, Academic Press, New York, Vol. 1 2 , P . P . 195-297, 1981.