Residual maximum likelihood (REML) methods for analysing hydrological data series

Residual maximum likelihood (REML) methods for analysing hydrological data series

Journal of Hydrology ELSEVIER Journal of Hydrology 182 (1996) 277-295 Residual maximum likelihood (REML) methods for analysing hydrological data se...

837KB Sizes 0 Downloads 98 Views

Journal of

Hydrology ELSEVIER

Journal of Hydrology 182 (1996) 277-295

Residual maximum likelihood (REML) methods for analysing hydrological data series R o b i n T. C l a r k e Institute de Pesquisas HidrAulicas, Porto Alegre, RS Brazil Received 29 June 1995; revision accepted 13 July 1995

Abstract

Much hydrological data can be displayed as two-way tables with observations classified (for example) by years (rows) and sites (columns), commonly with many missing entries; data classified by three factors or more (e.g. gauge sites within drainage basins; drainage basins; years) can also be put in this form. On an appropriate scale, the observations in such tables can frequently be represented by linear, additive models of components, some of which can be considered as random variables. Residual maximum likelihood (REML) is a technique for fitting models in which each observation is expressed additively in terms of fixed and random effects. When the model contains only one such random effect, the linear model reduces to a restricted form of multiple regression; REML can be regarded as an extension of multiple regression to the case where there are several error terms with different statistical characteristics. Models of this kind are appropriate in the hydrological context where the effects of the years (or other periods) of observation can be regarded as a sample from a hypothetical population of years (periods), or where sites can be regarded as random. The paper discusses two examples where REML was used: one in estimating mean areal monthly rainfaU in Amazonia, using incomplete records from 48 raingauge sites, and the other using inc0mplete records of annual floods from 19 gauging stations on the Rio Itajai-A¢6, in southern Brazil. In both cases, the assumptions of the REML model were satisfied and the objectives of the analysis achieved. Given the prevalence of incomplete hydrological records, the REML method may well have wider application.

1. Introduction The principal characteristics of m a n y hydrological data sets are that (a) they result from measurements taken at points distributed spatially; (b) measurements at each point are commonly recorded at equal intervals of time; (c) the time at which measurement began (and, possibly, ended) varies from site to site within the region. 0022-1694/96/$15.00 © 1996 - Elsevier Science B.V. All rights reserved SSDI 0022-1694(95)02929-X

278

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

These characteristics are found, for example, in measurements of daily rainfall at gauge sites within a drainage basin; in estimates of mean daily flow at gauging points of a river system; in measurements of soil moisture by neutron probe at a number of access tubes within a river basin; and in piezometric levels recorded at wells penetrating an aquifer. Subsequent analysis addresses the (often highly) incomplete array in which sites are the columns and measurement dates are the rows, and commonly seeks to relate the marginal means of the rows (columns) of the array to other variables. Examples are" (a) mean soil moisture (calculated from the rows of the array) related to rain falling on a river basin; (b) mean monthly rainfall (calculated from the rows of the array) related to external variables such as sea surface temperatures (SSTs); (c) mean annual floods (calculated from the columns of the array) related to physical characteristics of drainage basins, for the purpose of estimating floods at ungauged sites; (d) mean annual flow (calculated from the columns) related to drainage basin characteristics. For some purposes, as in example (c), variances as well as means are also required, so that quantiles (e.g. annual floods with return period T years) can be calculated. In the following discussion, we consider observations classified by periods as rows, and sites as columns. Observations classified by three factors or more (e.g. gauge sites within drainage basins; drainage basins; years) can be displayed in this basic form. Analyses of such data are complicated by the 'non-orthogonality' of the rows and columns of the data array: since this array (sites as columns, times or periods as rows) has missing values to a greater or lesser degree, the period (row) means are not always calculated from all the sites in the array, and the site (column) means are not always calculated over the same periods. There are many ad hoc methods for dealing with this difficulty. In many cases, it is just ignored, and the literature contains copious examples where, say, mean annual flow, or mean annual flood, from a number of sites (columns of the array) is related by multiple regression to physical characteristics of river basins. This is equivalent not only to ignoring the fact that site means are calculated over periods of different lengths, but also ignores t h e fact that the site means have some periods in common; thus, in a year of exceptionally heaw rainfall, the floods at all sites for which flow records exist in that year can be expected to be large, and conversely for years of exceptional drought. The effects of such 'intersite correlation' have received a good deal of attention from hydrologists. Alexander (1954) considered the estimation of the mean annual flow in a region by the average of the observed at-site mean annual flows, and derived an expression for the 'effective number of independent sites'; this expression was used subsequently by Haan (1977) and Stedinger (1983). However Alexander's expression appears to have assumed that the period-site array did not contain missing values. Hosking and Wallis (1988) looked at the effects of intersite correlation by Monte Carlo methods, generating artificial 'records' of unequal length at N sites. Values were sampled from a multivariate Normal distribution with correlation between sites i andj taken as e x p ( - a d i j ) , where dij is the Euclidean distance between the sites and a was a suitably chosen constant. The Normal values thus generated were transformed to new values with the desired marginal distributions. They found that the accuracy of flood quantile estimates decreased when intersite dependence was present, but that this effect was

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

279

less importan t for practical applications than the bias in flood quantile estimates due to heterogeneity of the flood frequency distributions in the region. Even where both heterogeneity and intersite dependence were present and the form of the flood frequency distribution was misspecified, regional flood frequency analysis was more accurate than at-site analysis. Although the work by Hosking and Wallis (1988) came close to the reality of annual flood data (unequal record lengths, intersite correlation where records exist for common periods) their work reported a simulation study. What is really needed is a straightforward method of statistical analysis that is readily applicable as a routine, and the purpose of the present paper isto give two examples of a method having some of these characteristics. In its simplest form, the method to be discussed (residual maximum likelihood, REML) assumes a data array with P periods (rows) and Q sites (columns); the basic model for the N (< P Q ) data values Yij in the occupied cells of the array is (on some appropriate scale: not necessarily the scale of the raw data): Yij = # + ri + cj + £ij

(1)

where # is a constant, eiy is a random variable with zero mean and variance a 2, and the row (column) effects r i (Cj) are either constants satisfying Eri = 0 (~cj = 0), or are random variables with zero mean and variance ~r ( ~ ) , respectively. The eij are taken to be mutually uncorrelated for all i and j. The case where r i is a random variable corresponds to an intersite correlation of ~ / ( o 2 + ~ ) ; depending on context, one or other or both of ri and cj may be taken as random. Where, for example, the data array is of annual maximum discharges at Q sites for up to P years, it may be appropriate to consider the year effects r i as a sample of effects from a hypothetical population of years with variance ~ ; if the Q sites as regarded as fixed, then annual floods at sitej (j = 1 . . . Q) have mean value # + cj and variance a 2 + ~ . The routine estimation of the fixed effects # and (in the example of the last sentence) cj, and of the variances of the random effects ~2 and ~ , can be undertaken by the technique of residual maximum likelihood (REML) which is now standard in good computer software for statistical calculation (e.g. Genstat, 1993). The method is well established and widely used in the context of agricultural field trials, having first been proposed by Patterson and Thompson (1971). In using the software, it is necessary, of course, to identify which of the factors by which the observations are classified are to be considered as fixed, and which are to be considered random: if, in (1), the r i (as well as the eij) are regarded as random and the cj as fixed, then output will normally consist of estimates of the variances ~ and a 2, together with estimates of # and the Q constants cj, j = 1 . . . Q and their standard errors; extension to the case where observations are classified by three or more factors is straightforward. We give below two hydrological case studies where R E M L was used, and discuss the advantages and limitations of the method in each case. We note in passing that the basic model given as Eq. (1) above can be immediately extended to the case where the model contains one or more covariates; for example, i f y q is annual maximum discharge, the model in (1) can be extended without difficulty to Yij = ~ q- ri q- f l l X l j q- fl2x2j q- " " " q- flkXkj q- ~-ij

(2)

280

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

where the covariates xlj, x 2 j . . , xkj are, say, the values of k basin characteristics. Then if the ri and eij are random variables as before, the k + 1 fixed quantities #, ~ can be estimated by REML, together with the variances ~ and ~ of the random components. The annual flood at any site, whether gauged or ungauged, then has a distribution with expected value E[yij ] = # + ~lXlj + t~2x2j --1-... + ~kXkj

and variance ~ + ~ from which, subject to appropriate distributional assumptions discussed more fully below, T-year floods can be estimated. Thus, REML permits physical characteristics of river basins to be incorporated into the estimation procedure for means and variances of flood distributions; this contrasts with the more common procedure of calculating mean annual floods at Q sites, followed by a search for a relationship between mean annual flood and catchment characteristics. In effect, REML permits the estimation of the # and ~ parameters, taking into account intersite correlation between annual floods observed at different gauging stations in the same year. The ability of the REML model to include covariates is also likely to be particularly useful where, for example, (a) sediment yield is the variable of interest, and flow variables are the covariates; (b) water quality variables are of main interest, and the covariates describe flow and meteorological conditions.

2. Case study: time variation in mean monthly areal rainfall in the Amazon basin In a study of rainfall on the Amazon basin lying within Brazilian frontiers, monthly rainfall was analysed from 48 raingauge sites with records extending over a 23-year period during which some areas had been extensively deforested. Because of difficulties in recording daily rainfall in a hostile environment, records of monthly rainfall were far from complete; nevertheless non-parametric tests for time-trend showed both significant positive and significant negative trends in certain regions, with the total number of significant trends being much greater than would be expected on the hypothesis of zero trend with a Type I error of 5% (de Paiva and Clarke, 1995). It was suggested that the observed trends could contain a decadal periodicity correlated with sea surface temperatures (SSTs) in the Pacific and, possibly, in the Atlantic Ocean. Therefore, before such a correlation could be investigated, it was necessary to obtain 'best' estimates of mean monthly rainfall, averaged over the 48 sites, for each month of the 23-year record, taking account of the fact that, in most months of that period, the number of gauges with a complete record was somewhat less than 48. Indeed of the total possible number of observations of monthly rainfall (23 x 12 x 48 = 13 248) there were 9153 observations; the last 4 months contained no data at all and were omitted. The model shown as Eq. (1) above was used, with the monthly rainfall data in an array with 272 (= 23 x 12, less 4) rows and 48 columns, for months and sites respectively. Since mean monthly rainfall, adjusted for site differences, was required for correlation with SSTs in the particular 23-year period of record, the row effects ri

R.T. Clarke/Journal of Hydrology 182 (1996,) 277-295

281

were taken as constants subject to the constraint ~ r i = O. Similarly, the 48 site effects were also taken as fixed, subject to the constraint ~cj = O. The object of the calculation was to estimate the 272 quantities # + r/, eliminating the site effects cj; that is, to calculate monthly rainfall, adjusted for the data missing from each site. Application of R E M L requires that the errors e/j in Eqs. (1) and (2) be Normally distributed. Given the 'design matrix' X, consisting of the values 0 and 1 specifying the occurrence of the fixed effects #, r/and cj in the model which relates these quantities to the observations Yiy of monthly rainfall, and denoting the fixed effects collectively by the vector a , the model may be written y = X a + E where y is a 9153 x 1 vector of data values, Xis a 9153 x (1 + 272 + 48 = 321) design matrix, a is a 321 x 1 vector of fixed effects, and e is a 9153 x 1 vector of random errors. Since only the e contains random variables, the model reduces in this particular case to a multiple linear regression model with vector of regression coefficients or. In the more general case where, say, the 272 row effects ri are also random variables, the model has the more general f o r m y -- X a + Z/~ + E, where the second design matrix Z o f s i z e 9153 x 272 specifies the occurrence of the 272 random variables ri in the data y/y, and/~ is the 272 × 1 vector of random variables ri. A more general form still allows the possibility of several groups of random variables/~ of the form: (3)

y = XOL "-}-~-~Zi~ i --}-E

where summation is over i. In this most general form, E[Y] = X c t and the variancecovariance matrix of Y is var[Y] = ~ Z i C i Z ;

+ ~21n

(4)

where var[~/] = ~ C i and cov[/~i,/~j] = 0 for i ¢ j, and I n is a unit matrix of dimension n × n where in this example n = 9153. Each Ci is a symmetric matrix, so that the variance-covariance matrix for the whole set of random effects [/~1,/32, 133...] takes a simple block-diagonal form. The general (standard) procedure described in the Appendix was used to estimate the effects in the present model in which all effects are fixed, and hence the 272 monthly means # + ri were obtained, corrected for the fact that not all of the 48 stations had complete rainfall measurements in all 272 months. Table 1 shows these corrected means, together with the number of observations in each month, and the values of the uncorrected means. Inspection of Table 1 shows that the differences between the corrected and uncorrected means are small, unless the number of observations ("N values") in the month is also small; thus in the final month 272, with only two measurements of monthly rainfall, the difference between corrected and uncorrected means exceeded 50 mm. As mentioned above, since both month (row) and column (gauge-site) effects are considered fixed, the vector of fixed effects ct could, in this simple case, have been equally well calculated by a multiple regression expressing each observation in terms of P + Q dummy variables X, each taking values of 1 or 0 according to whether or not there was an observation Yij in the i-th row and j-th column of the data array. The multiple regression coefficients ot (consisting of # together with the P constants ri and

282

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

Table 1 Corrected monthly rainfall (MthEffs) in 272 consecutive months, averaged over 48 raingauge sites. Also shown: number of observations in each month (NValues); uncorrected mean rainfall (UnCorrtd) NValues UnCorrtd MthEffs

20 211.6 204.0

22 294.4 285.2

22 353.7 344.6

22 345.9 336.8

22 261.1 252.5

NValues UnCorrtd MthEffs

22 182.1 174.7

22 126.4 119.0

22 120.4 112.9

2O 108.6 97.4

20 100.3 90.9

NValues UnCorrtd MthEfts

20 204.3 194.9

20 164.4 156.0

9 242.0 242.1

9 288.4 288.5

9 388.7 388.8

NValues Uncorrtd MthEffs

9 455.3. 455.4

9 330.8 330.9

8 186.0 183.7

8 139.6 137.4

8 81.2 79.0

NValues UnCorrtd MthEfts

8 62.7 60.5

8 91.0 88.7

7 82.9 79.5

8 82.6 82.1

7 322.1 318.8

NValues UnCorrtd MthEffs

6 321.7 333.5

7 423.0 435.8

7 345.3 358.1

7 180.4 190.8

8 110.2 126.0

N Values UnCorrtd MthEffs

14 112.0 119.4

15 107.3 115.6

18 85.9 93.9

17 102.0 109.8

17 141.6 147.0

NValues UnCorrtd MthEffs

19 224.0 233.8

21 220.2 222.1

22 307.4 311.2

22 351.7 355.4

20 311.3 318.9

NValues UnCorrtd MthEffs

22 321.0 324.7

22 183.7 188.2

22 169.7 173.5

23 91.7 92.9

25 106.9 110.7

NValues UnCorrtd MthEffs

26 139.9 144.2

24 172.7 176.1

25 232.8 236.4

26 301.0 305.4

26 369.7 374.0

NValues UnCorrtd MthEfts

26 375.5 379.9

25 337.2 345.5

26 275.6 279.9

24 152.5 161.2

26 120.7 123.4

NValues UnCorrtd MthEfts

26 81.5 88.8

27 124.9 128.4

26 136.2 138.8

26 158.2 160.8

25 302.6 309.1

NValues UnCorrtd MthEfts

24 353.2 356.9

24 348.0 351.4

26 356.0 360.3

26 310.1 314.4

26 254.7 259.0

NValues UnCorrtd MthEffs

29 147.8 152.3

30 137.7 142.2

32 96.8 102.4

33 122.1 124.6

33 139.8 14Z3

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

283

Table 1 (continued) NValues UnCorrtd MthEffs

33 158.7 166.3

36 235.5 239.4

36 326.2 330.1

36 291.3 295.6

35 336.7 341.1

NValues UnCorrtd MthEffs

38 269.9 272.8

39 191.2 193.7

40 121.3 123.8

40 62.5 66.9

41 63.8 67.8

NValues UnCorrtd MthEffs

43 71.8 76.6

40 158.9 163.4

42 160.8 166.8

41 205.0 208.3

44 233.0 235.9

NValues UnCorrtd MthEffs

45 280.9 283.9

45 330.6 334.1

45 286.7 289.3

47 235.7 236.6

47 174.6 175.6

NValues UnCorrtd MthEffs

46 110.5 111.5

45 81.0 80.3

46 132.4 133.4

46 176.4 177.3

45 134.2 133.6

NValues UnCorrtd MthEffs

46 272.3 273.2

45 279.6 280.2

46 246.0 247.0

46 295.8 296.8

45 280.7 281.6

NValues UnCorrtd MthEffs

45 219.4 220.4

43 130.4 128.5

46 145.7 146.7

44 89.7 89.6

46 133.7 134.7

NValues UnCorrtd MthEffs

44 158.0 160.5

44 145.5 149.2

46 ' 261.0 262.0

43 220.0 218.9

45 236.0 235.9

NValues UnCorrtd MthEffs

46 336.8 336.6

46 265.9 265.8

45 206.7 208.6

41 119.3 121.6

44 81.1 81.0

NValues UnCorrtd MthEffs

44 102.6 101.9

47 95.1 95.1

45 155.6 155.8

44 166.9 168.5

45 217.5 217.4

NValues UnCorrtd MthEffs

46 252.8 252.8

45 240.9 240.3

46 265.7 265.7

45 199.1 198.8

46 170.4 171.5

NValues UnCorrtd MthEffs

45 132.1 133.2

47 102.9 102.9

47 83.1 83.1

46 102.4 102.0

47 156.9 156.8

NValues UnCorrtd MthEffs

45 180.9 180.5

45 186.0 185.6

47 300.7 300.6

45 284.4 284.9

NValues UnCorrtd MthEffs

47 203.4 203.3

44 198.5 197.4

47 130.3 130.3

45 100.4 99.8

46 95.8 96.3

NValues UnCorrtd MthEffs

47 99.3 99.3

46 135.2 137.2

44 199.5 202.7

47 228.3 228.2

39 '-321.1 321.2



47 224.9 ~: 224.9

284

R.T.

Clarke/Journal of Hydrology 182 (1996) 277-295

Table 1 (continued) NValues UnCorrtd MthEffs

39 296.3 296.4

40 323.0 322.6

39 332.4 331.4

36 235.0 234.6

37 124.4 127.1

NValues UnCorrtd MthEffs

36 126.9 127.5

36 94.7 94.9

39 105.6 105.7

38 94.8 92.9

40 129.1 129.3

NValues UnCorrtd MthEffs

39 221.9 220.1

45 155.4 156.1

47 200.9 200.9

46 271.4 271.7

46 257.5 257.1

NValues UnCorrtd MthEffs

47 154.8 154.8

47 lO0.O 100.1

44 69.2 66.6

46 107.7 107.1

45 80.9 80.1

NValues UnCorrtd MthEffs

45 164.1 165.5

42 133.9 128.7

44 220.1 218.0

43 259.7 258.2

45 295.2 294.8

NValues UnCorrtd MthEffs

47 310.9 311.0

47 317.5 317.5

46 258.3 259.1

46 151.6 150.0

46 122.5 122.5

NValues UnCorrtd MthEffs

46 100.4 99.3

45 129.8 129.9

45 162.3 161.8

46 146.7 146.0

46 224.7 225.1

NValues UnCorrtd MthEffs

46 295.3 295.7

46 266.2 264.7

46 308.4 306.9

46 253.9 254.5

45 261.9 262.6

NValues UnCorrtd MthEffs

42 147.7 147.7

44 121.7 121.6

44 115.7 115.6

43 95.0 96.7

43 142.6 146.3

NValues UnCorrtd MthEffs

42 179.6 181.4

4O 306.9 305.7

44 282.4 282.6

46 275.2 275.9

45 344.1 345.1

NValues UnCorrtd MthEffs

45 257.8 258.8

45 208.8 209.8

43 177.4 177.6

44 138.0 137.3

45 73.2 74.3

NValues UnCorrtd MthEffs

43 133.2 133.6

42 195.1 195.8

42 201.0 200.9

41 210.0 210.3

43 242.3 240.1

NValues UnCorrtd MthEffs

43 247.9 246.8

44 247.6 247.2

43 299.5 299.0

42 168.1 165.6

42 147.3 148.9

NValues UnCorrtd MthEffs

41 107.4 106.8

43 86.0 86.1

43 104.5 104.6

43 I05.5 104.4

4O 150.3 152.5

NValues UnCorrtd MthEffs

40 177.6 178.7

41 281.4 281.7

40 306.9 307.1

4O , 243.3 243.4

35 281.1 280.4

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

285

Table 1 (continued) NValues UnCorrtd MthEffs

34 264.9 267.0

33 172.8 171.9

34 151.2 152.9

36 84.3 86.2

36 99.7 101.6

NValues UnCorrtd MthEffs

34 126.0 125.9

35 216.3 216.3

33 277.0 276.0

40 294.4 291.5

38 268.9 268.2

NValues UnCorrtd MthEffs

40 317.6 316.1

37 268.1 265.5

35 270.9 267.9

31 211.9 205.6

31 147.4 141.2

NValues UnCorrtd MthEffs

29 114.0 104.7

27 129.2 122.5

26 215.1 208.4

25 233.4 227.2

23 211.3 205.2

NValues UnCorrtd MthEffs

26 295.6 287.9

25 201.9 194.8

25 274.4 267.3

24 230.9 224.7

23 217.7 211.0

NValues UnCorrtd MthEffs

20 194.6 187.7

20 164.8 157.8

20 124.0 117.0

21 116.9 109.0

23 152.2 143.7

NValues UnCorrtd MthEffs

23 183.8 175.3

23 214.1 207.5

29 301.4 298.9

28 244.4 237.9

27 296.9 289.7

NValues UnCorrtd MthEffs

26 299.4 293.0

24 254.0 244.9

21 196.0 191.6

20 160.5 149.7

22 100.2 89.5

NValues UnCorrtd MthEffs

22 108.1 97.4

19 126.1 117.3

20 152.0 142.7

16 149.6 142.5

16 146.6 146.3

NValues UnCorrtd MthEffs

14 189.1 190.2

7 233.4 253.9

7 147.0 154.9

5 165.4 178.8

4 234.3 249.5

NValues UnCorrtd MthEffs

3 225.0 241.4

2 133.0 184.2

t h e Q c o n s t a n t s cj) are also subject t o t w o constraints: n a m e l y , t h a t the sums o f the r l a n d cj are b o t h zero. A s with m u l t i p l e regression w h e n significance tests a n d confidence intervals are used, the R E M L p r o c e d u r e requires t h a t the errors e and, w h e n they o c c u r in the m o d e l (3), the r a n d o m effects Jfli also be n o r m a l l y d i s t r i b u t e d a n d m u t u a l l y i n d e p e n d e n t . I n the p r e s e n t case study, ~however, the c a l c u l a t e d residuals Y i j - # r i - c j s h o w evidence t h a t the residuals are positively skewed; Fig. 1 shows a h i s t o g r a m a n d N o r m a l p l o t o f the 9153 residuals. T o n o r m a l i s e the d a t a , a B o x - C o x t r a n s f o r m a t i o n Ys*j = (Y~ - 1 ) / ~ was tried for v a r i o u s values o f the p a r a m e t e r A, after w h i c h R E M L was a g a i n used; a log t r a n s f o r m a t i o n , c o r r e s p o n d i n g to A = 0, o v e r c o r r e c t e d , giving residuals w h i c h were negatively skewed. Fig. 2 shows

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

286

Histogram of re, iduab 4500-

4000-

3800-

3OOO-

ZSOO-

a(X)O-

1500K300-

500-

0

I

i

i

RESIDUAL MAGNITUDE

Normal probability plot of residuals

./

CO0-

dr, 600-

400-

ti

200-

o-

| -ZOOm

NONI&AL OEVIATI[

Fig. 1. Histogram and Normal plot of 9153 residuals: 23 years of monthly rainfall data from 48 gauge sites in Brazilian Amazonia~

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

287

Histogram of residuals

I

3800-

30O0-

tSOO-

L~00-

11500-

I000-

500-

|

0 °

-

,

O

,'o

!

2O '

RE$iOUAL MAONITUOE

Normal probability plot of remduab sO-

I I1

-i j ;: / " -4

-~

-|

-) NORMAL

0

I

|

11

4

"

DEVIATE

Fig. 2. Histogram and Normal plot of 9153 residuals: 23 years of monthly rainfall data from 48 gauge sites in Brazilian Amazonia. Data transformed by Box-Cox transformation with parameter A = 0.5.

288

R.T. Clarke/ Journalof Hydrology 182 (1996) 277-295

the histogram and Normal plot corresponding to the case A = 0.5, which shows reasonable evidence of Normality. Optimisation of the parameter A may be expected to improve linearity still further. In the more general model specified by Eqs. (3) and (4) above, there occur matrices Ci, with C i ~ giving the covariances amongst the random effects/~i. By specifying a particular form for the Ci, it is possible to allow for particular correlational structures amongst the random effects/~i. Thus, if the 48 gauge-site effects were regarded as random with correlation between the effects corresponding to sites i a n d j of the form exp(-adij), dij being the Euclidean distance between the sites, then the matrix Ci can be specified to have this particular form withexp(-adu) as the element in the i-th row and j-th column. The parameter a must be assumed known; if it is unknown, it is possible to estimate it by assuming a series of fixed values a 1 = 0.1, a 2 -- 0.2,... and using REML for each assumed a-value. A goodness-of-fit statistic, such as the deviance (McCullagh and Nelder, 1989) would then indicate the appropriate estimate of the parameter a. It is useful to compare the REML method with the E - M algorithm which is sometimes used to estimate the means, variances and covariances of a multivariate Normal distribution, when observations on some of the variates are missing (e.g. Dempster et al., 1977; Johnson and Wichern, 1992, pp. 202-207). To apply the E - M algorithm to estimate missing values in the 272 x 48 array of Amazon monthly rainfall, it would be necessary to separate out the 23 Januaries, the 23 Februaries etc. forming 12 arrays of dimension 23 x 48; each row of an array could then be considered, in principle, as one sample-vector from a multivariate Normal distribution with 48 variables, and the missing values estimated as shown by Johnson and Wichern (1992), all observations from one site being assumed to have the same mean. When both row and column effects are considered as fixed, the REML approach is more general, since the expected value of an observation Yij in the i-th row andj-th column of the array depends on both the row and column effects ri and cj. Furthermore, in the E - M approach it is necessary to assume that the rows of the data array are statistically independent; as discussed in the preceding paragraph, the REML approach does not necessarily require statistical independence of this kind.

3. Case study: analysis of annual flood, data from sub-basins of the Rio Itajai-A~ As a second case study of the application of REML, we consider the analysis of annual flood data from 19 flow gauging sites in sub-basins of a large river system in Southern Brazil, the Rio Itajai-A~fi. When in fiord, this river constitutes a major threat to human life and property, and has been partie~ady damaging to the major urban area of Blumenau in the State of Santa Catarina. The analysis presented here covers a period of 55 years ending i n 1984. As with all river basins subject to land clearance and urban growth, there is a fear that flood peak discharges are becoming higher, and that the times to peak discharge are becoming shorter, so that methods of data analysis are required which allow changes in flood regime to be identified, This cannot be done simply by looking at the time-sequences of annual floods site by site,

R.T. Clarke/JournalofHydrology 182 (1996) 277-295

289

because these floods are likely to be greater in years when rainfall is high, and conversely; so a m e t h o d is required which eliminates annual effects. A t the same time, the analysis needs to provide a m e t h o d o f calculating f l o o d s o f T-year return period at each o f the gauged sites. Table 2 shows the names and drainage areas (km 2) o f the 19 sites, together with the number o f years o f record at each site. A preliminary analysis (Bartlett's test) showed no evidence o f heterogeneity in variance at the 19 sites. Some o f the records are very short, and it could be argued that they should be rejected. On the other hand, it can also be argued that where data are few in number it is essential to make the fullest use o f those that are available. Again, it could be argued that the inclusion o f short records could lead to a dilution o f information, if and where the well-known criterion for information gain p2/> 1/(n - 2) is not satisfied (Fiering, 1963; Fiering and Jackson, 1971). W e would counter this argument by saying that this criterion refers to the condition for increased information on the mean/~y o f a bivariate N o r m a l distribution with means #x, #y, given n independent data-pairs (x, y), and additional nl observations o f the variable X. However, (i)the correlation/9 is never known exactly; (ii) in R E M L with year effects fixed, the expected value o f Yij varies from year to year (being # + ri + cj) and so the model is different from that for which the information criterion was derived. Therefore, we prefer to keep the short records in our analysis. The basic model was, as before, Yij = # + ri + cj + elj, but for the purpose o f estimating T-year floods, it was considered appropriate to regard the year effects r i as r a n d o m variables with zero m e a n and variance ~ ; the eij are taken to be independently distributed, and distributed independently o f the ri, with variance o2. Table 2 2 and years of record at 19 gauging sites on the Rio Itajai-A¢6, Santa Catarina, Names, drainage areas (km), South Brazil. For explanation of column headed # + cj, see text Stream and site name

Drainage area

Years of flood record

# + cj

R Pouso Redondo R Trombudo at Trombudo Central R Itajai at Barracao R Itajal do Sul at Saltinho R Itajai do Sul at Jararaca R Itajai-A~fi at Rio do Sul R Itajai-A~fi at Rio do Sul Novo R Hercilio at Barra da Prata R Hercilio at Ibirama R Neisse Central at Neisse Central R Itajai-A¢6 at Apiuna R Itajai-Aeft at Warnow R Benedito at Benedito Novo R Benedito at Timb6 R Itajai-Aqfi at Indaial R Itajai-Ag"6at Itoupava Seca R Garcia at Gamia R Itajai-Mirim at Botuvera R Itajai-Midm at Brusque

130 55 364 483 728 5100 5100 1420 3314 195 9242 9714 692 1342 11151 11719 127 859 1240

32 24 22 11 26 39 7 8 50 25 51 4 51 48 51 15 33 7 50

3.863 4.465 5.490 4.730 5.581 6.616 6.637 5.929 6.531 4.274 7.272 7.409 5.166 5.972 7.517 7.451 4.003 4.879 5.329

290

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

The annual maximum floods (more precisely, the annual maximum mean daily discharges) were transformed to logarithms before using REML, the justification for the transform being given in Table 2. Thus the fixed effects were (a) the overall constant # and (b) the 'site effect' cy. The random effects were (c) the 'year effect' r i and (d) the 'error' or 'site x year effect' eij. The total number of annual floods from the 19 stations was 505. Recalling that these were transformed to logarithms, the R E M L analysis estimated ~ as 0.13803 (4-0.02943); 02, the variance of the eij, was estimated as 0.09661 4- 0.00657. The deviance, a measure of how badly the assumed model fits the data, was 379.9 with 485 degrees of freedom. This statistic is distributed approximately as X2, and the calculated value is close to the mean value 485 of this distribution, giving no evidence that the model fit is unsatisfactory. Estimates of the 19 quantities # + cj, giving the expected values of loge(annual flood) at the sites, are shown in Table 2, right-hand column. The variance of loge(annual flood) is the same for each site, namely 63 + ~y = 0.23488. Using these values of/2 + ~y and 63 + ~ , the value of the T-year flood at any site can be easily calculated. Thus if K0.99 is the 99% quantile of the standard normal distribution, the estimate o f the 100-year flood at Site i (i = 1 ... 19) is exp {/2 + cj + K0.99V/(t~2 + ~y) } • The analysis has made two distributional assumptions: (i) that the 'year effects' r i are Normally distributed, N(0, ~ ) and (ii) that the errors or 'site x year effects' eij are N(0, 02). These two assumptions need to be verified. Fig. 3 shows a histogram and Normal plot for the calculated year effects; this does not appear to warrant rejection of the assumption that the ri are N(0, ~ ) . Similarly, Fig. 4 shows a histogram and Normal plot for the site x year effects trij. Again, the assumption N(0, 02) appears justified. Therefore, transformation of the annual floods to logarithms justifies, in this case study, the R E M L assumptions. The estimated errors aij are free from the year-to-year variations in annual floods, caused where years of extreme rainfall result in floods at all sites that are larger than usual, and where drought years result in floods at all sites that are small. Plots of the trij against time might therefore be expected to show whether annual maximum mean daily discharges have tended to increase over the years. These plots were computed for sites with longer records, and Fig. 5 shows as an example the plot obtained for the Rio Itajai-Mirlm at Brusque, draining an area of 1240 k m 2, with 50 years of record. There was no significant evidence of time-trend in the estimated errors at this site, or any other, as judged by linear regression. However, testing for trend by linear regression is not strictly valid because the estimated errors are correlated as a consequence of fitting the model.

4. Conclusion

Two case studies are discussed which suggest that R E M L has possibilities as a

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

291

Histogrom of 55 yeor effocts le-

1614-

12-

| 642-

I I

C -O.q

-0.3

0

O.G

0.3

I

O.9

SITE X YEAR EFFECTS

Normol ~

of yoor effocts

2-

I-

O-

.l°

! QS -

.O,~

-o~ -o:, .~t YEAR

~

o'.t

o',4 o'., io,.,

EFFECTS

Fig. 3. Histogram and Normal plot o f calculated year effects rt: 55 years of annual flood data from 19 gauge sites on the Rio Itajai-A~fi, southern Brazil.

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

292

Histogromof residuois leO-

160-

140-

IZO-

z h

SO-

4020O-

I - 1.00

: -0.75

J -0.~0

J -0.25

SITE X

0

(123

).SO

I 0.7S

IDO

YEAR EFFECTS

Normal prolmbility plot of residuals

Geo-

ojm.

i o-OII|-

-G'I~-

-I~O-

-I.SS!

NORMAL



|

NVIATE

Fig. 4. Histogram and Normal plot o f site x year effects etj: 55 years o f annual flood data from 19 gauge sites on the Rio I t a j a t - A ~ , southern Brazil.

293

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295





G2I • •

I

I







I I I

O.O-

I x







0.2-

ammm

0.4-

0.8

|

i









mare





I 1944

I 1954

I MN,I

I 1974

I

?

YEAR

Fig. 5. Plot of site x year effectseij: data from 50 years of recordsof annual floodson the Rio Itajai-Mirim at Brusque.

method of analysis of hydrological data sets having the following characteristics which are typical of many such sets: hydrological records of varying length; intersite correlations; year-to-year effects which can be considered random. When all effects are considered fixed except for the errors denoted in the text by a U, R E M L reduces to multiple regression y = X a + E as used to estimate the constants in a non-orthogonal analysis, with linear constraints on the regression coefficients. The strength of R E M L lies, however, (a) in extending this to more complex models of the form y = X a + Z/3 + ~ and y = X ~ + E Z i f l i + ~ , where the /3 and/3i are also random variables, in addition to the E; (b) in allowing for specified correlational structures amongst the elements in the random vectors fl or fli. The generality of the model comes at a price however, since it is necessary to assume that the distributions o f the ¢, and of each of the/3 i, are multivariate Normal. This need not be a serious drawback as long as a suitable transformation to a Normal scale can be found.

Appendix In the model y = Xa + ~Zifli + e

let Z be the partitioned matrix [ Z 1 Z : . . . Zc] and /3 be the partitioned vector [/31~.../3c]. We assume that the random effects/3 and e are mutually independent

R.T. Clarke/Journal of Hydrology 182(1996) 277-295

294

N o r m a l l y distributed r a n d o m variables with zero mean, with cov[e] = a21n and cov[/3i] = ~ C i where In is the n x n unit matrix and Ci is a qi × qi symmetric matrix with c o v ~ i , /3j] = 0 for i a n d j unequal. It is usually assumed also that v a r ~ i ] = ~ l q i although this is not essential. Then

E[r] = and cov[Y] = V = o

{E.y

z,c,z; + I.}

= a2H.

where 7/= ~/o2. If Ci = lqi then H = zrz' diag{q, lIql~/2Iq2... %Iqc}. The residual log likelihood is then (Genstat,. 1993)

RL(y) = - 1/2 loge I VI - 1/2

log~lX'V - I x [

+ In

where

r =

- 1 / 2 ( y - Xct)' V -1 (y - Xct)

from which the first derivatives with respect to the 7 / a n d a2 are obtained as

ORL(y)/07i and ORL(y)/Oa 2 G i v e n an initial estimate o f V, the G e n s t a t c o m p u t a t i o n a l procedure is then as follows: (1) Calculate estimates o f a a n d / 3 from

x,x Z'X

Z ' Z X + r -1

Lz'y]

(2) Putting Q22 = z ' z + r -1, calculate U = r -~ - r - l Q 2 2 r - l ~ (3) U p d a t e the estimates o f 7 / a n d o2 using "/new ff2new

A/old + l _ 1 ld

ORL(y)/OO~oid

where I is the information matrix E[o~logeRL(y)~07ion]. (4) Cheek for convergence o f V; if convergence is not achieved, return to step (1) else stop.

References

Alexander, G.N., 1954. Some aspects of time series analysis in hydrology. J. Inst. Eng. Aust, 26: 196. Dempster, A.P., Laird, N.M. and Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm (with Discussion). J.R. Stat. Soc. Series B, 39(1): 1-38. ,~ de Paiva, E.C.D. and Clarke, R.T., 1995. Time trends in Amazon rainfall. Bull Am. Metl Sot., in press. Fiering, M.B., 1963. The use of correlation to improve ' estimates of the moan and variance. USGS Professional Paper 434-C, Washington, DC.

R.T. Clarke / Journal of Hydrology 182 (1996) 277-295

295

Fiering, M.B. and Jackson, B.B., 1971. Synthetic streamflows. Am. Geophys. Union Water Resour., Monograph 1. C~nstat 5, 1993. Release 3, Reference Manual. Clarendon Press, Oxford. Haan, C.T., 1977. Statistical Methods in Hydrology. Iowa State University Press, Ames, Iowa. Hosking, J.R.M. and Wallis, J.R., 1988. The effect of intersite dependence on regional flood frequency analysis. Water Resour. Res., 24(4): 588-600. Johnson, R.A. and Wichem, D.W., 1992. Applied Multivariate Statistical Analysis. 3rd ¢dn., Prentice-Hall International Editions. McCullagh, P. and Nelder, J.A., 1989. Generalised Linear Models. 2nd edn., Chapman & Hall Monographs on Statistics and Applied Probability. Patterson, H.D. and Thompson, R., 1971. Recovery of interblock information when block sizes are unequal. Biometrika, 58: 545-554. Stedinger, J., 1983. Estimating a regional flood frequency distribution. Water Resour. Res., 19: 503-510.