Journal of Hydrology, 78 (1985) 137--150 Elsevier Science Publishers B.V., Amsterdam -- Printed in The Netherlands
137
[1] COVARIANCE BETWEEN SUBSAMPLE MEAN AND VARIANCE RELATED TO STORAGE VARIABLES
AS
MEHMETCIK BAYAZIT 1 , J.T.B. OBEYSEKERA 2 and VUJICA YEVJEVICH 2 l
.
.
.
.
.
.
.
Department of Cwd Engineering Istanbul Technical Unwers*ty Istanbul (Turkey) International Water Resources Instztute, George Washmgton Unwers~ty, Washington, DC 20037 (U.S.A.) 2
•
'
.
'
,
',
.
.
(Received June 14, 1984; revised and accepted November 19, 1984)
ABSTRACT Bayazit, M., Obeysekera, J.T.B. and Yevjevich, V., 1985. Covariance between subsample mean and variance as related to storage variables. J. Hydrol., 78: 137--150. The covariance, cov(x, Sx 2 ), between mean and variance of subsamples is conceived as a single parameter which integrates the effects of asymmetrical extremes and dependence structure on storage capacity required for seasonal flow regulation. An exact expression for this covariance for the first-order autoregressive processes is derived. The use of cov(x, s2x) for estimation of skewness is investigated by simulation for both independent and dependent stationary processes. This approach is shown to result in a lesser bias in estimated skewness than the moment estimator. The relationship between cov(~,s~) and the storage related variables, such as range and deficit, is also studied by simulation. It is found that cov(~,S2x) is positively correlated with mean range and deficit for skewness smaller than 3, and negatively correlated with adjusted range. It is concluded that these storage related variables are influenced in different ways by the sampling fluctuations in the coefficients of skewness and serial correlation.
INTRODUCTION
A high correlation exists between the annual mean and the annual variance of seasonal {monthly, weekly) flows when they are asymmetrically distributed. This means that seasonal flows have a higher fluctuation around t h e a n n u a l m e a n in w e t y e a r s c o m p a r e d t o d r y y e a r s . T h i s p r o p e r t y o f s t r e a m f l o w s e r i e s is r e l e v a n t in t h e d e s i g n o f s t o r a g e c a p a c i t i e s f o r s e a s o n a l flow regulation. I n t h i s p a p e r , i t is s t u d i e d h o w t h e c o v a r i a n c e b e t w e e n t h e s u b s a m p l e m e a n a n d t h e s u b s a m p l e v a r i a n c e is r e l a t e d t o t h e c o e f f i c i e n t o f s k e w n e s s a n d t h e a u t o c o r r e l a t i o n s t r u c t u r e o f s t a t i o n a r y p r o c e s s e s . A n e x p r e s s i o n is derived for the covariance of first-order linear autoregressive processes with skewed distributions. This relationship can be used to estimate the population coefficient of skewness. The sampling properties of estimates
0022-1694/85/$03.30
© 1985 Elsevier Science Publishers B.V.
138
of this covariance and of the corresponding coefficient of skewness are investigated by simulation, and compared with the distribution of the skewness coefficient estimated by the m e t h o d of moments. Another objective of the study is to investigate how the covariance affects the storage capacity needed for seasonal regulation. For this purpose a simulation study is carried out to determine how the storage-related statistics (range, adjusted range, deficit) are correlated with the covariance between the subsample mean and the subsample variance.
COVARIANCE BETWEEN THE SUBSAMPLE MEAN AND SUBSAMPLE VARIANCE OF A SKEWED AR(1) PROCESS
Consider a first-order linear autoregressive (AR(1)) process:
xi = PXi-1 + ei
(1)
with xi a standardized random variable whose distribution has the third central m o m e n t / ~ 3 , P is the first serial correlation coefficient of the process, and ei is an independent random variable which has a skewed distribution with the third central m o m e n t : P3 (e) -
1 --,0 3 (1 -- p2)a/2 #3
(2)
For a subsample of size m drawn from the population of xi, subsample mean ~ a n d subsample variance s~ are estimated by:
x--
Z xi/m
s=2 =
and
i=1
-- ~2
x~/m
(3)
i=1
For subsamples of m o n t h l y values equal to the year, m -- 12. Following expression then gives the covariance between these two estimates (the proof of which is given in Appendix A): cov(E,s~)
=
-P3 ~ { (m--1)+(m--1)(1--3/m)
6 --
-
(m
--
2)
p(I+2p) 1 _p2
pS
--
6 [p4 __pro+2 +
--
[
m (1--p)(1 _p2) m (1 __p)3 p6 __ p2rn+2 ]'~ (1--p)(1--p2)2 , f o r m ~> 3
fl
(4)
Neglecting the terms with powers of p higher than 4, an approximate expression is obtained that yields practically identieal values of cov(~,s~) as those of eq. 4 for p < 0.5:
139
• IX
0.20
~' o.18 0.16
/
/
zzz "×
0,14
0.12
Eq;
//
0.10
0.08
0.06
' 0.2
' 0.4
, 0.6
, , 0.8
P Fig. 1. Covariance o f x a n d s x 2 as f u n c t i o n o f p for m = 12.
cov(E,s= 2) = ~-~ (m - - 1 ) + (m - - 1 ) ( 1 - - 3/m) P(l + 2p)
1 _p2
p2(1 + 2p + 2p2) --
(1
--
3/m) (1
+
6 p4 m (1%)3
__p2)2
6 (m -- 2) - -
p3
m (I--p)(l
1
_p2) (5)
Another approximate expression that is reasonably accurate for p < 0.3 is:
cov(x,s= 2) ~- ~3
m,[ -
m2 -
l + ( 1 - - 3 / m ) P ( l + 21 p_p2 )
]
(6)
The above expressions are reduced to the known relationship: cov(x, Sx2) = #3 ( m - - 1 ) m -2
(7)
for an independent but skewed random variable (p = 0). The covariance is always zero for a symmetrical distribution (/~3 = 0). It increases with both P3 and p for a given subsample size m (Fig. 1). SAMPLING PROPERTIES OF THE COVARIANCE
It is known (Wallis et al., 1974) that the estimate of the coefficient of skewness computed using the unbiased estimators of the third and second
140 (o p.
- -
~5
MOMI'P
0
=0.4
COY I "p LAP
=
o
= 01,
3
i
2 I
. . . .
a
•
" a
~
i
0.5
1.0
i
1.5
2.0
2.5
p
3.0
Cs
Fig. 2. Bias in C s as a function of population values of Cs and p. central moments has a negative bias. A simulation study was carried out to investigate the sampling properties of the coefficient of skewness computed using the expression for the covariance derived above. A total of 500 samples of standardized m o n t h l y flows (assumed that periodicity in mean and standard deviation are removed), each 50 yr. long (N = 50), were generated for various population values of Cs (0, (0.5), 3.0) and p (0, 0.4) of the first-order linear autoregressive process of eq. 1. For each sanfple the covariance between the annual mean and the annual variance of twelve m o n t h l y flows (m = 12) of each year and the corresponding lag-one serial correlation coefficient were estimated. Then the estimate of tt 3 was computed using the relationship between the covariance, #3 and p (eq. 5). The coefficient of skewness was then evaluated by two methods: (1) by using the estimated value of P3 ; and (2) by using unbiased estimators of the second and third central moments. Thus 500 estimates of C+ were obtained by each m e t h o d for each combination of population values of Cs and p. The mean of covariances estimated from the samples was always lower than the corresponding population value. An approximate expression for the bias in the estimation of covariance is obtained in case the subsample variance is computed with respect to the population mean (assumed to be zero) instead of using the subsample mean (see Appendix B for the details of derivation): 1
(bias)c;~(~,s ~) ----- - - N c°v(x'sx2) + ~-- -- cov(~,sx2)/N
N--1
N2
2 +/)3 (1 _p2)2
#3 p + 3 p
rn 2
(8)
141 where N is the sample size, or the num ber of subsamples each of size m. Covariances estimated from the sample were corrected for the bias by using the above a p p r o x i m a t e expression. This step reduced the bias, but did not eliminate it. The average bias was ~ 3.5% (1% still left after the correction) for p = 0, and 5% (1% after the correction) for p = 0.4. The bias in estimation of C8 (expressed as a percentage) is pl ot t ed in Fig. 2 as a f u n ct i on of the population skewness for p = 0 and p = 0.4. It is seen that the percentage bias increases linearly with Cs when the m e t h o d of m o m e n t s is used (MOM in Fig. 2), but remains approximately constant at ~ 1% when c o m p u t e d through the covariance (COV in Fig. 2). When the population C~ is higher than ~ 1.0, estimates based on the covariance have a smaller bias for the sample size used in the study. The sampling variance, however, is higher in this latter case. These results accord with the findings of Yevjevich and Obeysekera (1984) for the i ndependent process.
EFFECT OF COVARIANCE ON SEASONAL STORAGE The required capacity of a reservoir to regulate the streamflow can be characterized by the storage-related variables such as the range R . , adjusted range R*, m a x i m u m deficit DR, and similar. These are random variables whose statistical properties depend on characteristics of the underlying streamflow process and the period of regulation, n (which in general is different f r o m m and N, defined above). In the case where t he time series interval considered in the flow regulation is less than a year, the total storage can be divided in two parts. The storage capacity that is required for regulation between the years is called the annual storage. The part of the storage capacity that is needed for regulation within a year is called the seasonal storage. In this study seasonal storage refers to the capacity required for regulating the fluctuations of the stochastic comp o n e n t o f the streamflow within a year when the time interval chosen is a m o n t h (in this case n -- m = 12). The correlation between the annual mean and the annual variance (year as a sample) implies t hat wet years have more fluctuating m o n t h l y flows than dry years. Since the fluctuations of m o n t h l y flows in a year affect the required storage capacity, it can be expected t hat the storage-related variables for seasonal storage will be correlated with the covariance between the annual mean and the annual variance. A simulation study was carried out to investigate these relationships. As described above, 500 samples, each 5 0 y r . long, were generated for standardized m o n t h l y flows (with removed periodicity) that were assumed to have a skewed A R ( 1 ) population with certain values of C8 and p. For each year, variables t hat are related to seasonal storage (R12, R~2, D12 and D~2, which is the m a x i m u m deficit of partial regulation for a level of d e v elo p men t a = 0.5) were determined and their means (R12, R~2, D12,
142
sl . . 4
/
E(D12)
o SIMULATION
2
ANIS.LLOYD (1975) " ""---o.
E(DIP
--~
1
,
i
,
o5
,o
,
20
21.5
310,
Cs Fig. 3. E x p e c t e d values of RI2, R~2 , D12 and DiP2 for p = 0. m
D~2) were c o m p u t e d . Fig. 3 shows the means (for all 500 × 5 0 y r . ) of these variables as functions of Cs for p = 0. Corresponding curves for p = 0.4 are presented in Fig. 4. It is seen that the expected values of storage-related variables, with the e x cep tio n o f D~2 , are little affected by skewness of the distribution. All the ex p ected values decrease only slightly with an increase of Cs. The rate
(E (R~2) :
o
5
SIMULATION
" YEVJEVlCH (1967) iE
4 .
.
.
.
.
.
.
(D12)
. wl--__
o --,o .---eo--
~
--
.,..o
3
2, . . . . . ~ ..... ~ .
0
05'
E (0F2)
1.0 '
1.5 ' Cs
2'.0
25 '
3.0 '
Fig. 4. E x p e c t e d values of R12, R~2, D12 and D~P2 for p = 0.4.
143
0.6
•
bLL W
R~2, cov(~ ,S2x) ~)12,cov(~.S2x)
o p =o
RI~. c o v ( ~ , s 2 )
• p = o.~
o OA
o~ O~
0.2
- 0.2
- 0.4
-06 ~
-0.8
0
, 0.5
, 1.0
I 1.5
210
2.5
o
i . 3.0
Cs Fig. 5. C o r r e l a t i o n c o e f f i c i e n t s b e t w e e n t h e covariance and storage-related variables as f u n c t i o n s o f C8 and p.
of decrease is higher for larger values of Cs. Simulation results for E (R12) and E (RT2) are in perfect agreement with the theoretical expressions given by Anis and Lloyd (1953) and Solari and Anis (1957) for normal independent variables, and by Anis and Lloyd (1975) for independent skewed variables. E (D12) and E (D~2) for normal independent variables were checked with the results of Gomide (1975). Yevjevich (1967) provided an expression for the approximate expected value of the range of a first-order linear autoregressive process, which shows that E (R,) increases with p. Gomide (1975) has shown that E (Dn) and E (D~) also increase with p. A simulation study by Yevjevich (1965), on the other hand, indicates that E (R*) may be smaller for a large p than E (R*)_ for a small p when n < 25. For a 50-yr. long sample R12, R~2 and D12, which are the means of storage-related variables for that sample, fluctuate around their corresponding expected values due to chance fluctuations of Cs and p in that sample; cov(~, s ] ) for the sample also fluctuates around its population value for similar reasons. 500 values of each of the variables R12, RT2, D12 and cov(~,sx ~) that were obtained in the simulation were used to investigate the correlations between the means of storage-related variables and coy (~, sx 2). Fig. 5 shows the behavior of these correlation coefficients with the variation ofCs forp =0andp =0.4. It is seen that all the correlations are nearly zero for a normally distributed
144
1.0
p = 0.4 . . . . . . .
P=O
~o
-I.0 -0.I
C) &P
0.'I :
Fig. 6. F l u c t u a t i o n o f R12 d u e t o f l u c t u a t i o n o f p f o r p = 0 a n d p ---- 0.4.
process. They show different behavior with the increase of C,, The correlation between R12 and covariance first increases with C,, passes through a maximum, and then decreases. It is always positive for p = 0, and is also positive when C, < 2.4 for p = 0.4. The correlation coefficient between D12 and cov(~, sx2) has a similar behavior, but its values are lower than those between R 12 and- covariance. The correlation between R~2 and covariance, on the other hand, is always negative, and increases in absolute value with an increase of C,. The behavior of the correlation coefficient between R 12 and covariance can be explained as follows. The effect of AC8 (fluctuation of C, around its population value) on R12 is negligible except for very high values of Cs. Therefore, the correlation is due only to the effect of Ap (fluctuation of p around its population value). For Ap > 0 positive fluctuations occur in both/~12 and covariance (A/~12 ~ 0 and A c o v > 0), whereas for Ap < 0 negative fluctuations are obtained for both variables (AR12 < 0 and A c o v < 0 ; Figs. 6 and 7, respec.). Therefore, a positive correlation is observed between/~12 and covariance. The correlation coefficient is almost zero near C8 = 0 since covariance remains almost constant at zero in this case (Acov ~ 0). On the other hand, ACs produces a positive fluctuation in covariance, b u t some negative fluctuation in R12 at high values of C~ (Figs. 8 and 9), which has a negative contribution to the correlation ( A c o v > 0 b u t AR~2 < 0 ) . Therefore, the correlation between R12 and covariance decreases at high values of C~. The correlation coefficient between R~2 and cov(~, sx2) is always negative since Ap > 0 produces_ a positive fluctuation in covariance, but a negative fluctuation in R~2 ( A c o v > 0 but AR~2 < 0; Yevjevich, 1965), whereas for Ap < 0 a negative fluctuation is obtained in covariance b u t a positive
145
0.02 P =0.4
~x v
o>
0.01
L)
,Q
0
-0.01
- 0.02 -0.1
, 0
"0.1
~P
Fig. 7. F l u c t u a t i o n o f covariance due to f l u c t u a t i o n o f p for p = 0 and p -- 0.4.
0.15
......
Cs = 0
.......
Cs = I
- - C s = 2
R12
0
-015 -0,5
0
05
aCs/Cs
Fig. 8. F l u c t u a t i o n o f R12 due t o f l u c t u a t i o n o f C s for Cs = 0, 1 and 2.
0.04 %×
o> u
0
-0.04 -0,5
i
i
0
0.5
~Cs/Cs
Fig. 9. F l u c t u a t i o n o f covariance due t o f l u c t u a t i o n o f C s.
146 fluctuation in RT2 (Acov < 0 but AR~2 > 0). The behavior of the correlation coefficient between D12 and cov(~,sx 2) is between the other two since the effect of Ap on /)12 is weaker than that on/~12 but the effect of AC8 on D12 is stronger than that on R12. These results show that the storage variables are affected in different ways by the sampling fluctuations in C~ and p. If the fluctuations are such that for a certain sample cov(~,sx 2) is larger than its population value, then R 12 for that sample will, on the average, be overestimated, particularly for small skew, whereas RT2 will be underestimated. When cov(~,s~ 2) for a sample is less than the population value, R12 will be underestimated but R~2 will be overestimated. The influence of the fluctuations of C~ and p onDI2 can be expected to be less than those on R~2 and R~2, on the average. For dependent processes the probability of overestimating R~2 and D12 is higher than that for independent processes, and the probability of underestimating R 12 is smaller. One final c o m m e n t regarding the assumptions involved in above derivations is in order. The general case of variability of flows within the year has not been considered, when mean, variances and serial correlation coefficients are all periodic as well as when the model is more complex than an AR(1). The mathematical derivations for more general cases than those considered in this paper will certainly be more complex and tedious. This paper is only a step in studying the covariance between mean and variance as a parameter in modeling, and its implications on practical applications.
CONCLUSIONS The correlation between the subsample mean and the subsample variance increases with both P3 and p of the streamflow process. An exact expression is obtained for the covariance between the former two variables in the case of first-order linear autoregressive processes. The results of a simulation study have indicated that the coefficient of skewness estimated by using the value of P3 obtained from this expression has less bias than the conventional estimate when the value of population C8 is greater than 1. Therefore, covariance between the subsample mean and the subsample variance can be used to obtain the estimate of Cs that has less bias than that obtained by the m e t h o d of moments. For a certain sample the estimates of the covariance and of the means of variables that are related to seasonal storage show fluctuations around their corresponding expected values. It has been found that these estimates are correlated. The behavior of the correlation coefficients between the sample covariance and sample means of the storage-related variables are investigated by simulation, with the observed behavior explained. It is concluded that the mean maximum deficit is the variable that is least affected by the sampling fluctuations in Cs and p, and that the mean range and mean
147
adjusted range are affected in opposite directions such that the mean range will usually be underestimated (especially when 0.5 < Cs < 2.0) whereas the mean adjusted range will usually be overestimated. Further research is needed to investigate the properties of cov(~,s= 2) under more general conditions of periodicity in autocorrelation and more complex models than AR(1).
APPENDIX A
Covariance between the subsample mean and the subsample variance is defined as:
The first expectation on the right-hand side of eq. A-1 can be written as:
i:l
i=l
+ E
m-5 mE(x/3) + E
=
~',
j=l
•
I=1
xix~+ z
~ x~x~_~
(a-2)
J=2 l = l
For the first-order linear autoregressive process of eq. 1 :
E(xix'/+t) = E ( x ] ) p 2z = p3p 2z
and
E(xix}_z ) = E(xja) pZ = #3 pZ
(A-3)
substituting these into eq. A-2:
i=1
rn 2 m +
i=1
~
j=l
}-"
/=1
p2t+ ~
j=2
~
l=l
pl
(A-4)
After evaluating the double summations, eq. A-4 becomes:
E
xi/m i=1
x'~/rn 1
p + (rn - - 1 ) 1 - - p
.
P3 . . . p2 m2 m + (m -- 1) 1 - - p 2
p2 p2 _ p2m (1 __p2)2
_p_pm] P (1 _ p ) 2 j
(A-5)
The second expectation on the right-hand side of eq. A-1 can be evaluated as follows:
148
E
xi/m
--
E
, m J-1
xi a
i=l
'~
+ 3E
\ i=1 l=l
[ rn-2s m-j-Is m-j ~ j=l
'j--2 I:1
u=l v=u+l
(A-6)
for m ~ 3 For the first-order linear autoregressive process: E(xiXj+uXj+o ) = E(x 3) pU+O = #3pU+V
(A-7)
Substituting eq. A-7 into eq. A-6 and evaluating summations: E
I( i=l
xi/m
= -~
l + 3 (m - - 1 )
I _p2
p + 3 (m -- 1)
(1 _p)2
p3
+ - (m -- 2) 1 --p 1 _p2
6
p4 __p2m]
1--p
(1---p2-~]
+___p2
(1 --
p __pm 3p
1 --p
6
2_ 2m - p2)2
3P2
__
6
1 --p
p2
p2 __pro (1 _ p ) 2 (A-8)
form >3
Subtracting eq. A-8 from eq. A-5, an expression is obtained for the covariance between the subsample mean and the subsample variance of the first-order linear autoregressive process: /[/3 I 3 p(1 + 2p) c°v(E's~2) = m-2 t (m - - 1 ) + (m - - 1 ) ( 1 - - - m - ) (1 - - p 2 )
3 p2(1+2p+2p 2) 3 [ pm+l - (1 - - ) p2)2 + (1 - - ) [ )2 m
(1 --
p2(m+ 1) ]
6
~ _~p~'2j - - ( m - - 2 ) - + -m | (1 _ p ) 3
m
(1 --p
p3
rn (1--p)(1 _ p 2 )
= (1-p)(1-p~)
~
for m ~> 3
(A-9)
APPENDIX B
An approximate expression for the bias in estimation of the covariance between the subsample mean and the subsample variance is derived where for the sake of simplicity the subsample variance is c o m p u t e d with respect to
149
the population mean (E(x) = 0) instead of the subsample mean. In this case the covariance is defined as: cov(~,s~2) =
E [(~=lx~/rnl (~=~x~/m) l = E(~jsj 2)
(B-l)
The following expression is used to estimate the covariance from a sample of size N: cOv(2,s~ 2) = E
~
(2j--x)(si 2 --~)
(B-2)
i=1
It can be shown that the right-hand side of eq. B-2 can be rearranged such that: 1
cOv(x, sx2) =
N
N
E(xisi2)---~--d X ~.
E(~sk 2)
(B-3)
i=1 k = l
For the first-order linear autoregressive process the last term of eq. B-3 becomes:
1 N N 1 1 #3 [p(1--pm)[ y:,~ .=~SE(xj'sk 2) = N - - E ( ~ s ~ ) + N 2 m 2 ( ~ Z p ~ IN--
N2
p m _ pNm] l - - p - ~ j+
p=:_ ___p2N~l 1 1 _ p 2 m !J
p=(l_p=.,) [ ..... (1 _ p 2 ) 2 N - - 1
which, for large m, reduces to: 1 N N 1
N--lp3
g---~ ~ ~ E(~jSk2) = -- E(~isj 2) + 1:I ~=I N N:
(B-4)
P +3p 2 +p3
m2
(B-5)
(l--p:):
Substituting eq. B-5 into eq. B-3: cOv(x, sx2) =
1) 1-- N
N--1 P3 p+3p2 +p3 E(~isi2)+ N ~ rn2 ( l _ p 2 ) 2
(B-6)
Finally, an expression for bias is obtained considering that c o v ( ~ , s x : ) = eq. B-l: 1
E (~isi2) as given by (bias)c6v(~, sx 2)
=
c 0 v (~, sx 2 ) - - c o y (~, s x 2 ) -
--cov(Z,
sx2)
N +
N--1 N2
tt a p + 3 p 2 + p 3 rn 2 ( l - - p : ) :
(B-7)
Last term in eq. B-7 can be neglected for all practical purposes: (bias)c6v(~, sx') ~---
-
1 -N
coy
(~, sx 2)
(B-8)
150
It is seen that the a ppr oxi m a t e bias correction for the covariance is the same as that of the variance. Eq. B-8, which is a p p r o x i m a t e for d e p e n d e n t processes, can be shown to be exact for i ndependent processes.
REFERENCES Anis, A.A. and Lloyd, E.H., 1953. On the range of partial sums of a finite number of independent normal variates. Biometrika, 40: 35--42. Anis, A.A. and Lloyd, E.H., 1975. Skew inputs and the Hurst effect. J. Hydrol., 26: 39--53. Gomide, F.L.S., 1975. Range and deficit analysis using Markov chains. Colorado State University, Fort Collins, Colo., Hydrol. Pap. No. 79. Solari, M.E. and Anis, A.A., 1957. The mean and variance of the maximum of the adjusted partial sums of a finite number of independent normal variates. Ann. Math. Stat., 28: 706--716. Wallis, J.R., Matalas, N.E. and Slack, J.R., 1974. Just a moment! Water Resour. Res., 10: 211--219. Yevjevich, V., 1965. The application of surplus, deficit and range in hydrology. Colorado State University, Fort Collins, Colo., Hydrol. Pap. No. 10. Yevjevich, V., 1967. Mean range and linearly dependent annual variables with applications to storage problems. Water Resour. Res., 3: 663---671. Yevjevich, V. and Obeysekera, J.T.B., 1984. Estimation of skewness of hydrologic variables. Water Resour. Res., 20 (7): 935--943.