Journal of Hydrology (2008) 355, 16– 33
available at www.sciencedirect.com
journal homepage: www.elsevier.com/locate/jhydrol
On the tails of extreme event distributions in hydrology ´e, T.B.M.J. Ouarda S. El Adlouni *, B. Bobe Hydro-Quebec/NSERC Chair in Statistical Hydrology, Canada Research Chair on the Estimation of Hydrological Variables, University of Quebec, INRS-ETE, 490, de la Couronne, QC, Canada G1K 9A9 Received 28 June 2007; received in revised form 21 January 2008; accepted 15 February 2008
KEYWORDS Heavy tail; Extreme values theory; Return period events; Classes of distributions; Asymptotic behaviour; Quantiles
The study of the tail behaviour of extreme event distributions is important in several fields such as hydrology, finance, and telecommunications. Based on two classifications and five graphical criteria, this paper presents a practical procedure to select the class of distributions that provides the best fit to a dataset, especially for the right tail (large extreme events). Some numerical illustrations show that, almost all graphical tools allow discriminating between the regularly varying class and the sub-exponential class of distributions and lead to coherent conclusions. ª 2008 Elsevier B.V. All rights reserved.
Summary
Introduction Most extreme hydrological events cause severe human and material damage. Flood Frequency Analysis (FFA) is of particular interest for the design and management of hydraulic structures. The principal objective of FFA is to obtain robust estimates of extreme quantiles, and information concerning the probability that a given event can be exceeded. This procedure is related to extreme value theory (EVT), which is often derived from asymptotic properties according to the Fisher–Tippet theorem (Fisher and Tippet, 1928). Conventional estimates of flood exceedance quantiles are highly dependent on the form of the underlying flood fre* Corresponding author. Tel.: +1 418 654 2639; fax: +1 418 654 2600. E-mail address:
[email protected] (S. El Adlouni).
quency distribution, especially on the form of the right tail which is most difficult to estimate from observed data. The extreme event modelling is the central issue in EVT. The main purpose of the theory is to provide asymptotic models with which one can model the right tail of a distribution. EVT has been pioneered by the work of Fisher and Tippet (1928), followed by Gnedenko’s (1943) extreme value theorem presented by Gumbel (1958). More recently, Balkema and De Haan (1974) and Pickands (1975) presented the basic foundation for threshold-based extreme value methods. Applications of the theory have since spawned in different areas, such as hydrology and climatology. Recently, we have observed an increasing interest in EVT in such fields as finance and insurance. Presentations emphasizing these applications can be found in Embrechts et al. (2003) and Reiss and Thomas (2001). Note that the assumptions required for the application of the EVT theorems (as presented in the section ‘Classes of
0022-1694/$ - see front matter ª 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.jhydrol.2008.02.011
On the tails of extreme event distributions in hydrology heavy tailed distributions’) are generally not valid in hydrology. Hydrological series are results of complex processes and therefore do not necessarily fulfil the basic requirements of EVT. Indeed, in the case of a flood dataset, daily stream flow data are statistically auto-correlated (Kidson and Richards, 2005). Several standard frequency distributions have been extensively studied in the statistical analysis of hydrologic data. Physical processes which generate extreme events are rarely considered for the choice of the model (Kidson et al., 2005; Singh and Strupczewski, 2002). The selection of the most appropriate distribution of annual maximum or peaks over threshold series has received widespread attention. There are no rigid rules governing which type of distribution is most appropriate for a particular case, and a variety of probability distributions are commonly used as frequency–magnitude distributions in hydrology. Brooks and Carruthers (1953) indicated that the Gumbel distribution (Gumbel, 1942), which is commonly used in flood frequency prediction, tends to underestimate the magnitude of the rarest rainfall events. Bernier (1959) suggested the Log-Gumbel distribution for hydrological series. The Log-Gumbel distribution is also called Fre ´chet distribution (Fre ´chet, 1927) which is a special case of the generalized extreme value (GEV) distribution. In order to select the adequate distribution, empirical comparisons are commonly used for a given region. A comparison of the Lognormal, gamma, Gumbel, Log-Gumbel, Hazen, and Log-Pearson type 3 fits was given by Benson (1968) for 10 USA stream flow stations. Based on this study, the USA adopted a uniform approach to flood frequency estimation which consists in fitting the annual peak discharges to the Log-Pearson type 3 (LP3) distribution (US Water Resources Council, 1981). Although many countries, such as Australia, have adopted the LP3 distribution, others have selected different distributions, such as the Generalized Logistic distribution in the United Kingdom and the Lognormal (LN) in China (Bobe ´e, 1999; Robson and Reed, 1999). Discussions and reviews of the application of these and other statistical distributions to flood frequency analysis are given in Stedinger et al. (1993), Bobe ´e and Rasmussen (1995) and Rao and Hamed (2000). A comparison study for some distributions, given by Koutsoyiannis (2004), shows that the Fre ´chet distribution is more adequate to represent extreme rainfall series. Malamud and Turcotte (2006) showed that, the most commonly used frequency–magnitude distributions in hydrology can be divided into four groups: the normal family (normal, Lognormal), the general extreme value (GEV) family (GEV, Gumbel, Fre ´chet, reverse Weibull), the Pearson type 3 family (Gamma, Pearson type 3, Log-Pearson type 3), and the Generalized Pareto distribution. These groupings were deduced from the historical development of the distributions that compose them. However, no discussion of the tail behaviour of these classes was given in their paper. In practice, almost all these models are fitted to data and compared using conventional goodness-of-fit tests. However, these procedures essentially test the adequacy of the model to the central range of the sample and the adequacy should be tested for the extreme range of observations in order to estimate high return period events. In the present paper, a new approach is proposed for FFA. It is based on the characterisation of the best class of
17 distributions that will give a precise estimation of extreme events. The proposed approach includes two principal elements. The first one is related to distribution classification with respect to their tails and the second one provides criteria to select the class of distributions which represents adequately the studied sample. The most used criteria, for class identification, are presented to discriminate the commonly used distributions with respect to their right tail. All these criteria are presented and illustrated using simulated and observed peak flow data series. The paper is organized as follows. A review of distribution classes using the theoretical formulation of the extreme value theory is provided in the section ‘Classes of heavy tailed distributions’. Section ‘Classification of heavy tailed distributions based on asymptotic behaviour’ presents the classes of the distributions commonly used in hydrology. A short discussion of the mechanisms to generate Lognormal and power-law distributions is given in the section ‘Mechanisms for generating Lognormal and power-law distributions’. In the section ‘Standard methods of tail discrimination’, a short review of graphical tools to discriminate between classes is presented. The use of these techniques is illustrated for some distributions (presented in the Appendix) in the section ‘Application’. Section ‘Conclusions’ concludes the paper.
Classes of heavy tailed distributions Several definitions can be attributed to heavy tailed distributions, called also fat-tailed, thick-tailed or long-tailed. An often used definition of heavy tailed distributions is based on the fourth central moment. If X is a random variable and lX and rX are, respectively, the mean and the standard deviation of X, respectively, then the distribution of X is called heavy tailed if " 4 # X lX Ck ¼ E >3 ð1Þ rX (Ck 3) is known as the excess kurtosis because the fourth central moment (kurtosis) of the Normal distribution is 3. However, in hydrology, the term heavy tailed is used for distributions which have a restriction with respect to moment existence. The difference between the Normal distribution and the power tail was illustrated by Hubert and Bendjoudi (1996) who showed that the event of return period T = 1000 estimated by the normal distribution corresponds to T = 100 years when the power type tail distribution is the most adequate to fit precipitation data series. Fig. 1 shows that heavy tailed distribution quantiles are larger than those of the Normal distribution even if they have the same statistical characteristics (mean and variance) especially for a large non-exceedance probability. This comparison is illustrated using the Halphen type Inverse B (HIB, Appendix B) distribution (with parameters m = 3, a = 3.2 and m = 90), which is heavy tailed, and the Normal distribution with the same mean (E[X] = 35) and variance (Var[X] = 80). Even if these distributions do not have the same parameter space dimension, this comparison illustrates the difference in their tail behaviour. Indeed, the parameter space dimension affects mainly the variance of the estimated quantiles and not their tail behaviours.
18
S. El Adlouni et al.
Figure 1 Illustration of the difference between a heavy tailed distribution (Halphen type B1) and the Normal distribution: (a) probability density function and (b) quantile-return period plot.
Note that the classification based on Eq. (1) is vague and cannot be used for distributions with an infinite 4th moment. One tail ranking can be obtained for particular classes of distributions. In the following, a discussion of five classes, given in Werner and Upper (2002) is presented and more details are presented in Appendix A. These classes of distributions are nested (A B C D E) and can be presented as in Fig. 2: E: distributions with non-existence of exponential moments D: sub-exponential distributions C: regularly varying distributions B: Pareto-type tail distributions A: a-stable (non-normal) distributions Classes C and B are very important considering their connection to classical extreme value theory. A main topic of EVT is the modelling of the fluctuation of sample maxima. Let Y1; Y2; . . .; Yp be a sequence of independent and identically distributed random variables, then the sample maximum X is defined as X = max(Y1; Y2; . . .; Yp). For example, for the maximum annual flow, Y is the daily stream flow and p = 365. The Fisher–Tippet theorem (Fisher and Tippet, 1928) states that if the properly normalised sample maxi-
Figure 2 Different nested classes of heavy tailed distributions (from Werner and Upper, 2002).
mum converges to a non-degenerate distribution, then it belongs to one of the following three distributions (cumulative distribution function, cdf): Gumbel ðEV1Þ : KðxÞ ¼ expðex Þ; x 2 R 0 x60 chet ðEV2Þ : Un ðxÞ ¼ ; n>0 Fre n expðx Þ x > 0 reverse Weibull ðEV3Þ : expððx n ÞÞ x 6 0 ; n<0 Wn ðxÞ ¼ 0 x>0
ð2Þ ð3Þ
ð4Þ
In this case, the distribution of the original variable Y, belongs to the Max-Domain of Attraction (MDA) of K, Un or Wn . Von Mises (1954) proposes a generalized extreme value (GEV) distribution, which includes the three limit distributions. Another re-parameterization given by Jenkinson (1955), with j = n as presented in Appendix B.2, is generally used in hydrology. Based on the previous theorem, distributions can be classified into three categories. The first one corresponding to the MDA of K and contains distributions with medium tails. The second one is the class of heavy tailed distributions, whose extremes follow the EV2 distribution (the MDA Un ). The third one is the class of short-tailed distributions with finite end-point and contains the distributions of MDA of Wn . The class of distributions that belong to the MDA Un (Fre ´chet distribution) is very important when studying heavy tailed behaviour. Indeed, distributions of this class have a regularly varying tail, i.e. belong to the class C. Note that the heaviness of the tails depends negatively on the tail index n (given in Eq. (3)). Applications of the EVT in hydrology are restricted by the length of the dataset and serial correlation structure. Indeed, in the case of flood dataset, daily flow values Yi are not statistically independent and the length of the data series (p = 365) is not large enough to apply asymptotic results. Thus rigorously, the use of the GEV distribution has no strong theoretical basis to represent extreme value datasets. However, it can be considered as a candidate to fit extremes. This justifies the approach used in FFA based on the fit of several probability distributions and the use of criteria to select the best fit. The classification presented in this
On the tails of extreme event distributions in hydrology paper has the objective of discriminating between distributions with respect to their right tail and will be helpful for best fit selection. Class B contains mainly The Generalized Pareto distribution (GPD), which generalizes the Pareto law, and is commonly used in the peaks over threshold (POT) approach. Indeed, it has been shown that the GPD arises as the limiting distribution of excesses X x0 of a random variable X over a high threshold x0 (Pickands, 1975). The GPD exhibits a unique threshold stability property, i.e. if X follows the GPD then the conditional distribution of excesses also follows the GPD with the same shape parameter. Another extreme value theory result shows that the distribution of maximum excesses follows the GEV distribution when the exceedances over the threshold are generated from a Poisson process. However, these mathematical properties, of the POT approach, are asymptotic and require the same hypothesis (independence and homogeneity) as in the case of the maximum.
Classification of heavy tailed distributions based on asymptotic behaviour Ouarda et al. (1994) presented a classification of distributions commonly used in hydrology based on asymptotic behaviour of the probability density function (pdf). For a given return period T, large enough, the following equivalence holds (Gumbel, 1958; Ouarda et al., 1994): Table 1
19 T
xT !1
0
1 fðx T Þ
¼
f 0 ðx T Þ f 2 ðx T Þ
ð5Þ
Indeed, for a large value of T, the probability of exceedance function Fðx T Þ ¼ 1 Fðx T Þ ¼ T1 and the pdf f(xT) converge to zero. The Ho ˆpital’s rule states that, for large T: fðx T Þ f 0 ðx T Þ ’ lim ð6Þ lim x T !1 1 Fðx T Þ x T !1 fðx T Þ 0 1 f 0 ðx T Þ 1 ) T¼ 2 ¼ ð7Þ 1 Fðx T Þ xT !1 f ðx T Þ fðx T Þ Using this equivalence, the asymptotic behaviour of distributions, commonly used in hydrology, is presented in Table 1 (Ouarda et al., 1994). As presented in Table 1, four groups are deduced from this classification. The first one (class I) contains distributions for which the tail is a power function of the return period. The second class (class II) contains the Lognormal distribution. As shown below, the tail behaviour of the Lognormal distribution is very similar to that of the power-law. This similarity will be discussed in more details in the remainder of the document. The tail of the distributions in class III is a power function of the logarithm of the return period and almost all these distributions belong to the class D of sub-exponential distributions. The last class (class IV) contains distributions with an upper bounded support. The classification given by Ouarda et al. (1994) is equivalent to that given in the previous section (as presented in
Asymptotic behaviour classification of commonly used distributions in hydrology (from Ouarda et al., 1994)
Class
Characteristics
Distribution
Parameters
I: x TP Class C Class C Class C Class C Class C Class B Class C
P = 1/a P = 1/a P = 1/sk P = 1/k P = 1/k P = 1/k P = 1/2m
Log-Pearson 3 (a, k, m) Log-logistic (a, k) Generalized gamma (s, a, k) Inverse gamma (a, k) Fre ´cht (a, k, u) Generalized Pareto (a, k) Halphen type B1 (a, m, m)
a > 0; k > 0; m 2 R a > 0, k > 0 s < 0, a > 0, k > 0 a > 0, k > 0 a > 0; k > 0; u 2 R a > 0, k < 0 a 2 R; m > 0; m > 0
Lognormal 2 (l, r) Lognormal 3 (l, r, m)
l 2 R; r > 0 l 2 R; r > 0; m 2 R
P=1
Pearson type 3 (a, k, m) Gamma (a, k) Exponential (a, m) Halpen type A (a, m, m) Log-F (k, b) Gumbel (a, u)
a > 0; k > 0; m 2 R a > 0, k > 0 a > 0; m 2 R a > 0; m 2 R; m > 0 k > 0, b > 0 a > 0; u 2 R
Class D
P = 1/2
Halphen type B (a, m, m)
a 2 R; m > 0; m > 0
Class D
P = 1/s
Generalized gamma (s, a, k)
s > 0, a > 0, k > 0
P=m P=0
Pearson 3 (a, k, m) Gamma (a, k)
a < 0; k > 0; m 2 R a < 0, k > 0
P = exp(m/lna(e)) P=1 P = u + a/k P = a/k
Log-Pearson 3 (a, k, m) Log-logistic (a, k) reverse Weibull 3 (a, k, u) Generalized Pareto (a, k)
a < 0; k > 0; m 2 R a < 0, k > 0 a > 0; k < 0; u 2 R a > 0, k > 0
II: x exp[(lnT)1/2] – III: x (lnT)P Class D Class D Class E Class D Class D Class D
IV: x 6 P (Upper bounded support)
20
S. El Adlouni et al. * Gumbel Halphen A Gamma Halphen B Pearson type 3
Fréchet Halphen B-1 Inverse Gamma Log-Pearson type 3
Class D\C
Class C\B
Light Tailed **
Class E\D
Exponential
Class A
Heavy Tailed
Lognormal
Normal
Stable Distributions
Class B\A
Pareto
Figure 3 Distributions commonly used in hydrology, classed with respect to their tail behaviour (*distributions of a given class; **limiting case).
the first column of Table 1). Note that the distributions belonging to the first class (class I, Table 1) have a powerlaw tail and correspond generally to the class C of regularly varying distributions (this class contains the Generalized Pareto with a > 0 and k < 0, cf. Fig. 2). However, for special parameter values, these distributions of the class C belong to the sub-exponential distribution class (class III, Table 1). Given that all these classes are nested, the class of distributions, that belong to the class C2 and not to the class C1, such that C1 C2, will be noted C2nC1. For example, the class of sub-exponential distributions that are not regularly varying is noted DnC. By combining these two classifications, distributions commonly used in hydrology can be ordered with respect to their tails. Note that almost all these distributions are available in the software HYFRAN (CHS, 2002). Fig. 3 presents sub-exponential, regularly varying and stable distributions (upper squares) ordered from light-tailed (from the left) to heavy tailed (to the right) distributions and the limiting cases (down squares) represented by distributions in the limits of classes. This classification emphasizes the need to develop techniques to discriminate between the class C of regularly varying distributions and the class DnC of the other subexponential distributions and especially between the class C (asymptotically power type tail) and the Lognormal distribution. Indeed, the generation processes of the Lognormal and power-law distributions are often very connected. This phenomenon has been studied in social sciences, biology and computer sciences. A useful discussion can be found in Reed (2002), Reed and Hughes (2002) and Mitzenmacher (2004). It turns out that many seemingly small effects in the generative process of the Lognormal distribution can lead to a power-law tail instead (Champernowne, 1953; Mandelbrot, 1997, 2003; Turcotte, 1997).
Mechanisms for generating Lognormal and power-law distributions Singh and Strupczewski (2002) noted that great strides in hydrologic data collection and conceptualization of the processes surrounding floods have been made. However, there is no consideration of the hydrologic processes in the frequency analysis and for model selection. In this section, we look at possible candidate mechanisms by which Lognormal and power-law distributions might arise in hydrological
processes or more generally in natural systems. Indeed, the study of physical mechanisms that generate such processes provides additional information (Yevjevich, 1987) and may be useful for model selection.
Lognormal distribution The Lognormal distribution, called also Gibrat–Gauss distribution (Morlat, 1957), is generated by processes that follow what the economist Gibrat called the law of proportionate effect (Gibrat, 1930). The term multiplicative process is also used to describe the underlying model (Mitzenmacher, 2004). Such processes are used to describe many physical mechanisms. Suppose we start with a state X0 and at each step j, the mechanism may grow or shrink, according to a random variable Fj, so that Xj = FjXj1. The idea is that the random development of a phenomenon is expressed as a proportion of its current state. If Fk, 1 6 k 6 j, are all governed by independent Lognormal distributions, then so is each Fj, inductively, since the product of Lognormal distributions is again Lognormal. More generally, approximately Lognormal distributions may be obtained even if the Fj are not themselves Lognormal. Specifically, consider ln X j ¼ ln X 0 þ Pj ln F . Assuming the random variables ln Fk are indek k¼1 pendent and identically distributed with finite mean and variance, the Central Limit Theorem (CLT) says that Pj k¼1 ln F k converges to a Normal distribution, and hence, for sufficiently large j, Xj is well approximated by a Lognormal distribution.
Power-law distributions Several measured quantities in physical, biological and social systems have been proposed to follow power laws. The fact that power laws are used in several fields has led many scientists to wonder whether there is a single underlying mechanism linking all these different systems together. Several candidates for such mechanisms have been proposed, like ‘‘self-organized criticality’’ and ‘‘highly optimized tolerance’’. However, as presented by Newman (2005), there are many mechanisms for producing power laws and that different ones are applicable to different cases. Some of them are quite complex, we present here some simple algebraic methods of generating power laws. A much more common mechanism is the combination of exponential distributions. Let Y be a random variable that
On the tails of extreme event distributions in hydrology has an exponential distribution: fY(y) eay. Suppose that we are interested in an exponential transformation, X = ebY, of Y where b is a constant. The probability density function of X is dy eay x 1þa=b ð8Þ fX ðxÞ ¼ fY ðlnðxÞ=bÞ by ¼ dx be b which is a power-law with tail index a = a/b. Another mechanism to generate power laws is related to the properties of random walks. Consider a random walk in one dimension, in which the present state of the process is equal to the last state plus or minus one (simple case). Suppose that the starting position is zero. We are usually interested in the probability to return to the initial position for the first time at time t. This is the so called first return time of the random walk. It can be shown (Mitzenmacher, 2004) that the distribution of return times, follows a power-law with exponent a ¼ 32. Completely different mechanisms for generating a power-law and which have received high attention are the critical phenomena and self-organized criticality. A complete review of all these mechanisms is presented with illustrating examples by Newman (2005). In this section, we discussed some mechanisms that may generate power-law distributions. However, a scientist that is confronted with a new set of data having a highly skewed distribution should certainly bear in mind that a power-law model is only one of several possibilities for fitting it. It is thus important to use criteria to select a class of distributions that may be the most adequate to represent the data.
Standard methods of tail discrimination This section presents the most widely used methods for model selection. Note that methods presented in this section are based on graphical criteria. A detailed investigation of these methods and their properties can be found in Embrechts et al. (2003), Reiss and Thomas (2001) and Beirlant et al. (2004). In the following sections, the most important properties of these methods are reviewed. The Cunnane ka plotting position formula pk ¼ nþ12a with a ¼ 0:4 is used for all numerical illustrations presented in this paper.
Log–log plot (Beirlant et al., 2004)
21 This suggests that, for the log–log plot, the tail probability is represented by a straight line for power-law (or regularly varying distributions, class C) but not for the other subexponential or exponential distributions (class DnC or EnD, this notation is defined in the section ‘Classification of heavy tailed distributions based on asymptotic behaviour’). This is illustrated in Fig. 4 for two distributions belonging to the class C and DnC. As illustrated in this figure, the plot can be used to discriminate the distributions that belong to the class C and are not sub-exponential: (i.e.) distributions of the class DnC.
The mean excess function based method (Embrechts et al., 2003) The mean excess function method is based on the function e(u) = E[X u—X > u]. This function is constant for exponential tail distributions (e(u) = h), while for a power-law with u tail index a(a > 2): eðuÞ ¼ ða2Þ . This suggests that when plotting the empirical value of e(u) against u: – If the plot is linear and the slope is equal to zero, it suggests an exponential type. – If the plot is linear, the slope is greater than zero and the intercept is zero, then it suggests a sub-exponential distribution. This plot leads to discriminate the distributions that belong to the class D and that of the class EnD. The forms of this plot are illustrated for exponential and sub-exponential distributions in Fig. 5. In this figure, k is the position of an observation in the ordered sample.
The max–sum ratio plot (Beirlant et al., 2004) The idea is based on the fact that for stationary series X1, p
p
maxðX ;...;X Þ . . ., Xn: Rn ðpÞ ¼ Pn1 p n n ! 1 0 if and only if E[Xp] < 1. X
i¼1 i
!
In practice, we plot Rn(p) against n for various values of p (p = 1, 2, 3, . . .) and find that Rn(p) jumps up (does not converge to zero) for all p exceeding p0. This indicates that X has a power-law type with tail index a = p0 (as shown in the section ‘Classes of heavy tailed distributions’, when defining the class B, only k-moments such that k < a, are finite).
The log–log plot called also tail probability plot is widely used in the literature (especially in finance) to study the tail behaviour. The log–log plot is based on the fact that for an exponential tail with mean h, FðuÞ ¼ PðX > uÞ ¼ eu=h , and for a power-law tail with tail index a > 1, F is equivalent (for large quantile) to: Z 1 1 1 uaþ1 FðuÞ ¼ PðX > uÞ C dx ¼ C ð9Þ xa a1 u Therefore, taking the logarithm we have for exponential type distributions, log½PðX > uÞ u=h and for power-law distributions: 1 log½PðX > uÞ log C ð1 aÞ logðuÞ a1
ð10Þ
ð11Þ
Figure 4 Log–log plot for the regularly varying class of distributions.
22
S. El Adlouni et al. Sub-exponential distribution 0.5
1.2
0.4 E(X-u|X>u)
E(X-u|X>u)
Exponential distribution 1.4
1 0.8 0.6 0.4
0
200
Figure 5
Figure 6
400
600
800
1000
0
0
200
400
600
800
1000
k
Mean excess function for exponential and sub-exponential distributions.
Max–sum ratio plot for regular varying (power-law) class of distributions.
Fig. 6 illustrates the use of the max–sum ratio plot to check if the underlying data comes from a power-law type (class C). It is shown that for p = 3, the ratio converges to zero. However, this is not true for p = 7. These graphs should be given for different values of p. If all curves converge to zero, the distribution does not belong to the class C of regularly varying distributions. This method can be used for the same objective as the log–log plot. It leads to discriminate the distributions that belong to the class C and the class DnC.
The generalized Hill ratio plot (Beirlant et al., 2004) The fourth method is the generalized Hill ratio plot. Let Pn i¼1 IðX i > x n Þ ð12Þ an ðxn Þ ¼ Pn logðX i =x n ÞIðX i > x n Þ i¼1 where
0.2 0.1
k
IðX i > x n Þ ¼
0.3
1 if X i > x n 0 if X i < x n
This method is based on the fact that an is a consistent estimator of a if the tail is regularly varying (class C) with tail index a (Hill, 1975). In Eq. (12), xn is chosen to be large such that P(X > xn) ! 0 and nP(X > xn) ! 1, and I is the indicator function. The standard Hill estimator, of the tail index, corresponds to the particular case where the observations are
ordered X(1) 6 6 X(n) and xn ¼ X ðkn þ1Þ , where kn is an integer which tends to infinity as n tends to infinity. In practice, one plots an(xn) as a function of xn and looks for some stable region from which an(xn) can be considered as an estimator of a. Fig. 7 presents the generalized Hill ratio plot for a sample generated from the Generalized Pareto distribution (a) and the exponential distribution (b). Note that the generalized Hill method is an estimation method and can be used to characterize distributions of the class C and thus to discriminate between the classes C and DnC.
Method based on the Jackson statistic (Beirlant et al., 2006) This method is presented by Beirlant et al. (2006) and is based on the Jackson statistic. It allows to test whether the sample is consistent with Pareto-type distributions (class B). Originally the Jackson statistic (Jackson, 1967) was proposed as a goodness-of-fit statistic for testing the exponential behaviour. Given the link between the Exponential and the Pareto distribution (if X has a Pareto distribution the logarithmic transformation Y = log(X) is exponentially distributed) this statistic is used to assess Pareto-type behaviour. The Jackson statistic is further modified by taking into account the second-order tail behaviour (using Taylor expansion) of a Pareto-type model. We recall here the bias corrected version of the Jackson statistic as developed in Beirlant et al. (2006)
On the tails of extreme event distributions in hydrology
Figure 7
23
Generalized Hill ratio plot for: (a) Generalized Pareto and (b) exponential distributions.
and Goegebeur et al. (2006). Let X1, . . ., Xn be iid random variables and X1 6 6 Xn the corresponding order statistics. The test statistic is given by Pk 1 j¼1 Ckjþ1;k Z j T k ¼ k ð13Þ Hk;n where Zj ¼ jðlog X ðnjþ1Þ log X ðnjÞ Þ; j ¼ 1; . . . k; Hk;n ¼ Pk jþ1 1 j¼1 Z j and Cðkjþ1Þ ¼ 1 log kþ1 . k In practice, this test is similar to the generalized Hill plot. One presents T k against k and looks for a stable region (i.e. T k converges to the mean for large values of k). To obtain a bias corrected version of T k , Beirlant et al. (2006) derived the following approximate representation for log-spacing of order statistics: q j Zj c þ bn;k þ ej ; j ¼ 1; . . . ; k ð14Þ kþ1 n k
where bn,k is function of and ej, j = 1, . . ., k are zero centred error terms. This approximation leads to the bias corrected e k: version T ^q Pk j 1 ^ C Z b ð^ q Þ j LS;k j¼1 kjþ1;k kþ1 k ek ¼ ð15Þ T ^cLS;k ð^ qÞ ^LS;k ð^ ^ a consistent estimator of q, b with q qÞ and ^cLS;k ð^ qÞ are, respectively, the least square estimators of bn,k and c, obtained from Eq. (14).
Figure 8
For Pareto-type distributions (class B): 2 ! pffiffiffi q L ~ kðT k ðqÞ 2Þ ! N 0; 1q
ð16Þ
e k converges to its and for large values of k, the statistic T mean 2. For fixed value of q, ^cLS;k ðqÞ ¼
k ^LS;k ð^ 1X qÞ b Zj k j¼1 1q
ð17Þ
and the least square estimator of bn,k can be expressed by q k ð1 qÞ2 ð1 2qÞ 1 X j 1 ^LS;k ð^ Zj qÞ ¼ b q2 k j¼1 kþ1 1q ð18Þ Beirlant et al. (2006) considered the following estimator of q: Hbk2 kc;n Hbkkc;n _ _ 1 log ð19Þ log k Hbkkc;n Hk;n _ _ pffiffiffi for k 2 (0, 1) and k_ such that k_ b nk_ ! 1 as proposed by Drees and Kaufmann (1998). For the practical implementation of this test, Beirlant et al. (2006) considered several values of q (for example q = 1 or 2). Using simulation results, they have shown that q = 1 gives a good result. Fig. 8 presents the Jackson ^k_ ¼ q
Jackson statistic for: (a) Generalized Pareto and (b) exponential distributions.
24
S. El Adlouni et al.
test’s statistic for samples generated from the Generalized Pareto distribution (a) and the exponential distribution (b). It shows that for the Generalized Pareto distribution (class e k converges to 2 for large k. B) the statistic T This method allows the characterization of the class B (with Pareto-type tail) and thus to discriminate the distribution of the class C (which have asymptotically the same tail as the class B distributions) and the rest of the class D of sub-exponential distribution that do not have power-law type tail.
Application Simulated data To illustrate the practical use of the methods presented in the previous section, results of some simulation studies are presented. Most methods presented in the section ‘Standard methods of tail discrimination’, allow the discrimination between classes C (regularly varying distributions), D (sub-exponential distributions) and E (exponential type distributions). Note that the classes C and D are very important
Table 2
Cs = 1.84
a
Parameters of simulated distributionsa HIB
EV2
LN
HA
m = 1.8 a=8 m = 120
u = 37 a = 7.8 j = 0.10
x0 = 18 l=3 r = 0.46
m=1 a = 0.5 m = 100
These distributions are presented in Appendix A.
Figure 9
to fit extremes and there is, in practical cases, considerable controversy about distinctions between the sub-exponential type and the power-law type or regularly varying distributions. We consider four distributions: • Halphen type Inverse B [HIB] (Eq. (B.4.1)) and Fre ´chet [EV2] (Eq. (B.2.2)) belonging both to the class C. • Halphen type A [HA] (Eq. (B.3.1)) from the class DnC. • Lognormal [LN] (Eq. (B.1.2)) limiting case (C and DnC, Fig. 3). Some theoretical characteristics (pdf, moments) of these distributions are presented in Appendix B. For more details, on the Halphen system of distributions (see for example El Adlouni and Bobe ´e, 2007). Samples, of size n = 50, are simulated from these distributions with the same statistical characteristics: coefficient of variation Cv = 0.25 (with mean 40 and standard deviation 10) and a coefficient of skewness Cs = 1.84, which corresponds to skewed hydrological data eventually observed in practice. Table 2 gives for each of the distributions listed above, the corresponding parameters. Fig. 9 presents the log–log plots of samples generated from the HIB, Fre ´chet, Lognormal and HA distributions with Cs = 1.84. It is pointed out that the log–log plot should be linear with negative slope for regularly varying distributions (class C) but not for the other distributions of the sub-exponential class (class DnC). Fig. 9 indicates that the log–log plot method can distinguish class C (regularly varying distributions: HIB and EV2) and class DnC (sub-exponential distributions, HA). Indeed, it is clear that samples generated from HIB and EV2 produce a straight line, except in the extremity, and thus belong to the class C of regularly varying
Log–log plots for samples generated from the HIB, EV2, LN and HA distributions.
On the tails of extreme event distributions in hydrology distributions. However, the curve corresponding to a sample generated from the HA distribution is concave indicating that the underlying distribution is not a regularly varying one. However, the plot corresponding to the Lognormal distribution is very similar to that of a regularly varying distribution. Indeed, as shown in the section ‘Mechanisms for generating Lognormal and power-law distributions’, the mechanisms that generate the LN distribution are very similar to those generating regularly varying distributions. In the empirical mean excess function method, we plot the empirical value of e(u) against k (which is the order of the observation u) and compute the slope of the curve. If the plot is linear and the slope is equal to zero, this suggests an exponential type (class EnD) and if the plot is linear, the slope is greater than zero and the intercept is zero, then it suggests a sub-exponential type (class D). Fig. 10 illustrates the use of the empirical mean excess function method for the selected distributions. Note that results show that the mean excess function is increasing for all cases, which indicates that all data series belong to the class D of sub-exponential distributions. Indeed, all distributions considered in this section belong to the class D. This method is useful to eliminate all distributions that are in the class EnD (Fig. 2). In the case of the max–sum ratio plot, Fig. 11 shows, for each distribution, the plot for different values of p = 1, 2, . . ., 9. Note that if the curve does not converge to zero, for all values greater than p0, then the underlying distribution is a regularly varying tail type (class C). Fig. 11 shows that the max–sum ratio curves corresponding to HIB, Fre ´chet and Lognormal do not converge to zero for some values of p (p P 4 for HIB, p P 5 for EV2 and p = 4 for LN). However, for the HA generated samples, the corresponding max–sum ratio curve converges to zero for all values of p. Fig. 12 shows that for HIB, EV2 and LN the Hill ratio plot indicates that the underlying distribution is power-law type. Note that this method can be used to estimate the tail index (almost equal to 2), but other versions of the Hill estimator
Figure 10
25 may converge to a reliable estimate of the tail index. A detailed comparison of semi-parametric methods to estimate tail index is given by De Haan and Peng (1998). This approach, usually used as estimation method, can be considered to confirm the class choice based on the other approaches. Fig. 13 shows the plot of the bias corrected version of the method based on the Jackson statistic (Eq. (13)). It is pointed out that, when the underlying distribution is Paree k converges to-type (belongs to the class B), the statistic T to its mean 2, for large values of k. This is the case for HIB, EV2 and LN distributions. However, for the HA samples, the test statistic does not converge to the mean. This method leads also to the same conclusions as the log–log, max– sum ratio and Hill ratio plots. These plots are potential tools to distinguish regularly varying distributions (class C) from the rest of sub-exponential distributions (class DnC). However, it is difficult to discriminate between the Lognormal distribution and the class C of regularly varying distributions. Recall that these classes are nested and the LN distribution is a limiting case and it is sometimes difficult to distinguish it from the distributions of the class C.
Case study: Potomac River peak flow In this section, we study the annual instantaneous peak flows of the Potomac River at Point of Rocks for the time period 1895–2006 (water year October–September). Fig. 14 shows the observed annual peak flow time series. Smith (1987) and Katz et al. (2002) analyzed the same time series for the time period 1895–1986 and 1895–2000, respectively. To check whether the observations are independent and the data series are stationary and homogeneous we applied the WaldWolfowitz, Kendall and Wilcoxon tests (Bobe ´e and Ashkar, 1991). These tests indicated that the observed peak flow data series can be considered as independent and identically distributed.
Empirical mean excess function for samples with size n = 50.
26
S. El Adlouni et al.
Figure 11
Figure 12
Max–sum ratio plots with different values of p (p = 1, 2, . . ., 9).
Hill ratio plot for simulated samples generated from HIB, EV2, LN and HA.
On the tails of extreme event distributions in hydrology
Figure 13
Annual peak flow (cfts)
5
x 10
27
Adapted Jackson statistic to detect Pareto-type tail.
5
4
3
2
1
0
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
Year
Figure 14
Annual peak flows of the Potomac River at Point of Rocks 1895–2000.
Figs. 15 and 16 show the results of the graphical criteria to select the class of heavy tailed distributions for the Potomac River peak flow data series. The log–log plot (Fig. 15a) indicates that the curve is linear especially for high values of log(u) and thus the distribution belongs to the class C of regularly varying distributions. The mean excess function shows that the curve has a positive slope, which implies that, the underlying distribution is sub-exponential (Fig. 15b). The Hill ratio plot (Fig. 15c) shows a convergence towards a value different from zero. The adapted Jackson statistic (Fig. 15d) converges to its mean value 2 which implies that the studied data series have a power-law distribution (belong asymptotically to the class B). The max–sum ratio plot (Fig. 16) shows that for values of p greater or equal to 4 the ratio does not converge to zero, and thus the distribution belongs to the class C of regularly varying distributions. All these criteria show that the distribution function belongs to the class C of regularly varying distributions (such as Fre ´chet, HIB, Log-Pearson type 3, etc.). Smith (1987) used the GEV distribution to fit these time series for the time period 1895–1986 with the maximum likelihood (ML) method
and obtained an estimated shape parameter of j = 0.42. This estimation is erroneous, since we have estimated the parameters of this dataset (1895–1986) with the ML method and we obtained j = 0.186 (the location and scale parameter ML estimations are u = 89,018.6 and a = 43,262.8). However, for the time period 1895–2006 the value j = 0.189 was obtained (u = 86,869.1 and a = 41,671.2). This estimation is very similar to that given in Katz et al. (2002) for the time period 1895–2000 (j = 0.191, u = 87,535.7 and a = 42,499.2). In order to illustrate the impact of the distribution choice, we compared the quantiles of different return periods, T = 100, 200 and 1000. The maximum likelihood method is used to estimate the parameters of all fitted distributions. Results show that there is a significant difference between quantiles estimated with the distributions of the class D (HA and P3) and those estimated with regularly varying type distributions (class C: Fre ´chet, HIB and LP3) especially for T = 1000. For large return period T, (QT)D < (QT)LN < (QT)C, where, (QT)D, (QT)LN and (QT)C are, respectively, the quantile estimated by a distribution from class D, LN distribution
28
S. El Adlouni et al. 1.4
E(X-u|X>u)
log P(X > u)
2 0 -2 -4 -6 -1
0
1
1.2 1 0.8
2
0.5
1
1.5 u
log(u)
2
2.5
2.2
3
Tk
Hill ratio
2 2 1
1.8 1.6 1.4
0 0
20
40 k
60
80
0
50 k
100
Figure 15 Criteria to select the class of heavy tailed distributions for the Potomac River peak flow: (a) the log–log plot, (b) the empirical mean-excess function, (c) the Hill ratio plot and (d) the adapted Jackson statistic.
p=1
p=2 1
1
0.5
0.5
0.5
0
0
50
100
0
0
p=4
max-sum ratio
p=3
1
50
100
0
p=5 1
1
0.5
0.5
0.5
0
50
100
0
0
p=7
0.5
0
50
100
0
0
p=8
1
0
50
100
1
0.8
0.8
0.6
0.6 50
100
50
100
p=9
1
0
50 p=6
1
0
0
100
0
50
100
n Figure 16
The max–sum ratio criteria to select the class of heavy tailed distributions for the Potomac River peak flows.
or class C. Indeed, for T 6 200, all distributions lead roughly to the same results except for the Normal distribution. Note that the Lognormal distribution gives always an estimate between that given by distributions of the classes C and D. Indeed, as mentioned above, the LN distribution has very similar properties with distribution of the class C (regularly varying distributions). Inference is carried out using the software HYFRAN (CHS, 2002) and results are presented in Table 3.
Results presented in Table 3, shows that there is a significant difference between the estimates of the 1000-year event obtained with distributions from different classes. Fig. 17 shows the differences between three distributions; HIB (class C), LN (limiting case) and P3 (class DnC). The 1000-event estimated by the P3 distribution (4.82 · 105, Table 3) corresponds approximately, to the 200-event when the HIB (4.49 · 105, Table 3) is fitted. For instance, the estimated quantiles given by HIB, Fre ´chet and LP3 distributions,
On the tails of extreme event distributions in hydrology Table 3
29
The 100-, 200- and 1000-year events obtained with various commonly used distributions
Return period
T = 100 T = 200 T = 1000
Class D
Class C
HA
PIII
Fre ´chet
HIB
LPIII
3.82E+005 4.32E+005 5.11E+005
3.54E+005 3.93E+005 4.82E+005
3.93E+005 4.67E+005 6.82E+005
3.74E+005 4.49E+005 7.11E+005
3.88E+005 4.54E+005 6.32E+005
Figure 17
LN
Normal
3.70E+005 4.25E+005 5.69E+005
2.94E+005 3.12E+005 3.51E+005
Comparison of the HIB, LN and PIII fits for the annual peak flood at Potomac River.
belonging all to the class C, are very similar and larger than the values estimated with the other distributions. Almost all criteria lead to the conclusion that the class C is adequate for representing the annual instantaneous peak flows of the Potomac River at Point of Rocks. This example illustrates the importance of the use of the adequate class of distributions. Once the class which represents adequately the studied dataset is selected, other criteria could be used to select the most adequate fit inside of each class. Classical tests and criteria, such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC), can be used to select the distribution within the selected class. These criteria are available in the software HYFRAN (CHS, 2002).
Conclusions In this paper, classes of distributions that are commonly used for extremes and given by Werner and Upper (2002), are presented. Five nested classes were considered: class A (stable distributions), B (Pareto-type tail), C (regularly varying distributions), D (sub-exponential distributions) and the class E of distributions with inexistent exponential moments. This classification was combined with that given by Ouarda et al. (1994) to present the commonly used distributions in hydrology, in an ordered form with respect to their tails. The second part of this paper, concerns the illustration of some graphical techniques to discriminate between these
classes, especially between the class C, of regularly varying distributions and the class D of sub-exponential distributions. Five methods were considered: the log–log plot, the empirical mean excess function, max–sum ratio plot, the Hill plot and the adapted Jackson test. All these methods led to the same conclusion when the samples are generated from HIB and EV2 distributions. Even if the sample size is small (n = 50) almost all graphical criteria, presented in this paper, conclude that the underlying distribution of the generated samples belongs to the class C of regularly varying distributions in the case of HIB and EV2 and to the class D for HA generated samples. Note that the class C is included in the class D. Except for the empirical mean excess method, all these methods allow to discriminate between the class C and the class D of all sub-exponential distributions. The empirical mean excess method should be used to detect exponential tails and thus to distinguish the distributions of class D from those of class E. Note that these results illustrate the potential that have the graphical tools to select the adequate class of distributions to estimate extremes. However, more studies should be carried out to study the power of these methods. Such studies can be carried out using Monte Carlo simulations and should take into account the sample size and the skewness. For all these methods, we noticed that it is difficult to discriminate between samples generated from the Lognormal distribution and that with regularly varying distributions. Indeed, results show that all presented criteria do not allow discriminating samples generated from the LN
30
S. El Adlouni et al.
distribution and the class C distributions. This conclusion confirms the similarity between the generative process of the Lognormal and power-law type distributions presented in this paper. The case study illustrates that the difference between quantile estimates obtained from the various classes might be very important. In some cases this can correspond to an over-estimation. However, when all criteria indicate that the underlying distribution belongs to the class C, the use of a distribution from other classes such as D or E could lead to more dangerous underestimation. To conclude, we propose the use of all these graphical criteria to select the class of distributions that seems to represent adequately the sample extremes. In the second step and inside each class, other criteria could be used to select the most adequate fit to the dataset. Classical tests and criteria, such as the Anderson Darling test, Akaike Information Criterion or Bayesian Information Criterion, can be used to select the distribution within the selected class. An important criterion is the efficiency of the parameter estimation method, as it is the case for Halphen distribution family. Indeed, an inefficient parameter estimation method, for some distributions, can lead to a bad choice of the class and consequently a bad estimate of the extreme quantiles.
Acknowledgements
sents the tail of the exponential distribution. Thus, class D contains all distributions with tails that decrease more slowly than any exponential distribution. The class C contains distributions such that: lim
t!1
FðtxÞ ¼ xa FðtÞ
ðA:3Þ
This relationship states that, asymptotically, the tail declines according to the power function and thus these distributions have the same tail as that of the Pareto distribution. Distributions of the class C are also called regularly varying distributions and the parameter a is called ‘‘tail index’’ and is used as a measure of the tail-heaviness. Distributions in class B have exact Pareto tails. The cumulative distribution function of the Pareto distribution is ua FðxÞ ¼ 1 ua xa ¼ 1 ; x P u and u > 0 ðA:4Þ x where a > 0 is the tail index. The tail probability function (the probability of exceedance function) FðxÞ ¼ PðX > xÞ of the class B distribution is therefore exactly equal to uaxa and not asymptotically (as for the class C). The tail index a can be related to the moments of a distribution with Pareto tails. Indeed, the probability density function of a random variable following the Pareto distribution is fPareto(x) = auaxa1 and the k-moment is given by Z 1 x ka1 dx ðA:5Þ E½X k ¼ aua u
The financial support provided by the Natural Sciences and Engineering Research Council of Canada (NSERC) is gratefully acknowledged. The authors wish to thank an anonymous reviewer and Dr. R. Kidson for their insightful comments which helped improve the quality of the paper.
Appendix A. Classification of heavy tailed distribution (Werner and Upper, 2002) The broadest class E encompasses all distributions with: E(eX) = 1. Note that the Normal distribution has a lighter tail than that of the class E distributions. Thus, distributions of the class E are heavy tailed with respect to the Normal distribution. The class D contains the sub-exponential distributions defined by the following equation (Teugels, 1975; Goldie and Klu ¨ppelberg, 1998): lim
x!1
PðX 1 þ þ X n > xÞ ¼1 PðmaxðX 1 ; . . . ; X n Þ > xÞ
ðA:1Þ
It means that the sum of n independent and identically distributed (iid) sub-exponential distributions are likely to exceed x if and only if their maximum is larger than x. It can be shown that Eq. (A.1) is equivalent to (Embrechts et al., 2003): lim
x!1
FðxÞ ¼1 eex
8e > 0
ðA:2Þ
where FðxÞ ¼ 1 FðxÞ corresponds to the probability of exceedance function and F is the cumulative distribution function (cdf). Note that the denominator eex(e > 0) repre-
Thus, only k-moments such that k < a, are finite. This property is important to define the last class (class A). Note that small values of a imply heavier tails. Class A contains stable (or a-stable) distributions. Stable distributions have Pareto tail with a < 2, which implies infinite variance and, as a consequence, very fat tails. In spite of this restriction (infinite variance) class A distributions are of great importance because they constitute a generalization of the Normal theory. Indeed, the Central Limit Theorem, which states that the sum of a large number of independent, identically distributed variables from a finite-variance distribution will tend to be normally distributed cannot be applied for some variables with infinite variance (e.g. variables representing natural catastrophes). In response to the empirical evidence, Fama (1965) and Mandelbrot (2003) proposed the stable distribution as an alternative model. Stable distributions are supported by the generalized Central Limit Theorem, which states that stable laws are the only possible limit distributions for properly normalized and centered sums of independent, identically distributed random variables. Since stable distributions can accommodate the fat tails and asymmetry, they often give a very good fit to empirical data. In particular, they are valuable models for datasets covering extreme events. This class was characterized by Le ´vy (1925) in his work on the sum of independent and identically distributed variables (iid). However, the absence of explicit formulas of the probability density function of these distributions limited their use. Recently several software was proposed to solve these computational problems (Nolan, 2006), and stable distributions are used in several fields such as finance, physics and Internet traffic.
On the tails of extreme event distributions in hydrology
31
Appendix B. Distributions used in the section ‘Application’
whereas, the Fre ´chet distribution (j is negative) has an infinite upper bound and a finite lower bound given by u þ ja.
B.1. The three parameter Lognormal distribution
B.3. Halphen type A distribution (HA)
The three parameter Lognormal (LN) distribution differs from the two parameter Lognormal distribution by the introduction of a lower bound x0 so that if X follows LN distribution
Halphen (1941) first suggested what he called the harmonic distribution: h x mi 1 þ ; x>0 ðB:3:1Þ exp a fðxÞ ¼ 2xk0 ð2aÞ m x
Z ¼ logðX x0 Þ
ðB:1:1Þ 2
is normally distributed with parameters l and r . Thus, it is a three parameter distribution, x0 is the location parameter while l and r2 control the scale and shape, respectively. It can be shown that the pdf of x is fðxÞ ¼
2 1 1 pffiffiffiffiffiffi e2½ðlogðxx0 ÞlÞ=r ðx x 0 Þr 2p
ðB:1:2Þ
The moments of X may be obtained as functions of those of the corresponding Normal distribution. The mean and the variance are given by 1 ðB:1:3Þ EðXÞ ¼ x 0 þ exp l þ r2 2 2
VarðXÞ ¼ expð2l þ r2 Þðer 1Þ
ðB:1:4Þ
The coefficient of variation 2
1
Cv ¼ ðer 1Þ2
ðB:1:5Þ
And the coefficient of skewness is r2
1 2
r2
Cs ¼ ðe 1Þ ðe þ 2Þ
ðB:1:6Þ
B.2. Generalized extreme value (GEV) distribution
where m > 0 is a scale parameter; a > 0 is a shape parameter; and k0(Æ) is the modified Bessel function of the second kind of order zero (Perreault et al., 1999). Being a twoparameter distribution, the shape of the harmonic distribution is entirely determined by the shape parameter a. To obtain additional flexibility, Halphen generalized the harmonic distribution by adding a second shape parameter, m, giving rise to the Halphen type A distribution: fA ðxÞ ¼
h x mi 1 m1 þ ; x exp a 2mm km ð2aÞ m x
The reparameterization (j = n) used in hydrology is that given by (Jenkinson, 1955) " 1 # ðx uÞ j FðxÞ ¼ exp 1 j ðB:2:2Þ a Each member of the extreme value distributions is characterised by the value of the shape parameter denoted by k. The family can be divided into three classes, corresponding to different ranges of k values. The three classes are type 1 (EV1), type 2 (EV2) and type 3 (EV3). In practice, k values lie in the range ]0.5; 0.5]. This range is divided among the three classes as follows: k negative corresponds to Fre ´chet distribution (EV2), k zero corresponds to Gumbel distribution (EV1) and k positive corresponds to reverse Weibull distribution (EV3). Note that the cdf given in (B.2.2) exists if 1 j ðxuÞ P 0, a thus the reverse Weibull distribution (j is positive) has an infinite lower bound and a fixed upper bound given by u þ ja,
ðB:3:2Þ
where m > 0 is a scale parameter; and a > 0 and m 2 R are shape parameters. km(Æ) is the modified Bessel function of the second kind of order m. The Halphen type A distribution may be recognized as a re-parameterized form of the generalized inverse Gaussian distribution (GIG). Non-centred r moments (r is an integer) of random variable following HA distribution are given by l0r ¼ E½X r ¼
mr kmþr ð2aÞ km ð2aÞ
ðB:3:3Þ
Thus, the mean and the variance are given by kmþ1 ð2aÞ km ð2aÞ m2 Var½X ¼ 2 ðkm ð2aÞkmþ2 ð2aÞ k2mþ1 ð2aÞÞ km ð2aÞ E½X ¼ m
The generalized extreme value distribution encompasses the three standard extreme value distributions: Fre ´chet, reverse Weibull and Gumbel (cf. section ‘Classes of heavy tailed distributions’). The cumulative distribution function as presented by (Von Mises, 1954): " 1 # ðx uÞ n FðxÞ ¼ exp 1 þ n ðB:2:1Þ a
x>0
ðB:3:4Þ ðB:3:5Þ
B.4. Halphen type Inverse B distribution (HIB) The probability density function (pdf) of the HIB distribution is given by (Perreault et al., 1999): " # 2 2 m m 2m1 fHIB ðy; m; a; mÞ ¼ 2m y exp þa ; m efm ðaÞ y y ðB:4:1Þ
y>0
where m > 0 is a scale parameter, a 2 R and m > 0 are shape parameters. The non central moment expression is given by ½l0r HIB ¼ EðY r Þ ¼
mref mr=2 ðaÞ efm ðaÞ
ðB:4:2Þ
where ½l0r HIB exists provided that m r/2 > 0, and the mean and the variance are efm1=2 ðaÞ efm ðaÞ m2 Var½X ¼ 2 ðefm ðaÞefm1 ðaÞ ef2m1=2 ðaÞÞ efm ðaÞ
E½X ¼ m
ðB:4:3Þ ðB:4:4Þ
32 More details on the Halphen system of distributions are given by Perreault et al. (1999) and El Adlouni and Bobe ´e (2007) proposed generation algorithms for these distributions.
References Balkema, A., De Haan, L., 1974. Residual life time at great age. Annals of Probability 2, 792–804. Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J., 2004. Statistics of Extremes: Theory and Applications. Wiley, 490 pp. Beirlant, J., de Wet, T., Goegebeur, Y., 2006. A goodness-of-fit statistic for Pareto-type behaviour. Journal of Computational and Applied Mathematics 186, 99–116. Benson, M.A., 1968. Uniform flood-frequency estimating methods for federal agencies. Water Resources Research 4, 891–908. Bernier, J., 1959. Comparaison des lois de Gumbel et de Fre ´chet sur l’estimation des de ´bits maxima de crue. La Houille Blanche (SHF) (1), 681. Bobe ´e, B., 1999. Extreme flood events valuation using frequency analysis: a critical review. Houille Blanche 54 (7–8), 100–105. Bobe ´e, B., Ashkar, F., 1991. The Gamma Family and Derived Distributions Applied in Hydrology. Springer, Littleton, CO, 217 pp. Bobe ´e, B., Rasmussen, P.F., 1995. Invited paper: recent advances in flood frequency analysis. US National Report Contributions in Hydrology to International Union of Geodesy and Geophysics 1991–1994. Review of Geophysics, 1111–1116. Brooks, C.E.P., Carruthers, N., 1953. Handbook of Statistical Methods in Meteorology. Her Majesty’s Stationery Office, London, pp. 85–111. Champernowne, D., 1953. A model of income distribution. Economic Journal 63, 318–351. Chaire en Hydrologie Statistique (CHS), 2002. HYFRAN: Logiciel pour l’analyse fre ´quentielle en hydrologie. INRS-Eau, rapport technique. Water Resources Publications.
. De Haan, L., Peng, L., 1998. Comparison of tail index estimators. Statistica Neerlandica 52 (1), 60–70. Drees, H., Kaufmann, E., 1998. Selecting the optimal sample fraction in univariate extreme value estimation. Stochastic Processes and their Applications 75, 149–172. El Adlouni, S., Bobe ´e, B., 2007. Sampling techniques for Halphen distributions. Journal of Hydrologic Engineering 12 (6), 592– 604. Embrechts, P., Klu ¨ppelberg, C., Mikosch, T., 2003Modelling Extremal Events for Insurance and Finance Applications of Mathematics, vol. 33. Springer, 648 pp. Fama, E.F., 1965. Portfolio analysis in a stable Paretian market. Management Science (B2), 409–419. Fisher, R., Tippet, L., 1928. Limiting forms of the frequency distribution of the largest or smallest member of a sample. In: Proceedings of the Cambridge Philosophical Society, vol. 24, pp. 180–190. Fre ´chet, M., 1927. Sur la loi de probabilite ´ de l’e ´cart maximum. Annales de la Socie ´te ´ Polonaise de Mathe ´matique, Cracovie 6, 93–116 (in Johnson et al. 1995). Gibrat, R., 1930. Une loi des re ´partitions e ´conomiques: l’effet proportionnel. Bulletin de la Statistique Ge ´ne ´rale de la France, 19, p.469–514. Gnedenko, B.V., 1943. Sur la distribution limite du terme maximum d’une serie aleatoire. Annals of Mathematics 44, 423–453. Goegebeur, Y., Beirlant, J., de Wet, T., 2006. Goodness-of-fit testing and Pareto-tail estimation. Seminar given for the Department of Statistics, University of Southern Denmark. Goldie, C.M., Klu ¨ppelberg, C., 1998. Subexponential distributions. In: Adler, R., Feldman, R., Taqqu, M.S. (Eds.), A Practical Guide to Heavy Tails. Birkha ¨user, Boston, MA, pp. 435–459.
S. El Adlouni et al. Gumbel, E.J., 1942. On the frequency distribution of extreme values in meteorological data. Bulletin of the American Meteorological Society 23, 95. Gumbel, E.J., 1958. The Statistics of Extremes. Columbia University Press, New York, 371 pp. Halphen, E., 1941. Sur un nouveau type de courbe de fre ´quence. Comptes Rendus de l’Acade ´mie des Sciences 213, 633–635 (Due to war constraints, published under the name ‘‘Dugue ´’’). Hill, B.M., 1975. ‘‘A simple general approach to inference about the tail of a distribution. Annals of Statistics 3, 1163–1174. Hubert, P., Bendjoudi, H., 1996. Introduction a ` l’e ´tude des longues se ´ries pluviome ´triques. In: Journe ´es hyrologiques de l’ORSTOM, Montpellier, p. 20. Jackson, O.A.Y., 1967. An analysis of departures from the exponential distribution. Journal of the Royal Statistical Society B 29, 540–549. Jenkinson, A.F., 1955. The frequency distribution of the annual maximum (or minimum) values of meteorological elements. Quarterly Journal of the Royal Meteorological Society 87, 158– 171. Katz, R.W., Parlange, M.B., Naveau, P., 2002. Statistics of extremes in hydrology. Advances in Water Resources 25, 1287–1304. Kidson, R.L., Richards, K.S., 2005. Flood frequency analysis: assumptions and alternatives. Progress in Physical Geography 29 (3), 392–410. Kidson, R., Richards, K.S., Carling, P.A., 2005. Reconstructing the ca. 100-year flood in Northern Thailand. Geomorphology 70, 279–295. Koutsoyiannis, D., 2004. On the appropriateness of the Gumbel distribution for modelling extreme rainfall. In: Brath, A., Montanari, A., Toth, E. (Eds.), Hydrological Risk: Recent Advances in Peak River Flow Modelling, Prediction and Realtime Forecasting. Assessment of the Impacts of Land-use and Climate Changes. Editoriale Bios, Castrolibero, Bologna, Italy, pp. 303–319. Le ´vy, P., 1925. Calcul des Probabilite ´. Gauthier-villars. Malamud, B.D., Turcotte, D.L., 2006. The applicability of powerlaw frequency statistics to floods. Journal of Hydrology 322, 168–180. Mandelbrot, B., 1997. Fractales, Hasard et Finances. Flamarion, Coll. Champs. Mandelbrot, B., 2003. Multifractal power law distributions: negative and critical dimensions and other ‘‘anomalies’’, explained by a simple example. Journal of Statistical Physics 110 (3–6), 739– 774. Mitzenmacher, M., 2004. A brief history of generative models for power law and lognormal distributions. Internet Mathematics (2), 226–251. Morlat, G., 1957. Sur la the ´orie des choix ale ´atories. Revue d’e ´conomie Politique 3, 378–380. Newman, M.E.J., 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 323–351. Nolan, J.P., 2006. Stable Distributions – Models for Heavy Tailed Data. Birkha ¨user, Boston. (Chapter 1). Ouarda, T.B.M.J., Ashkar, F., Bensaid, E., Hourani, I., 1994. Statistical distributions used in hydrology. Transformations and asymptotic properties, Scientific Report, Department of Mathematics, University of Moncton, 31 pp. Perreault, L., Bobe ´e, B., Rasmussen, P.F., 1999. Halphen distribution system. I: Mathematical and statistical properties. Journal of Hydrologic Engineering 4, 189–199. Pickands, J., 1975. Statistical inference extreme order statistics. Annals of Statistics 3, 119–130. Rao, A.R., Hamed, K.H., 2000. Flood Frequency Analysis. CRC Press, Boca Raton, FL, USA. Reed, W.J., 2002. On the rank-size distribution for human settlements. Journal of Regional Science 41, 1–17.
On the tails of extreme event distributions in hydrology Reed, W.J., Hughes, B.D., 2002. From gene families and genera to incomes and internet file sizes: why power-laws are so common in nature. Physical Review E 66, 067103. Reiss, R.D., Thomas, M., 2001. Statistical Analysis of Extreme Values, second ed. Birkha ¨user, Basel/Boston/Berlin. Robson, A., Reed, D.W., 1999. Flood Estimation HandbookStatistical Procedures for Flood Frequency Estimation, vol. 3. Centre for Ecology and Hydrology, Wallingford, UK, 108 pp. Singh, V.P., Strupczewski, W.G., 2002. On the status of flood frequency analysis. Hydrological Processes 16 (18), 3737–3740. Smith, J.A., 1987. Estimating the upper tail of flood frequency distributions. Water Resources Research 23, 1657–1666. Stedinger, J.R., Vogel, R.M., Foufoula-Georgiou, E., 1993. Frequency analysis of extreme events. In: Maidment D.R. (Ed.), Handbook of Hydrology. McGraw Hill, New York (Chapitre 18).
33 Teugels, J.L., 1975. The class of subexponential distributions. Annals of Probability 3 (6), 1001–1011. Turcotte, D.L., 1997. Fractals and Chaos in Geology and Geophysics, second ed. Cambridge University Press, Cambridge. US Water Resources Council, 1981. Guideline for Determining Flood Flow Frequency. Bulletin #17B of the Hydrology Committee. Von Mises, R., 1954. La distribution de la plus grande de n valeursSelected Papers, vol. 2. American Mathematical Society, Providence, RI, pp. 271–294. Werner, T., Upper, C., 2002. Time variation in the tail behaviour of bund futures returns. Working paper No. 199, European Central Bank. Yevjevich, V., 1987. Stochastic models in hydrology. Stochastic Hydrology and Hydraulics 1 (1), 17–36.