Probabilistic modeling of random variables with inconsistent data

Probabilistic modeling of random variables with inconsistent data

Applied Mathematical Modelling 73 (2019) 401–411 Contents lists available at ScienceDirect Applied Mathematical Modelling journal homepage: www.else...

751KB Sizes 0 Downloads 51 Views

Applied Mathematical Modelling 73 (2019) 401–411

Contents lists available at ScienceDirect

Applied Mathematical Modelling journal homepage: www.elsevier.com/locate/apm

Probabilistic modeling of random variables with inconsistent data Jianjun Qin Department of Civil Engineering, Aalborg University, Niels Bohrs Vej 8A, Esbjerg 6700, Denmark

a r t i c l e

i n f o

Article history: Received 29 August 2018 Revised 20 March 2019 Accepted 3 April 2019 Available online 16 April 2019 Keywords: Probability modeling Reliability assessment Tail behavior Monte Carlo simulation Data inconsistency

a b s t r a c t The aim of the present paper was to formulate probabilistic modeling for random variables with inconsistent data to facilitate accurate reliability assessment. Traditionally, random variables have some outputs available, based on which, some distribution is identified. However, as will be illustrated, the data relevant to those extreme events might not necessarily follow the same distribution as well as the other part, but they generally have small weights in the definition of the distribution due to their small quantity. The adoption of one single probabilistic distribution to describe random variables with such inconsistent data might cause great errors in the reliability assessment, especially for extreme events. One new formulation of probabilistic modeling is proposed here for such type of random variables. The inconsistency within the data set is identified and based on how the set is divided. Each division is described by the respective distribution and finally they are unified into one framework. The relevant problems in the modeling (e.g., the identification of the boundary between the divisions, the definition of the probability distributions, and the unification of the distributions into one framework) are presented and solved. The realization of the proposed approach in the practical numerical analysis is further investigated afterwards. Finally, two examples are presented to illustrate the application from different perspectives. © 2019 Elsevier Inc. All rights reserved.

1. Introduction The reason we need probabilistic analysis in engineering and other fields is that our knowledge is limited and uncertainties exist in all the relevant aspects. One feasible description of the uncertainties is the adoption of random variables. A few outputs of random variables might be available and based on these, their probabilistic models are formulated, see e.g. [1] and [2]. In probabilistic analysis, traditionally, one single distribution is adopted to describe the random variables. For example, lognormal distribution is a common distribution used to describe the variables in structural reliability analysis, such as the variables included in the limit state functions for concrete corrosion presented by DuraCrete [3]. In such case, the data are considered as a whole and the bulk of the data set, rather than its tail, governs the definition of the value of the parameters due to its dominant quantity. The definition could be fixed by the MLM (maximum likelihood method), Bayesian approach, or other methods (see e.g., [4] and [5] for illustration). However, as introduced in [6], the tails of the data set

E-mail address: [email protected] https://doi.org/10.1016/j.apm.2019.04.017 0307-904X/© 2019 Elsevier Inc. All rights reserved.

402

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

Daily records of mean precipitation (mm)

400 300 200 100 0 1

73

146 219 Day of the year 2010

292

365

Fig. 1. Daily records of mean precipitation at Acton Escondido Canyon in the year 2010.

can be considerably different from the bulk, or the central part, of the data set. Such difference, as will be illustrated in the following section, might cause potential error in the probabilistic analysis. The particularity of the tail behavior of the available data of random variables (i.e., extreme values) was recognized (for example, see [7–10]) and researchers would like to use some special distributions to describe the maximal and minimal values. Several probability distributions, are listed in [11] as typical extreme value distributions such as the Gumbel maximum, Gumbel minimum, Frechet maximum, and Weibull minimum. Nevertheless, in many cases, the effort on the formulation of the distribution function of the extreme values is not enough for the practice of probabilistic analysis. Several problems still need to be clarified. For example, given a data set as the output of some random variable, exactly which part of the set could be regarded as the tail or the extreme values? How can the probabilistic description be realized for the random variable within the whole feasible region when only the distribution of its tail is known? Following the recognition of the inconsistency within the behavior of the available data set of random variables, several special formulations of probability distribution were proposed in recent decades such as the so-called hybrid distribution (see e.g., [12]) and the mixture kernel density function by Miao et al. [13]. However, the hybrid distribution includes the existence of null records while the inconsistency within the non-zero records is actually not analyzed. The cost to formulate the so-called mixture kernel density function is extremely high, and comes with strong arbitrariness in the definition. To minimize the potential errors and to make the probability distribution as close as possible to the information from the available data in practice, a new formulation of probabilistic modeling is proposed for random variables. In the following discussion, the existence of the inconsistency of the data is presented first in Section 2. Afterwards, a new formulation of probabilistic modeling for random variables is derived in Section 3 considering the boundary between divisions, identification of the distribution based on the proposed measurement of goodness of the alternative distributions, and final formulation of the new unified distribution. In Section 4, the approach to realize the formulation in numerical analysis is introduced. Finally, two typical examples, one on the distribution of engineering demand of a concrete column and one on the distribution of local temperature, are presented to illustrate the application of the proposed approach. 2. Inconsistency in the data set of a random variable In this section, one simple example will be provided to show the existence of potential inconsistency in the data set of a random variable. Fig. 1 illustrates the daily records of the mean precipitation at Acton Escondido Canyon in the US, collected by the National Centers for Environmental Information (NCEI) (www.ncdc.noaa.gov) in the year 2010. Here, four types of probability distribution common in the description of such phenomena, such as lognormal distribution, generalized Pareto distribution, generalized beta distribution, and Student’s t distribution, are taken into account and the Quantile– Quantile plot (Q–Q plot) of the variable is made to illustrate their performance within the whole feasible region (see Fig. 2). The vertical axis in the plot represents the quantities from the data, while the horizontal axis shows the quantities from the assumption with different distributions. For comparison, one straight line “y=x” is also shown in Fig. 2 and the discrepancy between the straight line and the curve expresses the difference between the assumption with different distributions and the real data. Evidently, both lognormal and generalized Pareto distribution represent the records reasonably well in the middle, which in general could be regarded as a good description of this variable due to the nice performance in most of the feasible region; while the performance of generalized beta and Student’s t distribution is a total failure for this case. The discrepancy between the straight line and both the curves of generalized beta and Student’s t distribution is quite big.

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

403

400

y=x

300 Data quantities

Generalized Pareto

200

Lognormal Generalized beta

100 Student’s t 0

100

200 300 Theoretical quantities of the distributions

400

Fig. 2. Q–Q plot of the variable. Table 1 Two groups of data points and their typical characteristics.

Bulk High/Low tail

Data points

Target events

Consequences

Large amounts Few

Moderate events Extreme events

Low High

However, there is a significant difference even for the curves of lognormal and generalized Pareto distribution when the quantities are large. Such difference might not be clearly seen using traditional measures of goodness of fit such as the chi-squared test and Kolmogorov Smirnov statistic. This is because both the size of the data in the tail and the difference in the scale of absolute value are small. Obviously, inconsistency exists in the description of the distribution of such data. Although there might be randomness in the output of the available data of variables, it is still a common phenomenon that it is difficult or even impossible to identify some single distribution that can describe variables well in the whole feasible region. Two groups of data points (i.e., bulk and tail) and their respective characteristics are listed in Table 1. The tails could be further divided into two types. One type is a high tail with low exceedance probability and another is a low tail with low non-exceedance probability. Here exceedance probability and non-exceedance probability represent the value of the complementary cumulative distribution function (CCDF) and the cumulative distribution function (CDF), respectively. One or both of the two tails correspond to extreme events with very few data points and correspondingly low probabilities of occurrence, but are generally of interest to us due to the extreme potential consequences. If there is inconsistency in the data set and one distribution is used to describe the whole feasible region of the variable, there would be errors in the estimation of the probabilities of the extreme events, which could become great in further risk analysis considering the extreme consequences. 3. Refined probability modeling of random variables Here, we consider random variables with some inconsistent data available. The probability distribution is generally acknowledged or some distribution could fit the data well from the statistical point of view, but some part (like the tail) has a big difference in the scale of magnitude between the exceedance probability (or non-exceedance probability) directly from the data and from the distribution function. The aim here is to propose a refined formulation of probability distribution of the variable to make it as close as possible to the knowledge from the data. The basic idea is to identify and divide the parts that are inconsistent with each other within the data set and to establish a probability distribution function for each part. Finally the distribution functions are unified into one framework to satisfy basic properties of the probability distribution function so that the probabilistic modeling can be formulated and the probabilistic analysis can be further realized. To reach the objective, the following problems needed to be solved: (1) separation and identification of the boundary between the parts that are inconsistent; (2) identification of the distribution at the tail regions; (3) distribution function of the other region consistent with the change in the tail region; and (4) implementation of the probabilistic analysis with the new formulation of the distribution function of the random variables. Denote X as a random variable and let there be N data points representing the available realizations of X. The data points are denoted by ξ = {ξ1 , ξ2 , . . . , ξN }T (ξ 1 ≤ ξ 2 ≤  ≤ ξ N ), of which the median is termed ξ m , for the convenience of the following analysis. The non-exceedance and exceedance probability of ξ i (i = 1, 2, . . . , N ), represented by G(ξ i ) and G¯ (ξi ),

404

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

respectively, are expressed as:

i N N−i G¯ (ξi ) = 1 − G(ξi ) = N

G(ξi ) = Pr(x ≤ ξi ) =

(1)

It is assumed here that X has a generally recognized distribution in the analysis denoted as F0 (x) and F¯0 (x ) for CDF and CCDF, respectively. For convenience, the following discussion will start from the high-tail problem first. Consider the total Th types of distributions as the candidates for the description of the high tail, for each of which CDF and CCDF given λj (ξ i ) j j j j are represented by Fh (x|λ (ξi )) and F¯h (x|λ (ξi )) ( j = 1, 2, . . . , Th ), respectively. Here λj (ξ i ) represents a vector of which the components are the parameters in the formulation of the function (e.g., the mean value and the standard deviation in the formulation of the normal distribution), and ξ i in the bracket means that λj (ξ i ) will be decided by the N − i + 1 points included in the set {ξi , ξi+1 , . . . , ξN } with curve fitting using the method of least squares or some other approach like the maximum likelihood method. Given the boundary ξh j (N/2 < hj < N) corresponding to the jth type, the corresponding CDF representing the bulk of the data Fb (x|ξh j ) is defined next, and has to satisfy the following two conditions: j

(1) Fb (−∞|ξh j ) = 0; and j

j j j j j (2) Fb (ξh j |ξh j ) = Fh (ξh j |λ (ξh j )) = 1 − F¯h (ξh j |λ (ξh j )).

The first condition is one basic property of CDFs and the second condition is to ensure the continuity of the function at the boundary. Following the tail entropy approximation approach by Lind and Hong [14], the CDF F0 (x) is modified to obtain j Fb (x|ξh j ):

Fhj (ξh j |λ (ξh j )) j

Fbj (x|ξh j ) =

F0 (ξh j )

F0 (x )

x < ξh j

(2)

It is obvious that the function Fb (x|ξh j ) presented in Eq. (2) satisfies the two conditions mentioned above. j

The next problem is to identify the boundary of the high tail ξh j (N/2 < hj < N) for each type of distribution. First, in j j the probabilistic analysis, especially in the tail, the order of magnitude difference between the value of CCDF F¯h (ξ |λ (ξi )) ¯ and the exceedance probability directly from the data G(ξ ) ξ ≥ ξ i is generally of greater interest than the absolute value difference as introduced in Section 2. Second, as shown in Eq. (2), the definition of the boundary would also change the j CDF Fb (x|ξh j ) and correspondingly, the magnitude difference between it and the G¯ (ξ )s. The boundary is identified as the

point at which to minimize the sum of the maximal distance in the tail and bulk parts, considering the respective number of data points:

           j j j    ¯ ¯ ¯ ¯ ξh j = ξi min max log10 G(ξ ) − log10 Fb (ξ |ξi ) × (i − 1 ) + max log10 G(ξ ) − log10 Fh (ξ |λ (ξi )) × (N − i + 1 ) ξi ∈ ξ

ξ < ξi

ξ ≥ξi

(3) Here, to avoid potential confusion, the logarithm of the CCDF is adopted to measure the magnitude difference and the maximal distance in the tail and bulk parts are added together with the respective number of the data points as the weights to identify the boundary. j j Once the boundary separating the high tail from the bulk of the data for all the Th distribution functions F¯h (ξh j |λ (ξi )) j j is selected as ξh j ( j = 1, 2, . . . , Th ), the parameters included in F¯h (ξh j |λ (ξh j )) (i.e., λ), can be fixed by least squares fitting

to the data in the high tail {ξh j , ξh j +1 , . . . , ξN }. Then the next question is how to identify the best CDF (and CCDF) from all the Th types to describe the high tail. As introduced above, there are several traditional approaches to measure the goodness of fit from different perspectives. However, the data are generally considered as a whole and the tail is ignored due to its j small quantity of available data. An index of the average residuals in the high tail Rh is proposed to measure the goodness j

of fit and compare the high-tail distribution candidates. It is defined as the average distance between log10 (F h (ξi |λ (ξh j ) ) ) and log10 (G¯ (ξi ) ) ξi ∈ {ξh , ξh +1 , . . . , ξN }T : j

N 

Rhj =

i=h j

i

j

 j   

      i  log10 G(ξi ) − log10 F h ξi λ ξh j N − hj + 1

(4)

Similar to the definition of the boundary between the high tail and the bulk, a logarithm is used in this measure to consider the difference of the magnitudes of CCDFs instead of their values. Now we totally have Th values of the index of T

j

the average residuals, R1h , R2h , . . . , Rhh , corresponding to the Th types of distribution. The distribution that has the smallest Rh

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

405

value ( j = 1, 2, . . . , Th ), would be identified as the best distribution to describe the high tail of the variable X, denoted as 0

Fh0 (x|λ (ξh0 )) and F h (x|λ (ξh0 )) for CDF and CCDF, respectively, and ξh0 for the boundary. Finally, the refined formulation of the CDF of the random variable X is: 0

0

     0 F 0 ξh |λ (ξh ) Fb0 xξh0 = h F0 ξ 0 F0 (x ) F (x ) =   0 ( h0 ) Fh0 xλ0 ξh0

x < ξh 0

(5)

x ≥ ξh 0

where Fb0 (x|ξh0 ) is the CDF to describe the bulk part. Note that although the CDF Fb0 (x|ξh0 ) keeps a linear relation with F0 (x), it might not necessarily follow the same distribution type as F0 (x). For example, a function by the multiplication of a CDF of a normal distribution with one constant does not necessarily follow a normal distribution if the constant is not equal to 0 one. Nevertheless, the rate Fh0 (ξh0 |λ (ξh0 ) )/F0 (ξh0 ) would generally be close to one, considering that the residual should be

very small. The comparison between Fb0 (x|ξh0 ) and F0 (x) in the bulk (x < ξh0 ) could be realized through calculation of the index of the average residuals, which will not be discussed in detail here. The low-tail problem is similar. One difference is that the CDF instead of the CCDF should be considered in the definition of the boundary and the measurement of the goodness due to the focus on the non-exceedance probability. The values of CCDF in the low tail would always approach one, while the difference of magnitudes among the values of CDF in the low tail could often reach one or more, as with those of the CCDF in the high tail. Let us assume that there are totally Tl choices j of low-tail distribution for X (i.e., Fl (x|η j (ξi ) ) ( j = 1, 2, . . . , Th )to denote the jth CDF of the low tail region.) Here the role of

η in the formulation is the same as that of λ in Fhj (x|λ j (ξi )). The difference is that the vector η is expressed as the function of ξ i , which represents the total i data points included in the set {ξ1 , ξ2 , . . . , ξi }T instead of {ξi , ξi+1 , . . . , ξN }T . Symmetry j with the high-tail problem, denotes ξl j as the boundary of the jth type of distribution and the corresponding CCDF F b (x|ξl j ) is defined next, and has to satisfy the following two conditions: j

(1) F b (+∞|ξl j ) = 0; and j

j

(2) F b (ξl j ) = F l (ξl j |η j (ξl j ) ) = 1 − Fl (ξl j |η j (ξl j ) ). j

Then, the CCDF would be: j Fb

   xξl j =

j







F 0 (x ) F 0 ξl j

F l ξl η j ξl j j

x > ξl j

(6)

Similar to the high-tail problem, the boundary of the low-tail region corresponding to the CDF of the jth type of distribution, ξl j , could be identified as:

  ξl j = ξi |min(max |log10 G(ξ ) − log10 Fl j (ξ |η j (ξi ))| × i+max |log10 G(ξ ) − log10 Fbj (ξ |ξi )| × (N − i )) ξi ∈ ξ

ξ ≤ξi

ξ > ξi

(7)

Furthermore, the parameters included in ηj , can be fixed by a least squares fitting to the data in the low tail {ξ1 , ξ2 , . . . , ξl j }T . Also, the index of the average residuals in the low tail Rlj corresponding to the jth CDF can be defined as:

Rlj =

lj          log10 (G(ξi ) ) − log10 F j ξi η j ξl j l

i=1

(8)

lj T

Now we totally have Tl values of the index of the average residuals, R1l , R2l , . . . , Rl l , corresponding to the Tl types of j

distribution. The distribution that has the smallest Rl value ( j = 1, 2, . . . , Tl ), would be identified as the best distribution to 0

describe the low tail of the variable X, denoted as Fl0 (x|η0 (ξl0 ) ) and F l (x|η0 (ξl0 ) ) for CDF and CCDF, respectively, and ξl0 for the boundary. Finally, the refined formulation of the CDF of the random variable X could be identified correspondingly:

    ξl 0 0  F (x ) = 0  F l (ξl |η0 (ξl ) ) 0 0  1 − F b x ξl 0 = 1 − F 0 (x ) F 0 ( ξl ) 0

Fl0 xη0

x ≤ ξl 0 x > ξl 0

The procedure to define the proposed probability distribution of the variable is concluded in Fig. 3.

(9)

406

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

Given N data points ξ = {ξ1 , ξ 2 ,L , ξ N }

T

(ξ1 ≤ ξ2 ≤ L ≤ ξ N )

and original CDF F0 ( x ) Yes

Cosider total Th types of distribution for the tail j F h x λ j (ξi ) ( j = 1, 2,L , Th )

(

Cosider total Tl types of distribution for the tail Fl j x η j (ξi ) ( j = 1, 2,L , Tl )

(

)

(

)

For each F h x λ j (ξi ) ( j = 1, 2,L , Th ) j to find λ (ξi ) for each data point ξi ( i = 1, 2,L , N ) j

by curve fitting with {ξi , ξi +1 ,L , ξ N }

T

(

For each F hj x λ j (ξi )

)

(

)

j j For each Fl x η (ξi ) ( j = 1, 2,L , Tl ) j to find η (ξi ) for each data point ξi ( i = 1, 2,L , N )

by curve fitting with {ξ1 , ξ 2 ,L , ξi }

T

For each Fl j x η j (ξi )

) ( j = 1, 2,L, T ) l

boundary:

(

(

)

)

j j ⎧ ⎫ ξ h j = ⎨ξi min max log10 G (ξ ) − log10 F b (ξ ξi ) × ( i − 1) + max log10 G (ξ ) − log10 F h ξ λ j (ξi ) × ( N − i + 1) ⎬ ξ ≥ξi ⎩ ξi ∈ξ ξ <ξi ⎭

average magnitude residual index: N

∑ log ( G (ξ ) ) − log i

10

R =

)

(

( j = 1, 2,L , Th )

boundary:

j h

No

High tail problem?

i =h j

10

( F (ξ λ (ξ ) ) ) j h

i

j

( ))

0 0 To identify the high-tail distributionFh ξ h0 λ ξ h0 by the comparison of average residual index

(

j

ξi ∈ξ

( )) ( ) ( ( ))

⎧ F 0 ξ λ0 ξ h0 ⎪ h h0 F0 ( x ) x < ξ h0 ⎪ ξ F CDF: F ( x ) = ⎨ 0 h0 ⎪ Fh0 x λ 0 ξ h0 x ≥ ξ h0 ⎪ ⎩

(

(

ξ ≤ξi

)

ξ >ξi

)}

average magnitude residual index: lj

hj

N − hj + 1

(

⎧ ⎩

ξl = ⎨ξi min max log10 G (ξ ) − log10 Fl j ξ η j (ξi ) × i + max log10 G (ξ ) − log10 Fb j (ξ ξi ) × ( N − i )

j

Rl =

∑ log ( G (ξ ) ) − log 10

i =1

i

10

( F (ξ η (ξ ) ) ) l

j

i

j

lj

lj

( ( )) by

0 0 To identify the low-tail distribution Fl x η ξl0 the comparison of average residual index

CDF:

( ( )) ( ( ))

⎧ Fl 0 x η0 ξl0 x ≤ ξl0 ⎪ ⎪ F ( x ) = ⎨ F l0 ξ η0 ξ l0 l0 ⎪1 − F 0 ( x ) x > ξl0 ⎪ F 0 ξl0 ⎩

( )

Fig. 3. Flowchart of the definition of the proposed probability distribution of the variable X.

4. Realizations of a random variable in the numerical analysis In the previous section, a refined formulation of a probability distribution function is proposed. However, it is generally difficult or even impossible to identify the analytical solution from the existing theoretical formulations. Furthermore, the CDF of the variable X turns into F(x) (see Eqs. (5) and (9)) rather than F0 (x), which makes the problem complicated. Numerical simulation, in most cases, is necessary to generate realizations of such random variables. The aim of this section is to investigate the procedure to generate realizations of X with this proposed formulation of probability distribution in practice. First, we could have M realizations of the variable U following a uniform distribution within the interval (0, 1) {u1 , u2 , . . . , uM }. For the high-tail problem, corresponding to the improved probability distribution F(x) shown in Eq. (5), realizations of X could also be divided into two parts. If the value of ui (1 ≤ i ≤ M) is no less than the value of the function 0 0 Fh0 (x|λ (ξh0 )) at the boundary ξh0 (i.e., Fh0 (ξh0 |λ (ξh0 ) )), the corresponding realization of x, xi would be its inverse function

(Fh0 (ui |λ0 (ξh0 )))−1 . If the value of ui is less than Fh0 (ξh0 |λ0 (ξh0 )), xi , would be the inverse function of

Fh0 (ξh |λ0 (ξh ) ) 0

F0 (ξh ) 0

0

F0 (x ).

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

407

Given M realizations of U {u1 , u2 ,L , uM } (uniform distribution within the interval (0,1)) i=0 i=i+1 Yes

(

Yes

No

High tail problem?

(

( ))

( )))

( (

xi = Fh0 ui λ 0 ξ h0

⎛ ⎛ F0 ξ h0 ⎜ xi = ⎜ F0 ⎜ ⎜ 0 0 ⎜ ⎜ Fh ξ h0 λ ξ h0 ⎝ ⎝

(

Yes

No

No −1

( ))

ui < Fl 0 ξl0 η0 ξl0

ui ≥ Fh0 ξ h0 λ 0 ξ h0

( ) ( ))

⎞⎞ ⎟ ui ⎟ ⎟ ⎟ ⎟⎟ ⎠⎠

−1

⎛ ⎛ F 0 ξl0 ⎜ xi = ⎜ F 0 ⎜ 0 ⎜ 0 ⎜ ⎜ F l ξl0 η ξl0 ⎝ ⎝

(

( ) (1 − u ) ⎞⎟ ⎞⎟ ⎟⎟ ⎟⎟ ( )) ⎠⎠

−1

i

( (

( )))

xi = Fl 0 ui η0 ξl0

−1

Yes

i
Obtain M realizations of X { x1 , x2 ,L , xM } Fig. 4. Flowchart of the detailed procedure to obtain realizations of the random variable X.

That is, the M realizations of X would be:

xi =

⎧  ⎪ ⎪ ⎨ F0 ⎪ ⎪ ⎩

Fh0



( 0 ) ui  ξh0 λ0 (ξh0 ) F0 ξh

−1

  

−1  0 Fh0 ui λ ξh0

ui < Fh0 ui ≥

Fh0



  

 ξh0 λ0 ξh0

  

 ξh0 λ0 ξh0

( i = 1, 2, . . . , M )

(10)

Similarly, the M realizations of X for the low-tail problem would be:

⎧  0   0   −1 Fl ui η ξl0 ⎪ ⎨   −1 xi = F 0 ( ξl ) ⎪ 0 ⎩ F0 0 ( 1 − ui ) F l (ξl |η0 (ξl ) ) 0 0

ui < Fl0

  0   ξl0 η ξl0

ui ≥ Fl0

  0    ( i = 1, 2, . . . , M ) ξl0 η ξl0

(11)

The detailed procedure to obtain M realizations of X is shown in Fig. 4. 5. Examples 5.1. High-tail problem: seismic demand analysis of a reinforced concrete column As an example of the high-tail problem, the seismic demand of a reinforced concrete column is investigated here. The circular flexural column has diameter 1.3 m, clear height 6 m with an additional 0.8 m to the center of mass of the superstructure, 2% longitudinal reinforcing ratio, and 1% transverse reinforcing ratio. The nominal compression strength of the concrete is 45 MPa and the nominal yield strength of the reinforcement is 462 MPa. The superstructure is represented by a concentrated mass weighing 30 0 0 kN. The column is assumed to be fixed to the foundation at the bottom. The column behaves as a cantilever in both the longitudinal and transverse directions. However, a rigid link is used to represent the portion of the column between the top clear height and the center of mass of the superstructure, potentially creating an inflection point in the column. The initial elastic fundamental vibration periods of the column are 0.54 s in each of the orthogonal lateral directions and 0.042 s in the vertical direction. Here, as is widely recognized in engineering seismic analysis, (see e.g., [15] and [16]), the EDP (engineering demand parameter) is considered the indicator of the seismic demand of engineering structures. A total of 160 ground motions were applied to the OpenSees (http://opensees.berkeley.edu) model of the column, which were verified by the experiments presented in [16], to generate a rich numerically simulated data set to evaluate the CDF of an EDP conditioned on the intensity measures (IM) of the seismic hazards. The 160 ground motions are available from the Pacific Earthquake Engineering Research (PEER) Center strong motion database and the detailed introduction could

408

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

0.7 0.6 0.5

Rh

0.4 0.3 0.2 0.1 0 -5.4

-5.2

-5

-4.8

-4.6

-4.4 -4.2 Data quantities

-4

-3.8

-3.6

-3.4

Fig. 5. Variation of Rh values of the normal distribution assumption for lnEDP with data quantities.

100

Student’s t

Normal

Pr(x>lnEDP)

10-1 Data

Gumbel Min

10-2

10-3 -4

Frechet Max Gumbel Max

-3.9

-3.8

-3.7 x

-3.6

-3.5

-3.4

Fig. 6. Comparison of the distributions for the high tail of lnEDP.

be found in e.g. [17]. Each ground motion comprised three components, the two orthogonal horizontal accelerations and a vertical acceleration component. The square root-sum of squares (SRSS) of the two orthogonal horizontal Peak Ground Velocity (PGV) values was selected as the IM. Correspondingly, the EDP is an SRSS of the obtained horizontal drifts in the two orthogonal directions. Generally, it is considered to follow a lognormal distribution, which means that lnEDP follows a normal distribution. The performance of the normal distribution assumption of lnEDP was investigated, see Fig. 5, through the index Rh . When the lnEDP (also EDP) values are large, the index Rh is also large, but drops significantly with decrease of the data quantity. The index Rh becomes stable as lnEDP is smaller than −4. The drastic fluctuation shows that there is inconsistency in the data set and the performance of normal distribution for lnEDP is not good in the high tail. Four types of distributions (i.e., Gumbel maximum, Gumbel minimum, Frechet maximum and Student’s t) were considered and compared here to identify the best distribution so that the tail of the variable would be recognized and described. A comparison of the four tail CCDFs considered for the variable lnEDP, is shown in Fig. 6 and Table 2. In this case, Student’s t distribution performs poorly and its average residual measure Rh is 0.31 even larger than that of normal distribution in the high tail

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

409

Table 2 High tail boundary and other parameters of the tail distributions for the lnEDP data set. Type of

Parameters

distribution

High tail boundary ξ h

Bulk distribution translation ratio

−3.61 −3.61 −3.61 −4.11

1.03 1.03 1.03 0.17

Gumbel max Gumbel min Frechet max Student’s t

Fhj (ξh

j

|λ j (ξh j ))

F j ( ξh ) j

Average residual measure Rh 0.032 0.035 0.033 0.31

Lowest temperature (0.1ºC)

200

100

0

-100

0

400 800 1200 Sort number of the data points

1600

Fig. 7. Illustration of the ascending sort of daily lowest temperatures of northern Beijing in December between 1951 and 20 0 0 (data source: National Meteorological Information Center of China data.com.cn).

as shown in Fig. 5. The other three types (i.e., Gumbel maximum, Gumbel minimum, and Frechet maximum) show similar and good performance. They have the same high tail boundary and a similar bulk distribution translation ratio. The scaling j

ratio

F (ξh |λ j (ξh )) h j j F j ( ξh )

s from all three distributions are very close to one, which means that the CDF at the bulk part would ap-

j

proach the normal distribution, showing that the modification of the normal distribution to allow the entire data set to fit the bulk of the data and concatenate to the selected high tail distribution, is small. They also have similar and low average residual measure Rh . All of them are around 0.03. Among them, the Gumbel maximum distribution has the lowest value of Rh , which is actually only slightly different from the Gumbel minimum and Frechet maximum, and is considered the best solution to the description of the high-tail data here. It could be found from the comparison of the index Rh between the Gumbel maximum distribution from the proposed approach (0.032) and the normal distribution in the high tail of lnEDP (fluctuation between 0.2 and 0.7), the proposed approach improves the probabilistic modeling of lnEDP greatly for this case. 5.2. Low-tail problem: extreme low temperature analysis Temperature is one great concern of our society. At the global level, global warming is one of the most popular topics (see e.g., [18] and [19]). At the local level, the temperature influences almost all aspects of our life. For example, the temperature is the key factor determining the process of steel corrosion in reinforced concrete structures, for example, see [3]. Here extreme low temperature analysis is undertaken to illustrate the application of the proposed approach to a low-tail problem. The daily lowest temperatures in December from 1951 to 20 0 0, Tmin , of northern Beijing recorded by one station (No. 54511) belonging to the China ground international exchange stations (published by National Meteorological Information Center of China data.com.cn), were analyzed to obtain their probability distribution using the proposed approach. In each year, there were a total of 31 days in December; and correspondingly, there were 31 records of daily lowest temperature every year. Therefore, the size of the data set (i.e., the number of available data points) was 1550. These were sorted in ascending order and are illustrated in Fig. 7. The consistency of the data with the normal distribution assumption was investigated with the index Rl ; see Fig. 8 for the results. At the low end of the data quantities, there is a sharp increase of the value of Rl and then the index gradually decreases with increase of the data quantity. The index Rl gradually becomes small and stable as the data quantity is larger than 0. The drastic fluctuation shows that there is inconsistency in the data set and the performance of normal distribution is not perfect in the low tail for this case. Now, in the low tail, the proposed approach was adopted with four types of distribution, same as in the last example, and the results are provided in Table 3. Same as in the last high-tail example, the Student’s t distribution performs badly and its average residual measure Rl is 0.29 even larger than that of the normal distribution in the whole region covered by the data quantities, see Fig. 8. Here, the Gumbel minimum and Frechet maximum distribution have similar and good performance and results. They have similar

410

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

0.3

Rl

0.2

0.1

0-100

0

-50

50 Data quantities (0.1°C)

100

200

150

Fig. 8. Variation of Rl values of the normal distribution assumption for Tmin with data quantities.

100

Student’s t

Pr(x
10-1

Data Normal

-2

10

Frechet Max

10-3

Gumbel Max Gumbel Min

10-4 -100

-50

x

50

0

Fig. 9. Comparison of the distributions for the low tail of Tmin .

Table 3 Boundary and the parameters of the distributions for the data set of Tmin. Type of

Parameters

distribution

Low tail boundary ξ l (0.1 ◦ C)

Bulk distribution translation ratio

F¯l j (ξl

j

|η j (ξl j ))

F¯j (ξl )

Average residual measure Rl

j

Gumbel max Gumbel min Frechet max Student’s t

4 −44 −47 33

0.95 1.10 0.99 1.84

0.014 0.069 0.067 0.29

J. Qin / Applied Mathematical Modelling 73 (2019) 401–411

411 j

boundaries (around −4 ◦ C) and similar Rl value (around 0.07), and both the two ratios F l (ξl j |η j (ξl j ) )/F j (ξl j ) are close to one. By comparison, the Gumbel maximum distribution has a relatively low Rl value (equal to 0.014), which means it is the best distribution in the tail region for this case. The comparison of the distributions in the low tail is illustrated in Fig. 9. It could be found from the comparison of the index Rl between the Gumbel maximum distribution from the proposed approach (0.014) and the normal distribution in the low tail of Tmin (fluctuation between 0.1 and 0.3), the proposed approach improves the probabilistic modeling of Tmin for this case. 6. Conclusions Random variables with inconsistent data are considered here. A new formulation of probabilistic modeling is proposed with the aim of minimizing potential error in probabilistic analysis from the available limited data with inconsistency. The data set of random variables is divided following the identification of inconsistency. Different probability distribution functions are identified using the proposed approach for each division and finally the distributions are unified into one framework, within which the basic properties of the probability distribution function are satisfied, to serve the subsequent probabilistic analysis. Furthermore, it is also introduced how to realize this approach in the practice of numerical analysis. Finally, two examples, one high-tail problem of seismic demand analysis of a reinforced concrete column and one low-tail problem of extreme low temperature analysis, are presented to illustrate the application of the proposed approach. It is a promising but preliminary exploration to divide the data set into parts and the formulation could be refined further from different aspects so that it could be as consistent as possible with the information from the available data. This is the basis for accuracy in the probabilistic analysis. Declarations of interest None. Acknowledgments The author appreciates greatly the valuable comments by the reviewers. The author also acknowledges the helpful discussion with Professor Bozidar Stojadinovic and Professor Kevin R. Mackie during the author’s stay at ETH Zurich. References [1] A.H.S. Ang, W.H. Tang, Probability Concepts in Engineering: Emphasis on Applications in Civil and Environmental Engineering, Wiley, US, 2007, p. 417. [2] K. Velmanirajan, K. Anuradha, A. Syed Abu Thaheer, R. Ponalagusamy, R. Narayanasamy, Statistical evaluation of forming limit diagram for annealed al 1350 alloy sheets using first order reliability method, Appl. Math. Model. 38 (2014) 145–167. [3] DuraCrete, Statistical Quantification of the Variables in the Limit State Functions Probabilistic Performance Based Durability Design of Concrete Structures 20 0 0, p. 136. DuraCrete, Delft. [4] S. Coles, L.R. Pericchi, S. Sisson, A fully probabilistic approach to extreme rainfall modelling, J. Hydrol. 273 (2003) 35–50. [5] D. Straub, A. Der Kiureghian, Improved seismic fragility modelling from empirical data, Struct. Saf. 30 (2008) 320–336. [6] J. Caers, M.A. Maes, Identifying tails, bounds and end-points of random variables, Struct. Saf. 20 (1998) 1–23. [7] M.D. Pandey, Extreme quantile estimation using order statistics with minimum cross-entropy principle, Probab. Eng. Mech. 16 (2001) 31–42. [8] J. Diebolt, M. Garrido, C. Trottier, Improving extremal fit: a Bayesian regularization procedure, Reliab. Eng. Syst. Saf. 82 (2003) 21–31. [9] K. Fujimura, A. Der Kiureghian, Tail-equivalent linearization method for nonlinear random vibration, Probab. Eng. Mech. 22 (2007) 63–76. [10] M.J. Kaiser, The impact of extreme weather on offshore production in the gulf of mexico, Appl. Math. Model. 32 (2008) 1996–2018. [11] M.H. Faber, Risk and Safety in Engineering, ETH Zurich, Zurich, 2009, p. 355. [12] S.E. Tuller, A.C. Brett, The characteristics of wind velocity that favor the fitting of a Weibull distribution in wind speed analysis, J. Climate Appl. Meteorol. 23 (1984) 124–134. [13] S. Miao, K. Xie, H. Yang, R. Karki, H.M. Tai, T. Chen, A mixture kernel density model for wind speed probability distribution estimation, Energy Convers. Manag. 126 (2016) 1066–1083. [14] N.C. Lind, H.P. Hong, Tail entropy approximations, Struct. Saf. 10 (1991) 297–306. [15] C.A. Cornell, H. Krawinkler, Progress and Challenges in Seismic Performance Assessment, 2, PEER Center News, Pacific Earthquake Engineering Research Center, Berkeley, 20 0 0. [16] K.R. Mackie, J.M. Wong, B. Stojadinovic, Integrated Probabilistic Performance-based Evaluation of Benchmark Reinforced Concrete Bridges, 199, Pacific Earthquake Engineering Research Center, Berkeley, 2008. [17] K.J. Cronin, Response Sensitivity of Highway Bridges to Random Multi-component Earthquake Excitation, 102, University of Central Florida, Orlando, 2009. [18] L.V. Alexander, X. Zhang, T.C. Peterson, et al., Global observed changes in daily climate extremes of temperature and precipitation, J. Geophys. Res. – Atmos. 111 (2006) D05109. [19] D.R. Easterling, G.A. Meehl, C. Parmesan, et al., Climate extremes: observations, modelling, and impacts, Science 289 (20 0 0) 2068–2074.