On the consistency of penalized MLEs for Erlang mixtures

On the consistency of penalized MLEs for Erlang mixtures

STAPRO: 8308 Model 3G pp. 1–10 (col. fig: nil) Statistics and Probability Letters xx (xxxx) xxx–xxx Contents lists available at ScienceDirect Sta...

323KB Sizes 0 Downloads 9 Views

STAPRO: 8308

Model 3G

pp. 1–10 (col. fig: nil)

Statistics and Probability Letters xx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

On the consistency of penalized MLEs for Erlang mixtures Cuihong Yin a , X. Sheldon Lin b, *, Rongtan Huang c , Haili Yuan d a

School of Insurance, Southwestern University of Finance and Economics, Chengdu, China Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada M5S 3G3 School of Mathematical Sciences, Xiamen University, Xiamen, China d School of Mathematics and Statistics, Wuhan University, Wuhan, China b c

article

a b s t r a c t

info

Article history: Received 13 July 2017 Received in revised form 29 July 2018 Accepted 5 August 2018 Available online xxxx Keywords: Erlang mixture Insurance loss modeling Penalized maximum likelihood estimates Consistency

In Yin and Lin (2016), a new penalty, termed as iSCAD penalty, is proposed to obtain the maximum likelihood estimates (MLEs) of the weights and the common scale parameter of an Erlang mixture model. In that paper, it is shown through simulation studies and a real data application that the penalty provides an efficient way to determine the MLEs and the order of the mixture. In this paper, we provide a theoretical justification and show that the penalized maximum likelihood estimators of the weights and the scale parameter as well as the order of mixture are all consistent. © 2018 Elsevier B.V. All rights reserved.

1. Introduction

1

The Erlang mixture model under consideration in this paper is of the following density:

2

m

h(x; α, γ, θ ) =



αj g(x; γj , θ ), x > 0,

(1.1)

3

j=1

where α = (α1 , . . . , αm ) is the mixing distribution and m is the number of components or the order of the mixture. Further, each component density is an Erlang of the form: g(x; γj , θ ) =

x

γj −1 −x⧸θ

e

θ γj (γj − 1)!

, x > 0,

(1.2)

with common scale parameter θ > 0 and positive integer shape parameter γj . To ensure the unique expression of density function (1.1), we assume that γ1 < γ2 < · · · < γm as we did in Yin and Lin (2016). The Erlang mixture model and its multivariate version have been widely used in modeling insurance losses due to its desirable distributional properties. For example, risk measures such as VaR and TVaR can be calculated easily. For more details on the applications, see Lee and Lin (2010), Cossette et al. (2013), Porth et al. (2014), Verbelen et al. (2015), Hashorva and Ratovomirija (2015), Verbelen et al. (2016), and references therein. As a mixture model, an expectation–maximization (EM) algorithm is naturally used to fit the model to data by estimating the scale parameter and the mixing weights. However, the shape parameter of each of the Erlang components is not estimated. In order to include all possible Erlang distributions for component selection, one must start with a large number of components in an Erlang mixture when running the EM algorithm. Over-fitting could be a concern in this situation. To

*

Corresponding author. E-mail address: [email protected] (X. Sheldon Lin).

https://doi.org/10.1016/j.spl.2018.08.004 0167-7152/© 2018 Elsevier B.V. All rights reserved.

Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.

4 5

6

7 8 9 10 11 12 13 14 15 16

STAPRO: 8308

2

C. Yin et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx

21

maintain the goodness of fit and to avoid over-fitting at the same time, an ad hoc method for shape parameter selection and BIC are used. See Lee and Lin (2010) and Verbelen et al. (2015). Several issues arise. First, the ad hoc method requires repeated runs of the EM algorithm, which can be computationally burdensome. Second, the chosen shape parameters are often suboptimal in terms of the order of the mixture. Third, using BIC often results in a poor fit of a model to the sparse right tail of the data, a major shortcoming in insurance loss modeling and risk measure calculation. Last, statistical properties of the corresponding estimators cannot be obtained under the ad hoc approach. Yin and Lin (2016) propose a new thresholding penalty function, termed as iSCAD, to penalize the likelihood when estimating the scale parameter and the mixing weights of the Erlang mixture. This approach is motivated by the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) in regression analysis and the MSCAD introduced in Chen and Khalili (2008) for Gaussian mixtures. The thresholding feature of the proposed penalty ensures the sparsity of the mixture, which allows us to avoid over-fitting and maintain fitting accuracy at the same time. Moreover, the structure of the penalty results in the unbiasedness and continuity in estimation. In this paper, we turn to the issue of consistency of the estimates including the order estimate when using the iSCAD penalized likelihood for the Erlang mixture model, as the consistency of the order is one of the most important statistical issues for mixture modeling. In the current statistics literature, most research focuses on Gaussian mixtures and a number of methods have been proposed. See Leroux (1992), James et al. (2001), Keribin (2000), Ciuperca et al. (2003), Ahn (2009), Chen et al. (2012) and references therein. However, few research have been done on non-negative non-Gaussian mixtures and few existing results may be directly applicable to the aforedescribed Erlang mixture. In this paper, we examine the consistency of the estimators of the weight parameter and common scale parameter, as well as the order estimator, when using the iSCAD penalized likelihood. In Section 2 we introduce the iSCAD penalty, the corresponding penalized likelihood and the estimators obtained from the maximum penalized likelihood. Main results and their proofs are given in Section 3 in which we show that the estimators are consistent.

22

2. The iSCAD penalty

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

23 24

Yin and Lin (2016) propose a new penalty function termed as iSCAD, which penalizes individually the weights of an Erlang mixture. For each weight πj , j = 1, . . . , m, the iSCAD penalty function is defined as Pλ (πj ) = λ{log

27 28 29 30 31 32 33 34 35

h(x; φ) =

=

37

h(x; α, γ, θ ) H(l; α, γ, θ ) m ∑ j=1

38

42

43





(2.1)

αj

=

m ∑ j=1

αj

g(x; γj , θ ) H(l; α, γ, θ )

G(l; γj , θ ) g(x; γj , θ ) H(l; α, γ, θ ) G(l; γj , θ )

=

m ∑

πj gθ (x; l, γj ),

(2.2)

j=1

where φ = (π1 , . . . , πm , θ ). There, H(x; α, γ, θ ) and G(x; γj , θ ) are the survival functions of h(x; α, γ, θ ) and g(x; γj , θ ), respectively, gθ (x; l, γj ) =

40

41

+

where λ is a tuning parameter that is a function of n with condition λ → 0, as n → ∞. a = mm > 1 is to ensure that the −λ estimator πˆ j of πj is continuous and parameter ε = λ3/2 is to ensure that the range of πj includes 0. These conditions are motivated by the conditions in Theorem 4 of Leroux (1992) to ensure to not overestimate the order of the Erlang mixture. The tuning parameter will ensure the sparsity of the mixture as it serves as a lower bound of the mixing weights. This property is crucial to avoid over-fitting and maintain fitting accuracy at the same time. Moreover, the structure of the iSCAD penalty and its derivative will result in the unbiasedness and continuity in estimation of the mixing distribution when an EM algorithm is used. Insurance loss/claim data are mostly left truncated with known truncation points (in the form of a deductible or retention limit). See the data sets in Beirlant et al. (2006) and Verbelen et al. (2015). Suppose l to be a truncation point. Then, the probability density function of a left-truncated Erlang mixture is

36

39

a2 λ 2

}I(πj > aλ) ε 2 aλ + ε πj2 1 πj + ε − + (aλ − )πj }I(πj ≤ aλ), + λ{log ε 2 aλ + ε

25

26

aλ + ε

g(x; γj , θ ) G(l; γj , θ )

,

and

πj = αj

G(l; γj , θ ) H(l; α, γ, θ )

.

(2.3)

Further, let Gθ (x; l, γj ) be the cumulative distribution function of gθ (x; l, γj ). Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.

STAPRO: 8308

C. Yin et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx

3

Suppose that X = (X1 , . . . , Xn ) is a random sample of size n from a sparse Erlang mixture with density (2.2). Let x = (x1 , . . . , xn ) be its realization. The penalized log-likelihood function with iSCAD penalty is defined as n

Ln (φ) = log fn (x1 , . . . , xn ; φ) =

∏n

i=1 h(xi

f1 (x; φ) = h(x; φ) exp{−

m ∑

2

m



log h(xi ; φ) − n



i=1

That is, fn (x1 , . . . , xn ; φ) =

1

Pλ (πj ).

(2.4)

3

j=1

; φ) exp{−n

∑m

π }, and in particular,

j=1 Pλ ( j )

Pλ (πj )}.

4

(2.5)

5

j=1

Yin and Lin (2016) applied an EM algorithm to (2.4) and obtained penalized maximum likelihood estimators for the mixing weights π = (π1 , . . . , πm ) and the common scale parameter θ as described in the following. Suppose that the kth iteration (k) (k) in the EM algorithm gives the estimates φ(k) = (π1 , . . . , πm , θ (k) ). The values of the parameters in the next iteration are obtained by the following formulas (k+1) j



(k)

∑n

π

where q¯ j

(k) (k) qj I(qj

¯

i=1 q(j|xi ,φ



> aλ ) +

(k) )

n

m

λ

(k) (qj

¯

−λ

(k) )+ I(qj

¯

≤ aλ),

(2.6)

. Here, q(j | xi , φ(k) ) is probability of the observation xi coming from the jth component:

π

q(j | xi , φ(k) ) = ∑m

(k) j gθ (k) (xi

j=1

π

; l, γj )

(k) j gθ (k) (xi

; l, γj )

θ (k+1) =

∑n

xi − t

∑m

γj q¯ (k) j

i=1 j=1

7 8 9

10

11

.

12

The updated estimate of the scale parameter θ is given by 1 n

6

13

(k)

,

14

where

15

t (k) =

m ∑ j=1

⏐ ⏐ l e ⏐ (k) . q¯ j ⏐ θ γj −1 (γj − 1)! G(l; γj , θ ) ⏐θ =θ (k) γj −l/θ

16

ˆ = (πˆ 1 , . . . , πˆ m , θˆ ) be the estimated values in the final EM iteration. The estimate of the order of the mixture is Let φ ˆ = #{πˆ j |πˆ j ̸= 0, j = 1, . . . , m}. m

17

(2.7)

For notational cleanness, we rename the shape parameters {γj |πˆ j ̸ = 0, j = 1, . . . , m} in the increasing order as γ˜ = ˜ = (π˜ 1 , . . . , π˜ mˆ ). Finally, from (2.3) the estimates of the corresponding (γ˜1 , . . . , γ˜mˆ ) and the corresponding mixing weights π ˜ = (α˜ 1 , . . . , α˜ mˆ ) are obtained as original weight parameters α

α˜ j = c

π˜ j G(l; γ˜j , θˆ )

.

(2.8)

where c is a normalizing constant such that

∑mˆ

j=1

α˜ j = 1.

G(l; γj , θ )

, x > l.

20 21

22

24

In this section we show that the estimators obtained in the proposed penalized MLE are consistent beginning with several lemmas. Recall that the density of the Erlang distribution with left truncated point l and shape parameter γj is g(x; γj , θ )

19

23

3. Consistency results and their proofs

gθ (x; l, γj ) =

18

(3.1)

Let the range of the scale parameter space be Θ = {θ : c1 ≤ θ ≤ c2 , where c1 and c2 are positive constants} with the true parameter θ0 ∈ Θ . The first lemma concerns boundedness properties of the truncated Erlang (3.1). Lemma 3.1. For θ ∈ Θ and sufficiently small ρθ > 0, define gθ,ρθ (x; l, γj ) = supθ ′ {gθ ′ (x; l, γj ) : |θ − θ ′ | ≤ ρθ , θ ′ ∈ Θ } and gθ∗,ρθ (x; l, γj ) = max(1, gθ ,ρθ (x; l, γj )). Then,

25 26

27

28 29 30

31 32



∫ l

log gθ∗,ρθ (x; l, γj )dGθ0 (x; l, γj ) < ∞

Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.

33

STAPRO: 8308

4

1

C. Yin et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx

and ∞



|log gθ (x; l, γj )|dGθ0 (x; l, γj ) < ∞.

2

l 3 4

Proof. It is well known that the mode of Erlang distribution (1.2) is θ · (γj − 1) when γj ≥ 1. If γj > 1, the density of left-truncated Erlang (3.1) is bounded as gθ (x; l, γj ) =

5

6

G(l; γj , θ )

=

gθ (x; l, γj = 1) = Define B(θ ) = max{

e−x⧸θ

γ −1 −(γj −1) (γj −1) j e

G(l;γj ,θ )θ (γj −1)!

log {x|gθ,ρθ (x;l,γj )>1,x≥l}

To show



G(l; γj = 1, θ )θ



∫ +∞ l



e−l/θ G(l; γj = 1, θ )θ

.

−l/θ

, G(l;γe =1,θ )θ , 1}. Obviously, gθ (x; l, γj ) ≤ B(θ ). Thus, we have j

sup |θ ′ −θ |≤ρθ

B(θ ′ )dGθ0 (x; l, γj ) < ∞.

|log gθ (x; l, γj )|dGθ0 (x; l, γj ) < ∞, we have the following

+∞

|log gθ (x; l, γj )|dGθ0 (x; l, γj )

11

∫l +∞

|(γj − 1) log x −

x

|dGθ0 (x; l, γj ) + C1 (j) ∫ +∞ x x = |(γj − 1) log x − |dGθ0 (x; l, γj ) + |(γj − 1) log x − |dGθ0 (x; l, γj ) + C1 (j) θ θ l 1 ∫ +∞ 1 ≤ (γ j − 1 + ) xdGθ0 (x; l, γj ) + C2 (j) < ∞, θ 1 =

12

∫l 1

13

14

17

.

log(max{1, gθ ,ρθ (x; l, γj )})dGθ0 (x; l, γj )

∫l

18

G(l; γj , θ )θ (γj − 1)!

+∞

∫ =

9

16

(γj − 1)γj −1 e−(γj −1)

log gθ∗,ρθ (x; l, γj )dGθ0 (x; l, γj )

l

15

G(l; γj , θ )θ γj (γj − 1)!



+∞



10

xγj −1 e−x⧸θ

If γj = 1 and x ≥ l, the density of left-truncated Erlang (3.1) is bounded as

7

8

g(x; γj , θ )

θ

where C1 (j) = |log(θ γj (γj − 1)!G(l; γj , θ ))| and C2 (j) = C1 (j) + (|(γj − 1) log l|+ θ1 )(Gθ0 (1; l, γj ) − Gθ0 (l; l, γj )) are both constants, j = 1, . . . , m. □ In the next lemma, we show the properties in Lemma 3.1 can be extended to the Erlang mixture model. The proof is similar to that in Redner (1981). Denote the parameter space of the mixture as

Φ = {φ = (π1 , . . . , πm , θ ) :

19

m ∑

πj = 1, πj ≥ 0, 0 < c1 ≤ θ ≤ c2 .}

j=1 20 21

22

with the true parameters being φ0 = (π0,1 , . . . , π0,m , θ0 ) ∈ Φ ,∑ in which there are only m0 non-zero weights. For any ′ ′ φ′ , φ ∈ Φ , we define the distance between φ′ and φ as |φ′ − φ| = m j=1 arctan|πj − πj | + |θ − θ|. Lemma 3.2. Recall that the mixture model (2.2) with the left truncated point l > 0 is given by h(x; φ) =

23

m ∑

πj gθ (x; l, γj ), x > l,

j=1 24 25

with pre-given l and γj , j = 1, . . . , m. Let H(x; φ) be the cumulative distribution function. For each φ ∈ Φ and sufficiently small ρφ > 0, ∞



log h∗ (x; φ, ρφ )dH(x; φ0 ) < ∞

26

l 27

and ∞



|log h(x; φ)|dH(x; φ0 ) < ∞,

28

l 29

where h(x; φ, ρφ ) = supφ′ {h(x; φ′ ) : |φ′ − φ| ≤ ρφ , φ′ ∈ Φ }, h∗ (x; φ, ρφ ) = max(1, h(x; φ, ρφ )). Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.

STAPRO: 8308

C. Yin et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx

5

Proof. The first boundedness property holds because

1





log h∗ (x; φ, ρφ )dH(x; φ0 )

2

l

log max{1,

l



m ∫ ∑

sup

|φ′ −φ|≤ρφ j=1

log max{1,

m ∫ ∑

3

sup |φ′ −φ|≤ρφ

gθ ′ (x; l, γj )}dH(x; φ0 )

4



log gθ∗,ρφ (x; l, γj )dH(x; φ0 ) < ∞.

l

j=1

πj′ gθ ′ (x; l, γj )}dH(x; φ0 )

∞ l

j=1

=

m ∑



∫ =

5

∑m

To show the second boundedness property, let A1 = {x ∈ [l, ∞)| ∞



|log

m ∑

l

∫ |log A1

πj gθ (x; l, γj )|dH(x; φ0 )

6

7

m ∫ ∑

πj gθ (x; l, γj )|dH(x; φ0 ) +

∫ |log

m ∑

A2

|log gθ (x; l, γj )|dH(x; φ0 ) + A1

m ∫ ∑ j=1

m ∑ j=1

j=1



πj gθ (x; l, γj ) ≥ 1} and A2 = [l, ∞) \ A1 . Then we have

j=1

=



j=1

m ∫ ∑ j=1

πj gθ (x; l, γj )|dH(x; φ0 )

8

j=1

|log gθ (x; l, γj )|dH(x; φ0 )

9

A2



|log gθ (x; l, γj )|dH(x; φ0 ) < ∞. □

10

l

The next lemma shows that the order of the mixture model cannot be underestimated using an idea in Theorem 4 of Leroux (1992). Divide the parameter space Φ into Φ ∗ = {φ ∈ Φ : m∗ < m0 } and Φ ∗∗ = {φ ∈ Φ : m∗ ≥ m0 }, where m∗ is the number of non-zero weights.

11 12 13

Lemma 3.3. For each φ ∈ Φ ∗ , Ln (φ) < Ln (φ0 ) .

14

Proof. It follows from Lemma 2.4 of Newey and McFadden (1994) that

15

∑n sup

i=1

log h(Xi ; φ) n

φ∈Φ ∗



P

− →

h(x; φ0 ) log(h(x; φ))dx as n → ∞.

16

Because of the law of large number, we also have

∑n

i=1

log h(Xi ; φ0 ) n

P



17

h(x; φ0 ) log h(x; φ0 )dx as n → ∞.

− →

18

Thus, for n → ∞,

∑n

i=1

19

log h(Xi ; φ0 ) n

∑n − sup

i=1

log h(Xi ; φ) n

φ∈Φ ∗

P

− →



h(x; φ0 ) log(h(x; φ0 )/h(x; φ))dx.

(3.2)

Since φ ∈ Φ ∗ , m∗ < m0 and hence φ ̸ = φ0 . The identifiability property of the mixture model implies that h(x; φ) ̸ = h(x; φ0 ), and the properties of Kullback–Leibler divergence lead to



h(x; φ0 ) log(h(x; φ0 )/h(x; φ))dx > 0.

i=1

22

24

log h(Xi ; φ0 ) n

∑n −

i=1

log h(Xi ; φ) n

> 0,

(3.3)

for each φ ∈ Φ ∗ and for large n. For any weight parameter πj of φ ∈ Φ ∗ , the definition of the iSCAD penalty implies that when πj = 0, Pλn (πj ) = 0, and when πj > 0, lim Pλn (πj ) = lim λn {log

n→∞

21

23

As a result,

∑n

20

n→∞

aλ n + ε n

εn

+

a2 λ2n 2



aλn aλ n + ε n

}

Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.

25

26 27 28

29

STAPRO: 8308

6

C. Yin et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx

= lim [λn (log(aλn + εn ))] − lim (λn log εn )

1

n→∞

n→∞

+ lim λn (

2

n→∞

3

4

6

m ∑

Pλn (π0,j ) −

m ∑

8

10

11 12

13 14

15 16 17

m ∑

Pλn (π0,j ) −

P

m ∑

Pλn (πj ) <

(3.4)

i=1

log h(Xi ; φ0 ) n

∑n −

i=1

log h(Xi ; φ) n

.

(3.5)

Equivalently, Ln (φ) =

n ∑

log h(Xi ; φ) − n

m ∑

Pλn (πj ) <

j=1

n ∑

log h(Xi ; φ0 ) − n

i=1

m ∑

P λ n ( π0 , j ) = Ln ( φ 0 ) .

(3.6)

j=1

The lemma is therefore proved. □ The next lemma concerns the expected penalized log-likelihood and we show that it is maximized at φ0 . Lemma 3.4. For each φ ∈ Φ ∗∗ , φ ̸ = φ0 , we have L(φ) = E0 log f1 (X ; φ) < E0 log f1 (X ; φ0 ) = L(φ0 ). Proof. To prove, we re-express the iSCAD penalty function (2.1) as Pλn (πj ) = A · I(πj > aλn ) + B(πj ) · I(πj ≤ aλn ), π +εn

a2 λ2

π2

n where A = λn {log aλnε+εn + 2 n − aλaλ+ε } that does not contain πj and B(πj ) = λn {log jεn − 2j + (aλn − aλn1+εn )πj }. n n n ∑m ∑m For sufficient large n, if πj > 0, Pλn (πj ) = A. Thus, j=1 Pλn (πj ) = m∗ A. Similarly, we have j=1 Pλn (π0,j ) = m0 A. For each ∗∗ φ ∈ Φ we have

∑m

j=1

20

∑n

j=1

j=1

19

) = 0.

Combining (3.3) and (3.4), we have, for large n,

∑m 18

aλn + εn

→ 0, as n → ∞. P λ n ( πj ) −

i=1 9



j=1

j=1 7

2

aλ n

We thus have

j=1 5

a2 λ2n

P λ n ( πj )

Pλn (π0,j )

=

m∗

≥ 1.

(3.7)

Pλn (π0,j ).

(3.8)

m0

That is, m ∑

Pλn (πj ) ≥

j=1

m ∑ j=1

21

It follows from (2.5) and Lemma 3.2 that

22

E0 |log f1 (X ; φ0 )| ≤ E0 |log h(X ; φ0 )| +

m ∑

Pλn (π0,k ) < ∞.

(3.9)

k=1 23

24

25 26 27 28 29

30 31 32

Let U = log f1 (X ; φ) − log f1 (X ; φ0 ). Then, U ̸ = 0 due to the identifiability of the mixture model (see Teicher, 1963), and

E0 [eU ] = E0 [

f1 (X ; φ) f1 (X ; φ0 )

] = exp{−(

m ∑

Pλn (πj ) −

j=1

m ∑

Pλn (π0,j ))} ≤ 1.

j=1

By Jensen’s inequality,

E0 U < log E0 [eU ] ≤ 0. The lemma is proved. □ The next lemma shows the convergence of the expected log-likelihood. The proof is similar to that Lemma 2 of Wald (1949). Lemma 3.5. For each φ ∈ Φ ∗∗ and sufficiently small ρφ > 0, let f1 (x; φ, ρφ ) = supφ ′ {f1 (x; φ′ ) : |φ − φ′ | ≤ ρφ , φ′ ∈ Φ ∗∗ }, f1∗ (x; φ, ρφ ) = max(1, f1 (x; φ, ρφ )) and f1∗ (x; φ) = max(1, f1 (x; φ)). Then, lim E0 log f1 (X ; φ, ρφ ) = E0 log f1 (X ; φ).

ρφ →0

Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.

STAPRO: 8308

C. Yin et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx

7

Proof. It follows from the definition that log f1∗ (X ; φ, ρφ ) is increasing in ρφ . The continuity of f1 (X ; φ) at each φ ∈ Φ ∗∗ leads to lim log f1 (X ; φ, ρφ ) = log f1 (X ; φ). ∗



By the monotone convergence theorem, we have

4

lim E0 log f1∗ (X ; φ, ρφ ) = E0 log f1∗ (X ; φ).

(3.10)

ρφ → 0

Now, denote f1∗∗ (x; φ) = min(1, f1 (x; φ)) and f1∗∗ (x; φ, ρφ ) = min(1, f1 (x; φ, ρφ )). Clearly,

5

6

|log f1∗∗ (x; φ, ρφ )| ≤ |log f1∗∗ (x; φ)|

7

(3.9) implies E0 |log f1 (X ; φ)| < ∞. The continuity of f1 (X ; φ) at each φ ∈ Φ ∗∗

∗∗

implies

8

lim log f1 (X ; φ, ρφ ) = log f1 (X ; φ), ∗∗

9

ρφ → 0

and hence

10

lim E0 log f1 (X ; φ, ρφ ) = E0 log f1 (X ; φ) ∗∗

∗∗

(3.11)

ρφ → 0

by the dominated convergence theorem. Since E0 log f1 (X ; φ, ρφ ) = E0 log f1∗ (X ; φ, ρφ )+E0 log f1∗∗ (X ; φ, ρφ ) and E0 log f1 (X ; φ) = E0 log f1∗ (X ; φ)+E0 log f1∗∗ (X ; φ), it follows from (3.10) and (3.11) that Lemma 3.5 holds. □ We now present a main consistency result. Again, the proof is motivated by Wald (1949).

supφ∈Ω f1 (X1 ; φ)f1 (X2 ; φ) · · · f1 (Xn ; φ)

( lim

f1 (X1 ; φ0 )f1 (X2 ; φ0 ) · · · f1 (Xn ; φ0 )

n→∞

11

12 13 14 15

Theorem 3.1. Let Ω be any closed subset of the parameter space Φ ∗∗ , and the true parameter φ0 ̸ ∈ Ω . Then, we have P

2 3

ρφ → 0

∗∗

1

16

)

= 0 = 1.

17

Proof. It follows from Lemmas 3.4 and 3.5 that for each φ ∈ Ω , there is a positive and sufficient small value ρφ > 0 such that

E0 log f1 (X ; φ, ρφ ) < E0 log f1 (X ; φ0 ),

(3.12)

and

18 19

20 21

lim E0 log f1 (X ; φ, ρφ ) = E0 log f1 (X ; φ) < E0 log f1 (X ; φ0 ).

(3.13)

22

φ , ρφk ), where S(φ, ρφ )

23

ρφ →0

Since Ω is compact, there exist a finite number of points φ1 , . . . , φh in Ω such that Ω ⊂ denotes a sphere with center φ and radius ρφ . Then, 0 ≤ sup φ∈Ω

n ∏

f1 (xi ; φ) ≤

i=1

h n ∑ ∏

⋃h

k=1 S( k

24

f1 (xi ; φk , ρφk ).

25

k=1 i=1

Thus, Theorem 3.1 will be proved if we can show that f1 (X1 ; φk , ρφk )f1 (X2 ; φk , ρφk ) · · · f1 (Xn ; φk , ρφk )

( P

lim

f1 (X1 ; φ0 )f1 (X2 ; φ0 ) · · · f1 (Xn ; φ0 )

n→∞

26

)

= 0 = 1, k = 1, 2, . . . , h.

27

Rewrite the above equations as

28

n

P { lim

n→∞

∑ [log f1 (Xi ; φk , ρφk ) − log f1 (Xi ; φ0 )] = −∞} = 1, k = 1, . . . , h.

29

i=1

Obviously, these equations hold due to (3.12) and the law of large number. This completes the proof. □

30

To show the consistency of the estimates, we give another lemma.

31

¯ n ∈ Φ ∗∗ be a function of the observations x1 , x2 , . . . , xn such that Lemma 3.6. Let φ ¯ n )f1 (x2 ; φ¯ n ) · · · f1 (xn ; φ¯ n ) f1 (x1 ; φ ≥ c > 0, for large n. f1 (x1 ; φ0 )f1 (x2 ; φ0 ) · · · f1 (xn ; φ0 )

32

(3.14)

Then

33

34

¯ n = φ0 } = 1. P { lim φ n→∞

Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.

35

STAPRO: 8308

8

1 2

C. Yin et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx

¯ of the sequence φ¯ n satisfy Proof. It is sufficient to prove that for sufficiently small ρφ0 , the probability that all limit points φ ¯ ¯ ¯ − φ0 | ≥ ρφ , then ¯ the inequality |φ − φ0 | < ρφ0 is one. If there exists a limit point φ of the sequence φn such that |φ 0 sup

3

|φ−φ0 |≥ρφ0 4

n ∏

f1 (xi ; φ) ≥

i=1

6 7 8

9 10

11

¯ n ) for large n. f1 (xi ; φ

i=1

It follows from (3.14) that sup|φ−φ0 |≥ρφ

5

n ∏

∏n

i=1 f1 (xi

0

; φ)

f1 (x1 ; φ0 )f1 (x2 ; φ0 ) · · · f1 (xn ; φ0 )

≥ c > 0.

¯ of the sequence According to Theorem 3.1, this is an event with probability zero. Hence, the probability that all limit points φ

φ¯ n satisfy the inequality |φ¯ − φ0 | < ρφ0 is always one. □

We now show that the estimates of all the parameters are consistent.

ˆ ∈ Φ ∗∗ are consistent, i.e. the estimators φˆ which maximize Theorem 3.2. The penalized maximum likelihood estimators φ ∏n f (X ; φ ) are such that i=1 1 i ˆ = φ0 } = 1. P { lim φ

(3.15)

n→∞

12

ˆ ∈ Φ ∗∗ , are the penalized maximum likelihood estimators of Proof. Since φ

14 15

i=1 f1 (Xi

; φ), we have

n

n

13

∏n

∏n ∏ ∏ ˆ) f1 (Xi ; φ ˆ) ≥ ≥1 f1 (Xi ; φ f1 (Xi ; φ0 ) or ∏ni=1 i=1 f1 (Xi ; φ 0 ) i=1 i=1 ˆ satisfies (3.14) with c = 1. Thus, we have Obviously, φ ˆ = φ0 } = 1. □ P { lim φ n→∞

16

17 18

Theorem 3.2 leads to the following corollary.

ˆ = (αˆ 1 , αˆ 2 , . . . , αˆ m ) are consistent, i.e. Corollary 3.1. The penalized maximum likelihood estimators α ˆ = α0 } = 1. P { lim α n→∞

19

where α0 = (α0,1 , . . . , α0,m ) is the true weight parameters of the original Erlang mixture (1.1) .

20

Proof. From Theorem 3.2 and (2.8), for j = 1, . . . , m,

21

22 23 24

    πˆ j π0,j   ∥αˆ j − α0,j ∥ = c  −   G(l; γj , θˆ ) G(l; γj , θ0 )     πˆ − π  π0,j π0,j  j  0 ,j =c + −   G(l; γj , θˆ ) G(l; γj , θˆ ) G(l; γj , θ0 )       πˆ − π    1 1  j   0,j  ⩽c −  + π0,j · c   → 0. □  G(l; γj , θˆ )   G(l; γj , θˆ ) G(l; γj , θ0 )  ˆ is also consistent. i.e. the estimator is such Theorem 3.3. If lim infn→∞ nλn > 0, then the estimator of the order of mixture m that ˆ = m0 } = 1. P { lim m n→∞

25 26

27 28

29

Remark. The condition lim infn→∞ nλn > 0 is always satisfied with the ‘optimal’ choice of λn in Yin and Lin (2016) in which

λn =



c(1+ m) √ m3/2 n

.

ˆ are consistent, i.e., P {limn→∞ φˆ = φ0 } = 1. Proof. As shown in Theorem 3.2, the penalized maximum likelihood estimators φ As a result, we have ˆ ≥ m0 } = 1, P {lim inf m n→∞

30

(3.16)

ˆ is the estimate of the order of the mixture defined in (2.7). m Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.

STAPRO: 8308

C. Yin et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx

9

Clearly,

1

ˆ > m0 ) ≤ P(Ln (φˆ ) ≥ Ln (φ0 )). P(m

2

Rewrite the right hand side of the above as

( ˆ ) ≥ Ln (φ0 )) = P P(Ln (φ

ˆ ) − L0n (φ0 ) L0n (φ n

∑m

k=1

Pλn (π0,k )

3

∑m

j=1

− ∑m

k=1

Pλn (πˆ j )

Pλn (π0,k )

) +1≥0 ,

4

ˆ ) → L0n (φ0 ), as n → ∞, and under the condition where L0n (φ) is the likelihood without penalty. Since (3.15) implies that L0n (φ of this theorem lim inf n

m ∑

n→∞



Pλn (π0,j ) = lim inf n→∞

j=1

nλn {log

π0,j >0,j=1,...,m

aλ n + ε n

εn

+

a2 λ2n 2



aλ n aλ n + ε n

} = ∞,

6

7

we have

8

φˆ ) − L0n (φ0 ) ∑m → 0, as n → ∞. n k=1 Pλn (π0,k ) L0n (

(3.17)

ˆ > m0 , Based on (3.8), if m

∑m

j=1

∑m

k=1

Pλn (πˆ j )

Pλn (π0,k )

=

ˆ ) − L0n (φ0 ) L0n (φ

∑m

j=1

Pλn (π0,j )

9

10

ˆ m m0

> 1, for sufficient large n.

(3.18)

It follows from (3.17) and (3.18) that

n

5

∑m

j=1

− ∑m

j=1

Pλn (πˆ j )

Pλn (π0,j )

11

12

+ 1 < 0, for sufficient large n.

Thus,

13

14

ˆ ) ≥ Ln (φ0 )) = 0, for sufficient large n. P(Ln (φ Equivalently,

ˆ > m0 , for sufficient large n} = 0, P {m Combining (3.16), we have

ˆ = m0 } = 1. □ P { lim m

15 16 17 18 19

n→∞

Uncited references Badescu et al. (2015) Acknowledgments This research was partly supported by grants from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2017-06684) and the Natural Science Foundation of China (No. 11471272). References Ahn, S.M., 2009. Note on the consistency of a penalized maximum likelihood estimate. Commun. Stat. Appl. Methods 16 (4), 573–578. Badescu, A.L., Gong, L., Lin, X.S., Tang, D., 2015. Modeling correlated frequencies with application in operational risk management. J. Oper. Risk 10 (1), 1–43. Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J., 2006. Statistics of Extremes: Theory and Applications. John Wiley & Sons. Chen, J., Khalili, A., 2008. Order selection in finite mixture models with a nonsmooth penalty. J. Amer. Statist. Assoc. 103 (484), 1674–1683. Chen, J., Li, P., Fu, Y., 2012. Inference on the order of a normal mixture. J. Amer. Statist. Assoc. 107 (499), 1096–1105. Ciuperca, G., Ridolfi, A., Idier, J., 2003. Penalized maximum likelihood estimator for normal mixtures. Scand. J. Stat. 30 (1), 45–59. Cossette, H., Cote, M.P., Marceau, E., Moutanabbir, K., 2013. Multivariate distribution defined with Farlie–Gumbel–Morgenstern copula and mixed Erlang marginals: Aggregation and capital allocation. Insurance Math. Econom. 52 (3), 560–572. Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 (456), 1348–1360. Hashorva, E., Ratovomirija, G., 2015. On Sarmanov mixed Erlang risks in insurance applications. ASTIN Bulletin J. IAA 45 (01), 175–205. James, L.F., Priebe, C.E., Marchette, D.J., 2001. Consistent estimation of mixture complexity. Ann. Statist. 29 (5), 1281–1296. Keribin, C., 2000. Consistent estimation of the order of mixture models. Sankhya Indian J. Stat., Ser. A 49–66. Lee, S.C., Lin, X.S., 2010. Modeling and evaluating insurance losses via mixtures of Erlang distributions. N. Am. Actuar. J. 14 (1), 107–130. Leroux, B.G., 1992. Consistent estimation of a mixing distribution. Ann. Statist. 20 (3), 1350–1360. Newey, W.K., McFadden, D., 1994. Large Sample Estimation and Hypothesis Testing. In: Handbook of Econometrics, vol. 4, pp. 2111–2245. Porth, L., Zhu, W., Tan, K.S., 2014. A credibility-based Erlang mixture model for pricing crop reinsurance. Agric. Finance Review 74 (2), 162–187.

Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.

20

21

22

23 24

25

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

STAPRO: 8308

10

1 2 3 4 5 6 8 7

C. Yin et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx

Redner, R., 1981. Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Ann. Statist. 225–228. Teicher, H., 1963. Identifiability of finite mixtures. Ann. Math. Stat. 1265–1269. Verbelen, R., Antonio, A., Gong, L., Lin, X.S., 2015. Fitting mixtures of Erlangs to censored and truncated data using the em algorithm. ASTIN Bulletin J. IAA 45 (3), 729–758. Verbelen, R., Antonio, K., Claeskens, G., 2016. Multivariate mixtures of Erlangs for density estimation under censoring. Lifetime Data Anal. 22 (3), 429–455. Wald, A., 1949. Note on the consistency of the maximum likelihood estimate. Ann. Math. Stat. 20 (4), 595–601. Yin, C., Lin, X.S., 2016. Efficient estimation of Erlang mixtures using iSCAD penalty with insurance application. ASTIN Bulletin J. IAA 46 (3), 779–799.

Please cite this article in press as: Yin C., et al., On the consistency of penalized MLEs for Erlang mixtures. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.08.004.