Risk analysis with categorical explanatory variables

Risk analysis with categorical explanatory variables

Journal Pre-proof Risk analysis with categorical explanatory variables Seul Ki Kang, Liang Peng, Hongmin Xiao PII: DOI: Reference: S0167-6687(20)300...

434KB Sizes 0 Downloads 68 Views

Journal Pre-proof Risk analysis with categorical explanatory variables Seul Ki Kang, Liang Peng, Hongmin Xiao

PII: DOI: Reference:

S0167-6687(20)30023-8 https://doi.org/10.1016/j.insmatheco.2020.02.007 INSUMA 2620

To appear in:

Insurance: Mathematics and Economics

Received date : 3 November 2019 Revised date : 22 January 2020 Accepted date : 11 February 2020 Please cite this article as: S.K. Kang, L. Peng and H. Xiao, Risk analysis with categorical explanatory variables. Insurance: Mathematics and Economics (2020), doi: https://doi.org/10.1016/j.insmatheco.2020.02.007. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Elsevier B.V. All rights reserved.

Journal Pre-proof

Risk Analysis With Categorical Explanatory Variables Seul Ki Kang1 , Liang Peng1 , and Hongmin Xiao2 Abstract. To better forecast the Value-at-Risk of the aggregate insurance losses, Heras, Moreno and Vilar-Zan´on (2018) propose a two-step inference of using logistic regression and

lP repro of

quantile regression without providing detailed model assumptions, deriving the related asymptotic properties, and quantifying the inference uncertainty. This paper argues that the application of quantile regression at the second step is not necessary when explanatory variables are categorical. After describing the explicit model assumptions, we propose another two-step inference of using logistic regression and the sample quantile. Also, we provide an efficient empirical likelihood method to quantify the uncertainty. A simulation study confirms the good finite sample performance of the proposed method.

Key words and phrases: Aggregate loss, empirical likelihood, insurance ratemaking, logistic regression, Value-at-Risk.

1

Introduction

Consider a set of insurance policies with n independent policyholders in a given period time. For the i-th policyholder, we observe the aggregate loss Si and the d-dimensional categorical explanatory vector X i , representing some characteristics such as the driver’s age and the age of the vehicle in automobile insurance. Because X i is categorical, we assume the possible values

rna

are x1 , · · · , xm without loss of generality. Let S denote the aggregate loss of a new policyholder, and let X represent the corresponding characteristics. Then the practical question is to forecast the Value-at-Risk (VaR) of S at a given level α ∈ (0, 1) given X = x, which is defined as VaRS (α|x) = inf{s : P (S > s|X = x) ≤ 1 − α}.

Jou

Because X is categorical, we can classify these n policies into m categories, defined as Axj = {i : X i = xj , 1 ≤ i ≤ n} for j = 1, . . . , m. So we only need to forecast {VaRS (α|xj )}m j=1 . A naive estimator for VaRS (α|xj ) is the sample quantile of those Si ’s in the category Axj without really using the explanatory variables. Alternatively, one can model the quantile of Si as a linear function of X i and employ the quantile regression technique to forecast VaRS (α|xj ) 1 2

Department of Risk Management and Insurance, Georgia State University, Atlanta, GA 30303, USA

College of Mathematics and Statistics, Northwest Normal University, Lanzhou, Gansu 730000, P.R. China

Journal Pre-proof for j = 1, · · · , m. Applying the quantile regression technique to insurance ratemaking is not new. For example, Kudryavtsev (2009) employs the quantile regression technique to estimate quantiles of the net premium rate in ratemaking. We refer to Koenker (2005) for an overview of the quantile regression technique. As a common feature of having many zero insurance losses due to no insurance claims, the

lP repro of

above two simple VaR estimators are not efficient and often underestimate the risk because they do not model the probability of having zero losses. When P (S > 0|X = xj ) > 1 − α, we can write

VaRS (α|xj ) = inf{s > 0 : P (S > s|X = xj ) ≤ 1 − α}

= inf{s > 0 : P (S > s|X = xj , S > 0)P (S > 0|X = xj ) ≤ 1 − α} = inf{s > 0 : P (S > s|X = xj , S > 0) ≤

(1)

1−α P (S>0|X=xj ) },

which motivates the two-step inference in Heras, Moreno and Vilar-Zan´ on (2018) by separately ∗ = 1− estimating αx j

1−α P (S>0|X=xj )

∗ |x , S > 0). Hence, when we model and infer and VaRS (αx j j

P (S > 0|X = x) soundly, the two-step inference should be better than the above two simple estimations without modeling P (S > 0|X).

More specifically, the first step in Heras, Moreno and Vilar-Zan´ on (2018) employs logistic regression to model and estimate pi = P (Si > 0|X i ), belonging to the generalized linear models commonly used for estimating the probability of having nonzero claims in actuarial science; see De Jong and Heller (2008) and Goldburd, Khare and Tevet (2016). Note that the number of distinct pi ’s is m at most because X i is categorical with m levels. After obtaining the

rna

∗ for α∗ , which pools information from all levels of the explanatory variables, the estimator α ˆx xj j

second step in Heras, Moreno and Vilar-Zan´ on (2018) uses the quantile regression technique ∗ |x , S > 0). When we model the conditional quantile of S at the level to estimate VaRS (ˆ αx j i j

p given Si > 0 and X i by a linear function of X i , we can estimate VaRS (p|xj , S > 0) by the quantile regression technique, which pools information from all levels of the explanatory

Jou

variables. However, it remains unclear how Heras, Moreno and Vilar-Zan´ on (2018) apply the ∗ |x , S > 0) because the risk level α ∗ depends quantile regression technique to estimate VaRS (ˆ αx ˆx j j j

on the category xj . Nevertheless, it is clear that Heras, Moreno and Vilar-Zan´ on (2018) in their data analysis apply quantile regression to those positive Si ’s and the related X i ’s in each category of {Axj }m j=1 rather than all positive Si ’s. Hence, we suspect that the quantile regression estimation in Heras, Moreno and Vilar-Zan´ on (2018) fails to pool information from all levels of the explanatory variables and is the same as the sample quantile based on those positive Si ’s in 2

Journal Pre-proof each category of {Axj }m j=1 . This paper proposes an alternative two-step inference for the VaR via logistic regression and the sample quantile rather than quantile regression. To quantify the inference uncertainty, which has not been addressed in Heras, Moreno and Vilar-Zan´ on (2018), we develop an empirical likelihood method. Applying the proposed empirical likelihood test to the real dataset in Heras,

lP repro of

Moreno and Vilar-Zan´on (2018), we find the risk estimates in Heras, Moreno and Vilar-Zan´on (2018) are not significantly different from the true values without modeling the conditional quantile of S given X and S > 0. That is, using quantile regression in the second step is not necessary for categorical explanatory variables. We refer to Owen (2001) for an overview of the empirical likelihood method, which has been proved to be quite effective in interval estimations and hypothesis tests.

As we often do not observe all policies in the full cycle, it is of importance to take the exposure rates into account. Heras, Moreno and Vilar-Zan´ on (2018) do adjust the exposure rates in the first step but ignore it in the second step. Moreover, Heras, Moren and Vilar-San´on (2018) neither state the model assumptions explicitly nor provide asymptotic results for their two-step inference. In this paper, we provide explicit model assumptions and a rigorous two-step inference with exposure rates adjusted in both steps. Also, we propose an efficient empirical likelihood method for uncertainty quantification.

We organize this paper as follows. Section 2 presents the explicit model, a two-step inference with asymptotic results, and an empirical likelihood method for uncertainty quantification.

rna

Section 3 applies the proposed method to the same insurer database in Heras, Moreno and VilarZan´ on (2018) for testing whether the risk estimates in Heras, Moreno and Vilar-Zan´ on (2018) are significantly different from the true values without modeling the conditional quantile of S given X and S > 0. A simulation study is conducted in Section 4 to examine the finite sample performance of the proposed empirical likelihood method. Some conclusions are summarized in

2

Jou

Section 5. All proofs are put in the Appendix.

Methodologies and Main Results

As argued in the introduction, this paper forecasts the Value-at-Risk of insurance losses given some categorical explanatory variables of a policyholder and develops an efficient empirical likelihood method for uncertainty quantification. To better appreciate the idea, we start with

3

Journal Pre-proof the detailed model description. For each policyholder i = 1, . . . , n, we observe the aggregate loss Si in a particular year and some characteristics of this policyholder denoted by the d-dimensional categorical explanatory vector X i with m categories x1 , . . . , xm . Also, we observe the exposure rates R1 , . . . , Rn because policies may not be observed in a full year. Let Si∗ denote the aggregate loss of the i-th

lP repro of

policyholder in a full year, which equals Si if Ri = 1. We use S ∗ and X to denote the aggregate loss in a full year and the related characteristics of a future policyholder, respectively. The question is to forecast the risk VaRS ∗ (α|X = xj ) for j = 1, . . . , m. Throughout, we write pi = P (Si > 0|X i ) and p∗i = P (Si∗ > 0|X i ). Assume

pi = p∗i Ri and Si∗ = Si h(Ri ) for i = 1, . . . , n, and a known positive function h. Like (1), when

P (S ∗ > 0|X = xj ) > 1 − α for j = 1, . . . , m, we can write

(2)

(3)

VaRS ∗ (α|xj ) = inf{s > 0 : P (S ∗ > s|X = xj ) ≤ 1 − α}

= inf{s > 0 : P (S ∗ > s|X = xj , S ∗ > 0)P (S ∗ > 0|X = xj ) ≤ 1 − α} = inf{s > 0 : P (S ∗ > s|X = xj , S ∗ > 0) ≤

(4)

1−α P (S ∗ >0|X=xj ) },

1−α ∗ = 1− which suggests the following two-step inference by separately estimating αx P (S ∗ >0|X=xj ) j

rna

∗ |x , S ∗ > 0). To efficiently estimate P (S ∗ > 0|X), the first-step uses logistic and VaRS ∗ (αx j j

regression to model pi :

logit(

logit(

for i = 1, · · · , n,

(5)

p∗i ¯ i for i = 1, . . . , n, under condition(2), ) = βT X 1 − p∗i

Jou

which is equivalent to

pi /Ri ¯i ) = βT X 1 − pi /Ri

¯ i denotes the corresponding dummy variable of X i and AT denotes the transpose of the where X vector or matrix A. Using the standard logistic regression estimation technique, we can obtain ˆ for β, which gives estimator β pˆi =

Ri

T

ˆ X ¯ i} 1 + exp{−β

and pˆ∗i =

1 ˆT X ¯ i} 1 + exp{−β 4

for i = 1, · · · , n.

(6)

Journal Pre-proof ¯ i has m different values, implying that {ˆ As X i only takes m values, X p∗i }ni=1 have m distinct values at most. Hence, for j = 1, · · · , m, we write p∗xj =

1 1 and pˆ∗xj = T ¯j} 1 + exp{−β x ˆT x ¯j} 1 + exp{−β

∗ by and estimate αx j

1−α , pˆ∗xj

lP repro of

∗ α ˆx =1− j

(7)

¯ j is the corresponding dummy variable of xj . where x Define

A+ xj = {i : X i = xj , Si > 0, 1 ≤ i ≤ n} for j = 1, . . . , m.

∗ |x , S ∗ > 0) by the sample quantile at the It follows from (2) that we can estimate VaRS ∗ (αx j j

∗ based on {h(R )S : i ∈ A+ , i = 1, . . . , n}, which gives the two-step estimator of level α ˆx i i xj j

VaRS ∗ (α|xj ) as

d S ∗ (α|xj ) = inf{s : VaR

Pn

∈ A+ 1−α xj , h(Ri )Si > s) ≤ ∗ }. Pn + pˆxj i=1 I(i ∈ Axj )

i=1 I(i

Here I(·) denotes the indicator function. A simple application of the standard asymptotic theory can prove the asymptotic normality of the above VaR estimator, but the asymptotic variance is complicated and depends on the inference uncertainties of both steps. To construct a confidence interval or conduct a hypothesis test without estimating the asymptotic variance of the above two-step risk estimation, we propose the following empirical likelihood method based on the

rna

estimating equation approach in Qin and Lawless (1994). It follows from (5) that

pi (β) =

and the corresponding scores are

I(Si > 0) ∂pi (β) I(Si = 0) ∂pi (β) − pi (β) ∂β 1 − pi (β) ∂β

Jou

Z i (β) =

Ri ¯ i} 1 + exp{−β T X

for i = 1, · · · , n.

Put θ = VaRS ∗ (α|xj , S ∗ > 0). For i ∈ A+ xj , it follows from (4) that P (Si > θ|X i ) =

1−α pi /Ri .

Hence, we define

Zi,xj (θ, β) = I(Si > θ) −

1−α pi (β)/Ri

for i ∈ A+ xj .

By noting that Z i ’s are constructed from {I(Si > 0)}ni=1 , Zi,xj ’s are based on {Si : i ∈

n + A+ xj , i = 1, . . . , n}, and {I(Si > 0)}i=1 is independent of {Si : i ∈ Axj , i = 1, . . . , n}, we

5

Journal Pre-proof formulate the empirical likelihood function for θ = VaRS ∗ (α|xj ) based on the method for two independent samples as L(θ, β|xj ) nQ Q Pn n ∗ (m+ = sup xj qi ) : qk ≥ 0 for k = 1, · · · , n, k=1 qk = 1, k=1 (nqk ) i∈A+ xj o P Pn P ∗ ∗ q Z (β) = 0, qi∗ > 0 for i ∈ A+ + q Zi,xj (θ, β) = 0 , + q = 1, k k xj , i i k=1 i∈Ax i∈Ax j

where m+ xj =

Pn

i=1 I(i

lP repro of

j

∈ A+ xj ).

It follows from the Lagrange multiplier technique that

l(θ, β|xj ) := −2 log L(θ, β|xj ) P P log{1 + λ2 Zi,xj (θ, β)}, = 2 ni=1 log{1 + λT1 Z i (β)} + 2 i∈A+ x j

where λ1 = λ1 (β) and λ2 = λ2 (θ, β) satisfy n X i=1

X

Z i (β) = 0 and 1 + λT1 Z i (β)

i∈A+ xj

Zi,xj (θ, β) = 0. 1 + λ2 Zi,xj (θ, β)

(8)

Because we are interested in θ = VaRS ∗ (α|xj ), we consider the following profile log-empirical likelihood ratio

lP (θ|xj ) = min l(θ, β|xj ). β

Theorem 1. Assume {(Si , X Ti )T }ni=1 is a sequence of independent random vectors, {Ri }ni=1 are deterministic and X i is a d-dimensional categorical vector with levels x1 , . . . , xm . Further  0 0 T + assume (2), (3), (5), limn→∞ m+ xj /n = τxj ∈ (0, 1) for j = 1, . . . , m, E Z 1 (β )Z 1 (β ) is

rna

positive definite with β 0 denoting the true value of β, and the conditional distribution function

∗ |x ) of Si∗ given X i = xj and Si∗ > 0 is independent of i with a positive density at VaRS ∗ (αx j j

for j = 1, . . . , m. Then, for each j = 1, . . . , m, lP (VaRS ∗ (α|xj )|xj ) converges in distribution to a chi-squared limit with one degree of freedom as n → ∞.

Jou

Based on the above theorem, an empirical likelihood confidence interval with level a for VaRS ∗ (α|xj ) is obtained as

Ia (α|xj ) = {θ : lP (θ|xj ) ≤ χ21,a },

where χ21,a denotes the a-th quantile of a chi-squared distribution function with one degree of freedom. Similarly, an empirical likelihood test for H0 : VaRS ∗ (α|xj ) = θ0 at the level a rejects the null hypothesis whenever lP (θ0 |xj ) > χ21,1−a . 6

Journal Pre-proof Remark 1. We can develop a similar empirical likelihood method for constructing a confidence region of (VaRS ∗ (α1 |x1 ), . . . , VaRS ∗ (αk |xk ))T and testing H0 : VaRS ∗ (α1 |x1 ) = θ1 , . . . , VaRS ∗ (αk |xk ) = θk ,

3

lP repro of

where k is a given integer. Here we allow different risk levels.

Application to an insurer database

We reexamine the dataset from an Australian automobile insurance company between the years 2004 and 2005, which is analyzed by Heras, Moreno and Vilar-Zan´ on (2018) based on the twostep inference of using logistic regression at the first step and quantile regression at the second step. The total number of policies is n = 67856, and the categorical explanatory variable X i involves the age of the vehicle with four levels and the driver’s age with six levels. Hence, the total number of different levels of X i is m = 24. We refer to De Jong and Heller (2008) for a detailed description of this dataset.

Following Heras, Moren and Vilar-Zan´ on (2018), we use h(Ri ) = 1 without adjusting the exposure rates at the second step. Because Heras, Moren and Vilar-Zan´ on (2018) apply quantile 24 regression to the sample in each category of {A+ xj }j=1 , we suspect that their estimates may be

d S ∗ (α|xj )}24 without modeling the conditional quantile of S ∗ given equal to our estimates {VaR j=1

d S ∗ (0.95|xj )}24 and reporting them in Table 1 below, X and S ∗ > 0. After computing {VaR j=1

we find that our two-step estimates are different from the two-step estimates in Heras, Moreno

rna

and Vilar-Zan´on (2018). To further investigate the effectiveness and necessity of using quantile regression, we employ the proposed empirical likelihood method to test whether the estimates in Heras, Moreno and Vilar-Zan´on (2018) are significantly different from the true values without modeling the conditional quantile of S ∗ given X and S ∗ > 0. The p-values reported in Table 1 show that the two-step estimates in Heras, Moreno and Vilar-Zan´ on (2018) are not significantly

Jou

different from the true values at the 5% significant level. In other words, the estimates in Heras, Moreno and Vilar-Zan´on (2018) are in the 95% confidence intervals of the true values without modeling the conditional quantile of S ∗ given X and S ∗ > 0, i.e., it is not significantly useful to employ quantile regression.

7

Journal Pre-proof d S ∗ (0.95|xj )}24 , copy the two-step estimates Table 1: We report our two-step estimates {VaR j=1

g S ∗ (0.95|xj )}24 from Heras, Moreno and Vilar-Zan´ {VaR on (2018), and report the P-values of j=1

the proposed empirical likelihood test for testing whether the true Value-at-Risk equals the estimate in Heras, Moreno and Vilar-Zan´ on (2018) for each group. The two numbers inside the

lP repro of

bracket of Group represent the levels of the age of the vehicle and the driver’s age, respectively. g S ∗ (0.95|xj ) d S ∗ (0.95|xj ) Group VaR P- value VaR 3212.78

0.6431

3277.82

2(1&1)

2534.94

0.6584

3241.63

3(3&1)

2901.37

0.4582

2636.08

4(2&2)

1726.35

0.2197

1438.54

5(4&1)

2927.59

0.4146

2442.07

6(1&2)

1407.25

0.2456

1573.67

7(2&3)

1556.17

0.7342

1615.35

8(1&3)

1327.27

0.0773

1045.51

9(2&4)

1347.33

0.9427

1346.14

10(3&2)

1691.04

0.9821

1709.33

11(1&4)

1143.21

0.6944

1203.77

12(3&3)

1487.55

0.7001

1576.59

13(4&2)

1736.64

0.5042

1859.44

14(3&4)

1283.53

0.8254

1271.97

15(4&3)

1470.00

0.9062

1478.81

rna

1(2&1)

1311.30

0.9199

1319.02

17(2&5)

1014.82

0.7465

1073.10

18(2&6)

1146.38

0.5192

1086.51

19(1&5)

837.34

0.8487

836.90

20(1&6)

947.30

0.7327

926.72

Jou

16(4&4)

21(3&5)

914.85

0.8660

969.59

22(3&6)

1067.53

0.1958

1193.75

23(4&5)

889.11

0.9058

882.40

24(4&6)

1062.24

0.1844

893.90

8

Journal Pre-proof

4

Simulation study

This section investigates the finite sample performance of the proposed empirical likelihood confidence interval in terms of coverage accuracy. First, we compute the sample proportions τx1 , . . . , τx24 of those 24 categories in the real

lP repro of

dataset analyzed in Section 3 and generate the explanatory variables X i = (VehAgei , AgeCati )T to have the sample size nj = [nτxj ] for the j-th category, where j = 1, . . . , 24. Here VehAgei denotes the age of the vehicle with levels 1 (newest), 2, 3, and 4, and AgeCati denotes the driver’s age with levels 1 (youngest), 2, 3, 4, 5, and 6. Note that the total sample size n ˜ = n1 + · · · + n24 may be smaller than n due to the rounding effect.

ˆ in fitting (5) to the real dataset in Section 3, we generate Next, using the estimator β independent Bernoulli random variables with pi =

1 ˆT X ¯ i} 1+exp{−β

¯ i is for i = 1, . . . , n ˜ , where X

˜ the dummy variable of X i . We use {Ni }ni=1 to denote these Bernoulli variables, which indicate

whether a loss is positive or zero.

For each i = 1, . . . , n ˜ , we further generate Si from a standard Gamma distribution with parameter being the sum of the levels of VehAgei and AgeCati when Ni = 1, and set Si = 0 when Ni = 0. We use Ri = 1 and h(Ri ) = 1 for all i = 1, . . . , n ˜ , i.e., all policies are observed in a full year cycle.

˜ , for each category, we first compute the true Based on the generated data {(Si , X Ti )T }ni=1

value VaRS ∗ (0.95|xj ) from the employed Gamma distribution and the corresponding pxj = 1 , ˆT x ¯j} 1+exp{−β

and then calculate the profile log-empirical likelihood ratio lP (VaRS ∗ (0.95|xj )|xj ).

rna

Repeat this procedure 5000 times so that the empirical coverage probabilities for the empirical likelihood confidence intervals I0.90 (0.95|xj ) and I0.95 (0.95|xj ) are obtained and reported in Table 2. This table shows that the proposed empirical likelihood method produces accurate confidence intervals and the coverage accuracy improves as the sample size becomes larger.

Conclusions

Jou

5

To accurately forecast the Value-at-Risk of the aggregate insurance losses in insurance ratemaking, Heras, Moreno and Vilar-Zan´on (2018) propose to model the probability with nonzero claims by logistic regression at the first step and model the conditional quantile of the aggregate loss given nonzero claims by quantile regression at the second step. When the explanatory variables are categorical, the adjusted quantile level from the first step is different for each category and 9

Journal Pre-proof Table 2: We report τj ’s, the empirical coverage probabilities for I0.9 (0.95|xj ) and I0.95 (0.95|xj ) of the proposed empirical likelihood based confidence intervals with sample sizes n = 30, 000 and n = 60, 000. The two numbers inside the bracket of Group represent the levels of the age of the vehicle and the driver’s age, respectively. n = 30000

n = 30000

n = 60000

n = 60000

lP repro of

τj

I0.9 (0.95|xj )

I0.95 (0.95|xj )

I0.9 (0.95|xj )

I0.95 (0.95|xj )

2.2165%

0.9152

0.9588

0.9086

0.9542

2(1&1)

1.8908%

0.9022

0.9522

0.9040

0.9520

3(3&1)

2.4213%

0.9064

0.9556

0.9042

0.9544

4(2&2)

4.6672%

0.9104

0.9558

0.9094

0.9572

5(4&1)

1.9335%

0.9084

0.9532

0.9110

0.9550

6(1&2)

3.1832%

0.9090

0.9524

0.9072

0.9534

7(2&3)

5.5131%

0.9126

0.9594

0.9068

0.9554

8(1&3)

3.9879%

0.9122

0.9604

0.9128

0.9550

9(2&4)

5.7755%

0.9102

0.9560

0.9144

0.9570

10(3&2)

5.8230%

0.9090

0.9544

0.9150

0.9528

11(1&4)

4.3253%

0.9190

0.9604

0.9070

0.9592

12(3&3)

7.1121%

0.9126

0.9606

0.9096

0.9556

13(4&2)

5.2936%

0.9160

0.9590

0.9142

0.9610

14(3&4)

7.0149%

0.9186

0.9580

0.9200

0.9592

15(4&3)

6.6228%

0.9120

0.9584

0.9176

0.9604

16(4&4)

6.7422%

0.9174

0.9584

0.9128

0.9618

17(2&5)

3.8832%

0.9196

0.9680

0.9076

0.9558

18(2&6)

2.3889%

0.9098

0.9602

0.9060

0.9552

19(1&5)

3.0093%

0.9112

0.9570

0.9124

0.9600

20(1&6)

1.6668%

0.9122

0.9598

0.9130

0.9576

Jou

1(2&1)

rna

Group

21(3&5)

4.5508%

0.9154

0.9632

0.9124

0.9606

22(3&6)

2.6394%

0.9198

0.9642

0.9130

0.9574

23(4&5)

4.3784%

0.9218

0.9612

0.9206

0.9664

24(4&6)

2.9533%

0.9164

0.9540

0.9146

0.9580

10

Journal Pre-proof quantile regression can only be applied to each category. Thus it is unnecessary to employ quantile regression at the second step. This observation motivates us to estimate the Value-at-Risk and quantify the uncertainty without using quantile regression. This paper provides the explicit model assumptions and develops an empirical likelihood method to construct a confidence interval for the Value-at-Risk measure and to test whether the two-step estimates in Heras, Moreno

lP repro of

and Vilar-Zan´on (2018) are significantly different from those without modeling the conditional quantile of the aggregate loss given the explanatory variable. Data analysis with the proposed new method does show that the second step of using quantile regression is not necessary. A simulation study confirms the good finite sample performance of the proposed empirical likelihood method.

Acknowledgements

We thank a reviewer for helpful comments. Peng’s research was supported by Simons Foundation.

References

[1] Goldburd, M., Khare, A. and Tevet, D. (2016). Generalized Linear Models for Insurance Rating. CAS Monograph Series Number 5.

[2] Heller, G,Z., and De Jong, P. (2008). Generalized Linear Models for Insurance Data. Cam-

rna

bridge University Press.

[3] Heras, A., Moreno, I., and Vilar-Zan´ on, J. (2018). An application of two-stage quantile regression to insurance ratemaking. Scandinavian Actuarial Journal 9, 753–769.

Jou

[4] Koenker, R. (2005). Quantile Regression. Cambridge University Press. [5] Kudryavtsev, A.A. (2009). Using quantile regression for ratemaking. Insurance: Mathematics and Economics 45, 296-304.

[6] Owen, A., B. (2001). Empirical Likelihood. Chapman and Hall. [7] Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Annals of Statistics 22, 300–325.

11

Journal Pre-proof

Appendix: Proof of Theorem 1 Before proving Theorem 1, we need some lemmas, where β 0 and θj0 denote the true values of β p

d

and θj = VaRS ∗ (α|xj ), respectively. We use → and → to denote the convergence in distribution and in probability, respectively. We also use the following conventional notation for partial





∂y ∂x

∂y ∂x

lP repro of

derivatives: ∂y , . . . , ∂x∂yd )T when y is scalar and x = (x1 , . . . , xd1 )T ; = ( ∂x 1 1



∂y1 ∂x1

···

  = · 

···

∂yd2 ∂x1

···



∂y1 ∂xd1 

 ·  when x = (x1 , . . . , xd1 )T and y = (y1 , . . . , yd2 )T . 

∂yd2 ∂xd1

Lemma 1. Under conditions of Theorem 1, for each j = 1, . . . , m, we have  Pn 0 d  √1   i=1 Z i (β ) → W 1 ∼ N (0, Σ), n     1 Pn  0 0 p 0 0 T T   n i=1 Z i (β )Z i (β ) → Σ = E Z 1 (β )Z 1 (β ) , P 0 d 0 2 q1 + Zi,xj (θ , β ) → W2 ∼ N (0, σ ),  j j i∈A +  x j  mxj    α−1+p∗xj α−1+p∗xj P p  0 , β 0 )}2 → 2 =   m1+ {Z (θ σ {1 − }, ∗ i,x j j j i∈A+ p p∗ x xj

j

xj

xj

as n → ∞, where W 1 and W2 are independent, and p∗xj is defined in (7).

Proof. It directly follows from the central limit theorem, weak law of large numbers and the independence between {I(Si > 0)}ni=1 and {Si : i ∈ A+ xj , i = 1, . . . , n}.

rna

Lemma 2. Under conditions of Theorem 1, as n → ∞, with probability tending to one,

e in the interior l(θj0 , β|xj ) for each j = 1, . . . , m attains its minimum value at some point β e λ ˜ 1 = λ1 (β) e and λ ˜ 2 = λ2 (β) e satisfy that of the ball ||β − β 0 || ≤ n−1/3 and β, 1 n

Pn

Z i (β) i=1 1+λT Z i (β) , 1

Jou

where Q1n (β, λ1 ) =

e λ e 1 ) = 0, Q2n (β, e λ e2 ) = 0, Q3n (β, e λ e1, λ e2 ) = 0, Q1n (β,

Q3n (β, λ1 , λ2 ) =

Q2n (β, λ2 ) =

1 n

P

Zi,xj (θj0 ,β) 0 i∈A+ xj 1+λ2 Zi,xj (θj ,β)

(9) and

n X ∂Zi,xj (θj0 , β) 1X 1 ∂Z i (β) T 1 λ + λ2 . 1 n ∂β ∂β 1 + λ2 Zi,xj (θj0 , β) 1 + λT1 Z i (β) + i=1 i∈Axj

Proof. It follows from the same arguments in the proof of Lemma 1 in Qin and Lawless (1994).

12

Journal Pre-proof Proof of Theorem 1. Note that n

n

i=1

i=1

1X ∂Q1n (β, 0) 1 X ∂Z i (β) ∂Q1n (β, 0) =− = , Z i (β)Z Ti (β), ∂β n ∂β ∂λ1 n ∂Q1n (β, 0) ∂Q2n (β, 0) 1 X ∂Zi,xj (θj0 , β) = 0, = , ∂λ2 ∂β n ∂β +

lP repro of

i∈Axj

∂Q2n (β, 0) ∂Q2n (β, 0) 1 X ∂Q3n (β, 0, 0) = 0, =− {Zi,xj (θj0 , β)}2 , = 0, ∂λ1 ∂λ2 n ∂β + i∈Axj

n 1 X ∂Z i (β) T ∂Q3n (β, 0, 0) 1 X ∂Zi,xj (θj0 , β) ∂Q3n (β, 0, 0) = , = . ∂λ1 n ∂β ∂λ2 n ∂β + i=1

i∈Axj

By Taylor expansion and Lemmas 1-2, we have e λ e1) 0 = Q1n (β,

= Q1n (β 0 , 0) +

e λ e2 ) 0 = Q2n (β,

= Q2n (β 0 , 0) +

e λ e1, λ e2 ) 0 = Q3n (β,

∂Q1n (β 0 ,0) e (β ∂β

∂Q2n (β 0 ,0) T e (β ∂β

= Q3n (β 0 , 0, 0) +

∂Q3n (β 0 ,0,0) e (β ∂β

− β0 ) +

− β0 ) +

− β0 ) +

∂Q1n (β 0 ,0) e λ1 ∂λ1

+

∂Q1n (β 0 ,0) e λ2 ∂λ2

∂Q2n (β 0 ,0) T e λ1 ∂λ1

∂Q3n (β 0 ,0,0) e λ1 ∂λ1

+

+

+ op (δn ),

∂Q2n (β 0 ,0) e λ2 ∂λ2

∂Q3n (β 0 ,0,0) e λ2 ∂λ2

+ op (δn ),

+ op (δn ),

rna

e − β 0 || + ||λ e 1 || + |λ e2 |. That is, where δn = ||β     0 e λ2 −Q2n (β , 0) + op (δn )      e   −1   λ1  = Sn −Q1n (β 0 , 0) + op (δn ) ,     0 e β−β op (δn ) where

Jou



Sn =

∂Q2n (β 0 ,0) ∂λ2   ∂Q1n (β0 ,0)  ∂λ2  ∂Q3n (β 0 ,0,0) ∂λ2



 p  →S= 

∂Q2n (β 0 ,0) T ∂λ1 ∂Q1n (β 0 ,0) ∂λ1 ∂Q3n (β 0 ,0,0) ∂λ1

2 −τx+j E Zj0 ,xj (θj0 , β 0 ) 0

τx+j E

∂Zj0 ,xj (θj0 ,β 0 )  ∂β



∂Q2n (β 0 ,0) T ∂β  ∂Q1n (β 0 ,0)   ∂β  0 ∂Q3n (β ,0,0) ∂β

0T

−E Z 1 (β 0 )Z T1 (β 0 ) 0  1 (β ) T E ∂Z∂β 13





∂Zj0 ,xj (θj0 ,β 0 ) T ∂β   ∂Z 1 (β 0 )   E ∂β 

τx+j E

0

Journal Pre-proof   0 2 0 0 T + 0 with j0 being any element in A+ xj . Put a1 = τxj E Zj0 ,xj (θj , β ) , A1 = E Z 1 (β )Z 1 (β ) , and write



0T

0

−A1



 , S12 

S=

which gives that S −1

  ∂Zj0 ,xj (θj0 ,β 0 ) T τx+j E ∂β , = ∂Z 1 (β 0 )  E ∂β

S11 S12



,

lP repro of

that is, we write

S11 = 

−a1

T S12

0

  −1 −1 −1 −1 −1 T −1 S11 − S11 S12 ∆ S12 S11 S11 S12 ∆ , = −1 −1 T −1 ∆ S12 S11 ∆

T S −1 S . Therefore where ∆ = S12 11 12     0 e √ λ2 −Q2n (β , 0) √ −1 −1 T −1 − S11 S12 ∆−1 S12 S11 n   = S11 n  + op (1) 0 e λ1 −Q1n (β , 0

and

(10)

  0 −Q (β , 0) √ 2n e − β 0 ) = ∆−1 S T S −1 n   + op (1). n(β (11) 12 11 −Q1n (β 0 , 0)  e = (λ e2 , λ e T )T and W n = √n Q2n (β 0 , 0), QT (β 0 , 0) T . Then, by Lemma 1, (10)–(11) and Put λ 1 1n √

Taylor expansion, we have

Jou

rna

 e T (β) e λ e1 e T Pn Z i (β) e −λ e T Pn Z i (β)Z lP (θj0 |xj ) = 2λ i 1 1 i=1 i=1  e2 P + Zi,x (θ0 , β) e −λ e2 P + Zi,x (θ0 , β) e 2λ e2 + op (1) +2λ j j j j i∈Axj i∈Axj   √ eT √ e T a 1 0T √ e   nλ + op (1) = 2 nλ W n − nλ 0 A1  −1 −1 T S −1 W = −2W Tn S11 − S11 S12 ∆−1 S12 n 11   −1 −1 −1 −1 −1 T −1 T S −1 S +W Tn S11 − S11 S12 ∆−1 S12 11 S11 − S11 S12 ∆ S12 S11 W n + op (1) 11  −1 −1 T S −1 W + o (1) S12 ∆−1 S12 = −W Tn S11 − S11 n p 11    T −1/2 T S −1/2 (−S )−1/2 W = (−S11 )−1/2 W n Ik×k − S11 S12 ∆−1 S12 11 n + op (1), 11

where Ik×k denotes the k × k identity matrix, and k is the dimension of S11 . Hence the theorem d

−1/2

follows from the facts that (−S11 )−1/2 W n → N (0, Ik×k ), the matrix Ik×k −S11 is idempotent and its rank is

T −1 k − rank(S12 ∆−1 S12 S11 ) = k − rank(∆−1 ∆) = k − (k − 1) = 1.

14

−1/2

T S S12 ∆−1 S12 11