Journal Pre-proof Risk analysis with categorical explanatory variables Seul Ki Kang, Liang Peng, Hongmin Xiao
PII: DOI: Reference:
S0167-6687(20)30023-8 https://doi.org/10.1016/j.insmatheco.2020.02.007 INSUMA 2620
To appear in:
Insurance: Mathematics and Economics
Received date : 3 November 2019 Revised date : 22 January 2020 Accepted date : 11 February 2020 Please cite this article as: S.K. Kang, L. Peng and H. Xiao, Risk analysis with categorical explanatory variables. Insurance: Mathematics and Economics (2020), doi: https://doi.org/10.1016/j.insmatheco.2020.02.007. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Elsevier B.V. All rights reserved.
Journal Pre-proof
Risk Analysis With Categorical Explanatory Variables Seul Ki Kang1 , Liang Peng1 , and Hongmin Xiao2 Abstract. To better forecast the Value-at-Risk of the aggregate insurance losses, Heras, Moreno and Vilar-Zan´on (2018) propose a two-step inference of using logistic regression and
lP repro of
quantile regression without providing detailed model assumptions, deriving the related asymptotic properties, and quantifying the inference uncertainty. This paper argues that the application of quantile regression at the second step is not necessary when explanatory variables are categorical. After describing the explicit model assumptions, we propose another two-step inference of using logistic regression and the sample quantile. Also, we provide an efficient empirical likelihood method to quantify the uncertainty. A simulation study confirms the good finite sample performance of the proposed method.
Key words and phrases: Aggregate loss, empirical likelihood, insurance ratemaking, logistic regression, Value-at-Risk.
1
Introduction
Consider a set of insurance policies with n independent policyholders in a given period time. For the i-th policyholder, we observe the aggregate loss Si and the d-dimensional categorical explanatory vector X i , representing some characteristics such as the driver’s age and the age of the vehicle in automobile insurance. Because X i is categorical, we assume the possible values
rna
are x1 , · · · , xm without loss of generality. Let S denote the aggregate loss of a new policyholder, and let X represent the corresponding characteristics. Then the practical question is to forecast the Value-at-Risk (VaR) of S at a given level α ∈ (0, 1) given X = x, which is defined as VaRS (α|x) = inf{s : P (S > s|X = x) ≤ 1 − α}.
Jou
Because X is categorical, we can classify these n policies into m categories, defined as Axj = {i : X i = xj , 1 ≤ i ≤ n} for j = 1, . . . , m. So we only need to forecast {VaRS (α|xj )}m j=1 . A naive estimator for VaRS (α|xj ) is the sample quantile of those Si ’s in the category Axj without really using the explanatory variables. Alternatively, one can model the quantile of Si as a linear function of X i and employ the quantile regression technique to forecast VaRS (α|xj ) 1 2
Department of Risk Management and Insurance, Georgia State University, Atlanta, GA 30303, USA
College of Mathematics and Statistics, Northwest Normal University, Lanzhou, Gansu 730000, P.R. China
Journal Pre-proof for j = 1, · · · , m. Applying the quantile regression technique to insurance ratemaking is not new. For example, Kudryavtsev (2009) employs the quantile regression technique to estimate quantiles of the net premium rate in ratemaking. We refer to Koenker (2005) for an overview of the quantile regression technique. As a common feature of having many zero insurance losses due to no insurance claims, the
lP repro of
above two simple VaR estimators are not efficient and often underestimate the risk because they do not model the probability of having zero losses. When P (S > 0|X = xj ) > 1 − α, we can write
VaRS (α|xj ) = inf{s > 0 : P (S > s|X = xj ) ≤ 1 − α}
= inf{s > 0 : P (S > s|X = xj , S > 0)P (S > 0|X = xj ) ≤ 1 − α} = inf{s > 0 : P (S > s|X = xj , S > 0) ≤
(1)
1−α P (S>0|X=xj ) },
which motivates the two-step inference in Heras, Moreno and Vilar-Zan´ on (2018) by separately ∗ = 1− estimating αx j
1−α P (S>0|X=xj )
∗ |x , S > 0). Hence, when we model and infer and VaRS (αx j j
P (S > 0|X = x) soundly, the two-step inference should be better than the above two simple estimations without modeling P (S > 0|X).
More specifically, the first step in Heras, Moreno and Vilar-Zan´ on (2018) employs logistic regression to model and estimate pi = P (Si > 0|X i ), belonging to the generalized linear models commonly used for estimating the probability of having nonzero claims in actuarial science; see De Jong and Heller (2008) and Goldburd, Khare and Tevet (2016). Note that the number of distinct pi ’s is m at most because X i is categorical with m levels. After obtaining the
rna
∗ for α∗ , which pools information from all levels of the explanatory variables, the estimator α ˆx xj j
second step in Heras, Moreno and Vilar-Zan´ on (2018) uses the quantile regression technique ∗ |x , S > 0). When we model the conditional quantile of S at the level to estimate VaRS (ˆ αx j i j
p given Si > 0 and X i by a linear function of X i , we can estimate VaRS (p|xj , S > 0) by the quantile regression technique, which pools information from all levels of the explanatory
Jou
variables. However, it remains unclear how Heras, Moreno and Vilar-Zan´ on (2018) apply the ∗ |x , S > 0) because the risk level α ∗ depends quantile regression technique to estimate VaRS (ˆ αx ˆx j j j
on the category xj . Nevertheless, it is clear that Heras, Moreno and Vilar-Zan´ on (2018) in their data analysis apply quantile regression to those positive Si ’s and the related X i ’s in each category of {Axj }m j=1 rather than all positive Si ’s. Hence, we suspect that the quantile regression estimation in Heras, Moreno and Vilar-Zan´ on (2018) fails to pool information from all levels of the explanatory variables and is the same as the sample quantile based on those positive Si ’s in 2
Journal Pre-proof each category of {Axj }m j=1 . This paper proposes an alternative two-step inference for the VaR via logistic regression and the sample quantile rather than quantile regression. To quantify the inference uncertainty, which has not been addressed in Heras, Moreno and Vilar-Zan´ on (2018), we develop an empirical likelihood method. Applying the proposed empirical likelihood test to the real dataset in Heras,
lP repro of
Moreno and Vilar-Zan´on (2018), we find the risk estimates in Heras, Moreno and Vilar-Zan´on (2018) are not significantly different from the true values without modeling the conditional quantile of S given X and S > 0. That is, using quantile regression in the second step is not necessary for categorical explanatory variables. We refer to Owen (2001) for an overview of the empirical likelihood method, which has been proved to be quite effective in interval estimations and hypothesis tests.
As we often do not observe all policies in the full cycle, it is of importance to take the exposure rates into account. Heras, Moreno and Vilar-Zan´ on (2018) do adjust the exposure rates in the first step but ignore it in the second step. Moreover, Heras, Moren and Vilar-San´on (2018) neither state the model assumptions explicitly nor provide asymptotic results for their two-step inference. In this paper, we provide explicit model assumptions and a rigorous two-step inference with exposure rates adjusted in both steps. Also, we propose an efficient empirical likelihood method for uncertainty quantification.
We organize this paper as follows. Section 2 presents the explicit model, a two-step inference with asymptotic results, and an empirical likelihood method for uncertainty quantification.
rna
Section 3 applies the proposed method to the same insurer database in Heras, Moreno and VilarZan´ on (2018) for testing whether the risk estimates in Heras, Moreno and Vilar-Zan´ on (2018) are significantly different from the true values without modeling the conditional quantile of S given X and S > 0. A simulation study is conducted in Section 4 to examine the finite sample performance of the proposed empirical likelihood method. Some conclusions are summarized in
2
Jou
Section 5. All proofs are put in the Appendix.
Methodologies and Main Results
As argued in the introduction, this paper forecasts the Value-at-Risk of insurance losses given some categorical explanatory variables of a policyholder and develops an efficient empirical likelihood method for uncertainty quantification. To better appreciate the idea, we start with
3
Journal Pre-proof the detailed model description. For each policyholder i = 1, . . . , n, we observe the aggregate loss Si in a particular year and some characteristics of this policyholder denoted by the d-dimensional categorical explanatory vector X i with m categories x1 , . . . , xm . Also, we observe the exposure rates R1 , . . . , Rn because policies may not be observed in a full year. Let Si∗ denote the aggregate loss of the i-th
lP repro of
policyholder in a full year, which equals Si if Ri = 1. We use S ∗ and X to denote the aggregate loss in a full year and the related characteristics of a future policyholder, respectively. The question is to forecast the risk VaRS ∗ (α|X = xj ) for j = 1, . . . , m. Throughout, we write pi = P (Si > 0|X i ) and p∗i = P (Si∗ > 0|X i ). Assume
pi = p∗i Ri and Si∗ = Si h(Ri ) for i = 1, . . . , n, and a known positive function h. Like (1), when
P (S ∗ > 0|X = xj ) > 1 − α for j = 1, . . . , m, we can write
(2)
(3)
VaRS ∗ (α|xj ) = inf{s > 0 : P (S ∗ > s|X = xj ) ≤ 1 − α}
= inf{s > 0 : P (S ∗ > s|X = xj , S ∗ > 0)P (S ∗ > 0|X = xj ) ≤ 1 − α} = inf{s > 0 : P (S ∗ > s|X = xj , S ∗ > 0) ≤
(4)
1−α P (S ∗ >0|X=xj ) },
1−α ∗ = 1− which suggests the following two-step inference by separately estimating αx P (S ∗ >0|X=xj ) j
rna
∗ |x , S ∗ > 0). To efficiently estimate P (S ∗ > 0|X), the first-step uses logistic and VaRS ∗ (αx j j
regression to model pi :
logit(
logit(
for i = 1, · · · , n,
(5)
p∗i ¯ i for i = 1, . . . , n, under condition(2), ) = βT X 1 − p∗i
Jou
which is equivalent to
pi /Ri ¯i ) = βT X 1 − pi /Ri
¯ i denotes the corresponding dummy variable of X i and AT denotes the transpose of the where X vector or matrix A. Using the standard logistic regression estimation technique, we can obtain ˆ for β, which gives estimator β pˆi =
Ri
T
ˆ X ¯ i} 1 + exp{−β
and pˆ∗i =
1 ˆT X ¯ i} 1 + exp{−β 4
for i = 1, · · · , n.
(6)
Journal Pre-proof ¯ i has m different values, implying that {ˆ As X i only takes m values, X p∗i }ni=1 have m distinct values at most. Hence, for j = 1, · · · , m, we write p∗xj =
1 1 and pˆ∗xj = T ¯j} 1 + exp{−β x ˆT x ¯j} 1 + exp{−β
∗ by and estimate αx j
1−α , pˆ∗xj
lP repro of
∗ α ˆx =1− j
(7)
¯ j is the corresponding dummy variable of xj . where x Define
A+ xj = {i : X i = xj , Si > 0, 1 ≤ i ≤ n} for j = 1, . . . , m.
∗ |x , S ∗ > 0) by the sample quantile at the It follows from (2) that we can estimate VaRS ∗ (αx j j
∗ based on {h(R )S : i ∈ A+ , i = 1, . . . , n}, which gives the two-step estimator of level α ˆx i i xj j
VaRS ∗ (α|xj ) as
d S ∗ (α|xj ) = inf{s : VaR
Pn
∈ A+ 1−α xj , h(Ri )Si > s) ≤ ∗ }. Pn + pˆxj i=1 I(i ∈ Axj )
i=1 I(i
Here I(·) denotes the indicator function. A simple application of the standard asymptotic theory can prove the asymptotic normality of the above VaR estimator, but the asymptotic variance is complicated and depends on the inference uncertainties of both steps. To construct a confidence interval or conduct a hypothesis test without estimating the asymptotic variance of the above two-step risk estimation, we propose the following empirical likelihood method based on the
rna
estimating equation approach in Qin and Lawless (1994). It follows from (5) that
pi (β) =
and the corresponding scores are
I(Si > 0) ∂pi (β) I(Si = 0) ∂pi (β) − pi (β) ∂β 1 − pi (β) ∂β
Jou
Z i (β) =
Ri ¯ i} 1 + exp{−β T X
for i = 1, · · · , n.
Put θ = VaRS ∗ (α|xj , S ∗ > 0). For i ∈ A+ xj , it follows from (4) that P (Si > θ|X i ) =
1−α pi /Ri .
Hence, we define
Zi,xj (θ, β) = I(Si > θ) −
1−α pi (β)/Ri
for i ∈ A+ xj .
By noting that Z i ’s are constructed from {I(Si > 0)}ni=1 , Zi,xj ’s are based on {Si : i ∈
n + A+ xj , i = 1, . . . , n}, and {I(Si > 0)}i=1 is independent of {Si : i ∈ Axj , i = 1, . . . , n}, we
5
Journal Pre-proof formulate the empirical likelihood function for θ = VaRS ∗ (α|xj ) based on the method for two independent samples as L(θ, β|xj ) nQ Q Pn n ∗ (m+ = sup xj qi ) : qk ≥ 0 for k = 1, · · · , n, k=1 qk = 1, k=1 (nqk ) i∈A+ xj o P Pn P ∗ ∗ q Z (β) = 0, qi∗ > 0 for i ∈ A+ + q Zi,xj (θ, β) = 0 , + q = 1, k k xj , i i k=1 i∈Ax i∈Ax j
where m+ xj =
Pn
i=1 I(i
lP repro of
j
∈ A+ xj ).
It follows from the Lagrange multiplier technique that
l(θ, β|xj ) := −2 log L(θ, β|xj ) P P log{1 + λ2 Zi,xj (θ, β)}, = 2 ni=1 log{1 + λT1 Z i (β)} + 2 i∈A+ x j
where λ1 = λ1 (β) and λ2 = λ2 (θ, β) satisfy n X i=1
X
Z i (β) = 0 and 1 + λT1 Z i (β)
i∈A+ xj
Zi,xj (θ, β) = 0. 1 + λ2 Zi,xj (θ, β)
(8)
Because we are interested in θ = VaRS ∗ (α|xj ), we consider the following profile log-empirical likelihood ratio
lP (θ|xj ) = min l(θ, β|xj ). β
Theorem 1. Assume {(Si , X Ti )T }ni=1 is a sequence of independent random vectors, {Ri }ni=1 are deterministic and X i is a d-dimensional categorical vector with levels x1 , . . . , xm . Further 0 0 T + assume (2), (3), (5), limn→∞ m+ xj /n = τxj ∈ (0, 1) for j = 1, . . . , m, E Z 1 (β )Z 1 (β ) is
rna
positive definite with β 0 denoting the true value of β, and the conditional distribution function
∗ |x ) of Si∗ given X i = xj and Si∗ > 0 is independent of i with a positive density at VaRS ∗ (αx j j
for j = 1, . . . , m. Then, for each j = 1, . . . , m, lP (VaRS ∗ (α|xj )|xj ) converges in distribution to a chi-squared limit with one degree of freedom as n → ∞.
Jou
Based on the above theorem, an empirical likelihood confidence interval with level a for VaRS ∗ (α|xj ) is obtained as
Ia (α|xj ) = {θ : lP (θ|xj ) ≤ χ21,a },
where χ21,a denotes the a-th quantile of a chi-squared distribution function with one degree of freedom. Similarly, an empirical likelihood test for H0 : VaRS ∗ (α|xj ) = θ0 at the level a rejects the null hypothesis whenever lP (θ0 |xj ) > χ21,1−a . 6
Journal Pre-proof Remark 1. We can develop a similar empirical likelihood method for constructing a confidence region of (VaRS ∗ (α1 |x1 ), . . . , VaRS ∗ (αk |xk ))T and testing H0 : VaRS ∗ (α1 |x1 ) = θ1 , . . . , VaRS ∗ (αk |xk ) = θk ,
3
lP repro of
where k is a given integer. Here we allow different risk levels.
Application to an insurer database
We reexamine the dataset from an Australian automobile insurance company between the years 2004 and 2005, which is analyzed by Heras, Moreno and Vilar-Zan´ on (2018) based on the twostep inference of using logistic regression at the first step and quantile regression at the second step. The total number of policies is n = 67856, and the categorical explanatory variable X i involves the age of the vehicle with four levels and the driver’s age with six levels. Hence, the total number of different levels of X i is m = 24. We refer to De Jong and Heller (2008) for a detailed description of this dataset.
Following Heras, Moren and Vilar-Zan´ on (2018), we use h(Ri ) = 1 without adjusting the exposure rates at the second step. Because Heras, Moren and Vilar-Zan´ on (2018) apply quantile 24 regression to the sample in each category of {A+ xj }j=1 , we suspect that their estimates may be
d S ∗ (α|xj )}24 without modeling the conditional quantile of S ∗ given equal to our estimates {VaR j=1
d S ∗ (0.95|xj )}24 and reporting them in Table 1 below, X and S ∗ > 0. After computing {VaR j=1
we find that our two-step estimates are different from the two-step estimates in Heras, Moreno
rna
and Vilar-Zan´on (2018). To further investigate the effectiveness and necessity of using quantile regression, we employ the proposed empirical likelihood method to test whether the estimates in Heras, Moreno and Vilar-Zan´on (2018) are significantly different from the true values without modeling the conditional quantile of S ∗ given X and S ∗ > 0. The p-values reported in Table 1 show that the two-step estimates in Heras, Moreno and Vilar-Zan´ on (2018) are not significantly
Jou
different from the true values at the 5% significant level. In other words, the estimates in Heras, Moreno and Vilar-Zan´on (2018) are in the 95% confidence intervals of the true values without modeling the conditional quantile of S ∗ given X and S ∗ > 0, i.e., it is not significantly useful to employ quantile regression.
7
Journal Pre-proof d S ∗ (0.95|xj )}24 , copy the two-step estimates Table 1: We report our two-step estimates {VaR j=1
g S ∗ (0.95|xj )}24 from Heras, Moreno and Vilar-Zan´ {VaR on (2018), and report the P-values of j=1
the proposed empirical likelihood test for testing whether the true Value-at-Risk equals the estimate in Heras, Moreno and Vilar-Zan´ on (2018) for each group. The two numbers inside the
lP repro of
bracket of Group represent the levels of the age of the vehicle and the driver’s age, respectively. g S ∗ (0.95|xj ) d S ∗ (0.95|xj ) Group VaR P- value VaR 3212.78
0.6431
3277.82
2(1&1)
2534.94
0.6584
3241.63
3(3&1)
2901.37
0.4582
2636.08
4(2&2)
1726.35
0.2197
1438.54
5(4&1)
2927.59
0.4146
2442.07
6(1&2)
1407.25
0.2456
1573.67
7(2&3)
1556.17
0.7342
1615.35
8(1&3)
1327.27
0.0773
1045.51
9(2&4)
1347.33
0.9427
1346.14
10(3&2)
1691.04
0.9821
1709.33
11(1&4)
1143.21
0.6944
1203.77
12(3&3)
1487.55
0.7001
1576.59
13(4&2)
1736.64
0.5042
1859.44
14(3&4)
1283.53
0.8254
1271.97
15(4&3)
1470.00
0.9062
1478.81
rna
1(2&1)
1311.30
0.9199
1319.02
17(2&5)
1014.82
0.7465
1073.10
18(2&6)
1146.38
0.5192
1086.51
19(1&5)
837.34
0.8487
836.90
20(1&6)
947.30
0.7327
926.72
Jou
16(4&4)
21(3&5)
914.85
0.8660
969.59
22(3&6)
1067.53
0.1958
1193.75
23(4&5)
889.11
0.9058
882.40
24(4&6)
1062.24
0.1844
893.90
8
Journal Pre-proof
4
Simulation study
This section investigates the finite sample performance of the proposed empirical likelihood confidence interval in terms of coverage accuracy. First, we compute the sample proportions τx1 , . . . , τx24 of those 24 categories in the real
lP repro of
dataset analyzed in Section 3 and generate the explanatory variables X i = (VehAgei , AgeCati )T to have the sample size nj = [nτxj ] for the j-th category, where j = 1, . . . , 24. Here VehAgei denotes the age of the vehicle with levels 1 (newest), 2, 3, and 4, and AgeCati denotes the driver’s age with levels 1 (youngest), 2, 3, 4, 5, and 6. Note that the total sample size n ˜ = n1 + · · · + n24 may be smaller than n due to the rounding effect.
ˆ in fitting (5) to the real dataset in Section 3, we generate Next, using the estimator β independent Bernoulli random variables with pi =
1 ˆT X ¯ i} 1+exp{−β
¯ i is for i = 1, . . . , n ˜ , where X
˜ the dummy variable of X i . We use {Ni }ni=1 to denote these Bernoulli variables, which indicate
whether a loss is positive or zero.
For each i = 1, . . . , n ˜ , we further generate Si from a standard Gamma distribution with parameter being the sum of the levels of VehAgei and AgeCati when Ni = 1, and set Si = 0 when Ni = 0. We use Ri = 1 and h(Ri ) = 1 for all i = 1, . . . , n ˜ , i.e., all policies are observed in a full year cycle.
˜ , for each category, we first compute the true Based on the generated data {(Si , X Ti )T }ni=1
value VaRS ∗ (0.95|xj ) from the employed Gamma distribution and the corresponding pxj = 1 , ˆT x ¯j} 1+exp{−β
and then calculate the profile log-empirical likelihood ratio lP (VaRS ∗ (0.95|xj )|xj ).
rna
Repeat this procedure 5000 times so that the empirical coverage probabilities for the empirical likelihood confidence intervals I0.90 (0.95|xj ) and I0.95 (0.95|xj ) are obtained and reported in Table 2. This table shows that the proposed empirical likelihood method produces accurate confidence intervals and the coverage accuracy improves as the sample size becomes larger.
Conclusions
Jou
5
To accurately forecast the Value-at-Risk of the aggregate insurance losses in insurance ratemaking, Heras, Moreno and Vilar-Zan´on (2018) propose to model the probability with nonzero claims by logistic regression at the first step and model the conditional quantile of the aggregate loss given nonzero claims by quantile regression at the second step. When the explanatory variables are categorical, the adjusted quantile level from the first step is different for each category and 9
Journal Pre-proof Table 2: We report τj ’s, the empirical coverage probabilities for I0.9 (0.95|xj ) and I0.95 (0.95|xj ) of the proposed empirical likelihood based confidence intervals with sample sizes n = 30, 000 and n = 60, 000. The two numbers inside the bracket of Group represent the levels of the age of the vehicle and the driver’s age, respectively. n = 30000
n = 30000
n = 60000
n = 60000
lP repro of
τj
I0.9 (0.95|xj )
I0.95 (0.95|xj )
I0.9 (0.95|xj )
I0.95 (0.95|xj )
2.2165%
0.9152
0.9588
0.9086
0.9542
2(1&1)
1.8908%
0.9022
0.9522
0.9040
0.9520
3(3&1)
2.4213%
0.9064
0.9556
0.9042
0.9544
4(2&2)
4.6672%
0.9104
0.9558
0.9094
0.9572
5(4&1)
1.9335%
0.9084
0.9532
0.9110
0.9550
6(1&2)
3.1832%
0.9090
0.9524
0.9072
0.9534
7(2&3)
5.5131%
0.9126
0.9594
0.9068
0.9554
8(1&3)
3.9879%
0.9122
0.9604
0.9128
0.9550
9(2&4)
5.7755%
0.9102
0.9560
0.9144
0.9570
10(3&2)
5.8230%
0.9090
0.9544
0.9150
0.9528
11(1&4)
4.3253%
0.9190
0.9604
0.9070
0.9592
12(3&3)
7.1121%
0.9126
0.9606
0.9096
0.9556
13(4&2)
5.2936%
0.9160
0.9590
0.9142
0.9610
14(3&4)
7.0149%
0.9186
0.9580
0.9200
0.9592
15(4&3)
6.6228%
0.9120
0.9584
0.9176
0.9604
16(4&4)
6.7422%
0.9174
0.9584
0.9128
0.9618
17(2&5)
3.8832%
0.9196
0.9680
0.9076
0.9558
18(2&6)
2.3889%
0.9098
0.9602
0.9060
0.9552
19(1&5)
3.0093%
0.9112
0.9570
0.9124
0.9600
20(1&6)
1.6668%
0.9122
0.9598
0.9130
0.9576
Jou
1(2&1)
rna
Group
21(3&5)
4.5508%
0.9154
0.9632
0.9124
0.9606
22(3&6)
2.6394%
0.9198
0.9642
0.9130
0.9574
23(4&5)
4.3784%
0.9218
0.9612
0.9206
0.9664
24(4&6)
2.9533%
0.9164
0.9540
0.9146
0.9580
10
Journal Pre-proof quantile regression can only be applied to each category. Thus it is unnecessary to employ quantile regression at the second step. This observation motivates us to estimate the Value-at-Risk and quantify the uncertainty without using quantile regression. This paper provides the explicit model assumptions and develops an empirical likelihood method to construct a confidence interval for the Value-at-Risk measure and to test whether the two-step estimates in Heras, Moreno
lP repro of
and Vilar-Zan´on (2018) are significantly different from those without modeling the conditional quantile of the aggregate loss given the explanatory variable. Data analysis with the proposed new method does show that the second step of using quantile regression is not necessary. A simulation study confirms the good finite sample performance of the proposed empirical likelihood method.
Acknowledgements
We thank a reviewer for helpful comments. Peng’s research was supported by Simons Foundation.
References
[1] Goldburd, M., Khare, A. and Tevet, D. (2016). Generalized Linear Models for Insurance Rating. CAS Monograph Series Number 5.
[2] Heller, G,Z., and De Jong, P. (2008). Generalized Linear Models for Insurance Data. Cam-
rna
bridge University Press.
[3] Heras, A., Moreno, I., and Vilar-Zan´ on, J. (2018). An application of two-stage quantile regression to insurance ratemaking. Scandinavian Actuarial Journal 9, 753–769.
Jou
[4] Koenker, R. (2005). Quantile Regression. Cambridge University Press. [5] Kudryavtsev, A.A. (2009). Using quantile regression for ratemaking. Insurance: Mathematics and Economics 45, 296-304.
[6] Owen, A., B. (2001). Empirical Likelihood. Chapman and Hall. [7] Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Annals of Statistics 22, 300–325.
11
Journal Pre-proof
Appendix: Proof of Theorem 1 Before proving Theorem 1, we need some lemmas, where β 0 and θj0 denote the true values of β p
d
and θj = VaRS ∗ (α|xj ), respectively. We use → and → to denote the convergence in distribution and in probability, respectively. We also use the following conventional notation for partial
•
•
∂y ∂x
∂y ∂x
lP repro of
derivatives: ∂y , . . . , ∂x∂yd )T when y is scalar and x = (x1 , . . . , xd1 )T ; = ( ∂x 1 1
∂y1 ∂x1
···
= ·
···
∂yd2 ∂x1
···
∂y1 ∂xd1
· when x = (x1 , . . . , xd1 )T and y = (y1 , . . . , yd2 )T .
∂yd2 ∂xd1
Lemma 1. Under conditions of Theorem 1, for each j = 1, . . . , m, we have Pn 0 d √1 i=1 Z i (β ) → W 1 ∼ N (0, Σ), n 1 Pn 0 0 p 0 0 T T n i=1 Z i (β )Z i (β ) → Σ = E Z 1 (β )Z 1 (β ) , P 0 d 0 2 q1 + Zi,xj (θ , β ) → W2 ∼ N (0, σ ), j j i∈A + x j mxj α−1+p∗xj α−1+p∗xj P p 0 , β 0 )}2 → 2 = m1+ {Z (θ σ {1 − }, ∗ i,x j j j i∈A+ p p∗ x xj
j
xj
xj
as n → ∞, where W 1 and W2 are independent, and p∗xj is defined in (7).
Proof. It directly follows from the central limit theorem, weak law of large numbers and the independence between {I(Si > 0)}ni=1 and {Si : i ∈ A+ xj , i = 1, . . . , n}.
rna
Lemma 2. Under conditions of Theorem 1, as n → ∞, with probability tending to one,
e in the interior l(θj0 , β|xj ) for each j = 1, . . . , m attains its minimum value at some point β e λ ˜ 1 = λ1 (β) e and λ ˜ 2 = λ2 (β) e satisfy that of the ball ||β − β 0 || ≤ n−1/3 and β, 1 n
Pn
Z i (β) i=1 1+λT Z i (β) , 1
Jou
where Q1n (β, λ1 ) =
e λ e 1 ) = 0, Q2n (β, e λ e2 ) = 0, Q3n (β, e λ e1, λ e2 ) = 0, Q1n (β,
Q3n (β, λ1 , λ2 ) =
Q2n (β, λ2 ) =
1 n
P
Zi,xj (θj0 ,β) 0 i∈A+ xj 1+λ2 Zi,xj (θj ,β)
(9) and
n X ∂Zi,xj (θj0 , β) 1X 1 ∂Z i (β) T 1 λ + λ2 . 1 n ∂β ∂β 1 + λ2 Zi,xj (θj0 , β) 1 + λT1 Z i (β) + i=1 i∈Axj
Proof. It follows from the same arguments in the proof of Lemma 1 in Qin and Lawless (1994).
12
Journal Pre-proof Proof of Theorem 1. Note that n
n
i=1
i=1
1X ∂Q1n (β, 0) 1 X ∂Z i (β) ∂Q1n (β, 0) =− = , Z i (β)Z Ti (β), ∂β n ∂β ∂λ1 n ∂Q1n (β, 0) ∂Q2n (β, 0) 1 X ∂Zi,xj (θj0 , β) = 0, = , ∂λ2 ∂β n ∂β +
lP repro of
i∈Axj
∂Q2n (β, 0) ∂Q2n (β, 0) 1 X ∂Q3n (β, 0, 0) = 0, =− {Zi,xj (θj0 , β)}2 , = 0, ∂λ1 ∂λ2 n ∂β + i∈Axj
n 1 X ∂Z i (β) T ∂Q3n (β, 0, 0) 1 X ∂Zi,xj (θj0 , β) ∂Q3n (β, 0, 0) = , = . ∂λ1 n ∂β ∂λ2 n ∂β + i=1
i∈Axj
By Taylor expansion and Lemmas 1-2, we have e λ e1) 0 = Q1n (β,
= Q1n (β 0 , 0) +
e λ e2 ) 0 = Q2n (β,
= Q2n (β 0 , 0) +
e λ e1, λ e2 ) 0 = Q3n (β,
∂Q1n (β 0 ,0) e (β ∂β
∂Q2n (β 0 ,0) T e (β ∂β
= Q3n (β 0 , 0, 0) +
∂Q3n (β 0 ,0,0) e (β ∂β
− β0 ) +
− β0 ) +
− β0 ) +
∂Q1n (β 0 ,0) e λ1 ∂λ1
+
∂Q1n (β 0 ,0) e λ2 ∂λ2
∂Q2n (β 0 ,0) T e λ1 ∂λ1
∂Q3n (β 0 ,0,0) e λ1 ∂λ1
+
+
+ op (δn ),
∂Q2n (β 0 ,0) e λ2 ∂λ2
∂Q3n (β 0 ,0,0) e λ2 ∂λ2
+ op (δn ),
+ op (δn ),
rna
e − β 0 || + ||λ e 1 || + |λ e2 |. That is, where δn = ||β 0 e λ2 −Q2n (β , 0) + op (δn ) e −1 λ1 = Sn −Q1n (β 0 , 0) + op (δn ) , 0 e β−β op (δn ) where
Jou
Sn =
∂Q2n (β 0 ,0) ∂λ2 ∂Q1n (β0 ,0) ∂λ2 ∂Q3n (β 0 ,0,0) ∂λ2
p →S=
∂Q2n (β 0 ,0) T ∂λ1 ∂Q1n (β 0 ,0) ∂λ1 ∂Q3n (β 0 ,0,0) ∂λ1
2 −τx+j E Zj0 ,xj (θj0 , β 0 ) 0
τx+j E
∂Zj0 ,xj (θj0 ,β 0 ) ∂β
∂Q2n (β 0 ,0) T ∂β ∂Q1n (β 0 ,0) ∂β 0 ∂Q3n (β ,0,0) ∂β
0T
−E Z 1 (β 0 )Z T1 (β 0 ) 0 1 (β ) T E ∂Z∂β 13
∂Zj0 ,xj (θj0 ,β 0 ) T ∂β ∂Z 1 (β 0 ) E ∂β
τx+j E
0
Journal Pre-proof 0 2 0 0 T + 0 with j0 being any element in A+ xj . Put a1 = τxj E Zj0 ,xj (θj , β ) , A1 = E Z 1 (β )Z 1 (β ) , and write
0T
0
−A1
, S12
S=
which gives that S −1
∂Zj0 ,xj (θj0 ,β 0 ) T τx+j E ∂β , = ∂Z 1 (β 0 ) E ∂β
S11 S12
,
lP repro of
that is, we write
S11 =
−a1
T S12
0
−1 −1 −1 −1 −1 T −1 S11 − S11 S12 ∆ S12 S11 S11 S12 ∆ , = −1 −1 T −1 ∆ S12 S11 ∆
T S −1 S . Therefore where ∆ = S12 11 12 0 e √ λ2 −Q2n (β , 0) √ −1 −1 T −1 − S11 S12 ∆−1 S12 S11 n = S11 n + op (1) 0 e λ1 −Q1n (β , 0
and
(10)
0 −Q (β , 0) √ 2n e − β 0 ) = ∆−1 S T S −1 n + op (1). n(β (11) 12 11 −Q1n (β 0 , 0) e = (λ e2 , λ e T )T and W n = √n Q2n (β 0 , 0), QT (β 0 , 0) T . Then, by Lemma 1, (10)–(11) and Put λ 1 1n √
Taylor expansion, we have
Jou
rna
e T (β) e λ e1 e T Pn Z i (β) e −λ e T Pn Z i (β)Z lP (θj0 |xj ) = 2λ i 1 1 i=1 i=1 e2 P + Zi,x (θ0 , β) e −λ e2 P + Zi,x (θ0 , β) e 2λ e2 + op (1) +2λ j j j j i∈Axj i∈Axj √ eT √ e T a 1 0T √ e nλ + op (1) = 2 nλ W n − nλ 0 A1 −1 −1 T S −1 W = −2W Tn S11 − S11 S12 ∆−1 S12 n 11 −1 −1 −1 −1 −1 T −1 T S −1 S +W Tn S11 − S11 S12 ∆−1 S12 11 S11 − S11 S12 ∆ S12 S11 W n + op (1) 11 −1 −1 T S −1 W + o (1) S12 ∆−1 S12 = −W Tn S11 − S11 n p 11 T −1/2 T S −1/2 (−S )−1/2 W = (−S11 )−1/2 W n Ik×k − S11 S12 ∆−1 S12 11 n + op (1), 11
where Ik×k denotes the k × k identity matrix, and k is the dimension of S11 . Hence the theorem d
−1/2
follows from the facts that (−S11 )−1/2 W n → N (0, Ik×k ), the matrix Ik×k −S11 is idempotent and its rank is
T −1 k − rank(S12 ∆−1 S12 S11 ) = k − rank(∆−1 ∆) = k − (k − 1) = 1.
14
−1/2
T S S12 ∆−1 S12 11