Data graduation based on statistical time series methods

Data graduation based on statistical time series methods

Statistics & Probability Letters 52 (2001) 169 – 175 Data graduation based on statistical time series methods V"#ctor M. Guerreroa; ∗ , Rodrigo Ju"ar...

114KB Sizes 2 Downloads 27 Views

Statistics & Probability Letters 52 (2001) 169 – 175

Data graduation based on statistical time series methods V"#ctor M. Guerreroa; ∗ , Rodrigo Ju"areza , Pilar Poncelab a Departamento

de Estad stica, Instituto Tecnol ogico Aut onomo de M exico (ITAM), Rio Hondo 1 Col. San Angel, D.F. M exico 01000, Mexico b Departamento de An alisis Econ omico: Econom a Cuantitativa, Universidad Aut onoma de Madrid, Cantoblanco, 28049 Madrid, Spain Received July 2000; received in revised form September 2000

Abstract On the basis of some suitable assumptions, we show that the best linear unbiased estimator of the true mortality rates has the form of Whittaker’s solution to the graduation problem. Some statistical tools are also proposed to help reducing c 2001 Elsevier Science B.V. All rights reserved subjectivity when graduating a dataset.  Keywords: Best linear unbiased estimation; Di8erence stationarity; Whittaker graduation

1. Introduction Whittaker’s graduation method uses a linear combination of some crude mortality rates to obtain a new set of graduated values (see London, 1985). The new dataset satis;es two competing criteria: goodness of ;t and smoothness. The trade o8 between these criteria is controlled through a smoothing parameter that re=ects the relative importance of both aspects. In practice, Whittaker graduation is extensively employed, with the standard to measure smoothness and the smoothing parameter chosen arbitrarily. Kimeldorf and Jones (1967) viewed Whittaker graduation from a Bayesian standpoint. They used a multivariate Normal distribution as a model for the vector of (correlated) true rates and postulated a priori the covariance matrix. Later on, Taylor (1992) assumed the rates to be mutually independent. Verrall (1993) posed the graduation problem in the form of a state-space model and used Kalman ;ltering. However, the dynamic structure of the rates had to be decided from a priori arguments and Verrall used a local linear trend model. We suggest an alternative interpretation of Whittaker graduation that yields the graduated series of rates as the best linear unbiased estimator (BLUE) of the true series. The smoothing parameter as well as the smoothness standard are related to a data-based model and some additional results are provided. ∗

Corresponding author. E-mail address: [email protected] (V.M. Guerrero).

c 2001 Elsevier Science B.V. All rights reserved 0167-7152/01/$ - see front matter  PII: S 0 1 6 7 - 7 1 5 2 ( 0 0 ) 0 0 2 1 3 - 3

170

V.M. Guerrero et al. / Statistics & Probability Letters 52 (2001) 169 – 175

2. Whittaker’s graduation and basic notation Let {ux } be a sequence of observed crude rates corresponding to ages x = 1; : : : ; n, where the actual age is given by x + , for some constant  ¿ 0. The basic expression underlying Whittaker graduation is M=

n 

wx (vx − ux )2 + h

x=1

n−d 

( d vx )2 ;

(2.1)

x=1

where vx denotes a graduated value, wx is a non-negative weight, is the forward-di8erencing operator de;ned as Mvx = vx+1 − vx for every v and x, d is the number of times the operator is applied and h is a smoothing parameter. The graduated values minimize M , which can also be written as M = (C − u) W (C − u) + hC Sd Sd C;

(2.2)

where u = (u1 ; : : : ; un ) ; C = (v1 ; : : : ; vn ) ; W = diag(w1 ; : : : ; wn ) and Sd is an (n − d) × n matrix whose ijth element is Sd (i; j) = (−1)d+i−j d!=[(j − i)!(d − j + i)!] for i = 1; : : : ; n − d and j = 1; : : : ; n, with Sd (i; j) = 0 for j ¡ i or j ¿ d + i. The minimizer of (2.2) is given by (see Shiu, 1986) Cˆ = (W + hSd Sd )−1 W u:

(2.3)

To use this formula, most practitioners would ;rst ;x the value d in 2, 3 or 4, according to their experience, then h is selected by trial and error, trading o8 smoothness against ;t, visually. Taylor (1992) related the true values, {tx }, to {ux } by ux = tx + x , where x is a zero-mean random error. The {ux } was assumed to be a sequence of independent and asymptotically normally distributed random variables with Var(ux ) = wx−1 . Then, h was shown to be given by h = Var −1 (Md tx ): By minimizing hC Sd Sd C, the structure of {vx } tends to lie on an adaptive polynomial of degree d − 1. In Section 3 we propose a formulation to take into account the autocorrelation structure of the rates. This is in clear opposition to the assumption of independence of {ux }. We were motivated by some comments in Kimeldorf and Jones (1967) attributed to Henderson and Elphinstone, both of whom considered the rates to be connected, in the sense of having some kind of interrelationships. We suggest a time-series representation in order to employ some analytical tools pertaining to this ;eld of study. However, as one referee pointed out, x could index a collection of lives with similar mortality characters rather than age, and the mortality data might have been collected in a ;xed period, as in a census. The important fact is that the data are ordered and equally spaced in some scale. Our proposed representation is fairly general since it basically assumes that, after appropriate di8erencing, {∇d T (ux )} is a stationary process, where ∇ is the backward-di8erencing operator given by ∇ux = ux − ux−1 = Mux−1 , with ∇d ux = Md ux−d for d = 0; 1; : : : : Furthermore, T (·) denotes an appropriate transformation, usually applied to make plausible the constant variance and Normal error distribution assumptions. The idea of using transformations has been strongly advocated in this context by Hickman (1987) and Hickman and Miller (1977).

3. Statistical model-based graduation Our interest lies in estimating an unobserved series of rates {tx } that follows a smooth pattern. A preliminary estimate of each tx is given by a crude rate ux0 , obtained as the proportion of deaths out of nx people exposed at age x. We start with ux0 = tx + x where E(x ) = 0 and Var(x ) = tx (1 − tx )=nx (on the assumption that nx ux0 follows a binomial model with parameters tx and nx for each age x). We deem graduation necessary because the crude rates contain some undesired variability due in part to the presence of outliers. Thus, we consider an outlier-corrected series {ux } obtained from {ux0 }. Then the range 0 ¡ ux ¡ 1 is extended to −∞ ¡ T (ux ) ¡ ∞

V.M. Guerrero et al. / Statistics & Probability Letters 52 (2001) 169 – 175

171

by means of a log-odds-ratio transformation, i.e. T (ux ) = log[ux =(1 − ux )]

for x = 1; : : : ; n:

(3.1)

{ux0 }

is captured by a di8erence stationary representation whose autocorrelation The dynamic structure in structure is basically due to the presence of outliers. Of course, autocorrelation might be due to other causes, e.g. shifts in tx over time, but we assume they are negligible relative to that caused by outliers. Thus, we have Assumption 1. {T (ux )} admits the model ∇d T (ux ) = au; x

for x = d + 1; : : : ; n

(3.2)

with T (u1 ); : : : ; T (ud ) some ;xed initial conditions and {au; x } a zero-mean Gaussian white noise process. A ;rst-order Taylor expansion of (3.1), together with the relationship between ux0 and tx , led us to Assumption 2. The transformed rates are related by means of T (ux ) = T (tx ) + ex

for x = 1; : : : ; n

(3.3)

with ex  [dT (ux )=dux ]|ux = tx (ux − tx ), so that E(ex ) = 0 and Var(ex ) = [nx tx (1 − tx )]

−1

:

It should be clear that {T (ux0 )}, {T (ux )} and {T (tx )} share the same trend behavior, so that {ex } is stationary. Moreover, since the autocorrelation structure is supposedly due to outliers in the observed data, {ex } must be an uncorrelated process. Within a Bayesian context, it is agreed that a sensible prior distribution for {T (tx )} will have a smooth mean vector (e.g. Taylor, 1992; Whittaker and Robinson, 1944, p. 303). That idea also agrees with the aforementioned trend behavior, therefore we tentatively assume Assumption 3. The dynamic structure of {T (tx )} is given by ∇d T (tx ) = at; x

for x = d + 1; : : : ; n;

(3.4)

where T (t1 ); : : : ; T (td ) are ;xed initial conditions and {at; x } is a zero-mean white noise process with Var(at; x ) = t2 and uncorrelated with {ex }. We call this a tentative assumption because, strictly speaking, (3.4) is not consistent with (3.2) and (3.3). In fact, Assumptions 1 and 2 are realistic given the nature of the data, while Assumption 3 is basically a simplifying device that will lead us to a Whittaker-like solution. We now de;ne T (u) = (T (u1 ); : : : ; T (un )) , T (t) = (T (t1 ); : : : ; T (tn )) and e = (e1 ; : : : ; en ) to express (3.3) as T (u) = T (t) + e with E(e) = 0 and Var(e) = W −1 where W = diag(w1 ; : : : ; wn ) and wx−1 = Var(ex ) for x = 1; : : : ; n. Similarly, by calling at = (at; d+1 ; : : : ; at; n ) , we can write (3.4) as Sd T (t) = at with E(at ) = 0; Var(at ) = t2 In−d and E(at e ) = 0. Thus we have the system of equations       e T (u) In T (t) + (3.5) = −at Sd 0 so that, given u; W; Sd and t2 , we can apply generalized least squares to obtain the BLUE of T (t) as Tˆ (t) = (W + t−2 Sd Sd )−1 WT (u)

(3.6)

whose variance–covariance matrix  = Var[Tˆ (t)] is given by 

 = (W + t−2 Sd Sd )−1 = W −1 − W −1 [W −1 + t−2 (Sd0 Sd0 )−1 ]−1 W −1 :

(3.7)

172

V.M. Guerrero et al. / Statistics & Probability Letters 52 (2001) 169 – 175

Expression (3.6) has the form of Whittaker’s solution (2.3) to the graduation problem in a transformed scale, when h = t−2 and the crude rates are corrected for outliers. Further, if we assume that e is normally distributed, we get a Normal distribution for Tˆ (t), with mean T (t) and variance–covariance matrix . This matrix is an increasing function of t2 , with  → 0 as t2 → 0 and  → W −1 as t2 → ∞. Moreover, −1 is the sum of two precisions, namely W and t−2 Sd Sd , associated to the binomial and the time series model, respectively. As t−2 Sd Sd corresponds to the smoothness element in (2.2), Guerrero (1993) proposed to measure its precision share in −1 by means of a scalar index. Theil’s measure of the share of P in (P +Q)−1 , with P and Q positive-de;nite matrices of size n×n, is given by (P; P +Q) = tr[P(P +Q)−1 ]=n. This measure has the following properties: (i) it takes on values between 0 and 1, (ii) it is invariant under linear, nonsingular transformations of the variable involved, (iii) it behaves linearly, and (iv) (P; P + Q) + (Q; P + Q) = 1. The share of t−2 Sd Sd in −1 is thus measured as (t−2 Sd Sd ; −1 ) = tr[t−2 Sd Sd (W + t−2 Sd Sd )−1 ]=n

(3.8)

which is a decreasing function of t2 , with 0 ¡ (t−2 Sd Sd ; −1 )61: The idea of measuring precision shares through a scalar index was also put forward by Hickman and Miller (1977). They employed the determinant of a matrix as their index with descriptive purposes, while we intend to use ours as an inferential device. We call (3.8) an index of precision share attributable to the time series model, since it can also be expressed as (t−2 Sd Sd ; −1 ) = tr({Var −1 [T (u)] − Var −1 [Tˆ (t)]}Var[Tˆ (t)])=n. Thus, it measures the relative reduction in variance attributable to using Tˆ (t) instead of T (u). For a ;xed ˆ2t we may calculate the precision share achieved. Or else, we can ;x a desirable precision share and select ˆ2t as the value satisfying the condition. In the transformed scale (where the errors are normally distributed) T (tx ) is both the mean and the median of Tˆ (tx ). Then, by applying the inverse transformation we obtain tˆx = T −1 [Tˆ (tx )]

for x = 1; : : : ; n

(3.9)

which (by monotonicity of T (·)) has its median located at tx . The whole series of estimates {tˆx } is what we call the graduated rates with our proposal. Besides, con;dence limits can also be constructed in the original scale by back-transforming the limits obtained in the transformed scale.

4. An empirically supported solution In order to apply the previous results we should determine the values of Sd ; W and t2 from the observed data. We start by correcting the data for outliers, assuming the data follow a Normal distribution, except for a small fraction coming from a heavy tailed symmetric distribution. Then we use Huber’s function  z if |z|6c; (4.1) c (z) = min{c; max{z; −c}} = c sgn(z) if |z| ¿ c: for z ∼ N(0; 1). This function produces robust location estimates that are optimal in a minimax sense, i.e. they are maximum likelihood estimates for the least favorable distribution (see Hampel et al., 1986). The constant c depends on the expected fraction of contaminated data and is usually chosen as c = 1:645, implying a potential 10% contamination. In our case, Huber’s function is used for robust estimation of dynamic location. When {T (ux0 )} is viewed as a process, its dynamic structure sets the standard to judge the degree of contamination. Such a structure is completely determined by the value of d that renders the series stationary. An augmented Dickey–Fuller (ADF) test on {T (ux0 )} is used to specify d objectively (see Hamilton, 1994, for a detailed description of

V.M. Guerrero et al. / Statistics & Probability Letters 52 (2001) 169 – 175

173

this test and related ones). Once d has been ;xed, Sd can be constructed and the elements of W can be preliminarily estimated by ˜ −1 (ex ) = nx ux0 (1 − ux0 ) w˜ x = Var

for x = 1; : : : ; n

(4.2)

so that W˜ = diag(w˜ 1 ; : : : ; w˜ n ). Then the value of ˆ2t is decided from the index of precision share, by ;xing a desirable (large) precision share attributable to the time series model and looking for the corresponding t2 that produces it. The underlying assumptions on the random errors are: (1) ex = T (ux )−T (tx ) ∼ N(0; wx−1=2 ) and (2) {at; x } is a zero-mean white noise process with variance t2 and uncorrelated with {ex }, where at; x = ∇d T (tx ). Then it is natural to calculate the residuals eˆ x = T (ux ) − Tˆ (tx ) to verify (1), while (2) cannot be validated empirically. The following steps summarize our proposal: 1. Determine the degree of di8erencing from an ADF test on {T (ux0 )} and create Sd . Obtain W˜ = diag(w˜ 1 ; : : : ; w˜ n ) with w˜ x = nx ux0 (1 − ux0 ) for x = 1; : : : ; n.  −1 2. Fix the value ˆ2t in such a way that (ˆ−2 t Sd Sd ;  ) = # with 0 ¡ # ¡ 1 speci;ed beforehand. 3. Calculate the preliminary values of T˜ (t) and ˜ with (3.6) and (3.7). 4. Obtain the following outlier-corrected results, for x = 1; : : : ; n. (a) Residuals e˜ x = w˜ x−1=2 1:645 {w˜ x1=2 [T (ux0 ) − T˜ (tx )]}; (b) observations T (ux ) = T˜ (tx ) + e˜ x and ux = T −1 [T (ux )]; and (c) weights wˆ x = nx ux (1 − ux ).  ˆ −1 5. Calculate the ;nal values of Tˆ (t), ˆ and (ˆ−2 t Sd Sd ;  ). 6. Carry out a residual analysis on {wˆ x1=2 eˆ x }. If the assumptions are supported by the data, you are ;nished, otherwise go back to step 2 and try with another value of #. As a guide, the higher the value of #, the less autocorrelation in the standardized residuals, but the more non-Normal (platikurtic) their distribution.

5. A numerical illustration The dataset used for this example is the 1971 Group Annuity Mortality Table for Female Lives (London, 1985). As a preliminary step, we present in Fig. 1 the original data and reproduce the results obtained by London using Whittaker graduation. The values 1000 and 426,138 of the smoothing parameter h were originally employed by London. The graduation was carried out in the logarithmic scale assuming d = 2 and

Fig. 1. Mortality rates, observed and graduated by Whittaker method.

174

V.M. Guerrero et al. / Statistics & Probability Letters 52 (2001) 169 – 175 Table 1 Results of the ADF tests d ADF p D–W ˆ

1 −1:216 6 1.902 0.330

2 −0:869 5 1.975 0.331

3 −10:275 4 2.019 0.330

4 −7:643 6 2.200 0.354

the graduated rates were obtained by taking anti-logs. Visual inspection of the data is the key element to judge the adequacy of these results. To apply our procedure we start by using the ADF test on the observed (transformed) rates and testing the values of d commonly used in graduation, that is d = 1; : : : ; 4. The ADF test employed is based on the regression 0 0 0 ∇d−1 T (ux0 ) = %∇d−1 T (ux−1 ) + &1 ∇d T (ux−1 ) + · · · + &p−1 ∇d T (ux−p+1 ) + x

so that testing d vs. d−1 as the proper order of di8erentiation is equivalent to testing H0 : % = 1 vs. H1 : % ¡ 1. The model order p must be large enough to capture all the serial correlation present in the data. Such a value was decided by discarding the highest lag in a regression when its coeVcient was not signi;cantly di8erent from zero at the 5% level, starting with p = 10. Table 1 shows the results obtained. The Durbin–Watson (D–W) statistic was used to check for serial correlation in the residuals and the last row contains the residual standard error ˆ . The critical values for the ADF test at the 1%, 5% and 10% levels are −2:616; −1:948 and −1:620, respectively (taken from Table B:6 in Hamilton, 1994), leading to d = 2. We then ;xed the desired precision in 75% and found ˆ2t = 0:00171286, with 12 original values of {T (ux0 )} corrected to obtain {T (ux )}. Those values correspond to ages: 60, 62, 64, 66, 68, 71, 79, 84, 92, 93, 94 and 97. Finally, the Ljung–Box statistic Q(12) = 22:94 became signi;cant at the 2.8% level, so that some autocorrelation was still present in the residuals {eˆ x }. Another round of the procedure was called for, with a higher precision value in order to diminish residual autocorrelation by correcting more observed values for outliers. We tried with 90% precision and found that ˆ2t = 0:00003879 satis;ed this criterion. Now 14 original values were corrected: the previous ones, except that of age 64, plus those of ages 75, 90 and 95. In this case the Ljung–Box statistic Q(12) = 10:0 did not show any evidence of autocorrelation. Similarly the skewness and (excess) kurtosis statistics Sk = − 0:10 and Ku = − 1:08 lent support to the null hypothesis of Normality. The t-statistic for the null hypothesis that E(ex ) = 0 took the value t = 0:12 and the hypothesis was not rejected. Next, a visual inspection of the standardized residuals provided support to the constant variance assumption. Therefore, we obtained an estimated series from a model that is valid in data. In Fig. 2, we show the ;nal product of this exercise. i.e. the graduated values and their 95% con;dence limits. The smoothness achieved is evident to the eye, but it can also be stated numerically by saying that 90% of the precision (smoothness) achieved is due to the use of a dynamic structure on the true rates. An extremely smooth structure for the estimated rates is undesirable for various reasons. Firstly, because it would correspond to a very rigid pattern, so that adaptivity of the implicit polynomial is lost. Secondly, as the precision gets closer to 100% more observations have to be discarded through Huber’s function and the residuals tend to show a strange behavior due to truncation. And thirdly, t2 → 0 will introduce numerical instabilities in the solution. In fact, we tried with 95% precision and veri;ed the previous statements, in particular 18 observed values needed correction (rather than 14 values corresponding to 90% precision) originating valleys and peaks in the residual pattern.

V.M. Guerrero et al. / Statistics & Probability Letters 52 (2001) 169 – 175

175

Fig. 2. Observed and estimated rates (original scale).

Acknowledgements We thank the editor and an anonymous referee for very helpful comments on a previous version of this paper. The ;rst author acknowledges ;nancial support provided by Asociaci"on Mexicana de Cultura, A.C. and the third author from CICYT project PB98-0075 (Spain). References Guerrero, V.M., 1993. Combining historical and preliminary information to obtain timely time series data. Internat. J. Forecasting 9, 477–485. Hamilton, J.D., 1994. Time Series Analysis. Princeton University Press, Princeton, NJ. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A., 1986. Robust Statistics. The Approach Based on In=uence Functions. Wiley, New York. Hickman, J.C., 1987. Connections between graduation and forecasting. In: MacNeill, I.B., Umphrey, G.J. (Eds.), Actuarial Science. Reidel Publishing Co., Dordrecht, pp. 1–17. Hickman, J.C., Miller, R.B., 1977. Notes on Bayesian graduation. Trans. Soc. Actuaries 29, 7–21 (Discussion, pp. 23– 49). Kimeldorf, G.S., Jones, D.A., 1967. Bayesian graduation. Trans. Soc. Actuaries 11, 66–112. London, D., 1985. Graduation: The Revision of Estimates. ACTEX Publications, USA. Shiu, E.S.W., 1986. A survey of graduation theory. Proceedings of Symposia in Applied Mathematics, Vol. 35, pp. 67–84. Taylor, G., 1992. A Bayesian interpretation of Whittaker–Henderson graduation. Insurance Math. Econom. 11, 7–16. Verrall, R.J., 1993. A state space formulation of Whittaker graduation. Insurance Math. Econom. 13, 7–14. Whittaker, E.T., Robinson, G., 1944. The Calculus of Observations, 4th Edition. Blackie, London.