Estimation of a flexible simple linear model for interval data based on set arithmetic

Estimation of a flexible simple linear model for interval data based on set arithmetic

Computational Statistics and Data Analysis 55 (2011) 2568–2578 Contents lists available at ScienceDirect Computational Statistics and Data Analysis ...

329KB Sizes 1 Downloads 94 Views

Computational Statistics and Data Analysis 55 (2011) 2568–2578

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Estimation of a flexible simple linear model for interval data based on set arithmetic Angela Blanco-Fernández a,∗ , Norberto Corral a , Gil González-Rodríguez b a

Dpto. de Estadística, I.O y D.M., Oviedo University, C/ Calvo Sotelo s/n, Oviedo 33007, Spain

b

European Centre for Soft Computing, C/ Gonzalo Gutiérrez Quirós s/n, Mieres-Asturias 33600, Spain

article

info

Article history: Received 28 April 2010 Received in revised form 3 March 2011 Accepted 11 March 2011 Available online 22 March 2011 Keywords: Linear regression model Interval data Interval-arithmetic Least-squares estimation

abstract The estimation of a simple linear regression model when both the independent and dependent variable are interval valued is addressed. The regression model is defined by using the interval arithmetic, it considers the possibility of interval-valued disturbances, and it is less restrictive than existing models. After the theoretical formalization, the leastsquares (LS) estimation of the linear model with respect to a suitable distance in the space of intervals is developed. The LS approach leads to a constrained minimization problem that is solved analytically. The strong consistency of the obtained estimators is proven. The estimation procedure is reinforced by a real-life application and some simulation studies. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Sometimes the exact value of a variable may be hidden for confidentiality reasons, or may be identified imprecisely due to uncertainty in its quantification. In such cases, one may know the interval that contains a representative data point instead of its precise value (see, for instance, Giordani and Kiers, 2006; Ham and Hsiao, 1984; Rivero and Valdes, 2008). On the other hand, it is also of interest to focus on characteristics which are essentially interval valued, like economical fluctuations, numerical ranges (in the sense of the range of variation (minimum, maximum) of a magnitude over the course of a certain period of time), subjective perceptions, grouped data, and so on (see Diamond, 1990; Gil et al., 2002, 2007). When the data for the regressors and/or for the dependent variable in a regression model are intervals, the regression analysis becomes more complex (see Zhang and Sun, 2010; Hashimoto et al., 2011). Frequently, only certain representative values of the intervals, such as the midpoints (see, for instance, Billick et al., 1979) are considered, disregarding part of the information this way. These approaches produce in general inconsistent or inaccurate solutions (see Hasselblad et al., 1980). In particular, their application in the context of linear regression models is criticized in Hsiao (1983). Interval data are often assumed only for the dependent variable, and in the context of grouped data (see, for instance, Chesher and Irish, 1987; Hong and Tamer, 2003; Steward, 1983). Manski and Tamer (2002) consider either an interval-valued regressor or outcome, but not both variables at the same time. When both the regressor and the dependent variable take interval values, the estimation of real linear models involving separately the centres (or midpoints) and semiranges (or spreads) has been suggested (see Lima Neto et al., 2008; Lima Neto and de Carvalho, 2010 and the references therein). However, these approaches are not always well-suited to the inferential scenario, because the models do not prevent the values of the dependent variable from being ill-defined as interval elements. Thus, although the problem is easily overcome in the fitting process, the estimates may refer to an ill-defined population model.



Corresponding author. Tel.: +34 985 102958; fax: +34 985 103354. E-mail addresses: [email protected] (A. Blanco-Fernández), [email protected] (N. Corral), [email protected] (G. González-Rodríguez). 0167-9473/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2011.03.005

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578

2569

Table 1 Data set on the range of variation of blood pressures of 59 patients. X

Y

X

Y

X

Y

[11.8, 17.3] [10.4, 16.1] [13.1, 18.6] [10.5, 15.7] [12.0, 17.9] [10.1, 19.4] [10.9, 17.4] [12.8, 21.0] [9.4, 14.5] [14.8, 20.1] [11.1, 19.2] [11.6, 20.1] [10.2, 16.7] [10.4, 16.1] [10.6, 16.7] [11.2, 16.2] [13.6, 20.1] [9.0, 17.7] [11.6, 16.8] [9.8, 15.7]

[6.3, 10.2] [7.1, 10.8] [5.8, 11.3] [6.2, 11.8] [5.9, 9.4] [4.8, 11.6] [6.0, 11.9] [7.6, 12.5] [4.7, 10.4] [8.8, 13.0] [5.2, 9.6] [7.4, 13.3] [3.9, 8.4] [5.5, 9.8] [4.5, 9.5] [6.2, 11.6] [6.7, 12.2] [5.2, 10.4] [5.8, 10.9] [5.0, 11.1]

[11.9, 21.2] [12.2, 17.8] [12.7, 18.9] [11.3, 21.3] [14.1, 20.5] [9.9, 16.9] [12.6, 19.7] [9.9, 20.1] [8.8, 22.1] [11.3, 18.3] [9.4, 17.6] [10.2, 15.6] [10.3, 15.9] [10.2, 18.5] [11.1, 19.9] [13.0, 18.0] [10.3, 16.1] [12.5, 19.2] [9.7, 18.2] [12.7, 22.6]

[4.7, 9.3] [7.3, 10.5] [7.4, 12.5] [5.2, 11.2] [6.9, 13.3] [5.3, 10.9] [6.0, 9.8] [5.5, 12.1] [3.7, 9.4] [5.5, 8.5] [5.6, 12.1] [5.0, 9.4] [5.2, 9.5] [6.3, 11.8] [5.7, 11.3] [6.4, 12.1] [5.5, 9.7] [5.9,10.1] [5.4,10.4] [5.7, 10.1]

[9.8,16.0] [9.7,15.4] [8.7, 15.0] [14.1, 25.6] [10.8, 14.7] [11.5, 19.6] [9.9, 17.2] [11.3, 17.6] [11.4, 18.6] [14.5, 21.0] [12.0, 18.0] [10.0, 16.1] [15.9, 21.4] [13.8, 22.1] [8.7, 15.2] [12.0, 18.8] [9.5, 16.6] [9.2,17.3] [8.3,14.0]

[4.7,10.8] [6.0,10.7] [4.7, 8.6] [7.7, 15.8] [6.2, 10.7] [6.5, 11.7] [4.2, 8.6] [5.7, 9.5] [4.6, 10.3] [10.0, 13.6] [5.9, 9.0] [5.4, 10.4] [9.9, 12.7] [7.0, 11.8] [5.0, 9.5] [5.3, 10.5] [5.4, 10.0] [4.5, 10.7] [4.5, 9.1]

Gil et al. (2001, 2002, 2007) consider an affine transformation, which relates random intervals, based on the natural interval arithmetic, and avoids ill-defined models. Nevertheless, González-Rodríguez et al. (2007) shows that, contrary to what happens in the case of real-valued random variables, the optimal solutions for the affine transformation model can produce estimates non-coherent for the regression parameters. This shortcoming has been overcome in the latter work by solving the least-squares (LS) problem over a suitable feasible set. The model in González-Rodríguez et al. (2007) implies that the midpoints and the spreads of the variables are related by means of two linear functions in which the absolute value of the slope parameter is the same. There are some practical situations in which it is reasonable to assume that the midpoints and the spreads of the variables may increase/decrease with the same absolute ratio. However, in general, this consideration imposes some restrictions. The aim of this paper is to present and estimate a more flexible simple linear model between random interval variables. 1.1. Motivating example To motivate the methodology, the analysis of the (linear) relationship between the fluctuation of the systolic and diastolic blood pressure for the patients in a hospital is considered. Data have been supplied by the Department of Nephrology of the Hospital Valle del Nalón in Asturias, Spain (see, for instance, Gil et al., 2002). Values of the blood pressures for each patient are measured at different points in a day. For some purposes, physicians focus on the range of variation (minimum-maximum) of these magnitudes which are registered for the patient on a concrete day. In these cases, devices used for measuring the pressure are programmed to record not the whole registers for a day, but only the lowest and highest measurements of pressure during the day. Therefore, the range of variation of each pressure over one day can be modelled by means of a real compact interval defined from the corresponding extreme values registered during that day. Data in Table 1 correspond to X = ‘‘range of variation of the systolic blood pressure over a day’’ and Y = ‘‘range of variation of the diastolic blood pressure over the same day’’. The sample data consist of 59 patients of the Hospital Valle del Nalón, taken from a population of 3000 who are hospitalized per year in this hospital. Fig. 1 represents the scatter plot of both variables, by means of crosses. For each cross, the intersection represents the pair of midpoints of the corresponding intervals, and the length of the arms of the cross is the diameter of each interval. The rest of the paper is organized as follows: Section 2 describes the simple linear model and some of its main features. In Section 3, the LS estimators of the linear model are obtained by finding the solutions to a constrained minimization problem. Both the theoretical and the empirical performance of the solutions are shown in Section 4. Specifically, the strong consistency of the estimators is proven and some illustrative simulation studies are performed. Moreover, the method is applied to the sample data presented in Section 1.1 in order to check its practical applicability in a real-life case study. Finally, Section 5 concludes and give some future directions. 2. A new interval regression model: the model M The formalization of the linear model for an interval-valued regressor and an interval-valued-dependent variable is based on the (mid, spr)-representation of the intervals, i.e. A = [mid A ± spr A] for each nonempty real compact interval, with spr A ≥ 0. This formalization will be used throughout the paper. In general, for statistical developments, it is more suitable to work with this representation instead of the (inf, sup)-representation, A = [inf A, sup A] with inf A ≤ sup A, since the non-negativity condition for the spread is usually easier to handle than the order restriction of infima and suprema.

2570

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578

Fig. 1. Interval data mapping.

The linear model M will be formalized by considering the usual interval arithmetic, which in terms of the (mid, spr)representation is expressed as A1 + λA2 =



mid A1 + λ mid A2



  ± spr A1 + |λ|spr A2 ,

for any λ ∈ R. The expected value for any random interval R is defined in terms of the Aumann expectation, and it satisfies that E ([mid R ± spr R]) = [E (mid R) ± E (spr R)],

(1)

whenever these expected values exist, i.e. mid R, spr R ∈ L . The representation A = [mid A ± spr A] can be split into two terms depending on the midpoint and spread values of A by means of the canonical decomposition 1

A = mid A[1 ± 0] + spr A[0 ± 1]. Obviously, the (inf, sup)-representation of the intervals [1 ± 0] and [0 ± 1] is [1, 1] and [−1, 1], respectively. Based on the canonical decomposition, we define the model M as Y = α mid X [1 ± 0] + β spr X [0 ± 1] + γ [1 ± 0] + ε,

(2)

where Y is the interval-dependent variable and X is the interval regressor. α and β are the regression coefficients associated with the independent variable, γ is an intercept term affecting the mid component of the dependent variable, and ε is an interval-valued random error variable such that E (ε|X ) = [−δ, δ], with δ ≥ 0. As in the interval model presented in González-Rodríguez et al. (2007), if the usual condition of zero mean error is considered, the residual term is reduced to a real-valued variable instead of an interval-valued one, due to the lack of linearity of the natural interval arithmetic. Model M induces linear relationships between the mid and the spr variables of intervals X and Y . To be precise, mid Y = α mid X + γ + mid ε, and spr Y = |β|spr X + spr ε.

(3)

Since α and β are different in general, model M is more flexible than the model developed in González-Rodríguez et al. (2007). The relationship for the mid variables coincides with the simple linear regression model of mid Y on mid X . But, this is not the case for the relationship for the spr variables in (3), given the constraints on spr Y , spr X and spr ε . Remark 2.1. One can also consider separate models for mids and spreads. Nevertheless, the estimation problem is not technically simplified, since the non-negativity condition of the spreads leads to a constrained linear model whose solution cannot be obtained by classical linear estimation techniques. On the other hand, although for predictive purposes the consideration of separate linear models may be enough, for explanatory purposes, to consider a unified model is usually preferable. For simpler notation, if we define the real interval B = [γ − δ, γ + δ], the regression function associated with model M can be expressed as E (Y |X ) = α X M + β X S + B, where X M = mid X [1 ± 0] and X S = spr X [0 ± 1]. The random interval X S = spr X [0 ± 1] = [−spr X , spr X ] satisfies that X S = −X S . From this property, it is clear that model M can always be written in two different ways: Y = α X M + β X S + γ [1 ± 0] + ε = α X M + (−β)X S + γ [1 ± 0] + ε. The existence of a double model in all the cases allows us to simplify the estimation process, because it is possible to look only for the estimate of the non-negative value of the parameter β .

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578

2571

3. Least-squares estimation of model M For the LS estimation of model M, a random sample of n observations for (X , Y ) is obtained. For the estimation of the parameters (α, β, B), the following objective function is minimized over (a, b, C ) n 1−

n i=1

d2θ (Yi , aXiM + bXiS + C ),

(4)

where dθ denotes the distance between intervals defined for every pair of intervals A1 and A2 as dθ (A1 , A2 ) =

 (mid A1 − mid A2 )2 + θ (spr A1 − spr A2 )2

for an arbitrary θ > 0. This metric has good operational properties, and it generalizes other well-known distances between intervals (see Trutschnig et al., 2009). The constant θ allows us to choose the relative importance given for the squared Euclidean distance between the spreads of the intervals (which is related to the difference of imprecision of both intervals) with respect to the corresponding squared Euclidean distance between the midpoints of the intervals (that measures the difference in location/position in the real line). In particular, d1/3 is often used for practical applications, since it corresponds to the Bertoluzza metric dW (see Bertoluzza et al., 1995) when W is the Lebesgue measure on [0, 1]. The objective function in (4), and also the solutions for the minimization problem, will be written in terms of sample moments of the intervals. Thus, in the following the theoretical second-order moments for random intervals are presented. The Aumann expected value for intervals presented in (1) agrees with the Fréchet expectation (see Fréchet, 1948) with respect to the metric dθ (see Körner, 1997; Körner and Näther, 2002). Thus, the variance of an interval X can be defined as the usual Fréchet variance (see Fréchet, 1948) associated with the Aumann expectation in the metric space of intervals endowed with dθ metric; that is,

  σX2 = E d2θ (X , E (X )) whenever E (|X |2 ) < ∞ (where |X | = sup{|x| : x ∈ X }). This concept has an expression in terms of classical variances 2 2 of mid and spr variables as σX2 = σmid X + θ σspr X . The semilinearity of the interval space entails difficulties to extend the classical concept of covariance. Thus, the covariance between two random sets is often introduced as the corresponding covariance between elements in a Hilbert space (see Körner, 1997 for details). Specifically, the covariance between two random intervals X and Y associated with the metric dθ has the expressions

σX ,Y = E [(mid X − E (mid X ))(mid Y − E (mid Y ))] + θ E [(spr X − E (spr X ))(spr Y − E (spr Y ))] = σmid X ,mid Y + θ σspr X ,spr Y . Thereafter, the second-order moments of the random intervals X M , X S , and Y involved in model M are assumed to be finite, 2 2 2 and the variances strictly positive. It can be proven that σX2M = σmid X and σX S = θ σspr X , so that the conditions for the variances of the random intervals X S and X S are equivalently assumed for the variances of mid X and spr X . The objective function in (4) can be written in terms of (a, b, C ) and the data values as n 1−

n i=1

(mid Yi − a mid Xi − mid C )2 + θ

n 1−

n i =1

(spr Yi − b spr Xi − spr C )2 .

From this expression, it is obtained that the minimization over a and mid C can be solved by means of the classical LS estimation of the simple linear regression of mid Y on mid X . Nevertheless, the minimization over b and spr C cannot be solved by the classical technique. Contrary to what happens in the real-valued case, in the interval setting it can be shown that the ordinary least-squares (OLS) estimates can sometimes be not suitable for estimating the linear relationship established by the theoretical model M. A simple example to illustrate this fact is described below. Example 3.1. Let X and ε be random intervals playing the role of regressor and error term, respectively, and consider a linear model between X and Y defining Y as Y = X M + 2X S +ε and E (ε|X ) = [−δ, δ]. Let mid X and mid ε be independent variables, both of them following a N (0, 1) distribution, spr X and spr ε being independent χ12 . Thus, γ = 0 and δ = 1 in model M between X and Y . Table 2 contains simulated sample data of n = 3 individuals from this situation. The values minimizing (4) are, respectively,  αOLS = 1.2636,  βOLS = 3.0157 and  BOLS = [−0.2557, −0.4015]. Obviously, since −0.4015 < −0.2557, the estimate of B is not well defined as a real interval. The shortcoming of the classical OLS estimator is associated with some features of the interval arithmetic. It is well known that the sum of intervals A1 + A2 and the product of an interval by a scalar, λA1 , are natural and well-defined inner operations in the space of real intervals. However, the operation A1 +(−1)A2 differs from the natural difference of two elements A1 − A2 ; in other words, A1 + (−1)A2 + A2 is not, in general, A1 . Moreover, in some cases there exists no other interval D such that A2 + D equals A1 . For instance, for A1 = [1, 2] and A2 = [0, 4], the unique way to get A1 = A2 + D (and then D = A1 − A2 ) is D being [1, −2], which is not a valid interval.

2572

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578 Table 2 Data set for the linear model Y = X M + 2X S + ε . Xi

Yi

[0.3698, 0.9424] [−0.0987, 0.0319] [−0.7885, 0.2447]

[−0.5329, 0.7805] [0.8997, 1.2875] [−3.3117, − 0.2081]

The classical OLS estimator of the interval parameter B would have the expression

   BOLS = Y −  αOLS X M +  βOLS X S , ∑n (with Y = (1/n) i=1 Yi , and analogously X M , X S ), in which the difference of the elements is not always a real interval, as in the example presented above. In order to assure the existence of a well-defined interval estimating B in all the situations, and therefore a good estimation of model M, the minimization problem in (4) will be reformulated. The difference D = A1 − A2 verifying that A1 = A2 + D is the so-called Hukuhara difference. Thus, the LS approach will be expressed as a minimization problem constrained to the existence of the Hukuhara differences guaranteeing the existence of the estimated residuals of model M. From a random sample of n observations  {Xi , Yi } of (X , Y ), it is necessary that the estimators of (α, β, B) verify that all the differences or residuals Yi −  α XiM +  β XiS are well-defined intervals. For this reason, the minimization problem associated with the estimation of model M is given by min

n 1−

n i=1

2

( ,

dθ Yi aXiM

+

bXiS

+ C)

subject to exists, for all i = 1, . . . , n

Yi − aXiM + bXi

 S



   

.

(5)

  

As in the classical linear regression, the estimator of the independent parameter B (an interval value in this case) can be obtained first, in terms of the other regression parameters, by solving (5) for the unknown quantity C . This minimum value is attained in the analogous expression for the OLS estimator of B, i.e.

   B=Y−  αX M +  β XS . In this case, the search for  α and  β fulfilling the constrains of (5) will guarantee that  B is a well-defined interval. Thus, problem (5) can be expressed as min

n 1−

n i=1

 

 d2θ Yi − (aXiM + bXiS ), Y − (aX M + bX S )   

subject to exists, for all i = 1, . . . , n

Yi − aXiM + bXi



 S

.

(6)

  

It can be shown that the constraints in (6) only regard the unknown quantity b. They can be written in terms of the spread values of the intervals in the sample as follows: Yi − (aXiM + bXiS ) exists

⇔b ≤

spr Yi spr Xi

for all i = 1, . . . , n such that spr Xi ̸= 0.

Therefore, the feasible set associated with (6) can equivalently be written as (a, b) ∈ S = R × [0, s0 ], where

  s0 = min

spr Yi spr Xi



: spr Xi ̸= 0 .

(7)

The resolution of (6) over the feasible set S gives the estimators for the regression parameters (α, β) of model M by means of the expressions

 α=

 σ X M ,Y  σX2M

and





 σX S ,Y  β = min  s0 , max 0, 2  σX S

 ,

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578

2573

where  σX M ,Y and  σX2M (analogously for X S ) denote the sample covariance and variance for random intervals defined in terms of the dθ -metric; that is,

 σX2M =

n 1−

n i =1

d2θ (XiM , X M )

and

 σX M ,Y =

n 1−

n 1−

n i =1

n i=1

(mid XiM − mid X M )(mid Yi − mid Y ) + θ

(spr XiM − spr X M )(spr Yi − spr Y ).

The random sample {(Xi , Yi )}ni=1 obtained from (X , Y ) involved in the minimization problem (6) must verify that  σX2M (or 2 2 equivalently  σmid σX2S (equivalently  σspr X ) are strictly positive. X ) and 

The particular definition of the intervals X M and X S , defined from the independent interval X , allows us to obtain more intuitive expressions for the regression estimators of model M. In terms of classical moments of real-valued random variables involving mids and spreads:

 α=

 σmid X ,mid Y , 2  σmid X  

 σspr X ,spr Y  β = min  s0 , max 0, 2  σspr X

(8)

 ,

(9)

 γ = mid  B

(10)

 δ = spr  B,

(11)

and

where

   B = (mid Y −  α mid X ) ± (spr Y −  β spr X ) .

(12)

The estimators of α and γ coincide with the OLS estimators of the regression coefficient and intercept for the usual linear regression model between mid X and mid Y variables (considering mid X as the regressor), respectively. Nevertheless, the estimators of β and δ = spr B are affected by the conditions to assure the coherency of the estimators with the interval linear model between X and Y . So,  β and  B are, in general, different from the OLS estimators  βOLS and  BOLS , respectively. It is important to remark that the expressions obtained in (8)–(12) do not depend on the parameter θ , which is chosen for the metric dθ and it is involved in the minimization problem. This is a valuable property for the estimators of model M, which is not verified by the estimators of the linear model introduced in González-Rodríguez et al. (2007). 4. Statistical properties of the LS estimators In this section, the theoretical and empirical adequacy of the LS estimators of model M is studied, and the method is applied to a case study. 4.1. Strong consistency results The main properties associated with the statistical reliability of the estimator of a given parameter are the asymptotic unbiasedness and the strong consistency of the estimator with respect to the parameter to be estimated. Since  α in (8) and  γ in (10) are exactly the OLS estimators for the parameters of the classical linear model mid Y = α mid X + γ + mid ε , it is clear that E ( α ) = α and E ( γ ) = γ , that is,  α and  γ are unbiased estimators w.r.t. the parameters α and γ , respectively. On the other hand, an analytic expression for the expected value of the estimators  β and  δ is not so easy to get, because the expression of  β (as well as  δ ) depends on min and max operators. Nevertheless, in this case Monte Carlo experiments in Section 4.2 show that these estimators are not unbiased, but they are asymptotically unbiased. To study the consistency property of the regression estimators, the following supporting result is to be used. All proofs are collected in the Appendix. 2 2 Lemma 4.1. Let X , Y be random intervals verifying a linear model M, such that 0 < σspr X , σspr Y < ∞. Then, the sample term

   s0 converges almost-surely-[P] to s0 = sup b ∈ R+ : F(b spr X −spr Y ) (0) = 1 as the sample size tends to ∞, i.e.   P lim  s0 (n) = s0 = 1, n→∞

where for a random variable ξ , Fξ (x) = P (ξ ≤ x).

2574

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578

Table 3 Empirical validation of the estimation procedure for model M. Model

n

M1

10 30 50 100 500

M2

10 30 50 100 500

 E ( α)

 ( MSE α)

 E ( β)

 ( MSE β)

 E ( γ)

 ( MSE γ)

 E ( δ)

 ( MSE δ)

2.9985 2.9995 3.0014 3.0007 2.9998

0.1414 0.0365 0.0211 0.0103 0.0020

0.8518 0.9216 0.9409 0.9590 0.9821

0.0952 0.0201 0.0106 0.0049 0.0009

−0.0038 0.0014 0.0027 0.0006 0.0001

0.1140 0.0352 0.0201 0.0100 0.0020

1.1224 1.0709 1.0561 1.0405 1.0177

0.2945 0.0904 0.0541 0.0265 0.0051

−2.0044 −2.0005 −1.9997 −2.0001 −1.9999

0.0627 0.0143 0.0077 0.0035 0.0007

0.8875 0.9148 0.9317 0.9515 0.9766

0.2192 0.0468 0.0244 0.0102 0.0019

1.0040 0.9979 0.9994 1.0001 0.9986

0.1549 0.0477 0.0274 0.0013 0.0027

2.0973 2.0774 2.0679 2.0505 2.0220

0.5583 0.1771 0.1066 0.0523 0.0102

Taking the statistical convergence stated in Lemma 4.1 into account, the following result about the strong consistency of the regression estimators is derived. Theorem 4.1. Let X , Y be random intervals verifying a model M, such that 0 < σX2M , σX2S , σY2 < ∞, and σX M ,Y , σX S ,Y < ∞. The regression estimators  α,  β and  B in (8), (9) and (12), respectively, are strongly consistent w.r.t. the parameters of the regression function of the model M, i.e. n→∞ n→∞ n→∞  α −→ α,  β −→ β, and  B −→ B a.s.-[P].

From the property of strong consistency for  B w.r.t. the parameter B, it is immediate to conclude the strong consistency of the estimators  γ and  δ in (10) and (11) w.r.t. the parameters γ and δ in model M, respectively. Although the regression estimators have analytic expressions in terms of classical moments, not all the theses in Lemma 4.1 and Theorem 4.1 are immediately obtained from classical results. In particular, the expressions of  β and  B depend on the constraints included in the minimization process in order to assure the interval-arithmetic coherency, and the classical results need to be complemented (see Appendix for details). 4.2. Empirical results The empirical performance of the estimation process developed in Section 3 is tested by studying the proximity of the estimates to the regression parameters by means of a Monte Carlo simulation. Given a theoretical situation in which model M is established for an interval regressor and a dependent variable X and Y , respectively, the distribution of the estimator for different samples sizes is approximated. First, k random samples of n observations from (X , Y ) are simulated. Then, the estimates of the regression parameters for each iteration, denoted by  αj ,  βj ,  γj and  δj (with  Bj = [ γj −  δj ,  γj +  δj ]), j = 1, . . . , k, are computed. The vector of k values for each estimate can be seen as an empirical distribution of the corresponding estimator close to the theoretical one if k is large enough. Thus, as a description of the performance of the estimator, the mean value and the mean squared error are computed from that distribution, as empirical approximations of the theoretical ones. Based on k iterations, the estimated mean value and the estimated mean squared error of an estimator ∑ ∑  (  ν are computed by means of the expressions  E ( ν) = ( kj=1  νj )/k and MSE ν) = ( kj=1 ( νj − ν)2 )/k, respectively. Two linear models with the structure of model M are investigated. We consider different distributions for the random interval regressor X and the random interval response Y , as well as different regression parameters. The distributions are determined through those for the associated mids and sprs. Some of the most commonly used models, taking into account that the spr variables are non-negative, are employed.

• Model M1 : Let mid X , mid ε ∼ N (0, 1) and spr X , spr ε ∼ χ12 be independent random variables, and let Y be a random interval defined by means of the linear model Y = 3X M − X S + ε = 3X M + X S + ε.

(13)

From the considered distribution for spr ε it can be shown that δ = 1, and thus B = [−1, 1]. • Model M2 : Let mid ε ∼ N (0, 1), spr ε ∼ χ22 , spr X ∼ χ12 and mid X = spr X − Z , with Z ∼ N (0, 1) independent of spr X , and the interval linear model between X and Y is Y = −2X M + X S + [1 ± 0] + ε. In this situation, a certain degree of dependence between mid X and spr X is allowed; hence, the intervals X (14) are dependent. The definition of ε entails that δ = 2, so B = [−1, 3].

(14) M

and X S in

In Table 3, the experimental results for k = 10,000 iterations of the Monte Carlo method from these two linear models are reported, taking the value θ = 1/3 for the metric dθ . Several conclusions can be extracted from these results. On one hand,

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578

2575

Fig. 2. Box plot for  α (left) and  β (right) for model M1 .

Fig. 3. Box plot for  α (left) and  β (right) for model M2 .

in both situations the mean values of the estimates are closer to the corresponding regression parameters as the number of observations n increases. The above findings show empirically the asymptotic unbiasedness of the estimators. On the other hand, in all the cases the estimated MSE of the estimators goes to zero as n increases. The empirical performance of the regression estimators can also be checked graphically, by means of a box-plot representation. Figs. 2 and 3 show the box plots of the estimators  α and  β obtained from the preceeding k = 10,000 iterations of the Monte Carlo experiment for model M1 and M2 , respectively. In both situations, the boxes reduce the spread around the true value of the corresponding parameter (α or β in each case) as the number of observations n increases. This finding suggests the empirical consistency of the regression estimators. 4.3. A real-life case study To motivate the study of an interval linear model, a real-life example was introduced in Section 1.1. The objective was to analyze the linear relationship between the intervals determined by the ranges of fluctuation of the systolic (X ) and diastolic (Y ) blood pressures for a sample of patients in a hospital over a day (see Fig. 1). Specifically, the aim is to estimate the interval linear model M, Y = α X M + β X S + γ [1 ± 0] + ε , for a random sample from (X , Y ). The estimation process developed in Section 3, based on the sample data for (X , Y ) in Table 1, results in the following estimation of model M:

 Y = 0.4527X M + 0.2641X S + 1.6920[1 ± 0] + [−1.5502, 1.5502] = 0.4527X M + 0.2641X S + [0.1418, 3.2422] .

(15)

The suitability of the estimation over the available data set is illustrated in Fig. 4. In order to get a two-dimensional representation, data mapping for mid and spr values obtained from the interval data set are plotted separately, each of them accompanied by the corresponding estimated linear relationship transferred from (15). For instance, for the sample unit (X39 , Y39 ) = ([15.9, 21.4], [9.9, 12.7]), the predicted range of fluctuation of the diastolic blood pressure would be  Y39 = [7.8842, 12.1943]. Besides, for the individual (X57 , Y57 ) = ([8.3, 14.0], [4.5, 9.1]), the prediction from (15) results in  Y57 = [4.6269, 9.0227]. 2 In this case, the sample value  σspr X ,spr Y / σspr X verifies all the constraints involved in the estimation of the model. Thus, the proposed estimation procedure leads to this value as the estimate for β , which obviously coincides with the classical OLS estimate for the linear model between the spr variables. However, when the classical OLS estimate does not guarantee the existence of all the residuals in the model, the estimate for β is different from the classical one. For instance, in Example 3.1, whereas  βOLS = 3.0157 was shown to be non-coherent with the model, the estimation strategy developed in Section 3 provides for the sample data in Table 2 the estimated model Y = 1.2636X M + 2.2937X S + [−0.4655, −0.1915], where  β = 2.2937 guarantees the existence of all the residuals and, consequently, the coherency of the estimate with the theoretical model.

2576

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578

Fig. 4. Estimated linear models for mid and spr values in example 4.3.

5. Conclusions The main objective of this work was to formalize a new linear regression model for interval-valued random variables, defined in terms of the usual interval arithmetic. The new model has more flexible properties than the linear models between intervals described in other works in the interval framework. Some properties of the space of intervals endowed with the arithmetic and the metric considered in this work allow us to determine the LS estimators. However, it has been shown that the lack of linearity of this space entails some particularities in the estimation process, leading to the conclusion that usual OLS estimators are sometimes not suitable for forecasting the dependent variable from the estimated interval linear model. The problem was overcome by looking for the estimates of the regression parameters in a set of possible values in which the coherency of the solutions with the interval nature of the variables is assured. Interval linear model M is an useful tool in many studies, when in a regression analysis both the regressor and the dependent variable are collected or described as intervals, because of different facts (grouping, censoring, uncertainty in the measure, and so on), or when the study is focused just on interval-valued characteristics (fluctuations, ranges of variation, etc.). In both situations, interval random sets are suitable to model these characteristics, and the inferential study of the linear relationship between two variables with this nature can be modelled in a coherent way by means of model M. Acknowledgements The research in this paper has been partially supported by the Spanish Ministry of Science and Innovation Grants MTM2009-09440-C02-01 and MTM2009-09440-C02-02, Principality of Asturias Proyect IB09-042C2. It has also benefitted from a short-term scientific mission associated with the COST Action IC0702. Their financial support is gratefully acknowledged. The authors thank especially Marianna Lyra and Henning Fischer for the proofreading of the paper, and Prof. Ana Colubi for her valuable comments; the scientific discussion with her has contributed to greatly improve the paper. The suggestions made by the associate editor and the anonymous referees are highly appreciated. Appendix





Proof of Lemma 4.1. Using Eq. (7), it is easy to check that  s0 can be written as  s0 = sup b ∈ R+ :  F(nb spr X −spr Y ) (0) = 1 , where for a random variable ξ ,  Fξn is the empirical distribution function of a sample of ξ with size n. + Let b ∈ Q arbitrary. From the Glivenko–Cantelli Theorem applied to the real random variable (b spr X − spr Y ), we can assure that n→∞

supx∈R | F(nb spr X −spr Y ) (x) − F(b spr X −spr Y ) (x)| −→ 0

a.s.-[P] ,

from what is directly obtained that  F(nb spr X −spr Y ) converges pointwise a.s.-[P] to F(b spr X −spr Y ) . Thus, there exists a measurable set Cb , with P (Cb ) = 1, such that n,ω lim F n→∞ (b spr X −spr Y )



(0) = F(b spr X −spr Y ) (0),

for all ω ∈ Cb .

(A.1)

Let C = b∈Q+ Cb . It is clear that C is a measurable set with P (C ) = 1. Let ω ∈ C arbitrary. We will prove that limn→∞ sn0 (ω) = s0 . Suppose that limn→∞ sn0 (ω) > s0 (since sn0 is a sample version n +   of s0 , it is obvious that s0 cannot be strictly less than s0 ). Then, there exists b1 ∈ Q such that limn→∞ sn0 (ω) > b1 > s0 . On one hand, b1 > s0 implies F(b1 spr X −spr Y ) (0) < 1. On the other hand, limn→∞ sn0 (ω) > b1 implies that there exists n ,ω n n0 ∈ N such that for all n ≥ n0 ,  s0 (ω) > b1 , so  F(b spr X −spr Y ) (0) = 1 for the samples of the variable (b1 spr X − spr Y ) with



1





n,ω size n ≥ n0 . In conclusion, it is not possible that the sequence  F(b spr X −spr Y ) (0) 1

is contradictory with the result in (A.1).



n∈N

has limit F(ωb

1 spr X −spr Y )

(0) < 1, which

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578

Proof of Theorem 4.1. The strong consistency of the estimator  α= σmid X ,mid Y 2 σmid X

 σX M ,Y  σ 2M

=

X

 σmid X ,mid Y 2  σmid X

2577

w.r.t. the parameter α =

σX M ,Y σ 2M

=

X

can be assured directly from the classical result for the OLS estimator of the regression parameter of the linear

model between real random variables mid X and mid Y .



In the following, we will prove the strong consistency property w.r.t. β for estimator  β . First, let Qn = max 0,

 σX S ,Y  σ 2S

 .

X

We will show that Qn converges a.s.-[P] to β . Applying the classical result about the strong consistency of the sample variance and covariance w.r.t. the corresponding population moments of real random variables, it follows that

 σX S ,Y  σ 2S

=

X

 σspr X ,spr Y 2  σspr X

converges a.s.-[P] to

σX S ,Y σ 2S

=

X

σspr X ,spr Y 2 σspr X

.

Taking into account that max{0, ·} is a real continuous mapping and then applying the Continuous Mapping Theorem, we have that



 σX S ,Y Qn = max 0, 2  σX S





σ X S ,Y −→ max 0, 2 σX S



n→∞

a.s.-[P].

(A.2)

At this point it is necessary to split the development in two cases, depending on the of β .  value  σ

S

If β > 0, from the expression β = σX S ,Y /σX2S it follows that σX S ,Y > 0, so max 0, X 2 ,Y σ S X



If β = 0, then σX S ,Y = 0, so in this case max 0,

σX S ,Y σ 2S



=

σX S ,Y σ 2S

= β.

X

= 0 = β.

X

In both cases, the convergence in (A.2) results in n→∞

Qn −→ β

a.s.-[P] .

(A.3)

Let us now show the convergence of  β = min{ s0 , Qn }. On one hand,  β ≤ Qn . Moreover, as  s0 can be seen as the sample version of s0 , it is obvious that s0 ≤  s0 , and then it follows min{s0 , Qn } ≤ min{ s0 , Qn } =  β . Thus, min{s0 , Qn } ≤  β ≤ Qn .

(A.4)

Since spr X , spr ε ≥ 0, and the non-negative value of β is considered, from Eq. (3) we can conclude that min{s0 , β} = β . n→∞

Then, we can again apply the continuous mapping theorem to min{s0 , ·} and obtain min{s0 , Qn } −→ min{s0 , β} = β a.s.-[P]. From this last convergence result and the one shown in (A.3), applying the sandwich rule in (A.4) we conclude the desired convergence, n→∞  β −→ β a.s.-[P].

Finally, the strong consistency of the estimator  B = Y − ( αX M +  β X S ) w.r.t. the parameter B = E (Y ) − (α E (X M ) + β E (X S ))  follows from the corresponding convergences of  α and β , together with the strong law of large numbers for the sample mean of an interval random set (see Artstein and Vitale, 1975).  References Artstein, Z., Vitale, R.A., 1975. A strong law of large numbers for random compact sets. Annals of Probability 5, 879–882. Bertoluzza, C., Corral, N., Salas, A., 1995. On a new class of distances between fuzzy numbers. Mathware & Soft Computing 2, 71–84. Billick, I.H., Curran, A.S., Shier, D.R., 1979. Analysis of pediatric blood lead levels in New York for 1970–1976. Environmental Health Perspectives 31, 183–190. Chesher, A., Irish, M., 1987. Residual analysis in the grouped and censored normal linear model. Journal of Econometrics 34, 33–61. Diamond, P., 1990. Least squares fitting of compact set-valued data. Journal of Mathematical Analysis and Applications 147, 531–544. Fréchet, M., 1948. Les éléments aléatoires de natures quelconque dans un éspace distancié. Annales de l’Institut Henri Poincaré 10, 215–310. Gil, M.A., López-García, M.T., Lubiano, M.A., Montenegro, M., 2001. Regression and correlation analyses of a linear relation between random intervals. Test 10 183–201. Gil, M.A., Lubiano, A., Montenegro, M., López-García, M.T., 2002. Least squares fitting of an affine function and strength of association for interval-valued data. Metrika 56, 97–111. Gil, M.A., González-Rodríguez, G., Colubi, A., Montenegro, M., 2007. Testing linear independence in linear models with interval-valued data. Computational Statistics & Data Analysis 51, 3002–3015. Giordani, P., Kiers, H.A.L., 2006. A comparison of three methods for principal component analysis of fuzzy interval data. Computational Statistics & Data Analysis 51, 379–397. González-Rodríguez, G., Blanco, A., Corral, N., Colubi, A., 2007. Least squares estimation of linear regression models for convex compact random sets. Advances in Data Analysis and Classification 1, 67–81. Ham, J., Hsiao, C., 1984. Two-stage estimation of structural labor supply parameters using interval data from the 1971 Canadian Census. Journal of Econometrics 24, 133–158. Hashimoto, E.M., Ortega, E.M.M., Paula, G.A., Barreto, M.L., 2011. Regression models for grouped survival data: Estimation and sensitivity analysis. Computational Statistics & Data Analysis 55, 993–1007. Hasselblad, V., Stead, A.G., Galke, W., 1980. Analysis of coarsely grouped data from the lognormal distribution. Journal of the American Statistical Association 75 (372), 771–778. Hong, H., Tamer, E., 2003. Inference in censored models with endogenous regressors. Econometrica 71 (3), 905–932.

2578

A. Blanco-Fernández et al. / Computational Statistics and Data Analysis 55 (2011) 2568–2578

Hsiao, C., 1983. Regression analysis with a categorized explanatory variable. In: Karlin, S., Amemiya, T., Goodman, L. (Eds.), Studies in Econometrics, Time Series and Multivariate Analysis. Academic Press, New York. Körner, R., 1997. On the variance of fuzzy random variables. Fuzzy Sets and Systems 92, 83–93. Körner, R., Näther, W., 2002. On the varianze of random fuzzy variables. In: Statistical Modelling, Analysis and Management of Fuzzy Data. Physica-Verlag, Heidelberg, pp. 22–39. Lima Neto, E.A., DeCarvalho, F.A.T., Freire, E.S., 2008. Centre and Range method for fitting a linear regression model to symbolic interval data. Computational Statistics & Data Analysis 52, 1500–1515. Lima Neto, E.A., de Carvalho, F.A.T., 2010. Constrained linear regression models for symbolic interval-valued variables. Computational Statistics & Data Analysis 54, 333–347. Manski, C.F., Tamer, E., 2002. Inference on regressions with interval data on a regressor or outcome. Econometrica 70 (2), 519–546. Rivero, C., Valdes, T., 2008. An algorithm for robust linear estimation with grouped data. Computational Statistics & Data Analysis 53, 255–271. Steward, M.B., 1983. On least squares estimation when the dependent variable is grouped. Review of Economic Studies 50, 737–753. Trutschnig, W., González-Rodríguez, G., Colubi, A., Gil, M.A., 2009. A new family of metrics for compact, convex (fuzzy) sets based on a generalized concept of mid and spread. Information Sciences 179 (23), 3964–3972. Zhang, X., Sun, J., 2010. Regression analysis of clustered interval-censored failure time data with informative cluster size. Computational Statistics & Data Analysis 54, 1817–1823.