A robust method for linear regression of symbolic interval data

A robust method for linear regression of symbolic interval data

Pattern Recognition Letters 31 (2010) 1991–1996 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

440KB Sizes 1 Downloads 168 Views

Pattern Recognition Letters 31 (2010) 1991–1996

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

A robust method for linear regression of symbolic interval data Marco A.O. Domingues a, Renata M.C.R. de Souza a,*, Francisco José A. Cysneiros b a b

Centro de Informática, Universidade Federal de Pernambuco, Av. Prof. Luiz Freire, s/n, Cidade Universitária, CEP 50740-540, Recife (PE), Brazil Departamento de Estatstica, CCEN, Universidade Federal de Pernambuco, Av. Prof. Luiz Freire, s/n, Cidade Universitária, CEP 50740-540, Recife (PE), Brazil

a r t i c l e

i n f o

Article history: Received 10 December 2009 Available online 30 June 2010 Communicated by: R.C. Guido Keywords: Symbolic interval-valued data Symbolic data analysis Symmetrical linear regression Outliers

a b s t r a c t This paper introduces a new linear regression method for interval valued-data. The method is based on the symmetrical linear regression methodology such that the prediction of the lower and upper bounds of the interval value of the dependent variable is not damaged by the presence of interval-valued data outliers. The method considers mid-points and ranges of the interval values assumed by the variables in the learning set. The prediction of the boundaries of an interval is accomplished through a combination of predictions from mid-point and range of the interval values. The evaluation of the method is based on the average behavior of a pooled root mean-square error. Experiments with real and simulated symbolic interval data sets demonstrate the usefulness of this symbolic symmetrical linear regression method.  2010 Elsevier B.V. All rights reserved.

1. Introduction In real-world applications of decision making is usual that inaccuracy, uncertainty or variability must be taken into account to represent available information. In these cases, classical data are not able to represent these nuances and other kinds of data, such interval-valued data (IVD) are required. IVD are adequate to deal with imprecise data resulting of repeated measures or confidence interval estimation, bounds of the set of possible values of the item or variation range of a variable through time or through aggregating large data set into a reduced number of smaller groups of information. Interval data are also relevant in case of confidential data applications in companies and government specific areas in which only ranges of values are permitted to be shown. Symbolic Data Analysis (SDA) is a research field related to multivariate analysis, pattern recognition and artificial intelligence that offers suitable methods to deal with interval-valued data (see Bertrand and Goupil, 2000). SDA aims to provide a comprehensive way to summarize data sets by means of symbolic data resulting in a smaller and more manageable data set which preserves the essential information, and its subsequently analysis by means of the generalization of the exploratory data analysis and data mining techniques to symbolic data. Symbolic data allow multiple values for each variable. These new variables (set-valued, interval-valued, and histogram-valued) make it possible holds data intrinsic variability and/or uncertainty from the original data set as shown in (Diday and Noirhomme-Fraiture, 2008).

* Corresponding author. Tel.: +55 81 21268430; fax: +55 81 21268438. E-mail addresses: [email protected] (R.M.C.R. de Souza), [email protected] (F.J.A. Cysneiros). 0167-8655/$ - see front matter  2010 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2010.06.008

This work introduces a new prediction method for IVD based on the symmetrical linear regression (SLR) analysis. Two innovative features are concerned in this work: the predicted value is less susceptible to the presence of IVD outliers and probabilistic assumptions are established for the errors of the method. Here, an interval variable is represented by two independent quantitative variables that describe the mid-point and range of an interval. The new method fits two symmetrical linear regression models on the mid-points and the ranges of the interval values assumed by the variables in the learning set. These regression models allow to consider heavy and light-tailed distribution (e.g. Student-t distribution) for the errors. The prediction of the interval value of the dependent variable is accomplished from its mid-point and range, which are estimated from the fitted linear regression models applied to the mid-point and range of each interval value of the independent variables. The paper is organized as follows: In Section 2, we describe the related works and the motivation. Section 3 presents the theory regarding to the standard symmetrical linear regression. Section 4 shows the symbolic symmetrical linear regression for symbolic interval-valued data. Section 5 presents an experimental evaluation with simulated interval-valued data in the framework of the Monte Carlo simulation and Section 6 describes an application with a real interval-valued data set. Finally, Section 7 gives the concluding remarks. 2. Related work and motivation In SDA, the problem of predicting interval-valued data has been approached in various ways. Billard and Diday (2000) presented an approach to extend the classic linear regression model (CLRM) to

1992

M.A.O. Domingues et al. / Pattern Recognition Letters 31 (2010) 1991–1996

symbolic interval data by fitting the method of least squares to the mid-points of the IVD assumed by the interval variables. Billard and Diday (2002) proposed another approach that fits two independent CLRM on the lower and upper bounds of the intervals. Billard and Diday (2006) also included explanatory variables as well as hierarchical variable structure into symbolic regression framework. Recently, Lima Neto and De Carvalho (2008) proposed the center and range method for fitting the CLRM to IVD as an improvement in comparison with the methods presented in (Billard and Diday, 2000; Billard and Diday, 2002). Although these recent works proposing regression models to symbolic data represent an advance there are still some open research topics, for instance, these cited models do not consider any probabilistic assumptions for the model errors and do not treat interval-valued data sets with outliers. In the presence of outliers the least squares estimates can be affected. The regression model that fits data is dragged towards the outliers inflating the variance of the estimates. Thus, most of specialists prefer to discard outliers before computing the line that best fits the data under investigation. In general, in a classic data set, outliers can be interpreted as misleading data. However, a small number of outliers are not due to any anomalous condition or measurement errors and they often contain valuable information about the process being analyzed and should be carefully investigated before being removed from the data set. This question gets worse in case of symbolic data sets in which a single intervalvalued outlier may represent an aggregation of a group of measurements, therefore it is not recommendable to discard IVDoutliers. SDA starts extracting knowledge from data sets in order to provide symbolic descriptions. In practice, symbolic descriptions are modelled mathematically by a generalization process applied to a set of individuals. According to Diday and Noirhomme-Fraiture (2008), overgeneralization problems can arise when these extreme values are in fact outliers or when the set of individuals to generalize is in fact composed of subsets of different distributions (Diday and Noirhomme-Fraiture, 2008). For example, Fig. 1 displays situations in which IVD outliers are found in a symbolic data set with 23 species of a mushroom data family, called amanita. Each specie

of this family is described by two regressor interval-valued variables that are stipe length, stipe thickness and, the problem is to predict the pileus cap from these variables. In this figure, IVD outliers are intervals that have unusual pileus cap coordinate. In predicting the pileus cap from stipe length and stipe thickness, procedures that dampens the effect of IVD outliers in the regression equation are needed. Some methods have been proposed to address the drawbacks of using least squares regarding to the presence of outliers. Robust methods has been used in order to damp the effect of observations that would be highly influential to the classical linear regression model. Robust procedures also tend to leave the residuals associated with outliers large, thereby making the identification of influential points much easier (see Rousseau and Leroy, 1987). Another approach to robust estimation of regression models is to replace the normal distribution with a heavy-tailed distribution. A kind of parametric approach is adopted in this work and will be explained in details in Section 3. The main contribution of this work is to propose a prediction method for interval data that is less sensible in the presence of IVD outliers. Here, IVD outliers are identified through midpoint outliers and/or range outliers. This method allows to assume heavy-tailed probabilistic distribution for the model errors. The next section presents the symmetrical linear regression for classical data. 3. Symmetrical linear regression As stated earlier, the presence of data outliers normally cause an impact on the regression model. A practical case is one in which the errors of the model follow a distribution that has heavier tails than the normal. In such a case, this is an evidence of the outlier presence among data and heavy-tailed distributions very often better adapt the occurrence of them as shown in (Montgomery et al., 2006). In the following, this section presents the symmetrical linear regression (SLR) model. The estimates of this model are less susceptible to the presence of outliers when heavy-tailed distribution is used. The assumption that the errors have a probability distribution grants to the model possibilities of applying regular statistical hypothesis tests and other inference techniques. The application in symmetrical models for classic data has been widely developed. In (Galea et al., 2003) was introduced diagnostic methods based on local influence for symmetrical linear models and in (Galea et al., 2005) was discussed the extension of diagnostic methods to non-linear models. In (Cysneiros and Vanegas, 2008) was proposed a general definition for residuals in the class of non-linear symmetrical models. Suppose Y1,. . .,Yn as n independent random variables where density function is given by

1 fY i ðyÞ ¼ pffiffiffiffi gfðy  li Þ2 =/g; /

y 2 R;

ð1Þ

with li 2 R and / > 0 being the location and dispersion parameters, R1 respectively. The function g : R ! ½0; 1Þ is such that 0 gðuÞdu < 1 is typically known as the density generator and it is denoted by Yi  S(li,/,g). The SLR model is defined as

Y i ¼ li þ ei ;

Fig. 1. Amanita mushroom data outliers.

i ¼ 1; . . . ; n;

ð2Þ

where li ¼ xTi b; b ¼ ðb0 ; . . . ; bp ÞT is an unknown parameter vector, additionally, i  S(0,/,g) and xi is the vector of explanatory variables. When they exist, E(Yi) = li and Var(Yi) = n/, where n > 0 is a constant that depends on distribution (see, for instance, Fang et al., 1990). This class of models includes all symmetric continuous distributions, such as normal, Student-t, logistic, among others. For

M.A.O. Domingues et al. / Pattern Recognition Letters 31 (2010) 1991–1996

example, the Student-t distribution with m degrees of freedom rem then VarðY Þ ¼ m / and normal distribution n = 1, sults in n ¼ m2 i m2 Var(Yi) = /. In this model, the maximum likelihood estimates of b and / cannot be obtained separately and closed-form expressions for this estimates do not exist. Some iterative procedures can be used such as Newton–Raphson, BFGS and scoring Fisher method. Scoring Fisher ^ where the process for b ^ / ^ can method can be easily applied to get b; be interpreted as a weighted least square. The iterative process for ^ takes the form ^ and / b ðmþ1Þ

b

¼ fXDðv

ðmÞ

1

T

ÞXg X Dðv

ðmÞ

Þy:

ð3Þ

/ðmþ1Þ ¼

4.1. Definition of IVD Outlier An IVD outlier (y,x) refers to an object of X, record in the data set in which the mid-point of the yc coordinate lies an abnormal distance from other mid-point values in a sample from a population. As stated, this definition do not consider any kind of leverage intervals. 4.2. Regression equations  T  T Let zci ¼ 1; xci and zri ¼ 1; xri . The SSLR–IVD is defined according to two independent regression equations that are:

 T yci ¼ zci bc þ eci

and

1 fy  XbgT Dðv Þfy  Xbg ðm ¼ 0; 1; 2; . . .Þ: n

ð4Þ

with Dðv Þ ¼ diagfv 1 ; . . . ; v n g; y ¼ ðy1 ; . . . ; yn ÞT ; X ¼ ðxT1 ; . . . ; xTn ÞT and 0

ðuÞ v i ¼ 2Wg ðui Þ; Wg ðuÞ ¼ ggðuÞ ; g 0 ðuÞ ¼ dgðuÞ and ui = du

(yi  li)2//.

For the normal distribution the maximum likelihood estimates take closed-form expressions, because vi = 1,"i. For the Student-t distribution with m degrees of freedoms, g(u) = c(1 + u/m)(m + 1)/ 2 ,m > 0 and u > 0 so that Wg(ui) = (m + 1)/2(m + ui) and vi = (m + 1)/ ðrÞ (m + ui)," i. In this case the current weight v i from (3) is inversely proportional to the distance between the observed value yi and its current predicted value xTi bðrÞ , so that outlying observations tend to have small weights in the estimation process (see discussion, for instance, in Cysneiros and Paula, 2005). 4. Constructing symbolic symmetrical linear regression on interval-value data This section introduces a prediction method for interval data based on symmetrical linear regression (SSLR–IVD) that is less susceptible to the presence of IVD outliers. Here, an interval is represented by a pair of independent observations (mid-point and range) and IVD outliers are identified according to the presence of mid-point outliers and/or range outliers. However, this works focuses the analysis of outliers on the the mid-points of the intervals. Thus, two symmetrical linear regression models are fitted: a model on the mid-points adopting Student-t distribution for errors and a model on the ranges of the interval values adopting normal distribution for errors assumed by the variables in the learning set. Let X = 1, . . ., n be a data set of n objects described by the response interval-valued variable Y = (y1, . . . , yi , . . . , yn)T and p exploratory interval-valued variables X1,. . .,Xp with Xj = (x1j,. . .,xij,. . .,xnj)T. Each object i of X is represented as (xi,yi), xi = (xi1,. . .,xip)T where xij ¼ ½xLj ðiÞ; xUj ðiÞ 2 I ¼ f½a; b : a; b 2 R; a 6 bg and yi ¼ ½yL ðiÞ; yU ðiÞ 2 I. In this method, an interval is represented by two independent variables that describe the mid-point and range of this interval.  T  T Let Y c ¼ yc1 ; . . . ; yci ; . . . ; ycn and Y r ¼ yr1 ; . . . ; yri ; . . . ; yrn be the mid-point and range dependent variables, respectively,related to the dependent interval-valued variable Y. Let Xcj ¼ xc1j ; . . . ; xcij ;  T . . . ; xcnj ÞT and Xrj ¼ xc1j ; . . . ; xcij ; . . . ; xcnj be the mid-point and range exploratory variables, respectively, related to the exploratory interval-valued variable Xj (j = 1,. . .,p). Now, each object i (i = 1,. . .,n) of X is represented by two vectors ðxci ; yc ÞT with  T  T  T xci ¼ xci1 ; . . . ; xcip and xri ; yri with xri ¼ xri1 ; . . . ; xrip where

yci ¼ ðyL ðiÞ þ yU ðiÞÞ=2 and xcij ¼ ðxLj ðiÞ þ xUj ðiÞÞ=2 yri ¼ yU ðiÞ  yL ðiÞ and xrij ¼ xUj ðiÞ  xLj ðiÞ

1993

 T and yri ¼ zri br þ eri

  where bc ¼ bc0 ; . . . ; bcp is an unknown mid-point parameter vector,  T additionally, i  S(0, /, g) and zci (i =1, . . . , n) isa vector of explanatory mid-point variables and br ¼ br0 ; . . . ; brp is an unknown  T range parameter vector, additionally, i  S(0, /, g) and zri (i = 1, . . . , n) is the vector of explanatory range variables. The parameter vectors bc and br are estimated by the maximum likelihood method assuming symmetrical distributions for errors on mid-points and ranges according to its description in Section 3. In this work, we assume the Student-t distribution on the mid-point and the normal distribution for errors on the mid-points and ranges, respectively. The prediction of lower and upper bounds of the ith ^ i ¼ ½y ^L ðiÞ; y ^U ðiÞ is based on the prediction of y^i c and y^i r . interval y 4.3. Prediction rule Given a new object and its exploratory interval-valued data vech i tor x = (x1,. . ., xp)T, where each xj is an interval xj ¼ xjL ; xjU . Con T  T and zr ¼ 1; xc1 ; . . . ; xcp where xcj ¼ sider zc ¼ 1; xci ; . . . ; xcp   xjL þ xjU =2 and xrj ¼ ðxLj  xUj Þ as mid-point and range values, respectively, of the interval xj. ^ ¼ ½y ^L ; y ^U  is obtained as follow: The interval y

^c  zr b ^r =2 ^L ¼ ðzc ÞT b y T ^ c þ zr b ^r =2 ^U ¼ ðzc Þ b y 5. Simulated interval-valued data experiments To show the usefulness of the IVDP–SLR approach proposed in this paper, experiments with synthetic symbolic interval-valued data sets with different degrees of difficulty fitting a linear regression model are considered in this section. 5.1. Simulated interval-valued data Initially, simulated IVD sets in R2 and R4 are generated from simulated standard quantitative data sets such that each point belonging to a standard quantitative data set is a center (seed) for a rectangle in R2 or a hypercube in R4 . Each standard quantitative set has 375 points in R2 or R4 as is presented in (Lima Neto and De Carvalho, 2008). 5.1.1. Configurations in R2 Regarding data sets in R2 , the mid-points and ranges of the intervals are simulated independently following uniform distributions. All independent variables were simulated as random draws, the mid-points xci from an uniform distribution [a,b] and their values were held fixed throughout the simulations with equal sample size. The mid-points yci are related to the mid-points xci as yci ¼ b0 þ b1 xci þ ei where b0,b1 ares simulated from an uniform dis-

1994

M.A.O. Domingues et al. / Pattern Recognition Letters 31 (2010) 1991–1996

tribution [c,d] and ei is simulated from an uniform distribution [e,f]. Thus, the centers for rectangles in R2 are the points ðxci ; yci Þ and these rectangles are formed in the following way:

 c    xi  xri =2; xci þ xr =2 ; yci  yri =2; yci þ yri =2 where xri and yri are generated following an uniform distributions [g,h]. 5.1.2. Configurations in R4 With respect to data sets in R4 , we consider that the mid-points  c c c xi1 ; xi2 ; xi3 are simulated from a uniform distribution [a,b]. The  mid-points yci are related to the mid-points xci1 ; xci2 ; xci3 as yci ¼ b0 þ b1 xci1 þ b2 xci2 þ b3 xci3 þ ei where b0,b1,b2,b3 are simulated from an uniform distribution [c,d] and ei is simulated from an uniform distribution [e,f].   The centers for rectangles in R4 are the points xci1 ; xci2 ; xci3 ; yci and these rectangles are formed in the following way:

 

       xc1  xr1 =2 ; xc1 þ xr1 =2 ; xc2  xr2 =2 ; xc2 þ xr2 =2 ;     xc3  xr3 =2 ; xc3 þ xr3 =2 ; ½ðyc  yr =2Þ; ðyc þ yr =2Þ

where xri1 ; xri2 ; xri3 and yri are generated following an uniform distributions [g,h]. Four different configurations for hypercubes in R2 and R4 are considered. These hypercubes are drawn according to the parameters of uniform distributions that are presented in Table 1. For simulated IVD sets considered in this paper, a rectangle is an outlier if its mid-point yc coordinate is remote in the rectangles set represented by the mid-points yci . The effect in this rectangle causes on the regression model depends on the x coordinate of its mid-point and on the general disposition of the other rectangles in the data set. Table 1 Configuration parameters for hypercubes in R2 and R4 . Config.

[a,b]

[c,d]

[e,f]

[g,h]

1 2 3 4

[20,40] [20,40] [20,40] [20,40]

[20,40] [20,40] [1,5] [1,5]

[20,20] [5,5] [20,20] [5,5]

[20,40] [20,40] [1,5] [1,5]

  Outlier IVDs are created based on the mid-point data set yci ; xci (i = 1,. . .,n). First, the sets in R2 and R4 are sorted ascending by the dependent variable Yc and a small cluster containing the m first  points of the sorted set yci ; xci is selected. The observations of this cluster are changed into outlier points by

yci ¼ yci  3  SY c

ði ¼ 1; . . . ; mÞ

  where SY c is the standard deviation of yc1 ; . . . ; ycn . Fig. 2 displays the IVD sets 1, 2, 3 and 4. Fig. 2(a) and (b) describes high variability on the ranges of the rectangles, hence the rectangle outliers are lightly remote in y coordinate. Figs. 2(c) and (d) show low variability on the ranges of the rectangles, there are outlier rectangles in those figures that are remote in y coordinate. 5.2. Performance analysis Monte Carlo simulations with four simulated IVD sets in R2 and in R4 have been developed to compare the performance of the SSLR–IVD model on the presence of IVD outliers. In addition, the proposed SSLR–IVD model is compared with the linear regression model for interval-valued data introduced by Lima Neto and De Carvalho (2008), here called LR-IVD. Initially, values for b0, b1 in R2 and b0, b1, b2, b3 in R4 are selected randomly from uniform distribution U[10, 10] and a Monte Carlo simulation with 100 replications considering each data sets in R2 and in R4 is performed. Test and learning sets are randomly selected from each simulated IVD sets. The learning set (250 units) corresponds to 75% of the original data set and the test data set (125 units) corresponds to 25%. The performance assessment of the IVDP–SLR method presented is based on the pooled root mean-square error PRMSE. This measure is calculate for learning and test data sets and for each 100 replications, the average PRMSE is calculated. The PRMSE measure for learning data set is given by

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn ^ i¼1 v i error i PRMSE ¼ where; n ^ y ðiÞÞ2  error ¼ ½ðly ðiÞ  ^ly ðiÞÞ2 þ ðuy ðiÞ  u l

i

where vi is the weight obtained from the symmetrical linear regression applied on the mid-points of the intervals.

Fig. 2. Interval-valued data sets 1, 2, 3 and 4 containing outlier rectangles.

1995

M.A.O. Domingues et al. / Pattern Recognition Letters 31 (2010) 1991–1996 Table 2 Comparison between regression models according to the rejection ratio (%) of H0 for interval-valued data sets in R2 . Configuration 1 l

Configuration 2 t

l

Configuration 3

DF

PRMSE

PRMSE

PRMSE

PRMSE

2 4 6 8 10

100 100 100 100 100

100 100 100 100 100

100 100 100 100 100

100 100 100 100 100

t

PRMSE

l

Configuration 4 t

100 100 100 100 100

PRMSE

PRMSEl

PRMSEt

100 100 100 100 100

100 100 100 100 100

100 100 100 100 100

Table 3 Comparison between regression models according to the rejection ratio (%) of H0 for interval-valued data sets in R4 . Configuration 1

Configuration 2

Configuration 3

PRMSEl

PRMSEt

PRMSEl

PRMSEt

PRMSEl

PRMSEt

PRMSEl

PRMSEt

2 4 6 8 10

90 94 98 98 98

100 100 100 100 100

100 100 100 100 100

100 100 100 100 100

90 94 98 98 98

100 100 100 100 100

100 100 100 100 100

100 100 100 100 100

Table 4 Ranges of pileus cap, stipe length and stipe thickness of the family Amanita of mushrooms.

The PRMSE measure for test data set is given by

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn t i¼1 error i PRMSE ¼ n

Symbolic Variables 2

These measures are estimated for each fixed configuration in R and in R4 of the learning and test data sets. At each replication of the Monte Carlo method, the SSLR–IVD and LR–IVD models to the learning simulated IVD set are fitted. Moreover, for the SSLR–IVD model, the analysis is carried out taking into account the variation of every 2 degrees of freedom from 2 to 10. Thus, the fitted models are used to predict the interval values of the dependent intervalvalued variable Y in the test and learning simulated IVD sets. The average and standard deviation over the 100 Monte Carlo simulations are calculated for each PRMSEl,t, and a statistical Student’s t-test for paired samples at a significance level of 1% is then applied to compare the SSLR–IVD method proposed in this paper l;t l,t with the LR–IVD method. Let ll;t 1 and l2 the average of the PRMSE for SSLR–IVD and LR–IVD methods, respectively. The null and alternative hypotheses are, respectively:

(

Configuration 4

DF

l;t H0 : ll;t 1 ¼ l2 l;t H1 : ll;t 1 < l2

In order to evaluate the performance of the SSLR–IVD model based on the pooled root mean-square error, a simulation Monte Carlo with 100 different values for the each parameter b0, b1 in configurations R2 and for each parameter b0, b1, b2, b3 in configurations R4 selected randomly from uniform distribution U[10, 10] is carried out. Thus, for each configuration, the ratio of times that the hypothesis H0 is rejected is calculated. Tables 2 and 3 display the ratio of times that the hypothesis null is rejected for the measures PRMSEl,t regarding all configurations of IVD in R2 and R4 , respectively. The results in the tables above show clearly that the SSLR–IVD model is superior to the LR-IVD model when rectangle outliers are present in simulated data sets in R2 and R4 . 6. Application with amanita mushroom interval valued data set This section illustrates the application of the symbolic SSLR–IVD model proposed in this paper to the amanita mushroom interval valued data set. Table 4 shows this data set that was previously introduced in Section 1. These mushroom species are members of the genus Amanita and the values displayed in Table 4 were extracted from the Fungi of California Species Index (http://

Amanita species

Pileus Cap

Stipe Length

Stipe Thickness

Aprica Bivolvata Breckonii Californica Cokeri Constricta Farinosa Franchetii Gemmata Lanei Magniverrucata Muscaria Novinupta Ocreata Pachycolea Pantherina Phalloides Porphyria Protecta Silvicola Smithiana Vaginata Velosa

[5.00:15.00] [7.00:10.00] [4.00:9.00] [6.00:7.00] [7.00:15.00] [6.00:12.00] [2.50:6.50] [4.00:12.00] [3.00:11.00] [8.00:25.00] [4.00:13.00] [6.00:39.00] [5.00:14.00] [5.00:13.00] [8.00:18.00] [4.00:15.00] [3.50:15.00] [3.00:12.00] [4.00:14.00] [5.00:12.00] [5.00:17.00] [5.50:10.00] [5.00:11.00]

[3.30:9.10] [13.00:15.00] [7.00:10.00] [6.00:10.00] [10.00:20.00] [9.00:17.00] [3.00:6.50] [5.00:15.00] [4.00:15.00] [10.00:20.00] [7.00:11.50] [7.00:16.00] [6.00:12.00] [10.00:22.00] [10.00:25.00] [7.00:11.00] [4.00:18.00] [5.00:18.00] [5.00:15.00] [6.00:10.00] [6.00:18.00] [6.00:13.00] [4.00:11.00]

[1.40:3.50] [1.60:2.50] [0.90:2.00] [0.60:0.80] [1.00:2.00] [1.00:2.00] [0.30:1.00] [1.00:2.00] [0.50:2.00] [1.50:4.00] [1.00:2.50] [2.00:3.00] [1.50:3.50] [1.50:3.00] [1.00:3.00] [1.00:2.50] [1.00:3.00] [1.00:1.50] [1.00:3.00] [1.00:2.50] [1.00:3.50] [1.20:2.00] [1.00:2.50]

www.mykoweb.com/CAF/species_index.html). The interval data of this data set were obtained by aggregating individual mushrooms according to the kind of specie. In Fig. 1, showed in Section 1, we can observe that the midpoint of the yc coordinate for the muscaria specie lies an abnormal distance from other mid-point values in this specie data set. For this reason, we assume the Student-t distribution for the errors on the mid-points and the normal distribution for the errors on the ranges of the intervals for the SSLR–IVD model. In addition, the Akaike Information Criterion (Akaike, 1973) is adopted to select the degree of freedom for Student-t distribution. Here, our aim is to achieve a performance evaluation of the SSLR-IVD method in comparison with the LR-IVD using a real interval-valued data set that contain outliers on mid-points of the intervals. The evaluation was accomplished through the pooled root mean-square error introduced in Section 5.2. This measure was estimated by the leave-on-out method. According to the the Akaike Criterion, the degree of freedom equal to 3 for the Student-t distribution was used. The results of

1996

M.A.O. Domingues et al. / Pattern Recognition Letters 31 (2010) 1991–1996

the pooled root mean-square error for SSLR–IVD and LR–IDV methods were 1.672 and 2.660, respectively. These results show that the SSLR-IVD outperformed the LR-IVD. The pooled root mean-square error for SSLR–IVD method was 37.1% less than the pooled error for LR-IVD one. 7. Conclusions This paper presented a new linear regression model for interval valued-data. The model is based on the symmetrical linear regression methodology applied to the range and mid-points of the intervals. In addition, the predicted value is less susceptible to the presence of IVD outliers because the model consider heavy-tailed distributions when estimating the parameters of the model. Here, IVD outliers are identified by the presence of unusual interval mid-points. The evaluation of the model is based on the average behavior of a pooled root mean-square error. Experiments with simulated symbolic interval data sets and an application with a mushroom interval-valued data set contained interval outliers demonstrated the superiority of the symbolic symmetrical linear model proposed in this paper in comparison with a linear regression model for interval data that uses least squares estimates. Acknowledgement The authors would like to thank CNPq, CAPES and FACEPE (Brazilian Agencies) for their financial support.

References Akaike, H., Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F., (Eds.), Second Internat. Symposium on Information Theory, Budapest, 1973, pp. 267–281. Bertrand, P., Goupil, F., 2000. Descriptive statistic for symbolic data. In: Bock, H., Diday, E. (Eds.), Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer, Heidelberg, pp. 106–124. Billard, L., Diday, E., 2000. Regression analysis for interval-valued data. In: Data Analysis, Classification and Related Methods, Proc. 7th Conf. Internat. Federation of Classification Societies (IFCS 2000), Springer, Belgium, pp. 369– 374. Billard, L., Diday, E., 2002. Symbolic regression analysis. In: Classification, Clustering and Data Analysis, Proc. 8th Conf. Internat. Federation of Classification Societies, Springer, Poland, pp. 281–288. Billard, L., Diday, E., 2006. Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, West Sussex, England. Cysneiros, F.J.A., Vanegas, L.H., 2008. Residuals and their statistical properties in symmetrical nonlinear models. Statist. Probab. Lett. 78, 3269–3273. Cysneiros, F.J.A., Paula, G.A., 2005. Restricted methods in symmetrical linear regression models. Comput. Statist. Data Anal. 49, 689–708. Diday, E., Noirhomme-Fraiture, M., 2008. Symbolic Data Analysis and the SODAS Software. Wiley, West Sussex, England. Fang, K.T., Kotz, S., Ng, K.W., 1990. Symmetric Multivariate and Related Distributions. Chapman and Hall, London. Galea, M., Paula, G.A., Uribe-Opazo, M., 2003. On influence diagnostics in univariate elliptical linear regression models. Statist. Papers 44, 23–45. Galea, M., Paula, G.A., Cysneiros, F.J.A., 2005. On diagnostic in symmetrical nonlinear models. Statist. Probab. Lett. 73, 459–467. Lima Neto, E.A., De Carvalho, F.A.T., 2008. Centre and Range method for fitting a linear regression model to symbolic interval data. Comput. Statist. Data Anal. 52 (3), 1500–1515. Montgomery, D.C., Peck, E.A., Vining, G.G., 2006. Introduction to Linear Regression Analysis. Wiley, USA. Rousseau, P.J., Leroy, A.M., 1987. Robust Regression and Outlier Detection. Wiley, USA.