COMSTA: 6479
Model 3G
pp. 1–33 (col. fig: nil)
Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Constrained center and range joint model for interval-valued symbolic data regression✩ Junpeng Guo *, Peng Hao College of Management and Economics, Tianjin University, Tianjin 300072, China
highlights • • • •
We introduce a constrained center and range joint model to fit linear regression to interval-valued symbolic data. We apply both the center and range of the interval to fit the linear regression model. The model avoids the negative value of the range of the predicted dependent interval variable. We adopt overlapping constraints to improve the model’s prediction accuracy.
article
info
Article history: Received 7 September 2016 Received in revised form 27 May 2017 Accepted 12 June 2017 Available online xxxx Keywords: Interval-valued data Linear regression model Constrained center and range joint model Least squares estimation
a b s t r a c t A constrained center and range joint model to fit linear regression to interval-valued symbolic data is introduced. This new method applies both the center and range of the interval to fit a linear regression model, and avoids the negative value of the range of the predicted dependent interval variable by adding nonnegative constraints. To improve prediction accuracy it adopts overlapping constraints. Using a little algebra, it is constructed as a special case of the least squares with inequality (LSI) problem and is solved with a Matlab routine. The assessment of the proposed prediction method is based on an estimation of the average root mean square error and accuracy rate. In the framework of a Monte Carlo experiment, different data set configurations take into account the rich or lack of error, as well as the slope with respect to the dependent and independent variables. A statistical t-test compares the performance of the new model with that of four previously reported methods. Based on experiment results, it is outlined that the new model has better fitness. An analysis of outliers is performed to determine the effects of outliers on our proposal. The proposed method is illustrated by analyses of data from two real-life case studies to compare its performance with those of the other methods. © 2017 Elsevier B.V. All rights reserved.
1. Introduction The exploration of the relationship between dependent and independent variables is an important task in many contexts, including data analysis, pattern recognition, data mining, and machine learning. Regression analysis is a common method for analyzing the relationship of dependent and independent variables. The traditional regression model, which is mainly applied to traditional point data, is generally used to predict the behavior of a dependent variable Y as a function of other ✩ A software is attached in the appendix. Correspondence to: Tianjin University, No.92, Weijin Road, Nankai District, Tianjin, 300072, China. Fax: +86 22 87402183. E-mail address:
[email protected] (J. Guo).
*
http://dx.doi.org/10.1016/j.csda.2017.06.005 0167-9473/© 2017 Elsevier B.V. All rights reserved.
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1
2 3 4 5
COMSTA: 6479
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
independent variables X that are responsible for the variability of Y . To fit this model, it is necessary to estimate a parameter vector β, based on the vector Y and matrix X . However, most information cannot be represented as real point data, as in the cases of missing, censored, or ambiguous data. As such, the need arises for regression with interval-valued data. Interval-valued data, a kind of symbolic data, implements data dimension reduction by ‘‘data packaging’’, resulting in data analyses with less computational complexity. Symbolic data analysis (SDA) has been discussed by Bock and Diday (2000), Billard and Diday (2003), Billard (2007), Diday and Noirhomme-Fraiture (2008), Brito and Noirhomme-Fraiture (2011), Chiun-How et al. (2014), and Débora et al. (2016), among others. Previous research has addressed interval-valued parametric regression whose parameter β is interval-valued, i.e., the coefficient β is interval-valued whereas variables Y and X are either real or interval-valued. Such instances have been addressed by Tanaka et al. (1982), Hojati et al. (2005), Savic and Pedrycz (1991), Tanaka et al. (1989), Tanaka and Ishibuchi (1991), Sakawa and Yano (1992), Chen and Hsueh (2007, 2009), and Hladík and Černý (2012). In SDA, however, regression model parameters of interval-valued variables are typically real numbers, i.e., the regression coefficient β is real whereas variables Y and X are interval-valued data. Symbolic interval-valued regression involves parametric and nonparametric regression algorithms. Being applicable to both linear and nonlinear regression problems is the valuable advantage of nonparametric regression methods. Based on CRM (Lima Neto and De Carvalho, 2008) and a reverse fitting method (Buja et al., 1989; Friedman and Stuetzle, 1981), Lim (2016) developed the nonparametric additive model (CRAM) to estimate the midpoint and radius of the response variables. By simulation and example verification, this method has the better fitness than both CRM (Lima Neto and De Carvalho, 2008) and SCM (Xu, 2010). Roberta et al. (2014) applies kernel regression method, which employs the Gauss kernel function, to fit interval-valued data. In this paper, we determine the regression method that best obtains the model parameters. Due to their salient performance in nonlinear situations, machine learning methods have also been applied to intervalvalued data regression. Examples include interval multi-layer perceptrons (iMLP) (San Roque et al., 2007), exponential smoothing (Arroyo et al., 2007), the linear autoregressive integrated moving average—nonlinear artificial neural network (ARIMA-ANN) (Maia et al., 2008), Holt-MLP (Maia and De Carvalho, 2011), firefly algorithm—multidimensional support vector regression (FA-MSVR) (Xiong et al., 2014), and the MSVR-vector error correction model (VECM) (Xiong et al., 2015). The common feature of machine learning processes in interval regression is to estimate the upper and lower bounds (respectively, y and y) without assuming any constraint of y ≥ y. As a rough classification, we can identify at least three kinds of linear interval-valued regression algorithms, including least square (LS) (e.g., Billard and Diday, 2000; Domingues et al., 2010; Lima Neto and De Carvalho, 2008, 2010), set arithmetic (e.g., Blanco-Fernandez et al., 2011), and probabilistic assumptions (e.g., Ahn et al., 2012; Xu, 2010). Set arithmetic ensures the existence of Hukuhara distance and adopts it as a criterion for building a fitting model. The probabilistic assumptions method takes into account the inner point distribution feature of an interval in a regression model. Essentially, these two kinds of algorithms are based to some degree on the LS method. With respect to interval regression, since it was first introduced by Billard and Diday (2000), LS algorithms have been studied for a relatively long period of time from Billard and Diday (2000). The authors developed the center method (CM) model and utilized the LS algorithm for the first time to estimate the upper and lower boundaries (respectively,y and y with y ≥ y) of the interval with the same coefficients. Their approach consists of fitting a linear regression model to the midpoint y+y
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
(yc = 2 ) of the interval values and applying this model to predict the lower and upper boundaries of the interval value of the dependent variable. However, having the same coefficients of the two models results in the models lacking flexibility and in an inability to characterize real-life situations. The MinMax model (Billard and Diday, 2002) improved the CM model by establishing two models to fit the lower- and upper-bound data series. Compared with end point (EP) of the interval, the expression of the midpoint–radius (MR) separates the uncertainty and the variation tendency, which are represented y−y by the radius (yr = 2 ) and midpoint, respectively, and in many practical cases, MR demonstrates more natural results (Boukezzoula et al., 2011). From this perspective Lima Neto and De Carvalho (2008) proposed the center and range method (CRM) to fit a center and radius model by the LS method into a uniform model. But this model does not mathematically ensure that the radius is greater than zero, which induces the case in which the upper bound may be less than the lower. To solve this problem, Lima Neto and De Carvalho (2010) put forward the constrained center and range method (CCRM) to ensure the rationality of the predicted interval. After a nonnegative transformation of the radius parameters, however, the predicted radius may be very unlikely to recover its true original value (i.e., nonnegative predicted radius parameters are only a sufficient rather than necessary condition for a nonnegative predicted radius). Allowing for negative relationships in the radius model, Giordani (2015) introduced a more flexible linear regression method, Lasso-IR, which uses least absolute shrinkage and selection operator (Lasso) constraints (Tibshirani, 1996) in his proposed model. With an optimization solution provided by Gill et al. (1981), Lasso-IR determines the midpoints and radii parameters through an LS approach to improve prediction accuracy. The objective function is defined as the Euclidian distance between the observed and the estimated intervals, which are represented by the midpoints and radii. This kind of objective function also can be found in Trutschnig et al. (2009), Sinova et al. (2012), and Blanco-Fernandez et al. (2011, 2012). Interval regression methods based on midpoint– radius (MR) (Billard and Diday, 2000, 2002) and end point (EP) (Lima Neto and De Carvalho, 2008, 2010; Giordani, 2015) representation are to some degree applications of the LS approach (e.g., Billard and Diday, 2002; Lima Neto and De Carvalho, 2008) or its improvement, whose instances are radius nonnegative constraints (Lima Neto and De Carvalho, 2010; Giordani, 2015). Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
3
Model M, a method based on set arithmetic different from the LS method, was first proposed by Blanco-Fernandez et al. (2011). Through canonical decomposition, Model M is defined as Y = α midX [1 ± 0] + β sprX [0 ± 1] + γ [1 ± 0] + ε , where the intervals [1 ± 0] and [0 ± 1] are [1, 1] and [−1, 1], respectively, and ε is an interval-valued random error variable such ( aX M + bX S
))
2 3
S 2 M i=1 dθ Yi − aXi + bXi , Y −
4
subject to the condition that Yi − aXiM + bXiS exists, for all i = 1, . . . , n, which ensures the existence of
5
that E(ε|X ) = [−δ, δ ], with δ ≥ 0. The minimization problem of Model M is given by min
(
∑n
1
(
(
)
)
{
{
1 n
Hukuhara distance. LS estimation βˆ = min sˆ0 , max 0,
σˆ sprX ,sprY 2 σˆ sprX
}} , where sˆ0 = min
{
sprYi sprXi
}
: sprXi ̸= 0 , ensures that sprYˆ
is nonnegative and Yˆ is a real interval. As a typical assumptions method, the symbol covariance method (SCM), ( probabilistic ) proposed by Xu (2010), forms a model Y − Y = X − X β + ϵ , where Y and Y are referred to as an interval-valued variable and its midpoint, respectively (analogously for X and X ). Following a uniform distribution, Ahn et al. (2012) randomly select a real sample from the observation interval-valued variables and apply LS to calculate the regression parameter βˆ ∗b . After several repetitions of βˆ ∗ , the average of βˆ ∗b is determined as the regression parameter. This constructed random sample algorithm is referred to as the Monte Carlo method (MCM). We suggest that CCRM (Lima Neto and De Carvalho, 2010) is more rational than the other LS methods. In some cases, however, the nonnegative predictions for independent variables do not require that all of the parameters be nonnegative. Moreover, in many mathematical programming methods, constraint conditions must be more suitable for their objective function. For example, Tanaka et al. (1982) request that the predicted region completely cover the observed region for a certain confidence level, which leads to an oversized predicted region. So, it makes sense to try relaxing the constraint conditions, which only require crossover rather than complete coverage between the observed and predicted regions. In this paper we develop a new linear regression model for interval-valued symbolic data. To make this model more flexible, we do not require all the regression parameters to be nonnegative. Instead, we add constraints to the new model to ensure that the predicted radii of the dependent variable are nonnegative. Moreover, we utilize the optimization process to fit the model without empirically assigning the parameters. To avoid an oversized prediction interval for the dependent variable, we also adopt crossover constraints, rather than require complete coverage, between the observed and predicted regions. The remainder of the paper is organized as follows. In Section 2, we introduce our constrained center and range joint model and rewrite it as a special case of the least squares with inequality (LSI) problem to deal with the model’s optimization. To compare our model results with those of the CCRM (Lima Neto and De Carvalho, 2010) and Lasso-IR (Giordani, 2015) methods, the probabilistic assumptions method MCM (Ahn et al., 2012) and SCM (Xu, 2010), we report the results of our simulation experiment in Section 3. Section 4 performs an analysis of outliers to exhibit the effects of outliers on our proposal. In Section 5, we describe two real-life applications, and we draw our conclusions in Section 6. 2. Constrained center and range joint model
6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31
In this section we propose a new model for interval-valued data regression, and then rewrite it as a special case of the least squares with inequality (LSI) problem to deal with the model’s optimization. Let E = {e1 , . . . , en } be a set of objects that are described by the p + 1 symbolic interval-valued variables Y , X1 , X2 , . . . , Xp . Consider the dependent variable to be Y , represented as Y = (Y c , Y r ) in vector terms, where Y c and Y r are the center and radius vectors, respectively, of Y . (Vectors and matrices will be denoted by boldface throughout.) The interval variable Y can be expressed in the matrix form Yn×2 = [Y c |Y r ]. Suppose the independent interval variables X have p dimensions. Again, let us consider X to be represented as X = (X c , X r ) in vector terms, where X c and X r are the center and radius vectors, ⎡ ⎤
32 33 34 35 36 37 38
11
. respectively, of X . The interval variable X can be expressed in the matrix form Xn×(2p+1) = [1|X c |X r ], where 1n×1 = ⎣ .. ⎦,
⎡xc
11
. Xnc×p = ⎣ ..
xcn1
··· ..
. ···
xc1p
. . .
⎡xr
⎤
···
11
⎦, and Xnr×p = ⎣
xc1p
. . .
..
. ···
xrn1
39
1n
xr1p
⎤
. . .
⎦.
40
xr1p
Suppose that the mapping relationships between variables X and Y [are expressed by Z( = (X), Y ). Then, n observations ] [ ] of E that have a relation Z are ei = (xi , yi ) ∈ E (i = 1, . . . , n), where xi = 1|xci |xri and yi = yci |yri . Let β(2p+1)×2 = βc |βr be
c the real coefficient vectors, which represent the relation between interval variables X and Y , where βc = β0c , β1c , . . . , β2p
(
)T
41 42
,
43
) )T )T ( ( ( r T and β = β0r , β1r , . . . , β2p . Let εc and εr be random errors where εc = ε1c , . . . , εnc and εr = ε1r , . . . , εnr . Then, we
44
develop a multi-interval variable regression model as follows:
45
r
{
Ync×1 Ynr×1
= Xn×(2p+1) β = Xn×(2p+1) β
c (2p+1)×1 r (2p+1)×1
+ε
c n×1 r n×1
+ε
,
(1)
where E (εc ) = 0, Var(εc ) = σ 2 I , E(εr ) = 0, Var (εr ) = σ 2 I , Xn×(2p+1) is the matrix of order n × (2p + 1) of the center and range of the explanatory variables, containing 1′ s in its first column. Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
46
47 48
COMSTA: 6479
4
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Fig. 1. Relationships between the observed interval Y and the predicted interval Yˆ .
1
Thus, in our proposed method, the sum of the squares of deviations is given by:
(
2
) n ( ∑ ( c )2 ( r )2 ) min εi + εi .
(2)
i=1 3 4 5
6
7 8 9 10 11 12 13
14 15 16 17
18
19 20
21
which is similar to the CRM proposed by Lima Neto and De Carvalho (2008), which fits the center and radius model by the LS method in a uniform model. The drawback of the CRM lies in its inability to mathematically ensure that the radius of the predicted interval-valued dependent variable is greater than zero. As such, we add the following nonnegative constraints: s.t .X βr ≥ 0,
(3)
( ) r T where Fig. 1 denotes an interval variable matrix with n objects and 2p + 1 dimensions, β = β0r , β1r , . . . , β2p . We can c r express the observed interval-valued dependent variable Y as [(Y c − Y r ) , (Y c + Y r )]. Using the parameters β and ) β of ( c r ˆ model (1), we can ( write the)lower bound of the predicted interval-valued dependent variable Y as X β − X β and the upper bound as X βc + X βr . In general, there are no more than six relationships between the observed interval Y and the predicted interval Yˆ , as shown in Fig. 1. r
With the exception of Fig. 1(e) and (f), we find that there is a crossover region. Furthermore, we can conclude that a crossover region exists if: max
((
Y c − Y r , X βc − X βr
) (
))
(( ) ( )) ≤ min Y c + Y r , X βc + X βr .
(4)
which means that the maximum lower bounds of the observed and predicted intervals must be less than their minimum upper bounds. Otherwise, there will be no crossover region, as we see in Fig. 1(e) and (f). First, let us suppose that Y c + Y r > X βc + X βr on the right side of inequation (4), which can be written as follows: max
Y c − Y r , X βc − X βr
((
) (
))
≤ X βc + X βr .
(5)
We assume X βr ≥ 0 to ensure that the predicted interval Yˆ rational. Hence, we have X βc − X βr ≤ X βc + X βr and in Eq. (5) is equivalent to the following: Y c − Y r ≤ X βc + X βr .
(6)
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
5
Next, we suppose Y c + Y r < X βc + X βr and inequation (4) becomes the following:
1
) ( )) max Y c − Y r , X βc − X βr ≤ Y c + Y r . ((
r
(7) c
r
c
r
In the same way, we assume Y ≥ 0 and obtain Y − Y ≤ Y + Y . Then, inequation (7) becomes the following: X βc − X βr ≤ Y c + Y r .
3
(8)
Certainly, there is a third case whereby Y + Y = X β + X β . Now, we can optionally transfer in Eq. (4) into (5) or (7) and obtain inequation (6) or (8). So, the integrations of inequations (6) and (8) are equivalent to inequation (4). Finally, we combine inequations (6) and (8) and obtain the crossover constraints after doing a little algebra, as shown in inequation (9) below: c
{( s.t. (
Xβ + Xβ c
r
)
c
(
r
c
r
r
− Y −Y ≥0 ( )
(9)
where X βc + X βr denotes the upper bound of the prediction interval and X βc − X βr the lower, and Y c − Y r denotes the lower bound of the observation interval and Y c + Y r the upper. A predicted region that completely covers the observed region was requested by Tanaka et al. (1982), which leads to an oversized prediction region. to)the cases ( c This corresponds ) ( c r r of Fig. 1(c) and (d), where the oversized prediction region constraints are as follows: X β + X β − Y + Y ≥ 0 and ( c ) r c r (Y − Y ) − X β − X β ≥ 0. Oversized constraints exclude situations such as those in Fig. 1(a) and (b) where there is also a crossover region. We relax those constraints, requiring only that there be crossover rather than complete coverage between the observed and predicted regions. By the crossover constraints of inequation (9), we ensure the existence of an overlap between the prediction and observation intervals for every object. Therefore, we express the model as follows:
(
(
4 5 6 7 8
)
Y c + Y r − X βc − X βr ≥ 0,
)
2
)
9
10 11 12 13 14 15 16 17
) ∑ (( )2 ( )2 ) r c εi + εi min n
i=1 ) ( c ) ⎧( c r r ⎨ Xβ + Xβ − Y − Y ≥ 0 ( c ) ( c ) r r s.t. ⎩ Y r + Y − Xβ − Xβ ≥ 0 X β ≥ 0,
(10)
which we call the constrained center and range joint model, abbreviated as CCRJM. The minimization problem can be rewritten as a special case of the least squares with inequality (LSI) problem posed by Lawson and Hanson (1974). After doing a little algebra, we can rewrite the problem (to be solved w.r.t. βc and βr ) as follows: min ∥Y − Z β∥2 , s.t. G β ≥ h where Y =
[ c] Y Yr
,Z =
[
X 0
0 X
]
[ c] β
, β = βr , G =
[−X X 0
19 20 21
(11)
22
. To solve this problem, we use the Matlab routine
23
[Y r − Y c ]
]
−X −X , h −X
18
=
Yr + Yc 0
‘‘lsqlin’’. In the Appendix, we have provided the software of the Matlab routine for computing the solution of this regression model. 3. Simulation study In this section, we share the results of our simulation experiment and recovery performance analyses of the CCRJM, as compared with those of the CCRM and Lasso-IR (Giordani, 2015) methods, the probabilistic assumptions methods MCM (Ahn et al., 2012) and SCM (Xu, 2010). To compare the goodness-of-fit among these five methods, we introduce their validity indexes. Then, in accordance with the symbolic data analysis (SDA) procedure, we construct one of the synthetic intervalvalued sets by data packaging and set different regression coefficient values for the center and range models in the other data set. Lastly, we evaluate these methods with respect to the constructed synthetic interval-valued sets. 3.1. Measurements Regarding the evaluation criteria, we utilize three criterion classes. Root mean square error (RMSE), one of the error indexes for evaluating interval linear regression methods, is frequently employed in linear interval regression. For example, Lima Neto and De Carvalho (2008) calculate the RMSE of the upper and lower bounds of the interval (respectively, RMSEu andRMSEl ), De Carvalho et al. (2006) argue for the use of the RMSEh based on the Hausdorff distance, Bargiela et al. (2007) calculate RMSE based on the minimum, maximum, and median values of a fuzzy numeric, among others (Lim, 2016; Chuang, 2008). The rate index, presented in Hu and He (2007), is also used in Hojati et al. (2005) to calculate the rate of the observed interval in the predicted interval, as well as that of the predicted interval in the observed interval, to measure the degree to which they overlap. Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
24 25
26
27 28 29 30 31 32
33
34 35 36 37 38 39 40 41
COMSTA: 6479
6
1 2
3
4
5 6
7
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
RMSEl and RMSEu ( Lima Neto and De Carvalho, 2008; (Lim, 2016)) and RMSEh (De Carvalho et al., 2006; Chuang, 2008) are defined as follows:
n n 1 ∑ ( 1 ∑ ( ) )2 l l 2 √ RMSEl = yi − yˆ i and RMSEu = √ yui − yˆ ui , n
n
i=1
(12)
i=1
n 1 ∑ (⏐ ⏐ ⏐ ⏐) ⏐yc − yˆ c ⏐ + ⏐yr − yˆ r ⏐ 2 , RMSEh = √ i i i i n
(13)
i=1
where yci = yli + yui /2, yri = yui − yli /2, yˆ ci = yˆ li + yˆ ui /2, yˆ ri = yˆ ui − yˆ li /2. Hu and He (2007) define the accuracy rate as follows:
(
)
(
)
( ) ˆi n w Y ∩Y ∑ i 1 ( ), AR = n ˆ i=1 w Yi ∪ Yi (
(
)
(
)
(14)
)
8
where w Yi ∩ Yˆi indicates the interval width of the elements simultaneously belonging to the ith object’s observed and
9
predicted intervals, while w Yi ∪ Yˆi
(
)
is the width of interval containing all of the elements belonging to the ith object’s
12
observed or predicted intervals. As for the goodness-of-fit, error indexes such as RMSEh , RMSEl and RMSEu has better goodness-of-fit when smaller index values are determined. In contrast, with respect to the accuracy rate, bigger values indicate better goodness-of-fit.
13
3.2. Data set configuration
10 11
14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29 30 31 32
33
34 35 36 37 38 39 40 41 42 43 44 45
We consider two data set configurations. In accordance with the SDA procedure, we construct the synthetic intervalvalued sets by data packaging. We generate these sets depending on the independent variable xs , regression coefficient βs , and the error ε of data points dispersed in the intervals. To consider different regression coefficient values for the center and range models, in the other data set, we set βc for the center model regression coefficients and βr for those of the range model. So, the main difference in these two data set configurations is whether or not they have different regression coefficient values for the center and range models. 3.2.1. Data set configuration 1 Next, we consider three cases with p = 1, p = 3, and p = 6. E = {e1 , . . . , en } is a set of objects that are described by p + 1 symbolic interval-valued variables Y , X1 , X2 , . . . , Xp , where n is set to 45, 150 and 1500, separately. There are n data sets to be considered in each iteration, among which 23 n are partitioned into a learning set and 31 n into a test set. Each data set contains m samples from an interval-valued dependent variable Y and independent variable X . The construction of interval data sets is carried out in the following steps: (s1 ) Suppose the data points dispersed in each interval of the interval-valued independent variable X follow a normal distribution. In each interval, m data points are generated that follow a normal distribution. For example, when p = 3, let us suppose there are 100 data points (m = 100) generated within each interval variable Xi , which are represented as xs1j , xs2j , and xs3j where xsij ∈ ℜ (i = 1, 2, 3; j = 1, . . . , m). (s2 ) Suppose there are linear relationships between the data points dispersed in the intervals of dependent variable Y and independent variable X . For instance, given that p = 3 and m = 100, let us suppose that there exists a linear relationship, as follows: ysj = β0s + β1s xs1j + β2s xs2j + β3s xs3j + εj ,
(15)
( ) where ys ∈ ℜ, βis ∈ ℜ (i = 0, . . . , 3), E εj = 0, and Var(εj ) = σ 2 (j = 1, . . . , 100). We assume that the βis values are
randomly selected from a population that follows a uniform distribution. By this linear relationship, the data points ys are generated. (s3 ) To construct a data set of the interval-valued variables Y and X , we select the minimal and maximal values of every m-generated data point as the lower and upper bounds of the interval, respectively. This procedure ( ) ( ) is the so-called ‘‘data packaging’’. For example, to again consider p = 3 and m = 100, let xli = min xsij and xui = max xsij construct the interval [ l u] ( ) ( ) xi , xi of the interval-valued variable Xi (i = 1, 2, 3; j = 1, . . . , m), and let yl = min ysj and yu = max ysj (j = 1, . . . , m) [ l u] construct the interval y , y of the interval-valued variable Y . (s4 ) We replicate steps (s1 ) to (s3 ) 150 times to generate 150 data sets from which we randomly select 100 data sets to form learning sets, leaving the other 50 data sets as test sets. To compare the of these)methods in different situations, we consider four different data set configurations. ( )T performances ( ( )T T Let βs = β0s , β1s , βs = β0s , β1s , β2s , β3s , and βs = β0s , β1s , β2s , β3s , β4s , β5s , β6s correspond to p = 1, p = 3, and p = 6, Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
7
Table 1 Interval data set configurations, where i = 1, . . . , p, j = 1, . . . , m. C1 C2 C3 C4
βs βs βs βs
∼ U (0.5, 1) ∼ U (0.5, 1) ∼ U (2, 4) ∼ U (2, 4)
xsi xsi xsi xsi
∼ N (e, 20) ∼ N (e, 20) ∼ N (e, 20) ∼ N (e, 20)
εj εj εj εj
∼ N (0, 20) ∼ N (0, 40) ∼ N (0, 20) ∼ N (0, 40)
Fig. 2. Configurations C 1, C 2, C 3 and C 4.
respectively. We take into account βs ∼ U (0.5, 1) and βs ∼ U (2, 4), one of which has larger absolute value coefficients and the other smaller. Then, we assume errors of εj ∼ N (0, 20) and εj ∼ N (0, 40). Let data points xsi (i = 1, . . . , p) follow a normal distribution for n data sets that are expected to exhibit a progressively increasing (or decreasing) tendency. For example, when p = 3 and n = 150, let data points xsi (i = 1, 2, 3) have the expectations of e, where e = 1, . . . , 150 in 150 sequential data sets. To compare the performance of the methods in these situations, Table 1 shows four different configurations for the considered interval data sets. Fig. 2 shows a case in which the p = 1 configurations C 2 and C 4 have a rich variability error between ys and xs , as well as a small linear relationship between variables, while C 1 and C 3 have a low variability error and a rich linear relationship between variables. C 1 and C 2 have a low slope in their linear relationship between variables due to the smaller absolute value coefficients between ys and xs . In contrast, configurations C 3 and C 4 have a rich slope in their linear relationship between variables due to the rich variability error between ys and xs . We show the pseudo code for generating interval variables in Algorithm 1. Algorithm 1 For o = 1 : n ( ) Initialize the linear relationship ysj = β0s + βis xsij + εj , where ys ∈ ℜ, β0s , βis ∈ ℜ (i = 1, . . . , p), E εj = 0, and Var(εj ) = σ 2 (j = 1, . . . , m) Randomly select βis (i = 0, . . . , p) following a uniform distribution and εj (j = 1, . . . , m) following a normal distribution Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
COMSTA: 6479
8
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Table 2 Interval data set configurations, where i = 1, . . . , n, j = 1, . . . , p.
βc βc βc βc
C5 C6 C7 C8
1 2 3 4 5 6 7
8 9 10 11 12 13 14 15 16 17 18 19 20
23 24 25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
xcj xcj xcj xcj
∼ U (−150, 0) ∼ U (−150, 0) ∼ U (−150, 0) ∼ U (−150, 0)
εic εic εic εic
∼ N (0, 25) ∼ N (0, 50) ∼ N (0, 25) ∼ N (0, 50)
βr βr βr βr
∼ U (2.5, 5) ∼ U (2.5, 5) ∼ U (2.5, 5) ∼ U (2.5, 5)
xrj xrj xrj xrj
∼ U (5, 10) ∼ U (5, 10) ∼ U (5, 10) ∼ U (5, 10)
εir εir εir εir
∼ U (0, 10) ∼ U (0, 20) ∼ U (0, 10) ∼ U (0, 20)
Generate m data points xsij ∈ ℜ following normal distribution in each interval variable Xi and generate the same number of ysj ∈ ℜ (i = 1, . . . , p; j = 1, . . . , m) for the oth data set [ ] [ ] Construct intervals xli , xui (i = 1, . . . , p) and yl , yu , whose lower and upper bounds are, respectively, the minimal and maximal values of the m generated data points in the oth data set End for o Randomly select 32 n objects from n data sets be training sets and 13 n to be test sets End of Algorithm 1 3.2.2. Data set configuration 2 We also consider three cases with p = 1, p = 3, and p = 6. E = {e1 , . . . , en } is a set of objects that are described by the p + 1 symbolic interval-valued variables Y , X1 , X2 , . . . , Xp , where n is set to 45, 150, and 1500, respectively. These n data sets are considered in each iteration, in which 23 n are partitioned into a learning set and the remaining 31 n into a test set. We consider different regression coefficient values for the center and range models and construct the interval data sets by the following ( ′ ) steps: s1 Suppose each random variable xcj and xrj (j = 1, . . . , p) is uniformly distributed; we randomly select n samples of variable xcj and xrj (j = 1, . . . , p) at each iteration and sort the samples of each xcj (j = 1, . . . , p) in ascending (or descending) order. For example, when p = 3 and n = 150, let us suppose there are 150 samples generated for xc1 , xc2 , xc3 and for xr1 , xr2 , xr3 , respectively. We sort samples xc1 , xc2 , and xc3 in ascending order. ( ′) s2 Suppose there are linear relationships between the dependent variable yc and independent variable xc such that c Yn×1 = Xnc×(p+1) βc(p+1)×1 . For instance, given that p = 3 and n = 150, let us suppose that there exists the following linear relationship: c yci = βi0 + βi1c xci1 + βi2c xci2 + βi3c xci3 + εic ,
21 22
∼ U (2, 2.5) ∼ U (2, 2.5) ∼ U (−2.5, −2) ∼ U (−2.5, −2)
(16)
( ) ( ) = 0, Var εic = σ 2 (i = 1, . . . , 150). We assume that vectors where ε follows a normal distribution and E εic βjc ((j =) 0, . . . , 3) are randomly selected 150 times from a uniformly distributed population. c i
s′3 Suppose there are linear relationships between the dependent variable yr and independent variable xr such that = Xnr×(p+1) βr(p+1)×1 . For instance, given that p = 3 and n = 150, suppose there exists the following linear relationship:
Ynr×1
r yri = βi0 + βi1r xri1 + βi2r xri2 + βi3r xri3 + εir ,
(17)
where ε (i = 1, . . . , 150) follow a uniform distribution. We assume vectors β (j = 0, . . . , 3) are randomly selected 150 r times ( ′ )from a uniformly distributed population and sort y in ascending (or descending) order. s4 150 interval-valued data sets are thus generated, from which we randomly select 100 data sets as learning sets, which leaves( the other )T 50 data ( sets )asT test sets. ( )T ( )T ( )T Let βc = β0c , β1c , βr = β0r , β1r ; βc = β0c , β1c , β2c , β3c , βr = β0r , β1r , β2r , β3r and βc = β0c , β1c , β2c , β3c , β4c , β5c , β6c , r i
r j
( )T βr = β0r , β1r , β2r , β3r , β4r , β5r , β6r correspond to p = 1, p = 3, and p = 6, respectively. We take into account βc ∼ U (2, 2.5) and βc ∼ U (−2.5, −2), one of which has positive value coefficients and the other negative. We assume vectors of βr ∼ U (2.5, 5) and errors of εic ∼ N (0, 25), εic ∼ N (0, 50), εir ∼ U (0, 10), εir ∼ U (0, 20). Let the center of the interval variables be xci ∼ U (−150, 0) and the radius of the interval variables be xri ∼ U (5, 10). To compare the performance of these methods in these situations, Table 2 shows four different configurations for the interval data sets we considered. Fig. 3 shows a case in which the p = 1 configurations C 6 and C 8 have a rich variability error between yc and xc , yr and r x , as well as a small linear relationship between variables, whereas C 5 and C 7 have a low variability error and a rich linear relationship between variables. C 5 and C 6 have a positive linear relationship between the center variables due to the positive value coefficients between yc and xc . In contrast, configurations C 7 and C 8 have a negative linear relationship between the center variables due to the negative value coefficients between yc and xc . Algorithm 2 shows the pseudo code for generating interval variables as follows: Algorithm 2 For i = 1 : n Generate a sample of xcij and xrij (j = 1, . . . , p) that follows a uniform distribution for each interval variable Xj for the ith object Randomly select βjc , βjr (j = 1, . . . , p) following a uniform distribution
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
9
Fig. 3. Configurations C 5, C 6, C 7 and C 8.
c Initialize the linear relationship yci = βi0 + βijc xcij + εic , where εic follows a normal distribution, and yri = βi0r + βijr xrij + εir , r where εi follows a uniform distribution Generate yci and yri for the ith object End for i Respectively sort xcj and yr in ascending (or descending) order [ ] [ ] Construct intervals xlj , xuj (j = 1, . . . , p) and yl , yu , where xlj = xcj − xrj , xuj = xcj + xrj , yl = yc − yr and yu = yc + yr for the ith data set Randomly select 23 n objects from n data sets be training sets and 13 n to be test sets End of Algorithm 2
3.3. Comparison results For comparison, we selected four linear regression methods for which the parameters of the fitted regression models could be achieved: CCRM (Lima Neto and De Carvalho, 2010), which has good fitness, the recently developed Lasso-IR (Giordani, 2015), the probabilistic assumptions method MCM (Ahn et al., 2012) and SCM (Xu, 2010). At each replication, we fit a linear regression model to the training interval data set using the CCRM, Lasso-IR, MCM, SCM and our CCRJM method. We used these fitted regression models to predict the interval values of the dependent variable Y in the test interval data set and calculated their RMSEh , RMSEl , RMSEu , and AR values. We repeated Algorithm 1 and Algorithm 2 75 times, and for each measure, we calculated the average and standard deviation (shown in parentheses) for 75 Monte Carlo simulations. For configurations C 1 to C 8, regarding the CCRJM and CCRM, Lasso-IR, MCM, SCM methods, Tables 3–5 show the overall results of the Monte Carlo experiments when n = 150, n = 45 and n = 1500. The CCRJM method has a higher average prediction performance than the CCRM, Lasso-IR, MCM and SCM method for almost all situations and measures considered. To compare these approaches further, we applied a statistical paired t-test to every 75 matched-pairs samples for each measure and adopted a 1% significance level. For each configurations C 1 to C 8, we calculated the ratio of times that the Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 2 3 4 5 6 7 8 9
10
11 12 13 14 15 16 17 18 19 20 21 22 23
C6
C5
C4
C3
6
3
1
6
3
1
6
3
1
6
3
1
6
3
1
6
(2.5047)
(6.4628)
(1.8440)
84.5820
75.0720
(5.7581)
(1.5948)
70.6878
(5.3162)
66.3280
60.7225
(2.1370)
57.3427
(5.3577)
(1.2582)
66.1172
(3.7823)
55.4277
49.2183
44.0297
35.9555
(0.8745)
32.4228
(22.4160)
(7.4640)
(2.7549)
440.2842
(10.0378)
75.9338
108.3608
52.3154
(4.2121)
(2.5165)
(4.6175)
54.5530
(20.9983)
(9.2078)
32.3361
446.2600
(9.8861)
75.5119
106.3805
49.5409
(3.8721)
(1.3537)
(4.5660)
48.7065
(4.7542)
22.4787
(3.7499)
(2.9836)
80.5573
38.0757
(2.5484)
(2.1494)
33.9858
(1.9846)
29.1851
25.3395
(4.4115)
24.0503
(4.3217)
(2.3814)
102.8318
36.9171
(2.2486)
(1.3796)
27.7922
24.4249
(0.7526)
(6.5732)
136.1820
(5.1403)
97.9182
(5.3477)
74.2846
(4.7032)
112.2110
(3.6002)
72.6557
(2.5951)
46.5642
(18.9899)
445.6991
(10.1817)
144.8287
(10.2668)
155.5249
(20.6530)
452.1814
(9.2322)
149.2193
(12.3034)
134.4146
(4.5997)
81.2432
(6.4535)
55.3651
(4.6545)
101.9047
(5.2119)
104.0715
(3.2406)
33.8468
(3.3861)
56.1521
MCM
(22.0177)
154.2127
(7.2954)
100.3953
(5.4268)
74.4934
(20.4144)
133.3237
(6.2256)
76.3441
(2.6968)
46.3803
(22.9226)
529.6057
(10.7147)
173.6991
(5.9126)
82.3600
(21.6222)
535.0953
(9.2363)
182.2894
(4.8016)
61.0771
(5.2963)
95.0194
(3.7494)
39.4569
(3.9939)
90.0750
(5.2028)
122.8732
(2.3947)
35.5276
(2.2211)
39.2930
SCM
(6.6862)
64.3144
(5.9436)
58.2071
(5.3561)
57.6924
(5.0285)
44.1024
(3.6741)
35.6555
(2.7916)
32.6089
(5.2966)
56.9117
(3.4490)
41.8548
(2.3989)
29.3008
(3.6624)
32.3810
(2.0138)
21.6394
(0.9303)
11.2973
(2.9872)
31.2114
(2.2900)
26.6825
(1.9662)
23.7339
(1.5559)
13.6994
(0.9977)
10.4580
(0.6029)
7.1458
CCRJM
(7.1332)
63.7831
(6.0327)
57.7606
(5.3718)
52.4277
(5.4163)
45.0571
(3.9897)
36.5562
(2.7925)
28.6703
(7.4463)
58.4887
(4.7449)
39.5765
(2.8520)
24.3335
(8.9567)
64.2580
(4.5414)
41.9190
(1.5493)
18.4579
(3.6843)
29.4946
(2.7068)
22.5519
(2.0663)
18.4806
(4.2217)
32.2605
(2.3170)
20.8429
(0.8611)
9.3086
CCRM
17.0254
Lasso-IR
CCRM
11.6525
RMSEl
RMSEh
(2.9502)
74.4550
(2.0292)
62.9204
(1.5720)
55.1598
(2.7995)
58.9995
(1.6802)
44.1366
(0.9557)
32.1715
(23.1692)
434.8828
(9.0096)
77.4656
(4.0590)
38.1368
(21.3603)
440.6626
(7.8489)
74.6660
(3.5381)
32.7115
(4.9725)
78.0007
(3.0431)
25.3397
(2.4952)
19.2920
(4.5134)
101.4855
(2.4035)
20.1764
(1.4131)
12.4168
Lasso-IR
(9.3872)
102.9528
(7.2195)
74.4921
(6.0873)
58.7727
(7.2873)
89.0366
(5.1748)
55.7550
(3.3973)
35.4300
(24.9445)
373.1780
(9.7017)
79.4534
(10.2009)
112.5079
(24.8728)
376.7395
(12.4933)
82.3121
(8.5254)
91.0647
(5.9920)
62.0698
(5.7408)
40.2478
(4.8558)
88.2834
(6.2287)
85.2318
(1.8167)
17.6258
(3.0320)
44.6201
MCM
(27.7608)
104.2408
(10.2263)
70.4601
(6.1052)
57.1954
(25.7972)
88.2206
(8.5732)
52.2215
(3.6878)
33.7292
(22.4665)
524.3993
(9.9232)
158.3273
(6.6414)
76.0075
(21.6941)
530.7589
(10.3793)
168.8747
(5.9138)
51.9899
(5.6150)
91.4860
(4.1509)
33.0387
(4.1149)
89.0762
(5.1525)
121.1445
(2.2758)
29.3745
(2.3668)
38.2751
SCM
(6.6695)
62.4812
(5.9255)
56.4700
(5.4061)
52.7130
(5.0708)
42.3483
(3.6990)
34.2116
(2.8408)
28.8405
(5.1166)
42.4117
(3.6174)
30.6195
(2.5131)
21.8199
(3.2752)
23.2532
(1.7786)
15.6170
(0.8841)
8.0521
(2.9470)
23.7120
(2.3215)
20.5639
(2.0054)
18.2078
(1.4279)
10.0224
(1.0777)
7.7236
(0.6145)
5.3510
CCRJM
(6.4955)
63.4058
(6.0216)
57.4530
(5.3934)
52.2854
(5.5012)
45.2538
(3.9919)
36.1849
(2.8706)
28.5558
(7.7445)
60.7113
(4.9374)
42.2064
(2.8499)
25.8003
(8.6719)
65.5738
(4.5387)
42.5996
(1.5380)
18.9921
(3.7962)
30.2292
(2.5967)
22.8756
(2.1242)
18.8167
(4.1207)
32.6374
(2.1817)
21.1764
(0.8804)
9.6126
CCRM
RMSEu
(2.2659)
62.6501
(1.9526)
55.8382
(1.5706)
51.8917
(1.5458)
44.5405
(1.1764)
35.1061
(0.8820)
28.5037
(18.5896)
174.6201
(11.0931)
91.1279
(4.4000)
46.2294
(17.3642)
175.6655
(10.9560)
90.4007
(4.4093)
42.7055
(4.4135)
39.3217
(3.1558)
27.3787
(2.0942)
19.9455
(3.8772)
42.0908
(2.5012)
23.0462
(1.5354)
13.9652
Lasso-IR
(9.6632)
104.9814
(7.8858)
73.5241
(6.3947)
58.1359
(6.7517)
89.3302
(4.8383)
56.2132
(3.7913)
36.1879
(26.8303)
398.8467
(11.7074)
135.3640
(12.4009)
131.9855
(29.1381)
406.9433
(11.8527)
139.4930
(14.0359)
115.4248
(5.9821)
70.6853
(7.0053)
45.7248
(5.1940)
89.3742
(7.2628)
92.3856
(3.6202)
32.0289
(3.8282)
48.0367
MCM
(32.3385)
124.7171
(12.3244)
80.8188
(6.8497)
60.7397
(29.5642)
110.6456
(10.7708)
63.6886
(3.8711)
37.6823
(20.9532)
242.9361
(12.7820)
113.0731
(5.7973)
49.6642
(21.4290)
243.7057
(11.0024)
115.6791
(5.0638)
44.9205
(5.7980)
51.4320
(3.7231)
27.8276
(3.5355)
32.6632
(5.3798)
59.1216
(2.9171)
26.4701
(2.0464)
17.3186
SCM
(6.7165)
62.4712
(5.8513)
56.5083
(5.4268)
52.6092
(5.0041)
42.3527
(3.6626)
34.2567
(2.8868)
28.7080
(5.7666)
45.8007
(3.8054)
34.2519
(2.6151)
23.5923
(3.9621)
26.8452
(2.3393)
17.8966
(1.0165)
9.4317
(3.0447)
24.6934
(2.1962)
20.8900
(2.0682)
18.6077
(1.6944)
11.2193
(1.0152)
8.4675
(0.6739)
5.6979
CCRJM
(0.0201)
0.7660
(0.0294)
0.6433
(0.0272)
0.6873
(0.0156)
0.8227
(0.0205)
0.7422
(0.0174)
0.8007
(0.0138)
0.8185
(0.0113)
0.8366
(0.0100)
0.8622
(0.0219)
0.7113
(0.0187)
0.7288
(0.0130)
0.7815
(0.0122)
0.8385
(0.0097)
0.8608
(0.0087)
0.8728
(0.0227)
0.6957
(0.0198)
0.7285
(0.0110)
0.8127
CCRM
AR
(0.0081)
0.7396
(0.0099)
0.6170
(0.0134)
0.3751
(0.0070)
0.7892
(0.0081)
0.7061
(0.0100)
0.5138
(0.0174)
0.6417
(0.0113)
0.8290
(0.0104)
0.8574
(0.0165)
0.6362
(0.0130)
0.8271
(0.0098)
0.8588
(0.0136)
0.7530
(0.0100)
0.8564
(0.0090)
0.8728
(0.0137)
0.6706
(0.0113)
0.8411
(0.0091)
0.8623
Lasso-IR
(0.0129)
0.5273
(0.0216)
0.4358
(0.0280)
0.2145
(0.0131)
0.5601
(0.0168)
0.5071
(0.0254)
0.3311
(0.0212)
0.5721
(0.0133)
0.7886
(0.0234)
0.5070
(0.0209)
0.5646
(0.0129)
0.7763
(0.0286)
0.5545
(0.0175)
0.7250
(0.0297)
0.7453
(0.0225)
0.2357
(0.0218)
0.6114
(0.0136)
0.8172
(0.0271)
0.3882
MCM
CCRJM
(0.0234)
0.7705
(0.0325)
0.6484
(0.0272)
0.6858
(0.0186)
0.8331
(0.0241)
0.7544
(0.0175)
0.7996
(0.0103)
0.8622
(0.0097)
0.8687
(0.0090)
0.8746
(0.0118)
0.8726
(0.0103)
0.8806
(0.0081)
0.8931
(0.0106)
0.8673
(0.0086)
0.8722
(0.0086)
0.8743
(0.0120)
0.8815
(0.0097)
0.8839
(0.0081)
0.8881
(continued on next page)
(0.0459)
0.4964
(0.0276)
0.4290
(0.0273)
0.2185
(0.0435)
0.5306
(0.0255)
0.4979
(0.0255)
0.3354
(0.0155)
0.5616
(0.0141)
0.7354
(0.0144)
0.7802
(0.0159)
0.5539
(0.0137)
0.7159
(0.0129)
0.8215
(0.0153)
0.7024
(0.0120)
0.8330
(0.0143)
0.5227
(0.0136)
0.5948
(0.0120)
0.8010
(0.0154)
0.6720
SCM
10
C2
1
C1
3
p
Config.
Comparison of CCRM, Lasso-IR, MCM, SCM and CCRJM methods with different data set configurations, with the average and standard deviation values in parentheses(n = 150).
Table 3
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
C8
1
C7
6
3
1
6
3
p
Config.
SCM
CCRJM
(11.8243)
(6.6946)
(5.6190)
157.1490
75.4456
(5.6932)
(1.9210)
107.4290
(5.2810)
66.0581
71.6886
(11.6592)
57.4326
(5.1680)
(5.8150)
156.7741
(4.1386)
55.1648
107.6673
43.9477
71.7068
(1.9992)
32.8706
(3.0931)
(5.6730)
134.6987
(6.0071)
98.0635
(5.5068)
74.9102
(4.3636)
111.7008
(3.4438)
72.8985
(2.6575)
46.4818
(39.2722)
186.1453
(12.2886)
108.3258
(5.4157)
74.6512
(37.3119)
166.5156
(11.4516)
84.8753
(3.0159)
47.1709
(6.9418)
64.7956
(5.8092)
58.1140
(5.3166)
57.7519
(4.9940)
43.9540
(3.8293)
35.7575
(3.1046)
33.0617
(6.7847)
63.7507
(5.8715)
57.4275
(5.4037)
52.4639
(4.9748)
45.1963
(4.3330)
36.4412
(2.9863)
29.0561
RMSEl
MCM
CCRM
Lasso-IR
RMSEh
CCRM
Table 3 (continued)
(12.4658)
140.8672
(6.1384)
96.4713
(1.9564)
64.4345
(12.1967)
140.1477
(6.4662)
96.6182
(2.1248)
64.3792
Lasso-IR
(9.0767)
103.8403
(7.8417)
74.2176
(6.3120)
59.0095
(6.6515)
89.9733
(5.2303)
56.2389
(3.4233)
35.9186
MCM
(20.3514)
96.0589
(8.8748)
69.7375
(6.2075)
57.5686
(18.1503)
81.0833
(6.9915)
50.0725
(3.4612)
35.2750
SCM
(6.9738)
62.9262
(5.6294)
56.4244
(5.4216)
52.7787
(4.9594)
42.1708
(3.8050)
34.3530
(3.0072)
29.2224
CCRJM
(7.2770)
64.1491
(5.9201)
57.4625
(5.1669)
52.3920
(5.8868)
44.8061
(4.1845)
36.3329
(3.1489)
29.0599
CCRM
RMSEu
(4.3038)
97.0096
(2.4796)
66.9478
(1.7727)
52.9303
(4.4125)
97.0327
(2.2936)
67.2525
(1.6832)
53.1420
Lasso-IR
(10.1828)
101.7990
(8.4469)
73.7387
(6.2417)
59.1209
(6.7163)
87.8395
(5.3179)
56.1515
(3.3471)
35.5284
MCM
(48.7097)
174.2781
(18.6471)
92.6412
(6.9156)
59.8438
(43.6019)
158.9903
(15.6817)
76.7755
(4.4016)
37.2710
SCM
(6.9310)
62.9445
(5.5836)
56.4020
(5.1979)
52.6417
(5.0146)
42.2045
(3.8296)
34.3481
(3.1605)
29.2390
CCRJM
AR
(0.0206)
0.7653
(0.0288)
0.6454
(0.0284)
0.6846
(0.0150)
0.8234
(0.0222)
0.7418
(0.0194)
0.7979
CCRM
(0.0366)
0.5405
(0.0291)
0.4416
(0.0141)
0.2752
(0.0361)
0.5423
(0.0285)
0.4399
(0.0140)
0.2759
Lasso-IR
(0.0130)
0.5305
(0.0217)
0.4373
(0.0268)
0.2098
(0.0118)
0.5614
(0.0151)
0.5067
(0.0226)
0.3309
MCM
(0.0807)
0.4160
(0.0458)
0.3898
(0.0277)
0.2100
(0.0779)
0.4449
(0.0475)
0.4506
(0.0274)
0.3218
SCM
(0.0242)
0.7686
(0.0309)
0.6508
(0.0285)
0.6832
(0.0182)
0.8337
(0.0262)
0.7536
(0.0194)
0.7970
CCRJM
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx 11
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
C6
C5
C4
C3
6
3
1
6
3
1
6
3
1
6
3
1
6
3
1
6
(12.8340)
(13.1577)
(11.0311)
86.6701
80.3750
(11.1788)
(10.0666)
71.0764
68.0337
(10.0046)
(9.1840)
62.1907
58.3851
(10.6076)
(8.6135)
63.8613
(7.4945)
59.9036
49.9375
(5.3455)
45.6142
36.3469
33.2518
(5.3252)
(6.6244)
(13.2995)
(4.5942)
61.1025
89.9734
(8.4877)
(3.8963)
30.9185
60.8316
(5.7260)
(7.3438)
25.3691
36.5488
(13.8396)
(3.2615)
75.1194
87.4896
(8.7864)
(2.1284)
21.9559
58.5173
(3.4897)
(6.6163)
15.2371
25.6900
(5.7297)
(4.9317)
60.4007
36.5883
(4.4030)
(3.0500)
31.2334
29.2707
(3.6882)
(6.0792)
24.8872
24.8855
(4.4119)
(3.7480)
74.1382
29.6236
(3.1046)
(2.4127)
22.2742
(2.2492)
21.5875
(11.5115)
140.2059
(9.6404)
99.8185
(9.1286)
73.5584
(9.0918)
116.4049
(6.9928)
75.0486
(5.6399)
46.8364
(32.1604)
225.9521
(40.8694)
198.9112
(14.3926)
199.9267
(35.8245)
235.6720
(40.9371)
191.4319
(14.6380)
177.8016
(14.8430)
58.9184
(15.7074)
101.6412
(6.5815)
114.4579
(7.8576)
52.1162
(10.8872)
64.6834
(5.1688)
69.0241
MCM
SCM
(39.4858)
171.9535
(15.3554)
105.7755
(10.3394)
75.9886
(44.0385)
156.2948
(10.9234)
79.1185
(6.1944)
47.9851
(31.2728)
326.5463
(14.3354)
130.5535
(6.9961)
47.3668
(29.5194)
332.9651
(14.3855)
133.2186
(5.2709)
35.9383
(6.1540)
64.0870
(4.7114)
31.1901
(5.2877)
49.7938
(7.9981)
78.2543
(3.9274)
29.1638
(3.2898)
21.8246
(16.4109)
78.5645
(11.4853)
62.5313
(10.5089)
59.6690
(12.3637)
54.7272
(8.1695)
38.9964
(5.4530)
33.6918
(19.1198)
108.5186
(9.6368)
65.7492
(5.9357)
37.0852
(17.5729)
104.9094
(10.4166)
62.6602
(3.6316)
26.1416
(7.4739)
43.5336
(4.8394)
31.5583
(3.8354)
25.5139
(6.3333)
35.4681
(3.3959)
23.2048
(2.2760)
15.2385
CCRJM
(13.3597)
67.5998
(11.3834)
59.0486
(10.4770)
53.2174
(10.8908)
48.5272
(7.7865)
37.4714
(5.3574)
29.1514
(12.2125)
68.0849
(8.6513)
46.0964
(5.2451)
26.8921
(13.5023)
66.2352
(8.5345)
43.7410
(3.6303)
17.9291
(5.8544)
28.0185
(4.6398)
22.2311
(3.7279)
19.0431
(4.6731)
22.1656
(2.9344)
15.8701
(2.0645)
11.0258
CCRM
Lasso-IR
15.5607
CCRM
14.9875
RMSEl
RMSEh Lasso-IR
(13.0997)
75.0700
(12.0951)
63.3786
(10.2266)
55.8553
(9.6258)
54.8870
(9.0357)
44.0651
(5.4571)
32.2638
(7.9259)
51.3169
(5.0661)
23.4204
(3.9361)
19.1551
(8.5791)
64.3551
(3.1894)
16.3289
(2.3928)
11.6484
(7.9202)
50.6951
(4.5889)
23.9413
(3.5092)
18.8532
(8.2145)
63.3428
(3.8671)
16.6481
(2.4798)
11.4370
MCM
(16.7499)
107.3566
(14.7712)
74.6180
(10.6692)
57.7094
(13.3620)
89.8255
(9.5929)
57.8379
(7.0368)
36.1921
(35.1247)
165.3957
(37.8297)
147.1614
(18.1856)
166.4203
(39.8127)
173.9079
(38.7037)
135.0899
(17.0588)
145.0027
(11.8843)
43.3166
(14.4223)
83.9764
(7.8477)
102.0912
(7.5065)
38.9770
(11.0675)
48.1908
(6.0574)
60.0654
SCM
(41.4430)
113.3119
(18.7330)
75.5127
(11.9917)
58.7460
(45.9469)
102.6611
(13.8561)
55.5703
(6.5447)
35.0418
(31.3145)
316.2515
(14.7116)
117.5614
(6.7545)
34.8779
(30.4903)
323.6522
(15.1106)
123.4199
(4.7539)
22.0494
(6.9900)
58.7242
(4.7347)
23.4976
(5.9223)
46.8108
(7.9362)
74.6684
(3.6466)
24.2651
(3.7096)
19.2272
(16.4716)
75.5404
(11.5001)
60.4012
(11.0219)
54.4144
(12.1526)
51.8786
(8.1183)
37.0703
(5.5199)
29.5498
(17.9423)
81.0458
(10.0171)
49.5841
(5.4235)
27.3121
(17.9505)
78.1345
(9.7504)
46.8663
(3.7183)
18.1146
(7.4179)
33.0322
(4.9323)
23.7533
(3.8303)
19.5431
(6.2414)
26.5008
(3.3588)
16.7462
(2.0974)
11.1461
CCRJM
(13.5406)
67.9204
(11.0658)
58.9232
(9.9325)
53.1987
(10.5743)
48.9493
(7.9026)
37.6106
(5.3630)
29.3408
(14.4740)
72.2527
(9.6770)
48.3374
(5.8736)
29.4070
(14.7474)
69.8605
(9.2826)
46.6552
(3.8124)
21.5543
(5.6600)
28.5264
(4.4837)
22.9589
(3.8141)
19.3555
(4.5179)
23.6414
(3.3027)
17.5073
(2.3913)
12.0627
CCRM
RMSEu
(13.1957)
68.7655
(10.4318)
56.7635
(10.0643)
52.9671
(9.9890)
47.8163
(8.4892)
37.2840
(5.6312)
28.8785
(8.2260)
42.4483
(4.7009)
24.2607
(3.9663)
19.8274
(8.7874)
50.0572
(3.5277)
17.5100
(2.2109)
11.9155
(6.9564)
41.8552
(4.6074)
24.1743
(3.6533)
19.4069
(8.5417)
49.4272
(3.5744)
17.8041
(2.3020)
12.4364
Lasso-IR
MCM
(16.5033)
104.4331
(13.7558)
74.4188
(11.0447)
57.5463
(12.9487)
92.1692
(9.5586)
57.0180
(7.1344)
35.3425
(34.2165)
182.5007
(37.9090)
148.8663
(18.3327)
168.5061
(36.0893)
189.7382
(37.1391)
146.4477
(18.7136)
148.2667
(12.8967)
46.5501
(15.4158)
83.7527
(7.5275)
102.2884
(7.8664)
42.1323
(10.8154)
50.7476
(6.0509)
59.3417
SCM
(51.9943)
140.2364
(22.1266)
83.5893
(12.3612)
61.1988
(52.2908)
126.6168
(15.3375)
63.6969
(8.0545)
38.8945
(36.7862)
204.5879
(19.4871)
88.4118
(7.9622)
38.0291
(39.7224)
206.3252
(17.6841)
88.6152
(5.8455)
31.9530
(7.2910)
42.6753
(4.7296)
24.5734
(5.7525)
28.3192
(10.0152)
51.1037
(4.8066)
21.4191
(2.7084)
14.0905
CCRJM
(16.4525)
75.6843
(11.2931)
60.5082
(10.3552)
54.3958
(12.4197)
52.1554
(8.3063)
37.1512
(5.4960)
29.6981
(19.6324)
87.0402
(10.9515)
52.3330
(6.0481)
29.8827
(18.2994)
84.2803
(10.9115)
49.8418
(3.9123)
22.0272
(7.3171)
34.2058
(5.1679)
24.8787
(3.9451)
19.7947
(6.2683)
28.3111
(3.6561)
19.0526
(2.4472)
12.3008
(0.0390)
0.7520
(0.0550)
0.6339
(0.0499)
0.6806
(0.0323)
0.8091
(0.0402)
0.7326
(0.0334)
0.7952
(0.0199)
0.8458
(0.0182)
0.8580
(0.0166)
0.8740
(0.0221)
0.8453
(0.0185)
0.8556
(0.0137)
0.8936
(0.0203)
0.8537
(0.0177)
0.8634
(0.0173)
0.8697
(0.0207)
0.8484
(0.0163)
0.8584
(0.0167)
0.8675
CCRM
AR
(0.0377)
0.7316
(0.0577)
0.6131
(0.0763)
0.3654
(0.0296)
0.7931
(0.0474)
0.7028
(0.0672)
0.5100
(0.0235)
0.7773
(0.0189)
0.8573
(0.0156)
0.8677
(0.0279)
0.6842
(0.0187)
0.8561
(0.0182)
0.8663
(0.0217)
0.7794
(0.0176)
0.8568
(0.0139)
0.8681
(0.0252)
0.6898
(0.0232)
0.8540
(0.0181)
0.8644
Lasso-IR
MCM
(0.0272)
0.5209
(0.0377)
0.4297
(0.0449)
0.2095
(0.0253)
0.5535
(0.0326)
0.4985
(0.0510)
0.3291
(0.0309)
0.7244
(0.0908)
0.5896
(0.0514)
0.1611
(0.0379)
0.7080
(0.0930)
0.5987
(0.0602)
0.1892
(0.0644)
0.7588
(0.1028)
0.3541
(0.0414)
0.0684
(0.0335)
0.7505
(0.0949)
0.4972
(0.0580)
0.1037
CCRJM
(0.0509)
0.7292
(0.0610)
0.6302
(0.0531)
0.6746
(0.0437)
0.7977
(0.0536)
0.7359
(0.0349)
0.7923
(0.0272)
0.8182
(0.0206)
0.8481
(0.0174)
0.8719
(0.0270)
0.8180
(0.0220)
0.8463
(0.0138)
0.8921
(0.0261)
0.8279
(0.0203)
0.8534
(0.0174)
0.8669
(0.0265)
0.8211
(0.0183)
0.8489
(0.0172)
0.8659
(continued on next page)
(0.0773)
0.4644
(0.0553)
0.4088
(0.0501)
0.2085
(0.0880)
0.4892
(0.0459)
0.4837
(0.0507)
0.3210
(0.0360)
0.6063
(0.0284)
0.7581
(0.0185)
0.8578
(0.0359)
0.5957
(0.0280)
0.7407
(0.0160)
0.8826
(0.0251)
0.7507
(0.0191)
0.8564
(0.0270)
0.7328
(0.0362)
0.6389
(0.0256)
0.8103
(0.0263)
0.8044
SCM
12
C2
1
C1
3
p
Config.
Comparison of CCRM, Lasso-IR, MCM, SCM and CCRJM methods with different data set configurations, with the average and standard deviation values in parentheses(n = 45).
Table 4
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
C8
1
C7
6
3
1
6
3
p
Config. CCRJM
(23.1100)
(14.2907)
(17.4229)
131.5732
81.6436
(11.2063)
(9.6362)
99.0289
68.6250
(8.9783)
(26.3577)
71.3713
57.6778
(10.3344)
(16.1625)
131.9825
(7.6687)
59.5380
99.4605
(9.9271)
45.5647
72.0890
32.7766
(5.1527)
(11.4091)
140.1921
(10.5247)
99.4273
(10.3203)
75.2778
(8.8585)
114.0465
(6.5116)
74.7462
(5.7134)
47.8229
(78.1002)
223.4658
(26.7351)
115.9793
(10.5153)
76.2740
(80.7809)
204.2082
(18.7274)
88.7597
(6.0741)
47.8992
(17.3029)
80.1790
(12.5734)
64.1921
(9.2439)
58.7018
(11.2909)
54.6916
(7.8068)
38.9708
(5.2425)
33.3078
(14.9811)
70.0992
(11.2908)
59.1535
(8.8165)
52.6564
(10.6424)
48.6420
(7.6180)
37.2610
(5.2116)
28.6557
CCRM
SCM
RMSEl
MCM
CCRM
Lasso-IR
RMSEh
Table 4 (continued)
(25.1039)
117.1058
(18.0353)
88.5151
(9.8626)
64.1201
(27.4939)
117.1451
(16.7236)
89.8357
(10.0215)
64.2947
Lasso-IR
(18.3841)
105.1358
(14.4062)
74.9038
(10.5941)
59.9034
(14.0047)
89.5022
(10.0181)
56.8114
(7.1434)
37.4010
MCM
(51.3489)
120.8160
(22.6578)
76.2079
(11.1159)
58.5904
(51.1289)
104.3747
(12.1854)
53.5577
(6.2120)
35.3796
SCM
(17.2253)
77.2362
(12.5652)
62.0743
(9.0088)
53.6049
(11.1922)
51.9043
(7.6753)
37.0014
(5.2712)
29.1097
CCRJM
(14.3371)
68.3638
(11.4434)
59.5433
(9.1611)
52.4218
(10.5274)
48.3548
(7.8996)
37.9308
(5.3987)
28.8962
CCRM
RMSEu
(18.9709)
91.9643
(13.5507)
67.7112
(9.4375)
52.9698
(20.1570)
92.5161
(12.6565)
65.7547
(9.5513)
54.1967
Lasso-IR
(18.7010)
105.6165
(13.9238)
74.3700
(11.3823)
58.0849
(13.5775)
88.9576
(9.5460)
56.6613
(6.5084)
35.6674
MCM
(89.1897)
207.8750
(33.9217)
97.3332
(12.8578)
61.3318
(88.6220)
194.3549
(24.9247)
78.2671
(8.5100)
38.0269
SCM
(17.1961)
77.2975
(12.6183)
62.0486
(9.3985)
53.3040
(11.2617)
51.8353
(7.9161)
37.2169
(5.4659)
29.3643
CCRJM
(0.0409)
0.7465
(0.0576)
0.6296
(0.0485)
0.6818
(0.0310)
0.8093
(0.0416)
0.7323
(0.0330)
0.7979
CCRM
AR
(0.0599)
0.6203
(0.0789)
0.4882
(0.0751)
0.2703
(0.0683)
0.6192
(0.0766)
0.4865
(0.0798)
0.2697
Lasso-IR
(0.0277)
0.5235
(0.0418)
0.4295
(0.0552)
0.2066
(0.0242)
0.5597
(0.0300)
0.5017
(0.0512)
0.3209
MCM
(0.1295)
0.3573
(0.0807)
0.3688
(0.0500)
0.2062
(0.1355)
0.3852
(0.0770)
0.4385
(0.0532)
0.3165
SCM
(0.0527)
0.7256
(0.0677)
0.6190
(0.0485)
0.6768
(0.0400)
0.7974
(0.0498)
0.7334
(0.0341)
0.7946
CCRJM
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx 13
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
C6
C5
C4
C3
6
3
1
6
3
1
6
3
1
6
3
1
6
3
1
6
(2.5841)
(1.9946)
(1.9438)
84.4117
73.2137
(1.6466)
(1.4799)
70.8669
(1.4827)
64.7649
60.9127
(2.2772)
57.0478
(1.4795)
(1.2474)
66.0160
(1.1403)
53.7811
49.1558
43.2872
35.9550
(0.9186)
32.4175
(0.9383)
(12.9817)
(83.0625)
(5.2761)
109.8082
743.5641
(38.1079)
(1.6986)
72.9192
(4.5589)
502.2342
37.5564
(12.2049)
213.6823
(77.0163)
(5.7420)
111.2239
747.5628
(38.7289)
(1.4054)
74.2861
(4.6638)
502.9085
33.8213
(12.8300)
212.9882
(41.8692)
(5.4245)
108.8336
375.6684
(20.3611)
(1.7048)
72.6064
(2.4057)
250.1816
37.6428
(11.7836)
106.3337
(39.2477)
(5.9280)
110.6917
380.7035
(19.8911)
(1.2530)
73.6542
(2.3408)
252.8973
(3.4716)
134.3061
(3.1388)
98.0152
(2.8810)
73.7637
(2.4062)
111.1019
(1.9162)
72.4965
(1.7046)
46.5742
(49.9435)
454.3499
(39.6140)
360.8696
(10.6296)
245.8860
(47.1470)
448.0094
(32.6744)
353.7047
(12.0770)
227.7970
(13.7985)
111.1620
(9.4050)
113.3211
(2.9799)
114.6753
(14.5798)
113.9611
(9.1509)
92.3254
(2.5506)
75.1255
MCM
(6.1462)
139.5802
(2.1520)
98.4508
(1.5610)
73.8211
(6.2966)
118.2240
(1.6697)
73.2863
(0.9057)
46.1952
(46.9294)
493.2115
(41.8192)
354.4293
(9.2926)
239.5897
(50.9381)
498.5227
(36.5415)
355.7128
(9.0909)
219.6875
(14.3885)
121.1976
(8.6007)
113.2953
(3.1914)
118.9468
(12.2050)
120.7599
(9.4778)
92.8746
(2.4989)
75.4871
SCM
CCRJM
(1.9488)
60.6231
(4.2544)
59.6397
(2.1038)
57.5730
(1.4136)
41.1554
(1.1169)
34.5043
(0.9372)
32.4344
(11.3927)
145.6424
(7.2813)
102.5417
(1.9032)
51.0721
(11.1232)
141.8052
(7.0925)
97.7705
(1.7623)
44.2611
(4.9692)
61.9745
(3.6565)
49.8124
(1.0327)
31.8828
(3.9524)
50.9146
(2.8345)
37.5873
(0.8392)
20.8469
(2.1262)
62.0126
(1.7825)
56.2377
(1.5389)
52.0877
(1.5835)
43.8962
(1.3065)
35.8548
(0.9436)
28.6503
(80.8543)
691.1677
(37.2878)
465.0039
(4.6466)
194.7269
(75.1278)
696.6802
(37.5344)
467.2135
(4.7375)
196.4501
(40.7983)
354.0366
(19.8198)
232.8107
(2.5122)
95.1790
(38.5161)
362.4457
(19.3696)
239.4831
(2.3202)
98.8436
CCRM
33.6091
Lasso-IR
CCRM
106.5634
RMSEl
RMSEh Lasso-IR
(3.0904)
74.3205
(2.1939)
63.2015
(1.4766)
55.2739
(2.9421)
58.8975
(1.6711)
44.0543
(0.9632)
32.1594
(9.3630)
87.7357
(5.0012)
59.4240
(1.6060)
29.1942
(8.3383)
95.9907
(5.6459)
62.8460
(1.3925)
27.5926
(8.7206)
86.9300
(4.8993)
59.1979
(1.7039)
29.0226
(8.0401)
96.1012
(5.6357)
62.5427
(1.4057)
27.4740
(5.1037)
103.1287
(4.1221)
74.4761
(3.5024)
57.9736
(3.8232)
89.1820
(2.8244)
55.7760
(2.2095)
35.6629
(47.6053)
374.1652
(40.9507)
301.2752
(9.6405)
211.0870
(47.2904)
372.2977
(33.9114)
294.2398
(9.5334)
195.2533
(12.7781)
89.2101
(9.3288)
96.0010
(3.1810)
101.8181
(13.7092)
93.3952
(9.1882)
76.6110
(2.2024)
65.3274
MCM
(11.0323)
96.1926
(4.1168)
70.2740
(1.8396)
56.5861
(11.3292)
81.3514
(3.6070)
51.5751
(1.0818)
33.6690
(45.5650)
409.6679
(43.3517)
327.8444
(9.6715)
232.5826
(43.9981)
417.0928
(39.4118)
329.1582
(9.8292)
211.8936
(14.1113)
101.4218
(8.9762)
108.3934
(3.4194)
116.7268
(11.9959)
100.6137
(10.0487)
86.8597
(2.6253)
73.8348
SCM
(1.9145)
59.0453
(2.4343)
55.4180
(1.8036)
52.2830
(1.3915)
39.6928
(1.1185)
33.2632
(0.9428)
28.6649
(10.1497)
104.4313
(6.8251)
73.0048
(2.0625)
36.1772
(9.2585)
101.1890
(5.9834)
68.0571
(1.8815)
30.1495
(4.5543)
44.6305
(3.6843)
36.6427
(1.1328)
23.9613
(2.9262)
34.4528
(2.3500)
25.4810
(0.8722)
14.7073
CCRJM
(2.1132)
62.0858
(1.7558)
56.3209
(1.5051)
52.1319
(1.6948)
43.9116
(1.2331)
35.8314
(0.9633)
28.6262
(81.1543)
694.2658
(36.8705)
467.5257
(4.8762)
196.1003
(75.2511)
698.9551
(37.6478)
469.8927
(4.9128)
197.6990
(41.0662)
354.8226
(19.6911)
233.5568
(2.5689)
95.6230
(38.5667)
363.4547
(19.6519)
240.1692
(2.4484)
99.4245
CCRM
RMSEu Lasso-IR
(2.1119)
62.6365
(1.8451)
55.9659
(1.4497)
52.0661
(1.5933)
44.5592
(1.2446)
35.1087
(0.8990)
28.5025
(13.6461)
97.6685
(5.0437)
62.4005
(1.8101)
30.6021
(14.7503)
97.3395
(5.5586)
65.6022
(1.5855)
28.5939
(14.1242)
96.6381
(5.4850)
61.9856
(1.9159)
30.7306
(14.2210)
96.2344
(5.7933)
64.9669
(1.3891)
28.4881
MCM
(5.3527)
102.5999
(3.7424)
73.9206
(3.4113)
58.4218
(3.6673)
88.5878
(2.8252)
56.4454
(2.1752)
36.0171
(49.1858)
419.6637
(39.3711)
337.7637
(12.6661)
231.8267
(44.8628)
412.8664
(32.2630)
332.1612
(14.1057)
216.8248
(13.8261)
99.7225
(9.8234)
102.2018
(3.7259)
103.5005
(14.2809)
104.3691
(9.5106)
85.7933
(3.2060)
69.2097
SCM
(13.9898)
116.7248
(5.0254)
79.0017
(1.9025)
60.2646
(12.9551)
103.7499
(4.1587)
61.3404
(1.2015)
37.6032
(43.3584)
423.5278
(35.1315)
291.8640
(9.5427)
183.4235
(49.3720)
428.5158
(31.0139)
295.2645
(8.8895)
176.1872
(12.2469)
100.9743
(7.3447)
80.7155
(3.4095)
77.4904
(11.1665)
103.3447
(7.8883)
74.0277
(2.6082)
52.4982
(1.9445)
59.0368
(2.4503)
55.4984
(1.7562)
52.3956
(1.3880)
39.7165
(1.1029)
33.2610
(0.9632)
28.6430
(14.1437)
121.6440
(8.4149)
86.1894
(2.1526)
43.0086
(13.7021)
118.9935
(8.5311)
83.2804
(2.1215)
38.2029
(6.2768)
51.6658
(4.2005)
40.9350
(1.1754)
25.6740
(4.7740)
44.1285
(3.5193)
32.4646
(0.9421)
17.5731
CCRJM
(0.0058)
0.7720
(0.0086)
0.6512
(0.0082)
0.6885
(0.0044)
0.8278
(0.0064)
0.7461
(0.0059)
0.8009
(0.0113)
0.6346
(0.0094)
0.6409
(0.0057)
0.6862
(0.0112)
0.6325
(0.0094)
0.6369
(0.0067)
0.6722
(0.0105)
0.6254
(0.0101)
0.6413
(0.0056)
0.7138
(0.0095)
0.6159
(0.0090)
0.6219
(0.0063)
0.6666
CCRM
AR
(0.0080)
0.7399
(0.0102)
0.6158
(0.0123)
0.3738
(0.0081)
0.7889
(0.0080)
0.7066
(0.0105)
0.5141
(0.0170)
0.7330
(0.0115)
0.7683
(0.0064)
0.8373
(0.0168)
0.6993
(0.0138)
0.7188
(0.0078)
0.7862
(0.0179)
0.7345
(0.0124)
0.7677
(0.0067)
0.8376
(0.0168)
0.7002
(0.0135)
0.7201
(0.0075)
0.7867
Lasso-IR
(0.0080)
0.5313
(0.0118)
0.4355
(0.0164)
0.2167
(0.0067)
0.5605
(0.0103)
0.5072
(0.0152)
0.3297
(0.0176)
0.6867
(0.0206)
0.6621
(0.0153)
0.4754
(0.0172)
0.6859
(0.0185)
0.6610
(0.0180)
0.5037
(0.0188)
0.7322
(0.0211)
0.6251
(0.0079)
0.2783
(0.0206)
0.6967
(0.0211)
0.6724
(0.0103)
0.3988
MCM
SCM
CCRJM
(0.0065)
0.7817
(0.0102)
0.6589
(0.0084)
0.6883
(0.0052)
0.8435
(0.0074)
0.7613
(0.0059)
0.8008
(0.0060)
0.9108
(0.0062)
0.9116
(0.0036)
0.9206
(0.0061)
0.9121
(0.0059)
0.9136
(0.0036)
0.9258
(0.0060)
0.9184
(0.0058)
0.9141
(0.0032)
0.9131
(0.0058)
0.9269
(0.0059)
0.9245
(0.0035)
0.9273
(continued on next page)
(0.0131)
0.5226
(0.0082)
0.4351
(0.0085)
0.2197
(0.0118)
0.5541
(0.0074)
0.5078
(0.0079)
0.3359
(0.0168)
0.6780
(0.0225)
0.6664
(0.0140)
0.5293
(0.0174)
0.6736
(0.0182)
0.6618
(0.0152)
0.5612
(0.0184)
0.7183
(0.0179)
0.6473
(0.0072)
0.3266
(0.0168)
0.6909
(0.0198)
0.6750
(0.0102)
0.4503
14
C2
1
C1
3
p
Config.
Comparison of CCRM, Lasso-IR, MCM, SCM and CCRJM methods with different data set configurations, with the average and standard deviation values in parentheses(n = 1500).
Table 5
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
C8
1
C7
6
3
1
6
3
p
Config. CCRJM
(11.1792)
(2.0181)
(5.2947)
156.5761
73.3380
(1.7860)
(1.7841)
107.5634
(1.5365)
65.0616
71.8625
(12.0370)
57.1057
(1.5781)
(6.0987)
156.6581
(1.1889)
54.0348
106.4056
43.2804
71.5393
(1.8220)
32.4156
(0.8557)
(3.4611)
133.5745
(3.2158)
97.6463
(3.1116)
73.7615
(2.2946)
110.6527
(1.8361)
72.3457
(1.6132)
46.0253
(13.6060)
158.8113
(3.8751)
103.2705
(1.6870)
74.4503
(17.6698)
143.1959
(4.8867)
79.7953
(0.9650)
46.8501
(2.0021)
60.7929
(4.1140)
59.4639
(2.0409)
57.6118
(1.4488)
41.4707
(1.1588)
34.5170
(0.8603)
32.4343
(2.0159)
62.1225
(1.8746)
56.5867
(1.5568)
52.2074
(1.7528)
44.1327
(1.2206)
35.7952
(0.8803)
28.6260
CCRM
SCM
RMSEl
MCM
CCRM
Lasso-IR
RMSEh
Table 5 (continued)
(12.0072)
140.1431
(5.8487)
96.4856
(1.8756)
64.5715
(12.5866)
140.1087
(6.4466)
95.1946
(1.8735)
64.1523
Lasso-IR
(5.3022)
102.5994
(4.5616)
73.9774
(3.6875)
58.3319
(3.8123)
88.9532
(3.1928)
55.9602
(2.0048)
35.7064
MCM
(7.5422)
88.6373
(3.5786)
69.0165
(1.8999)
57.7324
(8.8661)
74.5515
(3.5031)
50.1569
(1.0830)
34.9774
SCM
(1.9744)
59.2266
(2.4818)
55.6074
(1.7521)
52.3755
(1.4243)
39.9929
(1.1289)
33.2525
(0.8847)
28.6415
CCRJM
RMSEu
(2.2228)
62.2878
(1.8301)
56.5232
(1.5412)
52.1516
(1.5943)
44.1480
(1.2918)
35.8665
(0.8647)
28.6383
CCRM
(4.2961)
97.1007
(2.5267)
67.2068
(1.5248)
53.1938
(4.4107)
96.9575
(2.4311)
67.2424
(1.6486)
52.9443
Lasso-IR
(5.0675)
102.2429
(4.1950)
73.9743
(3.2947)
58.0620
(3.5030)
88.2818
(2.8499)
55.8056
(2.1521)
35.1865
MCM
(19.1979)
147.1431
(7.2580)
87.0834
(2.1815)
59.6732
(20.8745)
137.8404
(7.7939)
71.2451
(1.4686)
37.1816
SCM
(1.9732)
59.2214
(2.3103)
55.5646
(1.7697)
52.4190
(1.4303)
39.9914
(1.1206)
33.2642
(0.8687)
28.6562
CCRJM
AR
(0.0060)
0.7715
(0.0090)
0.6504
(0.0081)
0.6878
(0.0045)
0.8271
(0.0062)
0.7460
(0.0056)
0.8015
CCRM
(0.0353)
0.5422
(0.0280)
0.4405
(0.0131)
0.2746
(0.0375)
0.5425
(0.0317)
0.4469
(0.0138)
0.2752
Lasso-IR
(0.0077)
0.5326
(0.0116)
0.4362
(0.0147)
0.2171
(0.0066)
0.5609
(0.0104)
0.5088
(0.0160)
0.3336
MCM
(0.0294)
0.4692
(0.0169)
0.4071
(0.0089)
0.2113
(0.0354)
0.4901
(0.0205)
0.4698
(0.0085)
0.3245
SCM
(0.0068)
0.7813
(0.0100)
0.6584
(0.0083)
0.6877
(0.0053)
0.8426
(0.0073)
0.7612
(0.0056)
0.8014
CCRJM
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx 15
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
hypothesis H0a is rejected for each measure of RMSEh , RMSEl , RMSEu , and AR for the 50 repeats of the simulation experiment. In the order of A and B, for any two methods compared concerning the RMSEh , RMSEl and RMSEu measures (the lower the measures, the better the method), the null and alternative hypotheses are as follows: H0a: Measure of Method A ≥ Measure of Method B versus H1a: Measure of Method A < Measure of Method B. With respect to the AR measures (the higher the measures, the better the method), the hypotheses are structured as follows: H0a: Measure of Method A ≤ measure of Method B versus H1a: Measure of Method A > Measure of Method B. Tables 6 through 8 show comparisons of the CCRJM method and the CCRM, Lasso-IR, MCM, and SCM methods with respect to the percentages of rejection of the null hypothesis H0a (paired t-test for matched-pairs samples following a t-distribution with 1% significance level) when n = 150, n = 45, and n = 1500. Next, we exchange the positions of the null and alternative hypotheses and check the percentage of rejection of the null hypothesis H0b (paired t-test for matched-pairs samples following a t-distribution with 1% significance level). Again, in the order of A and B, for any two methods compared concerning the RMSEh , RMSEl and RMSEu measures (the lower the measures, the better the method), the null and alternative hypotheses are as follows: H0b: Measure of Method A ≤ Measure of Method B versus H1b: Measure of Method A > Measure of Method B. With respect to the AR measures (the higher the measures, the better the method), the hypotheses are structured as follows: H0b: Measure of Method A ≥ Measure of Method B versus H1b: Measure of Method A < Measure of Method B. Tables 9 through 11 show comparisons of the CCRJM method and the CCRM, Lasso-IR, MCM, and SCM methods with respect to the percentages of rejection of the null hypothesis H0b (paired t-test for matched-pairs samples following a t-distribution with 1% significance level), when n = 150, n = 45, and n = 1500. Fig. 4 illustrates the results displayed in Tables 6 through 11. The values of the percentages of rejection, shown at the vertical coordinates, are connected by a smooth curve. Higher percentages of rejection of hypothesis H0a and lower percentages of rejection of hypothesis H0b indicate better behavior by our proposed method. In Tables 6–8, a high percentage of rejection of the null hypothesis indicates the superiority of the CCRJM approach. In contrast, in Tables 9–11, a high percentage of rejection of the null hypothesis indicates the inferiority of the CCRJM approach. In Table 6, the percentage of rejection of H0a is 100% in almost all situations of C 1, C 2, C 3, and C 4. In contrast, in Table 9, the percentage of rejection of H0b is 0% in almost all situations. These results indicate that the CCRJM method always performs better than the CCRM, Lasso-IR, MCM, and SCM methods in data set configuration 1, regardless of the number of variables and performance measure concerned. In data set configuration 2, as shown in Table 6, a comparison of CCRJM and CCRM shows that the percentage of rejection of H0a is 100% in almost all situations except for the p = 1 case in data set configurations C 5, C 6, C 7, and C 8. This indicates that, except for the p = 1 case, the performance of the CCRJM method is always superior to that of the CCRM method. In Table 9, a comparison of CCRJM and CCRM shows that the rejection percentage of the null hypothesis H0b is 0% in almost all situations, which means that the CCRJM method is never inferior to the CCRM method except for the p = 1 case. The corresponding diagram is shown in Fig. 4(a). Likewise, in a comparison of the CCRJM method with the Lasso-IR, MCM, and SCM methods, Tables 6 and 9 show that the rejection percentage of the null hypothesis H0a is 100% in almost all situations except for the p = 1 case. This means that the CCRJM method is almost always superior to the Lasso-IR, MCM, and SCM methods except for the p = 1 case. For the corresponding diagram, again see Fig. 4(a). When we consider the extremely large data set example n = 1500, the results are much the same, as shown in Tables 8, 11 and Fig. 4(c). When we consider the extremely small data set example n = 45, however, we must separately analyze the results. A comparison of CCRJM and MCM as well as SCM in Table 7 shows that the rejection percentage of the null hypothesis H0a is 100% in nearly all situations, and Table 10 shows that of the null hypothesis H0b to be 0% in all situations. These results indicate that the performance of the CCRJM method is always superior to that of the MCM and SCM methods. In a comparison of CCRJM and CCRM, Table 7 shows that the rejection percentage of the null hypothesis H0a is 0% in almost all situations and the rejection percentage of the null hypothesis H0b is relatively higher in Table 10. We concluded that the CCRJM is slightly inferior to the CCRM method in the extremely small data set example n = 45. In the same manner, we can say that the CCRJM method is slightly superior in performance to the Lasso-IR method when n = 45. For the corresponding diagram see Fig. 4(b). In summary, when we consider extremely small data set examples, the CCRJM method performs slightly worse than the CCRM method but its performance is not inferior to those of the other three methods. These results indicate that, compared with the CCRM and Lasso-IR, MCM, SCM methods, the CCRJM method is more suitable for the determining regression with higher dimensions of interval independent variables and relative larger size data sets examples. Consequently, we can draw the conclusion that the interval-valued linear regression fitting model CCRJM performs better than the Lasso-IR, MCM, and SCM methods in our simulation experiment settings. Except for situations involving a low dimension of the independent variables and extremely small data set examples, the CCRJM model behaves better than the CCRM method in our simulation experiment settings. In summary, the CCRJM model produces a better effect than the other four methods in most cases according to our simulation study. Table 12 shows the average computation time when the number of objects in the learning and test sets are 100 and 50, respectively, for every configuration we considered. The main CPU frequency of our computing environment is 2.8 MHz, which has a 128 kB session level cache, a 1 MB second level cache, an 8 MB third level cache, and 8 GB of internal memory.
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 3 6 1 3 6 1 3 6 1 3 6
1 3 6 1 3 6 1 3 6 1 3 6
C1
C5
C8
C7
C6
C4
C3
C2
p
Config.
0 100 100 0 100 100 0 100 100 0 100 100
100 100 100 28 100 100 100 100 100 100 100 100
0 84 60 0 20 12 0 72 60 0 8 0
100 100 100 24 96 100 100 100 100 96 100 100
0 60 72 0 16 4 0 68 60 0 16 12
100 100 100 12 96 100 100 100 100 96 100 100 0 96 96 0 32 16 4 92 92 0 24 12
100 100 100 32 100 100 100 100 100 100 100 100 20 100 100 0 100 100 0 100 100 0 100 100
100 100 100 10 100 100 100 100 100 100 100 100
RMSEh
AR
Lasso-IR
RMSEu
RMSEh
RMSEl
CCRM
40 100 100 0 88 88 4 100 100 8 100 100
100 100 100 10 100 100 100 100 100 100 100 100
RMSEl
0 12 48 0 4 4 0 100 100 0 100 100
70 100 100 0 100 100 100 100 100 100 100 100
RMSEu
100 100 100 56 96 96 100 100 100 100 100 100
100 100 100 10 100 100 100 100 100 100 100 100
AR
100 100 100 0 100 100 100 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEh
MCM
90 100 100 0 100 100 100 100 100 0 100 100
100 90 100 100 100 100 100 100 100 100 100 100
RMSEl
Table 6 Comparison of CCRJM and CCRM, Lasso-IR, MCM, SCM methods: percentages of rejection of the null hypothesis 0a (%) (n = 150).
100 100 100 0 100 100 80 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEu
100 100 100 100 100 100 100 100 100 100 100 100
100 100 100 100 100 100 100 100 100 100 100 100
AR
100 100 100 0 100 100 100 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEh
SCM
88 100 100 0 100 100 100 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEl
100 100 100 0 100 100 96 100 100 0 100 100
100 100 100 100 90 100 100 100 100 100 100 100
RMSEu
100 100 100 100 100 100 100 100 100 100 100 100
100 100 100 100 100 100 100 100 100 100 100 100
AR
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx 17
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 3 6 1 3 6 1 3 6 1 3 6
1 3 6 1 3 6 1 3 6 1 3 6
C1
C5
C8
C7
C6
C4
C3
C2
p
0 100 36 0 60 4 0 96 44 0 44 12
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 4 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 4 0 0 0 0 0 12 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
88 96 76 0 76 16 8 100 96 12 100 100
0 0 100 0 0 100 0 0 100 0 0 100
Lasso-IR
RMSEu
RMSEh
RMSEl
CCRM
RMSEh
40 32 8 0 8 0 24 100 92 28 88 96
0 0 100 0 0 100 0 0 100 0 0 100
RMSEl
0 0 0 0 0 0 0 24 32 0 24 44
0 0 100 0 0 80 0 0 100 0 0 90
RMSEu
68 16 8 40 12 0 100 100 92 100 92 100
0 0 100 0 0 100 0 0 100 0 0 100
AR
MCM
100 100 100 40 100 100 100 100 100 60 100 100
100 100 100 100 100 80 100 100 100 100 100 100
RMSEh
90 100 100 10 100 100 90 100 100 20 90 100
100 100 100 100 100 60 100 100 100 100 100 100
RMSEl
100 100 100 10 100 100 70 100 100 0 100 100
100 100 100 100 100 70 100 100 100 100 100 100
RMSEu
100 100 100 100 100 100 100 100 100 100 100 100
100 100 100 100 100 70 100 100 100 100 100 100
AR
SCM
100 100 100 60 100 100 100 100 100 60 100 100
100 100 100 100 0 100 20 100 100 90 100 100
RMSEh
72 100 100 0 68 92 84 100 96 4 64 96
100 100 100 100 0 100 10 100 100 90 100 100
RMSEl
88 100 100 12 96 100 96 100 100 8 88 100
80 10 100 100 0 70 0 90 100 20 90 100
RMSEu
100 100 100 100 100 100 100 100 100 100 100 100
100 100 100 100 0 100 10 100 100 100 100 100
AR
18
Config.
Table 7 Comparison of CCRJM and CCRM, Lasso-IR, MCM, SCM methods: percentages of rejection of the null hypothesis 0a (%) (n = 45).
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 3 6 1 3 6 1 3 6 1 3 6
1 3 6 1 3 6 1 3 6 1 3 6
C1
C5
C8
C7
C6
C4
C3
C2
p
Config.
0 100 100 0 96 100 0 100 100 0 92 100
100 100 100 100 100 100 100 100 100 100 100 100
0 100 100 0 44 100 0 100 100 0 48 100
100 100 100 100 100 100 100 100 100 100 100 100
0 100 100 0 36 100 0 100 100 0 60 100
100 100 100 100 100 100 100 100 100 100 100 100 0 100 100 0 100 100 0 100 100 0 96 100
100 100 100 100 100 100 100 100 100 100 100 100 0 100 100 0 100 100 0 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEh
AR
Lasso-IR
RMSEu
RMSEh
RMSEl
CCRM
0 100 100 0 100 100 0 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEl
0 100 100 0 16 100 0 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEu
96 100 100 0 100 100 100 100 100 100 100 100
100 100 100 100 100 100 100 100 100 100 100 100
AR
100 100 100 0 100 100 100 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEh
MCM
80 100 100 0 100 100 90 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEl
Table 8 Comparison of CCRJM and CCRM, Lasso-IR, MCM, SCM methods: percentages of rejection of the null hypothesis 0a (%) (n = 1500).
100 100 100 0 100 100 70 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEu
100 100 100 100 100 100 100 100 100 100 100 100
100 100 100 100 100 100 100 100 100 100 100 100
AR
12 100 100 0 100 100 32 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEh
SCM
0 100 100 0 100 100 4 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEl
56 100 100 0 100 100 48 100 100 0 100 100
100 100 100 100 100 100 100 100 100 100 100 100
RMSEu
100 100 100 100 100 100 100 100 100 100 100 100
100 100 100 100 100 100 100 100 100 100 100 100
AR
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx 19
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 3 6 1 3 6 1 3 6 1 3 6
1 3 6 1 3 6 1 3 6 1 3 6
C1
C5
C8
C7
C6
C4
C3
C2
p
4 0 0 8 0 0 20 0 0 16 0 0
0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 12 0 0 12 0 0
0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 4 0 0 20 0 0 4 0 0
0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 4 0 0 8 0 0 12 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
0 0 0 100 0 0 68 0 0 44 0 0
0 0 0 0 0 0 0 0 0 0 0 0
Lasso-IR
RMSEu
RMSEh
RMSEl
CCRM
RMSEh
0 0 0 68 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEl
0 0 0 76 0 0 72 0 0 72 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEu
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
MCM
0 0 0 80 0 0 0 0 0 80 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEh
0 0 0 30 0 0 0 0 0 50 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEl
0 0 0 40 0 0 0 0 0 30 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEu
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
SCM
0 0 0 80 0 0 0 0 0 60 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEh
0 0 0 40 0 0 0 0 0 28 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEl
0 0 0 8 0 0 0 0 0 8 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEu
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
20
Config.
Table 9 Comparison of CCRJM and CCRM, Lasso-IR, MCM, SCM methods: percentages of rejection of the null hypothesis 0b (%) (n = 150).
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 3 6 1 3 6 1 3 6 1 3 6
1 3 6 1 3 6 1 3 6 1 3 6
C1
C5
C8
C7
C6
C4
C3
C2
p
Config.
12 0 0 12 0 0 4 0 0 4 0 0
16 76 100 24 84 100 8 68 100 4 80 100
4 0 8 4 4 44 8 0 16 4 12 16
0 24 92 20 44 80 12 32 80 0 40 80
8 0 8 16 20 40 8 0 8 0 12 56
16 56 84 24 64 92 12 32 96 0 60 96 4 0 0 4 0 48 12 0 20 0 12 28
12 68 100 12 80 100 8 52 100 12 76 96 0 0 0 4 0 0 0 0 0 0 0 0
10 30 0 20 0 0 0 10 0 10 0 0
Lasso-IR AR
RMSEh
RMSEu
RMSEh
RMSEl
CCRM
0 0 0 0 0 0 0 0 0 0 0 0
10 10 0 10 0 0 0 10 0 0 0 0
RMSEl
0 4 0 8 8 20 0 0 0 0 0 0
20 10 0 10 0 0 0 10 0 0 0 0
RMSEu
0 0 0 0 0 0 0 0 0 0 0 0
0 40 0 20 10 0 0 20 0 0 10 0
AR
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEh
MCM
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEl
Table 10 Comparison of CCRJM and CCRM, Lasso-IR, MCM, SCM methods: percentages of rejection of the null hypothesis 0b (%) (n = 45).
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEu
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
SCM
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEh
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEl
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEu
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 10 0 0 0 0 0 0 0
AR
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx 21
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 3 6 1 3 6 1 3 6 1 3 6
1 3 6 1 3 6 1 3 6 1 3 6
C1
C5
C8
C7
C6
C4
C3
C2
p
8 0 0 4 0 0 4 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 4 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 4 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
100 0 0 100 0 0 100 0 0 100 0 0
0 0 0 0 0 0 0 0 0 0 0 0
Lasso-IR
RMSEu
RMSEh
RMSEl
CCRM
RMSEh
28 0 0 100 0 0 100 0 0 100 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEl
100 0 0 100 0 0 100 0 0 100 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEu
0 0 0 52 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
MCM
0 0 0 100 0 0 0 0 0 100 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEh
0 0 0 100 0 0 0 0 0 100 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEl
0 0 0 100 0 0 0 0 0 100 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEu
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
SCM
0 0 0 100 0 0 0 0 0 100 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEh
4 0 0 100 0 0 0 0 0 100 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEl
0 0 0 100 0 0 0 0 0 100 0 0
0 0 0 0 0 0 0 0 0 0 0 0
RMSEu
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
AR
22
Config.
Table 11 Comparison of CCRJM and CCRM, Lasso-IR, MCM, SCM methods: percentages of rejection of the null hypothesis 0b (%) (n = 1500).
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
0.033 0.056 0.025 0.050
0.051 0.052 0.052 0.026
C5 C6 C7 C8
0.053 0.052 0.041 0.055
0.055 0.035 0.041 0.036
0.041 0.050 0.042 0.042
0.027 0.055 0.035 0.043 0.971 0.865 4.509 1.703
1.064 1.205 0.850 1.158 6.511 3.170 0.932 0.999
1.223 1.993 1.161 1.447
p=3
5.369 7.404 0.923 1.799
2.193 2.901 2.462 2.053
p=6
Lasso-IR (rs = 10, t = 10)
p=6
p=1
p=3
CCRM
p=1
C1 C2 C3 C4
Config.
Table 12 Average computation time for every configuration (seconds).
18.701 18.868 18.753 18.794
18.701 19.146 18.854 19.143 18.873 18.818 19.298 19.451
18.866 19.101 18.924 19.601
p=3
MCM (B = 1000) p=1
18.737 18.741 18.868 18.764
18.799 19.180 18.953 19.031
p=6
SCM
0.013 0.022 0.020 0.020
0.015 0.020 0.020 0.013
p=1
0.021 0.010 0.021 0.021
0.015 0.011 0.013 0.021
p=3
0.016 0.012 0.021 0.016
0.011 0.020 0.011 0.020
p=6
CCRJM
0.259 0.270 0.271 0.271
0.258 0.285 0.289 0.280
p=1
0.249 0.255 0.264 0.264
0.243 0.271 0.259 0.239
p=3
0.281 0.270 0.275 0.244
0.268 0.276 0.347 0.287
p=6
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx 23
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
24
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
(a) Cases of n = 150.
Fig. 4. Comparison of the CCRJM method and the CCRM, Lasso-IR, MCM, and SCM methods: percentages of rejection of Hypotheses H0a and H0b. Table 13 Interval data set configuration, where i = 1, . . . , 50, j = 1. C9 C 10
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
βc ∼ U (2, 2.5) βc ∼ U (2, 2.5)
xcj ∼ U (0, 150) xcj ∼ U (0, 150)
εic ∼ t (1) εic ∼ N (0, 20)
βr ∼ U (2.5, 5) βr ∼ U (2.5, 5)
xrj ∼ U (5, 10) xrj ∼ U (5, 10)
εir ∼ U (0, 10) εir ∼ U (0, 10)
4. Analysis of outliers This section describes an experiment to determine the effects of outliers in the interval-valued data regression and provides an analysis. We still adopt RMSEl , RMSEu , RMSEh , and AR for the evaluation criteria and generate data following Algorithm 2 of data set configuration 2. To illustrate the effects of outliers, Table 13 shows configuration C 9 as the interval data set where the error of the midpoint εic ∼ t (1) (i = 1, . . . , 50) to ( ensure ) the existence of(outliers. ) Configuration C 10 has εic ∼ N (0, 20) (i = 1, . . . , 50) and outliers are manually added at xc1 , yc = (0, 600) and xc1 , yc = (150, −400), whose xr1 and yr are assigned 10 and 50, respectively. Outliers are located on one side of the bulk of the data in configuration C 9 and on both sides in configuration C 10. To visualize the outliers, let us take into account the case of only one explanatory interval-valued variable represented by xcj and xrj , where j = 1. Fig. 5 shows the data set of configuration C 9, and Fig. 6 shows the observed interval-valued objects as well ( as their ) predictions using the CCRJM method. In Fig. 5, we can see clearly that the interval-valued object whose center xc1 , yc = (73.2681, 823.4511) is an outlier, which we will refer to as outlier 1. Figs. 5(c) and 6(c) remove this outlier and create a new data set. Figs. 5(d) and 6(d) illustrate the observed interval-valued objects and their predictions of this new data set using the CCRJM method. Table 14 shows the performance of the CCRJM method on configuration C 9, which contains outliers. When we remove outlier 1, the prediction performance improves dramatically. For instance, the measurement RMSEh changes from 424.2938 Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
25
(b) Cases of n = 45. Fig. 4. (continued) Table 14 Performance of CCRJM method with outlier(s).
C9 Outlier 1 removed Outliers 1, 2 removed C 10 Outliers removed
RMSEh
RMSEl
RMSEu
AR
424.2938 18.8440 16.2442 362.2770 21.8424
92.8963 16.4030 13.2681 293.1866 20.4530
414.7792 17.3975 15.2601 261.1438 20.6196
0.1744 0.7136 0.7286 0.1531 0.6434
to 18.8440, and the other measurements do much the same. Moreover, we can see another outlier located at xc1 , yc = (37.6594, 162.8786), which we will refer to as outlier 2. When we remove outliers 1 and 2 simultaneously, the prediction performance improves slightly, showing a slight decrease in RMSEh , RMSEl , and RMSEu and a slight increase in AR. Similar results can be seen in configuration C 10. As the experiment above shows, the data set configurations C 9 and C 10 with outliers perform poorly and thus do not challenge our proposal. Thus, the presence of only one outlier implies that our solution is not useful, because just one ‘‘bad’’ object can produce very deleterious results. As shown in Figs. 5(b) and 6(b), the adopted overlapping constraints make the predictions of the independent interval-valued variable cover a much larger region than their observations. We can see that to satisfy the overlapping constraints, the upper bound of the object of outlier 1’s prediction reaches a high point on the lower bound of its observation. Because of the effects of outlier 1, nearly all of the ranges of the prediction interval are much larger than their actual values. However, unlike outlier 1 of C 9, outlier 2 does not make a distinct impact on the regression process. Therefore, the adopted overlapping constraints lead to deleterious results in the case of a limited number of anomalous objects.
(
)
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 2 3 4 5 6 7 8 9 10 11 12 13
COMSTA: 6479
26
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
(c) Cases of n = 1500. Fig. 4. (continued)
11
Table 14 shows that when more outliers are removed, better performance will be achieved through our proposal. In fact, however, we want only to find the outliers that have a significant effect on the regression of interval-valued data, such as outlier 1 in this case. Clearly, one can identify and remove the outliers by observation before using our proposal, just as in the process shown in Figs. 5 and 6. Moreover, because the outliers’ locations are determined by the interval-valued object centers, we can perform outlier detection on the center data set of a series of interval-valued data. As regards the outliers’ detection, there are basically two ways (Rousseeuw and Leroy, 1987). The first, and probably most well-known, approach is to construct so-called regression diagnostics. Diagnostics are particular quantities computed from the data with the purpose of pinpointing influential points, after which the outliers can be removed or corrected, followed by an LS analysis on the remaining cases. The other approach is robust regression, which tries to devise estimators that are not so strongly affected by outliers. For more details regarding outlier detection see Rousseeuw and Leroy (1987), Hawkins (1980), Belsley et al. (1980), and Pardoe (2012).
12
5. Real-life case study
1 2 3 4 5 6 7 8 9 10
14
In this section, to illustrate the performance of our proposed method, we take into account two interval-valued data sets, in which the dependent variable in one has a positive value and in the other a positive or negative value.
15
5.1. Taobao seller credit data set
13
16 17 18 19
We consider a seller credit data set from Taobao, a popular online retail platform in China. This data set concerns the record of seller credit Y , the popularity of shop X1 , and the quantity of goods X2 for ten cities, such as Beijing, Shanghai, and others in 2014. For instance, an interval-valued seller credit of [54, 16038] in Beijing means that the minimum seller credit is 54 for a certain shop and the maximum is 16038 for another shop. Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
27
Fig. 5. Effect on outliers using the CCRJM method in Configuration C 9.
Table 15 Data set for interval-valued seller credit, popularity of shop, and quantity of goods in ten Chinese cities from the Taobao online retail platform. Number
City
Y
X1
X2
1 2 3 4 5 6 7 8 9 10
Beijing Changsha Chengdu Guangzhou Haerbin Jinan Nanjing Shanghai Tianjin Hangzhou
[54,16038] [14,3410] [30,6986] [59,18348] [27,8985] [12,3258] [41,10547] [100,18265] [37,10814] [46,18158]
[6,3037] [4,1010] [8,6960] [4,7338] [4,3900] [5,990] [5,2450] [20,6988] [29,4790] [2,5400]
[23,1625] [8,595] [13,945] [28,1089] [25,930] [10,677] [29,1328] [46,1379] [17,824] [25,1455]
Table 15 displays the interval data set. Our objective in this case is to predict the interval dependent variable values of Y from Xj (j = 1, 2) using a linear regression model. We note that seller credit Y should not be negative, which is meaningless, to ensure the rationality of the predicted results. Fig. 7 shows the interval dependent variable of seller credit Y versus the independent variables of the popularity of shop X1 and the quantity of goods X2 . The regression equations fitted to the Taobao seller credit interval data set using the CCRJM, Lasso-IR, CCRM, MCM and SCM approaches are presented below: ( ) ( ) CCRM: yˆ l = yˆ c − yˆ r and yˆ u = yˆ c + yˆ r , where yˆ c = 1(, xc1 , xc2 ()2737.95, 1.05, 11.29)T and yˆ r = 1, x(r1 , xr2 (0,)0(.91, 7.50)T . Lasso-IR: yˆ l = yˆ c − yˆ r and yˆ u = yˆ c + yˆ r , where yˆ c = 1, xc1 , xc2 (−2737.95, 1.05, 11.29)T and yˆ r = 1, xr1 , xr2 −2706.32,
)T
1.08, 11.51 .
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 2 3 4 5 6 7 8 9 10
COMSTA: 6479
28
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
Fig. 6. Effect on outliers using the CCRJM method in Configuration C 10.
Fig. 7. Interval variables for the Taobao online retail platform.
1 2 3 4 5 6 7 8
MCM: yˆ l
= min
((
1, xl1 , xl2
)(
)T (
2651.39, 0.72, 2.68 , 1, xu1 , xu2 (2651.39, 0.72, 2.68)T
)
)
and yˆ u
(( ) = max 1, xl1 , xl2
( ) ) T , 1, xu1 , x)u2 (2651.39, 0.72,(2.68)T .) (2651.39, 0.72, 2.68)(( ) (( ) yˆ l = min 1, x)l1 , xl2 (0, 0.94, 8.29)T , 1, xu1 , xu2 (0, 0.94, 8.29)T and yˆ u = max 1, xl1 , xl2 (0, 0.94, 8.29)T , ( SCM: ) T 1, xu1 , xu2 (0, 0.94, 8.29) . ( ) CCRJM : yˆ l = yˆ c )− yˆ r and yˆ u = yˆ c + yˆ r , where yˆ c = 1, xc1 , xc2 , xr1 , xr2 (−2553.10, 8.19, 61.25, −7.32, −51.81)T and ( yˆ r = 1, xc1 , xc2 , xr1 , xr2 (−2532.24, 7.62, 59.83, −6.75, −50.42)T . Based on the fitted linear regression models, CCRM, Lasso-IR, MCM, SCM and CCRJM, Table 16 shows the fitted values of variable Y . Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
29
Table 16 Observed and fitted values of the Taobao seller credit variable by different methods. Number
City
Y
CCRM
Lasso-IR
MCM
SCM
CCRJM
1 2 3 4 5 6 7 8 9 10
Beijing Changsha Chengdu Guangzhou Haerbin Jinan Nanjing Shanghai Tianjin Hangzhou
[54,16038] [14,3410] [30,6986] [59,18348] [27,8985] [12,3258] [41,10547] [100,18265] [37,10814] [46,18158]
[771,15549] [−1462,3857] [−352,12989] [84,14741] [−474,9870] [−1288,4612] [223,12195] [796,17157] [−667,9734] [620,16273]
[13,16308] [−19,2414] [−97,12734] [49,14776] [90,9306] [8,16885] [−3,3327] [121,12298] [247,17707] [21,9045]
[2670,9513] [2624,5059] [2641,10233] [2683,10922] [2674,8060] [2630,5287] [2687,8227] [2747,11535] [2668,8366] [2673,10655]
[196,16317] [70,5879] [115,14365] [236,15913] [211,11368] [88,6539] [245,13305] [400,17988] [168,11325] [209,17127]
[48,15516] [4,2349] [31,11422] [53,14721] [40,9909] [9,3343] [50,12863] [92,19399] [43,9161] [50,16125]
Fig. 8. Taobao seller credit prediction by the CCRM, Lasso-IR, MCM, SCM and CCRJM methods.
Fig. 8 shows the observed and fitted interval values of the Taobao seller credit in different cities by the CCRM, LassoIR, MCM, SCM and CCRJM methods. By comparing of Fig. 8 and Table 14, we can see that CCRJM has better fitness than CCRM, Lasso-IR, MCM and SCM, and that CCRJM performs better at the lower boundaries of the interval variable Y than the other methods. We were surprised to see several negative values in the lower boundaries of predicted interval by the CCRM and Lasso-IR methods. As noted above, in no circumstance should seller credit be negative. So, cases for which there exists negative seller credit in this case study are not rational. We evaluated the performance of each method by calculating the RMSEh , RMSEl , RMSEu , and AR measures. As shown in Table 17, the performance of these methods reveals the superiority of CCRJM with respect to CCRM, MCM and SCM for every performance measure. The performance of the Lasso-IR method exceeded that of CCRJM by 1.1% (a slim advantage) with respect to the accuracy rate. This means that the CCRJM method demonstrates better effect than the Lasso-IR in almost all performance measures. Considering the rationality of these results, we conclude that the CCRJM method produced a better solution for this real-life case study. Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 2 3 4 5 6 7 8 9 10 11 12
COMSTA: 6479
30
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx Table 17 Performance of the methods considered with respect to the Taobao seller credit data set. Methods
RMSEh
RMSEl
RMSEu
AR
CCRM Lasso-IR MCM SCM CCRJM
2510.4620 2347.4000 6066.8747 5960.4773 2204.6070
782.5487 72.1138 3420.4426 1628.3818 7.2171
2465.1570 2347.5000 5338.0980 4149.2909 2204.6070
0.7760 0.8534 0.3443 0.6626 0.8436
Table 18 Data set for interval-valued stock market value(%).
1
2 3 4 5 6 7 8 9 10 11 12 13 14
5.2. Stock market value data set
We consider the stock market value data set in Hu and He (2007) in which the values of the dependent variable may be negative, as shown in Table 18. Stock market values (SP ) are linearly determined by five macroeconomic factors (Chen et al., 1986): the growth rate variations of seasonally adjusted Industrial Production Index (IP ); changes in expected inflation (DI ) and unexpected inflation (UI ); default risk premiums (DF ); and unexpected changes in interest rates (TM ). Fig. 9 shows the interval dependent variable of stock market value SP versus the independent variables IP, DI, UI, DF , and TM. Below, we present the regression equations fitted to the stock market value data set using the CCRJM, Lasso-IR, CCRM, MCM and SCM approaches: ( ) CCRM: yˆ l = yˆ c −(yˆ r and yˆ u = yˆ c + yˆ r , where yˆ c = 1, xc1 , xc2 , xc3 , xc4 , xc5 (0.0004, 2.4089, 0.1345, 0.1800, −0.0492, ) −0.0314)T and yˆ r = 1, xr1 , xr2 , xr3 , xr4 xr5 (0.0018, 0, 0.1478(, 0, 0, 0.0149)T . ) c c c c c r u c r c Lasso-IR: yˆ l = yˆ c − ( yˆ randr yˆ r =r yˆr )+ yˆ , where yˆ = 1, x1 , x2 , x3 , x4 , x5 (0.0004, 2.4089, 0T .1345, 0.1800, −0.0492, T r −0.0314) and yˆ = 1, x1 , x2 , x3 , x4 x5 (0.0027, −0.3675, 0.1381, −0.0644, −0.0009, 0.0262) . MCM: (( ) ) 1, xl , xl , xl , xl , xl (0.0007, 0.4336, 0.0417, 0.0423, −0.0173, −0.0060)T , and ( 1u 2u 3u 4u 5u ) T ((1, x1 , x2 , x3 , x4 , x5 )(0.0007, 0.4336, 0.0417, 0.0423, −0.0173, −0.0060) ) 1, xl , xl , xl , xl , xl (0.0007, 0.4336, 0.0417, 0.0423, −0.0173, −0.0060)T , yˆ u = max ( 1u 2u 3u 4u 5u ) . 1, x1 , x2 , x3 , x4 , x5 (0.0007, 0.4336, 0.0417, 0.0423, −0.0173, −0.0060)T SCM: (( ) ) 1, xl , xl , xl , xl , xl (0, 0.7623, 0.1675, 0.1558, −0.0426, −0.0081)T , yˆ l = min ( 1u 2u 3u 4u 5u ) and T ((1, x1 , x2 , x3 , x4 , x5 )(0, 0.7623, 0.1675, 0.1558, −0.0426, −0.0081) ) 1, xl , xl , xl , xl , xl (0, 0.7623, 0.1675, 0.1558, −0.0426, −0.0081)T , yˆ u = max ( 1u 2u 3u 4u 5u ) . 1, x1 , x2 , x3 , x4 , x5 (0, 0.7623, 0.1675, 0.1558, −0.0426, −0.0081)T (
yˆ l = min
15
16 17 18
19
CCRJM: yˆ l = yˆ c − yˆ r and yˆ u = yˆ c + yˆ r , where yˆ c = 1, xc1 , xc2 , xc3 , xc4 , xc5 , xr1 , xr2 , xr3 , xr4 xr5
20
0.0017, −0.4812, 0.1175, 0.1142, −0.0193, −0.0424, −1.5441, 0.0983, −0.0524, 0.0122, −0.0238
( 21
0.0025, −1.3178, 0.1295, 0.0465, 0.0179, −0.0425, −0.3319, 0.1308, −0.0620, 0.0035, 0.0260
( 22 23 24
)T
)T
)
and yˆ r = 1, xc1 , xc2 , xc3 , xc4 , xc5 , xr1 , xr2 , xr3 , xr4 xr5
(
)
.
In addition, we compare the out-of-sample and in-sample OLS methods of Hu and He (2007) with our proposed method using a time-window of ten years. As shown in Table 19, the performance of these methods reveals the superiority of CCRJM Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
31
Fig. 9. Interval variables for the stock market value.
Table 19 Performance of the methods considered with respect to the stock market value data set. Methods
RMSEh
RMSEl
RMSEu
AR
CCRM Lasso-IR MCM SCM CCRJM
0.0072 0.0071 0.0105 0.0078 0.0061
0.0038 0.0038 0.0045 0.0081 0.0028
0.0065 0.0064 0.0092 0.0069 0.0058
0.5902 0.5810 0.1479 0.5105 0.5934
Out-of-sample OLS CCRJM
0.0133 0.0123
0.0092 0.0067
0.0124 0.0115
0.2185 0.2862
In-sample OLS CCRJM
0.0153 0.0126
0.0095 0.0068
0.0142 0.0118
0.2346 0.3276
for every performance measure. So, we conclude that the CCRJM method produced a better solution in this real-life case study. 6. Conclusions The constrained center and range joint method (CCRJM) proposed in this paper represents a new method for fitting a linear regression model to symbolic interval-valued data. We considered many inequality constraints to ensure overlaps between the observed and predicted intervals of the dependent variable as well as the nonnegative interval radius of the predicted dependent variable. As such, we ensured the rationality of our results by making their lower boundaries less than or equal to the upper boundaries. We presented four goodness-of-fit measures (root mean square errors and accuracy rate), which are commonly used in interval-valued data regression analysis. The results of a series of Monte Carlo experiments demonstrate that the prediction Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1 2
3
4 5 6 7 8 9 10
COMSTA: 6479
32
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
15
performance of the CCRJM method is superior to that of the CCRM, Lasso-IR, MCM and SCM approaches, since the CCRJM method yielded a lower average root mean square error and higher accuracy rate than those of the other two methods in almost every circumstance proposed in this paper. Our statistical t-test results also support this conclusion. With respect to the Taobao seller credit and the stock market value interval data sets, the CCRJM method outperformed the CCRM, Lasso-IR, MCM and SCM approaches for all rational results. Thus, we suggest that the CCJRM method be used as a suitable strategy for fitting an interval-valued symbolic data regression model. Even though the model introduced in this paper demonstrated relatively good performance in Monte Carlo experiments and in a real-life case study, when the independent interval variables have lower dimensions, such as p = 1, our method did not perform well. The CCRJM method is suitable for situations in which independent interval variables have higher dimensions. Moreover, the adopted overlapping constraints lead to deleterious results in the case of a limited number of anomalous objects far from the bulk of the data. In future research, we will focus on model constraints. To avoid an oversized prediction region, in this paper, we required only that there be crossovers between the predicted and observed intervals. Perhaps the use of stricter constraints would yield better results than those presented here. Moreover, the computation time for every solution process is relatively long. To shorten the calculation time, a challenging task for future research is to identify another approach for solving this model.
16
Uncited references
1 2 3 4 5 6 7 8 9 10 11 12 13 14
17
18
Harnett (1982), Johnson and Wichern (2002)
Acknowledgments
21
The research for this paper was supported by the National Natural Science Foundation of China, grants 71271147 and 71671121. This work was made possible by the facilities of the College of Management and Economic of Tianjin University, China.
22
Appendix
19 20
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
The Matlab software for computing the solution of the regression model: % Constrained center and range joint model for interval-valued symbolic data regression % Note: Ccrjm requires lsqlin (MATLAB Optimization Toolbox) % Input: % ym: vector of the centers of the independent variable % yr: vector of the radii of the independent variable % xm: matrix of the centers of the independent variables % xr: matrix of the radii of the independent variables % Output: % bc: vector of the estimated coefficients for prediction of centers % br: vector of the estimated coefficients for prediction of radii % yM: vector of the estimated centers of the dependent variable % yR: vector of the estimated radii of the dependent variable % B: vector of the estimated coefficients for prediction of centers and radii % n: the row of xm and xr % p: the column of xm and xr % Note: row of xm and xr, as well as column of them, must be the same number function[bc,br,yM,yR,B,n,p]=Ccrjm(ym,yr,xm,xr) [n,p]=size(xm); % Extract the row and column of xm x=[ones(n,1),xm,xr]; % Construct the centers and radii of the independent variables matrix x options=optimset(’Display’,’off’,’LargeScale’,’off’); ZERO=zeros(n,1);ZEROs=zeros(n,2*p+1);C=[x ZEROs;ZEROs x];d=[ym;yr];G=[-x -x;x -x;ZEROs -x];h=[(yr-ym);(ym+yr); ZERO]; B=lsqlin(C,d,G,h,[],[],[],[],[],options); for i=1:(2*p+1) bc(i)=B(i);br(i)=B(i+2*p+1); end yM=x*bc’; % Compute the vector of the estimated centers of the dependent variable yR=x*br’; % Compute the vector of the estimated radii of the dependent variable Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
COMSTA: 6479
J. Guo, P. Hao / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx
33
References Ahn, J., Peng, M., Park, C., Jeon, Y., 2012. A resampling approach for interval-valued data regression. Stat. Anal. Data Min. 5, 336–348. Arroyo, J., Munoz San Roque, A.M., Maté, C., Sarabia, A., 2007. Exponential smoothing methods for interval time series. In: Proceedings of the 1st European Symposium on Time Series Prediction, pp. 231–240. Bargiela, A., Pedrycz, W., Nakashima, T., 2007. Multiple regression with fuzzy data. Fuzzy Sets and Systems 158 (19), 2169–2188. Belsley, D.A., Kuh, E., Welsch, R.E., 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley, Hoboken, New Jersey. Billard, L., 2007. Dependencies and variation components of symbolic interval-valued data. In: Brito, P., Cucumel, G., Bertrand, P., De Carvalho, F.A.T. (Eds.), Selected Contributions in Data Analysis and Classification. Springer-Verlag, Berlin, pp. 3–13. Billard, L., Diday, E., 2000. Regression Analysis for Interval-Valued Data. Springer, Berlin, Heidelberg, pp. 369–374. Billard, L., Diday, E., 2002. Symbolic Regression Analysis. Springer, Berlin, Heidelberg, pp. 281–288. Billard, L., Diday, E., 2003. From the statistics of data to the statistics of knowledge: symbolic data analysis. J. Amer. Statist. Assoc. 98, 470–487. Blanco-Fernandez, A., Colubi, A., Gonzalez-Rodriguez, G., 2012. Confidence sets in a linear regression model for interval data. J. Statist. Plann. Inference 142, 1320–1329. Blanco-Fernandez, A., Corral, N., Gonzalez-Rodriguez, G., 2011. Estimation of a flexible simple linear model for interval data based on set arithmetic. Comput. Statist. Data Anal. 55, 2568–2578. Bock, H.H., Diday, E., 2000. Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer. Boukezzoula, R., Galichet, S., Bisserier, A., 2011. A midpoint-radius approach to regression with interval data. Internat. J. Approx. Reason. 52 (9), 1257–1271. Brito, P., Noirhomme-Fraiture, M., 2011. Far beyond the classical data models: symbolic data analysis. Stat. Anal. Data Min. 4, 157–170. Buja, A., Hastie, T., Tibshirani, R., 1989. Linear smoothers and additive models. Ann. Statist. 17, 453–555. Chen, L.H., Hsueh, C.C., 2007. A mathematical programming method for formulating a fuzzy regression model based on distance criterion. IEEE Trans. Syst. Man Cybern. B 37 (3), 705–712. Chen, L.H., Hsueh, C.C., 2009. Fuzzy regression models using the least-squares method based on the concept of distance. IEEE Trans. Fuzzy Syst. 17 (6), 1259–1272. Chen, N., Roll, R., Ross, S., 1986. Economic forces and the stock market. J. Bus. 59, 383–403. Chiun-How, Kao, Junji, Nakano, Sheau-Hue, Shieh, Yin-Jing, Tien, Han-Ming, Wu, Chuan-kai, Yang, Chun-houh, Chen, 2014. Exploratory data analysis of interval-valued symbolic data with matrix visualization. Comput. Statist. Data Anal. 79, 14–29. Chuang, C.C., 2008. Extended support vector interval regression networks for interval input–output data. Inform. Sci. 178 (3), 871–891. Débora, C.C., Francisco, Ap., Rodrigues, 2016. A survey on symbolic data-based music genre classification. Expert Syst. Appl. 60, 190–210. De Carvalho, F.A.T., De Souza, R.M.C.R., Chavent, M., et al., 2006. Adaptive hausdorff distances and dynamic clustering of symbolic interval data. Pattern Recognit. Lett. 27 (3), 167–179. Diday, E., Noirhomme-Fraiture, M., 2008. Symbolic Data Analysis and the SODAS Software. Wiley, Chichester. Domingues, M.A.O., De Souza, R.M.C.R., Cysneiros, F.J.A., 2010. A robust method for linear regression of symbolic interval data. Pattern Recognit. Lett. 31, 1991–1996. Gill, P.E., Murray, W., Wright, M.H., 1981. Practical Optimization. Academic Press, London. Giordani, P., 2015. Lasso-constrained regression analysis for interval-valued data. Adv. Data Anal. Classif. 9 (1), 5–19. Harnett, D.L., 1982. Statistical Methods. Addison-Wesley, Reading, Mass. Hawkins, D.M., 1980. Identification of Outliers. Chapman and Hall, New York. Hladík, M.,Černý, M., 2012. Interval regression by tolerance analysis approach. Fuzzy Sets and Systems 193, 85–107. Hojati, M., Bector, C.R., Smimou, K., 2005. A simple method for computation of fuzzy linear regression. European J. Oper. Res. 166 (1), 172–184. Hu, C., He, L.T., 2007. An application of interval methods to stock market forecasting. Reliab. Comput. 13 (5), 423–434. Johnson, R.A., Wichern, D.W., 2002. Applied Multivariate Statistical Analysis. Prentice Hall, Upper Saddle River, N.J. Lawson, C.L., Hanson, R.J., 1974. Solving Least Squares Problem. Prentice-Hall, New York. Lim, C., 2016. Interval-valued data regression using nonparametric additive models. J. Korean Stat. Soc. 45 (3), 358–370. Lima Neto, E.A., De Carvalho, F.A.T., 2008. Centre and range method for fitting a linear regression model to symbolic interval data. Comput. Statist. Data Anal. 52 (3), 1500–1515. Lima Neto, E.A., De Carvalho, F.A.T., 2010. Constrained linear regression models for symbolic interval-valued variables. Comput. Statist. Data Anal. 54 (2), 333–347. Maia, A.L.S., De Carvalho, F.A.T., 2011. Holt’s exponential smoothing and neural network models for forecasting interval-valued time series. Int. J. Forecast 27 (3), 740–759. Maia, A.L.S., De Carvalho, F.A.T., Ludermir, T.B., 2008. Forecasting models for interval valued time series. Neurocomputing 71 (16), 3344–3352. Pardoe, I., 2012. Applied Regression Modeling. John Wiley, Hoboken, New Jersey. Roberta, A.A., Fagundes, Renata, de Souza, M.C.R., Cysneiros, Francisco José A., 2014. Interval kernel regression. Neurocomputing 128 (27), 371–388. Rousseeuw, P.J., Leroy, A.M., 1987. Robust Regression and Outlier Detection. John Wiley, New York. Sakawa, M., Yano, H., 1992. Multiobjective fuzzy linear regression analysis for fuzzy input–output data. Fuzzy Sets and Systems 47 (2), 173–181. San Roque, A.M., Maté, C., Arroyo, J., Sarabia, á, 2007. iMLP: applying multi-layer perceptrons to interval-valued data. Neural Process. Lett. 25 (2), 157–169. Savic, D.A., Pedrycz, W., 1991. Evaluation of fuzzy linear regression models. Fuzzy Sets and Systems 39 (1), 51–63. Sinova, B., Colubi, A., Gil, M.Á., González-Rodríguez, G., 2012. Interval arithmetic-based simple linear regression between interval data: Discussion and sensitivity analysis on the choice of the metric. Inform. Sci. 199, 109–124. Tanaka, H., Hayashi, I., Watada, J., 1989. Possibilistic linear regression analysis for fuzzy data. European J. Oper. Res. 40 (3), 389–396. Tanaka, H., Ishibuchi, H., 1991. Identification of possibilistic linear systems by quadratic membership functions of fuzzy parameters. Fuzzy Sets and Systems 41 (2), 145–160. Tanaka, H., Uejima, S., Asai, K., 1982. Linear regression analysis with fuzzy model. IEEE Trans. Syst. Man Cybern. 12, 903–907. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288. Trutschnig, W., Gonzalez-Rodriguez, G., Colubi, A., Gil, M.A., 2009. A new family of metrics for compact, convex (fuzzy) sets based on a generalized concept of mid and spread. Inform. Sci. 179, 3964–3972. Xiong, T., Bao, Y., Hu, Z., 2014. Multiple-output support vector regression with a firefly algorithm for interval-valued stock price index forecasting. Knowl.Based Syst. 55, 87–100. Xiong, T., Li, C., Bao, Y., Hu, Z., Zhang, L., 2015. A combination method for interval forecasting of agricultural commodity futures prices. Knowl.-Based Syst. 77, 92–102. Xu, W., 2010. Symbolic Data Analysis: Interval-Valued Data Regression (Ph.D. thesis). University of Georgia.
Please cite this article in press as: Guo, J., Hao, P., Constrained center and range joint model for interval-valued symbolic data regression. Computational Statistics and Data Analysis (2017), http://dx.doi.org/10.1016/j.csda.2017.06.005.
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71