Outlier elimination using granular box regression

Information Fusion 27 (2016) 161–169 Contents lists available at ScienceDirect Information Fusion journal homepage: www.elsevier.com/locate/inffus ...

Download PDF

1MB Sizes 0 Downloads 13 Views

Report

PDF Reader
Full Text

Information Fusion 27 (2016) 161–169

Contents lists available at ScienceDirect

Information Fusion journal homepage: www.elsevier.com/locate/inffus

Outlier elimination using granular box regression M. Reza Mashinchi a, Ali Selamat a,b,⇑, Suhaimi Ibrahim c, Hamido Fujita d a

Faculty of Computing, Universiti Teknologi Malaysia, Johor, Malaysia UTM-IRDA Digital Media Center of Excellence, Universiti Teknologi Malaysia, Johor, Malaysia c Advance Informatics School, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia d Intelligent Software Laboratory, Iwate Prefectural University (IPU), Iwate, Japan b

a r t i c l e

i n f o

Article history: Received 1 October 2014 Received in revised form 8 April 2015 Accepted 12 April 2015 Available online 20 April 2015 Keywords: Granular box regression Outlier elimination Noisy data Data simpliﬁcation Data abstraction

a b s t r a c t A regression method desires to ﬁt the curve on a data set irrespective of outliers. This paper modiﬁes the granular box regression approaches to deal with data sets with outliers. Each approach incorporates a three-stage procedure includes granular box conﬁguration, outlier elimination, and linear regression analysis. The ﬁrst stage investigates two objective functions each applies different penalty schemes on boxes or instances. The second stage investigates two methods of outlier elimination to, then, perform the linear regression in the third stage. The performance of the proposed granular box regressions are investigated in terms of: volume of boxes, insensitivity of boxes to outliers, elapsed time for box conﬁguration, and error of regression. The proposed approach offers a better linear model, with smaller error, on the given data sets containing varieties of outlier rates. The investigation shows the superiority of applying penalty scheme on instances. Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction Simplifying and abstracting comfort us to understand the data to trace its general pattern or trend. While the term of ‘‘abstracting’’ associates with studies in artiﬁcial intelligence, the term of ‘‘granularity’’ is its synonym in soft computing studies [1–4]. Granularity often aims at reducing the complexities of data that cause to increase the processing cost – mostly where the uncertainties are involved. Therefore, the certain practices motivate the studies on granularity, i.e., clarity, low cost approximation and the tolerance of uncertainty. An application of these practices, through understanding the data, are identifying or eliminating the anomalies – known as outliers. The granulated data can beneﬁt the computation in two ways as follows. (i) To understand the data: by transparent the complexity to a reduced size data – known as granules. When a method performs an estimation based on granular data, as the requisite, the outcome should be more accurate than using the original complex data. (ii) To reduce the cost of data analysis: by avoiding complex tools to run through the data for insight discovery [2,5–7]. Thus, a non-expert data mining person can also make sense out of it. (iii) To increase the power of estimation and the capability of dealing with uncertainty. A data, as a part of a

⇑ Corresponding author at: Faculty of Computing, Universiti Teknologi Malaysia, 81310 Johor Bahru, Johor, Malaysia. http://dx.doi.org/10.1016/j.inffus.2015.04.001 1566-2535/Ó 2015 Elsevier B.V. All rights reserved.

granule, does not only represent a single observation but, rather, a group of data. However, beside the pointed advantages for granulation, it requires methods to sharpen the transparency of data. Granular box regression analysis [8,9] carries this out by detecting the outliers in data. It ﬁnds the correlation between the dependent and independent variables using hyper dimensional interval numbers known as boxes. In granular box regression (GBR), every instance in data set affects the size and coordinate of boxes. In the result, an approach becomes sensitive to outliers as an issue. To resolve this, we propose variations of granular box regression based on subset of data they process, and then, we investigate their performance with the presence of outliers in a data set. There are two motivations to the proposed variations. First, to simplify a data set contains numerous data – thus, we comfort understanding of a non-expert by clarifying the relationship between dependent and independent variables; second, to study the performances of each variation of granular box regression with presence of outliers. Outlier is an anomaly object that is atypical to the remaining data. It deviates from other objects such that it makes suspicious to be generated by a different mechanism [10]. Subject of outlier is either an elimination of disturbance such as noise reduction [10–14], or an interest of detection such as crime detection [15– 23]. This paper focuses on the former view. Different approaches [10,11,15,24] have studied the outlier; though, this paper differentiates itself by applying granular box approach and elimination of

162

M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169

the outliers. To measure the goodness of applied approach, either box or relationships between boxes can indicate the quality of box conﬁguration on data. In case of box measurement, the approach should minimize the overall volumes to reduce the complexity of data as were indented; and in case of measuring the relationships of boxes, am approach should build a coherent relationship similar to the true function by regression analysis. A possible approach, to furnish the former issue, is employing genetic algorithm (GA) [25–27] to ﬁnd the optimal volumes gradually. Performing the box conﬁguration based on GA builds the relationships between boxes to reduce the complexity of data and exclude the outliers. As the result, conﬁgured boxes represent the simpliﬁed original data. This paper is organized in seven sections. Section 2 reviews the granular box regression and explains its notions. Section 3 explains the three stages of proposed framework to conﬁgure the boxes, eliminate the outliers, and ﬁt the curve; where, box-based penalization (BP) and instance-based penalization (IP) are proposed for box conﬁguration, and clean- and candidate-based methods are proposed for elimination of outliers. Section 5, reveals the results on six data sets each with two rates of outliers. Then, Section 6 gives detailed analyses in three parts to investigate the regression analysis and box conﬁguration with respect to effect of dimensionality of data and rate of outliers on each method. Section 7 concludes the achieved results and addresses the future works for each method.

2. Granular box regression A regression approach should eliminate the outliers prior to ﬁt a model on data; as otherwise, it may ﬁt the outlier data and makes a wrong interpretation. Granular box regression (GBR) is an inclusive approach to detect the outliers in every dimension of data. Compared with classical regression analysis (CRA), which performs only on respond variable [27–29], GBR performs also on predictors to detect the outliers. Where CRA approach minimizes the summation of distances between the actual and the estimated values, the GBR approach obtains the minimum summation of volumes of all boxes. In general, GBR is a granular regression approach involved with fuzzy granulation generalization, f.g-generalization, to ﬁnd the relationships of boxes [2,30–33]. Fig. 1 shows fuzzy graph representation of f.g-generalization. GBR addresses three methods [9] of borderline, residual, and average distance. Borderline method examines the signiﬁcant decrease of box volume by eliminating a potential outlier on box borders. Residual method examines the signiﬁcant largeness of distance between the residual errors of a potential outlier and the

average of all data, where the residual error is a distance between estimated and true value. Average distance method examines the signiﬁcant largeness of distance between an individual distance of a potential outlier and the aggregated distance, where an individual distance is the average distance of a data to all others within a box and the aggregated distance is the average distance of all data to each other within a box [8]. Eqs. (1) and (2) deﬁnes the individual and aggregated distances; where for Box: All is the number of data, Di and Dj are the ith and jth instances, and, kBox is the kth instance in a Box. indiv idual DistancekBox

agregated

DistanceBox

PAll1 i¼1

¼

kDi DkBox k All 1

PAll PAll ¼

i¼1

j¼iþ1 kDi

PAll1 i¼1

Dj k

ðAll 1Þ

Minimize

n X VðBk Þ

! ð3Þ

k¼1

where V calculates the volume of a given box such as Bk. Note that Bk is kth box that four vertices can represent it in a two dimensional space. Both approaches initiate the number of boxes prior to perform the box regression analysis. The larger number of boxes the higher resolution of regression data; conversely, the less number of boxes the more abstract derived relationships. Studying the optimized number of boxes is in the scope of future works, tough it signiﬁcantly affects on GBR. 3. Proposed granular box regression An overall view to our proposed GBR is shown in Fig. 2. It illustrates the idea of penalty scheme, where box conﬁguration represents the simpliﬁcation of data by clean instances or candidates of boxes. Concretely, Fig. 3 shows the procedure of performing GBR in three stages that are: (i) apply a granular box conﬁguration, (ii)

X : Fuzzy graph

Fig. 1. Fuzzy graph representation of f.g-generalization.

ð2Þ

Earlier approach, P, for GBR [8] examines the box volume based on borderline method to identify the actual outliers out of potential outliers. In contrast, this paper proposes based on average distance method. In both approaches, P and our proposed, the mathematical formulation to obtain the optimized coordinates of n boxes for an m-dimensional problem is as follows:

y

: Crisp function

ð1Þ

Fig. 2. An overall view to the proposed penalty scheme GBRs.

163

M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169

Stage 1:

Stage 3:

Stage 2:

Granular box configuration

Elimination of outliers

Penalty on data

Obtain clean data

Penalty on Boxes

Obtain candidates

Curve fitting

Linear regression

Fig. 3. Steps of proposed approach for GBR.

exclude the outliers based on dispersion of data in each box, and (iii) apply the linear regression analysis – on the results of the second stage as the remaining data or subset of the remaining data. The following subsections explain each stage. 3.1. First stage: box conﬁguration In the ﬁrst stage of proposed GBRs, we investigate the application of four objectives: (i) S as the sensitive to outliers [8], (ii) P as the original objective proposed by Peters [8], (iii) and (iv) BP and IP as our proposed objectives. Eqs. 4–7 express the objective functions of each approach, with penalty functions Insatancepenaly and for IP and BP, respectively in Eqs. (8) and (9). Note that Boxpenalty b Eqs. 4–7 are in conjunction with the minimization of total volume of boxes (see Figs. 4 and 5).

S ¼ ObjSensitiv e ¼

m X

Boxbv olume

ð4Þ

Fig. 5. An example of box conﬁgurations by IP method on two-dimensional Peter’s data set [9] with low-rate of outliers.

BP ¼ ObjPro2 ¼

x m X X Boxpenalty þ Boxbv olume b b¼1

ð7Þ

b¼1

b¼1

Insatancepenaly ¼ Const1 ! PAll Pn Pm Boxb i¼1 Instancei b¼1 i¼1 Instancei 100 PAll i¼1 Instancei

P ¼ ObjPeters

! [ m r m X X v olume v olume ¼ Boxb Outlieri R Boxb i¼1 b¼1 b¼1 ! r [ ^ Outlieri Outliersactual

ð8Þ ð5Þ

i¼1

IP ¼ ObjPro1 ¼ Instancepenaly þ

m X

Boxbv olume

ð6Þ

b¼1

Boxpenalty ¼ Const2 b ! PAll Pm Boxb i¼1 Instancei i¼1 Instancei 100 Pall i¼1 Instancei

ð9Þ

where Const1 and Const2 are the pre-deﬁned percentage for IP and BP, respectively. We deﬁne them intuitively by ﬁtting them to the context of proposed models. We remind that S is unsuspicious of the outliers, P is suspicious of every potential outlier to ﬁnd the actual outliers, IP penalizes the instances outside of boxes, and BP penalizes the conﬁgured boxes. Penalty values are computed based on pre-determined values; where, Const1 deﬁnes the required minimum number of instances inside the total boxes, and Const2 deﬁnes the minimum number of instances each box must conﬁne. Therefore, the number of instances out of Const1 or Const2 results in penalty values Instancepenaly and Boxpenalty , respectively. Fig. 6 b shows an example of granular box conﬁguration. Completing the granular box conﬁguration allows to perform the second stage of GBR to obtain the clean data or candidates of boxes. 3.2. Second stage: elimination of outliers

Fig. 4. An example of box conﬁgurations by S method on two-dimensional Peter’s data set [9] with low-rate of outliers.

In the second stage of proposed GBRs, we obtain the clean data set [34] by applying Algorithm 1 to instances conﬁned in each box. To compute the distance of each instance, we measure the average distance of all data to the median point. We exclude that instance if

164

M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169

3. Conduct the box conﬁguration on the complete data set, 3.1. Apply GA to ﬁnd the granular boxes with respect to objective functions obj1 or obj2. 4. For each box; 4.1. Find the clean data (clean based method), or candidate of each box (candidate based method), 4.2. store the clean data set, or candidates of boxes. Stage 2: Apply linear regression analysis on clean data set (LR-on-clean) or candidates (LR-on-candidates). 4. Preparation and computing goodness of model Fig. 6. An example of box conﬁgurations by BP method on two-dimensional Peter’s data set [9] with low-rate of outliers.

its distance is greater than average by consider it dispersed from other instances. Concretely, we apply Algorithm 2 to generate the candidates of each box. Algorithm 1 (The procedure to obtain the clean data.). (i) For each box; 1.1. Compute the individual average distance – as in Eq. (1). 1.2. Compute the aggregated average distance – as in Eq. (2). 1.3. Exclude the instances with greater individual average distance than the aggregated average distance. (ii) End Algorithm 2 (The procedure to obtain the candidate of a box.).

d [ i¼1

P2

j¼1 xj

2

y¼

z¼

x þ 5 þ e þ o; with probably of outlier percentage x þ 5 þ e; with probability of ð1 outlier percentageÞ

ð10Þ

y þ x þ 5 þ e þ o; with probably of outlier percentage y þ x þ 5 þ e; with probability of ð1 outlier percentageÞ ð11Þ

(i) For each box; 1.1. Compute the candidate of box:

Centrebox ¼

We investigate the performance of all variations of granular box regression given in Table 1. We conducted 100 runs to report the average and the standard deviation for each conﬁguration. As given in Table 2, we used the following six data sets: micro-economic data for Customer Purchase Index (CPI) of Germany, artiﬁcial data set used in [9], two data sets generated by Eqs. (10) and (11), and, servo, and combined cycle power plant (CCPP) data sets. We generated 1000 instances to produce synthetic data sets using Eq. (10) and (11). It is worth mentioning that Eq. (10) is a well-knows approach to compare two methods [8,35].

; where d is dimensionality and i > 1:

(ii) End 3.3. Third stage: curve ﬁtting In the third stage of proposed GBRs, we either apply a linear regression analysis (LR) on clean data or on candidates of boxes. Each are called LR-on-clean and LR-on-candidates, respectively. We modiﬁed four possible granular box regressions as given in Table 1. To optimize Eqs. (6) and (7), ObjPro1 and ObjPro2, we apply the proposed method in Algorithm 3. Algorithm 3 (The procedure of the granular box regression with outlier elimination.).

In Eqs. (10) and (11), e is a random error with normal distribution, and o is an outlier value; where, the independent variables x and y are uniformly distributed on interval [0, 10]. We artiﬁcially introduced the outliers by randomly affecting on instances. We reproduced each original data set by affecting outliers on 30% and 70% of total instances. As the result, we generated 12 conﬁgured data sets to test the performance of GBRs. We performed GBRs with the control variables given in Table 3. To measure the goodness of model, we compute the residual error of original data to estimated points. We apply the Eq. (12), and its property in Eq. (13), to investigate the superiority of a linear model E1 over E2 as a sensitive estimation, S, to outliers.

SuperiorityE1 ¼ SupE1 ¼ ððGE2 GE1 Þ=GE2 Þ 100 Table 2 List of data sets used in this paper. Data set

Variables

Instances

Nature

Repository/ Reference

Artiﬁcial Functional#1 – generated by (8) Functional#2 – generated by (9) Micro-economic data for Customer Price Index (CPI) of Germany Servo

2 2

30 1000

Synthetic Synthetic

[9] Eq. (9)

3

1000

Synthetic

Eq. (10)

3

78

Realworld

DSBa

5

167

Combined Cycle Power Plant (CCPP)

5

9569

Realworld Realworld

UCIrepositoryb UCIrepository

Stage 1: 1. Deﬁne the number of boxes, 2. Set the control variables for box conﬁguration and genetic algorithms – detailed in Table 3. Table 1 Variations of granular box regression. Name

Stage one

Stage two

Stage two

IP-clean IP-candidate BP-clean BP-candidate

ObjPro1 ObjPro1 ObjPro2 ObjPro2

Clean Candidate Clean Candidate

LR-on-clean LR-on-candidate LR-on-clean LR-on-candidate

ð12Þ

a b

Deutsches Statistisches Bundesamt (http://www.destatis.de). University of California Irvine (http://archive.ics.uci.edu).

165

M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169 Table 3 Conﬁguration of control variables for granular box regression and genetic algorithm. Software

Model

Control variable

Value/choice

GUI MATLAB-9.3

Genetic algorithms

Population size

50

Stopping condition

Crossover function Mutation function

Exceeds 200 generations 100 Stochastic uniform Scattered Gaussian

Number of boxes

3

Minimum data inside the box Maximum data outside the box

30%

Iteration Selection function

Granular box regression

30%

X n true est Goodness of ModelE ¼ GE ¼ Av erage v aluei v aluei i¼1 true

where n is the number of instances, v aluei

conﬁgured boxes. The results are in average of 100 times run in terms of residual error. Table 4 gives the results based on rates of outliers over each data set. The comparisons investigate six methods over their equivalent S approach – as each three GBR performs two methods either based on clean data or based on candidates of boxes. Tables 4 and 5 reveal that IP, BP and P decrease the regression error in compared with S. We further analyzed clean-based methods of GBRs against other regression approaches for servo and CCPP data sets. Tables 6 and 7 compare the results based on [36–38]. Compared methods are driven from different categories; i.e., function based such as KMS and SMOReg; lazy-learning based algorithms such as K⁄; meta-learning based algorithm such as BREP; Rule-based algorithm such as M5R; Tree-based learning algorithms such as M5P and REP; and, kernel based such as GAM, HKL, GP SE-ART, GP Additive, and SS. 6. Analyses of proposed granular box regression

ð13Þ

is the value of ith origest

inal instance prior to noise affection, and v aluei is value of ith outcome by estimated function. Note that the superiority of zero advises that two models have similar performance. The following if-then statement in Eq. (14) advises three varieties for performance. 8 > < if SupE1 < 0;then E2 issuperrious than E1 Superiority v ariations ¼ if SupE1 ¼ 0;then E1 and E2 perform similar > : if SupE1 > 0;then E2 is superrious than E ð14Þ

5. Results To produce the results of proposed GBRs, we performed linear regression analysis on clean data and candidates of three

This section provides analyses on proposed GBRs based on 12 variations of six data sets, i.e., artiﬁcial data set [9], two data sets generated by Eqs. (10) and (11), CPI, servo, and CCPP. Regression analysis and box conﬁguration conduct the results with the following measurements to reveal the detailed analyses. (i) For regression: residual error, rate of outliers, and statistical analyses and (ii) for box conﬁguration: elapsed time, volume of boxes, and standard deviation of box volumes. 6.1. Regression analysis To illustrate the contribution of each approach to decrease the box volume over dimensionality of data, Fig. 7 shows the improved rates over S. Although all three approaches could decrease the volume over S, in overall, IP outperforms by 99% and P is superior to BP. In addition, IP negligibly effects by dimensionality of data. The same results are true for contribution of each approach over

Table 4 GBRs errors on synthetic data sets – bold indicate lowest error for clean and candidate methods. Error criterion

Mean of absolute errors (MAE)

Data set

Functional-Eq. (9)

Rate of outlier

Low

High

Low

High

Low

High

0.5187 24.5074 0.4198 22.2748 0.2955 4.2443 0.7113 9.6477

1.2226 13.3041 1.675 13.5846 0.723 5.2792 1.2214 40.9395

0.3299 125.001 0.3292 35.9774 0.5096 50.0692 2.6853 39.5285

0.7688 103.7303 0.7684 28.1761 2.1224 51.3603 3.1243 93.0011

01.3403 21.2767 01.0039 13.7822 01.5812 12.2358 04.4147 08.9980

02.4010 10.0269 1.8866 13.2625 01.9586 08.0264 02.4048 29.6718

Approach

S-clean S-candidate BP-clean BP-candidate IP-clean IP-candidate P-clean P-candidate

Functional-Eq. (10)

Peters [9]

Table 5 GBRs errors on real-world data sets – bold indicate lowest error for clean and candidate methods. Error criterion

Mean of absolute errors (MAE)

Data set

CPI

Rate of outlier

Low

High

Low

High

Low

High

0.5436 15.6762 0.5041 12.7497 0.3842 2.0511 1.9025 3.6759

00.9573 28.3775 00.6513 16.5123 03.4740 03.6762 02.2162 07.5947

1.8397 31.3897 1.7120 21.5382 1.3812 2.3492 1.8897 4.5218

2.7897 29.3897 1.8199 20.6252 1.8524 2.1693 2.4168 5.6037

04.8527 123.4579 03.5687 103.2674 02.9301 08.6252 05.1267 14.3570

09.5681 120.6715 04.81561 114.6480 06.6549 12.8456 05.9471 55.3547

Approach

S-clean S-candidate BP-clean BP-candidate IP-clean IP-candidate P-clean P-candidate

Servo

CCPP

166

M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169

Table 6 Comparison of results on CCPP data set.

Regression error

31%

50% Overall 5D 3D 2D

IP

P-candidate

BP

P-clean

0%

IP-candidate

2.818 2.882 2.932 3.133 3.140 3.172 3.620 3.621 3.625

IP-clean

BREP K⁄ IP-clean REP M5P M5R SMOReg LMS LR

20% 30%

BP-candidate

Bagging Reduced Error Pruning trees [39,40] KStar [41] Instance-based Penalty on clean data Reduced Error Pruning trees [42] Model Trees Regression [43] Model Trees Rules [44] Support Vector Poly Kernel [45] Least Median Square [46] Linear Regression

100%

BP-clean

MAE

Improved

Abbreviation

Dimensionality

70%

Approach

P

Method Table 7 Comparison of results on servo data set.

Fig. 8. Overall improved error for regression over S method.

Approach

Abbreviation

MSE

Structure Search [38] Additive Gaussian Process [47] Gaussian Process SE kernel using Automatic Relevance Determination [48] Hierarchical Kernel Learning [49] Instance-based Penalty on clean data Generalized additive model [50] Linear Regression RE- Dynamic Ensemble Selection and Instantaneous Pruning [37] Reduced Error Pruning [51] Ordered Aggregation [52] Recursive Feature Elimination [51] Ensemble optimization using Genetic Algorithm [52]

SS GP Additive GP SE-ARD

0.10 0.11 0.12

HKL IP-clean GAM LR RE-DESIP

0.19 0.21 0.28 0.52 0.66

RE OA RFE GA

1.42 1.43 1.94 2.37

Average of errors over clean and centre methods using low and hight outliers 60 Candidate-based

50

Error

40 Candidate-based

Candidate-based

30 Candidate-based

20 10 Clean-based

0 S

Clean-based

Clean-based

BP

IP

Clean-based

P

Fig. 9. Middle of regression errors between clean- and candidate-based methods.

Volume over S 99% 61%

43%

50% Overall 5D 3D

0%

2D BP

IP

Dimensionalty

Improved

100%

P

Approach Fig. 7. Improved volumes over sensitive method.

rates of outliers (Fig. 8); and therefore, IP outperform other approaches over both dimensionality of data and rates of outliers. Concretely, to analyse the regression in terms of residual error, we observe over Tables 4 and 5 (in Section 5) with respect to dimensionality of data. The results distinguish the higher rates of errors belong to candidate – compare with clean-based methods; which is true for both two- and three-dimensional data. The reason is inappropriate candidate-points that represent the boxes – which, then the regression performs on these poorly obtained box-candidates. The regression error is also affected by rates of outliers introduced to original data. We observe in Fig. 8 that BP-clean and IPcandidate improve the error over their corresponding S methods by 20% and 70% as the most and the least results, respectively. To observe the range of errors, Fig. 9 shows the ranges and the middles of errors between clean- or candidate-based methods. In other words, it shows the difference between using either clean-

or candidate-based method for GBRs. We observe that IP and P have the least ranges; which, it shows either the candidates of boxes are identical to the mean of distributed data in each box, or it should readjust the candidate of box to improve the results. In contrast, we observe that BP occurs in wide range of errors; which, it show either conﬁgured boxes were unsuccessful to simplify the data, or the candidate of data unable to best represent the conﬁgured boxes. Consequently, we recognize two following points. (i) IP, P, BP, P, and S have, sequentially, closer intervals and less midpoint errors and (ii) PB-candidate method needs to improve the candidate representation of a box. In general, we can recognize the error interval of each approach by (i). To further analysis, we performed the statistical analysis with 5% of signiﬁcance level on results of 100 runs for 8 GBR methods over 12 variations of six data sets. We applied z-test and k-test, where k-test revealed P mostly dissatisﬁes the conditions. Then, similarly, we applied two-sample t-test for GBRs against iteratively reweighted least-squares (IRLS) [53] and LR – IRLS is known as a robust estimation against outliers, and LR is used as regression method within each GBR method. As the result, LR rejected the null hypothesis, while GBRs revealed the same mean with IRLS for most the cases. 6.2. Box conﬁguration In terms of elapsed time to conﬁgure the boxes and eliminate the outliers; Fig. 10 shows IP and P take the shortest and longest time, respectively. It is worth mentioning that the elapse time for P method may varies depending on nature of outliers – as it examines the potentiality of being an actual outlier for every instance on edge of conﬁgured boxes. Therefore, elapse time in P method is sensible to outliers. We analyzed the dependency of volumes to the generated outliers by chasing the obtained volumes and the standard deviation.

167

M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169

Elapsed time for box configuration 4500 4000

Time (Sec.)

3500 3000 2500 2000 1500 1000 500 0 S

BP RP 2D: Equ.(9)

P

S

BP RP

P

2D: Peters

S

BP RP

P

S

3D: Equ.(10)

BP RP

P

S

3D: CPI

BP RP

P

S

5D: CCPP

BP RP

P

5D: Servo

Approach Slowest

Fastest

Fig. 10. Elapsed time by GA to execute each box conﬁguration method.

In every run, we measured the obtained volumes given in Table 8, and their standard deviations given in Table 9 to compute the

Average of STD of volumes Standard deviation

for volume insensitivity by the following indicators: (i) STDOuliered VB of data set with outliers and (ii) STDGB V B for volume of granular box conﬁguration approaches. The higher standard deviation we receive, the higher dependency to outliers it shows. Consequently, Fig. 11 shows P, BP, and S are outlier sensitive, and IP is most insensitive. As expected, the standard deviation of S shows identical to data with outliers. We found that the results depend on dimensionality of data and rate of outlier; where in most cases, the dependency slightly increases with the rate of outliers, and, it signiﬁcantly increases with dimensionality data.

S

GB

Time

¼ ððV s V GB Þ=V S Þ 100

IP

P

Outliered data set

Fig. 11. Average of standard deviations of volumes.

The analyses showed the superiority of penalty scheme in terms of volume, residual, time, and insensitivity. Correspondingly, Eqs. 15–18, and consequent Eqs. 19–22 in appendix, express each indicator.

Volume

BP

Approach

6.2. Summary

GB

400 350 300 250 200 150 100 50 0

¼ ððT S T GB Þ=T S Þ 100

ð16Þ

GB

Ouliered Insensitiv itySTDGB ¼ ððSTDOuliered STDGB Þ 100 VB V B Þ=STDV B VB

VB ¼

ð15Þ

m [ b¼1

v olumeb m ¼ 1 if

STDSV B ; and; m ¼ 3 if STDGV B box

ð17Þ

ð18Þ

Table 8 Obtained volumes for conﬁgured boxes over synthesis data sets – bolds indicate lowest volumes and highlights indicate least distances between L and H. Measurement

Box volume

Data set Dimensionality

Functional-Eq. (9) 2

Functional-Eq. (10) 3

Peters 2

Rate of outlier

L

H

L

H

L

H

Approach

S BP IP P

84.1820 43.3555 07.5980 60.1890

110.8725 067.5943 011.2788 074.8227

1475.5000 1178.5000 0.0000447 0319.8614

1581.0000 1246.8000 0.0000259 0530.3134

90.3804 53.9513 03.6289 77.5317

103.5050 056.9508 001.0466 102.6414

Table 9 Obtained results for average of standard deviation of conﬁgured boxes over synthesis data sets. Measurement

Standard deviation

Data set Dimensionality

Functional-Eq. (9) 2

Rate of outlier

L

H

L

H

L

H

Data with outlier Approach

11.95 18.02 11.26 02.04 53.09

15.9659 14.6710 11.1730 02.6350 76.7934

065.2580 136.0300 107.1700 000.0002 564.6020

88.3393 145.9600 807.8810 000.0002 685.2220

08.3101 10.2300 05.4511 01.3892 50.6395

011.5201 016.0940 016.8320 001.2383 169.9300

S BP IP P

Functional-Eq. (10) 3

Peters 2

168

M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169 Sumamry of resutls 120%

99%

99%

Improved

100% 80%

72%

61%

49%

60% 43% 40%

35%

31% 17% 19%

20% 0% BP

RP

P

BP

Volume

RP

P

BP

Residual

Improved over S method

RP

P

Time

BP

RP

P

Insensitivity

Improved over data set with outlier

Fig. 12. Summary of results for improved approaches.

Acknowledgements

Table 10 Summary of best performed approaches. Approach

representation of box candidates by investigating the optimal number and the placement of candidates. (ii) Enhance the quality of cleaned data when it excludes data from a box, by investigating different measurement. (iii) Investigate the number of data that IP or BP methods required to contain in each box to transparent the efﬁcacy of optimal size of boxes. (iv) Investigate the optimal number of boxes to best furnish the regression data. (v) Enhance the performance of P approach by investigating the optimal values for controlling variables. (vi) Investigate various outlier-generating functions to ﬁnd the efﬁcacy on each method. In addition, one can investigate the same characteristics for non-linear functions to detect outlier in a data set generated from a linear function; when, there is a promising result from granular box regression.

Measurement Time

Volume

Error

Insensitivity

IP

–clean –candidate

U

U

– U

U

BP

–clean –candidate

–

–

– –

–

P

–clean –candidate –candidate

–

–

– – –

–

The Universiti Teknologi Malaysia (UTM) and Ministry of Education Malaysia under Research University Grants 00M19, 02G71 and 4F550 are hereby acknowledged for some of the facilities that were utilized during the course of this research work. Appendix A

ErrorMethod ¼ Av g Av g

outlier X dataset X n¼1

where Vs and VGB indicate the obtained volumes by S and other granular box approach as BP, IP, and P; TS and TGB indicate the elapsed time for box conﬁguration by each approach; and,

þAv g

This paper investigated the insensitivity of granular box regressions. They conﬁgure the granular box on data, and then perform the regression analysis on subsets of granulated data. We modiﬁed two approaches for granular box conﬁguration. The ﬁrst approach, penalizes the box uncontained the required number of data; and the second approach, penalizes the instances unconﬁned by any box. Then, we performed the outlier- elimination by applying two methods to keep the major trend of data within each granular box. The ﬁrst method cleans the data based on dispersion, and the second method candidate the center of each box. The investigation compared the proposed granular box regressions against the outlier-sensitive approach. The comparisons carried out for regression analysis and box conﬁguration, respectively, in terms of: (i) residual error of estimated function, and, (ii) elapsed time for box conﬁguration, (iii) obtained volume, and (iv) standard deviation of volumes. Moreover, dimensionality of data and rate of outliers carried out the analyses. The results over S revealed the overall superiority for IP. It improved 99% of volume, 72% of residual error, and 99% of insensibility to outliers in box conﬁguration. For future works, the candidate-based BP requires enhancement to decrease the regression error. Consequently, one can consider the following works for box conﬁguration. (i) Enhance the

i¼1

VolumeMethod ¼ Av g Av g þAv g

outlier X dataset X

outlier X dataset X n¼1

TimeMethod ¼ Av g Av g

STDMethod ¼ Av g Av g

n¼1

i¼1

Av g

i¼1

! ð20Þ

run X cleanmethod;dataset Elapsedj j¼1

run X candidatemethod;dataset Av g Elapsedj

! ð21Þ

j¼1

outlier X dataset X

outlier X dataset X

j¼1

j¼1

i¼1

n¼1

þAv g

ð19Þ

run X cleanmethod;dataset Volj

run X candidatemethod;dataset Av g Volj

outlier X dataset X n¼1

Av g

i¼1

i¼1

outlier X dataset X n¼1

!

j¼1

n¼1

þAv g

j¼1

run X Av g Error jcandidatemethod;dataset

n¼1 i¼1

7. Conclusion

run X Error cleanmethod;dataset j

outlier n XX

implies the standard deviation for data with outlier, and STDOuliered VB STDGB V B implies the granular box approaches. Each of these indicators reveals the results in Fig. 12 and Table 10 to summarize the analyses of this paper. Fig. 12 illustrates the improved results of each approach over S; and, Table 10 shows the best performed approached in terms of each measurement. IP outperforms in terms of volume and insensitivity with 99% for both cases; and, IP-candidate outperforms in regression, and IP-clean outperforms in granular box conﬁguration.

Av g

Av g

i¼1

Av g

run X

STDjcleanmethod;dataset

j¼1 run X

!

STDjcandidatemethod;dataset

ð22Þ

j¼1

where in Eqs. 19–22, outlier = 2, data set = 6, and run = 100. References [1] F. Giunchiglia, T. Walsh, A theory of abstraction, Artif. Intell. 57 (1992) 323– 389. [2] L.A. Zadeh, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets Syst. 90 (1997) 111–127. [3] J.R. Hobbs, Granularity, in: Proceedings of the Ninth International Joint Conference on Artiﬁcial Intelligence, Citeseer, 1985. [4] Y. Yao, Human-inspired granular computing, in: Novel Developments in Granular Computing: Applications for Advanced Human Reasoning and Soft Computation, 2010, pp. 1–15. [5] Y.-Q. Zhang, B. Jin, Y. Tang, Granular neural networks with evolutionary interval learning, IEEE Trans. Fuzzy Syst. 16 (2008) 309–319. [6] Y. Yao, Interpreting concept learning in cognitive informatics and granular computing, IEEE Trans. Syst. Man Cybern., Part B: Cybern. 39 (2009) 855–866. [7] Y. Yao, J. Luo, Top-down progressive computing, in: Rough Sets and Knowledge Technology, Springer, 2011, pp. 734–742.

M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169 [8] G. Peters, Granular box regression, IEEE Trans. Fuzzy Syst. 19 (2011) 1141– 1152. [9] G. Peters, Z. Lacic, Tackling outliers in granular box regression, Inf. Sci. 212 (2012) 44–56. [10] D.M. Hawkins, Identiﬁcation of Outliers, Springer, 1980. [11] V.J. Hodge, J. Austin, A survey of outlier detection methodologies, Artif. Intell. Rev. 22 (2004) 85–126. [12] J. Fox, Regression Diagnostics: An Introduction, Sage, 1991. [13] V. Barnett, T. Lewis, Outliers in Statistical Data, Wiley, New York, 1994. [14] N. Devarakonda, S. Subhani, S.A.H. Basha, Outliers detection in regression analysis using partial least square approach, in: ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India, vol. II, Springer, 2014, pp. 125–135. [15] A. Dastanpour, S. Ibrahim, R. Mashinchi, Using genetic algorithm to supporting artiﬁcial neural network for intrusion detection system, in: The International Conference on Computer Security and Digital Investigation (ComSec2014), The Society of Digital Information and Wireless Communication, 2014, pp. 1–13. [16] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey, ACM Comput. Surv. (CSUR) 41 (2009) 15. [17] D.W. Osgood, Poisson-based regression analysis of aggregate crime rates, J. Quant. Criminol. 16 (2000) 21–43. [18] A. Shen, R. Tong, Y. Deng, Application of classiﬁcation models on credit card fraud detection, in: Service Systems and Service Management, 2007 International Conference on, IEEE, 2007, pp. 1–4. [19] E. Ngai, Y. Hu, Y. Wong, Y. Chen, X. Sun, The application of data mining techniques in ﬁnancial fraud detection: a classiﬁcation framework and an academic review of literature, Decis. Support Syst. 50 (2011) 559–569. [20] E.M. Matsumura, R.R. Tucker, Fraud detection: a theoretical foundation, Account. Rev. (1992) 753–782. [21] L.C. Mercer, Fraud detection via regression analysis, Comput. Secur. 9 (1990) 331–338. [22] Y. Wang, A multinomial logistic regression modeling approach for anomaly intrusion detection, Comput. Secur. 24 (2005) 662–674. [23] Y. Tsuda, M. Samejima, M. Akiyoshi, N. Komoda, M. Yoshino, An anomaly detection method for individual services on a web-based system by selection of dummy variables in multiple regression, Electron. Commun. Jpn 97 (2014) 9–16. [24] M.H. Mashinchi, M.A. Orgun, M. Mashinchi, Solving fuzzy linear regression with hybrid optimization, in: C. Leung, M. Lee, J. Chan (Eds.), Neural Information Processing, Springer, Berlin Heidelberg, 2009, pp. 336–343. [25] J.H. Holland, Genetic algorithms, Sci. Am. 267 (1992) 66–72. [26] M. Gen, R. Cheng, Genetic Algorithms and Engineering Optimization, John Wiley & Sons, 2000. [27] S. Chatterjee, A.S. Hadi, Regression Analysis by Example, John Wiley & Sons, 2013. [28] R.H. Myers, Classical and Modern Regression with Applications (Duxbury Classic), Duxbury Press, Paciﬁc Grove, 2000. [29] D. Kleinbaum, L. Kupper, A. Nizam, E. Rosenberg, Applied Regression Analysis and Other Multivariable Methods, Cengage Learning, 2013. [30] B. Apolloni, D. Iannizzi, D. Malchiodi, W. Pedrycz, Granular regression, in: Neural Nets, Springer, 2006, pp. 147–156. [31] A. Bargiela, W. Pedrycz, T. Nakashima, Multiple regression with fuzzy data, Fuzzy Sets Syst. 158 (2007) 2169–2188.

169

[32] S.-P. Chen, J.-F. Dang, A variable spread fuzzy linear regression model with higher explanatory power and forecasting accuracy, Inf. Sci. 178 (2008) 3973– 3988. [33] S. Roychowdhury, W. Pedrycz, Modeling temporal functions with granular regression and fuzzy rules, Fuzzy Sets Syst. 126 (2002) 377–387. [34] M.H. Mashinchi, M.A. Orgun, M.R. Mashinchi, A least square approach for the detection and removal of outliers for fuzzy linear regression, in: World Congress on Nature and Bio-logically Inspired Computing, 2010, pp. 134–139. [35] J.J. Buckley, T. Feuring, Linear and non-linear fuzzy regression: evolutionary algorithm solutions, Fuzzy Sets Syst. 112 (2000) 381–394. [36] P. Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, Int. J. Electr. Power Energy Syst. 60 (2014) 126–140. [37] K. Dias, T. Windeatt, Dynamic ensemble selection and instantaneous pruning for regression, in: ESANN, Bruges, Belgium, 2014. [38] D. Duvenaud, J.R. Lloyd, R. Grosse, J.B. Tenenbaum, Z. Ghahramani, Structure discovery in nonparametric regression through compositional kernel search, 2013, arXiv preprint arXiv:1302.4922. [39] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [40] J. D’Haen, D. Van Den Poel, Temporary stafﬁng services: a data mining perspective, in: 2012 IEEE 12th International Conference on Data Mining Workshops (ICDMW), 2012, pp. 287–292. [41] J.G. Cleary, L.E. Trigg, K⁄: An instance-based learner using an entropic distance measure, in: ICML, 1995, pp. 108–114. [42] S. Portnoy, R. Koenker, The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators, Stat. Sci. 12 (1997) 279–300. [43] Y. Wang, I.H. Witten, Inducing model trees for continuous classes, in: Proceedings of the Ninth European Conference on Machine Learning, 1997, pp. 128–137. [44] S. Ekinci, U.B. Çelebi, M. Bal, M.F. Amasyali, U.K. Boyaci, Predictions of oil/chemical tanker main design parameters using computational intelligence techniques, Appl. Soft Comput. 11 (2011) 2356–2366. [45] M.O. Elish, A comparative study of fault density prediction in aspect-oriented systems using MLP, RBF, KNN, RT, DENFIS and SVR models, Artif. Intell. Rev. (2012) 1–9. [46] P.J. Rousseeuw, Least median of squares regression, J. Am. Stat. Assoc. 79 (1984) 871–880. [47] D. Duvenaud, H. Nickisch, C.E. Rasmussen, Additive Gaussian processes, in: Neural Information Processing Systems, 2011. [48] R.M. Neal, Bayesian Learning for Neural Networks, University of Toronto, 1995. [49] F. Bach, High-dimensional non-linear variable selection through hierarchical kernel learning, 2009. arXiv preprint arXiv:0909.0844. [50] T.J. Hastie, R.J. Tibshirani, Generalized Additive Models, CRC Press, 1990. [51] T. Windeatt, K. Dias, Feature ranking ensembles for facial action unit classiﬁcation, in: L. Prevost, S. Marinai, F. Schwenker (Eds.), Artiﬁcial Neural Networks in Pattern Recognition, Springer, Berlin Heidelberg, 2008, pp. 267– 279. [52] D. Hernández-Lobato, G. Martínez-Muñoz, A. Suárez, Empirical analysis and evaluation of approximate techniques for pruning regression bagging ensembles, Neurocomputing 74 (2011) 2250–2264. [53] P.W. Holland, R.E. Welsch, Robust regression using iteratively reweighted least-squares, Commun. Stat.–Theory Methods 6 (1977) 813–827.

Outlier elimination using granular box regression

Outlier elimination using granular box regression

Recommend Documents