Information Fusion 27 (2016) 161–169
Contents lists available at ScienceDirect
Information Fusion journal homepage: www.elsevier.com/locate/inffus
Outlier elimination using granular box regression M. Reza Mashinchi a, Ali Selamat a,b,⇑, Suhaimi Ibrahim c, Hamido Fujita d a
Faculty of Computing, Universiti Teknologi Malaysia, Johor, Malaysia UTM-IRDA Digital Media Center of Excellence, Universiti Teknologi Malaysia, Johor, Malaysia c Advance Informatics School, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia d Intelligent Software Laboratory, Iwate Prefectural University (IPU), Iwate, Japan b
a r t i c l e
i n f o
Article history: Received 1 October 2014 Received in revised form 8 April 2015 Accepted 12 April 2015 Available online 20 April 2015 Keywords: Granular box regression Outlier elimination Noisy data Data simplification Data abstraction
a b s t r a c t A regression method desires to fit the curve on a data set irrespective of outliers. This paper modifies the granular box regression approaches to deal with data sets with outliers. Each approach incorporates a three-stage procedure includes granular box configuration, outlier elimination, and linear regression analysis. The first stage investigates two objective functions each applies different penalty schemes on boxes or instances. The second stage investigates two methods of outlier elimination to, then, perform the linear regression in the third stage. The performance of the proposed granular box regressions are investigated in terms of: volume of boxes, insensitivity of boxes to outliers, elapsed time for box configuration, and error of regression. The proposed approach offers a better linear model, with smaller error, on the given data sets containing varieties of outlier rates. The investigation shows the superiority of applying penalty scheme on instances. Ó 2015 Elsevier B.V. All rights reserved.
1. Introduction Simplifying and abstracting comfort us to understand the data to trace its general pattern or trend. While the term of ‘‘abstracting’’ associates with studies in artificial intelligence, the term of ‘‘granularity’’ is its synonym in soft computing studies [1–4]. Granularity often aims at reducing the complexities of data that cause to increase the processing cost – mostly where the uncertainties are involved. Therefore, the certain practices motivate the studies on granularity, i.e., clarity, low cost approximation and the tolerance of uncertainty. An application of these practices, through understanding the data, are identifying or eliminating the anomalies – known as outliers. The granulated data can benefit the computation in two ways as follows. (i) To understand the data: by transparent the complexity to a reduced size data – known as granules. When a method performs an estimation based on granular data, as the requisite, the outcome should be more accurate than using the original complex data. (ii) To reduce the cost of data analysis: by avoiding complex tools to run through the data for insight discovery [2,5–7]. Thus, a non-expert data mining person can also make sense out of it. (iii) To increase the power of estimation and the capability of dealing with uncertainty. A data, as a part of a
⇑ Corresponding author at: Faculty of Computing, Universiti Teknologi Malaysia, 81310 Johor Bahru, Johor, Malaysia. http://dx.doi.org/10.1016/j.inffus.2015.04.001 1566-2535/Ó 2015 Elsevier B.V. All rights reserved.
granule, does not only represent a single observation but, rather, a group of data. However, beside the pointed advantages for granulation, it requires methods to sharpen the transparency of data. Granular box regression analysis [8,9] carries this out by detecting the outliers in data. It finds the correlation between the dependent and independent variables using hyper dimensional interval numbers known as boxes. In granular box regression (GBR), every instance in data set affects the size and coordinate of boxes. In the result, an approach becomes sensitive to outliers as an issue. To resolve this, we propose variations of granular box regression based on subset of data they process, and then, we investigate their performance with the presence of outliers in a data set. There are two motivations to the proposed variations. First, to simplify a data set contains numerous data – thus, we comfort understanding of a non-expert by clarifying the relationship between dependent and independent variables; second, to study the performances of each variation of granular box regression with presence of outliers. Outlier is an anomaly object that is atypical to the remaining data. It deviates from other objects such that it makes suspicious to be generated by a different mechanism [10]. Subject of outlier is either an elimination of disturbance such as noise reduction [10–14], or an interest of detection such as crime detection [15– 23]. This paper focuses on the former view. Different approaches [10,11,15,24] have studied the outlier; though, this paper differentiates itself by applying granular box approach and elimination of
162
M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169
the outliers. To measure the goodness of applied approach, either box or relationships between boxes can indicate the quality of box configuration on data. In case of box measurement, the approach should minimize the overall volumes to reduce the complexity of data as were indented; and in case of measuring the relationships of boxes, am approach should build a coherent relationship similar to the true function by regression analysis. A possible approach, to furnish the former issue, is employing genetic algorithm (GA) [25–27] to find the optimal volumes gradually. Performing the box configuration based on GA builds the relationships between boxes to reduce the complexity of data and exclude the outliers. As the result, configured boxes represent the simplified original data. This paper is organized in seven sections. Section 2 reviews the granular box regression and explains its notions. Section 3 explains the three stages of proposed framework to configure the boxes, eliminate the outliers, and fit the curve; where, box-based penalization (BP) and instance-based penalization (IP) are proposed for box configuration, and clean- and candidate-based methods are proposed for elimination of outliers. Section 5, reveals the results on six data sets each with two rates of outliers. Then, Section 6 gives detailed analyses in three parts to investigate the regression analysis and box configuration with respect to effect of dimensionality of data and rate of outliers on each method. Section 7 concludes the achieved results and addresses the future works for each method.
2. Granular box regression A regression approach should eliminate the outliers prior to fit a model on data; as otherwise, it may fit the outlier data and makes a wrong interpretation. Granular box regression (GBR) is an inclusive approach to detect the outliers in every dimension of data. Compared with classical regression analysis (CRA), which performs only on respond variable [27–29], GBR performs also on predictors to detect the outliers. Where CRA approach minimizes the summation of distances between the actual and the estimated values, the GBR approach obtains the minimum summation of volumes of all boxes. In general, GBR is a granular regression approach involved with fuzzy granulation generalization, f.g-generalization, to find the relationships of boxes [2,30–33]. Fig. 1 shows fuzzy graph representation of f.g-generalization. GBR addresses three methods [9] of borderline, residual, and average distance. Borderline method examines the significant decrease of box volume by eliminating a potential outlier on box borders. Residual method examines the significant largeness of distance between the residual errors of a potential outlier and the
average of all data, where the residual error is a distance between estimated and true value. Average distance method examines the significant largeness of distance between an individual distance of a potential outlier and the aggregated distance, where an individual distance is the average distance of a data to all others within a box and the aggregated distance is the average distance of all data to each other within a box [8]. Eqs. (1) and (2) defines the individual and aggregated distances; where for Box: All is the number of data, Di and Dj are the ith and jth instances, and, kBox is the kth instance in a Box. indiv idual DistancekBox
agregated
DistanceBox
PAll1 i¼1
¼
kDi DkBox k All 1
PAll PAll ¼
i¼1
j¼iþ1 kDi
PAll1 i¼1
Dj k
ðAll 1Þ
Minimize
n X VðBk Þ
! ð3Þ
k¼1
where V calculates the volume of a given box such as Bk. Note that Bk is kth box that four vertices can represent it in a two dimensional space. Both approaches initiate the number of boxes prior to perform the box regression analysis. The larger number of boxes the higher resolution of regression data; conversely, the less number of boxes the more abstract derived relationships. Studying the optimized number of boxes is in the scope of future works, tough it significantly affects on GBR. 3. Proposed granular box regression An overall view to our proposed GBR is shown in Fig. 2. It illustrates the idea of penalty scheme, where box configuration represents the simplification of data by clean instances or candidates of boxes. Concretely, Fig. 3 shows the procedure of performing GBR in three stages that are: (i) apply a granular box configuration, (ii)
X : Fuzzy graph
Fig. 1. Fuzzy graph representation of f.g-generalization.
ð2Þ
Earlier approach, P, for GBR [8] examines the box volume based on borderline method to identify the actual outliers out of potential outliers. In contrast, this paper proposes based on average distance method. In both approaches, P and our proposed, the mathematical formulation to obtain the optimized coordinates of n boxes for an m-dimensional problem is as follows:
y
: Crisp function
ð1Þ
Fig. 2. An overall view to the proposed penalty scheme GBRs.
163
M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169
Stage 1:
Stage 3:
Stage 2:
Granular box configuration
Elimination of outliers
Penalty on data
Obtain clean data
Penalty on Boxes
Obtain candidates
Curve fitting
Linear regression
Fig. 3. Steps of proposed approach for GBR.
exclude the outliers based on dispersion of data in each box, and (iii) apply the linear regression analysis – on the results of the second stage as the remaining data or subset of the remaining data. The following subsections explain each stage. 3.1. First stage: box configuration In the first stage of proposed GBRs, we investigate the application of four objectives: (i) S as the sensitive to outliers [8], (ii) P as the original objective proposed by Peters [8], (iii) and (iv) BP and IP as our proposed objectives. Eqs. 4–7 express the objective functions of each approach, with penalty functions Insatancepenaly and for IP and BP, respectively in Eqs. (8) and (9). Note that Boxpenalty b Eqs. 4–7 are in conjunction with the minimization of total volume of boxes (see Figs. 4 and 5).
S ¼ ObjSensitiv e ¼
m X
Boxbv olume
ð4Þ
Fig. 5. An example of box configurations by IP method on two-dimensional Peter’s data set [9] with low-rate of outliers.
BP ¼ ObjPro2 ¼
x m X X Boxpenalty þ Boxbv olume b b¼1
ð7Þ
b¼1
b¼1
Insatancepenaly ¼ Const1 ! PAll Pn Pm Boxb i¼1 Instancei b¼1 i¼1 Instancei 100 PAll i¼1 Instancei
P ¼ ObjPeters
! [ m r m X X v olume v olume ¼ Boxb Outlieri R Boxb i¼1 b¼1 b¼1 ! r [ ^ Outlieri Outliersactual
ð8Þ ð5Þ
i¼1
IP ¼ ObjPro1 ¼ Instancepenaly þ
m X
Boxbv olume
ð6Þ
b¼1
Boxpenalty ¼ Const2 b ! PAll Pm Boxb i¼1 Instancei i¼1 Instancei 100 Pall i¼1 Instancei
ð9Þ
where Const1 and Const2 are the pre-defined percentage for IP and BP, respectively. We define them intuitively by fitting them to the context of proposed models. We remind that S is unsuspicious of the outliers, P is suspicious of every potential outlier to find the actual outliers, IP penalizes the instances outside of boxes, and BP penalizes the configured boxes. Penalty values are computed based on pre-determined values; where, Const1 defines the required minimum number of instances inside the total boxes, and Const2 defines the minimum number of instances each box must confine. Therefore, the number of instances out of Const1 or Const2 results in penalty values Instancepenaly and Boxpenalty , respectively. Fig. 6 b shows an example of granular box configuration. Completing the granular box configuration allows to perform the second stage of GBR to obtain the clean data or candidates of boxes. 3.2. Second stage: elimination of outliers
Fig. 4. An example of box configurations by S method on two-dimensional Peter’s data set [9] with low-rate of outliers.
In the second stage of proposed GBRs, we obtain the clean data set [34] by applying Algorithm 1 to instances confined in each box. To compute the distance of each instance, we measure the average distance of all data to the median point. We exclude that instance if
164
M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169
3. Conduct the box configuration on the complete data set, 3.1. Apply GA to find the granular boxes with respect to objective functions obj1 or obj2. 4. For each box; 4.1. Find the clean data (clean based method), or candidate of each box (candidate based method), 4.2. store the clean data set, or candidates of boxes. Stage 2: Apply linear regression analysis on clean data set (LR-on-clean) or candidates (LR-on-candidates). 4. Preparation and computing goodness of model Fig. 6. An example of box configurations by BP method on two-dimensional Peter’s data set [9] with low-rate of outliers.
its distance is greater than average by consider it dispersed from other instances. Concretely, we apply Algorithm 2 to generate the candidates of each box. Algorithm 1 (The procedure to obtain the clean data.). (i) For each box; 1.1. Compute the individual average distance – as in Eq. (1). 1.2. Compute the aggregated average distance – as in Eq. (2). 1.3. Exclude the instances with greater individual average distance than the aggregated average distance. (ii) End Algorithm 2 (The procedure to obtain the candidate of a box.).
d [ i¼1
P2
j¼1 xj
2
y¼
z¼
x þ 5 þ e þ o; with probably of outlier percentage x þ 5 þ e; with probability of ð1 outlier percentageÞ
ð10Þ
y þ x þ 5 þ e þ o; with probably of outlier percentage y þ x þ 5 þ e; with probability of ð1 outlier percentageÞ ð11Þ
(i) For each box; 1.1. Compute the candidate of box:
Centrebox ¼
We investigate the performance of all variations of granular box regression given in Table 1. We conducted 100 runs to report the average and the standard deviation for each configuration. As given in Table 2, we used the following six data sets: micro-economic data for Customer Purchase Index (CPI) of Germany, artificial data set used in [9], two data sets generated by Eqs. (10) and (11), and, servo, and combined cycle power plant (CCPP) data sets. We generated 1000 instances to produce synthetic data sets using Eq. (10) and (11). It is worth mentioning that Eq. (10) is a well-knows approach to compare two methods [8,35].
; where d is dimensionality and i > 1:
(ii) End 3.3. Third stage: curve fitting In the third stage of proposed GBRs, we either apply a linear regression analysis (LR) on clean data or on candidates of boxes. Each are called LR-on-clean and LR-on-candidates, respectively. We modified four possible granular box regressions as given in Table 1. To optimize Eqs. (6) and (7), ObjPro1 and ObjPro2, we apply the proposed method in Algorithm 3. Algorithm 3 (The procedure of the granular box regression with outlier elimination.).
In Eqs. (10) and (11), e is a random error with normal distribution, and o is an outlier value; where, the independent variables x and y are uniformly distributed on interval [0, 10]. We artificially introduced the outliers by randomly affecting on instances. We reproduced each original data set by affecting outliers on 30% and 70% of total instances. As the result, we generated 12 configured data sets to test the performance of GBRs. We performed GBRs with the control variables given in Table 3. To measure the goodness of model, we compute the residual error of original data to estimated points. We apply the Eq. (12), and its property in Eq. (13), to investigate the superiority of a linear model E1 over E2 as a sensitive estimation, S, to outliers.
SuperiorityE1 ¼ SupE1 ¼ ððGE2 GE1 Þ=GE2 Þ 100 Table 2 List of data sets used in this paper. Data set
Variables
Instances
Nature
Repository/ Reference
Artificial Functional#1 – generated by (8) Functional#2 – generated by (9) Micro-economic data for Customer Price Index (CPI) of Germany Servo
2 2
30 1000
Synthetic Synthetic
[9] Eq. (9)
3
1000
Synthetic
Eq. (10)
3
78
Realworld
DSBa
5
167
Combined Cycle Power Plant (CCPP)
5
9569
Realworld Realworld
UCIrepositoryb UCIrepository
Stage 1: 1. Define the number of boxes, 2. Set the control variables for box configuration and genetic algorithms – detailed in Table 3. Table 1 Variations of granular box regression. Name
Stage one
Stage two
Stage two
IP-clean IP-candidate BP-clean BP-candidate
ObjPro1 ObjPro1 ObjPro2 ObjPro2
Clean Candidate Clean Candidate
LR-on-clean LR-on-candidate LR-on-clean LR-on-candidate
ð12Þ
a b
Deutsches Statistisches Bundesamt (http://www.destatis.de). University of California Irvine (http://archive.ics.uci.edu).
165
M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169 Table 3 Configuration of control variables for granular box regression and genetic algorithm. Software
Model
Control variable
Value/choice
GUI MATLAB-9.3
Genetic algorithms
Population size
50
Stopping condition
Crossover function Mutation function
Exceeds 200 generations 100 Stochastic uniform Scattered Gaussian
Number of boxes
3
Minimum data inside the box Maximum data outside the box
30%
Iteration Selection function
Granular box regression
30%
X n true est Goodness of ModelE ¼ GE ¼ Av erage v aluei v aluei i¼1 true
where n is the number of instances, v aluei
configured boxes. The results are in average of 100 times run in terms of residual error. Table 4 gives the results based on rates of outliers over each data set. The comparisons investigate six methods over their equivalent S approach – as each three GBR performs two methods either based on clean data or based on candidates of boxes. Tables 4 and 5 reveal that IP, BP and P decrease the regression error in compared with S. We further analyzed clean-based methods of GBRs against other regression approaches for servo and CCPP data sets. Tables 6 and 7 compare the results based on [36–38]. Compared methods are driven from different categories; i.e., function based such as KMS and SMOReg; lazy-learning based algorithms such as K⁄; meta-learning based algorithm such as BREP; Rule-based algorithm such as M5R; Tree-based learning algorithms such as M5P and REP; and, kernel based such as GAM, HKL, GP SE-ART, GP Additive, and SS. 6. Analyses of proposed granular box regression
ð13Þ
is the value of ith origest
inal instance prior to noise affection, and v aluei is value of ith outcome by estimated function. Note that the superiority of zero advises that two models have similar performance. The following if-then statement in Eq. (14) advises three varieties for performance. 8 > < if SupE1 < 0;then E2 issuperrious than E1 Superiority v ariations ¼ if SupE1 ¼ 0;then E1 and E2 perform similar > : if SupE1 > 0;then E2 is superrious than E ð14Þ
5. Results To produce the results of proposed GBRs, we performed linear regression analysis on clean data and candidates of three
This section provides analyses on proposed GBRs based on 12 variations of six data sets, i.e., artificial data set [9], two data sets generated by Eqs. (10) and (11), CPI, servo, and CCPP. Regression analysis and box configuration conduct the results with the following measurements to reveal the detailed analyses. (i) For regression: residual error, rate of outliers, and statistical analyses and (ii) for box configuration: elapsed time, volume of boxes, and standard deviation of box volumes. 6.1. Regression analysis To illustrate the contribution of each approach to decrease the box volume over dimensionality of data, Fig. 7 shows the improved rates over S. Although all three approaches could decrease the volume over S, in overall, IP outperforms by 99% and P is superior to BP. In addition, IP negligibly effects by dimensionality of data. The same results are true for contribution of each approach over
Table 4 GBRs errors on synthetic data sets – bold indicate lowest error for clean and candidate methods. Error criterion
Mean of absolute errors (MAE)
Data set
Functional-Eq. (9)
Rate of outlier
Low
High
Low
High
Low
High
0.5187 24.5074 0.4198 22.2748 0.2955 4.2443 0.7113 9.6477
1.2226 13.3041 1.675 13.5846 0.723 5.2792 1.2214 40.9395
0.3299 125.001 0.3292 35.9774 0.5096 50.0692 2.6853 39.5285
0.7688 103.7303 0.7684 28.1761 2.1224 51.3603 3.1243 93.0011
01.3403 21.2767 01.0039 13.7822 01.5812 12.2358 04.4147 08.9980
02.4010 10.0269 1.8866 13.2625 01.9586 08.0264 02.4048 29.6718
Approach
S-clean S-candidate BP-clean BP-candidate IP-clean IP-candidate P-clean P-candidate
Functional-Eq. (10)
Peters [9]
Table 5 GBRs errors on real-world data sets – bold indicate lowest error for clean and candidate methods. Error criterion
Mean of absolute errors (MAE)
Data set
CPI
Rate of outlier
Low
High
Low
High
Low
High
0.5436 15.6762 0.5041 12.7497 0.3842 2.0511 1.9025 3.6759
00.9573 28.3775 00.6513 16.5123 03.4740 03.6762 02.2162 07.5947
1.8397 31.3897 1.7120 21.5382 1.3812 2.3492 1.8897 4.5218
2.7897 29.3897 1.8199 20.6252 1.8524 2.1693 2.4168 5.6037
04.8527 123.4579 03.5687 103.2674 02.9301 08.6252 05.1267 14.3570
09.5681 120.6715 04.81561 114.6480 06.6549 12.8456 05.9471 55.3547
Approach
S-clean S-candidate BP-clean BP-candidate IP-clean IP-candidate P-clean P-candidate
Servo
CCPP
166
M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169
Table 6 Comparison of results on CCPP data set.
Regression error
31%
50% Overall 5D 3D 2D
IP
P-candidate
BP
P-clean
0%
IP-candidate
2.818 2.882 2.932 3.133 3.140 3.172 3.620 3.621 3.625
IP-clean
BREP K⁄ IP-clean REP M5P M5R SMOReg LMS LR
20% 30%
BP-candidate
Bagging Reduced Error Pruning trees [39,40] KStar [41] Instance-based Penalty on clean data Reduced Error Pruning trees [42] Model Trees Regression [43] Model Trees Rules [44] Support Vector Poly Kernel [45] Least Median Square [46] Linear Regression
100%
BP-clean
MAE
Improved
Abbreviation
Dimensionality
70%
Approach
P
Method Table 7 Comparison of results on servo data set.
Fig. 8. Overall improved error for regression over S method.
Approach
Abbreviation
MSE
Structure Search [38] Additive Gaussian Process [47] Gaussian Process SE kernel using Automatic Relevance Determination [48] Hierarchical Kernel Learning [49] Instance-based Penalty on clean data Generalized additive model [50] Linear Regression RE- Dynamic Ensemble Selection and Instantaneous Pruning [37] Reduced Error Pruning [51] Ordered Aggregation [52] Recursive Feature Elimination [51] Ensemble optimization using Genetic Algorithm [52]
SS GP Additive GP SE-ARD
0.10 0.11 0.12
HKL IP-clean GAM LR RE-DESIP
0.19 0.21 0.28 0.52 0.66
RE OA RFE GA
1.42 1.43 1.94 2.37
Average of errors over clean and centre methods using low and hight outliers 60 Candidate-based
50
Error
40 Candidate-based
Candidate-based
30 Candidate-based
20 10 Clean-based
0 S
Clean-based
Clean-based
BP
IP
Clean-based
P
Fig. 9. Middle of regression errors between clean- and candidate-based methods.
Volume over S 99% 61%
43%
50% Overall 5D 3D
0%
2D BP
IP
Dimensionalty
Improved
100%
P
Approach Fig. 7. Improved volumes over sensitive method.
rates of outliers (Fig. 8); and therefore, IP outperform other approaches over both dimensionality of data and rates of outliers. Concretely, to analyse the regression in terms of residual error, we observe over Tables 4 and 5 (in Section 5) with respect to dimensionality of data. The results distinguish the higher rates of errors belong to candidate – compare with clean-based methods; which is true for both two- and three-dimensional data. The reason is inappropriate candidate-points that represent the boxes – which, then the regression performs on these poorly obtained box-candidates. The regression error is also affected by rates of outliers introduced to original data. We observe in Fig. 8 that BP-clean and IPcandidate improve the error over their corresponding S methods by 20% and 70% as the most and the least results, respectively. To observe the range of errors, Fig. 9 shows the ranges and the middles of errors between clean- or candidate-based methods. In other words, it shows the difference between using either clean-
or candidate-based method for GBRs. We observe that IP and P have the least ranges; which, it shows either the candidates of boxes are identical to the mean of distributed data in each box, or it should readjust the candidate of box to improve the results. In contrast, we observe that BP occurs in wide range of errors; which, it show either configured boxes were unsuccessful to simplify the data, or the candidate of data unable to best represent the configured boxes. Consequently, we recognize two following points. (i) IP, P, BP, P, and S have, sequentially, closer intervals and less midpoint errors and (ii) PB-candidate method needs to improve the candidate representation of a box. In general, we can recognize the error interval of each approach by (i). To further analysis, we performed the statistical analysis with 5% of significance level on results of 100 runs for 8 GBR methods over 12 variations of six data sets. We applied z-test and k-test, where k-test revealed P mostly dissatisfies the conditions. Then, similarly, we applied two-sample t-test for GBRs against iteratively reweighted least-squares (IRLS) [53] and LR – IRLS is known as a robust estimation against outliers, and LR is used as regression method within each GBR method. As the result, LR rejected the null hypothesis, while GBRs revealed the same mean with IRLS for most the cases. 6.2. Box configuration In terms of elapsed time to configure the boxes and eliminate the outliers; Fig. 10 shows IP and P take the shortest and longest time, respectively. It is worth mentioning that the elapse time for P method may varies depending on nature of outliers – as it examines the potentiality of being an actual outlier for every instance on edge of configured boxes. Therefore, elapse time in P method is sensible to outliers. We analyzed the dependency of volumes to the generated outliers by chasing the obtained volumes and the standard deviation.
167
M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169
Elapsed time for box configuration 4500 4000
Time (Sec.)
3500 3000 2500 2000 1500 1000 500 0 S
BP RP 2D: Equ.(9)
P
S
BP RP
P
2D: Peters
S
BP RP
P
S
3D: Equ.(10)
BP RP
P
S
3D: CPI
BP RP
P
S
5D: CCPP
BP RP
P
5D: Servo
Approach Slowest
Fastest
Fig. 10. Elapsed time by GA to execute each box configuration method.
In every run, we measured the obtained volumes given in Table 8, and their standard deviations given in Table 9 to compute the
Average of STD of volumes Standard deviation
for volume insensitivity by the following indicators: (i) STDOuliered VB of data set with outliers and (ii) STDGB V B for volume of granular box configuration approaches. The higher standard deviation we receive, the higher dependency to outliers it shows. Consequently, Fig. 11 shows P, BP, and S are outlier sensitive, and IP is most insensitive. As expected, the standard deviation of S shows identical to data with outliers. We found that the results depend on dimensionality of data and rate of outlier; where in most cases, the dependency slightly increases with the rate of outliers, and, it significantly increases with dimensionality data.
S
GB
Time
¼ ððV s V GB Þ=V S Þ 100
IP
P
Outliered data set
Fig. 11. Average of standard deviations of volumes.
The analyses showed the superiority of penalty scheme in terms of volume, residual, time, and insensitivity. Correspondingly, Eqs. 15–18, and consequent Eqs. 19–22 in appendix, express each indicator.
Volume
BP
Approach
6.2. Summary
GB
400 350 300 250 200 150 100 50 0
¼ ððT S T GB Þ=T S Þ 100
ð16Þ
GB
Ouliered Insensitiv itySTDGB ¼ ððSTDOuliered STDGB Þ 100 VB V B Þ=STDV B VB
VB ¼
ð15Þ
m [ b¼1
v olumeb m ¼ 1 if
STDSV B ; and; m ¼ 3 if STDGV B box
ð17Þ
ð18Þ
Table 8 Obtained volumes for configured boxes over synthesis data sets – bolds indicate lowest volumes and highlights indicate least distances between L and H. Measurement
Box volume
Data set Dimensionality
Functional-Eq. (9) 2
Functional-Eq. (10) 3
Peters 2
Rate of outlier
L
H
L
H
L
H
Approach
S BP IP P
84.1820 43.3555 07.5980 60.1890
110.8725 067.5943 011.2788 074.8227
1475.5000 1178.5000 0.0000447 0319.8614
1581.0000 1246.8000 0.0000259 0530.3134
90.3804 53.9513 03.6289 77.5317
103.5050 056.9508 001.0466 102.6414
Table 9 Obtained results for average of standard deviation of configured boxes over synthesis data sets. Measurement
Standard deviation
Data set Dimensionality
Functional-Eq. (9) 2
Rate of outlier
L
H
L
H
L
H
Data with outlier Approach
11.95 18.02 11.26 02.04 53.09
15.9659 14.6710 11.1730 02.6350 76.7934
065.2580 136.0300 107.1700 000.0002 564.6020
88.3393 145.9600 807.8810 000.0002 685.2220
08.3101 10.2300 05.4511 01.3892 50.6395
011.5201 016.0940 016.8320 001.2383 169.9300
S BP IP P
Functional-Eq. (10) 3
Peters 2
168
M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169 Sumamry of resutls 120%
99%
99%
Improved
100% 80%
72%
61%
49%
60% 43% 40%
35%
31% 17% 19%
20% 0% BP
RP
P
BP
Volume
RP
P
BP
Residual
Improved over S method
RP
P
Time
BP
RP
P
Insensitivity
Improved over data set with outlier
Fig. 12. Summary of results for improved approaches.
Acknowledgements
Table 10 Summary of best performed approaches. Approach
representation of box candidates by investigating the optimal number and the placement of candidates. (ii) Enhance the quality of cleaned data when it excludes data from a box, by investigating different measurement. (iii) Investigate the number of data that IP or BP methods required to contain in each box to transparent the efficacy of optimal size of boxes. (iv) Investigate the optimal number of boxes to best furnish the regression data. (v) Enhance the performance of P approach by investigating the optimal values for controlling variables. (vi) Investigate various outlier-generating functions to find the efficacy on each method. In addition, one can investigate the same characteristics for non-linear functions to detect outlier in a data set generated from a linear function; when, there is a promising result from granular box regression.
Measurement Time
Volume
Error
Insensitivity
IP
–clean –candidate
U
U
– U
U
BP
–clean –candidate
–
–
– –
–
P
–clean –candidate –candidate
–
–
– – –
–
The Universiti Teknologi Malaysia (UTM) and Ministry of Education Malaysia under Research University Grants 00M19, 02G71 and 4F550 are hereby acknowledged for some of the facilities that were utilized during the course of this research work. Appendix A
ErrorMethod ¼ Av g Av g
outlier X dataset X n¼1
where Vs and VGB indicate the obtained volumes by S and other granular box approach as BP, IP, and P; TS and TGB indicate the elapsed time for box configuration by each approach; and,
þAv g
This paper investigated the insensitivity of granular box regressions. They configure the granular box on data, and then perform the regression analysis on subsets of granulated data. We modified two approaches for granular box configuration. The first approach, penalizes the box uncontained the required number of data; and the second approach, penalizes the instances unconfined by any box. Then, we performed the outlier- elimination by applying two methods to keep the major trend of data within each granular box. The first method cleans the data based on dispersion, and the second method candidate the center of each box. The investigation compared the proposed granular box regressions against the outlier-sensitive approach. The comparisons carried out for regression analysis and box configuration, respectively, in terms of: (i) residual error of estimated function, and, (ii) elapsed time for box configuration, (iii) obtained volume, and (iv) standard deviation of volumes. Moreover, dimensionality of data and rate of outliers carried out the analyses. The results over S revealed the overall superiority for IP. It improved 99% of volume, 72% of residual error, and 99% of insensibility to outliers in box configuration. For future works, the candidate-based BP requires enhancement to decrease the regression error. Consequently, one can consider the following works for box configuration. (i) Enhance the
i¼1
VolumeMethod ¼ Av g Av g þAv g
outlier X dataset X
outlier X dataset X n¼1
TimeMethod ¼ Av g Av g
STDMethod ¼ Av g Av g
n¼1
i¼1
Av g
i¼1
! ð20Þ
run X cleanmethod;dataset Elapsedj j¼1
run X candidatemethod;dataset Av g Elapsedj
! ð21Þ
j¼1
outlier X dataset X
outlier X dataset X
j¼1
j¼1
i¼1
n¼1
þAv g
ð19Þ
run X cleanmethod;dataset Volj
run X candidatemethod;dataset Av g Volj
outlier X dataset X n¼1
Av g
i¼1
i¼1
outlier X dataset X n¼1
!
j¼1
n¼1
þAv g
j¼1
run X Av g Error jcandidatemethod;dataset
n¼1 i¼1
7. Conclusion
run X Error cleanmethod;dataset j
outlier n XX
implies the standard deviation for data with outlier, and STDOuliered VB STDGB V B implies the granular box approaches. Each of these indicators reveals the results in Fig. 12 and Table 10 to summarize the analyses of this paper. Fig. 12 illustrates the improved results of each approach over S; and, Table 10 shows the best performed approached in terms of each measurement. IP outperforms in terms of volume and insensitivity with 99% for both cases; and, IP-candidate outperforms in regression, and IP-clean outperforms in granular box configuration.
Av g
Av g
i¼1
Av g
run X
STDjcleanmethod;dataset
j¼1 run X
!
STDjcandidatemethod;dataset
ð22Þ
j¼1
where in Eqs. 19–22, outlier = 2, data set = 6, and run = 100. References [1] F. Giunchiglia, T. Walsh, A theory of abstraction, Artif. Intell. 57 (1992) 323– 389. [2] L.A. Zadeh, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets Syst. 90 (1997) 111–127. [3] J.R. Hobbs, Granularity, in: Proceedings of the Ninth International Joint Conference on Artificial Intelligence, Citeseer, 1985. [4] Y. Yao, Human-inspired granular computing, in: Novel Developments in Granular Computing: Applications for Advanced Human Reasoning and Soft Computation, 2010, pp. 1–15. [5] Y.-Q. Zhang, B. Jin, Y. Tang, Granular neural networks with evolutionary interval learning, IEEE Trans. Fuzzy Syst. 16 (2008) 309–319. [6] Y. Yao, Interpreting concept learning in cognitive informatics and granular computing, IEEE Trans. Syst. Man Cybern., Part B: Cybern. 39 (2009) 855–866. [7] Y. Yao, J. Luo, Top-down progressive computing, in: Rough Sets and Knowledge Technology, Springer, 2011, pp. 734–742.
M. Reza Mashinchi et al. / Information Fusion 27 (2016) 161–169 [8] G. Peters, Granular box regression, IEEE Trans. Fuzzy Syst. 19 (2011) 1141– 1152. [9] G. Peters, Z. Lacic, Tackling outliers in granular box regression, Inf. Sci. 212 (2012) 44–56. [10] D.M. Hawkins, Identification of Outliers, Springer, 1980. [11] V.J. Hodge, J. Austin, A survey of outlier detection methodologies, Artif. Intell. Rev. 22 (2004) 85–126. [12] J. Fox, Regression Diagnostics: An Introduction, Sage, 1991. [13] V. Barnett, T. Lewis, Outliers in Statistical Data, Wiley, New York, 1994. [14] N. Devarakonda, S. Subhani, S.A.H. Basha, Outliers detection in regression analysis using partial least square approach, in: ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India, vol. II, Springer, 2014, pp. 125–135. [15] A. Dastanpour, S. Ibrahim, R. Mashinchi, Using genetic algorithm to supporting artificial neural network for intrusion detection system, in: The International Conference on Computer Security and Digital Investigation (ComSec2014), The Society of Digital Information and Wireless Communication, 2014, pp. 1–13. [16] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey, ACM Comput. Surv. (CSUR) 41 (2009) 15. [17] D.W. Osgood, Poisson-based regression analysis of aggregate crime rates, J. Quant. Criminol. 16 (2000) 21–43. [18] A. Shen, R. Tong, Y. Deng, Application of classification models on credit card fraud detection, in: Service Systems and Service Management, 2007 International Conference on, IEEE, 2007, pp. 1–4. [19] E. Ngai, Y. Hu, Y. Wong, Y. Chen, X. Sun, The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature, Decis. Support Syst. 50 (2011) 559–569. [20] E.M. Matsumura, R.R. Tucker, Fraud detection: a theoretical foundation, Account. Rev. (1992) 753–782. [21] L.C. Mercer, Fraud detection via regression analysis, Comput. Secur. 9 (1990) 331–338. [22] Y. Wang, A multinomial logistic regression modeling approach for anomaly intrusion detection, Comput. Secur. 24 (2005) 662–674. [23] Y. Tsuda, M. Samejima, M. Akiyoshi, N. Komoda, M. Yoshino, An anomaly detection method for individual services on a web-based system by selection of dummy variables in multiple regression, Electron. Commun. Jpn 97 (2014) 9–16. [24] M.H. Mashinchi, M.A. Orgun, M. Mashinchi, Solving fuzzy linear regression with hybrid optimization, in: C. Leung, M. Lee, J. Chan (Eds.), Neural Information Processing, Springer, Berlin Heidelberg, 2009, pp. 336–343. [25] J.H. Holland, Genetic algorithms, Sci. Am. 267 (1992) 66–72. [26] M. Gen, R. Cheng, Genetic Algorithms and Engineering Optimization, John Wiley & Sons, 2000. [27] S. Chatterjee, A.S. Hadi, Regression Analysis by Example, John Wiley & Sons, 2013. [28] R.H. Myers, Classical and Modern Regression with Applications (Duxbury Classic), Duxbury Press, Pacific Grove, 2000. [29] D. Kleinbaum, L. Kupper, A. Nizam, E. Rosenberg, Applied Regression Analysis and Other Multivariable Methods, Cengage Learning, 2013. [30] B. Apolloni, D. Iannizzi, D. Malchiodi, W. Pedrycz, Granular regression, in: Neural Nets, Springer, 2006, pp. 147–156. [31] A. Bargiela, W. Pedrycz, T. Nakashima, Multiple regression with fuzzy data, Fuzzy Sets Syst. 158 (2007) 2169–2188.
169
[32] S.-P. Chen, J.-F. Dang, A variable spread fuzzy linear regression model with higher explanatory power and forecasting accuracy, Inf. Sci. 178 (2008) 3973– 3988. [33] S. Roychowdhury, W. Pedrycz, Modeling temporal functions with granular regression and fuzzy rules, Fuzzy Sets Syst. 126 (2002) 377–387. [34] M.H. Mashinchi, M.A. Orgun, M.R. Mashinchi, A least square approach for the detection and removal of outliers for fuzzy linear regression, in: World Congress on Nature and Bio-logically Inspired Computing, 2010, pp. 134–139. [35] J.J. Buckley, T. Feuring, Linear and non-linear fuzzy regression: evolutionary algorithm solutions, Fuzzy Sets Syst. 112 (2000) 381–394. [36] P. Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, Int. J. Electr. Power Energy Syst. 60 (2014) 126–140. [37] K. Dias, T. Windeatt, Dynamic ensemble selection and instantaneous pruning for regression, in: ESANN, Bruges, Belgium, 2014. [38] D. Duvenaud, J.R. Lloyd, R. Grosse, J.B. Tenenbaum, Z. Ghahramani, Structure discovery in nonparametric regression through compositional kernel search, 2013, arXiv preprint arXiv:1302.4922. [39] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [40] J. D’Haen, D. Van Den Poel, Temporary staffing services: a data mining perspective, in: 2012 IEEE 12th International Conference on Data Mining Workshops (ICDMW), 2012, pp. 287–292. [41] J.G. Cleary, L.E. Trigg, K⁄: An instance-based learner using an entropic distance measure, in: ICML, 1995, pp. 108–114. [42] S. Portnoy, R. Koenker, The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators, Stat. Sci. 12 (1997) 279–300. [43] Y. Wang, I.H. Witten, Inducing model trees for continuous classes, in: Proceedings of the Ninth European Conference on Machine Learning, 1997, pp. 128–137. [44] S. Ekinci, U.B. Çelebi, M. Bal, M.F. Amasyali, U.K. Boyaci, Predictions of oil/chemical tanker main design parameters using computational intelligence techniques, Appl. Soft Comput. 11 (2011) 2356–2366. [45] M.O. Elish, A comparative study of fault density prediction in aspect-oriented systems using MLP, RBF, KNN, RT, DENFIS and SVR models, Artif. Intell. Rev. (2012) 1–9. [46] P.J. Rousseeuw, Least median of squares regression, J. Am. Stat. Assoc. 79 (1984) 871–880. [47] D. Duvenaud, H. Nickisch, C.E. Rasmussen, Additive Gaussian processes, in: Neural Information Processing Systems, 2011. [48] R.M. Neal, Bayesian Learning for Neural Networks, University of Toronto, 1995. [49] F. Bach, High-dimensional non-linear variable selection through hierarchical kernel learning, 2009. arXiv preprint arXiv:0909.0844. [50] T.J. Hastie, R.J. Tibshirani, Generalized Additive Models, CRC Press, 1990. [51] T. Windeatt, K. Dias, Feature ranking ensembles for facial action unit classification, in: L. Prevost, S. Marinai, F. Schwenker (Eds.), Artificial Neural Networks in Pattern Recognition, Springer, Berlin Heidelberg, 2008, pp. 267– 279. [52] D. Hernández-Lobato, G. Martínez-Muñoz, A. Suárez, Empirical analysis and evaluation of approximate techniques for pruning regression bagging ensembles, Neurocomputing 74 (2011) 2250–2264. [53] P.W. Holland, R.E. Welsch, Robust regression using iteratively reweighted least-squares, Commun. Stat.–Theory Methods 6 (1977) 813–827.