Accepted Manuscript
AdaptativeCC4.5: Credal C4.5 with a rough class noise estimator Joaqu´ın Abellan, ´ Carlos J. Mantas, Javier G. Castellano PII: DOI: Reference:
S0957-4174(17)30663-2 10.1016/j.eswa.2017.09.057 ESWA 11578
To appear in:
Expert Systems With Applications
Received date: Revised date: Accepted date:
5 June 2017 7 August 2017 26 September 2017
Please cite this article as: Joaqu´ın Abellan, ´ Carlos J. Mantas, Javier G. Castellano, AdaptativeCC4.5: Credal C4.5 with a rough class noise estimator, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.09.057
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
1
Highlights
AC
CE
PT
ED
M
AN US
CR IP T
– CC4.5 is a parametric classifier based on imprecise probabilities outperforming C4.5 – That improvement increases in situations with class noise – Higher values of the parameter of CC4.5 are related with higher levels of noise – A new procedure to estimate the level of noise is combined with CC4.5 – The new classifier is similar to the CC4.5 when its best parameter value is used
ACCEPTED MANUSCRIPT
AdaptativeCC4.5: Credal C4.5 with a rough class noise estimator
CR IP T
Joaqu´ın Abell´ an and Carlos J. Mantas and Javier G. Castellano Department of Computer Science and Artificial Intelligence University of Granada, Granada, Spain {jabellan,cmantas,fjgc}@decsai.ugr.es
ED
M
AN US
Abstract. The application of classifiers on data represents an important help in a process of decision making. Any classifier, or other method used for knowledge extraction, suffers a deterioration when it is applied on data with noise. Credal C4.5 (CC4.5) is a recent method of classification, that introduces imprecise probabilities in the algorithm of the classic C4.5. It is very suitable in classification noise tasks, but it has a clear dependency of a parameter. It has been proved that this parameter is related with the level of overfitting of the model on the data used for training. In noisy domains, this characteristic is important in the sense that variations of this parameter can reduce the variance of the model. Depending on the degree of noise that a data set has, the application of different values of this parameter can produce different performance of the CC4.5 model. Hence, the use of the correct parameter is fundamental to attain a high level of performance for this model. In this paper, that problem is solved via a rough procedure to estimate the level of class noise in the training data. Combining this new noise estimation process with the CC4.5, it is presented a direct method that has an equivalent performance than the one of the CC4.5 when it is used with the best value of its parameter for each level of class noise.
Introduction
CE
1
PT
Keywords: decision support system, classification, decision tree, imprecise probabilities; uncertainty measures; noisy data
AC
Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. The task of the supervised classification (Hand, 1997) is a part of the data mining area. It starts from a set of observations determined by a set of features or attributes that have assigned a known class label. From a subset of observations, called training set, the classifier obtains a set of rules to predict a value of the class variable for each new observation. To verify the quality of this
ACCEPTED MANUSCRIPT
3
set of rules, it is used a different set of observations called test set. From this verification, a measure of the classification error is obtained.
CR IP T
Real-world datasets may contain noise, which it can be defined as anything that obscures the relationship between the features of an instance and its class (Frenay & Verleysen, 2014). In particular, Class noise or label noise is named to those situations where data sets have incorrect class labels. This situation is mainly motivated by deficiencies in the data learning and/or test capture process, such that wrong disease diagnosis method, human errors in the class label assignation, etc. The presence of label noise in the data can represent an added difficulty to extract knowledge from data. The study of methods to apply in this situation is an important task of the data mining area (see (Frenay & Verleysen, 2014), (Delany, Segata, & MacNamee, 2012), (Garcia et al., 2015), (Chaoqun, Victor, Liangxiao, & Hongwei, 2016)).
AN US
Decision trees (DTs) also known as classification trees are a type of classifier that can be seen as a set of rules in a tree format.An attribute variable is introduced in each node. Each leaf contains a label of the class variable or a set of probabilities for each class label. Hunt’s work (Hunt, Marin, & Stone, 1966) was the origin of decision trees, although they began to gain importance with the publication of the ID3 algorithm proposed by Quinlan (Quinlan, 1986). Afterwards Quinlan proposed the C4.5 (Quinlan, 1993) algorithm, which is an improvement of the previous ID3 with better results.
PT
ED
M
An important problem in supervised learning is to minimize the two sources of the classification error (Kohavi & Wolpert, 1996): (i) the Bias component, that represents the systematic component of the error resulting from the incapacity of the predictor to model the underlying distribution; and (ii) the Variance component, that represents the component of the error that stems from the particularities of the training sample and can be decreased by the increasing of the size of the data set. The Bias-Variance trade off can be seen as a tool for understand the performance of a classifier. A good learning method must have low Bias and low Variance.
CE
When a DT is pruned, the bias error is increased and the variance error is reduced. Hence, it is necessary to achieve a trade off between bias and variance when a DT is pruned. For this reason, a node is pruned if this operation improves certain criteria, normally error estimation measures (Rokach & Maimon, 2008).
AC
Via a mathematical model based on imprecise probabilities, known as Imprecise Dirichlet Model (IDM), (Abell´ an & Moral, 2003) have developed an algorithm for designing DTs, called credal decision trees (CDTs). The split criterion for this algorithm uses uncertainty measures on credal sets, i.e. closed and convex sets of probability distributions. In particular, the CDT algorithm extends the measure of information gain used by the ID3 algorithm. The split criterion is called the Imprecise Info-Gain (IIG). More recently, the theory of credal decision trees and the C4.5 algorithm have been connected in (Mantas & Abell´ an, 2014b), with the definition of the Credal-C4.5 (CC4.5) algorithm. This algorithm uses a new split criterion called Imprecise Info-Gain Ratio (IIGR).
ACCEPTED MANUSCRIPT
4
AC
CE
PT
ED
M
AN US
CR IP T
When imprecise probabilities are used to build decision trees, we are supposing that the training data are not clean. For this reason, both models (CDT and CC4.5), that it will be called as credal trees,1 are especially suitable to be applied on class noise. This fact has been experimentally shown in (Abell´ an & Masegosa, 2009, 2012; Mantas & Abell´ an, 2014a; Abell´ an & Mantas, 2014; Mantas & Abell´ an, 2014b), where it is shown that CC4.5 method has some better performance than the CDT method in noise domains. A complete and recent revision of machine learning methods to manipulate data sets with class noise can be found in (Frenay & Verleysen, 2014). In fact, variance error of the trees is reduced when imprecise probabilities are used to design DTs. For the same classification problem, credal trees produce smaller trees than classic DTs (Abell´ an & Masegosa, 2012; Mantas & Abell´ an, 2014a, 2014b). Therefore, these models have a new tool to solve the high variance problem of the DTs. This new characteristic is complementary with the pruning methods, as it was exposed in (Abell´ an & Masegosa, 2012; Mantas & Abell´ an, 2014b). In this way, we have two tools for controlling the variance in DTs: pruning methods and the use of imprecise probabilities to estimate values. The pruning methods can find a trade off between bias and variance by using error estimation based criteria for pruning a node. However, a trade off between bias and variance is not found when a credal tree procedure is used to build a tree. In this case, the imprecision degree of the probability distributions is determined by a hyperparameter s. The value s = 1 is always used in the previous works about CDTs, due to recommendations of the author of the IDM and computational reasons (see (Mantas & Abell´ an, 2014b)). In this way, the degree of imprecision of the probability distributions is fixed and, therefore, a trade off between bias and variance is not set for each data set. In a recent work, (Mantas, Abell´ an, & Castellano, 2016) have studied the relation between the value of the s parameter and the behavior of the CC4.5 (the best credal tree model in noise domains) when it is applied on data sets with different levels of label noise. It is shown the importance of correctly choosing the value for that parameter. If the noise level of a data set can be estimated, to determine the correct value for s can involve excellent results in the application of the CC4.5 method. Normally, higher values of s should be used when the level of noise is higher. However, this is not a rule for all the data sets, as we can seen in (Mantas et al., 2016). For each estimated level of noise, we cannot fix the same value of the s parameter for all the data sets with the aim of obtaining the best possible results. That optimum value of s can also depend on other characteristics of the data set. In this paper, it is solved the limitations above exposed, including the one of the static imprecision degree of the probability distributions in the CC4.5 algorithm. Here, it is presented a new method where the value of the s parameter is determined by an rough estimation of the noise level for each training data set and the number of states of the class variable. A particular value of the 1
Here, the term “credal” refers to the use of credal sets in the process to build decision trees.
ACCEPTED MANUSCRIPT
5
2.1
Previous knowledge
M
2
AN US
CR IP T
parameter s is found for each data set that is used to build a DT via the CC4.5 algorithm. In this way, it is achieved that such DT can find a trade off between bias and variance by means of two complementary tools: controlled pruning methods and the selection of a value for the parameter s by noise estimation. With this last tool, the imprecision degree of the probability distributions in the CC4.5 algorithm can be adjusted for each data set. Finally, a large experimental study has been carried out. Credal C4.5 algorithm with the new characteristic of selecting the value for s by noise detection (called AdaptativeCC4.5 ) was implemented for these experiments. From this study, it can be concluded that the AdaptativeCC4.5 algorithm obtains the best results when noisy data sets are classified. It is shown that the performance of the new method is, normally, equivalent or better than the one of the CC4.5 method when it is used with its best value of the s parameter on each level of class noise for a data set. Section 2 briefly describes the necessary previous knowledge about decision trees, credal decision trees and the CC4.5 algorithm. Sections 3 describes the AdaptativeCC4.5 algorithm, and exposes the characteristics of its algorithm. Section 4 describes and comments the experimentation carried out on data sets varying the percentage of label noise. Finally, Section 5 is devoted to the conclusions.
Decision Trees and the problem of the Bias-Variance trade off
ED
Decision trees (DTs) are models based on a recursive partition method, the aim of which is to divide the data set using a single variable at each level. The process for inferring a decision tree is mainly determined by the followings aspects:
CE
PT
(1) The split criterion, i.e. the method used to select the attribute to be inserted in a node and branching. (2) The criterion to stop the branching. (3) The method for assigning a class label or a probability distribution at the leaf nodes. An optional final step in the procedure to build DTs, which is used to reduce the overfitting of the model on the training set, is the following one:
AC
(4) The post-pruning process used to simplify the tree structure. In classic procedures for building DTs, where a measure of information based in probability theory (PT) is used, the criterion to stop the branching (above point (2)) normally is the following one: when the measure of information is not improved, or when a threshold of gain in that measure is attained. With respect to the above point (3), the value of the class variable inserted in a leaf node is the one with more frequency in the partition of the data associated with that leaf
ACCEPTED MANUSCRIPT
6
CR IP T
node; also can be inserted the distribution of probabilities, on the class variable, associated with the partition set in that node. Then the principal difference among all the procedures to build DTs is the point (1), i.e. the split criterion used to select the attribute variable to be inserted in a node. Many different approaches for building DTs have been published. The ones of Quinlan, ID3 and C4.5 (Quinlan, 1993), stand out among the most known of these ones. Let us see the split criteria used by these algorithms. Let us suppose a classification problem. Let C be the class variable, {X1 , . . . , Xf } the set of features, and X a general feature. With these conditions, the split criteria used by ID3 and C4.5 can be resume as follows: • The Info-Gain (IG) criterion was introduced by Quinlan as the basis for his ID3 model, which is defined as follows: X IG(C, X) = H(C) − P (X = xi )H(C|X = xi ),
AN US
i
where the function H is the classic Shannon’s entropy (Shannon, 1948). • The Info-Gain Ratio (IGR) criterion was introduced for the C4.5 model (Quinlan, 1993) in order to improve the ID3 model. IGR penalizes variables with many states. It is defined as follows: IGR(C, X) =
IG(C, X) . H(X)
M
The bias-variance trade off, in supervised learning, consists of simultaneously minimize the two sources of the classification error (Kohavi & Wolpert, 1996):
CE
PT
ED
• The bias represents the systematic component of the error resulting from the incapacity of the predictor to model the underlying distribution. It is the error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). • The variance represents the component of the error that stems from the particularities of the training sample and can be decreased by the increasing in the size of the data set. So, if the classifier becomes more sensitive to changes in the training data set, the variance increases. High variance of an algorithm can cause overfitting: modeling the random noise in the training data, rather than the intended outputs.
AC
Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously (Geman, Bienenstock, & Doursat, 1992) (Sammut & Webb, 2011). High-variance learning methods may be able to represent their training set well, but they are at risk of overfitting the noise of the unrepresentative data from the training set. In contrast, algorithms with high bias typically produce simpler models that do not tend to overfit, but may underfitting the used training data, failing to capture important regularities.
ACCEPTED MANUSCRIPT
7
2.2
CR IP T
It is known that DTs are learning algorithms that have high variance. They suffer the overfitting problem, specially when the overfitting risk is higher as happens with high noise levels appear in the training data. In order to solve the high variance problem, as the depth of the tree determines the variance, DTs are commonly pruned to control the variance (James, Witten, Hastie, & Tibshirani, 2013). It has been shown in various studies (Maimon & Rokach, 2005) that pruning methods can improve the generalization performance of a DT, especially in noisy domains. There are various techniques for pruning DTs. A revision of these techniques can be found in (Rokach & Maimon, 2008). By virtue of the pruning method, the DTs reduce the probabilities of overfitting due to noise in the training data. Credal Decision Trees
M
AN US
The Imprecise Info-Gain (IIG) (Abell´ an & Moral, 2003), used in the procedure to build Credal Decision Trees (CDT), is based on imprecise probabilities and the utilization of uncertainty measures on credal sets. Probability intervals are obtained from the data set using the Walley’s Imprecise Dirichlet Model (IDM) (Walley, 1996) (a special type of credal sets (Abell´ an, 2006)). The mathematical basis applied is described below. With the above notation for the class variable and attributes, p(cj ), j = 1, .., k, is defined for each value cj of the variable C via the IDM on the following way: ncj ncj + s , , j = 1, .., k; (1) p(cj ) ∈ N +s N +s
AC
CE
PT
ED
where ncj is the frequency of the case (C = cj ) in the data set, N the sample size and s a given hyperparameter. The value of parameter s regulates the convergence speed of the upper and lower probability when the sample size increases. Higher values of s produce an additional cautious inference. Walley (Walley, 1996) does not give a decisive recommendation for the value of the parameter s, but he proposed two candidates: s = 1 or s = 2, nevertheless he recommend the value s = 1. It is easy to check that the size of the intervals increases when the value of s increases. That representation gives rise to a specific kind of credal set on the variable C, K D (C) (Abell´ an, 2006), with D a data set or a partition of a data set. This set is defined as follows: K D (C) =
ncj ncj + s , , p | p(cj ) ∈ N +s N +s
j = 1, .., k .
(2)
On this type of sets (really credal sets, (Abell´ an, 2006)), uncertainty measures can be applied. The procedure to build CDTs uses the maximum of entropy function2 on the above defined credal set. This function, denoted as H ∗ , is defined in the following way: 2
A well accepted measure of uncertainty based information on credal sets (Klir, 2005)
ACCEPTED MANUSCRIPT
8
H ∗ (K D (C)) = max H D (p) | p ∈ K D (C)
(3)
CR IP T
The procedure to obtain H ∗ for the special case of the IDM reaches its lowest computational cost for s ≤ 1 (see Abell´ an (Abell´ an, 2006) for more details). The scheme to induce CDTs is like the one used by the classical ID3 algorithm (Quinlan, 1986), replacing its Info-Gain Split criterion with the Imprecise InfoGain (IIG) split criterion which can be defined by the following way: IIGD (C, X) = H ∗ (K D (C)) − H ∗ (K D (C|X)),
(4)
M
AN US
where H ∗ (K D (C|X)) is calculated via a similar way than H D (C|X) in the IG criterion (for a more extended explanation see Mantas and Abell´ an (Mantas & Abell´ an, 2014b)) It should be taken into account that for a variable X and a data set D, IIGD (C, X) can be negative. This situation does not occur with the Info-Gain criterion. This important characteristic allows that the IIG criterion discards variables that worsen the information on the class variable. This is an important feature of the model that can be considered as an additional criterion to stop the branching of the tree, reducing the overfitting of the model. For s ≤ 1, the procedure to calculate the maximum entropy value of a credal set associated with a set of probability intervals from the IDM can be seen in (Abell´ an, 2006). From a set of probability intervals obtained via the IDM, it can be resumed as follows: A = {cj | ncj = mini {nci }}.
ED
The frequencies are modified by the following way: nc i if ci 6∈ A s nci = ; i = 1, .., k; nci + s/l if ci ∈ A
(5)
(6)
p∗ (ci ) =
nsci N +s
(7)
CE
PT
where l = |A|, i.e. the number of elements of A. Finally the distribution with maximum entropy is
AC
In (Mantas et al., 2016) is exposed a simple algorithm for calculating the distribution with maximum entropy when s > 1. This algorithm is explained in Figure 1, and it expresses a repetitive procedure of sharing a mass of 1 taking from the s value. It is less efficient than the one presented in (Abell´ an & Moral, 2006) for more general credal sets, but easier to understand. 2.3
Credal C4.5
The method for building CC4.5 trees is similar to the Quinlan‘s C4.5 algorithm (Quinlan, 1993). The main difference is that CC4.5 estimates the values of the
ACCEPTED MANUSCRIPT
9
CR IP T
1. stemp = s 2. While stemp > 0 2.1. If stemp > 1 then saux = 1 else saux = stemp 2.2. A = {cj | ncj = mini {nci }} nci if ci 6∈ A 2.3. nci = ; i = 1, .., k; nci + saux /|A| if ci ∈ A 2.4. stemp = stemp − 1 nc i 3. p∗ (ci ) = N +s i = 1, .., k
Fig. 1. Procedure to calculate the distribution with maximum entropy when s > 1.
AN US
features and class variable by using imprecise probabilities. An uncertainty measure on credal sets is defined together with a new split criterion. CC4.5 is created by replacing the Info-Gain Ratio (IGR) split criterion from C4.5 with the Imprecise Info-Gain Ratio (IIGR) split criterion. With the notation used in previous section, this criterion can be defined as follows: IIGRD (C, X) = where IIG is equal to:
X
M
IIGD (C, X) = H ∗ (K D (C)) −
IIGD (C, X) . H(X)
P D (X = xi )H ∗ (K D (C|X = xi )),
(8)
(9)
i
CE
PT
ED
with K D (C) and K D (C|X = xi ) are the credal sets obtained via the IDM for the C and (C|X = xi ) variables respectively, for a partition D of the data set (Abell´ an & Moral, 2003); {P D (X = xi ), i = 1, ..., n} is a probability distribution that belongs to the credal set K D (X). We choose the probability distribution P D from K D (X) that maximizes the following expression: X P (X = xi )H(C|X = xi )). (10) i
It is simple to calculate this probability distribution. From the set B = {xj ∈ X | H(C|X = xj ) = maxi {H(C|X = xi )}},
AC
the probability distribution P D will be equal to ( nx i if xi 6∈ B +s D P (xi ) = N nxi +s/m if xi ∈ B N +s
(11)
(12)
where m is the number of elements of B. This expression shares out the parameter s among the values xi with H(C|X = xi ) maximum.
ACCEPTED MANUSCRIPT
10
Procedure BuildCredalC4.5Tree(J,L)
CR IP T
Each node J in a decision tree causes a partition of the data set (for the root node, D is considered to be the entire data set). Furthermore, each J node has an associated list L of feature labels (that are not in the path from the root node to J). The procedure for building CC4.5 trees is explained in the algorithm in Figure 2. We can summarize the main ideas of this procedure:
If L = ∅, then Exit. Let D be the partition associated with node J If |D| < minimum number of instances, then Exit. Calculate P D (X = xi ) (i = 1, ..., n) on the convex set K D (X) Compute the value α = maxXj ∈M IIGRD (C, Xj ) with M = Xj ∈ L | IIGD (C, Xj ) > avgXj ∈L IIGD (C, Xj ) 6. If α ≤ 0 then Exit 7. Else 8. Let Xl be the variable for which the maximum α is attained 9. Assign Xl to node J and remove it from L 10. For each possible value xl of Xl 11. Add a node Jl 12. Make Jl a child of J 13. Call BuildCredalC4.5Tree(Jl ,L)
M
AN US
1. 2. 3. 4. 5.
Fig. 2. Procedure to build a CC4.5 decision tree.
AC
CE
PT
ED
Split Criterion: IIGR criterion is employed to select the split attribute at each branching node. In a similar way to the classic C4.5 algorithm, it is selected the attribute with the highest Imprecise Info-Gain Ratio score and whose Imprecise Info-Gain score is higher than the average Imprecise Info-Gain scores of the valid split attributes. These valid split attributes are those which are numeric or whose number of values is smaller than the thirty percent of the number of instances which are in this branch. Labeling leaf node: The most probable value of the class variable in the partition associated with a leaf node is inserted as label. Stopping Criteria: The branching of the decision tree is stopped when the uncertainty measure is not reduced (α ≤ 0, step 6) or when there are no more features to insert in a node (L = ∅, step 1) or when there are not a minimum number of instances per leaf (step 3). The branching of a decision tree is also stopped when there is not any valid split attribute using the aforementioned condition in “Split Criterion”, like classic C4.5. Handling Numeric Attributes and Missing Values: The numeric attributes and missing values are handled in the same way that classic C4.5 algorithm. The only difference is the use of IIG instead of the IG measure (Quinlan, 1986).
ACCEPTED MANUSCRIPT
11
Post-Pruning Process: Like C4.5, Pessimistic Error Pruning is employed in order to prune a CC4.5.
3
The new AdaptativeCC4.5 classification method
AN US
CR IP T
The AdaptativeCC4.5 algorithm is similar to the CC4.5 algorithm with the difference that the value for the s parameter is previously selected using the training data set3 , i.e. it is determined before using the algorithm and it is dependent from the data set to be classified. In the previous works about credal trees (Abell´ an & Masegosa, 2009, 2012; Mantas & Abell´ an, 2014a; Abell´ an & Mantas, 2014; Mantas & Abell´ an, 2014b), the value s = 1 was used, motivated principally by recommendation from the author of the IDM and by computational reasons. In the AdaptativeCC4.5 algorithm, the value for the s parameter of the IDM is determined in accordance with a rough estimate of the level of noise from the data set. Hence, the value of s is adjusted for each training data. Once the value of the s parameter is fixed, the CC4.5 algorithm is executed with this value. The steps in the AdaptativeCC4.5 algorithm to find a value for the s parameter from a data set are shown in Figure 3.
ED
M
1. Let be L a rough estimation of the noise level from the training data D. 2. Select case L. 2.1. case L ≤ 1%: s = 0 //Very low noise level 2.2. case 1% < L ≤ 15%: s = 1 //Low or medium noise level 2.3. case 15% < L ≤ 20%: s = 2 //High noise level 2.4. case 20% < L: s = 3 //Very high noise level 3. Call BuildCredalC4.5Tree with the estimated s value
PT
Fig. 3. AdaptativeCC4.5 Procedure.
The noise levels utilized in this approach to choose the value of the s parameter, are drawn for the outcomes presented in (Mantas et al., 2016). Rough estimation of the noise level from the data
CE
3.1
AC
As may be deduced from the above description, the key aspect of this proposal to work properly is the correct estimation of the class noise level from training data. To achieve this, the instances with class noise and potentially noisy instances must be detected. In other works (Liu & Tao, 2016),(Natarajan, Dhillon, Ravikumar, & Tewari, 2013), methods for estimating the noise level were presented where the noisy samples are treated as statistical outliers. Efficient methods have been not proposed to estimate the noise level and to find them is an open problem (Yang, Mahdavi, Jin, Zhang, & Zhou, 2012). 3
The value of this parameter is constant for the CC4.5 algorithm
ACCEPTED MANUSCRIPT
12
AC
CE
PT
ED
M
AN US
CR IP T
Since we do not have access to the data without noise to make reliable comparisons, we will only be able to make a rough estimate of the noise level. We must take into consideration that finding noisy or potentially noisy instances may be a difficult task in datasets with a medium-low level of noise but, with high levels of noise, it is far more complicated to differentiate between correct and noisy instances. For reasons of simplicity and efficiency, the k-nearest neighbors approach (k-NN) (Fix & Hodges, 1989) has been considered to perform the task of estimating the noise level. The k-NN is a type of instance-based learning classifier (Aha, Kibler, & Albert, 1991). The k-NN classifier has been used largely to detect anomalous instances (Olvera-L´ opez, Carrasco-Ochoa, Mart´ınez-Trinidad, & Kittler, 2010) and it is known to be very sensitive to noise data (Kononenko & Kukar, 2007) (Wu, Ianakiev, & Govindaraju, 2002). In this context, the nearest neighbor classifier (1-NN) is more sensitive to noise than the k-NN with k > 1, due to the fact that larger values of k reduce the effect of noise on the classification (Everitt, Landau, Leese, & Stahl, 2011). Hence, our intention is to compare, for each instance from the training data, the predictions of the 1-NN and k-NN classifiers with the value of the class variable of such instance. This comparison is used to determine whether the instance has class noise or not. Once we have the number of (potentially) noisy instances, we divide them by the total number of instance in order to get the noise ratio of the data. In previous works to remove or repair noisy instances it is common to work only with binary classification problems or to decompose the problem into several binary subproblems (Garcia et al., 2015). This is due to the fact that in multiclass problems, the chances of incorrect classifications are increased. Although the k-NN is designed to work with multi-class classification, its performance is severely hindered by noise due to some difficulties are exacerbated by noise, such as the presence of small disjuncts (Jo & Japkowicz, 2004), the overlapping regions of classes (V. Garc´ıa, Mollineda, & S´ anchez, 2008) or borderline instances (S. Garc´ıa, Luengo, & Herrera, 2015). This is why we divide the estimated noise ratio by |C|/2 to improve its performance in multi-class datasets and limit the effects produced by a larger number of classes. The steps of this rough noise estimator are shown in the Figure 4. We must bear in mind, however, that some instances are not correctly classified by the 1-NN and k-NN classifiers and these instances can not be considered as noisy instances, even though our estimator may detect them as noisy instances. Due to this fact, we should consider this proposal as a rough estimator, since the estimation heavily depends on the performance of the k-NN classifier to correctly classify the instances of a given dataset. 3.2
Characteristics of the new algorithm
The AdaptativeCC4.5 algorithm allows us to determine the value of the s parameter for each training set. In this way, we obtain an algorithm with two new
ACCEPTED MANUSCRIPT
13
CR IP T
1. Let be |D| the number of instances and |C| the number of classes. 2. problematicInstances = 0 3. For each instance i of D. 3.1. If k-NN prediction and the class value of the instance i are not the same. 3.1.1. If 1-NN prediction also differs from the class value of the instance i. problematicInstances+ = 1.0 //noisy instance 3.1.2. Else problematicInstances+ = 0.5 //possible noisy instance 4. L=(problematicInstances/|D|) 5. If (|C| > 2) then L=L/(|C|/2) 6. return L ∗ 100
AN US
Fig. 4. Rough noise level detection algorithm.
characteristics: it generalizes the concept of decision trees and it is able to find a trade off between bias and variance. On the one hand, the AdaptativeCC4.5 algorithm is a concept that generalizes the methods for designing decision trees. When s = 0 we can observe that the probabilities intervals of IDM (see equation 1) are composed by only one value:
M
p(cj ) =
ncj N
(13)
AC
CE
PT
ED
Hence, when s = 0, the credal sets are composed by only one probability distribution, the function H ∗ is equivalent to the Shannon’s entropy function and the IIGR split criterion is equivalent to the IGR criterion. Therefore, the AdaptativeCC4.5 algorithm is the classic C4.5. For values s > 0, the AdaptativeCC4.5 algorithm provides new properties to the method for designing decision trees if this is required by the data set. On the other hand, the AdaptativeCC4.5 achieves a trade off between bias and variance. In decision trees, the depth of the tree determines the variance. The bigger the value of s is, the smaller is the generated credal tree (see (Mantas et al., 2016)). In this way, AdaptativeCC4.5 algorithm can adjust the variance of the decision trees by controlling the size of the trees. The stopping criteria for AdaptativeCC4.5 algorithm are similar to the ones of classic C4.5, except that a new stopping criterion is defined for the AdaptativeCC4.5 algorithm that it is not available in classic C4.5. The values of the IIGRD (C, X) split criterion can be negative for a feature X (see Example 1), a class variable C and a partition D, this fact implies information loss about C. However, the IGR split criterion of C4.5 can not be negative. In this way, if the measure IIGRD (C, X) is less or equal to 0 for all the available features in a node, then the branching of the tree is stopped. These negative values come from the use of the maximum of entropy function on credal sets (H ∗ ) in the expression (9). So, it is possible to find the following condition:
ACCEPTED MANUSCRIPT
14
X
P D (X = xi )H ∗ (K D (C|X = xi )) > H ∗ (K D (C)).
(14)
i
CR IP T
The condition of Eq. (14) is normally fulfilled when the sample size N is small. This usually happens in the deepest levels of the tree. When the sample size is high (normally in the first levels of the tree), the probabilities intervals from Eq. (1) of the IDM are narrow and its influence on the value of IIGRD (C, X) is negligible. However, when N is small, the size of the probabilities intervals is enough to influence on the values of IIGRD (C, X). Eq. (14) is then related with the branching of the DT. In Example 1 we can see a case where we consider two possible values for the parameter (s1 > s2 ), and we will see that with the lower one still we need to branch but not with the higher one.
= x11 = x12 = x21 = x22 = x23
→ (5 → (4 → (2 → (5 → (2
of of of of of
class class class class class
M
X1 X1 X2 X2 X2
AN US
Example 1. Lets consider of a node J in a DT, where the class variable C, with two possible states {c1 , c2 }, has the frequencies {c1 : 9, c2 : 4}. In this node we consider that we have only 2 attribute variables X1 , X2 , with possible values X1 ∈ {x11 , x12 }, and X2 ∈ {x21 , x22 , x23 }. The frequencies of each combination of states in the node J are the following ones: c1 , c1 , c1 , c1 , c1 ,
4 0 2 2 0
of of of of of
class class class class class
c2 ) c2 ) c2 ) c2 ) c2 )
ED
Taking into account two possible values for the s parameter: s1 = 1, s2 = 3, we analyze the possibility of branching in the node J considering these two cases - Case s1 . Here we have the following values:
PT
IIGRD (C, X1 ) =
CE
IIGRD (C, X2 ) =
H ∗ (K D (C))−
= 0.0290 H ∗ (K D (C))−
= −0.0159
P
P D (X1 =x1i )H ∗ (K D (C|X1 =x1i )) H(X1 )
P
P D (X2 =x2i )H ∗ (K D (C|X2 =x2i )) H(X2 )
i
i
- Case s2 . Here we have the following values:
AC
IIGRD (C, X1 ) =
IIGRD (C, X2 ) =
H ∗ (K D (C))−
= −0.0076 H ∗ (K D (C))−
= −0.0080
P
P D (X1 =x1i )H ∗ (K D (C|X1 =x1i )) H(X1 )
P
P D (X2 =x2i )H ∗ (K D (C|X2 =x2i )) H(X2 )
i
i
By the values obtained, in the node J we will insert the variable X1 when s = s1 , but no branching is produced when s = s2 .
ACCEPTED MANUSCRIPT
15
Experimental analysis
AN US
4
CR IP T
As we can see in the Example 1, with similar values for each case of the attribute variables, we can obtain a negative gain of information by the IIGR criterion for all the attribute variables if the value of s is increased. Then, using higher values of s the branching can be stopped in a node, producing smaller DTs. Hence, when the AdaptativeCC4.5 algorithm selects the value of the s parameter by a rough estimation of the noise level, it is controlling the fulfillment of the condition (14) in some nodes of the tree. So, AdaptativeCC4.5 algorithm with high values of s designs smaller tree than the same algorithm with values s0 < s. As the size of the tree determines the variance, the AdaptativeCC4.5 algorithm carries out a bias-variance trade off through the control of the value of the s parameter.
AC
CE
PT
ED
M
Our aim is to compare the performance of the new AdaptativeCC4.5 method in front of the ones of the classic C4.5 and CC4.5 with a fixed value for s, when data sets with different levels of added class noise are classified4 . In order to check the above procedures, we have used a broad and diverse set of 50 known data sets, obtained from the UCI repository of machine learning data sets (Lichman, 2013). We took data sets that are different with respect to the number of cases of the variable to be classified, data set size, feature types (discrete or continuous) and number of cases of the features. A brief description of these can be found in Table 1. We used Weka software (Witten & Frank, 2005) for our experimentation. The implementation of C4.5 algorithm provided by Weka software, called J48, was employed with its default configuration. We added the necessary methods to build CC4.5 trees with the same experimental conditions. For the AdaptativeCC4.5 algorithm, the selection of the s parameter is made by a rough noise detector based on the k-NN classifier already implemented in the Weka software. The minimum number of instances per leaf for branching was fixed to 2 for C4.5, C-C4-5 and AdaptativeCC4.5, as it is appears in the configuration by defect for C4.5 (J48 in Weka). On the other hand, by using Weka’s filters, we added the following percentages of random noise to the class variable: 0%, 5%, 10%, 20% and 30%, only in the training data set. The procedure to introduce noise into a variable was the following: a given percentage of instances of the training data set was randomly selected, after that their values for the variable were randomly changed to other possible values. The instances belonging to the test data set were left unmodified. We repeated 10 times a 10-fold cross validation procedure for each data set. 4
We remark that we do not consider as reference the k-NN classifier (used to estimate the level of noise) because its low level of accuracy compared with the methods used here.
ACCEPTED MANUSCRIPT
16
Table 1.
Data set description. Column “N” is the number of instances in the data sets, column
“Feat” is the number of features or attribute variables, column “Num” is the number of numerical variables, column “Nom” is the number of nominal variables, column “k” is the number of cases or of the nominal variables of each data set.
N 898 452 226 205 625 286 699 1728 1473 368 690 1000 366 768 366 214 306 303 294 270 155 3772 351 150 3196 20000 345 146 2000 12960 5620 5473 10992 339 2310 3772 1066 208 683 4601 531 3190 76 151 946 435 990 5000 178 101
Feat Num Nom k Range 38 6 32 6 2-10 279 206 73 16 2 69 0 69 24 2-6 25 15 10 7 2-22 4 4 0 3 9 0 9 2 2-13 9 9 0 2 6 0 6 4 3-4 9 2 7 3 2-4 22 7 15 2 2-6 15 6 9 2 2-14 20 7 13 2 2-11 34 1 33 6 2-4 8 8 0 2 7 7 0 7 9 9 0 7 3 2 1 2 12 13 6 7 5 2-14 13 6 7 5 2-14 13 13 0 2 19 4 15 2 2 30 7 23 4 2-4 35 35 0 2 4 4 0 3 36 0 36 2 2-3 16 16 0 26 6 6 0 2 18 3 15 4 2-8 240 0 240 10 4-6 8 0 8 4 2-4 64 64 0 10 10 10 0 5 16 16 0 10 17 0 17 21 2-3 19 16 0 7 29 7 22 2 2 12 0 6 3 2-8 60 60 0 2 35 0 35 19 2-7 57 57 0 2 101 100 1 48 4 60 0 60 3 4-6 44 0 44 3 2-9 5 3 2 3 2 18 18 0 4 16 0 16 2 2 11 10 1 11 2 40 40 0 3 13 13 0 3 16 1 16 7 2
AC
CE
PT
ED
M
AN US
Data set anneal arrhythmia audiology autos balance-scale breast-cancer wisconsin-breast-cancer car cmc horse-colic credit-rating german-credit dermatology pima-diabetes ecoli Glass haberman cleveland-14-heart-disease hungarian-14-heart-disease heart-statlog hepatitis hypothyroid ionosphere iris kr-vs-kp letter liver-disorders lymphography mfeat-pixel nursery optdigits page-blocks pendigits primary-tumor segment sick solar-flare2 sonar soybean spambase spectrometer splice Sponge tae vehicle vote vowel waveform wine zoo
CR IP T
states of the class variable (always a nominal variable) and column “Range” is the range of states
ACCEPTED MANUSCRIPT
17
Following the recommendation of (Demˇsar, 2006), we used a series of tests to compare the methods5 . We used the following tests to compare multiple classifiers on multiple data sets, with a level of significance of α = 0.05:
CR IP T
Friedman test (Friedman, 1937, 1940): a non-parametric test that ranks the algorithms separately for each data set, the best performing algorithm being assigned the rank of 1, the second best, rank 2, etc. The null hypothesis is that all the algorithms are equivalent. If the null-hypothesis is rejected, we can compare all the algorithms to each other using the Nemenyi test (Nemenyi, 1963).
M
AN US
To compare the performance of two classifiers over multiple data sets, we have carried out the Wilcoxon Signed-Ranks test (Wilcoxon, 1945) which is a non-parametric test that ranks the differences in performance of two classifiers of each dataset comparing the ranks for the positive and the negative differences. It takes into account the commensurability of the differences that the Sign Test does not (Demˇsar, 2006). It is used to compare two related samples, matched samples, or repeated measurements on a single sample to check whether their population mean ranks differently. On the other hand, in order to make a comparative analysis of the behavior of different classifiers when they deal with noisy data, we used the Equalized Loss of Accuracy (ELA) measure presented in (S´ aez, Luengo, & Herrera, 2014). This new behavior-against-noise measure allows to characterize the behavior of a method with noisy data considering performance and robustness. ELA measure is: 100 − Ax% A0%
(15)
ED
ELAx% =
AC
CE
PT
where A0% is the accuracy of the classifier with a level noise 0% and Ax% is the accuracy of the classifier with a noise level x%. For each implemented classifier in the experiments, ELAx% measure will be calculated for x=5, 10, 20 and 30. The classifier with the lowest values for ELAx% will be the most robust classifier with noisy data according this measure. Finally, an experimental comparison between the new AdaptativeCC4.5 method and the known classifier Random Forest (RF) was carried out. The purpose of this last experiment is to show that the performance of AdaptativeCC4.5 is comparable with a good classifier as RF. The implementation of Random Forest provided by Weka was used with its default configuration. The Wilcoxon test was used to compare the results obtained by AdaptativeCC4.5 and RF. 4.1
Results
Here we show the results obtained by the C4.5, CC4.5 (with different values of the parameter s) and AdaptativeCC4.5 trees. 5
All the tests were carried out using Keel software (Alcal´ a-Fdez et al., 2009), available at www.keel.es
ACCEPTED MANUSCRIPT
18
Table 2.
CR IP T
In Appendix A, it can be found the tables which show the accuracy results for each method, applied on data sets with a percentage of added random noise to the class variable equal to 0%, 5%, 10%, 20% and 30%, respectively. Table 2 presents the average result of the accuracy for each method when is applied to data sets with several percentages of added random noise. Average result of accuracy for C4.5, CC4.5 (with different values for the s parameter)
and AdaptativeCC4.5 when they are built from data sets with percentage of added noise equal to
0%, 5%, 10%, 20% and 30%. The numbers in bold fonts correspond with the best result for each noise level, the second best value is noted with gray fonts. noise 0% 82.62 82.45 82.44 82.31 81.96 81.54 81.00 82.56
noise 5% 81.77 81.88 81.87 81.85 81.61 81.25 80.67 81.93
noise 10% 80.77 80.97 81.10 81.21 81.03 80.77 80.29 81.25
noise 20% 78.20 78.67 79.03 79.44 79.58 79.55 79.35 79.73
noise 30% 74.14 74.89 75.41 76.38 76.95 77.32 77.44 77.41
AN US
Tree C4.5 CC4.5s=0.25 CC4.5s=0.5 CC4.5s=1.0 CC4.5s=1.5 CC4.5s=2.0 CC4.5s=3.0 Adaptative CC4.5
Table 3.
M
Table 3 shows Friedman’s ranks of C4.5, CC4.5 (with different values for the s parameter) and AdaptativeCC4.5 when they are used on data sets with percentages of added random noise equal to 0%, 5%, 10%, 20% and 30%. For all the test carried out, the level of significance has been of α = 0.05. Friedman’s ranks for α = 0.05 of C4.5, CC4.5 (with different values for the s parameter)
and AdaptativeCC4.5, when they are used on data sets with percentage of added random noise equal
ED
to 0%, 5%, 10%, 20% and 30%. The numbers in bold fonts correspond with the lowest value for each noise level, the second lowest value is marked with italic fonts.
CE
PT
Tree noise 0% noise 5% noise 10% noise 20% noise 30% C4.5 3.84 4.17 5.29 6.15 6.35 CC4.5s=0.25 3.77 3.87 4.88 5.82 5.92 CC4.5s=0.5 4.00 4.06 3.90 5.25 5.52 CC4.5s=1.0 4.00 3.97 3.62 4.30 4.83 CC4.5s=1.5 4.82 4.66 4.00 3.49 3.89 CC4.5s=2.0 5.88 5.41 4.99 3.65 3.04 CC4.5s=3.0 6.18 6.37 5.83 4.06 3.16 Adaptative CC4.5 3.51 3.49 3.49 3.28 3.29
AC
Tables 4-8 show the p-values of the Nemenyi test for the methods C4.5, CC4.5 (with different values for the s parameter) and AdaptativeCC4.5 when they are applied on data sets with a percentage of added random noise equal to 0%, 5%, 10%, 20% and 30%. In all the cases, Nemenyi test rejects the hypotheses that the methods are equivalent when p-value≤ 0.001786. The methods distinguished in bold fonts are better than its pair regarding accuracy. Table 9 presents the ELAx% measures calculated with the average accuracy obtained from each method when it is applied on data sets with added random noise equal to 0%, 5%, 10%, 20% and 30%.
ACCEPTED MANUSCRIPT
19 Table 4.
p-values of the Nemenyi test
Table 5.
p-values of the Nemenyi test
about the accuracy on data sets without
about the accuracy on data sets with 5% of
added noise.
added noise. i 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
algorithms CC4.5s=3 vs. AdaptativeCC4.5 CC4.5s=0.25 vs. CC4.5s=3 CC4.5s=1 vs. CC4.5s=3 CC4.5s=0.5 vs. CC4.5s=3 C4.5 vs. CC4.5s=3 CC4.5s=2 vs. AdaptativeCC4.5 CC4.5s=1.5 vs. CC4.5s=3 CC4.5s=0.25 vs. CC4.5s=2 CC4.5s=1 vs. CC4.5s=2 CC4.5s=0.5 vs. CC4.5s=2 C4.5 vs. CC4.5s=2 CC4.5s=1.5 vs. AdaptativeCC4.5 CC4.5s=2 vs. CC4.5s=3 CC4.5s=0.25 vs. CC4.5s=1.5 CC4.5s=1.5 vs. CC4.5s=2 CC4.5s=1 vs. CC4.5s=1.5 C4.5 vs. AdaptativeCC4.5 CC4.5s=0.5 vs. CC4.5s=1.5 CC4.5s=0.5 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=1.5 CC4.5s=1 vs. AdaptativeCC4.5 CC4.5s=0.25 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=0.25 C4.5 vs. CC4.5s=1 CC4.5s=0.25 vs. CC4.5s=0.5 C4.5 vs. CC4.5s=0.5 CC4.5s=0.25 vs. CC4.5s=1 CC4.5s=0.5 vs. CC4.5s=1
p − values 0 0 0.000001 0.000002 0.000007 0.000089 0.000482 0.001669 0.003289 0.005857 0.011369 0.016929 0.050044 0.106836 0.125786 0.158996 0.165124 0.220671 0.244624 0.31721 0.327187 0.437943 0.540291 0.683091 0.698137 0.82234 0.838256 0.85424
CR IP T
p − values 0 0.000001 0.000001 0.000002 0.000009 0.000009 0.000017 0.000031 0.000124 0.000124 0.005502 0.007495 0.030486 0.032089 0.045455 0.094166 0.094166 0.31721 0.31721 0.500559 0.540291 0.595611 0.638723 0.638723 0.743971 0.743971 0.88638 1
AN US
algorithms CC4.5s=3 vs. AdaptativeCC4.5 CC4.5s=0.25 vs. CC4.5s=3 CC4.5s=2 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=3 CC4.5s=1 vs. CC4.5s=3 CC4.5s=0.5 vs. CC4.5s=3 CC4.5s=0.25 vs. CC4.5s=2 C4.5 vs. CC4.5s=2 CC4.5s=1 vs. CC4.5s=2 CC4.5s=0.5 vs. CC4.5s=2 CC4.5s=1.5 vs. CC4.5s=3 CC4.5s=1.5 vs. AdaptativeCC4.5 CC4.5s=1.5 vs. CC4.5s=2 CC4.5s=0.25 vs. CC4.5s=1.5 C4.5 vs. CC4.5s=1.5 CC4.5s=1 vs. CC4.5s=1.5 CC4.5s=0.5 vs. CC4.5s=1.5 CC4.5s=0.5 vs. AdaptativeCC4.5 CC4.5s=1 vs. AdaptativeCC4.5 C4.5 vs. AdaptativeCC4.5 CC4.5s=2 vs. CC4.5s=3 CC4.5s=0.25 vs. AdaptativeCC4.5 CC4.5s=0.25 vs. CC4.5s=0.5 CC4.5s=0.25 vs. CC4.5s=1 C4.5 vs. CC4.5s=0.5 C4.5 vs. CC4.5s=1 C4.5 vs. CC4.5s=0.25 CC4.5s=0.5 vs. CC4.5s=1
M
i 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
4.2
PT
ED
At the end of Appendix A, it can be found the table which show the accuracy results for AdaptativeCC4.5 and RF, when they are used on data sets with different percentages of added random noise to the class variable. Table 10 presents the average result of the accuracy for the AdaptativeCC4.5 and RF methods when they are used to data sets with several levels of added random noise. Comments about the results
AC
CE
From the data obtained with the experiments carried out, we can say that the new method AdaptativeCC4.5 obtains the general best results on each analyzed aspect. It must be remarked that the new method is always better than the best one of the rest on each level of added noise, only it is not better but equivalent to the best one of the rest for 30% of added noise, where the best one is the CC4.5 with s = 2.0. It is shown that for the lowest levels of noise, the new method is significantly better than the CC4.5 with highest values of s; and for the highest levels of noise, the new method is significantly better than the CC4.5 with lowest values of s. For low levels of added noise, the methods with lowest values of s attains the best results together with the new method; and the contrary situation appears for the highest values of added noise, where the methods with highest s values are the best ones, together again with the new
ACCEPTED MANUSCRIPT
20 Table 6.
p-values of the Nemenyi test
Table 7.
p-values of the Nemenyi test
about the accuracy on data sets with 10%
about the accuracy on data sets with 20%
of added noise.
of added noise.
i 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
p − values 0 0 0 0 0.000002 0.000009 0.00002 0.000058 0.000159 0.000327 0.000327 0.001091 0.001918 0.015137 0.037336 0.052479 0.066193 0.098248 0.111347 0.184573 0.244624 0.244624 0.402644 0.450093 0.500559 0.624206 0.66817 0.743971
CR IP T
algorithms C4.5 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=1.5 CC4.5s=0.25 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=2 CC4.5s=0.25 vs. CC4.5s=1.5 CC4.5s=0.25 vs. CC4.5s=2 C4.5 vs. CC4.5s=3 CC4.5s=0.5 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=1 CC4.5s=0.5 vs. CC4.5s=1.5 CC4.5s=0.25 vs. CC4.5s=3 CC4.5s=0.5 vs. CC4.5s=2 CC4.5s=0.25 vs. CC4.5s=1 CC4.5s=0.5 vs. CC4.5s=3 CC4.5s=1 vs. AdaptativeCC4.5 CC4.5s=0.5 vs. CC4.5s=1 C4.5 vs. CC4.5s=0.5 CC4.5s=1 vs. CC4.5s=1.5 CC4.5s=3 vs. AdaptativeCC4.5 CC4.5s=1 vs. CC4.5s=2 CC4.5s=1.5 vs. CC4.5s=3 CC4.5s=0.25 vs. CC4.5s=0.5 CC4.5s=2 vs. CC4.5s=3 CC4.5s=2 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=0.25 CC4.5s=1 vs. CC4.5s=3 CC4.5s=1.5 vs. AdaptativeCC4.5 CC4.5s=1.5 vs. CC4.5s=2
algorithms C4.5 vs. CC4.5s=2 C4.5 vs. CC4.5s=3 C4.5 vs. AdaptativeCC4.5 CC4.5s=0.25 vs. CC4.5s=2 CC4.5s=0.25 vs. CC4.5s=3 CC4.5s=0.25 vs. AdaptativeCC4.5 CC4.5s=0.5 vs. CC4.5s=2 C4.5 vs. CC4.5s=1.5 CC4.5s=0.5 vs. CC4.5s=3 CC4.5s=0.5 vs. AdaptativeCC4.5 CC4.5s=0.25 vs. CC4.5s=1.5 CC4.5s=1 vs. CC4.5s=2 CC4.5s=1 vs. CC4.5s=3 CC4.5s=0.5 vs. CC4.5s=1.5 CC4.5s=1 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=1 CC4.5s=0.25 vs. CC4.5s=1 CC4.5s=1 vs. CC4.5s=1.5 CC4.5s=1.5 vs. CC4.5s=2 C4.5 vs. CC4.5s=0.5 CC4.5s=1.5 vs. CC4.5s=3 CC4.5s=0.5 vs. CC4.5s=1 CC4.5s=1.5 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=0.25 CC4.5s=0.25 vs. CC4.5s=0.5 CC4.5s=2 vs. AdaptativeCC4.5 CC4.5s=3 vs. AdaptativeCC4.5 CC4.5s=2 vs. CC4.5s=3
PT CE AC
i 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
p-values of the Nemenyi test about the accuracy on data sets with 30% of added noise.
ED
Table 8.
p − values 0.000002 0.000006 0.000082 0.000187 0.000239 0.000652 0.0022 0.004549 0.004549 0.005166 0.008458 0.010112 0.026084 0.043297 0.045455 0.052479 0.072448 0.086411 0.270344 0.29786 0.402644 0.402644 0.437943 0.540291 0.567628 0.790731 0.82234 0.838256
AN US
algorithms CC4.5s=3 vs. AdaptativeCC4.5 CC4.5s=1 vs. CC4.5s=3 CC4.5s=0.5 vs. CC4.5s=3 CC4.5s=1.5 vs. CC4.5s=3 C4.5 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=1 CC4.5s=2 vs. AdaptativeCC4.5 CC4.5s=0.25 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=0.5 CC4.5s=1 vs. CC4.5s=2 C4.5 vs. CC4.5s=1.5 CC4.5s=0.25 vs. CC4.5s=1 CC4.5s=0.5 vs. CC4.5s=2 CC4.5s=1.5 vs. CC4.5s=2 CC4.5s=0.25 vs. CC4.5s=0.5 CC4.5s=0.25 vs. CC4.5s=3 CC4.5s=0.25 vs. CC4.5s=1.5 CC4.5s=2 vs. CC4.5s=3 C4.5 vs. CC4.5s=3 CC4.5s=1.5 vs. AdaptativeCC4.5 CC4.5s=0.5 vs. AdaptativeCC4.5 C4.5 vs. CC4.5s=0.25 CC4.5s=1 vs. CC4.5s=1.5 C4.5 vs. CC4.5s=2 CC4.5s=0.5 vs. CC4.5s=1 CC4.5s=1 vs. AdaptativeCC4.5 CC4.5s=0.25 vs. CC4.5s=2 CC4.5s=0.5 vs. CC4.5s=1.5
M
i 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
p − values 0 0 0 0 0 0 0 0.000001 0.000001 0.000005 0.000034 0.000258 0.000652 0.000877 0.001669 0.001918 0.026084 0.055014 0.082731 0.090222 0.136196 0.158996 0.220671 0.380088 0.414216 0.609834 0.790731 0.806496
ACCEPTED MANUSCRIPT
21 Table 9.
ELAx% measures calculated with the average accuracy of C4.5, CC4.5 (with different
values for the s parameter) and AdaptativeCC4.5 when they are used on data sets with percentage of added random noise equal to 0%, 5%, 10%, 20% and 30%. The numbers in bold fonts correspond with the lowest value of the ELAx% measure for each noise level, the second lowest value is marked with grey fonts.
Table 10.
noise 5% 0.221 0.220 0.220 0.221 0.224 0.230 0.239 0.219
noise 10% 0.233 0.231 0.229 0.228 0.231 0.236 0.243 0.227
noise 20% 0.264 0.259 0.254 0.250 0.249 0.251 0.255 0.246
noise 30% 0.313 0.305 0.298 0.287 0.281 0.278 0.279 0.274
CR IP T
Tree C4.5 CC4.5s=0.25 CC4.5s=0.5 CC4.5s=1.0 CC4.5s=1.5 CC4.5s=2.0 CC4.5s=3.0 Adaptative CC4.5
Average result of accuracy for AdaptativeCC4.5 and RF when they are built from data
sets with percentage of added noise equal to 0%, 5%, 10%, 20% and 30%. The numbers in bold fonts
Tree Adaptative CC4.5 RF
noise 0% 82.56 84.68
AN US
correspond with the best result for each noise level. noise 5% 81.93 83.40
noise 10% 81.25 81.69
noise 20% 79.73 77.51
noise 30% 77.41 71.79
ED
M
method. This last statement can be checked via the tests carried out, where the differences are statistically significant in the way that we comment: for the lowest levels of noise, the methods with highest s values are statistically worse compared with the rest of methods; and for the highest levels of noise, the methods with lowest s value are now statistically worse compared with the rest. Following to the new method, the CC4.5 with s = 1.0 and s = 2.0 can be considered the second best methods in this comparative. For more detail, we will comment the results according to the following aspects: Average accuracy, Friedman’s ranking, Nemenyi test and ELAx% measure.
AC
CE
PT
• Average accuracy: In this aspect, the new method attains the best average result for each level of added noise, only for 30% of added noise and without noise it attains the second best value, but very close to the best ones. Really for levels of noise lower or equal to 20%, the accuracy results of each method are not very different. This can be explained because on some of the used original data sets, the CC4.5 method obtains excellent results when it is applied with highest values of s, this could imply that those original data sets have some levels of noise. However, for the case of 30% of added noise, the differences are bigger among the methods with highest values of s (the best ones, together with the new method) and the rest. In this case, the contrary situation rarely appears, i.e. data sets with highest values of added noise where the CC4.5 with lowest values of s give us the best accuracy results. • Friedman’s ranking and Nemenyi test: The results of the set of test carried out, reinforce the comments about the accuracy. The new method
ACCEPTED MANUSCRIPT
22
M
AN US
CR IP T
obtains the best Friedman’s rank for all the levels of added noise except for 30% of added noise, where the best one is obtained for the CC4.5 with s = 2.0 following very close for the CC4.5 with s = 3.0 and the new method. With respect to the Nemenyi test, the new method is the only one that always is significantly better than the worse methods for each level of added noise. For the lowest level of added noise, the worse ones are the CC4.5 with s value in {2.0, 3.0}; and for the highest values of added noise the worse ones are the C4.5 and the ones of the CC4.5 with s values in {0.25, 0.5, 1.0}. We note that the methods with the values recommend by Walley (s = 1.0 and s = 2.0) give us contrary situations for the lowest and the highest value of added noise: CC4.5 with s = 1.0 is statistically better than CC4.5 with s = 2.0, when the level of noise is 0%; and CC4.5 with s = 2.0 is statistically better than CC4.5 with s = 1.0, when the level of noise is 30%. • ELAx% measure: According to this measure, there is no doubt about what method is the strongest one on class noise. We can observe that AdaptativeCC4.5 algorithm obtains always the lowest values of ELAx% for all noise levels. This fact implies that AdaptativeCC4.5 is the most robust method when noisy data sets are classified. Besides, the ELAx% measure was defined considering performance and robustness of a method when it classifies noisy data. We remark that ELA measure takes into account higher levels of accuracy on data sets with added noise, that is different to other measures that take into account the difference on the performance of a classifier when it is applied on data set with and without noise (S´ aez, Luengo, & Herrera, 2011).
AC
CE
PT
ED
Respect to the experimental comparison between AdaptativeCC4.5 and RF, it can be observed that RF obtains accuracy results better than AdaptativeCC4.5 for data without added noise and data with low level of added noise (5%). For data with level of noise equal to 10% the accuracy results are close. However, when datasets with high level of noise added (20% and 30%) are classified, AdaptativeCC4.5 achieves better accuracy results than RF. These conclusions are corroborated by the Wilcoxon test. This test indicates that RF is better algorithm than AdaptativeCC4.5 when datasets without added noise or with low level of added noise (5%) are classified, both algorithms are equivalent when datasets with 10% of added noise are classified and, finally, AdaptativeCC4.5 is better classifier than RF when datasets with high level of noise added (20% and 30%) are classified. In this manner, it can be concluded with this experimental comparison that AdaptativeCC4.5 is a good classifier compared with the known RF, mainly, when it is used to classify very noisy datasets. From the general and particular comments above expressed, we can conclude that the new method of AdaptativeCC4.5 achieves a good trade off between performance and robustness in class noise. This fact is reasonable taking into account that AdaptativeCC4.5 is defined in order to obtain a trade off between bias and variance error, i.e a good trade off between over and under-fitting on the data used to build a decision trees.
ACCEPTED MANUSCRIPT
References
5
23
Conclusion
ED
M
AN US
CR IP T
In a previous work, it has been shown that the recent model called Credal C4.5 (CC4.5) presents excellent performance in class noise tasks, but its performance has a high dependency of its parameter (s) when the level of noise varies. Normally, the relation between the level of class noise, in a data set, and the correct value of s to be applied, is directly proportional. In some situations of noise levels, the CC4.5 with lower values of s is significantly better than the same model when it uses higher values for s; and vice versa. It makes necessary to know the level of noise to apply the appropriate value of the s parameter. The problems founds are two: (i) it can be not possible to guest the correct level of noise in a data set; and (ii) for all the data sets and level of noise, we cannot assure that the same values of s would be the best ones. In this paper, we present a new model based on the CC4.5, that does not suffer the above mentioned problems. The new method is always better or equivalent to the best model based on the CC4.5 varying its s parameter. That is, it is shown in the results of the experimentation that the new method is always better than the best one of the rest on each level of added noise, only it is not better but equivalent to the best one of the rest for 30% of added noise. Besides, in the experimental study, using a set of tests, we show that for the lowest levels of noise, the new method is significantly better than the CC4.5 with higher values of s; and for the highest levels of noise, the new method is significantly better than the CC4.5 with the lowest values of s. Here, we remark that C4.5 is the CC4.5 with s = 0. By the results presented in this paper, the new direct method is very suitable to be applied on data sets with class noise, without the need to know the levels of noise. Acknowledgments
PT
This work has been supported by the Spanish “Ministerio de Econom´ıa y Competitividad” and by “Fondo Europeo de Desarrollo Regional” (FEDER) under Project TEC2015-69496-R.
CE
References
AC
Abell´ an, J. (2006). Uncertainty measures on probability intervals from the imprecise dirichlet model. International Journal of General Systems, 35 (5), 509-528. doi: 10.1080/03081070600687643 Abell´ an, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 41 (8), 3825 - 3830. doi: 10.1016/j.eswa.2013.12.003 Abell´ an, J., & Masegosa, A. (2009). An experimental study about simple decision trees for bagging ensemble on datasets with classification noise. In C. Sossai & G. Chemello (Eds.), Symbolic and quantitative approaches to reasoning
ACCEPTED MANUSCRIPT
24
AC
CE
PT
ED
M
AN US
CR IP T
with uncertainty (Vol. 5590, p. 446-456). Springer Berlin Heidelberg. doi: 10.1007/978-3-642-02906-6 39 Abell´ an, J., & Masegosa, A. R. (2012). Bagging schemes on the presence of class noise in classification. Expert Systems with Applications, 39 (8), 6827 - 6837. doi: 10.1016/j.eswa.2012.01.013 Abell´ an, J., & Moral, S. (2003). Building classification trees using the total uncertainty criterion. International Journal of Intelligent Systems, 18 (12), 1215–1225. doi: 10.1002/int.10143 Abell´ an, J., & Moral, S. (2006). An algorithm to compute the upper entropy for order-2 capacieties. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 14 (02), 141-154. doi: 10.1142/S0218488506003911 Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6 (1), 37–66. doi: 10.1007/BF00153759 Alcal´ a-Fdez, J., S´ anchez, L., Garc´ıa, S., del Jesus, M., Ventura, S., Garrell, J., . . . Herrera, F. (2009). Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing, 13 (3), 307-318. doi: 10.1007/s00500-008-0323-y Chaoqun, L., Victor, S., Liangxiao, J., & Hongwei, L. (2016). Noise filtering to improve data and model quality for crowdsourcing. Knowledge Based Systems, 107 , 96 - 103. doi: 10.1016/j.knosys.2016.06.003 Delany, S., Segata, N., & MacNamee, B. (2012). Profiling instances in noise reduction. Knowledge Based Systems, 31 , 28 - 40. doi: 10.1016/j.knosys.2012.01.015 Demˇsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7 , 1–30. Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Miscellaneous clustering methods. In Cluster analysis (pp. 215–255). John Wiley And Sons, Ltd. doi: 10.1002/9780470977811.ch8 Fix, E., & Hodges, J. L. (1989). Discriminatory analysis. nonparametric discrimination: Consistency properties. International Statistical Review / Revue Internationale de Statistique, 57 (3), 238-247. doi: 10.2307/1403797 Frenay, B., & Verleysen, M. (2014). Classification in the presence of label noise: A survey. Neural Networks and Learning Systems, IEEE Transactions on, 25 (5), 845-869. doi: 10.1109/TNNLS.2013.2292894 Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32 , 675-701. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11 (1), 86–92. doi: 10.1214/aoms/1177731944 Garcia, L. P., S´ aez, J. A., Luengo, J., Lorena, A. C., de Carvalho, A. C., & Herrera, F. (2015). Using the one-vs-one decomposition to improve the performance of class noise filters via an aggregation strategy in multi-class classification problems. Knowledge Based Systems, 90 , 153 - 164. doi:
ACCEPTED MANUSCRIPT
References
25
AC
CE
PT
ED
M
AN US
CR IP T
10.1016/j.knosys.2015.09.023 Garc´ıa, S., Luengo, J., & Herrera, F. (2015). Dealing with noisy data. In Data preprocessing in data mining (pp. 107–145). Springer International Publishing. doi: 10.1007/978-3-319-10247-4 5 Garc´ıa, V., Mollineda, R. A., & S´ anchez, J. S. (2008). On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Analysis and Applications, 11 (3), 269–280. doi: 10.1007/s10044-007-0087-5 Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation(4), 1–58. doi: 10.1162/neco.1992.4.1.1 Hand, D. J. (1997). Construction and assessment of classification rules. John Wiley and Sons, New York. Hunt, E., Marin, J., & Stone, P. (1966). Experiments in induction. Academic Press. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 103). Springer. doi: 10.1007/978-1-4614-7138-7 Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets, 6 (1), 40–49. doi: 10.1145/1007730.1007737 Klir, G. J. (2005). Uncertainty and information: Foundations of generalized information theory. John Wiley And Sons, Inc. doi: 10.1002/0471755575 Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. Proceedings of the Thirteenth International Conference of Machine Learning, 275 - 283. Kononenko, I., & Kukar, M. (2007). Machine learning and data mining: Introduction to principles and algorithms. Woodhead Publishing Limited. Lichman, M. (2013). UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml Liu, T., & Tao, D. (2016). Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 , 447-461. Maimon, O., & Rokach, L. (2005). Data mining and knowledge discovery handbook. NJ, USA: Springer-Verlag New York. doi: 10.1007/978-0-387-098234 Mantas, C. J., & Abell´ an, J. (2014a). Analysis and extension of decision trees based on imprecise probabilities: Application on noisy data. Expert Systems with Applications, 41 (5), 2514 - 2525. doi: 10.1016/j.eswa.2013.09.050 Mantas, C. J., & Abell´ an, J. (2014b). Credal-c4.5: Decision tree based on imprecise probabilities to classify noisy data. Expert Systems with Applications, 41 (10), 4625 - 4637. doi: 10.1016/j.eswa.2014.01.017 Mantas, C. J., Abell´ an, J., & Castellano, J. G. (2016). Analysis of credal-c4.5 for classification in noisy domains. Expert Systems with Applications, 61 , 314 - 326. doi: 10.1016/j.eswa.2016.05.035 Natarajan, N., Dhillon, I. S., Ravikumar, P., & Tewari, A. (2013). Learning with
ACCEPTED MANUSCRIPT
26
AC
CE
PT
ED
M
AN US
CR IP T
noisy labels. In Proceedings of the 26th international conference on neural information processing systems (pp. 1196–1204). USA: Curran Associates Inc. Retrieved from http://dl.acm.org/citation.cfm?id=2999611.2999745 Nemenyi, P. (1963). Distribution-free multiple comparisons. Doctoral dissertation, Princeton University, New Jersey, USA. Olvera-L´ opez, J. A., Carrasco-Ochoa, J. A., Mart´ınez-Trinidad, J. F., & Kittler, J. (2010). A review of instance selection methods. Artificial Intelligence Review , 34 (2), 133–143. doi: 10.1007/s10462-010-9165-y Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1 (1), 81–106. doi: 10.1023/A:1022643204877 Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Rokach, L., & Maimon, O. (2008). Data mining with decision trees: theory and applications (Vol. 81). World Scientific. S´ aez, J. A., Luengo, J., & Herrera, F. (2011). Fuzzy rule based classification systems versus crisp robust learners trained in presence of class noise’s effects: A case of study. In Intelligent systems design and applications (isda), 2011 11th international conference on (p. 1229-1234). doi: 10.1109/ISDA.2011.6121827 S´ aez, J. A., Luengo, J., & Herrera, F. (2014). Evaluating the classifier behavior with noisy data considering performance and robustness: the equalized loss of accuracy measure. Neurocomputing, 176 , 26 - 35. (Recent Advancements in Hybrid Artificial Intelligence Systems and its Application to Real-World Problems, selected papers from the {HAIS} 2013 conference) doi: 10.1016/j.neucom.2014.11.086 Sammut, C., & Webb, G. I. (2011). In C. Sammut & G. I. Webb (Eds.), Encyclopedia of machine learning (p. 100-101). Springer. doi: 10.1007/978-0-387-30164-8 Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal , 27 (3), 379–423. doi: 10.1002/j.15387305.1948.tb01338.x Walley, P. (1996). Inferences from multinomial data; learning about a bag of marbles (with discussion). Journal of the Royal Statistical Society. Series B (Methodological), 58 (1), 3-57. doi: 10.2307/2346164 Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1 (6), 80–83. doi: 10.2307/3001968 Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (Second ed.). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Wu, Y., Ianakiev, K., & Govindaraju, V. (2002). Improved k-nearest neighbor classification. Pattern Recognition, 35 (10), 2311 - 2318. doi: 10.1016/S0031-3203(01)00132-7 Yang, T., Mahdavi, M., Jin, R., Zhang, L., & Zhou, Y. (2012). Multiple kernel learning from noisy labels by stochastic programming. CoRR, abs/1206.4629 .
ACCEPTED MANUSCRIPT
References
Appendix A
27
Tables about accuracy results
AC
CE
PT
ED
M
AN US
CR IP T
Tables 11-15 present the accuracy results for each method, applied on data sets with a percentage of added random noise to the class variable equal to 0%, 5%, 10%, 20% and 30%, respectively. Finally, Table 16 present the accuracy results of AdaptativeCC4.5 and RF methods, when they are used on data sets with the same percentages of added random noise to the class variable.
ACCEPTED MANUSCRIPT
28
Table 11.
Accuracy results of C4.5; CC4.5, with different values for the s parameter; and Adap-
98.57 65.65 77.26 81.77 77.82 74.28 95.01 92.22 51.44 85.16 85.57 71.25 94.10 74.49 82.83 67.63 72.16 76.94 80.22 78.15 79.22 99.54 89.74 94.73 99.44 88.03 65.84 75.84 78.66 97.18 90.52 96.99 96.54 41.39 96.79 98.72 99.53 73.61 91.78 92.68 47.50 94.17 92.50 57.41 72.28 96.57 80.20 75.25 93.20 92.61 82.62
PT
CE AC
CC4.5 s = 0.5 98.69 65.53 78.18 78.20 77.69 74.60 95.08 92.18 51.55 85.21 85.80 71.14 94.04 74.26 82.80 66.70 72.00 76.36 81.65 79.44 77.13 99.54 88.95 94.27 99.46 87.92 63.98 78.65 79.34 97.20 90.67 96.87 96.46 41.01 96.37 98.77 99.53 72.09 91.73 92.73 48.16 94.09 92.50 53.05 73.04 96.52 79.61 76.19 92.63 92.42 82.44
CC4.5 s = 1.0 98.49 67.30 80.66 74.70 77.26 74.84 95.14 89.86 52.91 85.26 85.36 71.36 94.12 74.15 81.72 65.52 71.18 76.86 82.23 80.37 79.85 99.55 88.27 94.13 99.45 87.55 64.45 78.58 79.38 96.30 90.77 96.68 96.42 40.09 96.11 98.76 99.53 71.19 92.12 92.57 45.97 94.09 92.50 53.26 73.07 96.62 77.19 76.21 92.47 93.22 82.31
CC4.5 s = 1.5 98.42 68.61 78.30 69.52 77.60 74.74 95.17 87.85 53.47 85.24 85.09 71.19 94.32 73.85 81.40 65.86 71.25 74.78 82.98 80.89 81.45 99.57 88.53 93.87 99.32 86.71 65.79 76.48 79.15 94.86 90.70 96.55 96.12 40.24 95.68 98.64 99.53 70.93 91.54 92.46 45.16 94.01 92.50 52.73 71.97 96.62 74.29 75.85 92.20 94.20 81.96
CC4.5 s = 2.0 98.42 70.04 76.05 66.93 78.25 74.91 94.97 87.03 53.46 85.24 84.68 71.13 93.80 74.18 81.04 64.78 71.28 73.82 82.40 80.63 81.70 99.54 88.04 93.60 99.26 85.91 67.12 74.24 78.99 93.69 90.53 96.54 95.75 40.03 95.24 98.54 99.53 70.54 91.64 92.39 43.35 94.10 92.50 53.47 71.48 96.59 71.30 75.26 91.52 91.54 81.54
CC4.5 AdaptativeCC4.5 s = 3.0 98.51 98.57 71.57 67.99 72.71 80.53 59.96 74.31 78.89 77.44 75.30 75.33 94.98 95.12 85.98 89.86 53.50 53.5 85.24 85.24 84.83 85.43 71.70 71.7 93.55 93.96 75.09 75.09 80.68 81.72 64.31 65.52 71.31 71.31 74.02 76.73 81.25 82.67 80.04 80.59 81.70 80.62 99.46 99.55 88.04 88.27 93.60 94.13 99.25 99.45 84.44 88.03 67.32 67.32 72.30 78.58 78.28 78.66 92.27 96.3 90.01 90.52 96.59 96.68 95.20 96.54 39.38 41.8 95.10 96.15 98.11 98.78 99.53 99.53 69.48 71.67 91.01 91.9 92.18 92.57 42.96 45.97 93.62 94.1 92.50 92.5 54.87 54.87 68.72 73.07 96.59 96.59 66.79 80.18 74.83 75.26 90.79 92.47 91.73 93.22 81.00 82.56
AN US
anneal arrhythmia audiology autos balance-scale breast-cancer wisconsin-breast-cancer car cmc horse-colic credit-rating german-credit dermatology pima-diabetes ecoli Glass haberman cleveland-14-heart-disease hungarian-14-heart-disease heart-statlog hepatitis hypothyroid ionosphere iris kr-vs-kp letter liver-disorders lymphography mfeat-pixel nursery optdigits page-blocks pendigits primary-tumor segment sick solar-flare2 sonar soybean spambase spectrometer splice sponge tae vehicle vote vowel waveform wine zoo Average
CC4.5 s = 0.25 98.60 65.68 77.26 79.49 77.92 74.32 95.04 92.23 51.15 85.18 85.70 71.30 94.07 74.64 82.95 67.22 72.56 76.86 81.07 79.26 77.63 99.54 88.95 94.73 99.44 87.96 64.01 76.71 79.19 97.18 90.57 96.95 96.52 40.77 96.63 98.76 99.53 72.37 91.83 92.78 48.49 94.05 92.50 52.17 72.81 96.62 80.10 76.06 92.98 92.42 82.45
M
C4.5
ED
Dataset
CR IP T
tativeCC4.5, when they are used on data sets without added noise.
ACCEPTED MANUSCRIPT
References
Table 12.
29
Accuracy results of C4.5; CC4.5, with different values for the s parameter; and Adap-
98.57 64.56 77.71 78.94 78.49 72.59 94.36 91.68 51 85.21 85.45 71.33 93.8 73.19 82.11 65.85 71.67 77.26 78.99 76.33 77.54 99.51 88.12 93.93 99.23 87.49 63.28 74.98 78.26 96.77 89.66 96.84 95.95 40.97 95.94 98.41 99.53 71.18 91.17 91.91 46 93.79 92.34 54.12 71.09 95.9 79.06 73.05 90.78 92.71 81.77
PT
CE AC
CC4.5 s = 0.5 98.53 64.51 79.08 77.91 78.32 73.18 94.41 91.67 50.88 85.15 85.1 71.39 93.53 73.5 82.29 66.22 72.62 77.21 79.87 77.59 77.8 99.5 88.5 93.6 99.22 87.46 61.97 75.11 78.43 96.76 89.6 96.82 96 40.62 95.92 98.48 99.53 73.18 91.77 92.11 45.84 93.69 92.34 51.58 71.15 95.98 78.25 75.29 91.49 92.71 81.87
CC4.5 s = 1.0 98.53 65.62 79.96 74.35 78.62 73.18 94.63 90.25 51.44 85.1 85.12 71.29 93.64 73.54 81.72 65.37 72.45 77.62 81.07 78.78 78.28 99.49 88.92 93.67 99.22 87.18 61.98 75.14 78.36 96.22 89.81 96.76 95.99 40.12 95.81 98.34 99.53 73.89 92.2 92.15 45.57 93.75 92.21 51.6 71.19 95.93 76.26 75.92 91.33 93.51 81.85
CC4.5 s = 1.5 98.49 67.88 79.37 70.95 78.64 73.15 94.65 87.94 51.74 85.18 85.16 71.27 93.64 73.42 81.55 64.58 72.48 76.36 81.55 79.48 78.79 99.45 88.69 93.6 99.18 86.61 62.17 74.86 78.39 95.49 89.89 96.69 95.81 40.47 95.55 98.24 99.53 72.22 91.6 92.06 43.95 93.79 92.21 52.14 70.97 95.84 73.42 75.79 91.17 94.3 81.61
CC4.5 s = 2.0 98.39 69.09 76.31 67.65 78.21 73.26 94.68 87.1 52.19 85.26 85.23 71.2 93.45 73.83 81.19 63.51 72.51 75.34 81.71 79.22 80.16 99.44 88.52 93.4 99.18 85.95 62.58 73.71 78.1 94.51 89.93 96.65 95.61 39.71 95.17 98.17 99.53 71.65 91.42 91.97 43.5 93.8 92.21 52.08 70.81 95.72 70.98 75.74 90.85 91.92 81.25
CC4.5 AdaptativeCC4.5 s = 3.0 97.98 98.53 70.15 66.44 72.09 79.74 63.48 74.21 77.92 78.1 73.57 73.75 94.34 94.62 85.87 90.25 52.42 52.42 85.1 85.29 85.07 85.22 71.66 71.66 93.09 93.64 74.22 74.22 80.84 81.72 63.67 65.37 72.58 72.58 74.59 77.65 81.27 81.55 78.81 79 80.86 79.68 99.45 99.48 88.29 88.52 93.4 93.67 99.07 99.22 84.61 87.49 62.81 62.81 72.39 75.14 77.4 78.36 92.7 96.22 89.6 89.81 96.65 96.76 95.17 95.99 39.32 41.27 94.68 95.81 98.11 98.36 99.53 99.53 69.39 71.22 90.23 91.87 91.6 92.15 42.52 45.57 93.28 93.28 92.21 92.21 52.53 52.53 68.93 71.15 95.72 95.88 66.77 76.26 75.45 75.45 90 91.33 92.12 93.51 80.67 81.93
AN US
anneal arrhythmia audiology autos balance-scale breast-cancer wisconsin-breast-cancer car cmc horse-colic credit-rating german-credit dermatology pima-diabetes ecoli Glass haberman cleveland-14-heart-disease hungarian-14-heart-disease heart-statlog hepatitis hypothyroid ionosphere iris kr-vs-kp letter liver-disorders lymphography mfeat-pixel nursery optdigits page-blocks pendigits primary-tumor segment sick solar-flare2 sonar soybean spambase spectrometer splice sponge tae vehicle vote vowel waveform wine zoo Average
CC4.5 s = 0.25 98.6 64.38 78.33 79.03 78.29 73.12 94.28 91.66 50.63 85.1 85.52 71.57 93.66 73.67 82.2 66.58 72.13 77.38 79.29 77.85 78.2 99.5 88.32 93.87 99.22 87.49 62.94 75.13 78.59 96.76 89.6 96.9 96 40.62 95.95 98.45 99.53 71.73 91.71 91.97 46.19 93.72 92.34 51.64 71.07 95.97 79.26 74.44 91.1 92.62 81.88
M
C4.5
ED
Dataset
CR IP T
tativeCC4.5, when they are used on data sets with percentage of added random noise equal to 5%.
ACCEPTED MANUSCRIPT
30
Table 13.
Accuracy results of C4.5; CC4.5, with different values for the s parameter; and Adap-
98.37 62.54 77.53 74.72 78.11 71.13 93.72 90.92 49.95 84.61 84.78 71.18 93.31 72.37 81.87 65.37 72.32 75.78 79.78 75.63 77.88 99.4 86.9 92.73 98.97 86.74 62.38 75.11 76.77 96.29 88.47 96.7 95.37 39.59 95.06 98.22 99.53 67.56 90.54 90.96 43.2 93.05 91.8 50.77 68.51 95.74 77.13 69.51 87.35 92.39 80.77
PT
CE AC
CC4.5 s = 0.5 98.2 62.23 77.53 73.94 78.39 71.79 94.29 90.91 50.78 84.77 85.25 70.8 93.66 73.16 81.85 66.55 72.75 76.71 79.4 77.48 79.22 99.44 87.64 93.33 98.94 86.77 61.8 74.41 77.69 96.36 88.6 96.81 95.45 39.8 95.12 98.2 99.53 68.71 91.02 91.3 43.25 93.05 91.66 49.85 69.36 95.61 76.44 73.63 89.1 92.3 81.10
CC4.5 s = 1.0 98.39 64.36 77.96 72.55 78.21 72.07 94.28 90.05 51.33 84.55 85.22 71.35 93.28 73.8 81.61 65.91 72.39 76.64 80.39 78.44 79.87 99.45 87.04 93.4 98.95 86.66 61.66 74.3 77.76 96.06 88.98 96.78 95.49 39.62 95.23 98.24 99.53 70.24 91.23 91.52 44.05 93.11 91.66 48.95 70.09 95.47 74.98 75.26 89.5 92.79 81.21
CC4.5 s = 1.5 98.42 65.97 78.39 68.48 78.13 72.21 94.26 88.28 51.55 84.75 85.03 70.77 93.31 73.5 81.63 64.96 72.42 76.57 80.87 78.78 79.04 99.42 87.73 93.47 98.89 86.28 61.17 74.3 77.46 95.57 89.2 96.69 95.51 39.65 95.12 98.21 99.53 71.16 90.99 91.66 43.56 93.2 91.66 49.01 69.59 95.49 72.14 75.87 89.67 91.99 81.03
CC4.5 s = 2.0 98.37 66.92 75.75 67.16 77.76 72 94.11 87.08 52 84.93 85.38 71.02 93.31 73.81 81.25 64.73 72.42 75.74 81.41 78.59 79.09 99.41 87.93 93.27 98.88 85.67 61 74.63 77.27 94.7 89.35 96.61 95.4 39.18 94.93 98.17 99.53 70.84 90.73 91.51 42.69 93.27 91.66 49.42 69.24 95.42 69.4 75.64 89.05 91 80.77
CC4.5 AdaptativeCC4.5 s = 3.0 97.92 98.39 68.38 65.95 72.61 77.78 62.01 71.01 77.2 77.2 72.36 72.46 93.86 94.26 85.35 90.05 51.93 51.93 84.91 84.96 85.12 85.2 71.56 71.56 92.95 93.28 74.08 74.08 81.42 81.61 64.77 65.91 72.42 72.42 76.01 76.87 81.41 81.31 78.11 78.11 79.81 80.02 99.43 99.45 87.9 87.87 93.33 93.27 98.7 98.95 84.57 86.66 61.06 61.06 73.61 74.3 76.78 77.76 93.16 96.06 89.3 88.98 96.61 96.78 95.11 95.49 38.82 40.36 94.72 95.23 98.1 98.24 99.53 99.53 69.21 69.35 89.65 91.52 91.11 91.51 41.47 44.05 92.92 92.92 91.66 91.66 49.68 49.68 68.71 69.27 95.33 95.33 65.65 74.98 75.44 75.44 88.3 89.5 90.33 92.79 80.29 81.25
AN US
anneal arrhythmia audiology autos balance-scale breast-cancer wisconsin-breast-cancer car cmc horse-colic credit-rating german-credit dermatology pima-diabetes ecoli Glass haberman cleveland-14-heart-disease hungarian-14-heart-disease heart-statlog hepatitis hypothyroid ionosphere iris kr-vs-kp letter liver-disorders lymphography mfeat-pixel nursery optdigits page-blocks pendigits primary-tumor segment sick solar-flare2 sonar soybean spambase spectrometer splice sponge tae vehicle vote vowel waveform wine zoo Average
CC4.5 s = 0.25 98.29 61.7 77.7 75.08 78.11 71.9 93.99 90.94 50.95 84.58 85.14 71.13 93.55 72.85 81.34 66.26 72.45 76.53 79.4 77.15 78.15 99.41 86.7 93.27 98.94 86.81 62.38 74.37 77.63 96.34 88.46 96.77 95.38 39.42 95.08 98.19 99.53 67.69 90.88 91.17 43.48 93.05 91.66 50.4 68.91 95.63 76.75 72 88.71 92.38 80.97
M
C4.5
ED
Dataset
CR IP T
tativeCC4.5, when they are used on data sets with percentage of added random noise equal to 10%.
ACCEPTED MANUSCRIPT
References
Table 14.
31
Accuracy results of C4.5; CC4.5, with different values for the s parameter; and Adap-
97.42 54.68 74.42 66.91 77.29 70.33 92.72 88.33 48.36 83.31 83.3 69.26 91.45 72.17 79.11 61.06 71.89 73.02 78.52 70.3 76.11 99.16 82.46 90.87 97.27 84.86 61.3 70.99 74.67 95.24 85 96.11 93.49 37.46 92.97 97.64 99.53 64.22 90 88.37 38.96 90.12 91.39 47.49 62.9 94.76 72.17 62.74 76.84 90.94 78.20
PT
CE AC
CC4.5 s = 0.5 97.44 56.05 74.34 68.01 77.37 70.45 93.76 88.36 48.97 83.09 83.9 68.26 91.05 72.45 80.22 62.96 72.38 75.15 77.98 74.7 77.6 99.17 84.6 92.2 97.24 84.88 60.29 70.29 75.56 95.45 85.37 96.49 93.89 37.87 93.89 97.84 99.53 66.45 89.84 89.58 38.4 89.92 91.27 47.61 65.93 94.78 72.76 69.99 84.75 91.35 79.03
CC4.5 s = 1.0 97.55 58.59 75.57 65.91 77.28 70.73 93.79 88.15 49.82 82.93 84.29 68.92 91.62 72.48 81.46 63.52 72.12 76.46 79.16 76.63 77.73 99.22 85.11 92.87 97.33 84.95 59.95 70.51 75.63 95.35 86.28 96.67 94.41 37.9 94.35 97.94 99.53 66.2 89.96 89.94 41.32 90.06 91.27 47.17 67.3 94.71 71.38 73.17 86.43 90.55 79.44
CC4.5 s = 1.5 97.64 61 75.79 63.45 77.04 71.23 93.86 87.21 50.47 83.39 84.33 69.21 91.84 72.84 81.66 64.24 72.36 76.26 79.98 76.59 78.38 99.3 85.34 92.67 97.18 84.96 58.97 71.61 75.28 95.13 87.19 96.6 94.71 38.46 94.51 97.95 99.53 66.92 90.13 90.04 41.58 90.48 91.39 46.89 67.95 94.69 69.78 74.37 86.64 90.15 79.58
CC4.5 s = 2.0 97.56 62.93 74.28 62.78 76.59 71.22 93.92 86.3 50.6 83.41 84.64 69.74 91.7 72.88 81.42 63.06 72.29 76.17 80.74 76.19 78.76 99.32 85.7 92.73 97.19 84.62 58.56 71.9 75.23 94.75 87.77 96.61 94.78 38.32 94.48 97.93 99.53 67.5 90.3 90.1 41.42 90.91 91.39 47.03 67.94 94.76 68.28 74.91 86.5 89.85 79.55
CC4.5 AdaptativeCC4.5 s = 3.0 97.43 97.55 65.2 63.27 71.4 75.18 61.39 63.37 76.21 76.21 71.37 71.44 93.68 93.65 84.95 86.48 51.62 51.62 83.79 83.74 84.77 84.77 70.06 70.06 91.21 91.59 72.84 72.84 81.34 81.46 62.85 64.13 72.36 72.36 75.8 76.17 80.9 80.7 75.85 75.85 78.57 78.04 99.37 99.29 86.13 86.13 93 93 97.26 97.26 83.8 84.95 58.49 58.49 72.36 72.1 74.94 75.63 93.6 95.35 87.71 86.28 96.54 96.67 94.65 94.41 38.4 39 94.24 94.35 97.96 97.94 99.53 99.53 67.42 67.42 90.01 90.72 90.07 90.07 40.66 41.32 91.39 91.39 91.27 91.27 46.68 46.68 67.71 67.82 95.15 95.11 64.12 71.38 74.98 74.98 86.95 86.95 89.54 90.55 79.35 79.73
AN US
anneal arrhythmia audiology autos balance-scale breast-cancer wisconsin-breast-cancer car cmc horse-colic credit-rating german-credit dermatology pima-diabetes ecoli Glass haberman cleveland-14-heart-disease hungarian-14-heart-disease heart-statlog hepatitis hypothyroid ionosphere iris kr-vs-kp letter liver-disorders lymphography mfeat-pixel nursery optdigits page-blocks pendigits primary-tumor segment sick solar-flare2 sonar soybean spambase spectrometer splice sponge tae vehicle vote vowel waveform wine zoo Average
CC4.5 s = 0.25 97.28 54.87 74.2 67.85 77.22 70.32 93.28 88.47 48.64 83.11 84.2 68.77 91.13 72.65 79.56 61.15 72 74.13 78.42 73.44 76.8 99.17 84.14 91.53 97.35 84.87 61.01 70.54 75.56 95.39 84.93 96.31 93.63 37.31 93.32 97.8 99.53 64.66 90 89.06 38.42 89.99 91.27 47.43 64.43 94.74 73.1 67.08 81.73 91.62 78.67
M
C4.5
ED
Dataset
CR IP T
tativeCC4.5, when they are used on data sets with percentage of added random noise equal to 20%.
ACCEPTED MANUSCRIPT
32
Table 15.
Accuracy results of C4.5; CC4.5, with different values for the s parameter; and Adap-
96.03 49.15 70.88 57.92 74.16 68.65 89.24 86 46.39 79.63 74.58 63.09 87.64 69.39 75.27 55.23 68.83 68 78.16 65.52 68.15 98.59 78.18 84 91.13 82.13 56.83 66.33 71.98 93.99 76.91 94.91 89.21 37.67 85.35 95.2 99.53 60.84 88.45 86.07 33.02 81.21 88.84 45.86 56.06 90.99 66.01 57.32 71.02 87.65 74.14
PT
CE AC
CC4.5 s = 0.5 95.87 49.25 72.09 59.92 74.42 67.71 91.23 85.97 46.85 79.66 78.33 63.1 87.31 69.99 77.14 58.74 72.36 69.58 76.74 70.11 70.14 98.73 78.95 88.4 90.7 82.21 56.56 67.34 72.75 94.21 77.86 95.89 91.03 37.02 89.92 96.76 99.49 63.41 88.34 87.31 33.96 79.97 87.64 44.1 60.87 91.13 65.82 66.33 79.37 88.14 75.41
CC4.5 s = 1.0 95.81 53.91 71.56 60.89 75.02 67.71 92.2 85.89 47.8 79.91 81.39 63.71 88.7 69.67 80.01 60.28 72.85 71.83 77.31 72.3 72.29 98.91 80.07 89 90.96 82.59 55.45 68.11 72.97 94.3 81.17 96.3 92.5 36.99 92.13 97.12 99.49 63.29 88.37 87.69 37.03 79.96 87.25 43.58 64.04 91.55 66.34 70.23 83.02 87.64 76.38
CC4.5 s = 1.5 96.03 57.83 71.52 59.33 74.91 67.21 92.36 85.49 48.5 79.99 82.19 65.09 89.27 69.93 80.52 61.7 72.72 73.57 78.75 72.96 74.4 99.02 80.57 90.2 91.25 82.92 55.57 68.74 73.08 94.24 83.41 96.44 93.47 37.47 93.02 97.27 99.46 63.68 88.71 87.77 38.42 80.93 87.54 43.92 65.32 91.68 65.68 72.42 82.73 88.43 76.95
CC4.5 s = 2.0 96.24 59.81 70.94 59.5 74.58 68.12 92.3 85.13 48.6 80.97 83.12 65.83 89.95 70.27 80.78 62.87 72.56 74.86 79.09 73.63 75.33 99.06 81.11 90.6 92.23 82.79 55.78 69.08 73.08 94.03 84.73 96.43 93.88 37.85 93.28 97.34 99.46 62.87 88.53 87.84 38.59 83.13 88.07 43.87 65.46 91.93 64.56 73.41 84.25 88.13 77.32
CC4.5 AdaptativeCC4.5 s = 3.0 96.36 95.81 61.89 61.7 69.25 70.73 57.37 60.24 74 74 67.93 67.85 92.72 92.78 84.46 84.73 48.74 48.74 81.05 81.05 83.87 83.9 66.61 66.61 89.74 88.7 70.12 70.12 80.72 80.01 62.26 62.87 72.46 72.46 75.75 75.38 79.85 80.03 73.59 73.59 75.97 76.18 99.11 99.12 80.74 80.74 91.33 91.33 93.33 93.33 82.34 82.59 56.1 56.1 71.21 70.87 73.11 72.97 93.46 93.47 85.68 81.17 96.32 96.43 94.01 92.5 37.88 37.71 93.26 92.13 97.35 97.31 99.46 99.46 62.01 62.01 88.5 89.19 87.93 87.93 38.18 37.03 86.75 86.75 88.52 88.52 42.6 42.6 66.48 66.48 93.01 92.9 62.63 66.34 74.24 74.24 84.29 84.29 87.45 87.64 77.44 77.41
AN US
anneal arrhythmia audiology autos balance-scale breast-cancer wisconsin-breast-cancer car cmc horse-colic credit-rating german-credit dermatology pima-diabetes ecoli Glass haberman cleveland-14-heart-disease hungarian-14-heart-disease heart-statlog hepatitis hypothyroid ionosphere iris kr-vs-kp letter liver-disorders lymphography mfeat-pixel nursery optdigits page-blocks pendigits primary-tumor segment sick solar-flare2 sonar soybean spambase spectrometer splice sponge tae vehicle vote vowel waveform wine zoo Average
CC4.5 s = 0.25 95.82 48.62 71.82 60.32 74.5 67.68 90.59 86.02 46.62 79.74 76.35 62.57 87.3 70.17 76.31 57.1 71.47 69.45 77.02 69.07 70.61 98.6 77.96 86.87 90.84 82.16 56.92 68.85 72.59 94.11 76.11 95.6 89.82 37.02 87.31 96.18 99.5 61.93 88.14 86.89 33.47 80.31 87.61 44.19 58.32 91.13 65.77 62.37 76.31 88.53 74.89
M
C4.5
ED
Dataset
CR IP T
tativeCC4.5, when they are used on data sets with percentage of added random noise equal to 30%.
ACCEPTED MANUSCRIPT
References
Table 16.
33
Accuracy results of AdaptativeCC4.5 (ACC4,5) and RF, when they are used on data
sets with a percentage of added random noise to the class variable equal to 0%, 5%, 10%, 20% and
RF 97.24 66.37 75.19 78.22 78.16 67.45 94.69 92.52 48.55 83.45 83.58 72.97 94.94 73.88 82.11 75.40 64.42 79.41 79.35 78.56 82.04 98.94 90.91 92.40 96.72 92.62 67.61 80.74 90.28 96.93 96.41 96.69 98.56 41.30 96.48 97.93 98.39 78.34 90.22 93.31 51.40 88.04 92.50 62.67 73.20 94.55 92.62 81.14 96.13 94.67 83.40
Noise 10% ACC4.5 98.39 65.95 77.78 71.01 77.2 72.46 94.26 90.05 51.93 84.96 85.2 71.56 93.28 74.08 81.61 65.91 72.42 76.87 81.31 78.11 80.02 99.45 87.87 93.27 98.95 86.66 61.06 74.3 77.76 96.06 88.98 96.78 95.49 40.36 95.23 98.24 99.53 69.35 91.52 91.51 44.05 92.92 91.66 49.68 69.27 95.33 74.98 75.44 89.5 92.79 81.25
RF 95.07 65.40 72.88 74.17 76.36 65.48 92.93 90.81 48.30 82.03 81.96 70.81 93.79 72.21 80.92 72.88 63.02 78.29 77.69 76.93 80.43 98.57 89.03 89.73 94.35 90.45 65.00 79.56 88.88 95.11 95.86 95.85 98.07 39.29 94.96 97.38 97.21 76.45 87.04 91.50 50.11 85.78 91.48 60.43 72.00 93.42 88.98 80.51 93.98 91.00 81.69
Noise 20% ACC4.5 97.55 63.27 75.18 63.37 76.21 71.44 93.65 86.48 51.62 83.74 84.77 70.06 91.59 72.84 81.46 64.13 72.36 76.17 80.7 75.85 78.04 99.29 86.13 93.00 97.26 84.95 58.49 72.1 75.63 95.35 86.28 96.67 94.41 39.00 94.35 97.94 99.53 67.42 90.72 90.07 41.32 91.39 91.27 46.68 67.82 95.11 71.38 74.98 86.95 90.55 79.73
AC
CE
PT
ED
M
AN US
Noise 0% Noise 5% Dataset ACC4.5 RF ACC4.5 anneal 98.57 99.38 98.53 arrhythmia 67.99 66.66 66.44 audiology 80.53 76.56 79.74 autos 74.31 81.75 74.21 balance-scale 77.44 80.11 78.1 breast-cancer 75.33 69.50 73.75 wisconsin-breast-cancer 95.12 95.81 94.62 car 89.86 93.18 90.25 cmc 53.5 49.75 52.42 horse-colic 85.24 84.91 85.29 credit-rating 85.43 84.88 85.22 german-credit 71.7 73.72 71.66 dermatology 93.96 95.71 93.64 pima-diabetes 75.09 74.44 74.22 ecoli 81.72 83.30 81.72 Glass 65.52 76.16 65.37 haberman 71.31 65.36 72.58 cleveland-14-heart-disease 76.73 80.37 77.65 hungarian-14-heart-disease 82.67 79.77 81.55 heart-statlog 80.59 80.41 79.00 hepatitis 80.62 83.13 79.68 hypothyroid 99.55 99.18 99.48 ionosphere 88.27 93.11 88.52 iris 94.13 94.27 93.67 kr-vs-kp 99.45 98.87 99.22 letter 88.03 94.54 87.49 liver-disorders 67.32 67.98 62.81 lymphography 78.58 80.74 75.14 mfeat-pixel 78.66 91.23 78.36 nursery 96.3 98.36 96.22 optdigits 90.52 96.79 89.81 page-blocks 96.68 97.25 96.76 pendigits 96.54 98.85 95.99 primary-tumor 41.8 42.13 41.27 segment 96.15 97.69 95.81 sick 98.78 98.23 98.36 solar-flare2 99.53 99.42 99.53 sonar 71.67 81.07 71.22 soybean 91.9 91.80 91.87 spambase 92.57 94.96 92.15 spectrometer 45.97 51.70 45.57 splice 94.1 90.09 93.28 sponge 92.5 94.11 92.21 tae 54.87 65.85 52.53 vehicle 73.07 74.08 71.15 vote 96.59 95.95 95.88 vowel 80.18 95.75 76.26 waveform 75.26 81.86 75.45 wine 92.47 97.13 91.33 zoo 93.22 95.94 93.51 Average 82.56 84.68 81.93
CR IP T
30%, respectively.
RF 89.43 62.65 66.23 68.65 71.99 62.41 87.97 86.94 45.64 77.58 76.51 68.59 91.17 69.33 77.61 68.27 59.76 75.66 75.01 72.93 77.10 96.91 82.69 81.47 86.67 85.43 61.53 73.56 85.61 90.16 93.80 93.06 96.41 37.90 91.27 93.97 93.60 73.67 81.21 85.97 47.88 79.76 88.05 54.34 68.76 88.74 80.60 77.39 88.93 84.87 77.51
Noise 30% ACC4.5 95.81 61.7 70.73 60.24 74.00 67.85 92.78 84.73 48.74 81.05 83.9 66.61 88.7 70.12 80.01 62.87 72.46 75.38 80.03 73.59 76.18 99.12 80.74 91.33 93.33 82.59 56.1 70.87 72.97 93.47 81.17 96.43 92.5 37.71 92.13 97.31 99.46 62.01 89.19 87.93 37.03 86.75 88.52 42.6 66.48 92.9 66.34 74.24 84.29 87.64 77.41
RF 81.09 59.87 60.78 58.95 65.42 59.35 79.33 80.67 43.18 70.38 68.35 62.85 85.28 65.45 72.56 64.54 57.59 71.95 70.38 67.33 71.97 93.28 72.94 72.07 76.14 78.91 56.94 65.62 80.64 83.23 90.40 88.38 93.06 34.52 85.61 85.75 88.96 65.20 74.08 78.07 43.77 71.19 78.61 49.72 64.04 81.24 71.75 72.45 82.48 78.15 71.49