Applied Soft Computing Journal 83 (2019) 105607
Contents lists available at ScienceDirect
Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc
A rough-granular approach to the imbalanced data classification problem ∗
K. Borowska , J. Stepaniuk Faculty of Computer Science, Bialystok University of Technology, Wiejska 45A, 15-351 Bialystok, Poland
highlights • • • •
The new rough-granular approach to imbalanced data was introduced. Selective data oversampling was combined with filtering step. Difficulty factors that affect classifier performance were handled. The problem of algorithms’ parameters tuning was addressed.
article
info
Article history: Received 31 October 2018 Received in revised form 2 May 2019 Accepted 25 June 2019 Available online 2 July 2019 Keywords: Data preprocessing Class imbalance Granular computing Information granules Rough sets SMOTE
a b s t r a c t More than two decades ago the imbalanced data problem turned out to be one of the most important and challenging problems. Indeed, missing information about the minority class leads to a significant degradation in classifier performance. Moreover, comprehensive research has proved that there are certain factors increasing the problem’s complexity. These additional difficulties are closely related to the data distribution over decision classes. In spite of numerous methods which have been proposed, the flexibility of existing solutions needs further improvement. Therefore, we offer a novel rough– granular computing approach (RGA, in short) to address the mentioned issues. New synthetic examples are generated only in specific regions of feature space. This selective oversampling approach is applied to reduce the number of misclassified minority class examples. A strategy relevant for a given problem is obtained by formation of information granules and an analysis of their degrees of inclusion in the minority class. Potential inconsistencies are eliminated by applying an editing phase based on a similarity relation. The most significant algorithm parameters are tuned in an iterative process. The set of evaluated parameters includes the number of nearest neighbours, complexity threshold, distance threshold and cardinality redundancy. Each data model is built by exploiting different parameters’ values. The results obtained by the experimental study on different datasets from the UCI repository are presented. They prove that the proposed method of inducing the neighbourhoods of examples is crucial in the proper creation of synthetic positive instances. The proposed algorithm outperforms related methods in most of the tested datasets. The set of valid parameters for the Rough–Granular Approach (RGA) technique is established. © 2019 Published by Elsevier B.V.
1. Introduction Dealing with the imbalanced learning problem plays an essential role in real-life applications. The issue occurs when the number of examples representing a class of major interest is significantly lower than the number of instances from another class. It is assumed that the under-represented class is called positive or minority and the class for which a sufficient number of instances is provided is called negative or majority [1]. In fact, the ∗ Corresponding author. E-mail addresses:
[email protected] (K. Borowska),
[email protected] (J. Stepaniuk). https://doi.org/10.1016/j.asoc.2019.105607 1568-4946/© 2019 Published by Elsevier B.V.
problem of imbalanced data in data mining involves several other factors that cannot be ignored by knowledge researchers [1–3]. These additional difficulties can be categorized into three main groups: class overlapping [4], small disjuncts [5] and the presence of noise and outliers [6]. Developing solutions that identify all of these problems and deal with them simultaneously is very challenging. Existing techniques are typically divided into categories as specified below [1,3,7]:
• data-level methods that focus on transforming the original data distribution mainly by oversampling and undersampling,
2
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607
• algorithm-level methods that create or modify existing learning algorithms to consider the importance of the minority class, • hybrid methods that combine both approaches. In this paper we focus on data-level methods, since they are classifier-independent and more flexible. They are usually performed as one of the preprocessing steps. Data preprocessing involves all transformations that should be applied to prepare raw data into a consistent format acceptable to the chosen learning algorithm [8]. Data-level preprocessing techniques interfere with the cardinalities of classes to make them approximately equal. They usually pay particular attention to specific data characteristics. Oversampling is defined as creating more positive class representatives and undersampling as removing selected negative class instances, and these are the main examples of the data-level approach. One of the most ground-breaking preprocessing methods is SMOTE (Synthetic Minority Oversampling Technique) [9]. It is based on the assumption that random oversampling, which replicates minority class examples, may lead to overfitting. Therefore, the SMOTE method enriches the original dataset by introducing new instances slightly different from the existing ones. These new instances are generated by interpolating several minority class examples that lie together [1]. SMOTE, as a cutting-edge and powerful technique, became the inspiration for many associated methods. However, many of them are applicable only to a narrow range of specific problems, ignoring other factors that frequently occur in real-world datasets [3,10]. Moreover, even if they are considered as flexible, they usually require careful calibration of numerous parameters to fit particular problems. Hence, the study of the possible and most appropriate parameter values is presented in this paper. It is also worth mentioning that combining oversampling and filtering or cleaning techniques has emerged as an especially promising approach [11]. In particular, rough set theory can be applied to the imbalanced data problem. This successful approach, developed by Pawlak (1926–2006) [12], assumes that vague concepts can be replaced by a pair of precise notions: lower and upper approximation [13]. Therefore, complex problems described by inconsistent information, frequently the case in real–life domains, may be handled effectively. The imbalanced data problem requires a deep analysis of the neighbourhood of minority class examples. Considering the data distribution enables us to deal with the additional difficulty factors. To implement this approach, granules with parameters are introduced. Tuning of the granules’ parameters lead to adequate creation of synthetic minority class instances to improve classification performance. The aim of this paper is to propose a novel data preprocessing RGA algorithm that applies the mentioned method and to present a methodology to easily obtain appropriate parameter values. Tuning these parameters is crucial for handling various imbalanced data characteristics, especially those that are more complex. The paper is organized as follows. Section 2 describes the main factors influencing an imbalanced data learning process, typical parameters, and evaluation measures. Section 3 is an overview of existing methods that involve data preprocessing. In Section 4 the proposed RGA algorithm is introduced. Section 5 contains the results of an experimental study. Section 6 contains the conclusion and possible approaches for future research. 2. The nature of the imbalanced data problem The relevance of the imbalanced data problem is emphasized by numerous occurrences in real-world domains, such as medical
diagnosis, text categorization, oil spill detection, risk management, behaviour and sentiment analysis, modern manufacturing plants, intrusion and fraud detection [1,7,10,14,15]. All of these examples are characterized by an unequal cardinalities of classes. Moreover, proper recognition of instances representing the positive class has a greater value than in the contrary case [10]. In other words, incorrect classification of rare cases from the minority class, being cancer patients, can have dramatic consequences. Mistakes made during analysis of data describing cancer patients are significantly more costly than misclassifying a healthy person [14]. Therefore, in many applications, it is dramatically important to resolve the imbalanced data problem. 2.1. Imbalanced data issues A typical insight into the imbalanced dataset problem concerns the unequal distribution between classes. Since standard machine learning algorithms tend to generalize, they misclassify under-represented examples from the minority class more often than numerous majority class instances [1,3]. The main requirement of most classifiers is a sufficient number of training instances. When the expected quantity of instances needed to learn is not provided, the algorithm decides in favour of majority examples. Furthermore, the overall accuracy remains valid, thus the results are outwardly satisfying. The measure that helps to evaluate the degree of imbalance is obtained by calculating class size ratio. Definition 1. Imbalance ratio (IR) as the relation between majority and minority class cardinalities is defined as follows [11]:
IR =
N− N+
,
(1)
where N − = card({x ∈ U : d(x) = −}) is the number of majority class examples (instances) and N + = card({x ∈ U : d(x) = +}) is the number of minority class examples (instances). There is no explicit definition of the threshold representing truly imbalanced data that leads to degradation of classification performance [3]. Values as high as 1000:1 or lower like 35:1 have been shown to often adversely influence the model building process [10]. Apart from the mentioned problem characteristic, there are other, even more meaningful factors affecting classifier performance. Recent comprehensive studies indicate that the complex data distribution is the primary reason of learning algorithms effectiveness depletion, when dealing with imbalanced datasets [1– 6,10,11,14]. Notably three main factors are distinguished:
• small disjuncts - dataset consists of many subconcepts cumulating only a few examples from a particular class [4– 6,10,11,14]. The presence of within-class clusters, in the case of an under-represented minority class, leads to serious learning problems. Standard classifiers may not properly identify small numbers of examples placed in some specific areas of feature space [5]. Moreover, the existence of rare cases forming small clusters is usually implicit, since high dimensional data cannot be easily visualized; • class overlapping - occurs when the data distribution contains ambiguous regions, where samples from different classes are very similar. Therefore, the classifier has difficulty in inducing discriminative rules [1]. As a comparison, in linear separable problems, patterns that enable distinguishing instances between classes are easily generated. The experiments mentioned in [10] prove that the higher the class overlapping degree is, the higher system’s sensitivity for imbalance becomes;
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607
• presence of noise - specific minority class examples are placed in feature space regions dominated by the majority class instances [11]. They are usually considered as erroneous samples, however, it should be emphasized that some of the outliers may be rare cases [3]. These outstanding instances, similar to the majority examples and different from minority ones, become untypical cases, since it is impossible to collect other comparable positive examples. It is important to handle the described issues in any methods proposed to solve the imbalanced data problem. 3. Overview of data-level techniques In this section some related methods are described. Focusing on data-level techniques only, it should be emphasized that many methods from this category were inspired by SMOTE. The idea of generating synthetic positive instances based on the similarities between existing minority instances is considered very powerful and successful. Taking additional difficulties into account, it revealed that even such a ground-breaking method as SMOTE is not free of certain drawbacks. By generating new instances blindly, some inconsistencies may be introduced. Synthetic samples often overlap with instances from the majority class or they are created in the areas occupied by the dominance of negative examples. Therefore, new algorithms improve SMOTE by reducing its drawbacks such as the tendency to over-generalization as well as variance. Since real-life data usually suffer from high complexity conceptualized as additional difficulty factors, two main approaches of applying SMOTE were established [11]:
• SMOTE modifications - involve creating new instances only in specific parts of the feature space, considering the characteristics of the data, • filtering-based techniques - extensions of SMOTE, developed by integrating it with additional cleaning techniques that ensure obtaining more regular class boundaries and filter noise. Apart from the methods belonging to these categories, there are also many other algorithms that are not inspired by SMOTE, but are worth mentioning. Therefore, the following two subsections consist of both SMOTE-based and other approaches. Most of the preprocessing methods require tuning of parameters. They should be carefully evaluated since they have a huge impact on the classifier performance. 3.1. SMOTE modifications There are numerous methods based on the SMOTE algorithm. Most of them were developed to deal with some specific data characteristics that make a problem more complex. BorderlineSMOTE [16] assumes that the examples located in the area surrounding class boundaries are more prone to be misclassified. Therefore, only these specific instances, considered as the most important in the classification process, are oversampled. There are two versions of this algorithm. Borderline-SMOTE1 generates new samples by combining features of danger instance from the minority class and its nearest neighbours from the same class. In the Borderline-SMOTE2 method the nearest neighbours from the majority class are also considered. However, new instances are generated closer to the examples from the minority class. On the other hand, Safe-Level SMOTE [17] analyses the number of positive instances in the neighbourhood of the example under consideration and its nearest neighbours. Based on this analysis the algorithm assigns the appropriate safe levels. The more nearest neighbour representatives of the minority class, the higher the
3
safe level is. New instances are generated closer to the largest safe level, thus they are created only in safe regions. In case of both noise examples (the instance under consideration as well as its selected neighbour), no new samples are generated. These two methods represent two opposite oversampling approaches. Although they are successful in some specific domains, they are not suitable for more complex problems, where the dataset suffers from more than one difficulty factor. 3.2. Filtering-based techniques Since SMOTE may introduce some inconsistent samples into the original data distribution and most of the real-life datasets are complex themselves, it is usually crucial to apply an additional cleaning phase. Some selected sophisticated and successful filtering-based techniques are described briefly in this section. SMOTE-RSB∗ [18] applies the rough-set theory to provide a more consistent set of synthetic data (generated by SMOTE). After oversampling, the degree of similarity between all instances is obtained. Newly created samples are evaluated by analysing their belongingness to the lower approximation. Each synthetic example that has a similarity value lower than the threshold (in other words — belonging to the lower approximation) is added to the final result set. In the SMOTE-ENN [19] method, inconsistent instances from both classes are removed after oversampling. The Edited Nearest Neighbour (ENN) rule removes examples as long as they are misclassified by their three nearest neighbours. VISROT [20] combines selective preprocessing and editing techniques. New synthetic instances are created only in selected regions of feature space to prevent introducing additional noise. After generating new minority class instances, the filtering technique is applied. All newly generated instances identified as inconsistent are removed. In a granular under-sampling [21] method information granules are built from the majority class instances. Undersampling is applied by elimination of instances that constitute the information granule having not sufficient specificity. This kind of information granule should be considered as noise. For more examples the reader is referred to the comprehensive overview of SMOTE-based methods in [22]. 4. Proposed solution based on rough–granular computing Table 1 summarizes the principal notation used in this article. The aim of the proposed RGA algorithm is to provide flexibility and effectiveness when learning from imbalanced data. The novel approach is based on the idea of achieving satisfactory results by applying data preprocessing and not interfering with the classification process itself. Since undersampling may lead to a loss of potentially valuable information, we decided to focus only on generating new synthetic samples and removing introduced noise. Therefore, the main phase of the algorithm involves selective data oversampling with an additional filtering step. The RGA method also enables automatic parameter tuning by repeating data preprocessing iteratively with different combinations of parameter values. The flowchart 1 shows the main steps of the presented algorithm. 4.1. Information granule formation The very first step involves identification of information granules (for more details on information granules see [23,24]). We assume that instances can be treated as similar in terms of their position in the feature space. Hence, by discerning the set of minority class instances and adjacent examples from both classes, information granules are formed. The efficient kNN algorithm is
4
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607 Table 1 Notation used in this article. Symbol
Interpretation
U a Va A a(x) d DT DTbalanced card(U) P(U) d(x) = + Xd=+ d(x) = − Xd=− IR NNk (x)
Set of instances (universe, e.g., sample of instances) Condition attribute over U (a : U → Va ) Set of attribute values of a ∈ A Set of condition attributes over U The value of each instance x ∈ U for an attribute a ∈ A Decision attribute over U (d ∈ / A, d : U → Vd , Vd = {+, −}) Decision Table (U , A ∪ {d}) Balanced decision table (Ubalanced , A ∪ {d}), where U ⊆ Ubalanced Number of elements in U Set of all subsets of U x ∈ U is in the minority class The minority class of instances x ∈ U is in the majority class The majority class of instances Imbalance Ratio (see Definition 1) The set of k-nearest neighbours of x ∈ U The standard deviation of values of the attribute a The label assigned to the example x Etiquette that indicates the high degree of information granule inclusion Etiquette that indicates the low degree of information granule inclusion Etiquette that indicates the noninclusion of information granule One of the algorithm’s parameters (see Section 4.2) One of the algorithm’s parameters (see Section 4.3) One of the algorithm’s parameters (see Section 4.3) The indiscernibility relation The set of algorithm parameter values Decision table balanced with particular parameters’ values The classification model built after applying certain parameters’ values
σa
Label(x) SAFE BOUNDARY NOISE complexity_threshold distance_threshold cardinality_redundancy IND ⊆ U × U param DT param classifier param
applied to obtain the entities that consist of instances arranged together. Since most real-life problems involve concomitance of qualitative and quantitative attributes, a dedicated distance metric is chosen. To properly recognize the neighbourhood of individual instances, described by both numeric and symbolic features, the Heterogeneous Value Distance Metric (HVDM) is utilized. It is defined as: HVDMA (x, y) =
√∑
da (a(x), a(y))2
(2)
a∈A
where x, y ∈ U are instances, a(x) and a(y) are the values of attribute a ∈ A for instance x and y respectively. The distance function da for the attribute a ∈ A is defined as: da (v, v ) = ′
{
normalized_v dma (v, v ′ ),
if a is nominal
normalized_diffa (v, v ′ ),
if a is linear
(3)
The distance function da consists of two other functions normalized_v dma and normalized_diffa conformed to different kinds of attributes. Hence, the following function is defined for nominal features:
⏐2 ∑ ⏐ ⏐ Na=v,d=c Na=v ′ ,d=c ⏐ ′ ⏐ ⏐ √ normalized_v dma (v, v ) = ⏐ Na=v − Na=v′ ⏐ c ∈{+,−}
(4)
where Na=v = card({x ∈ U : a(x) = v}) is the number of instances x in the training set U that have value v for attribute a, Na=v,d=c = card({x ∈ U : a(x) = v ∧ d(x) = c }) is the number of instances that have value v for attribute a and output class c, {+, −} is the set of decision classes. The function appropriate for linear attributes is defined as: normalized_diffa (v, v ′ ) =
|v − v ′ | 4σa
(5)
where σa is the standard deviation of the values of attribute a. It is crucial to assign an appropriate value of the k > 0 parameter, when identifying the nearest neighbours of minority class instances. The number of nearest neighbours affects directly the size of the information granule. Moreover, since instances from
both classes are considered as adjacent examples, the sensitivity of the algorithm to noise and outliers becomes adjustable, as does the specificity. The appropriate assortment of the k parameter is discussed in detail in the remainder of this section. We want to briefly explain why we regard the kNN step (present in most of the imbalanced classification approaches) as a data granulation approach. An information granule is a group of objects drawn together by indiscernibility, similarity or functionality (see e.g. [23,24]). Similarity of objects can be defined, among other things, by a distance function that is also used in the kNN method. At first we use the kNN method for the formation of information granules for minority class objects. Next, we consider the degree of inclusion of such granules in a granule labelled by minority class. 4.2. Inclusion of information granules Information granule formation facilitates splitting the problem into more feasible subtasks [23]. Each subtask can then be easily managed by applying appropriate approaches, dedicated to specific types of entities. After defining groups of similar instances NNk (x) (namely minority class instances x ∈ Xd=+ and their k nearest neighbours NNk (x)), the inclusion degree of each information granule NNk (x) in Xd=+ is examined. Based on this analysis, the labels Label(x) are assigned to all positive examples x ∈ Xd=+ . In this paper we assume that information granule evaluation is crucial for further processing. Before applying an oversampling mechanism, each information granule NNk (x) having a positive instance x as the anchor point is labelled with one of the following etiquettes: SAFE, BOUNDARY and NOISE. The category of the individual entity is determined by the inclusion degree of NNk (x) in the information granule Xd=+ (the whole minority class). Details of the proposed technique are presented in Eq. (6) and explained below.
Label(x) =
⎧ ⎪ ⎪ ⎨SAFE ,
BOUNDARY ,
⎪ ⎪ ⎩NOISE ,
if if if
card(NNk (x)∩Xd=+ ) 1 k 2 card(NNk (x)∩Xd=+ ) 0 k card(NNk (x)∩Xd=+ ) 0 k
>
<
=
≤
1 2
(6)
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607
5
Fig. 1. The flowchart of the RGA algorithm.
• Etiquette Label(x) = SAFE for x ∈ Xd=+ A high inclusion degree indicates that the information granule NNk (x) is placed in a homogeneous area and therefore x can be considered as SAFE. The inclusion level is obtained by an analysis of granule characteristics, especially cardinalities of instances from both classes. The number of positive class representatives belonging to the analysed entity (except the anchor example) i.e. card(NNk (x) ∩ Xd=+ ) is compared to the number of negative class instances i.e. card(NNk (x) ∩ Xd=− ). More than one half of minority class instances belonging to the positive class implies that x should be labelled as SAFE (see Fig. 2(a) for k = 5 and Table 2 for k = 7). • Etiquette Label(x) = BOUNDARY for x ∈ Xd=+ A low inclusion degree is determined by a large representation of the majority class Xd=− in the information granule NNk (x). When half or more than half of instances belong to the negative class, the BOUNDARY label for x is chosen. These kind of entities are placed in the area surrounding class boundaries, where examples from both classes overlap (see Fig. 2(b) for k = 5 and Table 2 for k = 7).
• Etiquette Label(x) = NOISE for x ∈ Xd=+ Noninclusion of the information granule NNk (x) in the minority class Xd=+ is identified with the situation where no instances belong to the minority class (except the anchor instance) i.e. NNk (x) ∩ Xd=+ = ∅. Since only one of the analysed instances is the positive example (namely the core instance x), it means that the information granule is created around the rare individual placed in the area occupied by representatives of the negative class Xd=− . This case is considered as NOISE (Label(x) = NOISE, see Fig. 2(c) for k = 5 and Table 2 for k = 7). The example of labelling instances from the minority class is presented in Table 2. It shows all possible cases of the minority class instance’s neighbourhood.
6
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607 Table 2 Identification of the type of the minority class instance in case of k = 7 nearest neighbours.
Algorithm 2 set_labels 1:
card(NN7 (x) ∩ Xd=+ )
card(NN7 (x) ∩ Xd=− )
2:
0 1 2 3
3:
SAFE
7 6 5 4
BOUNDARY
3 2 1
4 5 6
NOISE
0
7
Label(x)
Algorithm 1 RGA INPUT: DT - input dataset, minorityData - instances from the minority class, majorityData - instances from the majority class, parametersList - list of parameters’ vectors OUTPUT: DTbalanced : minority and majority class instances after preprocessing 1: Initial steps: Calculate the HVDM distance between each minority class example and instances from both classes. Save values in the dist list. Calculate the average distance and save it in the av g_distance variable. 2: for param IN parametersList do 3: Step I: Find param.k nearest neighbours of the each instance from the minorityData. Use values from the dist list. Save results in the granulesList. 4: Step II: Verify the inclusion degree of each information granule. Use the set_labels method 2. 5: Step III: Select the mode of processing based on the number of information granules labelled as BOUNDARY . Use parameter param.complexity_threshold. 6: while |synthetic_samples|< (|majorityData|−|minorityData|) ∗ param.cardinality_redundancy do 7: if mode ̸ = NOISE then 8: Step IV: Generate synthetic minority sample following rules specified for the selected mode and add to the synthetic_samples list. 9: end if 10: end while 11: for instance IN synthetic_samples do 12: Step V: Calculate the HVDM distance between instance and examples from the majorityData. If any of distances is less than param.distance_threshold * av g_distance, add the instance to the result set DTbalanced . 13: end for 14: end for Assuming that the k parameter is equal to 7, the second column presents the number of nearest neighbours belonging to the same class as the instance under consideration, and the third column shows the number of nearest neighbours representing the opposite class. Example 1. Let x ∈ Xd=+ be an instance from the minority class i.e. the decision d(x) = +. Let x1 , x2 , x3 , x4 , x5 , x6 , x7 be seven nearest neighbours of x. Let us assume that d(x1 ) = −, d(x2 ) = +, d(x3 ) = +, d(x4 ) = −, d(x5 ) = −, d(x6 ) = + and d(x7 ) = +. We obtain card(NN7 (x) ∩ Xd=+ ) = card({x1 , x2 , x3 , x4 , x5 , x6 , x7 } ∩ Xd=+ ) = card({x2 , x3 , x6 , x7 }) = 4 and card(NN7 (x) ∩ Xd=− ) = card({x1 , x2 , x3 , x4 , x5 , x6 , x7 } ∩ Xd=− ) = card({x1 , x4 , x5 }) = 3. Hence, we conclude that the correct label for x is Label(x) = SAFE.
4: 5: 6: 7: 8: 9:
for granule IN granulesList do if |granule.majorityData|/param.k > 1/2 then Label(granule) = SAFE else if 0 < |granule.majorityData|/param.k ≤ 1/2 then Label(granule) = BOUNDARY else if |granule.majorityData|/param.k == 0 then Label(granule) = NOISE end if end for
After categorizing information granules (instances from the minority class Xd=+ ), the mode of algorithm for oversampling is obtained. Three methods are proposed to deal with various real–life data characteristics. The most appropriate selection mainly depends on the number of information granules labelled as BOUNDARY. Assuming that a certain threshold is one of the algorithm’s parameters, the problem’s complexity is defined based on this value and the number of granules recognized as BOUNDARY. The threshold indicates how many instances of the entire minority class should be placed in boundary regions to treat the problem as a complex one. Experiments described in [14] proved that if more than 30% of the positive examples overlapped with the negative class (borderline instances) classifier performance might be significantly degraded. However, several values of the threshold are applied and compared in this paper. Having less BOUNDARY entities, i.e. card({x ∈ Xd=+ : Label(x) = BOUNDARY }) card(Xd=+ )
< complexity_threshold
means that the problem is not complex and the following method of creating new instances can be applied: Definition 2. LowComplexity mode for obtaining DTbalanced table from DT table: DT ↦ −→LowComplexity DTbalanced
• Label(x) = SAFE: there is no need to significantly increase the number of instances in these safe areas. Only one new instance per existing minority SAFE instance is generated. Numeric attributes are handled by interpolation with one of the k nearest neighbours. For the nominal features, a new sample has the same values of attributes as the instance under consideration. • Label(x) = BOUNDARY : most of the synthetic samples are generated in these borderline areas, since numerous majority class representatives may have a greater impact on the classifier learning when there are not enough minority examples. Hence, many new examples are created closer to the instance x under consideration. One of the k nearest neighbours is chosen for each new sample when determining the value of the numeric feature. Values of nominal attributes are obtained by the majority vote of k nearest neighbours’ features. • Label(x) = NOISE: no new samples are created. On the other hand, the prevalence of BOUNDARY information granules, i.e. card({x ∈ Xd=+ : Label(x) = BOUNDARY }) card(Xd=+ )
≥ complexity_threshold
involves more complications during the learning process. Therefore, a dedicated approach (described below) is chosen: Definition 3. HighComplexity mode for obtaining DTbalanced table from DT table: DT ↦ −→HighComplexity DTbalanced
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607
7
Fig. 2. (a) High inclusion degree of NN5 (x) in Xd=+ (Label(x) = SAFE); (b) Low inclusion degree of NN5 (x) in Xd=+ (Label(x) = BOUNDARY ); (c) Intersection of information granules is equal to empty set: NN5 (x) ∩ Xd=+ = ∅ (Label(x) = NOISE).
• Label(x) = SAFE: assuming that these concentrated instances provide specific and easy to learn patterns that enable proper recognition of minority samples, plenty of new data is created by interpolation between a SAFE instance and one of its k nearest neighbours. Nominal attributes are determined by a majority vote of k nearest neighbours’ features. • Label(x) = BOUNDARY : the number of instances is doubled by creating one new example along the line segment between half of the distance from a BOUNDARY instance and one of its k nearest neighbours. For nominal attributes, values describing the instance under consideration are replicated. • Label(x) = NOISE: new examples are not created. The last option is the special case, when any SAFE information granule is recognized i.e. {x ∈ Xd=+ : Label(x) = SAFE } = ∅. Hence, the following method is applied: Definition 4.
noSAFE mode: DT ↦ −→noSAFE DTbalanced
• Label(x) = BOUNDARY : all of the synthetic instances are created in the area surrounding class boundaries. This particular solution is selected in the case of especially complex data distribution, which does not include any SAFE samples. Missing SAFE elements indicate that most of the examples are labelled as BOUNDARY (there are no homogeneous regions). Since only BOUNDARY and NOISE examples are available, only generating new instances in the neighbourhood of BOUNDARY instances would provide a sufficient number of minority samples. • Label(x) = NOISE: no new instances are created. NOISE granules are completely excluded from the preprocessing phase, since their anchor instances are erroneous examples or outliers. Therefore, they should not be removed, but they also should not be taken into consideration when creating new synthetic instances, to avoid introducing more inconsistencies.
4.3. Parameters evaluation After generating minority instances, the correctness of a new dataset DTbalanced is examined. Only examples belonging to the lower approximation of the minority class are preserved. Let U denote a universe. The indiscernibility relation IND ⊆ U × U identifies instances that are described by the same information, considering certain attributes [24,25]. One of the main concepts based on this notion is the lower approximation of the set X (being any subset of U):
{x ∈ U : [x]IND ⊆ X },
(7)
This is the set of all instances that can be certainly classified as members of X with respect to IND. This step assures that no improper samples are added to the original data and the complexity is not increased. Since both qualitative and quantitative attributes are permissible, the extended rough set theory is applied. The standard indiscernibility relation is replaced by a weaker concept, namely a tolerance relation (reflexive and symmetric relation) [24,26]. This modification requires establishing the threshold value. It provides the boundary between similar instances and the remaining ones. Determining the threshold (the distance_threshold parameter) is crucial for identifying the improper minority synthetic samples that are placed close to the majority class instances. The distance_threshold is the percentage of average HVDM distance that indicates adverse similarity between instances representing different classes. All minority examples having any of the kNN distances smaller than the calculated threshold value are removed. Filtered positive instances are added to the original dataset and the classification process can be started. Predicting the number of incorrectly generated samples is impossible. The percentage of additional positive examples is specified as the cardinality redundancy parameter i.e. additional_positiv e_examples = cardinality_redundancy∗(N − −N + ) Since there are several parameters that need to be tuned, a new approach of simplifying this process is proposed in this paper. Fig. 3 illustrates the parameters evaluation method.
8
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607
Fig. 3. Evaluation of parameters. Table 3 Confusion matrix CM for a two decision class problem. Decision Class ‘‘+’’ Decision Class ‘‘−’’
is the proportion of instances from the decision class ‘‘+’’ correctly classified as belonging to the class ‘‘+’’,
Prediction ‘‘+’’
Prediction ‘‘−’’
True positive (TP) False positive (FP)
False negative (FN) True negative (TN)
TNrate =
TN FP + TN
(10)
is the proportion of instances from the decision class ‘‘−’’ correctly classified as belonging to the class ‘‘−’’, Firstly, combinations of several possible parameters’ values are obtained. Next, the preprocessing step is performed based on the distinct parameters provided. Classifiers Classifier param1 , . . . , Classifier paramn are built for each separate final dataset DT param1 , . . . , DT paramn , namely original data enriched with newly generated and filtered positive samples (see Fig. 3). Each model is obtained by applying a specified set of parameters’ vectors (param1 , . . . , paramn ) to either the preprocessing method or the classifier, depending on the particular parameter destination. Computed models are then evaluated and the most effective is selected.
FP
FPrate =
FP + TN
(11)
is the proportion of instances from the decision class ‘‘−’’ misclassified as belonging to the class ‘‘+’’, FNrate =
FN TP + FN
(12)
is the proportion of instances from the decision class ‘‘+’’ misclassified as belonging to the class ‘‘−’’, Precision =
TP TP + FP
(13)
5. Experimental study
is the number of correct positive results divided by the number of all positive results,
This section describes the applied evaluation measures and the results of experiments.
AUC =
5.1. Evaluation measures The measures of classification quality are built from the confusion matrix CM (shown in Table 3) which records correctly and incorrectly classified instances for classes ‘‘+’’ and ‘‘−’’. As mentioned in the previous subsection, imbalanced data classification cannot be evaluated in the standard way. Since simply obtaining the number of correct predictions from all made predictions is sensitive to data distribution, more sophisticated methods have been introduced. Definition 5. Considering the confusion matrix CM, the following measures are frequently used in various problems [10,14]:
1 + TPrate − FPrate
(14) 2 is the area under the ROC curve. Receiver Operating Characteristics (ROC) graph combines the FPrate on X axis and the TPrate on Y axis. Calculating the area under this curve provides the single measure to compare performance of classifiers based on the trade-off between benefits and costs. Defining four metrics that characterize positive and negative classes separately (Eqs. (9)–(12)), it is possible to achieve one universal measure that reflects classification results for both classes. The main example of such a method is the AUC measure Eq. (14). Since the AUC measure enables us to determine which of the models is more effective on average (similar to the Wilcoxon statistic [10]), it is used in this paper to evaluate the results of experiments. 5.2. Experiments
Accuracy =
TP + TN TP + FN + FP + TN
(8)
is the percentage of all correct predictions (both minority and majority class examples are considered), TPrate = Recall =
TP TP + FN
(9)
The experimental study was performed on fifteen real–life datasets characterized by particularly complex distributions. They were obtained from the UCI repository [27]. Since many researchers use these datasets, it is more convenient to compare results between different methods. Characteristics of the evaluated datasets are presented in Table 4 below. It shows information
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607 Table 4 Characteristics of evaluated datasets. Dataset
Instances Attributes IR
Boundary region
ecoli-0-1_vs_5 ecoli-0-1_vs_2-3–5 ecoli-0-1-4-6_vs_5 ecoli-0-1-4-7_vs_5–6 ecoli-0-3-4-7_vs_5–6 ecoli-0-4-6_vs_5 ecoli-0-6-7_vs_3–5 ecoli-0-2-6-7_vs_3–5 glass-0-1-6_vs_5 glass-0-4_vs_5 glass-0-6_vs_5 glass5 led7digit-0-2-4-5-6-7-8-9_vs_1 page-blocks-1-3_vs_4 yeast-0-3-5-9_vs_7–8
240 244 280 332 257 203 222 224 184 92 108 214 443 472 506
Nonempty Nonempty Nonempty Nonempty Empty Nonempty Nonempty Nonempty Empty Empty Empty Empty Nonempty Nonempty Nonempty
6 7 6 6 7 6 7 7 9 9 9 9 7 10 8
11 9.17 13 12.28 9.28 9.15 9.09 9.18 19.44 9.22 11.00 22.78 10.97 15.86 9.12
about the dataset name, number of instances, number of attributes, imbalance ratio and boundary region indicator. The last column informs about the existence of a boundary region, which increases the complexity of the problem. It presents datasets that are especially prone to misclassification, since they are complex. To adapt the selected datasets to the two-class problem, they were modified to group instances from several classes into two classes. Each occurrence of a dataset with the same name suffixed by different numbers means a different grouping method. For example, the ecoli-0-1_vs_5 dataset combines examples belonging to classes cp (0) and im (1) as positive instances and examples from the class om (5) are labelled as negative. Datasets were divided into smaller partitions to enable applying 5-folds stratified cross-validation. Table 5 shows the average results of the experimental study. The classifiers’ performance was evaluated by applying the AUC measure (see Definition 5). The approach introduced in this paper was compared with six other methods that represent the group of data-level techniques. Also, the results of classification without a preprocessing step are presented. The RGA algorithm setup involved evaluating 81 combinations of parameter values. Four parameters were considered, based on the main algorithm assumption. The number of nearest neighbours k was set to 3, 5 and 7. The complexity_threshold was set to 0.2, 0.3 and 0.4. For the distance_threshold the following values were chosen: 10, 15 and 25. The cardinality_redundancy was set to 20, 30 and 40. The detailed classification results depending on the k and complexity_threshold parameters were shown in [28]. The results of experiments were obtained by development of dedicated software. The project was implemented in the Java programming language utilizing several useful frameworks. The
9
Table 6 The summary of classification results for each algorithm: the number of absolute wins/shared wins. noPRE SMOTE S–ENN Border–S SafeL–S S–RSB∗ VISROT RGA Total Test 0/4
0/1
0/1
0/1
0/0
2/2
0/4
7/5
10/5
Weka machine learning library was applied to perform data parsing and classification. The KEEL software tool was used to provide results for related methods. The results revealed that the RGA technique obtains the highest AUC value in most cases and can be considered the best method. The proposed algorithm outperforms all other methods in seven of the datasets. For four datasets it has the joint best result together with some of the compared techniques. In the case of the three remaining datasets, two other methods are subtly more effective (Safe-Level SMOTE and SMOTE RSB∗ ). Table 6 shows the summary. It contains the total number of cases when each of the evaluated algorithms obtained the highest AUC value. Two values are provided for every method: the first one informs about the absolute wins, while the second means that more than one technique obtained the same best result. Table 7 presents the performance ranking of algorithms for each dataset based on the AUC values. The proposed RGA method has very stable performance: twelve times in the first position, once in the second position and two times in the third position. The results of the Friedman test are shown in Table 8. According to these ranks, the RGA method proves to be the best of the analysed techniques. The second is the SMOTE-RSB∗ method, and VISROT is third. The SMOTE_ENN algorithm has the worst performance (apart from in case of classification without preprocessing step). Since the datasets used in the described experiments suffer from the mentioned difficulty factors (for detailed studies see [2]) and the proposed algorithm obtained promising results, we could conclude that RGA correctly handles the comprehensive data distribution. The strategy of selective preprocessing based on information granules combined with a filtering technique could be considered as a proper direction of data preprocessing algorithm development. Additionally, more comprehensive studies of the parameters’ impact on the results were performed. Figs. 4–6 present the changes of AUC depending on the particular parameter values for three selected datasets. Figures for these three datasets show the most significant differences of obtained results in contrast to the remaining box plots that are not included in this article. The experiments revealed that the number of nearest neighbours is the most influential parameter — it had a significant impact
Table 5 Classification results for the selected UCI datasets — comparison of proposed algorithm Rough–Granular Approach (RGA) with six other techniques and classification without preprocessing step (noPRE). Dataset
noPRE
SMOTE
S–ENN
Border–S
SafeL–S
S–RSB∗
VISROT
RGA
ecoli-0-1_vs_5 ecoli-0-1_vs_2-3–5 ecoli-0-1-4-6_vs_5 ecoli-0-1-4-7_vs_5–6 ecoli-0-3-4-7_vs_5–6 ecoli-0-4-6_vs_5 ecoli-0-6-7_vs_3–5 ecoli-0-2-6-7_vs_3–5 glass-0-1-6_vs_5 glass-0-4_vs_5 glass-0-6_vs_5 glass5 led7digit02456789_vs_1 page-blocks-1-3_vs_4 yeast-0-3-5-9_vs_7–8
0.8159 0.7136 0.7885 0.8318 0.7757 0.8168 0.8250 0.7752 0.8943 0.9941 0.9950 0.8976 0.8788 0.9978 0.5868
0.7977 0.8377 0.8981 0.8592 0.8568 0.8701 0.8500 0.8155 0.8129 0.9816 0.9147 0.8829 0.8908 0.9955 0.7047
0.8250 0.8332 0.8981 0.8424 0.8546 0.8869 0.8125 0.8179 0.8743 0.9754 0.9647 0.7756 0.8379 0.9888 0.7024
0.8318 0.7377 0.7558 0.8420 0.8427 0.8615 0.8550 0.8352 0.8386 0.9941 0.9950 0.8854 0.8908 0.9978 0.6228
0.8568 0.7550 0.8519 0.8197 0.7995 0.8923 0.7950 0.8380 0.8429 0.9261 0.9137 0.8939 0.9023 0.9831 0.7296
0.7818 0.7777 0.8231 0.8670 0.8984 0.9476 0.8525 0.8227 0.8800 0.9941 0.9650 0.9232 0.9019 0.9978 0.7400
0.8636 0.7314 0.8366 0.8220 0.8471 0.8060 0.8500 0.7977 0.8943 0.9941 0.9950 0.9951 0.8918 0.9944 0.6868
0.9136 0.8513 0.8635 0.8918 0.9092 0.9141 0.8750 0.8452 0.8943 0.9941 0.9950 0.9951 0.9056 0.9978 0.7281
10
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607
Table 7 Algorithms performance ranking used in Friedman test (based on the AUC values). Dataset
1
2
3
4
5
6
7
8
ecoli-0-1_vs_5 ecoli-0-1_vs_2-3–5 ecoli-0-1-4-6_vs_5 ecoli-0-1-4-7_vs_5–6 ecoli-0-3-4-7_vs_5–6 ecoli-0-4-6_vs_5 ecoli-0-6-7_vs_3–5 ecoli-0-2-6-7_vs_3–5 glass-0-1-6_vs_5 glass-0-4_vs_5 glass-0-6_vs_5 glass5 led7digit02456789_vs_1 page-blocks-1-3_vs_4 yeast-0-3-5-9_vs_7–8
RGA RGA SMOTE* RGA RGA S–RSB RGA RGA noPRE* noPRE* noPRE * RGA * RGA noPRE * S–RSB
VISROT SMOTE S-ENN * S–RSB S–RSB SafeL–S Border-S SafeL–S VISROT* Border-S * Border-S * VISROT * SafeL–S Border-S* SafeL–S
SafeL–S S-ENN RGA SMOTE SMOTE RGA S–RSB Border-S RGA * S–RSB * VISROT * S–RSB S–RSB S–RSB * RGA
Border-S S–RSB SafeL–S S-ENN S-ENN S-ENN SMOTE* S–RSB S–RSB VISROT * RGA * noPRE VISROT RGA * SMOTE
S-ENN SafeL–S VISROT Border-S VISROT SMOTE VISROT * S-ENN S-ENN RGA * S–RSB SafeL–S SMOTE * SMOTE S-ENN
noPRE Border-S S–RSB noPRE Border-S Border-S noPRE SMOTE SafeL–S SMOTE S-ENN Border-S Border-S * VISROT VISROT
SMOTE VISROT noPRE VISROT SafeL–S noPRE S-ENN VISROT Border-S S-ENN SMOTE SMOTE noPRE S-ENN Border-S
S–RSB noPRE Border-S SafeL–S noPRE VISROT SafeL–S noPRE SMOTE SafeL–S SafeL–S S-ENN S-ENN SafeL–S noPRE
*Denotes methods that obtain the same result. Table 8 Friedman test ranks. Algorithm
Mean rank
RGA S_RSB VISROT Border_S SMOTE SafeL_S S_ENN noPRE
1.83 3.43 4.70 4.90 4.97 5.20 5.30 5.67
the classification results interpreted in terms of the AUC measure are better for a low number of inconsistent samples. However, the results are not particularly poor and they do not vary significantly, since the algorithm creates redundant synthetic samples to properly handle this case. Only the ecoli0146_vs_5 dataset demonstrates considerable classifier performance depletion when the cardinality redundancy parameter has less value and not enough minority class samples are created. It confirms the assumption that more synthetic samples should be created to provide better quality of data in the resulting dataset. 6. Conclusions
on results for most of the examined datasets. Therefore, this parameter especially requires careful tuning. The first dataset, namely yeast-0-3-5-9_vs_7-8 (Fig. 4) obtained the best results for the highest evaluated value of the k parameter. The lowest value of the nearest neighbours was evidently a disadvantageous choice — in the worst case it caused the reduction of AUC by almost 8%. The ecoli-0-4-6_vs_5 dataset (Fig. 5) also had the highest AUC for k equal to 7. Surprisingly, the value of 5 nearest neighbours preferred in many applications had the worst results. The last dataset (ecoli-0-1_vs_5, Fig. 6) also required higher values of the k parameter. While 3 was insufficient, both 5 or 7 nearest neighbours gave similarly better results. For ecoli01_vs_5 dataset, which was preprocessed in HighComplexity mode, the distance threshold becomes the most important parameter. The higher the distance_threshold value is, the higher the AUC value obtained. The complexity threshold should also be considered as influential in this case. The rest of the analysed datasets have a countertendency - their results measured by the AUC are better for lower values of the distance_threshold. It means that these two datasets probably have more complex distributions, where instances representing positive and negative classes strongly overlap. Since examples from different classes are very similar, they induce the necessity of applying more restrictive rules when defining the boundary region in tolerance relation to identify and remove any inconsistencies. The remaining two parameters, namely complexity_threshold and cardinality_redundancy, did not demonstrate any relevant impact on the results. Their values are presumably dependent on the other parameters. For two datasets, namely ecoli0146_vs_5 and led7digit02456789_vs_1, the occurrence of inconsistent synthetic samples was reported. It means that some of the generated minority class examples belonged to the boundary region defined by the rough set notions. All of these samples, treated as inconsistent, were removed before classifier learning. Fig. 7 shows that
A new rough granular computing approach to imbalanced data was introduced in this paper. The main aims of proposed solution are:
• considering additional factors that affect classifier performance (namely small disjuncts, class overlapping and presence of noise) by involving rough granular computing; • handling the problem of algorithm’s parameters proper assortment. The results of experiments presented in this paper confirm that combining oversampling and filtering techniques is an effective approach. Formation and inclusion of information granules incorporated into an algorithm enables proper handling of even highly imbalanced and complex data distributions. Enhanced classification results prove that taking into account local characteristics of examples when generating new minority samples is an advisable method. However, the adjustment of an algorithm to particular problems requires careful tuning of the parameters. Therefore, multiple combinations of parameter values were examined by applying a dedicated evaluation technique. The experimental study revealed that varying the number of nearest neighbours (k) had the highest impact on the classification results achieved on evaluated datasets. There were also cases that proved the importance of other parameters: the distance threshold and cardinality redundancy were notably influential. In further research we plan to develop methods for deeper analysis of the risk related to the predicted decisions. This should lead, in particular, to the development of adaptive strategies for tuning of the parameters specified in the paper. In particular, changes to the properties of analysed datasets should be considered. We also plan to investigate the role of Akaike information criterion (AIC) [29] in tuning the models developed in the paper. In particular, we plan to investigate how changes of the Akaike criterion relative to the number of parameters may influence the quality of the developed data models.
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607
Fig. 4. Results for yeast-0-3-5-9_vs_7-8 dataset.
Fig. 5. Results for ecoli-0-4-6_vs_5 dataset.
11
12
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607
Fig. 6. Results for ecoli-0-1_vs_5 dataset.
Fig. 7. The impact of number of inconsistent samples on the classification results for two datasets (ecoli0146_vs_5 and led7digit02456789_vs_1).
Acknowledgements The work was supported by the grant S/WI/1/2018 from Bialystok University of Technology and funded with resources for research by the Ministry of Science and Higher Education in Poland. Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have
impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105607. References [1] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Trans. Syst. Man Cybern. C (Appl. Rev.) 42 (4) (2012) 463–484, http://dx.doi.org/10.1109/TSMCC.2011.2161285. [2] K. Napierala, J. Stefanowski, Types of minority class examples and their influence on learning classifiers from imbalanced data, J. Intell. Inf. Syst. 46 (3) (2016) 563–597, http://dx.doi.org/10.1007/s10844-015-0368-1. [3] J. Stefanowski, Dealing with data difficulty factors while learning from imbalanced data, in: Challenges in Computational Statistics and Data
K. Borowska and J. Stepaniuk / Applied Soft Computing Journal 83 (2019) 105607
[4]
[5] [6] [7] [8]
[9]
[10]
[11]
[12] [13] [14] [15]
[16]
Mining, 2016, pp. 333–363, http://dx.doi.org/10.1007/978-3-319-187815_17. K. Napierala, J. Stefanowski, S. Wilk, Learning from imbalanced data in presence of noisy and borderline examples, in: Proceedings of the 7th International Conference on Rough Sets and Current Trends in Computing, Springer-Verlag, 2010, pp. 158–167, http://dx.doi.org/10.1007/978-3-64213529-3_18. T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, SIGKDD Explor. Newsl. 6 (1) (2004) 40–49, http://dx.doi.org/10.1145/1007730.1007737. G. Weiss, Mining with rarity: A unifying framework, SIGKDD Explor. Newsl. 6 (1) (2004) 7–19. B. Krawczyk, Learning from imbalanced data: Open challenges and future directions, Prog. Artif. Intell. 5 (4) (2016) 221–232. S. Garcia, J. Luengo, F. Herrera, Data Preprocessing in Data Mining, vol. 72, Springer International Publishing, 2015, http://dx.doi.org/10.1007/9783-319-10247-4. N. Chawla, W.K. Bowyer, L. Hall, W. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Int. Res. 16 (1) (2002) 321–357, http://dx.doi.org/10.1613/jair.953. Y. Sun, M. Kamel, A. Wong, Y. Wang, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognit. 40 (12) (2007) 3358–3378. J. Sáez, J. Luengo, J. Stefanowski, F. Herrera, SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Inform. Sci. 291 (2015) 184–203, http://dx.doi.org/10.1016/j.ins.2014.08.051. Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (5) (1982) 341–356, http://dx.doi.org/10.1007/BF01001956. Z. Pawlak, A. Skowron, Rough sets: Some extensions, Inform. Sci. 177 (1) (2007) 28–40, http://dx.doi.org/10.1016/j.ins.2006.06.006. H. He, E. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1263–1284, http://dx.doi.org/10.1109/TKDE.2008.239. V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Inform. Sci. 250 (2013) 113–141. H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, in: Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I, Springer Berlin Heidelberg, 2005, pp. 878–887, http://dx.doi.org/10.1007/11538059_91.
13
[17] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD ’09, Springer-Verlag, 2009, pp. 475–482, http://dx.doi.org/10.1007/978-3642-01307-2_43. [18] E. Ramentol, Y. Caballero, R. Bello, F. Herrera, SMOTE-RSB∗ : A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory, Knowl. Inf. Syst. 33 (2) (2012) 245–265, http://dx.doi.org/10.1007/s10115-011-0465-6. [19] G. Batista, R. Prati, M. Monard, A study of the behavior of several methods for balancing machine learning training data, SIGKDD Explor. Newsl. 6 (1) (2004) 20–29, http://dx.doi.org/10.1145/1007730.1007735. [20] K. Borowska, J. Stepaniuk, Rough sets in imbalanced data problem: Improving re–sampling process, in: Computer Information Systems and Industrial Management: 16th IFIP TC8 International Conference, CISIM 2017, Bialystok, Poland, June 16-18, 2017, Proceedings, Springer International Publishing, 2017, pp. 459–469, http://dx.doi.org/10.1007/978-3-31959105-6_39. [21] X. Zhu, W. Pedrycz, Granular Under-Sampling for Processing Imbalanced Data, IEEE, 2017, in print. [22] A. Fernández, F. Herrera, N. Chawla, SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary, J. Artif. Intell. Res. 61 (2018) 863–905. [23] W. Pedrycz, Granular computing : An introduction, in: Future Directions for Intelligent Systems and Information Sciences, Physica-Verlag HD, 2000, pp. 309–328, http://dx.doi.org/10.1007/978-3-7908-1856-7. [24] J. Stepaniuk, Rough — Granular Computing in Knowledge Discovery and Data Mining, vol. 152, Springer Berlin Heidelberg, 2008, http://dx.doi.org/ 10.1007/978-3-540-70801-8. [25] Z. Pawlak, A. Skowron, Rudiments of rough sets, Inform. Sci. 177 (1) (2007) 3–27, http://dx.doi.org/10.1016/j.ins.2006.06.003. [26] A. Skowron, J. Stepaniuk, Tolerance approximation spaces, Fundam. Inf. 27 (2–3) (1996) 245–253. [27] UCI machine learning repository, 2018, http://archive.is.uci.edu/ml/. (Accessed 22 October 2018). [28] K. Borowska, J. Stepaniuk, Granular computing and parameters tuning in imbalanced data preprocessing, Comput. Inf. Syst. Ind. Manag. CISIM 2018 (2018) 233–245, http://dx.doi.org/10.1007/978-3-319-99954-8_20. [29] H. Akaike, A new look at the statistical model identification, IEEE Trans. Autom. Control 19 (6) (1974) 716–723, http://dx.doi.org/10.1109/TAC.1974. 1100705.