Ranking-based instance selection for pattern classification

Ranking-based instance selection for pattern classification

Ranking-based Instance Selection for Pattern Classification Journal Pre-proof Ranking-based Instance Selection for Pattern Classification George D.C...

8MB Sizes 0 Downloads 63 Views

Ranking-based Instance Selection for Pattern Classification

Journal Pre-proof

Ranking-based Instance Selection for Pattern Classification George D.C. Cavalcanti, Rodolfo J.O. Soares PII: DOI: Reference:

S0957-4174(20)30094-4 https://doi.org/10.1016/j.eswa.2020.113269 ESWA 113269

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

16 May 2019 6 January 2020 31 January 2020

Please cite this article as: George D.C. Cavalcanti, Rodolfo J.O. Soares, Ranking-based Instance Selection for Pattern Classification, Expert Systems With Applications (2020), doi: https://doi.org/10.1016/j.eswa.2020.113269

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd.

Highlights • Three novel algorithms of the RIS family are proposed. • RIS employs a ranking that selects the best instances in terms of classification. • The ranking implies that borderline and noisy instances have low priority. • RIS obtains promising accuracy and reduction rates when compared with the literature.

1

Ranking-based Instance Selection for Pattern Classification George D. C. Cavalcantia,∗, Rodolfo J. O. Soaresa a

Universidade Federal de Pernambuco (UFPE), Centro de Inform´ atica (CIn), Av. Jornalista Anibal Fernandes s/n, Cidade Universit´ aria 50740-560, Recife, PE, Brazil

Abstract In instance-based learning algorithms, the need to store a large number of examples as the training set results in several drawbacks related to large memory requirements, oversensitivity to noise, and slow execution speed. Instance selection techniques can improve the performance of these algorithms by selecting the best instances from the original data set, removing, for example, redundant information and noisy points. The relationship between an instance and the other patterns in the training set plays an important role and can impact its misclassification by learning algorithms. Such a relationship can be represented as a value that measures how difficult such instance is regarding classification purposes. Based on that, we introduce a novel instance selection algorithm called Ranking-based Instance Selection (RIS) that attributes a score per instance that depends on its relationship with all other instances in the training set. In this sense, instances with higher scores form safe regions (neighborhood of samples with relatively homogeneous class labels) in the feature space, and instances with lower scores form an indecision region (borderline samples of different classes). This information is further used in a selection process to remove instances from both safe and indecision regions that are considered irrelevant to represent their clusters in the feature space. In contrast to previous algorithms, the proposal combines a raking procedure with a selection process aiming to find a promising tradeoff ∗

Corresponding author. Address: Centro de Inform´ atica (CIn), Universidade Federal de Pernambuco (UFPE). Av. Jornalista Anibal Fernandes. Cidade Universit´ aria 50740560 - Recife, PE - Brazil. Tel.: +55 81 2126-8430 Ext.4346 Fax: +55 81 2126-8438. Email addresses: [email protected] (George D. C. Cavalcanti), [email protected] (Rodolfo J. O. Soares)

Preprint submitted to Expert Systems with Applications

February 7, 2020

between accuracy and reduction rate. Experiments are conducted on twentyfour real-world classification problems and show the effectiveness of the RIS algorithm when compared against other instance selection algorithms in the literature. Keywords: instance selection, ranking, instance-based learning, k-nearest neighbor, classification 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

1. Introduction Instance-based learning algorithms are classifiers that have no learning phase. They use all training set instances as exemplars when generalizing. The k-Nearest Neighbor (k-NN) algorithm (Cover and Hart, 1967) is a wellknown instance-based learning classifier. k-NN computes the distance from each input instance to every stored example and labels them according to the class label of their k nearest neighbors, which are their most similar patterns in the training set. The use of the whole training set makes k-NN be confronted with several drawbacks, such as deciding how many exemplars to store and what portion of the input space they should cover. Excessive storage can result in high memory usage and significant computational time increase, once multiple computations of similarities between the test set and training examples would be performed. Moreover, k-NN suffers rapidly with the level of attribute noise in training instances (Aha et al., 1991). Instance selection is one of the most effective approaches to increase the performance of instance-based classifiers (Garcia et al., 2012). It is applied as a pre-processing step on the training set. This kind of technique consists of reducing the data used for establishing a classification rule by selecting relevant prototypes. With the reduction of the storage examples, the memory requirements and execution time for generalization are decreased as well. A successful instance selection algorithm searches to find the smallest subset of the original data set which yields or even improves the classification accuracy of instance-based algorithms. As the size of the data sets grows, algorithms that can present a shortlist of representative samples selected from the whole training data is of great importance. In this work, we propose an instance selection algorithm called Rankingbased Instance Selection (RIS). RIS calculates a score per instance, and this 3

29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

score is used to evaluate if this instance should be included in the final subset of prototypes. RIS is composed of two phases: ranking and selection. In the first phase, RIS starts by assigning a score to each instance in the training set that represents how vital the instance is for further generalization. This ranking procedure aims at evaluating the relationship among the instances in the training set to identify which instances can be removed, for representing irrelevant information or noisy points. The selection process is performed next, where instances with higher scores are selected first. In other words, the gathered knowledge of the ranking procedure can be applied to choose a subset of the most critical instances from the training set, and consequently, improve the performance of instance-based learning algorithms. To the best of our knowledge, this procedure that juxtaposes the raking calculation with the selection process was not employed previously. RIS takes advantage of the fact that the relationship of an instance to other patterns in the training set impacts its misclassification by learning algorithms. So, RIS calculates a score per instance that is determined by its relationship to the other instances, and this score represents a soft decision where lower the score closer to the border is the instance. In other words, these scores create a heat map that shows safe regions (neighborhood of samples with relatively homogeneous class labels) in the feature space, as well as indecision regions (borderline samples of different classes). The selection phase of RIS uses this information to choose a subset of the original training set independent if these samples are borderline or if they belong to a homogeneous region. This is an interesting procedure because the principal contributor to increasing the likelihood of an instance being misclassified is the occurrence of outliers and border points in class overlapping regions of the feature space (Smith et al., 2014). In this way, integrating into the learning process the knowledge of which instances are hard to classify (the ones that have lower scores) can lead to an increase in the classification accuracy. The major contribution of this paper is the proposal of a family of instance selection methods, called Ranking-based Instance Selection (RIS), composed of three algorithm RIS1, RIS2, and RIS3, where each algorithm has a different compromise between accuracy and reduction rate of the training set. RIS1 has the best accuracy, while RIS3 reduces the training set the most. Contrary to previous works, RIS incorporates in the selection process a score (which is real value instead of a binary decision) per instance that can be interpreted as a measure indicating how difficult it is to classify that instance. This measure proved to be an interesting strategy through a set 4

84

of comprehensive experiments on 24 datasets of the KEEL repository. We show that the proposed method can considerably reduce the original size of the training set while improving the recognition rates. The results reached by our method compare favorably to other published methods. This paper extends the work in (Pereira and Cavalcanti, 2011b) in many ways: i) novel algorithms of the RIS family are proposed; ii) a deeper explanation of the proposed approaches is presented; iii) a detailed analysis using a toy example is conducted to offer a better understanding of the proposed approach; iv) the number of real datasets used to evaluate the proposed approach is increased; v) a more rigorous statistical comparison is performed. This work is organized as follows. Section 2 describes the basic concepts of the instance reduction research area, in special instance selection, and also presents how instance selection has been applied to different areas of machine learning. The proposed methods are described in Section 3, along with an illustrative example of the algorithm. Experimental studies of the proposed method on public machine learning datasets are conducted and analyzed in Section 4. Finally, in Section 5, experimental results are summarized, and future works are suggested.

85

2. Reduction techniques

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101

Instance reduction algorithms have two main purposes: reduce/eliminate noise and outliers, and reduce the computational burden of instance-based learning algorithms. Regardless of the main objective, the training set should have its number of instances reduced. Instance reduction techniques rely on two different approaches: instance selection (Garcia et al., 2012) and prototype generation (Triguero et al., 2012). Instance selection aims at finding the best subset of instances of the training dataset. In contrast, prototype generation creates new instances, called prototypes, to represent the original dataset of instances. The main advantage of selection algorithms is their capacity to choose relevant samples without generating new artificial data since, for some applications, the generation of new data is not a sound procedure. Moreover, in general terms, selection algorithms are more straightforward than generation ones. Section 2.1 recalls concepts about instance selections methods, and Sections 2.2 reviews works that apply instance selection to different areas of machine learning such as imbalance and ensemble learning.

5

102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129

130 131 132 133 134 135 136 137 138

2.1. Instance selection Instance selection methods, also known as prototype selection (PS), can be classified according to the type of selection in Condensation, Edition, and Hybrid (Kim and Oommen, 2003). The Edition algorithms aim at cleaning the noisy instances and outliers without any compromise to obtain a high reduction rate of the training dataset. On the other hand, Condensation focus on reducing the dataset, maintaining borderline instances, and removing instances that are far away from the frontiers among the classes. The Hybrid algorithms aim to combine the best of the two previous mentioned classes of algorithms. To do so, it selects the most representative instances regardless of the fact that the instances are at the border or positioned in safer regions of the feature space. From the point of view of the direction of search, Garcia et al. (2012) classified the PS techniques in Incremental, Decremental, Batch, Mixed, and Fixed. Incremental starts with an empty subset, and the instance that satisfies a given criterion is added to the subset. In contrast, Decremental begins with the whole training set, and iteratively removes instances from it. In the Batch mode, the first task is to mark the instances that should be removed. After, all the marked instances are removed at once. Another strategy is to start with a preselected subset, and at each iteration, add or remove instances from this subset; this one is called Mixed. Lastly, the Fixed search is a subfamily of the Mixed search where the number of instances is predefined at the begging of the search and never change. Prototype selection algorithms have been employed in many different applications, such as traffic sign recognition (Chen et al., 2015), handwritten connected digits classification (Pereira and Cavalcanti, 2011a), financial data mining (jae Kim, 2006), and data summarization (Smith-Miles and Islam, 2010). 2.2. Instance selection applied to different areas In the literature, there is a myriad of methods that aim to select the most relevant prototypes from a training set. These proposals appear in different machine learning areas, such as imbalance learning, one-class classification, regression, big data, and ensemble learning. Instead of using classical under/oversampling techniques to address the class imbalance problem, Tsai et al. (2019) presented an approach that combines clustering analysis and instance selection techniques. In the same vein, Kuncheva et al. (2019) performed a theoretical evaluation on instance 6

167

selection for imbalanced data using the geometric mean as the performance measure. Bien and Tibshirani (2011) proposed a prototype selection algorithm that attempts to presents a shortlist of representative samples to increase its interpretative value. In other words, their work used prototypes as a tool to make it for easy to human understanding of the data. The majority of the PS algorithm was developed to deal with multi-class classification problems. (Krawczyk et al., 2019) proposed PS algorithms that work on scenarios in which we do not have access to counterexamples during the training phase. In other words, instead of dealing with multi-class, these algorithms focus on one-class classification. In contrast to existing methods that use PS to classification, (Arnaiz-Gonz´alez et al., 2016) proposed instance selection methods for regression tasks where the aim is to predict continuous values instead of one class. Arnaiz-Gonz´alez et al. also adapted instance ´ selection algorithms for multi-label learning (Alvar Arnaiz-Gonz´alez et al., 2018). Standard PS methods cannot cope with huge data sets. To fulfill this gap, Triguero et al. (2015) proposed a partitioning methodology that uses a MapReduce-based framework to distribute the functioning of the PS algorithms through a cluster. Ensemble learning aims at overcoming the precision of a classification task by combining classifiers. Cruz et al. (2017); Cruz et al. (2018b, 2019) showed that PS methods can improve the accuracy of dynamic ensemble selection techniques (Cruz et al., 2018a) that a promising multiple classifier approach in which the base classifiers are selected on the fly, according to each new test sample to be classified. For more information, some reviews of selection can be found in the literature (Garcia et al., 2012; Kim and Oommen, 2003; Wilson and Martinez, 2000).

168

3. Ranking-based Instance Selection Algorithm

139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166

169 170 171 172 173

The Ranking-based Instance Selection (RIS) is a hybrid instance selection technique that can be divided into two phases: Ranking and Selection. During the Ranking phase, RIS associates a score to each instance in the training set. This score represents the importance of each instance for classification purposes; the higher the score, the more important the instance. After the

7

174 175 176 177

construction of the ranking, the Selection phase keeps the instances having the highest ranking. For the sake of simplicity, the notation used in this paper are listed as follows:

178

X: the training dataset.

179

C: the classes of the problem.

180

m: the number of instances in X.

181

xi : the i-th instance in X.

182

R: the result set after the selection process.

183

S: the score vector that represents the ranking of the instances.

184

si : the score of xi , si ∈ S.

185

186

t: the threshold parameter. I: the indexes vector.

class(x): 187 the class label of x. min(V 188 ): the minimum value in vector V . max(V 189 ): the maximum value in vector V . d(x, y): the distance measure between the instances x and y.

190

argmin 191 f (v): the value of v that minimizes f . v 192 193 194 195 196 197

Sections 3.1 and 3.2 describe the two main phases of the Ranking-based Instance Selection (RIS), Ranking and Selection, respectively. To illustrate the RIS algorithm, we present a toy problem in Section 3.3. Sections 3.4 and 3.5 present two variations of the proposed method, RIS2 and RIS3, respectively. Finally, Section 3.6 describes the classification rule adopted by the RIS algorithms.

8

198 199 200 201 202 203 204

3.1. Ranking Each instance has a score associated with it in such a way that the instance surrounded by patterns having the same class of it has higher scores than borderline and noisy instances that are considered less important at first sight. The ranking construction works as follows: the score si is calculated for each instance xi ∈ X using Equation 1, where m is the number of instances in the training set X. si =

m X

j=1:j6=i 205 206 207

α(xi , xj ) × sm(xi , xj , X).

(1)

For the calculation of the score si of the instance xi , all the patterns xj ∈ {X − {xi }} are taken into consideration. So, for each xj ∈ {X − {xi }}, the functions α(.) (Equation 2) and sm(.) (Equation 3) are calculated.

α(xi , xj ) =



1, class(xi ) = class(xj ) −1, otherwise

sm(xi , xj , X) =

exp(−d(xi , xj )) m X

(2)

.

(3)

exp(−d(xi , xk ))

k=1:k6=i 208 209 210 211 212 213 214 215 216 217 218 219 220 221

The function α defines the sign of the adjustment. In other words, if xi and xj have the same class label, the score (si ) of xi is increased by the return value of the sm function (Equation 3). Otherwise, the score is subtracted by this value. The function sm calculates the absolute value of the xi score adjustment given xj . This value is given by the normalized exponential transform of the distance between xi and xj . This transform is also known as softmax (Bridle, 1990) and preserves the hierarchical order of the input values and it is a derivable generalization of the “winner-takes-all” rule (Haykin, 1999). This strategy ensures that smaller distances imply into higher score adjustments, which is sound given that the influence of a pattern that is closer to xi should be higher than of those patterns far away. After the evaluation of the whole set of instances in X, the scores si are normalized in the [0; 1] interval using Equation 4,

9

scaling(si , S) = 222 223 224 225 226 227 228 229 230 231

si − min(S) , max(S) − min(S)

(4)

where min(S) and max(S) represent the minimum and the maximum values in S, respectively. The scores of two artificial datasets (Circles and Banana – Figures 1(a) and (b), respectively) were calculated by Equation 1 and showed using two different plots: surface and curve. It is relevant to note that to improve the visual effect, each score was inverted to 1 − score. So, in the surface plots (Figures 1 (c) and (d)), the higher the inverted score of an instance, the higher the elevation level. Figures 1 (e) and (f) show a 2D view of the surface and it is possible to note that the regions defined by the instances having the highest scores preserve the original shapes of the datasets.

10

1.5

1

v2

0.5

0

−0.5

−1

−1.5 −1.5

−1

−0.5

0 v1

0.5

1

1.5

(a) Circles data.

(b) Banana set.

1 Score

Score

1 0.5 0

0.5

0 1

5

1 0.5

5

0.5 0

0

0 −0.5 v2

0 −5

−0.5 −1

−1 −1.5

v

−1.5

(c) Surface graph of the scores of the Circles data.

−5 −10

v2

1

−10

v1

(d) Surface graph of the scores of the Banana data set. 6

1

4 2

0.5

2

v

v2

0 0

−2 −4

−0.5

−6 −8

−1

−10 −1.5 −1.5

−1

−0.5

0 v1

0.5

1

(e) Curve lines of the scores of the Circles data.

−10

−8

−6

−4

−2 v1

0

2

4

6

(f) Curve lines of the scores of the Banana data set.

Figure 1: Surface and contour graphs of the ranking.

11

232 233 234 235

236 237 238

239 240 241

242 243 244

245 246 247 248 249

3.2. Selection The instance selection phase aims at removing redundant and undesirable instances. Two definitions, radius and relevant distance, that are used by the proposed algorithm to select the best instances are given below. Definition 1. The radius of an instance x, radius(x), is given by the radius of the largest hypersphere centered at x containing only instances having the same class of x. The radius of an instance, as described in Definition 1, can be obtained by calculating the distance between the instance xi and its nearest enemy. An enemy is defined as any instance having a different class of xi . Definition 2. An instance xi ∈ X is considered relevant in the RIS algorithm if does not exist xr ∈ X, such as: i) class(xr ) = class(xi ), and ii) d(xi , xr ) ≤ radius(xr ); i 6= r. Definition 2 can be summarized in one function (isRelevant - Equation 5) that is used to identify relevant instances. If xi is relevant, the function returns true. Otherwise, false. In other words, every instance that belongs to the coverage area (defined by the radius) of another instance of the same class is not relevant and should be removed.  true,   

250

251 252 253 254 255 256 257

6 ∃xr ∈ R | class(xr ) = class(xi ) isRelevant(xi , R) = ∧ d(xi , xr ) ≤ radius(xr )    f alse, otherwise

(5)

The selection procedure starts by selecting the instance xi ∈ X that has the highest score si ∈ S. This instance is added to the result set R. After, the instance xj with the second highest score is evaluated. If class(xi ) and class(xj ) are different, xj is inserted in R. Otherwise, xj is only selected if it is not located in the space covered by the hypersphere centered at xi with radius radius(xi ). The evaluation continues with the whole set of instances. Instances are only selected if they do not fall in the covered area of the 12

Algorithm RIS1 Input X: Training data set t: Threshold Output R: Set of instances selected from X 1 R ←{} 2 S ←{0, 0, ..., 0} 3 for each xi in X do 4 for each xj:j6=i in X do 5 if class(xi ) = class(xj ) then 6 si ←si + sm(xi , xj , X) 7 else 8 si ←si − sm(xi , xj , X) 9 end if 10 end for 11 end for 12 for i from 1 to m do 13 si ← scaling(si , S) 14 end for 15 [S, I] ←sortdesc(S) 16 X ←sortIdx(X, I) 17 i ←1 18 while i < m and si ≥ t do 19 if isRlvt(xi , R) then 20 R ←R ∪ {xi } 21 end if 22 i ←i + 1 23 end while end Algorithm 1: RIS1 Pseudocode.

13

258 259 260 261 262 263 264 265 266 267

268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294

hyperspheres generated by the patterns that have the same class as them. This strategy aims to preserve the best instances per cluster per class. Algorithm 1 shows the pseudo-code of the proposed instance selection method (RIS1). From line 3 to line 11, the algorithm calculates the scores per instance. After, from line 12 to 14, the scores are normalized in the [0; 1] interval. The scores S and the training set X are reordered in descending order of the instance’s scores si (lines 15 and 16). Each instance is added to the result set R if its score is higher than a given threshold and if it is a relevant instance (Equation 5) (lines 18-23). Instances having scores lower than the threshold are discarded. 3.3. Toy example Figure 2 shows the behavior of the RIS1 instance selection algorithm applied to the XOR problem. To simulate a “Soft” XOR problem, a data set composed of two classes was artificially generated; Figure 2(a) shows the original data set consisting of 10 instances per class. In Figure 2(b), each instance has a number associated with it that represents the order in the ranking. So, the instance labeled with the number one obtained the highest score, the instance labeled with 2 obtained the second highest score, and so on. The scores are calculated following Algorithm 1 (lines 1-16). Instances removed by the threshold (the second part of the logical expression in line 18) are represented by filled black markers. Note that these instances are placed near the classification boundaries and they are in the lowest positions of the ranking. This shows that instances having the tendency to be noisy points either should be removed by threshold or should have low selection priority. In Figure 2(c), selected instances are represented by filled markers (blue and red). First, the instance on the top of the ranking is selected. Since the hyperspheres are defined before the noise elimination, the hypersphere of instance 1 is delimited by instance 18 that is its nearest enemy. Instances 2, 5, 7 and 9 are in the coverage area of the instance 1, so they are removed (lines 19-21). Instance 3 is the second selected instance and its coverage area is delimited by instance 16. Instances 4 and 6 are removed because they are in the coverage area of instance 3. This process continues until the whole set of instances is evaluated. Figure 2(d) shows the five (out of 20) instances selected by RIS. Note that not only the first elements of the ranking are selected, but those elements considered relevant to cover the entire feature space. 14

3

3

2.5

2.5 11

1

2

2

1.5

1.5

2

5

10 12 15

7

18

20

9

14 1

16

13

1

17 0.5

0

8 6

19 3 4

0.5

0

0.5

1

1.5

2

2.5

0

3

0

0.5

(a) Original Training set

1

1.5

2

2.5

3

(b) Ranking

3

3

2.5 11

1

2.5

2

1

5

10

10

12

2

2 15

18

20

1.5

7

9 1.5

14 16

13

1

8

17

19 3 4

0.5

0

0

0.5

1

8

13

1

6

1.5

2

3

0.5

2.5

0

3

(c) Selection Task

0

0.5

1

1.5

2

2.5

(d) Reduced Set

Figure 2: RIS selection process with 2-D artificial data points.

295 296 297 298 299 300 301 302

3.4. RIS2: Scaling per Class The label of the instances are not taken into consideration in the scaling operation (Algorithm 1, lines 12-14). After scaling, the instance that is at the top of the ranking has the maximum score and the instance that is at the bottom has the minimum score. This global scaling may cause a problem if many elements of a specific class are grouped into the lower positions of the ranking. In this case, all the instances that belong to that class can be removed while noisy instances from other classes are preserved. 15

3

Algorithm RIS2 scaling 1 for each cj in C do 2 Wcj ←{wi |class(xi ) = cj } 3 for each wi in Wcj do 4 wi ← scaling(wi , Wcj ) 5 end for 6 end for end Algorithm 2: RIS2 scaling procedure

303 304 305 306 307

308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324

325 326 327

In order to address this problem, the scaling procedure can be performed per class. Algorithm 2 shows the scaling procedure of the RIS2 algorithm that should replace lines 12-14 of Algorithm 1. One by one, the classes have their instances processed separately. This local scaling aims to produce a more fair ranking than the one generated by the RIS1 algorithm. 3.5. RIS3: Avoiding Noise Influence in the Coverage Areas In the lower left quadrant of Figure 3(a), there is one small circle that is completely inside a bigger one. These two circles are defined by two instances of the square () class. Only one of these two instances should be used to map that area. However, the instance that represents the inner region is selected first because it has a higher score than the instance that represents the bigger circle; so, algorithms RIS1 and RIS2 will maintain both instances. In fact, the problem is not the selection order, but the small size of the coverage area of the inner instance. Observe that the hypersphere of this instance is delimited by an enemy (ball instance) that is not selected. In the RIS3 algorithm, the definition of the coverage areas is performed after the elimination of the instances that have smaller scores than the threshold parameter. Figure 3(b) shows the result set produced by RIS3 for the same data used in Figure 3(a). In Figure 3(b), the coverage area of the selected instances increased compared with Figure 3(a) since the redundant information was removed. Thus, the overlapping of the coverage areas defined by the instances belonging to the same class is eliminated. 3.6. Classification Rule In the proposed selection methods, all instances located in the space delimited by the hypersphere of a selected instance xr are removed if they have 16

3

3

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

0.5

1

1.5

2

2.5

0

3

0

(a) RIS2

0.5

1

1.5

2

2.5

(b) RIS3

Figure 3: Removing redundant information with RIS3.

328 329 330 331

the same class of xr . In order to compensate this removal, the radius of the selected instances is used in the distance calculation - the higher the radius the smaller the distance. So, the class of a query instance x is given by the class of the instance xr , such that: class(x) = argmin class(xr )

d(x, xr ) radius(xr )

(6)

335

This criterion is of particular importance in the classification of new points that are located on the borders of the hyperspheres. This procedure aims to prevent those instances with small hyperspheres located near the borders of instances with large hyperspheres take advantage in the decision.

336

4. Experiments

332 333 334

337 338 339 340 341 342 343 344

Experiments were carried out using 24 databases (Table 1) from the KEEL repository (Alcal´a-Fdez et al., 2011). All the experiments reported use the 10-fold cross-validation procedure where the prior probability of each class is preserved in each fold. All the features were normalized to a number between 0 and 1. The proposed algorithms are compared with k-Nearest Neighbor classifier with euclidean distance, Edited Nearest Neighbors (ENN) (Wilson, 1972), Decremental Reduction Optimization Procedure 3 (DROP3) (Wilson and 17

3

Dataset adult appendicitis balance bupa coil2000 connect-4 contraceptive haberman hayes-roth heart ionosphere led7digit marketing monk-2 movement-libras pima satimage segment titanic vowel wine winequality-red winequality-white yeast

#Instances 45222 106 625 345 9822 67557 1473 306 160 270 351 500 6876 432 360 768 6435 2310 2201 990 178 1599 4898 1484

#Features 14 7 4 6 85 42 9 3 4 13 33 7 13 6 90 8 36 19 3 13 13 11 11 8

#Classes 2 2 3 2 2 3 3 2 3 2 2 10 9 2 15 2 7 7 2 11 3 6 7 10

Table 1: Dataset characteristics. 345 346 347 348 349 350 351 352 353 354

Martinez, 2000), and ATISA1 (Cavalcanti et al., 2013). ENN is an edition and decremental method, while ATISA and DROP3 are hybrid and decremental (Garcia et al., 2014). Hybrid methods maintain inner and border instances, and according to Garcia et al. (2012), this family of methods obtains an excellent tradeoff between reduction and accuracy. The ENN rule is a widely used reduction technique that removes instances that are not correctly classified by their k nearest neighbors. ATISA1 uses the distance of each instance to its nearest enemy as a threshold and removes redundant instances that lay down in the coverage area delimited by this threshold. In contrast to ENN that only removes borderline instances, 18

355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370

371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387

ATISA1 removes inner and borderline instances. Decremental Reduction Optimization (DROP) is a family composed of five algorithms that remove an instance if the accuracy of the k nearest neighbors rule on the set of its associates (all the instances where the analysed instance belongs to the set of k nearest neighbors of them) does not decrease. Among them, the DROP3 algorithm obtains the best mix of storage reduction and generalization accuracy (Wilson and Martinez, 2000). As for performance measures, all experiments report accuracy rate and reduction percentage. The latter represents how much the original training set was reduced. In order to verify whether differences in accuracy and storage reduction are significant, a non-parametric Wilcoxon signed-rank test was performed, whose decision does not depend on the pool of classifiers included in the original experiment, as reported in (Benavoli et al., 2016). The next section shows the results and compares the RIS family with other algorithms in the literature. In Section 4.2, we discuss about the parameter t of the proposed algorithms. 4.1. Results In the RIS family, an instance is included in the final prototype subset if its score is higher than the parameter t (Algorithm 1 – line 18). This parameter is defined during the memorization phase using the training set. So, one threshold is defined for each round of the 10-fold cross-validation. In other words, 9-folds are used to define the threshold t, and this value of t is applied on the remainder fold (test set). On using the RIS algorithm, it is expected that different values of t, in the interval [0; 1], generate different subsets of prototypes. So, for each generated subset, the performance of the RIS algorithm is calculated, and the value of t that yields the best accuracy rate on the training set is selected and used to assess the test set. So, the following procedure is performed to calculate the threshold t∗ per round: 1) for t = [0.1, 0.2, . . . , 1.0] 2) apply RIS on the training set using t as threshold: P = RIS(T, t) 3) evaluate P on the whole training set T 4) endfor 5) t∗ is defined as the value of t that reached the best accuracy in line 3)

388 389 390 391

Table 2 shows the accuracy rate obtained by the proposed algorithms compared with the k-Nearest Neighbor (k-NN) classifier, where the value of k was defined per dataset as the best value among {1,3,5} on the training 19

392 393

set. It also shows the reduction percentage (R), except for the k-NN since it uses the entire training set. Dataset adult appendicitis balance bupa coil2000 connect-4 contraceptive haberman hayes-roth heart ionosphere led7digit marketing monk-2 movement-libras pima satimage segment titanic vowel wine winequalty-red winequality-white yeast Mean Wilcoxon

k-NN 78.99 74.08 81.90 58.40 90.84 56.55 44.95 62.81 58.56 76.26 84.42 61.27 21.80 95.36 65.60 69.22 21.49 94.52 63.79 93.84 89.79 51.47 41.19 42.10 65.80 ∼

RIS 1 75.17 80.79 86.72 57.97 94.03 65.83 46.91 64.98 66.42 74.44 90.62 72.37 18.28 92.11 78.11 63.40 25.05 92.12 69.06 87.88 93.39 45.83 41.55 41.77 67.70 n/a

R 47.25 70.64 76.53 43.83 31.01 52.81 26.14 49.20 44.79 50.99 75.50 16.53 82.96 72.79 62.28 53.78 52.74 88.96 17.61 76.90 86.64 42.01 41.83 36.27 54.17 n/a

RIS 2 75.17 79.14 84.17 56.85 87.78 65.57 47.05 65.65 63.49 77.41 90.63 67.27 15.20 92.60 60.56 63.27 18.27 91.82 56.44 66.06 91.62 44.78 41.81 39.89 64.38 +

R 47.25 72.42 84.76 50.18 27.10 46.74 38.36 58.02 48.42 61.77 74.64 57.76 72.30 83.04 70.73 52.82 60.82 90.51 34.28 83.45 86.15 64.83 60.69 56.02 61.70 −

RIS 3 75.17 79.14 81.48 57.14 87.76 65.80 45.69 66.34 61.46 78.89 90.92 74.30 15.63 92.83 59.22 63.27 18.19 91.65 55.48 65.56 92.25 44.72 42.16 40.17 64.46 +

R 47.25 72.84 90.91 56.14 25.11 95.30 41.37 64.78 51.47 77.86 74.89 92.53 50.43 87.27 72.58 53.27 13.76 91.51 68.34 84.39 87.83 69.35 62.91 60.57 66.26 −

Table 2: Accuracy rates and reduction percentages (R) of the proposed algorithms: RIS1, RIS2, and RIS3. The symbols +, − and ∼, in the row labeled “Wilcoxon”, indicate that RIS1 is significantly better, worse or equivalent compared with the other algorithms, respectively. The best results are in bold. 394 395 396

Results show that among the RIS family, RIS1 obtained the best accuracy rate, while RIS3 was the best regarding the reduction rate. Important to remark that even being the method with the lowest reduction rate in the 20

397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432

RIS family, RIS1 was able to discard more than 54 percent of the training data, on average, without hindering the accuracy. Compared to the k-NN, RIS1 obtained a better accuracy, almost two percentile points, on average. The last row of the table shows the results of the non-parametric Wilcoxon signed-rank test for zero median at a 0.05 significance level. The symbols +, − and ∼, in that row, indicate that RIS1 is significantly better, worse or equivalent compared with the other algorithms, respectively. Based on the trade-off between the accuracy rates and the reduction percentage showed in Table 2, RIS1 was chosen to represent the RIS family in the comparison study with the state-of-the-art algorithms (Table 3). The proposed approach is compared with the following algorithms: Edited Nearest Neighbors (Wilson, 1972), Decremental Reduction Optimization Procedure 3 (DROP3) (Wilson and Martinez, 2000), and ATISA1 (Cavalcanti et al., 2013). Excluding ATISA1, all compared instance selection methods are implemented in the KEEL software (Garcia et al., 2012). The experimental results reported in Table 3 show that RIS1 reached the highest average accuracy rates in 10 out of 24 data sets. While ENN, DROP3, and ATISA obtained the best accuracy rates in 5 out of 24 datasets each. ENN is an edition algorithm that aims at removing points that contradicts their neighbors, so, as expected, its strategy was the one with the lowest level of reduction. In contrast, DROP3 is a hybrid algorithm that has obtained high reduction rates at the cost of harming accuracy. ATISA, which also belongs to the hybrid family algorithm, obtained similar precision when compared with DROP3. On average, RIS1 outperformed DROP3 and ATISA by almost four percentile points regarding accuracy. The last row of the table shows the p-value of the Wilcoxon signed-rank test, and we can observe that the algorithms present a similar precision. As highlighted by the boldface values in the table, there are some algorithms that perform much better than the others in some datasets. This suggests that a procedure to choose the algorithm depending on the task could achieve impressive rates. Figure 4 shows a two-dimensional plot: accuracy versus reduction. Each point in this figure represents the average value showed in Tables 2 and 3. The top right corner is the best possible solution. RIS1 is ahead in terms of accuracy and RIS3 reached an interesting compromise between generalization accuracy and reduction rate.

21

Dataset adult appendicitis balance bupa coil2000 connect-4 contraceptive haberman hayes-roth heart ionosphere led7digit marketing monk-2 movement-libras pima satimage segment titanic vowel wine winequality-red winequality-white yeast Mean Wilcoxon-p

RIS 1 75.17 80.79 86.72 57.97 94.03 65.83 46.91 64.98 66.42 74.44 90.62 72.37 18.28 92.11 78.11 63.40 25.05 92.12 69.06 87.88 93.39 45.83 41.55 41.77 67.70 n/a

R 47.25 70.64 76.53 43.83 31.01 52.81 26.14 49.20 44.79 50.99 75.50 16.53 82.96 72.79 62.28 53.78 52.74 88.96 17.61 76.90 86.64 42.01 41.83 36.27 54.17 n/a

ENN 81.01 78.33 87.14 57.43 93.96 63.43 49.13 69.06 42.22 79.26 82.22 39.82 19.88 95.45 71.94 71.30 21.07 90.74 32.17 90.71 89.47 52.96 42.02 48.69 64.39 0.731

R 20.07 15.52 17.78 36.84 80.39 89.73 56.21 31.23 76.60 30.62 15.89 83.13 98.92 11.42 23.04 26.43 99.79 05.05 99.14 03.14 04.31 48.29 51.99 46.96 45.28 0.391

DROP3 R 80.69 91.38 78.33 90.98 85.24 83.25 56.00 74.30 93.96 99.96 67.71 99.47 48.66 80.95 70.00 89.14 44.44 87.44 76.30 83.46 82.22 93.07 39.45 94.82 21.32 99.79 79.32 96.73 65.27 62.81 71.95 82.29 21.07 99.93 91.73 88.58 32.17 99.84 87.07 52.83 89.47 84.39 52.16 81.41 43.34 81.82 47.32 81.67 63.56 87.58 0.345 < 0.001

ATISA1 80.54 78.33 84.92 56.86 93.94 59.46 50.34 69.69 36.11 81.11 83.89 41.27 21.14 95.45 68.05 71.69 21.07 92.47 32.17 88.89 90.53 53.40 42.08 48.63 64.00 0.637

R 84.68 89.62 73.07 44.93 84.17 98.27 67.74 78.87 81.25 78.11 72.43 85.84 99.67 37.73 55.94 61.47 99.87 80.85 99.14 51.16 74.66 61.09 62.00 65.00 75.10 0.002

Table 3: Accuracy rates (%) and reduction percentages (R) of RIS1, ENN, DROP3, and ATISA1 algorithms. The best results are in bold.

433 434 435 436 437 438 439 440 441 442 443

4.2. Analysis of the parameter t The RIS algorithms have the parameter t that is a threshold used to select the relevant instances. In the previous section, this threshold was defined using the training set. In this section, we conduct an evaluation supposing that the best threshold can be found. It is a kind of Oracle in which the best threshold is defined using the test dataset. This analysis aims at verifying if there is room for improvement in the threshold definition procedure, i.e., is the optimal threshold set during training? Moreover, if this threshold obtained during training is not the best, how much could we improve the accuracy rate given that we know the optimal threshold. Table 4 shows the accuracy rates for the threshold calculation using the 22

Figure 4: Accuracy x Reduction chart. Each point represents the average accuracy and reduction rates for each instance selection algorithm.

459

training (ttrain ) e test (ttest ) datasets for the RIS algorithms. The column ∆ shows the difference between the accuracy rate achieved using ttest and ttrain . RIS1 obtained the smaller difference (on average) between the hit rates of validation and test sets, meaning that RIS1 was able to better approximate the predicted threshold to the ideal one. The difference (∆) was zero in 5 datasets for RIS1, and this means that for these datasets, no improvement was achieved since the best threshold was found during training. The same can be observed for RIS2 and RIS3, in four and three datasets, respectively. It is also possible to observe that for some datasets, the difference ∆ is high. For instance, RIS2 and RIS3 could benefit from an increment of more than 13 percentile points in the appendicitis datasets if the best t is chosen. We also performed a statistical test to verify the significance of the results given that we have the best threshold. The non-parametric Wilcoxon signed rank test for zero median at a 0.05 significance level showed that all the RIS algorithms are significantly better than the literature algorithms discussed in the last section if we are able to determine the threshold better.

460

5. Conclusion

444 445 446 447 448 449 450 451 452 453 454 455 456 457 458

461 462

In this work, we introduced a new instance selection algorithm, called Ranking-based Instance Selection (RIS), that aims at selecting a subset of the 23

Dataset adult appendicitis balance bupa coil2000 connect-4 contraceptive haberman hayes-roth heart ionosphere led7digit marketing monk-2 movement-libras pima satimage segment titanic vowel wine winequalty-red winequality-white yeast Mean

ttrain 75.17 80.79 86.72 57.97 94.03 65.83 46.91 64.98 66.42 74.44 90.62 72.37 18.28 92.11 78.11 63.40 25.05 92.12 69.06 87.88 93.39 45.83 41.55 41.77 67.70

RIS 1 ttest 75.43 84.70 89.93 61.43 94.03 65.83 49.15 74.16 71.94 82.59 92.89 73.34 18.34 92.11 80.11 69.00 25.13 92.60 69.06 87.88 93.95 51.46 45.06 42.37 70.10

∆ 00.26 03.91 03.21 03.46 00.00 00.00 02.24 09.18 05.52 08.15 02.27 00.97 00.06 00.00 02.00 05.60 00.08 00.48 00.00 00.00 00.56 05.63 03.51 00.60 02.40

ttrain 75.17 79.14 84.17 56.85 87.78 65.57 47.05 65.65 63.49 77.41 90.63 67.27 15.20 92.60 60.56 63.27 18.27 91.82 56.44 66.06 91.62 44.78 41.81 39.89 64.38

RIS 2 ttest 77.09 92.26 87.70 62.31 87.78 65.59 48.95 70.23 66.40 84.07 90.92 68.84 15.53 92.84 60.56 68.08 18.31 91.82 56.75 66.06 93.32 48.67 43.39 46.85 66.85

∆ 01.92 13.12 03.53 05.46 00.00 00.02 01.90 04.58 02.91 06.66 00.29 01.57 00.33 00.24 00.00 04.81 00.04 00.00 00.31 00.00 01.70 03.89 01.58 06.96 02.47

ttrain 75.17 79.14 81.48 57.14 87.76 65.80 45.69 66.34 61.46 78.89 90.92 74.30 15.63 92.83 59.22 63.27 18.19 91.65 55.48 65.56 92.25 44.72 42.16 40.17 64.46

RIS 3 ttest 77.13 92.26 85.95 62.34 87.77 65.80 49.15 69.62 64.44 85.19 91.22 76.44 16.06 94.00 59.22 67.57 18.20 91.65 56.21 65.56 97.25 50.61 42.92 45.75 67.18

∆ 01.96 13.12 04.47 05.20 00.01 00.00 03.46 03.28 02.98 06.30 00.30 02.14 00.43 01.17 00.00 04.30 00.01 00.00 00.73 00.00 05.00 05.89 00.76 05.58 02.72

Table 4: Accuracy rate (%) for the threshold obtained using the validation (tval ) and test (ttst ) sets. ∆ shows the difference between them in percentile points.

463 464 465 466 467 468 469 470 471 472 473 474

original training set. In this sense, it is similar to hybrid methods previously published. But, in contrast, RIS uses a ranking strategy to select the best instances in terms of classification. This ranking is based on a score assigned to each instance. The higher the number of nearby patterns of the same class, the higher the instance score. Therefore, the ranking implies that borderline or noisy instances have low priority. In addition to the ranking, each instance is responsible for an area of coverage in the feature space. This area is defined by a hypersphere that has the instance as its center and is delimited by its nearest enemy. The selection process uses the scores to establish an order of the instances in the training set, and instances that belong to the coverage area of another instance of the same class are removed. The scores calculated in the proposed approach can be used as a heat 24

495

map that shows border and safe regions of the feature space. In our work, we used 2-dimensional toy problems to illustrate this property of scores. However, this could also be investigated for high dimensional tasks. One limitation of the proposed approach resides in how the instance coverage area is defined. It is assumed to be a hypersphere centered at the instance, with radius equals to the distance to its nearest enemy. So, a different strategy (probably not a hypersphere) could benefit from avoiding using information from only one direction (nearest enemy) while the other directions are not used in the definition of the instance coverage area. The three versions of the proposed technique were evaluated on twenty four datasets. The experimental study provided empirical evidence of the effectiveness of the proposed techniques for outperforming literature algorithms. In general terms, RIS1 obtains the highest average generalization accuracy, RIS3 obtains the best accuracy rates in reducing the data sets, and RIS2 is a balanced choice between accuracy and reduction. For future work, we intend to investigate how to select the most appropriate version of the RIS algorithms per dataset using data complexity measures (de Oliveira Moura et al., 2014), as well as how to define the parameter t of the proposed approach better. Furthermore, it remains to be investigated further the effectiveness of the novel method in an instance level, making use of instance hardness (Smith et al., 2014) and other instance properties.

496

Acknowledgements

475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494

500

The authors would like to thank Brazilian agencies: CAPES (Coordena¸ca˜o de Aperfei¸coamento de Pessoal de N´ıvel Superior), CNPq (Conselho Nacional de Desenvolvimento Cient´ıfico e Tecnol´ogico) and FACEPE (Funda¸ca˜o de Amparo `a Ciˆencia e Tecnologia de Pernambuco).

501

References

497 498 499

502 503 504

505 506 507

Aggarwal, C.C., Hinneburg, A., Keim, D.A., 2001. On the surprising behavior of distance metrics in high dimensional spaces, in: International Conference on Database Theory, pp. 420–434. Aggarwal, C.C., Yu, P.S., 2001. Outlier detection for high dimensional data, in: ACM SIGMOD International Conference on Management of Data, pp. 37–46. 25

508 509

510 511 512 513

514 515 516

517 518 519 520

521 522 523

524 525 526

527 528 529

530 531 532

533 534

535 536 537 538

Aha, W., D., Kibler, Dennis, Albert, K., M., 1991. Instance-based learning algorithms. Machine Learning 6, 3766. Alcal´a-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garc´ıa, S., S´anchez, L., Herrera, F., 2011. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17, 255–287. Angiulli, F., 2007. Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering 19, 1450–1464. ´ Alvar Arnaiz-Gonz´alez, D´ıez-Pastor, J.F., Rodr´ıguez, J.J., Garc´ıa-Osorio, C., 2018. Study of data transformation techniques for adapting single-label prototype selection algorithms to multi-label learning. Expert Systems with Applications 109, 114 – 130. ´ Alvar Arnaiz-Gonz´alez, D´ıez-Pastor, J.F., Rodr´ıguez, J.J., Garc´ıa-Osorio, C.I., 2016. Instance selection for regression by discretization. Expert Systems with Applications 54, 340 – 350. ´ D´ıez-Pastor, J.F., Rodr´ıguez, J.J., Garc´ıa-Osorio, C.I., Arnaiz-Gonz´alez, A., 2016. Instance selection for regression by discretization. Expert Systems with Applications 54, 340 – 350. Benavoli, A., Corani, G., Mangili, F., 2016. Should we really use post-hoc tests based on mean-ranks? Journal of Machine Learning Research 17, 1–10. Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U., 1999. When is ”nearest neighbor” meaningful?, in: International Conference on Database Theory, pp. 217–235. Bien, J., Tibshirani, R., 2011. Prototype selection for interpretable classification. The Annals of Applied Statistics 5, 2403–2424. Bridle, J.S., 1990. Probabilistic interpretation of feedfoward classification network outputs, with relationships to statistical pattern recognition, in: Fougelman-Soulie, F., Harault, J. (Eds.), Neuro-computing: Algorithms, Architectures and Applications, Springer-Verlag, New York. pp. 227–236. 26

539 540 541

542 543 544

545 546

547 548 549

550 551 552

553 554 555 556

557 558 559

560 561 562

563 564

565 566

567 568 569

Cavalcanti, G.D.C., Ren, T.I., Pereira, C.L., 2013. Atisa: Adaptive threshold-based instance selection algorithm. Expert Systems with Applications 40, 6894–6900. Chen, Z.Y., Lin, W.C., Ke, S.W., Tsai, C.F., 2015. Evolutionary feature and instance selection for traffic sign recognition. Computers in Industry 74, 201 – 211. Cover, T.M., Hart, P.E., 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 21–27. Cruz, R.M., Oliveira, D.V., Cavalcanti, G.D., Sabourin, R., 2019. Firedes++: Enhanced online pruning of base classifiers for dynamic ensemble selection. Pattern Recognition 85, 149 – 160. Cruz, R.M., Sabourin, R., Cavalcanti, G.D., 2018a. Dynamic classifier selection: Recent advances and perspectives. Information Fusion 41, 195 – 216. Cruz, R.M.O., Sabourin, R., Cavalcanti, G.D.C., 2017. Analyzing different prototype selection techniques for dynamic classifier and ensemble selection, in: International Joint Conference on Neural Networks (IJCNN), pp. 3959–3966. Cruz, R.M.O., Sabourin, R., Cavalcanti, G.D.C., 2018b. Prototype selection for dynamic classifier and ensemble selection. Neural Computing and Applications 29, 447–457. Garcia, S., Derrac, J., Cano, J., Herrera, F., 2012. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 417–435. Garcia, S., Luengo, J., Herrera, F., 2014. Data Preprocessing in Data Mining. Springer Publishing Company. chapter Instance Selection. Haykin, S., 1999. Neural Networks: A Comprehensive Foundation, 2/E. Prentice Hall PTR, Upper Saddle River, NJ, USA. jae Kim, K., 2006. Artificial neural networks with evolutionary instance selection for financial forecasting. Expert Systems with Applications 30, 519 – 526. Intelligent Information Systems for Financial Engineering. 27

570 571

572 573 574

575 576 577

578 579

580 581 582

583 584 585

586 587 588

589 590 591

592 593 594

595 596

597 598 599

Kim, S.W., Oommen, B.J., 2003. A brief taxonomy and ranking of creative prototype reduction schemes. Pattern Analysis & Applications 6, 232–244. Krawczyk, B., Triguero, I., Garc´ıa, S., Wo´zniak, M., Herrera, F., 2019. Instance reduction for one-class classification. Knowledge and Information Systems 59, 601–628. ´ D´ıez-Pastor, J.F., Gunn, I.A.D., 2019. Kuncheva, L.I., Arnaiz-Gonz´alez, A., Instance selection improves geometric mean accuracy: a study on imbalanced data classification. Progress in Artificial Intelligence 8, 215–228. Marchiori, E., 2008. Hit miss networks with applications to instance selection. Journal of Machine Learning Research 9, 997–1017. Marchiori, E., 2010. Class conditional nearest neighbor for large margin instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 364–370. Nanni, L., Lumini, A., 2011. Prototype reduction techniques: A comparison among different approaches. Expert Systems with Applications 38, 11820 – 11828. de Oliveira Moura, S., de Freitas, M.B., Cardoso, H.A., Cavalcanti, G.D., 2014. Choosing instance selection method using meta-learning. IEEE International Conference on Systems, Man, and Cybernetics , 2003–2007. Pereira, C.S., Cavalcanti, G.D.C., 2011a. Handwritten connected digits detection: An approach using instance selection, in: IEEE International Conference on Image Processing, pp. 2613–2616. Pereira, C.S., Cavalcanti, G.D.C., 2011b. Instance selection algorithm based on a ranking procedure, in: International Joint Conference on Neural Networks, pp. 2409–2416. Smith, R.M., Martinez, T., Giraud-Carrier, C., 2014. An instance level analysis of data complexity. Machine Learning 95, 225–256. Smith-Miles, K., Islam, R., 2010. Meta-learning for data summarization based on instance selection method, in: IEEE Congress on Evolutionary Computation, pp. 1–8.

28

600 601 602 603

604 605 606 607 608

609 610 611

612 613 614

615 616

617 618

Triguero, I., Derrac, J., Garcia, S., Herrera, F., 2012. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 86–100. Triguero, I., Peralta, D., Bacardit, J., Garca, S., Herrera, F., 2015. Mrpr: A mapreduce solution for prototype reduction in big data classification. Neurocomputing 150, 331 – 345. Bioinspired and knowledge based techniques and applications The Vitality of Pattern Recognition and Image Analysis Data Stream Classification and Big Data Analytics. Tsai, C.F., Lin, W.C., Hu, Y.H., Yao, G.T., 2019. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Information Sciences 477, 47 – 54. Wilson, D.L., 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics 2, 408–421. Wilson, D.R., Martinez, T.R., 2000. Reduction techniques for instance-based learning algorithms. Machine Learning 38, 257–286. Yang, L., Zhu, Q., Huang, J., Wu, Q., Cheng, D., Hong, X., 2019. Constraint nearest neighbor for instance reduction. Soft Computing .

29

Declaration of interests ☐ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 619

☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: