Asymmetric Label Switching Resists Binary Imbalance
Journal Pre-proof
Asymmetric Label Switching Resists Binary Imbalance ´ ´ ´ Aitor Gutierrez-L opez, Francisco-Javier Gonzalez-Serrano, Anibal R. Figueiras-Vidal PII: DOI: Reference:
S1566-2535(19)30557-3 https://doi.org/10.1016/j.inffus.2020.02.004 INFFUS 1205
To appear in:
Information Fusion
Received date: Revised date: Accepted date:
15 July 2019 30 January 2020 23 February 2020
´ ´ ´ Please cite this article as: Aitor Gutierrez-L opez, Francisco-Javier Gonzalez-Serrano, Anibal R. Figueiras-Vidal, Asymmetric Label Switching Resists Binary Imbalance, Information Fusion (2020), doi: https://doi.org/10.1016/j.inffus.2020.02.004
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.
Highlights • Asymmetric label switching generates diversity in ensemble learning • Bregman divergences enable ensembles to estimate the a posteriori prob- abilities • Optimum thresholds are analytically derived under the Bayes Decision Theory framework • Experiments with imbalanced datasets provide evidence of gains in performance
1
Asymmetric Label Switching Resists Binary Imbalance Aitor Guti´errez-L´opez, Francisco-Javier Gonz´alez-Serrano∗, An´ıbal R. Figueiras-Vidal Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. Universidad, 30, 28911 Legan´es, Spain.
Abstract In this correspondence, an asymmetric version of the label switching technique to build binary classification ensembles is introduced. The new version presents one more design parameter, the degree of asymmetry, and, consequently, it is more flexible to adapt to the problem under study. In particular, asymmetric switching allows designs that resist to class imbalance. A Bayesian analysis serves to establish how to deal with datasets for carrying out a principled rebalancing, which can be combined with other principled procedures according to the relative advantages of asymmetric switching. A number of simple experiments support the low sensitivity to imbalance and the validity of the analysis for this method of constructing ensembles. Keywords: Bayesian framework, ensembles, imbalanced classification, label switching
∗
Corresponding author Email addresses:
[email protected] (Aitor Guti´errez-L´ opez),
[email protected] (Francisco-Javier Gonz´ alez-Serrano ),
[email protected] (An´ıbal R. Figueiras-Vidal)
Preprint submitted to Information Fusion
February 24, 2020
1. Introduction Machine ensembles allow to improve the performance of single learning machines by aggregating the outputs of a number of them -learning units or learners- that are trained under diverse conditions [1, 2]. Since the 1990s, they served, in particular, to alleviate the practical limitations that the size of training datasets impose to the approximation capacity of shallow MultiLayer Perceptrons (MLPs). Committees are a family of machine ensembles in which the units are separately trained and their final output is obtained by a conventional fusion procedure, such as direct averaging or majority voting. Breiman propose a number of committee designs. One of them is Bagging [3], where diversity comes from bootstrapping the training dataset. Randomizing outputs [4], or switching, creates new sample populations by randomly changing the labels of a portion of the class populations –in the original version, the same number of changes for each class.– This serves to create diverse classification problems and to build ensembles that can improve the performance of single classifiers. The original Breiman mechanism switches the same number of labels for each class. The above methods are forms of creating diversity that can be used to build ensembles without any additional effort (but adding an aggregation mechanism). Of course, they are not competitive with other more sophisticated techniques to construct ensembles, such as Boosting (see [1, 2]. Yet Boosting does not provide estimates of the posterior probabilities of the classes, and, therefore, it cannot be considered into the principled framework of the study we present here). Bagging and Switching can be easily combined with the other fundamental way of increasing the 3
capacities of (shallow) MLPs: Deep Neural Networks (this is not the case of Boosting). As a consequence, these fusion mechanisms contribute to obtain performance improvements, that come just from their diversity. If complementary performance improvement techniques –such as adjustable sample weighting according to successful versions of Boosting [5, 6, 7]– are also employed, deep machine ensembles with excellent performance can be obtained [8, 9]. Imbalanced classification deserves much attention because it appears in many important real world problems of different application areas. Those problems have much different class population sizes and/or classifications costs. As a consequence, conventionally designed discriminative classifiers do not provide satisfactory results: They naturally tend to decide in favour of the majority class. Among the procedures that serve to combat this undesirable effect –references [10, 11, 12, 13, 14, 15] offer clear and complete perspectives,– there are methods based on sample preprocessing, and other that apply modified learning algorithms, such as one-class or modified kernel Support Vector Machines [16, 17]. Fuzzy formulations have also been applied [18]. Recently, there are contributions that include time-sensitive elements [19, 20]. The techniques based on preprocessing the (training) samples are simple and effective. They include, besides increasing the cost of wrongly classifying the minority samples [21], oversampling the minority and/or undersampling the majority classes [22, 23]– that can be considered as asymmetric versions of bagging,– or generating minority samples –SMOTE [24] is a successful
4
mechanism from which many modifications have been proposed.– Obviously, the classification problems that result from these processes are more balanced, and, under certain conditions, the solution of the original classification problem can be obtained from their results. In the same direction, the possibility of applying switching as a rebalancing mechanism (which can be combined with others) is obviously interesting: Bagging type diversifications (resampling) introduce some risks –deletion of critical samples, emphasis of outliers,– and even sample generation can create distortions in the class likelihoods. However, switching uses the available training samples without these drawbacks. This is even more important when some observed variables are binary, for example, because the undue deletion risk and the generation difficulties are worse. But, surprisingly, switching –another form of creating diversity and gaining advantage of its fusion mechanisms– has been not applied to imbalanced problems, with the exception of a particular form of it [25]. The reason for it seems to be the particular mode in which switching was defined: The number of changed labels is the same for all the classes, so the effects are clearly different for big and small class sizes, and it is impossible to find an appropiate randomization in order to face the imbalance. Additionally, the lack of an analytical framework for the switching process did not allow to apply a principled aggregation scheme, and a simple voting was applied. Voting aggregations offer a discrete value as the ensemble output, and this discontinuity is a drawback when coming back to the solution of the original imbalanced problem, because exact compensations are not possible. And even versions with equal switching rates do not appear as a good option.
5
In this correspondence, we propose an asymmetric form of switching for dealing with binary imbalanced problems. Its theoretical analysis serves to conclude that, if the different learners –we use shallow MLPs– are trained to obtain estimates of the ‘a posteriori’ probabilities of their ‘switched’ classes –that only requires the use of Bregman divergences [26] as surrogate costs,– there is a principled aggregation procedure that offers estimates of the ‘a posteriori’ probabilities of the original problem classes. Additionally, we show that to modify the standard activation nonlinearities of the MLP units is necessary for keeping a consistent processing. These new algorithms have an intrinsic resistance to the difficulties that IB datasets produce, because the two different switching rates –that can be selected by conventional CrossValidation (CV)– can generate diverse problems with a lower imbalance. Moreover, other rebalancing procedures can be additionally applied, following the principled approach presented in [27]. However, we will limit the content of this letter to the analysis and experimental demonstration of the resistance which the asymmetric switching shows to imbalanced dataset effects (Sections 2 and 3, respectively). The main conclusions of this work and some open lines for further research close this contribution. 2. The proposed algorithm and its analysis An asymmetric switching mechanism changes the labels of randomly selected samples with different rates for each class. Let {Ci } , i = 0, 1, be the two classes of a binary problem (C1 being the minority class), {Pi } their ‘a priori’ probabilities, x the observations, {p(x|Ci )} the likelihoods, and {Pr(Ci |x)} the ‘a posteriori’ class probabili6
ties. The Bayesian formula Pr(C1 |x) =
p(x|C1 )P1 1 = p(x|C1 )P1 + p(x|C0 )P0 1 + P0 p(x|C0 )/P1 p(x|C1 )
(1)
establishes a nonlinear one-to-one correspondence between ‘a posteriori’ class probabilities and the likelihood ratio, assuming that {Pi } are given. The application of a random label switching with rates α and β to the C0 and C1 samples, respectively, offers a key possibility. If α > β, the resulting new classification problem S will have ‘classes’ C10 and C00 , with mixed statistical probability densities, that are more balanced. Therefore, the estimation of its ‘class’ probabilities PrS (C10 |x) =
p(x|C1 )(1 − β)P1 + p(x|C0 )αP0 p(x|C1 )P1 + p(x|C0 )P0
(2)
(and its complement) will be easier. Moreover, by adding and substracting αp(x|C1 )P1 to the numerator of (2), we obtain PrS (C10 |x) = (1 − α − β)Pr(C1 |x) + α
(3)
from which Pr(C1 |x) =
PrS (C10 |x) − α 1−α−β
(4)
i.e., we can recover (an estimate of) Pr(C1 |x) from (an estimate of) PrS (C10 |x). So, we can solve the original classification problem by using the classical Bayesian test C1
Pr(C1 |x) ≷ C0
7
QC QC + 1
(5)
where QC = [c10 − c00 ]/[c01 − c11 ], cji being the cost of selecting Cj when Ci is true. From (4) and (5) C1
PrS (C10 |x) ≷ α + (1 − α − β) C0
QC QC + 1
(6)
Training a learning machine with a Bregman divergence [26] c(t, o) as surrogate cost –the function whose added sample values are minimized for training,– o = o(x) being the learning machine output, is a necessary and sufficient condition to obtain consistent estimators of the ‘a posteriori’ expectations E {t|o}. Bregman divergences are those such that ∂c(t, o) = −g(o)(t − o) , ∂o
g(o) > 0
(7)
References [28, 29] are general discussions of Bregman divergences in the context of machine learning, and the appendix of [27] presents the proof of the if and only if character of (7) by studying the minimization of the average surrogate cost. For a binary classification problem with targets t = ±1, o(x) = E {t|x} = 1Pr(C1 |x) − 1Pr(C0 |x) = 2Pr(C1 |x) − 1
(8)
Applying (8) to (6) leads to QC oS (x) ≷ 2 α + (1 − α − β) −1 QC + 1 C0 C1
(9)
as the classification rule for the ensemble learners, that work with (consistently) estimated values. Obviously, (9) could be applied to each learner and a majority vote to aggregate the results, but this will close the door to the 8
possibility of including other principled rebalancing techniques. As aggregation, a simple average of the outputs of the ensemble learners and comparing the result with the second member of (9) is possible without that limitation, and it will be used in the experiments.1 C1
From the above, if QC = 1, the test will be oS (x) ≷ α − β, and if in C0
C1
addition α = β, oS (x) ≷ 0. But note that the last case, as the original C0
switching method, does not rebalance the problems for the learners and, therefore, the ensemble will be more sensitive to imbalanced situations. There is a hidden condition in the above formulation. Assuming that 0 ≤ Pr(C1 |x) ≤ 1, (6) requires that α ≤ PrS (C10 |x) ≤ 1 − β, and, according to (8), this imposes −(1−2α) ≤ oS (x) ≤ 1−2β. So, if we apply a conventional activation nonlinearity act0 (z) = tanh(z) at the output of the learners, there is a risk of getting values of oS (x) out of the margin, and the appearance of these values will degrade the performance of the ensemble. To avoid this risk, it is enough to apply asymmetric activations such as
act1 (z) =
(1 − 2α)tanh(z)
act2 (z) =
(1 − 2α)tanh(
or
z<0
(1 − 2β)tanh(z)
(1 − 2β)tanh(
(10a)
z>0
z ) 1−2α
z<0
z ) 1−2β
z>0
1
(10b)
According to the one-to-one correspondence (1), to average likelihood ratios is also possible.
9
3. Experiments We have carried out a number of experiments, and their results support that asymmetric switching binary classifiers offer an intrinsic resistance to imbalance effects. We will present here those corresponding to twelve databases, that have been selected to illustrate the different possible situations: Vowel0, Abalone9vs18, Abalone17vs7-8-9-10, Yeast4, and Yeast6 from [30], and Ecoli-imU, Satimage4, Balance-B, Solarflare-m0, Oil, letter img-Z, and Winequality4, which are versions of the original datasets obtained from [31]. Table 1 shows their main characteristics. There are different sizes and imbalance ratios. Datasets
N
D
IR
Ecoli-imU
336
7
8.6
Satimage4
6435
36
9.3
Vowel0
988
13
10
Balance-B
625
4
11.8
Abalone9vs18
731
8
16.4
Solarflare-m0
1389
32
19
Oil
937
49
22
Winequality4
4898
11
26
Letter img-Z
20000
16
26
Yeast4
1484
8
28.1
Abalone17vs7-8-9-10
2338
8
39.3
Yeast6
1484
8
41.4
Table 1: Description of the datasets. N: Number of instances, D: Number of attributes, IR: Imbalance Ratio.
We randomly select (proportional) 75%-25% training-test sets for all the 10
databases (these 75%-25% training-test sets are different in each run, in order to obtain appropriately averaged results). This is a classical train-test partition proportion because it keeps the number of training samples high enough without reducing too much the number of test samples. We apply switching ensembles of 31 learners, that are enough to exploit switching diversity with moderate sample populations. Single hidden layer MLPs with 4 hidden units will be used as learners since they are weak enough for being sensitive to the effect of switching diversity, and they can be trained with a moderate number of examples. These empirical values are used because the purpose of the experiments is simply to make evident the resistance against the imbalance effects of the asymmetric forms. 100 runs (with different initializations for MLPs’ weights) are averaged, and the L-BFGS-B method is used for training. To illustrate the importance of the independently defined switching rates of our proposed design, we compare: • The original Breiman switching mechanism, which changes the label of the same number of positive and negative instances, and we characterize it by the switching rate β for the minority class. • The proposed asymmetric switching scheme, but presenting results not only for the best (α, β) combination, but also for the best α = β case, to appreciate the effects of asymmetry. The explored switching rate values are from 0 to 0.45 in 0.05 steps. Note that this increases the ensemble design cost because one more switching parameter must be cross-validated. For a fair comparison, all the ensemble 11
outputs are aggregated according to the principles susggested in Section 2, i.e., by a direct average. Finally, we present only results for act1 (z) because differences with others are not significant. Square error is the training cost. We will use as performance indicator the well-known F1 -score, the harmonic mean of precision (TP/(TP + FP), where TP and FP are the numbers of true positives and false positives, respectively) and recall (TP/(TP + FN), where FN is the number of false negatives)
F1 =
2TP 2TP + FP + FN
Note that this is equivalent to define a working point. Datasets Ecoli-imU Satimage4 Vowel0 Balance-B Abalone9vs18 Solarflare-m0 Oil Winequality4 Letter img-Z Yeast4 Abalone17vs7-8-9-10 Yeast6
Breiman (β) 0.57 ± 0.12 (0) 0.60 ± 0.03 (0) 0.98 ± 0.03 (0.05) 0.22 ± 0.15 (0) 0.51 ± 0.13 (0) 0.08 ± 0.07 (0.05) 0.57 ± 0.13 (0) 0.21 ± 0.05 (0) 0.92 ± 0.02 (0) 0.32 ± 0.10 (0) 0.28 ± 0.12 (0) 0.54 ± 0.13 (0)
Symmetric (α = β) 0.57 ± 0.12 (0) 0.60 ± 0.03 (0.05) 0.99 ± 0.02 (0.1) 0.23 ± 0.16 (0.05) 0.53 ± 0.13 (0.05) 0.08 ± 0.07 (0.05) 0.59 ± 0.13 (0.05) 0.21 ± 0.05 (0) 0.94 ± 0.01 (0.05) 0.32 ± 0.10 (0) 0.30 ± 0.12 (0.05) 0.55 ± 0.15 (0.05)
Asymmetric (α, β) 0.68 ± 0.05 (0.25, 0.05) 0.63 ± 0.03 (0.2, 0) 0.99 ± 0.02 (0.1, 0) 0.42 ± 0.16 (0.35, 0) 0.61 ± 0.12 (0.25, 0) 0.24 ± 0.07 (0.4, 0.05) 0.62 ± 0.12 (0.15, 0) 0.27 ± 0.06 (0.4, 0) 0.94 ± 0.01 (0.05, 0.05) 0.44 ± 0.12 (0.35, 0) 0.44 ± 0.09 (0.4, 0.05) 0.61 ± 0.11 (0.3, 0)
Table 2: F1 -score (average ± standard deviation) for the different datasets and switching ensembles. Datasets in Italics are those where we achieve consistent improvements (the difference in average is at least equal to the semi-sum of the standard deviations).
Table 2 shows the experimental results. There is a great difference of results for the four slightly imbalanced datasets: Easy problems such as Vowel0 –all the designs offer almost 100% performance– do not show significant changes, but Balance-B is not so easy, and the results of the asymmetric 12
switching are clearly the best. When IR increases, the Breiman switching tends to be not activated, and its results for β=0 correspond to pure initialization diversity. This is an expected result: Switching minority samples for high values of IR produces negative effects. Symmetric switching behaves in a similar way. The case of Oil is different, and it shows not significant performance changes: The reasons are the low number of instances and the high number of dimensions of this dataset which do not permit a good estimation of ‘a posteriori’ probabilities, that are functions of D dimensions. Other datasets, such as Winequality4, Abalone17vs7-8-9-10, and Letter imgZ, allow good estimations of ‘a posteriori’ probabilities as a consequence of having a reasonable relationship between the number of samples, N, and D. Overall, there is some advantage for the asymmetric switching designs for increasing values of IR. These advantages are obtained in the majority of the datasets, but they are especially clear in Balance-B, Solarflare-m0, and Abalone17vs7-8-9-10. However, further increasing the IR value will require the combination of the asymmetric switching with other rebalancing techniques. It is important to note that Solarflare-m0, which is a difficult problem, has a high dimension with respect to the number of hidden units in the MLPs. In these cases, ensembles without any rebalancing technique almost never detect the minority class (only 5% on the average, for Solarflare-m0). However, using high values for α enables the ensemble to detect the minority class and to offer good performance. According to our experience, this happens when the minority class is completely embedded in the majority one (high class overlapping). It is, indeed, the situation in this case, as it can be checked by
13
elementary procedures (such as an examination of the nearest neighbours of the minority samples). To demonstrate what we said in the above paragraph, we present a few experiments with the Ringnorm dataset in Table 3. Ringnorm, which was originally proposed by Breiman [32], is composed by two 20-dimensional Gaussian class-conditional distributions (p (x|Ci ) = N (x; µi , Σi ), where µi and Σi are the mean vector and covariance matrix of the i-th class, respectively) and has one class completely embedded in the other. The likelihood for class 0 is p (x|C0 ) = N (x; 0, 4I), and for class 1 is p (x|C1 ) = N (x; a, I) with √ a = [a, a, ..., a]T , a = 2/ 20, and I being the identity matrix. We have done experiments with class 1 being the minority one (“RingnormA”) and with the opposite class configuration (“RingnormB”), which has the minority class embedded with the majority samples. All datasets have 300 samples in the minority class and we have selected IR=20, which is similar to Solarflare-m0, and IR=60 to check if our argument (asymmetric switching performs well in problems with high class overlapping) is valid for higher Imbalance Ratios. Datasets RingnormA20 RingnormA60 RingnormB20 RingnormB60
Breiman (β) 0.54 ± 0.06 (0) 0.43 ± 0.06 (0) 0.21 ± 0.06 (0) 0.00 ± 0.01 (0)
Symmetric (α = β) 0.64 ± 0.05 (0.1) 0.51 ± 0.07 (0.05) 0.24 ± 0.05 (0.05) 0.00 ± 0.01 (0)
Asymmetric (α, β) 0.70 ± 0.05 (0.25, 0) 0.52 ± 0.07 (0.05, 0) 0.75 ± 0.03 (0.4, 0.15) 0.62 ± 0.05 (0.45, 0)
Table 3: F1 -score (average ± standard deviation) for the different versions of the Ringnorm dataset.
As expected, we get improvements with asymmetric switching. In particular, we obtain a huge increase in performance with RingnormB20, the dataset which is more similar to Solarflare-m0, and RingnormB60, even more imbalanced. This confirms what we said before: When the minority class is 14
embedded in the majority one and D is high with respect to the number of hidden units, an ensemble using asymmetric switching offers better performance than others with Breiman or symmetric switching, that are not able of detecting minority samples properly. α\β 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
0.07 0.11 0.10 0.12 0.15 0.17 0.21 0.22 0.23 0.21
0 ± ± ± ± ± ± ± ± ± ±
0.08 0.07 0.08 0.07 0.08 0.06 0.08 0.05 0.05 0.04
0.05 0.08 ± 0.07 0.08 ± 0.07 0.09 ± 0.09 0.10 ± 0.08 0.10 ± 0.08 0.14 ± 0.09 0.20 ± 0.09 0.24 ± 0.08 0.24 ± 0.07 0.18 ± 0.04
0.1 0.06 ± 0.06 0.06 ± 0.07 0.07 ± 0.08 0.08 ± 0.07 0.10 ± 0.08 0.14 ± 0.09 0.18 ± 0.06 0.23 ± 0.08 0.21 ± 0.06 0.20 ± 0.05
0.15 0.04 ± 0.06 0.06 ± 0.06 0.05 ± 0.06 0.07 ± 0.08 0.09 ± 0.07 0.09 ± 0.08 0.16 ± 0.07 0.21 ± 0.07 0.24 ± 0.09 0.20 ± 0.04
0.2 0.04 ± 0.07 0.06 ± 0.08 0.06 ± 0.07 0.07 ± 0.08 0.07 ± 0.07 0.09 ± 0.06 0.12 ± 0.08 0.17 ± 0.10 0.19 ± 0.08 0.17 ± 0.04
Table 4: F1 -score (average ± standard dev.) for (α, β) pairs for Solarflare-m0.
In Table 4, we show the F1 -score values for different pairs (α, β) when dealing with Solarflare-m0. It is obvious that high values for switching majority samples (α) and low values for changing the labels of minority samples (β) are appropriate, and that other combinations offer degraded performances (although changes are smooth enough to permit effective cross-validation). In order to provide a more complete analysis and further support for claiming that the principles we use for our designs are solid and robust enough, we have also considered another performance indicator: The Matthews Correlation Coefficient (MCC), which also takes into account the True Negatives (TN) in the calculation. TP · TN − FP · FN MCC = p (TP + FP)(TP + FN)(TN + FP)(TN + FN) 15
Datasets Ecoli-imU Satimage4 Vowel0 Balance-B Abalone9vs18 Solarflare-m0 Oil Winequality4 Letter img-Z Yeast4 Abalone17vs7-8-9-10 Yeast6
Breiman (β) 0.54 ± 0.12 (0) 0.57 ± 0.03 (0) 0.98 ± 0.03 (0.05) 0.29 ± 0.17 (0) 0.51 ± 0.13 (0) 0.07 ± 0.09 (0.05) 0.58 ± 0.13 (0) 0.25 ± 0.05 (0) 0.92 ± 0.02 (0) 0.35 ± 0.13 (0) 0.30 ± 0.12 (0) 0.54 ± 0.15 (0)
Symmetric (α = β) 0.58 ± 0.10 (0.05) 0.58 ± 0.03 (0.05) 0.99 ± 0.02 (0.1) 0.29 ± 0.18 (0.05) 0.53 ± 0.13 (0.05) 0.07 ± 0.09 (0.2) 0.59 ± 0.13 (0.05) 0.25 ± 0.05 (0) 0.94 ± 0.01 (0.05) 0.35 ± 0.13 (0) 0.33 ± 0.13 (0.05) 0.56 ± 0.15 (0.05)
Asymmetric (α, β) 0.65 ± 0.08 (0.3, 0.05) 0.59 ± 0.03 (0.2, 0) 0.99 ± 0.02 (0.1, 0) 0.46 ± 0.17 (0.1, 0) 0.59 ± 0.13 (0.25, 0) 0.20 ± 0.08 (0.4, 0.05) 0.61 ± 0.12 (0.15, 0) 0.27 ± 0.02 (0.45, 0) 0.94 ± 0.01 (0.05, 0.05) 0.42 ± 0.12 (0.35, 0) 0.43 ± 0.09 (0.4, 0.05) 0.60 ± 0.13 (0.3, 0.05)
Table 5: MCC (average ± standard deviation) for the different datasets and switching ensembles. Datasets in Italics are those where we achieve consistent improvements (the difference in average is at least equal to the semi-sum of the standard deviations).
Table 5 is equivalent to Table 2, but we show MCC instead of F1 -score. As expected, the selected parameters are not exactly the same because the exploration to optimize this measure leads to different values. However, they are qualitatively equivalent and the same conclusions can be extracted. It is important to note that, in most cases, we obtain the best results of this measure with the same –or very close– α and β values that achieved the best results of F1 -score. In practice, asymmetric switching has to be combined with other principled (see [27]) rebalancing procedures to provide high performance solutions for real-world imbalanced classification problems. Thus, the number of nontrainable parameters (rebalance intensity, combination proportions...) must be determined together with those of the different rebalancing procedures (α and β for asymmetric switching). Needless to say, this will require more sophisticated search procedures, such as genetic algorithms or similar. 16
4. Conclusions In this correspondence, an asymmetric binary label switching algorithm and its theoretical analysis are presented. The additional degree of freedom which asymmetry provides permits the application of the principled versions of this new algorithm for dealing with imbalanced classification problems. Some direct experiments show that asymmetric switching ensembles resist imbalance and the validity of the proposed analysis. It is clear that asymmetric switching can be combined with other principled rebalancing schemes to solve real-world classification problems in the best possible manner. The inclusion of asymmetric switching would be interesting to take benefit from its intrinsic advantages, such as no resampling risks or the difficulties to define good sample generation methods when the observed variables are discrete. This research avenue requires more study and experimental work, including the use of deep learners and the application to dichotomic versions of multiclass problems. Acknowledgment This work has received the support of BBVA Foundation Grant “2-step BAyesian Re-BAlancing Studies” (2-BARBAS). References [1] L. Rokach, Pattern Classification Using Ensemble Methods, World Scientific, Singapore, 2010. [2] R. E. Schapire, Y. Freund, Boosting: Foundations and Algorithms, MIT Press, Cambridge, MA, 2012. 17
[3] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123– 140. [4] L. Breiman, Randomizing outputs to increase prediction accuracy, Machine Learning 40 (2000) 229–242. [5] V. G´omez-Verdejo, M. Ortega-Moral, J. Arenas-Garc´ıa, A. R. FigueirasVidal, Boosting by weighting critical and erroneous samples, Neurocomputing 69 (2006) 679–685. [6] V. G´omez-Verdejo, J. Arenas-Garc´ıa, A. R. Figueiras-Vidal, A dynamically adjusted mixed emphasis method for building boosting ensembles, IEEE Trans. on Neural Networks 19 (2008) 3–17. ´ [7] A. Ahachad, L. Alvarez-P´ erez, A. R. Figueiras-Vidal, Boosting ensembles with controlled emphasis intensity, Pattern Recognition Letters 88 (2017) 1–5. [8] R. F. Alvear-Sandoval, A. R. Figueiras-Vidal, On building ensembles of stacked denoising auto-encoding classifiers and their further improvement, Information Fusion 39 (2018) 41–52. [9] R. F. Alvear-Sandoval, J. L. Sancho-G´omez, A. R. Figueiras-Vidal, On improving CNNs performance: The case of MNIST, Information Fusion 52 (2019) 106–109. [10] H. He, E. A. Garc´ıa, Learning from imbalanced data, IEEE Trans. on Knowledge and Data Engineering 21 (2009) 1263–1284.
18
[11] V. L´opez, A. Fern´andez, S. Garc´ıa, V. Palade, F. Herrera, An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences 250 (2013) 113–141. [12] H. He, Y. Ma (Eds.), Imbalanced Learning: Foundations, Algorithms, and Applications, IEEE-Wiley, Hoboken, NJ, 2013. [13] A. Fern´andez, V. L´opez, M. Galar, M.J. Del Jes´ us, F. Herrera, Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches, Knowledge-Based Sys. 42 (2013) 113–141. [14] P. Branco, L. Torgo, R. P. Ribeiro, A survey of predictive modeling on imbalanced domains, ACM Computing Surveys 49 (2016) 31:1–50. [15] A. Fern´andez, S. Garc´ıa, M. Galar, R. C. Prati, B. Krawczyk, F. Herrera, Learning from Imbalanced Data Sets, Springer, Berlin, 2018. [16] A. Kowalczyk, B. Raskutti, One class SVM for yeast regulation prediction, ACM SIGKDD Explorations Newsl. 4 (2002) 99–100. [17] C.-Y. Yang, J.-S. Yang, J.-J. Wang, Margin calibration in SVM classimbalanced learning, Neurocomputing 73 (2009) 397–411. [18] R. Batuwita, V. Palade, FSVM-CIL: Fuzzy support vector machines for class imbalance learning, IEEE Trans. on Fuzzy Sys. 18 (2010) 558–571. [19] B. Gu, V. S. Sheng, K. Y. Tay, W. Romano, S. Li, Cross validation
19
through two-dimensional solution surface for cost-sensitive SVM, IEEE Trans. on Pattern Analysis and Mach. Intell. 39 (2017) 1103–1121. [20] W. N. Robinson, A. Aria, Sequential fraud detection for prepaid cards using hidden Markov model divergence, Expert Sys. with Applications 91 (2018) 235–251. [21] C. Elkan, The foundations of cost-sensitive learning, in: Proc. 17th Intl. Joint Conf. on Artificial Intelligence, Lawrence Erlbaum, 2001, 973–978. [22] A. Estabrooks, T. Jo, N. Japkowicz, A multiple resampling method for learning from imbalanced data sets, Computational Intelligence 20 (2004) 18–36. [23] S. Hido, H. Kashima, Y. Takahashi, Roughly balanced bagging for imbalanced data, Statistical Analysis and Data Mining: The ASA Data Science Journal 2 (2009) 412–426. [24] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. of Artificial Intelligence Res. 16 (2002) 321–357. [25] S. Gonz´alez, S. Garc´ıa, M. L´azaro, A. R. Figueiras-Vidal, F. Herrera, Class switching according to nearest enemy distance for learning from highly imbalanced data-sets, Pattern Recognition 70 (2017) 12–24. [26] L. M. Bregman, The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming, USSR Computational Mathematics and Mathematical Physics 7 (1967) 200–217. 20
´ [27] A. Ben´ıtez-Buenache, L. Alvarez-P´ erez, V. J. Mathews, A. R. FigueirasVidal, Likelihood ratio equivalence and imbalanced binary classification, Expert Systems with Applications 130 (2019) 84–96. [28] J. Cid-Sueiro, J. I. Arribas, S. Urb´an-Mu˜ noz, A. R. Figueiras-Vidal, Cost functions to estimate a posteriori probabilities in multiclass problems, IEEE Trans. on Neural Networks 10 (1999) 645–656. [29] J. Cid-Sueiro, A. R. Figueiras-Vidal, On the structure of strict sense bayesian cost functions and its applications, IEEE Trans. on Neural Networks 12 (2001) 445–455. [30] J. Alcal´a-Fdez, A. Fern´andez, J. Luengo, J. Derrac, S. Garc´ıa, L. S´anchez, F. Herrera, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework., J. of Multiple-Valued Logic & Soft Computing 17 (2011) 255–287. [31] D.
Dua,
C.
Graff,
UCI
machine
learning
repository
http://archive.ics.uci.edu/ml (2019). [32] L. Breiman, Bias, variance, and arcing classifiers, Tech. Rep. 460, Statistics Department, University of California, Berkeley (1996).
21
Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: