Information Fusion 26 (2015) 96–102
Contents lists available at ScienceDirect
Information Fusion journal homepage: www.elsevier.com/locate/inffus
Post-aggregation of classifier ensembles q Adil Omari ⇑, Aníbal R. Figueiras-Vidal Universidad Carlos III de Madrid, Department of Signal Theory and Communications, Av de la universidad 30 Leganes, Madrid, Spain
a r t i c l e
i n f o
Article history: Received 22 July 2014 Received in revised form 20 January 2015 Accepted 22 January 2015 Available online 2 February 2015 Keywords: Classification Ensemble Maximal margin Post-aggregation
a b s t r a c t We propose to apply an adequate form of an ensemble output to the last level of an additional classifier – the post-aggregation element – as a method to improve ensemble’s performance. Our experimental results prove that a Gate-Generated Functional Weight Classifier post-aggregation serves to get this objective, both in situations in which data are available everywhere and when some features are missing for the post-aggregation task – a case which is relevant for distributed classification problems. Post-aggregation techniques can be especially useful for massive (integrated by many learners) ensembles – such as most the committees, which do not allow trainable first aggregations – and for human decision fusion, because it is unclear what features are considered in this kind of processes. Ó 2015 Elsevier B.V. All rights reserved.
1. Introduction 1.1. The concept of post-aggregation Machine ensembles consist of diverse single machines, or learners, whose outputs are aggregated to constitute the ensemble output. Learners’ diversity permits to improve the performance of monolithic designs as well as to reduce training difficulties [1,2]. Two main families of ensembles exist. Committees are obtained by training the learners in a first step, and then aggregating their outputs. Representative cases are bagging [3], label change [4], random forests [5], and stacking [6]. The last method is based on training a number of versions of different learners excluding one of several partitions of the labeled examples. An aggregation unit is trained with all the examples, but using the outputs of learners’ versions that have not seen each examples. An overall refining training is finally applied. The needs of using different learners and of training the aggregation impose a limited size – i.e., a moderate number of learners – for these ensembles. However, the trainable aggregation serves to obtain reasonable performances. On the contrary, the rest of the above mentioned committees require massive forms – i.e., a very high number of learners – to offer their best performances. This forces the use of non-trainable aggregation schemes, such as majority voting or averaging. As a
q This work has been partly supported by the Spanish Ministry of Science and Innovation, under Grant TIN 2011-24533. ⇑ Corresponding author. E-mail addresses:
[email protected] (A. Omari),
[email protected] (A.R. Figueiras-Vidal).
http://dx.doi.org/10.1016/j.inffus.2015.01.003 1566-2535/Ó 2015 Elsevier B.V. All rights reserved.
consequence, their performances are usually worse than those provided by some designs of the second family of ensembles, in which the learners and the aggregation element are simultaneously trained. So, boosting [7–9] has been found clearly superior to standard committees [10,11]. According to the above, to investigate new trainable aggregation methods that are suitable for massive committees is a relevant research line, because it can allow a size reduction and/or a performance improvement of these easy-to-design ensembles. There have been some advances in this direction, such as to consider local measures to design weighted aggregations [12]. Obviously, other principled approaches would deserve much attention. In this paper, we address the problem by proposing the concept of post-aggregation. This fusion procedure includes two steps. The first is to use a traditional non-trainable aggregation unit for the outputs of the ensemble learners. In the second step, the true post-aggregation, a soft version of the previous aggregation – the average itself, or the relative numbers of class votes if voting is used – is introduced as an input into the last level of a complementary learning machine which also reads the observations. It can be, for example. a Multi-Layer Perceptron (MLP) classifier, the soft previously aggregated output being applied just before the MLP output activation, with its corresponding trainable weight. This complementary machine is trained in a conventional manner. It must be remarked that, theoretically, the worst result would equal that of the non-trainable fusion. Nevertheless, there is a reasonable possibility of improvement, because the observed variables can allow to correct wrong decisions.
A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102
1.2. A powerful post-aggregation machine, and some additional applications Needless to say, the potential improvement will be more likely if a powerful post-aggregation unit is employed. This moves us to consider a high performance classification machine that we have previously introduced, the Gate-Generated Functional Weight Classifiers (GG-FWC) [13]. It appears as an evolution of the wellknown Mixture-of-Expert (MoE) ensembles [14]. An interesting overview of MoEs is [15]. Recently, Gaussian Process based formulation have also appeared [16,17]. MoEs’ performance in classification tasks seems to be limited due to their maximum-likelihood training. GG-FWCs can be trained by means of the more appropriate Maximal Margin – or Support Vector Machine – algorithms [18–20], which maximize a measure of the separation of samples corresponding to different classes. On the other hand, as we will see later, the architecture of our GG-FWCs gives them global–local approximation capabilities, i.e., an expressive power which allows to establish both extensive smooth regions and highly variable local forms for the classification hypersurfaces. The advantage of these capabilities has been demonstrated for other high performance classification ensembles, such as [21,22] with respect to the standard boosting designs. This global–local approximation characteristic makes GG-FWCs an attractive alternative to the usual kernel-based SVM classifiers, even for ‘‘ad hoc’’ architectures [23] and growing and adaptive designs [24,25]. Two more issues must be mentioned here. First, physically distributed systems are becoming more and more important. The excellent tutorial [26] presents a complete overview of filteringbased distributed estimation and decision systems. Curiously enough, the most studied systems consist on structurally identical learners which access basically common, or shared, observations – i.e., samples with the same features. This is not a general model, because there are many practical situations in which the same features are not available everywhere and the distributed units have different architectures. In any case, a high number of distributed learners and/or communication capacity limitations force direct (non-trainable) fusion mechanisms, such as those applied to standard committees. Thus, it is important to check if the post-aggregation ideas we propose can be useful to improve the performance of distributed systems in general situations. Here, we will present experimental results for an important kind of cases, those in which a part of the features are not available at the place where the fusion is carried out. Note that this lack of information increases the difficulties for a successful post-aggregation. The second issue is the usefulness of the post-aggregation concept to extract benefit from human decisions. A group of experts, for example, can provide their decisions for an instance, but it is unclear what are the features they consider and, even more, how they process them. In any case, there is some diversity among them. Consequently, aggregation is adequate. However, the traditional ways of aggregating their decisions are non-trainable. But, if some features can be read and there are enough labeled observations, a post-aggregation process can be applied to improve the simple aggregation results. This is a research avenue which can be of great importance in many application areas: Health, economy, sociology, etc. We would like to emphasize that the main objective of this contribution is to introduce post-aggregation as an efficient technique to obtain better results from massive ensembles, an important matter by itself and for distributed learning situations. The concrete examples we analyze have the purpose of demonstrating the usefulness of the post-aggregation concept. The rest of the paper has the following content. Section 2 is a brief revision of GG-FWCs and indicates how they will be designed
97
for post-aggregation objectives. In Section 3, we describe the experiments, and present and discuss their results, including some comments on computational loads, in the common information situation. Section 4 refers in the same manner to the situation in which the post-aggregation unit accesses limited information with respect to the learners. The main conclusions of our work and a number of suggestions for further research close the paper. In our discussion and experiments, we will address binary problems. Extensions to multi-class situations follow well-known formulations. 2. Gate-Generated Functional Weight Classifiers and their use in post-aggregation 2.1. The monolithic GG-FWC Fig. 1 shows the architecture of the monolithic GG-FWC we introduced in [13]. That paper explained how it can be obtained from a MoE ensemble with linear learners by reordering summations in the formula of the output and selecting a kernel gate – which provides enough expressive power to the resulting machine. There is no need of a complex training if kernels are previously selected according to an appropriate algorithm (note that kernel dispersion can be established by means of a Cross Validation (CV) process). The form of the GG-FWC output is
oðxÞ ¼
! D R X X wdr kr ðxÞ xd d¼0
ð1Þ
r¼1
where x0 ¼ 1; k0 ðxÞ ¼ 1; x ¼ ½x1 . . . xD T is a D-dimensional observation, and kr ðxÞ; r ¼ 1; . . . ; R, are kernels’ outputs. Note that reindexing, s ¼ rðd þ 1Þ, and calling zs ðxÞ ¼ kr ðxÞxd , formula (1) becomes
oðxÞ ¼
S1 X ws zs ðxÞ
ð2Þ
s¼0
which, fzs ðxÞg being a number for a given input, is a linear-in-theparameters, fws g, form. Therefore, an MM/SVM linear algorithm can be applied to determine these parameters – see [18–20] for the corresponding Lagrangian formulation based optimization procedures. Accordingly, an improved classification performance with respect to the maximum likelihood based training techniques used for MoEs can be expected. Additionally, fzs ðxÞg include global ðfxd gÞ, local ðfkr ðxÞgÞ, and local–global ðfxd kr ðxÞgÞ elements. This fact gives to GG-FWCs a natural global–local approximation capability, i.e., they can construct classification borders containing both extensive smooth regions and wrinkly small areas. The previous selection of kernel centers among the training samples can be done according to many auxiliary algorithms, such as those in [27,28]. Our experience says that results are very similar if the selection mechanism pays attention to both the difficulty of classifying each sample and its proximity to a reasonably established classification frontier, which are the essential aspects to evaluate the importance of training examples [29,30]. Consequently, we employed a simple two step procedure to design the highest performance machine presented in [13], GG-FWC 3, which used Gaussian kernels. First, we preselected center candidates by means of the Shin-Cho algorithm [31], which measures the proximity to the border and the classification difficulty by means of an auxiliary K-Nearest Neighbor (K-NN) classifier. Then, the Adaptive Pattern Classifier-III (APC-III) algorithm [32] sequentially selected the final centers starting with the nearest-to-the-border preselected example and excluding the candidates that are inside a hypersphere of a given radius around it. To determine the values of the non-trainable parameters – the kernel dispersion parameter
98
A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102
Fig. 1. Gate Generated Functional Weight Classifier. x : input sample; o: output; kr ðxÞ : r-th kernel output; wd ðxÞ : d-feature weight.
r, a scale factor for the hypersphere radius, as well as the parameter C of the MM/SVM algorithm – standard CV techniques were used. Experimental evidence in [13] supports that GG-FWCs offer extraordinary performances – better than SVMs and Real AdaBoost (RAB), in general – even they being shallow monolithic machines. The price to be paid is a high computational charge for their design, mainly due to the CV requirements. 2.2. Post-aggregation with GG-FWCs We said above that post-aggregation means to inject a soft output of the firstly applied conventional fusion unit just before the final output of the classification machine which is used as the post-aggregation element. If this machine is a GG-FWC, we only need to include this output as an unweighted additional variable zSþ1 ðxÞ, because the combination weights are provided by the MM/SVM algorithm. We will consider here that the ensemble is firstly designed, and then, the post-aggregation is applied. Although in some cases a joint training is possible, we prefer to keep committees’ concepts. On the other hand, a joint training could require a huge computational effort. This effort will be only justified for some delicate and/ or expensive applications, and always after favorable preliminary experiments as those we are addressing here. Since the soft output of the first aggregation is available, the kernel center selection can be based on it. Samples that are near the border are preselected, and the APC-III algorithm completes the process. This reduces the computational effort to train the prot-aggregation GG-FWC, because another auxiliary classifier is not necessary. Obviously, the CV has to include a parameter establishing the threshold for acceptable values of the proximity to the border. 3. Experiment with common observations We analyze here the post-aggregation results when the corresponding units read the same features than the ensemble learners. Bagging and RAB ensembles are post-aggregated by means of GG-FWCs (although RAB is not a committee, we want
to check if post-aggregation can be useful for a general use). We will compare these post-aggregation results with those of bagging and boosting, nonlinear SVMs, which are state-of-the-art shallow monolithic classifiers, and the above described GG-FWC 3. For an easy appreciation of the differences among classification performances, we will address a series of ten well-known binary datasets of different sizes, dimensionality, and difficulty to be solved: kwok [33], phoneme [34], ripley [35], and abalone, breast, contraceptive, hepatitis, image, ionosphere and spam [36]. Table 1 shows their main characteristics, including the numbers of positive and negative examples ðC 1 =C 1 Þ in the training and test sets (which are pre-established in order to allow fair comparisons among different classifiers). 3.1. Machine designs For the bagging ensembles, we use MLPs as learners and a simplified design method which is well-known by the practitioners. First, the size of a one hidden layer single MLP for solving each problem is selected by means of a 5-fold, 20-run (random zero mean, 0.01 variance Gaussian initialization of the weights) CV process exploring values from 2 to 20 in unit steps. Then, we create 50 different sets of 400 MLPs of that size, each MLP being trained by a different bootstrap resampling with the size of the original training sample population. After it, we construct 50 bagging ensembles of Table 1 Main characteristics of the ten datasets used in the simulations. Problem
Brief name
D
Train ðC 1 =C 1 Þ
Test ðC 1 =C 1 Þ
abalone breast contraceptive hepatitis image ionosphere kwok phoneme ripley spam
aba bre con hep ima ion kwo pho rip spa
8 9 9 19 18 33 2 5 2 57
1238/1269 145/275 506/377 70/23 821/1027 101/100 300/200 952/2291 125/125 1673/1088
843/827 96/183 338/252 53/9 169/293 124/26 6120/4080 634/1527 500/500 1115/725
99
A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102
3–301 elements by taking at random the MLPs for each one of the sets. The bagging aggregation is a direct arithmetic average. All these previous step serve to select the number of learners N b according to the average performance of the corresponding 50 results. The final designs are obtained by repeating the above training for the selected size, and using the learners’ outputs average as the ensemble output. RAB ensembles are built as in [13], using MLP learners whose sizes M are determined by CV and applying a simple stopping criterion [29,30]. T is the resulting number of learners. The reference (Gaussian kernel) SVMs are also designed by CV in a standard manner [13]. The GG-FWCs have Gaussian kernels whose centroids are obtained from the training samples that give an overall normalized output of the corresponding ensemble into ½; ; being a parameter to be selected by CV. A second selection step of the APC-III type starts at the sample with minimal absolute ensemble output value and sequentially excludes samples that are inside a hypersphere of radius hDmin ; Dmin being the average nearest intersample distance and h a parameter to be selected by CV. The ‘‘variance’’ of the Gaussian kernels ! for each centroid cr ; r2r in
kr ðxÞ ¼ exp
kx cr k 2r2r
2
ð3Þ
is proportional to the sample variance of the selected samples inside the corresponding hypersphere
r2r ¼ d
1 X xðkÞ cr 2 #Nr k2N
ð4Þ
r
N r indicating these selected samples and # their number. The proportionality parameter d is also selected by means of CV. Finally, CV serves to select the (inverse) regularization parameter C of the MM/ SVM formulation, too. We initially explore:
: 0.1; 0.2; 0.3; 0.4; 0.5. h: 1; 2; 3; 4; 5. d: from 0.2 to 4, in 0.2 steps. C: powers of 10 between 102 and 104 . with a 5-fold, 50-run CV. When the CV indicates an extreme value, the interval is extended. This occurs
when using bagging ensembles, for : in contraceptive, image, kwok, phoneme (extension: 0.6, 0.7) and hepatitis (extension: 0.6; 0.7; 0.8). h: in abalone (ext. 6; 7; 8) and contraceptive (ext. 6; 7). d: in kwok (ext. 5 to 26, in unit steps). C: in hepatitis, kwok (ext. 0.001). when using RAB, for : in abalone, hepatitis (ext. 0.6; 0.7). h: in ripley (ext. 6; 7). C: in hepatitis, ionosphere (ext. 0.001). After the above first selection, a higher resolution exploration for C is carried out, by exploring another 8 values between the best and each one of its immediate neighbors. The same was necessary for d in kwo. The GG-FWC 3 single machines are designed in the same manner, but using the Shin-Cho preselection of centers, as in [13]. 3.2. Results and their discussions Table 2 presents the experimental results: Average standard deviate error rates for 50 runs in the cases of designs that are sensitive to the training process, i.e., bagging, RAB, and their GG-FWC based post-aggregations. Note that SVM and GG-FWC 3 give just a single error rate because they are trained by deterministic algorithms. We indicate our post-aggregation methods by B-GG-FWC for bagging and RAB-GG-FWC for RAB. To compare the post-aggregation designs with SVM and GGFWC 3, mean and standard deviation are the only relevant figures, since we try to check if a distribution is up or down a given value. In the other cases, applying a statistical test is adequate in order to confirm the significance of the differences. Since these experiments are not independent, a non-parametric test is adequate. We have applied the Wilcoxon Rank Sum test [37,38]. Note that the Wilcoxon test only indicates if we deal with statistically different populations, and, therefore, the average-standard deviate statistics will decide the superior method when that difference appears. First, we will discuss if post-aggregation serves to improve the performance of the baseline ensemble. Then, we will evaluate the post-aggregation designs in absolute terms. When comparing B-GG-FWC with bagging, differences in breast and contraceptive are not significant. Note that bagging is clearly better than SVM and RAB just for breast, and similar for contracep-
Table 2 Average error rate (%) ( standard dev. when applicable) for SVM, bagging, RAB, GG-FWC 3, and the proposed post-aggregation schemes for the ten analyzed problems. CV selected parameter values are shown. Wilcoxon Rank Sum indicators of significant statistical differences are shown with respect to bagging (I), RAB (), and B-GG-FWC (). The best performances are in boldface. Problem aba bre con hep ima ion kwo pho rip spa
Bagging M=N b
RAB M=T
GG-FWC 3 K=d=h=C
B-GG-FWC
RAB-GG-FWC
r=C
SVM
=h=r=C
=h=r=C
19:8 2.6/10 3:2 1:2=1 29:3 1:4 4:6=90 14:5 2:0=0:0001 3:2 1:4=100 2:0 1:4=8 12:1 0:4=3 11:1 0:4 0:4=9 9:6 0:8=9 6:3 0:7 6:8=9
19:7 0:1 8=13 2:1 0:1 6=293 29:5 1:5 3=21 8:9 1:3 10=81 5:8 0:4 6=7 7:6 0:9 2=31 19:9 0:06 12=79 21:3 0:6 2=25 11:2 0:2 24=67 6:3 0:4 8=83
19:4 0:02 4=31:2 0:4 2:6 0:5 6=9:7 3:6 29:0 0:2 2=33:7 0:7 8:6 1:6 17=10:6 3:2 2:5 0:04 11=19:6 0:4 4:9 0:9 5=13:4 4:5 11:7 0:01 15=29:3 0:1 14:0 0:07 60=27:7 0:3 9:7 0:01 48=28:9 0:2 5:9 0:09 7=26:2 0:6
18:9 4=1:8=14=10 3:2 3=3:4=13=10 28:8 1:2 2=1:6=17=0:1 12:9 2=0:6=2=0:9 2:6 2=0:4=2=100000 6:7 2=0:6=2=1 12:1 2=0:8=5=1 11:8 0:5 2=0:4=1=100 9:1 2=3:8=8=100000 6:2 0:7 2=0:2=19=0:1
18:9 0:1 0:4=7=0:4=0:05 2:1 0:05 0:4=4=0:2=0:1 28:9 0:4 0:6=5=1=0:5 6:4 0:2 0:7=3=0:2=0:01 3:3 0:2 0:5=2=2:6=0:2 3:9 0:4 0:2=1=0:4=0:1 11:5 0:03 0:5=3=25:0=0:01 11:7 0:1 0:6=3=2:0=5 8:9 0:2 0:2=4=3:4=1 5:8 0:2 0:2=1=0:4=1
I
I I I I I I I
18:9 0:3 0:6=3=0:2=5 2:1 0:2 0:3=4=0:2=0:1 23:0 0:4 0:3=3=3:8=0:1 6:9 1:0 0:6=3=0:2=0:01 2:4 0:1 0:3=2=0:4=0:1 3:7 0:6 0:3=2=0:2=0:01 11:7 0:1 0:3=2=0:2=1 11:6 0:1 0:2=3=2=5 8:9 0:2 0:2=5=2=0:1 3:6 0:2 0:3=1=0:6=0:1
I I I I I I I I I
100
A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102
tive and spam. Thus, it seems that the high performance of bagging, the baseline ensemble, makes post-aggregation not effective. In all other problems, B-GG-FWC improves bagging’s results. With respect to RAB, RAB-GG-FWC is always better but in kwok (equal), also probably because the high performance of the baseline ensemble. B-GG-FWC is better than the other reference machines but: – For contraceptive (equal) and image, ionosphere, and phoneme (worse) when compared to SVM. In the first case, SVM is similar to bagging, and, as above said, their high performance does not allow to get advantage from post-aggregating. And image is much better solved by an SVM than with bagging. Ionosphere and phoneme are imbalanced problems, and purely local classifiers, such as SVMs, usually offer advantage for this kind of databases. – With respect to RAB, for contraceptive (tie) and image (loss). Again, the reason is that RAB is slightly better than bagging for contraceptive, and clearly better for image, bagging being the baseline ensemble for B-GG-FWC. – Compared to GG-FWC 3, in abalone and contraceptive (ties) and image (loss), the reasons being the same as above. RAB-GG-FWC improves the performance of B-GG-FWC for contraceptive, image (it is also better than SVM, RAB and GG-FWC 3), and ionosphere and phoneme, but it is worse than SVM for these two last databases, which, we repeat, are imbalanced, thus easier for a purely local classifier. On the contrary, B-GG-FWC is better for hepatitis and kwok, without a clear reason for it, because RAB is slightly and much better than bagging, respectively. Compared to GG-FWC 3, RAB-GG-FWC is better but for abalone (equal), a problem for which all the designs we are using offer similar results, i.e., it seems that these results are problem’s real limits and that it is not difficult to reach them. As a summary of the above, we can conclude: – Applying GG-FWC post-aggregation to a given ensemble permits to improve its performance, except if the ensemble has a very high quality. – If the baseline ensemble offers a relatively good performance, the corresponding post-aggregation design is better than well-known powerful monolithic and ensemble classifiers, except if some characteristic of the problem (e.g., imbalance) make it easier for some of these designs.
impossible to offer direct measurements of the training computational charges. But a rough estimate can be given following the same approach we used in [13]. It can be said that CV processes impose most the training effort of the different designs. According to that, training a GG-FWC is about two orders of magnitude more demanding than an SVM (it has two more non-trainable parameters). This effort is also much bigger than the necessary computation for designing a bagging or an RAB ensemble (including MLPs’ training). Consequently, any design including a GG-FWC requires substantially the same effort, about two orders of magnitude bigger than the rest. This is the price to be paid in order to get improvements, regardless of the availability of the ensemble – as in distributed classification situations – or not. With respect to operation computational efforts – i.e., those of classifying an unseen sample–, the presence of the ensemble guides in the GG-FWC post-aggregation systems penalizes them. As Table 3 shows, GG-FWC 3 and both the post-aggregation GGFWC units have numbers of multiplication MUL and nonlinear transformations NL lower than the corresponding single SVM design, but the inclusion of the figures corresponding to their guides change in some cases this balance: – With B-GG-FWC, for breast, hepatitis, kwok, ripley, and spam. – With RAB-GG-FWC, for hepatitis, ionosphere, kwok, phoneme, ripley and spam (MUL) and hepatitis, kwok, phoneme, and ripley (NL). The overall classification effort of the post-aggregation schemes is higher than that of the GG-FWC 3 in all the cases, although in some of them (breast, hepatitis, ionosphere, and spam, for B-GGFWC, and hepatitis, image and ionosphere for RAB-GG-FWC) the post-aggregation is computationally smaller than GG-FWC 3. And let us remark that GG-FWC post-aggregation is beneficial for solving hepatitis, and RAB-GG-FWC for image (in other cases, such as ripley, the benefit requires a modest computational charge increment). This suggest that for distributed learning situations in which local and post-aggregation available features are not the same comparisons may lead to different conclusions. Concrete cases must be analyzed to confirm it, but we will not explore this avenue because our objective is, we repeat, to show the potential usefulness of the post-aggregation concept.
4. Experiments in private information situation 3.3. A brief discussion of computational efforts We have carried out all the above experimental work with a powerful inhomogeneous computation cluster (1280 cores, 17 Tflop throughput). Its processing distribution procedure makes
Now, let us assume that some features that are available for the ensemble learners cannot be observed by the post-aggregation machine, a situation which is relatively frequent when dealing with practical cases of distributed learning, and also when the
Table 3 Operation computational costs (number of products, MUL, and number of nonlinear transformations, NL) for the machine and ensemble designs under analysis. Problem
aba bre con hep ima ion kwo pho rip spa
SVM
GG-FWC 3
B-GG-FWC
MUL/NL
MUL/NL
Bagging MUL/NL
11990=1199 1672=152 6149=559 1659=79 7220=361 4445=127 584=146 9163=1309 312=78 39766=674
278=15 149=7 569=28 899=22 1880=49 1801=26 68=11 3461=288 44=7 7365=63
936=117 17580=2051 630=84 16200=891 798=49 2108=93 2844=1027 300=75 4824=1675 38512=747
RAB-GG-FWC GG-FWC MUL/NL
RAB MUL/NL
GG-FWC MUL/NL
294=15 136=6 724=34 143=3 3841=98 862=12 353=50 5102=392 59=8 7078=60
1302=155 680=70 816=102 3949=198 4440=240 2301=78 1798=464 11816=1708 5626=1421 10790=208
3163=166 157=7 2110=100 143=3 1579=40 724=10 493=70 8157=627 66=9 27436=234
A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102 Table 4 Average error rate (%) standard deviation and nontrainable parameter values for the reduced input aggregation (private information) experiments. Problem cra spa
Bagging M=N b
RAB M=T
GG-FWC 3 K=d=h=C
=h=r=C
B-GG-FWC
4:6 0:7 12=3 6:3 0:4 8=83
12:9 0:6 5=32:1 6.6 7:4 0:3 11=22:9 3.8
3:7 0:0 2=1=3:8=1000 8:7 0:6 2=0:6=7=0.0001
1:2 0:0 0.6=5=1:4=100 5:9 0:2 0:5=1=0:6=0:1
101
Our present work is focused in introducing diversity in post-aggregation units, investigating how post-aggregation works in other distributed classification situations, checking the usefulness and acceptation of post-aggregation human decisions to create new decision support systems, and exploring how to extend these procedures to regression problems, mainly for Digital Signal Processing applications [40–42]. References
guide for the post-aggregation step comes from human experts. We will consider here two simple examples – there is a huge quantity of different possible situations. In these examples, variables that are highly correlated with the labels are not available for a post-aggregation GG-FWC. We use two databases that have enough features (to avoid a serious damage of the direct information which the GG-FWC accesses). The first is the database crabs (cra) [39], suppressing ‘frontal lobe size’ and ‘body depth’. The second is spam, erasing the frequency of the words ‘remove’ and ‘your’, and of ‘000’. We select this database not only because it has 57 features, but also because the results of the previous section prove that bagging gives a relatively high performance when applied to it. Therefore, the contribution of the bagging output is potentially important for obtaining good results if some variables are excluded in the post-aggregation. The bagging ensembles that were previously designed are used here (RAB is not appropriate for distributed learning tasks because its design requires a sequential construction), and the postaggregation GG-FWCs are designed in the same way that for the common feature situation, exploring the same values of the nontrainable parameters (extensions: ¼ 0:6; 0:7, for both datasets, h = 6, 7, for crabs). For comparison purposes, RAB ensembles and GG-FWC 3 machines were also trained with the same inputs that the postaggregation GG-FWC, in the same manner we indicated above (no extensions are required). Table 4 shows the experimental results (averages standard deviates for 50 runs). They speak by themselves, without applying any statistical test: RAB and GG-FWC offer seriously degraded performances, while B-GG-FWC is able to take advantage of the ensemble guides, giving the best results, which, for spam, are not much worse than that of the common feature situation (GG-FWC 3 best design does not make errors in the original crabs problem). These results support the potential usefulness of the postaggregation concept for different practical distributed decision situations, as well as for improving (individual or simply aggregated) human decisions. 5. Conclusions and further work In this paper, we introduce the concept of post-aggregation as a possibility to improve the performance of a previously aggregated, possible massive, maybe distributed classifier ensemble. Postaggregation means to introduce a soft version of the previously aggregated output as an input to the final step of a machine classifier, which is subsequently trained. We provide experimental evidence of the frequent performance benefits coming from the application of a Gate-Generated Functional Weight Classifier as the post-aggregation element, including situation in which the ensemble learners have some private information – the case that also corresponds to human decisions. There are many research lines starting here. Evaluating the effectiveness of simpler post-aggregation schemes is one of them. Combining post-aggregation with stacking is also interesting to check if compact high performance ensembles can be obtained.
[1] L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley, Hoboken, NJ, 2004. [2] L. Rokach, Pattern Classification Using Ensemble Methods, World Scientific, Singapore, 2010. [3] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [4] L. Breiman, Randomizing outputs to increase prediction accuracy, Mach. Learn. 40 (2000) 229–242. [5] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [6] D.H. Wolpert, Stacked generalization, Neural Netw. 5 (1992) 241–259. [7] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139. [8] R.E. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated predictions, Mach. Learn. 37 (1999) 297–336. [9] R.E. Schapire, Y. Freund, Boosting, Foundations and Algorithms, MIT Press, Cambridge, MA, 2012. [10] H. Drucker, C. Cortes, L.D. Jackel, Y. LeCun, V. Vapnik, Boosting and other ensemble methods, Neural Comput. 6 (1994) 1289–1301. [11] E. Bauer, R. Kohavi, An empirical comparison of voting classification algorithms: bagging, boosting, and variants, Mach. Learn. 36 (1999) 105–139. [12] S. Sun, Local within-class accuracies for weighting individual outputs in multiple classifier systems, Pattern Recogn. Lett. 31 (2010) 119–124. [13] A. Omari, A.R. Figueiras-Vidal, Feature combiners with gate-generated weights for classification, IEEE Trans. Neural Netw. Learn. Syst. 24 (2013) 158–163. [14] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts, Neural Comput. 3 (1991) 79–87. [15] S.E. Yuksel, J.N. Wilson, P.D. Gader, Twenty years of mixture of experts, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012) 1177–1193. [16] S. Sun, X. Xu, Variational inference for infinite mixtures of Gaussian processes with applications to traffic flow prediction, IEEE Trans. Intell. Transport. Syst. 12 (2011) 466–475. [17] S. Sun, Infinite mixtures of multivariate gaussian processes, in: International Conference on Machine Learning and Cybernetics (ICMLC), 2013, 2013, pp. 1011–1016. [18] K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkopf, An introduction to kernel-based learning algorithms, IEEE Trans. Neural Netw. 12 (2001) 181– 202. [19] B. Schölkopf, A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, 2002. [20] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge Univ. Press, New York, NY, 2004. [21] E. Mayhua-López, V. Gómez-Verdejo, A.R. Figueiras-Vidal, Real Adaboost with gate controlled fusion, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012) 2003– 2009. [22] E. Mayhua-López, V. Gómez-Verdejo, A.R. Figueiras-Vidal, A new boosting design of Support Vector Machine classifiers, Inform. Fusion 25 (2015) 63–71. [23] E. Parrado-Hernández, I. Mora-Jiménez, J. Arenas-García, A.R. Figueiras-Vidal, A. Navia-Vázquez, Growing support vector classifiers with controlled complexity, Pattern Recogn. 36 (2003) 1479–1488. [24] A. Navia-Vázquez, F. Pérez-Cruz, A. Artés-Rodríguez, A. Figueiras-Vidal, Weighted least squares training of support vector classifiers leading to compact and adaptive schemes, IEEE Trans. Neural Netw. 12 (2001) 1047– 1059. [25] J.L. Rojo-Álvarez, M. Martínez-Ramón, A.R. Figueiras-Vidal, A. Garcia-Armada, A. Artés-Rodríguez, A robust support vector algorithm for nonparametric spectral analysis, IEEE Signal Proc. Lett. 10 (2003) 320–323. [26] A. Sayed, Adaptive networks, Proc. IEEE 102 (2014) 460–497. [27] A. Lyhyaoui, M. Martínez-Ramón, I. Mora-Jiménez, M. Vázquez-Castro, J.L. Sancho-Gómez, A.R. Figueiras-Vidal, Sample selection via clustering to construct support vector-like classifiers, IEEE Trans. Neural Netw. 10 (1999) 1474–1481. [28] M.B. Almeida, A. Braga, J. Braga, SVM-KM: speeding SVMs learning with a priori cluster selection and k-means, in: Proc. 6th Brazilian Symp. Neural Networks, IEEE Computer Society, Washington, DC, 2000, pp. 162–167. [29] V. Gómez-Verdejo, M. Ortega-Moral, J. Arenas-García, A.R. Figueiras-Vidal, Boosting by weighting critical and erroneous samples, Neurocomputing 69 (2006) 679–685. [30] V. Gómez-Verdejo, J. Arenas-García, A.R. Figueiras-Vidal, A dynamically adjusted mixed emphasis method for building boosting ensembles, IEEE Trans. Neural Netw. 19 (2008) 3–17. [31] H.J. Shin, S. Cho, Neighborhood property-based pattern selection for support vector machines, Neural Comput. 19 (2007) 816–855.
102
A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102
[32] Y.-S. Hwang, S.-Y. Bang, An efficient method to construct a radial basis function neural network classifier and its application to unconstrained handwritten digit recognition, in: Proc. Intl. Conf. Pattern Recognition, Vienna, Austria, 1996, pp. 640–644. [33] J.T. Kwok, Moderating the outputs of support vector machine classifiers, IEEE Trans. Neural Netw. 10 (1999) 1018–1031. [34] P. Alinat, Periodic Progress Report 4, ROARS Project Esprit II-5516, Tech. Rep., ASM 93/S/EGS/NC/079, 1993. [35] B.D. Ripley, Neural networks and related methods for classification, J. Roy. Stat. Soc. 56 (1994) 409–456. [36] A. Frank, A. Asuncion, UCI Machine Learning Repository, Univ California, Irvine, School of Information and Computer Science, 2010.
. [37] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bull. 1 (1945) 80–83.
[38] M. Hollander, D. Wolfe, Nonparametric Statistical Methods, Wiley, New York, NY, 1999. [39] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge Univ. Press, Cambridge, UK, 1996. [40] J. Rojo-Álvarez, G. Camps-Valls, M. Martínez-Ramón, E. Soria-Olivas, A. NaviaVázquez, A.R. Figueiras-Vidal, Support vector machines framework for linear signal processing, Signal Proc. 85 (2005) 2316–2326. [41] M. Martínez-Ramón, J. Rojo-Álvarez, G. Camps-Valls, J. Munoz-Marí, A. NaviaVázquez, E. Soria-Olivas, A.R. Figueiras-Vidal, Support vector machines for nonlinear kernel ARMA system identification, IEEE Trans. Neural Netw. 17 (2006) 1617–1622. [42] J. Rojo-Álvarez, M. Martínez-Ramón, M. de Prado-Cumplido, A. ArtésRodríguez, A.R. Figueiras-Vidal, Support vector method for robust ARMA system identification, IEEE Trans. Signal Proc. 52 (2004) 155–164.