Post-aggregation of classifier ensembles

Information Fusion 26 (2015) 96–102 Contents lists available at ScienceDirect Information Fusion journal homepage: www.elsevier.com/locate/inffus P...

Download PDF

322KB Sizes 0 Downloads 52 Views

Report

PDF Reader
Full Text

Information Fusion 26 (2015) 96–102

Contents lists available at ScienceDirect

Information Fusion journal homepage: www.elsevier.com/locate/inffus

Post-aggregation of classiﬁer ensembles q Adil Omari ⇑, Aníbal R. Figueiras-Vidal Universidad Carlos III de Madrid, Department of Signal Theory and Communications, Av de la universidad 30 Leganes, Madrid, Spain

a r t i c l e

i n f o

Article history: Received 22 July 2014 Received in revised form 20 January 2015 Accepted 22 January 2015 Available online 2 February 2015 Keywords: Classiﬁcation Ensemble Maximal margin Post-aggregation

a b s t r a c t We propose to apply an adequate form of an ensemble output to the last level of an additional classiﬁer – the post-aggregation element – as a method to improve ensemble’s performance. Our experimental results prove that a Gate-Generated Functional Weight Classiﬁer post-aggregation serves to get this objective, both in situations in which data are available everywhere and when some features are missing for the post-aggregation task – a case which is relevant for distributed classiﬁcation problems. Post-aggregation techniques can be especially useful for massive (integrated by many learners) ensembles – such as most the committees, which do not allow trainable ﬁrst aggregations – and for human decision fusion, because it is unclear what features are considered in this kind of processes. Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction 1.1. The concept of post-aggregation Machine ensembles consist of diverse single machines, or learners, whose outputs are aggregated to constitute the ensemble output. Learners’ diversity permits to improve the performance of monolithic designs as well as to reduce training difﬁculties [1,2]. Two main families of ensembles exist. Committees are obtained by training the learners in a ﬁrst step, and then aggregating their outputs. Representative cases are bagging [3], label change [4], random forests [5], and stacking [6]. The last method is based on training a number of versions of different learners excluding one of several partitions of the labeled examples. An aggregation unit is trained with all the examples, but using the outputs of learners’ versions that have not seen each examples. An overall reﬁning training is ﬁnally applied. The needs of using different learners and of training the aggregation impose a limited size – i.e., a moderate number of learners – for these ensembles. However, the trainable aggregation serves to obtain reasonable performances. On the contrary, the rest of the above mentioned committees require massive forms – i.e., a very high number of learners – to offer their best performances. This forces the use of non-trainable aggregation schemes, such as majority voting or averaging. As a

q This work has been partly supported by the Spanish Ministry of Science and Innovation, under Grant TIN 2011-24533. ⇑ Corresponding author. E-mail addresses: [email protected] (A. Omari), [email protected] (A.R. Figueiras-Vidal).

http://dx.doi.org/10.1016/j.inffus.2015.01.003 1566-2535/Ó 2015 Elsevier B.V. All rights reserved.

consequence, their performances are usually worse than those provided by some designs of the second family of ensembles, in which the learners and the aggregation element are simultaneously trained. So, boosting [7–9] has been found clearly superior to standard committees [10,11]. According to the above, to investigate new trainable aggregation methods that are suitable for massive committees is a relevant research line, because it can allow a size reduction and/or a performance improvement of these easy-to-design ensembles. There have been some advances in this direction, such as to consider local measures to design weighted aggregations [12]. Obviously, other principled approaches would deserve much attention. In this paper, we address the problem by proposing the concept of post-aggregation. This fusion procedure includes two steps. The ﬁrst is to use a traditional non-trainable aggregation unit for the outputs of the ensemble learners. In the second step, the true post-aggregation, a soft version of the previous aggregation – the average itself, or the relative numbers of class votes if voting is used – is introduced as an input into the last level of a complementary learning machine which also reads the observations. It can be, for example. a Multi-Layer Perceptron (MLP) classiﬁer, the soft previously aggregated output being applied just before the MLP output activation, with its corresponding trainable weight. This complementary machine is trained in a conventional manner. It must be remarked that, theoretically, the worst result would equal that of the non-trainable fusion. Nevertheless, there is a reasonable possibility of improvement, because the observed variables can allow to correct wrong decisions.

A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102

1.2. A powerful post-aggregation machine, and some additional applications Needless to say, the potential improvement will be more likely if a powerful post-aggregation unit is employed. This moves us to consider a high performance classiﬁcation machine that we have previously introduced, the Gate-Generated Functional Weight Classiﬁers (GG-FWC) [13]. It appears as an evolution of the wellknown Mixture-of-Expert (MoE) ensembles [14]. An interesting overview of MoEs is [15]. Recently, Gaussian Process based formulation have also appeared [16,17]. MoEs’ performance in classiﬁcation tasks seems to be limited due to their maximum-likelihood training. GG-FWCs can be trained by means of the more appropriate Maximal Margin – or Support Vector Machine – algorithms [18–20], which maximize a measure of the separation of samples corresponding to different classes. On the other hand, as we will see later, the architecture of our GG-FWCs gives them global–local approximation capabilities, i.e., an expressive power which allows to establish both extensive smooth regions and highly variable local forms for the classiﬁcation hypersurfaces. The advantage of these capabilities has been demonstrated for other high performance classiﬁcation ensembles, such as [21,22] with respect to the standard boosting designs. This global–local approximation characteristic makes GG-FWCs an attractive alternative to the usual kernel-based SVM classiﬁers, even for ‘‘ad hoc’’ architectures [23] and growing and adaptive designs [24,25]. Two more issues must be mentioned here. First, physically distributed systems are becoming more and more important. The excellent tutorial [26] presents a complete overview of ﬁlteringbased distributed estimation and decision systems. Curiously enough, the most studied systems consist on structurally identical learners which access basically common, or shared, observations – i.e., samples with the same features. This is not a general model, because there are many practical situations in which the same features are not available everywhere and the distributed units have different architectures. In any case, a high number of distributed learners and/or communication capacity limitations force direct (non-trainable) fusion mechanisms, such as those applied to standard committees. Thus, it is important to check if the post-aggregation ideas we propose can be useful to improve the performance of distributed systems in general situations. Here, we will present experimental results for an important kind of cases, those in which a part of the features are not available at the place where the fusion is carried out. Note that this lack of information increases the difﬁculties for a successful post-aggregation. The second issue is the usefulness of the post-aggregation concept to extract beneﬁt from human decisions. A group of experts, for example, can provide their decisions for an instance, but it is unclear what are the features they consider and, even more, how they process them. In any case, there is some diversity among them. Consequently, aggregation is adequate. However, the traditional ways of aggregating their decisions are non-trainable. But, if some features can be read and there are enough labeled observations, a post-aggregation process can be applied to improve the simple aggregation results. This is a research avenue which can be of great importance in many application areas: Health, economy, sociology, etc. We would like to emphasize that the main objective of this contribution is to introduce post-aggregation as an efﬁcient technique to obtain better results from massive ensembles, an important matter by itself and for distributed learning situations. The concrete examples we analyze have the purpose of demonstrating the usefulness of the post-aggregation concept. The rest of the paper has the following content. Section 2 is a brief revision of GG-FWCs and indicates how they will be designed

97

for post-aggregation objectives. In Section 3, we describe the experiments, and present and discuss their results, including some comments on computational loads, in the common information situation. Section 4 refers in the same manner to the situation in which the post-aggregation unit accesses limited information with respect to the learners. The main conclusions of our work and a number of suggestions for further research close the paper. In our discussion and experiments, we will address binary problems. Extensions to multi-class situations follow well-known formulations. 2. Gate-Generated Functional Weight Classiﬁers and their use in post-aggregation 2.1. The monolithic GG-FWC Fig. 1 shows the architecture of the monolithic GG-FWC we introduced in [13]. That paper explained how it can be obtained from a MoE ensemble with linear learners by reordering summations in the formula of the output and selecting a kernel gate – which provides enough expressive power to the resulting machine. There is no need of a complex training if kernels are previously selected according to an appropriate algorithm (note that kernel dispersion can be established by means of a Cross Validation (CV) process). The form of the GG-FWC output is

oðxÞ ¼

! D R X X wdr kr ðxÞ xd d¼0

ð1Þ

r¼1

where x0 ¼ 1; k0 ðxÞ ¼ 1; x ¼ ½x1 . . . xD T is a D-dimensional observation, and kr ðxÞ; r ¼ 1; . . . ; R, are kernels’ outputs. Note that reindexing, s ¼ rðd þ 1Þ, and calling zs ðxÞ ¼ kr ðxÞxd , formula (1) becomes

oðxÞ ¼

S1 X ws zs ðxÞ

ð2Þ

s¼0

which, fzs ðxÞg being a number for a given input, is a linear-in-theparameters, fws g, form. Therefore, an MM/SVM linear algorithm can be applied to determine these parameters – see [18–20] for the corresponding Lagrangian formulation based optimization procedures. Accordingly, an improved classiﬁcation performance with respect to the maximum likelihood based training techniques used for MoEs can be expected. Additionally, fzs ðxÞg include global ðfxd gÞ, local ðfkr ðxÞgÞ, and local–global ðfxd kr ðxÞgÞ elements. This fact gives to GG-FWCs a natural global–local approximation capability, i.e., they can construct classiﬁcation borders containing both extensive smooth regions and wrinkly small areas. The previous selection of kernel centers among the training samples can be done according to many auxiliary algorithms, such as those in [27,28]. Our experience says that results are very similar if the selection mechanism pays attention to both the difﬁculty of classifying each sample and its proximity to a reasonably established classiﬁcation frontier, which are the essential aspects to evaluate the importance of training examples [29,30]. Consequently, we employed a simple two step procedure to design the highest performance machine presented in [13], GG-FWC 3, which used Gaussian kernels. First, we preselected center candidates by means of the Shin-Cho algorithm [31], which measures the proximity to the border and the classiﬁcation difﬁculty by means of an auxiliary K-Nearest Neighbor (K-NN) classiﬁer. Then, the Adaptive Pattern Classiﬁer-III (APC-III) algorithm [32] sequentially selected the ﬁnal centers starting with the nearest-to-the-border preselected example and excluding the candidates that are inside a hypersphere of a given radius around it. To determine the values of the non-trainable parameters – the kernel dispersion parameter

98

A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102

Fig. 1. Gate Generated Functional Weight Classiﬁer. x : input sample; o: output; kr ðxÞ : r-th kernel output; wd ðxÞ : d-feature weight.

r, a scale factor for the hypersphere radius, as well as the parameter C of the MM/SVM algorithm – standard CV techniques were used. Experimental evidence in [13] supports that GG-FWCs offer extraordinary performances – better than SVMs and Real AdaBoost (RAB), in general – even they being shallow monolithic machines. The price to be paid is a high computational charge for their design, mainly due to the CV requirements. 2.2. Post-aggregation with GG-FWCs We said above that post-aggregation means to inject a soft output of the ﬁrstly applied conventional fusion unit just before the ﬁnal output of the classiﬁcation machine which is used as the post-aggregation element. If this machine is a GG-FWC, we only need to include this output as an unweighted additional variable zSþ1 ðxÞ, because the combination weights are provided by the MM/SVM algorithm. We will consider here that the ensemble is ﬁrstly designed, and then, the post-aggregation is applied. Although in some cases a joint training is possible, we prefer to keep committees’ concepts. On the other hand, a joint training could require a huge computational effort. This effort will be only justiﬁed for some delicate and/ or expensive applications, and always after favorable preliminary experiments as those we are addressing here. Since the soft output of the ﬁrst aggregation is available, the kernel center selection can be based on it. Samples that are near the border are preselected, and the APC-III algorithm completes the process. This reduces the computational effort to train the prot-aggregation GG-FWC, because another auxiliary classiﬁer is not necessary. Obviously, the CV has to include a parameter establishing the threshold for acceptable values of the proximity to the border. 3. Experiment with common observations We analyze here the post-aggregation results when the corresponding units read the same features than the ensemble learners. Bagging and RAB ensembles are post-aggregated by means of GG-FWCs (although RAB is not a committee, we want

to check if post-aggregation can be useful for a general use). We will compare these post-aggregation results with those of bagging and boosting, nonlinear SVMs, which are state-of-the-art shallow monolithic classiﬁers, and the above described GG-FWC 3. For an easy appreciation of the differences among classiﬁcation performances, we will address a series of ten well-known binary datasets of different sizes, dimensionality, and difﬁculty to be solved: kwok [33], phoneme [34], ripley [35], and abalone, breast, contraceptive, hepatitis, image, ionosphere and spam [36]. Table 1 shows their main characteristics, including the numbers of positive and negative examples ðC 1 =C 1 Þ in the training and test sets (which are pre-established in order to allow fair comparisons among different classiﬁers). 3.1. Machine designs For the bagging ensembles, we use MLPs as learners and a simpliﬁed design method which is well-known by the practitioners. First, the size of a one hidden layer single MLP for solving each problem is selected by means of a 5-fold, 20-run (random zero mean, 0.01 variance Gaussian initialization of the weights) CV process exploring values from 2 to 20 in unit steps. Then, we create 50 different sets of 400 MLPs of that size, each MLP being trained by a different bootstrap resampling with the size of the original training sample population. After it, we construct 50 bagging ensembles of Table 1 Main characteristics of the ten datasets used in the simulations. Problem

Brief name

D

Train ðC 1 =C 1 Þ

Test ðC 1 =C 1 Þ

abalone breast contraceptive hepatitis image ionosphere kwok phoneme ripley spam

aba bre con hep ima ion kwo pho rip spa

8 9 9 19 18 33 2 5 2 57

1238/1269 145/275 506/377 70/23 821/1027 101/100 300/200 952/2291 125/125 1673/1088

843/827 96/183 338/252 53/9 169/293 124/26 6120/4080 634/1527 500/500 1115/725

99

A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102

3–301 elements by taking at random the MLPs for each one of the sets. The bagging aggregation is a direct arithmetic average. All these previous step serve to select the number of learners N b according to the average performance of the corresponding 50 results. The ﬁnal designs are obtained by repeating the above training for the selected size, and using the learners’ outputs average as the ensemble output. RAB ensembles are built as in [13], using MLP learners whose sizes M are determined by CV and applying a simple stopping criterion [29,30]. T is the resulting number of learners. The reference (Gaussian kernel) SVMs are also designed by CV in a standard manner [13]. The GG-FWCs have Gaussian kernels whose centroids are obtained from the training samples that give an overall normalized output of the corresponding ensemble into ½; ; being a parameter to be selected by CV. A second selection step of the APC-III type starts at the sample with minimal absolute ensemble output value and sequentially excludes samples that are inside a hypersphere of radius hDmin ; Dmin being the average nearest intersample distance and h a parameter to be selected by CV. The ‘‘variance’’ of the Gaussian kernels ! for each centroid cr ; r2r in

kr ðxÞ ¼ exp

kx cr k 2r2r

2

ð3Þ

is proportional to the sample variance of the selected samples inside the corresponding hypersphere

r2r ¼ d

1 X xðkÞ cr 2 #Nr k2N

ð4Þ

r

N r indicating these selected samples and # their number. The proportionality parameter d is also selected by means of CV. Finally, CV serves to select the (inverse) regularization parameter C of the MM/ SVM formulation, too. We initially explore:

: 0.1; 0.2; 0.3; 0.4; 0.5. h: 1; 2; 3; 4; 5. d: from 0.2 to 4, in 0.2 steps. C: powers of 10 between 102 and 104 . with a 5-fold, 50-run CV. When the CV indicates an extreme value, the interval is extended. This occurs

when using bagging ensembles, for : in contraceptive, image, kwok, phoneme (extension: 0.6, 0.7) and hepatitis (extension: 0.6; 0.7; 0.8). h: in abalone (ext. 6; 7; 8) and contraceptive (ext. 6; 7). d: in kwok (ext. 5 to 26, in unit steps). C: in hepatitis, kwok (ext. 0.001). when using RAB, for : in abalone, hepatitis (ext. 0.6; 0.7). h: in ripley (ext. 6; 7). C: in hepatitis, ionosphere (ext. 0.001). After the above ﬁrst selection, a higher resolution exploration for C is carried out, by exploring another 8 values between the best and each one of its immediate neighbors. The same was necessary for d in kwo. The GG-FWC 3 single machines are designed in the same manner, but using the Shin-Cho preselection of centers, as in [13]. 3.2. Results and their discussions Table 2 presents the experimental results: Average standard deviate error rates for 50 runs in the cases of designs that are sensitive to the training process, i.e., bagging, RAB, and their GG-FWC based post-aggregations. Note that SVM and GG-FWC 3 give just a single error rate because they are trained by deterministic algorithms. We indicate our post-aggregation methods by B-GG-FWC for bagging and RAB-GG-FWC for RAB. To compare the post-aggregation designs with SVM and GGFWC 3, mean and standard deviation are the only relevant ﬁgures, since we try to check if a distribution is up or down a given value. In the other cases, applying a statistical test is adequate in order to conﬁrm the signiﬁcance of the differences. Since these experiments are not independent, a non-parametric test is adequate. We have applied the Wilcoxon Rank Sum test [37,38]. Note that the Wilcoxon test only indicates if we deal with statistically different populations, and, therefore, the average-standard deviate statistics will decide the superior method when that difference appears. First, we will discuss if post-aggregation serves to improve the performance of the baseline ensemble. Then, we will evaluate the post-aggregation designs in absolute terms. When comparing B-GG-FWC with bagging, differences in breast and contraceptive are not signiﬁcant. Note that bagging is clearly better than SVM and RAB just for breast, and similar for contracep-

Table 2 Average error rate (%) ( standard dev. when applicable) for SVM, bagging, RAB, GG-FWC 3, and the proposed post-aggregation schemes for the ten analyzed problems. CV selected parameter values are shown. Wilcoxon Rank Sum indicators of signiﬁcant statistical differences are shown with respect to bagging (I), RAB (), and B-GG-FWC (). The best performances are in boldface. Problem aba bre con hep ima ion kwo pho rip spa

Bagging M=N b

RAB M=T

GG-FWC 3 K=d=h=C

B-GG-FWC

RAB-GG-FWC

r=C

SVM

=h=r=C

=h=r=C

19:8 2.6/10 3:2 1:2=1 29:3 1:4 4:6=90 14:5 2:0=0:0001 3:2 1:4=100 2:0 1:4=8 12:1 0:4=3 11:1 0:4 0:4=9 9:6 0:8=9 6:3 0:7 6:8=9

19:7 0:1 8=13 2:1 0:1 6=293 29:5 1:5 3=21 8:9 1:3 10=81 5:8 0:4 6=7 7:6 0:9 2=31 19:9 0:06 12=79 21:3 0:6 2=25 11:2 0:2 24=67 6:3 0:4 8=83

19:4 0:02 4=31:2 0:4 2:6 0:5 6=9:7 3:6 29:0 0:2 2=33:7 0:7 8:6 1:6 17=10:6 3:2 2:5 0:04 11=19:6 0:4 4:9 0:9 5=13:4 4:5 11:7 0:01 15=29:3 0:1 14:0 0:07 60=27:7 0:3 9:7 0:01 48=28:9 0:2 5:9 0:09 7=26:2 0:6

18:9 4=1:8=14=10 3:2 3=3:4=13=10 28:8 1:2 2=1:6=17=0:1 12:9 2=0:6=2=0:9 2:6 2=0:4=2=100000 6:7 2=0:6=2=1 12:1 2=0:8=5=1 11:8 0:5 2=0:4=1=100 9:1 2=3:8=8=100000 6:2 0:7 2=0:2=19=0:1

18:9 0:1 0:4=7=0:4=0:05 2:1 0:05 0:4=4=0:2=0:1 28:9 0:4 0:6=5=1=0:5 6:4 0:2 0:7=3=0:2=0:01 3:3 0:2 0:5=2=2:6=0:2 3:9 0:4 0:2=1=0:4=0:1 11:5 0:03 0:5=3=25:0=0:01 11:7 0:1 0:6=3=2:0=5 8:9 0:2 0:2=4=3:4=1 5:8 0:2 0:2=1=0:4=1

I

I I I I I I I

18:9 0:3 0:6=3=0:2=5 2:1 0:2 0:3=4=0:2=0:1 23:0 0:4 0:3=3=3:8=0:1 6:9 1:0 0:6=3=0:2=0:01 2:4 0:1 0:3=2=0:4=0:1 3:7 0:6 0:3=2=0:2=0:01 11:7 0:1 0:3=2=0:2=1 11:6 0:1 0:2=3=2=5 8:9 0:2 0:2=5=2=0:1 3:6 0:2 0:3=1=0:6=0:1

I I I I I I I I I

100

A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102

tive and spam. Thus, it seems that the high performance of bagging, the baseline ensemble, makes post-aggregation not effective. In all other problems, B-GG-FWC improves bagging’s results. With respect to RAB, RAB-GG-FWC is always better but in kwok (equal), also probably because the high performance of the baseline ensemble. B-GG-FWC is better than the other reference machines but: – For contraceptive (equal) and image, ionosphere, and phoneme (worse) when compared to SVM. In the ﬁrst case, SVM is similar to bagging, and, as above said, their high performance does not allow to get advantage from post-aggregating. And image is much better solved by an SVM than with bagging. Ionosphere and phoneme are imbalanced problems, and purely local classiﬁers, such as SVMs, usually offer advantage for this kind of databases. – With respect to RAB, for contraceptive (tie) and image (loss). Again, the reason is that RAB is slightly better than bagging for contraceptive, and clearly better for image, bagging being the baseline ensemble for B-GG-FWC. – Compared to GG-FWC 3, in abalone and contraceptive (ties) and image (loss), the reasons being the same as above. RAB-GG-FWC improves the performance of B-GG-FWC for contraceptive, image (it is also better than SVM, RAB and GG-FWC 3), and ionosphere and phoneme, but it is worse than SVM for these two last databases, which, we repeat, are imbalanced, thus easier for a purely local classiﬁer. On the contrary, B-GG-FWC is better for hepatitis and kwok, without a clear reason for it, because RAB is slightly and much better than bagging, respectively. Compared to GG-FWC 3, RAB-GG-FWC is better but for abalone (equal), a problem for which all the designs we are using offer similar results, i.e., it seems that these results are problem’s real limits and that it is not difﬁcult to reach them. As a summary of the above, we can conclude: – Applying GG-FWC post-aggregation to a given ensemble permits to improve its performance, except if the ensemble has a very high quality. – If the baseline ensemble offers a relatively good performance, the corresponding post-aggregation design is better than well-known powerful monolithic and ensemble classiﬁers, except if some characteristic of the problem (e.g., imbalance) make it easier for some of these designs.

impossible to offer direct measurements of the training computational charges. But a rough estimate can be given following the same approach we used in [13]. It can be said that CV processes impose most the training effort of the different designs. According to that, training a GG-FWC is about two orders of magnitude more demanding than an SVM (it has two more non-trainable parameters). This effort is also much bigger than the necessary computation for designing a bagging or an RAB ensemble (including MLPs’ training). Consequently, any design including a GG-FWC requires substantially the same effort, about two orders of magnitude bigger than the rest. This is the price to be paid in order to get improvements, regardless of the availability of the ensemble – as in distributed classiﬁcation situations – or not. With respect to operation computational efforts – i.e., those of classifying an unseen sample–, the presence of the ensemble guides in the GG-FWC post-aggregation systems penalizes them. As Table 3 shows, GG-FWC 3 and both the post-aggregation GGFWC units have numbers of multiplication MUL and nonlinear transformations NL lower than the corresponding single SVM design, but the inclusion of the ﬁgures corresponding to their guides change in some cases this balance: – With B-GG-FWC, for breast, hepatitis, kwok, ripley, and spam. – With RAB-GG-FWC, for hepatitis, ionosphere, kwok, phoneme, ripley and spam (MUL) and hepatitis, kwok, phoneme, and ripley (NL). The overall classiﬁcation effort of the post-aggregation schemes is higher than that of the GG-FWC 3 in all the cases, although in some of them (breast, hepatitis, ionosphere, and spam, for B-GGFWC, and hepatitis, image and ionosphere for RAB-GG-FWC) the post-aggregation is computationally smaller than GG-FWC 3. And let us remark that GG-FWC post-aggregation is beneﬁcial for solving hepatitis, and RAB-GG-FWC for image (in other cases, such as ripley, the beneﬁt requires a modest computational charge increment). This suggest that for distributed learning situations in which local and post-aggregation available features are not the same comparisons may lead to different conclusions. Concrete cases must be analyzed to conﬁrm it, but we will not explore this avenue because our objective is, we repeat, to show the potential usefulness of the post-aggregation concept.

4. Experiments in private information situation 3.3. A brief discussion of computational efforts We have carried out all the above experimental work with a powerful inhomogeneous computation cluster (1280 cores, 17 Tﬂop throughput). Its processing distribution procedure makes

Now, let us assume that some features that are available for the ensemble learners cannot be observed by the post-aggregation machine, a situation which is relatively frequent when dealing with practical cases of distributed learning, and also when the

Table 3 Operation computational costs (number of products, MUL, and number of nonlinear transformations, NL) for the machine and ensemble designs under analysis. Problem

aba bre con hep ima ion kwo pho rip spa

SVM

GG-FWC 3

B-GG-FWC

MUL/NL

MUL/NL

Bagging MUL/NL

11990=1199 1672=152 6149=559 1659=79 7220=361 4445=127 584=146 9163=1309 312=78 39766=674

278=15 149=7 569=28 899=22 1880=49 1801=26 68=11 3461=288 44=7 7365=63

936=117 17580=2051 630=84 16200=891 798=49 2108=93 2844=1027 300=75 4824=1675 38512=747

RAB-GG-FWC GG-FWC MUL/NL

RAB MUL/NL

GG-FWC MUL/NL

294=15 136=6 724=34 143=3 3841=98 862=12 353=50 5102=392 59=8 7078=60

1302=155 680=70 816=102 3949=198 4440=240 2301=78 1798=464 11816=1708 5626=1421 10790=208

3163=166 157=7 2110=100 143=3 1579=40 724=10 493=70 8157=627 66=9 27436=234

A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102 Table 4 Average error rate (%) standard deviation and nontrainable parameter values for the reduced input aggregation (private information) experiments. Problem cra spa

Bagging M=N b

RAB M=T

GG-FWC 3 K=d=h=C

=h=r=C

B-GG-FWC

4:6 0:7 12=3 6:3 0:4 8=83

12:9 0:6 5=32:1 6.6 7:4 0:3 11=22:9 3.8

3:7 0:0 2=1=3:8=1000 8:7 0:6 2=0:6=7=0.0001

1:2 0:0 0.6=5=1:4=100 5:9 0:2 0:5=1=0:6=0:1

101

Our present work is focused in introducing diversity in post-aggregation units, investigating how post-aggregation works in other distributed classiﬁcation situations, checking the usefulness and acceptation of post-aggregation human decisions to create new decision support systems, and exploring how to extend these procedures to regression problems, mainly for Digital Signal Processing applications [40–42]. References

guide for the post-aggregation step comes from human experts. We will consider here two simple examples – there is a huge quantity of different possible situations. In these examples, variables that are highly correlated with the labels are not available for a post-aggregation GG-FWC. We use two databases that have enough features (to avoid a serious damage of the direct information which the GG-FWC accesses). The ﬁrst is the database crabs (cra) [39], suppressing ‘frontal lobe size’ and ‘body depth’. The second is spam, erasing the frequency of the words ‘remove’ and ‘your’, and of ‘000’. We select this database not only because it has 57 features, but also because the results of the previous section prove that bagging gives a relatively high performance when applied to it. Therefore, the contribution of the bagging output is potentially important for obtaining good results if some variables are excluded in the post-aggregation. The bagging ensembles that were previously designed are used here (RAB is not appropriate for distributed learning tasks because its design requires a sequential construction), and the postaggregation GG-FWCs are designed in the same way that for the common feature situation, exploring the same values of the nontrainable parameters (extensions: ¼ 0:6; 0:7, for both datasets, h = 6, 7, for crabs). For comparison purposes, RAB ensembles and GG-FWC 3 machines were also trained with the same inputs that the postaggregation GG-FWC, in the same manner we indicated above (no extensions are required). Table 4 shows the experimental results (averages standard deviates for 50 runs). They speak by themselves, without applying any statistical test: RAB and GG-FWC offer seriously degraded performances, while B-GG-FWC is able to take advantage of the ensemble guides, giving the best results, which, for spam, are not much worse than that of the common feature situation (GG-FWC 3 best design does not make errors in the original crabs problem). These results support the potential usefulness of the postaggregation concept for different practical distributed decision situations, as well as for improving (individual or simply aggregated) human decisions. 5. Conclusions and further work In this paper, we introduce the concept of post-aggregation as a possibility to improve the performance of a previously aggregated, possible massive, maybe distributed classiﬁer ensemble. Postaggregation means to introduce a soft version of the previously aggregated output as an input to the ﬁnal step of a machine classiﬁer, which is subsequently trained. We provide experimental evidence of the frequent performance beneﬁts coming from the application of a Gate-Generated Functional Weight Classiﬁer as the post-aggregation element, including situation in which the ensemble learners have some private information – the case that also corresponds to human decisions. There are many research lines starting here. Evaluating the effectiveness of simpler post-aggregation schemes is one of them. Combining post-aggregation with stacking is also interesting to check if compact high performance ensembles can be obtained.

[1] L.I. Kuncheva, Combining Pattern Classiﬁers: Methods and Algorithms, Wiley, Hoboken, NJ, 2004. [2] L. Rokach, Pattern Classiﬁcation Using Ensemble Methods, World Scientiﬁc, Singapore, 2010. [3] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [4] L. Breiman, Randomizing outputs to increase prediction accuracy, Mach. Learn. 40 (2000) 229–242. [5] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [6] D.H. Wolpert, Stacked generalization, Neural Netw. 5 (1992) 241–259. [7] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139. [8] R.E. Schapire, Y. Singer, Improved boosting algorithms using conﬁdence-rated predictions, Mach. Learn. 37 (1999) 297–336. [9] R.E. Schapire, Y. Freund, Boosting, Foundations and Algorithms, MIT Press, Cambridge, MA, 2012. [10] H. Drucker, C. Cortes, L.D. Jackel, Y. LeCun, V. Vapnik, Boosting and other ensemble methods, Neural Comput. 6 (1994) 1289–1301. [11] E. Bauer, R. Kohavi, An empirical comparison of voting classiﬁcation algorithms: bagging, boosting, and variants, Mach. Learn. 36 (1999) 105–139. [12] S. Sun, Local within-class accuracies for weighting individual outputs in multiple classiﬁer systems, Pattern Recogn. Lett. 31 (2010) 119–124. [13] A. Omari, A.R. Figueiras-Vidal, Feature combiners with gate-generated weights for classiﬁcation, IEEE Trans. Neural Netw. Learn. Syst. 24 (2013) 158–163. [14] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts, Neural Comput. 3 (1991) 79–87. [15] S.E. Yuksel, J.N. Wilson, P.D. Gader, Twenty years of mixture of experts, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012) 1177–1193. [16] S. Sun, X. Xu, Variational inference for inﬁnite mixtures of Gaussian processes with applications to trafﬁc ﬂow prediction, IEEE Trans. Intell. Transport. Syst. 12 (2011) 466–475. [17] S. Sun, Inﬁnite mixtures of multivariate gaussian processes, in: International Conference on Machine Learning and Cybernetics (ICMLC), 2013, 2013, pp. 1011–1016. [18] K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkopf, An introduction to kernel-based learning algorithms, IEEE Trans. Neural Netw. 12 (2001) 181– 202. [19] B. Schölkopf, A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, 2002. [20] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge Univ. Press, New York, NY, 2004. [21] E. Mayhua-López, V. Gómez-Verdejo, A.R. Figueiras-Vidal, Real Adaboost with gate controlled fusion, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012) 2003– 2009. [22] E. Mayhua-López, V. Gómez-Verdejo, A.R. Figueiras-Vidal, A new boosting design of Support Vector Machine classiﬁers, Inform. Fusion 25 (2015) 63–71. [23] E. Parrado-Hernández, I. Mora-Jiménez, J. Arenas-García, A.R. Figueiras-Vidal, A. Navia-Vázquez, Growing support vector classiﬁers with controlled complexity, Pattern Recogn. 36 (2003) 1479–1488. [24] A. Navia-Vázquez, F. Pérez-Cruz, A. Artés-Rodríguez, A. Figueiras-Vidal, Weighted least squares training of support vector classiﬁers leading to compact and adaptive schemes, IEEE Trans. Neural Netw. 12 (2001) 1047– 1059. [25] J.L. Rojo-Álvarez, M. Martínez-Ramón, A.R. Figueiras-Vidal, A. Garcia-Armada, A. Artés-Rodríguez, A robust support vector algorithm for nonparametric spectral analysis, IEEE Signal Proc. Lett. 10 (2003) 320–323. [26] A. Sayed, Adaptive networks, Proc. IEEE 102 (2014) 460–497. [27] A. Lyhyaoui, M. Martínez-Ramón, I. Mora-Jiménez, M. Vázquez-Castro, J.L. Sancho-Gómez, A.R. Figueiras-Vidal, Sample selection via clustering to construct support vector-like classiﬁers, IEEE Trans. Neural Netw. 10 (1999) 1474–1481. [28] M.B. Almeida, A. Braga, J. Braga, SVM-KM: speeding SVMs learning with a priori cluster selection and k-means, in: Proc. 6th Brazilian Symp. Neural Networks, IEEE Computer Society, Washington, DC, 2000, pp. 162–167. [29] V. Gómez-Verdejo, M. Ortega-Moral, J. Arenas-García, A.R. Figueiras-Vidal, Boosting by weighting critical and erroneous samples, Neurocomputing 69 (2006) 679–685. [30] V. Gómez-Verdejo, J. Arenas-García, A.R. Figueiras-Vidal, A dynamically adjusted mixed emphasis method for building boosting ensembles, IEEE Trans. Neural Netw. 19 (2008) 3–17. [31] H.J. Shin, S. Cho, Neighborhood property-based pattern selection for support vector machines, Neural Comput. 19 (2007) 816–855.

102

A. Omari, A.R. Figueiras-Vidal / Information Fusion 26 (2015) 96–102

[32] Y.-S. Hwang, S.-Y. Bang, An efﬁcient method to construct a radial basis function neural network classiﬁer and its application to unconstrained handwritten digit recognition, in: Proc. Intl. Conf. Pattern Recognition, Vienna, Austria, 1996, pp. 640–644. [33] J.T. Kwok, Moderating the outputs of support vector machine classiﬁers, IEEE Trans. Neural Netw. 10 (1999) 1018–1031. [34] P. Alinat, Periodic Progress Report 4, ROARS Project Esprit II-5516, Tech. Rep., ASM 93/S/EGS/NC/079, 1993. [35] B.D. Ripley, Neural networks and related methods for classiﬁcation, J. Roy. Stat. Soc. 56 (1994) 409–456. [36] A. Frank, A. Asuncion, UCI Machine Learning Repository, Univ California, Irvine, School of Information and Computer Science, 2010. . [37] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bull. 1 (1945) 80–83.

[38] M. Hollander, D. Wolfe, Nonparametric Statistical Methods, Wiley, New York, NY, 1999. [39] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge Univ. Press, Cambridge, UK, 1996. [40] J. Rojo-Álvarez, G. Camps-Valls, M. Martínez-Ramón, E. Soria-Olivas, A. NaviaVázquez, A.R. Figueiras-Vidal, Support vector machines framework for linear signal processing, Signal Proc. 85 (2005) 2316–2326. [41] M. Martínez-Ramón, J. Rojo-Álvarez, G. Camps-Valls, J. Munoz-Marí, A. NaviaVázquez, E. Soria-Olivas, A.R. Figueiras-Vidal, Support vector machines for nonlinear kernel ARMA system identiﬁcation, IEEE Trans. Neural Netw. 17 (2006) 1617–1622. [42] J. Rojo-Álvarez, M. Martínez-Ramón, M. de Prado-Cumplido, A. ArtésRodríguez, A.R. Figueiras-Vidal, Support vector method for robust ARMA system identiﬁcation, IEEE Trans. Signal Proc. 52 (2004) 155–164.

Post-aggregation of classifier ensembles

Post-aggregation of classifier ensembles

Recommend Documents