A novel combining classifier method based on Variational Inference

Pattern Recognition 49 (2016) 198–212 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr A ...

Download PDF

1MB Sizes 4 Downloads 104 Views

Report

PDF Reader
Full Text

Pattern Recognition 49 (2016) 198–212

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

A novel combining classiﬁer method based on Variational Inference Tien Thanh Nguyen a, Thi Thu Thuy Nguyen b, Xuan Cuong Pham c, Alan Wee-Chung Liew a,n a

School of Information and Communication Technology, Grifﬁth University, Gold Coast Campus, QLD 4222, Australia Hue College of Economics, Hue University, No. 99, Ho Dac Di Street, An Cuu Ward, Hue City, Vietnam c Hanoi University of Science and Technology, No. 1, Dai Co Viet Street, Hai Ba Trung District, Hanoi, Vietnam b

art ic l e i nf o

a b s t r a c t

Article history: Received 14 January 2015 Received in revised form 16 May 2015 Accepted 25 June 2015 Available online 11 July 2015

In this paper, we propose a combining classiﬁer method based on the Bayesian inference framework. Speciﬁcally, the outputs of base classiﬁers (called Level1 data or meta-data) are utilized in a combiner to produce the ﬁnal classiﬁcation. In our ensemble system, each class in the training set induces a distribution on the Level1 data, which is modeled by a multivariate Gaussian distribution. Traditionally, the parameters of the Gaussian are estimated using a maximum likelihood approach. However, maximum likelihood estimation cannot be applied since the covariance matrix of Level1 data of each class is not full rank. Instead, we propose to estimate the multivariate Gaussian distribution of Level1 data of each class by using the Variational Inference method. Experiments conducted on eighteen UCI Machine Learning Repository datasets and a selected 10-class CLEF2009 medical imaging database demonstrated the advantage of our method compared with several well-known ensemble methods. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Ensemble method Multi classiﬁer system Mixture of experts Classiﬁer fusion Combining classiﬁer algorithm Variational Inference Multivariate Gaussian distribution

1. Introduction In recent years, ensemble method has been studied extensively, and is an active research area in machine learning community [1]. In general, it is difﬁcult to know a priori which learning algorithm is suitable for a particular dataset, and using an ensemble approach could achieve better accuracy than any single learning algorithms. According to the statistics from Web of Knowledge, there are more than 600 publications with keyword “classiﬁer ensemble” in two years 2011 and 2012 [2]. In addition to various applications in computer aided medical diagnosis, computer vision, software engineering, and information retrieval, ensemble-based algorithms have also won many competitions such as Netﬂix Prize (http://www.netﬂixprize.com) and KDD-Cup (http://www.sigkdd.org/kddcup) [3]. All this demonstrated the signiﬁcant interest on ensemble methods in both theoretical and application studies. Over the past 30 years of development, various approaches related to ensemble methods have been proposed [1,3]. Hence, there are many taxonomies of ensemble method that focuses on the different views on the ensemble system [1,3–7]. In this paper, we follow the taxonomy in [8] in which ensemble methods are divided into two types:

n

Corresponding author. Tel.: þ 61 7 55528671; fax: þ 61 7 55528066. E-mail address: a.liew@grifﬁth.edu.au (A.-C. Liew).

http://dx.doi.org/10.1016/j.patcog.2015.06.016 0031-3203/& 2015 Elsevier Ltd. All rights reserved.

Homogeneity: Generic classiﬁers are generated on different

training sets obtained from an original one by the same learning algorithm. Then, the outputs of these classiﬁers are combined to give the ﬁnal decision. Several state-of-the-art coverage based ensemble methods in the literature are AdaBoost [9], Bagging [10] and Random Forest [11]. Heterogeneity: A ﬁxed set of different learning algorithms are used on the same training set to generate the different classiﬁers and then making decision from the output of these classiﬁers (called Level1 data or meta-data [12–14]). The approach focuses more on algorithms to combining Level1 data to achieve higher accuracy than any single classiﬁer.

In this paper, we focus on the second type of ensemble method. There are two techniques to combine the outputs of different classiﬁers, namely ﬁxed combining methods and trainable combining methods [8]. Fixed combining methods which are based on Bayesian decision model [15] do not take into consideration the label information in Level1 data of training set when combining. The advantage of applying ﬁxed methods for ensemble system is that no training based on Level1 data is needed; as a result, they are less time-consuming than their counterparts. There are several popular ﬁxed combining methods studied in the literature, namely Sum, Product, Vote, Max, Min, Average, Median and Oracle rule [15,16]. According to our knowledge, Vote and Sum are the most popular rules and have been successfully

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

applied to many combining classiﬁer situations. Kittler et al. [15] showed that Sum rule is developed under two assumptions: “conditional independence of respective representations used by classiﬁers and classes being highly ambiguous” and Sum rule results in the most reliable predictions. Kuncheva [16] also proved theoretical probability of error related to rules by making assumptions about normal and uniform distribution. In contrast, trainable combining methods work on Level1 data of training set to form the prediction model. Although exploiting Level1 data of training set to discover knowledge will enhance the accuracy of classiﬁcation tasks, computational cost will also increase. Several state-of-the-art trainable combining algorithms are Multiple Response Linear Regression (MLR) [13], Decision Template [17] and SCANN [18]. The most important studies about trainable combining methods are based on Stacking algorithm that was ﬁrst proposed by Wolpert [19] and was further developed in [12,13,18]. In this algorithm, the training set is divided into several equal disjoint parts. One part plays the role of test set in turn and the rest plays the role of training set during training phase. The output of Stacking is the posterior probability that an observation belongs to a class according to each classiﬁer. The common feature of Stacking based approaches is that Level1 data of training set is trained again by a combining classiﬁer method to form the ﬁnal prediction framework. Several strategies have been proposed to exploit label information in Level1 data of training set. In one strategy, the outputs of classiﬁers are grouped according to the given labels and then the template associated with each label is constructed. Two wellknown methods using that strategy are Multiple Response Linear Regression (MLR) [13] and Decision Template [17]. MLR method is based on the assumption that each classiﬁer puts a different weight on each class and combining algorithm is then conducted based on the M linear combinations of posterior probabilities and the associated weights for the M classes. The predicted class label for an unlabeled observation is decided by selecting the maximum value among those combinations. As a result, it is important to ﬁnd the suitable combining weights so as to achieve high classiﬁcation accuracy. Ting et al. [13] proposed MLR method by solving M Linear Regression models corresponding with the M classes based on Level1 data and the training data labels in crisp form to ﬁnd these combining weights. Recently, Zhang and Zhou [20] proposed using linear programming to ﬁnd the weights. Sen et al. [21] introduced a method that was inspired by MLR which uses hinge loss function in the combiner instead of using conventional least square loss. By using this new function with regularization, three different combinations were proposed, namely weighted sum, dependent weighted sum and linear stacked generalization, based on different regularizations with group sparsity. In Decision Template method [17], Level1 data of training set is grouped according to the class label of training observations and the Decision Template is constructed by averaging values of Level1 data in each class. Kuncheva et al. [17] proposed eleven measures between a Decision Template and the Level1 data of an unlabeled observation to predict class label. The beneﬁt of this method is that it saves time in both training and testing due to its simple computation. However, this method could have high error rate if the classiﬁers do not have high enough accuracy due to the fact that the simple Decision Template may not provide a good representation for a particular class. Merz [18] combined Stacking, Correspondence Analysis (CA) and K Nearest Neighbor (KNN) in a single algorithm called SCANN. The idea of that algorithm was to discover relationship between learning observations and classiﬁcation outputs of base classiﬁers generated by stacking by applying CA to indicator matrix formed by Level1 data and true label of these observations. After that, KNN

199

is used to classify unseen observations in the new scaled space. In real-world application, the method sometimes is impractical due to the singularity characteristic of the indicator matrix obtained by CA. Moreover, the testing process of SCANN is more complicated than that of other combining classiﬁer algorithms and this increases classiﬁcation time. Another approach proposed by Todorovski and Džeroski [14] is Meta Decision Tree, a new Decision Tree on Level1 data where at each node a classiﬁer is chosen instead of selecting value for splitting an attribute. The authors also proposed an expansion for Level1 data by adding entropy and maximum posterior probability so as to increase the discrimination ability. However, no theoretical basis was provided about the effectiveness of that expansion. Recently, Nguyen et al. [22] proposed a hybrid combining classiﬁer system in which fuzzy rules are applied on Level1 data to produce the classifying rules. Although that system outperforms other fuzzy rules-based methods and ensemble methods in the experiment as well as addressing the high-dimensionality problem commonly found in general fuzzy rules-based methods, it takes a long time in the training process due to the large number of rules generated on Level1 data. Unlike the above approaches, here we propose a novel combining classiﬁer method called VIG by approximating the density distribution of Level1 data to obtain a prediction framework based on Bayesian model. Since maximum likelihood approach is not applicable due to the singularity property of Level1 data, we propose using Variational Inference (VI) for the estimation of the multivariate Gaussian density distribution. The rest of this paper is organized as followed. Section 2 describes the property of Level1 data and then introduces VI method for multi-dimensional Gaussian distribution estimation. After that, the framework based on Bayesian decision model is proposed to combine the outputs of base classiﬁers. Experiment results conducted on eighteen UCI datasets [23] and CLEF medical image database [24] are reported and discussed in Section 4. The conclusion is given in the last section. 2. Preliminaries A summary of mathematical notations Notations

Mathematical meaning

X x M N K ym m ¼ 1;…;M

Observed data or training set An observation The number of classes The number of observations The number of learning algorithms The set of labels

Z

Hidden variable Mean and covariance of Gaussian distribution The precision matrix as inverse of the

μ; Σ Λ D W0 and υ0 m0 and β 0

m; H W; υ Trð:Þ

1

covariance matrix Λ ¼ Σ The dimension of input data The initial values of scale matrix and degrees of freedom of Wishart distribution p Λ The initial values of mean vector and the scale of precision matrix Λ of Gaussian distribution p μj Λ Mean vector and precision matrix of Gaussian distribution q μ ¼ N μ j m; H 1 Scale matrix and degree of freedom of Wishart distribution q Λ ¼ W Λ j W; υ The trace operator of a matrix

200

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

Γð:Þ ℒð q Þ L L ðx Þ fLm gm ¼ 1;…;M

fGm gm ¼ 1;…;M j Uj

The Gamma function deﬁned by R Γð:Þ ¼ 01 xt 1 e x dx . The lower bound The meta-data or Level1 data of X The meta-data or Level1 data of observation x The meta-data or Level1 data of observations belonging to mth class (m ¼ 1; …; M) Lm ¼ ðLðxÞ; yÞj x A X; y ¼ ym The Gaussian model for the mth class (m ¼ 1; …; M) Relative cardinality of a set

2.1. Level1 data Let ym m ¼ 1;…;M denotes the set of labels, N as the number of observations, K as the number of base classiﬁers and M as the number of labels. For an observation x, Pk ym j x is the probability that x belongs to the class with label ym given by kth classiﬁer. Kuncheva et al. [17] summarized three output types for x for each k ¼ 1; …; K :

Crisp label: return only class label Pk ym j x A f0; 1g and P Pk ym j x ¼ 1

m Fuzzy label: return posterior probabilities that x belongs to P

classes, i.e. Pk ym j x A ½0; 1 and

Pk ym j x ¼ 1

Possibilistic label: the same asm Fuzzy Label but does not require the posterior probabilities to equal one, i.e. sum of all P Pk ym j x A ½0; 1 and Pk ym j x 4 0 m

In this work, we focus only on the second type, i.e. fuzzy label, which has a familiar interpretation. In this case, the posterior probability reﬂects the support of a class to an observation. The Level1 data of all observations, an N MK posterior probability matrix is deﬁned as: 2

P1 y1 j x1 ⋯ P1 yM j x1 6 6 P1 y1 j x2 ⋯ P1 yM j x2 6 L : ¼6 ⋮ ⋱ ⋮ 6 4 P1 y 1 j x N ⋯ P1 yM j xN

⋯ ⋯ ⋱ ⋯

PK y1 j x1 PK y1 j x2 ⋮ PK y1 j xN

3 PK y M j x 1 7 ⋯ PK y M j x 2 7 7 7 ⋱ ⋮ 7 5 ⋯ PK y M j x N ⋯

ð1Þ whereas the Level1 data of an observation xn is deﬁned in two forms depending on the combining algorithms: 3 2 ⋯ P1 yM j xn P1 y1 j xn 6 7 ⋮ ⋱ ⋮ ð2aÞ Lðxn Þn ¼ 1;…;N : ¼ 4 5 ⋯ PK yM j xn PK y1 j xn h Lðxn Þn ¼ 1;…;N : ¼ P1 y1 j xn

⋯

P1 yM j xn ⋯

PK y1 j xn

⋯

i PK yM j xn

ð2bÞ We have the following properties with respect to the Level1 data: Lemma 1: The Level1 data generated by using Fuzzy Label is not full column rank. Corollary 1: Covariance matrix of Level1 data is not full rank. The proof can be found in Appendix A. 2.2. Variational Inference for multivariate Gaussian distribution Maximum likelihood estimation is a popular method to obtain the Gaussian distribution for an observed dataset. However when the covariance matrix of the dataset is not full rank, it cannot be applied. Kuncheva et al. [17] commented that if the accuracy of all classiﬁers is high, the covariance matrix of Level1 data is likely to

be singular. In this work, we propose to use VI method to estimate the multivariate Gaussian model. The idea behind VI method is to approximate the posterior distribution pðZ j XÞ of hidden variables Z given observed data X by a more easily accessible distribution qðZ Þ which minimizes the divergence between pðZ j XÞ and qðZ Þ. In maximum-likelihood method, the parameters ðμ; ΣÞ are not considered as random variables, and the likelihood function is maximized by the parameters’ values. In contrast, in VI method the parameters ðμ; ΣÞ are treated as random variables and priors are placed over the parameters to obtain the posterior distribution p μ; Σ j X . In literature, the Kullback–Leibler (KL) divergence is commonly used to compute the distance between two distributions:

Z qðZ Þ pðZ j XÞ ¼ qðZ Þln dZ ð3Þ KLðq J pÞ ¼ Eq ln pðZ j XÞ qðZ Þ It’s worth noting that the KL divergence is difﬁcult to optimize since it requires knowledge about the distribution that we are trying to approximate. As KLðq J pÞ Z 0 and KLðq J pÞ ¼ R ln pðXÞ ℒðqÞ where ℒðqÞ ¼ qðZ Þln ðpðX; Z ÞÞ=ðqðZ ÞÞ dZ is the lower bound on the log marginal probability ln pðXÞ , we can maximize the lower bound ℒðqÞ instead of minimizing KLðq J pÞ. M M If we assume that qðZ Þ ¼ Π i ¼ 1 qi ðZ i Þ in which Z ¼ [ i ¼ 1 Z i , and iteratively ℒðqÞ with respect to qj Z j Z i \ Z j ¼ ∅ ; i aj n maximize o while qi a j are held ﬁxed, the optimal solution qnj Z j is given by [25,26]:

ln qnj Z j ¼ Ei a j ln pðX; Z Þ þ const ð4Þ here the notation Ei a j ½⋯N denotes an expectation with respect to the q distributions over all variables Z i ðia jÞ and the constant is independent of Z i . Convergence is guaranteed because that bound is convex with respect to each of the factors qi ðZ i Þ [27]. In the literature, VI-based approaches have been used to estimate the density distributions of Dirichlet and Gaussian Mixture Model (GMM) [26,28,29]. In this work, we apply the VI method to estimate the parameters of a multivariate Gaussian distribution. Based on the Central Limit Theorem, Gaussian can be used to approximate a wide range of other distributions such as Poisson, Binominal and Gamma [30]. Meanwhile, Dirichlet distributions are most commonly used as the prior distribution of categorical variables or multinomial variables in Bayesian-based models [31,32]. Here, the multivariate Gaussian distribution is used to approximate the likelihood function pðLðxÞj Gm Þ for each class label in which all features of LðxÞ are real and belong to ½0; 1. Although GMM can also be used to approximate the model for class labels, GMM requires many parameters resulting in expensive computation in the training process. Moreover, when a small amount of data is available, the choice of the number of Gaussian components for GMM becomes critical [33]. Our goal is to infer the posterior distribution for the mean μ and precision matrix Λ, where Λ is the inverse of the covariance 1 matrix Λ ¼ Σ , given a dataset X ¼ fxn j n ¼ 1; …; N g of variable x which are assumed to be drawn from the multi independently 1 variate Gaussian distribution N xj μ; Λ . The likelihood function is given by N 1 p Xj μ; Λ ¼ ∏ N xn j μ; Λ n¼1

( ) N N T ND 1X ¼ ð2πÞ 2 Λ 2 exp xn μ Λ xn μ 2n¼1

ð5Þ

where D is the dimensionality of the variable x. In order to formulate a variational solution, we write down the joint distribution of all of the random variables: p X; μ; Λ ¼ p Xj μ; Λ p μ j Λ p Λ . The conjugate prior of a multivariate Gaussian distribution, p μ; Λ , with unknowns μ and Λ

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

are distribution: p μ; Λ ¼ given by the Gaussian–Wishart p μ j Λ p Λ , where p μ j Λ is a Gaussian distribution: 1 p μ j Λ ¼ N μ j m0 ; β 0 Λ 1 T D 1 ¼ ð2πÞ 2 β 0 Λ2 exp μ m0 β0 Λ μ m0 ð6Þ 2 and p Λ is a Wishart distribution: ðυ0 D 1Þ 1 p Λ ¼ W Λ j W0 ; υ0 ¼ BðW0 ; υ0 ÞΛ 2 exp Tr W0 1 Λ 2 ð7Þ BðW0 ; υ0 Þ ¼ jW0 j

υ0 2

! 1 υ0 D DðD 1Þ D υ0 þ1 i 22 π 4 ∏ Γ 2 i¼1

ð8Þ

where m0 and β0 are the D-mean vector and the scale of precision matrix Λ of Gaussian distribution p μ j Λ , W0 and υ0 are the D D-scale matrix and number of degrees of freedom of the Wishart distribution p Λ , Trð:Þ denotes the trace operator of a matrix, and Γð:Þ denotes the Gamma function deﬁned by R Γð:Þ ¼ 01 xt 1 e x dx. Putting our attention on a factorized variational approximation to the posterior distribution, i.e. q μ; Λ ¼ q μ q Λ , the following update equations can be derived: ð9Þ ln qn μ ¼ EΛ ln p X; μ; Λ þ const ln qn Λ ¼ Eμ ln p X; μ; Λ þ const

ð10Þ

We have the following results: n Lemma 2: The solution q μ of update equation (9) optimum 1 n is a Gaussian q μ ¼ N μ j m; H with mean m and precision H given by (11) and (12). m¼

β0 m0 þ Nx β0 þ N

ð11Þ

1 ðlnjW i j lnjWi 1 jÞ 2

201

ð16Þ

where we have made use of the fact that E Λi jυWi j

¼ ln ¼ lnjWi j lnjWi 1 j lnjHi j lnjHi 1 j ¼ ln jυWi 1 j E Λi 1 We have the following algorithm for multivariate Gaussian distribution estimation: Algorithm 1. VI for multivariate Gaussian distribution estimation Dataset X, threshold ε; m0 ; β0 ; υ0 ; W0 ; E Λ Output: m; H of q μ ¼ N μ j m; H 1 and W; υ of q Λ ¼ W Λ j W; υ

Input:

i : ¼1 For each i Update m; H using (11) and (12) Update W; υ using (13) and (14) If i4 1 and ℒi ðqÞ ℒi 1 ðqÞ o ε then break i : ¼ iþ1 End In the algorithm above, the four variables of q μ and q Λ are updated step by step from their initial values. The updating process will stop when the change in lower bound value ℒðqÞ is smaller than a speciﬁed threshold ε. In our experiments, 3 or 4 iterations typically proved sufﬁcient to achieve convergence with a threshold ε ¼ 1e 10.

H ¼ β0 þ N E Λ

ð12Þ Lemma 3: The optimum solution qn Λ of update (10) is a Wishart qn Λ ¼ W Λ j W; υ with the number of degrees of freedom υ and the scale matrix W given by (13) and (14).

υ ¼ υ0 þ N þ 1

ð13Þ

β N W 1 ¼ W0 1 þ β0 þ N H 1 þ S þ 0 J β0 þ N

ð14Þ

where x¼

N 1 X xn ; Nn¼1

S¼

N X

ðxn xÞðxn xÞT ; J ¼ ðx m0 Þðx m0 ÞT

n¼1

Lemma 4: The lower bound ℒðqÞ of the Variational Inference for the multivariate Gaussian distribution is given by (15) ℒðqÞ ¼ ln BðW 0 ; υ0 Þ ln BðW; υÞ 1 ND lnð2π Þ D ln β0 υD þ lnjHjþ υTrðSWÞ 2 β Nυ TrðJWÞ þ υTr W0 1 W þ 0 β0 þ N

ð15Þ

The proofs for Lemmas 2–4 are given in Appendix A. Denote as ℒi ðqÞ the value of the lower bound at the ith iteration, then it can be shown that: ℒi ðqÞ ℒi 1 ðqÞ ¼ ln BðWi ; υÞ

1 β N υTr S þ W0 1 þ 0 J Wi 2 β0 þ N

1 β N þ ln BðWi 1 ; υÞ þ υTr S þ W0 1 þ 0 J W i 1 2 β0 þ N

3. Proposed combining classiﬁer method The most important distinction between our work and the previous work is that we use statistical learning-based approach on Level1 data to build the combining classiﬁer method. Attributes in the original data are frequently varying in nature, measurement unit, and type. As a result, Gaussian model does not perform well when it is used to approximate the distribution of the original data. Level1 data, on the other hand, can be viewed as scaled data from feature domain to posterior domain where data is reshaped to be real values in [0, 1]. Observations that belong to the same class will likely have equal posterior probabilities generated from the base classiﬁers and locate close together in the new domain. Consequently, Level1 data is expected to be more discriminative than the original data, and Gaussian model on Level1 data will be more effective than on original data. The proposed VIG combining classiﬁer method is illustrated in Fig. 1. First, Staking algorithm is applied on the training set to generate Level1 data denoted by L. Since labels of the training n o observations X ¼ ðx; yÞj y A ym m ¼ 1;…;M are known, we can gather L into M groups corresponding with the M labels: Lm ¼ ðLðxÞ; yÞj x A X; y ¼ ym m ¼ 1; …; M. Then, VI method is applied to Lm to model the distribution for each label by a multivariate Gaussian distribution. Based on the Bayesian decision model, the posterior probability of an observation x belonging to the mth class is given by pðGm j LðxÞÞ pðLðxÞj Gm ÞpðGm Þ where Gm is the model for the mth class.

ð17Þ

202

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

Fig. 1. The proposed VIG combining classiﬁer method.

In which pðGm Þ is the prior probability of the mth class. Many approaches about the choice of prior probability have been introduced [34], and they generally belong to one of two classes: informative priors and uninformative priors. Spiegelhalter et al. [35] and Lauritzen et al. [36] demonstrated the improvement in prediction when informative priors are used in Bayesian-based system. Gelman [37] stated that the choice of prior distributions will have minor effect on the posterior probabilities in the case of a large number of observations and ‘well-identiﬁed’ parameters, whereas it will play an important role when the number of observations is small or when the available data provide only indirect information about the parameters of interest. In this paper, due to space limitation and to show that even a simple empirical choice can achieve good performance in the classiﬁcation tasks, we compute the prior probability simply by pðGm Þ ¼

jLm j jXj

ð18Þ

where j U j denotes the cardinality of a set. The likelihood function pðLðxÞj Gm Þ is given by pðLðxÞj Gm Þ ¼ N LðxÞj μm ; Σm 1

T 1

MK 1 ¼ ð2πÞ 2 Σm 2 exp LðxÞ μm Σm LðxÞ μm 2

Output: Gaussian models fGm g with two parameters μm μm , Λm and pðGm Þ m ¼ 1; …; M Step1: L ¼ Stacking ðX; KÞ Step2: Lm ¼ ðLðxÞ; yÞj x A X; y ¼ ym Step3: For mth class ðmm ; Hm ; Wm ; υm Þ ¼ Algorithm1 Lm ; ε; m0 ; β 0 ; υ0 ; W0 ; E Λ 1 where μm ¼ mm pðLðxÞj Gm Þ ¼ N LðxÞj μm ; Λm and

Λm ¼ Wm υm

Compute pðGm Þ using (18) End Classiﬁcation process: Input: Unlabeled observation xn Output: Predicted label for xn Step1: Compute Lðxn Þ Step2: For mth class Compute pðGm j Lðxn ÞÞ pðLðxn Þj Gm ÞpðGm Þ using (17) End Step3: Predict label of xn using (20)

ð19Þ where μm ; Σm in (19) are the mean and covariance matrix of the multivariate Gaussian model obtained by VI for the mth class. Note that instead of Σm Σ m , the precision matrix Λm is computed by VI during the training process. In the classiﬁcation phase, Level1 data of an unlabeled observation is ﬁrst generated by base classiﬁers. Its class prediction is then obtained by selecting the label associated with maximum posterior probabilities computed by the M multivariate Gaussian models. Therefore, the class label of an unlabeled observation xn is predicted by xn A yt if t ¼ argmaxm ¼ 1;…;M p Gm j L xn argmaxm ¼ 1;…;M pðL xn j Gm ÞpðGm Þ

ð20Þ

Our ensemble framework has some similarity with Bayesian Model Averaging (BMA) [38] since the results of all hypotheses (classiﬁers) are used to obtain the ﬁnal discriminative model. However, in our work, we just consider the output itself (which in our framework is obtained from the base classiﬁers) whereas in BMA the base classiﬁer models are actually considered in the formulation. The VIG combining classiﬁer method is given below:

4. Experimental results 4.1. Dataset To evaluate the performance of our proposed VIG method we perform experiments on two datasets. In the ﬁrst experiment, eighteen UCI data ﬁles from the UCI dataset were used since it is often used to validate the performance of classiﬁcation system [23]. To ensure the objectiveness of the comparison between our method and the other benchmark algorithms, we chose ﬁles with the number of observations varies signiﬁcantly from small ﬁles like Fertility and Iris to big ﬁle such as Skin&NonSkin. The number of attributes also varies widely from three (Titanic) to sixty (Sonar). Information about the selected data ﬁles is summarized in Table 1. The second experiment was evaluated on CLEF2009, a medical imaging database collected by Archen University, Germany [24]. It is a large database containing 15,363 images allocated to 193 hierarchical categories. Here, we chose 10 classes with different number of observations in each class (Table 2). Histogram of Local Binary Pattern (HLBP) [39] was selected as feature vector of the image.

Algorithm 2. VIG combining classiﬁers method Training process: Input: Training set X ¼ ðx; yÞ , K¼{ K learning algorithms}, ε; m0 ; β0 ; υ0 ; W0 ; E Λ

4.2. Experimental settings Three learning algorithms, namely Linear Discriminant Analysis, Naïve Bayes, and K Nearest Neighbor (with K set to 5, denoted

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

by 5-NN), were chosen to construct the base classiﬁers. The reason for choosing these diverse learning algorithms is that they ensure diversity of the ensemble system. Since our method is a combining classiﬁer method, it is necessary to compare with other wellknown combining classiﬁer methods. It is also important to compare error rates of the proposed method with those of base classiﬁers to demonstrate the advantage of ensemble method. In our experiments, the proposed method were compared with seven benchmark algorithms: selecting the best results from base classiﬁers based on outcomes on test set, selecting the best results from ﬁxed rules based on outcomes on test set, Decision Template (we use the similarity measure S1 deﬁned as S1 ðLðxÞ; DT m Þ ¼ ðj LðxÞ \ DT m j Þ=ðj LðxÞ [ DT m j Þ where DT m is the Decision Template of mth class and j U j is the relative cardinality of a set [17]), MLR, SCANN, AdaBoost, and Bagging. Only simple values were chosen to initialize the parameters for Algorithm 1: T m0 is D-vector with all zero elements ð0; …; 0Þ , β 0 ¼ 1; υ0 ¼ D, W0 is D D identity matrix and E Λ ¼ υ0 W0 . We performed 10-fold cross validation and run the test 10 times to obtain 100 test results for each data ﬁle. To assess statistical signiﬁcance, we used two-sample t-test to compare the classiﬁcation results of our approach and each benchmark

File name

No. of Attribute No. of No. of attributes type observations classes

Bupa Artiﬁcial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris

6 10 6 60 13 5 3 4 9 3 30 14 20 20 20 9 18 4

345 700 768 208 270 540 2,201 625 100 245,057 569 690 7,400 7,400 151 1,473 946 150

algorithm. Speciﬁcally, a two-sample t-test was conducted to test the null hypothesis H 0 : eA ¼ eB where eA and eB are the classiﬁcation error rate of the proposal method and the benchmark algorithm: t¼

eA eB ðeA eB Þz}|{if qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ s2A s2B nA þ nB

H 0 true

e eB qAﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ s2A s2B nA þ nB

ð21Þ

where eA ; eB are the means, sA ; sB are the standard deviations of classiﬁcation error rate computed on nA ¼ nB ¼ 100 test results of the proposed method and the benchmark algorithm, respectively. The critical region is determined depending on how we choose the alternative hypothesis H 1 e.g. for one-tailed test: eA 4 eB , or eA o eB , or for two-tailed test: eA a eB . We reject the null hypothesis that the classiﬁcation error rates of two methods are equal if the statistical value t belongs to the rejection region and vise verse. In this paper, one-tailed alternative hypothesis was used and the level of signiﬁcance was set to 0.05. All source codes were implemented in Matlab running on a PC with Intel Core i5 with 2.5 GHz processor and 4G RAM. The results of the experiment are summarized in Tables 3 and 4.

4.3. Results and discussion

Table 1 Information of UCI data ﬁles in evaluation.

C,I,R R R,I R C,I,R R R,I C R R R C,I,R R R C,I C,I I R

203

2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 4 3

No. of attributes on level1 data 6 6 6 6 6 6 6 9 6 6 6 6 6 6 6 9 12 9

R: Real, C: Category, I: Integer.

4.3.1. Comparison with other combining classiﬁer methods and base classiﬁers First, the error rates and variances of three base classiﬁers are reported in Table 3. From these results, the one with the smallest error rate for each data ﬁle were selected as the best result. From the signiﬁcance test in Table 5, we see that VIG is better than the best result from base classiﬁers, obtaining seven wins and only one loss. In comparison to the best result from six ﬁxed combining rules (see Table 4), our method achieved better result on ﬁve ﬁles, namely Phoneme (0.1164 vs. 0.1407), Ring (0.1168 vs. 0.2122), Skin&NonSkin (4.28E-04 vs. 6E-04), Vehicle (0.2069 vs. 0.2645), and CLEF2009 (0.1693 vs. 0.2023) and worse results on two ﬁles: Bupa (0.3151 vs. 0.297) and Artiﬁcial (0.2401 vs. 0.2193). As mentioned earlier, ﬁxed combining rules do not exploit label information in Level1 data of training set to form the prediction so they sometimes do not obtain high accuracy in classiﬁcation tasks. In contrast, the proposed method is a trainable combining method in which label information in Level1 data of training set is exploited to make prediction. As a result, in almost all situations, our method is better than any ﬁxed combining rules.

Table 2 Information of 10 classes selected from clef2009 medical image database. Image

Description Number of observation

Abdomen 80

Cervical 81

Chest 80

Facial cranium 80

Left elbow 69

Left shoulder 80

Left breast 80

Finger 66

Left ankle joint 80

Left carpal joint 80

Image

Description Number of observation

204

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

Table 3 Classiﬁcation error rates and variances of 3 base classiﬁers. File name

LDA

Bupa Artiﬁcial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009

Naïve Bayes

5-NN

Best result from base classiﬁers

Mean

Variance

Mean

Variance

Mean

Variance

Mean

Variance

0.3693 0.4511 0.2396 0.2629 0.1593 0.2408 0.2201 0.2917 0.3460 0.0659 0.0397 0.1416 0.0217 0.2381 0.4612 0.4992 0.2186 0.0200 0.1683

8.30E 03 1.40E 03 2.40E 03 9.70E 03 5.30E 03 3.00E 04 5.00E 04 2.90E 03 2.01E 02 2.74E 06 7.00E 04 1.55E 03 3.12E 05 2.27E 04 1.21E 02 1.40E 03 1.39E 03 1.40E 03 1.63E 03

0.4264 0.4521 0.2668 0.3042 0.1611 0.2607 0.2515 0.2600 0.3770 0.1785 0.0587 0.1297 0.0217 0.2374 0.4505 0.5324 0.5550 0.0400 0.3757

7.60E 03 1.40E 03 2.00E 03 7.40E 03 5.90E 03 3.00E 04 8.00E 04 3.30E 03 2.08E 02 6.61E 06 1.20E 03 1.71E 03 3.13E 05 2.23E 04 1.22E 02 1.42E 03 2.94E 03 2.30E 03 2.12E 03

0.3331 0.2496 0.2864 0.1875 0.3348 0.1133 0.2341 0.1442 0.1550 0.0005 0.0666 0.3457 0.0312 0.3088 0.5908 0.4936 0.3502 0.0353 0.3116

6.10E 03 2.40E 03 2.30E 03 7.60E 03 5.10E 03 2.00E 04 3.70E 03 1.20E 03 4.50E 03 1.68E 08 8.00E 04 2.11E 03 3.96E 05 1.30E 04 1.37E 02 1.70E 03 2.35E 03 1.50E 03 2.34E 03

0.3331 ( ) 0.2496 ( ¼ ) 0.2396 ( ¼ ) 0.1875 ( ¼ ) 0.1593 ( ¼ ) 0.1133 ( ¼) 0.2201 ( ¼ ) 0.1442 ( ) 0.1550 ( ) 0.0005 ( ) 0.0397 ( ¼) 0.1297 ( ¼ ) 0.0217 ( ¼ ) 0.2374 ( ) 0.4505 ( ¼ ) 0.4936 ( ) 0.2186 ( ) 0.0200 ( þ) 0.1683 (¼ )

6.10E 03 2.40E 03 2.40E 03 7.60E 03 5.30E 03 2.00E 04 5.00E 04 1.20E 03 4.50E 03 1.68E 08 7.00E 04 1.71E 03 3.12E 05 2.23E 04 1.22E 02 1.70E 03 1.39E 03 1.40E 03 1.63E 03

( þ) means that the method is better than VIG, ( ¼) means that the method is equal to VIG, ( ) means that the method is worse than VIG. Relative performance was checked using t-test. Table 4 Classiﬁcation error rates and variances of combining classiﬁer algorithms. File name

Bupa Artiﬁcial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009

Best result from 6 ﬁxed rules

MLR Mean

Variance

0.3033 ( ¼ ) 0.2426 ( ¼ ) 0.2432 ( ¼ ) 0.1974 ( ¼ ) 0.1607 ( ¼ ) 0.1136 ( ¼) 0.2169 ( ¼) 0.1225 ( ) 0.1250 ( ¼ ) 4.79E 04 ( ) 0.0399 ( ¼) 0.1268 ( ¼) 0.0217 ( ¼) 0.1700 ( ) 0.4652 ( ) 0.4675 ( ¼ ) 0.2139 ( ¼) 0.022 ( ¼ ) 0.1633 ( ¼ )

4.70E 03 2.20E 03 2.30E 03 7.20E 03 4.70E 03 1.75E 04 4.00E 04 8.00E 04 2.28E 03 1.97E 08 7.00E 04 1.80E 03 2.24E 05 1.69E 04 1.24E 02 1.10E 03 1.40E 03 1.87E 03 1.58E 03

Mean

Variance

0.2970 ( þ ) 0.2193 ( þ ) 0.2365 ( ¼ ) 0.2079 ( ¼) 0.1570 ( ¼) 0.1407 ( ) 0.2167 (¼ ) 0.1112 ( ¼ ) 0.1270 ( ¼) 0.0006 ( ) 0.0395 ( ¼ ) 0.1262 (¼ ) 0.0216 ( ¼ ) 0.2122 ( ) 0.4435 ( ¼) 0.4653 ( ¼ ) 0.2645 ( ) 0.0327 ( ¼ ) 0.2023 ( )

4.89E 03 2.05E 03 2.10E 03 8.16E 03 4.64E 03 1.95E 04 5.00E 04 4.82E 04 1.97E 03 2.13E 08 5.03E 04 1.37E 03 2.82E 05 1.62E 04 1.70E 02 1.79E 03 1.37E 03 1.73E 03 1.85E 03

SCANN

Decision template

VIG

Mean

Variance

Mean

Variance

Mean

Variance

0.3304 ( ) 0.2374 ( ¼ ) 0.2384 (¼ ) 0.2128 ( ¼) 0.1637 ( ¼) 0.1229 ( ) 0.2216 ( ¼ ) x x x 0.0397 ( ¼ ) 0.1259 ( ¼) 0.0216 ( ¼ ) 0.2150 ( ) 0.4428 ( ¼) 0.4869 ( ) 0.2224 ( ) 0.032 ( ¼ ) 0.1895 ( )

4.29E 03 2.12E 03 2.06E 03 8.01E 03 4.14E 03 6.53E 04 6.29E 04 x x x 5.64E 04 1.77E 03 2.39E 05 2.44E 04 1.34E 02 1.80E 03 1.54E 03 2.00E 03 1.68E 03

0.3348 ( ) 0.2433 ( ¼ ) 0.2482 ( ) 0.2129 ( ¼) 0.1541 (¼ ) 0.1462 ( ) 0.2167 ( ¼) 0.0988 (þ ) 0.4520 ( ) 0.0332 ( ) 0.0385 (¼ ) 0.1346 ( ) 0.0221 ( ¼ ) 0.1894 ( ) 0.4643 ( ) 0.4781 ( ) 0.2161 ( ) 0.0400 ( ¼) 0.1893 ( )

7.10E 03 1.60E 03 2.00E 03 8.80E 03 4.00E 03 2.00E 04 6.00E 04 1.40E 03 3.41E 02 1.64E 06 5.00E 04 1.50E 03 2.62E 05 1.78E 04 1.21E 02 1.40E 03 1.50E 03 2.50E 03 1.74E 03

0.3151 0.2401 0.2340 0.2025 0.1556 0.1164 0.2169 0.1123 0.1310 4.28E 04 0.0408 0.1225 0.0218 0.1168 0.4348 0.4634 0.2069 0.0313 0.1693

3.73E 03 2.33E 03 2.21E 03 7.84E 03 4.01E 03 1.96E 04 4.83E 04 6.19E 04 2.34E 03 1.57E 08 5.75E 04 1.49E 03 2.49E 05 1.00E 04 1.71E 02 1.32E 03 1.23E 03 2.00E 03 1.76e 03

( þ) means that the method is better than VIG, ( ¼) means that the method is equal to VIG, ( ) means that the method is worse than VIG. Relative performance was checked using t-test.

Table 5 Statistical two-sample t-test results comparing proposed method with benchmark algorithms. VIG vs. best result from base classiﬁers Better 7 Equal 11 Worse 1

VIG VIG vs. best result from ﬁxed vs. MLR rules

VIG vs. Decision Template

VIG vs. SCANN

5 12 2

11 7 1

6 10 0

4 15 0

Compare with SCANN, our method outperformed it signiﬁcantly, obtaining 6 wins and 0 losses. We note that here only 16 ﬁles were compared because SCANN could not be run on 3 ﬁles (Fertility, Balance and Skin&NonSkin). The reason is that the

indicator matrix in SCANN has columns with all zero posterior probabilities from the base classiﬁers. As a result, its column mass is singular and standardized residuals is not available [18]. We ignored these cases in the comparison. Finally, our method performed signiﬁcantly better than Decision Template method, posting 11 wins and only 1 loss. Similar to our approach, Decision Template method also groups training observations based on their labels in Level1 data and then builds the template for each class. In fact, the Decision Template of the mth class is the average of Level1 data of training observations with label ym [17]. However, the template representation is not as powerful as our Variational Inference-based approach, so our method frequently outperforms that benchmark algorithm. MLR method also represents each class by building regression models based on Level1 data of training observations and their class labels. It has better performance compared to Decision Template method

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

205

Fig. 2. Average time of training process (in seconds).

Fig. 3. Average time of classiﬁcation process (in seconds). (Top: The results of three datasets, and Bottom: The results of the others).

Table 6 Average number of iterations by the proposed method.

Skin&NonSkin 3

Bupa

Artiﬁcial

Pima

Sonar

Heart

Phoneme

Titanic

Balance

Fertility

4 Wdbc 4

4 Australian 4

4 Twonorm 4

4 Ring 4

4 Tae 4

3.99 Contraceptive 4

4 Vehicle 4

4 Iris 4

4.5 CLEF2009 4

in most cases due to the more powerful statistical representation obtained by linear regression. Nevertheless, our method is still better than MLR, obtaining 4 wins and 0 loss. The 4 winning cases are Balance (0.1123 vs. 0.1225), Skin&NonSkin (4.28E-04 vs. 4.79E04), Ring (0.1168 vs. 0.17), and Tae (0.4348 vs. 0.4652). This is a

signiﬁcant outcome because MLR is a very highly competitive trainable combining method for many datasets. To compare the training and classiﬁcation time, we compute the average training and classiﬁcation time for all 100 cases, and report averaged value in Figs 2 and 3. Here we only compare the

206

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

performance of 4 trainable combining algorithms. Fix combining algorithms always outperform trainable algorithms on training time because they do not exploit label information in Level1 data. Consequently, we do not compare them to trainable combining methods. First, in the training process, no method is dominant in time performance evaluation. In 6 cases like Artiﬁcial, Contraceptive, Balance, Australian, Ring and CLEF2009, Decision Template is the best among all 4 methods. In fact, that approach is simpler than VIG and MLR as it only calculates average value on Level1 data of training observations associated with each class. In MLR, we have to solve M Linear Regression models to ﬁnd the weight each classiﬁer put on a particular class. In VIG, several iterations are needed to ﬁnd the mean and covariance parameters of the multivariate Gaussian distribution for each class. Table 6 shows the average value of the number of iterations among 10 runs of the 10fold cross validation procedure. The convergences are obtained after about 3 or 4 iterations so the training time in VIG method is acceptable. It is even better than the other methods on Heart, Titanic, Wdbc, Tae and Iris. In Appendix B, we show the decreases in the values of ℒi ℒi 1 ðiZ 2Þ in an experiment for all nineteen datasets. Although Decision Template is usually less timeconsuming in the training process compared with the other three methods, the differences related to training time between Decision Template and VIG are small. In the time taken for classiﬁcation, MLR method is ranked ﬁrst, followed by VIG, Decision Template, and SCANN. In MLR, the label of an observation is predicted by simply multiplying its Level1 data with the associated weight values obtained from training process. On the other hand, our method computes multiplication between Level1 data of the test observation and the precision matrix. So it is more time-consuming than MLR, although the difference is not large. In SCANN, we have to compute M representations corresponding with M classes from Level1 data of the test observation. After that, the distance between each representation and each row of selected principal coordinates of columns matrix is computed [18]. As a result, SCANN is frequently the most time-consuming method among the four trainable methods during the classiﬁcation process.

4.3.2. Different number of learning algorithms Two learning algorithms, namely Quadratic Discriminant Analysis and Decision Tree, are added into our multi-classiﬁer system. In this experiment, we want to assess the effect of having different number of base classiﬁers. We denote our combining classiﬁer method with 3 learning algorithms as VIG-3 and the method with ﬁve learning algorithms as VIG-5. Table 7 shows the error rates and variances of the classiﬁcation results of Decision Tree, QDA, and VIG-5. Table 8 shows the statistical two-sample t-test results comparing VIG-3 to VIG-5. VIG-3 is better than VIG-5 on 4 data sets (Phoneme, Titanic, Skin&NonSkin, and Australia), whereas VIG-5 out-performs VIG-3 on 6 data sets (Bupa, Artiﬁcial, Balance, Ring, Vehicle, and CLEF2009). There does not appear to be a clear overall advantage in using more base classiﬁers in this case. However, by using additional learning algorithms, the classiﬁcation performance of the ensemble system can improve signiﬁcantly on some ﬁles. For example, the classiﬁcation error rate on Ring given by ensemble system with 3 learning algorithms is 0.11 but that number reduces to only 0.02 when two new learning algorithms are added into the ensemble system. Looking at the output of the base classiﬁers, we see that one of the newly added base classiﬁers has signiﬁcantly better classiﬁcation accuracy for this ﬁle. Hence, the discrimination property of Level1 data could be enhanced by having the right base classiﬁers in the ensemble system. Through this experiment we can see that the addition of more learning algorithms could either improve or degrade the classiﬁcation performance of multi-classiﬁer systems. A learning mechanism as shown in [40] that search for an optimal subset of learning algorithms could be used to build a highly effective classiﬁcation system.

Table 8 Statistical two-sample t-test results comparing VIG-3 to VIG-5. VIG-3 vs. VIG-5 Better Equal Worse

4 7 6

Table 7 Classiﬁcation error rates and variances of additional learning algorithms and VIG-5. File name

Bupa Artiﬁcial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009

Decision tree

QDA

VIG-5

Mean

Variance

Mean

Variance

Mean

Variance

0.3514 0.2414 0.2892 0.2866 0.2381 0.1298 0.2101 0.2107 0.1730 0.0004 0.0705 0.1678 0.0536 0.2485 0.4275 0.5317 0.2932 0.0507 0.3613

6.10E 03 2.20E 03 1.80E 03 6.20E 03 6.70E 03 2.00E 04 3.00E 04 2.10E 03 7.20E 03 1.55E 08 1.10E 03 2.13E 03 4.22E 05 1.32E 04 1.06E 02 1.28E 03 2.13E 03 2.40E 03 2.00E 03

0.3965 0.4021 0.2628 0.2401 0.1722 0.2138 0.2267 0.0860 x 0.0165 0.0433 0.2078 0.0224 0.0210 x 0.4895 0.1453 0.0260 0.1719

7.48E 03 7.92E 03 2.29E 03 8.15E 03 5.94E 03 2.35E 04 8.25E 04 8.78E 04 x 6.26E 07 6.99E 04 1.75E 03 2.37E 05 2.32E 05 x 1.63E 03 1.13E 03 1.59E 03 1.40E 03

0.2970 (þ ) 0.2217 ( þ ) 0.2403 ( ¼) 0.1861 ( ¼ ) 0.1644 ( ¼ ) 0.1276 ( ) 0.2290 ( ) 0.0851 ( þ) x 0.0006 ( ) 0.0450 ( ¼ ) 0.1338 ( ) 0.0216 ( ¼ ) 0.0206 ( þ ) x 0.4629 ( ¼ ) 0.1578 ( þ) 0.0260 ( ¼ ) 0.1521 ( þ )

4.85E 03 2.05E 03 2.11E 03 6.68E 03 4.43E 03 2.16E 04 4.95E 04 1.09E 03 x 2.33E 08 6.92E 04 1.08E 03 2.82E 05 3.10E 05 x 1.34E 03 1.23E 03 1.68E 03 1.30E 03

( ) means that VIG-3 is better than VIG-5, (¼ ) means that VIG-3 and VIG-5 is equal, ( þ) mean that VIG-3 is worse than VIG-5. Relative performance was checked using ttest.

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

207

Table 9 Classiﬁcation error rates and variances of Bagging and AdaBoost. Bagging

Bupa Artiﬁcial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009

AdaBoost

Mean

Variance

Mean

Variance

0.2741 (VIG-3: þ ) (VIG-5: þ ) 0.2069 (VIG-3: þ ) (VIG-5: þ) 0.2357 (VIG-3: ¼ ) (VIG-5: ¼) 0.1545 (VIG-3: þ ) (VIG-5: þ ) 0.1700 (VIG-3: ¼ ) (VIG-5: ¼ ) 0.0878 (VIG-3: þ ) (VIG-5: þ) 0.2160 (VIG-3: ¼ ) (VIG-5: þ ) 0.1570 (VIG-3: ) (VIG-5: ) 0.1260 (VIG-3: ¼) 0.0004 (VIG-3: ¼) (VIG-5: þ ) 0.0362 (VIG-3: ¼ ) (VIG-5: þ) 0.1351 (VIG-3: ) (VIG-5: ¼ ) 0.0273 (VIG-3: ) (VIG-5: ) 0.0500 (VIG-3: þ ) (VIG-5: ) 0.3353 (VIG-3: þ) 0.4627 (VIG-3: ¼) (VIG-5: ¼ ) 0.2499 (VIG-3: ) (VIG-5: ) 0.0493 (VIG-3: ) (VIG-5: ) 0.1897 (VIG-3: ) (VIG-5: )

4.37E 03 2.30E 03 2.04E 03 6.39E 03 4.80E 03 1.08E 04 4.62E 04 1.10E 03 4.92E 03 1.40E 08 6.21E 04 1.68E 03 3.20E 05 6.47E 05 1.58E 02 1.55E 03 1.58E 03 2.63E 03 1.64E 03

0.2587 (VIG-3: þ) (VIG-5: þ ) 0.2197 (VIG-3: þ ) (VIG-5: ¼ ) 0.2444 (VIG-3: ¼ ) (VIG-5: ¼) 0.1413 (VIG-3: þ) (VIG-5: þ) 0.1896 (VIG-3: ) (VIG-5: ) 0.1920 (VIG-3: ) (VIG-5: ) 0.2217 (VIG-3: ¼ ) (VIG-5: þ ) 0.1334 (VIG-3: ) (VIG-5: ) 0.1600 (VIG-3: ) 0.0428 (VIG-3: ) (VIG-5: ) 0.0330 (VIG-3: þ) (VIG-5: þ) 0.1425 (VIG-3: ) (VIG-5: ) 0.0310 (VIG-3: ) (VIG-5: ) 0.0456 (VIG-3: þ ) 0.5145 (VIG-3: ) (VIG-5: ) 0.4996 (VIG-3: ) (VIG-5: ) 0.4451 (VIG-3: ) (VIG-5: ) 0.0540 (VIG-3: ) (VIG-5: ) 0.5565 (VIG-3: ) (VIG-5: )

3.30E 03 1.90E 03 1.97E 03 5.05E 03 4.67E 03 2.27E 04 4.31E 04 6.31E 04 9.00E 03 1.65E 06 4.76E 04 1.53E 03 3.76E 05 6.34E 05 1.80E 02 8.99E 04 2.87E 03 2.82E 03 1.91E 03

(VIG-3: ) and (VIG-3: ) mean that the benchmark algorithm is worse than VIG-3 or VIG-5, (VIG-3: ¼) and (VIG-5: ¼ ) means that the benchmark algorithm is equal to VIG-3 or VIG-5, (VIG-3: þ ) and (VIG-5: þ) mean that the benchmark algorithm is better than VIG-3 or VIG-5.

4.3.3. Comparison with other state-of-the-art ensemble methods We also compared the performance of our proposed method to two well-known ensemble methods, namely Bagging and AdaBoost. Both of them are implemented in Matlab 2014a with Decision Tree learning algorithm. To choose parameters for these methods, we refer [41–43] in which the number of iterations in AdaBoost is set to 200 and the number of learners in Bagging is set to 200. By comparing with these well-known ensemble methods, the superior classiﬁcation performance of our method can be observed. Table 9 shows the experimental results of the two ensemble methods on the nineteen data sets. The statistical two-sample t-test results comparing VIG-3 and VIG-5 to Bagging and AdaBoost are shown in Table 10. First, both VIG based approaches are equally competitive with Bagging, obtaining six wins and six losses with VIG-3 and six wins and seven losses with VIG-5. It should be noted that Bagging is one of the best performing ensemble methods in the literature. In our experiment, Bagging with 200 learners is much more timeconsuming than our approach in both training and testing. In comparison with AdaBoost, VIG-3 achieves better results on 12 and worse results on 5 data sets while VIG-5 has 11 wins and 4 losses. This is a signiﬁcant outcome because our approach is not only better in performance but also considerably less timeconsuming than AdaBoost with 200 iterations. In our experiments, Bagging and AdaBoost with 200 learners is much more time-consuming than our approach in both training and testing. Speciﬁcally, Bagging is on average 8.1 and 7.85 times more time-consuming than VIG-3 for training and testing, respectively, and is on average 2.93 and 4.61 times more time-consuming than VIG-5 for training and testing, respectively. Comparing AdaBoost with our methods, the corresponding number is 3.81 and 6 times higher compared with VIG-3 and 1.35 and 3.45 times higher compared with VIG-5. The computation times for each dataset are shown in Figs. 4 and 5.

Table 10 Statistical two-sample t-test results comparing VIG-3 and VIG-5 to Bagging and AdaBoost. VIG-3 vs. Bagging Better 6 Equal 7 Worse 6

VIG-5 vs. Bagging

VIG-3 vs. AdaBoost

VIG-5 vs. AdaBoost

6 4 7

12 2 5

11 2 4

groups Level1 data of training observations based on their class labels, and estimate the distribution of each class modeled as a multivariate Gaussian using VI. Then, classiﬁcation was conducted through maximization of posterior probability according to the Bayesian decision model. Experimental results on eighteen UCI data ﬁles and a 10-class CLEF2009 medical image database demonstrated the beneﬁt of our approach compared with several well-known combining classiﬁer methods. Speciﬁcally, the proposed method is better than individual base classiﬁers, ensemble methods with ﬁxed combining rules, MLR, Decision Template, AdaBoost, and SCANN, and is highly competitive with Bagging. It is also less time consuming compared with other trainable methods and the two well-known ensemble methods that we experimented. Besides, VIG is more applicable compared with SCANN since singularity is not a problem in VIG. In the future, the proposed method could be combined with feature and classiﬁer selection based approaches [40,44,45] to further improve its classiﬁcation performance. In this work, we have shown that having a set of good base classiﬁers could boost the discrimination ability of Level1 data and improve the overall classiﬁcation accuracy of the system.

Conﬂict of Interest Statement 5. Conclusion We have introduced a novel combining classiﬁer method denoted by VIG in which class distribution of Level1 data was exploited to form the decision making model. Our approach

We wish to conﬁrm that there are no known conﬂicts of interest associated with this publication and there has been no signiﬁcant ﬁnancial support for this work that could have inﬂuenced its outcome.

208

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

Fig. 4. Average time of training process (in seconds) of two VIG methods, AdaBoost and Bagging.

Fig. 5. Average time of testing process (in seconds) of two VIG methods, AdaBoost and Bagging.

Acknowledgments Tien Thanh Nguyen acknowledges the support of a Grifﬁth University International Postgraduate Research Scholarship (GUIPRS).

Appendix A In equation (3), Kullback–Leibler (KL) divergence is given by

Z qðZ Þ pðZ j XÞ KLðq J pÞ ¼ Eq ln ¼ qðZ Þln dZ pðZ j XÞ qðZ Þ Þ By pðZ j XÞ ¼ ppðZ;X ðXÞ , we can rewrite (3) as

Proof. Level1 data generated by using Fuzzy Label is a N MK P matrix (1) in which M m ¼ 1 Pk ym j x n ¼ 1 for each k ¼ 1; …; K and n ¼ 1; …; N. Due to this property, we can address a non-trivial combination of all columns given by: 1 1 0 0 Pi ym j x1 Pj ym j x1 M K M X X X B C B C ¼0 ðA1Þ ðM 1Þ @ ⋮ @ ⋮ A A m¼1 j ¼ 1;j a i m ¼ 1 Pi ym j xN Pj ym j xN Therefore, columns are not independent, consequently rank of the matrix L is smaller than the number of columns or matrix L is not full column rank □ Corollary 1. Covariance matrix of Level1 data is singular.

pðZ; XÞ dZ pðXÞqðZ Þ Z Z pðZ; XÞ dZ ¼ ln pðXÞ qðZ ÞdZ qðZ Þln pðXÞ R Since qðZ ÞdZ ¼ 1 we obtain Z pðZ; XÞ dZ KLðq J pÞ ¼ ln pðXÞ ℒðqÞ where ℒðqÞ ¼ qðZ Þln pðXÞ

Proof. Let L denotes the N MK matrix. 2 N N X X 1 P1 y1 j xn P1 y2 j x n 6 N1 N 6 n¼1 n¼1 6 6 ⋮ ⋮ L : ¼6 6 N N X 61X 1 4 P1 y1 j xn P1 y2 j x n N N

Lemma 1. Level1 data generated by using Fuzzy Label is not full column rank.

Then MK MK covariance T matrix of L, denoted by Σ is computed by Σ ¼ L L L L . Based on the property of rank

Z

KLðq J pÞ ¼

qðZ Þln

n¼1

n¼1

⋯ ⋱ ⋮

3 PK yM j x n 7 7 n¼1 7 7 ⋮ 7 7 N X 7 1 PK y j x n 5

1 N

N

N X

M

n¼1

ðA2Þ

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

209

Fig. A1. xxx.

h T i ¼ rank L L . operator we have rank Σ L L L ¼ rank L By Lemma 1, rank L L o MK. As a result, the covariance matrix Σ is singular.

n Lemma 2. Optimum solution q μ of update equation (9) is a 1 n with mean and precision given by Gaussian q μ ¼ N μ j m; H (11) and (12).

210

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

Proof. The optimal solution for q μ is given by ln qn μ ¼ EΛ ln p X; μ; Λ þconst ¼ EΛ ln p Xj μ; Λ þ ln p μ j Λ þconst T 1 h ¼ EΛ μ m 0 β 0 Λ μ m 0 2 XN T i þ const þ x μ Λ x μ n n n¼1

Eμ

h

μ m0 μ m0

T i

¼ ðm m 0 Þðm m 0 ÞT þ H 1 ; μ N μ j m; H 1

Using N X

N n X

xn xn T ¼

n¼1

N X

¼

ðxn xÞðxn xÞT þ

n¼1

" # N X 1 ln qn μ ¼ EΛ μT β0 Λμ 2μT β 0 Λm0 þ μT Λμ 2μT Λxn þ const 2 n¼1

! N X 1 T T ¼ μ β0 þN E Λ μ þ μ E Λ β0 m0 þ xn þconst 2 n¼1

ðxn xÞðxn xÞT þ xn xT þ xxTn xxT

N X H ¼ β0 þ N E Λ ; Hm ¼ E Λ β 0 m0 þ xn

!

¼ E Λ β0 m0 þNx

N X

and we can easily infer (11) and (12) □. n Lemma 3. Optimum solution q Λ of update equation (10) is a n Wishart q Λ ¼ W Λ j W; υ with the number of degrees of freedom υ and the scale matrix W given by (13) and (14). Proof. The optimal solution for q Λ is given by

N þ1 þ υ0 D 1 lnΛ ln qn Λ ¼ 2 n h T 1 Tr W0 1 Λ þ Eμ μ m0 β0 Λ μ m0 2 # N X T xn μ Λ xn μ g þ const þ

xm ¼

μ m0 μ m0

xn μ xn μ

n¼1

Eμ μ ¼ m; Eμ μ

T

β0 ðx m0 Þ β0 þ N

ðm m0 Þðm m0 ÞT ¼ ðx mÞðx mÞT ¼

ðA14Þ

N β0 þ N

β0

2 ðA15Þ

J

2 ðA16Þ

J

β0 þ N

we obtain h

μ m0 μ m0

T i

¼

N β0 þN

2

J þH 1

# N X T β2 N xn μ xn μ ¼ S þ 0 2 J þNH 1 β0 þ N n¼1

ðA17Þ

ðA18Þ

ðA19Þ

Lemma 4. The variational lower bound ℒðqÞ is computed as in (15)

ðA6Þ

Proof. Based on the expression of lower bound we have: ( ) p X; μ; Λ d μd Λ ℒ ¼ ∬ q μ; Λ ln q μ; Λ ¼ E ln p X; μ; Λ E ln q μ; Λ ℒ ¼ E ln p Xj μ; Λ þE ln p μ j Λ þ E ln p Λ E ln q μ E ln q Λ

T i

"

N X T þ Eμ xn μ xn μ

# ðA7Þ

N ND lnð2π Þ E ln p Xj μ; Λ ¼ E lnΛ 2 2 " # N T 1 X E xn μ Λ xn μ 2 n¼1 "

# ¼

N X

ðxn mÞðxn mÞT þ NH 1

ðA8Þ

n¼1

ðA13Þ

ðA5Þ

n¼1

Eμ

ðx m0 Þ

ðA12Þ

ðA20Þ

in which the ﬁrst component is computed by

where

T

N

β0 þ N

β N Q ¼ β0 þ N H 1 þ S þ 0 J β0 þ N

n¼1

N X xn xn T xn mT mxn T þ mmT

Therefore, qn Λ is a Wishart distribution of the form qn Λ ¼ W Λ j W; υ with W and υ computed as in (13) and (14) □

i 1 h ¼ Tr W0 1 þ Q Λ 2

Since " N X

n¼1

n¼1

m m0 ¼

"

Rewrite the 2nd term on the RHS of ln qn Λ as: h n T o 1n I ¼ Tr W0 1 Λ þ Eμ Tr β 0 μ m0 μ m0 Λ 2 ( )# N X T xn μ x n μ Λ g þ Tr

Q ¼ β0 Eμ

xxn T NxxT

¼ S þ N ðx mÞðx mÞT

Eμ

n¼1

h

N X

ðA11Þ

n¼1

Eμ

ln q Λ ¼ Eμ ln p X; μ; Λ þconst ¼ Eμ ln p Xj μ; Λ þ ln p μ j Λ þ ln p Λ þ const

1 T N 1 ¼ Eμ lnβ 0 Λ μ m0 β 0 Λ μ m0 þ lnΛ 2 2 2 # N X 1 T xn μ Λ xn μ 2n¼1 n

xn xT þ

n¼1

ðxn mÞðxn mÞT ¼

n¼1

ðA4Þ

N X

¼ S þ NxxT

ðA3Þ Hence, qn μ is a Gaussian with mean m and precision H given

o

n¼1

Completing the square over μ

by

ðA10Þ

¼ mT ; Eμ μμ

T

¼ mmT þ H 1

ðA9Þ

E

N X

T xn μ Λ xn μ

n¼1

"

"

¼ EΛ Eμ Tr

N X

#

Λ xn μ xn μ

n¼1

T

##

ðA21Þ

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

" " ##! N X T ¼ Tr E Λ Eμ xn μ x n μ n¼1

β 1 ¼ Tr E Λ S þ J þ NH β0 þN 2 2 0N

where

υ0 D DðD 1Þ ln 2 ln π lnjW0 j 2 4 2 D X υ0 þ 1 i ln Γ 2 i¼1

ln BðW0 ; υ0 Þ ¼

!!

υβ ND 2 TrðJWÞ þ β 0 þN β0 þ N

¼ υTrðSWÞ þ

211

2 0N

ðA22Þ

where we have used Λ W Λ j W; υ ; E Λ ¼ υW; H 1 ¼ 1 E Λ = β 0 þ N . Substituting (A22) into (A21), we get ( ) 1 ND υβ20 N E ln p Xj μ; Λ ¼ υTr ðSWÞ NE lnΛ ND lnð2π Þ 2 TrðJW Þ 2 β0 þ N β þN 0

υ0

υ υD DðD 1Þ ln 2 ln BðW; υÞ ¼ lnjWj 2 4 2 D X υþ1i ln π ln Γ 2 i¼1

ðA28Þ

ðA29Þ

Use (A23)–(A29) we have the lower bound given by (15) □.

ðA23Þ

1 T D 1 E ln p μ j Λ ¼ E lnð2π Þ þ lnβ0 Λ μ m0 β0 Λ μ m0 2 2 2 1 ¼ D ln β0 þ E lnΛ D lnð2π Þ 2 h T io β0 E Tr μ m0 μ m0 Λ ¼

1 D ln β0 þ E lnΛ D lnð2π Þ 2 h T io β0 Tr E μ m0 μ m0 Λ

( ) 1 Dβ0 υβ N2 E ln p μ j Λ ¼ D ln β 0 þ E lnΛ D lnð2π Þ 0 2 TrðJWÞ 2 β0 þ N β þ N 0

ðA24Þ In the same way, we obtain υ0 D 1 1 h 1 i E ln Λ E Tr W0 Λ E ln p Λ ¼ ln BðW0 ; υ0 Þ þ 2 2 υ0 D 1 1 h 1 i E ln Λ Tr W0 E Λ ¼ ln BðW0 ; υ0 Þ þ 2 2 υ0 D 1 1 h 1 i E ln Λ υTr W0 W E ln p Λ ¼ ln BðW0 ; υ0 Þ þ 2 2 ðA25Þ h i E ln q μ ¼ E ln N μ j m; H 1 h T io 1n ¼ lnjHj D lnð2π Þ E Tr μ m H μ m 2 h h T iio 1n ¼ lnjHj D lnð2π Þ Tr HE μ m μ m 2 h io 1n ¼ lnjHj D lnð2π Þ Tr HH 1 2 1 ¼ lnjHj D lnð2π Þ Tr½I 2 1 E ln q μ ¼ lnjHj D lnð2π Þ D ðA26Þ 2 E ln q Λ ¼ E ln W Λ j W; υ

υ D 1 1 h 1 i E ln Λ E Tr W Λ 2

2

2

2

2

2

υ D 1 1 h 1 i E ln Λ Tr W E Λ ¼ ln BðW; υÞ þ υ D 1 1 h 1 i E ln Λ Tr W υW ¼ ln BðW; υÞ þ υ D 1 υ E ln Λ Tr½I ¼ ln BðW; υÞ þ

2 2 υ D 1 υD E ln Λ E ln q Λ ¼ ln BðW; υÞ þ 2 2

The values of ℒi ℒi 1 ðiZ 2Þ in one experiment on all nineteen datasets with VIG-3. See Fig. A1.

References

Referring to (A17) we get the second component in the lower bound’s expression.

¼ ln BðW; υÞ þ

Appendix B

ðA27Þ

[1] L. Rokach, Taxonomy for characterizing ensemble methods in classiﬁcation tasks: a review and annotated bibliography, Comput. Stat. Data Anal. 53 (2009) 4046–4072. [2] M. Woźniak, M. Graña, E. Corchado, A survey of multiple classiﬁer systems as hybrid systems, Inf. Fusion 16 (2014) 3–17. [3] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, CRC Press, 2012. [4] R.P.W. Duin, The combining classiﬁer: to train or not to train?, in: Proceedings of International Conference on Pattern Recognition, 2002, pp. 765-770. [5] L.I. Kuncheva, Combining Pattern Classiﬁers: Methods and Algorithms, Wiley, 2004. [6] A.C. Sharkey, Types of multinet system, in: F. Roli, J. Kittler (Eds.), Multiple Classiﬁer Systems, Springer, Berlin Heidelberg, 2002, pp. 108–117. [7] H.T. Kam, J.J. Hull, S.N. Srihari, Decision combination in multiple classiﬁer systems, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 66–75. [8] C.-X. Zhang, R.P.W. Duin, An experimental study of one- and two-level classiﬁer fusion for different sample sizes, Pattern Recognit. Lett. 32 (2011) 1756–1767. [9] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Proceedings of International Conference on Machine Learning, 1996, pp. 148156. [10] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [11] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [12] S. Džeroski, B. Ženko, Is combining classiﬁers with stacking better than selecting the best one? Mach. Learn. 54 (2004) 255–273. [13] K.M. Ting, I.H. Witten, Issues in stacked generalization, J. Artif. Intell. Res. 10 (1999) 271–289. [14] L. Todorovski, S. Džeroski, Combining classiﬁers with meta decision trees, Mach. Learn. 50 (2003) 223–249. [15] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classiﬁers,, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 226–239. [16] L.I. Kuncheva, A theoretical study on six classiﬁer fusion strategies, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 281–286. [17] L.I. Kuncheva, J.C. Bezdek, R.P.W. Duin, Decision templates for multiple classiﬁer fusion: an experimental comparison, Pattern Recognit. 34 (2001) 299–314. [18] C. Merz, Using correspondence analysis to combine classiﬁers, Mach. Learn. 36 (1999) 33–58. [19] D.H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241–259. [20] L. Zhang, W.-D. Zhou, Sparse ensembles using weighted combination methods based on linear programming, Pattern Recognit. 44 (2011) 97–106. [21] M.U. Şen, H. Erdoǧ an, Linear classiﬁer combination and selection using group sparse regularization and hinge loss, Pattern Recognit. Lett. 34 (2013) 265–274. [22] T.T. Nguyen, A.W.-C. Liew, C. To, X.C. Pham, M.P. Nguyen, Fuzzy If-Then rules classiﬁer on ensemble data, in: X. Wang, W. Pedrycz, P. Chan, Q. He (Eds.), Machine Learning and Cybernetics, Springer, 2014, pp. 362–370. [23] 〈http://archive.ics.uci.edu/ml/datasets.html〉. [24] 〈http://ganymed.imib.rwth-aachen.de/irma/datasets_en.php〉. [25] C.M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, New York, 2006. [26] N. Nasios, A.G. Bors, Variational learning for Gaussian mixture models, IEEE Trans. Syst. Man Cybern. Part B Cybern. 36 (2006) 849–862. [27] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004. [28] D.M. Blei, M.I. Jordan, Variational methods for the Dirichlet process, in: Proceedings of ACM International Conference on Machine Learning, 2004.

212

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

[29] D.M. Blei, M.I. Jordan, Variational Inference for Dirichlet process mixtures, Bayesian Anal. 1 (2006) 121–143. [30] N. Balakrishnan, V.B. Nevzorov, A Primer on Statistical Distributions, Wiley & Sons Press, 2003. [31] T. Minka, Bayesian inference, entropy, and the multinomial distribution, in: Technical Report, Microsoft Research, 2003. [32] A. Agresti, D.B. Hitchcock, Bayesian inference for categorical data analysis, Stat. Methods Appl. 14 (3) (2005) 297–330. [33] C. Désir, S. Bernard, C. Petitjean, L. Heutte, One class random forests, Pattern Recognit. 46 (12) (2013) 3490–3506. [34] A. Gelman, J.B. Carlin, H.S. Stern, D.B. Rubin, Bayesian Data Analysis, second ed., CRC Press, 2004. [35] D.J. Spiegelhalter, A. Dawid, S. Lauritzen, R. Cowell, Bayesian analysis in expert systems (with discussion), Statist. Sci. 8 (1993) 219–283. [36] S.L. Lauritzen, B. Thiesson, D.J. Spiegelhalter, Diagnostic systems created by model selection methods: a case study, in: (P. Cheeseman, W. Oldford (Eds.), Uncertainty in Artiﬁcial Intelligence, 1994, vol. 4, pp. 143–152. [37] A. Gelman, Prior distribution, Encycl. Environ. 3 (2002) 1634–1637. [38] J.A Hoeting, D. Madigan, A.E Ratery, C.T Volinsky, Bayesian model averaging: a tutorial, Stat. Sci. 14 (4) (1999) 382–416. [39] T. Ojala, M. Pietikäinen, D. Harwood, Performance evaluation of texture measures with classiﬁcation based on Kullback discrimination of distributions,

[40]

[41]

[42] [43]

[44]

[45]

in: Proceedings of the 12th International Conference on Pattern Recognition, 1994, vol. 1, pp. 582–585. T.T. Nguyen, A.W.-C. Liew, M.T. Tran, X.C. Pham, M.P. Nguyen,A novel genetic algorithm approach for simultaneous feature and classiﬁer selection in multi classiﬁer system, in: IEEE Congress on Evolutionary Computation (CEC), 2014, pp. 1698–1705. C.D. Sutton, Classiﬁcation and regression trees, Bagging, and boosting, in: C. R. Rao, E.J. Wegman, J.L. Solka (Eds.), Handbook of Statistics, Elsevier, 2005, pp. 303–329. P. Viola, M. Jones, Robust real-time face detection, Int. J. Comput. Vision 57 (2002) 137–154. T. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization, Mach. Learn. 40 (2000) 139–157. T.T. Nguyen, A.W.-C. Liew, X.C. Pham, M.P. Nguyen, A Novel, 2-Stage combining classiﬁer model with stacking and genetic algorithm based feature selection, in: D.-S. Huang, K.-H. Jo, L. Wang (Eds.), Intelligent Computing Methodologies, Springer International Publishing, 2014, pp. 33–43. T.T. Nguyen, A.W.-C. Liew, M.T. Tran, M.P. Nguyen, Combining multi classiﬁers based on a genetic algorithm—a Gaussian mixture model framework, in: D.S. Huang, K.-H. Jo, L. Wang (Eds.), Intelligent Computing Methodologies, Springer International Publishing, 2014, pp. 56–67.

Tien Thanh Nguyen is currently a Ph.D. student at the School of Information & Communication Technology, Grifﬁth University, Australia. His research interest is in the ﬁeld of machine learning, pattern recognition, image processing and evolutionary computation. He is a member of the IEEE since 2014.

Thi Thu Thuy Nguyen is currently a lecturer at the Faculty of Economic Information System, College of Economics, Hue University, Vietnam. She graduated from the Faculty of Mathematics, Voronezh State University, Russia in 2008. Her research interests include machine learning, pattern recognition, image processing.

Xuan Cuong Pham graduated from School of Information and Communication Technology, Hanoi University of Science and Technology, Vietnam in 2014 and had a short time to work at R&D department of Samsung Co Ltd, Vietnam. His research interests include machine learning, image processing, information retrieval and computer vision.

Alan Wee-Chung Liew is currently an Associate Professor with the School of Information & Communication Technology, Grifﬁth University, Australia. His research interest is in the ﬁeld of medical imaging, bioinformatics, computer vision, pattern recognition, and machine learning. He serves on the technical program committee of many international conferences and is on the editorial board of several journals, including the IEEE Transactions on Fuzzy Systems. He is a senior member of the IEEE since 2005.

A novel combining classifier method based on Variational Inference

A novel combining classifier method based on Variational Inference

Recommend Documents