A novel combining classifier method based on Variational Inference

A novel combining classifier method based on Variational Inference

Pattern Recognition 49 (2016) 198–212 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr A ...

1MB Sizes 4 Downloads 104 Views

Pattern Recognition 49 (2016) 198–212

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

A novel combining classifier method based on Variational Inference Tien Thanh Nguyen a, Thi Thu Thuy Nguyen b, Xuan Cuong Pham c, Alan Wee-Chung Liew a,n a

School of Information and Communication Technology, Griffith University, Gold Coast Campus, QLD 4222, Australia Hue College of Economics, Hue University, No. 99, Ho Dac Di Street, An Cuu Ward, Hue City, Vietnam c Hanoi University of Science and Technology, No. 1, Dai Co Viet Street, Hai Ba Trung District, Hanoi, Vietnam b

art ic l e i nf o

a b s t r a c t

Article history: Received 14 January 2015 Received in revised form 16 May 2015 Accepted 25 June 2015 Available online 11 July 2015

In this paper, we propose a combining classifier method based on the Bayesian inference framework. Specifically, the outputs of base classifiers (called Level1 data or meta-data) are utilized in a combiner to produce the final classification. In our ensemble system, each class in the training set induces a distribution on the Level1 data, which is modeled by a multivariate Gaussian distribution. Traditionally, the parameters of the Gaussian are estimated using a maximum likelihood approach. However, maximum likelihood estimation cannot be applied since the covariance matrix of Level1 data of each class is not full rank. Instead, we propose to estimate the multivariate Gaussian distribution of Level1 data of each class by using the Variational Inference method. Experiments conducted on eighteen UCI Machine Learning Repository datasets and a selected 10-class CLEF2009 medical imaging database demonstrated the advantage of our method compared with several well-known ensemble methods. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Ensemble method Multi classifier system Mixture of experts Classifier fusion Combining classifier algorithm Variational Inference Multivariate Gaussian distribution

1. Introduction In recent years, ensemble method has been studied extensively, and is an active research area in machine learning community [1]. In general, it is difficult to know a priori which learning algorithm is suitable for a particular dataset, and using an ensemble approach could achieve better accuracy than any single learning algorithms. According to the statistics from Web of Knowledge, there are more than 600 publications with keyword “classifier ensemble” in two years 2011 and 2012 [2]. In addition to various applications in computer aided medical diagnosis, computer vision, software engineering, and information retrieval, ensemble-based algorithms have also won many competitions such as Netflix Prize (http://www.netflixprize.com) and KDD-Cup (http://www.sigkdd.org/kddcup) [3]. All this demonstrated the significant interest on ensemble methods in both theoretical and application studies. Over the past 30 years of development, various approaches related to ensemble methods have been proposed [1,3]. Hence, there are many taxonomies of ensemble method that focuses on the different views on the ensemble system [1,3–7]. In this paper, we follow the taxonomy in [8] in which ensemble methods are divided into two types:

n

Corresponding author. Tel.: þ 61 7 55528671; fax: þ 61 7 55528066. E-mail address: a.liew@griffith.edu.au (A.-C. Liew).

http://dx.doi.org/10.1016/j.patcog.2015.06.016 0031-3203/& 2015 Elsevier Ltd. All rights reserved.

 Homogeneity: Generic classifiers are generated on different



training sets obtained from an original one by the same learning algorithm. Then, the outputs of these classifiers are combined to give the final decision. Several state-of-the-art coverage based ensemble methods in the literature are AdaBoost [9], Bagging [10] and Random Forest [11]. Heterogeneity: A fixed set of different learning algorithms are used on the same training set to generate the different classifiers and then making decision from the output of these classifiers (called Level1 data or meta-data [12–14]). The approach focuses more on algorithms to combining Level1 data to achieve higher accuracy than any single classifier.

In this paper, we focus on the second type of ensemble method. There are two techniques to combine the outputs of different classifiers, namely fixed combining methods and trainable combining methods [8]. Fixed combining methods which are based on Bayesian decision model [15] do not take into consideration the label information in Level1 data of training set when combining. The advantage of applying fixed methods for ensemble system is that no training based on Level1 data is needed; as a result, they are less time-consuming than their counterparts. There are several popular fixed combining methods studied in the literature, namely Sum, Product, Vote, Max, Min, Average, Median and Oracle rule [15,16]. According to our knowledge, Vote and Sum are the most popular rules and have been successfully

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

applied to many combining classifier situations. Kittler et al. [15] showed that Sum rule is developed under two assumptions: “conditional independence of respective representations used by classifiers and classes being highly ambiguous” and Sum rule results in the most reliable predictions. Kuncheva [16] also proved theoretical probability of error related to rules by making assumptions about normal and uniform distribution. In contrast, trainable combining methods work on Level1 data of training set to form the prediction model. Although exploiting Level1 data of training set to discover knowledge will enhance the accuracy of classification tasks, computational cost will also increase. Several state-of-the-art trainable combining algorithms are Multiple Response Linear Regression (MLR) [13], Decision Template [17] and SCANN [18]. The most important studies about trainable combining methods are based on Stacking algorithm that was first proposed by Wolpert [19] and was further developed in [12,13,18]. In this algorithm, the training set is divided into several equal disjoint parts. One part plays the role of test set in turn and the rest plays the role of training set during training phase. The output of Stacking is the posterior probability that an observation belongs to a class according to each classifier. The common feature of Stacking based approaches is that Level1 data of training set is trained again by a combining classifier method to form the final prediction framework. Several strategies have been proposed to exploit label information in Level1 data of training set. In one strategy, the outputs of classifiers are grouped according to the given labels and then the template associated with each label is constructed. Two wellknown methods using that strategy are Multiple Response Linear Regression (MLR) [13] and Decision Template [17]. MLR method is based on the assumption that each classifier puts a different weight on each class and combining algorithm is then conducted based on the M linear combinations of posterior probabilities and the associated weights for the M classes. The predicted class label for an unlabeled observation is decided by selecting the maximum value among those combinations. As a result, it is important to find the suitable combining weights so as to achieve high classification accuracy. Ting et al. [13] proposed MLR method by solving M Linear Regression models corresponding with the M classes based on Level1 data and the training data labels in crisp form to find these combining weights. Recently, Zhang and Zhou [20] proposed using linear programming to find the weights. Sen et al. [21] introduced a method that was inspired by MLR which uses hinge loss function in the combiner instead of using conventional least square loss. By using this new function with regularization, three different combinations were proposed, namely weighted sum, dependent weighted sum and linear stacked generalization, based on different regularizations with group sparsity. In Decision Template method [17], Level1 data of training set is grouped according to the class label of training observations and the Decision Template is constructed by averaging values of Level1 data in each class. Kuncheva et al. [17] proposed eleven measures between a Decision Template and the Level1 data of an unlabeled observation to predict class label. The benefit of this method is that it saves time in both training and testing due to its simple computation. However, this method could have high error rate if the classifiers do not have high enough accuracy due to the fact that the simple Decision Template may not provide a good representation for a particular class. Merz [18] combined Stacking, Correspondence Analysis (CA) and K Nearest Neighbor (KNN) in a single algorithm called SCANN. The idea of that algorithm was to discover relationship between learning observations and classification outputs of base classifiers generated by stacking by applying CA to indicator matrix formed by Level1 data and true label of these observations. After that, KNN

199

is used to classify unseen observations in the new scaled space. In real-world application, the method sometimes is impractical due to the singularity characteristic of the indicator matrix obtained by CA. Moreover, the testing process of SCANN is more complicated than that of other combining classifier algorithms and this increases classification time. Another approach proposed by Todorovski and Džeroski [14] is Meta Decision Tree, a new Decision Tree on Level1 data where at each node a classifier is chosen instead of selecting value for splitting an attribute. The authors also proposed an expansion for Level1 data by adding entropy and maximum posterior probability so as to increase the discrimination ability. However, no theoretical basis was provided about the effectiveness of that expansion. Recently, Nguyen et al. [22] proposed a hybrid combining classifier system in which fuzzy rules are applied on Level1 data to produce the classifying rules. Although that system outperforms other fuzzy rules-based methods and ensemble methods in the experiment as well as addressing the high-dimensionality problem commonly found in general fuzzy rules-based methods, it takes a long time in the training process due to the large number of rules generated on Level1 data. Unlike the above approaches, here we propose a novel combining classifier method called VIG by approximating the density distribution of Level1 data to obtain a prediction framework based on Bayesian model. Since maximum likelihood approach is not applicable due to the singularity property of Level1 data, we propose using Variational Inference (VI) for the estimation of the multivariate Gaussian density distribution. The rest of this paper is organized as followed. Section 2 describes the property of Level1 data and then introduces VI method for multi-dimensional Gaussian distribution estimation. After that, the framework based on Bayesian decision model is proposed to combine the outputs of base classifiers. Experiment results conducted on eighteen UCI datasets [23] and CLEF medical image database [24] are reported and discussed in Section 4. The conclusion is given in the last section. 2. Preliminaries A summary of mathematical notations Notations

Mathematical meaning

X x M N K   ym m ¼ 1;…;M

Observed data or training set An observation The number of classes The number of observations The number of learning algorithms The set of labels

Z

Hidden variable Mean and covariance of Gaussian distribution The precision matrix as inverse of the

μ; Σ Λ D W0 and υ0 m0 and β 0

m; H W; υ Trð:Þ

1

covariance matrix Λ ¼ Σ The dimension of input data The initial values of scale matrix and degrees of   freedom of Wishart distribution p Λ The initial values of mean vector and the scale of precision matrix Λ of Gaussian distribution   p μj Λ Mean vector and precision matrix of Gaussian     distribution q μ ¼ N μ j m; H  1 Scale matrix and degree of freedom of Wishart     distribution q Λ ¼ W Λ j W; υ The trace operator of a matrix

200

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

Γð:Þ ℒð q Þ L L ðx Þ fLm gm ¼ 1;…;M

fGm gm ¼ 1;…;M j Uj

The Gamma function defined by R Γð:Þ ¼ 01 xt  1 e  x dx . The lower bound The meta-data or Level1 data of X The meta-data or Level1 data of observation x The meta-data or Level1 data of observations belonging to mth class (m ¼ 1; …; M)   Lm ¼ ðLðxÞ; yÞj x A X; y ¼ ym The Gaussian model for the mth class (m ¼ 1; …; M) Relative cardinality of a set

2.1. Level1 data   Let ym m ¼ 1;…;M denotes the set of labels, N as the number of observations, K as the number of base classifiers and M as the   number of labels. For an observation x, Pk ym j x is the probability that x belongs to the class with label ym given by kth classifier. Kuncheva et al. [17] summarized three output types for x for each k ¼ 1; …; K : 



 Crisp label: return only class label Pk ym j x A f0; 1g and  P  Pk ym j x ¼ 1

m  Fuzzy label: return posterior probabilities  that x belongs to P 

classes, i.e. Pk ym j x A ½0; 1 and

Pk ym j x ¼ 1

 Possibilistic label: the same asm Fuzzy Label but does not require the posterior probabilities to equal one, i.e.   sum of all P  Pk ym j x A ½0; 1 and Pk ym j x 4 0 m

In this work, we focus only on the second type, i.e. fuzzy label, which has a familiar interpretation. In this case, the posterior probability reflects the support of a class to an observation. The Level1 data of all observations, an N  MK posterior probability matrix is defined as: 2

    P1 y1 j x1 ⋯ P1 yM j x1    6  6 P1 y1 j x2 ⋯ P1 yM j x2 6 L : ¼6 ⋮ ⋱ ⋮ 6     4 P1 y 1 j x N ⋯ P1 yM j xN

⋯ ⋯ ⋱ ⋯

  PK y1 j x1   PK y1 j x2 ⋮   PK y1 j xN

  3 PK y M j x 1   7 ⋯ PK y M j x 2 7 7 7 ⋱ ⋮ 7   5 ⋯ PK y M j x N ⋯

ð1Þ whereas the Level1 data of an observation xn is defined in two forms depending on the combining algorithms:   3 2  ⋯ P1 yM j xn P1 y1 j xn 6 7 ⋮ ⋱ ⋮ ð2aÞ Lðxn Þn ¼ 1;…;N : ¼ 4    5 ⋯ PK yM j xn PK y1 j xn h   Lðxn Þn ¼ 1;…;N : ¼ P1 y1 j xn



  P1 yM j xn ⋯

  PK y1 j xn



 i PK yM j xn

ð2bÞ We have the following properties with respect to the Level1 data: Lemma 1: The Level1 data generated by using Fuzzy Label is not full column rank. Corollary 1: Covariance matrix of Level1 data is not full rank. The proof can be found in Appendix A. 2.2. Variational Inference for multivariate Gaussian distribution Maximum likelihood estimation is a popular method to obtain the Gaussian distribution for an observed dataset. However when the covariance matrix of the dataset is not full rank, it cannot be applied. Kuncheva et al. [17] commented that if the accuracy of all classifiers is high, the covariance matrix of Level1 data is likely to

be singular. In this work, we propose to use VI method to estimate the multivariate Gaussian model. The idea behind VI method is to approximate the posterior distribution pðZ j XÞ of hidden variables Z given observed data X by a more easily accessible distribution qðZ Þ which minimizes the divergence between pðZ j XÞ and qðZ Þ. In maximum-likelihood method, the parameters ðμ; ΣÞ are not considered as random variables, and the likelihood function is maximized by the parameters’ values. In contrast, in VI method the parameters ðμ; ΣÞ are treated as random variables and priors are placed over   the parameters to obtain the posterior distribution p μ; Σ j X . In literature, the Kullback–Leibler (KL) divergence is commonly used to compute the distance between two distributions:

  Z qðZ Þ pðZ j XÞ ¼  qðZ Þln dZ ð3Þ KLðq J pÞ ¼ Eq ln pðZ j XÞ qðZ Þ It’s worth noting that the KL divergence is difficult to optimize since it requires knowledge about the distribution that we are trying to approximate. As KLðq J pÞ Z 0 and  KLðq J pÞ ¼ R ln pðXÞ  ℒðqÞ where ℒðqÞ ¼ qðZ Þln ðpðX; Z ÞÞ=ðqðZ ÞÞ dZ is the lower bound on the log marginal probability ln pðXÞ , we can maximize the lower bound ℒðqÞ instead of minimizing KLðq J pÞ. M M If we assume that qðZ Þ ¼ Π i ¼ 1 qi ðZ i Þ in which    Z ¼ [ i ¼ 1 Z i , and iteratively ℒðqÞ with respect to qj Z j Z i \ Z j ¼ ∅ ; i aj n maximize o   while qi a j are held fixed, the optimal solution qnj Z j is given by [25,26]:

  ln qnj Z j ¼ Ei a j ln pðX; Z Þ þ const ð4Þ here the notation Ei a j ½⋯N denotes an expectation with respect to the q distributions over all variables Z i ðia jÞ and the constant is independent of Z i . Convergence is guaranteed because that bound is convex with respect to each of the factors qi ðZ i Þ [27]. In the literature, VI-based approaches have been used to estimate the density distributions of Dirichlet and Gaussian Mixture Model (GMM) [26,28,29]. In this work, we apply the VI method to estimate the parameters of a multivariate Gaussian distribution. Based on the Central Limit Theorem, Gaussian can be used to approximate a wide range of other distributions such as Poisson, Binominal and Gamma [30]. Meanwhile, Dirichlet distributions are most commonly used as the prior distribution of categorical variables or multinomial variables in Bayesian-based models [31,32]. Here, the multivariate Gaussian distribution is used to approximate the likelihood function pðLðxÞj Gm Þ for each class label in which all features of LðxÞ are real and belong to ½0; 1. Although GMM can also be used to approximate the model for class labels, GMM requires many parameters resulting in expensive computation in the training process. Moreover, when a small amount of data is available, the choice of the number of Gaussian components for GMM becomes critical [33]. Our goal is to infer the posterior distribution for the mean μ and precision matrix Λ, where Λ is the inverse of the covariance 1 matrix Λ ¼ Σ , given a dataset X ¼ fxn j n ¼ 1; …; N g of variable x which are assumed to be drawn from the multi independently  1 variate Gaussian distribution N xj μ; Λ . The likelihood function is given by   N   1 p Xj μ; Λ ¼ ∏ N xn j μ; Λ n¼1

( ) N  N T   ND  1X ¼ ð2πÞ  2 Λ 2 exp  xn  μ Λ xn  μ 2n¼1

ð5Þ

where D is the dimensionality of the variable x. In order to formulate a variational solution, we write down the joint   distribution    of  all   of the random variables: p X; μ; Λ ¼ p Xj μ; Λ p μ j Λ p Λ  . The  conjugate prior of a multivariate Gaussian distribution, p μ; Λ , with unknowns μ and Λ

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

  are distribution: p μ; Λ ¼  given   by the Gaussian–Wishart   p μ j Λ p Λ , where p μ j Λ is a Gaussian distribution:       1 p μ j Λ ¼ N μ j m0 ; β 0 Λ   1 T   D 1 ¼ ð2πÞ  2 β 0 Λ2 exp  μ  m0 β0 Λ μ  m0 ð6Þ 2   and p Λ is a Wishart distribution:    ðυ0  D  1Þ     1  p Λ ¼ W Λ j W0 ; υ0 ¼ BðW0 ; υ0 ÞΛ 2 exp  Tr W0 1 Λ 2 ð7Þ BðW0 ; υ0 Þ ¼ jW0 j

 υ0 2

 !  1 υ0 D DðD  1Þ D υ0 þ1  i 22 π 4 ∏ Γ 2 i¼1

ð8Þ

where m0 and β0 are the D-mean vector  and  the scale of precision matrix Λ of Gaussian distribution p μ j Λ , W0 and υ0 are the D  D-scale matrix and number of degrees of freedom of  the  Wishart distribution p Λ , Trð:Þ denotes the trace operator of a matrix, and Γð:Þ denotes the Gamma function defined by R Γð:Þ ¼ 01 xt  1 e  x dx. Putting our attention on a factorized   variational    approximation to the posterior distribution, i.e. q μ; Λ ¼ q μ q Λ , the following update equations can be derived:     ð9Þ ln qn μ ¼ EΛ ln p X; μ; Λ þ const     ln qn Λ ¼ Eμ ln p X; μ; Λ þ const

ð10Þ

We have the following results:   n Lemma 2: The solution   q μ of update equation (9)  optimum 1 n is a Gaussian q μ ¼ N μ j m; H with mean m and precision H given by (11) and (12). m¼

β0 m0 þ Nx β0 þ N

ð11Þ

1  ðlnjW i j  lnjWi  1 jÞ 2

201

ð16Þ

where we have made use of the fact that   E Λi  jυWi j

 ¼ ln ¼ lnjWi j lnjWi  1 j lnjHi j  lnjHi  1 j ¼ ln  jυWi  1 j E Λi  1  We have the following algorithm for multivariate Gaussian distribution estimation: Algorithm 1. VI for multivariate Gaussian distribution estimation Dataset X, threshold ε; m0 ; β0 ; υ0 ; W0 ; E Λ     Output: m; H of q μ ¼ N μ j m; H  1 and W; υ of     q Λ ¼ W Λ j W; υ

Input:

i : ¼1 For each i Update m; H using (11) and (12) Update W; υ using (13) and (14) If i4 1 and ℒi ðqÞ ℒi  1 ðqÞ o ε then break i : ¼ iþ1 End     In the algorithm above, the four variables of q μ and q Λ are updated step by step from their initial values. The updating process will stop when the change in lower bound value ℒðqÞ is smaller than a specified threshold ε. In our experiments, 3 or 4 iterations typically proved sufficient to achieve convergence with a threshold ε ¼ 1e 10.

  H ¼ β0 þ N E Λ

ð12Þ   Lemma 3: The optimum solution qn Λ of update (10) is a     Wishart qn Λ ¼ W Λ j W; υ with the number of degrees of freedom υ and the scale matrix W given by (13) and (14).

υ ¼ υ0 þ N þ 1

ð13Þ

  β N W  1 ¼ W0 1 þ β0 þ N H  1 þ S þ 0 J β0 þ N

ð14Þ

where x¼

N 1 X xn ; Nn¼1



N X

ðxn  xÞðxn xÞT ; J ¼ ðx  m0 Þðx  m0 ÞT

n¼1

Lemma 4: The lower bound ℒðqÞ of the Variational Inference for the multivariate Gaussian distribution is given by (15) ℒðqÞ ¼ ln BðW 0 ; υ0 Þ  ln BðW; υÞ   1  ND lnð2π Þ  D ln β0  υD þ lnjHjþ υTrðSWÞ 2   β Nυ TrðJWÞ þ υTr W0 1 W þ 0 β0 þ N

ð15Þ

The proofs for Lemmas 2–4 are given in Appendix A. Denote as ℒi ðqÞ the value of the lower bound at the ith iteration, then it can be shown that: ℒi ðqÞ  ℒi  1 ðqÞ ¼  ln BðWi ; υÞ

 1 β N  υTr S þ W0 1 þ 0 J Wi 2 β0 þ N

 1 β N þ ln BðWi  1 ; υÞ þ υTr S þ W0 1 þ 0 J W i  1 2 β0 þ N

3. Proposed combining classifier method The most important distinction between our work and the previous work is that we use statistical learning-based approach on Level1 data to build the combining classifier method. Attributes in the original data are frequently varying in nature, measurement unit, and type. As a result, Gaussian model does not perform well when it is used to approximate the distribution of the original data. Level1 data, on the other hand, can be viewed as scaled data from feature domain to posterior domain where data is reshaped to be real values in [0, 1]. Observations that belong to the same class will likely have equal posterior probabilities generated from the base classifiers and locate close together in the new domain. Consequently, Level1 data is expected to be more discriminative than the original data, and Gaussian model on Level1 data will be more effective than on original data. The proposed VIG combining classifier method is illustrated in Fig. 1. First, Staking algorithm is applied on the training set to generate Level1 data denoted by L. Since labels of the training n o   observations X ¼ ðx; yÞj y A ym m ¼ 1;…;M are known, we can gather L into M groups corresponding with the M labels:   Lm ¼ ðLðxÞ; yÞj x A X; y ¼ ym m ¼ 1; …; M. Then, VI method is applied to Lm to model the distribution for each label by a multivariate Gaussian distribution. Based on the Bayesian decision model, the posterior probability of an observation x belonging to the mth class is given by pðGm j LðxÞÞ  pðLðxÞj Gm ÞpðGm Þ where Gm is the model for the mth class.

ð17Þ

202

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

Fig. 1. The proposed VIG combining classifier method.

In which pðGm Þ is the prior probability of the mth class. Many approaches about the choice of prior probability have been introduced [34], and they generally belong to one of two classes: informative priors and uninformative priors. Spiegelhalter et al. [35] and Lauritzen et al. [36] demonstrated the improvement in prediction when informative priors are used in Bayesian-based system. Gelman [37] stated that the choice of prior distributions will have minor effect on the posterior probabilities in the case of a large number of observations and ‘well-identified’ parameters, whereas it will play an important role when the number of observations is small or when the available data provide only indirect information about the parameters of interest. In this paper, due to space limitation and to show that even a simple empirical choice can achieve good performance in the classification tasks, we compute the prior probability simply by pðGm Þ ¼

jLm j jXj

ð18Þ

where j U j denotes the cardinality of a set. The likelihood function pðLðxÞj Gm Þ is given by   pðLðxÞj Gm Þ ¼ N LðxÞj μm ; Σm   1

T  1

MK  1 ¼ ð2πÞ  2 Σm  2 exp  LðxÞ  μm Σm LðxÞ  μm 2

Output: Gaussian models fGm g with two parameters μm μm , Λm and pðGm Þ m ¼ 1; …; M Step1: L ¼ Stacking ðX; KÞ   Step2: Lm ¼ ðLðxÞ; yÞj x A X; y ¼ ym Step3: For mth class ðmm ; Hm ; Wm ; υm Þ ¼   Algorithm1 Lm ; ε; m0 ; β 0 ; υ0 ; W0 ; E Λ   1 where μm ¼ mm pðLðxÞj Gm Þ ¼ N LðxÞj μm ; Λm and

Λm ¼ Wm  υm

Compute pðGm Þ using (18) End Classification process: Input: Unlabeled observation xn Output: Predicted label for xn Step1: Compute Lðxn Þ Step2: For mth class Compute pðGm j Lðxn ÞÞ  pðLðxn Þj Gm ÞpðGm Þ using (17) End Step3: Predict label of xn using (20)

ð19Þ where μm ; Σm in (19) are the mean and covariance matrix of the multivariate Gaussian model obtained by VI for the mth class. Note that instead of Σm Σ m , the precision matrix Λm is computed by VI during the training process. In the classification phase, Level1 data of an unlabeled observation is first generated by base classifiers. Its class prediction is then obtained by selecting the label associated with maximum posterior probabilities computed by the M multivariate Gaussian models. Therefore, the class label of an unlabeled observation xn is predicted by    xn A yt if t ¼ argmaxm ¼ 1;…;M p Gm j L xn    argmaxm ¼ 1;…;M pðL xn j Gm ÞpðGm Þ

ð20Þ

Our ensemble framework has some similarity with Bayesian Model Averaging (BMA) [38] since the results of all hypotheses (classifiers) are used to obtain the final discriminative model. However, in our work, we just consider the output itself (which in our framework is obtained from the base classifiers) whereas in BMA the base classifier models are actually considered in the formulation. The VIG combining classifier method is given below:

4. Experimental results 4.1. Dataset To evaluate the performance of our proposed VIG method we perform experiments on two datasets. In the first experiment, eighteen UCI data files from the UCI dataset were used since it is often used to validate the performance of classification system [23]. To ensure the objectiveness of the comparison between our method and the other benchmark algorithms, we chose files with the number of observations varies significantly from small files like Fertility and Iris to big file such as Skin&NonSkin. The number of attributes also varies widely from three (Titanic) to sixty (Sonar). Information about the selected data files is summarized in Table 1. The second experiment was evaluated on CLEF2009, a medical imaging database collected by Archen University, Germany [24]. It is a large database containing 15,363 images allocated to 193 hierarchical categories. Here, we chose 10 classes with different number of observations in each class (Table 2). Histogram of Local Binary Pattern (HLBP) [39] was selected as feature vector of the image.

Algorithm 2. VIG combining classifiers method Training process:   Input: Training set X ¼ ðx; yÞ , K¼{ K learning algorithms}, ε; m0 ; β0 ; υ0 ; W0 ; E Λ

4.2. Experimental settings Three learning algorithms, namely Linear Discriminant Analysis, Naïve Bayes, and K Nearest Neighbor (with K set to 5, denoted

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

by 5-NN), were chosen to construct the base classifiers. The reason for choosing these diverse learning algorithms is that they ensure diversity of the ensemble system. Since our method is a combining classifier method, it is necessary to compare with other wellknown combining classifier methods. It is also important to compare error rates of the proposed method with those of base classifiers to demonstrate the advantage of ensemble method. In our experiments, the proposed method were compared with seven benchmark algorithms: selecting the best results from base classifiers based on outcomes on test set, selecting the best results from fixed rules based on outcomes on test set, Decision Template (we use the similarity measure S1 defined as S1 ðLðxÞ; DT m Þ ¼ ðj LðxÞ \ DT m j Þ=ðj LðxÞ [ DT m j Þ where DT m is the Decision Template of mth class and j U j is the relative cardinality of a set [17]), MLR, SCANN, AdaBoost, and Bagging. Only simple values were chosen to initialize the parameters for Algorithm 1: T m0 is D-vector with all zero elements ð0; …; 0Þ , β 0 ¼ 1; υ0 ¼ D, W0 is D  D identity matrix and E Λ ¼ υ0 W0 . We performed 10-fold cross validation and run the test 10 times to obtain 100 test results for each data file. To assess statistical significance, we used two-sample t-test to compare the classification results of our approach and each benchmark

File name

No. of Attribute No. of No. of attributes type observations classes

Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris

6 10 6 60 13 5 3 4 9 3 30 14 20 20 20 9 18 4

345 700 768 208 270 540 2,201 625 100 245,057 569 690 7,400 7,400 151 1,473 946 150

algorithm. Specifically, a two-sample t-test was conducted to test the null hypothesis H 0 : eA ¼ eB where eA and eB are the classification error rate of the proposal method and the benchmark algorithm: t¼

eA  eB  ðeA  eB Þz}|{if qffiffiffiffiffiffiffiffiffiffiffiffi ¼ s2A s2B nA þ nB

H 0 true

e  eB qAffiffiffiffiffiffiffiffiffiffiffiffi s2A s2B nA þ nB

ð21Þ

where eA ; eB are the means, sA ; sB are the standard deviations of classification error rate computed on nA ¼ nB ¼ 100 test results of the proposed method and the benchmark algorithm, respectively. The critical region is determined depending on how we choose the alternative hypothesis H 1 e.g. for one-tailed test: eA 4 eB , or eA o eB , or for two-tailed test: eA a eB . We reject the null hypothesis that the classification error rates of two methods are equal if the statistical value t belongs to the rejection region and vise verse. In this paper, one-tailed alternative hypothesis was used and the level of significance was set to 0.05. All source codes were implemented in Matlab running on a PC with Intel Core i5 with 2.5 GHz processor and 4G RAM. The results of the experiment are summarized in Tables 3 and 4.

4.3. Results and discussion

Table 1 Information of UCI data files in evaluation.

C,I,R R R,I R C,I,R R R,I C R R R C,I,R R R C,I C,I I R

203

2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 4 3

No. of attributes on level1 data 6 6 6 6 6 6 6 9 6 6 6 6 6 6 6 9 12 9

R: Real, C: Category, I: Integer.

4.3.1. Comparison with other combining classifier methods and base classifiers First, the error rates and variances of three base classifiers are reported in Table 3. From these results, the one with the smallest error rate for each data file were selected as the best result. From the significance test in Table 5, we see that VIG is better than the best result from base classifiers, obtaining seven wins and only one loss. In comparison to the best result from six fixed combining rules (see Table 4), our method achieved better result on five files, namely Phoneme (0.1164 vs. 0.1407), Ring (0.1168 vs. 0.2122), Skin&NonSkin (4.28E-04 vs. 6E-04), Vehicle (0.2069 vs. 0.2645), and CLEF2009 (0.1693 vs. 0.2023) and worse results on two files: Bupa (0.3151 vs. 0.297) and Artificial (0.2401 vs. 0.2193). As mentioned earlier, fixed combining rules do not exploit label information in Level1 data of training set to form the prediction so they sometimes do not obtain high accuracy in classification tasks. In contrast, the proposed method is a trainable combining method in which label information in Level1 data of training set is exploited to make prediction. As a result, in almost all situations, our method is better than any fixed combining rules.

Table 2 Information of 10 classes selected from clef2009 medical image database. Image

Description Number of observation

Abdomen 80

Cervical 81

Chest 80

Facial cranium 80

Left elbow 69

Left shoulder 80

Left breast 80

Finger 66

Left ankle joint 80

Left carpal joint 80

Image

Description Number of observation

204

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

Table 3 Classification error rates and variances of 3 base classifiers. File name

LDA

Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009

Naïve Bayes

5-NN

Best result from base classifiers

Mean

Variance

Mean

Variance

Mean

Variance

Mean

Variance

0.3693 0.4511 0.2396 0.2629 0.1593 0.2408 0.2201 0.2917 0.3460 0.0659 0.0397 0.1416 0.0217 0.2381 0.4612 0.4992 0.2186 0.0200 0.1683

8.30E  03 1.40E  03 2.40E  03 9.70E 03 5.30E  03 3.00E  04 5.00E  04 2.90E  03 2.01E  02 2.74E  06 7.00E  04 1.55E  03 3.12E  05 2.27E  04 1.21E  02 1.40E  03 1.39E  03 1.40E  03 1.63E  03

0.4264 0.4521 0.2668 0.3042 0.1611 0.2607 0.2515 0.2600 0.3770 0.1785 0.0587 0.1297 0.0217 0.2374 0.4505 0.5324 0.5550 0.0400 0.3757

7.60E  03 1.40E  03 2.00E  03 7.40E  03 5.90E  03 3.00E  04 8.00E  04 3.30E  03 2.08E  02 6.61E  06 1.20E  03 1.71E  03 3.13E  05 2.23E  04 1.22E  02 1.42E  03 2.94E  03 2.30E  03 2.12E  03

0.3331 0.2496 0.2864 0.1875 0.3348 0.1133 0.2341 0.1442 0.1550 0.0005 0.0666 0.3457 0.0312 0.3088 0.5908 0.4936 0.3502 0.0353 0.3116

6.10E  03 2.40E  03 2.30E  03 7.60E  03 5.10E  03 2.00E  04 3.70E 03 1.20E  03 4.50E  03 1.68E  08 8.00E  04 2.11E  03 3.96E  05 1.30E  04 1.37E  02 1.70E 03 2.35E 03 1.50E  03 2.34E  03

0.3331 (  ) 0.2496 ( ¼ ) 0.2396 ( ¼ ) 0.1875 ( ¼ ) 0.1593 ( ¼ ) 0.1133 ( ¼) 0.2201 ( ¼ ) 0.1442 (  ) 0.1550 (  ) 0.0005 (  ) 0.0397 ( ¼) 0.1297 ( ¼ ) 0.0217 ( ¼ ) 0.2374 (  ) 0.4505 ( ¼ ) 0.4936 (  ) 0.2186 (  ) 0.0200 ( þ) 0.1683 (¼ )

6.10E  03 2.40E  03 2.40E  03 7.60E  03 5.30E  03 2.00E  04 5.00E  04 1.20E  03 4.50E  03 1.68E  08 7.00E  04 1.71E  03 3.12E  05 2.23E  04 1.22E  02 1.70E 03 1.39E  03 1.40E  03 1.63E  03

( þ) means that the method is better than VIG, ( ¼) means that the method is equal to VIG, (  ) means that the method is worse than VIG. Relative performance was checked using t-test. Table 4 Classification error rates and variances of combining classifier algorithms. File name

Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009

Best result from 6 fixed rules

MLR Mean

Variance

0.3033 ( ¼ ) 0.2426 ( ¼ ) 0.2432 ( ¼ ) 0.1974 ( ¼ ) 0.1607 ( ¼ ) 0.1136 ( ¼) 0.2169 ( ¼) 0.1225 (  ) 0.1250 ( ¼ ) 4.79E  04 (  ) 0.0399 ( ¼) 0.1268 ( ¼) 0.0217 ( ¼) 0.1700 (  ) 0.4652 (  ) 0.4675 ( ¼ ) 0.2139 ( ¼) 0.022 ( ¼ ) 0.1633 ( ¼ )

4.70E 03 2.20E  03 2.30E  03 7.20E  03 4.70E 03 1.75E  04 4.00E  04 8.00E  04 2.28E  03 1.97E 08 7.00E  04 1.80E  03 2.24E 05 1.69E  04 1.24E 02 1.10E  03 1.40E  03 1.87E  03 1.58E  03

Mean

Variance

0.2970 ( þ ) 0.2193 ( þ ) 0.2365 ( ¼ ) 0.2079 ( ¼) 0.1570 ( ¼) 0.1407 (  ) 0.2167 (¼ ) 0.1112 ( ¼ ) 0.1270 ( ¼) 0.0006 (  ) 0.0395 ( ¼ ) 0.1262 (¼ ) 0.0216 ( ¼ ) 0.2122 (  ) 0.4435 ( ¼) 0.4653 ( ¼ ) 0.2645 (  ) 0.0327 ( ¼ ) 0.2023 (  )

4.89E  03 2.05E  03 2.10E  03 8.16E  03 4.64E  03 1.95E  04 5.00E  04 4.82E  04 1.97E 03 2.13E  08 5.03E  04 1.37E  03 2.82E  05 1.62E  04 1.70E 02 1.79E  03 1.37E  03 1.73E 03 1.85E  03

SCANN

Decision template

VIG

Mean

Variance

Mean

Variance

Mean

Variance

0.3304 (  ) 0.2374 ( ¼ ) 0.2384 (¼ ) 0.2128 ( ¼) 0.1637 ( ¼) 0.1229 (  ) 0.2216 ( ¼ ) x x x 0.0397 ( ¼ ) 0.1259 ( ¼) 0.0216 ( ¼ ) 0.2150 (  ) 0.4428 ( ¼) 0.4869 (  ) 0.2224 (  ) 0.032 ( ¼ ) 0.1895 (  )

4.29E  03 2.12E  03 2.06E  03 8.01E  03 4.14E  03 6.53E  04 6.29E  04 x x x 5.64E  04 1.77E 03 2.39E  05 2.44E  04 1.34E  02 1.80E  03 1.54E  03 2.00E  03 1.68E  03

0.3348 (  ) 0.2433 ( ¼ ) 0.2482 (  ) 0.2129 ( ¼) 0.1541 (¼ ) 0.1462 (  ) 0.2167 ( ¼) 0.0988 (þ ) 0.4520 (  ) 0.0332 (  ) 0.0385 (¼ ) 0.1346 (  ) 0.0221 ( ¼ ) 0.1894 (  ) 0.4643 (  ) 0.4781 (  ) 0.2161 (  ) 0.0400 ( ¼) 0.1893 (  )

7.10E  03 1.60E  03 2.00E  03 8.80E  03 4.00E  03 2.00E  04 6.00E  04 1.40E  03 3.41E  02 1.64E  06 5.00E  04 1.50E  03 2.62E  05 1.78E  04 1.21E  02 1.40E  03 1.50E  03 2.50E  03 1.74E  03

0.3151 0.2401 0.2340 0.2025 0.1556 0.1164 0.2169 0.1123 0.1310 4.28E  04 0.0408 0.1225 0.0218 0.1168 0.4348 0.4634 0.2069 0.0313 0.1693

3.73E 03 2.33E 03 2.21E  03 7.84E  03 4.01E  03 1.96E  04 4.83E  04 6.19E  04 2.34E  03 1.57E  08 5.75E  04 1.49E  03 2.49E  05 1.00E  04 1.71E  02 1.32E  03 1.23E  03 2.00E  03 1.76e  03

( þ) means that the method is better than VIG, ( ¼) means that the method is equal to VIG, (  ) means that the method is worse than VIG. Relative performance was checked using t-test.

Table 5 Statistical two-sample t-test results comparing proposed method with benchmark algorithms. VIG vs. best result from base classifiers Better 7 Equal 11 Worse 1

VIG VIG vs. best result from fixed vs. MLR rules

VIG vs. Decision Template

VIG vs. SCANN

5 12 2

11 7 1

6 10 0

4 15 0

Compare with SCANN, our method outperformed it significantly, obtaining 6 wins and 0 losses. We note that here only 16 files were compared because SCANN could not be run on 3 files (Fertility, Balance and Skin&NonSkin). The reason is that the

indicator matrix in SCANN has columns with all zero posterior probabilities from the base classifiers. As a result, its column mass is singular and standardized residuals is not available [18]. We ignored these cases in the comparison. Finally, our method performed significantly better than Decision Template method, posting 11 wins and only 1 loss. Similar to our approach, Decision Template method also groups training observations based on their labels in Level1 data and then builds the template for each class. In fact, the Decision Template of the mth class is the average of Level1 data of training observations with label ym [17]. However, the template representation is not as powerful as our Variational Inference-based approach, so our method frequently outperforms that benchmark algorithm. MLR method also represents each class by building regression models based on Level1 data of training observations and their class labels. It has better performance compared to Decision Template method

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

205

Fig. 2. Average time of training process (in seconds).

Fig. 3. Average time of classification process (in seconds). (Top: The results of three datasets, and Bottom: The results of the others).

Table 6 Average number of iterations by the proposed method.

Skin&NonSkin 3

Bupa

Artificial

Pima

Sonar

Heart

Phoneme

Titanic

Balance

Fertility

4 Wdbc 4

4 Australian 4

4 Twonorm 4

4 Ring 4

4 Tae 4

3.99 Contraceptive 4

4 Vehicle 4

4 Iris 4

4.5 CLEF2009 4

in most cases due to the more powerful statistical representation obtained by linear regression. Nevertheless, our method is still better than MLR, obtaining 4 wins and 0 loss. The 4 winning cases are Balance (0.1123 vs. 0.1225), Skin&NonSkin (4.28E-04 vs. 4.79E04), Ring (0.1168 vs. 0.17), and Tae (0.4348 vs. 0.4652). This is a

significant outcome because MLR is a very highly competitive trainable combining method for many datasets. To compare the training and classification time, we compute the average training and classification time for all 100 cases, and report averaged value in Figs 2 and 3. Here we only compare the

206

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

performance of 4 trainable combining algorithms. Fix combining algorithms always outperform trainable algorithms on training time because they do not exploit label information in Level1 data. Consequently, we do not compare them to trainable combining methods. First, in the training process, no method is dominant in time performance evaluation. In 6 cases like Artificial, Contraceptive, Balance, Australian, Ring and CLEF2009, Decision Template is the best among all 4 methods. In fact, that approach is simpler than VIG and MLR as it only calculates average value on Level1 data of training observations associated with each class. In MLR, we have to solve M Linear Regression models to find the weight each classifier put on a particular class. In VIG, several iterations are needed to find the mean and covariance parameters of the multivariate Gaussian distribution for each class. Table 6 shows the average value of the number of iterations among 10 runs of the 10fold cross validation procedure. The convergences are obtained after about 3 or 4 iterations so the training time in VIG method is acceptable. It is even better than the other methods on Heart, Titanic, Wdbc, Tae and Iris. In Appendix B, we show the decreases in the values of ℒi  ℒi  1 ðiZ 2Þ in an experiment for all nineteen datasets. Although Decision Template is usually less timeconsuming in the training process compared with the other three methods, the differences related to training time between Decision Template and VIG are small. In the time taken for classification, MLR method is ranked first, followed by VIG, Decision Template, and SCANN. In MLR, the label of an observation is predicted by simply multiplying its Level1 data with the associated weight values obtained from training process. On the other hand, our method computes multiplication between Level1 data of the test observation and the precision matrix. So it is more time-consuming than MLR, although the difference is not large. In SCANN, we have to compute M representations corresponding with M classes from Level1 data of the test observation. After that, the distance between each representation and each row of selected principal coordinates of columns matrix is computed [18]. As a result, SCANN is frequently the most time-consuming method among the four trainable methods during the classification process.

4.3.2. Different number of learning algorithms Two learning algorithms, namely Quadratic Discriminant Analysis and Decision Tree, are added into our multi-classifier system. In this experiment, we want to assess the effect of having different number of base classifiers. We denote our combining classifier method with 3 learning algorithms as VIG-3 and the method with five learning algorithms as VIG-5. Table 7 shows the error rates and variances of the classification results of Decision Tree, QDA, and VIG-5. Table 8 shows the statistical two-sample t-test results comparing VIG-3 to VIG-5. VIG-3 is better than VIG-5 on 4 data sets (Phoneme, Titanic, Skin&NonSkin, and Australia), whereas VIG-5 out-performs VIG-3 on 6 data sets (Bupa, Artificial, Balance, Ring, Vehicle, and CLEF2009). There does not appear to be a clear overall advantage in using more base classifiers in this case. However, by using additional learning algorithms, the classification performance of the ensemble system can improve significantly on some files. For example, the classification error rate on Ring given by ensemble system with 3 learning algorithms is 0.11 but that number reduces to only 0.02 when two new learning algorithms are added into the ensemble system. Looking at the output of the base classifiers, we see that one of the newly added base classifiers has significantly better classification accuracy for this file. Hence, the discrimination property of Level1 data could be enhanced by having the right base classifiers in the ensemble system. Through this experiment we can see that the addition of more learning algorithms could either improve or degrade the classification performance of multi-classifier systems. A learning mechanism as shown in [40] that search for an optimal subset of learning algorithms could be used to build a highly effective classification system.

Table 8 Statistical two-sample t-test results comparing VIG-3 to VIG-5. VIG-3 vs. VIG-5 Better Equal Worse

4 7 6

Table 7 Classification error rates and variances of additional learning algorithms and VIG-5. File name

Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009

Decision tree

QDA

VIG-5

Mean

Variance

Mean

Variance

Mean

Variance

0.3514 0.2414 0.2892 0.2866 0.2381 0.1298 0.2101 0.2107 0.1730 0.0004 0.0705 0.1678 0.0536 0.2485 0.4275 0.5317 0.2932 0.0507 0.3613

6.10E  03 2.20E  03 1.80E  03 6.20E  03 6.70E 03 2.00E  04 3.00E  04 2.10E  03 7.20E  03 1.55E  08 1.10E  03 2.13E  03 4.22E  05 1.32E  04 1.06E  02 1.28E  03 2.13E  03 2.40E  03 2.00E  03

0.3965 0.4021 0.2628 0.2401 0.1722 0.2138 0.2267 0.0860 x 0.0165 0.0433 0.2078 0.0224 0.0210 x 0.4895 0.1453 0.0260 0.1719

7.48E  03 7.92E  03 2.29E  03 8.15E  03 5.94E  03 2.35E 04 8.25E  04 8.78E  04 x 6.26E  07 6.99E  04 1.75E  03 2.37E  05 2.32E  05 x 1.63E  03 1.13E  03 1.59E  03 1.40E  03

0.2970 (þ ) 0.2217 ( þ ) 0.2403 ( ¼) 0.1861 ( ¼ ) 0.1644 ( ¼ ) 0.1276 (  ) 0.2290 (  ) 0.0851 ( þ) x 0.0006 (  ) 0.0450 ( ¼ ) 0.1338 (  ) 0.0216 ( ¼ ) 0.0206 ( þ ) x 0.4629 ( ¼ ) 0.1578 ( þ) 0.0260 ( ¼ ) 0.1521 ( þ )

4.85E  03 2.05E  03 2.11E  03 6.68E  03 4.43E  03 2.16E  04 4.95E  04 1.09E  03 x 2.33E 08 6.92E  04 1.08E  03 2.82E  05 3.10E  05 x 1.34E  03 1.23E  03 1.68E  03 1.30E  03

(  ) means that VIG-3 is better than VIG-5, (¼ ) means that VIG-3 and VIG-5 is equal, ( þ) mean that VIG-3 is worse than VIG-5. Relative performance was checked using ttest.

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

207

Table 9 Classification error rates and variances of Bagging and AdaBoost. Bagging

Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009

AdaBoost

Mean

Variance

Mean

Variance

0.2741 (VIG-3: þ ) (VIG-5: þ ) 0.2069 (VIG-3: þ ) (VIG-5: þ) 0.2357 (VIG-3: ¼ ) (VIG-5: ¼) 0.1545 (VIG-3: þ ) (VIG-5: þ ) 0.1700 (VIG-3: ¼ ) (VIG-5: ¼ ) 0.0878 (VIG-3: þ ) (VIG-5: þ) 0.2160 (VIG-3: ¼ ) (VIG-5: þ ) 0.1570 (VIG-3:  ) (VIG-5:  ) 0.1260 (VIG-3: ¼) 0.0004 (VIG-3: ¼) (VIG-5: þ ) 0.0362 (VIG-3: ¼ ) (VIG-5: þ) 0.1351 (VIG-3:  ) (VIG-5: ¼ ) 0.0273 (VIG-3:  ) (VIG-5:  ) 0.0500 (VIG-3: þ ) (VIG-5:  ) 0.3353 (VIG-3: þ) 0.4627 (VIG-3: ¼) (VIG-5: ¼ ) 0.2499 (VIG-3:  ) (VIG-5:  ) 0.0493 (VIG-3:  ) (VIG-5:  ) 0.1897 (VIG-3:  ) (VIG-5:  )

4.37E  03 2.30E  03 2.04E  03 6.39E  03 4.80E  03 1.08E  04 4.62E  04 1.10E  03 4.92E  03 1.40E  08 6.21E  04 1.68E  03 3.20E  05 6.47E  05 1.58E  02 1.55E  03 1.58E  03 2.63E  03 1.64E  03

0.2587 (VIG-3: þ) (VIG-5: þ ) 0.2197 (VIG-3: þ ) (VIG-5: ¼ ) 0.2444 (VIG-3: ¼ ) (VIG-5: ¼) 0.1413 (VIG-3: þ) (VIG-5: þ) 0.1896 (VIG-3:  ) (VIG-5:  ) 0.1920 (VIG-3:  ) (VIG-5:  ) 0.2217 (VIG-3: ¼ ) (VIG-5: þ ) 0.1334 (VIG-3:  ) (VIG-5:  ) 0.1600 (VIG-3:  ) 0.0428 (VIG-3:  ) (VIG-5:  ) 0.0330 (VIG-3: þ) (VIG-5: þ) 0.1425 (VIG-3:  ) (VIG-5:  ) 0.0310 (VIG-3:  ) (VIG-5:  ) 0.0456 (VIG-3: þ ) 0.5145 (VIG-3:  ) (VIG-5:  ) 0.4996 (VIG-3:  ) (VIG-5:  ) 0.4451 (VIG-3:  ) (VIG-5:  ) 0.0540 (VIG-3:  ) (VIG-5:  ) 0.5565 (VIG-3:  ) (VIG-5:  )

3.30E  03 1.90E  03 1.97E  03 5.05E  03 4.67E 03 2.27E  04 4.31E  04 6.31E  04 9.00E  03 1.65E  06 4.76E  04 1.53E  03 3.76E  05 6.34E  05 1.80E  02 8.99E  04 2.87E  03 2.82E  03 1.91E  03

(VIG-3:  ) and (VIG-3:  ) mean that the benchmark algorithm is worse than VIG-3 or VIG-5, (VIG-3: ¼) and (VIG-5: ¼ ) means that the benchmark algorithm is equal to VIG-3 or VIG-5, (VIG-3: þ ) and (VIG-5: þ) mean that the benchmark algorithm is better than VIG-3 or VIG-5.

4.3.3. Comparison with other state-of-the-art ensemble methods We also compared the performance of our proposed method to two well-known ensemble methods, namely Bagging and AdaBoost. Both of them are implemented in Matlab 2014a with Decision Tree learning algorithm. To choose parameters for these methods, we refer [41–43] in which the number of iterations in AdaBoost is set to 200 and the number of learners in Bagging is set to 200. By comparing with these well-known ensemble methods, the superior classification performance of our method can be observed. Table 9 shows the experimental results of the two ensemble methods on the nineteen data sets. The statistical two-sample t-test results comparing VIG-3 and VIG-5 to Bagging and AdaBoost are shown in Table 10. First, both VIG based approaches are equally competitive with Bagging, obtaining six wins and six losses with VIG-3 and six wins and seven losses with VIG-5. It should be noted that Bagging is one of the best performing ensemble methods in the literature. In our experiment, Bagging with 200 learners is much more timeconsuming than our approach in both training and testing. In comparison with AdaBoost, VIG-3 achieves better results on 12 and worse results on 5 data sets while VIG-5 has 11 wins and 4 losses. This is a significant outcome because our approach is not only better in performance but also considerably less timeconsuming than AdaBoost with 200 iterations. In our experiments, Bagging and AdaBoost with 200 learners is much more time-consuming than our approach in both training and testing. Specifically, Bagging is on average 8.1 and 7.85 times more time-consuming than VIG-3 for training and testing, respectively, and is on average 2.93 and 4.61 times more time-consuming than VIG-5 for training and testing, respectively. Comparing AdaBoost with our methods, the corresponding number is 3.81 and 6 times higher compared with VIG-3 and 1.35 and 3.45 times higher compared with VIG-5. The computation times for each dataset are shown in Figs. 4 and 5.

Table 10 Statistical two-sample t-test results comparing VIG-3 and VIG-5 to Bagging and AdaBoost. VIG-3 vs. Bagging Better 6 Equal 7 Worse 6

VIG-5 vs. Bagging

VIG-3 vs. AdaBoost

VIG-5 vs. AdaBoost

6 4 7

12 2 5

11 2 4

groups Level1 data of training observations based on their class labels, and estimate the distribution of each class modeled as a multivariate Gaussian using VI. Then, classification was conducted through maximization of posterior probability according to the Bayesian decision model. Experimental results on eighteen UCI data files and a 10-class CLEF2009 medical image database demonstrated the benefit of our approach compared with several well-known combining classifier methods. Specifically, the proposed method is better than individual base classifiers, ensemble methods with fixed combining rules, MLR, Decision Template, AdaBoost, and SCANN, and is highly competitive with Bagging. It is also less time consuming compared with other trainable methods and the two well-known ensemble methods that we experimented. Besides, VIG is more applicable compared with SCANN since singularity is not a problem in VIG. In the future, the proposed method could be combined with feature and classifier selection based approaches [40,44,45] to further improve its classification performance. In this work, we have shown that having a set of good base classifiers could boost the discrimination ability of Level1 data and improve the overall classification accuracy of the system.

Conflict of Interest Statement 5. Conclusion We have introduced a novel combining classifier method denoted by VIG in which class distribution of Level1 data was exploited to form the decision making model. Our approach

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

208

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

Fig. 4. Average time of training process (in seconds) of two VIG methods, AdaBoost and Bagging.

Fig. 5. Average time of testing process (in seconds) of two VIG methods, AdaBoost and Bagging.

Acknowledgments Tien Thanh Nguyen acknowledges the support of a Griffith University International Postgraduate Research Scholarship (GUIPRS).

Appendix A In equation (3), Kullback–Leibler (KL) divergence is given by

  Z qðZ Þ pðZ j XÞ KLðq J pÞ ¼ Eq ln ¼  qðZ Þln dZ pðZ j XÞ qðZ Þ Þ By pðZ j XÞ ¼ ppðZ;X ðXÞ , we can rewrite (3) as



Proof. Level1 data generated by using Fuzzy Label is a N  MK  P matrix (1) in which M m ¼ 1 Pk ym j x n ¼ 1 for each k ¼ 1; …; K and n ¼ 1; …; N. Due to this property, we can address a non-trivial combination of all columns given by: 1 1 0  0  Pi ym j x1 Pj ym j x1 M K M X X X B C B C  ¼0 ðA1Þ ðM  1Þ @  ⋮ @  ⋮ A A m¼1 j ¼ 1;j a i m ¼ 1 Pi ym j xN Pj ym j xN Therefore, columns are not independent, consequently rank of the matrix L is smaller than the number of columns or matrix L is not full column rank □ Corollary 1. Covariance matrix of Level1 data is singular.

pðZ; XÞ dZ pðXÞqðZ Þ  Z Z pðZ; XÞ dZ ¼ ln pðXÞ qðZ ÞdZ  qðZ Þln pðXÞ R Since qðZ ÞdZ ¼ 1 we obtain  Z pðZ; XÞ dZ KLðq J pÞ ¼ ln pðXÞ ℒðqÞ where ℒðqÞ ¼ qðZ Þln pðXÞ

Proof. Let L denotes the N  MK matrix. 2 N N X X     1 P1 y1 j xn P1 y2 j x n 6 N1 N 6 n¼1 n¼1 6 6 ⋮ ⋮ L : ¼6 6 N N X 61X     1 4 P1 y1 j xn P1 y2 j x n N N

Lemma 1. Level1 data generated by using Fuzzy Label is not full column rank.

Then MK  MK covariance T   matrix of L, denoted by Σ is computed by Σ ¼ L  L L  L . Based on the property of rank

Z

KLðq J pÞ ¼ 

qðZ Þln

n¼1

n¼1

⋯ ⋱ ⋮

3   PK yM j x n 7 7 n¼1 7 7 ⋮ 7 7 N X  7 1 PK y j x n 5

1 N

N

N X

M

n¼1

ðA2Þ

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

209

Fig. A1. xxx.

h T  i     ¼ rank L L . operator we have rank Σ L L  L ¼ rank L   By Lemma 1, rank L  L o MK. As a result, the covariance matrix Σ is singular.

  n Lemma 2. Optimum  solution q  μ of update equation (9) is a  1 n with mean and precision given by Gaussian q μ ¼ N μ j m; H (11) and (12).

210

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

  Proof. The optimal solution for q μ is given by     ln qn μ ¼ EΛ ln p X; μ; Λ þconst     ¼ EΛ ln p Xj μ; Λ þ ln p μ j Λ þconst T   1 h ¼  EΛ μ  m 0 β 0 Λ μ  m 0 2 XN  T  i þ const þ x  μ Λ x  μ n n n¼1



h



μ  m0 μ  m0

T i

¼ ðm  m 0 Þðm  m 0 ÞT  þ H  1 ; μ  N μ j m; H  1 

Using N X

N n X

xn xn T ¼

n¼1

N X

¼

ðxn xÞðxn  xÞT þ

n¼1

" # N  X    1 ln qn μ ¼  EΛ μT β0 Λμ  2μT β 0 Λm0 þ μT Λμ  2μT Λxn þ const 2 n¼1

! N X  1 T T ¼  μ β0 þN E Λ μ þ μ E Λ β0 m0 þ xn þconst 2 n¼1

ðxn  xÞðxn xÞT þ xn xT þ xxTn  xxT

N X   H ¼ β0 þ N E Λ ; Hm ¼ E Λ β 0 m0 þ xn

!

  ¼ E Λ β0 m0 þNx

N X

and we can easily infer (11) and (12) □.   n Lemma 3. Optimum   solution  q Λ of update equation (10) is a n Wishart q Λ ¼ W Λ j W; υ with the number of degrees of freedom υ and the scale matrix W given by (13) and (14).   Proof. The optimal solution for q Λ is given by  



  N þ1 þ υ0  D  1   lnΛ ln qn Λ ¼ 2   n h T   1  Tr W0 1 Λ þ Eμ μ  m0 β0 Λ μ  m0 2 # N  X T   xn  μ Λ xn  μ g þ const þ

xm ¼



μ  m0 μ  m0

xn  μ xn  μ

n¼1



Eμ μ ¼ m; Eμ μ

T

β0 ðx m0 Þ β0 þ N 

ðm m0 Þðm  m0 ÞT ¼  ðx  mÞðx  mÞT ¼

ðA14Þ

N β0 þ N

β0

2 ðA15Þ

J

2 ðA16Þ

J

β0 þ N

we obtain h



μ  m0 μ  m0

T i

 ¼

N β0 þN

2

J þH  1

# N  X  T  β2 N xn  μ xn  μ ¼ S þ  0 2 J þNH  1 β0 þ N n¼1

ðA17Þ

ðA18Þ

ðA19Þ

Lemma 4. The variational lower bound ℒðqÞ is computed as in (15)

ðA6Þ

Proof. Based on the expression of lower bound we have: (  )   p X; μ; Λ   d μd Λ ℒ ¼ ∬ q μ; Λ ln q μ; Λ     ¼ E ln p X; μ; Λ  E ln q μ; Λ       ℒ ¼ E ln p Xj μ; Λ þE ln p μ j Λ þ E ln p Λ      E ln q μ E ln q Λ

T i

"

N  X  T þ Eμ xn  μ xn  μ

# ðA7Þ

  N   ND lnð2π Þ E ln p Xj μ; Λ ¼ E lnΛ  2 2 " # N T   1 X  E xn  μ Λ xn  μ 2 n¼1 "

# ¼

N X

ðxn  mÞðxn  mÞT þ NH  1

ðA8Þ

n¼1



ðA13Þ

ðA5Þ

n¼1



ðx  m0 Þ

ðA12Þ

ðA20Þ

in which the first component is computed by

where

T

N

β0 þ N

  β N Q ¼ β0 þ N H  1 þ S þ 0 J β0 þ N

n¼1



N  X  xn xn T  xn mT  mxn T þ mmT

  Therefore, qn Λ is a Wishart distribution of the form    qn Λ ¼ W Λ j W; υ with W and υ computed as in (13) and (14) □

i  1 h ¼  Tr W0 1 þ Q Λ 2

Since " N  X

n¼1

n¼1

m  m0 ¼

"

  Rewrite the 2nd term on the RHS of ln qn Λ as:  h n   T o 1n  I ¼  Tr W0 1 Λ þ Eμ Tr β 0 μ  m0 μ  m0 Λ 2 ( )# N  X  T xn  μ x n  μ Λ g þ Tr

Q ¼ β0 Eμ

xxn T  NxxT

¼ S þ N ðx  mÞðx  mÞT



n¼1

h

N X

ðA11Þ

n¼1



  ln q Λ ¼ Eμ ln p X; μ; Λ þconst       ¼ Eμ ln p Xj μ; Λ þ ln p μ j Λ þ ln p Λ þ const

 1 T   N   1  ¼ Eμ lnβ 0 Λ  μ  m0 β 0 Λ μ m0 þ lnΛ 2 2 2 # N  X    1 T xn  μ Λ xn  μ  2n¼1 n

xn xT þ

n¼1

ðxn  mÞðxn  mÞT ¼

n¼1

ðA4Þ

N X

¼ S þ NxxT

ðA3Þ   Hence, qn μ is a Gaussian with mean m and precision H given

o

n¼1

Completing the square over μ

by

ðA10Þ



¼ mT ; Eμ μμ

T

¼ mmT þ H  1

ðA9Þ

E

N  X

T   xn  μ Λ xn  μ

n¼1

"

"

¼ EΛ Eμ Tr

N   X

#



Λ xn  μ xn  μ

n¼1

T 

##

ðA21Þ

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

" " ##! N  X  T ¼ Tr E Λ Eμ xn  μ x n  μ n¼1

β 1 ¼ Tr E Λ S þ   J þ NH β0 þN 2 2 0N

where

υ0 D DðD  1Þ ln 2  ln π lnjW0 j  2 4 2   D X υ0 þ 1 i  ln Γ 2 i¼1

ln BðW0 ; υ0 Þ ¼ 

!!

υβ ND 2 TrðJWÞ þ β 0 þN β0 þ N

¼ υTrðSWÞ þ 

211

2 0N

ðA22Þ



 where we have used Λ  W Λ j W; υ ; E Λ ¼ υW; H  1 ¼  1  E Λ = β 0 þ N . Substituting (A22) into (A21), we get ( )     1 ND υβ20 N E ln p Xj μ; Λ ¼  υTr ðSWÞ   NE lnΛ  ND lnð2π Þ  2 TrðJW Þ 2 β0 þ N β þN 0

υ0

υ υD DðD  1Þ ln 2  ln BðW; υÞ ¼  lnjWj  2 4 2   D X υþ1i ln π  ln Γ 2 i¼1

ðA28Þ

ðA29Þ

Use (A23)–(A29) we have the lower bound given by (15) □.

ðA23Þ

 1   T   D 1  E ln p μ j Λ ¼ E  lnð2π Þ þ lnβ0 Λ  μ  m0 β0 Λ μ  m0 2 2 2   1 ¼ D ln β0 þ E lnΛ D lnð2π Þ 2 h   T io  β0 E Tr μ  m0 μ  m0 Λ ¼

  1 D ln β0 þ E lnΛ D lnð2π Þ 2 h   T  io  β0 Tr E μ  m0 μ  m0 Λ

( )   1   Dβ0 υβ N2 E ln p μ j Λ ¼ D ln β 0 þ E lnΛ  D lnð2π Þ    0 2 TrðJWÞ 2 β0 þ N β þ N 0

ðA24Þ In the same way, we obtain   υ0  D  1   1 h   1 i E ln Λ  E Tr W0 Λ E ln p Λ ¼ ln BðW0 ; υ0 Þ þ 2 2 υ0 D  1   1 h  1  i E ln Λ  Tr W0 E Λ ¼ ln BðW0 ; υ0 Þ þ 2 2   υ0  D  1   1 h  1 i E ln Λ  υTr W0 W E ln p Λ ¼ ln BðW0 ; υ0 Þ þ 2 2 ðA25Þ h  i   E ln q μ ¼ E ln N μ j m; H  1 h  T  io 1n ¼ lnjHj  D lnð2π Þ  E Tr μ  m H μ  m 2 h h  T iio 1n ¼ lnjHj  D lnð2π Þ  Tr HE μ  m μ  m 2 h io 1n ¼ lnjHj D lnð2π Þ Tr HH  1 2  1 ¼ lnjHj D lnð2π Þ  Tr½I 2   1  E ln q μ ¼ lnjHj  D lnð2π Þ  D ðA26Þ 2     E ln q Λ ¼ E ln W Λ j W; υ

υ D 1   1 h   1 i E ln Λ  E Tr W Λ 2

2

2

2

2

2

υ D 1   1 h  1  i E ln Λ  Tr W E Λ ¼ ln BðW; υÞ þ υ D 1   1 h  1 i E ln Λ  Tr W υW ¼ ln BðW; υÞ þ υ D 1   υ E ln Λ  Tr½I ¼ ln BðW; υÞ þ

2 2   υ  D  1   υD E ln Λ  E ln q Λ ¼ ln BðW; υÞ þ 2 2

The values of ℒi  ℒi  1 ðiZ 2Þ in one experiment on all nineteen datasets with VIG-3. See Fig. A1.

References

Referring to (A17) we get the second component in the lower bound’s expression.

¼ ln BðW; υÞ þ

Appendix B

ðA27Þ

[1] L. Rokach, Taxonomy for characterizing ensemble methods in classification tasks: a review and annotated bibliography, Comput. Stat. Data Anal. 53 (2009) 4046–4072. [2] M. Woźniak, M. Graña, E. Corchado, A survey of multiple classifier systems as hybrid systems, Inf. Fusion 16 (2014) 3–17. [3] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, CRC Press, 2012. [4] R.P.W. Duin, The combining classifier: to train or not to train?, in: Proceedings of International Conference on Pattern Recognition, 2002, pp. 765-770. [5] L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley, 2004. [6] A.C. Sharkey, Types of multinet system, in: F. Roli, J. Kittler (Eds.), Multiple Classifier Systems, Springer, Berlin Heidelberg, 2002, pp. 108–117. [7] H.T. Kam, J.J. Hull, S.N. Srihari, Decision combination in multiple classifier systems, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 66–75. [8] C.-X. Zhang, R.P.W. Duin, An experimental study of one- and two-level classifier fusion for different sample sizes, Pattern Recognit. Lett. 32 (2011) 1756–1767. [9] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Proceedings of International Conference on Machine Learning, 1996, pp. 148156. [10] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [11] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [12] S. Džeroski, B. Ženko, Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54 (2004) 255–273. [13] K.M. Ting, I.H. Witten, Issues in stacked generalization, J. Artif. Intell. Res. 10 (1999) 271–289. [14] L. Todorovski, S. Džeroski, Combining classifiers with meta decision trees, Mach. Learn. 50 (2003) 223–249. [15] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers,, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 226–239. [16] L.I. Kuncheva, A theoretical study on six classifier fusion strategies, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 281–286. [17] L.I. Kuncheva, J.C. Bezdek, R.P.W. Duin, Decision templates for multiple classifier fusion: an experimental comparison, Pattern Recognit. 34 (2001) 299–314. [18] C. Merz, Using correspondence analysis to combine classifiers, Mach. Learn. 36 (1999) 33–58. [19] D.H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241–259. [20] L. Zhang, W.-D. Zhou, Sparse ensembles using weighted combination methods based on linear programming, Pattern Recognit. 44 (2011) 97–106. [21] M.U. Şen, H. Erdoǧ an, Linear classifier combination and selection using group sparse regularization and hinge loss, Pattern Recognit. Lett. 34 (2013) 265–274. [22] T.T. Nguyen, A.W.-C. Liew, C. To, X.C. Pham, M.P. Nguyen, Fuzzy If-Then rules classifier on ensemble data, in: X. Wang, W. Pedrycz, P. Chan, Q. He (Eds.), Machine Learning and Cybernetics, Springer, 2014, pp. 362–370. [23] 〈http://archive.ics.uci.edu/ml/datasets.html〉. [24] 〈http://ganymed.imib.rwth-aachen.de/irma/datasets_en.php〉. [25] C.M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, New York, 2006. [26] N. Nasios, A.G. Bors, Variational learning for Gaussian mixture models, IEEE Trans. Syst. Man Cybern. Part B Cybern. 36 (2006) 849–862. [27] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004. [28] D.M. Blei, M.I. Jordan, Variational methods for the Dirichlet process, in: Proceedings of ACM International Conference on Machine Learning, 2004.

212

T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212

[29] D.M. Blei, M.I. Jordan, Variational Inference for Dirichlet process mixtures, Bayesian Anal. 1 (2006) 121–143. [30] N. Balakrishnan, V.B. Nevzorov, A Primer on Statistical Distributions, Wiley & Sons Press, 2003. [31] T. Minka, Bayesian inference, entropy, and the multinomial distribution, in: Technical Report, Microsoft Research, 2003. [32] A. Agresti, D.B. Hitchcock, Bayesian inference for categorical data analysis, Stat. Methods Appl. 14 (3) (2005) 297–330. [33] C. Désir, S. Bernard, C. Petitjean, L. Heutte, One class random forests, Pattern Recognit. 46 (12) (2013) 3490–3506. [34] A. Gelman, J.B. Carlin, H.S. Stern, D.B. Rubin, Bayesian Data Analysis, second ed., CRC Press, 2004. [35] D.J. Spiegelhalter, A. Dawid, S. Lauritzen, R. Cowell, Bayesian analysis in expert systems (with discussion), Statist. Sci. 8 (1993) 219–283. [36] S.L. Lauritzen, B. Thiesson, D.J. Spiegelhalter, Diagnostic systems created by model selection methods: a case study, in: (P. Cheeseman, W. Oldford (Eds.), Uncertainty in Artificial Intelligence, 1994, vol. 4, pp. 143–152. [37] A. Gelman, Prior distribution, Encycl. Environ. 3 (2002) 1634–1637. [38] J.A Hoeting, D. Madigan, A.E Ratery, C.T Volinsky, Bayesian model averaging: a tutorial, Stat. Sci. 14 (4) (1999) 382–416. [39] T. Ojala, M. Pietikäinen, D. Harwood, Performance evaluation of texture measures with classification based on Kullback discrimination of distributions,

[40]

[41]

[42] [43]

[44]

[45]

in: Proceedings of the 12th International Conference on Pattern Recognition, 1994, vol. 1, pp. 582–585. T.T. Nguyen, A.W.-C. Liew, M.T. Tran, X.C. Pham, M.P. Nguyen,A novel genetic algorithm approach for simultaneous feature and classifier selection in multi classifier system, in: IEEE Congress on Evolutionary Computation (CEC), 2014, pp. 1698–1705. C.D. Sutton, Classification and regression trees, Bagging, and boosting, in: C. R. Rao, E.J. Wegman, J.L. Solka (Eds.), Handbook of Statistics, Elsevier, 2005, pp. 303–329. P. Viola, M. Jones, Robust real-time face detection, Int. J. Comput. Vision 57 (2002) 137–154. T. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization, Mach. Learn. 40 (2000) 139–157. T.T. Nguyen, A.W.-C. Liew, X.C. Pham, M.P. Nguyen, A Novel, 2-Stage combining classifier model with stacking and genetic algorithm based feature selection, in: D.-S. Huang, K.-H. Jo, L. Wang (Eds.), Intelligent Computing Methodologies, Springer International Publishing, 2014, pp. 33–43. T.T. Nguyen, A.W.-C. Liew, M.T. Tran, M.P. Nguyen, Combining multi classifiers based on a genetic algorithm—a Gaussian mixture model framework, in: D.S. Huang, K.-H. Jo, L. Wang (Eds.), Intelligent Computing Methodologies, Springer International Publishing, 2014, pp. 56–67.

Tien Thanh Nguyen is currently a Ph.D. student at the School of Information & Communication Technology, Griffith University, Australia. His research interest is in the field of machine learning, pattern recognition, image processing and evolutionary computation. He is a member of the IEEE since 2014.

Thi Thu Thuy Nguyen is currently a lecturer at the Faculty of Economic Information System, College of Economics, Hue University, Vietnam. She graduated from the Faculty of Mathematics, Voronezh State University, Russia in 2008. Her research interests include machine learning, pattern recognition, image processing.

Xuan Cuong Pham graduated from School of Information and Communication Technology, Hanoi University of Science and Technology, Vietnam in 2014 and had a short time to work at R&D department of Samsung Co Ltd, Vietnam. His research interests include machine learning, image processing, information retrieval and computer vision.

Alan Wee-Chung Liew is currently an Associate Professor with the School of Information & Communication Technology, Griffith University, Australia. His research interest is in the field of medical imaging, bioinformatics, computer vision, pattern recognition, and machine learning. He serves on the technical program committee of many international conferences and is on the editorial board of several journals, including the IEEE Transactions on Fuzzy Systems. He is a senior member of the IEEE since 2005.