Pattern Recognition 49 (2016) 198–212
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
A novel combining classifier method based on Variational Inference Tien Thanh Nguyen a, Thi Thu Thuy Nguyen b, Xuan Cuong Pham c, Alan Wee-Chung Liew a,n a
School of Information and Communication Technology, Griffith University, Gold Coast Campus, QLD 4222, Australia Hue College of Economics, Hue University, No. 99, Ho Dac Di Street, An Cuu Ward, Hue City, Vietnam c Hanoi University of Science and Technology, No. 1, Dai Co Viet Street, Hai Ba Trung District, Hanoi, Vietnam b
art ic l e i nf o
a b s t r a c t
Article history: Received 14 January 2015 Received in revised form 16 May 2015 Accepted 25 June 2015 Available online 11 July 2015
In this paper, we propose a combining classifier method based on the Bayesian inference framework. Specifically, the outputs of base classifiers (called Level1 data or meta-data) are utilized in a combiner to produce the final classification. In our ensemble system, each class in the training set induces a distribution on the Level1 data, which is modeled by a multivariate Gaussian distribution. Traditionally, the parameters of the Gaussian are estimated using a maximum likelihood approach. However, maximum likelihood estimation cannot be applied since the covariance matrix of Level1 data of each class is not full rank. Instead, we propose to estimate the multivariate Gaussian distribution of Level1 data of each class by using the Variational Inference method. Experiments conducted on eighteen UCI Machine Learning Repository datasets and a selected 10-class CLEF2009 medical imaging database demonstrated the advantage of our method compared with several well-known ensemble methods. & 2015 Elsevier Ltd. All rights reserved.
Keywords: Ensemble method Multi classifier system Mixture of experts Classifier fusion Combining classifier algorithm Variational Inference Multivariate Gaussian distribution
1. Introduction In recent years, ensemble method has been studied extensively, and is an active research area in machine learning community [1]. In general, it is difficult to know a priori which learning algorithm is suitable for a particular dataset, and using an ensemble approach could achieve better accuracy than any single learning algorithms. According to the statistics from Web of Knowledge, there are more than 600 publications with keyword “classifier ensemble” in two years 2011 and 2012 [2]. In addition to various applications in computer aided medical diagnosis, computer vision, software engineering, and information retrieval, ensemble-based algorithms have also won many competitions such as Netflix Prize (http://www.netflixprize.com) and KDD-Cup (http://www.sigkdd.org/kddcup) [3]. All this demonstrated the significant interest on ensemble methods in both theoretical and application studies. Over the past 30 years of development, various approaches related to ensemble methods have been proposed [1,3]. Hence, there are many taxonomies of ensemble method that focuses on the different views on the ensemble system [1,3–7]. In this paper, we follow the taxonomy in [8] in which ensemble methods are divided into two types:
n
Corresponding author. Tel.: þ 61 7 55528671; fax: þ 61 7 55528066. E-mail address: a.liew@griffith.edu.au (A.-C. Liew).
http://dx.doi.org/10.1016/j.patcog.2015.06.016 0031-3203/& 2015 Elsevier Ltd. All rights reserved.
Homogeneity: Generic classifiers are generated on different
training sets obtained from an original one by the same learning algorithm. Then, the outputs of these classifiers are combined to give the final decision. Several state-of-the-art coverage based ensemble methods in the literature are AdaBoost [9], Bagging [10] and Random Forest [11]. Heterogeneity: A fixed set of different learning algorithms are used on the same training set to generate the different classifiers and then making decision from the output of these classifiers (called Level1 data or meta-data [12–14]). The approach focuses more on algorithms to combining Level1 data to achieve higher accuracy than any single classifier.
In this paper, we focus on the second type of ensemble method. There are two techniques to combine the outputs of different classifiers, namely fixed combining methods and trainable combining methods [8]. Fixed combining methods which are based on Bayesian decision model [15] do not take into consideration the label information in Level1 data of training set when combining. The advantage of applying fixed methods for ensemble system is that no training based on Level1 data is needed; as a result, they are less time-consuming than their counterparts. There are several popular fixed combining methods studied in the literature, namely Sum, Product, Vote, Max, Min, Average, Median and Oracle rule [15,16]. According to our knowledge, Vote and Sum are the most popular rules and have been successfully
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
applied to many combining classifier situations. Kittler et al. [15] showed that Sum rule is developed under two assumptions: “conditional independence of respective representations used by classifiers and classes being highly ambiguous” and Sum rule results in the most reliable predictions. Kuncheva [16] also proved theoretical probability of error related to rules by making assumptions about normal and uniform distribution. In contrast, trainable combining methods work on Level1 data of training set to form the prediction model. Although exploiting Level1 data of training set to discover knowledge will enhance the accuracy of classification tasks, computational cost will also increase. Several state-of-the-art trainable combining algorithms are Multiple Response Linear Regression (MLR) [13], Decision Template [17] and SCANN [18]. The most important studies about trainable combining methods are based on Stacking algorithm that was first proposed by Wolpert [19] and was further developed in [12,13,18]. In this algorithm, the training set is divided into several equal disjoint parts. One part plays the role of test set in turn and the rest plays the role of training set during training phase. The output of Stacking is the posterior probability that an observation belongs to a class according to each classifier. The common feature of Stacking based approaches is that Level1 data of training set is trained again by a combining classifier method to form the final prediction framework. Several strategies have been proposed to exploit label information in Level1 data of training set. In one strategy, the outputs of classifiers are grouped according to the given labels and then the template associated with each label is constructed. Two wellknown methods using that strategy are Multiple Response Linear Regression (MLR) [13] and Decision Template [17]. MLR method is based on the assumption that each classifier puts a different weight on each class and combining algorithm is then conducted based on the M linear combinations of posterior probabilities and the associated weights for the M classes. The predicted class label for an unlabeled observation is decided by selecting the maximum value among those combinations. As a result, it is important to find the suitable combining weights so as to achieve high classification accuracy. Ting et al. [13] proposed MLR method by solving M Linear Regression models corresponding with the M classes based on Level1 data and the training data labels in crisp form to find these combining weights. Recently, Zhang and Zhou [20] proposed using linear programming to find the weights. Sen et al. [21] introduced a method that was inspired by MLR which uses hinge loss function in the combiner instead of using conventional least square loss. By using this new function with regularization, three different combinations were proposed, namely weighted sum, dependent weighted sum and linear stacked generalization, based on different regularizations with group sparsity. In Decision Template method [17], Level1 data of training set is grouped according to the class label of training observations and the Decision Template is constructed by averaging values of Level1 data in each class. Kuncheva et al. [17] proposed eleven measures between a Decision Template and the Level1 data of an unlabeled observation to predict class label. The benefit of this method is that it saves time in both training and testing due to its simple computation. However, this method could have high error rate if the classifiers do not have high enough accuracy due to the fact that the simple Decision Template may not provide a good representation for a particular class. Merz [18] combined Stacking, Correspondence Analysis (CA) and K Nearest Neighbor (KNN) in a single algorithm called SCANN. The idea of that algorithm was to discover relationship between learning observations and classification outputs of base classifiers generated by stacking by applying CA to indicator matrix formed by Level1 data and true label of these observations. After that, KNN
199
is used to classify unseen observations in the new scaled space. In real-world application, the method sometimes is impractical due to the singularity characteristic of the indicator matrix obtained by CA. Moreover, the testing process of SCANN is more complicated than that of other combining classifier algorithms and this increases classification time. Another approach proposed by Todorovski and Džeroski [14] is Meta Decision Tree, a new Decision Tree on Level1 data where at each node a classifier is chosen instead of selecting value for splitting an attribute. The authors also proposed an expansion for Level1 data by adding entropy and maximum posterior probability so as to increase the discrimination ability. However, no theoretical basis was provided about the effectiveness of that expansion. Recently, Nguyen et al. [22] proposed a hybrid combining classifier system in which fuzzy rules are applied on Level1 data to produce the classifying rules. Although that system outperforms other fuzzy rules-based methods and ensemble methods in the experiment as well as addressing the high-dimensionality problem commonly found in general fuzzy rules-based methods, it takes a long time in the training process due to the large number of rules generated on Level1 data. Unlike the above approaches, here we propose a novel combining classifier method called VIG by approximating the density distribution of Level1 data to obtain a prediction framework based on Bayesian model. Since maximum likelihood approach is not applicable due to the singularity property of Level1 data, we propose using Variational Inference (VI) for the estimation of the multivariate Gaussian density distribution. The rest of this paper is organized as followed. Section 2 describes the property of Level1 data and then introduces VI method for multi-dimensional Gaussian distribution estimation. After that, the framework based on Bayesian decision model is proposed to combine the outputs of base classifiers. Experiment results conducted on eighteen UCI datasets [23] and CLEF medical image database [24] are reported and discussed in Section 4. The conclusion is given in the last section. 2. Preliminaries A summary of mathematical notations Notations
Mathematical meaning
X x M N K ym m ¼ 1;…;M
Observed data or training set An observation The number of classes The number of observations The number of learning algorithms The set of labels
Z
Hidden variable Mean and covariance of Gaussian distribution The precision matrix as inverse of the
μ; Σ Λ D W0 and υ0 m0 and β 0
m; H W; υ Trð:Þ
1
covariance matrix Λ ¼ Σ The dimension of input data The initial values of scale matrix and degrees of freedom of Wishart distribution p Λ The initial values of mean vector and the scale of precision matrix Λ of Gaussian distribution p μj Λ Mean vector and precision matrix of Gaussian distribution q μ ¼ N μ j m; H 1 Scale matrix and degree of freedom of Wishart distribution q Λ ¼ W Λ j W; υ The trace operator of a matrix
200
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
Γð:Þ ℒð q Þ L L ðx Þ fLm gm ¼ 1;…;M
fGm gm ¼ 1;…;M j Uj
The Gamma function defined by R Γð:Þ ¼ 01 xt 1 e x dx . The lower bound The meta-data or Level1 data of X The meta-data or Level1 data of observation x The meta-data or Level1 data of observations belonging to mth class (m ¼ 1; …; M) Lm ¼ ðLðxÞ; yÞj x A X; y ¼ ym The Gaussian model for the mth class (m ¼ 1; …; M) Relative cardinality of a set
2.1. Level1 data Let ym m ¼ 1;…;M denotes the set of labels, N as the number of observations, K as the number of base classifiers and M as the number of labels. For an observation x, Pk ym j x is the probability that x belongs to the class with label ym given by kth classifier. Kuncheva et al. [17] summarized three output types for x for each k ¼ 1; …; K :
Crisp label: return only class label Pk ym j x A f0; 1g and P Pk ym j x ¼ 1
m Fuzzy label: return posterior probabilities that x belongs to P
classes, i.e. Pk ym j x A ½0; 1 and
Pk ym j x ¼ 1
Possibilistic label: the same asm Fuzzy Label but does not require the posterior probabilities to equal one, i.e. sum of all P Pk ym j x A ½0; 1 and Pk ym j x 4 0 m
In this work, we focus only on the second type, i.e. fuzzy label, which has a familiar interpretation. In this case, the posterior probability reflects the support of a class to an observation. The Level1 data of all observations, an N MK posterior probability matrix is defined as: 2
P1 y1 j x1 ⋯ P1 yM j x1 6 6 P1 y1 j x2 ⋯ P1 yM j x2 6 L : ¼6 ⋮ ⋱ ⋮ 6 4 P1 y 1 j x N ⋯ P1 yM j xN
⋯ ⋯ ⋱ ⋯
PK y1 j x1 PK y1 j x2 ⋮ PK y1 j xN
3 PK y M j x 1 7 ⋯ PK y M j x 2 7 7 7 ⋱ ⋮ 7 5 ⋯ PK y M j x N ⋯
ð1Þ whereas the Level1 data of an observation xn is defined in two forms depending on the combining algorithms: 3 2 ⋯ P1 yM j xn P1 y1 j xn 6 7 ⋮ ⋱ ⋮ ð2aÞ Lðxn Þn ¼ 1;…;N : ¼ 4 5 ⋯ PK yM j xn PK y1 j xn h Lðxn Þn ¼ 1;…;N : ¼ P1 y1 j xn
⋯
P1 yM j xn ⋯
PK y1 j xn
⋯
i PK yM j xn
ð2bÞ We have the following properties with respect to the Level1 data: Lemma 1: The Level1 data generated by using Fuzzy Label is not full column rank. Corollary 1: Covariance matrix of Level1 data is not full rank. The proof can be found in Appendix A. 2.2. Variational Inference for multivariate Gaussian distribution Maximum likelihood estimation is a popular method to obtain the Gaussian distribution for an observed dataset. However when the covariance matrix of the dataset is not full rank, it cannot be applied. Kuncheva et al. [17] commented that if the accuracy of all classifiers is high, the covariance matrix of Level1 data is likely to
be singular. In this work, we propose to use VI method to estimate the multivariate Gaussian model. The idea behind VI method is to approximate the posterior distribution pðZ j XÞ of hidden variables Z given observed data X by a more easily accessible distribution qðZ Þ which minimizes the divergence between pðZ j XÞ and qðZ Þ. In maximum-likelihood method, the parameters ðμ; ΣÞ are not considered as random variables, and the likelihood function is maximized by the parameters’ values. In contrast, in VI method the parameters ðμ; ΣÞ are treated as random variables and priors are placed over the parameters to obtain the posterior distribution p μ; Σ j X . In literature, the Kullback–Leibler (KL) divergence is commonly used to compute the distance between two distributions:
Z qðZ Þ pðZ j XÞ ¼ qðZ Þln dZ ð3Þ KLðq J pÞ ¼ Eq ln pðZ j XÞ qðZ Þ It’s worth noting that the KL divergence is difficult to optimize since it requires knowledge about the distribution that we are trying to approximate. As KLðq J pÞ Z 0 and KLðq J pÞ ¼ R ln pðXÞ ℒðqÞ where ℒðqÞ ¼ qðZ Þln ðpðX; Z ÞÞ=ðqðZ ÞÞ dZ is the lower bound on the log marginal probability ln pðXÞ , we can maximize the lower bound ℒðqÞ instead of minimizing KLðq J pÞ. M M If we assume that qðZ Þ ¼ Π i ¼ 1 qi ðZ i Þ in which Z ¼ [ i ¼ 1 Z i , and iteratively ℒðqÞ with respect to qj Z j Z i \ Z j ¼ ∅ ; i aj n maximize o while qi a j are held fixed, the optimal solution qnj Z j is given by [25,26]:
ln qnj Z j ¼ Ei a j ln pðX; Z Þ þ const ð4Þ here the notation Ei a j ½⋯N denotes an expectation with respect to the q distributions over all variables Z i ðia jÞ and the constant is independent of Z i . Convergence is guaranteed because that bound is convex with respect to each of the factors qi ðZ i Þ [27]. In the literature, VI-based approaches have been used to estimate the density distributions of Dirichlet and Gaussian Mixture Model (GMM) [26,28,29]. In this work, we apply the VI method to estimate the parameters of a multivariate Gaussian distribution. Based on the Central Limit Theorem, Gaussian can be used to approximate a wide range of other distributions such as Poisson, Binominal and Gamma [30]. Meanwhile, Dirichlet distributions are most commonly used as the prior distribution of categorical variables or multinomial variables in Bayesian-based models [31,32]. Here, the multivariate Gaussian distribution is used to approximate the likelihood function pðLðxÞj Gm Þ for each class label in which all features of LðxÞ are real and belong to ½0; 1. Although GMM can also be used to approximate the model for class labels, GMM requires many parameters resulting in expensive computation in the training process. Moreover, when a small amount of data is available, the choice of the number of Gaussian components for GMM becomes critical [33]. Our goal is to infer the posterior distribution for the mean μ and precision matrix Λ, where Λ is the inverse of the covariance 1 matrix Λ ¼ Σ , given a dataset X ¼ fxn j n ¼ 1; …; N g of variable x which are assumed to be drawn from the multi independently 1 variate Gaussian distribution N xj μ; Λ . The likelihood function is given by N 1 p Xj μ; Λ ¼ ∏ N xn j μ; Λ n¼1
( ) N N T ND 1X ¼ ð2πÞ 2 Λ 2 exp xn μ Λ xn μ 2n¼1
ð5Þ
where D is the dimensionality of the variable x. In order to formulate a variational solution, we write down the joint distribution of all of the random variables: p X; μ; Λ ¼ p Xj μ; Λ p μ j Λ p Λ . The conjugate prior of a multivariate Gaussian distribution, p μ; Λ , with unknowns μ and Λ
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
are distribution: p μ; Λ ¼ given by the Gaussian–Wishart p μ j Λ p Λ , where p μ j Λ is a Gaussian distribution: 1 p μ j Λ ¼ N μ j m0 ; β 0 Λ 1 T D 1 ¼ ð2πÞ 2 β 0 Λ2 exp μ m0 β0 Λ μ m0 ð6Þ 2 and p Λ is a Wishart distribution: ðυ0 D 1Þ 1 p Λ ¼ W Λ j W0 ; υ0 ¼ BðW0 ; υ0 ÞΛ 2 exp Tr W0 1 Λ 2 ð7Þ BðW0 ; υ0 Þ ¼ jW0 j
υ0 2
! 1 υ0 D DðD 1Þ D υ0 þ1 i 22 π 4 ∏ Γ 2 i¼1
ð8Þ
where m0 and β0 are the D-mean vector and the scale of precision matrix Λ of Gaussian distribution p μ j Λ , W0 and υ0 are the D D-scale matrix and number of degrees of freedom of the Wishart distribution p Λ , Trð:Þ denotes the trace operator of a matrix, and Γð:Þ denotes the Gamma function defined by R Γð:Þ ¼ 01 xt 1 e x dx. Putting our attention on a factorized variational approximation to the posterior distribution, i.e. q μ; Λ ¼ q μ q Λ , the following update equations can be derived: ð9Þ ln qn μ ¼ EΛ ln p X; μ; Λ þ const ln qn Λ ¼ Eμ ln p X; μ; Λ þ const
ð10Þ
We have the following results: n Lemma 2: The solution q μ of update equation (9) optimum 1 n is a Gaussian q μ ¼ N μ j m; H with mean m and precision H given by (11) and (12). m¼
β0 m0 þ Nx β0 þ N
ð11Þ
1 ðlnjW i j lnjWi 1 jÞ 2
201
ð16Þ
where we have made use of the fact that E Λi jυWi j
¼ ln ¼ lnjWi j lnjWi 1 j lnjHi j lnjHi 1 j ¼ ln jυWi 1 j E Λi 1 We have the following algorithm for multivariate Gaussian distribution estimation: Algorithm 1. VI for multivariate Gaussian distribution estimation Dataset X, threshold ε; m0 ; β0 ; υ0 ; W0 ; E Λ Output: m; H of q μ ¼ N μ j m; H 1 and W; υ of q Λ ¼ W Λ j W; υ
Input:
i : ¼1 For each i Update m; H using (11) and (12) Update W; υ using (13) and (14) If i4 1 and ℒi ðqÞ ℒi 1 ðqÞ o ε then break i : ¼ iþ1 End In the algorithm above, the four variables of q μ and q Λ are updated step by step from their initial values. The updating process will stop when the change in lower bound value ℒðqÞ is smaller than a specified threshold ε. In our experiments, 3 or 4 iterations typically proved sufficient to achieve convergence with a threshold ε ¼ 1e 10.
H ¼ β0 þ N E Λ
ð12Þ Lemma 3: The optimum solution qn Λ of update (10) is a Wishart qn Λ ¼ W Λ j W; υ with the number of degrees of freedom υ and the scale matrix W given by (13) and (14).
υ ¼ υ0 þ N þ 1
ð13Þ
β N W 1 ¼ W0 1 þ β0 þ N H 1 þ S þ 0 J β0 þ N
ð14Þ
where x¼
N 1 X xn ; Nn¼1
S¼
N X
ðxn xÞðxn xÞT ; J ¼ ðx m0 Þðx m0 ÞT
n¼1
Lemma 4: The lower bound ℒðqÞ of the Variational Inference for the multivariate Gaussian distribution is given by (15) ℒðqÞ ¼ ln BðW 0 ; υ0 Þ ln BðW; υÞ 1 ND lnð2π Þ D ln β0 υD þ lnjHjþ υTrðSWÞ 2 β Nυ TrðJWÞ þ υTr W0 1 W þ 0 β0 þ N
ð15Þ
The proofs for Lemmas 2–4 are given in Appendix A. Denote as ℒi ðqÞ the value of the lower bound at the ith iteration, then it can be shown that: ℒi ðqÞ ℒi 1 ðqÞ ¼ ln BðWi ; υÞ
1 β N υTr S þ W0 1 þ 0 J Wi 2 β0 þ N
1 β N þ ln BðWi 1 ; υÞ þ υTr S þ W0 1 þ 0 J W i 1 2 β0 þ N
3. Proposed combining classifier method The most important distinction between our work and the previous work is that we use statistical learning-based approach on Level1 data to build the combining classifier method. Attributes in the original data are frequently varying in nature, measurement unit, and type. As a result, Gaussian model does not perform well when it is used to approximate the distribution of the original data. Level1 data, on the other hand, can be viewed as scaled data from feature domain to posterior domain where data is reshaped to be real values in [0, 1]. Observations that belong to the same class will likely have equal posterior probabilities generated from the base classifiers and locate close together in the new domain. Consequently, Level1 data is expected to be more discriminative than the original data, and Gaussian model on Level1 data will be more effective than on original data. The proposed VIG combining classifier method is illustrated in Fig. 1. First, Staking algorithm is applied on the training set to generate Level1 data denoted by L. Since labels of the training n o observations X ¼ ðx; yÞj y A ym m ¼ 1;…;M are known, we can gather L into M groups corresponding with the M labels: Lm ¼ ðLðxÞ; yÞj x A X; y ¼ ym m ¼ 1; …; M. Then, VI method is applied to Lm to model the distribution for each label by a multivariate Gaussian distribution. Based on the Bayesian decision model, the posterior probability of an observation x belonging to the mth class is given by pðGm j LðxÞÞ pðLðxÞj Gm ÞpðGm Þ where Gm is the model for the mth class.
ð17Þ
202
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
Fig. 1. The proposed VIG combining classifier method.
In which pðGm Þ is the prior probability of the mth class. Many approaches about the choice of prior probability have been introduced [34], and they generally belong to one of two classes: informative priors and uninformative priors. Spiegelhalter et al. [35] and Lauritzen et al. [36] demonstrated the improvement in prediction when informative priors are used in Bayesian-based system. Gelman [37] stated that the choice of prior distributions will have minor effect on the posterior probabilities in the case of a large number of observations and ‘well-identified’ parameters, whereas it will play an important role when the number of observations is small or when the available data provide only indirect information about the parameters of interest. In this paper, due to space limitation and to show that even a simple empirical choice can achieve good performance in the classification tasks, we compute the prior probability simply by pðGm Þ ¼
jLm j jXj
ð18Þ
where j U j denotes the cardinality of a set. The likelihood function pðLðxÞj Gm Þ is given by pðLðxÞj Gm Þ ¼ N LðxÞj μm ; Σm 1
T 1
MK 1 ¼ ð2πÞ 2 Σm 2 exp LðxÞ μm Σm LðxÞ μm 2
Output: Gaussian models fGm g with two parameters μm μm , Λm and pðGm Þ m ¼ 1; …; M Step1: L ¼ Stacking ðX; KÞ Step2: Lm ¼ ðLðxÞ; yÞj x A X; y ¼ ym Step3: For mth class ðmm ; Hm ; Wm ; υm Þ ¼ Algorithm1 Lm ; ε; m0 ; β 0 ; υ0 ; W0 ; E Λ 1 where μm ¼ mm pðLðxÞj Gm Þ ¼ N LðxÞj μm ; Λm and
Λm ¼ Wm υm
Compute pðGm Þ using (18) End Classification process: Input: Unlabeled observation xn Output: Predicted label for xn Step1: Compute Lðxn Þ Step2: For mth class Compute pðGm j Lðxn ÞÞ pðLðxn Þj Gm ÞpðGm Þ using (17) End Step3: Predict label of xn using (20)
ð19Þ where μm ; Σm in (19) are the mean and covariance matrix of the multivariate Gaussian model obtained by VI for the mth class. Note that instead of Σm Σ m , the precision matrix Λm is computed by VI during the training process. In the classification phase, Level1 data of an unlabeled observation is first generated by base classifiers. Its class prediction is then obtained by selecting the label associated with maximum posterior probabilities computed by the M multivariate Gaussian models. Therefore, the class label of an unlabeled observation xn is predicted by xn A yt if t ¼ argmaxm ¼ 1;…;M p Gm j L xn argmaxm ¼ 1;…;M pðL xn j Gm ÞpðGm Þ
ð20Þ
Our ensemble framework has some similarity with Bayesian Model Averaging (BMA) [38] since the results of all hypotheses (classifiers) are used to obtain the final discriminative model. However, in our work, we just consider the output itself (which in our framework is obtained from the base classifiers) whereas in BMA the base classifier models are actually considered in the formulation. The VIG combining classifier method is given below:
4. Experimental results 4.1. Dataset To evaluate the performance of our proposed VIG method we perform experiments on two datasets. In the first experiment, eighteen UCI data files from the UCI dataset were used since it is often used to validate the performance of classification system [23]. To ensure the objectiveness of the comparison between our method and the other benchmark algorithms, we chose files with the number of observations varies significantly from small files like Fertility and Iris to big file such as Skin&NonSkin. The number of attributes also varies widely from three (Titanic) to sixty (Sonar). Information about the selected data files is summarized in Table 1. The second experiment was evaluated on CLEF2009, a medical imaging database collected by Archen University, Germany [24]. It is a large database containing 15,363 images allocated to 193 hierarchical categories. Here, we chose 10 classes with different number of observations in each class (Table 2). Histogram of Local Binary Pattern (HLBP) [39] was selected as feature vector of the image.
Algorithm 2. VIG combining classifiers method Training process: Input: Training set X ¼ ðx; yÞ , K¼{ K learning algorithms}, ε; m0 ; β0 ; υ0 ; W0 ; E Λ
4.2. Experimental settings Three learning algorithms, namely Linear Discriminant Analysis, Naïve Bayes, and K Nearest Neighbor (with K set to 5, denoted
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
by 5-NN), were chosen to construct the base classifiers. The reason for choosing these diverse learning algorithms is that they ensure diversity of the ensemble system. Since our method is a combining classifier method, it is necessary to compare with other wellknown combining classifier methods. It is also important to compare error rates of the proposed method with those of base classifiers to demonstrate the advantage of ensemble method. In our experiments, the proposed method were compared with seven benchmark algorithms: selecting the best results from base classifiers based on outcomes on test set, selecting the best results from fixed rules based on outcomes on test set, Decision Template (we use the similarity measure S1 defined as S1 ðLðxÞ; DT m Þ ¼ ðj LðxÞ \ DT m j Þ=ðj LðxÞ [ DT m j Þ where DT m is the Decision Template of mth class and j U j is the relative cardinality of a set [17]), MLR, SCANN, AdaBoost, and Bagging. Only simple values were chosen to initialize the parameters for Algorithm 1: T m0 is D-vector with all zero elements ð0; …; 0Þ , β 0 ¼ 1; υ0 ¼ D, W0 is D D identity matrix and E Λ ¼ υ0 W0 . We performed 10-fold cross validation and run the test 10 times to obtain 100 test results for each data file. To assess statistical significance, we used two-sample t-test to compare the classification results of our approach and each benchmark
File name
No. of Attribute No. of No. of attributes type observations classes
Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris
6 10 6 60 13 5 3 4 9 3 30 14 20 20 20 9 18 4
345 700 768 208 270 540 2,201 625 100 245,057 569 690 7,400 7,400 151 1,473 946 150
algorithm. Specifically, a two-sample t-test was conducted to test the null hypothesis H 0 : eA ¼ eB where eA and eB are the classification error rate of the proposal method and the benchmark algorithm: t¼
eA eB ðeA eB Þz}|{if qffiffiffiffiffiffiffiffiffiffiffiffi ¼ s2A s2B nA þ nB
H 0 true
e eB qAffiffiffiffiffiffiffiffiffiffiffiffi s2A s2B nA þ nB
ð21Þ
where eA ; eB are the means, sA ; sB are the standard deviations of classification error rate computed on nA ¼ nB ¼ 100 test results of the proposed method and the benchmark algorithm, respectively. The critical region is determined depending on how we choose the alternative hypothesis H 1 e.g. for one-tailed test: eA 4 eB , or eA o eB , or for two-tailed test: eA a eB . We reject the null hypothesis that the classification error rates of two methods are equal if the statistical value t belongs to the rejection region and vise verse. In this paper, one-tailed alternative hypothesis was used and the level of significance was set to 0.05. All source codes were implemented in Matlab running on a PC with Intel Core i5 with 2.5 GHz processor and 4G RAM. The results of the experiment are summarized in Tables 3 and 4.
4.3. Results and discussion
Table 1 Information of UCI data files in evaluation.
C,I,R R R,I R C,I,R R R,I C R R R C,I,R R R C,I C,I I R
203
2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 4 3
No. of attributes on level1 data 6 6 6 6 6 6 6 9 6 6 6 6 6 6 6 9 12 9
R: Real, C: Category, I: Integer.
4.3.1. Comparison with other combining classifier methods and base classifiers First, the error rates and variances of three base classifiers are reported in Table 3. From these results, the one with the smallest error rate for each data file were selected as the best result. From the significance test in Table 5, we see that VIG is better than the best result from base classifiers, obtaining seven wins and only one loss. In comparison to the best result from six fixed combining rules (see Table 4), our method achieved better result on five files, namely Phoneme (0.1164 vs. 0.1407), Ring (0.1168 vs. 0.2122), Skin&NonSkin (4.28E-04 vs. 6E-04), Vehicle (0.2069 vs. 0.2645), and CLEF2009 (0.1693 vs. 0.2023) and worse results on two files: Bupa (0.3151 vs. 0.297) and Artificial (0.2401 vs. 0.2193). As mentioned earlier, fixed combining rules do not exploit label information in Level1 data of training set to form the prediction so they sometimes do not obtain high accuracy in classification tasks. In contrast, the proposed method is a trainable combining method in which label information in Level1 data of training set is exploited to make prediction. As a result, in almost all situations, our method is better than any fixed combining rules.
Table 2 Information of 10 classes selected from clef2009 medical image database. Image
Description Number of observation
Abdomen 80
Cervical 81
Chest 80
Facial cranium 80
Left elbow 69
Left shoulder 80
Left breast 80
Finger 66
Left ankle joint 80
Left carpal joint 80
Image
Description Number of observation
204
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
Table 3 Classification error rates and variances of 3 base classifiers. File name
LDA
Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009
Naïve Bayes
5-NN
Best result from base classifiers
Mean
Variance
Mean
Variance
Mean
Variance
Mean
Variance
0.3693 0.4511 0.2396 0.2629 0.1593 0.2408 0.2201 0.2917 0.3460 0.0659 0.0397 0.1416 0.0217 0.2381 0.4612 0.4992 0.2186 0.0200 0.1683
8.30E 03 1.40E 03 2.40E 03 9.70E 03 5.30E 03 3.00E 04 5.00E 04 2.90E 03 2.01E 02 2.74E 06 7.00E 04 1.55E 03 3.12E 05 2.27E 04 1.21E 02 1.40E 03 1.39E 03 1.40E 03 1.63E 03
0.4264 0.4521 0.2668 0.3042 0.1611 0.2607 0.2515 0.2600 0.3770 0.1785 0.0587 0.1297 0.0217 0.2374 0.4505 0.5324 0.5550 0.0400 0.3757
7.60E 03 1.40E 03 2.00E 03 7.40E 03 5.90E 03 3.00E 04 8.00E 04 3.30E 03 2.08E 02 6.61E 06 1.20E 03 1.71E 03 3.13E 05 2.23E 04 1.22E 02 1.42E 03 2.94E 03 2.30E 03 2.12E 03
0.3331 0.2496 0.2864 0.1875 0.3348 0.1133 0.2341 0.1442 0.1550 0.0005 0.0666 0.3457 0.0312 0.3088 0.5908 0.4936 0.3502 0.0353 0.3116
6.10E 03 2.40E 03 2.30E 03 7.60E 03 5.10E 03 2.00E 04 3.70E 03 1.20E 03 4.50E 03 1.68E 08 8.00E 04 2.11E 03 3.96E 05 1.30E 04 1.37E 02 1.70E 03 2.35E 03 1.50E 03 2.34E 03
0.3331 ( ) 0.2496 ( ¼ ) 0.2396 ( ¼ ) 0.1875 ( ¼ ) 0.1593 ( ¼ ) 0.1133 ( ¼) 0.2201 ( ¼ ) 0.1442 ( ) 0.1550 ( ) 0.0005 ( ) 0.0397 ( ¼) 0.1297 ( ¼ ) 0.0217 ( ¼ ) 0.2374 ( ) 0.4505 ( ¼ ) 0.4936 ( ) 0.2186 ( ) 0.0200 ( þ) 0.1683 (¼ )
6.10E 03 2.40E 03 2.40E 03 7.60E 03 5.30E 03 2.00E 04 5.00E 04 1.20E 03 4.50E 03 1.68E 08 7.00E 04 1.71E 03 3.12E 05 2.23E 04 1.22E 02 1.70E 03 1.39E 03 1.40E 03 1.63E 03
( þ) means that the method is better than VIG, ( ¼) means that the method is equal to VIG, ( ) means that the method is worse than VIG. Relative performance was checked using t-test. Table 4 Classification error rates and variances of combining classifier algorithms. File name
Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009
Best result from 6 fixed rules
MLR Mean
Variance
0.3033 ( ¼ ) 0.2426 ( ¼ ) 0.2432 ( ¼ ) 0.1974 ( ¼ ) 0.1607 ( ¼ ) 0.1136 ( ¼) 0.2169 ( ¼) 0.1225 ( ) 0.1250 ( ¼ ) 4.79E 04 ( ) 0.0399 ( ¼) 0.1268 ( ¼) 0.0217 ( ¼) 0.1700 ( ) 0.4652 ( ) 0.4675 ( ¼ ) 0.2139 ( ¼) 0.022 ( ¼ ) 0.1633 ( ¼ )
4.70E 03 2.20E 03 2.30E 03 7.20E 03 4.70E 03 1.75E 04 4.00E 04 8.00E 04 2.28E 03 1.97E 08 7.00E 04 1.80E 03 2.24E 05 1.69E 04 1.24E 02 1.10E 03 1.40E 03 1.87E 03 1.58E 03
Mean
Variance
0.2970 ( þ ) 0.2193 ( þ ) 0.2365 ( ¼ ) 0.2079 ( ¼) 0.1570 ( ¼) 0.1407 ( ) 0.2167 (¼ ) 0.1112 ( ¼ ) 0.1270 ( ¼) 0.0006 ( ) 0.0395 ( ¼ ) 0.1262 (¼ ) 0.0216 ( ¼ ) 0.2122 ( ) 0.4435 ( ¼) 0.4653 ( ¼ ) 0.2645 ( ) 0.0327 ( ¼ ) 0.2023 ( )
4.89E 03 2.05E 03 2.10E 03 8.16E 03 4.64E 03 1.95E 04 5.00E 04 4.82E 04 1.97E 03 2.13E 08 5.03E 04 1.37E 03 2.82E 05 1.62E 04 1.70E 02 1.79E 03 1.37E 03 1.73E 03 1.85E 03
SCANN
Decision template
VIG
Mean
Variance
Mean
Variance
Mean
Variance
0.3304 ( ) 0.2374 ( ¼ ) 0.2384 (¼ ) 0.2128 ( ¼) 0.1637 ( ¼) 0.1229 ( ) 0.2216 ( ¼ ) x x x 0.0397 ( ¼ ) 0.1259 ( ¼) 0.0216 ( ¼ ) 0.2150 ( ) 0.4428 ( ¼) 0.4869 ( ) 0.2224 ( ) 0.032 ( ¼ ) 0.1895 ( )
4.29E 03 2.12E 03 2.06E 03 8.01E 03 4.14E 03 6.53E 04 6.29E 04 x x x 5.64E 04 1.77E 03 2.39E 05 2.44E 04 1.34E 02 1.80E 03 1.54E 03 2.00E 03 1.68E 03
0.3348 ( ) 0.2433 ( ¼ ) 0.2482 ( ) 0.2129 ( ¼) 0.1541 (¼ ) 0.1462 ( ) 0.2167 ( ¼) 0.0988 (þ ) 0.4520 ( ) 0.0332 ( ) 0.0385 (¼ ) 0.1346 ( ) 0.0221 ( ¼ ) 0.1894 ( ) 0.4643 ( ) 0.4781 ( ) 0.2161 ( ) 0.0400 ( ¼) 0.1893 ( )
7.10E 03 1.60E 03 2.00E 03 8.80E 03 4.00E 03 2.00E 04 6.00E 04 1.40E 03 3.41E 02 1.64E 06 5.00E 04 1.50E 03 2.62E 05 1.78E 04 1.21E 02 1.40E 03 1.50E 03 2.50E 03 1.74E 03
0.3151 0.2401 0.2340 0.2025 0.1556 0.1164 0.2169 0.1123 0.1310 4.28E 04 0.0408 0.1225 0.0218 0.1168 0.4348 0.4634 0.2069 0.0313 0.1693
3.73E 03 2.33E 03 2.21E 03 7.84E 03 4.01E 03 1.96E 04 4.83E 04 6.19E 04 2.34E 03 1.57E 08 5.75E 04 1.49E 03 2.49E 05 1.00E 04 1.71E 02 1.32E 03 1.23E 03 2.00E 03 1.76e 03
( þ) means that the method is better than VIG, ( ¼) means that the method is equal to VIG, ( ) means that the method is worse than VIG. Relative performance was checked using t-test.
Table 5 Statistical two-sample t-test results comparing proposed method with benchmark algorithms. VIG vs. best result from base classifiers Better 7 Equal 11 Worse 1
VIG VIG vs. best result from fixed vs. MLR rules
VIG vs. Decision Template
VIG vs. SCANN
5 12 2
11 7 1
6 10 0
4 15 0
Compare with SCANN, our method outperformed it significantly, obtaining 6 wins and 0 losses. We note that here only 16 files were compared because SCANN could not be run on 3 files (Fertility, Balance and Skin&NonSkin). The reason is that the
indicator matrix in SCANN has columns with all zero posterior probabilities from the base classifiers. As a result, its column mass is singular and standardized residuals is not available [18]. We ignored these cases in the comparison. Finally, our method performed significantly better than Decision Template method, posting 11 wins and only 1 loss. Similar to our approach, Decision Template method also groups training observations based on their labels in Level1 data and then builds the template for each class. In fact, the Decision Template of the mth class is the average of Level1 data of training observations with label ym [17]. However, the template representation is not as powerful as our Variational Inference-based approach, so our method frequently outperforms that benchmark algorithm. MLR method also represents each class by building regression models based on Level1 data of training observations and their class labels. It has better performance compared to Decision Template method
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
205
Fig. 2. Average time of training process (in seconds).
Fig. 3. Average time of classification process (in seconds). (Top: The results of three datasets, and Bottom: The results of the others).
Table 6 Average number of iterations by the proposed method.
Skin&NonSkin 3
Bupa
Artificial
Pima
Sonar
Heart
Phoneme
Titanic
Balance
Fertility
4 Wdbc 4
4 Australian 4
4 Twonorm 4
4 Ring 4
4 Tae 4
3.99 Contraceptive 4
4 Vehicle 4
4 Iris 4
4.5 CLEF2009 4
in most cases due to the more powerful statistical representation obtained by linear regression. Nevertheless, our method is still better than MLR, obtaining 4 wins and 0 loss. The 4 winning cases are Balance (0.1123 vs. 0.1225), Skin&NonSkin (4.28E-04 vs. 4.79E04), Ring (0.1168 vs. 0.17), and Tae (0.4348 vs. 0.4652). This is a
significant outcome because MLR is a very highly competitive trainable combining method for many datasets. To compare the training and classification time, we compute the average training and classification time for all 100 cases, and report averaged value in Figs 2 and 3. Here we only compare the
206
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
performance of 4 trainable combining algorithms. Fix combining algorithms always outperform trainable algorithms on training time because they do not exploit label information in Level1 data. Consequently, we do not compare them to trainable combining methods. First, in the training process, no method is dominant in time performance evaluation. In 6 cases like Artificial, Contraceptive, Balance, Australian, Ring and CLEF2009, Decision Template is the best among all 4 methods. In fact, that approach is simpler than VIG and MLR as it only calculates average value on Level1 data of training observations associated with each class. In MLR, we have to solve M Linear Regression models to find the weight each classifier put on a particular class. In VIG, several iterations are needed to find the mean and covariance parameters of the multivariate Gaussian distribution for each class. Table 6 shows the average value of the number of iterations among 10 runs of the 10fold cross validation procedure. The convergences are obtained after about 3 or 4 iterations so the training time in VIG method is acceptable. It is even better than the other methods on Heart, Titanic, Wdbc, Tae and Iris. In Appendix B, we show the decreases in the values of ℒi ℒi 1 ðiZ 2Þ in an experiment for all nineteen datasets. Although Decision Template is usually less timeconsuming in the training process compared with the other three methods, the differences related to training time between Decision Template and VIG are small. In the time taken for classification, MLR method is ranked first, followed by VIG, Decision Template, and SCANN. In MLR, the label of an observation is predicted by simply multiplying its Level1 data with the associated weight values obtained from training process. On the other hand, our method computes multiplication between Level1 data of the test observation and the precision matrix. So it is more time-consuming than MLR, although the difference is not large. In SCANN, we have to compute M representations corresponding with M classes from Level1 data of the test observation. After that, the distance between each representation and each row of selected principal coordinates of columns matrix is computed [18]. As a result, SCANN is frequently the most time-consuming method among the four trainable methods during the classification process.
4.3.2. Different number of learning algorithms Two learning algorithms, namely Quadratic Discriminant Analysis and Decision Tree, are added into our multi-classifier system. In this experiment, we want to assess the effect of having different number of base classifiers. We denote our combining classifier method with 3 learning algorithms as VIG-3 and the method with five learning algorithms as VIG-5. Table 7 shows the error rates and variances of the classification results of Decision Tree, QDA, and VIG-5. Table 8 shows the statistical two-sample t-test results comparing VIG-3 to VIG-5. VIG-3 is better than VIG-5 on 4 data sets (Phoneme, Titanic, Skin&NonSkin, and Australia), whereas VIG-5 out-performs VIG-3 on 6 data sets (Bupa, Artificial, Balance, Ring, Vehicle, and CLEF2009). There does not appear to be a clear overall advantage in using more base classifiers in this case. However, by using additional learning algorithms, the classification performance of the ensemble system can improve significantly on some files. For example, the classification error rate on Ring given by ensemble system with 3 learning algorithms is 0.11 but that number reduces to only 0.02 when two new learning algorithms are added into the ensemble system. Looking at the output of the base classifiers, we see that one of the newly added base classifiers has significantly better classification accuracy for this file. Hence, the discrimination property of Level1 data could be enhanced by having the right base classifiers in the ensemble system. Through this experiment we can see that the addition of more learning algorithms could either improve or degrade the classification performance of multi-classifier systems. A learning mechanism as shown in [40] that search for an optimal subset of learning algorithms could be used to build a highly effective classification system.
Table 8 Statistical two-sample t-test results comparing VIG-3 to VIG-5. VIG-3 vs. VIG-5 Better Equal Worse
4 7 6
Table 7 Classification error rates and variances of additional learning algorithms and VIG-5. File name
Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009
Decision tree
QDA
VIG-5
Mean
Variance
Mean
Variance
Mean
Variance
0.3514 0.2414 0.2892 0.2866 0.2381 0.1298 0.2101 0.2107 0.1730 0.0004 0.0705 0.1678 0.0536 0.2485 0.4275 0.5317 0.2932 0.0507 0.3613
6.10E 03 2.20E 03 1.80E 03 6.20E 03 6.70E 03 2.00E 04 3.00E 04 2.10E 03 7.20E 03 1.55E 08 1.10E 03 2.13E 03 4.22E 05 1.32E 04 1.06E 02 1.28E 03 2.13E 03 2.40E 03 2.00E 03
0.3965 0.4021 0.2628 0.2401 0.1722 0.2138 0.2267 0.0860 x 0.0165 0.0433 0.2078 0.0224 0.0210 x 0.4895 0.1453 0.0260 0.1719
7.48E 03 7.92E 03 2.29E 03 8.15E 03 5.94E 03 2.35E 04 8.25E 04 8.78E 04 x 6.26E 07 6.99E 04 1.75E 03 2.37E 05 2.32E 05 x 1.63E 03 1.13E 03 1.59E 03 1.40E 03
0.2970 (þ ) 0.2217 ( þ ) 0.2403 ( ¼) 0.1861 ( ¼ ) 0.1644 ( ¼ ) 0.1276 ( ) 0.2290 ( ) 0.0851 ( þ) x 0.0006 ( ) 0.0450 ( ¼ ) 0.1338 ( ) 0.0216 ( ¼ ) 0.0206 ( þ ) x 0.4629 ( ¼ ) 0.1578 ( þ) 0.0260 ( ¼ ) 0.1521 ( þ )
4.85E 03 2.05E 03 2.11E 03 6.68E 03 4.43E 03 2.16E 04 4.95E 04 1.09E 03 x 2.33E 08 6.92E 04 1.08E 03 2.82E 05 3.10E 05 x 1.34E 03 1.23E 03 1.68E 03 1.30E 03
( ) means that VIG-3 is better than VIG-5, (¼ ) means that VIG-3 and VIG-5 is equal, ( þ) mean that VIG-3 is worse than VIG-5. Relative performance was checked using ttest.
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
207
Table 9 Classification error rates and variances of Bagging and AdaBoost. Bagging
Bupa Artificial Pima Sonar Heart Phoneme Titanic Balance Fertility Skin&NonSkin Wdbc Australian Twonorm Ring Tae Contraceptive Vehicle Iris CLEF2009
AdaBoost
Mean
Variance
Mean
Variance
0.2741 (VIG-3: þ ) (VIG-5: þ ) 0.2069 (VIG-3: þ ) (VIG-5: þ) 0.2357 (VIG-3: ¼ ) (VIG-5: ¼) 0.1545 (VIG-3: þ ) (VIG-5: þ ) 0.1700 (VIG-3: ¼ ) (VIG-5: ¼ ) 0.0878 (VIG-3: þ ) (VIG-5: þ) 0.2160 (VIG-3: ¼ ) (VIG-5: þ ) 0.1570 (VIG-3: ) (VIG-5: ) 0.1260 (VIG-3: ¼) 0.0004 (VIG-3: ¼) (VIG-5: þ ) 0.0362 (VIG-3: ¼ ) (VIG-5: þ) 0.1351 (VIG-3: ) (VIG-5: ¼ ) 0.0273 (VIG-3: ) (VIG-5: ) 0.0500 (VIG-3: þ ) (VIG-5: ) 0.3353 (VIG-3: þ) 0.4627 (VIG-3: ¼) (VIG-5: ¼ ) 0.2499 (VIG-3: ) (VIG-5: ) 0.0493 (VIG-3: ) (VIG-5: ) 0.1897 (VIG-3: ) (VIG-5: )
4.37E 03 2.30E 03 2.04E 03 6.39E 03 4.80E 03 1.08E 04 4.62E 04 1.10E 03 4.92E 03 1.40E 08 6.21E 04 1.68E 03 3.20E 05 6.47E 05 1.58E 02 1.55E 03 1.58E 03 2.63E 03 1.64E 03
0.2587 (VIG-3: þ) (VIG-5: þ ) 0.2197 (VIG-3: þ ) (VIG-5: ¼ ) 0.2444 (VIG-3: ¼ ) (VIG-5: ¼) 0.1413 (VIG-3: þ) (VIG-5: þ) 0.1896 (VIG-3: ) (VIG-5: ) 0.1920 (VIG-3: ) (VIG-5: ) 0.2217 (VIG-3: ¼ ) (VIG-5: þ ) 0.1334 (VIG-3: ) (VIG-5: ) 0.1600 (VIG-3: ) 0.0428 (VIG-3: ) (VIG-5: ) 0.0330 (VIG-3: þ) (VIG-5: þ) 0.1425 (VIG-3: ) (VIG-5: ) 0.0310 (VIG-3: ) (VIG-5: ) 0.0456 (VIG-3: þ ) 0.5145 (VIG-3: ) (VIG-5: ) 0.4996 (VIG-3: ) (VIG-5: ) 0.4451 (VIG-3: ) (VIG-5: ) 0.0540 (VIG-3: ) (VIG-5: ) 0.5565 (VIG-3: ) (VIG-5: )
3.30E 03 1.90E 03 1.97E 03 5.05E 03 4.67E 03 2.27E 04 4.31E 04 6.31E 04 9.00E 03 1.65E 06 4.76E 04 1.53E 03 3.76E 05 6.34E 05 1.80E 02 8.99E 04 2.87E 03 2.82E 03 1.91E 03
(VIG-3: ) and (VIG-3: ) mean that the benchmark algorithm is worse than VIG-3 or VIG-5, (VIG-3: ¼) and (VIG-5: ¼ ) means that the benchmark algorithm is equal to VIG-3 or VIG-5, (VIG-3: þ ) and (VIG-5: þ) mean that the benchmark algorithm is better than VIG-3 or VIG-5.
4.3.3. Comparison with other state-of-the-art ensemble methods We also compared the performance of our proposed method to two well-known ensemble methods, namely Bagging and AdaBoost. Both of them are implemented in Matlab 2014a with Decision Tree learning algorithm. To choose parameters for these methods, we refer [41–43] in which the number of iterations in AdaBoost is set to 200 and the number of learners in Bagging is set to 200. By comparing with these well-known ensemble methods, the superior classification performance of our method can be observed. Table 9 shows the experimental results of the two ensemble methods on the nineteen data sets. The statistical two-sample t-test results comparing VIG-3 and VIG-5 to Bagging and AdaBoost are shown in Table 10. First, both VIG based approaches are equally competitive with Bagging, obtaining six wins and six losses with VIG-3 and six wins and seven losses with VIG-5. It should be noted that Bagging is one of the best performing ensemble methods in the literature. In our experiment, Bagging with 200 learners is much more timeconsuming than our approach in both training and testing. In comparison with AdaBoost, VIG-3 achieves better results on 12 and worse results on 5 data sets while VIG-5 has 11 wins and 4 losses. This is a significant outcome because our approach is not only better in performance but also considerably less timeconsuming than AdaBoost with 200 iterations. In our experiments, Bagging and AdaBoost with 200 learners is much more time-consuming than our approach in both training and testing. Specifically, Bagging is on average 8.1 and 7.85 times more time-consuming than VIG-3 for training and testing, respectively, and is on average 2.93 and 4.61 times more time-consuming than VIG-5 for training and testing, respectively. Comparing AdaBoost with our methods, the corresponding number is 3.81 and 6 times higher compared with VIG-3 and 1.35 and 3.45 times higher compared with VIG-5. The computation times for each dataset are shown in Figs. 4 and 5.
Table 10 Statistical two-sample t-test results comparing VIG-3 and VIG-5 to Bagging and AdaBoost. VIG-3 vs. Bagging Better 6 Equal 7 Worse 6
VIG-5 vs. Bagging
VIG-3 vs. AdaBoost
VIG-5 vs. AdaBoost
6 4 7
12 2 5
11 2 4
groups Level1 data of training observations based on their class labels, and estimate the distribution of each class modeled as a multivariate Gaussian using VI. Then, classification was conducted through maximization of posterior probability according to the Bayesian decision model. Experimental results on eighteen UCI data files and a 10-class CLEF2009 medical image database demonstrated the benefit of our approach compared with several well-known combining classifier methods. Specifically, the proposed method is better than individual base classifiers, ensemble methods with fixed combining rules, MLR, Decision Template, AdaBoost, and SCANN, and is highly competitive with Bagging. It is also less time consuming compared with other trainable methods and the two well-known ensemble methods that we experimented. Besides, VIG is more applicable compared with SCANN since singularity is not a problem in VIG. In the future, the proposed method could be combined with feature and classifier selection based approaches [40,44,45] to further improve its classification performance. In this work, we have shown that having a set of good base classifiers could boost the discrimination ability of Level1 data and improve the overall classification accuracy of the system.
Conflict of Interest Statement 5. Conclusion We have introduced a novel combining classifier method denoted by VIG in which class distribution of Level1 data was exploited to form the decision making model. Our approach
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
208
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
Fig. 4. Average time of training process (in seconds) of two VIG methods, AdaBoost and Bagging.
Fig. 5. Average time of testing process (in seconds) of two VIG methods, AdaBoost and Bagging.
Acknowledgments Tien Thanh Nguyen acknowledges the support of a Griffith University International Postgraduate Research Scholarship (GUIPRS).
Appendix A In equation (3), Kullback–Leibler (KL) divergence is given by
Z qðZ Þ pðZ j XÞ KLðq J pÞ ¼ Eq ln ¼ qðZ Þln dZ pðZ j XÞ qðZ Þ Þ By pðZ j XÞ ¼ ppðZ;X ðXÞ , we can rewrite (3) as
Proof. Level1 data generated by using Fuzzy Label is a N MK P matrix (1) in which M m ¼ 1 Pk ym j x n ¼ 1 for each k ¼ 1; …; K and n ¼ 1; …; N. Due to this property, we can address a non-trivial combination of all columns given by: 1 1 0 0 Pi ym j x1 Pj ym j x1 M K M X X X B C B C ¼0 ðA1Þ ðM 1Þ @ ⋮ @ ⋮ A A m¼1 j ¼ 1;j a i m ¼ 1 Pi ym j xN Pj ym j xN Therefore, columns are not independent, consequently rank of the matrix L is smaller than the number of columns or matrix L is not full column rank □ Corollary 1. Covariance matrix of Level1 data is singular.
pðZ; XÞ dZ pðXÞqðZ Þ Z Z pðZ; XÞ dZ ¼ ln pðXÞ qðZ ÞdZ qðZ Þln pðXÞ R Since qðZ ÞdZ ¼ 1 we obtain Z pðZ; XÞ dZ KLðq J pÞ ¼ ln pðXÞ ℒðqÞ where ℒðqÞ ¼ qðZ Þln pðXÞ
Proof. Let L denotes the N MK matrix. 2 N N X X 1 P1 y1 j xn P1 y2 j x n 6 N1 N 6 n¼1 n¼1 6 6 ⋮ ⋮ L : ¼6 6 N N X 61X 1 4 P1 y1 j xn P1 y2 j x n N N
Lemma 1. Level1 data generated by using Fuzzy Label is not full column rank.
Then MK MK covariance T matrix of L, denoted by Σ is computed by Σ ¼ L L L L . Based on the property of rank
Z
KLðq J pÞ ¼
qðZ Þln
n¼1
n¼1
⋯ ⋱ ⋮
3 PK yM j x n 7 7 n¼1 7 7 ⋮ 7 7 N X 7 1 PK y j x n 5
1 N
N
N X
M
n¼1
ðA2Þ
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
209
Fig. A1. xxx.
h T i ¼ rank L L . operator we have rank Σ L L L ¼ rank L By Lemma 1, rank L L o MK. As a result, the covariance matrix Σ is singular.
n Lemma 2. Optimum solution q μ of update equation (9) is a 1 n with mean and precision given by Gaussian q μ ¼ N μ j m; H (11) and (12).
210
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
Proof. The optimal solution for q μ is given by ln qn μ ¼ EΛ ln p X; μ; Λ þconst ¼ EΛ ln p Xj μ; Λ þ ln p μ j Λ þconst T 1 h ¼ EΛ μ m 0 β 0 Λ μ m 0 2 XN T i þ const þ x μ Λ x μ n n n¼1
Eμ
h
μ m0 μ m0
T i
¼ ðm m 0 Þðm m 0 ÞT þ H 1 ; μ N μ j m; H 1
Using N X
N n X
xn xn T ¼
n¼1
N X
¼
ðxn xÞðxn xÞT þ
n¼1
" # N X 1 ln qn μ ¼ EΛ μT β0 Λμ 2μT β 0 Λm0 þ μT Λμ 2μT Λxn þ const 2 n¼1
! N X 1 T T ¼ μ β0 þN E Λ μ þ μ E Λ β0 m0 þ xn þconst 2 n¼1
ðxn xÞðxn xÞT þ xn xT þ xxTn xxT
N X H ¼ β0 þ N E Λ ; Hm ¼ E Λ β 0 m0 þ xn
!
¼ E Λ β0 m0 þNx
N X
and we can easily infer (11) and (12) □. n Lemma 3. Optimum solution q Λ of update equation (10) is a n Wishart q Λ ¼ W Λ j W; υ with the number of degrees of freedom υ and the scale matrix W given by (13) and (14). Proof. The optimal solution for q Λ is given by
N þ1 þ υ0 D 1 lnΛ ln qn Λ ¼ 2 n h T 1 Tr W0 1 Λ þ Eμ μ m0 β0 Λ μ m0 2 # N X T xn μ Λ xn μ g þ const þ
xm ¼
μ m0 μ m0
xn μ xn μ
n¼1
Eμ μ ¼ m; Eμ μ
T
β0 ðx m0 Þ β0 þ N
ðm m0 Þðm m0 ÞT ¼ ðx mÞðx mÞT ¼
ðA14Þ
N β0 þ N
β0
2 ðA15Þ
J
2 ðA16Þ
J
β0 þ N
we obtain h
μ m0 μ m0
T i
¼
N β0 þN
2
J þH 1
# N X T β2 N xn μ xn μ ¼ S þ 0 2 J þNH 1 β0 þ N n¼1
ðA17Þ
ðA18Þ
ðA19Þ
Lemma 4. The variational lower bound ℒðqÞ is computed as in (15)
ðA6Þ
Proof. Based on the expression of lower bound we have: ( ) p X; μ; Λ d μd Λ ℒ ¼ ∬ q μ; Λ ln q μ; Λ ¼ E ln p X; μ; Λ E ln q μ; Λ ℒ ¼ E ln p Xj μ; Λ þE ln p μ j Λ þ E ln p Λ E ln q μ E ln q Λ
T i
"
N X T þ Eμ xn μ xn μ
# ðA7Þ
N ND lnð2π Þ E ln p Xj μ; Λ ¼ E lnΛ 2 2 " # N T 1 X E xn μ Λ xn μ 2 n¼1 "
# ¼
N X
ðxn mÞðxn mÞT þ NH 1
ðA8Þ
n¼1
ðA13Þ
ðA5Þ
n¼1
Eμ
ðx m0 Þ
ðA12Þ
ðA20Þ
in which the first component is computed by
where
T
N
β0 þ N
β N Q ¼ β0 þ N H 1 þ S þ 0 J β0 þ N
n¼1
N X xn xn T xn mT mxn T þ mmT
Therefore, qn Λ is a Wishart distribution of the form qn Λ ¼ W Λ j W; υ with W and υ computed as in (13) and (14) □
i 1 h ¼ Tr W0 1 þ Q Λ 2
Since " N X
n¼1
n¼1
m m0 ¼
"
Rewrite the 2nd term on the RHS of ln qn Λ as: h n T o 1n I ¼ Tr W0 1 Λ þ Eμ Tr β 0 μ m0 μ m0 Λ 2 ( )# N X T xn μ x n μ Λ g þ Tr
Q ¼ β0 Eμ
xxn T NxxT
¼ S þ N ðx mÞðx mÞT
Eμ
n¼1
h
N X
ðA11Þ
n¼1
Eμ
ln q Λ ¼ Eμ ln p X; μ; Λ þconst ¼ Eμ ln p Xj μ; Λ þ ln p μ j Λ þ ln p Λ þ const
1 T N 1 ¼ Eμ lnβ 0 Λ μ m0 β 0 Λ μ m0 þ lnΛ 2 2 2 # N X 1 T xn μ Λ xn μ 2n¼1 n
xn xT þ
n¼1
ðxn mÞðxn mÞT ¼
n¼1
ðA4Þ
N X
¼ S þ NxxT
ðA3Þ Hence, qn μ is a Gaussian with mean m and precision H given
o
n¼1
Completing the square over μ
by
ðA10Þ
¼ mT ; Eμ μμ
T
¼ mmT þ H 1
ðA9Þ
E
N X
T xn μ Λ xn μ
n¼1
"
"
¼ EΛ Eμ Tr
N X
#
Λ xn μ xn μ
n¼1
T
##
ðA21Þ
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
" " ##! N X T ¼ Tr E Λ Eμ xn μ x n μ n¼1
β 1 ¼ Tr E Λ S þ J þ NH β0 þN 2 2 0N
where
υ0 D DðD 1Þ ln 2 ln π lnjW0 j 2 4 2 D X υ0 þ 1 i ln Γ 2 i¼1
ln BðW0 ; υ0 Þ ¼
!!
υβ ND 2 TrðJWÞ þ β 0 þN β0 þ N
¼ υTrðSWÞ þ
211
2 0N
ðA22Þ
where we have used Λ W Λ j W; υ ; E Λ ¼ υW; H 1 ¼ 1 E Λ = β 0 þ N . Substituting (A22) into (A21), we get ( ) 1 ND υβ20 N E ln p Xj μ; Λ ¼ υTr ðSWÞ NE lnΛ ND lnð2π Þ 2 TrðJW Þ 2 β0 þ N β þN 0
υ0
υ υD DðD 1Þ ln 2 ln BðW; υÞ ¼ lnjWj 2 4 2 D X υþ1i ln π ln Γ 2 i¼1
ðA28Þ
ðA29Þ
Use (A23)–(A29) we have the lower bound given by (15) □.
ðA23Þ
1 T D 1 E ln p μ j Λ ¼ E lnð2π Þ þ lnβ0 Λ μ m0 β0 Λ μ m0 2 2 2 1 ¼ D ln β0 þ E lnΛ D lnð2π Þ 2 h T io β0 E Tr μ m0 μ m0 Λ ¼
1 D ln β0 þ E lnΛ D lnð2π Þ 2 h T io β0 Tr E μ m0 μ m0 Λ
( ) 1 Dβ0 υβ N2 E ln p μ j Λ ¼ D ln β 0 þ E lnΛ D lnð2π Þ 0 2 TrðJWÞ 2 β0 þ N β þ N 0
ðA24Þ In the same way, we obtain υ0 D 1 1 h 1 i E ln Λ E Tr W0 Λ E ln p Λ ¼ ln BðW0 ; υ0 Þ þ 2 2 υ0 D 1 1 h 1 i E ln Λ Tr W0 E Λ ¼ ln BðW0 ; υ0 Þ þ 2 2 υ0 D 1 1 h 1 i E ln Λ υTr W0 W E ln p Λ ¼ ln BðW0 ; υ0 Þ þ 2 2 ðA25Þ h i E ln q μ ¼ E ln N μ j m; H 1 h T io 1n ¼ lnjHj D lnð2π Þ E Tr μ m H μ m 2 h h T iio 1n ¼ lnjHj D lnð2π Þ Tr HE μ m μ m 2 h io 1n ¼ lnjHj D lnð2π Þ Tr HH 1 2 1 ¼ lnjHj D lnð2π Þ Tr½I 2 1 E ln q μ ¼ lnjHj D lnð2π Þ D ðA26Þ 2 E ln q Λ ¼ E ln W Λ j W; υ
υ D 1 1 h 1 i E ln Λ E Tr W Λ 2
2
2
2
2
2
υ D 1 1 h 1 i E ln Λ Tr W E Λ ¼ ln BðW; υÞ þ υ D 1 1 h 1 i E ln Λ Tr W υW ¼ ln BðW; υÞ þ υ D 1 υ E ln Λ Tr½I ¼ ln BðW; υÞ þ
2 2 υ D 1 υD E ln Λ E ln q Λ ¼ ln BðW; υÞ þ 2 2
The values of ℒi ℒi 1 ðiZ 2Þ in one experiment on all nineteen datasets with VIG-3. See Fig. A1.
References
Referring to (A17) we get the second component in the lower bound’s expression.
¼ ln BðW; υÞ þ
Appendix B
ðA27Þ
[1] L. Rokach, Taxonomy for characterizing ensemble methods in classification tasks: a review and annotated bibliography, Comput. Stat. Data Anal. 53 (2009) 4046–4072. [2] M. Woźniak, M. Graña, E. Corchado, A survey of multiple classifier systems as hybrid systems, Inf. Fusion 16 (2014) 3–17. [3] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, CRC Press, 2012. [4] R.P.W. Duin, The combining classifier: to train or not to train?, in: Proceedings of International Conference on Pattern Recognition, 2002, pp. 765-770. [5] L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley, 2004. [6] A.C. Sharkey, Types of multinet system, in: F. Roli, J. Kittler (Eds.), Multiple Classifier Systems, Springer, Berlin Heidelberg, 2002, pp. 108–117. [7] H.T. Kam, J.J. Hull, S.N. Srihari, Decision combination in multiple classifier systems, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 66–75. [8] C.-X. Zhang, R.P.W. Duin, An experimental study of one- and two-level classifier fusion for different sample sizes, Pattern Recognit. Lett. 32 (2011) 1756–1767. [9] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Proceedings of International Conference on Machine Learning, 1996, pp. 148156. [10] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [11] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [12] S. Džeroski, B. Ženko, Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54 (2004) 255–273. [13] K.M. Ting, I.H. Witten, Issues in stacked generalization, J. Artif. Intell. Res. 10 (1999) 271–289. [14] L. Todorovski, S. Džeroski, Combining classifiers with meta decision trees, Mach. Learn. 50 (2003) 223–249. [15] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers,, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 226–239. [16] L.I. Kuncheva, A theoretical study on six classifier fusion strategies, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 281–286. [17] L.I. Kuncheva, J.C. Bezdek, R.P.W. Duin, Decision templates for multiple classifier fusion: an experimental comparison, Pattern Recognit. 34 (2001) 299–314. [18] C. Merz, Using correspondence analysis to combine classifiers, Mach. Learn. 36 (1999) 33–58. [19] D.H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241–259. [20] L. Zhang, W.-D. Zhou, Sparse ensembles using weighted combination methods based on linear programming, Pattern Recognit. 44 (2011) 97–106. [21] M.U. Şen, H. Erdoǧ an, Linear classifier combination and selection using group sparse regularization and hinge loss, Pattern Recognit. Lett. 34 (2013) 265–274. [22] T.T. Nguyen, A.W.-C. Liew, C. To, X.C. Pham, M.P. Nguyen, Fuzzy If-Then rules classifier on ensemble data, in: X. Wang, W. Pedrycz, P. Chan, Q. He (Eds.), Machine Learning and Cybernetics, Springer, 2014, pp. 362–370. [23] 〈http://archive.ics.uci.edu/ml/datasets.html〉. [24] 〈http://ganymed.imib.rwth-aachen.de/irma/datasets_en.php〉. [25] C.M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, New York, 2006. [26] N. Nasios, A.G. Bors, Variational learning for Gaussian mixture models, IEEE Trans. Syst. Man Cybern. Part B Cybern. 36 (2006) 849–862. [27] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004. [28] D.M. Blei, M.I. Jordan, Variational methods for the Dirichlet process, in: Proceedings of ACM International Conference on Machine Learning, 2004.
212
T.T. Nguyen et al. / Pattern Recognition 49 (2016) 198–212
[29] D.M. Blei, M.I. Jordan, Variational Inference for Dirichlet process mixtures, Bayesian Anal. 1 (2006) 121–143. [30] N. Balakrishnan, V.B. Nevzorov, A Primer on Statistical Distributions, Wiley & Sons Press, 2003. [31] T. Minka, Bayesian inference, entropy, and the multinomial distribution, in: Technical Report, Microsoft Research, 2003. [32] A. Agresti, D.B. Hitchcock, Bayesian inference for categorical data analysis, Stat. Methods Appl. 14 (3) (2005) 297–330. [33] C. Désir, S. Bernard, C. Petitjean, L. Heutte, One class random forests, Pattern Recognit. 46 (12) (2013) 3490–3506. [34] A. Gelman, J.B. Carlin, H.S. Stern, D.B. Rubin, Bayesian Data Analysis, second ed., CRC Press, 2004. [35] D.J. Spiegelhalter, A. Dawid, S. Lauritzen, R. Cowell, Bayesian analysis in expert systems (with discussion), Statist. Sci. 8 (1993) 219–283. [36] S.L. Lauritzen, B. Thiesson, D.J. Spiegelhalter, Diagnostic systems created by model selection methods: a case study, in: (P. Cheeseman, W. Oldford (Eds.), Uncertainty in Artificial Intelligence, 1994, vol. 4, pp. 143–152. [37] A. Gelman, Prior distribution, Encycl. Environ. 3 (2002) 1634–1637. [38] J.A Hoeting, D. Madigan, A.E Ratery, C.T Volinsky, Bayesian model averaging: a tutorial, Stat. Sci. 14 (4) (1999) 382–416. [39] T. Ojala, M. Pietikäinen, D. Harwood, Performance evaluation of texture measures with classification based on Kullback discrimination of distributions,
[40]
[41]
[42] [43]
[44]
[45]
in: Proceedings of the 12th International Conference on Pattern Recognition, 1994, vol. 1, pp. 582–585. T.T. Nguyen, A.W.-C. Liew, M.T. Tran, X.C. Pham, M.P. Nguyen,A novel genetic algorithm approach for simultaneous feature and classifier selection in multi classifier system, in: IEEE Congress on Evolutionary Computation (CEC), 2014, pp. 1698–1705. C.D. Sutton, Classification and regression trees, Bagging, and boosting, in: C. R. Rao, E.J. Wegman, J.L. Solka (Eds.), Handbook of Statistics, Elsevier, 2005, pp. 303–329. P. Viola, M. Jones, Robust real-time face detection, Int. J. Comput. Vision 57 (2002) 137–154. T. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization, Mach. Learn. 40 (2000) 139–157. T.T. Nguyen, A.W.-C. Liew, X.C. Pham, M.P. Nguyen, A Novel, 2-Stage combining classifier model with stacking and genetic algorithm based feature selection, in: D.-S. Huang, K.-H. Jo, L. Wang (Eds.), Intelligent Computing Methodologies, Springer International Publishing, 2014, pp. 33–43. T.T. Nguyen, A.W.-C. Liew, M.T. Tran, M.P. Nguyen, Combining multi classifiers based on a genetic algorithm—a Gaussian mixture model framework, in: D.S. Huang, K.-H. Jo, L. Wang (Eds.), Intelligent Computing Methodologies, Springer International Publishing, 2014, pp. 56–67.
Tien Thanh Nguyen is currently a Ph.D. student at the School of Information & Communication Technology, Griffith University, Australia. His research interest is in the field of machine learning, pattern recognition, image processing and evolutionary computation. He is a member of the IEEE since 2014.
Thi Thu Thuy Nguyen is currently a lecturer at the Faculty of Economic Information System, College of Economics, Hue University, Vietnam. She graduated from the Faculty of Mathematics, Voronezh State University, Russia in 2008. Her research interests include machine learning, pattern recognition, image processing.
Xuan Cuong Pham graduated from School of Information and Communication Technology, Hanoi University of Science and Technology, Vietnam in 2014 and had a short time to work at R&D department of Samsung Co Ltd, Vietnam. His research interests include machine learning, image processing, information retrieval and computer vision.
Alan Wee-Chung Liew is currently an Associate Professor with the School of Information & Communication Technology, Griffith University, Australia. His research interest is in the field of medical imaging, bioinformatics, computer vision, pattern recognition, and machine learning. He serves on the technical program committee of many international conferences and is on the editorial board of several journals, including the IEEE Transactions on Fuzzy Systems. He is a senior member of the IEEE since 2005.