ARTICLE IN PRESS Neurocomputing 72 (2009) 1648–1655
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Interpretation of hybrid generative/discriminative algorithms Jing-Hao Xue a,b,, D. Michael Titterington a a b
Department of Statistics, University of Glasgow, Glasgow G12 8QQ, UK Department of Statistical Science, University College London, London WC1E 6BT, UK
a r t i c l e in f o
a b s t r a c t
Article history: Received 5 November 2007 Received in revised form 16 March 2008 Accepted 28 August 2008 Communicated by T. Heskes Available online 1 October 2008
In discriminant analysis, probabilistic generative and discriminative approaches represent two paradigms of statistical modelling and learning. In order to exploit the best of both worlds, hybrid modelling and learning techniques have attracted much research interest recently, one example being the so-called hybrid generative/discriminative algorithm proposed in Raina et al. [Classification with hybrid generative/discriminative models, in: NIPS, 2003] and its multi-class extension [A. Fujino, N. Ueda, K. Saito, A hybrid generative/discriminative approach to text classification with additional information, Inf. Process. Manage. 43(2) (2007) 379–392]. In this paper, we interpret this hybrid algorithm from three perspectives, namely class-conditional probabilities, classposterior probabilities and loss functions underlying the model. We suggest that the hybrid algorithm is by nature a generative model with its parameters learnt through both generative and discriminative approaches, in the sense that it assumes a scaled data-generation process and uses scaled class-posterior probabilities to perform discrimination. Our suggestion can also be applied to its multi-class extension. In addition, using simulated and real-world data, we compare the performance of the normalised hybrid algorithm as a classifier with that of the naı¨ve Bayes classifier and linear logistic regression. Our simulation studies suggest in general the following: if the covariance matrices are diagonal matrices, the naı¨ve Bayes classifier performs the best; if the covariance matrices are full matrices, linear logistic regression performs the best. Our studies also suggest that the hybrid algorithm may provide worse performance than either the naı¨ve Bayes classifier or linear logistic regression alone. & 2008 Elsevier B.V. All rights reserved.
Keywords: Hybrid generative/discriminative models Probabilistic generative and discriminative approaches Statistical modelling and learning
1. Introduction In recent years, under the new terminology of generative and discriminative approaches, research interest in classical statistical modelling and learning approaches, namely the sampling paradigm and the diagnostic paradigm [3,12], to discriminant analysis has re-emerged in the machine learning community. In discriminant analysis, observations with features x are classified into classes labelled by a categorical variable y. The generative approach, such as normal-based discriminant analysis and the naı¨ve Bayes classifier, models the joint distribution pðx; yÞ of the features and the group labels factorised in the form pðxjyÞpðyÞ, and learns the model parameters through maximising the corresponding likelihood; the discriminative approach, such as logistic regression, models the conditional distribution pðyjxÞ of the group labels given the features, and learns the model
Corresponding author. Tel.: +44 207 679 1872; fax: +44 207 383 4703.
E-mail addresses:
[email protected] (J.-H. Xue),
[email protected] (D.M. Titterington). 0925-2312/$ - see front matter & 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2008.08.009
parameters through maximising the corresponding conditional likelihood. Compared to the other, each of these two paradigms has its advantages and disadvantages [4,11,9]. In order to exploit the best of both worlds, Bouchard and Triggs [2] propose the tradeoff approach to modelling both pðyjxÞ and pðx; yÞ, and McCallum et al. [8] propose the multi-conditional learning approach to modelling pðyjxÞ and the data-generation process (DGP) pðxjyÞ. In the sense that the generative and discriminative components within both approaches are derived from the joint distribution pðx; yÞ, they can be regarded as hybrid learning of generative models [13]. Another interesting idea in this direction, proposed by Raina et al. [10], is the so-called hybrid generative/discriminative algorithm, which assigns different weights to the partitions of the features within x, learning most parameters generatively but the weights discriminatively. In this paper, we first interpret the hybrid algorithm from three perspectives, namely classconditional probabilities, class-posterior probabilities and loss functions underlying the model, and then discuss one of its multiclass extensions. Finally, by using simulated and real-world data,
ARTICLE IN PRESS J.-H. Xue, D.M. Titterington / Neurocomputing 72 (2009) 1648–1655
we compare its performance as a classifier with that of the naı¨ve Bayes classifier and linear logistic regression.
2. Interpretation of the hybrid algorithm
1649
assigning x to the group y ¼ 1 if pðy ¼ 1Þpðx1 jy ¼ 1Þy1 =h1 pðx2 jy ¼ 1Þy2 =h2 Xpðy ¼ 2Þpðx1 jy ¼ 2Þy1 =h1 pðx2 jy ¼ 2Þy2 =h2 , which is the criterion (denoted by Criterion-H) corresponding to the normalised hybrid algorithm; the other gives
Consider classifying an observation with h features into one of K groups by a classifier y^ , which was trained by using the observed features and group labels of m other so-called training observations. We use an h-variate random vector x ¼ ðx1 ; . . . ; xh ÞT to represent the h features of the observation and a random categorical variable y 2 f1; . . . ; Kg to represent the group label. We denote a classifier of x by y^ ðxÞ and the loss function of misclassifying x, which arises from the group y, into the group y^ ðxÞ is Lðy; y^ ðxÞÞ.
which is the criterion corresponding to the unnormalised hybrid algorithm. Without loss of generality, in this paper we focus on the normalised hybrid algorithm. Let us write y ¼ ðy1 ; y2 ÞT . Then the hybrid algorithm can be derived from
2.1. Class-conditional probabilities
py ðxjyÞ ¼ wy ðx1 ; x2 Þpðx1 jyÞy1 =h1 pðx2 jyÞy2 =h2
pðy ¼ 1Þpðx1 jy ¼ 1Þy1 pðx2 jy ¼ 1Þy2 Xpðy ¼ 2Þpðx1 jy ¼ 2Þy1 pðx2 jy ¼ 2Þy2 ,
and For binary classification, where K ¼ 2, based on Bayes’ Theorem, the Bayes discriminant criterion (i.e., y^ ðxÞ ¼ argmaxy p ðyjxÞ) of the generative classifier for classifying x into the group y ¼ 1 can be written as pðx; y ¼ 1ÞXpðx; y ¼ 2Þ, or equivalently pðy ¼ 1Þpðxjy ¼ 1ÞXpðy ¼ 2Þpðxjy ¼ 2Þ. In addition, specific generative classifiers, such as linear normal-based discriminant analysis with a common diagonal covariance matrix (denoted by LDA-L) and the naı¨ve Bayes classifier, assume that the h features are conditionally independent given the group label y, i.e., Q pðxjyÞ ¼ hi¼1 pðxi jyÞ. In the normalised hybrid and the unnormalised hybrid algorithms proposed by Raina et al. [10], the feature vector x is divided into R partial feature vectors x1 ; . . . ; xR , because they suggest different levels of importance for different partitions, or partial feature vectors; for example, x1 may represent the message subject of an email while x2 represents the message body. As with Raina et al. [10], we focus on R ¼ 2, such that x ¼ ðx1T ; x2T ÞT , x1 ¼ ðx1 ; . . . ; xh1 ÞT , x2 ¼ ðxh1 þ1 ; . . . ; xh ÞT and h2 ¼ h h1 , and assume that the discriminant criterion of the generative classifiers can be rewritten as pðy ¼ 1Þpðx1 jy ¼ 1Þpðx2 jy ¼ 1ÞXpðy ¼ 2Þpðx1 jy ¼ 2Þpðx2 jy ¼ 2Þ. Thus, given pðx; yÞa0, the corresponding discriminant function lG ðxÞ ¼ log pðy ¼ 1jxÞ=pðy ¼ 2jxÞ can be expressed in terms of likelihood ratios as pðy ¼ 1Þ pðx1 jy ¼ 1Þ pðx2 jy ¼ 1Þ þ log þ log . lG ðxÞ ¼ log pðy ¼ 2Þ pðx1 jy ¼ 2Þ pðx2 jy ¼ 2Þ Such a representation can be obtained by assuming the generative DGP pðxjyÞ ¼ wðx1 ; x2 Þpðx1 jyÞpðx2 jyÞ, where wðx1 ; x2 Þ can be regarded as a normalisation factor. However, if, for all y, pðx1 jyÞ and pðx2 jyÞ are proper marginal disP tributions derived from pðxjyÞ (i.e., pðx1 jyÞ ¼ x2 pðxjyÞ, pðx2 jyÞ ¼ P P P P 1 2 x1 pðxjyÞ and x pðxjyÞ ¼ x1 pðx jyÞ ¼ x2 pðx jyÞ ¼ 1), then 1 2 wðx ; x Þ 1, given that there exists x ¼ x such that pðxjy ¼ 1Þapðxjy ¼ 2Þ. In other words, it leads to assuming conditional independence between partial feature vectors x1 jy and x2 jy such that pðxjyÞ ¼ pðx1 jyÞpðx2 jyÞ. In addition, to some extent, for a simple implementation in practice, Raina et al. [10] Q 1 Q further assume that pðx1 jyÞ ¼ hj¼1 pðxj jyÞ and pðx2 jyÞ ¼ hj¼h1 þ1 p ðxj jyÞ; these imply the conditional independence of the elements within x1 and x2 given y, respectively. Raina et al. [10] introduce two additional parameters y1 and y2 into the discriminant criterion, leading to different weights for different partial feature vectors in the discrimination. Two ways of weighting are proposed by Raina et al. [10]: one corresponds to
py ðx; yÞ ¼ pðyÞpy ðxjyÞ, where wy ðx1 ; x2 Þ is independent of groups y so that it is cancelled out from Criterion-H, but it is not necessarily further factorised as wy ðx1 ; x2 Þ ¼ w1y ðx1 Þw2y ðx2 Þ. However, in order to maintain py ðxjyÞ as a proper probability distribution (so that Criterion-H is derived from a proper probabilistic model), with the marginal distribuP P tions pðx1 jyÞ ¼ x2 py ðxjyÞ and pðx2 jyÞ ¼ x1 py ðxjyÞ, it is required that, for all y, X wy ðx1 ; x2 Þpðx2 jyÞy2 =h2 ¼ pðx1 jyÞ1y1 =h1 , x2
X
wy ðx1 ; x2 Þpðx1 jyÞy1 =h1 ¼ pðx2 jyÞ1y2 =h2 .
x1
In some cases, it might be difficult to validate the existence of such a wy ðx1 ; x2 Þ, e.g., when y1 =h1 ¼ 1 while y2 =h2 a1 or vice versa, as the sums, in terms of x, on the left-hand sides of the above equations have to become independent of y. In other cases, further assumptions might be needed to guarantee the existence. We illustrate this by assuming that wy ðx1 ; x2 Þ can be further factorised in terms of wy ðx1 ; x2 Þ ¼ w1y ðx1 Þw2y ðx2 Þ; in other words, we assume conditional independence between x1 jy and x2 jy. It follows that py ðxjyÞ ¼ w1y ðx1 Þpðx1 jyÞy1 =h1 w2y ðx2 Þpðx2 jyÞy2 =h2 , which also leads to Criterion-H. One option for wy ðx1 ; x2 Þ is, for all y, w1y ðx1 Þ ¼ qðyÞpðx1 jyÞ1y1 =h1 ;
w2y ðx2 Þ ¼
1 pðx2 jyÞ1y2 =h2 , qðyÞ
where qðyÞ is a non-zero function used to cancel out terms in y within pðx1 jyÞ1y1 =h1 and pðx2 jyÞ1y2 =h2 . If such a wy ðx1 ; x2 Þ cannot be found, Criterion-H is not a Bayes discriminant criterion derived from a proper probabilistic model; nevertheless, in practice it can still be used as a criterion for discrimination, although in this case the hybrid algorithm is no longer a true Bayes classifier and, under a 021 loss function, it cannot provide a minimum Bayes error. Under Criterion-H, we classify x into y ¼ 1 if py ðx; y ¼ 1ÞX py ðx; y ¼ 2Þ. Given py ðx; yÞa0, the discriminant function lH ðxÞ of the hybrid algorithm can be expressed in terms of weighted likelihood ratios as
lH ðxÞ ¼ log
pðy ¼ 1Þ y1 pðx1 jy ¼ 1Þ y2 pðx2 jy ¼ 1Þ þ log þ log . pðy ¼ 2Þ h1 pðx1 jy ¼ 2Þ h2 pðx2 jy ¼ 2Þ
Therefore, lH ðxÞ can be viewed as a ‘‘weighted’’ version of the discriminant function lG ðxÞ of the generative classifier; however, as mentioned above, in theory the hybrid algorithm should satisfy some conditions about the marginal distributions in order to
ARTICLE IN PRESS 1650
J.-H. Xue, D.M. Titterington / Neurocomputing 72 (2009) 1648–1655
make the underlying model probabilistically valid. In addition, as with lG ðxÞ, most parameters, such as those for pðx1 jyÞ and pðx2 jyÞ, in lH ðxÞ are learnt by using a generative approach; only a few parameters, such as the two weights y1 and y2 , are then learnt by using a discriminative approach based on the learning results (about pðx1 jyÞ and pðx2 jyÞ) from the generative approach. Therefore, the hybrid algorithm can be regarded as a generative classifier since it assumes the DGP pðxjyÞ and thus pðx; yÞ. With the assumption of conditional independence between x1 jy and x2 jy, it follows that, for two class-conditional probabilities pðxjyÞ and py ðxjyÞ, py ðxjyÞ ¼ pðxjyÞfwy ðx1 ; x2 Þpðx1 jyÞy1 =h1 1 pðx2 jyÞy2 =h2 1 g. This indicates that, in practice, the hybrid algorithm assumes a scaled DGP py ðxjyÞ which scales the generative DGP pðxjyÞ by a function not only of the group label y but also of the feature vector x. 2.2. Class-posterior probabilities The second perspective for interpreting the hybrid algorithm is via its modelling of class-posterior probabilities: py ðyjxÞ ¼
py ðx; yÞ pðyÞwy ðx1 ; x2 Þpðx1 jyÞy1 =h1 pðx2 jyÞy2 =h2 ¼ py ðxÞ py ðxÞ
pðyÞpðx1 jyÞy1 =h1 pðx2 jyÞy2 =h2 , py ðxÞ=wy ðx1 ; x2 Þ P where py ðxÞ ¼ y py ðx; yÞ ¼ py ðx; y ¼ 1Þ þ py ðx; y ¼ 2Þ. According to Bayes’ Theorem, the class-posterior probabilities in terms of the generative DGP pðxjyÞ are pðyjxÞ ¼ pðyÞpðxjyÞ=pðxÞ; it follows that pðxÞ py ðyjxÞ ¼ pðyjxÞ wy ðx1 ; x2 Þpðx1 jyÞy1 =h1 1 pðx2 jyÞy2 =h2 1 . py ðxÞ ¼
This indicates that the normalised hybrid algorithm assumes scaled class-posterior probabilities py ðyjxÞ which scale the posterior probabilities pðyjxÞ by a function not only of the feature vector x but also of the group label y. 2.3. Loss functions In order to find the best classifier, one of the optimal criteria is to minimise the so-called unconditional or total risk: Rðy^ Þ ¼ Ey ½Exjy ½Lðy; y^ ðxÞÞ ¼ Ex ½Eyjx ½Lðy; y^ ðxÞÞ. Such a criterion suffices to minimise the Bayes error, also called Bayes risk, Eyjx ½Lðy; y^ ðxÞÞ ¼
K X
pðyjxÞLðy; y^ ðxÞÞ.
in which h1 and h2 are the dimensions of x1 and x2 , and x ¼ ðx1T ; x2T ÞT . A generalisation of such a loss function is Ly ¼ py ðxjyÞ=pðxjyÞ. Proof. The Bayes error for a classifier y^ ðxÞ with such a loss function Lðy; y^ ðxÞÞ is minimised by y^ ðxÞ ¼ argmin y^
X
pðyjxÞLy ¼ argmin
yay^
y^
X
pðx1 jyÞy1 =h1 pðx2 jyÞy2 =h2 pðyÞ
yay^
¼ argmin pðx1 jy^ Þy1 =h1 pðx2 jy^ Þy2 =h2 pðy^ Þ y^
¼ argmax pðx1 jyÞy1 =h1 pðx2 jyÞy2 =h2 pðyÞ, y
which is Criterion-H. The proof for the generalisation of Ly can be obtained similarly by replacing pðx1 jyÞy1 =h1 pðx2 jyÞy2 =h2 with py ðxjyÞ. & From Proposition 1, we observe that the loss from misclassification by the hybrid algorithm depends on the accuracy of the approximation of the true DGP pðxjyÞ by the assumed one, py ðxjyÞ say. The closer py ðxjyÞ is to pðxjyÞ, the closer can Lðy; y^ ðxÞÞ be approximated by a 021 loss function. Furthermore, in contrast to the 021 loss, Ly is dependent on x. 2.4. A multi-class extension Fujino et al. [5] present the result of a multi-class and multipartition extension of the hybrid algorithm by maximising a conditional entropy of pðyjxÞ under certain constraints associated with joint distribution pðx; yÞ and class-conditional probabilities pðxr jyÞ for each partial feature vector xr ; r ¼ 1; . . . ; R, as Q emy Rr¼1 pðxr jyÞlr , pðyjxÞ ¼ P m Q R lr r y r¼1 pðx jyÞ ye where lr and my are Lagrange multipliers. This result is equivalent to a straightforward extension of the hybrid algorithm, in which lr ¼ yr =hr and my ¼ log pðyÞ þ log wy ðxÞ.
3. Parameter estimation, implementation and evaluation of the classifiers 3.1. Discriminative learning of y By ‘‘hybrid’’, the normalised hybrid algorithm proposed in Raina et al. [10] means to use a discriminative approach to the estimation of y such that
y¼1
A simple and widely used loss function is a 021 loss such that Lðy; y^ ðxÞÞ ¼ 1 if y^ ay and 0 otherwise. This leads to a Bayes classifier, y^ ðxÞ ¼ argmaxy pðyjxÞ. Since there are many loss functions that can lead to the normalised hybrid algorithm, here we only present one loss function, fixing Lðy; y^ ðxÞÞ ¼ 0 if y^ ¼ y. Proposition 1. If the number of groups is KX2, and it is assumed that, given y, Lðy; y^ ðxÞÞ ¼ Ly is independent of y^ ðxÞ if y^ ay, then the hybrid algorithm proposed in Raina et al. [10] can be obtained through minimising the Bayes error with a loss function Lðy; y^ ðxÞÞ such that Lðy; y^ ðxÞÞ ¼ Ly if y^ ay and 0 otherwise, where pðx1 jyÞy1 =h1 pðx2 jyÞy2 =h2 , Ly ¼ pðxjyÞ
y^ ¼ argmax y
m X
log py ðyðiÞ jxðiÞ Þ ¼ argmax
i¼1
y
m X i¼1
p ðxðiÞ ; yðiÞ Þ log Py , ðiÞ y py ðx ; yÞ
where m is the number of independent training observations ðiÞ T 1;ðiÞ T Þ ; ðx2;ðiÞ ÞT Þ. If y is a binary fðxðiÞ ; yðiÞ Þgm i¼1 , in which ðx Þ ¼ ððx variable such that y 2 f1; 2g, py ðy ¼ 1jxÞ can be written in a way similar to that of logistic regression: py ðy ¼ 1jxÞ ¼
expðlH ðxÞÞ , 1 þ expðlH ðxÞÞ
where lH ðxÞ, as defined in Section 2.1, is the discriminant function corresponding to Criterion-H. As with linear logistic regression, lH ðxÞ is a linear function of y1 and y2 .
ARTICLE IN PRESS J.-H. Xue, D.M. Titterington / Neurocomputing 72 (2009) 1648–1655
Instead of using maximisation, we minimise the negative loglikelihood ‘H to estimate y1 and y2 , where ‘H ¼
m X
log py ðyðiÞ jxðiÞ Þ
i¼1 m X ðiÞ ðiÞ ¼ f1fyðiÞ ¼1g logð1 þ elH ðx Þ Þ þ 1fyðiÞ ¼2g logð1 þ elH ðx Þ Þg.
1651
Q 1 the partial feature vectors, such that pðx1 jyÞ ¼ hj¼1 pðxj jyÞ and Q h pðx2 jyÞ ¼ j¼h1 þ1 pðxj jyÞ. Linear logistic regression is implemented by the R function glm which uses an iteratively reweighted least squares algorithm (IRLS, or IWLS, also known as the Fisher scoring algorithm) to fit the model. The discriminant function lD ðxÞ of linear logistic regression can be written as
i¼1
Concerning lH ðxÞ, in order to estimate the parameters in the same discriminative way as that of linear logistic regression, Raina et al. [10] redefine y as y ¼ ðy0 ; y1 ; y2 ÞT , where y0 ¼ log pðy ¼ 1Þ= pðy ¼ 2Þ, similar to the intercept in a linear logistic regression model, is estimated discriminatively, i.e., log pðy ¼ 1Þ=pðy ¼ 2Þ is not calculated by using generative estimators of pðy ¼ 1Þ and pðy ¼ 2Þ but is directly estimated by a discriminative approach. Except for that, log pðx1 jy ¼ 1Þ=pðx1 jy ¼ 2Þ and log pðx2 jy ¼ 1Þ=pðx2 jy ¼ 2Þ are estimated by a generative approach. Considering that the discriminative estimator of y uses outputs from the generative estimator of pðxjyÞ as inputs while both estimators use the same training set, Raina et al. [10] suggest that the discriminative estimator of y is biased. Consequently, they use a ‘‘leave-one-out’’ strategy as follows:
y^ i ¼ argmax y
m X
log py;i ðyðiÞ jxðiÞ Þ ¼ argmax
i¼1
y
m X i¼1
p ðxðiÞ ; yðiÞ Þ log Py;i , ðiÞ y py;i ðx ; yÞ
where py;i ðxðiÞ ; yÞ and py;i ðxðiÞ ; yðiÞ Þ are obtained from the data with the i-th observation removed. However, when the training set size m is large enough, there is little difference between y^ i and y^ , and thus such a bias can be ignored. Therefore, in our study, we do not use the ‘‘leave-one-out’’ strategy to estimate y.
3.2. Implementation of the classifiers In order to evaluate the discrimination performance of the hybrid algorithm, we compare it with two widely used discriminative and generative classifiers, linear logistic regression and the naı¨ve Bayes classifier, using simulated continuous and discrete data. The naı¨ve Bayes classifier is implemented by an R function naiveBayes from a contributed package e1071 for R. As with Raina et al. [10], for discrete data, we use Laplace (add-one) smoothing. For simulated continuous data, the naı¨ve Bayes classifier, which assumes normal distributions for class-conditional probabilities pðxjyÞ, corresponds to LDA-L when the covariance matrix S1 of the group y ¼ 1 is equal to the covariance matrix S2 of the group y ¼ 2, and corresponds to quadratic normal discriminant analysis with a common diagonal covariance matrix (QDA-L) when S1 aS2 . The naı¨ve Bayes classifier assumes the conditional independence of all h features given the group label y, such Q that pðxjyÞ ¼ hj¼1 pðxj jyÞ; its discriminant function lG ðxÞ can be written as
lG ðxÞ ¼ log
h pðxj jy ¼ 1Þ pðy ¼ 1Þ X þ . log pðy ¼ 2Þ j¼1 pðxj jy ¼ 2Þ
The implementation of parameter estimation for the hybrid algorithm with lH ðxÞ consists of two steps: in the first step, by use of the R function naiveBayes, pðxj jyÞ; j ¼ 1; . . . ; h, are generatively estimated and thus log pðx1 jy ¼ 1Þ=pðx1 jy ¼ 2Þ and log pðx2 jy ¼ 1Þ=pðx2 jy ¼ 2Þ can be calculated; in the second step, y is estimated discriminatively by use of an R function glm (from a standard package stats in R) with log pðx1 jy ¼ 1Þ=pðx1 jy ¼ 2Þ and log pðx2 jy ¼ 1Þ=pðx2 jy ¼ 2Þ as predictor variables. The hybrid algorithm assumes conditional independence within
lD ðxÞ ¼ b0 þ
h X
bj x j ,
j¼1
which does not necessarily imply that the conditional independence assumption holds. 3.3. Evaluation of the classifiers To evaluate the performance of the three classifiers, we use the misclassification error rate (ER) and logarithmic loss (LL). The ER is defined as usual by the number of misclassified observations over the total number of observations; it is based on a 021 loss function and is independent of the observed value x. In contrast, the LL is dependent on x. The LL, also referred to as the logistic loss for logistic regression, is based on a loss function Lðy; y^ ðxÞÞ ¼ log pðyjxÞ, where pðyjxÞ is determined by the classifier y^ ðxÞ, and thus defined by LL ¼
t X f log pðyðiÞ jxðiÞ Þg, i¼1
where t is the number of test observations. It can be easily recognised that the LL is in fact the negative of the log-likelihood of pðyjxÞ, and therefore the estimates obtained by the discriminative classifiers provide the best classification for the training observations if the minimum LL is used to measure the performance. Consider two groups y 2 f1; 2g with the discriminant function lðxÞ ¼ log pðy ¼ 1jxÞ=pðy ¼ 2jxÞ. Then the LL can be rewritten as 8 9 !1 ðiÞ 1 ðiÞ = ðiÞ fy ¼1g t < X fy ¼2g elðx Þ 1 , log log LL ¼ : ; 1 þ elðxðiÞ Þ 1 þ elðxðiÞ Þ i¼1
where 1fyðiÞ ¼kg is an indicator function of the subset fyðiÞ ¼ kg. A simple notation for the LL used by the machine learning community for two groups such that y 2 f1; 1g is X t t X 1 ðiÞ ðiÞ log flogð1 þ ey lðx Þ Þg. ¼ LL ¼ ðiÞ lðxðiÞ Þ y 1þe i¼1 i¼1 4. Numerical studies 4.1. Simulation studies Twelve datasets are simulated here, of which six are composed of h continuous features and the other six are composed of h discrete features. In each continuous dataset, the data arise from two h-variate normal distributions; in each discrete dataset, the data arise from two h-variate Bernoulli distributions. Each dataset consists of N ¼ 103 observations, which are equally categorised into two groups by a group label y 2 1; 2. Amongst them, m=2 observations from each of the two groups are used as training observations; m is sampled within ½100; 400 in steps of 25. For each sampled m, the N observations are randomly split into m training observations and t ¼ N m test observations with 400 replicates; from them, the medians of the ERs and LLs are recorded and plotted. In each dataset, we set
ARTICLE IN PRESS 1652
J.-H. Xue, D.M. Titterington / Neurocomputing 72 (2009) 1648–1655
h ¼ 4 and the feature vector x ¼ ðx1 ; x2 ; x3 ; x4 ÞT is composed of two partial feature vectors x1 ¼ ðx1 ; x2 ÞT and x2 ¼ ðx3 ; x4 ÞT , i.e., h1 ¼ h2 ¼ 2. Amongst the 12 datasets, six datasets (three continuous and three discrete) have S1 ¼ S2 , i.e., the two groups have a common covariance matrix S. In addition, there are four datasets (two continuous and two discrete) with diagonal covariance matrices, and thus for them the assumption of conditional independence of all h features of x given y underlying the naı¨ve Bayes classifier is satisfied. There are also four datasets with block-diagonal covariance matrices of two blocks, where one block consists of the h1 features of x1 and the other consists of the h2 features of x2 , and thus for them the assumption of conditional independence between x1 and x2 given y is satisfied. The other four datasets have full covariance matrices such that each of the h features of x given y is dependent on the others.
As our results for the simulated discrete data showed similar patterns to those for the simulated continuous data, only the latter are presented below. 4.1.1. Continuous data with a common covariance matrix S The first three datasets contain simulated continuous data arising from two 4-variate normal distributions: xNðm1 ; S1 Þ for the group with y ¼ 1 and xNðm2 ; S2 Þ for y ¼ 2, with m1 ¼ ð1:5; 0; 0:5; 0ÞT , m2 ¼ ð1:5; 0; 0:5; 0ÞT , S1 ¼ S2 ¼ S and S is 2 3 2 3 2 3 1 0 0 0 1 c 0 0 1 c c c 60 1 0 07 6 c 1 0 07 6c 1 c c7 6 7 6 7 6 7 6 7; 6 7 or 6 7 40 0 1 05 40 0 1 c 5 4c c 1 c5 0
0
160
logistic loss
misclassification error rate
0
0
c
1
c
c
c
1
Normal: Diagonal Σ
naive Bayes linear logistic reg. λH
0.058
1
with c ¼ 0:25, giving a diagonal, a block-diagonal and a full covariance matrix, respectively, for the three datasets.
Normal: Diagonal Σ
0.062
0
naive Bayes linear logistic reg. λH
140
120
0.054 100 0.050 100 150 200 250 300 350 400 m
Normal: Block Diagonal Σ
Normal: Block Diagonal Σ
naive Bayes linear logistic reg. λH
130
0.056 logistic loss
misclassification error rate
0.058
100 150 200 250 300 350 400 m
0.054
0.052
naive Bayes linear logistic reg. λH
110
90 80
0.050
100 150 200 250 300 350 400 m
100 150 200 250 300 350 400 m Normal: Full Σ
Normal: Full Σ
0.060
naive Bayes linear logistic reg. λH
140 logistic loss
misclassification error rate
160
0.050
naive Bayes linear logistic reg. λH
120 100 80
0.040 100 150 200 250 300 350 400 m
100 150 200 250 300 350 400 m
Fig. 1. Simulated normally distributed data with equal covariance matrices. Plots of classification performance measured by ER and by LL vs. training set size m.
ARTICLE IN PRESS J.-H. Xue, D.M. Titterington / Neurocomputing 72 (2009) 1648–1655
1653
Medians of the ERs and LLs are obtained from 400 replicates; the medians are plotted against the training set size m in Fig. 1, of which each row represents the results for one dataset.
while S1 is the same as S shown in Section 4.1.1, respectively, for these three datasets. The results for these three datasets are shown in Fig. 2.
4.1.2. Continuous data with unequal covariance matrices S1 ; S2 The structure of the second set of three datasets is similar to that of the first set in Section 4.1.1, except that S1 aS2 and S2 is 2 3 2 3 0:25 0 0 0 0:25 c 0 0 6 7 6 7 6 0 6 0:75 0 0 7 0:75 0 0 7 6 7 6 c 7 6 7; 6 7 7 6 0 7 6 0 0 1:25 0 0 1:25 c 4 5 4 5 0 0 0 1:75 0 0 c 1:75 2 3 0:25 c c c 6 7 6 c 0:75 c c 7 6 7 or 6 7 6 c 7 c 1:25 c 4 5 c c c 1:75
4.2. Empirical studies For empirical studies, six continuous datasets in the UCI machine learning repository [1] are used here. The three UCI datasets are ‘‘Breast cancer Wisconsin (diagnostic)’’, ‘‘Breast cancer Wisconsin (prognostic)’’, ‘‘Connectionist bench (sonar)’’, ‘‘Ecoli (cp vs. pp)’’, ‘‘Pima Indians diabetes’’ and ‘‘Wine (1 vs. 2)’’. Raina et al. [10] used newsgroups data, reasonably dividing a message x into a message subject x1 and a message body x2 and obtaining very promising results from the hybrid algorithm. However, for these UCI datasets, there might not be such an apparently reasonable division. As a random division of x may break down the required connection of the features within either of the xr and thus lead to a bias disfavouring the hybrid algorithm,
0.026
Normal: Diagonal Σ1, Σ2 400
naive Bayes linear logistic reg. λH
naive Bayes linear logistic reg. λH
300 logistic loss
misclassification error rate
Normal: Diagonal Σ1, Σ2
0.022
0.018
200
100
0.022
100 150 200 250 300 350 400 m
Normal: Block Diagonal Σ1, Σ2
Normal: Block Diagonal Σ1, Σ2 400
naive Bayes linear logistic reg. λH
naive Bayes linear logistic reg. λH
logistic loss
300 0.020
0.018
200
100
0.016
0.028 misclassification error rate
100 150 200 250 300 350 400 m
100 150 200 250 300 350 400 m
100 150 200 250 300 350 400 m
Normal: Full Σ1, Σ2
Normal: Full Σ1, Σ2
naive Bayes linear logistic reg. λH
0.024
350
logistic loss
misclassification error rate
0.014
0.020
0.016
naive Bayes linear logistic reg. λH
250
150
50 100 150 200 250 300 350 400 m
100 150 200 250 300 350 400 m
Fig. 2. Simulated normally distributed data with unequal covariance matrices. Plots of classification performance measured by ER and by LL vs. training set size m.
ARTICLE IN PRESS 1654
J.-H. Xue, D.M. Titterington / Neurocomputing 72 (2009) 1648–1655
we simply took the first half of the features as x1 and the others as x2 . Such a simple division may preserve the connection between features, as similar features are in general next to each other in the order measured. Similarly to the training-test split of the simulated datasets, for each group we randomly chose r% of the observations as training data and the remaining ð100 rÞ% as test data, where r ¼ 20ð10Þ80, such that the group proportion is preserved for training. For each value of r, we generated 100 such random partitions to assess classifier performance; medians of the ERs for these 100 replicates are shown in Fig. 3, those of the LLs showing similar patterns.
4.3. Summary of numerical studies Based on the results shown in Figs. 1–3, our numerical studies suggest the following conclusions.
First, with the simulated datasets, in general, in terms of both performance measures, namely ER and LL, if both the covariance matrices S1 and S2 are diagonal matrices, the naı¨ve Bayes classifier performs the best; if both the covariance matrices S1 and S2 are full matrices, linear logistic regression performs the best, in particular when the training set size m is large. The superior performance of the naı¨ve Bayes classifier can be attributed to the fact that the simulated data satisfy the assumption of conditional independence underlying the classifier; the superior performance of linear logistic regression can be attributed to its robustness when the assumptions underlying other classifiers are violated. Second, the hybrid algorithm performs the best for three of the six UCI datasets while either the naı¨ve Bayes classifier or linear logistic regression performs the best for the others. Therefore, with these datasets, our studies suggest that the hybrid algorithm may provide worse performance than either the naı¨ve Bayes classifier or linear logistic regression alone.
naive Bayes linear logistic reg. λH
0.09 0.08 0.07 0.06
0.2
0.3
0.4
0.5 ρ
0.6
0.7
Breast cancer Wisconsin (prognostic) misclassification error rate
misclassification error rate
Breast cancer Wisconsin (diagnostic)
naive Bayes linear logistic reg. λH
0.40 0.35 0.30 0.25
0.8
0.2
0.3
naive Bayes linear logistic reg. λH
0.45 0.40 0.35 0.30
0.2
0.3
0.4
0.5 ρ
0.6
0.7
0.07 0.06
0.7
0.8
0.05 0.04 0.03
0.8
0.2
0.3
0.4
0.5 ρ
0.6
0.7
0.8
Wine (1 vs. 2)
naive Bayes linear logistic reg. λH
0.245
0.235
misclassification error rate
misclassification error rate
0.6
naive Bayes linear logistic reg. λH
Pima Indians diabetes 0.255
0.5 ρ
Ecoli (cp vs. pp) misclassification error rate
misclassification error rate
Connectionist bench (sonar)
0.4
naive Bayes linear logistic reg. λH
0.08
0.04
0.00
0.225 0.2
0.3
0.4
0.5 ρ
0.6
0.7
0.8
0.2
0.3
0.4
0.5 ρ
0.6
Fig. 3. UCI datasets. Plots of classification performance measured by ER vs. r.
0.7
0.8
ARTICLE IN PRESS J.-H. Xue, D.M. Titterington / Neurocomputing 72 (2009) 1648–1655
5. Discussion First, one of the key points of the hybrid algorithm in Raina et al. [10] is to assign weights to the class-conditional distributions of subsets of variables x; the subsets were obtained by partitioning x. The extremes of such a block-wise naı¨ve Bayes classifier are either the independence model investigated by Titterington et al. [12] and Hand and Yu [7], assigning a common weight, or a more sophisticated model, assigning different weights to the distributions of different variables. In addition, it may not be necessary to use a hybrid strategy to estimate parameters, as the weights can be also estimated in a generative way. Second, although the hybrid algorithm offered good empirical results, our results showed that simpler generative classifiers like the naı¨ve Bayes classifier and discriminative classifiers like linear logistic regression could offer comparable performance to the hybrid algorithm. This conformed to an argument made by Hand [6] that simple classifiers typically yield performance that is almost as good as more sophisticated classifiers. Finally, some good performance of hybrid classifiers, such as the hybrid algorithm [10] and the naı¨ve Bayes classifier-based independence model [12,7], may be the consequence of biasvariance trade-off, as they are in general biased models.
Acknowledgement The authors thank the reviewers for their constructive comments that led to more concise yet wide-ranging numerical studies. We are grateful to Rajat Raina for communication about Raina et al. [10] and to David J. Hand for discussion leading to Section 5. The work also benefited from our participation in the Research Programme on ‘‘Statistical Theory and Methods for Complex, High-Dimensional Data’’ at the Isaac Newton Institute for Mathematical Sciences in Cambridge. References [1] A. Asuncion, D.J. Newman, UCI machine learning repository, University of California, School of Information and Computer Science, Irvine, CA hhttp:// www.ics.uci.edu/mlearn/MLRepository.htmli. [2] G. Bouchard, B. Triggs, The tradeoff between generative and discriminative classifiers, in: IASC International Symposium on Computational Statistics (COMPSTAT), Prague, 2004, pp. 721–728. [3] A.P. Dawid, Properties of diagnostic data distributions, Biometrics 32 (3) (1976) 647–658. [4] B. Efron, The efficiency of logistic regression compared to normal discriminant analysis, J. Am. Stat. Assoc. 70 (352) (1975) 892–898.
1655
[5] A. Fujino, N. Ueda, K. Saito, A hybrid generative/discriminative approach to text classification with additional information, Inf. Process. Manage. 43 (2) (2007) 379–392. [6] D.J. Hand, Classifier technology and illusion of progress (with discussion), Statistical Science 21 (1) (2006) 1–34. [7] D.J. Hand, K. Yu, Idiot’s Bayes—not so stupid after all? International Statistical Review 69 (3) (2001) 385–398. [8] A. McCallum, C. Pal, G. Druck, X. Wang, Multi-conditional learning: generative/discriminative training for clustering and classification, in: AAAI, 2006, pp. 433–439. [9] A.Y. Ng, M.I. Jordan, On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes, in: NIPS, 2001. [10] R. Raina, Y. Shen, A.Y. Ng, A. McCallum, Classification with hybrid generative/ discriminative models, in: NIPS, 2003. [11] Y.D. Rubinstein, T. Hastie, Discriminative vs. informative learning, in: KDD, 1997, pp. 49–53. [12] D.M. Titterington, G.D. Murray, L.S. Murray, D.J. Spiegelhalter, A.M. Skene, J.D.F. Habbema, G.J. Gelpke, Comparison of discrimination techniques applied to a complex data set of head injured patients (with discussion), J. R. Stat. Soc. Ser. A (General) 144 (2) (1981) 145–175. [13] J.-H. Xue, D.M. Titterington, On the generative-discriminative tradeoff approach: interpretation, asymptotic efficiency and classification performance, Technical Report, Department of Statistics, University of Glasgow, 2008. Jing-Hao Xue was born in Jiangxi, China, in 1971. He received the B.Eng. degree in telecommunication and information systems in 1993 and the Dr.Eng. degree in signal and information processing in 1998, both from Tsinghua University, the M.Sc. degree in medical imaging and the M.Sc. degree in statistics, both from Katholieke Universiteit Leuven in 2004, and the degree of Ph.D. in statistics from the University of Glasgow in 2008. He is currently a Lecturer in the Department of Statistical Science at University College London. His research interests include statistical modelling and learning for pattern recognition. D. Michael Titterington was born in Marple, England, in 1945. He received the B.Sc. degree in mathematical science from the University of Edinburgh in 1967 and the degree of Ph.D. from the University of Cambridge in 1972. He has worked in the Department of Statistics at the University of Glasgow since 1972; he was appointed Titular Professor in 1982 and Professor in 1988. He was Head of Department from 1982 until 1991. He has held visiting appointments at Princeton University, SUNY at Albany, the University of Wisconsin-Madison and the Australian National University. His research interests include optimal design, incomplete data problems including mixtures, statistical pattern recognition, statistical smoothing, including image analysis, and statistical aspects of neural networks. Dr. Titterington was elected Fellow of the Institute of Mathematical Statistics in 1996, Member of the International Statistics Institute in 1991, and Fellow of the Royal Society of Edinburgh, also in 1991. He has held editorial appointment with the Annals of Statistics, Biometrika, the Journal of the American Statistical Association, the Journal of the Royal Statistical Society (Series B), Statistical Science, and IEEE Transactions on Pattern Analysis and Machine Intelligence.