Pattern Recognition 35 (2002) 2705 – 2710
www.elsevier.com/locate/patcog
On the use of Bernoulli mixture models for text classi$cation A. Juan ∗ , E. Vidal Dep. de Sistemas Informaticos y Computacion, Universidad Politecnica de Valencia, Camino de Vera s=n, 46022 Valencia, Spain Received 31 October 2001; accepted 31 October 2001
Abstract Mixture modelling of class-conditional densities is a standard pattern recognition technique. Although most research on mixture models has concentrated on mixtures for continuous data, emerging pattern recognition applications demand extending research e/orts to other data types. This paper focuses on the application of mixtures of multivariate Bernoulli distributions to binary data. More concretely, a text classi$cation task aimed at improving language modelling for machine translation is considered. ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Mixture models; EM algorithm; Data categorization; Multivariate binary data; Text classi$cation; Multivariate Bernoulli distribution
1. Introduction A standard pattern recognition approach to model class-conditional densities consists of using mixtures of parametric distributions [1]. On the one hand, mixtures are 9exible enough for $nding an appropriate tradeo/ between model complexity and the amount of training data available. Usually, model complexity is controlled by varying the number of mixture components while keeping the same (often simple) parametric form for all components. On the other hand, maximum-likelihood estimation of mixture parameters can be reliably accomplished by the well-known Expectation-Maximization (EM) algorithm. Although most research on mixture models has concentrated on mixtures for continuous data, emerging pattern Work supported by the Spanish MCT under TIC2000-1703-CO3-01 ∗ Corresponding author. Tel.: +34-9-63-877-251; fax: +34-9-63-877-239. E-mail addresses:
[email protected] (A. Juan),
[email protected] (E. Vidal). URLs: http://www.iti.upv.es/∼ajuan, http://www.iti.upv.es/∼evidal
grant
recognition applications demand extending research e/orts to other data types. Text classi$cation is a good example. Standard text classi$cation procedures developed in information retrieval are based on either binary or integer-valued features. In both cases, it has been recently shown that these procedures are closely tied to more well-founded statistical decision rules [2,3]. In fact, both classical and new pattern recognition techniques are currently being tested with success on this task [2,4 – 6]. One of the most widely used “new” text classi$cation techniques is the so-called naive Bayes classi$er (for binary data) [2,3,5]. It is a Bayes plug-in classi$er, which assumes that the binary features are class-conditional independent, and thus each pattern class can be represented as a multivariate Bernoulli distribution. While this technique is still considered to be among the most accurate text classi$ers, it is commonly accepted that better classi$ers can be designed by simply relaxing the class-conditional independence assumption. In this paper, we focus on the use of mixtures of multivariate Bernoulli distributions for a text classi$cation task aimed at improving language modelling for machine translation. We $rst review the basic theory on EM-based learning of these models. Then, experimental results are reported
0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 0 1 ) 0 0 2 4 2 - 4
2706
A. Juan, E. Vidal / Pattern Recognition 35 (2002) 2705 – 2710
showing that signi$cant improvements can be obtained by spreading class-conditional dependencies over mixture components.
2. Probabilistic framework 2.1. Mixtures for supervised and unsupervised learning A ($nite) mixture model consists of a number, say c, of components. The mixture model generates a binary d-dimensional sample x = (x1 ; : : : ; xd )t by $rst selecting the jth component with prior probability j , and then generating x in accordance with the jth component-conditional probability function (p.f.) Pj (x). The priors must satisfy the constraints c
j = 1
and
j ¿ 0
(j = 1; : : : ; c):
(1)
j=1
The posterior probability of x being actually generated by the jth component can be calculated via the Bayes’ rule as zj (x) =
j Pj (x) ; P(x)
(2)
where P(x) =
c
j Pj (x)
(3)
j=1
Consider an arbitrary component Pj (x | pj ). It identi$es a certain subclass of binary vectors “resembling” its parameter vector or prototype pj = (pj1 ; : : : ; pjd )t , pj ∈ [0; 1]d . In fact, each pjk is the probability of bit xk to be one, whereas 1 − pjk is the opposite. Thus, for instance, if pj = (1; 0)t , Pj (x | pj ) will be one for x=(1; 0)t and zero otherwise. Note that Pj (x | pj ) is simply the product of the probabilities of generating the individual bits in x. On the other hand, note that no feature dependences are modelled since each pjk does not depend on any bit xi ; i = k. Let X = {x1 ; : : : ; xn } be a set of samples to learn a Bernoulli mixture model. This is statistical parameter estimation problem since the mixture is a p.f. of known functional form, and all that is unknown is a parameter vector including the priors and component prototypes = (1 ; : : : ; c ; p1 ; : : : ; pc )t . Here we are excluding the number of components from the estimation problem, as it is a crucial parameter for controlling model complexity and receives special attention in Section 3. Following the maximum likelihood principle, the best parameter values maximize the log-likelihood function of , c n L( | X ) = log j Pj (xi | pj ) : (5) i=1
j=1
In order to $nd these optimal values, it is useful to think of each sample xi as an incomplete component-labelled sample, which can be completed by an indicator vector zi = (zi1 ; : : : ; zic )t with 1 in the position corresponding to the component generating xi and zeros elsewhere. In doing so, a complete version of the log-likelihood function (5) can be stated as
is the unconditional mixture probability function. In a simple classi$cation setting, there is a $xed, known number of classes and each class is modelled by a single component. A collection of class-labelled training samples is used to estimate the priors and class-conditional p.f.’s, following the conventional supervised statistical learning paradigm. Then, unlabelled test samples are classi$ed by maximum posterior probability. However, if each class is to be modelled by more than one component, we face an unsupervised learning problem since we do not know which component class each sample has been generated from. In what follows, we will adopt a “semi-supervised” framework. More precisely, we assume that the number of classes and even their priors are known, and that each class-conditional p.f. is a mixture of multivariate Bernoulli components. Our basic problem consists of learning a Bernoulli mixture model for each class, from training samples that are only labelled by the class they belong to.
where Z = {z1 ; : : : ; zn } is the so-called missing data. The form of the log-likelihood function given in (6) is generally preferred because it makes available the well-known EM optimization algorithm (for $nite mixtures) [7]. This algorithm proceeds iteratively in two steps. The E(xpectation) step computes the expected value of the missing data given by the incomplete data and the current parameters. The M(aximization) step $nds the parameter values which maximize (6), on the basis of the missing data estimated in the E step. In our case, the E step replaces each zij by the posterior probability of xi being actually generated by the jth component,
2.2. Unsupervised learning of a Bernoulli mixture model
zij = c
A Bernoulli mixture model is a mixture of form (3) such that each component has a multivariate Bernoulli p.f.
while the M step $nds the maximum likelihood estimates for the priors, n zij (j = 1; : : : ; c); (8) j = i=1 n
Pj (x | pj ) =
d k=1
x
pjkk (1 − pjk )1−xk
(j = 1; : : : ; c):
(4)
LC ( | X; Z) =
n c
zij (log j + log Pj (xi | pj )) ;
(6)
i=1 j=1
j Pj (xi | pj ) k=1 k Pk (xi | pk )
(i = 1; : : : ; n; j = 1; : : : ; c); (7)
A. Juan, E. Vidal / Pattern Recognition 35 (2002) 2705 – 2710
and the component prototypes, n i=1 zij xi pj = (j = 1; : : : ; c): n i=1 zij
2707
sentences: (9)
An initial value for the parameters is required to start the EM algorithm. To do this, it is recommended to avoid “pathological” points in the parameter space such as those touching parameter boundaries and those in which the same prototype is used for all components [8]. Provided that a non-pathological starting point is used, each iteration is guaranteed not to decrease the log-likelihood function and the algorithm is guaranteed to converge to a proper stationary point (local maximum). For the experiments reported in the next section, starting points are obtained through a clustering initialization technique called maxmin algorithm [9]. This technique eNciently selects c well-separated training samples in accordance with a given distance (in our case, the Hamming distance). Selected samples are used as cluster “seeds” and non-selected samples are assigned to clusters by minimum distance classi$cation. Cluster proportions and means are employed to initialize the mixture priors and component prototypes, respectively. To assure that prototype parameters are in (0; 1), a simple smoothing procedure is applied. 3. Experiments 3.1. The task We are interested in a text classi$cation task aimed at improving language modelling in a limited-domain (speech-to-speech) Spanish-English machine translation application developed in the EUTRANS ESPRIT project [10]. The general domain considered in EUTRANS is that of a traveller (tourist) visiting a foreign country. It encompasses a variety of di/erent translation scenarios which range from very restricted applications to unconstrained natural language. This allows for progressive experimentation with increasing level of complexity. In this work, the domain has been limited to human-to-human communication situations in the front-desk of a hotel. It will be referred to as the Traveller Task. To obtain a corpus for this task, several traveller-oriented booklets were collected and those pairs of sentences $tting the above scenario were selected. More precisely, only 16 sub-domains were considered for the sake of keeping the task within the desired complexity level. This provided a (small) “seed corpus” which was further extended to a large set of sentence pairs following a semi-automatic procedure [11]. This procedure was carried out independently by four persons, each of whom had to cater for a (non-disjoint) subset of sub-domains. As a result, four natural classes of sentence pairs were obtained, which are labelled as A, F, J and P. Table 1 lists these 16 sub-domains and their class assignments. Here is a couple of examples of paired
• ReservOe una habitaciOon individual y tranquila con televisiOon hasta pasado ma˜nana. • I booked a quiet, single room with a tv. until the day after tomorrow. • Por favor, prepOarenos nuestra cuenta de la habitaciOon dos veintidOos. • Could you prepare our bill for room number two two two for us, please? While very good results have been achieved in the Traveller Task with corpus-based, :nite-state and statistical technology [10,12,13], these results could be improved by taking advantage of the clustered nature of the sentences. Both language modelling and translation-model learning would greatly bene$t from training speci$c sub-domain models instead of a single huge model covering the whole domain. Given the way the corpus has been produced in this task, this kind of sub-modelling is quite straightforward. But in larger and more complex tasks, such as those more recently considered in EUTRANS [10,12], the data is no longer available in a clustered form. In this work we explore the capability of Bernoulli mixture models to provide the required automatic classi$cation of text sentences. In a $rst attempt, the simple Traveller Task corpus described above has been adopted. Given that it has been produced in a clustered manner, a ground-truth of each sentence is available and we can easily study to which extent the Bernoulli mixture models are e/ective to obtain the desired class labelling. More speci$cally, we are interested in studying their performance for the classi$cation of a sentence into the four classes A, J, P, and F (i.e., the identi$ers of the persons who catered for the corpus sentences). As the four classes overlap by de$nition, no perfect classi$cation is possible and low error classi$cation rates would indicate that the models are able to capture a major source of variability. 3.2. Experiments For the experiments described in this section, we used a corpus of 8000 Spanish sentences extracted from the simple Traveller Task corpus. This corresponds to 2000 sentences per class, with the $rst half devoted to training purposes and the second half reserved for classi$cation tests. After tokenizing and removing words that occur only once, a vocabulary of 614 words was obtained. Tokens were formed from contiguous alphabetic characters (including accented vowels and the Spanish ‘˜n’) and several punctuation marks (‘¿’, ‘?’, ‘j’, ‘!’, ‘,’ and ‘.’). Neither stemming nor stoplist removal were used. Following a standard feature selection technique for text classi$cation, non-discriminative words were $ltered out in accordance to the information gain or average mutual
2708
A. Juan, E. Vidal / Pattern Recognition 35 (2002) 2705 – 2710
Table 1 Sub-domains in the Traveller Task and their class assignments
A √ √ √ √ √ √ √
Sub-domain F √
J
√
√ √ √
√ √ √
√
P
√
√
√ √ √ √ √
#
Description
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Notifying a previous reservation Asking about rooms (availability, features, prices) Having a look at rooms Asking for rooms Signing the registration form Complaining about rooms Changing rooms Asking for wake-up calls Asking for keys Asking for moving the luggage Notifying the departure Asking for the bill Asking about the bill Complaining about the bill Asking for a taxi General sentences
information criterion [14]. This criterion measures the number of bits of information obtained for class label prediction by knowing the presence or absence of a word in a sentence. It was computed from the training sentences for each word, and the d most informative words were selected. Then, each sentence was represented as a d-dimensional bit vector where the kth bit denotes the presence or absence of the kth most informative word. As the dimension d is known to have a signi$cant impact in classi$cation performance, several values of d were considered in the experiments. For each d ∈ {50; 100; : : : ; 600} and c ∈ {1; 2; : : : ; 20}, a classi$er was trained with c components for each class-conditional mixture. This includes the naive Bayes classi$er as the particular case in which c = 1. The error rate of each trained classi$er was estimated by simply counting the number of misclassi$ed test samples. Fig. 1 shows the results obtained for some selected values of d and c. Each estimated error rate eˆ has an approximate 95% con$dence interval of [eˆ ± 0:5%] (for clarity, only exact bounds for d = 250 are plotted). As expected, the dimension has an important e/ect in classi$cation performance, but the variations become negligible for values of d larger than 250 (2=5 of the complete vocabulary). Similarly, the number of Bernoulli components also has an important impact in performance. Generally speaking, accuracy improves monotonically with the number of components, though there is a point after which the improvements become small or even some worsening is observed. More speci$cally, good behaviour is observed for dimensions equal or larger than 250 words and four or more Bernoulli components per class. In fact,
6
Dimensions 150 200 250 350
5 Error Rate (%)
Classes
4 3 2 1 0
2 4 6 8 10 12 Number of Bernoulli components per class
14
Fig. 1. Classi$cation error rate as a function of the number of Bernoulli components per class, for several dimensions. Each estimated error rate eˆ has an approximate 95% con$dence interval of [eˆ ± 0:5%] (exact bounds are plotted for d = 250).
some of the best $gures are: 1:7% for d = 250 and c = 6 or 7, and 1:5% for d = 350 and c = 6. These are quite good results given the diNculty of the task considered. It is interesting to study the type of sentences each trained component prototype accounts for. As training sentences are relatively short, their corresponding vectors are mostly $lled with zeroes and hence many trained prototype parameters are also (close to) zero. This facilitates interpretation of the prototypes since we can focus on those parameters having larger probability values. As an example we analyse here the mixture of :ve components that was trained for class F from the 250 most informative words. In this case, it suNces to focus on a simple vocabulary of 34 words that
A. Juan, E. Vidal / Pattern Recognition 35 (2002) 2705 – 2710
2709
Fig. 2. Prototypes in the mixture of :ve components that was trained for the class F from the 250 most informative words. A simple vocabulary of 34 words is considered which covers the ten most probable parameters in each prototype. The parameter values are represented as grey levels (white = 0, black = 1). Table 2 Examples of correctly classi$ed test sentences from class F and their corresponding most probable componentsa Proto. Sentence 1 1 1
¿ les importara bajar mi equipaje a la habitacion ?
2 2 2
tengo hecha una reserva . hice una reserva a nombre de Gloria Redondo . una habitacion con telefono y buena vista del lago , por favor .
3 3 3
por favor , dOeme la llave de la habitacion . la llave de la habitacion numero tres diecinueve . ¿ nos puede usted dar las llaves de nuestra habitacion ?
4 4 4
j hola !
5
¿ pueden despertarnos ma˜nana a las cuatro y
5 5
por favor , quisiera que llevara mis bultos a la habitacion .
¿ pueden llevar nuestro equipaje a la numero ochocientos nueve ?
mi nombre es Lidia Berrueco .
¿ a que da estamos ?
media , por favor ? despOYertenos ma˜nana a las diez menos cuarto , por favor . ¿ me podrOYa despertar a la una y cuarto ? a Words
bold face.
in the simple vocabulary given in Fig. 2 are typed in
covers the ten most probable parameters in each prototype. The parameter values for these words are represented as a grey-level image (white = 0; black = 1) in Fig. 2. Each image row is numbered in accordance with the prototype number it represents, while columns are labelled with their corresponding words. From this image, it can be seen that the $ve prototypes trained for class F very closely match its corresponding sub-domains (see Table 1). In particular, the $rst prototype accounts mainly for sentences about moving the luggage, the second about room reservations and asking for rooms, the third about asking for keys, the fourth about general sentences and the $fth about wake-up calls. This can be much more clearly seen in Table 2, where a few
examples of correctly classi$ed test sentences from class F are shown, along with the speci$c Bernoulli component which has exhibited the largest probability for each sentence. Similar behaviour is exhibited by the prototypes of other mixtures trained from the 250 most informative words. However, we have found that class A (which is slightly more complex) is best described with six or seven prototypes, while less than $ve prototypes are enough to properly model the simpler classes J and P. This brings up the very diNcult problem of $ne-tuning the number of components in a mixture, which is known not to be well-solved yet. Although this problem is of great importance when trying to uncover data structure, it is not so important when mixture modelling is used to approximate class-conditional probability functions. In this case, over-dimensioning the number of components in a mixture is not a problem at all when there is enough data available to avoid over$tting. Clearly, we never have enough data to learn arbitrarily complex models and hence some sort of experimentation is always needed to $nd an appropriate tradeo/ between model complexity and the amount of data available. 4. Conclusions Emerging pattern recognition applications demand extending research e/orts to non-continuous data. This paper focuses on the application of mixtures of multivariate Bernoulli distributions to binary data. In particular, a text classi$cation task aimed at improving language modelling for machine translation is considered. In this task, experimental results have been reported showing that Bernoulli mixtures are adequate models for the class-conditional probability functions. 5. Summary A standard pattern recognition approach to model class-conditional densities consists of using mixtures of parametric distributions. On the one hand, mixtures are 9exible enough for $nding an appropriate tradeo/ between
2710
A. Juan, E. Vidal / Pattern Recognition 35 (2002) 2705 – 2710
model complexity and the amount of training data available. On the other hand, maximum-likelihood estimation of mixture parameters can be reliably accomplished by the well-known Expectation-Maximization (EM) algorithm. Although most research on mixture models has concentrated on mixtures for continuous data, emerging pattern recognition applications demand extending research e/orts to other data types. Text classi$cation is a good example. In this application, one of the most widely used classi$cation techniques is the so-called naive Bayes classi$er (for binary data). It is a Bayes plug-in classi$er, which assumes that the binary features are class-conditional independent, and thus each pattern class can be represented as a multivariate Bernoulli distribution. While this technique is still considered to be among the most accurate text classi$ers, it is commonly accepted that better classi$ers can be designed by simply relaxing the class-conditional independence assumption. In this work, we focus on the use of (class-conditional) Bernoulli mixture models for text classi$cation. Generally speaking, this is a generalization of naive Bayes that tries to properly model signi$cant class-conditional dependencies by spreading them over di/erent mixture components. We $rst review the basic theory on EM-based learning of these models. Then, experimental results are reported showing the e/ectiveness of these models in a text classi$cation task aimed at improving language modelling for machine translation. References [1] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. PAMI 22 (1) (2000) 4–37.
[2] T. Joachims, A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, in: Proceedings of the ICML’97, 1997, pp. 143–151. [3] D.D. Lewis, Naive Bayes at forty: the independence assumption in information retrieval, in: Proceedings of the ECML’98, 1998, pp. 4 –15. [4] T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: Proceedings of the ECML’98, 1998, pp. 137–142. [5] K. Nigam, et al., Text classi$cation from labeled and unlabeled documents using EM, Mach. Learning 39 (2=3) (2000) 103–134. [6] Y. Yang, An evaluation of statistical approaches to text categorization, J. Inf. Retrieval 1 (1=2) (1999) 67–88. [7] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm (with discussion), J. Roy. Statist. Soc. B 39 (1977) 1–38. [8] M.A. Carreira-Perpi˜naO n, S. Renals, Practical identi$ability of $nite mixtures of multivariate Bernoulli distributions, Neural Comput. 12 (1) (2000) 141–152. [9] A. Juan, E. Vidal, Comparison of four initialization techniques for the K-medians clustering algorithm, in: Proceedings of the Joint IAPR International Workshops SSPR 2000 and SPR 2000, 2000, pp. 842–852. [10] E. Vidal, et al., Example-based understanding and translation systems (EuTrans), Final Report, ESPRIT Project 20268, ITI, 2000. [11] J.C. Amengual, et al., Example-based understanding and translation systems (EuTrans), Final Report (Part I), ESPRIT Project 20268, ITI, 1996. [12] F. Casacuberta, et al., Speech-to-speech translation based on $nite-state transducers, in: Proceedings of the ICASSP’01, Vol. 1, 2001, pp. 613– 616. [13] E. Vidal, Finite-state speech-to-speech translation, in: Proceedings of the ICASSP’97, 1997, pp. 111–114. [14] Y. Yang, J.O. Pedersen, Feature selection in statistical learning of text categorization, in: Proceedings of the ICML’97, 1997, pp. 412– 420.
About the Author—ALFONS JUAN received the M.S. and Ph.D. degrees in Computer Science from the Universidad PolitOecnica de Valencia (UPV), in 1991 and 2000, respectively. He is professor at the UPV since 1995. His research interests are in the areas of Pattern Recognition, Computer Vision and Human Language Technology. Dr. Juan is a member of the Spanish Society for Pattern Recognition and Image Analysis (AERFAI) and the International Association for Pattern Recognition (IAPR). About the Author—ENRIQUE VIDAL received the Doctor en Ciencias FOYsicas degree in 1985 from the Universidad de Valencia, Spain. From 1978 to 1986 he was with this University serving in computer system programming and teaching positions. In the same period he coordinated a research group in the $elds of Pattern Recognition and Automatic Speech Recognition. In 1986, he joined the Departamento de Sistemas InformOaticos y ComputaciOon of the Universidad PolitOecnica de Valencia (UPV), where he is until now serving as a full professor of the Facultad de InformOatica. In 1995 he joined the Instituto TecnolOogico de InformOatica, where he has been coordinating several projects on Pattern Recognition and Machine Translation. He is co-leader of the Pattern Recognition and Human Language Technology research group of the UPV. His current $elds of interest include Statistical and Syntactic Pattern Recognition, and their applications to language, speech and image processing. In these $elds, he has published more than one hundred papers in journals, conference proceedings and books. Dr. Vidal is a member of the Spanish Society for Pattern Recognition and Image Analysis (AERFAI) and Fellow of the International Association for Pattern Recognition (IAPR).