Decision Support Systems 57 (2014) 64–76
Contents lists available at ScienceDirect
Decision Support Systems journal homepage: www.elsevier.com/locate/dss
Exploiting poly-lingual documents for improving text categorization effectiveness Chih-Ping Wei a, Chin-Sheng Yang b,⁎, Ching-Hsien Lee a, Huihua Shi c, Christopher C. Yang d a
Department of Information Management, National Taiwan University, Taipei, Taiwan, ROC Department of Information Management, Yuan Ze University, Chung-Li, Taiwan, ROC c Infrastructure & System Department I, Information Technology Division (AUT), AU Optronics Corporation, Hsinchu Science Park, Hsinchu, Taiwan, ROC d College of Computing and Informatics, Drexel University, Philadelphia, PA, USA b
a r t i c l e
i n f o
Article history: Received 17 April 2012 Received in revised form 2 July 2013 Accepted 2 August 2013 Available online 15 August 2013 Keywords: Text mining Text categorization Poly-lingual text categorization Feature reinforcement Document management
a b s t r a c t With the globalization of business environments and rapid emergence and proliferation of the Internet, organizations or individuals often generate, acquire, and then archive documents written in different languages (i.e., poly-lingual documents). Prevalent document management practice is to use categories to organize this ever-increasing volume of poly-lingual documents for subsequent searches and accesses. Poly-lingual text categorization (PLTC) refers to the automatic learning of text categorization models from a set of preclassified training documents written in different languages and the subsequent assignment of unclassified poly-lingual documents to predefined categories on the basis of the induced text categorization models. Although PLTC can be approached as multiple, independent monolingual text categorization problems, this naïve PLTC approach employs only the training documents of the same language to construct a monolingual classifier and thus fails to exploit the opportunity offered by poly-lingual training documents. In this study, we propose a featurereinforcement-based PLTC (FR-PLTC) technique that takes into account the training documents of all languages when constructing a monolingual classifier for a specific language. Using the independent monolingual text categorization (MnTC) approach as a performance benchmark, the empirical evaluation results show that our proposed FR-PLTC technique achieves higher classification accuracy than the benchmark technique. In addition, our empirical results suggest the superiority of the proposed FR-PLTC technique over its counterpart across a range of training sizes. © 2013 Elsevier B.V. All rights reserved.
1. Introduction With advances in and the proliferation of information and networking technologies, organizations increasingly participate in or shift to the Internet environment to conduct business transactions, gather marketing and competitive intelligence information from various online sources, and facilitate information and knowledge sharing within or beyond organizational boundaries. Such e-commerce and knowledge management applications generate and maintain a tremendous amount of textual documents (or documents for short) in organizational repositories. To facilitate subsequent access to these documents, the use of categories to manage the ever-increasing volume of documents often occurs at both organizational and individual levels. Text categorization deals with the assignment of documents to appropriate categories on the basis of their contents [2,7,8,40]. Central to text categorization is the automatic learning of a text categorization model using a set of preclassified documents that
⁎ Corresponding author. Tel.: +886 3 463 8800x2799; fax: +886 3 435 2077. E-mail addresses:
[email protected] (C.-P. Wei),
[email protected] (C.-S. Yang),
[email protected] (C.-H. Lee),
[email protected] (H. Shi),
[email protected] (C.C. Yang). 0167-9236/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.dss.2013.08.001
serve as training examples. The induced text categorization model then can classify (or predict) the particular category (or categories) to which a new document belongs. Various text categorization techniques have been proposed [1,2,7,8,18,19,22,34,39,40,42], but most of them focus on monolingual documents (i.e., all documents written in the same language) for both learning a text categorization model and assigning (or predicting) new documents into appropriate categories. Because of the trend of globalization, an organization or individual often generates, acquires, and then archives documents written in different languages (i.e., poly-lingual documents). Assume that the languages involved in a repository include L1, L2, …, Ls, where s ≥ 2. That is, the set of poly-lingual documents contains some documents in L1, some in L2, …, and some in Ls. Consider the following scenarios: A division in a multinational corporation receives poly-lingual documents from other divisions and uses them in its routine activities. Or a financial analyst scrutinizing the global investment market needs to collect and archive financial reports and news that are in effect poly-lingual. Such scenarios are even more prevalent in countries with more than one official language. For example, Chinese and English are official languages of Hong Kong; French and English for Canada; Chinese, Malay, Tamil, and English for Singapore; and Dutch, French, and German for Belgium. In such poly-lingual environments, if organizations or individuals have already organized their poly-lingual
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
documents into existing categories and want to use this set of preclassified documents as a training set to construct text categorization models and then to classify newly received poly-lingual documents into appropriate categories, they face a poly-lingual text categorization (PLTC) problem. Formally, PLTC pertains to learning text categorization models from a training set of poly-lingual documents (some written in language L1, some in L2, …, and some in Ls, where s ≥ 2, but each training document only written in one language), each of which is preclassified into a predefined category (C1, C2, …, or Cn), and then assigning unclassified documents written in L1, …, or Ls into appropriate categories. Because of the availability of training documents in each language, PLTC can be approached simply as multiple, independent monolingual text categorization problems. That is, given a set of poly-lingual preclassified documents as training examples, we can construct a monolingual text categorization model (i.e., classifier) for each language on the basis of the training examples of that respective language. When a new document written in a specific language arrives, we select the corresponding classifier to predict the appropriate category for the target document. However, this naïve PLTC approach employs only the training documents of the same language to construct a monolingual classifier and ignores all training documents of other languages. Thus, if the training documents of a language (e.g., Li) are less representative of the target semantic space of the predefined categories, especially when the training set of the language is small-sized, the effectiveness of the induced classifier for that language would not be satisfactory. If the representativeness of the training documents of another language (e.g., Lj) with respect to the predefined categories is greater than that of the training documents of Li, the training documents of Lj can be beneficial to improve the effectiveness of the classifier for Li. Because the naïve PLTC approach constructs each monolingual classifier independently, it fails to exploit this opportunity offered by poly-lingual training documents. In addition, if the training documents of the target language (e.g., Li) contain features (i.e., terms) that are non-semantic-bearing for the predefined categories, the inclusion of those noisy features into the induced classifier will degrade its effectiveness. However, terms in another language (e.g., Lj) that linguistically correspond to the noisy features in Li may not have any discriminating power according to the training documents of Lj. In this case, a cross-check of features occurring in the training documents of different languages (i.e., referred to as feature reinforcement in this study) can help remove the noisy features of individual languages and hence improve the effectiveness of the classifier for each language. Most existing text categorization techniques deal with monolingual text categorization [1,2,7,8,18,19,22,27,34,39,40,42], and some prior studies address the challenge of cross-lingual text categorization (i.e., learning from a set of training documents written in one language and then classifying new documents in a different language) [4,9,23,24,33]. However, prior research has not paid much attention to PLTC. This study therefore is motivated by the importance of providing PLTC support to organizations and individuals in increasingly global, multilingual environments. Specifically, we propose a feature-reinforcement-based PLTC (FR-PLTC) technique that takes into account the training documents of all languages when constructing a monolingual classifier for a specific language. For the purposes of our intended feasibility assessment and illustration, this study concentrates on only two languages involved in poly-lingual documents. To support linguistic interoperability between training documents in different languages, we rely on a statistical-based bilingual thesaurus, automatically constructed from a collection of parallel documents. Experimentally, we evaluate the effectiveness of our proposed FR-PLTC technique using independent monolingual classifiers built by the aforementioned naïve PLTC approach. The remainder of this article is organized as follows: In Section 2, we review literature relevant to this study, including existing monolingual, poly-lingual, and cross-lingual text categorization techniques. We depict the detailed development of our proposed FR-PLTC technique
65
in Section 3, including the overall processes and specific designs. Subsequently, we describe the evaluation design and discuss important experimental results in Section 4. Finally, we conclude with a summary and some further research directions in Section 5. 2. Literature review 2.1. Monolingual text categorization techniques Text categorization refers to the assignment of documents, on the basis of their contents, to one or more predefined categories. Many text categorization techniques have been proposed in the literature [1,2,7,8,18,19,22,27,34,39,40,42] but most of them focus on monolingual documents. Central to text categorization is the automatic learning of a text categorization model from a training set of preclassified documents. The resulting model will then be used to classify or predict the particular category or categories to which a new, unclassified document belongs. The process of (monolingual) text categorization generally includes three main steps: 1) feature extraction and selection, 2) document representation, and 3) induction [1,32]. Feature extraction extracts or identifies terms (or features) from the training documents. Different languages exhibit different grammatical and lexical characteristics that affect how the features in documents are segmented. For example, there exist prominent differences between European languages (e.g., English) and Oriental languages (e.g., Chinese). Term extraction of English documents typically involves lexical analysis, stopword removal, stemming, and term-phrase formation [37]. However, no natural delimiter in the Chinese language marks word boundaries. Additional mechanism, such as the lexical rule-based or the statistical approach, is required to support lexical analysis and term-phrase segmentation for Chinese documents [38]. Following extraction is feature selection, which reduces the size of the feature space, a process that not only improves learning efficiency but also potentially improves learning effectiveness by suppressing potential biases embedded in the original (i.e., non-condensed) feature set [8]. According to the top-k selection method, commonly used in prior research, the k features with the highest selection metric scores are selected to represent each training document. However, previous research varies considerably in the underlying metric used for feature selection. Common examples include TF (term frequency), TF × IDF (IDF denotes inverse document frequency), correlation coefficient, χ2 metric, and mutual information [8,17,19,22,26]. In the document representation step, each document is represented by a vector space jointly defined by the top-k features selected in the previous step and labeled to indicate its category membership. A review of prior research suggests the prevalence of several representation methods, such as binary (which indicates the presence or absence of a feature in a document), within-document TF, and TF × IDF. Finally, in the induction step, a text categorization model(s) that distinguishes categories from one another on the basis of the set of training documents is constructed. Prevalent supervised learning techniques employed for text categorization include decision-tree induction [34], decision-rule induction [2,7], k-nearest neighbor (kNN) classification [12,18,20,39], neural network [22,35], the Naïve Bayes probabilistic algorithm [1,3,18,19,21], and SVM [8,14,41]. Empirical evaluations of different supervised learning strategies for text categorization can be found in the studies by Sebastiani [27] and Yang and Liu [41]. 2.2. Poly-lingual and cross-lingual text categorization techniques As an emerging research topic, the literature related to PLTC remains very limited. Bel et al. [4] assume that each training document is available in two different languages (i.e., parallel document) and a newly arrived (i.e., unclassified) document is available in one or both languages. Accordingly, they simply construct a single classifier that encompasses terms from both English and Spanish as its features. However, their
66
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
Preclassified Documents (Training Corpus)
Unclassified Documents (Prediction Corpus)
… Category C1 Note:
C2
Cn
denotes documents in L1 and
classify unclassified documents into appropriate categories
denotes documents in L2.
a) Poly-lingual Text Categorization (PLTC) Preclassified Documents (Training Corpus)
Unclassified Documents (Prediction Corpus)
… Category C1
C2
Cn
classify unclassified documents into appropriate categories
b) Cross-lingual Text Categorization (CLTC) Fig. 1. Illustrations of PLTC and CLTC.
empirical evaluations show that the effectiveness of their PLTC technique is inferior to that of monolingual categorization. The study by Gonalves and Quaresma [10,11], on the other hand, assumes that both training and unclassified documents are available in multiple languages. Their technique first constructs multiple, independent monolingual classifiers and then adopts a weighted sum scheme to combine them to obtain a poly-lingual text classifier. Our study differs from the aforementioned studies because our study assumes that each document in the poly-lingual training set and each unclassified document is written in only one language rather than available in multiple languages. Therefore, the techniques proposed in the prior studies cannot directly be applied to the PLTC problem investigated in our study. PLTC is related to cross-lingual text categorization (CLTC), because both pertain to multilingual text categorization [33]. However, as Fig. 1 illustrates, PLTC differs from CLTC because CLTC deals with learning from a set of training documents written in one language (L1) and then classifying any unclassified documents (i.e., prediction corpus) in a different language (L2) [4,9,23,24,33], whereas PLTC aims to construct text categorization models automatically from a set of training documents written in a mix of languages (though each training document is written only in one language), and then classify unclassified documents in any of those languages. The major challenge facing CLTC is the cross-lingual semantic interoperability that establishes the bridge between the representations of the training and prediction documents written in different languages. In response to this challenge, CLTC techniques generally rely on a translation mechanism to cross the language boundary between the prediction and the training corpora. For example, the CLTC technique proposed by Olsson et al. [23] assumes the availability of a bilingual dictionary and deals with the CLTC scenario in which a set of English training documents is used to classify Czech documents. Specifically, each unclassified document in Czech is translated into English on the basis of the bilingual dictionary. Similarly, Gliozzo and Strapparava [9] exploit common words in English and Italian
(i.e., identical words used by the two languages) and the availability of English–Italian dictionaries (i.e., MultiWordNet1 and Collins2) to address the CLTC problem. Rigutini et al. [24] develop a machine translation-based CLTC technique between English documents and Italian documents; Bel et al. [4] use a statistical-based bilingual thesaurus (Spanish and English) for their CLTC translations. Finally, the CLTC technique proposed by Wei et al. [33] employs a statistical-based bilingual thesaurus automatically constructed from a parallel corpus (English and Chinese) to translate unclassified documents into the same language as that of the training documents. Although several CLTC techniques have been proposed, they do not involve poly-lingual preclassified documents as training examples, as PLTC does. Thus, CLTC cannot take advantage of the semantics embedded in poly-lingual training documents for text categorization model learning but rather relies on monolingual training documents and a translation mechanism to classify documents in another language. Prior CLTC studies suggest three different translation strategies: a bilingual dictionary, a machine translation system, or a corpus-based method (i.e., constructing a statistical-based bilingual thesaurus from a parallel corpus). The bilingual dictionary translation may be proprietary or costly and is less tolerant to novel terms, technical terms, and proper nouns [33]. The effectiveness of existing machine translation systems may not be satisfactory [16], especially when translating a document that requires more contextual information for accurate translation. In contrast, translation based on a statistical-based bilingual thesaurus offers better contextualized translations. In addition, because the corpus-based method extracts from a parallel corpus terms to be included in the statistical-based bilingual thesaurus, those terms in the thesaurus are less limited and can be extended by expanding the number and coverage of parallel documents. Although the corpus-based method overcomes the limitations of the bilingual dictionary and the machine 1 2
Available at http://multiwordnet.fbk.eu/english/home.php Available at http://www.collinslanguage.com
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
translation approaches, it may be constrained by the availability of an appropriate parallel corpus for a target task. However, prior studies have developed different techniques for mining parallel corpora of different languages from the Web [15,36,43], which can address the limitation of the corpus-based method. As a result, due to its constructability, extensibility and capability of dealing with novel terms, technical terms, and proper nouns, the corpus-based method represents a promising approach to CLTC and other cross-lingual research (e.g., crosslingual information retrieval, or CLIR). Accordingly, in this study, we focus on PLTC with the support of a statistical-based bilingual thesaurus (i.e., constructed by the corpus-based method). 3. Design of feature-reinforcement-based poly-lingual text categorization (FR-PLTC) technique As mentioned previously, PLTC can be approached simply as multiple, independent, monolingual text categorization problems. However, when training a monolingual classifier for a specific language, this naïve PLTC approach does not exploit the training documents of other languages to improve the classification effectiveness of the target classifier. In this study, we instead propose a feature-reinforcement-based PLTC (FR-PLTC) technique with the support of a statistical-based bilingual thesaurus to address the potential limitations of the naïve PLTC approach. Fig. 2 shows the overall design of our proposed FR-PLTC technique, which consists of three main tasks: 1) bilingual thesaurus construction to build a statistical bilingual thesaurus (in this study, English and Chinese) from a parallel corpus, 2) categorization learning to induce a text categorization model for each language based on a set of preclassified documents (some written in languages L1 and some in L2), and 3) category assignment to predict appropriate categories for unclassified documents in either L1 or L2. We depict the detailed design of each task in the following subsections. 3.1. Bilingual thesaurus construction This task automatically constructs a statistical-based bilingual thesaurus using the co-occurrence analysis technique [13,37], as is commonly employed in CLTC and CLIR research. Fig. 3 illustrates the overall process of bilingual thesaurus construction. Given a parallel corpus (within which each parallel document consists of a subdocument in L1
Parallel Corpus
and a subdocument in L2), the thesaurus construction process starts with term extraction and selection. In this study, we primarily deal with only English and Chinese languages. Accordingly, we use the rulebased part-of-speech tagger developed by Brill [5,6] to tag each word in the English subdocuments in the parallel corpus. Subsequently, we adopt the approach proposed by Voutilainen [30] to extract noun phrases from the syntactically tagged English subdocuments. For the Chinese subdocuments in the parallel corpus, we employ a hybrid approach that combines dictionary-based and statistical approaches (specifically, mutual information measure) [37,38]. Please note that our bilingual thesaurus construction mechanism and the proposed FR-PLTC technique can also be applied to other language contexts (e.g., French, Spanish, Japanese), as long as appropriate language-specific term (or feature) extraction methods are adopted. After term extraction, the term selection step selects representative terms in both languages for each parallel document. We adopt the TF × IDF scheme as the selection metric. The term weight of a term fj (English or Chinese) in a parallel document di (denoted twij) is calculated as: ! NP twij ¼ t f ij log ; nj
ð1Þ
where tfij is the term frequency of fj in di, NP is the total number of parallel documents in the corpus, and nj is the number of subdocuments (in the same language as fj) in the parallel corpus in which fj appears. For each parallel document, the top kclt English and kclt Chinese terms with the highest TF × IDF values (i.e., twij) that simultaneously occur in more than δDF documents are selected for this parallel document. On the basis of the concept that relevant terms often co-occur in the same parallel documents, the co-occurrence analysis then measures the co-importance weight cwijh between terms fj and fh (where fj and fh are in different languages) in the parallel document di, as follows [37]: ! NP ; njh
cwijh ¼ t f ijh log
Bilingual Thesaurus Construction
Preclassified Documents in L1
Statistical-based Bilingual Thesaurus
Text Categorization Model in L1
Categorization Learning Preclassified Documents in L2
Unclassified Documents in L1 or L2
67
Text Categorization Model in L2
Category Assignment Predicted Categories Fig. 2. Overall process of proposed FR-PLTC technique.
ð2Þ
68
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
Parallel Corpus Term Extraction and Selection
L1 subdocument L2 subdocument
Representative nouns and noun phrases (in L1 and L2)
Co-occurrence Analysis
Statistical-based Bilingual Thesaurus
Fig. 3. Process of bilingual thesaurus construction.
where tfijh is the minimum of tfij and tfih in di, and njh is the number of parallel documents in which both fj and fh occur. Finally, the relevance weights between fj and fh are computed asymmetrically as [37]: NP X
rwjh ¼
NP X
cwijh
i¼1 NP X
and rwhj ¼ twij
i¼1
i¼1 NP X
cwijh ;
ð3Þ
twih
i¼1
where rwjh denotes the relevance weight from fj to fh, and rwhj is the relevance weight from fh to fj. After we estimate all directional statistical strengths between each pair of English and Chinese terms selected previously, we prune insignificant strengths. Specifically, if the statistical strength from a term in one language to a term in another language is less than a relevance threshold δrw, we remove the link. Upon completion of link pruning, we construct from the input parallel corpus a statistical-based bilingual thesaurus, which consists of many-to-many cross-lingual associations. Table 1 shows a partial bilingual thesaurus on the basis of the parallel
Table 1 A partial statistical-based bilingual thesaurus. 0.56
financial market
0.52
financial market
0.51
*
financial market
financial market* chief executive convertibility bank
0.58
financial crisis
0.55
financial crisis financial crisis
0.54
*
electronic money
economy financial crisis*
0.34 0.20 0.18 0.79 0.75 0.41
0. 54
financial crisis electronic money
bank
0.50
*
0.94 0.56
electronic money* payment octopus card visa cash
0.83 0.76
corpus collected for our evaluation study (described in Section 4). For illustrative purposes, we manually identify and mark with a ‘*’ the cross-lingual associations that denote the correct translations between the Chinese and English terms. The remaining cross-lingual associations (i.e., those without a ‘*’) in Table 1 denote semantically related terms in the two languages. 3.2. Categorization learning The categorization learning task is the core of our proposed FR-PLTC technique. As we show in Fig. 4, when training a monolingual classifier for language Li (L1 or L2), our proposed categorization learning method takes into account not only the preclassified documents in Li but also the preclassified documents in another language Lj, as well as the statistical-based bilingual thesaurus. Specifically, to train a monolingual classifier for language Li, the categorization learning task involves four steps: feature extraction (for Li and Lj), feature reinforcement and selection (for Li), document representation (in Li), and induction. 3.2.1. Feature extraction In this step, we extract features from the preclassified documents in both languages. As in the bilingual thesaurus construction task, we use the rule-based part-of-speech tagger [5,6] and the noun phrase parser [30] to extract nouns and noun phrases as features from preclassified English documents. For the preclassified Chinese documents, we employ a hybrid of dictionary-based and statistical-based approaches to extract Chinese terms from this training corpus [37,38]. 3.2.2. Feature reinforcement and selection Following feature extraction, we first assess the discriminating power of each feature in its respective training corpus and language. In this study, we employ the χ2 statistic metric for the target assessment purpose. The χ2 statistic, which measures the dependence between a feature fj and a category Ci, tends to be 0 when fj and Ci are independent. Using a two-way contingency table of fj and Ci, let nr+ be the number of documents in the category Ci in which fj occurs, nr− be the number of documents in Ci in which fj does not appear, nn+ be the number of documents in categories other than Ci in which fj occurs, nn− be the number of documents in categories other than Ci in which fj does not appear, and n be the total number of documents in the respective training corpus and language. The χ2 statistic of fj relevant to Ci thus is defined as follows [42]:
0.44 0.39
χ
2
f j; Ci ¼
n nr þ nn− −nr− nnþ 2 : nrþ þ nr− nnþ þ nn− nrþ þ nnþ ðnr− þ nn− Þ
ð4Þ
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
Preclassified Documents in Li
Feature Extraction for Li
Preclassified Documents in Lj
Feature Extraction for Lj
Feature Reinforcement and Selection for Li
69
Statistical-based Bilingual Thesaurus
Feature set for categories in Li Document Representation in Li
Document-feature vectors with category labels
Induction
Text Categorization Model in Li
Fig. 4. Process of categorization learning (for Li).
After deriving the χ2 statistic of feature fj relevant to each category Ci, we calculate the overall χ2 statistic of fj for all categories T according to the weighted average scheme [42]. That is, χ
2
X 2 fj ¼ pðC i Þ χ f j ; C i ;
ð5Þ
C i ∈T
where p(Ci) is the number of documents in Ci divided by n. After the χ2 statistic scores for all features in both languages are obtained, we start to reassess the discriminating power of a feature in one language by considering the discriminating power of its related features in another language. The reason for such crosschecking between two languages is that if a feature in one language and its related features in another language possess high χ2 statistic scores, it is likely that the feature has greater discriminatory power. However, inconsistent assessments between two languages (i.e., χ2 statistic score of a feature fj is high in one language but the χ2 statistic scores of fj's related features are low in another language) result in lower confidence in the discriminatory power of the feature. In this study, we refer to this cross-checking process as feature reinforcement. Assume a total of N1 features in L1 are extracted from the preclassified training documents in L1 and N2 features in L2 are extracted from the preclassified training documents in L2. Given a feature fj in L1, let R(fj) be the set of features in L2 that have direct cross-lingual associations to fj according to the previously constructed statistical-based bilingual thesaurus. A simple method for deriving the alignment weight for fj in L1 (denoted aw(fj)) from its related features (i.e., R(fj)) in L2 is as follows: 8 X 2 > χ ðg h Þ rwgh f j > > > < ∀gh ∈Rð f j Þ > aw f j ¼ R f j > > > > > :0
(i.e., |R(fj)|). For example, as Fig. 5 illustrates, feature f2 in L1 has four cross-lingual associations from L2 (i.e., g1, g2, g4, and g5). Therefore, the alignment weight of f2 is computed as: awð f 2 Þ ¼
χ 2 ðg1 Þ rwg1 f 2 þ χ 2 ðg 2 Þ rwg2 f 2 þ χ 2 ðg 4 Þ rwg4 f 2 þ χ 2 ðg 5 Þ rwg5 f 2 : 4
The aforementioned formula for deriving the alignment weight of fj only considers the relevance weights and the χ2 statistic scores of fj's related features in another language. We can further enhance this formula by exploiting the characteristics of the statistical-based bilingual thesaurus. Particularly, if a term in one language is an overly general term (e.g., “system” and “technique” in information technology-related documents), there must exist many terms in another language that have direct, cross-lingual associations to it. From the text categorization perspective, overly general terms typically do not have any discriminatory power for document categories and therefore should not be included in the feature set for text categorization. On the basis of this observation, the original alignment weight formula is revised in this study as: 8 X 2 > χ ðg h Þ rwgh f j > > > ∀g ∈R f < h ð jÞ N > log 2 aw f j ¼ R f R fj > j > > > > :0
f1
ð6Þ
if R f j ¼ 0
where χ2(gh) is the χ2 statistic score of feature gh, and rwghfj is the relevance weight from gh to fj, as specified in the statistical-based bilingual thesaurus. In this formulation, we consider both the discriminating power of the related features (i.e., χ2(gh)) in another language and their relevance weights to fj (i.e., rwghfj). Eq. (6) takes all the associated features into consideration and is normalized by the number of related features
if R f j ¼ 0
ð7Þ
Features in L1 if R f j ≠0
if R f j ≠0
f2 N1 features
f3
Features in L2 rwg1f2 rwg2f2 rwg4f2
f4 f5 …
g1 g2 g3 g4
rwg5f2
g5 …
Fig. 5. Examples of cross-lingual associations.
N2 features
70
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
where N2 denotes the number of features in L2 that are extracted from the preclassified training documents in L2, and log R Nf2 is referred to j ð j Þj as the inverse term frequency (ITF). We again take f2 in Fig. 5 as an example. Its alignment weight now is computed as: 2
awð f 2 Þ ¼
2
2
2
χ ðg1 Þ rwg 1 f 2 þ χ ðg 2 Þ rwg2 f 2 þ χ ðg 4 Þ rwg4 f 2 þ χ ðg5 Þ rwg5 f 2
Unclassified Document in Li
Document Representation in Li
Document-feature vector in Li
4
N log 2 4
Category Prediction
Apparently, the ITF of an overly general term will be much smaller than that of a more specific term (with fewer cross-lingual associations from another language). Thus, the revised alignment weight formula favors specific terms over overly general ones. Subsequently, we use the following formula to arrive at the overall weight of a feature fj by combining the weights of fj derived from the training documents in both languages: α 2 w f j ¼ χ f j aw f j ;
ð8Þ
where α denotes the trade-off between the χ2 statistic score of fj in its original language and the alignment weight of fj derived from the other language (where α ≥ 0). When α = 0, the assessment of the overall weight of a feature completely relies on the original language. In contrast, when α increases, the assessment of a feature depends greater on related features in another language. Because aw(fj) is the average of the χ2 statistic score of fj's related features in the other language, weighted by the relevance weights from the related features to fj, aw(fj) generally is much less than χ2(fj) in its original language. Due to the nonidentical value range of these two statistic scores, the value of α does not represent the exact trade-off ratio between the target feature in the original language and its related features in another language. After the overall weights of all features for both languages are derived, we perform feature selection. For each language (L1 or L2), we select the k features with the highest overall weights as the features to represent each training document of the respective language. 3.2.3. Document representation In this step, the training documents of each language are represented by the corresponding feature set selected previously. In this study, because the length of our documents is relatively short (please see Section 4.1 for details), we choose the binary scheme for document representation. Please note that other schemes (e.g., within-document TF and TF × IDF) can also be employed. As a result, each training docu! ment di forms a document-feature vector di , together with its known category label. That is, ! di ¼ ðwi1 ; wi2 ; …; wik Þ;
Feature Extraction
ð9Þ
where k is the number of features selected in the previous step and wij indicates the presence (i.e., 1) or absence (i.e., 0) of fj in di. For example, assume that 10 features (i.e., k = 10) be selected and features f1, f2, f3, f5, and f8 appear in document di. In this case, document ! di is represented as the document-feature vector di ¼ ð1; 1; 1; 0; 1; 0 ; 0; 1; 0; 0Þ. 3.2.4. Induction The induction step creates two monolingual text categorization models from the preclassified documents in L1 and L2, respectively. We adopt the Naïve Bayes probabilistic algorithm and Support Vector Machine (SVM) as alternative supervised learning algorithms because of their popularity in prior research on text categorization [28,31]. The Naïve Bayes probabilistic algorithm uses the joint probabilities of words and categories to estimate the probabilities of categories fitting
Text Categorization Model in Li
Predicted Category Fig. 6. Process of category assignment.
a particular document [1], whereas SVM, based on the structural risk minimization principle [29], attempts to find a decision surface that best separates the positive and negative training examples with the maximum margin. 3.3. Category assignment In the category assignment task (as Fig. 6 illustrates), we categorize each unclassified document in L1 or L2 using the corresponding text categorization model induced previously. According to the language used in the unclassified document, we use the respective feature extraction method (see Section 3.2) to extract features from this unclassified document. Subsequently, we use the same document representation scheme as employed in the previous task (i.e., binary) to represent the target unclassified document. Finally, the category prediction step uses the feature vector of the unclassified document to determine an appropriate category on the basis of the corresponding text categorization model. 4. Empirical evaluation In this section, we report on our empirical evaluation of the proposed FR-PLTC technique. In the following, we detail the design of our empirical experiments, including the data collection, evaluation procedure and criteria, and our benchmark technique. Subsequently, we discuss some important evaluation results. 4.1. Data collection As mentioned previously, the construction of a statistical-based bilingual thesaurus requires parallel documents in two languages. News releases from Government Information Center, Hong Kong Special Administrative Region of the People's Republic of China (accessible at http://www.info.gov.hk/), were collected to construct a statisticalbased bilingual thesaurus. Specifically, the parallel corpus collected for our experimental purpose contains 7779 pairs of Chinese and English news releases. Two additional monolingual document corpora also were collected to evaluate the effectiveness of our proposed FR-PLTC technique. These English and Chinese corpora also include news releases collected from the Government Information Center, Hong Kong. We manually Table 2 Summary of our English–Chinese corpora. Document corpus
Number of documents
Average number of words per document
Parallel corpus
7779 (pairs)
English corpus Chinese corpus
600 600
English Chinese 104 115
101 106
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
60%
Classification Accuracy
assigned the collected news documents into eight categories (i.e., Commerce & Economy, Communication & IT, Culture & Leisure, Education, Health & Environment, Housing & Land, Security, and Transportation & Traffic). To avoid any biases on classification effectiveness resulting from nonidentical category sizes, we randomly selected the same number of news documents for every category in each corpus. Specifically, every category in the English or Chinese corpus consists of 75 news documents and the total number of news documents in each corpus is 600. We then merged these two monolingual corpora into a poly-lingual corpus for our evaluation purpose. Table 2 shows the summary of the three document corpora used in our empirical evaluation.
71
55% 50% 45% 40% 35% 30%
0
500
1000
1500
2000
2500
All
Number of Features (k) 4.2. Evaluation procedure and criteria
4.3. Performance benchmark Because the PLTC problem can be approached as multiple, independent monolingual text categorization problems, we construct a monolingual text categorization model (i.e., classifier) for each language on the basis of the training documents of that language only. We adopt this naïve PLTC approach as our benchmark technique and refer it as the MnTC technique. The design of the MnTC technique follows that of our proposed FR-PLTC technique, except that the MnTC technique does not involve feature reinforcement. Specifically, to construct the monolingual text categorization model for each language, the MnTC technique employs the χ2 statistic metric for feature selection (to select the top-k features on the basis of the training documents of that language), the binary scheme for document representation, and the Naïve Bayes probabilistic algorithm and SVM as alternative supervised learning algorithms. In addition, the category assignment task of the MnTC technique is the same as that of the FR-PLTC technique. 4.4. Comparative evaluations Before we evaluate the effectiveness of our proposed FR-PLTC technique, we need to determine the appropriate values for the parameters involved in bilingual thesaurus construction. Specifically, during the bilingual thesaurus construction, for every parallel document in each language, the selected terms must satisfy the document frequency threshold δDF (i.e., the selected term must occur in at least δDF documents in the parallel corpus) and be the top kclt terms in the document. Furthermore, the relevance weight of all cross-lingual term associations in the statistical-based bilingual thesaurus must satisfy a prespecified relevance threshold δrw. To avoid possible overfitting problem, we adopt the values for these parameters suggested by Wei et al. [33]. That is, we set δDF as 3, kclt as 30, and δrw as 0.15 for the subsequent experiments.
FR-PLTC (α=1) FR-PLTC (α=4)
FR-PLTC (α=2) FR-PLTC (α=5)
Fig. 7. Effects of k and α (Naïve Bayes for English classifier).
selection). For the proposed FR-PLTC technique, the parameters involved include k and α (required by the feature reinforcement step). We therefore examine the same range of k as that for the MnTC technique and investigate different values of α between 1 and 5 in increments of 1. Fig. 7 reveals the effects of k on the MnTC technique and k and α on the proposed FR-PLTC technique using the Naïve Bayes algorithm for the English classifier. The classification accuracy attained by the MnTC technique improves when k increases from 200 to all features. That is, MnTC achieves its best classification accuracy when all features are included for categorization learning (i.e., without feature selection). At any level of α under discussion, the FR-PLTC technique exhibits a similar pattern across the range of k investigated. When α is 1, the classification accuracy of the FR-PLTC technique improves if k expands from 200 to 800 and deteriorates when k increases to 1000 and beyond. Overall, when using the Naïve Bayes algorithm, the FR-PLTC technique achieves its best effectiveness when α equals 4 and k equals 1200. When we employ SVM for the English classifier, the effects of k on the benchmark technique and effects of k and α on the proposed FR-PLTC technique are similar to those recorded by the Naïve Bayes algorithm. As Fig. 8 illustrates, using SVM, the classification accuracy achieved by MnTC improves when k increases from 200 to infinite (i.e., no feature selection). On the other hand, the best FR-PLTC accuracy is achieved when α equals 4 and k equals 1800. Although the proposed FR-PLTC technique generally outperforms its benchmark technique (i.e., MnTC) for both supervised learning algorithms when α equals 4 and k is equal to or higher than 800, its effectiveness is relatively worse when k is less than 600. A plausible reason is that the features selected by the FR-PLTC technique may be representative to text categorization but less frequently occur in the testing data set than those selected by the MnTC technique. For instance, when α = 4 and k = 200, 9.72% of the testing documents do not 55%
Classification Accuracy
To evaluate the effectiveness of PLTC, we randomly select 50% of the documents in the English and the Chinese corpora as our training data set and leave the remaining half in each corpus as the testing data set. To avoid any bias caused by random sampling and obtain a reliable evaluation performance estimate, we repeat the sampling and trainand-test process 30 times and evaluate the effectiveness of the PLTC technique under investigation (i.e., FR-PLTC or its benchmark technique) by averaging the performance obtained from these 30 individual processes. We measure the effectiveness of PLTC on the basis of classification accuracy, defined as the percentage of documents in the testing data set that the PLTC technique under investigation correctly classifies into the predefined categories.
MnTC (English) FR-PLTC (α=3)
50% 45% 40% 35% 30% 25%
4.4.1. English classifier The parameter involved in the MnTC technique includes k (number of features) in the feature selection step. We examine different values of k, ranging from 200 to 2000 in increments of 200. We also examine the inclusion of all features for categorization learning (i.e., without feature
0
500
1000
1500
2000
2500
All
Number of Features (k) MnTC (English) FR-PLTC (α=3)
FR-PLTC (α=1) FR-PLTC (α=4)
FR-PLTC (α=2) FR-PLTC (α=5)
Fig. 8. Effects of k and α (SVM for English classifier).
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
contain any features chosen by the FR-PLTC technique, whereas only 0.69% of the testing documents encounter the same problem with the features selected by the MnTC technique. When we increase k to 600, the difference of the percentages of the testing documents that do not include any features selected by the FR-PLTC technique decreases significantly (i.e., from 9.72% to 0.35%). Because of this reason, our proposed FR-PLTC technique will exhibit its power until a larger number of features are selected. 4.4.2. Chinese classifier The effects of k on the MnTC technique and those of k and α on the FR-PLTC technique, using either supervised learning algorithm in the Chinese classifier, are highly similar to those in the English classifier. As Fig. 9 shows, the classification accuracy attained by the MnTC technique using Naïve Bayes improves when k increases from 200 to 1800 and then levels off when k is beyond 1800. On the contrary, the best classification accuracy achieved by FR-PLTC is when α equals 3 and k equals 1000. As we illustrate in Fig. 10, using SVM, MnTC without feature selection results in the best effectiveness, whereas 4 for α and 1000 for k together yield the best effectiveness for FR-PLTC. 4.4.3. Overall discussion Using the parameter values selected previously (i.e., k for MnTC and k and α for FR-PLTC) for each language-specific classifier (English and Chinese) across the two supervised learning algorithms, we highlight the effectiveness of our proposed FR-PLTC technique by comparing it with that of the benchmark technique. As we summarize in Tables 3 and 4, using Naïve Bayes, the classification accuracy achieved by our FR-PLTC technique is 57.11% in the English classifier and 55.38% in the Chinese classifier. In particular, our proposed FR-PLTC technique improves the classification accuracy by 3.60% and 4.21% compared with the benchmark technique in the English and Chinese classifiers, respectively. We perform two-tailed, paired t-tests to examine the statistical significance of the difference between the classification accuracy by FR-PLTC and that by the benchmark technique for each classifier (English and Chinese). According to our results, the differences in both classifiers are statistically significant (i.e., p-values less than 0.01). When using SVM algorithm, our FR-PLTC technique improves the classification accuracy over the benchmark technique by 4.19% (53.10% vs. 48.91%) and 1.95% (45.24% vs. 43.29%) for the English and Chinese classifiers, respectively. The differentials in both English and Chinese classifiers are also statistically significant at the 0.01 level. Overall, our proposed FR-PLTC technique consistently outperforms its counterpart across two supervised learning algorithms (Naïve Bayes and SVM) and two classifiers (i.e., English and Chinese). In addition, the FR-PLTC technique with Naïve Bayes achieves the best classification accuracy in both English and Chinese classifiers.
Classification Accuracy
60% 55% 50% 45%
55%
Classification Accuracy
72
50% 45% 40% 35% 30% 25% 0
500
1000
1500
2000
All
2500
Number of Features (k) MnTC (Chinese) FR-PLTC (α=3)
FR-PLTC (α=1) FR-PLTC (α=4)
FR-PLTC (α=2) FR-PLTC (α=5)
Fig. 10. Effects of k and α (SVM for Chinese classifier).
4.5. Analysis of ITF effects As described in Section 3.2, the simplest method for deriving the alignment weight of feature fj in one language (e.g., L1) considers only the relevance weights and the χ2 statistic scores of fj's related features in another language (e.g., L2) (Eq. (6)). We revise the alignment weight formula in Eq. (7) by decreasing the weights of potentially general terms (i.e., we multiply the original alignment weight by the inverse term frequency (ITF) of fj). To understand the effects of ITF on the classification accuracy of our proposed FR-PLTC technique, we empirically evaluate the effectiveness of FR-PLTC with and without the consideration of ITF when we derive the alignment weights of terms. We refer to the FR-PLTC technique without ITF as FR-PLTC w/o ITF. In this experiment, we focus on the FR-PLTC technique using the Naïve Bayes probabilistic algorithm because, as noted previously, this design achieves the best classification accuracy for both classifiers. As Fig. 11 shows, across the range of k examined, the FR-PLTC w/o ITF technique (α = 4) generally is slightly better than the MnTC technique in the English classifier. However, the FR-PLTC technique (α = 4) attains higher classification accuracy than does the FR-PLTC w/o ITF technique when k is larger than 600. In the best versus best scenario comparison, the highest classification accuracy achieved by the FR-PLTC technique (57.11% when k = 1200) is greater than that reached by the FR-PLTC w/o ITF technique (i.e., 53.51% when all features are used). The differential (i.e., 3.60%) between the FR-PLTC technique and the FR-PLTC w/o ITF technique is statistically significant (i.e., p-values less than 0.01). This empirical result suggests the utility of ITF for decreasing the alignment weights of general terms and thus favors the selection of more specific terms for categorization learning. As Table 5 illustrates, FR-PLTC w/o ITF (α = 4) ranks such overly general terms as “hong kong,” “people,” and “government” within the top 100 features and “survey,” “year,” and “today” within the top 300 features. However, when considering ITF in the alignment weight estimation, FR-PLTC (α = 4) significantly lowers the ranks of these general terms (e.g., “hong kong” from 28 to 1900 and “government” from 92 to 1903). We also observe similar empirical results in the Chinese classifier. As we show in Fig. 12, the FR-PLTC technique (α = 3) generally outperforms the FR-PLTC w/o ITF technique (α = 5), which in turn surpasses the MnTC technique in the Chinese classifier. The highest classification Table 3 Comparison of effectiveness of FR-PLTC and MnTC for English classifier.
40% 35%
Supervised learning algorithm
30% 0
500
1000
1500
2000
2500
All
Number of Features (k) MnTC (Chinese) FR-PLTC (α=3)
FR-PLTC (α=1) FR-PLTC (α=4)
FR-PLTC (α=2) FR-PLTC (α=5)
Fig. 9. Effects of k and α (Naïve Bayes for Chinese classifier).
Naïve Bayes SVM
Classification accuracy FR-PLTC
MnTC
57.11% 53.10%
53.51% 48.91%
Δ 3.60%⁎⁎⁎ 4.19%⁎⁎⁎
Notes: Δ denotes the improvement, calculated as (classification accuracy of FR-PLTC– classification accuracy of MnTC), in Tables 3–4. ***: Significant at p b 0.01 on a two-tailed, paired t-test.
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76 Table 4 Comparison of effectiveness of FR-PLTC and MnTC for Chinese classifier. Supervised learning algorithm
Naïve Bayes SVM
55.38% 45.24%
Table 5 Rankings of features in FR-PLTC w/o ITF and FR-PLTC. Δ
Classification accuracy FR-PLTC
MnTC 4.21%⁎⁎⁎ 1.95%⁎⁎⁎
51.17% 43.29%
accuracy achieved by the FR-PLTC technique (55.38% when k = 1000) is significantly greater (i.e., p-values less than 0.01) than that attained by the FR-PLTC w/o ITF technique (53.58% when k = 600). Together, our empirical results support the positive effects of ITF for improving classification accuracy in poly-lingual text categorization environments.
4.6. Sensitivity to size of training data set In our next experiment, we assess the sensitivity of the proposed FRPLTC technique to the size of training data set. In previous experiments, we randomly selected 50% of the documents from the English and Chinese corpora as our training data set and the remainder as the testing data set. Here, we reduce the size of the training set of the target language L1 from 50% to 20% in decrements of 10%, but still maintain the same training size (i.e., 50%) in another language L2 (referred to as the auxiliary language). As with the previous experiment, we concentrate on the FR-PLTC technique that uses the Naïve Bayes probabilistic algorithm and adopts 4 for α to derive the overall weights of features. Tables 6 and 7 summarize, for each specific training size, the best classification accuracy attained by the FR-PLTC and MnTC techniques across the ranges of k examined. When the size of the training data set of the target language (L1) declines from 50% to 20%, the classification accuracy of each technique gradually degrades. However, for both classifiers (English and Chinese), the effectiveness of the FR-PLTC technique significantly outperforms (i.e., p-values less than 0.01) the MnTC technique at any size of the training data set examined. As Table 6 demonstrates, for the English text categorization, the accuracy improvement achieved by the FR-PLTC technique is 3.08% or higher across the range of training sizes investigated. For the Chinese text categorization (as Table 7 shows), the FR-PLTC technique outperforms the MnTC technique by 4.17% or more for any training set size under examination. Moreover, we conduct another sensitivity analysis by decreasing the size of the training set of the auxiliary language (L2) from 50% to 20% in decrements of 10% but keeping the training size in the target language (L1) unchanged (i.e., 50%). The best classification performance achieved by the FR-PLTC and MnTC techniques across the ranges of k investigated is shown in Tables 8 and 9. Because the size of the training set of the target language (L1) does not change, the MnTC technique maintains the same classification accuracy over the range of the training sizes in the auxiliary language (L2). When the size of the training data set of L2
Features
Ranking in FR-PLTC w/o ITF
Ranking in FR-PLTC
Hong kong People Government Survey Year Today Information China
28 83 92 158 278 284 362 458
1900 969 1903 658 2018 1988 1454 1528
declines from 50% to 20%, the classification accuracy of the PLTC technique generally decreases marginally. However, for both classifiers (English and Chinese), the FR-PLTC technique still significantly outperforms (i.e., p-values less than 0.01) the MnTC technique at any size of the training data set of L2 examined. As Table 8 demonstrates, for the English text categorization, the accuracy improvement achieved by the FR-PLTC technique is 2.66% or higher across the range of training sizes in L2 investigated. For the Chinese text categorization (as Table 9 shows), the FR-PLTC technique outperforms the MnTC technique by 3.22% or more for any training set size under examination. Together, the empirical results obtained from the two sensitivity analysis experiments suggest the robustness of the proposed FR-PLTC technique with respect to the range of training sizes in either the target language (L1) or the auxiliary language (L2) investigated. 4.7. Generalizability analysis on English-French corpora To assess the generalizability of our proposed FR-PLTC technique, we perform an additional evaluation on English-French corpora. News releases from Canada News Centre (news releases in English are accessible at http://news.gc.ca/web/index-eng.do and those in French are accessible at http://nouvelles.gc.ca/web/index-fra.do) were collected to form the parallel corpus and the two monolingual corpora (English and French, respectively). Specifically, we collected 5754 pairs of English and French news releases. The news releases in our collection span 22 categories; many of the collected news releases pertain to two or more categories simultaneously assigned by Canada News Centre. To maintain the same number of categories as that in our English–Chinese corpora, we chose the eight largest categories, including Economics & Industry, Government & Politics, Health & Safety, Information & Communications, Military, Nature & Environment, Society & Culture, and Transportation. For each category in the English corpus, we randomly selected from our collection 75 news releases exclusively belonging to the focal category and retained only their English parts (i.e., removing their French parts). Likewise, for each category in the French corpus, we also randomly selected 75 news releases that exclusively belonged to the focal category but were not included in our English corpus, and then preserved
60%
60%
Classification Accuracy
Classification Accuracy
73
55% 50% 45% 40% 35% 30% 0
500
1000
1500
2000
2500
All
55% 50% 45% 40% 35% 30%
0
500
Number of Features (k) MnTC (English)
FR-PLTC (α=4)
FR-PLTC w/o ITF (α=4)
Fig. 11. Effects of ITF (naïve Bayes for English classifier).
1000
1500
2000
2500
All
Number of Features (k) MnTC (Chinese)
FR-PLTC (α=3)
FR-PLTC w/o ITF (α=5)
Fig. 12. Effects of ITF (Naïve Bayes for Chinese classifier).
74
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
Table 6 Effectiveness of FR-PLTC and MnTC for English classifier across different training sizes in the target language (L1). Training size
English (L1): 50%, Chinese (L2): 50% English (L1): 40%, Chinese (L2): 50% English (L1): 30%, Chinese (L2): 50% English (L1): 20%, Chinese (L2): 50%
Classification accuracy FR-PLTC
MnTC
57.11% 55.29% 53.86% 50.27%
53.51% 51.86% 50.12% 47.19%
Δ 3.60%⁎⁎⁎ 3.43%⁎⁎⁎ 3.74%⁎⁎⁎ 3.08%⁎⁎⁎
Table 8 Effectiveness of FR-PLTC and MnTC for English classifier across different training sizes in the auxiliary language (L2). Training size
English (L1): 50%, Chinese (L2): 50% English (L1): 50%, Chinese (L2): 40% English (L1): 50%, Chinese (L2): 30% English (L1): 50%, Chinese (L2): 20%
Classification accuracy FR-PLTC
MnTC
57.11% 56.66% 56.53% 56.17%
53.51% 53.51% 53.51% 53.51%
Δ 3.60%⁎⁎⁎ 3.15%⁎⁎⁎ 3.02%⁎⁎⁎ 2.66%⁎⁎⁎
Notes: Δ denotes the improvement, calculated as (classification accuracy of FR-PLTC– classification accuracy of MnTC), in Tables 6–7. ***: Significant at p b 0.01 on a two-tailed, paired t-test.
Notes: Δ denotes the improvement, calculated as (classification accuracy of FR-PLTC–classification accuracy of MnTC), in Tables 8–9. ***: Significant at p b 0.01 on a two-tailed, paired t-test.
their French parts only. The remaining news releases in our collection (i.e., 5754 − 75 × 8 − 75 × 8 = 4554) then serve as the parallel corpus for the construction of a statistical-based bilingual (English–French) thesaurus. Table 10 shows the summary of our English–French corpora. For English term (feature) extraction, we also use the rule-based part-of-speech tagger [5,6] and the noun phrase parser [30] to extract nouns and noun phrases as features from the English documents in our English corpus and the parallel corpus. For French term (feature) extraction, we use the TreeTagger [25], which is one of the most widely used French part-of-speech taggers. Accordingly, we extract nouns and noun phrases from the syntactically tagged French documents in our French corpus and the parallel corpus. Subsequently, we follow the evaluation procedure and parameter setting identical to those of the experiments described in Sections 4.2 and 4.4 to evaluate the effectiveness of our proposed FR-PLTC and the benchmark technique (i.e., MnTC) on the English-French corpora. As we summarize in Tables 11 and 12, when using Naïve Bayes, the classification accuracy achieved by our FR-PLTC technique is 44.11% in the English classifier and 47.72% in the French classifier. Particularly, our proposed FR-PLTC technique improves the classification accuracy over the benchmark technique by 2.45% and 1.16% for the English and French classifiers, respectively. The differences in both classifiers are statistically significant (i.e., p-values less than 0.01). Similarly, when using SVM, the FR-PLTC technique outperforms the benchmark technique by 2.02% and 0.84% in classification accuracy for the English and French classifiers, respectively. The differentials are also significant statistically (p b 0.01 for the English classifier and p b 0.05 for the French classifier). Consistent with the evaluation results obtained from our English– Chinese corpora, the proposed FR-PLTC technique outperforms its counterpart across two supervised learning algorithms and two classifiers (i.e., English and French) in the English–French corpora. Overall, our empirical results obtained from the two corpora (i.e., English–Chinese and English–French) suggest that the proposed FR-PLTC technique is capable of providing more effective PLTC support than the benchmark technique in different poly-lingual environments.
Because of the trend of globalization, an organization or individual often generates or acquires, and then archives, documents written in different languages. If organizations or individuals have already
organized poly-lingual documents into categories and would like to use this set of preclassified poly-lingual documents as training documents to construct text categorization models that can classify newly arrived, unclassified poly-lingual documents into appropriate categories, they face the PLTC challenge. In this study, we propose the FRPLTC technique, which takes into account the training documents of all languages when constructing a monolingual classifier for a specific language. Using the naïve PLTC approach (i.e., MnTC) as performance benchmark, we conduct an empirical evaluation and find that the proposed FR-PLTC technique achieves higher classification accuracy than the benchmark technique. In addition, our empirical results suggest the superiority of the proposed FR-PLTC technique across the range of training sizes investigated. Our research contributions are fourfold. First, this study contributes to poly-lingual text categorization research by proposing a more effective PLTC technique. Specifically, we exploit the opportunity offered by poly-lingual training documents to propose a feature reinforcement mechanism for selecting more representative features for text categorization purposes and accordingly develop a feature-reinforcement-based PLTC (FR-PLTC) technique. Our empirical evaluation results offer empirical evidence of the relative effectiveness of the proposed PLTC technique, compared with the benchmark technique (i.e., MnTC). Second, this study also contributes to PLTC research by relaxing the requirements of existing PLTC techniques, which assume that each training document is available in multiple languages. Our study assumes that each document in the poly-lingual training set is written in only one language and thus the applicability of our proposed technique is greater than that of existing PLTC techniques. Third, this study demonstrates the effectiveness of our proposed feature reinforcement mechanism and therefore can advance current text mining research. For example, existing event tracking techniques mainly focus on monolingual news documents. Thus, event tracking research can adopt our proposed feature reinforcement mechanism to develop poly-lingual event tracking techniques for dealing with tracking events from poly-lingual news documents. Fourth, the current study can advance and support practices in many application domains in which poly-lingual text categorization prevails. For example, to mitigate information overload problem experienced by customers and strengthen customer relationships, business-to-customer (B2C) e-commerce sites have increasingly adopted recommendation systems to provide personalized recommendations. Content-based recommendation, representing a salient approach for supporting automated recommendations, recommends products similar to those a given
Table 7 Effectiveness of FR-PLTC and MnTC for Chinese classifier across different training sizes in the target language (L1).
Table 9 Effectiveness of FR-PLTC and MnTC for Chinese classifier across different training sizes in the auxiliary language (L2).
5. Conclusion and future research directions
Training size
Chinese (L1): 50%, English (L2): 50% Chinese (L1): 40%, English (L2): 50% Chinese (L1): 30%, English (L2): 50% Chinese (L1): 20%, English (L2): 50%
Classification accuracy FR-PLTC
MnTC
55.38% 53.51% 51.17% 48.77%
51.17% 49.34% 46.17% 44.51%
Δ
Training size
4.21%⁎⁎⁎ 4.17%⁎⁎⁎ 5.00%⁎⁎⁎ 4.26%⁎⁎⁎
Chinese (L1): 50%, English (L2): 50% Chinese (L1): 50%, English (L2): 40% Chinese (L1): 50%, English (L2): 30% Chinese (L1): 50%, English (L2): 20%
Classification accuracy FR-PLTC
MnTC
55.38% 54.99% 54.48% 54.39%
51.17% 51.17% 51.17% 51.17%
Δ 4.21%⁎⁎⁎ 3.82%⁎⁎⁎ 3.31%⁎⁎⁎ 3.22%⁎⁎⁎
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
Acknowledgment
Table 10 Summary of our English–French corpora. Document corpus
Number of documents
Average number of words per document
Parallel corpus
4554 (pairs)
English corpus French corpus
600 600
English French 114 166
References
Table 11 Effectiveness of FR-PLTC and MnTC for English classifier on English–French Corpora.
Naïve Bayes SVM
Classification accuracy FR-PLTC
MnTC
44.11% 44.40%
41.66% 42.38%
Δ 2.45%⁎⁎⁎ 2.02%⁎⁎⁎
Notes: Δ denotes the improvement, calculated as (classification accuracy of FR-PLTC–classification accuracy of MnTC), in Tables 11–12. ***: Significant at p b 0.01 on a two-tailed, paired t-test; **: p b 0.05.
Table 12 Effectiveness of FR-PLTC and MnTC for French classifier on English–French corpora. Supervised learning algorithm
Naïve Bayes SVM
Classification accuracy FR-PLTC
MnTC
47.72% 45.72%
46.56% 44.88%
This work was supported by the National Science Council of the Republic of China under the grant NSC 101-2410-H-002-041-MY3.
113 160
customer has liked in the past. When target products are information products (e.g., books, articles), the content-based recommendation approach generally is formulated as a text classification problem, in which a supervised learning algorithm is employed to construct a classifier (i.e., recommendation model) for each customer, using the products the customer has liked or disliked in the past as training examples. As B2C e-commerce sites become globalized, they likely carry information products beyond one language and, as a result, their content-based recommendations are in effect a poly-lingual text categorization problem. Thus, our proposed FR-PLTC technique can be employed to produce more effective recommendations than the traditional MnTC technique can. Another application domain that our proposed technique can support is content-based spam filtering, which is commonly considered as a text categorization problem (i.e., using spam emails as positive training examples and legitimate emails as negative training examples) [31]. Evidently, poly-lingual text categorization prevails in the spam filtering problem, because email users often receive poly-lingual emails (i.e., some written in language L1, some in L2, etc.). Hence, our proposed FR-PLTC technique can be adopted to construct a more effective spam filtering mechanism. Some future research related to this study includes the following: First, our current evaluation study employs two sets of document corpora in a specific domain (i.e., news releases). Evaluating our proposed FRPLTC technique using other document corpora that pertain to more diversified application domains (e.g., content-based recommendation, spam filtering) is an essential and desired objective. Second, the documents involved in our current evaluations are not long in their length. Document length may affect the choice of document representation scheme in our proposed technique. Further empirical investigations should involve the use of longer documents. Third, our proposed FRPLTC technique only focuses on two languages. It would be interesting to extend the FR-PLTC technique to deal with preclassified poly-lingual documents that involve more than two languages. Fourth, in addition to PLTC, the development of other poly-lingual text mining techniques (e.g., poly-lingual event tracking, poly-lingual sentiment classification) demand further research attention as well.
Supervised learning algorithm
75
Δ 1.16%⁎⁎⁎ 0.84%⁎⁎
[1] R. Agrawal, R. Bayardo, R. Srikant, Athena: mining-based interactive management of text databases, Proceedings of the 7th International Conference on Extending Databases Technology, 2000, pp. 365–379. [2] C. Apte, F. Damerau, S. Weiss, Automated learning of decision rules for text categorization, ACM Transactions on Information Systems 12 (3) (1994) 233–251. [3] L.D. Baker, A.K. Mccallum, Distributional clustering of words for text classification, Proceedings of the 21st International ACM SIGIR Conference on Research and Development, Information Retrieval, 1998, pp. 96–103. [4] N. Bel, C.H.A. Koster, M. Villegas, Cross-lingual text categorization, Lecture Notes in Computer Science 2769 (2003) 126–139. [5] E. Brill, A simple rule-based part of speech tagger, Proceedings of the 3rd Conference on Applied Natural Language Processing, 1992, pp. 152–155. [6] E. Brill, Some advances in rule-based part of speech tagging, Proceedings of the 12th National Conference on, Artificial Intelligence, 1994, pp. 722–727. [7] W.W. Cohen, Y. Singer, Context-sensitive learning methods for text categorization, ACM Transactions on Information Systems 17 (2) (1999) 141–173. [8] S. Dumais, J. Platt, D. Heckerman, M. Sahami, Inductive learning algorithms and representation for text categorization, Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management, 1998, pp. 148–155. [9] A. Gliozzo, C. Strapparava, Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization, Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL, 2006, pp. 553–560. [10] T. Gonalves, P. Quaresma, Multilingual text classification through combination of monolingual classifiers, Proceedings of the 4th Workshop on Legal Ontologies and Artificial Intelligence Techniques, 2010, pp. 29–38. [11] T. Gonalves, P. Quaresma, Polylingual text classification in the legal domain, Informatica & Diritto Journal (2011) 203–216. [12] M. Iwayama, T. Tokunaga, Cluster-based text categorization: a comparison of category search strategies, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995, pp. 273–281. [13] Y. Jing, W.B. Croft, An association thesaurus for information retrieval, Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994. [14] T. Joachims, Text categorization with support vector machines: learning with many relevant features, Lecture Notes in Computer Science 1398 (1998) 137–142. [15] C. Kit, J.Y.H. Ng, An intelligent Web agent to mine bilingual parallel pages via automatic discovery of URL pairing patterns, Proceedings of 2007 IEEE/WIC/ACM International Conferences on Web Intelligence, 2007, pp. 526–529. [16] K.L. Kwok, Evaluation of an English–Chinese cross-lingual retrieval experiment, AAAI Symposium on Cross Language Text & Speech Retrieval, 1997, pp. 133–137. [17] W. Lam, C.Y. Ho, Using a generalized instance set for automatic text categorization, Proceedings of the 21st International ACM SIGIR Conference on Research and Development in, Information Retrieval, 1998, pp. 81–89. [18] L. Larkey, W. Croft, Combining classifiers in text categorization, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in, Information Retrieval, 1996, pp. 289–297. [19] D. Lewis, M. Ringuette, A comparison of two learning algorithms for text categorization, Proceedings of Symposium on Document Analysis and Information Retrieval, 1994, pp. 81–93. [20] B. Masand, G. Linoff, D. Waltz, Classifying news stories using memory based reasoning, Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in, Information Retrieval, 1992, pp. 59–64. [21] A.K. McCallun, K. Nigam, A comparison of event models for naïve Bayes text classification, Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998. [22] H.T. Ng, W.B. Goh, K.L. Low, Feature selection, perception learning, and a usability case study for text categorization, Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1997, pp. 67–73. [23] J.S. Olsson, D.W. Oard, J. Hajic, Cross-language text classification, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in, Information Retrieval, 2005, pp. 645–646. [24] L. Rigutini, M. Maggini, B. Liu, An EM-based training algorithm for cross-language text categorization, Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp. 529–535. [25] H. Schmid, Probabilistic part-of-speech tagging using decision trees, Proceedings of International Conference on New Methods in Language Processing, 1994, pp. 44–49. [26] H. Schutze, D.A. Hull, J.O. Pedersen, A comparison of classifiers and document representations for the routing problem, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995, pp. 229–237. [27] F. Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys 34 (1) (2002) 1–47. [28] A. Sun, E.P. Lim, Y. Liu, On strategies for imbalanced text classification using SVM: a comparative study, Decision Support Systems 48 (1) (2009) 191–201. [29] V.N. Vapnik, The Nature of Statistical Learning Theory, 2nd ed. Springer, Berlin, Germany, 2000. [30] A. Voutilainen, Nptool: a detector of English noun phrases, Proceedings of the Workshop on Very Large Corpora, 1993, pp. 48–57.
76
C.-P. Wei et al. / Decision Support Systems 57 (2014) 64–76
[31] C. Wei, H.C. Chen, T.H. Cheng, Effective spam filtering: a single-class learning and ensemble approach, Decision Support Systems 45 (3) (2008) 491–503. [32] C. Wei, Y.H. Lee, Event detection from online news documents for supporting environmental scanning, Decision Support Systems 36 (4) (2004) 385–401. [33] C. Wei, Y.T. Lin, C.C. Yang, Cross-lingual text categorization: conquering language boundaries in globalized environments, Information Processing and Management 47 (5) (2011) 786–804. [34] S.M. Weiss, C. Apte, F.J. Damerau, D.E. Johnson, F.J. Oles, T. Goetz, T. Hampp, Maximizing text-mining performance, IEEE Intelligent Systems 14 (4) (1999) 63–69. [35] W. Wiener, J.O. Pedersen, A.S. Weigend, A neural network approach to topic spotting, Proceedings of the 4th Annual Symposium on Document Analysis and, Information Retrieval, 1995, pp. 317–332. [36] C.C. Yang, K.W. Li, Automatic construction of English/Chinese parallel corpora, Journal of the American Society for Information Science and Technology 54 (8) (2003) 730–742. [37] C.C. Yang, J. Luk, Automatic generation of English/Chinese thesaurus based on a parallel corpus in laws, Journal of the American Society for Information Science and Technology 54 (7) (2003) 671–682. [38] C.C. Yang, J. Luk, S. Yung, J. Yen, Combination and boundary detection approach for Chinese indexing, Journal of the American Society for Information Science 51 (4) (2000) 340–351. [39] Y. Yang, Expert network: effective and efficient learning from human decisions in text categorization and retrieval, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, pp. 13–22. [40] Y. Yang, C.G. Chute, An example-based mapping method for text categorization and retrieval, ACM Transaction on Information Systems 12 (3) (1994) 252–277. [41] Y. Yang, X. Liu, A re-examination of text categorization methods, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in, Information Retrieval, 1999, pp. 42–49. [42] Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, Proceedings of 14th International Conference on, Machine Learning, 1997, pp. 412–420. [43] Y. Zhang, K. Wu, J. Gao, P. Vines, Automatic acquisition of Chinese–English parallel corpus from the Web, Lecture Notes in Computer Science 3936 (2006) 420–431.
Chih-Ping Wei received a BS in Management Science from the National Chiao-Tung University in Taiwan, ROC in 1987 and an MS and a Ph.D. in Management Information Systems from the University of Arizona in 1991 and 1996. He is currently a professor of Department of Information Management at National Taiwan University, Taiwan, ROC. Prior to joining National Taiwan University in 2010, he was a professor at National Tsing Hua University and National Sun Yat-sen University in Taiwan and a visiting scholar at the University of Illinois at Urbana-Champaign (Fall 2001) and the Chinese University of Hong Kong (Summer 2006 and 2007). His papers have appeared in Journal of Management Information Systems (JMIS), European Journal of Information Systems, Decision Support Systems (DSS), IEEE Transactions on Engineering Management, IEEE Software, IEEE Intelligent Systems, IEEE Transactions on Systems, Man, and Cybernetics, IEEE Transactions on Information Technology in Biomedicine, Journal of the American Society for Information Science and Technology, Information Processing and Management, Journal of Database Management, and Journal of Organizational Computing and Electronic Commerce, etc. His current research interests include knowledge discovery and data mining, text mining and information retrieval, knowledge management, and patent analysis and intelligence. Chih-Ping Wei can be reached at the Department of Information Management, National Taiwan University, Taipei, Taiwan, ROC;
[email protected].
Chin-Sheng Yang received a BS in Management Information Systems from the National Chengchi University in Taiwan, ROC in 2000 and a MBA and a Ph.D. in Information Management from the National Sun Yat-sen University in Taiwan, ROC in 2002 and 2007. He is currently an assistant professor of the Department of Information Management at Yuan Ze University in Chung-Li, Taiwan, ROC. His papers have appeared in Journal of Management Information Systems (JMIS), Decision Support Systems (DSS), Information Processing and Management, and Information Systems and E-Business Management. His current research interests include text mining, information retrieval, and knowledge discovery and data mining. He can be reached at the Department of Information Management, Yuan Ze University, Chung-Li, Taiwan, ROC;
[email protected].
Ching-Hsien Lee received a BS in Information Management from the National Sun Yat-sen University in Taiwan, ROC and a MS in Information Communication from Yuan Ze University. He is currently a doctoral student in the Department of Information Management at National Taiwan University, Taiwan, ROC. His current research interests include text mining, information retrieval, and knowledge discovery and data mining. He can be reached at the Department of Information Management at National Taiwan University, Taiwan, ROC;
[email protected].
Huihua Shi received a BS in Information Management from the National Central University in Taiwan, ROC in 2004 and a MBA in Information Management from the National Sun Yat-sen University in Taiwan, ROC in 2006. She is currently a senior engineer in AU Optronics Corporation in Taiwan.
Christopher C. Yang is an associate professor in the College of Computing and Informatics at Drexel University. He has also been an associate professor in the Department of Systems Engineering and Engineering Management and the director of the Digital Library Laboratory at the Chinese University of Hong Kong, an assistant professor in the Department of Computer Science and Information Systems at the University of Hong Kong and a research scientist in the Department of Management Information Systems at the University ofArizona. His recent research interests include social media analytics, Web 2.0, security informatics, health informatics, Websearch and mining, knowledge management, and electronic commerce. He has published over 200 referred journal and conference papers in Journal of the American Society for Information Science and Technology (JASIST), Decision Support Systems (DSS), IEEE Transactions on Systems, Man, and Cybernetics, IEEE Transactions on Image Processing, IEEE Transactions on Robotics and Automation, IEEE Computer, IEEE Intelligent Systems, Information Processing and Management (IPM), Journal of Information Science, Graphical Models and Image Processing, Optical Engineering, Pattern Recognition, International Journal of Electronic Commerce, Applied Artificial Intelligence, ISI, WWW, SIGIR, ICIS, CIKM, and more. He has edited several special issues on multilingual information systems, knowledge management, Web mining, social media, and electronic commerce in JASIST, DSS, IPM, and IEEE Transactions. He chaired and served in many international conferences and workshops. He has also frequently served as an invited panelist in the NSF and other government agencies' review panels. He can be reached at
[email protected].