Expert Systems with Applications 38 (2011) 1575–1582
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
A language model approach for tag recommendation Ke Sun a,⇑, Xiaolong Wang a,⇑⇑, Chengjie Sun a,1, Lei Lin a,b,1 a b
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Department of Control Science and Engineering, Harbin Institute of Technology, Harbin 150001, China
a r t i c l e
i n f o
Keywords: Tag recommendation Language model for tag recommendation
a b s t r a c t Tags are user-generated keywords for entities. Recently tags have been used as a popular way to allow users to contribute metadata to large corpora on the web. However, tagging style websites lack the function of guaranteeing the quality of tags for other usages, like collaboration/community, clustering, and search, etc. Thus, as a remedy function, automatic tag recommendation which recommends a set of candidate tags for user to choice while tagging a certain document has recently drawn many attentions. In this paper, we introduce the statistical language model theory into tag recommendation problem named as language model for tag recommendation (LMTR), by converting the tag recommendation problem into a ranking problem and then modeling the correlation between tag and document with the language model framework. Furthermore, we leverage two different methods based on both keywords extraction and keywords expansion to collect candidate tag before ranking with LMTR to improve the performance of LMTR. Experiments on large-scale tagging datasets of both scientific and web documents indicate that our proposals are capable of making tag recommendation efficiently and effectively. Ó 2010 Elsevier Ltd. All rights reserved.
1. Introduction Tags are user-generated keywords for entities to organize them with their common attribute. In opposite to the predefined organization style ‘‘taxonomy”, this tagging based self generated organization style is called ‘‘folksonomy”, which differs from the taxonomy of forcing the entities into the predefined categories, but presents a more flexible style by allowing people to freely annotate entities with their own keywords. Recently tags have been used as a popular way to allow users to contribute metadata to large corpora on the web by many famous websites (e.g. Delicious, Flickr). Its advantages make it suitable for organizing the web objects which changes rapidly in their distribution or types. Although tagging is easy to perform and has many advantages, there are also some drawbacks. Golder and Huberman (2006) identified three major problems with current tagging systems: Polysemy. In tag system, polysemy refers to instances where a single tag can have multiple meanings. For example, the famous Company: ‘‘Apple” versus Fruit: ‘‘Apple”. Synonymy, which means multiple tags having the same meaning. For example, the ‘‘news” versus ‘‘current events” or the misspelling problem likes ‘‘Nokia” versus ‘‘Nokea”. ⇑ Corresponding author. Tel./fax: +86 451 86413322, mobile: +86 13946162826. ⇑⇑ Corresponding author. Tel.: +86 451 86413322. E-mail addresses:
[email protected] (K. Sun),
[email protected] (X. Wang),
[email protected] (C. Sun),
[email protected] (L. Lin). 1 Tel.: +86 451 86413322. 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.075
Level variation, refers to the phenomenon of users tagging content at different levels of abstraction. Content can be tagged at a ‘‘basic level” or at varying levels of specificity which is often based on the tag poster’s expertise or requirements. For example, given an entity like Google, normal users may use ‘‘search engine”, ‘‘famous web site” or ‘‘good se”, etc., to mark it, but researchers concerning on the IR techniques made using keywords from more academically fields like ‘‘page rank strategy”, ‘‘map reduce” or ‘‘distributed indexing system”. These problems are caused by the lacking of clear functional pressure to make tagging consistent, stable and complete. And consequently, the collected tags are hard to be used in applications dealing with collaboration/community, clustering, and search. In order to tackle these problems, the tag recommendation system has been recently proposed. It reminds the users of the alternative tags with less polysemy, synonymy problems from different abstraction levels. User could be reminded and some more suitable tags maybe selected easily. For example, when user wants to tag a document like ‘‘what is the single chip? what does it for?”, the recommendation system will generate a list of recommendation tags based on the given document, such as ‘‘computer”, ‘‘single chip”, ‘‘hardware” and ‘‘electronic engineer”. Also, because the tags can mostly be ‘‘selected” instead of ‘‘typing”, the misspelling problem could be controlled. The recommendation system could not only help tackle the level variation problem by encouraging user fulfills more useful tags from different abstraction levels, but also improve the quality of posting tags by proposing candidate tags with less polysemy and synonymy problems.
1576
K. Sun et al. / Expert Systems with Applications 38 (2011) 1575–1582
In this paper, we focus on the tag recommendation problem for documents. By converting the tag recommendation problem into the problem of retrieving a set of tags relevant to the given document, a language model approach for tag recommendation (LMTR) is proposed. The statistical language model has been used for many nature language processing applications such as speech recognition, part-of-speech tagging, and syntactic parsing. In 1998, Ponte and Croft (Ponte & Croft, 1998) first introduced the language model approach for information retrieval by ranking the retrieved documents based on the probabilities of generating a query from the corresponding language models of these documents. Although the language model theory has been studied for years in many domains, but to the best of our knowledge, this is the first effort of introducing the statistical language model theory into the tag recommendation problem. Our contribution focuses on the tag recommendation algorithm in documents. Specially, we (a) propose a novel tag recommendation framework based on the statistical language model theory, and (b) two expansion methods based on keywords extraction and keyword expansion theories for improving the tagging speed and performance are proposed also. Effectiveness and efficiency are both carefully analyzed for these proposals. The reminder paper is organized as follows: In Section 2, we first survey the related work about tag recommendation and language modeling. In Section 3, our approach to tag recommendation based on the statistical language model theory will be presented and the expansion methods are also introduced. In Section 4, we setup the experimental platform. And then in Section 5, the effectiveness and efficiency of our proposals in tag recommendation are empirically verified. Section 6 concludes the paper by summarizing our work and discussing the future directions. 2. Related work In this section, we first review the latest advances within the research area of tag recommendation and then survey some methods on language model. 2.1. Tag recommendation Tag recommendation problem can be majorly divided into two application domains. One domain is aimed at recommending tags for media resources such as pictures, audios and videos. Researches (Ames & Naaman, 2007; Liu, Hua, Yang, Wang, & Zhang, 2009; Sigurbjörnsson & Van Zwol, 2008; Wu, Yang, Yu, & Hua, 2009) focusing on this domain are majorly concerning on online expanding the existing tag set from users to encourage them post more tags, their methods are mostly inherits from the researches on query expansion or keyword expansion. Another application domain concerns on recommending tags for documents (Brooks & Montanez, 2006; Golder & Huberman, 2006; Heymann, Ramage, & Garcia-Molina, 2008; Mishne, 2006; Song et al., 2008; Sood, Owsley, Hammond, & Birnbaum, 2007; Xu, Fu, Mao, & Su, 2006; Yan, & Hauprmann, 2007), and our work is focusing on this domain either. Brooks and Montanez (2006) developed a system, which can automatically tag blog documents based on the top three terms extracted from the documents, using TFIDF scoring. Their method inherits from a similar research domain called keywords extraction (Frank, Paynter, Witten, Gutwin, & Nevill-Manning, 1999; Turney, 2000), which has been studied for years. However, the keyword extraction problem is viewed as a subset of keywords generation, because it only extracts the keywords/keyphrases from the content of document but ignores those tags from more abstract level which do not appear in the document content.
Chirita, Costache, Nejdl, and Handschuh, (2007) proposed more deeper methods by producing tag from both document content and the data residing on the user’s Desktop, which could somehow overcome the drawbacks of keywords extraction, but its application environment is quite limited, because it relies on the personal data which is not easy to be obtained. These approaches are also known as Text Mining based approach. Collaborative-filtering-based method is another popular scenario for tag recommendation. Mishne (2006) proposed a simple collaborative-filtering-based tagging system called ‘‘AutoTag”, which finds the similar tagged documents and suggests some set of the associated tags to a user for selection. Sood et al. (2007) improved this thought by introducing tag compression and case evaluation to filter and rank tag suggestions. In opposites to Text Mining based approaches, tags recommended by collaborativebased approaches are mainly concerns on the high abstraction levels, because tags are aggregated from those already tagged documents, and those high abstraction tags are more common among these documents, and easier to be pushed out. Most similar to our work, Song et al. (2008), proposed a clustering and classification based tag recommendation system, which partitions tags with documents into different clusters, and then classifies new documents into those clusters with a two-way Poisson Mixture Model, and tags belongs to the certain cluster are recommended, it can be viewed as a multi-label text classification based approach, and it overcomes both problems from Text Mining-based approach and collaborative-based approach. In this paper, we simplify their idea by treating each tag as a cluster, and further associating the new document directly to the tag, rather than a group of clustered tags which cannot easily calculates the similarity between document and each tag among the tags cluster. Also, we consider the tag recommendation problem is more like a ranking problem rather than the classification problem, because there are no fixed rich classes, but dynamic open tags which do not contains fixed documents for partitioning. 2.2. Language modeling The statistical language model could assign a probability to a sequence of m words by means of a probability distribution, as P(w1, w2, . . . , wm) or P(w1, m). Estimating the probabilities of word sequence may be expensive, since the sentence can be long and the size of corpus must be extremely large to avoid the data sparseness problem. In practice, the statistical language model is often approximated by smoothed n-gram models based on the Markov property, and the probability P(w1, m) can be represented as,
Pðw1 ; mÞ ¼
m Y
Pðwi jw1 ; w2 ; . . . ; wi1 Þ
i¼1
m Y
Pðwi jwiðn1Þ ; . . . ; wi2 ; wi1 Þ
ð1Þ
i¼1
Given different n, there are corresponding n-gram models, and the most common used models are,
Unigram :
Pðw1 ; mÞ ¼
m Y
Pðwi Þ
ð2Þ
i¼1
Bigram :
Pðw1 ; mÞ ¼Þ ¼
m Y
Pðwi jwi1 Þ
ð3Þ
i¼1
Language modeling approach has been successfully introduced by Ponte and Croft (1998) in information retrieval. Given a document and a query, it treats the similarity between document and query as the probability of generating query from the language model formed by the document, known as LMIR. LMIR has been studied
K. Sun et al. / Expert Systems with Applications 38 (2011) 1575–1582
for years (Zhai & Lafferty, 2004) in its smoothing methods, and it has been applied in many domains including web search (Ponte & Croft, 1998), question and answering pair retrieval (Duan, Cao, Lin, & Yu, 2008; Jeon, Croft, & Lee, 2005), topic detection and tracking (Spitters & Kraaij, 2001), etc. But to the best of our knowledge, this is the first effort of introducing the statistical language model theory into the tag recommendation problem. 3. A language model approach for tag recommendation Given a document d, our job is to firstly collect a set of candidate tags TC which may have a potential semantically correlation to the document. Each tag t e TC is then ranked by its similarity to the document, as sim(t, d), and at last the top N tags in TC can be listed for recommendation. In this section, we will firstly formalize the tag ranking problem into a language model approach (LMTR) and the methods for estimating the parameter will also be introduced, and then, several approaches for collecting candidate tag set TC will be presented. 3.1. Modeling document and tag for recommendation We employ the framework of language modeling to develop our approach to tag recommendation. In the approach of LMTR, the similarity sim(t, d) between document d and the candidate tag t is given by the probability P(t|d)
Simðt; dÞ ¼ PðtjdÞ ¼ pðtÞPðdjtÞ=PðdÞ / PðtÞPðdjtÞ
ð4Þ
In Eq. (4), the probability of P(t) could be set as a constant or used to integrate the independent feature of tag t like the quality or authority. For our task, P(t) is estimated with the maximum likelihood estimation (MLE),
Pmie ðtÞ ¼ tft;T =NT
ð5Þ
where T denotes all the tags in collection C tft,T denotes the frequency of tag t, and NT denotes the total number of tags in collection C. For the probability P(d|t) in Eq. (4), which could be viewed as generating document d from tag t, we first express the probability of document d with the unigram model, and then d can be represented as a sequence of independent words P(w1, w2, . . . , wm), and we can get,
PðdjtÞ ¼
m Y
Pðwi jtÞ
ð6Þ
i¼1
Estimating the probability P(w|t) is the core problem for our task, because the tag t contains too sparseness information for estimating a probability from it. But, as proved by Brooks and Montanez (2006), ‘‘tags are useful for grouping documents into broad categories”. And further, by leveraging the concept in taxonomy, a group of documents of a certain category C could represent the semantic distribution of C, we apply a substitution solution for estimating the probability P(w|t). Given tag t and the set of documents Dt which is tagged by t, the probability of P(w|t) can be represented as,
PðwjtÞ / PðwjDt Þ
ð7Þ
and the sim(t, d) could be expressed as,
simðt; dÞ ¼ PðtÞ
m Y
Pðwi jDt Þ
ð8Þ
i¼1
Hereafter, for simplicity, we use D to demote Dt if there do not exist ambiguities. And also, we use MLE to estimate P(w|d)
Pmis ðwjdÞ ¼ tfw;d =ND
ð9Þ
where tfw,d denotes the raw term frequency of word w in the document group D and ND denotes the total number of words in the document group D.
1577
Using the MLE method suffers several problems, and the most obvious one is that the unseen words in D have zero probability. Assigning this kind of words into our model, the similarity will equal to zero which is not our expectation. To remedy this problem, the smoothing process which transfers some probability mass from the seen words to the unseen words, is a good choice. Here, we select the popular Jelinek–Mercer (JM) smoothing (Zhai & Lafferty, 2004) for its good performance and cheap computational cost, and the probability P(w|D) could then be written as,
PðwjdÞ ¼ cPmis ðwjdÞ þ ð1 cÞPmis ðwjCÞ
ð10Þ
where c is the smoothing parameter between 0 and 1, C denotes the background collection of documents and Pmis(w|C) can be estimated as,
Pmis ðwjCÞ ¼ tfw;C =NC
ð11Þ
where tfw,C denotes the term frequency of word w in the background collection C, and NC denotes the total number of words in C. Now, with the LMTR model expressed by Eq. (8), we can calculate the similarity between candidate tags and the document. In the next section, we will introduce the methods for collecting candidate tag set TC for LMTR. 3.2. Approaches for collecting candidate tags Before ranking set of tags TC for recommendation, one problem is to how to collect the candidate tag set TC. A straightforward idea is to make TC = T, and we can compute the similarity of all tags in T with document d. The advantage is this approach would not omit any tag, but it suffers at least two problems: (a) It costs lots of time to compute the similarity of each tag with document d, and (b) because the tag is influenced by both P(t) and P(d|t), some useful tags with lower abstraction level but more particular and meaningful specificity will be ranked out of the scoop because of they often have lower occurrences and correspondingly lower P(t). Auto: To fix the time costing problem, employing some pre-filtering methods is a good choice. And the collaborative-based tagging approach could somehow meet the requirement because their high recall and highly collecting speed. Here we adopt one famous collaborative-based approach named as AutoTag, proposed by Mishne (2006). AutoTag leverages the traditional IR method (e.g. Okapi-BM25 (Robertson, Walker, & Beaulieu, 1999)) to collect a group of documents similar to the given one from the background documents. Then the most appeared K tags from S most similar documents are considered as the candidate tags T CA for document d. This approach is easy to perform, but aggravates the problem of omitting those particular tags, because it only concerns on those ‘general’ tags between similar documents. From now on, for simplify, we denote AutoTag as Auto if there is no ambiguous. Based on this approach, in the rest of this section, we will propose two modified approaches to overcome this problem by expanding the candidate tag set T CA with some particular tags. Sum: One way to expand the candidate tag set is finding those tags with similar meanings to the existing one and then the cooccurrence between two tags can be viewed as a good metric to measure the relationship between them. Sigurbjörnsson and Van Zwol (2008) normalize the co-occurrence between two tags into two kinds of relationships,
Ratag ðt i ; tj Þ ¼ Pðtj jti Þ ¼ tfti \tj ;T =tfti ;T
ð12Þ
Rstag ðt i ; tj Þ ¼ Pðti ; t j Þ ¼ tfti \tj ;T =tfti [tj ;T
ð13Þ
where Ratag ðt i ; t j Þ denotes the asymmetric, Rstag ðt i ; t j Þ denotes the symmetric, ti and tj are two any tags in background tag collection T. Obviously, the asymmetric makes sense for our task of finding some
1578
K. Sun et al. / Expert Systems with Applications 38 (2011) 1575–1582
less abstraction tags, which denotes the probability of the occurrence of tag tj given tag ti. Further, we define that,
Table 1 Algorithm 1 Sum. Input: tag set T CA generated by Auto, size threshold of each t’s expansion tag set TtJ, size threshold of candidate tag set K
Definition 1. Given two tags ti and tj, if ti and tj ; iRatag ðt i ; tj Þ < Ratag ðt i ; tj Þ then ti is considered more general than tj, denote as ti > tj. Based on Eq. (12) and Definition 1, we can expand the candidate tag set T CA with more particular tags. Further, we leverage the Sum method (Sigurbjörnsson & Van Zwol, 2008) to aggregate the relation tags set between each tag in T CA , as Algorithm 1 shown in Table 1. Algorithm 1 could first expand each tag t in T CA with a set tags Tt less abstraction than t, and then aggregates each Tt with the Sum method into a expanding tag set T CS , and at last returns the T CS T CA set as the candidate tag set. Ctx: Another method, named as contextual tag (Ctx), is also proposed to expand the candidate tags T CA . Different from the Sum which takes the tag relationship into consider, the CtxTag inherits the concept of keywords extraction by making use of the potential words from the context of document d. User often pick up some words or phrases from the document content as tags, and those tags (or keywords) often play an important role in denoting the particular meaning of the document. Based on this intuition, we take into consider these contextual tags to fetch up the ignorance of AutoTag, and the candidate tag set T CC can be collected with Algorithm 2 as Table 2 shows. Algorithm 2 is a simple method for expanding the existing tag set T CA , which simply considers all the potential keywords appeared in the document d with a competitive TFIDF score as the candidate tags, it ignores the tag relationships with T CA , but respects the tags’ correlation with document d. Above two expansion methods both have their advantages. For Sum, it considers candidate from the global tag set and could expand the tag set into a less abstraction level, but it cannot avoid bringing more noises while trying to dig deeper. For example, if we are trying to expand the tag company: ‘‘Apple”, Sum may bring some noise tags like ‘‘apple tree” or ‘‘apple juice”, which may influence the final result. In opposite, Ctx concerns on the document itself while ignores the tags’ global relationship, and it can expand more precise particular tags than Sum. In the sample example, it can push out tags like ‘‘iPod shuffle” or ‘‘Mac”, if they appear in the document content. But some abstraction or synonymy tags which do not appear in the document will be missed. In Section 5, we will conduct some experiments to find the different between and the suitable application environments of those two methods. 4. Experimental setup In this section we will introduce the experimental setup, including the dataset, ground truth and evaluation measures used in our experiments. 4.1. Dataset Experimental dataset comes from two different domains with different document length and writing style: scientific documents (CiteULike) and web pages (Zhishi2) are collected for evaluating the effectiveness of our proposal in different application environment. Data from both set is parsed into a uniformed object {d, Tu}, where d denotes the document content, and Tu denotes the tag set annotated by users. Now we will separately introduce these two datasets. CiteULike is a free online service to organize academic publications, which is based on the social tagging style and is aimed to promote and to develop the sharing of scientific references amongst researchers, as Fig. 1 illustrates. We acquired a subset of 2 http://zhishi.baidu.com is a Chinese community-based question and answering service.
Output:
expanded candidate tag set T CS
1: let T CS ¼ ;; 2: foreach t in T CA 3: collect Tt, |Tt| 6 J, as the most J tags correlated to t, ordered in discarding by Ratag ðt; t0 Þ; t 0 2 T t 4: 5: 6: 7: 8: 9: 10: 11: 12:
foreach t0 in Tt 0 if t > t 0 if t R T cs add t0 to T cs 0 let s(t ) = 0 end if let sðt0 Þ ¼ sðt 0 Þ þ Ratag ðt; t0 Þ end if end for 0
13: set
collect K tags with the highest s(t ), t
14:
let T CS ¼ T CS [ T CA
15:
return T CS
0
e Tt score in T CS
as the final T CS
Table 2 Algorithm 2 Ctx. Input: document d, tag set T CA generated by Auto, size threshold of candidate tag set K Output:
expanded candidate tag set T CC
1: let k ¼ 0; T CC ¼ ;; 2: foreach w in d ordered byTFIDF_{w, d} 3: if w appears as a tag in T 4: add w into T CC 5: end if 6: k=k+1 7: if k P k 8: break for 9: end if 10: end for 11: let T CC ¼ T CC [ T CA 12: return T CC
37,927 user annotated references for our experiments here. In order to unify the paper among different references annotated by different users and combine their tags, only those references with DOI3 entries are concerned, and based on the DOI ID papers are unified, and also references without title and abstracts are filtered too. Then, we combined each paper’s title and abstract as the document content d for tagging. Also, tags for each reference with the same DOI entry are combined together as user tag set Tu. Nature tags often suffer the problem of uncontrolled vocabulary. In order to normalize tags back to a unified form in our background tag collection T, we employ the method proposed by Sood et al. (2007), Words in each tag are tokenized, and stemmed to a morphological root using Porter’s stemmer (Porter, 1997). In addition, tags contain more than one atomic word are placed in alphabetic order for avoiding the tag variation problem, such as ‘‘politics and news” and ‘‘news and politics” are both resolve to ‘‘and new polit”. At last a collection of 5493 distinct referenced papers with 24,023 tags in total are collected, and then 50% papers are randomly selected for training (TR), 25% for developing (DEV) and 25% for testing (TST). And the background tag collection T and document collection D, is collected from the TR set. The statistics of this dataset can be found in Table 3 and 5 most tagged documents in TST set and their tags are listed in Table 4 respectively. Note that, in this dataset the intersection of DEV
3
http://www.doi.org/.
1579
K. Sun et al. / Expert Systems with Applications 38 (2011) 1575–1582
Fig. 1. Reference page in CiteULike.
Table 3 Statistics on experimental dataset. Source site
Dataset
Tags
Distinct tags
Tags intersection with T (%)
Tags per document
Words per document (stop words are filtered)
CiteULike
TR DEV TST
Documents 2695 1379 1419
12,038 5875 6110
4700 2900 3059
100 40.7 39.7
4.5 4.3 4.3
188 184 186
Zhishi
TR DEV TST
35,321 17,500 17,876
86,161 42,118 42,959
2710 2638 2639
100 99.8 99.7
2.4 2.4 2.4
3.8 3.8 3.9
Table 4 Example documents with their tags on CiteULike. Document
User tags
Our tags (Top 10)
Title: Epidemiology of European Community-Associated Methicillin-Resistant Staphylococcus aureus Clonal Complex 80 Type iv Strains Isolated in Denmark from 1993 to 2004 Author: Ar Larsen et al.
Infect, adolescent, adult, age, agent, analysis, anti bacterium, aurous, bacterium, cross, DNA, electrophoresis, family, gel, genotype, health, human, infant, methicillin, microbe, middle, newborn, pidmiosajsths, protein, field pulse, resist, sensate, sequence, soft, staphylococci, staphylococcus, test, tissue
Pidmiosajsths, staphylococci, methicillin, infect, resist, staphylococcus, field pulse, electrophoresis, aurous, type
Title: Literature-based concept profiles for gene annotation: the issue of weighting Author: Rob Jelier et al.
database, gene, abstract, artificial, automatic, computer, control, curve, express, function, genet, index, inform, intelligent, interact, language, likelihood, manage, map, nature, network, neural, pattern, process, profile, protein, pubmed, recognition, roc, system, terminology, theory, topic, uncertainty, vocabulary
Roc, topic, database, profile, retrieval, inform, process, storage, system, gene
Title: Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation Author: Rob Jelier et al.
Neoplasm, algorithm, analysis, array, biology, database, diagnosis, express, gene, genet, human, inform, language, leukemia, male, marker, nature, oligonucleotide, process, profile, prostate, protein, receptor, reproduce, result, retrieval, sensate, sequence, specify, storage, tumor
Profile, microarray, express, gene, database, analysis, cluster, cancer, genet
Title: ATLAS – a data warehouse for integrative bioinformatics Author: Sohrab P. Shah et al.
Computer, database, genome, human, biology, express, factual, gene, genet, graphic, inform, interact, interface, internet, language, manage, map, messenger, model, profile, program, protein, retrieval, RNA, software, storage, system, topic, computer user
Database, manage, software, retrieval, inform, interface, system, internet, language, nature
Title: Super-families of evolved and designed networks Author: Ron Milo et al.
Animal, bibtex, biology, caenorhabd, drosophila, elegant, evolution, feedback, genet, human, internet, language, mathematic, melanogast, model, nervy, protein, signal, social, support, system, theoretic, theory, transcript, transduction
Proteome, theoretic, animal, protein, profile, genet, network, structure, similar, biology
1580
K. Sun et al. / Expert Systems with Applications 38 (2011) 1575–1582
and TST’s tag set with T is quite low (40.7% and 39.7%), and because our approach is only built for recommending observed tags, thus those out of vocabulary tags (OOV) among DEV and TST set are meaningless for our research, and they are also filtered based on T. Zhishi: The second dataset is a Chinese community-based question and answering service (cQA for short) named as ‘‘ ” (Knowledge Manager). It allows user to organize a group of question and answering pairs into a piece of knowledge with the similar topic, and three tags are required to depict the topic by the user who edits it, as Fig. 2 illustrates. Until June 2nd, 2009, Zhishi has accumulated 29,469 knowledge pages with 11,978 distinct tags and covered more than 100 thousands of question and answering pairs (qna pair). 3411 knowledge pages are acquired from this site in our experiment, with 77,997 distinct qna pairs and 7942 tags in total. For each qna pair, the title of question is parsed as document d and user tags Tu are inherited from the knowledge page the qna pair belongs to. Also, we separated this dataset into 3 datasets, 50% for training (TR), 25% for developing (DEV) and 25% for testing (TST) as Table 3 lists. Different from the scientific documents in CiteULike, the rate of tag per document here is much lower, which means in Zhishi, one tag concludes more information than CiteULike. It may because the tag’s purpose at here is not for personal organization, but for others browsing, and the tags are more carefully selected with the instinct of bonus in Zhishi. Its good news for us, because it makes the tag recommendation problem on this dataset more suitable for our hypothesis in Eq. (7), and also suffers less from the tag variation problem. Another different is that the average number of words in each document is also much lower and the writing style is more informal, this may lead to much more serious data sparseness problem in this dataset.
Fig. 2. Knowledge page about ‘CPU’ in Zhishi.
5.1. CiteULike
We separately evaluated the effectiveness of our proposals in both CiteULike and Zhishi with five different approaches in this section. Specifically, we made use of the Auto approach as our baseline, and the LMTR with all tags in T as its candidate tag set will be evaluated as our proposal on tag recommendation. Furthermore, three candidate tag collection approaches (Auto, Sum and Ctx) combined with the LMTR are also analyzed respectfully.
CiteULike is first evaluated here. We collect the statistical model for LMTR and other expansion methods from the TR set. The parameters for each method are independently tuned on the DEV set with MAP measure and the evaluation is conducted with those optimal parameters on TST set. Note that, for each approach, 10 tags (N = 10) are required to be recommended and for the Sum method, we let J = K to simplify the tuning process. The optimal parameter set could be found from Table 5, and with the evaluation results are reported in Table 6. Results listed in Table 6 indicate that, by comparing with baseline, LMTR could significantly improve the effectiveness of tag recommendation. And from the P@N and R@N score, we can find, the improvement is mainly binged by the higher recall of LMTR, because it could rank the tag better than AutoTag, and thus more relevant tags will appear faster. Further, by combining LMTR with tag collection method Auto as LMTR + Auto demonstrates, although the performance is a slightly worse than LMTR because some potential tags are filtered. However, the purpose of combining with Auto is to improve the tagging speed, and the efficiency test will be conducted later. As advanced expansion methods to Auto for improving its effectiveness, Sum and Ctx both perform better than others. An interesting phenomenon is that LMTR + Ctx outperforms than LMTR + Sum. This proved that, users in CiteULike tend to use the keywords which often appear in the title and abstract to annotate their reading papers rather than create new tags, this further
Table 5 Optimal parameters on CiteULike.
Table 6 Experimental result on CiteULike.
4.2. Ground truth and evaluation measures Our goal is to recommend a set of tags ordered by their relevance to the given document. Finding the suitable ground truth is a hard word for this task, because although those tags posted by user can be considered as relevant tags, but we can’t say others are irrelevant. However, we still choose user tags Tu of each sample as one of our ground truth, because they are picked up by users and means they are more suitable than other candidates. We employ the Mean Average Precision (MAP), P@N and R@N (N = 5, 10) measures to evaluate the ability of our proposal in ranking the relevant tags.
5. Experimental results
RUNID Baseline LMTR LMTR + Auto LMTR + Sum LMTR + Ctx
c 0.006 0.006 0.006 0.006
K
15 5 5
S
RUNID
MAP
P@5
R@5
P@10
R@10
10
Baseline LMTR LMTR + Auto LMTR + Sum LMTR + Ctx
0.319 0.354 0.345 0.387 0.399
0.193 0.198 0.199 0.228 0.233
0.293 0.305 0.307 0.351 0.359
0.124 0.134 0.126 0.155 0.157
0.373 0.412 0.387 0.479 0.485
5 3 3
1581
K. Sun et al. / Expert Systems with Applications 38 (2011) 1575–1582
proved the usefulness of tag recommendation, because users prefer ‘select’ rather than ‘type’. And because of that, the Sum method which prefers collecting tags from global tag set than from the content of document, couldn’t match the requirement here better than Ctx. We present user tags and our recommended tags for papers in Table 4. The bold fonts indicate an overlap. We can find that, amongst these five papers, at least five tags for each paper are correctly recommended by our approach. In addition, although some recommended tags do not match the user tags literally, most of them are semantically relevant. e.g., ‘‘microarray” is appeared in the paper title, and is relevant to ‘‘array” and is even more precise for this paper; ‘‘proteome” is a group of ‘‘protein”; ‘‘network” has appeared in the paper title, and is also an important part of the given paper; ‘‘structure” is often co-occurrence with ‘‘network”. In the best scenario, 9/10 recommended tags match with the user tags for both paper: ‘‘Epidemiology of European Community-Associated Methicillin-Resistant Staphylococcus aureus Clonal Complex 80 Type iv Strains Isolated in Denmark From 1993 to 2004” and ‘‘ATLAS – A Data Warehouse for Integrative Bioinformatics”.
Table 8 Experiment results on Zhishi. RUNID
MAP
P@5
R@5
P@10
R@10
Baseline LMTR LMTR + Auto LMTR + Sum LMTR + Ctx
0.501 0.569 0.571 0.579 0.572
0.311 0.319 0.323 0.328 0.326
0.634 0.663 0.668 0.677 0.677
0.189 0.189 0.191 0.200 0.191
0.754 0.787 0.775 0.796 0.792
0.5 0.45
Intra-cluster Similarity
Inter-cluster Similarity
0.4 0.35 0.3 0.25 0.2 0.15
5.2. Zhishi
0.1
We choice Zhishi as the second data set to evaluate our approaches, because it’s quite different from CiteULike in the length, writing style, tags and even language. Experiments conducted on this dataset can reveal the effectiveness of our proposals on domains with more flexible style and less information content. Also, 10 tags are required to be recommended for each document, and the parameters are tuned as Table 7 shows. As results presented in Table 8, the relativity between each approaches are similar to CiteULike, but there are still some phenomena needed to be paid attention. Firstly, the performances of each approach on Zhishi are much better than CiteULike. As discussed in Section 4.1, we think this may because there are less tags and more documents for a single tag in Zhishi, and also the tagging purpose of Zhishi is more likely for others browsing rather than for personal organization as CiteULike and thus the quality and abstraction level of tags in Zhishi are much higher than CiteULike. In order to estimate this hypothesis, we treat each tag and its corresponding documents as a cluster, and then employ the inter-cluster similarity (intersim) and intra-cluster similarity (intra-sim) to compare the different between two dataset, as Fig. 3 shows. In Fig. 3, the gap between inter-sim and intra-sim on Zhishi is very clearly, which means the tag in Zhishi could clearly distinguish the documents belongs to it and also there are less cross cluster documents in this dataset. But on CiteULike, the gap between inter-sim and intra-sim is very close, and their relativity is even in a reversed order, this means tags in this dataset are much more chaos, and also, there are many synonymy tags. And the experimental result also proved this. Another phenomenon draws our attention is that the LMTR + Sum approach performs better than LMTR + Ctx here. Also, as discussed above, in Zhishi, the abstraction level of tags is much higher than in CiteULike, thus the keywords extraction based tag collection method: Ctx could not find more relevant tags than keywords expansion based method Sum.
Table 7 Optimal parameters on Zhishi. RUNID
c
Baseline LMTR LMTR + Auto LMTR + Sum LMTR + Ctx
0.15 0.15 0.15 0.15
K
S 25
15 20 30
25 10 15
0.05 0 Zhishi
CiteULike
Fig. 3. Statistics of inter-document similarity and intra-document similarity on TR set of both Zhishi and CiteULike.
Table 9 Average tagging time (milliseconds per document). RUNID
CiteULike
Zhishi
Baseline LMTR LMTR + Auto LMTR + Sum LMTR + Ctx
16.7 82.9 21.3 18.9 17.6
1.9 7.6 1.5 1.9 1.3
5.3. Efficiency of our proposals We also test the efficiency of our proposals on both CiteULike and Zhishi. Table 9 presents the average tagging time of each approach. On CiteULike, the baseline gets the fastest tagging speed, and as our prediction, LMTR performance worst in the tagging speed, although its performance is much better than baseline. Also, the expansion approach LMTR + Auto gets a faster tagging speed than LMTR, despite its slightly weaker performance. And the advanced LMTR + Sum and LMTR + Sum achieve both the high tagging speed and good performance. A little surprise to us is that LMTR + Ctx performs even faster than LMTR + Auto on both dataset, and we find this may because LMTR + Ctx requires less K and S as shown in Tables 5 and 7. Also, we find the tagging speed is correlated to the length of document, and because the average word of CiteULike documents is much longer than Zhishi, thus the tagging speed is affected. And further, as reported by Song et al. (2008), our approach is significantly faster (17.6 ms/doc vs. 1080 ms/doc) than theirs on the similar dataset (CiteULike), with a similar working environment.4
4
C#.
Our experiment was performed on a 2.2 GHz Personal Computer and is coded in
1582
K. Sun et al. / Expert Systems with Applications 38 (2011) 1575–1582
6. Conclusions In this paper, we proposed a language model based approach for tag recommendation which recommends tags by ranking them with their similarity to the given document, and leverages the content information from both tag and document for ranking. We also proposed some candidate tag collection methods based on the collaborative filtering, keywords extraction and keywords expansion theories, by combining these methods with LMTR, these methods are capable of making tags recommendation both efficiently and effectively. Our future work includes: (a) leveraging more sophisticated model to estimate the relationship between tag and document rather than MLE, (b) the inter-cluster and intra-cluster similarity of a tag is important for recommendation, methods to find more quality document may improve the performance. Acknowledgements This work was funded in part by the National Natural Science Foundation of China (Grant No. 60973076) and the Special Fund Projects for Harbin Science and Technology Innovation Talents (2010RFXXG003). References Ames, M., & Naaman, M. (2007). Why we tag: Motivations for annotation in mobile and online media. In Proceedings of the SIGCHI conference on human factors in computing systems, p. 980. Brooks, C. H., & Montanez, N. (2006). Improved annotation of the blogopshere via autotagging and hierarchical clustering. In Proceedings of the 15th international conference on world wide web (S. 625–632). Edinburgh, Scotland: ACM. Chirita, P. A., Costache, S., Nejdl, W., & Handschuh, S. (2007). P-tag: Large scale automatic generation of personalized annotation tags for the web. In Proceedings of the 16th international conference on world wide web, p. 854. Duan, H., Cao, Y., Lin, C. Y., & Yu, Y. (2008). Searching questions by identifying question topic and question focus. In Proceedings of ACL. Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., & Nevill-Manning, C. G. (1999). Domain-specific keyphrase extraction. In International joint conference on artificial intelligence (Vol. 16, pp. 668–673).
Golder, S. A., & Huberman, B. A. (2006). Usage patterns of collaborative tagging systems. Journal of Information Science, 32(2), 198–208. Heymann, P., Ramage, D., & Garcia-Molina, H. (2008). Social tag prediction. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp. 531–538. Jeon, J., Croft, W. B., & Lee, J. H. (2005). Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on information and knowledge management, pp. 84–90. Liu, D., Hua, X. S., Yang, L., Wang, M., & Zhang, H. J. (2009). Tag ranking. In Proceedings of the 18th international conference on world wide web, pp. 351– 360. Mishne, G. (2006). Autotag: A collaborative approach to automated tag assignment for weblog posts. In Proceedings of the 15th international conference on world wide web, p. 954. Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp. 275–281. Porter, M. F. (1997). An algorithm for suffix stripping. In Readings in information retrieval. Robertson, S. E., Walker, S., & Beaulieu, M. (1999). Okapi at TREC-7: Automatic adhoc, filtering, VLC and interactive track. Nist Special Publication SP, 253– 264. Sigurbjörnsson, B., & Van Zwol, R. (2008). Flickr tag recommendation based on collective knowledge. In Proceeding of the 17th international conference on world wide web, pp. 327–336. Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W. C., et al. (2008). Real-time automatic tag recommendation. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp. 515–522. Sood, S., Owsley, S., Hammond, K., & Birnbaum, L. (2007). Tagassist: Automatic tag suggestion for blog posts. In Proceedings of the international conference on weblogs and social media (ICWSM 2007). Spitters, M., & Kraaij, W. (2001). TNO at TDT2001: Language model-based topic detection. In Topic detection and tracking workshop report. Turney, P. D. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 303–336. Wu, L., Yang, L., Yu, N., & Hua, X. S. (2009). Learning to tag. In Proceedings of the 18th international conference on world wide web, pp. 361–370. Xu, Z., Fu, Y., Mao, J., & Su, D. (2006). Towards the semantic web: Collaborative tag suggestions. In Collaborative web tagging workshop at WWW2006, Edinburgh, Scotland. Yan, R., & Hauprmann, A. (2007). Query expansion using probabilistic local feedback with application to multimedia retrieval. In Proceedings of the sixteenth ACM conference on information and knowledge management, pp. 361–370. Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179–214.