Applying an intelligent notification mechanism to blogging systems utilizing a genetic-based information retrieval approach

Applying an intelligent notification mechanism to blogging systems utilizing a genetic-based information retrieval approach

Expert Systems with Applications 37 (2010) 705–715 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www...

3MB Sizes 0 Downloads 23 Views

Expert Systems with Applications 37 (2010) 705–715

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Applying an intelligent notification mechanism to blogging systems utilizing a genetic-based information retrieval approach Yong-Ming Huang, Tien-Chi Huang, Yueh-Min Huang * Department of Engineering Science, National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan 701, Taiwan, ROC

a r t i c l e

i n f o

Keywords: Blog comment Intelligent notification mechanism Genetic algorithm Information retrieval

a b s t r a c t Blogging systems have received a lot of attention in recent years due to their wide spectrum of applications. The comment function is a significant part of a blog application, which can be used to gather the readers’ feedback and produce social interactions with them. However, most of the existing blogging systems only provide simple comment notification mechanisms for bloggers. Since a popular blog may receive thousands of comments in a short period of time, it is almost impossible for the notification mechanism to inform the blogger about every comment, even meaningful ones. In this paper, we propose a Two-stage Intelligent Notification Mechanism (TINM) for blogging systems to carry out intelligent comment notification, so that the blogger only receives meaningful comments. To reduce the computation cost in the keyword retrieval, a Genetic-based Information Retrieval Approach (GIRA) was designed. Experimental results show that the proposed approach reduces the computation cost during keyword retrieval, and still leads to near optimal results. Ó 2009 Elsevier Ltd. All rights reserved.

1. Introduction In recent years, due to the prevalence of computers and the advancement of internet technologies, blogging applications have become more popular and have started to change the people’s daily lives, such as through e-business (Chen, Tsai, & Chan, 2008) and e-learning (Du & Wagner, 2007; Huang, Huang, & Cheng, 2008; Wang, Huang, Jeng, & Wang, 2008). A blog is a simple personal publishing platform which enables people to publish their thoughts and then to gather readers’ comments (Lindahl & Blount, 2003). The owner of a blog is called a blogger, who uses the post function to publish articles. Readers can use the comment function to express their opinions about articles on the blog. For example, a company can use the post news associated with free samples via their blog platform, and then consumers can use the comment function to discuss their experience of using the samples (Murugesan, 2007). The company can then improve the shortcomings of the samples via consumer feedback. Although the comment function is usually used to obtain feedback from readers, some problems can emerge. For example, when the latest comments appear on a blog, the blogger does not know about them unless he or she browses the blog continuously. Therefore, a smart blogging system needs to automatically determine whether any new comments have appeared on the * Corresponding author. Tel.: +886 6 2757575x63336; fax: +886 6 2766549. E-mail addresses: [email protected] (Y.-M. Huang), [email protected] (T.-C. Huang), [email protected] (Y.-M. Huang). 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.05.094

blog. Unfortunately, a blogger might not be interested in all new comments, and therefore, the blogging system has to offer an intelligent notification mechanism to filter for user-specified comments and then inform the blogger. Many studies have investigated blogs (Chen et al., 2008; Du & Wagner, 2007; Huang et al., 2008; Kwai Fun & Wagner, 2008; Kuan, Wu, & Lee, 2008; Lin, Sundaram, Tseng, Chi, & Tatemura, 2007; Lin, Sundaram, Chi, Tatemura, & Tseng, 2008; Thelwall & Hasler, 2007; Wang et al., 2008). However, most studies do not take the effect of the notification mechanism on blog comments into account. Previous studies directly used basic blog platforms for various applications, and thus, a blogger would need to spend a lot of time to frequently visit his or her blog, which is inconvenient in real-world applications. An intelligent notification mechanism is a very important issue that needs to be resolved in blogging platforms. To the best of our knowledge, the notification mechanism for blog comments has not been studied before. In this paper, an intelligent notification mechanism is designed to provide highly accurate comment notification for bloggers. The idea is originated from the auto-reply service system which efficiently and automatically answers students’ questions (Hwang, Yin, Wang, Tseng, & Hwang, 2008; Tseng & Hwang, 2007). We propose a Two-stage Intelligent Notification Mechanism (TINM) for blog comments. In the first stage, a genetic algorithm (Holland, 1975) is used to design a Genetic-based Information Retrieval Approach (GIRA) to retrieve the keyword interest value associated with the blogger’s keywords. By using GIRA, the TINM can retrieve the keyword interest value and dramatically reduce the computation cost.

706

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

In the second stage, we use the results of GIRA (i.e., the keyword interest value) to evaluate the similarity between the new comment and a user-interested comment (i.e. user-specified comment), and determines whether the system needs to notify the blogger. By employing this new technology in blogging platforms, bloggers can easily monitor new interesting comments. For instance, companies could apply TINM to build smart business blogging systems for monitoring customer responses to specific products. Companies could can quickly obtain consumer feedback, and then rapidly reply to customer needs. In a learning environment, a teacher could build an educational blog with TINM so that students could get useful and meaningful comments posted from the teacher or other peers. The rest of this paper is organized as follows. Section 2 reviews the related studies, which include blog research and genetic algorithms in information retrieval. In Section 3, we describe the parameters and define our problem. Section 4 presents the proposed Two-stage Intelligent Notification Mechanism (TINM). Section 5 shows the experiment results. Finally, a brief conclusion is given in Section 6.

were developed to simulate the evolution models of natural systems. GA is an intelligent search method within a defined search space, which searches an optimization solution. GA optimizes complex problems and has been applied to information retrieval (Fan, Gordon, & Pathak, 2004; Kushchu, 2005). Information retrieval (IR) (Manning, Raghavan, & Schütze, 2008) was developed to retrieve important information according to the demands of the user. In a recent study (Oussalah, Khan, & Nefti, 2008), a fuzzy based approach that uses information retrieval was proposed to accommodate the user’s needs. However, the computation cost of finding the optimal information can be very large because in many practical applications, the needs of the user are very complicated. Thus, many researchers have used GA to solve the computational problem of IR. A study applied GA to improve the performance of document retrieval from large databases (Yang & Korfhage, 1993; Yang, Korfhage, & Rasmussen, 1993). In their study, GA was used to find better queries. Similarly, Chang and Chen (2006) focused on using GA to reweight the query vector of the user according to the relevant feedback of the user and to increase the performance of document retrieval systems. The above-mentioned studies illustrate that GA-based approaches can be effectively used for IR.

2. Related studies In this section, we give a brief introduction of blogs and the current research on blogs. We then introduce Genetic Algorithms (GA) and several studies that have applied GA in information retrieval. Finally, we present the differences between this study and previous research. 2.1. Related studies on blogs In recent years, blogs have become the most popular tool for information dissemination (Rosenbloom, 2004; Young, 2003). Blogs are designed to allow easy and fast creation of Web content; users can easily use blogs to express their thoughts, ideas, suggestions, and comments on the Internet (Murugesan, 2007). Due to the ease of disseminating information on the Internet, many companies are using blogs to publish news and information, and to obtain feedback (e.g., product views) from their customers (Amd blog, 2008; Intel blog, 2008). Due to the proliferation of blogs (according to Technorati’s State of the Blogosphere report from 2008, the number of the visitors was over 77 million (Technorati, 2008)), many researchers have focused on blog research. Studies on blogs include the accuracy of blog search engines, spam blog detection, and the various applications of blog. Many researchers have focused on developing an effective retrieval technique to improve the search accuracy and quality of blog search engines (Chen et al., 2008; Thelwall & Hasler, 2007). Spam blogs (splogs) contain fake articles; their purpose is to increase website traffic to increase the effectiveness of advertising. Therefore, splog detection technology plays an important role in protecting innocent blogs (Lin et al., 2008). Several researchers have studied the applications of blogs and their variants. Applications include assistance in education (Du & Wagner, 2007; Huang et al., 2008; Wang et al., 2008), business (Chen et al., 2008; Yang & Liu, 2009), and social networking (Kwai Fun & Wagner, 2008; Kuan et al., 2008; Lin et al., 2007). For example, learners can document their learning experiences or knowledge and share them in blog postings. Companies can publish the latest product information on their blogs and gather customer feedback. To expand their social network, users can find new friends with the same interests using friend lists of bloggers. 2.2. Related studies on genetic algorithms in information retrieval Genetic Algorithms (GA) (Srinivas & Patnaik, 1994) are heuristic search algorithms that use natural selection and evolution. They

2.3. Two main contributions in our work

 We developed an intelligent notification mechanism for comments that are interesting to a blogger. This intelligent notification mechanism can be used to increase the interactions between the blogger and the reader.  We designed a Genetic-based Information Retrieval Approach (GIRA) that retrieves the keyword interest value according to the blogger’s interests. The comments that the blogger is interested in can be retrieved by employing the results of GIRA.

3. Problem description 3.1. Theoretical foundations of TINM The TINM problem is to understand whether a new comment is an interesting comment for the blogger; i.e., the TINM problem is to explore the blogger’s likes by applying information retrieval. Some information retrieval studies (Croft, 1987; Rijsbergen, 1986) suggested that significant improvements in retrieval performance will require techniques that, in some sense, ‘‘understand” the content of documents and queries (i.e., user’s demands) to infer probable relationships between documents and queries. From this viewpoint, information retrieval is an inference or evidential reasoning process in which the probability can be evaluated to meet a user’s needs (i.e. a submitted query), which is given a document as ‘‘evidence” (Croft & Thompson, 1987; Croft & Turtle, 1989; Syu & Lang, 2000). Hence, the common problem of information retrieval is to rank documents according to their relevance with the user’s demands (Larkey & Connell, 2005). The document with the highest rank is then selected to be the answer. 3.2. TINM problem definition The formal definition of the TINM problem is as follows. Assume that a blogger submits a interested post Pm that includes m1 keywords, which is denoted as K i ðPm Þ; i ¼ 1; 2; . . . ; m1. Since these m1 keywords were abstracted by the blogger’s post Pm , they are regarded as the interesting keywords of the blogger. Let IVðK i ðPm ÞÞ be the interest value of the ith keyword in Pm that has a range between 0 and 100, with 100 representing the highest interest. A post

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

P m may include n comments, C 1 ; C 2 ; . . . ; C n where n1. Let each comment C n be characterized by a document vector including n1 keywords, K j ðC n Þ; j ¼ 1; . . . ; n1. Let IV ðK j ðC n ÞÞ be the interest value of the jth keyword in C n , with a range of 0–100. A greater IVðK j ðC n ÞÞ means more interests on the keyword K j in the comment C n . Let PðK j ðC n ÞÞ be the proportion of the jth keyword in the all keywords of C n . The interest degree between the post P and the comment C n is computed as: IDðP m ;C n Þ ¼

m1 X n1 X ðequalsðK i ðP m ÞÞ;ðK j ðC n ÞÞ  IVðK j ðC n ÞÞ  PðK j ðC n ÞÞÞ;

ð1Þ

707

straint f ðPm Þ ¼ C n . An intuitive approach to find the optimal IVðK i ðPm ÞÞ is to test all possible IVðK i ðPm ÞÞ, where IV ðK i ðPm ÞÞ is an integer between 0 and 100. Assume that 100 combinations are tested for the optimization of IVðK i ðPm ÞÞ. The complexity of executing the training process is Oð100m1  n  mÞ; this high complexity do not meet our needs. In this paper, we design a highly efficient Genetic-based Information Retrieval Approach (GIRA) to resolve this problem, the details of which are presented in Section 4. 4. Two-stage Intelligent Notification Mechanism (TINM)

i¼1 j¼1

where

8 if keyword K i ðPm Þ is equal to K j ðC n Þ then > > > > > equalsðK i ðPm Þ; K j ðC n ÞÞ ¼ 1 and IVðK i ðPm ÞÞ > > > < was assigned to the IVðK ðC ÞÞ n j > if keyword K i ðPm Þ is not equal to K j ðC n Þ then > > > > > equalsðK i ðPm Þ; K j ðC n ÞÞ ¼ 0 and IVðK i ðPm ÞÞ > > : was not assigned to the IVðK j ðC n ÞÞ For the post Pm , the comment C n with a maximum value IDðPm ; C n Þ will be retrieved as the most interesting comment. To ensure the most interesting comment is detected, a training process needs to be executed to retrieve the keyword interest value for the blogger’s interesting keywords. In the training process, each training case consists of a post submitted by the blogger and several comments submitted by the readers. The most interesting comment perceived by the blogger is chosen. Note that for the training process, the post and the corresponding most interesting comment have been recorded during the training data collecting process. Assume that there are m training cases collected in the training database. The mapping function is as follows: f : fP 1 ; P 2 ; . . . ; Pm g ! fC 1 ; C 2 ; . . . ; C n g where f ðP m Þ ¼ C n indicates that the blogger has tagged comment C n as the most interesting comment to the post P m . Therefore, the keyword interest value retrieval problem is formally defined as:

MaximizeðMaxIDðPm ; C n Þ  MatchðPm ; C n ÞÞ;

ð2Þ

where



if f ðPm Þ ¼ C n then matchðPm ; C n Þ ¼ 1; if f ðPm Þ – C n then matchðPm ; C n Þ ¼ 0:

During the training process, the keyword interest value retrieval problem aims to find the optimal keyword interest value by using Eqs. (1) and (2). In other words, the problem of the training process is to retrieve the optimal IVðK i ðPm ÞÞ and to meet the con-

In the first stage of TINM (hereafter named the offline stage), a training process is designed to retrieve the keyword interest value. In order to increase the training efficiency, a Genetic-based Information Retrieval Approach (GIRA) is designed to reduce the computation cost for retrieving the optimal keyword interest value. In the second stage of TINM (hereafter named the online stage), an intelligent notification mechanism is presented, which has two main concerns. (1) The results of offline stage are used to calculate the interest value of a new comment, and a pre-defined interest threshold is used to determine whether the new comment is interesting. (2) When the interest value is larger than the threshold value, the blogging system informs the blogger to take a look at the new comment. The details of TINM are presented in the following three subsections. 4.1. Offline stage Fig. 1 shows a flow chart of the offline stage for retrieving the keyword interest value, which includes three phases: (i) the data collection phase, (ii) the data preprocessing phase, and (iii) the keyword interest value retrieval phase. 4.1.1. Data collection phase This phase collects the training data, which occurs in three steps. First, the blogger posts the blog articles to the platform via the post function. Then, the reader reads blogger’s blog articles and posts some comments via the comment function. Then, all blog articles with comments are collected in a blog training database. 4.1.2. Data preprocessing phase The process of the data preprocessing phase is to extract the specific keywords from each blog article in the blog training database. This phase uses general information retrieval technology (Manning et al., 2008). Fig. 2 shows the workflow for the data preprocessing phase. The workflow includes four components: (i)

Fig. 1. Flow diagram of the offline stage.

708

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

Fig. 2. Workflow of the data preprocessing phase.

Fig. 3. Example of synonymizer’s task.

tokenizer, (ii) stopper, (iii) stemmer, and (iv) synonymizer, which are discussed in detail below. Tokenizer. The blog training database outputs the training data to the tokenizer, whose task is to chop it up into pieces, called tokens, and to drop certain characters. In this paper, the tokenizer discards punctuation and numbers because in many practical applications they are relatively meaningless as words. Stopper. Once the training data has been segmented by the tokenizer, the task of the stopper is to remove common words (also called stop words, which are frequently appearing words in documents, such as ‘and, ‘the’, and ‘of’) in tokens. In this paper, the stopper adopts a list of stop words from (Wordnet, 2008) to remove common words. Stemmer. The task of the stemmer is to normalize. It matches morphological word variants by using the base or root form, so that morphological variants of the same word can be compared more easily. In this paper, a famous porter-stemming algorithm (Porter, 1980) is adopted to stem the words in the token to their root words. Synonymizer. The purpose of the synonymizer is to reduce the token size of articles, which consists of two steps. First, the synonymizer examines how synonyms are arranged in the same category. In this work, we use (Wordnet, 2008) to find matching synonyms. Second, the synonymizer randomly selects a word to represent all words of this category. An example of the synonymizer’s task is shown in Fig. 3.

All blog articles are divided into tokens by the tokenizer, stopper, stemmer, and synonymizer. When the training data preprocessing phase is completed, the offline system continues on to the next process. 4.1.3. Keyword interest value retrieval phase In this subsection, we present a Genetic-based Information Retrieval Approach (GIRA) to retrieve the optimal keyword interest value. This approach is based on the vector space model that is general applied to information retrieval. Thus, the output of the preceding phase, tokens, can be characterized as an integer vector in the integer vector space. The token’s value is regarded as the interest value of the corresponding keyword in the post/comment articles. Hence, the interest degree of each comment can be computed by utilizing the interest value of each token. The comment with the highest interest degree is retrieved as the most interesting comment. Fig. 4 shows a flow chart of GIRA for keyword interest value retrieval, which includes the eight steps discussed below. Step 1. Encoding. In Step 1, the variables of the solution are encoded as chromosomes. Encoding methods include integer encoding, real number encoding, and binary encoding. In this paper, since the variables of the solution are the keyword interest value (i.e., IVðK i ðP m ÞÞ), IVðK i ðPm ÞÞ must use integer encoding. Step 2. Generating initial population. When the suitable chromosomes have been determined in Step1, the next step is to generate an initial population to be the starting point for the genetic algorithm. In this paper, a uniform distribution is used to generate a random initial population. Step 3. Fitness function evaluation. In the genetic algorithm, the fitness function evaluates the suitability of the chromosomes for the environment under consideration. In this paper, we design the fitness function based on Eqs. (1) and (2). Eq. (1) is used to compute the interest degree between the post and the comment. Eq. (2) is used to verify whether the most interesting comment is true. Step 4. Termination condition. Since the genetic algorithm is a repeated process, an appropriate termination condition needs to be defined to terminate the process. In this paper, the termination condition is a fixed number of generations. The algorithm is also terminated when

Fig. 4. GIRA flow chart for retrieving keyword interest value.

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

the best fitness is retrieved. If the conditions are satisfied, the genetic algorithm outputs the optimal solution. Otherwise, the

709

chromosome with the highest fitness value goes into the reproduction procedure.

Fig. 5. Flow chart of the online stage.

Fig. 6. Preliminary steps of the example.

710

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

Fig. 7. An example of integer encoding.

Step 5. Reproduction. Reproduction is used to determine how the genetic algorithm creates children for each new generation. In this paper, the familiar roulette wheel selection is applied. Roulette wheel selection simulates a roulette wheel with the area of each segment proportional to its expectation. In our study, a random number is used to select one of the sections with a probability equal to its area. The selected probability of chromosome k is shown as Eq. (3), where fk is the fitness of chromosome k:

Pk ¼ fk

, n X

fi :

ð3Þ

i¼1

Step 6. Crossover. The crossover process creates a new chromosome, which inherits features from both parents. The general crossover includes onepoint crossover, two-point crossover, and uniform crossover. Many studies (Falkenauer, 1999; Syswerda, 1989) have shown that uniform crossover is the best way to carry out the crossover process. Hence, uniform crossover is chosen in this study. The uniform crossover process can be divided into two steps. The first step creates a crossover vector by using a random binary. Then, the genetic algorithm selects the genes from the first parent, where the vector is equal to 1; otherwise, the genetic algorithm selects the genes from the second parent, where the vector is 0. The genes are combined to form a new child. Step 7. Mutation. Mutation is used to avoid the local maximum/minimum problem by precluding chromosomes from becoming too similar to each other. The main concept of mutation is to do a random transfer in the chromosomes. The genetic diversity is thus increased. The genetic algorithm has an enlarged search space to generate better descendants. Once the mutation procedure has finished, the evolution process of a generation has also finished. Then, Steps 3–7 are repeated for evolution of each generation until the termination condition is satisfied. When the termination condition is satisfied, the keyword interest value retrieval has finished. The system outputs the best chromosome, and the result is an optimal keyword interest value list (hereafter named OKIV list) for a training case.

Step 8. Data fusion. This step is used to combine a set of OKIV lists into an optimal keyword interest value table (hereafter named OKIV table) from various training cases. We propose an Interest-oriented Data Fusion Method (IDFM) for this purpose. From the blogger’s viewpoint, the different posts have different interest degrees. Hence, different keywords should be assigned different weights. The IDFM method is presented below. Assume that there are m posts, P 1 ; P 2 ; . . . ; P m and that their relative interest weights are W 1 ; W 2 ; . . . ; W m . We suppose that post is included in the OKIV list, which contains m1 optimal keyword interest values, which are denoted as OptðK i ðPn ÞÞ; i ¼ 1; 2; . . . ; m1. The weighted keyword interest value is formally defined as:

WðK i ðPn ÞÞ ¼ OptðK i ðPn ÞÞ  W n :

ð4Þ

Some keywords can appear in different posts. From the blogger’s viewpoint, duplicate keywords may appear in different posts, so they can be regarded as individual interesting keywords. Duplicate keywords should be used in another way to calculate the keyword interest value. In order to tackle this problem, we adopt the cumulative calculation of keyword interest values to fuse duplicate keywords. Using IDFM in the offline stage, the results of GIRA will be exported into an OKIV table, as shown in the right side of Fig. 4 4.2. Online stage The intelligent notification mechanism was developed using the results of the offline stage. Fig. 5 shows a flow chart of the intelligent notification mechanism. The online stage has two major steps: (i) comment interest value evaluation, and (ii) message notification. In the comment interest value evaluation, the concept is to adopt an interest threshold control approach to determine whether the new comment is an interesting comment. Assume that a new comment is submitted to the blog platform by a reader. The interest value of this new comment is calculated according to the OKIV table. If the evaluation result is interesting (i.e., e > d, where e is the interest value and d is the pre-defined interest value threshold), then the message notification sends a message to remind the blogger to take a look at the new comment. An example of the TINM process is given Section 4.3. 4.3. An illustrative example This subsection shows an example of the Two-stage Intelligent Notification Mechanism (TINM), which consists of two stages: the offline stage and the online stage.

Fig. 8. Fitness function evaluation for the example.

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

711

Fig. 10. Illustration of mutation. Fig. 9. Illustration of uniform crossover.

4.3.1. Offline stage In order to simplify the page of description, the offline stage is divided into two steps: (i) the preliminary step, and (ii) the keyword interest value retrieval step. The preliminary step includes the data collection phase and the data preprocessing phase, and the keyword interest value retrieval step corresponds to the keyword interest value retrieval phase. Preliminary steps. Fig. 6 shows the preliminary steps of the offline stage. Assume that the blog training database includes a post P m and three comments C 1 , C 2 , and C 3 . The third comment, C 3 , is the most interesting comment chosen by the blogger, as shown in Fig. 6a. The keywords of archives are generated by applying the data preprocessing phase (see Section 4.1.2), as shown in Fig. 6b. Finally, the keywords of each article and proportions of keywords are extracted, as shown in Fig. 6c. Keyword interest value retrieval step. Step 1. Encoding: The administrator uses real number encoding to encode the interest value of the keyword IVðK i ðP m ÞÞ; i ¼ 1—5. Fig. 7 shows the integer encoding used in this paper. In this example, because the blogger’s post Pm had five keywords extracted, the example has five variables, IVðK i ðP m ÞÞ; i ¼ 1—5. Each variable is a integer between 0 and 100 (see Section 3.2). Step 2. Generating initial population: The GA generates an initial population of chromosomes using uniform distribution. Step 3. Fitness function evaluation: The GA evaluates the suitability of chromosomes using the fitness function. The fitness of a chromosome is evaluated as follows. First, the interest value of each comment is computed using Eq. (1). The acquired values are 58.63, 11.28, and 16.52, respectively. In the current solution, comment C 1 has the highest interest value, as shown in Fig. 8a. Second, the suitability of the chromosome is evaluated using Eq. (2), as shown in Fig. 8b. Since the present solution is not comment C 3 , the result is equal to 0 (because MatchðP m ; C 1 Þ ¼ 0Þ. For this exam-

ple, this outcome represents that the chromosome is not good; hence, the GA process will continue to the next step. Step 4. Termination condition: The GA determines whether it should terminate. For this example, the GA process will continue to the next step because this is the first generation of evolution. Step 5. Reproduction: This step uses Eq. (3) to determine how the GA creates children at each new generation. For example, assume that there are three chromosomes, Ch1 ; Ch2 , and Ch3 , which have different selected probabilities, 1, 1.3, and 0.7, respectively. In this step, the GA uses a random number to select one of the sections with a probability equal to its area. Ch2 will be selected by GA. Step 6. Crossover: The crossover step combines two individuals to generate a new individual for the next generation. Fig. 9 shows an example of uniform crossover. The first parent is (38, 23, 47, 65, 98), the second parent is (56, 74, 58, 12, 45), and the crossover vector is (1, 0, 0, 1, 0). After carrying out the uniform crossover, the child is (38, 74, 58, 65, 45). Step 7. Mutation: The mutation step makes small random changes in the individuals, which provide genetic variety and enable the GA to search an expanded space. Fig. 10 shows an example of mutation. Assume that the third gene is the mutation point in this example; the mutation procedure would change the gene to a random number using a pre-defined probability. After mutation, the third gene has changed from 58 to 22, as shown in Fig. 10. Step 8. Data fusion: This step combines a set of optimal keyword interest values into a OKIV table, as shown in Fig. 11. For this example, assume that there are three posts with relative interest weight of 2:3:1. Each post includes an OKIV list using Steps 1–7, as shown in Fig. 11a. The first step is to use the interest weight to compute the weighted interest value of a keyword using Eq. (3), as shown in Fig. 11b. Then, the second step is to use the cumulative calculation method to combine each keyword into an optimal KIV table, as shown in Fig. 11c. The OKIV table is thus generated by the GIRA process.

Fig. 11. Data fusion of this example.

712

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

Fig. 12. All possible solutions for comments 1–5. (a) comment 1; (b) comment 2; (c) comment 3; (d) comment 4; (e) comment 5;

4.3.2. Online stage Once the blogger’s OKIV table is obtained using GIRA, the comments that interest the blogger can be detected in the online stage. Assume that the pre-defined interest threshold d is 150, and that a reader submitted a new comment to the blogging platform. A set of keywords is extracted from the comment, namely {product, drawback, complaint}, and the proportion of each keyword is 0.23, 0.33, and 0.44, respectively. Thus, the interest value of the new comment (i.e., e) is derived by using the OKIV table; the result is

179:43 ð¼ 156  0:23 þ 203  0:33 þ 174  0:44Þ. Because e > d, the blogging system will use message notification to send a message to notify the blogger to view the new comment. 4.4. Discussion TINM has four advantages. (1) The proposed technique offers comment notification for blogging systems. (2) TINM can notify users about interesting comments. (3) Users can flexibly set the

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

713

Fig. 14. Different comment scenarios.

Fig. 13. Interest degree when using GIRA.

interest threshold according to different needs. (4) TINM uses GIRA to retrieve the keyword interest value, which dramatically reduces the computation cost, and still leads to a near optimal solution.

Table 1 Parameters and setting of the comment. Parameter

Setting

Number of comments for a post Number of keywords in a comment

5, 10, 15 3, 5

5. Experiments Three experiments were conducted to evaluate the performance of the proposed approach. The first experiment evaluated the

Fig. 15. Optimal solutions for the different scenarios using GIRA. (a) number of comments = 5; number of keywords in each comment = 3; (b) number of comments=5; number of keywords in each comment = 5; (c) number of comments = 10; number of keywords in each comment = 3; (d) number of comments = 10; number of keywords in each comment = 5; (e) number of comments=15; number of keywords in each comment = 3; (f) number of comments = 15; number of keywords in each comment = 5.

714

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

accuracy of keyword interest value optimization. The second experiment was conducted to observe whether the fitness value increases and converges as the generation number increases in various scenarios. In the third experiment, we compared the computation cost of optimization between GIRA and the enumeration approach. The system prototype designed for the experiment was implemented using MATLAB (MATLAB, 2008). 5.1. Accuracy evaluation In order to verify the accuracy of GIRA on optimizing the keyword interest value, we implemented an enumeration approach to test all possible solutions and then searched for the optimal solution. We also used GIRA to find the solution. Finally, we compared the results computed by GIRA and the enumeration approach to evaluate the performance of GIRA. In this experiment, training case included a post and five comments. Two keywords, (K 1 ðPÞ and K 2 ðPÞ), were extracted from the post. The keywords were chosen to optimize the interest value. In addition, the population size of the genetic algorithm was set to 50, and the terminated generation number of the genetic algorithm was set to 50. Fig. 12a–e illustrates all possible solutions for interest values for comments 1–5 using the enumeration approach. The x-axis and yaxis are all possible interest values of K 1 ðPÞ and K 2 ðPÞ, respectively, and the z-axis is the interest degree between the post and the comment. Every dot on the surface of the interested degree distribution represents the calculation result of Eq. (1) for the corresponding K 1 ðPÞ and K 2 ðPÞ coordinates. We tested all combinations of K 1 ðPÞ and K 2 ðPÞ pairs to obtain the interest degrees. Eq. (2) was used to determine the optimal solutions; the results are K 1 ðPÞ ¼ 100; K 2 ðPÞ ¼ 10, and the interest degree = 34.1667. From the curved surfaces from the results, the computation space is extremely large (i.e., Oð1002  5  1Þ) when using the enumeration approach to optimize keyword interest value. The results show that the computation space of the enumeration approach dramatically increases when the number of keywords is increased. Fig. 13 shows the evolution of the individual’s fitness in each generation; the optimal solutions are K 1 ðPÞ ¼ 98; K 2 ðPÞ ¼ 8, and the interest degree = 33.3333. From the results, we can estimate that the accuracy of interest degree is up to 97% when using GIRA. Moreover, GIRA obtains the best solution after 10 iterations from the curves of this figure. This shows that the GA can compute the interest value with higher performance. 5.2. Robustness analysis This experiment analyzed the robustness of GIRA in different comment scenarios. In this experiment, the post of the experiment had five keywords, K 1 ðPÞ; K 2 ðPÞ; K 3 ðPÞ; K 4 ðPÞ, and K 5 ðPÞ. Table 1 shows the parameters and settings of the comments for this exper-

iment. We assume that a post may include n comments, where n is equal to 5, 10, and 15 in the three experiments, respectively. In addition, each comment may have a different length. Longer comments have more keywords. We assume that the keyword number contained in a comment is between 3 and 5. These parameter settings were used to generate six different scenarios for the different comments, as shown in Fig. 14. The results of this experiment are shown in Fig. 15a–f. In Fig. 15, the upper part shows the evaluation of each generation, and the lower part shows the computational results for the optimal interest value of each keyword and the best fitness. From the curves of each figure, we observe that GIRA acquires the best solution before 20 iterations. Therefore, we can infer that the maximum computation cost of GIRA is Oð50  20Þ. The results show that the GA has a fixed computation cost for optimizing the keyword interest values in various situations. 5.3. Computation cost comparison The purpose of this experiment was to compare the computation cost between GIRA and the enumeration approach. The measurement of computation cost compared the tested combination for optimizing the keyword interest value. The results are shown in Fig. 16. The figure shows that GIRA has a lower computation cost. Unlike GIRA, the computation cost of the enumeration approach increases as the keyword number increases. This happens because the enumeration approach has to try all possible combinations to find the optimal solution. In contrast, since GIRA uses a fixed number of evolutionary generations to find the optimal solution, it can greatly reduce the computation cost for optimizing the keyword interest value. 6. Conclusion Blogs are critical web-based applications. However, previous blog research has not considered the effect of the comment notification mechanism. In this paper, we proposed TINM to improve the interaction of people in blogging applications. TINM can identify whether the comments correspond with users’ interests and notify them to view the comments. To identify interesting comments, we designed GIRA to retrieve the optimal keyword interest value table (OKIV table). Interesting comments can be detected using the results of the OKIV table. Furthermore, we designed an interest value control solution for bloggers to adjust the interest degree by setting the threshold of the interest value. In our experiments, GIRA produced optimal solutions with an accuracy of up to 97%. GIRA was able to robustly retrieve the best solution in various situations. Most importantly, GIRA had a relatively low computation cost for retrieving the keyword interest value. In the near future, we will apply TINM to the real-world applications, such as e-business, e-social science, and e-learning. In practical applications, our system will be improved to offer an immediate customer relationship management platform to address the lack of interaction between businesses and customers. The applications for e-social science and e-learning will assist users in improving social interactions. Some data analysis methods, like on-line data mining, will also be developed to enhance TINM. We expect that the enhanced TINM will improve the blogging experience of bloggers and readers. Acknowledgements

Fig. 16. Comparison of the computation cost when using GIRA and the enumeration approach.

The authors would like to thank the National Science Council of the Republic of China for financially supporting this research under

Y.-M. Huang et al. / Expert Systems with Applications 37 (2010) 705–715

Contract No. NSC 97-2511-S-006-001-MY3. The authors are grateful to the reviewers and the editor for their constructive comments and assistance in revising and polishing this paper. References Amd blog. (2008). (retrieved July 2008). Croft, W. B. (1987). Approaches to intelligent information retrieval. Information Processing and Management, 23(4), 95–110. Croft, W. B., & Thompson, R. H. (1987). I3 R: A new approach to the design of document retrieval systems. Journal of the American Society for Information Science, 38(6), 389–404. Croft, W. B., & Turtle, H. (1989). A retrieval model incorporating hypertext links. In Proceedings of the second annual ACM conference on hypertext (Hypertext’89) (pp. 213–224). Chang, Y. C., & Chen, S. M. (2006). A new query reweighting method for document retrieval based on genetic algorithms. IEEE Transactions on Evolutionary Computation, 10(5), 617–622. Chen, Y., Tsai, F. S., & Chan, K. L. (2008). Machine learning techniques for business blog search and mining. Expert Systems with Applications, 35(3), 581–590. Du, H. S., & Wagner, C. (2007). Learning with weblogs: Enhancing cognitive and social knowledge construction. IEEE Transactions on Professional Communication, 50(1), 1–16. Falkenauer, E. (1999). The worth of the uniform. In Proceedings of the congress on evolutionary computation (CEC’99) (pp. 776–782). Fan, W., Gordon, M. D., & Pathak, P. (2004). Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4), 523–527. Holland, J. H. (1975). Adaptation in natural and artificial systems. The University of Michigan Press. Huang, T. C., Huang, Y. M., & Cheng, S. C. (2008). Automatic and interactive elearning auxiliary material generation utilizing particle swarm optimization. Expert Systems with Applications, 35(4), 2113–2122. Hwang, G. J., Yin, P. Y., Wang, T. T., Tseng, J. C. R., & Hwang, G. H. (2008). An enhanced genetic approach to optimizing auto-reply accuracy of an e-learning system. Computers & Education, 51(1), 337–353. Intel blog. (2008). (retrieved July 2008). Kushchu, I. (2005). Web-based evolutionary and adaptive information retrieval. IEEE Transactions on Evolutionary Computation, 9(2), 109–126. Kwai Fun, R. I. P., & Wagner, C. (2008). Weblogging: A study of social computing and its impact on organizations. Decision Support Systems, 45(2), 242–250. Kuan, S. T., Wu, B. Y., & Lee, W. J. (2008). Finding friend groups in blogosphere. In Proceedings of the 22nd international conference on advanced information networking and applications – workshops (AINAW 2008) (pp. 1046–1050). Lindahl, C., & Blount, E. (2003). Weblogs: Simplifying web publishing. Computer, 36(11), 114–116.

715

Larkey, L. S., & Connell, M. E. (2005). Structured queries, language modeling, and relevance modeling in cross-language information retrieval. Information Processing and Management, 41, 457–473. Lin, Y. R., Sundaram, H., Tseng, B., Chi, Y., & Tatemura, J. (2007). Blog community discovery and evolution based on mutual awareness expansion. In Proceedings of the IEEE/WIC/ACM international conference on web intelligence (pp. 48–56). Lin, Y., Sundaram, H., Chi, Y., Tatemura, J., & Tseng, B. L. (2008). Detecting splogs via temporal dynamics using self-similarity analysis. ACM Transactions on the Web, 2(1), 1–35. Murugesan, S. (2007). Understanding Web 2.0.. IEEE IT Professional Magazine, 9(4), 34–41. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. MATLAB. (2008). (retrieved July 2008). Oussalah, M., Khan, S., & Nefti, S. (2008). Personalized information retrieval system in the framework of fuzzy logic. Expert Systems with Applications, 35(1–2), 423–433. Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. Rijsbergen, C. J. V. (1986). A non-classical logic for information retrieval. Computer Journal, 29(6), 481–485. Rosenbloom, A. (2004). The blogosphere. Communications of the ACM, 47(12), 31–33. Syswerda, G. (1989). Uniform crossover in genetic algorithms. In Proceedings of the third international conference on genetic algorithms (ICGA’89) (pp. 2–9). Srinivas, M., & Patnaik, L. M. (1994). Genetic algorithms: A survey. IEEE Computer, 27(6), 17–26. Syu, I., & Lang, S. D. (2000). Adapting a diagnostic problem-solving model to information retrieval. Information Processing and Management, 36, 313–330. Thelwall, M., & Hasler, L. (2007). Blog search engines. Online Information Review, 31(4), 467–479. Technorati. (2008). Technorati: State of the blogosphere. (retrieved September 2008). Tseng, J. C. R., & Hwang, G. J. (2007). Development of an automatic customer service system on the internet. Electronic Commerce Research and Applications, 6(1), 19–28. Wang, K. T., Huang, Y. M., Jeng, Y. L., & Wang, T. I. (2008). A blog-based dynamic learning map. Computers and Education, 51(1), 262–278. Wordnet. (2008). (retrieved July 2008). Yang, J., & Korfhage, R. R. (1993). Effects of query term weights modification in document retrieval: A study based on a genetic algorithm. In Proceedings of the second annual symposium on document analysis and information retrieval (pp. 271–285). Yang, J., Korfhage, R. R., & Rasmussen, E. (1993). Query improvement in information retrieval using genetic algorithms: A report on the experiments of the TREC project. In Proceedings of the first text retrieval conference (TREC-1) (pp. 31–58). Young, T. E. (2003). Blogs: Is the new online culture a fad or the future? Knowledge Quest, 31(5), 50–51. Yang, H. L., & Liu, C. L. (2009). A new standard of on-line customer service process: Integrating language-action into blogs. Computer Standards and Interfaces, 31(1), 227–245.