Expert Systems with Applications 42 (2015) 3751–3759
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Linguistic features for review helpfulness prediction Srikumar Krishnamoorthy ⇑ Indian Institute of Management, Ahmedabad, India
a r t i c l e
i n f o
Article history: Available online 5 January 2015 Keywords: Review helpfulness Linguistic category features Sentiment analysis Machine learning Text mining
a b s t r a c t Online reviews play a critical role in customer’s purchase decision making process on the web. The reviews are often ranked based on user helpfulness votes to minimize the review information overload problem. This paper examines the factors that contribute towards helpfulness of online reviews and builds a predictive model. The proposed predictive model extracts novel linguistic category features by analysing the textual content of reviews. In addition, the model makes use of review metadata, subjectivity and readability related features for helpfulness prediction. Our experimental analysis on two real-life review datasets reveals that a hybrid set of features deliver the best predictive accuracy. We also show that the proposed linguistic category features are better predictors of review helpfulness for experience goods such as books, music, and video games. The findings of this study can provide new insights to e-commerce retailers for better organization and ranking of online reviews and help customers in making better product choices. Ó 2015 Elsevier Ltd. All rights reserved.
1. Introduction The advent of Web 2.0 has enabled users to share their opinions, experiences and knowledge via blogs, forums, and other social media websites. In the e-commerce context, Web 2.0 allows consumers to share their purchase and usage experiences in the form of product reviews (e.g. Amazon product reviews, CNET reviews). Such reviews contain valuable information and are often used by potential customers for making purchase decisions. However, some of the most popular products receive several hundreds or thousands of reviews resulting in a review information overload problem. Besides, the review quality across large volume of reviews exhibits wide variations (Liu, Huang, An, & Yu, 2008; Tsur & Rappoport, 2009). In order to help potential customers in navigating through large volume of reviews, e-commerce websites provide an interactive voting feature. For example, Amazon asks its review viewers ‘‘Was this review helpful? Yes/No’’ to get user votes on reviews. The votes thus gathered from multiple users are then aggregated, ranked and presented, e.g. ‘‘24 of 36 people found the following review helpful’’. Reviews with higher share of helpful votes are ranked higher than the ones with lower helpful votes. This paper aims to study the factors that play an important role for a review to get higher helpful votes. Such an analysis is important for the following reasons: First, reviews can be effectively summarized ⇑ Tel.: +91 79 6632 4834. E-mail address:
[email protected] http://dx.doi.org/10.1016/j.eswa.2014.12.044 0957-4174/Ó 2015 Elsevier Ltd. All rights reserved.
by filtering low quality reviews. Second, websites that do not use voting feature could benefit from an automated helpfulness prediction system. Third, review ranking system could be improved with a better understanding of the underlying review helpfulness factors, avoiding early bird bias problem (Liu, Cao, Lin, Huang, & Zhou, 2007). The review voting behaviour which influences review helpfulness can be visualized as a socio-psychological process between the reviewer and the reviewee. This process is facilitated by Web 2.0 as a communication medium. Language plays a very important role in this process between the reviewer and reviewee. In an offline world, communication between a sender and receiver is often influenced by non-verbal cues, communication contexts and past interactions between the sender and receiver. In the absence of such external factors in the online world, language plays a crucial role. The sender’s message (composed using a language) impacts the receivers cognition and influences their behaviour. As the sender’s message can be composed in numerous ways, its impact on the receivers cognition and behaviour varies. Our basic intuition is that the review voting behaviour can be better understood by studying the psychological properties and propensities of the language. The Linguistic Category Model (LCM) proposed by Semin and Fiedler (1991) is a conceptual framework that models psychological properties of the language. The linguistic categories used in the LCM model and their descriptions are presented in Table 1. The LCM model (Coenen, Hedebouw, & Semin, 2006; Semin & Fiedler, 1991) uses three broad linguistic categories, namely Adjectives (e.g. fantastic, excellent, beautiful), State verbs (e.g. love,
3752
S. Krishnamoorthy / Expert Systems with Applications 42 (2015) 3751–3759
hate, envy) and Action verbs. The action verbs are further subdivided into State Action Verbs (e.g. amaze, anger, shock), Interpretive Action Verbs (e.g. help, avoid, recommend), and Descriptive Action Verbs (e.g. call, talk, run). All of these linguistic categories are organized on a abstract-to-concrete dimension. At one extreme (ADJ) the terms are abstract, less verifiable, more disputable and least informative. While at the other extreme (DAVs), the terms are concrete, verifiable, less disputable and most informative. Consider the following three review examples tagged with key linguistic categories: 1. A fantastic (ADJ) camera. The picture quality of this camera is wonderful (ADJ). 2. This is my first camera and I love (SV) it. The camera is excellent (ADJ). 3. I regularly take(DAV) pics with this camera. The quality of the pics has really amazed (SAV) me. Battery life is fabulous (ADJ). My only issue is that it makes (DAV) a lot of noise in autofocus mode. I strongly recommend (IAV) this camera. Review 1 is highly abstract and subjective as it primarily uses adjectives. Review 2 uses a subjective verb ‘love’ indicating the emotional state of the reviewer. The last review provides a more concrete and objective description of the camera using DAVs. Besides, it also contains subjective (ADJ) opinion of the reviewer. It is evident that the review 3 with far more concrete and descriptive information is likely to be more helpful than other two reviews for purchase decision making. Therefore, our basic intuition is that the linguistic categories impact the receivers (or consumers) cognitive process, influence their voting behaviour and affect review helpfulness. In this paper, our objective is to examine the use of such linguistic category features for predicting review helpfulness. We make a first attempt at devising a new method for extracting linguistic category features from review text and build a binary classification model. We conduct a detailed experimental analysis on two reallife review datasets to demonstrate the utility of the proposed linguistic features. Furthermore, we study the effect of product type on review helpfulness and show that the proposed linguistic features are better predictors of review helpfulness for experience goods. The rest of the paper is organized as follows. Section 2 describes the related work on review helpfulness. Section 3 elucidates the proposed novel features used in the model. Subsequently, Section 4 presents detailed experimental analysis, results and discussions. Section 5 highlights the implications of this research to theory and practice. Finally, Section 6 provides concluding remarks and outlines directions for future research work.
Table 1 Linguistic categories. Category
Description
ADJ1 SV2 SAV3
Qualifies a noun; Highly subjective and abstract Refers to mental or emotional state Describes the emotional consequences of an action; high positive or negative connotation Multitude of actions that have the same meaning; have a positive or negative connotation Objective description of a specific action; no positive/negative connotation
IAV4 DAV5 1 2 3 4 5
Adjectives. State verbs. State Action Verbs. Interpretive Action Verbs. Descriptive Action Verbs.
2. Related literature Zhang and Varadarajan (2006) build a regression model for predicting the utility of product reviews. They use lexical similarity, syntactic terms based on Part-Of-Speech (POS), and lexical subjectivity as features. Mudambi and Schuff (2010) formulated a linear regression model for determining factors that contribute towards review helpfulness. Their work was replicated by Huang and Yen (2013) and achieved just 15% explanatory power. The authors conclude that the review helpfulness prediction problem is considerably hard. Lee and Choeh (2014) build a multilayer perceptron neural network model and make use of product, review metadata, reviewer and review characteristics as features. The key contribution of their work is the use of neural network model to improve helpfulness prediction accuracy. The authors demonstrate that their model works better than other linear regression models used in the literature. Ngo-Ye and Sinha (2014) use reviewer engagement related features to predict review helpfulness. While prior studies have examined reviewer characteristics, the authors introduce a new concept of reviewer’s RFM (Recency, Frequency, Monetary value) to improve the prediction performance. They demonstrate that a hybrid model combining features from textual characteristics and reviewer’s RFM provide the best predictive results based on their evaluation on Yelp and Amazon reviews. The authors primarily use a simple bag-of-words model as part of textual features. They do not consider other rich set of features such as readability, subjectivity and metadata that are empirically proven to be better predictors of review helpfulness (Ghose & Ipeirotis, 2011; Kim, Pantel, Chklovski, & Pennachiotti, 2006; Liu et al., 2007). Liu and Park (2015) present a helpfulness prediction model for travel product websites. They employ a combination of reviewer and review characteristics to predict helpfulness. More specifically, the authors use features such as reviewer’s identity, reputation, expertise, valence of reviews, readability and build a text regression model to predict review helpfulness. A non-linear regression model based on radial basis function for predicting helpfulness of movie reviews is presented by Liu et al. (2008). They utilize reviewer expertise, writing style of reviews and timeliness of reviews as features for the prediction problem. Other works in the literature that use regression model include Cao, Duban, and Gan (2011), Chua and Banerjee (2014), Ghose and Ipeirotis (2011), Korfiatis, Garcia-Bariocanal, and SanchezAlonso (2012), Pan and Zhang (2011). These works study various textual and non-textual characteristics of reviews to determine the factors that contribute towards helpfulness of online reviews. REVRANK is an algorithm for ranking helpfulness of online book reviews (Tsur & Rappoport, 2009). It is an unsupervised algorithm that constructs a lexicon of dominant terms across reviews and builds a virtual core review. The similarity of each review is then assessed against the virtual core review to determine overall helpfulness ranking. Closely related to the work on helpfulness of reviews is the work proposed by Liu et al. (2007) for detecting low quality reviews. The authors employ features related to informativeness, subjectiveness and readability to classify reviews as high or low quality ones. Lexical, structural, syntactic, semantic and meta-data related features were used by Kim et al. (2006) for automatic helpfulness assessment. They demonstrate that the use of length of reviews, valence of reviews and unigrams achieves the best results. In one of the more recent works, Hong, Lu, Yao, Zhu, and Zhou (2012) develop a binary helpfulness based review classification system. Their system uses a set of novel features based on needs fulfillment, information reliability and sentiment divergence measure.
S. Krishnamoorthy / Expert Systems with Applications 42 (2015) 3751–3759
3753
the helpfulness score computed as a ratio of helpful votes to total votes, H 2 ½0; 1. This helpfulness metric has been widely used in the literature (Liu et al., 2008; Zhang & Varadarajan, 2006). Let F be a n m feature matrix, where n is the total number of reviews in R and m is the total number of features. F i is the feature vector for review ri ; F i 2 Rm 8i. Y is a n-dimensional vector, where Y i indicates whether a review ri is helpful (1) or not helpful (0). Given a helpfulness threshold value H; Y i is calculated as follows:
Yi ¼ Fig. 1. List of state verbs.
Their SVM based classifier achieves 69.62% accuracy and was shown to be superior compared to other earlier works in the literature (Kim et al., 2006; Liu et al., 2007). Valence consistency, a new dimension to the common review helpfulness problem, is recently explored by Quaschning, Pandelaere, and Vermeir (2014). In contrast to other works in the literature that considers each review individually while assessing its helpfulness, the authors investigate the influence of other nearby reviews and its effect on perceived helpfulness. The authors show that consistent reviews are perceived to be more helpful than inconsistent ones. Zhang, Wei, and Chen (2014) presents a new method to estimate the degree of helpfulness of reviews. Unlike a traditional review helpfulness problem, the authors use the helpfulness distribution information and the confidence interval for helpfulness prediction. Both synthetic and real datasets are used by the authors to empirically demonstrate the utility of the method. Mudambi, Schuff, and Zhang (2014) argue that the misalignment between star rating and review text leads to increase in cognitive processing cost, thereby resulting in suboptimal customer purchasing decisions. The authors explore the reasons behind misalignment between the valence of reviews and the review text. The authors find that such misalignment often occurs for experience goods and products with high star ratings. We believe that these findings are pertinent in the context of the present work. The use of proposed linguistic category features, as we show in our experiments, offer better helpfulness prediction for experience goods and positive valence reviews. This paper builds on the extant literature and proposes a new model for predicting helpfulness of online reviews. The proposed model employs novel linguistic category features for the review helpfulness prediction problem. The basic ideas for the linguistic features are borrowed from social psychology literature. We show the benefits of using the linguistic features for the problem through rigorous experimentation on real-life review datasets. 3. Review helpfulness model We first describe the terminology used in this paper and formally define the problem. Then, we explain the features used in our prediction model. 3.1. Terminology and problem statement Let R ¼ fr 1 ; r 2 ; r 3 . . . r n g be a set of reviews in the dataset. Each review r i can be represented as a tuple containing five elements ½P; A; C; M; H, where P describes the characteristics of a product like identifier, name, description, type, category, release date and so on; A is the author/reviewer of a review; C is the textual content of a review consisting of multiple sentences; M describes the review metadata such as review title, date and valence; and H denotes
1; if Hi > H 0;
o:w:
Our objective in this paper is to build a model that minimizes prediction error of Y given F. The model is then used to predict Y k for a new review rk . 3.2. Model features The proposed model uses several state-of-the-art features in addition to the novel linguistic category features. More specifically, we utilize the following features: (1) linguistic category features extracted from textual content of reviews (C), (2) review metadata (M) related features, (3) readability features, and (4) subjectivity features. Let us refer to linguistic, metadata, readability, and subjectivity features respectively as LF; MF; RF and SF. The combined set of features (LF; MF; RF; SF) form the overall feature matrix, F. 3.2.1. Linguistic category features In the LCM model (Coenen et al., 2006; Semin & Fiedler, 1991) introduced in the psychology literature, there are five linguistic categories, namely ADJ, SV, SAV, IAV and DAV. In the psychology literature, these categories were manually coded, often by multiple coders or annotators, before conducting experiments on the subjects. The reliability of coding is assessed using standard statistical measures such as cohen’s kappa coefficients. This approach can be followed if the size of the experiment is small, controlled and the nature of messages are known in advance. However, for a large scale review helpfulness assessment problem, the preparation of training corpus and manual annotation of linguistic categories is a very tedious and time consuming process. Besides, a language contains innumerable number of words and hence annotating a large text corpus is likely to be impractical. Therefore, we propose an automated and unsupervised mechanism for determining linguistic categories from text documents. Our linguistic category feature extraction procedure works as follows: We first use the NLTK Parts-Of-Speech(POS) tagger (Bird, 2006) to parse and tag each of the user reviews in the text corpus. 3.2.1.1. Adjective feature (ADJ). Extract all words with adjective tag in each sentence of a review (e.g. use a regular expression to match all words with POS tag ‘JJ’). The adjective feature has been widely used in the literature. 3.2.1.2. State verb feature (SV). State verb is abstract in nature and refers to emotional, affective or mental state of a person. It is used in sentences that are beyond specific behaviour or situation. It usually relates to thoughts, emotions, senses and state of being. The set of state verbs in a language is generally limited. Hence, a predefined list of keywords given in Fig. 1 is used to determine state verbs. This list is created by manual examination of words in the dictionary that pertain to emotions, thoughts, senses and state of being. To identify state verbs from a given text document, we first extract all words with verb POS tag in each sentence of a review. It is to be noted that the state and action verbs could be used in
3754
S. Krishnamoorthy / Expert Systems with Applications 42 (2015) 3751–3759
conjunction in some sentences. For example, the review sentence ‘I am using this camera regularly’, has both state verb (‘am’) and action verb (‘using’). In such cases, we ignore the state verb and consider only the action verb as the sentence describes a specific action. The eight state verb keywords listed in the last two rows of Fig. 1 are likely to occur in conjunction with an action verb. 3.2.1.3. Action verb features (SAV, IAV and DAV). There are three types of action verbs and each of them vary in the level of abstraction. A state action verb (SAV) is more abstract and describe general class of behaviours. Whereas, a descriptive action verb (DAV) at the other extreme is very objective and describes a specific situation and an observable behaviour. The feature extraction for each of these categories of action verbs works as follows: A word (with verb tag) that is not identified as a state verb is treated as an action verb. The identified action verb is further categorized as SAV, IAV or DAV. One approach to categorize an action verb is to build a list of keywords under each of these categories. But, this is a very difficult task considering the innumerable number of action verbs present in a language. An alternate approach is to devise automated means of categorizing action verbs. This paper introduces an automatic categorization mechanism based on the fundamental definition of three categories of action verbs. Our basic intuition is that the action verb categories have an implicit valence component. That is, SAVs (shock, anger) have a higher valence compared to IAVs (help, cheat) and IAVs in turn have higher valence compared to DAVs (talk, run). DAVs generally have very low or no valence component. Semin and Fiedler (1991)’s description of each of the linguistic categories also makes specific reference to the valence component. Therefore, one can automatically categorize an action verb into DAV, IAV or SAV using valence as a primary criterion. In order to determine the valence of a particular action verb, we utilized the publicly available SentiWordNet (Baccianella, Esuli, & Sebastiani, 2010). To categorize an action verb, we first compute its score, AVScoreðwordÞ, by taking mean of the subjectivity score of Top-K synsets. We observed that words with large number of synsets have a very wide range of subjectivity scores. For example, the word ‘good’ with 21 synsets has subjectivity scores ranging from 0 to 1. Given such wide range of scores, taking the mean of all synset scores is not useful. Therefore, we used Top-K synsets for computing valence scores (default value of K is set to 3). The complete scoring formula is given in Eq. (1). Once the AVScore is computed, we determine the type of action verb by using Eq. (2). That is, a word with valence score above s1 is treated as SAV. Similarly, a word with valence score below s2 is treated DAV. Otherwise, a word having valence scores in the intermediate range is categorized as IAV. The cut-off values in Eq. (2) can be determined based on high (low or no) valence requirements of action verbs SAV (DAV). We discuss the actual cut-off values used in the experimental results section.
AV ScoreðwÞ ¼
K 1X SW N SubjScoreðw; POS ¼ ‘v erb’Þ K k¼1
8 > < ‘sav ’; if AV ScoreðwÞ P s1 AVType ¼ ‘iav ’; if s2 6 AV ScoreðwÞ < s1 > : ‘dav ’; otherwise
ð1Þ
ð2Þ
If there are multiple categories of action verbs identified in a single sentence, then we pick the highest descriptive category. For example, the review sentence ‘I suggest that you avoid this camera’ has two action verbs, ‘suggest’ (DAV) and ‘avoid’ (IAV). As DAV is the most descriptive in this case, the sentence is assigned DAV as the action verb category.
After the linguistic category features (ADJ, SV, SAV, IAV, DAV) are extracted from reviews, the features are assigned weights or scores. The scoring procedure is described in Algorithm 1. The algorithm computes the occurrence count of five linguistic features for each and every review in the dataset (lines 1–15). For each review, the occurrence count of linguistic features at the sentence level is first computed. The process is then repeated for each and every sentence in the review. Finally, the scores are aggregated to obtain the overall LF i score for review r i . The algorithm then computes the mean (lc ) and standard deviation (rc ) scores of LF at the product category level (line 16). In the final step (lines 17–19), z-score is computed to get the final feature weights or scores for LF. The zscore for review i, feature k is calculated as a normalized score, LF ik lc rc . Lines 16–19 are used to account for category level variations in the usage of linguistic terms. For example, one product category may use more linguistic terms in reviews than others. We discuss more about variations in usage of linguistic terms at the category level in the experimental results section. Algorithm 1. Linguistic feature scoring procedure 1: for each review, ri in R do 2: for each sentence sj in r i do 3: if ADJ found then 4: score(ADJ) = 1 5: else if AV not found and SV found then 6: score(SV) = 1 7: else 8: Determine AV category for all words in sj 9: Pick the most descriptive AV category 10: Score for most descriptive AV category set to 1 11: end if 12: Get LF ij with scores based on the above steps 13: end for 14: Obtain LF i by cumulating scores across all sj ’s 15: end for 16: Compute mean and standard deviation of LF scores for each product category 17: for each review r i in R do 18: Compute z-score to obtain the final LF i 19: end for
In summary, we create five linguistic category features based on the Linguistic Category Model. So, LF is a tuple containing five elements hADJ; SV; SAV; IAV; DAVi. Let us illustrate the above feature score computation procedure with the help of an example. Consider the following three reviews tagged with linguistic categories for analysis: 1. A fantastic (ADJ) camera. The picture quality of this camera is (SV) wonderful (ADJ). This review has two adjectives and one state verb (‘is’). Also, the state verb is not used in conjunction with an action verb. Therefore, the linguistic feature for the review, LF ¼ h2; 1; 0; 0; 0i. 2. This is (SV) my first camera and I love (SV) it. The camera is (SV) excellent (ADJ). This review has two state verbs in the first sentence. As we account for the verb only once in a sentence, the SV count is incremented to one for the first sentence. The second sentence has one state verb and one adjective. Therefore, LF ¼ h1; 2; 0; 0; 0i. 3. I regularly take (DAV) pics with this camera. The quality of the pics has really amazed (SAV) me. Battery life is (SV) fabulous (ADJ). My only issue is (SV) that it makes (DAV) a lot of noise in autofocus mode. I strongly recommend (IAV) this camera.
3755
S. Krishnamoorthy / Expert Systems with Applications 42 (2015) 3751–3759
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Relative Frequency
Histogram for DS1 Dataset
0
20
40
60
80
100
80
100
Helpfulness (%)
0.00 0.02 0.04 0.06 0.08 0.10
Relative Frequency
Histogram for DS2 Dataset
0
20
40
60
Helpfulness (%)
Fig. 2. Distribution of helpfulness scores.
Table 2 Datasets. Dataset
No. of reviews
Product categories
DS1 DS2
11965 1653
Books, DVD, electronics and kitchen & Housewares Camera, cellphone, laser printer, mp3 player, music CD, video game
There are five sentences in this review. Score computation for sentences 1–3 are straight forward. Sentence 4 has a state verb (‘is’) used in conjunction with an action verb (‘makes’). The sentence describes an action and hence only the action verb is considered. The action verb ‘makes’ in the sentence is assigned DAV category. The resulting linguistic feature score for individual sentences are h0; 0; 0; 0; 1i; h0; 0; 1; 0; 0i; h1; 1; 0; 0; 0i; h0; 0; 0; 0; 1i; h0; 0; 0; 1; 0i. The linguistic feature score for the complete review is, LF ¼ h1; 1; 1; 1; 2i. 4. Compute mean and standard deviation across all three reviews, mean ¼ h1:33; 1:33; 0:33; 0:33; 0:67i sd ¼ h0:58; 0:58; 0:58; 0:58; 1:15i. 5. Calculate z-score for each review to obtain the final feature score. 6. The final feature weights or scores for all three reviews are: Review 1: LF ¼ hþ1:15; 0:58; 0:58; 0:58; 0:58i, Review 2: LF ¼ h0:58; þ1:15; 0:58; 0:58; 0:58i, Review 3: LF ¼ h0:58; 0:58; þ1:15; þ1:15; þ1:15i. Note that the final score for each review is obtained by computing z-score at the product category level. In the above illustrative example, all reviews pertain to the same ‘camera’ product category and hence the z-score is computed across all reviews.
3.2.2. Review metadata features Past research studies prove that review valence and review publication date influences the total number of helpful votes received by a review (Ghose & Ipeirotis, 2011; Liu et al., 2007). Hence, in this paper we also utilized two review metadata related
features namely, (1) review extremity and (2) review age. Review extremity is calculated as the difference between review valence (or rating) and mean product rating. Age of review is computed as the difference between review publication date and product release date as given in Eq. (3).
rev iew Age ¼ logðrev iew Date release DateÞ
ð3Þ
3.2.3. Readability features Readability of a review is another important characteristic that can influence review helpfulness. A highly readable review is likely to be read and voted by more number of users. In order to assess review readability, we utilize the following grade level readability metrics used in the literature (DuBay, 2004): (1) Automated Readability Index, (2) SMOG, (3) Flesch–Kincaid Grade Level, (4) Gunning Fog Index, and (5) Coleman–Liau Index. 3.2.4. Subjectivity features Another commonly used feature in helpfulness assessment is the review subjectivity. A review with higher number of subjective words is more likely to be helpful (Ghose & Ipeirotis, 2011). We compute the subjectivity as the total number of subjective words (positive and negative opinion words) normalized by review length. 4. Experimental evaluation 4.1. Online review datasets We used two real-life datasets for the experimentation. First dataset is a publicly available multi-domain sentiment analysis dataset (Blitzer, Dredze, & Pereira, 2007). This dataset has 13120 customer reviews across four different product categories. The second dataset, a more recent review dataset, is obtained by crawling amazon.com website. The details of both the datasets are summarized in Table 2. The datasets are cleaned and prepared for analysis by applying the following three preprocessing steps: (1) Online review websites tend to have same reviews repeated across multiple
S. Krishnamoorthy / Expert Systems with Applications 42 (2015) 3751–3759 1.5
3756
1.0 0.5 0.0
6
8
Mean IAV Count
helpful unhelpful
4
Mean DAV Count
10
helpful unhelpful
1
2
3
4
1
3
4
1.5 1.0
Mean SV Count
helpful unhelpful
0.0
0.5
0.2 0.0 −0.2 −0.4
Mean SAV Count
0.4
helpful unhelpful
1
2
3
4
1
2
3
4
Product Category
helpful unhelpful
4.0 3.5 3.0
2
2.0
4
6
Mean Valence
8
4.5
helpful unhelpful
2.5
10
Product Category
Mean ADJ Count
2
Product Category
2.0
Product Category
1
2
3
4
Product Category
1
2
3
4
Product Category
Fig. 3. Mean scores by product category for DS1 dataset.
locations. In order to remove such duplicate reviews, a data deduplication operation is performed. The duplicate reviews are identified by matching bigrams of every pair of reviews. A pair of reviews that has more than 90% common bigrams is considered as duplicates. (2) Reviews that have very low total votes but high helpfulness scores are less useful for customers. For example, a review having 3 helpful out of 4 total votes is less likely to be useful than a review having 33 helpful out of 44 total votes. Hence, we use reviews with at least ten total votes to ensure robustness of the results. This approach has also been followed in the past research literature (Liu et al., 2008). (3) Blank reviews (i.e. no textual content) with helpful and total votes are likely to be spurious in nature. Hence, we also removed such blank reviews. After the above three data cleansing operation, the final datasets have 11,965 and 1653 reviews for DS1 and DS2 datasets respectively. The distribution of helpfulness scores for DS1 & DS2 datasets is given in Fig. 2. The graphs in the figure indicate that the helpfulness scores are skewed to the right for both DS1 and DS2 datasets. One of the key parameters in our prediction model is the helpfulness threshold value, H. The default value of H is set as 0.60 for our experiments. That is, a review having more than 60% helpful votes is considered as a helpful review. This value is chosen based on its proven effectiveness in past research studies (Ghose & Ipeirotis, 2011; Hong et al., 2012). For determining the type of action verb from Eq. (2), the parameters s1 and s2 were chosen as 0.6 and 0.1 respectively. The s values were determined based on a manual review and validation of words that fall into each of the action verb categories.
Fig. 3 shows the mean linguistic feature count and review valence count at the product category level for DS1 dataset. The following observations are evident from the charts: (1) A clear and wide separation of mean scores for helpful and unhelpful reviews. Only exception being the SAV count which shows a very narrow separation. It shows the limited use of SAV category terms in the review datasets studied. The chart trends give a positive indication of the usefulness of the proposed features in discriminating helpful and unhelpful reviews. (2) Linguistic term usage exhibits variations at the category level. For example, in Fig. 3, the mean DAV count of category 1 (Books) is higher than that of category 4 (Kitchen & Housewares). This can be attributed to variations in the product descriptions of individual product categories. As discussed earlier in Section 3.2, we account for such category level differences by using z-score instead of raw count as feature scores. The above observations were found to be valid for both the datasets studied, although the mean score graphs are shown only for the DS1 dataset. To build a predictive model, we use Naive Bayes (NB), Support Vector Machine (SVM) and Random Forest (RandF) as learning methods. R statistical programming language is used for all model development and testing. Specific R packages that are used for model development include (1) e1071 that provides an interface to LIBSVM package (Chang & Lin, 2011) and Naive Bayes implementation, and (2) randomForest that implements random forest algorithm by Breiman (2001). We evaluate our predictive model using standard metrics such as f-measure and accuracy. To ensure robustness of the results, all our experiments were conducted using a stratified 10-fold cross validation.
3757
S. Krishnamoorthy / Expert Systems with Applications 42 (2015) 3751–3759 Table 3 Prediction performance analysis. Learning method
Table 5 Effect of review type on DS1 and DS2 datasets.
DS1
NB SVM RandF
DS2
Method
F
A
F
A
81.91 84.84
71.27 77.81
61.48 75.39
67.58 74.60
87.21
81.33
77.87
77.02
LF MF RF SF All
DS1 neg
DS1 pos
DS2 neg
DS2 pos
F
A
F
A
F
A
F
A
63.80 57.82 42.58 42.39 66.82
70.10 64.03 55.83 54.31 73.09
91.32 87.00 91.32 89.58 91.96
84.60 77.44 84.43 81.46 85.66
46.62 42.13 35.54 36.11 46.67
71.73 69.43 70.07 65.92 73.16
87.78 85.30 86.71 78.19 88.65
80.08 77.49 77.81 66.19 81.10
4.2. Experimental results We conduct series of experiments to evaluate the utility of the proposed helpfulness prediction model. 4.2.1. Predicting review helpfulness In the first set of experiments, we build predictive models and study their effectiveness. The models use a combination of LF; MF; RF and SF features. The results of our experiments are shown in Table 3 underlining the best performance values. It is evident from Table 3 that the RandF based classifier gives the best results. The DS1 dataset achieves 87.21% f-measure and 81.33% accuracy. The DS2 dataset shows a relatively lower performance (77.87% and 77.02%) compared to the DS1 dataset. The overall performance results are quite promising and demonstrates the utility of the proposed novel features for review helpfulness prediction. We conduct further set of experiments using the RandF classifier as it is found to outperform other methods on both the datasets studied. The next set of experiments analyse the performance of our predictive model across multiple feature categories, namely LF; MF; RF; SF and combined set of all features (Table 4). The combined set of features are found to offer the best predictive results. LF and MF are found to deliver comparable results across the datasets studied while the RF and SF deliver comparatively low predictive performance. In the past research literature, Ghose and Ipeirotis (2011) demonstrate that the readability and subjectivity features perform better than the lexical features used by Zhang and Varadarajan (2006). From Table 4 it is evident that LF performs better than RF and SF on both the datasets studied. This clearly demonstrates the utility of the proposed novel features against state-of-the-art methods. In addition, our experimental results reveal that the hybrid set of features based on linguistic category, review metadata, readability and subjectivity offer the best predictive performance. 4.2.2. Effect of review type on helpfulness The reviews are classified as positive or negative based on the review valence. A review is considered as positive if its valence is above three. Similarly, a review is considered as negative if its valence is below three. This classification method is widely used in the literature (Blitzer et al., 2007). We conduct experiments to study the effect of review type (positive or negative) on prediction performance. It is evident from the results in Table 5 that LF method performs better than other methods in all of the cases studied. Also, the hybrid method offers Table 4 Feature-wise performance analysis. Method
LF MF RF SF All
the best predictive performance. The predictive performance of negative reviews are poorer compared to the positive reviews. This could be explained by the non-uniform distribution of review examples (Fig. 2). 4.2.3. Effect of product type on helpfulness There are two types of products viz. search goods and experience goods (Huang, Lurie, & Mitra, 2009; Nelson, 1970). Prior research demonstrates that the helpfulness characteristics are different for different class of goods (Huang & Yen, 2013). While a helpful search good review is likely to contain more information on product aspects or attributes, a helpful experience good review is likely to have more descriptions on customer experiences. Therefore, we conjecture that the linguistic category features could be more effective in the case of experience goods. We conducted further experiments to validate this conjecture. The books and dvd product category in DS1 (mp3, music, and video game category in DS2) dataset belong to the experience goods. The results of our experiments are shown in Table 6. For the experience goods, the results clearly reveal that the linguistic features are more effective compared to other features. However, RF is also found to show competitive performance compared to LF on the DS1 dataset. For the search goods, LF is found to perform slightly poor compared to MF. These results indicate that the linguistic feature alone is not very effective in the case of search goods. The hybrid set of features, however, offer the best helpfulness prediction for both search and experience goods. 4.2.4. Sensitivity analysis of LF threshold values In this paper, s values were chosen based on manual review and validation of terms that fall into action verb categories. However, we also conducted sensitivity analysis to study the impact on varying s1 and s2 threshold values. The results in Table 7 indicate that prediction performance for both the datasets marginally increase as we increase s1 . But, no clear trend could be observed for the changes in s2 across the datasets studied. As part of future work, we intend to evaluate the model on more diverse datasets and estimate optimal threshold values. 4.3. Discussions This paper proposed a new review helpfulness prediction model that makes use of four different kinds of features, namely review metadata, subjectivity, readability and linguistic category features. Table 6 Prediction performance for experience and search goods.
DS1
DS2
F
A
F
A
81.57 81.15 80.57 77.91 87.21
71.65 72.76 69.48 65.44 81.33
70.73 70.80 62.86 57.63 77.87
69.33 70.72 60.01 57.47 77.02
Learning method
DS1 exp.
DS2 exp.
DS1 search
DS2 search
F
A
F
A
F
A
F
A
LF MF RF SF All
89.21 84.78 88.84 86.59 89.78
81.27 74.95 80.53 76.85 82.34
78.82 77.38 74.43 68.29 83.85
71.58 70.89 63.21 58.82 78.56
73.87 77.80 73.02 66.21 84.18
64.35 71.95 63.55 57.43 80.43
61.19 60.45 49.49 44.15 71.77
68.97 70.28 59.77 57.02 77.30
3758
S. Krishnamoorthy / Expert Systems with Applications 42 (2015) 3751–3759
Table 7 Sensitivity analysis of LF threshold values.
s1
s2
0.60 0.65 0.70 0.60 0.60
0.10 0.10 0.10 0.15 0.20
DS1
DS2
F
A
F
A
81.57 81.60 81.67 81.88 82.01
71.65 71.64 71.79 72.00 72.20
70.73 71.25 71.70 71.17 70.95
69.33 69.75 70.05 69.81 69.69
It also presented a method to automatically derive linguistic category features from review text documents. The proposed model was found to be quite effective achieving a predictive accuracy of over 77% on real-life review datasets studied (Table 3). The linguistic category features introduced in this paper was found to be effective in improving review helpfulness prediction. This is evident from the results of our experiments in Table 4. The best predictive performance were obtained for the hybrid set of features. A stand-alone model that use either linguistic features or review metadata features delivers comparable performance. Furthermore, a stand-alone model that use linguistic features delivers superior performance compared to a model that use either subjectivity or readability as features in almost all of the cases studied. Additional analysis of our model on different type of products show that the linguistic category features are quite effective in review helpfulness prediction for experience goods. For example, in the case of books, an experience good, a review is likely to be more helpful if reviews provide more objective description (say, using IAV or DAV) of user book reading experiences. As the proposed linguistic features effectively capture the subjectivity or objectivity of a word on a continuum (ADJ’s to DAV’s), it is likely to perform better. Our results on two real-life review datasets clearly demonstrate the utility of the approach for experience goods.
5. Implications The findings of this paper has implications for both theory and practice. From a theoretical perspective, the paper brings fresh ideas into the expert and intelligent systems research community from social psychology literature. The basic ideas for the linguistic category features introduced in this paper are borrowed from the LCM model (Semin & Fiedler, 1991) used in psychology literature. Another important contribution of this paper is the design of automatic linguistic category feature extraction method that obviates the need for expensive manual annotations of review text. We hope that these ideas can be effectively used in other related sentiment analysis or opinion mining tasks. The proposed linguistic category features captures the psychological properties of a language on an abstract-to-concrete dimension (adjectives to descriptive action verbs). The ability of our model to capture such granular language abstractions offer unique advantages. This is evident from our empirical analysis of reviews for the experience goods. In the case of experience goods, the reviews often contain more descriptive terms on product usage experiences. Such usage experiences are valuable (and helpful) for a customer as the product or service quality on experience goods cannot be clearly ascertained before its consumption. The linguistic features that captures multiple levels of language abstraction offer an useful mechanism to effectively discriminate objective and subjective terms in a review text. We believe that the proposed linguistic features that captures language abstractions on a continuum will be very useful for several other related
text mining tasks such as question answering systems and opinion mining. E-commerce retailers offer a platform for customers to express their opinions or comments and give their votes for useful reviews. These consumer generated product evaluations are known to increase customer trust and help in the decision making process. However, customer’s often face review overload problem incurring high cognitive processing costs while making product purchase decisions. The findings of our study can provide useful insights for e-commerce retailers in better organization of product reviews and thereby minimize the cognitive processing costs for their customers. Mudambi et al. (2014), in one of their recent works, demonstrate that the review valence and review text are commonly misaligned for experience goods and products with high star ratings. The linguistic category features introduced in this work was found to deliver better helpfulness prediction results for positive valence type of reviews and for experience goods (Tables 5 and 6). The use of review valence (an element of review metadata), on the other hand, is likely to introduce more noise (due to higher misalignment) in the predictive model for experience goods. Therefore, one can use the findings of this study to design custom review ranking system based on the type of products. We believe that the use of linguistic category features can play an important role in the design of such systems.
6. Conclusions This paper examined the online review helpfulness problem and built a new prediction model. The proposed model used hybrid set of features (review metadata, subjectivity, readability, and linguistic category) to predict review helpfulness. The effectiveness of the proposed model was empirically evaluated on two real-life review datasets. The linguistic category features was found to be effective in predicting helpfulness of experience goods. The paper described an automatic linguistic category extraction procedure. The LF feature extraction procedure, especially for AV categorization, utilized SentiWordNet and a set of threshold values (i.e. s1 and s2 ). We made an appropriate choice of the threshold values through a manual review and validation of words that fall into different action verb categories. Additionally, we conducted sensitivity analysis to assess the impact of changes in the threshold values. In future, we plan to extend this work to estimate optimal threshold values for a given review helpfulness problem. As part of future work, several interesting extensions can be explored. One can devise a supervised learning mechanism to automatically determine action verb categories, obviating the need for determining optimal threshold values (s1 and s2 ). Such a supervised learning scheme will result in creation of two stage classification model (action verb classifier & helpfulness classifier) for helpfulness prediction. Future research can also explore use of additional features like reviewer’s identity, social network and semantic features of review and examine their influence on review helpfulness prediction. Another interesting research direction is the application of the proposed ideas in other related domains. For example, linguistic category features can be used to predict the most useful response in social query answering systems such as StackExchange, Yahoo! Answers, and Google Answers. As the query answering systems (Surdeanu, Ciaramita, & Zaragoza, 2008) are likely to use more descriptive terms at different levels of language abstraction, we conjecture that linguistic category features can be quite effective in predicting useful responses. The findings of this research work are likely to be useful to ecommerce retailers. With better insights on review helpfulness,
S. Krishnamoorthy / Expert Systems with Applications 42 (2015) 3751–3759
retailers can improve the display of user reviews to their potential customers. It can also help potential customers in making better purchase decisions. Overall, this paper makes an useful contribution to the literature on review helpfulness prediction. References Baccianella, S., Esuli, A., & Sebastiani, F. (2010). Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10). Valletta, Malta: European Language Resources Association (ELRA). Bird, S. (2006). Nltk: The natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions COLING-ACL’06 (pp. 69–72). Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. In ACL (pp. 187–205). Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Cao, Q., Duban, W., & Gan, Q. (2011). Exploring determinants of voting for the helpfulness of online user reviews: A text mining approach. Decision Support Systems, 50, 511–521. Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector machines. The ACM Transactions on Interactive Intelligent Systems, 2, 1–27. Chua, A. Y., & Banerjee, S. (2014). Understanding review helpfulness as a function of reviewer reputation, review rating, and review depth. Journal of the Association for Information Science and Technology. Coenen, L. H. M., Hedebouw, L., & Semin, G. R. (2006). The Linguistic Category Model (LCM). Retrieved from:
. Last accessed: 2014-03-29. DuBay, W. H. (2004). The principles of readability. Impact Information. . Ghose, A., & Ipeirotis, P. (2011). Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Transactions on Knowledge and Data Engineering, 23, 1498–1512. Hong, Y., Lu, J., Yao, J., Zhu, Q., & Zhou, G. (2012). What reviews are satisfactory: novel features for automatic helpfulness voting. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval SIGIR’12 (pp. 495–504). Huang, P., Lurie, N. H., & Mitra, S. (2009). Searching for experience on the web: An empirical examination of consumer behavior for search and experience goods. Journal of Marketing, 73, 55–69. Huang, A. H., & Yen, D. C. (2013). Predicting the helpfulness of online reviews – A replication. International Journal of Human–Computer Interaction, 29, 129–138. Kim, S.- M., Pantel, P., Chklovski, T., & Pennachiotti, M. (2006). Automatically assessing review helpfulness. In Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP2006) (pp. 423–430).
3759
Korfiatis, N., Garcia-Bariocanal, E., & Sanchez-Alonso, S. (2012). Evaluating content quality and helpfulness of online product reviews: The interplay of review helpfulness vs. review content. Electronic Commerce Research and Applications, 11, 205–217. Lee, S., & Choeh, J. Y. (2014). Predicting the helpfulness of online reviews using multilayer perceptron neural networks. Expert Systems with Applications, 41, 3041–3046. Liu, J., Cao, Y., Lin, C.- Y., Huang, Y., & Zhou, M. (2007). Low-quality product review detection in opinion summarization. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 334–342). Liu, Y., Huang, X., An, A., & Yu, X. (2008). Modeling and predicting the helpfulness of online reviews. In IEEE international conference on data mining. Liu, Z., & Park, S. (2015). What makes a useful online review? Implication for travel product websites. Tourism Management, 47, 140–151. Mudambi, S. M., & Schuff, D. (2010). What makes a helpful online review? A study of customer reviews on amazon.com. MIS Quarterly, 34, 185–200. Mudambi, S. M., Schuff, D., & Zhang, Z. (2014). Why aren’t the stars aligned? An analysis of online review content and star ratings. In 2014 47th Hawaii international conference on system sciences (HICSS) (pp. 3139–3147). IEEE. Nelson, P. (1970). Information and consumer behavior. Journal of Political Economy, 78, 311–329. Ngo-Ye, T. L., & Sinha, A. P. (2014). The influence of reviewer engagement characteristics on online review helpfulness: A text regression model. Decision Support Systems, 61, 47–58. Pan, Y., & Zhang, J. Q. (2011). Born unequal: A study of the helpfulness of usergenerated product reviews. Journal of Retailing, 87, 598–612. Quaschning, S., Pandelaere, M., & Vermeir, I. (2014). When consistency matters: The effect of valence consistency on review helpfulness. Journal of ComputerMediated Communication. Semin, G. R., & Fiedler, K. (1991). The linguistic category model, its bases, applications and range. European Review of Social Psychology, 2, 1–30. Surdeanu, M., Ciaramita, M., & Zaragoza, H. (2008). Learning to rank answers on large online QA collections. In Proceedings of association for computational linguistics CL-08: HLT (pp. 719–727). Tsur, O., & Rappoport, A. (2009). Revrank: a fully unsupervised algorithm for selecting the most helpful book reviews. In International AAAI conference on weblogs and social media. Zhang, Z., & Varadarajan, B. (2006). Utility scoring of product reviews. In Proceedings of the 15th ACM international conference on information and knowledge management CIKM’06 (pp. 51–57). Zhang, Z., Wei, Q., & Chen, G. (2014). Estimating online review helpfulness with probabilistic distribution and confidence. In Foundations and Applications of Intelligent Systems (pp. 411–420). Springer.