An ensemble approach for spam detection in Arabic opinion texts

An ensemble approach for spam detection in Arabic opinion texts

Journal Pre-proofs An Ensemble Approach for Spam Detection in Arabic Opinion Texts Radwa M.K. Saeed, Sherine Rady, Tarek F. Gharib PII: DOI: Reference...

1MB Sizes 3 Downloads 140 Views

Journal Pre-proofs An Ensemble Approach for Spam Detection in Arabic Opinion Texts Radwa M.K. Saeed, Sherine Rady, Tarek F. Gharib PII: DOI: Reference:

S1319-1578(19)30741-4 https://doi.org/10.1016/j.jksuci.2019.10.002 JKSUCI 684

To appear in:

Journal of King Saud University - Computer and Information Sciences

Received Date: Revised Date: Accepted Date:

3 June 2019 1 October 2019 6 October 2019

Please cite this article as: Saeed, R.M.K., Rady, S., Gharib, T.F., An Ensemble Approach for Spam Detection in Arabic Opinion Texts, Journal of King Saud University - Computer and Information Sciences (2019), doi: https:// doi.org/10.1016/j.jksuci.2019.10.002

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Production and hosting by Elsevier B.V. on behalf of King Saud University.

An Ensemble Approach for Spam Detection in Arabic Opinion Texts Radwa M.K. Saeed Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt [email protected] Sherine Rady (Corresponding author) Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt [email protected] Tarek F. Gharib Information Systems Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt [email protected]

Abstract Nowadays, individuals express experiences and opinions through online reviews. These has an influence on online marketing and obtaining real knowledge about products and services. However, some of the online reviews can be unreal. They may have been written to promote low-quality products/services or sabotage a product/service reputation to mislead potential customers. Such misleading reviews are known as spam reviews and require crucial attention. Prior spam detection research focused on English reviews, with less attention to other languages. The detection of spam reviews in Arabic online sources is a relatively new topic despite the relatively huge amount of data generated. Therefore, this paper contributes to such topic by presenting four different Arabic spam reviews detection methods, while putting more focus towards the construction and evaluation of an ensemble approach. The proposed ensemble method is based on integrating a rule-based classifier with machine learning techniques, while utilizing content-based features that depend on N-gram features and Negation handling. The four proposed methods are evaluated on two datasets of different sizes. The results indicate the efficiency of the ensemble approach where it achieves a classification accuracy of 95.25% and 99.98% for the two experimented datasets and outperforming existing related work by far of 25%. Keywords — Arabic spam reviews detection; N-gram features; Content-based features; Negation handling; Machine learning; Ensemble approach Conflict of Interest The authors declare that they have no conflict of interest for this article. Declaration of Interest None Funding This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

An Ensemble Approach for Spam Detection in Arabic Opinion Texts

Abstract Nowadays, individuals express experiences and opinions through online reviews. These has an influence on online marketing and obtaining real knowledge about products and services. However, some of the online reviews can be unreal. They may have been written to promote low-quality products/services or sabotage a product/service reputation to mislead potential customers. Such misleading reviews are known as spam reviews and require crucial attention. Prior spam detection research focused on English reviews, with less attention to other languages. The detection of spam reviews in Arabic online sources is a relatively new topic despite the relatively huge amount of data generated. Therefore, this paper contributes to such topic by presenting four different Arabic spam reviews detection methods, while putting more focus towards the construction and evaluation of an ensemble approach. The proposed ensemble method is based on integrating a rule-based classifier with machine learning techniques, while utilizing content-based features that depend on N-gram features and Negation handling. The four proposed methods are evaluated on two datasets of different sizes. The results indicate the efficiency of the ensemble approach where it achieves a classification accuracy of 95.25% and 99.98% for the two experimented datasets and outperforming existing related work by far of 25%.

Keywords — Arabic spam reviews detection; N-gram features; Content-based features; Negation handling; Machine learning; Ensemble approach

1. Introduction Recently, there has been a spectacular increase in online opinion resources that are rich with customers’ opinions. These opinions serve to highlight positively or negatively aspects of business including quality of products/services, purchase interactions or customer support engagements. They also allow customers to benefit from the experiences of each other about a product or a service and make an informed decision before purchasing. The process of analyzing multiple reviews for decision making will be a very hard and time-consuming task if done manually. This has led to the emergence of automatic analysis of reviews through Sentiment Analysis (Ismail et al., 2016). Sentiment Analysis uses Natural Language Processing (NLP) to identify and extract subjective information (i.e. features) from online reviews (Touahri & Mazroui, 2019). The adopted features have a great necessity to classify reviews in order to determine the overall sentiment embedded in them (e.g. negative, positive or neutral). Sentiment Analysis is very useful for organizations and service providers since it allows them to keep track and monitor customer reviews regarding numerous subjects and classify them to enhance products’ quality or promote services’ performance. Due to the significance of Sentiment Analysis, several studies have been dedicated to this research area. The overwhelming majority of these researches were attentive to sentiment analysis for English texts, whereas a little range of these researches worked on morphologically rich languages texts, such as Arabic texts (Tartir & Abdul-Nabi, 2017).

Sentiment Analysis for Arabic texts is considered a research field that encloses several challenging points that need handling for achieving a design that works effectively. These challenges include: the morphological complexity of the language that implies the need for efficient pre-processing and feature representation, the need to build robust classifiers, and the need to identify and eliminate spam opinion texts (Touahri & Mazroui, 2019). Morphological complexity rises due to the complex nature of Arabic language. For instance, lack of standardization among the writing of the same word. An Arabic word can be written in different forms using various suffixes, affixes and prefixes. In addition, different Arabic words can convey different meanings using the same three-letter root. Also, the same letter can be written in different forms according to its position in the word. Another reason is that most of the Arabic words are written in an unstructured format, where some Arabic words can be full of spelling mistakes, missing punctuation, and repetitions (Al-Radaideh & Al-Abrat, 2018). Additional reason is negation handling. Negation words can change the meaning of a sentence and reverse its sentiment polarity. Therefore, it appears that negation handling can play a vital role in detecting sentiment accurately (Wahsheh et al., 2013). Another challenge is Arabic spam opinion detection which is considered as one of the main challenging tasks that has a close relationship with opinions’ analysis. Spam opinion refers to a fake or a false review. Spam opinion is usually written to ruin a certain product's reputation using negative opinions or to promote a low-quality product through positive opinions (Saeed et al., 2018). Spam opinion detection has a great impact on companies because users' experience will be affected if the opinions provided about a product/service enclose a large amount of spammed opinion information. Moreover, users will not purchase/use this product/service once more if they are deceived by these spam opinions. Therefore, developing an approach that works on detecting spam reviews in Arabic opinion texts is an indispensable task, especially that there are very few attempts that have been concerned with this task for the Arabic language (Hammad & El-Halees, 2013). The main contribution of this paper is comparing different methods for detecting spam reviews in Arabic online opinion sources while considering the other aforementioned challenges. In this scope, hybrid (ensemble) methods are investigated. The paper takes two study directions including rulebased and machine learning classification and comes up that a hybrid design achieves high accuracy for the detection of Arabic spam reviews. The rule-based classifier depends on a set of defined rules while considering content-based features that rely on extracting a triple combination of N-gram features along with applying Negation handling. The decision output of the rule-based classifier is judged along with the decision outputs of supervised and unsupervised machine learning classifiers for concluding a final decision about the class. For verifying the efficiency, experiments are performed on two datasets and comparison against some related works are given. The rest of paper is organized as follows: Section 2 gives an insight about the related works in the area of spam reviews detection, Section 3 describes the proposed approach, Section 4 presents the experimental results and discussion and finally, Section 5 concludes the paper and describes the future work. 2. Related Work Massive amount of researches have been carried out within the field of detecting spam reviews during the last decade. Most of these researches focused on detecting spam reviews using supervised learning techniques. These works focused mostly on reviews written in English language (Hammad & El-Halees, 2013). The researches that focused on the detection of spam reviews in Arabic opinion reviews are very few. Hence, this section presents an overview on the previous works that deal with spam reviews detection in Arabic and English languages. Some Arabic spam reviews detection methods have been applied in (Wahsheh et al., 2013; Hammad & El-Halees, 2013; Sabbeh et al., 2018; Jardaneh et al., 2019; Alorini et al., 2019; Alzanin et al., 2019). (Wahsheh et al., 2013) developed an Arabic spam URL detection system in which reviews in Yahoo!-Maktoob social network are classified as high-level or low-level spam reviews in

reference to contain URLs. The review is classified as high-level spam review if URL is listed in a black list dictionary or labeled as spam, and the review is classified as a low-level spam if the URL is not labeled as spam or contains successive numbers or ‘@’ symbol with successive letters. In the case the review does not contain a URL then it is classified as non-spam. This system reported an accuracy of 97.5% when validated using Support Vector Machine classifier. (Hammad & El-Halees, 2013) classified reviews collected from tripadvisor.com.eg, booking.com, and agoda.ae into spam and non-spam using a combination of review content features, metadata about each review and reviewer features. Their dataset is not labeled, and they labeled the spam case manually based on scoring in a defined list of subjective measures. This system is evaluated using the three classifiers: K-Nearest Neighbor, Naive Bayes, and Support Vector Machine, where the highest accuracy is achieved by the Naive Bayes classifier with a value of 99.2%. (Sabbeh et al., 2018) and (Jardaneh et al., 2019) presented a model for detecting fake Arabic news on twitter. Both employed user-based, content-based and sentiment analysis features. Both also tested their models on different machine learning techniques and concluded that utilizing sentiment analysis features have a significant effect on improving the accuracy. (Sabbeh et al., 2018) obtained an accuracy of 89.9% using Decision Tree classifier while (Jardaneh et al., 2019) achieved an accuracy of 76% for Random Forest classifier. (Alorini et al., 2019) studied the Gulf Dialectical Arabic spam on Twitter. Some features are extracted such as: number of hash-tags, number of shortened URLs, and existence of profanity words. Naive Bayes and Support Vector Machine are tested, where a maximum accuracy of 86% is achieved using the Naive Bayes classifier. (Alzanin et al., 2019) introduced two different learning models to detect fake Arabic tweets; semi-supervised learning and unsupervised learning using expectation–maximization. Two types of features are studied: tweet-based and topic-based features. The semi-supervised learning model achieved better results than the unsupervised learning model with accuracy value of 78.6%. Some relevant works have been also presented in the scope of English spam reviews detection, such as (Narayan et al., 2018; Mani et al., 2018; Saumya & Singh, 2018; Kumar et al., 2018; Hassan et al., 2019; Barushka et al., 2019; Jain et al., 2019). These works have detected spam reviews using different machine learning classifiers. (Narayan et al., 2018) combined linguistic inquiry and word count, and Uni-gram along with sentiment score as features. An accuracy value of 86.25% is recorded for Logistic Regression classifier. (Mani et al., 2018) introduced an ensemble technique combining the three classifiers: Naive Bayes, Random Forest, and Support Vector Machine. They employed N-gram (Uni-gram + Bi-gram) features and achieved a maximum accuracy of 87.68%. In their study, they concluded that using only simple features like N-gram features and adopting an ensemble technique can boost the efficiency of detecting spam reviews. (Saumya & Singh, 2018) introduced a method that utilizes the three features: sentiments of review and its comments, contentbased factor, and rating deviation. They recorded F1 score of 91% for Random Forest classifier. Their work clarified that utilizing sentiment mining allows achieving better accuracy. (Kumar et al., 2018) classified spam reviews based on review content and product rating deviation. Their model obtained an accuracy of 82.2% using a Neural Network classifier. (Hassan et al., 2019) introduced a model that utilizes content-based features including word frequency count, sentiment polarity and length of review. This model achieved an accuracy of 86.32% by Naive Bayes classifier. Their work elucidated that using review length as a feature has well significance in detecting spam reviews. Some other works focused on detecting English spam reviews using Deep Neural Networks, such as (Barushka et al., 2019) and (Jain et al., 2019). (Barushka et al., 2019) used a content-based model that considers both bag of words and word context properties. They utilized N-grams and skip-gram word embedding method. This work achieved an accuracy of 89.75%. (Jain et al., 2019) introduced two different models; Multi-Instance Learning model (MIL) and Convolutional Neural Network model (CNN-GRU). The MIL model is based on feeding different instances of the same training example to the same model while the CNN-GRU model is based on extracting N-gram like semantic features using CNN and learning semantic dependencies among the extracted features from CNN modules. The highest accuracy is reported by CNN-GRU with a value of 91.9%.

3. Proposed Approach The proposed approach for detecting and classifying Arabic spam reviews is showed in Figure 1. It consists of three main modules: (I) Pre-processing, (II) Extraction Module, and (III) Spam Detection. Each of these modules is explained in the following subsections.

Figure 1 Overview of the Proposed Approach

3.1.

Pre-processing

Pre-processing is accomplished to remove irrelevant parts of data before extracting any feature. The pre-processing module consists of five consecutive steps: tokenization, non-Arabic text removal, normalization, stop words removal, and light stemming. These steps, which are initially done on the reviews’ text, are important to generate a pre-processed text ready for features extraction and classification. – Tokenization: splits review’s text into a sequence of tokens where each token represents a single word based on whitespace character. – Non-Arabic text removal: checks all review’s tokens to remove any non-Arabic token in the review. – Normalization: produces a consistent form out of the input text by converting the different forms of the word into a common form. In this step, the characters of each review’s token are checked to detect if they are in their normalized form or not. Table 1 shows the way in which Arabic text normalization is performed.

Table 1 Arabic Text Normalization Letters to Replace

Replaced with

‫ئ‬,‫ٸ‬,‫ى‬ ‫ٳ‬,‫ٲ‬,‫ٱ‬,‫إ‬,‫أ‬,‫آ‬ ‫ة‬ ‫ ٶ‬,‫ؤ‬,‫ۉ‬ ً, ٌ , ٍ , َ , ُ , ِ

‫ي‬ ‫ا‬ ‫ه‬ ‫و‬ Nothing

– Stop words removal: removes meaningless words that occur frequently in the review’s text which can result in improving the response time and reducing the index’s space. A list of Arabic stop words containing 700 stop words is used. This list includes words such as ( ,‫ علي‬,‫ أو‬,‫ و‬,‫ كان‬,‫ من‬,‫إلي‬ ‫ لقد‬,‫ فقط‬,‫ أمام‬,‫ كل‬,‫ في‬,‫عن‬, etc…). – Light Stemming: returns the word to its original form. For non-Arabic languages, a basic stem can be either pre-fixed or post-fixed to express a grammatical syntax. However, in Arabic language it is difficult to differentiate between some Arabic words after stemming because some words have the same root while they have a completely different meaning. Table 2 shows an example of Arabic stemming problem. As a result, light stemming is used to avoid this problem, where common set of prefixes and suffixes are cropped from a word without reducing a word to its root. Table 3 shows a list for the prefixes and suffixes that are removed while applying the Arabic light stemming. Table 2 Example for Arabic Stemming Problem Arabic Word

Meaning in English

Sentiment Score

‫تالعب‬

Jugglery

-1

‫يلعب‬

Plays

1

Root ‫لعب‬

Table 3 List of Prefixes and Suffixes removed in Arabic Light Stemming Removed prefixes and suffixes

3.2.

Prefixes

‫ وفال‬,‫ وكال‬,‫ وبال‬,‫ ولل‬,‫ كال‬,‫ لل‬,‫ فال‬,‫ بال‬,‫ وال‬,‫ال‬

Suffixes

‫ وا‬, ‫ يه‬, ‫ ين‬, ‫ ات‬, ‫ ون‬, ‫ ان‬, ‫ها‬

Extraction Module

The extraction module is an important module because the selection of the appropriate feature set plays a key role in classification (Arif et al., 2017). This module is composed of the three processes: N-gram Feature Extraction, Negation Handling, and Content-based Feature Extraction. 3.2.1. N-gram Feature Extraction N-gram is the most commonly used feature extraction method in text classification. N-gram is a sequence of n contiguous words in a text. The most frequently used N-gram features are: Uni-gram (one word), Bi-gram (two contiguous words), and Tri-gram (three contiguous words). In Arabic language, employing N-gram features rather than Uni-gram features separately are required for better feature representation. This is because of their role in completing the meaning and affecting the polarity of the statement as shown in Table 4. Therefore, in the proposed design, different

combinations of the three different N-gram features are extracted for testing; Uni-gram features, Unigram along with Bi-gram features, and Uni-gram along with Bi-gram and Tri-gram features. After that, the extracted features’ polarity is retrieved from a sentiment lexicon that encloses 17,000 words/phrases. Only the N-gram features that exist in the sentiment lexicon are considered. Table 4 N-gram Features Examples with identified polarity N-gram features

Example

Polarity

Uni-gram Uni-gram Uni-gram Bi-gram Tri-gram

‫خيبة‬ ‫أمل‬ ‫كبيرة‬ ‫خيبة أمل‬ ‫خيبة أمل كبيرة‬

-1 1 1 -1 -1

3.2.2. Negation Handling Arabic negation words have been taken into consideration for correct polarity classification. A list of Arabic negation words consisting of 50 negators such as (‫ مش‬,‫ مو‬,‫ لن‬,‫ لم‬,‫ ال‬,‫ ما‬,‫ليس‬, etc...) is constructed. In this process, each extracted N-gram feature is checked to determine if the word preceding is a negation word or not. If the preceding word is a negation word, then the polarity of the N-gram feature is reversed. While if the preceding word is not a negation word, then the polarity of the N-gram feature is retained as it is. Table 5 shows an example for Negation handling. Table 5 Negation Handling Example Arabic Sentence

Polarity

‫الغرفة نظيفة‬ ‫الغرفة ليست نظيفة‬

1 -1

3.2.3. Content-based Feature Extraction This process constructs a set of different content-based features which are (i) Words count, (ii) Unique words percentage, and (iii) Review rating deviation. This process depends on the previously extracted N-gram features and their handled polarity in order to count the number of negative words and positive words correctly. Next, the review sentiment score is calculated based on the positive words count and the negative words count. Next, the calculated sentiment score is compared with the review rating value given by reviewers and which ranges from 1 to 5. This finally gives the Review rating deviation. Table 6 shows a description of the extracted content-based features. Table 6 Content-based features and their description Content-based feature

Description

Words count

Total number of words in a review

Unique words percentage

Percentage of unique words in a review Difference between the calculated review’s sentiment score and the review’s rating given by the reviewer

Review rating deviation

3.3.

Spam Detection

In this module, four different methods are introduced for testing: (I) Rule-based Classifier, (II) Classical Machine Learning Classifier, (III) Majority Voting Ensemble Classifier, and (IV) Stacking Ensemble Classifier. These methods use the previously extracted content-based features to classify a review as a spam review or as a truthful review. In the following subsections, these four methods are explained.

3.3.1. Rule-based Classifier In this classification method, a set of rules that depends on the content-based features: words count, unique words percentage, and review rating deviation, is constructed. In these rules, two threshold values are defined; one for words’ count, and one for unique words’ percentage. For each review, the number of words is counted and the percentage of unique words is calculated and compared to the defined threshold values. On the basis of these comparisons along with the review rating deviation, the review is classified as a spam review or as a truthful review. Equation 1 shows how the review is classified as a spam or a truthful review. 𝑅𝐿 = {

𝑇𝑟𝑢𝑡ℎ𝑓𝑢𝑙, 𝑆𝑝𝑎𝑚,

𝑖𝑓 𝐶𝑊 < 𝑇𝐶𝑊 𝑖𝑓 𝐶𝑊 ≥ 𝑇𝐶𝑊

𝑎𝑛𝑑 𝑃𝑈𝑊 > 𝑇𝑃𝑈𝑊 𝑎𝑛𝑑 𝑅𝑅𝐷 = 𝐹𝑎𝑙𝑠𝑒 𝑜𝑟 𝑃𝑈𝑊 ≤ 𝑇𝑃𝑈𝑊 𝑜𝑟 𝑅𝑅𝐷 = 𝑇𝑟𝑢𝑒

(1)

Where, RL: Review label CW: Words count PUW: Unique words percentage TCw: Words count threshold value TPuw: Unique words percentage threshold value RRD: Review rating deviation 3.3.2. Machine Learning Classifiers The classification of the reviews as spam or truthful reviews is performed using several machine learning classifiers: Decision Tree, Naive Bayes, Logistic Regression, Support Vector Machine, KMeans, K-Nearest Neighbor, Bagging, Boosting, Random Forest and Neural Networks. 3.3.3. Majority Voting Ensemble Classifier A majority voting ensemble is an aggregation of the decisions made by both the rule-based classifier and the machine learning classifiers when applying them together in a parallel manner. The final decision is made by taking the prediction with the highest votes from the multiple predicting models. Based on this final decision, the review will be classified as a spam or a truthful review. In the scope of our design, all classical machine learning classifiers are employed in the voting ensemble. Figure 2 shows an overview of the majority voting ensemble classifier.

Figure 2 Overview of the Majority Voting Ensemble Classifier

3.3.4. Stacking Ensemble Classifier Stacking is an ensemble method that resembles majority voting. However, rule-based classifier and machine learning classifiers are executed sequentially not in parallel. In this method, the rule-based classifier is executed firstly to classify the reviews as spam reviews or truthful reviews. Secondly, the output of the rule-based classifier is compared with the original dataset labels to categorize it into a set of correctly classified reviews and a set of incorrectly classified reviews. Finally, the set of correctly classified reviews is used to train the machine learning classifier while the set of incorrectly classified reviews is used as a test set for the machine learning classifier. Figure 3 shows an overview of the stacking ensemble classifier.

Figure 3 Overview of the Stacking Ensemble Classifier

4. Experimentation, Evaluation and Discussion This section includes the experimental results and the evaluation of the proposed architecture in Figure 1 including the four Arabic spam opinion detection methods mentioned above. 4.1.

Datasets

Two different publicly available opinion datasets are used. The first dataset is the Deceptive Opinion Spam Corpus (DOSC) which is created by (Ott et al., 2011). The dataset contains 1,600 opinion reviews in English language about 20 popular hotels in Chicago. DOSC dataset is translated from English language to Arabic language with the help of expert Arabic language translators. The dataset is divided into two groups: 800 truthful reviews and 800 spam reviews. The second dataset is the Hotel Arabic Reviews Dataset (HARD) which is created by (Elnagar et al., 2018). This dataset is bigger where it consists of 94,052 opinion reviews in Arabic language about 1858 hotels and collected from Booking.com. The reviews are expressed in both Modern Standard Arabic and dialectal Arabic. This dataset is unfortunately not labeled as truthful and spam reviews. Therefore, and to be able to use it in testing and verification, it has to be labeled. Accordingly, the DOSC dataset which is already labeled has been used as a training set for the designed majority voting ensemble classifier to identify a class label for HARD. Majority voting ensemble classifier has been selected based on its performance as it outperformed all the standalone machine learning classifiers when experimenting with the DOSC dataset as per the details in the upcoming section 4.3.

4.2.

Evaluation Metrics

Different evaluation metrics are used to measure the performance of the four proposed Arabic spam opinion detection methods. These include accuracy, recall (sensitivity), specificity, precision and F1 Score (Powers, 2011). Accuracy is the ability of an approach to differentiate spam and truthful reviews correctly. It is measured by calculating the proportion of true positive and true negative in all evaluated cases. 𝑇𝑃 + 𝑇𝑁 (2) 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 Recall measures the proportion of spam reviews correctly identified. It shows how good is an approach at detecting the spam reviews. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =

𝑇𝑃 (3) 𝑇𝑃 + 𝐹𝑁 Specificity measures the proportion of truthful reviews correctly identified to show how good is an approach at avoiding false alarms. 𝑇𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = (4) 𝑇𝑁 + 𝐹𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 =

Precision measures how many of the reviews classified as spam are correctly predicted. 𝑇𝑃 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (5) 𝑇𝑃 + 𝐹𝑃 F1 score is the weighted average of Recall and Precision. 𝐹1 𝑆𝑐𝑜𝑟𝑒 =

2𝑇𝑃 2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁

(6)

Where, True Positive (TP): Number of correctly predicted spam reviews which means that the value of the actual class is spam and the value of the predicted class is also spam. True Negative (TN): Number of correctly predicted truthful reviews which means that the value of the actual class is truthful and the value of the predicted class is also truthful. False Positive (FP): Number of incorrectly predicted spam reviews which means that the value of the actual class is spam but the value of the predicted class is truthful. False Negative (FN): Number of incorrectly predicted truthful reviews which means that the value of the actual class is truthful but the value of the predicted class is spam. 4.3.

Experimental Results

Four experiments have been conducted to measure the performances of the four proposed methods, while taking into consideration the effect of the employed extraction module; in more specific terms the experiments aimed at testing the impact of considering the N-gram feature extraction and negation handling on the detection of the Arabic spam reviews. This section shows the experiments for the two datasets: DOSC and HARD. 4.3.1.

Experiment 1: Rule-based Classifier

This experiment consists of two parts: evaluating the effect of N-gram features, and assessing the effect of considering negation words on the performance of the rule-based classifier. Using equation 1, preliminary trials have been carried out in order to tune threshold values for detecting the Arabic spam reviews in the two considered datasets. Based on the results of these trials, the two threshold values in equation 1 are defined to be: 170 words for Words count threshold value and 75% for Unique words percentage threshold value. For evaluating the effect of N-gram features, different combinations of N-gram features including Uni-gram features, Uni-gram along with Bi-gram features, and Uni-gram along with Bi-gram and Trigram features have been used to examine their impact on the performance of the rule-based classifier. As shown in Table 7, experimenting with the DOSC and HARD datasets while considering Uni-gram features only, the classifier reached accuracy of 64.00%, and 65.23% respectively. Whereas, when Uni-gram along with Bi-gram features is considered, the classifier reached accuracy of 64.88%, and 67.26% respectively. However, the maximum accuracy is achieved on considering a combination of the three N-gram features (Uni-gram, Bi-gram and Tri-gram features) where it reached 66.63%, and 71.45% respectively. Accordingly, using a triple combination of the N-gram features improves the classifier accuracy, where it has been improved by 2.63%, and 6.22% respectively for the two datasets DOSC and HARD. Table 7 also shows that precision, specificity and F1 score slightly increased on using the triple combination of the N-gram features for the DOSC dataset while there is a remarkable increase in recall, precision, specificity and F1 score for the HARD dataset. Table 7 Evaluation of N-gram features extraction on the performance of the rule-based classifier in the detection of Arabic spam reviews Classifier

Dataset

N-gram feature

DOSC

Uni-gram Uni-gram + Bi-gram Uni-gram + Bi-gram + Tri-gram

HARD

Uni-gram Uni-gram + Bi-gram Uni-gram + Bi-gram + Tri-gram

Rule-based

Performance Measure Accuracy 64.00% 64.88% 66.63% 65.23% 67.26% 71.45%

Recall Precision Specificity F1 Score 69.25% 62.67% 66.00% 64.55% 65.75% 66.92% 57.59% 70.92% 58.07% 74.15% 59.04% 81.67%

58.75% 63.75% 67.50% 73.73% 77.48% 85.25%

65.80% 65.27% 66.33% 63.56% 65.14% 68.53%

For assessing the effect of considering negation words, the results obtained from the first part of this experiment before considering negation handling have been compared with the results obtained after applying negation handling. The comparison is done while considering the triple combination of the N-gram features. As shown in Table 8, experimenting with the DOSC and HARD datasets before handling negation, the classifier obtained accuracy of 66.63%, and 71.45% respectively while after handling negation the classifier obtained accuracy of 82.38%, and 98.35% respectively. These results obviously reveal that considering negation handling along with the triple combination of N-gram features has a significant impact on classification by improving the accuracy 15.75%, and 26.90% for the two datasets. Table 8 also shows that there is a remarkable increase in recall, precision, specificity and F1 score for the two datasets. Table 8 Effect of negation handling on the performance of the rule-based classifier in the detection of Arabic spam reviews (N: effect is not included; Y: effect is included) Classifier

Dataset DOSC

Rule-based HARD

4.3.2.

Negation Handling

Performance Measure Accuracy

Recall

Precision Specificity F1 Score

N

66.63%

65.75%

66.92%

67.50%

66.33%

Y

82.38%

68.00% 95.44%

96.75%

79.42%

N

71.45%

59.04%

81.67%

85.25%

68.53%

Y

98.35%

97.84% 99.99%

99.97%

98.90%

Experiment 2: Machine Learning Classifiers

Similarly as in the previous experiment that has been conducted with the previous classifier (i.e. rule-based classifier), the machine learning classifiers are also tested in two parts; evaluating the effect of N-gram features, and assessing the effect of considering negation words. Same findings have been concluded in regard to achieving the best accuracy values upon use of the triplet combination of the Ngram feature representation. Experimenting with the DOSC and HARD datasets before handling negation, the unsupervised machine learning classifier K-means obtained accuracy of 47.13%, and 72.73% respectively for the two datasets. On the other hand, the supervised machine learning classifiers: Decision Tree, Naive Bayes, Logistic Regression, Support Vector Machine, K-Nearest Neighbor, Bagging, Boosting, Random Forest, and Neural Networks achieved accuracy values that range between 50.88% and 55.25% for the DOSC dataset and range between 61.30% and 91.51% for the HARD dataset. Whereas, after employing negation handling, the unsupervised machine learning classifier obtained accuracy of 52.88%, and 84.97% respectively for the two datasets. While the supervised machine learning classifiers achieved accuracy values that range between 52.88% and 72.50% for the DOSC dataset and range between 75.41% and 96.71% for the HARD dataset as shown in Table 9. This means that there is an increase in accuracy values in range 2.00−20.73% and 5.20−24.05% for the two datasets DOSC and HARD respectively. Table 9 also shows that the best accuracy value for the unsupervised machine learning classifiers is obtained with K-means for the two datasets. The best accuracy for the supervised machine learning classifiers is obtained with Decision Tree for the DOSC dataset and with Boosting for the HARD dataset.

Table 9 Effect of negation handling on the performance of different machine learning classifiers in the detection of Arabic spam reviews (N: effect is not included; Y: effect is included) Dataset

Classifier K-Means K-Nearest Neighbor Naive Bayes Random Forest Bagging

DOSC Boosting Support Vector Machine Logistic Regression Neural Networks Decision Tree K-Means Naive Bayes Logistic Regression Decision Tree Support Vector Machine HARD Neural Networks K-Nearest Neighbor Random Forest Bagging Boosting

Negation Handling N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y N Y

Performance Measure Accuracy 47.13% 52.88% 50.88% 52.88% 52.25% 57.38% 51.13% 59.75% 51.25% 60.13% 52.38% 61.00% 52.50% 64.38% 55.25% 65.38% 52.00% 72.13% 51.77% 72.50% 72.73% 84.97% 61.30% 75.41% 69.35% 80.51% 81.12% 92.73% 73.65% 93.70% 84.86% 93.80% 69.91% 93.96% 82.25% 93.99% 89.25% 94.77% 91.51% 96.71%

Recall Precision Specificity F1 Score 28.50% 45.42% 65.75% 35.02% 38.50% 54.04% 67.25% 44.96% 62.50% 50.71% 39.25% 55.99% 66.25% 52.27% 39.50% 58.43% 66.75% 51.74% 37.75% 58.30% 70.00% 55.89% 44.75% 62.15% 58.75% 50.98% 43.50% 54.59% 66.25% 58.63% 53.25% 62.21% 58.25% 51.10% 44.25% 54.44% 68.50% 58.67% 51.75% 63.21% 59.00% 52.10% 45.75% 55.33% 68.75% 59.52% 53.25% 63.81% 62.50% 52.08% 42.50% 56.82% 71.75% 62.53% 57.00% 66.82% 83.75% 53.34% 26.75% 65.18% 70.50% 63.95% 60.25% 67.06% 62.25% 51.66% 41.75% 56.46% 67.75% 74.25% 67.75% 70.85% 51.02% 51.28% 52.50% 51.15% 65.50% 76.16% 79.50% 70.43% 77.54% 88.13% 49.49% 82.50% 89.13% 90.93% 71.75% 90.02% 53.35% 99.91% 99.76% 69.56% 91.92% 81.29% 70.22% 96.50% 63.19% 99.72% 99.14% 77.36% 71.71% 86.67% 83.28% 90.34% 88.92% 86.61% 56.32% 87.75% 98.87% 92.14% 73.20% 95.39% 78.65% 85.55% 57.80% 81.95% 92.97% 98.68% 96.03% 95.74% 90.90% 89.38% 65.67% 90.13% 99.97% 92.49% 74.21% 96.08% 74.92% 83.81% 54.01% 79.11% 99.08% 93.38% 77.68% 96.14% 84.39% 93.56% 71.91% 88.74% 99.71% 92.91% 75.81% 96.19% 95.88% 91.55% 57.18% 93.67% 99.89% 93.66% 78.53% 96.67% 90.16% 99.54% 98.01% 94.62% 87.11% 97.87% 99.73% 96.09%

From the two above experiments, it is concluded that negation handling along with the triple combination of N-gram features as a feature extraction method have a positive impact on the performance of the proposed approach. Accordingly, in the two following methods and experiments, negation handling along with the triple combination of N-gram features will be directly considered.

4.3.3.

Experiment 3: Majority Voting Ensemble Classifier

Experimenting with the DOSC and HARD datasets, the majority voting ensemble classifier achieved accuracy of 74.38% and 98.21% respectively as shown in Table 10. The results achieved from this experiment which obviously reveal that the majority voting ensemble classifier even outperforms all the standalone machine learning classifiers where the accuracy values are 1.88% and 1.5% better than that achieved by the best machine learning classifier; Decision Tree for the DOSC dataset and Boosting for the HARD dataset. Table 10 Performance of the majority voting ensemble classifier in the detection of Arabic spam reviews Performance Measure Classifier

Dataset Accuracy

Majority Voting Ensemble

4.3.4.

Recall Precision Specificity F1 Score

DOSC

74.38%

65.50% 79.64%

83.25%

71.88%

HARD

98.21%

99.57% 98.10%

93.88%

98.83%

Experiment 4: Stacking Ensemble Classifier

Experimenting with the DOSC and HARD datasets, the stacking ensemble classifier manages to achieve maximum accuracy values of 95.25%, and 99.98% by integrating the outputs of the rule-based classifier with K-means classifier respectively for the two datasets. Table 11 shows the results achieved from this experiment which obviously reveal that the stacking ensemble classifier outperforms all of the three other previous methods (“Rule-based Classifier”, “Machine Learning Classifiers” and “Majority Voting Ensemble Classifier”). The accuracy shows an improvement in range 12.87−22.75% for the DOSC dataset and 1.63−3.27% for the HARD dataset. Table 11 Performance of the stacking ensemble classifier in the detection of Arabic spam reviews Performance Measure Dataset

DOSC

HARD

Classifier Accuracy

Recall

Rule-based + Naive Bayes

79.06% 80.13% 82.00% 83.88% 85.25% 85.63% 85.75% 85.75% 86.25% 95.25% 99.17%

68.45% 69.00% 69.00% 68.00% 72.25% 72.75% 72.25% 73.00% 74.00% 91.75% 98.91%

87.58% 88.75% 93.24% 99.63% 97.64% 97.98% 98.97% 97.99% 98.01% 98.66% 99.99%

90.00% 91.25% 95.00% 99.75% 98.25% 98.50% 99.25% 98.50% 98.50% 98.75% 99.98%

76.84% 77.64% 79.31% 80.83% 83.05% 83.50% 83.53% 83.67% 84.33% 95.08% 99.45%

Rule-based + Support Vector Machine

99.53%

99.40%

99.99%

99.98%

99.69%

Rule-based + K-Nearest Neighbor

99.90%

99.88%

99.99%

99.97%

99.94%

Rule-based + Random Forest

99.92%

99.90%

99.99%

99.97%

99.95%

Rule-based + Logistic Regression

99.93%

99.91%

99.99%

99.97%

99.95%

Rule-based + Naive Bayes Rule-based + Decision Tree Rule-based + Logistic Regression Rule-based + Support Vector Machine Rule-based + Boosting Rule-based + Random Forest Rule-based + Neural Network Rule-based + Bagging Rule-based + K-Nearest Neighbor Rule-based + K-Means

Precision Specificity F1 Score

Rule-based + Bagging

99.94%

99.93%

99.99%

99.97%

99.96%

Rule-based + Decision Tree

99.95%

99.94%

99.99%

99.97%

99.97%

Rule-based + Neural Network

99.96%

99.96%

99.99%

99.97%

99.98%

Rule-based + Boosting

99.97%

99.97%

99.99%

99.97%

99.98%

Rule-based + K-Means

99.98%

99.98%

99.99%

99.97%

99.98%

To assess the scalability of the stacking ensemble classifier, different data sizes are experimented from HARD dataset to study the effect of data size on classification accuracy. The range has been selected from 1K to 90K with an increment step of 10K samples. Figure 4 shows the results for this experiment.

Figure 4 Accuracy of the proposed stacking ensemble classifier with respect to varying data sizes

As Figure 4 indicates, the accuracy increases linearly at a rate of 0.5% for every step up till 60K samples. This occurs within the accuracy range 96-99.5%. The graph then shows slow increase for the accuracy at the maximum corpus size 90K reaching 99.87%. These results clearly prove that the stacking classifier is suitable for application over expanding sized datasets.

4.4.

Analysis and Discussion

The aforementioned experiments presented in the previous section show that the accuracy of Arabic spam reviews detection has been improved while considering negation handling along with a triple combination of N-gram features. This improvement has occurred for the different presented classifiers (rule-based, classical machine learning, and hybrid methods). This accuracy improvement is between 5% and 27%. This improvement reveals that the number of true positives (i.e. number of reviews correctly classified as spam reviews), as well as the number of true negatives (i.e. number of reviews correctly classified as truthful reviews) has been increased which means more accurate classification. From the four classification methods proposed, the stacking ensemble classifier performed the best when combining K-means classifier along with the rule-based classifier. The performance of the stacking ensemble classifier is higher than that of the other methods by a minimum approximate bound of 12% for the DOSC dataset. The increase in the accuracy values for the stacking ensemble over the rule-based classifier, majority voting ensemble classifier, and the best machine learning classifier, are 12.87%, 20.87%, and 22.75% respectively, whereas, these values are 1.63%, 1.77%,

and 3.27% respectively for the HARD dataset. For the two experimented datasets, Figure 5 shows a comparison for the four spam review detection methods, while Table 12 summarizes the results obtained for the stacking ensemble classifier in spam reviews detection. Table 12 Performance summary for stacking ensemble classifier in the detection of Arabic spam reviews Performance Measure Classifier

Dataset Accuracy

Recall

Precision

Specificity

F1 Score

DOSC

95.25%

91.75%

98.66%

98.75%

95.08%

HARD

99.98%

99.98%

99.99%

99.97%

99.98%

Stacking Ensemble

Figure 5 Comparing the accuracy of the four spam review detection methods for DOSC and HARD datasets

To validate the proposed approach, the stacking ensemble has been compared against some related works. Since Arabic related works unfortunately don’t exist, the comparison is performed against three of the state of art methods that have been presented in the scope of English spam reviews detection. These three related works as well as the stacking ensemble approach are validated on the DOSC dataset in Arabic with the size of 1.6K. Figure 6 illustrates the results of such comparison. The results reveal that the proposed stacking ensemble approach yields better results in the detection of Arabic spam reviews, where it outperformed the state of the art results by at least 25.87%.

Figure 6 Comparison between the stacking ensemble approach and some related works on the DOSC dataset

5. Conclusion This paper investigated different classification methods for spam detection in Arabic opinion texts. These methods are Rule-based, Classical Machine Learning, Majority Voting Ensemble, and Stacking Ensemble. N-gram features and negation handling were proposed for feature representation and extraction. The performance of the four methods has been verified on two datasets DOSC and HARD and different data sizes have been experimented. The experimental results showed that the performance of the four methods has been improved by considering negation handling along with a triple combination of N-gram features. The rule-based classifier outperforms machine learning classifiers. The stacking ensemble classifier that combines K-means classifier along with rule-based classifier outperforms all of the three other methods. The proposed stacking ensemble classifier has been also compared against related works. This comparison showed that there is almost 28% increase in the accuracy of detecting Arabic spam reviews. As a future work and for the enhancement of the Arabic spam reviews detection methods, deep learning methods will be studied and compared.

References Al-Radaideh, Q. A., & Al-Abrat, M. A. (2018). An arabic text categorization approach using term weighting and multiple reducts. Soft Computing, 1–15. Alorini, D., & Rawat, D. B. (2019, February). Automatic Spam Detection on Gulf Dialectical Arabic Tweets. In 2019 International Conference on Computing, Networking and Communications (ICNC) (pp. 448-452). IEEE. Alzanin, S. M., & Azmi, A. M. (2019). Rumor detection in Arabic tweets using semi-supervised and unsupervised expectation–maximization. Knowledge-Based Systems, 104945. Arif, M. H., Li, J., Iqbal, M., & Liu, K. (2017). Sentiment analysis and spam detection in short informal text using learning classifier systems. Soft Computing, 1–11. Barushka, A., & Hajek, P. (2019, May). Review Spam Detection Using Word Embeddings and Deep Neural Networks. In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 340-350). Springer, Cham. Elnagar, A., Khalifa Y.S., Einea A. (2018) Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications. In: Shaalan K., Hassanien A., Tolba F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740, pp: 35-52. Springer International Publishing. Hammad, A. A., & El‐Halees, A. (2013). An approach for detecting spam in Arabic opinion reviews. The International Arab Journal of Information Technology,12(1), 9–16. Hassan, R., & Islam, M. R. (2019, February). Detection of fake online reviews using semi-supervised and supervised learning. In 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE) (pp. 1-5). IEEE. Ismail, S., Alsammak, A., & Elshishtawy, T. (2016). A generic approach for extracting aspects and opinions of Arabic reviews. In Proceedings of the 10th international conference on informatics and systems (pp. 173–179). Jain, N., Kumar, A., Singh, S., Singh, C., & Tripathi, S. (2019, June). Deceptive Reviews Detection Using Deep Learning Techniques. In International Conference on Applications of Natural Language to Information Systems (pp. 79-91). Springer, Cham. Jardaneh, G., Abdelhaq, H., Buzz, M., & Johnson, D. (2019, April). Classifying Arabic Tweets Based on Credibility Using Content and User Features. In 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT) (pp. 596-601). IEEE. Kumar, K. A., Anil, B., Kumar, U. R., Anand, C., & Aniruddha, S. (2018). Effective approaches for classification and rating of users reviews. In Proceedings of international conference on cognition and recognition (pp. 1–9). Mani, S., Kumari, S., Jain, A., & Kumar, P. (2018). Spam review detection using ensemble machine learning. In International conference on machine learning and data mining in pattern recognition (pp. 198–209). Narayan, R., Rout, J. K., & Jena, S. K. (2018). Review spam detection using opinion mining. In Progress in intelligent computing techniques: Theory, practice, and applications (pp. 273–279). Springer. Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (Vol. 1, pp. 309–319). Powers, D. M. W. (2011). Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. International Journal of Machine Learning Technology, 2, 37–63. Sabbeh, S. F., & Baatwah, S. Y. (2018). Arabic News Credibility on Twitter: An Enhanced Model Using Hybrid Features. Journal of Theoretical & Applied Information Technology, 96(8).

Saeed, N. M., Helal, N. A., Badr, N. L., & Gharib, T. F. (2018, December). The Impact of Spam Reviews on Feature-based Sentiment Analysis. In 2018 13th International Conference on Computer Engineering and Systems (ICCES) (pp. 633-639). IEEE. Saumya, S., & Singh, J. P. (2018). Detection of spam reviews: a sentiment analysis approach. CSI Transactions on ICT, 6(2), 137-148. Tartir, S., & Abdul-Nabi, I. (2017). Semantic sentiment analysis in Arabic social media. Journal of King Saud University-Computer and Information Sciences, 29(2), 229-233. Touahri, I., & Mazroui, A. (2019). Studying the effect of characteristic vector alteration on Arabic sentiment classification. Journal of King Saud University-Computer and Information Sciences. Wahsheh, H. A., Al-Kabi, M. N., & Alsmadi, I. M. (2013). Spar: A system to detect spam in arabic opinions. In Applied electrical engineering and computing technologies (aeect) (pp. 1–6).