The Social Relation Key: A new paradigm for security

The Social Relation Key: A new paradigm for security

Information Systems 71 (2017) 68–77 Contents lists available at ScienceDirect Information Systems journal homepage: www.elsevier.com/locate/is The ...

1MB Sizes 19 Downloads 83 Views

Information Systems 71 (2017) 68–77

Contents lists available at ScienceDirect

Information Systems journal homepage: www.elsevier.com/locate/is

The Social Relation Key: A new paradigm for security Sihyun Jeong, Jaehoon Lee, Junhyun Park, Chong-kwon Kim∗ Dept. of Computer Science and Engineering, Seoul National University, Gwanak-gu, Seoul 151–744, Republic of Korea

a r t i c l e

i n f o

Article history: Received 23 June 2017 Revised 7 July 2017 Accepted 7 July 2017 Available online 18 July 2017 Keywords: Online social network Security key SMS Twitter Spam Authentication

a b s t r a c t For the last decade, online social networking services have consistently shown explosive annual growth, and have become some of the most widely used applications and services. Large amounts of social relation information accumulate on these platforms, and advanced services, such as targeted advertising and viral marketing, have been introduced to exploit this social information. Although many prior social relation-based services have been commerce oriented, we propose employing social relations to improve online security. Specifically, we propose that real social networks possess unique characteristics that are difficult to imitate through random or artificial networks. Also, the social relations of each individual are unique, like a fingerprint or an iris. These observations thus lead to the development of the Social Relation Key (SRK) concept. We applied the SRK concept in different use cases in the real world, including in the detection of spam SMSes, and another in pinpointing fraud in Twitter followers. Since spammers multicast the same SMS to multiple, randomly-selected receivers and normal users multicast an SMS to friends or acquaintances who know each other, we devise a detection scheme that makes use of a clustering coefficient. We conducted a large scale experiment using an SMS log obtained from a major cellular network operator in Korea, and observed that the proposed scheme performs significantly better than the conventional content-based Naive Bayesian Filtering (NBF). To detect fraud in Twitter followers, we use different social network signatures, namely isomorphic triadic counts, and the property of social status. The experiment based on a Twitter dataset again confirmed the feasibility of the SRK. Our codes are available on a website1 . © 2017 Published by Elsevier Ltd.

1. Introduction Over the last decade, we have witnessed explosive growth in online social networking services (SNS), such as Facebook, Twitter, Ren-Ren, and Sina Weibo. These started as microblogs or online replicas of off-line social interaction; and since then, social networking services and social media have rapidly changed the landscape of Internet use. Social media services are now a fast artery of news propagation. Furthermore, many online systems, such as audio/video contents providers, question-and-answer sites, review and rating systems, and e-commerce systems, increasingly support social media to provide product information, or to improve customer retention. Before the proliferation of social media, telephone and cellular companies were the primary points at which social relation information would accumulate, but now many companies that support online social interactions can collect an abundance of



Corresponding author. E-mail addresses: [email protected] (S. Jeong), [email protected] (J. Lee), [email protected] (J. Park), [email protected], [email protected] (C.-k. Kim). 1 https://github.com/sihyunj/TSP- SS- SRK http://dx.doi.org/10.1016/j.is.2017.07.003 0306-4379/© 2017 Published by Elsevier Ltd.

social relation information that can open unprecedented opportunity to develop novel services. The everyday lives of people increasingly depend on online access to information and services. Many first hear news through friends (or followers) rather than through mass media [1], and we frequently consult online reviews and ratings before making purchase decisions [2]. Even though online systems certainly make our lives more efficient and convenient, they are not without drawbacks. One drawback is that of online security. Online security can be susceptible to a number of problems, and as a rule of thumb, attacks evolve as new online services are introduced. Since we first hear news from un-authorized sources, such as Twitter, attackers propagate false rumors, some of which are maliciously fabricated. In addition, many users consult and trust online information including reviews and ratings, so systematic campaigns with the intention to promote or undermine the reputations of certain products, affiliations, and public figures have been increasingly carried out. In contrast to earlier attacks that were performed rather naively for leisure, recent attacks and information contortion have been motivated by monetary gain, and have been carried out by adept and vicious offenders, who overwhelm average security operators.

S. Jeong et al. / Information Systems 71 (2017) 68–77

A plethora of methods have been proposed to detect and prevent attacks. Taking spam e-mail detection as an example, contentbased methods including Nave Bayesian Filtering (NBF) and SVM have been introduced and widely deployed in production e-mail systems. Also, network operators eagerly contribute to lists of spammers’ IP address for blacklisting [3]. However, attackers can evade these security measures by avoiding using prohibited words, or by utilizing IP address spoofing. Therefore, security operators and attackers engage in an endless arms race: once a security person devises a new defense, attackers come up with new attacks that can elude the security checks. In addition to the lack of generality, many security solutions are optimized for a specific service. Content-based spam e-mail detection methods are not very effective for spam SMS detection because SMS messages, mostly less than or equal to 140 characters, use the same corpus as non-spam SMS messages and also contain characters in an image form and URLs. To terminate the arms race between operators and attackers, we aim to devise security measures that are difficult to circumvent and have general applicability. In this paper, we propose using social relation information to design security solutions. As mentioned before, the availability of abundant social relation information has opened opportunities to create new services, and several applications that exploit social relation information have already appeared. Most of these applications use social relation to improve monetary gain. One application is Influence Maximization (IM) [4], which identifies most of the important users who may maximize the sales of newlyintroduced goods along established social relations. Viral marketing also utilizes social relations to expedite and to expand the propagation of advertisements. Various types of link prediction schemes have been proposed to recommend plausible friends [5,6]. More recently, researchers have devised clever recommender systems that extract social relation features to identify similar users [7,8]. In contrast to previous methods that utilize social relation information for commercial purposes, we aim to devise security schemes that use social relations. Like fingerprints or irises that are unique for each person, every person has a unique set of social relations. Due to this uniqueness, we can use social relations as an authentication key. We call the set of social relations for each individual a Social Relation Key (SRK). Several systems have already used social relations as a supplementary authentication tool. For example, Facebook requests users to list several friends, and asks if the user is indeed their friend [9]. This is a use case with direct adoption of social relation information for security purposes. However, a request for explicit social relations can be cumbersome, and can be easily thwarted. Instead of implementing an explicit request for social relations, we propose using social relations naturally, and in unobtrusive ways. Ideally, the security mechanisms should detect attackers and attack-related activity during normal use, without the need for requests for additional information or interruptions. As far as we are aware, the idea of actively using accumulated social relation information to design security solutions has not been proposed before. Several researchers proposed a scheme that detects attack-related activity according to abnormal behavior, such as plural review postings in a short time span [10]. However, behavior-based schemes do not use individual social relations, but rather depend on the average habits of individuals. Sybil detection methods based on a random walk implicitly use clustering in a social network. However, these methods do not use individual social relations, but use the global property of clustering. Of course, global social network properties, including clustering and small world phenomena, are characteristics useful in the design of security solutions. However, we believe that social network properties that are observable at the level of the individual or egonetwork, including homophily, clustering coefficient, balancedness,

69

status, etc., provide stronger and more robust grounds to develop security solutions. This paper proposes the overarching idea of using the SRK in various security solutions. We devised two schemes based on the SRK and applied them to security problems in two different security research areas to test the feasibility and effectiveness of the proposed idea. The first problem involves detecting SMS spam, and the second is to detect spam in Twitter accounts. The spam SMS detection experiment used real voice calls and SMS data obtained from one of the major cellular operators in Korea. We devised a spam-detection mechanism based on the clustering coefficient of the recipients. Even though we use a constrained social network derived from a two-week long communication history, the proposed scheme performs significantly better than a conventional content-based spam detection scheme. We also designed a Twitter spam detection method. We suggest the use of two social related features, triad counts and social status, that are easily obtainable from each individual’s ego-network. Ego-networks of a spammer and a legitimate user present discriminating power on these two features. When compared to a previous study [11,12], our method based on integrated social features is superior in terms of a notably high true positive and low false positive. The contributions of this paper are as follows. •





We propose using social relations for security purposes. We develop the concept of the SRK, which may be able to spawn various applications. To show the feasibility and effectiveness of the proposed SRK concept, we implemented a spam SMS detection scheme with the SRK. In addition, we developed a scheme that detects Twitter follower spam using two social network features: the triad counts and social status. We conducted experiments using real-world large-scale social network data. In particular, the cellular network data is one of the largest and most recent sets of data, and our experimental results indicated the feasibility of the umbrella idea and the effectiveness for both SMS detection and Twitter follower spam detection.

The rest of this paper is organized as follows. Section 2 provides an overview of several security solutions for SMS and Twitter. Section 3 introduces the basic SRK model, while Section 4 illustrates the proposed approach for SMS Spam detection, and presents the results of the performance obtained with a rigorous experiment using real-world data. Section 5 delineates the details of the proposed approaches for Twitter Spam detection using the triad count and social status. Section 6 concludes the paper, and suggests interesting topics for future research. 2. Related work 2.1. SMS Spam filtering 2.1.1. Content-based SMS spam filtering Since Email and SMS share fundamental characteristics, in that communication consists of exchanging text messages and various attachments, existing methods can be easily adopted for SMS without extensive modifications. The Nave Bayesian filter [13,14] and Support Vector Machine (SVM) [13,15–17] are well-known approaches in this category. The Nave Bayesian filter determines a bag of words that occur in spam and non-spam messages in advance. Then by computing the Bayesian inference, it classifies a message as spam or non-spam, according to the probability that a message is spam. SVM is a famous method for machine learning, and it is also introduced here to clear out SMS Spam. Although it shows a high classification performance when compared to that of Bayesian filters, it has a fundamental limitation due to its high

70

S. Jeong et al. / Information Systems 71 (2017) 68–77

complexity, since an SMS spam filter system should be fast and adaptive in real time. To overcome this obstacle, efforts have been made to reduce the complexity of the SVM [17]. However, a lack of content and prevalence of word deformation in short messages act as the primary hindrance to content-based approaches. Also, spammers can easily and adaptively change the content of messages to bypass the filter trained using an outdated dataset. 2.1.2. Social network-based SMS spam filtering Social Networks linked according to users’ social relationships have drawn the attention of many researchers. For normal users, network structures constructed from relationships between close friends remain strong for a long time. In contrast, network structures constructed from spammers’ social relationships are abnormal in shape, since communication takes place in a massive, onesided manner. One approach that employs this feature is to compose a Blacklist and a Whitelist by analyzing the social relationships [18]. Since this approach does not depend on the content of a message, it is robust against content forgery. At the same time, it is very convenient, since filtering can be automated by monitoring social relationships and communication patterns. James and Hendler suggested a ranking method based on a reputation value [19]. Closeness, preferences, and trust relationship between users have also been introduced [20,21]. 2.2. Spam filtering in social networking services 2.2.1. Content-based spam filtering In Twitter, content, such as the user profile, tweets, and activity logs, has been used to distinguish spammers from legitimate users. Many studies have discovered discriminating features by analyzing Twitter content. For example, in previous works [22–25], spammers were seen to upload tweets that contain a hashtag or URL, because they are the most effective methods for spam accessibility. First, Egele [22] proposed COMPA, a countermeasure to a compromise attack in Twitter. A compromise attack is similar to a camouflage attack, in that it hijacks legitimate users’ accounts to spread spam. COMPA detected compromised user accounts based on content, such as tweeting language, timestamp, and URL. Similarly, Benevenuto et al. [23] and Martinez–Romo and Araujo [24] used a spam filtering model that learned the hashtag and URL, and extracted these from spam tweets. Also, Yardi et al. [25] pointed out that the trending topic in Twitter is effective in understanding spammers’ strategic behavioral patterns. Gao et al. [26] utilized a template matching scheme to filter spam according to the sentence structure of spam groundtruth tweets. 2.2.2. Social network-based spam filtering Social relation consists of information accumulated from social networking services. Considering that spammers focus on spreading spam content to normal users, it is hard for them to deceive the social relation of users. Recently, several works studied the discriminating power of social relations. Most of them intended to detect nave spam attacks in social media, that is, a spamming strategy using fake accounts. In general, fake accounts have been used to diffuse spam to unspecified masses, but sometimes they were used to increase reputation (i.e., Pagerank or the number of followers) of certain target accounts. Viswanath et al. [27] used Principal Component Analysis (PCA) to detect anomalies in Facebook. PCA uses like action information to find the irrelevant behavioral pattern of spammers. Jiang et al. [12] discovered synchronicity in the behavior of fake accounts to increase the status of the target account. Stringhini et al. [22] filtered out fake accounts based on an outbreak of the target accounts’ followers. Particularly on Twitter, a novel link-farming strategy called follow spam has become a critical issue. Follow spam accounts follow a mass number of people

Fig. 1. Generic security system based on the SRK.

to gain their attention of follow backs [28]. Ghosh et al. [11] first investigated the follow spam strategy, and proposed the Pagerankbased spam detection algorithm. Integro [29] is an optimized random walk-based ranking algorithm for fake account detection. It is a state-of-the-art scalable solution, and compared to SybilRank [30], it efficiently counters the befriending actions of spammers. 3. The basic model We propose using an individual’s social relations as the authentication key, and refer to it as the Social Relation Key (SRK). With the SRK, an individual can be identified by integrated social information of friendship, social interaction, and even affiliation. The SRK is not limited by the way of storage, but for more memoryefficient key, sophisticated feature engineering on social information is necessary. In following sections, we introduce some powerful social network based features for user identification: clustering coefficient, triad significance profile, and social status. Fig. 1 demonstrates an example generic security system based on the SRK. User activities are accumulated and stored in a social graph, and this social graph is basically a matrix, where element (i, j) articulates the social relation that user i has shaped with user j. The system computes various graph properties either instantaneously for each activity, or distantly for accumulated activities. The SRK has properties that distinguish it from other security keys, such as artificially-generated passwords and natural fingerprints. First, the SRK is not random, but contains rich semantics of copious relations between the neighbors, as well as between the owner and the neighbors. In other words, the SRK represents a one-hop ego-network of the owner. Even though the SRK denotes direct relations only, we can assume that the ego-network assumes rich characteristics of a social network, such as a high level of homophily, clustering coefficient, and balancedness. Unlike passwords or cryptographic keys that are meaningless when partially applied, we may be able to use subsets of an SRK, without significant degradation in performance. The SRKs of the same person from several domains can be superimposed into a single key, and we may also construct a group SRK, by merging the SRKs of different individuals. The SRK changes dynamically, since it is instantiated when the person becomes the member of a social group, and changes dynamically, as the person establishes new social relations, or terminates existing relations. In addition to the above SRK properties, we argue that the SRK is unique, that is, it is inconceivable to find two individuals with exactly the same social relation. The most important property of the SRK is that it is difficult to fabricate. Fabricating an SRK that satisfies the superficial social network characteristics, such as the degree distribution, is easy. However, it would be quite difficult to create an SRK that satisfies the needed semantics. Given that an attacker cannot access the global social network structure, a randomly generated SRK would possess the properties of a random network, rather than that of real social

S. Jeong et al. / Information Systems 71 (2017) 68–77

71

Table 1 Comparison of real social networks and random networks (N: Graph size). Graph properties

Real social network

Random network

Measurement level

Degree distribution Search complexity Clustering coefficient Community (conductance) Balance Status Similarity

Power law O(log N) Large (Usually >0.1) Yes (conductance of communities is small) Yes Yes Yes

Binomial distribution √ O( N) Very small ( N1 ) No No No No

Graph Graph Node Graph Graph, Node Graph, Node Node

networks. One plausible way to contrive a fake SRK is to observe the interactions of an individual over a long period of time. Long espionage itself is perilous to the attacker, and requires significant effort and resources. Even though an attacker can successfully create a fake SRK, its use can be easily detected using the source ID, or through offline interaction between friends. Table 1 shows that the properties of real social networks and random networks, such as the Erdos–Renyi network, are rather distinct. If every social network has different properties, it may be difficult to design schemes that are applicable to all social networks. Fortunately, most, if not all, social networks have the same characteristics, and we can devise schemes that are generally applicable to all social networks. Let us examine each property of a social network in greater detail. •









Degree distribution: The degree distribution of most large offline and online social networks follows the Power law [31]. On the other hand, in the Erdos–Renyi random network, node degrees are distributed according to a Binomial distribution. However, the applicability of the degree distribution can be limited, because it is observed at the graph level, and networks can also be generated obeying the Power Law using a random network generation model, such as preference attachment [32]. Search complexity: While both a real social network and a random network embody the O(log N) shortest paths, the search times are not the same. In real social networks, we can √ find O(log N) paths [33], but in random networks it can take O( N). Even though there is a random network generation model that yields O(log N) search complexity [33], fabricating one requires huge effort on the part of the attacker. A more serious factor that limits the applicability of the search complexity is that it is a graph level property. Also, it can be difficult to identify attacks that add a small number of edges to existing social networks. Clustering coefficient: The clustering coefficient is a metric that incorporates several network properties, including triadic closure, similarity, community, and social capital. The clustering coefficient of a real social network is much larger than that of random networks, and the clustering coefficient seems to be a very useful feature, since it is locally observable, and is hidden in the system. Although an attack can be produced with a large clustering coefficient, it is only possible after investing sizable effort and resources. Community: Real social networks encompass communities consisting of nodes that are heavily connected with each other, and their conductance is small. Random networks do not have such communities. Balance and Status Theory: Many social networks conform to the balance theory [34], status theory [35], or both. Balancedness is most prominent in social networks that imitate offline social interactions, such as friendship. On the other hand, the characteristics of status appear strongly in question-and-answer sites [36]. Balancedness and orderedness can be observed at the ego-network level, and can be easily applied in the design of security solutions.



Similarity (homophily): In real social networks, according to the homophily theory, two individuals with similar attributes become friends with each other, or according to the social influence theory, friends tend to become similar [37]. Again, the similarity is observable at the ego-network level, and enjoys a high level of applicability.

As briefly explained above, SRK-based security systems do not require additional actions to be taken by users, but are instead integrated into social activities. The indirect application of security measures congregated into routine social activities on the site eliminates the chores of entering keys required for most of the conventional authentication systems. In addition, the system is difficult to circumvent or compromise. If we use direct social relations as a criterion, the attacker would have established social relations with future victims, before commencing attack campaigns. Since it is difficult to control others’ activities, the attacker cannot easily avoid detection. 4. Application 1: SMS spam detection SMS is one of most popular applications in cellular communications. Along with mobile messenger, it is replacing voice calls due to several advantages, which include no or cheap fees. It enables quick sharing of text and images among multiple receivers, and it is more discreet than placing a voice call. While it provides ample advantages (or because of these advantages) to innocent users, it is also employed by attackers as a vehicle to disseminate unwanted and annoying propaganda. Before the deployment of 3G networks, SMS spam was just a nuisance that could be ignored without incurring any hazard. However, as smartphones and 3G/LTE networks proliferated, attackers started to exploit the data connection capabilities of these devices and systems, and developed much more vicious threats, such as Smishing (also referred to as SMS phishing). Usually, a smishing message disguises itself as a normal SMS message using the same word corpus, except for the URL that directs unsuspecting receivers to malicious websites. Once online on the malicious sites, criminals use social engineering to retrieve sensitive information, including passwords and credit card numbers. In fact, SMS spam is more difficult to detect than e-mail spam, because normal SMS messages contain many abbreviations and misspelled words that make spam and non-spam share a large word set. In this section, we propose an SMS spam detection mechanism that is based on social relation information, and we assess its suitability by conducting a large-scale experiment with real data obtained from a major cellular operator in South Korea. 4.1. SMS Spam detection scheme There are several reasons why SMS spam is popular among fraudsters. In addition to the ease with which attacks can be launched, the low execution cost and high success rate results in hefty pay back to the attackers. The launch procedure is highly automated, and only requires a few clicks to disseminate an SMS

72

S. Jeong et al. / Information Systems 71 (2017) 68–77 Table 2 Experimental data collected from a cellular operator. Number of voice/SMS senders Number of voice/SMS receivers Number of links Average node degree

60,328,911 38,587,815 625,498,315 10.37

this example). Accordingly, the clustering coefficient for node u is 2 1 (4∗3 ) = 6 = 0.167. The experimental data consists of tens of million users, and the difference can be of six orders of magnitude. Since there is such a huge difference in the clustering coefficients between spam and non-spam messages, we can easily define a threshold parameter to determine spam and non-spam. It is worth noting that it is difficult to manipulate the clustering coefficient unless the fraudster can induce the receivers to engage in communications with each other before launching the attack. Therefore, even if the attackers understand the detection scheme, they cannot easily evade the defense. 4.2. Experiment Fig. 2. Example of spam and normal SMS messages.

spam message to many randomly selected subscribers. Some serious attackers maintain a list of victims, but their collection process usually involves some randomness. For the sake of efficiency, and to evade content-based detection or network-based detection, fraudsters transmit bulk SMS messages to multiple victims. Cellular operators in South Korea limit the maximum number of recipients to 500, to constrain their damage to the public. Normal users also transmit bulk SMS messages. It is not uncommon in South Korea for people to send wedding party invitations and meeting notices via SMS messages. Fig. 2 shows two SMS messages found in the real data set, including a spam and a normal message. (We translated the messages originally written in Korean into English). Except for the URLs, they are exactly the same, so any content-based detection method would fail to discern them. We devised a simple spam SMS detection scheme that utilizes a high clustering coefficient, which is a unique property of social networks. One difference between spam SMS and normal SMS is that the recipients of spam are random while those of the legitimate SMS are not random. In a large social network, the probability that two randomly-selected persons know each other is the same as the connection probability p = Nc , where c is the average degree and N is the number of vertices. However, due to the triadic closure property, friends of an legitimate sender know each other with a much higher probability than p. Therefore, a spam SMS detection scheme can easily be devised based on the clustering coefficient among receivers. If it is close to p, then it is determined to be spam, and if it is significantly larger thanp, it is considered as genuine. Suppose that user u transmits a bulk SMS message to four receivers r1 , r2 , r3 and r4 . This is like an announcement where the relationship information with r1 , r2 , r3 and r4 is stored in his/her SRK. A simple method to verify the authenticity of the SRK is to examine if relations were established between the sender and the receivers. However, this nave method can be easily eluded by sending messages to receivers before launching the attack. The clustering coefficient for node u, assuming that r1 , r2 , r3 and r4 are its neighbors, is given as:

e deg(u )(deg(u ) − 1 )

(1)

where, e is the number of existing edges between the receivers in the social graph. and deg(u) is the receiver set size (i.e., 4 in

We conducted a large-scale experiment with real-world data, using call records for billing obtained from a major cellular network operator in South Korea. We extract the sender and receiver’s phone numbers to build a social network using both voice call records, and SMS records. A raw record consists of several fields, which include the sender number, receiver number, calling (or sending) time, and call duration. We note that the dataset provided by the cellular operator does not include private information; every phone number has been anonymized. Table 2 shows that the social network consists of about 38 million nodes, and slightly more than 625 million links. The construction of a dependable social network requires data to be collected for a long period of time of months, or even years. However, due to highly restrictive regulations on the collection of public telecommunication records, we were forced to build a social network with data collected over two consecutive weeks in 2014. Note that even though the dataset is quite large, it only includes relations that occurred within the two weeks, and there were many social relations missing. To preserve social relations as much as possible with the limited dataset, we do not use common techniques for link definitions that eliminate infrequent contact and one-way contact. In Table 2, a link may involve more than one instance of communication between the two end nodes. Training dataset for content-based spam detection: We compared the performance of the proposed method with that of NBF that is currently used by the cellular operator. However, we cannot access their SW or systems, so we implemented our own NBF, which is straightforward because several open source software libraries are available2 . The main problem was to build a spam word set and a non-spam word set. Note that the NBF is a supervised learning algorithm that requires cumbersome training while our method based on social networks is an unsupervised method. We used spam that had been reported to the governmentendorsed authority, the Korea Internet and Security Agency (KISA), during a two-month period in 2014. Since some reported spam messages were not spam in reality, we manually inspected every reported spam message, and classified them into spam and nonspam sets. Table 3 summarized the SMS dataset we used. Actually, due to privacy issues, it is more difficult to collect nonspam messages than spam messages. Thus, in addition to nonspam messages, we also collected legitimate messages from the 2

http://www.cs.waikato.ac.nz/∼ml/weka/.

S. Jeong et al. / Information Systems 71 (2017) 68–77

73

Table 3 Spam and non-spam training dataset for NBF. Spam dataset

Non-spam dataset

KISA Volunteer

18,471 144

Total

18,615

KISA Volunteer Twitter Total

976 10,852 31,299 43,127

Fig. 4. CDF of spam and non-spam receiver set clustering coefficient (log-scaled distribution).

Fig. 3. The categories of the spam messages. Table 4 Performance of social relation-based SMS spam detection scheme.

Spam Non-spam

Precision

False positive

1 0.999

0 0.004

Twitter timeline, and from volunteers who agreed to provide us their messages. Overall, we collected 43,127 non-spam messages, and created a non-spam word set from the collection. Test dataset: To obtain the ground truth, we examined the spam reported to KISA one day in 2014, and manually classified them into spam and non-spam, based on their content. There were 841 and 2238 non-spam and spam messages that had ben sent by 205 and 1910 users, respectively. Fig. 3 shows the categories of the spam messages. The majority of the reported spam was for gambling, which is illegal in Korea. Smishing and various unwanted advertisements, such as loan offerings, also constituted sizable portions. Since we could not access the content of SMS messages, it is hard to match spam messages and their receivers. First, we determined a receiver set for each bulk SMS, and if the phone number of the user who reported the spam occurs in a receiver set, then we judge it as spam. 4.3. Result Fig. 4 illustrates the probability distribution for the spam and non-spam receiver set clustering coefficient. The two distributions are widely separated. The average and standard deviation for spam are 3.74 × 10−6 and 1.04 × 10−5 , respectively, and the average and standard deviation for non-spam are 7.52 × 10−2 and 4.58 × 10−2 , respectively. For this dataset, the clustering threshold in (1.2 × 10−4 , 2.7 × 10−3 ) can partition the set into two non-overlapping clusters, achieving 100% precision and recall (Fig. 4). Table 4 summarizes the results of the experiment. We compared the performance of the social network-based method to that of the content-based NBF. Note that we apply NBF to each message but apply the proposed scheme to each of the re-

ceiver sets as described above. NBF detects 821 spam messages out of 841, and it falsely classifies 213 normal messages (out of 2,238) as spam. The false positive rate is around 9.5%. Note that false positives greatly affect user satisfaction, and cellular network operators regard these as much more serious algorithmic defects than false negatives. It is worth noting that the performance of our own NBF matches well with those of production NBF systems operated by cellular network operators. We witnessed extensive efforts by spammers to incapacitate the efficacy of the filtering mechanism on our NBF. Many smishing messages with URLs and advertisements intentionally use special characters and deformed words to circumvent detection schemes that rely on the occurrence of specific words, and the fraudsters prefix common words before sensitive words. We may overcome these tactics by creating additional customized rules into NBF. However, this again may trigger another attack scheme being developed to counteract the effectiveness, forcing us to engage in a never-ending arms race against online criminals.

5. Application 2: Twitter spam detection This section introduces another example where SRK is applied in the design of security solutions. We particularly deal with spamdexing and discuss an algorithm that uses both triad compositions and the status property. We validate the effectiveness of the SRK-based scheme in terms of the true positive and false positive by comparing our proposed method to Collusionrank [11] and CatchSync [12], a Pagerank/HITS-based follow spam detection algorithm. We carried out a performance analysis using a Twitter dataset. The experiment showed that the SRK-based scheme is very effective in detecting spam. The experimental result of the SRKbased approach is fundamentally based on our previous work [38]. Social networking services (SNS), such as Facebook, Twitter, Ren-Ren and Sina Weibo, have become the most influential networking medium to build social relations. Like the Web, where the importance of each page is largely determined by who references whom, the influence of individuals in many SNS systems is determined by the number of indexes that they receive. For example, on Twitter, the number of followers is the most important factor that determines social status, and the number of ‘likes’ on Facebook has a similar effect. This feature has attracted extensive fraud to try to increase the importance or reputation of entities by generating bogus indexes. This class of attacks is referred to as spamdexing. Twitter is one of the largest social networking and real-time microblogging services, and it has grown in size over the past several years. The latest announcement indicated that the number of

74

S. Jeong et al. / Information Systems 71 (2017) 68–77

Twitter users to date exceeded 255 M3 . The most unique social interaction feature in Twitter is the follow relation, where users may follow famous individuals who are unacquainted (usually celebrities or standout opinion leaders), as well as close friends. Twitter plays the role of information propagation, in addition to the role of an online social network [39]. However, allowing free and unlimited attachment through follow relations allows for ‘follow spam’ attacks where bogus follows are produced for target nodes [28]. Commercial attackers maintain fake accounts and cooperating users to generate bogus follows [11]. first analyzed follow spammers and discovered that over 80% of the target followees (who are followed by the spammers) made mutual follow links with the spammers. Using these bi-directional follow relations, spammers can expose their spam content effortlessly on their timeline. One may guess that a bogus Twitter account has a relatively smaller set of followers (indegrees) than follows (outdegrees). However, the high percentage of mutual follows with an attacker may negate the effectiveness of the simple algorithms based on the indegree to outdegree ratio of a node. Considering this limitation, we develop a Twitter spam detection mechanism based on the triad frequency and social status. 5.1. Dataset and experimental settings Dataset: For the experiment, we used the real Twitter dataset provided by [11]. The dataset consists of 54,981,152 users, including 41,352 spammers and 1,963,263,821 follow links. Among these, we randomly sampled 10 0 0 legitimate users and 10 0 0 spammers. We only include users who have more than 10 edges in the experiment dataset. Finally, we build 20 0 0 ego-networks, one for each user. Collusionrank: Collusionrank [11] is a PageRank-based follow spam detection mechanism. The idea behind Collusionrank is rather simple: spammers tend to have a lower PageRank than real users. According to [11], 94% of follow spammers belong to the lowest 10% group in terms of the Collusionrank metric. We also estimate the number of innocent users that belong to the lowest 10% group. If we use the criterion that classifies the users in the lowest 10% group as spammers and the others as not, the false positive is as high as 9.9%. Considering the simplicity of the algorithm, this can be considered to be satisfactory. However, as mentioned before, false positives are more serious than false negatives. In the following section, we develop an SRK-based scheme that may improve on the performance of Collusionrank. CatchSync: CatchSync [12] is a HITS-based anomaly detection scheme. The rationale behind this approach is that most abnormal accounts in social networks have synchronized action: anomalies (or outliers) in an SNS normally follow (or make subscription links to) numerous accounts that have similar indegree and authority value (HITS). Since these abnormal accounts aim to increase influence (a.k.a. the number of followers) of target users, they intensively make follow links to ‘follower buyers’. Typically, normal users used to make links to other user accounts with various indegree and authority value. We conducted an experiment with CatchSync on same real Twitter dataset. We used two major features of CatchSync to perform Twitter spammer classification: synchronicity and normality. On the experimental environment with 10-fold validation and the RandomForest classifier, the true positive of CatchSync is 91.5%, and the false positive is as high as 10.9%. In the following section, we also compare an SRK-based scheme to CollusionRank.

3 http://thenextweb.com/twitter/2014/04/29/twitter- passes- 255m- monthly- activeusers- 198m- mobile- users- sees- 80- advertising- revenue- mobile/.

Table 5 Performance of the degree-based follow spam detection scheme.

Spammer Legitimate user

True positive

False positive

0.808 0.804

0.196 0.192

Fig. 5. An example of ego-network.

5.2. Spam detection with degree information In social network-based spam detection, most prior work has largely relied on user node degree information. Node degree has been a discriminating feature for classification due to the difference in degree distribution between spammers and legitimate users. In general, for SNS, a broadcast attack was widely used by spammers when they distributed spam content to unspecified individuals. Therefore, spammers preferred using messaging services to diffuse spam at a low-cost. In such cases, we can infer that spammer nodes would have numerous outgoing links when compared to the few incoming links in the social network graph, that is, normal users do not appear to send messages to spammers because most spammer nodes were fake accounts used for spamming. Hence, we have leveraged the discriminatory power of the degree information for the follow spam. Table 5 shows that the degree feature seems to be suitable, but it did not offer good performance when used alone. About 20% of follow spammers and legitimate users were misclassified when using the degree information. As we briefly mentioned in the previous section, follow spammers could have had a high indegree, due to link farming attacks and reciprocative incoming links. Our investigation of misclassified users discovered that the characteristic of follow spam confused the spam classification. The results based on our experimental analysis indicate that the use of degree information only is not appropriate for follow spam classification. Therefore in the following sections, we propose two kinds of promising social relational features, and validate their feasibility. 5.3. Ego-network for user identification At the micro-level of a social network, an ‘ego-network’ is a compelling subgraph that can be used to grasp the social characteristics related to an ‘ego’. Although the definition of an egonetwork varies according to the research topic, in the case of Twitter, we define a user’s ego-network as the neighborhood social graph consisting of the user (i.e., ego), his/her followers/follows, and links among nodes. Note that an SRK defines the vertices of the ego network centered on the owner, and the SRK itself does not show whether a neighbor node pair is connected or not. This information is managed by, and stored in, the online social network systems. We utilized each user’s ego-network as an (extended) SRK. For example, in Fig. 5, the ‘ego’ (a yellow-colored node) is followed by r1 , r2 , and r3 , and it also simultaneously followsr4 , r5 and r6 (green-colored nodes). The system recognizes the relations

S. Jeong et al. / Information Systems 71 (2017) 68–77

75

Fig. 7. TSP of follow spammer in Twitter.

Fig. 6. Isomorphic triad types.

between neighboring nodes. Fig. 5 shows that a Twitter egonetwork is represented with a directed graph G = (V, E ), where the node set V is V= {ego, r1 , r2 , r3 , r4 , r5 , r6 }, and the edge set E includes every ‘follow’ link among the nodes. Note that gray-colored nodes and edges that are more than one hop away from the ego are not included in the ego-network. We extracted feasible social properties from each user’s egonetwork, including the triad significance (Section 5.4) and social status (Section 5.5). In the following sections, we demonstrate the mechanism and performance of SRK to detect follow spam using these features.

Fig. 7 shows the average TSP of 10 0 0 follow spammers. The error bar is based on the standard deviation of the spammers’ TSP. Since our work measured spammers’ triad frequencies against those of legitimate users’, zero would indicate that the frequencies of triad occurrences in a spammer’s ego network and in a legitimate user’s ego networks are the same. Fig. 7 shows that we can observe that only the 021D triad is over-represented while all other 12 patterns are under-represented. Note that the 021D triad appears to correspond to a link-farming pattern. Follow spam is apparently based on a link farming attack to obtain more followers. The most under-represented type is 021U, and that has two incoming edges. It is worth noting that triad types with two or three bi-directional links (i.e., types 201, 210 and 300) are under-represented. The fact that spammers have artificially generated links might be reflected in these phenomena. 5.5. Social status

5.4. Triad significance profile (TSP) [40] adopted isomorphic triads as network motifs. Here, a triad refers to a partially-connected 3-node subgraph pattern. The triad frequencies among heterogeneous networks have been observed in various fields, including social networks, word-adjacency networks and microorganism networks. To conduct an efficient comparison, [40] proposed a Triad Significance Profile (TSP), that is, normalized triad counts, and discovered triad types that are over/underrepresented in networks. We decided to use TSP without careful discretion or a logically firm background. We just guessed that follow relations of spammers are not the same as those of innocent users, and this fact may appear as the differences in the TSP. Following Milo’s approach, we identify the 13 isomorphic triad patterns shown in Fig. 6. We used the average triad counts for legitimate users’ ego network as the null model. For each legitimate user, we counted the number of triads in each class and computed the means and standard deviations for the 13 triad types. We also counted the number of triads that appeared in spam ego networks and used Eq. (2) to compute the Z-score for each triad type.

Zi =

(Nuseri − < Nnulli > ) std (Nnulli )

(2)

where Nuseri is the occurrence frequency of the ith triad class in a spammer’ ego-networks, and < Nnulli > and std (Nnulli ) are the mean and standard deviations of legitimate users’ ego-networks, respectively. The TSP is the vector of the Z-scores normalized to a length of 1 as defined in Eq. (3).

T SPi = 

(

Zi Zi 2 ) 2

1

(3)

The status property is a social network property that is useful in directed networks [35]. We can observe the status property in networks that involve voting, question and answers, etc. Since opinion leaders or celebrities have a large number of followers, we conjecture that the Twitter’s follow mechanism connotes status. According to the status theory, a person with a lower rank tends to give a positive directed link to a person in a higher link. In Twitter, if a person follows another person, then we think that there is positive link from the follower to the followee. For simplicity, we define the status of a person in a straightforward manner: The status of a node u is the ratio of the number of incoming links (that is the number of followers) to the number of outgoing links from u (the number of followees) as shown in Eq. (4).

status(u ) =

indegree(u ) outdegree(u )

(4)

Once the status for each user has been computed, we measure if a certain user generates follow relations by obeying the status property, i.e. if she tends to follow persons with a higher status. Specifically, we measured the positive link probability (plp(u)) for a user u as the fraction of follows to persons with higher ranks (Npos (u)) among the total follows (outdegree(u)). Eq. (5) is an expression of the positive link probability of user u.

P LP (u ) =

N pos (u ) (Npos (u ) + Nneg (u ))

(5)

Fig. 8 shows the relation between status and positive link probability for legitimate users and spammers, respectively. We can observe that, except for two spammers, the status for all spammers is very low. However, status itself may not be a suitable classification

76

S. Jeong et al. / Information Systems 71 (2017) 68–77

ify relations between the other nodes. We observe that the positive link probability is effective as a social status feature. Our analysis shows that a spammer made links with numerous randomlyselected users, and many of them have lower social status than the spammer. This obviously goes against normal social following patterns, because a typical legitimate user is likely to make links to more influential (i.e., higher status) users. In Sections 4 and 5, we applied the SRK in the design of security mechanisms to solve two real world cases. Even though these experiments do not prove the general applicability of the SRK, we believe that we have gained some confidence in the feasibility of its use. 6. Conclusion

Fig. 8. Relation between user status and positive link probability (plp). Table 6 Performance of the social relation based follow spam detection scheme.

Spammer Leigitimate user

True posiive

False positive

0.963 0.943

0.057 0.037

Table 7 Overall performance comparison: true positive in spammer classification and false positive in legitimate user (marked on percentile).

True positive False positive

SRK-based scheme

Collusionrank

CatchSync

96.3% 3.7%

94% 9.9%

91.5% 10.9%

criterion because many innocent users also have a low status. Another important point is the positive link probability distribution of spammers that spans from low to high values. 5.6. Experimental result We analyzed the performance of a cocktail mechanism that includes three different features of social networks: TSP, social status and degree information. Table 6 shows the result of Twitter spam detection. The true positive and false positive probabilities are 96.3% and 3.7%, respectively. The results are not as impressive as those for spam SMS detection. However, compared to Collusionrank whose true positive and false positive probabilities are 94% and 9.9%, respectively (Table 7), we can conclude that the performance of the social network-based scheme (or SRK-based scheme) is satisfactory. In particular, the social network method reduces false positive probability and is a more serious problem than false negatives, from 9.9% to 3.7%, or an improvement of about 266%. Table 7 shows that compared to CatchSync, the SRK-based scheme shows also powerful improvement on both true positive and false positive. This experiment is meaningful because it shows the feasibility of combining several social network features in the design of security solutions. A further investigation of the Information Gain showed that the triads 021D, 021U, and user status seemed to be the most effective features. In particular, triads 021D and 021U reflect many nonreciprocal outgoing links and incoming links, respectively. As expected, the ratio of the indegree to outdegree has limited effectiveness. Note that triad types 021D and 021U reflect not only reciprocal links between an ego and neighbors but also links between neighbors. Even though spammers can control follow relations centered on these, it is much more difficult to mod-

In this paper, we have proposed a Social Relation Key (SRK) as an authentication mechanism that takes advantage of abundant social relations accumulated over a long period of time. We verify this insight by conducting experiments of SMS and Twitter spam detection that implement individual verification. For SMS spam detection, we have exploited the clustering coefficient to discriminate the social relation feature. The experiment used a real-world SMS dataset provided by a major cellular network operator in Korea, and our approach showed significantly superior performance with a 100% true positive, compared to a traditional NBF-based spam detection scheme. In addition, we leveraged the feasibility of using triad counts and social status to detect Twitter follow spam. By investigating the ego-networks of the spammer and legitimate user, we discovered semantically discriminating social relational properties between them. As a result, the proposed approach confirmed the feasibility of using SRK with 96.3% true positive and 5.7% false positive rates for the real-world Twitter dataset. Compared to previous works, this result demonstrated ascendancy in performance and verified the feasibility of using social relations as a key. For future work, we will devise an approach to deter camouflage attacks. A camouflage attack compromises legitimate user accounts, and adopts them as a medium to spread unsolicited content. Our mechanism cannot yet track such attack, because it utilizes legitimate users’ social relations. If our scheme handles social relations over time, we will be able to distinguish such hijacked accounts with the SRK. Also, we should broaden the area of the experimental interest for every user’s case. Depending on every individual social network, the feasibility of using SRK as an identification tool will provide greater insight into future authentication systems. Acknowledgments The authors deeply appreciate the administrative support for this work from the Institute for Industrial Systems Engineering of Seoul National University. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2017R1A2A1A010 0740 0), Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. B019016-2017, Resilient/Fault-Tolerant Autonomic Networking Based on Physicality, Relationship and Service Semantic of IoT Devices). Also, this work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) (No.2016R1A5A1012966). References [1] D.M. Romero, B. Meeder, J. Kleinberg, Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter, in: Proceedings of the 20th International Conference on World Wide Web, ACM, 2011, pp. 695–704. [2] Bazaarvoice, Social trends report, 2013. URL http://media2.bazaarvoice.com/ documents/Bazaarvoice_Social- Trends- Report- 2013- a.pdf.

S. Jeong et al. / Information Systems 71 (2017) 68–77 [3] A. Ramachandran, N. Feamster, S. Vempala, Filtering spam with behavioral blacklisting, in: Proceedings of the 14th ACM Conference on Computer and ommunications security, ACM, 2007, pp. 342–351. [4] P. Domingos, M. Richardson, Mining the network value of customers, in: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2001, pp. 57–66. [5] D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, J. Assoc. Inf. Sci. Technol. 58 (7) (2007) 1019–1031. [6] L. Backstrom, J. Leskovec, Supervised random walks: predicting and recommending links in social networks, in: Proceedings of the Fourth ACM International cConference on Web Search and Data Mining, ACM, 2011, pp. 635–644. [7] H. Ma, H. Yang, M.R. Lyu, I. King, Sorec: social recommendation using probabilistic matrix factorization, in: Proceedings of the 17th ACM Conference on Information and Knowledge Management, ACM, 2008, pp. 931–940. [8] H. Ma, D. Zhou, C. Liu, M.R. Lyu, I. King, Recommender systems with social regularization, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, ACM, 2011, pp. 287–296. [9] Facebook, National Cybersecurity Awareness Month Updates, 2011. URL https://www.facebook.com/notes/facebook- security/national- cybersecurityawareness- month- updates/10150335022240766. [10] S. KC, A. Mukherjee, On the temporal dynamics of opinion spamming: case studies on yelp, in: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2016, pp. 369–379. [11] S. Ghosh, B. Viswanath, F. Kooti, N.K. Sharma, G. Korlam, F. Benevenuto, N. Ganguly, K.P. Gummadi, Understanding and combating link farming in the twitter social network, in: Proceedings of the 21st International Conference on World Wide Web, ACM, 2012, pp. 61–70. [12] M. Jiang, P. Cui, A. Beutel, C. Faloutsos, S. Yang, Catchsync: catching synchronized behavior in large directed graphs, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2014, pp. 941–950. [13] T.A. Almeida, J.M.G. Hidalgo, A. Yamakami, Contributions to the study of sms spam filtering: new collection and results, in: Proceedings of the 11th ACM Symposium on Document Engineering, ACM, 2011, pp. 259–262. [14] N. Wu, M. Wu, S. Chen, Real-time monitoring and filtering system for mobile sms, in: Industrial Electronics and Applications, 2008. ICIEA 2008. 3rd IEEE Conference on, IEEE, 2008, pp. 1319–1324. [15] J.M. Gómez Hidalgo, G.C. Bringas, E.P. Sánz, F.C. García, Content based sms spam filtering, in: Proceedings of the 2006 ACM symposium on Document engineering, ACM, 2006, pp. 107–114. [16] Y. Xiang, M. Chowdhury, S. Ali, Filtering mobile spam by support vector machine, in: CSITeA’04: Third International Conference on Computer Sciences, Software Engineering, Information Technology, E-Business and Applications, International Society for Computers and Their Applications (ISCA), 2004, pp. 1–4. [17] D. Sculley, G.M. Wachman, Relaxed online svms for spam filtering, in: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2007, pp. 415–422. [18] S. Hameed, P. Hui, LENS: LEveraging Anti-Social Network Against Spam, Technical Report, Technical Report Technical Report No. IFI-TB-2010-02„ Institute of Computer Science, University of Gttingen„ Germany, 2010. [19] J.G. James, J. Hendler, Reputation network analysis for email filtering, in: In Proc. of the Conference on Email and Anti-Spam (CEAS), Mountain View, Citeseer, 2004. [20] P.-A. Chirita, J. Diederich, W. Nejdl, Mailrank: using ranking for spam detection, in: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, ACM, 2005, pp. 373–380.

77

[21] Z. Li, H. Shen, Soap: a social network aided personalized and effective spam filter to clean your e-mail box, in: INFOCOM, 2011 Proceedings IEEE, IEEE, 2011, pp. 1835–1843. [22] G. Stringhini, G. Wang, M. Egele, C. Kruegel, G. Vigna, H. Zheng, B.Y. Zhao, Follow the green: growth and dynamics in twitter follower markets, in: Proceedings of the 2013 Conference on Internet Measurement Conference, ACM, 2013, pp. 163–176. [23] F. Benevenuto, G. Magno, T. Rodrigues, V. Almeida, Detecting spammers on twitter, in: Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS), vol. 6, 2010, p. 12. [24] J. Martinez-Romo, L. Araujo, Detecting malicious tweets in trending topics using a statistical analysis of language, Expert Syst. Appl. 40 (8) (2013) 2992–30 0 0. [25] S. Yardi, D. Romero, G. Schoenebeck, et al., Detecting spam in a twitter network, First Monday 15 (1) (2009). [26] H. Gao, Y. Yang, K. Bu, Y. Chen, D. Downey, K. Lee, A. Choudhary, Spam ain’t as diverse as it seems: throttling osn spam with templates underneath, in: Proceedings of the 30th Annual Computer Security Applications Conference, ACM, 2014, pp. 76–85. [27] B. Viswanath, M.A. Bashir, M. Crovella, S. Guha, K.P. Gummadi, B. Krishnamurthy, A. Mislove, Towards detecting anomalous user behavior in online social networks, in: 23rd USENIX Security Symposium (USENIX Security 14), 2014, pp. 223–238. [28] Making progress on spam, (https://blog.twitter.com/2008/making-progressspam). Accessed: 2016-05-23. [29] Y. Boshmaf, D. Logothetis, G. Siganos, J. Lería, J. Lorenzo, M. Ripeanu, K. Beznosov, Integro: leveraging victim prediction for robust fake account detection in osns., in: NDSS, vol. 15, 2015, pp. 8–11. [30] Q. Cao, M. Sirivianos, X. Yang, T. Pregueiro, Aiding the detection of fake accounts in large scale social online services, in: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, 2012, p. 15. [31] A.-L. Barabási, R. Albert, Emergence of scaling in random networks, Science 286 (5439) (1999) 509–512. [32] M.E. Newman, Clustering and preferential attachment in growing networks, Phys. Rev.s E 64 (2) (2001) 025102. [33] J.M. Kleinberg, Navigation in a small world, Nature 406 (6798) (20 0 0). 845–845. [34] F. Heider, The Psychology of Interpersonal Relations, Psychology Press, 2013. [35] J. Leskovec, D. Huttenlocher, J. Kleinberg, Signed networks in social media, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, 2010, pp. 1361–1370. [36] A. Anderson, D. Huttenlocher, J. Kleinberg, J. Leskovec, Discovering value from community activity on focused question answering sites: a case study of stack overflow, in: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2012, pp. 850–858. [37] M. McPherson, L. Smith-Lovin, J.M. Cook, Birds of a feather: homophily in social networks, Annu. Rev. Sociol. (2001) 415–444. [38] S. Jeong, G. Noh, H. Oh, C.-K. Kim, Follow spam detection based on cascaded social information, Inf. Sci. 369 (2016) 481–499. [39] H. Kwak, C. Lee, H. Park, S. Moon, What is twitter, a social network or a news media? in: Proceedings of the 19th International Conference on World Wide Web, ACM, 2010, pp. 591–600. [40] R. Milo, S. Itzkovitz, N. Kashtan, R. Levitt, S. Shen-Orr, I. Ayzenshtat, M. Sheffer, U. Alon, Superfamilies of evolved and designed networks, Science 303 (5663) (2004) 1538–1542.