Computer Communications 150 (2020) 455–462
Contents lists available at ScienceDirect
Computer Communications journal homepage: www.elsevier.com/locate/comcom
Predicting the security threats on the spreading of rumor, false information of Facebook content based on the principle of sociology Xiaomeng Wang a ,∗, Binxing Fang a,b , Hongli Zhang a , Xing Wang a a b
Research Center of Computer Network and Information Security Technology, Harbin Institute of Technology, Harbin, China Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China
ARTICLE
INFO
ABSTRACT
Keywords: Information diffusion Linear regression Popularity Social networks
With the rapid development of Internet of Things (IOT), frequent communication among a huge amount of heterogeneous smart devices over the Online Social Networks(OSN) becomes viable and efficient. Increasing user submissions including online contents, videos and comments are gradually affecting people’s lives, leading to an explosive propagation of information and posing security threats on the spreading of rumor, false information and inappropriate online speech. The goal of popularity prediction of online content is accurate predict the popularity in the future based on the early diffusion status. Existing models for popularity prediction are mostly based on discovering network features or fitting the equation into a varying time function which seldom introduces the principle of sociology. In this paper, we find that there exists a high linear correlation between the proportion of faithful fans in Facebook homepage with frequent shares in the early and the future popularity. The statistical results about Facebook remind us that the principle of mainstream fatigue plays an important role in prediction task. Furthermore, an experimental study clearly illustrates that the effectiveness of the proposed method.
1. Introduction
The popularity prediction problem is essentially due to the uneven phenomenon of statistical distribution in which few samples get most attention, such as the distribution of wealth, population, and the distribution of friends in dating websites. Studies have shown that most of the online content is of little concern, while only a small proportion gained a lot of user attention. This uneven distribution dates back to the Italian economist Vilfredo Pareto’s famous ‘Pareto‘s principle’, in which he found 20% of the population accounted for 80% of social wealth. During the last few years, researchers have been devoting themselves to improve prediction accuracy. In the Internet age, Barabasi and Albert [3] published a landmark article in Nature, they found that the distributions of most complex network such as the actor cooperation network, World Wide Web and Power Grid network of western America followed the power-law distribution with the index meet 2 < 𝛾 < 3. In the age of online social networks, Kwak et al. [4] found that 10% of the most popular users’ videos on YouTube attracted nearly 80% of users’ attention, while the remaining 90% received only 20% of users’ attention. The main task of popularity prediction in online social network is to predict the popularity of user-generated content in the future based on the observation of the initial propagation process. But it is still very difficult to get an accurate result. Because information diffusion is influenced by many factors, such as network topology, group behaviors and semantic content. During the last few years, researchers have
In recent years, the convergence of the ‘‘Internet of Things’’ and the ‘‘Online Social Networks’’ gradually become feasible, more and more smart devices connect to social networks. A growing number of online platforms which have gathered thousands of users are now becoming very popular. As one of the largest online social networks, Facebook has about 1 billion users by 2015, including social sciences, celebrities, government agencies and other star homepage, as well as a number of ordinary user homepage whom always post real-time messages to attract public attention. Facebook is not only a social network, but also an increasingly important distribution for big data allowing real-time access to smart devices. Considering its user scale and interaction pattern, Facebook can be understood as a mapping of human society on the internet. With the ease of posting and receiving information, users express their views freely and all kinds of topics can be posted at any time, posing security threats on the spreading of rumor, false information and inappropriate online speech. Therefore, it is necessary to know in advance whether a message will burst, which has great application value on rumor prevention [1], election forecasts and advertisement putting [2]. Predicting the popularity of messages on famous homepage can help people capture hot events more easily and make advertisers more rational to maximize the benefit. ∗ Corresponding author. E-mail address:
[email protected] (X. Wang).
https://doi.org/10.1016/j.comcom.2019.11.042 Received 25 September 2019; Received in revised form 29 October 2019; Accepted 26 November 2019 Available online 9 December 2019 0140-3664/© 2019 Published by Elsevier B.V.
X. Wang, B. Fang, H. Zhang et al.
Computer Communications 150 (2020) 455–462
been devoting themselves to improve prediction accuracy. However, these methods rarely consider sociological theory, which leads to a low prediction accuracy. In this paper, we introduce the mainstream fatigue theory into the popularity prediction task. We extend the SH model(Szabo and Huberman) by introducing the mainstream fatigue theory which shows a high linear correlation on Facebook dataset. The proposed model is named MFL (the Mainstream Fatigue Theory with the linear regression model). The main contributions of this paper are summarized as follows:
videos follow a power law distribution after the propagation peak. Yang et al. [14] studied the evolutionary mechanism of user generated content, They proposed six different types of prevalence timing patterns by analyzing 580 million tweets and 170 million blog posts. Lerman et al. [15] considered the interest and visibility in the voting process in Digg, and proposed a time series model to predict the final number of votes. Li et al. [16,17] proposed a cascade method to predict exogenous video popularity. Hu et al. [18,19] proposed a time sequence based method to improve the short term prediction accuracy for burst event, which divide the propagation process into four stages. Gao et al. [20] proposed an reinforced Poisson process method, which modeled the decay process of information diffusion with the priority connection mechanism.
(1) We propose an improved popularity prediction method considering the mainstream fatigue theory. (2) This paper finds that there exists a high linear correlation between the proportion of users with frequent shares in the early and the popularity in the future. (3) Experiments on Facebook dataset validate that our proposed model can provide a higher accuracy than the other models in long term popularity prediction.
The above methods have made some effect in predicting popularity, but for the hyper-massive online social network like Facebook, the predict accuracy still need to be improved. The method based on the group state mainly uses the mathematical model to reproduce the process of information diffusion from a microscopic perspective, but the node attribute and the state transfer probability in the model are too idealized, which can only apply to the estimation of the extent of propagation with fixed network topology. Time series methods use fitting functions to characterize the real-time popularity evolution trends, which have good effect on short term task, but for the long term prediction, the accumulation of deviation may leads to a gradual decrease in accuracy. The regression method aims to establish the mapping relationship between the early and the future popularity, and it is necessary to extract the characteristics from the popularity evolution, which is suitable for long-term prediction. In this paper, we make a deep analysis of the communication mechanism of Facebook homepage messages, and propose a popularity prediction model based on regression analysis, which introduces the ‘mainstream fatigue theory’ in sociology as a key feature into the regression equation in the form of connection strength for the first time, and predicts the final popularity of messages in combination with the early popularity. Experiment shows that this proposed method can improve the predictive performance better in comparison with other primary models.
2. Related work In the current studies, researchers have made great efforts on prediction problem and have conducted a comprehensive survey. Most methods can be divided into 3 categories, which based on group state, regression and time series. The method based on group state mainly divides the nodes in the social network into several states, and analyzes the trend of popularity evolution by simulating the state transfer process. Saeed et al. [5] used infectious disease models to study the spread of Twitter messages, arguing that when nodes in the social network that are infected (I) post relevant tweets, their fans become new susceptible, and the total number is growing. Abdullah et al. [6] improved the classic epidemic SIR model to simulate the propagation of twitter messages. Matsubara et al. found that the prevalence distribution of blogs obeys the power rate, and the user’s attention shows periodic changes, and a dynamic infection rate prediction model is proposed on the basis of the traditional SI model. Li et al. considered the influence of the underlying topology features on transmission, and proposed a method based on network cascade popularity prediction for Renren’s external video popularity prediction problem. Regression methods usually committed to finding the key influencing factors in the process of information diffusion, and exploring the relationship between these factors with the popularity, and transform the prediction task into classification or regression problem to solve. Szabo et al. [7] find that the popularity between the early and future shows a strong linear relationship after logarithm, and the first to use regression method to predict late heat. Chang et al. [8] find that online video sites, the popularity of TV series from was correlated with the historical released. What is more, the proportion of random viewers becomes less and less over time. Based on this finding, they improved the traditional regression model. Bao et al. [9] improved the SH model, which considering the link density and the depth in the prediction model. Kim et al. [10] found that the page views in the early was related to the final, and proposed a regression based on the exponential function. Cheng et al. [11] analyzed the hot topics of online social networks from a time perspective, and proposed a self-regression average model to predict the number of posts. Zhu et al. [12] first introduced the concept of diffusion acceleration, and combined with the early popularity to establish a multi-regression model to predict the number of micro-blogging shares. The methods based on time series assume that information diffusion of online content is continuous in the time dimension, and model the future trend by using the numerical series of the observed historical time points. Crane et al. [13] analyzed the time series of the transmission process of five million videos on YouTube and found that 90% videos can be accurately depicted using the Poisson process, the remaining
3. Preparatory work In this paper, popularity means the number of shares generated below the content on Facebook homepage. We focus on predicting the final popularity of online content. As shown in Fig. 1 we crawled hundreds of the most famous homepage on Facebook. This paper discusses how to predict the popularity of user-generated messages on Facebook homepage, where users can comment, like, and share messages. The target of popularity prediction is to achieve an accurate result in a period of time for the online content has been published based on the data of early observations. For each message post on any Facebook open homepage, we can know who shared it by the time of observation. Given a content m, we define its release time as 𝑇0 , predict time 𝑇𝑝𝑟𝑒𝑑𝑖𝑐𝑡 and reference time 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 . The predict time is from the time the message is released at 𝑇0 , and the popularity increases over time, and the popularity is nearly the same when the time exceeds the life cycle. It can generally be considered as 𝑇0 < 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 < 𝑇𝑝𝑟𝑒𝑑𝑖𝑐𝑡 . In addition, we define the time that message m receives the 𝑖th share as 𝑡𝑖 . The diffusion process up to the 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 can be defined as {𝑡𝑚 }, where k ∈ (0,𝑛𝑚 ], 𝑛𝑚 𝑘 is the number of share obtained for message m within the full training time period (0, 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 ]. We use 𝐵𝑚 presents the actual share number of content m, and 𝐵𝑚′ . In summary, the predict task can be defined as follows: based on the popularity of message m during the diffusion {𝑡𝑚 }, 𝑘 predict the number of shares 𝐵𝑚′ at later time 𝑇𝑝𝑟𝑒𝑑𝑖𝑐𝑡 . 456
X. Wang, B. Fang, H. Zhang et al.
Computer Communications 150 (2020) 455–462
Fig. 1. The structure of the popularity prediction model.
Fig. 2. The lifecycle of online content on Facebook: (a) Hourly variation of user activity on average; (b) The trending of the share popularity on average.
3.1. Problem assumptions
3.2. Facebook content lifecycle
This paper discusses how to predict the popularity of user-generated contents on Facebook homepage, where users can comment, like, and share messages. The target of popularity prediction is to achieve an accurate result in a period of time for the online content has been published based on the data of early observations. For each message post on any Facebook open homepage, we can know who shared it by the time of observation. Given a content m, we define its release time as 𝑇0 , predict time 𝑇𝑝𝑟𝑒𝑑𝑖𝑐𝑡 and reference time 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 . The predict time is from the time the message is released at 𝑇0 , and the popularity increases over time, and the popularity is nearly the same when the time exceeds the life cycle. It can generally be considered as 𝑇0 < 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 < 𝑇𝑝𝑟𝑒𝑑𝑖𝑐𝑡 . In addition, we define the time that message m receives the 𝑖th share as 𝑡𝑖 . The diffusion process up to the 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 can be defined as {𝑡𝑚 }, 𝑘 where k ∈ (0,𝑛𝑚 ], 𝑛𝑚 is the number of share obtained for message m within the full training time period (0, 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 ]. We use 𝐵𝑚 presents ′ the actual share number of content m, and 𝐵𝑚 . In summary, the predict task can be defined as follows: based on the popularity of message m during the diffusion {𝑡𝑚 }, predict the number of shares 𝐵𝑚′ at later time 𝑘 𝑇𝑝𝑟𝑒𝑑𝑖𝑐𝑡 .
In this section, we introduce an important feature, user activity, which measures the frequency of user participation in sharing content. We first analyze the daily active users over time of Facebook which can be obtained by counting the average number of messages published by all users in each hour of the day. The more average number of messages a user publishes within an hour, the more the user lives within that hour. Fig. 2(a) shows the hourly variation of user activity. There are significant differences in user activity over different time periods, and the user activity is much higher in the daytime than at midnight which follows the human routine. The period between 4 a.m. and 12 noon shows the lowest user activity, while 18 to 22 p.m. is the most popular time period, which in line with the user’s usage habits and the rest laws. Hence, the popularity of the online content will be affected by its release time. In addition, the online content become less interesting with time going by, and the popularity grows slow over time, finally becomes approximately constant after a certain period of time. For example, the content on Digg spreads at a rapid speed and reaches 80% of its total popularity in only 24 h, while videos spreading relatively slow in YouTube, which can only accumulate 50% of final 457
X. Wang, B. Fang, H. Zhang et al.
Computer Communications 150 (2020) 455–462
concept of mainstream fatigue parameter is proposed, which quantifies the frequency of user share messages. The definition is as follows: the mainstream fatigue parameter f represents the share frequency of the user j relative on homepage k, and the formula is as follows 𝑐𝑗𝑘 𝑓 = ∑𝑛 (2) 𝑘 𝑐 𝑗=1 𝑗𝑘 where 𝑐𝑗𝑘 is the frequency of user j share homepage k, 𝑛𝑘 present the number of all users whom shared at least once message of homepage k in history, and f is the tie strength of user j in the homepage k.
Fig. 3. Log–log diagram of the relationship between final popularity and faithful fans proportion.
share popularity within seven days. In this paper, we introduce the concept of relative activity, which is a one-dimensional vector that indicates the relative activity intensity of users in the 𝑖th hour of 24 h a day. The calculation process can be described as follows: firstly, calculate average shares m per hour for all messages in the dataset, and then calculate the total shares S[i] of the 𝑖th hour, the relative activity of the 𝑖th hour is 𝑆[𝑖] (1) 𝑆 ′ [𝑖] = 𝑀 As for Facebook, we can see the lifecycle of Facebook is clearly shown in Fig. 2(b). The horizontal coordinates indicate each hour after the content is published on those famous homepages, the longitudinal coordinates indicate the average amount of shares per content for this hour, and you can see that the average popularity remains constant after 150 h. Therefore, the proposed model use ‘‘day’’ as time measurement for final popularity. In order to predict accurately for final popularity, we set the 𝑇𝑝𝑟𝑒𝑑𝑖𝑐𝑡 as 7 days. In addition, the user’s share behavior is most concentrated in the first 12 h after the message is published, and 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 is set to 3 h based on real-time considerations.
3.3. The principle of mainstream fatigue In this section, we introduce a sociological theory called ‘‘mainstream fatigue theory’’ into popularity prediction, which is based on the classical theory of ‘‘the weak ties’’. American sociologist Mark Granovetter [9] proposed the weak ties theory, which could be a linear combination of emotional intensity, mutual trust and relations. Based on this classical theory, the widespread dissemination of information is driven by weak relationships. On social networks, the topology structure made up of friend relationships reflects what the social characteristics used to be at a macro level. Ferrara et al. [21] found that weak relationship in Facebook had a significant effect for information diffusion. However, the graph constructed by social interaction shows what the information diffusion will be. In the interaction graph, we call those who do not often share contents mainstream ties nodes. On Facebook homepage, there exists significant difference in the interaction frequency, only a small percentage of faithful fans share most content, and the majority of fans contribute only share a small amount. While the faithful fans only make up a small proportion of total shares if the content grows popular. We formed an interactive diagram between the homepage and all the users who have shared the message in history, the users who seldom share are called non-mainstream nodes. As for a popular content, there are always a lot of non-mainstream nodes involved in information diffusion. What is more, we call people who interact frequently and often share contents faithful fans, which consist of mainstream nodes. Based on the node interaction relationship, the
Through repeated experiments, we found that under the premise of a certain amount of shares, the smaller proportion of faithful fans participated in the early, the greater it will spread. We crawl all the users’ share history and sort the users by the number of contents that have been shared in the same homepage, and select top 3% users whom often share as reference. As we can see from Fig. 3, there exists a significant linear correlation (blue solid line) between early and final share popularity in dual logarithmic coordinate systems. This proves that if the content can be widely spread at final, it should not be limit to a closed community in the early period, more strangers as non-mainstream nodes should be involved to promote the information diffusion. 4. Model In this section, we describe how the proposed model works in detail. Firstly, we study the SH model [22], which is based on the classic linear regression. They found that there was a strong linear correlation after the logarithm of the early popularity and the logarithm of the 458
X. Wang, B. Fang, H. Zhang et al.
Computer Communications 150 (2020) 455–462
Fig. 4. The change trends of RMSE and Pearson’s correlation coefficient with the increasing of faithful fans proportion f .
late prevalence. In the SH model, early popularity is the initial input, it is suitable for a wide range of scenarios. This method is often used to make long-term predictions, and the prediction accuracy is low due to too few features. Considering there exists a linear relationship between final popularity and faithful fans proportion we introduce the mainstream characteristic into shares called. For each online content on Facebook homepage, there is a pair of 3-hour early observed share popularity and the final popularity. In addition, we assume that the prediction function of the final popularity is consists of linear combination with the mainstream proportion and early popularity, regression equations is proposed as Eqs. (3) and (4). ln 𝐵𝑚′ = 𝛼1 ln 𝐵𝑚 + 𝛼2 ln 𝑓 + 𝛼3 𝐵𝑚′
= exp[𝛼1 ln 𝐵𝑚 + 𝛼2 ln 𝑓 + 𝛼3 ]
Facebook. There are millions of contents created on the homepage. The data contains nearly 20,000 contents and more than 2 million users IDs that have shared behavior at least once on the crawled homepage between October 26, 2015 to October 26, 2016. 5.2. Performance criterion In this section, we introduce RMSE, Person’s correlation coefficient and MAPE to measure the prediction results. RMSE is frequently used to measure the differences between values (sample and population values) which are predicted by a model or an estimator and the values actually observed as Eq. (7). √ ∑𝑛 2 𝑖=0 (𝑋𝑜𝑏𝑠,𝑖 − 𝑋𝑚𝑜𝑑𝑒𝑙,𝑖 ) 𝑅𝑀𝑆𝐸 = (7) 𝑛
(3) (4)
where 𝐵𝑚 presents the predictive value of content m, and the parameters 𝛼 1 , 𝛼 2 , 𝛼 3 can be learned from the training data, B′𝑚 is popularity calculated by the SH model, f presents the faithful fans proportion of all users share the content in the recent 12 months. Further, considering that the released time of message may influence the real diffusion ability. We introduced a relative popularity 𝐵𝑚 ∗ to correct the predict result. 𝐵𝑚 ∗ =
𝐵𝑚 𝑆 ′ [𝑖]
where 𝑋𝑜𝑏𝑠,𝑖 is the actual value, 𝑋𝑚𝑜𝑑𝑒𝑙,𝑖 is the value obtained from the predict model. Pearson’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. As shown in Eq. (8), it measures how the linear correlation between two variables X and Y. ∑𝑛 𝑖=1 (𝑋𝑖 − 𝑋)(𝑌𝑖 − 𝑌 ) 𝑟= √ (8) √ ∑𝑛 ∑𝑛 2 2 (𝑋 − 𝑋) (𝑌 − 𝑌 ) 𝑖 𝑖=1 𝑖=1 𝑖
(5)
bring Eq. (5) into Eq. (4) , we can get the final predict model Eq. (6) 𝐵𝑚′ = exp[𝛾1 ln 𝐵𝑚 ∗ + 𝛾2 ln 𝑓 + 𝛾3 ]
(6)
where 𝛾1 , 𝛾2 , 𝛾3 are global coefficients that will learned from the data.
MAPE is a common measurement of prediction accuracy in statistics, which measures the average accuracy of the prediction. MAPE is defined as Eq. (9)
5. Empirical study 5.1. Dataset
𝑀𝐴𝑃 𝐸 = Since the paper focus on predicting the final popularity of online content, we crawled the hundreds of the most famous homepage on
𝑛 100 ∑ 𝐴𝑡 − 𝐹𝑡 | | 𝑛 𝑖=1 𝐴𝑡
where 𝐴𝑡 is the real value and 𝐹𝑡 is the prediction value. 459
(9)
X. Wang, B. Fang, H. Zhang et al.
Computer Communications 150 (2020) 455–462
Fig. 5. Prediction result on the famous Facebook homepage ‘‘Fox News’’. (a) SH Model based on the early popularity, (b) MFL model which mixed SH model with mainstream proportion.
460
X. Wang, B. Fang, H. Zhang et al.
Computer Communications 150 (2020) 455–462
Fig. 6. Prediction results for different models, MAPE is used to measure the accuracy of each model: (a) Interest homepage; (b) Entertainment homepage. Table 1 Result based on SH model.
5.3. Prediction result In this section, we first introduce the parameter selection method, and then give the comparison results between the proposed model and other baselines. Baselines: In order to examine the proposed model, the comparison popularity prediction methods are list as follows: • SH Model: The SH model is a classic popularity prediction model proposed by Szabo and Huberman, which use the linear regression method based on the early and final popularity. • DSH Model: The linear regression model proposed by Bao et al. [9], which considering the link density and the depth in the prediction model. Because of the introduction of topological structure into SH model, the accuracy is improved. • RPP Model: This model has a linear cumulative function and a log-normal temporal attenuation function. • MFL Model: Modified model based on the linear regression model, which is proposed in this in Eq. (6).
Famous page
RMSE
r
National Geographic History NBA Call of Duty Grey’s Anatomy Fox News The Simpsons Kung Fu Panda Barack Obama
0.27832594 0.32737778 0.28694586 0.191666 0.26148896 0.3087333 0.24680658 0.21722348 0.28865517
0.75463737 0.70040119 0.77730771 0.83741442 0.75880176 0.75103412 0.70316968 0.70204075 0.76637799
Table 2 Result based on MFL model.
We select the contents that have at least 10 shares to reduce the sample noise in the pretreatment process. We use RMSE to measure the differences between predicted values and observed values. It can be seen from Fig. 4 that the RMSE gradually decrease with the increasing of mainstream proportion in the early stage. When 𝑓 = 1.75%, the RMSE curve reaches the trough and Pearson’s correlation coefficient reaches peek. It is because when the selection criteria of mainstream users is strict, few samples are introduced into the model, thus f is too small to give an accurate result. When f increase, more samples are added, the prediction results turn to the better. However, when f is larger than the extreme point, the Pearson’s correlation coefficient of the sample decreases and the RMSE is increasing. Finally, our model will degenerate to the SH model if 𝑓 = 1. We compare the prediction accuracy of the proposed MFL method and the SH model on the ‘‘Fox News’’ homepage on Facebook in Fig. 5. We make a comparative experiment of the popularity prediction for Facebook online content based on the SH model in Fig. 5(a) and the proposed MFL model Fig. 5(b) , both show the scatter plot of the ‘‘Fox News’’ homepage. The horizontal coordinates are the logarithm value of the number of shares per content within 3 h, and the ordinates are the logarithmic value of the final share popularity. Based on the lifecycle of Facebook content which is shown in Fig. 2(b), the reference time is set to 150 to make sure all the share information is collected as much as possible. What is more, we remove the samples which the total number of share in the first three hours is less than the 10, because too little shares can affect the accuracy of the prediction, these
Famous page
RMSE
r
National Geographic History NBA Call of Duty Grey’s Anatomy Fox News The Simpsons Kung Fu Panda Barack Obama
0.23250591 0.28382431 0.19864445 0.21383976 0.14858594 0.17024147 0.23632155 0.19571676 0.18800831
0.87914733 0.8444794 0.83963969 0.8149566 0.85407032 0.85531719 0.80050685 0.78950783 0.78235282
contents that reaches the threshold is almost impossible to suddenly reach a large popularity in a week. In addition, the red dots in the figures represent 75% of the data extracted from the homepage data as a training set, and the black dots represent the remaining 25% of the data as a test set. The blue solid line indicates the fitting results of the training set using the least-squares method. It is obviously that the fitting result of MFL model is better than the SH model (see Table 2). For further comparison of the MFL model and the SH model, we chose some famous Facebook homepage such as ‘‘National Geographic’’, ‘‘The Simpsons’’ as the test sets from the dataset. The RMSE and Pearson’s correlation coefficient are introduced to measure the prediction results. We can see the Pearson’s correlation coefficient of MFL model in Table 1 is about 0.8, which proves that there exists a high linear correlation of our proposed model. What is more, as for these famous Facebook homepage, the RMSE is always under 0.3 and relatively stable. Furthermore, the proposed MFL method achieve 13% better performance than the SH model in total. As can be seen in Fig. 6, MAPE is used to measure the accuracy of each model. The reference time 𝑇𝑟𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒 is set to 3 h, we can observe the long-term prediction effect of each model by adjusting the predict 461
X. Wang, B. Fang, H. Zhang et al.
Computer Communications 150 (2020) 455–462
time 𝑇𝑝𝑟𝑒𝑑𝑖𝑐𝑡 . For the RPP model, we set the initial parameter 𝜀 to 10. In Fig. 6(a), the SH model shows the worst performance in the prediction process. DSH model performs better when the target time is shorter than 5 day. It is significant that MFL model is better in the long term prediction of final popularity when the time 𝑇𝑟 is longer than 5 days. Fig. 6(b) gives the MAPE comparison of these models on the entertainment homepage, we find that the RPP model has better result than other models in the short term prediction. However, with the growth of the predict time and the accumulation of shares, the MFL model gradually shows an advantage in the long term popularity prediction, especially on the interest class homepage, when 𝑇𝑝𝑟𝑒𝑑𝑖𝑐𝑡 ≥ 4.5 days, This phenomenon is more obvious, which proves that the MFL model is better for long term popularity prediction.
[2] M. Gupta, J. Gao, C.X. Zhai, et al., Predicting future popularity trend of events in microblogging platforms, Proc. Amer. Soc. Inf. Sci. Technol. 49 (1) (2012) 1–10. [3] Barabasi Albert-Laszlo, Albert Reka, Emergence of scaling in random networks, Science 286 (5439) (1999) 509–512. [4] H. Kwak, C. Lee, H. Park, et al., What is twitter, a social network or a news media?, in: Proceedings of the 19th International Conference on World Wide Web, AcM, 2010, pp. 591–600. [5] Saeed Abdullah, Xindong Wu, An epidemic model for news spreading on twitter, in: Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on, IEEE, 2011, pp. 163–169. [6] Xindong Wu Abdullah, An epidemic model for news spreading on twitter, in: Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on, IEEE, 2011, pp. 163–169. [7] G. Szabo, B.A. Huberman, Predicting the popularity of online content, 2008, Available at SSRN 1295610. [8] B. Chang, H. Zhu, Y. Ge, et al., Predicting the popularity of online serials with autoregressive models, in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, ACM, 2014, pp. 1339–1348. [9] P. Bao, H.W. Shen, J. Huang, et al., Popularity prediction in microblogging network: a case study on sina weibo, in: Proceedings of the 22nd International Conference on World Wide Web, ACM, 2013, pp. 177–178. [10] Y. Kim, Convolutional neural networks for sentence classification, 2014, arXiv preprint arXiv:1408.5882. [11] J. Cheng, L. Adamic, P.A. Dow, et al., CaN cascades be predicted?, in: Proceedings of the 23rd International Conference on World Wide Web, ACM, 2014, pp. 925–936. [12] A. Li, Z. Wu, D. Chen, H. Lu, G. Sun, Collaborative self-regression method with nonlinear feature based on multi-task learning for image classification, IEEE Access 6 (2018) 43513–43525. [13] R. Crane, D. Sornette, Robust dynamic classes revealed by measuring the response function of a social system, Proc. Natl. Acad. Sci. 105 (41) (2008) 15649–15653. [14] J. Yang, J. Leskovec, Patterns of temporal variation in online media, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, ACM, 2011, pp. 177–186. [15] K. Lerman, T. Hogg, Using a model of social dynamics to predict popularity of news, in: Proceedings of the 19th International Conference on World Wide Web, ACM, 2010, pp. 621–630. [16] Y. Lin, X. Zhu, Z. Zheng, et al., The individual identification method of wireless device based on dimensionality reduction and machine learning, J. Supercomput. (5) (2017) 1–18. [17] Y. Lin, C. Wang, J. Wang, Z. Dou, A novel dynamic spectrum access framework based on reinforcement learning for cognitive radio sensor networks, Sensors 16 (10) (2016) 1–22. [18] Y. Tu, Y. Lin, J. Wang, et al., Semi-supervised learning with generative adversarial networks on digital signal modulation classification, CMC-Comput. Mater. Continua 55 (2) (2018) 243–254. [19] Y. Hu, C. Hu, S. Fu, et al., Predicting the popularity of viral topics based on time series forecasting, Neurocomputing 210 (2016) 55–65. [20] S. Gao, J. Ma, Z. Chen, Modeling and predicting retweeting dynamics on microblogging platforms, in: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, ACM, 2015, pp. 107–116. [21] E. Ferrara, P. De Meo, G. Fiumara, et al., The role of strong and weak ties in facebook: a community structure perspective, 2012, arXiv preprint arXiv: 1203.0535. [22] H. Li, X. Ma, F. Wang, et al., On popularity prediction of videos shared in online social networks, in: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, ACM, 2013, pp. 169–178.
6. Conclusions In this paper, we applied a popularity prediction model based on the sociological theory to solve the problem that the predictive accuracy of current methods is not high enough. We find that there exists a high linear correlation between the proportion of faithful fans in Facebook homepage with frequent shares in the early and the future popularity. The statistical results about Facebook remind us the social physics theory plays an important role in prediction task. Furthermore, experimental study clearly illustrates that the effectiveness of the proposed method. The results of the experiment illustrate that the proposed model can provide a better result than the other models because of the import of the mainstream fatigue theory, which validate the effectiveness of our proposed model in popularity prediction. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. CRediT authorship contribution statement Xiaomeng Wang: Writing - original draft. Binxing Fang: Resources. Hongli Zhang: Formal analysis. Xing Wang: Validation.
Funding The work of this paper was supported in part by the National Key R&D Program of China of No. 2017YFB0803305 and No. 2016QY03D0501. References [1] N. Eltantawy, J.B. Wiest, The Arab spring social media in the Egyptian revolution: reconsidering resource mobilization theory, Int. J. Commun. 5 (2011) 18.
462