Developing the Cloud-integrated data replication framework in decentralized online social networks

Developing the Cloud-integrated data replication framework in decentralized online social networks

JID:YJCSS AID:2909 /FLA [m3G; v1.159; Prn:24/08/2015; 9:07] P.1 (1-17) Journal of Computer and System Sciences ••• (••••) •••–••• Contents lists av...

2MB Sizes 0 Downloads 76 Views

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.1 (1-17)

Journal of Computer and System Sciences ••• (••••) •••–•••

Contents lists available at ScienceDirect

Journal of Computer and System Sciences www.elsevier.com/locate/jcss

Developing the Cloud-integrated data replication framework in decentralized online social networks Songling Fu a , Ligang He b,c,∗ , Xiangke Liao d , Chenlin Huang d a

College of Polytechnic, Hunan Normal University, Changsha, China Department of Computer Science, University of Warwick, Coventry, UK School of Computer Science and Electronic Engineering, Hunan University, Changsha, China d School of Computer Science, National University of Defense Technology, Changsha, China b c

a r t i c l e

i n f o

Article history: Received 30 December 2014 Received in revised form 5 May 2015 Accepted 8 June 2015 Available online xxxx Keywords: Decentralized online social network Cloud Data availability Data replication Erasure coding

a b s t r a c t Decentralized Online Social Network (DOSN) services have been proposed to protect data privacy. In DOSN, the data published by a user and their replicas are only stored in the friend circle of the user. Although full replication can improve Data Availability (DA), pure DOSNs may not deliver sustainable DA. This paper proposes a Cloud-assisted data replication and storage scheme, called Cadros, to improve the DA in DOSN. This paper conducts quantitative analysis about the storage capacity of Cadros, and further models and predicts the level of DA that Cadros can achieve. The data in Cadros are partitioned in such a way that the overhead caused by storing the data in the Cloud is minimized while satisfying the desired DA. This paper also proposes the data placement strategies to realize the desired DA and improve other performance. Experiments have been conducted to verify the effectiveness of Cadros. © 2015 Elsevier Inc. All rights reserved.

1. Introduction In the last decade, Online Social Networks (OSNs), such as Facebook [18], Twitter and Sina Weibo [19], have gained extreme popularity with more than a billion users worldwide. OSNs allow a user to publish the data to all friends in his friend circle. Currently, the OSN platforms are typically centralized, where the users store their data in the centralized servers deployed by the OSN service providers. The service providers can utilize and analyze these data to know the users’ private information, such as interest and personal affairs, and in the worst case may sell this information to the third party. Therefore, the current Centralized Online Social Networks (COSNs) have raised the serious concerns in privacy [14,15,30]. In order to address the data privacy issue, an obvious solution is to encrypt the user data stored in the centralized server [1,2,16,29]. A typical procedure of this solution is as follows [1]. The user data are first encrypted with the secret key, and the secret key is then encrypted with the public keys of the corresponding friends. After a friend receives the encrypted data and secret key, it first decrypts the secret key with his own private key, and the user data are then decrypted with the secret key. However, the disadvantage of this encryption solution is that a user may have a large number of friends and a user may add or delete the friends over time. It is not practical to manage this many keys. Another obvious downside of this approach is that encrypting and decrypting the user data and the secret keys incur high overhead.

*

Corresponding author at: School of Computer Science and Electronic Engineering, Hunan University, Changsha, China. E-mail address: [email protected] (L. He).

http://dx.doi.org/10.1016/j.jcss.2015.06.010 0022-0000/© 2015 Elsevier Inc. All rights reserved.

JID:YJCSS AID:2909 /FLA

2

[m3G; v1.159; Prn:24/08/2015; 9:07] P.2 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

Therefore, Decentralized Online Social Networks (DOSNs) have been proposed recently as a promising solution to protect data privacy [1–4]. Although the DOSN products [17] are not as popular and mature as the OSN products [18], DOSN is indeed under active research and development. In DOSNs, in order to protect the data privacy the centralized servers are bypassed and the data published by a user are stored and disseminated only among the friend circle of the user [4]. Although DOSNs can help protect the data privacy, maintaining Data Availability (DA) becomes a big challenge [11,29,32]. This is because if a friend of the user is offline, the data stored in the friend cannot be accessed by other friends. In order to achieve good data availability in DOSN, the data replication approach has been widely used [6]. Full replication is a popular replication approach. In this approach, a certain number of copies (e.g., k copies) are created for each data item published by a user and these data replicas are stored across the user’s friends in the DOSN. By doing so, if a friend is offline, the data in this offline friend can be accessed through the replicated data stored in other friends. Consequently, data availability is improved. Although data replication helps improve DA, the following characteristics of DOSN have negative impact on its data availability. First, the friends in DOSN are highly volatile [22,28]. Further, the studies [11,31] show that the online/offline states of individual friends show high correlation, which indicates that many friends in a friend circle may go offline in the same time duration. When this happens, there may not be enough online friends to contribute the sufficient storage to save the data (and their replicas) published by the user. Second, in a typical DOSN, the data published by a user are distributed among the friends in his own friend circle. Some friend circles may be small (e.g., with tens of friends), which may also cause the situation during certain periods where there are not enough online friends to offer the adequate storage for the published data. Finally, the increasingly more data are being generated on the OSNs nowadays. On the other hand, current users often use the mobile devices, such as smart phones, to access the OSN services. The storage capacity in the mobile devices is much more limited than the desktop computers used in the “old fashioned” style of accessing OSNs. Adding even more strain, a mobile device owner typically only sets a small fraction of total storage capacity in his device to be used by the OSN client app installed in the device. There is now a dilemma. On one hand, using the centralized server to store the published data raises the data privacy concern. On the other hand, using the friend nodes as the only storage facility may raise the data availability concern although data replication helps improve data availability. In order to further improve data availability while guaranteeing data privacy, this paper proposes a hybrid data replication and storage approach which combines the DOSN with the centralized server. Nowadays, the Cloud becomes a popular storage platform. The Cloud is very suitable to be used as the centralized server in this work, because 1) it is available all the time and 2) the storage capacity offered by the Cloud can scale up and down according to the users’ demands. Thus, this work utilizes the Cloud as the centralized server and develops a Cloud-Assisted Data Replication framework in decentralized Online Social networks (Cadros). Due to the complexity of using encryption to protect DA, Cadros employs the erasure coding technique [20] to prevent the Cloud service provider from knowing the content of the stored data. In the erasure coding technique, the original data are split into m data segments, which are then encoded into n new data segments. Any r data segments of the n encoded segments can be used to reconstruct the original data. Thus, if the number of data segments stored in the storage facility offered by a Cloud service provider is less than r, the Cloud service provider cannot reconstruct and know the original data. The erasure coding technique can also be regarded as a data replication technique, and its redundancy degree is (n/m). Therefore, Cadros effectively employs two data replication techniques. Namely, all data replicas generated by full replication are stored in the friend circle, while less than r data segments generated by erasure coding are stored in the Cloud. Erasure coding can save storage space when n/m is less than the number of data replicas generated for each data item in full replication (k), which is typical case. The first contribution of this work is to conduct the quantitative analysis about the amount of data that Cadros can store as the result of combining the Cloud and erasure coding with DOSN. In order to help achieve the desired DA, it is very useful to predict the user and the friends’ behavior in the DOSN, and make judicious replication and storage decisions in advance with the prediction. The second contribution of this work is to analyze the probabilistic behavior of the friend circle in the DOSN and predict the values of two metrics at a future time point: i) the storage capacity that the friend circle can contribute and ii) the amount of data that the friends request to update at a future time point. Further, this work models the relation between the above two metric values and DA, and consequently predicts the level of DA that the Cloud-assisted DOSN system can achieve at a future time point. As discussed above, erasure coding can save storage space. However, it incurs the overhead for coding and reconstructing the data. More data are stored using erasure coding, higher overhead is incurred. Ideally, the overhead should be minimized. The third contribution of this work is to develop a data partition scheme in terms of the replication techniques, i.e., decide the portion of published data that should be stored using full replication or erasure coding, so that the erasure coding overhead is minimized while satisfying the desired level of DA. The DA prediction in the second contribution only indicates that the hybrid system has the capacity to achieve such a certain level of DA. It still depends on the underlying data placement strategy to realize the DA. The placement strategy determines how to place the newly published data replicas among the friends. Imagine if a poor placement strategy deliberately places the data replicas on those friends who are unlikely to be still online at the targeted future time point t  , then the desired level of DA will not be realized at t  even if the hybrid system has such ability based on our probabilistic

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.3 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

3

analysis. The final contribution of this work is to develop the placement strategy for data replicas so that the predicted DA can be realized. Under the condition of satisfying the data availability, this work further proposes a number of heuristic placement strategies to optimize other performance metrics in Cadros, such as data repair cost and data access performance. The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 analyzes the storage capacity that Cadros can provide. Section 4 presents the methods to model and predict the DA. Section 5 presents the DA-driven data replication and storage scheme. Section 6 presents the data placement strategies. Section 7 presents the experimental results. Finally, Section 8 concludes the paper. 2. Related work 2.1. DOSN To address the data privacy problem in COSNs, several decentralized approaches have been proposed [1–4]. Graffi et al. [1] advocate that online social networks will be the next main application field for the p2p paradigm. They proposed a secure and P2P-based solution for secure online social networks called LifeSocial.KOM. LifeSocial.KOM provides the functionality of common online social networks in a totally distributed and secure manner. LifeSocial.KOM is a plugin-based solution. Any data object that is not public is first encrypted with a symmetric cryptographic key. This symmetric key is encrypted individually with the public key of every read-enabled user. The list of encrypted symmetric keys as well as the object itself are signed by the owner of the object and stored in the p2p overlay. Buchegger et al. [2] proposed a decentralized, peer-to-peer approach coupled with encryption. This paper focuses on the description of how peer-to-peer networking can provide OSN features. The main means for privacy protection is allow for users to encrypt their data. They assume the availability of a public-key infrastructure (PKI) with the possibility of key revocation. The data are encrypted with the public keys of the intended audience. Extra-network contacts are used to establish and verify credentials. Yeung et al. [3] adopted a decentralized approach by using the URIs as the identifiers throughout, which can provide the same (or even higher) level of user interaction as with many of the current popular OSN sties. Tandukar et al. [4] also proposed a decentralized OSN. With this approach, users can maintain the control over their data to protect their data privacy, and forward the social data selectively to reduce the irrelevant data among the users. None of these approaches only stores the data published by a user in his friend circle. 2.2. Data availability The existing work in improving data availability mainly focuses on designing smart data replication and data storage policies. Shakimov et al. [5] propose three schemes for storing the data in DOSNs: the cloud-based scheme, the desktopbased scheme, and the hybrid scheme combining the above two. In the cloud-based scheme, the data will be stored in the cloud servers. In the desktop-based scheme, two mechanisms may be used: i) the data replicas are encrypted when they are stored in potentially untrusted hosts; ii) the users take advantage of the trust embedded in the social network to store the data replicas on trustworthy friends. The drawbacks of these mechanisms come from the complexity and overhead in the encryption key or trust management. The approach proposed by Koll et al. [6] exchanges the recommendations among the socially related nodes in order to effectively distribute a user’s data replicas among the eligible nodes carefully selected in the OSN. In the approach developed by Olteanu et al. [7], the preferences are given to the nodes when it comes to selecting the nodes for storing the data (and their replicas) published by a user. The online friends of the user have the highest priority. When all friends are offline, the data are then stored in the nodes which are not in the user’s friend circle. Buchegger et al. designed a two-tiered DOSN architecture (PeerSoN) [2]. One tier serves as a look-up service which is implemented by OpenDHT. The second tier consists of the peers and contains the user data. When a user is offline, his all data will be stored across the whole network. Cutillo et al. [8] propose a P2P-based DOSN (Safebook), in which each node is accessible through the so-called shells. The profile data is mirrored and stored in a subset of a node’s direct contacts, which forms the so-called innermost shell. The data retrieval requires traversing the shells along a path of the nodes that are online and are friends with each other. Tegeler et al. [29] propose an approach called Gemstone. Gemstone protects the user’s privacy by encrypting all data using ABE, and stores the user’s data in the so-called Data Holding Agents (DHAs). If a DHA itself is offline, the data have to be passed to the offline DHA’s DHAs. Our previous work also investigated the data availability in DOSN [33]. In [33], we established the relation between the total storage size contributed by the mobile devices in a DOSN and the average data availability of the DOSN in the long run, based on a method is proposed to predict the average data availability of the DOSN at a time point in the near future by predicting the total storage size of the DOSN. The work in [33] is useful for the DOSN designers to determine the system parameters such as the storage capacity of the DOSN in order to achieve the desired level of data availability of the DOSN. Different from [33], this work predicts the data availability that a friend node can achieve (instead of average data availability of the whole DOSN) by predicting both the size of the data stored in the DOSN and the data that this friend node needs to update when it goes online (presented in Section 4). Also, this work takes into account the heterogeneity of

JID:YJCSS AID:2909 /FLA

4

[m3G; v1.159; Prn:24/08/2015; 9:07] P.4 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

the storage capacity in individual mobile devices when predicting the data availability, while the work in [33] assumes that each mobile device contribute the same amount of storage space. Further, this work has the following contributions that the work in [33] did not address. First, this work proposes the scheme, Cadros, to improve the data availability by integrating the Cloud server into the DOSN. In Cadros, the erasure coding technique is employed to store the data in the Cloud server in order to preserve data privacy, while the data are stored in the friend circle using full replication. This work conducts the quantitative analysis about the amount of data that Cadros can store in this new hybrid data replication scheme (Section 3). Second, since erasure coding incurs higher overhead than the full replication method, this work develops a data partition scheme in terms of the replication techniques, i.e., decide the portion of published data that should be stored using full replication or erasure coding, so that the erasure coding overhead is minimized while satisfying the desired level of DA (Section 5). Finally, this work develops the DA-driven placement strategies for data replicas (i.e., decide which friend nodes the newly published data and their replicas should be stored in), so that the predicted DA can be realized (Section 6). 2.3. Cloud-assisted p2p systems In 2006, Li and Dabek [9] argue that a node should choose its neighbors (the nodes with which it shares resources) based on existing social relationships instead of randomly when deploying a distributed storage infrastructure in peer-to-peer systems. The system is called the F2F storage system, in which nodes restrict themselves to sharing storage and network resources only with their friends. The authors argue that the F2F system provides the incentives for the nodes to cooperate with each other, which results in a more stable system. Based on this idea, they later proposed a cooperative online backup system named Friendstore [10], which allows the users to back up the data into the trusted nodes (i.e., their friends and colleagues). In 2011, Sharma et al. [31] argued that the limitation of storing data only on friends can come to the detriment of data availability, and showed that the problem of obtaining maximal availability while minimizing redundancy is NP-complete. In 2012, Gracia-Tinedo et al. [11] showed that pure F2F storage systems present a poor QoS, mainly due to the availability correlations, and proposed a hybrid architecture called F2Box to combine F2F storage systems and the cloud storage services. F2Box uses erasure coding to replicate the data and allow users to adjust the amount of redundancy according to the availability patterns exhibited by friends. Compared with F2Box, our work uses the combination of full replication and erasure coding and proposes the scheme to minimize the overhead caused by erasure coding while satisfying the desired level of QoS. Also, our work quantitatively analyzes the storage capacity that the hybrid data replication and storage scheme in Cadros can provide. Moreover, our work develops the method to predict the data availability as well as the data placement strategies to realize the predicted data availability. In 2011, Liu et al. [12] present Confidant, a decentralized OSN designed to support a scalable application framework for the OSN data without promising users’ privacy. Because a user’s data are replicated on the trusted servers controlled by his/her friends, Confidant allows the application codes to run directly on these storage servers, and eliminates write conflicts for the weakly-consistent replicated data through a lightweight cloud-based state manager. FS2You [13], presented by Sun et al. in 2009, is a large-scale and real-world online storage system with peer assistance and semi-persistent file availability, which can dramatically mitigate server bandwidth costs. P2P storage systems provide no guarantees on file availability, while the cloud-based online storage systems are able to make such guarantees, at the prohibitive cost of server bandwidth and storage. FS2You stores the data in the cloud and the peers. Peers are allowed to request help from the cloud but only when any of the following three conditions hold: 1) there are currently active partners; 2) none of the active partners hold the desired block; 3) the aggregate download rate from active partners. FS2You can achieve a reasonable and balanced tradeoff between the p2p storage system and the cloud storage system. In 2013, Mega et al. [16] proposed a cloud-assisted dissemination approach in social overlays. In this approach, updates to the user’s profile are always performed first on the profile store, which is encrypted and hosted in the cloud, and are then disseminated via the social overlay. When updates from the user are not received by a friend for a long time, the cloud serves as an external channel to verify their presence. The outcome is disseminated to all friends in a P2P fashion, quenching cloud access from other friends. 3. Analyzing the storage capacity of Cadros Although it is assumed that the Cloud server can provide the unlimited storage capacity, in order to protect the data privacy the number of segments stored in the Cloud server for a data item must be less than r under erasure coding. Moreover, both full replication and erasure coding can be used to replicate the data and the redundancy degrees of these two techniques are different. This section conducts the quantitative analysis about the amount of published data that Cadros is able to store, as the result of combining the Cloud with DOSN. Assume that k copies are generated for each data item under full replication, and that in erasure coding, the original data are split into m segments, which are then encoded into n new segments, and any r encoded segments can be used to reconstruct the original data.

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.5 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

5

D denotes the amount of published data stored by Cadros. D f and D e denote the parts of published data that are stored using full replication and erasure coding, respectively. Then Eq. (1) holds.

D = D f + De

(1)

Since k copies of replicas are generated under full replication, the amount of data replicas generated by full replication is (k · D f ). All these data replicas have to be stored in the friend circle for the sake of data privacy. n After applying erasure coding, the amount of data replicas generated is ( m · D e ). Assume that the actual number of the segments stored in the cloud server is x. Then (n − x) is the number of segments stored in the friend circle. The amount of data replicas stored in the friend circle can be calculated by

n m

· De ·

n−x

=

n

n−x m

· De.

SS denotes the total storage space contributed by the friends in the DOSN. Then Eq. (2) holds.

SS = k · D f +

n−x m

· De

(2)

Eq. (2) can be transformed to Eq. (3).

Df =

1 k

· SS −

n−x k·m

· De

(3)

Substituting D f in Eq. (1) for that in Eq. (3), Eq. (1) becomes Eq. (4).

D=

k · m − (n − x) k·m

· De +

1 k

· SS

(4)

From Eq. (4), we can draw the following conclusions. n i) Typically, k > m , i.e., the redundancy of full replication is greater than that of erasure coding. In Eq. (4), therefore, k · m − (n − x) > 0, which means that D increases linearly as D e increases. Thus, when D e is 0, i.e., all data are replicated using full replication, Eq. (4) obtains the minimal value, which is ( 1k · SS). When D e is D, i.e., all data are replicated using erasure coding, Eq. (4) obtains the maximum value, which is ( nm −x · SS). ii) Since k · m − (n − x) > 0, D increases as x increases. Since x must be less than r, the maximum of Eq. (4) is ( n−mr +1 · SS).

Therefore, the range of the amount of published data that Cadros is able to store is



1 k

· SS,

m n−r +1

 · SS

(5)

It is straightforward to know that when the friend circle is the only storage facility and only full replication is used to generate the data replicas, the amount of published data is ( 1k · SS). From the above discussion, it can be seen that Cadros essentially expands SS by a factor thanks to the extra storage capacity provided by the cloud. However, it should be noted that for a user to publish a data item successfully on his mobile device, the precondition is that the user must have enough storage capacity to store his own data and otherwise the old data have to be deleted to make room for the new data. This is a reasonable condition since we are designing a friend-to-friend decentralized social network system. Also note that the storage capacity the user uses to hold the published data, which is typically only limited by the physical storage capacity of the mobile device, is different from the storage capacity that the friends of the user are willing to contribute. In case the total storage capacity contributed by all friends of the user plus the extended Cloud storage is not enough to accommodate the published data and their replicas, we can set the overwriting strategy in which the newly published data overwrite the oldest data. But as long as the overwritten data are still in the mobile device of the user, the friends can update them from the user if necessary. Namely, the overwritten data in the user’s friend circle are not lost as long as they exist in the mobile device of the user. Another point to note is that it is possible to use multiple Clouds to store the erasure coded data. However, using multiple Clouds will not further increase the total storage capacity of Cadros. As we have discussed above, the total storage capacity of Cadros is increased by m/(n − r + 1) times of SS, according to Eq. (5). The increased storage capacity can be located on a single Cloud or across multiple Clouds. The factor that limits the total storage capacity of Cadros is SS, i.e., the total storage capacity contributed by friends, not the capacity in the Cloud even if there is only one Cloud, since it is assumed that a Cloud is able to provide unlimited storage capacity.

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.6 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

6

Fig. 1. Illustration of the data availability problem.

4. Modeling and predicting the data availability Section 3 analyzes the amount of published data that Cadros can store given the storage capacity of the friend circle SS. This section presents the method of modeling and predicting the level of Data Availability (DA). In the DOSN system assumed in this paper, all nodes logs in and out the OSN service dynamically, and the online and offline duration of these nodes follow certain probability distributions. The user publishes the data following certain probability process (e.g., Poisson process). When a friend logs in, he tries to update the data published by the user after the friend logs out last time. The DA problem is illustrated in Fig. 1. In Fig. 1, the user publishes the data at a series of time points along the time line. Assume t 1 is the first time point when he publishes the data, Data1 , after he comes online, and tk is the last time point the u user publishes the data, Datak , before he goes offline at the time point t out . Now consider one of the friends in the user’s f

friend circle. Assume that the friend goes offline at time point t out just before the user publishes Datak (and after the user f t in

publishes Datak −1 ), and then comes online at time point after the user goes offline. Therefore, Datak to Datak are the data that the friend missed when he is offline and consequently need to update when he comes online. Since the user is already offline, the friend can only update the missed data from other online friends where the data replicas are stored, or reconstruct the missed data from the data segments stored in the Cloud server and/or other online friends. Note that if the friend comes online before the user goes offline, the friend can update all missed data from the user directly and the data availability is not a problem under this circumstance. When a friend comes online, assume that the total amount of data that the friend tries to update is D up . Out of D up , the amount of data that are stored in Cadros is D st . The level of DA for the friend is defined as Eq. (6).

DA =

D st

(6)

D up

The data replication frameworks typically work in the following way [6,10]. When the user publishes a data item, the data replicas are created and stored in the storage pools, which are either the online friends or the Cloud server in this work. If the storage capacity is unlimited, the newly published data will just be added. If the storage capacity is limited and the storage space is already full, the oldest data in the storage space will be replaced with the new data. Therefore, the storage capacity will determine the period of the data that are stored in the pool, which is called the time window of the stored data in this paper. The time window affects the DA. For example, in Fig. 1, if the storage space can only store the data published from tk back to tk , i.e., the time window of the stored data is [tk , tk ], then the data published earlier than f

f

tk are not available to the friend who goes offline at t out and comes online at t in . One of the aims of this work is to model and predict the DA at a future time point t  . Eq. (6) shows that in order to predict the DA, we need to predict D st and D up . This section presents the methods to predict D st (Section 4.1) and D up (Section 4.2), respectively. 4.1. Predicting D st According to Eq. (5) in Section 3, the amount of data that Cadros can store depends on SS, the total storage capacity that the friend circle can provide (all other parameters in Eq. (5) are constants). Therefore, in order to predict D st at time t  , we have to predict SS at t  , which is denoted by SS(t  ). The focus of this subsection is predict SS(t  ), given the state of the friend circle at current time t. Given the current time t, it can be determined that which friends are online or offline. For an online friend v i at time t, f on we can know the time point at which v i logged in (i.e., became online), which is denoted by t in_i (“ f on” means “online friend”, “in” means “login”). Note that in all notations in this paper, the superscript represents the role (e.g., “u” for “user”, “ f off ” for “offline friend”) and the subscript represents the action of the role or status (e.g., “in” for “login”, “out” for “logout”, “up” for “update data”, “off ” or “on” for being in the “offline” and “online” state).

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.7 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

7

The existing work [21–25] has extensively studied the patterns of the user behaviors in OSNs, such as the accessing frequency, online and offline durations, and the total time spent on OSNs. These patterns can be expressed by probability distributions. In this paper, we assume that the probability distributions for online and offline durations are already known. The probability that the online friend v i does not change to offline before t  equals to the probability that v i will only

f on f on logout after t  (i.e., v i ’s logout time, denoted by t out_i , is greater than t  ). The probability, denoted by p (t out_i > t  ), in turn

equals to the probability that v i ’s online duration is greater than (t  − t in_i ) under the condition that v i ’s online duration is f on

f on

f on

no less than (t − t in_i ), which can be computed using the conditional probability shown in Eq. (7), where t on_i is the time f on F on_i

duration of friend v i being online continuously and f on (t on_i

≥t



f on − t in_i )

in Eq. (7) reflect the fact that v i has





f on 



f on is the probability distribution function of t on_i . f on been staying online for the duration of (t − t in_i ).

The condition of

f on 

p t out_i > t  = p t on_i > t  − t in_i | t on_i ≥ t − t in_i f on

=

f on

p (t on_i > t  − t in_i ) f on

f on

f on

f on

p (t on_i > t − t in_i )

=

f on

1 − F on_i (t  − t in_i ) f on

f on

f on

f on

(7)

1 − F on_i (t − t in_i )

Similarly, the probability that an offline friend v j becomes online at t  equal to the probability that it logs in before t  , f off

which can be computed using (8), where t off _ j is the time duration of the offline friend v j being offline continuously and f off

f off

F off _ j is the probability distribution function of t off _ j .







 

p t in_ j ≤ t  = p t off _ j ≤ t  − t out_ j | t off _ j ≥ t − t out_ j f off

f off

f off



f off

f off

f off

f off

p (t off _ j ≥ t − t out_ j ) F off _ j (t  − t out_ j ) − F off _ j (t − t out_ j ) f off

=

f off

p (t − t out_ j ≤ t off _ j ≤ t  − t out_ j ) f off

=

f off

f off

f off

f off

f off

(8)

f off

1 − F off _ j (t − t out_ j )

S i denotes the storage capacity contributed by friend v i . Eq. (7) and Eq. (8) calculate the probability that friend, either online or offline at the current time t, will be online at time t  . Therefore, the expectation of the storage capacity contributed by the friend circle, i.e., SS(t  ), can be calculated by Eq. (9).

  SS t  =

N on   i =1

=

N on   i =1



S i · p t out_i > t 

Si ·

f on



+

N off   j =1

1 − F on_i (t  − t in_i ) f on

f on

f on

f on

1 − F on_i (t − t in_i )



S j · p t in_ j ≤ t 

 +

f off

N off  



F off _ j (t  − t out_ j ) − F off _ j (t − t out_ j ) f off

Sj ·

j =1

f off

f off

f off

f off



f off

1 − F off _ j (t − t out_ j )

(9)

Further, from Eq. (5) in Section 3, D st (t  ) can be determined. 4.2. Predicting D up A friend needs to update the data through Cadros only when both of the following situations occur. f off u . i) The friend v j is offline at time t but comes online at time t  , and t out_ j is earlier than t out

ii) The user is offline at time t  (otherwise, the friend can update the data directly from the user). Situation ii) can be further divided into two cases: 1) the user is online at the current time t, but becomes offline at time t  , and 2) the user is offline at t, and remains offline at t  . We now present the methods to calculate D up for both cases. Case 1: the user is online at the current time t. As discussed in Fig. 1, when the friend v j who is offline at the current time t and comes online at t  , v j needs to update u u u the data only when t out_ j is earlier than t out . The time window of these data is [t out_ j , t out ]. twup_ j (t  , t out ) denotes the f off

f off

length of the time window of the data that the offline friend v j has to update when v j comes online at t  and the user’s u u last logout time is t out . twup_ j (t  , t out ) can be calculated using Eq. (10).

 u  f off u = t out twup_ j t  , t out − t out_ j

(10)

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.8 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

8

u Note that t out is unknown at t and can be any time point in the time interval [t , t  ]. Therefore, the expectation of u u twup_ j (t  , t out ), denoted by twup_ j (t  ), can be calculated using Eq. (11), where f (t out ) is the probability density function u (pdf) of t out .

   u 

twup_ j t  = E twup_ j t  , t out t  =

 u  u  u · f t out · dt out twup_ j t  , t out

t

t 



f off







u u u · dt out t out − t out_ j · f t out

=

(11)

t

xpu (t ) denotes the number of times that the user publishes the data in the time duration t. xpu (t ) is a discrete random variable. f (xpu (t )) denotes the probability density function of xpu (t ). a denotes the average size of the data published by the user each time. The methods of analyzing and obtaining f (xpu (t )) and a have been presented in the literature [22]. spu (t ) denotes the total size of the data published by the user during t. Clearly, spu (t ) = a · xpu (t ). Therefore, the probability density function of spu (t ), denoted by f (spu (t )), can be determined by Eq. (12) and the expectation of spu (t ) can be calculated by Eq. (13).









f spu (t ) = a · f xpu (t )









E spu (t ) = a · E xpu (t ) = a ·

(12) +∞ 





xpu (t ) · f xpu (t )

(13)

x=1

Substituting the time duration t in Eq. (13) with twup_ j (t  ) in Eq. (11), we can obtain D up_ j (t  ), i.e., the amount of the data which the offline friend v j needs to update when he comes online at time t  , which is expressed in Eq. (14).

    

D up_ j t  = E spu twup_ j t 

(14)

Case 2: the user is offline at time t. The procedure for calculating D up_ j (t  ) for Case 2 is the same as that for Case 1. Only Eq. (11) in Case 1 should be u u re-derived, because t out is known if the user is offline at time t, i.e., t out is a constant. Therefore twup_ j (t  ) should be calculated using Eq. (15).

  f off u twup_ j t  = t out − t out_ j

(15)

4.3. Predicting DA With the results of D st (t  ) and D up_ j (t  ), the level of DA that the offline friend v j can achieve when he comes online at t  , denoted by DA j (t  ), can be calculated using Eq. (16).

 

DA j t



=

100%

D st (t  ) D up_ j (t  )

D st (t  ) ≥ D up_ j (t  )

D st (t  ) < D up_ j (t  )

(a) (b)

(16)

Since D st (t  ) can be any value in [ 1k · SS, n−mr +1 · SS] according Eq. (5), DA j (t  ) calculated by Eq. (16) falls in the corresponding range. 5. DA-driven data replication and storage in Cadros Section 4 models and predicts the data availability. This section presents a DA-driven data replication and storage scheme based on the results in Section 4. In the DA-driven scheme, the data are partitioned in terms of the data replication and storage techniques (i.e., deciding which data should be replicated using full replication or erasure coding and stored in the friend circle or the Cloud server) according to the desired level of DA for the system. The objective of the DA-driven scheme is to minimize the overhead caused by the erasure coding while trying to satisfy the desired level of DA at a future time point t  . Assume that the desired level of DA is pt at t  . In order to achieve such data availability, the amount of data that has to be stored for the friend v j who is offline at the current time t and comes online at t  , denoted by D st_ j (t  ), can be calculated using Eq. (17) according to Eq. (16b).

    D st_ j t  = D up_ j t  · pt

(17)

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.9 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

9

D st (t  ) is the expectation of D st_ j (t  ) over all such friends, which can be calculated using Eq. (18). N off 

     D st t  = E D st_ j t  = j =1

1 N off

  · D up_ j t  · pt

 (18)

If D st (t  ) is no more than the upper bound of the range specified in Eq. (5), the desired DA of pt can be satisfied by Cadros at t  . If the desired level of DA cannot be satisfied, the method presented in Section 4 can be used to calculate the level of DA that Cadros can achieve. Assume that the desired DA of pt can be satisfied. If D st (t  ) is no more than ( 1k · SS), then all data can be stored in

the friend circle using full replication. If D st (t  ) is more than ( 1k · SS), the hybrid replication has to be used, i.e., some data are replicated using full replication, other using erasure coding. We now derive the data partitions in terms of the data replication and storage technique, aiming to minimize the overhead caused by erasure coding. Out of D st (t  ) that have to be stored at t  in order to achieve the DA of pt , D f (t  ) denotes the amount of data that are replicated using full replication while D e (t  ) denotes the amount of data replicated using erasure coding at t  . Then Eq. (19) holds.

      D st t  = D f t  + D e t 

(19)

According to Eq. (2), Eq. (20) holds.

    n−x   SS t  = k · D f t  + · De t m

(20)

In Eqs. (19) and (20), D st (t  ) has been calculated in Eq. (18) and SS(t  ) calculated in Eq. (9). D f (t  ) and D e (t  ) are only two unknown variables. Therefore, we can combine Eqs. (19) and (20) to obtain the values of D f (t  ) and D e (t  ), as shown in Eqs. (21) and (22).

 

D f t =

m · SS(t  ) − (n − x) · D st (t  ) m · k − (n − x)

  m · k · D st (t  ) − m · SS(t  ) De t = m · k − (n − x)

(21)

(22)

Therefore, when Cadros replicates D f (t  ) amount of data using full replication and D e (t  ) amount of data using erasure coding as calculated by Eqs. (21) and (22), the overhead caused by erasure coding is minimized while achieving the desired level of DA. Further, the time window corresponding to D st (t  ) amount of data, which is denoted by twst (t  ), can be determined by using Eq. (23).

    

D st t  = E spu twst t 

(23)

t tl (t  ) denotes the lower bound of the time window twst (t  ), i.e., the earliest publication time of the stored data. t tl (t  ) can be determined by Eq. (24).

    t tl t  = t  − twst t 

(24)

Since the new data are more likely to be accessed more frequently and full replication does not incur the overhead caused by erasure coding, the newest D f (t  ) amount of data are replicated using full replication in Cadros. The rest of data, i.e., the oldest D r (t  ) amount of data are replicated and stored using erasure coding. t io (t  ) denotes the partition time point for using the two replication techniques. t io (t  ) can be determined by Eq. (25).

    

D f t  = E spu t  − t io t 

(25)

The relation among t  , t io (t  ) and t tl (t  ) as well as the main workings of Cadros are illustrated in Fig. 2. In Fig. 2, given

the current replication time window [t tl , t ], the data published during [t tl , t io ], denoted by D e , are replicated with erasure coding and stored in the friend circle and the Cloud, and the data published during [t io , t ], denoted by D f , are replicated using full replication and stored only in the friend circle. Cadros takes the following steps to determine the data replication and storage methods for the future time point t  , when the desired level of DA is pt . (1) Predicting the storage capacity SS(t  ) using Eq. (9), D st (t  ) using Eq. (18) and the replication time window twst (t  ) using Eq. (23). (2) Calculating t io (t  ) using Eq. (25). (3) k copies will be created and stored in the friend circle for each data item published after t io (t  ).

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.10 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

10

Fig. 2. The Cloud-assisted data replication and storage scheme.

(4) The data items published before t io (t  ) are then split and encoded using erasure coding. (r − 1) encoded data segments for each data item are migrated into the Cloud server. After determining the portion of data that should be stored using erasure coding, the data are packed as data blocks, each block having the size of several megabytes (e.g., 2 MB). The erasure coding is then applied and divides each block into segments. 6. Data placement in Cadros The DA model in Section 4 and the DA-driven data replication and storage scheme in Section 5 only indicate that the hybrid system has the capacity to achieve such a certain level of DA. It still depends on the underlying data placement strategies to realize the DA. The placement strategy determines how to place among the friends the newly published data replicas or the data replicas stored in a friend that is going offline. Imagine if a poor placement strategy deliberately places the data replicas on those friends who are unlikely to be still online at the targeted future time point t  , then the desired level of DA will not be realized at t  even if the hybrid system has such ability based on our probabilistic analysis. This section develops the DA-driven data placement strategies so that the desired level of DA can be realized (Section 6.1). Further, under the condition of satisfying the data availability, this work proposes a number of heuristic placement strategies to optimize other performance metrics in Cadros, such as data repair cost and data access performance (Section 6.2). 6.1. DA-driven data placement The DA-driven data placement strategies take into account the availability of friends when determining the placement of data replicas among friends in Step 3 of the main workings of Cadros. Since data replicas are stored in the different ways under full replication and erasure coding, different considerations are taken in Cadros. Sections 6.1.1 and 6.1.2 present the DA-driven data placement strategies for full replication and erasure coding, respectively. 6.1.1. DA-driven data placement for full replication Since data replicas can only be stored in currently online friends, only currently online friends need to be considered in data placement strategies. The probability that the friend v i that is currently online remains online at t  , denoted by

p (t out_i > t  ), can be calculated using Eq. (7). If k j copies are generated for a data item, d j , and stored in k j friend nodes, denoted by V d j = { v 1 , v 2 , . . . , v k j }, the availability of the data item d j at the time point t  , denoted by DA(d j , t  ), can f on

be expressed using Eq. (26), where

ki

i =1, v i ∈ V d j

(1 − p (t out_i > t  )) is the probability that all k j nodes are offline, i.e., the f on

probability that the data d j is not available at t  .





DA d j , t  = 1 −

kj  i =1 , v i ∈ V d





1 − p t out_i > t  f on



(26)

j

It can be seen that the data availability for a data item is different when we select different friend set V d j for placing d j . The DA-driven data placement strategy for full replication tries to find such V d j in the friend circle that satisfies Eq. (27).

∀d j ∈ D f , ∀t  ∈ [t , t + t ],





DA d j , t  ≥ pt

(27)

Besides Eq. (27), two important additional considerations need to be added when making the data placement decisions.

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.11 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

11

First, at least 2 copies of data are created in full replication, i.e., Eq. (28) holds. This makes sense because if only one copy exists in the friend circle, the system cannot restore the data item when the friend containing the data item goes offline.

ki ≥ 2

(28)

Second, when a friend in V d j goes offline, the data stored in the friend must be migrated into other online friends from the remaining friends in V d j in order to maintain the number of replica copies. This process is called data repair. The whole time needed to complete the data repair, denoted by tr , includes the time of detecting the number of copies for each data item, the time of finding a new available online friend and the time of transferring the new copies into the new friend. If the remaining friends in V d j also go offline in less than tr time, the data repair process will fail. Therefore, the condition as shown in Eq. (29) needs to be satisfied, where t is the one in Eq. (27).

t > tr

(29)

Therefore, the objective of the DA-driven data placement strategy for full replication is to find such V d j in the friend circle that satisfies Eqs. (27), (28) and (29). 6.1.2. DA-driven data placement for erasure coding For a data item, db j , the number of encoded data segments stored in the cloud server, x, must be less than r, i.e., Inequality (30) holds.

x
(30)

The remaining (n − x) encoded segments are stored in the friend circle. The availability of db j is only determined by the availability of the (n − x) segments stored in the friend circle because the Cloud server is always available and therefore the x segments stored in the Cloud can be regarded as 100% available. Under erasure coding, the original data can be restored using r data segments. Therefore, if the number of available segments in the friend circle is more than (r − x), the data item is available. V db j = { v 1 , v 2 , . . . , v (n−x) } denotes the (n − x) friends that store the (n − x) segments. The number of data segments that are available at t  , denoted by N a (db j , t  ), can be calculated using Eq. (31).



n−x 



N a db j , t  =

i =1, v i ∈ V db j



1 × p t out_i > t  f on



(31)

The availability of the data item db j at t  , denoted by DA(db j , t  ), can then be expressed using Eq. (32).













DA db j , t  = p N a db j , t  ≥ (r − x)

(32)

The DA-driven data placement strategy for erasure coding tries to find such V d j in the friend circle that satisfies Eq. (33).

∀db j ∈ D e , ∀t  ∈ [t , t + t ],





DA db j , t  ≥ pt

(33)

The additional condition expressed in Eq. (29) should also exist in the erasure coding case for the same reason. Therefore, the objective of the DA-driven data placement strategy for erasure coding is to find such V db j in the friend circle that satisfies Eq. (33) and Eq. (29). 6.2. Heuristic strategies to optimize other performance metrics The DA-driven data placement strategy presented above only find any set of friends which can satisfy Eqs. (27), (28) and (29) for the full replication case and Eqs. (33) and (29) for erasure coding, so that the desired level of DA can be achieved. However, when there are multiple friend sets (V d j and/or V db j ) satisfying these conditions, more considerations can be given to optimize other performance metrics. A few heuristic methods are presented as follows to select the final friend set when multiple satisfactory friend sets exist. Mincost: it is easy to know that when t in Eq. (27) and Eq. (33) is the longer, a smaller number of data repairs will be performed and therefore the data repair cost is lower. Thus, the Mincost heuristic selects from all satisfactory friend sets such a set (V d j and/or V db j ) that has the longest t. Random: Randomly selecting the friend set (V d j and/or V db j ). Balance: selecting the friend set which has the lowest resource utilization. The resources can be storage or communication bandwidth.

JID:YJCSS AID:2909 /FLA

12

[m3G; v1.159; Prn:24/08/2015; 9:07] P.12 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

Fig. 3. The states of all friends at current time point. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.)

6.3. Functional components in Cadros The Cadros framework proposed in this paper consists of two main components: Cadros Cloud Service (CCS) and Cadros App (CApp). CCS resides in the Cloud. CSS receives and stores the data segments generated and sent by CApp, and sends the segments to the friends upon request. CApp runs in the user’s mobile device and the mobile device of each of its friend. In addition to performing the usual functions of a typical social network App, it also performs the following functionalities presented in this paper. 1) When the user publishes a new data item, the CApp running on the user’s device sends it to all online friends. 2) The CApp running on the user’s device predicts the level of DA using the method presented in Section 4, and based on the level of DA, determines the portions of data that should be stored in the friend circle and the Cloud, respectively, using the method presented in Section 5. 3) The CApp running on the user’s device selects the friend devices that are used to store the replicated data and the erasure-coded data, respectively, using the method presented in Section 6. 4) The CApp running on the user’s device sends the published data and its replicas to the designated friend devices. 5) The CApp running on the friend devices receives and stores the data sent by the CApp running on the user’s device. 6) When a friend comes online, the CApp running on the friend’s device update the data either from the user if the user is online, or from other online friends or the Cloud if the user is offline. 7. Evaluation A discrete simulator has been developed in this work to simulate an OSN. There are N users in the simulated OSN. Some users act as the friends of another user and update the data published by the user. The online and offline durations of the users in the simulated OSN follow the Power-Law distribution (PL) or the Exponential distribution (Exp), as observed in the literature [21,26]. The user publishes the data following the Poisson process. Based on the methods presented in this paper, the replicas of the published data are created using either full replication or erasure coding and are stored in either online friends or the simulated Cloud Server. In order to evaluate the prediction results, the experimental scenario is designed as follows. A friend contributes the storage capacity that is randomly taken from the range of [ S min , S max ]. The user and his friends login and logout following the specified probability distribution during the experiment interval [0, l]. The current time is set to be mth minute (m < l). The online or offline states of all friends at time m as well as the latest login or logout time before time m are collected. The collected data, combing with the specified probability distributions of the friends’ online and offline durations, are used to predict the total storage capacity contributed by online friends (i.e., SS) and the amount of the published data that have to be stored (i.e., D) at the future time points (i.e., the time points later than m). The predicted data are then compared against the data gathered from the actual simulation running. For example, Fig. 3(a) shows the first login/logout time of each friend when the number of the friends of a user is set to be 150, while Fig. 3(b) shows the latest login and logout time of each friend when the current time is set to be 31st min. A point above the red line (i.e., when y = 0) in Fig. 3(b) represents the latest login time of a friend who is online at 31st min, while a point below the red line represents the latest logout time of a friend who is offline at 31st min. Unless stated otherwise, the experimental parameters used in the simulator and the performance evaluation take the values shown in Table 1. These values are chosen based on those used in the literature [21,22,26,27].

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.13 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

13

Table 1 The experimental parameters. Notations

Default value

Descriptions

N a

150 1

The number of the user’s friends The average size of published data

λon

pl

2.5

The parameter of the online time duration which follows power-law distribution

λoff

pl

2.1

ps λpu

1

k m

3 5

n

8

r

5

x

4

pt t Si

99% 31 [1, 10]

The parameter of the offline time duration which follows power-law distribution The parameter of the number of times that the user publishes data which follows Poisson distribution Redundancy degree of full replication The number of the original data segments needed before encoding in erasure code The total number of the data segments after encoding in erasure code The minimum number of the data segments needed to the original data in erasure coding The maximum number of the data segments stored in cloud in erasure code The desired level of data availability The current time point The storage capacity contributed by friend v i

Fig. 4. The accuracy of SS prediction.

7.1. DA model 7.1.1. Prediction accuracy for SS Fig. 4 shows the experimental results for the accuracy of predicting SS. In the experiments, the current time t is set to be 31st min and then SS at future time points is predicted. And then the actual values of SS are gathered as the simulation progresses. Fig. 4(a), (b), (c) and (d) show the results under different λon and λoff (i.e., online and offline durations). It

JID:YJCSS AID:2909 /FLA

14

[m3G; v1.159; Prn:24/08/2015; 9:07] P.14 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

Fig. 5. The accuracy of D up prediction; the experimental settings are the same as those in Fig. 4.

can be seen from Fig. 4(a) that compared with its actual values, the prediction of SS is fairly accurate in the first 10 minutes, which shows the effectiveness of the proposed prediction method. By comparing Fig. 4(a), (b), (c) and (d), we can see that the length of the accurate prediction decreases as the settings of λon and λoff change from Fig. 4(a) to 4(d). These results indicate that the online and offline durations have impact on the prediction accuracy. After carefully analyzing the changing trend of λon and λoff , it appears that the minimum value between the online and the offline durations (i.e., min(1/λon , 1/λoff )) determines the length of accurate prediction. The less the value of min(1/λon , 1/λoff ), the shorter the length of the accurate prediction. The reason for this is because when min(1/λon , 1/λoff ) is smaller, the friends are more dynamic and consequently, it is more difficult to obtain the accurate prediction for future time points. 7.1.2. Prediction accuracy for D up In the same simulation runs that generate the results in Fig. 4, the actual values of D up are also collected and compared with the predicted counterparts. These results are plotted in Fig. 5, where the experimental settings in Fig. 4(a)–(d) are the same as those in Fig. 5(a)–(d). It can be seen that the D up prediction is rather accurate in most cases. 7.2. DA-driven data replication and storage in Cadros 7.2.1. Compare storage capacity and data availability The aim of combining the Cloud with DOSN is to increase the storage capacity of the system and thus to improve the data availability. Fig. 6(a) compares the real amount of data updated by offline friends (real D update , shown as the red line in the lower part of Fig. 6(a)), the storage capacity achieved by Cadros (Cadros D stored , shown as the green line in the upper part of Fig. 6(a)) and that obtained in the case where the data are replicated only using full replication and stored only in friend circle without the Cloud (FR D stored , shown as the blue line in the middle of the lower part of Fig. 6(a)). In the experiments, the storage capacity available at each time point during the simulation run is recorded. We can see from Fig. 6(a) that, compared with FR, Cadros can not only store much more data but also satisfy the need of data update all the time, while FR cannot satisfy the need all the time. Fig. 6(b) compares the data availability achieved by Cadros and that achieved by FR. We can see from Fig. 6(b) that the data availability of Cadros (shown as the straight line in Fig. 6(b)) is 100% all the time, while the data availability of FR (shown as the curve in Fig. 6(b)) is not always 100%.

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.15 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

15

Fig. 6. Comparing the storage capacity and data availability between Cadros and Full Replication (FR) without the Cloud. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.)

Fig. 7. Comparing the overhead between Cadros and Erasure Coding (EC). (For interpretation of the colors in this figure, the reader is referred to the web version of this article.)

Fig. 8. Comparing the heuristic strategies.

7.2.2. Compare the overhead between Cadros and Erasure Coding (EC) In Cadros, the data are partitioned in the way that the overhead caused by erasure coding is minimized while satisfying the desired level of DA. Fig. 7 compares the overhead caused by Cadros and that by pure erasure coding, i.e., all data are replicated using erasure coding. It can be seen from Fig. 7 that Cadros incurs much lower overhead in almost all cases. In many cases, the overhead is 0. This is because in those time points, the data replicas can be fully stored in the friend circle without the need to encode the data and store them in the Cloud. These results indicate that Cadros is able to judiciously determine the suitable data partition points and use erasure coding or full replication to replicate different portions of data. 7.2.3. Data placement strategies in Cadros Fig. 8 compares the heuristic data placement strategies in terms of data repair cost (Fig. 8(a)) and utilization (Fig. 8(b)). The Mincost strategy is designed to achieve the longest t among all friends that can satisfy the desired level of DA, aiming

JID:YJCSS AID:2909 /FLA

16

[m3G; v1.159; Prn:24/08/2015; 9:07] P.16 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

to reduce the data repair cost. The Balance strategy aims to balance the resource utilization among the friends. In the experiments, when a data replica is placed in a friend, the next logout time of the friend is recorded. Since multiple replicas (or segments) are generated for a data item, a data item (or the corresponding data segments) will be placed on multiple friends. Among all friends associated with a data item, the friend who logs out at the earliest and at the latest time is called to have the shortest and the longest available time. The shortest available time represents the time before which there is no data repair cost, while the longest available time represents the time when all friends that store the replicas associated to a data item logout once. For both shortest and longest available time, the longer, the better. As can be seen from Fig. 8(a), Mincost achieves best results in terms of both shortest and longest available time. These results show that Mincost is effective in reducing the data repair occurrence, thus reducing the data repair cost. Fig. 8(b) plots maximum storage utilization among all friends. It can be seen that the maximum storage utilization achieved by the Balance strategy is the smallest. This means that the storage utilization across the friend nodes are more balanced. 8. Conclusions This paper proposes a Cloud-assisted data replication and storage service, called Cadros, for DOSN, aiming to improve the data availability of DOSN. This paper first conducts the quantitative analysis about the storage capacity as the result of combining the Cloud with DOSN. Further, this paper models and predicts the level of DA that Cadros is able to achieve. In Cadros, the published data are partitioned in terms of the replication technique, which is either full replication or erasure coding. The optimal data partition is achieved in this paper in the sense that the overhead incurred by erasure coding is minimized under the condition of satisfying the desired level of DA. This paper also proposes the data placement strategies to realize the desired DA and improve the performance in terms of other metrics. Acknowledgments We would like to thank the users and the developer community for their help with this work. The work is supported by the Key Program of National Natural Science Foundation of China (Grant Numbers: 61402511, 61432005, 61272482, 61379146), the Foundation of Science and Technology on Information Assurance (No. KJ-13-105, No. KJ-14-107). References [1] K. Graffi, C. Gross, P. Mukherjee, et al., LifeSocial.KOM: a P2P-based platform for secure online social networks, in: 2010 IEEE Tenth International Conference on Peer-to-Peer Computing (P2P), IEEE, 2010. [2] S. Buchegger, D. Schiöberg, L.H. Vu, et al., PeerSoN: P2P social networking: early experiences and insights, in: The Second ACM EuroSys Workshop on Social Network Systems, ACM, 2009, pp. 46–52. [3] C.A. Yeung, I. Liccardi, K. Lu, et al., Decentralization: the future of online social networking, in: W3C Workshop on the Future of Social Networking Position Papers, 2009. [4] U. Tandukar, J. Vassileva, Selective propagation of social data in decentralized online social network, in: Advances in User Modeling, Springer, Berlin, Heidelberg, 2012, pp. 213–224. [5] A. Shakimov, A. Varshavsky, L.P. Cox, et al., Privacy, cost, and availability tradeoffs in decentralized OSNs, in: The 2nd ACM Workshop on Online Social Networks, ACM, 2009, pp. 13–18. [6] David Koll, Jun Li, Xiaoming Fu, With a little help from my friends: replica placement in decentralized online social networks, Technical report, University of Goettingen, Germany, January 2013. [7] A. Olteanu, G. Pierre, Towards robust and scalable peer-to-peer social networks, in: Proceedings of the Fifth Workshop on Social Network Systems, WOSN, ACM, 2012. [8] L.A. Cutillo, R. Molva, T. Strufe, Safebook: a privacy-preserving online social network leveraging on real-life trust, IEEE Commun. Mag. 47 (12) (2009) 94–101. [9] J. Li, F. Dabek, F2F: reliable storage in open networks, in: IPTPS, 2006. [10] D.N. Tran, F. Chiang, J. Li, Friendstore: cooperative online backup using trusted nodes, in: Proceedings of the 1st Workshop on Social Network Systems, ACM, 2008, pp. 37–42. [11] R. Gracia-Tinedo, M. Sánchez-Artigas, P. Garcia-Lopez, F2BOX: cloudifying F2F storage systems with high availability correlation, in: 2012 IEEE 5th International Conference on Cloud Computing, CLOUD, IEEE, 2012, pp. 123–130. [12] D. Liu, A. Shakimov, R. Cáceres, et al., Confidant: protecting OSN data without locking it up, in: Middleware 2011, Springer, Berlin, Heidelberg, 2011, pp. 61–80. [13] Y. Sun, F. Liu, B. Li, et al., Fs2you: peer-assisted semi-persistent online storage at a large scale, in: INFOCOM 2009, IEEE, 2009, pp. 873–881. [14] B. Krishnamurthy, C.E. Wills, Characterizing privacy in online social networks, in: Proceedings of the First Workshop on Online Social Networks, ACM, 2008, pp. 37–42. [15] C. Zhang, J. Sun, X. Zhu, et al., Privacy and security for online social networks: challenges and opportunities, IEEE Netw. 24 (4) (2010). [16] G. Mega, A. Montresor, G.P. Picco, Cloud-assisted dissemination in social overlays, in: Peer-to-Peer Computing (P2P), IEEE, 2013, pp. 1–5. [17] Diaspora, https://joindiaspora.com/. [18] Facebook, https://www.facebook.com/. [19] Sina Microblog, http://weibo.com/. [20] H. Weatherspoon, J.D. Kubiatowicz, Erasure coding vs. replication: a quantitative comparison, in: Peer-to-Peer Systems, Springer, Berlin, Heidelberg, 2002, pp. 328–337. [21] L. Jin, Y. Chen, et al., Understanding user behavior in online social networks: a survey, IEEE Commun. Mag. 144–150 (150) (2013). [22] F. Benevenuto, T. Rodrigues, M. Cha, et al., Characterizing user behavior in online social networks, in: Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement Conference, ACM, 2009, pp. 49–62. [23] M. McGlohon, L. Akoglu, C. Faloutsos, Statistical properties of social networks, in: Social Network Data Analytics, Springer, US, 2011, pp. 17–42. [24] R.E. Wilson, S.D. Gosling, L.T. Graham, A review of facebook research in the social sciences, Perspect. Psychol. Sci. 7 (3) (2012) 203–220.

JID:YJCSS AID:2909 /FLA

[m3G; v1.159; Prn:24/08/2015; 9:07] P.17 (1-17)

S. Fu et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

17

[25] A. Mislove, M. Marcon, K.P. Gummadi, et al., Measurement and analysis of online social networks, in: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, ACM, 2007, pp. 29–42. [26] A.L. Barabasi, The origin of bursts and heavy tails in human dynamics, Nature 435 (7039) (2005) 207–211. [27] T. Zhou, P. Han, et al., Towards the understanding of human dynamics, in: Science Matters: Humanities as Complex Systems, 2008, pp. 207–233. [28] D. Stutzbach, R. Rejaie, Understanding churn in peer-to-peer networks, in: Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement, ACM, 2006, pp. 189–202. [29] F. Tegeler, D. Koll, X. Fu, Gemstone: empowering decentralized social networking with high data availability, in: Global Telecommunications Conference, GLOBECOM 2011, IEEE, 2011, pp. 1–6. [30] B. Krishnamurthy, C.E. Wills, Privacy leakage in mobile online social networks, in: Proceedings of the 3rd Conference on Online Social Networks, USENIX Association, 2010. [31] R. Sharma, A. Datta, et al., An empirical study of availability in friend-to-friend storage systems, in: Peer-to-Peer Computing (P2P), IEEE, 2011, pp. 348–351. [32] K. Rzadca, A. Datta, et al., Replica placement in p2p storage: complexity and game theoretic analyses, in: 30th International Conference on Distributed Computing Systems, ICDCS, IEEE, 2010, pp. 599–609. [33] S. Fu, L. He, X. Liao, K. Li, C. Huang, Analyzing the impact of storage shortage on data availability in decentralized online social networks, Sci. World J. 2014 (2014) 826145, http://dx.doi.org/10.1155/2014/826145.