DPRS: A dynamic popularity aware replication strategy with parallel download scheme in cloud environments

DPRS: A dynamic popularity aware replication strategy with parallel download scheme in cloud environments

Simulation Modelling Practice and Theory 77 (2017) 177–196 Contents lists available at ScienceDirect Simulation Modelling Practice and Theory journa...

3MB Sizes 1 Downloads 55 Views

Simulation Modelling Practice and Theory 77 (2017) 177–196

Contents lists available at ScienceDirect

Simulation Modelling Practice and Theory journal homepage: www.elsevier.com/locate/simpat

DPRS: A dynamic popularity aware replication strategy with parallel download scheme in cloud environments N. Mansouri, M. Kuchaki Rafsanjani, M.M. Javidi∗ Department of Computer Science, Shahid Bahonar University of Kerman, Box No. 76135-133, Kerman, Iran

a r t i c l e

i n f o

Article history: Received 16 January 2017 Revised 18 April 2017 Accepted 12 June 2017

Keywords: Data replication Parallel download Cloud computing Simulation

a b s t r a c t Cloud computing has emerged as a main approach for managing huge distributed data in different areas such as scientific operations and engineering experiments. In this regard, data replication in Cloud environments is a key strategy that reduces response time and improves reliability. One of the main features of a distributed environment is to replicate data in various sites such that popular data would be more available. Whenever a site does not have a needed data file, it will have to fetch it from other locations. Therefore, the parallel download approach is applied to reduce download time. It enables a user to get various parts of a file from several sites simultaneously. In this work, we present a data replication strategy, named the Dynamic Popularity aware Replication Strategy (DPRS), which is presented on Cloud system leveraging data access behavior. DPRS replicates only a small amount of frequently requested data file based on 80/20 idea. It determines to which site the file is replicated based on number of requests, free storage space, and site centrality. We introduce a parallel downloading approach that replicates data segments and parallel downloads replicated data fragments, to enhance the overall performance. We evaluate effective network usage, mean job execution time, hit ratio, total number of replications and percentage of storage filled by using the CloudSim simulator. Extensive experimentations demonstrate the effectiveness of DPRS under most of access patterns. © 2017 Elsevier B.V. All rights reserved.

1. Introduction High-speed computing and storage elements are needed for new scientific applications like astrophysics, astronomy, aerography, and biology. Since these applications perform with the huge amount of datasets and complex computing. Though a supercomputer can execute the tasks, it is too costly and too difficult to use. The number of transistors on a chip doubles approximately every 18 months based on the Moore’s Law [1]. Kryder’s Law [2] is the observation that the disk space with the same value doubles approximately every 13 months. Today with the high bandwidth, distributed environment can replace the expensive supercomputers. Distributed systems integrate different dispersed storage and computation elements. Distributed system decreases cost of system and utilizes the idled bandwidth and CPUs. Cloud computing has significant role on Information Technology (IT) solutions for both engineering and business applications [3–5]. Cloud computing system has attractive characteristics that are important for both science purposes. Clouds also provide solutions to computationally intensive applications similar to HPC (High Performance Computing) environments like supercomputing centers. From the business view, Clouds provide flexible platforms to both Cloud providers and application ∗

Corresponding author. E-mail addresses: [email protected], [email protected] (M.M. Javidi).

http://dx.doi.org/10.1016/j.simpat.2017.06.001 1569-190X/© 2017 Elsevier B.V. All rights reserved.

178

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

owners. Cloud computing presents a unique computing ecosystem where providers and application owners can set up elastic relationship driven by application performance features (e.g. availability, execution time, monetary budget, etc.) [6–8]. Cloud providers are concerned about keeping their services in a relatively unreliable platform, while producing profitable revenues. Cloud application owners, on the other hand, need services/resources that meet high performance requirements [9,10]. Due to their data intensive nature, new scientific applications can be appropriate if Cloud schedulers use data reuse and replication strategies in executing their workflows [11,12]. Data replication technique is a common idea to achieve these aims. Nowadays, different fields such as the Internet, P2P systems, and distributed databases use data replication approach to improve the overall performance [13–16]. An efficient data replication strategy should be able to find a suitable time to copy files, determine which data should be copied, and place copies in the best site. The key phase in proposing appropriate dynamic data replication algorithms is the analyses of data access patterns. Different models of data access patterns introduced as the distribution of access counts of data in distributed environment. Breslau et al. [17] can more exactly show the distribution of webpage accesses is based on the Zipf-like distribution. Cameron et al. [18] showed that the Zipf-like distribution is the base of the distribution of file accesses in data grid environment. Ranganathan and Foster [19,20] claimed that Zipf and geometric distributions could accurately indicate file popularity on a hierarchical data grid with infinite storage capacity. Tang et al. [21] indicated users’ file access behavior on grid environment is based on the Zipf-like and geometric distributions. Chang et al. [22,23] presented two data replication algorithms on a multi-tier data grid with limited storage capacity. However, Chang et al. did not focus on the data access pattern [22]. When the users’ access pattern changes, it has inefficient data access time. One of the important factors in data replication is data popularity. Popularity of file shows that how much the file is accessed by the sites in distributed system. In this paper, Dynamic Popularity aware Replication Strategy (DPRS) is proposed. DPRS places popular files to appropriate clusters/sites so that adapt to the changes of users’ interests in data. DPRS determines in which cluster site the file is replicated based on number of requests, free storage space, and site centrality. With explosive growth of data, storing different replicas of the whole dataset in the limited storage is a main challenge. Therefore, it is important to consider critical data file in replication process. Italian economist Pareto presented the 80/20 idea. Pareto principle explains that, form any events, nearly 80% of the effects come from 20% of the causes. Breslau et al. [24] concluded that Zipf-like distribution indicates the distribution of web page requests by investigating six trace where the relative probability of a request for the i’th most popular page is proportional to1/iα ; with α usually gets some values less than one. Staelin and Garcia-Molina [25] observed that some files have a much higher skew of accesses than others on very large file systems. Gomez and Santonja [26] demonstrated that some of the data files are acutely critical and favorite, but others are rarely or never requested in real examinations. Cherkasova and Ciardo [27] investigated the feature of web workloads and proved that 10% of the files requested on the server commonly account for 90% of the server demands and 90% of the bytes transferred. Xie and Sun [28] considers the 80/20 rule in file assignment method for parallel I/O systems. The above investigation shows that a small amount of the data files is demanded most of time. Therefore, replicating the hot file should improve the overall performance, while consuming minimal resources. DPRS considers 80/20 approach, 80% of the data accesses mostly go to 20% of the storage, to copy only a small part of frequently accessed data. Another useful method is parallel downloading of replicated data. If a site does not contain a file, it will have to download it from other locations. Thus, the parallel download idea, which enables a user to download various segments of a file from different sites simultaneously, is used to improve download time. Developing methods for parallel downloads of Internet documents is of significant interest in the networking area. Due to the dynamics of distributed systems, instead of replicating complete files, chunks of data files are replicated, which can further downloaded in parallel from various sites. We have presented a parallel downloading technique that replicates data segments and parallel downloads of data segments, to enhance performance. This paper presents a scheme for efficient data replication in Cloud environments. Main contributions of our strategy (DPRS) include: (1) it periodically finds file access popularity to track the variation of users’ access behaviors, and then places popular files to the best locations based on the variation, (2) DPRS exploits the 80/20 approach as an example, i.e., DPRS will copy the top 20% of frequently accessed files, (3) DPRS supports replicating and parallel downloading replica fragments, and (4) We perform comprehensive simulations to investigate the impacts of different factors on the behavior of system performance. The article is structured as follows: In Section 2, we briefly describe the motivation of data replication in Cloud. Section 3 presents a brief introduction of previous works on data replication for Cloud computing. Section 4 explains our system model; Section 5 details the DPRS; Section 6 evaluates the simulation results, followed by Section 7 that analyses the tradeoff between the calculation cost and the accuracy level. Conclusion and future work have been explained in Section 8.

2. Motivations Cloud computing is one of the hottest core technical issues in the modern period of time. It has appeared with broad ranging effects across IT, businesses and software engineering. According to the National Institute of Standards and Technology (NIST) description, “the Cloud computing is a model for enabling convenient, resource pooling, ubiquitous, on-demand access which can be easily delivered with various types of service provider interaction” [29].

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

179

One of the main advantages of pay as you go model is that we can decrease cost by provisioning a certain amount of resources [30]. The customer can choose processor, memory, hard disk, operating system, networking, access control and any additional new software as needed to their environment. The resources presented on-demand to the users. It provides the great advantages to industry and home users and attracts the attention of the scientific societies [31]. On the other hand, small part of data is an important and critical part of shared resources in different research communities. The volume of data is measured in terabytes or petabytes in most of categories. Such huge amount data is usually kept in data centers of Cloud environment. Therefore, data replication is used to manage big data by storing several replicas of data files in distributed locations. Data replication is necessary to enhance data accessibility, availability, and fault tolerance, while improving data access time and load of network. In order to achieve these goals, different data replication algorithms have been designed in different systems such as data grid [32–34], Cloud storage [35,36], P2P [37–39], and Content delivery network (CDN) [40,41]. We can improve decision of what data is necessary to enhance the availability and resource utilization by using popularity prediction. Data popularity indicates how much a given part of data is accessed by the sites. This provides critical information such as a presentation of the importance of data file. Therefore we can propose a more smart data placement method and a large optimization in the storage utilization. Also we can determine number of replicas for a particular file according to its popularity. This can balance load of system. Because we have no sites with high jobs requesting popular files and other sites are not active since having unpopular files. 3. Related work The data replication approach has recently attracted new interests due to its relationship with virtualization and Cloud computing. Static replication and dynamic replication are the most common issue in data replication. Based on these two methods, different replication strategies have been designed by different researchers. In static replication scheme, the number of replicas for each data file is set in advance, manually. In this category, the placement of replicas is also pre-decided and these replicas are controlled manually. It does not adjust according to the changes in the system and users’ access patterns. However, in dynamic replication methods, the replicas of each data file generated and stored dynamically according to the changes in the system and users’ access patterns. In Cloud computing environments, users’ access patterns may keep on changing time to time, hence, to obtain high availability along with better performance, the replication methods should adjust dynamically to variations in the system. In Cloud, dynamic replication algorithms are considered to be more acceptable than static replication algorithms. According to literatures, several replication algorithms reported to replicate the best file at reasonable time as well as suitable site. Then, these strategies compared and summarized in Table 1. To evaluate various approaches theoretically, we focus the comparison on different options: • • • • • • • • • •

Bandwidth consumption: It specifies whether the bandwidth consumption is reduced by the replication approach. Response time: It specifies whether the study focuses on reducing the response time for submitted jobs. Load balancing: It determines whether the workload is balanced on all the sites/data centers. Fault tolerant: It shows whether the proposed replication algorithm is able to improve the fault tolerance of the environment. Energy efficiency: It focuses on advancing energy efficiency in data centers and computing systems. Consistency management: It guarantees that the several replicas of a given file are kept consistent in the presence of concurrent updates. Popularity: It determines whether the replication algorithm takes into account the number of access to data to create a new replica or not. Parallel downloading: This parameter specifies whether the replication strategy focuses on parallel technique. Simulator used: This parameter determines the name of the simulator used to evaluate the proposed replication algorithm. Usually, the replication strategies try to response some of keys questions, i.e., (1) which data should be copied; (2) when copies should be generated; (3) how many copies should be generated; (4) where the replicas should be stored in the sites; (5) which copies should be replaced; and (6) which replica location is the best for users. Decision-making about the data replication has been done using several considerations, various steps as well as accounting of several factors and assumptions. Table 1 typically presents some of recent replication algorithms and their assumptions in each step.

Due to the large scale of scientific data in Cloud environment, modification of the available replication strategies are inevitable. Boru et al. [42] presented a data replication strategy for Cloud computing data centers which takes in to account the energy consumption, network bandwidth and communication delay both between distributed data centers and inside each data center. The power consumption of a server is based on its CPU utilization. As described in [43–45], an idle server keeps free about one-thirds of its peak power consumption. This is due to the principle that servers must keep memory modules, disks, I/O resources and other peripherals operational even when no calculations are carried out. Network switches are hardware devices that include the port transceivers, line cards and switch chassis [46]. These parameters contribute to the switch energy consumption. The power consumption of switch chassis and line cards stay constant over the time. While, the consumption of network ports can scale with the volume of the received traffic. In addition, optimization of

Bandwidth consumption

Main idea

Boru et al. [42]

Modeling of energy Yes consumption characteristics of data center IT infrastructures. Optimizing the cost Yes of replication using the concept of knapsack problem.

Gill and Singh [48]

Response Load time balancing

Fault tolerant

Energy efficiency

Parallel Consistency Popularity technique

Simulator

Questions

Yes

No

No

Yes

No

No

No

GreenCloud

1- Where the replicas - Energy - Bandwidth should be placed?

Yes

No

No

No

No

Yes

No

CloudSim

1 When replicas should be created?

Set of parameters

1- Popularity crosses a threshold

2- Which files should 2-Number of accesses be replicated? - Date of access 3- How many 3-Availability replicas should be created? 4- Where the replicas 4- Number of user accesses - Date of should be placed? access 1- Where the replicas 1-Service time should be placed? Failure probability -Load variance Latency - Storage usage - Availability 2- Which replicas 2-Number of accesses - Date of access should be -File size replaced? 1- Where the replicas 1- Availability should be placed? Service time - Load variance - Energy -Latency

Mansouri [14]

Placing replicas in load balancing manner

Yes

Yes

Yes

Yes

No

No

No

No

CloudSim

Long et al. [50]

Modeling multi-objective optimized replication management Using a query span as the metric to optimize replication Proposing replica cost

Yes

Yes

Yes

Yes

Yes

No

No

No

CloudSim

No

Yes

Yes

Yes

No

Yes

No

No

Trace-driven simulator

1- Where the replicas 1-Access frequency should be placed?

Yes

Yes

No

No

No

Yes

No

No

Eucalyptus private Cloud infrastructure

1- Where the replicas 1-Number of access should be placed?

Kumar et al. [52]

Rajalakshmi et al. [53]

Hussein and Mousa [54]

Predicting the user access to the blocks of each file

Yes

Yes

No

Yes

No

Yes

Yes

No

CloudSim

2- Which replica location is the best for users? 1- Which files should be replicated?

2- Transfer cost Operational cost Bandwidth 1- Popularity Failure

2- Which replica location is the best for users? 3-Where the replicas should be placed?

2- Popularity Failure 3- Popularity Failure

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

Mechanisms

180

Table 1 Comparison of the dynamic data replication strategies in Cloud environment.

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

181

Table 2 Performance results for ARS algorithm. Response time Number of jobs

ARS with popularity

ARS without popularity

Loss (%)

100 200 300 400 500

3200 4566 6256 7112 9503

3833 5829 8201 10,145 14,998

16 21 23 29 36

ARS without popularity 0.35 0.44 0.51 0.69 0.88

Loss (%) 20 27 31 42 51

Effective network usage Number of jobs 100 200 300 400 500

ARS with popularity 0.28 0.32 0.35 0.40 0.43

Table 3 Performance results for AREN algorithm. Response time Number of jobs

AREN with popularity

AREN without popularity

Loss (%)

100 200 300 400 500

3677 4908 6012 8711 11,001

4833 6832 9501 13,973 17,995

23 28 36 38 39

Number of jobs 100 200 300 400 500

AREN with popularity 0.32 0.34 0.40 0.42 0.45

AREN without popularity 0.36 0.45 0.61 0.77 0.85

Loss (%) 11 23 34 45 47

Effective network usage

Table 4 Performance results for D2RS algorithm. Response time Number of jobs

D2RS with popularity

D2RS without popularity

Loss (%)

100 200 300 400 500

3022 4141 5332 7065 7021

3701 5320 7844 11,349 13,996

18 22 32 37 49

Number of jobs

D2RS with popularity

D2RS without popularity

Loss (%)

100 200 300 400 500

0.26 0.31 0.38 0.46 0.51

0.30 0.39 0.51 0.71 0.87

13 20 25 35 41

Effective network usage

communication delays improves quality of user experience of Cloud applications. The main idea of the presented replication algorithm is related to the mathematical approach in the GreenCloud, the simulator focusing on energy performance and communication processes in Cloud data centers [47]. The experiment results demonstrated that replicating data near to data consumers, i.e., Cloud applications, can decrease energy consumption and bandwidth usage. But they did not mention the consistency problems. Gill and Singh [48] presented a Dynamic Cost-aware re-replication and re-balancing Strategy (DCR2S). This method models the cost of replication by the concept of knapsack issue. Whenever the cost of replication increased dramatically with respect to the user budget then the replicas should be stored at higher cost data centers. The re-replication is performed

182

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196 Table 5 Performance results for PA algorithm. Response time Number of jobs

PA with popularity

PA without popularity

Loss (%)

100 200 300 400 500

3831 4741 6288 8121 10,344

4327 6129 8980 12,057 15,996

11 23 29 32 35

Effective network usage Number of jobs

PA with popularity

PA without popularity

Loss (%)

100 200 300 400 500

0.36 0.41 0.45 0.49 0.51

0.41 0.49 0.67 0.79 0.85

12 16 32 37 40

until the particular availability set in SLA is lower than the availability. Access history of each data file is checked to find its respective popularity. For a data file, replication process is triggered when its popularity exceeds a dynamic threshold. Also, it estimates the number of necessary replicas as well as storing them in a way that the cost must lower than the budget along with a high system byte effective rate. Knapsack approach is applied for minimization of replication cost. Simulation results demonstrated that DCR2S strategy can be simulated the replication cost and improved system byte effective rate in heterogeneous Cloud structure. The main weakness of this approach was the consideration of popularity degree only in replica placement. Moreover, they ignored load balancing of system resources. Mansouri [49] introduced an Adaptive Data Replication Strategy (ADRS), which contributes average service time, failure probability, load variance, latency and storage usage as criteria. If one site fails, a replica of the failed site will be possibly generated on a various site to answer the requests. Thus, placement of popular file in site with lower risk of failure can minimize system latency. They introduced a novel replica replacement method using file availability, the last time replica was requested, number of access, and size of replica as criterion. They tested ADRS strategy by the CloudSim simulator and showed that it can be improved the average of mean response time, effective network usage, load balancing, replication frequency, and storage usage. They neglected the tradeoff problem for quality attributes, e.g., the availability and costs of different resources. Long et al. [50] designed a Multi-objective Optimized Replication Management (MORM) strategy for storage Cloud based on the artificial immune algorithm [51]. The authors considered several criteria, e.g., mean file unavailability, mean service time, load variance, energy consumption and mean access latency to simulate the relationship among replica number and replica layout. Replicas are stored among data nodes with respect to the five aims. The suitable number of replicas is maintained for each data file to reach the optimal objective value. They implemented MORM using the extended CloudSim and MATLAB toolkit. The experimental results demonstrated that the presented strategy is able to increase file availability, improve system load balancing ability, reduce mean service time and latency as well as decrease the energy consumption in Cloud. Kumar et al. [52] developed a workload-aware data replication strategy, named SWORD, to maximize the resource usage in Cloud. They monitored and modeled the anticipated workload as a hyper graph and proposed partitioning methods that reduced the mean query span, i.e., the mean number of sites in the execution of a request or a transaction. They considered the query span as a metric to optimize the analytical and transactional workloads as well as presented a data placement method by modeling of different well-studied graph theoretic concepts. They also investigated the application of fine-grained quorums to decrease the query spans. The simulations results confirmed that the application of fine-grained quorums is essentially necessary due to its ability to manage the workloads. Rajalakshmi et al. [53] introduced a Dynamic Replica Selection and Placement (DRSP) to enhance the availability of data in the Cloud environment by combination of file application and replication operation in Eucalyptus Cloud environment. The proposed strategy uses index and catalog scheme to arrange the replicas into the local or remote locations. Also, the indexer keeps master and slave replicas locations. The proposed algorithm has two main steps. The site is candidate for selection when the number of access is greater than the threshold value, Tα , defined as the ratio of replication threshold (RT ) and the number of replicas. In second step the existence of enough free space in destination site is checked. The results showed that the proposed algorithm could improve the data access performance and bandwidth utilization due to its automatic and transparent replica selection and placement. The main weakness of DRSP algorithm is the consideration of limited factors during replication decision. In another proposed algorithm by Hussein and Mousa [54], the reliability and quality of services were improved using definition of replication factor which adaptively estimates the data files for replication. This factor is determined from data block and the availability of each replica that higher than a predefined threshold. The number of new replicas esti-

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

183

mated adaptively according to improvement the availability of each file heuristically. Problem formulization of this strategy is based on literature [55]. The heuristic introduced dynamic replication algorithm that has low cost and manages large-scale resources and data in a suitable time. The simulation results in Cloud environment confirmed the improvement of adaptive replication algorithm. Neglecting the energy efficiency problem in data center is the main disadvantage of the proposed method. Data popularity is the most important factor in replica decision, for example, least recently used method considers recently file access to find the popularity. While least frequently used method used access frequency in recent times to calculate the popularity. Wu et al. [56] presented a replica pre-adjustment algorithm by considering trend analysis of file popularity. They calculated file popularity using binary linear regression forecasting algorithm. Due to numerous files in distributed file system, and huge file sizes, the probability computation of the popularity should be complicated. Ye et al. [57] presented a two-layer Geo-cloud based dynamic replication algorithm which is called TGstag. It reduces both bandwidth consumption and mean access time with restriction of site capacity. It ranks files based on the popularity and replicates the most popular file. Wang et al. [58] presented data replication using historical access record. They considered the popularity factor to make a replication decision. If the mean popularity of all the files is higher than the threshold, it decides to take place a replication. They determined different values to the access records in different intervals to fine a popular file. They used the half-life in determining the value of records. But a dynamic parameter is appropriate to adjust based on varying network situations. Myint and Hunger [59] considered Markov chain model in replication algorithm that is able to adapt according to the file popularity. Some other papers were mainly focused on the popularity factor such as [60,61]. However, these studies did not consider the popularity problem in all its aspects. They also did not take into account some significant parameters that will be highlighted in this study, for example the tradeoff between computation cost and precision. In addition, several deficiencies by the available data popularity calculations were not investigated, such as the way to affect weight values to requests. All these issues will be discussed in detail in the following sections. A comparison of the different strategies with respect to various factors such as parallel technique, popularity, fault tolerance, response time and etc. is presented. From this review it can be observed that there is still a lot of investigation to be done in the field of data replication in Cloud environment. Although some previous works have done that, such as guaranteeing lower job response time, better network usage, and storage usage, they did not attention to and the dynamic behaviors of user access regularly. Thus, DPRS is introduced to improve this weakness. It can properly adapt to the changes of users’ interests by continually assigning file value and storing necessary replicas to the best site to enhance overall performance. In this paper we consider three issues that are discussed below. 1) The significance of considering the data popularity parameter in replication decision is highlighted. Several works are investigated and how they consider the data popularity is explained. There are various methods for data popularity calculation. Most of works were used only the number of requests. Despite the benefit of its low computation cost, this manner of calculating data popularity has a main weakness. This class takes into account that a request performed recently has the same value as a request done very long ago. Thus, the temporal locality is not assumed. We proposed a new method for popularity determination based on the number of requests and the requests distribution over time. If a file attracts a lot of accesses, it is considered as popular. Additionally, two other important parameters are considered as follow: • The data set life time (since it avoids privileging old files at the expense of new ones) • The distribution of the requests for files over time (since it distinguishes new requests from old files and then to not consider them as having the identical value) Both metrics are significant since the popularity value of a file changes over time. 2) In contrast to the available works of data replication and energy saving of distributed systems, by considering the skew of data access pattern, we present strategy that focuses on a small amount of data (frequently accessed file) into store in an active site, thus saving of a storage space. 3) 3 Most of the strategies included in this work have ignored the parallel downloading approach. Several works integrated data replication strategies with parallel download schema in data grid environment [62–66]. We introduce an efficient and bandwidth sensitive parallel download scheme in Cloud that considers the server suitability of site in block size determination. It places the portion of the replica on each server based on several important parameters such as number of requests, free storage space, and distances. 4. System framework In the system framework, there are several clusters. Every cluster includes a number of sites that are placed in a close geographical area. The proposed architecture as shown in Fig. 1 includes a Global Replica Manager (GRM) and different clusters. GRM is a root of the tree topology. In this system, each connection consists of multiple routers and links. And each cluster constructed by several sites and a Local Replica Manager (LRM). LANs connect several clusters. Local replica information table includes access count, logical filename, file location, physical file name, file popularity, and master file. An original file that cannot be deleted from the system is known as a master file. We describe file popularity calculation in next section. The number accesses of the file are defined as file access count. The GRM collects file access records for all clusters and

184

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

Fig. 1. System framework.

specifies which data should be copied to which locations. Local replica information table and global replica information table are placed in LRM and GRM, respectively. Global replica information table aggregates the information stored in local replica information tables. GRM adds the location of the new replica in global replica information table. Also, the cluster of the new replica stores the related information in its local replica table. 5. Dynamic popularity aware replication strategy GRM executes the DPRS strategy in termination of a round. Round is defined as a constant time interval Td in which y jobs, y ≥ 0, are sent by users. Input of each job is represented as several files. DPRS comprises five phases as aggregation of file access, computation of file popularity, determination of files, replica placement, and replication phase. Aggregation of file access: DPRS calculates NRn c (fi ) that indicates the access count of each file fi located in cluster c at round n. It arranges all files in a descending order based on NRn c (fi ). Then it places the sorted result into a set Set1. DPRS using the information stored in local replica table determines TFn c that is aggregation of the total number of files having been needed by all nodes in cluster c at round n. We assume that 1 ≤ i ≤ Nk , and 1 ≤ c ≤ Nc where the number of files in cluster c in round n indicated by Nk . The number of clusters in the system is shown by Nc , and n = 1, 2, 3,…. Computation of file popularity: A popularity value for file fi , represented by FPVn c (fi ),



F PVcn

( fi ) =

F PVcn−1 ( fi )+(NRnc ( fi )×p) Size( fi )

F PVcn−1 ( fi ) − q

×

NRnc ( fi ) T N Rn

N Rn c ( f i ) > 0

(1)

otherwise

Where NRn c (fi ) indicates the number of requests for file fi that come from cluster c at round n. Size parameter shows the size of file. The total number of requests are generated in round n is shown by TNRn . We assume that value of p and q are constants and p < q. The discussion about why p < q is explained in Section 6.3. When fi has been accessed by users in round n (NRn c (fi ) > 0), DPRS increases FPVn−1 c (fi ) by NRn c (fi ) × p. Otherwise, DPRS decreases FPVn−1 c (fi ) by q factor. When fi is more popular then FPVn c (fi ) has higher value. Note that, all files track the binomial distribution in round 0. Also the value of FPV°c (fi ) is 0.5 which is shows the primary value of access probability for fi is 0.5. Zero is the least value for FPVn−1 c (fi ). Size of file is one of the important parameter because the cluster sites have bounded storage capacity, and files with small size need less room. When the value of NRn c is zero, i.e., file fi was not needed at all in cluster c at round n, then the size parameter will not have any role in the computation of FPVn c (fi ) and its value will be zero whatever the file size is.

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

185

Consequently, the other files with FPVn c (fi ) > 0 will have a better opportunity for selection than the files with FPVn c (fi ) = 0 even if the requested files have large sizes. The proportion of number of requests for fi in round n from cluster c to the total number of requests in this round is represented by (fi )) is determined as follow:

F PV n avg ( fi ) =

NC

c=1

NRnc ( fi ) . T N Rn

The mean popularity of the files in all clusters (FPVn avg

F PV n c ( fi ) NC

(2)

Where NC shows the number of all clusters in the system. All clusters of the system are used in the computation of mean popularity for a file even if some of clusters do not contain the file. Since the cluster that does not contain the file might have a high access rate for it. Therefore, they are necessary to be used in the computation. Determination of files: DPRS sorts the set Set1 in a decreasing order based on the mean popular value. It determines the number of files that should be copied (Numi ). Then it chooses the first Numi files as cluster c’s replication candidates in Set1, where

Numi = T F n c × (1 − x )

(3)

Where 0 < x < 1. DPRS copy the top 20% of frequently accessed files based on the 80/20 rule. Thus, we set x to 0.8. The value of x can be changed based on different rules or assumptions. Replica placement: For each candidate file fi for replication the site merit (Mn i ) for each site n is determined by

Min

NRni F Sn = W1 × + W2 × + W3 × HN Ri TS



Disn 1− HDis



(4)

Where the merit value of site n for file fi is indicated by Mn i . The number of requests for file fi from site n is shown by NRn i . HNRi is the site that has the highest number of requests for file fi . FSn shows the free storage space of site n. The total storage space is indicated by TS. The summation of distances between cluster site n and the other sites in the cluster is shown by Disn . The site that has the highest Dis is presented by HDis. We assume that W1 + W2 + W3 = 1, where W1 , W2 , and W3 are applied to set weights to the above three main parameters. If the three parameters have the same weight, then W1 = W2 = W3 = 1/3. The value of W1 , W2 , and W3 can be set different according to the user preference. Whenever the site with the great number of requests is more important in choosing for user, then the user should be set a higher weight to the

NRni HN Ri

parameter by increasing the value of W1 and decreasing

the values of W2 and W3 in return. If the load balancing on various sites is more interested, then the user should be set a F Sn higher weight to the T S parameter by increasing the value of W2 . If choosing the site that is more centered between the n

Dis other sites in the cluster is more important for user, then the user should be set a higher weight to the (1 − HDis ) parameter by increasing the value of W3 . We aggregate the distances of site n to the other sites in the same cluster to determine Disn . Among the values of Diss of the cluster sites within the same cluster, the highest value of Dis is set to be the HDis. The number of requests is a common metric to show how much is the cluster site favorite. Free storage space is considered to load balancing of sites in the cluster. The centrality of the site in its cluster is determined by the sum of distances. The centrality factor is considered in order to decrease the overall execution time and bandwidth to get the non-existing files. Then, it sorts all sites according to merit value (Mn i ) in a descending order. DPRS stores the sorting result into Set3. Replication phase: In this section, we discuss about the advantage of parallel downloading and then explain the replica placement in the system. Parallel download is an interesting idea to enhance the performance of the data transfer. The most considerable difference between the parallel download and the single download method is that the parallel download generates several connections from the client to the servers and transfers the data simultaneously. By means of the data moving from more than one site, the parallel download will enhance the download speed. In Cloud system, obtaining largescale datasets from remote sites as soon as possible by using parallel download methods is a must. To the best of our knowledge, few works have been attempted to propose replication strategies for Cloud systems; instead most of the work exists for grid systems. In order to access huge dataset efficiently in Cloud system, we present a parallel download scheme. Due to the limitation of storage capacity, it is not practical for all data files to have different complete replicas across the sites. Thus in our proposed parallel download method, the data file is break into N (N ∈ {2, 3, 4 . . .}) parts and placed on N different storages. These replica parts are transferred from various sites concurrently to the requested site and reassembled there. The data access time is significantly reduced by parallel downloading scheme and the completeness is preserved at the same time. Three primary benefits over traditional methods: optimizing storage usage, by generating fragments; increasing data access performance, by considering parallel I/O techniques; and reducing unnecessary replication, by storing them among sites in an efficient way. DPRS determines whether candidate file is placed in cluster c or not. If yes, DPRS does nothing. Otherwise, a data file is split into different parts and placed on various sites. Traditional parallel download approaches divide the file into x equal size, where the number of servers is x. Each server will be assigned (size of file/x) bytes to transfer to the requester. When all servers finished their allocated responsible, the transfer process is completed. The traditional strategies have a good result only when the condition of the network, servers, and user are stable. Therefore, it is necessary to design a parallel download strategy that can consider the fluctuation of the environment. The unequal approach divides the file into unequal parts and improves the static equal strategy. The main

186

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

Fig. 2. An example of the replica placement.

parameter in unequal approach is the size of each fragment that is based on the performance or the throughput of each server. In this work, we introduce an efficient parallel download scheme in Cloud that takes into account the site suitability in fragment size determination. It places the portion of the replica on each server based on several important parameters such as number of requests, free storage space, and distances. In other word, the larger portion of the file is replicated on the best site. Therefore, it reduces the data transfer. DPRS selects N sites from list Set4. It divides replica into N segments and replicates the N segments across the N nodes. Pn indicates the portion of the replica to be replicated on node n.

Mi n pn =  j M i

(5)

j∈C

n

where Mi denotes the merit value of node n for file fi , and C denotes the set of N nodes (including node n). We explain an example of algorithm operation with N = 2. After running DPRS in step 1, the set of files should be replicated is Set2 = {file1, file3, file9}, as shown in Fig. 2. In this example, there are 5 sites in the particular cluster. For file1, it selects 2 sites (site A and C) from set Set3 which is created by (4). Using (5), the percentage to be replicated on site A is 60%. Therefore, we replicate the first 60% of file1 on site A and place the remaining 40% of file1 on site C. If file1 is requested, it downloads the first 60% of the file1 from site A and downloads the remaining 40% from site C. After both fragments are downloaded, the two fragments are reassembled as file1. File3 and file9 are placed in the same way. Therefore, we can parallel access to copies and consume less space per site to keep the replicas. In whole data replication techniques, when a replica of a K GB file is created, we require K GB per target site. In DPRS we need more target sites to create n replicas based on, which should not be a problem in large Cloud environment. However, each target site requires less space to store its fragments. To achieve this capability, we need extra replica information on Replica Manager of the architecture. Replica Information table saves the other file characteristics such as file name, number of segments, fragmented replica size, file popularity, access count and etc. LRM and GRM know the exact physical location of sites that has the requested replica segments. An example of parallel download is presented in Fig. 3. Modeling the largest computing system, the Cloud, at least requires an efficient abstraction of Cloud components such as Data Center, Host, Broker, and Virtual Machine. Moreover, main packages of the CloudSim namely, “org.Cloudbus.clousim” and “org.Cloudbus.clousim.core” and some of its classes are overridden to provide the software appropriate to the proposed replication strategy. Specifically, various resource characteristics are designed using built-in classes such as “Datacenter”, “Host”, “Vm”, “Pe”, and “Cloudlet”. In this work, data replication strategy is simulated in CloudSim, program is extended in UserCode layer. The File makeCopy() method is implemented in File class of CloudSim to copy data file to appropriate location. ReplicaManager() method is added to manage all data manipulation on a resource. Meanwhile, Storage, NetworkTopology, FileAttribute classes are also necessary to be modified, i.e., adding various parameters of files and network. Moreover, modified CloudSim software can initiate stripped data transfer and it supports partial data file transfer between two nodes. Fig. 4 shows the transition for network communication in CloudSim [67]. The parallel download approach imposes additional overheads on managing the file to be downloaded from several locations. Therefore, without careful control method, parallel download efficiency will not be reasonable when there is multiple parallel downloads. The other advantage of the proposed method is its simple implementation. The DPRS algorithm is explained in Fig. 5.

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

187

Fig. 3. An example of parallel downloading.

Fig. 4. Network communication in CloudSim.

In summary, we answer three questions that are discussed below. 1. How popularity metric can be considered in replication management? • We calculate the popularity based on the mean of the requests throughout time. 2. How many file should be replicated? • We use 80/20 fact. Since most of the file requests can be provided with a small part of files, replicating the hotspot files should preserve the system performance at a reasonable level, while saving resources. 3. Where replica should be placed? • Due to the finite storage capacity of each site and improving transfer time, we place the portion of the replica on each server based on several important parameters such as number of requests, free storage space, and distances. By means of these replicas, parallel download schema improves the performance of the data transfer. 6. Simulation and performance comparison In this section, CloudSim structure, impact of popularity in replication performance, simulation setup and discussion of the simulation results are described. 6.1. CloudSim architecture Schematic of multi-layered structure of CloudSim architecture is shown in Fig. 1. The first version of CloudSim employed SimJava [68] that provides various core functionalities, e.g., queuing and processing of events, generation of Cloud system items (services, host, data center, broker, virtual machines), relationship between elements, and administration of the modeling clock. However, in the present version, the SimJava layer has been omitted to enhance the simulation performance. The CloudSim simulation is prepared as substrate for modeling of virtualized Cloud-based data center environments contain

188

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

Fig. 5. The DPRS algorithm.

of dedicated administration interfaces for virtual machines, memory, storage, and bandwidth. Also, this layer administrates some of the most common issues, e.g., provisioning of hosts to virtual machines, managing application execution and controlling the dynamic system. A Cloud provider is responsible for checking the performance of various algorithms in its host to virtual machine. Then, it implements the appropriate policy at this layer. An obvious distinction of this layer is related to the provisioning of hosts to virtual machines. A Cloud host can be simultaneously shared to a set of virtual machines that run applications according to SaaS provider’s determined QoS steps. The main part of top-most layer in CloudSim is the user code that provides basic components for host (number of tasks and their requirements), virtual machines, number of users and their application types, and broker scheduling strategies. By developing the primary elements specified at this layer, a Cloud application developer are able to: (i) create various workload request distributions, application configurations; (ii) simulate Cloud availability strategy and do powerful tests according to the custom setting; and (iii) implement custom application provisioning approaches for Clouds and their federation. 6.2. Performance dependency of data replication strategies to popularity parameter We discuss the experiment results after carrying out some simulations using the Java based CloudSim simulator [69]. It models and simulates the data center of Cloud environment. We can create users and resources by rewriting the corresponding codes. We generate 64 data centers based on topology shown in Fig. 6 [70]. 10 0 0 virtual machines are considered for service providers. Each virtual machine has the processing elements within the range of two to four. A hundred various files are stored in the storage, with each size in the range of [0.1, 10] GB. Each file is placed in fixed size 2GB. All jobs are sent to the service providers based on the Poisson distribution. Each job needs 1 or 2 data files randomly. In start of simulation, there is one replica for each file. We study Adaptive Replication strategy (ARS) [54], Adaptive Replication scheme for Edge Networks (AREN) [70], Dynamic Data Replication Strategy (D2RS) [71] and Pre-adjust Algorithm (PA) [56]. Specific parameters of each algorithm have been defined based on the values of their works. In this respect, we evaluated each of these algorithms in two points of view: 1) Original version which contains the popularity factor; 2) Modified version which removes the popularity factor. In other word, all the files have the identical popularity.

Loss V alue =

Modified Version Value − Original Version Value Modified Version Value

(6)

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

189

Fig. 6. The Cloud data server architecture.

Table 6 Performance results for DPRS algorithm. Response time Number of jobs

DPRS with popularity

DPRS without popularity

Loss (%)

100 200 300 400 500

1500 1794 3027 4221 5891

2042 2483 4298 6320 8946

26 28 31 33 35

Number of jobs

DPRS with popularity

DPRS without popularity

Loss (%)

100 200 300 400 500

0.20 0.24 0.29 0.32 0.36

0.25 0.32 0.45 0.53 0.78

19 25 34 39 53

Effective network usage

The results obviously show that response time of all algorithms increases by deleting the popularity factor. In term of the ENU, the loss is equal to 51%, 47%, 41% and 40% for respectively the ARS, AREN, D2RS and PA algorithms. The loss values of these algorithms in terms of response time reach to 36%, 39%, 49% and 35%. Deletion of the popularity factor causes to store less necessary copies being locally stored and greater cost being generated in remotely accessing files. Consequently, the importance of the data popularity factor in replication decision is proved. Also, the performance result for DPRS is presented in Table 6. Obviously, the popularity parameter has a significant effect on ENU and response time.

6.3. Simulation setup In Fig. 1, the proposed environment consists of a GRM and ten clusters. Table 7 abbreviated the resource and job parameters. The experimental environment employed to analysis the parameters p and q in Eq. (1). Fig. 7 shows experimental results for response time of DPRS when p < q (i.e., p = 0.1, q = 0.15), p = q (i.e., p = 0.1, q = 0.1), and p > q (i.e., p = 0.15, q = 0.1) on Zipf distribution. Typically, the case of (p = 0.1, q = 0.15) has the best response time and is able to determine the most popular files. Therefore, in this study we set (p = 0.1, q = 0.15) to do the simulations.

190

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196 Table 7 Simulation parameters. Parameters

Value

Total number of clusters Number of nodes Number of nodes within the same clusters Number of different files Size of each file Storage space for every cluster node Number of files accessed by a job Round length Number of intermediate nodes between two nodes that are in the same cluster Number of intermediate nodes between two successive clusters Inter-router bandwidth Router-to-site bandwidth User-to-router bandwidth GRM-to-router bandwidth LRM-to-router bandwidth The duration of a round (Td ) W1 ,W2 , W3

10 100 10 200 Between 10 0 0 and 20,0 0 0 (Mb) 60,0 0 0 (Mb) 3–10 100 1 3 10 Gb/s 2.5 Gb/s 100 Gb/s 2.5 Gb/s 1 Gb/s 10 0 0(s) 1/3

Fig. 7. The response time for different values of p and q.

6.4. Simulation results and analysis 6.4.1. Average response time If the response time is defined as the time duration among the sending of job and receiving of the answer, then the average response time is calculated from Eq. (7): mj m  

Average Response T ime =

j=1 k=1





t s jk (rt ) − t s jk (st ) m 

(7) mj

j=1

Where tsjk (st) and tsjk (rt) shows the sending and receiving times of job k in user j, respectively. mj is the number of the jobs for user j. Fig. 8 shows the average response time of the six dynamic replication algorithms for the uniform and Zipf distributions. Accordingly, the average response time of MORM method is better than of DRSP to be about 32% for 10 0 0 tasks. This is because, MORM used various criteria, i.e., the average file unavailability, average service time, load variance, energy consumption and mean access latency to find the relationship among replica number, replica layout and their performances. By increasing of the replicas, ADRM strategy has more chance to find the suitable replica and consequently each replica’s access load is further minimized as well as becomes more balanced. Thus, the average response time of ADRM strategy is further decreased. The value of SWORD strategy is inflexible in reasonable apportion of replica access load and decreases the dynamic adjustment ability, leading to a higher average response time. DPRS has the lowest value of response time in comparison with SWORD, D2RS, DRSP, ADRM, MORM and ADRS strategies. This can be related to its ability to provide an intelligent data placement for balancing of the system load and optimizing the job response time.

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

191

Fig. 8. Average response time for different replication algorithms.

Fig. 9. Effective network usage for different replication algorithms.

6.4.2. Effective network usage Effective network usage showed the efficiency of the network resource usage and calculated from Eq. (8) [72]:

Eenu =

Nr f a + N f a Nl f a

(8)

where Nrfa indicates the number of access times that site reads a file from a remote site, Nfa shows the total number of file replication operation, and Nlfa indicates the number of times that site reads a file locally. The ENU changed between 0 and 1. Clearly, by decreasing of the ENU, the usage efficiency of bandwidth was increased. Despite of time and bandwidth consumption in data replication techniques, application of this technique even in the simplest approach would be increased the network performance. Therefore, selection of efficient replication strategy can be decreased the future traffic. A lower ENU indicating the better performance in sorting data files at the proper locations. The effective network usage for different replication algorithms is shown in Fig. 9. Accordingly, the MORM and ADRS algorithms improved the ENU to be about 40– 50%. The MORM strategy enhances the number of local access as well as reduces the number of replicas and consequently decreases the amount of bandwidth usage and the ratio of ENU, simultaneously. Moreover, the ENU of ADRS is lower (about 10%) than the MORM algorithm. ADRS algorithm is able to pre-replicates the necessity files of the next job to the suitable site storage, dynamically. The lower ENU was belongs to the DPRS strategy, due to its ability to use from available replicas in all sites without consumption of network bandwidth.

192

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

Fig. 10. Replication frequency for different replication algorithms.

6.4.3. Replication frequency The replication frequency is determined as the ratio of the frequency of replication to the frequency of data access, namely it is the value of how many replications take place per data access. The lower value proved that methods are better at locating data in the best sites. Replication consumed the network bandwidth resource and due to the disk I/O and utilization increased the replica server load. To avoid heavy network and server load, the frequency of replication must be lower as much as possible. As shown, the lowest frequency is related to DPRS strategy (Fig. 10) due to its ability to store replicas based on the suitability of site. The replication frequency of MORM strategy is less than 0.3, i.e., at least 30 replicas are generated per 100 data accesses. The replication frequencies are higher than 1 for D2RS strategy, meaning that the necessity to create at least 1 replicas for every data access. The relatively too high replication frequency of this strategy is the most drawback of its performance within the real world. The replication frequency of SWROD strategy is similar to the D2RS strategy. This is related to do replication and predict future requests at each file request. In summary, by increasing of replications, the file transmissions were increased and consequently the consumption of network bandwidth considerably increased. DPRS policy is able to maintain rational replicas during the change environment sessions. This characteristic improved the availability and decreased unnecessary replication. 6.4.4. Storage usage Next evaluation criterion is the occupation of the storage element. If the occupancy rate is less storage capacity, algorithm is better, and system fewer faces a shortage of storage space.

Storage Usage =

F il l ed_Space_Avail abl e Space

(9)

Storage usage is very informative and can be depicted the percentage of storage used during the simulation. This can help to judge analyzing a strategy from two opposite points of view: on the one hand, the objective could be the minimization of storage usage, perhaps because the resource cost is proportional to the amount being used; on the other hand, its cost might be fixed and one would then goal at maximizing the use of storage space. Fig. 11 compares the storage usage of typically algorithms. In ADRS strategy, the needed file stored in the appropriate site rather than many sites and decreased the storage usage. The storage usage of the MORM is better than SWROD strategy to be about 24%. It can be observed that the minimum storage usage is related to the DPRS strategy, due to its ability to replicate only the frequently accessed data that are a small portion of the overall data set. Due to the limitation of storage capacity, it is not practical for all data files to have different complete replicas across the sites. DPRS breaks the data file into different parts and placed on best storages. This technique reduced the resource consumption of data replication. 6.4.5. Hit ratio Hit ratio parameter is defined by

Hit ratio =

Number o f local f ile access Number o f local f ile access+Number o f replications+Number o f remote f ile access

(10)

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

193

Fig. 11. Storage resources usage for different replication algorithms.

Fig. 12. Hit ratio for different replication algorithms.

This is another metric to compare the highest value of hit ratio is related to the replica in appropriate site and avoiding remote accesses as well as job execution

performance of typically reported and proposed strategies as shown in Fig. 12. The DPRS algorithm. Because it enhances the total number of local access using sorting unnecessary replication. This characteristic of proposed strategy can be reduced time.

7. Tradeoff between the calculation cost and the accuracy level Different replication methods put weights on the request in different ways. Therefore, they have different level for the cost of calculation and accuracy of results. The time interval of requests number is considered has a key role in weighting process. Short interval causes many weights considered in the computation of file popularity. Therefore, the precision of the results and the calculation cost are increased. Long interval causes fewer weights to be considered to the requests. Thus the accuracy of results is decreased.

194

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

Fig. 13. The number of intervals based on the granularity level.

As a consequent, we have tradeoff: reducing the calculation cost or enhancing the result precision. Sometimes, users prefer high precision while accepting heavy calculation cost. But some other users want to have low calculation cost and accept less precision. Most of the strategies included in this work have ignored this tradeoff when considering the popularity. They assumed the fixed interval and determined the weight of each file based on only to the interval that it belongs to. This assumption is not suitable when interval is relatively long. Since, they set identical weights to the file in the beginning of the interval and the file in the end of the same interval despite the high period of time between them. The temporal locality is better estimated based on a fine-grained interval when setting the weights. We offer a changeable period of requests so that it can be decreased if the users or systems prefer good precision to enhance accuracy, or increased if the users want low cost of calculation. The granularity level defined as the factor that manages the split of the time portion into intervals. When time will be divided into few intervals, granularity level value is low. Therefore, the calculation cost will not be high. While time will be divided into a large number of intervals, granularity level value is high. Thus the calculation cost and accuracy will be high. The number of intervals based on granularity level is determined by:

Num_Intervals =

Li f e T ime Granularity Level

(11)

Fig. 13 shows the impact of granularity level on the number of intervals. The value of granularity level is depended on tradeoff between accuracy and cost factor. 8. Conclusion The large scale of data in cluster, grid or Cloud environments increase the necessity of data management, significantly. Data replication is an effective approach to decrease user-waiting time and improve data access by creating several replicas of the same service. In this study, a Dynamic Popularity aware Replication Strategy (DPRS) is proposed based on the data access popularity and parallel download as criteria. DPRS determines the number of replicas as well as the appropriate sites for placement based on the number of requests, free storage space, and site centrality. Moreover, for data intensive jobs in Cloud environment, simple proposed parallel downloading algorithm can enhance the overall performance with the ability of transferring data fragments. CloudSim simulator performs the performance evaluation of DPRS and other replication methods. DPRS is compared with five existing algorithms, i.e., SWORD, D2RS, DRSP, MORM, and ADRS strategies. Accordingly, DPRS improve the mean response time, effective network usage, replication frequency, storage usage, and hit ratio with respect to the others. The average response time of DPRS is lower by 36% compared to MORM strategy for 10 0 0 tasks. DPRS can save storage space for distributed system while the available and performance QoS requirements are ensured. In this paper, there are still some issues yet to be performed, such as 1. It has been found that there exists no particular structure for Cloud environment. Most of the works used a flat structure but actually a general graph is a more realistic structure. 2. Security problems have not been discussed in this work. The default value may adapt according to the reliability of the Cloud environment. 3. Another open research issue is data consistency that is generally ignored by the most of papers. The files of Cloud environments can be modified over time and cause challenge of maintaining data consistency among the several copies in distributed locations. Therefore, adaptive replica consistency is critical. Also New popularity metrics for Read/Write systems have also to be introduced.

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

195

4. An empirical investigation of server assignment scenario based on the actual downloading process is necessary. Since some new phenomena from them can be discovered. Then an efficient server assignment strategy for parallel download is found. 5. In this work, energy-efficient virtual machine placement problem is not discussed. We assumed fixed memory BW requirement and considered restriction on the number of virtual machine copies. It is better to enhance the total processing requirement by a particular parameter, if a virtual machine is copied. Also, communication network I/O and secondary storage should be used in this decision-making.

References [1] The information of Moore’s Law on Intel’s page: http://www.intel.com/technology/mooreslaw/index.htm. [2] Chip Walter, Kryder’s Law, Scientific American, August 2005. [3] S. Fu, L. He, X. Liao, C. Huang, Developing the Cloud-integrated data replication framework in decentralized online social networks, J. Comput. Syst. Sci. 82 (2016) 113–129. [4] C. Karakoyunlu, J.A. Chandy, Exploiting user metadata for energy-aware node allocation in a Cloud storage system, J. Comput. Syst. Sci. 82 (2016) 282–309. [5] K. Li, Power and performance management for parallel computations in Clouds and data centers, J. Comput. Syst. Sci. 82 (2016) 174–190. [6] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, A view of Cloud computing, Commun. ACM 53 (2010) 50–58. [7] P. Mell, T. Grance, The NIST definition of Cloud computing, Computer Security Division, Information Technology Laboratory, 9, National Institute of Standards and Technology, 2011. [8] Q. Zhang, L. Cheng, R. Boutaba, Cloud computing: state-of-the-art and research challenges, J. Internet Serv. Appl. 1 (2010) 7–18. [9] J.O. Gutierrez-Garcia, K.M. Sim, A family of heuristics for agent-based Cloud bag-of-tasks scheduling, in: 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 2011, pp. 416–423. [10] J. Taheri, A.Y. Zomaya, S.U. Khan, Genetic algorithm in finding Pareto frontier of optimizing data transfer versus job execution in grids, Concurrency Comput. (2012). [11] A. Turk, K.Y. Oktay, C. Aykanat, Query-log aware replicated declustering, IEEE Trans. Parallel Distrib. Syst. 24 (5) (2013) 987–995. [12] S. Zaman, D. Grosu, A distributed algorithm for the replica placement problem, IEEE Trans. Parallel Distrib. Syst. 22 (9) (2011) 1455–1468. [13] H. Shen, An efficient and adaptive decentralized file replication algorithm in P2P file sharing systems, IEEE Trans. Parallel Distrib. Syst. 21 (6) (2010) 827–840. [14] N. Mansouri, QDR: a QoS-aware data replication algorithm for Data Grids considering security factors, Cluster Comput. 19 (2016) 1071–1087. [15] A.H. Guroob, D.H. Manjaiah, Efficient replica consistency model (ERCM) for update propagation in data grid environment, International Conference on Information Communication and Embedded System, 2016. [16] N. Mansouri, G.H. Dastghaibyfard, E. Mansouri, Combination of data replication and scheduling algorithm for improving data availability in Data Grids, J. Netw. Comput. Appl. 36 (2013) 711–722. [17] L. Breslau, P. Cao, L. Fan, G. Phillips, S. Shenker, Web caching and Zipf-like distributions: evidence and implications, in: Proceedings of IEEE INFOCOM’99, no.1, New York, USA, 1999, pp. 126–134. [18] D.G. Cameron, R. Carvajal-Schiaffino, A. Paul Millar, C. Nicholson, K. Stockinger, F. Zini, Evaluating scheduling and replica optimisation strategies in OptorSim, The International Workshop on Grid Computing, Phoenix, Arizona, November 17, IEEE Computer Society Press, 2003. [19] K. Ranganathan, I. Foster, Decoupling computation and data scheduling in distributed data intensive applications, International Symposium for High Performance Distributed Computing, HPDC-11, 2002. [20] K. Ranganathan, I. Foster, Simulation studies of computation and data scheduling algorithms for data grids, J. Grid Comput. 1 (2003) 53–62. [21] M. Tang, B.S. Lee, C.K. Yeo, X. Tang, Dynamic replication algorithms for the multi-tier data grid, Future Gener. Comput. Syst. 21 (2005) 775–790. [22] R.S. Chang, J.S. Chang, S.Y. Lin, Job scheduling and data replication on data grids, Future Gener. Comput. Syst. 23 (7) (2007) 846–860. [23] R.S. Chang, H.P. Chang, A dynamic data replication strategy using access weights in data grids, J. Supercomput. 45 (3) (2008) 277–295. [24] L. Breslau, P. Cao, L. Fan, G. Phillips, S. Shenker, Web caching and Zipf-like distributions: evidence and implications, in: Proceedings of the 18th conference on computer communications, 1999, pp. 126–134. [25] C. Staelin, H. Garcia-Molina, Clustering Active Disk Data to Improve Disk Performance, 1990 Tech. Rep. CSTR-283-90, Department of Computer Science, Princeton University. [26] M.E. Gomez, V. Santonja, Characterizing temporal locality in I/O workload, in: Proceedings of the international symposium on performance evaluation of computer and telecommunication systems, 2002. [27] L. Cherkasova, G. Ciardo, Characterizing Temporal Locality and its Impact on Web Server Performance, 20 0 0 Technical Report HPL-20 0 0-82, Hewlett Packard Laboratories. [28] T. Xie, Y.A. Sun, file assignment strategy independent of workload characteristic assumptions, ACM Trans. Storage 5 (3) (2009). [29] D. Zissis, D. Lekkas, Addressing Cloud computing security issues, Future Gener. Comput. Syst. 28 (3) (2012) 583–592. [30] S. Subashini, V. Kavitha, A survey on security issues in service delivery models of Cloud computing, J. Netw. Comput. Appl. 34 (1) (2011) 1–11. [31] N. Fotiou, A. Machas, G.C. Polyzos, G. Xylomenos, Access control as a service for the Cloud, J. Internet Serv. Appl. 6 (11) (2015). [32] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, S. Tuecke, The data grid: towards an architecture for the distributed management and analysis of large scientific datasets, J. Netw. Comput. Appl. 23 (20 0 0) 187–20 0. [33] T. Hamrouni, S. Slimani, F. Ben Charrada, A survey of dynamic replication and replica selection strategies based on data mining techniques in data grids, Eng. Appl. Artif. Intell. 48 (2016) 140–158. [34] J. Ma, W. Liu, T. Glatard, A classification of file placement and replication methods on grids, Future Gener. Comput. Syst. 29 (6) (2013) 1395–1406. [35] S.R. Malik, S.U. Khan, S.J. Ewen, N. Tziritas, J. Kolodziej, A.Y. Zomaya, S.A. Madani, N. Min-Allah, L. Wang, C. Xu, Q.M. Malluhi, J.E. Pecero, P. Balaji, A. Vishnu, R. Ranjan, S. Zeadally, H. Li, Performance analysis of data intensive Cloud systems based on data management and replication: a survey, Distrib. Parallel Databases (2015) 1–37. [36] B.A. Milani, N.J. Navimipour, A comprehensive review of the data replication techniques in the Cloud environments: major trends and future directions, J. Netw. Comput. Appl. 64 (2016) 229–238. [37] M. Knoll, H. Abbadi, T. Weis, Replication in peer-to-peer systems, in: Self- Organizing Systems, 5343, 2008, pp. 35–46. [38] M. Rahmani, M. Benchaiba, A comparative study of replication schemes for structured P2P networks, in: Proceedings of the 9th International Conference on Internet and Web Applications and Services, 2014, pp. 147–158. [39] E. Spaho, L. Barolli, F. Xhafa, Data replication strategies in P2P systems: a survey, in: Proceedings of the17th International Conference on Network-Based Information Systems, 2014, pp. 302–309. [40] J. Kangasharju, J. Roberts, K.W. Ross, Object replication strategies in content distribution networks, Comput. Commun. 25 (4) (2002) 376–383. [41] A. Passarella, A survey on content-centric technologies for the current internet: CDN and P2P solutions, Comput. Commun. 35 (1) (2012) 1–32. [42] D. Boru, D. Kliazovich, F. Granelli, P. Bouvry, A.Y. Zomaya, Energy-efficient data replication in Cloud computing datacenters, Cluster Comput. 18 (2015) 385–402.

196

N. Mansouri et al. / Simulation Modelling Practice and Theory 77 (2017) 177–196

[43] F. Bellosa, The benefits of event driven energy accounting in power-sensitive systems, in: ACM SIGOPS European Workshop: beyond the PC: new challenges for the operating system, 20 0 0, pp. 37–42. [44] S. Pelley, D. Meisner, T.F. Wenisch, J.W. VanGilder, Understanding and abstracting total data center power, Workshop on Energy Efficient Design (WEED), 2009. [45] X. Fan, W.D. Weber, L.A. Barroso, Power provisioning for a warehouse-sized computer, in: ACM International Symposium on Computer Architecture, San Diego, 2007, pp. 13–23. [46] A. Ganesh, R.H. Katz, Greening the switch, Conference on Power aware computing and systems, 7, 2008. [47] D. Kliazovich, P. Bouvry, S.U. Khan, GreenCloud: a packet-level simulator of energy-aware Cloud computing data centers, J. Supercomput. 62 (3) (2012) 1263–1283. [48] N.K. Gill, S. Singh, A dynamic, cost-aware, optimized data replication strategy for heterogeneous Cloud data centers, Future Gener. Comput. Syst. 65 (2016) 10–32. [49] N. Mansouri, Adaptive data replication strategy in Cloud computing for performance improvement, Front. Comput. Sci. 10 (5) (2016) 925–935. [50] S.Q. Long, Y.L. Zhao, W. Chen, MORM: a multi-objective optimized replication management strategy for Cloud storage cluster, J. Syst. Archit. 60 (2014) 234–244. [51] L.W. Lee, P. Scheuermann, R. Vingralek, File assignment in parallel I/O systems with minimal variance of service time, IEEE Trans. Comput. 49 (2) (20 0 0) 127–140. [52] K. Ashwin Kumar, A. Quamar, A. Deshpande, S. Khuller, SWORD: workload-aware data placement and replica selection for Cloud data management systems, VLDB J. 23 (6) (2014) 845–870. [53] A. Rajalakshmi, D. Vijayakumar, K.G. Srinivasagan, An improved dynamic data replica selection and placement in Cloud, International Conference on Recent Trends in Information Technology (ICRTIT), 2014. [54] M.K. Hussein, M.H Mousa, A light-weight data replication for Cloud data centers environment, Int. J. Innovative Res. Comput. Commun. Eng. 2 (1) (2014) 2392–2400. [55] D.W. Sun, G.R. Chang, S. Gao, L.Z. Jin, X.W. Wang, Modeling a dynamic data replication strategy to increase system availability in Cloud computing environments, J. Comput. Sci. Technol. 27 (2) (2012) 256–272. [56] S. Wu, G. Chen, T. Gao, L. Xu, C. Song, Replica pre-adjustment strategy based on trend analysis of file popularity within Cloud environment, 12th International Conference on Computer and Information Technology, 2012. [57] Z. Ye, S. Li, J. Zhou, A two-layer geo-cloud based dynamic replica creation strategy, Appl. Math. Inf. Sci. 8 (1) (2014) 43–440. [58] Z. Wang, T. Li, N. Xiong, Y. Pan, A novel dynamic network data replication scheme based on historical access record and proactive deletion, J. Supercomput. 62 (1) (2012) 227–250. [59] J. Myint, A. Hunger, Modeling a load-adaptive data replication in cloud environments, in: Proceedings of the 3rd International Conference on Cloud Computing and Services Science, 2013, pp. 511–514. [60] M. Seddiki, M. Benchaiba, Toward a global file popularity estimation in unstructured p2p networks, in: Proceeding of the eighth International Conference on Systems and Networks Communications, 2013, pp. 77–81. [61] J. Chu, K. Labonte, B.N. Levine, Availability and popularity measurements of peer-to-peer file systems, Technical Report (2004) 4–36. [62] R. Izmailov, S. Ganguly, N. Tu, Fast parallel file replication in data grid, in: Proc. of Future of Grid Data Environments: A Global Grid Forum (GGF) Data Area Workshop, Berlin, Germany, 2004. [63] J.M. Perez, F. Garcia, J. Carretero, A. Calderon, J. Fernandez, A parallel I/O middleware to integrate heterogeneous storage resources on grids, in: Lecture Notes in Computer Science Series, 2970, 2004, pp. 124–131. [64] F. Garcia-Carballeira, J. Carretero, A. Calderon, J.D. Garcia, L.M. Sanchez, A global and parallel file systems for grids, Future Gener. Comput. Syst. 23 (1) (2007) 116–122. [65] J.M. Pérez, F. García-Carballeira, J. Carretero, A. Calderón, J. Fernández, Branch replication scheme: a new model for data replication in large scale data grids, Future Gener. Comput. Syst. 26 (2010) 12–20. [66] R.S. Changa, M.H. Guob, H.C. Lin, A multiple parallel download scheme with server throughput and client bandwidth considerations for data grids, Future Gener. Comput. Syst. 24 (2008) 798–805. [67] R.N. Calheiros, R. Ranjan, A. Beloglazov, C.A.F. De Rose, R. Buyya, CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms, Softw. Pract. Exper. 41 (2011) 23–50. [68] F. Howell, R. Mcnab, SimJava: A discrete event simulation library for java, in: Proceedings of the First International Conference on Web-based Modeling and Simulation, San Diego, U.S.A., 1998. [69] R. Buyya, R. Ranjan, R.N. Calheiros, Modeling and simulation of scalable Cloud computing environments and the Cloudsim toolkit: Challenges and opportunities, in: Int. Conf. High Perform. Comput. Simul., 2009, pp. 1–11. [70] G. Silvestre, S. Monnet, R. Krishnaswamy, P. Sens, AREN: a popularity aware replication scheme for Cloud storage, 18th International Conference on Parallel and Distributed Systems, 2013. [71] D.W Sun, G.R. Chang, S. Gao, L.Z. Jin, X.W. Wang, Modeling a dynamic data replication strategy to increase system availability in Cloud computing environments, J. Comput. Sci. Technol. 27 (2) (2012) 256–272. [72] D.G. Cameron, R. Carvajal-schiaffino, A. Paul Millar, C. Nicholson, K. Stockinger, F. Zini, UK Grid Simulation with OptorSim, UK e-Science All Hands Meeting, 2003.