A prediction-based dynamic replication strategy for data-intensive applications

A prediction-based dynamic replication strategy for data-intensive applications

ARTICLE IN PRESS JID: CAEE [m3Gsc;December 6, 2016;21:48] Computers and Electrical Engineering 0 0 0 (2016) 1–13 Contents lists available at Scien...

780KB Sizes 9 Downloads 167 Views

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;December 6, 2016;21:48]

Computers and Electrical Engineering 0 0 0 (2016) 1–13

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

A prediction-based dynamic replication strategy for data-intensive applicationsR Vijaya Nagarajan∗, Mulk Abdul Maluk Mohamed Software System Group, Department of Computer Science and Engineering, M. A. M. College of Engineering and Technology, Tiruchirappalli, Tamil Nadu, India

a r t i c l e

i n f o

Article history: Received 31 December 2015 Revised 28 November 2016 Accepted 28 November 2016 Available online xxx Keywords: Replication Intelligent Replica Manager Association rules Modified apriori algorithm Prediction

a b s t r a c t Data-intensive applications produce huge amount of data sets which need to be analyzed among geographically distributed nodes in grid computing environment. Data replication is essential in this environment to reduce the data access latency and to improve the data availability across several grid sites. In this work, an Intelligent Replica Manager (IRM) is designed and incorporated in the middleware of the grid for scheduling data-intensive applications. IRM uses a Multi-criteria based replication algorithm which considers multiple parameters like storage capacity, bandwidth and communication cost of the neighboring sites before taking decisions for the selection and placement of replica. Additionally, future needs of the grid site are predicted in advance using modified apriori algorithm, which is an association rule based mining technique. This IRM based strategy reduces the data availability time, data access time and make span. The simulation results prove that the proposed strategy outperforms the existing strategies. © 2016 Elsevier Ltd. All rights reserved.

1. Introduction With the growing technological advancements in the scientific field, the modern instruments and simulation tools used in e-science applications produce huge amount of data sets. These data sets need to be analyzed and distributed to the researchers located across diverse geographical regions. Here Grid serves as a promising infrastructure by integrating globally distributed heterogeneous resources across different administrative domains [1]. Eventually, grid can be classified into two major categories such as computational grids and data grids. The objective of computation grids is to split the computation into several parts and execute them across the different resources in the grid. The objective of the data grids is to handle huge amount of data sets and to distribute them among several grid resources. Scientific applications can be categorized as computation-intensive and data-intensive. As computation intensive applications demands more CPU usage, the dataintensive applications process data ranging from tera bytes to peta bytes. Applications such as global weather prediction, Digital sky project, Brain imaging analysis, mammographic analysis and high energy physics produce huge data sets which need to be transferred, processed and analyzed across distributed data repositories [2]. This work concentrates on scheduling of data-intensive applications in the grid environment. Each grid site consists of data hosts and computation hosts. Any escience application can be considered as set of independent jobs assigned to various grid sites and each job require large amount of data sets stored in various data hosts. In a data grid environment, scheduling data-intensive application is a R ∗

Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. M. S. Kumar. Corresponding author. E-mail addresses: [email protected] (V. Nagarajan), [email protected] (M.A. Maluk Mohamed).

http://dx.doi.org/10.1016/j.compeleceng.2016.11.036 0045-7906/© 2016 Elsevier Ltd. All rights reserved.

Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

JID: CAEE 2

ARTICLE IN PRESS

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

challenging task. Since these applications have large-scale runs and also thousands and thousands of tasks where each task in turn processes hundreds and hundreds of input files where the size of each input data sets is huge in terms of petabytes. Hence, the following issues like Heterogeneity, Granularity, Replication, Storage, Security, Fault tolerance, and Locality are to be taken care of during the planning, execution and storage phases [2–8]. Replication is one of the primary issue which have a major impact in the make span. In order to reduce the access latency, bandwidth and storage server load in the internet, most frequently accessed data sets may be replicated across different sites. Imagine datasets are stored in site 1 and these data sets are needed for the execution of a task at site 2. Now the data sets at site 1 is replicated and the replica is sent to site 2 to improve the performance of job executed at site 2. In future if the same data sets are needed for the tasks executed at site 3 which is nearer to site 2 then datasets will not be transferred from site 1(the permanent storage of data sets) but the replicas available at site 2 may be used. Based on the locality of data sets, temporal or spatial data locality, the replication strategy adopted may be changed. Data replication not only optimizes the data access cost but also the following [9,10]: ◦ Availability: When a job failure or resource failure happens at a particular site then the system can restore the replicated data from the other site. This will enhance the availability of the data. ◦ Reliability: When replica is available at all sites, then the probability of servicing the user request will be high. Hence the system is more reliable. ◦ Performance: The data access delay and make span is improved, when the replica is available nearer to the execution site. There are several challenges associated with the dynamic replication [11]. In the grid environment, replication is a serious issue because grid is dynamic in nature. Users may join and leave the virtual organization of the grid at any time. Hence the replication strategy should adapt to the changing nature of grid in order to provide better performance. The replication strategy must be designed according to the topology of the grid. The data grid may have different architectures: Multi-tier architecture, Hierarchical architecture, Graph based architectures, peer to peer architecture and hybrid architecture. Data replication strategy depends upon dynamic decision making which involve when to replicate the data, where to replicate the data and which data has to get replicated. Even if a strategy is adopted, that should ensure the benefit of the replication should always be higher than the cost of the replication. Applying optimization techniques to data replication results in faster data access, increased data availability and decreased make span. The main contributions of this paper are: (1) A unique model is proposed where Intelligent Replica Manager (IRM) is designed for scheduling data-intensive applications in grid by considering multiple parameters for replica selection and placement. (2) A novel Multi-criteria based replication algorithm is proposed and deployed in IRM. The algorithm considers multiple parameters like storage capacity, bandwidth and communication cost of the neighboring sites before taking decisions for placement of replica. (3) IRM uses association rule based mining technique that predicts the datasets needed for a site accurately by finding the frequent data sets before replication. This IRM based strategy reduces the data availability time, data access time and job execution time. This paper is structured as follows: Section 2 presents the problem formulation. Section 3 describes the related works. Section 4 proposes IRM. Section 5 describes the simulation results. Finally, In Section 6, we conclude our discussion with future plans of our work. 2. Problem formulation Even though many issues are identified as challenging in the grid scheduling process, data replication plays a prominent role as long as data-intensive applications are considered. During scheduling transferring huge data sets from one site to another site requires more network bandwidth. Also, the delay incurred during the data transfer and the data availability will result in degradation of performance and affect the make span. Optimization of make span is the ultimate aim of the grid scheduling process. In this work, when a task executing at a site requires data sets for further processing, then it places the request to the IRM. The IRM will store the request in the Knowledge Base (KB) and the replica selector present in the IRM will search the data catalogue for finding the location of the required data sets. If the data sets are available in more than one location, then the least cost table is searched to find the location with minimum cost. The Least cost table is constructed based on the bandwidth of the sites. If the bandwidth of a site is high then the cost assigned is low. So the site with the least cost is selected for forwarding the request. At the remote IRM, when a new request arrives, it is stored in the KB. The correlation analyzer will find the frequent data sets using the modified apriori algorithm. Hence, the frequently used data sets and currently requested data sets are sent to the requesting site. This pre-fetching of datasets will reduce the overhead of requesting site to place the request again. This in turn reduces the queuing delay, data access cost and data availability time. If the space is not available for storing the predicted data sets in the requesting site, then the deletor present in the IRM will delete the least frequently used replica and the storage space is allocated to the predicted data sets. Let us assume an application W consists of n number of tasks W = {t1 , t2 , t3 …… tn } and these tasks are data-intensive in nature where huge amount of data sets d1 , d2 , d3 ….. dm are to be processed. The performance of the application processing is characterized by the following parameters [12]. Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13 •

3

Make span M is the time elapsed between the submission of the first task and the time of the result after the execution of the last task in an e-science application execution

M=

n 

(Waiting time (tk ) + Execution time (tk ) )

(1)

k=1 •

Waiting time: The delay experienced by the task to collect the required data sets for execution. It consists of two components: Data transfer time and Data availability time.

WT(tk ) = DTT(tk ) + DAT(tk ) •



(2)

Data Availability Time (DAT): Time required for the task to fetch the input data sets for execution (i.e., Data staging time for computation). Data Transfer Time (DTT): Time required for transferring the data sets from one site to another site.

Eq. (1) states that waiting time and the execution time will have major impact in the make span. In order to optimize the make span any one of the parameters has to get optimized. The execution time depends on the resource performance which can be optimized by selecting the best resource during the scheduling. The waiting time depends on the data availability time and data transfer time as stated in Eq. (2). Generally in grid scheduling, we have three phases [7]: Resource discovery, where the available resources are listed; System selection, where the best resource is selected from the available resources; Job execution, where the data staging, execution and cleanup operations are carried out. This work is focused only on the third phase of the scheduling process. Here an intelligent replica manager is designed and incorporated in the middleware of the data grid. The main objective is to optimize the make span by minimizing waiting time. In order to minimize the waiting time, as given in Eq. (2) the DAT and DTT has to get reduced. 3. Related works Lin et al. [13] addressed the issue of placing replicas with locality assurance. The objective of the work is to select strategic locations for replica placement by balancing the workload among the replicas and guaranteeing the locality of the service for each data request. Here two algorithms were proposed. MinMaxLoad is to place the replicas at proper locations considering the server workload. Find R algorithm is for choosing the optimal number of replica. In this work, the prediction of future data set is not addressed. Vashisht et al. [14] proposed Efficient Dynamic Replication Algorithm for selecting the best replica. Here three main parameters like bandwidth, load gauge and computing capacity are considered for selecting the best replica. Hierarchical grid topology is considered and two fold scheduling policy is adapted in master node and head node. Saadat et al. [15] proposed PDDR – prefetching based dynamic data replication algorithm. This algorithm predicts the future needs of a grid site in advance and prefetches the files to the requester site. The prediction is based on the past file access sequences of a particular site. The algorithm works with an assumption that all members in a virtual organization have similar interests in files. Beermann et al. [16] proposed prediction-based replication strategy where prediction is based on the past data popularity. Here neural networks are used to forecast the data by considering the data accesses in the past including the information about the users, sites and the files accessed. Other metrics are not considered. Khanli et al. [17] considers the spatial locality for replication and also the future demands are predicted in advance and the data sets are pre-replicated to the requested site. This method uses priority and replication configuration change components for replication. This work is based on the assumption that users who work on the same context will request same set of files with high probability. Ranganathan et al. [18] have proposed six different replication strategies such as No Replication, Best Client, Cascading, Plain Caching, Caching plus Cascading, and Fast Spread. They evaluated these strategies based on two parameters such as response time and bandwidth. The results proved that fast spread strategy outperformed other strategies. This work provided the base for all the replication strategies available today. But the prediction of future data sets is not addressed in this work. Chang et al. [19] presents a job scheduling policy for improving the data availability in job execution. Here jobs are dispatched to the node where the data resides. Two policies were proposed to improve the data access time in the cluster grid. Park et al. [20] proposed a Bandwidth Hierarchy based Replication (BHR) to minimize the access time and maximize the network level locality. Here the grid sites are divided into several regions, where the bandwidth across the region is lower when compared with the bandwidth within the regions. If the required data is available within the region, then the data access time will be less. This strategy doesn’t consider other metrics. Tang et al. [21] proposed two algorithms for a multi-tier data grid: Simple Bottom-Up (SBU) and Aggregate Bottom-Up (ABU) for reducing the average data access time. In these algorithms, the basic idea is to replicate the files, only if the access rate is higher than a pre-defined threshold value. SBU algorithm considers the access history for an individual site and ABU algorithm considers the file access history for a system. Here the access latency is reduced but the wastage of storage is unavoidable. Chang et al. [22] proposed a dynamic replication strategy called Latest Access Largest Weight (LALW). This LALW algorithm collects the file access details at constant time intervals and weights are assigned based on their ages. Based on the weight and number of accesses the popularity of the file is measured. The popular file is replicated to selected sites to balance the load of the system. Bsoul Mohammad et al. [23] proposed a Round-based Data Replication Strategy, in which the time is divided into rounds and at each round the file to be replicated is selected based on the popularity of each file. The algorithm of this approach consists of 4 phases: file access aggregation, file popularity calculation, file selection and file replication. Bsoul et al. Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

ARTICLE IN PRESS

JID: CAEE 4

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

Fig. 1. Architecture of a data grid (CE – Computational Element, SE – Storage Element, IRM – Intelligent Replica Manager).

[24] proposed a category-based dynamic replication strategy for data grids. Every node in data grids have files that belong to various categories. Each category is assigned a value based on the number of file accesses. When the storage capacity of the node is full, the node starts to store files that belong to the category with the highest value. Chettaoui et al. [25] proposed a Decentralized Periodic Replication Strategy based on Knapsack Problem. In this approach, two polynomial-time complexity algorithms are proposed. The first algorithm selects the best candidate files for replication based on the file popularity and the second places the file in the best location which has high bandwidth. In addition, a deletion algorithm chooses the files for deletion when the storage space is not enough to accommodate new files. File prediction is not addressed in this paper. Even though few related works focused on prediction-based replication, the prediction is not accurate at many instances. The predicted data sets occupy the storage space without any future use. But in this proposed approach, we have used association rule mining and modified apriori algorithm for prediction. Hence, the prediction is more accurate when compared to the existing strategies. In addition, data availability and data access cost is minimized by considering multiple criteria like storage capacity, bandwidth and communication cost between the grid sites before taking replication decision. 4. Intelligent replica manager Fig. 1 shows the architecture of the data grid. Users submit Jobs to the grid portal and the global scheduler schedules the jobs across the local scheduler in each grid site. The local scheduler schedules the job to the computational elements present at the site. The computational elements use storage elements for data retrieval and access. In this work, each site in the data grid will have an IRM (Intelligent Replica Manager). The IRM present at one site can communicate with the IRM at other site. When a task is assigned to the computational element and if the task requires data sets for processing and if the data sets are available locally, then the request is processed successfully without any delay. If the data sets are not available locally, then the IRM will forward the request to the remote IRM. The architecture of the IRM is given in Fig. 2. It consists of the following components: 1. Request handler – It handles the incoming request for data sets. The request is preprocessed with the help of correlation analyzer. 2. Correlation analyzer – It uses the knowledge base to predict the future needs of a grid site based on the association rule mining with apriori algorithm. Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

5

Fig. 2. Intelligent Replica Manager.

3. Knowledge base – It stores the incoming request for the data sets from the grid sites. The request will contain the request id and data sets id. 4. Replica optimizer – It consist the following components:  Replica selector – Selects the best replica from the neighboring site using the least cost table.  Data catalogue – Consist the details of data sets and site id.  Deletor – If the storage space is not available in the current site for holding the replica then this deletor will delete the least frequently used data sets in the current site.  Intelligent monitor – This will take dynamic decisions on what to replicate, when to replicate and where to replicate based on storage, distance of data sets and bandwidth. 4.1. Correlation analyzer Discovering the correlation relationships among the huge amount of data sets can help in making dynamic decisions for prediction. It also helps in gaining insight into which data sets are frequently accessed together by the jobs and also how likely are they also accessed on the same instance. For finding the correlation relationships our approach uses Association rules [26] and Apriori algorithm. Association rule is an implication of the form: x ⇒ y. Let I be the set of data sets. A transaction T is said to contain x if and only if x ⊆ I. In this approach, the query generated by the grid site contains the site id and data sets. Query is referred as transaction made in the knowledge base (KB). In Eq. (3), x ⊆ I; y ⊆ I where I = {I1, I2, I3……. In} and x ∩ y = ϕ . i. Rule x⇒y holds in the transaction set KB with support S where S is the percentage of transactions in KB that contain x ∪ y which is considered as the probability of P(x ∪ y) given in Eq. (3). ii. Rule x⇒y holds in the transaction set KB with confidence C where C is the percentage of transactions in KB containing x that also contain y. This is taken to be the condition probability P ( yx ) as given in Eq. (4).

Support (x ⇒ y ) = P (x U y ) Con f idence (x ⇒ y ) = P

y x

(3) (4)

4.2. Association rule mining It involves the following steps: i. Finding the frequent datasets in the knowledge base using the minimum support count. ii. Finding strong association rules from the frequent datasets using the minimum confidence constraint. In this work, we have modified the Apriori algorithm [27] for finding the frequent data sets requested by the site. 4.3. Modified apriori algorithm The process of modified apriori algorithm is given in Fig. 3. Let us demonstrate the process of finding the predicted data sets using the sample knowledge base of the IRM given in the left top corner of Fig. 3. Initially find the value of n i.e., the number of data sets requested in the query. Apply the following steps to find the associated data sets. Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

JID: CAEE 6

ARTICLE IN PRESS

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

Fig. 3. Finding Associated Data sets.

Step 1: Scan the Knowledge Base for the support count of each data sets. Generate Frequent- 1data sets i.e., Level 1 Step 2: Let i = 1 Step 3: Repeat until level(i) < level(n + 1) 3.1 3.2 3.3 3.4 3.5

Set the minimum support count value and generate the candidate C(i) datasets. Using join operation LixLi generate Frequent- data sets i.e., Level (i + 1). Apply pruning and eliminate data sets that are infrequent. Generate the candidate C(i + 1) data sets. Increment the value of i.

This algorithm is different from apriori algorithm in the following aspects: (1) in this modified apriori algorithm, the levels are generated based on the number of data sets requested in the user’s query; (2) this algorithm does not check the strength of association rules. From the example, given in Fig. 3 it is evident that when a data set request for d1 arrives, then d2 is also sent along with d1 to the requested site. Similarly if a request for d2 arrives then d1will also be sent along with d2. Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

7

Fig. 4. Graphical representation of sites linked by network links.

Table 1 Data catalogue. Siteid (node id)

Data sets

S1 S2 S3 S4 S5 S6 S7 S8

d1,d2,d5 d2,d3 d2,d4 d1,d2,d4 d1,d6 d8 d2 d6

4.4. Construction of least cost matrix Fig. 4 shows the graphical representation of different data sites in a grid. Let G = {V, E} be a weighted graph with a weight function W⇒F where F is a function to calculate the weight based on the bandwidth capacity of the site. Let the set of vertices V = {v1 , v2 , v3 …vn } and set of edges E = {e1 , e2 , e3 …en }.

W (e1 , e2 ) = cost (S1 , S2 )

(5)

In Eq. (5), cost(S1, S2) is the bandwidth assigned between two sites S1 and S2 as given in the least cost table (Table 2).The least cost table is constructed based on the bandwidth between two sites. In the Least cost table, low cost is assigned to high bandwidth link and high cost is assigned to low bandwidth link. A shortest path is a sequence of vertices v1 , v2 , v3 …vn where each pair (vi , vi + 1 ) is an edge with minimum cost. Very often we have to find the shortest path from a source vertex s to a target vertex t.

1. Scan all of the transactions in KB to count the number of occurrences of each data set. Generate C1 data sets. 2. Let the minimum support count is 2. Now filter the data sets which do not satisfy minimum support count. Generate L1 data sets. This is called pruning. 3. Apply Join operation L1xL1 and Generate C2 data sets. 4. Apply Pruning and Generate L2 data sets. 5. Apply Join and Generate C3 data sets.

The length of the path is the sum of the weights on the edges. If a task executed at site S5 requires dataset d2, then the availability of the data sets are checked in the data catalogue. If the data set d2 is available at site S1, S2, S3 and S4 then the IRM at S5 searches the least cost table to find out the nearby site for accessing the data set d2. Hence the site S3 is having least cost with more bandwidth. So the IRM places a remote request to the site S3. The request contains the site id and the required data sets. Now the knowledge base is searched for predicting the associated data sets by applying modified apriori algorithm. Using the algorithm, data set d4 is predicted and sent along with the d2. At the site S5, if the storage space is not enough to store the predicted data set then IRM will call deletor to delete the least frequently used data sets at S5. Table 1 shows the data catalogue and Table 2 shows the least cost table. The pseudo code of the multicriteria based replication algorithm and modified apriori algorithm is given below. Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

JID: CAEE 8

ARTICLE IN PRESS

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13 Table 2 Least cost table. Source

Destination

Bandwidth (Mbps)

Cost

S1 S1 S2 S2 S3 S4 S5 S6 S7

S2 S3 S5 S4 S5 S6 S7 S8 S8

10 10 50 50 50 10 100 100 100

3 3 2 2 1 3 1 1 1

Multicriteria based replication algorithm Phase I //Placement of data 1. For each data request ri from IRM at si 2. Check for the data availability in the local site 3. If data is available 3.1 Process the job with the datasets 4. Else 5. Search the data catalogue for data availability 5.1 If data is available at multiple sites then 5.2 Search the least cost table 5.3 Select the site with the minimum cost 5.3.1 Forward the request to the IRM at remote site Phase II //Prediction 1. For each remote request from remote IRM do the following 1.1 Store the request in kb 1.2 Send the requested data sets 1.3 Call modified apriori algorithm to find frequent data sets in kb //Modified Apriori algorithm 1. Find the number of data sets n requested in the query 2. Scan the Knowledge base to get the support count for each dataset 3. Compare support count with minimum support count and get the frequent 1-datasets i.e., Level-1 Let k = 1 Repeat until level(k) < level(n) 5.1 Increment the value of k 5.2 Generate set of candidate k itemsets using join operation (Lk-1 X Lk-1) 5.3 Use Apriori property to prune the unfrequent set in k-itemset 5.4 Scan the Knowledge base to get the support count for each candidate k itemset in the final set 5.5 Compare support count with minimum support count and get the frequent k-datasets i.e., Level-k Phase III //Deletion 1. If the storage space is not available to store the data sets then 1.1 Sort the count value of access frequency of the data sets in the ascending order 1.2 Delete the data sets which has the least access frequency 2. Update the knowledge base 3. Update the least cost table

4. 5.

5. Simulation The simulation environment is set up using GridSim [28], the java based simulation tool kit. We have considered a network comprised of five sites: each site consists of an IRM, set of processors and storage servers. Here the basic assumption is that all the processors are operating in the same speed and the files are read only. The simulation parameter values are given in Table 3. In order to evaluate the effectiveness of our proposed approach, the following popular replication strategies which had been discussed in related works are compared. i. No replication (NR), where all the files are accessed from the remote site. ii. Least Recently Used (LRU), which always replicates the files by deleting the least recently used files in the storage site. iii. Bandwidth Hierarchy based Replication strategy (BHR), where the replication is based on the popularity of the file in a region. If the required data is available within the region, then the data access time will be less. The above strategies are evaluated with the proposed approach based on the following metrics: data availability time, data transfer time, make span and waiting time. Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

9

Table 3 Parameter configuration. Simulation parameters

Values

Number of sites Number of files File size Storage size at each site Number of jobs Number of files accesses made by each job Bandwidth (Mbps)

3–5 10–20 1G 50 G 10–100 5–10 10–100

Fig. 5. Analysis of the waiting time.

Fig. 6. Analysis of the data transfer time.

5.1. Impact on waiting time In this section, the proposed approach is compared to the existing strategies like: No replication, LRU, and BHR methods. Fig. 5 shows the comparison of waiting time with the existing strategies. It is the time required by the task to collect the required data sets for execution. Consistent with expectation, it is evident that mean waiting time measurements portrayed in Fig. 5 was reduced by 57%, 61% and 60% when compared to LRU, BHR and NR strategies. During execution, the future need of a grid site is predicted in advance and the data needed for the current job is readily available in the execution site. Hence the job gets executed without any delay. With the help of apriori algorithm, the associated data sets are accurately predicted and made available at the computing site in advance. So the necessity to wait for the required data sets is eliminated. 5.2. Impact on data transfer time In this section, the data transfer time of the IRM strategy from the data site to the requesting site is compared to LRU, BHR, and NR strategies. The data transfer time is calculated by finding the time required for transferring the data sets from one site to another site. Fig. 6 shows the comparison of Data transfer time. The mean data transfer time was reduced Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

JID: CAEE 10

ARTICLE IN PRESS

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

Fig. 7. Analysis of the data availability time.

Fig. 8. Analysis of the make span.

by 48.3%, 54.2% and 53% when compared to LRU, BHR and NR strategies. In IRM, the data sets are sent through the high bandwidth link by referring the least cost table. If the replica of the data sets resides in more than one site, then the least cost table is searched for high bandwidth link to send the data sets to the requested site. This will ultimately eliminate data being transferred in the low bandwidth line which in turn reduces the data transfer time and minimize the delay incurred during data transfer. 5.3. Impact on data availability time and make span Fig. 7 shows the comparison of data availability time. In the proposed strategy, the IRM predicts the required data sets in advance and pre-replicates them. This will eliminate unnecessary placement of data request to the neighbor sites which also reduces the queuing delay. So the data availability time is reduced by 77%, 59.8% and 73.2% when compared to LRU, BHR and NR strategies. The primary goal of this scheduling approach is to minimize the make span. As mentioned in Eq. (1), the waiting time and the execution time will have major impact in the make span. In order to optimize the make span any one of the parameter has to get optimized. The parameter execution time depends on the resource performance which can be optimized by selecting the best resource during the scheduling and waiting time depends on the data availability time and data transfer time as stated in Eq. (2). Fig. 8 shows the comparison of make span with other strategies. From Fig. 8, it is evident that by reducing the data availability time, queuing delay, and data transfer time the overall make span is reduced by 47.2%, 40% and 41.3% for LRU, BHR and NR strategies, respectively. 5.4. Prediction accuracy In this work, we have assumed that all the processors are operating at the same speed. Fig. 9 shows the prediction accuracy of the proposed IRM approach. The prediction accuracy is the ratio of number of predicted data sets accessed to Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

11

Fig. 9. Prediction accuracy.

the number of data sets predicted, as given in Eq. (6).

Prediction Accuracy =

Number of predicted data sets accessed Number of data sets predicted

(6)

Many prediction algorithms proposed so far make unnecessary predictions where maximum number of predicted data sets may not be used in future and also those predicted data sets occupies unnecessary storage space in the data sites. The modified apriori algorithm used in this proposed work makes accurate predictions where maximum numbers of predicted data sets are used in successive accesses. The prediction accuracy of the proposed work is compared with the prediction accuracy of PDDR, LALW, and PHFS described in the related works. Results prove that prediction accuracy of IRM is increased by 53.5%, 38.7% and 30.3% when compared to LALW, PDDR and PHFS strategies. 6. Conclusion In this work, we investigated the shortcomings of current solutions and presented a novel strategy for prediction-based dynamic replication. We have designed an Intelligent Replica Manager (IRM) and deployed in the middleware of the data grid for scheduling data-intensive applications. IRM uses multi-criteria based replication algorithm and modified apriori algorithm for efficient selection, prediction and placement of replica. We evaluated our dynamic replication strategy using GridSim tool kit. During execution, the future need of a grid site is predicted in advance and the data sets needed for the current job is readily available in the execution site. Hence the job gets executed without any delay. From the results, it is evident that the proposed work optimizes the scheduling performance by reducing the waiting time of jobs in queue, data availability time and data transfer time. Modified apriori algorithm makes accurate predictions where 90% of the predicted data sets are used in successive accesses. Across all jobs, it is also found that the make span was reduced by 40% when compared with popular strategies. In future, this work may be extended to resolve the scheduling problems in the cloud environment. Acknowledgment The authors would like to thank the anonymous reviewers for their insightful comments and systems support group for their help. References [1] Foster I, Kesselman C, Tuecke S. The anatomy of the grid: enabling scalable virtual organizations. Int J Supercomput Appl 20 01;15(3):20 0–22. [2] Moore R, Baru C, Marciano R, Rajasekar A, Wan M. Data-intensive computing. In: The grid: blueprint for a new computing infrastructure. Morgan Kaufmann; 1999. p. 105–29. [3] Kwok Y-K, Ahmad I. Benchmarking and comparison of the task graph scheduling algorithms. J Parallel Distrib Comput 1999;59(3):381–422. [4] Schopf JM. Ten actions when grid scheduling. In: Grid resource management. US: Springer; 2004. p. 15–23. [5] Ranganathan K, Foster I. Decoupling computation and data scheduling in distributed data intensive applications. In: Proceedings of the 11th IEEE international symposium on high performance distributed computing, HPDC’02; 2002. p. 352–8. [6] Rehn J, Barrass T, Bonacorsi D, Hernandez J, Semeniouk I, Tuura L, Wu Y. PhEDEx high-throughput data transfer management system. Proceedings of computing in high energy and nuclear physics (CHEP); 2006. [7] Mohamed HH, Epema DHJ. An evaluation of the close-to-files processor and data co-allocation policy in multiclusters. In: Proceedings of IEEE international conference on cluster computing; 2004. p. 287–98. [8] Cameron DG, Schiaffino Ruben Carvajal, Millar A Paul, Nicholson Caitriana, Stockinger Kurt, Zini Floriano. Evaluating scheduling and replica optimisation strategies in OptorSim. In: Proceedings of the 4th international workshop on grid computing. IEEE Computer Society; 2003. p. 52–60. [9] Mansouri N, Dastghaibyfard GH, Mansouri E. Combination of data replication and scheduling algorithm for improving data availability in data grids. J Netw Comput Appl 2013;36(2):711–22. [10] Ranganathan K, Foster I. Identifying dynamic replication strategies for a high performance data grid. In: Proceedings of the second international workshop on grid computing; November 12, 2001. p. 75–86.

Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

JID: CAEE 12

ARTICLE IN PRESS

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

[11] Amjad T, Sher M, Ali D. A survey of dynamic replication strategies for improving data availability in data grids. Future Gener Comput Syst 2012;28(2):337–49. [12] Kwok Y-K, Ahmad I. Benchmarking and comparison of the task graph scheduling algorithms. J Parallel Distrib Comput 1999;59(3):381–422. [13] Lin YF, Liu P, Wu J-J. Optimal placement of replicas in data grid environments with locality assurance. In: Proceedings of the 12th international conference on parallel and distributed systems (ICPADS’06), Minneapolis, Minn, USA; July 2006. [14] Vashisht P, Kumar R, Sharma A. Efficient dynamic replication algorithm using agent for data grid. Sci World J 2014;2014:10. [15] Saadat N, Rahmani AmirMasoud. PDDRA: A new pre-fetching based dynamic data replication algorithm in data grids. Future Gener Comput Syst 2012;28(4):666–81. [16] Beermann T, Stewart GA, Maettig P. A popularity based prediction and data redistribution tool for ATLAS distributed data management. PoS 2014:004–19. [17] Khanli LM, Isazadeh A, Shishavan TN. PHFS: A dynamic replication method, to decrease access latency in the multi-tier data grid. Future Gener Comput Syst 2011;27(3):233–44. [18] Ranganathan K, Foster I. Design and evaluation of dynamic replication strategies for a high performance data grid. In: Proceedings of international conference on computing in high energy and nuclear physics; 2001. [19] Chang R-S, Chang J-S, Lin S-Y. Job scheduling and data replication on data grids. Future Gener Comput Syst 2007;23(7):846–60. [20] Park S-M, Kim J-H, Go Y-B, Yoon W-S. Dynamic Grid replication strategy based on internet hierarchy. In: Proceedings of International workshop on grid and cooperative computing, 1001; 2003. p. 1324–31. Lecture notes in computer science. [21] Tang M, Lee BS, Yeo CK, Tang X. Dynamic replication algorithms for the multi-tier data grid. Future Gener Comput Syst 2005;21:775–90. [22] Chang R-S, Chang H-P. A dynamic data replication strategy using access-weights in data grids. J Supercomput 2008;45(3):277–95. [23] Bsoul M, Abdallah AE, Almakadmeh K, Tahat N. A round-based data replication strategy. IEEE Trans Parallel Distrib Syst 2016;27(1):31–9. [24] Bsoul M, Alsarhan A, Otoom A, Hammad M, Al-Khasawneh A. A dynamic replication strategy based on categorization for data grid. Multiagent Grid Syst 2014;10(2):109–18. [25] Chettaoui H, Charrada FB. A new decentralized periodic replication strategy for dynamic data grids. Scalable Comput: Pract Exp 2014;15(1):101–19. [26] Han J, Kamber M, Pei J. Data mining, southeast asia edition: concepts and techniques. Morgan Kaufmann; 2006. [27] Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases, VLDB, Santiago, Chile; September 1994. p. 487–99. [28] Sulistio A, Cibej U, Venugopal S, Robic B, Buyya R. A toolkit for modelling and simulating data Grids: an extension to GridSim. Concurr Comput: Pract Exp 2008;20(13):1591–609.

Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036

JID: CAEE

ARTICLE IN PRESS

[m3Gsc;December 6, 2016;21:48]

V. Nagarajan, M.A. Maluk Mohamed / Computers and Electrical Engineering 000 (2016) 1–13

13

M. A. Maluk Mohamed received his B.E. degree from Bharathidasan University in 1993 and M.E. degree from NIT, Trichy in 1995. He obtained his Ph.D. degree from IIT Madras in 2006. His research interests include Distributed Computing, Mobile Computing, Cluster Computing and Grid Computing. He has 86 publications, which includes 32 papers in international journals. He is a member of the ACM, IEEE, ISA, IARCS, ISTE and life member of the CSI. He received VijayaRattan award from India International friendship society in 2005 for specializing in science and technology. Vijaya Nagarajan received her B.E. degree from Bharathidasan University in 2001 and M.E. degree from Anna University in 2010. Currently, she is pursuing her research under Anna University, Chennai. She has more than 12 years of teaching experience. She has published nearly 15 research papers in various international conferences/journals. Her research area includes Distributed systems, Grid Computing and Mobile Computing.

Please cite this article as: V. Nagarajan, M.A. Maluk Mohamed, A prediction-based dynamic replication strategy for dataintensive applications, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.11.036