A DSM-based fragmented data sharing framework for grids

A DSM-based fragmented data sharing framework for grids

Future Generation Computer Systems 26 (2010) 668–677 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: ...

2MB Sizes 3 Downloads 112 Views

Future Generation Computer Systems 26 (2010) 668–677

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

A DSM-based fragmented data sharing framework for grids Po-Cheng Chen a,∗ , Jyh-Biau Chang b , Ce-Kuen Shieh a , Chia-Han Lin a , Yi-Chang Zhuang c a

Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan City 701, Taiwan, ROC b

Department of Digital Applications, Leader University, No. 188, Sec. 5, Au-Chung Road, Tainan City 709, Taiwan, ROC

c

Home Network Technology Center, Industrial Technology Research Institute/South, No. 31, Gongye 2nd Road, Annan District, Tainan City 709, Taiwan, ROC

article

info

Article history: Received 14 August 2008 Received in revised form 17 December 2009 Accepted 29 December 2009 Available online 7 January 2010 Keywords: Data grid Distributed Shared Memory (DSM) On-demand access Data consistency Teamster-G

abstract Sharing scientific and data capture files of gigabyte and terabyte size in conventional data grid systems is inefficient because conventional approaches copy the entire shared file to a user’s local storage even when only a tiny file fragment is required. Such transfer schemes consume unnecessary data transmission time and local storage space, with the additional problem of maintaining replica synchronization. Traditionally, replica consistency treats shared files as read-only, consequently sacrificing guaranteed replica consistency. This paper presents a DSM-based fragmented data sharing framework called ‘‘Spigot’’ which transfers only the necessary fragments of large files on user demand, thereby reducing data transmission time, wasted network bandwidth and required storage space. Data waiting time is further reduced by overlapping data transmission and data analysis. The DSM concept maintains replica synchronization. Real experiments show reduced turnaround time in data-intensive applications, particularly when fragment size is low and analysis time and network latency are high. © 2010 Elsevier B.V. All rights reserved.

1. Introduction Shared data is a critical resource in data-intensive applications of both scientific and business domains, for example climatology research, medical diagnosis and on-line business services. As a result of advanced simulation and sensor technology, the size of these data has grown already to the gigabyte level and is approaching the terabyte level [1–6]. Consequently, data-intensive applications are generally run on some form of data grid system for efficient use of these data. A data grid system hides the complexity of data management issues. It provides the user with a uniform scheme for data discovery, replication and transfer. Thus, the user need not worry about heterogeneous data formats and access methods, as well as diverse consistency requirements between mutable and immutable data. Moreover, it is not uncommon for scientists to parallelize dataintensive jobs such as climate simulation [1,7–9,2,4,5]. In respect to performance concerns, the original job is usually divided into several sub-jobs which can be executed in parallel. In this scenario, a data grid system should first support the sub-jobs so as to obtain the necessary fragments of an input file and exchange the



Corresponding author. Tel.: +886 6 2757575x62400 1779; fax: +886 6 234 5486. E-mail addresses: [email protected], [email protected], [email protected] (P.-C. Chen), [email protected] (J.-B. Chang), [email protected] (C.-K. Shieh), [email protected] (C.-H. Lin), [email protected] (Y.-C. Zhuang). 0167-739X/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2009.12.008

intermediate results during the process. Second, it should help the sub-jobs maintain data consistency if the fragments are subject to modification by multiple-writers. Finally, it should perform these services with user transparency. Several data grid systems have been proposed [7–19] to satisfy large scale data-intensive applications. However, most of these systems introduce an explicit scheme for data management [7–9, 18,19], that is, the user himself is left to explicitly store and transfer the necessary data between the replica servers and the computing nodes. Because these systems are used by diverse data-intensive applications, the explicit scheme therefore may be inefficient for the three following reasons:

• Non-transparent access interface Most available data grid systems such as DataMover [13,14], SRB [12,17] and JuxMem [7–9] still use an explicit data transfer mechanism or FTP-like protocol that is, the user is required to use an explicit API for locating and transferring files. Such explicit data access schemes forbid legacy applications (i.e. the grid-unaware application) to migrate directly to the grid environment without modification (re-design or re-compilation with the explicit APIs), consequently sacrificing user transparency. • A file-level transfer approach Most current data grid systems (e.g. SRM [13], DataMover [14]) support only file-level transfer. Consequently, they must duplicate the entire file to the computing nodes even if these nodes only need a few small fragments of the file. This

P.-C. Chen et al. / Future Generation Computer Systems 26 (2010) 668–677

wastes network bandwidth. A further restriction of file-level transfer of large files is the prerequisite of large storage capacity at the computing node, i.e. whether the storage capacity of the computing node can accommodate the file. Unfortunately, while computing nodes provide powerful computational ability, they commonly provide only limited storage capacity [20,21]. Thus only computing node with adequate storage capacity can be exploited. • No guarantee of the replica consistency Replicas of a file are dynamically generated in respect to performance concerns in the context of a data grid [10,21,22, 3,23,6]. These replicas may suffer from consistency issues [7–9, 24–27,18,19]. Once a replica is updated, all other replicas need to be synchronized so as to have the same content. However, current data grid systems such as DataMover [13,14], SRB [12, 17], Kangaroo [16], etc., treat a shared file as read-only, therefore sacrificing the guarantee of replica consistency. Although data grid systems such as Sorrento [15,24,27], etc., are aware of the requirement of replica consistency, they still treat updates as new versions of the original file. Thus, they require manual synchronization for the updates using an explicit API. In other words, such systems are responsible only for transferring updates at the file-level to replica servers. Considering the issues of storage capacity and transmission time for updates, the current whole-file method is very inefficient at large scales such as are common in a grid environment. In order to overcome these three problems, this paper presents a DSM-based (Distributed Shared Memory based) fragmented data sharing framework, named ‘‘Spigot’’. The idea of Spigot is to make accessing shared data as easy as using a water tap. Spigot has three key features, namely: (1) a transparent access interface; (2) a fragment-level transfer-on-demand approach; (3) a guarantee of replica consistency. Spigot does not implement an additional access interface library. Instead, the users need merely design their programs as usual, using native I/O system calls. Consequently, the user’s perception of Spigot is as though he were using a local file system, that is, the I/O operations on fragments of a remote shared file in Spigot is like working on the local file system. Spigot is responsible for trapping and redirecting native I/O operations into corresponding Spigot I/O operations. Spigot divides a large shared file into several fragments. Each fragment consists of several blocks whose size is configurable, and each has at least one replica stored in distributed replica servers. Spigot postpones data transmission of necessary fragments until the user’s application actually accesses the data, thereby enhancing the system’s throughput by overlapping data transmission and data analysis times. Moreover, Spigot allows a fragment to be modified by multiple users who have a partnership. To maintain the data consistency of modified fragments, Spigot adopts a page-based, grid-enabled Distributed Shared Memory (DSM) system, namely TeamsterG [28–31], to support fragment-level transfer on-demand. Spigot maps the contents of fragments into the DSM space and relies on the DSM system to transparently share data with concern for consistency. It inherits a release data consistency model [28,30,31] from Teamster-G and therefore does not bother the users with the consistency of shared data. The remainder of this paper is organized as following. Section 2 briefly reviews the major data sharing schemes presented in the literature for the grid environment. Section 3 introduces the proposed Spigot scheme and describes its system model. Section 4 discusses the implementation of Spigot, while Section 5 evaluates its performance under various data transfer scenarios. Finally, Section 6 presents some concluding remarks and indicates the intended direction of future research.

669

2. Related work Several notable data sharing frameworks have been proposed in the literature [7–19]. They are briefly compared with Spigot in this section. DataMover [14] provides users with a file-level data sharing framework. It uses Storage Resource Managers (SRM) [13] as its data transfer mechanism for replicating whole files/directories from a replica server to a user’s local disk. The user therefore can access directly the replica at the local disk. However, DataMover does not maintain data consistency between multiple replicas at different users’ local disks. In contrast, Spigot supports fragment-level transfer, replicating only the fragments of data which are really needed by a user. Moreover, Spigot guarantees data consistency between multiple replicas. Although SDSC SRB (Storage Resource Broker) [12,17] provides users with a set of explicit APIs for fragmented file sharing, it does not allow fragments of a file to have multiple simultaneous writers so as to guarantee data consistency. In contrast, Spigot supports a transparent access interface, so a grid-unaware program without re-design and re-compilation can be executed correctly and can access a remote file the same as a local file. Furthermore, Spigot allows fragments of a file to have multiple simultaneous writers. Spigot maintains data consistency by use of DSM methodology. Kangaroo [16] provides users with a transparent access interface and considers the data consistency of multiple replicas of the same file. Kangaroo’s adaptation layer translates a user’s POSIX I/O functions call into the underlying Kangaroo transmission function calls, whereby a grid-unaware application without re-design or recompilation can execute on top of Kangaroo. Kangaroo provides users with a set of functions for replica synchronization, but the actual responsibility of maintaining data consistency is left to the users. In contrast, Spigot handles data consistency by its own underlying DSM sub-system. NFSv4.r [18,19] also provides users with a transparent access interface. A user can access remote data as if the data is in the local file system. NFSv4.r uses a coarse-grained lock to protect a whole file and guarantees data consistency. In contrast, Spigot provides fine-grained locks to protect blocks of a file, and therefore may improve performance. Juxmem [7–9] is somewhat similar to Spigot. Juxmem is also inspired by the DSM approach for transparent access to data and consistency management. However, Juxmem does not leverage existing replica location services [21,22,32], but instead implements its own P2P memory resource management scheme by adopting the JXTA framework. Once the data is produced, Juxmem directly deposits the entire shared data into the DSM space. Under the Juxmem scheme, only potential nodes which provide enough memory for data storage can be exploited. In contrast, Spigot maps only the content of a necessary fragment into the DSM space ondemand. Moreover, although Juxmem declares itself to be transparent, it nevertheless requires the user to convert (i.e. re-design and re-compile) existing grid-unaware programs with its own API. 3. Spigot Fig. 1 illustrates the general system overview of Spigot. The basic building blocks of Spigot include at least one client, at least one replica server and a centralized file lookup server. A replica server represents a grid node which provides large storage capacity and stores fragmented replicas, i.e. fragments of shared files. A client represents a grid node which hosts a data-intensive application that directly uses the contents of fragments. The centralized file lookup server represents a catalog which records the locations of fragments. Initially, the replica servers register newly created fragments with the file lookup server. After that, the clients query the file

670

P.-C. Chen et al. / Future Generation Computer Systems 26 (2010) 668–677

3.2. A fragment-level transfer on-demand approach

Fig. 1. System overview of Spigot.

lookup server about the location of necessary fragments and then accesses these fragments with the facilities of Spigot. Spigot allows grid-unaware data-intensive applications to use a remote shared file the same as a local file. Meanwhile, multiple data-intensive applications which have a partnership can access distinct parts of a fragment concurrently. Spigot is also responsible for maintaining data consistency. The design principle of Spigot is to enable a data-intensive application to access data as easy as a person drawing tap water. Spigot has three key features, namely: (1) a transparent access interface; (2) a fragment-level transfer on-demand approach; (3) a guarantee of replica consistency. The remainder of this section describes these key features in detail. 3.1. A transparent access interface Just as people who draw tap water need not concern themselves about the water source and the water transmission, data-intensive applications which consume shared data also need not concern themselves about the data location and the data transmission. However, most published data grid systems [12–14,17] make it necessary for programmers to re-design or re-compile legacy applications for explicitly locating and transferring replicas. Spigot, on the other hand, hides a fact from the users, namely, that fragments of a shared file are actually deposited among the replica servers of Spigot and not in the local disk. Spigot creates and associates a virtual file in the local disk with each shared file. Thereby, it does not need to implement its own access interface library, but instead allows the application to access a shared file through the associated virtual file with native I/O system calls. The disk I/O operations on the virtual file are then trapped and redirected to Spigot, but this process is totally hidden from the users. Thus, a data-intensive application can transparently access the fragments as though it were accessing a local file. This provides the user with a transparent access impression. Fragments of a shared file in Spigot may be distributed into more than one replica server. Each replica server, however, owns an independent file namespace. A conventional user who would like to access these fragments suffers significant overhead since he must explicitly locate the fragments. Spigot alleviates the overhead by providing users with a global namespace. With a global namespace, the client accesses the fragments of an identical file located on different replica servers through the logical identifier of the file. The file lookup server then transparently translates the logical location of the fragments into the physical location of the fragments.

Under the file-level transfer approach in most available data grid systems, a shared file must be transferred entirely to a client before an application can really access a file [20,33,10–14,23,17]. After the file has been stored in a local disk, existing gridunaware applications without re-design or re-compiling can also access the file with native I/O system calls. However, transferring complete files prior to running an application may cause transfer of unnecessary fragments, with consequent delay in application response time as well as wasted network bandwidth and storage space. Applications usually do not need to access an entire file at one time, but instead normally access the necessary fragments of the file on-demand [20,10]. In the data grid context, the size of shared files is increasing rapidly. This creates the danger of exhausting the local disk space of a replica server if each file is entirely duplicated in the replica server. This predicament becomes more serious in a client because a grid node which hosts data-intensive applications often provides powerful computing ability but only moderate storage capacity [20,10,3,6]. To overcome these limitations of the file-level approach, Spigot applies a fragment-level transfer approach. Spigot consequently divides a large file into several size-configurable fragments. Each replica server stores several fragments according to its storage capacity. Moreover, Spigot postpones the data transmission of necessary fragments until the application really accesses the data, i.e. transmission on-demand. Fig. 2 exemplifies the breakdown of the time line for a data-intensive application using two different data transfer approaches, i.e. file-level and fragment-level on-demand. As shown in Fig. 2, although the application utilizes only 50% of the fragments of a file, the file-level transfer approach delays the CPU burst until the required file is 100% transferred. On the other hand, the fragment-level transfer on-demand approach transfers only the necessary 50% fragments of the file, and additionally overlaps the data transmission and the CPU burst, thereby improving performance. 3.3. A guarantee of replica consistency Existing schemes for replica consistency either treat all replicas as read-only [12–14,16,17] or treat any update as a new version of the original replica [24,15,27]. The former scheme sacrifices the possibility of sharing replicas in a multiple-writers scheme. The latter scheme burdens users with manual synchronization for updating replicas. On the other hand, strategies for transparently maintaining data consistency have been widely studied in the context of DSM systems. Although the design of most available DSM systems is acceptable only in a small scale environment, advanced DSM systems [28–31] have been proposed for the grid environment. Spigot consequently adopts a page-based, gridenabled DSM system, namely Teamster-G [28–31] as a sub-system. Spigot then exploits the data consistency protocol derived from its DSM sub-system for automatically synchronizing fragments. The basic concepts of the page-based DSM system are briefly reviewed here to expedite the understanding of the following paragraphs. The DSM system makes the virtual address spaces of the distributed nodes look like a single shared memory with a unique address space, namely a virtual global memory image (GMI) [28–31]. All node accesses to a shared page of the virtual GMI are detected by page protection with protection states such as invalid, read-only and read/write. When a node accesses an invalid page, the DSM system transparently fetches the associated page from another node, namely the page owner. If a node writes to a read-only page which has more than one copy in other nodes, the DSM system is responsible for contrasting the original with the modified

P.-C. Chen et al. / Future Generation Computer Systems 26 (2010) 668–677

671

Fig. 2. Breakdown of the time line for a data-intensive application.

system module) and SFPS (spigot fragments publishing service module). In addition, Spigot adopts Teamster-G [28–31], FUSE [34] and Globus RLS (Replica Location Service) [22] as sub-systems. In Spigot, both the clients and the replica servers deploy the SUTIL and adopt Teamster-G (a page-based, grid-enabled software DSM system) as the underlying DSM sub-system. The client deploys SFS and exploits FUSE for redirecting the disk I/O operations from data-intensive applications to Spigot. Moreover, the replica server deploys SFPS to register newly created fragments with the file lookup server which deploys Globus RLS for recording and providing mapping information from the logical locations into the physical locations of the fragments. The following sub-sections delineate the operation of the Spigot components. Fig. 3. Abstraction of the virtual global memory image.

4.1. SFPS and RLS contents of the page, and then sends the difference to nodes which have their local copy of the page. When those nodes receive the difference, they apply the difference to their own local copy. Having done so, all local copies of the page are thus synchronized by the DSM system. Moreover, the DSM system allows that multiple nodes can write concurrently to the same page. Note, however, that the race condition arises during simultaneous writing to the same part of the same file. This is clearly a matter of concern to the user of Spigot but is not in the province of Spigot’s control. As shown in Fig. 3, Spigot uses its DSM sub-system to maintain the consistency of the fragments of a shared file. Spigot groups together the replica servers which store fragments of a shared file and the clients which access the fragments of the shared file, joining them into a virtual GMI. Spigot then deposits the fragments of the file from the local disk of the replica servers into the virtual GMI. After that the replica servers and clients can share the fragments without concern for the fragment consistency issue because Spigot automatically synchronizes the fragments according to the consistency model inherited from its DSM sub-system. Moreover, the size of the virtual GMI is equal to the size of the virtual address space of a Spigot process. The emergence of the 64bit architecture with proper operating system support can ideally address 16 exabytes address space. That implies the size of the virtual GMI is virtually unlimited under the 64-bit architecture. Honestly, most 64-bit processors on the market today have not fully supported the 16 exabytes address space yet because of physical constraints. For example, currently the AMD64 architecture only supports 4 petabytes physical memory and 256 terabytes virtual address space. Nevertheless, it is still reasonable to consider that 256 terabytes virtual GMI is infinite for Spigot applications. 4. Implementation This section of the paper describes the Spigot system architecture, including Spigot’s three modules and three sub-systems. Fig. 4 displays the Spigot architecture, which consists of three modules, namely SUTIL (spigot utility module), SFS (spigot file

With respect to user transparency, Spigot allows a client to locate the desired file by using a unique logical identifier, i.e. a logical file name (LFN) of the Spigot global namespace. The file lookup server deploys RLS for constructing a global namespace whereby Spigot assists the client with determining the physical locations, i.e. the physical file names (PFNs) of one or more replicas of the file. The file lookup server in Spigot implements the simplest RLS deployment, that is, uses only a single LRC (Local Replica Catalog) of Globus as the central registry of the replica location information. Additionally, in order to improve the efficiency of data transmission, Spigot divides a shared file into at least one fragment. Thus, each PFN registered as an LRC entry essentially represents the location information of a fragment. However, each fragment can have a different length. Each PFN, therefore, has two attributes shown in Table 1: the first attribute specifies the offset, i.e. the distance from the beginning of the file; the second attribute specifies the length of the fragment. Once a replica server creates new fragments according to any replicating strategy, the LRC offers the SFPS in the replica servers a function for registering the fragments according to the abovementioned PFN format. SFPS examines the LRC to determine whether or not a corresponding mapping between the LFN and the PFNs exists. If the mapping does not exist, it creates the mapping in the LRC database; otherwise it directly adds the mapping from the PFN to an existing LFN. Then it adds the offset and fragment length attributes to the PFN in the LRC database. Moreover, the LRC also offers the SFS in the clients the functionality of querying with regard to the mapping between the LFN and PFNs. Both the fragment registry and query processes are performed using Globus RLS APIs, i.e. globus_rls_client_lrc_*. 4.2. SFS and FUSE Spigot supports client exploitation of fragments deposited among distributed replica servers. At the same time it also

672

P.-C. Chen et al. / Future Generation Computer Systems 26 (2010) 668–677

Fig. 4. System architecture of Spigot. Table 1 Replica catalog of a given shared file. PFNs of a given shared file

Offset (byte)

Length (MB)

pfn://replica_serverX/spigot_dir/fragment0_file1 pfn://replica_serverX/spigot_dir/fragment1_file1 pfn://replica_serverY/spigot_dir/fragment2_file1 ...

0 134217727 201326591 ...

128 64 96 ...

promises that clients do not have to explicitly locate and transfer the necessary fragments. Accordingly, SFS represents the connector between the application and Spigot. It permits existing grid-unaware applications, without modification, to access remote fragments through native I/O system calls. It is responsible for intercepting the disk I/O, querying the PFNs and demanding the necessary fragments transparently. 4.2.1. Disk I/O interception For each shared file requested by a client, SFS establishes a virtual file in local disk space for the file according to its LFN. The client consequently can access the shared file, consisting of distributed fragments, as though accessing a local file by accessing the virtual file. SFS then utilizes the FUSE kernel module [34] to trap and redirect the I/O system calls from the application to Spigot. In order to satisfy the I/O requirement of the client, Spigot performs the corresponding operations for fragment locating or transferring according to the trapped I/O system calls. Note that all virtual files are located in a special directory such as /tmp/spigot-dir in the local disk space. Only disk I/O operations on the special directory are transparently passed to SFS. Thus Spigot does not affect the operation of the normal file system. 4.2.2. PFN querying process Although SFS provides data-intensive applications with a transparent access impression, the actuality is that a shared file consists of distributed fragments. Thus, SFS also features the PFN querying process for locating the requested fragments of a shared file when the application accesses these fragments through open() or read()/write() system calls in case these fragments have not been deposited in the virtual GMI. SFS uses the LFN of the file as the key for requesting all mappings of PFNs to the LFN from the LRC in the file lookup server. After SFS receives the list of requested PFNs, it further examines the attributes of these PFSs according to the current file offset and the requested read/write size (the number of bytes). Therefore it can filter the PFNs of the necessary fragments from all the mappings of the PFNs to the LFN. Note that in the file opening case, if the

virtual GMI has not been constructed yet, SFS queries the LRC for the PFN of the first fragment of the file and then requests the SUTIL to dynamically construct the virtual GMI. 4.2.3. Demand for a fragment After SFS obtains the PFNs of the necessary fragments, Spigot has to deposit the fragments into the virtual GMI. After that, the application can access the fragments from the virtual GMI. During the process, Spigot actually has to cache the fragment into the local memory of the client before the application can really access the fragment. Accordingly, SFS assigns SUTIL to collaborate with the underlying DSM sub-system for transparently transferring the requested fragment from the remote replica servers into local memory. SFS has to pass the PFNs of the necessary fragments, the current file offset and the requested read/write size (the number of bytes) as arguments for the SUTIL. If the application wants to write data into a given fragment, the data is also passed as arguments for the SUTIL. 4.3. SUTIL and Teamster-G The above-mentioned SFS is in charge of locating and requesting the necessary fragments. SUTIL is responsible for constructing the virtual GMI and also for transferring necessary fragments between clients and replica servers, and also for guaranteeing the data consistency of fragments transparently. 4.3.1. Virtual GMI construction SUTIL is essentially a DSM application of Teamster-G which is a grid-enabled, page-based DSM system which provides the user with an eager-release consistency mechanism [28,30,31]. SUTIL exploits Teamster-G to create and also manage a virtual GMI for each shared file. Each virtual GMI is constructed dynamically when a client opens the shared file for the first time. The replica server which stores the first fragment of the file is regarded as the manager node of the virtual GMI. The SUTIL in the manager node is responsible for creating the virtual GMI. In addition, TeamsterG allows the scalability of a virtual cluster to be reconfigured [29],

P.-C. Chen et al. / Future Generation Computer Systems 26 (2010) 668–677

that is, grid nodes can dynamically join or leave the virtual GMI sharing. Therefore, other nodes, i.e. clients/replica servers which access/store fragments of the file, can join the virtual GMI sharing on-demand via SUTIL in these nodes. This dynamic node join process is controlled by the SUTIL in the manager node. After the virtual GMI is constructed, the clients and the manager node can share data in the virtual GMI. However, the virtual GMI still has no deposited fragments. According to the PFNs of the necessary fragments, a client subsequently requests the SUTIL in the manager node to incorporate the replica servers which store necessary fragments into the virtual GMI for dynamic sharing of the fragments. Having done so, the SUTIL in these newly joined replica servers apply memory-mapped I/O [28] to map the fragments on the local disks into the virtual GMI (i.e. the DSM memory space); meanwhile the client also applies a memory-mapped I/O to map the fragments from the virtual GMI (i.e. the DSM memory space) into the virtual file provided by SFS. Consequently, a gridunaware data-intensive application can transparently access the virtual file. Although the original data is actually mapped in the remote memory, Teamster-G is responsible for transparently fetching the necessary fragments from the replica server to the client (as mentioned in Section 3.3). Later, other clients may also want to access the same shared file. SFS consequently traps the open() operations when these clients open the associated virtual file. Because the virtual GMI has already been constructed, the SUTIL in the manager node incorporates these clients into the virtual GMI for sharing. Moreover, a client may request some fragments which have not yet been deposited in the virtual GMI after the client accesses sequentially several fragments of the shared file or changes to a new read/write position by using the lseek() function. At this time, the client requests the SUTIL in the manager node to incorporate the other replica servers which contain the necessary fragments into dynamic virtual GMI sharing. 4.3.2. Data consistency of fragments Teamster-G guarantees the data consistency of the virtual GMI because the virtual GMI is really a DSM memory. Suppose a client, say, client I modifies a fragment of a shared file, say, fragment M. Further, suppose fragment M is originally stored in replica server Y, and client J and client K have previously accessed it. This means that client I, client J, client K and replica server Y each has its own local cache of fragment M. These local caches should be synchronized. In this scenario, SUTIL in client I is responsible for synchronizing all local caches of fragment M while SFS in client I traps the flush() system call, that is, client I closes the file. SUTIL in client I contrasts the original with modified version of fragment Mand then sends the difference to client J, client K and replica server Y. Once client J, client K and replica server Y receive the difference, SUTILs in these nodes apply the difference to their own local cache. Having done so, the local caches of fragment M achieve consistency. 5. Performance evaluation The performance of Spigot was evaluated by executing the benchmarks presented below over various wide-area network configurations. The practicability of the fragment-level transfer on-demand approach (‘‘fragment approach’’ for short) adopted by Spigot was compared with that of the file-level transfer approach (‘‘file approach’’ for short) used in most available data grid systems. All experiments were performed on commercial general-purpose rack-mounted servers running Linux 2.6.20-2925.9.fc7xen operated on two 2.4 GHz AMD OpteronTM Dual Core processors with 1024 KB L2 cache, 4 GB RAM and a Gigabit Ethernet Card. A PC router was used for emulating the different wide-area link latencies. It ran Linux 2.4.20-8 and a network emulation package,

673

namely NIST Net v2.0.12 [35] operated on a four-way SMP, 500 MHz Pentium III Xeon with 512 KB L2 cache, 512 MB RAM and a 10/100 Mbps Fast Ethernet Card. The remainder of this section commences by describing the benchmarks used in the experiments. Details of the experimental procedure and the performance observation metric are then introduced. Finally, the experimental results are presented in the last sub-section. 5.1. Benchmark The benchmarks used in the experiments consist of a synthesized application and two real applications (FFT and 2D wavelets coding). All of the synthesized and real applications perform bulk data accesses (including read/write operations) on an 800 MB large file which is initially deposited in a remote replica server. The advantage of using a synthesized application to conduct the experiment is that it can easily model the relationship between the different data transfer approaches and the diverse behavior of the various real data-intensive applications by swapping experimental parameters. On the other hand, the use of several real applications in the experiments verifies the applicability of Spigot in a way that a synthesized application cannot. The synthesized application reproduces a general data-intensive application that continually reads 4 MB data from a large file and analyzes those data. After that, it produces 4 MB results and then overwrites the results to the large file. The data analysis process is emulated by repeated sorting of the 4 MB data. The required analysis time (Ta) for each 4 MB data is an experimental parameter of the synthesized application. Ta can be controlled by increasing or decreasing the number of iterations of the sorting procedure. Obviously, the behavior of the application is correlated with Ta. A greater Ta makes the application tend to be a CPU bound application, whereas a smaller Ta makes the application tend to be an I/O bound application. However, an application may not analyze the entire data of the file, since it may use only partial data of the file. Accordingly, the percentage of the total data consumed by the application (Pc) is also an experimental parameter. The real applications used in the experiments are a Fast Fourier Transform application (FFT) and a 2D wavelet coding application (Wavelet). The behavior of the two real applications is same as the synthesized application. The only difference is the data analysis process is replaced for the FFT application by the one-dimensional discrete Fourier transform subroutine from the FFTW library [36] and, for the wavelet coding application, by the 2D wavelet coding subroutine from Wavelet [37]. These particular applications were specifically chosen for testing purposes since their runtime behaviors are well understood in most grid computing environments. Note that, on the strength of Spigot’s transparent access interface, these public access subroutines can be used directly without any modification. Note also that when using the real application benchmarks, the only parameter modified experimentally is Pc, in order to change the input data size. For the real applications, Ta is totally decided by the subroutines. 5.2. Experiments The objective of the experiments was to model the relationship between the diverse behaviors of the data-intensive application and the data transfer approaches. For performance evaluation, the benchmark was executed using two different data transfer approaches. In addition to the fragment approach adopted by Spigot, the file approach used in most available data grid systems was also tested. The file approach transferred data between the replica servers and clients by way of GridFTP version 4 as its underling transfer mechanism.

674

P.-C. Chen et al. / Future Generation Computer Systems 26 (2010) 668–677

approach, the definition of the data transmission time is equal to the data waiting time described above. For the fragment approach, however, the definition of the data transmission time is not equal to the data waiting time because the fragment approach can reduce the data waiting time by overlapping the data transmission time and the data analyzing time. All results presented in this paper are mean values from 10 trials of each experiment. Fig. 5. Performance experiment setup.

All evaluations were performed on the experimental configuration shown in Fig. 5. The experiment large file (800 MB) was equally divided into 200 fragments, and the mapping information of the PFNs to the LFN was recorded in the central file lookup server. Additionally, the experiment parameter Ta of the synthesized application was increased quasi-stepwise from 5 ms to 10 s, specifically, Tavalues = 5, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500 ms and then 1, 2, 5 and 10 s. In addition, the parameter Pc of both the synthesized and real applications was increased quasistepwise from 10% to 100%, specifically, Pc values = 10%, 20%, 40%, 80%, 100%. By adjusting these parameters, the diverse behaviors of general data-intensive applications could be evaluated as comprehensively as possible. Moreover, the performance impact caused by WAN delays was also evaluated. Accordingly, the PC router emulating the round-trip time (RTT) increased quasi-stepwise from 1ms to 150 ms between each pair, specifically RTT = 1 ms, 10 ms, 50 ms, 100 ms, 150 ms. In a general scenario, once a data-intensive application is submitted, both of the two evaluated transfer approaches duly fetch (replicate) the necessary fragments of the large file from the remote replica server into the local disk space of the client node. After that, the application can access the necessary data. Therefore, the elapsed time experienced by the application while waiting for the data fetching process to be completed (the data waiting time) is the performance metric for quantifying the effect of the different data transfer approaches. For the fragment approach, the measured data waiting time is the sum of the times spent for each read operation. On the other hand, for the file approach, the measured waiting time is the data staging time spent for complete duplication of the large file from replica server into the local disk space of the client node before the application can be executed. In addition, the data transmission times associated with both data transfer approaches were also measured. Note that for the file

5.3. Results and discussion Fig. 6 shows the results derived from executing the synthesized application with the Ta set between 5 ms to 450 ms, that is, the application tends to be an I/O bound application. It displays the relation between the data waiting times associated with the two transfer approaches for varying Pc while the RTT remains 100 ms. The data waiting times associated with the file approach remained constant as the Pc was changed, whereas that associated with the fragment approach correlated with the Pc. Clearly, the proposed fragment approach is more efficient than the traditional file approach with regard to data waiting time for all Pc, but this difference becomes much more pronounced as the Pc becomes low. Moreover, this finding reinforces the premise that the fragment approach prevents wasting of network bandwidth and storage space. Fig. 7 shows the results derived from executing the synthesized application. It illustrates the relation between the data waiting times associated with the two transfer approaches and for varying Pc while the RTT remains 100 ms, for Ta ranging from 500 ms to 10 s, that is, the application tends to be a CPU bound application. Like Fig. 6, Fig. 7 reveals that the data waiting times associated with the file approach remained constant for different Pc. Unlike Fig. 6, however, the data waiting times associated with the fragment approach were not affected by increasing Pc. In fact, these results are to be expected since the fragment approach reduces the data waiting times by overlapping the data transmission time and the data analysis time when Ta is great enough, such as when Ta ≥ 500 ms. On the contrary, the file approach is restricted by transferring the entire file from replica servers to clients prior to running the application. The file approach wastes times transferring unnecessary fragments of the large file. These suggest that all general data-intensive applications can benefit from Spigot regardless of the application behavior (I/O or CPU bound). Figs. 8 and 9 indicate respectively the results derived from executing the real applications, Wavelet and FFT, with the fragment approach and varying Pc for a constant RTT of 100 ms. The

Fig. 6. Results of the on-demand access feature of the fragment approach.

P.-C. Chen et al. / Future Generation Computer Systems 26 (2010) 668–677

675

Fig. 7. Results of the fragment approach by overlapping the data transmission time and the data analyzing time.

Fig. 8. Performance of executing the real application — Wavelet.

Fig. 10. Scalability in network latency.

using the file approach suffered dramatically as the RTT increased from 50 ms to 100 ms. Actually, the performance of the file approach may be improved by tuning the underlying transfer mechanism such as setting more parallel data connections in each GridFTP transfer. However, the Fig. 10 results indicate that the transfer mechanism (SUTIL) of the fragment approach promises better scalability with regard to network latency than the transfer mechanism (GridFTP) of the file approach. 6. Conclusion and future work Fig. 9. Performance of executing the real application — FFT.

results obtained when the large experimental file is in the local file system of the client node are also shown for reference purposes. Note also that the Ta of Wavelet averaged 340 ms while the Ta of FFT averaged 595 ms, that is, Wavelet tends to be an I/O bound application whereas FFT tends to be a CPU bound application. Therefore, the performance of the fragment approach for Wavelet is inferior to the case in which the large experimental file is in the local file system. On the other hand, the performance of the fragment approach for FFT is improved by overlapping the data transmission and data analysis times when the Pc is great enough, i.e. when Pc ≥ 40%. These results are reasonable since the data transmission between the application and Spigot is actually through the memory copy procedure instead of disk I/O operations. Finally, Fig. 10 illustrates the relation between the data transmission times associated with using both transfer approaches for increasing RTT while the Pc remains 100%. Evidently, the data transmission time are adversely affected by increasing RTT. The histograms in Fig. 10 show that the performance associated with

This study has presented a DSM-based fragmented data sharing framework designated ‘‘Spigot’’, so-named since it is intended to make the accessing of shared data as easy as using a water tap. Firstly, Spigot allows users to design their programs with native I/O system calls. The user’s perception of Spigot is just like a local file system, thus alleviating the traditional difficulties involved when the user must use a complicated explicit data transfer API. Second, it is conventionally believed that the supporting of consistent mutable replications in a data grid environment is too costly to consider [18,19]. However, this study proves otherwise. In fact, Spigot employs its DSM sub-system to realize the consistency of the fragments of the shared file. Third, with respect to performance issues, the experimental results indicate that the fragment-level transfer on-demand approach adopted by Spigot achieves superior overall performance in reducing application turnaround time than the file-level transfer approach. This is particularly true in the case where the sizes of the necessary fragments used by the application are relatively low and the required data analysis time and network latency are relatively

676

P.-C. Chen et al. / Future Generation Computer Systems 26 (2010) 668–677

high. Moreover, the results show that Spigot significantly reduces waste of network bandwidth and storage space. In a future study, the performance of Spigot will be further enhanced via the addition of more efficient mechanisms for reducing the turnaround time for data-intensive applications. For example, a mechanism may be proposed for tracing the fragment access patterns of data-intensive applications such that Spigot can pre-fetch necessary fragments. Such fragment prefetching would allow Spigot to complete the data transfer of the necessary fragments exactly when these fragments are needed by the application. Fragment pre-fetching is feasible since most scientific applications adopts iterative methods for solving numerical problems and therefore has regular fragment access patterns. In addition, a scheme may be presented for coupling other replica management systems [20,10,21,22,11,23] so that Spigot can find the best replica from multiple replica servers, thereby transferring the necessary fragments from the best replica server. Furthermore, Spigot may even parallel download the necessary fragments from multiple best replica servers to reduce data transmission time. Acknowledgement This work was supported by the National Science Council of Taiwan, ROC, under project No. 96-2221-E-426-004-. References [1] B. Allcock, I. Foster, V. Nefedova, A. Chervenak, E. Deelman, C. Kesselman, J. Lee, A. Sim, A. Shoshani, B. Drach, D. Williams, High-performance remote access to climate simulation data: A challenge problem for data Grid technologies, in: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, ACM, Denver, Colorado, 2001. [2] J.B. Drake, P.W. Jones Jr., Overview of the software design of the community climate system model, International Journal of High Performance Computing Applications 19 (2005) 177–186. [3] W. Hoschek, J. Jaén-Martínez, A. Samar, H. Stockinger, K. Stockinger, Data management in an international data Grid project, in: Proceedings of the First IEEE/ACM International Workshop on Grid Computing, Springer-Verlag, 2000. [4] D. Stainforth, J. Kettleborough, M. Allen, M. Collins, A. Heaps, J. Murphy, Distributed computing for public-interest climate modeling research, Computing in Science and Engineering 4 (2002) 82–89. [5] G.A. Stewart, D. Cameron, G.A. Cowan, G. McCance, Storage and data management in EGEE, in: Proceedings of the Fifth Australasian Symposium on ACSW Frontiers, vol. 68, Australian Computer Society, Inc., Ballarat, Australia, 2007. [6] S. Venugopal, R. Buyya, K. Ramamohanarao, A taxonomy of data Grids for distributed data sharing, management, and processing, ACM Computing Survey 38 (2006) 3. [7] G. Antoniu, L. Bougé, M. Jan, JuxMem: An adaptive supportive platform for data sharing on the Grid, Scalable Computing: Practice and Experience 6 (2005) 45–55. [8] G. Antoniu, H.L. Bouziane, M. Jan, C. Pérez, T. Priol, Combining data sharing with the master–worker paradigm in the common component architecture, Cluster Computing 10 (2007) 265–276. [9] G. Antoniu, J.F. Deverge, S. Monnet, How to bring together fault tolerance and data consistency to enable Grid data sharing: Research articles, Concurrency and Computation: Practice and Experience 18 (2006) 1705–1723. [10] R.-S. Chang, P.-H. Chen, Complete and fragmented replica selection and retrieval in data Grids, Future Generation Computer Systems 23 (2007) 536–546. [11] Y. Machida, S.i. Takizawa, H. Nakada, S. Matsuoka, Intelligent data staging with overlapped execution of Grid applications, Future Generation Computer Systems 24 (2008) 425–433. [12] R. Moore, S.-Y. Chen, W. Schroeder, A. Rajasekar, M. Wan, A. Jagatheesan, Production storage resource broker data Grids, in: Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, IEEE Computer Society, 2006. [13] A. Shoshani, A. Sim, J. Gu, Storage resource managers: Essential components for the Grid, in: Grid Resource Management: State of the Art and Future Trends, Kluwer Academic Publishers, 2004, pp. 321–340. [14] A. Sim, J. Gu, A. Shoshani, V. Natarajan, DataMover: Robust terabyte-scale multi-file replication over wide-area networks, in: Proceedings of the 16th International Conference on Scientific and Statistical Database Managment, SSDBM’04, IEEE, 2004, pp. 403–412.

[15] H. Tang, A. Gulbeden, J. Zhou, W. Strathearn, T. Yang, L. Chu, A self-organizing storage cluster for parallel data-intensive applications, in: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, IEEE Computer Society, 2004. [16] D. Thain, J. Basney, S.-C. Son, M. Livny, The Kangaroo approach to data movement on the Grid, in: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing, IEEE Computer Society, 2001. [17] M. Wan, A. Rajasekar, R. Moore, P. Andrews, A simple mass storage system for the SRB data Grid, in: Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies, MSS’03, IEEE Computer Society, 2003. [18] J. Zhang, P. Honeyman, NFSv4 replication for Grid storage middleware, in: Proceedings of the 4th International Workshop on Middleware for Grid Computing, ACM, Melbourne, Australia, 2006. [19] J. Zhang, P. Honeyman, A replicated file system for Grid computing, Concurrency and Computation: Practice and Experience 20 (2008) 1113–1130. [20] M.S. Allen, R. Wolski, The Livny and Plank–Beck problems: Studies in data movement on the computational Grid, in: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, IEEE Computer Society, 2003. [21] A. Chazapis, A. Zissimos, N. Koziris, A peer-to-peer replica management service for high-throughput Grids, in: Proceedings of the 2005 International Conference on Parallel Processing, IEEE Computer Society, 2005. [22] A. Chervenak, E. Deelman, I. Foster, L. Guy, W. Hoschek, A. Iamnitchi, C. Kesselman, P. Kunszt, M. Ripeanu, B. Schwartzkopf, H. Stockinger, K. Stockinger, B. Tierney, Giggle: A framework for constructing scalable replica location services, in: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, IEEE Computer Society Press, Baltimore, MD, 2002. [23] S. Tikar, S. Vadhiyar, Efficient reuse of replicated parallel data segments in computational Grids, Future Generation Computer Systems 24 (2008) 644–657. [24] G. Belalem, Y. Slimani, Consistency management for data Grid in OptorSim simulator, in: Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering, IEEE Computer Society, 2007. [25] A. Domenici, F. Donno, G. Pucciani, H. Stockinger, K. Stockinger, Replica consistency in a data Grid, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 534 (2004) 24–28. [26] D. Düllmann, B. Segal, Models for replica synchronisation and consistency in a data Grid, in: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing, IEEE Computer Society, 2001. [27] C.-T. Yang, W.-C. Tsai, T.-T. Chen, C.-H. Hsu, A one-way file replica consistency model in data Grids, in: Proceedings of the 2nd IEEE Asia-Pacific Service Computing Conference, IEEE Computer Society, 2007. [28] J.-B. Chang, C.-K. Shieh, T.-Y. Liang, A transparent distributed shared memory for clustered symmetric multiprocessors, The Journal of Supercomputing 37 (2006) 145–160. [29] P.-C. Chen, J.-B. Chang, T.-Y. Liang, C.-K. Shieh, Y.-C. Zhuang, A multi-layer resource reconfiguration framework for Grid computing, in: Proceedings of the 4th International Workshop on Middleware for Grid Computing, ACM, Melbourne, Australia, 2006. [30] T.-Y. Liang, C.-Y. Wu, J.-B. Chang, C.-K. Shieh, Teamster-G: A Grid-enabled software DSM system, in: Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid, CCGrid’05 — Volume 2 — Volume 02, IEEE Computer Society, 2005. [31] T.-Y. Liang, C.-Y. Wu, C.-K. Shieh, J.-B. Chang, A Grid-enabled software distributed shared memory system on a wide area network, Future Generation Computer Systems 23 (2007) 547–557. [32] V. Vlassov, D. Li, K. Popov, S. Haridi, A scalable autonomous replica management framework for Grids, in: Proceedings of the IEEE John Vincent Atanasoff 2006 International Symposium on Modern Computing, IEEE Computer Society, 2006. [33] J. Bester, I. Foster, C. Kesselman, J. Tedesco, S. Tuecke, GASS: A data movement and access service for wide area computing systems, in: Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems, ACM, Atlanta, GA, United States, 1999. [34] Filesystem in Userspace, http://fuse.sourceforge.net/. [35] NIST Net, http://snad.ncsl.nist.gov/nistnet/. [36] FFTW, http://www.fftw.org/. [37] 2D Wavelets, http://eeweb.poly.edu/~onur/source.html.

Po-Cheng Chen received his M.S. degree from National Cheng Kung University in 2006. He is currently a Ph.D. candidate in the Institute of Computer and Communication Engineering, Department of Electrical Engineering at National Cheng Kung University. His research interests include grid computing and virtualization technique.

P.-C. Chen et al. / Future Generation Computer Systems 26 (2010) 668–677

677

Jyh-Biau Chang is currently an assistant professor in the Department of Digital Applications at Leader University in Tainan, Taiwan. He received his B.S., M.S. and Ph.D. degrees from National Cheng Kung University in 1994, 1996, and 2005, respectively. His research interests include grid computing and distributed systems.

Chia-Han Lin received his Master degree from the Institute of Computer and Communication Engineering, Department of Electrical Engineering at National Cheng Kung University in July 2007. His study focuses on cluster and grid computing. He is currently a software engineer of a computer network company.

Ce-Kuen Shieh is currently a professor at the Electrical Engineering Department of National Cheng Kung University in Tainan, Taiwan. He is also the chief of the computation center at National Cheng Kung University. He received his Ph.D. degree from the Electrical Engineering Department of National Cheng Kung University in 1988. He was the chairman of the Electrical Engineering Department of National Cheng Kung University from 2002 to 2005. His research interest is focused on computer networks, and parallel and distributed systems.

Yi-Chang Zhuang received his B.S., M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University in 1995, 1997, and 2004. He is currently working as an engineer at Industrial Technology Research Institute in Taiwan. His research interests include objectbased storage, file systems, distributed systems, and grid computing.