A quadtree approach to domain decomposition for spatial interpolation in Grid computing environments

A quadtree approach to domain decomposition for spatial interpolation in Grid computing environments

Parallel Computing 29 (2003) 1481–1504 www.elsevier.com/locate/parco A quadtree approach to domain decomposition for spatial interpolation in Grid co...

1MB Sizes 30 Downloads 72 Views

Parallel Computing 29 (2003) 1481–1504 www.elsevier.com/locate/parco

A quadtree approach to domain decomposition for spatial interpolation in Grid computing environments Shaowen Wang a

a,1

, Marc P. Armstrong

b,*

Academic Technologies-Research Services Division of Information Technology Services and Department of Geography, 128G South Lindquist, The University of Iowa, Iowa City, IA 52242, USA b Department of Geography and Program in Applied Mathematical and Computational Sciences, 316 Jessup Hall, The University of Iowa, Iowa City, IA 52242, USA Received 13 May 2002; accepted 7 April 2003

Abstract Spatial interpolation is widely used in geographical information systems to create continuous surfaces from discrete data points. The creation of such surfaces, however, can involve considerable computation, especially when large problems are addressed, because of the need to search for neighbors on which to base interpolation calculations. Computational Grids provide the computing resources to tackle spatial interpolation in a timely way. The objective of this paper is to investigate the use of domain decomposition for a distributed inverse-distanceweighted spatial interpolation algorithm; the algorithm runs using the Globus Toolkit (GT) in a heterogeneous Grid computing environment. The interpolation algorithm is modified for implementation in the Grid by using a quadtree to spatially index and adaptively decompose the interpolation problem to balance processing loads. In addition, the GT allows the distributed algorithm to couple multiple machines, potentially of different architectures, to dynamically schedule the decomposed sub-problems through Globus services and protocols (e.g., resource management, data transfer). Experiments are conducted to test how well this distributed IDW interpolation algorithm scales to heterogeneous grid computing environments using irregularly distributed geographical data sets.  2003 Elsevier B.V. All rights reserved. Keywords: Spatial interpolation; Domain decomposition; Quadtree; Computational Grid; Globus toolkit

*

Corresponding author. Tel.: +1-319-335-0153; fax: +1-319-335-2725. E-mail addresses: [email protected] (S. Wang), [email protected] (M.P. Armstrong). 1 Tel.: +1-319-335-6094; fax: +1-319-335-5505. 0167-8191/$ - see front matter  2003 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2003.04.003

1482

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1. Introduction Geographic information sources continue to increase in diversity as well as size. Diversity has increased because the importance of geographic information is being recognized in numerous decision-making contexts through the widespread deployment of inexpensive, location-aware data collection devices that monitor dynamic environmental, social and economic processes. For example, global positioning system (GPS) receivers continuously monitor the location of fleets of vehicles; embedded sensors monitor traffic flows, and measure pollutants; and point-of-sale (POS) data fill data warehouses. Database sizes have increased hand-in-glove with increasing diversity. Each new generation of spatial data sensors is designed to sample at higher spatial and temporal rates, and are often tuned to collect a richer set of attributes than their predecessors. Remote sensing technologies are but one bellwether of such trends, with (civilian) spatial resolution increasing from approximately 80-m during the 1970s to the 0.5-m resolution on currently available commercial systems [8]. The original 80-m systems were designed to collect information in only four spectral bands while today’s hyperspectral systems sense simultaneously in more than 1000 bands. Though these increases in resolution enhance the ability of sensors to discriminate among objects in the environment, storage requirements have increased commensurately [21]. This inexorable movement toward massive increases in data volume has taken place simultaneously with the development of new spatial analysis methods that require intensive computation to calculate results. Geographical information systems (GIS) software, for example, is increasingly linked to optimization models and statistical methods that are based explicitly on the computation of large numbers of alternative solutions to problems. Because many spatial optimization problems are NP-hard [7], they are often addressed through the use of computationally intensive heuristic search processes [24] and may require additional computation to satisfy numerous, often conflicting, criteria [31]. The computational requirements of statistical methods such as Markov-Chain Monte-Carlo (MCMC) also can be extensive when large, realistic problems are addressed [5,20]. This marriage of problem size and computational complexity has yielded GIS applications with enormous appetites for computing power [1]. The purpose of this paper is to examine the performance of a relatively straightforward, but computationally intensive, method of spatial data analysis that has been examined in previous papers: inverse-distance-weighted (IDW) interpolation (see, for example, [2,3]). In this investigation, we compute results for a set of computational experiments using the distributed computing resources that have become available as a consequence of the emerging Grid computing infrastructure. Though the Grid presents an opportunity to examine alternative strategies for decomposing geographic problems, it also presents problems related to latency and load-balancing that must be addressed in the design of parallel GIS algorithms. In the following sections of this paper, we place IDW interpolation within the broad framework of GIS tools and briefly describe the Grid as it pertains to this particular problem, with a

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1483

specific emphasis placed on our problem decomposition and load-balancing strategies. We then present results for a set of computational experiments that are designed to serve as diagnostic tests for possible load-balancing problems, and conclude with some general remarks about the suitability of Grid computing for geographic information processing. 2. Spatial interpolation Data in GIS are often collected in point format (e.g., a temperature measurement is made at a location) and are represented by an (x; y; z) triplet with x and y representing coordinates and z representing a measurement at that location. From these discrete observations it is a common practice to create a quasi-continuous statistical surface. Specifically, the data measured at each coordinate location is used to estimate a derived value for each cell of a regular (e.g., square) grid. The size of the grid is either fixed by instrumentation specifications (remote sensor resolution) or set to correspond with other covariates of interest in a GIS application. Interpolation is a commonly performed GIS operation, is a key element in the toolboxes provided by commercial GIS software environments, and is also described in most GIS textbooks [4]. Researchers can choose from a variety of algorithms to calculate an interpolated surface [22,29,30]. In general, each approach uses either the entire data set to make an estimate for a grid cell location (global methods) or restricts, after searching, calculations to include only those observations that fall in some neighborhood of the location for which the grid cell estimate is being made (local methods). As it is normally implemented, IDW is one example of a local interpolation method. Local approaches to interpolation are based on an underlying assumption of positive spatial autocorrelation. The crux of local interpolators is to reduce the amount of computation required to find near (however defined) neighbors on which to base calculations. An algorithm described by Clarke [6] is designed specifically to reduce the amount of searching required to compute interpolated z-values. The z-value of any desired point p is computed using the equation: Pk zi =dib Zp ¼ Pi¼1 ; k b i¼1 1=di where Zp is an interpolated value at point p; zi is an observed value at control point i, in the neighborhood of p; k is the number of points in the neighborhood of p that are used in the interpolation (often k < 10); di is the Euclidean distance from observed control point i to p; and b, the distance weighting factor (often b ¼ 2). The Clarke algorithm works by pre-sorting data into a coarse grid during the data input stage. Then, when a candidate point for a neighborhood calculation is required, search is restricted, in the first instance, to those coarse grid cells that are near in a topological sense. This approach to search restriction can reduce total computation time substantially. In spite of this advantage, researchers may need to interpolate surfaces that have irregular distributions of data points. Such irregularities can

1484

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

cause large imbalances in the distribution of computation loads (due to searching) that are required to interpolate different parts of the surface.

3. Experimental datasets Spatial samples that define the location of control points can be specified in many ways. Though Robinson et al. [25] suggest that the best interpolation results are usually obtained by a scatter of control points that is more evenly spaced than random, in some cases the distribution is specified exogenously (e.g., the fixed location of weather stations), while in other cases the sample size and its distribution is limited by technical, budgetary, accessibility, and other constraints. For example, hydrographic surveys often collect samples in strips. In this research, four experimental datasets were created to evaluate variable control point distribution effects. It must be emphasized that the focus of this paper is not on evaluating the accuracy of interpolation results [32], but on the effects of control points distributions on parallel algorithm performance [9]. The four output grid cell datasets are fixed at 2000 rows by 2000 columns (4 · 106 grid cells), and the size of the sample used to interpolate values for the grid consists of 2000 control points. Each control data point is represented in (x; y; z) format, with z-values generated using a linear function z ¼ fðx; yÞ. These z-values are used only to debug and evaluate load-balancing strategies, since their magnitude has little impact on the performance of the interpolation algorithm. The first dataset (see Fig. 1(a)) has a uniform, random x–y distribution of sample data points located across the entire grid. The remaining three experimental datasets represent a set of ‘‘worst case’’ distributions that are intended to induce performance difficulties. Though the interpolation algorithm is applied to these datasets, what is being attempted is, strictly speaking, not interpolation, which is restricted to the convex hull of the set of data points. Extrapolative applications of interpolation algorithms are widely encountered in GIS applications, however. Consequently, in the second dataset (Fig. 1(b)), all sample data points fall into a single cluster with a specified radius (in this case equal to 0.5 grid size) in which the points are also distributed using a uniform random distribution. In the third dataset (Fig. 1(c)), the sample data points are divided into two clusters. The fourth dataset has multiple (six) clusters (Fig. 1(d)). Please note that since the number of control points remains constant (n ¼ 2000), the density of the distributions changes in each case.

4. Grid computing resources The computational experiments reported in the paper were conducted using resources made available through the National Computational Science Alliance (NCSA) as part of their emerging Grid computing initiative. The Grid is based on a logic that enables researchers to overcome local computational constraints through the use of flexibly defined assemblages of computing resources (individual as well as

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1485

Fig. 1. Datasets of two thousand points: (a) a uniform random distribution; (b) one cluster with a uniform random distribution; (c) two clusters, each of which has a uniform random distribution; (d) multiple clusters, each with a uniform random distribution.

clusters) connected by high performance networks to form a distributed parallel processing environment. The aim is to make the formation of such assemblages nearly transparent to the end user in much the same way (metaphorically) that consumers are usually unaware of the source of electricity used to power their microwave ovens [13,14]. 4.1. Grid-in-a-box (GiB) testbed The Grid computing environment used in this study is the NCSA Grid-in-a-Box (GiB) 2 testbed on which a software distribution for building Grid-enabled Linux 2

http://www.ncsa.uiuc.edu/TechFocus/Deployment/GiB/.

1486

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

Table 1 GiB hardware environment Domain name

PBS CPUs

Cluster CPU speed

Organization

gib01.bu.edu gibhead.ccs.uky.edu conshead.ncsa.uiuc.edu cornhead.ncsa.uiuc.edu gib.ovl.osc.edu biron.cs.wisc.edu

4 8 4 8 4 1

866 MHz (Pentium III) 1000 MHz (Pentium III) 300 MHz (Pentium II) 300 MHz (Pentium II) 498 MHz (Pentium III) 996 MHz (Pentium III)

Boston University University of Kentucky NCSA NCSA Ohio State University University of Wisconsin

clusters has been configured for the NCSA Grid infrastructure. The GiB testbed is comprised of multiple sites of NCSA partners for advanced computational services (PACS) that have the Linux (RedHat 7.1) operating system installed. Table 1 lists the GiB sites and computational resources used in this research. Our local Linux node was connected with the GiB testbed using the Globus Toolkit 2.0. 3 4.2. Globus The Globus Toolkit (GT) used in this paper is a product of the Globus project 4 [16,17], and provides the GiB testbed with basic Grid services and protocols that include resource managers, security protocols, information services, communication services, fault tolerance services, and remote data access facilities. These services and protocols are designed to support a wide variety of applications and programming models, and are organized in a layered grid architecture (Fig. 2). Different from the approach to a uniform programming model such as the object-oriented model defined by the Legion system [17], GT provides a ‘‘bag of services’’ from which developers of specific tools or applications can select to meet their needs [15]. This ‘‘bag of services’’ model does not impose any single programming paradigm. In this research, we focus on resource allocation using GT resource managers although we also used GT security services and remote data access facilities.

5. Parallelization Partial parallelization strategies [9] are applied in our parallel implementation of the Clarke algorithm [6]. The basic elements of the Clarke algorithm are illustrated in Fig. 3. Step 3 in Fig. 3 includes a k-nearest neighbor search for each empty cell and is thus computationally intensive. Consequently, a spatial domain decomposition is introduced to parallelize this step. In particular, a Morton ordered quadtree is con-

3 4

http://www.globus.org/gt2/. http://www.globus.org.

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

Application

Collective (Application)

Collective (Generic)

Interpolation Grid application

Coherency control, replica selection, task management, virtual data catalog, and virtual data code catalog

Replica catalog, replica management, co-allocation, certificate authorities, and metadata catalogs

Resource

Access to data, access to computers, and access to network performance data

Connect

Communication, service discovery, authentication, authorization, and delegation

Fabric

1487

Storage systems, clusters, networks, and network caches Fig. 2. Grid architecture (after http://www.gridforum.org and [12]).

structed to decompose the spatial domain [23,27] and produce adjustable, scalable interpolation tasks that are sensitive to underlying data distributions. Based on the tasks created by this domain decomposition, a resource specification language (RSL) 5 script is generated to schedule tasks and balance computational loads among GiB sites. The generation of the RSL script takes into consideration the computing capacity of each GiB site (the number of CPUs and CPU clock speed are shown in Table 1). However, network performance and the dynamic status of the portable batch system (PBS) [10] queue at GiB sites are not evaluated due to the complexity of acquiring such information and incorporating this dynamic information into the RSL script for task scheduling. Such information could be incorporated into future versions of our load-balancing strategy, however. 5.1. Quadtrees In the fields of image processing, computer graphics, and remote sensing two dimensional point and region data are often indexed using quadtrees [26]. A quadtree is a representation of a regular partitioning of space where regions are split recursively until there is a constant amount of information contained in them. Each quadtree 5

http://www-fp.globus.org/gram/rsl_spec1.html.

1488

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

Fig. 3. Steps of the Clarke interpolation algorithm. Characters in the top Grid indicate the sequence in which the algorithm searches outward (alphabetical order) from the center cell. Cells on the edge numbered by 1 are searched before cells on the edge numbered by 2 on each highlighted dashed square; the two regions that are pointed to by arrows out of the step 3 box can be parallelized in the Grid environment using the domain decomposition approach presented in this paper.

block (also referred to as a cell, or node) covers a portion of space that forms a hypercube in d-dimensions, usually with a side length that is a power of 2 [18]. Many different varieties of quadtrees have been defined, differing in the rules that govern node splitting, the type of data being indexed, and other details [28]. Quadtree algorithms have also been implemented in parallel processing environments [19] though not in the specific analytical environment represented in this paper. A basic quadtree in two dimensional space is a 4-way branching tree that represents a recursive binary decomposition of space wherein at each level a square sub-space is divided into four equal size squares––labeled the north-east (NE), north-west (NW), south-east (SE), and south-west (SW) quadrants. The quadtree node definition in our spatial domain decomposition algorithm is described as the following C language struct. The meaning of level will be described more fully in Section 5.2.

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1489

typedef struct quadnode_struct { int length; /* square side length of the node */ int level; /* a parameter to control how deep a node should be divided */ int control_pt_num; /* number of control points in this node */ int startrow; /* start row number of this node in global grid */ int startcol; /* start column number of this node in global grid */ int endrow; /* end row number of this node in global grid */ int endcol; /* end column number of this node in global grid */ } Quadnode; As implemented, a quadtree is stored in a single direction list as the following C language struct shows: typedef struct quadnode_list_struct { Quadnode quad; struct object_list_struct *next, } Quadnode_list_node;

/* Pointers to next Quadnode object */

The transformation of a two dimensional representation to a single dimension required us to implement a method of linearizing two dimensional space. Morton ordering is a well-known way to perform this task. Though the original source of the Morton ordering is an obscure technical publication [23], the concept is described in a variety of widely-accessible publications [26]. In this case, the Morton order is created to link the leaf nodes of the quadtree when a quadtree splitting operation is performed. Consequently, children nodes are added to the list in the following order: SW, SE, NW, and NE (Fig. 4). Only leaf nodes are stored in the list because they contain all the required information that is needed to support the flexible decomposition of the spatial domain. 5.2. Spatial domain decomposition Our quadtree-based spatial domain decomposition algorithm is designed for general use, and it can produce scalable geographical workloads that can be allocated to available Grid computational resources. The algorithm addresses the decomposition challenges [11] that are centered on finding efficient data partitions that are assigned to each GiB site. The Quad_Decompose algorithm (Fig. 5) has two input parameters: level and min_length. In Quad_Decompose, these two parameters together with information about the number of control points at each quadtree node are used to determine the level and location of recursive division. In the execution of Quad_Decompose, those parts of a region that contain a high density of control points are recursively divided until the level of their quadtree nodes reaches zero from a user-specified value (greater than 0). However, regions with a low density

1490

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

List Head

List Tail

……

SW

Current quadnode that is being split

List Head

……

SE

NW

NE

Four children of current quadnode

List Tail

SW

SE

NW

NE

NW

NE

Current quadnode has been removed after its children have been generated.

List Head

……

SW

SE

Four new children of the deleted quadnode has been appended at the end of the list

List Tail

Fig. 4. Construction of a quadnode list.

of control points will not stop being divided until the square length of the node is less than the parameter min_length. Regions with an intermediate density of control points will invoke either of these two rules depending on which is satisfied first. Fig. 6 shows an example of decomposing a spatial domain through the use of these two parameters. By examining the way Clarke’s interpolation algorithm searches outward over the whole grid, it was found that regions with a higher density of control points take less interpolation time than ones with a lower density given the same area [9]. Using two parameters, level and min_length, the Quad_Decompose algorithm produces balanced workloads of two types: extensive un-interpolated areas with large numbers of control points and compact un-interpolated areas with small (quite often, equal to zero) numbers of control points (Fig. 6). However, users can specify these two parameters to determine the granularity of workloads according to the computing capacity of their accessible Grid resources. For example, users can increase min_length from 65 (in Fig. 6) to 125 (in Fig. 7) to generate larger workloads (Fig. 7). These values, which are specific to this 2000 · 2000 problem, represent the side length of grid cells generated from the Clarke algorithm. These values can easily be converted to real world units, or ratios based on the size of entire grid.

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1491

Fig. 5. An algorithm for quadtree based domain decomposition.

5.3. Task scheduling Dynamic task scheduling is difficult to accomplish because of the unpredictable nature of network traffic and job queues at the Grid sites. Consequently, a static strategy has been adopted to schedule tasks according to variability of the computing capacity at each GiB site and the number of workloads used to partition the interpolation problem. The workload differences are characterized by the level, length, and control_pt_num associated with each quadtree leaf node. In this research, level

1492

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504 2000

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2000

Fig. 6. A quadtree to decompose the spatial domain of one cluster data (Fig. 1(b)); level values of quadnodes are shown when they are equal to zero; two parameters of Quad_Decompose: level ¼ 3; min_length ¼ 64.

and min_length have been specified so that workload differences are small due to the similar computing capability of each GiB site (Table 1). The same logic could be applied to create variable task sizes that could be apportioned to sites with markedly different computational characteristics. After invoking the Quad_Decompose algorithm, nodes with a level value greater than zero have no control points located within them, and it is thus very likely that these locations are computationally intensive for the interpolation problem considered in this paper. Based on this observation, and knowledge about the characteristics of the GiB hardware environment (Table 1), an algorithm called Generate_RSL (Figs. 8 and 9) was developed to automatically generate RSL scripts for task scheduling. In this example, the clock speed and the number of PBS CPUs are used as a coarse approximation of computing capacity. We organized GiB sites into three categories: sites (TOTAL_WEIGHTED_GIB_PBS_FAST_CPU in Fig. 9) with a large capacity that include gibhead.ccs.uky.edu, gib01.bu.edu, gib.ovl.osc.edu, and biron.cs.wisc.edu; sites (TOTAL_WEIGHTED_GIB_PBS_SLOW_CPU in Fig. 9) with a small capacity that include cornhead.ncsa.uiuc.edu and conshead.ncsa.uiuc.edu;

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1493

2000

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2000

Fig. 7. A quadtree to decompose the spatial domain of one cluster data (Fig. 1(b)); level values of quadnodes are shown when they are equal to zero; two parameters of Quad_Decompose: level ¼ 3; min_length ¼ 125.

and all sites (TOTAL_GIB_PBS_CPU in Fig. 9). This approach is sufficiently flexible to accommodate a broad range of analyses that could consider, if required, other architectural characteristics of the available resources. In the Generate_RSL algorithm, jobs are mainly distinguished by the level values of their associated quadtree nodes. Jobs with a level value equal to 0 are generated from regions that either have a high density of control points, or are very close to such regions. Jobs generated from regions with a high density of control points are scheduled (addressed by 2(c.2.1). in Fig. 8) to GiB sites with a small computing capacity. Jobs generated from regions that are close to clusters with a high density of control points are scheduled (addressed by 2(c.2.2). in Fig. 8) to all GiB sites. However, jobs with a level value greater than 0 are generated from regions that have a low density of control points, and are thus more computationally intensive than jobs with a level value equal to 0. These jobs are allocated to the most capable GiB sites. For the uniform random dataset (no cluster) in Fig. 1(a), jobs are scheduled to all GiB sites.

1494

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

Fig. 8. A static task scheduling algorithm through RSL.

Job scheduling to a specific category of GiB site (e.g., fast, slow, or total GiB sites) is implemented using a random uniform distribution. Each job is assigned a random number generated by this distribution and the modulus of this random number on the basis of computing capacity of the category is calculated. For example, the total computing capacity of GiB sites is quantified as 60 (in Fig. 9). If a job were assigned a random number 3193350153, the modulus of this (basis of 60) is 33. This job will be assigned to gibhead.ccs.uky.edu because gibhead has a range of [12, 38] (in Fig. 9). This load-balancing strategy can be used to assign jobs to a specific category of the GiB sites with load-balancing based on the proof shown in Fig. 10. The Globus RSL provides a common interchange language that is used to describe computational resources. The various components of the Globus resource management architecture (Fig. 11) manipulate RSL strings to perform management

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

gib01 [1, 7]

gibhead

biron [8, 23]

1495

gib

[24, 25]

[26, 29]

TOTAL_ WEIGHTED _GIB_PBS_ FAST _CPU = (4 * 866 + 8 * 1000 + 1 * 996 + 4 * 498 ) / 498 = 29

conshead [1, 4]

cornhead [5, 12]

TOTAL _WEIGHTED_GIB _PBS__SLOW_CPU = (4 * 300 + 8 *300) / 300 = 12

gib01 [1, 11]

gibhead

conshead cornhead [12, 38]

[39, 42]

[43, 50]

biron

gib

[51, 53]

[54, 60]

TOTAL_GIB_PBS_CPU = (4 * 866 + 8 * 1000 + 4 * 300 + 8 *300 + 1 * 996 + 4 * 498 ) / 300 = 60

Fig. 9. Computing capacity of GiB sites.

Fig. 10. Load balance proof.

functions in cooperation with the other system components. 6 The Globus resource allocation manager (GRAM) provides a standardized interface to the local resource management tools that each site might have in place. In this paper, PBS is used to 6

http://www-fp.globus.org/gram/rsl_spec1.html.

1496

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

Application IDW

Broker Quad_Decompose and Generate_RSL

DUROC

GRAM

PBS Fig. 11. Resource management architecture in this study.

manage (e.g., queue and run) interpolation sub-jobs submitted using GT tools. The dynamically-updated request online coallocator (DUROC) is used to coordinate a single request that may span multiple GRAMs. The Generate_RSL algorithm generates a DUROC RSL script to coordinate the use of multiple GiB site resources. One example of an RSL script is shown in Fig. 12. The GT command line tool globusrun is used to submit jobs to Globus managed resources; globusrun uses an RSL script as input, and conducts the specified computing tasks. 6. Performance evaluation The domain decomposition and task scheduling algorithms described in the previous sections are evaluated using the four datasets shown in Fig. 1. For each dataset, two parameters in Quad_Decompose algorithm (level and length) were adjusted to determine the parameter configurations that yield the best performance (i.e., shortest computing time). In addition, we calculate speedup S, which is defined as S ¼ t1 =tn ;

ð1Þ

where t1 is the run time realized on a single processor that is selected from the GiB node––gibhead.ccs.uky.edu (the selected processor is the fastest one in the current GiB hardware environment); and tn is the computing time when multiple processors (either at one or more GiB sites) are used to conduct computations. To calculate speedup, we used the fastest available GiB node (gibhead.ccs.uky.edu) to run the sequential Clarke interpolation algorithm; results are shown in Table 2. For the dataset shown in Fig. 1(a), experiments were conducted when no other users were accessing PBS-managed GiB resources. For the clustered datasets shown in Fig. 1(b)– (d), we did not impose this restriction even though GiB sites were not always scheduled with PBS jobs from other users during the time when we conducted our experiments. 6.1. Uniformly distributed data For the uniformly distributed data (Fig. 1(a)), the single parameter level was adjusted because length does not affect the process of decomposing the spatial domain. Table 3 lists the computing time used to interpolate results for the uniformly distri-

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1497

Fig. 12. An automatically generated RSL script.

buted data using the Quad_Decompose algorithm. Different level values were chosen to observe which level value achieves the best performance. All GiB resources are used except that when level is equal to 1, two sites are used because only four jobs are generated by the Quad_Decompose algorithm. It was found that when level is

1498

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

Table 2 Sequential computing time using one processor of gibhead.ccs.uky.edu Data sets

No cluster (Fig. 1(b))

One cluster (Fig. 1(b))

Two clusters (Fig. 1(c))

Six clusters (Fig. 1(d))

Computing time (hh:mm:ss)

00:35:20

30:48:29

30:09:29

06:12:01

Table 3 Interpolation time for Fig. 1(a) data in GiB environment GiB sites

Number of PBS jobs submitted to each GiB site

gib01.bu.edu gibhead.ccs.uky.edu conshead.ncsa.uiuc.edu cornhead.ncsa.uiuc.edu gib.ovl.osc.edu biron.cs.wisc.edu

1 3 0 0 0 0

4 7 1 2 1 1

12 23 3 16 5 5

45 129 11 28 30 13

185 457 76 135 121 50

Computing time (hh:mm:ss)

00:11:02

00:07:57

00:08:23

00:08:42

00:24:43

equal to 2, the best performance is achieved because of the characteristics of the Grid computing environment and the problem being addressed. In particular, there is an overhead penalty imposed on each job that is generated by the Quad_Decompose algorithm. This penalty can be expressed in the following equation: P ¼ Ti þ Tj ;

ð2Þ

where Ti is the job initiation time in steps 1, 2 and 4 in Fig. 3 (Ti varies from 0.4 s for gibhead.ncsa.uiuc.edu to 1.5 s for cornhead.ncsa.uiuc.edu), and where Tj is the sum of job creation (through DUROC) and schedule time (queue operations through the local PBS job manager at each GiB site). Tj ranges widely from a few seconds up to more than a minute depending on network speed, the PBS queue status (whether there is a significant number of jobs that are competing to enter a queue), and job submission failure from DUROC during the execution of globusrun (this is a rare event, but resubmission is implemented to tolerate such a fault). When level is assigned a larger value, a greater number of jobs is generated by the Quad_Decompose algorithm. Therefore, a larger time penalty is also incurred from Eq. (2), and in this case the penalty is not offset by increased performance due to parallel execution. Since the best performance is achieved using all GiB PBS-managed resources when level is equal to 2, we fixed its value at that level and examined how different GiB sites contribute to the speedups shown in Table 3. We ranked the GiB sites according to their computing capacity and increased the number of sites used to compute results based on this rank. This strategy emulates a situation in which users would choose the best available resources first. Table 4 shows the decrease in runtime as additional sites are used. Fig. 13 shows the speedups (Eq. (1)) that were calculated from the computing time data shown in Table 4. Generally, as sites are added, greater speedups are achieved. The largest increase occurs between 1 site and 2 sites because

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1499

Table 4 Interpolation time for Fig. 1(a) data, level ¼ 2 GiB sites

Number of PBS jobs submitted to each GiB site

gibhead.ccs.uky.edu gib01.bu.edu gib.ovl.osc.edu biron.cs.wisc.edu cornhead.ncsa.uiuc.edu conshead.ncsa.uiuc.edu

16 0 0 0 0 0

11 5 0 0 0 0

10 4 2 0 0 0

10 3 1 2 0 0

7 5 1 2 1 0

7 4 2 1 1 1

Computing time (hh:mm:ss)

00:09:17

00:08:16

00:08:09

00:08:07

00:08:03

00:07:57

Speedup for a dataset with a uniform random distribution 4.6

Speedup

4.4 4.2 4 3.8 3.6 3.4 1

2

3

4

5

6

Used GiB sites

Fig. 13. Speedup with respect to used GiB sites for dataset in Fig. 1(a).

the second site has a much larger computing capacity than the sites that are added later. The overall scalability of the algorithm, however, is modest in this particular experiment. 6.2. Clustered datasets For each clustered dataset, we selected four configurations of level and length to determine the values used by the Quad_Decompose algorithm to achieve the best performance. Level values of 3 and 4 are tested because when level is less than or equal to 2, individual jobs are still large and because when their number is fewer than 16, jobs cannot be distributed evenly among GiB sites using the Generate_RSL algorithm; when level is greater than 4, regions with a high density of control points are over-divided so that the penalty from formula (2) is applied to a large number of small jobs that are decomposed from such regions. Length values of 63 and 125 were tested because if the length of a square side is greater than 125, some jobs are so computationally intensive (e.g., more than a hour on the fastest GiB node) that scheduling them could easily cause load imbalances. On the other hand, if the

1500

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

length of a square side is less than 63, a large number of small jobs is generated, which increases the overhead penalty dramatically. By observing the computing time when different configurations of level and length values were used, we found that for all three clustered datasets, the configuration of level ¼ 3 and length ¼ 63 yielded the best speedups (Tables 5–7). This observation is supported by the design of Quad_Decompose and Generate_RSL algorithms. When level is equal to 3 and length is equal to 63, the spatial domains with both a high and low density of control points are divided by Quad_Decompose into jobs that require computing time on the order of 1–10 min. These jobs are scheduled to GiB sites by Generate_RSL so that loads are balanced among the sites. This is because a uniform random distribution is used by Generate_RSL to distribute the jobs in a way that each GiB site is assigned a diversified collection of jobs (i.e., computing time varies from 1 to 10 min), but the sum of their computing time is similar across all GiB sites (see proof in Fig. 10). Speedups calculated for all three clustered datasets are shown in Fig. 14. The calculation of speedups for the clustered datasets is based on the best performance of each dataset from Tables 5–7. Fig. 14 indicates that our approach achieved better Table 5 Interpolation time for Fig. 1(b) data in GiB environment GiB sites

Number of PBS jobs submitted to each GiB site Level ¼ 3; Length ¼ 63

Level ¼ 3; Length ¼ 125

Level ¼ 4; Length ¼ 63

Level ¼ 4; Length ¼ 125

gib01.bu.edu gibhead.ccs.uky.edu conshead.ncsa.uiuc.edu cornhead.ncsa.uiuc.edu gib.ovl.osc.edu biron.cs.wisc.edu

126 299 10 13 38 58

37 65 8 15 14 21

157 365 25 56 58 87

47 94 19 62 15 19

Computing time (hh:mm:ss)

03:43:08

03:55:17

03:47:34

04:02:52

Table 6 Interpolation time for Fig. 1(c) data in GiB environment GiB sites

Number of PBS jobs submitted to each GiB site Level ¼ 3; Length ¼ 63

Level ¼ 3; Length ¼ 125

Level ¼ 4; Length ¼ 63

Level ¼ 4; Length ¼ 125

gib01.bu.edu gibhead.ccs.uky.edu conshead.ncsa.uiuc.edu cornhead.ncsa.uiuc.edu gib.ovl.osc.edu biron.cs.wisc.edu

138 318 6 13 51 78

34 87 5 14 10 22

164 426 19 35 77 75

54 114 21 33 20 14

Computing time (hh:mm:ss)

03:34:47

03:45:01

03:38:12

03:53:04

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1501

Table 7 Interpolation time for Fig. 1(d) data in GiB environment GiB sites

Number of PBS jobs submitted to each GiB site Level ¼ 3; Length ¼ 63

Level ¼ 3; Length ¼ 125

Level ¼ 4; Length ¼ 63

Level ¼ 4; Length ¼ 125

gib01.bu.edu gibhead.ccs.uky.edu conshead.ncsa.uiuc.edu cornhead.ncsa.uiuc.edu gib.ovl.osc.edu biron.cs.wisc.edu

54 110 11 27 22 20

16 35 12 26 7 4

112 284 45 63 44 68

42 76 37 71 16 4

Computing time (hh:mm:ss)

00:45:02

00:51:42

00:56:14

00:53:33

Comparison among speedups of four datasets

9

8.284

8.425

8.261

1

2

6

8

Speedup

7 6 5

4.444

4 3 2 1 0

0

Four datasets with 0, 1, 2, 6 clusters

Fig. 14. Speedups of all four datasets using all GiB resources.

performance on the datasets with an irregular distribution of control points (onecluster, two-cluster, and six-cluster datasets). This is because the quadtree-based domain decomposition algorithm can divide irregular domains into a balanced set of jobs that are allocated by Generate_RSL to heterogeneous GiB sites. The observed speedups (>8) are significant for the clustered datasets because if we were to use homogeneous PBS nodes as provided by gibhead.ccs.uky.edu to achieve the same level of speedup, we would need at least nine PBS CPUs, each of which is associated with a 1 GHz Intel processor.

7. Concluding discussion The general goal of the research reported in this paper was to investigate the performance of a parallel spatial interpolation algorithm that is implemented using Grid

1502

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

computing resources. A quadtree-based domain decomposition algorithm and a static task scheduling algorithm were developed and evaluated when used to interpolate uniformly distributed and clustered datasets in the Grid computing environment. The GT, and in particular, its resource location and allocation modules, were used to couple PBS managed GiB resources to conduct these experiments. The results show that for a dataset with a uniform random distribution, our domain decomposition and task scheduling algorithms scale well given the available GiB resources. More specifically, speedup is increased when additional resources are used. As expected, sites with larger computing capacities contributed more to observed increases in speedup. For clustered datasets, the quadtree-based domain decomposition and task scheduling algorithms can address the irregularities of the datasets well. Consequently, significant speedups were achieved for the clustered datasets shown in Fig. 1(b)–(d). Further research is needed to evaluate such speedups in light of possible increases in the heterogeneity of the hardware characteristics of an expanding Grid computing environment. Future research will also investigate how much additional Grid computing resources would be required to achieve user-desired performance (e.g., near-real-time) for a given interpolation problem using the general approach developed in this paper. This type of research can be enabled by the Grid network weather service (NWS) released by the NSF middleware initiative. 7 Fault tolerance was addressed in this application using job-resubmission. In the future, we will examine how Condor-G 8 can improve fault tolerant behavior.

Acknowledgements We would like to thank Jim Basney at the NCSA for his technical support on the use of the GiB testbed. We are also grateful to Jun Ni and Boyd Knosp in the Academic Technologies Research Services Division of Information Technology Services at the University of Iowa for insightful discussions about this research.

References [1] M.P. Armstrong, Geography and computational science, Annals of the Association of American Geographers 90 (1) (2000) 146–156. [2] M.P. Armstrong, R.J. Marciano, Massively parallel strategies for local spatial interpolation, Computers & Geosciences 23 (8) (1997) 859–867. [3] M.P. Armstrong, R.J. Marciano, Inverse distance weighted spatial interpolation using a parallel supercomputer, Photogrammetric Engineering and Remote Sensing 60 (10) (1994) 1097–1103. [4] P.A. Burrough, R.A. McDonnell, Principles of Geographical Information Systems, Oxford University Press, 1998. 7 8

http://www.nsf-middleware.org/. http://www.cs.wisc.edu/condor/condorg/.

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

1503

[5] B.P. Carlin, S. Chib, Bayesian model choice via Markov-Chain Monte-Carlo methods, Journal of the Royal Statistical Society Series B––Methodological 57 (3) (1995) 473–484. [6] K.C. Clarke, Analytical and Computer Cartography, second ed., Prentice-Hall, Englewood Cliffs, NJ, 1995. [7] T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to Algorithms, MIT Press/McGraw-Hill, 1990. [8] D.J. Cowen, J.R. Jensen, C. Hendrix, M.E. Hodgson, S.R. Schill, A GIS-assisted rail construction econometric model that incorporates LIDAR data, Photogrammetric Engineering and Remote Sensing 66 (11) (2000) 1323–1328. [9] B.E. Cramer, M.P. Armstrong, An evaluation of domain decomposition strategies for parallel spatial interpolation of surfaces, Geographical Analysis 31 (2) (1999) 148–168. [10] K. Czajkowski, I. Foster, C. Kesselman, N. Karonis, S. Martin, W. Smith, S. Tuecke, A resource management architecture for metacomputing systems, in: Proceedings of IPPS/SPDP’98 Workshop on Job Scheduling Strategies for Parallel Processing, 1998, pp. 62–82. [11] Y. Ding, P.J. Densham, Spatial strategies for parallel spatial modeling, International Journal of Geographical Information Systems 10 (6) (1996) 669–698. [12] I. Foster, C. Kesselman, J. Nick, S. Tuecke, The physiology of the grid: an open grid services architecture for distributed systems integration. Available from . [13] I. Foster, The Grid: a new infrastructure for 21st century science, Physics Today (2000) 42–47. [14] I. Foster, C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, San Francisco, CA, 1999. [15] I. Foster, C. Kesselman, The Globus project: a status report, in: Proceedings of the 7th IEEE Heterogeneous Computing Workshop (HCW ’98), 1998, pp. 4–18. [16] I. Foster, C. Kesselman, Globus: a metacomputing infrastructure toolkit, International Journal of Supercomputing Applications and High Performance Computing 11 (2) (1997) 115–128. [17] A. Grimshaw, A. Ferrari, G. Lindahl, K. Holcomb, Metasystems, Communications of the ACM 41 (11) (1998) 46–55. [18] G.R. Hjaltason, H. Samet, Speeding up construction of quadtrees for spatial indexing, Computer Science Department, TR-4033, University of Maryland, College Park, MD, July 1999. Available from . [19] E.G. Hoel, H. Samet, Data-parallel primitives for spatial operations, in: Proceedings of the 24th International Conference on Parallel Processing, Oconomowoc, WI, 1995, pp. 184–191. [20] R.I. Jennrich, An Introduction to Computational Statistics: Regression Analysis, Prentice-Hall, Englewood Cliffs, NJ, 1995. [21] H.J. Kramer, Observation of the Earth and its Environment: Survey of Missions and Sensors, Springer, Berlin, 2002. [22] N. Lam, Spatial interpolation methods: a review, The American Cartographer 10 (2) (1983) 129– 149. [23] G.M. Morton, A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing, IBM, Ottawa, Canada, 1966. [24] D.T. Pham, D. Karaboga, Intelligent Optimization Techniques: Genetic Algorithms, Tabu Search, Simulated Annealing and Neural Networks, Springer, New York, 2000. [25] A.H. Robinson, J.L. Morrison, P.C. Muehrcke, A.J. Kimerling, S.C. Guptill, Elements of Cartography, sixth ed., John Wiley and Sons, New York, 1995. [26] H. Samet, Applications of Spatial Data Structures, Addison Wesley, Readings, MA, 1990. [27] H. Samet, Design and Analysis of Spatial Data Structures, Addison Wesley, Readings, MA, 1989. [28] H. Samet, The quadtree and related hierarchical data structures, ACM Computing Surveys 20 (4) (1984) 187–260. [29] C.A. Schloeder, N.E. Zimmerman, M.J. Jacobs, Comparison of methods for interpolating soil properties using limited data, Soil Science Society of America Journal 65 (2) (2001) 470– 479.

1504

S. Wang, M.P. Armstrong / Parallel Computing 29 (2003) 1481–1504

[30] D.F. Watson, G.M. Philip, A refinement of inverse distance weighted interpolation, Geo-Processing 2 (4) (1985) 315–327. [31] N. Xiao, D.A. Bennett, M.P. Armstrong, Using evolutionary algorithms to generate alternatives for multiobjective site-search problems, Environment and Planning A 34 (4) (2002) 639–656. [32] D. Zimmerman, C.E. Pavlik, A. Ruggles, M.P. Armstrong, An experimental comparison of ordinary and universal kriging and inverse distance weighting, Mathematical Geology 31 (4) (1999) 375– 390.