Massively parallel strategies for local spatial interpolation

935KB Sizes 1 Downloads 96 Views

Report

PDF Reader
Full Text

Compuk~s & Geosciences Vol. 23, No. 8, pp. 859-867, 1997

0 1997 Published by Elsevier Science Ltd. All rights reserved

Pergamon

Printed in Great Britain 009%3004/97 $17.00 + 0.00

PII: soo!w3004(97)ooo58-7

MASSIVELY

PARALLEL STRATEGIES SPATIAL INTERPOLATION

MARC P. ARMSTRONG’

and RICHARD

FOR

LOCAL

J. MARCIANO*

‘Department of Geography and Program in Applied Mathematical and Computational Sciences, 3 16 Jessup Hall, The University of Iowa, Iowa City, IA 52242, and ‘Enabling Technologies, San Diego Supercomputer Center, P.O. Box 85608, San Diego, CA 92186-9784 (e-mail: [email protected]) (Received 9 December 1996; revised 6 April 1997)

Abstract-An inverse distance weighted interpolation algorithm is implemented using three massively parallel SIMD computer systems. The algorithm, which is based on a strategy that reduces search for control points to the local neighborhood of each interpolated cell, attempts to exploit hardware communication paths provided by the system during the local search process. To evaluate the performance of the algorithm a set of computational experiments was conducted in which the number of control points used to interpolate a 240 x 800 grid was increased from 1000 to 40,000 and the number of knearest control points used to compute a value at each grid location was increased from one to eight. The results show that the number of processing elements used in each experimental run significantly affected performance. In fact, a slower but larger processor grid outperformed a faster but smaller configuration. The results obtained, however, are roughly comparable to those obtained using a superscalar workstation. To remedy such performance shortcomings, future work should explore spatially adaptive approaches to parallelism as well as alternative parallel architectures. 0 1997 Published by Elsevier

Science Ltd Key Words: Parallel computing, Spatial interpolation.

INTRODUCTION Interpolation is often used to convert geographical datasets that contain point-sampled observations into surfaces or layers that are used in subsequent analyses. Though several interpolation methods have been developed (e.g. Lam, 1983; Burrough, 1986), many require a substantial amount of computation to produce results when large data sets must be processed. Parallel processing is one way to reduce the time required to compute solutions for large and computationally intensive problems (Ding and Densham, 1996; Clematis, Falcidieno, and Spagnuolo, 1996). Previous research has reported on the computational performance of inverse distance weighted interpolation algorithms when they were implemented using MIMD (multiple instruction, multiple data stream) parallel computers. Armstrong and Marciano (1994) showed how the performance of a brute-force interpolation algorithm (MacDougall, 1984) can be significantly improved, and in a second paper (Armstrong and Marciano, 1996), a more efficient algorithm, based on the use of a restricted, or local, search (Clarke, 1990), was used to reduce computation time further. In this paper we also focus on inverse distance weighted interpolation using Clarke’s local search strategy, but we shift to a more massively parallel SIMD (single instruction, multiple data stream)

architecture, discuss alternative approaches to algorithm implementation, contrast parallel results to those obtained using a workstation, and sketch out a conceptual basis for a spatially adaptive approach to parallelism.

SIMD ARCHITECTURE High performance computing technology has evolved into several distinctive architectural forms (Flynn and Rudd, 1996). Each architecture is used most effectively when a problem is matched to it in a way that exploits its particular performance characteristics. MIMD computers are often con-

structed from powerful but loosely connected processing elements (PE). The control-parallel model used by these systems assumes that a problem can be coarsely divided into sub-problems that are processed autonomously on each PE. On the other hand, the data parallel model that is often adopted when SIMD systems are applied to problems relies on the use of a large number of processors (often 1000s); in this approach when arithmetic calculations are performed, say on a matrix, each element is operated on simultaneously and, as a consequence, this class of machines is said to operate synchronously. SIMD computers are particularly well suited to problems that can be 859

860

M. P. Armstrong and R. J. Marciano

1 Front End

size controls the number of data elements that can be operated on synchronously. In addition, as a consequence, the size of the processor grid can significantly affect the performance characteristics of a SIMD system. We return to this issue later in the paper.

p-0

IMPLEMENTATION

Figure 1. SIMD architecture schematic represented using a regular (e.g. gridded) geometry. Because many geographical problems can be reduced to a lattice form, they can be adapted to a SIMD architecture (see for example, Armstrong and Marciano, 1995). The full implementation of a SIMD parallel program typically requires that a master program be written on a Front End (UNIX workstation) system that supports the execution of the parallel program on the computational backend or data parallel unit (DPU in Fig. 1). This master program loads the initial data onto the DPU which is comprised of an ACU (array control unit) and the PE array. The ACU processor controls the PE array by sending data and instructions to each PE simultaneously. This programming paradigm allows sequential portions of the code to be carried out on the Front End, while parallel portions execute on the DPU. Sequential and parallel portions of a program can alternate between these sub-systems. We used three MasPar (Blank, 1990; Nickolls, 1990) SIMD computer systems to perform the analyses reported in this paper. These SIMD machines differ in two of their characteristics (Table 1). First, the size of the PE array used by the three systems ranges from 4096 (4 K) to 16,384 (16 K) elements, and second, two types of processing elements are used. The MP-1 is an older machine that uses a slower 4-bit processor, while the MP-2 is faster and uses a 32-bit processor. Note, however, that although the MP-2 system that we used has faster PEs, as well as slightly faster connections (though Xnet communications on both machines are identical, router communications improved slightly in the MP-2: send bandwidth improved by 7% and receive bandwidth improved almost 4%), the size of its grid (4 K) is smaller than either of the MP-1 machines. This is important because the processor grid Table 1. MasPar SIMD computer systems used in this research Type

PE local memory

MP-1 MP-1 MP-2

64K 16K 64K

Number of PEs 8 K (8192) 16 K (16,384) 4 K (4096)

Grid size 64 x 128 128x 128 64 x 64

Several strategies have been developed to implement the process of estimating values in a grid from a collection of known control points. Among the local (Burrough, 1986, p. 153) approaches to estimation, the inverse distance weighted approach is widely used. Researchers have devised different approaches to the determination of the k-nearest control points that are used to compute values for unknown locations in this local approach to interpolation. Hodgson (1989), for example, describes a “learned search” that is based on the concept that the k-neighbors of two cells for which values must be computed are likely to share common members. This information is then used to reduce search times and consequently, overall processing time is reduced significantly. In a similar vein, the local search strategy used by the Clarke interpolation algorithm (Clarke, 1990) visits each “empty” cell of the interpolation grid and calculates a value for it based on the values of the k-nearest control points. Initially the search for neighbors is restricted to a surrounding square of size inc. If the requisite number of neighbors is not found, inc is incremented until a sufficient number of neighbors is contained within the bounding square. For example, if k = 3, and if grid point (id), has a final value of inc equal to 1, this means that the three neighbor points on which the calculation of an interpolated value is based were found in the immediate vicinity (queen’s move) of cell (i,& If, on the other hand, a final value of inc was equal to 20, this means that 20 search steps were required to expand the neighborhood around (ij) so that it encompassed at least three control points. The Clarke algorithm was chosen for this application to evaluate the performance of local, as opposed to brute force, approaches to interpolation because of its inherent simplicity in its sequential form and because serial code that can be rendered into parallel form was readily available in a widely-used analytical cartography text (Clarke, 1990, pp. 207-210). In addition, the performance of the algorithm has been evaluated in other computer environments (see Armstrong and Marciano, 1996) and the results presented here can be compared with this earlier work. It should be noted that it is not the intent of this paper to evaluate the quality of the results produced by the Clarke algorithm; while “. . . the algorithm has been found to be highly effective and is far faster than the brute force algorithm.” (Clarke, 1990, p. 207), it can introduce

861

Massively parallel strategies for local spatial interpolation errors into the resulting surface when, under certain conditions, it assigns multiple input points to the center of the nearest cell. The parallel version of Clarke’s algorithm was implemented using MPL (MasPar Programming Language), an ANSI C dialect with extensions that support data parallel programming, the mapping of data structures onto the PE array, and low level inter-processor communication (MasPar, 1992). Two data grid sizes are used to evaluate the performance of the implementations. The first is relatively small and was designed specifically to help us evaluate the effect of communication overhead on the performance of the algorithm. This small problem interpolates a 24 x 80 grid from 10 control points. The second, and larger, data set more closely approximates one that might typically be encountered in a research setting: a 240 x 800 grid is interpolated from a starting configuration of 10,000 randomly distributed control points. In additional experiments the number of control points is varied from 1000 to 40,000. The use of a random distribution provides a reasonable basis on which to evaluate performance given that control point densities often exhibit spatial variability in real world sampling schemes. Communication strategies In the interpolation algorithm the search for kneighbors is conducted using a message passing approach. Message passing on this class of SIMD computers, however, can be performed in two ways. In the first approach, communication is supported by physical connections among each PE and its eight (queen’s move) nearest neighbors, with toroida1 wraparound implemented for those PEs on the edges of the array. This local mode of communication, called xnet, operates synchronously and provides the means through which each PE is able to communicate with a target PE based on a regular access pattern. For example, in parallel, all active PEs would try to fetch a value located in a PE three rows over and 10 columns up. In a typical implementation, several MPL functions might be used to implement xnet communication: xnetN, xnetNE, xnetE, xnetSE, xnetS, xnetSW, xnet W, and xnetNW. Each function takes an argument dist. If dist = 0 a local memory reference is made. If dist = 1 then for xnetNE(dist) the program would communicate with a PE in the processor grid located at distance 1 in the northeast direction. These eight functions also can be generalized in MPL through the use of xfetch and xsend which try to connect to a PE located at distance AJJ, Ax (where Ay is the number of steps in the y direction required to reach the target PE-a positive value means step north and a negative value means step south, and Ax is the number of steps in the x direction required to reach the target PE-a positive value means step east and a negative value

means step west). Consequently, xnet requires that a communication direction and distance be specified for all PEs (arguments AJJ and Ax are specified as fixed integer values). In addition to this simple connectivity, the PE array is physically arranged in non-overlapping squares of sixteen PEs that constitute a collection of clusters (Fig. 2). The second communication approach, called global router, supports PE to PE access that is based on these underlying PE cluster connections. It is important to note, however, that when this router communication strategy is invoked, the system can be instructed to communicate with all PEs simultaneously, but in practice, it is able to communicate with only one PE per cluster at one time. This means that at most one outgoing and one incoming communication connection can be established per cluster during each time step and, consequently, contention will occur when two PEs attempt to connect to target PEs that are in the same PE cluster. If there are any one-to-many or many-to-one PE cluster connections, global router communications are forced into a sequential mode: parallel execution is precluded by the one-connection-at-a-time restriction on access into each cluster. Moreover, if a PE connects to itself or to another PE in the same cluster, then both the outgoing and incoming connection are used up for that cluster. During execution this means that only one communication access is required to communicate with four PEs that are in different clusters, but that four sequential communication accesses are required to communicate with four PEs in the same cluster. Implications of alternative communication strategies The geographical context of some applications can be characterized as being regular in the sense that neighborhood relations are fixed. In image pro-

0 0 0 q PE

PE

PE

m n ‘Km n n n 128 PE &ray Figure 2. A 4 x 2 PE cluster of 16 PEs each.

PE

M. P. Armstrong and R. J. Marciano

862

cessing applications, for example, relations in smoothing or enhancement operators are often based on a 3 x 3 cell neighborhood. In other types of applications, however, such geometrical regularity is not found. In the interpolation problem considered in this paper, at each interpolated grid location the distance and direction of search is dependent upon the distribution of control points. Because control points may be clustered, search patterns will often vary among grid cells. Since the local search strategy used by the Clarke algorithm draws upon the idea of using values contained in neighboring grid cells, this behavior could be modeled by exploiting xnet communication capabilities. If the met approach were used, the interpolation array, interpprrayfij), could be loaded directly onto the PE array and mapped onto PE(ij)‘s local memory. The main problem with this approach, however, is that it is only practical when the size of the interpolation grid is less than or equal to the size of the PE array because met assumes that the pattern of communication is regular and that the data set maps entirely onto the PE grid. While xnet would work well for the 24 x 80 test case, if a dataset is introduced that is larger than the PE grid (see the right-hand column of Table 1 for the size of the virtual layers on each machine) it must be folded into a set of virtual layers, thus destroying the regular access pattern used by xnet. To better illustrate the concept of virtual layers, consider a hypothetical machine with 100 processors: a 10 x 10 matrix would map directly onto this configuration of PEs. However, if a larger problem, say 20 x 20, were processed, it would require this larger matrix, now containing 400 elements, to be subdivided into four virtual layers to fit the problem onto the physical processor grid of size 100. In practice, the interpolation array on the PE array would be declared as plural interp_arrayfll]. Figure 3 illustrates this point. Assume a userdefined (plural) parallel array A(4) and a hypothetical physical PE array of four PEs, where only the

m Pu)

PEI

Hypothhal PE Amy with2PEa

One-Dimensional Amy of Data

Data Partitioning srrategy

Cut-and-Stack Mapping of the

Parlitions

xm%E[I].A=A “shifting data east”

Figure 4. Illustration of need to use different xnet operations on grid.

first three PEs are active at the time of an xnet operation. The syntax of an xnet construct is: xnet[singular_expression].plural_expression Because xnet requires a singular-expression,

J

the pattern has to be regular. Thus, xnetE[I]. A = A, causes all active PEs to store the value of their local A into their east neighbor’s A. However, when the size of the data array exceeds the size of the PE array, the regularity of the communication pattern may be broken, prohibiting the use of xnet. As shown in Figure 4, if the xnet communication approach were adopted, the program would need to use different operations, xnetE () and xnet W 0, to accomplish the same task. In contrast, the router construct allows each active PE to communicate with any other PE in an arbitrary way (MasPar, 1992, pp. 240). It also supports the analysis of datasets that are larger than the PE array. Consequently, a general implementation to the interpolation problem requires that point-to-point PE communication be established using the global router rather than xnet. The simple example sketched out in Figure 4, could be implemented with a router construct (syntax:

Figure 3. Use of xnef constructs on small data arrays.

because it uses an index array rather than a singular direction value. Recall, however, that the router communication strategy relies on cluster-based access. As a result of the potential for contention, additional MPL programming instrumentation was required in our imMPL provides example, plementation. For constructs that check for collisions. Thus, a program can be written in a way that wraps potentially unsafe router code with “while” loops that try to

HypotheticalPE Array with 4 PEs active PEs (I=active)

I

I

1

I

1

I

One-Dimensional Array of Data: plural

xnetE[l].A =A “shifting data east” Updated array A(4) after the xnet operation

communication

router[plural_index_expression].

pluraI_expression),

863

Massively parallel strategies for local spatial interpolation re-invoke the router statement until all serialized PE communications are completed. This is illustrated in the following code fragment in which the connected() function is used. The connected(target) routine returns a “1” in those PEs that were able to establish a connection with their target PE. Because a collision will occur when two PEs attempt to connect to target PEs that are in the same sixteen node partition, by placing connected() in a loop it is possible to establish that all required PE communications have been completed: Within the loop, as long as there is still a PE in the active set, one is selected that must have a communication link established, it is sent k, and then removed from the active set.

sizes varied among the systems used to compute the results. The times reported are from the start to end of execution on the DPU (Data Parallel Unit) which includes the controlling processor (ACU) and attached PE parallel array. Note, however, that the

program is launched from the Front End workstation and that the results do not include the overhead required to load the program onto the DPU and download results. This overhead was so minima1 for the grid problem sizes examined that it did not represent a measurable component. The MP-2 with its 4 K processor grid performed b est on the small (24 x 80) problem with ten control

)lural int i, j, k, activeset;

111activeset = 0; IctiveSet = 1; ........................................ while ( activeset == I )

/* initialize all PEs in the activeset plural array */ /* initialize active PEs */ /* the active set is the set of PEs that is enabled at any given

time *I L

if( connected(

i >)

/* returns 1 for each active PE that did achieve a router

connection to its target PE */ 1

router[ i ].j = k; activeset = 0;

/* send k to the i-th PE’s j */ /* PEs who have sent their k are done *I

The theoretical performance of the router approach depends on the number of collisions that are encountered during execution given the size of the PE array and the problem. For example, given an MP-1 machine with an opsize of 32 (number of bits in each operand) the approximate timings (in clock cycles) for a north xnet operation (xnet appears on the left hand side of a statement) are 34 * dist + 8 when dist> 1 AND 41 when dist = 1 (MasPar, 1993, section 4.7) where dist was defined previously. A random communication pattern with all PEs participating takes an average of 5000 clock cycles for 32bit operands. In the worst case, if a value must be broadcast from one PE to all other PEs, it is possible that a router statement would perform this operation in a serial manner, thus being nproc (nproc is the total number of PEs in the PE array) times slower than a theoretically optimal implementation.

DISCUSSION OF RESULTS The results contained in Table 2 show that the best performance on the different interpolation grid

points. This is an expected result because the MP-2 is constructed from higher performance components and its processor grid can easily accommodate the entire small problem. Because of their slower processor and interconnection speeds, the 16 K and 8 K MP-1 machines provide slower, but virtually identical, results for the small problem. When the larger 240 x 800 data set is interpolated with 10,000 control points, however, the interpretation of the results is not as straightforward. Though the MP-2 with the 4 K grid outperforms the 8 K processor MP-1, it does not perform as well as the MP-1 system with its larger grid of 16 K processors. If we return to the processor grids that were used in these experiments, the timing results obtained from the 24 x 80 data set provide a good indication of the raw performance of the processors used by each machine: because the problem is small enough to fit directly onto the physical grid of each machine without being converted into virtual layers, the speed of the processors used by the systems has the dominant effect on performance. Consequently,

Table 2. Timings (in seconds) for three MasPar machines on 24 x 80 and 240 x 800 problems with k = 3 Grid size

Control points

MP-2,4 K PEs

MP-1, 8 K PEs

MP-I, 16 K PEs

24 x 80 240 x 800

10 10000

4.98 17.02

8.26 19.01

8.27 10.86

864

M. P. Armstrong and R. J. Marciano Table 3. Complete timing results for MasPar: 16 K MP-I 1

1000 2000 5000 10,000 15,000 20,000 25,000 30,000 35,000 40,000

2

49.65 25.88 11.31 5.88 4.38 3.31 2.74 2.44 2.15 1.85

13.61 36.36 15.81 8.23 6.02 4.51 3.86 3.37 2.97 2.59

4

5

6

7

111.81 54.52 24. I1 12.97 9.04 6.72 5.68 5.00 4.37 3.95

121.51 61.98 27.90 15.00 10.30 7.78 6.46 5.68 5.00 4.54

146.06 70.24 31.69 17.58 11.62 9.02 7.33 6.37 5.66 5.03

161.93 78.40 35.58 18.77 13.24 10.10 8.25 7.12 6.26 5.58

3 94.24 46.41 20.09 10.86 7.60 5.12 4.17 4.21 3.73 3.31

8 182.16 86.85 38.80 20.62 14.06 11.17 9.14 1.79 6.84 6.00

In 240 x 800 grid number of control points is increased from 1000 to 40,000 and number of proximal neighbors (k) is increased from 1 to 8. is used to compute results for a problem in which the size of the study area is held constant, an increase in the number of randomly distributed control points will lead to increased densities and, as a consequence, it is likely that the Clarke algorithm would find the k-nearest control points more quickly. Table 3 shows the performance of the Clarke algorithm when the number of k-neighbors is increased from one to eight and when the number of control points is increased from 1000 to 40,000. At the outset of the Clarke algorithm, for each empty grid cell, inc = 1. If k neighbors are found within that unit square, the algorithm stops; otherwise, as described in Section 3, inc is incremented and the search is conducted in the enlarged neighborhood until k is reached. This operation is repeated for each interpolated cell in the grid and the largest values of all the final values of inc (based on results shown in Table 3) are recorded in Table 4. For example, when the 1000 point data set was processed with k = 8, the largest concentric square in which search was conducted before finding eight control points was 43. However, when the 40,000 point data set was processed with k = 1,the largest value of inc was reduced to 4. The final value of inc provides an indication of the amount of computation that is needed to find the requisite number of neighbors on which the interpolation calculations are based. As the number of k-neighbors increases, ithm

the MP-2 computes results at a rate that is nearly twice as fast as the MP-1 machines for the small problem. When the large (240 x 800)problem is examined, however, it is evident that the system with the 16 K processor grid, though comprised of slower individual elements, outperforms both machines with smaller grids. As shown in Table 1, the size of the processor grid determines the maximum size of the sub-matrix that can be handled during each time step executed by the different SIMD systems. This means that a 4 K-processor machine is required to map the 240 x 800 data set into 52 vir-

tual layers, an 8 K-processor machine would map the same data set into 28 virtual layers, while a 16 K-processor machine would map it into fourteen virtual layers. Since these multiple layers are processed sequentially (the algorithm iterates through the layers), this explains why, for the 240 x 800 data set, the 16 K-processor machine clearly outperforms the 8 K-processor machine, even though they have identical processors. More tellingly, the 16 K machine even outperforms the 4 K MP-2 machine despite its faster components. This example provides a good illustration of the time and space trade-off that is inherent in SIMD computing environments. As the number of proximal points used in interpolation (k) increases, the algorithm is required to complete a greater number of searches. It is important to note, however, that when the Clarke algor-

Table 4. Values of inc for the 16 K MP-1 computed for each run in Table 3

1000 2000 5000 10,000 15,000 20,000 25,000 30,000 35,000 40,000

1

2

3

4

5

6

7

8

23 15 9 8 I 5 4 4 4 4

25 20 12 9 7 I 5 5 5 4

29 23 15 9 8 7 6 5 5 5

30 23 15 12 8 8 7 6 6 6

31 24 16 15 9 8 7 6 6 6

37 28 16 15 10 8 8 7 7 6

37 28 17 16 13 9 8 7 7 7

43 29 19 16 13 9 8 7 7 7

In a 240 x 800 grid, number of control points is increased from 1000 to 40,000 and number of proximal neighbors (k) is increased from 1 to 8.

Massively Table 5. Results

RS/6000-550 16 K MP-1 Number

of k-nearest

parallel

strategies

for local spatial

interpolation

865

(in set) for 240 x 800 grid with 10,000control points using 16 K MP-1 and RS/6000-550

1

2

5.61 5.88

7.63 8.23

neighbors

3 9.69 10.86

is increased

from

11.87 12.97

Figure

5. Alternative

13.70 15.00

6 15.79 17.58

7

8

17.57 18.77

19.47 20.62

1 to 8.

search squares tend to increase, as illustrated in each row of Table 4. If a random distribution of control points is assumed, when a smaller data set is interpolated, a larger search square would be required to locate control points. Clearly, there is a direct correlation between the values of inc recorded in Table 4 and the run times reported in Table 3. Also, there is a crude correlation between the spatial distribution of the control points and the final value of inc, thus, by transitivity, between the spatial properties of the underlying data set and the final execution time, for a given mapping. Though the results we obtained appear, at first, to be promising, when they are contrasted with the

PEO

5

4

levels of performance that can be obtained with widely available, and comparatively far less expensive, workstation technology, the computation times for the large (240 x 800 grid with 10,000 control points) problem appear to be lackluster. In fact, the results were not distinctly different from those obtained using a superscalar workstation (an IBM RS/6000-550). As shown in Table 5, the workstation consistently outperformed the 16 K MP-I on this particular problem. Note that in Table 5 the computational workload was varied by increasing the number of k-nearest control points (from one to eight) on which the interpolated value at each cell is based.

PEl

PE2

PE3

mappings

of data to processor

array.

866

M. P. Armstrong and R. J. Marciano

1

SUMMARY AND CONCLUDING DISCUSSION An inverse distance weighted interpolation algorithm developed by Clarke (1990) was implemented on three massively parallel SIMD computer systems. This algorithm reduces the search for control points on which to base interpolation calculations to a (usually) small subset of the study area. The number of processing elements used by each system significantly affected performance. In fact, a large set (16 K PEs) of comparatively slower processors outperformed a faster but smaller (4 K PEs) configuration. The best results obtained on these massively parallel systems, however, are roughly comparable to those obtained using a superscalar workstation. The particular interconnection network used by the PE array relies on cluster-based access (a cluster is a 4 x 4 = 16 PE square) whereby only one outgoing and incoming communication path is allowed per time step. Multiple paths are automatically serialized when using the global router communication primitive, thus making communication an important and sensitive bottleneck. Our results show that a cut-and-stack data grid-to-PE-cluster mapping results in communications collisions that may preempt useful parallelism. The demonstration that collisions and data locality are sensitive on this architecture is an important finding. Because our results did not yield levels of performance that exceeded those of a typical workstation, one might be tempted to reject the SIMD dataflow approach to parallelism in favor of the MIMD approach (cf. Armstrong and Marciano, 1996). Before such a conclusion can be reached, however, alternative approaches should be explored. One alternative would center on the development of a new strategy for localizing search and matching it to the characteristics of the SIMD architecture. A different, hierarchical approach, could be contrasted with the cut-and-stack approach that we implemented to compute results. Figure 5 depicts the differences between these strategies. In the hierarchical approach, neighboring values in the data array are stored on the same PE whenever possible. This might be a good mapping for applications such as local interpolation, because storing neighboring points on the same PE might minimize PE communication. Rather than using this approach we implemented the more straightforward cut-and-stack strategy, which while spatially more intuitive may lead to increases in interprocessor communication overhead which, as we discussed in Section 3, can degrade performance. Our results suggest that the success of alternative mappings between a problem and the computer architecture will depend on the spatial characteristics of the data set examined (Fig. 6). In this particular study the control point locations were randomly generated. Additional research should

1 Data Set 1

1Pattern Analysis

1

Figure 6. Alternative parallelization strategies driven by characteristics of problem. examine the effects of different levels of clustering of control point distributions on alternative parallelization strategies. If clusters are relatively isolated from one another, mapping them onto the local memory of one processor would make many of the data access patterns local to each processor, thus improving performance. The spatial pattern “signature” of a data set could be used to drive the virtualization of the data mapping onto the PE array, literally steering the type of parallel computation involved as well as level of communication overhead, in an attempt to minimize both data collisions and the use of the router primitives. This approach, however, could only be justified if performance improvements were large enough to overcome the computational overhead required to derive the particular spatial signature of each dataset. With the increasing availability of distributed, networked, heterogeneous computing systems (Freund and Siegel, 1993), however, strategies such as this can be used to evaluate the computational characteristics of a problem so that it can be assigned to one or more architectures for optimal execution.

Acknowledgments-This work was made possible by a grant of computer time by the Scalable Computing Laboratory, which is funded by Iowa State University and the Ames Laboratory-USDOE, Contract W-7405-Eng-82. Special thanks to Joe Metzger at SCL. M.P.A. would like to acknowledge support provided by the National Center for Geographic Information and Analysis; this paper is a contribution to Initiative 17 (Collaborative Spatial Decision-Making) of the NCGIA which is supported by a grant from the National Science Foundation SBR-8810917. Thanks also to Tim Busse, MasPar Corporation, for his assistance with the design of a new generalized router function that was needed to help solve part of the communication portion of this interpolation problem, and to Mike Hammill, SmartNode Coordinator at the Cornell Theory Center, for insightful technical ideas on how to best use the MasPar machines.

Massively parallel strategies for local spatial interpolation REFERENCES Armstrong, M. P. and Marciano, R. (1994) Inverse distance weighted spatial interpolation using parallel supercomputers. Photogrammerric Engineering and Remote Sensing 60(9), 1097-l 103. Armstrong, M. P. and Marciano, R. (1995) Massively parallel processing of spatial statistics. International Journal of Geographical Information Systems 9(2), 169189. Armstrong, M. P. and Marciano, R. (1996) Local interpolation using a distributed parallel supercomputer. International Journal of Systems 10(6), 713-729.

Blank,

T.

(1990) The

Geographical

MasPar

MP-1

Information

architecture.

Proceedings of the Thirty-Fifth IEEE Computer Society 20-24. Spring International Conference, PP.

COMPCON 90, San Francisco, California. Burrough, P. (1986) Principles of Geographical Information Systems for Land Resources Assessment, Oxford University Press, New York, 193 pp. Clarke, K. C. (1990) Analytical and Computer Cartography, Prentice Hall, Englewood Cliffs, New Jersey, 290 pp. Clematis, A., Falcidieno, B. and Spagnuolo, M. (1996) Parallel processing on heterogeneous networks for GIS applications. International Journal of Geographical Information Systems 10(6), 747-767.

867

Ding, Y. and Densham, P. J. (1996) Spatial strategies for parallel spatial modelling. International Journal of Geographical Information Systems 10(6), 669-698. Flynn, M. J. and Rudd, K. W. (1996) Parallel architectures. ACM Computing Surveys zs(l), 67-70. Freund, R. F. and Siegel. H. J. (1993) Heterogeneous oro_ cessing. IEEE Comjuter 26(6), 13-17. Hodason. M. E. (1989) Searchina methods for rauid erid interpolation. \The ProfessioniI Geographer 41<2), 3161. Lam, N.-S. (1983) Spatial interpolation methods: A review. The American Cartographer 10(2), 129-149. MacDougall, E. B. (1984) Surface mapping with weighted averages in a microcomputer. In Spatial Algorithms for Processing Land Data with a Microcomputer, 278 pp. Lincoln Institute Monograph #84-2. Lincoln Institute of Land Policy Cambridge, Massachusetts. MasPar (1992) MasPar programming language reference manual (ANSI C compatible MPL), Document Part Number 9302-001. Revision A3, July, 237 pp. MasPar (1993) MasPar parallel application language (MPL) User Guide, Software Version 3.2, Document Part Number 9302-0101, Revision A5, 69 pp. Nickolls, F. R. (1990) The design of the MasPar MP-1: A cost effective massively parallel computer. Proceedings the Thirty-Fifth IEEE Computer of International 25-28. Conference, PP.

COMPCON 90, San Francisco, California.

Sociefy

Spring

Massively parallel strategies for local spatial interpolation

Massively parallel strategies for local spatial interpolation

Recommend Documents