The Hypercube as a Dynamically Reconfigurable Processor Mesh

The Hypercube as a Dynamically Reconfigurable Processor Mesh

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO. PC971402 48, 130–142 (1998) The Hypercube as a Dynamically Reconfigurable Processor Mesh J...

195KB Sizes 2 Downloads 172 Views

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO. PC971402

48, 130–142 (1998)

The Hypercube as a Dynamically Reconfigurable Processor Mesh Joseph M. Joy 1 Microsoft Corporation, One Microsoft Way, Redmond, Washington 98052

and R. Daniel Bergeron 2 Department of Computer Science, University of New Hampshire, Durham, New Hampshire 03824

We describe a technique for the efficient processing of large, multidimensional arrays on a MIMD hypercube. The technique allows the hypercube to be used as a processor mesh whose relative dimension sizes may be changed dynamically, while always keeping adjacent array elements on the same node or on physically adjacent nodes. The technique is based on a mapping scheme, called permuted Gray code mapping, which is a generalization of the binary reflected Gray code mapping. We also extend the technique to allow interleaving of the array data over the nodes of the hypercube. This technique can be used to efficiently parallelize scan-line algorithms, including operations such as volume rotation and volume rendering. © 1998 Academic Press

1. INTRODUCTION The processor mesh is a natural topology for processing array data, especially if the processing uses coherent operations—those that involve localized regions of an array— because adjacent array data are located on adjacent processors. Coherent operations are used in applications such as volume rendering [6, 11] and volume data analysis [3] and include tasks such as interpolation between sampled points, filtering with small kernels, and small bitblt-like movements of the data. The optimal shape of the processor mesh depends on the access patterns of the operation. For example, if we need to perform some operation along the x axis of the array, the optimal shape maximizes the x size of the subarray stored on each processor. When multiple operations need to be successively performed on an array, each with a 1

E-mail: [email protected]. E-mail: [email protected].

2

130 0743-7315/98 $25.00 Copyright © 1998 by Academic Press All rights of reproduction in any form reserved.

THE HYPERCUBE AS A DYNAMICALLY RECONFIGURABLE MESH

131

different access pattern, we would like to have an efficient mechanism for dynamically reconfiguring the processor mesh into different shapes. The hypercube can emulate a mesh computer using the reflected Gray code mapping [5]. Although this mapping is used in several volume- and image-rendering algorithms for the hypercube [3, 12], redistribution is either inefficient or inflexible. Montani [12] suggests several different processor mesh shapes that are suitable for certain kinds of rendering problems, but does not discuss changing among mesh shapes. Despite limitations in the static storage scheme, Camahort [3] opts against dynamically migrating volume data among processors because of the high message traffic overhead. We utilize the permuted reflected Gray code (PRGC) mapping and a mapping transformation technique, which together may be used to transform a MIMD hypercube computer into a dynamically reconfigurable processor mesh. A key property maintained by the technique is physical adjacency—adjacent array elements are always stored on the same processor or on physically adjacent processors. Figure 1 shows several different distributions of a 3D array over differently shaped logical processor meshes, all of which are mapped to the same hypercube. Each outlined section of the array represents a subarray stored on a single processor. Our reconfiguration technique allows switching among the different distributions to accommodate changing array access patterns when processing the array. Dynamic reconfiguration is achieved by means of a primitive operation, HalveDouble, that simultaneously halves the size of the logical mesh along one dimension and doubles it along another. The primitive is used as many times as necessary to obtain the desired configuration. For example, HalveDouble is applied twice to switch from the distribution in Fig. 1a to that in Fig. 1b, and then once again to switch to that of Fig. 1c. Each invocation of HalveDouble changes the shape of the logical processor mesh and the shape of the subarray stored in each processor. One application of the dynamic mesh reconfiguration algorithm is the parallelization of scan-line algorithms [4], such as image and volume rotation and scaling, that access data along one principal axis at a time. If the overhead of reconfiguring the mesh is small

FIG. 1. (top) Several 3D logical processor meshes mapped to a 16-node hypercube. (bottom) Mappings of the 3D array over the nodes; adjacent data in the array are on the same or adjacent nodes.

132

JOY AND BERGERON

compared with the processing time of the sequential scan-line algorithm, the sequential algorithm is very easily parallelized. That is, the mesh is reconfigured so that all the data along rows parallel to the scan-line axis are stored in a single processor and then the sequential algorithm is run essentially unchanged on each processor. This technique is used in the parallel volume rendering algorithm for the MasPar MP-1 by Vézina et al. [16]. The algorithm uses the array transposition services available on the MP-1, which is not a hypercube. Schr¨oder and Salem [14] describe a three-pass rotation algorithm which utilizes the grid communication features of the Connection Machine to perform shearing and scaling operations, but do not make use of the hypercube architecture of that machine. If the reconfiguration overhead for this full reshaping is not small compared with the cost of the operation, the HalveDouble operation may be applied just enough times to reduce the communication required during the scan-line operation to adjacent-processor communication. We know of no image or volume processing algorithms which use such a partial reconfiguration made possible by our approach. It is also possible to interleave the array data among the processors. Instead of storing a single subarray, each processor stores several smaller, widely spaced subarrays. This interleaving property is crucial for maintaining load balance in many algorithms where the computation costs are unevenly distributed over the data set such as occurs commonly when processing sparse volumetric data [11]. When processing large arrays on medium-grained hypercube computers, the HalveDouble operation takes time proportional to N / p, where N is the size of the array and p is the number of processors. Transforming a 1 × 2l mesh into a 2l × 1 mesh requires l applications of HalveDouble. We have shown this to be optimal by using HalveDouble to emulate perfect exchange [9]. We know of no distribution scheme which allows dynamic arbitrary reshaping of logical processor meshes, or a scheme which allows interleaving while maintaining the physical adjacency property. Chan [5] describes how to statically map meshes onto a hypercube using RGC sequences. Dynamic redistribution of arrays has been investigated for use in parallelizing matrix algorithms [1, 2, 7]. Other researchers have addressed modification of interleaving strategies, especially conversions among cyclic and block distributions [10, 13, 15]. This work does not attempt to maintain the physical adjacency property, and does not consider changing the sizes of each dimension of the logical array. Other work aims at collapsing or adding a dimension to a mesh, such as transforming a two-dimensional processor array into one dimension on a hypercube [8, 15]. Our method yields this result as a special case.

2. BACKGROUND 2.1. Notation The ith bit of the binary representation of a number a is denoted by ai , with bit a0 being the least significant bit. Thus, the binary representation of an n-bit number a is P i an−1 an−2 . . . a0 . Another representation of a in terms of its bits is a = in−1 =0 2 ai . Given some sequence of numbers G, the ith number is denoted by G i and the j th bit of G i is denoted by G ij .

THE HYPERCUBE AS A DYNAMICALLY RECONFIGURABLE MESH

133

2.2. Reflected Gray Codes A Gray code is a sequence of numbers whose successive elements differ by exactly one bit. One particular Gray code is the reflected Gray code (RGC). The 1-bit RGC is the sequence (0, 1). Let G be the (n − 1)-bit RGC sequence, G = (G 0 , . . . , G p−1 ), where p = 2n−1 . The n-bit RGC (n > 1) is defined recursively as (0G 0 , 0G 1 , . . . , 0G p−1 , 1G p−1 , 1G p−2 , . . . , 1G 0 ). Thus the 2-bit RGC is (00, 01, 11, 10), and the 3-bit RGC is (000, 001, 011, 010, 110, 111, 101, 100). 2.3. RGC Mapping A logical processor mesh whose dimension sizes are powers of two can be mapped perfectly onto a hypercube [5] using RGCs. First, the hypercube nodes (the processors) need to be ordered so that nodes adjacent in the ordering are adjacent in the hypercube. In other words, the addresses of nodes adjacent in the ordering must differ by exactly one bit, and thus the sequence of addresses of nodes in this ordering forms a Gray code. Any Gray code sequence of the correct length suffices. Second, we divide the d bits of the node address into n sections and order the binary numbers represented by the bits of each section using an RGC of the appropriate size. For example, to map a 2 × 4 array to an order 3 3 hypercube (8 node), we partition the bits of the address into two sections, of length 1 and 2 bits, respectively. A node with address a = a2 a1 a0 is now represented by the double (a2 , a1 a0 ) and we order the nodes along the first and second dimension according to 1-bit and 2-bit RGCs, respectively, resulting in the following 2D array of node addresses: 000 ≡ (0, 00) 001 ≡ (0, 01) 011 ≡ (0, 11) 010 ≡ (0, 10) 100 ≡ (1, 00) 101 ≡ (1, 01) 111 ≡ (1, 11) 110 ≡ (1, 10) This can be generalized to a mapping of an n-dimensional mesh onto an order d hypercube as follows: let the dimension sizes of the mesh be 2l1 , 2l2 , . . . , 2ln . Since the number of nodes in the mesh must be the same as the number of nodes in the hypercube, l1 +l2 +· · ·+ln = d. The physical address of the node with mesh coordinates (i 1 , i 2 , . . . , i n ) is equal to G i 1 (l1 )G i 2 (l2 ) . . . G in (ln ), where G i (l) is the ith number in the l-bit RGC sequence. This mapping scheme, row major RGC mapping, is useful for statically organizing the nodes of a hypercube into a mesh. Unfortunately, it is not suitable for changing the distribution at runtime, except for the transformation of a 1 × n mesh into an n × 1 mesh. This operation is equivalent to the all-to-all or perfect exchange operation, for which efficient algorithms exist [2, 7]. 3. THE DISTRIBUTION SCHEME We now describe a mapping of array elements onto a logical processor mesh and a mapping of the logical processor mesh onto the nodes of a hypercube which support efficient dynamic redistribution. The mappings possess the physical adjacency property 3

To avoid confusion between hypercube and array dimensions, we refer to the dimension of the hypercube as “order.”

134

JOY AND BERGERON

and redistribution can be efficiently implemented by using the HalveDouble operation described in Section 4. For presentation simplicity we concentrate on 2D arrays, but the extension to higher dimensions is straightforward as indicated in Section 3.3. 3.1. Distributing an Array over a Logical Processor Mesh Consider an n x × n y 2D array and an N x × N y 2D logical processor mesh. Note that N x and N y must both be powers of 2 since their product is the number of nodes in the hypercube, which must be a power of 2. The array is subdivided into N x sections along the x dimension and N y sections along the y dimension, resulting in N x N y subarrays. The array is thus partitioned into an N x × N y array of subarrays. Each subarray is stored in the corresponding node of the logical processor mesh. When n x and n y are multiples of N x and N y , respectively, all subarrays are the same size 4 (n x /N x × n y /N y ) and the array is evenly distributed among the nodes. This property is not essential for our reconfiguration algorithm to work. 3.2. Mapping the Logical Mesh to the Hypercube The nodes of the hypercube are organized into a logical mesh with the same number of dimensions as the array. The initial mapping is based on the row major RGC mapping. The reconfiguration algorithm can then be used to change the dimensions of the logical mesh. Although the new mapping is not, in general, the row major RGC mapping, it can be defined in terms of a permutation of the bits of the node addresses provided by the row major RGC mapping. This more general mapping is called a permuted reflected Gray code (PRGC) mapping. It preserves the property that neighboring nodes in the logical mesh are neighbors on the hypercube (i.e., the mesh is embedded in the hypercube). In the following, P[i, j ] is the physical hypercube node address corresponding to position (i, j) in the 2D logical mesh. Our mapping maps each node in an N x × N y logical mesh to a unique node in an N x N y node hypercube; i.e., our mapping determines P[i, j ], i = 0 . . . N x − 1, j = 0 . . . N y − 1. A PRGC mapping is defined by the triple M = (l, m, σ ), where l = log2 N x , m = log2 N y , l + m = d (the order of the hypercube), and σ = (σ0 , σ1 , . . . , σd−1 ) is a permutation of (0, 1, . . . , d − 1). Given mapping M, P[i, j ] (0 ≤ i < 2l , 0 ≤ j < 2m ) is given by P[i, j] =

l−1 X k=0

2σk G ik +

m−1 X

2σl+k G k , j

(1)

k=0

j

where G k is the kth bit of the jth element of a d-bit reflected Gray code sequence. The mapping represents a row major reflected Gray code mapping followed by a permutation of bits according to σ . If σ = (m, m + 1, . . . , d − 1, 0, 1, . . . , m − 1), the mapping reduces to the standard row major reflected Gray code mapping. The lemma below shows that adjacent subarrays are stored on adjacent nodes in the hypercube. The proof essentially shows that permuting the bits of a Gray code results in another Gray code [9]. 4

Throughout this paper, the slash, /, refers to integer division.

THE HYPERCUBE AS A DYNAMICALLY RECONFIGURABLE MESH

135

LEMMA. For any 2D PRGC mapping M = (l, m, σ ), P[i, j] and P[i + 1, j] are physical addresses of adjacent nodes on the hypercube, for all i = 0 . . . 2l − 1, j = 0 . . . 2m − 1. (By convention, P[2l , j] = P[0, j] and P[i, 2m ] = P[i, 0].) It is easy to show [9] that any property of a processor mesh that uses our mapping may easily be modified to hold true for the transposed node mesh. Thus, we may conclude from the lemma that for a 2D mapping M = (l, m, σ ), P[i, j ] and P[i, j + 1] are physical addresses of adjacent nodes on the hypercube, for all i = 0 . . . 2l − 1, j = 0 . . . 2m − 1. 3.3. Extension to Higher Dimensions We now show how to represent an n-dimensional logical processor mesh on an order d hypercube. Let the physical addresses of a 2l1 × 2l2 × · · · × 2ln (n-dimensional) logical processor mesh be represented by P[ j1 , j2 , . . . , jn ], where 0 ≤ ji < 2li , for i = 1 . . . n. Pn li = d, the order of the hypercube. The n-dimensional PRGC mapping Note that i=1 M, used to compute P, is defined by M = (l1 , l2 , . . . , ln , σ ), where (as before) the permutation vector σ = (σ0 , σ1 , . . . , σd−1 ) is a permutation of (0, 1, 2, . . . , d − 1). σ is partitioned into n sections, and each section is assigned to successive dimensions of the logical processor mesh as (σ0 , σ1 , . . . , σl1 −1 , σl1 , σl1 +1 , . . . , σl1 +l2 −1 , . . . , σd−ln , σd−ln +1 , . . . , σd−1 ). | {z } | {z } {z } | 1st di mensi on

2nd di mensi on

n th di mension

Let si denote the index of the first element of σ assigned to dimension i. Note that P s1 = 0, and si = i−1 k=1 li , 1 < i ≤ n. P is computed from M as P[ j1 , j2 , . . . , jn ] =

n lX i −1 X

2σsi + p G pi . j

i =1 p=0

4. THE HALVEDOUBLE OPERATION The HalveDouble operation swaps half of the subarray on each node with half of the subarray on an adjacent node. This is the only communication required for the operation. Remarkably, the operation results in a hypercube configuration that is organized as a different logical processor mesh. Figure 2 illustrates this operation for the special case of halving the logical processor mesh size along the x axis and doubling the size along the y axis. 4.1. Theoretical Foundations We now describe how each node determines its partner for the data swap, which portion of the array to swap in or out, and how to compute the position of each node in the reconfigured logical processor mesh. We use the properties of PRGC mappings specified in the theorem below. This theorem forms the basis for the HalveXDoubleY operation. Minor variations can be easily developed for the other versions of HalveDouble. (The theorem is proved in [9].)

136

JOY AND BERGERON

FIG. 2. HalveXDoubleY transforms a 4 × 2 mesh into a 2 × 4 mesh. Numbers are the physical addresses of nodes. Shaded regions show the subarray halves swapped between adjacent nodes.

THEOREM. where

Given a 2D PRGC mapping M = (l, m, σ ), let M 0 = (l − 1, m + 1, σ 0 ), σ 0 = (σ1 , σ2 , . . . , σl−1 , σ0 , σl , σl+1 , . . . , σl+m−1 ).

If P is the logical processor mesh represented by M and P 0 is the logical processor mesh represented by M 0 , then P and P 0 have the relations 0

P [i, j ] =



P[2i, j/2], if (i + j + j/2) is even; P[2i + 1, j/2], otherwise,

for all 0 ≤ i < 2l−1 , and 0 ≤ j < 2m+1 . Given the mappings M and M 0 defined in the theorem, it is easily shown that each pair of nodes (P 0 [i, j ], P 0 [i, j + 1]) (0 ≤ i < 2l−1 , 0 ≤ j < 2m+1 , j even) that are adjacent in the processor mesh P 0 , are also adjacent in the processor mesh P. Similarly, each pair of nodes (P 0 [i, j ], P 0 [i + 1, j]) (0 ≤ i < 2l−1 , 0 ≤ j < 2m+1 , i even) which are adjacent in P 0 , are also adjacent in P. The specific relationships between pairs of nodes in P and P 0 are given by  (P[2i, j/2],    P[2i + 1, j/2]), (P 0 [i, j], P 0 [i, j + 1]) =    (P[2i + 1, j/2], P[2i, j/2]),

if (i + j/2) is even, and

(2)

if (i + j/2) is odd,

for all 0 ≤ i < 2l−1 , and 0 ≤ j < 2m+1 , j even.  (P[2i, j/2],    P[2i + 3, j/2]), (P 0 [i, j], P 0 [i + 1, j ]) =    (P[2i + 1, j/2], P[2i + 2, j/2]),

if ( j + j/2) is even, and

(3)

if ( j + j/2) is odd,

for all 0 ≤ i < 2l−1 , and 0 ≤ j < 2m+1 , i even. The dimension sizes of P 0 are 2l−1 × 2m+1 . Thus P 0 is a mesh that is half the size of P in one dimension and double the size of P in the other. The nodes are partitioned into pairs of nodes that are adjacent in both the new mesh and the old mesh, and hence

THE HYPERCUBE AS A DYNAMICALLY RECONFIGURABLE MESH

137

it is possible to change the distribution of an array from one to the other by a single exchange of data between adjacent nodes. 4.2. The Algorithm Algorithm HalveXDoubleY in Fig. 3 performs this operation for the specific case of halving along the x axis and doubling along the y axis. Simple variations handle the similar operation in the other directions. The algorithm is executed by each processor in the mesh in parallel. The condition, “(i + j + j div 2) is even” for the 8 × 8 case can be represented by the following matrix which is formatted to show the pairs of columns whose nodes swap data with their adjacent partner: i \j 0 1 2 3 4 5 6 7

01 10 01 10 01 10 01 10 01

23 01 10 01 10 01 10 01 10

45 10 01 10 01 10 01 10 01

67 01 10 01 10 01 10 01 10

A node with a 1 entry invokes SendHighGetLowY(SA, node) in order to send the high half (farther from the origin) of its data (the array SA) to the partner node along the x axis and receives the low half of the corresponding array from node partner (which has a 0 entry), and then reorganizes the array to reflect the new processor mesh dimensions. SendLowGetHighY is the analogous operation on the low half invoked by a node with a 0 entry in the array. Functions GetPosFromAddress, GetAddressFromPos, and MapHalveXDoubleY perform the obvious operations given the PRGC mapping M.

FIG. 3. Algorithm for HalveXDoubleY: halve processor mesh along x axis; double it along y axis.

138

JOY AND BERGERON

M 0 is the new mapping. Note that the shape of each node’s SA changes—it is twice its original size along the x axis and half its original size along the y axis. It is now straightforward to reconfigure a logical processor mesh of size 2l × 2m to 0 0 one of some other size 2l × 2m , provided l + m = l 0 + m 0 —simply halve/double along one dimension of the mesh until the desired dimensions are reached. The HalveDouble operation is essentially unchanged for higher dimensional meshes as defined in Section 3.3. This is because each invocation involves only two dimensions; the elements of the permutation vector (σ ) assigned to the other dimensions are unchanged after the operation. Details of the implementation of HalveDouble for 3D meshes are provided in [9]. 5. DATA INTERLEAVING Interleaving is desirable when processing arrays which contain a substantial fraction of constant value regions. We now extend the distribution scheme to allow interleaving, while maintaining the physical adjacency property. No extra communication is incurred. The added data reorganization cost is proportional to the size of the data in each node. The interleaving scheme is compatible with the mesh reconfiguration algorithm—data remains interleaved as the shape of the mesh is changed. The idea is to distribute the array over a virtual processor mesh (which we abbreviate to virtual mesh). The virtual mesh is then mapped to a virtual hypercube. The dimension of the virtual hypercube is larger than the dimension of the physical hypercube. The virtual hypercube can be interpreted as an extension of the physical hypercube by adding one or more virtual (hypercube) dimensions. All nodes in the virtual hypercube whose addresses differ only in bits corresponding to the virtual (hypercube) dimensions are mapped to a single physical node; i.e., subarrays corresponding to these virtual nodes are stored in the same physical node. Figure 4 illustrates this for interleaving a 1D array over a 4-node hypercube. A 16 × 1 virtual processor mesh is mapped to a 16-node virtual hypercube. The virtual hypercube is then mapped to the 4-node physical hypercube. A larger number of virtual dimensions corresponds to a larger virtual processor mesh, and hence smaller subarrays and finer interleaving. The user decides the number of extra dimensions to be added based on the degree of interleaving required—there is a compromise between the savings in communication by having larger coherent regions in a single node and the better load balance provided by interleaving.

FIG. 4. A 1D mesh in a 16-node virtual hypercube. The virtual hypercube is embedded in a 4-node physical hypercube (shaded areas).

THE HYPERCUBE AS A DYNAMICALLY RECONFIGURABLE MESH

139

5.1. Static Interleaving of 1D Arrays We first describe a 1D array distributed over a 1D virtual mesh, which is mapped to an order d virtual hypercube using row major RGC mapping, as in Fig. 4. Let d 0 be the order of the physical hypercube (d 0 ≤ d). The virtual hypercube is formed by adding d − d 0 extra bits to the left of the physical node address. The least significant d 0 bits of the virtual node address determine the physical node address and are called the physical bits; the remaining d − d 0 bits are the virtual bits. The row major RGC mapping maps adjacent virtual mesh nodes to adjacent nodes in the virtual hypercube. These mesh nodes are also mapped to the same or adjacent physical node, because the addresses of any two adjacent virtual nodes differ by exactly one bit and hence any subsequence can differ by at most one bit. We use the least significant bits to determine the physical node address since they toggle most rapidly in the RGC sequence. Hence, consecutive nodes in the virtual mesh “cycle” among the physical nodes with at most two consecutive mesh nodes mapped to the same physical node. For example, if a 1D array is distributed over a 1 × 16 virtual mesh (Fig. 4), each virtual mesh node holds a subarray whose dimension size is 1/16th that of the array. The virtual mesh is mapped to the following sequence of virtual node addresses (the 4-bit RGC): (0000, 0001, 0011, 0010, 0110, 0111, 0101, 0100, 1100, 1101, 1111, 1110, 1010, 1011, 1001, 1000). The physical hypercube in the figure is of order 2, so the above sequence translates to the following sequence of physical node addresses (by extracting the 2 least significant bits of each address): (00, 01, 11, 10, 10, 11, 01, 00, 00, 01, 11, 10, 10, 11, 01, 00), or (0, 1, 3, 2, 2, 3, 1, 0, 0, 1, 3, 2, 2, 3, 1, 0). Thus, node 0 gets the 1st, 8th, 9th, and 16th subarrays; node 1 gets the 2nd, 7th, 10th, and 15th, etc. 5.2. Static Interleaving of 2D Arrays It is now straightforward to extend this idea to the 2D row major Gray code mapping. Recall that given a 2l × 2m logical processor mesh, the bits of a hypercube node address a = ad−1 . . . a1 a0 are organized as a tuple (ad−1 . . . am , am−1 . . . a0 ) (see Section 2.3). The bits in the 1st and 2nd element are mapped to the i and j indices of the node in the logical node mesh based on an RGC. Let this mapping represent the mapping of a virtual mesh to a virtual hypercube. Zero or more least significant bits of each element in the tuple (a total of d 0 bits) are combined to get the address of the physical node. Note that if l 0 least significant bits are extracted from the first element, the array data 0 is interleaved across 2l physical nodes along the 1st array (or virtual mesh) dimension. In other words, if one were to traverse the subarrays (or the nodes of the virtual mesh) 0 parallel to the first dimension, the visited subarrays would be distributed over exactly 2l physical nodes. 5.3. Dynamic Interleaving of 2D Arrays We now describe interleaving using the 2D PRGC (permuted reflected Gray code) mapping introduced in Section 3.2 and show how the array data remains interleaved after a HalveDouble operation.

140

JOY AND BERGERON

With a PRGC mapping, each mesh index is used to index into an RGC sequence, whose bits are then permuted to obtain the node address. As with row major RGC mapping, to obtain proper interleaving we need to ensure that the least significant bits of the RGC get mapped to physical bits. In other words, given the mapping M = (l, m, σ ), where σ = (σ0 , σ1 , . . . , σd−1 ), σ should have the following form (physical bits underlined) σ = (σ0 , . . . , σl 0 −1 , σl 0 , . . . , σl−1 , σl , . . . , σl+m 0 −1 , σl+m 0 , . . . , σd−1 ). The physical bits form two sequences: (σ0 , . . . , σl 0 −1 ) and (σl , . . . , σl+m 0 −1 ), where 0 ≤ l 0 ≤ l, 0 ≤ m 0 ≤ m, and l 0 + m 0 = d 0 , the number of physical bits. The physical bits are the least significant bits in the addresses of each of the two dimensions of the processor mesh. The HalveDouble operation preserves this property as long as the dimension along which the mesh is halved has at least one physical bit. To see this, consider the mapping M 0 = (l − 1, m + 1, σ 0 ) produced by applying HalveXDoubleY to M. σ 0 is defined by (physical bits underlined) σ 0 = (σ1 , σ2 , . . . , σl 0 −1 , σl 0 , . . . , σl−1 , σ0 , σl , . . . , σl+m 0 −1 , σl+m 0 , . . . , σd−1 ). The physical bits of M 0 are still the least significant bits along both mesh dimensions. As with static 2D interleaving, based on row major RGC mapping, if l 0 physical bits are 0 assigned to an axis, the array data along that axis is interleaved over 2l physical nodes. One can control the number of physical bits assigned to a particular axis (and hence the shape of the subarrays stored in each node) by using the HalveDouble operation. Once the number of physical bits along an axis is reduced to 0, rows of the array parallel to this axis are entirely contained in a single physical node, so there is no benefit to further halving the size of the virtual processor mesh along this dimension. Figure 5 illustrates the interleaving scheme applied to an order 8 virtual hypercube (256 virtual nodes) and an order 2 physical hypercube (4 nodes). The filled rectangles represent subarrays stored in physical node 0. In Fig. 5, the virtual mesh on the left has dimensions 16 × 16. One bit is reserved for the physical hypercube along each mesh dimension. A single HalveXDoubleY operation results in the distribution on the right:

FIG. 5. Interleaving a 2D array over a 4-node hypercube. The array is first mapped to a 16 × 16 virtual mesh (each small square represents a subarray stored in a node in the mesh). The subarrays that are eventually mapped to physical node 0 are filled.

THE HYPERCUBE AS A DYNAMICALLY RECONFIGURABLE MESH

141

the virtual mesh is now 8 × 32. The x axis now has 0 physical bits assigned to it—hence, rows parallel to the x axis are entirely contained in a single physical node. 6. SUMMARY We have introduced a scheme for organizing the nodes of a hypercube into a mesh whose dimension sizes may be changed at run time while maintaining the physical adjacency property. Algorithm analysis and supporting performance experiments indicate that a 256 node nCube 6400 hypercube can perform over 40 HalveDouble operations per second on a 256 × 256 × 256 volume [9]. The algorithm can be shown to be optimal for transforming a 1 × n mesh into an n × 1 mesh by comparing it to an optimal perfect exchange algorithm [9]. The scheme can be extended to allow dynamic interleaving of data, which is likely to lead to better load balancing for many algorithms. Our approach has a distinct advantage over other reconfiguration schemes because it maintains the physical adjacency property even with interleaving. The ability to organize the hypercube into a reconfigurable mesh with interleaving transforms the hypercube into a flexible “volume processor.” In addition, we see the possibility of embedding the reconfiguration and interleaving capabilities into a parallel language. REFERENCES 1. A. Agrawal, G. E. Blelloch, R. L. Krawitz, and C. A. Phillips, Four vector-matrix primitives, in “Proc. 1989 ACM Symposium on Parallel Algorithms and Architectures,” pp. 292–302, June 1989. 2. D. P. Bertsekas and J. N. Tsitsiklis, “Parallel and Distributed Computation,” Prentice Hall, Englewood Cliffs, NJ, 1989. 3. E. Camahort and I. Chakravarty, Integrating volume data analysis and rendering on distributed memory architectures, in “Proc. 1993 Parallel Rendering Symposium,” pp. 89–96, ACM Press, Oct. 1993. 4. E. Catmull and A. R. Smith, 3D transformations of images in scanline order, Comput. Graphics (Proc. SIGGRAPH ’80) 14, 3 (July 1980), 279–285. 5. T. F. Chan and Y. Saad, Multigrid algorithms on the hypercube multiprocessor, IEEE Trans. Comput. C 35, 11 (Nov. 1986), 969–977. 6. R. A. Drebin, L. Carpenter, and P. Hanrahan, Volume rendering, Comput. Graphics (Proc. SIGGRAPH ’88) 22, 4 (Aug. 1988), 65–74. 7. S. L. Johnsson and C. T. Ho, Optimum broadcasting and personalized communication in hypercubes, IEEE Trans. Comput. 38, 9 (Sept. 1989), 1249–1268. 8. S. L. Johnsson and C. T. Ho, The complexity of reshaping arrays on boolean cubes, in “Proc. Fifth Distributed Memory Computing Conference,” pp. 370–377, IEEE Computer Society Press, Apr. 1990. 9. J. M. Joy, “Parallel Algorithms for Volume Rendering,” Computer Science Technical Report 91-16, University of New Hampshire, Dec. 1991. 10. S. D. Kaushik, C. H. Huang, R. W. Johnson, and P. Sadayappan, An approach to communication efficient data redistribution, in “Proc. International Conference on Supercomputing,” pp. 364–373, ACM Press, July 1994. 11. M. Levoy, Efficient ray tracing of volume data, ACM Trans. Graphics 9, 3 (July 1990), 245–261. 12. C. Montani, R. Perego, and R. Scopigno, Parallel volume visualization on a hypercube architecture, in “Proc. 1992 Workshop on Volume Visualization,” pp. 9–16, ACM Press, Oct. 1992. 13. S. Ramaswamy and P. Banerjee, Automatic generation of efficient array redistribution routines for distributed memory multicomputers, in “Proc. Fifth Symposium on the Frontiers of Massively Parallel Computation,” pp. 342–349, IEEE Computer Society Press, Feb. 1995.

142

JOY AND BERGERON

14. P. Schr¨oder and J. B. Salem, Fast rotation of volume data on data parallel architectures, in “Proc. IEEE Visualization ’91,” pp. 50–57, IEEE Computer Society Press, Oct. 1991. 15. R. Thakur, A. Choudary, and G. Fox, Runtime array redistribution in HPF programs, in “Proc. Scalable High Performance Computing Conference,” pp. 309–316, IEEE Computer Society Press, 1994. 16. G. Vézina, P. A. Fletcher, and P. K. Robertson, Volume rendering on the MasPar MP-1, in “Proc. 1992 Workshop on Volume Visualization,” pp. 3–7, ACM Press, Oct. 1992.

JOSEPH M. JOY received a B.Tech from the Indian Institute of Technology, Madras, in 1985; he received an M.S. in ocean engineering in 1989 and an M.S. in computer science in 1991 both from the University of New Hampshire. Currently, he is a technical lead in the Microsoft Windows NT Networking and Communications Group. R. DANIEL BERGERON (Professor, computer science) has been a faculty member at the University of New Hampshire since 1974 and served as the first Computer Science Chair from 1981 to 1987 and again from 1995 to 1997. He received an Sc.B. in applied mathematics in 1966 and a Ph.D. in computer science in 1973 both from Brown University. He was founding Editor-in-Chief of ACM Transactions on Graphics and currently serves on the Editorial Board for “Computers and Graphics.” He has chaired and served on numerous program committees for SIGGRAPH, IEEE Visualization, and Eurographics. He is currently serving on the Executive Board for Eurographics. Professor Bergeron has research interests and publications in the areas of computer graphics, scientific visualization, scientific database systems, user interface design, intelligent tutoring systems, software engineering, parallel and distributed computing, and expert systems. Professor Bergeron’s current research is focused on the integration of multi-dimensional scientific data visualization techniques into a comprehensive scientific database environment. Received February 23, 1995; revised October 14, 1997; accepted October 17, 1997