Generalized parallel divide and conquer on 3D mesh and torus

Generalized parallel divide and conquer on 3D mesh and torus

Journal of Systems Architecture 51 (2005) 281–295 www.elsevier.com/locate/sysarc Generalized parallel divide and conquer on 3D mesh and torus Ali Kar...

231KB Sizes 0 Downloads 89 Views

Journal of Systems Architecture 51 (2005) 281–295 www.elsevier.com/locate/sysarc

Generalized parallel divide and conquer on 3D mesh and torus Ali Karci

*

Department of Computer Engineering, Faculty of Engineering, Fırat University, 23119, Elazig, Turkey Received 23 July 2003; received in revised form 12 May 2004; accepted 11 June 2004 Available online 4 January 2005

Abstract In this paper, we handle the problem of 1mapping divide-and-conquer idea to 3D mesh and torus interconnection networks. Binary tree is not an efficient computation structure, thus, we select the computation structure as binomial tree. We propose an algorithm for divide and conquer on 3D meshes/torus. After that we give dilation of this algorithm for any 3D mesh whose size is power of 2 and the congestion of this embedding is 1, since each binomial tree consists of two edge-disjoint binomial tree B(n  1)s. The communication times of proposed algorithm for store-and-forward routing mechanisms are evaluated with respect to some specific values of message ratio a. The results of wormhole routing mechanism are better than the results of store-and-forward routing mechanism due to the nonunit dilation of embedding. The efficiency of the proposed algorithm is also investigated in this paper. If sequential algorithm has the complexity or number of computation as the quadratic form of size of data, then the proposed algorithm is cost-optimal depending on the routing mechanism being wormhole. In the store-and-forward routing mechanism, the number of computation in the sequential algorithm does not make the proposed algorithm be cost-optimal or not. The communication time is dominant and computation time is less effective than communication time.  2004 Elsevier B.V. All rights reserved. Keywords: Parallel Computation; Binomial tree; Task graph; Divide and conquer

1. Introduction Simulation of one interconnection network in another is an important problem in parallel computing. The problem of simulation can be ab*

Corresponding author. Tel.: +90 424 237 2035; fax: +90 424 218 1907. E-mail address: akarci@firat.edu.tr

stractly formulated as a problem of graph embedding or simply embedding, where the simulated network is called the guest graph and the network on which the simulation is to be carried out is called the host graph. The mapping of computation onto 3D mesh/ torus is the embedding of binomial tree onto this architecture. The problem of embedding some guest graph G = (V1, E1) into a host graph

1383-7621/$ - see front matter  2004 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2004.06.004

282

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

H = (V2, E2) is to find a one-to-one function g: V1 ! V2. The dilation cost of embedding is max{dH(g(u), g(v)); "u, v 2 V1} where dH is the distance between the nodes u and v in the host graph H. In other word, the dilation is the longest distance between the images of the adjacent nodes of G. The embedding is adjacency preserving if dilation of g is equal to 1. Expansion cost is defined as the ratio jV2j/jV1j. Congestion is another criteria in graph embedding and the congestion is the maximum number of times any edge of host graph H is used by paths corresponding to the mappings of the edges of guest graph G. If the dilation cost is minimum, expansion cost is minimum, and embedding is congestion-free (congestion of 1), then the embedding is optimum. Thus, the purpose of the embedding would be to minimize the dilation cost and the expansion cost, and also congestion. Minimizing dilation cost leads to the minimization of the degradation in time performance emulating the guest architecture, and minimizing expansion cost leads to the minimization of the hardware needed by the host architecture to emulate the guest architecture. Minimization of congestion leads to minimize the contention on any edge of host graph. The problem of tree embedding in interconnection networks has recently been investigated by a number of researchers. The problem concerns about parallel computations in which, the processes are organized as tree. Such a computation starts with only one process (the root of the tree). As the computation proceeds, existing processes can create new processes (new tree nodes). When a new process is spawned, it must be immediately assigned to some processor for processing. Example applications include divide-and-conquer algorithms, branch-and-bound computations, backtrack search algorithms, game-tree evaluation, functional and logical programs, and various numeric computations [4–9]. There are two main considerations in embedding trees onto interconnection networks. First, parent–child pair of tree nodes should be assigned to processors that are close to each other. The maximum distance between a pair of parent and child nodes is called the dilation of the embedding. The dilation should be as small as possible, since it

affects the communication overhead of a parallel computation. Second, the number of tree nodes assigned to a processor is called the load of the processor. Thus, divide and conquer logic that occurs in many computations as a problem-solving logic can be considered. If a problem can be divided into subproblems and each subproblems are independent of each other, they may be executed in parallel, and this makes it a useful paradigm for designing large scale parallel programs [10–15]. Divide and conquer is applicable to a wide range of applications. The aim of this paper is to introduce the problem of mapping degree-2 divide and conquer computations on 3D meshes with/without wrapped around connections and torus. We show that it runs in a phase by phase manner and that there is a regular pattern in the times at which messages are sent. Moreover, the message volumes also exhibit regularity; at each step the volume of sending message is same. We inspired by the study of Lo et al. [1], and they proposed two mappings algorithms for embedding binomial tree onto 2D meshes, however, we convert reflecting mapping onto 3D mesh and torus. The communication time for wormhole, and store-and-forward routing mechanisms are evaluated. The dilation of embedding and its congestion are also determined in this paper. The cost-optimality of the proposed algorithm is investigated with respect to wormhole and storeand-forward routing mechanisms, and linear or quadratic computation algorithm. Similar works were done in literature [16,17]. Gonzalez et al. [17] proposed execution of hypercube algorithms on the meshes. Valero-Garcia et al. [16] proposed a parallel divide-and-conquer algorithm on the mesh. They evaluated the performance of two embedding schemes. The dilation of embedding scheme was evaluated with respect to a formula without considering the initiate node which may be near to the any side of mesh. In this case, the proposed dilation will not work and the value of dilation will be larger. In this paper, we consider this case, too. The paper is consists of fives sections. Introduction, Sections 2 and 3 give some detail for previous works and literature. Section 4 describes the

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

embedding of binomial tree B(n) onto 3D mesh (torus) and its consequences and finally, last section concludes this paper and gives some discussion about the results of this study.

2. 3D Mesh and torus and literature In a k-ary n-cube, where k is referred to as the radix and n as the dimension, N = kn nodes are arranged in n dimensions, with k nodes per dimensions. Each node can be identified by an n-digit radix k address (a1, a2, . . . , an). The ith digit of the address vector, ai, represents the node position in the ith dimension. Nodes with addresses (a1, a2, . . . , an) and (b1, b2, . . . , bn) in an k-ary ncube are connected if and only if there exists i, 1 6 i 6 n, such that ai = (bi ± 1) mod k and aj = bj for 1 6 i 6 n, i 5 j. Thus each node is connected to two neighbouring nodes in each dimension. If n is selected as 3, then the obtained architecture will be a torus. If wraparound edges are removed from a torus, then 3D mesh will be obtained. The divide and combine or divide and conquer paradigm is a popular and useful technique in algorithm design and implementation for single processor and multi-processors system [2–4]. Lo and her colleagues used divide-and-conquer for solving a problem in parallel on 2D meshes. They proposed two algorithms for this issue [1]. There is also an algebraic theory for describing this type of computation on the hypercube-like interconnection networks [5].

3. Divide and conquer on 3D mesh and torus The complete binary tree can be considered as the traditional task graph structure for degree-2 divide and conquer algorithm. The computation proceeds from the root to the leaves and back up the tree in a level by level manner; so at any time only the nodes in a given level are active. Thus the complete binary tree is not efficient. ‘‘Keep half, send half’’ strategy is a trivial way to improve the performance. At each phase, message is divided into two parts and the parameter a denotes the message size in each phase as a fraction of

283

the incoming message size. So, each node in the tree performs the following computation. Each node except root node receive problem from its parent and if the problem is small enough, then it solves the problem. Otherwise, it divides the problem into two parts (of sizes am and (1  a)m), and sends part am to its child to solve. In parallel, it solves the part (1  a)m, if this part is small enough. On the contrary, it proceeds dividing process upto reaching a small problem to be solved locally. After dividing and solving parts phase, it gets the results from the children and combines them. Repeat until all childrens results have been combined. After that it sends the results to the parent. The traffic volumes that occur in both stages can be defined by the parameter a. The channel bandwidth for each edge is assumed identical. The case a = 1 depicts the uniform traffic; the case a = 1/2 depicts the case where the message volume is halved in each phase. There are potential for different values of in practical problems. The divide stage consists of receiving message from its parent once, and sending message to its children multiple times (once to each child). The combine stage consists of the message traffic in opposite direction; receiving message from its children multiple times and sending message to its parent once. The pattern of data flow in the two stages are identical except for direction. So, analysing one stage will result the analysing of the other stage without loss of generality. The structure of divide and conquer on the 3D mesh (or torus) as a task graph is same as binomial tree and embedding binomial tree onto 3D mesh (or torus) with minimum congestion and dilation results in structure of computation on the 3D mesh (or torus). The leaves of binomial tree represent last processors handling computation and returning results to upper level. So, the definition of binomial tree can be given as follows. Binomial Tree. The binomial tree B(n) is an ordered tree defined recursively [2,4]. The binomial tree B(0) consists of a single node. The binomial tree B(n) consists of two binomial tree B(n  1)s that are linked together; the root of one is the leftmost child of the root of the other. For the binomial tree B(n):

284

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

1. B(0) is a single node with no edges. 2. There are 2n nodes. 3. The height of the tree  is  n. n 4. There are exactly nodes at depth i for i i = 0, 1, . . . , n and 5. The root has degree n, which is greater than that of any other node; moreover if the children of the root are numbered from left to right by n  1, n  2, . . . , 0. Child i is the root of a subtree B(i). 6. In other word, B(n) consists of two copies of B(n  1) together with an edge connecting their roots, one of which is designated as the root of B(n). The root of B(n) has n edges and this means that it has n children, each of which are the roots of B(n  1), B(n  2), . . . , B(0), and the subtree rooted at any node is itself a binomial tree. The computation graph of the keep half send half paradigm divide and conquer with n divide phases is the B(n). In addition to the topological properties of communication in divide and conquer, it is important to specify exactly when a given edge is active. There are a total of n communication phases numbered as 1 to n. We assumed that every node receives message once and after receiving message, a node is regarded as active node. Except leaves node, each of other node activates a node in each phase. If the root node receives a message of size m, then it divides this message into two parts and sends one part to its child and this process is called first phase. There are n children of root node and there are total of n phase. After nth phase, the leaf node at the farthest way to root will receive messages. All of other leaves receive the messages at the same time. The number of nodes in any B(n) is the power of 2, so, the numbers of nodes in any 3D mesh (or torus) into which B(n) will be embedded, is 1, 2, 4, 8, 16, . . . etc.

4. Embedding of B(n) into 3D mesh and torus Let us summarize the work done in [1]. The authors proposed the use of a binomial tree to represent a divide-and-conquer algorithm. Every

node of the binomial tree represents a process that performs the following computations: • Receive a problem (of size x) from parent (the host, if the node is the root of the tree). • Solve the problem locally (if small enough), or divide the problem into two parts of sizes ax and (1  a)x, and spawn a child process to solve one of these parts. In parallel, start solving the other part, by repeating this step. • Get the results from the children and combine them. • Send the results to the parents (the host, if the node is the root). Fig. 1 depicts some binomial trees. A binomial tree has 2n nodes. The division and combine stages are organized in n phases each. In the first phase of the division stage, only the root node (11 . . . 1) is active. In the second phase nodes 01 . . . 1 and 11 . . . 1 are active. In the third stage, nodes 111 . . . 1, 101 . . . 1, 011 . . . 1 and 001 . . . 1 are active. In general, there are 2i1 active nodes in the phase i. The authors in [1] proposed two embedding scheme which are called Reflecting and Growing schemes. In this paper, we just use Reflecting scheme, since Growing scheme has worse dilation. We presented a method to enlarge the Reflecting scheme to 3D in this paper. We present embedding of B(n) into 3D mesh (or torus) of size 2x · 2y · 2z where there are three cases (the implementations of these cases are same, so the implementation of a case is efficient to depict the idea): Case 1: y = 2x and x = z. Case 2: y = x and x = 2z. Case 3: y = x = z. Each node of B(n) is mapped to a distinct node of the 3D mesh (or torus). The embedding can be completely specified by the node mapping (embedding of B(n) is denoted as E(B(n))). The embedding of B(n) on 3D mesh (or torus) can be summarized as follows. Definition 1. The embedding of binomial tree into 3D mesh or torus can be specified as follows, and

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

B(0)

285

B(0) B(1) B(n-1)

B(2)

B(n-1) B(3) B(4)

(b)

B(n)

(a)

B(n-1)

B(n-2)

.....

B(1)

B(0)

B(n)

(c) Fig. 1. Binomial trees.

for f 2 Z+(Z+ is the set of positive integer). The algorithm proposed to implement this definition is a recursive algorithm and while an architecture is divided into two sub-architectures, each subarchitecture is regarded as a whole architecture. (a) B(0) is the single node and embedded in 3D mesh (or torus) of sizes 20 · 20 · 20 (single node). (b) If n = 3f, then the embedding of B(n) is constructed by taking two copies of E(B(3f  1)), and placing the second copy, reflected about the xy-plane at direction z by assuming first one is placed at direction z+. The roots of z and z+ are connected and root of z is the root of E(B(3f)). (c) If n = 3f  1, then the embedding of B(n) is constructed by taking two copies of E(B(3f  2)), and placing the second copy, reflected about the xz-plane at direction y by assuming first one is placed at direction y+. The roots of y and y+ are connected and root of y is the root of E(B(3f  1)). (d) If n = 3f  2, then the embedding of B(n) is constructed by taking two copies of E(B(3f  3)), and placing the second copy, reflected about the yz-plane at direction x+

by assuming first one is placed at direction x. The roots of x and x+ are connected and root of x+ is the root of E(B(3f  2)). There are three strategies for order of axes while reflecting the copies of embedding. The order of axes in each strategy is specified with respect to number of nodes in each dimension as seen in Table 1. The algorithm in this paper was written with respect to Strategy-1. Algorithm 1 ToBinary (Node_Axis,s,L,exp) if exp = 0 then return (L)//L is the Label else if s = + then //s is the sign

Table 1 Strategies of embedding binomial trees onto 3D mesh (or torus) N 3k 3k  1 3k  2

Strategy-1 

+

z Mz y M y+ x M x+

Strategy-2 

+

y My x M x+ z M z+

Strategy-3 x M x+ z M z+ y M y+

286

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

if 0 6 Node Axis <

2exp then 2

L ¼ Lk1 ToBinary (Node_Axis,-,L,exp-1) else L ¼ Lk0 2exp Node_Axis = Node_Axis2 ToBinary (Node_Axis,+,L,exp-1) else

2exp then if 0 6 Node Axis < 2 L ¼ Lk0 ToBinary (Node_Axis,-,L,exp-1) else L ¼ Lk1 2exp Node_Axis = Node_Axis2 ToBinary (Node_Axis,+,L,exp-1)

Algorithm 2 OneNodeMapping (NodeAxisx,NodeAxisy,Node Axisz,exp_x,exp_y,exp_z) Lx = ToBinary (NodeAxisx,-,Null_String,exp_x) Ly = ToBinary (NodeAxisy,-,Null_String,exp_y) Lz = ToBinary (NodeAxisz,-,Null_String,exp_z) if exp_x = exp_y = exp_z = 0 then stop and return error message (there is no node label to be converted to binary). else if exp_x = exp_y = exp_z 5 0 NodeLabel = Lz[1]kLy[1]kLx[1]kLz[2]kLy[2] kLx[2]k . . . Lz[r]kLy[p]kLx[q] where length(Lz) = r, length(Ly) = p and length(Lx) = q. else if exp_x = exp_y > exp_z then NodeLabel = Ly[1]kLx[1]kLz[1]kLy[2]kLx[2] kLz[2]k . . . Ly[r]kLx[p]kLz[q] where length(Ly) = r, length(Lx) = p and length(Lz) = q. else if exp_x > exp_y = exp_z then NodeLabel = Lx[1]kLz[1]kLy[1]kLx[2]kLz[2] kLy[2]k . . . Lx[r]kLz[p]kLy[q] where length(Lx) = r, length(Lz) = p and length(Ly) = q. The embedding of B(n) onto Torus or 3D mesh can be denoted by OneNodeMapping(a1, a2, a3, x, y, z) where a1 is the first dimension element

and belongs to x-axis, a2 is the second dimension element and belongs to y-axis, a3 is the third dimension element and belongs to z-axis. x, y, z are nonnegative integers and determine the size of 3D mesh/torus where size of 3D mesh/torus is 2x · 2y · 2z. The algorithm for embedding a binomial tree onto 3D mesh or torus can be simplified by just using a node mapping. Then the idea can be enlarged for all nodes, and one node case is seen in Algorithm 1. This algorithm converts one axis element of a node label to binary with respect to the idea given in the Definition 1. Each node in the 3D mesh or torus has label whose label is triplicate (consists of three axis elements). ToBinary(ÆÆ) must be invoked for each element in case of converting triplicate node label to single binary label. Algorithm ToBinary (ÆÆ) just converts one element of triplicate node label. In order to convert a triplicate node label, algorithm OneNodeMapping (. . .) must be invoked. In order to convert all node labels, the algorithm OneNodeMapping (. . .) must be invoked for each node label. Let the dimensions of 3D mesh (torus) be 2x · 2y · 2z, the binomial tree consisting of 2x+y+z = 2n nodes can be embedded onto this architecture. B(n) tree is embedded onto whole architecture. In order to embed B(n) onto 3D mesh of size 2x · 2y · 2z, two copies of B(n  1) are taken and the issue is to decide onto where to embed 0kB(n  1) and 1kB(n  1). After that each B(n  1) tree consists of B(n  2) and again the issue is to decide onto where 0(0kB(n  2)) and 0(1kB(n  2)) to embed, and so on. This process is continued upto reaching B(0). For example, Fig. 2 depicts the embeddings of B(1), B(2), B(3) and B(4). Let the node (3, 1, 1) in 3D of size 22 · 21 · 21 be mapped onto B(4). First of all, the first a1 = 3 must be converted to binary by using ToBinary (. . .) algorithm. This algorithm is invoked as ToBinary (3,-,NullString, 1) and it returns 10 as a results, so Lx = 10. Second and third elements are same and the 3D mesh has the same number of nodes in both dimensions. So, the results of ToBinary(. . .) will be same. Hence Ly = Lz = 1. Finally, the node (3, 1, 1) in 3D mesh is mapped to the node 1110 in B(4), and vice versa.

A. Karci / Journal of Systems Architecture 51 (2005) 281–295 yz-plane 1

0

00

01

000

001

y- xz-plane 100

x-

287

101

x+

xy-plane 010

011

y+

(a) 10

11

z-

(b)

z+ 111 (c)

110 0000 0100

0001 0101

0010

1001 1000 yz-plane 1101 1100

0011 x-

0110

Root of tree

1010

1011 x+

0111

1111

1110

(d) Fig. 2. Embeddings of B(1), B(2), B(3) and B(4) onto 3D meshes.

Let us introduce an operation used in the embedding process and this operation is translation. Definition 2. Translation of a node v by a node t means a bitwise exclusive-OR of the address, Tr(t, v) = t  v. With translation of graph G(V, E) with respect to a node t, it means a graph Tr(t, G(V, E)) = G(Tr(t, V), Tr(t, E)), where Tr(t, V) = {Tr(t, v)j"v 2 V} and Tr(t, E) = {(Tr(t, u), Tr(t, v))j"(u, v) 2 E}. Translation of a graph preserves the Hamming distance between nodes. The translation operation preserves the dimension of every edge. The topology of a B(n) remains unchanged under operations translation. So, if initiate node is v and it is different from node 11 . . . 1, then all nodes in the B(n) are translated by the complement of v. The binomial tree translation can be considered in case of torus and wrapped around mesh, and translation makes dilation be larger if mesh is without wrapped around links (edges). This case illustrates that B(n) can be translated and the proposed algorithm can be applied to make a specific node of B(n) be the root of embedding (root of B(n)). Theorem 1. The algorithm for embedding B(n) onto 3D mesh (torus) of sizes 2p · 2q · 2r is correct.

Proof. While the embedding of B(n) onto 3D mesh (torus) is taking place, the dimension which contains maximum nodes is regarded as reflecting axis. There will be two submesh with respect to reflecting plane. These two submesh will contain two B(n  1)s, and one of them is preceded with label 1 and the other is preceded with label 0. This process continues up to reaching a single node. Proof can be done with structural induction. Base step: Fig. 2 depicts some embeddings and these can be considered as base step (Figs. 3–5). Hypothesis step: Let algorithm works correctly for B(n  1)s. Induction step: In this step, there are three cases for embedding B(n) onto 3D mesh (torus).

case x-:

case x+:

1

0

0

0

1

0

yz-plane

yz-plane

Fig. 3. Reflecting axis is x-axis.

288

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

Proof. The current reflecting axis corresponds to a dimension in which number of nodes is more than the other. In equality case, any of the dimension can be considered as current reflecting dimension. Due to reflection, an edge in binomial tree is mapped with at most half of diameter of corresponding 3D mesh/torus. Let the number of nodes in current reflecting dimension be 2h as seen in Fig. 6. So, the number of nodes in a linear array which determines the dilation of embedding of B(n) onto 3D mesh, is 2h2 + 2h2 = 2h1 and this linear array contains 2h11 edges. h

xz- plane

0

0 1

xz-plane

1

case y-:

case y+:

0

0

Fig. 4. Reflecting axis is y-axis. case z-:

case z+:

1

0 0

1

Corollary 1. The embedding of binomial tree onto 3D mesh/torus is not optimum. Since the dilation is not unit dilation. However, expansion is unit expansion.

0

0

xy-plane

xy-plane

Theorem 3. The congestion of embedding B(n) onto 3D mesh (torus) with respect to the algorithm given in this paper is 1 (congestion-free).

Fig. 5. Reflecting axis is z-axis .

(1) p = max{p, q, r}. In this case, the reflecting axis is x-axis. In the case of x+, the labels of the nodes near to the node whose label is (0, 0, 0) are concatenated with 1 and other labels are concatenated with 0. (2) q = max{p, q, r}. In this case the reflecting axis is y-axis. The cases for x-axis are also valid for y-axis. (3) r = max{p, q, r}. In this case, the reflecting axis is z-axis. The cases for x-axis are also valid for z-axis. h

Theorem 2. The dilation of embedding B(n) onto 3D mesh (or torus) of size 2p · 2q · 2r with respect to the algorithm given in this paper is 2h1  1, where h = max{p, q, r} and n = p + q + r.

Proof. The embedding obtained from algorithm given in this paper is congestion free. Since B(n) consists of two B(n  1)s and both of them do not have edges in common. The same case is valid for all subtrees. At each communication step, each node sends message to its child. This process is shown in Fig. 7 for dividing phase. The same case is also valid for combine phase. In dividing phase: 1st step. The active edge is incident onto roots of 1kB(n  1) and 0kB(n  1). 2nd step. The active edges are incident onto the roots of 01kB(n  2) and 00kB(n  2), and roots of 10kB(n  2) and 11kB(n  2), and so on. Combine phase is similar to the divide phase

2h-2

....

2h-2

....

....

2h-2

.... 2h-2

Dilation

Fig. 6. Dilation with respect to algorithm given in this paper in dimension in which there are 2h nodes.

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

289

B(n) 1 2 B(n-2) 4

3

B(n-2)

...

3

4

...

B(n-3) B(n-4)

B(n-4)

4

B(n-4)

... B(n-5)

...

B(1)

B(0) n B(0)

B(n-4)

.....

B(n-3)

3 B(n-3)

n-1

.....

2

B(n-3)

..... .....

B(n-1)

n

4

3

Fig. 7. The active edges in dividing phase of divide-and-conquer. Each edge is labelled with its activation step number .

except direction of sending and receiving messages. Under wormhole routing, a message consists of a stream of flits that are routed through network with the minimal buffering in a pipelined fashion. In the absence of the contention, the time a message takes to travel from one processor to another over a path of length k is tcom ¼ ts þ mtw þ kth where ts is the start up time, tw is the time of one word traversing an edge, and th is the per-hope time. h

Proof. The active edges in a binomial tree are mapped with different edges in 3D mesh/torus. So, there is no need for including contention in case of total communication time computation. The communication time is, with respect to wormhole routing, tcom ¼ ½ts þ amtw þ th ð2p1  1Þ þ ½ts þ að1  aÞmtw þ th ð2q1  1Þ 2

þ ½ts þ að1  aÞ mtw þ th ð2r1  1Þ 3

þ ½ts þ að1  aÞ mtw þ th ð2p2  1Þ þ ½ts þ ð1  aÞ4 mtw þ th ð2q2  1Þ

Theorem 4. The total communication time for a parallel architecture is the summation of all wasted times for transmitting messages between pairs of nodes. Total communication time for a message of size m with wormhole routing mechanism is (without loss of generality p P q P r) (a) If (1  a) 1, then amtw + th(2p + 2q + 2r  p  q  r) + ts(p + q + r). (b) If a 1, then amtw(p + q + r  6) + th(2p + 2q + 2r  p  q  r) + ts(p + q + r). (c) If a = 1/2, then  amtw

 8pþq þ 8pþr þ 8qþr 3 þ th ð2p þ 2q þ 2r 8pþqþr

 p  q  rÞ þ ts ðp þ q þ rÞ

5

þ ½ts þ að1  aÞ mtw þ th ð2r2  1Þ 6

þ ½ts þ að1  aÞ mtw þ th ð2p3  1Þ 7

þ ½ts þ ð1  aÞ mtw þ th ð2q3  1Þ þ ½ts þ að1  aÞ8 mtw þ th ð2r3  1Þ þ    þ ½ts þ að1  aÞ3ðp2Þ mtw þ th 3ðq2Þþ1

þ ½ts þ að1  aÞ

3ðr2Þþ2

þ ½ts þ að1  aÞ

3ðp1Þ

þ ½ts þ að1  aÞ

mtw þ th

mtw þ th

3ðq1Þþ1

þ ½ts þ að1  aÞ

mtw þ th

mtw þ th

þ ½ts þ að1  aÞ3ðr1Þþ2 mtw þ th

290

A. Karci / Journal of Systems Architecture 51 (2005) 281–295 p q X X 3ðpiÞ 3ðqiÞþ1 ð1  aÞ þ ð1  aÞ

tcom ¼ amtw

i¼3

þ

r X

ð1  aÞ

! 3ðriÞþ2

i¼3

i¼3

þ th

" p X

ð2i1  1Þ

i¼3

q r X X þ ð2i1  1Þþ ð2i1  1Þ þ 6 i¼3

#

i¼3

Theorem 5. Total communication time for a message of size m with store and forward routing mechanism is (without loss of generality p P p P r) If (1  a) 1, then ts(2p + 2q + 2r  p  q  r) + amtw(2p1  1). (b) If a 1, ts(2p + 2q + 2r  p  q  r) + p q r amtw(2 + 2 + 2  p  q  r).

(a)

þ ts ðp þ q þ rÞ a  1: the following summation for p is p1 X ð1  aÞ3ðpiÞ þ ð1  aÞ3ðppÞ ¼ 0 þ 1 i¼3

Proof. The communication time is with respect to wormhole routing tcom ¼ ½ts þ amtw ð2p1  1Þ þ ½ts þ að1  aÞmtw 2

and summations for q and r are equal to 0. The summations in coefficient of th are geometric series and the result is 2p + 2q + 2r  p  q  r. Hence the theorem. a  0: the summations in coefficients of tw are equal to p  2, q  2, r  2, respectively. The coefficients of ts and th do not change with respect to value of a. a = 1/2: the communication time " p  3ðpiÞ q  3ðqiÞþ1 X X 1 1 tcom ¼ amtw þ 2 2 i¼3 i¼3 # r  3ðriÞþ2 X 1 þ 2 i¼3

 ð2q1  1Þ þ ½ts þ að1  aÞ mtw ð2r1  1Þ 3

þ ½ts þ að1  aÞ mtw ð2p2  1Þ 4

þ ½ts þ að1  aÞ mtw ð2q2  1Þ 5

þ ½ts þ að1  aÞ mtw ð2r2  1Þ þ    þ ½ts þ að1  aÞ3ðp2Þ mtw þ ½ts þ að1  aÞ

3ðq2Þþ1

þ ½ts þ að1  aÞ

3ðr2Þþ2

þ ½ts þ að1  aÞ

3ðp1Þ

þ ½ts þ að1  aÞ

3ðq1Þþ1

þ ½ts þ að1  aÞ

3ðr1Þþ2

"

þ th ð2p þ 2q þ 2r  p  q  rÞ þ ts ðp þ q þ rÞ   8pþq þ 8pþr þ 8qþr ¼ amtw 3  8pþqþr

tcom ¼ ts

þ th ð2p þ 2q þ 2r  p  q  rÞ þ ts ðp þ q þ rÞ

þ

mtw

mtw mtw

mtw

p q X

i1 X

i1 2 1 þ 2 1 i¼3

r X



2i1  1

i¼3

"

h The communication with respect to store-andforward routing mechanism, the sending message is stored by receipt processor and after storing whole message, it is sent to the next processor on the path of message. So, the communication time for store-and-forward with respect to message of size m and traversing k edges is

mtw

þ amtw



#

i¼3

p X ð2i1  1Þð1  aÞ3ðpiÞ i¼3

þ

q X

3ðqiÞþ1

ð2i1  1Þð1  aÞ

i¼3

þ

r X

# ð2

i1

 1Þð1  aÞ

3ðriÞþ2

þ 6ts

i¼3

tcom ¼ ðts þ mtw Þk

þ amtw ½ð1  aÞ3ðp2Þ þ ð1  aÞ3ðp1Þ

where ts is the start up time, tw is the time of one word traversing an edge.

þ ð1  aÞ3ðr2Þþ2 þ ð1  aÞ3ðr1Þþ2

þ ð1  aÞ3ðq2Þþ1 þ ð1  aÞ3ðq1Þþ1

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

The coefficient of ts is independent from a. So, the coefficient of ts is 2p + 2q + 2r  p  q  r. a  1: while i = p, the coefficient of tw is am (2p1  1). So, the communication time is ts ð2p þ 2q þ 2r  p  q  rÞ þ amtw ð2p1  1Þ: a  0: the coefficient of tw is 2p + 2q + 2r  p  q  r. Thus the communication time is ts ð2p þ 2q þ 2r  p  q  rÞ þ amtw ð2p þ 2q þ 2r  p  q  rÞ:

291

the given embedding algorithm is cost-optimal with respect to wormhole routing mechanism. (c) The algorithm given in this paper is not cost-optimal with respect to store-and-forward routing mechanism, while the computation algorithm for each node is linear with respect to the size of data. (d) The algorithm given in this paper is not cost-optimal with respect to store-and-forward routing mechanism, while the computation algorithm for each node is quadratic with respect to the size of data.



The cost of a parallel system can be defined as the product of parallel run-time and the number of processors. Cost reflects the sum of the time that each processor spends solving the problem. The cost of solving a problem on a single processor is the execution time of the fastest known sequential algorithm. The speedup of a parallel system is the ratio of execution time of the fastest known sequential algorithm on single processor and the execution time of the parallel algorithm on the P processors. The efficiency can also be expressed as the ratio of the execution time of the fastest known sequential algorithm for solving a problem to the cost of solving the same problem on P processors. A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel system computer is proportional to the execution time of the fastest known sequential algorithm on a single processor. A cost-optimal system has efficiency H(1). Real parallel systems do not achieve on efficiency of 1 or speedup of P on as many processors. All causes of nonoptimal efficiency of a parallel system are collectively (interprocessor communication time, other causes for efficiency loss in parallel system) referred to as the overhead due to parallel processing. Theorem 6. The algorithm given in this paper has the following properties. (a) If the computation algorithm for each node is linear with respect to the size of data, then the given embedding algorithm is not cost-optimal with respect to wormhole routing mechanism. (b) If the computation algorithm for each node is quadratic with respect to the size of data, then

Proof. (a) The computation is linear with respect to the size m of data, then Ts = H(m). Let the routing mechanism be wormhole. There are three cases for some specific values of a. Without loss of generality, let p = q = r and th = ts. The number of processors is 23p, m is the size of data and m = X (23p), Case 1. If (1  a) 1, then communication time Tcom is amtw + th(2p + 2q + 2r  p  q  r) + ts(p + q + r). The parallel execution time is     m m p T p ¼ T com þ H 3p ¼ amtw þ th ð3:2 Þ þ H 3p 2 2 a, tw and th are constants, so,   m p T p ¼ HðmÞ þ Hð2 Þ þ H 3p ¼ HðmÞ þ Hð2p Þ 2 The speedup and efficiency are S¼ and E¼

HðmÞ ¼ Hð1Þ HðmÞ þ Hð2p Þ   Hð1Þ 1 ¼ H : 23p 23p

The speedup is near to 1 and efficiency is near to 0. In this case, the algorithm is not cost-optimal. Case 2. If a 1, then communication time Tcom is amtw(p + q + r  6) + th(2p + 2q + 2r  p  q  r) + ts(p + q + r). The parallel execution time is   m T p ¼ T com þ H 3p 2   m ¼ amtw ð3p  6Þ þ th ð3:2p Þ þ H 3p 2

292

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

a, tw and th are constants, so, T p ¼ HðmpÞ  HðmÞ þ Hð2p Þ þ H



m 23p



¼ HðmpÞ þ Hð2p Þ The speedup and efficiency are   HðmÞ 1 S¼ ¼ H HðmpÞ þ Hð2p Þ p and H E¼

1 p

23p

 ¼H

 1 : p23p

The speedup is less than 1 and efficiency is near to 0. In this case, the algorithm is not cost-optimal. Case 3. If a = 1/2, then communication time Tcom is amtw 3  23p þ th ð3:2p Þ. The parallel execution time is   m T p ¼ T com þ H 3p 2     3 m ¼ amtw 3  p þ th ð3:2p Þ þ H 3p 2 2 a, tw and th are constants, so,   m m T p ¼ HðmÞ  H p þ Hð2p Þ þ H 3p 2 2 ¼ HðmÞ þ Hð2p Þ The speedup and efficiency are S¼

HðmÞ ¼ Hð1Þ HðmÞ þ Hð2p Þ

routing mechanism be wormhole. There are three cases for some specific values of a. Without loss of generality, let p = q = r and th = ts. The number of processors is 23p, m is the size of data and m = X(23p), Case 1. If (1  a) 1, then communication time Tcom is amtw + th(2p + 2q + 2r  p  q  r) + ts(p + q + r). The parallel execution time is  2  2 m m T p ¼ T com þ H 6p ¼ amtw þ th ð3:2p Þ þ H 6p 2 2 a, tw and th are constants, so, T p ¼ HðmÞ þ Hð2p Þ þ Hð1Þ ¼ HðmÞ þ Hð2p Þ The speedup and efficiency are S¼

Hðm2 Þ ¼ Hð23p Þ HðmÞ þ Hð2p Þ

and E¼

Hð23p Þ ¼ Hð1Þ 23p

The speedup is near to the number of processors and efficiency is near to 1. In this case, the algorithm is cost-optimal. Case 2. If a 1, then communication time Tcom is amtw(p + q + r  6) + th(2p + 2q + 2r  p  q  r) + ts(p + q + r).The parallel execution time is  2 m T p ¼ T com þ H 6p 2  2 m ¼ amtw ð3p  6Þ þ th ð3:2p Þ þ H 6p 2 a, tw and th are constants, so,

and 

 HðmÞ 1 E ¼ 3p ¼ H 3p : 2 2 The speedup is near to 1 and efficiency is near to 0. In this case, the algorithm is not cost-optimal. If the Ts is linear and wormhole routing mechanism is used, the better results can be obtained for a in interval (0, 1/2] and worse results can be obtained for a in interval (1/2, 1]. Since the efficiency for values of a in the first interval is greater than the efficiency obtained for values of a in the second interval. (b) The computation is quadratic with respect to the size m of data, then Ts = H(m2). Let the

T p ¼ HðmpÞ  HðmÞ þ Hð2p Þ þ H ¼ HðmpÞ þ Hð2p Þ The speedup and efficiency are   Hðm2 Þ m S¼ ¼ H p HðmpÞ þ Hð2 Þ p and H E¼

m p

23p

¼H

  1 p



m 23p



A. Karci / Journal of Systems Architecture 51 (2005) 281–295

The efficiency is not asymptotically equal to 1. Thus, the algorithm is not cost-optimal. Case 3. If a = 1/2,

then communication time 3 Tcom is amtw 3  23p þ th ð3:2p Þ: The parallel execution time is  2 m T p ¼ T com þ H 6p 2   3 ¼ amtw 3  3p þ th ð3:2p Þ þ Hð1Þ 2 a, tw and th are constants, so,   m T p ¼ HðmÞ  H 3p þ Hð2p Þ þ Hð1Þ 2 ¼ HðmÞ þ Hð2p Þ The speedup and efficiency are S¼

Hðm2 Þ ¼ HðmÞ HðmÞ þ Hð2p Þ

and E¼

HðmÞ ¼ H ð 1Þ 23p

The speedup is near to the number of processors and efficiency is near to 1. In this case, the algorithm is cost-optimal. If the Ts is quadratic and wormhole routing mechanism is used, the better results can be obtained for a in interval [1/2, 1] and worse results can be obtained for a in interval (0, 1/2). Since the efficiency for values of in the first interval is greater than the efficiency obtained for values of a in the second interval. (c) The computation is linear with respect to the size m of data, then Ts = H(m). Let the routing mechanism be store-and-forward. There are two cases for some specific values of a. Without loss of generality, let p = q = r. The number of processors is 23p, m is the size of data and m = X(23p), Case 1. The communication time Tcom is ts(2p + 2q + 2r  p  q  r) + amtw(2p1  1), if (1  a) 1. The parallel execution time Tp is   m T p ¼ ts ð3:2p  3pÞ þ amtw ð2p1  1Þ þ H 3p 2 ¼ Hðm2p Þ:

293

The speedup and efficiency are   HðmÞ 1 ¼ H S¼ p Hðm2 Þ 2p and E¼

H

1

2

2p 3p



 1 ¼ H 4p : 2

The algorithm is not cost-optimal due to the efficiency being not equal to H(1). Case 2. The communication time Tcom is ts(2p + 2q + 2r  p  q  r) + amtw(2p + 2q + 2r  p  q  r), if a 1. The parallel execution time Tp is   m T p ¼ ts ð3:2p  3pÞðamtw þ ts Þ þ H 3p 2 ¼ Hðm2p Þ: The speedup and efficiency are   HðmÞ 1 S¼ ¼ H Hðm2p Þ 2p and E¼

H

1

2

2p 3p



 1 ¼ H 4p : 2

The algorithm is not again cost-optimal due to the efficiency being not equal to H(1). (d) The computation is quadratic with respect to the size m of data, then Ts = H(m2). Let the routing mechanism be store-and-forward. There are two cases for some specific values of a. Without loss of generality, let p = q = r. The number of processors is 23p, m is the size of data and m = X(23p), Case 1. The communication time Tcom is t s (2 p + 2 q + 2 r  p  q  r) + amt w (2 p1  1), if (1  a) 1. The parallel execution time Tp is  2 m p1 T p ¼ ts ð3:2p  3pÞ þ amtw ð2  1Þ þ H 6p 2 ¼ Hðm2p Þ: The speedup and efficiency are   HðmÞ 1 ¼ H S¼ p Hðm2 Þ 2p

294

A. Karci / Journal of Systems Architecture 51 (2005) 281–295

and E¼

H

1

2

2p 3p

  1 ¼ H 4p : 2

The algorithm is not cost-optimal due to the efficiency being not equal to H(1). Case 2. The communication time Tcom is ts(2p + 2q + 2r  p  q  r) + amtw(2p + 2q + 2r  p  q  r), if a 1. The parallel execution time Tp is  2 m T p ¼ ts ð3:2p  3pÞðamtw þ ts Þ þ H 6p 2 ¼ Hðm2p Þ: The speedup and efficiency are   HðmÞ 1 S¼ p ¼ H Hðm2 Þ 2p and E¼

H

1

2

2p 3p

  1 ¼ H 4p : 2

The algorithm is not again cost-optimal due to the efficiency being not equal to H(1). Consequently, the proposed algorithm is costoptimal if routing mechanism is wormhole, computation is quadratic and the ratio a is in the closed interval [1/2, 1]. h

5. Conclusion In this paper, we have described an algorithm for divide-and-conquer on the interconnection networks 3D mesh and torus. This algorithm contains embedding of binomial tree onto 3D mesh (torus) interconnection networks and this embedding contains regularity in case of communication stages of divide and conquer binomial tree. The dilation of embedding is 2h1  1 where h = max{p, q, r}. There is no edge contention, since the active paths in each phases are edge-disjoint. The binomial tree consists of two binomial tree B(n  1)s and both of them do not have edges in common and the same case is valid for all subtrees. The communication time of wormhole routing mechanism is better than the communication time

of store-and-forward routing mechanism, since the dilation of embedding is not 1. If the initial message has size m and dividing ratio is a, then at each communication step total volume of the routed messages is am. This means that there is a regularity in volume of routed messages. The algorithm proposed in this paper is suitable for converting it for k-ary n-cubes. This means that it can be converted to all sub-interconnection of kary n-cubes without any edges/nodes faulty. While routing message ratio is a, at each communication step, the total volume of routed messages is m. If a > 1/2, then the communication time is determined by using aim, 1 6 i 6 n. If a = 1/2, then one of the terms aim, (1  a)im or ar(1  a)sm for 0 6 r + s 6 n is used to determine the communication time for parallel computation. If a 6 1/2, then (1  a)im for 1 6 i 6 n is used to determine the communication time. The routing mechanism and computation algorithm determine the cost of overall algorithm. The wormhole routing mechanism and quadratic computation algorithm are together to be a cost-optimal overall algorithm. There is no differences in the consequences of embedding a binomial tree on the 3D mesh or torus. All embedding parameters are same, so, the time results are also same. Circuit-switching routing method also can be applied to the algorithm proposed in this paper, since the novelty of this paper is not routing.; it is the embedding of B(n) onto mesh/torus interconnection networks. In other word, it is the parallel algorithm based on the divide-and-conquer paradigm.

References [1] V. Lo, S. Rajopadhye, J.A. Telle, X. Zhong, Parallel divide and conquer on meshes, IEEE Trans. Parallel Distrib. Syst. 7 (10) (1996) 1049–1057. [2] J. Vuillemin, A data structure for manipulating priority queues, Comm. ACM 21 (4) (1987) 309–315. [3] S.L. Johnson, Communication in Network Architectures, VLSI and Parallel Computation, Morgan Kaufmann, 1990. [4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to Algorithms, The MIT Press, Cambridge, 1990, p. 968.

A. Karci / Journal of Systems Architecture 51 (2005) 281–295 [5] Z.G. Mou, P. Hudak, An algebraic model for divide-andconquer algorithms and its parallelism, J. Supercomput. 2 (3) (1988) 257–278. [6] I.-C. Wu, Efficient parallel divide and conquer for a class of interconnection topologies, in: Second Annual International Symposium on Algorithms, Taipei, 1991. [7] H.W. Loidl, K. Hammond, On the granularity of divide and conquer parallelism, in: Glasgow Workshop on Functional Programming 1995, Ullapool, Scotland, July 8–10, 1995, Springer-Verlag. [8] K. Gates, P. Arbenz, Parallel divide and conquer algorithms for the symmetric tridiagonal eigenproblem, Technical Report No. 222, Eidgenossische Technische Hochschule, Zurih. [9] S. Gorlatch, it N-graphs: scalable topology and design of balanced divide and conquer algorithms, J. Parallel Comput. 23 (1997) 687–698. [10] A. Lopez, R. Alcover, J. Duato, L. Zunica, A cost-effective methodology for the evaluation of interconnection networks, J. Syst. Architec. 44 (1998) 815–830. [11] J.-S. Chen, C.-Y. Chang, J.-P. Sheu, Efficient pathbased multicast in wormhole-routed mesh networks, J. Syst. Architec. 46 (2000) 919–930. [12] S. Loucif, M. Ould-Khaoua, On the relative performance merits of hypercube and hypermesh networks, J. Syst. Architec. 46 (2000) 1103–1114. [13] J. Giglmayr, Routing of 2D switching networks by their embedding into cubes, Optics Laser Technol. 32 (2000) 473–491.

295

[14] S. Lakshmivarahan, S.K. Dhall, Ring, torus, and hypercube architectures/algorithms for parallel computing, J. Parallel Comput. 25 (1999) 1877–1906. [15] S. Lakshmivarahan, S.K. Dhall, Ring, torus, and hypercube architectures/algorithms for parallel computing, J. Parallel Comput. 25 (1999) 1877–1906. [16] M. Valero-Garcia, A. Gonzalez, L.D. Cerio, D. Royo, Divide-and-conquer algorithms on two-dimensional meshes, Technical Report No. UPC-DAC-1997-30, Department of Computer Architecture, Polytechnique University of Catalunia, 1997. [17] A. Gonzalez, M. Valero-Garcia, L.D. Cerio, Executing algorithms with hypercube topology on torus multicomputers, IEEE Trans. Parallel Distrib. Syst. 6 (1995) 803– 814.

Ali Karci received the B.S. degree in Computer Engineering and Information Science at Bilkent University in 1994. He received the M.S. degree in Computer Engineering at Fırat University in 1998, and the Ph.D. degree in Electrical-Electronics Engineering at Fırat University in 2002. His areas of interest includes parallel and distributed computing, interconnection networks, evolutionary computation and genetic algorithms.