An optimal and processor efficient parallel sorting algorithm on a linear array with a reconfigurable pipelined bus system

An optimal and processor efficient parallel sorting algorithm on a linear array with a reconfigurable pipelined bus system

Computers and Electrical Engineering 35 (2009) 951–965 Contents lists available at ScienceDirect Computers and Electrical Engineering journal homepa...

513KB Sizes 1 Downloads 121 Views

Computers and Electrical Engineering 35 (2009) 951–965

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

An optimal and processor efficient parallel sorting algorithm on a linear array with a reconfigurable pipelined bus system Min He a,*, Xiaolong Wu a, Si Qing Zheng b a

Department of Computer Engineering and Computer Science, California State University Long Beach, 1250 Bellflower Blvd., Long Beach, CA 90840, United States Department of Computer Science, Erik Jonsson School of Engineering and Computer Science, 800 West Campbell Rd., EC31, University of Texas at Dallas, Richardson, TX 75080-3021, United States b

a r t i c l e

i n f o

Article history: Available online 12 January 2009 Keywords: Sorting Parallel algorithms Optical interconnection Parallel computing

a b s t r a c t Optical interconnections attract many engineers and scientists’ attention due to their potential for gigahertz transfer rates and concurrent access to the bus in a pipelined fashion. These unique characteristics of optical interconnections give us the opportunity to reconsider traditional algorithms designed for ideal parallel computing models, such as PRAMs. Since the PRAM model is far from practice, not all algorithms designed on this model can be implemented on a realistic parallel computing system. From this point of view, we study Cole’s pipelined merge sort [Cole R. Parallel merge sort. SIAM J Comput 1988;14:770–85] on the CREW PRAM and extend it in an innovative way to an optical interconnection model, the LARPBS (Linear Array with Reconfigurable Pipelined Bus System) model [Pan Y, Li K. Linear array with a reconfigurable pipelined bus system—concepts and applications. J Inform Sci 1998;106;237–58]. Although Cole’s algorithm is optimal, communication details have not been provided due to the fact that it is designed for a PRAM. We close this gap in our sorting algorithm on the LARPBS model and obtain an O(log N)-time optimal sorting algorithm using O(N) processors. This is a substantial improvement over the previous best sorting algorithm on the LARPBS model that runs in O(log N log log N) worst-case time using N processors [Datta A, Soundaralakshmi S, Owens R. Fast sorting algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 2002;13(3):212–22]. Our solution allows efficiently assign and reuse processors. We also discover two new properties of Cole’s sorting algorithm that are presented as lemmas in this paper. Published by Elsevier Ltd.

1. Introduction Parallel computing on optical interconnection models has drawn lots of attention in recent years due to the following attractive features: high speed, high bandwidth, low error probability, gigabit transmission capacity, increased fan out, long interconnection length, low power requirement, and freedom from capacitive bus loading, cross talk and electromagnetic interference. These features give optical interconnections great potential to improve the performance of algorithms designed for massive parallel processing systems. In addition, optical signals transmitted on an optical waveguide have two important properties that are not shared with electrical signals running on an electrical bus, namely, unidirectional propagation and predictable propagation delays per unit length. These two properties enable synchronized concurrent accesses of an optical bus in a pipelined fashion [4]. Compared with exclusive access for data transmission on an electrical interconnection, the ability for

* Corresponding author. Tel.: +1 949 2858390. E-mail address: [email protected] (M. He). 0045-7906/$ - see front matter Published by Elsevier Ltd. doi:10.1016/j.compeleceng.2008.11.020

952

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

an optical signal to concurrently access waveguide increases the bandwidth of the bus and can reduce the time complexity of many parallel algorithms. That in turn has opened up new challenges in algorithm design for difference optical interconnection models. A variety of optical interconnection models [2,5–13] have been proposed by researchers in the past decades. Among them, arrays with reconfigurable optical bus systems received significant attention as messages can be transmitted concurrently on a bus in a pipelined fashion and the bus can be reconfigured dynamically under program control to support different algorithmic requirements. In other words, one array can be partitioned into several independent subarrays. These subarrays can then operate as regular optical bus systems, and can be used independently for different computations without interference. Hence, these architectures are more suitable for many divide-and-conquer problems. Two typical representatives of the reconfigurable optical bus systems are: the Linear Array with Reconfigurable Pipelined Bus System (LARPBS) [2] and the Array with Reconfigurable Optical Buses (AROB) [6]. In both models, the optical buses can be dynamically reconfigured and the time complexities of the algorithms are analyzed in terms of the number of bus cycles needed to perform the global communication and the number of arithmetic operations for local computation, where a bus cycle is defined as the time needed for a signal to travel from end to end along a bus. However, there is one major difference between these two models: in the AROB model, the processors connected to a bus are able to count optical pulses within a bus cycle, whereas in the LARPBS model counting is prohibited during a bus cycle. In order for the processors in the AROB model to be able to count optical pulses, it is assumed that the CPU cycle time equals the optical pulse time. This assumption is not realistic because the optical pulse time is usually much shorter than the CPU time of an electronic processor. Since processors on the LARPBS model are not involved in the optical bus operation except for setting switches at the beginning of a bus cycle, it is more realistic. A preliminary feasibility study of the LARPBS model can be found in a white paper [11] at silicon.com and a practical implementation of the LARPBS model, i.e., the LARPBS(p), is given in [12,13]. Other practical issues, such as scalability and fault tolerance, etc., of the LARPBS model are also investigated in [14–16]. Many parallel algorithms [3,14–36] in different domains have been proposed on the LARPBS model. The results from these algorithms indicate that the LARPBS model is efficient for parallel computations. We choose the LARPBS model because it is an efficient and more realistic optical parallel computation model that provides the benefits of both optical buses and reconfiguration. In this paper, we are interested in adapting an optimal sorting algorithm on the CREW PRAM onto the LARPBS model. On the LARPBS model, memory is distributed among processors. Communication is much more restricted compared with the PRAM. That is one major challenge. Apart from communication, other challenges are: first, since Cole’s algorithm uses a binary sorting tree model, we need to find a solution to simulate a binary tree on a bus-based architecture; second, in order to achieve a processor complexity of O(N), we need to efficiently assign and reuse the processors. A solution on how to reuse the processors in Cole’s algorithm has not been addressed so far. The re-configurability of the LARPBS model is certainly a feature we want to take advantage of for solving this problem. Solutions to all the challenges mentioned above are the contributions of this paper. Furthermore, we discovered some new and interesting properties of Cole’s optimal sorting algorithm and present them as lemmas with proof. This paper is organized as follows. Section 2 gives a detailed description of the LARPBS model. Section 3 introduces related work on the LARPBS model. Section 4 describes the newly designed O(log N)-time sorting algorithm on the LARPBS model. In Section 5, we show how the algorithm works with an example. Section 6 concludes the paper.

2. The LARPBS model A Linear Array with Reconfigurable Pipelined Bus System (LARPBS) is a folded optical bus system with three waveguides, one set of fixed delays, and two sets of flexible switches. To avoid confusion, we use two figures to illustrate the complete complements of the LARPBS models: Figs. 1 and 2. Fig. 1 shows an LARPBS model of size 6 with the following components: processors, reconfigurable switches, and the message waveguide. Fig. 2 shows an LARPBS model of size 6 with the following components: processors, conditional delay switches, fixed unit delays, the reference waveguide and the select waveguide. The three waveguides are: a message waveguide shown in Fig. 1, is used to carry messages, and a reference waveguide and a select waveguide shown in Fig. 2, are used together to carry address information. Communication on this model is carried out through adding messages and addresses on the upper half of the bus, called transmission segment, at the beginning of a bus cycle, and receiving them at the lower half of the bus in the same bus cycle, called receiving segment. There are two sets of switches connected to each processor on the bus, namely, conditional delay switches and reconfigurable switches. The conditional delay switches are used for addressing purpose. Coincident pulse technique [4] is used to implement the address scheme. By using the two properties of optical signals, namely unidirectional propagation and predictable propagation delays, the optical buses enable synchronized concurrent access in a pipelined fashion which results in a much higher bandwidth. To be specific, in the same bus cycle, the pipelined optical bus can transmit up to N messages compared to a single message on the electrical bus of the same length, where N is the number of processors in the array. The reconfigurable switches, RST(i) and RSR(i) in Fig. 1, are used to segment one bus into several or connect multiple buses into one. The reconfiguration is done in the following way: setting RST(i) and RSR(i) to cross causes the bus to split at processor P i with the first sub-bus folds at processor P i and the second sub-bus starts from processor Piþ1 and folds at the last processor. Each subbus can independently works as an LARPBS. Switching them from cross back to straight combines two sub-buses into one bus. The switch setting in

953

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

Transmitting Segment Message waveguide RST(0)

Transmitting Segment

RST(1)

P0

P1

RSR(0)

RST(3)

RST(2)

P2

RSR(1)

P3

P4

RSR(3)

RSR(2)

Receiving Segment

RST(5)

RST(4)

P5

RSR(5)

RSR(4)

Receiving Segment

Pro cesso r

2X1 RSR Switch

1X2 RST Switch Fig. 1. An LARPBS model with reconfigurable switches.

Transmitting Segment

P0

P1

P2

P3

P4

P5

Select Waveg uide

Reference Waveguide Receiving Segment Reference Pulse

Conditional delay switch with no delay introduced

Select Pulse

Fix Unit Delay

Conditional delay switch with one unit delay introduced

Fig. 2. An LARPBS model with conditional delay switches.

Fig. 1 configures the bus into two sub-buses. The first sub-bus consists of processor P 0 and P 1 . The second sub-bus consists of processors P 2 , P 3 , P 4 , and P 5 . A slash inside a switch indicates setting the switch to cross. A dashed line inside a switch indicates setting the switch to straight. Computation on the LARPBS model usually consists of a sequence of alternate communication and computation steps which are synchronized by bus cycles. In other words, by ‘‘O(log N) time”, we mean O(log N) bus cycles for communication plus O(log N) time for local computation. Due to the speed difference between optical pulses and electrical signals, each

954

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

processor can read at most one message from the bus during one bus cycle. For more details of this model, readers are encouraged to refer to [2]. Following basic operations [2,17] are used as building blocks in our algorithm:  One-to-one communication: For two groups of processors G1 ¼ ðP i0 ; P i1 ; . . . ; P im1 Þ and G2 ¼ ðP j0 ; P j1 ; . . . ; Pjm1 Þ, each processor in G1 sends a message to one of the processors in G2 . No two processors in G1 send messages to the same processor in G2 .  Broadcast: One processor, P i , in an N-processor system sends a message to all other N  1 processors P0 ; P1 ; . . . ; P i1 ; Piþ1 ; . . . ; P N1 .  Multiple multicast: For m disjoint groups of receiving processors, Gk ¼ fP jk;0 ; Pjk;1 ; . . .g; 0 6 k 6 m  1, and m senders Pi0 ; Pi1 ; . . . ; P im1 , each processor P ik broadcasts a message to all the processors in Gk , for 0 6 k 6 m  1. The time complexity of the above basic operations is critical in calculating the time complexity of the sorting algorithm. Thus we provide the following lemma proved by Pan and Li [2]. Lemma 1. One-to-one communication, broadcasting, multiple multicasting can be done in O(1) bus cycle on the LARPBS model. By using those basic operations, we can focus on developing, specifying, and analyzing parallel algorithms and ignore the optical and engineering details. In Section 4.1, we introduce two new basic operations: find-max-representatives and findmin-representatives and show that both of them take O(1) bus cycle. Different algorithms [37] are proposed that simulate the LARPBS model on the PRAM model. The PRAM model [38] is an idealized shared memory parallel computation model and it is assumed that accessing a shared memory location in this model takes constant time. Due to this assumption, when designing algorithms on this model, one can concentrate on finding the parallelism in a computation without worrying about the implementation details of communication among the processors. Thus the time complexity of Cole’s CREW PRAM sorting algorithm is optimistic. Since the best sorting algorithm [3] on the LARPBS model until today is still a factor of O(log log N) away from optimal, extending an optimal sorting algorithm onto the LARPBS model means totally eliminating this gap. Now the questions are: is it possible and how to solve the communication puzzle? There are two approaches to bring an algorithm designed for a PRAM model onto a practical model:  Simulate the CREW PRAM on the practical model, and then run the algorithm on the simulated CREW PRAM model.  Implement the computation operations directly on the practical model and solve the communication puzzle of the algorithm without introducing extra time complexity. An obvious disadvantage of the first approach is the overhead introduced from the simulation. Unless the simulation can be done in constant time, the first approach cannot achieve the same time complexity. The disadvantage of the second approach is that it may not be possible if we cannot solve the communication puzzle. In the case when both computation and communication can be done without introducing extra time complexity, we expect the second approach to achieve better results. If we take the first approach, we need to simulate the CREW PRAM on the LARPBS model. To our knowledge, the best simulation algorithm is given by Li et al. [37]. His algorithm provides the following results: each step on a p-processor CREW PRAM computation with O(p) shared memory cells can be simulated by a p-processor LARPBS in O(log p) time. Thus, even if we use the best simulation algorithm, we can only achieve an O(log N log N)-time sorting algorithm with O(N) processors, which is not optimal. That leaves us only one approach to try: implement the computation directly on the LARPBS model and solve the communication puzzle.

3. Related work on the LARPBS model Since the LARPBS model was proposed by Pan and Li [2], many parallel algorithms [3,14–36] in different domains have been proposed to investigate the power of this model, such as the Euclidean distance transform [15], sorting [3,14,18,27], selection [21–23], matrix multiplication [24–26], graph theory problems [27–30], permutation routing [31], sequencing problems [32–35], etc. All these previous results indicate that the LARPBS model is efficient for the parallel computations due to its high bandwidth and flexibility within a reconfigurable optical bus system. Sorting is one of the fundamental problems in computer science. Since sorting is often used as a necessary step to solve many problems, a fast sorting algorithm can help reduce the time complexity of many parallel algorithms. Several sorting algorithms have been designed for the LARPBS model. Pan et al. [27] have presented an O(1) time sorting algorithm which uses N 2 processors. Although it is a constant time sorting algorithm, the cost of the algorithm is not optimal as it uses N 2 processors. Pan et al. [18] have designed the first N-processor sorting algorithm which is based on the sequential quicksort algorithm and runs in O(log N) average time and O(N) worst-case time. Another sorting algorithm, which reduces the worstcase time to O(log2 N) with the same processor cost, was proposed in [14]. However, the best previous sorting algorithm on this model runs in O(log N log log N) worst-case time [3] using N processors. None of the sorting algorithms for the LARPBS model is optimal. A parallel sorting algorithm is considered to be optimal (with respect to the cost) if the number of

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

955

comparisons is O(N log N), which is the same as the sequential complexity of sorting N elements. Thus an interesting question is: is it possible to run an optimal sorting algorithm on the LARPBS model? The first optimal sorting algorithm is the AKS sorting network [39] on the circuit model which runs in O(log N)-time and uses O(N) comparators. The second one is Cole’s pipelined merge sort [1] designed for the CREW PRAM (Parallel Random Access Machine) and EREW PRAM model which also runs in O(log N)-time using O(N) processors. A question arose: since there has not been an optimal sorting algorithm designed for the LARPBS model, why not investigate the possibility of extending an existing optimal algorithm onto this model? Because sorting is a fundamental problem that has lots of applications, having an optimal sorting algorithm at our disposal would be extremely helpful for solving a lot of scientific problems more efficiently on the LARPBS model. From this point of view, we study and compare the three optimal sorting algorithms mentioned above. Here are our findings: first, the constant in the running time is very large for the AKS network; second, comparing Cole’s two PRAM sorting algorithms, the one on EREW PRAM is much more complex and the constant in the running time is larger. Based on those findings, we choose to implement Cole’s CREW PRAM algorithm [1] on the LARPBS model.

4. An optimal sorting algorithm on the LARPBS model In this section, we give some definitions and properties which are necessary for designing our sorting algorithm; provide a brief description on Cole’s pipelined merge sort; analyze challenges on extending Cole’s algorithm on a realistic model; and introduce our optimal sorting algorithm on the LARPBS model. 4.1. Definitions and properties Suppose we have two sorted arrays L and J, each having N distinct elements. In the rest of this paper, we use P i to represent a processor, and use lower case alphabetic character to represent the rank of an item associated with a processor.  Interval IðeÞ: Define Lx ¼ ð1; L; þ1Þ. If e 2 L, g is the next larger item in Lx , then [e; g) is defined to be the interval induced by e denoted as IðeÞ.  c-cover: For a positive integer c, L is a c-cover of J if each interval induced by an item in L contains at most c items from J.  rank: A rank of an item x in a sorted array S, is defined to be the number of elements in S that are less than or equal to x. A processor’s rank refers to the rank of the associated item.  straddle: For any three items {e, f, g}, where e < g, we say that e and g straddle f if they satisfies e 6 f < g. Thus e and g are called two straddle items for f. Similarly, if e, f, g are associated with processors Pi ; P j and Pk respectively, P i and Pk are called two straddle processors for processor P j .  L ! J: L is ranked in J; i.e., for each item in L we know its rank in J.  L $ J: L and J are cross-ranked; i.e., for each item in L we know its rank in J and for each item in J we know its rank in L.  L(u): A final sorted array of items initially at the leaves of the subtree rooted at node u.  UP(u): A sorted subset of the items in L(u).  external node/inside node: Node u is an external node if it has jUP(u)j = jL(u)j, i.e., the number of items in UP(u) is equal to the number of items in L(u); otherwise, it is an inside node.  active node: An active node is either a node whose UP array is non-empty; or it is an external node in its first three stages.  NEWUP(u), OLDUP(u): They are the corresponding UP arrays at the start of the next and previous stages, respectively.  SUP(u): If u is an inside node, SUP(u) is a sorted array comprising every fourth item in UP(u) measured from the right end; if u is an external node, SUP(u) is every fourth, every second and every item in UP(u) for the first, second and third stages.  NEWSUP(u), OLDSUP(u): They are the corresponding SUP arrays in the next and the previous stages, respectively.  Calculate NEWUP(u): Let u, v and w be three nodes in the binary sorting tree where u is the parent of v and w, then we define NEWUP(u) = SUP(v) [ SUP(w), i.e., array NEWUP(u) is the result of merging arrays SUP(v) and SUP(w). Two new basic operations for LARPBS model are presented as follows:  Find-min-representatives operation: Assume an LARPBS bus holds a sorted array of items. This operation applies to its items’ ranks in another sorted array. Adjacent items may have the same rank. This operation selects the smallest items for each rank. The operation is done in the following way: each processor P i sends its rank r to the one on its right, and compares the received rank with its own rank. If the two ranks do not match, then P i sets itself to be a representative. The first one always sets itself to be a representative. This operation takes one bus cycle and one local comparison.  Find-max-representatives operation is done in a symmetric way: Each processor sends its rank r to the processor on its left. The last processor always sets itself to be a representative. The rest of the operation is the same as Find-min-representatives operation. This operation selects the largest items for each rank and takes one bus cycle and one local comparison. The above two operations are designed to avoid different values being sent to the same processor in one bus cycle. They are necessary because a processor in the LARPBS can only receive at most one message in one bus cycle. In the next section, we briefly introduce Cole’s pipelined merge sort and some lemmas that are used to prove the correctness of our algorithm.

956

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

4.2. Cole’s pipelined merge sort Cole’s pipelined merge sort algorithm uses an N-leaf binary sorting tree. A key observation Cole made in [1] is that when using a binary sorting tree to merge two sorted arrays, the merge at the different levels of the tree can be pipelined. In other words, computations can be distributed into different pipeline stages to minimize the workload on each stage. Instead of merging all items in two sorted lists, the algorithm merges only sampled items from both lists. By selectively choosing items to merge in the i  1th stage, the merge in the ith stage can be completed in constant time using the result from the i  1th stage. Next, we give a brief description of the algorithm. The N inputs are placed at the leaves of the tree. The algorithm ‘‘moves up” from the leaves to the root. The task at each internal node u is to compute L(u). At intermediate steps in the computation, at node u, we will compute two arrays: UP(u) and SUP(u). The items in UP(u) will be a rough sample of the items in L(u). As the algorithm proceeds, the size of UP(u) increases, and UP(u) becomes a more accurate approximation of L(u). When UP(u) = L(u). node u’s job to compute a sorted list L(u) is completed. It becomes an external node and will only participate in the algorithm for three more stages to provide samples to its parent node. The algorithm runs in 3 log N stages and stops when the root node completes the calculation for its L(u). At each stage, each node u performs the following two computations in constant time: (1) Form the array SUP(u). (2) Compute NEWUP(u) = SUP(v) [ SUP(w), where v and w are u’s children, and [ denotes merging. The first computation is done by following the definition of SUP arrays given in the previous section. The second computation comprises the following two steps. Step 1 computing NEWUP(u). An item e in NEWSUP(u) comes from either SUP(v) or SUP(w) and the rank of e in NEWSUP(u) is equal to the sum of its ranks in SUP(v) and SUP(w). Thus, if we cross rank SUP(v) and SUP(w), we can obtain e’s rank in NEWSUP(u). We describe how SUP(v) ! SUP(w) is calculated in two substeps. SUP(w) ! SUP(v) is computed in a similar way. Substep 1 computing SUP(v) ! UP(u). Let y be an item in UP(u) and I(y) be the interval induced by y in UP(u). Using UP(u) ! SUP(v) (obtained in Step 2 of the previous stage), the processor associated with item y in UP(u) can send its rank to the items in SUP(v) that falls in the interval I(y). Now, each of those items in SUP(v) obtains its rank in UP(u), which gives us SUP(v) ! UP(u). Substep 2 calculating SUP(v) ! SUP(w). Consider an item e in SUP(v). First, we determine the two items d and f in UP(u) that straddle item e using SUP(v) ! UP(u). Then we find items in SUP(w) that falls into interval [d, f) using UP(u) ! SUP(w) (obtained in Step 2 of the previous stage) and compare item e with those items in SUP(w) to determine its rank in SUP(w). Substep 3 computing NEWUP(u). Using SUP(v) ! SUP (w) and SUP(w) ! SUP(v), each processor in the SUP(v) array and the SUP(w) array calculates its item’s rank in NEWUP(u) by adding its local rank with its cross rank. Step 2 maintaining ranks NEWUP(u) ! NEWSUP(v) and NEWUP(u) ! NEWSUP(w). Given UP(u) ! SUP(v), UP(u) ! SUP(w), and NEWUP(u) = SUP(v) [ SUP(w), we can immediately deduct UP(u) ! NEWUP(u). Similarly, we can obtain UP(v) ! NEWUP(v). Since SUP(v) and NEWSUP(v) are obtained by sampling from respectively, UP(v) and NEWUP(v), array, we can find out SUP(v) ! NEWSUP(v) by applying the sampling rules. Thus, for every item in NEWUP(u) that came from SUP(v), we have its rank in NEWSUP(v). Consider each item e from SUP(w). We can find e’s two straddle items d and f in SUP(v) using SUP(w) ! SUP(v). Next, using SUP(v) ! NEWSUP(v), we can find d and f’s ranks r and t in NEWSUP(v). Now item e can determine its rank in NEWSUP(v) by comparing the items whose rank fall into the interval [r, t) in NEWSUP(v). Thus, for every item in NEWUP(u) that came from SUP(w), we have its rank in NEWSUP(v). Combining the results from previous two paragraphs, we obtain NEWUP(u) ! NEWSUP(v). NEWUP(u) ! NEWSUP(w) is calculated using an analogous method. Next, we present three lemmas that describe some features of this algorithm. We use them to prove the correctness of our sorting algorithm. The following lemma has been proved in [1]. Lemma 2. UP(u) is a 3-cover for both SUP(v) and SUP(w). We found two new features of Cole’s pipelined merge sort algorithm and present them as the following two lemmas. Lemma 3. SUP(u) is a subset of NEWSUP(u), i.e., SUP(u) # NEWSUP(u). Proof. We prove the result by induction on the level of the nodes in the binary sorting tree. The claim is true for the nodes on the lowest active level of the sorting tree in a life-cycle, i.e., the external nodes. If u is an external node, on the first stage after it becomes external, SUP(u) consists of every fourth item in UP(u) = L(u); on the second stage, SUP(u) consists of every second item in L(u); and on the third stage, SUP(u) consists of every item in L(u). Thus our claim holds for all external nodes in each life-cycle.

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

957

Inductive step: We seek to prove that for any node u on level k, SUP(u) is a subset of NEWSUP(u), where k is one level above the external nodes. Suppose our claim for SUP(u) # NEWSUP(u) on level k is not true. There is an item e 2 SUP(u), but e R NEWSUP(u). As e comes from UP(u) which in turn comes from the SUP array of one of its children. We use the same induction to trace the origin of u down to the lowest active level, which is the level that has the external nodes. Assume e originally comes from external node f. So we have e 2 SUP(f), but e R NEWSUP(f). This is in conflict with what we have already proved for the initial case. Thus our claim is true for level k. Summarizing the initial step and the inductive step, we get SUP(u) # NEWSUP(u). h Lemma 4. UP(u) is a subset of NEWUP(u), i.e., UP(u) # NEWUP(u). Proof. When jUP(u)j < L(u), NEWUP(u) = SUP(v) [ SUP(w), and UP(u) = OLDSUP(v) [ OLDSUP(w). Using Lemma 3, we get SUP(u) # NEWSUP(u). When jUP(u)j = L(u), we have UP(u) = NEWUP(u). Combining the two cases, we get UP(u) # NEWUP(u). h We notice that in Cole’s pipelined merge sort, the most important and complicated part is to calculate ranks between arrays and communication is assumed to be done in constant time without any details provided. These give us challenges in implementing the algorithm on the LARPBS model. We address these challenges in more detail in the next section. 4.3. Implementation challenges As noted before, the PRAM is a less restrictive parallel computation model. To implement the CREW PRAM sorting algorithm on the LARPBS model, we face the following challenges: Communications: Concurrent read on the CREW PRAM is simple and obvious and is done in constant time. However concurrent read on the LARPBS model can only be done in constant time by multicasting if the sender knows the address of all the receivers. How to inform the sender is a problem to be solved on the LARPBS model. Passing ranks among processors is one of the major communication operations in Cole’s pipelined merge sort. Inefficient handling of the communication operations could introduce extra time complexity into this algorithm. Details on how ranks are passed are not considered in a PRAM model but have to be solved in any realistic parallel computation models. Representing multiple trees on an LARPBS: Cole’s pipelined merge sort uses an N-leaf complete binary tree. A binary tree machine is a multiprocessor parallel computation architecture where processors are connected as a complete binary tree. It is used to capture the essence of divide-and-conquer strategies. Although some researches in [40] have begun to simulate a binary tree on a bus-based architecture, we still need to find a simple and straightforward way to simulate multiple binary trees on an LARPBS model so that the simulation does not introduce extra time complexity. To be specific, we have to keep track of three arrays for each active node in the sorting tree: UP, SUP and NEWUP. The simulation should bring convenience to operations such as assigning processors for items in the arrays, and ranking them among each other, etc. Processor reusing: In order to keep the number of processors to be O(N), we need to use the re-configurability of the LARPBS model to dynamically allocate processors and change their roles. Processor reuse is mentioned in [1] without providing the solution.

4.4. An optimal sorting algorithm on the LARPBS model In our optimal sorting algorithm, we implement the computation operations in Cole’s pipelined merge sort and add the details of communication operations that are needed to make the computations possible. In Cole’s pipelined merge sort, nodes on the binary tree are classified into two categories: inside nodes and external nodes. In order to address the fact that external nodes only stay active for three stages, we call each 3-stage period from the time a set of nodes becomes external, a life-cycle. Thus the algorithm runs in log (N) life cycles. In our algorithm, we split Step 1 in Cole’s algorithm into three separate steps to make the hidden work clearer and the computation and communication easier to understand. At the beginning, we only have UP arrays: each leaf has an UP array which contains one item. The algorithm describes the operation performed on a typical internal node u that has two children: v and w. Here is the description of the algorithm. Algorithm: Pipelined Merge Sort on the LARPBS Input: a sequence of N keys S Output: a sorted sequence of the N keys S0 Begin for LifeCycle = 1 to log N, all processors pardo RunLifeCycle(LifeCycle); endfor End Procedure RunLifeCycle(i)

958

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

Begin for stage = 1 to 3 do Step 1: Compute SUP arrays by sampling from corresponding UP arrays. Step 2: Compute SUP(v) ! UP(u) and SUP(w) ! UP(u). Step 3: Compute SUP(v) $ SUP(w). Step 4: Compute NEWUP(u) = SUP(v)[SUP(w). Step 5: Maintain ranks NEWUP(u) ! NEWSUP (v) and NEWUP(u) ! NEWSUP(w). endfor End In Step 1, the sampling is done in the following way: for inside nodes, sampling every 4th item is done after each processor Pi in an UP array sends its item to the jth processor in SUP array, where P i ’s rank is r ¼ m  3  4j, m is the number of items in the UP array, and j P 0; external nodes do the same operation except the sampling rate is 2ð3stageÞ . This step implements the first computation in Cole’s algorithm. Step 2 is implemented in the following way. Using UP(u) ! SUP(v), do a find-min-representatives operation in UP(u) based on each item’s rank in SUP(v). Then each representative processor P i in UP(u) sends its item and its rank s in UP(u) to the rth processor P j in SUP(v), where r is Pi ’s rank in SUP(v). Since different representatives have different ranks in SUP(v), this is a one-to-one communication. After receiving rank s, each processor Pj sets its reconfigurable switches to cross to form a subbus and broadcast rank s in its sub-bus. Since P i1 < Pj 6 P i ; P j ’s rank in UP(u) is s, see Fig. 3. All other processors on the same sub-bus as processor P j have the same rank in UP(u). For example, in Fig. 3, since P i3 ; P i2 , and Pi1 have the same rank r in SUP(v), we have P i1 < P j2 < Pj1 < P j . As Pj 6 P i , the three processors in the same sub-bus, i.e., Pj2 ; P j1 , and P j , have the same rank s. Now let us take a look at the special case when Pi ¼ Pj . According to Lemma 3 and UP(u) = OLDSUP(v) [ OLDSUP(w), when Pi comes from OLDSUP(v), we have P i ¼ P j . In this case, P j ’s rank in UP(u) is r þ 1 according to our definition on rank. Thus one extra step is needed to do this correction: after receiving a message, processor Pj in SUP(v) compares its item with the received item and increase the received rank r by one if the two items are equal. Now we obtain SUP(v) ! UP (u). SUP(w) ! UP(u) is computed in a similar way. This step implements Substep 1 in Step 1 of the second computation in Cole’s algorithm. Notice that the special case in this step is not mentioned in Cole’s algorithm. To implement Step 3, we use the already known ranks SUP(v) ! UP(u) and UP(u) ! SUP(w) to cross rank SUP(v) and SUP(w). This step implements Substep 2 in Step 1 of the second computation in Cole’s algorithm. We describe how to calculate SUP(v) ! SUP(w). SUP(w) ! SUP(v) is obtained in a similar way. SUP(v) ! SUP(w) can be calculated in two sub-steps: (1) find the range of an item’s rank through communication; (2) calculate the exact rank using local computation. Sub-step 1: Do a find-max-representatives operation in SUP(v) based on items’ ranks in UP(u). Then each representative P i requests the two straddle processors in UP(u), P j and P jþ1 , send their ranks r and t in SUP(w) back. This is a multiple multicast operation. The fact that only the representatives in SUP(v) are involved in the communication guarantees that any processor in this array receives at most one message in each bus cycle. How many items will be involved in the communication and local comparison is determined in the following way. See Fig. 4. Given that Pm < P j and P j < P i ; Pm < P i can be obtained without any comparison. Since Pi < P jþ1 and Pn < P jþ1 < Pnþ1 , we can determine that P i < P nþ1 . That leaves us items between Pm and P n (including Pn ) undetermined. According to Lemma 2, there are at most three such items. Therefore, each representative processor broadcasts a request to processors whose ranks are in the range (r, t] in SUP(w). The request message indicates the order in which each processor in SUP(w) should send its item back. After a maximum of three one-to-one communication operations, each representative processor in SUP(v) sets its switch to cross to form a sub-bus and broadcasts those items and rank r in its sub-bus.

UP(u) S-3

Pi-3

S-2

S-1

S

Pi-1 Pi

Pi-2

SUP(v) r Pj-3

Pj-2

Pj-1

Pj

Fig. 3. SUP(v) ! UP(u): P i3 and P i are representatives, P j2 ; P j1 and P j are in the same sub-bus and have rank s in UP(u).

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

959

UP(u) k

k+1

Pj

r

Pj+1

t SUP(v)

r

SUP(w)

t

Pm

Pi

Pn Pn+1

Representative

Fig. 4. Find the ranks in SUP(w) for P i ’s two straddle processors P j and P jþ1 in UP(u). Solid lines represent data communication. Dashed lines represent ranks.

Sub-step 2: Each processor in SUP(v) now compares a maximum of three received items with the its own item and determines its rank in SUP(w). For example, if a processor owns item x and it received items q, y, z, and a rank r, and if q < x < y < z, then x’s rank in SUP(w) is r þ 1. Step 4 implements Substep 3 in Step 1 of the second computation in Cole’s algorithm. Each processor P i in SUP(v) and SUP(w) computes its rank r in NEWUP(u) by adding its rank in SUP(v) with its rank in SUP(w). Each Pi then sends its item to the rth processor in NEWUP(u). Note we also obtain two side products: SUP(v) ! NEWUP(u) and SUP(w) ! NEWUP(u). Step 5 implements Step 2 of the second computation in Cole’s algorithm. These ranks will be used for inside nodes in Steps 2 and 5 of next stage. We show how to compute NEWUP(u) ! NEWSUP(v) in detail. NEWUP(u) ! NEWSUP(w) is obtained using an analogous method. Since NEWUP(u) = SUP(v) [ SUP(w), to obtain NEWUP(u) ! NEWSUP(v), we only need to find out SUP(v) ! NEWSUP(v) and SUP(w) ! NEWSUP(v). Here is the detail. Substep 1 – compute SUP(v) ! NEWSUP(v). Given that UP(u) ! SUP(v), UP(u) ! SUP(w) and NEWUP(u) = SUP(v) [ SUP (w), UP(u) ! NEWUP(u) can be computed in UP(u) by adding an item’s ranks in SUP(v) and SUP(w). UP(v) ! NEWUP(v) is obtained in the same way. Next, SUP(v) ! NEWSUP(v) can be calculated in the following way: processor P i in UP(v) requests processor Pj in NEWUP(v) to send its rank r in NEWSUP(v) back, where P j 6 Pi < Pjþ1 . Then processor Pi passes rank r to processor P k in SUP(v), where Pk and P i has the same item and Pk can be located by applying the sampling rule. According to Lemma 4, Pj and P i also has the same item. Since P k ; Pi and P j all have the same item, r is the correct rank in NEWSUP(v) for P k . Now each item in SUP(v) is ranked in NEWSUP(v). See Fig. 5. Substep 2 – compute SUP(w) ! NEWSUP(v). Do a find-max-representatives in SUP(w) using items’ ranks in SUP(v). Each representative processor P i finds the two processors that straddle Pi in SUP(v) and obtains their ranks r and t in NEWSUP(v). Using SUP(w) ! NEWUP(u), each representative processor in SUP(w) sends the two ranks to the processor in NEWUP(u) that holds the same item. Processors in NEWUP(u) who receive two ranks mark themselves to be a representative. Since we will release processors assigned to SUP arrays and reassign them to NEWSUP arrays, the following operations will be performed at the beginning of Step 2 in the next stage. Each representative processor Pj in NEWUP(u) requests processors in NEWSUP(v) whose rank falls in interval (r, t] to send their items back in specified order. Processor P j then uses these items to determine its rank in NEWSUP(v). This step is similar to Step 3. Now we get SUP(w) ! NEWSUP(v). Combining the above two substeps, we obtain NEWUP(u) ! NEWSUP(v). Next, we compute the time complexity of this algorithm. Step 1 consists of a simple local comparison and a one-to-one communication, so it takes constant time. Step 2 consists of the following three basic operations: find-min-representative, one-to-one communication, and broadcast. All of them run in constant time. Step 3 consists of a constant number of the following basic communication operations: find-max-representative, multiple multicast, one-to-one communication and broadcast. A maximum of three local comparisons is needed for finding the rank at the end of this step. Thus Step 3 also runs in constant time. Step 4 consists of one local addition and a one-to-one communication, which runs in constant time. Step

UP(v)

Request

NEWUP(v)

Pi

Pj r r

SUP(v)

Pk

NEWSUP(v) r

Fig. 5. Compute SUP(v) ! NEWSUP(v). Solid lines represent data communication, dashed lines represent rank.

960

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

5 consists of a constant number of one-to-one communication and similar communications and computations as Step 3. Thus Step 5 runs in constant time. Now we conclude that the time complexity of each stage is O(1). As the algorithm proceeds in 3 log N stages, we obtain the following theorem. Theorem 1. There is an O(log N) optimal sorting algorithm on the LARPBS model using O(N) processors. 4.5. Processor assignment and reuse In this algorithm, processors are only assigned to active nodes. We allocate processors for each item in the following three binary trees, UP tree, SUP tree, and NEWUP tree, shown in Fig. 6. Each tree is represented by a tree-bus. The three tree buses are aligned in a circular queue form in the following order: UP tree, SUP tree, and NEWUP tree. At the end of each stage, UP tree will be released, NEWUP tree will become UP tree. After step one of each stage, SUP tree will be released, and a new SUP tree will be generated after UP tree. Inside each tree-bus, we further segment the bus into smaller sub-buses, one for each node; we call it node-bus. In a tree-bus, processors are assigned by levels, from top to bottom in the breath-first search order. In a node-bus, items are always aligned in sorted order with each item assigned to one processor. Since in each stage, the number of items on each node-bus and each tree-bus is fixed, each processor can calculate the address of other processors before performing any communication operations. Our final result is in the root node-bus inside UP tree-bus. Lemma 5. The number of processors needed on any stage of the parallel merge sort algorithm on the LARPBS model is upper bounded by 4N. Proof. According to [1], in each life-cycle, the total size of the UP arrays is upper bounded by N þ N=7; N þ 2N=7, and N þ 4N=7 on the first, second and third stages, respectively. The total size of SUP/NEWUP arrays is upper bounded by 2N=7; 4N=7, and 8N=7 on the first, second and third stages, respectively. The sum of the maximum of the three sizes is (3 þ 6=7ÞN < 4N. This proves the lemma. h Based on Lemma 5, we use 4N processors for our parallel sorting algorithm. Processor reuse is based on the following two properties of the algorithm: (1) NEWUP array of the current stage is actually the UP arrays in the next stage. (2) External nodes only participate in the algorithm for three stages. Our solution is: at the end of each stage, release processors assigned to UP arrays, rename NEWUP arrays to UP arrays; at the end of each life-cycle, release processors assigned to UP arrays and the inactive external nodes. 5. An example For the original list: (19, 18, 17, 4, 6, 13, 9, 15, 11, 14, 8, 3, 7, 16, 12, 10), we will walk through the four life-cycles as explained in Section 3. 5.1. Life-cycle 1 Items on nodes n0;0  n0;15 are: 19, 18, 17, 4, 6, 13, 9, 15, 11, 14, 8, 3, 7, 16, 12, 10. There is no need to calculate SUP and UP arrays for the leaves. We only need to cross rank sibling leaves. The ranks for n0;0  n0;15 are: 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0.

UP Tree n3 (a,e)

SUP Tree n3 ()

n1 n2 (a,b,c,d) (e,f,g,h)

n1 (a,c)

NEWUP Tree n3 (a,c,e,g)

n2 (e,g)

UP Tree-bus

a

b n1

c

d

e

f n2

SUP Tree-bus

g

h

a

e n3

a

c n1

e

NEWUP Tree-bus

g n2

Fig. 6. Processor assignment for sorting on the LARPBS model.

a

c

e n3

g

961

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

These ranks will be used to generate UP and SUP arrays for nodes on the next level. All leaf nodes become inactive and nodes n1;0  n1;7 becomes external after this cycle. See Fig. 7 for the result of life-cycle 1. 5.2. Life-cycle 2 The UP arrays for nodes n1;0  n1;7 come from the NEWUP arrays in Fig. 7: (18, 19), (4, 17), (6, 13), (9, 15), (11, 14), (3, 8), (7, 16), (10, 12). Cross rank sibling nodes gives us the following ranks: (2, 2), (0, 0), (0, 1), (1, 2), (2, 2),(0, 0), (0, 2), (1, 1). The UP arrays for nodes n2;0  n2;3 are generated by merging the two SUP lists on their children nodes. For example, 17’s rank in UPðn2;0 ) is equal its rank in UP(n1;1 ). which is 0, plus its rank in UP(n1;0 ), which is 1. Thus its rank in UP(n2;0 ) is 1. Following are the UP arrays for nodes n2;0  n2;3 : (4, 17, 18, 19), (6, 9, 13, 15), (3, 8, 11, 14), (7, 10, 12, 16). After this cycle, nodes n1;0  n1;7 become inactive and nodes n2;0  n2;3 become external nodes. See Fig. 8 for the results at the end of this life-cycle. 5.3. Life-cycle 3 Stage 1: By sampling every fourth items in their UP arrays obtained in the previous cycle, we obtain the following SUP arrays for node n2;0  n2;3 : (4), (6), (3), (7). NEWUP arrays for nodes n3;0  n3;1 are obtained by merging their children’s SUP ar-

Inact

n4,0 Inact

Inact

n3,0

n3,1

Inact

Inact

Inact

Inact

n2,0

n2,1

n2,2

n2,3

NEWUP = {18,19}

NEWUP = {4,17}

NEWUP = {6,13}

NEWUP = {9,15}

NEWUP = {11,14}

NEWUP = {3,8}

NEWUP = {7,16}

NEWUP = {10,12}

n1,0

n1,1

n1,2

n1,3

n1,4

n1,5

n1,6

n1,7

19

18

17

4

6

13

9

15

11

14

8

3

7

16

12

10

n0,0 n0,1 n0,2 n0,3 n0,4 n0,5 n0,6 n0,7 n0,8 n0,9 n0,10 n0,11 n0,12 n0,13 n0,14 n0,15 Fig. 7. Example: Sorting results at the end of life-cycle 1. Inact indicates inactive nodes.

Inact

n4,0 Inact

Inact

n3,0

n3,1

NEWUP = {4,17,18,19}

NEWUP = {6,9,13,15}

NEWUP = {3,8,11,14}

NEWUP = {7,10,12,16}

n2,0

n2,1

n2,2

n2,3

SUP=UP = {18,19}

SUP=UP = {4,17}

SUP=UP = {6,13}

SUP=UP = {9,15}

SUP=UP = {11,14}

SUP=UP = {3,8}

SUP=UP = {7,16}

SUP=UP = {10,12}

n1,0

n1,1

n1,2

n1,3

n1,4

n1,5

n1,6

n1,7

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

n0,0 n0,1 n0,2 n0,3 n0,4 n0,5 n0,6 n0,7 n0,8 n0,9 n0,10 n0,11 n0,12 n0,13 n0,14 n0,15 Fig. 8. Example: Sorting results at the end of life-cycle 2.

962

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

rays. Since there is only one item in each SUP array, the merge is trivial. Here are the NEWUP arrays: NEWUP(n3;0 ) = (4, 6), NEWUP(n3;1 Þ ¼ ð3; 7Þ. Stage 2: SUP arrays for node n2;0  n2;3 are obtained by sampling every second items in their UP arrays. Table 1 shows the updated SUP arrays. Ranks in NEWUPs are calculated by adding local ranks and cross ranks. Stage 3: SUP arrays for node n2;0  n2;3 now are the same as their corresponding UP arrays. SUP arrays for node n3;0  n3;1 are obtained by sampling every fourth items in their UP arrays. Thus SUP(n3;0 Þ ¼ ð4Þ. SUP(n3;1 Þ ¼ ð3Þ. Table 2 shows how NEWUP(n3;0 ) and NEWUP(n3;1 ) are obtained in stage 3. Now node n3;0 and n3;1 become external nodes. NEWUP(n4;0 ) = (3, 4) is obtained by merging SUP(n3;0 ) = (4) and SUP(n3;1 ) = (3). See Fig. 9 for the results at the end of this life-cycle. 5.4. Life-cycle 4 In this life-cycle, n4;0 is the only inside node. Since n4;0 is the root of the tree, there is no need to calculate SUP(n4;0 ) as it will not be used. We start from UP(n4;0 ) = (3, 4). Stage 1: SUP arrays for nodes n3;0 and n3;1 are obtained by sampling every fourth items in their UP arrays. Table 3 shows how NEWUP(n4;0 ) is obtained in this stage. Stage 2: SUP arrays for nodes n3;0 and n3;1 are obtained by sampling every second items in their UP arrays. Table 4 shows how NEWUP(n4;0 ) is obtained in this stage. Table 1 Calculate NEWUPðn3;0 Þ and NEWUPðn3;1 Þ in life cycle 3, stage 2.

Local rank Cross rank Rank in NEWUP NEWUPðn3;0 Þ ¼ ð4; 6; 13; 18Þ

SUPðn2;0 Þ (4, 18)

SUPðn2;1 Þ (6, 13)

SUPðn2;2 Þ (3, 11)

SUPðn2;3 Þ (7, 12)

(0, 1) (0, 2) (0, 3)

(0, 1) (1, 1) (1, 2)

(0, 1) (0, 1) (0, 2) NEWUPðn3;1 Þ ¼ ð3; 7; 11; 12Þ

(0, 1) (1, 2) (1, 3)

Table 2 Calculate NEWUPðn3;0 Þ and NEWUPðn3;1 Þ in life cycle 3, stage 3. SUPðn2;0 Þ (4, 17, 18, 19) Local rank (0, 1, 2, 3) Cross rank (0, 4, 4, 4) Rank in NEWUP (0, 5, 6, 7) NEWUPðn3;0 Þ ¼ ð4; 6; 9; 13; 15; 17; 18; 19Þ

SUPðn2;1 Þ (6, 9, 13, 15)

SUPðn2;2 Þ (3, 8, 11, 14)

(0, 1, 2, 3) (1, 1, 1, 1) (1, 2, 3, 4)

(0, 1, 2, 3) (0, 1, 2, 3) (0, 1, 2, 3) (1, 2, 3, 4) (0, 2, 4, 6) (1, 3, 5, 7) NEWUPðn3;1 Þ ¼ ð3; 7; 8; 10; 11; 12; 14; 16Þ

SUPðn2;3 Þ (7, 10, 12, 16)

NEWUP = {3,4}

n4,0 NEWUP = {3,7,8,10,11,12,14,16} SUP = {3}

n3,0

n3,1

SUP=UP= {4,17,18,19}

SUP=UP= {6,9,13,15}

SUP=UP= {3,8,11,14}

n2,0

n2,1

n2,2

Inact

n1,0 Inact

NEWUP = {4,6,9,13,15,17,18,19} SUP = {4}

Inact

Inact

n1,1 Inact

Inact

Inact

n1,2 Inact

Inact

Inact

n1,3 Inact

Inact

Inact

n1,4 Inact

Inact

SUP=UP= {7,10,12,16}

n2,3 Inact

n1,5 Inact

Inact

Inact

n1,6 Inact

Inact

Inact

n1,7 Inact

Inact

n0,0 n0,1 n0,2 n0,3 n0,4 n0,5 n0,6 n0,7 n0,8 n0,9 n0,10 n0,11 n0,12 n0,13 n0,14 n0,15 Fig. 9. Example: sorting results at the end of life-cycle 3.

963

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965 Table 3 Calculate NEWUPðn4;0 Þ in life-cycle 4, stage 1.

Local rank Cross rank Rank in NEWUP

SUPðn3;0 Þ (4, 15)

SUPðn3;1 Þ (3, 11)

(0, 1) (0, 2) (1, 3) NEWUPðn4;0 Þ ¼ ð3; 4; 11; 15Þ

(0, 1) (0, 1) (0, 2)

Table 4 Calculate NEWUPðn4;0 Þ in life-cycle 4, stage 2.

Local rank Cross rank Rank in NEWUP

SUPðn3;0 Þ (4, 9, 15, 18)

SUPðn3;1 Þ (3, 8, 11, 14)

(0, 1, 2, 3) (1, 2, 4, 4) (1,3,6,7) NEWUPðn4;0 Þ ¼ ð3; 4; 8; 9; 11; 14; 15; 18Þ

(0, 1, 2, 3) (0, 1, 2, 2) (0,2,4,5)

Table 5 Calculate NEWUPðn4;0 Þ in life-cycle 4, stage 3.

Local rank Cross rank Rank in NEWUP

SUPðn3;0 Þ (4, 6, 9, 13, 15, 17, 18, 19)

SUPðn3;1 Þ (3, 7, 8, 10, 11, 12, 14, 16)

(0, 1, 2, 3, 5, 6, 7) (1, 1, 3, 6, 7, 8, 8, 8) (1,2,5,9,11,13,14,15) NEWUPðn4;0 Þ ¼ ð3; 4; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19Þ

(0, 1, 2, 3, 4, 5, 6, 7) (0, 2, 2, 3, 3, 3, 4, 5) (0,3,4,6,7,8,10,12)

NEWUP = {3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19}

n4,0 SUP=UP = {4,6,9,13,15,17,18,19}

SUP=UP = {3,7,8,10,11,12,14,16}

n3,0

n3,1

Inact

Inact

n2,0

n2,1

Inact

Inact

Inact

n1,0

n1,1

n1,2

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

n2,2

n2,3

Inact

Inact

Inact

Inact

n1,3

n1,4

n1,5

n1,6

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

Inact

n1,7 Inact

Inact

n0,0 n0,1 n0,2 n0,3 n0,4 n0,5 n0,6 n0,7 n0,8 n0,9 n0,10 n0,11 n0,12 n0,13 n0,14 n0,15 Fig. 10. Example: sorting results at the end of life-cycle 4.

Stage 3: SUP arrays for nodes n3;0 and n3;1 are the same as the corresponding UP arrays. Table 5 shows how NEWUP(n4;0 ) is obtained on this stage. Now the whole list is sorted, see Fig. 10. 6. Conclusions In this paper, we implemented an optimal sorting algorithm on the LARPBS model that runs in O(log N)-time using O(N) processors. Our algorithm is based on Cole’s CREW PRAM pipelined merge sort [1]. Although Cole’s algorithm is optimal, it is designed on an ideal model – CREW PRAM, where implementation details of communication among processors are not considered. Unlike many theoretical models, the LARPBS model is more realistic and is likely to become a feasible architecture in the near future. Communication has to be taken care of for any algorithm designed on this model. We extended Cole’s opti-

964

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

mal sorting algorithm on the LARPBS model by solving the communication details, provided a processor efficient solution for processor assignment and reuse, and obtained an optimal sorting algorithm on the LARPBS model. Further more, we discovered two new interesting properties of the algorithm and presented them as lemmas with proof. We also provided an example with detailed description on how each step works to demonstrate how the algorithm works. The previous best sorting algorithm on the LARPBS model runs in O(log N log log N) worse-case time using N processors [3]. Our optimal sorting algorithm removed the O(log log N) gap. By making an optimal sorting algorithm available on the LARPBS model, we further explored the power of an optical interconnection model on parallel computing, laid down a foundation for sorting on higher dimensional optical bus systems and provided a possibility for designing more efficient parallel algorithms which depend on sorting as a building block on the LARPBS model. Acknowledgements The authors would like to thank the two reviewers for giving valuable suggestions for improving the quality of the paper. The authors would also like to thank Professor Burkhard Englert and Professor Arthur Gittleman from California State University Long Beach for careful reading and suggestions that improved the quality and presentation of the paper. References [1] Cole R. Parallel merge sort. SIAM J Comput 1988;14:770–85. [2] Pan Y, Li K. Linear array with a reconfigurable pipelined bus system—concepts and applications. J Inform Sci 1998;106:237–58. [3] Datta A, Soundaralakshmi S, Owens R. Fast sorting algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 2002;13(3):212–22. [4] Levitan S, Chiarulli D, Melhem R. Coincident pulse techniques for multiprocessor parallel computing. Appl Optics 1990;29(14):2024–39. [5] Sahni S. Models and algorithms for optical and optoelectronic parallel computers. In: Proceedings of the 4th international symposium on parallel architectures, algorithms and networks; 1999. p. 2–7. [6] Rajasekaran S, Sahni S. Sorting, selection, and routing on the array with reconfigurable optical buses. IEEE Trans Parallel Distribut Syst 1997;8(11):1123–32. [7] Pavel S, Akl S. Matrix operations using arrays with reconfigurable optical buses. Parallel Algorithms Appl 1996;11:223–42. [8] Wang BF, Chen GH. Constant time algorithms for transitive closure and some related graph problems on processor arrays with reconfigurable bus systems. IEEE Trans Parallel Distribut Syst 1990;1(10):500–7. [9] Li Y, Pan Y, Zheng SQ. Pipelined time-division multiplexing optical bus with conditional delays. Optical Eng 1997;36(9):2417–24. [10] Guo Z, Melhem RG, Hall RW, Chiarulli DM, Levitan SP. Pipelined communications in optically interconnected arrays. J Parallel Distribut Comput 1991;12(3):269–82. [11] Roldan R, D’Auriol B. A preliminary feasibility study of the LARPBS optical bus parallel model. ; March 2003. [12] D’Auriol B, Rajesh Molakaseema R. A parameterized linear array with a reconfigurable pipelined bus system: LARPBS(p). Comput J 2005;48:115–25. [13] D’Auriol B. The systems edge of the parameterized linear array with a reconfigurable pipelined bus system (LARPBS(p)) optical bus parallel computing 31 model. J Supercomput. Published online by Springer on July 29; 2008. [14] Pan Y, Hamdi M, Li K. Efficient and scalable quicksort on a linear array with a reconfigurable pipelined bus system. Future Generat Comput Syst 1997– 1998;13:501–13. [15] Chen L, Pan Y, Xu XH. Scalable and efficient parallel algorithms for Euclidean distance transform. IEEE Trans Parallel Distribut Syst 2004;15(11):975–82. [16] Bourgeois AG, Pan Y, Prasad SK. Constant time fault tolerant algorithms for a linear array with a reconfigurable pipelined bus system. J Parallel Distribut Comput 2005;65(3):374–81. [17] Li K, Pan Y, Zheng SQ. Parallel computing using optical interconnections. Boston: Kluwer Academic Publishers; 1998. [18] Pan Y, Hamdi M. Quicksort on a linear array with a reconfigurable pipelined bus system. In: Proceedings of the second international symposium on parallel architectures, algorithms and networks; 1996. p. 313–9. [19] Wang YR. Fast algorithms for block-based medial axis transform on the LARPBS. IEEE Int Conf Syst, Man Cyb 2005;4:3616–21. [20] Wang YR. An efficient O(1) time 3D all nearest neighbor algorithm from image processing perspective. J Parallel Distribut Comput 2007;67(10):1082–91. [21] Han Y, Pan Y, Shen H. Fast parallel selection on the linear array with reconfigurable bus system. In: Proceedings of the seventh symposium frontiers of massively parallel computation; 1999. p. 286–93. [22] Han Y, Pan Y, Shen H. Sublogarithmic deterministic selection on arrays with a reconfigurable bus. IEEE Trans Comput 2002;51(6):702–7. [23] Arock M, Ponalagusamy R. A constant-time selection algorithm on an LARPBS. In: Proceedings of the 3rd Asian applied computing conference; 2005. p. 68–72. [24] Li K. Constant time boolean matrix multiplication on a linear array with a reconfigurable pipelined bus system. J Supercomput 1997;11(4):391–403. [25] Li K, Pan Y. Parallel matrix multiplication on a linear array with a reconfigurable pipelined bus system. In: Proceedings of 13th International parallel processing symposium and 10th symposium parallel and distributed processing; 1999. p. 31–5. [26] Li K, Pan Y, Zheng SQ. Fast and processor efficient parallel matrix multiplication algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 1998;9(8):705–20. [27] Pan Y, Li K, Zheng SQ. Fast nearest neighbor algorithms on a linear array with a reconfigurable pipelined bus system. J Parallel Algorithms Appl 1998;13:1–25. [28] Wang YR, Horng SJ, Wu CH. Efficient algorithms for the all nearest neighbor and closest pair problems on the linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 2005;16(3):193–206. [29] Li J, Pan Y, Shen H. More efficient topological sort using reconfigurable optical buses. J Supercomput 2003;24:251–8. [30] Li K, Pan Y, Hamdi M. Solving graph theory problems using reconfigurable pipelined optical buses. In: Proc. third workshop optics and computer science, (WOCS’99), vol. 1586; 1999. p. 911–23. [31] Trahan JL, Bourgeois AG, Pan Y, Vaidyanathan R. An optimal and scalable algorithm for permutation routing on reconfigurable linear arrays with optically pipelined buses. J Parallel Distribut Comput 2000;60(9):1125–36. [32] Semé D, Youlou S. Repetitions detection on a linear array with reconfigurable pipelined bus system. Int J parallel, Emergent Distribut Syst 2007;22(3):173–83. [33] Semé D, Youlou S. An Efficient sequence alignment algorithm on a larpbs. Lecture notes in computer science. Computational science and its applications – ICCSA, vol. 4707; 2007. p. 379–87. [34] Semé D, Youlou S. An efficient parallel algorithm for the longest increasing subsequence problem on a LARPBS. In: Proceedings of the eighth international conference on parallel and distributed computing, applications and technologies; 2007. p. 251–8.

M. He et al. / Computers and Electrical Engineering 35 (2009) 951–965

965

[35] Xu XH, Chen L, He P. Fast sequence similarty with LCS and LARPBS. In: Proceedings of ISPA Workshops; 2005. p. 168–75. [36] Arock M, Ponalagusamy R. A parallel solution for the unconstrained maximum elements problem. In: Proceedings of the international conference on advanced computing and communications; 2006. p. 12–5. [37] Li K, Pan Y, Zheng SQ. Efficient deterministic and probabilistic simulations of PRAMS on linear arrays with reconfigurable pipelined bus systems. J Supercomput 2000;15:163–81. [38] Fortune S, Wyllie J. Parallelism in random access machines. In: Proceedings of 10th annual ACM symposium on theory of computing; 1978. p. 114–8. [39] Ajtai M, Komlos J, Szemeredi E. An O(n log n) sorting network. In: Proceedings of the 15th Annual ACM symposium on theory of computing; 1983. p. 1– 9. [40] Dighe OM, Vaidyanathan R, Zheng SQ. Bus-based tree structures for efficient parallel computation bus-based tree structures for efficient parallel computation. In: Proceedings of the international conference on parallel processing (ICPP’93), vol. 1; 1993. p. 158–61.