Improved Bounds for Integer Sorting in the EREW PRAM Model

Improved Bounds for Integer Sorting in the EREW PRAM Model

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO. PC971401 48, 64–70 (1998) Improved Bounds for Integer Sorting in the EREW PRAM Model Ander...

62KB Sizes 4 Downloads 110 Views

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO. PC971401

48, 64–70 (1998)

Improved Bounds for Integer Sorting in the EREW PRAM Model Anders Dessmark1 and Andrzej Lingas2 Department of Computer Science, Lund University, Box 118, S-221 00 Lund, Sweden

A new simple method of exploiting nonstandard word length in the nonconservative RAM and PRAM models is considered. As a result, improved bounds for parallel integer sorting in the EREW PRAM model with standard and nonstandard word length are obtained. © 1998 Academic Press

1. INTRODUCTION In the unit-cost analysis of algorithms for RAM’s and PRAM’s where each instruction requires exactly one time unit, the machine word is often assumed to be of length logarithmic in the input size. In this way, the machine’s random address space is sufficiently large to store the input. Although this assumption is reasonable, it still seems to be a bit arbitrary. With the exception of sorting [3, 8, 10], not too much is known about problem complexity in the so-called nonconservative RAM model, where the machine word is of substantially larger size. In the introduction to [3], Albers and Hagerup argue that “algorithms using a nonstandard word length should not be hastily discarded as unfeasible and beyond practical relevance.” The truth of these words has been confirmed recently by Andersson et al. in [5]. First, they iterated the O(n)-time reduction of sorting n integers of value ≤ 2m to sorting n integers of value ≤ 2m/2 of Kirkpatrick and Reisch [10] 2dlog log ne times. In this way, they reduced the input problem to that of sorting n integers of bit length ≤ m/ log2 n in time O(n log log n). To solve the latter problem, they simply applied the nonconservative linear-time sorting algorithm of Albers and Hagerup (cf. Corollary 1 in [3]) that uses words of length O(m log n log log n) [3]. As a result, an O(n log log n)-time algorithm for sorting n integers ≤ 2m using the conservative (for sorting) word length O(log n + m) was obtained. 1E-mail:

[email protected]. E-mail: [email protected].

2

64 0743-7315/98 $25.00 Copyright © 1998 by Academic Press All rights of reproduction in any form reserved.

IMPROVED BOUNDS FOR PARALLEL INTEGER SORTING

65

The known efficient nonconservative algorithms use the restricted parallelism hidden in unit-cost operations on longer words [3, 8]. A straightforward approach packs consecutive groups of input items into consecutive long words and then solves the subproblems induced by the groups faster, taking advantage of the restricted parallelism. For instance, in the nonconservative sorting algorithm of Albers and Hagerup [3], k integers are packed in a single word and then sorted using only O(log k) operations. We propose an alternative general approach of using the restricted parallelism of the nonconservative RAM. Simply, we divide the input problem, if possible, into (k) analogous subproblems. Next, we place the ith input item of the lth subproblem into the lth field of the ith machine word. Further, we run an oblivious algorithm, if possible, on such a subproblem vector performing a number of operations proportional to that taken by a single subproblem with a standard word length. Now, it depends on the problem nature whether some substantial merging step is needed (as, e.g., for sorting), or the original problem is already solved. We provide a general framework for efficient word-parallel simulation of k computations of an oblivious PRAM or of an oblivious RAM (see Section 3) by a PRAM or RAM, respectively. 3 Our general approach yields in particular an improved bound on the time and the nonconservativeness factor for linear-work parallel integer sorting in the EREW PRAM model. We show that n integers of size ≤ 2m can be sorted in time O(log n log log n) using an EREW PRAM with n/ log n log log n processors and word length O((log n + m) log n). The previous best known bounds were O(log2 n) time on an EREW PRAM with a linear time-processor product and word length O(m log n log log n). In [3], Albers and Hagerup also presented a method of reduction of parallel conservative integer sorting to parallel nonconservative integer sorting. Combining their method with our new bounds on nonconservative parallel integer sorting, we also obtain improved bounds on conservative parallel integer sorting in the EREW PRAM model. In particular, we show that n integers of size n O(1) can be sorted in time O(log3/2 n) on an EREW PRAM with n/ log n processors. This improves the enhanced result from the recent journal version [4] of [3], where the corresponding work bound, i.e., time√ processor product, is O(n log n log log n).

2. SIMULATION OF k-WISE RAM INSTRUCTIONS A PRAM is a synchronous parallel machine consisting of a finite number of processors and a finite number of memory cells accessible to all processors. A RAM is a PRAM with a single processor. Following [3], the PRAM instruction set is assumed to include addition, subtraction, comparison, shift, and the bitwise operations of AND and OR. The shift operation shift(x, i) is given by bx2i c. In an EREW PRAM, a memory cell can be accessed only by a single processor at a time. In a CREW PRAM several processors can simultaneously read the same cell but concurrent writing is still disallowed. In a CRCW PRAM, both concurrent reading and concurrent writing are allowed. 3

In a preliminary version of this paper [6], a similar framework for simulating k computations of an oblivious Turing machine by a RAM was presented.

66

DESSMARK AND LINGAS

Usually, one assumes that a PRAM memory cell stores a bit-word of a length logarithmic in the input size. In the case of sorting n integers in the range 1 through 2m , the word length is assumed to be O(log n + m). We shall also use the so-called nonconservative model of RAM and PRAM with substantially larger word length [3, 10]. To exploit the larger word length, we shall assume as in [3] that a word consists of at least k fields of equal length. The leftmost bit of each field serves as a test bit and the remaining bits are used to store strings over the binary alphabet, in particular integers. Usually, k strings are stored in the k rightmost fields of a word. A k-tuple (S1 , . . . , Sk ), where Si is a string sequence (x 1i , x 2i , . . . ), for i = 1, . . . , k, is given in orthogonal-word representation if there are consecutive machine words such that for i = 1, . . . , k, j = 1, 2, . . . , the string x ij is in the i th field of the jth word. The function CopyTestBit, which copies the test bits in A to the remaining positions of the corresponding fields in A, was considered in [3]. It is given by A ∧ C1 − shi f t (A ∧ C 1 , −l + 1). Throughout the paper, we shall measure the time complexity of RAM’s and PRAM’s according to the unit-cost criterion with bounded word length. Assuming the notation and conventions above, we have the following useful lemma. LEMMA 2.1. Let M be a nonconservative EREW PRAM whose word consists of at least k fields of length l. Let I be the set of the basic PRAM instructions different from the shift instruction. After an O(log k)-time and O(1)-processor preprocessing, for any instruction in I and any pair of k-tuple arguments given in orthogonal-word representation, M can produce the corresponding k-tuple of results of the instruction in orthogonal-word representation in time O(1) using O(1) processors (in the case of comparison, the test bit of the ith field of the resulting word, 1 ≤ i ≤ k, is set to one iff the contents of the corresponding field of the first argument word is greater than or equal to that of the second one). Also, after an O(log k)-time and O(l)-processor preprocessing, for any k-tuple of arguments in orthogonal-word representation and 0 ≤ q < l, M can produce the corresponding k-tuple of the arguments shifted by q to the left, in orthogonal-word representation, in time O(1) using O(1) processors. Proof. The lemma is obvious for the bitwise operations of AND and OR, denoted respectively by ∧ and ∨. For the remaining instructions in I we need to use a constant number of masks. In particular, we need the mask C1 , where test bits are set to 1 and the remaining bits to 0 and the complementary mask C0 . Following [3] C1 can be computed by C 1 ← shift(1, l − 1) for t ← 0 to log k C1 ← C 1 ∨ shift(C 1 , 2t l) Due to these masks and the separating test bits, we easily obtain the lemma for addition by A = (B + C) ∧ C0 . For the comparison operation, the lemma follows from the method of Paul and Simon [12]. For two argument words A1 and A2 , we can obtain the result on test bits simply by (A1 ∨ C 1 ) − (A 2 ∧ C 0 ).

IMPROVED BOUNDS FOR PARALLEL INTEGER SORTING

67

Using the result of k-wise comparison for two k-tuples in orthogonal-word representation and the function copying the test bits, we can also easily produce the corresponding k-tuples of maxima and minima in orthogonal-word representation, i.e., implement a kwise comparator. By applying this, we can reduce the k-wise subtraction to its restricted case, where the first argument is never smaller than the second. The so-restricted k-wise subtraction can be directly implemented by a single subtraction of M. Finally, the k-wise shift by q to the left can be implemented in a straightforward way by a single shift of M by q and the bitwise AND operation with a special mask. The mask has 0’s on the q rightmost positions in each field and 1’s on the remaining ones. Clearly, we can create such masks for 0 ≤ q < l in time O(log k) using O(l) processors of M. Assuming the preprocessing of Lemma 2.1, we obtain the following corollary. COROLLAY 2.2. The function CopyTestBit and the k-wise comparator can be implemented in constant time using a single processor of M. 3. k-WISE SIMULATION OF AN OBLIVIOUS PRAM The PRAM operates in steps. At each step, each processor accesses some memory cell. Each instruction in the instruction set can be implemented in a standard way in a constant number of cycles. By naturally extending the definition of an oblivious RAM (cp. [11]), we say that a PRAM is oblivious if the address of the memory cell accessed in its ith step by its kth processor depends only on the input size, i and k, and not on the input values. However, for our purposes, we need a stronger version of the definition above, where even the type of the instruction in the i th step is uniquely determined by the three parameters. We shall say that a PRAM is strongly oblivious if the instruction (given by its type and memory addresses of the arguments and the result) performed in its i th step by its kth processor is uniquely determined by the input size, k, and i, and if the instruction shift(x, j) then the argument j is also uniquely determined by the input size, k, and i. By an on-line simulation of a PRAM computation we mean simulations of the single instructions that comprise the computation in the partial execution order given by the computation. Importantly, the output produced by the simulation is required to have not only the same value but also the same placement as that of the simulated PRAM. The following lemma shows that if an oblivious PRAM does not use shift then it can be simulated by a strongly oblivious PRAM within the same asymptotic resource bounds. LEMMA 3.1. An oblivious PRAM which does not use shift, runs in time t (n), and uses p(n) processors and word length l(n) can be on-line simulated by a strongly oblivious PRAM running in time O(t (n)), using p(n) processors and word length l(n). Proof. Let M be the oblivious PRAM to be simulated. The main idea is to use a PRAM with stored program (see [1]) encoding M. The PRAM assigns a processor to each processor of M. In order to perform the instructions of the corresponding processor of M (e.g., Z ← A ⊕ B), the assigned processor tests the code of the type of the current instruction for equality with the codes of all the possible instruction types u. Next, using the copy bit function, it produces a mask C u consisting entirely of 1’s or 0’s, depending on whether the test is positive or negative. Finally, it runs a loop for all instruction types

68

DESSMARK AND LINGAS

u, executing Z ← (A ⊕u B) ∧ Cu + Z ∧ C u . As a result, only the encoded instruction is performed. Since M is oblivious, the addresses of the arguments depend only on the input size, the number of the processor, and the number of the parallel step. The shift used for the copy bit function always has the same second argument. We conclude that the simulating PRAM is strongly oblivious. It remains to be observed that the generation of the program encodings for the p(n) simulated processors takes O(1) time and p(n) processors. If the instruction set is to be extended with instructions such as division that can have illegal arguments some further care must be taken in the instruction loop to avoid this difficulty. A possible solution is to replace A ⊕u B with (A ∧ C u + X u ∧ Cu ) ⊕u (B ∧ C u + Yu ∧ Cu ), where X u and Yu are legal arguments for the operation ⊕u . The concept of an oblivious PRAM enables us to exploit the word parallelism according to the following theorem under the assumption that if the PRAM uses concurrent write, then it is of the common, arbitrary, or priority type, see [9]. THEOREM 3.2. Let M be a strongly oblivious PRAM, or an oblivious PRAM which does not use shift, running in time t (n) and using p(n) processors and word length l(n). Suppose that M produces output composed of r(n) words. For k inputs each of size n, the k computations of M on the k inputs can be on-line simulated by a single strongly oblivious PRAM of the same type of read–write access running in time O(t (n) + k), using word length O(kl(n)), and O(n + p(n) + r(n) + l(n)) processors in the strongly oblivious case or O(n + p(n) + r(n)) processors in the shift-free case. Proof. By Lemma 3.1 we may assume w.l.o.g. that M is strongly oblivious. Divide the first n memory cells of a PRAM with word length O(kl(n)) into consecutive fields of length 2(l(n)). Pack the ith item of the qth input into the qth field of the ith cell. It is clear that the packing step can be done easily in time O(k) using n EREW processors. Since M is strongly oblivious it remains to simulate each instruction performed by M simultaneously on k-tuples held in common memory cells. After the preprocessing specified in Lemma 2.1, it takes O(1) time and O(1) processors by this lemma. In this way, in parallel for q = 1, . . . , k, the qth computation is on-line simulated on the qth track corresponding to the qth fields of the memory cells within the resource bounds specified in the lemma. To extract the k outputs, we first use r(n) processors to extract the output contents on the first track, then that on the second, etc., which takes in total O(k) time. It remains to observe that the simulating PRAM is strongly oblivious. 4. NONCONSERVATIVE SORTING The acyclic AKS sorting network of constant degree and O(log n) depth is built of O(n log n) comparator gates [2]. It gives rise to an O(log n)-time, work optimal, strongly oblivious EREW PRAM for sorting provided that the expander needed for the AKS network is given. The expander can be constructed in logarithmic time by an EREW PRAM on n/ log n processors [7]. Hence, we obtain the following theorem. LEMMA 4.1. After O(log n)-time, n/ log n-processor preprocessing on an EREW PRAM, n integers ≤ 2m can be sorted by a strongly oblivious EREW PRAM running in time O(log n) and using n processors and word length O(log n + m).

IMPROVED BOUNDS FOR PARALLEL INTEGER SORTING

69

Combining the lemma above with Theorem 3.2, we obtain the following useful lemma. LEMMA 4.2. After O(log n)-time, n/ log n-processor preprocessing on an EREW PRAM, O(log n) groups of n/ log n integers ≤ 2m can be sorted by an EREW PRAM running in time O(log n) and using n/ log n processors and word length O((log n + m) log n). By Theorem 1(a) in [3], two sorted sequences of at most n integers ≤ 2m can be merged in time O(log n) using an EREW PRAM with O(n log log n/ log2 n) processors and word length O((log n + m) log n) (under the assumption that the input integers are already packed into consecutive fields of consecutive machine words). Hence, by Lemma 4.2 and Brent’s principle [9] we obtain the following result (cf. Theorem 1 in [3] and Theorem 3 in [5]). THEOREM 4.3. If dlog me is known, a sequence of n integers ≤ 2m can be sorted by an EREW PRAM running in time O(log n log log n) and using n/ log n log log n processors with machine words of length O((log n + m) log n). Proof. To combine Lemma 4.2 with Theorem 1(a) in [3], it is sufficient to note that a sequence of at most n integers ≤ 2m can be easily packed into consecutive 2(log n) × 2(n/ log n) fields of consecutive machine words in time O(log n) using an EREW PRAM with n/ log n processors. 5. CONSERVATIVE SORTING Albers and Hagerup derived their new upper bounds on integer sorting in the EREW PRAM model from their results on nonconservative integer sorting in this model (cf. Theorems 1 and 2 in [3]). By using our results on nonconservative parallel integer sorting (i.e., Theorem 4.3) we can analogously derive stronger upper bounds on integer sorting in the EREW PRAM model. Following [3], we reduce the problem of sorting integers ≤ 2m to that of sorting integers ≤ 2s where s < m by using the method of radix sorting. By taking s bits at a time, we need O(m/s) phases to fulfill the reduction. We shall choose s as the largest integer satisfying s ≤ log n and s 2 ≤ log n + m. Analogously as in [3], we implement each phase in two stages. First, we split the input integers into dn/2s e groups of ≤ 2s integers and apply the algorithm of Theorem 4.3 to each of the groups independently (stabilizing it as described in Section 2 in [3]). The splitting can be easily done in time O(log n) on an EREW PRAM with O(n/ log n) processors and word length O(log n+m). Let t = max{log n, s log s}. By Brent’s principle, the first stage takes time O(t) and O(n/t) processors in the EREW PRAM model with word length O(log n + m). In the second stage, we merge the sorted groups analogously as in [3, p. 469], using optimal prefix sums [9]. It takes time O(log n) on an EREW PRAM with n/ log n processors [3]. We conclude that a single phase takes time O(t) and O(n/t) processors in the EREW PRAM model with word length O(log n + m). Hence, by the definition of s, we obtain the following upper bounds on the conservative integer sorting in the EREW PRAM model, improving the results from [3]. THEOREM 5.1. Let s be the largest integer satisfying s ≤ log n and s 2 ≤ log n+m and let t = max{log n, s log s}. A sequence of n integers ≤ 2m can be sorted by an EREW

70

DESSMARK AND LINGAS

PRAM running in time O(tm/s), using n/t processors and word length O(log n + m). In particular, if m = O(log n) the time and processor bounds are, respectively, O(log3/2 n) and n/ log n. REFERENCES 1. A. V. Aho, J. E. Hopcroft, and J. D. Ullman, “The Design and Analysis of Computer Algorithms,” Addison-Wesley, Reading, MA, 1974. 2. M. Ajtai, J. Komlós, and E. Szemerédi, An O(n log n) sorting network, in “Proc. 15th Annual ACM Symposium on Theory of Computing,” pp. 1–9, 1983. 3. S. Albers and T. Hagerup, Improved parallel integer sorting without concurrent writing, in “Proc. 3rd Symposium on Discrete Algorithms,” pp. 463–472, 1992. 4. S. Albers and T. Hagerup, Improved parallel integer sorting without concurrent writing, Inform. and Comput. 136 (1997), 25–51. 5. A. Andersson, T. Hagerup, S. Nilsson, and R. Raman, Sorting in linear time?, in “Proc. 27th Annual ACM Symposium on Theory of Computing,” pp. 427–436, 1995. 6. A. Dessmark and A. Lingas, On the power of nonconservative PRAM, in “Proc. 21st Symposium on Mathematical Foundations of Computer Science,” Lecture Notes in Computer Science 1113, pp. 303– 311, 1996. 7. R. Cole and U. Vishkin, Approximate parallel scheduling. Part 1: The basic technique with applications to optimal parallel list ranking in logarithmic time, SIAM J. Comput. 17, 1 (1988), 128–142. 8. T. Hagerup, and H. Shen, Improved nonconservative sequential and parallel integer sorting, Inform. Process. Lett. 36 (1990), 57–63. 9. R. M. Karp and V. Ramachandran, Parallel algorithms for shared-memory machines, in “Handbook of Theoretical Computer Science,” Vol. A, (J. van Leeuwen, Ed.), pp. 869–941, Elsevier, Amsterdam, 1990. 10. D. G. Kirkpatrick and S. Reisch, Upper bounds for sorting integers on random access machines, Theoret. Comput. Sci. 28 (1984), 263–276. 11. R. Ostrovsky, Efficient computation on oblivious RAMs, in “Proc. 22nd Annual ACM Symposium on Theory of Computing,” pp. 514–523, 1990. 12. W. J. Paul and J. Simon, Decision trees and random access machines, in “Proc. International Symposium on Logic and Algorithmic,” pp. 331–340, Z¨urich, 1980.

ANDERS DESSMARK is a Ph.D. student at Lund University. He received his B.S. in computer science from Lund University in 1992. His research interests include sequential and parallel algorithms, graph theory, and complexity theory. ANDRZEJ LINGAS received the M.S. in mathematics and computer science in 1976 from Warsaw University. He was a visiting scientist at M.I.T. in 1980–1982. In 1983, he obtained the Ph.D. in computer science from Link¨oping University, where he was a lecturer and leader of Laboratory for Complexity of Algorithms until 1989. In 1990 he joined Lund University, where he is currently a professor of computer science. His research interests include sequential and parallel graph and geometric algorithms, computational biology, and complexity theory. Dr. Lingas is a member of EATCS. Received February 11, 1997; revised October 7, 1997; accepted October 20, 1997