Simulation of PRAMs with scan primitives by unbounded fan-in circuits

Simulation of PRAMs with scan primitives by unbounded fan-in circuits

Informqtion ;g;zrw Information Processing Letters 68 (1998) 275-282 Simulation of PRAMS with scan primitives by unbounded fan-in circuits Rakesh K...

851KB Sizes 0 Downloads 10 Views

Informqtion ;g;zrw Information

Processing

Letters 68 (1998) 275-282

Simulation of PRAMS with scan primitives by unbounded fan-in circuits Rakesh K. Sinha 1 Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974-0436, USA Received 15 December 1997; received in revised form 25 October 1998 Communicated by S.E. Hambrusch

Abstract We show that PRAMS that have multiprefix (scan) operations available as primitives may be simulated by unbounded fanin circuits with gates for related operations. The complexity bounds depend on the word-size of the PRAM but not on the multiprefix operations or the instruction set of the processors. 0 1998 Published by Elsevier Science B.V. All rights reserved. Keywords: Parallel algorithms;

Simulation;

PRAM; Circuits

1. Introduction In the study of parallel computation, the shared memory parallel random access machine or PRAM has been a very useful model. The model is attractive because it abstracts away most of the messy details of implementing algorithms on parallel machines. The flip side of this is that the model has no efficient implementation. Most work using the PRAM has considered models such as the EREW, CREW, and CRCW PRAM. In each of these models, the operations that access shared memory are reads and writes, and these are assumed to be of unit cost. In practice, global memory is distributed among processors and the read and write primitives are typically implemented by routing packets on a fixed connection network of processors. If many processors try to access the block of memory located at the same node, memory contention can really slow the system down. Currently there is no practical solution for implementing an n-processor read or ’ Email: rks [email protected].

write that does better than 0(log2 n) for deterministic schemes or O(log n) for probabilistic schemes [ 12,141. But we realize that the assumption of unit cost for primitives is never strictly true on any model. The usefulness of algorithmic models is in supplying a suitable abstraction of real machines that helps in algorithm design. The real concern is the following: any algorithm can be thought of as giving a two step simulation of the given problem in hardware: first, the problem is solved in terms of primitives of the machine; second, each of those primitives is simulated in hardware. For certain problems, this two step simulation process may be a lot more time-consuming than a direct simulation in hardware. As an example, consider the problem of computing the parity of n bits on an n-processor machine, where each processor starts with one input bit. Let us assume that the underlying network is an r-dimensional butterfly [ 11, p. 4401 with n = (u + 1)2’. As far as hardware implementation goes, this problem can easily be computed in O(r) = O(logn) time as follows: First in O(r) time, we compute the parity within each row. These par-

OOZO-0190/98/$ - see front matter 0 1998 Published by Elsevier Science B.V. All rights reserved PII:SOO20-0190(98)00182-3

216

R.K. Sinha /Information

Processing Letters 68 (1998) 275-282

tial results are stored in local memory of processors in the rth level. It is easy to see that the butterfly network contains a binary-tree network of depth I such that the 2’ nodes in the rth level form the leaves of the tree. So, the parity of the 2’ partial results can be computed in another O(r) steps by repeatedly computing parity of sibling nodes. Thus computing parity is no more difficult than a multiprocessor read or write. But if we restrict the primitives to be multiprocessor reads and writes then parity requires s2 (logn/ loglogn) steps even on a PRIORITY CRCW PRAM [l]. In other words, computing parity on PRIORITY CRCW PRAM will take time s;Z(log n/ log log n) times the time to implement reads or writes, which is a slow-down of 52(log n/ log log n). One possible solution to this particular problem is to make parity a primitive operation on the PRAM model. In general, some practical and theoretical work for parallel machines [3-7,9,10,13,15] have suggested that parallel prefix computations for certain multiple arity operators be allowed at unit cost. We will call all such models multiprefix PRAMS. (Section 1.1 contains precise definitions.) For many problems, providing these extra primitives have made the algorithms simpler and/or efficient [3,5,6]. The costs of implementing prefix computations as part of the PRAM simulations by networks of processors is that each processor or switch in that network becomes a little more complicated for each new prefix operation added. We parameterize the multiprefix PRAM by a set c3 of allowable prefix operations. It is easy to show that arbitrary parallel prefix operations can lead to unreasonable models. (Concatenation of bit strings is an associative operation and, as we will show later, using it in a parallel prefix operation permits the collection of all the n input bits into a single location in a single step.) However, these require the transmission of n-bit values in the network. A natural limitation then is on the bandwidth of the operation, the number of bits of information of each input and output value. It is reasonable to set this to be equal to the word-length of the machine. Our main result shows that if the bandwidth of the prefix operation is so limited (and the domain of the operation has an identity element) then the PRAM itself may be very efficiently simulated by an unbounded fan-in circuit with special gates for the operations in 0.

There are several motivations for designing efficient simulations between different models. If seemingly different models have efficient simulations on each other then it is an evidence of the robustness of those models. That is, results proved on such models are indeed saying something about the complexity of the problem, rather than quirks of the model. Because researchers have had very limited success in proving lower bounds, simulation results are also attractive because they give an easy way of translating known lower bound results on the model that is simulating to the model being simulated. For example, Stockmeyer and Vishkin [ 171 showed a simulation of CRCW PRAMS (with limited instruction set) by unbounded fan-in circuits. Coupled with the known unbounded fan-in circuit lower bounds on parity [8], this simulation result gave a non-constant time lower bound for the problem of computing parity on PRAMS with a polynomial number of processors. Notice that other than a word-length restriction we are not making any assumption that otherwise limits the power of the processors in the PRAM. The most important special case of our results is when the word-length is O(logp) where p is the number of processors in the machine. (logp bits are needed to specify any particular processor, so word-length at least logp is most desirable. Furthermore most specific operations that have been proposed can be implemented using (logp)-bit values.) In this case, anything that can be computed by a multiprefix PRAM (with operations in 0) in time T using p processors can be computed by an unbounded fan-in circuit of depth O(T) and size polynomial in pT, having gates for the operations in 0 (Corollary 5). Our results are an extension of those of Bellantoni [2] which are designed for PRIORITY CRCW PRAMS with a word-length restriction. Our main contribution is to extend it to the case of multiprefix PRAMS. Our techniques are similar to Bellantoni’s, although there are some significant differences in the details required to handle the multiprefix operations. We should note that our simulation is not limited to prefix operations based on associative binary operations and thus we can handle a wider variety of functions, for example, threshold functions.

R.K. Sinh /Infommtion Processing Letters 68 (1998) 275-282

I. I. De$nitions We begin by defining a multiprefi (MP) operation. Several definitions have been proposed in the literature. Because we will be giving a simulation of multiprefix PRAMS, we adopt the strongest definition of multiprefix operations, which was proposed by Ranade et al. [ 151. Intuitively, for any function 0, whenever a set of processors performs a a-multiprefix operation on a common memory location, the result is the same as if a single prefix operation has been performed on the values ordered by the processor indices. We number the processors of the (multiprefix) PRAM as PI, P2, 4, . . . . Definition. A multiprefix operation MP(L, u, 0) takes three arguments: L is the address of a memory location; u is the private data inside the processor performing this operation; and 0 is a multiple arity operator. Let S = {Pij: 1 < j < k) be any set of k processors such that ij < ij+l for 1 6 j < k. Suppose that the memory location L contains the value uo, each processor Pij in S performs the operation MP(L, nj, a), and no processor outside S performs a multiprefix operation referring to memory location L. Then, as a result of these multiprefix operations, each processor Pij will receive O(vu, vt , . . . , uj-l), and L will contain O(vo, ut, . . . , vk). A valid algorithm makes sure that in any round of the computation, all processors performing a multiprefix operation referring to a particular memory location have the same multiple arity operator. To give an example, if the memory location L contains value 0, and for 1 6 i < n, the ith processor Pi performs MP(L, i, +), then as a result of this operation Pi will receive 1 + 2 + . . . + i - 1 and L will contain 1 + 2 + . . . + n. We are now ready to give a formal definition of multiprefix PRAMS. Definition. Let 0 be any set of operations. Then an @multiprefix PRAM is a PRIORITY CRCW PRAM with extra multiprefix primitives from the set 0. For computing a function of n inputs, we start with input values stored in n designated memory locations. The computation proceeds in steps. Each step consists of a compute, write, multiprefix, and read phase. In the compute phase, individual processors can perform

277

arbitrary local computation on the information they have received during earlier steps. Processors are not allowed to access the shared memory during the compute phase. For each processor, this local computation during the compute phase determines the indices of memory locations for read, write, multiprefix operations, as well as values for write and multiprefix operations. If more than one processor is trying to write to the same location L, then L receives the value that the processor with the smallest index is trying to write. In the write (multiprefix, read) phase, each processor can perform at most one write (respectively, multiprefix, read) operation. In the multiprefix phase, all processors performing a multiprefix operation referring to a particular memory location have the same multiple arity operator. The output is the contents of specially marked memory locations at the end of the computation. The two most interesting complexity measures to us are time, defined as the number of steps, and the number of processors. It is easy to show that arbitrary multiprefix operations can lead to unreasonable models. For example, if concatenation of bit strings is allowed as a multiprefix primitive then, with n processors, the entire input can be collected into a single location L in a single step. (Processor Pi executes MP(L, ith input, “concatenate”).) There are two objections to this: first, if, as is traditional in lower bound study of PRAMS, we do not place any restrictions on the computational power of individual processors, a processor can read all input bits in one more step and compute the given function. The more serious objection is that such operations require the transmission of n-bit values in the network. A natural limitation then is on bandwidth of the operation, defined as the number of bits of information of each input and output value. It is reasonable to set this to be equal to the word-length of the machine. In other words, we allow only those multiprefix operations which have bounded “bandwidth”. Definition. For any integer /.Lz 0, VALID(p) is the set of operations 0 such that (1) forallk~O,ifeachofxl,X2,...,Xk islessthan 2@ then 0(x1, x2, . . _, Xk) is also less than 2K, and

R.K. Sinhu /Information

278 (2)

Processing Letters 68 (1998) 275-282

0 has a unit; in other words, there exists a 10 such that adding a string of 1, to the argument list of 0 does not change its value.

[16] on circuits with mod gates to the case of multiprefix PRAMS. (We state Smolensky’s lower bound in the proof of Corollary 2.)

Notice that VALID(p) includes a wide variety of functions including all threshold functions [13]. Because majority function does not have a unit, it is not in VALID(p). However, adding an equal number of O’s and l’s to the argument list of majority does not change its value. We can think of a more general definition of unit under which the pair (1,0) forms a unit for the majority function. Our simulation can be easily extended to allow this more general class of functions, such as majority. We call p the word-length of the machine and make sure that the contents of any memory location can be represented by at most p bits.

Definition. For any m > 0, MOD,@1 , . . . , x,) is defined to be 0 if c Xi = 0 (mod m), and 1 otherwise. Corollary 2. If m is a prime, and r is not a power of m then for any p = 2 ‘ogo(‘)n, any MP-PRAM({MOD,}, p, logp) solving MOD, C2(log n/log log n) time.

on

n

bits

runs

for

Proof. Suppose that the MP-PRAM((MOD,}, p, logp) runs for T steps. Then by Theorem 1 with p = logp, there is an unbounded fan-in circuit with AND, OR, NOT, MOD, gates of depth O(T) and size p°CT) that computes MOD,. However, Smolensky [ 161 proved that any such circuit has size Q (2”‘2T).

Definition. For any p, p > 0, and 0 c VALID (p), an MP-PRAM(C?, p, p) is an O-multiprefix PRAM with p processors such that any value that any processor attempts to write is less than 2@. We also assume that each input value can be represented by p bits. Next, we define the family of circuits that will be used to simulate these PRAMS. Definition. For any p > 0, and 0 C VALID (p), MPCircuit(0, p) is the set of unbounded fan-in circuits, with gates computing AND, OR, NOT, and functions in 0. We assume that each gate computing a function in 0 receives its set of inputs encoded in binary, and has its output also encoded in binary. For small values of F, this assumption is hardly restrictive since we can always compute any function of these inputs or output by additional circuitry of size 0@2@). Our main theorem is Theorem 1. For any p, p, T > 0, and 0 5 VALID(b) any MP-PRAM(0,

p, p) running

simulated by an MP-Circuit(0,

in time T can be

w) of depth O(T) and

Since p = 2’“go(‘)n, it follows that T = Q(logn/loglogn).

The remainder Theorem 1.

q

of this paper is devoted to proving

2. Simulation Following the model of Bellantoni [2], in our simulation, we initially assume that we do not have to worry about simulating memory. This motivates the following definition. Definition. For any p, p > 0, and 0 G VALID(p), let a memoryless MP-PRAM(0, p, p) be an MPPRAM(0, p, p) such that after each step its entire shared memory is reset to zero. The simulation has two parts: first, we show that a memoryless machine can be efficiently simulated by circuits; next, we show how to simulate a general multiprefix PRAM with a memoryless multiprefix PRAM.

size 0(24’LTp3T4p).

2.1. Simulation of memoryless PRAMS by circuits As an illustration of how our simulation result is useful in proving lower bounds on multiprefix PRAMS, we can translate Smolensky’s lower bound

Lemma 3. For any p, p, T, and 0 C VALID(p), any memoryless MP-PRAM(O, p, p) running in time

R.K. Sinha /Information Processing Letters 68 (1998) 275-282

T can be simulated by an MP-Circuit(0, O(T) and size O(2 2pT p 3 PT).

p) ofdepth

Proof. For any t, 1 < t < T, we construct a constant depth circuit to simulate step t of the computation of the memoryless multiprefix PRAM. We refer to this as the stage(t) circuit. In any step, a processor reads p bits of information and receives another p bits of information by performing a multiprefix operation. So any processor by the end of step t - 1 has received 2p(t - 1) bits of information. Together, the p processors receive 2pp(t - 1) bits of information. The stage(t) circuit takes as input these 2pp(t - 1) bits and outputs 2ppt bits which will be input to the stage(t + 1) circuit. Our task is to compute the 2,~ bits of information that any processor P receives as a result of the read and multiprefix operation in step t. Since a processor does not receive any information during the write or compute phases, we do not need to simulate these two phases explicitly (except that in the last step, we need the value written into the output memory location). As we will see later, we still need to simulate these two phases at least partially. For example, in order to simulate the read phase, we need to know the index of the memory location for the read operation (which is determined during the compute phase) and the value in certain memory locations at the end of the write phase. Before describing the stage(t) circuit, we want to bound the number of distinct memory locations that ever get accessed during the computation of the memoryless MP-PRAM. After t steps, a processor can be in one of at most 22p’ possible states. Since, in any step, a processor can access the shared memory in one of three (write, multiprefix, or read) phases, there are at most N(t) = 3p c

22’1j < 3~2~@+’

j=l

memory locations that ever get accessed during the first t steps by any processor. If the memoryless multiprefix PRAM runs for T steps then there are at most N(T) memory locations that possibly ever get accessed. We can index these memory locations by using log N(T) bits. Next, we describe the stage(t) circuit.

279

First, we use 2p(t - 1) bits, for each processor P, to extract the following information for step t: (1) Wp is the index of the memory location into which P writes. (2) Mp is the index of the memory location for multiprefix operation of P . (3) Rp is the index of the memory location that P reads. (4) 0 is the multiprefix operation of P. (5) up is P’s data for the multiprefix operation. (6) The value that P is going to write. Altogether this is O(log N(t) + k) = O(log N(t)) bits of information. This information is determined during the compute phase of the PRAM. Because each processor is deterministic in nature, this information depends only on which of the 22”’ possible states the processor is in. In other words, for each possible state, there is a constant vector of length O(log N(t)) giving all this information. We can construct a table of 22fir many such constant vectors and then get this information by a table look-up in constant time by using a constant depth circuit of size O(22wt log N(t)). So, this part of the simulation can be performed by a constant depth circuit of size O(~2~w’ log N(t)). Assume that the (circuit) gates have been arranged in such a fashion that the gates corresponding to lower indexed processors are to the left of higher indexed processors. We will first show how to compute the result of the multiprefix operation of processor P. We already know Mp is the index of the memory location for multiprefix operation of P, and 0 is the multiprefix operation of P. We need to determine the values on which 0 is applied. In other words, we need (i) vu, which is the contents of memory location Mp before the multiprefix operation, and (ii) vyr which is the value for the multiprefix operation of Q, for each processor Q that is to the left of P and executes MP(Mp, vq ,a). Because the “write” phase precedes the “multiprefix” phase, vu can be determined by finding the highest priority processor writing in location Mp. We compare Mp with WQ for every processor Q to determine the highest priority processor writing into Mp. Then vo is the value that this processor was going to write. Consider a gate g for computing the function 0. The leftmost input to g is vc. To compute the other

280

R.K. Sinha /Information

Processing Letters 68 (1998) 275-282

inputs of g, compare Mp with MQ for every processor Q strictly to the left of P. All the matched processors should have the same multiprefix operation. If there is a match (indicating that Q participates in the multiprefix operation of P) then vq is sent to g; otherwise the unit for 0 is sent to gate g. The output of this gate is what P receives as a result of the multiprefix operation. The simulation of the read operation is done similarly. Because the “read” phase follows the “multiprefix” phase, we need to determine the contents of location Rp at the end of the multiprefix phase. As in the previous simulation, we find the value uu that memory location Rp will contain at the end of the write phase in step t. Next, we find out if a multiprefix operation has been performed on memory location Rp. To do this, we compare Rp with MQ for every processor Q. All the matched processors should have the same multiprefix operation. If there is a match, the value in the location Rp has been updated by a multiprefix operation. Consider a gate g computing the multiprefix operation of these matched processors. The leftmost input to g is ua. For each processor Q, if there is a match we send the value vq to gate g, otherwise we send the unit for the function computed by g. The output of this gate is what P receives as a result of the read operation. In both these constructions, we are comparing O(log N(t)) bits of information among processors. So this can be performed by circuits of constant depth and O(p* logN(t)) size. For each processor P, placing the 2~ bits of information next to the 2~(t - 1) bits that P has received in the first (t - 1) steps (which is input to the stage(t) circuit) constitutes the input to the stage(t + 1) circuit. The stage(t) circuit has constant depth and 0(22/“.‘plogN(t)

+ $logN(t))

size. Since N(t) < 3~2 *F’+‘, log N(t) is O(logp + pt). This says that the stage(t) circuit has constant depth and 0(p32*pt@) size, which in turn implies that the final circuit has depth O(T) and size c

2*@

l
= 0(~~,uT2*@~).

0

Next, we show how to simulate a general multiprefix PRAM by a memoryless multiprefix PRAM. 2.2. Simulation of multiprefin PRAMS by memoryless multiprefix PRAMS

Lemma 4. For any p, CL,T > 0, and 0 C VALID (p), any MP-PRAM(O, p, CL) M running in time T can be simulated by a memoryless MP-PRAM(C3,2pT f n, p) M’ running in time 2T - I, where n is the size of the input.

Proof. The essential idea of this proof follows that of Bellantoni [2] but there are some additional complications due to the multiprefix phase. The 2pT + n processors of M’ are numbered Qt , . . . , Q,, , and P:j, P;i, 1 < t 6 T, 1 < j 6 p. The p processors of M are numbered PI, . . . , Pp. We first sketch a simulation that works in the absence of multiprefix operations. Later, we will modify it to allow multiprefix operations. Each step t of M is simulated by the set of p processors P:l, . . . , P& of M’. Somehow, we need to remember the contents of all the relevant memory locations of M. In the very first step of M’, Ql, . . . , Qn read the input from memory locations L 1, . . . , L, . In all subsequent steps they write back the input in the corresponding locations. The general idea is that any processor updating the contents of a memory location should write that value in all subsequent steps. For any t, the set of p processors Pi,, . . . , Pi,, acquire the necessary information about the input in the first t - 1 steps so that they can simulate step t of M. In all steps subsequent to t, they perform the same write and multiprefix operation as in step t. We need to arrange the priorities of processors so that the processor writing the current value has the highest priority. It is useful to think of any processor being in either read or write mode. In the read mode, a processor receives all the information without making any writes. It then simulates the behavior of a processor in M, and makes a transition to the write mode where it writes the same value in all subsequent steps. The scheme can be further simplified as follows: any processor in “write mode” reads the memory location it wrote to in the previous step. If the memory location contains a different (than what it last wrote) value then this processor realizes that another processor with a more recent value has taken

R.K. Sinha /Information

Processing Letters 68 (1998) 275-282

over writing into this location. It becomes inactive and does not participate in any subsequent computation. There are several problems with this simplistic scheme if we allow multiprefix operations: any processor in its read mode should be able to get the information that the corresponding processor of M receives by a multiprefix operation. However, unlike READ, we can not allow multiple copies of the same processor to perform a multiprefix operation on the same location. We fix this problem by doubling the number of steps. The real simulation is performed in odd steps, whereas even steps are used to pass the information received as a result of the multiprefix operations. Another problem is that both WRITE and multiprefix operations affect memory locations, and they may interact in a tricky way. For example, if a series of writes and multiprefix operations are performed at various steps on the same memory location L, then the net result is the same as obtained by performing that series of operations starting with the last WRITE. Furthermore, we restrict all processors performing multiprefix operation on the same memory location in M’ (as well as in M) to have the same operation. So, if for a particular memory location L, the operation for round tl is 01 and for round t2 it is 02, we cannot allow these two multiprefix operations to be done in the same step (which our simplistic scheme would have required in all steps subsequent to tl and t2). We avoid all these complications by letting another set of p processors P:;, . . . , P:(, remember the contents of the memory locations on which a multiprefix operation has been performed in step t. Now, it suffices for processors Pi,, . . . , P:p, Pi;, . . . , P& to perform only a WRITE (and not a multiprefix) operation in all steps subsequent to t. Assume that the processors of M have been assigned priorities according to the order Pp -c P,_l =c ... c 4 < PI. The 2pt +n processors of M’ are assigned priorities according to the order Q,, < .+. < Ql < Pip <. . . < P;, < P;‘, < . . . < Pl’; < PiI, <... t, perform the same COMPUTE as PI, . . . , Pp in step t of M. The purpose of performing this COMPUTE is merely to

281

learn the readJwrite/multiprefix behavior of processors of M. The actual read/write/multiprefix behavior of processors of M’ will be somewhat different, as specified in rest of this algorithm. Ift>lthenQl,..., QnWRITEX1 ,..., X,back into input locations L 1,. . . , L, , respectively. p;, , . . . , Pip perform the same WRITE as PI, . . . , P,, in step t of M. PL, , . . . , Pip, for k < t, perform the same WRITE operation as they did in step 2k - 1. p;;,..., P&, for k < t, WRITE back the values they READ in step 2k - 1. p;, I . . . , P& perform the same multiprefix operation as PI, . . . , P,, in step t of M. Ift=lthenQl,..., Qn READ the input XI, . . . , X, from input locations L 1, . . . , L, , respectively. Pk’l, . . .) PLp, P$’ . . . , P,&, for k > t, perform the same READ as PI, . . . , Pp in step t of M. PL,, . . . , Pi,,, PL;, . . . , P&, for k < t, READ the locations that they wrote to in step 2t - 3. If any processor discovers that the content of the memory location is different from what it last wrote, it becomes inactive and does not participate in any subsequent computation. p;;,..., P$, READ the memory locations where Pip have performed the multiprefix operp;,,..., ation. In step 2t, 1 < t 6 T - 1, Pt’p..., P&, WRITE the value that they received as a result of their multiprefix operation in step 2t - 1 in some designated p memory locations. P& ,..., P&,P$ ,..., P&,fork>tREADthose designated memory locations. can be proved by induction on t that the following properties hold: (1) For any 1 < r < p and k > t, processors Pk)!and PLi in M’ have the same information at the end of step 2t as processor P,. in M has after step t. (2) Content of any memory location L in M’ after step 2t - 1 is the same as the content of memory location L in M after step t. q The proof of Theorem 1 follows immediately from Lemmas 3, 4 and the observation that if the output of the MP-PRAM(0, p, p) depends on n inputs then pT>n. For the special case p = log p, we have the following corollary.

R.K. Sinha /Information Processing Letters 68 (1998) 275-282

282

Corollary

5. For any T, p > 0, and

0 c VALID (log p) , any MP-PRAM(0, p, log p) running in time T can be simulated by an MP-Circuit(0, logp) of depth O(T) and size p”cT).

3. Conclusion We showed that a multiprefix PRAM with restricted bandwidth can be very efficiently simulated by an unbounded fan-in circuit with special gates for computing the prefix operations. In the case of PRIORITY CRCW PRAMS, lower bounds for unbounded fan-in circuits [S] provided the pattern for PRAM lower bounds [l]. What we demonstrate is that for natural restrictions of multiprefix PRAM word-length, lower bounds for circuits with prefix operations directly translate into lower bounds for the multiprefix PRAM. Parberry and Schnitger [ 131 defined and analyzed TRAMS (Threshold RAMS) that are CRCW PRAMS whose write resolution rule corresponds to computing a threshold function of the values that processors are attempting to write. Parberry and Schnitger, by using techniques of Stockmeyer and Vishkin [17], were able to show an efficient simulation of TRAMS by threshold circuits which are unbounded fan-in circuits with gates computing AND, OR, NOT, and threshold functions. Our techniques can be employed to give an alternate proof of their result.

Acknowledgement I am indebted to Paul Beame for his extensive help and advice on this paper.

References [l] P. Beame, J. H&tad, Optimal bounds for decision problems on the CRCW PRAM, J. ACM 36 (3) (1989) 643-670. [2] S. Bellantoni, Parallel random access machines with bounded memory wordsize, Inform. and Comput. 91 (1991) 259-273. [3] G.E. Blelloch, Scans as primitive parallel operations, IEEE Trans. Comput. 38 (11) (1989) 1526-1538. [4] G.E. Blelloch, Vector Models for Data-Parallel Computing, MIT Press, Cambridge, MA, 1990. [5] G.E. Blelloch, J.J. Little, Parallel solutions to geometric problems in the scan model of computation, J. Comput. System Sci. 48 (1) (1994). [6] S. Chatterjee, G.E. Blelloch, M. Zagha, Scan primitives for vector computers, in: Supercomputing ‘90, IEEE, 1990, pp. 666-675. [7] A. Gottlieb, R. Grishman, C.P. Kruskal, L. Rudolph, The NYU ultracomputer-designing an MIMD parallel machine, IEEE Trans. Comput. 32 (1983) 175-189. [8] J. H&tad, Computational Limitations of Small-Depth Circuits, ACM Doctoral Dissertation Award, 1986, MIT Press, Cambridge, MA, 1987. [9] W.D. Hillis, The Connection Machine, MlT Press, Cambridge, MA, 1985. [lo] C.P. Kruskal, L. Rudolph, M. Snir, Efficient synchronization in multiprocessors with shared memory, in: Proc. Fifth Annual ACM Symposium on Principles of Distributed Computing, 1986, pp. 218-228. [ 1 l] T. Leighton, Introduction to Parallel Algorithms and Architectures, Morgan Kaufmann, San Mateo, CA, 1991. [12] T. Leighton, B. Maggs, S. Rao, Universal packet routing algorithms, in: Proc. 29th Annual Symposium on Foundations of Computer Science, White Plains, NY, October 1988, pp. 256271. [13] I. Parberry, G. Schnitger, Parallel computation with threshold functions, J. Comput. System Sci. 36 (3) (1988) 278-302. [14] A.G. Ranade, How to emulate shared memory, in: J. Comput. System Sci. 42 (1991) 307-326. [15] A.G. Ranade, S. Bhatt, S.L. Johnson, The Fluent abstract machine, in: Proc. 5th MIT Conference on Advanced Research in VLSI, 1988, pp. 71-94. [16] R. Smolenslq, Algebraic methods in the theory of lower bounds for Boolean circuit complexity, in: Proc. Nineteenth Annual ACM Symposium on Theory of Computing, New York, May 1987, pp. 77-82. [17] L.J. Stockmeyer, U. Vi&kin, Simulation of parallel random access machines by circuits, SIAM J. Comput. 13 (2) (1984) 409422.