J. SYSTEMS SOFTWARE 1994; 2518.5-192
Very Fast Distributed
Spreadsheet
185
Computing
H. B. Zhou and L. Richter Department of Computer Science, University of Zurich, Zurich, Switzerland
A very fast distributed spreadsheet computing program has been implemented on a distributed memory (message-passing) system with -40 sparcstations in a local Ethernet network. With a two-stage accelerating refinement procedure-the introduction of the high-speed ECE-SC’ program and its distributed implementation, a major (a total acceleration factor’ of 373 has been reached with 10 Sparcstation 1 s)acceleration has been achieved. Two versions of the distributed program have been implemented via direct stream socket programming and the parallel virtual machine environment, respectively. Experiments show that only a minor performance difference exists between these two implementations. The developed working mechanism, independent of the target machine and the spreadsheet program used, is, in principle, the fastest for the parallel implementation of spreadsheet computing.
1. INTRODUCTION
Spreadsheet computing is probably the most widely used software in the commercial and the accounting world (Miller, 1990). It is the operating kernel of popular business software, such as Excel, Lotus, Symphony, etc. Because of its wide usefulness, it has also been built into some advanced software systems such as X Windows (the “table” program in the Andrew tool kit) and the Isis system (its “spread” program). Publications on the efficient design and implementation of spreadsheet programs appear sporadically in some technical journals (Sajaniemi, 1988; Amsterdam, 1986; Eichinger-Wieschmann, 1989; Du and Wadge, 1990). In the fast-growing field of visual programming, a new form-based programAddress correspondence to Dr. Honbo Zhoy E.P.M. Division, Oak Ridge National Labomtoq Oak Ridge, TN 37831. ‘An equivalent C expression substitution of the corresponding computing subroutines in the spreadsheet framework. ‘The acceleration factor denotes a general performance gain comparing the terms of speedup. Speedup = T,/T,, where T, is running time on one processor and Tn the time on n processors. The main acceleration is contributed by the ECE-SC substitution. 0 Elsevier Science Inc. 655 Avenue of the Americas,
New York, NY 10010
ming paradigm is based on the spreadsheet approach (Ambler, 1987), which may trigger interest in parallelism. Although some possible approaches have been suggested (Du and Wadge, 1990) on the topic of spreadsheet parallelization, as far as we know, no previous work on this topic has been reported. Our study emphasizes the parallelizing issue of spreadsheet computing on a distributed memory multiprocessor system. Data dependence analysis and maintenance have been considered important factors in developing both sequential and parallel efficient implementations of spreadsheets (Amsterdam, 1986; Du and Wadge, 1990; Banerjee, 1988). To exploit the parallelism in a specific spreadsheet computing, a data dependency graph (DDG) or task graph is usually generated through data dependence analysis techniques. The DDG is then (automatically) partitioned (Sarkar, 1989) for a given architecture so that the parallel execution time is minimized. The DDG of a spreadsheet computation may be represented as a directed acyclic graph (DAG) or an undirected graph (UDG). Accordingly, two partitioning techniques are used: task allocation (Zhou, to appear (a)), when the DDG is a UDG, or task scheduling (Zhou, 19941, when the DDG is a DAG. Those issues will be discussed in separate articles. We differentiate here between static and dynamic allocation or scheduling. By static allocation or scheduling, we mean that the task graph is known beforehand at compile time. When the task graph is not known before the program executes, dynamic allocation or scheduling must be used to schedule tasks at run time. We consider only a static implementation of the distributed spreadsheet program in the sense that we assume that for every application the task graph is known and fixed beforehand at compile time. That is, no interactive change to problem structure of the implemented application will be allowed and no nondeterministic factors will be handled during run time. The DDG of a spreadsheet
0164-1212/94/$7.00
186
J. SYSTEMS SOFTWARE
H. B. Zhou and L. Richter
1994; 25:185-192
computing application is partitioned into subtasks (i.e., groups of formulas or expressions), is allocated or scheduled onto the network at compile time, and fixed during run time. With this working mechanism, only the data of the spreadsheet application (i.e., the numbers; here we ignore the string processing functions of the spreadsheet program) can be changed at run time. Also, only the data need to be transmitted in the network during parallel executing. Thus, the communication overhead is reduced to a minimum in that no expressions need to be transmitted at run time. In this article a simplified prototype implementation of our distributed spreadsheet program is described. It is a simplified version in that a simple spreadsheet computing has been used, the DDG partitioning has been done manually, and interprocess communication has been avoided by simple overlapping. The aim is to establish a test bed for performance testing and future improvements. We have chosen a sequential spreadsheet program called Spreadsheet Calculator (SC) as a basis for developing our distributed implementation. This program was originally designed and implemented by James Gosling at the University of Maryland and is available as public domain software.3 The working mechanism of most spreadsheet programs (including SC) is run time interpreting of expressions like that of a BASIC program, which is slow and inefficient. For some applications in a commercial center, huge computing formula settings in a spreadsheet may keep running for weeks or even months without change, and a real-time response is required. In such cases, the implementation of the aforementioned static parallel working mechanism will surely improve the computing speed significantly. If we assume that the potential application sites of our distributed spreadsheet program are in such commercial centers, then the interactive editing function that is important in daily use can be neglected. In such a circumstance, the static parallel working mechanism is the best choice to achieve high speed. Based on this working mechanism, we can further exploit the parallelism of a spreadsheet application in the SC framework by listing the formulas (expressions) in the spreadsheet cells verbatim as equivalent C expressions (ECE) in the slave processors and executing the expressions as compiled code of C language. This replacement will accelerate the computing significantly to a new stage.
kurrent maintainer: Jeff Buhrt, Garuel Enterprises, Inc. Email: sequent!sawmiII!buhrt.
It is clear now that our aim is to do a static parallel implementation on a specific application in the SC framework instead of a general parallelization of the SC program. The goal is to reach maximum acceleration. The implementation has been done in the following two steps, which are described in detail in the following sections. 1. The slow and computationally inefficient execution mechanism of run-time expression parsing and interpreting in SC has been replaced by run-time (sequential) evaluation of the ECE (section 2). 2. The distributed implementation is based on the ECE evaluation version of the sequential SC (section 3). After implementing two improvements, a total acceleration factor of 373 was achieved with 10 Sparcstations in our local networked system. The proposed working mechanism, independent of the target machine and the spreadsheet program used, is a fully exploited mechanism and therefore should be the fastest in principle. If this working mechanism is transplanted to some dedicated multiprocessor hardware, such as the transputer array, or some shared memory supercomputers, a further major speedup will be achieved. 2. SEQUENTIAL
ECE VERSION OF THE SC PROGRAM
To test the performance of the sequential SC program, we applied a local average operation on spreadsheets of different sizes.4 For comparison, we have designed a simple C (SimpC) program that directly performs the same operations. Another program, which lists the computing cell formulas verbatim as ECE code directly, has also been implemented; it uses parts of the original SC program (i.e., the framework, largely the user interface), apart from its expression parser and evaluator. This local average operation (like that in image processing [Zhou, 19921) is based on a local average operator, which works as shown below and in Figure 1: C,(i,
j)
= [C,(i
-
1, j -
1) + C,(i
+C,(i
- 1, j + 1)
+C,(i,
j -
-
1, j)
1) + Coti, j) + COG, j + 1)
+C,(i
+ 1, j - 1) + C,(i
+C,
+ 1, j + 1)]/9
+ 1, j)
4No other better computer-intensive application was available at the time of implementation.
Distributed Spreadsheet
(i-l ,j-1)
Figure
1. Spreadsheet
J. SYSTEMS SOFTWARE 1994; 25:185-192
(i-l ,j)
(i-l ,j+l)
(id
(ij+l)
&j-l 1
(i+l j-l)
Computing
(i+l ,j)
(i+l ,j+l)
cells.
C,(i, j) is the new value of the spreadsheet cell at (i, j) after the local average operation, C,,(i, jPs represent the old values of the spreadsheet cells before the operation. The running times’ needed for the three different spreadsheet computations have been measured on a Sparcstation 1 computer and are shown in the following subsections. Where
2.1 Spreadsheet
Computing with the SC Program
By executing the local average operation on spreadsheets of different sizes (8 X 8, 16 X 16, 32 X 32, 64 x 64, 128 x 128), we have the time measurements shown in Table 1.
2.2 Spreadsheet
Computing with a SimpC
Program The same spreadsheet
computing formulas as in the SC program above have been coded in a SimpC program. That is, the time-consuming SC framework is not included, and thus a much simplified data structure is used. All expressions are listed verbatim in the source code of the SimpC program without the use of the “for” loop statements in C. Obviously,
5Alltime measurements in this article are in seconds.
Table 1. Running Times of an Application
this is the fastest sequential implementation of the same computations. Table 1 shows the measurements. The large computing time difference between these two programs (Tables 1 and 2) indicates that the SC program has huge overhead due to its inefficient working mechanism of run-time expression parsing and evaluating. In the SimpC program, the expressions are compiled into executable C code before execution (we may call this compile-time interpreting), therefore, the time spent has been significantly reduced.
2.3 Spreadsheet Computing with the ECE Substitution in the SC Framework Based on the idea of compile-time interpreting, we have developed a fast but simplified version of the SC program under the original framework. For a spreadsheet application, the new ECE-SC program inserts the ECEs of that application verbatim into the source code of the original SC program as substitutions for its corresponding parsing and interpreting subroutines. Figure 2 presents a section of the verbatim ECE expression replacements listed in the source code of the original SC program. This is a fixed implementation with no flexibility of run-time expression editing through the SC user interface. However, with this implementation, a major acceleration as been achieved before its parallelization. The time spent is shown in Table 1. Finally, we get the performance curves of the same application implemented sequentially in three different working mechanisms (Figure 3). Log is the logarithm to 10, Log 2 is the logarithm to 2, and size is the row or column length of the spreadsheet matrices, which is the same.
3. DISTRIBUTED IMPLEMENTATION OF THE ECE-SC MECHANISM
With the sequential ECE implementation of the SC program, an acceleration factor of - 15-20 seconds (See Tables 1 and 3) has already been reached
in SC, SimpC, and ECE-SC Versions
Time Version SC SimpC ECE-SC
8x8 0.015642 0.000291 0.001035
187
x
16 x 16
32 x 32
64
64
128 x 128
0.075822 0.001741 0.002836
0.342660 0.014390 0.020954
1.436955 0.064261 0.092204
5.939589 0.269416 0.3770453
188
H. B. Zhou and L. Richter
.I. SYSTEMS SOFTWARE 1994; 25:185-192
(‘ATBL(tb1.101.1))~>v = ((*ATBL(tbl.O.O))->v + (‘ATBL(tbl.O.1))~>v + (*ATBL(tbl.0.2))=+ + (*ATBL(rbl.l,O))->v + (*ATBL(lbl,l,l))->v
+ (*ATBL(tbl,l,2))->v
+
(*ATBL(tbl,2.0))>~ + (*ATElL&bl2.1))->v + (‘ATBL(tbl.2.2))~z-v ) / 9; (*ATBL(tbl.l01.2))->v
= ((*ATBL(Ibl,O,l)> >v + (‘ATBL(tbl.OS))->v + (‘ATBL(tbl.0.3))~>v+ (*ATElL (tbl,l,l))->v
+ (*ATBL(tbl,l2))->v
+ (*ATBL(tbl.l.3))->v+
(*ATBL(tbl,2.1))->v + (‘ATBL(lb1.2.2))~>v + (‘ATBL(tbl.2.3))~>v
. ..
Figure 2.
.. .
.
. ..
) /9:
.
A section of ECE statements in SC source code.
compared with the original SC program (for the specific application used). Because this is just a simplified prototype implementation, no fault tolerance is considered for the time being. Furthermore, consideration of run-time interprocess communication has also been delayed.
3.1 Implementation
Approaches
Sockets are the essential and probably the most useful facilities that support interprocess communication in the Unix world. Many networking tools (e.g., File Transfer Protocol) have been implemented by use of the connected Transmission Control Protocol (stream) or the unconnected User Datagram Protocol (UDP) sockets. The connectedstream socket has certain advantages concerning communicating speed over the unconnected UDP socket. It is usually used when applications as ours
need fast, established, stable communicating channels. Once the stream-socket communicating channels are established, they will be kept as long as they are needed. Every time data need to be transmitted through the network, no additional handshaking is needed, whereas the unconnected socket communication requires handshaking and missed-message checking every time a package needs to be sent (Ernst, 1990; Stevens, 1990). Nowadays, many supporting environments for parallel programming have been developed, for instance, the commercially available EXPRESS, Parform (Cap and Strampen, 19921, Marionette (Sullivan and Anderson, 19791, etc. A public domain package called PVM6 (parallel virtual machine) (Sunderam, 1990; Geist and Sunderam, 1991; Beguelin et al. 1991b) has been used here. The PVM uses UDP sockets for its communicating channels. For performance comparison, we have used two approaches to make the distributed implementation in our study:
Ln(Execution Time)
1. The first is directly based on TCP stream-socket programming. 2. The second is based on the PVM environment.
SC SimpC EoziC
/
3.2. Stream-Socket of the ECE-SC
Implementation
The well-known client-server or master-slave model is currently the approach used for most distributed implementations, where the client (or master) sends raw data to and collect results from the servers (or slaves). There are usually one master process and
2
3
4
5
6
7
a
LogP(Size) Figure 3. Performance
curves of the sequential
6Developed
spreadsheets.
cooperatively
by University
of
Tennessee,
Ridge National Laboratory, and Emory University.
Oak
Distributed
Spreadsheet
J. SY!?TEMS SOFIWARE 1994, 25:185-192
Computing
189
many slave processes. The slave processes may locate in the same processor with the master process, or, most probably, in some other remote processors. The master process connects itself to and starts up the slave processes, and then the distributed computing begins. In our SunOS Unix network system, there are two ways to start up the slave processes. One way is to use the widely used automatic in&d demon. The other way is manual start-up by means of shell scripts. Although the inerd start-up is more convenient to manage, it involves one more firk( ) and some more file descriptor operations at run time. Therefore, it will increase the run-time overhead, so direct manual start-up has been used here. Another advantage of the manual start-up is that the number and names of live processors can be known in advance. This is good if a task graph allocation or scheduling algorithm is to be used. For our static implementation, this is of special importance because transparent task graph (process) allocation or scheduling can be made before execution of the program. Because the start-up process is a once-for-all procedure, the time required for the slow preruntime manual start-up does not influence the run time performance of the distributed program.
process. The ECEs in the spreadsheet cells of every block are listed verbatim in the corresponding individual slave’s source code. The ECEs are generated by an automatic expression-generating program or manually in our present implementation. Thus, what need to be transmitted to the slave processes through the network at run time are only the data (i.e., the numbers). The expressions need not be transmitted and run in an interpretive way (as in the original SC) by slaves at run time. The source code of one slave usually contains only one block of expressions from the spreadsheet. A slave processor only receives and computes one block of data. The data are received from and sent back to the master according to some predefined order to secure maximum speedup. Thus, the use of some unnecessary system calls such as fork( >, select( ), etc., which have been used almost exclusively in other similar situations (Stevens, 19901, have been avoided. To keep the data of every block in one package and, therefore, accelerate the communication (package-sending) speed, the iovec structure has been used.
3.2.1 The master (client) program. In our program, the master takes care of the user interface and handles some other management operations. At the same time, it takes charge of the raw data package pratitioning and packing (dividing the whole spreadsheet into regular groups of cells, i.e., blocks) as well as their distribution into the network. The resulting data package collecting and unpacking (assembling the collected data blocks back into the spreadsheet cells) are also conducted by the master program.
1. The local average operation on a 32 X 32 matrix of spreadsheet cells. To make this operation computation-intensive, the operation has been run between 1 and 10,000 times. 2. The computing of the future values of lo-5,000 monthly payments (deposits to a bank) of a random amount of money at an interest rate of 0.6% (0.0005% per month) for each cell of a spreadsheet with 64 x 64 cells. The difficulty of its sequential implementation is that there is no simple formula for this computation.
3.2.2 The slave (server) program. The slave processes are started up by requests from the master
For the first computing issue, we have included overlapping boundary rows of cells (see the package
3.2.3 Two spreadsheet applications. Two different spreadsheet computations have been implemented:
Table 2. Timings of the Distributed ECE-SC (Local Average Time Local Average Operation Loops 10,000
WJfJ l,oCJO 500 100 50 10
1 Spare; Sequential with Pointer Operations
3 Spares; Package Size 12 x 32
5 Sparcs; Package Size 8 x 32
10 spans; Package Size 5 x 32
211.014554 105.905128 21.259582 10.533508 2.0888395 1.049431 0.209348
31.988956 15.727752 2.924144 1.494740 0.310115 0.261679 0.098657
18.371369 9.143821 1.830418 0.893314 0.236954 0.147158 0.060712
14.360976 6.289417 0.919236 0.492398 0.168675 0.098081 0.047155
190
H. B. Zhou and L. Richter
J. SYSTEMS
SOFIWAFE 1994; 2.5:185-192
Table 3. Timings of the Distributed ECE-SC (Interests) Time
so00 1,000 500 100 50 10
1 Spare; Package Size 64 x 64
4 spares; Package Size 16 x 64
8 Spares; Package Size 8 x 64
168.45424 33.3594579 16.8047033 3.32069767 1.67774067 0.377909
42.4749755 9.3929585 5.5842815 1.3180555 0.6418225 0.204735
23.3598562 5.7655404 2.7312284 0.5684984 0.3300212 0.1543344
sizes of Table 2) into relating partitioned packages so as to avoid inter-process communication. The second problem has no data dependence itself. Therefore, the distributed implementation of these two computations involves no interprocess communication among subtasks that locate on different processors. The implementation of the first application has been based on 1, 3, 5, and 10 processors, respectively. Because 30 is a multiple of 3, 5, and 10, those processor numbers have been used. With those processor numbers, regular partitioning can be reached (see also Table 4). For the implementation of the second computation regular processor numbers of 1, 4, and 8 (have been used because there is no partitioning irregularity here. The processors are all Sparcstation 1s each with a speed of 1.4 MFLOPS. The performance measurements are shown in Tables 2 and 3, respectively (all data subjected to minor changes depending on the network load). The performance curves of Table 4’ are depicted in Figure 4.
3.3 PVM Implementation
big as expected. In fact, only minor differences (< 5%) have been shown. The stream-socket programming method does perform better than the PVM implementation, but requires much more effort in its design, coding, and maintenance, and the programmer has to take care of more technical details. Nevertheless, the connected-socket programming method may be a better choice when dedicated hardware is available because it has more compact source code and is more reliable than the PVM (connectless socket) counterpart. The time required for the PVM implementation is also less stable, because sometime a message may get lost during its unconnected communicating and so the package needs to be sent again (though this seldom happens in our tests if network traffic is not busy). This results in a fluctuation in the values of the time measurements. However, the PVM version is much easier to implement, and its maintenance is also much simpler. Another advantage of the PVM implementation is that the same source code can be run on many systems (both distributed and shared memory ones) with no or only minor change (Beguelin et al., 1991a).
of ECE-SC
For the PVM version of distributed ECE-SC, we have only implemented the interest-computing program. Similar performance data have been collected as shown in Table 3 and depicted in Figure 5. To compare direct stream-socket and PVM implementation, we present here the time-measurement results of four processors to compute the interest problem. From Table 4, we see that the time difference between those two implementations is not as 71n Table 2, the implementation for the case of one Sparcstation 1 is based on the original SC source code, which involves pointer operations (Figure 2). It is much more time consuming than the normal two-dimensional (C language) array operations, which the other implementations (3, 5, and 10 Sparcstation 1s) use. This results in the “super-linear speedup” in Table 2.
4. CONCLUSION
AND FUTURE WORK
With this two-stage acceleration refinement described in sections 2 and 3, that is, the introduction of the high-speed ECE program and its parallelization, a working mechanism has been developed which can, in principle, lead to the fastest implementation of a spreadsheet computing program if suitable parallel hardware is available. This is the major contribution of this work. For some commercial application sites that use spreadsheet programs, such as stock exchange centers, business centers, that use banks, etc., where the computing formulas usually are unchanged for long periods of time and the response time is of superior importance, this working mechanism and its hard-
Distributed Spreadsheet
Computing
191
J. SYSTEMS SOFTWARE
1994, 25185-192
Ln(Execution Time)
Ln(Execution Time)
6-
6-
. 4_ . .
ISparc
-
3spares 4-
5sparcs
1osparcs
-
ISpalC
-
4sparcs IOSparcs
22O-
O-
-2 -
-4
1 0
I
1
Figure 4. Performance (local average).
4
3
1 5
0
L&Loop) curves of the distributed
I
-2
ECE-SC
1
2 Log(Months)
Figure 5. Performance (interest).
3
4
curves of the distributed
ECE-SC
data rather than computation intensive, and much package passing need to be done if a distributed memory system is used. If a shared memory system is used, this massive package-passing problem can be alleviated. This leads to the idea of implementing the parallel ECE-SC program on shared memory multiprocessor systems such as Alliant and Sequent. As is described in the previous section, this can be easily done with the PVM system. In the present implementation, the task allocation or scheduling has been done manually with simplified data dependence structure. The task allocation work has been reported elsewhere (Zhou, to appear (b)). Future work will emphasize the automatic compile-time task-scheduling issue. Because many spreadsheet applications have recursive data structures, the parallel implementation of such spreadsheets is also a challenging problem, which may involve some troublesome problems such as process migration, etc. Given the problem of a more general distributed implementation of the spreadsheet programs, to ensure good performance we strongly recommend that
ware implementation can be of great application significance. This work presents a simplified prototype implementation whose aims are to collect some firsthand performance data in a distributed memory multiprocessor system and predict potential performance behavior when a similar working mechanism is implemented in some other hardware platforms. One observation is that the time needed to compute a 32 x 32 local average with 1,000 loops is about 1,000 X 0.34266 = 342.66 (this can be reasonably deduced from the linear curve for one Sparcstation 1 on Figure 4 and Table 1); the time needed to perform the same operation on 10 Sparcstations is 0.919236 (Table 4)-an acceleration factor of 342.66/0.9192, or N 373! Although substantial acceleration has been achieved with the distributed implementation of the ECE-SC program, it is difficult to achieve good performance if the distributed implementation is limited to run on distributed memory systems such as the Sparcstation network system we have used. Unfortunately, most spreadsheet applications are
Table 4. Timings of the PVM and Stream-Socket Implementation of ECE-SC Months
Stream Socket PVM
10
50
100
500
WOO
5,ooo
0.2041 0.3570
0.6418 0.6739
1.3181 1.3201
5.5843 5.6813
9.3930 9.4614
42.4750 43.7648
192
J. SYSTEMS SOFTWARE
H. B. Zhou and L. Richter
1994; 25:185-192
the underlying system should be a shared memory system or some dedicated architecture.
ACKNOWLEDGMENTS We thank Dr. Clemens Cap and Edgar Lederer for many useful discussions. The idea of ECE-SC was suggested by Edgar Lederer; Dr. Cap helped with technical details. We also thank Rainer Sinkwitz for discussions on SC and PVM.
REFERENCES Ambler, A. L., Forms: Expanding the visualness of sheet languages, in Proceedings of the 1987 Workshop on Visual Languages, Sweden. Amsterdam, J., Build a Spreadsheet Program, BYTE (July 1986). Banerjee, U., Dependence Analysis for Supercomputing, Kluwer, 1988. Beguelin, A., Dongarra, J., Geist, A., Manchek, B., and Sunderam, V., The PVM and HeNCE Projects, 1991a. Beguelin, A., Dongarra, J., Geist, A., Manchek, B., and Sunderam, V., A Users’ Guide to PVM-Parallel Virtual Machine, 1991b. Cap, C. H., and Strumpen, V., The Parform-A High Performance Platform for Parallel Computing in a Distributed Workstation Environment, TR-92.07, Department of Computer Science, University of Zurich, Zurich, Switzerland, 1992. Du, W., and Wadge, W. W., The Eductive Implementation of a Three-Dimensional Spreadsheet, Software Pratt. Exp. 20, (1990).
Eichinger-Wieschmann, B., C TabellenKalkulation, Vieweg, 1989. Ernst, H., Interprozess-Kommunikation fiir verteilte Prozesse unter SunOS 4.0, Semesterarbeit, Department of Computer Science, University of Zurich, Zurich, Switzerland, 1990. Geist, G. A., and Sunderam, V. S., Network Based Concurrent Computing on the PVM System, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 1991. Miller, R., Computer Aided Financial Analysis, AddisonWesley, Reading, Massachusetts, 1990. Sajaniemi, J., et al., an Empriical Analysis of Spreadsheet Calculation, Software Pratt. I&p. 18 (1988). Sarkar, V., Partitioning and Scheduling Parallel Programs for Multiprocessors, Ph.D. Thesis, 1989. Stevens, W. R., UNIX Network Programming, Prentice-Hall Englewood Cliffs, New Jersey, 1990. Sullivan, M., and Anderson, D., Marionette: A system for parallel distributed programming using the master/slave model, in Proc. 9th Intel. Conf. on Distributed Computing Systems, June, 1989. Sunderam, V., PVM: A Framework for Parallel Distributed Computing, Concurrency Pratt. Exp. 2 (1990). Zhou, H. B., An Effective Approach for Distributed Program Allocation, .I. Parallel Algorithms and Applications, to appear (b). Zhou, H. B., Image processing in a workstation-based distributed system, in Proc. 2nd Intel. Conf. on Automation, Robotics, and Computer Vision, Singapore, September, 1992. Zhou, H. B., Two-Stage m-Way Graph Partitioning, J. Parallel Algotithms Appl., to appear (a). Zhou, H. B., Scheduling DAGs on a Bounded Number of Processors-A New Approach, in press (1994).