A distributed-termination experiment on a mesh-connected array of processors

A distributed-termination experiment on a mesh-connected array of processors

Parallel Computing 18 (1992) 779-791 North-Holland 779 PARCO 685 A distributed-termination experiment on a m e s h - c o n n e c t e d array of pro...

1003KB Sizes 1 Downloads 48 Views

Parallel Computing 18 (1992) 779-791 North-Holland

779

PARCO 685

A distributed-termination experiment on a m e s h - c o n n e c t e d array of processors * Jianjian Song Department of Electrical Engineering, National University of Singalmre, Singalmre 0511 Received 6 November 1991 Revised 13 February 1992

Abstract Song, J., A distributed-termination experiment on a mesh-connected array of processors, Parallel Computing 18 (1992) 779-791. In totally asynchronous computation any number of processing elements (PEs) may start simultaneously. Termination detection for it is difficult because of asynchronousness and lack of single root (initiator) to accumulate termination information. This paper presents a symmetric, asynchronous, and distributed solution to the above problem for a mesh-connected array of PEs and its implementation and comparison on a parallel computer. Our algorithm produces less messages for termination purpose than the various counter token schemes. The worst time for it to finish is O((n'), where n is the number of PEs on a square mesh. An experiment on a 64-node NCUBE parallel computer showed that it was at least twice as fast as a centralized termination detection algorithm. The algorithm makes three assumptions: only nearest-neighbor communication exists; activation messages are always broadcast to all neighboring PEs to avoid deadlock; and any number of PEs can initiate the computation. It defines tokens and dependency arcs: tokens to collect termination states of regions of the mesh and dependency arcs to establish terminating precedence among connected PEs. They together reduce the number of messages for termination detection. Two sufficient conditions are shown to guarantee that any one of the PEs may detect termination. Although the algorithm was originated for use with totally asynchronous computations for the finite element analysis, it is applicable to any iterative computations on a mesh.

Keywords. Asynchronous iteration; distributed termination; finite element computation; hypercube computer; mesh

1. Introduction

Computation for the finite element analysis can be distributed on an array of processing elements (PEs). The PEs are usually connected as a mesh since a mesh is a good match to the grid patterns used in finite element analysis. Although the computation can be done by either direct or iterative solution techniques, iterative solutions have been found more suitable for * Preliminary version of this paper, except the experimental results, was presented at the 1988 International Conference on Parallel Processing, Chicago, lilinois, USA, August 1988. Correspondence to: Jianjian Song, Department of Electrical Engineering, National University of Singapore, Singapore 0511. ~ 0167-8191/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

780

~ Song

utilizing the power of parallel processing [1,2]. Among iterative approaches, asynchronous iterations have been attracting mo~e attention [3,4]. Asynchronous computation has all the attributes of other distributed computations: a PE is either active in its computation or passive when finished with its computation; passive PEs may be activated again by messages from active PEs; and the pattern of message transmission cannot be decided a priori, etc. One of the challenges in applying iterative solutions is to determine when the computation is completed. An ideal solution for the determination process should have the following properties: (1) It does not interfere with the computation process (transparency). (2) It does not require dedicated communication channels. (3) It does not use a predesignated processor (host, root, etc.) that observes the states of all the PEs, i.e. the solution should be fully distributed and symmetric. (4) It should allow delay of message transmission through communication channels, i.e. the transmission is not instantaneous. The termination detection process can be centralized or distributed. In [1,2], and [4], termination detection was solved by appealing to a global synchronization mechanism. When a PE finishes its current computation, it reports its state to a predesignated PE (called the host or root). The host collects the states of all PEs to decide if the computation is terminated. Global termination detection for totally asynch;onous and iterative computation is difficult to implement, if not impossible, because a passive PE may be activated by a message from another PE and this change in PE state must be made known to the host. The host may never know the real state of the computation at a given point in time due to communication delays unless message transmission is assumed instantaneous. Global synchronization may also be time-consuming since global communication is usually slower than local communication; hence, PEs may idle, waiting for synchronization. Techniques for distributed termination can be used when global synchronization is not a good choice. Distributed termination has been discussed in the literature [5-10], where two basic approaches are studied. One is state detection and the other message counting. The state detection method assumes that communication is instantaneous so that a PE does not have to check whether there is any message delayed in the communication channels. A snapshot of the states of the PEs is sufficient to decide whether the computation on the array is completed. The message counting method assumes that communication may have arbitrary delay. It uses the total sum of messages received and sent in the system to determine whether all messages sent have been received [10]. The previous state detection approaches have the following characteristics: (1) A dedicated communication network (CN) is assumed for the purpose of termination detection. (CN is a tree in [5,9], and a ring in [6,7].) (2) Termination detection is a continuous, trial and error process. A detecting probe (token or control message) or detecting wave is initiated and circulated periodically in CN until the completion of computation has been detected. (3) All the approaches, except the one in [7], use a predesignated processor to detect termination. While the method in [7] is distributed and symmetric in the sense that any processor may detect termination, it uses a common clock which is not desirable in practice. There are drawbacks in these approaches. First, CN should not be used to transmit computation messages. Otherwise, both speed of termination detection and that of computation would be reduced. There must be another network for computation messages, resulting in additional hardware and complexity to a parallel computer system. The second problem with these methods is a large number of messages (counter tokens) travelling in CN. Some messages that will eventually be destroyed must pass through several processors even though

Distributed-termination experiment

781

they have become obsolete at the beginning of the journey [7]. The problems are inherent in the ring structure and in the assumption that data messages can be passed between any processor pair. The situation is different in a mesh-connected array of PEs. Communication in the mesh is through nearest-neighbor connections which are fixed and local. Therefore, only messages between adjacent PEs exist. Taking advantage of the mesh structure, a dependency-based method is proposed ir his paper for distributed termination on a rectangular mesh. When finished with its computation, a PE sends its termination signal to its neighboring PEs. One token from each side of the rectangular mesh is initiated once. The tokens travel in the mesh according to rules derived later in this paper. Their traces and positions will indicate the computation status of the mesh. One or more of the PEs will eventually be able to detect termination by examining its own record of the token arrival, its state, and its neighbors' states. The method is fully distributed and symmetric in the sense that no PE has more responsibility than the others [7]. It is also asynchronous and as such it uses less messages for termination detection than the other counter token or synchronous approaches. These characteristics make the method ~'~ ideal solution for asynchronous computation. This paper describes the above method and the results of its implementation. Section 2 gives the assumptions about a rectangular mesh and PEs as well as some definitions. Section 3 proposes a method for the case where there is no activation message. In this case a PE will remain passive once it becomes passive, making it easier to detect termination. Section 4 describes an improvement of the method so that it works even if activation messages exist. Section 5 presents correctness proofs and speed analysis. Section 6 reports the results from an implementation of the method on a 64-PE NCUBE parallel computer, and Section 7 concludes the paper.

2. Assumptions and definitiors This section lists some general assumptions about the mesh and defines data and control messages, states, tokens, and arrows. Messages are grouped under data type and control type. The data message carries values used in computation, e.g. coordinates or potentials. The control message carries information on the state of a PE, an arrow or a token being a good example as seen later. A passive PE may be activated by data messages from other PEs. (This kind of data messages are called activation messages by some researchers [19].) Two assumptions are made throughout the paper. One is that message transmission is instantaneous so that the state of the mesh is equivalent to the set of all PE states. This assumption is essential to the correctness of all state checking methods for termination detection. If the assumption does not hold, the computation may not be completed even if every PE is idle because there may be delayed data messages in communication channels. Another assumption is that there is a continuous communication process that handles messages no matter whether computation is in process so that control messages can be passed for termination detection. As shown in Fig. 1, the array consists of PEs that are connected to six of its nearest neighbors to form a mesh. A P E may be in one of two states: active or passive. A P E is active when it is contributing to the computation and passive when it is finished with its assigned computation. The white square represents an active PE and the black square a passive PE. Computation is completed when all PEs are passive and there is no data message in any communication channel. Detection of this state by one of the PEs is called distributed termination.

Z Song

782 000 0OI O11

010 llO111 lOt 100

000

001

passive PE

u active PE

011

OlO

110

Ill

South passive

m West passive

!01

100

North passive

m East passive

Fig. 1. Structure and notations.

The token is a messenger travelling in the mesh to collect global information about the state of the mesh. It is represented by half-black-and-half-white squares. Token 'North passive' indicates that all PEs north of the west-east line (not including PEs on the line) where the token resides have been passive. (T~lis definition of token is different from that of most token-passing schemes in that it is not a counter and will not be destroyed and initiated repeatedly.) The arrow is a passive state messenger for a group of PEs on a line of the mesh. Initially, only the boundary PEs have arrows pointing in directions where the arrows will move. A PE passes an arrow to another PE only if the former has an arrow and is passive. An arrow, therefore, carries two messages: the state of passiveness of the PE that sent the arrow and t b past states of the PEs on a segment of the line from one end where the arrow was originated up to the receiving PE. When a PE receives an arrow, it knows that the PEs on the line segment have been passive. To simplify our discussion, six combinations of the arrow reception patterns are named as shown below. An arrow pointing to a PE indicates that the PE has received the arrow. Single

Single

Normal

Collision

+ Cross

Triple

Summit

Distributed-termination experiment

783

Shift.pass

B

B

sender

receiver

(a) before



Cross.pass re¢.~iver

y

sender

receiver

(b) after

receiver

(a) before

(b) after

and Collision are the most important combinations since they are examined by the PEs to move the tokens as discussed in next paragraph and next section. The tokens are passed in two ways. One is called Shift-pass and the other Cross-pass as illustrated above. The pictures show the state of two neighboring PEs before the token is passed from a passive PE to another PE, and the state after the token is passed. Shift-pass in the above figure moves the south-pass token along tbe east-west line. Cross-pass advances the token in the direction opposite to what it represents. Cross-pass takes place in the above picture when all the PEs on and south of the line where the token resides have been passive. The various situations in which the south-passive token is passed are illustrated in Fig. 2.

3. Distributed termination in the absence of data messages

Absence of data messages makes the detection problem simpler since a PE cannot become active again once it is passive. This section describes a method to detect termination in the absence of data messages. The method is based on three sets of rules: state transition rules, arrow passing rules, and token passing rules. The principle behind the method is that a token

-.-

+-

ot.

Single

Normal

Triple

=.i,e." Cross

Collision

Summit

Fig. 2. The dotted arrows indicate direction(s) in which the token is passed.

784

J. Song

will never cross a line of PEs if any PE on this line has never been passive. Therefore, a token location will always indicate that the PEs on those lines passed by the token have been passive. (If there is no activation message, these PEs will remain passive.) A P E will maintain a record of the arrival of the tokens and arrows from its neighbors and transmit the arrows and tokens according to the rules listed below to assure termination detection.

State transition rules (1) A P E is activated by initialization or program loading. (2) A P E becomes passive if it is finished with its assigned computation. Arrow passing rules An active PE does not pass any arrow. A passive PE does not pass arrows before receiving arrows from its neighbors unless the PE is on the boundary. Every boundary PE is assigned arrows for all the directions when they are initialized. The arrows are passed according to the following rules: (1) A boundary PE passes its arrows to all its neighbors as soon as it becomes passive. (2) A non-boundary passive PE passes an arrow it has received so that the arrow continues to travel in the same direction. For example, if an arrow came from the north of a PE, it is ;~assed to the south. Rule (1) assures starting the termination detection process. Rule (2) makes the state of passiveness propagate along a line. If all PEs on this line are passive, it may be possible to cross-pass a token. If a PE receives two opposite arrows from its neighbors, it passes the two arrows to its corresponding neighbors. This insures sending the line passiveness information to all the PEs on the same line, which will be needed to cross-pass a token. Token passing rules There are four different tokens residing initially on the four boundaries of the mesh. Initial locations of the tokens are not critical as long as they are on their corresponding boundaries. ,,~ passive PE may pass tokens to other PEs. A list of rules for passing tokens follows (see Fig. 2 for examples): (1) A P E in Single or Normal state shift-passes a token in the direction in which it passes its arrow. (2) A P E cross-passes a token if there is a collision in the token's direction, indicating that all PEs on lines that the token traversed have been passive. (3) A boundary PE keeps any token it has received. Each PE records the tokens that have visited the PE. Any PE may detect the termination by examining the following two sufficient termination conditions: Condition 1. A passive PE declares the computation terminated when the PE has recorded the arrivals of the four tokens. The tokens may or may not be with the PE at the time of the decision. Condition 2. A passive PE on a boundary declares the computation terminated if it has a collision and receives the token from the opposite boundary. The correctness of the conditions is shown in Section 5.

Distributed-termination experiment

785

4. A solution with the existence of data messages

When data messages exist, an active PE can activate a passive PE by sending a data message to the latter. If the active PE received an arrow from the passive PE, the arrow should be destroyed, indicating that the passive PE becomes active. For many applications of the mesh-connected array, data messages are sent to all of the six neighbors. For example, a potential change of a grid node in the finite element analysis must be made known to all its neighboring nodes so that they may adjust themselves accordingly. Thus, in general, when a PE sends data messages it destroys the arrows from all of its neighboring PEs. Termination can be detected when data messages are present if the following state transition rule is added to the two rules defined in Section 3.

State transition rule (3) (3) If a PE sends a data message to a passive PE, the sender may not declare itself passive until the receiver becomes passive again. This rule guarantees that the token indicating that the receiver was passive will never be cross-passed to the next level by the sender or any PEs on the same line as the sender unless the receiver becomes passive again. If a PE sends a data message to activate any of its neighboring PEs, it should invalidate the arrows from the activated neighbors. It must then wait to receive the arrows from the neighbors. Taking a snapshot of the activity in the array of PEs, we see many dependency graphs that represent the termination precedence among the connected PEs. A dependency graph is directed acyclic graph (DAG). For any two adjacent PEs i and j, we let (i, j) be an arc from i to j, i being the tail and j the head, and we say that PE i depends upon PE j, if PE i has destroyed an arrow from PE j. And therefore, the termination of PE i is conditional upon receiving another arrow from PE j. One sample snapshot is given in Fig. 3, where the dependency graphs are indicated by dotted arcs. The figure shows that PE 1 is dependent on PEs 2 and 3, and PE 3 is dependent on PEs 4 and 5, and so on. The dependency relation must be known to the two PEs at both ends of a dependency arc. The tail PE (active) can record the arc by invalidating the arrow from the head PE (passive) before it activates the latter. The head PE (passive) can record where the data message was from so that it knows the sender PE.

Fig. 3. Dependency graphs represented by the dotted arcs.

786

]. Song

Fig. 4. An example of deadlock, where PE 1 and PE 2 are waiting for each other.

Deadlock may occur if the transmission of data messages is arbitrary, e.g. some neighbors may receive them while the others may not. Figure 4 shows one example where PE 1 is waiting for an arrow from PE 2 through a path from PE 1, PE 3, PE 5, PE 6, PE 4, to PE 2, but PE 2 is waiting for an arrow from PE 1 as well. When there is a deadlock, there is a loop formed by the arcs and all PEs on the loop have become passive at least once in the past. The loop could be broken if one of the PEs could pass an arrow to clear an arc. But it is impossible because none of the PEs on the loop can declare to be passive. Deadlock will not occur if data messages generated by a PE are always sent to all its neighbors. By sending any data message to all the neighbors simultaneously, the, sender informs the neighbors of its being active and establishes one outgoing arc with every passive neighbor. The sender must receive one arrow from every passive neighbor before it can become passive, which makes it impossible for the sender to be in a deadlock loop of the arcs as reasoned below. It is easy to see that a loop of arcs can not exist if data messages are always broadcast to all the neighboring PEs. Assume there is such a loop including three PEs, PE 1, PE 2, and PE 3, and the PEs are connected by two arcs, (1,2) and (2,3), from PE 1 to PE 3. There are three possible time sequences to create the two arcs: they are created simultaneously; arc (1,2) is created before arc (2,3); or arc (1,2) after arc (2,3). We show that all the three cases are impossible. The two arcs can not be generated simultaneously because PE 2 must be active to set up arc (2,3) but be passive for arc (1,2) to be established. If PE 1 is passive, arc (2,1) will be established with arc (2,3) when PE 2 broadcasts a data message to both PE 1 and 3. If arc (1,2) is created before arc (2,3), then arc (1,2) will be destroyed when PE 2 sends a data message to set up arc (2,3). Arc (1,2) can not be established after arc (2,3) since PE 2 remains active if arc (2,3) exists. E!

$. Correctness proof and primitive speed analysis It can be shown that Conditions 1 and 2 in Section 3 are sufficient even in the presence of data messages. The key point here is that all the PEs in the area where a token has traversed must have been passive sometime earlier. They may be activated again by another PE outside the area. However, the token will not be able to pass that PE as inferred from the state transition rule 3.

Distributed-termination experiment

787

Proof of the first condition reads as follows. Assume that a passive PE has received the four tokens while the system is not terminated. There must exist one originally active PE (i.e. it has never been passive) according to the state transition rule (3) specified in Section 4. Otherwise, there would be a deadlock loop. But then the tokens should not have crossed the two lines on which the active PE resides, which implies that no PE could have received the four tokens. This contradicts the assumption that a passive PE has received them. Proof of Condition 2 follows the same path. After any token has crossed the mesh from one side to the opposite side, No originally active PE should exist, which implies that every PE has been passive and the system is terminated. Next we show that the worst time for the algorithm to detect termination is O(n), where n is the number of PEs, and the best time is O(Vrn-), which is better than the ring type algorithms that always have O(n) time complexity. Consider a square mesh of n PEs (vrn- on each side). Assuming that the system is already terminated, the worst time for the algorithm to detect the termination is O(n), which is the time for one token to travel through a ring that connects every PE. The best time is o(vrn-) which is the time for a token to travel across the mesh or for the four tokens to meet at the center of the mesh. The above time estimates represent the simplest situation. Since tokens and arrows travel simultaneously with actual computations, termination detection may be completely overlapped with actual computation so that it could be completed as soon as the computation is over. In comparison, the algorithm developed in [7] will always take O(n) time where n is the number of PEs on a ring since it is the last token (the counter in the paper) that starts the terminating wave.

6. An experiment on an NCUBE parallel computer

This section reports an implementation of our algorithm on a 64-PE N C U B E / 7 parallel computer to compare its speed with that of a global polling scheme, when asynchronous computations for 2-dimensional finite element analysis were executed. The state transition rules (1) and (3), the arrow passing rules, and the token passing rules were applied in the experiment. The termination condition (2), which uses just one token, was implemented to simplify programming. The experiment was done on an NCUBE/7 parallel computer in the Department of Computer Science of the University of Minnesota, USA. The system is composed of a host computer and an array of 64 PEs connected as a 6-dimension hypercube, which can implement a mesh easily [12]. The host computer is a SUN 3/140 system on which application programs are developed. Connected with the SUN system is an NCUBE/7 system with 64 8-MHz 32-bit processors (nodes), each having 128 Kbyte of RAM [11]. Two programs must be written to use the NCUBE. One runs on the host and the other runs on the nodes.

6.1. General structure of the node program General structure of the node program for our algorithm is given in Fig. 5, with its centralized counterpart included for comparison. Details of the experiment can be found in [13]. The major difference between the two detection methods lies in the amount of communication between the mesh of PEs and the host computer for the purpose of termination detection. The distributed scheme does not need help from the host to examine termination state of the mesh. Communication between the host and mesh occurs only when one of the PEs informs the host of the completion. The centralized scheme relies on the host to poll every PE each time a newly-finished PE informs the host of its state. The polling is a

788

J. Song The centralized detection

The distributed deteetlon . . . .

Loop calculations; process data messages; If(the PE becomes passive) inform the host of the fact; wait to hear from the host; End of loop; (The host polls to see if every PE is passive.)

calculations; If (a dam message is to be sent) { clear all arrows received; send the data message;

)

receive arrows and tokens; process data messages; If (the PE is activated by a data message) record the sender, If (the PE becomes passive) ( send the received arrows; send the received tokens; If (the PE is on the boundary) check for termination; if terminated, inform the host. }/* end if PE on boundary */ End of loop;

Fig. 5. The distributed terminationalgorithmand its globalpollingcounterpart. trial and error process that may repeat several times before the host can detect the termination. 6.2. Analysis of the experimental results The algorithm with centralized termination is called Sync I and the one with distributed termination Sync If. Both of the programs implemented locally synchronous Jacobi iterations for grid generation and potential solution. There was no global synchronization among the PEs. Therefore, it is reasonable to believe that the difference in the overall execution times of the two programs was caused by the termination detection mechanisms, one being global polling and the other distributed. The node program was so written that any number of PEs may initiate the computation. For example, tbose PEs that have constant potential values and

IJ

lm

-J

T~OOC

lm

T=IO oC

T~OOC

T=DoC

a2T

ax2

+

02T ~ = 0 , where T is temperature. Fig. ~. The heat transfer problem.

Distributed-termination experiment

789

3000 m ~'--'-lmpermeable ] J J J J J J J J J J J J J

v"

h

Impermeable 5111111m

-1

15 a2~ + 15 ~2~ + Q = 0 , ax 2 ~y2 where Q is a point sink equal to 1500 m3/day at (2000m, 1500m). Boundary conditions are: ~ = 200 m on the left and right edges, and -~=0.0 on the upper and lower edges. Fig. 7. A regional aquifer with a single pump.

constant sources may all start to send the values to their neighbors without any synchronization among them. The two programs were executed with two example problems to compare their speeds. One example from [14] is a heat transfer problem in Fig. 6. The other one from [15] is a regional aquifer problem with a single pump illustrated in Fig. 7. The number of equations was made equal to the number of processors so that the maximum of sixty four grid nodes were generated on the 64-PE N C U B E / 7 computer. F/gure 8 gives the results of comparisons for the two example problems. As can be seen, the distributed termination algorithm was about twice as fast as the global polling one. The ratio was actually about 2.5 when the computation accuracy was low and thus the execution times were more dependent on termination. When the computation was intensive, the distributed termination process became complicated but the global polling was still simple. Hence, the overall execution times of the two programs became closer.

3.0 N O

-q

2.0 1.5

.g

|

o~

IO-2

10-!

10 O

IO !

10 2 precision

10 3

10 4

10 5

10 6

Fig. 8. The ratio of the overall execution times of Sync I to Sync I! for the two example problems.

790

J. Song

7. Conclusions The distributed termination method discussed in this paper has all the merits that the other methods for distributed termination detection claim: asynchronous, distributed, symmetric, etc. It is also more efficient, faster and better than the global polling as well as the ring-based algorithms. It is efficient since no additional communication network is needed; messages are only passed the shortest distance possible when they are needed; and global communication is not required until the computation terminates. It is faster since messages may be travelling in parallel with each other unlike sequential message passing in the ring approach. It is better since the mesh structure is fully utilized. The effectiveness of the algorithm was verified by an experiment on a 64-node NCUBE computer, which showed that the algorithm could be more than twice as fast as the global polling for asynchronous finite element computations. The distributed algorithm could be more superior as the number of processors on the mesh increases. This work is unique since, to my knowledge, there has been no other report on experimental results of distributed termination for totally asynchronous computation. Application of the method is not limited to finite element analysis. The method is useful for distributed termination detection for any computation on an array of mesh-connected PEs. An immediate example is the solution of nonlinear partial differential equations, where iterative solutions techniques are essential [16,17]. The method is also applicable to multi-task cases. Tokens, control and data messages may be colored to represent different tasks so that a few iterative computations may be executed simultaneously. Another application of the algorithm could be to the solutions of three-dimensional partial differential equations.

Acknowledgement I would like to thank the High-Performance Computing Laboratory of the Department of Computer Science at the University of Minnesota for allowing me to use their computing resources (including the NCUBE/7 parallel computer) and for their technical support; in particular, Dr. Sartaj Sahni, Jon Burege, Perry Busalacchi, and Tim Mikula.

References [1] R. Morison and S. Otto, The scattered decomposition for finite elements, Technical Report C3p 286, Caltech Concurrent Computation Group, Caltech, May 1985, pp. 1-22. [2] D.D. Loendorf, Advanced Computer Architecture for Engineering Analysis, Ph.D. Dissertation, University of Michigan, 1983. [3] G.M. Baudet, Asynchronous iterative methods for multiprocessors, J. ACM 25 (2) (April 1978) 226-244. [4] D.A. Reed and M.L. Patrick, A model of asynchronous iterative algorithms for solving large, sparse, linear systems, in Proc. 1984 lnternat. Conf. Parallel Processing (1984) 402-409. [5] E.W. Dijkstra and C.S. Scholten, Termination detection of diffusing computations, Inform. Process. Letters 11 (1) (1980) 1-4. 16] E.W. Dijkstra, W.H.J. Feijen and A.J.M. van Gasteren, Derivation of a termination detection algorithm for distributed computations, Inform. Process. Letters 16 (1983) 217-219. [7] S.P. Rana, A distributed solution of the distributed termination problem, Inform. Process. Letters 17 (1983) 43-46. [8] R.W. Topor, Termination detection for distributed computations, Inform. Process. Letters 18 (1984) 33-36. [9] R. Cytron, Useful parallelism in a multiprocessing environment, in Proc. 1985 Internat. Conf. Parallel Processing (Aug. 1985) 450-457. [10] F. Mattern, Algorithms for distributed termination detection, Distributed Comput. 2 (1987) 161-175. [11] NCUBE Users Handbook, NCUBE corporation, Beaverton, Oregon, 1987.

Distributed-termination experiment

791

[12] Y. Saad and M.H. Schultz, Topological properties of hypercubes, Research Report YALEU/DCS/RR-389, June 1985, 1-16. [13] J. Song and L.L. Kinney, Totally asynchronous approach to finite element analysis, Proc. 4th ISMM/LASTED Internat. Conf. on Parallel & Distributed Computing and Systems, Washington D.C. (1991) 310-315. [14] R.L. Huston and C.E. Passerello, Finite Element Methods: An Introduction (Marcel Dekker, New York, 1984). [15] L.J. Segcrlind, Applied Finite Element Analysis (Wiley, New York, 1976). [16] J.R. Rice, Parallel methods for partial differential equations, in The Characteristics of Parallel Algorithms (MIT Press, Cambridge, MA, 1987) 209-231. [17] G. Birkhoff and R.E. Lynch, Numerical Solution of Elliptic problems, (SLAM, Philadelphia, 1984). [18] D. Bertsekas and J. Tsitsiklis, Parallel and Distributed Computation - Numerical methods (Prentice Hail, Englewood Cliffs, NJ, 1989). [19] O. Eriksen, A termination detection protocol and its formal verification. J. Parallel Distributed Comput. 5 (1988) 82-91. [20] B. Szymanski, Y. Shi and N.S. Prywes, Synchronized distributed termination, IEEE Trans. Software Engrg. SE-11 (10) (1985) 1136-1140. [21] C. Hazari and H. Zedan, A distributed algorithm for distributed termination, Inform. Proc. Letters 24 (1987) 293-297.