Flat Concurrent Prolog on transputers

Flat Concurrent Prolog on transputers

Journaf of Microcomputer Flat Concurrent Applications (1990) 13, 3-18 Prolog on transputers U. Gliisser, M. Klrcher, G. Lehrenfeld and N. Vieth ...

1MB Sizes 1 Downloads 62 Views

.Journaf of Microcomputer

Flat Concurrent

Applications

(1990) 13, 3-18

Prolog on transputers

U. Gliisser, M. Klrcher, G. Lehrenfeld and N. Vieth University of Paderborn, Department Paderborn, West Germany

of Mathematics

and Computer

Science, D-4790

Flat Concurrent Prolog (FCP) is a general purpose logic programming language designed for concurrent programming and parallel execution. Staring with a concise introduction of the language and its underlying computational model we describe how to implement a distributed FCP interpreter on a transputer environment using OCCAM. Basic techniques we used for exploiting and controlling parallelism are explained in terms of an abstract architecture. The result of mapping this abstract model on transputers is presented as concrete architecture. Substantial design issues are considered in detail.

1.

Introduction

In recent years concurrent logic programming languages have turned out to be very suitable both for describing parallel objects and algorithms as well as parallel execution. This made them a powerful tool in programming parallel computers [I]. On the other hand, transputers provide a universal building block for constructing concurrent systems [2]. Their relatively low costs but high performance and modular features lead to a steadily increasing usage of transputer systems. Hence, it seems worthwhile to investigate possibilities of combining a transputer hardware environment with a concurrent logic application language. Essentially, we are concerned with the question on how features inherent to the language can be matched with those inherent to transputer architectures in order to utilize a maximum of parallelism. The development of a distributed interpreter being described within this paper is intended to provide a framework for further research and experimentation. It is implemented in OCCAM2 running on a Megaframe transputer system. Among various parallelization strategies applied in logic programming* streamparallelism is of main interest for programming, description, and simulation of digital systems. From the known concurrent logic programming languages, e.g. Relational Language [3], Concurrent Prolog [4], Parlog [5], Guarded Horn Clauses [6] etc.,? based on the stream-parallel model we chose Flat Concurrent Prolog (FCP) which was developed at The Weizmann Institute of Science 171. This language at least is applicable to the whole class of multiple processor systems (MPS) from which the transputer systems exist as a small subset. ‘A MPS, also called a multicomputer in the literature, is defined as a multiprocessing system in which all processors have their own local memory, execute instructions asynchronously, and communicate with one another through message transfer. This class of computing systems includes message-passing multiprocessors as well as computer networks.’ [S]. *A classification can be found in [9] tAn overview and comparison of stream-parallel

logic programming

languages is given in [IO]

3 0745-7138/90/010003+

16 $03.00/O

0 1990 Academic Press Limited

4

U. Glisser

et al.

FCP is a process oriented language designed for general-purpose programming, which embodies dataflow synchronization and guarded-command indeterminacy as its basic control mechanisms. We shortly characterize main language features as far as they are referred to in this paper. For a comprehensive introduction to the language and its applications consult [ 111. Formally, a FCP program is a finite set of guarded clauses. Each guarded clause is a universally quantified Horn clause of the form: HI-G,.

. . .. GJB,,

. . .. B,. m. nz0

(1)

H is called the head predicate of the guarded clause. The guard predicates G,, . . .. G, are out of a predefined set of test predicates and build the guard part. The body part, B,, .. .. B, contains any set of user defined predicates. Guard part and body part are separated by the commitment operator (1). Each predicate PE(H, G,, . . . . G,, B,, . . . . B,,) is a term of the form P=p(T,, . . .. Tk) compounded of a functor p and argument terms T,, . . . . TA. A clause with an empty body part (n = 0) is called a fact. Declaratively, a guarded clause (1) can be interpreted as a logical implication. Here, the semantics of ‘I’ and ‘,’ become identical: H=G,A

. . . AG,AB,A

. AB,.

(2)

Operationally, a guarded clause can be interpreted as a rule stating that a process H’ may fork to a network of communicating subprocesses B’,, .... B’, provided that H’ unifies with Hand the guard part evaluates to true. B’,, .... B’, represent the body part resulting from unifying the variables of H with corresponding parameters of H’. In general, the behaviour of a process is specified by a set of guarded clauses with a common head predicate, i.e. identical functor and arity. The commitment operator ‘I’ acts as a control mechanism for clause selection. A clause is selected for process reduction only if the process to be reduced matches with the clause’s head and additional conditions defined in the guard part are fulfilled. In case several clauses are applicable for one reduction, FCP provides a mechanism to delay the reduction until enough information is available and a correct choice can be made. Applying a read-only operator ‘?’ to a read-write variable ‘X’ (ordinary variable) yields the read-only variable ‘X?‘. The effect is, that read-only variables are prevented from immediate binding to non-variable terms by unification, but can be instantiated by assigning a value to the corresponding read-write variable. This way, the execution of a reduction affecting ‘X?’ is delayed until ‘X’ becomes a non-variable. The abstract interpreter definition shown below has been adopted from [l 11. It illustrates the underlying computational model without specifying certain scheduling and clause selection policies. INPUT: A FCP program P and a Process Q OUTPUT: Q’, if Q’ was an instance of Q proved from P or deadlock otherwise ALGORITHM: Initialize the resolvent to be Q, the input process. While the resolvent is not empty do choose a process A’ in the resolvent. and a fresh copy of a clause A: - G,. . . .. G,, IB,, . . . . B, in P such that A and A’ unify and the resulting tests G’,, . . . . G, succeed (exit if such a process and clause do not exist). Remove A’ and add B’,, . . . . B’,, to the resolvent and apply the unification to the resolvent and Q. If the resolvent

is empty then output

Q else output

deadlock

Flat Concurrent Prolog on Transputers

2.

A distributed

5

FCP interpreter

The abstract interpreter’s computational model being described in the previous section provides a framework for the development and implementation of a distributed FCP interpreter. Before implementation aspects can be discussed in detail with regard to a transputer architecture, the more conceptual questions on how to exploit parallelism and how to control parallel actions need to be considered. 2.1

Parallelization concepts

An adequate and seemingly natural degree of granularity for work load distribution is given on the reduction layer. Hence, work load distribution strategies deal with mapping processes from the resolvent onto processors to reduce them. Two basic approaches can be characterized as centralized and decentralized resolvent organization. Centralized organization depends on a two-level processor hierarchy. On the top level a host processor P,, takes control over the resolvent. On the bottom level subordinate processors P,, . . .. P, are working as reducers. While the resolvent is not empty and no deadlock occurs, as soon as a processor P, ( 16 n) becomes idle POsends a new process for reduction. By sending back the result of a reduction to POa processor P, indicates to be idle. If the result is compatible to current variable bindings it is applied on the resolvent. Incompatible results are abandoned. Though this model has several advantages due to centralized control over the resolvent, its built-in communication bottle-neck is the decisive drawback to reject it. With an increasing number of reducers the number of idle reducers waiting for the host would also increase dramatically. The main objective in decentralizing the resolvent is to uniformly distribute the work and communication load. On the other hand, a model based on distributed control at the same time requires efficient strategies for minimizing control overhead. In the remainder of section 2 we present in some detail the decentralized approach we chose to implement. 2.2

Abstract architecture

With the term abstract architecture we associate the more basic issues in construction and function of a distributed interpreter without any concrete underlying hardware architecture in mind. A real distributed interpreter is provided as the result of mapping this virtual interpreter onto a concrete target architecture. The quality of this mapping is a crucial aspect in the overall objective of improving efficiency. Structural composition A virtual distributed FCP interpreter is composed of processing elements and bidirectional communication links. Each processing element captures the full functionality of a sequential FCP interpreter and operates on a private local memory. Logically, this memory can be divided into a data base, a local stack. and a global stack. The FCP program to be executed is contained in the data base. The resolvent to be processed is kept in the global stack. An additional local stack is needed to store temporal bindings of logic variables. The processing elements represent the distributed interpreter’s basic building blocks. They can be understood as atomic FCP interpreters, which are performing elementary reduction operations sequentially. In principal a processing element may be connected to an arbitrary number of neighbour processing elements via communication links.

6

U. Gliisser e? al.

Communication is done using message passing. If a sender and a receiver taking part in a transmission are not direct neighbours messages are routed accordingly. A specific processing element, which is provided with additional control and I/O functions, serves as host. With exception of the host, all other processing elements are built up the same way. They differ solely in a unique index introduced for identification. The set P of all processing elements, P= (PE,, .... PE,,}, together with the set C of all communication links, Cc{{PEi,PEj}(PEi,PEj~P,i#j}, define a configuration graph G(P,C) which determines the interpreter’s static interconnection structure. An edge { PE,,PE,)E C of the configuration graph thereby corresponds to a physical link between processmg elements PE, and PE, For constructing a virtual distributed interpreter any given interconnection structure is possible as long as the resulting configuration graph remains connected. With regard to a real distributed system the underlying hardware may impose further restrictions. For instance, choosing a configuration with a node degree that exceeds the maximum possible number of direct neighbours that a processor might have, could heavily complicate the mapping. Distributed computation According to a dynamic load distribution strategy a common global resolvent R is partitioned into a set of subresolvents {R,, .... R,) and distributed over a network of asynchronously operating atomic FCP interpreters. Each R, consists of a possibly empty process network {pi,,, .... pi,k,},kilRO,which indicates in combination with all other subresolvents the global state of computation {p ,,,, .... plk,, .... pi.,, .... P;.~,, ....Pn,,, . . ..P.,,~,~= R. Shared variables establish communication channels between processes. Thus, a shared variable between two processes not residing in the same processing element also represents a logic communication channel between the corresponding processing elements. Finally, the overall process network topology is given as interconnection of local process subnetworks. All atomic interpreters are executing the same FCP program on a common input data set. Decentralized management of shared data is achieved by logically combining the global stacks of all processing elements within a common global address space. A variable residing in one processing element that way, is also transparent to processes local in other processing elements.

Process

Figure 1.

Network

of communicating

Subnetwork

processes.

Flat Concurrent Prolog on Transputers

7

In a distributed computation at any time, several reductions are performed in parallel. Thereby, a single reduction step always can be separated into three phases: Phase 1: According to the implemented scheduling strategy, the next process to be reduced is selected from the subresolvent in the global stack and copied onto the local stack. Phase 2: The data base is searched for an applicable clause to be used for reduction. Temporal variable bindings made during head unification and guard evaluation, in order to try a clause, are hidden in a local environment. Phase 3: If the reduction was successful, the compatibility of resulting variable bindings with the current local and global resolvent have to be checked. If consistence is maintained, the result is written back on the global stack. Otherwise the reduction attempt definitely fails. Synchronization of concurrently executed variable assignments are a central issue in the development of the distributed interpreter. Scheduling, dynamic load balancing and termination detection techniques, have heavy impact on the overall performance, and need to be investigated further. The current version applies bounded depth$rst search scheduling described in [ 1I] and Dijkstra’s termination detection algorithm [ 121. 2.3

Logic variable representation

Representation and management of logic variables strongly affect the distributed reduction algorithm. Decentralized management leads to a number of coherent problems concerning issues like competing variable accesses, deadlock prevention, fair scheduling, etc. Though there already exist single solutions, for the majority of these problems it remains difficult to integrate them into a closed solution that can be implemented efficiently as a distributed algorithm. A general approach to attack these problems was first presented in [7]. For our prototype being described here, we adopted some main concepts which provided an initial framework. A short overview to illustrate the basic aspects is contained in this section. Variable occurrences The synchronizing of parallel reduction operations involving shared variables is achieved by representing multiple occurrences of the same variable in the form of variable references. For each logic variable within the whole interpreter there exists exactly one variable entry containing its physical representation. This is denoted as variable. In general, additional separate variable entries created for references may appear on the same or on other processing elements as the variable itself. Several related references may also be connected to a reference list involving one or more processing elements. The processing element on which a variable is residing becomes the current variable owner. Direct variable access is exclusively restricted to the variable owner. As a result of this referencing mechanism, process parameters are pointers to be further evaluated in order to obtain the associated variables or terms. A process parameter always refers to a variable entry containing a possible variable value and a pointer value. In case the pointer value is ‘nil’, the variable entry indicates an unbound logic variable. Otherwise the pointer addresses a subsequent entry which has to be inspected for obtaining the value. Running through a list of variable entries finally yields the valid variable, or an according term, in case the variable has already been instantiated.

8

U. Glkser et al.

Figure 2.

Representation

of logic variables

by means of variable

occurrences.

ClassiJication of references Depending on the location of the variable being addressed, there are two types of references. While local references identify variables within the local environment of a single processing element, remote references appear only between different processing elements. Remote references are composed of two unique identifiers, one for the processing element and the other for a variable placed on it. That way, a global address space is realized. According to the direction of data flow, which results from the operational semantics of the read-only operator, variable references can be further classified into read-only and read-write. Read-only references are always directed from a read-only variable towards a read-write variable because read-only variables per definition can be written solely by instantiating the corresponding read-write variables. In contrast to read-only references, read-write references represent bidirectional communication channels between two read-write variables. Assigning a value to one of these at the same time means instantiating the other. Variable migration Remote references of the type read-write appear in situations where two processes running on different processing elements both hold a read-write occurrence of a common variable. In order to instantiate the variable, a process must ensure that the variable is local, i.e. the variable entry must reside on the same processing element as the process itself. A process merely holding a remote reference can first obtain access to the variable by performing an operation denoted as variable migration.

I

Loco1 Reference

Remote Reference

Figure 3.

%I -3

Classification

I j of references

Flat Concurrent Prolog on Transputers Process

Read -only Variable

Figure 4.

Read-only

Read-only

9

Reference

references.

Variable migration simply exchanges the locations of the two entries for the variable and the remote reference. The referencing processing element therefore sends a message along the list of references to the processing element which is the current variable owner. Readprocess

Phase 1 of a reduction step determines a process for reduction. In general, the reduction operation can not be applied immediately because of remote references occurring in the process parameter list. To obtain the current process state, remote references first have to be evaluated and replaced by actual values. This task is performed using the procedure read process, which sends request messages along the remote reference lists. When all values have arrived, unification of the process with a suitable program clause starts. Write process For a process reduction which is carried out on the local stack of a processing element to become effective, resulting variable bindings have to be taken over into the global stack. Writing on variables may be required to perform variable migration, in case remote references are involved. This may reveal incompatible variable bindings caused by concurrently executed reductions. If this appears, the reduction is aborted. Synchronizing parallel process reductions is achieved by executing all write operations within each reduction step as an atomic action. The application of a locking mechanism guarantees this feature. One by one the necessary variables are locked, as soon as the reducing processing element becomes the owner. Locking simply ensures a variable to be blocked for variable migration. The moment all variables are locked the reduction can be completed. Afterwards blocked variables are immediately released.

Remote Reference

u Variable

Vorlable

Figure 5.

Remote Reference

i

0

I

I

Mlgratlon

I

1

Application

I of variable migration.

I

10

U. Gliisser et al.

Figure 6. Deadlock as a result of competing Remote reference,; R.Ref,, Remote reference,.

reduction

attempts.

var,,

variable,;

var,,

variable,;

R.Ref,

The procedure described so far, offers the possibility of deadlocks to appear because it combines four typical properties: mutual exclusion; no pre-emption; partial allocation and circular wait. For recognizing and solving deadlocks the locking mechanism is extended by a control mechanism. The whole algorithm then is implemented in the procedure write process.

3.

Implementation

on transputers

Following the global model, section 2 has presented as abstract architecture, we have implemented the distributed FCP interpreter in 0CCAM2 [13] on a Megaframe transputer system. The system we use is built up from transputers of the T414 family. Due to structural similarities between the abstract and the target architecture for our prototype implementation it was convenient to choose a one-to-one mapping. Processing elements are identified with transputers, and communication links are identified with hardware links between transputers. The local memory of a transputer thereby corresponds to a processing element’s local stack, global stack, and database. Obviously, this leads to a restriction on configurations with a resulting node degree smaller or equal to 4. Implementing FCP in OCCAM essentially requires the adaptation of two different communication models. Though both languages rest upon process communication models, the applied synchronizing mechanisms are quite different. OCCAM was developed from the CSP model [14] using synchronous process communication. A sending process and receiving process are synchronized by blocking one process, until both processes are ready to transmit (Figure 7).

P

P

Q

ready for sending

ready for reCel”l”g

blocked send

Q

blocked ready for rewwng retelve

Figure 7.

Synchronous

ready for sending send

communication

behaviour.

receive

Flat Concurrent Prolog on Transputers 11 P

Q

P

Q

I

1 recelvmg

ready for sending

ready

send

Y B

I3

send

receive

Figure 8.

blocked

reody for sending

ready for recelvlng

for

reielve

l-4

Asynchronous communication behaviour.

In comparison to OCCAM, FCP is based upon the stream-parallel computation model using asynchronous process communication. By application of a buffer, a process ready for sending can be enabled, immediately. The buffer stores incoming information from the sending process as long as the receiving process is busy. On the other hand, a receive operation may be blocked. A receiving process, which tries to read from an empty buffer, is disabled until the sender again has written on the buffer (Figure 8). For adapting both communication models each processing element is supplied with a special communication unit being described as part of the concrete architecture. 3.1

Concrete architecture

The basic functional units that a processing element is composed of are shown in Figure 9. For reasons of simplicity and space restrictions we have implemented the prototype’s host element without the full functionality of an atomic FCP interpreter. Instead of the distributor and reducer units being implemented in ordinary processing elements, the host element applies a specific host unit to fulfil its arbiter and I/O functions. Nevertheless, both processing element types are supplied with an identical communication unit for simulating asynchronous transfer of data and messages (Figure 10).

3

D

Communlcatlon

s

_

_

G

Communlcatlon

Unit +-

s +

< 4

f

44

Sender

Dlstrlbutor

Reducer

Figure 9.

Concrete architecture of a processing and a host element.

12

U. Gliisser et al. back2

to3

back

from

bock3 from3

from1 back1 to1

I

lb back0

from0

Figure 10.

too

Logical structure of a communication unit.

Communication unit The communication unit has four uniquely numbered input and output ports (Figure IO). Using a routing matrix R and the processing element identification contained in an incoming message, it identifies the best port to be selected for the message in order to go along the shortest path. If a message, which addresses processing element i, is received by processing elementj ihen the communication unit of processing element j determines the correct port identifier k (0
sender( CHAN

OF ANY from, back, to)

ADT message : ADT buffer : WHILE

true

SEQ receive.message( from, message) IF message.id = “data transfer” receive.term( from, buffer) TRUE SKIP send.message( to, message) IF message.id = ‘data transfer’ send. term( to, buffer) TRUE SKIP back ! ‘idle’

Flat Concurrent Prolog on Transputers from

back

13

to t

Scheduling Global Stack

Process Cltstrlbutlon Read Process Wrote Process I

1L

4

I

\1 to.reducer

Figure 11.

from reducer

Logical structure

from.dist

of the distributor.

to.dlst

” Clause Selection Actual

Unlflcotlon Algortthm Built-In Procedures

Process

Local Stack

Ltobose I

i

1-1

Figure 12.

Logical

structure

of the reducer.

Atomic interpreter An atomic interpreter functionally divides into a distributor and a reducer unit. While the distributor (Figure 11) maintains control over the subresolvent, the reducer (Figure 12) performs all activities involved in the reduction of a single process. The interaction between the distributor and its reducer is as follows. First of all, the distributor selects a process from the subresolvent and applies the procedure read process to update the current process state. The result then is delivered to the reducer. Following the implemented clause selection policy, the reducer now searches the data base for applicable clauses in order to compute a reduction. -411 the reducer needs to fulfil this task, is the unification algorithm together with implementations of built-in test predicates for guard evaluation. The result of a reduction afterwards is delivered back to the distributor. By application of the procedure write process the distributor modifies the resolvent accordingly, provided that the variable bindings are consistent. Beside process scheduling, the distributor also determines the work load distribution. The plain process distribution policy we have implemented applies a connection table to find out a processing element’s direct neighbours. For each direct neighbour, a processing element continuously updates information about the current work load

14 U. Glbser et al. situation. A processing element holding more than one process in its own resolvent distributes part of his work load if there is a neighbour with an empty resolvent. Communication between the distributor and the communication unit of the same processing element also requires processing of asynchronously generated messages. Hence, it is realized the same way as communication between different processing elements. Host Since the host has to fulfil several arbiter and I/O tasks, it is supplied with some extra functional units not being contained in ordinary processing elements (Figure 13). With the channels keyboard and screen it holds a connection to the user. Via these channels the host loads the input program and an initial process to start with and outputs computation results. A parser is used for syntactically checking input program clauses before they are distributed to the processing elements. Holding a copy of the current input program and the process to be computed in its local data base offers the possibility to output this data to the user, if desired. A state table to control the computation states of the subordinate processing elements is needed for deadlock detection. If a computation was successful the initial process is updated by application of read process to yield the final result.

3.2

Application

A simple illustrative example from the domain of hardware description and simulation [15] (we used our interpreter for testing) describes a 2-bit fulladder module which is composed of basic combinatoric elements. (Behavioural description for basic elements are left out here). The corresponding FCP program was applied on hypercube configurations with dimensions 0, 1, 2, 3, and 4. halfadder( As, Bs, Zs, Cs) :xor( As ?, Bs ?, Zs), and( As ?, Bs ?, Cs). fulladder( As, Bs, Cin, Zs, Cout) :halfadder( As ?, Bs ?, Cs, Ds), halfadder( Cin ?, Cs ?, Zs, Es), or( Ds ?, Es ?, Cout). n_bit_adder( Null, Al, Bl, A2, B2, .., An, Bn, Zl, .., Zn+ 1) :fulladder( Al ?, Bl ?, Null ?, Zl, Cl), fulladder( A2 ?, B2 ?, Cl ?, 22, C2),

fthladder( An ?, Bn ?, Cn-1 ?, Zn, Zn+ 1).

The table below shows results of running this program on input stimuli pattern of length 16. An optimal computation would require at least 766 successful reductions. While the overhead in reductions is relatively small for a hypercube of dimension 0 (host element + 1 processing element), it dramatically increases with higher dimensions. Essentially, this is because we have not yet implemented a process suspension mechanism. Processes which cannot be reduced because of uninstantiated read-only variables being contained in the process parameter list, are tried for reduction several times in vain before the read-only variables finally get values. A much better way to handle such

Flat Concurrent Prolog on Transputers from

back

IS

to

1

Database lnltial Process

Scanner

Parser Arbiter Read Process

Host Memory

I

L’

1 L

A

4 screen

Figure 13.

keyboard

Logical structure of the host element.

screen HE h

keyboard 2

1 PE 5 h

2 .3

3-

h2

C\p=+L&J

Figure 14.

3-dimensional

hypercube.

HE, host element; PE, processing element.

Table 1. Dimension

0

1 2 3 4

Processor

2 3 5 9 17

count

Time

11.52 8.32 5.12 4.33 4.18

Communication count 8 181 1728 2875 3135

Reduction

792 994 1586 1972 2191

count

16 U. Gliisser et al. pending reductions would be the application of a mechanism for suspending those processes from the active resolvent. As long as affected read-only variables are uninstantiated, suspended processes could sleep without consuming computation time. However, the overall performance therefore is poor but can be substantially improved by applying a suitable suspension mechanism to avoid busy-waiting.

4.

Concluding

remarks

Based on the described abstract architecture of a distributed FCP interpreter we have implemented a first prototype in OCCAM2 running on a Megaframe transputer system. The objective of this prototype implementation was to gain experience in the application of a transputer environment as a target architecture for parallel logic programming languages. The main conclusion we can draw from the communication requirements, which are necessary to synchronize concurrently executed reduction operations performed on a common variable set, is that a closely coupled non-shared memory multiprocessor system, such as the transputer network, represents a suitable hardware environment for embedding a parallel logic programming language like FCP. Future developments on the ‘transputer market’ are directed towards the further enhancement of transputer communication facilities by support of virtual links, higher transfer rates, and automatic message routeing. From an implementational point of view OCCAM turned out to be unsuitable for the purposes illustrated. Essentially, it does not meet the requirements to implement the prototype efficiently because of the absence of dynamic data structures and recursions. Static storage allocation-required for buffer and list representation for examplecauses serious space limitations. Hence, the testing and evaluation of larger sized examples to reveal real speedup, communication overhead, and dynamic load balancing features, was not practicable. In the meantime, due to this insight we have decided to reimplement our prototype in Par.C [16]. Beside the ANSI standard for C the relatively new and transputer oriented language Par.C practically offers all parallel constructs known from OCCAM. Another interesting feature of Par.C is the ability to dynamically generate new processes at run time. Application of dynamic process generation may lead to a more direct support in describing the computation model of FCP. However, it does not depend on the implementation language alone. Basic improvements can also be achieved on the conceptual layer. Part of this, in the new version, was a redesign of the reducer units on the basis of Warren’s Abstract Machine concept (WAM). Compiling each individual input program clause into a set of abstract reduction instructions for a virtual machine significantly reduces the unification overhead at run time [17]. While the functional architecture realized so far seems quite suitable, the mechanisms for reduction synchronization, process scheduling, work load distribution, and deadlock detection need further optimization. A primary drawback of the described experimental version-the scheduling overhead caused by busy-waiting-has also been eliminated via integrating a process suspension mechanism into the scheduling algorithm. Main activities now concentrate on analysing the interpreter’s dynamic computation behaviour in order to extract various influences of applied control mechanisms on the overall performance. Special efforts are taken to minimize the communication overhead caused by the chosen work-load distribution strategy. The objective in distributing

Flat Concurrent Prolog on Transputers

17

work-load is for a maximum utilization of parallel hardware. But it should also take into account the related costs for interprocess communication. Distribution of processes that share a lot of common variables may result in high variable migration rates, which are relatively expensive. A better choice would be to leave the processes on the same processing element. In addition to interprocess communication costs the amount of work associated with each process is another important criterion to be considered. While the partition of subresolvents is calculated at run time according to dynamically generated processes, certain load and communication complexity measures for each process type can be determined at compile time by static analysis of program clauses. We are currently investigating the applicability of complexity measures in process distribution techniques. Finally, alternative possibilities exist for mapping the abstract architecture onto transputers. A single processing element might be implemented using more than one transputer. For instance, the functional splitting into separate distributor and reducer units can actually be reflected from its hardware embedding.

Acknowledgments We wish to thank the reviewers for their helpful comments contributing version of this paper.

to the final

References 1. K. Fuchi & K. Furukawa 1987. The role of logic programming in the fifth generation computer project. New Generation Computing, 5, 3-28. 2. D. May & R. Shepherd 1985. Occam and the transputer. In Concurrent Languages in distributed Systems. (G. L. Reijens, E. L. Dagless, eds.) North-Holland, 19-33. 3. K. L. Clark & S. Gregory 1981. A relational language for parallel programming. ACM Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture, 171-178. 4. E. Shapiro 1983. A subset of concurrent prolog and its interpreter. K’OT Tech. Report TR-003, Tokyo.. 5. K. L. Clark & S. Gregory 1984. PARLOG: Parallel Programming in Logic, Research report DOC 84/4, Department of Computing, Imperial College of Science and Technology, London. 6. K. Ueda 1985. Guarded horn clauses. ICOT Tech. Report TR-103, Tokyo. 7. E. Shapiro 1987. Concurrent Prolog: Collected Papers, Vol. 2. MIT Press. 8. J.-J. Hwang, Y.-C. Chow, & F. D. Anger 1988. An analysis of multiprocessing speedup with emphasis on the effect of scheduling methods. IEEE Proceedings 8th International Conference on Distributed Computing Systems, 242-248. 9. J. S. Conery & D. F. Kibler. Parallel interpretation of logic programs. ACM Proceedings 1981 Conference on Functional Programming Languages and Computer Architecture, 163-170. 10. A. Takeuchi & K. Furukawa 1987. Parallel logic programming languages, Third International Conference on Logic Programming, LNCS 225. 11. E. Shapiro 1986. Concurrent prolog: a progress report IEEE Computer, 8. 44-58. 12. E. W. Dijkstra. W. H. J. Feigen, & A. J. M. von Gasteren 1983. Derivation of a termination detection algorithm for distributed computations. Information Processing Letters. 16. 13. A. Burns 1988. Programming in OCCAMZ. Addison Wesley. 14. C. A. R. Hoare 1978. Communicating sequential processes. Communications of ACM, 21, 666677. IS. D. Weinbaum & E. Shapiro 1987. Hardware description and simulation using concurrent prolog. Proc. 1987 CHDL, 9-27. 16. 1989. Par.C- User Manual. Partec Inc. 17. D. H. D. Warren 1983. An Abstract Prolog Instruction Set. Technical Note 309, Artificial Intelligence Center, SRI.

18

U. Gliisser et al. U. Gliisser received his diploma in computer science from the University of Paderborn in 1987. Since then he has been working as a research assistant in the Department of Computer Science at the University of Paderborn. His current research activities include parallel logic, programming languages, decentralized system architectures, and fault tolerant computing.

Georg Lehrenfeld studies computer science at the University of Paderborn. His research interests cover parallel algorithms, parallel programming languages, and automata theory.

Vieth studied computer science at the University of Paderborn and received his diploma in computer science in 1989. He now is working at a software development center, Dr. Materna GmbH, in Dortmund. His research interests are in the field of parallel logic programming. Norbert