OR parallel logic programming system

OR parallel logic programming system

Parallel Computing 25 (1999) 321±336 OASys: An AND/OR parallel logic programming system I. Vlahavas b a,* , P. Kefalas b,1 , C. Halatsis c,2 a ...

252KB Sizes 3 Downloads 46 Views

Parallel Computing 25 (1999) 321±336

OASys: An AND/OR parallel logic programming system I. Vlahavas b

a,* ,

P. Kefalas

b,1

, C. Halatsis

c,2

a Department of Informatics, Aristotle University of Thessaloniki, 54006 Thessaloniki, Greece Department of Computer Science, City College, Aliated Institution of the University of Sheeld, 54024 Thessaloniki, Greece c Department of Informatics, University of Athens, 15771 Athens, Greece

Received 28 February 1997; received in revised form 6 June 1998; accepted 30 July 1998

Abstract The OASys (Or/And SYStem) is a software implementation designed for AND/OR-parallel execution of logic programs. In order to combine these two types of parallelism, OASys considers each alternative path as a totally independent computation (leading to OR-parallelism) which consists of a conjunction of determinate subgoals (leading to AND-parallelism). This computation model is motivated by the need for the elimination of communication between processing elements (PEs). OASys aims towards a distributed memory architecture in which the PEs performing the OR-parallel computation possess their own address space while other simple processing units are assigned with AND-parallel computation and share the same address space. OASys execution is based on distributed scheduling which allows either recomputation of paths or environment copying. We discuss in detail the OASys execution scheme and we demonstrate OASys e€ectiveness by presenting the results obtained by a prototype implementation, running on a network of workstations. The results show that speedup obtained by AND/OR-parallelism is greater than the speedups obtained by exploiting AND or OR-parallelism alone. In addition, comparative performance measurements show that copying has a minor advantage over recomputation. Ó 1999 Elsevier Science B.V. All rights reserved.

*

Corresponding author. Fax: 3031-998419; e-mail: [email protected]; URL: http://www.csd.auth.gr/ teachers/vlahavas.html 1 Fax: +3031-287564; e-mail: [email protected] 2 Fax: +301-7219561; e-mail: [email protected] 0167-8191/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 9 8 ) 0 0 1 0 8 - 2

322

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

Keywords: Parallel Prolog; AND/OR-parallelism; Copying; Recomputation

1. Introduction Exploitation of parallelism in logic programs has drawn much attention to the logic programming community. The aim is to exploit the inherent parallelism embodied in the declarative semantics of logic programs. Various methods have been implemented to enable parallel execution of Prolog programs while not a€ecting the language semantics. The procedural semantics of sequential Prolog is described by a ®xed execution sequence of the program clauses, i.e. a top-down, left-to-right traversal of the AND/OR tree built while trying to reduce a given query. Prolog execution strategy uses chronological backtracking in order to explore all possible ways of reducing a goal. The principle idea behind parallel implementations of Prolog is to traverse the execution AND/OR tree in a parallel mode. Alternative paths are independent since they represent di€erent ways of solving a goal and therefore can be simultaneously searched (OR-parallelism). In order to solve a conjunction of goals, the di€erent paths starting from each goal can be independently searched (ANDparallelism) as long as there exists a reconciliation method to resolve con¯icting bindings. During the last decade, research was mainly concentrated in either exploiting OR-parallelism [1,14,16] or various forms of AND-parallelism [11,13,17] reporting promising results. Practically, the implementation of a full AND/OR Prolog system is too complex to be ecient. There have been e€orts to combine both AND and OR in such a way that each type of parallelism will be exploited only when it is ecient to do so [25]. This paper describes a di€erent approach to implementing full AND/OR-parallelism in an experimental system called OASys (Or/And SYStem). OASys eciently implements OR-parallelism by viewing the search space as many independent and deterministic computations which rarely need to communicate. In addition, this feature provides the ability of exploiting the deterministic paths of the proof in an AND-parallel way. OASys aims towards an architecture in which the processing elements (PEs) performing the OR-parallel computation, possess their own address space while other simple processing units are assigned with AND-parallel computation and share the same address space. Communication is limited only to scheduling operations, i.e. allocation of work to available PEs. The latter is accomplished in two di€erent ways: recomputing the alternative path or copying the computation state. In the next section, we brie¯y describe the problems of parallelising logic programs as well as the main approaches for implementation of AND/OR-parallelism. Section 3 presents the OASys computation model and Section 4 its proposed architecture. Section 5 is devoted to presenting performance results. Finally, we conclude with Section 6 in which we outline the features of the proposed model and the objectives of our further work.

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

323

2. Implementing parallelism in logic programming The main burden in implementing OR-parallelism is the handling of multiple bindings of the same variable produced by di€erent paths. A variety of methods have been implemented: · Processors share the same address space while a binding method records the different bindings to an appropriately de®ned data structure [5,24]. In shared environment schemes, the overhead reported is small in respect to sequential Prolog execution. Despite their complexity, they are adopted in many successful implementations [4,14]. · Processors are independent and have their own copy of the working environment. The main advantage of this approach is the locality of reference. However, there is an extra overhead due to environment copying between parent and child processes, although in some cases, this is proved to be faster than a sharing binding scheme [2]. · Processors view the search space as totally independent computations. Therefore each processor starts computing each path from the root in a sequential Prologlike manner without having to communicate with others at all. This implies recomputation of certain parts of the search space but avoids the need of either having to share variable bindings or copying environments. It is argued that this method can perform well in highly non-deterministic programs where the previous two schemes require excessive amount of communication [3,7,8,12,15]. However using the recomputation method, it is dicult to cope with cut and side-e€ect predicates of Prolog. In AND-parallelism, the problem has been the handling of shared variables between subgoals. The main approaches to implement AND-parallel systems have been the following: · Processors work in a producer±consumer way, communicating through the shared variables. This approach, called Stream AND-parallelism, is adopted by the Committed Choice Non-Deterministic languages [6,18]. However, the language issues introduced for the sake of eciency sacri®ce the semantics of Prolog to a great extent. · Processors work in parallel only when the subgoals do not share variables at all (independent AND-parallelism) [13]. It is, however, dicult to establish independence of variables in goals and extra overhead is introduced due to static or dynamic analysis in order to identify such goals. · Processors work in parallel even if they share variables (dependent AND-parallelism) [17]. The main overhead introduced in this case is reconciliation of bindings produced by di€erent processes. Full AND/OR parallelism is dicult to implement due to the overheads introduced by both types of parallelism. In practice, it is more ecient to combine OR with Independent AND-parallelism [3,4,9]. In Andorra-I [25], a substantial e€ort has been made to fully combine the two types. The main idea is to delay the execution of non-determinate goals while performing the determinate computation available in an AND-parallel way. When there

324

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

is no more determinate computation, i.e. all computations are suspended, the leftmost goal is reduced in an OR-parallel way. As in Andorra-I, OASys exploits both OR-parallelism and dependent ANDparallelism. However OASys does not require any determinacy checking, since the search space is considered as independent and determinate OR-parallel branches. This feature allows computation to proceed freely without suspending any non-determinate goals. The OASys approach imposes a di€erent and much simpler implementation aiming towards higher performance. Our previous work [19] proposed a formal model for the resolution of AND/ORparallelism and suggested its implementation and abstract architecture [20]. In OASys, we improve the execution scheme and we eliminate the problems due to centralised control by providing an ecient distributed scheduling. In addition, OASys supports two alternative ways of distributing work, namely copying and recomputation, which are determined at compile time. 3. OASys computational model In Prolog, the search space can be normally represented by an AND/OR tree, where AND nodes represent the body goals of a clause and OR nodes represent the alternative clauses matching a body goal. In OASys, the search space can be represented by a tree where the nodes, called U-nodes, denote the uni®cations between the calling predicates and the heads of the corresponding procedures. A link between two U-nodes indicates a sequence of procedure calls as it is speci®ed by the program. A link is called successful if the uni®cations performed in the two U-nodes produce consistent bindings, or unsuccessful otherwise. Fig. 1(a) shows a Prolog program and its corresponding AND/OR tree while Fig. 1(b) shows the OASys search tree of the same Prolog program. In Prolog, a node in the execution tree has already produced the variable bindings through uni®cation. In OASys, uni®cations in the U-nodes are in progress while new U-nodes are continuously produced, i.e. the U-nodes of a path are produced and executed in an asynchronous and concurrent way. This has the e€ect that a path containing U-nodes may be speculative, i.e. some of the U-nodes generated may not have existed in the corresponding sequential Prolog execution. OASys supports OR-parallelism by searching simultaneously di€erent paths of the search tree. It also supports AND-parallelism by executing in parallel the Unodes belonging to the same path. Execution of a path terminates when: · all links are successful (a solution is found), or · one link is unsuccessful (failure in conjunction), or · uni®cation in a U-node fails (mismatch). Fig. 2 shows the links generated for the paths of the search tree of Fig. 1 and the most general uni®ers found for every uni®cation in U-nodes. The same ®gure presents the potential of exploiting AND/OR-parallelism in OASys by viewing the same tree as three independent computations. The grey shaded areas include U-nodes which also exist in other paths and their meaning is discussed later.

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

Fig. 1. Representation of search trees in Prolog and OASys.

Fig. 2. Nodes and links in OASys search tree.

325

326

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

4. The OASys architecture Following the above computation model, OASys can be implemented by employing a group of PEs. Each PE is a shared-memory multiprocessor and consists of three main parts: a preprocessor, a scheduler and an engine (Fig. 3). In the following sections, we present the compilation process, the execution scheme and we discuss some implementation issues. 4.1. Compilation The source Prolog program is compiled in two codes, namely the Compiled Prolog (CP) code and the Compiled Clauses Dependency (CCD) code [21]. The CP code represents the de®nitions of clauses and during the startup phase, it is distributed to all PE engines. The CP code is broadly based on the APIM [20], it includes only one unify instruction and provides a mechanism to implement arithmetic and other built-in operations. The CCD code is produced by abstract interpretation of the Prolog program, and during the startup phase, is distributed to all PE preprocessors. It maps each Prolog clause to its actual address in the CP code together with the index table entry for every body subgoal. For each procedure in the program, there exists an index table which contains the addresses of the clauses of that procedure as well as the data types of the ®rst argument of their head predicate. The preprocessor executes the CCD code and in combination with the index tables generates choices, when there is more than one clause in the de®nition of a procedure. Each Prolog clause: Hi : ÿB1 ; B2 ; . . . ; Bn is represented in the CCD code notation as: Ci : T1 …ArgType1 †; T2 …ArgType2 †; . . . ; Tn …ArgTypen † where Ci is the address of the clause in the CCD code used to derive the actual address of the clause in the CP code, Tj is the address of the index table of Bj , and ArgTypej is the data type of the ®rst argument of Bj .

Fig. 3. Overview of the OASys architecture.

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

327

The CCD code for the N queens program together with the graph representation for the CCD code is given in the Appendix A. 4.2. OASys execution scheme Each PE passes through three phases of operation: preprocessing, scheduling and execution. A PE starts generating choices by executing the CCD code (preprocessing phase) and passes them to the scheduler which decides whether available work could be shared with other PEs (scheduling phase). The PE's engine works concurrently by consuming the choices passed from the scheduler. The sequence of choices are used as directives in order to construct the sequence of U-nodes which in turn are assigned to special processing units (AND Processing Units or APUs) which work in parallel (execution phase). The two types of parallelism supported are handled as follows: · OR-parallelism: The di€erent paths of the search tree may be distributed to idle PEs. The scheduler decides whether it will assign the alternative paths to other PEs (OR-parallel execution) or keep them for its own engine (sequential execution). · AND-parallelism: The engine executes the U-nodes of a single path in parallel, i.e. it performs the uni®cations within a U-node by assigning each node to an APU. This is in e€ect AND-parallelism because the U-nodes in the same path form the conjunction of predicate calls, which may solve the query. These two types of parallelism are depicted in Fig. 2. The three di€erent paths are allocated in three distinct PEs while each U-node of the same path is allocated in a di€erent APU. 4.2.1. The preprocessor The preprocessor executes the CCD code, thus generating a sequence of choice points along a path which are then passed to the engine. As shown previously, a call to a predicate is represented by an index table address together with the type of its ®rst argument. Execution of the call is actually a search in the corresponding index table to ®nd a clause with a matching data type. If only one matches, the call is deterministic. Otherwise, there is a choice point and one of the alternative addresses is sent to the engine whereas the fate of the others is decided by the scheduler. Note that the preprocessor in OASys ®nds the sequencing of calls, thus facilitating the indexing operation at run time. In this context, preprocessing di€ers to the compile time preprocessing of Andorra-I that performs determinacy checking. 4.2.2. The scheduler At any time, each PE scheduler is aware of availability of work. This is done by maintaining a table of PE-ids which is updated by broadcasting a single byte that indicates whether a PE has unexplored paths to share or if a PE has no more available work. More analytically, when a PE completes the execution of its current path either successfully or not, it backtracks to its local choice points, if any. When there is no more available work in a PE, it is declared as idle and tries to ®nd work by

328

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

consulting its table. It chooses an entry with available work and makes a request. The PE which is asked to give work: · is locked against other requests, · simulates failure (dummy backtracking) up to the nearest to the root alternative, · transmits the appropriate information, and · unlocks and continues executing its current path. A PE's request for work might not be answered from another PE if the latter is locked. Instead of busy waiting, the scheduler decides to try another entry from its table. The current implementation supports two possible ways of distributing work which are determined at compile time. The PE with available work may either: · send the links followed from root to the available node and force the PE which issued the request to recompute that speci®c path and continue execution with one alternative, or · copy the computation state of the available node and let the PE which issued the request to continue execution with one alternative. The e€ect of both techniques is illustrated in Fig. 2, where recomputation includes the portions of paths in the grey shaded areas while copying is concerned only with the portions of paths outside those areas. Scheduling is a crucial issue and could be optimised in various ways. For the moment, work that is nearest to the root has priority for distribution. This minimizes the amount of information transmitted. This could also be avoided by pre-compilation detection of small chunks of work or by employing user annotations to indicate parts of the program that should be kept private to a PE. In addition to these, we are currently experimenting with a more sophisticated scheduler which can decide at run time whether copying or recomputation is more pro®table at a speci®c occasion, taking under consideration various factors that a€ect one or the other, such as the amount of information to be transmitted, the estimated communication, the recomputation overhead, etc. 4.2.3. The engine The engine consists of a work pool, a control unit and a number of special processing units (APUs) which operate in parallel sharing a common memory (Fig. 4). The work pool accumulates choices supplied by the preprocessor. The control unit executes the machine instructions, constructs U-nodes taking directives from the work pool, and distributes them to the APUs. The APUs receive the addresses of a calling predicate and the head of the corresponding procedure, and perform the uni®cation operation eciently. Parallel execution of APUs requires synchronised access to unbound variables which may also be accessed by other APUs. This does not add considerable complexity since, according to the execution algorithm, unbound variables may be bound in any order, either left to right as in Prolog, or right to left. The mechanism used for the variable bindings is a lock-test and set operation. According to this, whenever an APU tries to bind a variable, it initially tests if the variable is still

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

329

Fig. 4. The OASys engine con®guration.

unbound. If so, the variable is locked and the APU proceeds by setting a value. Otherwise, the variable binding is retrieved and a consistency check takes place. Heap writing operations are also synchronised by setting a lock to the starting address of the space available for writing. An APU cannot allocate space on the heap until an already heap-writing operation by another APU has been completed. However, access operations are free up to the locked address. A similar problem has been addressed in Ref. [10] proposing that simple processing units con®gured as a systolic array could be used to speed up the parallel logic inferences. 4.3. Implementation issues The previously described execution scheme adopts quite naturally to a hybrid multiprocessor in which parts of the address space are shared among subsets of processors, as for example, in a distributed (loosely coupled) system containing multiple shared-memory multiprocessors. In such a system, each shared-memory multiprocessor would be a natural host for a team of processors sharing a common address space. OASys can also be considered as a SPMD (Single Program Multiple Data) machine, its processors executing the same program on di€erent parts of the data, proceeding independently of the others inside their own program. Besides the above mentioned desired ideal architecture, OASys could also be implemented as follows: · A network of multiprocessor workstations, each one of which corresponds to a PE while its included processors correspond to the various components of the PE. · A message passing scalable multiprocessor architecture that provides a shared virtual address space on essentially physically distributed hardware [23]. The address space is truly shared but it is also virtual in the sense that memory is physically distributed across di€erent processors. OASys could also be emulated in a shared memory multiprocessor machine by grouping processors into teams such that each team corresponds to a PE and each

330

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

processor of the team corresponds to the various components of the PE. The memory should be partitioned into as many independent areas as the number of teams. All processors in a team will share the same logical address space in each area. 5. Performance results A prototype system for OASys has been developed in C and runs on a network of Sun workstations. We executed a number of benchmark programs and the results taken are found to be in accordance with those taken by a simulation of the system [22]. We have chosen to run the following series of Prolog programs: · queens8: the 8-queens problem, · ham: the search for hamilton paths in a graph, · zebra: who owns the zebra problem, · merge500: mergesort of a list of 500 elements, and · nrev500: naive reverse of a list of 500 elements. The last two of the above programs exploit only AND-parallelism while the rest exploit both types of parallelism. 5.1. Relative speed performance There are two versions of the prototype system, the sequential and the parallel OASys. In sequential OASys, several aspects concerning synchronization, communication and scheduling have been omitted. Table 1 compares the performance of sequential OASys and parallel OASys employing one PE, with other well known implementations of Prolog, like sequential Andorra-I, Eclipse and Sicstus. All times shown are actual cpu times in seconds and obtained on the same machine (Sun Enterprise 3000). The runtimes of Andorra-I were obtained with preprocessing for determinacy checking. 5.2. OR-parallel performance Table 2 shows the speedups measured for 1±10 PEs, each with a single APU, using a network with the same number of workstations as the number of PEs. The

Table 1 Runtimes (in seconds) for OASys and other Prolog systems queens8 hamilton zebra merge500 nrev500

Sequential OASys

Parallel OASys

Sequential Andorra-I

Eclipse

Sicstus

1.11 8.21 20.01 0.23 1.18

1.44 9.07 25.69 0.32 1.25

1.68 9.37 27.11 0.45 1.03

0.41 0.84 2.52 0.07 0.25

0.02 0.29 1.21 0.01 0.04

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

331

Table 2 OR-parallel speedups PEs 1 2 3 5 8 10

ham

queens

zebra

Copy

Recom.

Copy

Recom.

Copy

Recom.

1.0 1.5 1.9 3.1 5.0 6.2

1.0 1.5 1.9 2.9 4.8 6.0

1.0 1.6 2.3 3.4 5.0 6.1

1.0 1.6 2.1 3.2 4.9 5.7

1.0 1.7 2.4 3.8 5.6 6.5

1.0 1.7 2.4 3.7 5.7 6.5

speedups concern only OR-parallelism exploited by the benchmark programs. Since the programs are non-deterministic, the speedup is close to linear, as expected. It is worth noticing that the performance of recomputation and copying is similar with a small number of PEs, giving precedence to copying as the number of PEs increases. Although the data transfer in recomputation is 4±7 times less than copying, the above results are justi®ed by the fact that the transfer time in copying is much less than the re-execution time in recomputation. The overall amount of data needed to be transferred in copying produces a small overhead since current networks are capable of transferring large amounts of data within a very short time. In addition, copying is eciently implemented, since each PE has identical but independent logical address space and the stack portions are copied without reallocating any pointers. Bearing in mind that copying is pro®table when work is allocated lower in the tree over recomputation which is more pro®table closer to the root, a hybrid scheduling scheme deciding which method is more appropriate at any given occasion is expected to result in better actual performance. 5.3. AND-parallel performance The current prototype system implements AND-parallelism by creating the same number of processes as the number of required APUs in each workstation. Having measured the actual timings, AND-parallel performance is estimated by taking into account the following factors: · the parallel overhead due to variable locking over sequential execution with one APU, · the overhead due to book keeping operations of non-determinate nodes which demand more processing than deterministic nodes, and · the overhead of allocating uni®cations to APUs which increases directly proportionally to the number of APUs. Table 3 shows the estimated speedups for the same benchmark programs referring to a single PE with 1±10 APUs. Some of the programs show good speedup since they are purely deterministic, like nrev500 and merge500. Others, like ham and zebra, have long deterministic paths between choice points and AND-parallelism can be exploited without much

332

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

Table 3 AND-parallel speedups APUs

ham

queens8

zebra

nrev500

merge500

1 2 4 6 8 10

1.0 1.3 2.4 3.3 4.1 4.7

1.0 1.3 2.4 2.6 2.5 2.4

1.0 1.2 2.9 3.1 3.9 4.5

1.0 1.9 3.4 4.5 5.6 6.5

1.0 1.7 3.1 4.2 5.9 6.0

overhead. Speedup in queens8 is low and ¯attens as the number of APUs increases. This is because the search space has short deterministic paths between choice points and therefore more APUs do not contribute to the exploitation of AND-parallelism. On the other hand, for all programs, we do not expect better performance with more than 10 APUs since the preprocessor's total execution time is 8±15% of the total execution time of an engine with one APU. That means that as the number of APUs increases, the above percentage increases and the execution times become roughly equal and the total execution time of a PE will be the execution time of the preprocessor, i.e. constant. 5.4. AND/OR-parallel performance Results of combining both types of parallelism are shown in Table 4, for two representative programs. The results were obtained by running queens8 and ham with di€erent number of PEs, each one con®gured with a ®xed number of APUs at every run (1±6). The higher performance is reached with 10 PEs with six APUs each (i.e. 60 APUs in total). 6. Conclusions and future work In this paper we have described the OASys model for implementing full AND/ OR-parallelism. OASys achieves AND/OR-parallelism by assigning each path of the

Table 4 AND/OR-parallel speedups of the queens8 and hamilton path problem PEs 1 2 3 5 8 10

queens8

ham

1 APU

2 APUs

4 APUs

6 APUs

1 APU

2 APUs

4 APUs

6 APUs

1.0 1.6 2.3 3.4 5.5 6.1

1.3 2.1 3.0 4.5 6.6 8.0

2.4 3.9 5.5 8.2 12.0 14.7

2.6 4.1 5.9 8.8 12.9 15.7

1.0 1.5 1.9 3.1 5.0 6.2

1.3 2.0 2.5 4.0 6.6 8.2

2.4 3.6 4.6 7.5 12.0 14.9

3.3 5.0 6.3 10.3 16.6 20.6

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

333

search tree into di€erent PEs while at the same time assigning each node of an individual path into di€erent processing units (APUs). The above scheme adopts a modular design, which is easily expandable and highly distributed, thus limiting communication only to scheduling operations. Allocation of work to available PEs can be achieved by either copying or recomputing. OASys software architecture aims towards a distributed system in which the PEs rarely need to communicate and could be implemented in various ways. However, an ideal architecture would be a hybrid multiprocessor containing multiple shared memory multiprocessors connected by a message-passing network. We have implemented a software prototype in order to investigate the model's performance. By running a series of example programs we showed that the two scheduling methods performed similarly, giving a minor precedence to copying. In both cases, the speedup obtained is linear to the number of PEs. However, there is an upper bound in speeding up the execution, which is determined by the ratio of the time required to generate the alternative paths over the time of performing the deterministic computations on these paths. We are currently working on an OASys implementation which integrates both scheduling methods during the execution of a program. This approach aims towards a more sophisticated scheduler, which will decide at any occasion: · which of the two methods (copying or recomputation) to use depending on the estimated overhead, · whether it is more pro®table to keep an alternative path for sequential execution rather than sharing it with other PEs, depending on the estimated amount of work in this path, and · which of the available alternative paths to send, depending on the estimated relative pro®t of sharing one path compared to the pro®t of sharing any other. We believe that the performance will increase, since the overhead from communication will be further optimised.

Acknowledgements We would like to thank D. Xochellis, D. Benis, C. Berberidis and E. Sakellariou for their help in the prototype and the collection of performance data. Many thanks to Rong Yang for the help with Andorra-I and the anonymous referees.

334

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

Appendix A Consider the following Prolog program for N queens: queens([ ],R,R). queens([H|T],R,P):- delete([A,[H|T],L), safe(R,A,1), queens(L,[A|R],P). delete(X,[X|Y],Y). delete(X,[Y|Z],[Y|W]):- delete(X,Z,W). safe([ ],_,_). safe([H|T],U,D):- H-U ˆ \ ˆ D, U-H ˆ \ ˆ D, D1 is D+1, safe(T,U,D1). Given the query for four queens: ?- queens([1,2,3,4],K,M)

% % % % % %

C1 C2 C3 C4 C5 C6

% Query

The CCD code, the index tables and the graph representation for the CCD code are given below:

C1: C2:

true. T2(var), T3(var), T1(var).

C3:

true.

C4:

T2(var).

C5: true. C6: T3(var). Query: T1(list).

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

335

References [1] K. Ali, R. Karlsson, The muse OR-parallel Prolog model and its performance, in: Proceedings of 1990 North American Conference on Logic Programming, Austin, USA, October 1990, pp. 129±162. [2] K. Ali, R. Karlsson, Scheduling speculative work in MUSE and performance results, Int. J. Parallel Programming 21 (6) (1992) 449±476. [3] L. Araujo, J.J. Ruz, PDP: Prolog distributed processor for independent AND/OR parallel execution of Prolog, in: P. Van Hentenryk (Ed.), Proceedings of 11th ICLP, MIT Press, 1994, pp. 142±156. [4] U. Baron, J.C. de Kergommeaux, M. Hailperin, M. Ratcli€e, P. Robert, J. Syre, H. Westphal, The parallel ECRC Prolog system PEPSys: an overview and evaluation results, in: Proceedings of Future Generation Computer Systems '88 Conference, Tokyo, 1988, pp. 841±849. [5] A. Ciepielewski, S. Haridi, A formal model for OR-parallel execution of logic programs, in: Proceedings in Format Processing, IFIP, 1983, pp. 299±305. [6] K.L. Clark, S. Gregory, Parlog: a parallel programming in logic, ACM Transactions for Languages and Systems 8 (1) (1986) 1±49. [7] W.F. Clocksin, Principles of the DelPhi parallel inference machine, The Computer Journal 30 (5) (1987) 386±392. [8] G. Gupta, M. Hermenegildo, AND±OR parallel Prolog: a recomputation based approach, New Generation Computing 11 (1993) 297±321. [9] G. Gupta, M. Hermenegildo, E. Pontelli, V.S. Costa. ACE: And/Or-parallel copying-based execution of logic programs, in: P. Van Hentenryk (Ed.), Proceedings of the 11th ICLP, MIT Press, 1994, pp. 93±109. [10] Y.F. Han, D.J. Evans, Parallel inference algorithms for the connection method on systolic arrays, Int. J. Computer Math. 53 (1994) 177±188. [11] M. Hermenegildo, K. Green. &-Prolog and its performance: exploiting independent ANDparallelism, in: Proceedings of International Conference on Logic Programming ICLP 90, MIT press, 1990, pp. 253±268. [12] Z. Lin, Self-organizing task scheduling for parallel execution of logic programs, in: Proceedings of International Conference on Fifth Generation Computer Systems, ACM, 1992, pp. 859±868. [13] Y. Lin, V. Kumar, AND parallel execution of logic programs on a shared-memory multiprocessor, J. Logic Programming 10 (1991) 155±178. [14] E. Lusk, R. Butler, T. Disz, R. Olson, R. Overbeek, R. Stevens, D.H.D. Warren, A. Calderwood, P. Szeredi, S. Haridi, P. Brand, M. Carlsson, A. Ciepielewski, B. Hausman, The Aurora OR-parallel Prolog system, New Generation Computing 7 (2/3) (1990) 243±271. [15] S. Mudambi, J. Schimpf, Parallel CLP on heterogeneous networks, in: P. Van Hentenryk (Ed.), Proceedings of 11th ICLP, MIT Press, 1994, pp. 124±141. [16] T.J. Reynolds, P. Kefalas, OR-parallel Prolog and search problems in AI applications, in: D.H.D. Warren, P. Szeredi (Eds.), Proceedings of seventh International Conference on Logic Programming, Jerusalem, 1990, pp. 340±354. [17] K. Shen, Exploiting AND-parallelism in Prolog: the dynamic dependent AND-parallel scheme (DDAS), in: Proceedings of Joint International Conference and Symposium on Logic Programming, 1992, pp. 717±731. [18] K. Ueda, Introduction to Guarded Horn Clauses, ICOT Research Center, TR-209, Tokyo, 1986. [19] I. Vlahavas, P. Kefalas, A parallel prolog resolution based on multiple uni®cations, Parallel Computing 18 (1992) 1275±1283. [20] I. Vlahavas, P. Kefalas, The AND/OR parallel Prolog machine APIM: execution model and abstract design, The J. Programming Languages 1 (1993) 245±261. [21] I. Vlahavas, Exploiting AND±OR parallelism in Prolog: the OASys computational model and abstract architecture, J. Systems Software, 43 (1998) 45±57. [22] I. Vlahavas, P. Kefalas, I. Dakellariou, C. Halatsis, The basic OASys model: preliminary results, in: Proceedings of Sixth National Conference on Informatics, Greek Computer Society, Athens, 1997, pp. 723±731.

336

I. Vlahavas et al. / Parallel Computing 25 (1999) 321±336

[23] D.H.D. Warren, S. Haridi, Data di€usion machine ± a scalable shared virtual memory multiprocessor, in: Proceedings of International Conference on Fifth Generation Computer Systems, 1988, pp. 943±952. [24] D.S. Warren, Ecient Prolog memory management for ¯exible control strategy, in: Proceedings of International Symposium on Logic Programming, 1984, pp. 198±202. [25] R. Yang, T. Beaumont, V. Santos Costa, D.H.D. Warren, Performance of the compiler-based Andorra-I system, in: D.S. Warren (Ed.), Proceedings of the 10th International Conference in Logic Programming, MIT Press, 1993, pp. 150±166.