Distributed game-tree searching

Distributed game-tree searching

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 6,90-l 14 (1989) Distributed Game-Tree Searching* JONATHAN Computing Science Department, Univer...

2MB Sizes 6 Downloads 158 Views

JOURNAL

OF PARALLEL

AND DISTRIBUTED

COMPUTING

6,90-l

14 (1989)

Distributed Game-Tree Searching* JONATHAN Computing

Science

Department,

University

SCHAEFFER

of Alberta,

Edmonton,

Alberta,

Canada

T6G 2HI

Received January 29, 1987 Conventional parallelizations of the alpha-beta (cud)algorithm have met with limited success.Implementations suffer primarily from the synchronization and search overheads of parallelization. This paper describes a parallel (Y@searching program that achieves high performance through the use of four different types of processes: Controllers, Searchers, TableManagers, and Scouts. Synchronization is reduced by having all Searchers apply the PVSplit algorithm on the subtrees they search and having a Controller process reassigning idle processes to help out busy ones. Search overhead is reduced by having two types of parallel table management: global Table Managers and the periodic merging and redistribution of local tables. Experiments show that nine processors can achieve 5.67-fold speedups but beyond that, additional processors provide diminishing returns. Given that additional resources are of little benefit, speculative computing is introduced as a means of extending the effective number of processors that can be utilized. Scout processes speculatively search ahead in the tree looking for interesting features and communicate this information back to the ~$3 program. In this way, the effective search depth is extended. These ideas have been tested experimentally and empirically as part of the chess program ParaPhoenk ic

19X9 Academic

Prcrc.

Inc

1. INTRoDuCTT~N The alpha-beta algorithm (ap) has proven to be a difficult problem to parallelize. Simulation results have been published for several inventive parallel approaches showing tremendous performance [ 1, 91; however, in practice, these expectations remain unfulfilled. Among the many parallel ~$3algorithms, PVSplit [ 10, 121 appears to have emerged as the preferred one for implementation. It is easy to understand, relatively simple to implement, and for small numbers of processors (eight or less) provides good results [ 11, 12, 141. The biggest obstacles to its performance appear to be synchronization overhead, the cost incurred by processors falling idle, and search overhead, the conse* Funding for this research was provided by the Canadian Natural Sciences and Engineering Research Council under Grants A7902 and A8 173. 90 0743-73 15/89 $3.00 Copyright All

rights

0

1989

by Academic

cd reproduction

in

any

Pres.

Inc.

form

reurved.

DISTRIBUTED

GAME-TREE

SEARCHING

91

quences of information deficiency in a parallel environment [ 1 I]. Beyond a handful of processors, performance rapidly tapers off to the point where additional computing resources have a negative effect. With networks of tens or even hundreds of processors available, the problem remains how to best implement this algorithm in a parallel environment. A parallel @ searching program is described that runs on a network of workstations and achieves high performance by using four different types of processes: Controllers, Searchers, Table Managers, and Scouts. PVSplit can be viewed as having one Controller process that assigns subtrees to be searched (the employer) to an arbitrary number of Searcher processes (employees). The Dynamic PVSplit algorithm has all Searcher processes applying PVSplit on their subtrees. When a process runs out of work to do, the Controller can reassign it to help out busy ones. In this way, a Searcher can be both an employer and an employee. Processor idle time is reduced but at the expense of increased search effort. Search overhead is reduced by having two types of parallel table management. First, some tables can best be shared by having a global Table Manager process that Searchers can send entry updates and queries to. Second, smaller local tables can be shared by periodic merging and redistribution. Whereas PVSplit’s performance appears to level off at a speedup of roughly 5, experiments with the new algorithms show that nine processors can achieve 5.67-fold speedups but beyond that, additional processors provide diminishing returns. It appears that the enhancements achieve a constant increase in performance without, however, altering the asymptotic nature of parallel c@. The speedups are shown to be strongly tied to the efficiency of the search program and the size of tree considered. Given that there is a point beyond which additional processors cannot be used effectively, the idea of speculative computing is introduced to aid the search. Scout processes are used to scout ahead in the tree looking for interesting features. In much the same way as scouts are sent out to gather intelligence for an army, these processes communicate their information back to the a/3 searcher, thereby increasing the effective depth to which the program can see. Scouts are simplified u$ searchers and can be parallelized using the same routines obtaining comparable parallel performance. Empirical results show them to be a significant enhancement to LYEsearch programs. The ideas presented in this paper have been implemented in the computer chess program Phoenix and its parallel counterpart ParaPhoenix. Results are presented from experimentation in a controlled environment as well as empirical data from tournament competition. In the recent World Computer Chess Championships, ParaPhoenixran on a network of 20 SUN 3175s workstations and finished in a four-way tie for first place.’ Ten SUNS were used for each ParaPhoenix and Minix (the Scouts), with each achieving an estimated ’ For the tournament, the program was actually named Sun Phoenix.

92

JONATHAN

SCHAEFFER

6-fold speedup. Hence 20 SUN computers were able to achieve the performance of 12.

2. BACKGROUND

2.1. Sequential &I The alpha-beta algorithm maintains a search window (a, /3), providing lower and upper bounds on the minimax value of each subtree [7]. Figure 1 illustrates the structure of an c@ tree. The value of a circled node is the maximum of its sons; for square nodes, the minimum. At ALL nodes, all descendants must be searched. At CUT nodes, at least one, and possibly all, of the sons must be considered. The first path examined from root to terminal node consists entirely ofALL nodes and is called the principal variation (PV). In the context of game trees, nodes are often called positions and branches moves. The efficiency of the algorithm depends on how work can be cutoflat CUT nodes. In the special case where cutoffs occur on the first son examined at all CUT nodes, the minimal cuptree remains. For a uniform tree of width w and depth d, the worst case full minimax search examines wd nodes, whereas the minimal tree is wrdi2’ + wLdi2’ - 1. Enhancements to a/3 and applicationdependent knowledge can be used to order sons of CUT nodes to achieve close to the minimal tree. Among the many c@ enhancements, transposition tables, the history heuristic, and iterative deepening have the largest effect on tree size and are briefly described here. There are others but their effect is relatively minor [ 171. Nodes of game trees are not necessarily distinct; identical nodes may be reached by more than one path. Transposition tables [20] attempt to take

FIG.

1. ~(3 search tree.

DISTRIBUTED

GAME-TREE

SEARCHING

93

advantage of this by recording information about each subtree searched in the event that it recurs elsewhere in the tree. This has the potential for building trees smaller than the minimal tree by eliminating subtrees without performing any search. The history heuristic is an inexpensive means for dynamically ordering successors at an interior node [ 171. A move that is best in one position is likely to be best in similar positions. The heuristic remembers which moves have a history of being best and orders the moves at interior nodes based on this information. The information is stored in tables that are usually sparse. Iterutive deepening [6, 201 is a technique used to increase the probability that the best son is searched first at the root of the tree. Instead of immediately starting a depth d (or d ply) search, a series of staged searches (depth 1, then 2, 3, . . . , d) is carried out to successively deeper depths. After each iteration, the sons of the root are sorted on their minimax values from the previous iteration, providing a better ordering for the next iteration. As the search progresses, greater confidence is gained in the root ordering. By itself, iterative deepening is probably not useful but with transposition tables it has proven to be effective [6]. The above enhancements have been implemented in the chess program Phoenix. A complete description of the search algorithm used can be found in [ 171. The program is shown to search, on the average, within a factor of 2 of the minimal tree without transposition tables, and less than the minimal tree with the tables. 2.2. Parallel Performance Obstacles A measure of the degree of parallelism achieved by a program is the speedup. the ratio of time taken by the best sequential implementation of the algorithm to that taken by the parallel version. The objective is to achieve speedups that increase linearly with the number of processors used. Unfortunately, except for a few trivially parallelizable problems, this is difficult to achieve. To understand the limitations of parallel a& it is important to analyze the reasons for the poor performance of its implementations. Time overhead (TO) is the percentage loss in speedup when using N processors compared to the ideal speedup. This can be expressed as N TO = (Time using N CPUs) * (Time using 1 CPU) - 1. TO is a simple quantitative measure of the total overhead. An interpretation of it is that for every 100 set of useful computing done (as defined by the sequential program), TO set is wasted because of parallelization. The major causes for this loss in performance are communication overhead (CO), search overhead (SO), and synchronization overhead (SY). These overheads are related

94

JONATHAN

SCHAEFFER

by TO = CO + SO + SY. There are other costs (such as the overhead of the operating system) which are negligible for most applications. Communication overhead is the additional burden that a parallel program incurs when it spends time sending messages between processes. The programmer can control this cost by adjusting the size and frequency of messages sent. The communication overhead can be crudely estimated by counting the number of messages sent and multiplying this by the average cost per message. In the sequential environment, all information is readily available for making decisions. In a distributed environment, that information may be dispersed over several machines or not be available at all, and may result in unnecessary exploration of parts of the tree. This possible increase in the tree size is called information dejiciency [2 1] or search overhead. This loss can be approximated by using the observation that tree size is proportional to the time spent searching. SO can then be estimated by so = (Nodes searched for N CPUs) _ 1 (Nodes searched for 1 CPU) . An important point to note is that occasionally incomplete information can work to advantage. For example, assume that the second son, but not the first, of a CUT node will cause a cutoff. The sequential version will search the entire first subtree before considering the second and finding the cutoff. Searching them in parallel may result in finding the cutoff and halting the search of the first subtree before completion. As a result, it is possible for speedups to occur that exceed the number of processors. Unfortunately, far more common is the case where a cutoff from the first obviates the work on the second. Synchronization overhead is the cost incurred by having processors fall idle. Ideally, all processors should be kept busy 100% of the time doing useful work. However, processes may become idle waiting for other results to become available. In practice, only a small class of problems is suitable for purely asynchronous solutions, and the simpler understanding and implementation of synchronous algorithms is preferred. This cost can be estimated using the known values of TO, CO, and SO. Alternatively, it can be approximated by having a process measure and sum all its idle times. To achieve maximum performance from a parallel algorithm, it is necessary to minimize these overheads. The overhead may be inherent in the algorithm chosen, in which case if performance is unsatisfactory a new algorithm (if one exists!) may be the only solution. Usually, however, there is room for experimentation. Unfortunately, the overheads are not independent of one another and an effort to decrease one may result in an increase in another. For example, increased communication can be used to decrease search effort. There are trade-offs where one can carefully balance the reductions in one type of overhead versus the increases in another.

DISTRIBUTED

GAME-TREE

SEARCHING

95

2.3. Previous Work Baudet’s parallel aspiration search [2] subdivided the search window into disjoint intervals. Each processor searched the tree with a different window. The solution must fall within one of them and can be found in a time less than would be possible had the full range been used as a window. His experiments using randomly generated trees showed a maximum speedup of 5-6, regardless of the number of processors. The mandatory workjirst approach maintains a list of subtrees which it is known must be searched [ 11. Processors take the next available subtree from this list, search it, and return the results, possibly resulting in addition of more work to the list. The algorithm has the advantage of reducing search overhead by performing only work that it has been proven must be done. Simulations have shown the possibility of tremendous speedups. However, the method has been ignored by implementors because of the excessive memory requirements needed to store all the partially evaluated nodes in the tree. Simple tree-splitting applies one processor per subtree originating at the root of the tree [5]. This scheme is the simplest way to parallelize a/3, but it suffers from synchronization overhead (few subtrees to search, each of large granularity) and search overhead (there is no convenient means for sharing information). Fishburn has shown that under best-case ordering, the speedups grow with the square root of the number of processors used [5]. The Principal Variation Splitting algorithm (PVSplit) applies tree splitting along the principal variation [lo]. Recognizing that the first subtree from the root may consist of half of the entire tree [ 171, PVSplit applies all the processors to solving this problem before dividing up the rest of the work. In Fig. 1, this involves descending to node PV4 and applying simple tree splitting. When completed, the remaining sons at node PV3 are divided up, and so on back to the root. At each point where tree splitting occurs, the processes must synchronize before continuing. The general form of the algorithm can be found in [ 121, whereas an implementation algorithm can be found in [ 1I]. The advantage of PI/Split is that it applies processors where the largest search effort is required. The major problem is the synchronization that occurs along the principal variation. It is interesting to note that the more efficient the search is, the greater this problem can be. As one approaches the minimal tree, the average number of sons examined at CUT nodes approaches 1. Not getting a cutoff quickly can result in that subtree being many times larger than other trees where the cutoff occurs quickly. The cost at synchronization points will increase since processors may remain idle longer waiting for the completion of a proportionally larger subtree. Previous experiments with the PVSplit algorithm in ParaPhoenix have shown that as tree size increases, synchronization costs quickly become the primary source of performance degradation, apparently growing linearly with

96

JONATHANSCHAEFFER

the number of processors [ 111. The experiments show that communication overhead is negligible and that search overhead appears to level off eventually. With five processors, 7-ply searches yielded a 3.66-fold speedup. Extrapolation of the diminishing returns for additional processors gave the conclusion that performance would taper off at a 4.4-fold speedup. Marsland and Popowich have experimented with transposition tables as an aid to PVSplit [ 121. They compared the effects of use of a local table for each processor and of use of one global table. Their results are strongly influenced by the small size of the tables used (8000 entries) and the high cost of communication in their system. Use of a global table showed up poorly because of table overloading and the large increase in communication. To solve both problems, they limited table entries to subtrees of greater than 2 ply in depth. With this change, a speedup of 2.9 1 for 5 ply with four processors was achieved. With local tables, a similar depth limitation on table entries produced a 3.27fold speedup for 6-ply searches. Newborn has implemented PVSplit in his chess program Ostrich [ 141. He reports a 3.2 1-fold speedup with four processors, and 5.03 with eight. However, these figures were obtained using variable depths; the sequential program was run until an iteration was completed and the total computing time exceeded 4 min. Communication overhead was not a problem; most of the losses were the result of synchronization. To help reduce synchronization, he suggests reassigning idle processors to aid busy ones [ 141.

3. REDUCINGSYNCHRONIZATIONOVERHEAD Experimental evidence has shown that synchronization is the major cause for the poor performance of PVSplit. The solution is obvious: to reassign idle processors to assist busy ones. However, given the interdependence of the parallel overheads, one might suspect that the resulting reductions in synchronization time will be at least partially offset by increases in search and communication. 3.1. Dynamic PVSplit One way of thinking of PVSplit is that there is one Controller process to distribute work and N Searcher processes to do the searching. This can be viewed as a process tree of depth 1 and width N (see Fig. 2a with N = 4). The problem with the algorithm is that the Controller assigns work to Searchers that is indivisible; the task must be completed by that Searcher alone. The generalized version of PVSpZit, which allows arbitrary process trees, suffers from the same problem because the communication structure is fixed at program startup and is not dynamic. When assigning a subtree to be searched to a process, it is difficult to know how much effort will be required to complete

DISTRIBUTED

a) 3 becomes idle

GAME-TREE

b) 2 becomes idle

97

SEARCHING

c)l &Jbecomeidle Q

d) 1 becomes idle

e) 3 becomes idle

f)lbemnesidle

FIG. 2. A sample execution sequence.

the task. If this information were available, then the work could be scheduled to minimize processor idle time. Since it is not, the last piece of work outstanding at a synchronization point could take arbitrarily long to complete. A solution is to make the work assigned to Searchers divisible by having each Searcher apply PVSplit on its subtree. When the Controller runs out of subtrees to allocate, it can reassign idle Searchers to help others complete their tasks. In this way, the tree-like configuration of processes is retained; it just changes its shape dynamically. The synchronization overhead is thereby reduced by not allowing processes to become idle at synchronization points. Figure 2 illustrates a sample scenario with four Searchers. Anthropomorphically, the Controller can be viewed as an employer of Searchers. In the PI/Split algorithm, a Searcher is always an employee of the Controller. The new approach, called Dynamic PVSplit (DPVS), allows Searchers to be both employees and employers. When a Searcher, S, is assigned work, it applies PI/Split to its subtree. Initially there is only one process to do the work-itself. When a message is received from the Controller asking whether S needs help, it can respond favorably (if there is still work to be done) or unfavorably (if there is no more work left to distribute). In the former case, S uses the process id, T, sent by the Controller to assign some of its work to. Tin turn applies PVSplit to its work from S. An important point is that most subtrees will be searched by a process alone without assistance. Hence the overhead of applying PVSplit when only one process is doing the work must be kept small. Each Searcher saves the work to be done along its principal variation in global variables. When a process is made available to help out, the employer examines the work to be done and selects a subtree of minimum depth for the new employee to evaluate. When the assigned work has been completed, the employer process receives

98

JONATHAN

SCHAEFFER

a message, updates the results, and looks for new work to assign. The employer process cannot complete a node along its principal variation until all employees working on that node have returned their results. Interrupts are used to notify an employer when employees return results and when new Searchers are available to help out.2 Beyond a certain size or granularity of work it does not make sense to parallelize. Each work assignment has significant overhead (for example, the communication) and the amount of work assigned must justify the cost. The smaller the work, the greater the relative percentage of overhead. PVSplit is applied until the subtrees to be searched are 2 ply, the smallest unit of work that can be assigned to a process. Each Searcher process is essentially a loop that continually results in hiring and firing of the process by other processes. A process works for an employer until the employer has no more work to be done. Once fired, a process notifies the Controller, who then reassigns him to a process known to have work to be done. The Controller process performs iterative deepening at the root of the tree. At the start of each iteration, the principal variation is sent to a Searcher and the other N - 1 Searchers are assigned to help it out. After this is completed, the remaining moves at the root are sent in turn to the iV Searchers for evaluation. When work to be done runs out, the Controller assigns idle processesto be employed by other Searchers in a way that evenly distributes the help among the processes searching the largest subtrees (in terms of depth). DPVS is not without its problems. The immediate consequence is the information deficiency as a result of splitting a task over more than one processor. Had a subtree been searched in its entirety on the same processor, the local transposition and history tables could be used to maximum benefit. By searching part of it on another processor, in all likelihood extra effort is spent searching because of lack of relevant information in the tables. Despite the attempt to eliminate it, synchronization overhead still exists. At nodes where tree splitting occurs, the employer process must wait until all his employees have returned with their results before moving on to other tasks. Consequently, the employer may be idle until the last employee returns. When processes complete their work, they are assigned new tasks either for the same or for a different employer. The exception to this is the evaluation of the last piece of work. Consider synchronization for a 7-ply tree with four Searchers. In PVSplit, at the end of an iteration, three processes would remain idle waiting for the return of the last 6-ply result. In DPVS, the scenario of three idle processes can occur, but here the wait may only be for a 3-ply result. As in Fig. 2g, C would be waiting for the 6-ply result from 4, who waits for * Actually, polling proved to be more convenient to implement. A process polls once a second on average.

DISTRIBUTED

GAME-TREE

SEARCHING

99

the 5-ply result from 2, who waits for 3’s 4-ply result, who waits for l’s 3-ply result. Similarly, there is synchronization between iterations; the Controller cannot start dividing up work for the next iteration until the current iteration is completed. Only when the last piece of work is done is it possible to state with certainty what the principal variation for the next iteration is. Idle processescould be used to begin the next iteration, but if the principal variation changes as a result of the last piece of work completed, then this work will have to be aborted. This could be implemented but has not been because it fails to solve the problem of the very last iteration, where synchronization overhead is the largest. Another source of idle time occurs when a Searcher encounters a large subtree that is not along the principal variation. If this subtree is assigned to another Searcher there is no problem since DPVS will be applied to it. But if the employer happens to search it, it will not be subdivided since it is not on the principal variation and must be searched to completion. As a result, this process is busy but has no work to assign to others. If this is near the end of an iteration, it is possible that there are idle processors that could be used to help out. The stronger the ordering of the tree, the less frequently this scenario will occur. Note that this problem can give rise to significant variations in the running time of the program, depending on whether the offending subtree is searched by the employer or the employee. Finally, idle time is caused by all communication to the Controller. Searchers become idle on all requests for work. The Searcher must wait until his request is received (he may not be first in line for service), wait until work can be found (the Controller might have to find a process that is still busy with work available), and wait until the work assignment is sent back (if there is any). Thus the overhead of the Controller is more than that for PVSplit. There is a serious problem with DPVS not addressed by the previous discussion: extendibility. DPVS requires every process to be able to communicate with all others. For a large number of processors, this is obviously unacceptable. A solution is to have each node in the process communication graph represent a pool of processors that can only help each other out. By localizing the communication, the algorithm can be readily extended. 3.2. Experiments A collection of workstations, with its network and communications software, may be viewed as a multicomputer capable of running distributed and parallel algorithms. The facility used in the experiments described in this paper is a multicomputer called the Virtual Tree Machine [ 15, 161. It is implemented on a network of autonomous VAX-l 1/78Os, SUN 2, and SUN 3 processors each running the 4.2BSD UNIX3 operating system. The experi3 UNIX is a trademark of AT&T Bell Laboratories.

JONATHAN

100

SCHAEFFJZR

# of FIG.

Processors

3. PVSplit vs DPVS speedups (7 ply).

menter’s view of the facility is a collection of processing elements, each with its own local memory and peripherals. In reality, the Virtual Tree Machine consists of ordinary UNIX processes with communication paths implemented as virtual connections between processes over a local area network. The name is a misnomer in that arbitrary connection patterns are possible, not just trees. The user’s interface to the machine is a set of procedures, callable from application programs, and a collection of servers that create the processes according to a description provided by the user. This description specifies the mapping between virtual processing elements (nodes) and physical processors as well as the interconnections between nodes. Thus, the whole machine might reside on one physical processor during development, and later be distributed over the selected physical machines for productive use. Experiments were performed comparing PVSpIit (PVS) and DPVS (0). Each program used local history and transposition tables. A standard set of 24 positions [8] that has been extensively used as a benchmark for sequential and parallel chess programs was the basis for comparing performance. The speedups through seven processors is shown in Fig. 3 and the corresponding overheads in Fig. 4. The six- and seven-processor data points represent the median over three runs; other points represent a single run each. The experiments were performed on a network of SUN 2 computers, six of which were standalone machines with no operating system. The Controller and one Searcher were run on a SUN 2 under UNIX, while the other six Searchers used the standalones. The standalone machines run 13% faster than the UNIXbased system and the results in both Figs. 3 and 4 have been adjusted to reflect this.4 For the standalones it was not possible to get accurate measures 4 A SUN 2 running UNIX was viewed as 1 CPU and each standalone as I. 13. Performance was calculated using these adjusted numbers of CPUs.

DISTRIBUTED

GAME-TREE

101

SEARCHING

!aso706050Cost%4030. 20s loO-104

0

, 1

, 2

!I

4

5

6

7

J 8

# of PKces.sors FIG. 4. PVSplit

vs DPVS

overheads (7 ply).

of processor idle time. Consequently, the synchronization overhead was computed from the known values of TO, SO, and CO. The results show that the reassignment of idle processors is of some benefit, increasing the seven-processor speedups from 3.70 to 3.96. This is not a particularly significant improvement. Intuitively, one would expect the results to be better because idle time should be reduced. Examination of Fig. 4 shows that synchronization overhead has indeed gone down but, as expected, search overhead is increasing. Previous experiments with PVSplit have suggested that search overhead levels off [ 1 I] and the results here are consistent. DPVS, however, has altered this pattern, making search overhead apparently increase linearly with the number of processors. With seven processors, DPVS’s search overhead exceeds PVSpZit’s. The result is that the improvements in one overhead have almost been offset by losses in another. Note in Fig. 4 that the two- and three-processor results for PVSplit have negative synchronization overhead. Strictly speaking this is not possible. It is a result of the formula used to calculate search overhead, which naively assumes all nodes in the tree to be roughly equivalent in cost. SO is an approximation on the basis of the strong correlation between tree size and execution time. Consequently, all overhead measures are estimates and variations are to be expected. Synchronization overhead is lower for DPVS but, more importantly, its growth rate is smaller. However, for seven processors it is still a high 37%. Given the efforts to reduce it, why is it not lower? The consequence of splitting work up is that a Searcher examines a subtree without benefit of the information in his employer’s history and transposition tables. As a result, the employee process is ineffective at ordering sons at CUT nodes and, on average, builds larger trees. Employers must wait until all employees have returned

102

JONATHANSCHAEFFER

their results before completing a node along the principal variation. Hence, the increased searching of the employee can translate into increased idleness of the employer. Whereas the average wait time at a synchronization point may be less for DPVS than for PVSpZit, there are potentially many more synchronization points. Consequently, the decreased idle time at PVSplit synchronization points is largely offset by increased idle time at employer synchronization points. Analysis of communication and the Controller shows that neither has a significant overhead. Communication overhead is estimated by counting the number of messages sent and received and multiplying it by the average cost per message. The overhead of the Controller is obtained by measuring the CPU time it uses. For seven processors, these costs add up to roughly 0.5% of the total overhead and have been ignored. In summary, DPVS has reduced synchronization at the cost of increasing search effort. In turn, more searching implies longer waits and increasing synchronization. Since the two are intimately related, if the search overhead could be reduced, then perhaps synchronization would also decrease. 4. REDUCINGSEARCHOVERHEAD

In this section, two methods for sharing information increase search efficiency are investigated.

between processes to

4.1. Sharing Tables Search overhead is caused by a deficiency of information. A solution is to increase the sharing of information at the cost of increased communication. Since transposition tables are an important enhancement to sequential a& it is possible they may also be useful in a parallel environment. Local tables do not allow for the sharing of information between processors. These tables are large so it is not practical to distribute their information to everyone. An observation is that they are most effective near the root of the tree. Transpositions found there result in significantly more savings than transpositions near the terminal nodes. One possibility is to have, in addition to the local tables, a global Table Manager. Whenever a Searcher needs to do a table lookup, it sends a request to the Table Manager to check its table, in parallel with the lookup in its own table. It then waits until it gets a response. Obviously such a scheme will not work for all nodes in the tree where table lookups occur; there are too many. Consequently the table can be limited to storing results for positions within a few ply of the root node. Experiments showed that saving results for the first 4 ply of the tree did not cause a harmful increase in communication costs, whereas those for 5 ply did. Further, querying information near the terminal nodes is not useful since the benefits are small.

DISTRIBUTED

GAME-TREE

SEARCHING

103

Hence, information within 3 ply of terminal nodes was not saved. For low search depths, this implies that the tables are of little use. For a depth 5 search, only information relevant to the first 2 ply is saved; for depth 6 the first 3 ply is saved, and for 7 and beyond, the first 4 ply. A potential problem is processor idle time. The local table lookup is much faster than the global query. Until that reply arrives the process is idle. Attempts to eliminate this idle time only increase the complexity of the program. At an interior node after a table lookup, the program searches the moves in best to worst order. If the search continues before the table request has returned, an inferior first move may be considered. When the query returns, it may have a suggested best move that differs from the choice made. At this point it is awkward to halt everything and back up to the point where the wrong decision was made. Consequently, the global table queries may help reduce search overhead but at a cost of increased processor idle time and communication. Another important sequential enhancement is the history heuristic. When a move is found best in one position, that information is recorded in the history tables to be used in all other positions where that move is legal. In a distributed environment, each processor can have its own local table. However, that reduces the heuristic’s effectiveness since a processor does not know about best moves found elsewhere. A global table that all processes can query is unsatisfactory because the history tables are used to sort moves at all interior nodes. History tables are sparse; usually less than 100 and rarely more than 200 entries are nonzero. It thus becomes feasible for processes to communicate the nonzero entries to each other. The goal is to have all processors’ results merged into one table. A convenient time to do this is at the start of each iteration. Each Searcher can send to the Controller its version of the table. The Controller can merge the results and broadcast the resulting table. Each process can use that table and make its own local updates that eventually get sent back to the Controller at the start of the next iteration. The advantage of this scheme is that history information is shared, albeit not immediately as in the sequential case. The disadvantage is a delay in starting an iteration to communicate this information. 4.2. Experiments Sharing history (H) and transposition table (T) information has been added to the program. The performance is shown in Fig. 5 with DH representing a program with DPVS and shared history information and DHT representing the same with transposition tables added.’ The addition of history information ’ Note that there is no data point for two processors. Obviously, two processors, one Searcher and one Table Manager do not make sense since the local table will be similar to the global one.

104

JONATHAN

i

1

SCHAEFFER

3

4

4

i,

3

# of Pmcesors

FIG. 5. D, DH, and DHT speedups (7 ply).

has provided a consistent improvement in performance from DPVS’s 3.96to 4.41-fold speedups. With the Table Manager, a speedup of 4.78 for seven processors is achieved, a full processor better than PVSplit’s 3.70. The introduction of a Table Manager requires a redefinition of overheads. This process occupies a processor by itself (because of its memory requirements) but is not computation bound and is largely idle.6 The idle time of this process should not be reflected as part of the synchronization overhead of the others. Similarly, the search overhead is attributable to the Searchers, not the Table Manager. Assume there are N processors using N - 1 Searchers and one Table Manager. Search overhead becomes

SO’= so&$ to distribute the search effort of N - 1 processors over all N. Table overhead (TB) is TB’ = TO, - TON-, =

Time using N CPUs Time using 1 CPU ’

representing the difference in TO had it been computed using N processors rather than with the N - 1 that actually did the searching. Synchronization overhead of the Searchers can now be estimated as SY’ = TO - CO - SO’ - TB’. Figure 6 illustrates the overheads for the D and DHT versions of the 6 The Controller

memory.

could share the same processor since it does not require many cycles or a large

DISTRIBUTED

FIG. 6. D, DH,

GAME-TREE

and DHT

SEARCHING

overheads

105

(7 ply).

program, with DHT's overheads calculated as above making appropriate adjustments for the faster speeds of the standalone machines. Again, the communication overhead is well less than 1% and is assumed to be zero. Note that the negative search overhead is due to the phenomenon reported in Section 2.2, where timely cutoffs can reduce the parallel search effort below that required by the sequential program. From Fig. 6 it is seen that TB steadily decreases. The effect of devoting one processor to table managing is less pronounced the more processors used. Synchronization and search are significantly reduced. The better ordering at CUT nodes afforded by the shared tables has reduced the search effort and as a result, the idleness of employers. The synchronization cost now includes, in addition to the previous costs, the cost of remaining idle waiting for table queries and the time spent at the start of each iteration communicating the history information. On average, the history sharing costs 15-20 set per 7ply search, representing l-2% of the seven-processor running time. There are other improvements that can be made in the search algorithm to reduce search and synchronization costs. For example, one possibility is to communicate new search window bounds to processors as soon as they become available. In general, this and similar types of enhancements have been shown to provide only minor improvements. A variety of algorithm enhancements can be found in [ 181. This brings us full circle. Synchronization overhead was perceived as the bottleneck and attempts to reduce it were successful. However, it introduced search overhead as the new bottleneck. Tackling that problem further improved performance but brought us back to synchronization. Of course, the cycle could continue; each attempt at improving the algorithm yielding marginally less for the effort expended.

106

JONATHAN 5.

IN

SCHAEFFER PERSPECTIVE

An interesting question to ask is how the performance of a parallel searcher is related to the size of problem solved. Intuitively, the larger the tree, the more parallelism that can be achieved. Figure 7 shows the speedups of the DHT version for 3 through 7 ply. For shallow depths, the Table Manager is of little benefit and the speedups are low. For 6 and 7 ply, performance improves as table queries begin to eliminate large portions of the tree. One could extrapolate these results to imagine the 8- or 9-ply curves that would bring us closer to the ideal speedup. Searching the data set to a fixed depth is a misleading measure of the size of trees being searched. A 7-ply search of one position took only 332 set, whereas 16,699 set was needed for another. If we want to know the relationship between speedup and tree size, using a fixed depth search is inadequate. A more meaningful measure is to classify the trees into groups with similar sequential running times. Figure 8 shows the performance of 7-ply DHT for three classifications of the sequential running time of the data: less than lo3 set, lo3 to lo4 set, and greater than lo4 sec. This diagram clearly illustrates that the larger the tree, the better the speedups possible. Published results for PI/Split are often better than those shown in Fig. 3 (see Section 2.3). There are two reasons. First, the preceding discussion shows that the performance of parallel ap searchers must be qualified by the size of task solved. ParaPhoenix could be run long enough that all test positions exceed 10,000 set, thereby achieving a notable 6-fold (or better!) speedup for seven processors (from Fig. 8). The second reason is the efficiency of the search: the more efficient, the greater the overheads involved. For example, in [12], PVSplit running on equipment similar to that used in this paper required 155,400 set (2590 min) to complete 6-ply searches on the same set

65-

FIG. I.

DHT speedups

vs search

depth.

DISTRIBUTED

GAME-TREE

SEARCHING

107

Speedup

FIG.

8. Speedups and tree sizes (7 ply).

of test data. In comparison, PamPhoenix took 142,17 1 set to complete 7-ply searches. Thus one would expect the former to achieve a better speedup; the latter has the extra search and synchronization overhead of an additional iteration. Thus a less efficient searcher would expect to get better speedups for the same ply searches. Previous published results for PVSplit in ParaPhoenix showed a 3.66-fold speedup with five processors [ 1 l] compared to the current results of a 3.14fold speedup (see Fig. 3). Whereas the implementation of PVSplit has remained the same, the chess program has evolved and its searching efficiency improved. Since Phoenix is now a better sequential program, the more pronounced both search and synchronization overhead will be in the parallel version. This can be seen by comparing the 1984 and current ParaPhoenix PI/Split results. This shows that results of parallel ~$3searchers must be qualified not only by the size of tree searched but also by the efficiency of the sequential search program. 6. OUTSIDE THE LABORATORY Some additional results are available using a network of 20 SUN 3/75s. These machines are roughly four times the speed of the SUN 2s used for the previous experiments. In addition, their Ethernet interface is considerably faster, reducing the cost of communication. However, these machines were not standalone, and accurate timings had to contend with operating system interference as well as occasional interruptions from other users on the system. In Fig. 9, the speedups through 19 processors are shown. Data are available only for odd numbers of processors. The results are similar to those shown in Fig. 5 (through 7 processors), showing that scaling the hardware has preserved performance. With 19 pro-

108

JONATHAN

SCHAEFFER

# of processors FIG. 9. SUN 3/75 network speedups (7 ply).

cessors, a speedup of 7.64 is achieved, but clearly performance is rapidly tapering off. Comparing the DHT and D curves shows that the effort in reducing search and synchronization overheads has not substantially changed the shape of the curve; it has only shifted it to a point where more processors can be used effectively. If the D and DHT curves continue their current trends, extrapolation shows that D’s performance will level off at a 5-fold speedup using 16 processors, while DHT will achieve 8 with 24 processors. Figure IO illustrates the overheads for Fig. 9. Steadily increasing search and synchronization are the long-range downfall for DHT. The diagram shows synchronization to be decreased but still growing linearly with the number of processors-roughly 5% per processor. Whereas PVSplif’s search overhead levels off, in Fig. 10 for DHT it has an average slope of 4% per processor. The improvements in synchronization are being largely offset by increased search effort. The effect of all the enhancements to PVSplit has been to reduce the synchronization problem without altering its consequences for large numbers of processors. Beyond 9 processors, where the speedups begin to taper off noticeably, there are several problems that DHT starts encountering that prevent better performance. One potential problem is the overloading of Table Manager. Whereas it may be able to effectively serve roughly 10 Searchers, beyond that number the wait times for service may grow to unacceptable levels. A possible solution would be the addition of a second Table Manager. This might further reduce search and synchronization overheads, but increase the table overhead. A second problem is just the sheer number of processors. As work to do runs out at synchronization points, more processors fall idle. The work still being performed cannot continually be subdivided to use the idle resources;

109

DISTRIBUTEDGAME-TREESEARCHING

2 # of processors FIG.

10. SUN 3/75 network overheads (7 ply).

there is a minimum granule of work below which it is not cost effective to have it distributed. As the number of processors increases, the execution time of the program decreases, and the idle time at synchronization points increases, causing the percentage loss as a result of synchronization to increase. Thus one would expect that the larger the tree being searched, the more processors can be utilized before synchronization overhead takes over. For a large number of processors, a third problem can be communication. If each connection used a dedicated line, then processors would not be in contention with themselves. However, the machines used in the experiments were connected by a single Ethernet cable. The more processors, the greater the volume of communication and the greater the chance for communication collisions and delays. For the results in Fig. 9, the bandwidth of the Ethernet appears to be adequate and communication is not viewed as being a problem. The conclusion is that the efforts to reduce the search and synchronization overhead have not changed the fundamental nature of u$ to resist parallelism. What has been achieved is a constant level of improvement in performance. The number of processors that can profitably be used to improve performance has been increased, but this still does not bring us close to the goal of using hundreds of processors effectively. 7. SPECULATIVECOMPUTING

7.1. scouts Given that there is a point beyond which processors cannot be used effectively, how best to make use of additional resources? A solution is to add

110

JONATHAN

SCHAEFFER

speculative computing [4] to the search, using processes to speculatively search for interesting features in the tree. It is called speculative because it is a gamble; more often than not, the effort may prove fruitless. Baudet’s parallel aspiration search (Section 2.3) is speculative because the solution will fall in only one window; the other processor’s computations are wasted. An example from symbolic computation is the calculation of greatest common divisors [22]. Here there are several algorithms that can be used to solve a problem, but it is not known a priori which one is best. The solution is to start all the algorithms in parallel and stop when one of them finds a solution. An example from chess programs is Newborn’s experiments with a Mater process [ 131. Rather than add an additional processor to his parallel afi search program, he used this processor to look exclusively for checkmates. Because of the specialty of the task, the process was able to search much deeper than was normally possible. On the other hand, it was rare that the Mater found anything worthwhile. Whereas the previous discussion of parallel CUPcould be described in an application-independent manner, the introduction of speculative computing requires the discussion to focus on one application. The features in the tree that one wants to speculatively search for are, of course, application dependent. In this paper, cup searching has been applied to a computer chess program. For the game of chess, winning and losing a game is highly correlated to the win or loss of material on the board. Whatever the depth to which an afl search is performed, there is always the possibility that by searching deeper, a better result may be found. In chess programs this is particularly acute; a program may make a move that, for example, looks good in a 5-ply search, but may show its deficiencies at 6 ply. Considerations such as this have motivated chess programmers to acquire the fastest possible hardware. To overcome this problem, the idea of a Scout is introduced. Scouts are stripped down versions of Searchers. They are designed to be as fast as possible, scouting ahead in the tree looking for wins and losses of material at a depth beyond what an @3 searching chess program could usually consider. All the expertise that allows a chess program to play a good game has been removed. It evaluates positions solely on the material balance of the position. This tactical search information can be used to: (1) find wins of material at depths beyond what can be achieved by conventional a@ searchers, and (2) prevent Phoenix from making a move that deeper search shows to lose material. The job of a Scout is to find the set of tactically best moves at the root of the tree. These moves result in the maximum material advantage for the side to move. It performs an iterative search, classifying each move as inferior or tactically best.

DISTRIBUTED

GAME-TREE

111

SEARCHING

Because of its simplicity a Scout can execute at twice the speed of a Searcher. This is not enough; on average, a factor between 3 and 8 is required to search an extra ply. Since Scouts are interested only in finding tactically interesting lines, some method is required to eliminate irrelevant lines from the search, thereby reducing the effort required to search an extra ply. Don Beal’s null move can do this selection in a mechanical way without use of applicationdependent knowledge [3]. The algorithm is a minor change to c@ to allow an extra alternative at interior nodes, that of not making a move at all. The forfeiture of a move usually has serious consequences. If the opponent is given two moves in a row and still cannot do anything harmful, then that line of play is cut off without further search. This idea is applied at all nodes in the tree where a null move is not already part of the path from the root to that node. In this way, many innocuous moves are quickly eliminated from the search. 7.2. Phoenix and Minix Scouts can be easily implemented in parallel by using the same routines used to parallelize Phoenix. They have their own Controller to distribute the work and accumulate the results and, if desired, their own Table Manager. A set of Scouts with a Controller is called Minix (Mini-Phoenix). The parallelization of Minix yields a performance comparable to that for ParaPhoenix. A sample process structure on our network of computers is shown in Fig. 11. Phoenix l-TiszllsesrcherllsePrcherlmIs-

Minix

...

FIG.

11, Phoenix and Mink

112

JONATHAN

SCHAEFFER

Here there are four Searchers supported by a Table Manager for Phoenix and five Scouts for Minix. For the World Computer Chess Championships, 10 SUN 3175 computers were used for each ParaPhoenix and Minix, each program having nine processors dedicated to Searchers/Scouts and one processor having both a Controller and a Table Manager process. Phoenix and Minix communicate through their Controller processes. Whenever Phoenix finds a new best move, it sends that information to Minix to ensure that it is searched first at the start of the next iteration. When Phoenix has decided on its move, it asks Minix for permission to make the move. Minix can respond with one of two answers: (1) Phoenix’s intended move is in the set of tactically best moves and permission is granted to make that move. (2) Phoenix’s intended move is not tactically best. Minix vetos Phoenix and forces it to make a different move. The move is selected from the set of tactically best moves. If there is more than one move, the move giving the highest positional score based on a 1-ply search is chosen. In this case, Minix’s deeper search has found a move materially better than that of Phoenix. Further details can be found in [ 191. 7.3. Results Scout processes are new and there is not much experience using them. Under tournament conditions (40 moves to make in 2 hr), Minix usually searches 2 ply or more ahead of Phoenix. Unfortunately, Minix’s performance is difficult to quantify. Many standard sets of chess problems contain mostly tactical positions and on these Minix performs exceptionally well, finding many winning moves that would not ordinarily be found by Phoenix. However, in real tournament games, the number of times that an extra 1 or 2 ply of tactical searching makes a difference in the move selected is infrequent. Since these situations do not occur often, one might argue that the computing resources are not well used. On the other hand, these winning and losing moves can be the decisive difference in a game. If the Scouts improve even one move in a game, then they have made a significant improvement in the program’s play. Phoenix has competed in three tournaments with Mink. In the 14 games played, Minix has twice searched deeper than Phoenix to find the win of material (and the game). On two other occasions, it found lines of play that allowed a speedy conclusion in games that Phoenix was already winning. In two other games for a total of three occurrences, Minix prevented the making of a move that would have lost material. All totaled, this represents a difference of 7 moves in 6 of 14 tournament games, approximately 1% of the moves made. The probability that Minix will make a difference is small but when it occurs, the benefits can be large.

DISTRIBUTED

GAME-TREE

SEARCHING

113

These data are probably insufficient to draw any meaningful conclusions. The important point to make is that Minix allowsPhoenixto overcome the tactical shortsightedness that plagues programs competing against those running on faster hardware. For example, the World Computer Chess Champion Cray Blitz is consistently able to achieve 8- and g-ply searches in the middlegame while using a Cray XMP/48. The work of Minix allowsthe Phoenix program to play at par tactically with this supercomputer, despite being at a large computing disadvantage. Of course, this parity may be short-lived; there is no reason why the authors of CrayBlitz could not add speculative computing to their program.

8.

SUMMARY

This paper has addressed the fundamental problems that restrict the performance of parallel &I. Unfortunately, the enhancements to reduce the search and synchronization costs have only shifted the performance curve higher, without changing its basic shape. However, there is evidence that shows the speedups are strongly tied to the granularity of work. In principle, then, one could expect excellent speedups for an arbitrary number of processors if one is just willing to run the program long enough. Given that existing parallelizations of &I are limited to a handful of processors, Scoutshave successfully provided an alternate means of using available resources profitably. Although the work here has increased the number of processors that can profitably be used for u$? it is just a prelude to tackling the problems posed by having hundreds or even thousands of processors. It is not clear what form the algorithms will take that are needed to use that much computing power effectively. Indeed, it is likely that algorithms such as DPVS will be inadequate.

ACKNOWLEDGMENTS Thanks to Marius Ok&on, Don Beal, Tony Marsland, Bjom Bjomsson, and Tim Breitkreutz for invaluable suggestions. Special thanks to Sun Microsystems Inc. for providing the equipment used for the World Computer Chess Championships. In particular, the tireless efforts of Rob Gingell are greatly appreciated.

REFERENCES 1. Akl, S. G., Barnard, D. T., and Doran, R. J. The design, analysis, and implementation of a parallel tree search algorithm. IEEE Trans. Pattern Anal. Machine Intelligence 4, 2 (1982) 192-203. 2. Baudet, G. The design and analysis of algorithms for asynchronous multiprocessors. Ph.D. thesis, Department of Computer Science, Carnegie-Mellon University, 1978.

114

JONATHAN

SCHAEFFER

3. Beal, D. Experiments with the null move. Advances in Compufer Chess V, 1988, in press. 4. Burton, F. W. Controlling speculative computation in a parallel programming language, Proc.

5. 6. 7. 8.

9. 10. 11. 12.

5th International

Conference

on Distributed

Computing

Systems,

Denver,

CO,

1985,

pp. 453-458. Fishbum, J. Analysis of speedup in distributed algorithms. Ph.D. thesis, Computer Sciences Department, University of Wisconsin-Madison, 198 1. Gillogly, J. Performance analysis of the technology chess program. Ph.D. thesis, Department of Computer Science, Carnegie-Mellon University, 1978. Knuth, D. E., and Moore, R. W. An analysis of alpha-beta pruning. Artificial Intelligence 6 (1975) 293-326. Kopec, D., and Bratko, 1. The Bratko-Kopec experiment: A comparison of human and computer performance in chess. In Clarke, M. R. B. (Ed.). Advances in Computer Chess 3. Pergamon, Elmsford, NY, 1982, pp. 57-72. Lindstrom, G. The key node method: A highly-parallel alpha-beta algorithm. UUCS 83-101, Department of Computer Science, University of Utah, 1983. Marsland, T. A., and Campbell, M. S. Parallel search of strongly ordered game trees. Comput. Surveys 14 (1982), 533-551. Marsland, T. A., Olafsson, M., and Schaeffer, J. Multiprocessor tree-search experiments. In Beal, D. (Ed.). Advances in Computer Chess 4. Pergamon, Elmsford, NY, 1985, pp. 37-5 I. Marsland, T. A., and Popowich, F. Parallel game-tree search. IEEE Trans. Pattern Anal. Machine

Intelligence

I, 4 (1985),

442-452.

13. Newborn, M. M. private communication, San Francisco, 1984. 14. Newborn, M. M. A parallel search chess program. Proc. ACM Annual Cor@rence, 1985, pp. 272-277. 15. Olafsson, M., and Marsland, T. A. A Unix based virtual tree machine. Proc. CIPS Congress 85, Montreal, June 1985, pp, 176-181. 16. Olafsson, M., and Marsland, T. A. Implementation of virtual tree machines. Tech. Rep. 859, Department of Computing Science, University of Alberta, 1985. 17. Schaeffer, J. Experiments in search and knowledge. Ph.D. thesis, Department of Computer Science, University of Waterloo, 1986. 18. Schaeffer, J. Experiments in distributed game-tree searching. Tech. Rep. 87-2, Department of Computing Science, University of Alberta, 1987. 19. Schaeffer, J. Speculative computing. J. Internat. Comput. Chess Assoc. 3 (I 987). I 18-124. 20. Slate, D. J., and Atkin, L. R. Chess 4.5-The Northwestern University Chess Program. In Frey, P. W. (Ed.). Chess Skill in Man and Machine. Springer-Verlag, New York, 1977, pp. 82-l 18. 2 1. Wah, B. W., Li, G., and Yu, C. F. Multiprocessing of combinatorial search problems. Cornpaler 18,6 (1985), 93-108. 22. Watt, S. M. Bounded parallelism in computer algebra. Ph.D. thesis, Department ofcomputer Science, University of Waterloo, 1986.