The Journal of Systems and Software 72 (2004) 389–399 www.elsevier.com/locate/jss
The design and implementation of a runtime system for graph-oriented parallel and distributed programming q J. Cao
a,b,* ,
Y. Liu b, Li Xie a, B. Mao a, K. Zhang
c
a National Key Lab for Novel Software Technology, Nanjing University, Nanjing 21008, China Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083-0688, USA
b c
Received 24 October 2002; accepted 28 March 2003 Available online 5 December 2003
Abstract Graph has been widely used in modeling, specification, and design of parallel and distributed systems. Many parallel and distributed programs can be expressed as a collection of parallel functional modules whose relationships can be defined by a graph. Often, the basic functions of communication and coordination of the parallel modules are expressed in terms of the underlying graph. Furthermore, parallel/distributed graph algorithms are used to realize various control functions. To facilitate the implementation of these algorithms, it is desirable to have an integrated approach that provides direct support for efficient operations on graphs. We have proposed a graph-oriented programming model, called GOP, which aims at providing high-level abstractions for configuring and programming cooperative parallel processes. GOP enables the programmer to configure the logical structure of a distributed program by using a logical graph and to write the program using communications and synchronization primitives based on the logical structure. In this paper, we describe the design and implementation of a portable run-time system for the GOP framework. The runtime system provides an interface with a library of programming primitives to the low-level facilities required to support graph-oriented communications and synchronization. The implementation is on top of the parallel virtual machine in a local area network of Sun workstations. We focus our discussion on the following four aspects: the software architecture, including the structure of runtime system and interfaces between user programs and the runtime kernel; graph representation; implementation of graph operations; and performance of the run-time in terms of the implementation of graph-oriented communications. 2003 Elsevier Inc. All rights reserved.
1. Introduction It has been well recognized that programming a parallel/distributed system is much more difficult than programming a centralized system. Many functions such as parallel execution, task mapping, interprocess communication, synchronization, and reconfiguration are very difficult to program. Supporting tools and environments can greatly simplify the programming task and are therefore highly demanded. However, the methods q This research is partially supported by the Hong Kong Polytechnic University under the Grant H-ZJ80. The first author is currently visiting Nanjing Univ. under the support by the National Key Lab Visitorship Fund from the Education Ministry of PR China. * Corresponding author. Address: Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong. Tel.: +852-2766-7275; fax: +852-2774-0842. E-mail address:
[email protected] (J. Cao).
0164-1212/$ - see front matter 2003 Elsevier Inc. All rights reserved. doi:10.1016/S0164-1212(03)00099-2
supported by most existing facilities provide only separate and flat programming primitives. There is a lack of an integrated approach to programming parallel/distributed software in a high-level, structured way. Graph has been widely used in modeling, specification, and design of parallel and distributed systems. Many parallel and distributed programs can be expressed as a collection of parallel functional modules whose relationships can be defined by a graph (Chang, 1982). Often, the basic functions of communication and coordination of the parallel modules are expressed in terms of the underlying graph. Furthermore, parallel/ distributed graph algorithms are used to realize various control functions. However, existing programming support provides only low-level message-passing primitives, so in order to implement the graph-based algorithms we need to manually translate the high-level model into underlying low-level programming
390
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399
constructs. To facilitate the implementation of these algorithms, it is desirable to have an integrated approach that provides direct support for efficient operations on graphs. In previous papers, we have proposed the GOP framework to programming reconfigurable parallel/distributed computer software (Cao et al., 1995; Cao et al., 1996). In our context, a parallel/distributed program consists of a collection of local programs (LPs), which can be run on multiple physical machines. LPs interact directly and explicitly with one another and coordinate to realize a common goal. Based on a novel graphoriented programming model, GOP aims at providing high-level, convenient abstractions for configuring and programming the coordination of LPs based on userspecified graphs. The GOP framework consists of a language-level graph construct and a collection of software facilities that support graph-oriented distributed programming. GOP does not attempt to completely hide the message passing nature of the underlying hardware. The programmer is still given a message passing view of the system, but it is an abstract one, defined in terms of the user-specified logical graph. Graphs have been implemented in some programming languages as distributed data structures (Bal, 1990; Carriero et al., 1986). A distributed data structure contains a set of data values, a collection of operations to access and manipulate these values, as well as facilities for distributing the data values to different sites and accessing them independent of their physical location (Peleg, 1990; Totty, 1992). In contrast, in the GOP model, graphs are used mainly as distributed control structures, consisting of a collection of linguistic primitives that directly support distributed graph operations required in different contexts of distributed control. Graphs are associated with the functions of local processes, rather than their data values. They can serve the purposes of naming, grouping and configuring parallel/ distributed tasks, and/or can be used as the underlying structure of mechanisms for implementing uniform message passing and process coordination. In GOP, a distributed program consists of a userspecified logical graph to be associated with the LPs and their relationships, and a collection of local functions defined in terms of the graph and invoked by messages traversing the graph. The programmer can first define a graph specifying the configuration structure of LPs in a parallel/distributed program and then write code for implementing the program in terms of the graph and the operations on the graph. Programs are conceptually sequential, but augmented with primitives for binding LPs to vertices in a graph, with the distributed graph operations and inter-node communications completely hidden from the users. In this way, the programmer is relieved from the burden of writing code for implementing low-level message passing, task mapping and
graph operations but instead can concentrate on the structure and the logic of the distributed program. Furthermore, the programmer is given enough flexibility to exploit the semantics of the graph construct to deal with different aspects of distributed programming, such as dynamic reconfiguration, communication, and synchronization in an integrated way. A rich set of primitives should be provided for the specification of and various operations on graphs. The realization of the graph-oriented model in GOP depends on an efficient implementation of the logical graph construct, which provides a logical abstraction of the underlying graph operations, simplifying the programming task. The GOP runtime system described in this paper provides an interface with a library of programming primitives to the low-level facilities required for supporting graph-oriented system configuration, process communications and synchronization. Among others, the following functions need to be implemented: • Distribution of graphs: The first task is to design the representation of a graph in a distributed environment. A graph can be either directed or un-directed, and can be represented in three different ways: centralized, partitioned and replicated. Implementations of graph operations will vary for different forms of representations. • Mapping: The second task is to manage the mapping of graphs to underlying network processors. If the user specifies the mapping, the solution to the problem becomes straightforward. Otherwise the execution system needs to explore task scheduling techniques in order to make efficient use of system resources and/or to speed up the computations. • Operations on graphs: The third task is the identification and the distributed implementation of various graph operations. Results of existing work on message passing protocols, group communications. and distributed graph algorithms can be used to help the implementation. Operations on graphs can be categorized into several classes (Cao et al., 1995), including communication among the vertices of a graph, subgraph generation, graph update, and query. By placing the graph operation management facilities inside the control structures maintained by the system (rather than as part of any particular algorithm), information about the system environment and the graphs distribution can be exploited to optimize the implementation. This paper presents the design and implementation of the run-time system supporting the GOP graph-oriented model. The implementation is on top of the parallel virtual machine (PVM) (Geist et al., 1994) in a local area network of Sun workstations. Issues related to the implementation of graph operations in a distributed
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399
environment are discussed. Performance of the runtime system is evaluated by estimating the overheads associated with using GOP primitives as opposed to PVM. The rest of the paper is organized as follows. Section 2 briefly describes the GOP model. Section 3 discusses how GOP can be applied to programming parallel and distributed systems. Section 4 presents the architecture design of the GOP run-time system, as well as the method of graph representation in a heterogeneous environment. Section 5 describes the implementation of distributed graph operations and discusses the performance evaluation results. Section 6 concludes the paper with the discussion of our future work.
2. The graph-oriented programming model The design of the runtime system is shaped by the requirements of graph-oriented parallel/distributed programming, together with other design goals including efficiency and portability across diverse system platforms. In this section, we briefly describe the graphoriented model, which allows programmers to write distributed programs based on user-specified graphs. We define a distributed program as a collection of LPs that may execute on several processors. Each LP performs operations on available data at various points in the program and communicates with other LPs. Parallelism is expressed through explicit creation of LPs and communication between LPs is solely via message passing. A graph GðV ; EÞ is a finite set V whose members are called vertices and a finite set E whose members are called edges. We propose to have graphs as a languagelevel control construct, which consists of: • a logical graph, whose vertices are associated with LPs and whose edges are associated with relationships between LPs; The conceptual graph can either be a logical one with no relation to the physical interconnection structure of the distributed system, or it can reflect what is actually occurring in the underlying network; • a LPs-to-vertices mapping, which allows the programmer to bind LPs to specific vertices; • an optional vertices-to-processors mapping: which allows the programmer to explicitly specify the mapping of the logical graph to an underlying network of processors. When the mapping specification is omitted, the underlying run-time system transparently performs the mapping; • graph-oriented programming primitives. Fig. 1 illustrates the components of the GOP model. The programmer first defines variables of the graph construct in a program and then creates an instance of the construct. The following steps are needed:
391
Program Level: LP types
Logical Graph Level: Instances of LP types mapped to logical graph vertices
Physical Network Level: Graph vertices mapped to network nodes
Fig. 1. Components of the GOP model.
Step 1: define the logical graph construct describing the logical relationship between LPs; Step 2: define the mapping of the LPs to the graph’s vertices and, if necessary, the mapping of the graph vertices to the underlying network nodes; Step 3: instantiate the construct and associate a name with the instance. Once the local context for the graph instance is set up, communication and coordination of LPs can be implemented by invoking operations defined on the specified graph. A simple example will elaborate the basic idea behind the model (Cao et al., 1995). The example is to implement a global sum on a hypercube. There are two types of processes in the program: the Coordinator process, which receives and distributes the global sum, and the Participant processes, which calculate and submit their own partial sums and then collect the final global sum from the Coordinator. One way to write the program is to derive a spanning tree from the original graph, i.e. the hypercube, and base the calculus on the tree (Schwan and Bo, 1990). This will give an efficient implementation since the number of messages sent between the processes is minimum. Using basic message passing primitives such as send and receive, the program will be difficult to write and hard to read. Using the proposed graph-oriented model, the spanning tree can be derived automatically by the programming system and the programmer will simply invoke such an operation and then base the calculus on the resulting spanning tree, using primitives such as Receive, SendToChildren, ReceiveFromChildren, and SendToParent. In addition to graph-oriented communication, GOP also allows the programmer to exploit the semantics of the graph construct to deal with other aspects of distributed programming, such as dynamic reconfiguration and process synchronization. The operations on a userspecified graph can be categorized into several classes: • Communication and Synchronization. These operations provide various forms of communication primitives which can be used to pass messages from one node
392
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399
to one or more other nodes in the graph. These primitives can be used by the LPs associated with the nodes to communicate with each other and to synchronize their operations without knowing the low-level details such as name resolution, addressing and routing. The graph construct also provides a clean and consistent way of integrating multiple, different communication primitives. In particular, the graph construct provides an ideal vehicle to provide group communication services. LPs structured into a graph can form a group where multicast/broadcast communication primitives can be used to achieve consistent global views. • Subgraph generation. These operations derive subgraphs such as a shortest path between two nodes and spanning trees of a graph. Many distributed algorithms rely on the construction of some forms of subgraphs of the underlying control graph. • Update. In GOP, both the logical structures of the LPs and the underlying processor network interconnection can be dynamically changed. The update operations provide the programmer with the capability to dynamically insert into and/or delete from a graph edges and nodes. Also, mappings can be dynamically changed during the program execution. These primitives are very useful for dynamic control functions such as dynamic reconfiguration of distributed programs. Query. These operations provide information about the graph, such as the number of nodes in a graph, current location of a LP, and whether an edge exists between two nodes. These information are useful for many system control and configuration functions.
programmed neatly using GOP with an example: the parallel merge sort. The experiences with building these applications and their performance will be reported in a separate paper. The parallel merge sort algorithm is from Magee et al. (1993). The GOP program for this example has two kinds of LPs: one is a ‘‘main’’ LP and the other is the ‘‘sort’’ LPs. Each ‘‘sort’’ LP takes as its input two arrays of ordered integer values and merges them to produce as its output a single array of ordered values. The ‘‘main’’ LP specifies the logical graph construct––a ‘‘binary tree’’ structure and defines the mappings. PutData() stores the integer values that need to be ordered into a database. The ‘‘sort’’ LP on a leaf will get the values from this database by using GetData(). Programming of the ‘‘sort’’ LP is very clear and easy. If the ‘‘sort’’ LP is on a leaf, then it gets two integer values from the database, and merges them into one single array and sends it to its parent. If the ‘‘sort’’ LP is not on a leaf, it receives arrays of ordered values from its children, merges them, and sends to its parent. If the ‘‘sort’’ LP has no parent, it is the root of the tree and it prints the result of merge sorts. A good property of scalability can be drawn from this example. Although two arrays of ordered integer values are required as the inputs of each ‘‘sort’’, we can extend it to any number. That means the merge sort we write does not limit the tree structure to a ‘‘binary tree’’. It can be any kind of tree. We can modify the tree structure in the ‘‘main’’ program and need not modify the ‘‘sort’’ program to cater for the structural change. The C code for the ‘‘main’’ LP and the ‘‘sort’’ LP are shown below.
The GOP model has the desirable features of expressiveness and simple semantics. With GOP, parallel/distributed execution is under control of the programmer but the physical distribution of the hardware and the programming of low-level operations can be hidden from the programmer. Furthermore, sequential programming constructs blend smoothly and easily with distributed constructs in GOP.
Edge edges[] ¼ {{0,1,0}, {0,2,0}, {1,3,0}, {1,4,0}, {2,5,0}, {2,6,0}}; int nds[] ¼ {0,1,2,3,4,5,6}; int data[8] ¼ {11, 9, 23, 21, 78, 12, 55, 43}; main(argc, argv) int argc; char argv; { Graph g; char gname;
3. Applications of GOP GOP can be applied to programming parallel and distributed algorithms for solving various problems. In our earlier papers (Cao et al., 1995; Cao et al., 1996), we have shown two simple examples of programming using GOP, a global sum on a hypercube and parallel matrix multiplication on a mesh. Since then, we have used GOP to build more parallel and distributed programs, including parallel merger sorting, distributed dining philosopher, and parallel back-propagation on grid. In this section, we illustrate how these applications are
int gid; gid ¼ DefineGraph(&g, 6, edges); MapLpToNodes(&g, ‘‘sort’’, 7, nds); MapNodeToHost(&g, MapNodeToHost(&g, MapNodeToHost(&g, MapNodeToHost(&g, MapNodeToHost(&g, MapNodeToHost(&g, MapNodeToHost(&g,
0, 1, 2, 3, 4, 5, 6,
‘‘cssolar36’’); ‘‘csultra5’’); ‘‘cssolar36’’); ‘‘cssolar36’’); ‘‘cssolar36’’); ‘‘cssolar36’’); ‘‘cssolar36’’);
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399
DispatchGraph(&g); SetGraphName(gid, ‘‘tree’’); PutData(data, 8, ‘‘/tmp/sort.data’’); StartExecute(gid); }
main(argc, argv) int argc; char argv; { int gid, tid, nid; int info; int data1[NUM], data2[NUM]; char str; int data_num; Msg msg; int i; FILE datafile; gid ¼ GetGid(‘‘tree’’); if(!IsLeaf(gid)){/ not a leaf/ info ¼ RecvFromChildren(gid, 0, 0, & msg); copyData(data1, msg.data, msg.num); while(info! ¼ 0){ info ¼ RecvFromChildren(gid, 0, 0, &msg); copyData(data2, msg.data, msg.num); data_num ¼ sort(data1, data2); } } else{/ is a leaf / data_num ¼ GetData(‘‘/tmp/sort.data’’, 1, data1); data_num ¼ GetData(‘‘/tmp/sort.data’’, 1, data2); } if(HaveParent(gid)! ¼ 0){ PrepareMsg(& msg, 0, data_num, data1, ’’’’); SendToParent(gid, 1, 0, msg); } else{ for(i ¼ 0; i < data num; i++){ printf(‘‘%dnt’’,data1[i]); } } LeaveNow();
shown in the figure, the runtime system is responsible of graph representation, graph operations and graph mapping. It mainly consists of a kernel and a library of programming primitives. A copy of the run-time kernel is run on each site in the distributed system. However, the kernel on a site is not required to be started before a user program begins execution on that site. It will be started implicitly by a remote kernel when a user program calls the library function for the first time. For this reason, the kernel at each site maintains a Local Kernel List. When the local kernel needs to interact with a kernel on another site, the local kernel first checks this list to see whether a kernel has been started at that site. If not, it will spawn a kernel on the remote site, record relative information and then multicasts a message to all other kernels so that they can update their local kernel lists. The kernel is responsible for assigning a system-wide, unique identifier (gid) to a graph, maintaining graph representation and related system information, receiving and handling messages sent by the LPs, and spawning a kernel in the remote host when needed. Fig. 3 illustrates internal structure of the runtime kernel. Details of the kernel functional modules are described in the following sub-sections. 4.1. Distributed representation of graphs User-specified graphs in GOP can be represented in primarily three ways, centralized, replicated and distributed, depending on the structuring of the run-time system. The greatest differences between these approaches are whether there is a centralized site/process to coordinate the graph operations and how much
}
4. The design of the GOP runtime Fig. 2 shows the architecture framework of the graphoriented distributed programming environment. As
393
Fig. 2. The GOP architecture framework.
394
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399
message queue Remote Kernel
1
Lp
2
3
message entry
message processing Lp
Local graph representation
Library Interface
Kernel tasks
Lp
local kernel
entry 1: system maintainenace
entry 2: system control
entry 3: function invoking
Fig. 3. Internal structure of a kernel.
coordination is required to maintain a consistent view of the graph. In our design, a decentralized approach is taken, where no process has the whole picture of the logical graph. Instead, each process maintains only partial information about the graph. Kernels on all sites cooperate to perform an operation based on the partitioned graph stored at different sites. The operations at different sites can be run in serial or parallel. Control messages are used to synchronize the execution of localized algorithms in the LPs. Common graph representation methods include adjacency matrix and adjacency list. Both adjacency matrix and adjacency lists can be adopted in representing the logical graphs. They are partitioned and stored on different sites of the system. At each site, only the information about the adjacent sites is maintained. We found that the two forms of graph representation have their respective merits and shortcomings. Implementation of distributed graph algorithms with adjacent matrix is simpler and more efficient. On the other hand, adjacency lists are easy to decompose into different parts and thus an appropriate graph representation in distributed approach. Moreover, the size of the adjacency lists are determined by the connectivity rather than the number of vertices of the graph. In our design, graphs are directed but must be acyclic. Using directed graphs are more convenient in representing logical relationships between LPs. It is worth notice that directed graphs do not necessarily impose constraints on inter-node communications. In fact, we adopted a complete graph approach for modeling the network communication, where any node can communicate with any other node in the network. Each graph is assigned a system-wide, unique identifier, which is used to invoke operations on the
graph. The structure of the graph identifier is shown in Fig. 4. Multiple graphs can exist in the system and operate in parallel. A data structure Graph table is used to record the names and corresponding graph identifiers in the system. struct NameMap{ char gname [NAME_LEN]; int gid; } To reduce the overhead of graph storage, we design the graph representation such that graph vertices are shared between graphs and their subgraphs which are marked by graph identifier’s first eight bits and also by graph edges’ mask tag. When we find a vertex and add it to a subgraph, we actually set this vertex’s mask and edges’ masks in the subgraph to correspondent masks and generate a new graph identifier to represent it. Although this technique may make the system look a little bit complicated, it will reduce memory costs. The kernel maintains the following data structure for the local graph representation, which is illustrated in Fig. 5. 31
2827 SM
2019 MK
1615 DN
87 APN
0 GN
SM - subgraph tag, which defines if this graph is some graph’s subgraph MK - mask tag DN - identifier number of daemon APN - application number GN - relative graph number
Fig. 4. Global graph identifier.
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399 IntraGraph
IntraMap
IntraItem
gid_1
gid1
Map nid1
395
nid2
nid3
...
IntraNod e node1
node2
node3
gid2
gid_n
...
Fig. 6. Internal structure for mapping information.
(a)
char hname[NAMe_LEN]; char node_where[NAME_LEN]; int uml_tag; / unmap lp tag / int umh_tag / unmap host tag / struct Map next;
spanning - tree shortest - path unused bits
}
type of parent graph
(b ) Fig. 5. (a) Internal graph structure and (b) composition of mask tag.
struct IntraGraph{ int gid; struct IntraItem gitem; } struct IntraItem{ struct IntraNode ndp; struct IntraItem next; } struct IntraNode{ int nid; struct Edge_Link in_list, out_list; int close_tag; } Graph edges’ mask tag has eight bits. Except for the highest two bits which are used to record the type of the parent graph, each bit of the mask tag represents a class of subgraphs such as shortest path, spanning tree, etc. Different subgraphs have different masks. The mask for a spanning tree also contains information of the root vertex. The kernel also maintains information about the LPVertex-Host mapping. Each kernel maintains the following data structure which holds the global mapping structure (see Fig. 6). The kernels on different sites cooperate to maintain the consistency of the information. struct IntraMap{ int gid; struct Map mapitem; } struct Map{ int nid; char lpname[NAME_LEN];
Another data structure called Active processes table is maintained which records all current process’s task identifier and related information. The system uses it to find a tid for a LP on a graph’s node. struct ActiveMap{ int gid; int nid; char lpname[NAME_LEN]; int tid; char node_where[NAME_LEN]; } 4.2. Handling messages Various operations including graph operations are carried out by the kernel after the kernel receives the function invocation message from user programs. The local kernel may need to interact with other kernels resided in remote hosts to cooperate. The kernel enters a loop after initialization and then waits for incoming messages from either LPs or remote kernels. There are three types of message entries: system maintenance, system control and function invocation. System maintainenance entry mainly deals with such messages as updating the kernel list, the active application table, and others. System control entry mainly deals with cooperate messages sent by other daemon. Function invoking entry deals with function calling from the user programs. These entries are extendible and independent with each other. We can add or delete one flexibly. 4.3. Library of programming primitives The GOP runtime system provides users with a library of programming primitives as an interface between user programs and the kernel. User programs send request
396
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399
messages to the kernel by calling library functions. After receiving the requests, the kernel dispatches them to different message-entries according to the message tag attached. Then different graph operations are invoked. Functions in the library do not perform the real graph operations but only form request messages for users and then send them to the kernel. Graph operations are carried out by subtasks of the kernel. This separates user programs from the management of global graph structures. Fig. 7 outlines the operation occurring at the library interface and in the kernel respectively.
In library interface: int graph_operation(arguements) { / ensure kernel has been started / / get the caller’s tid and the kernel’s tid / initial(&call_tid, &d_tid); pack the message tag; pack the caller’s tid; pack other information required; send message package to the kernel: d_tid; if (required) wait for the reply; } In the kernel: main() { do initialisation work; enter work loop{ receive message; unpack msg tag; switch(msg tag){ entry1: unpack other arguements; processing message; entry2: . . .; entry3: . . .; }} }
Primitives are provided for operations of different categories, such as communications, graph mapping and reconfiguration, and sub-graph generation. Communication primitives include SendToParent, SendToChildren, SendToNeighbors, SendToNode, SendToAll, RecvFromParent, RecvFromChildren, RecvFromNeighbor, and RecvFromNode. Primitives for supporting graph reconfiguration and graph mappings include AddNode, DeleteNode; AddEdge, DeleteEdge, MapLpToNodes, MapNodeToHost, UnMapLpFromNode, and UnMapNodeFromHost. Other primitives such as GetGid,
SetGname, and StartExecute are also provided to spawn each task of the graph node. In the rest of this section, we focus on the distributed graph search and subgraph operations. These operations derive subgraphs such as a shortest path between two vertices and spanning trees. Many distributed algorithms rely on the construction of some forms of subgraphs of the underlying control graph. In our initial design, three primitives are provided by the run-time system to support sub-graph generation operations: shortest path, minimum-weight spanning tree, and depth-first search. The primitive for Shortest Path has the following format: int G ¼ shortest_path (int graph_id, int start_vertex, int end_vertex) The primitive is used to find the shortest path from a vertex to another vertex in a graph whose identifier is specified as a parameter. The starting vertex and the ending vertex specify the start and end point of the shortest path in the graph. The resulting shortest path is represented as a graph, which is distributed onto different sites. The format of the primitive for minimum-weight spanning tree is: int G ¼ spanning_tree (int graph_id) This primitive is used to calculate a minimum-weight spanning tree from the given graph with the vertex where the calling process is bound as the root. The resulting tree is represented as a graph which is distributed onto different sites. The format of the primitive for depth-first-search is: int flag ¼ depth_first_search(int graph_id, int target_vertex) The primitive is used to find the location of the target vertex in the specified graph by depth-first-search. The result is a flag. A false flag is returned if the target process is not found. Otherwise, the location of the target vertex, which is the identifier of the physical processor where the vertex is mapped into, will be returned to the requesting process. For finding the shortest path and the minimumweight spanning tree, we chose to modify two sequential algorithms, the classical Dijkstra’s algorithm (Dijkstra, 1959) and Prim’s algorithm (Prim, 1957), for distributed implementation. Both algorithms are easy to understand and similar in nature. More importantly, the way the operation is carried out in the algorithms can be easily adapted to distributed implementation based on our distributed graph representation method. First, Dijkstra’s algorithm for finding single source shortest path is modified to make it suitable for a
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399
decentralized implementation. The idea of Dijkstra’s algorithm is as follows. Let T be a tree rooted at the source vertex s, and v be any other vertex and Distanceðs; vÞ be the shortest length of path in T from s to v. T is a shortest-path tree if and only if each edge ðv; wÞ in the graph satisfies the inequality: distanceðs; vÞ þ lengthðv; wÞ P distanceðs; wÞ In Dijkstra’s algorithm, the shortest path tree grows from the source, one edge at a time. The operation of the algorithm is an iteration with the following steps: • First, find the closest node from the unvisited nodes which are not examined. • Second, if the corresponding edges fulfill the above inequality, add these edges to the shortest path tree. Always mark this closest node as visited node. This iteration will be terminated when all the nodes are marked visited. This means all the nodes have been examined. Greedy algorithm is applied in Dijkstra’s algorithm in which the optimal solution is obtained by means of local optimal decisions. In our modification, whenever a process receives a AWAKE message, the steps in the iteration are executed once. The first step is that the running process adds the edge connecting itself and the closest visited node to the shortest path tree. The information stored in the partitioned graph is sufficient for the above step. Then it finds out the closest unvisited node from the information given by the sender of the AWAKE message. Next, it sends an AWAKE message to the chosen node. This operation is terminated when all the processes have performed the iteration once. There are two major approaches to calculating a minimum-weight spanning tree for a graph. One approach starts with a single node, and enlarges the tree by the Greedy algorithm. Another approach takes different vertex sets as trees in forest and add edges to connect these trees to form a single tree (Gallager et al., 1983). The first approach is more suitable for our purpose where the invoking process will be the root of the resulting tree. One such method is proposed by Prim (Prim, 1957). Prim’s algorithm is similar to Dijkstra’s Algorithm. This algorithm is started by choosing a vertex u as the root vertex of the spanning tree. At each state, a new vertex is added to the minimum spanning tree by choosing the edge ðu; vÞ from a tree-node u to a non-visited node v, such that the weight of this edge is smallest among all edges. By repeating this process successively, the spanning tree is grown and enlarged. The algorithm is finished until the minimum spanning tree span the whole graph. The algorithm can be easily modified for distributed implementation where nodes are exploited by passing messages from one to another.
397
Algorithm. Spanning Tree (gid,nid) case SPANNING TREE: unpack the message; (include st_tag, which shows if this message comes for the first time(0, or 1)) span set ¼ NULL; if(st_tag ¼ ¼ 1) build span set from received data; if(st_tag ¼ ¼ 0){ find if the nid is on the local host; if true allocate a new gid for this subgraph: spanningtree; send back the new gid: ngid to the caller; find this node’s spanning children(that is nodes which are in this node’s out_list); add nodes to the span set (which is a set of spanning nodes); build subgraph’s internal graph structure; st_tag ¼ 1; else forward this request message to the kernel where nid is resided; } if(st_tag ¼ ¼ 1){ while(span set is not empty){ get a node from the span_set; if(this node is on the local host) get spanning children and add to span set; add this node to local subgraph; else send this nid and span set to other kernel where this node is resided; / message tag is SPANNING TREE / } }
For graph search operation, the distributed DFS algorithm of (Awerbuch, 1985) was used in our implementation. The algorithm can be applied to a model similar to our graph-oriented model. Each node processes messages received from its neighbors, performs the local computations, and then sends the messages to its neighbors. In addition, the traversals along back edges are in parallel, and only tree edges will be traversed serially. 5. Implementation and performance evaluation The GOP run-time system has been implemented on top of PVM (Geist et al., 1994) in a local area network of Sun workstations. PVM provides a unified framework that supports the execution of parallel/distributed programs in a heterogeneous environment. By building
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399
PVM primitives used
Functions
pvm_bufinfo()
Returns information about the requested message buffer Tells the local workstation that this process is leaving PVM Clears default send buffer and specifies message encoding Multicasts the data in the active message buffer to a set of tasks Returns the tasd id of the process Packs the active message buffer with arrays of integer Returns the task id of the process that spawned the calling process Prints the error status of the last PVM call Receives a message Sends the data in the active message buffer Starts new PVM processes Unpacks the active message buffer into arrays of integers
pvm_exit() pvm_initsend() pvm_mcast() pvm_mytid() pvm_pkint() pvm_parent() pvm_perror() pvm_recv() pvm_send() pvm_spawn() pvm_upkint()
Performance of three kinds of communication primitives in Graph-Oriented System
Time(us)
the GOP runtime on PVM, we can achieve our design goals of being portable and scalable, and supporting heterogeneous platforms. The communication model in PVM employs asynchronous blocking send, asynchronous blocking receive, and non-blocking receive functions. PVM ensures reliable message passing and ordering. Each process in the PVM environment, including tasks, groups of tasks, and the daemon pvmd within a virtual machine, is assigned a unique task identifier. Task identifiers are used for tasks to communicate with each other. In our implementation, we inherit this feature to specify every process in the GOP runtime, including the kernel, and user’s application processes. We utilize PVMs asynchronous blocking send to implement synchronised communication and we use non-blocking receive to implement checking messages. The GOP kernel is implemented as a daemon that will keep running after the user first calls the library functions until all user tasks exterminate. The kernel therefore is spawn by one of he user tasks. The daemons and user programs are treated the same in terms of PVM task management. PVM configures a virtual machine either statically or dynamically. Heterogeneity is one of the most important characteristics of PVM which we inherit in implementing the GOP runtime system. As a consequence, with a consistent library interface to the user hiding how the programming primitives are implemented, the GOP runtime is highly portable to various platforms. The PVM primitives used in the other implementation is summarized in following table.
9000 8000 7000 6000 5000 4000 3000 2000 1000 0
Send To Parent 1
2
(a)
3
4
5
6
Send To Children Recv From Parent
Host number
Overhead comparing to pvm_send 0.8 0.7 0.6 Overhead
398
0.5 0.4 0.3 0.2 0.1 0 1
(b)
2
3
4
5
6
Host number
Send To Parent Send To Children
Fig. 7. Performance evaluation results.
We have carried out a performance evaluation study to estimate the overhead associated with using the programming primitives of GOP as opposed to PVM. We use the application global_sum on hypercube described in Section 2 as a benchmark program, which use various primitives. The participators can be mapped to several hosts, same as coordinator or different from coordinator. Experiments are performed with different number of hosts to evaluate the cost of GOP communication primitives. We also compare the performance of these communication primitives with that PVMs communication primitives. Fig. 7(a) shows the performance affected by the number of hosts. Fig. 7(b) shows the overhead over PVM incurred by different primitives. The comparison shows although the absolute processing time is reduced when we increase the number of hosts, the relative overhead to pvm_send is increased, about 40%. Such overhead is composed of not only communication cost for using PVM communication primitives, but also some processing cost for managing graphs and so on.
6. Conclusions and future work In this paper, we have presented the GOP graph-oriented framework for programming parallel/distributed systems. We described the design and implementation of the portable run-time system for GOP. Issues such as graph mapping, graph representation strategies, distributed graph algorithms, and underlying implementation platforms were discussed. GOP runtime has been implemented using PVM on a workstation cluster environment. We carried out and analyzed the performance
J. Cao et al. / The Journal of Systems and Software 72 (2004) 389–399
of a benchmark application. The performance comparison and analysis show the GOP runtime operations are quite efficient in comparison with PVM primitives–– there is only a small performance penalty paid in using GOP. We believe that by further optimization, we should be able to further reduce the overhead.
References Awerbuch, B., 1985. A new distributed depth-first-search algorithm. Information Processing Letters 20, 147–150. Bal, H.E., 1990. Programming Distributed System. Silicon Press. Cao, J., Fernando, L., Zhang, K., 1995. DIG: a graph-based construct for programming distributed systems. In: Proceedings of 2nd International Conference on High Performance Computing. New Delhi, India. Cao, J., Fernando, L., Zhang, K., 1996. Programming Distributed Systems Based on Graphs, Intensional Programming I. World Scientific Pub. Co. 1996, pp. 83–95. Carriero, N., Gelernter, D., Leichter, J., 1986. Distributed data structures in Linda. In: Proceedings of 13th ACM Symposium on Principle of Programming Languages. St. Petersburg, FL, pp. 236– 242. Chang, E.J.H., 1982. Echo algorithms: depth parallel operations on general graphs. IEEE Transactions of Software Engineering SE-8 (4), 391–401. Dijkstra, E.W., 1959. A note on two problems in connection with graphs. Numerische Mathematik 1, 269–271. Gallager, R.G., Humblet, P.A., Spira, P.M., 1983. A distributed algorithm for minimum-weight spanning trees. ACM Transactions on Programming Languages and Systems 5 (1), 66–77. Geist, A. et al., 1994. PVM: Parallel Virtual Machine: A User’s Guide and Tutorial for Networked Parallel Computing. MIT press, Cambridge, MA. Magee, J., Dulay, N., Kramer, J., 1993. Structuring parallel and distributed programs. IEE Software Engineering Journal 8 (2), 73–82. Peleg, D., 1990. Distributed data structures: a complexity-oriented view. In: Proceedings of 4th International Workshop on Distributed Data Algorithms. LNCS 486, pp. 71–89. Prim, R.C., 1957. Shortest connection networks and some generalizations. Bell System Technical Journal 36, 1389–1401.
399
Schwan, K., Bo, W., 1990. Topologies-distributed objects on multicomputers. ACM Transactions on Computer Systems 8 (2), 111– 157. Totty, B.K., 1992. Experimental Analysis of Data Management For Distributed Data Structures. MS Thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign. Jiannong Cao received the B.Sc. degree in computer science from Nanjing University, China, in 1982, and the M.Sc. and Ph.D. degrees from Washington State University, USA, in 1986 and 1990, all in computer science. Before joined the Hong Kong Polytechnic University in 1997, where he is currently an associate professor, he has been on faculty of computer science in James Cook University and The University of Adelaide in Australia, and the City University of Hong Kong. His research interests include parallel and distributed computing, networking, mobile computing, fault tolerance, and distributed programming environments. He has authored or co-authored more than 100 journal and conference papers in the above areas. He is a member of the IEEE Computer Society, the IEEE Communication Society, IEEE, ACM, and American Association for the Advancement of Science (AAAS). He has served as a reviewer for international journals/conference proceedings, and also as an organizing/programme committee member for many international conferences.
Yin Liu received her M.Sc. in Computer Science from Nanjing University. Her research interests include parallel and distributed computing and fault tolerance. Li Xie is a Professor in the Department of Computer Science and Technology at Nanjing University. He was the head of the department and vice president of Nanjing University for several years. He is also the director of the Parallel and Distributed Computing Lab in the department. His research interests include parallel and distributed computing, mobile computing, multimedia, and security.
Dr. Bing Mao received his Ph.D. in Computer Science from Nanjing University. He is currently a professor in the Department of Computer Science and Technology at Nanjing University. His research interests include parallel and distributed computing, parallel compilers, and CSCW. Dr. Kang Zhang received his B.Eng. in Computer Engineering from the University of Electronic Science and Technology, China; and Ph.D. from the University of Brighton, UK. He is currently an associate professor in the Department of Computer Science at University of Texas at Dallas. His research areas are mainly in software visualization, parallel and distributed processing, and Internet computing. He is a senior member of IEEE.