Parallel Computing 29 (2003) 1589–1621 www.elsevier.com/locate/parco
High-level abstractions for message-passing parallel programming q Fan Chan, Jiannong Cao *, Yudong Sun Software Management and Development Laboratory, Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Received 15 January 2003; accepted 15 May 2003
Abstract Large-scale scientific and engineering computation problems are usually complex and consequently the development of parallel programs for solving these problems is a difficult task. In this paper, we describe the graph-oriented programming (GOP) model and environment for building and evaluating parallel applications. The GOP model provides higher level abstractions for message-passing parallel programming and the software environment offers tools which can ease programmers for parallelizing, writing, and deploying scientific and engineering computing applications. We discuss the motivations and various issues in developing the model and the software environment, present the design of the system architecture and the components, and describe the evaluation of the environment implemented on top of MPI with a sample parallel scientific application program. With the support of the high-level abstractions provided by the proposed GOP environment, programming of parallel applications on various parallel architectures can be greatly simplified. 2003 Elsevier B.V. All rights reserved. Keywords: Graph-oriented computing; Cluster computing; Programming environments
1. Introduction Message-passing (MP) is one of the popular paradigms used to write parallel programs. MP provides the two key aspects of parallel programming: (a) synchronization q
This work is partially supported by the Hong Kong Polytechnic University under the research grant H-ZJ80. * Corresponding author. Tel.: +852-2766-7275; fax: +852-2774-0842. E-mail address:
[email protected] (J. Cao). 0167-8191/$ - see front matter 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2003.05.008
1590
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
of processes and (b) read/write access for each processor to the memory of all other processors [1]. Parallel scientific applications have been traditionally parallelized using a combination of a traditional sequential programming language such as Fortran of C/C++ with MP facilities, such as PVM [2] or MPI [3], allowing virtually complete control over the structure and degree of parallelism as well as the distribution and alignment of data across the architecture. However, programmer still found difficulties when using MP facilities to develop parallel programs, especially for large-scale scientific and engineering computing applications. First, although the concept of the MP is quite simple, developing parallel applications raises new difficulties due to communication and synchronization of the processes. MPI, PVM and other MP interfaces represent low-level programming tools. Their interfaces are simple but force the programmer to deal with low-level details, or their functions are complicated and too large to use for a non-professional programmer. The low-level parallel primitives make the writing of really real-world parallel applications tedious and error-prone. Second, often, the main implementation complexity arises from the process management MP facilities. Applications require the programming of process management and this task is not easy. Consequently, implementation requires special techniques and skills, different for each MP facilities, and programmers have to rely on their experience, quite often in an ad hoc way. The ability to develop parallel programs quickly and easily is becoming increasingly important to many scientists and engineers. Today, everyone seems to agree that the acceptance of parallel computing mainly depends on the quality of a high-level programming model, which should provide powerful abstractions in order to free the programmer from the burden of dealing with low-level issues such as data layout or communications [4]. A high-level programming model facilitates the building of large-scale applications, and bridges the semantic gap between the application and the parallel machines. Although we cannot expect parallel programming to become as easy as sequential programming, we can avoid unnecessary difficulties by using appropriate tools. In this paper, we describe a high-level programming methodology. It is based on the graph-orient programming (GOP) model, which was originally proposed as an abstract model for distributed programming [5,6]. In applying GOP to parallel programming, it was our observation that many parallel programs can be modeled as a group of tasks performing local operations and coordinating with one another over a logical graph, which depicts the architectural configuration and inter-task communication pattern of the application. Most of the graphs are regular ones such as tree and mesh. Using a message-passing library, such as PVM and MPI, the programmer needs to manually translate the design-level graph model into its implementation using low-level primitives. With the GOP model, such a graph metaphor is made explicit in the programming levels because GOP directly supports the graph construct. By directly using the logical graph construct, the tasks of a parallel program are configured as a logical graph and implemented using a set of high-level operations defined over the graph.
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1591
The GOP runtime has been implemented on MPI. It provides a high-level programming abstraction (GOP library) for building parallel applications. Graphoriented primitives for communications, synchronization and configuration are perceived at the programming-level and their implementation hides the programmer from the underlying programming activities associated with accessing services through MPI. The programmer can thus concentrate on the logical design of an application, ignoring unnecessary low-level details. GOP can also support parallel software architecture. In GOP, the concept of software architecture is reified as an explicit graph object, which provides a locus for addressing architectural issues, separated from programming of the parallel tasks. Furthermore, the system architecture design can be simplified with the graph abstraction and predefined graph types. In this paper we focus on GOP’s abstraction in MP programming. Readers are referred to [7] for GOP’s support for software architecture. Moreover, the GOP system is portable as its implementation is based almost exclusively on calls to MPI, a portable MP standard, and the GOP library, a user-level library portable to several operating systems. The rest of paper is organized as follows: Section 2 introduces the related work and background information of GOP. Section 3 presents an overview of GOP framework, including the GOP model, the support of high-level MP programming, and the major features of VisualGOP for program development, from program construction to process mapping and compilation. Section 4 illustrates the programming in GOP. Section 5 describes the implementation and performance evaluation of GOP, with a sample application. Finally Section 6 concludes the paper with discussion on our future work.
2. Related work In parallel computing, the design and implementation of message-passing application (MPA) have been recognized as complex tasks. PVM and MPI have improved the situation, as they permit the implementation of applications independently on underlying architecture. PVM and MPI allow for the general form of parallel computation, as programs may exhibit arbitrary communication dependencies. In general, programs forming tree inter-process communication dependencies, where each process communicates only with its parent and its child processes, are well suited to PVM, while regular ring or grid process communication dependencies are well suited to MPI. However, as we know, PVM and MPI are low-level tools and somewhat difficult to use for building applications. The step from application design to implementation remains a demanding task. Several high-level programming models are being well developed. Ensemble [8–10] supports the design and implementation of MPAs (applied to MPI and PVM), particularly MPMD and those demanding irregular or partially regular process topologies. Also, the applications are built by composition of modular MP components. Ensemble divided the software architecture into two layers: The
1592
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
abstract design and implementation (AD&I), which is the responsibility of the programmer, and the architecture specific implementation (ASI), e.g. MPI implementation, which is generated from the AD&I and transparent to the developer. AD&I consists of three well-separated implementation parts. The first one is virtual component. It is the implementation abstraction of a MP program. For example, it provides abstract names and abstract roots for collective calls and abstract point-to-point interaction. It also uses ‘‘ports’’ to replace the MPI argument types (context, rank and message tag). The second one is symbolic topology. It is an abstraction of a process topology, which specifies the number of processes required from each component, each process’s interface and its interaction with other processes. The last one is resource allocation. It is the mapping of processes, as well as the location of source, executable, input and output files in the underlying environment. We compare Ensemble and GOP in four aspects: Firstly, the programming model of Ensemble and GOP are different. Ensemble is mainly for abstract programming design of the application. Instead of compiling the programs directly, Ensemble uses a tool to generate the abstract parallel programs into pure MPI code, and then compiles the code into modular MPI components. In GOP, the programmer can use the high-level GOP API to develop parallel applications. Then the parallel programs and the GOP library will be compiled together to form executable programs and run. GOP also includes a runtime system to help the graph update and synchronization, so that the graph topology for communication can be changed during runtime. The program is static, so the topology for communication cannot be changed easily. Secondly, GOP and Ensemble have similar features in the application programming. Abstraction is used for referring the node names, node groups and communicating edges. Thirdly, GOP has a flexible mapping strategy which provides automatic and manual mapping, but in Ensemble mapping can be done only manually. The last difference is that the GOP syntax is independent of MPI, which provides a high-level abstraction library for programming parallel applications. The programmer can design the GOP application without need to know any low-level syntax of MPI. However, the programming structure of Ensemble is a mixture of MPI and its syntax. Therefore, the programmer needs to learn the new architecture design of Ensemble and also the usage of MPI communication routines. The Nanothreads Programming Model (NPM) [11] is a programming model for shared memory multiprocessors. The NPM can integrate with MPI, used on distributed memory systems. The runtime system is based on a multilevel design that supports both the models (NPM and MPI) individually but offers the capability to combine their advantages. Existing MPI codes can be executed without any changes and codes for shared memory machines can be used directly, while the concurrent use of both models is easy. The major feature of the NPM runtime system is portability, as it is based exclusively on calls to MPI and Nthlib, a user-level threads library that has been ported to several operating systems. The runtime system supports the hybrid-programming model (MPI + OpenMP) [12]. Moreover, it extends the API and the multiprogramming functionality of the NPM on clusters of multiprocessors and can support an extension of the OpenMP standard on distributed memory multiprocessors.
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1593
There are similarities and differences between NPM and GOP. Firstly, the two models have their own API to support high-level programming design. This overcomes the insufficiency of the low-level library routines supported by the MPI, providing a more efficient and easy way to design and manage the application. Secondly, although both of them can separate the application into a number of smaller parts, their programming structures are not the same. NPM decompose the application into fine-grain tasks and executed in a dynamic multiprogrammed environment. The parallelizing compiler analyzes the source program in order to produce an intermediate representation, called the hierarchical task graph. The graph is used for the mapping of user tasks to the physical processors on runtime. The model allows one or several user-level ready queues to contain the ready tasks that are waiting for execution. When a processor finishes its current task, it picks up the next task from the ready queue. Processors continuously pick up tasks from the ready queue until the program terminates. In GOP, the graph is used for several purposes, not only for mapping, and nodes in the user-defined graph are representing different processes. Programmer can write graph-oriented programs to be bound to the nodes and then maps the nodes into the physical processors. Since the mapping is one-to-one, the programmer can do the mapping manually or let it done by the system automatically. Before the application executes, they can review their design and the mapping information. The graph-based programming paradigm has been a prosperous area in visual programming for parallel computing in decade. Graphical programming environments, tools, and libraries have been developed to ease parallel programming and assist the software development on parallel systems. CODE [13,14] is a graphical parallel programming language by which user can create a parallel program by drawing a dataflow graph that shows the communication structure of the program. A graph consists of nodes that represent computations (or shared variables) and arcs that represent the flow of data from one computation to another. CODE provides a visual programming environment. Having drawn a graph, the user annotates it by filling in a set of forms that describe the properties of the nodes and arcs, e.g., the sequential computation (usually written in C) that a node performs. Then, the user should select a target machine from a machine list and the CODE translates the annotated graph into a program that can be compiled and run on the target machine. As many algorithms are dynamic in their program structures that depend on runtime information, CODE supports the specification of dynamic structure in which the nodes and arcs can be instantiated any number of times at runtime. HeNCE [15] is also a graphical language for creating and running parallel programs in graphs over a heterogeneous collection of computers. Differing from CODE, the graph in HeNCE shows the control flow of a program [16]. The procedures performed by the nodes are written in C or Fortran. HeNCE is implemented on top of PVM. It provides an integrated environment for editing, compiling, executing, debugging, and tracing parallel programs. The user can specify various machines on which a program is to run. He/she can also specify a cost matrix showing the relative costs of running the procedures on various machines. HeNCE will
1594
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
automatically schedule the procedures on particular machines based on the cost matrix. VPE [17] is a visual parallel programming environment. It provides a simple GUI for creating MP programs and supports automatic compilation, execution, and animation of the programs. The programmer draws a graph to describe the program structure and annotates the nodes with C or Fortran texts that contain simple MP calls. The VPE environment is implemented on top of PVM and runs on X-Window to build a virtual parallel machine on a collection of heterogeneous machines. The programs do not directly use PVM primitives. Instead, the nodes make explicit calls to VPE MP library routines to send and receive messages via the named ‘‘ports’’ that are attached to the nodes so that the VPE programs are substantially simpler than the PVM equivalent. The above projects are graph-based programming languages and environments in which graph is simply used as a visual representation of a program’s structure. The code, especially the inter-node operations like communication and synchronization, is programmed in the convention primitives of procedural languages (e.g., C, Fortran) and MP libraries (e.g., MPI, PVM). Differently, the GOP model harnesses graph-oriented concept in support of parallel programming. Graph-orientation means that graph is not only utilized as the representation of program structure but also the underlying mechanism for implementing operations of the program based on the topology of graph. For example, the communication primitives are defined using the relative references of source and destination nodes in a specific graph topology, such as precedent and successor, parent and children, root and leaves, instead of absolute node IDs (see Section 3.1).
3. The GOP framework In this section, we first introduce the GOP model for high-level programming of parallel applications. Then we describe the GOP system, including the GOP API. Next, we will discuss the enhancement to MPI provided by GOP. Finally, we will briefly describe the features of VisualGOP which provides the high-level programming environment to support the GOP model. 3.1. The graph-oriented programming (GOP) model In the GOP model, a parallel program is defined as a collection of local programs (LPs) that may execute on several processors. Parallelism is expressed through explicit creation of LPs and communication between LPs is solely via MP. GOP allows programmers to write parallel programs based on user-specified graphs, which can serve the purpose of naming, grouping and configuring LPs. It can also be used as the underlying structure for implementing uniform MP and LP coordination mechanisms. The key elements of GOP are a logical graph construct to be associated with the LPs of a parallel program and their relationships, and a collection of functions
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1595
Local programs
Logical graph
Underlying Network
Fig. 1. The GOP conceptual model.
defined in terms of the graph and invoked by messages traversing the graph. As shown in Fig. 1, the GOP model consists of the following: • A logical graph whose nodes are associated with LPs, and whose edges define the relationships between the LPs. • An LPs-to-nodes mapping, which allows the programmer to bind LPs to specific nodes. • An optional nodes-to-processors mapping, which allows the programmer to explicitly specify the mapping of the logical graph to the underlying network of processors. When the mapping specification is omitted, a default mapping will be performed. • A library of language-level GOP primitives. GOP programs are conceptually sequential but augmented with primitives for binding LPs to nodes in a graph, with the implementation of graph-oriented internode communications completely hidden from the programmer. The programmer first defines variables of the graph construct in a program and then creates an instance of the construct. Once the local context for the graph instance is set up, communication and coordination of LP’s can be implemented by invoking operations defined on the specified graph. The sequential code of LPs can be written using any programming language such as C, C++ and Java. A graph-oriented parallel program is defined as a logical graph, GðN ; EÞ, where N is a finite set of nodes and E is a finite set of edges. Each edge of the graph links a pair of nodes in N . A graph is directed if each edge is unidirectional. A graph is labeled if every edge is associated with a label [18]. A graph is associated with a parallel program, which consists of a collection of LPs bound to the nodes with the messages that pass along the edges of the graph. Edges denote the interaction relationship between LPs. The graph can represent a logical structure that is independent of the real structure of a parallel system, which can be used to reflect the properties of the
1596
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
underlying system. For example, the label on each edge may denote the cost or delay in sending a message from one site to another site within the system. A GOP program consists of a collection of LPs, which are built using the graph construct, and a main program. A graph construct consists of a directed conceptual graph, an LP-to-node mapping, and an optional node-to-processor mapping. The programmer can create an instance of a graph construct using the following steps: • Step 1. Graph template declaration and instantiation A Graph is a template for a logical graph describing the logical relationships between LPs. It instantiates a graph instance and associates a name with the instance. The structure of a Graph is a general type of logical graph, which is described as a list of nodes connected with edges. It is defined as follows in Backus-Naur Form.
< ¼ Graph Graph-name Ô ¼ ’ {{}, {}} < ¼ j < ¼ . . . < ¼ , j < ¼ , {node_no, node_no} j e A Graph is the type identifier denoting the definition of a graph construct. The Graph-name is an identifier of a graph construct. The node_no is an integer identifier of a node. • Step 2. Mapping in GOP Map LPs to the conceptual nodes of a graph and the nodes to the underlying processors. The LP-to-node mapping is defined as a set of (node_no, LP_no) pairs. < ¼ LNMAP LN-map-name Ô ¼ ’ {} < ¼ , {, } j e LNMAP is the type identifier of an LP-to-node mapping. LN-map-name is the name of a mapping instance. The LP named in LP_no is mapped to the node identified by node_no. The node-to-processor mapping is optional in the graph construct, which is specified by a set of (node_no, processor_no) pairs, in a similar form to the LP-to-node mapping. < ¼ NPMAP NP-map-name Ô ¼ ’ {} < ¼ , {, } j e NPMAP is the type identifier of a node-to-processor mapping. NP-map-name is the name of a mapping instance. The node named in node_no is mapped to the processor identified by processor_no.
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1597
• Step 3. Graph construct binding Given the declaration of a graph construct and its mappings, a graph instance can be created by binding the mappings to the graph. The routine CreateGraph is used for this purpose. CreateGraph (Graph-name, LN-map-name, NP-map-name, Graph-instancename); Graph-instance-name is the identifier of a file that specifies all of the information about the newly created instance. This information is useful in establishing the operating context for the local programs. LN-map-name is the name of an LP-to-node mapping and NP-map-name is the name of a node-to-processor mapping. The local operating context is created by SetUpLocalGraph (Graph-name, Graph-instance-name, myvid); The myvid argument is the identifier of an instance of an LP. As a result of a call to the routine SetUPLocalGraph, a local representation of the distributed instance of a conceptual graph is created and the information required by the communication protocols is generated. Programming based on a graph-oriented model includes creating the graph construct and writing program code for the LPs using the graph primitives. GOP allows the programmer to exploit the semantics of the graph construct to deal with various aspects of parallel programming. The graph primitives define operations on a userspecified graph including communication and synchronization. These operations can be used to pass messages from one node to other nodes in the graph without knowing the low-level details such as absolute naming, addressing and routing. In this way, the programmer is saved from the burden of writing dedicated program codes for implementing task mapping and MP, and can concentrate on designing the structure and the logic of the parallel program instead. 3.2. GOP system structure It is important to note that the GOP model is independent of any particular language and platform. It can be implemented as library routines incorporated in familiar sequential languages and integrated with different programming platforms. In this paper, however we present an implementation of the GOP framework on MPI. The GOP software environment is illustrated in Fig. 2. The top layer is a visual programming environment. It supports the design and construction of parallel programs. It has a highly visual and interactive user interface, and provides a framework in which the design and coding of GOP programs, and the associated information can be viewed and modified easily and quickly. It also facilitates the compilation, mapping, and execution of programs (see Section 3.5). A set of GOP API is provided for the programmer to use in parallel programming, so that the programmer can build application based on the GOP model, ignoring the details of low-level operations and concentrating on the logic of the parallel
1598
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
User Interface (VisualGOP) Graph Editor
Local Program Editor
Mapping and Execution Control
Graph Representation
GOP API GOP Library GOP Runtime Configuration Manager and Consistency Maintenance
MPI
OS Network Fig. 2. The GOP framework.
program. The GOP library provides a collection of routines implementing the GOP API. The goal in the GOP library implementation is to introduce a minimum number of services with a very simple functionality to minimize the package overhead. The runtime system is responsible of compiling the application, maintaining structure, and executing the application. In the target machine, there exists two runtimes: The first one is the GOP runtime, a background process that provides graph deployment, update, query and synchronization. When deploying and updating the graph, it will block other machines to further update the graph and synchronize the graph update on all machines. Another runtime is the MPI runtime, which provides a complete set of parallel programming library for the GOP implementation. GOP uses MPI as the low-level parallel programming facility so that processes can communicate efficiently. 3.3. The GOP library The implementation of GOP applications is through wrapping functions (GOP library) to native MPI software. In general, GOP API adheres closely to MPI standards. However, GOP library provides operations to automatically perform some MPI routines and thus simplifies the API. It also allows argument list to be simplified relative to the MPI programs. For MP, GOP provides a set of routines to enable graph-oriented point-to-point and collective communications (in both blocking and non-blocking modes). In this layout, the GOP system follows the MPI standard, but it is simpler and the implementation is specifically designed for the graph-oriented framework. For example,
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1599
we use node ID instead of process ID to represent different processes, so the LP bound to a node can be replaced without affecting other LPs. Also, GOP hides the complexity of low level addressing, communication, as well as initializing processes from the programmer. GOP provides three types of communication and synchronization primitives for programmers: • point-to-point communication, • collective communication, • synchronization. Point-to-point communication consists of a process that sends a message and another process which receives the message––a send/receive pair. These two operations are very important and most often used. To implement optimal parallel applications, it is essential to have a model that accurately reflects the characteristics of these basic operations. Collective communication refers to MP routines involving a group (collection) of processes is a collective communication routine. Sometimes, one wishes to gather data from one or more processes and share among all participating processes. At other times, one may wish to distribute data from one or more processes to a specific group of processes. In GOP API, there is another type of collection communication primitives which provide the operations using the parent and child relationships. They are used for finding the parent or child nodes, and then broadcast the data to all the corresponding nodes. Finally, synchronization operations are provided to support the synchronization of processes. The following is the list of GOP communication and synchronization primitives: Point-to-point communication primitives: MsgHandle Usend(Graph g, Node n, Msg msg, CommMode m) /* sending unicast message */ MsgHandle Urecv(Graph g, Node n, Msg msg, CommMode m) /* receiving unicast message */ MsgHandle SendToParent(Graph g, Msg msg, CommMode m) /* send message to parent nodes */ MsgHandle RecvFromParent(Graph g, Msg msg, CommMode m) /* receive message from parent nodes */ MsgHandle SendToChildren(Graph g, Msg msg, CommMode m) /* send message to children nodes */ MsgHandle RecvFromChildren(Graph g, Msg msg, CommMode m) /* receive message from children nodes */ Collective communication primitives: MsgHandle Msend(Graph g, NodeGroup ng, Msg msg, CommMode m); /* sending multicast message */ MsgHandle Mrecv(Graph g, NodeGroup ng, Msg msg, CommMode m); /* receiving multicast message */
1600
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
MsgHandle Gather(Graph g, NodeGroup ng, Msg msg, Node s); /* s collect data from all nodes in the NodeGroup */ MsgHandle Scatter(Graph g, NodeGroup ng, Msg msg, Node s); /* s distribute data to all nodes in the NodeGroup */ MsgHandle Allgather(Graph g, NodeGroup ng, Msg msg); /* data collection in all nodes in the NodeGroup */ MsgHandle Alltoall(Graph g, NodeGroup ng, Msg msg); /* data distribution in all nodes in the NodeGroup*/ MsgHandle Reduce(Graph g, NodeGroup ng, Msg msg, Node s); /* reduce data to s from all nodes in the NodeGroup */ MsgHandle Allreduce(Graph g, NodeGroup ng, Msg msg); /* data reduction in all nodes in the NodeGroup */ Synchronization: void barrier(Graph g); /* Synchronize all nodes in the graph */ void barrier(Graph g, NodeGroup ng); /* Synchronize this node with all (other) nodes in the NodeGroup.*/ Boolean isArrived(Graph, MsgHandle handle); /* Check if the msg(s) arrived.*/ The signatures of the above methods follow the same pattern, where the first argument represents the specified GOP graph which is used for defining the scope of process communication. The communication target can be either a single node (Node) or a group of nodes (NodeGroup). Unlike MPI, GOP hides the message contents, type and tag by embedding it inside the Msg datatype. The number of arguments used in GOP API thus is fewer than in the MPI. This reduces the complexity of using the API routines and helps the program maintenance program more easily. GOP also provides a set of operations to query the information about nodes and edges in the graph. The corresponding GOP API is listed below: Query primitives: Edge GetEdge(Graph, Node start, Node end); /* get the edge from start node and end node */ Node GetNode(Graph, Edge edge); /* get the end node from edge */ These query information can be generated during the running of the application. The programmer can use the query information for communication purposes. For example, when a programmer wants to find the neighbor node which is connected with the current node, he/she can use the routine GetNode to retrieve the node name of a connected edge. Programming in this way helps the programmer dynamically assign the node names in the communication procedures, without specifying static node names in the LP code. Therefore it helps the
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1601
programmer to design the LP structure freely, and produces a more readable code for software maintenance. 3.4. GOP’s enhancement on communication support In this section, we discuss two major components in the MPI, namely the communicator and the virtual topology. We analyze their purposes, weakness and describe how GOP enhances the support. 3.4.1. MPI communicator and GOP node group Communicator plays a very important role in MPI library. The basic functions of a communicator include management of processes, defining scope of communication, and communication between communicators. When two processes A and B do not belong to the same communicator, they cannot send and receive any information to and from the other processes. That is, when the process wants to communicate to the other process, it must be included in the same communicator. When the parallel application starts, a default communicator, namely MPI_COMM_ WORLD, is created. By default, all processes are belonging to MPI_COMM_ WORLD and can send and receive information for each other. The programmer can create a new communicator in addition to the MPI_COMM_WORLD. When the programmer introduces a new communicator into the application, several code statements are added. MPI does not provide an easy way to create the new communicator. The programmer first needs to specify a group of processes in the communicator. The group of processes must be provided in the form of the process IDs and they must be prepared in an array before go to further steps. Therefore, the programmer needs to remember the creation routines and all the process IDs of the communicator. This decreases the readability and increases the complexity of the program. GOP uses node group to simplify the above process. Node group provides the same functionality of the MPI communicator. In addition, GOP provides the support for the programmer to manage node group in two ways. In the first way, GOP provides a simpler syntax structure: The node group creation does not need to specify any array or data structure. Inside the node group, all nodes are represented by their node names rather than the process IDs, so the programmer does not need to remember the process ID themselves. Also, GOP provides functions (not shown in this paper) for adding or removing nodes in the node group directly. The second way is using the aids of the VisualGOP tool. The programmer can manage the node group in the tool without the need to modify any part in the code. This makes the management of the node group easier and eliminates most errors from the programming mistakes. 3.4.2. MPI virtual topology and GOP graph topology The domains of many scientific and engineering problems are inherently two or higher dimensional. For parallel computing, these higher physical dimensions often lead to decompositions of comparable number of dimensions in the computational
1602
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
domain. Ordinarily, the process ranks of the decomposition are expressed in linear order (i.e., 0, 1, 2, K). To enable a code to refer to these ranks in multidimensional coordinates similar to that which arises in the domain decomposition, the MPI library provides many virtual topology routines. The MPI virtual process topology is a communicator with user-provided, highlevel information describing the preferred communication pattern for the processes in the communicator. There are two types of virtual topologies. In a graph topology the communication pattern is an undirected graph, and communication is assumed to be between processes that are adjacent in the graph. In a Cartesian topology the communication pattern is a D-dimensional grid, and communication is assumed to be among processes at neighboring grid-points, which in the MPI standard implicitly means along a dimension, not along a diagonal. From the application programmer’s point of view both types of topologies describe a preferred or likely communication pattern. However, the MPI virtual topology is a static description of a communication pattern. It is not possible to inform the MPI implementation about when communication will take place, neither about the load (frequency, amount of data) of particular links. It is also not possible to in influence how the reranking should be performed. There is more than one possible optimization criterion. The application programmer has no means of informing the MPI implementation which criterion is the most relevant. For the Cartesian topology, it is application dependent whether neighbors are found only along the dimensions, or may also include the processes on the diagonals. The exact structure of the neighborhood could possibly have influence on the remapping, but again there is no means of informing the MPI implementation which neighborhood will be used in a given application. A possible shortcoming of a different nature has to do with the fact that the topology remapping problems are NP-hard (for most communication architectures), so that there will be a tradeoff between solution quality and invested time. It is not possible via MPI to specify how much time the MPI implementation should spend in finding a good remapping satisfying the optimization criteria, but only the application programmer can know how much he can afford to spend in a given situation [19]. A high-quality MPI will use the topology information provided by the user to reorder the processes in the communicator to fit well with the underlying communication system. The identity mapping, that is returning a communicator in which processes have not changed ranks, is a legal implementation of the topology mechanism. Most current MPI implementations do not go beyond this trivial realization. In GOP, the graph topology can be specified by a user-defined graph. The graph is reconfigurable and highly customizable. The programmer can review the graph and make changes if necessary. The graph is also used as reference information for the programmer to build the application during the programming phase. Unlike the MPI virtual topology, the GOP graph supports both static and dynamic node-toprocessor mapping and LP-to-node mapping. The LPs, the nodes and processors can be remapped, therefore providing a flexible way for the programmer to design the application by reordering the graph topology at anytime. Besides mapping, if the program needs to use another graph topology or modify a part of the graph,
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1603
the current graph or a new pattern can be replaced. This eliminates the extra code segments for graph reconfiguration or rebuilding a new graph inside the application, making the management and programming of parallel applications much easier and more flexible. GOP also supports the functions for querying the graph. A program can make changes of its computation flow according to the current graph information. The VisualGOP tool provides automatic mapping and optimizing the application (task graph scheduling) so that the programmer does not need to be involved the design of the process management. 3.5. VisualGOP VisualGOP is a visual programming environment which supports the design, construction and deployment of GOP programs. It helps the programmer further eliminate many coding details and simplify many of the program development tasks. In this section, we introduce the overall architecture and the visual programming framework of VisualGOP. VisualGOP consists of the following major components: • VisualGOP program construction. This is a component provided for constructing a GOP program by using graphical aids (visual design editors). The graph editor is used to build the logical graph, representing program structure from a visual perspective. The LP editor is used for editing the source code of LPs in a GOP program. • Mapping and resource management. This component provides the programmer the control over the mapping of LPs to graph nodes and the mapping of graph nodes to processors (through mapping panel). It also allows the programmer to access the information about the status of the machine and network elements (through processor panel). • Compilation and execution. This component is used for transforming the diagrammatic representations and GOP primitives into the target machines code for execution (remote execution manager). These components are organized to form the architecture of VisualGOP, as shown in Fig. 3. They are divided into two levels: the visual level and the non-visual level. Fig. 4 shows a screen of the main visual interface of VisualGOP. The nonvisual level contains components responsible of maintaining a representation of GOP program design and deployment information. This representation is kept in memory when program design takes place, and later stored in a file. This level also contains the remote execution manager, which makes use of the stored information to execute the constructed GOP program on a given platform. With VisualGOP, the programmer starts program development with building a highly abstracted design and then transforms it successively into an increasingly more detailed solution. More specifically, it separates program design and configuration (i.e., the definition of the logical graph) from implementing the program
1604
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
VisualGOP Top Level (VisualGOP's GUI)
Processor
Mapping
Visual Design
Panel
Panel
Editors
Non-Visual Level
Internal GOP Structure Node data structure
Remote Execution Manager
Edge data structure Processor data structure LP data structure
Underlying Level (OS Platform) GOP Runtime Parallel Machines
Fig. 3. The VisualGOP architecture.
Fig. 4. The main screen of VisualGOP.
MPI
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1605
components (i.e., the coding of the LPs). VisualGOP helps automation of generating a large portion of GOP code such as declaration and initialization of graph construct and generation of message templates. The visual programming process under VisualGOP consists of the following iterative stages: (1) Visual programming. The programmer interacts with VisualGOP through a visual representation of the design of a parallel program. The design is presented as a logical graph which consists of a set of nodes representing LPs and a set of edges representing the interactions between the LPs in the program. The programmer uses the graph editing facility to support the creation and modification of a logical graph. VisualGOP also includes a LP editor to visually manipulate the textual LP source code. (2) Binding of local programs to graph nodes. The programmer specifies the binding of LPs of a parallel program to the nodes of the logical graph. The programmer has two ways in this choice. The first way is to create the LP and then bind them to the graph nodes. Another way is to combine the two steps into one––click on a node of the graph to open the program editor by which the code of the LP mapped to the node can be entered. (3) Mapping of graph nodes to processors. The mapping panel of VisualGOP displays the GOP program elements (nodes, processors, LPs) in a hierarchical tree structure. LPs, nodes and processors can be added to and deleted from the panel. The programmer can use drag-and-drop to bind and unbind the LPs to graph nodes and the graph nodes to processors. Detailed binding information concerning a node can be viewed by selecting and clicking the node. The processor panel provides icons for displaying processors and their connections. When a processor is added, a new processor icon will be shown on the panel. For node-to-processor mapping, the panel also provides the drag and drop function to bind and unbind graph nodes to one of the processors in the panel. (4) Compile the LPs. Source files can be distributed to and compiled on the specified processors. (5) Execute the program. The constructed GOP program can be distributed to and executed on the specified processors. Outputs will be displayed on the VisualGOP. VisualGOP provides the user with automated, intelligent assistance throughout the software design process. Facilities provided include the visual programming support; automation for mapping and code generation. This creates a flexible programming environment and eliminates most of the mundane clerical tasks [20].
4. Programming in the GOP environment This section shows the advantage of using the GOP environment to develop parallel applications. We demonstrate this with an example by comparing the differences
1606
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
between programming in GOP and MPI. GOP can support both single program multiple data (SPMD) and multiple program multiple data (MPMD) program structure. In our example, we use only SPMD programs, in order to compare more clearly the differences between MPI and GOP programming. Therefore, all nodes are bound with the same copy of the LP in the application. We use an example to illustrate how GOP supports higher level MP parallel programming. The finite difference method (FDM) is an approach to obtain an approximate solution to a partial differential equation governing the behavior of a physical system. The method imposes a regular grid on the physical domain. It then approximates the derivative of an unknown value u at a grid point (x; y) by the values of adjacent grid points. Consider a specific partial differential equation––Laplace equation, the approximation uij to the exact solution uðxi; yjÞ on the grid satisfies the equation 1 uij ¼ ðuiþ1;j þ ui1;j þ ui;jþ1 þ ui;j1 Þ; 4
i; j ¼ 0; . . . ; N 1:
ð1Þ
Eq. (1) shows that the value of u at any point is affected by four adjacent elements. Given the initial value, the value of u at any point can be approximated by iteration 1 ðk1Þ ðkÞ ðk1Þ ðk1Þ ðk1Þ uiþ1;j þ ui1;j þ ui;jþ1 þ ui;j1 uij ¼ 4 until the predefined accuracy is reached. When solving the problem on p processors, the grid will be partitioned into p sections. The grid can be decomposed in different manners. Here, a two-dimensional partition is used that generates a coarser grid. Fig. 5 shows the grid partitions for four processors. Fig. 5(a) is the partition on four processors. Fig. 5(b) is the program graph of the FDM derived on this partition. The program graph has the coarse-grid topology correspondent to the physical topology of the grid. The entry and exit nodes are omitted in the program graph because they are not indispensable to express the problem.
P0
P1
P2
P3
(a)
0
1
2
3 (b)
Fig. 5. Grid partitions for four processors: (a) two-dimensional grid partition for four processors, (b) program graph for four processors.
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1607
Consider the four nodes of program graph in Fig. 5(b). There are two edge-communication dependences for each node. Both edges are used for sending and receiving data from the other processes. In VisualGOP, we create a logical graph similar to the program graph above, shown in Fig. 6. The graph will be used by the GOP runtime to query the graph information and resolve the node names into MPI process ID. At the same time, LPs are written. We also need to setup a processor list for node mapping, so that later processes can be created for LPs and they can communicate through the GOP system. LPs are mapped into the graph nodes. The LPS are implemented by common programming language such as C, C++ or Java. In this paper, we use C language as an example. The LP structure is similar with that of a MPI program. It starts with the routine Init and ends with the routine Finalize: Init(argc, argv); ... ... Finalize(); In the above program code, we can see that GOP provides a simple statement to hide the details from the programmer. GOP Init(argc, argv);
/* start GOP */
Fig. 6. Logical graph for four processors in VisualGOP.
1608
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
In MPI, the programmer needs to add additional statements for program initialization, as shown below: MPI MPI_Init(&argc, &argv); /* starts MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &k); /* get current process id */ MPI_Comm_size(MPI_COMM_WORLD, &p); /* get # procs from env */ To know how GOP hides the details from the programmer, let us have a look inside the initialization code in GOP. The routine Init performs several operations, such as obtaining the command line arguments, determining the number of processes in MPI environment, getting its own process ID, getting the processor name and reading initialize the graph representation. The implementation is shown below: void Init(int argc, char *argv[]) { FILE *fp; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name,&namelen); /* init graph node ID mapping */ Graph_Init(); } Between routines Init and Finalize, the programmer writes code for the application. GOP communication primitives are used for processes to communicate with each other in the parallel environment. They are built based on the MPI, so that calling them actually invokes the MPI operations. In order to get the MPI process ID from the GOP node, a conversion between GOP nodes to MPI process IDs is done inside the GOP communication primitives, as shown below in implementing routine GetNodeID as an example: int GetNodeID(Node gn) { int i ¼ 0; if (gn ¼ ¼ NULL) return )1; /* Here shows a simple method for searching the node, */ /* a more efficient algorithm will be replaced if necessary */ for (i ¼ 0; i < node_count; i++) { if (strcmp(gn, nodes[i].name) ¼ ¼ 0) return i; } return )1; }
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1609
The following example of the implementation of routine Usend shows how the routine parameters are translated from GOP to MPI code: int Usend(Graph gname, Node nodename, Msg message, CommMode mode) { int nodeid; /* MPI process ID */ /* resolve process ID from GOP runtime */ nodeid ¼ GetNodeID(gname, nodename); if (nodeid ¼ ¼ )1) return )1; if (mode ¼ ¼ SYN) { /* synchronous send */ MPI_Ssend(message.data, message.length, message.datatype, nodeid, message.tag, MPI_COMM_WORLD); return 0; } else if (mode ¼ ¼ ASYN) { /* asynchronous send */ MPI_Send(message.data, message.length, message.datatype, nodeid, message.tag, MPI_COMM_WORLD); return 1; } else return )1; } The routine Usend defines a unicast MP primitive, delivering a message from the current node to another node in synchronous or asynchronous sending mode. Firstly, GOP identifies the specified graph. Then, the node name will be resolved into MPI process ID by investigating the logical graph structure. Finally, the MPI routine uses the process ID for communication to the destination node. The last part of a LP is the finalizing routine. Like using MPI, GOP releases the resources such as graph representation and temporary variables used by the GOP library, and then calls the routine MPI_Finalize to end the MPI process. In the FDM example, every node uses the same LP. In calling the sending and receiving routines, the node’s process ID is resolved by the GOP runtime automatically. Here is the example of the MP routine Update_Boundary_Condition in FDM: Update_Boundary_Condition(double **solution_array, int mrow, int mcol, int k) { int i; Msg msg1, msg2, msg3, msg4; if (k % 2 ¼ ¼ 0) { /* even numbered processes */
1610
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
/* – The message template is generated by VisualGOP – */ msg1 ¼ createMsg(&(solution_array[mrow][1]), mcol, DOUBLE, 0); msg2 ¼ createMsg(&(solution_array [0][1]), mcol, DOUBLE, 0); msg3 ¼ createMsg(&(solution_array [1][1]), mcol, DOUBLE, 1); msg4 ¼ createMsg(&(solution_array [mrow+1][1]), mcol, DOUBLE, 1); /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ Usend(ggraph, GetNode(00right00), msg1, ASYN); Urecv(ggraph, GetNode(00left00), msg2, ASYN); Usend(ggraph, GetNode(00left00), msg3, ASYN); Urecv(ggraph, GetNode(00right00), msg4, ASYN); } else { /* odd numbered processes */ Urecv(ggraph, GetNode(00left00), msg2, ASYN); Usend(ggraph, GetNode(00right00), msg1, ASYN); Urecv(ggraph, GetNode(00right00), msg4, ASYN); Usend(ggraph, GetNode(00left00), msg3, ASYN); } ... /* – The message template is generated by VisualGOP – */ msg1 ¼ createMsg(&sbuf_up, mrow, DOUBLE, 2); msg2 ¼ createMsg(&rbuf_down, mrow, DOUBLE, 2); msg3 ¼ createMsg(&sbuf_down, mrow, DOUBLE, 3); msg4 ¼ createMsg(&rbuf_up, mrow, DOUBLE, 3); /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ if (k % 2 ¼ ¼ 0) /* even numbered processes */ Usend(ggraph, GetNode(00up00), msg1, ASYN); Urecv(ggraph, GetNode(00down00), msg2, ASYN); Usend(ggraph, GetNode(00down00), msg3, ASYN); Urecv(ggraph, GetNode(00up00), msg4, ASYN); } else /* odd numbered processes */ { Urecv(ggraph, GetNode(00down00), msg2, ASYN); Usend(ggraph, GetNode(00up00), msg1, ASYN); Urecv(ggraph, GetNode(00up00), msg4, ASYN); Usend(ggraph, GetNode(00down00), msg3, ASYN); } ... }
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1611
In contract, in MPI, each process needs to provide the static process ID for communication before the program compilation. So, the programmer needs to write an extra routine neighbors to calculate the processor IDs of neighbors for each node: neighbors(int pid, int *left, int *right, int *up, int *down, int total_p_num) { int q, r, c, proc_col, i ¼ 0, j; ... proc_col ¼ 1; for (j ¼ 1; j< ¼ c; j++) /* calculate the column size of the grid */ proc_col * ¼ 2; if (pid%proc_col ¼ ¼ 0) /* the first column in the grid */ { *left ¼ )1; /* tells MPI not to perform send/recv */ *right ¼ pid+1; } else if (pid%proc_col ¼ ¼ proc_col)1) /* the last column */ { *left ¼ pid)1; *right ¼ )1; /* tells MPI not to perform send/recv */ } else { *left ¼ pid)1; *right ¼ pid+1; } if (proc_col ¼ ¼ total_p_num) /* no rows in the grid */ { *up ¼ )1; *down ¼ )1; } else if (pid < proc_col) /* the first row */ { *up ¼ pid+proc_col; /* tells MPI not to perform send/recv */ *down ¼ )1; /* tells MPI not to perform send/recv */ } else if (pid > ¼ total_p_num-proc_col) /* the last row */ { *up ¼ )1; /* tells MPI not to perform send/recv */ *down ¼ pid)proc_col; } else
1612
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
{ *up ¼ pid+proc_col; *down ¼ pid)proc_col; } } Update_Boundary_Condition (double ** solution_array, int mrow, int mcol, int k, int left, int right, int up, int down) { MPI_Status status; int i; if (k % 2 ¼ ¼ 0) { /* even numbered processes */ MPI_Send(&(solution_array[mrow][1]), mcol, MPI_DOUBLE, right, 0, MPI_COMM_WORLD); MPI_Recv(&(solution_array[0][1]), mcol, MPI_DOUBLE, left, 0, MPI_COMM_WORLD, &status); MPI_Send(&(solution_array[1][1]), mcol, MPI_DOUBLE, left, 1, MPI_COMM_WORLD); MPI_Recv(&(solution_array[mrow+1][1]), mcol, MPI_DOUBLE, right, 1, MPI_COMM_WORLD, &status); } else { /* odd numbered processes */ MPI_Recv(&(solution_array[0][1]), mcol, MPI_DOUBLE, left, 0, MPI_COMM_WORLD, &status); MPI_Send(&(solution_array[mrow][1]), mcol, MPI_DOUBLE, right, 0, MPI_COMM_WORLD); MPI_Recv(&(solution_array[mrow+1][1]), mcol, MPI_DOUBLE, right, 1, MPI_COMM_WORLD, &status); MPI_Send(&(solution_array[1][1]), mcol, MPI_DOUBLE, left, 1, MPI_COMM_WORLD); } ... if (k % 2 ¼ ¼ 0) /* even numbered processes */ { MPI_Send(sbuf_up, mrow, MPI_DOUBLE, up, 2, MPI_COMM_ WORLD); MPI_Recv(rbuf_down, mrow, MPI_DOUBLE, down, 2, MPI_COMM_ WORLD, &status); MPI_Send(sbuf_down, mrow, MPI_DOUBLE, down, 3, MPI_COMM_ WORLD); MPI_Recv(rbuf_up, mrow, MPI_DOUBLE, up, 3, MPI_COMM_ WORLD, &status); } else /* odd numbered processes */ {
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1613
MPI_Recv(rbuf_down, mrow, MPI_DOUBLE, down, 2, MPI_COMM_ WORLD, &status); MPI_Send(sbuf_up, mrow, MPI_DOUBLE, up, 2, MPI_COMM_ WORLD); MPI_Recv(rbuf_up, mrow, MPI_DOUBLE, up, 3, MPI_COMM_ WORLD, &status); MPI_Send(sbuf_down, mrow, MPI_DOUBLE, down, 3, MPI_COMM_ WORLD); } ... } Note that the communicator and status arguments of the MPI API are eliminated in the GOP API. Also, VisualGOP hides the message details from the programmer unless they need to modify them explicitly. Therefore, programming in GOP is simpler and more flexible. Initially, VisualGOP uses four nodes to represent the program structure. When the application is required to have more nodes for parallel processing, the programmer can use VisualGOP to modify the program design by expanding the program graph. Fig. 7 shows the graph expansion for a mesh. The original 2 · 2 mesh (Fig. 7(a)) is expanded to 2 · 4 and then 4 · 4 meshes by redeploying pffiffiffithe decomposed nodes. If there are n nodes in a mesh, they are deployed as an n pnffiffin array. The nodes are linked by edges according to the mesh topology. For example, if our example application (program graph for four processors) needs to be scaled up to using 16 processors, the programmer can first uses the pattern of the above 2 · 2 mesh as a basic template for the program graph. With this template, the graph will be automatically duplicated into 2 · 4 mesh, and then 4 · 4 mesh. In the new graph, LPs will be mapped into new nodes automatically. The remaining task is that the programmer needs to add 12 new processors in the VisualGOP and then map the graph to processors. In Fig. 8, the nodes are automatically bounded to the specified LPs. The programmer can make the final change to this mapping if needed. The node-to-processor mapping helps programmer to map
(a)
(b)
(c)
Fig. 7. Graph expansion for mesh: (a) 2 · 2 mesh, (b) 2 · 4 mesh, (c) 4 · 4 mesh.
1614
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
Fig. 8. The resource mapping diagram.
the nodes into the processors automatically. Currently, it is an one-to-one mapping without considering the load balance in the system. If the programmer wants to use the powerful processor for the specific node, the mapping table can be changed and the node can be bounded to the most powerful processor. After that, the programmer can compile and execute the application directly. In summary, GOP provides several high-level features to support MP programming in a parallel computing environment. A user-defined graph helps in better understanding and programming of the parallel structure of the code. The programmer does not required to understand the syntax and other low-level details of the underlying MPI MP system. The GOP APIs can help hiding the implementation details support for node-to-processor and LP-to-node mapping can greatly help the programmer deploy the application and manage the system resources. 5. Implementation and evaluation This section describes the implementation of the GOP framework, including the system architecture of the GOP implementation and the runtime system. We also describe the results of a preliminary evaluation of the implemented system. 5.1. GOP implementation As shown in Fig. 9, the GOP system architecture is divided into three layers: the programming layer, the compilation layer and the execution layer. In the programming layer the programmer develops a GOP program using highlevel abstraction in MP implementation. GOP exports a set of API that provides the implementation of the parallel applications with traditional programming languages, e.g., the C language. The GOP API contains a header file of global information, which is shared among the GOP library routines. In the compilation layer, LPs will be transformed into an executable program on the target’s execution environment. It compiles together with both the MPI and the GOP libraries. At the bottom part, the execution layer is realized through the services of two important runtimes, MPI runtime and GOP runtime. MPI runtime is the environment
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1615
Fig. 9. The GOP program communication implementation.
that used for communication through the network. The GOP runtime is developed as a daemon process on top of the operating system. It helps the GOP processes to dynamically resolve the node names into process IDs and to prepare for MPI communication. GOP runtime provides the synchronization among all the nodes, so that each node can get the most updated logical graph for running GOP primitives. The GOP runtime system is implemented by the C language with socket communication and synchronization scheme. In the GOP runtime system, a graph representation is used for the operation in GOP primitives. The GOP runtime lies on each node so that the nodes can exchange the graph information in a synchronization way. For nodes in the same machine, they use the shared memory to access the graph. For nodes on different machines, to synchronize graph update, a memory coherence protocol is needed. We choose the sequential consistency model as the graph synchronization scheme [21]. 5.2. Performance evaluation In the remaining part of this section, we present the evaluation results of our proposed high-level graph-oriented programming model, GOP. In our preliminary experiments we have communications between processes in the parallel application. We first compare the performance of our proposed GOP application library with those of the MPI library, and then identify the overhead involved in the GOP system.
1616
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
The experiments used a 20 processors SGI Origin 2000 machine [22], running on IRIX64 6.5. The release of the IRIX implements the MPI 1.2 standard and all the testing programs are written by C language. We use the example described in Section 4 for testing. For the program input, we choose large problem sizes 256 · 256 and 512 · 512. Execution times were measured in seconds using the routine MPI_Wtime. Measurements were made by inserting instructions to start and stop the timers in the program code. To make the result more accurate, the lowest bound value from 10 measurements is chosen. The major differences between the two implementations are that the MPI program needs an extra routine to calculate the runtime processor ID for each node, while GOP needs to resolve the node names into process IDs before invoke the communications which is based on MPI. Fig. 10 shows the execution times for the core-code of the FDM in MPI and GOP system, without considering other factors such as program initialization and finalization. The MPI program performs slightly better than GOP, as GOP needs to spend more time to manipulate the graph topology and to prepare the message data for communications. However, the difference is small and the speedups achieved by the MPI and GOP programs are almost the same. The corresponding speedups are shown in Fig. 11. We also compared the performance of the MPI and GOP communication routines. Here, we assume MPI does not have any overhead and used it as the baseline
Fig. 10. Time required by MPI and GOP programs.
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1617
Fig. 11. Speedups achieved by MPI and GOP programs.
for comparison. GOP will spend processing time on converting node names to process IDs and converting the message into arguments of MPI operations. Hence, we use a pair of routines MPI_Send and MPI_Recv for the MPI node-to-node communication setup. On the other hand, we evaluate a pair of send and receive operation (routines Usend and Urecv) in GOP and calculate the overhead, as shown below: int Usend(Graph gname, Node nodename, Msg message, CommMode mode) { int nodeid; /* MPI process ID */ /* resolve process ID from GOP runtime */ nodeid ¼ GetNodeID(gname, nodename); if (nodeid ¼ ¼ )1) return )1; if (mode ¼ ¼ SYN) { /* synchronous send */ MPI_Ssend(message.data, message.length, message.datatype, nodeid, message.tag, MPI_COMM_WORLD);
1618
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
return 0; } else if (mode ¼ ¼ ASYN) { /* asynchronous send */ MPI_Send(message.data, message.length, message.datatype, nodeid, message.tag, MPI_COMM_WORLD); return 1; } else return )1; } int Urecv(Graph gname, Node nodename, Msg message, CommMode mode) { int nodeid; /* MPI process ID */ void *data; /* message data */ /* resolve process ID from GOP runtime */ nodeid ¼ GetNodeID(gname, nodename); if (nodeid ¼ ¼ )1) return )1; if ((mode ! ¼ SYN) && (mode ! ¼ ASYN)) return )1; MPI_Recv(data, message.length, message.datatype, nodeid, message.tag, MPI_COMM_WORLD, & message.status); /* convert the received data into GOP message */ message.data ¼ createMsgData(data, message.datatype); if (mode ¼ ¼ SYN) /* synchronous send */ return 0; else if (mode ¼ ¼ ASYN) /* asynchronous send */ return 1; } During the GOP communication, all node names will be translated into process IDs of the mapped processors. The source node invokes routine Usend to convert the GOP message type into input arguments of MPI routine MPI_Send, and sends out the message to the destination node. On the other side, the destination node invokes routine Urecv and waits for the message from the source node. Once the message has arrived, it will be converted into a GOP message. Fig. 12 shows the overhead in handling the GOP MP operation. We can see the overhead imposed by this is small compared with using the same procedures under the MPI environment. Fig. 13 shows the program initialization time. Both GOP and MPI require some setup before their library routines can be called. MPI includes an initialization routine MPI_INIT. GOP performs the same step as MPI along with the graph initialization. In the graph initialization, each running node reads the graph structure from the graph representation. Then, it converts the graph structure into its programming structure and then load the graph into the memory. We can see that the initialization time for both MPI and GOP increases linearly with the number of the processes (nodes). The difference between them is very small.
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1619
Fig. 12. Message-passing overhead on GOP.
Fig. 13. Graph initialization time.
6. Conclusions and future work In this paper, we described a graph-oriented approach to providing high-level abstraction in MP parallel programming. The GOP model has the desirable features of expressiveness and simple semantics. It provides high-level abstractions for programming parallel programs, easing the expression of parallelism, configuration, communication and coordination by directly supporting logical graph operations. Furthermore, sequential programming constructs blend smoothly and easily with parallel programming constructs in GOP. We also provide a visual programming environment, VisualGOP, to provide a visual and interactive way for the programmer to develop and deploy parallel applications. We described the implementation of the
1620
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
GOP environment and reported the results of the evaluation on how GOP performs compared with the MPI. The results showed that GOP is as efficient as MPI in parallel programming. For our future work, we will enhance the current implementation with more programming primitives, such as update and subgraph generation. We will also define commonly used graph types as built-in patterns for popular programming schemes. Finally, we will develop real-world scientific and engineering computing applications using GOP. References [1] O. McBryan, An overview of message passing environments, Parallel Computing 20 (4) (1994) 417– 447. [2] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge, MA, 1994. [3] M. Snir et al., MPI: the Complete Reference, MIT Press, Cambridge, MA, 1996. [4] M. Besch, H. Bi, P. Enskonatus, G. Heber, M. Wilhelmi, High-level data parallel programming in promoter, in: Proceedings of 2nd International Workshop on High-level Parallel Programming Models and Supportive Environments, Geneva, Switzerland, 1997, pp. 47–54. [5] J. Cao, L. Fernando, K. Zhang, Programming Distributed Systems Based on Graphs, in: Intensional Programming I, World Scientific, Singapore, 1996, pp. 83–95. [6] J. Cao, L. Fernando, K. Zhang, Dig: a graph-based construct for programming distributed systems, in: Proceedings of 2nd International Conference on High Performance Computing, New Delhi, India, 1995. [7] J. Cao, X. Ma, A.T. Chan, J. Lu, Webgop: a framework for architecting and programming dynamic distributed web applications, in: Proceedings of the International Conference on Parallel Processing (ICPP’02), Vancouver, British Columbia, Canada, 2002. [8] J. Contronis, Message-passing program development by ensemble, in: PVM/MPI 97, 1997, pp. 242– 249. [9] J. Cotronis, Developing message-passing applications on MPICH under ensemble, in: PVM/MPI 98, 1998, pp. 145–152. [10] J. Cotronis, Modular MPI components and the composition of grid applications, in: Proceedings of the 10th Euromicro Workshop on Parallel, Distributed and Network-Based Processing 2002, 2002, pp. 154–161. [11] P. Hadjidoukas, E. Polychronopoulos, T. Papatheodorou, Integrating MPI and nanothreads programming model, in: Proceedings of the 10th Euromicro Workshop on Parallel, Distributed and Network-Based Processing 2002, 2002, pp. 309–316. [12] C. Hu, H. Lu, A. Cox, W. Zwaenepoel, OpenMP for networks of SMPs, Journal of Parallel and Distributed Computing 60 (12) (2000) 1512–1530. [13] D. Banerjee, J.C. Browne, Complete parallelization of computations: integration of data partitioning and functional parallelism for dynamic data structures, in: Proceedings of 10th International Parallel Processing Symposium (IPPS’96), Honolulu, Hawaii, 1996, pp. 354–360. [14] P. Newton, A graphical retargetable parallel programming environment and its efficient implementation, Ph.D. thesis, Department of Computer Science, University of Texas at Austin, 1993. [15] C. Anglano, R. Wolski, J. Schopf, F. Berman, Developing heterogeneous applications using zoom and hence, in: Proceedings of the 4th Heterogeneous Computing Workshop, Santa Barbara, 1995. [16] V. Shen, C. Richter, M. Graf, J. Brumfield, Verdi: a visual environment for designing distributed systems, Journal of Parallel and Distributed Computing 9 (2) (1990) 128–137. [17] P. Newton, J. Dongarra, Overview of VPE: a visual environment for message-passing, in: Proceedings of the 4th Heterogeneous Computing Workshop, 1995.
F. Chan et al. / Parallel Computing 29 (2003) 1589–1621
1621
[18] J. Cao, Z. Ren, A.T. Chan, L. Fang, L. Xie, D.X. Chen, A formalism for graph-oriented distributed programming, Visual Programming––From Theory to Practice (2003) 77–109. [19] J. Traff, Implementing the MPI process topology mechanism, in: Proceedings of the IEEE/ACM SC2002 Conference, Baltimore, MD, 2002. [20] F. Chan, J. Cao, A.T. Chan, K. Zhang, Visual programming support for graph-oriented parallel/ distributed processing, submitted, 2002. [21] C. Scheurich, M. Dubois, Correct memory operation of cache-based multiprocessors, in: Proceedings of the 14th Annual International Symposium on Computer Architecture, ACM Press, 1987, pp. 234– 243. [22] J. Laudon, D. Lenoski, The SGI origin: a ccNUMA highly scalable server, in: Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97), Boulder, Colorado, USA, ACM SIGARCH Computer Architecture News, vol. 25, ACM Press, 1997, pp. 241–251.