Future Generation Computer Systems 15 (1999) 699–712
Message-passing environments for metacomputing Matthias A. Brune a,∗ , Graham E. Fagg b , Michael M. Resch c a
Universitat-GH Paderborn, Paderborn Center for Parallel Computing, Fürstenallee 11, D-33102 Paderborn, Germany b University of Tennessee, 104 Ayres Hall, Knoxville, Tennessee, TN-37996-1301, USA c High Performance Computing Center Stuttgart, Allmandring 30, D-70550 Stuttgart, Germany Accepted 14 December 1998
Abstract In this paper, we present the three libraries PACX-MPI, PLUS, and PVMPI that provide message-passing between different high-performance computers in metacomputing environments. Each library supports the development and execution of distributed metacomputer applications. The PACX-MPI approach offers a transparent interface for the communication between two or more MPI environments. PVAMPI allows the user spawning parallel processes under the MPI environment. The PLUS protocol bridges the gap between vendor-specific (e.g., MPL, NX, and PARIX) and vendor-independent message-passing environments (e.g., PVM and MPI). Moreover, it offers the ability to create and control processes at application runtime. ©1999 Published by Elsevier Science B.V. All rights reserved. Keywords: Metacomputing; Message-passing library; Distributed application; MPI; PVM
1. Introduction The idea of metacomputing [1] emerged from the wish to utilize geographically distributed high-performance systems for solving large problems that require computing capabilities not available in a single computer. From the user’s perspective, a metacomputer can be regarded as a powerful, cost-effective high-performance computing system based on existing WAN-connected HPC systems. Thereby, networked supercomputers can increase the total amount of computing power. Message-Passing Environments. The realization of metacomputing brings several obstacles with it. One of these obstacles lies in the incompatibility of the vendor-specific message-passing environments, e.g., MPL, NX, and PARIX. On one hand, these message-passing environments provide optimal communication on the corresponding hardware, but on the other hand, it is usually not possible with these communication libraries to exchange messages with other vendor’s systems. Vendor-specific message-passing environments only support ‘closed world communication’ within a single environment. Even with the two vendor-independent message-passing standards PVM [2] and MPI [3] it is not possible for an application to communicate from one programming environment to another. ∗
Corresponding author E-mail address:
[email protected] (M.A. Brune) 0167-739X/99/$ – see front matter ©1999 Published by Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 7 3 9 X ( 9 9 ) 0 0 0 2 0 - 5
700
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
Considering metacomputing, those limitations must be overcome by providing communication interfaces acting between message-passing environments. However, linking the different models together mainly depends on the requirements of the application scenario for which the metacomputing resource is intended. There are three obvious scenarios: first, there exist homogeneous applications, with single compute-intensive applications running on a cluster of supercomputers to solve huge problems. Secondly, we have heterogeneous applications using completely different programming environments: an example might be a CFD-application parallelized using MPI that is supposed to work together with an optimization tool that is based on PVM. Here, the goal is to bridge the gap between MPI and PVM with a metacomputing message-passing environment. Finally, we can classify heterogeneous applications which are using the same programming environment: an example might be the coupling of an ocean model and an atmosphere model that are both based on MPI. Process Models. Another problem of building a real metacomputing environment is the static process model of most of the message-passing environments. Currently, only PVM offers a dynamic process model wherein parallel PVM user tasks are able to spawn new processes during application runtime. However, even this model lacks in spawning processes if the remote HPC machine is administered by a resource management system [4], e.g., NQS [5], PBS [6], CCS [7], Codine [8], Condor [9], and LSF [10]. Currently, three different tools enable the message-passing between HPC systems: • PACX-MPI (Parallel Computer extension) [11] was initially developed at the University of Stuttgart to connect a Cray-YMP to an Intel Paragon, both running under MPI. Currently, it has been extended to couple two or more MPPs to form a cluster of high-performance computers. • PLUS (Program Linkage by Universal Software-interfaces) [12], developed at the University of Paderborn, provides a transparent communication mechanism between vendor-specific (PARIX, MPL, NX) and standard parallel message-passing environments (MPI, PVM). Additionally, it offers a dynamic process model to the supported environments with the ability to contact HPC systems under arbitrary resource management systems. • PVMPI [13] allows different MPP vendor MPI implementations to intercommunicate to aid the use of multiple MPPs. Furthermore, it enhances the static process model of MPI-1 to a dynamic one. PVMPI was developed at the University of Tennessee, Knoxville. In the following, we give an overview on three approaches PACX-MPI (Section 2), PLUS (Section 3), and PVMPI (Section 4), that have all proven to be most useful in the daily work. In Section 5, we make a comparison of the tools and discuss their different working concepts. The last sections give a brief summary and outlook.
2. PACX-MPI PACX-MPI [14] is a library that enables the linkage of several MPPs into one single resource based on MPI. This allows one to program a metacomputer just like an ordinary MPP. The main goals of PACX-MPI were: • no changes in the source code, • the programmer should have a single system image, • use of the vendor implemented fast MPI for internal communication, and • use of a standard protocol for external communication. 2.1. Concept of PACX-MPI To achieve the defined goals, PACX-MPI was designed as a multi-protocol MPI. All internal communication is done by the vendor-specific MPI implementation while all external communication is handled by a newly developed daemon concept. As schematically viewed in Fig. 1 the library mainly consists of two parts: • A software layer that handles the management of internal and external communication and decides which protocol to use.
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
701
Fig. 1. Schematic description of the concept of PACX-MPI.
Fig. 2. Point-to-point communication for a metacomputer consisting of two machines.
• A daemon that handles all external communication. Depending on the requirements of the MPI standard communication intelligence is put into the software layer or into the daemon itself. MPI calls that are implemented by PACX-MPI are automatically replaced by calls to PACX-MPI during the linking process. Either using macro-directives for C or the MPI profiling interface for fortran. The creation of a global MPI COMM WORLD requires two numberings for each node; a local number and a global one. In Fig. 2, the local numbers are in the lower part of the boxes and the global numbers in the upper one. The daemon nodes have to be split from the computational nodes. To explain their function, we describe the sequence of a point-to-point communication between global Node 2 and global Node 7. The sending node will check first, whether the receiver is on the same MPP or not. If it is on the same machine, it will do a normal MPI Send(). If not, it creates a command-package, which has the same function as the message-envelope in MPI, and transfers this command-package and the data to a communication node, the so-called MPI-server. The MPI-server compresses the data and transfers them via TCP/IP to the destination machine. There the command-package and the data are received by the so-called PACX-server, the second communication node. Data is decompressed and passed on to the destination node.
2.2. Buffering concept of PACX-MPI In the previous PACX-MPI versions, the receiver only checked whether the sending node is on the same machine or not. This is no problem for the coupling of two machines but may lead to race conditions if more than two machines are involved. To avoid this, message buffering on the receiving side is done. The decision was to buffer messages without matching MPI Recv() at the destination node rather than at the PACX-server. This allows to distribute both memory requirements and working time. In a point-to-point communication, the receiving node first checks whether the message is an internal one. In positive case, it is received by directly using MPI. Otherwise the receiving node has to check whether the expected message is already in the buffer. Only if this is not the case. the message is received from the PACX-server directly.
702
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
Fig. 3. A broadcast operation in PACX-MPI 3.0.
2.3. Data conversion in PACX-MPI To support heterogeneous metacomputing, PACX-MPI has to do data conversion. Initially, we thought of having the two communication nodes handle all data conversion. However, for the MPI Packed datatype the receiver has to know exactly what the content of the message is. Therefore, we decided to design the data conversion concept as follows: • The sending node does a data conversion into an XDR-data format, if it prepares a message for another machine. For internal communication, no additional work accrues. • The receiver converts the data from the XDR-format into its own data representation. • For the datatype MPI PACKED, a data conversion to XDR-format will be done while executing MPI PACK, even for internal communication. Because of the high overhead, data conversion can be enabled and disabled by a compiler option of PACX-MPI. This allows the optimization of applications for homogeneous metacomputing. 2.4. Global communication in PACX-MPI The sequence of a broadcast operation of Node 2 to MPI COMM WORLD is shown in Fig. 3. First, Node 2 sends a command-package and a data package to the MPI-server. Then a local MPI Bcast is executed. On the second machine, the PACX-server transfers the command and the data package to the application node with the smallest number. This node does the local broadcast. This means that global operations are handled locally by nodes from the application part rather than by one of the servers. 2.5. Starting a PACX-MPI application To use PACX-MPI, the user has to compile and link his application with the PACX-MPI library. The main difference for the user is the start-up of the application. He has to provide two additional nodes on each machine, which handle the external communication. Thus, an application that needs 1024 nodes on a T3E takes 2∗ 514 if running on two separate T3Es. Then the user has to configure a hostfile, which has to be identical on each machine. The hostfile contains the name of the machines, the number of application nodes, the used protocol for the communication with this machine and optionally the start-up command for the automatic start-up facility of PACX-MPI 3.0. Such a hostfile may look like this: #Machine host1 host2 host3 host4
NumNodes 100 100 100 100
Protocol tcp tcp tcp tcp
Start-up command (rsh host2 mpirun -np 102 ./exename) (rsh host3 mpirun -np 102 ./exename) (rsh host4 mpirun -np 102 ./exename)
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
703
Fig. 4. Process communication via PLUS between different programming environments.
3. PLUS PLUS [12,15] provides a fast communication interface between different message-passing environments, including both vendor-specific (e.g., MPL, NX, PARIX) and standard environments (e.g., MPI, PVM). Existing applications can be easily modified to communicate with other applications operating under other message-passing models. The original send and receive operations need not be modified in the application’s source code: PLUS automatically distinguishes between ‘internal’ and ‘external’ communication partners. It supports communication over LAN, MAN, and WAN. PLUS consists of daemons and a modular library that is to be linked to the application code. Only four new commands make it easy to integrate PLUS into an existing code: • plus init() for signing on at the nearest PLUS daemon, • plus exit() for logging off, • plus system() for spawning tasks on another (remote) system, and • plus Info() for obtaining information on the accessible (remote) tasks. From an applications’ view, the PLUS communication is completely hidden behind the native communication routines used in the application. By means of macro substitution PLUS provides linkage to external systems without modifying the source code. Thereby, an MPI application, for example, makes implicit use of PLUS when using MPI communication routines. Data conversion between different data representations on the participating hardware platforms (little or big endian) is transparently handled by PLUS using XDR. 3.1. How it works together In the following, we describe an illustrating example (Fig. 4) of a process communication between C/MPI and C/PVM. Both applications have to ‘sign on’ at the PLUS master daemon using the plus init() command with the daemon location and a logical name of the programming environment as parameters. As a result, the successful initialization of plus init() is returned. The daemons will then autonomously exchange process tables and control information with each other. After successful initialization, the remote processes are able to communicate with each other via PLUS. From the MPI application, the communication is done by calling MPI Send() with a partner identifier that is greater than the last (maximum) process identifier managed by MPI in the corresponding communicator group (e.g., MPI COMM WORLD). Generally, PLUS recognizes an external communication by a process identifier outside the ID range of the corresponding programming environment. Within PLUS, all process identifiers are given relative to the number of active tasks. For situations involving more than two programming environments the plus info() function can be used to identify remote communication partners. It returns a data structure identifying the accessible remote programming environments and corresponding processes in consecutive order.
704
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
Fig. 5. Remote process creation with PLUS on a Parsytec operating under the CCS resource management system.
Most of the PLUS system code is contained in the daemons. Thereby, it was possible to keep the PLUS library small. The daemons are used to hide the application processes from the slow Internet communication. Whenever a message is to be transferred between different environments, the daemons take responsibility for delivering the message while the application process is able to proceed with its computation. 3.2. Process creation and management Today’s programming environments have succeeded in making the communication aspect easier, but powerful mechanisms for the dynamic creation of new processes are still missing. As an example, the current MPI-1 standard does not provide a dynamic process model at all. Moreover, PVM cannot spawn dedicated processes that do not belong to the parallel program. Since message-passing environments are mainly used by communicating parallel programs, this situation is less than satisfying. We need a flexible and powerful interface between programming environments and resource management systems. PLUS [15] provides plus system() and plus info() to spawn new processes and to obtain information on remote processes for programming environments lacking a dynamic process model (such as MPI-1). Fig. 5 illustrates the spawning of new processes. Each daemon maintains a small database listing all available machines with the corresponding commands to be used for spawning processes. When an application calls plus system(), the remote PLUS daemon invokes the corresponding command for allocating the requested number of computing nodes at the target machine. This is either done with the default parameters found in a local configuration database or with the user-supplied parameters in the plus system() function. The function plus system() also accepts complex program descriptions specified in the Resource and Service Description (RSD) [16]. As a side effect, a new PLUS daemon may be brought up at the remote site to provide faster access. 3.3. PLUS system architecture PLUS Library. Fig.6 shows the library libplus.a which is composed of a translation module and a communication module. The translation module interfaces between the programming environment and the PLUS system. By means
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
705
Fig. 6. The PLUS system library libplus.a and the ISO/OSI model.
of macro substitution in the application code, PLUS checks whether the communication is internal or external. In the first case, the native message-passing call is performed at runtime, whereas in the latter case, a PLUS routine takes care of the communication with the remote system. The communication module in PLUS is responsible for all external communication. It supports various network protocols, e.g., TCP/IP, UDP/IP, thereby providing flexible communication between various programming environments via the fastest available network protocol. Initially, we used TCP/IP for the underlying communication layer. It provides a reliable, connection-oriented service. Later, we found TCP/IP to be inappropriate because of its high overhead, especially when opening several connections via the WAN. As TCP/IP keeps a file descriptor for each connection in main memory, the maximum number of descriptors might not be sufficient to allow for many concurrent external connections. Therefore, our current implementation uses the unreliable, connection-less UDP/IP protocol with a maximum packet length of 8KB. With UDP/IP, it is possible to communicate with an unlimited number of processes over a single descriptor. For providing a reliable service, we implemented a segmentation and desegmentation algorithm and also a flow control with recovery. The segmentation/desegmentation algorithm splits the message into packets of 8KB and re-assembles the packets at the receiving node. For error recovery, PLUS uses a sliding window mechanism with a variant of the selective reject, where only those packets that got lost (unacknowledged) are re-sent. The sender sends the packets from its window until either a timeout occurs or the window is empty. When a timeout for a sent packet occurs, the packet is re-sent and a new timeout is calculated. The receiver sorts the packets into its window and acknowledges receipt. PLUS Daemons. Before running an application program, a PLUS master daemon must be brought up by the UNIX command startplus
. Thereafter, the master daemon autonomously starts additional daemons that are (logically) interconnected in a clique topology. Clearly, the daemons locations in the network are essential for the overall communication performance. PLUS daemons should be run on network routers, or on powerful workstations. Moreover, it is advisable to run a daemon on the frontends (e.g., I/O nodes) of a parallel system. Using PLUS on workstation clusters it is advisable to run in each sub-network at least one daemon. Depending on the actual communication load in the subnetwork, more daemons may be started by the master daemon on request of ‘normal’ daemons. Any daemon that has a work load (number of supported user tasks) larger than a certain threshold can ask the master to bring up further daemons.
4. PVMPI PVMPI from the University of Tennessee is a powerful combination of the proven and widely ported Parallel Virtual Machine (PVM) system and MPI. The system aims to allow different vendors’ MPI implementations to inter-operate directly with each other using the normal MPI API instead of using a single MPI implementation such as MPI/CH which may be sub-optimal in performance on individual MPP systems.
706
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
Two important features of PVMPI are its transparent nature and its flexibility. Additionally, PVMPI allows flexible control over MPI applications by providing access to all the process control and resource control functions available in the PVM virtual machine. 4.1. Virtual machine resource and process control The PVM virtual machine is defined to be a dynamic collection of parallel and serial hosts. With the exception of one host in the PVM virtual machine any number of hosts can join, leave, or fail without affecting the rest of the virtual machine. In addition, the PVM resource control API allows the user to add or delete hosts, check that a host is responding, shut down the virtual machine or be notified by a user-level message that a host has been added or deleted (intentionally or not). The PVM virtual machine is very flexible in its process control capabilities. It can start serial, or parallel processes that may or may not be PVM applications. For example, PVM can spawn an MPI application as easily as it can spawn a PVM application. The PVM process control API allows any process to join or leave the virtual machine. Additionally, PVM provides plug-in interfaces for expanding its resource and process control capabilities. This extendibility has encouraged many projects to use PVM in different distributed computing environments with dedicated schedulers [17], load balancers and process migration tools [18]. 4.2. MPI communicators Although the MPI standard does not specify how processes are started, it does dictate how MPI processes enrol into the MPI system. All MPI processes join the MPI system by calling MPI Init() and leave it by calling MPI Finalize(). Calling MPI Init twice causes undefined behavior. Processes in MPI are arranged in rank order, from 0 to N − 1, where N is the number of processes in a group. These process groups define the scope for all collective operations within that group. Communicators consist of a process group, context, topology information and local attribute caching. All MPI communications can only occur within a communicator. Once all the expected MPI processes have started, a common communicator is created by the system for them called MPI COMM WORLD. Communications between processes within the same communicator or group are referred to as intra-communicator communications. Communications between disjoint groups are inter-communicator communications. The formation of an inter-communicator requires two separate (non overlapping) groups and a common communicator between the leaders of each group, as shown in Fig. 7. The MPI-1 standard does not provide a way to create an inter-communicator between two separately initiated MPI applications since no global communicator exists between them. The scope of each application is limited by
Fig. 7. Inter-communicator formed inside a single MPI COMM WORLD.
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
707
its own MPI COMM WORLD which by its nature is distinct from any other applications’ MPI COMM WORLD. Since all internal details are hidden from the user and MPI communicators have relevance only within a particular runtime instance. MPI-1 implementations cannot inter-operate. 4.3. The PVMPI system We developed a prototype system [19] to study the issues of interconnecting MPI and PVM. Three separate issues were addressed: 1. mapping identifiers and managing MPI and PVM IDs, 2. transparent MPI message-passing, and 3. start-up facilities and process management. 4.3.1. Mapping identifiers A process in an MPI application is identified by a tuple pair {communicator, rank}. PVM provides similar functionality through use of the group library. The PVM tuple is {group name, instance}. PVMPI provides address map from the MPI tuple space to the PVM tuple space and vice versa. An initial prototype version of PVMPI [13] used such a system without any further translation (or hiding of mixed identifiers). The association of this tuple pair is achieved by registering each MPI process into a PVM group by a user level function call. A matching dis-associate or leave call is also provided: info = PVMPI Register(char ∗ group MPI Comm comm, int ∗ handle); info = PVMPI Leave(char ∗ group); Both register and leave functions are collective and blocking. All processes in the specified MPI communicator have to participate. The PVMPI Leave() command is used to clean up MPI data structures and to leave the PVM system in an orderly way if required. Processes can register in multiple groups, although currently separate applications cannot register into a single group with this call (i.e. take the same named group). The register call takes each member of the communicator and makes it join a named PVM group so that its instance number within that group matches its MPI rank. Since any two MPI applications may be executing on different systems using different implementations of MPI (or even different instances of the same version), the communicator usually has no meaning outside any application callable library. The PVM group server, however, can be used to resolve identity when the names of groups are unique. Once the application has registered, an external process can access it by using that process’ group name and instance via the library calls pvm gettid() and pvm getinst(). When the groups have been fully formed, they are frozen and all their details are cached locally to reduce system over-head. 4.3.2. Transparent messaging The mixing of MPI and PVM group calls requires the understanding of two different message-passing systems, their APIs, semantics and data formats. A better solution is to transparently provide interoperability of MPI application by utilizing only the MPI API. As previously stated, MPI uses communicators to identify message universes, and not PVM group names or TIDs. Thus PVMPI could not allow users to utilize the original MPI calls for inter-application communication. The solution is to allow the creation of virtual communicators that map either onto PVM and hence remote applications or onto real MPI intra-communicators for local communication. In order to provide transparency and handle all possible uses of communicators, all MPI routines using communicators were re-implemented using MPIs profiling interface. This interface allows user library calls to be intercepted on a selective basis so that debugging and profiling tools can be linked into applications without any source code changes. Creating dual role communicators within MPI would require altering MPI’s low level structure. As this was not feasible, an alternative approach was taken. PVMPI maintains its own concept of a communicator using a hash table
708
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
Fig. 8. MPI profiling interface controlling communicator translation.
to store the actual communication parameters. As communicators in MPI are opaque data structures this behavior has no impact on end user code. Thus PVMPI communicator usage is completely transparent as shown in Fig. 8. Intra-and inter-communicator communications within a single application (MPI COMM WORLD) proceeds as normal, while inter-application communications proceed by the use of a PVMPI inter-communicator formed by using the PVMPI Intercomm create() function: info = PVMPI Intercomm create (int handle, char ∗ gname, MPI Comm ∗ intercom); This function is almost identical to the normal MPI inter-communicator create call except that it takes a handle from the register function instead of a communicator to identify the local group, and a registered name for the remote group. The handle is used to differentiate between local groups registered under multiple names. The default call is blocking and collective, although a non-blocking version has been implemented that can time-out or warn if the requested remote group has attempted to start and then failed, so that appropriate action can be taken to aid fault tolerance. PVMPI inter-communicators are freed using the normal MPI function call MPI Comm free(). They can be formed, destroyed and recreated without restriction. Once formed, they can be used exactly the same as a normal MPI inter-communicator except in the present version of PVMPI there is a restriction that they cannot be used in the formation of any new communicators. PVMPI inter-communicators allow the full range of point-to-point message-passing calls inside MPI. Also supported is a number of data formatting and (un)packing options, including user derived datatypes (i.e. mixed striding and formats). Receive operations across inter-communicators relies upon adequate buffering at the receiving end, in-line with normal PVM operation. 4.3.3. Low-level start-up facilities The spawning of MPI jobs from PVM requires different procedures depending upon time target system and the MPI implementation involved. The situation is complicated by the desire to avoid adding many additional spawn calls the current intention of the MPI-2 forum). Instead, a number of different MPI implementation specific taskers have been developed that intercept the internal PVM spawn messages and then correctly initiate the MPI applications as required. 4.3.4. Process management under a general resource manager The PVM GRM [17] can be used with specialized PVMPI taskers to manage MPI applications in an efficient and simple manner. This provides improved performance [20] and better flexibility than that of a simple hostfile utilized by most MPIRUN systems.
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
709
Fig. 9. General resource manager and taskers handling process management.
When a user’s spawn request is issued it is intercepted by the GRM and an attempt is made to optimize the placement of tasks upon available hosts. If the placement is specialized then appropriate taskers are used. Fig. 9 shows a system with three clusters of machines: one each for MPI/CH, LAM and general purpose jobs. In this figure, the start request causes two MPI/CH nodes to be selected by the GRM, then the actual processes are started by the MPI/CH tasker. 4.4. Non-PVM based PVMPI or MPI Connect The PVMPI system suffered from the need to provide access to a PVM daemon pvmd at all times. On many MPP systems that enforce the use of a batch queuing job control system on top of their native runtime systems such as in the case of many IBM SP2’s and Cray T3E systems it is not possible to provide concurrent access to both a PVM daemon and the MPI application. MPI Connect is a new system based upon PVMPI that uses the MPI profiling interface as in PVMPI but where non-local communication is forwarded using the native MPI to relay processes that route them to their final destinations as in PACX. The system uses RCDS [21] instead of the PVM group server for name resolution and resource discovery and SNIPE [22] for inter-machine communication.
5. Comparing PACX-MPI, PLUS, and PVMPI A comparison of the presented tools is difficult, because obviously everyone has a specific defined goal. However, it might help to classify them with respect of technical concept and application support in order to evaluate their suitability for future developments. • PACX-MPI tries to overcome the lack of an MPI implementation that fully supports multi-protocol handling on all kind of platforms. It has evolved from very special projects and is therefore limited in functionality. The basic concept of using a daemon has proven to be very useful. Additionally, it may help to improve outgoing traffic and adapt it to available network resources. • PLUS bridges the gap between vendor-specific (e.g., MPL, NX, and PARIX) and vendor-independent messagepassing environments (e.g., PVM and MPI). It provides a fast point-to-point communication interface between the different models. By offering the ability to create and control parallel processes at application runtime, it enhances message-passing environments which support only static process models (e.g., MPI-1, PARIX, and MPL). Therefore it contains an interface for accessing resource management systems to allow an optimal process mapping on the target platforms. • PVMPI tries to integrate dynamic process capabilities from PVM into MPI while allowing it to be used in a metacomputing environment. PVMPI concentrates on a cooperating application programming model where
710
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
multiple applications can join and inter-communicate using point-to-point communications and then disconnect from each other. PVMPI also allows MPI applications the ability to spawn new tasks, although this is not as integrated as in MPI-2 as it does not allow collection communication between separately initiated applications. From a programmers point of view all libraries offer some additional functionality put at the same time require some work. The amount of work is the smallest for PACX-MPI since it tries to be fully transparent for the user. On the other hand, this may reduce performance. While PACX-MPI needs to cope with the problems arising from a possible mixture of global communication and point-to-point communication, the two other approaches concentrate on the later one. However, PLUS and PVMPI offer the great advantage in extending the almost static process models of the message-passing environments to dynamic ones. Thereby metacomputer applications do not waste computing power, because the applications are able to allocate and deallocate dynamically the resources on the basis of their runtime demands. Moreover, the PLUS package provides an user interface for accessing metacomputer resources running under a resource management system. In comparison to PACX-MPI and PVMPI, the PLUS library has a different design goal. It does not only bridge the gap between two or more MPI environments, it also offers the ability to communicate between heterogeneous message-passing environments such as PVM, MPI, PARIX, and MPL.
6. Future work Forthcoming papers have to be driven by the experience that we got from the projects described here. There are some lessons that we have learned: • There are a lot of different types of applications that need support for metacomputing. No single library can so far support all of them. • MPI-2 [23] may be a way to bring more functionality into one single library. • The daemon concept turns out to be the most useful approach for handling external communication. It reduces the costs of management and bundles the work that has to be done for performance tuning and data conversion. • The performance is based on the appropriate choice of asynchronous communication protocol that can improve the bandwidth significantly. • Latency can hardly be influenced by a metacomputing library and is mainly dominated by the network. Considering these items, the future direction will lead us towards an MPI-2 implementation. Since the standard is very huge, future libraries will have to concentrate on those issues that are important for metacomputing. These are the dynamic process management and parallel I/O. Other issues such as one-sided communication will not be addressed. Currently, we see no need in a real metacomputing environment for them.
7. Conclusions Metacomputing environments are much more difficult to access, organize and maintain than any single highperformance computing system. Unfortunately, it is complicated to implement an efficient parallel program for a metacomputer. There are several problems to deal with, different resource access methods, unstable WAN network connections, and complex load balancing strategies on the HPC resources of the metacomputer. In spite of all these difficulties, metacomputing provides a powerful, cost efficient method for solving well-known problems e.g., CFD simulation, Fast Fourier Transformation, and Monte Carlo solver. ln this paper, we presented three inter-message-passing libraries which support the user in developing and running metacomputer applications. The PACX-MPI library can be used for coupling one or more supercomputers operating under MPI. Moreover, PVMPI tries to integrate dynamic process capabilities from PVM into MPI while allowing it to be used in a metacomputing environment. The PLUS approach allows parallel tasks on
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712
711
supercomputers or workstations to transparently communicate between the public message-passing standards and the vendor-supplied message-passing environments. Like PVMPI, PLUS enhances the static process models of its supported message-passing environments to dynamic ones. References [1] L. Smarr, C.E. Catlett, Metacomputing, Commun. ACM 35(6) (1992) 45–52. [2] A. Geist, A. Beguehin, J. Dongarra, W. Liang, B. Manchek, V. Sunderam, PVM: Parallel Virtual Machine – A User’s Guide and Tutorial for Network Parallel Computing, MIT Press, Cambridge. MA, 1994. [3] Message Passing Interface Forum, MPI: a message-passing interface standard, Int. J. Supercomputer Appl. 8 (3/4) (1994). [4] M. Baker, G. Fox, H. Yau, Cluster Computing Review, Northeast Parallel Architectures Center, Syracuse University, New York, November 1995. http://www.npar.syr.edu/techreports/index.html [5] C. Albing, Cray NQS: production batch for a distributed computing world, in: 11th Sun User Group Conf. and Exhibition, Brookline, USA, December 1993, pp. 302–309. [6] A. Bayucan, R.L. Henderson, T. Proett, D. Tweten, B. Kelly, Portable Batch System: External Reference Specification, Release 1.1.7, NASA Ames Research Center, June 1996. [7] A. Keller, A. Reinefeld, CCS resource management in networked HPC systems, in: 7th Heterogeneous Computing Workshop HCW ‘98 at IPPS/SPDP ‘98, Orlando, Florida, IEEE Computer Soc. Press, Silver Spring. MD, 1998, pp. 44–56. [8] GENIAS Software GmbH, Codine: Computing in Distributed Networked Environments. http://www.genias.de/products/codine/ [9] M.J. Litzkow, M. Livny, Condor – a hunter of idle workstations, in: Proc. 8th IEEE Int. Conf. Distr. Computing Systems, June 1988, pp. 104–111. [10] LSF, Product Overview. http://www.platform.com/products/ [11] T. Beisel, E. Gabriel, M. Resch, An extension to MPI for distributed computing on MPPs, in: M. Bubak, J. Dongarra, J. Wasniewski (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, LNCS, Springer, Berlin, 1997, pp. 25–33. [12] M. Brune, J. Gehring, A. Reinefeld, Heterogeneous message passing and a link to resource management, J. Supercomputing 11 (1997) 355–369. [13] G.E. Fagg, J. Dongarra, PVMPI: an integration of the PVM and MPI systems, Calculateurs Paralleles 8 (1996) 151–166. [14] M. Resch, D. Rantzau, H. Berger, K. Bidmon, R. Keller, E. Gabriel, A metacomputing environment for computational fluid dynamics, in: Parallel CFD Conf., Hsinchu, Taiwan, May 1998. [15] A. Reinefeld, M. Brune, J. Gehring, Communicating across parallel message-passing environments, J. Syst. Architecture 44 (1998) 261–272. [16] M. Brune, J. Gehring, A. Keller, A. Reinefeld, RSD – resource and service description, in: 12th Int. Symp. on High-Performance Computing Systems and Applications HPCS ‘98, Edmonton, Canada, Kluwer Academic Publishers, Dordrecht, 1998. [17] G.E. Fagg, K. London, J. Dongarra, Taskers and general resource manager: PVM supporting DCE process management, in: Proc. 3rd EuroPVM group meeting, Munich, Springer, Berlin, October 1996, pp. 167–174. [18] G. Stellner, J. Pruyne, Resource management and checkpointing for PVM, in: Proc. EuroPVM ‘95, Paris, 1995, pp. 130–136. [19] G.E. Fagg, J. Dongarra, A. Geist, PVMPI provides interoperability between MPI implementations, in: Proc. 8th SIAM Conf. on Parallel Processing, March 1997. [20] G.E. Fagg, S.A. Williams, Improved program performance using a cluster of workstations, Parallel Algorithms and Appl. 7 (1995) 233–236. [21] K. Moore, S. Browne. J. Cox, J. Gettler, The Resource Cataloging and Distribution System, Technical report, Computer Science Department, University of Tennessee, December 1996. [22] G.E. Fagg. K. Moore, J. Dongarra, A. Geist, Scalable networked information processing environment (SNIPE), in: Proc. Supercomputing ‘97, San Jose, CA, November 1997. [23] Message Passing Interface Forum, MPI-2 Extensions to the Message-Passing Interface, University of Tennessee, Knoxville, Tennessee, 1997.
Matthias A. Brune received his Diploma in Computing Science from the University of Paderborn in 1997. Currently, he is a staff member at the Paderborn Center for Parallel Computing. His research interest focuses in the field of parallel computing, in particular, high-performance and metacomputing. He works in the design of distributed resource management software and runtime environments for metacomputers. He has published several papers on national and international conferences. Besides metacomputing, his research interests include high-speed networking protocols.
712
M.A. Brune et al. / Future Generation Computer Systems 15 (1999) 699–712 Graham E. Fagg received his BSc in Computer Science and Cybernetics from the University of Reading, England in 1991. From 1991 to 1993, he worked on CASE tools for interconnecting array processors and transputer systems. From 1994 to the end of 1995, he was a research assistant in the Cluster Computing Laboratory at the University of Reading, working on code generation tools for group communications. Since 1996, he has worked as a senior research associate at the University of Tennessee with Professor Jack Dongarra within the Innovative Computing Laboratory. His current research interests include scheduling, resource management and high speed networking. He is currently involved in the development of four different metacomputing systems: SNIPE, MPI Connect, HARNESS and MPI-2.
Michael M. Resch received his Diploma degree in Technical Mathematics from the Technical University of Graz, Austria in 1990. From 1990 to 1993, he was with JOANNEUM RESEARCH – a leading Austrian research company – working on the numerical simulation of groundwater flow and groundwater pollution on high performance computing systems. Since 1993, he is with the High Performance Computing Center Stuttgart. The focus of his work is on parallel programming models. He is responsible for the development of message-passing software and numerical simulations in international and national metacomputing projects. Since 1998, he is the head of the Parallel Computing Group of the High Performance Computing Center Stuttgart. His current research interests include parallel programming models, metacomputing and numerical simulation of viscoelastic fluids.