Future Generation Computer Systems 8 (1992) 191-203 North-Holland
191
H e t e r o g e n e o u s network-based concurrent computing environments * V.S. Sunderam Department of Math & Computer Science, Emory University, Atlanta, GA 30322, USA
Abstract
Sunderam, V.S., Heterogeneous network-based concurrent computing environments, Future Generation Computer Systems 8 (1992) 191-203. Concurrent computing on networks of heterogeneous machines has gained tremendous attention and popularity in recent months. This computing model encompasses a wide spectrum of environments, ranging from very coherent systems to loosely coupled workstations, but usually excludes distributed operating systems. Many such systems are operational (several more are under development) and have been used for an impressive range of applications. Heterogeneous network computing environments have the potential for multi-gigaflop performance, provide excellent support for applications with multifaceted requirements, enable straightforward provisions for fault-tolerance, and are easy to install, maintain, and use. These systems enhance the power, versatility, and usefulness of existing hardware, and support well-understood, pragmatic, programming models. In this paper, we will survey the state of the art in heterogeneous network computing, beginning with a description of representative systems and applications. We will also attempt to identify the most pressing needs as well as the foreseeable obstacles, and pose a sampling of the major challenges that face researchers and developers in heterogeneous network computing. Keywords. Concurrency; heterogeneous computing; distributed systems; high-performance; practical experiences.
I. I n t r o d u c t i o n H e t e r o g e n e o u s n e t w o r k c o m p u t i n g environm e n t s have r e c e n t l y p r o v e n to b e effective a n d e c o n o m i c a l p l a t f o r m s for h i g h - p e r f o r m a n c e comp u t i n g in a n u m b e r of disciplines. P a r t i c u l a r l y in the a r e a o f c o m p u t a t i o n a l science, w h e r e t h e d e m a n d for c o m p u t i n g p o w e r is e v e r - i n c r e a s i n g , n e t w o r k - b a s e d e n v i r o n m e n t s p r o v i d e a very attractive a l t e r n a t i v e to t r a d i t i o n a l v e c t o r s u p e r computers and monolithic multiprocessors. From Correspondence to: V.S. Sunderam, Department of Math & Computer Science, Emory University, Atlanta, GA 30322, USA * This work was supported by the Applied Mathematical Sciences program, Office of Basic Energy Sciences. US Department of Energy, under Grant No. DE-FG0591ER25105. An abridged version of this paper was presented at the Parallel and Distributed Workstation Systems Workshop, September 1991.
t h e e c o n o m i c p o i n t o f view, n e t w o r k c o m p u t i n g systems p r o v i d e s u p e r c o m p u t i n g p o w e r at minim a l cost; an a d v a n t a g e h i g h l i g h t e d by t h e fact t h a t often, exisiting g e n e r a l p u r p o s e r e s o u r c e s p r o v i d e t h e c o m p u t i n g p l a t f o r m , t h e r e b y requiring little o r no a d d i t i o n a l investment. F r o m the t e c h n i c a l viewpoint, n e t w o r k c o m p u t i n g systems provide equivalent power and (sometimes greater) functionality than traditional high-performance h a r d w a r e , for a large n u m b e r o f a p p l i c a t i o n categories. I n this p a p e r , we will discuss v a r i o u s a s p e c t s of c o n c u r r e n t c o m p u t i n g o n n e t w o r k e d collections o f i n d e p e n d e n t m a c h i n e s . N e t w o r k - b a s e d systems r a n g e in m e t h o d o l o g y a n d t e c h n i q u e f r o m dist r i b u t e d o p e r a t i n g systems to c o n c u r r e n t environments, l a n g u a g e - b a s e d a p p r o a c h e s , a n d low-level p r o g r a m m i n g . In this p a p e r , w e a r e p r i m a r i l y c o n c e r n e d with t h e last t h r e e , a n d in p a r t i c u l a r , will e m p h a s i z e c o n c u r r e n t c o m p u t i n g environ-
0376-5075/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved
192
V.S. Sunderam
ments. These systems are distinguished by their operation over the native operating systems of independent machines, as a separate layer. Further, they are stand-alone systems, easily ported to a variety of machine, network, and operating system architectures. They tend to be programmed and utilized using imperative programming languages, and typically support several tools for program development and profiling. By a combination of technological circumstance and the evolution of a number of applications in several disciplines, such network-based concurrent computing environments have become a viable platform for high-performance computing,
2. Background and generalities Network concurrent computing essentially involves the collective utilization of independent machines that are interconnected by general purpose communication networks. In these environments, existing hardware bases are used to provide distributed or concurrent computing capabilities, usually with the help of specialized software systems. In this section, we provide some background information on such computing platforms.
2.1. Historical perspective Network based concurrent computing originally gained popularity with the advent of powerful desktop workstation systems in the early 1980's. These computer systems were capable of delivering close to one million operations per second, were easily affordable, and soon became commonplace. Further, they were based on a common operating system (almost always a derivative of Unix), and were interconnected by relatively fast local networks. Within the host systems, communication transport mechanisms were accessible to user-level programs. This combination led to the inevitable and obvious step of utilizing these resources collectively and in cooperation. Early work in enabling distributed computing on collections of workstations included projects to support remote execution, process migration, and client-server applications. The basic scenario is almost exactly the same in present-day computing environments. Most environments consist of a number of workstations
interconnected by LANs; many frequently include several specialized hardware platforms including multiprocessors, graphics engines, and occassionally, vector supercomputers. The workstations are significantly faster, ranging in power from tens to almost a hundred million operations per second. Furthermore, since these workstations are 'personal' machines, they tend to be idle for a significant portion of the time, thus resultitt~ ifi ~tn ~ibfifidance of unused computing cycles in the environment as a whole. Therefore, the same motivations for utilizing networked resources collectively exist currently, in even stronger form - owing to the increased capabilities of each individual machine and the maturing of support systems such as network filesystems and windowing systems. The communications network is the only component in most environments that presents a potential obstacle to some kinds of network computing. However, fiber-optic gigabit networks are imminent and should be ubiquitously available in the very near future, thus eliminating current bottlenecks in network bandwidth. While the coherence and tight coupling provided by hardware multiprocessors may never be achievable, high-speed networks and effective software systems will make network computing a viable alternative for a large class of applications.
2.2. Benefits of network computing Network-based concurrent computing environments offer significant benefits in the following respects: • Computing power: By providing (controlled) access to a much larger and richer hardware base, network environments can increase application performance by significant amounts. In addition to increasing the overall computing resources that may be accessed by individual users, the network environment will also enable the exploitation of specialized resources. The Plan 9 project [17] at Bell Labs is based on this notion; computing tasks are executed on machines best suited to the nature of the task, or on machines specifically designed for such tasks. • Expandability: Incremental scaling of a network based concurrent computing environment is usually straightforward; and network band-
Heterogeneous network-based concurrent computing em'ironments
width limitations that pose the main obstacle to scaling by larger factors are on the verge of increasing by one or two orders of magnitude with the advent of fiber optics. Further, under the right circumstances, the network based approach can be effective in coupling several similar multiprocessors, resulting in a configuration that might be economically and technically difficult to achieve with hardware. • Application heterogeneity: Many existing and projected applications (e.g. as mentioned in [9]) are composed of sub-algorithms that differ widely in the model of computation, programming language, and computing and data handling requirements. On typical networks with a wide mix of architectures and capabilities, such applications can benefit by executing appropriate sub-algorithms on the best suited ~ processing elements. As an example, the environment simulation application described in [16] is composed of different sub-tasks, some ideally suited for vector processing, others for distributed multiprocessing, and a few for execution on high-performance graphics engines. • Other benefits: In addition to the main benefits of providing a powerful aggregate of concurrent computing resources and supporting a variety of application requirements, network based environments have several other potential advantages. Improved resource management and utilization are possible when a networked collection as a whole is shared among multiple users. Resilience to failures (an aspect becoming increasingly important in parallel processing) is more straightforward to implement owing to the logical independence of individual processing elements. Finally, important support services like visualization tools, I / 0 bandwidth and large capacity data stores, and profiling/debugging software can be integrated readily into networked environments. 2.3. Aproaches to network computing Developers of distributed or concurrent applications have several options available to them.
1 In terms of computing paradigms, application languages, and resource requirements.
193
While the distinctions are often blurred, the following approaches are typical. 2.3.1. Distributed operating systems In distributed operating systems, a specialized kernel executes on all the hosts on a network, and collectively manages its resources. Application programs access these resources by invoking system specific functions. The V-kernel is a well known example: it supports process groups, multicast, various security and authentication features, and is primarily programmed using the request-response paradigm. Other examples of distributed operating systems are Amoeba [15] and Clouds [6], both object based systems, and Chorus [5]. The fundamental characteristic of these systems is that the processing elements involved are not autonomous - the distributed kernel must execute on all the system nodes. From the application point of view, this restriction may not be desirable or feasible for a variety of reasons, thereby limiting the viability of distributed operating systems in many situations. 2.3.2. Distributed programming em, ironments A second approach is the use of distributed environments, sometimes called network operating systems. In this scenario, a software layer executes above the individual operating systems of autonomous machines that are networked, and provides various distributed facilities. Representative examples of such environments or programming systems are ISIS [4], Marionette [18], PVM [19], and Linda [1]. These distributed environments or programming systems are gaining in popularity, and are proving to be effective platforms for the deployment of concurrent or distributed applications. 2.3.3. Low level programming The alternative to distributed operating systems or environments is for applications to use appropriate collections of standard networking and remote execution facilities that computer systems typically support. Program access to transport protocols, mechanisms for communicating with name servers, and remote process execution facilities are usually standard features in contemporary operating systems. At a slightly higher level of abstraction, remote procedure call, an extremely valuable, well-understood and popular
194
v.s. Sunderam
paradigm, may be used as the basis for constructing distributed applications. This approach is attractive when distributed operating systems or environments are inadequate, inappropriate or inefficient, and a significant number of applications are implemented using these relatively low level networking or remote execution functions.
2.3.4. Concurrent languages The use of concurrent languages for network based computing is an alternative approach along a different dimension, since concurrent languages may be implemented on distributed operating systems, concurrent environments, or while directly using primitive network facilities. Several concurrent languages have been recently proposed; representative examples include Jade [14], Concurrent C [10], Orca [2], and Linda [1]. Some concurrent languages are derived from imperative bases, while others propose novel programming paradigms. The language-based approach is likely to eventually gain acceptance, but has not achieved sufficient maturity or widespread use at present. In the remainder of this paper, we will restrict our attention to concurrent environments loosely defined as software systems that enable concurrent processing on collections of networked computer systems.
3. Concurrent environments and applications At the present time, several concurrent processing environments have been developed and are operational, and a few have achieved production status. By and large, these systems are the outcome of research projects at universities or laboratories, but there have been occassional commercial systems, e.g. Express [13]. In this section, we will describe the salient features of the PVM system that has been developed jointly at Emory University, Oak Ridge National Laboratory, and the University of Tennessee. We will also briefly describe a number of applications that have been reported to execute on concurrent computing environments.
3.1. The PVM system PVM (Parallel Virtual Machine) is a software system that permits the utilization of a heteroge-
neous network of parallel and serial computers as a unified general and flexible concurrent computational resource. The PVM system supports the message passing, shared memory, and hybrid paradigms, thus allowing applications to use the most appropriate computing model, for the entire application or for individual sub-algorithms. Processing elements may be scalar machines, distributed- and shared-memory multiprocessors, vector supercomputers and special purpose graphics engines, thereby permitting the use of the best suited computing resource for each component of an application. This versatility is valuable for several large and complex applications including global environmental modeling, fluid dynamics simulations, and weather prediction applications. However, the full effectiveness of the PVM system can be realized, with significant benefits, on common hardware platforms such as a local network of general purpose workstations.
3.1.1. Architectural overview The PVM system is composed of a suite of user-interface primitives and supporting software that together enable concurrent computing on loosely coupled networks of processing elements. Some of the prominent advantages of the system are: - The ability to execute in existing network environments without the need for specialized hardware or software enhancements or modifications. - S u p p o r t for multiple parallel computation models, particularly useful in conjunction with support for multiple hardware architectures. Integral provision of debugging and administrative facilities, using interactive graphical interfaces. - Support for fault-tolerance and partially degraded execution in the presence of machine or network failures. Auxiliary profiling an visualization tools that permit post-mortem analysis of program behavior. PVM may be implemented on a hardware base consisting of different machine architectures, including single CPU systems, vector machines, and multiprocessors. These computing elements may be interconnected by one or more networks, which may themselves be different (e.g. one implementation of PVM operates on Ethernet, the Inter-
Heterogeneous network-basedconcurrentcomputing environments net, and a fiber optic network). These computing elements are accessed by applications via a standard interface that supports common concurrent processing paradigms in the form of well-defined primitives that are embedded in procedural host languages. Application programs are composed of components that are subtasks at a moderately large level of granularity. During execution, multiple instances of each component may be initiated. Figure 1 depicts a simplified architectural overview of the PVM system. Application programs view the PVM system as a general and flexible parallel computing resource that supports shared memory, message passing, and hybrid models of computation. This resource may be accessed at three different levels: the transparent mode in which component instances are automatically located at the most appropriate sites, the architecture-dependent mode in which the user may indicate specific architectures on which particular components are to execute, and the low-level mode in which a particular machine may be specified. Such layering permits flexibility while retaining the ability to exploit particular strengths of individual machines on the network. The PVM user interface is strongly typed; support for operating in a heterogeneous environment is provided in the form of special constructs that selectively perform machine-dependent data conversions where neces-
195
sary. Inter-instance communication constructs include those for the exchange of data structures as well as high-level primitives such as broadcast, barrier synchronization, mutual exclusion, global extrema, and rendezvous. 3.1.2. Application programming paradigms PVM supports two general parallel programming models - tree computations as supported by the DIB [8] and Schedule [7] packages, and crowd computations. Supporting both paradigms increases the flexibility and power of the system significantly, especially since individual subtasks within either of these models may themselves be parallel programs expressed in the other. At present, the model, individual subtasks, and their interactions are described in procedural terms; work is in progress to provide graphical specification. Application programs under PVM may possess arbitrary control and dependency structures. In other words, at any point in the execution of a concurrent application, the processes in existence may have arbitrary relationships between each other and, further, any process may communicate a n d / o r synchronize with any other. This is the most unstructured form of crowd computation, but in practice a significant number of concurrent applications are more structured. Two typical structures are the tree and the 'regular crowd'
................... ..A~..~.,qo,.! ...............
,___s_~_~. . . . .
_c~A~___s~,
l__s_~_~___v_~• / _ 7
,; . . . .
I
I
' ~
,,
I
I
....
W_or_kst_a_t_ign Ne_t_work_ _ 2
CM-2
/.AN3
Fig. 1. Architectural overview of the PVM system.
Convex
196
V.S. Sunderam
structure. We use the latter term to denote crowd computations in which each process is identical; frequently such applications also exhibit regular communication and synchronization patterns. Any specific control and dependency structure may be implemented under the PVM system by appropriate use of PVM constructs and host language control flow statements.
3.2. Representative applications A large number of 'production' applications have been reported to be operational on heterogeneous network based concurrent computing environments. In this section, we list some representative examples, and provide feedback obtained directly from the researchers involved. • At the Livermore computing center, researchers are investigating possible performance improvements by distributing physics calculations between a BBN TC-2000 and a Cray. This experiment will use a new code under development by physicists consisting of elementary Monte Carlo neutron transport and hydrodynamics modules. The choice of physics was arbitrary and was chosen to have two very different kinds of algorithms. The Monte Carlo neutron transport will run on the BBN, while the hydrodynamics will run on the Cray. From this experiement they expect to determine if speed-ups are sometimes possible by distributing scientific calculations across machines of different architectures. • At Fermi National Laboratories, activities in this area have been focussed independently on two different sets of requirements, experimental high energy physics and theoretical high energy physics. In the early 1980s it became apparent that the data reconstruction needs of experiments could not be met with conventional computing. Attempts at special purpose computers were tried, but their programability was not adequate for such complicated problems. Researchers needed a computer that could support Fortran programs of 100 000 lines (then, and several million source lines now), and which maximized the performance/cost ratio. When 32 bit microprocessors became available with (almost) reasonable Fortran compilers, they designed modules with the mi-
cros and DRAM local memory. They were plugged into VME, a standard commercial backplane protocol. The VME crates were interconnected through a Fermilab high speed inter crate cable protocol called Branch Bus. Over 600 of these modules were built. Typically up to 100 processors worked together on an experiment's data with each node having one data event at a time passed out to it from a host MicroVAX supporting I / O . Built by the Advanced Computer Program (ACP), the systems came to be known widely as ACP Farms. Later, as industry produced low-cost, high-performance workstations based on microprocessors, the newer generations of Fermilab farms used them, interconnected with standard networks. In these systems I / O and disk can be directly addressed by each process. In units of VAX 11/780s benchmarked on real high energy physics codes (VUPS), Fermilab now has a total installed farm capacity approaching 7000 VUPS. Workstation based farms have been based on DEC MicroVAXES, SGI RISC servers, and IBM RS/6000 series servers. In order to explore the software architectures of high performance, distributed, heterogeneous applications, and to address the problem of working with much larger 3D imaging problems than can be handled on a workstation, the Imaging Technologies Group at Lawrence Berkeley Laboratory is working on an application which is designed to use multiple backend computing systems connected to a frontend workstation via a high speed network. The goal is to visualize very large 3D images (voxel data sets) in near enough to real time that interactive exploration is practical. They are currently experimenting with different backend configurations, including the Cray-2 at NERSC, and the combination of a Cray Y-MP and a Connection Machine (CM) at the Pittsburgh Supercomputer Center (PSC). The application consists of an interactive program to control the segmentation (generation of 3D geometry from the voxel data s e t / 3 D scalar field) and display the result on a local workstation, passing the relevant parameters to the backend compute servers to do geometry generation and graphical rendering. Specifically, they are working on a suite of algorithms to semi-automatically segment MRI data sets to obtain analyzable geo-
Heterogeneous network-based concurrent computing em,ironments
metrical representations of structures of interest. • Research is in progress at the University of Zurich, to perform particle motion simulation under the PVM system. Particle motion simulation is an approach to model and render objects that cannot be easily described in terms of surfaces, volume primitives and motion paths. This is because the frontier between the objects and the environment is varying over time or can only be defined statistically, like fire, fog or smoke. Particle motion simulation can also be applied to models in natural science and economics. In a first attempt, a prototype particle motion simulation application was implemented on a state of the art graphics super-workstation. To achieve real-time performance, a simple model was chosen representing a small number of balls with a specific size and mass moving within cubic boundaries in 3-D space according to Newtonian mechanics. A generic solution applicable to a wide range of today's computer architectures including locally available workstations was prefered over specialized single purpose architectures, which very often cannot keep pace with future hardware and software development. The most general parallel architecture available in modern research environments is a set of scalar computers connected by fast networks. This is the reason that an approach of parallel and distributed processing using a set of loosely coupled workstations under PVM was adopted as the solution. • At a scientific computing consultancy, researchers are involved in atmospheric and ground water modeling, with particular emphasis on three dimensional transport of pollutants of various kinds. These models currently run about 1 / 2 to 1 / 3 slower than real time on Cray YMP and Connection Machine systems. T h e s e c o d e s h a v e b e e n a d a p t e d to I B M / R S 6 0 0 0 clusters using PVM to execute the codes at speeds in excess of today's supercomputers at a fraction of the cost. One application, a Lagrangian Particle Distribution Model, is an obvious choice for such coarse grained parallelism and 'scales' very well. Two RS6000 processors coupled with token ring, accessing a 'driving' meteorological database of over 250 megabytes, runs 1.96 times faster than
197
a single processor for a model time of 24 hours; Four processors achieve 3.78 times a single processor's performance. The model is coupled to a graphics display system which interactively places, moves and modifies emission sources and dynamically chooses the number of processors employed in the processing while displaying the particle plume and concentrations being produced by the multiprocessed model. PVM is being used in a project called 'Network Synamation', which is being developed by the Xerox Design Research Institute at Cornell University. The idea is to provide a very highlevel design and simulation environment based on visual programming using AVS (Application Visualization System) to encapsulate numerical simulation routines. Typical examples include modules to solve fluid-flow, heat-transfer, and electrostatic field problems using FEM and BEM techniques. The modules are (interactively) connected together into a network which acts as a simulator for a specific problem (e.g. a xerographic print engine, a thermal print head, etc.). As some of the modules are very compute intensive, researchers are using PVM to allow one or more AVS modules to draw resources from a network of workstations. A typical example is tracking a large set of image toner particles in a xerographic developer under the influence of various electrostatic fields. In effect, the AVS module is run in parallel on a set of external machines.
4. ()ngoin~ research and de~eh)l)tneul! trends
The field of network based concurrent computing is relatively young, and research on various aspects is ongoing. Although basic infrastructures have been developed, many of the refinements that are necessary are still evolving. In this section, we discuss some areas in which research activities are underway. 4.1. Increased power and resource usage
Standalone systems delivering several tens of millions of operations per second are commonplace, and continuing increases in power are predicted. For network computing systems, this presents many challenges. One aspect concerns scal-
198
v.s. Sunderam
ing to hundreds and perhaps thousands of independent machines; it is conjectured that functionality and performance equivalent to massively parallel machines can be supported on cluster environments. The Fermilab project mentioned above demonstrates feasibility for some classes of problems. Research in protocols to support scaling and other system issues are currently under investigation. Further, under the right circumstances, the network based approach can be effective in coupling several similar multiprocessors, resulting in a configuration that might be economically and technically difficult to achieve with hardware. 4.2. Failure resilience and migration Applications with large execution times will benefit greatly from mechanisms that make them resilient to failures. Currently few platforms (especially among multiprocessors) support application level fault tolerance. In a network based computing environment application resilience to failures can be supported without specialized enhancements to hardware or operating systems. Research is in progress to investigate and develop strategies for enabling applications to run to completion, in the presence of hardware, system software, or network faults. The following approaches are being pursued: • Checkpointing: Work in this area is based on the belief that checkpoint/restart is an effective means of providing large-grained fault tolerance to applications. However, in a concurrent environment, especially when an application is distributed over a network, obtaining a snapshot of process state in a consistent manner can be very difficult. The PVM project proposes a solution based on the insertion of synchronization points into the application; at each synchronization point, interactions between components are temporarily suspended (thereby freezing global state) and a snapshot obtained. Synchronization points are forced by the PVM system in a manner transparent to applications, and their frequency is dynamically adjusted based on measurements of application resource utilization and inter-process communication. • Shadow execution: An alternative mechanism to provide failure resilience is to replicate some
or all components of a concurrent application. This is expensive but may be advantageous in terms of ease of implementation and reduced overheads. This approach measures quantitatively the costs of replication for several representative classes of applications. When shadow execution is applied uniformly to all components of a concurrent application, it is perhaps straightforward to implement. However, when selective shadowing is desired, a number of problems arise. For example, messages sent to component instances must also be delivered to all shadow copies, but messages sent by a replicated component should be delivered only once. The rates of execution of the original instance and shadow copies may need to be controlled, in order to maintain consistency. Input and output operations require special handling. Process migration: Components of long-running concurrent applications also benefit from the existence of a process migration facility. Migration is useful when it is desired to preempt anticipated shutdowns, or for load balancing. Under investigation are mechanisms for process migration in a networked concurrent computing environment and in particular, focus on strategies for migration among a heterogeneous collection of machines. Almost by definition, this implies migration at well defined execution points within the application source code, since architecture incompatibilities usually preclude dynamic movement of executing processes at arbitrary points. One approach is to extend the checkpoint/restart scheme described above, so that a component may be restarted on a machine with a different architecture, thus achieving the effect of migration. 4.3. Profiling and interfacing aspects Graphical tools enhance the effectiveness of many of the network computing systems that are in existence. An example is the HENCE graphical interface tool that is currently under development [3]. HENCE (Heterogeneous Network Computing Environment) is a parallel programming paradigm and tool which supports the creation, compilation, execution, debugging, and analysis of parallel programs for a heterogeneous group of computers. The HENCE programmer specifies the
Heterogeneous network-based concurrent computing encironments
parallelism of a computation by drawing a graph describing the dependencies between user defined procedures, as shown in Fig. 2. HENCE will then automatically execute these procedures on a user defined collection of machines on some network. Different versions of a procedure may exist for different architectures. HENCE executes the appropriate version of a procedure for a chosen target machine architecture. HENCE maps procedures to machines based on a user defined cost matrix. The HENCE user dynamically configures a parallel collection of
machines, referred to as a parallel virtual machine, on which the HENCE program is to be run. During execution, HENCE can collect trace and scheduling information which can be displayed in real time or saved to be replayed later - a snapshot of an example animation is shown in Fig. 3. Another type of graphical interface that is being investigated pertains to visualizing conc/arrent program behavior. For example, the present version of PVM supports programming using the PICL library [11], and post-mortem profiling using the ParaGraph [12] tool. These tools enable
JWl~B tooZ ,~,O
tbms/sm&tWllsdJbllwmoelexo~le= ~ modb. " l d ~ t . F " Zoomed.
n l ll ll lN l-$-I
t
t
12? ~rite file
13?
dlsp_frase
9~
T
19r--n
T
~-'5"01"ag~aces c e n e
"-
3~
~
199
To T
sV
T
10t-J
Fig. 2. Graphical application specification in HENCE.
gen_vie,,point
200
V.S. Sunderam
Figure 4 shows the Kiviat diagram, the concurrency profile and a utilization chart. The Kiviat display gives a geometric depiction of individual processor utilization and overall load balance. The dark regions indicate recent utilization by shading a polygon formed by connecting individual processor utilizations, with the center representing an idle state and the circumference, 100%
visualization of program behavior and identify communication patterns, load imbalances, and processor utilization. To illustrate a few of the kinds of postmortem analysis possible, displays from the use of the ParaGraph tool are presented in Fig. 4 and 5, for a matrix factorization application that was executed on a symmetric multiprocessor.
Host
H~hine
U t i l i z a t i o n ¢Paph
////////
TOIIK LI HT FD T 0.9
2,ii
4,13
7,@i
II 50,8
WI LHA
i,iO
DASHER
3,12
THC
5,i4
6,15
Ti~
I II
I II
E; seoonds pep tlok
I I
AUSTI N
UCRAY
[~ legend
IDtrectorgI[ Trace?ile: trace.mb.allarchl j
.o. doin,. .... ,e.~,~ j
Host running one or ~me'e=u~outine$
I-e-II-~I~-INI~II~INNN~NB
,o,,..,,
i
~]lekt
Fig. 3. Visualizing application execution behavior.
Read~ to Load Data Loading Data Subroutine Running Subroutine Suspended Subroutine Completed Exited F a t a l EPPoPat Node
Heterogeneous network-based concurrent computing encironments
201
slanted lines show message transmission and reception events. Such a display is useful in locating bottlenecks, detecting deadlock, and as a basis for fine tuning of the application. When used on a heterogeneous network based environment however, these displays are often skewed, owing to external factors including network load variance and the multiprogrammed nature of the individual processing elements. In ongoing research, normalizing such external influences to make the displays more accurate and meaningful are being investigated.
utilization. The concurrency profile indicates the percent of time that different numbers of processors are simultaneously active, while the utilization graph indicates processors that are idle, busy, and performing overhead tasks on a time axis. Figure 5 is a snapshot of ParaGraph displays that enable visualization of the communication aspects of an application. The matrix display indicates communication volume in thousands of bytes, color coded for different levels, for each pair of processors. A second display depicts commincation patters - although this appears as a complete graph in Fig. 5, the animated version shows interactions on a time scale. The upper display in Fig. 5 shows interaction between processing elements as a function of time. Processor activity is indicated by horizontal lines, while
4.4. Distributed algorithms and optimization The performance and effectiveness of network based concurrent computing environments depends to a large extent on the efficiency of the
Concurrency
ProFlle PROFILE
P E
R C E H T
°1'
0 F T ] M E
~lle
H U M
8 E
R 0 F P R 0 C
$ 0
i!
e S
m BLISY
r--;
m XmLE
Fig. 4. Computation oriented profiling in ParaGraph.
V.S. Sunderam
202
support software, and on minimization of overheads. Experiences with the PVM system have identified several key factors in the system that are being further analyzed and improved to increase overall efficiency. Efficient protocols to support high level concurrency primitives is the subgoal of work in this area. Particular attention is being given to exploiting the full potential of imminent fiber optic connections, using an experimental fiber network that is available. In preliminary experiments with a fiber optic network, several important issues have been identified. For example, the operating system interfaces to fiber networks, its reliability characteristics, and factors such as maximum packet size are significantly different from those for Ethernet. When the concurrent computing environment is executed on a combination of both types of networks, the system algorithms have to be modified
E
to cater to these differences, in an optimal manner and with minimized overheads. Another issue to be addressed concerns data conversions that are necessary in networked heterogeneous systems. Heuristics to perform conversions only when necessary and minimizing overheads have been developed and their effectiveness is being evaluated. Recent experiences with a Cray-2 have also identified the need to handle differences in wordsize and precision, when operating in a heterogeneous environment; general mechanisms to deal with arbitrary precision arithmetic (when desired by applications) are also being developed. A third aspect concerns the efficient implementation of inherently expensive parallel computing operations such as barrier synchronization. Particularly in an irregular environment (where interconnections within hardware multiprocessors are much faster than net-
[]
Spacetime 8PACETIHEDIFIGR~I
lJllll
512
:Inlmation (I
•
•
•
©
Idle IIIIIIIIIIIIII
~ IJ
~
P¢,¢~
Fig. 5. Displays depicting communication related information.
Heterogeneous network-based concurrent computing environments
work channels), such operations can cause bottlenecks and severe load imbalances. Other distributed primitives for which algorithm development and implementation strategies are being investigated are: • Polling - this operation involves the collection of messages from nodes in an interconnection network, in response to a query. Results that have been obtained recently demonstrate that polling can be performed using as few as 70% of the messages of the obvious 'broadcastgather-broadcast strategy'. • Distributed Fetch-and-Add - Fetch-and-Add is an atomic operation that may be used to achieve partial simultaniety of access to a shared variable while preserving the serialization requirement. In recent work, algorithms for a distributed version of fetch-and-add have been developed and analyzed from the viewpoints of efficiency and tolerance to failures.
5. Discussion Experiences with network based computing using a variety of methods and a diverse repertoire of applications have demonstrated its viability. Nevertheless, network computing is not, and cannot be considered a unversal solution to all concurrent computing needs - several applications will continue to require massively parallel, closely coupled, or SIMD-based hardware multiprocessors. However, the central theme behind heterogeneous network based computing, viz. cooperative use of multifarious, interconnected, independent computer systems, will continue to be a valid, viable, and attractive proposition. Further, the advent of high-speed fiber networks, advances in software and toolkit technology, and the evolution of heterogeneous applications will contribute to the increased effectiveness of network based concurrent computing.
References [1] M. Arango, D. Berndt, N. Carriero, D. Gelernter and D. Gilmore, Adventures with Network Linda, Supercomput. Rev., 3 (10) (Oct. 1990). [2] H.E. Bal, Programming Distributed Systems (Silicon Press, Summit, N J, 1990). [3] A. Beguelin et al., Graphical development tools for network-based concurrent supercomputing, to appear in Proc. ACM Supercomputing 1991 (Nov. 1991). [4] K. Birman et al., The ISIS system manual, version 2.0 Cornell University, Computer Science Department, Sep. 1990.
203
[5] F. Boyer et al., Supporting an object-oriented distributed system: Experience with Unix, Mach and Chorus, Proc. Syrup. on Experiences with Distributed and Multiprocessor Systems (Mar. 1991). [6] P. Dasgupta et aL, The design and implementation of the Clouds distributed operating system", Comput. Syst., 3 (1) (Winter 1990). [7] J. Dongarra and D. Sorenson, SCHEDULE: Tools for developing and analyzing Parallel Fortran programs, in The Characteristics of Parallel Algorithms (MIT Press, Cambridge, MA, 1988). [8] R. Finkel and U. Manber, DIB - A Distributed Implementation of Backtracking, ACM Trans. Programming Languages and Syst. 9 (2) (Apr. 1987) 235-256. [9] G.C. Fox, Parallel computing comes of age: Supercompurer level parallel computations at Caltech, Concurrency: Practice and Exper. 1 (1) (1989) 63-103. [10] N. Gehani and W. Roome, Concurent C, Software: Practice and Exper. (Sep. 1986) 821-844. [11] G. Geist et al., A machine independent communications library, Proc. Hypercube Concurrent Computers Conf. (1989). [12] M.T. Heath, Visual animation of parallel algorithms for matrix computations, Oak Ridge National Laboratory, Technical report, 1990. [13] A. Kolawa, The Express programming environment, Workshop on Heterogeneous Network-Based Concurrent Computing (Oct. 1991). [14] M. Lam, Parallel programming in Jade, Workshop on Heterogeneous Network-Based Concurrent Computing (Oct. 1991). [15] S.J. Mullender, The Amoeba distributed operating system, CWI Newsletter, 11 (Jun. 1986) 21-33 and 12 (Sep. 1986) 15-23. [16] H. Narang, R. Flanery and J. Drake, Design of a simulation interface for a parallel computing environment, Proc. ACM Southeastern Regional Conf. (Apr. 1990). [17] R. Pike et al., Plan 9 from Bell Labs, Research Note, July 1990. [18] M. Sullivan and D. Anderson, Marionette: A system for parallel distributed programming using a master/slave model, Proc. 9th Internat. Conf. on Distributed Computing Systems (Jun. 1989) 181-188. [19] V.S. Sunderam, PVM: A framework for parallel distributed computing, Concurrency: Practice and Exper., 2, (4) (Dec. 1990) 315-339. ~,aid) .~underam received a Ph.D. in Computer Science from the University of Kent, England and is a faculty member in the Department of Mathematics & Computer Science at Emory University, Atlanta, USA. His research interests are in parallel and distributed processing, particularly high-performance concurrent computing in heterogeneous networked environments. Sunderam's recent research has focused on heterogeneous concurrent computing on general purpose networks. He is the principal architect of the PVM system for concurrent heterogeneous computing, which is in widespread use and is emerging as a de-facto standard for network computing. He is the recipient of the 1990 IBM supercomputing first prize award for his work on high-performance, network-based concurrent computing. His other recent research includes high-speed protocols for distributed systems support, graphical tools for parallel program development, concurrent stochastic stimulation, and algorithms for efficient implementation of concurrency primitives.