High-performance parallel computing in industry

High-performance parallel computing in industry

PARALLEL COMPUTING Parallel Computing ELSEWER 23 (1997) 1217-1233 High-performance parallel computing in industry Michael Eldredge a, Thomas J.R. H...

1MB Sizes 0 Downloads 67 Views

PARALLEL COMPUTING Parallel Computing

ELSEWER

23 (1997) 1217-1233

High-performance parallel computing in industry Michael Eldredge a, Thomas J.R. Hughes b9* .l, Robert M. Ferencz b, Steven M. Rifai b, Arthur Raefsky ‘, Bruce Herndon d ’ Decision Focus, Inc., 650 Castro St., b Centric Engineering System, Inc., 624 East ’ Silicon Graphics Computer Systems, d Applied Electronics Laboratory, Stanford

Mountain View, CA 94041, USA Evelyn Ave. Sunnyvale, CA 94086, USA Mountain View, CA 94039, USA University, Stanford, CA 94305, USA

Received 17 January 1997; revised 27 February 1997

Abstract We review experiences gained in the development of a commercial, engineering analysis software system for parallel computation. The lessons learned frequently complement, but occasionally are in conflict with academic ideas about parallel computations. The thoughts presented are believed to be useful to those engaged in, or about to embark upon, parallel implementations aimed at the industrial sector. 0 1997 Elsevier Science B.V. Keywords:

Computing

in industry; Engineering

1. Motivations

for parallel processing

1.1.

in design and analysis

Directions

analysis;

Commercial

software

Many of the current directions in product design and analysis are driven by competitive and regulatory constraints, such as the need to shorten design cycles, reduce cost, meet increasingly stringent government regulations, improve quality and safety, and reduce environmental impact. These directions have increased the need for accurate

* Corresponding author. Stanford University, Division of Mechanics and Computation, Room 252 Durand Building, Stanford, CA 94305, USA. Tel.: + l-415-7232040, e-mail: [email protected]. ’ Also Professor of Mechanical Engineering and Chairman of the Division of Mechanics and Computation, Durand Building. Stanford University, Stanford, CA, 94305. 0167-8191/97/$17.00 0 1997 Elsevier Science B.V. All tights reserved. PII SO167-8191(97)00049-5

1218

hf. Eldredge et al./Parallel

Computing 23 (1997) 1217-1233

product and component simulation and pressed analysts for simulations of unprecedented scale and complexity. Many of today’s computations require simulation of coupled physical phenomena, modeling of more product details, resolution of finer time scales, and investigation of larger design spaces. Typically, these simulations are 3-dimensional, mathematically nonlinear, often transient and involve many coupled physical phenomena. While single processor computers improve every year, the quantum increase in computational needs requires a quantum increase in computational power. The hardware industry is answering the increased needs with affordable high-performance parallel architectures which are built on commodity parts and compatible with workstation systems. These architectures offer the level of computational performance and increased memory capacity needed for large-scale computing. Because the underlying architecture is different than traditional uniprocessor or vector computers, this increased performance is most fully available to codes designed to take advantage of the new design. Centric Engineering Systems has developed the Spectrum’” Solver [ 141 for engineering analysis, which delivers increased simulation functionality and is architected to fully exploit modem hardware architectures. This combination enables the solution of problems at a new level of sophistication and accuracy with the ability to deliver these results within ever tightening design schedules. Paramount in this effort is the delivery of solutions for real-world, industrial design problems. Modem, commodity based parallel computer systems and software designed to take advantage of these systems can now provide the capabilities required to solve increasingly large and complex engineering design problems. On distributed memory computer systems it is also possible to take advantage of the large, but not directly addressable memory to solve problems too big for a single system’s memory. Finally, the software applications need not be specialized codes differing significantly in source code and user interaction. The software is compatible with uniprocessor versions with the addition of a few simple commands to enable execution on a parallel system. Centric’s Spectrum Solver is a unique computational simulation tool that enables engineers to capture the multiple physics of interacting fluids and solids inherent in many real-world engineering designs. This capability permits engineers to model their proposed design in a more realistic fashion, with fewer of the idealizations required for ‘fluids-only’ or ‘solids-only’ analysis. It also encourages cross-disciplinary interaction by engineering staff within the design organization. The Solver provides for the simultaneous use of multiple numerical analysis techniques in a single problem. This feature is important because the optimum technique for solution in a fluid region, for example, may not be optimum for an adjacent solid region. The Solver is complemented by the Spectrum Visualizer [15], an advanced visualization utility which allows engineers to examine and interpret the results of the simulation. 1.2. Parallel requirements

In the past, parallel computation has often been labeled an academic oddity. We have identified several requirements for parallel computational environments which have

M. Eldredge et al./Parallel

Computing 23 (1997) 1217-1233

1219

primarily been dictated by end users and the fast paced marketplaces in which they interact. The primary requirement is, quite naturally, utility. Parallel computation in and of itself has no value in an industrial setting. It must be an enabling technology that is applicable to industrial design problems and the highly complex and competitive product development cycles. The technology must have a significant impact on the design cycle, otherwise any additional cost, complexity or retraining will simply not be borne. With the wide range of general purpose parallel systems now available and the extremely short life cycles of computer hardware, software portability is a necessity. Various corporate policies or computer environment constraints additionally influence the actual systems that industrial design groups purchase. Software designed for portability can be migrated relatively quickly and easily to systems in use and is able to keep pace with constantly evolving hardware products lines. In the efforts to provide significant performance gains, development for parallel computation often focuses on scalability as the ultimate goal. However, hyper-tuning for a specific platform may prevent timely migration to newer systems. In fact, computer hardware obsolescence outpaces improvements in scalability. The complexity and growing interactivity of industrial design environments puts a premium on compatibility. In fact, compatibility has many facets in the sophisticated working environments of analysis and design. Such environments have a history and process grown around geometry modelers, mesh generators, visualization and other post processing tools, legacy and in-house special purpose tools. Existing computer systems, networks and utility software also define the environment in which a new tool must fit. Compatibility also includes the ease of transition between the new parallel tool and associated uniprocessor versions. The user interaction and experience must be maintained and the new tools must be able to make use of previously developed simulation definitions. Finally, compatibility includes the requirement that the same answers are produced. It is expected that a parallel version return the same results as a uniprocessor version but in a significantly shorter period of time. I .3. Engineering

reasonable

time

The utility provided by a software design tool can be measured in many ways. For existing tools, tweaking code or processor clock speed may allow for incrementally larger problems to be solved or an additional case to be run. However, increasing market pressures demand significantly more complex and detailed analyses than have ever been attempted. These same market forces also dictate ever shorter design cycles. In fact, this time constraint most significantly controls the potential impact of simulation on component and product design. Even with new simulation capabilities, if the solution cannot be found within the appropriate time window of the design cycle, it cannot be applied. The time limit for large simulations is often thought of as ‘over night’ although particular environments may have turnaround times of an hour or a week. Yet in each case, there exists a limit on turnaround time which sets the upper bound inside which solutions must be delivered in order to impact the design. This threshold we designate the Engineering Reasonable Time.

1220

M. Eldredge et al. / Parallel Computing 23 (I 997) 1217-1233

Simulation of customer wind tunnel model Turbulent incompressible tlow over a bluff body with longitudinal symmetry plane Reynolds number 3.7 million 160,192 hexahedral element mesh Solve for pressure, velocity and intensity of turbulence (kinematic eddy viscosity) Spalart-Allmaras turbulence model

Fig. 1. Bluff body turbulent flow problem.

The analysis software Spectrum, created to solve next generation, complex problems, has been designed with parallel computation in mind and is already delivering solutions to real-world design problems in Engineering Reasonable Time. An automotive test case shown in Fig. 1 is used to illustrate this point. The problem has been run on various parallel systems with excellent speedup and parallel efficiency results. (Parallel efficiency is defined as the ratio of speedup to the level of parallelism employed.) The results for the IBM Power Parallel SP2 system are summarized in Table 1. With parallel efficiencies over 80% through 16 processors, this example illustrates the excellent parallel performance that has been achieved. Many legacy codes see a limit of 2.5-3 times speedup regardless of the number of processors. This is due to the limited amount of parallelism that can be found by optimizing compilers in such ‘dusty-deck’ software. While a speedup of 2.5 can be an incremental benefit in some environments, the need for significant impact requires much better return. Clearly, this example shows that such impact can be achieved. At the same time, a 32 subdomain case can indicate diminishing returns of some sort. In this example, with approximately 160,000 finite elements, there are only 5,000 elements per processor when distributed 32 ways. Each processor is quite powerful on the IBM SP2, as is the case with most commodity based microprocessor systems today. Five thousand finite elements is a relatively small amount of work for such processors. In fact, problems with 700,000 to 1 million elements are regularly run on 32 or more processors with excellent performance.

Table 1 Bluff body parallel performance Subdomains

Hours

Speedup

1 4 8 16

38.6 9.7 4.9 2.9

1.oo 3.98 7.88 13.31

Efficiency 100% 99% 98% 83%

M. Eldredge et al./Parallel

Engineering

1

1221

Computing 23 (1997) 1217-1233

Time

16

8

4

Reasonable

Subdomains Fig. 2. Bluff body-engineering

reasonable

time.

In spite of the very good results illustrated in Table 1, such detail has proven to be of little interest to industrial users. In fact, the concept of Engineering Reasonable Time best summarizes the results of this example. In Fig. 2, it can be seen that the single processor case took approximately 40 h. But the 4 processor returned the answer ‘over night’. The increased performance of the runs with higher levels of parallelism do not directly impact the design cycle for the target client. The increased performance available does indicate that larger or more complex problems can he solved within Engineering Reasonable Time.

2. Design for high performance

parallel computation

Traditionally, parallel software designs have attempted to match the architecture of the underlying hardware systems. As a result, application development was problematic as hardware architectures evolved. In recent years, a quasi-standard architecture has materialized: a collection of high-performance microprocessors with a relatively fast interconnection network. The emergence of a single hardware abstraction shifts the focus from hardware architectures to software architectures for parallel computation. The parallel programming models for parallel software architectures include explicit message passing, high-level parallel languages (e.g., HPF), and shared memory directives. The commercial constraints of utility, portability, and compatibility summarized earlier make explicit message passing a natural choice for many reasons. Standard messaging libraries such as MPI [2,3] or PVM [I] provide portability from workstation clusters to large parallel supercomputers. In addition, the use of standard programming languages such as C and F77 within a message passing paradigm delivers compatibility with existing codes through reuse of existing software modules with well developed algorithms and data structures. (This language compatibility is also important in simplifying software management in a commercial setting.) Utility is created by

1222

M. Eldredge et al./ Parallel Computing 23 (1997) 1217-1233

providing users the ability to tackle difficult problems within engineering time through a reasonably efficient parallel program organization.

reasonable

2.1. Coarse-grain application structure Obtaining high efficiencies for high-performance parallel computer systems requires careful design. However, the principles guiding this design are the same principles necessary for modem cache-based serial architectures: - Locality of reference - Minimization of data movement In other words, to maximize computational efficiency, minimize expensive data Ref. [5]. Following this mantra, programmers of cache-based systems have attained high efficiency through intelligent high-level data organizations and regular patterns of computation. While the memory hierarchy becomes more complex in a parallel computer (i.e., data can now reside in a remote memory with expensive access times), the general concepts regarding locality and data movement remain true. A reasonable software approach to expose parallelism would be to intuitively extend the existing data and program organizations into the parallel realm rather than radically alter algorithm designs. In many applications, a coarse-grain decomposition of a problem domain can be performed to yield any number of largely independent subdomains. In this fashion, computation may proceed independently on each subdomain until an inter-domain data dependency is reached. At this point the data dependency must be satisfied through interdomain communication. Fortunately, the coarse granularity of the subdomains allows a large amount of computation to be performed locally before communication is required. Computational simulations comprise one class of applications which can greatly improve their utility through adaptation to coarse-grain parallelism. In fact, experience with the Centric Spectrum Solver has shown the power of the coarse-grain model to meet commercial software objectives. Moreover, the coarse-grain decomposition of this multiphysics simulator illustrates the natural extension of an existing software framework to accommodate parallelism. 2.2. Coarse-grain data decomposition in a multiphysics context In order to accommodate the divergent physical phenomena inherent in multiphysics problems, the flexibility to handle multiple computational subdomains was architected into the software. The solver allows a heterogeneous simulation domain to be separated according to the underlying physics into multiple subdomains of homogeneous physics. Similarly, multiple subdomains with the same physics can also be specified. In a typical uniprocessor simulation, this decomposition is based upon physical considerations in the problem and is provided by the user. During the solution phase, data are exchanged along subdomain interfaces in order to maintain consistency throughout the global domain. Clearly, these computations can be performed in coarse-grain parallel fashion by adapting the mechanisms created to manage subdomain data and control in the uniprocessor solver.

M. Eldredge et al./Parallel Compuring23 (1997) 1217-1233

1223

Typically, the user-supplied, physically motivated subdomains are too few in number to fully utilize the available parallel resources. The original subdomains must be partitioned to expose additional parallelism [12]. This second partitioning step must be as transparent as possible to accommodate commercial users without expertise in parallel processing. With this in mind, an automated domain decomposition module is necessary. The domain decomposition module must run prior to the solution phase and partition the user-defined subdomains into a suitably large collection of parallel subdomains via grid partitioning. Domain decomposition via grid partitioning continues to be an active area of research and algorithms and available software continue to evolve. For example, several partitioners, such as RSB [7], TOP-DOMDEC [S], CHACO [lo], and METIS [13] are currently available. The choice of partitioning algorithm can greatly affect the resulting subdomains. For this reason, a domain decomposition module should be designed to incorporate generic grid partitioning software within a complete domain decomposition framework to allow the grid partitioning kernel to be exchanged as better methods become available. Typically, grid partitioners attempt to balance computation among subdomains with minimal interaction (communication) between subdomains [9]. To simplify this task, most algorithms take an abstract view of the grid to be partitioned by reducing it to a simple graph problem with only nodes and edges [ll]. Unfortunately, this abstract viewpoint belies the complexities in the underlying physical problem to be solved. While automatic tools may work well for homogeneous domains, real-world problem domains are often comprised of homogeneous local domains used to represent a larger heterogeneous physical domain. Consequently, a good decomposition of the abstract problem representation may not result in a good decomposition of the original domain. One clear example arises when the physical behavior at a certain locality in the simulation domain depends upon outside (non-local) phenomena. In such cases, the cost of obtaining such information rises dramatically if it resides in another processor’s partition. Moreover, the model must be modified to communicate across subdomain boundaries. (This is not the case in the serial code where subdomains are user-defined based upon underlying physics.) An alternate solution is to modify the initial decomposition to respect the underlying physics without communication (i.e., put dependent parts of the computation into the same subdomain). To overcome these problems, the domain decomposition module uses a combination of preprocessing and post processing steps to modify the abstract grid description and the resulting domains to improve efficiency and insure that the physics are respected. In truth, both solutions have applicability in certain circumstances and inconsistencies must be evaluated on a case-by-case basis. 2.3. Program

decomposition

in a multiphysics

context

In the coarse-grain model, each processor of the parallel machine runs a copy of the application to solve one subdomain of the partitioned grid. Conceptually, this approach attempts to parallelize computation at the newly created subdomain level (i.e., at the outermost loop level). For example, the pseudo-code below shows the outer subdomain loop implicit in the computation.

1224

M. Eldredge et al./Parallel

Computing 23 (1997) 1217-1233

Subdomains

k.\\\\

k----L==>

Data Exchange

Fig. 3. Parallel multiphysics

architecture.

doacross (subdomains) perform local subdomain computations enddo communicate non-local data doacross (subdomains) .. enddo In the multiphysics domain, this code structure already exists in the uniprocessor simulator. The translation and communication of data between the physically derived subdomains required a ‘serial messaging’ service. During the initial design of the serial application, the future parallelization of the code was also considered. Consequently, control of the subdomains and communication between them is carefully choreographed in order to create a so-called ‘shared none’ execution from the subdomain viewpoint. For example, the computation of global values for points along an interface is accomplished in a master-slave fashion as shown below: foreach (shared point) all slaves post update to master masters create global values foreach (shared point) master posts final value to slaves In the serial implementation, updates to boundary nodes are placed in a memory buffer for use by the master. Under the coarse-grain parallel model, updates are passed along via send/receive pairs. The separation of control from computation in uniprocessor code proved to be a significant design decision. This separation by design, facilitated the straightforward the changes were limited to the conversion to parallel execution. Fundamentally, dispatch operation of the global choreographer and the data exchange mechanism. A schematic description of the coarse-grain parallel architecture is presented in Fig. 3.

M. Eldredge et al./Parallel

1225

Computing 23 (1997) 1217-1233

3. Along the road The success of Parallel Spectrum has come with many lessons. Several lessons, issues and caveats along the road are discussed in this section. 3.1. System-level development 3.1 .l. Portable

and ejicient

of these

issues parallel model

The coarse-grain parallel programming model supports the parallel performance model is efficiently requirements discussed above. In addition, this programming supported on a wide range of parallel hardware architectures. Coarse-grain parallel is easily implemented on distributed memory, shared memory and hybrid or hierarchical memory computers. The implementation is most often by means of a message passing interface. This means that the coarse grain parallel approach is high portable. Industrial problems solved with Parallel Spectrum clearly reveal that the quality of the parallel performance is maintained across different hardware designs. Fig. 4, for example, illustrates that scalability is very similar for distributed memory and shared memory architectures. 3.1.2.

Issues with message passing

on shared memory

As shown, shared memory platforms provide excellent parallel performance and scalability for coarse-grain programs implemented via message passing. Although the programmatic interface is consistent across various architectures, the underlying implementation of message passing on shared memory systems necessarily exposes issues relating to data buffering, system resource allocation and limits, and communication bottlenecks. For example, the implementation of network PVM allocates message storage in the process address space. Thus, the total size of outstanding messages is limited only by

8 7

m

Ideal

B

Distributed

CL6

=I 5

TI

8 4 cnp3 2 1 0

Memory

n 2

4

6

Shared Memory

8

Subdomains Fig. 4. Comparison of distributed memory and shared memory platforms.

1226

M. Eldredge et al. /Parallel

Computing 23 (1997) 1217-1233

virtual memory constraints of the system since memory for additional pending messages can simply be allocated from the processes heap. However, message passing implementations on shared memory systems typically pre-allocate a shared transfer buffer along with associated control and locking resources. Communicating processes can then map the shared memory segment into a fixed location of their own address space. With a fixed transfer area, the total size of outstanding messages is limited. For certain uses this presents no problem. In an unconstrained producer/consumer situation, the producer may fill the transfer buffer at which time it would block. However, the consumer continues receiving messages which will eventually drain the transfer buffer sufficiently, allowing the producer to unblock and continue sending. However, in situations where all processes work in a two phase manner where every process sends in the first phase and then every process turns around and receives in the second phase, deadlock can occur. For example, if a process fills the transfer buffer during the send phase, it will block with the expectation that the intended receiver will post all its outgoing messages, turn around and begin its receive phase thus draining the transfer buffer sufficiently that the first process can unblock and finish sending. However, if the intended receiver also fills its transfer buffer during the send phase, both processes will be blocked waiting for the other to continue on to the receive phase and drain the outgoing queue. In this case, deadlock occurs. The simplest mechanism to avoid this situation is to ensure that the shared transfer buffer space is allocated of sufficient size to hold the largest intended collection of outstanding messages. This works well when the size can be predetermined or the largest size can be reasonably estimated. Several implementations of the PVM interface include a system specific means to specify the buffer size. For example, an environment variable can be used to declare the buffer size to be allocated at start-up: setenv PVMBUFSIZE 500000 Another approach would be the dynamic allocation of additional shared memory segments. In this case, additional shared memory extensions can be allocated thus providing the additional transfer buffer space. While this segment allocation approach is unlikely to provide a contiguous memory space, this is only a minor issue since the transfer buffer is used to pack discrete size elements into messages. In the case where a message packing would overrun the memory segment, the packing would simply begin in the next segment at a small cost of the temporarily unused space. The more significant cost is due to the considerable system overhead. The existence of the additional memory segment must be recognized by cooperating processes and subsequently be mapped in to the address space of each process. Access control for the shared memory segments requires system support resources such as semaphores. While the message passing programming interface is generally implementation independent, the developer is still required to understand and often using the manage these resources. As an example, PVM has been implemented System-V shared memory interface with several vendor specific derivative implementations. For each shared memory segment allocated, a System-V semaphore is also allocated. Upon successful completion of the parallel job, the resources (shared memory objects and semaphores) will be released and returned to the system. However, if the parallel job does not exit cleanly (frequently the case during development and debug-

M. Eldredge et al./Paralkl

Computing 23 (1997) 1217-1233

1227

ging), the resources may not get released. Subsequent parallel runs will allocate new resources. At some point, a start-up allocation attempt fails since there are no more free resources to allocate. While it is generally possible to terminate wayward processes and manually release system resources that have been tied down, the situation must be carefully monitored by the application developer and, more unfortunately, by the application user. In recognition of this situation, one hardware vendor even supplies a script to release all the resources being held by a user. The coarse-grain approach enables excellent parallel performance on a wide range of system architectures. Basic design, at least implicitly, assumes a relatively high latency communication medium. Communication is assumed to be somewhat expensive in overhead and time. In achieving good performance under this assumption, a high bandwidth/low latency communication medium then provides even better communication. It has been argued that shared memory systems break down for higher degrees of parallel computation. This can certainly be true for very fine grain, frequent communication situations, however, the coarse-grain code implemented with the high latency assumption simply finds a very high performance communication medium in the shared memory bus. Nevertheless, scalability limitations due to bus saturation are well known. At some point, the number of processors attempting access to the shared bus and the amount of data to move across it will become too great for the bus to handle. Software design approaches such as minimizing or randomizing communication may stave off saturation, however, the issue will eventually be encountered. Hardware approaches to mitigate this situation include limiting the number of processors to a manageable number, for example 4 to 8 processors. Communication between these ‘hypemodes’ is accomplished via a high-speed interconnection switch or network. Whether the interconnection mechanism provides a single, large address space (global shared memory) or distinct address spaces (distributed memory) for each hypemode, the coarse-grain/message passing approach easily accommodates the underlying architecture. The programming model remains the same while the underlying architecture may provide opportunities for the optimization of the communication and process/task implementations. Nevertheless, these system specific optimizations remain transparent to the developer and (in general) the end user. 3.1.3. Debugging message passing programs Building and debugging a large engineering software product is already a complex process. In spite of the minimal re-coding required in the coarse-grain parallel approach, debugging and testing the parallel code presents significant challenges. Endemic are the issues of multiple tasks working asynchronously and in non-deterministic orders. These issues present difficulties in conceptualizing the execution and interaction of the tasks. While long-time parallel processing developers may have an appreciation for these issues, they are generally foreign to application developers migrating their application to a parallel platform. Even with an appreciation of the inherent character of parallel execution, the lack of debugging and analysis tools remains the most significant limitation. Not withstanding the great advances in multiprocessing hardware and the standardization of parallel

1228

M. Eldredge et al./ Parallel Computing 23 (1997) 1217-1233

programming libraries and language constructs, WRITE() or printf() remain among the primary and most widely used debugging tools. Versions of uniprocessor debuggers have been modified to provide limited support for parallel environments. The capabilities vary widely, however. It is, in fact, the wide variance that most limits the use of tools. In today’s fast changing multi-platform compute environments, software must be developed for a range of target platforms, and developers require (or at least fall back to) broadly available, standard solutions. The DBX debugger interface has been provided on most uniprocessor Unix based systems, for example. But there is no similar standard for a parallel debugging interface. The WRITE statement remains the only universal debugging tool for parallel code. Beyond debugging tools, parallel profiling and analysis tools are required but generally lacking. The ability to analyze message traffic by type, source, destination, time size, etc. can expose significant understanding about the correct and incorrect functioning of the parallel applications. Experienced developers have learned to include monitoring support of their own since robust or standard built-in support is almost non-existent. Surprisingly, network PVM for workstation clusters has provided the most robust development environment. For basic development and for non-platform specific problems, it is often most helpful to test and debug on the cluster implementation. Network PVM provides a debugging option for the subtask spawn function that initiates each child process inside a separate instance of the debugger. The many windows and the need to type commands into each can be unwieldy for high degrees of parallelism. Nevertheless, tracebacks, single-stepping, break points and variable examination can all be accomplished for any task. Also, standard network and system monitoring tools provide important run time feedback on the parallel execution. The process monitor top and the graphical ethernet traffic monitor etherman are the most used tools in our environment. Staring at continually updated CPU utilization and process run and sleep patterns with top and at communication source/destination and message volume with etherman often provide great insight to the developer. Similar standard run time graphical tools are required on each platform. 3.2. User issues 3.2.1. Dedicated usage Discussions of the value of parallel processing invariably lead to arguments over whether scalability or throughput is more important. In an industrial setting, however, the answer is simply throughput. To impact the ever shortening design cycles, more complex and detailed simulations must be accomplished in shorter absolute time frames. Large, high-performance parallel computers used as shared resources can obscure this goal. With batch queue systems or time-sharing, the elapsed time to solution is often much longer than for a uniprocessor run with a dedicated processor. Since improved throughput is the goal, the time to run plus the time in the queue is the time to solution. Certain classes of parallel processing involving significant input and output (such as parallel database transactions) may see little impact from time-sharing since the processor is frequently relinquished for the I/O operations. However, computationally inten-

M. Eldredge et al. / Parallel Computing 23 (1997) 1217-l 233 Table 2 Launch environments Network PVM Cray MPT IBM PBS

supported

1229

by Parallel Spectrum SGI PVM/Array IBM POE IBM EASY

HP/Convex PVM IBM LoadLeveler

sive, industrial simulations often involve several hours or even days between I/O operations. In such cases, interrupting the computation for context switching and sharing significantly degrades throughput. With superscalar and cache based systems, the cost of this switching becomes even more severe. Therefore, dedicated processors allocated for a specific simulation are important. This can be accomplished with a single-user system, a policy controlled departmental system or a larger, batch controlled enterprise system. However, the proliferation of engineering workstations and personal computers has created a cultural situation where policy or batch controlled access are not accepted. Everyone wants the whole system and wants it immediately. This cultural barrier against ‘in-turn dedicated use’ has proven to be one of the most challenging issues in achieving significant improvements in throughput. 3.2.2. Additional system knowledge is unavoidable As discussed, the developer must be aware of implementation issues in spite of the implementation independent programming interface. Unfortunately, this awareness is also required of the end user. This is particularly troublesome for the users of parallel applications since they are generally concerned with the actual domain of the computation rather than particulars of a given operating environment. Nevertheless, this situation is simply unavoidable. Often the first encounter with system idiosyncrasies is during installation. For example, the instructions may suggest that the system administrator be contacted to verify the appropriateness of the number of semaphores configured into the system or may request the default batch queue definition. Parallel Spectrum has been ported to many platforms with relative ease. The portability of the coarse-grain parallel model for this type of application has facilitated the code development and performance. However, from a user perspective, each run time environment proves to be quite different. Table 2 lists several launch environments supported by Parallel Spectrum. In spite of the uniform interface and commands provided by Spectrum, initiating the parallel run varies significantly for the various environments. Parallel Spectrum provides a wrapper over each environment to assist in the process. However, the wrapper generally must be configured for each system and can only help with basic system interaction. Checking queues (if there is a queuing system), removing parallel jobs or managing multiple jobs often requires direct interaction with the specific run time environment. Unfortunately, this cannot be completely hidden from the user and must be addressed in training and documentation. 3.2.3. Scalability constraints The definition of scalability traditionally involves looking at computational speedup compared to the number of processors used. Industrial engineering analysis problems

1230

M. Eldredge et al./P&rallel

Prob Geom/ Def. Mesh

Prob Geom/ Def. Mesh “lve

Computing 23 (1997) 1217-1233

Solve

Analyze Vis

_z Analyze Vis

Project Time Fig. 5. Significant

improvements

in solve times highlight other phases.

present constraints which redefine scalability. Simulations involve several steps from problem definition, analysis model creation, mesh generation, solving the analysis problem and post processing and visualization. Most of the supporting functions are extremely labor intensive, often requiring several weeks or months of manual, engineering expertise. Analogous to Amdahl’s Law, even if the solver stage can be sped up to require negligible time, the time required for the remaining stages remains significant (see Fig. 5). E x a cer b a t’mg this situation is the lack of automatic tools for model and mesh creation. Without automatic tools, analysts are forced to work with existing, hand-generated models rather than working with models of more appropriate size and complexity for the available parallel computational power and the parallel solver application. In most settings, the number of processors available for a particular run is controlled. This may be due to a fixed number of processors on the system, or it may be due to administrative or policy constraints. Often, there is an increasing charge rate for larger blocks of processors. More detrimental to the throughput requirement is the case where larger blocks of processors have reduced time limits and delayed start times (for example, the ‘night queue’). Therefore, while a problem may be sufficiently complex to benefit from a large number of processors, in practice, it may be impossible to apply the appropriate number of processors. Small machines naturally have a limited number of processors. Policy may allow allocation of a large number of processors but only during ‘off hours’ such as weekends. Some systems are managed so that the large number of processors are available but with pragmatically useless allocation limits such as five minutes. Even without such policy restrictions, a well utilized system generally means that a large number of processors does not become available for long periods of time. The time in the queue may greatly exceed the time to solution if a large block of processors is requested. Therefore, while a parallel application may scale very well, in practice the necessary processor allocation and availability controls the realizable scalability. 3.2.4. Output, visualization and post processing Successful implementation and deployment of high-performance parallel simulation tools provides great satisfaction., However, this success immediately highlights new

M. Eldredge et al./Parallel

Computing 23 (1997) 1217-1233

1231

areas that need to be addressed. In particular, the solution of huger, more detailed problems focuses attention on data volume, user versus computational views and compatibility with post processing tools. The ability to solve significantly larger problems in shorter time periods has led to a situation of data explosion. The volume of data generated stresses the computational environment, requiring additional disk storage, higher data bandwidth; and improved post processing capabilities in order to summarize, explore and understand the voluminous data. The domain decomposition approach provides an excellent and scalable method for parallel processing which requires no modification to the user input. Domain decomposition is performed transparently to the user allowing the same input to be used for uniprocessor or parallel execution. However, since simulation is performed on decomposed regions, the results can be generated and stored in the decomposed regions. Therefore, subsequent interaction with the results, such as interpretation or visualization, may not correspond to user expectations. Examples include visual discontinuities in results variables. It may be determined that such issues are innocuous artifacts of the solution methodology. However, the discontinuities may introduce problems or errors in subsequent processing. For example, the computation of surface normals will be different when post-processing uniprocessor and parallel executions. Such potential problems require an additional post simulation step to restore the user view of the data. With the appropriate mapping information maintained, the resulting data can be recomposed to the original view defined by the input. Finally, no simulation tool works in isolation. Every simulation tool interacts with a wide range of pre-processing, post processing, validation, and database tools. Tools from various vendors or those developed in-house are a part of an analyst’s working environment. Multiple domains or arbitrary decompositions may preclude the use of tools which have not been specifically designed for these cases. Advanced capabilities must be carefully introduced and must include consideration for the full work flow of the user. 3.2.5. Superlinear

speedup

Production use of Parallel Spectrum continues to illustrate new opportunities in product design. An automotive analysis of various interior configurations has been performed to study passenger comfort. The analysis of the coupled thermal and fluid flow, while varying design considerations such as air conditioner and heater registers, head rest configurations and external temperatures, has resulted in improved passenger comfort and savings. These problems are routinely run using 12 to 32 processors on models of over one million finite elements. Table 3 shows the performance of one such

Table 3 Super-linear speedup at high degrees of parallelism Subdomains

Time (h)

14

23.2

28

11.6

Speedup

Efficiency

2.0

100%

1232

M. Eldredge et al./ Parallel Computing 23 (1997) 1217-1233

Fig. 6. Velocity response from automobile

interior flow simulation.

problem run on an early HP/Convex Exemplar (SPP-1000). Velocity response is illustrated in Fig. 6. For further elaboration, the interested reader is referred to [4]. Upon first examination, the perfect speedup of two by doubling the number of processors is surprising and possibly suspect. But this result reveals an additional benefit of parallel processing on common, commodity microprocessor and cache based systems. With high degrees of parallelism comes the increased overhead of communication. This penalty necessarily degrades parallel efficiency. At the same time, the increased levels of decomposition result in smaller subdomains to be solved on each processor. At some point, the subdomains become small enough to fit into individual processor’s cache. This phenomenon results in a super-linear speedup and illustrates that the efficient utilization of the modem, complex processor designs can compensate for the increased communication overhead. Other applications of Parallel Spectrum may be found in 116,171.

4. Conclusions The industrial application of parallel processing to engineering analysis and design is in its infancy. However, parallel processing is already practically and positively impacting these endeavors. Due to increased memory and processing speed, more complex and detailed analyses can be performed within shrinking design cycles. The additional user complexity of the parallel environment is small compared with the issues of engineering analysis, model building, mesh generation, and interacting physics.

M. Eldredge et al. /Parallel

Computing 23 (1997) 1217-1233

1233

It is possible to achieve portability while attaining good performance but serious, up front design is a primary requirement for this success.

Acknowledgements Portions of this work were supported by the Advanced Research Projects Agency (ARPA) under Federal Cooperative Agreement F30602-95-2-0007. Additional support for this work was provided by the State of California under Defense Conversion Matching Grants C95-0248 and C96-0078. Centric Engineering Systems gratefully acknowledges this support as well as that provided by our commercial customers.

References [l] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine - A User’s Guide and Tutorial for Networked Parallel Computing, MIT Press, 1994. [2] W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming with the Message Passing Interface, MIT Press, 1994. [3] Message Passing Interface Forum. Document for a standard message passing interface. Tech. RPG. CS-94320, U. Tenn. April 1994. Available on Netlib. [4] T.P. Gielda, B.E. Webster, M.E. Hesse, D.W. Halt, The Impact of Computational Fluid Dynamics on Automotive Interior Comfort Engineering, AIAA Paper 960794, 34th Aerospace Sciences Meeting and Exhibit, January 1996. [51 2. Johan, Data Parallel Finite Element Techniques for Large-Scale Computational Fluid Dynamics, Ph.D. Thesis, Stanford University, 1992. [7] C. Farhat, M. Lesionne, problems in computational [S] H.D. Simon, Partitioning 135-148.

Automatic partitioning of unstructured meshes for the parallel mechanics, Int. J. Numer. Methods Eng. 36 (1993) 745-764. of unstructured

problems

for parallel processing,

solution

of

Comput. Syst. Eng. 2 (1991)

191 J.G. Malone, Automated mesh decomposition and concurrent finite element analysis for hypercube computers, Comput. Methods Appl. Mech. Eng. 70 (1988) 27-58. [lo] B. Hendrickson, R. Leland, A Multi-Level Algorithm for Partitioning Graphs, Sandia National Labs Report, Albuquerque, NM 87185-1110. [ 111 A. George, J. Lui, Evolution of the minimum degree ordering algorithm, SIAM Rev. 3 1 (1989) I- I 9. [12] M.T. Heath, P. Raghavan, A Cartesian Nested Dissection Algorithm, Technical Report UIUCDCS-R-921772, Department of Computer Science, University of Illinois, Urbana, IL, 1992. I131 G. Karypis, V. Kumar, Analysis of Multilevel Graph Partitioning, Report 95037, University of Minnesota, Department of Corn puter Science, Minneapolis, M N 55455, http://www.cs.umn.edu/users/kumar/papers.html, 1995. [14] Spectrum Theory Manual, Centric Engineering Systems, Santa Clara, CA, 1994. [151 Spectrum Visualizer Reference Manual, Centric Engineering Systems, Santa Clara, CA, 1994. [161 B.S. Holmes, J. Dias, B. Jaroux, T. Sassa, Y. Ban, Predicting the windnoise from the pantograph cover of a train, Int. J. Numer. Methods Fluids, to appear. 1171 R.M. Ferencz, H.L. Gabriel, Comparison of RANS and LES Calculations of Interior Flows in a Sports Utility Vehicle, Sixth International Symposium on Computational Fluid Dynamics, Lake Tahoe, CA, September 1995.