Models and Languages for High-Performance Computing Domenico Talia, University of Calabria, Rende (CS), Italy r 2018 Elsevier Inc. All rights reserved.
Introduction Programming models, languages, and frameworks for parallel computers are required tools for designing and implementing high performance applications on scalable architectures. During recent years, parallel computers ranging from tens to hundred thousands processors became commercially available. They are gaining recognition as powerful instruments in scientific research, information management, and engineering applications. This trend is driven by parallel programming languages and tools that make parallel computers useful in supporting a wide range of applications, from scientific computing to business intelligence. Parallel programming languages (called also concurrent languages) permit the design of parallel algorithms as a set of concurrent actions mapped onto different computing elements (Skillicorn and Talia, 1994). The cooperation between two or more actions can be performed in many ways according to the selected language. The design of programming languages and software tools for high-performance computing is crucial for the large dissemination and efficient utilization of these novel architectures (Skillicorn and Talia, 1998). High-level languages decrease both the design time and the execution time of parallel applications, and make it easier for new users to approach parallel computers. Several issues must be solved when a parallel program is to be designed and implemented. Many of these are questions specifically related to the nature of parallel computation, such as process structuring, communication, synchronization, deadlock, process-to-processor mapping and distributed termination. For solving these questions in the design of an efficient parallel application it is important to use a programming methodology that helps a designer/programmer in all the stages of parallel software development. A parallel programming methodology must address the following main issues: 1. Parallel process structuring: how to decompose a problem in a set of parallel actions; 2. Inter-process communication and synchronization: how the parallel actions cooperate to solve a problem; 3. Global computation design and evaluation: how to see globally at the parallel program (as a whole) to improve its structure and evaluate its computational costs; 4. Process-to-processor mapping: how to assign the processes that compose a program to the processors that compose a parallel computer. Parallel programming languages offer a user a support in all the phases of the parallel program development process. They provide constructs, mechanisms, and techniques that support a methodology for parallel software design that addresses the problems listed above. Although, early parallel languages and more recent low-level tools do not provide good solutions for all the mentioned problems, in the recent years significant high-level languages and environments have been developed. They can be used in all or many phases of the parallel software development process to improve scalability and portability of applications. We discuss here representative languages and tools designed to support different models of parallelism and analyze both languages currently used to develop parallel applications in many areas, from numerical to symbolic computing, and novel parallel programming languages that will be used to program parallel computers in a near future.
Shared Memory Languages Parallel languages of this class use the shared-memory model that is implemented by parallel machines composed by several processors that share the main memory. The concept of shared memory is a useful way to separate program control flow issues from data mapping, communication, and synchronization issues. Physical shared memory is probably difficult to provide on massively parallel architectures, but it is a useful abstraction, even if the implementation it hides is distributed. Significant parallel languages and environments based on the shared-memory model are OpenCL, Linda, OpenMP, Java, Pthreads, Opus, SHMEM, ThreadMarks, and Ease. One way to make programming easier is to use techniques adapted from operating systems to enclose accesses to shared data in critical sections. Thus a single access to each shared variable is guaranteed at a given time. Another approach is to provide a highlevel abstraction of shared memory. One way to do this is called virtual shared memory or distributed shared memory. In this case, the programming languages present a view of memory as if it is shared, but the implementation may or may not be. The goal of such approaches is to emulate shared memory well enough that the same number of messages travel around the system when a program executes as would have travelled if the program had been written to pass messages explicitly. In other words, the emulation of shared memory imposes no extra message traffic. Examples of these models and systems are Linda, Threadmarks, SHMEM, and Munin. These systems emulate shared memory on distributed memory hardware by extending techniques for cache coherence in multiprocessors to software memory coherence. This involves weakening the implementation semantics of coherence as much as possible to make the problem tractable, and then managing memory units at the operating system level. Munin is a system that
Encyclopedia of Bioinformatics and Computational Biology
doi:10.1016/B978-0-12-809633-8.20370-1
1
2
Models and Languages for High-Performance Computing
supports different consistency models in the implementation of distributed shared memory (Bennett et al., 1990). It implements a type-specific memory coherence scheme that uses different coherence models. Thus, in Munin, each shared data object is supported by a memory coherence mechanism appropriate to the way in which the object is accessed. The other way is to build a programming model based on a useful set of sharing primitives for implementing shared data accessed through user-defined operations. This is the approach used by the Orca language. Here we describe the main features of two shared-memory parallel languages: OpenMP and Java.
OpenMP OpenMP is a library (application program interface or API) that supports parallel programming on shared memory parallel computers (OpenMP Consortium, 2002). OpenMP has been developed by a consortium of vendors of parallel computers (DEC, HP, SGI, Sun, Intel, etc.) with the aim to have a standard programming interface for parallel shared-memory machines (like PVM and MPI for distributed memory machines). The OpenMP functions can be used inside Fortran, C and C þ þ programs. They allow for the parallel execution of code (parallel DO loop), the definition of shared data (SHARED), and synchronization of processes. OpenMP allows a user to.
• • •
define regions of parallel code (PARALLEL) where it is possible to use local (PRIVATE) and shared variables (SHARED); synchronize processes by the definition of critical sections (CRITICAL) for shared variables (SHARED); define synchronization points (BARRIER).
A standard OpenMP program begins its execution as a single task, but when a PARALLEL construct is encountered, a set of processes are spawned to execute the corresponding parallel region of code. Each process is assigned with an iteration. When the execution of a parallel region ends, the results are used to update the data of the original process, which then resume its execution. From this operational way, it could be deduced that support for general task parallelism is not included in the OpenMP specification. Moreover, constructs or directives for data distribution control are absent from the current releases of OpenMP. This parallel programming library solves portability of code across different shared-memory architectures, however it does not offer a high-level programming level for parallel software development.
Java Java is a language that is popular because of its connection with platform-independent software delivery on the Web (Lea, 2000). However, it is an object-oriented language that embodies also a shared-memory parallel programming model. Java supports the implementation of concurrent programs by process (called threads) creation (new) and execution (start). For example, the following Java instructions create three processes: new proc (arg1a, arg1b, ..) ; new proc (arg2a, arg2b, ..) ; new proc (arg3a, arg3b, ..) ; where proc is an object of the Thread class. Java threads communicate and synchronize through condition variables. Shared variables are accessed from within synchronized methods. Java programs execute synchronized methods in a mutually exclusive way generating a critical section by associating a lock with each object that has synchronized methods. Wait and notify constructs have been defined to handle locks. The wait operation allows a thread to relinquish its lock and wait for notification of a given event. The notify operation allows a thread to signal the occurrence of an event to another thread that is waiting for that specific event. However, the notify and wait operations must be explicitly invoked within critical sections, rather than being automatically associated with section entry and exit as occurs in the monitor construct proposed about two decades ago. Thus a programmer must be careful to avoid deadlock occurrence among Java threads; the language offers no support for deadlock detection. The shared-memory programming model has been defined for using Java on a sequential computer (pseudo-parallelism) or on shared-memory parallel computers. However, although the concurrent model defined in Java is based on the shared-memory model, Java was mainly designed to implement software on the Internet. Therefore it must be able to support the implementation of distributed programs on computer networks. To use Java on such platforms or on distributed-memory parallel computers there are different approaches presented in section “Object-Oriented Parallel Languages”.
Distributed Memory Languages A parallel program in a distributed-memory parallel computer (multicomputer) is composed of several processes that cooperate by exchanging data, e.g, by using messages. The processes might be executed on different processing elements of the multicomputer. In this environment, a high-level distributed concurrent programming language offers an abstraction level in which resources are defined like abstract data types encapsulated into cooperating processes. This approach reflects the model of distributed memory architectures composed of a set of processors connected by a communication network. This section discusses imperative languages for distributed programming. Other approaches are available such as logic, functional, and object-oriented languages. This last class is discussed in Section “Object-Oriented Parallel Languages”. Parallelism
Models and Languages for High-Performance Computing
3
in imperative languages is generally expressed at the level of processes composed of a list of statements. We included here both languages based on control parallelism and languages based on data parallelism. Control parallel languages use different mechanisms for process creation (e.g., fork/join, par, spawn) and process cooperation and communication (e.g., send/receive, rendezvous, remote procedure call). On the other hand, data parallel languages use an implicit approach in solving these issues. Thus their run-time systems implement process creation and cooperation transparently to users. For this reason, data parallel languages are easier to be used, although they do not allow programmers to define arbitrary computation forms as occurs with control parallel languages. Distributed memory languages and tools are: Ada, CSP, Occam, Concurrent C, CILK, HPF, MPI, C*, and Map-Reduce. Some of these are complete languages while others are APIs, libraries or toolkits that are used inside sequential languages. We discuss here some of them, in particular we focus on the most used languages for scientific applications, such as MPI and HPF.
MPI The Message Passing Interface or MPI (Snir et al., 1996) is a de-facto standard message-passing interface for parallel applications defined since 1992 by a forum with a participation of over 40 organizations. MPI-1 was the first version of this message passing library that has been extended in 1997 by MPI-2 and in 2012 by MPI-3. MPI-1 provides a rich set of messaging primitives (129), including point-to-point communication, broadcasting, barrier, reduce, and the ability to collect processes in groups and communicate only within each group. MPI has been implemented on massively parallel computers, workstation networks, PCs, etc., so MPI programs are portable on a very large set of parallel and sequential architectures. An MPI parallel program is composed of a set of similar processes running on different processors that use MPI functions for message passing. A single MPI process can be executed on each processor of a parallel computer and, according the SPMD (Single Program Multiple Data) model, all the MPI processes that compose a parallel program execute the same code on different data. Examples of MPI point-to-point communication primitives are. MPI_Send (msg, leng, type,…, tag, MPI_COM); MPI_Recv (msg, leng, type,0, tag, MPI_COM, &st); Group communication is implemented by the primitives: MPI_Bcast (inbuf, incnt, intype, root, comm); MPI_Gather (outbuf, outcnt, outype, inbuf, incnt,..); MPI_Reduce (inbuf, outbuf, count, typ, op, root,...); For program initialization and termination the MPI_init and MPI_Finalize functions are used. MPI offers a low-level programming model, but it is widely used for its portability and its efficiency. It is the case to mention that MPI-1 does not make any provision for process creation. However, in the MPI-2 and MPI-3 versions additional features have been provided for the implementation of.
• • •
active messages, process startup, and dynamic process creation.
MPI is becoming more and more the first programming tool for message-passing parallel computers. However, it should be used as an Esperanto for programming portable system-oriented software rather than for end-user parallel applications where higher level languages could simplify the programmer task in comparison with MPI.
HPF Differently from the previous toolkits, High Performance Fortran or HPF is a complete parallel language (Loveman, 1993). HPF is the result of an industry/academia/user effort to define a de facto consensus on language extensions for Fortran-90 to improve data locality, especially for distributed-memory parallel computers. It is a language for programming computationally intensive scientific applications. A programmer writes the program in HPF using the Single Program Multiple Data (SPMD) style and provides information about desired data locality or distribution by annotating the code with data-mapping directives. Examples of data-mapping directives are Align and Distribute: !HPF$ Distribute D2 (Block, Block). !HPF$ Align A(I,J) with B(I þ 2, J þ 2). An HPF program is compiled by an architecture-specific compiler. The compiler generates the appropriate code optimized for the selected architecture. According to this approach, HPF could be used also on shared-memory parallel computers. HPF is based on exploitation of loop parallelism. Iterations of the loop body that are conceptually independent can be executed concurrently. For example, in the following loop the operations on the different elements of the matrix A are executed in parallel. ForAll (I ¼ 1: N, J ¼ 1: M). A(I,J) ¼ I * B(J). End ForAll. HPF must be considered as a high level parallel language because the programmer does not need to explicitly specify parallelism and process-to-process communication. The HPF compiler must be able to identify code that can be executed in parallel and
4
Models and Languages for High-Performance Computing
it implements inter-process communication. So HPF offers a higher programming level with respect to PVM or MPI. On the other hand, HPF does not allow for the exploitation of control parallelism and in some cases (e.g., irregular computations) the compiler is not able to identify all the parallelism that can be exploited in a parallel program, and thus it does not generate efficient code for parallel architectures.
Object-Oriented Parallel Languages The parallel object-oriented paradigm is obtained by combining the parallelism concepts of process activation and communication with the object-oriented concepts of modularity, data abstraction and inheritance (Yonezawa et al., 1987). An object is a unit that encapsulates private data and a set of associated operations or methods that manipulate the data and define the object behavior. The list of operations associated with an object is called its class. Object-oriented languages are mainly intended for structuring programs in a simple and modular way reflecting the structure of the problem to be solved. Sequential object-oriented languages are based on a concept of passive objects. At any time, during the program execution only one object is active. An object becomes active when it receives a request (message) from another object. While the receiver is active, the sender is passive waiting for the result. After returning the result, the receiver becomes passive again and the sender continues. Examples of sequential object-oriented languages are Simula, Smalltalk, C þ þ , and Eiffel. Objects and parallelism can be nicely integrated since object modularity makes them a natural unit for parallel execution. Parallelism in object-oriented languages can be exploited in two principal ways:
• •
using objects as the unit of parallelism assigning one or more processes to each object; defining processes as components of the language.
In the first approach languages are based on active objects. Each process is bound to a particular object for which it is created. When one process is assigned to an object, inter-object parallelism is exploited. If multiple processes execute concurrently within an object, intra-object parallelism is exploited also. When the object is destroyed the associated processes terminate. In the latter approach two different kinds of entities are defined, objects and processes. A process is not bound to a single object, but it is used to perform all the operations required to satisfy an action. Therefore, a process can execute within many objects changing its address space when an invocation to another object is made. Parallel object-oriented languages use one of these two approaches to support parallel execution of object-oriented programs. Examples of languages that adopted the first approach are ABCL/1, the Actor model, Charm þ þ , and Concurrent Aggregates (Chien and Dally, 1990). In particular, the Actor model is the best-known example of this approach. Although it is not a pure object-oriented model, we include the Actor model because it is tight related to object-oriented languages. On the other hand, languages like HPC þ þ , Argus, Presto, Nexus, Scala, and Java use the second approach. In this case, languages provide mechanisms for creating and control multiple processes external to the object structure. Parallelism is implemented on top of the object organization and explicit constructs are defined to ensure object integrity.
HPC þ þ High Performance C þ þ (Diwan and Gannon, 1999) is a standard library for parallel programming based on the C þ þ language. The HPC þ þ consortium consists of people from research groups within Universities, Industry and Government Labs that aim to build a common foundation for constructing portable parallel applications as alternative to HPF. HPC þ þ is composed of two levels:
• •
Level 1 consists of a specification for a set of class libraries based on the C þ þ language. Level 2 provides the basic language extensions and runtime library needed to implement the full HPC þ þ .
There are two conventional modes of executing an HPC þ þ program. The first is multi-threaded shared memory where the program runs within one context. Parallelism comes from the parallel loops and the dynamic creation of threads. This model of programming is very well suited to modest levels of parallelism. The second mode of program execution is an explicit SPMD model where n copies of the same program are run on n different contexts. Parallelism comes from parallel execution of different tasks. This mode is well suited for massively parallel computers.
Distributed Programming in Java Java is an object-oriented language that was born for distributed computing programming, although it embodies a shared-memory parallel programming model discussed in Section ”Shared Memory Languages“. To develop parallel distributed programs using Java, a programmer can use two main approaches:
•
sockets: at the lowest programming level, Java provides a set of socket-based classes with methods (socket APIs) for inter-process communications using datagram and stream sockets. Java sockets classes offer a low-level programming interface that requires the user to specify inter-process communication details, however this approach offers an efficient communication layer.
Models and Languages for High-Performance Computing
•
5
RMI: the Remote Method Invocation toolkit (Sun Microsystems, 1997) provides a set of mechanisms for communication among Java methods that reside on different computers having separate address spaces. The approach offers a user a higher programming layer that hides some inter-process communication details, but it is less efficient compared with the socket APIs.
In the latest years several efforts have been done to extend Java for high performance scientific applications. The most significant effort is represented by the Java Grande consortium that aimed at defining Java extensions for implementing computing intensive applications on high performance machines. The outcome of several research projects on the use of Java for distributed memory parallel computing are a set of languages and tools such as: HPJava, MPIJava, JMPI, JavaSpace, jPVM, JavaPP, and JCSP.
Composition Based Languages This section describes novel models and languages for parallel programming that have properties that make them of interest. Some are not yet extensively used but they are very interesting high-level languages. The general trend that is observable in these languages, to those discussed in the previous sections, is that they are designed with stronger semantics directed towards software construction and correctness. There is also a general realization that the level of abstraction provided by a parallel programming language should be higher than was typical of languages designed in the past decade. This is no doubt partly due to the growing role of parallel computation as a methodology for solving problems in many application areas. The languages described in this section can be divided into two main classes:
• •
Languages based on predefined structures that have been chosen to have good implementation properties or that are common in applications (restricted computation forms orskeletons); Languages that use a compositional approach – a complex parallel program can be written by composing simple parallel programs while preserving the original proprieties.
Skeletons are predefined parallel computation forms that embody both data- and control-parallelism. They abstract away from issues such as the number of processors in the target architecture, the decomposition of the computation into processes, and communication by supporting high-level structured parallel programming. However, using skeletons cannot be wrote arbitrary programs, but only those that are compositions of the predefined structures. The most used skeletons are geometric, farm, divide& conquer, and pipeline. Examples of skeleton languages are SCL, SkieCL, BMF (Bird-Meertens formalism), Gamma, and BSPlib.
BSPlib The Bulk Synchronous Parallel (BSP) model is not a pure skeleton model, but it represents a related approach in which the structured operations are single threads (Skillicorn et al.,1997). Computations are arranged in supersteps, each of which is a collection of these threads. A superstep does not begin until all of the global communications initiated in the previous superstep have completed (so that there is an implicit barrier synchronization at the end of each superstep). Computational results are available to all processes after each superstep. The Oxford BSPlib is a toolkit that implements the BSP operations in C and Fortran programs (Hill et al., 1998). Some examples of BSPlib functions are.
• • • • •
bsp_init(n) to execute n processes in parallel, bsp_sync to synchronize processes, bsp_push_reg to make a local variable readable by other processes, bsp_put and bsp_get to get and put data to/from another process, bsp_bcast to send a data towards n processes.
The BSPlib provides automatic process mapping and communication patterns. Furthermore, it offers performance analysis tools to evaluate costs of BSP programs on different architectures. This feature is helpful in the evaluation of performance portability of parallel applications developed by BSP on parallel computers that are based on different computation and communication structures. BSPlib offers a minimal set of powerful constructs that make the programmer task easier than more complex models and, at the same time, it support costs estimation at compile time. Significant examples of compositional languages are Program Composition Notation (PCN), Compositional C þ þ , Seuss, and CYES-C þ þ . PCN and Compositional C þ þ , are second-generation parallel programming languages in the imperative style. Both are concerned with composing parallel programs to produce larger programs, while preserving the original properties.
Closing Remarks Parallel programming languages support the implementation of high-performance applications in many areas: from computational science to artificial intelligence applications. However, programming parallel computers is more complex of programming sequential machines because in developing parallel software must be addressed several issues that are not present in sequential programming. Parallel processes structuring, communication, synchronization, process-to-processor mapping are problems that must be solved by parallel programming languages and tools to allows users to design complex parallel applications.
6
Models and Languages for High-Performance Computing
New models, tools and languages proposed and implemented in the latest years allow users to develop more complex programs with minor efforts. In the recent years we registered a trend from low-level languages towards more abstract languages to simplify the task of designers of parallel programs and several middle-level models have been designed as a trade off between abstraction and high performance. The research and development efforts presented here might bring parallel programming to the right direction for supporting a wider use of parallel computers. In particular, high-level languages will make the writing of parallel programs much easier than was possible previously and will help bring parallel systems into the general computer science community.
References Bennett, J.K., Carter, J.B., Zwaenepoel, W., 1990. Munin: Distributed shared memory based on type-specific memory coherence. In: Proceedings of ACM PPOPP, pp. 168–176. Chien, A.A., Dally, W.J., 1990. Concurrent aggregates. In: Proceedings of 2nd SIGPLAN Symp. on Principles and Practices of Parallel Programming, pp. 187–196. Diwan, S., Gannon D., 1999. A capabilities based communication model for high-performance distributed applications: The Open HPC þ þ approach. In: Proceedings of IPPS/ SPDP'99. Hill, J.M.D., et al., 1998. BSPlib: The BSP programming library. Parallel Computing 24, 1947–1980. Lea, D., 2000. Concurrent Programming in Java: Design Principles and Patterns, 2nd ed. Addison-Wesley. Loveman, D.B., 1993. High Performance Fortran. IEEE ParallelDistributed Technology. 25–43. OpenMP Consortium, 2002. OpenMP C and C þ þ Application Program Interface. Version 2.0. Skillicorn, D.B., Hill, J.M.D., McColl, W.F., 1997. Questions and answers about BSP. Scientific Programming 6, 249–274. Skillicorn, D.B., Talia, D., 1994. Programming Languages for Parallel Processing. IEEE Computer Society Press. Skillicorn, D.B., Talia, D., 1998. Parallel programming models and languages. ACM Computing Surveys 30, 123–169. Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J., 1996. MPI: The Complete Reference. The MIT Press. Sun Microsystems, 1997. Java RMI Specification. Yonezawa, A., et al., 1987. Object-Oriented Concurrent Programming. MIT Press.