An efficient synchronization model for OpenMP

An efficient synchronization model for OpenMP

J. Parallel Distrib. Comput. 66 (2006) 1359 – 1365 www.elsevier.com/locate/jpdc An efficient synchronization model for OpenMP夡 F.C. García López ∗ , N...

872KB Sizes 8 Downloads 128 Views

J. Parallel Distrib. Comput. 66 (2006) 1359 – 1365 www.elsevier.com/locate/jpdc

An efficient synchronization model for OpenMP夡 F.C. García López ∗ , N.L. Frías Arrocha Departamento de Estadística, Inv. Operativa y Computación, Universidad de La Laguna, Avda Francisco Sánchez s/n, La Laguna, 38271, Spain Received 24 November 2004; received in revised form 19 June 2006; accepted 28 August 2006

Abstract It is usually difficult for OpenMP programmers to use programming design techniques based on exhaustive search like backtracking, branch and bound, and dynamic programming. Thus, in order to solve this problem properly, this paper suggests an extension to the OpenMP model consisting of a new and efficient synchronization model (Monitor model). Also, a translation scheme, and a detailed description of the extension is presented, together with the performance results obtained. © 2006 Elsevier Inc. All rights reserved. Keywords: OpenMP; Synchronization model; Monitors

1. Introduction The OpenMP model defines a set of directives and library routines for both Fortran and C/C++ [15], oriented to the parallelization of array-based computations. The model provides a higher level of abstraction to the programmer than, for example, programming with POSIX threads [10]. OpenMP has evidently pleased the computing field, both because it is plain, and because it can be programmed incrementally, preserving the original sequential program. But there are some applications that, although following the array based model, should respect precedence relations among the data values originated from work-sharing constructs (pipelined structures). In addition, it is widely known that there are some applications which do not satisfy the array based model, particularly those based on exhaustive search (task-queue parallelism). To enable the parallelization of such codes, new mechanisms are needed, not only to organize exclusive access to data, but also to provide synchronization and communication among threads. The OpenMP API provides constructs for exclusive access (critical), and a particular type of synchronization (barrier). Unfortunately, there are certain codes which need other type of synchronization, and that is why we 夡 This work was partially supported by the Spanish MCyT project TIC200204242-C03-01 and the Canary Islands project TR2003/005. ∗ Corresponding author. E-mail address: [email protected] (F.C. García López).

0743-7315/$ - see front matter © 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2006.08.006

here propound simple extensions to the OpenMP model to cover those cases. We will show these fitting the OpenMP language, and enabling programmers to parallelize a large class of programs that in previous occasions required restructuring, maintaining the original sequential program. The research community has had large discussion on this model’s extensions, essentially on nested parallelism [16,6,5,13,4,17,12]. Shan et al. [16] evidences the workqueuing model handling recursive control and also list or tree data structures. Gonzalez et al. [6] establishes a difference between the attitudes already adopted by creating thread groups dynamically. Gonzalez et al. [5] tries to express complex pipelined computations through defining precedence relations among the tasks originated from work-sharing constructs. Marowka [13] extends the parallel section constructs to include task-index and precedence-relations matrix clauses. Dorta et al. [4] designs a language which supports the most common used programming patterns on distributed networks. Tanaka et al. [17] shows a simple and efficient implementation of OpenMP nested parallelism. Lu et al. [12] describes an implementation of OpenMP on networks of workstations (NOWS), introduces semaphores and condition variables, and highlights how semaphores are suitable for pipelines, and condition variables for task-queues. The extension we favor does not deal with the resolution of nested parallelism, but with the essence of new synchronization constructs. It is based on the monitor concept [1,7], it being one of the most natural and consistent synchronization and communication mechanisms. Many modern programming

1360

F.C. García López, N.L. Frías Arrocha / J. Parallel Distrib. Comput. 66 (2006) 1359 – 1365

languages provide some form of monitor for concurrency control. Good examples of the capabilities of the monitor appear in Java ([11,14] java.lang.thread class), and in the standard POSIX ([10] implementing the monitor through libraries). This paper is organized as follows. Section 2 motivates the proposed extensions with two types of codes: (1) codes which deal with searching for a set of solutions or which ask for optimal solution satisfying some constraints (backtracking, branch and bound [8]); (2) an array-based code with precedence relations among the data values (pipelined computations, dynamic programming [8]). Section 3 overviews the syntax and semantic of the extension. Section 4 outlines some implementation details which fit the proposed model. Section 5 presents experimental results of the examples. Finally, Section 6 highlights future works and conclusions. 2. Motivating patterns The extensions to OpenMP provided for this paper are about to be shaped in this section through two patterns: the first one takes into account those search techniques applicable to graphs (backtracking, branch and bound), and the second one the array based applications characterized by a sequential propagation of computed values (pipelined computation, dynamic programming). 2.1. Search strategies The codes in Fig. 1 exemplify search strategies. The search procedure distinguishes a generic search scheme applicable to graphs. We have called heap the data structure employed to keep those nodes which are not yet explored. The other two procedures (insert and extract) refer to data structure operations. Through directives only, and without destroying the original sequential code, parallelization will be here wisely established. (Line 19) The data structure heap is declared as shared, within the parallel region. The sync clause defines the monitor heap_access containing the synchronization variable called elem. (Lines 20–21) A single thread in the team is allowed to insert the root node (single directive). (Lines 2 and 9) The concurrent access to heap structure demands the use of the critical construct within the insert/extract functions. (Line 4) Every time a node is inserted, it is registered on the elem variable (signal directive). (Line 10) During the extraction, any thread trying to take the correspondent node before it comes created should wait until a previous (line 4) registration happens (wait directive). Having all threads working makes the signal directive increase the number of nodes not analyzed. Both wait and signal functions are executed within the critical region associated to the monitor heap_access. Taking into account the absence of the sync, wait, and signal, and the fact that the starting sequential code must not be modified, it will be required a synchronous parallel scheme.

Fig. 1. Search algorithm.

repeat (1) synchronizes all the threads in the team (barrier construct), (2) the threads extract heap nodes (critical construct), (3) the threads insert heap nodes (critical construct). In spite of this, it is not possible to solve the stop condition problem without modifying the code, or adding any kind of mechanism able to realize that all threads detected that the heap is empty (the end of the algorithm). Lu et al. [12] started a code (task-queue implementation using critical, cond_wait and cond_signal constructs), that we have conveniently adapted (Fig. 2), which still evidences how a shared counter nwait is needed in order to keep track of the number of waiting threads. Our Monitor model just works as a barrier construct when all the working threads reach the wait function. And so

F.C. García López, N.L. Frías Arrocha / J. Parallel Distrib. Comput. 66 (2006) 1359 – 1365

1361

Fig. 3. Pipeline (N synchronization variables).

Fig. 2. Search algorithm proposed by Lu et al.

all threads in the team are synchronized and can execute the statements after the wait directive. It is clear that all the threads are satisfied for the heap is empty and the algorithm has finished. This wise system avoids possible delays when detecting the stop condition. 2.2. Pipelined computations OpenMP does not adapt well to computations characterized by a data dependent flow of values. Fig. 3 shows an example using an array based code where a dependence relationship between the array’s iterations takes place. (Line 3) The array vec is stated as shared, within the parallel construct. The number of synchronization variables (pref ) declared within the monitor pipe is equal to the loop’s size, N . (Line 13) Each iteration must wait until the work in the previous iteration has been completed (wait directive on variable pref[i-1]).

(Line 18) Once each iteration has finished this fact is communicated to the following iteration (signal directive on variable pref[i]). (Line 7) Within a single construct it is indicated that the initial iteration is ready (signal directive on variable pref[0]). The code belonging to this figure will suffer a deadlock if the directives on lines 6 and 7 are eliminated. In this the case, the wait extension will block all the threads (line 13). By providing the previous example’s semantic a barrier synchronization happens when all the threads reach the wait construct. Then, it would not exist a synchronization between an iteration and the previous one, and each iteration would operate independently. Usually, the number of threads available is smaller than the number of iterations. It is possible then to build a code with less synchronization variables (Fig. 4). The unique difference with regard to the previous code is the cyclic use of the nthreads synchronization variables. We should mention that the achieved extension respects the use of the schedule directive, for it yields the way iterations are mapped onto the team of threads. At worst, the same thread will be both signalling a synchronization variable and waiting for it at the next iteration. In this case the performance could be damaged but not the success of the execution. The parallel code obtained is similar to the sequential code if we ignore the directives.

1362

F.C. García López, N.L. Frías Arrocha / J. Parallel Distrib. Comput. 66 (2006) 1359 – 1365

Fig. 4. Pipeline (n_threads synchronization variables).

3. The monitor model A program begins execution as a single thread in the fork/join model defined for OpenMP [15]. The thread we are talking about executes sequentially until a parallel construct is encountered, and it is then when it turns into a master thread by creating a team of threads by itself, so all threads will execute the statements enclosed lexically within the parallel construct. Work-sharing constructs (for, section and single) are given to divide the execution of the enclosed code region among the members of a team. All the independent threads should now synchronize at the end of each work-sharing construct or at specific points (specified by the barrier directive). Through the definition of critical regions exclusive execution is also possible. Next we will present the extensions we have proposed to expand the synchronization model, the Monitor model. They can be used by the threads that participate in the execution of a parallel construct. Both the syntax and semantic we have employed come out from the concept known as monitor [2]. 3.1. The monitor A monitor consists of a set of data items and a set of routines, called entry routines, that operate on the data items. The monitor data items can represent any resource that is shared by multiple tasks (threads). Monitors are employed by tasks to ensure exclusive access to resources, and to establish

synchronization and communication among tasks. Usually, monitor data should be manipulated only by the set of operations defined by its entry routines. Since only one task at a time (active task) can execute a entry routine, mutual exclusion is assured among tasks using a monitor, by locking it when execution of an entry routine begins, and by unlocking it when the active task gives up the monitor’s control voluntarily. If another task invokes an entry routine while the monitor is locked, the task remains blocked until the monitor becomes unlocked. This mechanism is already implemented in the OpenMP model through the critical directive. It is enough, for many uses, that monitors give rise to mutual exclusion automatically. However, synchronization among tasks is sometimes required. In these cases monitors provide the operations wait and signal associated to the condition variables (also called event queues). A condition variable can be thought of as a queue of waiting tasks. When the active task executes a wait statement (wait q), the task becomes blocked on a condition variable (q) and the monitor becomes unlocked allowing then another task to use the monitor. A task is reactivated from a condition variable (q) when another (active) task executes a signal statement (signal q). A signal statement removes one task from the specified condition variable (if such a task exists) and leaves it ready to work again. The wait and signal operations on condition variables in a monitor are similar to P and V operations on counting semaphores [3]. The wait statement can block a task’s execution, while a signal statement can cause another task to be unblocked. There are, however, differences between them. When a task executes a P operation, it does not necessarily block since the semaphore counter may be greater than zero. In contrast, when a task executes a wait statement it always blocks. When a task executes a V operation on a semaphore it either unblocks a task waiting on that semaphore or, if there is no task to unblock, increments the semaphore counter. In contrast, if a task executes a signal statement when there is no task to unblock, there is no effect on the condition variable. Another difference between semaphores and monitors is that tasks awoken by a V can resume execution without delay. But taking into account that tasks execute with mutual exclusion within a monitor, tasks awoken from a condition variable are restarted only when the monitor is unlocked. Once semaphores have been here mentioned, it would be interesting to justify why we have decided to use monitors instead of them. Although semaphores are one of the most important tools that aid in the design of correct synchronization protocols, we still think that monitors fit better the OpenMP synchronization model for they mean a natural extension to this model, and do not lead to confusion as semaphores do (in our opinion, semaphores complicate (and replicate) the critical construct, and there is, precisely, where the trouble arises).

3.2. The monitor adapted to OpenMP The OpenMP model needs a new clause (sync), in order to employ the concept of monitor. We have used the term sync

F.C. García López, N.L. Frías Arrocha / J. Parallel Distrib. Comput. 66 (2006) 1359 – 1365

1363

because it is important for us to take into account that this kind of construct allows the synchronization of different threads. The syntax of this clause goes as follows: #pragma omp parallel ... sync(type_monitor {, type_monitor}) type_monitor = name_monitor [ (var_sync {, var_sync}] Our extension assures the presence of one monitor at least (type_monitor). Each monitor (name_monitor) could provide synchronization variables (var_sync). The synchronization variables may be single default types or arrays of them. These variables are in a name space which is separated from the name spaces used by ordinary variables (such is the case of the critical directive). The concept of synchronization variable is equivalent though not equal to the condition variable one. This clause is employed within a parallel construct. The critical directive should now allow the monitor, which has been previously declared within the parallel construct, to occur as the optional name used to identify the critical region. We use a structured block with no statements on our second example (pipelined computations). The critical directive’s syntax could be extended by the modifications this directive has suffered, though in our opinion, it could be correct to modify the syntax employing from now on the monitor as the exclusive access model such as many modern programming languages do (for example Modula, Concurrent Pascal). In this case the critical construct should be preceded by the convenient monitor clause which has no synchronization variable attached. By using this semantic, the monitor’s inclusion will avoid the use of the identifier name within the critical construct, if there is only one active monitor. To synchronize the threads, directives similar to wait and signal statements are listed. These directives should happen into a structured block of a critical construct so they could share the synchronization variable. Their syntax go as follows: #pragma omp wait var_synch #pragma omp signal var_synch These operation’s semantic is similar to P and V operations on semaphores, since we need to count the concluded operations, as we have seen on the examples presented in the previous section. Besides the differences already suggested, we will like to add a variation in order to make use of the possible blocking when working with this type of mechanisms. This blocking would be equivalent to the synchronization associated to a barrier construct, if all threads in the team were waiting at the synchronization variables associated to one specific monitor. 4. The implementation details In this section we describe a translation mechanism that efficiently implements the language extensions proposed. Fig. 5

Fig. 5. Translation of extract function.

shows the compiler generating the code within the extract function (Fig. 1). (Lines 1–10) mut_t, cond_t and monitor are types employed by our translation mechanism which can use POSIX threads (pthread_mutex_t and pthread_cond_t). (Lines 13–19) The code should declare the variables heap_access (monitor type) and elem (cond_t type) as global external variables. The statements lock (line 13) and unlock (line 19) comprise the critical directive’s block. The call to the wait function (line 14) implements the synchronization among threads. The only way through which the synchronizations needed in the examples presented could be accomplished with OpenMP is busy waiting. These synchronizations are coded by the programmer in the source code using variables allocated in the application address space. Notice that the programmer has to introduce the flush construct to ensure memory consistency for the variables used for synchronization. The variables will not be padded, so false sharing problems may appear during synchronization, degrading performance. With our solution, false sharing problems are eliminated and the programmer has not to be aware about memory consistency issues as both things are handled by the runtime system. Fig. 6 offers the implementation of the wait function based on POSIX threads. The code connects the P operation on semaphores with the synchronization associated to a barrier construct.

1364

F.C. García López, N.L. Frías Arrocha / J. Parallel Distrib. Comput. 66 (2006) 1359 – 1365

Fig. 6. Wait code.

5. Monitor performance To check the Monitor’s performance, parallel performance measurements were taking using two benchmark programs on a Digital Alpha Server with four 466 MHz Alpha processors and 2 Gb memory running Compaq Tru64 Unix V5.1. We have made use of the benchmark programs because they fit our patterns (Section 2). We have also used a prototype translator that takes a C/C++ program with OpenMP directives and produces a C/C++ program using POSIX threads. 5.1. Queens The n-queens problem is trying to place n queens on one n×n chess board so that two of them cannot attack [16]. Instead of quitting if any single solution is given, the code tries to find all possible solutions. By employing a non-recursive code using a heap structure (stack), we have rewritten the back-tracking search algorithm. The serial version does not use the stack structure but directly acquires the n-queens code. The parallel version generates n subproblems which are about to be solved through the same code that the serial version employed. 5.2. Prime factors This program calculates the prime number decomposition of the elements of a matrix. This program has triply-nested for loops (m × n × p). The outermost loop represents how many prime numbers (m) should be used. The parallel version inserts a parallel directive (with a monitor directive which has n synchronization variables associated) at the outermost loop, and inserts the synchronization constructs (wait and signal) at the loop which is in the middle. 5.3. NAS LU LU is a simulated CFD application that comes with the NAS benchmarks [9]. The OpenMP version has a pipelined computation hand-coded using a synchronization vector and the flush directive. Fig. 7 presents our version of the code.

Fig. 7. NAS LU.

Performance results of the experiments are shown on Tables 1 and 2. Size notices the number of elements used. Ts is the time the serial program took to run (in seconds). Ti (i = 1, . . . , 4) indicates the time of the parallel program run having i threads. The columns Ts /Ti (i = 1, . . . , 4) measure speedup. OMP makes reference to the OpenMP hand-coded versions which reach synchronization through busy wait. Monitor is the version we make reference to. The Ts /T1 column evidences that the sequential overhead is not significant (<2%). All programs claim near linear speedup and good scalability. The synchronization variables’ overhead (250 and 500) keeps not the parallel performance from having success. The monitor codes keep the sequential version of the application unmodified. We expect also to obtain better results on heavily loaded systems (more threads than processors), than those obtained in the OMP codes. 6. Conclusion and future work This paper proves the Monitor model naturally being an easy extension to OpenMP. The Monitor increases the algorithms that can be parallelized, tellingly. Holding an inquiry into the benchmark program speedup has evidenced how simple it is to get good performance on some algorithms not easily

F.C. García López, N.L. Frías Arrocha / J. Parallel Distrib. Comput. 66 (2006) 1359 – 1365

1365

Table 1 Monitor results Program

Size

Queens

13 14 50 × 500 × 104 100 × 250 × 104

Prime Factors

Ts

Ts T1

Ts T2

Ts T3

Ts T4

11.28 74.46 50.46 45.80

0.996 0.997 0.981 0.984

1.864 1.984 1.811 1.869

2.666 2.822 2.597 2.692

3.407 3.566 3.295 3.461

Table 2 OpenMP versus monitor Program

Ts

Ts T1

Ts T2

Ts T3

Ts T4

Queens

OMP Monitor

74.46 74.46

0.99 0.99

1.98 1.98

2.83 2.82

3.66 3.57

Prime Factors

OMP Monitor

45.80 45.80

0.99 0.98

1.90 1.87

2.72 2.69

3.47 3.46

LU

OMP Monitor

188.21 188.21

0.99 0.99

1.92 1.90

2.75 2.72

3.58 3.55

parallelized through OpenMP. The model we here proposed is valid for distributed systems as [12] has revealed. Our future work will go on examining other benchmark programs just to verify the good performance of the experimental results. Now other kinds of monitors should be put to the test in order to compare several implementation strategies of this model. Acknowledgments The authors would like to thank Jose Carlos González González and Jesús Alberto González Martínez for their valuable assistance, and to thank the referees for their relevant remarks. References [1] P. Brinch Hansen, Operating Systems Principles, Prentice-Hall, NJ, 1973. [2] P.A. Buhr, M. Fortier, M.H. Coffin, Monitor classification, ACM Comput. Surveys 27 (1) (1995) 63–107. [3] E.W. Dijkstra, The structure of THE-multiprogramming system, Comm. ACM 11 (5) (1968) 341–346. [4] A. J. Dorta, J. A. González, C. Rodríguez, F. de Sande, Towards structured Parallel Programming, in: Proceedings of the Fourth European Workshop on OpenMP, Roma, 2002. [5] M. González, E. Ayguade, X. Martorell, J. Labarta, Complex pipelined execution in OpenMP parallel applications, in: Proceedings of the International Conference on Parallel Processing, Valencia, 2001, pp. 295–304. [6] M. González, J. Oliver, X. Martorell, E. Ayguade, J. Labarta, N. Navarro, OpenMP extensions for thread groups and their runtime support, in: Proceedings of the Workshop on Languages and Compilers for Parallel Computing, Yorktown Heights, 2000, pp. 324–338. [7] C.A.R. Hoare, Monitors: an operating system structuring concept, Comm. ACM 17 (10) (1974) 549–557. [8] E. Horowitz, S. Sahni, Fundamentals of Computers Algorithm, Computer Science Press, MD, 1978. [9] H. Jin, M. Frumkin, J. Yan, The OpenMP implementation of NAS parallel benchmarks and its performance, Technical Report NAS-99-011, NASA Ames Research Center, 1999. [10] International Organization for Standardization (ISO), Portable Operating System Interface (POSIX)-Part 1: System Application Program Interface, ISO/IEC Standard 9945-1, 1996.

[11] D. Lea, Concurrent programming in Java. Design Principles and Patterns, Addison-Wesley, MA, 1999. [12] H. Lu, Y.C. Hu, W. Zwaenepoel, OpenMP on networks of workstations, Proceedings Supercomputing, Orlando, 1998. [13] A. Marowka, Extending OpenMP for task parallelism, Parallel Process. Lett. 13 (3) (2003) 341–352. [14] S. Oaks, H. Wong, Java threads, O’Reilly, Sebastopol, 1997. [15] OpenMP Architecture Review Board, OpenMP Application Program Interface v 2.5, www.openmp.org, 2005. [16] S. Shah, G. Haab, P. Petersen, J. Throop, Flexible Control Structures for Parallelism in OpenMP, Concurrency: Practice and Experience 12 (12) (2000) 1219–1239. [17] Y. Tanaka, K. Taura, M. Sato, A. Yonezawa, Performance evaluation of OpenMP applications with nested parallelism, Proceedings Languages, Compilers, and Runtime Systems for Scalable Computers, Rochester, 2000, pp. 100–112.

Félix García-López is an Associate Professor at the University of La Laguna (Tenerife-Spain) since 1997. He obtained his M.S. in Mathematics and his Ph.D. in Computer Science at the University of La Laguna. His main interests are Skeletons for Parallel Algorithms, Synchronization Models, Design and Analysis of Concurrent Algorithms, Parallel Metaheuristics, and Distributed Computing. As a result of his research, both in basic research and technology transfer, Dr. García-López has published more than 25 papers in international journals and conferences, and he has published 2 book chapters. Dr. García-López has also been collaborating with different research work groups. Nieves-Luz Frías-Arrocha’s professional career might best be described as eclectic. She obtained her B.S. in English Philology at the University of La Laguna (Tenerife-Spain) in 1999. She took the road right to Translation and Literary Analysis. In 2000, she served as Human Resources Director developing the coordination for pilots, engineers, and aeronautical staff. In 2001 she turned into Manager of a group for the Health Care. In 2005, following her business sense, she became a businesswoman. Just now her own company is starting to fly high. In spite of this, her love for Literature is present in her life. Her main academic interest is Design and Analysis of Parallel Programming Languages. She has published 3 papers in international journals.