31
A NOTE ON THE RELATIONSHIP BETWEEN THE EXPRESSION AUTOMATIC DETECTION OF PARALLELISM IN PROGRAMS
AND
THE
KARL J. O’ITENSTEIN
Dept. of Mathematical and Computer Sciences, Michigan Technological University, Houghton, Michigan
49931;
ABSTRACT A certain balance between language features and compiler complexity is required to achieve reasonable speedups and machine efficiency on multiprocessors. This short note contains one picture and about a thousand words discussing this relationship between the explicit expression and the automatic detection of parallelism in programs. INTRODUCTION “Supercomputers” based on multiprocessors offer the fastest means known for solving many problems of importance to science and society. As the speed of individual processors is reaching a limit constrained by such parameters as the speed of light, it appears to many that the most profitable means of decreasing job turnaround time is to discover highly parallel algorithms and to design machines onto which those algorithms can be effectively mapped. Much research is proceeding in both of these areas [4]. An important area receiving much less attention is the translation of programs onto multiprocessor architectures. The issues in language and compiler design are similar for both uniprocessor and multiprocessor environments. In each setting it is advantageous to choose languages that not only support the application at hand but that also support the development of reliable software. For example, a scientist programming a problem conceptualized with vectors will find the programming task more convenient if a language supporting vector notation is available [13]. A simple example of a language feature supporting reliability is the requirement that all variables be declared: this lessens the chance that a typing error can produce a syntactically correct program. Both of these areas are human factors and software engineering topics and are outside of the scope of this note. Several groups are advocating the use of functional programming languages (e.g., SISAL [9]) over imperative ones for programming supercomputers. An advantage of functional languages is that much parallelism is easily found due to the lack of side effects [6]. While detection of parallelism is simplified with these languages, it is not clear that transformations to map programs onto real architectures are any easier’. The choice of a functional over an imperative language appears to be primarily a human factors issue. High-level language compilers for both uniprocessors and multiprocessors require a global On most analysis and optimization phase to increase object program execution speed’. uniprocessors, such an optimizing compiler is not likely to produce speedups much greater than a factor of 2. On a multiprocessor, an optimizing compiler is required to improve code in serial bottlenecks and can be a key element in the detection of parallel tasks for allocation onto processors. Hence the compiler might be responsible for considerable speedups, depending on the nature of the program and the number of available processors. In addition, while enabling optimization in a sequential programming environment permits better data flow anomaly detection This work was partially supported DCR-8404909 and DCR-8511439. ‘A recent
paper
suggests,
though,
by the National
Science
Foundation
under
Grants
DCR-8203487,
DCR-8404463,
that it may be U41.
?he C language is a notable exception, as most optimizations other than efficient register allocation can be performed at the source level. The source-optimized C code can be extremely difficult to read, though, so an optimizing compiler might be preferable to maintain easily understood source code.
38
[lo], enabling optimization anomaly detection [15].
in a concurrent
programming
context can also permit concurrency
Interactive compilers or transformation systems and integrated environments for parallel processing are being investigated at several locations (including Rice Univ. [2,3], Univ. of Illinois, Indiana Univ. and Fujitsu). Such approaches to program development are most appropriate for programming parallel supercomputers since user interaction is often required to avoid worst case assumptions by the compiler. A compiler for a single scalar processor need not be concerned with subscript dependences, for example; on a vector or multiprocessor, however, lack of information on loop bounds can prevent the determination of independence between array references.
A SIMPLE MODEL The remainder of this note addresses the model depicted in Figure 1. We first examine the meaning of the axes, and then the meaning of the described space. The vertical axis captures the amount of control over the expression of parallelism that a programmer has at the source language level. This is parenthetically called programmer COST,under the assumption that the more control the programmer has, the more time it will take to program a solution that is both efficient and correct (free of race conditions, deadlock, etc.). One can argue that the programmer should want some control over parallelism, particularly if he/she is dealing with a problem that is envisioned as consisting of cooperating processes. It is infeasible that the programmer would have full control over parallelism in a high-level language since that would imply control of pipelining, caching, and other very low-level hardware-dependent features; this limit of language expressiveness is suggested by the heavily shaded region near the top of the vertical axis. The horizontal axis in Figure 1 describes the amount of implicit parallelism detected by a compiler. This axis is also labeled as compiler cosr, since the development costs for a compiler will increase with the amount of analysis desired. The automatic detection of all parallelism is
?
none
Implicit
Detection
FIG. 1. Space of Effective Language-Compiler
Pairings
(Compiler
$1
39
clearly unsolvable [.5]; this is represented by the dark shading on the axis. Given the parallelism that we can detect, optimal clustering of worst unsolvable and is more often an exponential time task. Below transformations are known to successfully rearrange operations for multiprocessor architectures [ 1 I].
right end of the compiler cost operations into processes is at these theoretical limits, many general pipelined, vector and
The shaded area in Figure 1 is meant to suggest those pairings of languages and compilers that can yield high speedups on a multiprocessor; the unshaded area suggests pairings that cannot produce efficient mappings. Certain pairings of language and compiler should in principle yield satisfactory results. This will not always be the case, however, due to differences in programmer ability, the complexity of the software at hand and the relative importance of program portability. Here are some areas of concern regarding explicitly expressing parallelism: .
language
consfructs
may not allow
(or be used at) rhe appropriate
granularity
for
a target
If the language expression granularity is larger than that of the machine, a compiler would have to find additional parallelism to effectively utilize the machine. If the granularity is smaller than that of the machine, either the compiler or the operating system would have to cluster parallel regions when mapping onto processors.
machine.
.
constructs may not be used to their full advantage. The programmer might miss opportunities for parallelism [8] or might accidentally declare processes that race or deadlock.
.
code modifications under maintenance might yield suboptimal to the complexity of dependence patterns.
.
portability
results,
races or deadlocks
due
could suffer. The original coding would likely be done with respect to certain costs for process creation, synchronization and communication. This solution might not map well or easily to a new cost set.
As an example of where portability can suffer due to choosing a pairing in the unshaded area, consider programming a machine such as the Denelcor HEP where the FORTRAN compiler does no program analysis. One might distribute work from a loop into several processes that obtain work in a self-scheduling manner [7]. If one wanted to run the same program on a vector machine, the program would have to be rewritten, since current vectorizing compilers are not able to recognize the loops that have been disnibuted across multiple procedures. (Vectorizing compilers [l, 121 handle DO-loops quite well, so if the HEP had a more sophisticated compiler that could convert appropriate DO-loops into self-scheduled or partitioned multiprocessor codings, programs written with ordinary Do-loops would be quite portable.) It may well be that because of the benefits to be accrued, the user of a supercomputer is willing to forsake portability and the associated savings in labor costs in favor
CONCLUDING
REMARKS
A user rarely has much choice of language or compiler by the time a machine is delivered. Hopefully, this short note can add fuel to discussions that will lead to incentives for the development and use of appropriate language-compiler pairs as components of integrated programming environments for supercomputers.
ACKNOWLEDGEMENTS Several suggestions from the referees grateful for their comments.
have improved
the readability
of this note.
I am
40
REFERENCES 1.
Allen, J. R. and K. Kennedy, PFC: A Program to Convert FORTRAN to Parallel MASC Tech. Rep. 82-6, Dept. of Math. Sciences, Rice Univ., March 1, 1982.
2.
Allen, J. R. and K. Kennedy, 1985), 21-29.
3.
Allen,
R.
and
Supercomputers:
K.
A Parallel Programming
Kennedy,
Algorithms,
Programming
Architectures
Environment,
Environments
and Scientific
IEEE Software
for
Computation,
Form,
2,4 (July
Supercomputers, in F. A. Matsen and T.
Tajima (editor), Univ. of Texas Press, Austin, 1986, 19-38. 4. 5.
Arvind, ed., Responses to First Survey of Parallel Processing for Computer Science, Cambridge, MA, June 1985. Bernstein, EC-15,5
A. J., Analysis of Programs (Oct. 1966), 157-163.
for Parallel Processing,
Projects, IEEE
M.I.T., Laboratory
Trans.
E&c. Computers
6.
Friedman, D. P. and D. S. Wise, Aspects of Applicative Programming Processing, IEEE Trans. Computers C-27,4 (April 1978), 289-296.
I.
Jordan, H. F., Parallel August 20, 1981.
8.
Kuck, D. _I., E. S. Davidson, D. H. Lawrie and A. H. Sameh, Parallel Supercomputing Today and the Cedar Approach, Science 231,4741 (February 28, 1986), 967-974.
9.
McGraw, J., S. Skedzielewski, S. Allan, D. Grit, R. Oldehoeft, J. Glauert, I. Dobes and P. Hohensee, SISAL: Streams and Iteration in a Single-assignment Language, Lawrence Livermore Nat’1 Lab. Report M-146, July 20, 1983. Language reference manual version 11.
10.
Ottenstein, K. J. and L. M. Ottenstein, High-Level Debugging Assistance via Optimizing Compiler Technology (extended Abstract), in Proc. ACM SIGSOFTiSIGPLAN Soft. Eng. Symp. on High-Level Debugging, Pacific Grove, Calif., March 20-23, 1983, 155-158. Published as ACM Software Eng. Notes 8, 4 (August 1983) and as ACM SIGPLAN Notices 16, 8 (August 1983).
11.
Ottenstein,
Programming
on the HEP Multiple
Instruction
for Parallel
Stream Computer,
K. J., A Brief Survey of Implicit Parallelism Detection, in MIMD Computation: and its Applications, J. S. Kowalik (editor), MIT Press, May 1985.
HEP Supercomputer
12.
Padua, D. A., D. J. Kuck and D. H. Lawrie, High-Speed Multiprocessors Techniques, IEEE Trans. Computers TC-29,9 (Sept 1980), 763-776.
and Compilation
13.
Rice, J., Comments at the Purdue Workshop on Program Transformations Compilation Techniques for Supercomputers, September 1984.
and Optimizing
14.
Sarkar, V. and J. Hennessy, Compiler-Time Partitioning and Scheduling of Parallel Programs, in Proc. ACM SIGPLAN ‘86 Compiler Construction Conference, Palo Alto, CA, June 1986.
15.
Taylor, R. N. and L. J. Osterweil, Anomaly Detection in Concurrent Flow Analysis, IEEE Trans. Soft. Eng. SE-6,3 (May 1980), 265-278.
Software by Static Data