Computer Physics Communications ELSEVIER
Computer Physics Communications 97 (1996) 16-35
H i g h - p e r f o r m a n c e c o m p u t i n g - an o v e r v i e w Peter Marksteiner * Vienna University Computer Center Universithtsstrafle 7, A-1010 Vienna, Austria
Abstract
An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
1. Introduction
The present paper is intended to review the current status of high-performance computing at a fairly elementary level, aimed at the user - the computational physicist or chemist - rather than the computer science specialist. It should be noted that the paper represents the author's personal opinions, biases, and prejudices, and possibly some of the views expressed should be taken with a grain of salt. Obviously, so large a field as high-performance computing cannot be treated exhaustively here. A more detailed discussion with particular emphasis on RISC processors can be found in Dowd [1].
* E-mail:
[email protected]
What is high-performance computing, or HPC, as it is usually abbreviated? HPC is closely related to, and in many aspects the successor of "supercomputing". Throughout the history of computing, the term "supercomputer" has been applied to many different computer systems. In the late seventies and early eighties, there was little doubt about what is a supercomputer and what is not. At that time, vector processors by far outperformed all other computers in scientific and numerically intensive applications. A factor of 1000 between the performance of a scientific computer a research group or a smaller department could afford (for example, a VAX 11/730 or VAX 11/750) and a top-level supercomputer was not uncommon. Supercomputing was characterized by exclusiveness and high prices: few institutions - military sites and (mainly US) government laboratories, aerospace and oil companies and a few supercomputer centers for university users - could
0010-4655/96/$15.00 Copyright © 1996 Elsevier Science B.V. All rights reserved. PII S001 0 - 4 6 5 5 ( 9 6 ) 0 0 0 1 8 - 5
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
afford a supercomputer, and few researchers had access to supercomputing facilities. The traditional supercomputer users with practically unlimited demand for computing power can be found in many branches of science and industry: • Computational fluid dynamics (CFD): atmospheric research (weather forecast, pollution research, global climate modelling), wind tunnel simulations, combustion simulation, etc. • Structural analysis: automobile and aerospace design; finite element methods. • Seismic modelling. • Computational chemistry, drug design, materials research. • Fundamental research: quantum chromodynamics (QCD), cosmogony. • Cryptography, and many others. In recent years, there has been increasing competition to vector supercomputers. Many other computer systems have been successfully applied for tasks that previously could only be done on top-level supercomputers. Consequently, it is now less clear what constitutes a supercomputer. Many definitions of "supercomputer" have been given, none of which is entirely satisfying. Does it make sense to claim that someone who runs a program on a vector computer is doing supercomputing, whereas someone who solves a similar problem a little more slowly on a workstation is not? Somebody who runs a parallel program on two workstations is doing supercomputing since parallel processing belongs to supercomputing, but someone who runs two serial jobs on the same two workstations is not? Since "supercomputing" is nowadays rather illdefined, it is customary to use the more neutral term "high-performance computing". Still, different people give different definitions of HPC; here I give mine: HPC is an interdisciplinary field of research dealing with all aspects of the solution of large-scale scientific problems requiring substantial computational resources. Topics in HPC include: • Numerically intensive applications. • Algorithms. • Numerical Mathematics. • Performance analysis and optimization. • Programming models and programming techniques.
17
• Compilers and tools (preprocessors, automatic vectorization and parallelization). • Vector and parallel architectures. • Visualization. Some - not all - of these topics will be discussed in the following sections.
2. Architectures for high-performance computing The execution time of any program, on any computer system, is given by the following formula:
t= t c × CPI× Np,
(1)
where t c is the cycle time, CPI is the average number of cycles per instruction, and Np is the total number of instructions executed in a particular program. Np is also called the path length. Eq. (1) is sometimes called the fundamental law of computer
performance. Ideally, all three of the factors in Eq. (1) should be optimized to achieve higher performance. In practice, these three quantities are interdependent; some compromise is required. As a result of different approaches to increase the hardware performance focussing on one or the other of the above factors, several different computer architectures are available for HPC applications. These will be discussed in the next sections. 2.1. Vector supercomputers Although the basic concepts of vector processing were introduced about thirty years ago, vector computers became widely used in scientific computing during the late seventies and reached their peak in the eighties.
2.1.1. The principles of vector processing The operands of conventional (scalar) instructions are single data items, also called scalars: A D D RT, RA, RB
This pseudo-assembler instruction takes two scalar operands - i.e. two numbers - from registers, adds them, and writes the result into a target register. For our purposes, a vector is a linear array of data, say, A(1), A(2) . . . . . A(N). A vector computer has additional hardware features that are capa-
18
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
ble of performing instructions on entire vectors rather than single data items, for example
Sometimes it is possible to obtain even more: many vector computers support compound instructions like multiply-and-add, multiply-and-subtract, multiply-and-accumulate. The results of the first (multiply) operation are not stored in a register or in main memory, but are directly transferred as input to the add-pipeline. After a somewhat longer startup time two floating-point results per cycle are produced. Some vector processors have several pipes that can operate in parallel: for example, the Hitachi S - 3 8 0 0 / 1 8 0 supercomputer can perform 16 floating-point operations per cycle on 8 parallel pipes. (A vector processor with parallel pipes is not to be confused with a vector multiprocessor - all pipes are attached to a single CPU.)
V A D D VRT, VRA, VRB
This (pseudo-assembler) vector instruction adds A(1) + B(1), A(2) + B(2) . . . . . A ( N ) + B ( N ) and puts the result into a vector register. The maximum number N that can be handled by a single vector instruction is called the vector section size and is usually some power of 2, e.g. 64, 128, or 256. The fact that several pairs of operands (say 128) are handled by a single instruction does not mean that all 128 elements are processed simultaneously. The higher performance of vector processing is achieved through pipelining. Complex instructions (for example, floating-point multiplication) are broken into simple steps. One such step typically takes one cycle to complete. Each step is performed by a dedicated piece of hardware called functional unit. In a conventional (von Neumann) architecture, only one functional unit is busy at a given time; the others are idle. With pipelining, it is possible to keep several functional units busy at the same time on different operands, rather similar to an assembly line in a factory. An example of pipelining is shown in Fig. 1. A floating-point multiplication is performed in 4 steps, each step requiring one cycle. Without pipelining, every four cycles one result is produced. A vector computer requires a startup time of three cycles to fill the pipe, afterwards one result per cycle is produced.
2.1.2. Characteristics of vector supercomputers Most vector supercomputers are designed to achieve maximum performance, with little or no consideration of cost. Often unconventional and therefore very expensive technologies are used. Some vector computers are built using gallium arsenide (GaAs) semiconductors rather than silicon, for example, the Convex C4 and the Cray 3. However, the performance advantages of GaAs over Si are not as pronounced as has been sometimes predicted. The cycle time of vector computers is very short, most of the current top-level systems have cycle times of about 2 ns. Because of the finite speed of light (in vacuum, light propagates 30 cm per
scalar
IZ'lll II II I m~l II---']1 I p~.o..~,,.o I IIIml II I! Imml II I ~u,~.~.m.n,~ I II IBBI II II IBml I,,~.,,,onon~ I II II I I~[~11 II IBm pos,-no...,,z°
vector
m n n n n u n u p..,~...,,zo i in n n n n n m -~.,.m.o~ i it i n m n m n u ~ . ~ * n . i il ii i l n n m m m ~ . . , - n o . . . , , . o 1
2
3
4
5
6
7
cycle hr. Fig. 1. The principles of pipelining.
8
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
nanosecond, in solids considerably less) short cycle times require a high packing density and therefore large-scale cooling to dissipate the heat generated. Consequently, most vector computers are not only expensive to buy but also expensive to run. An important characteristic of a well-designed vector computer is its overall balance: the high CPU performance should be matched by an equally high memory and input/output performance. As an example, let us consider Cray's current top model, the CRAY T932. It has 32 processors and a cycle time of 2.2 ns. Each processor can deliver 4 floating-point operations per cycle, resulting in a peak performance of about 60 Gflop/s. Its (theoretical) memory bandwidth is more than 800 Gbytes/s. This bandwidth is orders of magnitude higher than the memory bandwidth of typical RISC-based multiprocessors (see Section 2.5). To appreciate this value, let us do a little arithmetic. 800 Gbytes/s corresponds to 25 Gbytes/s per processor, or 55 bytes per cycle. One of the most frequently used vector operations, A X P Y (see Section 3.5.1), requires two words to be loaded and one word to be stored per iteration. Since each CPU can perform two such operations per cycle, six 8-byte words, or 48 bytes, must be moved to or from memory. The memory bandwidth is just sufficient to transfer the amount of data necessary to keep all CPUs busy. 2.1.3. Memory organization The memory organization has, a priori, nothing to do with the type of processor. Some vector processors (for example, the IBM ES/9000) have a cache-based hierarchical memory as discussed in Section 2.2.4. Most vector computers, however, achieve their high memory performance through interleaved memory banks. The main memory is organized into n banks. A typical value of n might be 256. Words with addresses 0, n, 2n . . . . belong to one memory bank, words with addresses 1, n + 1, 2 n + 1. . . . to the next one, and so on. Words that are adjacent in memory always belong to different memory banks. Every memory access is associated with a bank busy time of several cycles such that subsequent access to the same memory bank is delayed. Some memory access patterns (e.g. vectors with constant
19
stride 2 n with large n) lead to performance degradation through frequent bank conflicts. Fortunately, such patterns are comparatively rare or can be avoided (for example, by changing the leading dimension of arrays). A major advantage of such a memory organization is that the memory access time is the same for all memory locations. 2.1.4. The future of vector processing Today, vector processing is a mature field. The capabilities and limitations of vector processing both its hardware and software aspects (compare Section 3.2) are well understood, major breakthroughs are not expected. Since progress in other fields like RISC processors and parallel processing is rapid, the market share of vector computers - which has always been small - is dwindling. The main disadvantage of vector supercomputers is their high price. Consequently, many attempts have been made to make supercomputers more affordable by using cheaper components and more conventional technology. Of course, this is only possible with substantially lower performance. Such "minisupercomputers" have only been partially successful: the minisupercomputer market has been hit particularly hard by the competition of workstations and microcomputers. Many minisupercomputer projects and companies failed. Vector minisupercomputers - and, in fact, most vector computers - are mainly used for memory-intensive applications, mostly in CFD. The major advantage of vector supercomputers is not so much the superio r CPU performance but rather the overall balance of CPU, memory and I / O subsystems. In summary, there are still problems in HPC for which top-level vector supercomputers are the optimal or indeed the only possible platform, but their number is decreasing. The future of the "classical" vector supercomputer is rather uncertain. 2.2. RISC processors Recent advances in technology have resulted in an unprecedented performance boost for workstations and microcomputers. The progress in mainframe and supercomputer performance has been much slower. Consequently, the performance gap between large
P. Marksteiner/ ComputerPhysics Communications97 (1996) 16-35
20
systems and microcomputers has narrowed considerably. Many numerically intensive problems that required a supercomputer a few years ago can be done today on workstations or even personal computers. Because of the resulting threat to the established mainframes' and supercomputers' market share, the new powerful microcomputers have been nicknamed "killer micros".
2.2.1. The principles of RISC processing The key to success was the development of a new type of processors called RISC. The concept of RISC was created in the late seventies by John C o c k e ' s team at IBM (project 801 [2]). It was motivated by the observation that programs typically spend 80% of the time in 20% of the instructions. As a consequence of this observation, the following " r e d u c e d instruction-set computer" (RISC) was postulated: • Reduce the instruction set to eliminate little-used instructions. • Optimize important instructions as far as possible - preferably one cycle or less. • Replace the eliminated instructions at the software level. Subsequently, traditional architectures have been dubbed " C o m p l e x Instruction Set C o m p u t e r s " (CISC). Today, R I S e processors dominate the workstation and server market and are rapidly making inroads into other markets as well. Practically all major manufacturers of CPU chips also produce R I S e processors. Some examples o f R I S e processors produced by major vendors are given in Table 1. M o d e m R I S e processors are sometimes quite dif-
ferent from what was originally proposed nearly 20 years ago. Actually, the instruction set of some RISC processors is not very " r e d u c e d " at all. On the other hand, some RISC-like features have been incorporated into typical CISC processors like the Intel 80486.
2.2.2. Characteristics of RISC processors Most o f today's RISC processors have the following in common: Most instructions can be completed in one cycle. The primary design goal o f RISC processors is to reduce the cycles per instruction rather than the number of instructions. • For complex (mostly floating-point) instructions requiring more than one cycle, pipelining techniques - similar to vector processors - are used. With pipelining, one floating-point result per cycle can be obtained, with compound instructions (multiply-and-add, etc.) two. • Superscalar processors have independent functional units that can operate in parallel. A superscalar processor is able to dispatch several instructions simultaneously (e.g. an integer, a floating-point, a condition, and a load instruction). • Some processors have two parallel floating-point pipelines and allow up to four floating-point resuits per cycle. • Instructions take all operands from registers and write all results into target registers (except, o f course, explicit memory instructions like L O A D and STORE). RISC processors usually have lots of registers; a typical number might be 32 general-purpose registers and 32 floating-point registers.
Table 1 Examples of RISC processors. Most processors are available in several configurations with different clock rates, cache sizes etc. The values in the table correspond mostly to high-end configurations. The two figures for level-I cache refer to instruction and data cache, respectively Processor
Cycle time (ns)
Clock rate) (MHz)
Level- 1 cache
Level-2 cache
Maximum flops/cycle
Theoret.peak performance (Mflop/s)
DEC Alpha A21164 DEC Alpha A21064A MIPS R8000 IBM POWER2 IBM MPC604 HP PA RISC 7200 SuperSPARC-II
3.33 3.63 11.11 15 7.5 8.33 13.3
300 275 90 66.7 133 120 75
96 kb 16 + 16 kb 16 + 16 kb 32 + 128 kb 16 + 16 kb 256 + 256 kb 20 + 16 kb
4 Mb 2 Mb 4 Mb 2 Mb 512 kb
2 1 4 4 2 2 1
600 275 360 266 266 240 75
1 Mb
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
There is a tradeoff between clock speed and what can be accomplished in a single cycle. Very high clock rates ( > 200 MHz) are only possible in single-chip implementations. Some vendors (e.g. Digital) offer single-chip processors with very high clock rates (currently up to 300 MHz), others (e.g. Silicon Graphics, IBM) have multichip processors with moderate clock speeds ( < 100 MHz). Since these processors perform more operations per cycle, their overall performance is about the same. With improving chip packaging technology, even for complex processors one chip is sufficient. Therefore, more and more vendors switch to single-chip implementations. Even several processors on one chip are possible. 2.2.3. RISC versus vector supercomputers How does the (single-processor) performance of today's RISC-based systems (high-performance workstations and servers) compare to supercomputers? Looking at the fundamental law of computer performance, Eq. (1), the approaches are quite different: vectorized programs have very short path lengths, but vector operations require many cycles per instruction; whereas RISC processors optimize the cycles per instruction at the expense of greater path lengths. Both types of processors optimize the cycle time: the cycle time of high-end RISC processors is in the range of a few nanoseconds, typical values are between 3 and 10 ns (see Table 1). This comes close to the cycle time of top-level supercomputers with typical cycle times of 1-4 ns. RISC processors can usually produce several floating-point results per cycle, typically between 1 and 4. Again, this is comparable to vector supercomputers. The " r a w " CPU speed of a RISC processor - which results from the combination of the two quantities listed above comes very close to the performance of a vector supercomputer. However, the performance of "real-world" applications is not determined by CPU speed alone: very often the crucial factor is memory performance. 2.2.4. Memory hierarchies For economic reasons, RISC-based machines usually have dynamic RAM (DRAM) rather than the faster and more expensive static RAM (SRAM). For many years, the progress in memory capacity has
21
been remarkably constant: the capacity quadruples approximately every three years. Currently 16 Mbit DRAM chips are the state of the art. Unfortunately, the decrease in memory access time has been considerable slower. To minimize the delays caused by the comparatively long memory access times, most RISC processors have a hierarchical memory organization with several layers of increasing capacity and decreasing access time. Between the CPU and its associated registers (which may be considered the top level of the hierarchy) and the main memory there are usually one or two levels of cache, or high-speed buffer. Whenever the CPU requests data from main memory (e.g. a LOAD instruction), it is first checked whether the requested data item is already in the cache. If it is, the instruction can be completed very quickly, usually in one cycle. If it is not, a cache miss occurs. The smallest unit of data that is transferred from main memory to cache is a cache line. A cache line contains several words, say 128 or 512 bytes. After the entire cache line has been replaced from main memory - which requires several cycles - the LOAD instruction can complete. In a direct-mapped cache, each memory element can only be transferred into one particular cache location. If a program accesses many memory elements that are mapped to the same cache line, the effective cache size is just that one line, even though the others may be unused, and the performance suffers from many cache misses. This problem does not occur in a fully associative cache, where a line from memory may be transferred into any cache line. However, a fully associative cache is expensive, and its size is limited. A compromise between these two types is a set-associative cache. An example of a four-way set-associative cache is shown in Fig. 2. Each square represents the amount of memory corresponding to a cache line. A data element from main memory may be transferred into any of the four cache lines below it, but not to any other. Many processors have two levels of cache: a small on-chip cache and a larger level-2 cache on separate chips. Many of today's workstations have kilobytes of on-chip cache (e.g. 16 kbyte instruction and 16 kbyte data cache), and about a megabyte or two of combined level-2 cache. An example of the difference in memory perfor-
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
22
The timings shown were obtained by a simple b e n c h m a r k p r o g r a m that a c c e s s e s r a n d o m e l e m e n t s
main memory
~%X~
ii
I f
I
I
I
I
I
I
I
f
I
I
I
i i
i I I I l i l i I I I I l i | ~
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I I I I I I I I I I I I I
II ~Che
I
Fig. 2. ~ e x ~ p l e of a h i e ~ h i c a l memo~: ~ u ~ w ~ s~iative cache.
se~-
mance between workstations with hierarchical, cache-based memory and a supercomputer with interleaved memory banks is shown in Figs. 3 and 4.
0.2
I
I
of an integer array. The numbers correspond to the total time divided by the number of elements accessed and therefore also include some software overhead. The memory access time on the Cray is constant, no matter how large the array is. The Silicon Graphics Challenge has a 16 kbyte on-chip cache, and a 1 Mbyte level-2 cache. Up to an array size of 16 kbyte, the memory access time is constant; at 16 kbyte it increases sharply due to cache misses. With increasing array size the curve flattens and approaches an asymptotic limit corresponding to a cache miss at every memory access. At an array size of 1 Mbyte, the same phenomenon occurs again for the level-2 cache. There is also a third bump in the curve at 320
I
I
Silicon Graphics Challenge o Cray C90 + 0.15
ooo@od,oooo~%%@°@°°@°°°%°°°~
"10 tO (3 ¢/)
evOOOOOOOOO~OOOO
o
.c2
g
Q)
E
OOO¢'
oo
0.1
J
.m
u)
O
¢fJ
oo
(..) m "c] m CD
rr
0.05
I
20
I
I
40 60 Array size (kbytes)
I
80
Fig. 3. Memory access times on workstations and supercomputers: small arrays.
100
P. Marksteiner/ Computer Physics Communications 97 (1996) 16-35
kbyte. This is due to TLB misses. A "translation lookaside buffer" (TLB) contains a translation table that maps virtual addresses to physical addresses. Qualitatively similar results were found for many different RISC-based machines with hierarchical memory. A very simple model gives a good quantitative description of the experimental results [3]. 2.3. Parallel processors
The basic idea of parallel computing is very simple - indeed, so deceptively simple that the difficulties connected with parallel computing have been very often underestimated. Since the performance of a single processor is limited, and since the price/performance ratio of less powerful processors is usually better, it is natural to use several processors to attack a single problem.
23
In the following sections some important concepts in parallel computing are presented, and various types of parallel architectures are discussed. 2.3.1. Amdahl' s law
In 1967, Gene Amdahl [4] proposed a simple model to predict the performance of programs on parallel computers. This model is based on the observation that there is always a fraction or - however small - that cannot be parallelized and must be executed in serial. The parallelizable part ( 1 - o-) runs N times as fast on N processors, whereas the serial part tr remains unchanged. Therefore we get the following relation between the parallel execution time tll and the serial (single processor) execution time ts:
1.8 1.6 "o t-
OtO O 2 ._~
Silicon Graphics Challenge o Cray C90 +
1.4
1.2
g
00@@@00@@@@000~@&O<>~@@o@@@@O@V~
E o tO tO
0.8
•
0.6
~0 ° O@ O
D:
<> 0 0 0 0 <>
0.4
<><>O@<>000~<> 0o
0.2
++÷+÷+++++++++++++++++++++++++++++++++++++++++++++++++++-~-+++++++++++++++++++++++`
0 0
I
I
i
I
I
I
I
I
500
1000
1500
2000
2500
3000
3500
4000
Array size (kbytes)
Fig. 4. Memoryaccessfimesonworks~fionsandsupe~ompute~:l~gearrays.
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
24
This model - which still neglects parallelization and communication overhead - leads to rather pessimistic performance expectations for massively parallel processors. It is known as Amdahl's law. The ratio ts/tll is called speedup; the optimum speedup is obviously equal to the number of processors N (linear speedup) l; the ratio between actual and optimum speedup is called efficiency. With a simple change of variables, Amdahl's law also applies to vector performance:
t~=ts tr+
~
.
150
0=0.005
125
a=0.1 75 50 25 0=0.05 2
4 6 8 Ratio voctor/scalat
tO
200 400 600 800 Number o! processors
tO00
Fig. 5. Amdahl's law - numerical examples of speedup achieved through vectorizationor parallelization.
(3)
2.3.2. Communications
Here o- denotes the fraction of the program that cannot be vectorized, t s and t o are the scalar and the vectorized execution times, respectively, and R is the ratio of execution speed in vectorized/scalar mode of the vectorizable part of the program. Because R tends to be fairly small - typical values are 2 - 1 0 - , the consequences of Amdahl's law for vector performance are usually less grave than for highly parallel systems. Some numerical examples of the speedup obtained by vectorization or parallelization are shown in Fig. 5. Basically, Amdahl's law is a fact of life that cannot be ignored or circumvented. Nevertheless, it is inadequate in one respect: it assumes that the same problem is solved on all sizes of machines. However, the main purpose of using, say, 1000 processors instead of one is not to solve the same problem 1000 times as fast, but rather to solve a much more complex problem in the same time. It is reasonable to assume that the serial part is more or less constant for problems of the same kind and different sizes (e.g. CFD calculations with different grid sizes). The question how performance data for a given problem and machine size can be extrapolated to larger problems and a larger number of processors is called scalability. In 1988, John Gustafson [5] investigated scalability taking into account the above observation and predicted acceptable performance for many massively parallel applications.
2.3.3. Granularity
In a few cases superlinear speedup greater than N is encountered. This is usually due to the fact that the parallel tasks are smallerand fit better into the processors'caches than the serial program.
• • • • •
One of the essential problems to solve in parallel computing is the communication between the processors. The performance of any communications link is characterized by two parameters, bandwidth and latency. Bandwidth is defined as the maximum amount of data transferred per time unit. Today's local area network (LAN) technologies usually have bandwidths of several Megabits per second, e.g. Ethernet (10 Mbit/s), TokenRing (4 or 16 Mbit/s), FDDI (100 Mbit/s). "Asynchronous Transfer Mode" (ATM) offers bandwidths up to 622 Mbit/s. Such LANs are used for the coupling of workstation clusters, their bandwidth is sufficient for parallel applications with moderate amounts of communication. Dedicated parallel processors usually have proprietary networks with bandwidths in the Gigabit range. The time interval between a transfer request and the beginning of the actual transfer is called latency. Sometimes latency is defined as the time required to transfer a message of length zero. The total latency results from the hardware latency - the actual switching time, which is usually very short ( < 1ixs), and from the software latency, which includes packet assembly, etc., and may be orders of magnitude larger.
Parallelism can be exploited at various levels: program (job) level, subroutine (task) level, DO-loop level, inter-statement level, intra-statement level.
P. Marksteiner/ ComputerPhysics Communications97 (1996) 16-35 To parallelize small pieces of code is called "fine-grained parallelization"; to parallelize larger chunks is called "coarse-grained parallelization". Parallelization at the job level is easiest. However, how to run independent jobs (that do not communicate with each other) in parallel is more a question of batch processing and systems administration than parallel computing proper. Parallelization at the inter-statement or intra-statement level is done, for example, by vector or superscalar processors. In the following, we are mostly concerned with task parallelization and occasionally with loop parallelization. It is obvious that fine-grained parallelization is limited by the latency: unless the "grain size" - i.e., the execution time of individual tasks - is substantially larger than the latency, most of the time will be spent in communication rather than in computation.
2.3.4. Stability and fault tolerance Every processor is bound to fail from time to time, be it a hardware failure or a software problem that causes a system to "crash". The stability of a system is characterized by the mean time between failures (MTBF). In scientific applications, an MTBF of, say, a few months is perfectly acceptable. This is not true in the case of massive parallelism: even if every single processor has an MTBF of several years, the mean time between failures of any processor may be days or even hours. Therefore it is essential in the design of parallel systems to take into account that failures can and will happen, and take appropiate precautions: • Systems should be designed with few "single points of failure", components that cause the entire system to crash when they fail. For example, the central switch in a switched topology (see Section 2.4.4.5) is a single point of failure. • Redundancy: The reliability of a system can be increased if spare processors are available, disk errors can be minimized by using "redundant arrays of inexpensive disks", etc. • Automatic detection of hardware failures: The system should be capable of detecting hardware failures and automatically activate spare comonents. • Fault-tolerant application programs: An application that runs on, say, 1000 processors in parallel
25
can be coded such that it " s u r v i v e s " the failure of some processor of communication link. For example, the program can check constantly if all processors are alive and reachable, and reassign a task to another processor if a failure occurs.
2.4. Characteristics of parallel processors Since Flynn's 1972 classification [6] that distinguishes between "single instruction stream-multiple data stream" (SIMD) and "multiple instruction stream-multiple data stream" (MIMD) machines, many criteria have been used to classify parallel processors. Some important hardware characteristics are the following.
2.4.1. Number of processors Machines with a few - obviously, at least two processors are called moderately parallel, machines with many processors are called massively parallel processors (MPPs). Sometimes a more or less arbitrary division line between moderate and massive parallelism or some equally arbitrary further subdivision is given. Today's MPPs may have several thousands of processors. The largest number ~uoted in [7] is 6768 for an Intel Paragon X P / S MP 2.4.2. Type of processors 2.4.2.1. Vector processors. Most of today's vector processors are symmetric multiprocessors (see Section 2.5), with typically up to 16 or 32 processors. A few massively parallel top-level vector supercomputers are offered by Japanese Vendors (e.g. Fujitsu's VP550), but very few customers can afford the extremely high prices of such systems. 2.4.2.2. Custom-made processors. The main advantage of custom-made processors is that communication facilities necessary for parallel processing can be better integrated into the processor. The main disadvantage is that such processors are produced in smaller quantities and are therefore somewhat more
2 SIMD machines with simple special-purposeprocessorsmay even have more, e.g. MasPar.
26
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
expensive than "off-the-shelf" RISC processors. Also the development usually takes longer than for standard processors, consequently the singleprocessor performance is frequently not competitive with the latest high-performance standard RISC processors.
Bus c
e":t".I'"I"I'" ."£".t'".'I" .'i"I"'b C":I;"'-I"I"I" 'i"'i'"I"I"I".':3 C":l:"'I"I"i"'~"'i"I"I"I'"~ c
2.4.2.3. Standard RISC processors. Most of today's MPPs use standard RISC processors, sometimes with minor modifications. Examples are the DEC Alpha processor used in the Cray T3D and the Intel i860 in the Paragon X P / S MP. Sometimes RISC processors are equipped with additional vector units to boost floating-point performance (e.g. Thinking Machine's CM-5 or the Meiko CS-2). 2.4.3. Memory distribution Between the extremes of totally shared and totally distributed memory there are various hybrid forms. Examples of totally shared-memory systems are vector multiprocessors like Crays and RISC-based multiprocessor servers like the Silicon Graphics POWER CHALLENGE XL (although in the latter each processor has its own cache). A workstation cluster or an IBM SP2 is a totally distributed, "shared-nothing" system: not only has each processor its own memory, it has its own disks and operating system as well. The distinction between shared and distributed memory is not only a question of the physical memory arrangement but also of the logical memory organization. Physically distributed memory may be globally addressable and appear to all processors as a single unified memory. In such a virtual shared memory environment only the memory access time is different, otherwise the physical memory location is irrelevant. Parallel computers with virtual shared memory are Kendall Square's KSR-1 and the Convex Exemplar. The latter has an interesting hybrid architecture consisting of hypernodes with eight processors; each hypernode has physically shared memory; the whole system consisting of several hypernodes has physically distributed, virtually shared memory. 2.4.4. Topology The geometrical arrangement of the nodes and their interconnections is called topology. A node
UCj~1,Y,j~u~'rc.J 2D Torus Hypercube
3D Mesh
Fat Tree
Central Switch
Fig. 6. Examplesof parallel computertopologies.
usually consists of one or more processors, possibly local memory and disk, and appropriate communications hardware. Some examples of the most frequently used topologies are shown in Fig. 6.
2.4.4.1. Bus and ring. These are simple topologies for a small number of processors, e.g. a workstation cluster connected by a LAN (Ethernet, TokenRing, FDDI, etc.). 2.4.4.2. Hypercube. This topology has been popular for some time because of the short distance between nodes: in a hypercube with 2 d nodes, any node can be reached from any other in at most d hops. However, its main disadvantage is its inflexibility: in order to increase the number of processors one has to double it and at the same time increase the number of connections per node, which may be difficult. 2.4.4.3. Mesh and torus. The processors are arranged in a rectangle (in 2D) or in a cube (in 3D). If the processors on the edges of the rectangle or on the faces of the cube are directly connected to the processors on the opposite edge or face, the arrangement is called a (2D or 3D) toms. In a 2D mesh, the
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
27
memory bus
main memory
CPU 0
CPU 1
CPU 2
CPU 3
Fig. 7. Symmetric multiprocessor.
number of hops may be fairly large between some processors. This can be compensated by short switching times. In a 3D geometry, the number of hops is smaller, but switching is somewhat more complicated because of the larger number of connections per node. A 2D rectangular mesh is used, for example, in the Intel Paragon, and a 3D Torus in the Cray T3D.
2.4.4.4. "Fat tree". A fat tree is a hierarchical organization of nodes where there are short connections with few hops between nodes on the same " b r a n c h " , and more hops between nodes on different branches. The Thinking Machine Corporation's CM-5 has a fat-tree topology. 2.4.4.5. Point-to-point connections through a central switch. The advantage of a central switch is that the communication speed between any two processors is the same. Progress in switching technology is rapid, current switches can handle about 64 processors. Examples of parallel processors using a switch are the IBM SP2 and workstation clusters (or " f a r m s " ) using Digital's GigaSwitch. 2.5. Symmetric multiprocessors Most shared-memory machines are symmetric multiprocessors (SMPs). All processors are equivalent - hence the name "symmetric", and the operating system distributes the workload among them. (Sometimes there are slight exceptions, some SMPs are not exactly symmetric.) Most of today's vector supercomputers and many mainframes are SMPs. One of the first multiprocess-
ing vector computers was the Cray X-MP: its multiprocessing capabilities are implied by its name. The "RISC revolution" has also come to SMPs: recently, RISC-based SMP servers have been very successful, e.g. the SGI POWER CHALLENGE XL or the Digital AlphaServer 2100 and 8200. A schematic representation of such an SMP is shown in Fig. 7. Most of the discussion in the present section and in Section 3.4.1 refers specifically to RISC-based SMPs. Early SMPs had two or at most four processors, today 32 and more processors are possible. Although SMPs with more and more processors are being built, there are limits. One limiting factor is the memory bandwidth. Since all processors use the same memory bus, it is obvious that n processors require roughly n times the bandwidth of a single processor. Partly this problem is addressed by very large caches. However, there is a second difficulty: the caches have to be kept coherent. Several processors might have a copy of a certain memory element in their respective caches. Whenever such a memory element is updated by one processor, all other copies have to be flagged as unusable. Since RISC-based SMPs are usually built using the same processors as workstations, it is natural to ask how an SMP compares to a workstation cluster. SMPs have several advantages: • Parallelization is easier, and parallel processing is more efficient: communication between tasks is much faster through a common shared memory than through message passing (unless the cluster has expensive networking and switching technology).
28
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
• SMPs allow a better memory utilization: it is easier to satisfy the memory requirements of a mixed set of jobs or tasks using a single large shared memory rather than smaller distributed memory units. • System administration is much easier since only one system has to be maintained rather than n independent operating systems. There are, however, some disadvantages as well: • SMPs are more expensive than equivalent cluster configurations. • On loaded SMP systems, there may be performance degradation through bus contention. In summary, RISC-based SMPs are very attractive HPC platforms in environments where top-level supercomputers are not required or unavailable for financial reasons. ¢
3. Programming for HPC - tools and techniques
In recent years, the success in building ever faster computers has been astounding. The speed a machine is capable of delivering can be calculated in a straightforward manner from the cycle time and the number of operations (usually floating-point) that can be completed per cycle. This is called the "theoretical peak performance" and is usually given in megaflops (million floating point operations per second, Mflop/s), or, more and more frequently, in gigaflops. An exhaustive list of the theoretical peak performance of many computers is given in [7]. Unfortunately, in many applications the performance is just a small fraction of the theoretical peak performance: in order to approach the theoretical hardware speed, sophisticated software tools and techniques are required. This is particularly important in the case of parallel processors, but applies to any other kind of computer as well. In many respects, the development of software technology has not been as rapid and as successful as the development of hardware technology. It should be emphasized that all published benchmark numbers like LINPACK [7], SPECfp, etc., don't reflect the hardware speed alone: they measure the quality of operating systems, compilers, and optimizers as well. For a given hardware configuration, performance usually improves over time due to
more mature compilers. In the case of well-publicized benchmarks, however, this improvement may be artificially high since many vendors make special efforts to tune their compilers to produce good benchmark figures. Real-world applications may benefit somewhat less from compiler tuning. 3.1. Numerical software - design goals and quality criteria
Since the software tools and programming techniques that are required to exploit the hardware capabilities to the fullest possible extent are a major issue in HPC, they will be discussed at some length in the following sections. However, one should not forget that high performance is not the only design goal in the development of numerical software. This is not the place to discuss the large and complex field of software engineering even in the most superficial manner. I will just enumerate briefly a few properties that - apart from high performance characterize good numerical software. • Correctness and reliability: It is obvious that any performance improvement is useless when the results are wrong. • Numerical stability and accuracy: A "correct" numerical program does not necessarliy produce correct results. The approximations inherent in floating-point arithmetic necessarily produce errors, and the propagation of these errors throughout the calculation should be subject to analysis [8]. • Readability and ease o f maintenance: Although
there is no general agreement about what constitutes good programming style - to a large extent, this is a matter of taste - a readable, well-documented program is much easier to maintain, update, and modify. • Ease o f use: Scientific programmers traditionally pay little attention to the design of a user interface. A well-designed user interface, however, increases the reliability by minimizing user errors. • Portability: Sometimes programs are written for a particular computer, and therefore portability is not an issue. Most scientific programs, however, are intended to run on a variety of architectures and therefore should be as portable as possible: non-standard features should be avoided, and, if
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
present, they should be clearly identified as such so that they can be easily replaced. These properties are not always compatible with performance. For example, a program tuned for highest possible performance on a particular architecture is usually not very portable: it will probably mn inefficiently - if it runs at all - on a different machine. Also, such a highly tuned program may be rather " t r i c k y " and therefore less readable and more error-prone. In practice, some compromise is required. Some programming techniques in HPC are independent of the platform used. However, each type of architecture requires special considerations, which will be discussed in turn. 3.2. Vectorization software
The hardware vector capabilities can be accessed at various software levels: • Vector assembler: This is essential for highly optimized kernels, but rarely used by application programmers. • Vectorizing compilers: For practicaly all vector computers a vectorizing Fortan compiler is available, some vendors offer also vectorizing C, Pascal, and other compilers. • Vectorized p r o g r a m libraries: These will be discussed in Section 3,5. • Vectorized application p a c k a g e s : Many application packages in quantum chemistry, molecular dynamics, finite elements, etc., are vectorized. 3.2.1. A u t o m a t i c vectorization
From the application programmer's point of view, the easiest way to produce vectorized code is to let the compiler do it. Compilers usually are able to generate vector instructions automatically. Such automatically vectorizing compilers are frequently coupled with preprocessors that perform extensive analysis and transformation of the source code to enhance vectorization. Automatic vectorization is characterized by the following properties: • Usually only DO-loops (or equivalent loop constructs in other languages) can be vectorized. • Some loops contain constructs which inhibit vec-
29
torization. Sometimes vector inhibitors can be removed by rewriting the code; some programs are intrinsically not vectorizable. • The most frequent vector inhibitor is recursion: the elements of a vector depend on the results of a previous iteration of the loop. As a simple example of a recursive loop, let us consider the following code fragment: DO I=2, N A(I) =A(I- i) END DO
At every iteration of the loop, the element ai. 1 is loaded into a register and subsequently stored into a i. This loop propagates the value of a 1 throughout the array A. Since the operand of every load operation is the result of the store operation of the previous iteration, the loop cannot be replaced by a single vector load and vector store operation: this would simply shift the array A by one, and produce the incorrect result al, a~, a2, a 3 . . . . , a N. • Advanced techniques enable vectorization of conditional expressions (vectorization under mask). Loops containing IF-statements can frequently be vectorized if there are no branches outside the loop. • T h e gather and scatter functions allow vectorization of operations on indirectly addressed arrays like g ( INDEX ( I ) ). Naturally, indirect addressing is somewhat slower than direct addressing. • Economic analysis determines whether it is worthwile to vectorize a loop. This depends on the length of the vector, the amount of computation done in the loop, and possibly other factors like memory access patterns. Simply put, vectorization is most efficient for loops doing lots of computations on long vectors. • Compiler directives can be used to control the vectorization policy. Such directives are comment lines which are ignored by other compilers. As an example, consider the following loop: DO I=M, N A(I)=A(I+K) END DO
For negative K, this loop is recursive, as we have seen in the previous example. For positive K, however, it just shifts the elements of A, whether vectorized or not. In a particular case the programmer might know that K will always be
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
30
positive and instruct the compiler by a directive to ignore possible recursion. In some cases, the programmer might know that a loop with variable length is always very short and therefore possibly more efficient in scalar mode. For these purposes, Cray's cf77 compiler has the following directives: CDIR$ IVDEP CDIR$ SHORTLOOP •
The success of automatic vectorization varies: the speedup is sometimes substantial, frequently moderate; in unfortunate cases, the vectorized program runs slower. In many cases, high performance can only be achieved by manual optimization and
The most "natural" loop ordering is not always the most efficient. Most compilers can only vectorize the innermost loop, although preprocessors can sometimes rearrange the loop order. Lastly, one should not forget scalar optimization. If vector operations are very fast compared to scalar ones, even in a highly vectorized program the scalar part requires a substantial fraction of the CPU time. There are many (mostly platform-dependent) scalar optimization techniques. The most important one is probably strength reduction, the replacement of "expensive" operations like exponentiation or division by "cheaper" ones like multiplication.
vector tuning.
3.2.2. Optimization and vector tuning Prior to any tuning, performance analysis is essential to determine the "hot spots", the parts of the program that use most of the CPU time. Many scientific programs consist of tens of thousands of lines of code or even more; fortunately, in many cases a small amount of the code uses most of the CPU time: all optimization efforts should concentrate on these parts. Several tools for performance analysis (prof, gprof, pixie, and others) are available on various platforms. The first goal in vector tuning is to increase the vectorized part of the program: Amdahl's law shows that even a fairly small scalar part substantially decreases the overall performance. To achieve this, one should • Use DO-loops at all: rewrite the code, whenever possible, as loops. • Recognize and, if possible, remove vector inhibitors. Instead of local code modifications like rearranging loops, global reorganizations are often more efficient. In many cases it is possible to replace a scalar algorithm by a vectorizable one, or, better still, by a call to a vector library. The fact that a program is highly vectorized does not necessarily mean that its performance is good. Therefore the second goal is to ensure efficient use of the vector capabilities. Things to be checked are vector lengths, the use of compound instructions, memory traffic and memory access patterns, etc. Particular care should be taken with nested loops.
3.3. Optimization for RISC processors Although RISC processors require optimization techniques that are rather different from those used for vector processors, they could benefit from much of the experience gained with vector processing. Automatic compiler optimization usually produces good results, and software tools like optimizing preprocessors that have been originally developed for vector processors have been adapted for RISC processors. As with all automatic optimization tools, the success of preprocessor optimization varies. To achieve maximum performance on a RISC processor, all pipelines must be kept busy as far as possible. There are two main reasons for pipeline stalls. First, the different stages of a pipeline may not be able to execute in parallel because of interdependences of the elements being processed. This problem is essentially equivalent to recursion as discussed in Section 3.2.1. Secondly, the memory bandwidth may not be sufficient to " f e e d " the pipelines. The first problem is usually addressed by loop unrolling, which can often be done by an automatic preprocessor. For example, a loop of the type DO I=l, N A (I)=... END DO
is replaced by D O I = l , N, 4 A(I) =... A(I+I)=... A(I+2)=...
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35 A(I+3)
=...
END DO
with appropriate "tidy-up" code to take care of values of N not divisible by 4. The second problem can only be addressed if it is possible to rewrite the code to improve the ratio of arithmetic to memory operations. For example, the _AXPY operation requires 3 memory operations for 2 floating-point operations, and no amount of unrolling or rearranging will change this. Improvement is possible using nested loops or level-2 and level-3 BLAS (see Section 3.5.1). 3.3.1. Programming considerations for hierarchical memories On cache-based machines, a bad memory access pattern may lead to dramatic performance degradation. The actual performance may be a tiny fraction of the theoretical peak performance. Several programming techniques help to use hierarchical memoties efficiently: • Data reuse: as many computations as possible should be performed on each data item. • Contiguous memory access: as discussed in Section 2.2.4, with every cache miss several elements are transferred from main memory to the cache. It helps to reduce cache misses if all these elements are subsequently used. If at all possible, data elements that are adjacent in memory should be processed sequentially. • Cache blocking: large matrices and arrays should be broken into smaller chunks that fit into the cache. To a limited extent, this can be done automatically by optimizing preprocessors. • Large effective cache size: some memory access patterns lead to a small effective cache size (see Section 2.2.4) and should be avoided. A comprehensive discussion of optimization techniques for RISC processors can be found in [1] and [9]; the latter focuses on the IBM RS/6000, but much of the material in [9] is applicable to other RISC processors as well. 3.4. Parallel programming Parallel programming is a very large and active field of research. Many tools, models, languages, etc., for parallel programming have been developed.
31
Only a few of the most important ones will be discussed in the following sections. 3.4.1. Parallelization on symmetric multiprocessors Since a single operating system - usually some variant of Unix - has to manage the workload of several processors, the operating system itself is a parallel application and supports parallelization to a certain extent. It is possible to write parallel applications using standard system calls without specific parallel software: the f o r k system call enables the creation of a child process that may run in parallel with its parent, there are various mechanisms like signals and sockets for inter-process communication, etc. In fact, parallel programming and systems programming have a lot in common: many of the things that will be discussed below like like locks, events, synchronization, etc., are well known to systems programmers. However, application programmers usually work with specific parallel software, usually parallel programming languages and parallelizing compilers. Many compilers have the capability of automatic parallelization. This works well in a few cases, but the results are usually less satisfying than for automatic vectorization. Sometimes automatic parallelization, assisted by directives, is handled by a preprocessor. Explicit parallelization rather than automatic parallelization is often more successful. It is supported by several compilers through directives and language extensions. While there are few vector extensions to standard programming languages, there are often substantial parallel additions to languages like Fortran and C. The lack of standards makes programs using parallel constructs non-portable. These parallel additions have to take care of two main problems: first, the handling of parallel tasks, in particular their synchronization. Various synchronization mechanisms are available: • Locks ensure sequential access to some data item. For example, a pool of tasks might be distributed to a set of processors. After finishing a task, each processor reads a counter to determine which task to perform next, and updates the counter. Without a proper locking mechanism, two processors might access the counter simultaneously and perform the same task twice.
32
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
• Events are a means of communication between tasks. After reaching a certain point in its execution, a task posts an event. Other tasks may enquire whether a certain event has been posted or wait for the event. • Barriers are special events for which all tasks have to wait. • Eureka synchronization causes all tasks to stop as soon as one task has reached a certain goal. This is used, for example, in parallel data base searches. Secondly, the programmer has to take care of the sharing and distributing of data. Usually, each task has its own address space. Some data are global to all tasks (in Fortran, usually through COMMON blocks). Incorrect sharing is a very common source of errors: very often data that should be declared local (or private) to a task are declared as global, and vice versa. As an example of an error caused by incorrect synchronization consider the following scenario: Two tasks, A and B, execute in parallel. After t l CPU seconds, A requires a result produced by B after t 2 CPU seconds, where t~ is a little larger than t 2. A correct program has an appropriate synchronization mechanism (e.g. B posts an event, A waits for the event). A buggy program that omits synchronization will still produce correct results in most cases: since tj > t 2, task B will probably already have produced the result at the time it is needed by A. Whether the program produces correct results or not depends on external circumstances like system load. Such bugs are extremely hard to find. Another common synchronization error is deadlock: two tasks that wait for each other are blocked forever. 3.4.2. Message passing Message passing is most suitable for distributedmemory parallel processors and networked computers, but can be implemented on any kind of parallel computer. Messages serve two basic purposes, first, to distribute the data among the processors, and secondly, to control the program flow: initiating of tasks, synchronization, locks, etc. There are many variants of the basic messagepassing operations SEND and RECEIVE: messages may be sent to a specific recipient or broadcast to all
processes; a RECEIVE operation may accept a message from any processor or only from a specific sender; sending of a synchronous message is completed only after it has been received, while an asynchronous message may be retrieved by the recipient at a later time; a blocking message blocks the sending processor until the SEND operation has completed, while a processor can proceed with computations while sending a non-blocking message. Several implementations of message-passing systems are available: TCGMSG, PVM, p4, and many others. Recently, a standard for message passing calls has been defined, the Message Passing Interface, MPI [10]. This interface may be built on top of existing message passing libraries. 3.4.2.1. PVM - Parallel Virtual Machine. PVM [11] is probably the most successful and most widely-used message-passing system. It was primarily designed for networks of workstations, but is successfully used on any kind of parallel computer, including shared-memory multiprocessors. PVM is particularly suited to a heterogeneous environment: PVM can be implemented on practically any machine running some variant of Unix. For example, it is possible to run a single program simultaneously on a vector supercomputer, a massively parallel machine, and a workstation cluster: from a PVM user's point of view, the collection of all these machines forms a single distributed-memory computer, the "parallel virtual machine". PVM is implemented as a set of library routines callable from Fortran, C, and C + + . PVM routines perform tasks such as initiating ("spawning") tasks on any computer in the network belonging to the virtual machine, packing data into a message buffer, and, of course, sending and receiving of messages. PVM is in the public domain. There are also some commercial versions available, (e.g. optimized shared-memory versions). 3.4.3. Data parallelism The parallel programming methods described so far all implement control parallelism: constructs like messages, barriers, etc., control the program flow. The explicit programming of these constructs can be rather tedious and, as discussed in Section 3.4.1, somewhat error-prone.
P. Marksteiner/ Computer Physics Communications97 (1996) 16-35
Many applications involve operations on large arrays. In a data parallel programming model the distribution of arrays among the processors rather than the distribution of tasks is specified. The details of the underlying control mechanism (e.g. messagepassing) are hidden from the programmer. There are a few programming languages that implement a data-parallel programming model, notably C* and HPF. Data-parallel programming models have been rather successful on a restricted class of problems, but lack general applicability. High-Performance Fortran. HighPerformance Fortran, HPF [12], is a superset of Fortran 90 with support for data-parallelism. HPF is not yet very widely used, but there are already several commercially available HPF compilers for MPPs and workstation farms. There is only one additional statement, FORALL. Although the FORALL-constuct looks very similar to a DO-loop, there is a basic difference: in a DO-loop elements are processed in a specified order, whereas the FORALL statement implies that the array elements may be processed in any order or in parallel. The processor arrangement and the data mapping and array distribution are specified by compiler directives. The following example illustrates the use of FORALL and HPF compiler directives. 3.4.3.1. H P F -
PROGRAM gaussian ! Gaussian Eliminiation INTEGER, P A R A M E T E R : : n = 1 0 0 0 REAL, D I M E N S I O N (n, n) :: a !HPF$ D I S T R I B U T E A (*, CYCLIC) C A L L init(a, n) DO k = l , n - i a ( k + l : n , k ) = a ( k + l : n , k ) / a ( k , k) FORALL (i=k+l:n, j=k+l:n) a(i, j ) = a ( i , j)-A(i, k) * A(k, j) END DO C A L L out (a, n) END P R O G R A M g a u s s i a n
This particular DISTRIBUTE directive means that there is no distribution across rows (i.e., every processor receives an entire column); the cyclic column distibution means (in the case of, say, 10 processors) that processor 1 receives columns 1, 11, 21 . . . . . processor 2 receives columns 2, 12, 22 . . . . . etc.
33
Within the FORALL construct, each processor updates the elements of A - in any order whatsoever that it has received in the above distribution. The kth column of A is required by all processors and has to be broadcast. The necessary message-passing calls or equivalent communication functions are automatically generated by the HPF compiler. -
3.5. Numerical libraries
Many mathematical problems occur over and over again in many different applications. The most widespread is probably the solution of linear systems of equations, but there are many others: fast Fourier transforms (FF'Ts), eigenvalue problems, special functions, integration, ordinary and partial differential equations, optimization (linear programming), etc. For practically all common problems there are routines in numerical libraries available. There are comprehensive commercial libraries and many specialized public-domain products. The use of numerical libraries has several advantages: • Saving of programmer time: The optimization techniques discussed so far are mainly aimed at saving computer time, but time spent by humans is probably a more valuable resource. There is little to be gained from, say, programming yet another Gaussian elimination routine (except for educational purposes) since dozens are freely available. • Library routines are usually of high quality. Many people who write scientific application programs are experts in their respective fields, e.g. physics, but not necessarily in numerical mathematics, software design, etc. (see Section 3.1). People who design numerical libraries usually are. Also bugs that might be present in numerical library routines are quickly detected and fixed since the routines are used by many different people in many different applications. • A reasonable compromise between performance and portability: programs using numerical libraries are generally portable. The important libraries are available on many different computers, and even if a particular library is not present it is easier, for example, to replace one call to an FFT routine by another one than to recode a routine that uses some non-standard feature. Library routines tend
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
34
to be fairly efficient, and many vendor offer highly optimized libraries for various architectures (e.g. SCILIB for Crays, the Convex MLIB, ESSL for IBM RS/6000). 3.5.1. Basic Linear Algebra Subprograms - BLAS
Originally, the Basic Linear Algebra Subprograms (BLAS, later to be called level-1 BLAS) [13] were a byproduct of the development of LINPACK [14], a public-domain library for the solution of systems of linear equations. In LINPACK, basic vector operations (and a few others) were implemented as calls to Fortran 77 subroutines and functions. Examples of BLAS routines are the _AXPY routine, which multiplies a vector by a scalar and adds it to another vector, and the _DOT function which calculates the inner product of two vectors. (The underscore stands for any of the letters S, D, C, or, Z, corresponding to the single- or double-precision, real or complex versions of the respective routines). In LINPACK, nearly all of the computational work is done within the BLAS routines. However, it turned out that the BLAS are not very efficient on many platforms. The main reason is the high ratio of memory operations to arithmetic operations (see Section 3.3). Subsequently, sets of routines for matrixvector and matrix-matrix operations were defined, the "level-2" BLAS [15] and "level-3" BLAS [16]. For example, the level-2 BLAS routine G E M V multiplies a vector by a matrix, and the level-3 BLAS routine G E M M performs a general matrix multiplication. The great advantage of the BLAS is that many optimized impementations for different architectures exist. The BLAS are easily vectorized; implementations of level-2 or level-3 BLAS using optimization techniques like cache blocking (see Section 3.5.1) may run close to peak performance on RISC processors. Programs using the BLAS as far as possible for compute-intensive kernels are portable - since in any case at least the unoptimized source-code BLAS are available - as well as highly efficient on a variety of platforms. 3.5.2. LAPA CK
LAPACK [17] is a comprehensive public-domain package for linear algebra and eigenvalue problems.
It is the outcome of a large intemational collaboration involving dozens of research groups. With minimal exceptions (DOUBLE COMPLEX), LAPACK is written in standard-conforming Fortran 77. It uses level-3 BLAS whenever possible, and level-2 BLAS otherwise. The level-1 BLAS are only used on a few occasions where they have no impact on performance. LAPACK contains extensive documentation, testing and timing routines. It has been extensively tested on a variety of platforms: since LAPACK is portable, reliable, and efficient it is highly recommended for all sorts of problems involving dense matrices. Current work on LAPACK focuses on the user interface and on parallelization. Besides a C version generated by the automatic conversion routine f2c, a part of LAPACK has been implemented in C + + , work on a Fortran 90 version is in progress. A distributed-memory parallel version called ScaLAPACK uses the "Basic Linear Algebra Communication Subprograms" (BLACS) [18] for the distribution of data, in particular matrices. The BLACS provide a standardized interface (in Fortran or C); implementations of the BLACS for a particular machine are built on top of vendors' communication software or on message-passing systems like PVM and MPI. The various LAPACK implementations, the BLAS, the BLACS, PVM, and much more is available from netlib, an invaluable repository of freely available numerical libraries, mathematical software packages, documents etc. Netlib can be accessed by electronic mail, anonymous ftp or WorldWideWeb (http : / / w w w . n e t l i b , o r g / , f t p : / /
ftp.netlib.org; in Europe also http:// www. netlib, no/).
Acknowledgements The present review article makes no claim to originality. Apart from the references given, much of the material presented was collected from electronic sources like Usenet news ( c o m p . s y s . s u p e r , c o m p . b e n c h m a r k s , etc.), from vendors' WorldWideWeb pages, and - last but not least - from netlib.
P. Marksteiner / Computer Physics Communications 97 (1996) 16-35
References [1] K. Dowd, High Performance Computing (O'Reilly & Associates, Sebastopol, CA, 1993). [2] G. Radin, The 801 Minicomputer, in: Proc. Symposium of Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA March 1-3, 1982, IBM Journal of Research and Development 27 (1983) 237. [3] P. Marksteiner, unpublished results. [4] G. Amdahl, Validity of the single-processor approach to achieving large-scale computing capabilities, in: Proc. AFIPS Conference, Atlantic City, NJ (AFIPS Press, Reston, 1967) pp. 483-485. [5] J.L. Gustafson, The scaled-size model: a revision of Amdahl's law, in: Proc. Third International Conference on Supercomputing, eds. L.P. Kartashev and S.I. Kartashev, Boston (1988) pp. 130-133. [6] M.J. Flynn, IEEE Transactions on Computers C-21 (1972) 948. [7] J.J. Dongarra, Performance of Various Computers Using Standard Linear Equation Software, Technical Report CS-8985, University of Tennessee, Knoxville (1989-1995), available online at ftp ://ftp. netlib, org/benchmark/ performance. [8] A thorough discussion of these questions (focussing on linear algebra) can he found in: G.H. Golub and J.M. Ortega, Matrix Computations, 2nd ed. (Johns Hopkins University Press, Baltimore, 1989); The numerical aspects of HPC are discussed extensively in: C. l~oerhuber, Computer-Numerik (Springer, Berlin, Heidelberg, New York, 1995) [in German].
35
[9] Optimization and Tuning Guide for Fortran, C, and C+ +, IBM Publication No. SC09-1705-00 (1993). [10] MPI: A Message-Passing Interface Standard, The Message Passing Interface Forum (1994), available on-line at f t p : / / ftp. netlib, org/mpi/mpi- report, ps. [11] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sundaram, PVM3 User's Guide and Reference Manual, ORNL/TM-12187, Oak Ridge (1994). [12] High Performance Forum, High Performance Fortran: Language Specification, Scientific Programming 2 (1) (1993). [13] C. Lawson, R. Hanson, D. Kincaid, F. Krogh, ACM Trans on Math. Soft. 5 (1979) 308. [14] J. Bunch, J. Dongarra, C. Moler, G.W. Stewart. LINPACK User's Guide (SIAM, Philadelphia, PA, 1979). [15] J.J. Dongarra, J. Du Croz, S. Hammarling, R. Hanson, ACM Trans on Math. Soft. 14 (1988) 1. [16] J.J. Dongarra, I. Duff, J. Du Croz, S. Hammarling, ACM Trans on Math. Soft. 16 (1990) 1. [17] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov and D. Sorensen, LAPACK User's Guide, 2nd ed. (SIAM, Philadelphia, PA, 1995). [18] R. Clint Whaley, Basic Linear Algebra Communications Subprograms: Analysis and Implementation Across Multiple Parallel Architectures Technical Report CS-94-234 (University of Tennessee, Knoxville, 1994), available online at ftp ://ftp. netlib, org/lapack/lawns/ lawn73 .ps.