Toward an analytical solution to task allocation, processor assignment, and performance evaluation of network processors

Toward an analytical solution to task allocation, processor assignment, and performance evaluation of network processors

J. Parallel Distrib. Comput. 65 (2005) 29 – 47 www.elsevier.com/locate/jpdc Toward an analytical solution to task allocation, processor assignment, a...

359KB Sizes 1 Downloads 46 Views

J. Parallel Distrib. Comput. 65 (2005) 29 – 47 www.elsevier.com/locate/jpdc

Toward an analytical solution to task allocation, processor assignment, and performance evaluation of network processors Sameer M. Bataineh∗ Department of Computer Engineering, Jordon University of Science and Technology, P. O. Box 3030, Irbid 22110, Jordan Received 18 January 2002; received in revised form 25 August 2004

Abstract Message-passing network-based multicomputer systems emerge as a potential economical candidate to replace supercomputers. Despite enormous effort to evaluate the performance of those systems and to determine an optimum scheduling algorithm (which is known as an NP-complete), we still lack a complete and a good performance model to analyze distributed computing systems. The model is complete if all system parameters, network parameters, communication overhead parameters, and application parameters are considered explicitly in the solution. A good performance model, like a good scientific theory, should be able to explain all normal behavior, predict any abnormality in the system, and allow the designer to adjust some of the parameters, while abstracting unimportant details. In this paper, we develop a good and complete performance model, which predicts a minimum finish time, equally the maximum speed up. In addition, we develop a closed form solution which forecasts the optimum share of the parallel job (task) that has to be assigned to each processor (node). Task assignment may then be undertaken in a distributed manner, which enhances the distributive nature of the system and, thus, improve system performance. Most importantly, our analytical solution presents a mechanism to select, based on system and application parameters, the optimum number of processors (nodes) that has to be assigned to a given parallel job. The model helps the designer to study the effect of each individual parameter on the overall system performance. This then becomes a tool for a designer of a multicomputer system to manage limited resources in an optimal manner paying attention only to those parameters that are most critical. © 2004 Published by Elsevier Inc. Keywords: Communication overhead; Divisible jobs; Multicomputer systems; Network-based; Performance evaluation; Scheduling; Task assignment

1. Introduction Over the last decade, the landscape of high-performance computing (HPC) has changed drastically due to number of significant developments in the speed of the processors and the speed of the communication networks. With this development, it has become possible to build an economical distributed system consisting of powerful workstations interconnected through a high-speed communication link to replace an expensive special-purpose supercomputer. Cluster of workstations (COWs) or network-based

∗ Fax: +962 27095046.

E-mail address: [email protected] (S.M. Bataineh). 0743-7315/$ - see front matter © 2004 Published by Elsevier Inc. doi:10.1016/j.jpdc.2004.09.008

multicomputers and massively parallel systems (MPPs) are the most prominent distributed-memory systems, with multiple address space, to replace supercomputers. Processors (nodes) in multicomputer systems interact through a message-passing. A good example is the on-going project to build the Terascale Computing System, which will be the most powerful system in the world available for public research, achieving 6 Tflops peak capability. The system consists of 2728 Alpha processors with Quadrics interconnect, 2.7 TBytes memory and 50 TBytes disk [37]. As estimated by Pfister [54], over 100,000 computer clusters are in use worldwide. Examples of such systems include IBM SP2, DEC TruCluster, Hp, Intel/Sandia ASCI Option Red, etc. The importance of distributed systems, in addition to its availability and scalability features, is its ability to handle

30

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

large computational load. Therefore, the interest in the research that studies the performance of distributed-memory parallel systems is also growing for both the MPPs and the COWs [e.g., [24,32,45,48,54]]. The uniform-access shared memory (UMA) multiprocessor systems are beyond the scope of this work. Performance evaluation and modeling of such systems has been discussed elsewhere [43,46,64]. Performance evaluation of multicomputer systems is of interest to computer designers and it is a challenge to computer scientists and researchers. In evaluating the performance of multicomputer systems, one has to consider the effect of system parameters (processors speeds, link speeds), application parameters (degree of divisibility of the parallel jobs and the level of interactions among tasks of the same job), fault tolerance, number of processors available, scheduling algorithm, and the processors allocation algorithm. The conventional modeling techniques and experimental results (carried on specific machine), normally used to evaluate the multicomputer system performance, failed to integrate all those elements in a generic model. So, there is a demand to develop a good performance evaluation model, which is capable of scheduling a given load on p available processors such that the finish time is minimized. The model must be generic (independent of machine, link, applications, etc.) while taking into consideration the entire above-mentioned elements concerning the system parameters, application parameters, and algorithms adopted for process allocation and scheduling. A good performance model, like a good scientific theory, should be able to explain all normal behavior, predict any abnormality in the system, and allow the designer to adjust some of the parameters, while abstracting unimportant details. Asymptotic analysis does not satisfy the first requirements [30]. On the other hand, conventional computer modeling techniques, which typically involved detailed simulation of individual hardware components, introduce too many details to be of practical use. Some of these modeling techniques measure the performance of the system as whole, others measure a specific aspect of the computer system such as node utilization, I/O speed, operating system performance, etc. For example, by processing NAS parallel benchmark results [56], Hwang found the assertion that the utilization drops as the number of nodes increases for three MPPs: IBM SP2, Cray T3D, and Intel Paragon to be false. It is not possible to say for definite that this is true for other existing multicomputer distributed system. There are many parallel computing benchmark suits have been in use [13,63]. Benchmarking methodology is proposed in [40] to identify key performance parameters based on measuring performance vector. More importantly are the attempts to formalize an analytical model to study the performance of multicomputer distributed systems. There is a variety of mathematical formulation for “selection theory” which chooses an appropriate set of machines for a set of subtasks [e.g. [15,31,49,60,62]]. This work has common ground with graph-based algorithms to solve the matching-related problems [14,38,50] in that

both try to reduce the communication time by “clever” algorithms. So nothing in this work is concerned with the effect of a specific element on the system performance, say the exact effect of the link speed on the finish time. Latency is used in most of the work to refer to the communication delay, which embedded in it the start-up time, the transmission delay (link speed), and the network delay. Some impose constraints to simplify analysis, but while doing that they hide some of the system characteristic. For example, although, the goal of the work in [60], is to reduce the communication time, the time for determining matching, scheduling, and data relocation is neglected. Scheduling and data relocation in a shared bus network model have an adverse effect on the communication time (and, hence the execution time) [7]. The reliable and practical results for evaluating the performance of network multicomputers based on Nectar system might not be applicable to other machines [58]. Other allocation strategies do not consider the load characteristics in the measurement of performance [e.g., [4,9,25]]. Consequently, it would not be feasible to study the effect of the load parameter on the system performance. Although the workload characteristics is well thought-out in [29], but having it in two general classes; communication-intensive and computation-intensive, does not help to explicitly study the effect of load parameters such as number of tasks of a given program (parallel job) on the system performance. We acknowledge that such an effect is “embedded” in the probability definition of the model. What is common among all the work on multicomputer distributed systems is that the load is divisible. For example, in [57], a particular type of parallel load used is called a forkjoin job, which consists of a set of independent sequential tasks. Several previous studies have considered this kind of load under variety of assumptions [5,51,52,57,61,65]. The application programs in [60] are a collection of unrelated programs, yet they still have to communicate to synchronize their access to the shared resources. So, they can be thought of as independent tasks of a very large job. The generic class for such load is the paradigm of divisible load. A divisible load is one that can be arbitrarily partitioned among processors in the system. Therefore, we believe that the divisible load theory (DLT), of recent origin [8,16] has a potential to formulate an analytical model for multicomputer systems taking into consideration all the parameters of parallel systems. The DLT allows a close evaluation of the integration of computation and communication in network computing. In fact, it has developed a new “calculus” for scheduling problem. There has been a large volume of work to analyze parallel systems exploiting the DLT [11,27]. However, almost all of this work suffers from two drawbacks. First, it assumes that parallel jobs have a generic arbitrary property without giving a realistic example of such load. Hence, they repeat the same problem of other works where one cannot study the effect of the load parameters on the system performance. Consequently, the system performance is found to be independent of the load. Second, although the DLT

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

uses single port systems with a linear network, except for few articles, it does not include the start-up time as part of the communication delay. Start-up time is a major concern in multicomputer systems [2,12,23,30,44,45,47]. Ghous alleviates the first problem by assuming the load to be a largesized matrix–vector product [34]. Ignoring the start-up time, he has no bound on the number of the processor to be used to execute the load. This contradicts earlier findings, which state that the benefits of parallel processing can be completely negated by task synchronization (variable start-up time as will be explained) [57]. This allows the researchers to draw conclusions concerning appropriate processor allocation decisions under specific load. Processor allocation was never a concern in DLT because it is always assumed that an infinite number of processors is available. This justifies the leveling off of the performance curves in the work exploiting the DLT. This paper deals with a paradigm of divisible load where each job is divisible to n-independent tasks and can be executed independently. Although the tasks of the same job do not communicate with each other during their execution, they have to sequentially deliver their partial results to the central processor in the system. The job is considered to be completed only after all of its tasks have finished execution. The paradigm of divisible load has typical applications involving the processing of very large data files used in signal processing applications such as computations of Hough transforms [18], image processing applications like features extraction, edge detection [33,42], and Kalman filtering. Among many other applications, the above applications lend themselves easily to matrix-manipulation-like problem. The matrix-manipulation encompasses a wide range of applications such as control theory and dynamic system simulation. Most of the physical systems can be approximated as a linear time-invariant systems (number of states are finite). The mathematical model of such systems is represented as a first-order matrix differential equation. Those matrices reflect the dynamic of the systems. Manipulating process of those matrices is needed to get the states of the systems. The major contribution of this paper is to develop a closed form solution to the finish time and the speed up of a multicomputer system where the effect of the application parameters, represented by the matrix dimensions, are expressed explicitly in the equation. Moreover, the communication latency factors are not anymore embedded in the bandwidth available between the communicating machine. The startup time, the transmission time, and the link speed are all explicitly expressed. Hence, the effect of each of these elements on the system performance can be studied exactly and independently. The closed form equations also include coefficients of the processing elements. More specifically, the time it takes a node (workstation) to execute a floating-point operation is included as a variable in the equation. Hence, we free our model from adhering to a specific machine, computer architecture, or a certain connection. Moreover, our model is network topology independent we also develop a

31

mathematical model for optimizing process allocation. Process allocation tradeoff is very important for optimizing the multicomputer system performance. If tasks of a job are distributed to all processors in the system, the computing power is maximized, so is the communication penalty. So it is important to distribute each job to a “set” of processors such that the finish time is minimized. This set is called an optimum set. The finish time will not be minimum if we add or remove a processor from the optimum set. This important finding enables us to design more “intellectual” schedulers and allows the sequential parallelism in the DLT. The rest of the paper is organized as follows: we started by system modeling, where we exactly define the applications to be used. In Section 3, the system equations are derived and the analyses of the system equations are presented in Section 4. Finally, Section 5 shows some concluding remark.

2. System modeling Consider a network of workstations, which consists of K homogeneous communicating front-end processors (nodes) (node and processor will be used interchangeably) connected through a high performance LAN. Front-end processors are capable of communicating and computing at the same time. Out of the K nodes there is a “central processor set” consisting of n central nodes distributed evenly over the network. The other (N = K − n) nodes are considered helper processors to the central processors and each is given a unique number from 1 to N; usually N ?n. Parallel computationintensive jobs arrive to the central processors in the system and are buffered in a first-come first-serve (FCFS) parallel jobs queue before they get service. A helper processor can be idle, executing a local task, or executing a task delivered by a central processor (external task). External tasks have higher priority than local ones and they run atomically, without interruption, until they finish. Therefore, the central processors view the helper processors as either available or busy. The helper processor is available if it is idle or executing a local task and busy if it executes an external task. Accordingly, there are always two sets of helper processors in the system “available set” and “busy set”. Let v and s denote the number of processors in available and busy sets respectively; N = s + v. A central processor, in addition to executing its local tasks, does three functions when there is a parallel job in its parallel jobs buffer queue. Firstly, it stops running local jobs and determines p, which is the optimum number of helper processors needed to cooperate in executing the parallel job. Secondly, it distributes to the p helper processors their assigned tasks, while computing its own task (its share of the job). Finally, it collects the final result of the computations. So the starting task (distribution of the load) and ending task (agglomeration of the results) are executed by the central processor. Once the central processor selects p processors from the available set to execute a parallel job, it broadcasts a message informing

32

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

The Central Processor Processor 0 w

Task 1

Processor 1 w

Task p

Task 2

Processor 2 w

Task 1 Results

Processor p w

Task 2 Results

Task p Results

The Central Processor Processor 0 w

Fig. 1. Linear cluster network of workstations.

all central processors to update their busy and available sets by adding the p selected processors to the busy set and removing them from the available set. An inverse message is broadcasted when the same job is completed. It is assumed that there are always available processors to execute parallel jobs since they have a higher priority. Therefore, at this stage, we do not consider a queuing delay neither in the central processors nor in the helper processors. In the literature, one finds research on queuing delay, which can be incorporated in our model [54]; however, this is beyond the scope of this paper. This configuration has several advantages: firstly, the system is fault-tolerant and it does not suffer from a single-point failure problem with one exception, if a node fails, we loose the jobs for which the failure node acts as a central processor. Secondly, all cluster nodes, cooperating to execute a job, work together collectively to reflect a single, integrated computing resource; in addition to filling the conventional role of using each mode by interactive users individually. Thirdly, the central processors set can be updated dynamically and easily through software without affecting the operation of the whole system because all processors (nodes) are identical. Finally, front-end processors can communicate and compute at the same time, therefore, broadcasting control messages to update the available set, busy set, and central processor set or any other type of broadcast messages does not degrade the system performance. A one-level send–receive task graph, encountered when a job can be partitioned into disjoint unions of parallel task,

resembles exactly our system. As shown in Fig. 1, the central processor partitions a job into disjoint parallel tasks, and delivers to each helper processor its task while keeping its own task. A helper processor works on its task to completion, without interruption, and then reports the result back to the central processor. Helper processors report their results in a pipeline fashion. A job is considered complete only after all its tasks are completed and their results are sent back to the central processor. So, the finish time of a job is the time when the central processor finishes; we refer to this time as the makespan of the central processor. All previous work on a similar problem focused on deriving algorithms to assign a set of disjoint tasks into a set of processors such that the finish time is minimum. An excellent reference in this regard is the survey paper [41], which contains detailed descriptions and classifications of various scheduling strategies. Finding an algorithm that guarantees an optimal task assignment for such systems turns out to be NP-hard [35]. Lately, an optimal assignment algorithm is achieved for only scheduling three tasks or less [43]. Our model is different in that we develop an analytical solution that gives the minimum finish time (smallest makespan). It is not an algorithm that searches for the best realization which is NP-hard. Most importantly, our model does not only give the optimum amount of a given job (task) that has to be assigned to each computing processor, but also determines the optimum number of processors that has to participate in executing a job. Of course, this number is dynamic and is based on the application and system parameters. An early work on

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

similar problem can be found in [59]. A good review to develop a simplistic theoretical models and algorithms may be found in [1,17,21,32,53]. The tasks in our model do not have precedence constraints. Algorithms to deal with precedence constraints can be found elsewhere [17,19,20,22]. The major difference between our model and previous models can be outlined as follows: • We developed a closed form solution for the optimum finish time. • We developed a closed form solution to calculate the optimum amounts of the parallel job (tasks) that has to be assigned to each processor in the system. • We developed an analytical solution to find the optimum number of processors that has to participate in executing a job. • Task assignment can be calculated in a distributed manner, rather than being handled by a central scheduler, i.e., each processor can determine its share of the parallel job (task) that has to process. • The system parameters, application parameters, and communication parameters are all considered at the same time in a closed form equation. • We consider a realistic example, which encompasses most, if not all, applications of disjoint tasks (divisible job paradigm). • The solution guarantees optimality regardless of the number of tasks. • The model can be easily modified to accommodate various distributed systems. Despite all of that, we do not claim that the model is complete; in that it presents a complete solution to distributed systems general problem. However, we believe it is an excellent start. There are some limitations and constraints to be alleviated before declaring obtaining a complete solution. First of all the jobs are considered arbitrarily divisible, consequently, precedence relations and synchronizations problems are not considered. A queuing delay is neither considered in the central processors nor in the helper processors. The system is assumed to be homogeneous; that is, all processors assumed to have the same computing power.

2.1. The application problem A good example of divisible jobs, which when executed, resemble a directed acyclic one-level send–receive graph is the matrix manipulation operations. Many computer algorithms were developed to perform matrix multiplication, matrix transposition, matrix inversion, Boolean matrix operations, fast fourier transform (FFT), summation of vector elements, parallel sorting, linear recurrence, and one- or twodimensional finite difference problems. Such operations are of major interest in many fields. For example, pairwise interaction is very important in molecular dynamics, where is required to find the total force vector acting on each atom. The design of VLSI components is a computationally de-

33

manding process. VLSI is a process used to build electronic components such as memory chips, microcontrollers, and microprocessors comprising thousands of millions of transistors. The most computationally demanding process in the design is the floorplan optimization process which requires distributing a set of indivisible rectangular blocks called cells with their interconnection information in order to determine the minimum relative placements of these cells. Finally, calculating the shortest path algorithm is another example, which demands a mechanism to optimize matrix-like operations. As a matter of fact, the original motivation behind developing array processors was to perform parallel computation on vector or matrix types of data. In addition benchmarks use matrices to characterize their problem. For example, a LINPACK benchmark performance showed that the 7264-processor Intel ASCI Option Red achieved a sustained speed of 1.068 Tflop/s in solving a problem characterized by 215,000 × 215,000 matrix [26]. As a particular example, in this paper we will consider matrix-multiplication, which is heavily used in the process of solving linear system equations. Let A = [aik ] and B = [bkj ] then C = A × B, cij =

n 

aik × bkj

for 1  i  m

k=1

and 1  j  n.

(2.1)

There are mn(2k −1) cumulative operations to be performed in Eq. (2.1): mnk multiplication and mn(k − 1) addition. The objective of parallel execution is to allow arrays A and B to be distributed over p processors so that array C is obtained in the minimum finish time. The finish time is the sum of computation time (multiplication and addition) and communication time. The communication time includes start-up time, distribution time, and agglomeration time. The start-up time, as will be demonstrated, is very important and has an adverse effect on the system performance and it imposes a limit on the number of processors to be used in parallel systems. Such a limit is very important to optimize the overall system performance (the phrases “optimum system performance”, “minimum finish time”, and “maximum speedup” will be used interchangeably to imply the same thing). To highlight the importance of the start-up time on the overall system performance, Table 1 has some representative data for the start-up time (ts ) and the transfer time per four-Byte word (tw ) for 8 standard machines and networks. The numbers in the table state clearly that the start-up time cannot be neglected when analyzing the performance of distributed systems specially a network of workstations systems. The start-up time consists of send overhead, which is the time the sender takes to inject a word into the network, receive overhead, which is the time the receiver takes to get a word from the network, and the network latency or system latency. Martin et al. [45] found that applications are most sensitive to overhead, while network latency and bandwidth are not of major concern. Thus, increasing the bandwidth is

34

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

Table 1 Approximate machine parameters for some parallel computers, in microseconds Machine/network

ts

tw

IBM SP2 Intel DELTA Intel paragon Meiko CS-2 nCUBE-2 Thinking machines CM-5 Workstations on ethernet Workstations on FDDI

40 77 121 87 154 82 1500 1150

0.11 0.54 0.07 0.08 2.4 0.44 5.0 1.1

not the way to solve communication overhead. That is why we disagree with all analysis that embedded all the communication system parameter in the bandwidth available [10]. Such analysis does not identify which of these parameters affects the application performance most significantly. This is an important concern because the designer of the communication subsystem wants to devote the limited resources to improving the most critical parameters. 3. System equations Let us start by defining some useful parameters: z a constant that is inversely proportional to the speed of the network. z = 1, if the network is dedicated to transfer tasks of a given parallel job. z > 1 when a link is shared and so a node will not find the network always available. z becomes larger under heavy traffic or when there are large number of active nodes in the system. Tcm time it takes the central processor to transmit one unit of data over a dedicated link, zdedicated = 1. tw a time needed to transmit one unit of the data over a non-dedicated network link, tw = zTcm . w a constant that is inversely proportional to the computational speed of a node(processor). w = 1, if the node is dedicated to execute a given parallel job. If the node is connected to a network, w will be larger because the node dedicates part of its computing power to handle other tasks (local or external) and has to handle other control signals. m time taken by a dedicated node (w Tcp dedicated = 1) to perform one floating-point multiplication. a time taken by a dedicated node (w Tcp dedicated = 1) to perform one floating-point addition. Tcp time it takes a dedicated node (wdedicated = 1) to compute the inner product of two k-dimensional vectors. m + (k − 1)T a . Tcp = kTcp cp ts start-up time is the time to communicate a 0-Byte or a short (e.g., one-word) message over the network. This time is required before transmitting any data between any two nodes in the network. Tf1 the time taken by a single processor to process the whole job.

k

1

1 c

b

A

=

mi m

k

m

Fig. 2. Row decomposition for matrix–vector product problem.

 ratio of the time taken to transfer one word of data over the network to the time taken by a network processor to compute the inner product of two n-dimensional vectors. =

zTcm tw = . wTcp wTcp

We consider two problems in our analysis: the matrix–vector product problem and the matrix–matrix product problem. Since, each is analyzed under two conditions, there are four variation of the system. We refer to those variations as system 1, 2, 3 and 4. 3.1. The matrix–vector product problem A matrix–vector product problem is derived from Eq. (2.1) by setting n = 1. It follows that A = [aik ] and B = b ci =

k 

aij × bj

then

A × b = c,

for 1  i  m.

(3.1)

j =1

Each element in the vector c requires k floating-point multiplications and k − 1 floating-point additions. In our analysis we will adopt row decomposition shown in Fig. 2. It is found that row decomposition performs better than column decomposition and in very few cases they perform the same [6]. Therefore, each processor will be assigned a number of the rows of matrix A. If processor i is assigned mi rows of A, it will be in charge of computing mi points of vector c and, thus, executing mi (2k − 1) floating-point operations. The communication cost between two nodes depends on the message length, the start-up time and on the distance between two nodes, and is given by the linear communication model [28,55]. The linear communication model is a good approximation to most currently available message passing parallel systems [28]. A good example where this model is used for a particular matrix operation, which “Gaussian elimination”, is found in [38]. In our analysis, we consider the distance between any two nodes is one. 3.1.1. System 1: no communication of matrix and vector In system 1, both matrix A and vector b are all assumed to be available in the processors that are participating in executing the product and, thus, only the computed results need to be communicated back to the central processor.

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

Processor 1 m1wTcp

35

Processor 2 m2wTcp

ts + m1 zTcm

Processor p mpwTcp

ts + mp zTcm

ts + m2 zTcm

z

z

z The Central Processor Processor 0 m0wTcp

Fig. 3. One-level send–receive graph for system 1.

A one-level send–receive graph to reflect this case is shown in Fig. 3. This case is justified by the fact that we have front-end processors in the system and so communicating the tasks of next job can take place while executing the tasks of the current job. The system-timing diagram of Fig. 3 is depicted in Fig. 4, from which we have derived the following equations: t0 = m0 wTcp ,

mi wTcp = mi+1 wTcp (1 + ) + ts , i = 0, 1, 2, . . . , p − 1,

j =i+1

t3 = m3 wTcp + (m1 + m2 + m3 )wTcp + 3ts , .. . mj wTcp + its ,

i = 1, 2, . . . , p,

j =1

(3.2) m0 + m1 + m2 + m3 + . . . mp = m =

p 

mi ,

m = mp + mp

(3.3) + (3.4)

Tf(p+1) is the optimum finish time or equivalently the minimum finish time. It is minimum because processors communicate with the central processor one after another in a pipeline fashion to deliver their partial results with no idle time between any two successive processors. The central processor is also occupied all this period in executing its share of the job. In other words, processor i starts the communication process with the central processor immediately after processor i + 1 finishes delivering its partial results to

(3.6)

mi and mp are the optimum fractions of the parallel job to be assigned by the central processor to processor i and p, respectively. i = 0, 1, 2, . . . , p − 1. Those are optimum because they lead to the minimum finish time. So, if we manage to find a closed form solution for mp , then we can find the exact fraction of the parallel job that has to be assigned to each processor such that the finish time is minimum. Using Eqs. (3.3) and (3.6), we obtain

i=1

t0 = t1 = t2 = . . . = tp = Tf(p+1) .

p  ts (1 + )p−j , wTcp

i = 0, 1, 2, . . . , p − 1,

t2 = m2 wTcp + (m1 + m2 )wTcp + 2ts ,

i 

(3.5)

from which we get mi = (1 + )p−i mp +

t1 = m1 wTcp + m1 wTcp + ts ,

ti = mi wTcp +

the central processor. Obviously, no other distribution can lead to a better finish time. Using Eq. (3.4), one can obtain

ts wTcp

p−1 

(1 + )p−i

i=0 p−1 p  

(1 + )p−j .

i=0 j =i+1

With some algebraic manipulation of the above equation, we obtain a closed form solution for mp as follows: mp =

2 m −

ts p+1 − (1 + )(p wTcp ((1 + ) ((1 + )p+1 − 1)

+ 1) + p)

. (3.7)

Note that we assume that matrix A is large enough such that the integer approximation of mi (i = 0, 1, . . . , p), will still

36

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

m0 wTcp Processor 0

m1 wTcp

ts

m1 zTcm

Processor 1

m2 wTcp

ts

m2 zTcm

Transmission Time

Processor 2

m3wTcp

ts

m3 zTcm

Start-up time

Processor 3

t1 = t 2 = t 3 = .... = t p

Computation time

Processor p

ts

m p wTcp

mp zTcm

Tl p

T f(p+1) Fig. 4. System 1 timing diagram.

satisfy the Eq. (3.3) and cause insignificant variation in the processing time of the individual processors. mp > 0 It follows that, ts <

(1 + )p+1

t w m . − ((1 + p) + 1)

(3.8)

Using Eq. (3.6), m0 = (1 + )p mp +

p−1 ts  (1 + )p−j , wTcp j =1

ts m0 = (1 + )p mp + wTcp



(1 + )p

−1



ts p wTcp ((1 + ) (p ((1 + )p+1 − 1)

(1 + )p 2 m +

Tf (p+1) (1 + )p 2 mwTcp + ts ((1 + )p (p − 1) + 1) , = ((1 + )p+1 − 1) (3.10) Tf1 = mwTcp . Speed up is defined as the ratio of time taken by a single processor to process the load Tf1 to the time taken by a (p + 1) processor network and is denoted by S(1, p + 1) and so it is given by S(1, p + 1) =

 .

Substituting the value of mp in the above equation, a closed form solution for m0 and the finish time, Tf (p+1) , are found as follows: m0 =

Tf(p+1) = t0 = m0 wTcp ,

− 1) + 1)

, (3.9)

Tf1 m = . Tf(p+1) m0

(3.11)

3.1.2. System 2: broadcasting of matrix and vector In this case, we release the constraint imposed on system 1 where we assumed that each node has its task of a parallel job. So, in this case, given a parallel job, the central processor first determines the number of helper processors (p) that gives the minimum finish time and selects a “list” of p helper processors to cooperate to executing the job. Recall that each of the helper processors in the “list” has a unique number

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

37

The Central Processor Processor 0 m0 wTcp

Tb

Tb

Processor 1 m1wTcp

Tb

Processor 2 m2wTcp

Processor p mpwTcp

ts + mp zTcm

ts + m2 zTcm

ts + m1 zTcm

The Central Processor Processor 0 m0 wTcp

Fig. 5. One-level send–receive graph for system 2.

from 1 to N. The numbers of the processors in the “list” do not have to make up a consecutive sequence. Secondly, the central processor broadcasts matrix A, vector b and the “list”. The time to broadcast the matrix, the vector and the list over the network is Tb = tb + ts , where tb = [k(m + 1) + 1]tw is the transmission time. Note that the time to transmit the list is equivalent to transmitting one word, which is tw . The one-level send–receive task graph of this case is depicted in Fig. 5 and the timing diagram is shown in Fig. 6. From the timing it is easy to get the system equations: t0 = m0 wTcp , ti = mi wTcp + wTcp

i 

(1 + )p−i

i=1

+

ts wTcp

p−1 

p 

(1 + )p−j .

i=1 j =i+1

With some algebraic manipulation of the above equation, we obtain a closed form solution for mp

((1+)p+1 −1)

mj + tb + (i + 1)ts ,

. (3.14)

i = 1, 2, . . . , p.

(3.12)

Using same methods as before we get, mi = (1 + )p−i mp +

ts wTcp

i = 1, 2, . . . , p − 1, mi =

m = m0 + mp + mp

p−1 

mp = m− wTtbcp 2 − wTtscp [(1+)p+1 +2 −((p+1)+1)]

j =1

(1 + )p−i m

Using Eq. (3.3)

p

m0 = m1 (1 + ) +

+

ts wTcp

2ts +tb wTcp .

  (1 + )p−j ,      j =i+1     p 



(1+)p−i −1





,

         

(3.13)

Processors that are involved in executing a parallel job will identify themselves based on the “list”, which is delivered to them by the central processor. The central processor identifies itself as processor 0, the processor with the smallest number in the list identifies itself as processor 1, the second smallest number in the “list” identifies itself as processor 2 and so on and so forth. Using Eqs. (3.13) and (3.14), all processors can determine their share of the job in parallel. The time to execute Eq. (3.13) is relatively insignificant and can be neglected.

38

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

m0 wTcp

Processor 0

ts

tb

m1 wTcp

tb

m2 wTcp

ts

m1 zTcm

Processor 1

ts

m2zTcm

ts

Processor 2

t1= t2

tb

ts

ts

m p wTcp

…….= tp

m p zTcm

Processor p

T f(p+1) Fig. 6. System 2 timing diagram.

Since mp > 0, it follows that, tw  m − wTtbcp . ts < (1 + )p+1 + 2 − ((1 + p) + 1) Using Eq. (3.13), ts m0 = (1 + ) mp + wTcp p



(1 + )p +  − 1 

Tf1 = mwTcp , (3.15) 

tb + . wTcp

Substituting the value of mp from Eq. (3.14) in the above equation, a closed form solution for m0 and the finish time, Tf (p+1) , are found as follows: m0 =

(1+)p 2 m− wTtbcp + wTtscp ((1+)p (p+−1)−+1)

tb + , wTcp

((1 + )p+1 − 1) (3.16)

Tf(p+1) = t0 = m0 wTcp , Tf (p+1) = (1+)p 2 (mwTcp −tb )+ts ((1+)p (p+−1)−+1) ((1+)p+1 −1) (3.17) +tb ,

S(1, p + 1) =

Tf1 Tf(p+1)

=

m . m0

3.2. System 3: the matrix–matrix product problem Each element in the vector–matrix C requires k floatingpoint multiplications and k − 1 floating-point additions. In our analysis we will adopt row decomposition. It is found that row decomposition performs better than column decomposition [6]. Therefore, each processor will be assigned a number of the rows of matrix A. If processor i is assigned mi rows of A, it will be in charge of computing nmi points of matrix C and, thus, executing nmi (2k − 1) floating point operations. Next, we analyze the system under two conditions: first, no communication of matrix A and B is required. Second, both matrix A and B are communicated over the network (Fig. 7). 3.2.1. System 3: no communication of matrix A and B The timing diagram for this case will be the same as system 1 with one difference. The size of the task assigned to processor i becomes nmi instead of mi . The system

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

n

k A mi

Using same methods as before we get,

n C

B

AxB=C

= m

k

39

m

Fig. 7. Row decomposition for matrix–matrix product problem.

mi =

(1 + )p−i m

p

ts nwTcp

+

i = 1, 2, . . . , p − 1, s +tb m0 = m1 (1 + ) + 2t nwTcp .

p  j =i+1

  

 (1 + )p−j ,     

(3.24)

Using Eq. (3.3), equations will be as follows:

m = m0 + mp + mp

t0 = nm0 wTcp , ti = mi wTcp +

i  j =1

+

its mj wTcp + , n

i = 1, 2, . . . , p.

(3.18)

From which we get

j =i+1

i = 0, 1, 2, . . . , p − 1. ts p+1 − ((p nwTcp [(1 + ) ((1 + )p+1 − 1)

+ 1) + 1)]

. (3.20)

Since mp > 0, it follows that, (3.21)

Tf(p+1) = t0 = nm0 wTcp , Tf (p+1) n(1 + )p 2 mwTcp + ts ((1 + )p (p − 1) + 1) = , ((1 + )p+1 − 1) (3.22) Tf1 = nmwTcp , Tf1 Tf(p+1)

=

m . m0

t0 = nm0 wTcp , i  j =1

i = 1, 2, . . . , p.

((1+)p+1 −1)

mj + t b +

.

(3.25) Since mp > 0, it follows that, tb ntw  m − nwT cp . ts < (1 + )p+1 + 2 − ((1 + p) + 1)

m0 = (1+)p mp +

(3.26)

(i + 1)ts , n (3.23)

ts nwTcp



 tb (1 + )p + − 1 + ,  nwTcp

Tf(p+1) = t0 = nm0 wTcp , Tf1 = nmwTcp , Tf (p+1) = (1+)p 2 (nmwTcp −tb )+ts ((1+)p (p+−1)−+1) ((1+)p+1 −1) (3.27) +tb , S(1, p + 1) =

3.2.2. System 4: broadcasting of matrix A and B This case is similar to the analysis of system 2 except tb becomes larger (tb = k(m + n)tw ) and the tasks are now multiplied by n. That is, the size of a task assigned to processor i becomes nmi instead of mi . The system equations will be as follows:

ti = mi wTcp + wTcp

With some algebraic manipulation of the above equation, we obtain a closed form solution for mp

Using Eq. (3.24),

ntw m , ts < p+1 (1 + ) − ((1 + p) + 1)

S(1, p + 1) =

(1 + )p−j .

i=1 j =i+1

(3.19)

Using the same technique as before, we obtain, mp =

i=1 p 

(1 + )p−i

mp = tb ts m− nwT 2 − nwT [(1+)p+1 +2 −((p+1)+1)] cp cp

p  ts p−i mi = (1 + ) mp + (1 + )p−j , nwTcp

2 m −

ts nwTcp

p−1 

p−1 

Tf1 Tf(p+1)

=

m . m0

4. System analysis In this section, we use the analytical results developed in the previous section to study how the system parameters and application parameters affect the system performance. We will show how we can use those results to select the optimum number of available nodes (processors) in message-passing parallel systems to achieve the best performance. We will show that this number is different for different applications running on the same system and also is different for the same application running on different systems.

40

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

Table 2 Times for floating point operations

Clock cycle time (ns) Cycles/addition Cycles/multiplication

MIPS R3010

Weitek 3364

TI 8847

40 2 5

50 2 2

30 2 3

14

Speed Up S(1,P+1)

12

10

ts = 0 µs ts = 200 µs ts = 400 µs ts = 600 µs ts = 800 µs ts = 1000 µs

8

6

4

2 1

3

5

7

9 11 13 15 17 19 21 23 Number of processors p

Fig. 8. Speed up vs number of processors with various start-up times, k = m = 400, tw = 5 s, w = 1.

We demonstrate the effectiveness of our model through analysis of the results obtained for system 1. Next, we will perform a comparative study with other systems to highlight the impact of various system parameters and application parameters on the overall system performance. In our analysis, we will consider a practical example using representative data of real systems. Table 2 gives the time required to perform floating-point multiplication and addition operations in three standard chips. Consider a cluster computers system of Workstations on Ethernet with K + 1 Weitek 3364 processors. From the data in Table 1, tw = 5 s for workstation on the Ethernet. Using the data in Table 2, Tcp = 0.2(k − 1) s for Weitek 3364 processors; this number is approximately = 0.2k s for large k, which is normally the case. Fig. 8 shows the variation of speed up with respect to p for various values of ts . The data used for obtaining the plot is tw = 5 s, k = m = 400, and w = wdedicated = 1 since we consider a dedicated-network clusters. As shown in Fig. 8, if ts = 0, it is always possible to get a performance improvement by adding extra processors. However, it is not possible to get an infinite speed up. The speed up has an upper bound given by ((1 + )p+1 − 1) p→∞ (1 + )p  1+ = . 

lim S(1, p + 1) = lim

p→∞

(4.1)

If ts > 0, it is no longer possible to obtain a continuous performance improvement by adding extra processors. On the contrary, there will be degradation in performance if we exceed a certain number of processors. Obviously, the exact number of processors (degree of parallelism) used in order to achieve the maximum speed up (minimum finish time) reduces as ts increases. Fig. 9 shows the variation of finish time with respect to p for various values of ts . The data used for obtaining the plot is tw = 5 s, k = m = 400, and w = 1. Fig. 9 confirms the adverse effect of start-up time on the degree of parallelism. The importance of this work stems from the fact that the start-up time is considered for the first time in analyzing real examples of parallel applications running on parallel systems in the context of divisible theory. The previous work in divisible theory focused on optimizing the scheduling of a generic arbitrarily divisible load on “unlimited” “available” number of processors [e.g., [8,11,16,27,34]]. There was no constraint on the number of processors to be used. It was only observed that the performance curves (finish time or speed up) level off after a certain number of processors as shown in Figs. 8 and 9 for ts = 0. The important issue is now to find p that gives the maximum speed up (conversely, the minimum finish time) in a network computer system given a start-up time, system parameters, and application parameters. Investigating thoroughly the system-timing diagram, Fig. 4, one observes that the time, which is left for processor p to compute its task and communicate the result to the central processor, is given by   p−1  Tlp = Tf(p+1) − (p − 1)ts + tw (4.2) mk  . k=1

We continue to increase the number of processors, p, in order to obtain a performance improvement until the time left for processor p becomes less than its communication overhead. In other words, if Tl(p) < mp tw +ts , it would be impossible to deliver part of the parallel job to processor p. This is because the time needed to compute mp and communicate the result to the central processor outpaces the available time. A task can be delivered to the p processor, if we stretch Tf(p+1) . Something is implausible to do since the finish time will not be minimum anymore. Therefore, we must not deliver any load to processor p, which implies mp = 0. If mp = 0, it follows from Eq. (3.7) that ts =

(1 + )p+1

t w m . − ((1 + p) + 1)

(4.3)

And the minimum finish time is given by ts ((1 + )p − 1) .  Eq. (4.3) is an important equation, which gives an upper bound on the start-up time. In other words, if mp > 0, then

Tf(p+1) = t0 =

ts <

(1 + )p+1

t w m , − ((1 + p) + 1)

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

41

18

ts =1000 µs

16

Finish time *K

14

ts = 800 µs

12

ts = 600 µs 10

ts = 400 µs

8 6

ts = 200 µs

4

ts = 0 µs 2 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Number of processors p

Fig. 9. Finish time in (s) vs number of processors with various start-up times, k = m = 400, tw = 5 s, and w = 1.

Processors 9–13 for m = 400 from Fig. 10(a) is zoomed inon for the sake of analysis and is plotted in Fig. 10(b). The values in Table 3 are derived from Fig. 10(a), which gives the exact upper bound (truncated to four decimal places) for different number of processors and various values of m. With the help of Fig. 10(b), the table may be read as follows: if m = 400, and 323.1994  ts ≺ 390.7593, the minimum finish time is obtained with p = 11. For each number of processors p, there is a range of startup times where the finish time is minimum. Let Rs (p) be the range of start-up times for p processors that gives the minimum finish time. This range, as shown in Fig. 10(b), is given by

3500 3000

Start-up time

2500 m =200

m =1000

m= 400

2000

m = 600 1500

m = 800 m = 1000

1000 500

m =200

0 6

8

10

12

14 16 18 20 22 24 Number of processors

Start-up time

(a) 600 570 540 510 480 450 420 390 360 330 300 270

28

30

599..4175

479..6039

Rs (10) 390..7593

Rs (11) 323..1994

Rs (12) 9

(b)

26

10

11 12 Number of processors p

270..7317 13

Fig. 10. (a) Start-up time (in s) vs number of processors for various values of m, k = 400, tw = 5 s, and w = 1 and (b) m = k = 400, tw = 5 s, and w = 1.

ts is plotted in Fig. 10(a) against number of processors for different values of m. The data used to plot this curve is k = 400, tw = 5, and w = 1. The figure shows the upper bound of the values of ts to be used given a number of processors.

ts (p + 1)  Rs (p) ≺ ts (p) for p = 11, 323.1994  Rs (11) ≺ 390.7593. Using Fig. 10(b), one notes that crossing any of these boundaries, the finish time will not be minimum. If ts exceeds or is equal to the upper bound (ts  390.7593 for p = 11), we get into Rs (10) and so the minimum finish time is achieved by decrementing the number of processor to p = 10 instead of 11 processors. On the other hand, if ts is less than the lower bound (ts < 323.1994 for p = 11), the minimum finish time is obtained by distributing the load into more processors (12 processors instead of 11 processors) because at that value we will be working in Rs (12). Obviously, within the same range, the less ts is, the better finish time is. Recall that, from Eq. (4.30), ts varies as a function of system bandwidth (tw ), processors speed, and the application parameters m and k and so is Rs (p). This is an important finding, because it enhances the intelligence of the scheduler. The scheduler can now, based on the application parameters and system parameters, determine the value of p that gives the minimum finish time. Previously, such a number was determined based on the system parameters and independent

42

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

Table 3 Setup time ts in s for different values of p and m m = 200 m = 400 m = 600 m = 800

p=8

p=9

p = 10

p = 11

p = 12

p = 13

383.0109 766.0218 1149.0328 1532.044

299.7088 599.4175 899.12632 1198.835

239.8019 479.6039 719.40581 959.2077

195.3797 390.7593 586.139 781.5187

161.5997 323.1994 484.79913 646.3988

135.3658 270.7317 406.09753 541.4634

12

800

ts = 0 µs 10

m = 200 m = 400 m = 600 m = 800 m = 1000

600 500 400

ts = 200 µs

Working Domain

ts = 400 µs

8

Speed up

Start-up time

700

300 200

ts = 800 µs

6

ts = 1000 µs 4

Threshold

100

ts = 600 µs

0 13

15

17

19

21

23

25

27

29

31

33

35

37

2

Number of processors p 0 100

Fig. 11. Start-up time (in s) vs number of processors for various values of m, k = 400, tw = 0.54 s, and w = 1 for (Intel DELTA) parallel computers.

200

300

400

500

600

700

800

900 1000

m

Fig. 12. Speed up vs m for various start-up times with k = 400, tw = 5 s, p = 15, and w = 1.

10.58

ts = 0

10.56 10.54 10.52 Speed up

of the applications’ parameters. The parallel job parameters vary from one application to another and so is the number of processors p. This gives room to a parallel composition where different program components execute concurrently on different subsets of processors in cluster computer systems. To put things into perspective, recall that the numbers used to get the above results are for cluster workstations on Ethernet, which are given in Table 1. For ts = 1500 s, it is obvious from Fig. 10(a) that only large matrix size m  800 can be efficiently run on such system with maximum degree of parallelism equal to 9. Therefore, most of matrix operations are carried on tightly coupled systems where the startup time is relatively very small. A good choice from Table 1 is Intel DELTA. ts is plotted in Fig. 11, for Intel DELTA, against number of processors for different values of m. The data used to plot this curve is k = 400, tw = 0.54, and w = 1. The figure shows the upper bound for the value of ts to be used given a number of processors for Intel DELTA. The acceptable working domain, as shown in Fig. 11, is above the threshold curve (ts = 121s). For optimality the working point must be as close as possible to the threshold curve. So, one of the designer objectives is to reduce ts . If ts of a parallel system gets smaller, its working domain gets larger and the degree of parallelism increases, which in turn improves system performance. The degree of parallelism also increases, as m gets large. For example, Fig. 11 shows that a maximum of 15 processors can cooperate in executing a parallel job if m = 200. However, for m = 1000, this number can be as large as 34

ts = 200 µs ts = 400 µs ts = 600 µs ts = 800 µs ts = 1000 µs

10.5 10.48 10.46 10.44 10.42 10.4

10.38 100 200 300 400 500 600 700 800 900 1000 m/ 1000

Fig. 13. Speed up vs very large values of m for various start-up times with k = 400, tw = 5 s and p = 15.

processors. So, this asserts the fact that for each application (parallel job) a subset of the available processors (not all) process the job in order to obtain an optimal performance. The effect of application parameters (m and k) on the system performance is presented in Figs. 12–15. Figs. 12 and 13 show the variation of time with respect to m for various values of ts . The data used for obtaining the plots are tw = 5 s, k = 400, w = 1, and p = 15. The two figures show that if ts = 0, the speed up is independent of m and is

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

Speed up

14

ts = 0 µs

12

ts = 200 µs

10

ts = 400 µs

8

ts = 600 µs ts = 800 µs

6

ts = 1000 µs

4

Figs. 14 and 15 show the speed up against k for various values of ts and for wide range of k values. The data used for obtaining the plots are tw = 5 s, m = 400, w = 1, and p = 15. If ts = 0, the speed up is given by Eq. (4.5) and tw tw it is not independent of k because ( = wT = 0.2k ). As cp k gets very large the speed up approaches an upper bound, which is p + 1 = 16 and gradually the effect of the startup vanishes. This can be derived from Eq. (4.5) as follows: large k (k → ∞) implies  → ∞, consequently, the speed up is obtained by taken the limit of Eq. (4.5) as  → ∞,

2

→0

k

Fig. 14. Speed up vs k for various start-up times with m = 400, tw = 5 s, p = 15, and w = 1.

ts = 0 µs

16

Speed up

15.5

ts = 0 µs ts = 200 µs ts = 400 µs ts = 600 µs ts = 800 µs ts = 1000 µs

15 14.5 14

((1 + )p+1 − 1) = p + 1. →0 (1 + )p 

lim S(1, p + 1) = lim

0 100 200 300 400 500 600 700 800 900 1000

16.5

43

13.5

This result can also be derived from the system equations (3.2) by setting ts = 0 and  = 0. From the above analysis, it is obvious now that the designer objective is to increase the degree of parallelism in the working domain, which can be achieved by decreasing the communication time (ts ,tw ) and/or increasing the computation time (m, k, wTcp ). To see how the degree of parallelism varies with respect to system and application parameters, three variations of the system are considered in Sections 3.1.2, 3.2.1, and 3.2.2. We referred to those variations as systems 2–4, respectively. In system 2, an extra communication overhead is added to the system. It is expected that a more stringent constraint be imposed on the system and so, the degree of parallelism will be less. From Eq. (3.15), ts must be positive and since the dominator of Eq. (3.15) is always positive, it follows that m > k(m + 1)

13 12.5 10

20

30

40

50

60

70

80

90

for large m, which is usually the case

100

k /1000

Fig. 15. Speed up vs very large values of k for various start-up times with m = 400, tw = 5 s and p = 15, and w = 1.

given by, S(1, p + 1) =

((1 + )p+1 − 1) . (1 + )p 

(4.5)

For ts > 0, Fig. 13 shows that the speed up curves increase as m becomes large and all saturate, with different speeds, to the same upper bound. As m gets very large, the computation time overwhelmed communication delays. In other words, the effect of very large m (m → ∞) is the same as ts → ∞. This is why all curves approach the value given by Eq. (4.5) as m → ∞. Investigating the speed up equation below, one can readily obtain the upper bound given by Eq. (4.5) by either setting ts = 0 or, equivalently, setting m = ∞. S(1, p + 1) =

(1 + )p 2 −

1 > . k This implies that the effect of broadcasting the matrix A has little or no effect on the system behavior. However, k must not, based on the network bandwidth and a node speed, exceed a certain limit. If k gets larger than that limit, it would be better to execute the parallel job on a single node. Note that, there is a correlation between  and k, which by itself constitutes a whole task of future research. To elucidate zTcm this point recall that  = wT and, thus, when k increases, cp we expect the processors get busier and so is the network, consequently, w and z increases. A very interesting problem is to find this incremental relation. System 3 is the same as system 1 except that the application is n times more computationally intensive. So it is now required to compute nm(2k − 1) floating point operations (FLOPs) compared to m(2k − 1) FLOPs in system 1. In

((1 + )p+1 − 1) . − (1 + )p+1 (p + 1) + (1 + )p (p − 1))

ts 2p+1 mwTcp (2(1 + )

(4.7)

(4.6)

44

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

Table 4 Values of the start-up times (and the finish times for various numbers of processors (p) for the four systems where k = m = 400, n = 100, tw = 5 p

ts (system 1)

ts (system 2)

ts (system 3)

ts (system 4)

Tf (1)

Tf (3)

Tf (4)

1 2 3 4 5 6 7 8 9 10

32,000 10,448.98 5116.802 3006.17 1961.978 1371.657 1006.684 766.0218 599.4175 479.6039

−770,000 −251,429 −123,123 −72,336 −47,210.1 −33,005.5 −24,223.3 −18,432.4 −14,423.5 −11,540.5

3,200,000 1,044,898 511,680.2 300,617 196,197.8 137,165.7 100,668.4 76,602.18 59,941.75 47,960.39

1,100,000 541,538.5 303,284.9 188,925.9 127,093.6 90,425.38 67,098.65 51,432.79 40,452.21 32,485.88

16,969.7 12,298.65 10,224.52 9194.804 8691.981 8494.229 8490.611 8619.465 8843.767 9139.929

Tf (2)

1,648,970 1,132,925 875,633.7 721,847.8 619,818.5 547,367.3 493,404.7 451,769.8 418,765.9 392,040.4

2,134,303 1,779,838 1,603,188 1,497,669 1,427,720 1,378,098 1,341,185 1,312,747 1,290,242 1,272,054

39 40 41 42 43 44 45 46

16.02141 14.80002 13.68519 12.66605 11.73304 10.8777 10.09253 9.370896

−385.515 −356.125 −329.3 −304.777 −282.326 −261.745 −242.851 −225.487

1602.141 1480.002 1368.519 1266.605 1173.304 1087.77 1009.253 937.0896

1100.921 1017.031 940.4543 870.4466 806.3509 747.5875 693.6426 644.0605

27,366.45 28,146.91 28,933.48 29,725.96 30,524.14 31,327.83 32,136.84 32,951.01

231,808.3 231,428 231,134.2 230,920.2 230,780.2 230,708.9 230,701.3 230,753

1,168,210 1,168,197 1,168,244 1,168,348 1,168,505 1,168,710 1,168,960 1,169,253

To calculate the finish time in four systems ts = 1000.

system 4, we study the effect of applications, which are both computation-intensive and communication-intensive. In this case, we broadcast k(m+n) points and compute nm(2k −1) FLOPs. Eq. (3.26) suggests that for the load sharing to be feasible in system 4, we must have ts > 0. This implies that tb m> , since the denominator of Eq. (3.26) is always nwTcp positive for all values of ’s and p’s. It follows that nm (4.8) > k(m + n) for large m, which is usually the case, n > . k For comparison purposes, the finish times for the four systems are calculated when k = m = 400, n = 100, tw = 5 ms, ts = 1000, and w = 1. The important questions is “Given the above system and application parameters, what would be the number of processors p to be used in (3.10), (3.17), (3.22), and (3.27) such that the finish time is minimum? Given the same parameters, we calculated different values of ts for various values of p. All the results are tabulated in Table 4. The following observations are derived from the table. • The number of processors, p, that gives the minimum finish time corresponds to the same number that gives the least ts that is greater than 1000 (the threshold value for the systems under consideration). As shown in Table 4, this number is 7, 0, 45, and 40 for systems 1–4, respectively. It is obvious from the table how the finish time increases as we depart from these numbers upwards or downwards. • For system 2: Broadcasting the load (matrix A and vector b) to all processors adds an extra communication over-

head. Since the inequality in Eq. (4.7) is not satisfied, the communication overhead outweighs the potential benefits of sharing the computing power of more than one processor. In order to avoid the communication overhead, it would be better to execute the whole task on a single node (i.e., p = 0). An indicator of the infeasibility of parallel execution of a given job on system 2 is that ts values for system 2 in Table 4 are all negative. • For system 3: In this system we study the effect of increasing the computation time of the load by considering matrix–matrix product applications. As shown in Table 4, the effect of n times more intensive-computation application is that ts is n times greater than the value of ts for system 1. This means that the system with n times more intensive-computation applications can tolerate n times slower network and still achieves performance improvement. However, as shown in Table 4, the finish time does not grow with the same ratio and it is not n times greater than the finish time obtained in system 1. • For system 4: In this case both the communication and computation are increased. We consider the matrix–matrix product and we broadcast the two matrices to the helper processors over the network. So the computation, as in system 3, is n times greater than that in system 1; however, there is a communication overhead incurred by broadcasting both matrices A and B to the helper processors over the network. Unlike system 2, the inequality in Eq. (4.7) is satisfied and so, the benefits of parallel execution of a job outweigh the communication overhead. • The degree of parallelism is increased as we increase the computation in the application and decreases as we increase the communication. For example, when we increase the communication in system 2 the degree of

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

parallelism drops to zero compared to p = 7 in system 1. In system 3, we increase the computation in the application and so the degree of parallelism equals p = 45. System 4 is similar to system 3 in that both have the same computation volume, but in system 4 there is an extra communication overhead, which explains why the degree of parallelism drops to p = 40.

5. Concluding remarks In this paper, we presented an analytical solution to evaluate the performance of message-passing parallel systems. An optimum amount of divisible job (task) that has to be assigned to a each processor participating in executing that job is given by a closed form equation. This helps to design more intellectual and effective schedulers. Moreover, it becomes possible to run the scheduling in a distributive manner. In other words, there is no need for a centralized scheduling, even though the end results for a class of fork-join divisible jobs must be collected by the centralized processor or (job initiator). We showed that there is an upper bound on the number of processors that can take part in executing a given job. In other words, for a given parallel job running on a given parallel system, there is an optimum number of processors, which gives the minimum finish time. We showed, how this number can be found and how it is different for different sizes of divisible jobs and different for different network systems, which has variable communication overhead. The algebraic equations presented in this paper, which characterize the message-passing parallel systems are simplistic equations and can easily be adapted or modified to include or handle any system with any topology. We demonstrated this feature by extending the equations derived for the first model to handle other problems. Moreover, the model can be extended to get analytical solutions to other problems with different structure such as modeling the synchronization problem in parallel execution. We do not claim that the model is complete in that it considers all aspects of message-passing parallel systems. However, we believe it to be an excellent start. There are some other aspects of the system to be covered. For example, work is still to be done to determine the correlation between broadcasting and start-up time. Broadcasting will increase the start-up time as it increases the network latency. This has been neglected in our model and we consider that the start-up time is the same for the case where there is no broadcasting and for the case where broadcasting exists. As a matter of fact, we have done that based on the results in [47], which state that the effect of the network latency on the start-up time is insignificant. However, for completion and exact results it is worth pursuing to determine this correlation and incorporate it in the model. Work also need to be done to incorporate the queuing delay at the FCFS parallel buffer queue in the central nodes. This time cannot be

45

neglected as it is a crucial part of the overall response time in heavily loaded systems. Moreover, the analytical solution presented in this paper was obtained for homogeneous network of processors system. It would a good endeavor to extend the model to the heterogeneous models where the nodes have different speeds and the links have different speeds too. There are many non-dedicated networks of workstations that are not identical and connected through different LANs. We may also consider analyzing the system considering two paradigms of jobs divisible and indivisible. Acknowledgment The research in this paper is supported by UAEU research council Grant No. 19-7-11/01. References [1] I. Ahmad, Y.K. Kwok, On paralleling the multiprocessor scheduling problem, IEEE Trans. Parallel Distributed Systems 10 (1999) 414–432. [2] A. Alexandrov, M. Ionescu, K. Schauser, C. Scheiman, LogGP:incorporating long messages into LogP model, Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, 1995, pp. 95–105. [3] A. Amoura, E. Bampis, J. Konig, Scheduling algorithms for parallel gaussian elimination with communication costs, IEEE Trans. Parallel Distributed Systems 9 (1998) 676–686. [4] M. Atallah, C. Black, D. Marinescu, H. Siegel, T. Casavant, Models and algorithms for coscheduling compute-intensive task on network of workstations, Parallel Distributed Comput. 16 (1992) 319–327. [5] F. Baccelli, W.A. Massey, D. Towsly, Aclyic fork-join queuing networks, J. Assoc. Comput. Mach. 36 (1989) 615–642. [6] S. Bataineh, Matrix decomposition for minimum time execution on network-based multicomputers, Technical Report, March 2001. [7] S. Bataineh, M. Al-Ibrahim, Effect of fault tolerance and communication delay on response time in bus network multiprocessor system, Comput. Comm. J. 17 (12) (December 1994) 843–851. [8] S. Bataineh, T.G. Robertazzi, Bus oriented load sharing for a network of sensor driven processors, IEEE Trans. System Man Cybernetics 21 (5) (1991) 1202–1205. [9] A. Beguelin, J. Dongarra, G. Geist, R. Manchek, V. Sunderam, Development tools for network-based concurrent supercomputing, Proceedings of the Supercomputing 91, 1991, pp. 435–444. [10] V. Bharadwaj, D. Ghose, V. Mani, Optimal sequencing and arrangement in distributed single-level networks with communication delays, IEEE Trans. Parallel Distributed Systems 5 (1994) 968–976. [11] V. Bharadwaj, D. Ghose, V. Mani, T.G. Robertazzi, Scheduling Divisible Loads in Parallel and Distributed Systems, IEEE Computer Society Press, Los Alamitos, CA, September 1996, 292p. [12] J. Blazewicz, M. Drozdowski, Distributed processing of divisible jobs with communication startup costs, Discrete Appl. Math. 76 (1–3) (June 1997) 21–41; J. Blazewicz, M. Drozdowski, Distributed processing of divisible jobs with communication startup costs, Second International Colloquium on Graphs and Optimization, Leukerbad, Switzerland, Aug. 1994 [13] W. Blume, R. Eigenmann, Performance analysis of parallelizing compilers on the perfect benchmarks programs, IEEE Trans. Parallel Distributed Systems 3 (6) (1992) 643–656. [14] S.H. Bokhari, A shortest tree algorithm for optimal assignments across space and time in a distributed processor system, IEEE Trans. Software Eng. 7 (6) (November 1981) 583–589.

46

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

[15] S. Chen, M.M. Eshaghian, A. Khokhar, M.E. Shaaban, A selection theory and methodology for heterogeneous supercomputing, Proceedings of the Workshop Heterogeneous Processing, April 1993, pp. 15–22. [16] Y.C. Cheng, T.G. Robertazzi, Distributed Computation with communication delays, IEEE Trans. Aerospace Electron. Systems 24 (1988) 700–712. [17] T.C.E. Cheng, C.C.S. Sin, A state-of-the-art review of parallelmachine scheduling research, Europ. J. Oper. Res. 47 (1990) 271–292. [18] A.N. Choudhary, R. Ponnusamy, Implementation and evaluation of Hough transform algorithms on a shared-memory multiprocessor, J. Parallel Distributed Comput. 12 (1991) 178–188. [19] P. Chretienne, A polynomial algorithm to optimally schedule tasks on a virtual distributed system under tree-like precedence constraints, Europ. J. Oper. Res. 43 (1989) 225–230. [20] P. Chretienne, Task scheduling with interprocessor communication delays, Europ. J. Oper. Res. 57 (1992) 348–354. [21] P. Chretienne, E.G. Coffman Jr., J.K. Lenstra, Z. Liu (Eds.), Scheduling Theory and its Applications, Wiley, New York, 1995. [22] J.Y. Colin, P. Chretienne, C.P.M. Scheduling with small communication delays and task duplication, Oper. Res. 39 (3) (1991) 680–684. [23] D.E. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subarmonian, T. Eicken, LogP:towards a realistic model of parallel computation, Proceedings of the ACM Symposium on Principles and Practice of Parallel Programming, 1993, pp. 1–12. [24] R. Davoli, L. Giachini, O. Babaoglu, A. Amoroso, L. Alvisi, Parallel computing in networks of workstations with paralex, IEEE Trans. Parallel and Distributed Systems 7 (1996) 371–384. [25] H. Dietz, W. Cohen, B. Grant, Would you run it here ...or there? (AHS: Automatic heterogeneous supercomputing), Proceedings of the International Conference on Parallel Processing, vol. II, August 1993, pp. 217–221. [26] J.J. Dongarra, The performance database server (PDS): Reports: Linpack Benchmark-Parallel, http://www.netlib.org/benchweb/. [27] M. Drozdowski, Selected Problems of Scheduling Tasks in Multiprocessor Computer Systems, Politechnika Poznanska, Book No. 321, Poznan, Poland, 1997. [28] T.H. Dunigan, Performance of the Intel iPSC/860 and n-Cube 6400 hypercubes, Parallel Comput. 17 (10, 11) (December 1991) 1,285– 1,302. [29] S.M. Figueira, F. Berman, A slowdown model for applications executing on time-shared clusters of workstations, IEEE Trans. Parallel Distributed Systems 12 (2001) 653–669. [30] I. Foster, Designing and building parallel programs, http://wwwunix.mcs.anl.gov/dbpp/. [31] R.F. Freund, Optimal selection theory for superconcurrency, Proceedings of the Supercomputing’89, 1989, pp. 699–703. [32] A. Gerasoulis, T. Yang, A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors, J. Parallel Distributed Comput. 16 (1992) 276–291. [33] D. Gerogiannis, S.C. Orphanoudakis, Load balancing requirements in parallel implementations of image feature extraction tasks, IEEE Trans. Parallel Distributed Systems 4 (1993) 994–1013. [34] D. Ghose, Load partitioning and trade-off study for large matrixvector computations in multicast bus network with communication dealy, J. Parallel Distributed Comput. 55 (1998) 32–59. [35] L. Hollermann, T. Hsu, D.R. Lopez, K. Vertanen, Scheduling problem in a practical allocation model, J. Combin. Optim. 1 (2) (1997) 129–149. [36] T. Hsu, J.C. Lee, D.R. Lopez, W.A. Royce, Task allocation on a network of processors, IEEE Trans. Comput. 49 (December 2000) 1339–1353. [37] http//www.psc.edu. [38] J.J. Hwang, Y.C. Chow, F.D. Anger, C.Y. Lee, Scheduling precedence graphs in systems with interprocessor communication times, SIAM J. Comput. 18 (2) (1989) 244–257.

[39] J. Kim, D.J. Lilja, Performance-based path determination for interprocessor communication in distributed computing system, IEEE Trans. Parallel Distributed Systems 10 (1999) 316. [40] U. Krishnaswamy, I.D. Scherson, A frame for computer performance evaluation using benchmark sets, IEEE Trans. Comput. 49 (2000) 1325–1338. [41] Yu-Kwong Kwok, I. Ahmad, Static scheduling algorithms for allocating directed task graphs to multiprocessors, ACM Comput. Surveys 31 (4) (December 1999) 406–471. [42] C. Lee, M. Hamdi, Parallel image processing applications on a network of workstations, Parallel Comput. 21 (1) (1995) 137–160. [43] S. Leuteneger, M. Vernon, The performance of multiprogrammed multiprocessor scheduling algorithms, Proceedings of the Sigmetrics ‘90’, 1990, pp. 226–236. [44] X. Li, V. Bharadwaj, K.C. Chung, Divisible load scheduling with start-up costs on distributed linear networks, 12th International Conference on ISCA PDCS, FL, USA, August 1999. [45] R.P. Martin, et al., Effects of communication latency, overhead, and bandwidth in a cluster architecture, Proceedings of the 24th International Symposium on Computer Architecture, June 1997, pp. 85–96. [46] C. McCann, R. Vaswani, J. Zahorjan, A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors, ACM Trans. Comput. Systems 11 (May 1993) 146–178. [47] C.A. Moritz, M.I. Frank, LoGPC: modeling network contention in message-passing programs, IEEE Trans. Parallel Distributed Systems 12 (4) (2001) 404–415. [48] C.A. Moritz, M.I. Frank, LoGPC: modeling network contention in message-passing programs, IEEE Trans. Parallel Distributed Systems 12 (2001) 404–415. [49] B. Narahari, A. Youssef, H.A. Choi, Matching and scheduling in a generalized optimal selection theory, Proceedings of the Heterogeneous Computing Workshop, April 1994, pp. 3–8. [50] B. Narahari, A. Youssef, H.A. Choi, Matching and scheduling in a generalized optimal selection theory, Proceedings of the Heterogeneous Computing Workshop, April 1994, pp. 3–8. [51] R.D. Nelson, D. Towsley, A performance evaluation of several priority policies for parallel processing systems, J. Assoc. Comput. Mach. 40 (July 1993) 714–740. [52] R.D. Nelson, D. Towsley, A.N. Tantawi, Performance analysis of parallel processing systems, IEEE Trans. Software Eng. 14 (April 1988) 532–540. [53] M.G. Norman, P. Thanisch, Models of machines and computation for mapping in multicomputers, ACM Comput. Surveys 25 (3) (1993) 263–302. [54] G.F. Pfister, Clusters of computers: characteristics of an invisible architecture, Keynote address presented at IEEE International Parallel Processing Symposium, Honolulu, April 1996. [55] Y. Saad, M.H. Schultz, Data communication in parallel architectures, Parallel Comput. 11 (2) (August 1989) 131–150. [56] S. Saini, D.H. Bailey, NAS Parallel Benchmark Results 12-95, Technical Report NAS-95-021, NASA Ames Research Center, December 1995. [57] S.K. Setia, M.S. Squillante, S.K. Tripathi, Analysis of processor allocation in multiprogrammed, distributed-memory parallel processing systems, IEEE Trans. Parallel Distributed Systems 5 (1994) 401–420. [58] P. Steenkiste, Network-based multicomputers: a practical supercomputer architecture, IEEE Trans. Parallel Distributed Systems 7 (1996) 861–875. [59] H.S. Stone, Multiprocessor scheduling with the aid of network flow algorithms, IEEE Trans. Software Eng. 3 (1) (1977) 85–93. [60] M. Tan, H.J. Siegel, J.K. Antonio, Y.A. Li, Minimizing the application execution time through scheduling of subtasks and communication

S.M. Bataineh / J. Parallel Distrib. Comput. 65 (2005) 29 – 47

traffic in a heterogeneous computing system, IEEE Trans. Parallel Distributed Systems 8 (1997) 857–871. [61] D. Towsley, C.G. Rommel, J. Stankovic, Analysis of fork-join program response times on multiprocessors, IEEE Trans. Parallel Distributed Systems 1 (July 1990) 286–303. [62] M. Wang, S.D. Kim, M.A. Nichols, R.F. Freund, H.J. Siegel, W.G. Nation, Augmentation the optimal selection theory for superconcurrency, Proceedings of the Workshop on Heterogeneous Processing, March 1992, pp. 13–22. [63] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, A. Gupta, The SPLASH2 Programs: Characterization and methodological consideration,

47

Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995, pp. 24–36. [64] J. Zahorjan, C. McCann, Processor scheduling in shared memory multiprocessors, Proceedings of the ACM Sogmetrics Conference, 1990, pp. 214–225. [65] S. Zhou, T. Brecht, Processor pool-based scheduling for large-scale NUMA multiprocessors, Proceedings of the ACM Sigmetrics, 1991, pp. 132–142.