A detailed MPI communication model for distributed systems

A detailed MPI communication model for distributed systems

Future Generation Computer Systems 22 (2006) 269–278 A detailed MPI communication model for distributed systems Thuy T. Le ∗ , Jalel Rejeb Department...

375KB Sizes 0 Downloads 71 Views

Future Generation Computer Systems 22 (2006) 269–278

A detailed MPI communication model for distributed systems Thuy T. Le ∗ , Jalel Rejeb Department of Electrical Engineering, San Jose State University, One Washington Square, San Jose, CA 95192-0084, USA Received 15 April 2005; received in revised form 23 August 2005; accepted 23 August 2005 Available online 17 October 2005

Abstract Message Passing Interface (MPI) is the most popular communication interface used in today PC clusters and other cluster-type parallel/distributed computers. Up-to-date the most popular analytical MPI communication performance model for parallel/distributed machines is the LogGP model, which is mostly based on system hardware parameters. Due to the popularity of MPI, the improvements of connection network in the past few years, and the development of MPI for computational Grids, LogGP model needs to be re-evaluated for the detailed hardware performance and for the inclusion of middleware overheads for different data structures. In this article, we use our experiment results to show that the current LogGP communication model is too limited for today parallel/distributed systems. We propose a modification by including into the model important factors that have been left out. We itemize the terms of the model to show the consistency and the meaning of these communication costs, which we believe to be the starting point for modeling MPI communication cost on the Grids. In this work, we start with point-to-point communication and plan to extend to other communication patterns, such as the broadcast communication. © 2005 Elsevier B.V. All rights reserved. Keywords: MPI communication; Parallel system; Distributed system; Performance modeling

1. Introduction MPI performance of inter-node communication of a particular traditional distributed system or cluster [1,2] can be measured and analytically modeled. The information then can be used to estimate the arrival of transferred data for parallel application coding and programming purposes. The communication cost parameters can be measured for a system and the data can ∗ Corresponding author. Tel.: +1 408 924 5708; fax: +1 408 924 3925. E-mail address: [email protected] (T.T. Le).

0167-739X/$ – see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2005.08.005

be used as optimum communication costs for every MPI-based parallel application running on that system. Although the communication performance of a system depends on the current system load, using lowerbounce or average values of communication cost for parallel code development purposes is acceptable. In cluster and distributed-memory systems, communication time depends on both hardware and software parameters. The hardware parameters include the inter-network bandwidth, the network topology and the number of nodes participated in the communication process. The software parameters include the communication algorithm and implementation, the software

270

T.T. Le, J. Rejeb / Future Generation Computer Systems 22 (2006) 269–278

architecture, types of message transfer or exchange and the structure of message stored in memory. As an example, the MPICH [3,4] process of transferring data from a source to a destination is a basic point-to-point communication that requires the steps summarized below (for the sender): 1. Specify source and target addresses and memory locations (application buffers). 2. Specify amount of data to be transferred. 3. Specify type of data transferred. 4. Define the scheme of transferring data: Short, Eager, Rendezvous or Get [5–8]: • Short: Data is delivered within the message envelop (used for short messages). • Eager: Data is delivered separately from the envelop and without waiting for the receiver to request it (used for immediate length messages). • Rendezvous: Data is delivered separately from the envelop but only when the receiver requests it (used for long messages). • Get: Data and envelop are copied by the receiver using shared memory or remote memory operations. 5. Determine devices to use based on destination (TCP, UDP, proprietary, share-memory, ATM, etc.) [8,9]. 6. Create message envelop (small) that contains information about the message, such as tag, communicator, length, etc. 7. Restructure message buffer based on data-type. 8. Series of handshaking between the source and destination where data is transferred by calling system service routines. 9. Cancel or push the requests as needed. 10. Test and wait for the completion as needed. 11. Terminate data transfer to release the message buffers (application buffers). Eager is the highest performance protocol and serves as a default choice of MPICH. The data is sent to the destination immediately and if the destination is not expecting the data, the receiver must allocate spaces to store the data locally. Rendezvous is the lowest performance protocol. In this protocol, data is sent to the destination only when it is requested, that is, data transfer is blocked and waited for the handshaking between the sender and the receiver.

The Get protocol is also a high-performance protocol, since transferred data is read directly by the receiver. This protocol requires special hardware implementation (such as shared memory) to support direct data transfer between processors’ memory. An example of this kind of implementation is the “memcopy” operation used in UNIX systems [10]. Both the Eager and Rendezvous protocols can deliver data by blocking mode or non-blocking mode. In non-blocking transfer, once the data is requested the sender will call a system service routine to begin the transfer and then return to the user without waiting for the transfer to complete. For messages of intermediate length, the Eager protocol may offer the best combination of performance and reliability while for very long data only the Rendezvous and Get protocols can maintain the correctness of the message. The selection of data transfer scheme is basically based on the size of the message but it also could be based on other factors, such as the number of pending completions, message data-type, etc. The steps of restructuring data buffer are necessary for cases of non-contiguous or heterogeneous data transfer. The simplest implementation for these cases is copying the message into and out of a temporary buffer and system service routines perform message transfer/receive between temporary buffers. In multiple “request” operation, the “request” is pushed out from a global-request list for execution until the list is empty. The push device runs through the list frequently and keeping track of the request pointer. The cost of handling these steps is not analytically modeled in LogGP [11–13] and is the focus of our study and experiments described in this article. In a MPI communication, steps 1–3 are completed by the MPI calls and steps 4–11 are in fact completed by the communication middleware that is automatically run when high-level communication instructions are executed. The communication middleware is basically system software installed in form of communication libraries. The library includes a number of multi-level routines that are called by highlevel application program via a predefined standard communication interface. In LogGP model, the pointto-point communication is represented as a function of system parameters L, o, g, G and P as shown in Fig. 1, where the parameters can be summarized as follow:

T.T. Le, J. Rejeb / Future Generation Computer Systems 22 (2006) 269–278

Fig. 1. LogGP point-to-point communication time.

• L: Transmission latency, which is the time that the message spends on the communication link. It is the time interval that the first bit of the message (from the sender) enters the communication link to the time that the receiver receives it. L is mostly the timedependent function of the inter-connection architecture and the status of the communication traffic. • o: Sending or receiving overhead, which is the time that the processor engages in communication activities and cannot be overlapped with any other activities. For the sender, this is the time for framing and adding protocol information, etc. For the receiver, this is the time for message recovering and error checking, etc. • g: Message gap, which is the minimum time interval between consecutive message transmissions or receptions at the sender or receiver, respectively. For the sender, it is the time interval that the last bit of the current message going out to the time that the first bit of the next message going out. For the receiver, it is the time interval that the receiver received the last bit of the current message to the time that the receiver is able to receive the first bit of the next message. • G: Gap per byte, which is the transmission time per unit message, and is not overlapped with the overhead time. G is mostly time-dependent function of the inter-connection architecture, the status of communication traffic, and the communication algorithm of the message passing library. • P: Number of processors that are currently involved in the communication operation. For a simple designated parallel/distributed system, the total communication time of a basic point-topoint communication can be described in LogGP model as: Tp2p ≈ os + L + or for small messages and Tp2p ≈ os + (M − 1)G + L + or ≈ os + MG + L + or for

271

large messages. The notation Tp2p represents the half round-trip time in point-to-point communication and notation M represents the message length. The subscripts s and r stand for sender and receiver, respectively. The discussion above shows that the costs of all operations that are handled by high-level software (at application level) are lumped together as the software overhead costs os and or . The dependency of communication cost on the communication links is represented by the hardware parameters latency L and gap G. The dependency of communication cost from middleware operations due to data structure and communication scheme is ignored. The model does not give application developers any hint of middleware cost as function of data structure and communication devices. With performance improvement in inter-connection networks of distributed systems, the middleware cost even becomes more sensitive. The remaining of this article is organized as following. Section 2 presents our experiments and the measured results. Section 3 describes the mapping of experiment data into the LogGP analytical model with our proposed additional factors. Section 4 describes the detailed communication parameters and their measured values to validate the consistency between the models. Section 5 concludes our research and outlines our plan for further study.

2. The experiments The two computer systems that we used in our experiments are the 4-processor Itanium2 cluster (Intel IA64) [14] and the 8-processor Fujitsu HPC2500 [15]. We used two systems for our experiments in order to justify the machine dependency of the model parameters and to guarantee that our analysis is not biased to any particular system hardware and/or MPI implementation. Our Itanium2 system has four 64-bit 1.5 GHz Itanium2 processors with 32 KB L1 cache, 256 KB L2 cache and 6 MB L3 cache per processor (both data and instruction). The system uses 128-bit 400 MHz bus with total of 12 GB memory. The Fujitsu HPC2500 structure is based on Symmetric Multi-Processing architecture with 1.3 GHz SPARC64V processor [16]. The architecture supports up to 128 nodes where each node has up to 128 CPUs

272

T.T. Le, J. Rejeb / Future Generation Computer Systems 22 (2006) 269–278

Fig. 2. Half round-trip times (␮s) vs. message sizes (bytes) with various data sizes per stride on the Itanium2 cluster.

with 512 GB memory. Each CPU has 256 KB on-chip primary instruction and data cache and 4 MB on-chip secondary instruction and data cache. Each node has 16 system boards and each system board has 8 CPUs and 32 GB memory. The system for our experiment consists of one node, one board, and eight processors. The node-to-node, board-to-board and processor-tomemory communications are all handled by cross-bar switches. Beside the software barrier, the system also has a hardware barrier facility that reduces the execution cost of barrier synchronization between threads. The hardware barrier implementation is an order of magnitude more efficient than the software barrier implementation and is almost independent with the number of processors participating in the process. We performed blocked point-to-point communication experiments with various message sizes and various data stride sizes. To observe the importance of data structure that contributes to the communication cost, we performed initial experiments with simple data structures as summarized below:

Fig. 3. Half round-trip times (␮s) vs. message sizes (bytes) with various data sizes per stride on the Fujitsu HPC2500.

Fig. 4. Half round-trip times (␮s) of some message lengths of four different data sizes per stride on the Itanium2 cluster.

the smallest of 30 measured values where each values is the average among 100 measurements. For better visualization, communication costs for several message lengths with various stride sizes are re-sketched in Figs. 4 and 5. Note that since our analyses are at the

• No-stride: Contiguous message. • 8B and stride: Each 8-byte message is separated by 8-byte stride. • 16B and stride: Each 16-byte message is separated by 8-byte stride. • 32B and stride: Each 32-byte message is separated by 8-byte stride. Figs. 2 and 3 show our measured half round-trip time versus message length for the above described data patterns on the Itanium2 cluster and the Fujitsu HPC2500, respectively. Each reported time value in the figures is

Fig. 5. Half round-trip times (␮s) of some message lengths of four different data sizes per stride on the Fujitsu HPC2500.

T.T. Le, J. Rejeb / Future Generation Computer Systems 22 (2006) 269–278

application level, the random characteristics of memory access and network delays are not considered in this study. Moreover, our measurement methods assume that the receiving processor may access a message only after the entire message has arrived. At any given time a processor can either be sending or receiving a single message. The experiment results clearly show that the increasing in communication time for non-contiguous messages must be from the middleware overheads, since there is no difference in MPI send and receive calls for various message structures. The experiment results also show that middleware overhead cost is a function of message sizes and stride sizes. It is important to recognize that the communication cost is heavily and quickly dominated by the additional cost from the stride of memory accesses.

3. The proposed analytical model For point-to-point message transferring, the similarity of all implementations is the utilization of highbandwidth protocol (such as Rendezvous) for the long messages and low-latency protocol (such as Eager) for the short messages. Short and long messages are defined by the message threshold that is predefined at the configuration time. Moreover, in order to minimize latency incurred by the long messages, most MPI implementations use low latency transmission (Eager) for sending the first fragment of the long messages and then use high-bandwidth transmission (Rendezvous) for the remaining fragments. As a result, the point-topoint communication cost can be modeled as piecewise continuous function of the message length as described by the LogGP model. There are several sources of software overhead in point-to-point communication. The obvious source of software overhead is the cost of (application-level) calling function as described by the os and or parameters in LogGP model. The second source of software overhead is the cost for the preparation and packing of data for transferring and for the recovering of data at the receiver end. One of the data processing activities that can be process at the application level is the packing and memory-copying of non-contiguous data at the transmission end as well as the unpacking and memorycopying at the receiving end. Note that if the data is

273

Fig. 6. Various stride sizes per 8-byte data size.

contiguous, packing and memory-copying is not necessary and so data is transferred directly from the user’s buffers. The middleware overhead is a function of message sizes and stride sizes as shown in our experiment results reported in the previous section. We also performed several experiments on both machines with different languages by writing our benchmark programs in C and in FORTRAN. We found that languages only contribute to the cost of software overhead calls, which is about 25% higher for FORTRAN compared to C. There is very small difference in middleware cost for using the C or FORTRAN benchmarks. For the rest of this article, we describe our experiment results and analyses based on the C benchmark only. The other factor that should be mentioned is that the overhead communication parameters can change abruptly at some distinct message sizes due to the IP packet size, data cache size and the handshake mechanism in MPI implementation. For the HPC2500, distortion in round-trip time can be seen at 128-byte and 128 KB message sizes. The distortion at 128-byte is due to the IP packet size and the distortion at 128 KB is due to the L1 data cache size. For the Itanium2 cluster, distortion in communication cost can be seen at 16 KB message size for the L1 data cache size. To avoid these unusual data points, we modeled our experiment data into LogGP model by performing the curve fitting for messages between the distortion points. In order to observe the independency of middleware cost from the stride amount (at least in the close range of memory access, except for the case of cache misses and memory bank conflict), we also performed a number of experiments on the two systems for data structures as shown in Figs. 6 and 7. The measured results are also analytically modeled based on LogGP scheme as shown in the equations below. In Fig. 6, the

T.T. Le, J. Rejeb / Future Generation Computer Systems 22 (2006) 269–278

274

Fig. 7. 8-byte stride per various data sizes.

transferred messages have strides for each 8-byte data and the stride amount varies from 8 to 64 bytes. In Fig. 7, the transferred messages have strides for various data sizes and the stride amount is constant, which is 8 bytes. We also performed quite a number of experiments with different stride patterns and message sizes in order to check our proposed model. Note that all of our analytical LogGP curves have (absolute) average errors of less than 3%. Itanium2 cluster: cont (M) = 36.31 + 1.58 × 10−3 M; |¯ Tp2p ε| = 2.69% 8B (M) = 36.08 + 6.04 × 10−2 M; Tp2p

|¯ε| = 1.16%

16B (M) = 36.24 + 6.16 × 10−2 M; Tp2p

|¯ε| = 1.56%

32B (M) Tp2p

−2

= 36.24 + 6.18 × 10

M;

|¯ε| = 1.84%

N

|¯ε| =

1 |εi | N i=1

=

N 1

N

i=1

|Ticalculated − Timeasured | (1)

HPC2500: cont (M) = 10.03 + 6.51 × 10−3 M; |¯ Tp2p ε| = 2.99% 8B (M) = 10.29 + 8.77 × 10−2 M; Tp2p

|¯ε| = 1.66%

16B (M) = 10.32 + 8.77 × 10−2 M; Tp2p

|¯ε| = 1.82%

32B (M) Tp2p

= 10.15 + 8.79 × 10

−2

M;

|¯ε| = 1.74% (2)

Itanium2 cluster: 8B (M) = 35.95 + 6.05 × 10−2 M; Tp2p 16B (M) Tp2p 24B (M) Tp2p 32B (M) Tp2p

|¯ε| = 1.44%

= 35.96 + 3.48 × 10

−2

M;

|¯ε| = 2.77%

= 37.47 + 1.88 × 10

−2

M;

|¯ε| = 1.02%

= 37.86 + 1.61 × 10

−2

M; |¯ε| = 0.52% (3)

Fig. 8. Modified LogGP point-to-point communication time.

HPC2500: 8B (M) = 10.92 + 8.78 × 10−2 M; Tp2p 16B (M) Tp2p 24B (M) Tp2p 32B (M) Tp2p

|¯ε| = 2.13%

= 10.73 + 5.39 × 10

−2

M; |¯ε| = 2.21%

= 11.08 + 4.14 × 10

−2

M;

= 11.48 + 3.35 × 10−2 M;

|¯ε| = 1.88% |¯ε| = 2.33% (4)

cont represents In Eqs. (1) and (2) above, notation Tp2p the half round-trip time of contiguous message and 8B , T 16B , . . . represent the half round-trip notations Tp2p p2p times of messages with stride amounts equal to 8 bytes, 16 bytes, etc. for each 8-byte data. In Eqs. (3) and 8B , T 16B , . . . represent the half (4) above, notations Tp2p p2p round-trip times of messages with 8-byte stride for each 8 bytes, 16 bytes, etc. of data, respectively. All communication times are in unit of microseconds and all message lengths and stride amounts are in unit of bytes. From the experiment results, it is clear that the middleware cost is independent from the stride amounts as expected. To better analyze the affects of middleware cost on overall data structures, we redefined the LogGP model as shown in Eq. (5) and graphically represented in Fig. 8:

Tp2p ≈ os + L + or + MG + Mf (φ) ≈ o + MG + Mf (φ);

o = o s + or

(5)

In our approach, we include into the model a function f(φ), which is a dependent function of the “data structure” φ. Function f(φ) is in fact a systemdependent parameter. The term Mf(φ) is the effective latency (Leff ), which is the length of time the processor is engaged in the transmission or reception of a non-contiguous message over and above the cost of a contiguous transfer. This cost is upper-bounded

T.T. Le, J. Rejeb / Future Generation Computer Systems 22 (2006) 269–278

275

HPC2500: cφ = 10.7 ␮s/byte,  0.003 ␮s/byte, oφ = 0,

Fig. 9. Function f(s) (␮s/byte) vs. data sizes per stride.

(maximum) for the cases of data transfers without overlap in communication and data preparation and is lower-bounded to 0 (minimum) for cases of data transfers with full overlap in communication and data preparation. We performed numerous experiments with different data combination patterns and performed corresponding curve fittings for the measured communication times. The middleware parameters for the above equations were calculated and sketched as shown in Fig. 9. By defining φ as the inverse of data size per stride, values of variable φ of several stride patterns are shown in Fig. 10. Again, by using least-square fitting for the function f in term of φ, the modified LogGP model for MPI point-to-point communication then can be written as shown in Eq. (6): Tp2p

≈ os + L + or + MG + Mf (φ) ≈ os + L + or + MG + M[cφ φ + oφ ]

Itanium2: cφ = 16.4 ␮s/byte,  0.001 ␮s/byte, oφ = 0,

for φ = 0 for φ = 0

Fig. 10. φ values of several stride patterns.

(6)

for φ = 0 for φ = 0

We used Eq. (6) to predict the point-to-point communication times and compare the evaluated data with the measured data. Our model gives maximum error of 7%, which is adequate for using the model as a guideline for software development purposes. Note that the difference between the original LogGP model and our proposed model is the introduction of the cφ and oφ parameters to analytically represent the communication cost contributed from the heterogeneous of the message. The inclusion of these parameters simplifies the role of parameter G such that for one system architecture, only one G value exists. In the original LogGP model, the parameter G is not only function of the system hardware but is also function of the message structure. In the proposed model, the hardware parameter G is the only function of the system hardware. The message structure is analytically modeled by the cφ and oφ parameters, which can be determined by multi-level least-square fitting of several experimental data.

4. Detailed analysis To validate our developed model, we applied the concepts described in reference to measure the roundtrip time based on the following parameters: • Leff : The effective latency is the length of time the processor is engaged in the transmission or reception of a non-contiguous message in addition to the cost of a contiguous transfer. This parameter is similar to the Mf(φ) discussed in the previous section. The parameter is system dependent and is a function of the message size (M) and the message structure (stride or contiguous). • Lb : The basic latency is the length of time the processor is engaged in the transmission or reception of a contiguous message. It is system-dependent and is similar to the communication cost described in LogGP model for contiguous messages. The basic latency can be a function of the message size (M) under a fixed unit stride (means no stride) together

276

T.T. Le, J. Rejeb / Future Generation Computer Systems 22 (2006) 269–278

with resource contention overheads. If the system resource contention is ignored, then this basic latency represents the best case of data transfer on a target system. The lower-bounce of the parameter equals the transferred data size divided by the hardware bandwidth. Let superscript un denotes the communication from the user-space to the network interface, superscript nn denotes the communication across the network and superscripts nu denotes the communication from the network interface to the user-space. We then can represent the detailed point-to-point communication time as shown in Eq. (7):    nn  un nn Tp2p ≈ Lun eff (M) + Lb + Leff + Lb (M)   nu (7) + Lnu eff (M) + Lb Eq. (7) is based on three implicit transfers along the data transfer path where endpoints are the user memory spaces and the network interfaces. In our experiments, the number of processors P can be the processors or memory modules involved in the data transfer process. For simplicity, we assume that the communication cost from the user-space to the network interface is the same as communication cost from network interface to the user-space. As such we use only superscript u for user-space parameters, which are the sum of latencies from user-space to network interface and from network interface to user-space. Moreover, the basic latency for the communication network is a linear function of a fixed packet size transfer across the interconnects, but the effective latency can be assumed to be zero since packets across the network are always contiguous and fixed in size. As a result, Eq. (7) then becomes: Tp2p ≈ [Lueff (M) + Lub ] + [Lnn b (M)]

(8)

Note that the user-space basic latency is expected not to be a function of the message size since this parameter represents the set-up overhead where data transfer time is accounted for by the basic latency of the network. We then measured the detailed communication parameters on the Itanium2 cluster and the Fujitsu HPC2500 system. We first measured the user-space basic latency by performing same source and desti-

Fig. 11. Effective latency (␮s) vs. message sizes (bytes) with various data sizes per stride on the Itanium2 cluster.

nation experiment (where processor sends the data to itself) and measured the total communication time versus message sizes. Since this communication time also includes the time of memory copy, we then measured the cost of “memory copy” (tmemcp ) on the systems by direct measurement method. By using contiguous messages, the user-space basic latency can be measured and modified as follow: nu cont cont Lub = Lun b + Lb = Tp2p,self (M) − tmemcp (M)

(9)

Based on our examination, the system dependent user-space basic latency is almost constant as expected. Per our experiments, the average user-space basic latency for the Itanium2 cluster is 34.5 ␮s and is 10.5 ␮s for the HPC2500. Since the measured user-space basic latencies are similar as from our previous experiment results, the basic network latency for contiguous messages in fact is a linear function of the message sizes as reported in the previous section. These parameters are 1.58 × 10−3 ␮s/byte and 6.51 × 10−3 ␮s/byte for the Itanium2 cluster and the Fujitsu HPC2500, respectively. We finally performed blocked point-to-point same source and destination experiments with various message sizes and various stride patterns. The user-space effective latencies of the two systems are calculated according to Eq. (10) and the corresponding results are sketched in Figs. 11 and 12. nu Lueff (M) = Lun eff (M) + Leff (M) stride cont = Tp2p,self (M) − tmemcp (M) − Lub

(10)

T.T. Le, J. Rejeb / Future Generation Computer Systems 22 (2006) 269–278

277

gap on performance due to the structure of data stored in memory. Our model is great for MPI performance evaluation and prediction but the model requires lots of efforts in obtaining the model parameters. Moreover, our analysis is limited to a number of regular access patterns, which are only useful for vector–matrix operations. Our model also can be extended for MPI broadcast algorithms. The two most important MPI broadcast algorithms are the linear broadcast and the tree structured broadcast. The complexity of using our model to broadcast communication will be due to the pipelining effects where many operations are overlapped. Fig. 12. Effective latency (␮s) vs. message sizes (bytes) with various data sizes per stride on the Fujitsu HPC2500.

References In these figures, each reported time is the smallest of 30 values where each value is in fact the average among 100 measurements. We finally used the data measured in this section together with Eq. (8) to predict the total half roundtrip time of point-to-point communication with various stride patterns. The relative average discrepancy between half round-trip times measured in this section and in the previous section is less than 4%. All of our analyses are for the message sizes between 256 bytes and 11 KB. We also observed that on the Itanium2 cluster, manually pack non-contiguous messages into contiguous messages before the data transfer and then manually unpack contiguous message into non-contiguous messages after the data transfer always provides better performance. Similar results were obtained for the Fujitsu HPC2500, but the performance improvement is very much lower than the one obtained on the Itanium2 cluster. On the average, performance improvement for manual pack/unpack noncontiguous messages is about 20% for the Itanium2 cluster and is about 6% for the Fujitsu HPC2500.

5. Conclusion Our modified LogGP model can address the complete costs along the full communication path, which consists of communication middleware cost inside memory hierarchy and interconnect cost across the network. Our work is based on the fact that previous hardware-parameterized models of MPI communication cost do not consider the influence of the memory

[1] M. Baker, R. Buyya, Cluster Computing at a Glance, High Performance Cluster Computing, vol. 1, Prentice Hall, 1999 (Chapter 1). [2] D. Turner, Introduction to Parallel Computing and Cluster Computers, Ames Laboratory http://www.scl.ameslab.gov/ Projects/parallel computing/. [3] MPI Forum. MPI: A message-passing interface standard: http://www.mpi-forum.org/docs/mpi11.html/ www.mpireport. html. [4] W. Gropp, E. Lusk, MPICH Working Note: Creating a New MPICH Device Using the Channel Interface, Technical Report ANL/MCS-TM-213, Argonne National Laboratory, 1995. [5] Developer Connection: http://developer.apple.com/macosx/ rendezvous/. [6] MPI Performance Topics: http://www.llnl.gov/computing/ tutorials/mpi performance/. [7] R. Brightwell, K. Underwood, Evaluation of an Eager Protocol Optimization for MPI, Center for Computation, Computers, Information, and Mathematics, Sandia National Laboratories. [8] R. Thakur, MPICH on Clusters: Future Directions, Mathematics and Computer Science Division, Argonne National Laboratory, http://www.mcs.anl.gov/∼thakur. [9] Internet Protocol Documents: http://www.cisco.com/univercd/ cc/td/doc/cisintwk/ito doc/ip.htm. [10] G. Banga, J. Mogul, P. Druschel, A scalable and explicit event delivery mechanism for UNIX, in: Proceedings of the Annual USENIX Conference, 1999. [11] D. Culler, et al., LogP: towards a realistic model of parallel computation, in: Proceedings of the ACM Symposium Principles & Practice Parallel Programming, ACM Press, New York, 1993, pp. 1–12. [12] A. Alexandra, M.E. Ionescu, K.E. Schauser, C. Scheiman, LogGP: incorporating long messages into the LogP model, in: Proceedings of the Seventh Annual Symposium of Parallel Algorithms and Architectures, ACM Press, New York, 1995, pp. 95–106. [13] Z. Xu, K. Hwany, Modeling the communication overhead: MPI and MPL performance on the IBM SP2 multicomputer, IEEE Parallel Distribut. Technol. (1996) 9–24.

278

T.T. Le, J. Rejeb / Future Generation Computer Systems 22 (2006) 269–278

[14] M.F. Guest, Intel’s Itanium IA-64 Processors: Overview and Initial Experiences, CCLRC Daresbury Laboratory, http://www.cse.clrc.ac.uk/arc/ia64.shtml. [15] N. Izuta, T. Watabe, T. Shimizu, T. Ichihashi, Overview of PRIMEPOWER 2000/1000/800 Hardware, Fujitsu Sci. Tech. J. 36 (December (2)) (2000) 121–127. [16] SPARC64V Microprocessor Provides Foundation for PRIMEPOWER Performance and Reliability Leadership, A D.H. Brown Associates Inc., White Paper Prepared for Fujitsu, September 2002. Thuy T. Le is an associate professor in the Department of Electrical Engineering at San Jose State University. In the past 20 years, he has held several research positions at several U.S. national laboratories and has been consultant for several private companies. He has served as advisor, general chair, technical program and session chair, and committee member of many technical international conferences and symposiums. Dr. Le earned his B.S., M.S. and Ph.D. degrees all from the

University of California at Berkeley. His current research interests include topics in digital system and logic design, cluster computing and architectures, computational grid and parallel/distributed computing. Jalel Rejeb is an assistant professor in the Department of Electrical Engineering at San Jose State University. Dr. Rejeb earned his B.S., M.S. and Ph.D. degrees all from Syracuse University, New York. His current research interests include topics in networking and digital system design.