Information Processing Letters 75 (2000) 191–197
plapackJava: Towards an efficient Java interface for high performance parallel linear algebra Eric Gamess Departamento de Ciencias de la Computación, Universidad Central de Venezuela, Los Chaguaramos, Caracas, Venezuela Received 3 March 2000; received in revised form 1 June 2000 Communicated by R. Backhouse
Abstract Java is gaining acceptance as a language for high performance computing, as it is platform independent and safe. A parallel linear algebra package is fundamental for developing parallel numerical applications. In this paper, we present plapackJava, a Java interface to PLAPACK, a parallel linear algebra library. This interface is simple to use and object-oriented, with good support for initialization of distributed objects. The experiments we have performed indicate that plapackJava does not introduce a significant overhead with respect to PLAPACK. 2000 Elsevier Science B.V. All rights reserved. Keywords: Parallel processing; Java; Parallel linear algebra; PLAPACK
1. Introduction There is a growing interest in using Java as a language for high performance computing, as it is platform independent, object-oriented and safe. In order to be widely accepted as a language for developing parallel applications, an adequate message passing package has to be included in the Java distributions. Also, support for parallel linear algebra will be welcomed for scientific programming. Java RMI (Remote Method Invocation) and a socket interface are included in JDK (Java Development Kit) from SUN Microsystems. They are frequently used for developing client/server applications. However, they are not oriented to parallel computing, where loads and communications are usually symmetric. Since JDK lacks a message passing package, a number of projects, such as [8,2,9,11,5], have been proposed. MPI Software E-mail address:
[email protected] (E. Gamess).
Technology [8] has developed a commercial product, mainly written in Java, that partially implements MPI (Message Passing Interface), a well-known message passing library. This product has the main features of MPI 1 and the dynamic process handling defined in MPI 2. mpiJava [2] is a Java interface that allows the user to call a native MPI library from Java. It implements most of the functions of MPI 1.1. DOGMA [9] (Distributed Object Group Metacomputing Architecture) is an architecture for parallel applications in Java. It contains MPIJ, which is an implementation of a subset of MPI 1.1 totally written in Java. Java environments to PVM (Parallel Virtual Machine), another well-known message passing library, have also been developed. jPVM [11] is an interface between a native PVM library and Java, formerly known as JavaPVM. JPVM [5] is an implementation of PVM written in Java. It provides new concepts such as safe threads and multiple communication end points per thread. These projects do not offer additional support for scientific
0020-0190/00/$ – see front matter 2000 Elsevier Science B.V. All rights reserved. PII: S 0 0 2 0 - 0 1 9 0 ( 0 0 ) 0 0 1 0 5 - 8
192
E. Gamess / Information Processing Letters 75 (2000) 191–197
computing. JDK also lacks parallel packages for scientific programming. Well-known parallel linear algebra libraries exist for other languages, mainly for Fortran and C, such as ScaLAPACK [3] (Scalable Linear Algebra PACKage) and PLAPACK [1] (Parallel Linear Algebra PACKage). ScaLAPACK has a communication environment, called BLACS (Basic Linear Algebra Communication Subprograms), that is implemented on top of different message passing libraries. However, its interface is difficult to use and the user has no proper support for initialization of distributed objects. PLAPACK is a parallel library for linear algebra that uses MPI for message passing and can be invoked from Fortran 77 or C. It has a simple interface and a good support for initialization of distributed objects. PLAPACK offers both basic and advanced parallel functions. Among them, we can mention parallel versions of the functions of BLAS (Basic Linear Algebra Subprograms), functions to solve a system of linear equations with several righthand-side vectors using an LU or Cholesky factorization, functions to solve least-square problems by using an LQ or QR decomposition, and functions to find the eigenvalues and eigenvectors of a matrix. PLAPACK is mainly implemented in C with calls to BLAS and LAPACK (Linear Algebra PACKage). It is well documented and freely available. 1 The high performance computing community has also developed projects to extend Java’s functionality so it can offer more support to the realm of parallel scientific applications. For instance, a tool called JCI [10] (Java to C Interface generator) was proposed, which allows the user to automatically generate the interface between an existent native library and Java from the corresponding header files. Although this tool facilitates the implementation work, it does not offer a total object-oriented API. This tool was used to generate a Java interface to part of MPI and ScaLAPACK. In this paper we introduce plapackJava, a Java interface to PLAPACK, that has most of the features of PLAPACK and does not introduce a significant overhead. Since PLAPACK uses MPI for message passing, mpiJava [2] is the communication environment that we use for plapackJava. We also chose mpiJava 1 http://www.cs.utexas.edu/users/plapack.
since its source code is freely available, 2 its interface is similar to the standard MPI for C++ and its implementation with JNI (Java Native Interface) is clean and efficient. plapackJava will be integrated to SUMA [7] (Scientific Ubiquitous Metacomputing Architecture), a metasystem oriented to efficient execution of Java bytecode. SUMA extends the Java virtual machine, provides quick access to distributed resources, executes sequential bytecode and executes parallel bytecode in a group of nodes configured as a Beowulf cluster.
2. plapackJava: A Java interface to PLAPACK plapackJava stands for “Parallel Linear Algebra PACKage for Java”. It is a set of Java classes that allows the user to call the functions of a native library of PLAPACK from Java. plapackJava does not offer support for message passing communication, but works with mpiJava. Previous work on message passing performance using mpiJava [6] shows the potential influence of communication granularity on transmitting contiguous messages. In this work, we used a benchmark from the Parkbench [4] suite, called comms1, to estimate communication parameters of a network by exchanging messages between two nodes. A message is sent by the master process to the slave and immediately returned by the slave. The exchange is repeated many times to obtain an accurate measurement, and the overall time is divided by the number of repeats and by 2, since an exchange is made by two sends (a send and a receive). When varying the size of the message, we obtain different points (n, t) in the Cartesian plane that are adjusted by least-squares according to formula (1). In this way, comms1 computes r∞ and n1/2 , where r∞ is the asymptotic stream rate and n1/2 is the message length giving half the asymptotic performance. t=
n + n1/2 . r∞
(1)
The source code of comms1 is in the Parkbench distribution for Fortran 77 using MPI or PVM. We wrote a Java version of comms1 to measure the overhead introduced by this language over contiguous messages. 2 http://www.npac.syr.edu/projects/pcrc/mpiJava.
E. Gamess / Information Processing Letters 75 (2000) 191–197
193
Fig. 1. Main classes in plapackJava’s class hierarchy.
Table 1 Times in milliseconds to send contiguous messages Size of message (bytes)
Time for MPI
Time for mpiJava
0
0.15
0.31
10
0.17
0.39
100
0.18
0.43
1000
0.32
0.67
10000
3.99
4.76
100000
19.37
20.11
1000000
107.72
109.11
Table 1 shows the results obtained from the execution of comms1 between two Pentium II (400 MHz) processors, running Linux RedHat 5.2. We also used MPICH 3 1.1.2 (a popular implementation of MPI), JDK 1.2 for Linux, 4 gcc and g77 version 2.95.1 from GNU (we compiled with optimization level 2) and mpiJava 1.1. Each PC had 512 Mbytes of RAM and they were interconnected by a 100 Mbits/s network. The asymptotic stream rate is 72.6 Mbits/s for MPI and 72.1 Mbits/s for mpiJava. The results seem to indicate that the communication performance for a Java parallel program that uses coarse-grain communication (a lot of big messages) will not be inferior to the one shown by its Fortran 77 equivalent. However, the communication performance of a parallel program that 3 http://www-unix.mcs.anl.gov/mpi/mpich. 4 http://www.blackdown.org.
uses fine-grain communication (a lot of small messages) can be affected by the Java overhead. plapackJava implements most of the functions of PLAPACK with an object-oriented interface and is freely available at http://suma.ldc.usb.ve/desarrollo/ plapackJava. plapackJava has been implemented in Java and JNI. JNI allows Java bytecode to interoperate with applications and libraries written in other languages such as C and C++. A diagram depicting the hierarchical relationships among the main classes of plapackJava is shown in Fig. 1. PLAObject is the base class from which we derive classes for the different objects of plapackJava, such as Vector, Mvector, Pvector, Pmvector, Mscalar and Matrix. In plapackJava, the processors of a parallel computer are distributed in a two-dimensional array. The processor array can be created with one of the following methods. The first method specifies the number of row-wise and column-wise processors associated with the Cartesian topology. As for the second method, the ratio nprows/npcols is recommended. Class Comm is defined in mpiJava and is the base class for communicators. Comm PLA.Comm_1D_to_2D(Comm comm1D, int nprows, int npcols); Comm PLA.Comm_1D_to_2D_ratio(Comm comm1D, float ratio); Instances of class Template are patterns used to create plapackJava objects and define the distribution block size as well as the lower bound for indices to
194
E. Gamess / Information Processing Letters 75 (2000) 191–197
Fig. 2. Two-dimensional array of 2 × 3 processors.
be used (e.g., 0 as in C or 1 as in Fortran 77). An object of type Vector is a vector to be distributed on the two-dimensional array of processors, according to the distribution block size. When creating a Vector object, the user specifies the element type (MPI.INT, MPI.FLOAT or MPI.DOUBLE), the length, the associated pattern and the relative origin. In order to illustrate these concepts, let us suppose that we have a parallel computer with 6 processors. With method PLA.Comm_1D_to_2D, we organize the processors in a two-dimensional array of 2 × 3 processors as shown in Fig. 2. The declaration: Template myTemplate= new Template(2, 0); creates a pattern with a distribution block of two elements and specifies that the plapackJava objects instantiated with this pattern have 0 as their first index (see Fig. 3). The association between blocks (Bi ) and processors (Pj,k ) is made in a round-robin fashion: B0 = [t0 t1 ]T is associated with P0,0 , B1 = [t2 t3 ]T is associated with P1,0 , B2 = [t4 t5 ]T is associated with P0,1 , and so on. The declaration: Vector vectorX=new Vector(MPI.DOUBLE, 6, myTemplate, 3);
creates a vector of 6 double precision elements that are aligned with element 3 of pattern myTemplate. In this way, [x0 ] corresponds to B1 , which in turn is associated with P1,0 ; [x1 x2 ]T corresponds to B2 , which is associated with P0,1 ; [x3 x4 ]T corresponds to B3 , which is associated with P1,1 ; [x5] corresponds to B4 , which is associated with P0,2 . In other words, [x0] is stored in P1,0 , [x1 x2 ]T is stored in P0,1 , [x3 x4 ]T is stored in P1,1 , and [x5 ] is stored in P0,2 . An object of type Mvector is a group of aligned vectors with the same length to be distributed on the two-dimensional processor array, according to the distribution block size. An object of type Mscalar is atomic, that is, it is not distributed. However, an Mscalar object can be replicated in various processors, and an instance of the Mscalar object can be modified without changing the other instances. In order to create an Mscalar object, the user specifies the type of its elements (MPI.INT, MPI.FLOAT or MPI.DOUBLE), the number of rows and columns and the coordinates of the processors where to store instances, with respect to the two-dimensional array of processors. Constants PLA.ALL_ROWS and PLA.ALL_COLS allow the user to specify whether an Mscalar object should be replicated in all of the processors in a row or all of the processors in a column. For instance, the following declaration creates a replicated Mscalar object of size 1 × 1 with elements of type MPI.DOUBLE in all of the processors in row 2. Mscalar myMscalar= new Mscalar(MPI.DOUBLE, 2, PLA.ALL_COLS, 1, 1, myTemplate); An object of type Matrix is a matrix to be distributed on the two-dimensional processor array, according to the distribution block size. In order to
Fig. 3. Example of a pattern and its associated creation of a vector.
E. Gamess / Information Processing Letters 75 (2000) 191–197
195
create an object of type Matrix, the user specifies, among other parameters, the type of its elements (MPI.INT, MPI.FLOAT or MPI.DOUBLE) as well as the number of rows and columns. Classes UniArray, BiArray and their derived classes implement the interface between the Java arrays and the native arrays. With these classes and methods such as API_axpy_vector_to_global and API_ axpy_global_to_vector for vectors, API_ axpy_matrix_to_global and API_axpy_ global_to_matrix for matrices, the user can transfer data from the local memory of a processor to a distributed object and vice versa. Type checking in plapackJava is safer than the PLAPACK one using C. For example, the prototypes of function Trmv (TRiangular Matrix–Vector multiplication) in PLAPACK using the C interface and in plapackJava are:
plapackJava provides an interface to most functions of BLAS and has methods that operate upon both global objects of plapackJava and local portions of these objects. For instance, method PLA.Swap allows the user to exchange objects obj1 and obj2 (see prototype below) in a global form, while method PLA.Local_swap allows the user to exchange the local parts of the two objects stored in the processor where the method is called. The first method is collective and should be called by all processors while the second can be called by a single processor.
int PLA_Trmv(int uplo, int trans, int diag, PLA_Obj A, PLA_Obj x);
3. Experiments
void PLA.Trmv(UpperLower uplo, Transpose trans, UnitDiagonal diag, PLAObject A, PLAObject x); where A and x are the triangular matrix and the vector to be multiplied, uplo is a parameter to specify whether the matrix is upper or lower triangular, trans allows the user to transpose the matrix or not before the multiplication and diag is used to indicate whether the matrix is unitary. The user is forced to introduce coherent arguments in the plapackJava version since the Java type system is strict, and uplo, trans and diag have type UpperLower, Transpose and UnitDiagonal, respectively. All of the previous parameters are integers in the original version of PLAPACK and there is no value check at compile-time. Apart from classes UpperLower, Transpose and UnitDiagonal, plapackJava has other classes such as APIMode, APIState, ProjectDirection and Side, which implement compiletime coherence checking. Class IntHolder is used to pass integers by reference to plapackJava methods since Java only passes primitive types by value.
void PLA.Swap(PLAObject obj1, PLAObject obj2); void PLA.Local_swap(PLAObject obj1, PLAObject obj2);
In this section we present the experiments we have designed to estimate the overhead introduced by Java in plapackJava. If A is a non-singular matrix and b is a vector, the problem of solving the linear system Ax = b can be tackled by using an LU decomposition in plapackJava and PLAPACK, as shown in Fig. 4. If n is the matrix size, time is the total execution time and nprocs is the number of processors, the average number of floating-point operations executed by each processor in one second would be: n3 2 · . 3 time × nprocs We built two programs (one in Java and the other in C) to estimate the relative performance of plapackJava with respect to PLAPACK. Both programs solve randomly generated real linear algebra problems, according to the following parallel scheme: • Process 0 randomly generates a matrix (columnwise) with values uniformly distributed in [−1.0, 1.0] and distributes the columns to the other processes. We adopted this generation scheme to efficiently use the available memory. Process 0 also generates a random solution vector with values in [−1.0, 1.0] and sends it to the other processes. • The group of processes computes the right-handside vector with function PLA.Gemv (GEneral
196
E. Gamess / Information Processing Letters 75 (2000) 191–197
Fig. 4. Sentences to solve a system of linear equations with an LU decomposition.
Table 2 Results of experiments (time in seconds and performance in Mflops/process) Matrix size
Time for PLAPACK
Performance of PLAPACK
Performance of plapackJava
1000 × 1000
13.63
6.11
13.75
6.06
2000 × 2000
52.77
12.63
52.96
12.58
3000 × 3000
116.85
19.25
117.47
19.15
4000 × 4000
242.72
21.97
243.45
21.91
5000 × 5000
410.97
25.34
411.56
25.31
6000 × 6000
674.03
26.70
674.78
26.67
Matrix–Vector multiplication), which multiplies the matrix with the solution vector. • The group of processes solves the problem as indicated in Fig. 4. When the matrix is found to be numerically singular, a message is sent to the user. • The norm of the vector obtained by subtracting the computed solution to the randomly generated one, is visualized in order to verify that everything is correct. Table 2 shows the elapsed time of these two programs on a Beowulf cluster with eight Pentium II (400 MHz) processors, as described in Section 2. We also used the standard BLAS available at Netlib, 5 LAPACK 3.0 available at Netlib and PLAPACK 1.2. We compiled with optimization level 2. Our suite of results seems to indicate that the overhead introduced by the Java classes is negligible as it was excepted since our Java interface does not do any run-time data conversion or array copies. To avoid array copies, we 5 http://www.netlib.org.
Time for plapackJava
simulate bidimensional arrays in Java with unidimensional arrays, since Java only has unidimensional arrays. In other words, the Java overhead is not proportional to the matrix size since our interface was designed to work directly over the dereferencing Java objects.
4. Conclusions and future work In this paper we have presented plapackJava, a Java interface to PLAPACK that has good support for initialization of distributed objects. plapackJava implements most of the PLAPACK functions with a similar interface. Our experiments suggest that plapackJava does not introduce a significant overhead. plapackJava will be integrated to SUMA, a metasystem oriented to efficient execution of Java bytecode. It will allow a SUMA user to develop sequential applications with calls to parallel linear algebra methods that will be transparently executed in a group of remote nodes of
E. Gamess / Information Processing Letters 75 (2000) 191–197
a Beowulf cluster. In SUMA, the user will also have support to develop parallel applications with calls to a native library of PLAPACK. Although our preliminary results indicate that the overhead introduced by the interface is not significant, the poor performance of Java is something that must be taken into account. The technology of the Java virtual machine has improved significantly with the release of JIT (Just In Time) compilers, but the performance of a Java application is still inferior to the performance shown by more traditional languages. By the end of July, 1999, Cygnus 6 released the first public domain Java and bytecode compiler. We expect to improve the performance of plapackJava and the Java bytecode by using the Cygnus compiler. We also need to explore more closely the integration of plapackJava and SUMA. References [1] P. Alpatov, G. Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, R. van de Geijn, Y. Wu, PLAPACK: Parallel Linear Algebra Package, in: Proc. SIAM Parallel Processing Conference, 1997. [2] M. Baker, B. Carpenter, G. Fox, S.H. Ko, X. Li, mpiJava: A Java MPI interface, in: Proc. First UK Workshop on Java for High Performance Network Computing, 1998. 6 http://www.cygnus.com.
197
[3] J. Choi, J. Dongarra, R. Pozo, D. Walker, ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers, IEEE Computer Society Press, 1992, pp. 120–127. [4] J. Dongarra, T. Hey, The Parkbench benchmark collection, Supercomputer 11 (2–3) (1995) 94–114. [5] A. Ferrari, JPVM: Network parallel computing in Java, in: Proc. ACM 1998 Workshop on Java for High-Performance Network Computing, 1998, http://www.cs.ucsb.edu/ conferences/java98/papers/jpvm.ps. [6] E. Gamess, E. Hernández, Rendimiento del paso de mensajes contiguos y no contiguos en Java usando MPI, in: Proc. XXV Conferencia Latinoamericana de Informática, CLEI 1999, 1999. [7] E. Hernández, Y. Cardinale, C. Figueira, A. Teruel, SUMA: A Scientific Metacomputer, in: Proc. Parco99, Delft, Holland, August 1999. [8] G. Crawford III, Y. Dandass, A. Skjellum, The JMPITM Commercial Message Passing Environment and Specification: Requirements, Design, Motivations, Strategies, and Target Users. http://www.mpi-softtech.com/publications. [9] G. Judd, M. Clement, Q. Snell, DOGMA: Distributed Object Group Metacomputing Architecture, in: Proc. ACM 1998 Workshop on Java for High-Performance Network Computing, 1998, http://www.cs.ucsb.edu/ conferences/java98/papers/dogma.ps. [10] S. Mintchev, V. Getov, Automatic binding of native scientific libraries to Java, in: Proc. ISCOPE, Marina del Rey, CA, Lecture Notes in Comput. Sci., Vol. 1343, Springer, Berlin, December 1997, pp. 129–136. [11] D. Thurman, jPVM. http://www.chmsr.gatech.edu/jPVM.