JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.1 (1-21)
Science of Computer Programming ••• (••••) •••–•••
Contents lists available at ScienceDirect
Science of Computer Programming www.elsevier.com/locate/scico
Performance evaluation of virtual execution environments for intensive computing on usual representations of multidimensional arrays Francisco Heron de Carvalho Junior ∗ , Cenez Araújo Rezende Mestrado e Doutorado em Ciência da Computação, Universidade Federal do Ceará, Brazil
a r t i c l e
i n f o
Article history: Received 19 April 2014 Received in revised form 16 March 2016 Accepted 22 April 2016 Available online xxxx Keywords: Virtual execution High performance computing Performance evaluation Java virtual machine (JVM) Common Language Infrastructure (CLI)
a b s t r a c t This paper evaluates the sequential performance of virtual execution environments (VEE) belonging to the CLI (Common Language Infrastructure) and JVM (Java Virtual Machine) standards, for the usual approaches of representing multidimensional arrays in programs that present intensive array access. Such programs are mostly found in scientific and engineering application domains. It shows that the performance profile of virtual execution is still highly influenced by the choice of both the multidimensional array representation and the VEE implementation, also contradicting some results of previous related works, as well as some beliefs of HPC programmers that come from their practice with native execution languages, such as C, C++, and Fortran. Finally, recommendations are provided both for HPC programmers that desire to take advantage of VEE performance for their array-intensive code and for VEE developers that are interested in improving the performance of virtual execution for the needs of HPC applications. © 2016 Elsevier B.V. All rights reserved.
1. Introduction Object-oriented programming languages based on virtual execution environments (VEE), such as Java and C#, are consolidated in the software industry, motivated by the security, safety and interoperability they provide to software products and the productivity provided by their supportive programming environments to software development and maintenance. These languages have caught the attention of Fortran and C users from computational sciences and engineering, interested in modern techniques and tools for improving the productivity of programming large-scale software in their particular domains of interest [1]. However, most of them still consider that the performance of VEEs does not attend to the needs of numeric- and array-intensive HPC (High Performance Computing) applications. In the mid-1990s, the performance bottlenecks of JVM (Java Virtual Machine) for HPC have been widely investigated [2]. New techniques have been proposed to circumvent them, also implemented by VEEs that follows the CLI (Common Language Infrastructure) standard [3], such as .NET and Mono. In the mid-2000s, most of these efforts moved to the design of new high-level programming languages for HPC [4]. However, during this period, the performance of virtual machines was significantly improved with just-in-time (JIT) compilation. Also, CLI introduced new features for addressing HPC requirements, such as rectangular arrays, direct interface to native code, language interoperability and unsafe memory pointers.
*
Corresponding author. E-mail addresses:
[email protected] (F.H. de Carvalho Junior),
[email protected] (C.A. Rezende).
http://dx.doi.org/10.1016/j.scico.2016.04.005 0167-6423/© 2016 Elsevier B.V. All rights reserved.
JID:SCICO AID:2004 /FLA
2
[m3G; v1.175; Prn:4/05/2016; 13:55] P.2 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
However, after a decade of advances in the design and implementation of VEEs in the 2000s, there is still a lack of rigorous studies about their serial performance in the context of HPC applications [5]. Recent works have evaluated current JVM machines by taking parallel processing into account, including communication and synchronization costs [6,7], both in shared-memory and distributed-memory platforms, using a subset of NPB-JAV [8], a Java implementation of the NPB (NAS Parallel Benchmarks) suite [9]. Also, to our knowledge, W. Vogels reported the only comprehensive performance comparison between JVM and CLI machines published in the literature [10]. Finally, the existing works barely take into account the evaluation of different approaches of implementing multidimensional arrays in Java and C#, even though most of the computational-intensive algorithms implemented in scientific computing programs have multidimensional arrays as the main data structure. This fact becomes critical for performance, because the access time to values stored in multidimensional arrays may dominate the total execution time of these programs. The sources of overhead in intensive access to multidimensional arrays depend on the way they are represented, including poor cache performance, pointer indirections, index calculation arithmetic, array-bounds-checking (ABC), etc. This paper evaluates the performance of a set of numeric programs that perform intensive access to the elements of multidimensional arrays, with the following goals in mind:
• to compare the current performance of the virtual execution engines that follow the CLI and JVM standards for different approaches for implementing multidimensional arrays;
• to quantify the current performance overhead of virtual execution, compared to native one; and • to identify bottlenecks in the most popular current implementations of virtual machines, regarding the support of multidimensional arrays. As far we know, this paper is the first one to report a comprehensive study about the performance of distinct approaches for implementing multidimensional arrays in the most popular programming languages based on virtual execution, which are:
• embedded arrays, where the list of indexes for accessing a value in a multidimensional array are mapped to an index of a unidimensional array;
• jagged arrays, also known as arrays-of-arrays, where an array of n dimensions, where n > 1, is implemented as an unidimensional array of n references to arrays of n − 1 dimensions; • rectangular arrays, where array elements are stored in contiguous memory addresses, such as in Fortran and C. The results of this paper show that some results reported in previous related works cannot be generalized, neither for all VEEs belonging the same standard (e.g. JVM or CLI) nor for all numeric- and array-intensive programs. Therefore, they complement the results of these related works [6,11,10,8], finding that they lack the analysis of different forms of implementing multidimensional arrays. Also, they show that some common-sense beliefs of HPC programmers are not valid depending on the choice of VEE, and present useful recommendations for helping them on how to optimize the performance of array-intensive code according to the choice of VEE and multidimensional array representation. Finally, VEE developers could find, in this paper, evidences of performance bottlenecks that may help them to improve the performance of VEE implementations. It is not one of the objectives of this work to identify the source of each observed performance bottleneck. To that end, it is necessary a deeper systematic study, including the design and use of specific-purpose microbenchmarks for testing a potentially large number of hypothesis, which we plan as a further work. The data collected in the experiments the results of which are reported in this paper is available at http://npb-for-hpe.googlecode.com. In what follows, Section 2 introduces virtual execution technology and motivates the research work reported in this paper. Section 3 presents the methodology adopted in the experiments. Section 4 summarizes the data collected in the experiments and discusses their meaning. Finally, Section 5 concludes the paper, by emphasizing its contributions and suggesting ideas for further works. 2. Context and related works Programming languages based on virtual execution environments (VEE) abstract away from the hardware and operating systems of computer systems. In heterogeneous environments, the benefits of virtual execution, such as security and cross-platform portability, are clear, in particular for applications that are distributed across a network, such as the internet. Java was launched by Sun Microsystem in the 1990s. The Java Virtual Machine (JVM) has been implemented for most of the existing computer and operating system platforms. Its intermediate representation, called bytecode, is dynamically compiled through just-in-time (JIT) compilation. The JIT compiler dictates the performance of virtual execution engines such as JVM. The most important industrial-strength implementations of JVM are now supported by Oracle [12] and IBM [13]. Oracle has two JVM implementations, called JRockit and HotSpot, whereas J9 is the name of the IBM’s JVM implementation. JRockit was firstly introduced by Appeal Virtual Machines. It was further acquired by BEA Systems, in 2002, and later on by Oracle, in 2008. Since 2011, JRockit was made free and publicly available by Oracle. In turn, HotSpot virtual machines are owned by Oracle since the acquisition of Sun Microsystem in 2010. Oracle maintains both a commercial and an open-source
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.3 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
3
implementation of HotSpot, sharing the same execution kernel. OpenJDK1 versions are released before the corresponding versions of the commercial HotSpot JVM. OpenJDK is in version 1.9, whereas the commercial version of HotSpot is in version 1.8. In turn, the last release of JRockit is 1.6, since Oracle, due to the acquisition of Sun, have joined the JRockit and HotSpot development initiatives. Java programmers consider that JRockit has better performance than HotSpot for long running computations, due to the different JIT compilation strategies they implement. This is the reason that motivated us to introduce JRockit, currently a legacy technology, in this study. The Common Language Infrastructure (CLI) [3] is an industrial standard introduced in the beginning of 2000s by Microsoft and their partners, specifying virtual execution machines with distinguishing features, most of them useful for scientific computing programs with HPC requirements, such as:
• • • • • • •
support for multiple programming languages and their interoperability [14]; a strongly typed and polymorphic virtual machine code, called IL (Intermediate Language); a component-based architecture, with version control support; ahead-of-time (AOT) compilation, for reducing the impact of JIT compilation in the execution time; rectangular arrays (Java only supports embedded and jagged arrays); unsafe execution using pointers, like in C, and turning array-bounds-checking (ABC) off; user-controlled compiler optimizations.
CLR (Common Language Runtime) [15] and Mono [16] are interoperable implementations of CLI. CLR is the virtual execution engine of the .NET framework. It has been developed by Microsoft for the Windows Platform, whereas Mono is an open-source project started by Novell, with implementations for a wide range of platforms, including Linux, Windows and OS X. Mono and its related products are now owned by Xamarin [17]. In turn, IKVM.NET [18] is an implementation of the JVM standard on top of Mono and CLR, also including standard Java class libraries, and tools for Java and Mono/CLR interoperability. 2.1. Virtual Execution Environments (VEE) in HPC In the end of the 1990s, the programming productivity and architecture interoperability of Java attracted the attention of programmers from scientific and engineering community. However, the poor performance of bytecode interpretation, the difficulties in supporting common loop optimizations, and the absence of rectangular arrays were considered prohibitive features for using Java in the implementation of applications with HPC requirements. This fact motivated a number of researchers in quantifying the performance bottlenecks of JVM and proposing solutions to circumvent them [19,20], including language extensions [21], support for multidimensional arrays [22,23], specific-purpose scientific computing libraries [24–26], interfaces for interoperability with native scientific computing libraries [27], and compiler support [28], including loop optimizations [29] and array-bounds-checking (ABC) elimination [30–33]. Despite the many advances motivated by these efforts, VEEs were considered attractive for HPC software only with the inception of the first efficient JIT compilers in the commercial VEEs, in the beginning of 2000s. .NET have attempted to narrow the performance gap between native and virtual execution by introducing rectangular arrays and JIT optimizations. W. Vogels [10] was the first one to compare CLI and JVM implementations, showing that CLR performance is competitive with the performance of the best JVM implementations, but without significant gains. Surprisingly, jagged arrays outperformed rectangular arrays in CLR, justified by the optimizations that JIT compilers apply to code based on jagged arrays. In 2003, Frumkin et al. [8] proposed NPB-JAV, a multi-threaded Java implementation of the NAS Paralell Benchmarks (NPB) [9], reporting that the performance of Java was still lagging far behind Fortran and C. Also in 2003, Nikishkov et al. [34] compared the performance of Java and C, using an object-oriented design of a three-dimensional finite elements code, reporting similar performance conditioned to code tuning and the correct choice of JVM implementation. In a special issue of the Concurrency and Computation: Practice and Experience journal, devoted to the 2001 Java GrandeISCOPE (JGI’2001) conference, published in 2003, two papers deserve special attention. The paper of Riley et al. [35] reported slowdown factors between 1.5 and 2.0 in carefully object-oriented designed Java implementations of real-world applications from computational fluid dynamics (CFD), running in JVM implementations of Sun and IBM. In turn, the paper of Bull et al. [36] compared directly the performance of the Java Grande benchmarks [37], re-written in C and Fortran, with their performance in Java, across a number of JVM implementations and execution platforms, showing small performance gaps, particularly for Intel Pentium. The most recent works of other authors on the performance evaluation of virtual machines have been published by Taboada et al. [6] and Amedro et al. [7], with focus on parallel programming. Contrariwise to the previous works, they have presented results showing a negligible difference between the performance of virtual and native execution. However, their evaluation was based only on HotSpot. The results of Section 4 will show that some conclusions of these works cannot be generalized to other JVM implementations. Also, the evaluation of pure sequential execution, not affected by communication
1
http://openjdk.java.net.
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.4 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
4
and synchronization costs of parallel execution, is important, since the efficiency of sequential codes are the first important prerequisites for obtaining efficient parallel programs. 2.2. Multidimensional arrays In computational intensive programs from scientific and engineering domains, arrays of multiple dimensions are the most used data structures. Even graphs, another important data structure, are represented as arrays and matrices presenting particular patterns (e.g. location of non-zero elements). For this reason, HPC programmers have developed techniques for efficient access to the elements of multidimensional arrays, most of them based on assumptions about their layout in memory and on particular patterns and distribution of their values, attempting to optimize spacial and temporal locality for improving cache performance [5]. To that end, Fortran, C and C++, the most popular programming languages in scientific computing, support rectangular arrays, whose elements may be accessed as in standard mathematical notation, and stored consecutively in memory. Java was conceived without rectangular arrays, which has been viewed as an annoying limitation for using Java in HPC, together with bytecode interpretation. Java supports jagged arrays (arrays-of-arrays), where a N 1 × N 2 × · · · × N k -dimensional array (k dimensions, with N i elements in the i-th dimension), where k > 2, must be represented as an unidimensional array whose N 1 elements are pointers to N 2 × · · · × N k -dimensional arrays. By consequence, it is ensured that only the N k elements in the last dimension of the array are consecutive in memory. The memory requirements of jagged and rectangular arrays are different. Jagged arrays require more memory space, due to the additional pointers for redirecting the elements of the first k − 1 indexes in an array with k dimensions. In turn, CLI designers decided to introduce rectangular arrays, as well as other features considered relevant in HPC, such as cross-language interoperability, integration to unmanaged code, ahead-of-time (AOT) compilation, and unsafe execution (e.g. turning off array-bounds-checking). Due to the lack of support for rectangular arrays, Java HPC programmers often implement multidimensional arrays by using the common C/C++ technique for implementing dynamic rectangular arrays by embedding them into unidimensional arrays through explicit mapping of their indexes to the indexes of the unidimensional arrays, ensuring that array elements are placed consecutively in memory. These arrays are called embedded arrays in this paper. For instance, let a be a N 1 × N 2 × · · · × N k embedded array and let A be a unidimensional array that represents a in a Java program. In the case of row-major memory layout, the element ai 1 ,i 2 ,...,ik must be mapped to the element A index (i 1 , i 2 , . . . , ik ) , where
index (x1 , x2 , . . . , xk ) =
k i =1
⎛
⎝ xi ∗
k
⎞
N j ⎠.
j = i +1
In the practice of HPC programmers, the code of the index function is “inlined”, either explicitly, by the programmer, or implicitly, by the compiler, in order to avoid the cost of function calls. These techniques, originated from the experience of HPC programmers with native languages, are followed by the original NPB-JAV implementation. However, this paper shows that some VEEs may suffer from overheads when such a common-sense optimization approach is adopted. Another point to notice is that rectangular arrays are not implemented by the CLI compiler through embedded arrays, for the same reason dynamically allocated multidimensional arrays cannot be implemented as rectangular arrays in C/C++, where only static arrays may be rectangular. Rectangular arrays are always dynamically created in C#, through the new constructor. In the current implementation, the access to the entries of C# rectangular arrays with multiple dimensions is implemented by calls to a low-level function (Get), as the reader can see in Fig. 6(c). This is the reason why it is relevant to compare the performance of rectangular arrays and embedded arrays in CLI compliant VEEs. In fact, our results confirm that embedded and rectangular arrays present different performance figures. 3. Methodology For achieving the goals of the experimental study, announced in Section 1, it is required the use of realistic code from HPC applications. For this reason, we have adopted NPB (NAS Parallel Benchmarks), a benchmark suite derived from CFD (Computational Fluid Dynamics) applications by the Advanced Supercomputing Division of NASA for evaluating the performance of parallel computing platforms [9]. NPB has reference implementations written by expert HPC programmers in Fortran, C and Java [8]. It has been widely used along the years for evaluating the performance of parallel programming languages, parallelizing compilers, and parallel execution environments. NPB comprises eight (8) benchmark programs, including five kernels and three simulated applications. The kernels have been designed for evaluating specific features of parallel computing platforms that are considered relevant for the performance of CFD applications. In turn, simulated applications evaluate these platforms for real world scenarios. Indeed, kernels may be used to assist in the identification and characterization of performance bottlenecks in the execution of simulated applications. Since this study is interested in evaluating and comparing how different serial virtual execution engines perform in practical scenarios, instead of evaluating and finding bottlenecks in computing platforms, we restrict our study to the three simulated applications. They can provide a better insight on the performance of VEEs in realistic scenarios. Also,
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.5 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
5
we have included the FT kernel, since it has some characteristics of simulated applications. Thus, the NPB programs selected for the purposes of the experiment reported in this paper are:
• SP (Scalar Pentadiagonal Linear System Solver), which uses the ADI method (Alternating Direct Implicit) to solve three sets of pentadiagonal uncoupled systems of equations in the x, y, and z axes;
• BT (Block-Triagonal Linear System Solver), which differs from SP because the system of equations is block-tridiagonal; • LU (LU Factorization), which solves the Navier–Stokes equations [38] of fluid flow problems, using the SSOR method (Symmetric Successive Over Relaxation).
• FT (Fast Fourier Transform), which solves a tri-dimensional Poisson’s partial differential equation (PDE) using the Fast Fourier Transform method (FFT); NPB specifies a set of problem classes, which define standard workloads to be applied to the programs, based on realistic problem instances. They are referenced by letters. S and W have been designed originally for testing purposes, whereas A, B, C, and so on, have been defined for representing realistic workload levels. For the purposes of this paper, we adopt the problem classes S, W, A, and B. 3.1. Experimental factors The following experimental factors and levels have been used for this performance evaluation:
• program: SP, BT, LU and FT; • problem class: S, W, A and B; • VEE (virtual execution engine): – IBM J9 JVM 1.8.0, build 2.7 (J9); – Oracle HotSpot 1.8.0_51, build 25.0-b70 (HotSpot); – IKVM 7.2.4630.5, Mono version 2.10.8.1 (IKVM); – Oracle JRockit 1.6.0_45, build 1.6.0_45-b06 (JRockit)2 ; – Open JVM 1.9.0-internal (build 1.9.0-internal-admcad_2014_02_ 04_16_47-b00) (Open); – Mono 4.2.0 (Mono); – .NET 4.0.3 (CLR). • array kind (multidimensional array representation): – embedded arrays with inline index arithmetic (EI); – embedded arrays with index arithmetic encapsulated in methods (EE); – row-major jagged arrays (JR); – column-major jagged arrays (JC); – row-major rectangular arrays (RR); – column-major rectangular arrays (RC). In the appendix, it is presented the output of each VEE when asked about their versions by means of the following commands3 :
•
\bin\java -version, for standard virtual machines (J9, HotSpot, JRockit, Open); • \bin\ikvm.exe -version, for IKVM (IKVM); • \bin\mono -version, for Mono (Mono). We have also collected execution times for native execution, using the standard implementations of SP, BT, LU and FT. The three first programs are written in C, whereas the last one is written in Fortran. We have compiled the C programs with GCC 4.8.4. In turn, the Fortran programs has been compiled with fort77 1.15. All programs have been compiled with the “-O3” flag (full optimization). 3.2. Derivation of program versions We have derived Java sequential versions of SP, BT, LU and FT from NPB-JAV [8], supported by NPB 3.0. NPB-JAV corresponds to the EI versions. For supporting the other kinds of multidimensional arrays, versions of each program have been derived for EE, JR, and JC. From the Java versions, the C# versions were derived. The RR and RC versions have been derived only in C#, which supports rectangular arrays.
2 After the acquisition of Sun in 2011, Oracle decided to fix 1.6.0 as the last release of JRockit, moving towards the idea of merging JRockit with HotSpot in further versions. Since JRockit is considered by the community as a better alternative for running scientific programs in Java, compared to HotSpot, we decided to include it in the experiment. 3 , and are the paths where the JVMs, IKVM and Mono are installed, respectively.
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.6 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
6
EI
EE
JR
JC
RR
RC
for(k=1;k<=nz-2;k++) for(j=jst-1;j<=jend-1;j++) for(i=ist-1;i<=iend-1;i++) for(m=0;m<=4;m++) rsd[m+i*isize1+j*jsize1+k*ksize1] = dt * rsd[m+i*isize1+j*jsize1+k*ksize1]; for(k=1;k<=nz-2;k++) for(j=jst-1;j<=jend-1;j++) for(i=ist-1;i<=iend-1;i++) for(m=0;m<=4;m++) rsd[index1(k,j,i,m)] = dt * rsd[index1(k,j,i,m)]; for(k=1;k<=nz-2;k++) for(j=jst-1;j<=jend-1;j++) for(i=ist-1;i<=iend-1;i++) for(m=0;m<=4;m++) rsd[k][j][i][m] = dt * rsd[k][j][i][m]; for(k=1;k<=nz-2;k++) for(j=jst-1;j<=jend-1;j++) for(i=ist-1;i<=iend-1;i++) for(m=0;m<=4;m++) rsd[m][i][j][k] = dt * rsd[m][i][j][k]; for(k=1;k<=nz-2;k++) for(j=jst-1;j<=jend-1;j++) for(i=ist-1;i<=iend-1;i++) for(m=0;m<=4;m++) rsd[k,j,i,m] = dt * rsd[k,j,i,m]; for(k=1;k<=nz-2;k++) for(j=jst-1;j<=jend-1;j++) for(i=ist-1;i<=iend-1;i++) for(m=0;m<=4;m++) rsd[m,i,j,k] = dt * rsd[m,i,j,k]; Fig. 1. SSOR iteration of LU in C# (similar in Java, excepting rectangular arrays).
The translation of NPB-JAV to other multidimensional array representations, as well as to C#, is trivial. It has been necessary to modify only the pieces of code where multidimensional arrays are declared, instantiated, and accessed. For instance, Fig. 1 shows the code of the SSOR (Symmetric Successive Overrelaxation) iteration of LU in each multidimensional array representation, in C#. The code in Java is similar, for embedded and jagged arrays. 3.3. Performance measures The performance measures have been collected in a Dell Studio XPS 8100 desktop computer. It is equipped with an Intel Core i5 processor (Cores: 4, RAM: 8 GB, Clock: 2.66 MHz). The operating system is Ubuntu 12.04.4 LTS (GNU/Linux 3.8.0-37-generic x86_64). An experimental case is defined as a combination of levels of the experimental factors. For each experimental case, we have collected a sample of execution times from 10 runs, with outliers elimination, defining the set of observations. The sample size has been determined from a preliminary experiment with 100 runs of problem classes S and W, where we calculated that the sample size does not need to exceed 10 in order to achieve acceptable confidence in the use of the average as the central tendency measure, by assuming normal distribution. 4. Results and discussion The experimental data is summarized in Figs. 2, 3, 4 and 5. Each figure presents the experimental cases for a level of program (SP, BT, LU and FT, respectively), including four charts, one for each level of problem class (S, W, A and B, respectively). Thus, for a given program and problem class, each bar chart shows the average execution times for the experimental cases combining levels of VEE and array kind. The charts also include the average times of the native runs, for making easier the visual comparison between native and virtual execution. So, by looking at a bar chart, one may compare the performance of the VEEs and the native execution across all multidimensional array representations considered in this work, for a given program and problem class. The reader may take care with visual data comparison of data between different bar charts, since they are in different scales. For helping the reader in the interpretation of results, Figs. 8 and 9 present a view of the experimental data from another perspective. Each figure contains a VEE × program matrix of charts, for VEEs compliant to JVM and CLI standards, respectively. Thus, each bar chart presents the ratio of the virtual to the native execution time for all experimental cases that combine levels of array kind and problem classes for a given combination of program and VEE. We start our analysis by comparing column-major and row-major versions of all experimental cases, i.e. by comparing JR with JC and RR with RC. As expected, the execution times of column-major versions are greater than the execution times of row-major versions for all the experimental cases. Also, one may observe, by looking at Figs. 8 and 9, that the overhead of column-major traversing tends to increase with the workload (from smaller to bigger problem classes). We credit this
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.7 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
7
Fig. 2. SP benchmark – performance figures (average execution time in seconds).
effect to a faster decreasing in cache performance, compared to row-major traversing, due to the loss of data locality when computing with bigger arrays. For the row-major versions, this effect is observed only in some exceptional cases. So, since column-major versions are always a worse choice in terms of performance when compared with row-major versions, we focus our attention in the row-major versions of jagged and rectangular arrays (JR and RR). The two Oracle’s HotSpot VEEs (HotSpot and Open) present similar performance, despite they support different Java versions (1.8 versus 1.9), for all programs and problem classes. This is expected, since they share the same JIT core. In fact, both have been included in the experiment only for confirming the hypothesis that they may be used interchangeably for HPC programs. For this reason, we refer only to HotSpot for simplifying the following performance analyses. Table 1 compares the performance of server and client versions of HotSpot and J9 (version 1.7), for embedded and row-major jagged arrays, showing no significant difference between client and server executions of HotSpot, contradicting previous works and the belief of Java HPC programmers. We credit this result to the optimizations that Oracle developers have introduced to the client version of HotSpot implementations since version 1.6 [39]. In turn, for J9, most of experimental cases shows that the performance of executions in server mode outperforms the performance of executions in client mode by small factors. For these reasons, the following analysis are all based on the execution of server versions of HotSpot and J9. In order to structure the discussion, firstly, in Section 4.1, JR (jagged arrays) and EI/EE (embedded arrays), the most usual approaches among the Java HPC developers, are compared, since they present better performance than RR (rectangular arrays) both in Mono and .NET. In turn, the observations and discussions about the performance of RR are presented in Section 4.2. 4.1. Jagged versus embedded arrays Table 2 shows that J9 and HotSpot present contradicting performance results for EI and JR. For J9, EI outperforms JR for all the programs, as expected, by a nearly constant factor across problem classes. On the other hand, JR outperforms EI for HotSpot, except for FT. Moreover, by looking at the HotSpot charts of Fig. 8, HotSpot presents outstanding performance overheads for EI in SP, BT and LU for all problem classes. For J9, also by observing Fig. 8, this is observed only for problem
JID:SCICO AID:2004 /FLA
8
[m3G; v1.175; Prn:4/05/2016; 13:55] P.8 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
Fig. 3. BT benchmark – performance figures (average execution time in seconds).
class S. This is a surprising result for HPC programmers, since most of them believe that embedded arrays with inline index arithmetic must be preferred for avoiding the overhead of jagged array indirections. Taboada et al. have reported this effect before in an older version of HotSpot [6], using NPB-JAV (EI). It is explained by the fact that the size of a method is one of the criteria used by the HotSpot JIT compiler for triggering its dynamic compilation. Thus, they recommend to factorize the code of loops in simple and independent methods as much as possible, for reducing the code size of methods. Since the JIT compiler is enabled to inline methods, the performance of array access may be accelerated by encapsulating array indexing arithmetic in reusable methods. After applying these refactorings, they have reported high performance speedup in NPB-JAV. The relevant finding firstly reported by Taboada et al. and now confirmed in this study, concerning the unexpected higher overhead of EI in HotSpot, is the reason why we have included the EE versions in the study reported in this paper. Table 3(a) compares EI and EE for the HotSpot VEEs, showing the speedup of using EE instead of EI for SP, BT and LU, confirming the findings of Taboada et al. However, the execution times with EE continues to be slower than JR, as shown in Table 4. However, Table 3(b/c) shows that the common sense of HPC programmers is valid for the other VEEs, both JVM and CLI compliant ones, for which EE causes significant performance overhead compared to EI. The only exception is J9 when executing FT, where EE outperforms EI by a small factor. In addition, surprisingly, EE causes overhead for HotSpot when executing FT. These observations show that the recommendations of Taboada et al. cannot be generalized neither to all JIT compilers nor to all numerical- and array-intensive programs, as they suggest. Table 2 also shows that JRockit and J9 present similar performance characteristics, since embedded arrays outperform jagged ones and encapsulation of index arithmetic in methods makes embedded arrays worse than jagged ones. The only exception occurs for SP, where JR outperforms EI in JRockit. Also, for BT, the ratio of EI to JR is smaller for JRockit than J9. However, JRockit does not show significant higher overheads for problem class S, as observed for the other JVM implementations (J9, HotSpot and Open), i.e. JIT compilation time does not penalize executions with small workloads, like
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.9 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
9
Fig. 4. LU benchmark – performance figures (average execution time in seconds).
S, for JRockit. Finally, the most important observation for the comparison between JRockit and the other standard JVMs is that it is outperformed by all of them for almost all experimental cases with SP, BT and LU. The only exceptions are:
• FT, where the execution times of JRockit are close to J9, HotSpot and Open, with a slight advantage for JRockit in relation to J9;
• EI, for HotSpot, due to the high overheads observed for HotSpot VMs. The better performance of HotSpot compared to JRockit contradicts the belief of Java programmers that JRockit is better for long running computations, since it is tuned for server execution, having the ability to perform deeper runtime optimizations that cannot be performed by HotSpot. However, remember that JRockit outperforms HotSpot only for embedded arrays (EI). This is probably the reason for their belief, since most of them adopt the embedded representation to represent multidimensional arrays. Concerning CLI implementations (IKVM, Mono, and CLR), Table 5 shows the ratios of the best execution times obtained using one of them to the best execution times using a JVM implementation, for both EI and JR. It can be notice that they present higher execution times compared to the JVM implementations, for both EI and JR, mainly for workloads bigger than S. Also, contrariwise to the previous version of this study [40], where we compared Mono 2.11.3 and .NET 4.03, Mono outperforms CLR for all programs and problem classes, with respect to the experimental cases involving both jagged and embedded arrays. In fact, Table 6 shows a decreasing in the execution times of Mono from version 2.11.3 to version 4.2 (particularly significant for jagged arrays), simultaneously to an increasing in execution times of CLR from version 4.03 to 4.6. By looking at Table 7, one may observe that it is not possible to conclude whether EI or JR presents better performance for the CLI implementations (Mono and CLR). For instance, EI performs better for BT and FT. For SP and LU, we found contracting results. For SP, JR presents better performance for both Mono and CLR. However, the advantage of JR for Mono
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.10 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
10
Fig. 5. FT benchmark – performance figures (average execution time in seconds).
Table 1 Ratio of server to client execution times for version 1.7 of HotSpot and J9. S
W
A
B
S
W
A
B
EI
SP BT LU FT
1.01 1.00 1.00 1.00
1.01 1.00 1.00 1.00
1.00 1.00 1.01 0.99
1.00 1.00 1.00 1.00
0.64 0.70 0.61 0.90
1.01 1.01 0.91 1.07
0.96 0.91 0.87 0.93
0.91 0.91 0.68 0.93
JR
SP BT LU FT
1.00 0.99 1.00 0.95
0.99 0.99 1.00 1.00
0.95 0.99 0.97 0.99
1.00 1.00 1.00 1.00
0.63 1.13 1.04 0.85
0.91 1.04 0.96 0.77
0.87 0.97 0.93 1.00
0.84 0.96 0.97 1.05
HotSpot
J9
Table 2 Ratio of EI to JR for JRockit, J9 and JRockit.
SP BT LU FT
S
W
A
B
S
W
A
B
S
W
A
B
5.55 3.60 4.77 0.60
21.94 5.40 29.90 0.51
18.63 6.84 31.04 0.48
18.34 6.63 26.94 0.46
0.70 0.51 0.58 0.52
0.72 0.82 0.81 0.60
0.66 0.85 0.83 0.67
0.65 0.82 0.84 0.57
0.70 0.23 0.52 0.55
1.13 0.26 0.58 0.73
1.14 0.26 0.59 0.85
1.10 0.25 0.59 0.78
HotSpot
J9
JRockit
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.11 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
11
Table 3 Ratio of the execution time of EE to the execution time of EI, for each VEE.
SP BT LU FT
(a)
S
W
A
B
S
W
A
B
0.36 0.59 0.31 1.03
0.10 0.47 0.07 1.23
0.10 0.46 0.07 1.27
0.10 0.46 0.08 1.14
0.35 0.77 0.34 0.97
0.09 0.47 0.08 1.23
0.10 0.48 0.07 1.29
0.10 0.52 0.08 1.15
HotSpot
(b)
SP BT LU FT
Open
S
W
A
B
S
W
A
B
S
W
A
B
0.73 1.24 0.77 0.83
2.30 3.38 1.98 0.75
1.63 4.27 1.45 0.79
1.70 3.99 1.37 0.73
2.07 3.88 1.98 1.37
1.86 3.27 2.30 1.32
1.84 3.21 2.36 0.97
1.79 3.24 2.34 0.96
2.08 3.34 2.17 1.67
2.09 3.36 2.24 1.69
2.08 3.32 2.22 1.65
2.07 3.35 2.21 1.64
J9
SP BT LU FT
(c)
JRockit
IKVM
S
W
A
B
S
W
A
B
1.69 1.38 1.92 1.03
1.70 1.38 1.97 1.03
1.68 1.38 1.96 1.02
1.69 1.38 1.96 1.02
1.45 1.32 0.82 1.09
1.50 1.34 1.54 1.07
1.50 1.34 1.56 1.07
1.51 1.35 1.55 1.06
Mono
CLR
Table 4 Ratio of EE to JR for HotSpot and Open.
SP BT LU FT
S
W
A
B
S
W
A
B
1.99 2.12 1.48 0.61
2.10 2.55 2.24 0.62
1.84 3.18 2.25 0.61
1.80 3.05 2.18 0.53
1.88 2.73 1.59 0.60
2.16 2.54 2.33 0.63
1.92 3.37 2.37 0.65
1.91 3.53 2.34 0.58
HotSpot
Open
Table 5 Ratio of the best execution time of CLI VEEs to the best execution time of JVMs, for EI (a), EE (b), and JR (b).
SP BT LU FT
S
W
A
B
S
W
A
B
S
W
A
B
0.9 0.9 1.0 2.2
1.9 1.6 1.6 2.9
2.1 1.8 1.6 3.0
2.1 1.9 1.5 2.9
0.8 0.5 0.8 2.4
2.3 2.3 1.9 1.7
2.1 3.0 2.1 1.8
2.1 3.0 2.1 1.8
0.7 1.4 0.7 1.5
2.3 2.3 1.9 1.7
2.1 3.0 2.1 1.8
2.1 3.0 2.1 1.8
(a)
(b)
(c)
Table 6 Comparing current and previous versions of Mono and .NET. Mono 4.2 / 2.11.3
.NET 4.6 / 4.0.3
SP
EI JR RR
0.95 0.74 1.00
0.96 0.73 1.02
0.95 0.71 0.98
0.96 0.71 0.99
1.19 1.19 1.13
1.18 1.17 1.17
1.19 1.14 1.15
1.17 1.16 1.16
BT
EI JR RR
0.94 0.70 1.01
0.95 0.70 1.01
0.95 0.69 0.99
0.95 0.70 1.00
1.18 1.16 1.17
1.18 1.18 1.16
1.14 1.17 1.18
1.15 1.18 1.19
LU
EI JR RR
0.90 0.63 1.00
0.94 0.72 1.03
0.94 0.72 1.00
0.94 0.71 1.02
1.26 1.14 1.18
1.19 1.17 1.16
1.18 1.16 1.17
1.17 1.15 1.18
FT
EI JR RR
0.92 0.82 0.97
0.93 0.83 0.98
0.92 0.79 0.96
0.92 0.81 0.97
1.17 1.16 1.16
1.16 1.17 1.15
1.17 1.15 1.16
1.16 1.16 1.17
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.12 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
12
Table 7 Ratio of EI to JR for Mono and CLR.
SP BT LU FT
S
W
A
B
S
W
A
B
1.09 0.60 1.02 0.85
1.05 0.59 0.97 0.83
1.02 0.59 0.89 0.83
1.01 0.58 0.84 0.79
1.46 0.61 2.98 0.87
1.36 0.61 1.53 0.87
1.33 0.60 1.44 0.84
1.28 0.59 1.29 0.78
Mono
u[k][j][i][m]
ldarg.0 ldfld float64[] u ldloc.3 ldloc.0 ldarg.0 ldfld int32 isize1 mul add ldloc.1 ldarg.0 ldfld int32 jsize1 mul add ldloc.2 ldarg.0 ldfld int32 ksize1 mul add ldelem.r8 (a)
CLR
u[m+i*isize2 +j*jsize2 + k*ksize2]
ldarg.0 ldfld float64[][][][] u ldloc.2 ldelem.ref ldloc.1 ldelem.ref ldloc.0 ldelem.ref ldloc.3 ldelem.r8 ldloc.s 8 ldloc.3 ldelem.r8
u[k, j, i, m]
ldarg.0 ldfld float64[0...,0...,0...,0...] u ldloc.2 ldloc.1 ldloc.0 ldloc.3 call instance float64 float64[„,]::Get (int32, int32, int32, int32)
(b)
(c)
Fig. 6. CIL codes for accessing a multidimensional array in Mono (Comparing EI, JR, and RR).
is insignificant, whereas it is significant for CLR. In turn, JR is better than EI for Mono, whereas EI is better then JR for CLR. Finally, the performance curve suggests that the ratio of EI to JR decreases as the problem instance increases. Fig. 6 compares the CIL code generated by the C# compiler for accessing the array u in SP and BT, using jagged (a), embedded (b) and rectangular (c) multidimensional array representation, respectively. An access to a jagged array is performed by a sequence of indirection operations (ldloc/ldref.i), for each index i, followed by a fetch operation (ldelema). The JIT compiler emits the array bounds checking code to protect ldref instructions. In turn, the unidimensional version performs arithmetic to calculate the array index, followed by a fetch operation. Indirection operations perform very fast if data is in cache, explaining why it presents better performance than index calculation arithmetic. Fig. 7 presents another perspective of the minimal overheads of the virtual machines, in relation to the native execution, for each combination of array representation, program and problem class (excepting S). It is complemented by Table 8, which includes the information of which array representation leads to the minimal virtual execution overhead for each virtual machine, including problem class S. The first positions are mostly dominated by J9 using EI and HotSpot/Open using JR, mainly for bigger workloads in SP, BT and LU. For FT, embedded arrays present the best performance for all experimental cases. However, surprisingly, in FT, HotSpot VEEs with EI outperform the other VEEs, contradicting the high overheads of EI observed for the other programs (SP, BT and LU). Finally, notice that EE outperforms EI for experimental cases combining FT and J9. These are the only experimental cases where EE outperforms EI outside HotSpot VEEs. From the above observations, we conclude that the best results across all the experimental cases have been obtained by J9 using EI, followed by HotSpot using JR. As expected, their performance is still worse compared to native execution, in Fortran, by the factors presented in Table 9, confirming, for large workloads, the results found by Riley et al. [35]. However, we think that the advantages of using object-oriented languages, regarding modularity and abstraction, make them competitive for bigger workloads, where virtual execution presents better results. Table 9 also shows the average and worst overheads for all virtual machines, showing that bad choices of virtual machine and array representation lead to high overheads, even in the average. For HPC programmers, it is a surprise that JR may be competitive to EI in Java, since they avoid jagged arrays because their elements are not consecutive in memory, making it difficult to take advantage of spatial locality of array accesses for better performance of the system’s memory hierarchy. This belief comes from their familiarity with C/C++ and Fortran: Together, the above beliefs and the results of Taboada et al. [6] suggest that Java HPC programmers must use embedded arrays with indexing arithmetic encapsulated in methods (EE), with methods for accessing the elements through multiple dimensions, avoiding the scattering of the index calculation arithmetic in the source code. However, this paper shows that the performance of this approach depends on the underlying VEE. For J9 and JRockit, it is really better to use embedded arrays in alternative to jagged arrays, but it is even better to inline indexing arithmetic. On the other hand, inlining is prohibitive for HotSpot and Open, and embedded arrays, even EE, are outperformed by jagged arrays. These results evidence the poor performance portability of VEEs for the needs of HPC, since it requires that programmers write code that will run efficiently depending on the choice of VEE. Unfortunately, despite the differences observed between FT and the other
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.13 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
13
Fig. 7. Ratios of the best execution time of each virtual machine to the corresponding native execution time.
Table 8 Ranking the VEEs according to their performance compared with native version. 1th
2nd
3rd
4th
5th
6th
7th
SP
S W A B
Mono/JR Open/JR Open/JR Open/JR
.NET/JR HotSpot/JR J9/EI J9/EI
JRockit/EI J9/EI HotSpot/JR HotSpot/JR
IKVM/EI JRockit/JR JRockit/JR JRockit/JR
Open/JR Mono/JR Mono/JR Mono/JR
HotSpot/JR .NET/JR .NET/JR .NET/JR
J9/EE IKVM/EI IKVM/EI IKVM/EI
BT
S W A B
Mono/EI J9/EI J9/EI J9/EI
IKVM/EI HotSpot/JR Open/JR Open/JR
Open/JR Open/JR HotSpot/JR HotSpot/JR
HotSpot/JR Mono/EI Mono/EI Mono/EI
JRockit/EI IKVM/EI IKVM/EI IKVM/EI
.NET/EI JRockit/EI JRockit/EI JRockit/EI
J9/EI .NET/EI .NET/EII .NET/EI
LU
S W A B
Mono/JR Open/JR Open/JR Open/JR
IKVM/EI HotSpot/JR HotSpot/JR HotSpot/JR
JRockit/EI J9/EI J9/EI J9/EI
.NET/JR JRockit/EI JRockit/EI JRockit/EI
HotSpot/JR .NET/JR Mono/EI Mono/EI
Open/JR Mono/EI .NET/JR IKVM/EI
J9/EE IKVM/EI IKVM/EI .NET/JR
FT
S W A B
Open/EI Open/EI Open/EI Open/EI
HotSpot/EI HotSpot/EI HotSpot/EI HotSpot/EI
JRockit/EI J9/EE J9/EE J9/EE
IKVM/EI JRockit/EI JRockit/EE JRockit/EE
.NET/EI IKVM/EI IKVM/EI .NET/EI
Mono/EI .NET/EI .NET/EI IKVM/EI
J9/EE Mono/EI Mono/EI Mono/EI
Notes: For each program and problem class, each entry of this table is obtained by firstly taking the lower execution time achieved by each VEE across all representations. Then, rank the VEEs according to this execution time, from 1st (best) to 7th (worse). This ranking can also be obtained by reading the Fig. 7. However, this table adds the information about the array representation that achieved the best result for each virtual machine.
Table 9 Comparing the overall performance of virtual execution with native execution.a
SP BT LU FT
S
W
A
B
S
W
A
B
S
W
A
B
4.00 3.35 4.70 2.75
1.81 2.12 2.30 1.94
1.84 1.71 2.03 1.77
1.76 1.67 2.24 1.98
49.72 44.59 77.52 20.51
41.73 25.54 72.16 20.51
36.37 24.51 66.91 25.96
34.71 24.95 65.21 28.32
14.58 12.37 23.60 8.87
9.01 9.27 12.13 8.53
9.33 8.85 12.00 8.93
9.63 9.03 12.73 10.34
Best case
Worse caseb
Averageb
a
Each entry in the table is the ratio of the lowest execution time among all VEEs, for all array representations, to the execution time of the native version. b Ignoring the column-major versions (JC and RC).
programs, the results of this paper are not still enough to suggest a reliable criteria to perform these choices, using a systematic procedure. However, the knowledge that multidimensional jagged arrays may outperform their embedded counterparts in programs executing over the popular and freely available HotSpot VEEs (HotSpot and Open) is useful for helping HPC programmers to improve the quality of their code, since jagged arrays is the natural way to implement multidimensional arrays in Java, promoting safety and code readability. On the other hand, if they have an existing code using embedded arrays, they do
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.14 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
14
Table 10 Ratio of RR to the best between EI and JR for Mono and CLR.
SP BT LU FT
S
W
A
B
S
W
A
B
3.8 7.3 4.4 2.7
3.7 7.3 4.5 2.8
3.4 7.1 4.4 2.7
3.4 7.1 4.4 2.7
3.0 3.7 2.0 1.9
2.8 3.7 3.5 2.0
2.7 3.7 3.3 1.9
2.6 3.7 3.0 1.9
Mono
CLR
Table 11 Performance of rectangular arrays.
SP BT LU FT
S
W
A
B
S
W
A
B
1.3 1.4 0.9 1.6
1.3 1.4 1.4 1.6
1.2 1.4 1.3 1.6
1.2 1.4 1.3 1.6
3.0 5.1 4.9 4.2
6.8 8.4 6.5 5.6
5.9 9.3 6.9 5.8
5.8 9.4 6.1 5.5
(a)
(a) Ratio of execution times of Mono to execution times of CLR for RR. (b) Ratio of execution times of CLR to the best execution times across all the other virtual machines, for RR.
(b)
not need to refactor it to use jagged arrays for taking advantage of better performance. It is only a matter of choosing the appropriate VEE. In this situation, the better choice is J9. 4.2. Rectangular versus jagged and embedded arrays As discussed before, CLI implementations have introduced the support for rectangular arrays, also referred as true multidimensional arrays, for the needs of scientific computing code, where it is important to traverse array elements by supposing they are stored consecutively in memory for taking advantage of data locality for improving the performance of system’s memory hierarchy. However, the results of all experimental cases with rectangular arrays are disappointing, for both Mono and CLR. Compared to EI and JR, RR performs significantly worse, as pointed out in Table 10. In most of experimental cases, both for Mono and CLR, RR is outperformed even by JC. Indeed, CLR outperforms Mono for rectangular arrays. For quantifying the differences, Table 11 shows the ratio of execution times between Mono and CLR for RR (a) and between RR of CLR and the best execution time across all the evaluated VEEs (b). The results obtained for rectangular arrays in this paper reproduces the previous results of this study [40], which confirmed the findings of Vogels et al. [10], a long time ago. Fig. 6(c) helps on explaining the overhead in rectangular array access. The reason is the dynamic nature of rectangular arrays in C#, since they are created as objects, using the new constructor. So, the size of the dimensions are not always known at JIT compilation-time, and they are necessary for calculating the address of the array elements in memory. The solution of designers of Mono and CLR is implementing rectangular array access through a call to low-level function (Get), instead of specific CIL instructions, like embedded and jagged arrays Fig. 6(a/c). This is the same reason why rectangular arrays are only supported by statically allocated arrays in C/C++. We think that the study on how to optimize rectangular array access could be a relevant research topic for improving the implementation of VEEs that support them. 4.3. Particular features of mono Mono provides some additional features for improving the performance of program execution. In the following paragraphs, we discuss the most important ones.
JIT optimizations. Mini, the Mono JIT compiler, supports a set of compiler optimizations that may be applied to the CIL code. The users may enable each optimization separately through the flag -optimize. In this study, the experimental cases with Mono have been performed by enabling all optimization flags, under the assumption that HPC users will try the most aggressive alternatives to reduce execution time. Table 12(a) compares the optimized execution times with the execution times by ignoring the optimization flag (default), for Mono 2.11.3. In EI and JR, no gains in performance have being observed. In RR, the enabling of all optimizations results in significant gains, between 10% and 16%. ABC disabling. The overhead of array bounds checking (ABC) motivates many HPC programmers to avoid using safe programming languages, such as Java and C#, despite the sophisticated static analysis techniques applied for ABC elimination [33]. Table 12(b) presents the speedup obtained by turning off ABC in EI, JR and RR, in Mono. For EI and JR, ABC may be turned off by setting the unsafe optimization flag (-O=unsafe). However, this approach does not apply to RR, forcing us to disable emission of ABC code in the source code of Mono, requiring recompilation. The ABC overhead of JR and RR is greater than EI, since it has more dimensions bounds checking. Also, the speedups are nearly independent of the problem class (array size).
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.15 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
15
Table 12 Optimizations and ABC in Mono [40]. Classes
Optimizations
ABC off
S
W
A
S
W
A
EI
SP BT LU FT
1.03 1.01 0.95 0.97
1.03 1.01 1.00 0.97
1.03 1.01 1.00 0.97
1.08 1.15 1.11 1.04
1.07 1.12 1.11 1.04
1.07 1.11 1.11 1.04
JR
SP BT LU FT
1.04 0.98 0.88 0.96
1.04 0.98 1.03 0.96
1.05 0.99 1.03 0.95
1.28 1.26 1.33 1.11
1.30 1.27 1.34 1.12
1.28 1.27 1.32 1.12
RR
SP BT LU FT
1.10 1.16 1.12 1.13
1.10 1.15 1.12 1.14
1.10 1.16 1.13 1.12
1.13 1.31 1.21 1.02
1.14 1.29 1.19 1.02
1.13 1.29 1.20 1.02
(a)
(b)
5. Conclusions and further works This paper reports an evaluation of the sequential performance of current implementations of virtual execution engines (VEEs) based on the CLI and JVM standards for different representations of multidimensional arrays, using realistic arrayintensive programs, mostly found in computer applications of sciences and engineering domains. Concerning the three goals stated in Section 1, we have confirmed the hypothesis of the high dependence of the performance of these programs on the choices of multidimensional array representation and VEE implementation. The main observations are: 1. JVM implementations outperform CLI implementations for bigger workloads (W, A and B); 2. In J9 and HotSpot/Open, JIT compilation causes significant performance overheads for problem class S (small problem instance), making CLI implementations competitive for S; 3. The most efficient combinations of VEE and multidimensional array representation are J9 with embedded arrays (EI) and HotSpot/Open with row-major jagged arrays (JR); 4. Virtual execution is slower than native execution by factors between 1.83 and 2.32 (Table 9) for realistic workloads (A and B); 5. In CLI implementations, the use of rectangular arrays continues to be outperformed by the use of both jagged and embedded arrays; 6. JRockit is outperformed by both J9 and HotSpot. The above observations help us to provide useful hints for guiding HPC programmers in some scenarios, which we characterize by means of the questions in the following paragraphs. Given an existing array-intensive Java program that adopts a given multidimensional array representation, which is the best choice of JVM implementation for executing the program as rapidly as possible provided that the developer is not interested in refactoring the code? In most of cases, row-major jagged arrays (JR) and embedded arrays with inline index calculation arithmetic (EI) are the two multidimensional array representations that we expect to find in an existing array-intensive Java code. The former is the typical choice of a Java programmer with little or no background on programming for HPC, whereas the latter is the typical choice of an HPC programmer, accustomed to write code in C/C++, forced to write code in Java for some reason (e.g. the NPB developers working on developing the Java reference implementation, NPB-JAV). From the third observation, the developer may choose J9 for the EI case. For the experimental cases used in this paper, it is a better alternative than refactoring the code for obtaining an EE version, for accelerating execution in a HotSpot JVM, as suggested by Taboada et al. [6]. In turn, a HotSpot JVM, either open-source or commercial, continues to be the best choice for the JR case. These choices are valid for big workloads, since dynamic JIT optimizations, as performed by J9 and HotSpot/Open may be attractive for long running computations, amortizing the costs of dynamic execution analysis and compilation. On the other hand, it causes high overhead for small workloads, being the reason why the current CLI implementations (IKVM, CLR, and Mono) performs better than the current JVM implementations for the problem class S. Given a VEE implementation of interest, what is the best choice of multidimensional array representation and programming language (Java or C#) for implementing a new array-intensive program provided that the developer cannot move to another virtual execution engine?
JID:SCICO AID:2004 /FLA
16
[m3G; v1.175; Prn:4/05/2016; 13:55] P.16 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
Fig. 8. Relative performance of VEEs, by using native execution times as baseline (JVM).
From the third observation, JR is the best choice for a developer that is forced to use either HostSpot or Open. Indeed, the fact that JR outperforms EE may surprise some HPC developers that read the paper of Taboada et al. [6], since, in theory, EE provides better data locality than JR. In turn, our experimental cases show that EI is the best choice for J9. For the CLI implementations (Mono and CLR), the results do not evidence whether JR or EI is the best choice. However, since we have observed similar results for both representations, we think that JR is better than EI, because it is the most natural representation. The fact that rectangular arrays (RR) are outperformed by jagged and embedded arrays may cause
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.17 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
17
Fig. 8. (continued)
Table 13 Ratio of execution times of JR/HotSpot to the best execution times across all array representations and VEEs.
SP BT LU FT
S
W
A
B
2.2 1.2 3.4 1.8
1.0 1.2 1.0 1.9
1.0 1.0 1.0 2.0
1.0 1.1 1.0 2.0
surprise for HPC developers, since they offer the data locality of embedded arrays and provide a syntax that is yet more natural than the syntax of jagged arrays. In fact, they have been introduced to C# for circumventing the performance bottlenecks of jagged array indirections and avoiding the intricate index arithmetic of embedded arrays. It is also worth notice that the code using jagged arrays in C# is easily portable to Java, if the developer decides to move from the CLI to the JVM platform. The developer has also the alternative of programming in J#, instead of C#, for improving code portability to the JVM platform. IKVM, which is implemented on top of Mono by converting Java bytecode to CIL, inherits its performance characteristics. However, our results show that EI outperforms JR for all experimental cases with IKVM. This is explained by the fact that the version of IKVM used in this work is implemented on top of Mono 2.10.8. In fact, the results we have obtained for IKVM are very similar to the results we have obtained for Mono 2.11.3 in the previous version of this study [40]. Remember that, from Mono 4.2.0 and Mono 2.11.3, the performance of jagged arrays have been improved by factors between 17% and 37%, whereas the performance of embedded arrays have been improved by a smaller factor, between 4% and 10%, across all the experimental cases, as pointed out before in Table 6. What is the better combination of virtual execution engine, programming language and multidimensional array representation for implementing, compiling and running a new array-intensive program provided that the developer has flexibility on making the choices? For developing new code, we recommend the adoption of JR combined with a HotSpot JVM (HotSpot or Open). Despite they are outperformed by their embedded counterparts on top of J9 in most of experimental cases, the difference is not so significant for bigger workloads, as pointed out by Table 13, so that it is compensated by the more naturalness of the code written using jagged arrays in Java, resulting in code that is easier to read and to maintain. However, this observation is valid when the code is better structured for taking advantage of JIT dynamic compilation. For instance, this is not the case of FT, for which jagged arrays are significantly outperformed by embedded arrays. Fortunately, the recommendations on how to structure better the code for optimizing JIT dynamic compilation in HotSpot JVMs are relatively simple, with the additional advantage that they promote code modularization. Given an existing array-intensive program, written in either Java or C#, that adopts a given multidimensional array representation, in which situations is it advantageous to move from one programming language to one another and between multidimensional array representations in order to improve the overall application performance?
JID:SCICO AID:2004 /FLA
18
[m3G; v1.175; Prn:4/05/2016; 13:55] P.18 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
Fig. 9. Relative performance of VEEs, by using native execution times as baseline (CLI).
If an existing array-intensive program uses either JR or EI representation in Java, the developer may consider to move to either a HotSpot JVM (Open or HotSpot) or J9, respectively, if he/she is not already using one of them, as pointed out in the response to the first question. JR and EI are the most common approaches one may find in preexistent code. On the other hand, if the code is written in C#, for running in either Mono or CLR, to rewrite the code in Java may be a good alternative when the performance requirements are critical. Indeed, Table 5 shows that CLI implementations have presented overheads
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.19 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
19
between 40% and 200% for EI and JR, compared to their JVM counterparts, for the experimental cases considered in this performance evaluation. Some other unusual situations may occur. For example, a code may be using a column-major jagged array representation, because it was derived directly from a Fortran program. This is the case of NPB-JAV. Under the HPC requirements assumption, it is better to convert this code to its row-major version, since the experimental cases show that column-major traversal of array elements present significant overhead. Another situation is an existing code that adopts EE representation, probably following the refactoring recommendations of Taboada et al. for improving HotSpot’s JIT performance. Table 4 shows that a conversion to a jagged array representation (JR) may accelerate significantly the application in HotSpot and Open, being the best choice under HPC requirements. Finally, there is a situation where a developer has an existing code that adopts rectangular arrays (RR), written under the erroneous belief that the access to rectangular array elements in C# provides significant better performance than using jagged arrays. For accelerating significantly its application, also making it more portable to the Java platform, the developer may move to JR. This is a simple conversion, by using regular expressions. Is it advantageous, in terms of the trade-off between reaching peek performance and have access to better development supportive tools and high-level programming techniques, to move from C or Fortran to a VEE (either Java or C#)? By looking at Fig. 9, one may notice that the overhead of virtual execution for bigger workloads (A and B) is between 67% to 124%4 in relation to the execution times obtained through native execution. However, the performance gap between virtual and native execution could be yet reduced by applying certain coding recommendations for enabling further dynamic optimizations of the JIT compiler [7], such as the refactorings proposed by Taboada et al., as well as by using some techniques for optimizing array access performance, such as the unsafe blocks of C# and ABC elimination compiler directives of Mono. However, the results reported in this paper show that code refactorings for JIT optimization are not portable. In particular, the refactorings proposed by Taboada et al., which cause significant speedups in HotSpot VEEs, cause significant overheads in the other VEEs. The use of unsafe blocks of C# is another non-portable technique, with the additional drawback of compromising execution safety, one of the relevant motivations for moving to a VEE. In turn, ABC elimination, despite to be a portable approach, causes significant speedup only for rectangular arrays, which are not the best choice of array representation for Mono and are not affected by the -O:unsafe compiler directive. So, it is not possible neither encouraging nor discouraging definitively a developer to move his/her application from native to virtual execution. The decision depends on the tolerance of the application with respect to the inevitable performance overheads, as well as on the relevance of the requirements of the application that motivate the developer to consider the possibility of using a VEE. From our results, we can only conclude that VEEs constitute viable alternatives for a significant number of HPC applications. Besides to help developers of array-intensive programs with HPC requirements to guide their choices regarding the use of VEEs and multidimensional array representations, we expect that the findings of this paper may suggest investigation directions for addressing the bottlenecks of VEEs, compliant to both JVM and CLI standards, for supporting this kind of programs, by taking advantage of open-source industrial-strength implementations, such as OpenJDK, Jikes RVM, SSCLI and Mono. In particular, we consider relevant to investigate the bottlenecks around the implementation of dynamic rectangular arrays in Mono. We plan to continue evaluating the performance of VEEs by means of the performance evaluation framework developed in this work. To that end, we plan to add new experimental cases by introducing other NPB programs, bigger workloads and other VEE implementations. We also plan the inclusion of different machine configurations, for evaluating how they affect the performance of VEEs and the conclusions of this study. Appendix In what follows, we present the output of the virtual machines used in this experimental study when they are asked to provide detailed information about the characteristics of the version used in the experiments.
• J9: admcad@pargocad-Studio-XPS-8100:~$ /opt/ibm/java-x86_64-80/bin/java -versionjava version "1.8.0" Java(TM) SE Runtime Environment (build pxa6480ea-20130422_01) IBM J9 VM (build 2.7, JRE 1.8.0 Linux amd64-64 Compressed References 20130419_145797 (JIT enabled, AOT enabled) J9VM - R27_Java827_Beta_3_20130419_2138_B145797 JIT - r13.b02_20130419_36653 GC - R27_Java827_Beta_3_20130419_2138_B145797_CMPRSS J9CL - 20130419_145797) JCL - 20130410_01 based on Oracle jdk8-b80
4
1.67 for BT/B and 2.24 for LU/B.
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.20 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
20
• HotSpot: admcad@pargocad-Studio-XPS-8100:~$ /opt/oracle/jdk1.8.0/bin/java -version java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
• IKVM: admcad@pargocad-Studio-XPS-8100:~$ /opt/ikvm/ikvm-7.2.4630.5/bin/ikvm.exe -version IKVM.NET Launcher version 7.2.4630.5 Copyright (C) 2002-2012 Jeroen Frijters http://www.ikvm.net/ CLR version: 2.0.50727.1433 (64 bit) Mono version: 2.10.8 (tarball Sex Mar 28 05:15:14 BRT 2014) mscorlib: 2.0.0.0 ikvm: 7.2.4630.5 IKVM.Runtime: 7.2.4630.5 IKVM.OpenJDK.Core: 7.2.4630.5 System: 2.0.0.0 System.Configuration: 2.0.0.0 System.Xml: 2.0.0.0 OpenJDK version: OpenJDK 7u6 b24
• JRockit: admcad@pargocad-Studio-XPS-8100:~$ /opt/oracle/jrockit-jdk1.6.0_45-R28.2.7-4.1.0/bin/java -version java version "1.6.0_45" Java(TM) SE Runtime Environment (build 1.6.0_45-b06) Oracle JRockit(R) (build R28.2.7-7-155314-1.6.0_45-20130329-0641-linux-x86_64, compiled mode)
• Open: admcad@pargocad-Studio-XPS-8100:~$ /opt/oracle/jdk9/bin/java -version openjdk version "1.9.0-internal" OpenJDK Runtime Environment (build 1.9.0-internal-admcad_2014_02_04_16_47-b00) OpenJDK 64-Bit Server VM (build 25.0-b62, mixed mode)
• Mono: admcad@pargocad-Studio-XPS-8100:~$ /opt/mono/bin/mono --version Mono JIT compiler version 4.2.0 (Stable 4.2.0.179/a224653 Seg Ago 31 17:41:50 BRT 2015) Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com TLS: _ _thread SIGSEGV: altstack Notifications: epoll Architecture: amd64 Disabled: none Misc: bigarrays softdebug LLVM: supported, not enabled. GC: sgen
References [1] D.E. Post, L.G. Votta, Computational science demands a new paradigm, Phys. Today 58 (1) (2005) 35–41. [2] Java Grande Forum Home Page, http://www.javagrande.org/, 2013. [3] ECMA International, Common Language Infrastructure (CLI), Partitions I to VI, Tech. rep. 335, ECMA International, http://www.ecma-international.org/ publications/standards/Ecma-335.htm, Jun. 2006. [4] E. Lusk, K. Yelick, Languages for high-productivity computing – the DARPA HPCS language support, Parallel Process. Lett. 1 (2007) 89–102. [5] A. Grama, A. Gupta, J. Karypis, V. Kumar, Introduction to Parallel Computing, Addison–Wesley, 2003. [6] G.L. Taboada, S. Ramos, R.R. Expósito, J. Tourino, R. Doallo, Java in the High Performance Computing arena: research, practice and experience, Sci. Comput. Program. 78 (5) (2013) 425–444. [7] B. Amedro, D. Caromel, F. Huet, V. Bodnartchouk, C. Delbé, G. Taboada, HPC in Java: experiences in implementing the NAS parallel benchmarks, in: Proceedings of the 10th WSEAS International Conference on Applied Informatics and Communications, AIC’10, 2010.
JID:SCICO AID:2004 /FLA
[m3G; v1.175; Prn:4/05/2016; 13:55] P.21 (1-21)
F.H. de Carvalho Junior, C.A. Rezende / Science of Computer Programming ••• (••••) •••–•••
21
[8] M.A. Frumkin, M. Schultz, H. Jin, J. Yan, Performance and scalability of the NAS parallel benchmarks in Java, in: 17th International Symposium on Parallel and Distributed Processing, IPDPS’03, 2003, p. 139. [9] D.H. Bailey, et al., The NAS parallel benchmarks, Int. J. Supercomput. Appl. 5 (3) (1991) 63–73. [10] W. Vogels, Benchmarking the CLI for high performance computing, IEE Proc., Softw. 150 (5) (2003) 266–274. [11] B. Amedro, F. Baude, D. Caromel, C. Delbé, I. Filali, F. Huet, E. Mathias, O. Smirnov, An efficient framework for running applications on clusters, grids and clouds, Springer, 2010, pp. 163–178, chapter 10. [12] Oracle Java Development Kit, http://java.oracle.com/, May 2012. [13] IBM Java Development Kit, http://www.java.com/java, May 2012. [14] J. Hamilton, Language integration in the common language runtime, SIGPLAN Not. 38 (2) (2003) 19–28, http://dx.doi.org/10.1145/772970.772973. [15] Microsoft .NET Framework, http://www.microsoft.com/net, 2012. [16] The Mono Project, http://www.mono-project.com, 2006. [17] Xamarin – Build mobile apps for iOS, Android, Mac and Windows, http://xamarin.com/, 2014. [18] IKVM.NET home page, http://www.ikvm.net/, May 2012. [19] M. Philippsen, R.F. Boisvert, V. Getov, R. Pozo, J.E. Moreira, D. Gannon, G. Fox, JavaGrande – high performance computing with Java, in: PARA ’00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia, 2000, pp. 20–36. [20] J.E. Moreira, S.P. Midkiff, M. Gupta, P.V. Artigas, P. Wu, G. Almasi, The NINJA project, Commun. ACM 44 (10) (2001) 102–109. [21] C. van Reeuwijk, F. Kuijlman, H.J. Sips, Spar: a set of extensions to Java for scientific computation, Concurr. Comput., Pract. Exp. 15 (35) (2003) 277–297. [22] J.E. Moreira, S.P. Midkiff, M. Gupta, Supporting multidimensional arrays in Java, Concurr. Comput., Pract. Exp. 15 (35) (2003) 317–340. [23] G. Gundersen, T. Steihaug, Data structures in Java for matrix computations, Concurr. Comput., Pract. Exp. 16 (8) (2004) 799–815. [24] JSci – a science API for Java, http://jsci.sourceforge.net/, May 2012. [25] V. Todorov, Java and computing for robust statistics, in: Developments in Robust Statistics, Physica-Verlag GmbH & Co, 2002, pp. 404–416. [26] W.B. VanderHeyden, E.D. Dendy, N.T. Padial Collins, CartaBlanca — a Pure-Java, component-based systems simulation tool for coupled nonlinear physics on unstructured grids, Concurr. Comput., Pract. Exp. 15 (35) (2003) 431–458. [27] M. Baitsch, N. Li, D. Hartmann, A toolkit for efficient numerical applications in Java, Adv. Eng. Softw. 41 (1) (2010) 75–83. [28] Z. Budimlic, K. Kennedy, JaMake: a Java compiler environment, in: Proceedings of the Third International Conference on Large-Scale Scientific Computing, LSSC’01, 2001, pp. 201–209. [29] P.V. Artigas, M. Gupta, S.P. Midkiff, J.E. Moreira, Automatic loop transformations and parallelization for Java, in: Proceedings of the 14th International Conference on Supercomputing, ICS ’00, ACM Press, New York, USA, 2000, pp. 1–10. [30] M. Luján, J.R. Gurd, T.L. Freeman, J. Miguel, Elimination of Java array bounds checks in the presence of indirection, in: Proceedings of the 2002 Joint ACM-ISCOPE Conference on Java Grande, JGI ’02, ACM Press, New York, USA, 2002, pp. 76–85. [31] F. Qian, L.J. Hendren, C. Verbrugge, A comprehensive approach to array bounds check elimination for Java, in: Proceedings of the 11th International Conference on Compiler Construction, CC’02, 2002, pp. 325–342. [32] T.V.N. Nguyen, F. Irigoin, Efficient and effective array bound checking, ACM Trans. Program. Lang. Syst. 27 (3) (2005) 527–570. [33] T. Würthinger, C. Wimmer, H. Mössenböck, Array bounds check elimination for the Java HotSpot client compiler, in: Proceedings of the 5th International Symposium on Principles and Practice of Programming in Java, PPPJ’07, ACM Press, New York, USA, 2007, p. 125. [34] G.P. Nikishkov, Y.G. Nikishkov, V.V. Savchenko, Comparison of C and Java performance in finite element computations, Comput. Struct. 81 (24–25) (2003) 2401–2408. [35] C.J. Riley, S. Chatterjee, R. Biswas, High-performance Java codes for computational fluid dynamics, Concurr. Comput., Pract. Exp. 15 (35) (2003) 395–415. [36] J.M. Bull, L.A. Smith, C. Ball, L. Pottage, R. Freeman, Benchmarking Java against C and Fortran for scientific applications, Concurr. Comput., Pract. Exp. 15 (35) (2003) 417–430. [37] J.A. Mathew, P.D. Coddington, K.A. Hawick, Analysis and development of Java Grande benchmarks, in: Proceedings of the ACM 1999 Conference on Java Grande, JAVA ’99, ACM Press, New York, USA, 1999, pp. 72–80. [38] R. Temam, Navier–Stokes Equations: Theory and Numerical Analysis, vol. 2, Revised edition, Elsevier Science Ltd., 1979. [39] T. Kotzmann, C. Wimmer, H. Mössenböck, T. Rodriguez, K. Russell, D. Cox, Design of the Java HotSpot&Trade; client compiler for Java 6, ACM Trans. Archit. Code Optim. 5 (1) (2008) 7:1–7:32, http://dx.doi.org/10.1145/1369396.1370017. [40] F.H. de Carvalho Junior, C.A. Rezende, J.C. Silva, F.J.L. Magalhães, R.C. Juaçaba Neto, On the performance of multidimensional array representations in programming languages based on virtual execution machines, in: Lecture Notes in Computer Science, vol. 8129, Springer, 2013, pp. 31–45.