Data parallel language and compiler support for data intensive applications

Data parallel language and compiler support for data intensive applications

Parallel Computing 28 (2002) 725–748 www.elsevier.com/locate/parco Data parallel language and compiler support for data intensive applications q Rena...

278KB Sizes 0 Downloads 122 Views

Parallel Computing 28 (2002) 725–748 www.elsevier.com/locate/parco

Data parallel language and compiler support for data intensive applications q Renato Ferreira

a,*

, Gagan Agrawal b, Joel Saltz

a

a

b

Department of Computer Science, University of Maryland, College Park, MD 20742, USA Department of Computer and Information Sciences, Ohio State University, Columbus, OH 43210, USA Received 11 March 2001; received in revised form 20 November 2001

Abstract Processing and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. High-level language and compiler support for developing applications that analyze and process such datasets has, however, been lacking so far. In this paper, we present a set of language extensions and a prototype compiler for supporting high-level object-oriented programming of data intensive reduction operations over multidimensional data. We have chosen a dialect of Java with data-parallel extensions for specifying a collection of objects, a parallel for loop, and reduction variables as our source high-level language. Our compiler analyzes parallel loops and optimizes the processing of datasets through the use of an existing run-time system, called active data repository (ADR). We show how loop fission followed by interprocedural static program slicing can be used by the compiler to extract required information for the run-time system. We present the design of a compiler/run-time interface which allows the compiler to effectively utilize the existing runtime system. A prototype compiler incorporating these techniques has been developed using the Titanium front-end from Berkeley. We have evaluated this compiler by comparing the performance of compiler generated code with hand customized ADR code for three templates, from the areas of digital microscopy and scientific simulations. Our experimental results show that the performance of compiler generated versions is, on the average 21% lower, and in all cases within a factor of two, of the performance of hand coded versions.  2002 Elsevier Science B.V. All rights reserved.

q This research was supported by NSF Grant ACR-9982087, NSF Grant CCR-9808522, and NSF CAREER award ACI-9733520. * Corresponding author. E-mail addresses: [email protected] (R. Ferreira), [email protected] (G. Agrawal), [email protected] (J. Saltz).

0167-8191/02/$ - see front matter  2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 0 2 ) 0 0 0 9 3 - 5

726

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

Keywords: Data intensive applications; Data parallel language; Compiler techniques; Run-time support; Java

1. Introduction Analysis and processing of very large multi-dimensional scientific datasets (i.e. where data items are associated with points in a multidimensional attribute space) is an important component of science and engineering. Examples of these datasets include raw and processed sensor data from satellites [27], output from hydrodynamics and chemical transport simulations [23], and archives of medical images [1]. These datasets are also very large, for example, in medical imaging, the size of a single digitized composite slide image at high power from a light microscope is over 7 GB (uncompressed), and a single large hospital can process thousands of slides per day. Applications that make use of multidimensional datasets are becoming increasingly important and share several important characteristics. Both the input and the output are often disk-resident. Applications may use only a subset of all the data available in the datasets. Access to data items is described by a range query, namely a multidimensional bounding box in the underlying multidimensional attribute space of the dataset. Only the data items whose associated coordinates fall within the multidimensional box are retrieved. The processing structures of these applications also share common characteristics. However, no high-level language support currently exists for developing applications that process such datasets. In this paper, we present our solution towards allowing high-level, yet efficient programming of data intensive reduction operations on multidimensional datasets. Our approach is to use a data parallel language to specify computations that are to be applied to a portion of disk-resident datasets. Our solution is based upon designing a prototype compiler using the titanium infrastructure which incorporates loop fission and slicing based techniques, and utilizing an existing run-time system called active data repository (ADR) [8–10]. We have chosen a dialect of Java for expressing this class of computations. We have chosen Java because the computations we target can be easily expressed using the notion of objects and methods on objects, and a number of projects are already underway for expressing parallel computations in Java and obtaining good performance on scientific applications [4,25,36]. Our chosen dialect of Java includes data-parallel extensions for specifying collection of objects, a parallel for loop, and reduction variables. However, the approach and the techniques developed are not intended to be language specific. Our overall thesis is that a data-parallel framework will provide a convenient interface to large multidimensional datasets resident on a persistent storage. Conceptually, our compiler design has two major new ideas. First, we have shown how loop fission followed by interprocedural program slicing can be used for extracting important information from general object-oriented data-parallel loops. This technique can be used by other compilers that use a run-time system to optimize for locality or communication. Second, we have shown how the compiler and the run-time system can use such information to efficiently execute data intensive reduction computations.

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

727

Our compiler extensively uses the existing run-time system ADR for optimizing the resource usage during execution of data intensive applications. ADR integrates storage, retrieval and processing of multidimensional datasets on a parallel machine. While a number of applications have been developed using ADR’s low-level API and high performance has been demonstrated [9], developing applications in this style requires detailed knowledge of the design of ADR and is not suitable for application programmers. In comparison, our proposed data-parallel extensions to Java enable programming of data intensive applications at a much higher level. It is now the responsibility of the compiler to utilize the services of ADR for memory management, data retrieval and scheduling of processes. Our prototype compiler has been implemented using the titanium infrastructure from Berkeley [36]. We have performed experiments using three different data intensive application templates, two of which are based upon the virtual microscope application [16] and the third is based on water contamination studies [23]. For each of these templates, we have compared the performance of compiler generated versions with hand customized versions. Our experiments show that the performance of compiler generated versions is, on average, 21% lower and in all cases within a factor of two of the performance of hand coded versions. We present an analysis of the factors behind the lower performance of the current compiler and suggest optimizations that can be performed by our compiler in the future. The rest of the paper is organized as follows. In Section 2, we further describe the characteristics of the class of data intensive applications we target. Background information on the run-time system is provided in Section 3. Our chosen language extensions are described in Section 4. We present our compiler processing of the loops and slicing based analysis in Section 5. The combined compiler and run-time processing for execution of loops is presented in Section 6. Experimental results from our current prototype are presented in Section 7. We compare our work with existing related research efforts in Section 8 and conclude in Section 9.

2. Data intensive applications In this section, we first describe some of the scientific domains which involve applications that process large datasets. Then, we describe some of the common characteristics of the applications we target. Data intensive applications from three scientific areas are being studied currently as part of our project. Analysis of microscopy data: The virtual microscope [16] is an application to support the need to interactively view and process digitized data arising from tissue specimens. The virtual microscope provides a realistic digital emulation of a high power light microscope. The raw data for such a system can be captured by digitally scanning collections of full microscope slides under high power. At the basic level, it can emulate the usual behavior of a physical microscope including continuously moving the stage and changing magnification and focus. Used in this manner, the virtual microscope can support completely digital dynamic telepathology.

728

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

Water contamination studies: Environmental scientists study the water quality of bays and estuaries using long running hydrodynamics and chemical transport simulations [23]. The chemical transport simulation models reactions and transport of contaminants, using the fluid velocity data generated by the hydrodynamics simulation. The chemical transport simulation is performed on a different spatial grid than the hydrodynamics simulation, and also often uses significantly coarser time steps. To facilitate coupling between these two simulation, there is a need for mapping the fluid velocity information from the hydrodynamics grid, averaged over multiple fine-grain time steps, to the chemical transport grid and computing smoothed fluid velocities for the points in the chemical transport grid. Satellite data processing: Earth scientists study the earth by processing remotelysensed data continuously acquired from satellite-based sensors, since a significant amount of Earth science research is devoted to developing correlations between sensor radiometry and various properties of the surface of the Earth [9]. A typical analysis processes satellite data for ten days to a year and generates one or more composite images of the area under study. Generating a composite image requires projection of the globe onto a two dimensional grid; each pixel in the composite image is computed by selecting the ‘‘best’’ sensor value that maps to the associated grid point. Data intensive applications in these and related scientific areas share many common characteristics. Access to data items is described by a range query, namely a multidimensional bounding box in the underlying multidimensional space of the dataset. Only the data items whose associated coordinates fall within the multidimensional box are retrieved. The basic computation consists of (1) mapping the coordinates of the retrieved input items to the corresponding output items, and (2) aggregating, in some way, all the retrieved input items mapped to the same output data items. The computation of a particular output element is a reduction operation, i.e. the correctness of the output usually does not depend on the order in which the input data items are aggregated. Another common characteristic of these applications is their extremely high storage and computational requirements. For example, ten years of global coverage satellite data at a resolution of 4 km for our satellite data processing application Titan consists of over 1.4 TB of data [9]. For our virtual microscope application, one focal plane of a single slide requires over 7 GB (uncompressed) at high power, and a hospital such as Johns Hopkins produces hundreds of thousands of slides per year. Similarly, the computation for one ten day composite Titan query for the entire world takes about 100 s per processor on the Maryland 16 node IBM SP2. The application scientists typically demand real-time responses to such queries, therefore, efficient execution is extremely important.

3. Overview of the run-time system Our compiler effort targets an existing run-time infrastructure, called the ADR [9] that integrates storage, retrieval and processing of multidimensional datasets on a parallel machine. We give a brief overview of this run-time system in this section.

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

729

Processing of a data intensive data-parallel loop is carried out by ADR in two phases: loop planning and loop execution. The objective of loop planning is to determine a schedule to efficiently process a range query based on the amount of available resources in the parallel machine. A loop plan specifies how parts of the final output are computed. The loop execution service manages all the resources in the system and carries out the loop plan generated by the loop planning service. The primary feature of the loop execution service is its ability to integrate data retrieval and processing for a wide variety of applications. This is achieved by pushing processing operations into the storage manager and allowing processing operations to access the buffer used to hold data arriving from the disk. As a result, the system avoids one or more levels of copying that would be needed in a layered architecture where the storage manager and the processing belong in different layers. A dataset in ADR is partitioned into a set of (logical) disk blocks to achieve high bandwidth data retrieval. The size of a logical disk block is a multiple of the size of a physical disk block on the system and is chosen as a trade-off between reducing disk seek time and minimizing unnecessary data transfers. A disk block consists of one or more objects, and is the unit of I/O and communication. The processing of a loop on a processor progresses through the following three phases: (1) Initialization––output disk blocks (possibly replicated on all processors) are allocated space in memory and initialized, (2) Local Reduction––input disk blocks on the local disks of each processor are retrieved and aggregated into the output disk blocks, and (3) Global Combine––if necessary, results computed in each processor in phase 2 are combined across all processors to compute final results for the output disk blocks. ADR run-time support has been developed as a set of modular services implemented in Cþþ. ADR allows customization for application specific processing (i.e., mapping and aggregation functions), while leveraging the commonalities between the applications to provide support for common operations such as memory management, data retrieval, and scheduling of processing across a parallel machine. Customization in ADR is currently achieved through class inheritance. That is, for each of the customizable services, ADR provides a base class with virtual functions that are expected to be implemented by derived classes. Adding an application-specific entry into a modular service requires the definition of a class derived from an ADR base class for that service and providing the appropriate implementations of the virtual functions. Current examples of data intensive applications implemented with ADR include Titan [9], for satellite data processing, the virtual microscope [16], for visualization and analysis of microscopy data, and coupling of multiple simulations for water contamination studies [23].

4. Java extensions for data intensive computing In this section, we describe a dialect of Java that we have chosen for expressing data intensive computations. Though we propose to use a dialect of Java as the source language for the compiler, the techniques we will be developing will be largely

730

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

independent of Java and will also be applicable to suitable extensions of other languages, such as C, Cþþ, or Fortran 90. 4.1. Data-parallel constructs We borrow two concepts from object-oriented parallel systems like Titanium [36], HPCþþ [5], and Concurrent Aggregates [11]. • Domains and Rectdomains are collections of objects of the same type. Rectdomains have a stricter definition, in the sense that each object belonging to such a collection has a coordinate associated with it that belongs to a pre-specified rectilinear section of the domain. • The foreach loop, which iterates over objects in a domain or rectdomain, and has the property that the order of iterations does not influence the result of the associated computations. Further, the iterations can be performed in parallel. We also extend the semantics of foreach to include the possibility of updates to reduction variables, as we explain later. We introduce a Java interface called Reducinterface. Any object of any class implementing this interface acts as a reduction variable [18]. The semantics of a reduction variable is analogous to that used in version 2.0 of High Performance Fortran (HPF-2) [18] and in HPCþþ [5]. A reduction variable has the property that it can only be updated inside a foreach loop by a series of operations that are associative and commutative. Furthermore, the intermediate value of the reduction variable may not be used within the loop, except for self-updates.

4.2. Example code Fig. 1 outlines an example code with our chosen extensions. This code shows the essential computation in the virtual microscope application [16]. A large digital image is stored in disks. This image can be thought of as a two dimensional array or collection of objects. Each element in this collection denotes a pixel in the image. Each pixel is comprised of three characters, which denote the color at that point in the image. The interactive user supplies two important pieces of information. The first is a bounding box within this two dimensional box, which implies the area within the original image that the user is interested in scanning. We assume that the bounding box is rectangular, and can be specified by providing the x and y coordinates of two points. The first 4 arguments provided by the user are integers and together, they specify the points lowend and hiend. The second information provided by the user is the subsampling factor, an integer denoted by subsamp. The subsampling factor tells the granularity at which the user is interested in viewing the image. A subsampling factor of 1 means that all pixels of the original image must be displayed. A subsampling factor of n means that n2 pixels are averaged to compute each output pixel.

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

731

Fig. 1. Example code.

The computation in this kernel is very simple. First, a querybox is created using specified points lowend and hiend. Each pixel in the original image which falls within the querybox is read and then used to increment the value of the corresponding output pixel. There are several advantages associated with specifying the analysis and processing over multidimensional datasets in this fashion. The programs can specify the computations assuming a single processor and flat memory. It also assumes that the data is available in arrays of object references, and is not in persistent storage. It is the responsibility of the compiler and run-time system to locate individual elements of the arrays from disks. Also, it is the responsibility of the compiler to invoke the run-time system for optimizing resource usage. 4.3. Restrictions on the loops The primary goal of our compiler will be to analyze and optimize (by performing both compile-time transformations and generating code for ADR run-time system) foreach loops that satisfy certain properties. We assume standard semantics of parallel for loops and reductions in languages like High Performance Fortran (HPF) [18] and HPCþþ [5]. Further, we require that no Java threads be spawned within such loop nests, and no memory locations read or written to inside the loop nests may be touched by another concurrent thread. Our compiler will also assume that no Java exceptions are raised in the loop nests and the iterations of the loop can be reordered without changing any of the language semantics. One potential way of enabling this can be to use bound checking optimizations [25]. 5. Compiler analysis In this section, we first describe how the compiler processes the given data-parallel data intensive loop to a canonical form. We then describe how interprocedural

732

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

program slicing can be used for extracting a number of functions which are passed to the run-time system. 5.1. Initial processing of the loop Consider any data-parallel loop in our dialect of Java, as presented in Section 4. The memory locations modified in this loop are only the elements of collection of objects, or temporary variables whose values are not used in other iterations of the loop or any subsequent computations. The memory locations accessed in this loop are either elements of collections or values which may be replicated on all processors before the start of the execution of the loop. For the purpose of our discussion, collections of objects whose elements are modified in the loop are referred to as left hand side or L H S collections, and the collections whose elements are only read in the loop are considered as right hand side or R H S collections. The functions used to access elements of collections of objects in the loop are referred to as subscript functions. Definition 1. Consider any two L H S collections or any two two collections are called congruent iff

RHS

collections. These

• The subscript functions used to access these two collections in the loop are identical. • The layout and partitioning of these two collections are identical. By identical layout we mean that elements with the identical indices are put together in the same disk block for both the collections. By identical partitioning we mean that the disk blocks containing elements with identical indices from these collections reside on the same processor. Consider any loop. If multiple distinct subscript functions are used to access R H S collections and L H S collections and these subscript functions are not known at compile-time, tiling output and managing disk accesses while maintaining high reuse and locality is going to be a very difficult task for the run-time system. In particular, the current implementation of ADR does not support such cases. Therefore, we perform loop fission to divide the original loop into a set of loops, such that all L H S collections in any new loop are congruent and all R H S collections are congruent. We now describe how such loop fission is performed. Initially, we focus on L H S collections which are updated in different statements of the same loop. We perform loop fission, so that all L H S collections accessed in any new loop are congruent. Since we are focusing on loops with no loop-carried dependencies, performing loop fission is straight-forward. An example of such transformation is shown in Fig. 2, part (a). We now focus on such a new loop in which all L H S collections are congruent, but not all R H S collections may be congruent. For any two R H S accesses in a loop that are not congruent, there are three possibilities:

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

733

Fig. 2. Examples of loop fission transformations.

1. These two collections are used for calculating values of elements of different L H S collections. In this case, loop fission can be performed trivially. 2. These two collections Y and Z are used for calculating values of elements of the same L H S collection. Such L H S collection X is, however, computed as follows: X ðf ðiÞÞ opi ¼ Y ðgðiÞÞ opj ZðhðiÞÞ such that, opi  opj . In such a case, loop fission can be performed, so that the element X ðf ðiÞÞ is updated using the operation opi with the values of Y ðgðiÞÞ and ZðhðiÞÞ in different loops. An example of such transformation is shown in Fig. 2, part (b). 3. These two collections Y and Z are used for calculating values of the elements of the same L H S collection and unlike the case above, the operations used are not identical. An example of such a case is X ðf ðiÞÞþ ¼ Y ðgðiÞÞZðhðiÞÞ In this case, we need to introduce temporary collection of objects to copy the collection Z. Then, the collection Y and the temporary collection can be accessed using the same subscript function. An example of such transformation is shown in Fig. 2, part (c). After such a series of loop fission transformations, the original loop is replaced by a series of loops. The property of each loop is that all L H S collections are accessed with the same subscript function and all R H S collections are also accessed with the same subscript function. However, the subscript function for accessing the L H S collections may be different from the one used to access R H S collections. 5.1.1. Discussion Our strategy of performing loop fission so that all L H S collections are accessed with the same subscript function and all R H S collections are accessed with the same subscript function is clearly not the best suited for all classes of applications. Particularly, for stencil computations, it may result in several accesses to each disk block. However, for the class of data intensive reductions we have focused on, this strategy works extremely well and simplifies the later loop execution. In the future, we will incorporate some of the techniques from parallel database join operations for loop execution, which will alleviate the need for performing loop fission in all cases.

734

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

Fig. 3. A loop in canonical form.

5.1.2. Terminology After loop fission, we focus on one individual loop at a time. We introduce some notation about this loop which is used for presenting our solution. The terminology presented here is illustrated by the example loop in Fig. 3. The domain over which the iterator iterates is denoted by . Let there be n R H S collection of objects read in this loop, which are denoted by I1 ; . . . ; In . Similarly, let the L H S collections written in the loop be denoted by O1 ; . . . ; Om . Further, we denote the subscript function used for accessing right hand side collections by R and the subscript function used for accessing left hand side collections by L . Given a point r in the range for the loop, elements L ðrÞ of the output collections are updated using one or more of the values I1 ½ R ðrÞ; . . . ; In ½ R ðrÞ, and other scalar values in the program. We denote by i the function used for creating the value which is used later for updating the element of the output collection Oi . The operator used for performing this update is opi . 5.2. Slicing based interprocedural analysis We are primarily concerned with extracting three sets of functions, the range function , the subscript functions R and L , and the aggregation functions 1 ; . . . ; m . Similar information is often extracted by various data-parallel Fortran compilers. One important difference is that we are working with an object-oriented language (Java), which is significantly more difficult to analyze. This is mainly because the object-oriented programming methodology frequently leads to small procedures and frequent procedure calls. As a result, analysis across multiple procedures may be required in order to extract range, subscript and aggregation functions. We use the technique of interprocedural program slicing for extracting these three sets of functions. Initially, we give background information on program slicing and give references to show that program slicing can be performed across procedure boundaries, and in the presence of language features like polymorphism, aliases, and exceptions. 5.2.1. Background: program slicing The basic definition of a program slice is as follows. Given a slicing criterion ðs; xÞ, where s is a program point in the program and x is a variable, the program slice is a subset of statements in the programs such that these statements, when executed on

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

735

any input, will produce the same value of the variable x at the program point s as the original program. The basic idea behind any algorithm for computing program slices is as follows. Starting from the statement p in the program, we trace any statements on which p is data or control dependent and add them to the slice. The same is repeated for any statement which has already been included in the slice, until no more statements can be added in the slice. Slicing has been very frequently used in software development environments, for debugging, program analysis, merging two different versions of the code, and software maintenance and testing. A number of techniques have been presented for accurate program slicing across procedure boundaries [32]. Since object-oriented languages have been frequently used for developing production level software, significant attention has been paid towards developing slicing techniques in the presence of object-oriented features like object references, polymorphism, and more recently, Java features like threads and exceptions. Harrold et al. and Tonnela et al. have particularly focused on slicing in the presence of polymorphism, object references, and exceptions [17,28]. Slicing in the presence of aliases and reference types has also been addressed [3]. 5.2.2. Extracting range function We need to determine the R H S and L H S collection of objects for this loop. We also need to provide the range function . The R H S and L H S collection of objects can be computed easily by inspecting the assignment statements inside the loop and in any functions called inside the loop. Any collection which is modified in the loop is considered a L H S collection, and any other collection touched in the loop is considered a R H S collection. For computing the domain, we inspect the foreach loop and look at the domain over which the loop iterates. Then, we compute a slice of the program using the entry of the loop as the program point and the domain as the variable. 5.2.3. Extracting subscript functions The subscript functions R and L are particularly important for the run-time system, as it determines the size of L H S collections written in the loop and the R H S disk blocks from each collection that contributes to the L H S collections. The function L can be extracted using slicing as follows. Consider any statement in the loop which modifies any L H S collection. We focus on the variable or expression used to access an element in the L H S collection. The slicing criterion we choose is the value of this variable or expression at the beginning of the statement where the L H S collection is modified. The function R can be extracted similarly. Consider any statement in the loop which reads from any R H S collection. The slicing criterion we use is the value of the expression used to access the collection at the beginning of such a statement. Typically, the value of the iterator will be included in such slices. Suppose the iterator is p. After first encountering p in the slice, we do not follow data-dependencies

736

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

Fig. 4. Slice for subscript function (left) and for aggregation function (right).

Fig. 5. Compiler generated subscript and aggregation functions.

for p any further. Instead, the functions returned by the slice use such iterator as the input parameter. For the virtual microscope template presented in Fig. 1, the slice computed for the subscript function L is shown at the left hand side of Fig. 4 and the code generated by the compiler is shown on the left hand side of Fig. 5. In the original source code, the R H S collection is accessed with just the iterator p, therefore, the subscript function R is the identity function. The function L receives the coordinates of an element in the R H S collection as parameter (iterpt) from the run-time system and returns the coordinates of the corresponding L H S element. Titanium multidimensional points are supported by ADR as a class named ADR_Pt. Also, in practice, the command line parameters passed to the program are extracted and stored in a datastructure, so that the run-time system does not need to explicitly read args array. 5.2.4. Extracting aggregation functions For extracting the aggregation function i , we look at the statement in the loop where the L H S collection Oi is modified. The slicing criterion we choose is the value of the element from the collection which is modified in this statement, at the beginning of this statement. The virtual microscope template presented in Fig. 1 has only one aggregation function. The slice for this aggregation function is shown in Fig. 4 and the actual code generated by the compiler is shown in Fig. 5. The function Accum accessed in this code is obviously part of the slice, but is not shown here. The generated function iterates over the elements of a disk block and applies aggregation functions on each element, if that element intersects with the range of the loop and the current tile. The function is presented as a parameter of current_block (the disk block being processed), the current_tile (the portion of L H S collection which is currently allocated in memory), and querybox which is the iteration range for the loop. Titanium rectangular domains are supported by the run-time as ADR_Box. Further

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

737

details of this aggregation function are explained after presenting the combined compiler/run-time loop processing. 6. Combined compiler and run-time processing In this section we explain how the compiler and run-time system can work jointly towards performing data intensive computations. 6.1. Initial processing of the input The system stores information about how each of the R H S collections of objects Ii is stored across disks. Note that after we apply loop fission, all R H S collections accessed in the same loop have identical layout and partitioning. The compiler generates appropriate ADR functions to analyze the meta-data about collections Ii , the range function , and the subscript function R , and compute the list of disk blocks of Ii that are accessed in the loop. The domain of each R H S collection accessed in the loop is R  . Note that if a disk block is included in this list, it is not necessary that all elements in this disk block are accessed during the loop. However, for the initial planning phase, we focus on the list of disk blocks. We assume a model of parallel data intensive computation in which a set of disks is associated with each node of the parallel machine. This is consistent with systems like IBM SP and cluster of workstations. Let the set P ¼ fp1 ; p2 ; . . . ; pq g denote the list of processors in the system. Then, the information computed by the run-time system after analyzing the range function, the input subscript function and the meta-data about each of the collections of objects Ii is the sets Bij . For a given input collection Ii and a processor j, Bij is the set of disk blocks b that contain data for collection Ii , is resident on a disk connected to processor pj , and intersects with R . Further, for each disk block bijk belonging to the set Bij , we compute the information Dðbijk Þ, which denotes the subset of the domain R  which is resident on the disk block b. Clearly the union of the domains covered by all selected disk blocks will cover the entire area of interest, or in formal terms, [ 8i Dðbijk Þ R  8j;8k

6.2. Work partitioning One of the issues in processing any loop in parallel is work or iteration partitioning, i.e., deciding which iterations are executed on which processor. The work distribution policy we use is that each iteration is performed on the owner of the element read in that iteration. This policy is opposite to the owner computes policy [19] which has been commonly used in distributed memory compilers, in

738

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

which the owner of the L H S element works on the iteration. The rationale behind the approach is that the system will not have to communicate blocks of the R H S collections. Instead, only replicated elements of the L H S collections need to be communicated to complete the computation. Note that the assumptions on the nature of loops we have placed requires replacing an initial loop by a sequence of canonical loops, which may also increase net communication between processors. However, we do not consider it to be a problem for the set of applications we target. 6.3. Allocating output buffers and strip mining The distribution of R H S collections is decided before performing the processing, and we have decided to partition the iterations accordingly. We now need to allocate buffers to accumulate the local contributions to the final L H S objects. We use run-time analysis to determine the elements of output collections which are updated by the iterations performed on a given processor. This run-time analysis is similar to the one performed by run-time libraries for executing irregular applications on distributed memory machines. Any element which is updated by more than one processor is initially replicated on all processors by which it is updated. Several different strategies for allocation of buffers have been studies in the context of the run-time system [9]. Selecting among these different strategies for the compiler generated code is a topic for future research. The memory requirements of the replicated output space are typically higher than the available memory on each processor. Therefore, we need to divide the replicated output buffer into chunks that can be allocated on the main memory of each processor. This is the same issue as strip mining or tiling used for improving cache performance. We have so far used only a very simple strip mining strategy. We query the run-time system to determine the available memory that can be allocated on a given processor. Then, we divide the L H S space into blocks of that size. Formally, we divide the L H S domain L  into a set of smaller domains (called strips) fS1 ; S2 ; . . . ; Sr g. Since each of the L H S collection of objects in the loop is accessed through the same subscript function, the same strip mining is done for each of them. In performing the computations on each processor, we will iterate over the set of strips, allocate that strip for each of the n output collections, and compute local contributions to each strip, before allocating the next strip. To facilitate this, we compute the set of R H S disk blocks that will contribute to each strip of the L H S . 6.4. Mapping input to the output We use subscript functions R and L for computing the set of R H S disk blocks that will contribute to each strip of the L H S as indicated above. To do this we apply the function L ð 1 R Þ to each Dðbijk Þ to obtain the corresponding domain in the L H S region. These output domains that each disk block can contribute towards are denoted as ODðbijk Þ. If Dðbijk Þ is a rectangular domain and if the subscript functions are monotonic, ODðbijk Þ will be a rectangular domain and can easily be computed by applying the subscript function to the two extreme corners. If this is not the case,

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

739

the subscript function needs to be applied on each element of Dðbijk Þ and the resulting ODðbijk Þ will just be a domain and not a rectangular domain. Formally, we compute the sets Ljl , for each processor j and each output strip l, such that Ljl ¼ fkj 8 i ðODðbijk Þ \ Sl Þ 6¼ /g 6.5. Actual execution The computation of sets Lil marks the end of the loop planning phase of the runtime system. Using this information, now the actual computations are performed on each processor. The structure of the computation is shown in Fig. 6. In practice, the computation associated with each R H S disk block and retrieval of disk blocks is overlapped, by using asynchronous I/O operations. We now explain the aggregation function generated by the compiler for the virtual microscope template presented in Fig. 1, shown in Fig. 5 on the right hand side. The accumulation function output by the compiler captures the Foreach element part of the loop execution model shown in Fig. 6. The run-time system computes the sets Ljl as explained previously and invokes the aggregation function in a loop that iterates over each disk block in such a set. The current compiler generated code computes the rectangular domain Dðbijk Þ in each invocation of the aggregation function, by doing an intersection of the current block and query block. The resulting rectangular domain is denoted by box. The aggregation function iterates over the elements of box. The conditional if project() achieves two goals. First, it applies subscript functions to determine the L H S element outputpt corresponding to the R H S element inputpt. Second, it checks if outputpt belongs to current_tile. The actual aggregation is applied only if outputpt belongs to current_tile. This test represents a significant source of inefficiency in the compiler generated code. If the tile or strip being currently processed represents a rectangular R H S region and the subscript functions are monotonic, then the intersection of the box and the tile can be performed before the inner loop. This is in fact done in the hand customization of ADR for virtual microscope [16]. Performing such an optimization automatically is a topic for future research and beyond the scope of our current work.

Fig. 6. Loop execution on each processor.

740

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

7. Current implementation and experimental results In this section, we describe some of the features of the current compiler and then present experimental results comparing the performance of compiler generated customization for three codes with hand customized versions. Our compiler has been implemented using the publicly available Titanium infrastructure from Berkeley [36]. Our current compiler and run-time system only implement a subset of the analysis and processing techniques described in this paper. Two key limitations are as follows. We can only process codes in which the R H S subscript function is the identity function. It also requires that the domain over which the loop iterates is a rectangular domain and all subscript functions are monotonic. Titanium language is an explicitly parallel dialect of Java for numerical computing. We could use the Titanium front-end without performing any modifications. Titanium language includes Point, RectDomain, and foreach loop which we required for our purposes. The concept of reducinterface is not part of Titanium language, but no modifications to the parser were required for this purpose. Titanium also includes a large number of additional directives which we did not require, and has significantly different semantics for foreach loops. We have used three templates for our experiments. • VMScope1 is identical to the code presented in Fig. 1. It models a virtual microscope, which provides a realistic digital emulation of a microscope, by allowing the users to specify a rectangular window and a subsampling factor. The version VMScope1 averages the colors of neighboring pixels to create a pixel in the output image. • VMScope2 is similar to VMScope1, except for one important difference. Instead of taking the average of the pixels, it only picks every subsampth element along each dimension to create the final output image. Thus, only memory accesses and copying is involved in this template, no computations are performed. • Bess models computations associated with water contamination studies over bays and estuaries. The computation performed in this application determines the transport of contaminants, and accesses large fluid velocity data-sets generated by a previous simulation. These three templates represent data intensive applications from two important domains, digital microscopy and scientific simulations. The computations and data accesses associated with these computations are relatively simple and can be handled by our current prototype compiler. Moreover, we had access to hand coded ADR customization for each of these three templates. This allowed us to compare the performance of compiler generated versions against hand coded versions whose performance had been reported in previously published work [16,23]. Our experiments were performed using the ADR run-time system ported on a cluster of dual processor 400 MHz Intel Pentium nodes connected by gigabit ethernet. Each node has 256 MB main memory and 18 GB of internal disk. Experiments were performed using 1, 2, 4 and 8 nodes of the cluster. ADR run-time system’s cur-

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

741

Fig. 7. Comparison of compiler and hand generated versions for VMScope1.

rent design assumes a shared nothing architecture and does not exploit multiple CPUs on the same node. Therefore, only one processor on each node was used for our experiments. The results comparing the performance of compiler generated and hand customized VMScope1 are shown in Fig. 7. A microscope image of 19,760  15,360 pixels was used. Since each pixel in this application takes 3 Bytes, a total of 910 MB are required for storage of such an image. A query with a bounding box of size 10,000  10,000 with a subsampling factor of 8 was used. The time taken by the compiler generated version ranged from 73 s on 1 processor to 13 s on 8 processors. The speedup on 2, 4, and 8 processors was 1.86, 3.32, and 5.46, respectively. The time taken by the hand coded version ranged from 68 s on 1 processor to 8.3 s on 8 processors. The speedup on 2, 4, and 8 processors was 2.03, 4.09, and 8.2, respectively. Since the code is largely I/O and memory bound, slightly better than linear speedup is possible. The performance of compiler generated code was lower by 7%, 10%, 25%, and 38% on 1, 2, 4, and 8 processors respectively. From this data, we see that the performance of compiler generated code is very close to the hand coded one on the 1 processor case, but is substantially lower on the 8 processor case. We carefully compared the compiler generated and hand coded versions to understand these performance factors. The two codes use different tiling strategies of the L H S collections. In the hand coded version, an irregular strategy is used which ensures that each input disk block maps entirely into a single tile. In the compiler version, a simple regular tiling is used, in which each input disk block can map to multiple tiles. As shown in Fig. 5, the compiler generated code performs an additional check in each iteration, to determine if the L H S element intersects with the tile currently being processed. In comparison, the tiling strategy used for the hand coded version ensures that this check always returns true, and therefore does not need to be inserted in the code. But, because of the irregular tiling strategy, an irregular mapping is required between the bounding box associated with each disk block and the actual coordinates on the allocated output tile. This mapping needs to be

742

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

carried out after each R H S disk block is read from the memory. The time required for performing such mapping is proportional to the number of R H S disk blocks processed by each processor for each tile. Since the output dataset is actually quite small in our experiments, the number of R H S disk blocks processed by each processor per tile decreases as we go to larger configurations. As a result, the time required for this extra processing reduces. In comparison, the percentage overhead associated with extra checks in each iteration performed by the compiler generated version remains unchanged. This difference explains why the compiler generated code is slower than the hand coded one, and why the difference in performance increases as we go to larger number of processors. The results comparing the performance of compiler generated and hand coded VMScope2 are shown in Fig. 8. This code was executed on the same dataset and query as used for VMScope1. The time taken by the compiler generated version ranged from 44 s on 1 processor to 9 s on 1 processor. The hand coded version took 47 s on 1 processor and nearly 5 s on 8 processors. The speedup of the compiler generated version was 2.03, 3.31, and 4.88 on 2, 4, and 8 processors respectively. The speedup of the hand coded version was 2.38, 4.98, 10.0 on 2, 4, and 8 processors respectively. A number of important observations can be made. First, though the same query is executed for VMScope1 and VMScope2 templates, the execution times are lower for VMScope2. This is clearly because no computations are performed in VMScope2. However, a factor of less than two difference in execution times shows that both the codes are memory and I/O bound and even in VMScope1, the computation time does not dominate. The speedups for hand coded version of VMScope2 are higher. This again is clearly because this code is I/O and memory bound. The performance of the compiler generated version was better by 6% on 1 processor, and was lower by 10% on 2 processors, 29% on 4 processors, and 48% on 8 processors. This difference in performance is again because of the difference in tiling strategies used, as explained previously. Since this template does not perform any computations, the difference in the conditionals and extra processing for each disk

Fig. 8. Comparison of compiler and hand generated versions for VMScope2.

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

743

Fig. 9. Comparison of compiler and hand generated versions for Bess.

block has more significant effect on the overall performance. In the 1 processor case, the additional processing required for each disk block becomes so high that the compiler generated version is slightly faster. Note that the hand coded version was developed for optimized execution on parallel systems and therefore is not highly tuned for sequential case. For the 8 processor case, the extra cost of conditional in each iteration becomes dominant for the compiler generated version. Therefore, the compiler generated version is almost a factor of 2 slower than the hand coded one. The results comparing performance of compiler generated and hand coded version for Bess are shown in Fig. 9. The dataset comprises of a grid with 2113 elements and 46,080 time-steps. Each time-step has four 4-byte floating point numbers per grid element, denoting simulated hydrodynamics parameters previously computed. Therefore, the memory requirements of the dataset are 1.6 GB. The Bess template we experimented with performed weighted averaging of each of the four values for each column, over a specified number of time-steps. The number of time-steps used for our experiments was 30,000. The execution times for the compiler generated version ranged from 131 s on 1 processor to 11 s on 8 processors. The speedup on 2, 4, and 8 processors was 1.98, 5.53, and 11.75 respectively. The execution times for the hand coded version ranged from 97 s on 1 processor to 9 s on 8 processors. The speedup on 2, 4, and 8 processors was 1.8, 5.4, and 10.7 respectively. The compiler generated version was slower by a factor of 25%, 19%, 24%, and 19% on 1, 2, 4, and 8 processors respectively. We now discuss the factors behind the difference in performance of compiler generated and hand coded Bess versions. As with both the VMScope versions, the compiler generated Bess performs checks for intersecting with the tile for each pixel. The output for this application is very small, and as a result, the hand coded version explicitly assumes a single output tile. The compiler generated version cannot make this assumption and still inserts the conditionals. The amount of computation associated with each iteration is much higher for this application. Therefore, the

744

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

percentage overhead of the extra test is not as large as the VMScope templates. The second important difference between the compiler generated and hand coded versions is how averaging is done. In the compiler generated code, each value to be added is first divided by the total number of values which are being added. In comparison, the hand coded version performs the summation of all values first, and then performs a single division. The percentage overhead of this is independent of the number of processors used. We believe that the second factor is the dominant reason for the difference in performance of two versions. This also explains why the percentage difference in performance remains unchanged as the number of processors is increased. The performance of compiler generated code can be improved by performing the standard strength reduction optimization. However, the compiler needs to perform this optimization interprocedurally, which is a topic for future work. As an average over these three templates and 1, 2, 4, and 8 processor configurations, the compiler generated versions are 21% slower than hand coded ones. Considering the high programming effort involved in managing and optimizing disk accesses and computations on a parallel machine, we believe that a 21% slow-down from automatically generated code will be more than acceptable to the application developers. It should also be noted that the previous work in the area of out-of-core and data intensive compilation has focused only on evaluating the effectiveness of optimizations, and not on any comparisons against hand coded versions. Our analysis of performance differences between compiler generated and hand coded versions has pointed us to a number of directions for future research. First, we need to consider more sophisticated tiling strategies to avoid large performance penalties associated with performing extra tests during loop execution. Second, we need to consider more advanced optimizations like interprocedural code motion and interprocedural strength reduction to improve the performance of compiler generated code.

8. Related work Our work on providing high-level support for data intensive computing can be considered as developing an out-of-core Java compiler. Compiler optimizations for improving I/O accesses have been considered by several projects. The PASSION project at Northwestern University has considered several different optimizations for improving locality in out-of-core applications [6,20]. Some of these optimizations have also been implemented as part of the Fortran D compilation system’s support for out-of-core applications [29]. Mowry et al. have shown how a compiler can generate prefetching hints for improving the performance of a virtual memory system [26]. These projects have concentrated on relatively simple stencil computations written in Fortran. Besides the use of an object-oriented language, our work is significantly different in the class of applications we focus on. Our techniques for executions of loops are particularly targeted towards reduction operations, whereas previous work has concentrated on stencil computations. Our slicing based information extraction for the run-time system allows us to handle applications which re-

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

745

quire complex data distributions across processors and disks and for which only limited information about access patterns may be known at compile-time. Many researchers have developed aggressive optimization techniques for Java, targeted at parallel and scientific computations. javar and javab are compilation systems targeting parallel computing using Java [4]. Data-parallel extensions to Java have been considered by at least two other projects: Titanium [36] and HP Java [7]. Loop transformations and techniques for removing redundant array bounds checking have been developed [12,25]. Our effort is also unique in considering persistent storage, complex distributions of data on processors and disks, and the use of a sophisticated run-time system for optimizing resources. Other object-oriented dataparallel compilation projects have also not considered data residing on persistent storage [5,11,30]. Program slicing has been actively used for many software engineering applications like program based testing, regression testing, debugging and software maintenance over the last two decades [34]. In the area of parallel compilation, slicing has been used for communication optimizations by Pugh and Rosser [31] and for transforming multiple levels of indirection by Das and Saltz [15]. We are not aware of any previous work on using program slicing for extracting information for the run-time system. Several research projects have focused on parallelizing irregular applications, such as computational fluid dynamics codes on irregular meshes and sparse matrix computations. This research has demonstrated that by developing run-time libraries and through compiler analysis that can place these run-time calls, such irregular codes can be compiled for efficient execution [2,21,24,35]. Our project is related to these efforts in the sense that our compiler also heavily uses a run-time system. However, our project is also significantly different. The language we need to handle can have aliases and object references, the applications involve disks accesses and persistent storage, and the run-time system we need to interface to works very differently. Several run-time support libraries and file systems have been developed to support efficient I/O in a parallel environment [13,14,22,33]. They also usually provide a collective I/O interface, in which all processing nodes cooperate to make a single large I/ O request. Our work is different in two important ways. First, we are supporting a much higher level of programming by involving a compiler. Second, our target run-time system, ADR, also differs from these systems in several ways. The computation is an integral part of the ADR framework. With the collective I/O interfaces provided by many parallel I/O systems, data processing usually cannot begin until the entire collective I/O operation completes. Also, data placement algorithms optimized for range queries are also integrated as part of the ADR framework.

9. Conclusions In this paper we have addressed the problem of expressing data intensive computations in a high-level language and then compiling such codes to efficiently manage data retrieval and processing on a parallel machine. We have developed data-parallel

746

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

extensions to Java for expressing this important class of applications. Using our extensions, the programmers can specify the computations assuming that there is a single processor and flat memory. Conceptually, our compiler design has two major new ideas. First, we have shown how loop fission followed by interprocedural program slicing can be used for extracting important information from general object-oriented data-parallel loops. This technique can be used by other compilers that use a run-time system to optimize for locality or communication. Second, we have shown how the compiler and runtime system can use such information to efficiently execute data intensive reduction computations. This technique for processing such loops is independent of the source language. These techniques have been implemented in a prototype compiler built using the Titanium front-end. We have used three templates, from the areas of digital microscopy and scientific simulations, for evaluating the performance of this compiler. We have compared the performance of compiler generated code with the performance of codes developed by customizing the run-time system ADR manually. Our experiments have shown that the performance of compiler generated codes is, on the average, 21% slower than the hand coded ones, and in all cases within a factor of 2. We believe that these results establish that our approach can be very effective. Considering the high programming effort involved in managing and optimizing disk accesses and computation on a parallel machine, we believe that a 21% slow-down from automatically generated code will be more than acceptable to the application developers. It should also be noted that the previous work in the area of out-of-core and data intensive compilation has focused only on evaluating the effect of optimizations, and not on any comparisons against hand coded versions. Further, we believe that by considering more sophisticated tiling strategies and other optimizations like interprocedural code motion and strength reduction, the performance of the compiler generated codes can be further improved.

Acknowledgements We are grateful to Chialin Chang, Anurag Acharya, Tahsin Kurc, Alan Sussman and other members of the ADR team for developing the run-time system, developing hand customized versions of applications, helping us with the experiments, and for many fruitful discussions we had with them during the course of this work.

References [1] A. Afework, M.D. Beynon, F. Bustamante, A. Demarzo, R. Ferreira, R. Miller, M. Silberman, J. Saltz, A. Sussman, H. Tsang, Digital dynamic telepathology––the virtual microscope, in: Proceedings of the 1998 AMIA Annual Fall Symposium, American Medical Informatics Association, November, 1998. [2] G. Agrawal, J. Saltz, Interprocedural compilation of irregular applications for distributed memory machines, in: Proceedings Supercomputing ’95, IEEE Computer Society Press, December, 1995.

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

747

[3] H. Agrawal, R.A. DeMillo, E.H. Spafford, Dynamic slicing in the presence of unconstrained pointers, in: Proceedings of the ACM Fourth Symposium on Testing, Analysis and Verification (TAV4), pp. 60–73, 1991. [4] A. Bik, J. Villacis, D. Gannon, Javar: a prototype Java restructing compiler, Concurrency Practice and Experience 9 (11) (1997) 1181–1191. [5] F. Bodin, P. Beckman, D. Gannon, S. Narayana, S.X. Yang, Distributed pCþþ: basic ideas for an object parallel language, Scientific Programming 2 (3) (1993) 12. [6] R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, M. Paleczny, A model and compilation strategy for out-of-core data parallel programs, in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), ACM SIGPLAN Notices, vol. 30, no. 8. ACM Press, July 1995, pp. 1–10. [7] B. Carpenter, G. Zhan, G. Fox, Y. Wen, X. Li, HPJava: Data-parallel extensions to Java. Available from http://www.npac.syr.edu/projects/pcrc/July97/doc.html, December 1997. [8] C. Chang, A. Acharya, A. Sussman, J. Saltz, T2: a customizable parallel database for multidimensional data, ACM SIGMOD Record 27 (1) (1998) 58–66. [9] C. Chang, R. Ferreira, A. Sussman, J. Saltz, Infrastructure for building parallel database systems for multi-dimensional data, in: Proceedings of the Second Merged IPPS/SPDP (13th International Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Processing), IEEE Computer Society Press, April, 1999. [10] C. Chang, B. Moon, A. Acharya, C. Shock, A. Sussman, J. Saltz, Titan: a high performance remotesensing database, in: Proceedings of the 1997 International Conference on Data Engineering, IEEE Computer Society Press, April, 1997, pp. 375–384. [11] A.A. Chien, W.J. Dally, Concurrent aggregates (CA), in: Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), ACM Press, March, 1990, pp. 187–196. [12] M. Cierniak, W. Li, Just-in-time optimizations for high-performance Java programs, Concurrency Practice and Experience 9 (11) (1997) 1063–1073. [13] P.F. Corbett, D.G. Feitelson, The Vesta parallel file system, ACM Transactions on Computer Systems 14 (3) (1996) 225–264. [14] P.E. Crandall, R.A. Aydt, A.C. Chien, D.A. Reed, Input/Output characteristics of Scalable Parallel Applications, in: Proceedings Supercomputing ’95, December, 1995. [15] R. Das, P. Havlak, J. Saltz, K. Kennedy, Index array flattening through program transformation, in: Proceedings Supercomputing ’95, IEEE Computer Society Press, December, 1995. [16] R. Ferreira, B. Moon, J. Humphries, A. Sussman, J. Saltz, R. Miller, A. Demarzo, The virtual microscope, in: Proceedings of the 1997 AMIA Annual Fall Symposium, October 1997, American Medical Informatics Association, Hanley and Belfus, Inc., pp. 449–453, Also available as University of Maryland Technical Report CS-TR-3777 and UMIACS-TR-97-35. [17] M.J. Harrold, N. Ci, Reuse-driven interprocedural slicing, in: Proceedings of the International Conference on Software Engineering, 1998. [18] High Performance Fortran Forum, HPF language specification, version 2.0, Available from http://www.crpc.rice.edu/HPFF/versions/hpf2/files/hpf-v20.ps.gz, January, 1997. [19] S. Hiranandani, K. Kennedy, C.-W. Tseng, Compiling Fortran D for MIMD distributed-memory machines, Communications of the ACM 35 (8) (1992) 66–80. [20] M. Kandemir, J. Ramanujam, A. Choudhary, Improving the performance of out-of-core computations, in: Proceedings of International Conference on Parallel Processing, August, 1997. [21] C. Koelbel, P. Mehrotra, Compiling global name-space parallel loops for distributed execution, IEEE Transactions on Parallel and Distributed Systems 2 (4) (1991) 440–451. [22] D. Kotz, Disk-directed I/O for MIMD multiprocessors, in: Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, ACM Press, November 1994, pp. 61–74. [23] T.M. Kurc, A. Sussman, J. Saltz, Coupling multiple simulations via a high performance customizable database system, in: Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, SIAM, March, 1999.

748

R. Ferreira et al. / Parallel Computing 28 (2002) 725–748

[24] A. Lain, P. Banerjee, Exploiting spatial regularity in irregular iterative applications, in: Proceedings of the Ninth International Parallel Processing Symposium, IEEE Computer Society Press, April, 1995, pp. 820–826. [25] S.P. Midkiff, J.E. Moreira, M. Snir, Optimizing bounds checking in Java programs, IBM Systems Journal (1998). [26] T.C. Mowry, A.K. Demke, O. Krieger, Automatic compiler-inserted i/o prefetching for out-of-core applications, in: Proceedings of the Second Symposium on Operating Systems Design and plementation (OSDI ’96), November, 1996. [27] NASA Goddard Distributed Active Archive Center (DAAC), Advanced Very High Resolution Radiometer Global Area Coverage (AVHRR GAC) data, http://daac.gsfc.nasa.gov/ CAMPAIGN_DOCS/ LAND_BIO/origins.html. [28] R. Fiutem P. Tonnela, G. Antonio, E. Merlo, Flow-insensitive Cþþ pointers and polymorphism analysis and its application to slicing, in: Proceedings of the International Conference on Software Engineering, 1997. [29] M. Paleczny, K. Kennedy, C. Koelbel, Compiler support for out-of-core arrays on parallel machines, in: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, IEEE Computer Society Press, February, 1995, pp. 110–118. [30] J. Plevyak, A.A. Chien, Precise concrete type inference for object-oriented languages, in: Ninth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA ’94), October, 1994, pp. 324–340. [31] W. Pugh, E. Rosser, Iteration space slicing and its application to communication optimization, in: Proceedings of the International Conference on Supercomputing, 1997. [32] T. Reps, S. Horwitz, M. Sagiv, G. Rosay, Speeding up slicing, in: Proceedings of the Conference on Foundations of Software Engineering, 1994. [33] R. Thakur, A. Choudhary, R. Bordawekar, S. More, S. Kutipudi, Passion optimized I/O for parallel applications, IEEE Computer 29 (6) (June, 1996) 70–78. [34] F. Tip, A survey of program slicing techniques, Journal of Programming Languages 3 (3) (September, 1995) 121–189. [35] J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani, Distributed memory compiler design for sparse problems, IEEE Transactions on Computers 44 (6) (June, 1995) 737–753. [36] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Libit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, A. Aiken, Titanium: a high-performance Java dialect, Concurrency Practice and Experience 9 (11) (1998) 11.