Distributed processing of very large datasets with DataCutter

Parallel Computing 27 (2001) 1457±1478 www.elsevier.com/locate/parco Distributed processing of very large datasets with DataCutter q Michael D. Beyn...

Download PDF

436KB Sizes 0 Downloads 76 Views

Report

PDF Reader
Full Text

Parallel Computing 27 (2001) 1457±1478

www.elsevier.com/locate/parco

Distributed processing of very large datasets with DataCutter q Michael D. Beynon a, Tahsin Kurc b, Umit Catalyurek b, Chialin Chang a, Alan Sussman a, Joel Saltz a,b,* a b

Department of Computer Science, University of Maryland, College Park, MD 20742, USA Department of Pathology, Johns Hopkins Medical Institutions, Baltimore, MD 21287, USA Received 16 January 2001; received in revised form 3 April 2001

Abstract We describe a framework, called DataCutter, that is designed to provide support for subsetting and processing of datasets in a distributed and heterogeneous environment. We illustrate the use of DataCutter with several data-intensive applications from diverse ®elds, and present experimental results. Ó 2001 Published by Elsevier Science B.V. Keywords: Multi-dimensional datasets; Data analysis; Distributed computing; Runtime systems; Component architectures

1. Introduction As networks connecting computational resources get faster, an increasing number of applications are making collective use of computational resources across local and wide-area networks. One of the consequences of the ability to use collections of powerful computers is that scienti®c and engineering simulations are generating

q This research was supported by the National Science Foundation under Grants #ACI-9619020 (UC Subcontract #10152408) and #ACI-9982087, the Oce of Naval Research under Grant #N6600197C8534, Lawrence Livermore National Laboratory under Grant #B500288 (UC Subcontract #10184497), and the Department of Defense, Advanced Research Projects Agency, USAF, AFMC through Science Applications International Corporation under Grant #F30602-00-C-0009 (SAIC Subcontract #4400025559). * Corresponding author. E-mail address: [email protected] (J. Saltz).

0167-8191/01/$ - see front matter Ó 2001 Published by Elsevier Science B.V. PII: S 0 1 6 7 - 8 1 9 1 ( 0 1 ) 0 0 0 9 9 - 0

1458

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

unprecedented amounts of data. In addition, vast amounts of data are being gathered by advanced sensors attached to various instruments such as satellites and microscopes at geographically distributed locations. There are many examples of large useful datasets from simulations [20,24], sensor data [26], and medical imaging [2] (e.g., pathology, MRI, and CT scan). On the storage side, disks continue to become larger and cheaper making them commodity items, which makes it relatively easy to setup a large disk-based storage system at a relatively low cost. For example, using o-the-shelf components, it is currently possible to build a disk-based storage cluster with about 1TB of storage space, based on six Pentium III PCs, each with two 80GB EIDE disks, for about $10,000. The availability of such low-cost systems has greatly enhanced a scientist's ability to store large-scale scienti®c data, resulting in many such independent dataset collections at disparate locations. Simulation or sensor datasets generated or acquired by one group may need to be accessed over a wide-area network by other groups. The above trends result in very large datasets distributed across a network of storage and computational resources. The primary goal of generating data through large-scale simulations or sensors is to better understand the causes and eects of physical phenomena. Understanding can be achieved by querying, mining and analyzing such massive and complex datasets so that the scientist can gain insight into the problem at hand. Analysis is composed of two main steps. First, the data of interest should be extracted from all possible data in a dataset. In this step, ecient indexing schemes should be employed to quickly locate the data items that satisfy a request. The second step of analysis is the application speci®c processing of data that transforms it into a new data product that can be more eciently consumed by another program or analyzed interactively. There is a large body of hardware and software research on archival storage systems, including distributed parallel storage systems [18], ®le systems [25], and data warehouses [17]. Several research projects have focused on high-performance I/O systems [9] and remote I/O [12]. In addition to many end-point solutions, the Grid [11] has been emerging in recent years as infrastructure to link distributed computational, network and storage resources, and to provide services for uni®ed, secure, ecient and reliable access. However, providing support for ecient exploration and processing of very large scienti®c datasets stored in archival storage systems in a distributed, heterogeneous environment remains a challenging research issue. In this paper we present a framework, called DataCutter, that is designed to enable exploration and analysis of scienti®c datasets in distributed and heterogeneous environments. The programming model in DataCutter, called ®lter-stream programming, represents components of a data-intensive application as a set of ®lters. Each ®lter can potentially be executed on a dierent host across a wide-area network. Data exchange between any two ®lters is described via streams, which are uni-directional pipes that deliver data in ®xed size buers. We present the implementations of several data intensive applications, the virtual microscope (VM), out-of-core sort, and isosurface visualization, using DataCutter, and describe experimental results.

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1459

2. Datacutter DataCutter [5,7,8] provides two core services, namely the indexing service and ®ltering service. The indexing service provides support for accessing subsets of datasets via multi-dimensional range queries. The ®ltering service is used to instantiate and execute collections of ®lters on various hosts, including networks of workstations and SMP clusters. The ®ltering service allows users to de®ne the connectivity of multiple ®lters to use together as well as sets of concurrent ®lter instances that can be used collectively to perform computation; work can be directed to any running instance. 2.1. Indexing One of our goals is to provide support for subsetting very large datasets (sizes up to petabytes). In DataCutter, a dataset consists of a set of data ®les and a set of index ®les. Data ®les contain the data elements of a dataset; data ®les can be distributed across multiple storage systems. Each data ®le is viewed as consisting of a set of segments. Each segment contains one or more data items, and has some associated metadata. Since each data element is associated with a point in an underlying multidimensional space, each segment is associated with a minimum bounding rectangle (MBR) in that space, namely a hyperbox that encompasses the points of all the data elements contained in the segment. The metadata for each segment consists of an MBR, and the oset and size of the segment in the ®le that contains it. Spatial indices are built from the MBRs for the segments in a dataset. When a range query is submitted, entire segments are retrieved from archival storage, even if the MBR for a particular segment only partially intersects the range query (i.e. only some of the data elements in the segment have been requested). Ecient spatial data structures have been developed for indexing and accessing multi-dimensional datasets, such as R-trees and their variants [13]. However, storing very large datasets may result in a large set of data ®les, each of which may itself be very large. As a result, a single index for an entire dataset could be very large, requiring large memory space and CPU cycles for managing and searching the index. Assigning an index ®le for each data ®le in a dataset could also be expensive because it is then necessary to access all the index ®les for a given search. To eciently index very large datasets, DataCutter uses a multi-level hierarchical indexing scheme implemented via summary index ®les and detailed index ®les. The elements of a summary index ®le associate metadata (i.e. an MBR) with multiple segments and/or detailed index ®les. Detailed index ®le entries specify segments. A detailed index ®le is associated with a subset of data ®les, and stores the index and metadata for all segments in those data ®les. Data ®les can be organized in an application-speci®c way into logical groups, and each group can be associated with a detailed index ®le for better performance. For example, in satellite datasets, each data ®le may store data for one week. A detailed index ®le can be associated with data ®les grouped by month, and a summary index ®le can contain pointers to detailed index ®les for the entire range of data in the dataset. In the current implementation of the DataCutter

1460

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

indexing service, R-trees are used as the indexing method for summary and detailed index ®les. 2.2. Processing: Filter stream programming model A networked collection of storage and computing systems provides a powerful environment, yet introduces many unique challenges for applications. First, both computational and storage resources can be at distributed locations in the network. Second, the characteristics, capacity and power of resources, including storage, computation, and network, can vary widely. Third, the distributed resources can be shared by other applications. These characteristics have several implications for developing ecient applications. An application should be structured to accommodate the heterogeneous nature of the environment. Moreover, the application should be optimized in its use of shared resources and be adaptive to changes in their availability. For instance, it may not be ecient or feasible to perform all processing at a data server when its load becomes high. In that case, the eciency of the application depends on its ability to perform application processing on the data as it progresses from the data source(s) to the consumer, and on the ability to move all or part of its computation to other machines that are well suited for the tasks. Recent research on programming models for developing applications in a widearea environment has proposed component-based models [1,14,16,21,22], in which an application is composed of multiple interacting computational objects. The ®lterstream programming model in DataCutter represents processing components of an application as a set of ®lters. Data exchange between any two ®lters is described via streams, which are uni-directional pipes that deliver data in ®xed size buers. Filters are location-independent, because stream names are used to specify ®lter to ®lter connectivity rather than endpoint location on a speci®c host. This allows for the placement of ®lters on dierent hosts in a distributed environment. Therefore, processing, network and data copying overheads can be minimized by the ability to place ®lters on dierent platforms. A ®lter is a user-de®ned object that performs application-speci®c processing on data. Currently, ®lter code is expressed using a C++ language binding by subclassing a ®lter base class. This provides a well-de®ned interface between the ®lter code and the ®lter service. The interface for method callbacks that a ®lter writer must implement consists of an initialization function, a processing function, and a ®nalization function. class ApplicationFilter : public DC_Filter_Base_t { public: int init(int argc, char *argv[]) { . . . }; int process(stream_t st[]) { . . . }; int ®nalize(void) { . . . }; } A stream is an abstraction used for all ®lter communication, and speci®es how ®lters are logically connected. A stream also denotes a supply of data to and from the storage media, or a ¯ow of data between two separate ®lters, or between a ®lter and

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1461

Fig. 1. Data buers and end-of-work markers on a stream.

a client. There are two kinds of streams in DataCutter: ®le streams and pipe streams. A ®le stream is used to access ®les. A pipe stream is the means of uni-directional data ¯ow between two ®lters, from producer to consumer. Bi-directional data exchange can be achieved by creating two pipe streams in opposite directions between two ®lters. All transfers to and from streams are through a provided buer abstraction. A buer represents a contiguous memory region containing useful data. Streams transfer data in ®xed size buers. The size of a buer is determined in the init call; a ®lter discloses a minimum and an optional maximum buer size for each of its streams. The actual size of the buer allocated by the ®ltering service is guaranteed to be at least the minimum value. The size of the data in a buer can be smaller than the size of the buer. Thus, the buer contains a pointer to the start, the length of the portion containing useful data, and the maximum size of the buer. The current prototype implementation uses TCP for stream communication, but any pointto-point communication library could be used. Filter operation progresses as a sequence of cycles, that each handle a single application-de®ned unit-of-work (see Fig. 1). An example of a unit-of-work would be a spatial query for an image processing application that describes a region within an image to retrieve and process. A work cycle starts when the ®ltering service calls the init function, which is where any required resources such as memory or disk scratch space are pre-allocated. Next the process function is called to continually read data arriving on the input streams in buers from the sending ®lters. A special marker is sent after the last buer to mark the end for the current unit-of-work. The ®nalize function is called after all processing is ®nished for the current unitof-work, to allow release of allocated resources. When a work cycle has been completed, these interface functions may be called again to process another unitof-work. 3. Application execution The manual restructuring of an application in ®lter-stream style is referred to as decomposing the application. In choosing an appropriate decomposition, the application designer should consider the complete data¯ow path from data generation to ultimate consumption, and the granularity for each ®lter along this path. The main goal is to achieve ecient use of limited resources in a distributed and heterogeneous environment. Once the application is decomposed and implemented as a set of ®lters,

1462

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

execution occurs by choosing a set of hosts to place and instantiate the ®lters to perform application-speci®c processing. 3.1. Application model Applications are invoked by the user, and start out as a single process on some initial host. This process is called the console process. The console process uses the DataCutter ®ltering service to instantiate and execute collections of ®lters on other hosts. The style of interaction between the ®lters and the console process is referred to as the application model, and is de®ned along the following two axes. Detached vs. Pass-Thru. This classi®cation is based on where the output of the last ®lter is directed. Detached is where ®lter output data are not handled by the console process. For example, output data may be processed in some manner within the ®lters, and consumed outright, or written to a disk ®le. Sorting of out-of-core data is an example of this type, where output is written directly by a ®lter to a ®le. In contrast, Pass-Thru is where ®lter output is consumed by the console process. PassThru resembles an elaborate remote procedure call (RPC) or remote method invocation (CORBA, Java RMI), where the result is collected in the same process that initiated the remote call. Stand-Alone vs. Client±Server. This describes the entity that initiated the work that drives the ®lters. Stand-Alone represents cases where the console process initiates the work. An example of this is a simulation code that decides future data requests internally based perhaps on the current time-step. In contrast, Client±Server is the case where one process sends work to the console process, which, acting like a front-end, passes the work to the ®lters for processing. The work description in the Client±Server case is commonly referred to as a query, where database and image servers are common examples. 3.2. Placement Filters are the unit of placement, and each ®lter can potentially be executed on a dierent host. In addition, a ®lter's location may change at unit-of-work boundaries during the course of execution. Note this does not imply true migration of code and state, but rather placement can be recomputed and the original ®lter can be stopped on the original hosts and a new one created with a dierent placement, with a single buer transfer supported per ®lter for application controlled restoration of state. This approach avoids many of the details and overhead involved in check-pointing and process migration. Filters need to be structured appropriately to handle such events. For cases when a ®lter has some anity to a particular host, the ®lter can be pinned to a particular host, so that the ®lter will always be placed on that host. Filters are location-independent, because stream names are used to specify ®lter to ®lter connectivity rather than endpoint location on a speci®c host. This allows the placement of ®lters on hosts to be controlled by the ®ltering service. For example, two ®lters with a large communication volume between them should not be placed on opposite ends of a slow wide-area network connection.

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1463

Fig. 2. Startup process for executing an application.

3.3. Execution system The output of the planning system is a description of how the application ®lters should be organized, and the execution system is responsible for following this plan at runtime. The ®ltering service performs all steps necessary to instantiate ®lters on the desired hosts, to connect all logical stream endpoints, and to call the ®lter's interface functions for processing work. The description of where to instantiate ®lters is provided by a placement speci®cation. The speci®cation for each ®lter is simply the host to execute the ®lter on. Security concerns have made it dicult to start processes on remote machines in a uniform manner. To deal with this problem in the current prototype, the provided application execution daemons (appd) must be running on every host used to execute ®lters. In addition, a single provided Directory Daemon (dird) 1 is used to record the contact information (host, port, pid) for each appd. The dird is the only process that runs on a well-de®ned host and port. All other ports are ephemeral, and registered with the dird to later be queried. Currently, we require an application binary to exist on every host, which must contain at least the code for the ®lters that will execute on that host. The binary can contain code for all the ®lters, and those ®lters not intended to run on a given host will not be instantiated at runtime. Currently we manually compile/copy the binaries as needed, but convenience procedures to do this could be added. Deployment of application code is not a goal of this work, hence the lack of support. Fig. 2 illustrates the startup process. The application is started by running the application binary on some host. This becomes the console process. The console queries the dird to get the relevant appd contact information, and then sends an execute command to each appd as instructed by the placement speci®cation. The appd executes the application binary on that host, which then contacts the console 1

The Directory Daemon (dird) is similar to an LDAP server for storing metadata.

1464

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

process and performs some initial handshaking to setup the stream abstractions. In the current prototype, one POSIX thread is created for each ®lter that is to run on a host, and creates the needed new application ®lter objects. The ®lter's thread ®rst waits for a work buer to arrive from the console process. When a work buer arrives, the thread calls the init interface function. Next, the thread calls the process function. When this returns, all open streams are closed and the ®nalize function is called. At the completion of the unit of work, the thread loops back to check for another work buer from the console process. After the last work buer, ®lter objects are deleted before the thread exits.

4. Application examples In this section, we describe the implementation of several diverse applications using DataCutter. The ®rst is a client±server application that processes queries into multi-dimensional image data. The second is an external sort application for sorting out-of-core data. We present experimental results on a variety of platforms. 4.1. The Virtual Microscope (VM) The VM [2] is an application that supports interactive viewing and processing of digitized data arising from tissue specimens. It provides a realistic digital emulation of a high power light microscope. At the basic level, VM emulates the usual behavior of a physical microscope, including continuously moving the stage and changing magni®cation and focus. The Java client interface for VM is shown in Fig. 3. The raw data for VM is generated by digitally scanning collections of full microscope slides under high power. The digitized images from a slide are eectively a 3D dataset, since each slide can contain multiple focal planes. The uncompressed size of a slide with a single focal plane can be up to several gigabytes. A client query selects a focal plane of a slide, a 2D spatial region of interest within the focal plane

Fig. 3. The VM client.

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1465

and a magni®cation level. The processing of the query requires projecting high resolution data onto a grid of suitable resolution (governed by the desired magni®cation) and appropriately compositing pixels mapping onto a single grid point, to avoid introducing spurious artifacts into the displayed image. In order to achieve high I/O bandwidth, each focal plane in a slide is regularly partitioned into data chunks, each of which is a rectangular subregion of the 2D image. Data chunks are declustered across all disks attached to the server to achieve I/O parallelism. Each pixel in a chunk is associated with a coordinate (in x- and y-dimensions) in the entire image. Each chunk has an associated MBR based on all the pixels in the chunk. An index is created using the MBR of each chunk. 4.1.1. Filter implementation The ®lter decomposition used for the VM system is shown in Fig. 4. This ®lter pipeline structure is natural for query-response applications. The thickness of the stream arrows indicate the relative volume of data that ¯ows on the dierent streams. During query processing, the following steps are performed: · read_data (R): full-resolution image blocks that intersect the query region are read from the disk-resident dataset, and written to the output stream. · decompress (D): Image blocks are read individually from the input stream. The block is decompressed using JPEG decompression and converted into a 3B RGB format. The image block is then written to the output stream. · clip (C): Uncompressed image blocks are read from the input stream. Portions of the block that lie outside the query region are removed, and the clipped image block is written to the output stream. · zoom (Z): Image blocks are read from the input stream, subsampled to achieve the magni®cation requested in the query, and then written to the output stream. · view (V): Image blocks are received for a given query, collected into a single reply, and sent to the client using the standard VM client/server protocol. 4.1.2. Experimental results Using the ®lters described in Section 4.1.1, we have implemented a simple data server for digitized microscopy images, stored in the IBM HPSS archival storage system [15] at the University of Maryland. The HPSS setup has 10TB of tape storage space and 500GB of disk cache, and is accessed through a 10-node IBM SP with four multiprocessor (one 4-processor and three 2-processor) and six single processor nodes. An existing VM client trace driver [6] was used to drive the experiments. This driver was always executed on the same host as the view ®lter, which is referred to as

Fig. 4. Virtual microscope decomposition.

1466

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

the client host. The server host is where the read_data ®lter is run, which is the machine containing the disks with the dataset. 4.1.3. Multi-level indexing The ®rst experiment isolates the impact of organizing the dataset into multiple ®les and using the multi-level indexing scheme. In this experiment we use a 4GB 2D compressed JPEG image dataset (90GB uncompressed). This dataset is equivalent to a digitized slide with a single focal plane that has 180K 180K RGB pixels. The 2D image is regularly partitioned into 200 200 pixel data chunks and stored in HPSS in a set of ®les. We de®ned ®ve possible queries, each of which covers 5 5 chunks of the image (see Fig. 5). The execution times presented are response times seen by the visualization client averaged over ®ve repeated runs. Fig. 6 shows the results when the 2D image is partitioned into 1 1, 2 2, 4 4 and 10 10 rectangular regions, and all data segments in each region are stored in a data ®le. Each data ®le is associated with a detailed index ®le, and there is one

Fig. 5. 2D dataset and query regions. The table lists transmitted data sizes for q5 . zoom (no) is for no subsampling and zoom (8) is for a subsampling factor of 8 (in each of the two spatial dimensions).

Fig. 6. Query execution time with the dataset organized into 1 1, 2 2, 4 4 and 10 10 ®les. Load shows the time to open and access the ®les, which contain segments that intersect a query. Computation shows the sum of the execution time for searching segments that intersect a query, and for processing the retrieved data via ®lters.

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1467

summary index ®le for all the detailed index ®les for each partitioning. As is seen in the ®gure, the load time decreases as the number of ®les is increased. This is because HPSS loads the entire ®le onto disks used as the HPSS cache when a ®le is opened. When there is a single ®le, the entire 4GB ®le is accessed from HPSS for each of the queries ± in these experiments, all data ®les are purged from disk cache after each query is processed. When the number of ®les increases, only a subset of the detailed index ®les and data ®les are accessed using the multi-level hierarchical indexing scheme, decreasing the time to access data segments. Note that the load time for query 5 for the 2 2 case is substantially larger than that of other queries, because query 5 intersects segments from each of the four ®les (Fig. 5), hence the same volume of data is loaded into the disk cache as in the 1 1 case. The load time for that query is also larger than that in 1 1 case because of the overhead of seeking/ loading/opening four ®les instead of one. The computation time, on the other hand, remains almost constant, except for the 10 10 case, where it slightly increases, due to overhead of opening many ®les. These results demonstrate that applications can take advantage of the multi-level hierarchical indexing scheme by organizing a dataset into an appropriate set of ®les. However, having too many ®les may increase computation time, potentially decreasing overall eciency when multiple similar queries are executed on the same dataset. 4.1.4. Varying the processing One node of the IBM SP is used to access the stored dataset, and the client was run on a SUN workstation connected to the SP node over the department Ethernet. We experimented with dierent placements of the ®lters by running some of the ®lters on the same SP node where the data is accessed, as well as on the SUN workstation where the client is run. In Fig. 7 we consider varying the placement of the ®lters under dierent processing requirements. Figs. 7(a) and (b) show the query execution times when the image is viewed at the highest magni®cation (no subsampling) and when the

(a)

(b)

Fig. 7. Execution time of queries under varying processing (subsampling). R, D, C, Z, V denote the ®lters read_data, decompress, clip, zoom, and view, respectively. hserveri±hclienti denotes the placement of the ®lters in each set: (a) no zooming (no subsampling); (b) subsampling by a factor of 8.

1468

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

subsampling factor is 8 (i.e. every 8th pixel in each dimension is output), respectively. There are three dominant factors in these experiments: (1) placement of the most computationally intensive ®lter (decompress); (2) volume of data transmitted between the two machines; (3) amount of data sent by the view ®lter to the client driver. Consider the ®rst two groups of bars in the ®gures. The dierence between the groups within each ®gure is the placement of the zoom ®lter on the server (RDCZ-V) or client host (RDC-ZV). When there is no subsampling, query execution times remain almost the same for both placements, because the volume of data transfer between the server and client is the same in both cases. In the case of subsampling, the placement of the zoom ®lter makes a dierence, because the volume of data sent from the server to the client decreases if the zoom ®lter is executed at the server. Now consider the last two groups of bars in the ®gures. The dierence between the groups within each ®gure is the placement of the decompress ®lter (RD-CZV or R-DCZV). For the no subsampling case, time increases substantially when decompress is placed on the client, because of the combined eects of the most computationally intensive ®lter (decompress) and the high amount of data being processed by view and sent to the client driver. When there is subsampling, the query execution time is not as high, because the amount of data processed by view and sent to the client driver is much lower. These experiments demonstrate the complex interactions between placement of computation and communication volume. 4.2. External sort External sort has a long history of research in the database community and has resulted in many fast algorithms [4]. The application starts with a large unsorted data ®le that is partitioned across the disks on multiple nodes, and the output is a new partitioned data ®le that contains the same data sorted on a key ®eld. The sample data ®le is based on a standard sorting benchmark that speci®es 100B tuples, with the ®rst 10B being the sort key. The distribution of the key values is assumed to be uniform, both in terms of the unsorted ®le as a whole and for each partition. A recent record holder for the fastest external sort is NowSort [4], and we use the pipelined version of their two-pass parallel sort for our basic algorithm. 4.2.1. Filter implementation The implementation of external sort using ®lters proceeds in two phases: partition and merge. The location of the unsorted dataset dictates the number of nodes to be used for execution. There are two ®lters on each node, Partitioner and Sorter. The Partitioner ®lter reads buers from the unsorted input ®le, and distributes the tuples into buckets based on the key value. When a bucket is full, it is sent to the Sorter ®lter on the corresponding node. The Sorter continually reads buers from the input streams, extracts a portion of the key and appends it to a sort buer. When the sort buer becomes full, it is sorted using partial-radix sort 2 and written to scratch space 2

Making two passes over the keys with a radix size of 11-bits plus a cleanup.

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1469

as a single temporary run. When all buers have been read from the input streams, the merge phase begins with only the Sorter ®lters still executing. The Sorter ®lter then reads sorted tuples from the head of each of the temporary run ®les and merges them by repeatedly selecting the lowest-value key among all merge input buers and copies it to a single output buer, and writes this output buer to the sorted output ®le on disk. This application is essentially a parallel SPMD program, with an all-to-all communication pattern. This organization is in contrast to the previous VM application that was structured as a processing chain pipeline. 4.2.2. Experimental results The experimental setup is a 16 node cluster of dual 400 MHz Pentium IIs with 256MB memory per node, running Linux kernel 2.2.12. There are two interconnects, a shared Ethernet segment, and a switched gigabit Ethernet channel. We use the faster switched interconnect for all experiments, and because of a problem with the network interface cards on some of the nodes, only use a maximum of eight nodes in all experiments. The nodes are isolated from the rest of the network, and the cluster was not running other jobs during the experiments. Each node has a single 18GB hard disk. All data for a particular node, including temporary data, are stored on the single local disk. The dataset consists of a single 128MB unsorted ®le per node. The unsorted dataset was generated randomly with a uniform key distribution. The execution time for an experiment is the maximum time across all nodes used for the experiment. Each experiment is repeated for ®ve trials, and the execution time shown represents the average of the trials. The disk cache was cleared between executions to insure a cold disk cache for each run. Varying memory size. In this set of experiments we vary the amount of memory available for ®lters on some of the nodes while keeping it constant for ®lters on the remaining nodes. Our goal is to create a heterogeneous con®guration in a controlled way, and observe the eects of heterogeneity on application performance. Fig. 8 shows the execution times under varying memory constraints. The solid line in all of the graphs denotes the base case, in which the size of the memory is reduced equally across all nodes, and shows the change in the execution time. The amount of the Full memory case is determined empirically to minimize execution time while consuming the least memory. Memory parameters are varied by halving the full memory amount for the 1/2 case, and halving again for the 1/4 case, etc. Constraining memory causes the ®lters to read/process/write data in smaller pieces, thus performance should suer. As is shown by the solid line in the ®gure, the execution time increases as the size of the memory is decreased. In the experiments with heterogeneous memory con®guration, we divide the eight nodes into two sets of four nodes. The ®rst set of nodes retains the initial amount of memory (i.e., Full memory) for all runs, while the second set has their memory reduced for each case. The left bars for each case in each graph shows the maximum of the execution times on the nodes with full memory. Similarly, the right bar for each case in each graph shows the maximum of the execution times on the nodes with reduced memory. As is shown in Fig. 8(a), we observe performance degradation similar to the base case. The nodes

1470

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

(a)

(b)

(c)

(d)

Fig. 8. Execution time of sorting a 1GB dataset using eight nodes. Two groups of four nodes: (1) memory usage held constant (Full), (2) memory usage reduced 1=X : (a) regular partitioning of input data and Partitioner ®lter output; (b) irregular partitioning of input data; (c) irregular partitioning of input data and Partitioner ®lter output and (d) irregular partitioning of input data and Partitioner output, two pairs of ®lters per Full node.

that use a constant amount of memory ®nish sooner, but the entire job runs no faster. In this experiment, both the input data to the Partitioner ®lter and the output of the Partitioner (i.e. the input data to the Sorter ®lter) on each node are regularly partitioned across all the nodes. Notice that the total amount of memory across all nodes for this experiment is larger than that for the base case because half the nodes keep full memory. For example, for the 1/8 memory case, 350% more memory was being used in aggregate than for the 1/8 base case. Instead of reducing sort time, the extra memory results in a load imbalance between the two sets of four nodes. Hence, in the next experiment we partitioned the amount of input data for each node irregularly, to attempt to reduce overall execution time. Fig. 8(b) shows that the execution time increases when we partition the input data based on available node memory, i.e., full nodes have more input data than nodes with reduced memory. This results from an increase in the time for the partitioning phase, because the Partitioner ®lters on the set of nodes with full memory have more input records to process. The execution time for the merge phase is eectively unchanged, because the amount of data sent to each node is unchanged. Fig. 8(c) shows the result of partitioning the output of the Partitioner ®lter (and thus the merge phase work) according to the memory usage of the

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1471

receiving node. This experiment, however, moves too much work to the nodes with full memory, so that those nodes become the longest running node set. To improve performance further, we partitioned both the input data and the output of the Partitioner ®lter as was done in the experiment for Fig. 8(c), but executed two Sorter and two Partitioner ®lters on the nodes with Full memory to take advantage of the dual processors available in each node. As is seen in Fig. 8(d), the performance is better than for the previous cases. These experimental results show that applicationlevel workload handling and careful placement of ®lters can deal with heterogeneity, which can have a signi®cant impact on performance.

5. Support for parallelism In the previous sections, we investigated the eects of placement decisions on the overall performance of data-intensive applications. Our results show that the choice of placement represents an important degree of freedom in aecting application performance by placing ®lters with anity to data sources near the sources, minimizing communication volume on slow links, and placing ®lters to deal with heterogeneity. In this section, we discuss another degree of freedom: parallelism in executing application de®ned units of work via group instances and parallel ®lters. Group instances enable inter unit-of-work parallelism by concurrent instances of ®lter groups, because multiple units-of-work can be processed concurrently by different group instances. Parallel ®lters, on the other hand, allow a ®ner level of parallelism via multiple copies of a single ®lter within a group. The goal of parallel ®lters is twofold: ecient execution of parallel reduction operations in a distributed environment, and achieving intra unit-of-work parallelism. 5.1. Filter groups and group instances A ®lter group is a set of running ®lters that are logically related and are used together to perform a computation. For example, an application's console process may implement ®lters for performing various image processing operations: anti-alias, subsample, reduce-colors, and clip. One such logically related set of ®lters could be a processing chain that ®rst subsamples and then clips, which could then be executed as a ®lter group. To make use of a ®lter group, it must be instantiated. This involves choosing which hosts to use (placement), and how to create the individual ®lters (instances). The instantiation process can be repeated to create multiple independent realizations of the ®lter group. In many cases, resources are shared by multiple applications. In addition, an application or a set of applications may use groups of ®lters for executing multiple workloads. For instance, in a collaborative environment, requests from multiple users can use the same set of ®lters, which collectively implement the VM pipeline (see Section 4.1). Therefore, the number of instances of ®lter groups to be created and the scheduling of instances are important decisions that aect overall

1472

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

performance. In particular the question is: should a new instance be created, or an existing one reused? In deciding how many instances to create, the basic trade-o is between host resources and latency. When a new ®lter group instance is created, resources are allocated to be used for processing new work. In contrast, reuse of an existing ®lter group instance will share the previously allocated resources with this new work. If a new work item is appended to an existing instance, this can potentially add queuing delay to the response time of the new work item due to FIFO processing of previous work items. Creating a new instance in this case will avoid the queuing delay, but does so by concurrently running another ®lter that can saturate shared resources such as processors, physical memory, disk, etc. Since each concurrent instance processes its assigned workload independent of other instances, an application must be able to handle out-of-order delivery of results from a batch of work requests. The instance creation decision should be made carefully to avoid overloading shared resources such as memory and processors, while still achieving good performance for the application. The DataCutter ®ltering service currently provides the abstraction of a ®lter group to the console process. Multiple concurrent running instances of any number of ®lter groups is supported by the DataCutter runtime system. Each ®lter within a ®lter group is executed in a separate POSIX thread context, which allows for concurrent execution. The console process can then append work to any running instance it has created. Work is handled in FIFO order by an instance. There is no ordering between work appended to concurrent ®lter instances. Fig. 9 provides a summary illustration of a ®lter group comprised of ®lters A and B with a single stream S for communication. This ®lter group is instantiated twice, each using a dierent placement. The ®lters for the ®rst instance are all located on host1 , and denoted by subscript 0, and the ®lters for the second instance are located on both hosts, denoted by subscript 1. 5.2. Explicitly parallel ®lters Performing generalized reduction-type operations is one of the common processing operations in many applications that make use of large multi-dimensional datasets. The processing steps in generalized reductions consist of (1) retrieving data

Fig. 9. Two identical ®lter group instances running with dierent placements.

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1473

items of interest, (2) applying application-speci®c transformation operations on the retrieved input items, (3) mapping the input items to output items, and (4) aggregating, in some application speci®c way, all the input items that map to the same output data item. Aggregation functions are usually commutative and associative, i.e., the correctness of the output data values does not depend on the order input data items are aggregated. An intermediate data structure, referred to as an accumulator, can be used to hold intermediate results during processing. For example, a Z-buer can be used as an accumulator to hold color and distance values in a rendering application. The intermediate results stored in the accumulator are postprocessed to produce the ®nal results for the output. Parallel ®lters aim to implement generalized reductions in a distributed and heterogeneous environment. A generalized reduction operation can be realized by a ®lter group that implements transformation, mapping, and aggregation operations and encapsulates the accumulator data structure. We are developing support for parallel ®lters from two classes, 1-®lter and n-®lter, which are dierentiated based on the granularity of mapping between application tasks and ®lters. The 1-®lter parallel ®lters represent an entire parallel program as a single ®lter instance to DataCutter. The goal is to allow the use of optimized parallel implementations of user-de®ned mapping, transformation, ®ltering and aggregation functions on a particular machine con®guration. For instance, coupling of a hydrodynamics simulator to chemical transport simulator would require a series of transformation and aggregation operations on the output of the hydrodynamics simulator to create the input for the chemical transport simulator. A common transformation operation is the projection of ¯ow values computed at points on one grid to ¯ux values at faces of another mesh for chemical transport calculations. The projection requires solving linear systems of equations. Ecient parallel implementations have been implemented for solution of linear equations on distributedmemory and shared-memory platforms. In this case, a parallel implementation of projection operation can be a 1-®lter parallel ®lter in the group of ®lters that implement the operations needed for coupling the two simulators. The n-®lter parallel ®lters are represented as concurrent instances of the same ®lter. For this class, the ®lter code itself is the unit of parallelism, and is replicated across the parallel machine. The runtime system targets the combined use of ensembles of distributed-memory systems and SMP machines. For example, Fig. 10(a) shows a decomposition of the operations for isosurface rendering [23], which is a common scienti®c visualization method, into ®lters. This collection of ®lters represents an instance of n-®lter parallel ®lters. A possible execution of the parallel ®lters on a heterogeneous cluster of machines is shown in Fig. 10(b). We are in the process of developing methods for dynamic instantiation and scheduling of multiple instances of ®lters. For the load balance optimizations, we are currently developing adaptive methods for dynamic scheduling of data-¯ow into ®lter instances that will take into account heterogeneity and availability of system resources. We are also in the process of developing software to enable data exchange between parallel ®lters. The data structures (e.g., accumulator) can be distributed among multiple ®lter instances. Therefore, software support is needed to enable data

1474

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

(a)

(b)

Fig. 10. Isosurface visualization of a chemical species transport simulation: (a) ®lter decomposition and (b) a possible runtime instantiation.

exchange between distributed data structures. We plan to generalize the Meta-Chaos [10] runtime library and develop an appropriate interface for ®lters. Meta-Chaos is a runtime library that provides support for data exchange between data structures distributed by parallel libraries in one or more parallel programs. 5.3. Transparent copies Pipelining works well when all stages are balanced, both in terms of relative processing time of the stages, as well as the time of each stage compared to the communication cost between stages. Oftentimes, the processing of ®lter-based applications is not well balanced, which results in bottlenecks that cause other ®lters before and after the bottleneck ®lter to become idle. This imbalance and resulting performance penalty is addressed by using what we refer to as transparent copies, where the ®lter is unaware of the concurrent ®lter replication. Furthermore, we de®ne a copy set to be all transparent copies of a given ®lter that are executing on a particular host. The ®ltering service maintains the illusion of a single logical pointto-point stream for communication between a logical producer and a logical consumer in the ®lter group. When this logical producer and/or logical consumer has transparent copies, the ®lter service must decide for each producer which consumer to send a buer to. For example, in Fig. 11, if P1 issues a buer write operation to the logical stream that connects P to F, the choice is to send it to the copy set on host3 or host4 . Each copy set shares a single buer queue, so there is perfect demand-based

Fig. 11. P, F, C ®lter group instantiated using parallel ®lters.

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1475

balance within a single host. For distribution between copy sets (dierent hosts), we are developing several automatic policies: (1) Round robin distribution of buers among copy sets, (2) Round robin among copy sets based on the number of copies on that host, (3) a demand-driven sliding window mechanism, and (4) a user-de®ned callback mechanism that allows for any distribution policy. Not all ®lters will operate correctly in parallel as transparent copies, because of internal ®lter state. For example, a ®lter that attempts to compute the average size of all buers processed for a unit of work will not arrive at the correct answer, because a subset of the total buers for the unit-of-work were seen at any one copy, hence the internal sum of buer sizes is less than the true total. Such cases can be annotated to prevent the ®lter service from adding transparent copies. In our experience, many ®lters do not exhibit this restriction, or an additional step can be added to combine the parallel results into a single one (this is the mechanism used by the isosurface application Merge ®lter described in Section 5.4). 5.4. Isosurface application experiment Isosurface extraction is a technique for visualizing the surface of a 3D volume with a particular scalar value. Given a 3D grid with scalar values at grid points and a user-de®ned scalar value, called the isosurface value, an isosurface rendering algorithm extracts the surface on which the scalar value is equal to the isosurface value. The extracted surface (isosurface) is rendered to generate an image. The marching cubes algorithm [19] is a widely used algorithm for extracting isosurfaces from a rectilinear mesh. In this algorithm, volume elements (voxels) are visited one by one. At each voxel, the scalar values at the corners of the voxel are compared to the isosurface value. If all the scalar values are less than or greater than the isosurface value, no surface passes through the voxel. Otherwise, using the scalar values at the corners, the algorithm generates a set of polygons that approximate the surface passing through the voxel. The polygons from the current voxel are added into a list of polygons. An image of the surface from a viewing direction can be generated from the list of polygons using a polygon rendering algorithm (e.g., the Z-buer algorithm). The ®lter-based implementation is shown in Fig. 10(a). The Read ®lter reads the volume data, and writes the 3D voxel elements to the output stream. The ISO ®lter reads voxels, extracts a list of triangles from the voxels, and writes the triangles to its output stream. The TRansform ®lter performs 3D viewing transformations, mapping operations on the triangles, and clips the triangles to the query viewport. The Z-buer ®lter renders the transformed and clipped triangles onto a 2D Z-buer. Each Z-buer entry corresponds to a pixel in the image plane, and stores the color value (an RGB value) and the distance to the image plane of the current foremost point on the isosurface that maps to the pixel with respect to a viewing direction. Finally, the MergeView ®lter combines any number of Z-buer output images to form the ®nal output image, and sends it to the client for display. This application is structured as a pipeline, with the Z-buer ®lter acting as an accumulator (see Section 5.2).

1476

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

Fig. 12. Response time for a single time-step of isosurface visualization application. One ®lter per host, all copies of a given ®lter run on the same host.

The following preliminary experiments attempt to quantify the bene®t of adjusting the number of transparent copies to better balance a ®lter-based application's processing pipeline. The experiment involves visualizing a single time-step of the output of a chemical species transport simulation. The volume of data produced by each of the ®lters is: R (700MB), ISO (11MB), TR (11MB), Z (2MB), MV (2MB). We ran the code on two con®gurations. The ®rst con®guration is a cluster of 2 processor 450 MHz Pentium II nodes connected by switched Gigabit Ethernet, and the second is a cluster of 2 processor 550 MHz Pentium III nodes connected by 10Mbit Ethernet. The nodes were dedicated for the experiments to eliminate background job interference and network cross-trac. When using multiple copies, all the copies run on a single node, so there is perfect balance among the copies due to the shared queue. Fig. 12 shows the response time for this application. The 1 Copy case is when a single copy of each ®lter in the ®lter groups is executed on a dierent host. For this case, the ISO ®lter is the most expensive, 3 so the second column shows the result of running two copies of this ®lter. For both host con®gurations, we see a signi®cant improvement from better balancing of the ®lter pipeline. As seen in the third column, increasing the number of copies for other ®lters does not improve performance, since the single copy was sucient to keep up with the rate of the TR ®lter.

6. Related work In this section we present several research projects that deal with data-intensive applications, and to varying degrees, how to execute them in a wide-area environment. The ABACUS framework [3] addresses the automatic and dynamic placement of functions in data-intensive applications between clients and storage servers. ABACUS supports applications that are structured as a chain of function calls, and can partition the functions between the client and the server. Adapting computational data streams (ACDS) [16] is a framework that addresses construction and adaptation of computational data streams in an application. A computational object performs ®ltering and data manipulation, and data streams characterize data ¯ow from data servers or from running simulations to the clients of the application. The runtime system of ACDS performs migration, splitting and merging of streams to 3

Which ®lter is most expensive depends on the particular implementation.

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

1477

adapt for runtime variations such as data volume and availability of resources. The dynamic QUery OBjects (dQUOB) [21] system consists of a runtime and compiler environment that allows an application to insert computational entities, called quoblets into data streams. The data streams targeted by the dQUOB system are those used in large-scale visualizations, in video streaming to a large number of end users, and in large volumes of business transactions. Dv [1] is a framework for developing applications for distributed visualization of large scienti®c datasets on a computational grid. The Dv framework is based on the notion of active frames.

7. Conclusions In this paper we have presented a framework ± DataCutter ± for developing cluster and grid-based applications that subset and process large scienti®c datasets. The programming model in DataCutter, called ®lter-stream programming, represents components of a data-intensive application as a set of ®lters. DataCutter services provide support for the execution of ®lters in a heterogeneous environment to carry out resource-constrained, pipelined communication and processing. We presented a description of the use of DataCutter in several real data-intensive applications, the VM, out-of-core sort, and isosurface visualization. The experimental results demonstrate the impact of heterogeneity on an application, and further suggest that any static application organization will likely not perform eciently in all cases. The DataCutter ®ltering service uses techniques such as careful placement of ®lters, multiple ®lter group instances, and transparent copies to adjust dynamically to the heterogeneity present in the targeted runtime environment. As a result, we have shown improved performance for these applications using DataCutter, while avoiding manual application tuning.

References [1] M. Aeschlimann, P. Dinda, J. Lopez, B. Lowekamp, L. Kallivokas, D. O'Hallaron, D, Preliminary report on the design of a framework for distributed visualization, in: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99), Las Vegas, NV, June 1999, pp. 1833±1839. [2] A. Afework, M.D. Beynon, F. Bustamante, A. Demarzo, R. Ferreira, R. Miller, M. Silberman, J. Saltz, A. Sussman, H. Tsang, Digital dynamic telepathology ± the Virtual Microscope, in: Proceedings of the 1998 AMIA Annual Fall Symposium, American Medical Informatics Association, November 1998. [3] K. Amiri, D. Petrou, G. Ganger, G. Gibson, Dynamic function placement in active storage clusters, Technical Report CMU-CS-99-140, Carnegie Mellon University, Pittsburg, PA, June 1999. [4] A. Arpaci-Dusseau, R. Arpaci-Dusseau, D. Culler, J. Hellerstein, D. Patterson, High-performance sorting on networks of workstations, in: Proceedings of 1997 ACM SIGMOD Conference, Tucson, AZ, 1997. [5] M. Beynon, T. Kurc, A. Sussman, J. Saltz, Design of a framework for data-intensive wide-area applications, in: Proceedings of the 9th Heterogeneous Computing Workshop (HCW2000), IEEE Computer Society Press, May 2000, pp. 116±130.

1478

M.D. Beynon et al. / Parallel Computing 27 (2001) 1457±1478

[6] M. Beynon, A. Sussman, J. Saltz, Performance impact of proxies in data intensive client±server applications, in: Proceedings of the 1999 International Conference on Supercomputing, ACM Press, June 1999. [7] M.D. Beynon, R. Ferreira, T. Kurc, A. Sussman, J. Saltz, DataCutter: middleware for ®ltering very large scienti®c datasets on archival storage systems, in: Proceedings of the Eighth Goddard Conference on Mass Storage Systems and Technologies/17th IEEE Symposium on Mass Storage Systems, National Aeronautics and Space Administration, March 2000, NASA/CP 2000-209888, pp. 119±133. [8] M.D. Beynon, T. Kurc, A. Sussman, J. Saltz, Optimizing execution of component-based applications using group instances, in: IEEE International Symposium on Cluster Computing and the Grid (CCGrid2001), Brisbane, Australia, May 2001, pp. 56±63. [9] A. Choudhary, R. Bordawekar, M. Harry, R. Krishnaiyer, R. Ponnusamy, T. Singh, R. Thakur, PASSION: parallel and scalable software for input±output, Technical Report SCCS-636, NPAC, September 1994; also available as CRPC Report CRPC-TR94483. [10] G. Edjlali, A. Sussman, J. Saltz, Interoperability of data parallel runtime libraries, in: Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997, IEEE Computer Society Press. [11] I. Foster, C. Kesselman, The GRID: Blueprint for a New Computing Infrastructure, MorganKaufmann, Los Altos, CA, 1999. [12] I. Foster, D. Kohr, R. Krishnaiyer, J. Mogill, Remote I/O: fast access to distant storage, in: Fifth Workshop on I/O in Parallel and Distributed Systems (IOPADS), ACM Press, New York, 1997, pp. 14±25. [13] V. Gaede, O. Gunther, Multidimensional access methods, ACM Comput. Surveys, 30 (2) (1998). [14] Global Grid Forum, http://www.gridforum.org. [15] HPSS: High performance storage system, http://www4.clearlake.ibm.com/hpss. [16] C. Isert, K. Schwan, ACDS: adapting computational data streams for high performance, in: 14th International Parallel & Distributed Processing Symposium (IPDPS 2000), Cancun, Mexico, May 2000, IEEE Computer Society Press. [17] T. Johnson, An architecture for using tertiary storage in a data warehouse, in: the Sixth NASA Goddard Space Flight Center Conference on Mass Storage Systems and Technologies, Fifteenth IEEE Symposium on Mass Storage Systems, 1998. [18] W.E. Johnston, B. Tierney, A distributed parallel storage architecture and its potential application within EOSDIS, in: NASA Mass Storage Symposium, March 1995. [19] W. Lorensen, H. Cline, Marching cubes: a high resolution 3D surface reconstruction algorithm, Comput. Graphics 21 (4) (1987) 163±169. [20] K.-L. Ma, Z. Zheng. 3D visualization of unsteady 2D airplane wake vortices, in: Proceedings of Visualization'94, October 1994, pp. 124±31. [21] B. Plale, K. Schwan, dQUOB: managing large data ¯ows using dynamic embedded queries, in: IEEE International Symposium High Performance Distributed Computing (HPDC), August 2000. [22] M. Rodriguez-Martinez, N. Roussopoulos, MOCHA: a self-extensible database middleware system for distributed data sources, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD00), vol. 29, No. 2, ACM Press, ACM SIGMOD Record, May 2000. [23] W. Schroeder, K. Martin, B. Lorensen, The Visualization Toolkit: An Object-oriented Approach to 3D Graphics, second ed., Prentice-Hall, Englewood Clis, NJ, 1997. [24] P.H. Smith, J. van Rosendale (Eds.), Data and Visualization Corridors: Report on the 1998 DVC Workshop Series, Technical Report CACR-164, California Institute of Technology, September 1998. [25] M. Teller, P. Rutherford, Petabyte ®le systems based on tertiary storage, in: the Sixth NASA Goddard Space Flight Center Conference on Mass Storage Systems and Technologies, Fifteenth IEEE Symposium on Mass Storage Systems, 1998. [26] US Geological Survey, Land satellite (LANDSAT) thematic mapper (TM), http://edcwww.cr.usgs. gov/nsdi/html/landsat_tm/landsat_tm.

Distributed processing of very large datasets with DataCutter

Distributed processing of very large datasets with DataCutter

Recommend Documents