JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO. PC971420
49, 182–203 (1998)
Parallel Labeling of Three-Dimensional Clusters on Networks of Workstations 1 Felipe Knop 2 and Vernon Rego Department of Computer Sciences, Purdue University, West Lafayette, Indiana 47907
Cluster algorithms have application in diverse areas, including statistical mechanics of polymer solutions, spin models in physics, and the study of ecological systems. Most parallel cluster labeling algorithms are designed for SIMD and MIMD multiprocessors and based on relaxation methods. We present a parallel 3-D cluster labeling algorithm based on mapping tables, for distributed memory environments. The proposed algorithm focuses on minimizing interprocess communication to enhance execution performance on workstation networks. We implemented the algorithm with the aid of the EcliPSe parallel replication toolkit, exploiting special tree-combining and data reduction features of the system. We report on performance results for experiments conducted on workstation clusters. © 1998 Academic Press
1. INTRODUCTION The problem of cluster identification is a critical step in the study of fundamental aspects of the statistical mechanics of polymer solutions [17]. A delicate and controversial problem in this area is that of the conformational properties of polymer chains in disordered media [25]. The question is: how is the conformation of a macromolecule in solution affected when the solution is confined to a quenched random medium? Potentially relevant experimental realizations include polymers trapped in porous media (such as a sandstone), transport of polymer through microporous membranes, and gel permeation chromatography. Despite rather intensive theoretical and numerical studies in recent years, the problem remains unsolved, and the available partial solutions appear to contradict one another [24]. Apart from applicability to studies of materials, there is a more fundamental theoretical interest in the solution to this problem. There is a close relationship [10, 13] between long 1
Research supported in part by ONR-9310233, ARO-93G0045, and BMDO-34798-MA. Research supported in part by CNPq-Brazil Process 260059/91.9. Current address: IBM Corporation, 522 South Road, Poughkeepsie, NY 12601, MS P963. 2
182 0743-7315/98 $25.00 Copyright © 1998 by Academic Press All rights of reproduction in any form reserved.
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
183
chain polymers and scale-invariant phenomena at the critical point [28] (e.g., liquid–gas phase transition or ferromagnet–paramagnet phase transition). The recognition of such relationships, based on scaling and universality, led to a recent Nobel Prize for Pierre de Gennes. For polymers in disordered media, it is possible to have distinct and competing critical behaviors, because polymers can be very large, and disorder in the medium can exhibit long-range correlations. The competition of two large length scales (in the jargon of statistical physics) makes this problem a fertile ground for investigations using the tools of critical phenomena. The standard approach in experimental studies is to focus on a linear chain which has a restricted interaction with the medium. Because of forbidden regions (or infinite energy barriers) the chain is confined to certain parts of the medium. If the forbidden regions are randomly distributed, and the minimum length scale of these regions is much smaller than that of the polymer, then this problem can be modeled by a self-avoiding walk (SAW) on a randomly diluted 3-D lattice. First, a percolation cluster [30] is created on a finite grid by randomly diluting the grid (i.e., marking sites as inaccessible to the chain) with probability q = 1 − p > 0. The remaining sites then form connected components called clusters. Above a certain threshold pc there will tend to be one cluster which spans the entire grid. Once this disordered cluster has been identified, a starting point of a SAW is chosen randomly, and all SAWs with a given number of steps N are generated. A SAW [27] is a random walk defining a path that does not intersect itself. On a discrete lattice, the walk is constrained to move to a nearest neighbor site at each time step. The “self-avoiding” constraint prohibits the walk from occupying sites that have already been occupied. Details on how the SAWs are generated, based on a depth-first search, can be found in [24, 25]. During each walk, various conformational properties of the chain are measured, such as the moments of the end-to-end distance R N and radius of gyration S N [27]. This is repeated for a large number of disorder configurations. In this study, we restrict our attention to parallel generation and labeling of clusters on 3-D lattices. Similar clustering algorithms have been used in many areas, including the simulation of spin models [6] near phase transition in statistical physics. The clustering work in [5] focuses on landscape ecology, including the study of ecosystems subject to fires and other sources of habitat fragmentation. The cluster labeling problem is a special case of the more general problem of finding the connected components of a graph, which has applications in areas such as computer vision and image processing. There have been a number of proposals for performing cluster labeling in parallel. In most approaches, the input grid is divided among processors, and a local cluster labeling algorithm is run on the subgrid assigned to each process. Next, potential clusters that may span several process boundaries are identified and labeled. Examples of sequential and parallel cluster identification models for SIMD and MIMD machines can be found in [3], and several other algorithms in [1, 9, 15, 16]. Some of these algorithms assume a PRAM model and an “unlimited” number of processors [12]. A completely different approach [25, 27] is to replicate the problem, assigning a complete grid and an independent random number stream to each processor. Replication is a viable alternative in applications such as Monte Carlo, where a large number of samples, based on labeled clusters, is required. On the other hand, if a grid is too large to be accommodated by a single processor, there may be no recourse but to subdivide the grid and distribute it among processors.
184
KNOP AND REGO
As indicated earlier, our target execution environment consists of heterogeneous networked processors. Many scientific applications that previously required supercomputers can now be executed in a distributed manner by dividing the computation among heterogeneous workstations communicating over a LAN or WAN. This mode of computing is attractive because it is cheap and more readily available than supercomputer access. In addition, networked computing has fostered the development of efficient software systems which support cluster computing, motivating scientists to port their compute-intensive applications to workstation environments. We mention PVM [31], HeNCE [4], Isis [7], MPI [18], and EcliPSe [26] as examples of such software systems. In contrast to work on parallel labeling algorithms for SIMD/MIMD hardware multiprocessors [1, 3], we seek to exploit the EcliPSe [26] tool for parallel cluster identification. The algorithm we propose attempts to enhance performance by minimizing interprocess communication. Many of the parallel cluster labeling algorithms that are reported to perform well on PRAMS or on SIMD/MIMD machines may not be appropriate for workstation clusters because of communication overheads. For example, [1, 5, 16] propose methods based on relaxation cycles, requiring repeated data transfers between neighboring processors. This can be very expensive in a workstation network environment. In designing the proposed cluster labeling algorithm, we have focused on (1) minimizing the impact of interprocess communication (partly, by avoiding the need for relaxation cycles) and (2) enabling execution—with minimal changes to legacy code—in various execution environments, including heterogeneous networks and MIMD machines. We use the EcliPSe [21, 26] simulation tool—originally built for replication-based simulation applications, to enable multiple runs for statistical accuracy—as a base for implementing cluster labeling. We were motivated by the need to gauge EcliPSe’s suitability to this application, its execution performance, and the level of programming effort involved. The rest of this paper is organized as follows. In Section 2 we specify the cluster labeling problem and a sequential solution. In Section 3 we briefly describe the EcliPSe toolkit. In Section 4 we describe our parallel algorithm and how this algorithm is implemented using features of the EcliPSe system. In Section 5 we report on execution performance, and in Section 6 we present a brief conclusion. 2. PROBLEM SPECIFICATION We begin by describing the cluster labeling problem and, to simplify the presentation, examine how the labeling algorithm works on 2-D lattices [22]. Later, we explain how these ideas can be extended to the case of 3-D lattices. 2.1. The Cluster Labeling Problem Let M be an N × N 2-D lattice of sites, where each site M[i, j] (0 ≤ i, j < N ) is either “occupied” or “unoccupied.” Site M[i, j ] is said to be a neighbor of site M[k, l] iff (k = i and l = j + 1) or (k = i and l = j − 1) or (k = i + 1 and l = j ) or (k = i − 1 and l = j ).
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
185
In other words, “diagonal neighbors” are not permitted, though the proposed algorithm can be made to accommodate such generalizations. A given site M[c, d] is said to be reachable from another site M[a, b] if it can be reached through a set of occupied sites, i.e., if there exists a sequence of occupied sites S0 = M[a, b], S1 , . . . , Sn = M[c, d] such that Si +1 is a neighbor of Si . A cluster C(i, j) is defined to be a set of occupied sites containing M[i, j] and all other occupied sites M[k, l] ∈ C(i, j ) such that M[k, l] is reachable from M[i, j]. The cluster labeling problem is stated as follows. Given a lattice M containing both occupied and unoccupied sites, identify distinct clusters and give each a unique label, such as an integer id. 2.2. The Labeling Algorithm A well-known sequential solution [19] to the cluster labeling problem is as follows. Traverse M row by row, in increasing row-order, assigning to M[i, j] the label assigned to M[i − 1, j ] if both sites are occupied. If not, assign to M[i, j] the label assigned to M[i, j − 1], but only if sites M[i, j ] and M[i, j − 1] are occupied, as shown in Fig. 1a. During this traversal, it is possible to arrive at an occupied site with two occupied neighbors, both of which have different labels, as shown in Fig. 1b. Such a situation arises when two occupied sites, initially thought to be in distinct clusters—because of the manner in which the lattice is traversed—actually belong to the same cluster. The solution to this problem is to relabel one of the two clusters marked as distinct and to identify the site in question as belonging to the other cluster. In the example shown
FIG. 1. Sequential cluster-labeling algorithm. (a) Labeling—the shaded site is labeled “2.” (b) Labeling conflict. (c) The “proper” array.
186
KNOP AND REGO
in Fig. 1b, the site in question (i.e., shaded location) is assigned the label 1, while the occupied neighbor previously labeled as a site in cluster 2 is now relabeled as belonging to cluster 1. To implement the relabeling phase efficiently we must avoid having repeatedly to locate all sites whose labels must be changed. As an alternative, we create an array named proper and initialize it to null. When occupied sites from cluster x are to be relabeled as belonging to cluster y, we update the label array proper using: proper[x] ← y. The array proper implements a forest of trees of sites, with each tree containing sites of a cluster. Once a traversal of M is complete, we begin a second traversal of M, updating the label of each site M[i, j ]. We do this by recursively accessing array proper until we determine that proper[x] is null, as shown in Fig. 1c. At this point, we assign site M[i, j] the label x. In the equivalence class method, the cluster labeling problem is essentially an instance of the union-find problem [11]. Labeling clusters is one example of connected component labeling in graphs [2]. In the “ants in the labyrinth” method [14], a single cluster is identified by initially placing an “ant” (label) at some occupied site. At the next time step, this ant places “children” (same label) at neighboring, occupied sites that are not already populated by ants. The process continues in this manner, with the ants reproducing until each occupied site contains an ant. If each occupied site is initially given a negative label, all clusters can be identified one by one through the ant population algorithm, starting by labeling the first cluster with 0 and incrementing the label by one each time the ants have located another cluster. The problem of cluster labeling also occurs in the related context of Monte Carlo simulations of statistical mechanical systems and, in particular, spin models. This is based on the equivalence between Potts spin models and percolation models [29], though algorithms presented for spin clusters are not directly applicable to our problem. For example, Wolff [33] proposed a constructive sweep method to “grow” single clusters by bonding together sites in a probabilistic manner. The intent here is to move between cluster configurations as quickly as possible, while measuring statistics from spin ensembles. This is different from our polymer problem, where the cluster generation and identification is done once per SAW enumeration. In a similarly motivated algorithm proposed by Swendsen and Wang [32], clusters are created constructively by probabilistically introducing bonds between neighboring spins. Sequential and parallel cluster algorithms for spin models are motivated by the need for rapid cluster updates, i.e., identifying spins which can be flipped at low cost. As opposed to cluster labeling in our problem, which is to locate the largest cluster for subsequent SAW generation, identifying spins for rapid updating leads to different sequential and parallel algorithms. Baillie and Coddington [3] and Embrechts et al. [15] proposed parallel algorithms for clustering in spin models. 3. PARALLEL EXECUTION ENVIRONMENT: THE EcliPSe TOOLKIT The parallel algorithm that we propose is partly motivated by the facilities provided by the EcliPSe system for replicated computation. EcliPSe [20, 21] is a system that enables
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
187
replications and structured parallel computations on heterogeneous networked processors. Though designed primarily to enable straightforward parallelization of sequential codes in the simulation domain, EcliPSe has been found to be useful for general parallel computation with structured communication patterns. The system has been found to enable low-effort parallel application development, as well as exhibit good performance in a variety of situations. An important factor that led us to examine the use of EcliPSe over other systems, in the solution of this problem, is the flexibility and power of its tree-combining mechanism. The use of this facility for the proposed algorithm is detailed in Section 4. Porting a sequential application to a distributed EcliPSe environment requires only small changes to the original code. This includes inserting EcliPSe primitives in the original code and possibly rearranging some of the code. To configure the run-time environment, the user is required to provide a file containing the names of the machines to be used, usually (though not necessarily) “idle” workstations. The end result is a run consisting of a set of concurrently executing computational processes, coordinated by one or more monitor processes. An EcliPSe program must contain the following components: Computation code. This is code that is run by each of the computational processes. This code is responsible for most of the “actual work” that the application performs. The computation code usually requires (input) data from and returns (result) data to a monitor process. Monitor function(s). These are functions executed by monitor processes. A monitor process is responsible for coordinating the computation done by a set of computational processes, generating data for and collecting data from these processes, and finally determining when the computation should be terminated. Declarations. Each type of data item that is exchanged between monitors and computational processes must be declared. By declaring data types, the user informs EcliPSe about data characteristics. Declarations provide the added advantage of making explicit the flow of information between monitors and computational processes. EcliPSe declarations are handled by a special preprocessor, which allows the user to make the declarations using a “C-like” syntax. The preprocessor reads in the declarations and produces a set of handles—analogous to “file descriptors” in UNIX systems—that are used in all subsequent EcliPSe calls. By using these handles, the data exchange primitives can transmit data in a machine-independent format. The basic primitives available to computational processes are simple: request_data obtains data from a monitor, and put_stat sends data to a monitor. Both take as parameters a (integer) type handle obtained in the declarations part of the code and a pointer to the data to be transferred. All code related to file I/O and to the collection of statistics or data is placed inside the monitor function, which is executed by a monitor process. The main data-exchange primitives available are produce_data (counterpart to request_data) and collect_stat (counterpart to put_stat). The system provides two key features designed to cope with communication and computational bottlenecks. These are extremely useful in parallel algorithms that expand and reduce data sets. Tree-combining. If a monitor were to receive data directly from a large number of computational processes, the amount of incoming traffic and resulting “combining”
188
KNOP AND REGO
work would make the monitor a bottleneck for the entire computation. To prevent the occurrence of such bottlenecks, EcliPSe enables processes to be organized in a virtual tree structure, based on a user-defined topology, 3 with the monitor as root. Each computational process transparently sends data to its parent in the tree, instead of sending data directly to the root. Each such parent combines its own results with the results it receives from all of its children in the tree, applying the same operation that the monitor process would have applied had the tree-combining scheme not been used. As a result, the monitor at the root only needs to combine data it receives from its own children. The user may choose to employ the tree-combining scheme by declaring a data type to be a “combining type” and by specifying its corresponding combining operation. The latter may be either a user-written function or one of the standard combining operations provided by EcliPSe (i.e., averaging, summation, concatenation, and others). No change is required for the computational processes. Data-diffusing. The virtual tree structure described above can also be used to speed up the distribution of data from a monitor to each computational process. Instead of sending data directly to each computational process, a monitor only needs to use the produce_data_diffuse primitive on an array of data items to be distributed. Data is then “diffused” down the tree, with each computational process receiving a data item. The process behaves like a tree-combining process in reverse. The same primitive can also be used for an efficient data broadcast. As with tree-combining, the declaration of a data type must be changed to indicate use of data-diffusing. Code for the computational processes need not change. Figure 2 shows an example of a “typical” EcliPSe application: parallel generation of samples in a replicative stochastic simulation. The example shows the simulation of a single-server queueing system. The sample-generation details were omitted for brevity; we emphasize in the figure how the tool is used to collect and combine samples. Most replicative applications follow a similar pattern. In the example, the input parameters are broadcast and seeds are “diffused” to all samplers. The two simulation outputs— average delay on system and maximum number of customers in queue—are sent to the monitor using the tree-combining mechanism. In the example, the independently generated samples are combining using the EcliPSe-provided averaging operator. For the parallel cluster algorithm presented in Section 4, the combining operator is an integral part of the algorithm. 4. PARALLEL CLUSTER LABELING 4.1. Outline The basic idea behind the proposed parallel labeling algorithm is as follows. First, in the 2-D case, the lattice M is divided row-wise into n strips, where n is the number of processors used. Each strip is assigned to a different processor. Next, each processor runs the sequential algorithm explained in Section 2.2 on its strip. This simple approach will lead to an incorrect labeling, since actual clusters can and will tend to span more than one strip. Therefore, it should be clear that processors will have to communicate with one another to resolve problems with cluster labels on large clusters. 3
The topology is specified in the file that contains the set of machines to be used.
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
189
FIG. 2. Parallel replicative simulation of a single-server queue in EcliPSe.
One relabeling alternative is for each processor numbered 2i + 1 to send its strip to processor 2i (i = 0, 1, . . . , n/2), allowing the latter to relabel clusters in both strips. This procedure may be repeated log2 n times, using tree combining, to yield a correctly labeled lattice M. As it turns out, this approach is impractical for workstation cluster environments, simply because of the large amount of data that would flow over the network with large lattices. In addition, tree combining will force a single processor to finally host the correctly labeled lattice, which may not be possible if the lattice is large. Yet another alternative, and one we adopt, requires processors to transmit only stripboundary information, instead of an entire strip. Processor i ships its boundary to
190
KNOP AND REGO
FIG. 3. Example of mapping table.
processor i + 1, and the latter uses this information to build a mapping table, as shown in Fig. 3. Using this mapping table, processor i + 1 relabels all its clusters. Unfortunately, performing relabeling based solely on information obtained from processors i and i + 1 will not suffice in generating a correct labeling. This problem is illustrated in Fig. 4. Here, it is only after gathering information from processors i, j, k, and l that we find that clusters i 1 and i 2 (for example) are actually part of the same cluster. To obtain an accurate mapping table we need to examine all the processors. A key idea in the proposed parallel algorithm lies in the use of tree combining. We exploit the tree-combining mechanism in building and merging mapping tables, instead of working with entire strips. This drastically reduces the volume of information flow between processors. Through use of EcliPSe, exploiting tree combining for this purpose
FIG. 4. Example of cluster that spans over many processors.
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
191
is simple. Only the combining function needs to be written by the user. All combiningrelated communication is handled internally by EcliPSe. N The combining operator is designed so that the result of a combination is of the same type as the operands. To satisfy this condition, the operands and the result are all defined to be in the form of the tuple t ( f, l, T ), where f is the first row, l is the last row, and T is the mapping table of an operand or reN sult. Applying the combining operator over operands t1 , t2 , . . . , tk , which correspond N N N to contiguous segments, gives the result tr = t1 t2 . . . tk , where tr = ( fr , lr , tr ) is defined as fr = f 1 lr = l k Tr =
[ i=1 ... k
! Ti
∪
[
! T (li , f i +1 ) .
i =1 ... k−1
The combined first row, fr , is the first row of the first segment (operand). The combined last row, lr , is the last row of the last operand. The combined mapping table, Tr , is obtained by taking the union of all input mapping tables with all mapping tables resulting from the boundaries between each input operand. The mapping table is further discussed in Section 4.2. N With the implementation of as an EcliPSe combining function, it remains for each processor i to perform the task of generating the tuple ( f i , li , T∅ ): the first and last of its rows and an empty mapping table. Tuples are then automatically sent by nodes to their parent node in the virtual tree topology, where the combining function is transparently applied. The whole procedure is repeated, until the monitor process finally receives tuple ( f 1 , ln , T ), where T is the global mapping table. The monitor must broadcast T to all processes, using EcliPSe’s tree-diffusing mechanism. Each process uses the global mapping table to relabel its occupied sites. The implementation of the algorithm in EcliPSe presents an added benefit in that the use of heterogeneous machines is allowed. Here all information is transmitted over the network in a machine-independent format. Adapting the 2-D algorithm to work for 3-D lattices requires some changes. The first involves the sequential part of the algorithm. Instead of examining two neighbors of each occupied site to determine a site’s cluster label, the 3-D algorithm must examine three neighbors. To enable parallelization, the 3-D lattice is divided into slabs, each comprising a sequence of planes. Instead of the line boundaries arising in the 2-D case, we now have to deal with planar boundaries. Thus, while exchanging boundaries between neighboring processors, entire planes must be exchanged, instead of just rows. The initial mapping table is created as follows: an entry is created in the table if site (i, j) in the last plane of processor k’s slab and site (i, j ) in the first plane of processor (k + 1)’s slab are both occupied. Once the mapping tables are built, the combining operator proceeds to function just as in the case of the 2-D lattice. In Fig. 5 examples of clusters identified by our 3-D parallel labeling program can be seen. This program was run on four processors on a 128 × 128 × 128 lattice.
192
KNOP AND REGO
FIG. 5. Example of clusters found by the program. Only some selected clusters, among the thousands labeled by the program, are shown in the figure.
4.2. Implementation of the Combining Operator After the initial mapping table is built (see Fig. 3), it must be formatted in a way that enables us to identify false clusters. A cluster is false if it is identified as a cluster in its own right, when it actually is part of another cluster. The initial mapping table, obtained from the boundary between adjacent processors a and b, is simply a sequence of pairs (bi , ai ), where ai is the cluster number of site Ma [(N /n) − 1, i, j] (last plane of processor a’s slab) and bi is the cluster number of site Mb [0, i, j] (first plane of processor b’s slab), provided both sites are occupied. We may assume that bi > ai , since cluster numbers initially assigned by processors are distinct, which is possible because the number of different clusters in a slab is bounded from above by half its number of sites. Let (ci , di ) and (c j , d j ) denote any two elements of the mapping table. To allow clusters to be effectively relabeled and to force them to coalesce whenever required, the final mapping table must obey the following rules: 1. ∀ i: ci > di 2. j > i ⇒ c j > ci (table ordered by the left-hand-side element; no repeated left-hand-side elements) 3. ∀ i, j : di 6= c j (a right-hand-side element cannot appear on the left-hand side; that is, a cluster whose number is on the right-hand side of the final table does not need to be relabeled).
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
193
FIG. 6. Algorithm for transforming the mapping table.
Rule 1 dictates the manner in which clusters are relabeled: if clusters x and y are the same and x > y, then x is relabeled as y, and y remains the same. In this case, entry (x, y) is in the table, but (y, x) is not. Rule 2 is enforced to simplify searching for pairs of entries (x, y), (x, z). The final mapping table cannot have such pairs because this would imply that “sites labeled x can be relabeled as either y or z.” Such pairs must be replaced by (x, z), (y, z), assumingy > z. Rule 3 makes the use of the table more efficient by obviating the need for multiple searches for a single cluster number: once entry (x, y) is found, we know that there is no need to check if some entry(y, · · · ) is in the table. In the final mapping table, it is possible that, for some i, we have identified ci and d j as belonging to the same processor, implying that the final table identifies clusters that one processor previously computed as different clusters as actually being parts of the same cluster (such as clusters i 1 and i 2 in Fig. 4). The algorithm used to transform the initial mapping table into the final mapping table is outlined in Fig. 6. “Loop 1” in the algorithm enforces rules 1 and 2 above, and “Loop 2” enforces rule 3. In Loop 1 after (ci , di ) is reinserted at some position j in T ( j < i, since the new ci is max(di−1 , di ), which must be less than the original ci because of the invariant ∀ i: ci > di ), the algorithm must again check if c j = c j−1 . If so, the same procedure must be applied again recursively. The procedure converges because (ci , di ) is always reinserted at position j < i. If the search for the reinsertion position is done linearly from index i − 1 downward, then each step of Loop 1, considering the possibility of multiple reinsertions, takes (|T |). The whole loop then takes (|T |2 ). Each step in Loop 2 requires a binary search for (di , r) in [T1 , Ti−1 ]; the whole loop’s execution time is (|T | log |T |). Given the worst case execution time of Loop 1 and Loop 2, and also that sorting T takes (|T | log |T |), the worst case execution time for the entire cluster labeling procedure is (|T |2 ). We may combine mapping tables by exploiting essentially the same algorithm used to transform an initial mapping table obtained from two boundaries. The tables to be combined only need to be concatenated and the algorithm described in Fig. 6 applied to the resulting table. Besides allowing for easy detection of repeated entries, having the mapping table sorted also aids in the last phase of parallel cluster labeling, when the final table is broadcast 4 to each processor, and relabeling takes place. Upon receiving the 4
An optimization is used: only the portion useful to a processor is sent to it. This is done with the aid of the tree-diffusing mechanism.
194
KNOP AND REGO
final table composed of pairs (ci , di ), each processor determines elements f and l in the table such that the numbers c f , c f +1 , . . . , cl are in the range of cluster numbers initially assigned to this processor. The final relabeling is then done by traversing the whole lattice and replacing cluster number x of M[i, j, k] by dk if x = ck ∈ {c f , c f +1 , . . . , cl }. The search in {c f , c f +1 , . . . , cl } is performed using a binary search strategy. 4.3. Performance To evaluate the cluster-labeling algorithm experimentally, we must first create instances of clusters on a 3-D lattice. This task is accomplished by randomly marking each site as either occupied or unoccupied, using a given probability p of site occupation. Near criticality, clusters exhibit high variability in size, including clusters that are extremely small and clusters that span the entire lattice. Since results at criticality are of great interest, most simulations are performed for near-critical values of p. For different values of p, both sequential and parallel algorithms exhibit different behavior. In the experiments reported here, we measure execution time of the sequential algorithm as (t E − t B ), where t B is the time at which lattice initialization begins, and t E is the time at which all cluster labeling ends. The same definition applies to the parallel algorithm, though here t E is the earliest time at which all processors have completed their work. One of the inputs to the cluster-labeling program is a single cluster id. After labeling is done, each process must send to the monitor the set of coordinates belonging to the cluster with this id. The completion time t E is defined as the earliest time at which coordinates from all processes are received at the monitor. 4.4. Communication Costs and Tree Topology Communication plays a major role in the execution cost of the algorithm. Besides grid size, other factors having an impact on the computation and communication costs include the number of processes, probability p of grid point occupancy, and the underlying tree topology in EcliPSe. For example, in flat (one-level) trees, the monitor process is highly taxed in receiving and combining all boundary information from children. With taller trees, some of this work is taken off the hands of the monitor and done by children and possibly even grandchildren of the monitor. This distribution of combining work comes at a slightly increased communication cost, since one or two more steps are needed for the mapping table to reach the monitor. For large grid configurations, messages containing boundary planes and the mapping table may be large. Since the cost of sending large messages depends mainly on the message size, it is helpful to estimate typical sizes for messages which are sent between nodes of the tree. Let e be the number of mapping table elements generated by two neighboring slabs of a 3-D grid. We may approximate this quantity as e = p2 N 2 , where N 2 is the number of sites on a boundary, and p2 is the probability that two matching sites (belonging to distinct slabs, as depicted in Fig. 3) will be occupied and thus generate a mapping table entry. The formula yields a slight overestimate for e, since it does not account for repeated entries in the table, which are eliminated during the combining operation. The positive bias increases with p, so that for p = 1, the formula gives
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
195
FIG. 7. An example of how communication cost (sending boundary planes and mapping tables) is evaluated. The quantity b is the number of boundaries contained in the mapping table sent to a process. The quantities n c and n d denote, respectively, the number of children and the number of proper descendants of a given node.
e = N 2 , while the real number of elements in the table must be 1 (all the entries in the table are identical!). Mapping tables are combined to make new mapping tables with a larger number of elements. The number of elements in a table is approximately given by e × n b , where n b is the number of boundaries that initiate creation of the table. When mapping tables are combined at a node, the number eo of elements in the resulting table is ei + n c e. Here, ei is the aggregate size of mapping tables received from the node’s n c children. As shown in Fig. 7, each such combining operation adds n c boundaries to the table. Of these, n c − 1 boundaries are those between all children, and the last boundary is that between the node itself and its first child. As can be seen in Fig. 7, the number of boundaries in a mapping table generated by a node is always equal to the number n d of proper descendants of this node. Therefore, we can compute eo as eo = n d e and the total number ei of mapping table entries received by a node as ei = (n d − n c )e. To obtain the actual number of bytes transmitted and received by a node, we must take into account the size, in bytes, of a mapping table element (sm ), also including the pair of boundary planes that is part of the combined data. The size of the latter is 2sb N 2 , where sb is the size in bytes of one boundary element. The number of bytes transmitted is thus given by eo + 2sb N 2 = n d p 2 N 2 sm + 2sb N 2
196
KNOP AND REGO
and the number of bytes received is given by ei + n c (2sb N 2 ) = (n d − n c ) p 2 N 2 sm + 2n c sb N 2 . As a numerical example, consider a 3-D cluster with N = 256, with clusters generated using p = 0.30. Suppose that the EcliPSe tree has seven nodes: a root with two children, where each child has two children. In the current implementation, sm = 8 bytes and sb = 4 bytes. Under these conditions, each of the root’s children (with n c = 2 and n d = 2) receives a total of about 1 Mb of data from its own children and generates about 0.59 Mb to send to its parents. 4.5. Heterogeneity and Load Balancing Because EcliPSe allows data exchange in a network-independent format, heterogeneous machines may be used in a single run of the parallel cluster labeling algorithm without changes in the code. The use of heterogeneous machines, however, may introduce load imbalance if the different machines have different speeds, since the completion time depends on the slowest executing machine. Having “outside” (user) computations taxing the machines may also create further load imbalance. Although the proposed algorithm, as presented, has no provision for dynamic load balancing, it is easily modified to support static load balancing. To accommodate known differences in CPU speed, different strip widths can be assigned to each machine. Faster machines may be gainfully employed by having them host multiple children in the process tree. 4.6. Discussion In principle, mapping tables can be combined using the algorithm described in Section 2.2, exploiting union-find algorithms which execute in almost-linear time [11]. Instead of a single cubic lattice, we would have to work with a set of N × N × 2 lattices, representing the several boundary planes. The drawback of this strategy is the amount of space that would be required by the array proper (see Section 2.2), which represents the forest of clusters. The size of the array is proportional to the number of clusters, and this number is proportional to the number of sites in the lattice (N 3 ). We use an asymptotically slower but more practical algorithm for combining mapping tables. It requires considerably less memory and yields a mapping table which is formatted in a way that is convenient for cluster relabeling during the final phase of the clustering algorithm. Not all parallel algorithms for cluster labeling rely on relaxation cycles. The scheme described in [9] uses mapping tables, but the tables are passed from processor to processor in a cyclic manner, requiring a number of communication steps that is equal to the number of processors used. A notion similar to the mapping tables is used in [8], in the context of image segmentation. The algorithm described in this paper partitions the grid into squares rather than into strips. While this approach may lead to reduction in interprocess communication, partitioning the grid into strips enables a simpler definition of the combining operator.
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
197
5. EXPERIMENTS As mentioned earlier, we implemented the parallel cluster labeling algorithm using the EcliPSe toolkit. We report the results of experiments conducted on a network of SUN SPARCstations 5 (70 MHz microSPARC-II CPU) connected on an Ethernet. A total of up to 33 machines located on two local subnets were used. Each machine was configured with 32 Mb of main memory and 100 Mb of swap space. Though we were not able to guarantee zero outside (user) influence and network traffic on the subnets, we attempted to conduct our experiments when the machines were observed to be lightly loaded. In each experiment, using a 320×320×320 lattice as input, we measure execution time as our indicator of parallel program performance. We examine the effect of increasing the number of machines, changing the virtual tree topology, and altering parameter p (probability that each site in the lattice is occupied) on execution time. 5.1. Scalability In this experiment, we investigate how the program scales with the number of machines used. In generating clusters, we kept the value of p = 0.31 fixed throughout. We chose this number because it is near criticality and, according to percolation theory [30], tends to yield a cluster spanning the entire grid. Unfortunately, since this also makes the mapping tables peak in size, it is also the most taxing value of p for the proposed algorithm. Because of interest in criticality, however, this value of p is often used in clustering studies. In Fig. 8 is shown the result of the experiment, graphing the number of processes used (excluding the monitor process) versus execution time. We measured execution time at two points in the program: one measurement was made after the monitor received and combined mapping tables (“collecting mapping tables” in Fig. 8), and the second measurement was made after all processes received the final table, relabeled their sites, and sent the sample cluster coordinates to the monitor (“complete” in Fig. 8).
FIG. 8. Execution time vs number of processes.
198
KNOP AND REGO
It was not possible to obtain the execution time for only one or two computing processes because of the prohibitively large memory space required by the program. This actually highlights one important motivation for developing a distributed algorithm. In using four and eight processes, program execution performance is seen to suffer because of demands on limited physical memory for the lattice labeling operation. Measurements made at particular points in the program indicated that more than 40% of the execution time was spent in paging-in and paging-out. In using 16 and 32 processes, the sublattices assigned to each process become much smaller, and shortage of memory is no longer an issue. With 32 processes, however, communication starts to become an issue. Not only does the computation-to-communication ratio decrease significantly, but the increase in number of process boundaries forces a growth in the number of mapping table entries. Despite this, we believe that increasing problem sizes will result in better performance for the 32-process case. Finally, with an increase in the number of processes, EcliPSe’s virtual tree topology and the tree-combining feature tend to have a larger impact on execution performance. The lackluster performance that we obtained in the 32-process case highlights one problem in using a network of workstations as a computational engine: communication is several orders of magnitude slower than computation. For applications such as this, a significant amount of effort or a very large problem is required to obtain reasonable speedup. In this experiment, the serial part of the cluster labeling algorithm required relatively little execution time. More complex computations are required, for example, when the algorithm is applied for spin models in physics. For such applications, we would expect the attainable speedup to be larger. It is worth mentioning that because of the communication traffic generated by the algorithm (for the larger configurations of processors), and because of the paging I/O (for the smaller configurations of processors), we witnessed a variation of up to 15% in execution time between runs. Comparing the results shown above with those in the literature is difficult because: • our results were obtained using an Ethernet as the interprocessor communication medium. This results in a much higher CPU speed vs communication speed ratio than that offered by hardware multiprocessors. • cluster labeling algorithms can be applied in a number of areas, and it is apparent that sequential and parallel implementations can differ significantly. For example, in spin models a nontrivial computation is needed to determine whether two neighbors are actually part of the same cluster. This makes the algorithm more computationally intensive and favors better speedups. Baillie and Coddington [3] experimented with several algorithms, and were able to obtain speedups as high as 90 using 128 nodes in a MIMD machine (10242 lattice). They reported much less promising results on SIMD machines, due to either a lack of parallelism or a load imbalance in the algorithms investigated. Flanigan and Tamayo [16] reported almost linear speedup using 256 nodes on a CM-5 (163842 lattice). Berry et al. [5] reported good speedup on a MasPar MP-2, a SIMD machine (no speedup numbers because they compare the parallel results against sequential runs on a different machine), but the results varied drastically with p.
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
199
5.2. Effect of Tree Topology Different tree topologies yield different performance numbers because of communication costs and data-reduction effects (see Section 4.4). Execution times for specific configurations of processes can be seen in Fig. 9. In each case, the number of processes and all program parameters were kept fixed, while the underlying tree topology was varied. We use the following notation to specify tree topologies. An (x 1 , x 2 , x3 , . . . )-tree is a tree in which the root has x 1 children, each of the x1 children of the root has x2 children, and each of the x 2 grandchildren of the root has x3 children, and so on. Because EcliPSe is layered upon the Conch communications library, and because the latter uses TCP, we were unable to experiment with flat trees made up of 32 processing nodes. Conch opens up TCP connections between every two processes that are directly connected in the tree; each connection requires a number of file descriptors, and there is a limit on the maximum number of file descriptors that can be in use at any given time. Tree configuration (4, 7) was simply not able to execute because of limitations with virtual memory. This is because the four processes at level 1 in the tree were required to have sufficient memory both for the sublattices on which they have to work and for the combining buffers. The latter is used by EcliPSe to hold the combined results of the seven children of each of the level 1 processes. Thus, when memory requirements begin to get large, it is best for nodes at level 1 and higher not to have many children. At level 0 (i.e., the monitor), the restriction is not so severe, since the monitor does not store any part of the lattice. In Fig. 9, we observe that three-level trees outperform two-level trees. This happens because (1) the monitor becomes a bottleneck when data from many children is to be combined (i.e., six or eight nodes at level 2, versus two or four nodes at level 2), and (2) there is less work going on in parallel during the combining of mapping tables. To increase the level of concurrency in the combining operation for two-level trees, the monitor process must receive combining data from more children. This tradeoff does not arise for taller trees. In the current experiment, the increased parallelism afforded by three-level trees was enough to offset work required by the extra combining step.
FIG. 9. Execution times for different tree topologies.
200
KNOP AND REGO
FIG. 10. Execution time as a function of p.
5.3. Sensitivity to Variation in p In studying the effect of different values of p on the algorithm, we kept the number of processes fixed at 32, and the tree topology fixed at (4, 3, 1 or 2). The probability p of site occupancy was varied. The resulting execution time performance is shown in Fig. 10, and the CPU time required to perform the combining operation at the monitor is shown in Fig. 11. For small values of p, site occupancy in the lattice is sparse and the number of clusters is low. If the number of repeated entries in a mapping table (see Section 4.4) is ignored, the number of elements in an initial mapping table averages roughly p2 N 2 , where N is the lattice dimension. Because of this, small values of p tend to result in small mapping tables. As the value of p increases, the number of clusters tends to increase, and the number of elements in a mapping table also tends to increase. Besides making
FIG. 11. Execution time of the combining function at the monitor process as a function of p. A (4, 3, 1 or 2)-tree was used.
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
201
FIG. 12. Number of elements in the final mapping table as a function of p.
the combining operation more expensive, an increase in the number of mapping table entries means that the last phase of the algorithm (where each slab is relabeled, based on the final mapping table) will take longer. As the value of p approaches 0.31, however, clusters tend to coalesce. The number of sites per cluster tends to increase, and the number of distinct clusters tends to fall drastically. At this point, the mapping tables begin to contain a large number of repeated entries. Indeed, this phenomenon prompted the elimination of repeated entries in the initial stages of the algorithm shown in Fig. 6. As a result of this, the execution time of the algorithm actually decreases as p exceeds the critical value 0.3117. For values of p around 0.7, the time required for combining begins to increase very slightly, for a second time. Although the number of elements in the mapping tables decreases considerably, there is an increase in the number of “remappings” because the existing clusters become significantly larger, often spanning over a larger number of process boundaries. In Fig. 12 is shown the number of entries in the final mapping table versus the probability p. This number will undoubtedly change if we were to change any one or a combination of the input lattice size, the number of processes used, or the random number seeds used in determining site occupancy. In using 32 processes, a value of p = 0.31, and a 320 × 320 × 320 lattice, the final mapping table had 230,074 elements, with roughly 7500 for each of the 31 boundaries. When the final mapping table is sent back to all processes, each of these 32 processes receives roughly about 7500 elements. The only exception is the very first process, which receives only about 2000 elements (the clusters originally labeled there have the smallest overall label numbers and so generally do not need relabeling), and the last process, which receives only about 5000 (since this process has only one boundary). 6. CONCLUSIONS Today’s workstation cluster environments typically consist of fast workstations (such as the SUN 5s used in the experiments reported here) connected over relatively slow networks (an Ethernet in our case). Although it is reasonable to expect faster networks
202
KNOP AND REGO
in the future, workstation speeds continue to improve. This means that parallel algorithm development/modification on cluster environments will continue, if only to exploit newer computation to communication ratios. We determined that the proposed algorithm for parallel cluster labeling on workstation networks actually minimizes the amount of data exchanged between processes. This is possible because only the information necessary for mapping tables needs to be communicated. Through use of EcliPSe’s tree-combining mechanism, we were able to parallelize the combination of mapping tables. We found that using EcliPSe simplified the programming effort required, particularly in enabling data exchanges between computing processes. Although our performance numbers have been encouraging, there is still scope for reducing communication costs and improving scalability. Performance is bound to improve if communications costs go down, such as through the use of compression schemes (i.e., reduction in boundary row and mapping table data). When using a larger number of processes, the tree topology can be expected to play a significant role in execution performance. ACKNOWLEDGMENT The 3-D cluster picture was obtained with Tonitza [23], a graphics package for structural biology. We thank its author, Ioana M. Martin, for help with the tool.
REFERENCES 1. J. Apostolakis, P. Coddington, and E. Marinari, A multi-grid cluster labeling scheme, Europhys. Lett. 17, 3 (January 1992), 189–194. 2. S. Baase, “Computer Algorithms: Introduction to Design and Analysis,” Addison-Wesley, Reading, MA, 1978. 3. C. F. Baillie and P. D. Coddington, Cluster identification algorithms for spin models—sequential and parallel, Concurrency: Practice and Experience 3, 2 (April 1991), 129–144. 4. A. Beguelin, J. Dongarra, A. Geist, R. Manchek, K. Moore, and R. Wade, “HeNCE: A User’s Guide (draft),” Technical Report, Oak Ridge National Laboratory, Argonne National Laboratory, November 1991. 5. M. Berry, J. Comiskey, and K. Minser, Parallel cluster analysis in landscape ecology, IEEE Comput. Sci. Engrg. 1, 2 (Summer 1994), 24–38. 6. K. Binder, Ed., “Topics in Current Physics,” Vol. 36, pp. 1–36, Springer-Verlag, Berlin/New York, 1984. 7. K. Birman and R. Cooper, The Isis project: Real experience with a fault tolerant programming system, Oper. Systems Rev. 25 (1991), 103–107. 8. J. Browning and S. Tanimoto, Segmentation of pictures into regions with a tile-by-tile method, Pattern Recognition 15, 1 (1982), 1–10. 9. A. N. Burkitt and D. W. Heermann, Parallelization of a cluster algorithm, Comput. Phys. Comm. 54, 2 (June 1989), 201–209. 10. J. Des Cloizeaux and G. Jannink, “Polymers in Solution: Their Modelling and Structure,” Oxford Univ. Press, New York, 1990. 11. T. H. Cormen, C. E. Leiserson, and R. L. Rivest, “Introduction to Algorithms,” MIT Press, Cambridge, MA, 1990. 12. R. Cypher, J. L. Sanz, and L. Snyder, Hypercube and shuffle-exchange algorithms for image component labeling, J. Algorithms 10, 1 (March 1989), 140–150. 13. P. G. de Gennes, “Scaling Concepts in Polymer Physics,” Cornell Univ. Press, Ithaca, NY, 1979.
PARALLEL LABELING OF THREE-DIMENSIONAL CLUSTERS
203
14. R. Dewar and C. K. Harris, Parallel computation of cluster properties: Application to 2-d percolation, J. Phys. A 20 (1987), 985–993. 15. H. Embrechts, D. Roose, and P. Wambacq, Component labelling on a distributed memory multiprocessor, in “First European Workshop on Hypercube and Distributed Computers,” pp. 5–17, 1989. 16. M. Flanigan and P. Tamayo, A parallel cluster labeling method for Monte Carlo dynamics, Internat. J. Modern Phys. C 3, 6 (1992), 1235–1249. 17. P. J. Flory, “Statistical Mechanics of Chain Molecules,” Wiley–Interscience, New York, 1969. 18. B. Gropp and E. Lusk, “A Test Implementation of the MPI Draft Message-passing Standard,” Technical Report, Mathematics and Computer Science Division, Argonne National Laboratory, 1992. 19. J. Hoshen and R. Kopelman, Percolation and cluster distribution, Phys. Rev. B 14 (1976), 3438–3445. 20. F. Knop, E. Mascarenhas, V. Rego, and V. Sunderam, An introduction to fault tolerant parallel simulation with EcliPSe, in “Proceedings of the Winter Simulation Conference,” pp. 700–707, December 1994. 21. F. Knop, E. Mascarenhas, V. Rego, and V. Sunderam, Fail-safe concurrent simulation with EcliPSe: An introduction, Simulation Practice and Theory 3 (1995), 121–146. 22. F. Knop and V. Rego, Parallel cluster labeling on a network of workstations, in “Proceedings of the Thirteenth Brazilian Symposium on Computer Networks,” pp. 533–548, May 1995. 23. I. M. Boier Martin and D. C. Marinescu, “Graphics for Macromolecular Crystallography and Electron Microscopy,” Technical Report CSD TR-96-009, Purdue University, February 1996. 24. H. Nakanishi and J. Moon, Self-avoiding walks on critical percolation cluster, Physica A 191 (1992), 309–312. 25. H. Nakanishi, V. Rego, and V. Sunderam, On the effectiveness of superconcurrent computations on heterogeneous networks, J. Parallel Distrib. Comput. 24 (1995), 177–190. 26. V. J. Rego and V. S. Sunderam, Experiments in concurrent stochastic simulation: The EcliPSe paradigm, J. Parallel Distrib. Comput. 14, 1 (January 1992), 66–84. 27. M. D. Rintoul, J. Moon, and H. Nakanishi, Statistics of self-avoiding walks on randomly diluted lattice, Phys. Rev. E 49 (1994), 2790–2803. 28. H. E. Stanley, “Introduction to Phase Transitions and Critical Phenomena,” Oxford Univ. Press, New York, 1971. 29. D. Stauffer, Scaling theory of percolation clusters, Phys. Rep. 54 (1978), 1–74. 30. D. Stauffer, “Introduction to Percolation Theory,” Taylor & Francis, London, 1985. 31. V. S. Sunderam, PVM: A framework for parallel distributed computing, Concurrency: Practice and Experience 2, 4 (December 1990), 315–339. 32. R. H. Swendsen and J. Wang, Nonuniversal critical dynamics in Monte Carlo simulations, Phys. Rev. Lett. 58, 2 (1987), 86–88. 33. U. Wolff, Collective Monte Carlo updating for spin systems, Phys. Rev. Lett. 62, 4 (1989), 361–364.
FELIPE KNOP is a Member of the Technical Staff at IBM Corporation. He received a Masters degree in electrical engineering from the University of São Paulo, Brazil, in 1990, and Masters (1993) and Ph.D. (1996) degrees in computer sciences from Purdue University. His current research interests include parallel and distributed simulation and distributed systems. VERNON REGO is a Professor of Computer Sciences at Purdue University. He received his M.Sc. (Hons.) in mathematics from B.I.T.S. (Pilani, India) and an M.S. and Ph.D. in computer science from Michigan State University (East Lansing) in 1985. He was awarded the 1992 IEEE/Gordon Bell Prize in parallel processing research, and is an Editor of IEEE Transactions on Computers. His research interests include parallel simulation, parallel processing, modeling, and software engineering. Received May 5, 1997; revised December 29, 1997; accepted January 5, 1998