COMPUTER VISION AND IMAGE UNDERSTANDING
Vol. 72, No. 3, December, pp. 304–327, 1998 ARTICLE NO. IV970672
Object Location by Parallel Pose Clustering W. J. Austin and A. M. Wallace Department of Computing and Electrical Engineering, Heriot–Watt University, Riccarton, Edinburgh EH14 4AS, United Kingdom E-mail:
[email protected],
[email protected] Received May 6, 1996; accepted November 10, 1997
This paper describes the location of 3D objects in either depth or intensity data using parallel pose clustering. A leader-based partitional algorithm is used that allows the number of clusters to be selected on the basis of the input data, which is important because the number of pose clusters cannot usually be determined in advance. In comparison with previous work, no assumptions are made about the number or distribution of data patterns, or that the processor topology should be matched to this distribution. After overcoming a parallel bottleneck, we show that our approach exhibits superlinear speedup, since the overall computation is reduced in the parallel system. Isolated pose estimates may be eliminated from the cluster space after an initial stage, which may be done with low probability of missing a true cluster. The algorithm has been tested using real and synthetic data on a transputer-based MIMD architecture. c 1998 Academic Press °
1. INTRODUCTION Object recognition and location can be achieved by a variety of approaches including generalized Hough transformation [1] or pose clustering [2] which have been suggested as suitable candidates for parallel implementation. A candidate pose can be computed from any minimal subset of matched model and scene features, for example, surfaces or space curves detected in a depth or intensity image. If all such subsets are considered, these can be stored in a transformational or pose space. Clusters or accumulations of points in this space represent probable pose and object identification. The purpose of this paper is to present a new algorithm for parallel pose clustering and compare it with previous approaches. We include in our comparison methods based on Hough transformation and geometric hashing since these are analogous but use fixed quantisation schemes to form accumulations of evidence for object recognition and/or pose. A description of the implementation and serial complexity of other common algorithms for visual recognition can be found in [3]. Of these, parallel vision algorithms have been developed based on tree search, e.g., [4, 5, 6, 7], maximal cliques, e.g., [8], and relaxation, e.g., [9, 10], in addition to the work on clustering and Hough transformation which we consider in some detail in Section 2. To achieve useful parallelism in the majority of cases, Wang et al. [11] observe the necessity to in-
corporate heuristic criteria. Simple scalable parallelism can also be achieved if multiple models or views are distributed among processors for matching or verification [8, 12]. Summaries of work on parallel vision methods, architectures and algorithms can be found in review articles [13, 14, 15, 11], but these tend to reflect the relative paucity of high-level parallelism. 2. CLUSTERING AND PARALLEL IMPLEMENTATION Hough transformation may be considered as a restricted form of clustering, in which the number and “diameter” of clusters are predetermined. This has an obvious appeal for parallel implementation since it is possible to employ data parallelism either over the parameter bins or image points [16]. The majority of parallel Hough transform implementations deal with line detection, since the two-dimensional parameter space maps easily on to a parallel grid. Either regular image [17] or parameter space [18, 19] decompositions have been used. Since points that form a cluster in parameter space may in general be spread out in image space (and vice versa), problems can arise in parallel implementations: if the parameter space is split among the processors, matched features are broadcast to all processors and duplication of effort occurs; if the feature space is distributed, subsequent redistribution of the parameter space for peak detection may outweigh the parallel advantage [20]. These problems are compounded when a six-dimensional space is used for pose determination by Hough transformation [13, 21]. Direct mapping of space to processors results in a very coarse quantization, and storage restrictions also limit the resolution that can be achieved. In general, a fixed parameter space is sparsely filled, feature clusters straddle cell boundaries and can be difficult to detect if they fall below the background noise level. Bhandarkar and Suk [21] employed a heuristic to try and circumvent this problem, rejecting bins with few votes and concentrating on fuller cells, using available resources to perform finer quantization, similar to the strategy described by Austin et al. for plane detection [22]. Geometric hashing [23] is a process whereby models can be selected by an “inverted index” [3] which is formed off-line in advance of a recognition phase. For example, pairs of object features (sometimes termed “basis pairs”) can be used to form
304 1077-3142/98 $25.00 c 1998 by Academic Press Copyright ° All rights of reproduction in any form reserved.
PARALLEL POSE CLUSTERING
invariant geometric indices, such as the angle between planes or the distance between centroids [24]. For every combination of features in every model the hash table is pre-formed, so that for a pair of planes at a given angle and separation of centroids, say, there is a list of the feature pairs and models which correspond. Recognition is analogous to Hough transformation in this respect, since after all feature pairs in the scene are considered, the hypothesised model (and matching features) are determined by a voting process. Pose determination can be refined by a standard method [25]. Object recognition systems based on geometric hashing have been implemented on a Connection Machine [26, 27]. In the former case, the pre-processing stage (i.e., construction of the hash table) is made parallel by considering all models and basis pairs in parallel. For recognition a basis system is formed in the scene and the other coordinate points are hashed in parallel. This allows both pre-processing and matching to be made parallel in the same processor network. To match successfully, the two scene points selected to form the coordinate system must be from the same model: this problem can be avoided by using all scene point pairs to form coordinate systems but this would lead to a considerable increase in the complexity of the matching phase. Wang et al. [28] noted that previous approaches require the same number of processors as there are bins in the hash table. Ignoring parallelization of the hash table, as this was created off-line, they distributed the hash table to all processors, and computation and quantization of the index from the scene features was performed in parallel. However, the same problem occurs as with Hough transformation when the image space is partitioned among processors as the peaks are distributed among the several processing elements. The quantization also creates difficulties when there are errors on the computed indices and hence the correct “peaks” may be distributed among “bins.” In a pose clustering approach, the transformational space is not defined explicitly as a quantized accumulator array. Instead the exact values of candidate poses are retained and grouped together according to some distance measure. This avoids quantization problems and only requires sufficient storage for the pose estimates and their associated cluster labels. Clustering can be split into two main classes, hierarchical and partitional [29]. Hierarchical algorithms can be further divided into agglomerative and divisive subclasses. In the agglomerative approach each data vector is initially treated as an atomic cluster and these are repeatedly merged to form higher level clusters, using a proximity measure. The aim is to identify the sub-structure of the resultant hierarchy that best represents clusters in the data vectors. Divisive hierarchical clustering can be viewed as the reverse of this process. Olson [30] has described hierarchical clustering PRAM algorithms for several distance metrics. In partitional clustering a single fixed partition of the data is sought in an attempt to recover the underlying structure of the data. This partition may be formed by minimizing a global measure such as the squared error or a local measure such as proximity. A common approach is to form K initial clusters and
305
iteratively refine these by moving data items between them so that the global error or cluster radius is reduced. Where least squares residuals are minimized this is is normally referred to as k-means clustering. The effectiveness of this approach is influenced by the number of clusters; if K is too low, cross cluster contamination occurs; if K is too high the cluster structure becomes fragmented and data that should be related is split. The nature of hierarchical clustering implies a global communications pattern in parallel architectures and is thus ill suited to distributed memory MIMD architectures of the type we have used. Olson has previously hypothesized optimal performance for such algorithms cannot be achieved on such architectures [30]. To apply the hierarchical approach in the context of pose clustering requires the heuristic setting of a threshold on the dendogram using a similarity measure based on the distance between pose estimates. We have favored partitional clustering described below, considering all estimates and allocating to leader clusters; this substitutes a threshold on allocation rather than on the formed dendogram, still based on a distance between pose estimates. There are several key issues. 1. How are the number and distribution of initial clusters determined? 2. How are the input data patterns assigned initially? 3. What error criterion should be employed? 4. How are the patterns moved between clusters? 5. How are the cluster centers determined? 6. What is the termination condition? In a parallel algorithm, it is necessary to consider not only the quality of the final outcome but also the parallel complexity and ease of mapping of the algorithm to the architecture. Determination of the number and position of the initial clusters has a crucial influence on the final solution and complexity. Many authors assume a fixed K , not least because this makes it easy to map their algorithm on to a parallel architecture with a fixed number of processors. In the context of object recognition this is not desirable, since there may be more than one object, and symmetries or other common structures may give more than one cluster for a single object. The distribution is also important. Several authors [31–33] suggest that the initial clusters are determined heuristically or given by the user. Ranka and Sahni [34] also appear to assume an initial assignment but they do vary K . They also acknowledge that different initial conditions give different results, but Rudolph [35] suggests that the initial condition is arbitrary, which is in direct conflict. In previous work, the squared error criterion has almost invariably been used and the cluster centers determined by computation of the centroid of the member patterns [31–35]. Once an initial partition has been formed, cluster membership is altered to reduce this error although in general it is difficult to guarantee a global minimum. If several values of K are chosen then the best partition must also account for this variation, diverging from the simple squared error criterion. If k-means clustering is employed, the centroid is recomputed after each
306
AUSTIN AND WALLACE
pattern reassignment but this is not amenable to parallel implementation because of its inherent sequential nature. Therefore the cluster centers generally remain fixed until all patterns have been considered for reassignment. As an exception, Stockmann [2] suggested a fixed cluster radius rather than a squared error criterion, based on known input data characteristics. However, although he conjectured that pose clustering was suitable for parallel architectures, implementation was not discussed. The process of label reassignment and center updating is common to all the published algorithms and it is this which offers the most scope for parallel advantage. For termination, all authors use either a stable condition in which no labels change, or failing that, a fixed number of iterations. However, when considering these published algorithms the necessity for further iteration for different values of K must also be considered; many authors do not include this in their calculations of complexity. There is great similarity between the several published parallel algorithms; the differences which occur in the implementations are relatively minor, and are architecture, rather than algorithm, driven. Ranki and Sahni [34], Rudolph [35], Barlos and Bourbakis [33], and Li and Fang [32] distribute the N data patterns of dimension M among P processors. For K clusters the distance calculation and assignment can be done in O(N M K /P) time since the distance from each pattern to each center must be computed. Then the new cluster centers can be determined, but this requires that the data vectors common to each cluster be combined. Since Ranki and Sahni use a tree topology but Rudolph a ring topology embedded in a hypercube there are differences in the final computed complexities, O(N M K /P + M K log P) and O(N M K /P + M K ) (for P ≤ K ), respectively. These values depend on the number of processors available. For example, in the unlikely event that P = N M K , Ranki and Sahni conclude that O(log(N M K )) complexity is achieved. Using a similar strategy, Li and Fang used N M processors and claimed O(K log2 M + (2K + 1) log2 N ) complexity. In a later paper [36], they compared parallel partitional and hierarchical algorithms but again assumed N M processors. This is unrealistic for our application since N is generally very large, and is certainly scene-dependent. Barlos and Bourbakis used a quad-tree topology and obtained similar complexity of the updating process to Rudolph, concluding that if N À P 2 this reduced to the simple O(N M K /P) of the basic iteration. All these figures can be compared with the time complexity O(N M K ) of the serial algorithm. 3. A SERIAL CLUSTERING ALGORITHM FOR POSE DEFINITION In Section 2 we reviewed several previous implementations of parallel clustering algorithms, highlighting the similarity in approach. With the exception of Stockmann [2], these authors dealt with general clustering, not considering the specific application of object recognition. This is important because the assumptions of the number and distributions of data patterns and of clusters may be quite different. In this paper, we show how leader clus-
tering followed by cluster refinement can be used successfully to determine object identity and pose and leads to an efficient parallel implementation. In comparison with previous work, we make several contributions. 1. We make no initial assumptions about the number, K , or distribution of clusters. Rather we derive these from the input data. This eliminates two common problems, first the dependence of the final solution on the initial placement of clusters, and second the need to increase the complexities of the published algorithms by iterating over several values of K and different distributions. 2. We exploit the constraints imposed by the application domain. Since the cluster space distribution is determined by the correct and incorrect correspondences between scene and model features, this leads to a pose space with clusters corresponding to correct poses, lesser clusters due to common structures between objects and parts of objects (e.g., symmetries or the common occurrence of orthogonal planes) and pseudo-randomly distributed isolated patterns due to incorrect feature correspondences. This is exploited to achieve superlinear speedup by eliminating unlikely clusters. 3. We have fully implemented and evaluated the pose clustering approach in a visual context. Although reference has been made to parallel implementation [2], this has not been fully presented to the best of our knowledge, and consequent problems such as cluster refinement in a distributed system have not been properly addressed. Further, we have applied the clustering approach concurrently to depth and intensity images of the same scene, rather than the more common, single modality visual system. There are four basic stages in the clustering algorithm: 1. Generation of pose estimates from matched model and scene features: (a) using depth data; (b) using intensity data. 2. Initial cluster determination. 3. Center determination and axis resolution. 4. Iterative cluster refinement and hypothesis extraction. Stage 1 depends on the different orthographic or perspective transformations used during image formation, and on different segmentation algorithms used to extract surface and boundary features. Stages 2, 3, and 4 are common to depth and intensity data. The object pose may be derived from a single data source, or from both together. 3.1. Stage 1(a): Generating Pose Estimates from Depth Data In segmented depth data, surfaces can be represented by their type, a unit orientation vector and a location vector. This information will be referred to as a typed point–vector (p-v) pair. Four different surface types are considered, planes (normal, centroid), cylinders (axis, axis mid-point), cones (axis, vertex), and sphere (center only). Occlusion may cause problems since this affects the position of planar centroids and cylinder axes when
PARALLEL POSE CLUSTERING
307
FIG. 1. An illustration of the pose estimation process.
computed on visible surface patches. The absence of an orientation parameter for complete spherical surfaces implies that these cannot be used to determine rotation directly. While a direction vector could be determined for partial spheres, this would also be unstable under occlusion. Since other orientation parameters are relatively stable, spherical surfaces are not used in the clustering process. Geometric constraints are used to select model data during pose estimation. Generating pose estimates for all possible combinations of model and scene surfaces is computationally expensive: for a scene and model containing 30 surfaces each, it would be necessary to generate some 1.9 × 105 pose estimates. To reduce this we employ an “inverted index” [3]. Using the object models, geometric constraints on angle and distance are computed for each pair of model surfaces. The inverted index associates lists of possible model surface pairs with the constraints between them. Given the constraints between a pair of scene surfaces, the index allows only those pairs of model surfaces that are consistent with the scene surfaces to be considered. The rotation can be determined from a pair of unit model (v1 , v2 ), and a pair of unit scene (s1 , s2 ), vectors [37, 38]. The translation
vector can then be determined using a matched pair of scene and model location points. Each matched model and scene p-v pair has two location points that can be used to determine translation. Where a stable location is available (e.g., a cone vertex), this is used to generate a pose estimate for clustering for each possible rotation. Where both scene parameters are sensitive to occlusion, both of the possible translations are stored to ensure that the best estimate is recorded by the clustering process. This process is illustrated in Fig. 1, and except for the inverted index, is essentially similar to the initial stage of Stockman’s serial algorithm [2].
STAGE 1(a). Generating pose estimates from depth data.
308
AUSTIN AND WALLACE
A maximum of 4 rotation estimates per matched pair results when both surfaces have the same type: two for the hypothesized matches ((m 1 , s1 ), (m 2 , s2 )) and ((m 1 , s2 ), (m 2 , s1 )) and two by reversing the order of these in the rotation calculation. This has the effect of reversing the axis direction so only the translation must be redetermined. As noted above, each rotation estimate can generate two translation vectors. This further increases the maximum number of pose estimates for each matched model and scene pair to eight. The number of depth-based pose estimates depends on both the model and the scene data. In general, O(dc Ss2 ) estimates are produced where Ss is the number of scene surfaces. Only half of the Ss2 possible scene pairings need be tested since order is not important. The value of dc depends on a number of factors including model complexity and symmetry, the proportion of scene surfaces that are related to the model being tested, the types of surface data matched, and the effectiveness of the inverted index. Without the inverted index the value of dc would be ∼Ms2 /2, where Ms is the number of model surfaces. 3.2. Stage 1(b): Generating Pose Estimates from Intensity Data To create pose estimates from intensity data, the depth-based process is replaced by a perspective inversion algorithm for pose recovery from a single view [6]. Since no explicit geometric constraints can be applied to intensity-based feature data, it is difficult to constrain the matching processes and a significant increase in the amount of memory required to store pose estimates is expected. A minimum of 3 matched scene and model points are required to invert the perspective transformation when the camera parameters are known [39]. We select points from the intersections of nonparallel line segments in the image and model. The use of intersection data provides easy parallelism and allows the “principle of least commitment” to be employed. Whereas grouping-based approaches enforce an early 3D interpretation on the scene, the clustering of intersection data leaves such “commitment” until clusters are extracted. Model intersections are derived directly from the CAD model. To determine intersections between (extended) scene lines, every line segment must be compared with all other line segments and, if the intersection falls within the image area, a scene point triple is formed. This helps to reduce matching complexity by excluding parallel lines that appear to converge outside the image under perspective projection but can also exclude valid intersections. In each recovered scene triple (P1 , P2 , P3 ), let point P1 be the intersection point and P2 , P3 be the line endpoints. Perspective inversion is applied to the scene and “matched” model triple to recover a set of possible 3D (P10 , P20 , P30 ) triples. Once these are known, the model scene transformation can be determined: the unit direction vectors from P10 to P20 and P10 to P30 are matched with the model direction vectors and used to determine the rotation matrix as in the depth case. Point P10 and the equivalent model point are considered stable and are used to determine the
vector translation; i.e., each possible 3D triple generates a pose estimate. By assuming that the model occupies a certain portion of the image space, the camera model and known object model dimensions can be used to reject pose estimates on the basis of the distance from the camera. The serial intensity process corresponds to the depth process in Section 3.1.
STAGE 1(b). Generating pose estimates from intensity data.
While in the depth-based case it is sufficient to store the pose and associated feature identifiers, in the intensity-based case the recovered 3D vectors must also be stored to allow the correlation matrix contribution to be determined during pose refinement. If these were not stored explicitly, it would be necessary to invert the perspective transformation each time a pose estimate moved between clusters during refinement. This is a simple trade off between the cost of storing the scene vectors and the cost of redetermining them during refinement. O(i c Mt St ) pose estimates are produced where Mt is the number of model-based triples, St is the number of scene-based triples. As in the depth-based case, i c is data dependent and is difficult to determine in advance. Factors that influence i c include the rejection of pose estimates that fall outside the expected range and the variable number of estimates that may be generated from each matched triple pair.
3.3. Stage 2: Initial Cluster Determination As discussed, clusters can arise from the true cluster pose, common object structures or symmetry, or pseudo randomly distributed isolated chance alignments. This makes the number of clusters difficult to determine in advance. In contrast to earlier work, K is not fixed in advance [33, 35] nor determined by repeated testing [34] as this makes unrealistic assumptions about the constancy of the input data. We do not employ an arbitrary starting position [33, 34, 35] or initially link all data [2]. Rather, since the number and distribution of clusters are not known a priori, a modified leader clustering algorithm that reflects this is outlined below. A new leader is created each time a pose estimate is found to be incompatible with existing pose leaders. For example, in Fig. 2 two pose estimates A and B are considered. Pose A is compatible with the first leader and is added to a list of associated poses. Pose B is not compatible with any of the existing leaders and a new leader is created. Where a new estimate is compatible with two leaders, the first leader tested is chosen.
309
PARALLEL POSE CLUSTERING
can be used to find the rotation that best fits a set of matched model and scene vectors. Here nv is the number of matched vectors associated with a cluster and vmi and vsi are matched model and scene unit vectors. Whereas the initial pose estimates are determined from a pair of vectors, Mk allows individual matched vectors to be used to estimate pose. A correlation matrix is generated for each leader cluster and rotation matrices can be determined from these using singular value decomposition (SVD). Once these are known, the related translation vectors can be determined. The axis-angle representation of rotation used during clustering can easily be determined from the rotation matrix.
STAGE 2. Initial leader partitioning of the cluster space.
In clustering it is common to construct a single distance measure but it can be difficult to develop a measure that does not give undue emphasis to one particular aspect of the data [40]. In pose clustering there are two distinct components in the data to be clustered, i.e., rotation and translation, and each is treated separately. This allows separate clustering of rotation or translation if required, and is similar to (but more flexible than) the initial translation-based Hough transform approach of Linnainmaa et al. [41]. Both compatibility checks are performed by the check function. The rotational check ensures that the rotational angles of the leader and candidate poses agree within AngTol and that the axis angles are within AxTol. The distance between the leader and the candidate translations must be less than TranTol. Rotation estimates are generally more reliable than translation estimates which are more sensitive to occlusion. This implies that TranTol should be more relaxed than AxTol and AngTol to accommodate the spread in translation vectors that can arise from occlusion. In comparison with other work [33–35], M is the number of clustered parameters (three for rotation and three for translation) and N is the number of input data patterns (poses). In our approach K is data dependent and increases as stage 2 progresses. If the data vectors are randomly distributed over K , then the WHILE loop carries out K /2 comparisons before the correct cluster is found. This gives an average bound on the while loop and suggests an overall complexity of O(N M K ) for leader clustering, i.e., of similar complexity to a single refinement iteration in [33–35].
STAGE 3. Center determination and axis resolution.
During initial pose estimation, the order of consideration of the vectors is not clear and two axis directions must be generated for each matched pair of model and scene vectors. This problem does not arise in Eq. (1) and half of the initial clusters can be discarded once the true axis direction is known. Since all N pose estimates contribute to the K centers, this can be achieved in O(N M) + O(K ) time for accumulation and SVD/cluster deletion respectively. 3.5. Stage 4: Pose Refinement and Hypothesis Extraction On completion of stage 3, the number of clusters is fixed and provides a good initial partition of the data. However, since the leader pose may be close to the edge of the cluster, some pose estimates can be incorrectly assigned in the initial partition, illustrated in Fig. 4. During refinement the pose estimates are compared with all cluster centers and reassigned if a closer center can be determined. This process is essentially the same as a single K -value iteration of other algorithms [33–35] and runs in O(I N M K ) time. However, the number of iterations I is expected to be lower since an appropriate K and good initial
3.4. Stage 3: Center Determination and Axis Resolution As illustrated in Fig. 3, the leader for a given cluster may be some distance from the true cluster center. Once the initial clusters are known, the centers (i.e., the translation and rotation that best represent the pose of each cluster) can be calculated by least squares. Kanatani [38] shows how a correlation matrix Mk , Mk =
nv X ¡
¢ vmi vsiT ,
i=1
(1) STAGE 4. Cluster refinement.
310
AUSTIN AND WALLACE
FIG. 2. Adding pose estimates to the leader cluster structure.
partition are known. In many cases, a single iteration of our refinement process is all that is required to confirm the partition determined by leader clustering. Further, Ranka and Sahni [34] note that when K is not known, it is necessary to repeat the clustering process using different K values. This is not necessary in our approach since K is not fixed arbitrarily, so this again reduces the overall complexity markedly. When a pose estimate is reassigned, an entry of the form (pose identifier, old cluster, new cluster) is added to a list of cluster modifications. Once all moves have been determined, they are implemented and the correlation matrix (Mk ) for each affected cluster is updated.
When a pose estimate determined from the vector pairs (m v1 , sv1 ) and (m v2 , sv2 ) is moved from a cluster, the correlation matrix for that cluster, Mi , is changed according to ¡ ¢ T T + m v2 · sv2 Mi = Mi − m v1 · sv1
(2)
and the correlation matrix, M j , of the cluster that the estimate is moved to becomes ¡ ¢ T T + m v2 · sv2 , M j = M j + m v1 · sv1
(3)
T T + m v2 · sv2 ) is the contribution that where the term (m v1 · sv1
PARALLEL POSE CLUSTERING
311
FIG. 4. Clusters where refinement is required.
FIG. 3. The leader pose may be some distance from the cluster center.
the pose estimate makes to the correlation matrix in Eq. (1). When all correlation matrix updates are complete, centers for those that have changed are redetermined by SVD before the next iteration begins. Refinement ends when there are no further inter-cluster moves or a maximum number of iterations (set by the user) is exceeded. This limit prevents repeated swapping of data between two closely related clusters. At the end of the refinement stage, pose hypotheses based on the final clusters are generated. In addition, each cluster identifies the unique model and scene vector matches that contributed to the pose estimation process. To further reduce complexity, only clusters of sufficient size need be considered. Intra-cluster pose variation can be used as a measure of “goodness” but cumulative error in a large cluster may outweigh the error in a small cluster that does not represent a true pose. Experiment suggests that where there is sufficient scene data, the occurrence of a model in the scene can be detected by cluster size alone. Table 1 summarizes the complexity of the serial algorithm, where N = Nd + Ni expresses the number of data patterns (pose estimates) as the sum of the depth and intensity-derived patterns,
TABLE 1 Complexity of the Serial Algorithm Process number
Complexity
Stage 1(a) Stage 1(b) Stage 2 Stage 3 Stage 4
O(dc Ss2 M) ∼ O(Nd M) O(i c Mt St ) ∼ O(Ni M) O(N M K ) O(N M) + O(K ) O(I N M K )
respectively. In comparison with other work, the bounding complexity of a single iteration is O(N M K ), but there is no requirement to try several values of K and the number of iterations should be small as the initial partition should be close to the final one. This is achieved by including stages 2 and 3 which have limiting complexity O(N M K ) so that the bounding complexity of the algorithm remains at O(I N M K ). It should be noted that K is halved after stage 3 and that N is also reduced. 4. DEVELOPMENT OF A PARALLEL POSE CLUSTERING ALGORITHM In principle, data parallelism is possible in each of the 4 stages described above. Control parallelism may also be considered by cascading any or all of the processing stages. However, there are two factors that make this unattractive. 1. Stages 3 and 4 require complete data from the previous stage before they can start. 2. The cost of communicating data between stages is considerable. In our experiments, ∼105 pose estimates can be generated typically for a single scene. Since these are required at each stage of the algorithm, it would be necessary to transfer this data from stage to stage in any pipeline. As processing progresses, further data is added to the inter-stage communications requirements. Leader clustering introduces cluster labels and leader poses while center calculation introduces correlation matrices and center values. Our preferred option is to minimize communication costs by implementing each stage of the algorithm on the same set of processors with a requirement to communicate minimal coordination data. The parallel software has been implemented in occam 2 on a Meiko Computing Surface. There is a single, coordinating farmer, and the Nw worker processors are arranged in a grid with toroidal wrap around. This is a good compromise between communication costs and extensibility. It allows relatively short distances from the farmer to the furthest
312
AUSTIN AND WALLACE
FIG. 5. The main processes and messages in the parallel pose clustering algorithm.
worker, but can be extended by the simple addition of a further row or column of processors. The parallel algorithm is refined into 9 processes shown in Fig. 5; data communication between farmer and workers are shown as Tx/Rx1 to Tx/Rx9. The parallel processes are described in Algorithms 1 to 9. To minimize data movement in the parallel system, the boundaries between stages 1 to 3 in the serial algorithm, corresponding to processes 1 to 5 in Fig. 5, are relaxed. During pose estimation, each worker also assigns the pose estimates to local leader clusters and accumulates correlation matrices for those clusters. Once all local pose estimation has been carried out, the local cluster spaces are combined to form a global cluster space. By indexing the local spaces into this global space, most of the cluster data can remain in the worker network. Processes 6 and 7 correspond to the iterative refinement of stage 4 of the serial algorithm. Processes 8 and 9 represent the final collection and output of data which was not included in the serial case, but is now included for completeness. During process 1 (see Algorithm 1), task data is generated by the farmer. For intensity clustering, scene point triples are gen-
erated; for depth data matched model and scene surface pairs are identified using the inverted index. As the actual pose computation of stage 1 is performed during the distributed process 2 (see Algorithm 2), this task is relatively simple and the cost of distribution outweighs the parallel benefits. The complexity of the inverted index examination is O(dc Ss2 M). Tx/Rx 1 initiates process 2 (local pose estimation and leader cluster formation) on each worker. On completion the local leader values and correlation matrices are sent to the farmer (Tx/Rx 2) where they are merged to form a global pose space (process 3; Algorithm 3). Pose estimates and their cluster assignments (representing the bulk of the local data) are retained at the
ALGORITHM 1. Process 1 of the parallel algorithm (farmer).
PARALLEL POSE CLUSTERING
workers. The merger process is similar to the original leader formation process. When two local clusters are merged, their correlation matrices are combined by simple matrix addition and an index of merged clusters is constructed. The complexity of process 2 is µ
Nd M K Ni M K O + Nw Nw
¶ ∼ O(N M K /Nw )
since estimates of pose are required from indexed depth pairs and all intensity triples. Global cluster formation is a serial step. Although it is possible to coordinate this among the workers in similar way to Rudolph’s center determination process [35], this would require more complex worker code. Unlike Rudolph, the relationship between the local clusters is not known at this stage and indices into the merged spaces would have to be constructed after each local merger. Maintenance and transmission of these partial indices would increase the computational and communication costs. The grid-based communications used here are also more efficient than the ring-based communications used by √ Rudolph (the maximum distance to the coordinating process is Nw compared to Nw ). Since this requires comparisons between all the clusters generated by the distributed pose estimations, the complexity is O(M K 2 ). The global cluster centers are determined by farming the merged correlation matrices out to the workers (Tx/Rx 3) where the SVD-based rotation calculations are performed (process 4; Algorithm 4). In general, the number of clusters will be reduced since previously separately defined clusters have been merged at the farmer during process 3. Hence the complexity of process 4 is O(N M K /Nw ). The calculated centers are returned to the farmer (Tx/Rx 4) where the uncertainty in the axes that arose during initial pose estimation is resolved (process 5; Algorithm 5). The center information for the remaining merged clusters and an index from the local cluster spaces to the global space are then broadcast to the workers (Tx/Rx 5) to initiate iterative refinement. The complexity is O(M K ) but the number of clusters is again reduced. Refinement is broken into two separate processes in the parallel algorithm. During process 6 (see Algorithm 6), each worker compares its locally stored pose estimates with the cluster centers to determine any inter-cluster moves. If a pose estimate is reassigned, a move list entry is generated (cf. serial stage 4). Since the correlation matrices for the clusters are maintained at the farmer, the parallel move list is extended to include the contribution that a pose estimate makes to the correlation maT T + m v2 · sv2 ) in Eqs. (2) and (3). The trix (the term (m v1 · sv1 complexity of process 6 is O(N M K /Nw ) since each pattern is compared with each cluster center. At the end of each iteration this list is supplied to the farmer (Tx/Rx 6) where the centers are recalculated for any modified clusters (process 7; Algorithm 7). Since a good data partition is determined by leader clustering, the number of inter-cluster moves is expected to be low
ALGORITHM 2. Process 2 of the parallel algorithm (worker).
ALGORITHM 3. Process 3 of the parallel algorithm (farmer).
ALGORITHM 4. Process 4 of the parallel algorithm (worker).
ALGORITHM 5. Process 5 of the parallel algorithm (farmer).
ALGORITHM 6. Process 6 of the parallel algorithm (worker).
ALGORITHM 7. Process 7 of the parallel algorithm (farmer).
313
314
AUSTIN AND WALLACE
TABLE 2 Process Complexity of the Parallel Algorithm
ALGORITHM 8. Process 8 of the parallel algorithm (worker).
making the parallel recalculation of centers unnecessary. Thus the complexity is O(κ N M) where κ (≤1) denotes the fraction of patterns that are reassigned. Any updated centers are broadcast to the workers (Tx/Rx 7) to initiate the next refinement iteration. The number of iterations is a crucial parameter. In general processes 6 and 7 combined have complexity µ
NMK O I Nw
¶
µ
¶
NMK + O(I κ N M) ∼ O I . Nw
This offers the most direct point of comparison with other published algorithms reviewed in Section 2 which would have complexity O(I1 I2 (N M K /Nw )) where I1 and I2 correspond to iterations at fixed K and variation of K , respectively. When refinement is complete the workers identify the unique model-scene matches that contributed to the clusters (process 8; Algorithm 8), of complexity O(N K /Nw ) and transmit these to the farmer (Tx/Rx 8). The farmer combines any related cluster information from the workers and outputs this to the user (process 9; Algorithm 9) of complexity O(K ). The complexity of each stage of the parallel algorithm is summarized in Table 2. Our strategy minimizes data communication costs so these are not presented; in practice they vary from about 0.1% (2 processors) to 4% (25 processors) of the total time. The number of clusters is reduced as processing proceeds. Since cluster formation is distributed among the workers, these are merged during process 3 then redistributed, thereby reducing the number. Process 5 discards approximately half of the clusters; this is the same as stage 3 of the serial algorithm. 4.1. Load Balancing among the Worker Processors The volume of data involved in pose clustering makes it impractical to reassign data between processors once clustering has begun. Therefore the load balance is fixed by the first communications step, i.e., distribution of the model and scene vector pairs or point triples. It is relatively easy to ensure an even distribution of work for depth data as the possible model and scene matches are determined using explicit 3D constraints in the model index. For intensity data there are no explicit 3D constraints and a
Process number
Location
Complexity
Process 1
Farmer
O(dc Ss2 M) ∼ O(Nd )
Process 2
Workers
O( NdNMw K +
Process 3
Farmer
O(M K 2 )
Process 4
Workers
O( NNMwK )
Process 5
Farmer
O(M K )
Process 9 of the parallel algorithm (farmer).
)
Processes 6 and 7 iterate I times Process 6
Workers
O( NNMwK )
Process 7
Farmer
O(κ M N )
Process 8
Workers
O( NNwK )
Process 9
Farmer
O(K )
distance based constraint can only be applied after the pose has been determined. There are three main possibilities for distributing the intensity data in the pose clustering process. 1. Both the model and scene processing loops are split among the worker processes. 2. The outer (model) loop is split among the worker processes. 3. The inner (scene) loop is split among the worker processes. While option 1 is employed for depth-based pose estimation, only options 2 and 3 are considered for intensity data where all possible model and scene combinations must be considered. In general, the largest loop should be made parallel to give an even distribution of model and scene data. However, since the model and scene triples are related by geometry, a straightforward split may result in one worker (or a small number of workers) being responsible for all of the model triples that appear in the scene. One simple way to avoid this is to step through the loop that is implemented in parallel using Nw as an increment, i.e., worker i considers triples i, i + Nw , i + 2Nw , . . . , etc. As an illustration, Table 3 shows the results of applying the different strategies to scene data derived by projecting a model into the scene under a known transformation. In this table “seq” TABLE 3 Different Parallel Strategies and Their Effect on Load Balance Time (s)
ALGORITHM 9.
Ni M K Nw
Pose estimates
Type
Slowest
Fastest
Difference
Slowest
Fastest
Difference
Seq model Seq scene Step model Step scene
26.50 25.34 16.15 15.97
9.42 9.11 10.50 10.95
17.08 16.23 5.65 5.02
1478 1458 950 944
402 170 536 573
1076 1288 414 371
315
PARALLEL POSE CLUSTERING
refers to parallelism using sequential runs of the loop indices and “step” refers to the use of Nw (=16) as the loop increment. The figures for the fastest and slowest workers were measured over five runs of the algorithm. Though the figures in Table 3 are specific to a single model and scene combination, the same modeling and segmentation processes are used for all scenes and similar results can be expected for other model and scene combinations of similar visual complexity. Sequential indexing results in a greater spread in execution time and number of estimates calculated between the slowest and fastest workers. There is still a noticeable difference between the slowest and fastest workers when Nw is used as a loop increment but the spread has been reduced. Based on this, we use stepped loop indices to control parallel pose estimation. An alternative approach would be to randomize the model and scene data prior to distribution.
parisons are split among the workers resulting in µ O
(k 0 Pe )2 Nw2
¶
comparisons at each of the Nw workers, i.e., Nw2 fewer than were required in the serial case. When the global cluster space is formed however, all of the k 0 Pe related clusters are compared in the global cluster space and this costs the same as in the serial case. The parallel algorithm thus performs more comparisons than the serial algorithm. Examining Tables 1 and 2, the limiting complexity of the serial algorithm is O(I N M K ) and of the parallel algorithm µ O
¶ I NMK + MK2 . Nw
Hence the anticipated speedup is of the order, 4.2. Explanation and Removal of a Parallel Bottleneck The costs of the parallel algorithm are lower than the costs of the serial algorithm. Since the data is split among the workers, the costs of leader clustering and each refinement iteration become O(N M K /Nw ). Unfortunately, the parallel approach introduces an additional coordination cost, that of merging the local leaders (process 3). This involves the comparison of approximately K local leaders at the farmer with an associated cost of O(M K 2 ). This results in a parallel bottleneck that restricts the speedup to approximately 2 when the unmodified algorithm is used. It is instructive to compare the number of leader comparisons in the serial and parallel algorithms. Since all possible model-scene matches are considered by the serial algorithm, only a small proportion, k, of the resulting Pe pose estimates are related to possible object occurrences in the scene. In the worst case, the remaining k 0 Pe (where k 0 = 1 − k) pose estimates form singleton clusters that require O((k 0 Pe )2 ) leader comparisons. In the parallel algorithm, the leader com-
µ
I N M K Nw Speedup ∼ O I N M K + Nw M K 2
¶
¶ I N Nw . =O I N + Nw K µ
(4) Figure 6 shows the effect of the bottleneck depending on the formation of the cluster space. The speedup is plotted as a function of Nw and K for fixed N = 10,000 pose estimates, and I = 1 iteration. Since I N forms a product in Eq. (4), these parameters can be varied in combination with a similar result. If there are few clusters, certainly 100 or less, the speedup is scalable and offers a reasonable prospect of parallel advantage. If however, K is large due to many singleton clusters (k ¿ k 0 ) then little or no appreciable speedup is obtained. A possible solution is to reduce the value of K during process 3, i.e., to apply a threshold to limit the number of local clusters before the global cluster space is formed at the farmer. The danger of this approach is that a true cluster representing the accumulated evidence from several
FIG. 6. Anticipated speedup as a function of the number of workers, Nw , and the number of clusters, K .
316
AUSTIN AND WALLACE
correct model and scene feature correspondences is split quasiuniformly over several workers and is hence lost. The threshold is a function of the pattern complexity and distribution and the number of workers and must be determined so that there is a low probability of rejecting the true cluster. In adopting this approach, there is some precedent in Stockmann’s suggestion of the deletion of small clusters, recognizing that most of the data patterns arise from mismatched scene and model features [2]. Rather than use a size threshold, we could alternatively limit K by returning a fixed number of clusters from each worker. This would then have some similarity with the previous work where K was fixed. However, the crucial difference is that the leader process is still used to determine where the chosen clusters originate and is thus a much truer representation of the data structure, yet, by limiting K we can still get comparable parallel advantage. To determine a suitable threshold, we assume that there are α patterns arising from correct feature matches which should form the true pose cluster, and β patterns which are incorrect and lie elsewhere in pose space. The problem is to determine the probability that there are no more than γ true estimates on any processor when the initial distribution occurs; i.e., the probability of missing the correct pose because the cluster is split across several processors. Assuming an even distribution of patterns, there are N /Nw of these on each processor. If we further assume that all configurations of data patterns on processors are equally likely, then we can define a generating function g(x) [42] to distribute the false pose estimates. Our model is not exact, since the order of segmentation of the scene data influences the distribution, as discussed in Section 4.1. However, we shall also present experimental evidence in section 5 which is in general accord with the random model. ¡ N ¢N g(x) = 1 + x + x 2 . . . . + x Nw w
(5)
Call is the value of the coefficient of x β , defining the number of possible distributions of the β false estimates. As the total number of patterns allocated to each processor is N /Nw , Eq. (5) also defines the distribution of true estimates and is the total number of possible pattern distributions. Next, we consider the constraint that there must be not more than γ true estimates on any processor. The generating function is ¡ N N ¢N gγ (x) = x Nw −γ . . . . + x Nw w
(6)
and the value of the coefficient of x β , Cγ , is the number of combinations. The ratio of Cγ to Call defines the probability of missing a cluster. Equations (5) and (6) may be expressed as g(x) =
¶ Nw µ X Nw i=0
i
∞ ¡ N ¢i X (−1)i x Nw +1 j=0
µ
¶ j + Nw − 1 j x j
(7)
¶ Nw (−1)i (x γ +1 )i i i=0 ¶ ∞ µ X j + Nw − 1 j × x . j j=0
Nw ¡ N ¢N X gγ (x) = x Nw −γ w
µ
(8)
This allows us to compute the values of Call and Cγ by considering the appropriate values for i and j to give the x β terms. Thus, for example, if there are 25 model and 8 scene depth features, this leads to 8400 pose estimates of which only 28 are true. If γ = 2, i.e., clusters of two or less patterns are discarded, the probability of missing the true cluster when 16 processors are used is 2.38 × 10−7 . However, if there are only 10 true estimates, this becomes 0.304. In general, the variables in Eqs. (7) and (8) are scene dependent; however, a value of γ = 2 has been used in the experiments of Section 5 which excludes most insignificant clusters but allows a high probability of success, provided that the number of true patterns is equal to or greater than the number of processors. This assumption is appropriate to the problem and architecture; if not, then γ can be reduced, but the parallel advantage is lower. The thresholding approach used here should not be confused with that used by Olson [43] who used a data sampling technique to restrict the computational complexity of both sequential and parallel clustering, similar to the RANSAC algorithm [39]. 5. EXPERIMENTAL RESULTS To evaluate the parallel algorithms we used a small database of four relatively simple object models (A–D in Fig. 7). The algorithms were tested using synthetic and real, depth and intensity data. The use of synthetic data gave us a known ground truth and allowed us to control errors when characterising the algorithms. It should be noted that the serial times are quite long; this is because we are using all the data to compute the final pose. For object A, for example, there are 22 surfaces and 234 junctions. As surface pairs and junction triples are used, then an examination of Table 1 shows that there are still approximately 6.5 × 104 comparisons of scene and model features to produce pose estimates, even allowing for the use of the inverted index, i.e., N ≈ 6.5 × 104 . Both serial and parallel algorithms can be reduced by heuristics of the type discussed in Section 2 to prune the search space. In the parallel algorithm, we do normally prune clusters of less than 3 estimates (γ = 2) after leader formation, but in these experiments we shall also show how variation of γ affects performance. In general, speedup, rather than actual time taken, is most important in comparing serial and parallel processes. Finding a point of comparison with the earlier work is not always straightforward. Of the papers reviewed in Section 2, many present theoretical discussion alone and have not implemented the proposed algorithms. Where experimental results are given, speedup is used commonly and the actual times are not always presented.
PARALLEL POSE CLUSTERING
317
FIG. 7. The object models used to test the clustering algorithm.
Rygol et al. [8] present times for model location ranging from 8 to 10 s dependent on scene complexity. Rucklidge [44] uses arguably more complex scenes, but simple objects, and presents instances of model location on the order of 40 min on 6 Sparc workstations. Choudhary and Ponnusamy [45] appear to match features in stereo data in times varying from 20 to 2 s on 2 to 16 processors, a task which is partly analogous to model-scene feature matching. In our experiments, we show how we can reduce the real time to below a second on a relatively old parallel machine. There are several ways to further improve performance, using the obvious heuristic of suitably chosen feature subsets [46] or using a more modern parallel machine, which was not unfortunately possible in this study.
5.1. Synthetic Depth Data Scene data for the depth experiments was generated by rotating and translating the model point-vector pairs by a known amount. Three different sets of scene data were generated for each model: 1. perfect scene data, 2. all surfaces perturbed with a 2 degree maximum error in orientation vectors and a 2.5 mm random maximum variation in direction vectors, and 3. all surfaces perturbed with a 2 degree maximum error in orientation vectors and a 5.0 mm random maximum variation in direction vectors.
318
AUSTIN AND WALLACE
Table 4 shows the minimum and maximum deviations of the recovered pose estimates from the true value for each object. Pose estimates within the perturbed range are recovered in each case. The speedup figures for the whole process are given in Fig. 8. Since the threshold on distributed cluster size was applied so that clusters with fewer than three estimates were not returned to the farmer processor, the algorithm achieved superlinear speedup, reducing the processing time to a second or less in all cases. The depth based clustering algorithm was applied to four additional synthetic scenes derived from object E in Fig. 7. Experiments were carried out using 22, 12, 7, and 3 surfaces in the scene. As for objects A–D, perfect and perturbed data were generated for each additional model. The resulting speedup characteristics in Fig. 9 show that the parallel clustering algorithm can achieve superlinear speedup providing there is sufficient
TABLE 4 Variation in Pose from True Values for Depth Data Angle Diff. (degrees)
Axis Diff. (degrees)
Trans. Diff. (mm)
Object
Perturb (mm)
min
max
min
max
min
max
A A
2.5 5.0
0.051 0.124
0.293 0.303
0.044 0.0560
0.990 0.110
0.175 0.136
0.883 1.041
B B
2.5 5.0
0.138 0.009
0.340 0.172
0.149 0.091
0.188 0.124
0.635 0.525
2.239 4.494
C C
2.5 5.0
0.307 0.377
0.453 0.511
0.044 0.044
0.086 0.096
0.491 0.592
0.878 2.422
D D
2.5 5.0
0.214 0.292
0.259 0.500
0.313 0.295
0.343 0.343
0.523 0.910
0.612 2.168
FIG. 8. Total speedup for objects A–D (depth data).
PARALLEL POSE CLUSTERING
319
FIG. 9. Speedup characteristics for object E.
scene data. As the complexity of the scene data decreases, the cost of distributing data becomes closer to the cost of calculation and the degree of parallel advantage is reduced. Even for the least complex scene, superlinear speedup is achieved for unperturbed data until 9 processors are reached. The speedup then flattens out and falls slightly when 25 processors are reached. When all model data is used, speedup in excess of 100 for 25 workers is observed. Note that for 3 surfaces, the algorithm requires more time than for 7 surfaces. This reflects the fact that the minimum threshold value was used in the 3 surface case to ensure that the pose was detected; this generates more clusters than in the 7 surface case. 5.2. Synthetic Intensity Data Synthetic intensity data was generated by applying a known perspective transformation to the model segments of objects
A–D. Three sets of scene data were generated for each model: 1. perfect projected data, 2. 25% of the projected points moved by 1 pixel, and 3. 25% of the projected points moved by 2 pixels. One of eight possible perturbation directions was selected at random for each perturbed segment and both end points of the line moved in the same direction. Table 5 shows the number of line intersections that were created for each of the generated scene segments. Table 6 shows the minimum and maximum deviations of the recovered pose estimates from the true value for each of the object models in the synthetic intensity experiments. The zero perturbation case is included to show that pixel positioning errors contribute to pose estimation error in the intensity case. The maximum errors are greater in the intensity case than the depth
320
AUSTIN AND WALLACE
TABLE 5 Number of Scene Intersections for the Perturbed Test Data Perturbation Object
0
1
2
A B C D
599 374 325 261
528 348 301 240
539 346 299 239
case reflecting decreased accuracy in the location of the points used to estimate transformation. Figure 10 shows the speedup characteristics and some timing information for the test data. The intensity-based process is much slower than the depth-based process and the speedup characteristics are disappointing when compared with the corresponding depth characteristics in Fig. 8. However, as the object complexity decreases from object A to object D the speedup characteristic improves. Examination of the stored cluster data shows that the cause of the unexpectedly poor performance depicted in Fig. 10 was due to the physical limitations of the transputer network rather than a problem in the algorithm itself. Each processor has 4 megabytes of available memory and since there is no ‘virtual’ memory, all code and data structures on any given processor must fit within this space. As dynamic allocation of memory is not available under occam, a fixed array structure is used to hold the cluster and pose data, and at most 1750 clusters and 10000 pose estimates can be stored. For a serial process, this is an absolute limit. Both the serial and parallel leader clustering processes continue until there is no more space available for data storage. Since TABLE 6 Variation in Pose from True Values for Intensity Data Angle Diff. (degrees)
Axis Diff. (degrees)
Trans. Diff. (mm)
Object
Perturb (pix.)
min
max
min
max
min
max
A A A
0 1 2
0.0193 0.0727 0.0457
1.01 0.816 0.630
0.316 0.245 0.444
1.220 1.256 0.3453
0.816 3.109 2.005
8.897 10.077 17.43
B B B
0 1 2
0.0543 0.00927 0.0122
0.102 0.100 0.235
0.0198 0.074 0.096
0.149 0.233 0.187
0.465 2.318 1.689
1.7087 2.933 2.764
C C C
0 1 2
0.0178 0.00180 0.0168
0.108 0.238 0.2378
0.0280 0.0396 0.0442
0.148 0.183 0.153
1.482 1.320 1.562
7.181 7.729 3.719
D D D
0 1 2
0.375 0.228 0.0354
0.600 0.926 0.541
0.068 0.071 0.0523
0.127 0.274 0.194
0.393 1.619 0.470
1.493 2.007 5.497
the parallel algorithm generates pose estimates and local clusters in the worker network, more space is available for the storage of pose data. When the serial algorithm examines fewer pose estimates and clusters than the parallel algorithm, a distortion in the speedup characteristic occurs and the parallel performance appears to be worse than it really is. The total number of pose estimates that are generated for the 4 test objects increases as Nw increases. This continues until all possible pose estimates are stored in the worker network (after 9, 4, 4, and 2 processors for objects A, B, C, and D, respectively). Clearly the point at which all possible pose estimates are stored depends on the complexity of the model and scene data. The speedup distortion that is caused by storage limitations in the serial algorithm is worse for more complex models and scenes (as illustrated in Fig. 10). A similar picture emerges if the number of stored leader clusters is examined. Again, as model complexity increases, the storage requirements for leader clusters increases. Figure 11 shows speedup for the 4 object models when an adjusted serial time, Tsa , is used. This takes differences in storage between the serial and parallel algorithms into account according to Tsa = Flc
lp pp Ts + F pe Ts , ls ps
where l p and ls are the number of leaders stored in the parallel and serial programs, p p and ps are the number of stored pose estimates, and Ts is the serial leader clustering time. The constants Flc and F pe refer to the proportion of time spent performing leader comparisons and pose estimation respectively (determined by experiment to be 0.6 and 0.4 respectively in Fig. 11). This gives a conservative estimate of true speedup since it ignores costs of center calculation and refinement and does not reflect the increased number of comparisons in the serial algorithm if more leader clusters are allowed. The adjusted characteristics show that superlinear speedup is indeed possible for the intensity based version of the clustering algorithm. 5.3. Multiple Object Scenes: Variation of the Cluster Threshold In these experiments, we ray-traced depth and intensity images in orthographic and perspective projection respectively using more than one CAD object model. Standard segmentation routines were employed to extract the surfaces from the depth data and the line segments from the intensity data, so that both were used in the pose clustering process. We were then able to back-project the located object models of Fig. 7 onto the scene data. For example, Fig. 12a shows the pose solution of object B on a scene containing two objects and Fig. 12b shows solutions for objects C and D on a scene containing three objects. Figure 13a shows the speedup for the depth-based pose solution displayed in Fig. 12a as a function of size threshold. This is defined with respect to included clusters, i.e., equal to γ + 1.
PARALLEL POSE CLUSTERING
321
FIG. 10. Total speedup for objects A–D (intensity data).
The purpose is to show how the threshold applied to cluster size in the parallel algorithm affects the speedup significantly, similar to the theoretical expectation of our complexity analysis and the prediction of Fig. 6. For example, if no clusters are rejected, even including those which are a single estimate then the speedup with 16 or 25 processors is limited to about 3 times. However, even rejecting single estimate clusters (γ = 1) this can be improved to 11 or 12 times. If the threshold is set at γ = 2, which is our common practice, then superlinear speedup is achieved. In these experiments, we did not find that the true cluster was missed, even at higher thresholds, but no appreciable benefit was obtained as the graphs in Fig. 13a level off beyond size 3. Experimentally, this confirms the expectation that the application of the cluster threshold is relatively safe, but of course
there is still the option of reducing the risk still further if a higher time is acceptable. The proportion of time spent on the four principal stages of the parallel algorithm is shown in Fig. 13b. In particular, this demonstrates how the leader process dominates the parallel algorithm, particularly as the size threshold increases and the refinement stage becomes less significant. This is in contrast to the earlier work described in Section 2 in which the number and distribution of clusters were fixed arbitrarily in advance, and as a result many more refinement iterations were required. 5.4. Real Data: Depth and Intensity Processing Figure 15 shows the speedup characteristics that are obtained when clustering is applied to the segmented depth and intensity
322
AUSTIN AND WALLACE
FIG. 11. Adjusted total speedup for the four object models.
data for object D shown in Fig. 14. Since both scenes contain limited amounts of data, a low cluster threshold size of 2 (γ = 1) was used to allow cluster detection while maintaining a parallel advantage. Superlinear speedup is achieved for both cases. However, only the depth-based algorithm detects the desired pose for all processor configurations. In the intensity-based case this is only detected when fewer than 4 processors are used. Depth-based pose estimation uses pairs of matched surfaces and, where reasonable location points are available, a cluster with a minimum size of 2 is created from each pair. This means that as Nw increases, the likelihood of detecting the desired clusters using the minimum cluster size threshold of 2 is high. In intensity-based pose estimation each matched model and scene point triple produces one pose estimate for each rotation. As Nw increases, it is easy for correct poses to be assigned to local
singleton clusters that are missed when the threshold size is 2. When all local clusters are considered (the threshold size is 1), the desired pose is detected but the speedup is restricted to less than 2, as shown in Fig. 15. Figure 14 shows back-projection of the model on real depth and intensity data. While the rotation estimates are reasonable, the location estimates include some error. This may be attributed to inaccurate segmentation of the scene; both depth and intensity clustering require accurate location points to obtain a good estimate of position. In the experiments using synthetic data this is not a problem as the scene location points are based on perturbed model data. Position estimates for depth data are based on centroid values and these can be affected by poor segmentation and occlusion in the scene. In the depth scene shown in Fig. 14 several of the visible planes are poorly defined and the
PARALLEL POSE CLUSTERING
323
FIG. 12. Pose solutions for scenes containing more than one object.
resulting scene centroids are inaccurate. In contrast, the scene normals are accurately located and the maximum error between any pair of scene normals is 2 degrees. The intensity estimates are based on line intersections and these too can be affected by segmentation errors. In intensity images the pixel location of edges may depend on the lighting conditions that prevail when the scene is captured and this can lead to inaccuracy in the position estimate. This is demonstrated in Table 6 where a 2 pixel error in line position can result in up to a 17 mm location error (it should be remembered that data in Table 6 was generated directly from the model). It should also be noted that under perspective projection, a relatively small
change in viewpoint can result in a significant change to the perceived scene. 5.5. The Contribution of the Size Threshold to Superlinear Speedup The introduction of the size threshold heuristic in the parallel algorithm leads to the observed superlinear speedup. Though a size threshold is applied in the serial algorithm, only the effort required during refinement is reduced; in the parallel algorithm significant improvements are also made during leader clustering. Arguably, the behavior of the parallel algorithm can be mimicked
FIG. 13. Timing data for scenes containing more than one object.
324
AUSTIN AND WALLACE
FIG. 14. Clustering applied to real images of object D.
by decomposing the problem into a number of runs of an equivalent serial algorithm. In that case, superlinear speedup is not possible since any lessons learned in the parallel algorithm can be retrospectively applied. In the case of depth data clustering, the serial algorithm was modified to subdivide the leader clustering process to mimic
local cluster formation in the parallel algorithm. (Memory limitations on a single processor mean that a similar approach for scenes involving intensity data is not possible). This modified serial clustering algorithm was implemented by splitting the “FOR i” loop in stage 2 to allow the serial equivalent of local clusters to be determined. After applying the size threshold the
FIG. 15. Speedup for the real data shown in Fig. 14.
325
PARALLEL POSE CLUSTERING
FIG. 16. The “heuristic” and parallel components of superlinear speedup for Object A.
remaining clusters were combined as in the parallel algorithm. Timing information recorded using this modified serial algorithm can be used to determine a “heuristic speedup,” Sheuristic , according to Sheuristic =
Ts Theuristic
,
(9)
where Ts is the execution time of the unchanged and Theuristic is the serial time measured using the modified serial algorithm. As an example, we show the application of this new serial algorithm to a real scene containing Object A using size thresholds in the range 2–6 (i.e., those thresholds that ensure that the correct cluster is detected) and 2, 4, 9, 16, and 25 cluster sub-divisions to mimic the real parallel configurations. The resulting heuristic speedup is illustrated in Fig. 16a. Once Sheuristic is known, the proportion of speedup due to parallelism can be determined by dividing the superlinear speedup obtained using the unmodified serial algorithm by the equivalent Sheuristic value, illustrated in Fig. 16b. Sheuristic increases with the size threshold and the computational cost of the leader clustering process is reduced. The proportion of superlinear speedup due to parallel processing also increases with the size threshold. These improvements in the actual parallel performance are expected since the size threshold reduces the lengths of the messages that must be transmitted to the farmer. The cost of the refinement phase is also reduced since fewer clusters interact. The parallel performance for lower size thresholds could be improved further by carrying out the SVDs during refinement in parallel. While the graphs in Fig. 16 show that the size heuristic contributes significantly to superlinear speedup, it should also be noted that whereas parallel solutions can offer real improvements in execution time, this may still be unacceptably high in the retrospectively modified serial algorithm; particularly if sufficient memory were available to cluster intensity pose estimates. Further, the parallel solution offers additional real storage benefits that allow all clusters and pose estimates in the more demanding intensity-based case to be held in the worker network.
6. CONCLUSION We have described a serial pose clustering algorithm for object location in which the number of clusters, K , is determined from the data using a modified leader clustering process. This is important in vision applications where K is rarely known a priori. The use of two compatibility measures allows the relative stabilities of the orientation and location vectors to be taken into account. A parallel version of the pose-clustering algorithm has been implemented on a MIMD architecture. This has been demonstrated on scene data arising from single depth and intensity images, and both in conjunction. When compared with the sequential algorithm, this is limited by a bottleneck when all, isolated clusters are considered. However, application of a threshold on cluster size after the leader process results in significant speedup, which is superlinear in the majority of cases. Theoretical analysis and empirical data show that although a probability of missing the true pose must exist, this is very small. In any event, a tradeoff between threshold size and speedup can always be made, depending on the particular specification. A further consideration in improving the performance of the parallel algorithm is the movement of data between processors. This is minimised by transferring only necessary data between the workers and the farm coordination process. Most of the data resides in the worker network, and is not transferred between workers. In comparison with published work on generalized Hough transformation, the fixed (or in some cases dynamic) quantized accumulator space is replaced by explicit pose representations and their associated cluster labels. This is considerably more space efficient and the clusters are unaffected by quantization effects, such as peak splitting between adjacent bins. The serial clustering approach described by Stockman [2] does use explicit pose storage, but his cluster assignment process is less sophisticated. (Stockmann suggested, but did not present, a future parallel algorithm.) All possible links between pose candidates are generated and clusters are extracted based only on size. This does not address the compactness of the clusters or compare the
326
AUSTIN AND WALLACE
assigned pose estimates with the center value of the cluster. In k-means clustering, an iterative refinement stage that takes the center values into account is generally regarded as necessary. The cluster refinement approach described here has been compared with previous work on parallel clustering [33–36]. All the described algorithms based on iterative refinement of an initial assignment used fixed K , and all required O(N M K /Nw ) time per refinement iteration. Several values of K need be tested to find a good data partition and this further increases the computational cost. Generally, an arbitrary initial cluster assignment was used; it is reasonable to expect that this will require many more refinement iterations than the work described here. In addition, earlier work required that all center values be calculated at the end of each iteration. In our work, centers are only calculated for those clusters that have changed. This reduces the amount of calculation and communication that is required during each iteration. Rather, most of the computational effort is expended during leader clustering. The refinement calculation can be regarded as a “fail-safe” mechanism that has been found to be largely unnecessary. ACKNOWLEDGMENTS This work was supported by ESPRC Grant GR/J07884. Thanks are due to Jennie Hansen for discussion of the use of generating functions.
REFERENCES 1. D. H. Ballard, Generalising the Hough transform to detect arbitrary shapes, Pattern Recognition 13(1), 1981, 111–122. 2. G. Stockmann, Object recognition and localization via pose clustering, Comp. Vision, Graphics Image Process. 40, 1987, 361–387. 3. A. M. Wallace, An informed strategy for matching models to images of fabricated components, Pattern Recognition 20(3), 1987, 349–363. 4. H. Liu, T. Y. Young, and A. Das, A multilevel parallel processing approach to scene labelling problems, IEEE Trans. Pattern Anal. Mach. Intelligence 10(4), 1988, 586–590. 5. A. M. Flynn and J. G. Harris, Recognition algorithms for the connection machine, in Proc. of 9th Int. Joint Conf. on Artificial Intelligence, Vol. 1, Los Angeles, 1985, pp. 57–60. 6. A. M. Wallace, G. J. Michaelson, P. McAndrew, K. Waugh, and W. Austin, Dynamic control and prototyping of parallel algorithms for intermediate and high level vision, IEEE Comp. 25(2), 1992, 43–53. 7. S-Y. Hwang and S. L. Tanimoto, Parallel coordination of image operators: model, algorithm and performance, Image Vision Comp. 11(3), 1993, 129138. 8. M. Rygol, S. B. Pollard, and C. R. Brown, MARVIN and TINA: A multiprocessor 3d vision system, Concurrency: Practice Experience 34(4), 1991, 333–356. 9. V. Dixit and D. I. Moldovan, Semantic network array processor and its application to image understanding, IEEE Trans. Pattern Anal. Mach. Intelligence 9(1), 1987, 153–160. 10. K. Kakusho, S. Dan, and T. Kitahashi, Computer vision based on a hypothesization and verification scheme by parallel relaxation, Int. J. Comp. Vision 9(1), 1992, 13–30. 11. C-L. Wang, P. B. Bhat, and V. K. Prasanna, High performance computing for vision, Proc. IEEE 84(7), 1996, 931–945.
12. P. J. Moody, P. J. Flynn, and D. L. Cohn, Parallel hypothesis verification, Pattern Recognition 26(10), 1993, 1521–1528. 13. J. J. Little, G. Blelloch, and T. Cass, Algorithmic techniques for computer vision on a fine-grained parallel machine, IEEE Trans. Pattern Anal. Mach. Intelligence 11(3), 1989, 244–257. 14. V. Chaudhary and J. K. Aggarwal, Parallelism in computer vision: A review, in Parallel Algorithms for Machine Intelligence and Vision (V. P. Kumar, P. S. Gopalakrishnan, and L. N. Kanal, Eds.), Springer-Verlag, New York 1990. 15. S. Yalamanchili and J. K. Aggarwal, Parallel processing methodologies for image processing and computer vision, Adv. Electronics Electron Physics 87, 1994, 259–299. 16. V. F. Leavers, Which Hough Transform?, Comp. Vision, Graphics Image Process: Image Understanding 58(2), 1993, 250–264. 17. R. A. Lotufo, E. L. Dagless, D. J. Milford, A. D. Morgan, J. F. Morissey, and B. T. Thomas, Hough transform for transputer arrays, in Proc. of IEE Int. Conf. on Image Processing, 1989, pp. 122–130. 18. J. J. Little, G. Blelloch, and T. Cass, Parallel algorithms for computer vision on the connection machine, in Proc. of 1st Int. Conf. on Computer Vision, London, 1987, pp. 587–591. 19. A. Rosenfeld, J. Ornelas, and Y. Hung, Hough transform algorithms for mesh-connected SIMD parallel processors, Comp. Vision, Graphics, Image Process. 41, 1988, 293–305. 20. S. Yalamanchilli and J. K. Aggarwal, A system model for parallel image processing, Pattern Recognition 18(1), 1985, 17–28. 21. S. M. Bhandarkar and M. Suk, Recognition and localisation of objects with curved surfaces, Mach. Vision Appl. 4, 1991, 15–31. 22. W. J. Austin, A. M. Wallace, and V. Fraitot, Parallel algorithms for plane detection using an adaptive Hough transform, Image Vision Computing 9(6), 1991, 372–384. 23. H. Wolfson, Model based object recognition by geometric hashing, in Proc. of European Conference on Computer Vision, 1990, pp. 526–526. 24. W. E. L. Grimson, Object Recognition by Computer, MIT Press, Cambridge, MA, 1990. 25. R. M. Haralick, H. Joo, C-N. Lee, X. Zhuang, V. G. Vaidya, and M. B. Kim, Pose estimation from corresponding point data, IEEE Trans. Sys. Man. Cybernet. 19(6), 1989, 1426–1446. 26. O. Bourdon and G. Medioni, Object recognition using geometric hashing on the connection machine, in Proc. of the 10th International Conference on Pattern Recognition, Vol. 2, 1990, pp. 596–600. 27. I. Rigoulsos and R. Hummel, Massively parallel model matching, IEEE Comp. 25(2), 1992, 33–42. 28. C-L. Wang, V. K. Prasanna, H. J. Kim, and A. A. Khokhar, Scalable data parallel implementations of object recognition using geometric hashing, J. Parallel Distributed Computing 21, 1994, 96–109. 29. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, New York, 1988. 30. C. F. Olson, Parallel algorithms for hierarchical clustering, Parallel Computing 21, 1995, 1313–1325. 31. L. M. Ni and A. K. Jain, A VLSI systolic architecture for pattern clustering, IEEE Trans. Pattern Anal. Mach. Intelligence 7(1), 1985, 80–89. 32. X. Li and Z. Fang, Parallel algorithms for clustering on hypercube SIMD computers, in Proc. of Int. Conf. on Comp. Vision and Pattern Recognition, 1986, pp. 130–133. 33. F. Barlos and N. Bourbakis, A parallel image clustering algorithm on the HER-MES multiprocessor structure, Engineering Appl. Artificial Intelligence 5(4), 1992, 299–307. 34. S. Ranka and S. Sahni, Clustering on a hypercube multicomputer, IEEE Trans. Parallel Distributed Systems 2(2), 1991, 129. 35. G. Rudolph, Parallel clustering on a unidirectional ring, in Proc. of Conf. on Transputer Applications and Systems, 1993, pp. 487–493.
PARALLEL POSE CLUSTERING 36. X. Li and Z. Fang, Parallel clustering algorithms, Parallel Computing 11, 1989, 275–290. 37. P. J. Flynn and A. K. Jain, BONSAI: 3D object recognition using constrained tree search, IEEE Trans. Pattern Anal. Mach. Intelligence 13(10), 1991, 1066–1075. 38. K. Kanatani, Geometric Computation for Machine Vision, Oxford Science, London, 1993. 39. M. A. Fischler and R. C. Bolles, The random sample consensus set: A paradigm for model fitting with applications to image analysis and automated cartography, Comm. ACM 24(6), 1981, 381–395. 40. J. A. Hartigan, Clustering Algorithms, Wiley, New York, 1975. 41. S. Linnainmaa, L. Harwood, and L. S. Davis, Pose determination of a three-
327
dimensional object using triangle pairs, IEEE Trans. Pattern Anal. Mach. Intelligence 10(5), 1988, 634–647. 42. A. Tucker, Applied Combinatorics, Wiley, New York, 1984. 43. C. F. Olson, Time and space efficient pose clustering, in Proc. of IEEE Conf. on Comp. Vision and Pattern Recognition, 1994, pp. 251–258. 44. W. J. Rucklidge, Locating objects using the hausdroff distance, in Proc. of 5th Int. Conf. on Computer Vision, Cambridge, MA, 1995, pp. 457–464. 45. A. N. Choudhary and R. Ponnusamy, Run time data decomposition for parallel implementation of image processing and computer vision tasks, Concurrency: Practice and Experience 44(4), 1992, 313–334. 46. A. M. Wallace, A comparison of approaches to high level image interpretation, Pattern Recognition 21(3), 1988, 241–249.