Signal Processing 21 (1990) 49-60 Elsevier
IMAGE TEMPLATE MATCHING
49
ON HYPERCUBE SIMD COMPUTERS*
E.L. ZAPATA, J.I. BENAVIDES, O.G. PLATA, F.F. RIVERA Department of Electronics, Faculty of Physics, University Santiago de Compostela, Santiago de Compostela, Spain J.M.
CARAZO
Centro Nacional de Biotecnoiog(a ( CSIC), Universidad Aut6noma de Madrid, Campus de Cantoblanco, 28049 Madrid, Spain Received 23 August 1989
Abstract. We present in this work a parallel algorithm to perform an image template matching (PITM) on SIMD hypercube computers with non-shared local memory. This parallel algorithm is general in the sense that it allows for arbitrary dimensions for the image, the template and the hypercube. The flexibility of the PITM algorithm is rooted in the partition of the dimensions of the hypercube into four subsets, each one associated with one independent loop of the sequential algorithm (template matching in the domain of the time), and in the way the data are distributed in the local memories of the processing elements (consecutive storage for the template and for the matrix of cross-correlation coefficients, and shifted-consecutive for the image). Both the algorithmic complexity and the data redundancy are analyzed. Zmmmmenfmmung. In dieser Arbeit steilen wir einen ParalleI-Algorithmus vor, der einen Bildmuster-Vergleich (Parallel Image Template Matching, PITM) auf SIMD Hypercube-Rechnern mit getrennten Iokalen Speichern ausfiihrt. Dieser Algorithmus ist allgemein in dem Sinne, dab er beliebige Dimensionen f'dr Bild, Muster und Hypercube zul.~Bt. Seine Flexibilifiit beruht darauf, dab die Hypercube-Dimensionen in vier Teilmengen aufgeteilt werden, yon denen jede einer unabh~ingigen Schleife des sequentiellen Algorithmus zugeordnet ist (Mustervergleich im Zeitbereich), und darauf, wie die Daten in den Lokalspeichern der Prozessorelemente verteilt werden (aufeinanderfolgende Speicherung f'dr das Muster und f@r die Matrix der Kreuzkorrelations-Koeffizienten, versetzt aufeinander-folgend fiir das Bild). Sowohl die algorithmische Komplexit~it als auch die Datenredundanz werden analysiert. R&mm~. Nous pr~sentons dans cet article un algorithme ~ structure parall~le operant une comparaison d'image avec un gabarit (en anglais parallel image template matching, PITM) sur des ordinateurs hypercube SIMD ~ m~moire locale non-partag~e. Cet algorithme parall~le est g6n~ral dans le sens qu'il autorise des dimensions arbitraires pour l'image, le gabarit et I'hypercube. La flexibilit6 de l'algorithme PITM est ~troitement li~e ~ la partition des dimensions de l'hypercube en quatre sous-ensembles, chacun d'eux 6tant associ~ ~ une boucle ind~pendante de I'algorithme s6quentiel (comparaison de gabarit dans le domaine temporal), et ~ la fafon dont les donn~es sont distribu6es dans les m~moires locales des ~l~ments de traitement (stockages cons~cutifs du gabarit et de la matrice des coefficients de cross-corr61ation, et cons~cutifs en d~calage pour l'image). La complexit~ de I'algorithme et la redondance des donn6es sont toutes deux analys6es. Keywords. Image processing, template matching, object location, parallel algorithm, hypercube SIMD computer.
1. Introduction
One of the chief characteristics determining the success of a parallel architecture is its ability to be adapted to a large number of computational problems, which is certainly the case for hypercube computers. They offer, from the topological point of view, a good balance between node connectivity, communication diameter, algorithm embeddability and ease of programming. All these characteristics are necessary for * This work was supported by the Ministry of Education and Science (CICYT) of Spain under contracts TIC88-0094, MIC88-0549 and Xunta de Galicia XUGA80406488. 0165-1684/90/$03.50 © 1990 - Elsevier Science Publishers B.V.
E . L Zapata et al. / Image template matching on hypercube S I M D computers
50
the design of a non-restrictive parallel architecture [4]. Nowadays there are a number of commercially available hypercube computers [6, 18] which open the way to the application of parallel processing to a wide variety of applications, such as computer vision and image processing [9], computational physics, linear algebra [1, 7] and pattern recognition [13, 20], among many others. Image template matching is a fundamental application in the field of image processing and computer vision. Some problems such as image filtering, edge detection, image alignment and object detection, could be described using the concept of template matching [14]. Basically, given a large image field and a template, the operation of template matching is used in order to find a subimage or, in general, a number of subimages, from the image which are very similar to the template. Certainly, both the 'large image field' and the template are in themselves 'images', however for the sake of simplicity we will name the large image field just 'image'. There exist a number of methods to detect an object placed into a large image field. All these methods are based in a definition of a measure of similarity between two images [14, 17]. Image template matching algorithms are ideal candidates to be implemented in parallel because they need large amounts of computer time and because most of the computational elemental operations are performed among neighbouring elements in the images. Parallel programming thus allows for a very substantial reduction in processing time, which in real time applications is critical. In this lane, Fang et al. [2], Kumar and Krishnan [8] and Sahni and Ranka [ 15] have developed parallel algorithms for template matching on SIMD hypercube computers. However, all these algorithms impose strong constraints on the size of the image, the size of the template and the dimension of the hypercube, since they require the number of points in the image and the template to be an integer power of two and, usually, the number of processing elements of the hypercube to be equal to the number of points of the image and/or the template. In this work we present PITM, a general parallel algorithm for image template matching on hypercube computers without any restriction in the size of the image, the template or the hypercube. PITM performs a parallel calculation of the cross-correlation coefficients, which is a general measure of similarity. The rest of the work is structured in the following order. The next section presents a brief description of the image template matching approach based on the cross-correlation coefficients. Section 3 is dedicated to describe the SIMD hypercube computer and the programming technique used to obtain the PITM algorithm, Section 4 presents the PITM algorithm in itself and, finally, an evaluation of its performance in terms of algorithmic complexity and data storage redundancy is presented in Section 5.
2. Cross-correlation coefficient
Template matching may be described as the comparison of a template (also called window) with a large image (also called picture) in order to obtain the location of other possible 'windows' within this 'picture', each one of these windows resembling closely the template. More precisely, let P(u, v) and T(k, l) be the elements of an image and a template with NrNc and MrMc pixels, respectively (0~ < u ~
Y~ Y~ P(k+i,l+j)T(k,I) C(i,j)=f Mc )1/2, p2(k + i, l+j) ~ k = l I=1
k ~ l I=l Signal Processing
(1)
E.L. Zapata et al. / Image template matching on hypercube SIMD computers
51
where O~ < i < N r - M r and O<~j< N ¢ - M ¢ [14]. It can be easily proved that the numerator of (1) is the cross-correlation function. The choice o f the cross-correlation coefficient as the similarity measure to perform the image template matching is based on two important considerations. The first one is that this method allows us to obtain not only the location of the template in the image by searching for the maximum value o f the crosscorrelation coefficient, but this coefficient is in itself a locally normalized measure of the goodness of this matching. That is, this method does not only locate an object, but it first determines if it is present or not through the significance o f the cross-correlation value. The second consideration is that this technique tends to behave well in real data applications, where noise and some small geometrical distortions are present. In fact, a recent survey of methods to locate a template in an image [17] rated the cross-correlation coefficient as the most accurate and stable technique when applied to model data and to noisy electron microscopy images.
3. Hypercube architecture A q-dimensional hypercube computer is a machine with Q = 2 q processing elements (PE's) interconnected as a binary q-dimensional cube [16]. If a q-bits non-negative integer number r (r = 0, 1 , . . . , Q - 1 ) is assigned to each PE, then, in this way, each PE(r) (r = 0, 1 , . . . , Q - 1 ) has q bidirectional and non-shared links with the others q PE(r
The basic operation needed in the conversion from sequential algorithms to parallel algorithms is the fragmentation o f nested loops in such a way that different iterations are processed by different PE's. This Vol. 21, No. 1, September 1990
52
E.L. Zapata et al. / Image template matching on hypercube SIMD computers
implies that the data processed and produced in the different iterations have to be distributed over the PE's. However, it is the mapping strategy of the algorithm that determines the data distribution, not vice-versa. On the other hand, the processing should be distributed keeping in mind that a parallel algorithm has to be flexible enough to be able to handle problems whose size is not related to the number of PE's of the hypercube. Recently, Zapata et al. [21] analyzed how to partition a sequential algorithm in order to be processed in parallel by a hypercube computer. Their design procedure was as follows: (1) identification of the maximum nesting level of the independent loops of the sequential algorithm (this will define the number of dimensions of the algorithmic space), (2) partitioning the dimensions of the hypercube into subsets associated with the independent loops of the sequential algorithm, (3) distribution of the data arrays that are to be used in the algorithm among the PE's according to the indexing scheme of the PE's and the data distribution mode, that have been chosen, (4) make the parallel algorithm, and (5) performance optimization by optimizing the partition in step (2). In [13, 19-21] the successful application of this procedure to the design of numerous sequential algorithms is shown. The sequential algorithm for the calculation of the cross-correlation coefficients (eq. (1)) has four independent nested loops corresponding to the rows and columns of the image and the template. Following the approach outlined above, we will perform a partition of the q dimensions of the 2q-PE's hypercube 3 into four subsets, noted as q3, q2, q~ and q0 (with q = ~ = o q~)- This partition allows us to represent the v 3 rs2 cx~=~q,) We will then associate r3 and index r of each PE(r) by a vector (r3, r2, rl, ro), where r = ~,s=o r2 to the rows and columns of the template, respectively, and rl and ro to the rows and columns of the image and the cross-correlation coefficients. In this way, each PE(r) = PE(r3, r2, r~, ro) will have three local submatrices LP(r~, r0), LT(r3, r2) and LC(r~, ro) of size mrmc for the template and nrnc for the image and the local cross-correlation coefficients, where mr = [Mr/2%], m e = [ M c / 2 q2], nr = [ N r / 2 ql] and n~=
INJ2'~o 1. An analysis of eq. (1) easily shows the high level of locality of the operations. Based on this observation it would seem logical to adopt a consecutive distribution scheme for the matrices P, T and C, and a pure binary indexing for the PE's. However, this data distribution would place the P ( k + i, l + j ) and T(k, l) elements of each one of the terms in different PE's. This, in turn, would force us to perform a routing before initiating the local evaluation of eq. (1) in each of the PE's. This problem can be solved if the consecutive distribution of the image depends on the indices r3 and r2 of each PE. This is, the new data distribution that we call 'shifted-consecutive' takes into account the shifting of indices that is intrinsic to eq. (1). The adoption of the consecutive scheme for the matrices T and C and shifted-consecutive for the matrix P make the matrices T, C and P to be distributed in the following way: T: tkt is stored in the position (k mod mr, I mod m~) of the local submatrix LT( [ k / m r J , [l/mcJ ) in all the 2 q~+q° PE's for which indices r3 and r2 are r3 = [ k / m T ] , r2 = [ i / m c ] . C: c~ is stored in the position (i mod nr, j rood n¢) of the local submatrix LC( [i/nrJ, [j/ncJ ) in all the 2 q3+q2 PE's for which indices rl and ro are rl = [i/nrJ, ro = L j / n c l . P: P~ is stored in the position (w mod n~, z mod no) of the local submatrix LP( [ w / n r J , [z/n¢J ) in all the 2 q3+qe PE's for which indices r~ and ro are r~= [w/n~J, ro = [ z / n c J . Where w = ( u + nr2 q~ - r3mr) rood n~2q~ and z = ( v + no2 q ° - r2m¢) rood nc2 qo. We illustrate how the data would be actually distributed with an example. Let us assume a hypercube computer with 16 PE's (q =4) on which we want to distribute the image P and the template T, of dimensions 6 x 7 and 3 x 2, respectively, considering the following partition: q3 = 1, q2 = 0, ql = 1 and qo = 2. From this partition and the dimensions of P and T it follows that the dimensions of the local matrices LP and LT will be 3 x 2 and 2 x 2, that is, n~= 3, nc = mr = m~ = 2 (we name local matrices the parts of Signal Processing
53
E.L Zapata et al. / Image template matching on hypercube S I M D computers ro
O00
001
100
101
0
1
2
3
0-3
Poo Pol
Poz Po3
Po~P~
Po6 ---
PlO Pit
PI2 PI3
P]4 PI5
PI6 -'-
P20 I'21
P22 P23
P2~ P25
P26 "--
TOO T01 T~o Tu
P3o P31
P~ P~
P~4 P~
P~ ---
r40 r4t
P42 P43
P44 P45
P46 --
P.SO PSl
PS2 PS3
PS4 PSS
PS6 "--
P20 P2I e30 P3I
P22 PZ3 P~2P~
P~P25 P~P3s
P26--" e~---
P40P41
P42P43
P~ e~s P~ ---
PS0 PSl
PS2 PS3
PS4 PS$
P00 P0I
P02 eo3
P04 P0~
P06 "--
PI0 PI!
PI2 PI3
PI4 PI.S
PI6 "--
T~oT21
P56 "--
Fig. 1. Image and template distributions according to shifted-consecutive and consecutive schemes, respectively.
general matrices which are stored by a single PE). The index r of the PE's is the vector (r3, rz, r~, r0) where r3 =0, 1; r2=O; rl =0, 1 and ro=0, 1, 2, 3. Figure 1 shows the distribution of the elements of P and T in the local matrices image and template with a shifted-consecutive and a consecutive distribution, respectively. It can be noticed that some of the local matrices corresponding to both the image and the template have null elements. This is because the dimensions of P and T are not an integer power of 2 (Nr=6, No=7, Mr=3 and Me=2). The consecutive distribution of the cross-correlation coeilicient's matrix is similar to the one of the template, but with the dimensions of the image.
4. The parallel image template matching algorithm PITM The programming language that we will use in the following to show our parallel algorithm is an extension of the language C, called ACLAN (Array C LANguage) [11, 12, 21], which is machine independent and allows the direct programming of the PE's local memories. There are two types of statements in programs written in ACLAN. One type is formed by parallel statements; they process the distributed data among the PE's or the data distribution itself and are executed in the PE's. The second type is formed by 'scalar' statements which control the program flow; they are executed in the computer control unit. Table 1 shows the possible types of parallel statements, together with two new operators used in ACLAN expressions. The parallel statements have the form action {mask}. The field mask is an optional parallel expression that determines if a particular PE is active or inactive for the action carried on by that statement; the action is executed only if the mask has a value different from zero for this PE. The three types of parallel actions are assignment ones among local memory components. These components are specified by means of multiple operands or multiple expressions, this is, expressions that can contain one or more multiple operands. A multiple operand can express one or more values and is specified by underdimensionalizing an operand an d /o r by including a range of values into one or more dimensions. For example, if M is a bidimensional matrix, M[0], M[*][1] and M[0: 1][2] represent the first row of M, the second column of M and the elements M[0][2] and M[1][2], respectively. Routing actions use the predefined vector 'neigh[ ]' to express the neighbouring PE(r ~b)) (receiving PE) of each PE(r) (sending PE). Vol. 21, No. 1, September1990
E.L. Zapata et aL / Image templatematchingon hypercubeSIMD computers
54
Table 1 Types of parallel statements and operators in ACLAN Symbol
Meaning
Explanation
:= : =: **-~ <=: • / :/ in :/:/ ~/
Local assignment Local exchange Remote assignment Remote exchange Central assignment Bit operator Set operator Parallel register
Data transfer from one register to another in the same PE. Data swap between two registers of the same PE. Data transfer from one register to another in a different PE. Data swap between two registers in directly connected PEs. Data transfer from PE to CU or viceversa. Extract a bit or a range of bits (//means optional). Check if a value is in a certain range (//means optional). Identify the index of the PE.
The A C L A N code that actually performs the load of the image and the template in the local memory of the PE's according to the data distribution schemes described in Section 3 is as follows:
image P[NrNc] for ( p = 0 ; p < ( l < <(q2+q3)); p+ +){
Load
a=p*mr.(q2+q3-1): q2; b=p*mc.(q2-1):O; for ( z = 0 ; z < ( 1 < < q 2 ) ; z + + ) LP[0: n r - 1][0: n c - 1] <=: P[a: (nr*(1 < < q l ) + a - 1)%(nr*(1 < < q l ) ) ] [ z * n c + b: ( ( z + 1)*nc+ b - 1)%nc*(1 < < q l ) ] { # . ( q - 1) : ( q 0 + q l ) = =p&.(qO- 1): 0 = = z};
} Load template T[Mr, Me] for ( p = 0 ; p < ( 1 < < ( q 0 + q l ) ) ; p + + ) for(z=0;z<(l<
{ 1 2 3 4 5 6 7 8 9 10 11 12 13
for ( i = 0 ; i < w r ; i + + ) for (j = 0; j < wc; j + + ){ coef:= 04 n o r m : = 0.; for ( k = 0 ; k < m r ; k + + ) { a := ( i + k)%nr; ( k + r3*mr) > = Mr ? RM := 1 : RM := 0; for ( l = 0 ; l < m c ; 1++){ b := (j + l)%nc; ( l + r2*mc) > = Mc ? M A S K : = 0: M A S K : = !RM; buf[1] := LP[a][b]*LT[k][i] {MASK}; buf[2] := LP[a][b]*LP[a][b] {MASK}; b u r [ l ] := 0. { !MASK}; buf[2] := 0. {[MASK}; if ((dest:= (j+l)/nc) != 0){
SignalProcessing
E.L. Zapata et al. / Image template matching on hypercube SIMD computers
14 15
buf[0] := ( # + Q - dest)%(1 < < q0); for ( t = 0 ; t < q 0 ; t + + ) buf[0: 2](neigh[t]) ~ buf[0 : 2]
55
{ # . t != buf[0].t};
} if ((dest:= ( i + k ) / n r ) [=0){
16 17 18
buf[0] := ( # + Q - dest*(1 < < q0))%(1 < < ( q 0 + q l ) ) ; for(t=q0; t<(qO+ql); t++ ) b u l l 0 : 2](neigh[t]) ~ b u l l 0 : 2] {#.t != buf[0].t};
} 19
coef:= c o e f + b u f [ 1 ] ; norm := n o r m + b u r [ 2 ] ;
} } for(t=q0+ql; t
20 21 22
buf[1](neigh[t]) ~ coef; buf[2](neigh[t]) ~ norm; coef:= c o e f + b u f [ 1 ] ; norm := n o r m + b u r [ 2 ] ;
} norm ? L C [ i ] [ j ] := coef/sqrt(norm): L C ( i ] [ j ] := 0;
23
} } }, where wr = rain (n r - 1, N r - Mr), wc = min ( n o - 1, N o - Me). All the two-character variables of the general form 'zh' have the same meaning as the variables of the form Zh, that we have used in previous sections ( m r = mr, q0 = q o , . . . , as examples), sqrt(x) is the square root of x (with x real), coef and norm are temporal variables where the PE's accumulate the numerator and denominator of eq. (1). bur is a vector of three elements: buf[0] is used in the routing statements to indicate the destination o f the contents of bur[ 1] and buf[2]. These latter registers are general purpose variables associated with the numerator and denominator of eq. (1), respectively. PITM keeps the same general structure o f the sequential algorithm that evaluates eq (1), achieving parallelism by selecting the PE's that calculate the pertinent correlations (statements 6 and 9) and introducing the necessary routing statements. Statements 20-22 form the routing loop that is necessary due to the distributed nature of the problem and, therefore, are processed for all WrW~ cross-correlation coefficients if it is needed for the partition (q3, q2, ql, qo)- On the contrary, we have introduced the routing statements 13-15 and 16-18 to resolve those particular cases in which the product terms (k, l) generated for the PE's for each value ( i , j ) do not belong to the same cross-correlation coefficient; their function is to move the product terms to the corresponding PE. We will use an example to illustrate the behaviour of these two blocks of routing statements. Let us consider the same example previously used in the description of the data distribution scheme. Figure 2a shows the product terms that each PE calculated for the case (i,j) = (0, 0). We only present here the first eight PE's since similar calculations happen in the other processors. In this case we can observe that each PE always calculates product terms that belong to the same cross-correlation coefficient for the different values of (k, l). In this way, the PE's numbered 0, 1, 2, 4, 5 and 6 calculate each one of four product terms, corresponding to the cross-correlation coefficients Coo, Co2, (704, (730, (?32 and C34, respectively. The PE's 3 and 7 do not perform any useful calculation since coefficients C06 and C36 do not exist, as can be deduced from Fig. 1. For this pair (i,j) = (0, O) the statements 13-15 and 16-18 will not be executed, since Vol. 21, No. 1, Sept©tuber 1990
56
E.L. Zapata et al. / Image template matching on hypercube SIMD computers
(k,l)
(k,0 I"
(0,0)
(0,I)
(I,0)
(i,I)
(0,0)
(0,1)
(1,0)
(1,1)
0
P00T00
PolT01
PIoT10
PtlTn
Po1Too
-
PI1TI0
-
1
Po~T00
Po3Tol
P12T1o
P13TI!
Po3Too
Po2T01
P13TIo
PI2TII
PIsTIl
PosToo
Po4Tol
PlSTlo
PI4TII
-
Po6Tol
P41Tll
P31T00
P32T01
P43TIo P45TIO
2 3
Po4Too
4
P30T00
Po.sT01
P14T10
P31T01
P4oTIo
-
5
P32T00
P33T01
P42TIo
P43Tll
P33T0o
6 7
P34Too
P35Tol
P44TI0
P45TII
P35Too
P34Tol
-
P36T01
Fig. 2a. (i,j)= (0, 0). Statements 13-18 are not executed.
P16T11 P41TI0
-
P42TII P44TII P46Tn
Fig. 2b. (i,j)= (0, 1). Statements 13-15 are executed.
all the product terms that are evaluated in each PE belong to the same cross-correlation coefficient. Obviously, routing statements 20-22 are needed to complete the numerator of eq. (1); for example, the coefficient Coo is evaluated in the PE(0) with the product terms generated by PE(8) and itself. Figure 2(b) shows a different situation, ( i , j ) = (0, 1), in which there are product terms that belong to different cross-correlation coefficients in the same PE. We see that for the cases (k, l) = (0, 1) and (1, 1) the calculated product terms should be moved from PE(r) to P E ( r - 1), in order to be placed in the correct PE. Similar situations occur in all those cases in which (j + l)/> nr (statements 13-15) or ( i + k)t> nc (statements 16-18). The reason is that when the local image matrix (elements ( i , j ) ) is displaced to a position such that the dimension of the template local matrix is larger than the one o f the local image matrix ((j + l)/> nr or (i + k)/> no), the product term (k, l) associated with the cross-correlation coefficient ( i, j ) corresponding to the PE(r) has been calculated in the PE(r + dest), where dest is specified in statements 13 and 16, depending on the case. From the above discussion it can be deduced that in those cases in which the product terms are incorrectly located in the PE's, the numeric distance (dest) between the PE source and the PE target is the same for all the PE's. The routing specified in statements 15 and 18 is based on the following lemma. L E M M A 1. Given a hypercube o f dimension q, where each node r has the value r + k in A , k being a constant, the algorithm for (t = 0; t < q; t + + ) A ( n e i g h ( t ) ) ~ A
{ # . t != A.t}
m a k e s each node r to have the value r in A. P R O O F . Given a node r, let ( / n - l , . . . ,
lo) be all the bits of r + k such as ( r + k ) t , ~ rt,, i = O , 1 , . . . , n - 1 (r, is the sth bit of r, where ro is the MSB of r). If we represent the transference of a value from the PE(r) to the P E ( r (b)) at time t by r ~ r b, the transference o f the value r + k from the PE(r) to the P E ( r + k) is performed in the form r
t0
) r t°
tI
, r lit°
t2
) ...
tn -- I
) rt~-'"'t°=r+k.
(2)
However, it is easy to check that the values r t- and rtm + k verify that (r ~m+ k)t~ # r~ for all m = s, s + 1 , . . . , n - 1. In this way the following transfers also occur: Signal Processing
E.L. Z a p a t a et al. / I m a g e template matching on hypercube S I M D r l° rl~
rl.-i
to
computers
57
>r
to ) rltto
to
q
) rl.-ilo
>
rio
tl
) • . .
fn-I
) rl.-2lo. t.
In this way the unidirectional transfers in (2), are, really, bidirectionals, *~. Finally, if l~ ( l , _ ~ , . . . , lo), that is rt = (r + k) j, then clearly rtt = (rl+ k)l and, therefore, there is no transfer in any sense between r and r I. There are two situations in which statements 13-15 and 16-18 may he eliminated from the algorithm PITM: (1) by overdimensionalizing the local image matrix in a n u m b e r of rows a n d / o r columns equal to the n u m b e r of rows a n d / o r columns less one of the template local matrix (n'~= nr+ m r - 1; n c' - - n c + m c 1); (2) by selecting a partition of the dimensions of the hypercube with q~ = 0 and qo = 0. In both situations the penalty is a data redundancy that will be analyzed in detail in the next section. Finally, the A C L A N program used to unload the cross-correlation coefficients calculated by P I T M is as follows:
Unload cross-correlation coefficients C[N~,N~] for ( p = 0 ; p < ( l < < ( q 2 + q 3 ) ) ; p + + ) for ( z = 0 ; z < ( l < < q 0 ) ; z++) C [ 0 : n r * ( 1 < < q l ) - 1][z*nc: ( z + 1 ) * n c - 1] ~ LC[0: n r - 1][0: n c - 1] { # . ( q - 1) : ( q 0 + q l ) = = p & & # . ( q O - 1):0 = = z}.
5. Performance analysis The performance of the P I T M algorithm is a function of eight variables, four of them (Nr, No, Mr and Mc) characterizing the problem size and the other four (q3, q2, ql and qo) characterizing how the hypercube architecture has been used. For the sake o f simplicity, let us ignore in what follows the initial step of loading data into the local m e m o r y of the PE's and final step of unloading the cross-correlation coefficients. The algorithmic complexity of P I T M is O(nrnc(mrmc(1 + qo+ ql)+ q2 q- q3)), where we have assumed that wr = n r - 1 and wc = n ~ - 1 for simplicity. The term O(nrn~mrm~) is the algorithmic complexity associated with the local calculation o f the cross-correlation coefficients, and the term O( nrn~( mrmc(qo + q~) + q2 q- q3 )) results from the needed routing operations. I f we overdimensionalize the rows and columns of the local image matrix up to dimensions m r - 1 rows and m e - 1 columns, then the algorithmic complexity of P I T M is reduced to O(nrn~(mrmc+ q2 + q3)), since statements 13-18 are not necessary. Table 2, in which nl = log2 Nr, no = log2 N~, ml = log2 Mr and mo = log2 M~, lists the processing time and the total algorithmic complexity for different n u m b e r of PE's (tf is the time associated with arithmetic or logic operations and tc is the time associated with elemental routing). Since the structure of the parallel algorithm is essentially the same as the one of the sequential algorithm, the algorithmic complexity of the former is the same one as o f the latter, that is O(NrNcM~M~), for the case of one single PE. It is interesting to notice that in the extreme case o f Q = NrN~MrMc we obtain an optimal partition if we assign q3 = ml, q2 = mo, ql = nl and qo = no. The total complexity would then be reduced to O ( m l + mo), since the latter partition is in agreement with the overdimensionalizing conditions expressed before (m r = 1, m~ = 1). In VoL 21, No. I, September 1990
58
E.L. Z a p a t a et al. / I m a g e template m a t c h i n g on hypercube S I M D
computers
Table 2 Examples of the performance of the parallel image template matching algorithm q3
q2
ql
qo
n. PEs
Complexity
time
1
0
0
0
0
1
NrNoM, Mc
NrNoM, Mctr
2
0
0
nI
0
N~
NcMrMcn I
NcMrMctt+NcM,
3 4 5 6 7 8 9
0 0 mI mI 0 mI m1 m1
mo 0 mo 0 mo 0 mo mo
0 nI 0 nI nl n1 nI nI
0 no 0 0 0 no 0 no
Me NrN¢ MrM ¢ N~M~ N~M c N~NcM r NrM~M¢ N~NeMrM ~
NrNcM r M r M ¢ ( n I + n o) N ~ N c ( m l + mo) N¢M¢ NeMrn I Men o N c ( m t + too) m t + mo
NrNeMrtf+ NrNcmot c M r M ~ t r + M r M ¢ ( n 1 + no)t~ N ~ N c t f + N r N c ( m l + too)tc N¢MJf+Ncnlt ~ N c M r t f + N e ( M ~ n I + mo)tc Metf+(Mcno+ml)t ~ N ~ t t + N ~ ( m I + mo)t~ tf + ( m I + too) t¢
10
Mcntt ~
this case each PE calculates one single product element of eq. (1) and, therefore, log2 Mr+log2 Me routing operations are needed. We have obtained the following formal result. T H E O R E M 1. Image template matching can be executed in a hypercube computer with NrNeMrMe P E ' s and with a constant storage for each PE in a time O(log2 Mr+log2 Me). Furthermore this time is optimum.
In this way our PITM algorithm has an optimum algorithmic complexity, improving the previously reported limits [2-15]. When the size of the hypercube is NrN~ PE's and the partition is q3 = q2 ----0, ql = nl and qo = no, which corresponds to the fourth line of Table 2, all PE's storage a complete template and one pixel of the image. We can reduce drastically the redundancy associated with the template if we store this latter one in the control unit and broadcast to all PE's its element associated with each iteration. If we overdimensionalize the local image matrix by M r - 1 rows and M ~ - 1 columns, the algorithmic complexity will be reduced to O(MrMc), which is optimum since there is no routing involved. Again, we can establish the following result. T H E O R E M 2. Image template matching can be executed in a hypercube computer o f size NrNc P E ' s with O( MrM¢) local memory per PE in a time O( MrM¢). Furthermore, this time is optimum.
In this case each PE will evaluate a complete cross-correlation coefficient. However, the data redundancy is important. This particular case is interesting in some basic operations of image processing, such as filtering or edge enhancement, because the small size of template makes the relation complexity/redundancy specially favourable. In the cases 5, 6, 8 and 9 of Table 2 the statements 13-15 a n d / o r 16-18 of the PITM algorithm will not be executed since the conditions of overdimensionalizing the columns a n d / o r rows o f the local image in the PE's are fulfilled. In these cases the number of columns a n d / o r rows of the local template is one. Based on the above results and on the general basis of other parallel algorithms we see that PITM increase the total processing speed by replicating the data in the local memories o f the PE's. Specifically, if we assume that LP(r~, ro) and LT(r3, r2) contain pixels (1 byte) and LC(rl, ro) are single precision floating point numbers (4 bytes), then the data distribution used in PITM generates the redundancy: R = 5nrn¢2 q3+q2+ mrmc2 ql+qo. Signal Processing
(3)
59
E.L. Zapata et al. / Image template matching on hypercube SIMD computers
If we consider the case in which we overdimensionalize the local image matrix in order to eliminate statements 13-18 of PITM and speed up the execution time, the additional redundancy introduced in (3) will be ( n r ( m c - 1) + ( n c ( m r - 1))2 q3+q2. Figure 3 shows in a normalized logarithmic scale the complexities associated with the local computations ( O j ( ) ) and to the routing operations with ( O r ( ) ) and without (Ors()) overdimensionalizing the local image matrix as a function of the hypercube dimension. We have considered the following example: (a) an image of 1024 x 1024 pixels and a template of 64 x 64 pixels, (b) tf = 23.25 p.s and tc = 456.3 p~s, that are the calculation and routing times, respectively, o f the hypercube N C U B E 10 [10], and (c) we have assumed a partition of the hypercube q = (q3, q2, q~, qo) such as qi - qt <~ I, i = 0, 1, 2 and t = I, 2, 3. Or( ) is the plot that presents the greatest complexity. Eliminating the term m r m ¢ ( q o + q ~ ) in the complexity achieved by overdimensionalizing the local image matrix allows us to drastically reduce the complexity associated with the data routing, as it is illustrated in Fig. 3. The changes in the slope of Ors( ) are due to the linear dependency (q2 ÷ q3) of q2 and q3 (positive slope) compared to the dependency l(q0+ql) of q0 and q~. Obviously, O~( ) is independent of the overdimensionalization and presents an optimum dependency (lq) with respect to q. The analysis of Table 2 and Fig. 3 allows us to assert that PITM presents an optimum algorithmic complexity if statements 13-18 are eliminated by overdimensionalizing the local image matrix in the PE's. This overdimensionalization introduces an additional redundancy which is not important in many of the basic image processing operations. Furthermore, this redundancy can be reduced in many cases by placing the image, the template a n d / o r the matrix of cross-correlation coefficients in the control unit (host) of the hypercube. 6. Conclusions
Image processing is a common problem in which to a certain extent reduced number of operations have to be carried out upon a great number of data. These kinds of problems come to be suitable to be
.
.
"71 '
.25",
il
0
l
I
2
I
I
4
I
t
6
!
i
8
i
e q
i0
Fig. 3. Algorithmic complexity versus q and the overdimensionalization of the image local matrix. VoL21, No. 1, September1990
60
E.L. Zapata et al. / Image template matching on hypercube S I M D computers
i m p l e m e n t e d u p o n a S I M D p a r a l l e l architecture, that is, the o n e that m a k e s use o f the spatial p a r a l l e l i s m , b e c a u s e d o i n g so we greatly reduce the n u m b e r o f data that each PE m u s t process, i m p r o v i n g a great deal the time o f the g l o b a l p r o c e s s i n g time. This last feature is very i m p o r t a n t as the algorithmic s o l u t i o n s to most o f the p r o b l e m s c o n t a i n e d in the field o f image p r o c e s s i n g n e e d a big p r o c e s s i n g time w h e n they are executed in a s e q u e n t i a l c o m p u t e r . O n e o f the most i m p o r t a n t p r o b l e m s within this field is r e p r e s e n t e d by c a l c u l a t i o n o f c r o s s - c o r r e l a t i o n coefficients, quite useful i n m a n y a p p l i c a t i o n s . W e have p r e s e n t e d i n this p a p e r a n efficient parallel s o l u t i o n to the p r o b l e m o f image t e m p l a t e m a t c h i n g , e x e c u t a b l e o n S I M D h y p e r c u b e m a c h i n e s , u s i n g cross-correlation coefficients as a similarity m e a s u r e m e n t unit. N o w a d a y s there is a great variety o f p a r a l l e l architectures, m a n y o f t h e m c o m m e r c i a l l y available. W i t h i n d i s t r i b u t e d m e m o r y parallel architectures, h y p e r c u b e m a c h i n e s c a n be p o i n t e d out, c o m m e r c i a l i z e d b y several firms, either for S I M D p a r a l l e l i s m or for M I M D parallelism. These m a c h i n e s are very useful for parallel c o m p u t i n g , m o s t l y w h e n spatial ( S I M D ) p a r a l l e l i s m is to be exploited. Its m a i n defect lies i n the fact that the n o w a d a y s a v a i l a b l e c o m m e r c i a l m o d e l s p r e s e n t relatively high values o f c o m m u n i c a t i o n l a t e n c y i n r e l a t i o n to time u s e d in l o g i c / a r i t h m e t i c operations.
References [1] P.R. Cappello, "'Gaussian elimination on hypercube automaton", J. Parallel and Distr. Comput., Vol. 4, No. 3, May 1987, pp. 288-308. [2] Z. Fang, X. Li and L.M. Ni, "Parallel algorithms for image template matching on hypercube SIMD computers", IEEE Trans. Pattern Anal. Mach. Intell., Vol. PAMI-9, No. 6, November 1987, pp. 835-841. [3] G.C. Fox, S.W. Otto and A.J.G. Hey, "Matrix algorithms on a hypercube I: Matrix multiplication", J. Parallel Comput., Vol. 4, No. 1, January 1987, pp. 17-31. [4] J.P. Hayes, T.N. Mudge, Q.F. Stout, S. Colley and J. Palmer, "Architecture of a hypercube supercomputer", Proc. 15th lnternat. Conf. on Parallel Processing, Philadelphia, PA, 1986, pp. 653-660. [5] W.D. Hiilis, The Connection Machine, MIT Press, Cambridge, MA, 1985. [6] T. Johnson and T. Durham, Parallel Processing: The Challenge of New Computer Architectures, Ovum Ltd Press, London, 1987. [7] S.L. Johnsson, "Communication efficient basic linear algebra computations on hypercube architectures", J. Parallel and Distr. Comput. Vol. 4, No. 2, March 1987, pp. 133-172. [8] V.P.K. Kumar and V. Krishnan, "'Efficientimage template matching on hypercube SIMD arrays", IEEE Trans. Pattern Anal. Machine lntell., Vol. PAMI-11, No. 6, 1989, pp. 665-669. [9] T. N. Mudge and T.S. Abdel-Rahman, "Vision algorithms for hypercube machines", J. Parallel and Distr. Comput., Vol. 4, No. 1, January 1987, pp. 79-94. [10] NCUBE Handbook and Programmer's Manual, NCUBE, Beaverton, OR, 1988. [11] O.G. Plata and E.L. Zapata, "ACLAN: Programmer's manual", lnternat. Rep. Dept. of Electronics, Faculty of Physics, University of Santiago de Compostela, December 1987. [12] O.G. Plata, E.L. Zapata, F.F. Rivera and R. Peskin, "An array processing language for message-passing hypercubes", Parallel Computing 89, Leiden, The Netherlands, Elsevier, Amsterdam, 1990, pp. 455-460. [ 13] F.F. Rivera, M.A. Ismail and E.L. Zapata, "Parallel squared error clustering on hypercube arrays, J. Parallel and Distr. Comput., Vol. 8, No. 3, March 1990, pp. 292-299. [14] A. Rosenfeld and A.C~ Kak, Digital Picture Processing. Vols. 1 and 2, Academic Press, New York, 1982. [15] S. Sahni and S. Ranka, "Image template matching on SIMD hypercube muiticomputers", Proc., 17th lnternat. Conf. on Parallel Process., Philadelphia, PA, 1988, pp. 86-91. [16] C.L. Seitz, "The cosmic cube", Comm. ACM., Vol. 28, No. 1, January 1985, pp. 22-33. [17] J.P. Secilla, N. Garcia and J.L. Carrascosa, "Template location in noisy pictures", Signal Processing, Vol. 14, No. 4, June 1988, pp. 347-361. [18] P. Wiley, "A parallel architecture come of age at last", IEEE Spectrum, Vol. 26, No. 3, March 1987, pp. 46-50. [19] E.L. Zapata, F.F. Rivera, J.I. Benavides and J.M. Carazo, "Multidimensional fast Fourier transform into fixed size hypercubes', lEE Proceedings Part E: Computers and Digital Techniques, to be published. [20] E.L. Zapata, F.F. Rivera, O.G. Plata and M.A. Ismail, "Parallel fuzzy clustering on fixed size hypercube SIMD computers", J. Parallel Comput., Vol. 13, No. 3, September 1989, pp. 291-303. [21] E.L. Zapata, F.F. Rivera and O.G. Plata, "On the partition of algorithms into hypercubes", in: D.J. Evans, ed., Advances on Parallel Computing, JAI Press, London, 1990, pp. 149-171. Signal Processing