Implementation and experimental evaluation of the constrained ART algorithm on a multicomputer system

Implementation and experimental evaluation of the constrained ART algorithm on a multicomputer system

SIGNAL PROCESSING Signal Processing 51 (1996) 69-76 Implementation and experimental evaluation of the constrained ART algorithm on a multicomputer s...

706KB Sizes 0 Downloads 34 Views

SIGNAL PROCESSING Signal Processing

51 (1996) 69-76

Implementation and experimental evaluation of the constrained ART algorithm on a multicomputer system’ I. Garciaa,*, J. Rota”, J. Sanjurjob,

J.M. Carazoc,d,

E.L. Zapatad

aDepartamento de Arquitectura de Computadores y Electrbnica, Universidad de Almeria, 04120-Almeria, bDepartamenio de Electrbnica y Sistemas. Universidad de Santiago de Compostela, Spain “Unidad de Bioinformbtica, Centro National de Biotecnologia. CSIC, Spain dDepartamento de Arquitectura de Computadores, Universidad de Mirlaga, Spain Received 20 July 1995; revised 4 December

Spain

1995

Abstract

Truly three-dimensional reconstruction from projections can be carried out by the well-known ART (algebraic reconstruction technique) methods. In this work we present the implementation of an additive ART algorithm based on the theory of projections onto convex sets. It takes advantage of the high sparsity of the coefficient matrix allowing for an efficient parallelization. The solution has been conceived for a multiprocessor system using the SPMD (single program multiple data) programming model and our practical evaluation has been realized on an if360based multiprocessor system. Zusammenfassung

Eine echte dreidimensionale Rekonstruktion aus Projektionen kann man mit Hilfe der bekannten ART- (algebraisthen Rekonstruktionstechnik-) Methoden durchfiihren. In dieser Arbeit stellen wir die Realisierung eines additiven ARTAlgorithmus vor, der auf der Theorie der Projektionen auf konvexe Mengen (POCS) beruht. Er nutzt die aul3erst sparliche Besetzung der Koeffizientenmatrix aus, die eine wirksame Parallelisierung erlaubt. Die Losung ist fur ein Mehrprozessorsystem entwickelt worden, das ein SPMD-(Einzelprogramm-/Mehrfachdaten-) Programmiermodell benutzt, und unsere praktische Auswertung ist mit einem Mehrprozessorsystem auf i860-Grundlage realisiert worden.

La reconstruction tri-dimensionnelle a partir de projections peut Ctre faite a l’aide des methodes bien connues de Technique de Reconstruction Algebrique (TRA). Dans ce travail, nous presentons l’implantation d’un algorithme TRA additif, base sur la theorie des projections sur des ensembles convexes (PSEC). 11tire avantage du caractere tres creux de la matrice des coefficients, permettant une parallelisation efficace. La solution a Cte conGue pour un systeme multiprocesseurs utilisant le modele de programmation Programme Unique-Donnees Multiples (PUDM), et notre evaluation pratique a Cte realisee sur un systeme multiprocesseur base sur le i860. Keywords: Parallel algorithm; Multicomputer;

Reconstruction

algorithm; ART

‘This work is supported by the Ministry of Education of Spain CICYT TIC92-0942X03. especial CSIC/DGICYT para la puesta en marcha de Unidades Asociadas. *Corresponding author: Tel.: 34 50 215021; fax: 34 50 215070; e-mail: [email protected].

0165- 1684/96!$15.00 c> 1996 Elsevier Science B.V. All rights PI1 SOl65-1684(96)00032-l

reserved

PB91-0910,

BI095-0768

and Action

70

I. Garcia et al. / Signal Processing 51 (1996) 69-76

1. Introduction The image reconstruction from projections problem (IRP) arises in a wide variety of scientific areas such as tomography, electron microscopy or astronomy. This paper concerns with data from the transmission electron microscopy field and the results correspond to the chaperonin GroEL protein of E. coli. Nevertheless, our practical implementation and improvements are general and useful in all the scientific fields mentioned above. In its most general form, the three-dimensional (3D) reconstruction problem can be defined as follows. Starting from a projection set (2D data) g, determine the object f (3D) that produces these projections g. We are interested in finding the conditions under which g is adequate for the determination off, or at least of somefl close tofin some sense. Our objective is to deal with practical cases and, consequently, we address the fact that our set of projections g is contaminated by some type of noise and a finite number of projections must be used. Algebraically, the reconstruction problem can be modeled as the solution of a linear equation system of the form g=Hf+n,

(1)

where f is the unknown vector, g the projection data vector, H is the coefficient matrix and n represents the noise of the projection space. There exist a wide variety of methods to solve the problem posed by Eq. (1). In this work we will be concerned with algebraic reconstruction techniques (ART), and it constitutes a family of iterative procedures designed to efficiently solve large systems of linear equations [ 51. This paper has been organized as follows. In Section 2 the specific procedure we use in our implementation is described. It is an additive ART algorithm formulated within the framework of the theory of projections onto convex sets (POCS) that incorporates some a priori knowledge typical of our electron microscopy problem. The algorithmic complexity of ART is proportional to the size of the volume to be reconstructed and to the number of available projections in a particular problem. This high complexity makes ART attractive for its execution in a multiprocessor

system. We will take advantage of the sparsity of the matrix H to obtain an algorithmic formulation feasible of being efficiently parallelized. Thus, in Section 3 we describe a set of considerations intended to state the algorithmic formulation for an efficient and advisable parallel solution. Section 4 describes the parallel generalized algebraic reconstruction technique algorithm (PGART). The evaluation of PGART’s performance and the execution times for a multiprocessor system are presented in Section 5.

2. Algebraic reconstruction techniques method In the field of tomography, the first approximation to the solution of (1) was formulated by Gordon [Z], leading to the reconstruction method known as the ART. ART approaches the solution of system (1) in an iterative manner by means of the following algorithm: f

k+l

+k

+

Ak+’

bk+l

-

chk+l, f">l

ilhk+ll?

k=O,l,

. ..)

hk+l’

(2)

where f” is some initial value for the unknown vector f, 1 is the so-called relaxation parameter (0 < I < 2), gk is the kth element of the vector g and hk is the kth row of the matrix H. Both g and H are cyclic, i.e. hk = hk mod(M x N2) and gk = gk mod(M X N’). The convergence and performance of this general algorithm has been extensively studied by Herman (see [6] and references therein). In Censor [l] and Herman [S] an in-depth revision of this family of methods can be found. More recently, some works aimed at building a methodology for evaluating reconstruction from projections methods have appeared [9]. In this work a minor variation of the so-called ART3 [4] method is used. It is based on the theory of POCS [14] and on the notion offeasible solution [13] and we will call it in this paper ‘generalized ART’ (GART) [ll]. This approach takes the noise term n appearing in Eq. (1) explicitly into account, and at the same time it permits the natural introduction of multiple constraints within the framework provided by the POCS theory.

I. Garcia et al. / Signal Processing

The concept of feasible solution is easily derived from the relationship between two terms: the noise n and the residual. The residual is defined as r = g - Hf*, wheref* is a certain estimation of the original signal f (for example, the result of stopping an iterative process at the kth step). Since, it is true that, iff* =fthen Y= n, it is sensible to incorporate the statistical information about the noise it, defined in (1) through the residue Y [13]. Any solution f* verifying that the residue r tends to the noise n is considered to be a feasible solution for Eq. (1). In particular, the constraint of ‘outliers removal’, i.e. the elimination of those values of the residue rk in the kth step whose module 1rk 1is higher than a certain value, is a constraint which defines a closed and convex set {f: 1g - Hf 1< 61, whose associated projector [13] is given by Plfk =fz&,

h f” + %“+I ,,;kk;11,,2hx+~ - $,hkk;;,/2

1

ifrk+l>&

j-” + ~?+‘&hlrtl

+ 6++

k+l

k+l

(3)

if rk+i > -6, If”

where rk+i =,gk+r - (hk+l,fk) and 6 will be proportional to the variance of the noise n. The incorporation of some a priori knowledge of the upper and lower bounds of the true solution {fl(a Gf < b)) 1s a common constrain [l l] whose associated projector is determined by

Pzf =fbounds = 1 f

Ib

71

the kth step it is necessary to know the value of the elements of matrix H at the (k + 1)th row. Computations of the vector hk+ 1 can be as expensive as the GART algorithm itself. For this reason, we are going to distinguish between two terms in our algorithmic description: the calculus of the coefficients of the matrix H, and the GART algorithm itself as it is shown in (3) and (4). Several methods have been proposed for obtaining the matrix H, each representing a different model for explaining the contribution produced by the 3D object onto the projection space. Methods such as the square pixels [7] or non-overlapping cubic voxel [lo] are based on computing the length of a ray across a voxel while other kind of methods make use of different interpolation schemes. Among the later ones we can count the one due to Lewitt [S] based on overlapping spherical elements, or blobs, and the so-called forward methods reported in [7]. Additionally, three different types of interpolation can be used: nearest neighbor, bilinear and modified bicubic spline. In this work we use the forward method with bilinear interpolation in order to calculate the hk,j elements of matrix H. The value of the coefficient reflects influence of voxel j (with coordinates (xj, yj, Zj) for (O
otherwise,

Ia

51 (1996) 69% 76

cos Qrn sin @, cos 0,

if f
a d f < b,

if

b cf.

- sin @, cos Qi, cos 0,

0 sin 0,

(4)

3. Algorithmic formulation of generalized ART Although Eqs. (2)-(4) show the mathematical formulation of the GART method, from a computational point of view, it is convenient to explain further several details which can help to write an algorithmic formulation for GART. Looking at Eq. (2) it is clear that in order to update the vector f at

(5) Let xk = L ukj, yk = L ok], &, = uk - xk and sy = ok - y, (0 < a,, sy < l), then the four corresponding elements of matrix H take the values given by

h,,j = (1 - 6,) (1 - ay), hk+ 1.j = ~,(l - a>!), hk+N,j = (1 - sJa,,

hk+N+r.j = ax&y,

(6) which correspond to the four pixels in the neighborhood of (uk, ok) in the projection plane.

72

I. Garcia et al. / Signal Processing 51 (1996) 69-76

The forward method provides a set of geometric and algebraic properties to the matrix H that can be used in order to exploit the parallelism and optimize the partition projection of GART onto a multiprocessor. We are now in a position to describe these properties and use them. For a system made up of an N3 voxel volume, projected onto N2 pixels in M projection directions, a sparse matrix H with N3 columns and M x N2 rows is obtained. This system of linear equations can be considered as M equation subsystems, each consisting of N3 columns and N2 rows, just like the one represented in Fig. 2(a). The N3 columns of each of the subsystems have only four non-null elements in positions h,,j, hk+ l,j, h k+N,j? hk+N+ 1,j, where k iS the first non-zero element in columnj. This geometry allows us to recognize four equation subsystems, which verify that all the equations of the same subset are orthogonal. Consequently, applying an adequate permutation of the rows of matrix H, the equation system can be defined as 4 x M subsystems of N2/4 equations each. Matrix H is transformed into the one represented in Fig. 2(b). At the level of possible data dependencies, each one of these subsystems is totally independent and, consequently, an essentially sequential algorithm of M x N2 steps becomes a parallel algorithm. Using this new structure of the matrix, the parallel GART can be described as follows. Begin Parallel GART while convergency conditions are met For each projection m (0 d m < M)

For 0 d t d 3 if t = 0: Compute hf,j; 0 < k < N2 and 0
yk =

1;:

1

hi,jfj,

(N2/4)t < k < (N2/4)(t + 1) Update f (Eq. (3)): ‘f. + Akbk

-

(gk

-

yk)

j

hf-$-$

iihkii2 if

hk

ykl

J

>

6

jj=G/,+ikCq;h;1hf+c5$+ k if

(gk

-

yk)

k <

-

6

otherwise and (Eq. (4)):

with (N2/4)t < k < (N2/4)(t + 1) and

O
I. Garcia et al. / Signal Processing 51 (1996) 69-76

distance N2/4, i.e. a dependence between equations k and k + N2/4, as the computation of coefficients and the inner loop t can be executed in parallel over the subset of N2 and N2/4 equations, respectively. On the other hand, the convergence of the PGART algorithm is accelerated with the use of the permuted H matrix due to the orthogonality of the equations that are consecutively applied [lo]. In order to verify the robustness of the PGART algorithm experimentally, a 3D reconstruction of the chaperonin GroEL protein of E. coli was done, using projections from transmission electron microscopy, obtaining valuable structural information on this biological specimen as shown in Fig. 1.

13

4. Parallel algorithm We will center the description of the parallel implementation of PGART based on the data dependencies described in Section 3. As we show in Figs. 2(b) and 2(c), each of the sparse submatrices can be represented as a vector. Fig. 2(c) is a vector representation of the matrix in Fig. 2(b), in which the fillings that appear as elements of the vector indicate the row of the matrix where values are different from zero. This can be done because the distribution of zeros is such that in each column there is a maximum of one non-null element. By means of this arrangement, the execution of a step

(a)

(b)

Fig. 2. Representation of the coefficient matrix into a dense vector.

matrix for N = 4, A4 = 1: (a) original

matrix;(b)

permuted

matrix; (c) transformation

of a sparse

14

I. Garcia et al. / Signal Processing 51 (1996) 69-76

in the loop ‘For each projection’, which is performed for indices k and j of the matrix, can be reduced to a single loop over index j. Thus, we have translated the sparse matrix-vector product into a dense vector-vector product, whose result is a subvector y of N2/4 components. This same situation is the one we find in updating the value offin PGART, which is again the sparse matrix-vector product, but in this case we obtain a vectorfof N3 components. Consequently, the complexity is N3, and the partition of the algorithm on a multiprocessor system is carried out over the N3 columns of matrix H. In order to define the data distribution among processors, the sparsity of the vectors in Fig. 2(c) has been taken into account. This sparsity is due to the fact that the algorithm must only work over those elements of the volume that project onto the projection space. At this point, we are going to show the strategy used in the distribution of the data in order to optimize the work load balance and the execution time, as well as to minimize the data redundance in the multiprocessor system. For a multiprocessor system with P PEs and a 3D data geometry, which is the reference for each element of the volume, it seems convenient to define an analogous structure for referencing the PEs. This leads to a correspondence between the jth element of the volume, at coordinates Xj, yj, zj, and those of the processor in which it will be manipulated, P,, P,, P,. However, it must be pointed out that the PEs need not be physically placed on a 3D mesh, but that the topology of interconnection can be arbitrary. In principle, we have started from a general data distribution, of the block cyclic type [3], in which the block size will be limited by N. This can be expressed by the relation bi < N/Pi (i = {x, y, z>), where (b,, b,, b,) represents the size of the 3D block. Our experiments confirmed that the data distribution that optimizes the load balance is the one with block size bi = 1, i.e. a pure cyclic distribution. With this data distribution the jth voxel, with coordinates (xj, yj, zj), will be processed in processor (P,, P,, PJ, where P, = Xj mod P,, P, = yj mod P, and P, = zj mod P,. In PGART interprocessor communications must be introduced before updating the vector f: Each

processor has to send its local values of vector y[N2/4], which will then be added to the respective partial vectors in the processors that receive the message. The cost of this function is proportional to the amount of data to be transferred (N2/4). This is the only penalty of the parallel implementation.

5. Evaluation This algorithm has been executed on a Meiko Computer Surface multiprocessor system, based on the Intel i860, at the Edinburgh Parallel Computing Centre. It is a distributed memory MIMD machine based on a message passing communication model. Data represented in the time and performance tables correspond to real executions measured on this machine for two values of N (N = 64, 128) and values of P between 1 and 16 PEs. The implementation of the algorithm has been carried out with the CHIMP and PUL software support developed at the University of Edinburgh

WI. Results for the main parameters characterizing the parallel implementation, such as memory requirements, algorithmic complexity, execution times and efficiency, as well as the important problem of load balance will be described. For a system with P PEs, the local memory is 12 x (N3/P) + (M + 2) x N2, where the dominant term is that corresponding to the space taken up by the coefficient matrix and the volume. The only data redundancy in our implementation is due to the storage of all the projections in all the processors. For a single iteration, the computational complexity is 0(4 x M x N3/P + M x N2 log, P). Two different terms can be clearly appreciated. One of them, due to communications, conditions the performance of the parallel algorithm. In an iteration, the number of data items to be exchanged is N2, interprocessor communications are of the all-to-all type, and it will be carried out in log, P steps, as specified in the PUL-EM library [12]. The other term is arithmetic and proportional to the size of the local volume in each PE. Consequently, its complexity is inversely proportional to the number of PEs. In those conditions, a decrease of the efficiency of the parallel implementation with an

I. Garcia et al. 1 Signal Processing 51 (1996) 69- 76

75

Table 1 Execution times of the parallel algorithm for a single iteration and M = 1 versus P and N Communication times (ms)

N = 64 N=128

P=l

P=2

P=4

P=8

P=

0 0

40.2 151.2

90.6 412.5

180.9 803.8

592.9 2090.7

9314.5 69241

4652.4 35008

2472.3 18318

1758.5 10873

P=8 0.99

P = 16 0.95

0.99

0.99

0.94 0.95

0.66 0.80

16

Execution total times N= 64 N = 128

18559 13848

Table 2 Work load balance and efficiency versus P and N Work load balance

N = 64

N=128

P=l

P=2

P=4

1.0 1.0

1.0 1.0

1.0 1.0

1.0 1.0

1.0 1.0

1.0 1.0

Efficiency N= 64 N= 128

increase of P should be expected. This decrease will depend on the ratio between communication and calculation times. A good use of the system can still be guaranteed if P < N/4 as shown in Tables 1 and 2. Data in Table 1 show communication and total execution times of the algorithm for a single iteration and a single projection (M = 1) as a function of the value of P. The communication time increases with the number of PEs, but the total execution time decreases with P. For example, with N = 64 and P = 4 communication time is a very small fraction of the total execution time, but for P = 16 this ratio is increased to 0.34. Table 2 is a similar representation to the one in Table 1, but in this case for the values of the efficiency and the work load balance. The efficiency is always over 0.80 for N = 128, being higher than 0.9 if the number of processors is less or equal to 8. Measurements of the work load balance were obtained as the ratio (t,,, - tmiJ/tmax, t,,, being the

execution time of the algorithm for the slowest processor and tmin for the fastest PE. The best results are obtained for those situations in which the computational load is heavier with respect to communications. Even though the specific results depend on the multiprocessor system employed, the values we have obtained are satisfactory, and in every case are over 94%. Therefore, the fact that the efficiency decreases with P is not a consequence of a load unbalance problem, but is due to the logical increase of the communication time with P.

6. Conclusions We have efficiently parallelized a generalization of the ART algorithm called GART using a message passing model on an i860 based system with an SIMD programming model and distributed memory. We have proposed a permutation of the coefficient matrix that reduces the data dependence

16

I. Garcia et al. / Signal Processing 51 (1996) 69-76

appearing in the sequential algorithm, and we have also found a solution for reducing the size of the local memory needed. We have defined a distribution of the data among processors which eliminates a possible load imbalance. We have obtained good performances for the parallel implementation in all cases. Summarizing, in this work we have proved that the GART algorithm, whose main drawback is the large amount of calculation time and memory it requires, can be implemented in an efficient manner on a multiprocessor system. References [l] Y. Censor, “Finite series-expansion reconstruction methods”, Proc. IEEE, Vol. 3, 1983, pp. 409419. [2] R. Gordon, R. Bender and G.T. Herman, “Algebraic reconstruction techniques (ART) for the three-dimensional electron microscopy and X-ray photography”, J. Theor. Biol., Vol. 29, 1970, pp. 471481. [3] A. Gupta and P. Banerjee, “Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers”, IEEE Trans. Parallel and Distributed Systems, Vol. 3, No. 2, 1992, pp. 179-193. [4] G.T. Herman, “A relaxation method for reconstructing object from noise X-rays”, Math. Programming, Vol. 8, 1975, pp. 1-19. [S] G.T. Herman, Image Reconstructionfrom Projections: The Fundamentals of Computerized Tomography, Academic Press, New York, 1980.

C’31G.T. Herman and L.B. Meyer, “Algebraic reconstruction

techniques can be made computationally efficient”, IEEE Trans. Imaging, Vol. MI-12, No. 3, 1993, pp. 60&609. c71P.M. Joseph, “An improved algorithm for reprojecting rays through pixel images”, IEEE Trans. Med. Imaging, Vol. 1, No. 3, 1983, pp. 192-196. I?1 R.M. Lewitt, “Alternatives to voxels for images representation in iterative reconstruction algorithms”, Phys. Med. Biol., Vol. 37, No. 3, 1992, pp. 705-716. 191S. Matej, G.T. Herman, T.K. Narayan, S.S. Furuie, R.M. Lewit and P.E. Kinahan,” Evaluation of taskoriented performance of several fully 3D PET reconstruction algorithms”, Phys. Med. Biol., Vol. 39, 1994, pp. 355-367. Cl01A. Niu, C.Y. Han and W.G. Wee, “A new three-dimensional reconstruction method using algebraic reconstruction techniques”, J. X-Ray Sci. Technol., Vol. 2, 1990, pp. 95-l 16. Cl11J. Sanjurjo, E.L. Zapata, I. Garcia, J. Rota and J.M. Carazo, “Convex constraints on the residual and 3D reconstruction from projections: Proposal of a general ART”, in: F. Casacuberta and A. Sanfeliu, eds., Advances in Pattern Recognition, World Scientific, Singapore, 1994, pp. 146157. Cl21S. Trewin, Pul-em prototype user guide, Tech. Rep. EPCCKTP-PUL-EM-PROT-UG, Edinburgh Parallel Computing Centre, The University of Edinburgh, 1992. P31 H.J. Trussel and M.R. Civanlar, “The feasible solution in signal restoration”, IEEE Trans. Acoust. Speech Signal Process., Vol. 32, No. 2, 1984, pp. 201-212. Cl41DC. Youla and H. Webb, “Image restoration by the method of convex projections”, IEEE Trans. Med. Imaging, Vol. 1, No. 2, 1982, pp. 81-94.