PARALLEL COMPUTING Parallel Computing 21 (1995) 649-668
ELSEVIER
Practical aspects and experiences
Parallel processing for robot dynamics computations Albert Y. Zomaya
*
Parallel Computing Research Group, Department of Electrical and Electronic Engineering, The University of Western Australia, Nedlandr, Perth, Western Australia 6907 Australia
Received 16 November 1993; revised 15 November 1994
Abstract The computation of the dynamics plays a major role in many real-time robotic applications. The dynamics consist of a set of complex formulations that need to be computed at high sampling rates (i.e. hundreds or thousands of times) to facilitate real-time applications. This paper presents an inexpensive and efficient solution to this problem which employs networks of parallel processors. The dynamic equations are first functionally decomposed into a set of simple computational tasks. Then, the resulting tasks are distributed, via a suitable scheduling scheme, onto networks of parallel processors. The networks are constructed using general purpose TSOO transputer chips. The speed and efficiency of the proposed method are demonstrated by a case study. Keywords:
Robotics; Parallel-processing;
Transputers;
Multiprocessing;
Scheduling
1. Introduction There is a real demand in many robotics applications for higher operational speeds, and a solution enabling this would have clear economic benefits in terms of improving manufacturing efficiency. In general, a fully operational robotic system running in real-time requires the repeated execution of a variety of complex algorithms. Most of these algorithms, if not all, need to be computed within milliseconds (msec) or microseconds (psec) in order to meet severe real-time constraints. Such algorithms require massive computing power that surpasses the
* Email:
[email protected] 0167-8191/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved SSDI 0167-8191(94)00107-3
650
A. Y. Zomaya /Parallel Computing 21 (1995) 649-668
capabilities of many sequential computers. Also, for robotics applications severe limitations must be imposed on the size, weight, and power consumption levels of the computing systems [18]. An important class of robotic problems that requires massive computing power is dynamics-based computations, such as control and trajectory planning algorithms. These algorithms require the computation of the dynamics at high sampling rates (i.e. hundreds or thousands of times) in order to avoid poor performance [20]. Thus, methods for the fast and inexpensive computation of the dynamics would facilitate the implementation of a wide variety algorithms: This fact was also noted by Brady [3], who outlined some thirty problems that need to be resolved to advance the state of robotics science. One of the problems that was highlighted is the development of multiprocessor systems for robotics computations. The formulation of computationally efficient dynamic models has been an active area of research for the last two decades and several methods have been developed by researchers (see [16]). Of these methods, the most commonly used are the Lagrange-Euler (LE) and the Newton-Euler (NE) formulations. The LE technique has a high computational complexity of order O(n4), where n is the number of links of the robot arm, but it is a well-structured and systematic formulation. On the other hand, the NE formulation is computationally efficient, of order O(n), but with a set of highly recursive equations. In this work the NE formulation is used to develop a parallel implementation to compute robot dynamics. The paper is organized as follows. Section (2) gives an overview of parallel processing and dynamics computations. Section (3) presents a description of the problem of computing robot dynamics. A solution to the problem is proposed in Section (4) followed by a case study in Section (5). Conclusions are given in Section (6).
2. Parallel robot dynamics computations: An overview Several parallel algorithms have been proposed by researchers to compute the dynamics [17]. The basic idea is to divide the dynamics into several smaller computational tasks. Then, these tasks should be distributed optimally onto a given number of processors to achieve the fastest execution time. For practical considerations, the number of processors involved should be kept as small as possible if a cost-effective solution is to be found (see [6], and [18] and their references). The efficient distribution of these tasks onto a network of processors is performed through scheduling. The scheduling problem is solved by the optimal distribution of m tasks over p processors, while minimizing the cost involved. In this case, the cost represents the overall execution time of the algorithm including the communication time between the different processors [8]. In general, the scheduling problem is considered to be extremely “formidable” to solve [5,7], which is further complicated by the varying execution times of the different tasks
A. Y Zomaya /Parallel Computing 21 (1995) 649-668
651
and the arbitrary number of application processors involved 1181.Therefore, such a problem has been classified to be NP-complete. The parallel computation of robot dynamics is also known to be NP-complete. That is, the likelihood of finding optimal tasks schedules for the computation of robot dynamics is impossible. Several parallel algorithms for the computation of robot dynamics are highlighted below. Luh and Lin [15], introduced a technique based on a generalization of the branch-and-bound algorithm, which exhibits several significant limitations. Most importantly, their proposed method does not fully consider the recursive structure of the (NE) and the sequential dependencies of the algorithm. Furthermore, the system suffers from load imbalances because some of the processors are poorly utilized and the interprocessor communication and synchronization of the (NE) are ignored. Lathrop [12] proposed two parallel algorithms using special purpose processors. First, is a linear parallel algorithm which is related to the method in [l-5]. The second is an algorithm based on the partial sum technique. Both approaches exhibit massive buffering which degrades the performance and causes complicated intertask communication structures, and hence no practical implementation had been made. Kasahara and Narita [ll] proposed a parallel processing scheme which employs two scheduling algorithms; depth-first/implicit-heuristic-search and critical-path/ most-immediate-successors-first. The algorithm was implemented on an actual multiprocessor system to prove its effectiveness. Lee and Chang [14] introduced a method based on the recursive doubling algorithm with a modified inverse perfect shuffle interconnection scheme between a set of parallel processors. Their approach may not be cost efficient and fault-tolerant due to the complexity and expensive interconnection structure among the processors. Chen et al. [4] presented an approach based on the A* algorithm to develop the dynamical highest level first/most immediate successors first (DHLF/MISF) methods. The efficiency of the work was only demonstrated by simulation results. Further, Barhen [2] divided the dynamics of a 6-links robot arm into 66 computational tasks. These tasks were embedded into a concurrent computation ensemble (Hypercube). Two modes of operation were presented. In the sequential mode all computations take place on the Intel 80286/80287 processors residing on the NCUBE peripheral subsystem, while in the concurrent mode the equations (tasks) are solved on hypercube nodes. For a more detailed survey the reader is referred to [18]. Most of the previous attempts did not involve any testing of the results on a parallel processing system. Results are usually presented in terms of the number of multiplications/additions and their theoretical equivalent of processor clock cycles. The results obtained in this work are the outcome of an implementation of the algorithm on a multiprocessor system. Hence, these results not only represent the processing-time of multiplications/additions but also the delays caused by the communication between different processors and the limitations imposed by the hardware and software components of the computing system.
A.Y. Zomaya /Parallel Computing 21 (1995) 6494568
652
3. Robot dynamics formalism An open chain robot mechanism consists of a chain of (n + 1) rigid links. The links (Fig. 1) are arranged such that link (i) is connected to a preceding link (i - 1) and a following link (i + 1). In the case of robot manipulators, two types of joints exist, translational and revolute joints. The translational joints are such that the adjacent links translate linearly to each other along the joint axis, while the revolute joints allow adjacent links to rotate with respect to each other about the joint axis. Therefore, the link (i> motion with reference to the link (i - 1) depends only on one variable, rotation Bi or translation di. Generally, the robot base is considered to be link (0). The last link (n) carries a gripper (hand) or a tool (drill, pincer) and is called the end effector of the robot. The location of an object in space is determined by six degrees of freedom (dof), three of which represent position and the other three orientation. If a task is performed in space without constraints, 6 dof are necessary. But if the task is performed in a plane, only 3 dof are necessary. A typical robot manipulator arm has 6 dof. 3.1. Kinematic frame assignment To model a robot arm coordinate frames need to be attached to each link with the z-axis directed along the joint axis. The assignment process produces four transformations: the rotation angle ei, distance di, link length ai, and the twist angle ai (see Fig. 1). The link parameters of a PUMA-560 robot arm are given in Table l.From these parameters an orthogonal rotation matrix can be formed which transforms a vector in the (i - 11th coordinate frame to a coordinate frame which is parallel to the (ihh coordinate frame: -sin Ai =
ei cos ai
COS ei COS (Yi
sin ai
sin ei sin ai -COS ei sin
cos ai
ai
1
(1)
.'i+l
JOINT(i)
Fig. 1. Link organization for a typical robotic arm.
A. Y. Zomaya/ Para& Computing21 (1995) 64Wi68
653
Table 1 Link parameters between the different frames of a PUMA 5604ike manipulator Link parameters Link
0
a
d
lY
4 e2 e3 e4 0,
0
0
-90
a2
d2
0 90 -90 90 0
a3
0 0 0
e6
and the position of the (ihh coordinate frame is given by:
coordinate
0 4 0 d,
frame with respect
to the (i - 11th
I 1 di
(2)
pi = Ui sin (Yi ai cos (Yi
For a revolute joint, Bi changes while di, ai, and (Y~remain constant. For a translational joint di is changing while ai, Bi, and (Y~are constants. To achieve transformation between the different coordinate frames of an n-dof robot arm a matrix T, is defined such that: T, = A, A, .....A..
(3)
Eqs. (l-3) are used to describe the kinematic behaviour of a robot arm. 3.2. The dynamics of robot motion By using the (NE) formulation, the dynamic problem can be stated as follows, given the input vectors of joint positions 6&j, joint velocities &to), and joint accelerations &to> at any instant of time (to), calculate the applied force/torque ho). Note that this problem is known as the inverse dynamics problem in the robotics literature [18]. Here the term dynamics is used for conveniency. The dynamic equations of motion of a robot arm can be written as: 7(t)
=D(e)J(t)
+c(e,
e’) +H(e)
(4)
where dt) is an n x 1 applied force/torque vector for joint actuators; O(t), d(t), and kt) are it x 1 vectors representing position, velocity and acceleration respectively; D(O) is an n X IZ effective and coupling inertia matrix, C(e,& is an n X 1 Coriolis and Centripetal effects vector, and H(8) is an n x 1 gravitational force vector, where (n) is the number of degrees of freedom. The evaluation of Eq. (4) is computationally demanding and has until recently posed a major bottleneck in real-time robotics applications. The dynamic algorithm for a robot arm can be divided into two recursive phases: first, the forward iteration propagates kinematic information, such as
A.Y. Zomaya /Parallel Computing 21 (1995) 649-668
654
linear velocities, angular accelerations, and linear accelerations at the centre of mass of each link, from the inertial coordinate frame to the hand coordinate frame. Second, the backward iteration propagates the forces and moments exerted on each link from the end-effector of the manipulator to the base reference frame. Initialization zO=(O
0
l)r
w,=,()=o ti, = gz,
g = 9.81 m/s*
f, + 1 = force exerted at the end-effector n n+l = moment exerted at the end-effector Phase 1 ( fonuard iteration) For i = 0,. . . , n - 1 do; wi+l
=A$+,[oi+zOr$+r]
&i+l
=
vi+1 =
A~+,[ ii + toe;+, + WiX (Z,ii+r)]
(6)
AI,‘+r[vi]+ Wi+r X Pi+1
(7)
ILi+l = ‘i+l x Fi+l
Ni+l
(5)
si+l
+
wi+l
’
+ Wi+l X (@i+l X Pi+l)
Cwi+l
’ 'i+l>
=mi+l+i+l
=
Ji+lii+l
+
ei+lX (Ji+l’i+l)
sii+l
(8) (9) (10)
End (Phase - 1); Phase 2 (backward iteration) Fori=n,...,l do;
+ Fi
ni=Ai+,[ni+,]+piXfi+Ni+siXFi
(11) (12)
Ti( t) = nr(ATz,)
(13)
fi= Ai+,[fi+lI
End (Phase - 2);
where, Ai and pi: transformation matrix and position vector of the ith coordinate frame, respectively. wi and Gi: angular velocity and acceleration of the ith coordinate frame, respectively. ii: linear acceleration of the ith coordinate frame. $i: linear acceleration of the centre of mass of link 6). Fi and Ni: net force and moment exerted on link (i>, respectively.
A.Y. Zomaya /Parallel Computing 21 (1995) 649-668
655
Table 2 Sequential (NE) computational cost Total computations (NE) Equation Ai Matrices pi Vectors Oifl “it1
Multiplications 4n 2n 12n 18n
Additions Subtractions 0 0 9n
15n 21n 15n 0 18n
fi
27n 18n 3n 24n 9n
ni
21n
Ti
12n
8n
TOTAL
150n
116n
n=6DOF
900
696
“i+l qi+l
Fi+l Ni+l
9n
21n
fi and ni: force and moment exerted on link (i> by link (i-l), respectively. si: position of the centre of mass of link (il. Ji: inertia tensor of link (i) about its centre of mass. mi mass of link (il. In the previous equations, all the matrices and vectors are of size (3 x 3) and (3 x 1) respectively. The computational requirements for the previous general purpose (NE) algorithm are listed in Table 2. It can be seen from Table 2 that it takes 15On multiplications and 1161 additions to compute the dynamics of an n-dof robot arm based on the (NE). It is important to note that the dynamic equations must be computed several times during the control cycle to provide a source of look-ahead information.
4. The proposed method Parallelism is achieved by distributing a given task over a number of processors, ideally in such a way that all the processors are fully utilized. As a consequence, highly parallel structures have evolved, and many have been built to meet the increasing demand for more computing power and higher processing speed [9,13]. When solving a problem of a parallelizable nature, the programmer’s objective is to identify and exploit as many forms of parallelism within the problem as possible. In retrospect, three major programming paradigms can be identified [lo]. The first is the processor farm where independent tasks are farmed out by a ‘master’ processor to a set of ‘slave’ or ‘worker’ processors. The second is the geometric paradigm where the structure of the data is such that the problem can be distributed in some convenient manner amongst processors. Finally, the algo-
A.Y. Zomaya /Parallel Computing 21 (1995) 649-668
656
rithmic paradigm consists of exploiting the parallelism that exists within tasks that
make the overall algorithm. This section describes an algorithm for the parallel computation of robot dynamics. The dynamic equations are functionally decomposed into a set of tasks (processes). The tasks represent the different terms of Eqs. (S-13). Of course, the objective of this work is to find a cost-effective architecture, by distributing the algorithm onto simple tree-structured networks of processors. The algorithmic paradigm is naturally applicable in this case. This paradigm is motivated by the amount of tasks that can be executed independently in the proposed algorithm. The software portions running on the processors are replicas of one another with minor modifications depending on the task and the communication paths. 4.1. Reorganizing the dynamic equations
An efficient architecture would be one based on the decomposition of the computations into functional subsets of a more elementary nature, such as the computation of basic matrix and vector operations [19]. When solving a computational problem, such as the dynamics, parallelism can be exploited at different levels which can be summarised as follows: (1) inter-chain parallelism: parallelism existing among computational chains, that is, among independent equations. For example, Eqs. (9) and (10) can be executed concurrently with no data dependency. (2) nodal-parallelism: involves the parallel execution of primitive vector arithmetic operations within the ith iteration of a given computational chain. For example, in Eq. (71, the computing of cji+l X pi+l and Oi+l X (Wi+l X pi+*) can be carried out independently. (3) operational-parallelism: obtained at an operand level. For example, [a + b + c + d] can be computed as (a + b) and (c + d) in parallel, followed by [sum,b +
sum,J. Due to the expensive nature and cost-ineffectiveness of operational-parallelism for large-sized problems, only inter-chain and nodal-parallelism can be employed in this work. The dynamics need to be divided into several computationally simpler entities. This applies to both phases of the algorithm, that is, the forward and the backward iterations (see Section 3.2). A task is represented as Tiphase, where i and phase represent the task number and the phase (forward or backward), respectively. Phase 1 (forward iteration) For i = 0,. . . , n - 1 do;
-sin T: :Ai=
8, cos oi
cos ei cos (Y~ sin (Y~
sin Bi sin ffi
-cos
ei sin ai cos ffi
1
(14)
A. IT Zomuya /Parallel Computing 21 (1995) 649-668
di
I
ai sin cri
1-; :pi= [
L2iCOS
T; : hi + z,&,
657
(19
(Yi
+ wi X (z&+,)
(16)
Tb_: q + z&+~
(17)
T; :o~+~ i- T; x Ti
(18)
T; : hi+,
(19)
= T; x Tj
T; : T; x ii
(20)
T,’ :+i+l = T71+(ji+lXPi+l+Oi+1X(~i+lXPi+l)
(21)
Ti :hi+l Xsi+l +wi+r X (w~+I XS~+I)
(22)
+~i+~x(Ji+~oi+~)
Tt~:Ni+r=Ji+~‘i+l T/, :+i+l =T;+T,’ r;* $+I End (Phase
=nj+I
(23) (24)
x Tfl
(25)
1);
Phase 2 (backward iteration) For i = n, . . . , 1 do; Tf :fi = T,‘[fi+,] + F
(26)
7’; : T; x r~,+~
(27)
T; : si x Fi
(28)
TZ : pi x fi
(29)
T; : T; -t T; + T/O
(30)
T: : T&)
= (Tt + T;)
x (A;z,)
(31)
End (Phase 2);
This adds up to a total of 18 tasks that need to be scheduled. Tables 3 and 4 give the number of floating point operations (multiplications and additions) required to compute each of the tasks given by Eqs. (14-31). Now, a scheduling algorithm need to be employed in order to organize the different tasks (or processes) in a way that achieves fast execution time of the complete aIgorithm. A close examination of the dynamics (Eqs. 14-31) shows the existence of a certain amount of parallelism with a Iarge amount of sequentialism inherent in the natural flow of the computations. Thus, several factors have to be considered in the task distribution and the implementation stages, i.e. (1) The sequential dependency between the different tasks.
658
A.Y. Zomaya /Parallel Computing 21 (1995) 649-668
Table 3 Multiplications and additions (forward iteration) Computations Task
Multiplications
Additions Subtractions
T:
4 2 2 0 9 9 18 9
0 0 2 1 6 6 15 6
18 24 0 3
12 18 3 0
T.j
T: T41 T: ‘6: 3 T;
G T:, T:,
(2) Minimization of the interaction between the different slave processors as much as possible by enabling each processor to execute its task without the need for data from other processors. (3) Avoiding the case of two slave processors communicating with each other through a third slave processor. In addition, to avoid (I/O) bottleneck, the overall (I/O) of a processor must be reduced by increasing the size of the processor’s memory. It is very important to locate all sources of overhead and communication deadlocks prior to the real-time implementation. A computer program, running on a SUNSPARC workstation, is used to perform this operation and schedule all the tasks automatically. This is done through a loud balancing procedure to ensure that each processor is utilized properly. The program assigns weights to the different tasks (Tables 3 and 4) according to the total number of multiplications and additions required to compute each of them. Then, according to the number of processors in the network, the algorithm distributes the different tasks by a
Table 4 Multiplications and additions (backward iteration) Computations Task
Multiplications
Additions Subtractions 9 36 6 6 4
A.Y. Zomaya /Parallel Computing 21 (1995) 649-668
659
Fig. 2. A task graph for the parallel (NE)
heur&ic bin-packing procedure
[8]. In the worst case the heuristic solution will not deviate too much from the optimum solution. Appendix (A> demonstrates the procedure adopted in this work. Note that a few assumptions have to be made in this case: (1) The different tasks are assigned to a set of identical processors. (2) Although each processor is capable of executing each of the tasks, no processor can execute more than one task at a time. (3) When a processor begins to execute a task (T), it must continue executing the task until CT) is completed. Two problems are usually encountered with this type of algorithms, which is the case in this work too. First, after few passes, several time-gaps (Ai) would emerge within one or more processors, for which none of the remaining tasks would fit. Second, and owing to the first problem, searching for a suitable processor gets more difficult as the scheduling passes proceed.
660
A.Y. Zomaya /Parallel Computing 21 (199.5) 649-668
The parallel computation of the problem can be represented by a directed static graph S = W, E), whose vertices vi E I/ represent processes and whose edges eii E E represent communication paths. A task-graph representing the (NE) algorithm is given in Fig. 2. The graph shows the different tasks and the different levels of priority existing between the different tasks. In addition, the graph shows the data-dependency between the two phases of the (NE) algorithm. Note that the graph is a restricted model in that it does not allow new nodes or edges to be created during run time. The total computation time can be estimated from the task graph. The execution time of the complete schedule is determined by evaluating the cost of computation of each level, that is,
(32) whereGeveland
m represent the cost of computation and number of tasks in a single level. Therefore, the total cost of the complete algorithm can be approximated by:
where k is the number of levels in the task graph. It can be noted from the task graph that the number of multiplications and additions required to compute the dynamics drops to 57n multiplications and 46n additions. This implies a reduction of 62% and 60% in the number of multiplications and additions, respectively, when compared to the sequential implementation in Table 2.
5. A case study In this section the proposed algorithm is applied to a B-link robot manipulator (PUMA-560). The different tasks shown in Fig. 2 are distributed onto a network of three INMOS transputers (Fig. 3). Initially, the number of processors is limited to three processors to achieve an efficient and cost-effective implementation and provide some hindsight into how to improve performance by using more processors. This should make the implementation economically viable without sacrificing the robustness of the results. This situation can be further facilitated by the fact that the (NE) requires less hardware than other robot dynamics algorithms [181. For the 6-links PUMA-560 it takes 900 multiplications and 696 additions to compute the dynamics using the (NE) (see Table 2). If the parallel version of the algorithm, which was proposed in the previous section, is used the cost will be reduced to 342 multiplications and 276 additions. However, in this case, these numbers do not reflect the actual computational cost of the algorithm. The total
A.Y. Zomaya /Parallel Computing 21(1995)
649-668
661
HOST COMPUTER
Pl
c
*
/
r Fig. 3. A three-processor
cost of a given algorithm should incorporate computational costs of the algorithm, that is,
network.
both the communication
and the (34)
Tp = Lmp + LW?z?n
represent the total processing time, computations time, and communications time, respectively. The idle time that processors spent waiting to receive data from other processors is included in T,,,,. Hence, a real-time implementation is necessary to establish the “true” value of the overall processing time. Table 5 shows the values of the Theoretical Processing Time (TPTJ of the original (NE) and the parallel version of the algorithm.
where T,, Tromp,and L,,
Definition 5.1. TPT is defined as the time required to solve a problem if all the communications between the different processors were instantaneous (with no idle time) [ 181. Table 5 shows a theoretical speed up of 2.96 resulting from the use of the parallel (NE) algorithm. Note that speedup is defined: S,=-
T,(l) (3%
T,(m)
Table 5 Theoretical processing time (TPT) Processing time (msec) Algorithm NE Parallel
TSOO 0.74 0.25
A.Y. Zomaya /Parallel Computing 21 (1995) 649-668
662
where T’_(l) is the total processing time of the sequential algorithm, and T,(m) is the total processing time required to finish the execution of the algorithm by an m-processor network. A relative error K, which measures the difference between the finishing time of a schedule and the finishing time of an optimal schedule is given by:
where Tp is the finishing time of a schedule and T,,,(m) is the optimum (ideal) finishing time that can be achieved by using m processors and is given by:
T,(l) Toptw = - m
(37)
Table 5 shows that TO,, (3) = 0.247, leading to K = 1.35%. Now, the network shown in Fig. (3) is employed to execute the parallel algorithm. First, the general (NE) was executed using one transputer to provide a benchmark that can be used to evaluate the efficiency of the parallel implementation. Hence, we define here a new measure which is called Practical Processing Time (PPT).
Definition 5.2. PPT is defined as the time required to solve the problem in the case of a real-time implementation by using a multiprocessor system. In this case the speedup dropped to about 2.86 (for the parallel implementation). This difference is attributed to the effect of communications between processors on the overall performance of the algorithm. It is also important to note the difference between the TPT and PPT for the sequential implementation. This disparity is due to the fact that TPT does not take into account issues like, the efficiency of coding, memory access requirements, buffering, and many other factors. The TPT is only used to measure the cost of computations. Further, the efficiency of an m-processors network is measured by the utilization rate of the available processors, E,sI1
(38)
m
It can be noted from Table 6 that E = 95.0% which indicates a high rate of utilization for the different processors.
Table 6 Practical processing time (PPT) Processing time (msec) Algorithm NE Parallel
T800
6.5 2.27
A.Y. Zomaya /Parallel Computing 21 (1995) 649-668
663
Table 7 Task allocation for the forward iteration Processor p3
p2
Table 8 Task allocation for the backward iteration
Pl
p2
p3
T:
T2”
T32
Te’
T;’
T42
Tables 7 and 8 show the tasks-distribution the three-processor network. 5.1. Task dhibution
and pet$omance
of the two phases of the algorithm on
improvement
Further improvement of performance can be achieved if the computations of the tasks shown in Tables 3, 4 are overlapped. It can be clearly seen that the two phases of the dynamics algorithm are tightly coupled (see Fig. 2) for a given iteration. By doubling the number of processors in the network shown in Fig. 3 to a six-processor network (Fig. 41, the two phases of computations can be overlapped. In this case, the tasks listed in Table 3 are mapped onto processors pl, p2, and p3, HOST COMPUTER
Fig. 4. A six-processor network.
664
A.Y. Zomaya /Parallel Computing 21 (1995) 649-668
Table 9 Task allocation for the backward iteration on the six-processor network Processor
P6
p5
p.3
T:
T22
T32
T,”
T52
T42
while the tasks listed in Table 4 are mapped onto processors (p,, ps, pJ. Now, while processors pq, ps, and pg are busy computing the tasks of phase (2) of the algorithm (Table 4) for iteration t, processors (p,, pz, p3) can be used to compute phase (1) of the algorithm (Table 5) for iteration t + 1, and so on. For the six-processor network the task allocation strategy is the same as the three-processor network for processors (pl, pz, p3), while processors (p4, ps, p6) will be allocated those tasks shown in Table 8 as shown in Table 9. Table 10 shows some of the performance measures associated with the six-processor network. By using a six-processor network the execution time of the dynamics can be reduced by 26% when compared to that of a three-processor network, however, due to the increase in the processor-to-processor communications the performance degradation associated with the six-processor network is much higher than that of the three-processor network. To further evaluate the performance of the two networks we need to measure the amount of parallelism and the effect of communications on both of the implementations. Amdahl [l] has pointed out that the speedup is limited by the amount of parallelism inherent in the algorithm which can be characterised by a parameter (f), the fraction of the computation that must be done serially. He reasons that the maximum speedup of an m-processor system in executing an algorithm as a function of (f) is given by: 1 S”5f+(l-f)/m
(39)
Vm
Note that (S,, = 1) (no speedup) where (f = 1) (everything must be serial), and (S,, = m) where (f = 0) (everything in parallel). So, (f) could be considered as a
Table 10 Performance of a six-processor network Performance Criteria TPT .PPT TOPf S E
TSOO 0.18 msec 1.68 msec 0.12 msec 3.9 64.5%
A.X Zomaya /Parallel Computing 21(1995) 649468
665
good measure of how much parallelism is exploited in an algorithm running on a multiprocessor system [Ml, and it is given as:
Using the above formula the values of the f parameter for the three-processor and six-processor networks are f3 = 0.024 and f6 = 0.11, respectively. This indicates an increased communication rate between the different processors which is rather logical with a larger network. In retrospect, the cost-performance issues need to be decided by the user, that is, if the user is interested in the execution time only with no regard to cost then the six-processor network should be satisfactory. On the other hand, if the user has limited resources and interested in cost-efficiency of the computing system then the three-processor network will be more appropriate. It has been shown by Zomaya [18] that other robot dynamics techniques, which are more amenable to parallel-processing implementations, can be used to compute the dynamics of robot manipulators. However, the proposed algorithm, and for practical reasons, has the following two advantages: l The physical implementation is simple. Because the algorithm is based on the NE it requires less hardware. l The program portions that are executed by the different processors are modular, in the sense that they are, more or less, similar in structure. This factor reduces the development and the programming cost.
6. Conclusion and summary The dynamic modelling and simulation of typical robot arms is systematic and simple in concept but complicated in respect of the computational burden inherent in real-time applications. In this work a parallel form of the dynamics based on the Newton-Euler formulation has been developed and distributed onto a parallelprocessing system. The general form of the dynamics is useful for off-line simulation purposes. However, for real-time applications a faster approach have to be used to improve the performance. The structure of the parallel algorithm used in this work was motivated by the transputer architecture. Several issues have to be addressed when a sequential algorithm is distributed onto a network of parallel processors, The data flow paths and the communication between the different processors constitute a major bottleneck in many situations. The division of the workload between the different co-processors. The idle-time that each processor spends waiting for input from other processors. The required computing power which decides the size and complexity of the network.
A.Y. Zomaya /Parallel Computing 21 (1995) 649468
666
l The amount of parallelism and sequentialism inherent in the algorithm. 0 The cost of software development. Similar scheduling strategies, to the ones presented in this work, are equally applicable to other types of robotics problems [18].
Acknowledgements The author would like to acknowledge the support of the Australian Council.
Research
Appendix A: The bin-packing algorithm The assumption made is that every bin (processor) has capacity 1, and there are (1) items (tasks) in the list (L), 0 < TASK, I 1 for 1 I i I 1. The bins have indices of m, where B, is called the first bin and B always denotes the last nonempty l&:‘The sum of TASK, packed in a bin B. g called the content of B. (c(B.1). Hence, (1 - c(Bj)) will denote the empty spice the bin Bj currently has. So, giien the list L = [TASK,, . . . , TASK,], then the job of the heuristic algorithm is to pack TASK,
TASK,,...,
until TASK,,
TASK k+l,. . .,TASK,
1 > 1 - c( B,) ,
until TASK,+,
> 1 - c( B2), . . .
This work employs a Best-Fit (BF) rule to improve the performance of the algorithm. In this rule, every nonempty bin is scanned and put TASK, into the bin so that (1 - c(B)) is minimum. That is, assign TASK, to Bj if 1 - c( Bj) - TASK, I 1 - c( Bi) - TASK, in the case of tie, put TASK, if 1 - c( Bi) < TASK, put TASK,
for all i.
into the bin of the lowest index.
for all nonempty bins,
into an empty bin of lowest index. In simpler terms,
Proc, = (TASK,,
. . . , TASK,)
(AlI
1: number of tasks per processor
i=l
,.**7 m
m : number of processors
Proc, * the ith processor Tproci= i
j=l
(TTAsK,)
(4 (AJv
667
A. Y Zomaya /Parallel Computing 21 (I 995) 649-668
where (T_,,,j) is the Total Processing Time of a single processor (ProcJ, and (T,) is the Total Processing Time of the complete algorithm. Weights are assigned to each task: TTASK. ,
wj *
O
(Ad)
the cost of a group of task running on a single processor will be,
&w,)
H$=
69
j=l
Hence, the total computational algorithm can be given by: i+=
cost of the sequential
implementation
:(I&)
of the
(A6)
k=l
and the parallel implementation
by:
(AT In the case of an optimum task-allocation
strategy w is given as:
(A81 However, if a near-optimum task-allocation
strategy is obtained, then
n
+
cwk>
=;
(4
or
where wOf is some offset value due to practical considerations, to-processor communications.
such as processor-
References [l] G.M. Arndahl, Validity of the single processor-approach to achieving large scale computing capabilities, in Proc. AFIPS 30 (Thompson, Washington, D.C., 1967) 483-485. [2] J. Barhen, Hypercube ensembles: An architecture for intelligent robots, in Computer Architectures for Robotics and Automation, ed. J.H. Graham (Gordon and Breach Science Publishers, New York, 1987) 195-236. [3] M. Brady, Robotics Science (MIT Press, Cambridge, MA, 1989). [4] C.L. Chen, C.S.G. Lee, and E.S.H. Hou, Efficient scheduling algorithms for robot inverse dynamics computation on a multiprocessor system, IEEE Trans. Syst. Man and Cybemet. 18 (5) (1988) 729-743. 151 E.G. Hoffman, Computer and Job-Shop Scheduling Theory (Wiley, New York, 1976).
668
A.Y. Zomaya /Paralkl
Computing 21 (1995) 649-668
[6] A. Fjany and A. Bejczy, eds, Parallel Computation Systems for Robotics: Algorithms and Architectures (World Scientific Publishing, Singapore, 1992). [7] M.R. Garey and D.S. Johnson, Computers and Interactability: A Guide to the Theory of NP Completeness (W.H. Freeman, San Francisco, 1979). [8] T.C. Hu, Combinatorial Algorithms (Addison-Wesley, Reading, MA, 1982). [9] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, and Programmability (Mc-
Graw-Hill, New York, 1993). [lo] IEE, Parallel processing in control- the transputer and other architectures, IEE Workshop, UCNW, Bangor, Wales, United Kingdom (1988). [ll] H. Kasahara and S. Narita, Parallel processing of robot-arm control computation on a multi-microprocessor system, IEEE J. Robotics Automation 1 (2) (1985) 104-113. [12] R.H. Lathrop, Parallelism in manipulator dynamics, ht. L Robotics Res. 4 (2) (1985) 80-102. [13] H.W. Lawson, Parallel Processing in Industrial Real-Time Applications (Prentice-Hall, Englewood Cliffs, NJ, 1992). [14] C.S.G. Lee and P.R. Chang, Efficient parallel algorithm for robot inverse dynamics computation, IEEE Trans. Syst. Man Cybernet. 16 (4) (1986) 532-542. 1151 J.Y.S. Luh and C.S. Lin, Scheduling of parallel computation for a computer controlled mechanical manipulator, IEEE Trans. Syst. Man Cybernet. 12 (2) (1982) 214-234. [16] R.J. Schilling, Fundamentals of Robotics: Analysis and Control (Prentice-Hall, Englewood Cliffs,
NJ, 1990). [17] A.Y. Zomaya, Efficient robot dynamics for high sampling rate motions: Case studies and benchmarks, Int. J. Control 54 (4) (1991) 793-814. [18] A.Y. Zomaya, Modelling and Simulation of Robot Manipulators: A Parallel Processing Approach (World Scientific Publishing, New Jersey, 1992). [19] A.Y. Zomaya and A.S. Morris, Transputer-arrays for the on-line computation of robot Jacobians, Concurrency: Practice and Experience 4 (5) (1992) 399-412. (201 A.Y. Zomaya, Highly efficient transputer arrays for the computation rency: Practice and Experience 4 (2) (1992) 186-205.
of robot dynamics, Concur-