Copyright © IFAC Algorithms and Architectures for Rea!-Time Control, Cancun, Mexico, 1998
PERFORMANCE EVALUATION OF HETEROGENEOUS ARCHITECTURES
D. N. Ramos-Hernandez and M. O. Tokhi
The University o/Sheffield. UK. Tel: +44 (0)114 2225236. email:
[email protected]@sheffield.ac.uk.
Abstract: An investigation into the performance evaluation of beterogeneous arcbitectures is presented in this paper. A task allocation strategy for maximum efficiency is proposed and evaluated. Compiler efficiency and code optimisation are also investigated in context of performance of an arcbitecture. An adaptive filtering algorithm and a beam simulation algorithm are considered. These are implemented on a beterogeneous architecture incorporating the TMS32OC40 digital signal processing devices, the Intel i860 vector processor and the 1'805 Inmos transputers. Copyright © 1998 IFA C
Keywords: Heterogeneous arcbitectures, parallel processing, real-time signal processing and control, speedup, task allocation.
Alternative strategies where multi-processor based systems are employed, utilising high performance reduced instruction set computer (RISC) processors, DSP devices, transputers and parallel processing (PP) techniques, could provide suitable methodologies.
1. INTRODUCTION
The general guideline for the parallel real-time realisation of control-intensive algorithms is to increase the flexibility in bard ware parallelism and to exploit software parallelism. Hardware and software design trade-offs also exist in terms of cost, complexity, expandability, compatibility, and performance. Compiling for multiprocessors is mucb more involved than for uni-processors (Hwang K., 1993).
For PP with widely different architectures and different PEs, performance measurements such as MIPS, MOPS and MFLOPS of the PEs are meaningless. It is more important to rate the performance of each architecture with its PEs on the type of program likely to be encountered in a typical application. The different arcbitectures and their different clock rates, inter-processor communication speed, etc. all confuse the issue of attempting to rate the architecture. This is an inherent difficulty in selecting a parallel architecture for better performance for algorithms in signal processing and control system development applications. The ideal performance of a parallel arcbitecture demands a
Digital signal processing (DSP) devices are designed in hardware to perform concurrent add and multiply instructions and execute irregular algorithms efficiently. The Intel i860 vector processor, for instance, has been designed for high performance floating point computation and is able to efficiently process regular algorithms involving matrix manipulations. Purpose built PEs are not enougb to bridge the ever-increasing software/bardware gap.
173
is connected to the Host computer and via its link 3 to the Transtech TMB16 motherboard that contains the Intel i860 RISC processor (iS60). The iS60 1RAM has 16 Mbytes shared memory and a TS05 with 4 Mbytes of local memory. The second transputer on the TMB08 motherboard is connected to the root transputer via its link 1 and to a subnetwork of three Texas Instruments TMS32OC40 (C40) DSP devices.
perfect match between the capability of the architecture and the program behaviour. Capability of the architecture can be enhanced with better hardware technology, innovative architectural features and efficient resources management. In contrast, program behaviour is difficult to predict due to its heavy dependence on application and runtime conditions. Moreover, there are many other factors that influence program behaviour. These include algorithm design, partitioning and mapping of the algorithm, inter-processor communication, data structures, language efficiency, programmer skill, and compiler technology. Few investigations have been made to propose performance metrics for heterogeneous architectures (Yan et al., 1996; Zbang and Yan, 1995). However, these do not fully exploit the capability of all the PEs in the architecture. Tokhi et al. (1997) developed several performance metrics for sequential and parallel architectures, which ensure that the capabilities of the processors are exploited by maximising the efficiency of the architecture.
I _ _ _ _ _ _ JI In.mu
Fig.
Compilers have a significant impact on the performance of the system (Hwang, 1993; Tokhi et al, 1996). The power of a parallelising compiler is derived from its ability to analyse loops and the precision of its dependence analysis. Several compiler optimisation techniques have been employed in parallelising compilers to eliminate some of the dependencies in the programs and make them parallelisable (Haghighat et al, 1992).
1. The topology architecture.
of the
heterogeneous
3. APPLICATION ALGORITHMS The first algorithm considered is the least mean square (LMS) adaptive filter. This is of irregular nature, with uneven inner and outer loops and many multiply and accumulate functions. The second algorithm is a cantilever beam simulation algorithm. This is of regular nature and is based on matrix multiplication and addition.
Performance is related to code optimisation facility of the compiler, which may be machine dependent. The goal of program optimisation is, in general, to maximise the speed of code execution. This involves several factors such as minimisation of code length and memory accesses, exploitation of parallelism, elimination of dead code, in-line function expansion, loop unrolling and maximum utilisation of registers. The optimisation often transforms a code into an equivalent but "better" form in the same representation language (Hwang, 1993). This paper continues with the description of the hardware architecture in Section 2. The application algorithms are presented in Section 3. The performance metrics used are described in Section 4. Finally, implementation results are given in Section 5 and the paper concluded in Section 6.
3.1 LMS algorithm
The LMS adaptive filter algorithm was developed by Windrow and his co-workers (Widrow et al., 1975). It is based on the steepest descent method where the weight vector is updated according to (1)
where and Jl
"*
= weight vector, X k = input signal vector
= constant that controls the stability, and the
error is given by
eA: = Yk -
~T X k
•
3.2 The Beam simulation algorithm
Consider a cantilever beam system with a force U (x, t) applied at a distance x from its fixed end at
2. HARDWARE ARCHITECTURE A heterogeneous architecture consists of various types of processors. The hardware used to carry out the experimental investigations is shown in Fig. 1. This comprises a Transtech TMB08 motherboard with two Inmos T805 (T8) transputers. The root TS
time t . This force produces a deflection
y( x, t) of
the beam from its stationary position at the point where the force bas been applied. The dynamic equation that represent this system is
174
11
2
a4Y(X,t} a2 Y(X,t} -..!..F( } 4 + 2 x,t ax at m
Equation (4) can alternatively be expressed in terms of fixed-load execution time increments as
(2)
Sil. -_!1T.;. !1T '
where 11 is a beam constant and m is the mass of the beam. Using a finite difference discretisation of the dynamic equation of the beam in time and distance yields (Tokhi et al., 1994) 2
Ik+l =-Yk - 1 -A.. SY.\:
(M)2 +--U(x,t)
(3)
!1W
i
=I, ... , N
= Vi
!1W
= Silv !1W
; i
= I, ... ,N
(6)
It can be demonstrated that, in this manner, the architecture is conceptually transformed into an equivalent homogeneous architecture incorporating N identical virtual processors. This is achieved by the task allocation among the processors according to their computing capabilities to achieve maximum efficiency.
5. IMPLEMENTATION AND RESULTS In implementing the LMS algorithm in these investigations the number of weights were varied from 10 to 200, in increments of 10. The operation at each step involved 1000 data points. The (convergence rate parameter) J..l = 0.05, was utilised throughout. For the beam simulation algorithm a cantilever beam of length L = 0.635 m and mass m = 0.037 kg was considered. The algorithm granularity was achieved by increasing the number of beam segments from 5 to 40, in increments of 5. The execution times of the processors were obtained over 20000 iterations at each step.
To obtain a comparative performance evaluation of a uni-processor and multi-processor for an application under various processing conditions, for example with and without code optimisation, the concept of generalised sequential speedup and task allocation theory can be applied. This concept is based on providing a maximum efficiency and maximum speedup in parallel architectures. A virtual processor is defined as a reference node that represent the characteristics of all the processing elements in a heterogeneous architecture. Let the generalised sequential speedup of processor i (in a parallel architecture) to the virtual processor be Sil.;
V
(5)
Vv N N Using equation (6), for the distribution of load among the processors, the speedup and efficiency achieved with a heterogeneous architecture with N processors are N and 100% respectively. However, real experimentation shows that due to communication overheads and run-time conditions the . speedup and efficiency of the parallel architecture will be less than these values (Tokhi et. al, 1997).
The performance of a processor, as execution time, in implementing an application algorithm generally evolves linearly with the task size (Tokhi et al., 1996a; Tokhi et al., 1996b). With some processors, however, anomalies in the form of change of gradient (slope) of execution time to task size are observed. These are mainly due to run-time memory management conditions of the processor where, up to a certain task size the processor may find the available cache sufficient, but beyond this it may require to access lower level memory. Despite this the variation in the slope is relatively small and the execution time to task size relationship can be considered as linear. This means that a quantitative measure of performance of a processor in an application can adequately be given by the average ratio of task size to execution time or the average speed. Alternatively, the performance of the processor can be measured as the average ratio of execution time per unit task size, or the average (execution time) gradient. In this manner, a generalised performance measure of a processor relative to another in an application can be obtained.
t;•
N
Thus, to allow the maximum utilisation of the processors in the architecture the task increments !1 Ht; allocated to processors is given by
4. PERFORMANCE METRICS
=
I
I
m where S is a pentadiagonal matrix, entries of which depend on the physical properties and boundary conditions of the beam, Y; (i = k + l.k,k -I) represents the beam deflection at time step i , and M and ~ are increments along time and distance coordinates respectively, A..2 =(M}2(~tI12 .
Si I.
.
l = , ... ,
The beam simulation algorithm was only considered for the task allocation experiments. Investigations at compiler efficiency and code optimisation were carried out with both algorithms.
5.1 T8&C40 Implementation
Fig. 2 shows the execution times of the T8, C40 and the T8&C40 architectures in implementing the beam simulation algorithm. The characteristics of the virtual processor and of the corresponding theoretical T8&C40 heterogeneous architecture are also shown in Fig. 2. It is noted that among the uniprocessors, the C40 is considerably faster that the
(4)
175
T8. The combination, thus, results in a virtual processor with characteristics closer to that of the C40. This implies that, for the T8&C40 to achieve maximum efficiency, a large proportion of the task is to be allocated to the C40. It is noted that the actual T8&C40 has performed slower than the corresponding theoretical model. This is due to communication between the processors.
5.3 C40&i860 Implementation
Fig. 4 shows the execution times of the C40, i860 and the actual, and theoretical execution times of the C40&i860 heterogeneous architecture in implementing the beam simulation algorithm. It is noted that the actual implementation is faster than the virtual processor and faster than the i860. It is also noted that the actual implementation is very close to the virtual parallel machine. Comparing the three implementations, it is noted that with the T8&C40, the execution times of the actual implementation stay very close to the virtual processor. Task allocation was computed according to the theoretical model. However, communication overheads are evident with the T8&i860 impiementation where no task is allocated to the T8 except for the last two numbers of segments. This resulted in the same execution times as the i860 processor, except for the 35 and 40 segment steps where the execution time has increased. The results of these two implementations show that due to the disparity in capabilities of the processors, communication overhead becomes a dominant factor in the implementation. The i860&C40 implementation shows that a better task allocation is obtained between both processors and communication overheads are minimum. In this manner the best performance for this algorithm is obtained with the combination of the C40&i860.
Fig. 2. Execution times of the architecture (T8&C40) implementing the beam simulation algorithm.
5.2 T8&i860 Implementation
Another heterogeneous configuration was used to implement the beam simulation algorithm. This consists of one T8 and one i860. Fig. 3 shows the execution times corresponding to this implementation. It is noted that the i860 is extremely faster than the T8. For this reason all the tasks were allocated to the i860, with the exception of the last two values (35 and 40 segments) where one segment was allocated to the T8. The performance of the actual T8&i860 is similar to the uni-processor experiment with the i860, except, for the last two values where inter-processor communication increases the execution time for the actual implementation.
.. 1'
'/
.. -.
--
-- -- -..... -...... ... -.
..,. .... -,.,.,
_ .~;i""..:. ::..f-
~,,...:;.,'~
-..,.,::- ~:
..,-:.-........
~~.:.:.::...:::-
.. .
i
-Si,. .. eeo ,.,oc..or
-
Vi NIl
-
",.... p...II• • .ct.i\ •
.. ••
~I
. .
i.p ... __ tioII (i.oGto)
.... ..... 01 . . . . . .
Fig. 4. Execution times of the architecture (C40&i860) implementing the beam simulation algorithm.
-Si,._ra
--.-it,._ iao • •• "'.i.O . ... CUll iMP ... .,.tion
--Tl. ,.O-v ..... proc:.-.. _·lI.lM;O · Th ..... ~ntad.
5.4 Compilers Efficiency
5 u
:
,
u
_--
1" .:I
j
1 .:I
_..r
r·~:.:~·~
..,....-'-
,.~
-----
The beam simulation algorithm and the LMS algorithm were implemented for the performance evaluation of several compilers. The compilers involved are the 3L Parallel C version 2.1, Inmos ANSI C and Occam. All these compilers support PP and can be utilised with the heterogeneous architecture considered before. Occam is more hardware oriented and a straight-forward programming language for PP. It may not be as suitable as the Parallel C or ANSI C compilers for
_---------:~:.. _,_ ..- .
r"-"'::' • .;:.;:..~:...-- •
.
., .. - .......--.-
"
.... "'t-0I . . . . __
"
Fig. 3. Execution times of the (T8&i860) in implementing simulation algorithm.
"
..
architecture the beam
176
Similarly, Fig. 6a and Fig. 6b show the execution times in implementing the LMS algorithm on single T8 and two 1'8s with the compilers. It is noted that better performances with Occam compiler have been achieved with single T8. However, with two T8s, Parallel C and ANSI C were more effective. This is due to large amounts of data handling, where runtime memory management problem can be solved with Parallel C and ANSI C more efficiently than with the Occam compiler (Tokhi et al, 1996a).
numerical computation (Tokhi et al., 1995). Both algorithms were implemented on one and two transputers to obtain a comparative performance evaluation of these compilers. Fig. Sa shows the execution times in implementing the beam simulation algorithm using the three compilers with one T8 and Fig. Sb shows the execution times with two TSs. ....... tJ_on.tgori ... (Canp . . _T_)
5.5 Optimisation
s-n
The i860 processor with its respective compiler is considered in investigating its optimisation facility. Using the optimisation levels of the i860, the beam simulation algorithm and the LMS algorithm were studied. The Portland Group (PG) C compiler is an optimising compiler for the i860. It incorporates four levels of optimisation. Level 0 causes no optimisation, Level 1 generates a fast code when the program is very irregular, i.e. it has few loops and little floating point arithmetic, Level 2, does good optimisation when the functions are short and there is a significant amount of floating point arithmetic, and levels 3 and 4 give the best performance when the code contains any conditional branches (Portland Group, 1991).
_IJ_on _ .
(C....p ..
lIIgoril'm 2 Tm6)
....... -........
'~ . -------------.--------------~
i""
_ .... " . . ""oIgool ... (OpIimizalion with . .
".,---------------------------------..., Fig.
S. Performance of the compilers in implementing the beam simulation algorithm (a) with one T8, (b) with two T8s.
05
" Fig. 7. Execution times of the i860 in implementing the beam simulation algorithm with the compiler optimiser. ::;=:=!
Previous research has reported that in implementing the LMS algorithm the performance enhanced significantly with higher levels of optimisation and for the beam simulation algorithm, the enhancement was not significant beyond the first level (Tokhi et al, 1996a). The source code for both algorithms was optimised before carry out the experiments.
,., ..... ot •• tt*
Fig.
In the case of the beam simulation algorithm the implementation of the code used arrays with some loops, for this reason the compiler restructured the actual code. The results are show in Fig. 7. The best performance was obtained with optimisation level 3 and 4.
6. Performance of the compilers in implementing the LMS algorithm (a) with one TS, (b) with two T8s. 177
Tokhi, M. O. and Hossain, M. A. (1994). Selftuning active vibration control in flexible beam structures. Proceedings of IMechE-I: Journal of Systems and Control Engineering, 208, (14), pp. 263 277. Tokhi, M. O. and Hossain, M. A. (1995). CISC, RISC and DSP processors in real-time signal processing and control. Microprocessors and Microsystems, 19, (5), pp. 291-300. Tokhi, M. 0., Chambers, C. and Hossain, M. A. (1996a). Performance evolution with DSP and transputer based systems in real-time signal processing and control applications, Proceedings of UKACC International Conference on Control-96, Exeter, 02-05 September 1996, 1, pp. 371-375. Tokhi, M. 0., Hossain, M. A. and Chambers, C. -(1996b). Performance evaluation in sequential real-time processing. Research Report No. 645, Department of Automatic Control and Systems Engineering, The University of Sheffield, UK, October 1996. Tokhi, M. 0., Ramos-Hernandez D.N., Chambers, C. and Hossain, M. A. (1997). Performance evaluation of DSP, RIse and transputer based systems in real-time implementation of signal processing and control algorithms. The 4th IFAC Workshop on Algorithms and Architectures for real-time Control. Vilamoura, PorOlgal, 9-11 April 1997, pp_ 287-292. Yan, Y., Zhang, X. and Song, Y. (1996). An effective and practical performance prediction model for parallel computing on nondedicated heterogeneous NOW, Journal of Parallel and Distributed Computing, 38, (1), pp. 63-80. Zbang, X. and Yan, Y. (1995). Modeling and characterizing parallel computing performance on heterogeneous networks of workstations. Proceedings of Seventh IEEE Symposium on Parallel and Distributed Processing, San Antonio, Texas, 25-28 October 1995, pp. 25-34. Widrow, B., Glover, J.R., Mccool, J.M.,Kaunitz, J.,Williams, C.S., Heam, R.H., Zeidler, J.R., Dong, E. and Goodlin, R.c. (1975). Adaptive noise cancelling: principles and applications. Proceedings of IEEE, 63, pp. 1692-1696.
For the LMS algorithm the code has multiple nested loops so the compiler is able to recognise the structures involved and restructure the actual code. However for this algorithm few changes were done to the code. Fig. 8 shows levels 0 and four for the LMS. It can be noted that the performances with both optimisation levels are very similar. UlSoIp_
o
(OpIoft_"" .. e. ... ;150) "r------------------, • •• OpIi..aalian La_ 0
'" '"
0. .
O2
2
~ ~ ~
i
g
i
i 8 '= £ R ~ ~ ~ ~ ¥ ~ ~ """'.,;.01;..... - -
Fig. 8. Execution times of the i860 in implementing the LMS algorithm with the compiler optimiser.
6. CONCLUDING REMARKS The effectiveness of task allocation theory developed in previous work has been demonstrated in new hardware configurations for heterogeneous architectures. Compiler efficiency and code optimisation have been investigated showing that these issues affect the performance of the processors in real-time applications. Experiments with different compilers have demonstrated how these and different configurations modify the performance of the processors. In the optimisation experiments it was also shown how the regularity or irregularity of the algorithm, as well as the codification change the performance of the processors.
ACKNOWLEDGEMENTS D.N. Ramos-Hernandez acknowledges the financial support of CONACYT-MEXICO.
REFERENCES Haghighat, M. and Polychronopoulos C. (1992). Symbolic program analysis and optimization for parallelizing compilers Lecture Notes in Computer Science (757). Banerjee, U. Gelernter, D. Nicolau, A., Padua, D. (Eds). Hwang, K. (1993). Advanced Computer Architectu re: Parallelism, Scalability, Programmability. McGraw -Hill, Singapore. Portland Group Inc. (1991). PG tools user manual, Portland Group Inc.
178