Performance Evaluation Issues in Real-Time Parallel Signal Processing and Control

Performance Evaluation Issues in Real-Time Parallel Signal Processing and Control

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO. 42, 67–74 (1997) PC971307 Performance Evaluation Issues in Real-Time Parallel Signal Proc...

95KB Sizes 1 Downloads 37 Views

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING ARTICLE NO.

42, 67–74 (1997)

PC971307

Performance Evaluation Issues in Real-Time Parallel Signal Processing and Control M. O. Tokhi, M. A. Hossain, M. J. Baxter, and P. J. Fleming Department of Automatic Control and Systems Engineering, The University of Sheffield, Sheffield, United Kingdom

chitecture demands a perfect match between the capability of the architecture and the program behavior. The capability of the architecture can be enhanced with better hardware technology, innovative architectural features, and efficient resources management. In contrast, program behavior is difficult to predict due to its heavy dependence on application and run-time conditions. There are also many other factors affecting program behavior, including algorithm design, partitioning and mapping of an algorithm, interprocessor communication, data structures, language efficiency, programmer skill, and compiler technology [2, 3, 8]. This paper presents an investigation into the performance evaluation issues involved in real-time parallel signal processing and control. A comparative study of the performance of several high-performance architectures in implementing several algorithms, on the basis of real-time computation with communication overhead, interprocessor communication, compilers efficiency, and code optimization, is presented. Not much work has previously been reported on such a comparative performance evaluation of sequential and parallel architectures in context of real-time applications. The remainder of the paper is presented as follows. Section 2 gives a brief background related to this work. Section 3 introduces the algorithms utilized. The hardware architectures utilized in this study with the corresponding software resources are described in Section 4. Results of the implementations are presented and discussed in Section 5 and, finally, the paper is concluded in Section 6.

This paper presents an investigation into the performance evaluation issues involved in real-time parallel signal processing and control. Issues such as algorithm partitioning, mapping, interprocessor communication, granularity, regularity, compiler efficiency for numerical computation, and code optimization with several signal processing and control algorithms are investigated and presented. Several algorithms are considered and implemented on a number of uniprocessor and multiprocessor, homogeneous and heterogeneous, parallel architectures. A comparative performance evaluation of the architectures is made, demonstrating the critical problems encountered in real-time parallel signal processing and control. © 1997 Academic Press

1. INTRODUCTION

Parallel processing (PP) has emerged as a key enabling technology in modern computing to meet the ever-increasing demand for higher performance, lower costs, and sustained productivity in real-life applications. The concept of PP on different problems or different parts of the same problem is not new. Discussions of parallel computing machines are found in the literature at least as far back as the 1920s [4]. It is evidenced that, throughout the years, there has been a continuing research effort to understand parallel computation [7]. Such effort has intensified dramatically in the last few years in numerous applications, including signal processing, control, artificial intelligence, pattern recognition, computer vision, computer-aided design, and discrete event simulation. For PP, with widely different architectures and different processing elements (PEs), performance measurements such as MIPS, MOPS, and MFLOPS of the PEs are meaningless. Of more importance is to rate the performance of each architecture with its PEs on the type of programs likely to be encountered in a typical application. The different architectures and their different clock rates, memory cycle times, interprocessor communication speed, optimization facility, and compiler performance, etc., all confuse the issue of attempting to rate the architecture. This is an inherent difficulty in selecting a parallel architecture, for better performance, for algorithms in various applications. The ideal performance of a parallel ar-

2. BACKGROUND

There are various factors that play important roles in the performance of an architecture. Interprocessor communication between PEs is one of the important issues to compare the real-time performance of a number of parallel architectures and suitability of an algorithm. The amount of data, the frequency with which the data is transmitted, the speed of data transmission, latency, and the data transmission route are all significant in affecting the intercommunication within the architecture. The first two factors depend on the algorithm itself and how well it has been partitioned. The remaining two fac67 0743-7315/97 $25.00 Copyright © 1997 by Academic Press All rights of reproduction in any form reserved.

68

TOKHI ET AL.

tors are the function of the hardware which this investigation is concerned with. These depend on the interconnection strategy, whether tightly coupled or loosely coupled. Any evaluation of the performance of the interconnection must be, to a certain extent, quantitative. However, once a few candidate networks have been tentatively selected, detailed (and expensive) evaluation including simulation can be carried out and the best one selected for a proposed application [1]. There are three different problems to be considered in implementing algorithms on PP systems: (a) identifying parallelism in the algorithm, (b) partitioning the algorithm into subtasks, and (c) allocating the tasks to processors. These include interprocessor communication, issues of granularity of the algorithm and of the hardware, and regularity of the algorithm. Hardware granularity is a ratio of computational performance over the communication performance of each processor within the architecture. Similarly, task granularity is the ratio of computational demand over the communication demand of the task. Typically a high compute/communication ratio is desirable. The concept of task granularity can also be viewed in terms of compute time per task. When this is large, it is a coarse-grain task implementation. When it is small, it is a fine-grain task implementation. Although large grains may ignore potential parallelism, partitioning a problem into the finest possible granularity does not necessarily lead to the fastest solution, as maximum parallelism also has the maximum overhead, particularly due to increased communication requirements. Therefore, when partitioning the algorithm and distributing across PEs, it is essential to choose an algorithm granularity that balances useful parallel computation against communication and other overheads [11]. Regularity is a term used to describe the degree of uniformity in the execution thread of the computation. Many algorithms can be expressed by matrix computations. This leads to the so-called regular iterative type of algorithms due to their very regular structure. In implementing these types of algoriths, a vector processor will, principally, be expected to perform better. Moreover, if a large amount of data is to be handled for computation in these types of algorithms, the performance will be further enhanced if the processor has more internal data cache, instruction cache, and/or a built-in math coprocessor. Performance is also related to program optimization facility of the compiler, which may be machine dependent. The goal of program optimization is, in general, to maximize the speed of code execution. This involves several factors such as minimization of code length and memory accesses, exploitation of parallelism, elimination of dead code, in-line function expansion, loop unrolling, and maximum utilization of registers. The optimization techniques include vectorization using pipelined hardware and parallelization using multiprocessors simultaneously [8]. A parallel algorithm that solves a problem well using a fixed number of processors on a particular architecture may perform poorly if either of these parameters changes. It has

been argued that analysis of the performance of a given parallel algorithm/architecture calls for a method that accounts for scalability: the system’s ability to increase speedup as the number of processors increases. Several parallel performance metrics that measure scalability have been proposed [5, 6, 17, 18, 23]. The most widely accepted measure used to evaluate the performance of a parallel system is speedup [12, 17, 18]. Speedup (S N ) is defined as the ratio of the execution time (T1 ) on a single processor to the execution time (TN ) on N processors. The theoretical maximum speed that can be achieved with a parallel architecture of N identical processors working concurrently on a problem is N . This is known as the ideal speedup. In practice, the speedup is much less, since some architectures do not perform to the ideal level due to conflicts over memory access, communication delays, inefficient algorithms, and mapping for exploiting the natural concurrency in a computing problem [9]. But, in some cases, the speedup can be obtained above the ideal speedup, due to factors such as anomalies in programming, compilation, and architecture usage. When speed is the goal, the primary objective is to seek the power to solve problems of some magnitude in a reasonably short period of time. Speed is a quantity that ideally would increase linearly with system size. Based on such a reasoning, the isospeed approach has been proposed [18]. This is described by the average unit speed as the achieved speed of a given computing system divided by the number of processors N . A generalized speedup has also been introduced and its relation with the traditional speedup studied [16]. The generalized speedup thus introduced is defined as the ratio of parallel speed over sequential speed. Another useful measure in evaluating the performance of a parallel system is efficiency (E N ) which can be defined as the ratio of S N to N . Efficiency can be interpreted as providing an indication of the average utilization of the N processors, expressed as a percentage. Furthermore, this measure allows a uniform comparison of the various speedups obtained from systems containing different numbers of processors. It has also been illustrated that the value of efficiency is related to the granularity of the system [15]. Although the analysis uses an ideal model, the value of granularity can be used as a guideline during the partitioning process. Some researchers have proposed the concept of isoefficiency as a measure of scalability of parallel algorithms [5]. Isoefficiency fixes the efficiency and measures how much work must be increased to keep the efficiency unchanged. So, constant efficiency means that speedup increases linearly with system size. Thus, the concept of isoefficiency still uses speedup as the performance metric. If generalized speedup is used and assuming the sequential speed is independent of problem size (which does not hold in general), then two implementations have the same efficiency if and only if they have the same average speed. In this case the isospeed approach is the same as the isoefficiency approach.

PERFORMANCE EVALUATION IN REAL-TIME PARALLEL PROCESSING

3. ALGORITHMS

4. HARDWARE AND SOFTWARE

The algorithms considered in this investigation consist of the finite-different (FD) simulation, identification and active vibration control (AVC) of a flexible beam structure, fast Fourier transform (FFT), and second-order correlation. Consider a cantilever beam system with a force F(x, t) applied at a distance x from its fixed (clamped) end at time t. Discretizing the governing dynamic equation of the beam into a finite number of sections of length 1x and considering the deflection of each section at time steps 1t using the central FD method, a discrete approximation can be obtained as [20] Yk+1 = −Yk−1 − λ2 SYk +

(1t)2 F(x, t), m

(1)

where, m is the mass of the beam, λ2 = [(1t)2 /(1x)4 ]µ2 with µ as a beam constant, S is a pentadiagonal matrix, entries of which depend on the physical properties and boundary conditions of the beam. Yi (i = k+1, k, k−1) is an (n−1)×1 matrix representing the deflection of end of sections 1 to n − 1 of the beam at time step i (beam divided into n − 1 sections). Equation (1) comprises the beam simulation algorithm. An AVC system is considered for vibration suppression of the beam. The unwanted (primary) disturbance is detected by a detection sensor, processed by a controller to generate a cancelling (secondary, control) signal, to result in cancellation at an observation point along the beam. The objective is to achieve total (optimum) vibration suppression at the observation point. Synthesizing the controller on the basis of this objective will yield the required controller transfer function as [20] C = [1 − Q 1 /Q 0 ]−1 ,

69

(2)

where, Q 0 and Q 1 represent the equivalent transfer functions of the system (with input at the detector and output at the observer) when the secondary source is off and on, respectively. Equation (2) is the required controller design rule which can easily be implemented online. This will involve estimating Q 0 and Q 1 , using a suitable system identification algorithm, designing the controller using Eq. (2) and implementing the controller to generate the cancelling signal. A recursive leastsquares parameter estimation algorithm is used to estimate Q 0 and Q 1 in the discrete-time domain in parametric form. Note that the process of estimating parameters of the controller is referred to here as the “identification algorithm” and implementation of the controller as the “control algorithm.” To devise an FFT algorithm of a real periodic discrete-time signal, the divide and conquer approach [14] is used in this investigation. The correlation algorithm utilized, on the other hand, constitutes the cross-correlation of two finite energy signal sequences [10, 14].

The uniprocessor architectures considered include the Intel 80860 (i860) RISC processor [9], the Texas Instruments TMS320C40 (C40) DSP device [19], and the INMOS T805 (T8) transputer. The homogeneous architectures considered include a network of C40s and a network of T8s. A T8 is used as a root processor with the network of C40s that communicates with the C40s via a link adapter using serial-to-parallel communication links. The C40s, on the other hand, communicate with each other via parallel communication links. The serial links of the processors are used for communication with one another in the network of T8s. A pipeline topology is utilized in case of the homogeneous architectures as it is simple to realize. Moreover, it reflects the structure of the algorithms considered in this study. Two heterogeneous parallel architectures, namely, an integrated i860+T8 system and an integrated C40+T8 system, are considered in this study. In the i860+T8 architecture the T8 and the i860 communicate with each other via shared memory. In the C40+T8 architecture the C40 and the T8 communicate with each other via serial-to-parallel or parallel-to-serial links. The architectures considered are programmed in high-level languages consisting of Portland ANSI C, INMOS ANSI C, 3L Parallel C, and Occam as appropriate for the hardware used. All these programming languages support PP. 5. IMPLEMENTATIONS AND RESULTS

This section presents results of implementations highlighting important issues for consideration in devising parallel solutions to real-time signal processing and control applications. It is noted in these investigations that real-time processing has been achieved with all the architectures in the applications considered. Moreover, it is shown that real-time processing can be achieved with these systems at large and complicated applications through suitable matching between the computing requirements of the algorithm and computing capabilities of the architecture. 5.1. Interprocessor Communication The interprocessor communication techniques for the different architectures utilized in this study are (i) T8–T8, serial communication link; (ii) C40–C40, parallel communication link; (iii) T8–C40, serial to parallel communication link; and (iv) T8–i860, shared memory communication. The performance of these interprocessor communication links were evaluated by utilizing a similar strategy for exactly the same data block without any computation during the communication time. It is important to note that, although the C40 has potentially high data rates, these are often not achieved when using 3L Parallel C due to the routing of communications via the microkernel (which initializes a DMA channel). This, as noted later, incurs a significant setting-up delay and becomes particularly prohibitive for small data packets.

70

TOKHI ET AL.

TABLE I Interprocessor Communication Times with Various Links (LP1) and Relative to the C40 – C40Parallel Pair of Lines of Communication (LP2) Link

C40–C40(2)

C40–C40(1)

T8–T8(2)

C40–T8(2)

C40–T8(1)

i860–T8(2)

Time(s)

0.0018

0.1691

0.018

0.031616

0.208

0.0268

LP1/LP2

1

93.889

10.00

17.5644

115.556

14.889

To investigate the interprocessor communication speed using the communication links indicated above, 4000 floatingpoint data elements were used. The communication time was measured as the total time in sending the data from one processor to another and receiving it back. In the case of the C40–T8 and C40–C40 communications, the speed of a single line of communication was also measured using bidirectional data transmission in each of the 4000 iterations. This was achieved by changing the direction of the link at every iteration in sending and receiving the data. Table I shows the communication times for the various links, where, (1) represents a single bidirectional line of communication and (2) represents a pair of unidirectional lines of communication. Note that, as expected, the C40–C40 parallel pair of lines of communication has performed as the fastest and the C40–T8 serial to parallel single line of communication as the slowest of these communication links. It is noted from the relative communication times with respect to the C40–C40 parallel pair of lines of communication that the C40–C40 parallel pair of lines of communication is 10 times faster than the T8–T8 serial pair of lines of communication and nearly 15 times faster than the i860–T8 shared memory pair of lines of communication. The slower performance of the shared memory pair of lines of communication as compared to the T8–T8 serial pair of lines of communication is due to the extra time required in writing and reading data from the shared memory. The C40–T8 serial to parallel pair of lines of communication involves a process of transformation from serial to parallel, while transferring data from T8 to C40, and vice versa, when transferring data from C40 to T8, making the link about 17.56 times slower than the C40–C40 parallel pair of lines of communication. As noted, the C40–C40 parallel single line of communication performs about 94 times slower than the C40–C40 parallel pair of lines of communication. This is due to the utilization of a single bidirectional line of communication in which, in addition to the sequential nature of the process of sending and receiving data, extra time is required for altering the direction of the link (data flow). Moreover, there is a setting-up delay for each communication performed. This is on the order of 10 times that of the actual transmission time. These aspects are also involved in the C40– T8 serial to parallel single line of communication which has performed 115.556 times slower than the C40–C40 parallel pair of lines of communication due to the extra time required for the process of transformation from serial to parallel and vice versa.

5.2. Performance and Communication Overheads To investigate the real-time implementation of the simulation algorithm, an aluminum type cantilever beam was considered. The beam was divided into 19 segments and a sample period of 1t = 0.3 ms was used. The total execution times achieved by the architectures in implementing the simulation algorithm over 20000 iterations were considered. The algorithm, thus, consists of computation of deflection of nineteen equal-length segments. The computation for each segment requires information from two previous and two forward segments. In implementing the algorithm on a multiprocessor architecture, the load was distributed to balance the communication and computation work among the PEs. To reduce communication overhead, messages were passed in blocks of two segments from one processor to another, rather than sending one segment at a time. Using such a strategy, the algorithm was implemented on networks of up to nine T8s. To investigate the computation time and the computation with communication overhead, the algorithm was partitioned into fine grains (one segment as one grain). Considering the computation for one segment as a base, the grains were then computed on a single T8 increasing the number of grains from 1 to 19. This, referred to as the theoretical computation time, is shown in Fig. 1 with the corresponding actual computation time of 1 to 19 segments. Note that the theoretical computation time is more than the actual computation time. This is due to the RISC nature of the T8 processor. The performance for actual computation time was then utilized to obtain the actual computation for the multiprocessor system without communication overhead. Figure 2 shows the real-time performance (computation with communication overhead) and the actual computation time with 1–9 PEs. The difference between the real-time performance and the actual computation time is the communication overhead, shown in Fig. 3. It is noted in Figs. 2 and 3 that, due to communication overheads, the computing performance does not increase linearly with an increase in the

FIG. 1. Execution time for flexible beam segments on a single T8.

71

PERFORMANCE EVALUATION IN REAL-TIME PARALLEL PROCESSING

TABLE II Speedup and Efficiency of the T8 Network for the Simulation Algorithm Number of T8s

FIG. 2. Execution time of the simulation algorithm on the transputer network.

number of PEs. The performance remains nearly at a similar level with a network having more than 6 PEs. Note that the increase in communication overhead at the beginning, with less than 3 PEs, is more pronounced and remains nearly at the same level with more than 5 PEs. This is due to the communication overheads among the PEs which occur in parallel. The speedup and corresponding efficiency of the execution time for the network of up to 9 T8s are shown in Table II. As discussed earlier, an increase in the number of transputers has resulted in a decrease in the execution time. The relation, however, is not always linear. This implies that, with an increase in the number of processors or a transformation of the algorithm from coarse grain to fine grains, more and more communication is demanded. This is further evidenced in Table II, which shows a nonlinear variation in speedup and efficiency. This implies that the algorithm considered as a fixed load is not suitable to exploit with this hardware, due to communication overheads and run-time memory management problems. 5.3. Compiler In this section results of investigations of the performance evaluation of several compilers are presented and discussed. The compilers involved are the 3L Parallel C version 2.1, INMOS ANSI C, and Occam. The simulation algorithm was coded into the three programming languages and run on a T8. The execution times with Parallel C, ANSI C, and Occam thus achieved were 3.4763, 3.6801, and 5.38 sec, respectively. Note that the performances with Parallel C and ANSI C are nearly at a similar level and at about 1.5 times faster than with Occam. To further investigate this, the simulation algorithm was implemented on networks of 1–9 T8s using ANSI C and Occam. The algorithm was partitioned geometrically, where each processor was allocated a defined spatial length of the beam. The implementations of the partitioned algorithm into

FIG. 3. Communication overhead of the transputer network.

Two Three Four

Speedup

1.55

Efficiency

77% 58% 48%

Five

1.73 1.935 1.943 39%

Six Seven Eight Nine 2.1

2.15 2.174 2.19

35% 31%

27% 24%

Occam and C attempted to preserve the algorithmic structure as much as possible. The execution times achieved are shown in Fig. 4. Table III shows the corresponding speedup and efficiency achieved with the two compilers. This further demonstrates that Occam produces slower executable code, for this particular numerical computation, in a PP environment, as compared to ANSI C. This is due to the involvement of integer/ floating point operations and run-time memory management of the transputer for which the C compiler is more efficient than Occam. It is noted that the performance with Occam using nine transputers is better than with ANSI C. This implies that with the reduction of grain size, i.e., the reduction of run-time memory management, the performance is enhanced with the Occam compiler. This was further investigated by computing a linear algebraic equation. Table IV shows the performances with integer and floating point operations with and without array. It is noted that better performance is achieved throughout with the ANSI C compiler than with Occam, except in a situation where the computation involved is floating type data processing with declaring array. As compared to ANSI C, better performance is achieved with the 3L Parallel C for both integer and floating type computation with array. Better performance is achieved with Occam compiler for floating type computation as compared to 3L Parallel C. It is also noted that for the amount of double data handling 1.9 times more execution time is required with Occam. In contrast, 1.87 times more execution time is required with each ANSI C and 3L Parallel C. This implies that for large amounts of data handling, the run-time memory management problem can be solved with 3L Parallel C and ANSI C more efficiently than with the Occam compiler.

FIG. 4. Performance of ANSI C and Occam compilers in implementing the simulation algorithm on the network of the T8s.

72

TOKHI ET AL.

TABLE III Comparative Speedup and Efficiency of ANSI C and Occam Number of T8s

Speedup

TABLE V Execution Times of the i860 and Speedup in Implementing the Algorithms with and without Optimization

Efficiency

ANSI C

Occam

ANSI C

Occam

Two

1.550

1.871

77%

93%

Three

1.730

2.32

58%

77%

Four

1.935

2.795

48%

70%

Five

1.943

2.795

39%

70%

Six

2.097

3.025

35%

50%

Seven

2.159

3.094

31%

44%

Eight

2.179

3.100

27%

39%

Nine

2.189

3.377

24%

38%

Beam Beam simulation Identification control Correlation FFT

Algorithms Execution time (s) with optimization

0.38

0.35

0.41

0.66

0.05

Execution time (s) without optimization

0.45

0.39

0.48

1.07

0.05

Speedup

1.18

1.083

1.17

1.62

1.0

6. CONCLUSION

5.4. Optimization Code optimization facility of compilers for hardware is another important aspect of real-time implementation. Almost always, optimization facilities enhance the real-time performance. To investigate the optimization facility, two high performance processors, namely an i860 and a C40 with their respective compilers, were considered. The i860 and the C40 have many optimization features [13, 21, 22]. The five different algorithms introduced earlier were considered to investigate the performance of the architectures with and without code optimization. Table V shows the execution times of the i860 processor in implementing the algorithms and the speedup achieved in implementing the code with optimization as compared to that without optimization. It is noted that performance variations in implementing different algorithms with and without code optimization are not similar. Table VI shows the corresponding performance and speedup for the C40 processor with its compiler. Further investigations revealed that the level of enhancement achieved varied with the level of optimization for a given algorithm and hardware. It was noted that the level of optimization achieving the best performance for a given algorithm was not the same for another algorithm.

This paper has investigated the critical issues of performance evaluation in real-time PP. The performance of four different communication links, namely serial, parallel, serial to parallel, and shared memory, have been investigated. The communication overhead in parallel architectures has been explored within a transputer using an FD simulation algorithm of a flexible beam in transverse vibration. The partitioning and mapping process, speedup, and efficiency of this implementation has also been presented. It has been demonstrated that an increase in the number of transputers has resulted in a decrease in the execution time. The relation, however, is not always linear, implying that, with an increase in the number of processors or a transformation of the algorithm from coarse-grain to fine grains, more and more communication is demanded. The performance and efficiency of the 3L Parallel C, INMOS ANSI C, and Occam compilers, in numerical computation, have been investigated and discussed. Comparative results of this investigation have been presented in which the C compiler has demonstrated better performance for numerical computation as compared to Occam within the application considered. Code optimization capabilities of hardware and corresponding software have also been investigated. It has been revealed that, although the utilization of code optimization enhances the performance, the level of enhancement depends on the nature of the algorithm in relation to the capabilities and characteristics of the hardware.

TABLE IV Comparison of Compilers for Different Types of Data Processing Floating type data processing Compiler

Integer type data processing

With array 20000/40000

Without array 20000/40000

With array 20000/40000

Without array 20000/40000

3L Parallel C

0.1327/0.2488

0.1263/0.2333

0.1327/0.2488

0.1263/0.2333

ANSI C

0.1328/0.2444

0.0227/0.0227

0.1328/0.2444

0.0226/0.0226

0.1078/0.2044

0.1960/0.3825

0.1905/0.3698

Occam

0.1078/0.2052 ∗







Note. Equation used: z = (x + i y − x i )/(x x + y y); i = 0, 1, 2, . . ., 20000. The same equation is repeated twice for 40000 declaring another variable z 1 ; i.e., z 1 = (x + i ∗ y − x ∗ i)/(x ∗ x + y ∗ y); i = 0, 1, 2, . . ., 20000. Values used: integer x = 55, y = 25; floating: x = 55.02562, y = 25.13455.

PERFORMANCE EVALUATION IN REAL-TIME PARALLEL PROCESSING

TABLE VI Speedup of C40 with Respective Compiler due to Optimization Beam Beam simulation Identification control Correlation FFT

Algorithms Execution time (s) with optimization

2.3

Execution time (s) without optimization

2.37

Speedup

1.03

0.179

2.68

1.8387

1.617

73

6. Gustafson, J. L. The consequences of fixed-time performance measurement. Proceedings of 25th International Conference on Systems Sciences, III. IEEE Computer Society Press, Los Alamitos, CA, 1992, pp. 113–124. 7. Hockney, R. W., and Jesshope, C. R. Parallel Computers 2: Architecture, Programming, and Algorithms. Adams Hilger, Bristol, 1988. 8. Hwang, K. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw–Hill, New York, 1993.

0.48

2.72

4.0247

3.549

9. Hwang, K., and Briggs, F. A. Computer Architectures and Parallel Processing. McGraw–Hill, New York, 1985. 10. Ifeacho, E. C., and Jervis, B. W. Digital Signal Processing—A Practical Approach, Addison–Wesley, UK, 1993.

2.68

1.02

2.19

2.19

7. APPENDIX: NOMENCLATURE

C

Controller transfer function.

EN

Efficiency with N processors.

F(x, t) Applied force at a distance x from fixed-end of beam at time t.

11. Nocetti, G. D. F., and Fleming, P. J. Performance studies of parallel real-time controllers. Proceedings of IFAC Workshop on Algorithms and Architectures for Real-time Control. Bangor, UK, 1991, pp. 249–254. 12. Nussbaum, D., and Agrawal, A. Scalability of parallel machines. Comm. ACM 34(3) (1991), 57–61. 13. Portland Group Incorporated. PG Tools User Manual. Portland Group Inc., 1991. 14. Proakis, J. G., and Manolakis, D. G. Introduction to Digital Signal Processing. Macmillan, New York, 1988.

i, k

Indices.

15. Stone, H. S. High Performance Computer Architectures. Addison– Wesley, Reading, MA, 1987.

m

Mass of beam, constant.

16. Sun, X.-H., and Gustafson, J. Toward a better parallel performance metric. Parallel Comput. 17(10) (1991), 1093–1109.

n

Beam section.

17. Sun, X.-H., and Ni, L. Scalable problems and memory-bounded speedup. J. Parallel Distrib. Comput. 19(1) (1993), 27–37.

N

Number of processors, tasks, period of a sequence.

18. Sun, X.-H., and Rover, D. T. Scalability of parallel algorithm-machine combinations. IEEE Trans. Parallel Distrib. Systems 5(6) (1994), 599– 613.

Q 0 , Q 1 Transfer functions of system models. S

Stiffness matrix.

19. Texas Instruments. TMS320C40 User’s Guide. Texas Instruments, USA, 1991.

SN

Speedup with N processors.

20. Tokhi, M. O., and Hossain, M. A. Self-tuning active vibration control in flexible beam structures. Proc. IMechE-I: J. Systems Control Eng. 208(I4) (1994), 263–277.

T1 , TN Execution time with single, N processors. Yk

Deflection of end of sections 1 to n of beam at time step k.

21. Texas Instruments. TMS320C4x User’s Guide. Texas Instruments, USA, 1991.

1t

Differential time.

22. Texas Instruments. TMS320 Floating-Point DSP Optimising C Compiler User’s Guide, Texas Instruments, USA, 1991.

1x

Differential distance.

23. Worley, P. H. The effect of time constraints on scaled speedup. SIAM J. Sci. Statist. Comput. 11(5) (1990), 838–858.

λ

A parameter related to 1t, 1x, and µ.

µ

Beam constant. REFERENCES

1. Agrawal, D. P., Janakiram, V. K., and Pathak, G. C. Evaluating the performance of multicomputer configuration. IEEE Comput. 19(5) (1986), 23–37. 2. Ching, P. C., and Wu, S. W. Real-time digital signal processing system using a parallel architecture. Microprocess. Microsystems 13(10) (1989), 653–658. 3. Cvetanvoic, Z. The effects of problem partitioning, allocation, and granularity on the performance of multiple processor systems. IEEE Trans. Comput. 36(4) (1987), 421–432. 4. Denning, P. J. Parallel computing and its evolution. Comm. ACM 29(12) (1986), 1163–1167. 5. Grama, A. Y., Gupta, A., and Kumar, V. Isoefficiency: Measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib. Tech. 1(3) (1993), 12–21.

M. O. TOKHI obtained his B.Sc. in electrical engineering from Kabul University, Afghanistan in 1978 and Ph.D. in control engineering from Heriot– Watt University, U.K. in 1988. He has worked as a lecturer at Kabul University and Glasgow College of Technology (U.K.) and as a sound engineer in industry. He is currently employed as senior lecturer in the Department of Automatic Control and Systems Engineering, The University of Sheffield (U.K.). His research interests include active noise and vibration control, realtime signal processing and control, parallel processing, system identification, and adaptive/intelligent control. He has authored or coauthored over 125 publications, including a book, in these areas. He is a chartered engineer, a corporate member of the IEE, and a member of the IEEE and the IIAV. M. A. HOSSAIN obtained his M.Sc. from the Department of Applied Physics and Electronics, University of Dhaka, Bangladesh in 1987 and his Ph.D. from the Department of Automatic Control and Systems Engineering, The University of Sheffield, U.K. in 1995. He worked as a lecturer in the Department of Applied Physics and Electronics, University of Dhaka, Bangladesh from 1988 to 1995. He is currently working as an assistant professor in the Department of Computer Science, University of Dhaka,

74

TOKHI ET AL.

Bangladesh. His research interests include real-time signal processing and control, parallel processing, and adaptive active control. M. J. BAXTER graduated from Bangor University, U.K. in 1990 with an Honours degree in computer systems engineering. Following this, he worked in an industrial workplace with a focus on hardware and software systems for instrumentation and control. In 1991 he returned to academia to pursue a Ph.D. research programme at the Department of Automatic Control and Systems Engineering, The University of Sheffield, U.K. and was subsequently recruited as a research associate in 1993. His primary research interests are real-time systems, digital signal processing, parallel processing, heterogeneous systems, genetic algorithms, and computer-aided software engineering.

Received March 28, 1995; revised January 23, 1997; accepted January 27, 1997

P. J. Fleming received his B.Sc. in electrical engineering and Ph.D. in engineering mathematics from Queen’s University, Belfast in 1969 and 1973, respectively. He is currently Professor of Industrial Systems and Control in the Department of Automatic Control and Systems Engineering at the University of Sheffield, where he is Head of Department and also Director of the Rolls-Royce University Technology Centre for Control and Systems Engineering. His research interests include software for control system design and implementation, distributed and parallel processing for real-time control and instrumentation, and control applications of genetic algorithms and optimization. He has authored or coauthored over 150 publications, including four books, in these research areas and is a Fellow of the Institution of Electrical Engineers and of the Institute of Measurement and Control.