North-Holland Microprocessing and Microprogramming 25 (1989) 347 - 352
347
A Heuristic Algorithm for Real-Time Application Allocation to Multimicrocomputer s. Vraned Institut "Mihailo Pupin" Computer Department Volgina 15, 11060 Beograd, Yugoslavia Multimicrocomputers, due mainly to their low cost and large p o t e n t i a l for high throughput, better r e l i a b i l i t y and i n c r e m e n t a l s y s t e m growth, have emerged as an a t t r a c t i v e c o m p u t i n g means for critical r e a l - t i m e control applications. In d e v e l o p i n g r e a l - t i m e m u l t i m i c r o c o m p u t e r m a n y d e s i g n issues must be addressed. As it has a great impact on r e s p o n s e time, the a l l o c a t i o n of w o r k l o a d to p r o c e s s o r s is one of the most important issues, w h i c h must be properly resolved in order to meet very severe performance requirements. W i t h o u t careful considerations, a module a l l o c a t i o n can cause computer s a t u r a t i o n due to e x c e s s i v e interprocessor communication. The major c o n t r i b u t i o n of this paper is the s o l u t i o n of the task a l l o c a t i o n p r o b l e m in the p r e s e n c e of r e a l - t i m e constraints. Our static, deterministic, heuristic algorithm iteratively searches for task allocation with appropriate p r e c e d e n c e relationships, w h i c h yield a lower response time by r e d u c i n g i n t e r p r o c e s s o r c o m m u n i c a t i o n , while b a l a n c i n g p r o c e s s o r load.
i.
INTRODUCTION
The c o m p l e x i t y and s o p h i s t i c a t i o n of the real-time data processsing problems, encountered particularly in c o m p u t e r - b a s e d w e a p o n system, aircraft guidance and control, electric power generation and distribution etc., severely tax all aspects of p a r a l l e l - p r o c e s s i n g technology. These systems may have r e s p o n s e time as low as a few milliseconds, with throughput requirements ranging up to tens of millions of instructions per second. While the task allocation severely affects the response time , it is of critical importance in a real-time multiprocessor applications, and allocation decisions are not easily made. Few allocation algorithms have been developed specifically for r e a l - t i m e applications, even fewer have used any k n o w l e d g e of a p p l i c a t i o n and computer system characteristics to improve allocation. A lot of the models does not include precedence r e l a t i o n s h i p s among modules, the more realistic of them ignore non time dependent constraints of the system. However, current module assignment t e c h n i q u e s u s u a l l y n e g l e c t one or more of the i m p o r t a n t factors. This has m o t i v a t e d us to i n v e s t i g a t e the module allocation technique with all key p a r a m e t e r s m e n t i o n e d above. In this paper
we consider a d e t e r m i n i s t i c a l l o c a t i o n a l g o r i t h m w h i c h embodies two m o d e l l i n g domains - a p p l i c a t i o n and physical. As dynamic allocation techniques impose overheads that create problems in satisfying real-time requirements, we adopt a static allocation policy. Besides, in a real time multimicrocomputer, a fixed set of application program modules reside p e r m a n e n t l y on a system (the system is solely d e d i c a t e d for a given task) and all the attributes of the tasks are known a priori. It is m u c h better to perform extensive and time consuming a n a l y s e s prior to run time, becauseif the a l l o c a t i o n is done in real-time, a lot of analysis may o v e r s h a d o w the gain of using a m u l t i m i c r o c o m p u t e r system. Therefore, dynamic a l l o c a t i o n , although general, is not acceptable in r e a l - t i m e systems. Although static allocation can be performed manually, we tried to automatize the p r o c e d u r e as much as possible, in order to reduce the considerable expertize required for writing real-time multimicrocomputer a p p l i c a t i o n of substantial size. It is b e l i e v e d that the a l l o c a t i o n p r o b l e m s for multiprocessor systems are intractable in the sense that not p o l y n o m i a l time algorithms can p o s s i b l y solve them (NP-complete). Moreover, the real-time applications under
348
$. Vrane§/Real-TimeApplicationAIIocation
investigation involve various logical and precedence relationships among modules, what further increases the complexity of the allocation problem. Therefore, we develop a heuristic algorithm for obtaining approximate solutions in a computationally efficient way. Unlike graph theoretic and mathematical programming approach, heuristic approach for solving the allocation problem aims only to find a suboptimal solution. Yet, heuristic approaches are faster, more extensible and simpler than optimal solution techniques.
M(m)
2. THE ALLOCATION ENVIRONMENT
R
To provide a basis for automation of task allocation, the application and physical domain models (allocation environment) are embodied. We shall assume that the real-time application under consideration is well partitioned into tasks. That means that inherent parallelism is exploited, tasks' execution times are similar (to facilitate load balancing), synchronizations and inter-task communications are eliminated as much as possible Given the real-time constraints of the application task, partitioning is considered feasible if the length of the critical path is less than the application hard deadline.The logical structure and precedence relationships among modules are represented by a graph model of computation (Computation control graph [1], Control flow graph [ 2 ] , Process graph [3]), with nodes representing tasks and a r c s representing precedence relationships among them. Task processing requirements are represented as weights associated with nodes in the process graph. Weights associated with arcs are used to represent communication loads. This pictorial representation of the real-time application is contained in the application description file. Module hardware demands are formulated in terms familiar to programmers (components are selected by attribute not by name), so that programmers state as little hardware as possible.
C(m,m) matrix
of data transmission amounts between tasks (each cell denotes the number of words communicated between task i and task j); the number of arcs in the graph;
relationships indicated by the partial ordering (left member of the pair is the predecessor module and the right member is the successor module);
(m)
E(m)
execution time limit for each task (lines of machine language instructions are convenient units);
redundancy matrix; since module replications may be required for some real-time applications we shall extend the allocation algorithm for handling module replications;
E(m,m) exclusiveness matrix;
to improve reliability some critical processes must not be collocated;
hd
hard dedline task.
of
the
application
In the configuration description file, the physical parameters of multimicrocomputer hardware are recorded. Prior to allocation , all relevant data are copied into memory: total number of processors;Simple and rough estimate of the bounds for the number of processor isthe maximum value is equal to the maximum height of the load density function, theminimum value is equal to the sum of the total execution times of all modules divided by the length of the critical path. The more complicated formulas that give sharper bounds require more calculation time. The algorithm starts with the minimum value. If the hard deadline cannot be gained (waiting times inserted in order to satisfy precedence relationships extended too much the critical path length), the number of processors will be increased.
M(n) total number of application tasks (graph nodes);
(dependencies)
P(a,2) precedence
From the appliction description file, the following information is extracted : m
maximum memory used by each task;
memory capacity of the private memory of each processor;
D(n,n) matrix
of transmission delays between processors (describes the "distance" between processors as a cost of communication in seconds per word);
S. Vrane~/ ReaI-TimeApplication AIIocation U(n)
permitted processsor;
N(p,n)
peripheral connectivity matrix (p-number of d i f f e r e n t type of p e r i p h e r a l controllers);
utilization
of
each
S(n)
speed of each processor i n s t r u c t i o n s per second;
in
w
relative weight of computation and c o m m u n i c a t i o n costs.
349
All tolerances m e n t i o n e d above could easily be c h a n g e d prior to allocation.
3. THE A L G O R I T H M Before d e s c r i b i n g the algorithm, d e f i n i t i o n s are introduced:
a
few
Critical path in a g r a p h model of c o m p u t a t i o n is the longest p a t h from its entry node to its exit node; -
The features of the a p p l i c a t i o n task and multimicrocomputer hardware define a multidimensional space, and the a l l o c a t i o n p r o b l e m can be f o r m u l a t e d as an o p t i m i z a t i o n p r o b l e m over that space, w h i c h is to d e t e r m i n e a l l o c a t i o n m a t r i x X(m,n) in order to o p t i m i z e p r e s c r i b e d criteria. As it is not possible to completely satisfy conflicting requirements of real-time multimicrocomputer system (precedence c o n s t r a i n t s limit p a r a l l e l i z a t i o n , load balancing requirements spoil IPC minimization, saturation effect increases inefficiency of the system w i t h m a n y processors), a c o m p r o m i s e has to be made - some c r i t e r i a tolerances must be defined.
- Load density function shows how m a n y nodes are active in each time interval; Precedence relationships among modules of an a p p l i c a t i o n task are i n t e r m o d u l e synchronization requirements, which require each module not to start its execution until all its proceding m o d u l e s have f i n i s h e d their execution. Since the e x e c u t i o n time of a process is variable, all the p r o c e s s e s that have to s y n c h r o n i z e at a given point wait for the slowest among them. This "worst case" c o m p u t a t i o n speed is the basic w e a k n e s s of s y n c h r o n i z e d algorithms, and it results in worse than expected speedup and p r o c e s s o r utilization. -
- Task
To minimize contention and speedup saturation, the a l g o r i t h m merges h e a v i l y c o m m u n i c a t i n g modules. But, if a new m o d u l e is too large, the a l g o r i t h m will not be able to p r o d u c e a b a l a n c e d load allocation. Therefore, theempirical threshold value for intermodule c o m m u n i c a t i o n is e
$m¢
= ( a v e r a g e m o d u l e e x e c u t i o n time)xa~
where a g o o d range for a is from i~ to 10~ [3]. If the average load of the p r o c e s s o r is defined as the sum of the module e x e c u t i o n times d i v i d e d by the number of processors, p r o c e s s o r load t h r e s h o l d is 0! =
(average load) x ~
where ~ should range from 80~ to 12%~. Similarly, as the timing parameters associated with the graph cannot be fixed until allocation itself is complete, the critical path cannot be p r e c i s e l y d e t e r m i n e d and some tolerances have to be i n t r o d u c e d . T h i s is s o b e c a u s e the d e c i s i o n on whether to choose local or global (bus) c o m m u n i c a t i o n time for an arc depends on where s u c c e s s o r task will be allocated. The e cp in the range of 5~ to 10~ is c o n s i d e r e d feasible [5]. Furthermore, if t h e a c h i e v a b l e s p e e d u p of multimicrocomputer is e m (=20~) less than the linear one, c o n s i d e r e d saturated.
the
system
is
response
- is the time from until the c o m p l e t i o n
time
a task i n v o c a t i o n of its execution;
Hard deadline is a m a x i m u m c o m p u t e r think time allowed to keep the controlled system within a "safe" region. -
Let us now introduce our new r e a l - t i m e application allocation algorithm which consists of the f o l l o w i n g steps: 1.Rewind application description file and copy graph model of c o m p u t a t i o n into memory; 2.Rewind multimicrocomputer description file and c o p y data into memory; 3.Define values
hardware relevant
allocation criteria threshold (" search n e i g h b o r region");
4.Segregate application and physical domain models and define m u l t i d i m e n s i o n a l feature space; 5 . P r e p r o c e s s graph model of c o m p u t a t i o n (if the original g r a p h is fine grain) in order to reduce the time taken for the a l l o c a t i o n process: - merge eligible largest IMC (
>
pairs e Imc
with ) if
the all
350
S. Vrane§/ Real-Time Application Allocation
predafined satisfied;
constraints
are
- the other criteria for lumping together modules could be the local m i n i m i z a t i o n of the r e s p o n s e times for subgraph under consideration
[61
;
6.Generate feasible initial task allocation; As some modules may have to be allocated to certain fixed p r o c e s s o r s in order to take a d v a n t a g e of their unique h a r d w a r e capabilities (the multimicrocomputer considered here need not be homogeneous) it is c o n v e n i e n t to a l l o c a t e them first; 7 . B u i l d the linked list from p r o c e s s o r descriptors and sort the list according to ascending order of p r o c e s s o r busy times (queue lengths); 8.Build the linked list from the u n a l l o c a t e d modules with neighborhood region (predecessor and successor modules) and sort the list a c c o r d i n g to d e s c e n d i n g order of critical path length; Our a l g o r i t h m does not rely on exact c a l c u l a t i o n of the critical path due to the lack of precision in determining communication time. To solve the p r o b l e m one can reverse the g r a p h [6], and p e r f o r m the a l l o c a t i o n on t h e r e v e r s e d graph. It solves the p r o b l e m b e c a u s e a m o d u l e is a l l o c a t e d after its actual successors, and therefore its c o m m u n i c a t i o n times are known. 9.Build finish
activity list that times of all a l l o c a t e d
contains modules;
1 0 . S e a r c h for the best a l l o c a t i o n " search n e i g h b o r region" • - choose the top p r o c e s s o r processor to be idle) p r o c e s s o r list;
in the
(the from
next the
- select the modules from the top of the module list that fall into a p r e s c r i b e d range of critical path : • check non time dependent constraints; This is a good point to check any c o n s t r a i n t s w h i c h may eliminate allocation before we proceed to the time consuming process of checking real-time c o n s t r a i n t s and p o s s i b l y i n s e r t i n g waiting time; N o n - t i m e dependent c o n s t r a i n t s are: o p e r i p h e r a l demands; The H m a t r i x is used to exert explicit control on r e s o u r c e allocation, so the m a t r i x is c h e c k e d to see
if all module's demands are honored;
hardware
o m e m o r y constraints; It is very easy to determine if memory constraints are violated c o m p a r i n g the proper cell of the m e m o r y c a p a c i t y vector M to the storage requirements of all modules allocated to a p a r t i c u l a r processor; o mutual exclusiveness; Comparing the list of p r o c e s s e s already allocated to current p r o c e s s o r w i t h the p r o p e r row of E matrix, we find out if r e q u i r e d mutual exclusivness among modules is satisfied; o processor utilization; For b e t t e r r e l i a b i l i t y reasons, the s e c u r i t y m a r g i n of the p r o c e s s o r idle time is kept. C a l c u l a t i n g a new processor busy time, s a t i s f a c t i o n of this r e q u i r e m e n t is e a s i l y checked. If any c o n s t r a i n t m e n t i o n e d above is violated, we reject the current allocation and choose another module candidate. If the module c a n d i d a t e has to be r e p l i c a t e d (R matrix) we a l l o c a t e it to the next p r o c e s s o r on the list and repeat all described checkings for another m o d u l e instance. • check the P matrix and the activity list to see that the allocation does not violate any precedence relationship (all candidate module's immediate predecessors have completed execution) - If no candidate is found the smallest possible precedence waiting time (the intermodule synchron i z a t i o n d e l a y due to the p r e c e d e n c e relationships among modules) is i n s e r t e d and the module that needs the s m a l l e s t delay is chosen. After this i n s e r t i o n a new length of the critical path is c a l c u l a t e d as the delay can cause v i o l a t i o n of the a p p l i c a t i o n hard deadline. If it is the case, a l l o c a t i o n is r e j e c t e d and the whole p r o c e d u r e is r e p e a t e d with the next p r o c e s s o r on the list. - If there are more then one candidate, the a l g o r i t h m checks the distance [7] b e t w e e n the candidate modules and previously allocated modules on the current processor, and chooses the one with t h e s m a l l e s t distance. If the distances are
S. Vrane§/ ReaI-TimeApplication AIIocation
comparable , another criterion is introduced. The experiments have revealed [4] that allocating two consecutive modules to a same processor will yield a good response time if the execution time of the second module is much larger than the first one. Applying this rule, the final decision is made.
351
research in this area by explicitly taking real-time application into account and permit the efficient utilization of multimicrocomputer architectures for a wide range of the real-time problems of practical interest.
REFERENCES -
When the proper task is allocated, the busy time of the processor is updated, the processor is reinserted at the appropriate position in the processor list, the activity list is updated and theallocated task is removed from the task list.
The procedure is repeated until the task list is empty. ll.Check the load balancing; The result of our algorithm must also satisfy the load balancing constraints (O~.If not, the last operation of our algorithm is to reallocate some modules from the longest average waiting time processor to the shortest average waiting time processor. As such module reallocation may not necessarily provide a shorter response time, because it may increase interprocessor communication, the task response time is recomputed for each realocation. The algorithm continues to reallocate modules until the solution converges to a balanced load distribution. 12.Calculate performance measures (speedup [8], and response time [3]). 13.Output suboptimal allocation linking and loading directives. Programmers have the ability to check and prove allocation decision manually. 4.
SUMMARY
Given a multimicrocomputer system made up of n processors, and a real-time program made up of m modules , the problem of allocating the program over the processors in order to improve performance is addressed in this paper. We
hope
that
our
results
extend
prior
[1] Alex Kapelnikov,"Analytic Modeling Methodology for Evaluating the Performance of Distributed Multiple Computer Systems", PhD Dissertation, UCLA, Computer Science Dept., December 1986. [2] Wesley Chu, Kim Leung, "Module Replication and Assignment for Real-Time Distributed Processing Systems", Proceedings of the IEEE, Vol. 75, No. 5, May 1987. [3] Leslie Joan Holloway, "Task Assignment in a Resource Limited Distributed Processing Environment", PhD Dissertation, UCLA, Computer Science Dept.,1982. [4]
Lance Min-Tsun Lan, "Characterization of Intermodule Communication and Heuristic Task Allocation for Distributed Real-Time Systems", PhD Dissertation, UCLA, Computer Science Dept.,1985.
[5] Milos D. Ercegovac, "Multiprocessor System Evaluation and Programming Environment", UCLA, Computer Science Dept, Technical Report CSD-86~066, April 1986. [6] T.M.Ravi, M.D.Ercegovac, T.Lang, R.R.Muntz, "Static Allocation for a Data-Flow Multiprocessor System", UCLA, Computer Science Dept.,Technical Report CSD-860028, November 1986. [7] V.B.Gylys, J.A.Edwards, "Optimal Partitioning of Workload for Distributed Systems",Proc. Compcon Fall 76, pp. 353-357 [8] M.Ajmone Marsan, G.Balbo, G.Conte, "Comparative Performance Analysis of Single Bus Multiprocessor Architectures",IEEE Transactions on Computers, Vol. C-31, December 1982., pp. 1179-1191