SIGNALPROCESSING:
IMAGE ELSEVIER
Signal Processing:Image Communication 8 (1996) 113-130
COMMUNICATION
High-speed moving picture coding using adaptively load balanced multiprocessor system S.H. ChoP'*
and K.T. Park b
aLG Electronics Inc., Image and Media Laboratory. 16 Woomyeon-Dong, Seoho-Gu, Seoul, 137-140 South Korea bDepartment of Electronic Engineering, Yonsei University, 134 Shinchun-Dong, Seodamun-674, Seoul, 127-749 South Korea
Abstract
A multiprocessor system for high-speed processing of hybrid picture coding algorithms such as H.261, MPEG or digital HDTV is presented in this study. Using a combination of a highly parallel 32-bit microprocessor, DCT and motion estimation function specific devices, a new processing module is designed for a high-performance coding system. We constructed the motion picture coder using the geometrical parallel processing technique since a single module alone cannot perform hybrid encoding algorithms at high speed, and also analyzed the processing time and communication overhead. Theoretical calculations and experimental results show that the efficiency of geometrical parallel processing system falls off as the difference of the computational amount among regions allocated to each processing module increases. An adaptive load balancing technique is proposed to resolve this performance degradation in geometrical partitioning which results from unbalanced processing time between regions. In the proposed algorithm, load estimation using DCT coefficients and load reallocation using the linear programming method are employed for optimal load balancing so that each processing module has a balanced processing time. A more balanced processing time is obtained using the adaptive load balancing technique compared to the method using only geometrical partitioning. This results in an increase of overall efficiency.
1. Introduction
Generally, there are two methods in hardware implementation of picture coding systems. The first method is by using high-performance processors and the second is by application specific integrated circuits (ASIC). The former has the advantages of easy development and expansion but the cost of system implementation is rather high and the performance of the developed system is also limited.
*Corresponding author.
Implementation using ASIC is usually done by VLSI design tools such as V H D L or by using various function specific devices. The advantage of this method is in high-speed coding and is convenient for mass production. However, modification of algorithm is difficult and also a great deal of development effort is needed [1]. In this study, function specific devices are used for simple and repetitive operations such as motion estimation and DCT, and flexible high-speed processors are used for operations requiring complex data structure and control. Hence, the implemented system yields easy algorithm modification with high-speed execution [2].
0923-5965/96/$15.00 © 1996 ElsevierScienceB.V. All rights reserved SSDI 0923-5965(95)00040-2
114
S.H. Choi and K.T. Park ,/Signal Processing. lmage Communication 8 (1996) 113 130
However, since high-speed processing of motion pictures on a single processing module alone is impossible, geometric parallel processing method which equally distributes the data to be processed among all the processing modules is employed. Because all the modules execute the same algorithm, modification and expansion of the algorithm is easy, and the increase in performance is nearly linear in proportion to the number of modules when the amount of data to be processed is uniform among each module. In motion picture coding, inverse quantization, IDCT, picture reconstruction, RLC, and VLC are all skipped when the quantized coefficients are all zero. Therefore, allocating equal amount of picture data to each module does not necessarily result in equal processing time due to unbalanced computational load, and the overall system performance does not increase linearly with the number of modules. To resolve this imbalance in computational load, load balancing is used. Load balancing can be applied to parallel processing systems with unequal computational load among processors to achieve a proportional system performance with the increase in number of processors. In this method, the rate of performance increase is determined according to the degree of load imbalance, and is also effected by overhead costs such as load prediction and communication. It is, therefore, critical to choose processors with small communication overhead and select load prediction measure with small computational load.
In this paper, the computational and communication load in geometrical processing of motion picture coding is analyzed, and the performance degradation caused by the changes in the statistical characteristic of pictures is also analyzed. To resolve this performance degradation, computational load unbalance is predicted using the correlation between the maximum value in the 8 x 8 spatial domain and the maximum range of the DCT coefficients, and adaptive load balancing algorithm using linear programming is proposed to enhance the performance of parallel processing motion picture coding system. In Section 2 the geometrical partitioning is described along with the communication and the computational load of the motion picture coding. In Section 3 load prediction for adaptive load allocation is discussed, and optimal load reallocation using the predicted load information is mentioned. The results of the experiment and discussion are given in Section 5, and finally the conclusion is stated in Section 6.
2. Motion picture coding using parallel processing technique 2.1. Geometrical partitioning Fig. 1 shows the parallel processing technique which solves a problem by appropriately distributing the data to the processors in the system and combining their partial results. In this case, all the
Host Processor
Data n-1
1
Data n
,"
[' 1 Data n+I
V Processor
V
Vo or n-1
Processor
n
Processor n+l
Fig. 1. The task allocationin geometricalprocessing.
S.H. Choi and K.Z Park / Signal Processing." Image Communication 8 (1996) 113 130
processors execute the same algorithm and only the data allocated to each processor are different. Assuming that algorithm A is to be executed for the given data D, if the total data D is represented as the union of non-overlapping parts D = (,_)id , then the processing task for each processing module can be represented as a pair (A, di) [3]. For P number of processors executing the same algorithm for the same amount of data, the total processing time Ttota~is Ttotal = T~eg_O = Treg 1 . . . . . :
Treg k =
..-
Treg ( P - l b Treg k - -
Tone ~ + Tcomm-
(1)
The processing time T~eg k is the sum of the computation time (Tone~P) and communication time T . . . . k. If the amount of data to be processed by each of P processors is not equal and hence the processing time of each processor is not equal, the total processing time is then Ttotal=MmX(Treg_o, Treg 1 . . . . .
Treg_k=
Tproc k -'1- T . . . .
k
Treg k, .--, T, eg (e-l)), for
non-overlapped
mode, Yreg k =
MAX(Tvroc ,, T . . . . k)
for
overlapped
mode.
(2) Here the overlapped mode is for cases where communication and computation can be executed simultaneously such as with Inmos Transputer or TI TMS320C40. If the communication time is shorter than that of the computation time, the overall execution time will be equal to the computation time. The overall efficiency E of a partitioning method is given by
! 15
the result to the master processor. There are two major methods in allocating picture data to processing modules [4]. Boxwise partitioning is a twodimensional allocation scheme suited for those parallel processing systems with a large number of processors. Stripwise partitioning is a one-dimensional allocation method which is a special case of the boxwise partitioning. Considering actual picture coding applications, the stripwise partitioning is appropriate for applications such as M P E G 1 [6] where there are only a small number of macroblocks to process in slice, but this is not true for those applications with a large number of macroblocks in slice. An example is M P @ M L (Main profile at Main level) and M P @ H L (Main profile at High level) in MPEG2. In this case, because the number of slices allocated to a processing module is small, encoding only with stripwise partitioning will result in excessive amount of data exchange between neighboring processing modules for motion estimation and compensation. Furthermore, with a large search region, data exchange must take place between non-neighboring processing modules. So the boxwise partitioning can increase the efficiency of a parallel coding system when there is a large number of macroblocks per each slice. In this paper, a general boxwise partitioning scheme is presented (see Fig. 2). For boxwise allocation, the number of processors P is 2 x 2 2n ( P = 2 , 4 , 8 , 1 6 . . . . for n = 0, 1,2,3 . . . . ), and P = PxPr, where Px is the number of processors in the x direction and Py is that in the y direction. The communication load required for data transfer between the main processor and slave processors can be represented iteratively by merging two previous results, and its total Tp~ is given by
T m = fl [(Do~P) + 2(Do~P) + 4(Do~P) Tone E
-
-
-
P Ttotal
+ ... +2(k-1)(Do/P)]
(31
"
=fl(Do/P)[1 +21 + 2 2 + = fi(Do/P)(2 k - 1),
2.2. Task partitioning of picture data In the task partitioning method, all the slave processing modules in the system execute the same algorithm with a portion of the entire data and pass
where
~=
1 byte transmission time processor cycle time
"'" + 2 (k-I)] (4)
116
S.H. Choi and K.T. Park / Signal Processing. Image Communication 8 (1996) 113 I30
PROCESSING
MODULE
0
PROCESSING
MODULE
I
PROCESSING
MODULE
2
PROCESSING
MODULE
3
PROCESSING
PROCESSING MODULE 0
MODULE 1
PROCESSING MODULE 2
PROCESSING MODULE 3
(a) A b o x w i s e allocation.
(b) A s t r i p w i s e allocation.
Fig. 2. Allocation of picture data to the processors.
The factor fl is included to quantitatively reflect the communication time along with the computation time, Do is the size of the total output data size, and k is the parameter which represents the stage of communication and is related to the number of processors in the system. That is, the partial results spread over P slave processors are collected by the master through k communication stages. Therefore, k is the total communication stage depending on the number of processors P. P/2 partial results after completion of interval 1, P/4( = P/2 z) partial results after completion of interval 2,
The processor number is given row by row. Except for the total amount output data Do, the amount of transfer of data from the slave processors to the master processor can be found as the same as the above. 2.3. Motion picture coding using geometrical partitioning
Fig. 3 shows the task allocation of hybrid motion picture coding algorithm [-5]. The computational amount when implementing the algorithm shown in Fig. 3 on a single processing module is Tone = [-TME + TDECL + TDCT + TQ
P/2 k partial results after completion of interval k, P/2 k = 1, k = log2P.
(5)
The data collection algorithm is given by For i = 0 to log2P do For all processing modules j do in parallel (0~
× (1 - SF)]N,
(7)
where TME is the motion estimation time, TDECL is
Substituting k in (5) to (4) yields (6). Tp, = fl(Do/P)(P - 1).
-4- (T,Q -I- T[DCT + TRLC + THUFF)
(6)
the MC/no MC, intra/nonintra decision time, T(.)DCT is the (I)DCT time for six 8 × 8 blocks, T(I)O is the (I)Quantization time for six 8 × 8 blocks, TRLc is the RLC time for six 8 × 8 blocks, THUFFis the Huffman coding time for six 8 × 8 blocks, SF is the skip factor ( = 1 - number of coded blocks/number of blocks in picture) and N is the number of packets to be processed in a picture. The computational amount of T(I)DCT and T(I)Q equals the (I)DCT and (inverse) quantization time of six 8 × 8 blocks in a macroblock. The skip factor (SF) is included to reflect the skipping of IDCT, inverse quantization, RLC, VLC and reconstruction
S.H. Choi and K.T. Park / Signal Processing: Image Communication 8 (1996) 113 130 LOW
LEVEL
MEDIUM
LEVEL
LOW
117
LEVEL
RLC&VLC[~ Vide°, ME
DECISION DCT
I
Q |IQ
FFER
IDCT
REC
}
Fig. 3. Allocation of motion picture coding tasks. when all of the quantized coefficients are zero, and N is the number of data packets which is the number of macroblocks in a picture to be encoded in this case. In the geometrical parallel processing method, the computational amount is effected by the number of macroblocks allocated to it as described below. Treg_k for motion picture coding is
ing modules, and Tpi is the time it requires to send the partial results back to the master. TovR is the extra time it needs to move data from the previous frame processed by a neighboring module when estimating and compensating motion along the boundaries of partitioned picture.
T~g k
The overall performance of coder is degraded when all the quantized coefficients are zero thus causing unbalanced execution time among processing modules [-7, 8]. Adaptive data allocation method is used to predict the load of each region and allocate data to slave modules based on the prediction. There are two types of adaptive data allocation - static and dynamic [9]. Static allocation is a method in which the work load is allocated prior to execution and no change is made during the execution. On the other hand, dynamic allocation predicts the load based on input data and determines the amount of data to be allocated to each processing module. Also, for load prediction and allocation either centralized control or distributed control can be used [9]. In the centralized control scheme the master processor predicts the load for allocation whereas in the distributed control a given amount of data is first allocated to each processing module which performs load prediction, and based on this prediction the load is reallocated for balanced task execution. In this study, dynamic data allocation is used together with distributed control to well adapt to the changes in statistical properties of picture data.
=
[-TME -k- TDECL -+- TDCT -{- TQ + (TIQ + TIDCT + TRLC + THUFF) x(1 - SVk)]Nreg k,
(8)
where SFk is the skip factor in kth region and Nreg k is the number of packets to be processed in kth region. Obviously, the execution time is greatly effected by SFk when the number of data packets is fixed. The total execution time equals the processing time of the region with the smallest SFk. The communication load in the partitioning method can be divided into following three elements: T
. . . . .
k ---- TDp q- TpI + TOVR,
(9)
where TDp is the data partition time, Tpl is the partial results integration time and TovR is the overlapped data interchange time. Though the amount of communication may vary depending on the result, the method of communication for the data partition time and the partial results integration time are similar with opposite direction of data flow. That is, Top is the time it takes to partition a picture data to slave process-
3. A d a p t i v e data allocation
S.H. Choi and K.T. Park ,/Signal Processing," Image Communication 8 (1996) 113 130
118
Along with the control schemes described above, the method of load prediction is also a critical element. It is essential to choose accurate measures with simple arithmetic.
horizontal entropy of the pixel is then M
VE(x, y, t) = -
~
dvk Iog2dvk,
(14)
dhklOgzdhk.
(15)
k--M M
HE(x,y,t)=-
~ k--M
3. Load prediction for adaptive data allocation 3.1. Load prediction in the spatial domain
The load prediction in picture coding is similar to the bit-rate prediction since, in most cases, pictures with large computational load yield high bitrate. There different types of measures can be used for predicting the amount of information contained in a picture. The first measure uses the distribution of pixel values, the second measure is derived from the spatial correlation between intraframe pixels, and the third is obtained from the temporal correlation between interframe pixels [10]. The first-order statistics used for bit-rate prediction - mean, variation, and entropy - is given below, M
m=
~
kpk,
(10)
k=0 M
a 2= ~ (k--m)Zpk,
(11)
k=0 M
E = -- ~
pklOgzpk,
(12)
k=O
where p~ is the relative density of pixels with the grey level k, and M is 255 which is the maximum grey level value. For finding the spatial measure, first let G(x, y, t) be the grey level of the pixel at x, y in frame t. The vertical and horizontal differences between neighboring pixels can be represented as Dr(x, y, t) = G(x, y, t) - G(x, y - 1, t),
(13) Dh(x, y, t) = G(x, y, t) - G ( x - 1, y, t).
If the probability of Dv and Dh being equal to k is dvk and dhk, respectively, the vertical entropy and
For the temporal measure, the difference between frames using motion vectors vx and vy in the direction of x and y axis, respectively, can be defined as D,(x, y, t) = G(x, y, t) - G ( x + vx, y + vy, t - 1).
(16) Here when dr, is the probability of Dr being equal to k, the difference variation and difference entropy are M
DV,(x,y,t) = -
~
(k - m)2dtk,
(17)
dtklogzdtk,
(18)
k= -M M
DE,(x,y, t) = -
~ k--M
respectively. There are, however, three major problems with the load prediction using the above indices. The first problem is that separate operations are required since the above indices are not a part of the encoding process. Second, a large amount of computation is needed for the load prediction since the operation involves multiplication of pixel values. Third, the measures cannot correctly reflect the amount of computation since they give only an approximation of the load which can be changed due to various factors during the encoding procedure. Therefore, load prediction and adaptive load balancing using the indices described above pose difficulties in enhancing the performance of the system.
3.2. Load prediction in the D C T domain
It is desirable to utilize the graph shown in Fig. 3 for adaptive load balancing of motion pictures. That is, accurate load can be determined and save encoding time significantly if we can determine the cause of load unbalance - all of the quantized coefficients being zero.
S.H. Choi and K.T. Park / Signal Processing." Image Communication 8 (1996) 113-130
There are two approaches in doing this. The first method uses the pixel values in the spatial domain before performing DCT.
F(u, v) - c(u)c(v) L 4
Z7 f(x, y) cos ( (2x + llu~']
x=o,,=o x ( ! 2 y + 1)vrc~ i-6 j,
]-6-
/
119
Compared to predicting load in the spatial domain, the two methods described above significantly reduce the amount of computation since the prediction is done during the encoding process and can predict the load more accurately. Both (21) and (22) are used for load prediction in this study.
u, v = 0, 1, 2, 3,4, 5, 6, 7, 4. O p t i m a l load allocation
where
c(p)= {ll/,,~
ifp=O, otherwise.
(19)
The relation between the maximum spatial value fmax and the maximum DCT coefficient value Fm,x for a 8 x 8 block size is given in (20). Fmax ~
F ....
F .... pper :
pper, 8fmax"
(20)
The maximum spatial pixel value fm,x is found first, and the maximum upper bound F . . . . pper can then be calculated. If F . . . . p p e r satisfies (21), DCT, quantization, inverse quantization, IDCT, reconstruction, RLC and VLC are all skipped, thus allowing a significant reduction in overall amount of calculation. In (21) quantizer_scale represents the quantization stepsize and non_intra_quant [u] [v] is the motion picture quantization matrix value applied to F(u, v). 8Fmax upper
quantizer_scale x ~ r a _ q u a n t [ u ]
Iv] ~< 1.
Based on the load information found in Section 3, the processing modules with heavier load pass the load to those modules with lighter load.
4.1. System load table To determine the load status of each processing module, the load information of each slave processing module is passed to the master processor where the loads are compared and one of three load status (light, moderate, or heavy) is assigned to each module.
dig = Ik -- I. . . . .
(23)
where lk is the calculated load to the kth processing module /me.n =
(1o+11 + "" + l(v-~}) P
The status of processing module can be determined as (24) based on the above differential load information.
(21) The inequality in (21) can cause error in load prediction since the equation is based on the upper bound for F m a x. So another prediction is carried out for a more accurate prediction of the operation after the DCT. The maximum coefficient Fmax is determined after DCT and if this value satisfies the condition of (22) then quantization, inverse quantization, IDCT, reconstruction, RLC and VLC are skipped. 8Fmax quantizer_scale x non_intra_quant [u] Iv] ~< 1.
(22)
kth processor load
light:
dlk <
moderate:
Idlkl <~lth,
heavy:
dlk > lth,
-- Ith,
(24)
where lth is the load transition threshold (/> 0). The load transition threshold value is introduced to prevent modules from balancing their loads when there is only a slight load difference. In such a case, the overhead involved in data exchange and scheduling causes performance degradation. The load table for overall system can be made using the status of each processing module in (24), and based on this load table the overall system
S.t-L Choi and K.T. Park / Signal Processing." Image Communication 8 (1996) 113 130
120
efficiency can be maximized. A few definitions are needed for constructing the load table. Definition 1. The distance dij between processing module i a n d j is the shortest distance between i and j. Hence the diameter of a system with P modules is the maximum distance between two modules. diameter(P) = MAX{dij for all i,j in P}.
(25)
Definition 2. The binary output function of gate gk which is equivalent to the load condition of the kth processing module is gk
{°max
load condition is light, load condition is heavy or moderate, (26)
Z Z c,wlkX,kMIN(dl , dlk)
Definition 3. A processing module with heavy load passes the load to modules with 9k = 0 and the results are returned. Here the distance between the heavily loaded module i and rest of the modules is Wik ~
if gk = 0, if gk = Wmax.
Using the load information of the overall system derived in Section 3, task can be reallocated. Task reallocation tree can be constructed as shown in Fig. 4 for a parallel processing system with P processing modules where initially n modules have heavy loads and m modules have light loads. At each level, n heavily loaded modules allocate tasks to m lightly loaded modules. The tree is expanded to depth n while continuously varying load and load status, and the minimum cost path from the root node to the terminal node in determined. The total cost is expressed in (28) where it is defined as the communication cost incurred during task reallocation plus the penalty of residual load not allocated to lightly loaded modules. n-lm-1
where Wmax= diameter(P) + 1.
dik ~{Wma x
4.2. Task reallocation
(27)
i=0
k=O n-1
+ ~
m--1
~
czX~kMAX((dl~ - d/k), 0)).
(28)
i--O k - O
The cost is calculated for each of m" possible paths down to the terminal node and the task is reallocated along the minimum cost path. However,
LO H2
L
El
- H e a v y t a s k is r e p r e s e n t e d b y n o d e . - A l l o c a t i o n to t h e l i g h t p r o c e s s o r is r e p r e s e n t e d b y b r a n c h . Fig. 4. Task reallocation tree.
S.H. Choi and K.I~ P a r k / S i g n a l Processing. Image Communication 8 (1996) 113-130
this requires a large amount of computation, and because the total overload of a heavily loaded module is fully allocated to another module (though lightly loaded) there is still a little imbalance in distributing load along the minimum cost path.
4.3. Linear programming The aforementioned problem can be easily resolved through linear programming. Linear programming is a mathematical tool used for distributing a limited amount of resources for reaching a goal. The goal can be cost minimization, profit maximization, or optimal time allocation. A linear programming model is set by defining objective functions and constraint sets that consist of equalities and inequalities in both directions. Minimize (or maximize) z = cx + Zo subject to Ax <~b,
121
the pivot operation used for finding the first-order polynomial linear equation. Pivot operation transforms m equations to an equivalent linear equations where a chosen variable remains in only one equation and is removed from remaining m - 1 equations• The coefficient of the variable chosen as the pivot is unity. This process can be repeated for a linear system with m pivots. This form called the canonical form, is crucial to simplex method• At this time, we define basic variables that have coefficient one in one equation and zero in the others. Transforming (29) to the canonical form yields the basic variables in (30). X1 -}- al(rn+ l)Xrn+ l +
"'"
-~- OlnXn ~ b l ,
b2,
X2 ~- a l ( m + l ) X m + l "q- "'"
-~- a l n X n =
Xrn -[- a l (m+ l)Xm+ l "-
-~- a l . x n = bin,
C(m+ 1 ) X m + 1 "+- " ' "
"'"
-~- CnX n ~ 2 0 "~- Z,
where X1, X2, -all
a12
aln
a21
a22
a2n
aml
am2
a,nn
A=
b=
-b~
Xt
b2
X2
.
b,.
Xm
(30)
The above canonical form has following three properties: (i) the system constraints contain m basic variables, (ii) the system has a feasible solution, and (iii) the objective function has no basic variables. For a linear system expressed in the canonical form, a solution which minimizes the objective function can be found from the above three properties. Since the solution is found from n variables with m different pivots, the number of basic solutions is n~
and C ~--- [-C 1
. . . , X n ~ O.
nCm - [-n!(n - m)!]" C2
•..
(31)
Cm]
The xi is decision variable and ci, a~j and b~ are deterministic parameters. The positive vector x satisfying the system constraint equation in (29) becomes the feasible solution. Therefore, the final solution of the linear equation is a vector which minimizes the goal function. For finding the solution, the simplex method is used in this study. The basic operation of simplex is
4.4. Load balancing using linear programming In this section, an actual adaptive load allocation procedure is described using an example with 8 processing modules (Px = 4, py = 2). First, the system load table is derived from the measured load information using (24) (see Table 1).
S.H. Choi and K.T. Park / Signal Processing: Image Communication 8 (1996) 113-130
122
Table 1 The measured load information
Table 4 The load balancing table
Processor 0 10
Processor 1 5O
Processor 2 78
Processor 3 58
Processor 4 38
Processor 5 102
Processor 6 58
Processor 7 70
Table 2 The system load table
Src.
1 (P2)
2 (PS)
3 (P7)
Supplies
2 (P1) 3 (P4)
x l = 20 x4 = 0 x, = 0
x2 = 24 x5 = 0 x8 = 20
x3 = 4 x6 = 8 x9 = 0
48 8 20
Demands
20
44
12
76
Des. 1 (P0)
Processor 0 -48
Processor 1 -8
Processor 2 + 20
Processor 3 0
subject to
Processor 4 - 20
Processor 5 +44
Processor 6 0
Processor 7 + 12
xl + x 2 + x 3 - - - 4 8 , X4 + X 5
Table 3 The load imbalance information Src.
I (P2)
2 (P5)
+X6
= 8,
x~ + x8 + x9 = 20,
(32) 3 (P7)
Supplies
Des.
xl + x4 + x7 = 20, x2 + xs + x8 = 44,
1 (P0)
2 (PI) 3 (P4)
Demands
Xl
x2
x3
W2o = 2 x4 w2x=l x7
WTo = 4
w24 = 3
ws0 = 2 x5 wsl=2 xs w52 = 1
w71=3 x9 w.4 = 3
20
20
44
12
76
48
X6
8
Src.: overloaded processor. Des.: underloaded processor.
The positive numbers in Table 2 represent overloaded processing modules and negative numbers represent underloaded modules. As a middle step before setting the linear model equation, load imbalance information can be found. The underloaded modules # 0 , 1,4 in Table 3 become the destination # 1 , 2, 3, and overloaded modules # 2, 5, 7 are the source # 1, 2, 3. The value of wik represents the distance between modules defined in Definition 3. Using this table, constraints and objective function can be expressed as
Minimize z = 2xl + 2x 2 + 4x 3 + lx,~ + 2x5 + 3x6 + 3xv + lx8 + 3x9
x3 + x6 + x9 = 20. The equations in (32) take on the same form as (29). Changing the equations to the canonical form and finding the optimal solution yields Table 4, Based on this table, the system performance can be enhanced by passing the load between processing modules. For example, the module # 2 passes the extra load to the underloaded module # 0 and receives the processed result. Processor # 5 passes the extra load to modules # 0 and # 4, and module # 7 passes its extra load to modules # 1 and # 2. Using the load allocation information, an overloaded processing module passes a data packet composed of DCT coefficient blocks and quantization step sizes to an underloaded processing module which then processes the packet and returns the result data packet consisting of bit-stream and reconstructed blocks. This load reassignment takes place after the load estimation during the actual coding process, and for overlapped mode processors described in Section 2.1 the communication time for passing data packets does not effect the overall execution time. Therefore, all the processing modules in the parallel processing system terminates almost simultaneously owing to balanced load allocation.
S,H, Choi and K . Z Park / Signal Processing." Image Communication 8 (1996) 113-130
123
5. Experimental results .t 4 NMb'~PLiras
4M Main
i
~
A parallel processing module for highspeed motion picture coding is designed, and processing time and efficiency of geometrical parallel processing and the proposed adaptive load balancing method using the implemented processing modules are compared and analyzed. The processing module shown in Fig. 5 which combines a high-speed processor and function specific devices is implemented.
IMS A121
im°"i i Transputer~ T ~ - - 3 ~ Sy~em~
L~ic :
Fig. 5. The i m p l e m e n t e d processing module.
o.:I ::;l 0.~ 0.5
. . . . . . . ,,
h\ J"l~ .+
0.2 0.1 C
~/~
I
.
IIZ W~llk~ "+"If ~. ~IV',,~ Zx,,~.kl -",",,"~II,.. ~tq.f"~l
0.,¢ 0.3
A A~
11" ,.,."ml,.il'i.'.,'i' tl.
~r-IW v U ....... :: j,....;;
"~qll,.K":'t~'ll'+~f - 'lb.,
II/'f'",.t'll'-~"-'~
:J:
[- - T ~ k O - - T ~ k l
..... T ~ k 2 " ' = ~ T ~ k 3
]
(a) Football image 1-
. f-"',.
°0.7-+ ' Virlf ~
.f'~':'~i ~ %, i+
/'1
/
/
~,0. ~V ~ 0A,-
A WY/
V
0.3-
't
0,20.1O"
1
.I
\_.."lit.', II"V"Vlix,.f._":'vi~l'~'cKNl~ ./"~iL ""': I y V -v.~ V Y._~ V k l l ~ " ~ l ..... ++ ~ ........ J, .............J, ............/+............ k.; ......... 3+ ........... +!, ......... ,##,of Fmme~
30
45
80 # of Frames
I--
Task 0 - -
Task 1 . . . . Tuk 2 .-i..- Tlmk 3 ]
(b) Table tennis i m a g e Fig. 6. R e g i o n a l SF k of the test images.
T -
.
.
.
.
.
124
S.H. Choi and K.12 Park /Signal Processing." Image Communication 8 (1996) 113-130
The following criteria was considered in choosing a processor for the high-performance processing module,
(ii) data bus width: 32 bits, address space: 4 Gbytes, (iii) on-chip memory: 4 Kbytes, (iv) communication channel: 4, 20 Mbps links, (v) internal hardware scheduler, and (vi) Two-dimension block move instruction, CRC generation instruction support.
(i) computational power, (ii) high-speed communication capability between processors and (iii) easy H/W development. 32-bit RISC structured Transputer for parallel processing is chosen as the main processor in this study. The features of Transputer T805 include [15]:
20 Mbps links and internal hardware scheduler are specifically for parallel processing. These functions allow high-speed communication between processors and efficiently support multiprocessing.
(i) instruction throughput: up to 30 MIPS, 1.31.21.1-
¢~;;1
;,;.. ',"
0.9-
",t',"
~
:""
.;t,,
~ r "V~ .~1"
0.8"
0.70.6-
I .........
,,,,,,,,,, 16
I ...........
I
I
I
, ............................................ 31 46 81
1
, .......................... 76 91
!
,,,,,,, .......... ,,, ........... 106 121
,,,, .......... 136
, 150
# of Frames
[
~ T ~ k O - - T ~ k l
.....
T~k2"-~--'T~k3
]
(a) Football i m a g e
1,41.3" 1.2"
~ 1,1"
i 0.9'"
F
0.8', 0.7.
0.~
1
r ~ r
~
L ....
I ..............
~'~ . . . . . . . . . . . .
1 ~
I ............
]
',~ ............
- -
Y
i~ ............ Y6 ............ # of Frames
Task 0 - -
Task 1 ...... Task 2 ~
I
!
~ ............
Task 3
r
eo-~ ...... ~!
~'
]
(b) T a b l e tennis i m a g e Fig. 7. Regional encoding time of the test images.
,----,
Y . . . . . . . . . . . ~,'I" . . . . . . . . . . .
~
..........
~o
S.H. Choi and K.T. Park/Signal Processing." Image Communication 8 (1996) 113 130
125
features is used [16]:
And 2-D block move instruction can be used readily for motion compensation. Current international standard algorithm for motion picture compression includes motion estimation and DCT operation. DCT is an 8 × 8 size 2-D operation requiring approximately 1024 MAC. For example, processing of CIF images in H.261 requires approximately 73 MMAC and even with a fast algorithm requires 14 MMAC. MPEG coding requires much more operations. For fast DCT, INMOS's IMSA121 having following
(i) 8 x 8 transform size, (ii) 8 x 8 calculation time = up to 3.2/~s, (iii) multi-function (DCT, IDCT, filtering, matrix transpose), and (iv) Dx port, clipping capability. Among these features, use of Dx port is particularly useful. Dx port allows DCT operation to be performed for the difference data between the
2.~.
0 1.5
o +. . . . . . . . . . . . .
N ............
b;,i . . . . . . . . . . . .
:~ ............
g' i . . . . . . . . . . . .
~ ............
~'r . . . . . . . . . . . .
~
...........
1"~'("
lae
1~
# of Frames
I--Dmean--
Dmax ]
(a) Football image 3 2.5
a
1.1
/-,
J\
l
AIj +
r¢
.
i
o' OI
b. . . . . . . . . . . . . i'~ . . . . . . . . . . . . g~ ............ ~ ............ g i . . . . . . . . . . . . Y~. . . . . . . . . . . . g i ........... ' i N ........... {~" ........... ~ N .......... ~ 0 f o t Frames
t
I (b) T a b l e tennis i m a g e F i g . 8. P e r f o r m a n c e i m p r o v e m e n t m e a s u r e a n d t h e m a x i m u m l o a d difference.
S.H. Choi and K.T. Park / Signal Processing." Image Communication 8 (1996) 113 130
126
estimation. For example, motion estimation for a CIF image within a - 8 , ~ + 7 search range requires 780 MOPS. This is higher than DCT, and requires a function specific device [16]. For motion estimation, SGS-THOMSON's STI3220 containing 256 systolic arrays is used. Its features include
current and the reconstructed previous frames, and during the decoding process inverse quantized DCT coefficients can be IDCTed and added to the 8 x 8 block data of the previous frame and are clipped to the valid data range for constructing the current frame. Motion estimation requires the most amount of operations in motion picture coding. Simple operations such as subtraction, absolute value operation, and accumulative addition are needed for motion
(i) pixel rate: up to 18 MHz, (ii) block size: 8 × 4n, 16 × 4n,
,ol 0.6"
0.4
18
31
48
61
I
~
76 # of F r a m ~
C~o~
--
91
Lo.I~I I:~,~J~.ed
106
121
136
1,50
]
(a) Football i m a g e
• 0.0 E iz
0.6-
0.'
.............
~'~ . . . . . . . . . . . .
~i ............
~ ............
Y~
i~ .........
~3"'"
i6~
# of Frame=
I
~
Geometric
~
Load balanced
]
(b) T a b l e tennis i m a g e Fig. 9. The encoding time improvement using load balancing.
121
t3e
,,.,o
127
S.H. Choi and K.T. Park / Signal Processing." Image Communication 8 (1996) 113 130
(iii) search range: - 8 ,-. + 7, and
t e c h n i q u e is used when the frame interval is over 1. The parallel processing system i m p l e m e n t e d in this s t u d y uses 4 processing m o d u l e s 2 modules o n b o t h x a n d y direction. The a l l o c a t e d picture size for the first two m o d u l e s is 176 × 128 size picture, a n d processing m o d u l e s # 2 a n d # 3 process 176 × 112 size picture. 352 × 240 M P E G S I F images with N = 15 a n d M = 1 is used for the experiment. L o a d b a l a n c i n g is n o t p e r f o r m e d for I pictures, a n d R L C a n d Huffm a n c o d i n g are n o t i n c l u d e d for simplicity in the
(iv) e r r o r function: full search with M A E . T h e search range of STI3220 is - 8 ~ + 7. But
a wider range is required for many cases including MPEG
c o d i n g where m o t i o n e s t i m a t i o n is per-
formed for up to three previous frames making the search range from - 4 8 to + 4 7 . Since b e y o n d the search range of STI3220, - 8 range search is p e r f o r m e d four times to 16 ~ + 15 search range, a n d telescopic -
f~A
0.95
i
this is ~ + 7 cover search
'
0.9
0.85
0.8
.,,,
.....................................................
18
,.,,
31
46
......................
61
I
--
Geometric
, ................................................
76 # of Frames --
91
Load balanced
, .........
106
121
138
, .......
150
]
(a) F o o t b a l l i m a g e
0.~
0.9
L,
0.~
0.8
rw../'
:"............ Y~............ ~i ............ :~ ............ ~'~............ ~ ............ ~ ............+6ii...........~'~'1............~ # of Frames [ --
Geometric
--
Load baJanced
I
(b) T a b l e t e n n i s i m a g e Fig. 10. The efficiency improvement using load balancing.
150
128
S.H. Choi and K.T Park / Signal Processing." Image Communication 8 (1996) 113 130
experiment. The results show the superiority of the proposed adaptive coding method. Fig. 6 shows the regional SFk of Football and Table tennis test sequence. Fig. 12 shows the 100, 110, 120, 130, 140, and 150th pictures in the Table tennis sequence. Consistent with that shown in Fig. 7(b) Task0 and Task2 have larger SFk values and shorter processing time than Taskl and Task3 since these are slow-varying background regions. From the load information obtained from each region during encoding, the performance improvement measure D . . . . and maximum load difference Dma x c a n be defined as in (33): /max-
operations since they yield different execution time for each block; second, Task0 and Taskl have longer processing time than Task2 and Task3 due to difference in picture in which 11 more macroblocks must be processed. Table 5 shows the comparison of the geometrical partitioning and the adaptive load balancing. A decrease in the average coding time following the adaptive load balancing is due to the performance improvement over the geometrical partitioning brought by skipping quantization according to the results of load prediction (see Figs. 11 and 12). Table 5 The encoding time improvement using the load balancing
l. . . .
Dmean -mean
Average processing time
/max -- lmin Dmax -mean
where lr,~x = MAX(to, ll . . . . . lo,- 1)), lmin ~"
MIN(lo, la . . . . . l(P- 1)).
(33)
D..... is the normalized difference between maximum and average load found using DCT coefficients mentioned in Section 2. The overall performance of the system can be improved in proportion to this figure by load balancing. Dma x is the normalized difference between maximum and minimum load. The efficiency of the system decreases significantly with larger Dma x values when only the geometrical partitioning is used. Comparing the two picture sequences, a higher performance improvement can be expected through the use of load balancing for Table tennis picture sequence than Football sequence. Figs. 9 and 10 show the comparison of the processing time and efficiency for the geometrical partitioning and the adaptive load balancing. There is an approximately 10% improvement in the processing time and efficiency by using the adaptive load balancing. The performance improvement brought by load balancing is not as high as the indices in Fig. 8 because: first, RLC and Huffman coding are not performed in the experiment due to the fact that it is difficult to quantify the execution time of these
(a) Football image Geometrical processing 2.223 Adaptive load balancing 2.206 (b) Table tennis image Geometrical processing 2.030 Adaptive load balancing 1.922
Worst processing time
Deviation of the worst time
2.407
0.185
2.309
0.103
2.281
0.251
2.021
0.099
Table 6 The performance of the implemented system Single processing
Geometrical processing
Adaptive load balancing
(a) Football image Time (s) 8.46 Efficiency 100
2.407 87.9
2.309 91.6
Speed up
3.51
3.66
(b) Table tennis image Time (s) 7.527 Efficiency 100
2.281 82.6
2.021 92.9
Speed up
3.30
3.72
(%)
1
(%)
1
S.H. Choi and K.T. Park / Signal Processing." Image Communication 8 (1996) 113 130
Fig. 11. The picture of implemented processing module.
Fig. 12. 100, 110, 120, 130, 140 and 150th Table tennis picture.
129
130
S.H. Choi and K.T. Park/Signal Processing: linage Communication 8 (1996) 113-130
5. Conclusion
A processing module for motion picture coding is implemented by combining a 32-bit microprocessor and function specific ME and DCT devices. With a number of these modules, motion picture coding is performed using parallel processing technique, and its performance is analyzed in terms of operation and communication time. Both theoretical study and experimental results show that the performance improvement is not linear in proportion to the increase in the number of processing modules due to unbalanced computational load among partitioned regions. To resolve this problem, an adaptive load balancing technique utilizing load prediction is proposed. It is shown that this technique stabilizes the coding efficiency even for regions with largely varying coding times. The proposed adaptive load balancing guarantees equal to or superior performance. The experimental results show approximately 10% improvement in processing time and efficiency. RLC and Huffman coding is not included in the experiment due to the fact that it is difficult to quantify the execution time of these operations since they yield different execution time for each block. More improvements are expected if above functions are considered. References [1] C. Hoek, R. Heiss and D. Mueller, "An array processor approach for low bit rate video coding", Signal Processing: Image Communication, Vol. 1, No. 2, October 1989, pp. 213-223.
[2] S.H. Choi et al., "Hierarchical heterogeneous multiprocessor system for real-time motion picture coding", SPIE's Visual Commun. Image Process., 1994, pp. 1777-1787. [3] R.S. Cok, Parallel Programs for the Transputer, PrenticeHall, Englewood Cliffs, NJ, 1990. [4] L.R. Scott, "Load balancing on message passing architectures", J. Parallel Distributed Comput., Vol. 13, 1991, pp. 312-324. [5] H. Jeschke, K. Gaedke and P. Pirsch, "Multiprocessor performance for real time processing of video coding applications", IEEE Trans. Circuits Systems Videotechnol., Vol. 2, No. 2, June 1992, pp. 221-230. [6] ISO-IEC/JTCI/SC29/WGll, CD2-11172, "Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s", ISOIEC/JTC1/SC29/ WG11, November 1991. [7] N. Ohta et al., "Efficient video coding for a multiprocessor system using an adaptive data allocation technique", ICIP "89 Conf. Proc., Vol. 1, No. 2, 1989, pp. 65-68. I-8] S.N. Choi, K.T. Park et al., "Efficient video coding using multiprocessor system", I C C T Conf. Proc. Image Process., Vol. I, 1992, pp. 02.01.1-02.01.4. [9] S.H. Bokhari, Assignment Problems in Parallel and Distributed Computing, Kluwer Academic Publishers, Dordrecht, 1987. [10] A. Leon-Garcia et al., "Prediction of bit rate sequences of encoded video signals", IEEE J. Selected Areas Commun., Vol. 9, No. 3, April 1991, pp. 305-313. [11] W.K. Pratt and W.-H. Chen, "Scene adaptive coder", IEEE Trans. Commun., Vol. COM-32, 1984, pp. 225-232. [12] R.C.H. Lin and R.M. Keller, "The gradient model load balancing method", IEEE Trans. Software Engrg., Vol. SE-13, No. 1, January 1987, pp. 32-38. [13] M. Tsuchiya et al., "A task allocation model for distributed computing systems", IEEE Trans. Comput., Vol. C-31, No. 1, January 1982, pp. 41-47. [14] P.R. Thie, An Introduction to Linear Programming and Game Theory, Wiley, New York, 1988. [15] INMOS, The Transputer Databook, 2nd Edition, INMOS, 1989. [16] SGS-Thomson Microelectronics, Image Processing Data Book, 1st Edition, SGS-Thomson Microelectronics, October 1990.