ARTICLE IN PRESS
Real-Time Imaging 10 (2004) 1–7
Motion estimation using a frame-based adaptive thresholding approach Shou-Yi Tseng Department of Computer and Information Science, Shoochow University, 56, Sec. 1, Kwei-Yang St., Taipei, Taiwan
Abstract This paper proposes two frame-based adaptive thresholding algorithms for reducing the amount of computation involved in block-based motion estimation in the real-time and low-rate video coders. The proposed algorithms use an adaptive threshold for each of the frames to prejudge if each of the block matching computation is worthwhile. Based on the difference between the current and reference frames, the first algorithm determines a threshold in terms of the optimal trade-off between run-time and distortion, and the second algorithm determines a threshold according to a user specified percentage of total number of blocks. The experimental results demonstrate that the proposed algorithms can significantly reduce the amount of computation compared to the previous algorithms, while almost fully maintaining the quality of the reconstructed image. r 2003 Elsevier Ltd. All rights reserved.
1. Introduction In video coding systems, motion estimation employs the high correlation between neighboring frames to reduce temporal redundancy in video compression. Because of significantly improving bit rate reduction, motion estimation has been widely adopted and has become critical in video coding. The straightforward technique for motion estimation is the full search algorithm [3], which divides the current frame into non-overlapped, equal-sized blocks and searches for each of them in the reference frame within the search area. The displacement of the searched blocks is termed the motion vector. Although the distortion can be reduced effectively, the computational costs are always an important issue in video coding. Many fast motion estimation algorithms have been developed recently [1,2,4–7,12]. However, most of these algorithms focus on efficiently searching for the block in the search area that best matches each of the blocks in the current frame. From the perspective of the whole frame, regardless of the motion estimation algorithm used, some blocks can be accurately predicted using the motion vector, while others cannot. As with the MPEG standard [9], each coded block must specify whether the motion vector is applied. Therefore, the motion estimation E-mail address:
[email protected] (Shou-Yi Tseng). 1077-2014/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.rti.2003.09.014
procedure requires a threshold for separating these two kinds of blocks. Shi and Xia [10] presented a thresholding multiresolution block matching algorithm using a predefined mean absolute difference (MAD) value to filter out inefficient blocks before further block matching. Recently, Yang et al. [13] presented a computation reduction algorithm for motion estimation that detected the non-zero motion blocks using a threshold value before further block matching. Both of these algorithms perform well in trade-off between time and distortion. However, the threshold values are predefined and obtained by off-line computation. Many experiments must be conducted to obtain feasible constant threshold values for various video sequences. However, the activities of various video sequences differ and the activities of various frames within a video sequence also differ. Applying a constant threshold to all cases in real-time environments thus is difficult. Therefore, an adaptive threshold that can determine the most important portion of the computation in the motion estimation procedure for a frame is rather critical, especially for the real-time and low-rate video coders. Accordingly, in [11], a first draft of the two proposed adaptive thresholding algorithms was presented. This paper first notes that the effect of motion estimation for each block can be prejudged by the sum absolute difference (SAD) between the current block and the same position of the reference frame. Based on this observation, two adaptive thresholding algorithms
ARTICLE IN PRESS Shou-Yi Tseng / Real-Time Imaging 10 (2004) 1–7
2
for motion estimation are proposed. The first algorithm determines a threshold in terms of optimal trade-off between run-time and distortion. The second algorithm determines a threshold according to a user-specified percentage of the total number of blocks. The experimental results demonstrate that the proposed algorithms can significantly reduce the amount of computation compared to the previous algorithms, while almost fully maintaining the quality of the reconstructed image. The rest of this paper is organized as follows. First, Section 2 makes some observations regarding the motion estimation procedure. Section 3 then presents the proposed algorithms. Subsequently, some experimental results are demonstrated in Section 4. Finally, concluding remarks are presented in Section 5.
defined as dSAD ¼ OSAD SAD: An observation regarding the correlation between the OSAD and the dSAD demonstrates that blocks with high OSAD value can reduce high distortion during motion estimation. Fig. 1 is a scatter diagram of blocks OSAD and dSAD in a frame. In Fig. 1, each point represents a block: x-axis is the OSAD of this block, and y-axis is the dSAD of the same block. Most of the points lie just under the x ¼ y line, which explains that most of the blocks can effectively be predicted using the motion vectors; therefore, their dSAD and OSAD are quite near. To measure the strength of the association between the two variables dSAD and OSAD; the coefficient of correlation r is applied. r is computed by the following formula [8]: nðSXY Þ ðSX ÞðSY Þ r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; nðSX 2 Þ ðSX Þ2 nðSY 2 Þ ðSY Þ2
ð3Þ
2. Observations This section measures and observes the reduced distortion after motion estimation from the perspective of each block, and then from the whole frame. Although the motion vector searching time required for each block may vary among searching algorithms, the full search algorithm is applied on each of the blocks to provide a comparison base. 2.1. Measure the effect of motion estimation for each block While searching a motion vector for a block, the distortion between the current block and each of the candidate blocks is measured, and the candidate block with minimum distortion is used to obtain the motion vector. To measure the distortion, SAD is a fast computation criterion and the most popular of the common motion estimation algorithms. The SAD of a block is defined as X SAD ¼ jBði; jÞ B0 ði þ m; j þ nÞj; ð1Þ
where X and Y are values of the two variables corresponding the OSAD and dSAD of blocks, n is the total number of blocks, SX is the sum of the X variables, SY is the sum of the Y variables, SX2 is the sum of the squares of the X variables, (SX)2 is the square of the sum of the X variables, SY2 is the sum of the squares of the Y variables, and (SY)2 is the square of the sum of the Y variables. The computed correlation coefficient r is between 71.00, and a value closer to +1 implies a stronger positive correlation. The experimental results indicate that the correlation coefficient of the two variables in Fig. 1 exceeds 0.8 in almost all cases. Therefore, the OSAD can be a criterion for prejudging the effectiveness of the block in motion estimation. Although the correlation between OSAD and dSAD is very high, in a few cases, as shown in Fig. 1, some points have a high OSAD value but a very small dSAD value. The motion estimation and compensation scheme have almost no effect on these blocks. In this case, the OSAD value cannot be used to prejudge the effectiveness of the motion estimation. In other words, these blocks cannot be filtered out using the proposed
i; jAblock
where Bði; jÞ denotes the pixel value at the (i; j) position of the current frame, B0 ði; jÞ represents the corresponding pixel value of the reference frame, and (m; n) is the motion vector. Before searching a motion vector in the search area, a direct prediction uses the same block in the reference frame. Here, the prediction error, termed the original SAD (OSAD) for each block, is defined as X OSAD ¼ jBði; jÞ B0 ði; jÞj; ð2Þ i; jAblock
where Bði; jÞ denotes pixel value at the (i; j) position in the current frame and B0 ði; jÞ represents the pixel value at the same position in the reference frame. To measure the gain from motion estimation, the reduced distortion is
Fig. 1. Correlation between original distortion and reduced distortion.
ARTICLE IN PRESS Shou-Yi Tseng / Real-Time Imaging 10 (2004) 1–7
adaptive thresholding algorithm. However, such blocks are much fewer than the thousands of blocks in a frame, and do not affect the results of the proposed algorithm.
3
blocks for each frame in the video sequence. The proposed thresholding algorithms employ the high correlation between dSAD and OSAD to optimize the threshold.
2.2. Observe the effect from a frame perspective The conventional motion estimation procedure for a frame always calculates the motion vectors in a raster scan order of the blocks. Fig. 2 displays the relationship between the distortion and the percentage of blocks processed during the motion estimation procedure for an example frame. The x-axis represents the percentage of blocks processed, while the y-axis is the sum of the SAD for all the blocks. The curve labeled (a) in Fig. 2 shows the distortion decreasing progress of processing the blocks in raster scan order. To emphasize that more effective blocks are computed first, the motion estimation procedure calculates the motion vectors in block OSAD decreasing order. The curve labeled (b) in Fig. 2 shows the distortion decreasing progress of processing the blocks in block OSAD decreasing order. Fig. 2(b) clearly shows that the point with the largest curvature represents the optimal trade-off point (OTP) between the running time and distortion. After the OTP, the effect of motion estimation declines significantly, but the computation time required remains unchanged. Restated, Fig. 2(b) shows that less than 10% of the motion estimation computation effort contributes to more than 90% of the distortion reduction. The position of OTP may vary among different frames. However, the phenomenon such as Fig. 2(b) is rather common in video sequences, especially low activity video sequences. Based on the observations, given limited running time, the more effective blocks among the thousands of blocks in a frame should be selected in the motion estimation procedure and the less effective blocks should be given up. Consequently, a frame-based adaptive threshold is required to separate these two kinds of
Fig. 2. Typical example of the motion estimation procedure for a frame. (a) Raster scan order; (b) OSAD decreasing order.
3. Proposed algorithms In the proposed algorithms, an adaptive threshold based on the OSAD for a frame is generated to divide blocks into two parts. Regardless of the motion estimation algorithm applied, motion vectors are calculated for the blocks with OSAD exceeding the threshold, while the blocks with OSAD below the threshold are assigned zero motion vectors. 3.1. Algorithm I In the first algorithm, first, the OSADs for all blocks are calculated and pushed in a heap in decreasing order. The sequence of the motion vector computation is ordered by the block OSAD; and the motion vector estimation process stops when a number of consecutive small dSADs occur. The pseudo-code of the proposed algorithm I is presented below. Algorithm I Step 1: Go all the blocks in raster scan order. Compute the OSAD for each block. Push the block position and its OSAD in the heap in decreasing order. Step 2: Let small dsad counter be zero. Step 3: While ((small dsad counteroe) and (heap is not empty)) do Pop a block from the heap. Calculate the motion vector of this block. Let dSAD=OSADSAD: If ((dSAD=OSAD) o 10%) then / A small dSAD found. / Add one to small dsad counter. else / Reset the small dSAD counter./ Let small dsad counter be zero. Wend. / The threshold of this frame is the OSAD value of the last popped block. / Step 4: While (heap is not empty) do Pop a block from the heap. Set the motion vector (0,0) to this block. Wend. In Algorithm I, the occurrence of consecutive small dSADs means that the effective blocks are already computed, and thus the OTP described in Section 2 is determined. Therefore, the distortion is controlled in
ARTICLE IN PRESS 4
Shou-Yi Tseng / Real-Time Imaging 10 (2004) 1–7
terms of optimal trade-off between running-time and distortion. However, the consecutive small dSADs condition is difficult to meet early to stop the motion vector calculation for high activity frames. The motion estimation time required for each frame may vary largely because of differences in activity among frames using Algorithm I. On the other hand, Algorithm II is applicable if running time is limited for each frame. 3.2. Algorithm II In Algorithm II, a percentage of the total number of blocks, say p; is given. The p-percentile of the OSADs in the current frame is calculated as the desired threshold for the frame. Because each frame contains thousands of blocks, the OSADs are obtained by randomly sampling a small set of blocks. The p-percentile, then is estimated using the sampling data. The pseudo-code of the proposed algorithm II is presented below.
allowed for each pixel, then the value E is 8 8=64. Generally, the desired confidence level is 95%. Therefore, the z value is 1.96. s can be estimated roughly using a small sample, such as one of size ten. Suppose the estimated s is 500, using the above formula m=(1.96 500/64)2=234.473, the desired m is 234. By the formula, the sample size is determined from the allowable error and the variability in the population being studied, and is not related to the total number of blocks in a frame. The sampling size m should be relatively small compared to the total number of blocks in a frame. The time required for computing the threshold g; as steps 1–5 in Algorithm II, is relatively short compared to the total motion estimation time. Thus the motion estimation for each frame can be controlled using Algorithm II.
4. Experimental results Algorithm II Given a percentage p of the total number of blocks. Step 1: Randomly sample m block positions of the frame. Step 2: Compute the OSAD for each of the m selected blocks. Step 3: Sort the OSADs by decreasing order. Step 4: Let i ¼ Jpmn: Step 5: Let the threshold g be the average of the ith, ði 1Þth, and ði þ 1Þth of the sorted OSADs. Step 6: For each block in the frame do If the block with OSAD > g then Calculate the motion vector of this block. else Set the motion vector (0,0) to this block. EndFor. As represented in Algorithm II, the percentage p is given to be used to estimate a desired OSAD as a threshold. A formula in Statistics [8] is considered to yield a guideline for choosing an appropriate samples size m: The sample size m can be computed from the formula z s2 m¼ ; ð4Þ E where E is the maximum allowable error, s is the estimated standard deviation of the population, and z is a value from the standard normal distribution corresponding to the desired level of confidence. In the experiments, the standard deviation of OSAD depends on the activities of the video sequence, ranging from 500 to 900 in the block of size 8 8. If an error of one is
The experiments are conducted using three test sequences, Susie, Football and Tennis. For each sequence, continuous 40 frames are tested, and each frame has 720 480 pixels. The full search algorithm [3] and the 3-step search algorithm [5] are applied for calculating the motion vector for each block. The block size is 8 8 and the maximum horizontal and vertical search displacement is 77. The peak signal-to-noise ratio (PSNR) is used as a measure of the quality of the motion-compensated image. The PSNR is defined as 2552 MSE and the MSE is defined as M X N # jÞ 2 X ½ f ði; jÞ fði; MSE ¼ ; M N j i PSNR ¼ 10 log10
ð5Þ
ð6Þ
where f ði; jÞ denotes the pixel value in the current frame, # jÞ represents the corresponding predicted pixel value, fði; and M and N are the height and width of the frame, respectively. All the motion vectors in a frame comprise a motion field. Fig. 4 presents examples of motion fields computed from the two frames in Tennis sequence as shown in Fig. 3, where the full search algorithm is applied on each of the blocks in the current frame. In Fig. 4, the arrows represent the direction and magnitude of the motion vectors, and the dots represent the zero motion vectors. In Fig. 4(a), the motion vectors are calculated for all blocks, while in Fig. 4(b), the motion vectors are only calculated for 20% of the blocks. Notably, the leftbottom side of the original frame, the area of the tennis table, produces correspondingly motion vectors in Fig. 4(a). Since the table shows little change, the produced motion vectors are ineffective. Clearly, a good
ARTICLE IN PRESS Shou-Yi Tseng / Real-Time Imaging 10 (2004) 1–7
5
Fig. 3. Two continuous frames in Tennis sequence, frame #1 and frame #2.
Table 1 The performance comparisons for Algorithm I and constant threshold approach Method Football 1. Full search to all frames 2. Algorithm I to all frames 3. Constant 4. Algorithm I + constant TH 1 (TH 1 + TH 2)/2 (TH 1 + TH 2 + TH 3)/3 Tennis 1. Full search to all frames 2. Algorithm I to all frames 3. Constant 4. Algorithm I + constant TH 1 (TH 1 + TH 2)/2 (TH 1 + TH 2 + TH 3)/3 Susie 1. Full search to all frames 2. Algorithm I to all frames 3. Constant Fig. 4. Examples of motion field. (a) Search all the blocks; (b) Search 20% of the blocks in the current frame.
threshold for filtering out these ineffective blocks can save extensive computational effort with only an insignificant increase in distortion. Table 1 demonstrates the performance of the proposed Algorithm I and compares it with that of the constant-thresholding approach as in [10,13]. To provide a comparison base, the compared methods [10,13] are implemented as full search algorithm with constantthresholding approach. To eliminate the effect of block size on threshold value, in Table 1, the threshold value (TH) is represented as a mean absolute difference
4. Algorithm I + constant TH 1 (TH 1 + TH 2)/2 (TH 1 + TH 2 + TH 3)/3
TH
PSNR (dB)
Blocks computed
0 Adaptive 4 12
25.412 25.347 25.411 25.237
100.00% 43.74% 68.89% 37.69%
7.516 6.633 7.792
25.385 25.396 25.380
48.31% 51.81% 47.43%
0 Adaptive 4 12
26.359 26.278 26.346 26.103
100.00% 37.63% 84.43% 34.13%
11.56 10.88 13.00
26.131 26.171 26.033
36.20% 39.57% 29.94%
0 Adaptive 4 12
34.011 33.795 33.850 31.914
100.00% 37.47% 43.20% 14.41%
5.906 5.852 6.010
33.589 33.598 33.502
33.20% 33.44% 31.26%
(MAD): TH ¼
OSAD : block size
ð7Þ
In Table 1, first, the non-thresholding approach is applied to the three test sequences, and the average PSNR is shown. In this approach, the threshold value is zero, the percentage of the number of computed blocks over the total number of blocks is 100%, and the PSNR is optimal. Second, applying the proposed Algorithm I to all the 40 frames of the Football test sequence
ARTICLE IN PRESS 6
Shou-Yi Tseng / Real-Time Imaging 10 (2004) 1–7
indicates that the computation time decreases to 43% and the PSNR decreases by 0.065 dB compared to the full search method, so the results for Tennis and Susie test sequence are 38%, 37% and 0.081 dB, 0.216 dB, respectively. In this approach, the threshold value for each frame is adaptive by real-time computation. Third, the constant-thresholding approach is applied using values from 4 to 12, as suggested by Shi and Xia [10] and Yang et al. [13]. As shown in Table 1, the given threshold values 4 and 12 result in a low PSNR or a high computation time. The problem is that a constant threshold or off-line adaptive threshold value is not applicable in all cases. The fourth method is a combined approach. The symbol THi in Table 1 represents the threshold value of frame i obtained using Algorithm I. In the combined approach, a threshold value is computed first using Algorithm I from the first few frames, and then the constant thresholding approach is applied to other frames, using the THi values. In Algorithm I, a time overhead is associated with obtaining the threshold value for each frame. If the first few frames can obtain a good threshold, then the constant-thresholding approach can be applied to other frames. Therefore, Table 1 shows that the combined approach performs well in terms of both PSNR and computation time. Fig. 5 shows the distortion of the proposed algorithms using the first 40 frames in the Tennis sequence. Using Algorithm I, the percentage of computed blocks differs among frames. Given a low activity frame, the percentage of computed blocks can be below 10%. Meanwhile, given a high activity frame, the percentage may exceed 90%. The average computed block percentage for the 40 frames is 38%. For comparison, a percentage of 38% is applied using Algorithm II and the sampling size m is 300. In Fig. 5, the original MSE line shows the activities of each frame, while the 100% full search line shows the optimal MSE in the motion estimation, and the lines labeled ‘‘Algorithm I’’ and
Table 2 The running time comparisons for Algorithm II using the 3-step search algorithm Blocks computed Relative running time MSE increased
100% 100% 0
50% 81% 0.6%
38% 68% 3.4%
30% 59% 5.7%
20% 49% 9.9%
‘‘38% Full Search’’ present the proposed algorithms. If frame activities is low, as in frames 0–22, the proposed algorithms have almost the same MSE as with the 100% full search method, but the computation time is much smaller than that of 100% full search. If frame activities is high, such as in frames 23–40, Algorithm I also has almost the same MSE as the 100% full search, but requires longer computation time than the low activity frames. Moreover, if the time is bounded by a user specified percentage, say 38%, then Algorithm II may have slightly higher distortion than Algorithm I, but the running time can be controlled. The MSE is higher when applying the 3-step search algorithm than when applying the full search algorithm, as shown in Fig. 5, the line marked ‘‘100% 3-step’’. If Algorithm II is applied using the 3-step search algorithm with 38%, then the MSE values very closely approach those obtained with the 100% 3-step search algorithm, as shown in Fig. 5 the line labeled ‘‘38% 3-step’’. Since the motion estimation time for each block differs with 3-step search algorithm, the time reduction cannot be measured based on the number of blocks computed. Table 2 lists the relationship between the percentage of blocks computed and the relative running time applying 3-step search algorithm for the first 40 frames of the Tennis sequence. The relative running time is the ratio of the time required for computing part of the blocks to that required for computing all the blocks. The third column of Table 2 shows that computing 38% of the blocks in a frame requires 68% of the running time required to compute all of the blocks, but the MSE only increases by 3.4%.
5. Conclusion
Fig. 5. MSE of the Tennis from frames 1 to 40.
This paper presented two frame-based adaptive thresholding algorithms for motion estimation with reduced computation requirements. The first algorithm determines a threshold in terms of optimal trade-off between run-time and distortion, and the second algorithm determines a threshold based on a userspecified percentage of computed blocks. Using the adaptive threshold, each frame can select effective blocks and remove ineffective blocks from the motion estimation procedure. The experiments have shown that the proposed algorithms outperform both the
ARTICLE IN PRESS Shou-Yi Tseng / Real-Time Imaging 10 (2004) 1–7
non-thresholding approach and the constant-thresholding approach. In the real-time and low-rate video coders, the proposed algorithms efficiently determine thresholds for general motion estimation algorithms. In the experimental results, the combination of proposed Algorithm I and the constant-thresholding approach performs well in terms of both PSNR and computation time. Thus, the proposed frame-based adaptive thresholding approach can be improved as a shot-based one, which will be pursued in future work.
Acknowledgements The author would like to thank the anonymous referees for their valuable suggestions that led to improvement of this paper.
References [1] Chalidabhongse J, Jay Kuo C-C. Fast motion vector estimation using multiresolution-spatio-temporal correlations. IEEE Transactions on Circuits System Video Technology 1997;7(3):477–89. [2] Chimeinti A, Ferraries C, Pau D. A complexity-bounded motion estimation algorithm. IEEE Transactioins on Image Processing 2002;11(4):387–92.
7
[3] Jain JR, Jain AK. Displacement measurement and its application in interframe image coding. IEEE Transactions on Communications 1981;29(12):1799–808. [4] Kim JN, Choi TS. A fast full-search motion-estimation algorithm using representative pixels and adaptive matching scan. IEEE Transactions on Circuits System Video Technology 2000;10(7): 1040–8. [5] Koga T, et al. Motion compensated interframe coding for video conferencing. In: Proceedings of the National Telecommunication Conference. November 1981, New Orleans, LA. p. G 5.3.1–5.3.5. [6] Li B, Zeng R, Liou ML. A new three-step search algorithm for block motion estimation. IEEE Transactions on Circuits System Video Technology 1994;4:438–42. [7] Liu B, Zaccarin A. New fast algorithm for the estimation of block motion vectors. IEEE Transactions on Circuits System Video Technology 1993;3(2):148–57. [8] Mason RD, Lind DA, Marchale WG. Statistics: an introduction. 5th ed. Thomson Publishing Inc., Singapore, Asia; 1998. [chapters 9 and 13]. [9] Mitchell JL, Pennebaker CE, Legall DJ. MPEG video compression standards. New York: Chapman and Hall; 1997. [10] Shi YQ, Xia X. A thresholding multiresolution block matching algorithm. IEEE Transactions on Circuits System Video Technology 1997;7:437–40. [11] Tseng SY. Efficient motion estimation algorithm using run-time and distortion optimization approach. In: Proceedings of 2003 IEEE International Conference on Multimedia and Expo (ICME 2003), Vol. I. Baltimore, MD, USA, July 2003. p. 361–4. [12] Oh HS, Lee HK. Block-matching algorithm based on an adaptive reduction of the search area for motion estimation. Real-time imaging 2000;6:407–14. [13] Yang JF, Chang YC, Chen CU. Computation reduction for motion search in low rate video coders. IEEE Transactions on Circuits System Video Technology 2002;12:948–51.