Dynamic core allocation for energy efficient video decoding in homogeneous and heterogeneous multicore architectures

Dynamic core allocation for energy efficient video decoding in homogeneous and heterogeneous multicore architectures

Future Generation Computer Systems 56 (2016) 247–261 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: ...

2MB Sizes 0 Downloads 68 Views

Future Generation Computer Systems 56 (2016) 247–261

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Dynamic core allocation for energy efficient video decoding in homogeneous and heterogeneous multicore architectures Rajesh Kumar Pal ∗ , Ierum Shanaya, Kolin Paul, Sanjiva Prasad Indian Institute of Technology Delhi, India

highlights • • • •

Present dynamic core allocation for video decoding on homogeneous multicores. Present an energy-efficient video decoding method for heterogeneous multicores. Show energy savings with dynamic core allocation. Analyze factors influencing frame decoding time.

article

info

Article history: Received 15 January 2015 Received in revised form 23 August 2015 Accepted 16 September 2015 Available online 28 September 2015 Keywords: Core allocation H.264 video decoding Embedded system Heterogeneous multicores

abstract This paper describes two dynamic core allocation techniques for video decoding on homogeneous and heterogeneous embedded multicore platforms with the objective of reducing energy consumption while guaranteeing performance. While decoding a frame, the scheme measures ‘‘slack’’ and ‘‘overshoot’’ over the budgeted decode time and amortizes across the neighboring frames to achieve overall performance, compensating for the overshoot with the slack time. It allocates, on a per-frame basis, an appropriate number and types of cores for decoding to guarantee performance, while saving energy by using clock gating to switch off unused cores. Using the Sniper simulator to evaluate the implementation of the scheme on a modern embedded processor, we get an energy saving of 6%–61% while strictly adhering to the required performance of 75 fps on homogeneous multicore architectures. We receive an energy saving of 2%–46% while meeting the performance of 25 fps on heterogeneous multicore architectures. Thus, we show that substantial energy savings can be achieved in video decoding by employing dynamic core allocation, compared with the default strategy of allocating as many cores as available. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Contemporary video decoders for real-time, large and detailed digital movies/videos on embedded platforms require high CPU performance. H.264 [1] is one of the best video codecs in terms of compression and quality. Its compression efficiency is at least twice that of the earlier codecs such as MPEG-2, and MPEG1 [2]. The decoding process of H.264 produces video with perceptibly high quality. However these advanced features come at a cost of increased computational requirements: the video encoders/decoders exploit advanced instruction sets (MMX/SSE/ SSE2), instruction-level parallelism, and parallelism provided by modern processors (multi/manycores). Multi-threaded implementations of H.264 codec take advantage of multiple cores provided



Corresponding author. E-mail addresses: [email protected] (R.K. Pal), [email protected] (I. Shanaya), [email protected] (K. Paul), [email protected] (S. Prasad). http://dx.doi.org/10.1016/j.future.2015.09.018 0167-739X/© 2015 Elsevier B.V. All rights reserved.

by embedded processors such as ARM Cortex A15 and Intel Silvermont. While playing video on devices such as mobiles and tablets, users expect high video quality as well as long battery life. These are conflicting requirements: high quality of video means better resolution and higher frame rates, which need more computation and thus more energy. The general approach in the video decoders is to utilize as many cores as available on the multicore platform. For example, in libavcodec, the leading audio/video codec library, there is a flag threads in AVCodecContext; when set to auto this lets the decoder detect the number of available cores and spawn as many threads. Employing all available cores definitely helps to meet the performance, albeit at the cost of higher energy consumption. We show in this work that with intelligent core allocation at the frame level, performance can be guaranteed with significantly lower energy consumption. The proposed core allocation methodology can be employed easily on all embedded multicore platforms to enhance battery life while providing the desired performance.

248

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

Fig. 1. H.264 decoder.

Soft real-time applications such as video encoders/decoders and speech/image recognition have soft deadline constraints. These applications can gracefully accommodate occasional deadline misses. For example, if the video decoder can decode and render the required number of frames (more than 24 fps) within the deadline of 1 s in spite of few frames locally missing their individual deadline, an overall perceptibly acceptable quality of video is achieved. From the performance perspective, these applications may miss their local deadlines occasionally but must meet the global deadline. Most frames are decoded before their local deadlines. Thus one can amortize the decode times across frames and compensate for the occasional local deadline misses while meeting the global deadline. The technique explored in this paper is based on this observation. Isovic et al. [3] used an alternative strategy of frame skipping to meet the global deadline. Embedded processors for multimedia communication devices often adopt heterogeneous multi-core architecture in order to achieve good power efficiency for executing mixed control/data processing tasks. When playing a video, the processing resources are responsible for more than 60% of the power consumption [4,5]. This leads to a drastic decrease in mobile devices autonomy as lithium battery technologies are not evolving fast enough to absorb the ever-growing energy requirements of such mobile architectures [6]. Due to the limitation of the microprocessor fabrication technologies, it is expected that only 20% of the energy saving will be achieved in the next few years [7]. Thus, one should consider the optimization of overall system including the hardware and the software platforms to cope with the energy saving issue. To take full advantage of these multi-level energy saving opportunities, mobile systems designers should deal with the increasing system complexity and heterogeneity. Various approaches like dynamic task scheduling, heterogeneous architectures, hybrid parallel and hybrid pipeline schemes, frame level parallelism are used for obtaining the performance and energy efficiency in video decoding. Most prior work [8–10] has used DVS/DVFS to trade-off between energy and performance. To the best of our knowledge, ours is the first use of slack time to determine when and how many cores to switch on dynamically at runtime to save energy while meeting performance constraints. When decoding a frame, we measure the slack and overshoot times over a budgeted decode time and use the slack time to compensate for the overshoot. We assign a suitable number of cores on a per-frame basis to guarantee performance in homogeneous multicore architectures. The unused cores are shut off using clock/power gating, thus saving energy. For heterogeneous multicore architectures, we assign suitable type of cores in required numbers on a per-frame basis to preserve energy while guaranteeing performance. We find that our schemes are profitable for embedded platforms as significant benefits can be achieved in terms of energy, without changing the hardware or software, merely by controlling the core allocation dynamically. This paper makes four major contributions:

• We show that the default strategy of allocating as many cores as available on the platform leads to substantial energy wastage.

• We present a simple core allocation methodology for H.264 video decoding on homogeneous multicore platforms to meet the performance while conserving energy.

• We present a dynamic core allocation methodology for video decoding on heterogeneous multicore architectures that saves energy while meeting performance. • We identify and analyze the factors that influence frame decoding time on multicore architectures. In the next section we overview the default core allocation and then present the dynamic core allocation strategy that conserves energy while meeting the required performance. In Section 3 we describe our experimental methodology and details of test video sequences. Section 4 provides an insight into the factors influencing frame decoding time. Section 5 presents the results and its analysis. Section 6 discusses related work in this area. Section 7 concludes the paper with directions for extending this work. 2. Core Allocation for H.264 Decoder In this section, we present an overview of the core allocation in H.264 video decoder and thereafter propose dynamic core allocation. 2.1. Default core allocation The threaded implementation of H.264 video decoders is done in the following two fundamentally different ways.

• Functional Decomposition: As shown in Fig. 1 each frame gets decoded after passing through several functional stages: inverse transforms and quantization, intra prediction, motion compensation and deblocking filter. Each function is performed by a separate thread. This implementation has limited scalability on multicore architectures as to increase the number of threads we must partition a function into two or more threads. Due to interdependence and tight coupling between sub-functions of a function, division becomes difficult. Moreover unbalanced workload to the threads can cause thread waiting and synchronization delays. These limitations hamper utilization of large number of cores for such implementations of video decoders. • Data Domain Decomposition: The hierarchy of data domain decomposition in H.264 is shown in Fig. 2. A H.264 video consists of many groups of pictures (GOP). Each GOP is made up of a number of frames. Each frame consists of slices. A slice is an independent and self-contained encoding unit. A slice is further divided into macroblocks which are 16 × 16 pixels. Motion estimation and entropy decoding is done at macroblock level. Depending on the required scalability, threads can be created at different levels of this hierarchy. As we move towards lower levels of the hierarchy, more threads can be created. One significant advantage of data domain decomposition over functional decomposition is that each thread processes the same operation on different data blocks, each having the same dimensions. The scalability, thread homogeneity and even distribution of workload amongst the threads make this type of implementation suitable for exploiting large number of cores. Considering these factors, we decided on a data domain decomposition with a multi-threaded implementation of the H.264 decoder.

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

249

therefore create time slack (green bar), but frames 3, 6, and 9 overshoot (red bar) the time budget. The second observation is that a larger number of cores generally reduces the decoding time for a frame. The third observation is that a big core results in faster decoding of a frame as compared to a smaller one mainly because of powerful and aggressive resources. Based on these key observations, we formalize dynamic core allocation methodologies for homogeneous and heterogeneous multicore architectures.

Fig. 2. Data domain decomposition in H.264.

Fig. 3. Opportunity to compensate for overshoot with slack amongst the frames. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Our video decoder is a FFmpeg [11] based multi-threaded decoder. The FFmpeg API is used in most audio/video codecs and players/applications such as VLC, Mplayer, media player classic, Plex, etc. The default behavior of FFmpeg (and other contemporary decoders) is to spawn as many threads as there are cores on a platform and employ all the cores for decoding frames to render video. Utilizing all the resources ensures that performance is not only met but surpassed. However the flip side is excessive energy consumption. We observe from our experiments that a reasonable number of threads suffice to meet the performance requirements as well as save on energy. Considering the energy constraints of embedded platforms it becomes important to have an energyefficient core allocation that meets performance requirements of the videos. 2.2. Dynamic core allocation Video decoding is performed frame by frame. To achieve a perception of motion, at least 25 frames must be decoded and displayed in one second. Our proposed core allocation methods are based on three observations. First, depending on data, the decoding of different frames may take differing amounts of time. Many frames are decoded well within the required time whereas a few frames exceed the allocated time budget. The time slack created by frames decoded early can be utilized to compensate for the overrun over the budgeted time on (a few) others. For example, Fig. 3 shows the decode times of frames using 4 cores, where each frame has a budgeted decode time of 13.88 msec at a frame rate of 72 frames per second (fps). We can see that frames 1, 2, 4, 5, 7, 8, and 10 decode earlier than the time budget and

2.2.1. Homogeneous multicore architectures The dynamic core allocation method for homogeneous multicore architectures exploits the fact that a larger number of cores decodes a video frame faster than a fewer number of cores. However the power consumption is also higher with larger number of cores. Our proposed method aims to save energy by trading off performance with power while maintaining the required performance. As shown in Fig. 4, frames are typically divided into slices. Slices are self-contained decoding units having no dependence on one another, permitting many-threaded implementations of decoders at the slice level. Each slice is given to a thread for decoding.1 The slice undergoes inverse quantization, inverse transform, spatial or motion prediction, and filtering stages in the process of decoding. The threads join back after slice decoding and the parent thread outputs the complete frame. To trade-off between decoding time and energy consumption, we control, on a per-frame basis, how many cores are to be assigned for decoding. All the threads get distributed evenly over the assigned cores and unassigned cores are switched off using clock gating for the duration of one frame decoding. Different frames can get different number of cores, e.g., the ith frame can be assigned 4 cores, and for the next (i + 1)th frame, 8 cores may be assigned. Employing more cores helps in reducing the decode time of a frame whereas using fewer cores is more energy efficient. We characterize performance in terms of the desired frame rate. The decoder extracts the encoded frame rate from the compressed bit stream and utilizes it as a measure of performance. As shown in Fig. 5, we calculate budgeted_time, the time budgeted to decode each frame as the inverse of the frame rate. The time taken to decode a frame, decode_time, is monitored to calculate the slack (unused time) and overshoot (excess time over the allocated budget) times. To decide on how many cores are to be allocated for frame decoding on a per-frame basis, we keep the progressive total of frame decoding times in a counter time_in_hand. The counter indicates the occasions for increasing or decreasing the core count. The video decoding starts with a core count of 2. We double the core allocation as soon as the counter crosses a lower threshold. The employment of more cores reduces the frame decode time of the subsequent frame, leading to higher energy consumption. We halve the core allocation if the time_in_hand counter crosses an upper threshold, thus saving energy. This dynamic adaptation guarantees adherence to the desired performance while conserving energy. The upper (Uthd ) and lower (Lthd ) thresholds are tweakable parameters and generally their value ranges between 0 and 5. As obtained empirically from our experiments, we have set Lthd and Uthd equal to 0 and 1 respectively. The thresholds value influence the aggressiveness of the algorithm by deciding the amount of time_in_hand before we halve or double the core allocation. We double the core allocation when the time_in_hand goes negative and halve it when time_in_hand exceeds the budgeted_time. We now discuss reasons for increasing/decreasing the core count by a factor of 2. In Fig. 5, we can set n = 1, 2, 3, 4, 5, . . . for

1 This work presumes slice level multi-threading of the decoder.

250

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

Fig. 4. Thread mapping on allocated cores.

core allocation using p = n ∗ p or p = p/n. However with higher values of n, whenever the core allocation changes the quantum of change in count is large. For example, for n = 4, the core allocations will be 1, or 4, or 16 (i.e., core allocation changes by a factor of 4). We observe that changing the core count from 4 to 16 may well decrease the decode time but at the cost of the excess energy consumed by the 16 cores. So it is better to go from 4 to 8 cores as the performance benefits are similar to that of 16 cores for a majority of frames, but without the substantial extra energy costs. For certain cases where very aggressive frame decoding is required, it may make sense to go from 4 to 16. Note that having a finer control on core allocation is better as we can move in smaller steps to balance decoding time requirements versus energy cost. We also note that increasing the core count linearly is less effective because reaching an appropriate core count takes longer than with an increase by a factor of 2. For a video with a frame rate of 30 fps, 30 frames should be decoded within a second. Thus, each frame should ideally be decoded within the budgeted_time of 33 msec (1/30 fps = 0.033 s). We record the cumulative decoding time for a set of frames, and obtain the difference of cumulative decoding time and progressive budgeted_time as time_in_hand after each frame is decoded. The measure time_in_hand gives an idea of how much time buffer we have for trading off with energy. In case this buffer time is zero (or in negative, i.e., below Lthd ) we immediately increase the core count for the next frame in the hope of compensating for and meeting the timing constraints. If we have surplus time (i.e., above Uthd ) then we reduce the core count to save on energy. This trade-off is continuously performed at runtime after each frame decoding. In essence Lthd and Uthd are the control knobs to increase and decrease the number of allocated cores. Uthd determines the buffer time before the core count can be decreased. As we would like to have some buffer time before decreasing the core count, the lower bound of Uthd should be greater that 0. The value of Uthd influences how frequently the core count decreases. In general, irrespective of the video, we observe that for Uthd > 5 the core count change is reduced by less than 9% and not much energy is saved. Therefore the value of Uthd should range between 0 and 5. Lthd determines the minimum buffer time, below which the core count should be increased. The higher the value of Lthd the more frequently higher core counts are selected. If the value of Lthd is 0 the core count will be increased conservatively (i.e., only when the buffer time becomes negative). If the value of Lthd is 5 the core count increases

frequently. We find that keeping the value of Lthd > 5 results in selection of max core count for 87% of the frames. As our scheme dynamically increases/decreases the core count after every frame, it very well adjusts to different types of video. This notion is validated by our results where the scheme selects 2 cores for low definition videos and a combination of 8/4/2 cores for high definition videos. The allocation method dynamically increases/decreases the core count in order to meet the desired timing requirements while preserving energy. For a few exceptional cases of aggressive HD videos where the maximum core count of the multicore platform is incapable of decoding the video within desired time lines, the dynamic core allocation will also suffer from inadequacy of the hardware. However for all other cases, the method saves energy by selecting an appropriate number of cores so that overall the required frame rate can be guaranteed. 2.2.2. Heterogeneous multicore architectures The dynamic core allocation method for heterogeneous multicore architectures uses the fact that often a big core decodes a video frame faster than a smaller core. However the power consumed by a bigger core is also higher due to its larger transistor count. While maintaining the required performance for video decoding, we trade-off performance with power to preserve energy. Similar to the homogeneous approach, we divide a frame into a number of slices. Each slice is assigned to a thread for decoding. Taking cognizance of the type of core, big or small, we map different number of threads on diverse cores on a per-frame basis. The number and the type of cores that are powered on changes depending on the aggressiveness of decoding required to maintain the performance. Fig. 6 shows different configurations of a heterogeneous multicore architecture motivated by big.LITTLE architecture of ARM. The architecture consists of 4 big cores and 4 small cores of the same instruction set architecture (ISA). The big and small cores operate at 2.1 GHz and 1.5 GHZ frequency respectively. When aggressive decoding is required, the 4 big cores are run at the same time; else for any simple frame even 2 small cores can be sufficient. Any configuration can be chosen to decode frames. This improves battery life by not using more cores than are needed. Fig. 6 shows a subset of possible configurations used in evaluation of proposed method. In the performance-driven configuration all 4 big cores are used for the decoding and other 4 small cores are kept off. There is an energy-driven configuration in which the 4 small cores are used and other 4 big cores are kept powered off.

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

251

Fig. 5. Core allocation at frame level for homogeneous multicore architectures.

Fig. 6. Different configurations of ARM style big.LITTLE heterogeneous multicore architecture.

In ultra-low power configuration only two small cores are on in the interest of power-efficiency. The balanced configuration serve with one big and one small cores. The proposed dynamic core allocation method switches among one of these configurations before every frame decoding to preserve energy while meeting the performance required, as shown in Fig. 7. The switching does not lead to any task migration or cache invalidations as it happens after the end of a frame decoding and before the start of next frame decoding. The switching overhead is negligible as it takes a few tens of cycles for calculating the next configuration to adapt via core power-gating over a few million cycles for each frame decoding.

We have selected four configurations for evaluation purpose but more can be used. We selected these configurations as each one has distinct characteristic that fits different decoding requirements very well. For example, for decoding a HD video frame having fast motion that needs large computation, a 4 big cores (performancedriven) configuration be the best fit. Whereas whenever energy can be saved by decoding a frame having low resolution and relatively static scenes, a 2 small cores (ultra-low power) configuration be the better choice. Based on two parameters— decoding performance and energy consumption, we select the 4 configurations. Fig. 8 gives an insight into the characteristics of

252

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

Fig. 7. Dynamic core allocation in heterogeneous multicore architectures.

initialized with the values as shown in the flowchart. The encoded frame rate is extracted from the compressed bit stream by the decoder. The frame rate gives the number of frames to be decoded in a second. The budgeted_time is calculated as the inverse of the frame rate and it denotes the quantum of time within which each frame should be decoded. Each frame is decoded on a selected configuration and its decoding time and energy consumption is measured. The slack or the overshoot is calculated depending on whether the current frame gets decoded within budgeted time or outside it. The cumulative time buffer, time_in_hand, is updated with slack or overshoot, similar to homogeneous approach. This is an important parameter which is compared with the thresholds to select an appropriate heterogeneous configuration. If the time buffer is above 2 units of budgeted_time then ultra-low power configuration is selected, if this parameter is between 1.5 and 2 then balanced configuration is chosen, if the value falls between 1 and 1.5 then energy-driven configuration is preferred, and if time_in_hand is less than 1 then performance-driven configuration is opted. The architecture adapts to the selected configuration by power-gating the unnecessary cores. The threads containing video slices for decoding are mapped to the available cores. The selected configuration is used to decode the next frame. This way the method selects different number and type of cores to preserve energy while adhering to decoding time-lines. 3. Performance evaluation

Fig. 8. Comparison of decode time and energy consumption on heterogeneous configurations. The data is averaged over selected videos and normalized to the baseline configuration (4 big cores).

these configurations for a typical video. The decode time and energy numbers are averaged over our selected videos and normalized to the baseline configuration of 4 big cores. It can be seen from the figure that the configurations energy-driven, balanced, and ultra-low power takes 45%, 97%, and 110% more decoding time over the performance-driven baseline configuration, and consumes 22%, 41%, and 52% less energy than the baseline configuration. The figure shows the possibilities to trade-off decoding time with energy. For switching between the configurations, we develop a threshold based approach. Taking clue from the behavior of configurations shown in Fig. 8, it is best to use performance-driven configuration when the time buffer (time_in_hand) goes below 1 unit of budgeted_time as the next frame may get quickly decoded to compensate for time deficit at an additional energy cost. Therefore, we set Thd1 = 1. Above the threshold value of 1, we select 4 small cores configuration as it being the next powerful configuration. We note that 2 small cores configuration is the most energy efficient but twice as poor as the 4 big cores configuration in terms of timings. Therefore before this configuration is selected we must have at least 2 units of budgeted_time to compensate time and gain on energy. As a result we set Thd3 = 2. We choose a value between Thd1 and Thd3 to differentiate selection of 1 big and 1 small cores configuration from others. When more than 1.5 unit of budgeted_time accumulates we select 1 big and 1 small cores configuration and thus set Thd2 = 1.5. Based on the time buffer with respect to the threshold values an appropriate configuration is selected after each frame at the runtime. Fig. 9 shows the flowchart for core allocation for video decoding on heterogeneous multicore architectures. The method selects one of the four configurations consisting of different number and types of cores on a per-frame basis. For example, when the method selects balanced configuration it implies selection of 1 big and 1 small cores. The allocation method starts with a default configuration of ultra-low power. The other parameters including the thresholds are

In this section, we evaluate the performance of our proposed core allocation strategy, measuring how much energy can be conserved while adhering to a predefined performance by adopting our strategy of dynamic core allocation on a per-frame basis in video decoding. 3.1. Characteristics of the video traces For this study, we take 10 videos2 whose details are given in Table 1. We use x264 [12] encoder to encode an ensemble of H.264 test sequences. Each uncompressed source sequence is encoded with 8 slices per frame to obtain the encoded bit-stream of the videos. In each video, we measure the frame decoding time, energy consumption, and energy delay product (EDP) of each frame. The reported values are the average of five runs with negligible variance. 3.2. Experimental setup We use Sniper [13] and McPAT [14] simulators for timing and energy measurements respectively. Our simulated processor resembles Silvermont [15], a popular embedded processor whose architectural parameters are shown in Table 2. This is a 64-bit, out-of-order microprocessor operating at 2.4 GHz frequency and implemented in 22 nm technology node at a Vdd of 1 V. As most of the current cell phones have up to 8 cores, we evaluate the proposed allocation scheme on a 8 core embedded processor. Our video decoder is a FFmpeg [11] based multi-threaded decoder. We use libavcodec, the leading audio/video codec library, to build an 8-threaded video decoder. The method of multithreading we use in the decoder is slice multi-threading, in which each thread decodes a slice. The slices of a frame are decoded simultaneously by concurrently running threads. The current frame is decoded completely before a new frame is taken. We also insert a marker at the start and end of each frame, and measure

2 The videos can be accessed from http://www.cse.iitd.ac.in/~rkpal/video.html.

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

253

Fig. 9. Core allocation at frame level for heterogeneous multicore architectures.

timing and energy values between the two markers to obtain the statistics at a per-frame level. We use slice-level multi-threading where each slice is decoded by an independent thread executing on a core. The proposed method forks as many threads as the number of slices in a video. In order to fully exploit a multicore platform it is desirable that videos are encoded with a higher number of slices than the number of available cores on a multicore platform. While decoding, the

following three scenarios may arise:

• No of slices = No of cores. Each slice will get decoded on a separate core if all the cores are allocated. If fewer cores are allocated then slices get evenly distributed on the allocated cores in a round robin fashion. In this paper, we have used number of slices = max no of cores = 8. • No of slices > No of cores. The slices get evenly distributed over the allocated cores.

254

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

Table 1 Test videos used for measurements and simulations. Name

Content

Frames

Resolution

Video mode

Properties

aerobatics avalanche carrace cricket lawntennis riverflow rocketlaunch slowmotion tabletennis waterbody

Two aircraft display aerobatics Snow avalanche Formula one car race A cricket shot One tennis game between two players A flowing river Launch of space shuttle Slow motion replay of a cycle jump One round of table tennis Coverage of non-flowing water body

150 150 150 150 150 150 150 150 150 150

1280 × 720 1280 × 720 1280 × 720 480 × 360 1280 × 720 1280 × 720 1280 × 720 1280 × 720 480 × 360 1280 × 720

720p (HD) 720p (HD) 720p (HD) 360p (LD) 720p (HD) 720p (HD) 720p (HD) 720p (HD) 360p (LD) 720p (HD)

Panoramic view of sky with fast moving objects Fast motion with lots of scene change Very fast moving object Low detail average motion Almost constant environment with a moving object High detail image with average motion Panoramic detailed view Detailed motion Constant environment with moving objects at low resolution Dull movement

Table 2 Architectural Parameters. Processor Homogeneous (Silvermont)

Heterogeneous (ARM big.LITTE)

Small 8 2.4 GHz 2 32

Big 4 2.1 GHz 4 128

Small 4 1.5 GHz 2 32

Private L1 Instruction Caches

32KB, 8-way set associative 4 cycle latency

32KB, 8-way set associative 4 cycle latency

32KB, 8-way set associative 4 cycle latency

Private L1 Data Caches

32KB, 8-way set associative 4 cycle latency

48KB, 6-way set associative 6 cycle latency

24KB, 6-way set associative 3 cycle latency

Shared L2 Unified Cache

1MB, 16-way set associative 12 cycle latency

8MB, 16-way set associative 18 cycle latency

1MB, 16-way set associative 12 cycle latency

Main Memory

1GB, 45 cycle access latency

Type of cores No of cores Frequency Dispatch width Window size Memory

• No of slices < No of cores. In this case, the number of cores that can be used equals at most the number of slices. The degree of parallelism reduces in this case; there is no other effect. 4. Factors influencing frame decoding time In this section, we provide an insight into the factors that influence frame decoding time.

1GB, 45 cycle access latency

the decoding time of P-frames concentrates between 0–10 msec for frames varying in size from 1–20 KB. Similar results were reported for MPEG2 by Kumar and Srivastava [16]; we present the results for H.264. In addition, we also observed that a few fast moving videos such as carrace (results not displayed) show larger sizes for P-frames. carrace being a continuous video of a racing car does not have many scene cuts and thus has relatively fewer I-frames but being fast moving the content difference in frames increases the size of P-frames.

4.1. Frame types and size

4.2. Core count

The Intra (I), Predicted (P), and Bidirectional-predicted (B) frame types have different sizes. The I-frame is a self-contained frame that carries all the data needed to independently create the frame picture, and therefore has the largest size. I-frame does not depend on any other frame and is always the first frame in a video. I-frame starts the GoPs. The I-frame is the index frame and is the most important frame which denotes a change in the picture sequence. The loss of an I-frame can disturb all the frames in a GoP and thus it must be ensured that I-frames get decoded properly. A P-frame carries the motion vectors that indicate the delta changes over the previous frames and therefore has smaller size than Iframe. Though small in size P-frame takes relative more decoding time than for an I-frame of the same size because it undergoes motion prediction whereas I-frame does not. A P-frame depends on the previous I/P-frames for its decoding. The loss of P-frame results in artifacts that are carried forward into subsequent frames. A B-frame depends on the previous and the following I/P-frames, and has much smaller size. On an average the relationship between sizes based on frame type is I ≫ P ≫ B. In Fig. 10 we show the relationship between the frame sizes and decoding times. We note from the figure that the average size of an I-frame is more than 8x that of the average size of P-frames. We observe that the decoding time of I-frames increases as the frame size increases. In Fig. 10(b),

The number of cores employed for decoding a video affects the average decoding time. The slices of a frame are distributed to the threads which run simultaneously and decode the frame much faster on more cores. To understand the effect of core count on decoding time, we take an 8-threaded decoder and run it on 2, 4, and 8 core Silvermont configurations. Fig. 11 shows the decoding time of each frame with different numbers of cores at a frame rate of 50 fps. We take area under the curve to find the time taken by different core counts for decoding aerobatics consisting of 150 frames. The total time taken by 2, 4, and 8 cores for decoding are 4.7, 3.3, and 2.8 s respectively. This shows that significant time savings can be obtained by employing higher core counts. But we need to remember that higher core counts also bring higher energy consumption. To see the trade-off between performance and energy, we show in Fig. 12 the energy delay product (EDP) for different core counts. We find that considering EDP, 4 cores is 33% and 2 cores is 72% more efficient than 8 cores. However it may be noted that 2 cores may fail to meet the required performance constraints. Therefore a good design choice must balance performance and energy requirement of the video decoding. Fig. 11 shows the average frame decoding time of aerobatics on a 2 core configuration is 31.91 msec with standard deviation of 7.49.

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

(a) avalanche I-Frame.

255

(b) avalanche P-Frame. Fig. 10. Relationship between frame size and decoding time.

Table 3 Statistical analysis of data set obtained from decoding 150 frames of aerobatics at 50 fps. All figures denote decode time in msec. Cores

Minimum decode time

Maximum decode time

Range

Median

Mean

Standard deviation

2 cores 4 cores 8 cores

1.62 0.42 0.41

51.37 71.65 113.97

49.75 71.23 113.56

30.05 6.71 10.82

31.91 22.22 19.13

7.49 24.16 20.57

Table 4 Total decoding time and EDP obtained while decoding the videos at a frame rate of 75 fps on static configurations with different number of cores. Cores

aerobatics avalanche carrace cricket lawntennis riverflow rocketlaunch slowmotion tabletennis waterbody

Total Decoding Time (sec)

EDP (Js)

2

4

8

2

4

8

2.304 2.520 3.226 0.392 2.351 2.886 2.898 2.501 0.363 2.362

1.630 1.876 2.300 0.318 1.556 1.998 2.022 1.847 0.311 1.742

1.419 1.781 1.992 0.307 1.549 1.992 1.785 1.658 0.302 1.545

0.253 0.324 0.503 0.007 0.269 0.421 0.404 0.303 0.006 0.266

0.614 0.770 1.348 0.016 0.554 0.724 0.825 0.728 0.016 0.716

0.937 1.694 2.250 0.018 1.130 2.220 1.453 1.192 0.021 1.090

Fig. 11. Per-frame decoding time of aerobatics with different core count.

Fig. 12. Energy Delay Product of aerobatics with different core count.

The decoding curve for 2 cores looks uniform across most of the frames as the deviation from the mean is small. In configurations of 4 cores and 8 cores, the standard deviation is 24.16 and 20.57 respectively, over their mean decoding time of 22.22 msec and 19.13 msec. As the average deviation from the mean decoding time is high, we observe big spikes in the decoding curves for 4 and 8 cores. The spikes of 8 cores are higher than 4 cores showing that on particular frames more time is used as compared with 4 cores. On

the other hand, the quantity of spikes in 4 cores is larger than in the 8 cores case. Due to high standard deviation, we find the median to be a better measure of central tendency. We calculate median of 6.71 msec and 10.82 msec for 4 cores and 8 cores respectively. Their decoding curves interspersed with spikes can be observed at these medians in Fig. 11. The minimum and maximum decode time, and other statistical parameters for the configurations are presented in Table 3. After a detailed analysis of aerobatics video, we present the total decoding time and EDP for all the videos in Table 4. The static configuration with 2 cores is insufficient for decoding any HD videos within the deadline of 2 s at a required frame rate of 75 fps. The LD videos (cricket and tabletennis) are decoded well within the deadline. With 4 cores all videos except carrace and rocketlaunch meet the time-lines. The 8 cores static configuration is good enough to decode all videos within their deadline but at an excessive energy cost. We can see in Table 4 that the EDP increases as the core count increases. To further understand the impact of increasing core counts, we plot decode time and energy versus the core count in Fig. 13. The decode times are normalized with the maximum decode time obtained from 2 cores for each video. Similarly, the energy is normalized with the maximum energy consumed by 8 cores for each video. A common energy consumption pattern is observed across all the videos. The energy consumption increases as more cores are employed. Fig. 13 shows that the energy consumption reduces by 59.75% by using 4 cores

256

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

waits for synchronization on 5 different condition variables and sync objects. The threads share many data objects amongst one another and thus they must synchronize to cooperatively decode each single frame. We find that on an average the synchronization overhead increases as the thread count increases. This is the reason that the occasional peaks in frame decode time for 8 cores is larger than for 4 cores in Figs. 11 and 12. 5. Results and analysis In this section, we evaluate the proposed core allocation methods for homogeneous and heterogeneous architectures and present the results. Fig. 13. Impact of increasing core count on decoding time and energy while video decoding on static configurations.

and by 37.49% by using 2 cores over the energy consumed on 8 cores. We find that, for the selected architectural parameters, each core costs an energy of approximately 7.6 J for the HD videos. While increasing the core count, the decode time reduces to 73.60% by using 4 cores and to 68.37% by using 8 cores, compared with the decoding time on 2 cores. Frames for which threads have longer wait times, which add up to a larger frame decode time, exhibit a peak in Fig. 11. We investigate the underlying reasons for these peaks using Intel’s VTune Amplifier XE [17]. Fig. 14 shows the profile of a multithreaded decoder running aerobatics. The figure presents the thread concurrency viewpoint that shows the concurrent running threads along the time-line. Each horizontal bar represents the execution of a thread. The light green color and light skyblue color on the bar depict the thread running and thread waits respectively. We observe that threads run and wait intermittently. The frames where threads have longer wait time, adds up to the frame decode time and shows up as a peak in Fig. 11. We also highlight an instance (black box in Fig. 14) where the parent thread (_start)

5.1. Energy savings in homogeneous architectures Fig. 15 shows the video decoding time and energy consumption for decoding avalanche video at a frame rate of 75 fps on different configurations of a homogeneous multicore architecture. The decoding takes less time as we use more cores. For example, 2 cores decode the avalanche video in 2.52 s whereas 8 cores take 1.78 s. However utilizing less number of cores is much more energy-efficient as evident from energy consumption of 17.71 J with 2 cores as compared to 51.81 J with 8 cores. We observe inverse relationship between decoding time and energy consumption. Note that 1 core and 2 cores configurations though highly energy-efficient but fail to meet the decoding deadline of 2 s (150 frames at a rate of 75 fps). The decoding of avalanche takes 1.98 sec at an energy cost of 43.78 J by using dynamic core allocation methodology for homogeneous architectures. This results from trading off performance with energy by utilizing different configurations on a per-frame basis. It shows effectiveness of the proposed methodology by decoding all frames at a frame rate of 75 fps within required deadline of 2 sec with energy savings of 15.49% over the baseline (8 cores) configuration. We use the required frame rate as a measure of performance. We tested the proposed dynamic core allocation method for

Fig. 14. Snapshot of thread concurrency viewpoint obtained from Intel VTune Amplifier showing synchronization delay while video decoding of aerobatics. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

257

Table 5 Core count selected by dynamic core allocation method with varied performance for avalanche video. Frame

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

...

150

Energy (J)

Decode Time (s)

30 fps 50 fps 60 fps 75 fps

2 2 2 2

2 4 4 4

2 4 4 4

2 8 8 8

2 8 8 8

2 4 8 8

2 4 8 8

2 2 8 8

2 2 8 8

2 2 8 8

2 2 8 8

2 2 8 8

2 2 4 8

2 2 2 8

2 2 2 8

... ... ... ...

2 2 2 2

17.71 19.08 26.88 43.78

2.52 2.47 2.40 1.98

Fig. 15. Decoding time vs energy consumption on different configurations of a homogeneous architecture for decoding avalanche video at a frame rate of 75 fps. The deadline of 2 s for the video is highlighted with red dashed line. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

homogeneous multicore architectures at different frame rates such as 30, 50, 60, and 75 frames per second. We show the core-count schedule obtained for avalanche in Table 5. The core count selected by dynamic core allocation for each frame at different frame rate is shown. We observe that the allocation method assigns fewer cores for lower performance. For example, 2 cores are assigned to all the frames for performance of 30 fps. This is because 2 cores are sufficient to decode frames well within the budgeted time of 33.34 msec per frame. On the other hand, for aggressive performance requirements, the allocation method allots varying number of cores to different frames in an effort to meet the performance as well as to reduce energy consumption. The allocation for 60 fps shows an increasing and then decreasing core allocation pattern for the initial few frames. We observe similar varying allocation behavior across the life cycle of video decoding for most of the HD videos at aggressive performance, i.e., frame rates higher than 60 fps. In the last two columns of the table, we also show the corresponding energy consumption and total decoding time on the schedule generated by dynamic core allocation for different frame rates. We find that the energy requirement increases as the frame decoding rate increases. When we compare with the energy consumption of 52.46 J for decoding avalanche on a static configuration of 8 cores, we find that the dynamic core allocation is approximately 16%–66% more energy efficient. For further analysis in this study, we set the performance requirement as a frame rate of 75 fps; therefore all the 150 frames of each test video must be decoded within 2 s (red dashed line in Fig. 16). The figure shows the time taken to decode 150 frames of each video on a static configuration of 8 cores. The time taken by the schedule generated by dynamic core allocation is shown with a stacked bar that contains information on the time spent on different core counts. The notation dynamic:8 denotes the amount of time spent for decoding on 8 cores in our dynamic core allocation scheme. The dynamic scheme meets the performance for all the test videos. We observe that while the decode time using static 8-cores scheme may be lower than our dynamic scheme, it provides no advantage since the proposed dynamic scheme

Fig. 16. Frame decoding time with performance requirement of 75 fps. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 17. Energy consumption at a frame rate of 75 fps.

always meets the performance constraints. On the other hand, as shown in Fig. 17, the dynamic scheme always consumes less energy. We observe from the stacked bar of dynamic scheme that for resource-demanding videos, it assigns more cores and for low load it favors energy-conscious allocation of fewer cores. For LD videos, we find that dynamic allocation sticks to the lowest core count so as to meet performance and save energy. On the test videos (at 75 fps), we obtain energy savings ranging from 6% to 61%. This result highlights that the default strategy of allocating as many cores as available for video decoding, though meeting the performance, wastes large amounts of energy. On embedded multicore platforms, it is much more beneficial to apply our dynamic core allocation method for video decoding to meet the desired performance while preserving energy. 5.2. Energy savings in heterogeneous architectures Fig. 18 shows the video decoding time and energy consumption for decoding aerobatics video at a frame rate of 25 fps on different configurations of a heterogeneous multicore architecture. The decoding time degrades as we use smaller and fewer cores. For example, 4 big cores decode the aerobatics video in 3.38 sec whereas 2 small cores take 7.11 sec. However utilizing small and less number of cores is much more energy-efficient as evident from energy

258

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

Table 6 Configurations with different core types selected by dynamic core allocation method for heterogeneous architectures with varied performance for aerobatics video. The configurations 4 big cores, 4 small cores, 1 big & 1 small cores, and 2 small cores are represented by 1, 2, 3, and 4 respectively. Frame

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

...

150

Energy (J)

Decode Time (s)

25 fps 30 fps 50 fps

1 1 1

1 1 1

3 3 3

4 3 1

4 3 1

4 2 1

4 1 1

4 2 1

4 4 1

4 4 1

4 3 1

4 2 1

4 1 1

4 1 1

4 3 1

... ... ...

3 1 1

47.37 53.95 70.07

5.35 4.82 3.47

Fig. 18. Decoding time vs energy consumption on different configurations of a heterogeneous architecture for decoding aerobatics video at a performance of 25 fps.

consumption of 34.09 J with 2 small cores as compared to 71.59 J with 4 big cores. The inverse relationship between decoding time and energy consumption while decoding videos on different configurations of heterogeneous architecture is observed. Note that 1 big & 1 small cores and 2 small cores configurations though highly energy-efficient but fail to meet the decoding deadline of 6 sec (150 frames at a rate of 25 fps). By using dynamic core allocation methodology for heterogeneous architectures that trades off performance with power by employing different configurations on a per-frame basis, we get the decode time of 5.35 sec at an energy consumption of 47.37 J for aerobatics. It shows effectiveness of the proposed methodology by decoding all frames at a frame rate of 25 fps within required deadline of 6 sec with energy savings of 33.83% over the baseline (4 big cores) configuration. We also evaluate the proposed dynamic core allocation at different frame rates. Table 6 shows the heterogeneous configuration schedule obtained for decoding aerobatics at frame rates of 25, 30, and 50 frames per second. We observe that the allocation method assigns energy-efficient configurations more often at lower frame rates. For example, 2 small cores configuration is selected frequently for 25 fps till the decoding time constraints are not violated. As a result better energy savings are achieved at lower frame rates. Similar to the homogeneous multicores, we find that the energy requirement grows as the frame decoding rate increases on heterogeneous platforms. When we compare with the energy consumption of 71.59 J for decoding aerobatics on a static configuration of 4 big cores, we find that the dynamic core allocation is approximately 2%–34% more energy efficient. With higher frame rates, we find that the allocation method mostly selects aggressive and powerful configurations in an effort to meet the timing/performance constraints. As seen from Table 6, majority of the time 4 big cores are used at a frame rate of 50 fps, which results in lower energy savings. It may be noted that in spite of selecting the most aggressive configuration available, the methodology may miss to meet the decoding deadline, as in the case of 50 fps (decoding time taken: 3.47 sec, however it should be decoded within 3 sec), when the performance requirement is beyond the inherent capacity of the hardware. Because of this reason, we mainly focus on evaluating the proposed methodology for heterogeneous multicore architectures at 25 fps.

Fig. 19. Frame decoding time with performance requirement of 25 fps.

Fig. 20. Energy consumption at a frame rate of 25 fps.

Fig. 19 shows the time taken to decode 150 frames of each video on a baseline configuration of 4 big cores. The figure also shows the time taken by the dynamic core allocation with a stacked bar that contains information on the time spent by different heterogeneous configurations. The notation dyn:4 small cores denotes the amount of time spent for decoding on 4 small cores in our dynamic core allocation scheme. The dynamic scheme meets the performance of 25 fps for all the test videos except carrace. However it may be noted that the allocation method mostly selects the most aggressive configuration available i.e., 4 big cores and reports the best decoding time that the hardware could do for carrace. For rest of the videos, 4 big cores configuration takes less time than the dynamic allocation scheme but without any advantage. We observe from the stacked bar of dynamic scheme that for HD videos, it assigns aggressive configurations and for LD videos it favors energy-efficient configurations. For LD videos, the dynamic allocation sticks to 1 big & 1 small cores so as to meet performance and preserve energy. The decoding time of dynamic core allocation scheme is poorer than baseline configuration because the performance is traded off to conserve energy as shown in Fig. 20. It can be observed that the dynamic core allocation method always consume less energy as compared to the static configuration. On the test videos (at 25 fps), we obtain energy savings ranging from 2% to 46%. This result highlights that careful allocation of type of cores on heterogeneous multicore architectures can result is energy savings while balancing performance.

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

Fig. 21. Frame decoding at different frame rates on homogeneous architectures. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 22. Energy consumption at different frame rates on homogeneous architectures.

5.3. Varying frame rates Next, we analyze the performance of dynamic core allocation at different frame rates. Figs. 21 and 22 show the decoding time and energy consumption of videos when run with dynamic core allocation at frame rate of 30, 50, and 75 fps on homogeneous architectures. The deadline for decoding 150 frames at 30, 50, and 75 fps are 5, 3, and 2 sec (marked with red dashed lines) respectively. The notation 30 fps-2 denotes the time taken (or energy consumed) for decoding on 2 cores at a frame rate of 30 fps. We find from Fig. 21 that the dynamic core allocation scheme decodes all the test videos at the selected performance well within their respective deadlines. We notice that performance is easily met for lower frame rates because the budgeted decode time is large at lower frame rates and smaller core counts are able to decode frames easily within the budgeted time. Also, the greater accumulated slack is sufficient to compensate for the occasional overshoots. However meeting the performance for higher frame rates which have small budgeted decode time becomes challenging. This is where core allocation at a frame level helps by adjusting slacks with overshoots and upping to more cores when required. We can observe from Fig. 22 that energy consumption for HD videos increases as the performance requirement increases. We note that our dynamic core allocation scheme has many advantages. First, it adapts to the resource requirements of the videos and assigns what is required to meet the performance. For example, it assigns fewer cores for LD video (cricket) and assigns more cores oftener in the dynamic schedule for HD video (carrace). Second, it conserves energy by allocating fewer cores when performance can be met with a lower core count. Third, its dynamic behavior absorbs the variability in program execution while decoding video. Fourth, it guarantees performance while

259

Fig. 23. Frame decoding at different frame rates on heterogeneous architectures. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 24. Energy consumption at different frame rates on heterogeneous architectures.

saving energy. Note that our scheme is not aware of the nature of the video data, but will always meet the performance requirements (if they can be satisfied on the number of available cores), and possibly save energy while doing so. In Figs. 23 and 24, we present decoding time and energy consumption of videos when run with dynamic core allocation at frame rate of 25, 30, and 50 fps on heterogeneous architectures. The deadline for decoding 150 frames at 25, 30, and 50 fps are 6, 5, and 3 sec (marked with red dashed lines) respectively. The notation 25 fps-2small denotes the time taken (or energy consumed) for decoding on 2 small cores at a frame rate of 25 fps. Fig. 23 shows that all the test videos except carrace are decoded within 6 sec at the frame rate of 25 fps. We observe that at a frame rate of 30 fps, the allocation method is able to decode majority of the videos within their deadline of 5 sec except fast moving videos avalanche, carrace, and riverflow. At a frame rate of 75 fps, the inadequacy of simulated hardware is evident from the failure to decode any of the HD videos within its stipulated time. It can be observed that at low frame rates different configurations are selected more often whereas at higher frame rates mostly performance-driven configuration is picked up. The allocation method chooses powerful configurations for faster decoding of videos at higher frame rates. It can be seen from Fig. 24 that majority of the energy contributions come from 4 big cores configuration. We note that energy consumption increases as the decoding frame rate increases. 6. Related work We divide the related work into two categories, i.e., work on dynamic techniques on homogeneous and heterogeneous multicore architectures.

260

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

Dynamic scheduling on homogeneous multicore architectures: Hughes et al. [18] show that the presence of frame-level execution time variability in multimedia applications can be exploited for frame-level architectural adaptation. Using this observation, Hughes et al. [19] demonstrate a frame-type based architectural adaptation technique that selects a hardware configuration to save energy. In contrast to Hughes scheme, our proposed allocation method does not depend on frame types and have no profiling overhead. Vu et al. [20] propose an adaptive dynamic scheduling scheme for H.264 decoding that employs multiple local queues to reduce lock contention and assigns tasks to neighboring cores in a cache locality aware fashion. They focus on only performance without considering energy. Tuveri et al. [21] propose a dedicated multicore platform for H.264 decoding that can migrate processes among different computational tiles to support runtime adaptivity. In contrast, our approach is more generic and flexible, and can be applied to any multicore architecture as we map the threads working on independent slices to assigned cores on a per-frame basis. Richter et al. [22] modify the decoder to make it adaptive by providing fine grained distributed synchronization via a dedicated lock for each macroblock. They identify the dependence sequence for decoding and accordingly start the decoding process by threads early, as per the dependence. Kato et al. [23] proposed AIRS that is aimed at supporting systems that run multiple interactive real-time applications, particularly on multicore platforms. AIRS provides a new CPU reservation mechanism to enhance the performance of the overall system. Dynamic scheduling on heterogeneous multicore architectures: Tsai et al. [24] proposed a video decoding framework that adopts the dynamic task partition paradigm in the user space instead of kernel space. Depending on the load of each core at runtime, frame decoding task is assigned dynamically to either the RISC core or the DSP Core. The load on a core is determined based on the items in its task queue. The task granularity is set at video slice level. This approach had a 38.4% decoding gain on an average over the static task partition approaches. Chen et al. [25] presented a heterogeneous architecture which is composed of a mobile GPU with a configurable filtering unit and a CPU to coordinate the decoding flow. The traditional video decoding pipeline is partitioned into stages for either GPU or CPU. CPU is responsible for coordinating decoding flow and dispatching threads. Decoding stages, such as inverse quantization and inverse transform, are scheduled as parallel vector threads for GPU. A 1080p H.264 video decoding time was reported as reduced by over 50%. Liu et al. [26] implement a task-based hybrid parallel and hybrid pipeline scheme for multi-standard video decoding on a heterogeneous coarse-grained reconfigurable processor, called the reconfigurable multimedia system (REMUS). The macro-block (MB)-level, blocklevel, and sub-block-level decoding tasks are parallelized to improve data processing throughput, and a hybrid pipeline scheme, in which slice-level, MB-level, block-level and sub-blocklevel computations are pipelined to improve efficiency. Benmoussa et al. [27] proposed an end-to-end methodology to characterize and model the energy consumption of the processing resources in the context of video decoding for embedded heterogeneous platforms containing both a GPP and a DSP. A high-level analytical model is built which estimates the consumed energy as a function of the considered characterization parameters in addition to a set of comprehensive architecture, system and video related coefficients. The considered parameters are the processor frequency, the processor type (GPP or DSP), the video quality (resolution and bit-rate) and the video complexity. Mesa et al. [28] evaluated a heterogeneous many core architecture for parallel H.264 decoding. Entropy decoding was identified as the main bottleneck, and a solution based on the simultaneous exploitation of multiple levels of parallelism was used. The impact of using

different type of processors for entropy decoding and macroblock decoding was analyzed. Parallelization technique that operates at the MBlock-level and dynamic 3D wave algorithm that exploits spatial and temporal MBlock-level parallelism were used. Cho et al. [29] proposed a parallelization and optimization techniques of the H.264 decoder for the Cell BE processor. The decoder eliminates the bottleneck of achieving the real-time performance with the HD resolution in the entropy decoding stage by exploiting the frame-level parallelism available in the entropy decoding stage and by exploiting the simultaneous multi-threading (SMT) feature available in the PPE. Macroblocks in a frame are pipelined through multiple SPEs. The evaluation results indicate that the parallel H.264 decoder with CABAC entropy decoding on a single Cell BE processor meets the real-time requirement of the full HD standard at level 4.0 and performance was improved by over 18% on an average. Baker et al. [30] show a parallelized implementation of the H.264/AVC decoder addressing workload and data partitioning in the face of complex data dependences, and performance and scalability in terms of throughput in embedded multicore architectures. H.264 parallelization scheme focuses on data parallelism at the Macroblock level, where macroblocks are assigned to the next SPU in a round-robin fashion. This approach had an average improvement of 6.6% across all tested bitrates and 9.3% for bitrates below 12 Mbps. 7. Conclusion Among the most important applications that run on contemporary embedded systems such as tablets and smart-phones are those that play videos. On future embedded systems the demand for videos is only going to increase, while energy will remain a critical constraint. This paper presents an energy efficient approach for video decoding that may be helpful to the future generation embedded systems in conserving energy while playing videos. The paper deals with both homogeneous (Silvermont) and heterogeneous (ARM big.LITTLE) variants of multicore architectures. Video decoding using all cores on embedded multicore platforms meets the performance but with excessive energy cost. This paper presents dynamic core allocation techniques for video decoding that satisfies performance requirements while lowering energy consumption on homogeneous and heterogeneous multicore architectures. The energy savings are achieved by controlling core allocation, i.e., more/powerful cores are allocated for a frame duration only when required. The key idea is to compensate for the overshoots with the slack found in decoding neighboring frames and to increase the core count when more stringent performance requirements need to be satisfied. The result is relevant to multicore systems and power-aware computing since it allows switching off or running the cores at lower clock speeds, which can help in conserving energy. We also test the proposed allocation method with varying frame rates and find that our dynamic technique is able to satisfy different performance constraints. In this paper, we also identify and evaluate the factors that influence the decoding time of a frame, namely frame types and their sizes, and the core counts on multicore architecture. In the future, we would like to implement and evaluate the proposed methodology on real hardware. We would also explore the unification of homogeneous and heterogeneous scheme. We would hope to establish a generic dynamic core allocation methodology for soft real-time applications that leads to energy savings while meeting the desired performance requirements. An interesting direction would be to combine the proposed allocation techniques with DVFS strategies. We would also like to integrate the dynamic core allocation method with cache reconfiguration approaches in an effort to improve performance while preserving energy.

R.K. Pal et al. / Future Generation Computer Systems 56 (2016) 247–261

References [1] ISO/IEC 14496-10, Advanced Video Coding for Generic Audiovisual Services, in: http://www.itu.int/ITU-T/recommendations/rec.aspx?rec=11466. [2] S.K. Kwon, A. Tamhankar, K. Rao, Overview of H.264/MPEG-4 Part 10, J. Vis. Commun. Image Represent. 17 (2) (2005) 186–216. [3] D. Isovic, G. Fohler, L. Steffens, Timing constraints of MPEG-2 decoding for high quality video: misconceptions and realistic assumptions, in: Proceedings of 15th Euromicro Conference on Real-Time Systems, 2003, pp. 73–82. [4] A. Carroll, G. Heiser, An analysis of power consumption in a smartphone, in: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, 2010, pp. 21–35. [5] A. Carroll, G. Heiser, The systems Hacker’s guide to the galaxy energy usage in a modern smartphone, in: Proceedings of the 4th Asia-Pacific Workshop on Systems, 2013, pp. 5:1–5:7. [6] Li-ion batteries and portable power source prospects for the next 5–10 years, J. Power Sources 136 (2) (2004) 386–394. [7] K. Jeong, A. Kahng, A power-constrained MPU roadmap for the International Technology Roadmap for Semiconductors (ITRS), in: 2009 International SoC Design Conference, 2009, pp. 49–52. [8] M. Mesarina, Y. Turner, Reduced energy decoding of MPEG streams, J. Multimedia Syst. 9 (2) (2003) 202–213. [9] K. Choi, K. Dantu, W. Cheng, M. Pedram, Frame-based dynamic voltage and frequency scaling for a MPEG decoder, in: IEEE/ACM International Conference on Computer Aided Design, 2002, pp. 732–737. [10] Z. Lu, J. Lach, M. Stan, K. Skadron, Reducing multimedia decode power using feedback control, in: Proceedings of 21st International Conference on Computer Design, 2003, pp. 489–496. [11] FFmpeg: Open Source Codec Project, in: http://www.ffmpeg.org. [12] x264 Project, in: http://www.videolan.org/developers/x264.html. [13] T.E. Carlson, W. Heirman, L. Eeckhout, Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulations, in: International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, pp. 52:1–52:12. [14] S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, N.P. Jouppi, McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures, in: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 469–480. [15] D. Kanter, Silvermont: Intel’s Low Power Architecture, in: http://www.realworldtech.com/silvermont. [16] P. Kumar, M. Srivastava, Power-aware multimedia systems using run-time prediction, in: 14th International Conference on VLSI Design, 2001, pp. 64–69. [17] Intel VTune Amplifier XE, in: http://software.intel.com/en-us/intel-vtuneamplifier-xe. [18] C.J. Hughes, P. Kaul, S.V. Adve, R. Jain, C. Park, J. Srinivasan, Variability in the execution of multimedia applications and implications for architecture, in: Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001, pp. 254–265. [19] C.J. Hughes, J. Srinivasan, S.V. Adve, Saving energy with architectural and frequency adaptations for multimedia applications, in: Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, 2001, pp. 250–261. [20] D. Vu, J. Kuang, L. Bhuyan, An adaptive dynamic scheduling scheme for H.264/AVC decoding on multicore architecture, in: IEEE International Conference on Multimedia and Expo, 2012, pp. 491–496. [21] G. Tuveri, S. Secchi, P. Meloni, L. Raffo, E. Cannella, A runtime adaptive H.264 video-decoding MPSoC platform, in: Conference on Design and Architectures for Signal and Image Processing, 2013, pp. 149–156. [22] H. Richter, B. Stabernack, E. Muller, Adaptive multithreaded H.264/AVC decoding, in: Conference on Signals, Systems and Computers, 2009, pp. 886–890. [23] S. Kato, R. Rajkumar, Y. Ishikawa, AIRS: supporting interactive real-time applications on multicore platforms, in: 22nd Euromicro Conference on RealTime Systems, 2010, pp. 47–56. [24] C.-J. Tsai, T.-F. Shen, P.-C. Liao, Dynamic task partition for video decoding on heterogeneous dual-core platforms, ACM Trans. Embedded Comput. Syst. 12 (1) (2013) 53:1–53:22. [25] Y.-J. Chen, Y.-S. Lin, H.-F. Wu, C.-M. Chang, S.-Y. Chien, HD video decoding scheme based on mobile heterogeneous system architecture, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 2761–2765.

261

[26] L. Liu, Y. Chen, D. Wang, S. Yin, X. Wang, L. Wang, H. Lei, P. Cao, S. Wei, Implementation of multi-standard video decoder on a heterogeneous coarse-grained reconfigurable processor, Sci. China Inf. Sci. 57 (8) (2014) 82406–82420. [27] Y. Benmoussa, J. Boukhobza, E. Senn, Y. Hadjadj-Aoul, D. Benazzouz, A methodology for performance/energy consumption characterization and modeling of video decoding on heterogeneous soc and its applications, J. Syst. Archit. 61 (1) (2015) 49–70. [28] M.A. Mesa, F. Cabarcas, A. Ramirez, C. Meenderinck, B. Juurlink, M. Valero, Scalability of Parallel Video Decoding on Heterogeneous Manycore Architectures. [29] Y. Cho, S. Kim, J. Lee, H. Shin, Parallelizing the H.264 decoder on the cell BE architecture, in: Proceedings of the 10th ACM International Conference on Embedded Software, 2010, pp. 49–58. [30] M.A. Baker, P. Dalale, K.S. Chatha, S.B. Vrudhula, A scalable parallel H.264 decoder on the cell broadband engine architecture, in: Proceedings of the 7th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis, 2009, pp. 353–362.

Rajesh Kumar Pal received the M.Tech. degree in Computer Science and Engineering from Indian Institute of Technology Kharagpur, India in 2008. He is currently pursuing Ph.D. from the Department of Computer Science and Engineering, Indian Institute of Technology Delhi, India since 2008. His research interests include multicore/manycore architectures, adaptive and reconfigurable systems, and data security.

Ierum Shanaya received her B.E degree in Computer Science and Engineering from the National Institute of Engineering, Mysore, India in 2006. She is currently pursuing MS(R) from Amar Nath and Shashi Khosla School of Information Technology at Indian Institute of Technology Delhi, India. Her research interests include Virtualization, distributed systems, multicore/manycore architectures and reconfigurable systems.

Kolin Paul received the Ph.D. in Computer Science from Bengal Engineering College, Kolkata, India in 2002. He joined the Department of Computer Science and Engineering, Indian Institute of Technology, in 2004, where he is an associate professor. He currently leads the ReMorph group that focuses on reconfigurable computer architectures in varied areas of FPGAs, multicores, embedded systems and Quantum Computing. His research interests include reconfigurable and adaptive computing, sensor fusion in deeply embedded systems, and systems issues of multicore/manycore architectures. He has rendered his services in program committee of various conferences related with embedded systems and computer architectures such as DSD, ATS, and Indocrypt. Sanjiva Prasad received the M.S. and Ph.D. degrees from Department of Computer Science, SUNY at Stony Brook, USA in 1990 and 1991 respectively. He is a Professor in Department of Computer Science and Engineering and head of department of Khosla School of Information Technology at Indian Institute of Technology Delhi, India. His research interests include programming language for mobile distributed computing, formal verification of programs, protocols, and systems. He has served on program committee of various conference of international repute such as POPL, SEFM, FST TCS, and ASIAN.