Journal of Systems Architecture 59 (2013) 1144–1156
Contents lists available at ScienceDirect
Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc
High-level power and performance estimation of FPGA-based soft processors and its application to design space exploration Adam Powell ⇑, Christos Savvas-Bouganis, Peter Y.K. Cheung Imperial College London, United Kingdom
a r t i c l e
i n f o
Article history: Available online 17 August 2013 Keywords: Design space exploration Modeling Soft processors Performance estimation Power estimation Regression tree Image compression
a b s t r a c t This paper presents a design space exploration framework for an FPGA-based soft processor that is built on the estimation of power and performance metrics using algorithm and architecture parameters. The proposed framework is based on regression trees, a popular machine learning technique, that can capture the relationship of low-level soft-processor parameters and high-level algorithm parameters of a specific application domain, such as image compression. In doing this, power and execution time of an algorithm can be predicted before implementation and on unseen configurations of soft processors. For system designers this can result in fast design space exploration at an early stage in design. Ó 2013 Elsevier B.V. All rights reserved.
1. Introduction Soft processors on FPGAs allow for quick design and use of microprocessors for custom applications. However, early in the design process it is difficult for the system designer to know how a particular algorithm is going to perform because specific parameters of the architecture may not yet be known. Ideally, the designer would know how the various architecture parameters of the system affect the power and execution time of an algorithm without having to implement the algorithm or generate the soft processor. Exploration of the soft-processor architecture space is a lengthy process. Based on the number of architecture combinations, the generation of all possible architectures can 25 days for the Altera NIOS 2 for only a base set of parameters. Including other components, such as a Memory Management Unit (MMU) or a Memory Protection Unit (MPU), can increase this by as much as three to six orders of magnitude. This process would be required to be find the optimal configuration for a particular application. By having a profile of power consumption and execution time as a function of hardware architecture parameters and high-level parameters that capture the main computational and memory access characteristics of the algorithm, early optimization and design space exploration can be performed. This allows for a much earlier estimate than previous work at the expense of a constrained application space. To this end, design space exploration of the softprocessor architecture can be performed. ⇑ Corresponding author. E-mail addresses:
[email protected] (A. Powell),
[email protected] (C. Savvas-Bouganis),
[email protected] (P.Y.K. Cheung). 1383-7621/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.sysarc.2013.08.003
A tool that enables designers to have this profile of power and execution time is dependent on having an accurate underlying model that provides the predictions based on the previously mentioned parameters. The contributions of this work are: Presents a model which uses power and performance (execution time) predictions for domain-specific algorithms using high-level algorithm parameters and architecture parameters. Presents a design space exploration framework which, by knowing the power consumption and execution time of an algorithm, the system can be made more efficient earlier in the design process without implementation of the processor or algorithm. Shows how framework provides confidence measures of these predictions using characteristics of the training data as well as characteristics of the input test vectors. For this work, the scope is limited to only the prediction of power and performance for image compression methods running on FPGA-based soft processors. The use of high-level algorithm parameters constrains the domain to algorithms and applications which share the same execution characteristics. Therefore the ability of this technique to be generalizable to other application domains is limited. However, architecture parameters, such as cache size, are shared by all processors. To this end, the flexibility of soft processors can be used to generate power and performance models that can be applied more generally to soft processors as well as ‘‘hard’’ processors. How this can be done is discussed later in the paper.
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156
This work improves on previous work [1] by improving the parameter extraction and increasing the number of algorithms used which lead to an increase in model accuracy. Also, the concepts of prediction confidence and design space exploration were added. The paper is organized as follows. Section 2 discusses previous work. Section 3 discusses the chosen modeling technique and application domain. Section 4 provides a high-level description of the framework and how it will be used. Section 5 discusses algorithms and how their parameters were extracted in order to create the underlying prediction model. Section 6 describes the construction of the model followed by the validation method and the results of this validation. Section 7 describes the method by which the model can provide degrees of confidence for its predictions. A brief case study in presented in Section 8 that describes how this framework is used for design space exploration. Section 9 concludes the paper and provides direction for future work. 2. Related work As the major contribution of this work is in the modeling of power and performance on soft processors at a high-level of abstraction, this will be the main focus on this section. These levels are the functional level, the instruction level, and the specification level; the work in this paper is at the specification level. Low-level techniques, such as those one the register transfer level and transistor level, will not be covered. This will be followed by a discussion of frameworks used for design space exploration (DSE). 2.1. Power and performance modeling Functional-level power analysis (FLPA) considers groups of components as functional units. First introduced by Laurent et al. [2], FLPA uses these functional units as the basis of its analysis. The functional units in this work consist of the Control Unit (CU), Instruction Management Unit (IMU), Memory Management Unit (MMU), and the Processing Unit (PU). Though this work is specific for a Digital Signal Processor (DSP), Zipf et al. [3] applied FLPA to an FPGA-based soft processor using functional units consisting of the data cache, instruction cache, and integer processing units. In order to produce estimates, FLPA methods use low-level parameters of the applications running on them to predict the activity levels of the functional units. These parameters include cache miss rates, dispatch and fetch rates (for DSP processors), parallelism degree, or memory access rates. In many situations, these statistics will not be available to designers as they require complex analysis of the source code or special profiling tools. Introduced by Tiwari et al. [4], instruction-level power analysis (ILPA) methods attempt to estimate the energy consumption of an application by first characterizing the energy consumption of each instruction in the instruction set. In addition the base energy cost of an instruction, there are additional inter-instruction costs which occur due to sequential execution of two or more instructions. Tiwari et al. [4] developed per-energy instruction costs for two specific processors, as indeed this must be done for any processor used in ILPA methods. Ou and Prasanna [5] performed ILPA specifically for the MicroBlaze soft processor, giving the reason that previous power estimation methods could not be properly applied to soft processors due to their unique construction. JouleTrack [6] is a tool that has been developed as an optimization framework built on instruction-level power estimation. Ultimately, the issue with ILPA methods is the dependence of the characterization on the underlying architecture. There is little way to know how a particular algorithm is affected by architecture parameters, such as instruction cache size or multiplier type. In
1145
addition, compiled source code of the application must be available. This makes ILPA methods dependent on a specific processor and architecture being fully characterized before estimation can occur. The highest level of abstraction is the specification level, which is where this work falls. The specification level includes methods that rely on high-level parameters in order to produce estimations. Specifically, these parameters will not require implementation of either the processor or algorithm. These will be parameters such as cache size, types of arithmetic operations, or clock frequency. Cambre et al. [7] examines the effect of arithmetic hardware on the energy usage of various algorithms but does not attempt to construct a predictive model. Instead, they made generalizations about the most efficient choice of arithmetic hardware; the conclusion was that, simply, choosing hardware implementations of arithmetic operations was more energy efficient than having software-based operations. Azizi et al. [8] use a large number of architecture parameters. These include the branch target buffer (BTB) size, cache sizes, fetch/decode/register renaming latencies, ALU latency, DRAM latency, and so on. However, the work uses only a small subset of applications taken from the SPEC CPU benchmarks and does not include any application parameters in the modeling process; this confines the model to only a small subset of applications.
2.2. Applications to DSE Modeling is also an important part of DSE research as it allows for the architecture space to be explored without the need of a lengthy generation process. Machine learning techniques are popular methods for tackling this problem. Ipek et al. [9] uses artificial neural networks (ANN) to reduce the size of the architecture space as well as providing a feature-rich DSE framework. Similarly, Cho et al. [10] uses wavelet-based ANNs to analyze the workload of various applications in order to make better microarchitectural design choices. A type of linear regression is used by Hallschmid and Saleh [11] to reduce the design space and to evaluate the affects of either one or two parameters in order to optimize the energy consumption of the processor. Carrion Schafer and Wakabayashi [12] uses various machine learning techniques combined with genetic algorithm (GA) techniques and source code analysis to produce a fast method for exploring the architecture design space. Lee and Brooks [13] considers a large number of architecture parameters. These parameters include cache sizes, pipeline depth, and register counts. In addition to these, the work considers a number of application-specific parameters such as branch and stall statistics and cache misses. These parameters are taken from a single execution on a ‘‘baseline’’ configuration of architecture parameters. By doing this, some aspects of the specific application can be captured in the modeling process while not having to explicitly generate all possible architectures. In addition to modeling-based techniques for design space exploration, there are a number of techniques which progressively generate architectures in order to tune the performance of a specific algorithm. Yiannacouras et al. [14] uses the SPREE architecture to perform application-tuning on soft processor architectures, while Sheldon et al. [15] uses in-the-loop synthesis to optimize designs in terms of area. Dimond et al. [16] considers the use of custom instructions and multi-threaded applications in this DSE framework which uses source code analysis, cycle-accurate simulation, and place-and-routing. The issue with the above work is that all of these works are dependent on the application being executed. However, if the application is optimized or changed, the DSE process must happen again in order to accommodate the changes to the application.
1146
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156
Ultimately, this work seeks to improve upon previous work by allowing the designers earlier estimates allowing for exploration of the architecture and, potentially, algorithm space. 3. Background One of the main goals behind this work is to identify the relationship between high-level algorithm parameters and soft-processor architectures parameters in order to be able to predict power consumption and execution time. Machine learning techniques are powerful enough to be able to model this relationship and provide predictions. For this work, regression trees were chosen. 3.1. Regression trees A regression tree uses a succession of binary splits of the data using a set of observations and their responses, also known as recursive partitioning. Compared to other modeling techniques, regression trees offer many benefits. These benefits include flexibility of input data and the ability to model conditional information within its predictor variables. After a regression tree has been constructed, its structure is transparent. That is, how the model arrives at a prediction can be seen by the user which gives insight into variable dependencies and effects. Nonparametric regression models, which includes regression trees, are methods that allow the predicting and modeling of systems whose relationship between predictor and output variables is not known. Multivariate adaptive regression splines (MARS) [17] are a popular method for creating nonparametric models and are commonly compared to recursive partitioning methods as they both contain automatic variable selection and can easily handle input data, requiring little or no preparation. However, the MARS method produces a continuous, piece-wise function describing the output variable. The relationship of high-level variables used in this work produces discontinuous relationships with the output variables therefore the MARS method is unsuitable to be used for this work. In fact, this makes many modeling techniques unsuitable for use on this data which includes linear regression, nonlinear regression, generalized linear models, and so on. The method of constructing regression trees used here (simply called the CART method) was introduced by Breiman et al. [18] and will not be reiterated in full. The binary splits of a regression tree divide the data into two disjoint subsets. These splits are binary tests against a predictor variable of the training data; the best split for a node is one that produces the greatest decrease in error of the current node and children nodes. Once it is determined that further splits are not needed at a node, the particular node in question is then considered a terminal node. A few key shortfalls of regression trees, such as instability, can be handled by using multiple regression trees called an ensemble or regression forest. While there are many ways of creating an ensemble of regression trees, the one most suited for this work is called ‘‘bagging’’. 3.1.1. Bagged regression trees Bagging, or Bootstrap Aggregating, was first introduced by Breiman [19]. Using an initial learning data set L, bagging works by generating a set of k models where each model is constructed using a separate learning data set, resulting in the sequence fLk g. fLk g is generated by randomly sampling L uniformly and with replacement. Therefore observations from L may not appear in fLk g or may appear multiple times. The predicted value of the ensemble for an input vector x is the average response value of all k models using x as input. By applying this concept to the generation of
many regression trees, the prediction error of the model is decreased. Why bagging results in an increase in accuracy is explained by Breiman [19]. Breiman explains that bagging generally increases the accuracy of unstable predictors. By using bagging, the instability inherent in single regression trees is decreased as well as the prediction error. It is also shown that bagging generally improves prediction performance on predictors that tend to overfit; another criticism of regression trees. There exists other ensemble methods and are namely random forests and boosting. Random forests [20] create trees from a random subset of parameters using bootstrapping. It is unsuitable as the relationship of multiple variables are important to producing low error predictions; making accurate predictions with only one of these variables therefore becomes impossible. Boosting [21] (specifically least-squares boosting in the regression case) is not suitable in this context for similar reasons as random forests. Boosted regression trees are typically small trees which use a subset of predictor variables and thus lose many important variable interactions that are key to accuracy of the model. Both methods have been tested and have yielded much higher error when compared with bagging. 3.1.2. Node statistics Each node in a regression tree contains information on the data set associated with it. This information is the node’s response variable mean tM , node error tE (absolute deviation from tM ), and node size tS of each node generated during the construction phase. The node probability t P is defined as
tP ¼
tS N
where N is the total number of training vectors. Further, the node risk tR is defined as the product of the node probability and node error
tR ¼ tP tE : For an input vector, the constructed ensemble will produce a response value for each tree within the ensemble. Therefore for each tree, there is an associated node that the response came from along with the statistics mentioned above. 3.1.3. Outlier measure By using multiple regression trees, a test vector can be characterized against the training vectors to determine the similarity between the two. If similar, the probability of having higher error in the prediction is lower than if the two are dissimilar. To measure this, the idea of proximity and the outlier measure is used – first introduced by Breiman [20]. The outlier measure b for a vector v against the data set X which has proximity measures aX is defined as
1
b¼
l2
a~X
madðaX Þ
:
Here, l2 is the average squared proximity of the vector v against all other vectors in X. a~X is the median of all proximity values in aX and the function madðaX Þ indicates the median absolute deviation from the median a~X of all values in aX . Typically, this outlier measure is used in classification to determine outliers within a class. However, it will be seen that it can be used in regression as an indication of confidence in Section 7. 3.2. Image compression domain As the main goal of the work is to be able to predict the power and performance of algorithms early in the design phase, a con-
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156
straint must be put on the algorithm space in order to be able to extract meaningful high-level parameters. This constraint is that the algorithms chosen must have similar characteristics which means that all algorithms should come from the same domain. For this work, the domain of image compression was chosen. Image compression algorithms, though diverse, share characteristics that allow high-level parameters to be used that can capture the individual behavior of each algorithm. These characteristics include deterministic memory accesses, memory-access dominated, simple arithmetic operations, and distinct phases of execution. A wide variety of algorithms were used to construct the model. These algorithms include those that use the concept of transform coding (i.e. transform stage followed by quantization and entropy coding) and consist of JPEG, JPEG 2000, JPEG XR, and WebP. Also included in this work are less widely used methods which consist of vector quantization and quad-tree fractal compression. In the past, these were not popular due to high computation costs; due to increasing compute speeds and more ubiquitous parallel processing, these methods are experiencing a resurgence. Ultimately, these algorithms represent nearly all of the methods used for image compression. The widely used JPEG standard consists of a transform stage and entropy encoding stage. Beginning with the Discrete Cosine Transform (DCT), the input image is transformed into the frequency domain. From here, the resulting coefficients are quantized and sent to the entropy encoder. The standard permits the use of arithmetic coding, but Huffman coding is the most commonly used entropy encoder and is the one used here. A successor to JPEG, JPEG 2000 improves upon the problems that existed in JPEG as well as adding more features. The JPEG 2000 standard uses the Discrete Wavelet Transform (DWT), which improves over the DCT by including spatial information in addition to frequency information. After the transformation, the coefficients are sent to the block coding method called Embedded Block Coding with Optimal Truncation (EBCOT). EBCOT scans DWT coefficients in an adaptive manner and then uses a type of arithmetic coding to perform entropy encoding. The most recent addition to the JPEG group of algorithms is called JPEG XR. JPEG XR uses the Photo Core Transform (PCT) which is ultimately an approximation of the DCT. However, the PCT uses a combination of cascaded 2-by-2 Walsh-Hadamard transformations followed by a number of one-dimensional and two-dimensional rotation operations. Consequentially, there are few arithmetic operations and mainly consist of additions or subtractions. The block coding for JPEG XR uses adaptive scanning of transform coefficients and a type of adaptive Huffman encoding. Vector quantization (VQ) is a widely used method with applications in many areas, not just image compression. VQ is straightforward as it represents the original vectors of the image with a smaller, representative set of vectors called the codebook. This makes the compression ratio deterministic for VQ whereas the quality of compression is data dependent. Fractal compression, first patented by Barnsley and Sloan [22], uses a mathematical construct called an iterated function system. It was extended by Fisher [23] to include a method of partitioning called quad-tree partitioning. The quad-tree representation of an image can be thought of as a tree-like structure with the original image as the root. Each node of the tree corresponds to a square portion of the image that contains four child nodes that correspond to the four quadrants of that square. The actual structure of this tree depends on the image and is constructed during the encoding process. This set of nodes (or ranges) are mapped onto the domains which are twice the size of the ranges that are taken from the same image. The mapping is an affine transformation which minimizes the root-mean-squared (RMS) difference between the transformed domain pixel values and the range pixel values. The coefficients of
1147
this transformation and the optimal domain are stored in order for the range to be compressed. The full details of the algorithm nor the selection of domains will be discussed here. Developed by Google, WebP [24] was introduced as a competitor to JPEG that would allow for more flexibility in the encoding and decoding process. At its core, WebP is a key frame encoder for video compression taken from the VP8 standard. Being designed as an internet-friendly compression algorithm, the algorithm is asymmetric and most of the processing is done on the encoding side to allow for fast decompression on the web browser side. 4. System overview This section outlines the framework and how it would be used by designers to perform design space exploration. The overview of the system described in this paper is shown in Fig. 1. Ultimately this work will result in a tool for designers that allows them to obtain predictions for power and execution time of their algorithms as a function of their algorithm. Further, the designer will be able to see how the architecture parameters affect these metrics, allowing for early design space exploration. Using this tool allows designer to forgo the generation of all possible architectures, allowing for design space exploration within the confines of the tool. As the tool is based on a pre-generated model, the designer will not have to have the computing resources or measurement tools and capabilities in order to obtain power consumption and execution time. In terms of the providers of the tool, architecture generation and data collection will only need to happen once which can then be provided to many users. This work assumes nothing about the user’s system requirements or constraints. That is, optimal architectures can be found for power versus execution time but these architectures can vary greatly in the number and type of resources used. Therefore this tool will allow users to examine how the power and performance of the system changes with respect to design choices and the resource implications of those design choices. This work involves four parts: parameter selection, performance evaluation, prediction confidence, and the application of the model to design space exploration. First, parameter selection is how high-level parameters are extracted from the previously mentioned algorithms as well as the parameters of the soft-processor architecture. The performance evaluation is the construction and analysis of how well the underlying modeling method performs in terms of prediction error. Prediction confidence is how the training data can be exploited to provide users of the framework a degree of confidence in the predictions they receive. Finally, the model can be used to perform design space exploration. Specifically, the front end can show how the underlying architecture parameters affect the algorithm performance. For example, if the predictions show that the execution time of an algorithm does not decrease much with a larger cache size, there would be no reason to include a large cache and these resources could be used elsewhere. 5. Parameter selection 5.1. Algorithm blocks This work proposes to break the algorithms into their building blocks rather than using the entire algorithm. The reason for this is twofold: first, compression algorithms are too complex to be represented efficiently in a high-level manner. Secondly, being able
1148
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156
Fig. 1. Framework overview.
to combine blocks allows the algorithms to better represent the algorithm parameter space; that is, instead of having two algorithms (e.g. JPEG and JPEG 2000), four separate algorithms can be used to construct the model by using the combination of the two transforms and the two coding methods. This standard model of transform coding allows for a greater number of algorithms to be used as training for the model. The splitting of the blocks was done in a high-level manner so that each algorithm was broken into its two basic blocks. The split was done manually such that each block was both a logical and computational component of the original algorithm. Another constraint on the split is that both blocks must be sequential. That is, in one execution of an algorithm, block ‘‘A’’ is executed followed by the execution of block ‘‘B’’ without interleaving execution of the two blocks.
5.1.1. Standard compression blocks A majority of the blocks were taken from transforms and coding methods of the three well-known JPEG standards. From the original JPEG standard, the three DCT variants used were Slow integer. Fast integer. Floating point. These implementations were taken from the libjpeg library [25]. From the JPEG 2000 standard, the two variants of the DWT were used Irreversible (lossy using CDF 9/7 wavelet). Reversible (lossless using CDF 5/3 wavelet). These variants were taken from the OpenJPEG library [26]. The implementation for its block coding method EBCOT was also taken from here. Finally, the implementation for the PCT from JPEG XR and its block coding method were taken from the patent reference software as filed by the Joint Photographic Experts Group [27].
From here, algorithms were ‘‘created’’ by a combination of a transform block and a coding block which were taken from those shown in Table 1. In addition to these standard blocks, the non-standard methods – vector quantization, quad-tree fractal encoding, and WebP – were also separated into two blocks. However, the blocks from these methods were not combined to create ‘‘new’’ algorithms. 5.1.2. Vector quantization Compression using vector quantization is a method for representing the initial set of vectors using a smaller set of prototype or codebook vectors. This process involves two distinct steps. The first is codebook generation and the second is determining which codebook vector to assign to an image block. Vector quantization was split along these lines into the initial codebook generation stage (using the Generalized Lloyd Algorithm) and the final distance calculations between the codebook vectors and image vectors. The implementation for all algorithms parts was taken from QccPack [28]. 5.1.3. Fractal compression As mentioned above, fractal compression involves the mapping of range blocks onto domain blocks through an affine transformation. Fractal compression is split into a classification stage and a partitioning stage. The classification stage is when all domains are classified and the averages of 4 4 pixel blocks within domains are computed (this makes them the same size as the ranges). The details of clas-
Table 1 Standard compression blocks used for modeling. Transform ‘‘A’’ blocks
Coding ‘‘B’’ blocks
DCT slow (8 8) DCT slow (4 4) DCT fast DCT floating point DWT reversible DWT Irreversible PCT
Huffman coding EBCOT JXR coding
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156
sification will not be covered and can be found in Image and Text Compression [29]; briefly, the classification of a range or domain is based on the brightness of pixel values in each quadrant of this subimage. A range will be encoded only to domains that share the same- or similar-classification. In the partitioning stage, the encoding tree is constructed and the coefficients for the affine transformations are determined. Additional nodes are added if the average RMS error across all mappings is above a threshold or if the specified minimum depth of the tree has not yet been reached. The implementation of quad-tree fractal compression was taken from Fisher [23]. 5.1.4. WebP The two blocks that the algorithm is broken into are the analysis phase and the compression phase. During analysis, the susceptibility of the image blocks to compression is determined in order to provide the best trade-off between PSNR and compression ratio. The compression phase simply implements the statistics determined in the analysis through the use of transforms and entropy encoding. For WebP, there is a compression parameter that drastically changes the complexity of the analysis and compress phases. This parameter determines the depth of block analysis that is performed, the amount of rate-distortion optimization, and the type of entropy encoding. As this is a complex algorithm, both a lowcomplexity version and a high-complexity version of WebP were used in this modeling process. This parameter ranges from 1 to 5 and the values chosen were 2 and 4. These values were chosen as they represented both the diversity of WebP as well as realistic choices for this parameter. A value of 1 typically results in poor compression ratio and quality; a value of 2 increases the complexity by a small amount and provides better compression. On the higher end, a value of 5 invokes a highcomplexity entropy encoder which would be unsuitable for embedded applications therefore the next step down was taken. The implementation was taken from Google’s source code [24]. 5.2. Algorithm parameters The selection of parameters to represent an algorithm is crucial in the construction of an accurate model. There is a close interaction between the algorithm being executed and the architecture on which it is being executed. By having representative parameters, the model can capture this interaction. At the same time, these parameters must be high level in representing the algorithm so they can be estimated prior to implementation. This presents a challenge in selecting high-level parameters such that they can indicate the algorithm’s dependency on the low-level soft-processor architecture. Two types of algorithm parameters have been chosen. The first (type I) are parameters that require knowledge only of the algorithm to be implemented. These involve parameters such as precision of arithmetic operations. The second (type II) are parameters that are estimated using knowledge of the algorithm to be implemented and knowledge of other image compression methods. Note that each block contains its own set of parameters; that is, each algorithm will have two sets of these parameters corresponding to each block. 5.2.1. Type I parameters For the first type of parameters, the parameters chosen are Presence of floating-point operations. Presence of arbitrary multiplication. Presence of arbitrary division.
1149
Average working set size. Variable working set size. Working set dependence. Dependency set size. Dependency function arithmetic complexity. Dependency function memory complexity.
Floating-point operations refers to the presence of a non-trivial amount floating-point arithmetic operations. The definition of ‘‘non-trivial’’ will depend on the designer and if they feel the number of floating-point operations is significant enough to affect power and execution time. Though no floating-point hardware is considered in this work, the presence of floating-point operations can greatly affect the execution time of an algorithm if present. Arbitrary multiplication and arbitrary division refer to the presence of arbitrary (that is, non-2n ) operations for multiplications and divisions. The average working set size refers to the common set sizes that compression algorithms work in. For instance, the DCT typically works on 8 8 pixel partitions of the image. This value should give an indication on how much the algorithm will be affected by both instruction and data cache sizes. The other related parameter here is the variable working set size. Some blocks, such as the DCT, always have the same working set size. Others, such as EBCOT, have working set sizes that vary throughout the execution. The working set dependence parameter indicates whether or not there is a data dependency between two separate working sets. In the DCT, there is no such dependency; all blocks are computed separately with no interactions between any two blocks. On the other hand, WebP is intraframe prediction from video compression which contains large amounts of data dependency and operations between working sets. Again, this parameter can show how much a block is affected by size of the caches. The data dependency function is how the blocks interact with each other and the nature of this interaction. For instance, this function could be a distance calculation between corresponding pixels, pixel comparisons, and so on. The dependency function f ðB; SÞ has two parts: the dependency set size and the dependency function complexity. Here, B is the set of all image blocks to be compressed and S is the set of blocks which are needed to compress B. For example, codebook generation from the VQ algorithm requires calculating the distance between each image block and each codebook entry. The value for this parameter would then be the cardinality of the set S which would be
jSj ¼ jBj NCB for N CB codebook entries. This parameter can also be used to indicate iterations. In this case, codebook generation requires multiple iterations. The dependency set size would then be
jSj ¼ jBj NCB T for T iterations. The complexity parameters describe the number of operations that are required for the single execution of the dependency function. For instance, if the data dependence is a distance calculation between two blocks, there are two loops: one to loop over the columns and the other to loop over the rows. Each iteration of the inner loop calculates the Euclidean distance d between the two pixel values from images xCB and xIB , which is
d¼dþ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 xCB ði; jÞ þ xIB ði; jÞ
For one iteration of the inner loop, there are four memory accesses: two for elements in the blocks, once to access the current value of d and one to save the updated value. Similarly, there are five
1150
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156
arithmetic operations for the distance calculation and two for the loop increments. So for blocks of size N X by N Y , the total arithmetic operations are
Table 2 NIOS 2 parameter summary. Category
Parameter
Choices
Data cache
Cache size Line size Burst transfers Cache size Burst transfers Multiplier type HW or SW division 50 MHz
7 3 2 7 2 3 2
NA ¼ 7 NY NX N blocks NCB : The total memory operations for the dependency function would be
NM ¼ 4 NY NX Nblocks NCB : Ultimately, these dependency functions are simple enough to know prior to implementation or can be implemented in code or pseudocode with little overhead.
Instruction cache Multipliers Division Clock frequency Total combinations With board limitations
3528 3288
6. Performance evaluation 5.2.2. Type II parameters Type II parameters are useful parameters as they use the designers knowledge base to estimate parameters of the algorithm-to-be-implemented against those algorithms that were used to generate the model. The type II parameters are
Total arithmetic operations. Total memory operations. Total operations. Ratio of divisions to arithmetic operations. Ratio of multiplications to arithmetic operations.
Counts of total instructions for memory and arithmetic operations are important as they give an indication of the complexity of the algorithm block. While the inclusion of the total instructions may seem redundant, it is done to address a shortcoming of regression trees as they are unable to handle additive data efficiently. The ratio of multiplications to arithmetic operations and the ratio of divisions to arithmetic operations gives the model an idea of the dependency of the algorithm block on the higher latency arithmetic operations which can be affected by the choice of specialized arithmetic hardware.
5.3. Architecture parameters The selection of soft-processor architecture parameters are limited to those offered to designers by the companies that provide them, apart from clock frequency. The particular soft processor chosen for this work is the Altera NIOS 2. However, the actual choice of soft processor is not likely to affect the overall conclusions of this work; implementations of various soft processors are likely to be similar enough not to cause results to differ by a significant amount. The parameters that were examined are shown in Table 2. The NIOS 2 allows for additional parameters to be changed which include the addition of tightly coupled caches, a Memory Management Unit (MMU), or a Memory Protection Unit (MPU). Including just tightly coupled caches increases the number of combinations to 88,200, while including an MMU or MPU increases the number of combinations to 4 106 and 9 108 , respectively. If tightly coupled caches and a MMU or MPU is considered, the number of combinations becomes 1 108 and 2 1010 , respectively. At these numbers of combinations, the generation and handling of data becomes a problem of its own. There exists methods for sampling this space effectively but this is beyond the scope of this work.
6.1. Data acquisition The training data used in the construction of the predictive model is data measured from actual execution of algorithms running on a single soft processor. For each algorithm block, execution time and power measurements were taken on every combination of architecture parameters. Each algorithm performed compression on a grayscale image of size 128 128 pixels which was taken arbitrarily from the FERET [30] face dataset.
6.1.1. Power measurement The development board being used for the experiments in the Cyclone III Starter Kit. It contains a Cyclone III EP3C25 FPGA in addition to SSRAM, SDRAM, and flash memory. The board also contains components to assist in JTAG communication and other I/O operations. To measure power, the development board offers to sense resistors for measuring the current drawn by two of the power supply rails: the 1.2 V supply and the 2.5 V supply. The FPGA core is the only device that uses the 1.2 V supply while the 2.5 V supply is used by SSRAM, flash, SDRAM, and a small amount of I/O circuits. A Keithley 199 DMM was used to measure the voltage across the sense resistors. It was controlled via a PC using the IEEE488 standard. It has a resolution of 5 12 digits at a measured sampling rate of 45 Hz. Even though the NIOS 2 processor runs at 50 MHz and the SDRAM at 100 MHz, there is sufficient capacitance on the sense resistor to create an averaging effect to obtain an accurate measurement. An average is taken over the execution of the algorithm block to obtain the average power consumed.
6.1.2. Execution time measurement The NIOS 2 supports the use of gprof, the GNU profiler. Gprof works by sampling the program counter to determine time spent in functions. Gprof works by injecting function calls to measure execution time. Overhead introduced by gprof is detected by gprof itself and does not include this in the final measurement. The other issue with gprof is the sampling error. In practice, the sampling error of a function call is proportional to the number of samples of a function times the sampling period (1 ms) [31]. The algorithm blocks executed here are repeated multiple times for a total execution time of, at least, several seconds to ensure that sampling error was within 1–5% of the measured values. The execution times for the algorithm blocks are shown in milliseconds (ms).
1151
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156
6.2. Model construction Every algorithm block is executed on every combination of architecture parameters and its execution time, core power, and off-chip device power were measured for each execution. From here, the measurements of separate blocks were combined to form a complete compression algorithm. First, a combination of one block from each column in Table 1 is taken to form this complete algorithm. Additionally, the separate blocks from fractal compression, VQ, and WebP are included without using combinations of these. Each combination of algorithm blocks generates three sets–one for each metric–of N training vectors, where N is the total number of combinations of architecture parameters. An ensemble was created for each of the three metrics being predicted. For the measured values, the total execution time is the additional of each individual block’s execution time. For core and offchip device power, the mean of each individual block’s power consumption is taken. Ultimately, a training vector will consist of the parameters of the algorithm and the parameters of the architecture that the algorithm was executed on along with its respective metric as its response value. As regression trees are unable to handle additive data, an additional set of parameters are added to the training vectors. These additional parameters are the combination of the corresponding parameters for each block. In this way, the training data incorporates the combination of parameters instead of each parameter on its own. Ultimately, each training vector will be 3 nB þ nA elements long, where nB are the number of parameters for an algorithm block and nA is the number of architecture parameters. The coefficient on nB is because of the two algorithm blocks plus the combination of the two sets of parameters. For this set of architecture and algorithm parameters, nB ¼ 12 and nA ¼ 7 resulting in 43 parameters to predict each of the 3 response variables.
For core and off-chip device power, the measured values are normally distributed with the tails containing values from a number of different algorithms so there are no issues here concerning extrapolation. However, the measured values for execution time across all algorithms are shown in Fig. 2. While there is a high concentration of values in short execution times, there is a very long tail which contains the execution times from a small number of algorithms. Also, execution times are much dispersed than those of core or off-chip device power measurements. A common dispersion metric called quartile coefficient of dispersion (QED) is defined as
QED ¼
IQR Q3 þ Q1
IQR ¼ Q 3 Q 1 where Q 1 and Q 3 are the first and third quartiles and IQR is the interquartile range. The QED is useful as it provides a scale-invariant metric for dispersion. For core and off-chip device power measurements, the QED is 0.05 and 0.04, respectively. For execution time, this value is 0.76, indicating a data set that much more dispersed. What this means for regression trees is that problems can arise due to lack of data when characterizing less dense regions. The longest execution times belong to a single algorithm: the fractal compression algorithm. The distribution of algorithms in the tail (refer to Fig. 2), starting at 6 104 ms, are shown in Table 3. Further, fractal compression is the only algorithm to have any execution times over 13 104 ms. While fractal compression has the longest execution times, the other algorithms in Table 3 can be expected to have higher levels of error due to this same issue. To evaluate the models, both absolute and relative error are used. Absolute error A for a predicted value p and a measured value m is defined as
A ¼ jp mj: Similarly, relative error
6.3. Model validation method For the final version of the 3 models, all combinations of algorithm blocks would be used to generate them. However, there is a need here to perform validation on the model to have an idea of how well it can generalize to an unseen image compression algorithm. To do this, cross-validation must be used. Cross-validation [32] is a method that tests how well a predictive model can generalize to an unknown test set. k-fold cross-validation works by selecting k subsets of the data and then generating the model using k 1 subsets while using the remaining subset as a test set. In the case of models generated here, k is the number of algorithm combinations as discussed in the previous section. Here, each fold is a separate algorithm combination to ensure that the predictive ability of the model is tested in the fairest, and most realistic, way possible. This work will present the result of each fold so the model can be evaluated for its predictive performance for any given algorithm. The idea is that the folds were chosen at a high-level of abstraction and therefore the analysis must be done at this high-level as well; it will be seen that the algorithm being predicted has a large effect on the accuracy of the model. One issue that must be addressed here concerns the issue of extrapolation as regression forests are unable to perform extrapolation. Because of this, care must be taken when including algorithms in error analysis whose responses are at the lower and upper limits; models generated attempting to predict these extreme values will invariably perform poorly. However, if the extreme values are shared by a number of algorithms then there is no issue.
R ¼
A m
R
is defined as
:
In addition to these, the coefficient of determination R2 is also given as a metric for a model’s predictive performance. The R2 value is a measure to show how well a given model fits a set of data. If is the mean of the measured values, yi are the measured values, y and y^i are the predicted values, R2 is defined as
Execution time truth value distribution 0.25
0.2
0.15
0.1
0.05
0 0
2
4
6
8 10 Milliseconds
12
Fig. 2. Execution time measurement distribution.
14
16 4 x 10
1152
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156 Table 3 Execution time ‘‘tail’’ membership.
R2 ¼ 1
Table 4 Core power per algorithm fold performance. Percent of samples (%)
Algorithm
Mean
Fractal Vector quantization WebP method 4
61 28 11
DCT 4 4 + EBCOT DCT 8 8 + EBCOT DCT float + EBCOT DCT fast + EBCOT DCT float + Huffman DCT fast + Huffman DCT 4 4 + Huffman DCT 8 8 + Huffman DCT 4 4 + JXR DCT 8 8 + JXR DCT float + JXR DCT fast + JXR DWT irr. + EBCOT DWT irr. + Huffman DWT irr. + JXR DWT rev. + EBCOT DWT rev. + Huffman DWT rev. + JXR Fractal PCT + EBCOT PCT + Huffman PCT + JXR Vector quantization WebP method 2 WebP method 4
1.49 1.35 2.21 1.65 2.31 1.80 1.53 1.45 1.47 1.54 2.44 1.99 3.01 2.92 3.07 3.05 2.48 3.15 3.91 2.27 2.03 1.94 5.20 3.82 3.65
0.8 0.7 1.1 0.8 1.2 0.9 0.8 0.7 0.8 0.8 1.3 1.0 1.5 1.5 1.6 1.6 1.3 1.7 1.9 1.1 1.0 1.0 2.7 2.0 1.8
Mean
2.47
1.3
SSR SST
where SSR is the residual sum of squares and SST is the total sum of squares. These are defined as
SSR ¼
X
ðyi y^i Þ
2
i
SST ¼
A
Algorithm
X Þ2 : ðyi y i
For linear regression, the R2 value is limited to the interval ½0; 1. With non-linear regression methods, such as regression trees, the interval becomes ð1; 1. A negative R2 value indicates that the simply choosing the mean of the data results in a better fit than the actual predictive model. As the models are dealing with estimated values, this error is actually the residual error. However, for brevity any reference to error in the model evaluation process should be taken to mean the residual error of the test data.
(mW)
Mean
R
(%)
R2 0.98 0.99 0.97 0.98 0.97 0.98 0.98 0.99 0.98 0.98 0.97 0.97 0.94 0.93 0.93 0.92 0.95 0.93 0.92 0.96 0.97 0.97 0.82 0.93 0.94
presented here to predict core power consumption of hard processors.
6.4. Results This section is divided into examining the results of predicting each individual parameter. To our knowledge, there is currently no work that attempts to predict power consumption and execution time for algorithms at this level of abstraction. Because of this, no comparison is possible. 6.4.1. Core power FPGA core power represents the total power consumption of all LUTs, registers, multipliers, and memory being used on the FPGA core during execution. As mentioned above, the value being predicted here is the mean of the current drawn over the entire execution of the algorithm. Over all architectures and algorithms, the core power ranged from 159 mW to 272 mW. The full overview of the various errors for each fold in cross-validation is shown in Table 4. This table shows that predicting an algorithm’s core power consumption is accurate with an overall mean absolute error of only 2.47 mW, ranging from errors of 1.35 mW to 5.20 mW. The R2 values are all close to 1 and range from 0.82 to 0.99 which indicate that all models fit the test data well. In terms of framework generalization, core power is very dependent on the underlying technology. Larger cache sizes and LUT-based multipliers, for example, require more logic elements and interconnect which increases power consumption. Only the slight differences between the different soft processors would need to be taken into account in order to make this model more general. While this stays true for other programmable logic devices (such as CPLDs), it is not the case for hard processors where this is more complex relationship between cache sizes and implementation details. However, core power can give an indication of the activity level of pipeline components which is largely processorindependent. Using this information, parameters could be inferred that would allow the use of FLPA methods in addition to model
6.4.2. Off-chip device power The off-chip device power is the power consumed by a number of off-chip devices on the development board. It is measured using the sense resistor on the 2.5 V power supply. The 2.5 V supply powers the DDR SDRAM, the SSRAM, the flash memory, and some small I/O circuits. For this work, the SSRAM and the flash were not used. Ultimately, off-chip device power represents how the activity of the soft processor relates to memory usage and access to offchip devices. Over all architectures and algorithms, the off-chip device power ranged from 475 mW to 652 mW. The full overview of the various errors for each fold in cross-validation is shown in Table 5. Here, off-chip device power consumption are accurate with an overall mean absolute error of only 8.53 mW, ranging from errors of 4.71 mW to 17.96 mW. The R2 values show that the model does not fit the data as well as the core power model and range from 0.29 to 0.90. Unlike core power, off-chip device power consumption is virtually independent of the implementation of the processor itself, whether as a soft-core or a hard-core processor. The off-chip device power model is therefore dependent on the memory technology. The Cyclone III starter kit uses SDRAM which is typical in many embedded situations, making the off-chip device model generalizable to many system configurations regardless of processor implementation. 6.4.3. Execution time Execution time is the time taken to complete one execution of the block in question. The reported execution time is the average time taken over multiple runs of an algorithm block. The full overview of the various errors for each fold in cross-validation is shown in Table 6. Though having an overall relative accuracy of 18.3%, the model exhibits high levels of error in a four cases: Floating-point
1153
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156 Table 5 Off-chip device power per algorithm fold performance.
A
Algorithm
Mean
(mW)
Mean
DCT 4 4 + EBCOT DCT 8 8 + EBCOT DCT Float + EBCOT DCT Fast + EBCOT DCT Float + Huffman DCT Fast + Huffman DCT 4 4 + Huffman DCT 8 8 + Huffman DCT 4 4 + JXR DCT 8 8 + JXR DCT float + JXR DCT fast + JXR DWT irr. + EBCOT DWT irr. + Huffman DWT irr. + JXR DWT rev. + EBCOT DWT rev. + Huffman DWT rev. + JXR Fractal PCT + EBCOT PCT + Huffman PCT + JXR Vector quantization WebP method 2 WebP method 4
6.13 5.51 11.65 7.84 8.96 6.07 5.38 5.17 4.71 4.86 8.16 6.20 15.42 14.82 8.65 12.81 7.44 12.79 8.75 7.04 5.11 5.65 17.96 7.60 8.51
1.1 1.0 2.1 1.4 1.7 1.1 1.0 1.0 0.9 0.9 1.5 1.2 2.7 2.7 1.6 2.3 1.4 2.4 1.6 1.2 0.9 1.0 3.4 1.4 1.5
Mean
8.53
1.6
R
(%)
R2 0.90 0.94 0.84 0.87 0.85 0.87 0.89 0.92 0.94 0.95 0.91 0.90 0.71 0.53 0.89 0.30 0.63 0.29 0.89 0.88 0.90 0.91 0.57 0.89 0.88
DCT + Huffman, Irreversible DWT + JPEG XR coding, fractal compression, and WebP method 2. As discussed in Section 6.3, the high error with fractal compression can be explained by the fact that the ensemble is attempting to extrapolate. Ultimately, this should not be an issue for users of this framework as it is unlikely that users will want to have an algorithm with a longer execution time than fractal; the longest execution time for fractal compression is 160 103 ms for a single 128-by-128 pixel image.
Table 6 Execution time per algorithm fold performance.
A
Algorithm
Mean
(ms)
Mean
DCT 4 4 + EBCOT DCT 8 8 + EBCOT DCT float + EBCOT DCT fast + EBCOT DCT float + Huffman DCT fast + Huffman DCT 4 4 + Huffman DCT 8 8 + Huffman DCT 4 4 + JXR DCT 8 8 + JXR DCT float + JXR DCT fast + JXR DWT irr. + EBCOT DWT irr. + Huffman DWT irr. + JXR DWT rev. + EBCOT DWT rev. + Huffman DWT rev. + JXR Fractal PCT + EBCOT PCT + Huffman PCT + JXR Vector quantization WebP method 2 WebP method 4
70 60 2495 1266 2498 52 64 42 64 47 1017 56 1694 424 1429 205 76 60 55,348 83 71 76 10,454 5565 3716
1.0 0.9 24.4 16.8 76.7 8.0 9.5 5.9 3.7 2.7 24.8 3.4 18.0 15.8 40.4 2.8 11.0 3.9 88.3 1.1 11.3 4.3 31.6 38.4 12.3
Mean
3477
18.3
R
(%)
R2 1.00 1.00 0.56 0.76 2.44 0.81 0.70 0.88 0.98 0.99 0.67 0.99 0.74 0.78 0.01 0.99 0.45 0.99 2.80 1.00 0.34 0.98 0.73 0.09 0.91
The error in the other cases is due to the relationship between variable importance and the algorithm being predicted. Table 7 shows the average change in mean squared error (MSE) for the top 20 variables in descending order. The variables at the top indicate those which contribute the most to decrease in error of the model in fitting the training set. For instance, the floating-point DCT has very similar parameters to the other DCT variants. When excluding it from the construction process, the arithmetic precision parameter becomes less well-defined. From Table 7, it can be seen that both arithmetic operations and the precision of Block A is important in predicting execution time. In fact, higher levels are error can be seen in all of the algorithms that contain the floating-point DCT. This same phenomena is the cause for high levels of error for different algorithms on different parameters. Ultimately, this is an effect of the chosen validation process. In practice, these ill-defined parameters can be determined using node statistics and the outlier measure which maintains the model’s feasibility in these cases. Models for execution time are highly dependent on the implementation of a processor, either as a soft or hard processor. Execution time is reported here as ‘‘wall-clock’’ time in milliseconds; expressing this as a clock cycle count means the model can be made more general. Generalization of the model would then rely on the non-trivial task of characterizing the differences in clock cycle latency of components in the pipeline. The latency due to off-chip memory devices would not be affected by processor choice but still plays an important role in determining execution time. Because of this, the latency of other off-chip memory technologies would need to be characterized.
7. Providing prediction confidence In Section 3, node statistics and the outlier measure was discussed as a way to provide the user with an idea of the confidence of a prediction. This section demonstrates how to put those concepts into practice. As the power predictions are typically accurate, those will not be considered here. Only execution time is examined for prediction confidence.
Table 7 Execution time variable importance. Group
Variable name
Block B
Arithmetic operations
D MSE 9:68 103
Architecture
Multiplier type
1:75 103
Architecture
Inst. cache size
1:66 103
Block A
Arithmetic operations
4:53 102
Block A
Precision
4:11 102
Architecture
Data cache size
2:33 102
Block A
Dependency set size
2:26 102
Block B
Multiplication ratio
1:96 102
Block A
Variable working set size
1:73 102
Total
Memory operations
1:64 102
Architecture
Data cache line Size
8:11 101
Total
Total operations.
8:07 101
Block A
Average working set size
5:64 101
Total
Arithmetic operations
5:22 101
Block A
Memory operations
3:18 101
Architecture
Divider type
1:42 101
Block A
Total operations
3:53 100
Block A
Multiplication ratio
3:47 100
Block B
Average working set size
3:32 100
Architecture
Inst. cache burst transfers
4:49 101
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156
First, a relationship must be established between the error of a given input test vector, the outlier measure of that vector, and the node statistics associated with the prediction of that vector. First, the outlier measure is plotted versus the absolute error across all test sets, shown in Fig. 3. It can be seen that if a test vector has an outlier measure between 4 and 10, it has a higher probability of having more error. Similarly, the node probability is plotted versus absolute error in Fig. 4. Here, only nodes with very small probability contain high error in the test set; this probability is around 1 104 which corresponds to the minimum node size of a tree. From these, it is reasonable to hypothesize that a test vector having an outlier measure between 4 and 10 and a node probability between 0 and 1 104 will have a greater probability of containing high error. To test this, the set of test vectors meeting these conditions (unstable) were compared against those that did not (stable). The mean, median, and interquartile range (IQR) of their distributions are shown in Table 8. This table shows that the values of absolute error for the unstable test vectors are more dispersed and have a higher probability of having more error than stable vectors. Ultimately, the outlier measure and probability of a test vector will be measured and presented to the user as prediction confidence.
4
3
x 10
2.5 Absolute error
1154
2 1.5 1 0.5 0
1
1.2
1.4
1.6
1.8
2
2.2 −4
x 10
Node probability Fig. 4. Absolute error versus node probability.
Table 8 Test vector distribution statistics. Algorithm
Mean
Stable Unstable
583 3609
A
(ms)
Median 85 1690
A
(ms)
IQR 580 3692
8. Application to design space exploration This section describes how this high-level model is applied to design space exploration. Consider the situation of a designer looking to use a soft processor for an image compression application. As their soft processor is part of a larger system, resource usage is an important consideration. It will be shown how this framework can be used to select the ideal sizes of the caches. As the parameters of their algorithm are known, they are input into the model as well as the fact that cache sizes are the design space parameter of interest. The issue here is not directly cache sizes, but rather the resources used by the various sizes of cache. In the FPGA core, both caches use the M9K memory blocks of which there is a limited amount. For now, the trade-offs considered are both types of power, core and off-chip device, and execution time. Total power will be considered here which is a sum of both core and off-chip device power.
4
3
x 10
Absolute error
2.5 2 1.5 1 0.5 0 0
The framework calculates the predictions for each of the metrics for every combination of architecture parameters along with the given algorithm parameters. From here, two types of multiobjective analysis can be performed. First, the most optimal architectures in terms of execution time versus power are found and presented to the designer along with the corresponding research usage. Second, optimization can be done in order to determine the most optimal configuration of cache sizes for either execution time or power. The first type of optimization is finding the optimal architectures when considering power and execution time. Once these are found, the resource usage of these architectures are presented to the designer. Less optimal design points are also given to the designer so a more complete picture of the design space is obtained. An example is shown in Fig. 5. In addition to the design points, the user is able to scroll over a data point and see the associated architecture parameters and resource usage for that point, as seen on the figure. As the most optimal architectures here have not been optimized for memory block usage, it allows the designer more flexibility in the final system. Optimizing for memory block usage allows the designer to select a cache size configuration that conforms to either a power or execution time budget. Fig. 6 shows M9K usage versus total power for an example algorithm. For each value of memory blocks used, the architectures which have the lowest power consumption are presented to the user. This allows the designer to see how the total power is affected by various combinations of cache sizes. Similar to the previous approach, the user is able to see the specific architecture parameters for each point.
9. Conclusions and future work
2
4
6
8 10 12 Outlier measure
14
Fig. 3. Absolute error versus outlier measure.
16
18
This paper has presented a design space exploration framework for FPGA-based soft processors. Underlying the framework is an accurate predictive model which uses architecture parameters as well as high-level algorithm parameters. Additionally, the model
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156
730 I Cache size: 32768 D Cache size: 8192 D Cache line size: 16 I Cache burst: 0 D Cache burst: 1 −−−−−−−−−−−−−−−−−−−−−− Memory Blocks: 57 Logic Elements: 12558 Embedded Mults: 4
720
Total power (mW)
27 34 42 50 57
710
700
690
680 2500
3000
3500
4000
4500
5000
Execution time (ms) Fig. 5. Optimal architectures: Total power versus execution time by M9K block usage.
760
Total power (mW)
750 740 730 720 710 700 690 680
20
30
40 50 Memory blocks used
60
Fig. 6. Architectures: Total power versus M9K blocks.
is able to provide some level of prediction confidence to the user based on the input vector’s relation to the training data. It is also possible for a designer, using this framework, to perform algorithm space exploration using this framework by comparing metrics and optimizations of multiple algorithms and choosing the best one. These predictions happen very early in the design phase long before an algorithm implementation is possible. This also removes the need to spend 25 days generating all architecture combinations to see how a specific algorithm performs on a soft processor. Ultimately, this allows for fast design space exploration in the very early stages of design for current and future image compression algorithms. At the moment, the model is only used to predict the power and performance of a single soft-processor system using SDRAM. However, by characterizing differences in implementations and power consumption, the model can become more general and allow for richer design space exploration among not only architecture parameters but different architectures as well. An extension of this work includes the addition of sensitivity analysis: as the parameters are estimated by humans, they are subject to human error. The future work also includes sensitivity anal-
1155
ysis of the model in dealing with human error, predicting effects of soft errors in the multipliers or dividers, and extending the system to include predictions of dual soft-processors systems. References [1] A. Powell, C. Bouganis, P.Y.K. Cheung, Early performance estimation of image compression methods on soft processors, in: Field Programmable Logic Conference, Proceedings, 2012, pp. 587–590. [2] J. Laurent, E. Senn, N. Julien, E. Martin, High-level energy estimation for DSP systems, PATMOS (2001). [3] P. Zipf, H. Hinkelmann, L. Deng, M. Glesner, H. Blume, T.G. Noll, A power estimation model for an FPGA-based softcore processor, in: Field Programmable Logic Conference, Proceedings, 2007, pp. 171–176. [4] V. Tiwari, S. Malik, A. Wolfe, Power analysis of embedded software: a first step towards software power minimization, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2 (1994) 437–445. [5] J. Ou, V.K. Prasanna, Rapid energy estimation of computations on FPGA based soft processors, in: SOC Conference, Proceedings, 2004, pp. 285–288. [6] A. Sinha, A.P. Chandrakasan, JouleTrack-a Web based tool for software energy profiling, in: Design Automation Conference, Proceedings, 2001, pp. 220–225. [7] D.M. Cambre, E. Boemo, E. Todorovich, Arithmetic operations and their energy consumption in the Nios II embedded processor, in: International Conference on Reconfigurable Computing and FPGAs, 2008, pp. 151–156. [8] O. Azizi, A. Mahesri, B.C. Lee, S. Patel, M. Horowitz, Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis, in: Computer Architecture Symposium, Proceedings, 2010, pp. 26–36. [9] E. Ipek, S. McKee, K. Singh, R. Caruana, B. de Supinski, M. Schulz, Efficient architectural design space exploration via predictive modeling, ACM Transactions on Architecture and Code Optimization 4 (2008) 1–34. [10] C.B. Cho, W. Zhang, T. Li, Informed Microarchitecture Design Space Exploration Using Workload Dynamics, in: IEEE/ACM International Symposium on Microarchitecture, Proceedings, 2007, pp. 274–285. [11] P. Hallschmid, R. Saleh, Fast design space exploration using local regression modeling with application to ASIPs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 27 (2008) 508–515. [12] B. Carrion Schafer, K. Wakabayashi, Machine learning predictive modelling high-level synthesis design space exploration, IET Computers & Digital Techniques 6 (2012) 153. [13] B.C. Lee, D. Brooks, Accurate and efficient regression modeling for microarchitectural performance and power prediction, in: Architectural Support for Programming Languages and Operating Systems, Proceedings, vol. 1, 2006, p. 185. [14] P. Yiannacouras, J.G. Steffan, J. Rose, Exploration and customization of FPGAbased soft processors, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26 (2007) 266–277. [15] D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, D. Tullsen, Application-specific customization of parameterized FPGA soft-core processors, in: IEEE/ACM International Conference on Computer-Aided Design, Proceedings, 2006, p. 261. [16] R. Dimond, O. Mencer, W. Luk, Application-specific customisation of multithreaded soft processors, in: Computers and Digital Techniques, Proceedings, 2006, pp. 173–180. [17] J.H. Friedman, Multivariate adaptive regression splines, The Annals of Statistics 19 (1991) 1–67. [18] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees, Wadsworth, 1984. [19] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123–140. [20] L. Breiman, Random forests, Machine Learning 45 (2001) 5–32. [21] J.H. Friedman, Greedy function approximation: a gradient boosting machine, The Annals of Statistics 29 (2001) 1189–1232. [22] M.F. Barnsley, A.D. Sloan, Methods and Apparatus for Image Compression by Iterated Function, System, 1987. [23] Y. Fisher, Fractal Image Compression, Springer, 1995. [24] Google, WebP Image Format, 2012. [25] IJG, Indepedent JPEG Group, 2012. [26] OpenJPEG Development Team, UCL-ICTEAM, B. Macq, OpenJPEG, a JPEG 2000 open-source software suite, 2012. [27] ITU Standard T.832, JPEG XR Image coding system – Image coding specification, 2009. [28] J.E. Fowler, QccPack: an Open-Source Software Library for Quantization, Compression, and Coding, in: Applications of Digital Image Processing Conference, Proceedings, 2000, pp. 294–301. [29] Y. Fisher, E.W. Jacobs, R.D. Boss, Fractal image compression using iterated transforms, in: Image and Text Compression, Kluwer Academic, Boston, 1992, pp. 35–62. [30] P.J. Phillips, H. Moon, The FERET evaluation methodology for face-recognition algorithms, Pattern Analysis and Machine Intelligence, IEEE Transactions on 22 (2000) 1090–1104. [31] Sourceware.org, Gprof documentation, 2013. [32] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: International Joint Conference on Artificial Intelligence, Proceedings, 1995, pp. 1137–1143.
1156
A. Powell et al. / Journal of Systems Architecture 59 (2013) 1144–1156 Adam Powell received the B.S. degree in computer engineering from the University of Kansas in Lawrence, Kansas, USA, and the M.Sc. degree from Imperial College London, London, UK, in 2008 and 2009, respectively. Currently he is working towards his Ph.D. degree with an interest in image processing and performance modeling.
Christos Savvas-Bouganis received the M.Eng. (Honors) degree in computer engineering and informatics from University of Patras, Patras, Greece, and the M.Sc. and Ph.D. degrees from Imperial College London, London, UK, in 1998, 1999, and 2004, respectively. In 2007, he joined the faculty at Imperial College London. His research interests include reconfigurable architectures for signal processing, computer vision and machine learning algorithms targeting reconfigurable hardware.
Peter Y.K. Cheung received the B.S. degree with first class honors from Imperial College of Science and Technology, University of London, London, UK, in 1973. Since 1980, he has been with the Department of Electrical Electronic Engineering, Imperial College, where he is currently a Professor of digital systems and head of the department. His research interests include VLSI architectures for signal processing, asynchronous systems, reconfigurable computing using FPGAs, and architectural synthesis.