Feedback-directed specialization of code

Feedback-directed specialization of code

Computer Languages, Systems & Structures 36 (2010) 2 -- 15 Contents lists available at ScienceDirect Computer Languages, Systems & Structures journa...

1MB Sizes 0 Downloads 62 Views

Computer Languages, Systems & Structures 36 (2010) 2 -- 15

Contents lists available at ScienceDirect

Computer Languages, Systems & Structures journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / c l

Feedback-directed specialization of code Minhaj Ahmad Khan∗,1 Bahauddin Zakariya University, Multan, Pakistan.

A R T I C L E

I N F O

Article history: Received 16 July 2008 Received in revised form 17 October 2008 Accepted 5 January 2009 Keywords: Programming languages Optimizations Specialization Compilers Dynamic code generation

A B S T R A C T

Based on feedback information, a large number of optimizations can be performed by the compiler. This information actually indicates the changing behavior of the applications and can be used to specialize code accordingly. Code specialization is a way to facilitate the compiler to perform optimizations by providing the information regarding variables in the code. It is however difficult to select the variables which maximize the benefit of specialization. Also the overhead of specialization and code size increase are the main issues while specializing code. This paper suggests a novel method for improving the performance using specialization based on feedback information and analysis. The code is iteratively specialized after selecting candidate variables by using a heuristic, followed by generation of optimized templates. These templates require a limited set of instructions to be specialized at runtime and are valid for a large number of values. The overhead of runtime specialization is further minimized through optimal software cache of template clones whose instantiation can be performed at static compile time. The experiments have been performed on Itanium-II(IA-64) and Pentium-IV processors using icc and gcc compilers. A significant improvement in terms of execution speed and reduction of code size has been achieved for SPEC and FFTW benchmarks. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction The feedback information provides opportunity to the compilers to optimize code accordingly. It can either be used online or offline. The online feedback systems implemented in Trident [1], ADORE [2] and Jikes RVM [3] optimize code during execution of the program. For such systems, the overhead of instrumentation, profiling and performing optimizations becomes very large and may impact the performance of the application. On the other hand, some feedback-directed optimization approaches [4–6] trace the behavior of the application offline as a separate run. These systems then adapt code by performing different optimizations together with the code specialization. Based on profiling, the code specialization results in many optimizations such as constant propagation, dead-code elimination, software pipelining, loop unrolling, etc. [6–8]. The size of the code after specialization however can be large, and may thereby degrade the performance due to instruction cache misses and overhead of branches. The large size of code implicitly impacts the heuristics of inlining, unrolling and pipelining, and consequently, the compilers resort to generating sub-optimal code (in terms of scheduling, register allocation, etc.) [8,9]. Moreover, the profiling of all memory locations/variables incurs a large overhead and may not always be beneficial.

∗ Tel.: +33 13 92 54 064. E-mail addresses: [email protected], [email protected] (M.A. Khan). 1 Most of this work was accomplished when the author was affiliated with UVSQ, France. 1477-8424/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.cl.2009.01.001

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

3

In order to mitigate the code size increase problem, the runtime specialization of values is performed by dynamic compilation systems [10–15] and off-line partial evaluators [16,17]. However, they need the time-consuming activities including code generation/optimizations with memory allocation which require hundreds of calls to amortize their overhead. This article suggests an approach aimed at improving the performance of the applications by feedback-directed specialization. The profiling overhead is minimized by using efficient analysis of code and restricting specialization to the parameters where it would most probably be beneficial. Based on the parameters found after code analysis, we make use of declarative approach to specialize the code. The code is specialized to obtain optimized templates at static compile time. An optimal cache of template clones is generated where each clone can work for a large set of values by adapting it at runtime. This adaptation requires modification of a small number of binary instructions. The selection of appropriate variables for specialization, low cost of runtime activities, and the optimizations obtained at static compile time enable our approach to obtain significant improvement in the performance. The remaining part of the paper is organized as follows. Section 2 describes a summarized background of existing specialization approaches, whereas, Section 3 elaborates the context of the feedback-directed specialization with an example. The approach of feedback-directed specialization is described in Section 4. The implementation steps of the suggested approach are given in Section 5. The experimental setup and the performance results together with the overhead, are given in Section 6. A comparison with existing approaches is provided in Section 7 followed by concluding remarks in Section 8.

2. Background: template-based specialization Consider the code in Fig. 1 of the most time-consuming function train_match from SPEC benchmark 179.art. When the variables in the source code are given constant values, the compiler generates highly optimized code. For example, the compiler performs software pipelining or better scheduling if it knows the loop trip count or the dependence distance. Depending upon the use of the variable, the compiler can also perform other optimizations such as partial evaluation, cache prefetching, loop unrolling, reduced number of spills and fills, etc.

2.1. Notion of a template Since the values of most of the variables are unknown at static compile time, an attempt is made to keep many specialized versions for possible runtime values, in order to benefit from optimizations. Keeping many specialized code versions actually degrades the performance due to a large number of instruction cache misses, overhead of branches and indirect impact on optimizations which are largely dependent on code size such as inlining, loop unrolling or scheduling, etc. The idea here is to generate a template [8,18,19] which if instantiated with runtime values, should be equivalent to the statically specialized code.

Fig. 1. 179.art Benchmark: the inner loop bound numf1s is a good candidate for loop specialization as it would let the compiler perform optimizations such as software pipelining, better unrolling, prefetching, etc.

4

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

Fig. 2. Object code generated by icc v 9.0 on IA-64 after specialization of code with different values of numf1s loop bound. (a) numf1s = 23. (b) numf1s = 25.

2.2. Template generation If the code in Fig. 1 is compiled using different specialized values of numf1s parameter, we obtain very similar versions of code.2 These versions contain similar code in terms of the optimizations including register allocation and code scheduling, however, they differ by some constants which are dependent on the value of the specialized parameter. Fig. 2 shows two similar code versions obtained after specialization of code (with numf 1s = 23 and 25) for the train_match function. Any of these versions can therefore be used as a template, which can be transformed during execution to the other version through runtime instantiation. Since the template is generated at static compile time after specialization of code, it is more optimized, thereby making this approach much better than performing entire code optimizations at runtime. The template-based optimization approaches [8,18,19] are oriented towards performance improvement through specialization while keeping the code size to its minimum. The overhead of runtime specialization especially for parameters having large frequency of variance in their values, and the selection of appropriate variables are the main issues in these approaches. These drawbacks reduce the scope (candidates) of specialization, and also limit the performance gain.

3. Context of feedback-directed specialization The specialization of code is more effective if it is applied to hot blocks of code [2,3,6,7,19]. Therefore, the variables in these blocks are much better candidates than variables in other parts of code. Despite the variant impact of the specialization on the compiler optimizations, these variables produce better results through iterative specialization. However, finding these variables requires the feedback from profiling the code. Within a function, the execution frequency of the basic blocks of the code can be found through vertex (block) profiling [20]. The usage count of the variables in the code together with the execution frequency of the block actually highlight good candidates for specialization. For the code in Fig. 1, the block profiling makes the variables and the loop trip counts (e.g. numf1s) of the innermost loops good candidates for specialization. The variables which are not modified, and have a higher usage count (including their dependents) may be specialized and are more effective than other variables. This is useful for eliminating the need for exhaustive search of candidate variables to be used for specialization. The feedback-directed specialization (shown in Fig. 3) approach is based on the above given intuition, and can be used to improve the performance of the application. It works by specializing code at both static and dynamic compile times. At static compile time, data-flow analyses coupled with vertex profiling select a subset of candidate variables for specialization. This is followed by generation of templates through the specialization of code for a variable. An optimal cache of template clones is then generated which reduces the overhead of runtime activities. A template clone generated at static compile time can be used for other versions through very efficient runtime specialization, also termed as instantiation of the template. The initial instantiation of template clones is performed at static compile time for the initial profiled values. The runtime instantiation is performed by the clone specializer also generated at static compile time. The runtime specialization overhead is small since no analyses/optimizations need to be performed during execution. Moreover, a template clone requires a very small number of binary instructions to be modified during execution. This makes the performance equivalent to statically specialized code with small code size. The process of specialization is iterated for the selected subset of variables as found after data-flow analysis and heuristic using vertex profile.

2 In the 8192 compilations performed, there were only 247 versions which differed in optimizations. i.e. only 3% of code versions can be used for all possible specialized versions. This ratio further reduces for larger values.

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

5

Candidate Code Static Analysis and Block Profiling Candidate variables Template Generation Generation of Optimal Cache of Template Clones

Template Clones Specializer Generation Initial Instantiations of the Clones Instantiated Clones

Dynamic Specializer

Final Code (Template + Runtime Specializer)

Performance Evaluation Fig. 3. Feedback-directed specialization of code.

4. Applying feedback-directed specialization of code A major obstacle to applying specialization is to search for candidate parameters that may be best useful for specialization [6,21]. Many aggressive optimizations invoked by the compilers are related to integer variables. Even if specialization could work for other types such as floating data (fixed values for partial evaluation), or structured types (for alignment, etc.), this article takes into account only integral parameters. This section describes the main steps that are performed for the feedback-directed specialization approach. 4.1. Specialization effectiveness The candidate variables are ordered with respect to their appropriateness for specialization, also termed as specialization effectiveness (SE). This ordering helps our approach to iterate the template generation within a limited amount of time. The SE value is calculated by taking into account the vertex profile and instances of candidate variables together with their dependent variables. The candidates for specialization within a function code may comprise global variables and its input parameters that are not modified in any of its blocks. Assume that a function consists of blocks b1 , b2 , . . . , bn . We can find out the specialization candidates by analyzing the blocks. Let I represent the set of variables (global variables and input parameters) that represent information input to the function. Let K(bj ) represent the set of variables modified at some block bj . We can find the set of candidate variables, V, by using the following equation: ⎛ V =I−⎝

n 

⎞ K(bj )⎠ .

(1)

j=1

Subsequently, the computation of SE is performed through vertex profiling, for the variables in the set V, as found in Eq. (1). In order to perform vertex profiling, the code is instrumented at block level, and the profile provides the number of times the blocks in the control flow graph are executed. Let V = {V1 , V2 , . . . , Vm } be the set of m variables (found after data-flow analysis in Eq. (1)), and R = R1 , R2 , . . . , Rk be the set of k variables referenced in the blocks B = b1 , b2 , . . . , bn . We can now compute the SE through the algorithm given below.

6

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

Algorithm 1. Algorithm for calculating SE Require: Set of variables V Ensure: SE for the variables in the set V for each candidate variable Vi ∈ V do Let SEi = 0 for For each block bj ∈ B do Let Freqj = Execution frequency of blockbj for each variable Rk ∈ R in block bj do if the statement containing Rk is successor of the statement containing Vi in DDG then Let RC k = Reference Count of Rk in bj SEi + =Freqj ∗ RC k end if end for end for end for The algorithm is based on the intuition that a variable which has large number of dependents being used in frequently executed parts of code, is a better candidate of specialization than other variables. An ordering of variables in terms of the SE metric may then be defined. This heuristic actually eliminates the need for an exhaustive search of the parameter to be used for specialization, and facilitates to iterate specialization over a limited set of variables. 4.2. Template generation and cloning with optimal cache The templates are generated by analyzing the object code generated by the compiler. The versions compiled for the set of profiled values are compared, and those versions are placed in a class which have similar register allocation, scheduling and optimizations, and differ only by some immediate constants at corresponding locations. Any of the versions in a class can therefore be used as the template. Let T[X1v , X2v , . . . , Xnv ] be a template obtained after specialization of the parameter with value v, where X1v , X2v , . . . , Xnv are the immediate constants. These constants of the template version differ from other versions at n locations, whose values are based on n arbitrary functions, so that, Xiv = fi (v), ∀i = 1, . . . , n. Since the versions of a class differ only by immediate constants, the functions fi (v), ∀i = 1, . . . , n, are similar in all these versions. This ensures that only data at specified instructions of the template needs to be modified to obtain another version. Moreover, the functions fi do not need to be computed at runtime, since the required runtime data (to be inserted at runtime) can be obtained at static compile time for the profiled values of the specialized parameter. We define instantiation I of the template with a value vnew to be a transformation which adapts the template to another version. We have I(T[X1v , X2v , . . . , Xnv ], vnew ) = T[X1vnew , X2vnew , . . . , Xnvnew ]

where Xivnew = fi (vnew ) and Xiv = fi (v),

∀i = 1, . . . , n.

(2)

If the runtime values change with large frequency, the instantiations become costly, mainly due to the modification of data at specific locations of the template. This drawback actually reduces the impact of optimizations obtained after static specialization of code. We address this issue through the software cache of template clones. Given a set of p profiled values, v1 , v2 , . . . , vp , we generate the clones of the template, T i [X1v , X2v , . . . , Xnv ], ∀i = 1, . . . , Copt , where Copt is the optimal size of cache of template clones that can be obtained for the profiled values. Intuitively, the optimal cache size for the templates represents the threshold value (corresponding to the number of templates) at which the increase in cache size does not improve the cache hits to a large extent, for the given profiled values. The computation of optimal cache size actually reduces the possible penalties of instruction cache/TLB misses that could otherwise occur due to large number of template clones in the cache. These penalties might reduce the performance gain, or even result in slowdown of the application. Given a set of n cache sizes, the optimal cache size Copt is found by taking into account the cache hits found using least frequently used (LFU) replacement policy. Let CHi represent the number of cache hits with cache of size i, and number of profiled values p, the optimal cache size is computed as Copt = min(i)

where

(CHi+1 − CHi ) ∗ 100 < 1, p

∀1  i  n.

(3)

The Copt number of clones are generated by specializing the code with similar values (for every clone) to represent different versions. The object code of these template clones is then analyzed to generate a specializer that would instantiate the clones during execution of the application.

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

7

Fig. 4. Runtime specializer and template code snippets.

For the initial profiled values, the runtime instantiations can be avoided by instantiating the template clones at static compile time through the transformation, v

v

v

I(T[X1v , X2v , . . . , Xnv ], vi ) = T i [X1 i , X2 i , . . . , Xn i ],

∀i = 1, . . . , Copt

where 1  Copt  p.

(4)

The final code therefore comprises the (statically instantiated) template clones, a specializer, and the standard code as the fallback. The specialization and template generation are iterated for the topmost variables (having larger SE value than others) to find the code with best performance. 5. Implementing feedback-directed specialization for efficient instantiations of template clones The value profiling [7] needs to be performed for the selected variables3 to specialize the code with the profiled values. The code is instrumented to capture the values of the selected variables at routine level. The feedback-directed specialization approach then proceeds as follows. 5.1. Generation of templates The code is specialized with the profiled values and the object code analysis is performed to generate a template that can work for a large set of values. During analysis, the object code of the specialized versions is compared to find equivalent versions which meet the coherency conditions. The equivalent specialized code versions which differ (at corresponding locations) in immediate constants are placed in a class, as described in Section 4.2. First version of the class is then selected as the template. The instructions of the template which differ in other versions of the class are termed as candidate instructions. These object code instructions (in the template clones) will be modified at runtime by an efficient runtime specializer. The runtime instantiation would therefore adapt a single template to different values of the specialized variable. 5.2. Generating optimal cache of template clones The optimal cache is found through the cache simulator which incorporates the LFU replacement strategy. The specialized template clones of optimal software cache size are then generated which can be instantiated with a dynamic specializer. The code is also instrumented with a call to small wrapper (shown in Fig. 5) that will be used for redirecting execution control to cache manager (containing specializer invocation) and proper specialized clone or it may result in fallback to standard code. 5.3. Generation of specializers and data for instantiations The overhead of runtime instantiation is reduced through the use of statically specialized data (i.e. already computed at static compile time). For each candidate instruction (found in previous step), this specialized data can be obtained from the specialized object code versions by comparing versions and extracting immediate constants, as described in Section 4.2. The specialized data approach eliminates the need to perform any computations at runtime. The runtime specializer contains the self-modifying code that will modify instructions of the template clones during execution. It is provided with the information regarding locations of binary instructions. This information can be easily gathered for each candidate instruction in the specialized object code versions. For the example considered (Figs. 1 and 2), the template clone containing the slots X0 , X1 , and X2 and the runtime specializer code are shown in Fig. 4. Each instruction is specialized at runtime by inserting elements (Data[0], Data[1], and Data[2]) of static data, at the corresponding slots. Since these values are extracted from object code versions at static compile time (as described earlier in this section), the overhead incurred at runtime is reduced to storing values in the processor cache.

3

Limited set of topmost variables having large SE.

8

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

Fig. 5. Wrapper code and cache management for template clones.

Table 1 Configuration of the architectures used. Processor

Speed (GHz)

Compilers

Intel IA-64 (Itanium-II) Intel Pentium-4 (R)

1.5 3.20

gcc v 4.3, icc v 9.0 with -O3 gcc v 4.3, icc v 8.1 with -O3

Table 2 Summary of overhead reduction through optimal cache. Benchmark

Function

Max. cache

Optimal cache

ORP

164.gzip 175.vpr 181.mcf 197.parser 254.gap 256.bzip2 300.twolf 177.mesa 179.art 183.equake

send_bits expand_neighbours primal_bea_mpp hash NewBag hbAssignCodes term_newpos_a sample_linear_1d train_match smvp

> 20 > 20 > 20 > 20 > 20

13 1 1 6 12 5 2 6 1 1

99.98 0 0 84.09 88.27 100 100 99.15 0 0

5 2 10 2 1

5.4. Generating cache management code for the specializer The code to manage cache (Fig. 5 on the left) of template clones at runtime is generated. It makes use of LFU replacement policy to select cache slot for which the instantiation would be performed if the new value is not found in template cache. The specializer for the selected slot is then invoked to update the instructions of the corresponding template clone to adapt it to new runtime value. In addition, a wrapper (as shown in Fig. 5 on the right) is generated which contains the code to redirect control to cache manager, specialized code and the original code as fallback. 5.5. Instantiation of template clones Since the data required for the instantiations are already computed, the instantiations of the template clones are very efficient. At static compile time, the initial instantiations can be performed to reduce the number of runtime instantiations. The static instantiations of the template clones are performed for the initial set of profiled values. The data array values are inserted at the specific slots of the template clones. This instantiation makes the performance of the code equivalent to statically specialized code. The runtime instantiations are required when the new values arrive for which the instantiation was not already performed. This runtime activity is limited to storing pre-computed immediate constants at the specific slots of the template clones.

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

9

Fig. 6. Cache behavior of SPECINT benchmarks with cache size on x-axis and the percentage of cache hits (in gray) and misses (in black) on y-axis. (a) 164.gzip benchmark. (b) 175.vpr benchmark. (c) 197.parser benchmark. (d) 256.bzip2 benchmark. (For interpretation of the reference to colour in this figure legend, the reader is referred to the web version of this article.)

Through this approach, instrumentation and specialization, both are entirely performed at static compile time, and therefore, the invocation of the instantiated clones incurs only the overhead of cache management. It means that, with the minimum possible code size increase and minimum overhead, we are able to obtain a large number of optimizations due to the specialization performed at static compile time.

10

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

Fig. 7. Performance results of SPEC CPU2000 benchmarks. The speedups vary depending upon the compiler optimizations. (a) Speedup percentage with icc compiler. (b) Speedup percentage with gcc compiler.

6. Experimental results The HySpec framework [8,19] is modified to implement the steps that are necessary to perform feedback-directed specialization. It acquires a flexible automated4 approach for specializing integral variables. This section presents the results of feedback-directed specialization after being applied to SPEC CPU2000 and FFTW3 benchmarks. The experiments have been performed for top five variables w.r.t. their SE using the platforms whose configuration is given in Table 1.

6.1. Cache behavior We have used 20 as the threshold value for the maximum cache size of the template clones. Table 2 describes the optimal cache and overhead reduction percentage (ORP), computed as 100 ∗ (1 − Number of instantiations with feedback-directed cache/ Number of instantiations without cache), together with the maximum cache (the cache size required for reducing the cache misses to zero) for some hot functions in SPEC benchmarks using icc compiler on IA-64 architecture. For 164.gzip, 197.parser, 254.gap, 256.bzip2, 300.twolf, and 177.mesa benchmarks, the ORP is large, however, for the remaining benchmarks it is very small since the cache hits/misses at optimal cache size either become constant or do not vary largely. Fig. 6 depicts the behavior of the cache of template clones for some functions of SPEC benchmarks given in Table 2. The benchmarks 256.bzip2, 300.twolf, 179.art, and 183.equake all require the cache which produces 100% hits with a small number of cache size. On the contrary, the benchmarks 164.gzip, 175.vpr, 197.parser, 254.gap, and 177.mesa, all continue to produce cache misses even with the threshold cache size 20.

4

The functions and compilation parameters are specified in a configuration file.

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

11

Fig. 8. Speedup to size increase ratio (SSIR) for SPEC benchmarks. SSIR metric shows that with small code size increase we can obtain good performance for many benchmarks. (a) icc compiler. (b) gcc compiler.

6.2. Performance results of SPEC benchmarks The speedup percentage obtained for different SPEC benchmarks5 with reference inputs is shown in Fig. 7. The performance of the code is mostly dependent on the optimizations invoked by the compilers. In the 179.art benchmark, the code benefits from loop optimizations including unrolling and data prefetching which resulted in good performance on IA-64 architecture. Similarly, for Pentium-IV, the loop optimizations resulted in better performance of the application as compared to un-specialized (original) code. The 177.mesa and 183.equake benchmarks benefit mainly from partial evaluation, however, the compilers could not perform aggressive optimizations for a large part of the code. The hot regions of code in the 188.ammp benchmark contain variables which do not fulfill the required criteria of specialization, and for the remaining part of the code, there is a very small speedup. Loop optimizations and partial evaluation for the 164.gzip benchmark are performed on Pentium-IV by the compilers, however, on Itanium-II, the compilers generate code with similar unroll factor for both the un-specialized and the specialized versions. In the 175.vpr, 181.mcf, 197.parser, and 254.gap benchmarks, the lack of integer variables in the hot functions results in a very small impact on the optimizations performed after specialization. Similarly, in the 300.twolf benchmark, the specialized code is optimized with reduced number of loads and cache prefetches on IA-64, whereas, on Pentium-IV, the optimizations for the specialized code are very similar to those for the un-specialized code. The specialization of code in the 256.bzip2 benchmark improves performance on IA-64 and Pentium-IV using the gcc compiler. The loop optimizations and partial evaluation of code are performed to obtain significant improvement. Using icc compiler, the loop-based optimizations were similar to those for un-specialized code, however, the partial evaluation produced a small improvement in the performance. Speedup ∗ Size of standard unspecialized code (SSIR) for SPEC benchmarks has been The speedup to size increase ratio calculated as Size of code after feedback-directed specialization given in Fig. 8. The SSIR is large for the 179.art benchmark. However, for other benchmarks, even with good speedup, the SSIR value is not very large. In most of the cases producing speedup, the size of the standard code is small, and addition of the specialized clones and the runtime specializer code thereby reduces the SSIR factor.

5

HySpec parser supports only the C language.

12

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

Fig. 9. Performance results of FFTW. (a) Speedup percentage with icc compiler. (b) Speedup percentage with gcc compiler.

Fig. 10. Speedup to size increase ratio (SSIR) for DFT computation. (a) icc compiler. (b) gcc compiler.

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

13

Fig. 11. Reduction factor (RF), obtained for different codelets. RF metric shows the efficacy of our approach over manual exhaustive specialization. RF = 7 means that manually specialized codelet size was 7 times larger than our specialized code. (a) icc compiler. (b) gcc compiler.

6.3. Performance results of FFTW benchmark FFTW library [22,23] contains C routines called codelets to compute discrete Fourier transform (DFT) of real and complex data. It uses the best combination of codelets (called wisdom) to compute DFT of arbitrary input size in O(n log n). Figs. 9 and 10 shows respectively, the speedup and the SSIR values, obtained for calculating DFT of powers of 2. The calculation of single DFT may comprise repeated invocations of multiple codelets. The speedup varies depending upon the compiler optimizations for the specific processor. For smaller values and smaller size of code, the compilers tend to produce better code with more optimizations. Since large codelets are invoked for computing DFT of large sizes, the compiler generated code is not too much optimized. The FFTW library also contains many versions of already specialized codelets. These versions are specialized with stride values to improve the performance. Fig. 11 shows the reduction factor (RF) which is calculated as Size of already specialized codelets/ Size after feedback − directed specialization. This metric actually provides a comparison between our approach and the static specialization. The RF obtained is large for almost all the codelets, and it increases for codelets with large radix since they contain large code size after static specialization. 6.4. Runtime overhead A summarized view of the overhead with respect to application execution time is presented in Table 3. The runtime overhead is small since a static data array is incorporated. With this approach, the values are calculated at static compile time (for the profiled values of the specialized variable), and therefore, no runtime computation needs to be performed for immediate values. On average, the overhead of modifying data at precise locations on Itanium-II and Pentium-IV is eight cycles6 and two cycles, respectively.

6

On Itanium-II, different binary instruction formats require extraction of different bit-sets.

14

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

Table 3 Avg. specialization overhead w.r.t. execution time. Benchmark

SPEC FFTW

IA-64

Pentium-IV

icc (%)

gcc (%)

icc (%)

gcc (%)

2.04 3.97

2.11 3.31

1.98 3.91

2.26 3.23

Table 4 Performance comparison of feedback-directed specialization approach with template-based generalized specialization. Approach

IA-64

Pentium-IV

icc

Feedback-directed specialization Generalized specialization

gcc

icc

gcc

Max.

Avg.

Max.

Avg.

Max.

Avg.

Max.

Avg.

1.25 1.12

1.04 1.02

1.58 1.27

1.09 1.05

1.12 1.08

1.03 1.02

1.19 1.14

1.04 1.03

7. Related work Our previous work on template-based generalized specialization approach [8] provides an efficient way to specialize the code, while keeping a small size of code. A comparison between feedback-directed approach and generalized approach for the performance of SPEC benchmarks is provided in Table 4. In generalized specialization approach, the lack of analysis at static compile time and the runtime overhead limit the gain after specialization. For SPEC benchmarks, it improves the performance up to 27% over un-specialized code. In addition, the runtime overhead incurred during the execution reduces the number of candidate functions that might benefit through specialization. The feedback-directed specialization, in contrast, improves the performance gain by using analyses at static compile time. It minimizes the runtime overhead through a cache of template clones which can be instantiated at static compile time. For SPEC benchmarks, it results in a much better performance, attaining up to 58% improvement over un-specialized code. However, for both the approaches, the performance depends upon the optimizations performed by the compiler for the specific architecture. The Dynamo [24] system developed at HP Labs. takes as input the native instruction stream and optimizes code during execution without requiring programmer intervention. It identifies hot traces of code, generates optimized equivalent for hot trace and stores it in a fragment cache. However, the size of software cache can not be fixed and addition/removal of code from the cache incurs overhead. The feedback-directed specialization works more efficiently where the runtime overhead is reduced to storing limited data into cache. The ADORE [2] runtime optimization system makes effective use of performance monitoring hardware to perform data cache prefetching at runtime. A parallel thread performs the phase detection and selects the hot traces to optimize the code. The traces are selected by sampling hardware counters followed by detection of data reference patterns and locality. The prefetches are then inserted into the code. In contrast, our specialization approach makes the code candidate for a large number of optimizations at static compile time. The compilers are then able to insert more accurate prefetches if the related stride values become available. Similarly, the Trident framework [1] performs basic compiler optimizations together with value specialization. The frequently executed branches are profiled, and subsequently, events are generated to perform optimizations of hot traces of code. The value specialization is then performed by copy propagating the invariants. The large analysis and the runtime code generation activities make it suitable for the code having large number of invariants. In work related to code specialization, C-Mix [25] and Tempo [17,26] specializers perform partial evaluation of code. The C-Mix partial evaluator analyzes the code and makes use of specialized constructs to perform partial evaluation. Similarly, the Tempo specializer performs partial evaluation of code at static and dynamic compile times. The partial evaluation at static compile time is useful only for the constants in the code. This is not useful for cases where the input data set is obtained during execution. The runtime specialization by Tempo is done by invoking the gcc compiler. The large overhead incurred due to analysis and code generation makes it suitable for situations where multiple invocations of code show invariant behavior. In contrast, the feedback-directed specialization approach obtains optimizations at static compile time and works for code where the values are known at runtime. The template used in our approach requires modification of a limited set of binary instructions. A code/data patching approach described in [27] is used to reduce the interpreter dispatch overhead by patching constants and jump target addresses. The fragments of code extracted from an interpreter can then be used in a native compiler and therefore the portability is also achieved. It is similar to our framework of feedback-directed specialization which provides a platform-independent approach to generate the runtime specializers and the templates. However, the overhead incurred through our runtime instantiation is only related to storing data at cache locations avoiding the need for copying any code fragments. Many other specialization systems [8,10,19,28–30] perform runtime optimizations with computations and/or time-consuming code generation. Although some of them acquire declarative approach, there is always need for analysis or computations to be performed during execution. These systems are different from feedback-directed specialization where analyses and computations are entirely offline. Most of the other dynamic code generation and/or optimization systems discussed above and others suggested in [11–13,31,32] do not make use of generic templates to reduce the code size increase after specialization.

M.A. Khan / Computer Languages, Systems & Structures 36 (2010) 2 -- 15

15

8. Conclusion and future work The feedback-directed specialization approach focuses on improving the overall application performance, achieving static specialization with a small code size increase through a lightweight dynamic specialization mechanism. It makes use of the static analysis to select the candidate variables for specialization of code. Subsequently, a heuristic based on vertex (block) profiling is used to estimate the gain after specialization. The best candidates are then iteratively specialized to generate versions. An analysis of object code of these versions is used to generate the templates. These templates are highly optimized due to specialization performed at static compile time, and can be used for a large range of the specialized parameter values. The runtime instantiation of the templates is very efficient and requires modification of a small number of binary instructions during execution. This approach improves the performance of the applications with limited increase in code size. For the SPEC and FFTW benchmarks, the feedback-directed specialization is able to provide significant improvement in the performance. The feedback-directed specialization is a step towards a fully automated specialization compiler that would take as input the directives from the specialization language together with transformations that can be applied to the loops in specialized blocks of the code. References [1] Zhang W, Calder B, Tullsen D. An event-driven multithreaded dynamic optimization framework. In: PACT-05; 2005. [2] Lu J, Chen H, Yew P-C, Hsu W-C. Design and implementation of a lightweight dynamic optimization system, Journal of Instruction-Level Parallelism 2004;6:1–24. [3] Arnold M. Online profiling and feedback-directed optimization of java; 2002. URL citeseer.ist.psu.edu/article/arnold02online.html. [4] Barreteau M, Bodin F, Brinkhaus P, et al. Oceans: optimizing compilers for embedded applications. In: Euro-Par98; 1998. [5] Pettis K, Hansen RC. Profile guided code positioning. In: PLDI '90: proceedings of the ACM SIGPLAN 1990 conference on programming language design and implementation, ACM Press, New York, NY, USA, 1990. [6] Muth R, Watterson SA, Debray SK. Code specialization based on value profiles. In: Static analysis symposium; 2000. URL citeseer.ist.psu.edu/muth00 code.html. [7] Calder B, Feller P, Eustace A. Value profiling. In: MICRO-30; 1997. URL citeseer.ist.psu.edu/calder97value.html. [8] Khan MA, Charles H-P, Barthou D. An effective automated approach to specialization of code. In: LCPC-07, Urbana, Illinois; October 11–13, 2007. [9] Zhou H, Conte TM. Code size efficiency in global scheduling for ILP processors. In: Interact; 2002. p. 79. [10] Poletto M, Hsieh WC, Engler DR, Kaashoek FM. 'c and tcc: a language and compiler for dynamic code generation. ACM TOPLAS-99 1999;21:324–69. [11] Grant B, Mock M, Philipose M, Chambers C, Eggers SJ. Annotation-directed run-time specialization in C. In: PEPM'97, ACM Press, USA, 1997. URL citeseer.ist.psu.edu/63006.html. [12] Piumarta I. Ccg: dynamic code generation for c and c + +. Technical Report 25, INRIA Rocquencourt; October 2003. [13] Grant B, Mock M, Philipose M, Chambers C, Eggers SJ. Dyc: an expressive annotation-directed dynamic compiler for c. Technical Report, Department of Computer Science and Engineering, University of Washington; 1999. [14] Leone M, Lee P. Optimizing ml with run-time code generation. Technical Report, School of Computer Science, Carnegie Mellon University; 1995. [15] Meloan S. The java hotspot performance engine: an in-depth look. Technical Report, Sun Microsystems; 1999. [16] Consel C, Danvy O. Tutorial notes on partial evaluation. In: ACM symposium on principles of programming languages; 1993. p. 493–501. [17] Consel C, Hornof L, Marlet R, et al. Tempo: specializing systems applications and beyond. ACM Computing Surveys 1998;30:19–24. URL citeseer.ist.psu. edu/consel98tempo.html. [18] Khan MA, Charles H-P, Barthou D. Improving performance of optimized kernels through fast instantiations of templates. Revised version of cpc-2007 article titled “Improving performance of optimized kernels through fast template-based specialization”, Concurrency and Computation: Practice and Experience 2008;21:59–70. [19] Khan MA, Charles H-P, Barthou D. Reducing code size explosion through low-overhead specialization. In: INTERACT-2007, Phoenix; 2007. [20] Knuth D, Stevenson F. Optimal measurement points for program frequency counts. BIT 1973;13:313–22. [21] Suganuma T, Yasue T, Debray S, et al. A dynamic optimization framework for a java just-in-time compiler. In: OOPSLA'01, ACM Press, USA, 2001. [22] Frigo M, Johnson SG. The design and implementation of fftw3. Proceedings of the IEEE 2005;93(2). [23] Frigo M, Johnson SG. FFTW: an adaptive software architecture for the FFT. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol. 3, Seattle, WA, 1998. URL citeseer.ist.psu.edu/frigo98fftw.html. [24] Bala V, Duesterwald E, Banerjia S. Dynamo: a transparent dynamic optimization system. ACM SIG-PLAN Notices 2000;35(5):1–12 URL citeseer.ist. psu.edu/bala00dynamo.html. [25] Makholm H. Specializing c—an introduction to the principles behind c-mix. Technical Report, Computer Science Department, University of Copenhagen; June 1999. [26] Nol F, Hornof v, Consel C, Lawall JL. Automatic, template-based run-time specialization: implementation and experimental study. In: ICCL'98; 1998. [27] Ertl MA, Gregg D. Retargeting jit compilers by using c-compiler generated executable code. In: PACT'04. Washington, DC, USA: IEEE CS; 2004. [28] Leone M, Lee P. Dynamic specialization in the fabius system. ACM Computing Surveys 1998;30:133–38. [29] Leone M, Dybvig RK. Dynamo: a staged compiler architecture for dynamic program optimization. Technical Report, Indiana University; 1997. [30] Leone M, Lee P. A declarative approach to run-time code generation, In: Workshop on compiler support for system software (WCSSS); 1996. URL: citeseer.ist.psu.edu/leone96declarative.html. [31] Engler DR, Proebsting TA. DCG: an efficient, retargetable dynamic code generation system. In: ASPLOS-94, San Jose, California; 1994. URL citeseer.ist.psu.edu/engler94dcg.html. [32] Consel C, Hornof L, No¨el F, Noyé J, Volanschi N. A uniform approach for compile-time and run-time specialization. In: International seminar on partial evaluation, Springer, Berlin, Germany, Dagstuhl Castle, Germany, 1996. URL citeseer.ist.psu.edu/consel96uniform.html.