Source level merging of independent programs

Source level merging of independent programs

J. Parallel Distrib. Comput. 69 (2009) 521–531 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier...

4MB Sizes 0 Downloads 21 Views

J. Parallel Distrib. Comput. 69 (2009) 521–531

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

Source level merging of independent programs Yosi Ben Asher ∗ , Moshe Yuda CS Department, Haifa University, Israel

article

info

Article history: Received 31 July 2007 Received in revised form 10 December 2008 Accepted 9 February 2009 Available online 4 March 2009 Keywords: Compilers Source-level Merging

a b s t r a c t In here we describe a technique to merge at source level two (and hence more) independent C programs. Due to the independence of the programs, the merged program has more parallelism that can be extracted by the underlying compiler and CPU. Thus it is expected that the execution time of the merged program will be better than the time obtained by executing the two programs separately. The usefulness of such merging for embedded systems has been studied and demonstrated by the works of Dean and others with the Thrint compiler for merging threads at Assembly level. The main contribution of this work is an efficient algorithm for matching sub-components considering the inside structure of the sub-components and not only their execution frequency. Two novel techniques for balancing the merge of sub-components are presented:

• Residual loop merging (RLM) as a way to merge loops with different nesting and execution frequency levels.

• Using the remaining iterations formed after merging two non-equal loops (loops with different number of iterations) in future mergings of other loops. These two abilities allow the proposed algorithm to simplify the matching process and overcome merging problems related to deep nested structure. We also consider the problem of merging function calls and make extensive use of cloning (and not only inlining as is the case with previous works). The final tool is the first complete system for merging C-programs at source level supporting profile and structure based matching. The main use of merging is to speed up embedded systems that usually execute independent threads or processes that can potentially be merged. Our experimental results suggest that the proposed merging technique can speedup the execution of two independent programs by 10%–20% for about half of mergings that have been tested. © 2009 Elsevier Inc. All rights reserved.

1. Introduction In this paper we study how to merge at source level two concurrent programs executed by one CPU in order to form a single program that is executed by the same CPU. The goal is to increase the Instruction Level parallelism of the merged program compared to a separate execution of the two programs. The term ‘‘merge’’ indicates that structures from both programs are actually mixed, as opposed to just executing them one after the other or concurrently. We assume that the two programs (denoted Pa and Pb) are independent processes that can be safely merged to a single program (denoted by Pab). The merging itself is done at source level, attempting to fuse loops and other statements from Pa and Pb. Unlike loop transformations and in particular Loop Fusion [1],



Corresponding author. E-mail addresses: [email protected] (Y. Ben Asher), [email protected] (M. Yuda). 0743-7315/$ – see front matter © 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2009.02.001

merging can be done in any order as Pa and Pb are independent. This type of merging can be useful for superscalar machines and in particular for VLIW and DSP cores that are used in embedded systems. In addition to improving the ILP and other parameters, merging at source level allows the programmer to continue editing and debugging the merged program. The improvements in the scheduling of Pab may include: eliminating pipeline stalls, hiding memory latencies, filling branch delay slots and generation of larger VLIW instructions. Apart from scheduling, merging can be used to reduce the number of branch instructions (using one loop instead of two). There is also an obvious benefit of eliminating context switches involved with the concurrent execution of Pa and Pb. Merging (denoted by M (Pa, Pb)) is not guaranteed to be always successful. Some of the known limiting factors that have been indicated in previous works are code expansion and register pressure. However there are additional factors that are even more significant:

• The underlying compiler can fail to detect the parallelism exposed by the merging. For example, assume that we are

522

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

Fig. 1. An example demonstrating the complexity of source level merging.

merging two loops for (; p− > next ! = NULL; p = p− > next ){. . . p . . .} with for (; q− > next ! = NULL; q = q− > next ){. . . q . . .}, then the underlying compiler may fail to detect that the references through p and q are disjoint (non aliasing). It is thus important to pass mutual-independent information to the underlying compiler. Unfortunately, current compilers hardly support this, only absolute independence through the ‘‘restrict’’ key word is supported, and not mutual-independent as proposed in [11].1 • Pa and/or Pb can already have relatively large degree of parallelism in their loops, filling most of the VLIW slots of the CPU. In this case, merging such loops will not be as beneficial and the ratio time(Pa + Pb)/time(Pab) will be close to one. • The merged loops can produce more cache misses than the sequential or the concurrent execution of Pa and Pb. Assume that we merge two loops, one accessing an array A[i] and the other accessing B[i]. In addition we assume that for each i, A[i] and B[i] are mapped to the same cache line. Consequently, each access to A[i] in the merged program may evict B[]’s elements from the current cache line and vice versa an access to B[i]. Thus, every access to an array in Pab results in a cache miss (assuming a non associative cache). One option is to change the start address of A[] by inserting a random number of dummy variables before its declaration, so that the probability that A[i] and B[i] fall into the same cache line is small. A successful merging of Pa and Pb depends on the ability to match and merge suitable sub-components (inner-loops, thenpart, else-part and function-calls) of Pa and Pb overcoming different nesting levels, conditional statements, function calls and different execution frequencies.2 Finding a good matching of subcomponents in Pa and Pb is the main difficulty addressed by the proposed scheme. Fig. 1 contains an example of merging for which an efficient merging requires non-trivial considerations. Here a good merging must overcome the following problems:

• Ability to split the 300 iterations of Sa between Sb’s for-loops of the if-statement and the last for (i = 0; i < 100; i + +)Lb5 for

Fig. 2. Using dynamic remainders for the merging of Fig. 1.

• Specific merging scheme for merging different types of statements (e.g., merging if (E )S1 else S2 with for (E1; E2; E3) if (E4) S4). The proposed scheme covers all basic types of statements (loops, if-statements and function calls) but also considers special combinations for which optimized mergings exist (e.g., merging of two function calls). • Good heuristics and algorithmic techniques for optimized matching sub components. Special measures combining nesting levels, if-then-else structure and execution frequencies have been developed. • A way to balance the expected size and number of iterations of loops that are being merged (usually handled by loop unrolling). This includes loop unrolling and loop splitting. • Ability to use ‘‘remaining iterations’’ of merged loops for future mergings of other sub components. Using dynamic remainders in different forms is the main focus of the proposed system. In particular, we use dynamic remainders as a way to merge nested loops with different nesting levels and execution frequency levels. We remark that merging can be done in assembly/RTL level ‘‘shuffling’’ instructions from two programs. This approach was pursued by the Thrint compiler [5,7]. Though shuffling instructions by the compiler’s scheduler is potentially simpler than source level merging, it does not seem to handle the structural matching considerations as can be done in source level. In addition loop indexes, array references and type information are usually not available in assembly level. Thus we have selected source level merging as the main technique for the proposed tool. The Thrint compiler did not perform full scheduling merging and mainly filled empty gaps caused by context switches and synchronization.

loop.

• It is not known in advance which of the two for-loops of Sb’s if-statement will be executed. Thus the remainder of Sa’s iterations that are merged with the last for-loop (for (i = 0; i < 100; i + +)Lb5) must be dynamic. Let M (S1, S2) denote the merging operation between two components S1 and S2. The solution (Fig. 2) is to use dynamic remainders (R) allowing the merging (M (Sa, Sb)) of Sa’s remaining iterations with the last loop of Sb as follows: In this work we explore several ways of ‘‘forwarding remaining iterations’’ as a way to balance the iterations in merging sub-components of two different programs. In conclusion, a complete source level merging scheme should address the following components:

1 There are other flags that can be used to mark independence such as – fargument – noalias in GCC indicating that function arguments cannot alias, however currently it is not possible to pass non-alias information of the kind produced in merging to the compiler’s scheduler. 2 The tool includes a profile mechanism that counts how many times each basic block and each loop are executed.

2. Background One of the first works on merging techniques is [2] showing how concurrent threads can be merged to form one sequential program. [2] used merging to obtain a correct sequential version of a parallel program so that it can be debugged sequentially. The main result of that work is a recursive merging technique for mutually dependent while-loops (through shared variables). The merging method of [2] was also based on a recursive applications of a loop merging technique. In a sequence of works Dean et al. developed a thread merging technique called STI (Software Thread Integration) for interleaving (at assembly level) multiple threads [5,7]. The resulting compiler (Thrint) recursively applies the simple loop merging technique while preserving real-time constrains of events in Pa and Pb. A major goal of STI in Thrint is to fill ‘‘dead’’ waiting times of one thread by useful code of another thread. The Thrint compiler is also based on a recursive application of the Basic Loop Merging (BLM) technique adjusted to handle time constraints. Recently, Dean et al. studied the potential of source-level merging to increase ILP. The first work [14] shows that by merging two clone procedures [4]

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

523

Fig. 3. BLM merging with nested loops.

ILP level can be improved. The second work [15] presents a source level STI scheme that is based on loop transformations [1]. The goal is to improve the efficiency of the Modulo scheduling stage of the underlying compiler. This framework combines loop fusion, loop unrolling, loop splitting and loop peeling [1] to merge loops so that their range of iterations is balanced. It seems that no complete automatic tool was implemented. An important contribution of Dean’s works is in evaluating STI for specific applications of embedded systems including: TI’s Image/Video libraries [15], JPEG compression [14] and refreshing LCD in software [7]. Dean also showed that STI can be used to replace hardware units of embedded systems by software threads [6]. The main contributions of the method proposed here are the following:

• Residual loop merging (RLM) as a way to merge loops with





• • • •

different nesting and execution frequency levels. Thus we are able to activate the merge on much more complex (by structure) programs than the programs that were tested in previous works. The usefulness of the RLM has been verified experimentally yielding successful mergings that could not have been obtained otherwise (i.e., by the methods used in previous works). Forwarding remaining iterations of merged loops and function calls to be merged with other sub-components. In previous works the range of remainders was fixed and not dynamic as is the case with the proposed scheme. In addition in previous works remainders could be forwarded only to the ‘‘next’’ loop compare a larger degree of flexibility (up/down the AST and splitting) in the proposed scheme. An efficient algorithm for matching sub-components considering the inside structure of the sub-components and not only their execution frequency as is the case with previous works. A complete tool for source level merging which is fully automated. Unlike Thrint, loops whose iteration range is not bounded (such as while-loops) are merged. Using both inlining and cloning to merge function calls according to the expected benefit. Using filtering rules to avoid harmful mergings of sub components such as creating unbalanced if-then-else statements which cannot be predicated by the underlying compiler.

Finally we include a discussion comparing software merging with its hardware analogue SMT [12]. Simultaneous MultiThreading (SMT) is an emerging super scalar architecture allowing several independent processes to dynamically issue several instructions at the same cycle. If, in a given cycle, a certain process is not using a

resource, that resource can be used by the next instruction of another active process. A merging process can be viewed as a compilation technique analogue of SMT as by merging processes the underlying compiler and CPU trade explicit parallelism by implicit ILP. Note that software merging can use a larger scope of analysis than the hardware scheduler of an SMT CPU, due to the use of a small window of instructions for the merging decisions by the SMT CPU. In addition, software merging can be used to increase the size of processes and thus possibly improve the efficiency of SMT CPUs. Assume that we have six programs that can be merged, while our SMT CPU has only two hardware processes. We can tune the merging so that it will optimally generate two programs selecting the best partition of the six programs to two subsets of three programs each. 3. Residual loop merging (RLM) In here we describe residual loop merging (RLM), a technique for forwarding remaining iterations between consecutive applications of an inner loop when merging loops with different nesting levels and different execution-frequencies. We first describe the basic form of loop merging (BLM) extensively used in previous works. In this form nested loops are merged following the nesting levels, i.e., merging the two external loops and recursively their bodies. Note that in the example of Fig. 3, Pa and Pb have the same nested structure and if the execution frequencies match na1 = nb1, na2 = nb2 then BLM will increase ILP significantly (A[i + k] − −; and B[j + l]−−; can be executed in parallel). The BLM technique can fail to increase ILP if the condition of structure similarities and equal sub execution frequencies is not met. In the example of Fig. 3, if na1 = 10, nb1 = 100 and na2 = 10, nb2 = 100 then BLM will increase ILP only by 10% in spite of the fact that the total number of iterations is 1000. Another limiting factor is the case where the iterations range of an inner loop changes, e.g., if the inner loop of Pa in Fig. 3 is for (k = 0; k < na2 + i; k + +)A[i + k] − −;. In particular conditional execution of an inner loops will fail BLM based merging, e.g., if Pa’s inner loop will be conditioned by the check if (A[i] > 10). In general the BLM is effective when there is a good matching between the structure and frequencies of all nested loops but may fail otherwise. In this work we make extensive use of another form of basic merging called ‘‘Residual Loop Merging’’ (RLM). The basic case of RLM is when Pa is a nested loop and Pb contains a single loop. In this case applying BLM implies that Pb is merged with the external loop of Pa compared to RLM where Pb can be merged with one of the inner loops of Pa as depicted in Fig. 4. Note that RLM allows

524

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

Fig. 6. Remainders generation in the BLS.

Fig. 4. Residual loop merging (RLM).

Fig. 7. Remainders returned by the RLM.

Fig. 5. Using a common index when merging two loops.

greater flexibility in overlapping iterations as a direct BLM would have merged the loop of Pb with the if-statement of Pa. We remark that in some cases of BLM and RLM it might be possible to use one index in Pab as depicted in Fig. 5. (Note that variables were renamed.) 4. Creation and propagation of remainders An important aspect of the proposed recursive scheme is the way remainders are handled. A remainder (remaining iterations of a for-loop e.g., for (i = i; i < n; i + +)Si ) is formed by merging two loops either by BLM or RLM. Let M (Sa, Sb) denote an application of a recursive merge of two statements Sa and Sb. Remainders are useful to balance the merging of loops between two statement lists. For example in the merging M ({S1; S2; S3}, {S4, S5, S6}) let the expected execution frequencies be S1 = 100, S2 = 130, S3 = 220, S4 = 220, S5 = 70, S6 = 150. A direct recursive merge

that the prediction that the first loop has more iterations than the second one is not true. Remainders are also generated by RLM mergings. In the following example S2 is a simple loop and S1 is a nested loop such that most of its statements are executed by its inner loop. Note that remainders returned by RLM can be either from the nested loop or from the simple loop (see Fig. 7). Once a remainder has been created during the recursive merging it should be propagated up the recursion levels until it can be used. The propagation of remainders occurs mainly through the merging of if-then-else statements (as depicted in Fig. 8). This includes case-statements that are converted to ifstatements before actual merging begins. Merging of two if-thenelse statements cannot propagate remainders since the then-part and the else-part may return different loops as remainders (as depicted in Fig. 9). Note that dynamic remainders can be generated for any form of while-loops, however, special care must be given to cases where loops contain break-statements, continue-statements and returnstatements. In particular the remainder of a loop with a breakstatement must contain a check that the break-statement did not occur in the main part of the loop.

{ M (S1, S4); M (S2, S5); M (S3; S6) }

5. Merging function calls

will yield a very poor balance of the execution frequencies 100–220; 130–70; 70–150. One possible way to improve the matching is to try different partitions such as

An important advantage of source level merging is to continue the merge into function calls e.g., M (Sa, Sb = f (x, y − x)). There are two possibilities for such a merge:

{ M ({S1; S2}, {S4}); M {S3}, {S5, S6} } 230 − 220; 220 − 220. However, there is no guarantee that S4 can be fully merged with {S1; S2} and similarly S3 with {S5, S6}. In here we use a different solution that is based on forwarding remainders from one merging to the next. For example, we can forward a remainder of 120 iterations from the merge M (S1, S4) to the merge of M (S2, S5) yielding 100 − 220 −→ 120; 130 − (70 + 120) −→ 60; 70 − 150. Remainders are preferable to partitioning as it is a dynamic solution rather than a static one. Remainders are generated by BLM as follows. Consider the example in Fig. 6 wherein the execution frequency of the first loop is assumed to be larger than that of the second loop and the remaining iterations of the first loop are returned. Note that the remaining iterations of the second loop are executed in case

• Apply inlining [3] M (Sa, inlined_body_f (x, y − x)) or • Apply cloning [4] calling a ‘‘merged’’ function fab(Sa.v ars, Sb. v ars) instead of two original calls Sa and Sb. The body of fab() in case of cloning is the merge of the functions called in Sa and in Sb. Inlining eliminates parameter passing and for this reason it is usually preferable to cloning. Note that the code size expansion is approximately similar for both cloning and inlining. The current implementation uses cloning when merging two calls and inlining in all other cases. An additional condition for using cloning is that both functions f 1(), f 2() have the same ‘‘relative weight’’ (to be defined later on). Otherwise, one function is significantly ‘‘larger’’ than the other, yielding a large amount of ‘‘remaining iterations’’ that cannot be used after cloning. Hence, in such a case, inlining is preferable as it allows the merging process to forward the remaining iterations. The merging process

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

525

Fig. 8. Propagating a remainder.

• The sum of the minimum weights of each pair of leaves that have been mapped to each other is maximal.

Fig. 9. Remainders cannot be propagated with if-then-else mergings.

is stopped after a fixed ‘‘depth’’ (two levels) of function merging so that recursion unfolding and too deep inlining/cloning levels are avoided. Return statements in functions must be eliminated when merging function calls (both for inlining and for cloning). Otherwise a part of one of the merged functions may not be executed. This is done by passing the returned value through a parameter rather than by the return-statement. For example M (a = f 1(x + y); , b = f 2(u, v); ) can be replaced by a call to f 1_f 2(x + y, &a, u, v, &b) where the returned values are passed by &a and &b. When a return statement is eliminated, the execution should ‘‘jump’’ to the remaining part of the ‘‘other’’ function (the one whose return statement has not been reached yet). The ‘‘jumping’’ operation uses a goto-statement to jump out of loop nests as depicted in Fig. 10. There are more details and cases to merging function calls such as merging a call with a loop however most of them are at technical level and will not be described here. Break/continue statements are handled similarly. 6. The merging algorithm Merging is a recursive process applied on two abstract syntax trees of the programs being merged, using the following operations: 1. Traversing the two trees computing five measures to each node (loops, if-statements, . . .). 2. Based on these measures selecting which sub components of Sa and Sb will be merge next. 3. Applying specific merging-schemes for each possible types of statements, e.g., merging a loop L with an if-then-else is performed by merging L with both the then-part and the elsepart. 4. Forwarding remaining loop iterations to be used by merging of other sub-components. As explained earlier, a problematic aspect in obtaining optimized mergings is to consider the effect of the nesting structure of sub components. Fig. 11 illustrates this problem showing that (based on the structure of sub-trees) it is better to merge S3 with A2 than to merge it with A3 or A4 that have the same size but the ‘‘wrong’’ inside structure. Potentionally the formal matching problem between two trees (as depicted in Fig. 11) can be solved using integer linear programming (ILP). Formally this problem can be defined as finding an optimal mapping of the nodes of a weighted rooted tree L to the nodes of another weighted rooted tree R such that:

• Leaves are mapped to leaves and internal nodes to internal nodes.

• If a node u ∈ L has been mapped to v ∈ R then all their fathers should be mapped as well.

• The mapping preserves the tree-partial order defined by both L and R.

It seems that the set of ILP constraints that is needed to solve this problem would be at least quadratic in the number of nodes of the two trees. This follows from the fact that if a leaf u has been mapped to another leaf v then we should add constraints that all the ‘‘fathers’’of u and the fathers of v must be mapped as well. We remark that we do not know if the trees-mapping problem is NP-complete or not. In contrast the proposed technique (matching based on grading + propagating of remainders) is basically a linear top-down pass wherein mapping is chosen based on grades that capture the ‘‘structure’’ of each sub-tree. Moreover, the proposed algorithm uses dynamic remainders as opposed to the static mapping that would have been obtained by an ILP matching. In the first step of the recursive merging procedure the algorithm computes the following five measures (as depicted in Fig. 12). Note that these measures can change during the merging procedure as remaining iterations are forwarded to other components of the programs. Thus when a remainder is forward and attached to a node u the five measures of u must be re-calculated. This re-calculation does not increase the linear complexity of the proposed algorithm since: (a) there is no need to recalculate the measures of the fathers of u; and (b) remainders are forward only at the lowest levels of the syntax tree. nesting depth (nD)- indicates the maximal nesting level of subloops in multiplications of 10. The nD of a function call is the nD of the function definition as in merging the function call may be inlined. number of iterations (iT)- is the average number of Ss’ iterations (using profile information) every time S is activated. The iT of a function call is 1. residual frequency (rF)- is the number of times that S will be executed by the loops S is nested at (based on their iT s). overall frequency (oF)- is the weighted sum of the oFs of S’s inner statements, where:

• the oF of an assignment/function-call is its rF. • the oF of an if-statement is the maximum .oF of the then-part and the else-part.

• the oF of a block (list of statements) or a loop is the sum of oFs of its inner statements. Note that this measure is affected by the number of statements in the body of each loop. relative weight (rW)- is the ratio of S .oF and the sum of oF s of all other statements in the same block as S. Note that these measurements can yield good distinctions between different structures. Structural merging decisions are mainly made when merging 0 two lists of statements merge({S1 ; . . . ; Sk }, {S10 ; . . . ; Sm }). It is in this case that we should decide which Si and Sj0 s should be merged (‘‘mapped to each other’’ in the tree matching formalism). In all other cases (if-then-else, for-loops, etc.) there is no need for mergings are determined by the type of statements being merged. Merging two lists of statements is also the place where remainders formed by a recursive merge M (Si , Sj0 ) can be propagated to other mergings. Thus we first describe the merging of this special case 0 merge({S1 ; . . . ; Sk }, {S10 ; . . . ; Sm }).

526

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

Fig. 10. Merging with function calls.

The algorithm for merging two lists of statements is as follows:

Fig. 11. Structural considerations in matching sub components for merging.

Fig. 12. An example demonstrating how the five measures are computed.

As indicated earlier forwarding remainders can occur only in the case of merging two lists of statements. Given two lists of statements that should be merged M_lists({S1; REST 1}, {S2; REST 2}) there are three options for a recursive merge: opt1 : opt2 : opt3 :

M (S1, S2) M_lists({ REST 1 }, { REST 2; }) S1; M_lists({ REST 1 }, { S2; REST 2; }) S2; M_lists({ S1; REST 1 }, { REST 2; })

Amongst others the choice is made by computing which option obtains the minimum of the following function:

int op({S1;REST1},{S2,REST2},int~*opt) { k1 = |S1.rw-S2.rw|+op(REST1,REST2,opt); k2 = op({S1;REST1},REST2,opt); k3 = op(REST1,{S2,REST2},opt); if (k1 <= k2 && k1 <= k3){ *opt = 1; return(k1); } else if (k2 <= k1 && k2 <= k3){ *opt = 2; return(k2); } else { *opt = 3; return(k3); } }

M_lists({S1;REST1},{S2;REST2}) { int opt; k=op({S1;REST1},{S2;REST2},&opt); /* check merging options */ if (S1.oF > S2.oF) or /* Case-1: non-nested loops */ (S1.rW-S2.rW| < Threshold) or /* Case-2: both have the same relative weight */ (opt == 1) ) { R=M(S1,S2); /* opt1, merge S1,S2 and forward the remainder */ if(R is from S1){ Recalculate measures for {R;REST1}; M_lists({R;REST1},{REST2}); } else { Recalculate measures for {R;REST2}; M_lists({REST1},{R;REST2}); } } else if(opt == 2) /* opt2, no remainder */ S1;M_lists({ REST1 },{ S2; REST2; }); else S2;M_lists({ S1; REST1 },{ REST2; }); } For clarity several technical aspects of this merging have been omitted, e.g., handling the case that R == ∅. Following is an explanation motivating the two checks that affect the choice of opt1: Case-1 S1.nd <= 10 && S2.nd <= 10- both S1, S2 are nonnested loops or simple statements hence can be safely merged following one of the cases of the previous section. Case-2 |S1.rW − S2.rW | ≤ Threshold- both S1 and S2 have the same relative weight. For example consider M_lists({ S1; S2; S3 }, { S10 ; S20 ; S30 }) where S1.oF = 10, 10 10 + 20 + 20 S10 .oF = 100, 100

S2.oF = 20, 1

S3.oF = 20,

=

100 + 50 + 350

5 S20 .oF = 50, 1

=

S30 .oF = 350

5

hence we will get that S1.rW = S10 .rW = 1/5 and eventually do M (S1, S10 ); M (S2, S20 ); M (S3, S30 ); as expected.

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

The overall merging is a recursive process starting with the two roots of the ASTs (sa, sb) of the two programs being merged:

M(Sa,Sb){ measures: Sa.nD,Sa.iT,Sa.rF,Sa.oF,Sa.rW,Sb.nD, Sb.iT,Sb.rF,Sb.oF,Sb.rW are computed. if Sa or Sb are while/do-loops then they are converted to for-loops. applay splitting and unrolling if Sa and Sb are for-loops. if both Sa and Sb are function calls then merge the bodies of the two functions or create a clone function $fab()$ and call it instead of Sa and Sb. if Sa or Sb is a function call (but not both) then inline the function call. if Sa or Sb is a case-statement it is converted to nested if-then-else. R=empty CASE Sa,Sb: {Sa;Sb} when either Sa or Sb is an assignments. M_{2}if(Sa,Sb) when both Sa and Sb are if-statements. R=M_{1}if(Sa,Sb) when either Sa or Sb is an if-statement. R=M_blm(Sa,Sb) when both Sa and Sb are non-nested for-loops or both are nested. R=M_rlm(Sa,Sb) when both Sa and Sb are for-loops and one (Sa or Sb) is a nested loop. M_lists(Sa,Sb) when Sa or Sb is a statement-list {S1;... } M_lists recalculates the measures in case that a reminder is moved. {Sa;Sb} otherwize ENDCASE return(R); } Fig. 13 depicts the merging process of two programs and in particular the way remainders are being forwarded for future mergings: This example illustrates how the algorithm works but for clarity some of the more technical details are not included. 1. Initially the merging algorithm applies M_lists as both Sa and Sb are lists of statements Sa = {for (. . .)a[i] = i; if (. . .) . . .} and Sb = {s = 10; for (. . .)c [k] = k; s = 0; }. 2. M_list merges the first two loops by calling M (for (i = 0; i < 50; i + +)a[i] = i, for (k = 0; k < 150; k + +)c [k] = k). 3. M () merges the two loops using M_blm() and the remainder R = for (k = 50; k < 150; k + +)c [k] = k is attached back to the second program. M_lists continues to merge if (x < y) . . . with {R = for (k = 50; k < 150; k + +)c [k] = k; s = 0; }. 4. This last merging is performed by M_1if (if (x < y) . . . , R = for (k = 50; k < 150; k++)c [k] = k) which recursively merges R with both the then-part and the else-part of if (x < y)for (j = 0; j < 50; j + +)b[j] = j; else x = y;. These two mergings yield the final remainder R = for (k = k; k < 150; k + +)c [k] = k that is merged by M_lists with the last statement s = 0. Note that the remainder returned by M_1if () starts with k = k which can be k = 100 if the the then-part of if (x < y) . . . was executed or k = 150 if the else-part was executed. Following are few remarks regarding the way unrolling and splitting [1] are used in the merging algorithm. These two operations are designed to balance the number of iterations and the .oF s of the two loops being merged:

527

Splitting- ‘‘separates’’ a loop with n iterations to k > 1 consecutive loops each with n/k iterations. Unrolling a loop- u > 0 times modifies the loop such that the size of its body increases by u + 1 and its number of iterations decreases by u + 1. Fig. 14 contains an example wherein the loop of Pa is first split to two loops each unrolled by a factor u = 1. The unrolling factor is determined as follows: u = min(max(loop1.iT , loop2.iT )/ min(loop1.iT , loop2.iT ), max(loop2.rW , loop1.rW ), min(loop2.rW , loop1.rW )). The unrolling factor is restricted not to accede a certain threshold as a way to prevent large code expansions. The decisions which loop should be split and then unrolled are based on the differences between the .iT and the .rW values of each loop. Note that unrolling+splitting is not always an efficient solution. Consider for example merging a loop (L1) of 10 iterations whose body contains two loops with 50 iterations each with a loop (L2) of 1000 iterations whose body contain only one statement. Clearly unrolling L2 100 times will not obtain an efficient merging. It is better to change L2 to a nested loop (apply loop-tiling) for (i = 0; i < 1000; i+ = 100)for (j = i; j < i + 100; j + +) . . . . and then split the inner loop to two loops of 50 iteration each. This type optimization is also supported in the implementation. 7. Experimental results The system implemented extends the CTool parser to include all parts of the merging algorithm: profile infrastructure, Loop Unrolling, Loop Splitting and the above merging algorithm. The approach for verifying the usefulness of the proposed system was to merge a random sample of programs from known benchmarks and not to consider specific examples from real world applications. The idea is that the fraction of successful mergings can be used as an indicator to the success probability of any specific merge made in the future. In this respect out of 135 mergings reported here only 51 were not successful (no improvement compare to serial execution), implying more than 0.5 success probability for future mergings. We believe that the potential for finding independent programs for merging in real life applications is clearly large as embedded systems usually contain many control/dsp tasks executed as independent parallel processes by the underlying Real Time Operating System. The real question is not to show some specific examples where merging works but rather predict or systematically evaluate the potential of mergings in general. Note that the proposed tool is intended to be selectively used by the developer of an embedded system at the final stage of the development. Since for an application with n processes there are n2 /2 possible mergings to consider then a success probability of more than 0.5 is a sufficient drive to justify the effort. The case of self mergings is in particular important as in many cases the same algorithm/code is used for multiple tasks. In the experiments we used two sets of programs. The first set include five simple benchmarks (about 60 lines each) from Integer DSPstone benchmark suite [17]. The second set includes larger (about 600 lines each) and more complex programs from Mibench benchmark [8]. In addition we also applied merging to the Livermore loops a benchmark containing small loops representing numerical kernels [13]. All programs were selected at random out of all programs in the benchmark. Since we are testing all possible combinations of merging it is not practical to test all the programs in a benchmark. Each of the chosen programs contains arithmetic computations and extensive use of array references. All programs have been tested on the Itanium II, 1.6 GHz and IBM’s Power4, using the GCC, XLC or ICC compilers with optimization level O3.

528

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

Fig. 13. A detailed example of merging and forwarding remainders.

Fig. 14. An example demonstrating the complexity of source level merging.

The following tables summarize the improvements (percentages of the reduction in the execution time) obtained by merging two programs compared to executing them sequentially. A ‘0’ value indicates unsuccessful merging with a degradation of 5%–10%. The negative results are not considered here since it is assumed that the user includes only profitable mergings (for representative inputs). These improvements do not include the operating system overhead of concurrent execution (e.g., context switches) and are solely focused on the improvement of instruction level parallelism. Unmarked or zero places indicate no improvements or a small degradation in the execution time due to the merging. Testing all possible combinations helps in establishing the robustness of the results as each program has a different nesting level, loop structure and function calls. Clearly, the highest results (a different of 10%) are in the diagonal as self mergings allow optimal matching of sub components. The first group of experiments study the usefulness of merging on the Itanium is described in Fig. 15. The numbers in the tables mean percentage of reduction in runtime compared to running two programs serially. Improvements were also obtained for IBM’s XLC compiler and GCC with the Power4 (Regatta) as depicted in Fig. 16. Note that only very few mergings in the second table in Fig. 16 (ICC+dsp-stone) were successful. The reason for this is that ICC is very good parallelizer and the loops of the dsp-stone are highly parallel, hence merging cannot be used to increase ILP (Instruction Level parallelism). The next set of experiments (Fig. 17) studies the benefit of mergings for a DSP CPU such as the TI-C67x. In this case, a simulator for TI-C67x had to be used hence only small sized loops such as the Livermore loops could be simulated. The Ti-C67x can execute

Fig. 15. Experimental results of merging Mibench and dsp-stone using ICC and GCC.

Fig. 16. Experimental results using IBM’s XLC compiler.

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

Fig. 17. DSP results for TI-C67x.

Fig. 18. dsp-stone results for TI-C67x.

more arithmetic operations (up to 8 operations) in a cycle than the Itanium hence it was interesting to compare the mergings on the TI-C67x with those of the Itanium. Indeed the mergings were more successful with the TI-C67x than with the Itanium. Similar improvements (Fig. 18) were also obtained by the TIC67x for the dsp-stone benchmark in comparison to the Itanium + ICC (described in Fig. 18). Note that the results on the dsp-stone benchmark for GCC-Itanium are significantly better than those of TI, however the results are not in absolute times but relative improvements hence it is expected that the GCC will be better than ICC or TI. This is since GCC is probably less parallelizing and less of an optimizer compiler than TI/ICC hence can benefit more from mergings. The average code size expansion in these mergings were about 1.5 with maximal value of 1.8. The above programs used in the experiments were selected based on their relation to DSP applications. Some of these programs where small about 100 lines and some where larger but all were less than 500 lines of code. In order to verify that the method works for larger programs we have tested it on some programs from SpecINT 2000 such as earthquake with about 1500 lines of code. These mergings obtained about 10% improvement for all input sizes (small, medium and large). Indeed the code expansions in these cases were about 40% expansion. We have tested the necessity of the RLM and the necessity of forwarding remaining iterations to a successful merging. This comparison has been done by disabling the use of RLM and letting the remaining iterations ‘‘stay’’ right after the merged loop. The decision which sub component should be merged were made based on the oF alone. In all the experiments (Fig. 19), turning the RLM off either reduced the improvements to zero or had no affect at all. We thus consider three cases:

• Initially RLM was not activated by the merging algorithm and hence turning it off had no effect (denoted as X +). • The RLM was activated by the merging algorithm but there were no improvements obviously in this case turning the RLM had no effect (denoted as X 0 ). • RLM was activated and there was an improvement, turning the RLM off reduced this improvement to zero (denoted as X −). There were no cases where turning the RLM off improved the performance. The following tables in Fig. 19 (Livermore and Mibench) summarize the effect of turning the RLM off for Ti (X = T ), ICC Itanium (X = I) and GCC Itanium (X = C ). As expected T +, C +, I + occurred in the diagonals as self

529

mergings does not require RLM and direct recursive-merging of components is optimal (both programs have the same structure). In the first table of Fig. 19, apart for the diagonal there are 11 X +s, 11 X −s and 8 X 0 compared to 4 X +s, 13 X −s and 2 X 0 s. The second table Fig. 19 contains larger and more complex programs thus structural matching is more important indeed the effect of turning the RLM off was more crucial here only 2 − X 0 + 4 X + (the negative results) compared to 13 X −s (the positive results). The first table of Livermore kernels also shows that RLM is essential but only in half of the cases. For both tables in Fig. 19 it follows that RLM+remainders are essential as every time there was an improvement it was eliminated when RLM+remainders was turned off. Next we consider profiling techniques allow non-intrusive measurements of internal CPU events. We used the Oprofile system [16] to measure some of these internal events and thus get a better insight as to the effect of program merging on the pipeline usage, caching and other factors. Fig. 20 depicts the difference of merging on various CPU events before (programs executed one after the other) and after merging. As expected for these complex programs merging was less effective (about 5% improvement) however they can be useful as indicators to the effect of the size on the merging. As explained in the introduction merging can improve the scheduling allowing the scheduler of the compiler/CPU to exploit un-used slots and clock cycles. We measured the amount of cycles with entire machine pipeline cleared (MC) and found that it was significantly reduced (see MC column in Fig. 20). The amount of retired misspredicted branches (BR) was reduced for about half of the programs and remained the same for the rest. This is expected since in merging hot loops are merged to single loop hence the amount of misspredicted branches is reduced. The effect of merging conditional statements can be more complex but is also expected to reduce misspredicted branches. The amount of retired instructions (INS) definitely increased due to merging but not significantly (see INS column in Fig. 20). As expected merging increased the amount of 2nd+3rd level cache miss (CH) and the amount of Instructions TLB misses but not significantly. Moreover, the effect of these ‘‘bad’’ events does not seem to be related to the size of the merged programs. This somewhat surprising result suggests that the performances of the proposed technique are mainly depend on scheduling and hence can probably reach higher than 20%. We remark that the low cache misses values in fft − fft . . . enc − enc are due the fact that these programs use few arrays and access the elements of these arrays in a regular non-repeating manner. 7.1. Practicality The practicality of the proposed method depends on the question how frequent are cases where an embedded system contains independent threads/processes executed forever that can be merged. We believe that such cases are frequent in embedded systems and hence the potential usefulness of the proposed scheme. Fig. 21 depicts a typical structure of an embedded system showing multiple decoding/encoding processes that (if not executed in hardware) can be potentially merged. For example in a cellular phone there are parallel channels for decoding and encoding incoming and outgoing audio signals usually executed by concurrent processes over a DSP core. Note that self-mergings of the same program are also expected to occur since some of the channels’ structure can be identical. Self-mergings are an important case due to two reasons. First as indicated earlier self merging are frequent in the channel structure of embedded systems. Next, since the two programs are identical their structural matching is easier. Finally, in self merging if one of the programs has a low ILP utility then so is the other and the merging is likely to be very successful.

530

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

Fig. 19. Turning the RLM off.

Fig. 20. Effect of merging on counted CPU events.

Fig. 21. Schematic channel configuration often found in embedded systems.

Some of these ‘‘channel mergings’’ are included in the benchmarks used Section 7. In particular the self mergings of ADPCM and the self merging of Sha are such cases, where ADPCM is an Adaptive Differential Pulse Code Modulation used in communication devices and Sha is a secure hash algorithm often used in the secure exchange of cryptographic keys. Apart for the systematic experiments used Section 7 we have made some separate tests of practical ‘‘channel’’ mergings to further facilitate the claim that merging and self-mergings can be useful in embedded systems. These include Viterbi’s encoding/decoding algorithm for audio signals, GSM a modulation algorithm used in cell phones and blowfish a symmetric block cipher with a variable length key. The Viterbi’s encoder/decoder merging obtained 8% of improvement, GSM merging obtained 10% improvements and blowfish 8% (for ICC Itanium). Previous works related to the Thrint compiler have also demonstrated this potential by showing successful mergings (though in assembly level) of a selected set of embedded systems programs. These mergings were also demonstrated in the work of ‘‘Simultaneous Multithreaded DSPs:’’ [9,10] where the merging was done not at source level but in runtime using an SMT simulator. The mergings in [10] combined code of 6–9 threads usually executed in parallel in cellular phones as follows: an MPEG-2 encoder and decoder, a speech encoder and decoder (GSM standard), and a channel encoder and decoder (with channel modulation).3 Another example [14] where merging can be useful are cellular base stations performing Viterbi decoding of multiple independent data streams.

3 Up to 4 MPEG encoder threads were used each processing a quarter of the original image.

Practical use of merging systems implies that the merging of such processes should be done such that real-time restrictions are met. Though in this work we study the potential of source level merging and not merging of specific applications we have implemented a mechanism to control the ratio in which loops are merged (e.g., in [10] the decoding rate of the MPEG decoder was reduced to match it to the processing rate of the other threads). Thus if Pa is intended to run twice as fast than Pb then this ratio will be preserved by the underlying tool. 8. Conclusions This work presents the outline of an algorithm for merging two programs into a single program at the source level. The approach is a software equivalent of simultaneous multi-threading (SMT). This work goes beyond previous works by considering how to merge the remainder of a loop into a recursively merged tail. The contribution of this work is the technique of handling the tail of non-equivalent loops in the process of recursively merging subcomponents. The proposed scheme makes extensive use of the ability to forward ‘‘remaining iterations’’ from the merging of two sub-components to be used in following mergings of other subcomponents. Forwarding remaining iterations has several modes and in particular the ability to use repeated execution of inner loops to complete the iterations of large loops. This is (to the best of our knowledge) the first complete tool for source level merging in C. The effectiveness of the proposed scheme for embedded systems has been studied via a sequence of experiments showing an expected improvement of 20%. We have presented results of merging programs from DSP related benchmarks using several compilers on different architectures.

Y. Ben Asher, M. Yuda / J. Parallel Distrib. Comput. 69 (2009) 521–531

The conclusions of the experimental results are as follows. Ideally merging can achieve an improvement of 100% by ‘‘filling up’’ the unused ‘‘slots’’ in the execution of one program with instructions of the second program. This did not happened (maximal of 48% obtained) mainly due to: (A) existing parallelism in the original programs and (B) lack of ability to communicate independency of instructions to the underlying compiler. Some mergings were not successful mainly due to high degree of parallelism in at least one of the merged programs. Our results show that cache misses, branch prediction accuracy, register pressure are not limiting factors even in merging programs with several thousands of code lines. The key factors limiting the current RLM capabilities are mainly the inability to propagate independency information to the compiler’s scheduler. Future research includes:

• Study the possibility of combining the dynamic remainders method with a static tree matching algorithms. Such an algorithm can be based on known techniques for tree isomorphism and related techniques. • Use profile information for matching nested if-statements with different true-false profile behavior. • Explore the option of merging arrays and linked lists (not only code) to improve cache efficiency and reduce the overhead of memory allocation. • Extend the merging technique to handle dependent threads.

531

[6] Alexander G. Dean, John Paul Shen, Hardware to software migration with realtime thread integration, EUROMICRO’98, in: 24 th. EUROMICRO Conference, vol. 1, 1998. [7] Alexander G. Dean, John Paul Shen, Techniques for software thread integration in real-time embedded systems, in: RTSS, 1998, pp. 322–333. [8] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, R. Brown, Mibench: A free, commercially representative embedded benchmark suite, 2001. [9] Stefanos Kaxiras, Alan D. Berenbaum, Girija Narlikar, Simultaneous multithreaded dsps: Scaling from high performance to low power, in: Lucent Technologies Technical Memorandum 10009600-001024-13TM/10009639001024-06TM, 2001. [10] Stefanos Kaxiras, Girija Narlikar, Alan D. Berenbaum, Zhigang Hu, Comparing power consumption of an smt and a cmp dsp for mobile phone workloads, in: CASES ’01: Proceedings of the 2001 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2001. [11] D. Koes, M. Budiu, G. Venkataramani, S. Goldstein, Programmer specified pointer independence, in: Workshop on Memory System Performance, MSP, 2004. [12] Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, Dean M. Tullsen, Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading, ACM Transactions on Computer Systems 15 (3) (1997) 322–354. [13] F.H. McMahon, The livermore fortran kernels test of the numerical performance range, in: Performance Evaluation of Supercomputers, Elsevier Science B.V., 1988, pp. 143–186. [14] Won So, Alex Dean, Procedure cloning and integration for converting parallelsim from coarse to fine grain, in: Interaction between Compilers and Computer Architectures, IEEE Computer Society, 2003. [15] Won So, Alexander G. Dean, Complementing software pipelining with software thread integration, SIGPLAN Notices 40 (7) (2005) 137–146. [16] SourceForge: ProjectInfo-Oprofile. http://sourceforge.net/projects/oprofile. [17] V. Zivojnovic, J.M. Velarde, C. Schlager, H. Meyr, Dspstone: A dsp-oriented benchmarking methodology, 1994.

References [1] D.F. Bacon, S.L. Graham, O.J. Sharp, Compiler transformations for highperformance computing, ACM Computing Surveys 26 (4) (1994) 345–420. [2] Yosi Ben-Asher, Esti Stein, Basic results in automatic transformations of shared memory parallel programs into sequential programs, in: Third Asian Computing Science Conference, Kathmandu, Nepal, 1997. [3] Pohua P. Chang, Scott A. Mahlke, William Y. Chen, Wen mei W. Hwu, Profileguided automatic inline expansion for C programs, Software - Practice and Experience 22 (5) (1992) 349–369. [4] Keith D. Cooper, Mary W. Hall, Ken Kennedy, A methodology for procedure cloning, Computer Languages 19 (2) (1993) 105–117. [5] Alexander G. Dean, Compiling for fine-grain concurrency: Planning and performing software thread integration, in: RTSS ’02: Proceedings of the 23rd IEEE Real-Time Systems Symposium, RTSS’02, IEEE Computer Society, Washington, DC, USA, 2002, p. 103.

Yosi Ben Asher received his Ph.D. in computer science from Hebrew University in 1991. He is currently a lecturer at the Department of Computer Science, University of Haifa, Israel. His research interests include parallel systems, compilers, distributed WEB applications and reconfigurable networks. Current projects include: Highlevel synthesis, program merging, source level compilation, and distributed Ecommerce.

Moshe Yuda completed his M.Sc. in 2007 on program merging. He is currently working in Elvarion Ltd. on scheduling algorithms for cellular base stations.