Applied Mathematical Modelling 22 (1998) 533±543
Transient and steady-state performance modeling of parallel processors Marlin H. Mickle
1
School of Engineering, University of Pittsburgh, 348 Benedum Hall, Pittsburgh, PA 15261, USA Received 10 December 1996; received in revised form 12 May 1998; accepted 2 June 1998
Abstract A mathematical procedure is presented as a design/analysis methodology to model computer performance including; (1) Instruction Set Architecture (ISA), (2) speci®c hardware implementations of the ISAs, and (3) any benchmark program in one closed form solution. Examples are given employing three classical 8 bit processors, and a classical benchmark program, for both CISC and RISC type implementations. The formulation provides a complete solution giving both steady-state and transient results. Ó 1998 Elsevier Science Inc. All rights reserved. Keywords: Instruction set architecture; Parallel processing; Cache memory; Performance modeling; Assembly language; Memory contention
1. Background Early analysis of computer systems considered the instruction set architecture (ISA) at a variety of levels [1±4]. Although this subject has not received signi®cant recent attention, the multitude of microprocessors available today presents an excellent array of examples to demonstrate the underlying principles and ISA dierences using speci®c benchmarks [5]. In addition, with the current VLSI capability to produce numerous types of dedicated special purpose types of processors, this method of modeling provides an analytical model to provide insight into solutions for wide ranging alternatives including ISAs and hardware implementations based on various parameter choices. Previous results [6] have shown how the discrete Markov analysis can be applied to computer systems (processor/memory) including the fundamental mathematical derivations. A particular example was given [6] to show how this method of design/analysis can be applied to obtain the clock doubling capability such as the Intel OverdriveÒ and the IBM Blue LightningÒ. The method provides the basis of a procedure to analyze the design and/or modi®cation of an ISA to evaluate its performance under (1) speci®c system implementations, e.g., cache size, memory speed, multiple processors with local and/or shared memory [4], and (2) speci®c hardware designs involving clocking, i.e., clocks per instruction (CPI), RISC vs. CISC, etc.
1
Tel.: +1 412 624 8000; fax: +1 412 624 8003; e-mail:
[email protected].
0307-904X/98/$19.00 Ó 1998 Elsevier Science Inc. All rights reserved. PII: S 0 3 0 7 - 9 0 4 X ( 9 8 ) 1 0 0 6 2 - 8
534
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
While simulations can be employed in many design/analysis situations, the necessity to include details representing a wide variety of hardware alternatives requires a tremendous eort to obtain a statistical result (probability) that is in fact a combination of many related factors. The method of this paper avoids the details of simulation and provides a complete solution including both steady-state and transient. System performance in the presence of interrupts, task switching, etc., can only be accurately represented using a transient analysis. The details of the mathematical derivation for the transient are included in [6]. Simulation has some use in such analyses, but it does not provide for convenient alternative ISAs, changing hardware implementations for CPI alternatives as discussed in this paper. The results of this paper provide for the inclusion of detailed hardware actions and interactions, such as clock cycles per instruction (CPI) including their location within the total execution interval, to obtain a tractable mathematical result for design, analysis and optimization. 2. Introduction This paper is an extension of previous results [6] to provide a design/analysis methodology to model computer performance including; (1) ISA, (2) speci®c hardware implementations of the ISAs, and (3) any benchmark program in one closed form solution. Examples will be given employing three classical 8 bit processors, and a classical benchmark program [5], for both CISC and RISC type implementations. The mathematical formulation provides a complete solution giving both steady-state and transient results. Speci®cally, the performance measure modeled is the contention and lack of contention for memory accesses by executing processors. The accesses may be instruction or data fetches and data writes. The model incorporates; (1) the executing program, (2) ISA, (3) the hardware implementation of the ISA in terms of clock cycles for fetch/read/write/execute segments of the total instruction execution. The model provides both the transient and steady-state memory contention for executing programs. An example is used to demonstrate an application to parallel computation making it possible to analyze the interrelationships of ISA, hardware implementation, and the applications level programs to be executed on the parallel processors. One major point of analysis is the interaction of programs executing on individual processors and the latency caused by memory con¯icts or contentions. For example, the interactions among memories and processors have been modeled in a closed form using Markov chains [7]. A more recent text [8] proposes performance measures for multiple memories and processors using the same type of assumptions as in [7], while again only achieving a steady-state result. The modeling technique presented in this paper allows very speci®c (nonsimplifying) data and/or parameter values in a mathematically tractable form giving the complete solution. The analysis is based on dynamic measure of software execution parameters that are used in a standard mathematical systems technique. The basis of comparison is the average number of useful execution cycles per unit time. The method is demonstrated using speci®c benchmarks programmed for three dierent microprocessors. The example and type of processor result in no loss of generality while permitting the reader to see the fundamental nature of the application and the procedure. Any given ISA can be implemented in a variety of hardware con®gurations with speci®c forms of assembled code at the binary level sometimes termed the immediate language. For each as-
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
535
Fig. 1. Two (2) parallel processor architecture.
sembly language, two implementations in terms of hardware are given where one is in fact the manufacturer's hardware implementation, and the other is a gross approximation of a general hardware implementation. Both are shown to have similar results for each ISA. 3. The two processor two memory example The benchmark program is the classical data movement problem of Osborne [5] used in the comparison of an array of microprocessors. The eect of a cache for each type of parallel implementation is then demonstrated using the same form of analysis. The two processor implementation is shown in Fig. 1. The parallel implementation is a con®guration of two processors accessing two dierent memories to solve the benchmark program that moves an array of bytes from one buer to another. The two buers in this example are located in two dierent memories that can be accessed by both processors. The benchmark assembly language programs for the three processors are shown in Fig. 2. These benchmarks are analyzed on a dynamic basis where their loops represent the primary computational eort. The methodology presented in this paper works for multiple processors (any n), multiple memories (any m), with or without cache. Although the examples included are two (2) processor, two (2) memory systems, this dimensional choice is solely to keep the matrices and state diagrams to a reasonable size for publication with no loss to generality. The method is applicable to any number of memories or processors, and the two numbers (processor/memory) are not required to be equal.
Fig. 2. Benchmark program for 8051, 8085, and 6800.
536
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
4. The method of analysis The data requirements for the analysis are: (1) the probability of memory access for the instructions in the benchmark, (2) the time basis of execution of the individual instructions as given in the manufacturer's data sheets, and (3) the simple conceptual state model of the multiple processor con®guration. All of these three requirements are easy to obtain, and the mathematics of analysis can be performed on a hand held calculator. Smaller benchmarks are used so that the reader can see the direct result of various implementations of the instructions and their hardware implementation. The same procedure can be followed for larger programs using available software analysis tools that count the occurrence of each instruction during execution, i.e., the dynamic analysis. Consider ®rst the parallel benchmark. The program for P1 is assumed to be stored in M1, and the program for P2 is stored in memory M2 of Fig. 1. The object of the program is to have P1 move 64 bytes from M2 to M1, and P2 is to move 64 bytes from memory M1 to M2. The two processor con®guration and memory map are assumed to have the hardware to direct the memory reads and writes to the proper processor as well as stall any memory request when two competing requests are received. From the above description, there will be memory con¯icts when both P1 and P2 try to access M1, and when both P1 and P2 try to access M2. The processor hardware divides the crystal cycles into a certain number of machine cycles to execute each instruction. The point of the research reported here is to ®rst analyze the ISA of a number of processors, in particular the 8051, 8085, and the 6800, in a generalized type of implementation. Consider ®rst the general form of implementation. The hardware implementation will be assumed to take; (1) one clock cycle for the fetch of each instruction byte; (2) one clock cycle for a data read, and; (3) one clock cycle for a data write. In both cases, the data are the 64 bytes contained in the memory buer. In addition, it will be assumed that; (4) one additional cycle is required to perform any internal processor steps, i.e., the execution. The additional clock cycle accounts for the fact that during the instruction timings, there are typical periods during the execution when the memory is not active. The above four assumptions regarding cycles will be the same for the 8085, 8051, and 6800, in order to compare the ISAs. The point of investigation here is the ISA which is a characteristic distinct from the hardware that implements the ISA. This is a point that is sometimes lost when dealing with an assembly language implementation. The concern is the ISA, not the syntactical implementation of the ISA indicated assembly language benchmarks or the hardware execution of the immediate language in the binary (hex) load module. The three processors indicated in Fig. 2, will have the fetches, reads and writes as shown in Table 1. In Table 1, the Fetches and Reads are from the same memory and the Writes are from the other memory. Based on the data of Table 1, it is possible to establish the probabilities for each Table 1
Cycles for each processor/function
Fetches (by byte) Reads (data) Writes (data) Executes (one per instruction) Total
8085
8051
6800
8 1 1 6 16
6 1 1 5 13
8 2 2 5 17
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
537
Table 2
States based on memory requests State State State State State State
1 2 3 4 5 6
1 2 0 0 1 0
M1 M1 M1 M1 M1 M1
request requests requests requests request requests
1 0 2 0 0 1
M2 M2 M2 M2 M2 M2
request requests requests requests requests request
processor accessing M1, M2 or having no memory access, pij , where i indicates the processor, j indicates memory (M1 or M2) with 0 implying no memory operation. These are given below in Eq. (1) for the 8085 processor. The probabilities can be arrived at by a number of dierent methods. The ®rst applies to relatively simple code where the code/data fetches can be simply counted including any looping. A second is by using traditional code analyzers or simulators. A third is by choosing the desired result and working backwards to see what ISA or hardware implementation would be required to achieve the desired result. 9=16 1=16 6=16 p11 p12 p10 :
1 P 1=16 9=16 6=16 p21 p22 p20 With the probabilities of Eq. (1), there are six possible states for the 2P/2M system. These six states are shown in Table 2. This method of formulating the problem is similar to that of Stone [7]. The mathematical analysis is based on a Markov state process [9] as opposed to Markov chains [6]. The steady-state results are the same. However, the Markov modeling method allows for the transient solution to be obtained directly in addition to the steady-state result. The states of Table 2, are shown in the state diagram of Fig. 3, where the state number is shown at the top inside each circle, with 1,0 or 0,0, etc., representing the M1, M2 requests for that particular state. The calculation of the state transition probabilities is straightforward. For example, assume the system is in state 1. Both memory requests can be satis®ed in that state. Thus, the probability of going to state 4, is the probability of P1 not having a memory request, e.g., p10 166 times the probability that P2 will not have a memory request, p20 166 , i.e., 36 .
166
166 256
Fig. 3. Six state diagram for the two (2) 8085 parallel processor architecture.
538
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
5. The state diagram and transitions The calculation for transitions from state 1 to states 2 and 3 is similar. The primary dierence comes when there is a transition from state 1 to states 5 and 6. Consider for example the transition from state 1 to state 5. In state 1, again, both memory requests are going to be satis®ed, thus the probability of going to state 5 is the probability of (1) P1 requesting M1 times the probability of P2 having no memory request, plus (2) P1 having no memory request times the probability of P2 requesting M1. In states 2 and 3, there are two memory requests for the same memory. Thus, only one memory request can be satis®ed. It is necessary to establish a policy as to which request will be satis®ed. In this development, it will be assumed that the P2 request will be satis®ed in state 2, and the P1 request will be satis®ed in state 3. The transition probabilities are thus based on the probabilities for the processor whose request is satis®ed. The complete set of states and transition probabilities is given in the state transition diagram of Fig. 3. The state diagram of Fig. 3, provides a simple method to analyze the steady-state behavior of the system. The probability of being in any one of the six states is termed the vector of absolute state probabilities, P, and is shown in Eq. (2). Given this vector, the steady-state result can be obtained as follows [9]. P
P1 ; P2 ; P3 ; P4 ; P5 ; P6 PP:
2
Eq. (2) is a set of six homogeneous linear simultaneous equations with rank n ) 1. Rank n ) 1 implies there is one less independent equation than the number of unknowns. However, the sum of the probabilities must be 1, and this equation provides the nth independent equation [9]. Solving n ) 1 of the equations given in Eq. (2), and the sum of the probabilities will give the vector of absolute state probabilities, P. Eq. (2) for the speci®c example of the 8085 is given in Eq. (3). 2 3 82=256 9=256 9=256 36=256 60=256 60=256 6 9=16 1=16 0 0 6=16 0 7 6 7 6 7 6 9=16 0 1=16 0 0 6=16 7 7 P P6
3 6 82=256 9=256 9=256 36=256 60=256 60=256 7: 6 7 6 7 4 82=256 9=256 9=256 36=256 60=256 60=256 5 82=256 9=256 9=256 36=256 60=256 60=256 The steady-state solution (vector of absolute state probabilities) is given in Eq. (4). P
0:33721; 0:03488; 0:03488; 0:13081; 0:23111; 0:23111:
4
From Eq. (4), the four states in which execution will proceed without interruption are states 1, 4, 5, and 6. Adding the probabilities of being in these states gives the probability of no memory contention during execution. This result is given in Eq. (5). Eq. (6) gives the probability of execution being delayed due to a memory contention, or alternatively the probability of only one processor executing. P1 P4 P5 P6 0:93024:
5
P2 P3 0:06976:
6
In states 1, 4, 5, and 6, two (2) memory cycles can be executed simultaneously because there are no memory con¯icts and no processor is waiting. In states 2 and 3, only one (1) memory cycle can
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
539
Table 3
Results for the three dierent processors Processor
X ==Y
8085 8051 6800
1.93024 1.91766 1.86442
be executed due to the contention at the memory, and thus only one processor can be served in states 2 and 3. Thus, on the average, 2
0:93024 1
0:06976 1:93024 simultaneous memory cycles
7
can be achieved in the time of one (1) typical memory contention (or execution) cycle. Note that in the current example, one cycle is for each fetch, execute, read, and write. This is a measure of the eectiveness of the two (2) 8085 processors in parallel based on the instruction set architecture executing this benchmark. In general, for the two (2) processor parallel computer executing, the best that can be expected is 2, or in general, the number of processors represents the optimum execution of the parallel combination. The notation can be simply viewed with two (2) parallel processors of computational capability in cycles per second, X and Y, the best possible time for the parallel combination is X Y . With the parallel combination, this will be noted as X ==Y . The problem of parallel computing in general then can be said to be one of obtaining X ==Y as close to X Y as possible. This is the concept and no other notation will be pursued at this point. The results for all three processors are given in Table 3. From Table 3, it can be said that there is a dierence among the three ISAs. 6. A simple steady-state calculation method The above result can be obtained using an alternate method which is easily implemented on a hand held calculator avoiding the solution of the linear simultaneous equations. The probability matrix, P, in Eq. (2) when raised to a power, is such that [6], Pn , in the limit, all rows are equal and each row is the P of both Eqs. (2) and (4). In most hand held calculators, it is a simple procedure to raise a matrix to a power. Because of the eigenvalues of the matrix involved, the exponent value to reach this steady-state is typically less than 20. In any case, it is easy to look at the result to see if the rows are all equal. If they are not, choose a larger value of n. Calculators such as the TI-82, can also invert the matrix and multiply it onto the right-hand side vector. The matrix exponentiation has the advantage of not involving any divides which in general makes the process more computationally stable. 7. The eect of a program cache The 2P/2M system of Fig. 1, is now modi®ed by including an instruction cache as shown in Fig. 4. The notation 2P/2C/2M will be used for the system of Fig. 4. The 2P/2C/2M example using the 8085 processors has the probability matrix of Eq. (8). The introductory treatment of cache has been previously reported [6]. 1=16 1=16 14=16 p11 p12 p10 :
8 P 1=16 1=16 14=16 p21 p22 p20
540
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
Fig. 4. 2P/2C/2M example.
The state transition diagram for the probabilities of Eq. (8) is given in Fig. 5. The equation for P is given in Eq. (9). 2
2=256 6 1=16 6 6 6 1=16 P P6 6 2=256 6 6 4 2=256 2=256
3 1=256 1=256 196=256 28=256 28=256 1=16 0 0 14=16 0 7 7 7 0 1=16 0 0 14=16 7 7: 1=256 1=256 196=256 28=256 28=256 7 7 7 1=256 1=256 196=256 28=256 28=256 5 1=256 1=256 196=256 28=256 28=256
9
The steady-state solution (vector of absolute state probabilities) is given in Eq. (10). Eqs. (11) and (12) give the probabilities for no memory contention (execution delay), (2 fetches) and memory contention (execution delay), (1 fetch) respectively. P
0:00826; 0:00413; 0:00413; 0:75930; 0:11209; 0:11209:
10
P1 P4 P5 P6 0:99174:
11
P2 P3 0:00826:
12
Fig. 5. 8085 two (2) parallel processor architecture with cache.
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
541
Table 4
Execute cycle comparison
No cache Cache
Two fetch probability
One fetch probability
X ==Y
0.93042 0.99174
0.06976 0.00826
2(0.93042) + 1(0.06976) 1.93024 2(0.99174) + 1(0.00826) 1.99174
The results of Eqs. (11) and (12) can now be compared with the no cache version of Eqs. (5) and (6) respectively in Table 4. The complete set of all results for all processors is given in Table 8. Based on the results of Table 4, the speed-up obtained by adding the cache with two (2) 8085 processors is approximately 3.2%. This shows that for the speci®c example chosen, the use of a cache does not provide a signi®cant increase in performance for this benchmark. This result is expected due to the eciencies of the program, including looping. 8. The eect of the execute cycle The execute cycle that was added to the example hardware implementation was to account in some way for the time during execution when the memory bus is inactive and can be used by another processor. This was done to apply the same execution timing (a function of hardware implementation) to all ISAs to see the eects of only the ISA on execution eciency in a parallel con®guration. Table 5, gives the counts from Table 1, without the execute cycle that was added. The vector of absolute state probabilities for the 8051 of Table 5, is given below in Eq. (13). The state transition diagram in this case is only 3 states with 2 fetches in state 1 and 1 fetch in each of state 2 and state 3. P
0:83333; 0:08333; 0:08333:
13
Using these results, the average fetch per cycle for the no cache situation is given in Table 6. These results re¯ect the number of bytes to encode the instructions necessary (chosen) to execute the benchmark. This is one typical measure of an ISA. The comparison can now be made between the use of execute or no execute cycles for the 8085 ISA. The extra time interval for the memory bus gives an additional 5.3% (from Tables 4 and 6). Thus, for this particular example, the speed-up of the memory read with respect to the total instruction execution gives a better result than the addition of an instruction cache, i.e., 5.3% vs. 3.2% (from Table 4).
Table 5
Memory access cycles for the three ISAs
Fetches Reads Writes Total
8085
8051
6800
8 1 1 10
6 1 1 8
8 1 1 10
542
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
Table 6
Three state X ==Y with no cache Two fetch probability
One fetch probability
X ==Y
0.83333 0.80000 0.83333
0.16667 0.20000 0.16667
2(0.83333) + 1(0.16667) 1.83333 2(0.80000) + 1(0.20000) 1.80000 2(0.83333) + 1(0.16667) 1.83333
8085 8051 6800
9. Actual manufacturer execution cycles The manufacturer data sheets were used to obtain the actual number of cycles of the clock that are required for memory access and execution which leaves the memory bus free. These are given in Table 7. The actual cycles were used to calculate the cycle eciency for each of the three processors assuming no cache and cache with the complete tabulation given in Table 8. 10. Summary All of the given cases were solved for the cycle eciency with and without cache for the actual cycle timing of the data sheets. All results are shown including the results of the general ISA execution cycles initially presented. Table 8 gives the summary of results for this paper. The ``General Implementation'' re¯ects how the three assembly languages perform in solving this particular benchmark program in a two processor parallel con®guration. From the table, it can be seen that there is little dierence among the three assembly languages. This is not entirely unexpected. The primary purpose here is to demonstrate how this instruction set architecture comparison can be performed.
Table 7
Actual cycles
Fetches Reads Writes Executes Total
8085
8051
6800
24 3 3 9 39
36 4 4 4 48
11 2 2 9 24
Table 8
Summary results Processor X ==Y General implementation
8085 8051 6800
Data book cycles
No cache
Cache
% Inc
No cache
Cache
% Inc
1.93024 1.91766 1.86442
1.99174 1.98736 1.96960
3.19 3.63 5.64
1.89656 1.86842 1.91034
1.98736 1.95172 1.98508
4.79 4.46 3.91
M.H. Mickle / Appl. Math. Modelling 22 (1998) 533±543
543
The ``Data Book Cycles'' re¯ect the implementation of the instruction set into a particular hardware con®guration. To perform the analysis, the actual clock cycles for each of the instructions involved were used to determine the relative amount of time that the memory read/ write operations take. The most signi®cant result here is the amount of improvement that is obtained by employing a cache. The primary reason is that the amount of time spent in memory activity is greater than that of the General Implementation when a single machine cycle was assumed for each instruction execution. The methodology presented provides a detailed tractable modeling method to obtain a closed form expression that includes details of executing programs with a variety of ISAs, and a multiplicity of hardware implementations for the ISAs in terms of CPI including the ability to separate clocks for fetch to be distinct from clocks for execute. The simple probability matrix Pn , in the mathematical model, P(n) P(0)Pn provides the complete solution. References [1] A. Lunde, Empirical evaluation of some features of instruction set processor architectures, Communications of the ACM 20 (3) (1977) 143±153. [2] M.J. Flynn, J.D. Johnson, S.P. Wake®eld, On instruction sets and their formats, IEEE Transactions on Computers C-34 (3) (1985) 242±254. [3] R.J. Baron, L. Higbie, Computer Architecture, Chapter 2, Addison-Wesley, Reading, MA, 1992. [4] J.L. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach, 2nd ed., Morgan Kaufmann, San Francisco, 1996. [5] A. Osborne, An Introduction to Microprocessors: Some Real Microprocessors, vol. 2, Osborne and Associates, Inc., 1978. [6] M.H. Mickle, W.G. Vogt, Stochastic modeling and dynamic analysis of computer architectures, Control and Computers 23 (2) (1995) 38±43. [7] H.S. Stone, Introduction to Computer Architecture, 2nd ed., Science Research and Associates, Chicago, 1980. [8] J.P. Hayes, Computer Architecture and Organization, McGraw-Hill, Boston, 1998. [9] M.H. Mickle, T.W. Sze, Optimization in Systems Engineering, Intext Educational Publishers, Scranton, 1971.