Performance analysis of hierarchical cache-consistent multiprocessors

Performance analysis of hierarchical cache-consistent multiprocessors

287 Performance Analysis of Hierarchical Cache-Consistent Multiprocessors * Mary K. Vernon Computer Sciences Department, University of Wisconsin, Mad...

1MB Sizes 0 Downloads 101 Views

287

Performance Analysis of Hierarchical Cache-Consistent Multiprocessors * Mary K. Vernon Computer Sciences Department, University of Wisconsin, Madison, 14/153706, U.S.A.

Rajeev J~g Hewlett - Packard, 95014, U.S.A.

19420 Homestead Road, Cupertino,

CA

Finally, we present parametric results that indicate how performance is affected by one of the parameters that characterizes data sharing in the workload.

Keywords: Multiprocessors, Hierarchical Multiprocessors, Performance Analysis, Mean Value Analysis, Cache Consistency Protocols.

Gurindar S. Sohi Computer Sciences Department, Unioersity of Wisconsin, Madison, 14/153706, U.S.A. Received 9 March 1989

This paper investigates the potential performance of hierarchical, cache-consistent multiprocessors. We have developed a mean-value queueing model that considers bus latency, shared memory latency and bus interference as the primary sources of performance degradation in the system. A key feature of the model is its choice of high-level input parameters that can be related to application program characteristics. Another important property of the model is that it is computationally efficient. Very large systems can be analyzed in a matter of seconds. Results of the model show that system topology has an important effect on overall performance. We find optimal two-level, three-level and four-level topologies that distribute the bus traffic uniformly across all levels in the hierarchy. We provide processing power estimates for the optimal topologies, under a particular set of workload assumptions. For example, the optimal three-level topology supports 512 processors each with a peak processing rate of 4 MIPS, and provides an effective 1400 (1700)MIPS in processing power, if the buses operate at 20 (40) MHz. This result assumes 22% of the data references are to globally shared data and that the shared data is read on the average by a significant fraction of processors between write operations. Results of our study also indicate that for reasonably low cache miss rates (3% at level 0), and 20 MHz buses, the bus subnetwork saturates with processor speeds of 6-8 MIPS, at least for topologies of five or fewer levels.

* This work was supported by the National Science Foundation under grant numbers DCR-8451405 and CCR-8706722, and by grants from IBM, Cray Research, inc. and the AT&T Foundation. North-Holland Performance Evaluation 9 (1988/89) 287-302

Mary K. Vernon received a B.S. degree with Departmental Honors in Chemistry in 1975, and the M.S. and Ph.D. degrees in Computer Science in 1979 and 1983, from the University of California at Los Angeles. She is currently an Assistant Professor of Computer Science at the University of Wisconsin-Madison. Her research interests include techniques for performance modeling of parallel systems, performance Petri nets, and analysis of multiprocessor design issues. In 1985 Dr. Vernon received a Presidential Young Investigator Award from the National Science Foundation. She is a member of the IEEE, the ACM, the C.S. Advisory Board of the Computer Measurement Group, and the board of ACM SIGMETRICS. Rajeev J6g was born in Bombay on August 5, 1963. He received the B. Tech. degree in Electrical Engineering from the Indian Institute of Technology, Bombay, in 1985 and the M.S. degree in computer science from the University of Wisconsin-Madison in 1987. In 1988 he joined Hewlett-Packard, Cupertino, CA, where he is currently a performance analyst in the Hardware Systems Performance group. His interests include performance modeling and analysis, computer architecture, multiprocessors and computer networks. Mr. JSg is a member of the Association for Computing Machinery. From 1980 to 1985 he was a recipient of a Government of India National Talent Scholarship. Gmindar S. Solii received his B.E. (Hons.) degree in Electrical Engineering from the Birla Institute of Science and Technology, Pilani, India in 1981 and the M.S. and Ph.D degrees in Electrical Engineering from the University of Illinois, Urbana-Champaign, in 1983 and 1985, respectively. Since September 1985, he has been with the Computer Sciences Department at the University of WisconsinMadison where is currently and Assistant Professor. His interests are in computer architecture, parallel and distributed processing and fault-tolerant computing.

0166-5316/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)

288

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

1. Introduction

The design of high performance processormemory interconnect architectures is currently of critical importance to the success of large-scale shared memory multiprocessors. An interconnect architecture that provides each processor with a high-speed cache for shared memory, and that maintains consistency among all copies of shared data, provides an elegant programming model at all levels of software development. Within the past year or two, several proposals have appeared for large-scale multiprocessor interconnect architectures that are based on multiple shared buses and a class of efficient cache consistency protocols known as snooping cache protocols [1,3,10]. In this paper we are interested in the strictly hierarchical architecture defined by Wilson [10], and by JSg et al. [4]. This architecture consists of (1) a hierarchical organization of cache memories interconnected by shared buses, and (2) a snooping cache consistency protocol. We call this architecture, illustrated in Fig. 1, the TREEBus architecture. In this paper, we develop an analytical model to evaluate the performance of the TREEBus interconnect. The model equations are intuitive in the sense that each can be explained in terms of the mechanics of the TREEBus architecture. Our performance model considers bus latency, shared memory latency, and bus interference as the primary sources of performance degradation in the system. Cache interference is not considered, but could be added in the future. An important feature of our.model is that we have designed the input parameters to be easily related to application program characteristics. The parameters include the fraction of memory references that are to data blocks shared locally within a cluster, the fraction of references to blocks shared globally among all processors, and the probability that a processor finds the shared blocks it references invalid due to writes by other processors. Another important characteristic of the model is its computational efficiency; performance estimates can be obtained for very large systems in a matter of seconds. We have used it to investigate systems with up to 2048 processors in this paper. Our confidence in the viability of the model is derived primarily from two sources. First, although the model is approximate and iterative, it

uses a technique that was shown to be surprisingly accurate for evaluating single-bus cache-consistency protocols [8]. Second, the model produces results that can be explained, after careful thought, in terms of expected system behavior. Furthermore, our results agree qualitatively with independent simulation results in [9] and [10]. The remainder of this paper is organized as follows. Section 2 contains a description of the machine organization and of the write-back cache consistency protocol. Section 3 presents the memory request response time equations and Section 4 discusses the model input parameters, output measures, and iterative solution technique. Section 5 contains the initial results of our experiments with this model. Finally, Section 6 contains the conclusions of this study.

2. The TREEBus architecture

In the TREEBus architecture (Fig. 1), the processors form the leaves of the tree and the shared memory is at the root of the tree. All memories, with the possible exception of the global shared memory, are organized as caches. The processors form level 0 of the system. Each processor is connected to a level 1 cache. Several level 1 caches are connected to a level 1 bus. Such a collection of processors on a single bus forms a super-processor or cluster. Several of these clusters are now connected through caches to another shared bus to form the next level in the hierarchy. This can be applied recursively to an arbitrary number of levels in the system. The entire systems forms a tree with a branching factor of n~ at level i. T h e p e e r caches of a cache C at level i are the n,-1 caches connected to the same bus. The descendants of a cache C are all caches that are present in the subtree whose root is C. Data in shared memory is organized into fixed-size blocks, which are the units over which consistency is maintained. Copies of data blocks in the shared memory migrate to the level 1 caches in response to processor read an write operations. The cache consistency protocol ensures that copies of the data do not become inconsistent. If a processor request cannot be serviced by its level 1 cache, it appears on its level 1 bus. The bus request will either be serviced by a level 1 peer cache or by the level 2 cache. The level 2 cache in

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

289

Shared Memo~

Level 3 Bus

'

t c

Level 3 Caches

I c

]

'

Level 2 Buses Level 2 Caches

Level 1 Buses Level 1 Caches

Processors I

I

1st Cousins

I

J

Oth Cousins

2nd Cousins

Fig. 1. A three level T R E E B u s system.

turn may have to propagate a request to its higher level bus to fetch the data a n d / o r to invalidate other copies. When data is fetched, it filters through the levels of the tree until it reaches the cache of the requesting processor. In order to simplify the implementation of the cache consistency protocol, the caches must satisfy an inclusion property [1]. This means that a cache must contain a few bits of tag information for all blocks that are present in any of its descendant caches. Due to the inclusion property, the size of each cache (memory) must be quite large ~. Given the density of modern-day dynamic RAMs, large caches for this purpose are feasible. In Subsection 2.1, we define the cache block states used in our description of the cache consistency protocol for the TREEBus system. In Subsection 2.2 we outline the important features of the TREEBus protocol. 2.1. Cache block states

Each cache in the system has two directories that each contain two bits of state information per cache block. The processor directory for a level i cache services processor or super-processor (i.e. 1 W i l s o n suggests that the cache size for a level i cache is a n o r d e r of m a g n i t u d e times the sum of the level i - 1 cache sizes [10].

level i - 1 bus) requests. The bus directory for the level i cache services level i bus transactions. Since the two directories of the cache serve distinct purposes, we choose to distinguish between the states that are maintained in each directory. In other words, our definition of the TREEBus cache consistency protocol uses non-identical dual directories [4]. This facilitates a uniform protocol state description for all levels of the hierarchy. Both directories are updated in a single atomic operation. The entry for a block in the processor directory can specify one of the following three states: (1) Clean. The processor can read from the block but cannot write into the block without informing other caches. N o write-back is necessary if the block is purged. (2) Dirty. The processor can read and write from this block without informing any other caches. A dirty block purged from a level i cache must be written back into the level i + 1 cache. (3) Invalid. The block does not contain valid data. The entry for a cache block in the bus directory can specify one of the following four states: (1) Owner Updated Sublet (OUS). This cache is responsible for supplying data in response to a bus request. The cache has an updated copy of the data which it can supply fight away. For a write request, the cache must also invalidate copies of

290

M.K. Vernonet al. / Performanceanalysis of hierarchicalcache-consistentmultiprocessors

the block that have been supplied, i.e. sublet to its descendant caches. (2) Owner Not Updated Sublet (ONS). This cache is responsible for supplying data in response to a bus request, but it does not have the latest copy of the data. It must obtain the up-to-date data from one of its descendants, by initiating a recall operation. In case of a write request, it must also invalidate copies of the block in its descendants. For a level 1 cache, the ONS state is merged with the OUS state. (3) Non-Owner Valid Sublet (NVS). The cache is not responsible for supplying the data. However, since the cache and possibly its descendants have a copy of the block, it must invalidate all such copies on a bus write request. (4) Invalid (I). The cache does not need to perform any action and can ignore the bus transaction.

2.2. Operation of the data coherence protocol The main features of the multilevel consistency protocol are outlined below. For further details, the reader is referred to [10] or [4].

2.2.1. Read requests A READ request is a hit if the block's state in the processor directory is either clean or dirty. If a READ request hits in a cache at level i, the requested block is read from the cache. If the request misses in the cache at level i, the request propagates to the bus at level i. A peer cache on the level i bus will respond to this request only if its bus directory indicates that it is the Owner of the block. If a supplying peer cache has the block marked dirty in its processor side directory, it is instead marked d e a n and the block is also written back to the level i + 1 cache. The requesting cache marks the block d e a n in its processor directory and NVS in its bus directory. If no peer cache is the owner, the level i + 1 cache is responsible for supplying the block. In this case, the supplied block is marked as O U S in the bus directory of the requesting cache. A peer cache m a y be the Owner of a requested block without having an up-to-date copy. In this case, the data is recalled from the descendant of the owner cache to whom it has been sublet, by propagating the request down to the bus at level i - 1. Note that a read request does not purge the

block from the supplying cache or its descendents. On a given bus, only one of the caches connected to it can be the owner of the block. It is possible, however, that several caches, all connected to different buses, m a y simultaneously have the same block marked in an owner state. The owner caches are responsible for servicing requests that appear on the buses they are connected to.

2.2.2. Write requests and block invalidation If the processor issues a write request to a block that is marked dirty in the processor directory, the write can proceed without informing anybody else. However, if the block is marked clean, other caches must be informed about the write. In this case, an INVALIDATE request is placed on the bus. The peer caches that have a copy of the block respond by changing the states of the bus and processor directories to invalid; the level i + 1 cache changes the state of its bus directory to O N S and its processor directory to dirty. (The O N S state indicates that the cache is the owner of the block, but it no longer has an up-to-date copy of the data since one of its descendants is modifying it.) Each peer cache at level i that responds to the invalidation request may also need to propagate an INVALIDATErequest downward to its level i - 1 bus, to invalidate copies of the block in its descendants. Similarly the level i + 1 cache may need to propagate the INVALIDATE request upward to its level i + 1 bus. Therefore, a single INVALIDATE request at a level i bus can possibly fork into one upward and several downward INVALIDATE requests. The downward requests continue to propagate on all branches that have valid copies of the block. An upward INVALIDATErequest stops propagating when it reaches a cache that is already in an O N S state or when it reaches the shared memory at the root of the tree. When a level i + 1 parent cache determines that the INVALIDATE request need not propagate upward, it sends an invalidate acknowledge signal back to the level i cache. The level i cache in turn propagates the invalidate acknowledge signal to the level i - 1 cache and eventually the processor that initiated the invalidate operation receives the invalidate acknowledge signal. At this point, the processor can proceed. Note that the processor does not wait for downward INVALIDATErequests. It is possible that a processor might read data from a cache block that another processor is try-

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

ing to invalidate. If the read occurs before the invalidate arrives, we assume this is a valid behavior of the system 2. Our protocol guarantees that two or more processors never update the value of the same block simultaneously. The acknowledgement from the upward INVALIDATEwill only occur for the first of two simultaneous requests to reach a shared cache common to their propagation paths. The other INVALIDATE request will generate a recall request when it reaches the common shared cache. Note that our model in Section 3 ignores the case of colliding invalidations. If the processor issues a write request to a block that is invalid in its cache, a READ--MOD ( o r readwith-the intent-to-modify) request is generated on the bus. Since the block must be read into the cache of the requesting processor, and since other caches that have the block must be informed about the write, the state changes that take place in the caches when a READ--MOD operation is issued are the same as the combined changes for a READ request and an INVALIDATE request. Both recall and invalidate operations may need to be carried out. In this case, the cache must wait until both are completed, i.e., the delay for a cache to respond to a READ-MOD request is the maximum of the delay for the data and the invalidate acknowledge to arrive.

3. A model of the TREEBus system We present a performance model for the hierarchical TREEBus architecture, including the cache consistency protocol outlined in the previous section. The model uses the 'customized mean value analysis' technique introduced in [8], to construct a set of equations that compute mean values of performance quantities in terms of mean values of model inputs. The technique proved to be surprisingly accurate for the analysis of single-bus cache consistency protocols and has been used to model an alternative large-scale cache-consistent multiprocessor architecture in [6]. In this section, we develop equations for mean response times for memory requests and mean bus 2 Note that this problem exists in any multiprocessor that has more than one level of cache memory(for example an onchip and an offchip cache) and allows multiple cached copies of shared data.

291

utilizations in terms of the following input parameters: (1) Input parameters that characterize the TREEBus machine: • n i is the degree of branching from level i to level i - 1. • dcache, i is the latency for retrieving a block of data from a level i cache in the system. • Trj is the (constant) duration of a type r request on the level i bus. A type r request may be type addr, data or both. • TsuPply,i is the (constant) time it takes to transfer data from one side of a level i cache to another. • r is the mean time a processor executes between memory requests, including the local cache access time if the request hits in the cache. (2) Input parameters that characterize the application workload: • mM,, is the miss rate in the memory at level i. This is the probability that the request will be forwarded to the i + l t h bus. We use raM,0 to denote the probability that a processor request misses in its level 0 memory (i.e., its level 1 cache). • Pcj is the probability that a READ request on the level i bus can be satisfied by a peer cache at the same level. • m p c s , ~ is the miss rate seen by read requests that will be satisfied by a peer cache at level i. This is the probability that a block needs to be ' recalled'. • mR, i is the miss rate seen by recall requests on the level i bus. • P~N~ is the probability that the processor request is a write request to a data block that is currently in state Clean. • PoNsIcachesupp/y,i is the probability that a block is not in state O N S in the level i memory, given that the block is supplied by a peer cache. The workload parameters given above are very low-level parameters. The derivation of these parameters from higher-level input parameters that relate more directly to application program characteristics is addressed in Section 4. In the development of the mean-value model below, notation is defined as needed, and is summarized in Appendix I. We treat cache access as being conflict free, and assume that contention

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

292

arises primarily at the buses in the system. We believe, based on previous experience in analyzing single-bus cache consistency protocols [7,8], that the former is a second-order effect as compared with the latter, and can be ignored in an initial analysis of system performance. We first consider only READ requests. In Subsection 3.4 we comment on how the equations are modified to account for READ-MOD and INVALIDATE requests. The experiments described in Section 5 were run using the complete model, including the extensions for invalidation traffic. Subsection 3.5 discusses the performance measures that can be computed from the model outputs.

3.1. Preliminaries 3.1.1. Tree organization Let L be the total number of levels in the bus hierarchy. For each bus at level i, we can partition the set of processors in the system into L - i + 1 generations of 'cousins' in the tree. Fig. 1 shows the processor partition for the level i bus labeled ' i ' . The 0th generation of cousins is the set of processors in the subtree rooted at the level i bus. Requests from these processors may propagate directly up to the level i bus. For k, L - i/> k >/1, the k th cousins of a given level i bus are the processors whose requests may propagate up to a level i + k bus, and then down to the level i bus in the form of recall or invalidation requests. These processors are leaf nodes of the subtree rooted at the level i + k bus that is an ancestor of the level i bus, excluding the subtree rooted at level i + k - 1 that contains the level i bus itself. Using the following notation: • Nka is the number of k t h cousins that can generate requests on a level i bus, the number, of 0th cousins for a bus at level i is the product of the degrees of branching for levels 1 through i: i

3.1.2. Request propagation probabilities A read request that must propagate to another level requires two separate operations on the level i bus: an 'address" operation (type = addr), and a data transfer operation (type = data), which is the delayed response to the address operation. A bus request that does not propagate to another level before being satisfied requires a fixed number of cycles to transmit an address plus a data block. We label this type of operation as type both 3. Using the input miss rates, raM,,, Pc,,, mpcs,i, and mR,,, we can compute the probabilities that requests of various types propagate to a level i bus. Throughout the remainder of this paper, we use the subscript i to denote a level in the TREEBus hierarchy, the subscript k to denote the generation of the cousin that generated the request that has reached level i, and the subscript r to denote the type of bus operation required by the bus request. We use the following notation for the request propagation probabilities: • 6k. i is the probability that a request on the ith bus from a k th cousin will propagate up to the bus at level i + 1. • /~k,, is the probability that a request on the ith bus from a k th cousin will propagate down to a bus at level i - 1. • ak,i is the probability that a request from a kth cousin propagates to a particular level i bus. • flk,r,~ is the probability that a request for the i th bus from a k th cousin will require an operation of type r. The propagation probabilities for read requests are computed in a straightforward way from the miss rates and peer cache supply probabilities. The formulae are given in Appendix II.

3.2. Response times for block requests Below we develop a set of recursive equations for the total mean time between memory requests.

N0., = I-I n j , j=l

and the number of kth cousins, 1 ~< k ~
Nk, i=(n'+ k - l )

though it is easily extended for the asymmetric case.

1-I nj. j=l

Note that the model assumes a symmetric system,

3 Note that we have not explicitly modeled the bus traffic for writing back blocks that are purged in the level i cache. This traffic can be included indirectly by adding the time for transmitting the purged data block, weighted by the probability that the purged block is dirty, to Taaar.o,the time for the 'address' cycle of gEAOrequests from 0th cousins. (See the definition of Tr.i.)

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors We use the following notation: • R is the mean total time between memory requests made by a processor. • tk, g is the mean response time for class k requests at level i. These are the requests generated by the k th cousins of i. The response time includes the time for retrieving data from lower level buses or recalling data from higher level buses, if necessary. • tQ,k, ~ is the mean time a request from a k t h cousin spends waiting for and receiving service from the level i bus. • ~k,~ is the mean waiting time at the level i bus for a request from a k th cousin. Each processor is assumed to continuously execute the following cycle: it spends some time executing, then makes a memory reference. The total time between processor memory requests will consist of the processor's execution time, plus the delay to retrieve data remotely if the request misses in the cache: R = "c + mM,o(to, 1 + rsupply,1 ) .

(1)

The mean response time for READ requests from a k th cousin at level i is the sum of the time the request waits for and receives service on the bus, plus the mean delay for propagating the request to other bus levels whenever this is needed. With probability btk,, the request will propagate to level i - 1 and become a request from a k + l t h cousin. With probability 8k,~, the request will propagate to level i + 1. Otherwise, the request is satisfied at level i. Thus,

tk, i = tO,k, i + (1 - ~k,i -- ~k,i)dcache,i q-~k,i(tk.,+l-I- rsupply,i+l ) -}-~k,i( tk+l,i_l "}- Tsupply,i ).

(2)

The mean time a READ request spends in the queue for the level i bus is the waiting time plus the service time for the request, weighted by the probability that the request is a particular type of operation: tQ.k, , = flk,both,i(~k,i + Tboth,i)

+ flk.addr.i(Wk,i-b Taddr,i + Wk,i -[- Tdata.i)" Since ~k,data.i = flk,~aar, i, this last equation can be rewritten as follows:

te,k,i =

E t~k,r,i(Wk,i "1- Tr,i)" r ~ ( addr, data, both }

(3)

293

3.3. Modeling bus contention To complete the response time calculations in equations (1)-(3), we need to compute the mean bus waiting times, ~k,,, for each generation of cousins k and each level i. We will refer to k as the class of the request on the level i bus. The following notation is used to derive the bus waiting times in this section: • Ur, ~ is the utilization of the level i bus by operations of type r. Recall that r may be addr, data, or both. • fO.k,r,i is the fraction of time a k t h cousin spends in the queue or uses the i th bus for a type r request. • fB,k,r,i is the fraction of time a k t h cousin actually uses the level i bus for a type r request. • Or,ilk is the mean queue length of type r requests at the level i bus, as seen by an arriving customer belonging to class k. • P, us~.r,~lk is the probability that the bus is in use by an operation of type r when a customer of class k arrives at the level i bus. We treat each bus in the system as a first-come first-serve queue with a deterministic service time for each request type, 4 and we apply mean value techniques for queueing network models to calculate bus waiting times [5]. Accordingly, the mean queue length seen by a customer of class k arriving to the level i bus queue is estimated by the steady state mean queue length of the bus if the requesting processor is removed from the system. The queue length is approximately the sum of the fraction of time that each other processor spends in the bus queue. Thus, the arrival queue lengths are computed for each request type as follows: Or,ilk=(Nk,i--1)fQ,k,r,i

+ E (Nc,ifQ ..... i). cg=k

(4)

The fraction of time a processor spends in the queue for a particular type of operation is given in equation (8) below. Note that the mean queue length includes the request on the bus, if any, when the class k request arrives. The class k request will wait for the remaining bus access time, or residual life, of the request in

4 Note that the mean value performance estimates we compute from the model equations hold for other non-preemptive non-service-time dependent disciplines, such as random order.

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

294

service, plus one full service time for every other request in the queue when it arrives. Since the request in service has a deterministic service time of duration T~,i, its mean residual life is 71 T~,i. The probability that a class k request finds a type r request on the bus is PBvsY,r,ilk" The mean waiting time for a class k request is thus:

Wk ,i

= ~r [(Or,ilk--PBusv,r,ilk)Tr,i+PBvsY,r,ilkT2"i]

"

(5) The probability that the level i bus is in use by a type r request when a customer of class k arrives is given by:

Ur,i-- fB,k,r,i PnusY'r'ill~- 1 - EfO,k,r,i

(6)

r

where Ur, i is the bus utilization by type r requests: L-1 Ur,, = E Nk,ifS,k,r.'" (7) k=0 The fraction of time a type r request from a k th cousin waits for and uses the level i bus is the product of the probability the request propagates to the bus at level i, the probability the request requires a type r operation, and the time required to perform the type r operation, divided by R:

fQ,k,r,i =

ak,iflk,r,i(Wk,i + ~,i) R

(8)

Similarly, the fraction of time for which a type r request from a k th cousin is using the level i bus is:

fB,k,r,i --

Oli'iflk'r'iTr'i R

(9)

This completes the bus waiting time equations.

Thus, a class 0' request is an invalidation request from a 0th cousin (i.e., a direct descendent) of the bus. The propagation probabilities for invalidation requests are similar in flavor to the read request propagation probabilities in Appendix II, but are more complex. Four extensions to equations (1) through (9) are required to model the invalidation traffic. First, equation (1) must be modified to include the response time for invalidation requests that are generated by write hits to clean blocks in the level 1 cache. Letting P, Nv be the probability that the processor request is a write request to a data block that is currently in state Clean, the additional term in equation (1) is:

Pit, v(/0,,1-{- Tsupply,1 ). Second, equation (2) applies for ~AO-MOD requests, but must be modified to include a term for the mean response time for an invalidation request at the next higher level, if any. The probability that a READ-MOD request requires invalidation at higher levels is the probability that it hits in the memory, plus the probability that it is supplied by a peer cache and is not in state ONS in the memory. The probability of this second case, denoted by PoN-glcachesuppty.i, is computed from the model inputs defined in Section 4. Third, equations which are analogous to equations (2) and (3), must be developed for the invalidation traffic. Invalidations generate three types of requests on the bus: type inv, type ack, and the combined type invack. These types must be included in the straightforward formulation of the analogs to (2) and (3). Finally, minor but careful changes are required to include invalidation request types in the bus waiting time equations. Care must be taken to sum over the appropriate classes (primed and non-primed), and request types, in those equations.

3. 4. Extensions to the model for invalidations 3.5. Model solution and outputs The equations developed in Subsection 3.3 considered only gEaD requests. In this section we outline the changes needed to include READ-MOD and INVALIDATE requests in the model. We use the same notation for classes of invalidation requests which are generated by k th 'cousins' of a level i bus, but add an additional prime (') notation to distinguish these requests.

The mean-value memory request response times computed in equations (1)-(3), are a function of the bus waiting time values computed in equations (4)-(9). The bus waiting time estimates are in turn a function of the mean request response times (see equations (8)-(9)). Thus, the model equations must be solved iteratively. We begin the iteration by

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

computing memory request response time for mean bus waiting times equal to zero. The estimate of R is then used to compute more accurate estimates of mean bus waiting times. At the end of each iteration, we compare the values of all waiting time estimates with the estimates from the previous iteration. When the sum of the absolute values of the differences in the estimates is less than some very small value, e, the solution terminates. For experiments with up to five bus levels and 1536 processors, the solution terminates in a matter of seconds on a D E C VAXstation II. The performance measure we are primarily interested in is the overall processing power, which can be expressed as speedup, efficiency, or effective MIPS. We are also interested in bus utilization measures at each level in the bus hierarchy, since these measures can help elucidate causes of performance degradation. The bus utilizations are computed as described in Subsection 3.3. Efficiency is computed as T/R. Speedup is computed by multiplying the efficiency by the number of processors in the system, and the effective MIPS rate is computed by multiplying the speedup by the peak MIPS rate of the individual processors.

4. Model inputs The set of six workload parameters for each level in the TREEBus hierarchy, defined at the beginning of Section 3, is fairly small given the complexity of the system being modeled. However, the relation between these parameters and the application level program is indirect and nonintuitive. Furthermore, the complex interdependencies among these parameters make assignment of parameter values extremely difficult. We have thus designed a set of higher-level input parameters that relate more directly to application program characteristics and contain only very weak interdependencies. The characterization of parallel program behavior is a very difficult and open problem. We propose the workload parameters in this section as a starting point for developing a workload submodel for hierarchical machines that has an optimal set of independent parameters. The mean-value model inputs are derived from the high-level workload parameters in a straightforward, albeit complex, manner.

295

Examples in Appendix III illustrate the equations. The high-level workload parameters, which embody the paradigm we use to characterize the sharing of data in application programs, are as follows: • fp is the fraction of memory references which are to private cache blocks. Private cache blocks are cache blocks which are only accessed by one process. • f~ is the fraction of memory references which are to class x shared cache blocks. Class K cache blocks contain data that is shared among the processors in a subtree rooted at the level x bus. K • f,w is the fraction of references to class shared blocks that miss in the cache due to being written by another processor. This parameter is a measure of the frequency with which all other processors write a block, relative to the frequency at which a particular processor reads and writes the block, averaged over all blocks in class K. • fRI is the overall fraction of processors which read a shared block between writes to the block by a particular processor. At first glance, it might appear that this parameter is determined by the previous parameter, f,w- However, this is not the case. For example, fi% is small if one processor reads a block many times between another processor's write to the block. In this case, fR, could be large (if other processors also read the block) or small (if no other processors read the block). • f~vlP is the fraction of references to private blocks that are write operations. • fwms is the fraction of references to shared blocks (of any class) that are write operations. Note that the unconditional probability, fw, is easily computed from parameters (1), (2), (5), and (6). • rn,~j is the 'traditional' miss rate for memory references at the level i cache. This is the probability a memory reference misses in the cache if it is a reference to a private block or if it is a reference to a shared block and does not miss for cache-consistency reasons. • f, Nv{~w. is the fraction of writes to private data blocks that find the block in state Clean. In general, we expect this parameter to be very small, since we are assuming very large caches, and since the first access to a data block by a

296

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

process tends to be a write request. Note that the corresponding input parameter for writes to shared blocks that find the block in state Clean, is related to the f~w parameters, and is computed from them in an appropriate way. We note that real values of the above application-related parameters are unknown at this time. This is partly due to the fact that the development of large-scale parallel applications necessarily lags behind the design of these systems. It is also partly due to the lack of a well-defined workload model specifying which parameters are important to measure. However, we have found the above parameters to be very useful for performing parametric studies of the TREEbus architecture. They allow us to gain an understanding of which types of parallel applications might run well on the machine. (See Section 5).

(iii)

(iv)

(v)

(vi) 5. Performance results (vii) In this section, we use the model developed above to analyze several aspects of the TREEBus architecture. We perform several parametric studies to evaluate overall system performance, and to gain an understanding of how performance is affected by key model parameters. We are especially interested in the question of whether, and under what application characteristics, the bus at the root of the hierarchical tree structure becomes a bottleneck. We believe that the results in the remainder of this section are qualitatively significant, although further future validation studies are required to assess quantitative accuracy.

5.1. The effect of topology We would like to know how the performance of TREEBus changes as we vary the topology of the system. In order to evaluate the effect of topology, we have solved the model for selected topologies, varying the number of processors, but holding all other inputs fixed. In this set of experiments, we use the following input parameters values: (i) The bus speed is 20 MHz, giving a bus cycle time of 50 nanoseconds. (ii) The processors each have peak performance of 4 MIPS, and each processor makes an average of 1.3 memory references per instruction, counting the in-

(viii)

struction fetch. Note that T is the reciprocal of the product of these two values, or 3.85 bus cycles. The service times, measured in bus cycles, for the various classes of bus requests are: T~adr = 2.0, Td~,a = 4.0, Tboth = 6.0, T,,v = T~ck = 1.0 and T~. . . . ~ = 2.0. These values hold for the buses at all levels. The memory latencies are: Ts,ppty.i = 100 nanoseconds, and dc~che, i = 750 nanoseconds, for all levels i. The traditional miss ratios for the level 1, 2, 3 and 4 caches are 3%, 1%, 0.1% and 0.01%, respectively. The values of these miss ratios are based on the expected sizes of the caches at the various levels. Note that invalidations of shared blocks will result in a higher total miss ratio. 5% of all references (or about 22% of all data references) are to blocks containing globally shared data [2]. 10% of private references (including instruction references) are writes and 30% of references to shared data are writes. The probability of finding a shared data block invalidated, f~v, is 0.25 for all levels 1£.

(ix) 50% of the processors that share a data block read it between successive invalidations of the block. Fig. 2 shows the speedup achieved for the selected system topologies versus the total number of processors in the system. Fig. 3 gives results for 40 M H z buses and dcach e 500 nanoseconds. Each topology is identified by the notation N1 x i x j x k, where N1 is the number of processors connected to each level 1 bus, and i ( j , k) is the branching factor on the level 2 (3, 4) bus. Consider, for example, the N1 x 12 topology in Fig. 2. When N1 = 20, this system has a total of 240 processors, and achieves a speedup of about 150. At this point, one or more of the buses becomes saturated. To connect more processors, we must use an alternate TREEBus topology. Our first observation is that, for a fixed number of levels, the point at which bus saturation occurs is dependent on the topology. For example, the N1 x 8 x 4 topology saturates at 512 (768) processors, whereas the the N1 x 4 x 4 topology saturates at 256 (512) processors, if the buses operate at 20 (40) MHz. =

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

297

640 ................................................

~. . . . . . . . . . . .

576

;. . . . . . . . . . . .

: ................................................

............

7 ...........

:

i NIilflkTff2 ......

.......

Nlxlpx2x2 l 512

/

_~+______+___~+

Nlx~x3x2

448

S

384 Nlx4x8!

~ P e e

320

. . . . .

d /"

tl P

i

.

--of~

.

.

.

.

.

. . . . .

:

. . . . .

:

i

256

:

,

----

-

Nlx4x4

,

192 i

128

!Nlxl2

-41 Ntx4x2--! ........ " .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Nlx8

I

64

22_-~-- : - .-.-. . . .--~--Ntx4

...........

i

.

.

.

.

.

.

.

'

i

0

256

512

768

1024

1280

1536

1792

Number of Processors Fig. 2. The effect of the T R E E B u s topology;

The bus utilizations estimated by the model show that the topology determines which level buses become the bottlenecks in the system, and that the higher performance topologies have bus utilizations that are better balanced across the various levels. For example, the level 1 bus is the bottleneck in the N1 x 4 x 4 topology, the level 3 (root) bus is the bottleneck in the N1 x 4 x 8 topology, and bus traffic is more evenly distributed among the buses at different levels in the more optimal N1 x 8 x 4 topology. For a given number of processors, all nonsaturated topologies have the same performance. Thus, the optimal configuration for a given number of processors is the topology with the minimum number of levels such that the speedup in Fig. 3 is still in the unsaturated region. Con-

4 MIPS,

20 MHz,

fsL = 0.05,

fl L = 0.25,

fR~ = 0 . 5 .

versely, the optimal topology for a fixed number of levels is the one which can accommodate the largest number of processors before becoming saturated. (This is also the topology that distributes the traffic most evenly across all levels in the hierarchy.) For the topologies and parameter values investigated, the optimal two- (three-, four-, five-) level topology is N1 X 12 (N1 x 8 x 4, Nl×12x2×2, N1x8x2×2×2). In general, it appears that the optimal configurations are those which look 'pyramidal', with the branching factor being high at levels 1 and 2, and decreasing at each succeeding level. We note that, for the parameter values used in this experiment, it appears that in the 40 MHz system the optimal 5-level configuration can support well over a thousand processors. In the 20

298

M.K. Vernon et a L / Performance analysis of hierarchical cache-consistent multilJroeessors 1536

Nlx8x2x2x2 1408 f

:

1280

Nlxi2x2x2

1152

1024

.

.

[ + - - --~S

.

.

.

Nlx8x~x2

896

P e

768

d fi

/

640

:

_~,

N1x~16 _. . . . . . . . . . . . . . . . . . . .

. ~ -~ ..1~. ~ . --z'z-. . . . . . . . . . . . . . .

P

512

:g=

--L_~_~---z:8

NIx4x4

384 Nlxl2 256

.....

~ ....

/~tx~-~

...................................................

Nlx8 128

. . . . Ntx4.~. . . . . . . . . .

0

256

512

768

~. . . . . . . . . .

1024

~. . . . . . . . . . . . . . . . . . . . .

1280

Numberof Processors Fig. 3. The effect of the TREEBus topology; 4 MIPS, 4 0 M H z ,

MHz system, the four and five-level topologies can support well over 500, and perhaps as many as 1000 processors.

5.2. Increasing processor speed We next address how the performance of TREEBus changes with increases in processor speed. The increases in total computing power achieved by faster processors can then be tradedoff against the increase in cost. In these experiments, we vary the number of processors and the processor speed, considering only the optimal two-, three-, or four-level topology for each system size, and holding all other parameters fixed at the values used in Fig. 2. Fig. 4 gives the results. The performance measure

1536

J...........

1792

2048

f s L = 0 . 0 5 , fiLw = 0 . 2 5 , f ~ , = 0 . 5 .

plotted on the y axis is the effective MIPS rate, which is defined in Subsection 3.6. As long as the bus subsystem is not saturated, increasing the MIPS rate of the processors will increase the total effective MIPS delivered by the multiprocessor. In the 512-processor system, replacing 2-MIPS processors with 6-MIPS processors increases the total effective processing power from 895 MIPS to about 1730 MIPS. At this point, the system is close to saturation and further increasing the speed of the individual processors to 8-MIPS increases the total effective MIPS only marginally. Further investigation of five-level topologies is needed to evaluate the benefit of 6-MIPS processors for systems larger than 512 processors, and the benefit of processor speeds higher than 6 MIPS for systems larger than 256 processors.

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

299

2560

10 M I ; P S

2304

"

E

6

MIPS

i 4 MiPs

2048

................

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

J

:S

1792

f

/:

~. . . . . . .

:

2M+

_/

i

f e c

I/

~/

1536

'

¢

t

/

i v

.2

1280

.2.../ . . . . . . . . . . . . . . . . . . . . . . . . .

e

M

1024

---/--" ....... /

I

, .......

, .......

r .......

~......

, .......

, .......

,



P s

~....

768

512

]

256

0 0

128

256

384

512

6,10

768

896

1024

1152

1280

1408

Number of Processors Fig. 4. The effect of increasing processor speed; 20 M H z buses,

5.3. The effect of interleaved reads

Our final set of experiments investigates the implications of the value of fRI used in the earlier experiments. This parameter denotes the fraction of processors that share a data block which actually read the block between successive writes to the block (on the average), fR~ was set to 0.5 in our previous experiments. The first read access by a processor after a block is written by another processor will cause the block to appear, in all caches on the path from the supplier's bus to itself. Future read accesses to the block by other processors in the same subtree are satisfied within the subtree itself, thereby amortizing the cost of the initial read. Thus, increasing the value of fRi increases the probability that a request is satisfied at the lower levels of the TREEBus, thereby de-

fR~ = 0 . 5 .

creasing the latency of requests and reducing the traffic on the higher level buses, both of which contribute to a higher system performance. Fig. 5 shows the effect of varying fR~ in an N1 × 12 × 2 x 2 system with 4 MIPS processors, 20 MHz buses, and 22% of data references to globally shared data (i.e., data that is shared by all processors). The results show that under the assumption that all shared data is shared globally, system performance degrades sharply with increasing system size for low values of fR,. High values of fRl may be appropriate for applications that incrementally build up shared data (i.e., shared knowledge or state information) that is then read by a large number of processors in the system. On the other hand, if the application involves mostly pairwise sharing of data, the value of f ~ will be very small. However, some

300

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors 80

.............................................................

70

60

% 50 E f f i

fRI =4.30 40

C

f ~ = (120

i e i1

30

C

Y 20

10 f m = 0~01

0 0

192

384

576

768

960

1152

1344

1536

Numbea"of Processors Fig. 5. The effect of global reads b e t w e e n invalidates; 4 M I P S processors, 20 M H z buses, N1 × 12 × 2 × 2 topology.

applications with low values of fR~, including applications that have pairwise sharing of data, can be mapped onto the TREEBus system to take advantage of locality on the level I bus. Investigating the optimal configurations and potential performance of TREEBus for these workload is a key component of our continuing experiments with the model.

6. Conclusions and future work

In this paper, we have analyzed several performance aspects of hierarchical shared-memory multiprocessors. The mean-value queueing analysis includes bus latency, shared memory latency, and bus interference, as the primary sources of system performance degradation. The model also

has a small set of relatively independent input parameters that are easily related to applicationlevel program characteristics. Our analysis shows that a hierarchical organization of caches and shared buses can be configured for very high performance, as long as the total bandwidth in the bus subsystem can support the total load placed on it by the processors. For workloads where most of the shared data is shared globally, our results can be used to select an optimal topology for a given system size. For 20 MHz buses, our results also show that optimal topologies of more than 256 processors reach saturation with 6 to 8 MIPS processors. Finally, our results show that system performance is highly dependent on the fraction of processors that read a block containing shared data between writes to the block by a given processor. If this fraction is

M.K. Vernon et aL / Performance analysis of hierarchical cache-consistent multiprocessors

high, the locality of sharing within a cluster appears to be less important than the 'traditional' miss rate for non-shared data. Our current work includes further investigation of the performance of hierarchical cache-coherent multiprocessors, using the performance model developed in this paper. In particular, we are examining the effects of shared data locality and access patterns, and the interplay between these two properties. Future plans include improving the performance model by adding cache and memory interference, improving the workload model, extending the model to study the benefits of adding an exclusive state (requiring read-notify operations) to the cache coherence protocol, and investigating real workload parameter values.

Appendix I. Mean value model notation The following notation is used in developing the mean value model: • Nk, i is the number of k t h cousins that can generate requests on a level i bus. • 8k, i is the probability that a request on the ith bus from a k th cousin will propagate up to the bus at level i + 1. • /~k,, is the probability that a request on the ith bus from a k th cousin will propagate down to a bus at level i - 1. • ak, ~ is the probability that a request from a k th cousin propagates to a particular level i bus. • /3k,r,~ is the probability that a request for the i th bus from a k th cousin will require an operation of type r. • R is the mean total time between memory requests made by a processor. • tk, ~ is the mean response time for class k requests at level i. These are the requests generated by the k th cousins of i. • tQ,k, i is the mean time a request from a kth cousin spends waiting for and receiving service from the level i bus. • ~k,i is the mean waiting time at the level i bus for a request from a k th cousin. • Ur,i is the utilization of the level i bus by operations of type r. Recall that r may be addr, data, or both. • fQ, k,r,, is the fraction of time a kth cousin spends in the queue or using the i th bus for a type r request.

301

is the fraction of time a k t h cousin actually uses the level i bus for a type r req_uest. • Qk .... is the mean queue length of type r requests at the level i bus, as seen by an arriving customer belonging to class k. • PBusv,k,r,j is the probability that the bus is in use by an operation of type r when a customer of class c arrives at the level i bus. • fB,k,rd

Appendix II. Read request propagation probabilities

k=0

k = l ..... L - i

8k, ,

(1 -- pc,i)mM,,

0

]£k,i

Pc,im pcs,i i-1

mR,~

Olk,i

mM~O H ~O,j j~l

m M,O

17

j=l

~O,~+k 8o,j - -

hi+ k --1

t+k--1 X H mR': j = i + l nj

k = 0 ..... ( L - i ) fir.k.

8k,i + P*k.i 1 --(Sk, i + ttk,i)

r ~ ( addr, data } r = both

Appendix III. Example calculations of low level model parameters from the high level model inputs

Shared blocks may miss due to invalidations caused by interleaved accesses by other processors. In addition, both private and shared blocks that do not miss for the above reason, may miss for 'traditional' reasons such as startup effects and block purging. L

mM,O = f e , o m T , o +

Z

(fs~,0f, w ) -

A block that reaches the memory at level i > 0 will have missed in all the level i caches. If this block is shared, then it will be supplied by a peer cache at some higher level, and thus will miss in this memory. It will do so because it has been invalidated by processors outside the subtree

302

M.K. Vernon et al. / Performance analysis of hierarchical cache-consistent multiprocessors

rooted at the level i bus. Hence the miss ratio at level i is: L rnM,i = f p , i m r , i +

E

fs~,i + mT,ifs,ri •

K=i+I

N o t e that fs,~ is the fraction of references that reach the level i m e m o r y that are shared but have propagated because of the traditional miss ratios, and fs~i is the fraction that have missed due to invalidations by other processors. We now find the probability that an invalidate request will continue to propagate towards the root level when the block is supplied by a cache level i. This will be given by 1 - the probability that the block is in state O N S in the level i m e m o r y given that the block is supplied by a peer cache at level i. T h e cache supply occurs only for shared blocks. The invalidate must have occurred by a processor in the subtree rooted at level i, excepting the subtree where the current request is coming from. (Otherwise, the m e m o r y would not be an ' o w n e r ' of the clock.) There are N i - N i-1 such processors, out of a total of N ~ - n ~-1 processors that can potentially have d o n e the invalidate. This has to be adjusted for the fact that, after the invalidate, this access must be the first out of all requests for the block that makes it to the level i bus. There are n i - 1 such possible requests, adjusted by fR,, plus the current request. Hence we have:

References [1] J.-L. Baer and W.-H. Wang, Architectural Choices For Multi-Level Cache Hierarchies, Tech. Rep. 87-01-04, Department of Computer Sciences, University of Washington, Seattle, WA 98195, 1987. [2] F. Darema-Rogers, G.F. Pfister, and K. So, Memory access patterns of parallel scientific programs, Proc. SIGMETRICS International Symposium on Computer Performance Modeling, Measurement and Evaluation (1987)

46-58. [3] J.R. Goodman and P.J. Woest, The Wisconsin multicube: A new large-scale cache-coherent multiprocessor, Proc. 15th Annual Symposium

A C M SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1988.

[7] M.K. Vernon and M. Holliday, Performance analysis of multiprocessor cache consistency protocols using generalized timed petri nets, Proe. SIGMETRICS International Symposium on Computer Performance Modeling, Measurement and Evaluation (1986) 9-17.

[8] M.K. Vernon, E.D. Lazowska, and J. Zahooan, An accurate and efficient performance analysis technique for multiprocessor snooping cache-consistency protocols, Proc. 15th Annual Symposium

PONSt ca ch e s u p p

ly ,i No,i-1

× No,K _ No,i_1 P~ls

×(pcls,,)-', i = 1 . . . . . L.

]1

on Computer Architecture

(Honolulu, HI, 1988) 422-431. [4] Rajeev J~g, G.S. Sohi, and M.K. Vernon, The TREEBus Architecture and its Analysis, Computer Sciences Tech. Rep. #747, University of Wisconsin-Madison, Madison, W! 53706, 1988. [5] E.D. Lazowska, J. Zahorjan, G.S. Graham, and K.C. Sevcik, Quantitative System Performance, Computer System Analysis Using Queueing Network Models (Prentice Hall, Englewood Cliffs, NJ, 1984). [6] S. Leutenegger and M.K. Vernon, A mean-value performance analysis of a new multiprocessor architecture, Proc.

on Computer Architecture

(Honolulu, HI, 1988) 308-315. [9] A.W. Wilson, Organization and Statistical Simulation of Hierarchical Multiprocessors, Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University, Pittsburgh, PA, 1985. [10] A.W. Wilson, Hierarchical cache/bus architecture for shared memory multiprocessors, Proc. 14th Annual Symposium on Computer Architecture (Pittsburgh, PA, 1987). 244-252.