Efficient performance models for layered server systems with replicated servers and parallel behaviour

Efficient performance models for layered server systems with replicated servers and parallel behaviour

The Journal of Systems and Software 80 (2007) 510–527 www.elsevier.com/locate/jss Efficient performance models for layered server systems with replicat...

449KB Sizes 0 Downloads 30 Views

The Journal of Systems and Software 80 (2007) 510–527 www.elsevier.com/locate/jss

Efficient performance models for layered server systems with replicated servers and parallel behaviour Tariq Omari, Greg Franks *, Murray Woodside, Amy Pan Department of Systems and Computer Engineering, Carleton University, 1125 Colonel By Drive, Ottawa, Ont., Canada K1S 5B6 Available online 6 September 2006

Abstract Capacity planning for large computer systems may require very large performance models, which are difficult or slow to solve. Layered queueing models solved by mean value analysis can be scaled to dozens of servers and hundreds of service classes, with large class populations, but this may not be enough. A common feature of planning models for large systems is structural repetition expressed through replicated subsystems, which can provide both scalability and reliability, and this replication can be exploited to scale the solution technique. A model has recently been described for symmetrically replicated layered servers, and their integration into the system, with a mean-value solution approximation. However, parallelism is often combined with replication; high-availability systems use parallel data-update operations on redundant replicas, to enhance reliability, and grid systems use parallel computations for scalability. This work extends the replicated layered server model to systems with parallel execution paths. Different servers may be replicated to different degrees, with different relationships between them. The solution time is insensitive to the number of replicas of each replicated server, so systems with thousands or even millions of servers can be modelled efficiently.  2006 Elsevier Inc. All rights reserved. Keywords: Analytic performance model; Client–server performance; Layered queueing networks; Replication; Parallelism

1. Introduction Large computer systems often use server replication to provide capacity, reliability or a combination of the two (Lazowska et al., 1984; Smith, 1990). To plan and manage these, it is useful to predict properties such as capacity and delay with performance models, such as layered queueing models (which describe the layering of services). Even the most efficient computational techniques eventually run into limitations on the number of distinct servers and service classes they can model, and in general there is a need for ways to extend the reach of analytic modelling techniques. One well-known approach to simplify a model state space or solution complexity is to discover and exploit symmetry (Sanders and Meyer, 1991; Capra et al., 1999; Woodside, 1983; Sheihk et al., 1997), so this work considers applica*

Corresponding author. Tel.: +1 613 5205726; fax: +1 613 5205727. E-mail addresses: [email protected] (T. Omari), greg@sce. carleton.ca (G. Franks), [email protected] (M. Woodside). 0164-1212/$ - see front matter  2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2006.07.022

tion-level symmetry in server systems. Large models often have a structure based on replication of servers and subsystems, either because replication is a central feature of the system, or as a simplifying assumption in making the model. Replication is deliberately introduced into systems to improve both performance and reliability. If a server is a bottleneck, the system capacity can be increased by adding copies of the server in some way. One solution is to add threads to a software server, and if necessary to run the server on a multiprocessor. Another solution is to introduce replica servers, which are separate computing nodes. This is the solution adopted by cluster and grid computing, by proxy web servers, and in replicated databases, and this is the context of models for replicated tasks and subsystems. Replication provides redundancy in case of failures, geographic separation to reduce vulnerability to fires and other disasters, modular upgrade capability through addition of nodes, separation of network access traffic, and placement of services near to distributed users to reduce

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

access latencies. For example, the DNS naming service maintains copies of name-to-address mappings for computers and other resources, and is relied on for day-today access to services across the Internet. The USENET system maintains replicas of items posted to electronic bulletin boards across the Internet, the replicas being held within or close to the various organizations that provide access to it. Google uses replication for better request throughput (Barroso et al., 2003). The internal services are replicated across many machines to obtain sufficient capacity. Other practical examples include web databases (Loukopoulos et al., 2002), grid computing (Lamehamedi et al., 2003), and safety critical systems like air traffic control (Das and Woodside, 2004). Replication is also introduced into planning models to approximately represent subsystems which are nearly symmetrical, such as departmental administrative networks. The planning exercise may assume symmetry as a useful simplification for decision-making, and for budgeting equipment and effort. In solving models, symmetry has been exploited for efficient solution in different ways for different modelling paradigms. State explosion in state-based modelling approaches (such as Stochastic Petri Nets or Stochastic Process Algebras) gives a strong incentive for finding statebased simplifications (Sanders and Meyer, 1991). These techniques give an exact solution through lumping of states, based on symmetry and exact Markov Chain aggregation. Because they scale up better for some kinds of large systems, capacity planning is often done with queueing models as in Menasce´ et al. (1994). Smith (1990) has described queueing and extended queueing models for modelling application behaviour. Symmetry was exploited for a queueing model in Woodside (1983), which represented an unbounded set of nodes in a distributed system by a single queueing station. Layered queueing is a kind of extended queueing which captures the behaviour of layered services in many kinds of systems (Franks et al., 1999; Ka¨hkipuro, 2001; Ramesh and Perros, 2000; Menasce´, 2002; Rolia and Sevcik, 1995; Franks et al., 1996), and replication has been analyzed by Sheihk et al. (1997), based on replicated ‘‘areas’’ in the system. Another approach, for systems with more general patterns of interaction among the replicated servers, is described in Omari et al. (2005). Any replicated server can interact with any other server, and with different numbers of different replica groups. As in Omari et al. (2005) this paper considers layered server systems with groups of replicated elements or subsystems which partition or share a workload. It extends (Omari et al., 2005) with servers that may fork internal parallel threads that interact with other servers. The central idea is that each group of replicated entities is represented only once, and its properties are computed once. The results are then used for all the members of the group. This is an approximation, because jobs processed by some systems do distinguish between instances of replicated components (Spitznagel and Garlan, 1998).

511

However the approximation is justified by the greater simplicity in determining and expressing the model, as well as expediting the solution time. The approach makes the time and space complexity of the solution insensitive to the number of replicas of any server, and to the number of groups of replicas. 2. Layered queueing network (LQN) models An LQN model is a canonical form for extended queueing networks that represent layered service. In a layered service a server, while executing a service, may request a lower layer service and wait for it to complete. The service time of the upper server includes the queueing delay and service time of the lower server, and this may extend through multiple layers. LQN was developed for modelling software servers, with for example blocking remote procedure calls to lower layer software servers, however it applies to any extended queueing network in which resource usages are nested, lower layer usages within higher layer usages. 2.1. Notation Fig. 1 shows a notation for LQNs, applied to a web server hosting an application. The large parallelograms denote ‘‘tasks’’ which represent software servers, with small parallelograms denoting ‘‘entries’’ which are interfaces for service classes. One task may have several entries, and may have a multiplicity representing multithreading (indicated by a parameter in curly brackets, such as {20} for the Webserver task). Entries make requests to other entries, and to a host processor. The details of an entry are specified by a sequence of ‘‘activities’’ shown as small rectangles with predecessor-successor relationships, with a first activity triggered by the entry. In the Webserver task, entry http triggers execution of the activity init, followed by an AND-fork, and two activities static and dynamic in parallel, and joining with activity fin. At the end of activity fin, a reply is sent by Webserver to the entry http, and the service is over. The Webserver task runs on a processor P2, indicated by a circle within the Server Node boundary. Activities have execution-demand parameters (which demand execution by the host processor and are given in time units) and arcs to indicate requests for other services (a directed arc from the activity, to an entry of another task). For example activity dynamic has 0.4 time units of host demand and makes one request to entry cgi of task AppServer. Some entries show no activity detail because they have just a single activity (or two, as described below) and their parameters are attached to the entry instead. An entry also requests service from the processor to which its task is allocated, indicated by an arc from the task to the processor. An example is the arc from task Webserver to processor P2. An entry without specified detail has one or more activities. An example with one activity is entry serve, with one

512

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527 browser [0,1]

Layer 1

User {200} (0,1) http

P1 {inf}

Webserver {20} init [0.1] & static [0.4]

Layer 2 dynamic [0.4]

& fin [0.1] (5)

Clients

(1)

diskop [3,0.5]

cgi [1]

Disk

AppServer

Layer 3 (1)

P3

P2

(1)

(1)

query [1]

update [1]

Database Server Node

(1)

serv [1]

Layer 4

RemoteWS

(1)

fileop [1]

P5

P4

Layer 5

FileServer Remote Web Server Node P6

Layer 6

Database Node

Synchronous request

Fig. 1. A layered queueing network.

time unit of host demand. There may also be two activities, one before sending a reply and one after, which are called ‘‘phases’’. A second phase of service is common in software servers, either to perform ‘‘cleanup’’ after the service, or as a performance optimization for logging or delayed writes. The existence of a second phase is indicated by the entry parameters, as shown in Fig. 1 for the browser entry. The entry has two demand parameters [0, 1] and its request arc to http has two visit rate parameters (0, 1), one parameter for each of the two phases. A synchronous or blocking-RPC (remote procedure call) request from an entry or an activity to an entry returns a reply to the requester, indicated in Fig. 1 by solid arrows with closed arrowheads. For example, task AppServer makes a request to task Database which then makes a request to task FileServer. While task FileServer is servicing the request, tasks Database and AppServer are blocked. Alternatively, a request may be forwarded to another entry for later reply, or may not return any reply (an asynchronous request); these request types are not used in Fig. 1. In Fig. 1 the tasks and processors are arranged in layers, with requests always from a higher-layer entity to a lower

layer one. An arc representing a service request is annotated with a ‘‘visit rate’’ parameter (in brackets) which gives the average number of requests made, per execution of the requesting activity or entry. Note that requests may jump over layers. The reason for layering the model is to capture blocking delays in the real system. This ‘‘active server’’ feature (Woodside et al., 1995) is the key difference between layered and ordinary queueing networks. The entry service time is not constant but is determined by its lower servers. Thus the essence of layered queueing is a form of simultaneous resource possession. In software systems delays and congestion are heavily influenced by synchronous interactions such as RPCs or rendezvous, and the LQN model captures these delays by incorporating the lower layer queueing and service into the service time of the upper layer server. 2.2. Parallel execution notation Parallel execution is shown in the Webserver task where entry http is specified using activities. Activities are the lowest level of granularity in a performance model and are linked together in a directed graph to indicate precedence.

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

When a request arrives at an entry, it triggers the first activity of the activity graph. Subsequent activities may follow sequentially, or may fork into multiple paths which later join. The fork may take the form of an ‘AND’ which means that all the activities on the branches after the fork can run in parallel, or in the form of an ‘OR’, which chooses one of the branches with a specified probability. In Fig. 1, a request that is received by entry http of task Webserver is processed using an activity called init that represents the main thread of control, after which the main thread of the task is ANDforked into two threads. One thread has the activity dynamic, and the other thread has the activity static. These threads make requests to lower level servers and they run logically in parallel, although they run on the same processor which serializes them and which may impose contention delays on both of them. When both replies from the lower level servers are received, both threads join into one thread and continue processing using the activity fin. Activity fin in turn, generates the reply to be sent from the entry http to the entry browser in the User task.

set of entities with the same properties. In symmetrical replication, the interactions of all the replicas in the group are also similar; that is they have the same number of clients with the same properties, and servers which are all similar. In effect ‘‘similar’’ entities have numeric parameters with the same value, and interactions with corresponding entities which are themselves ‘‘similar’’. For example a multithreaded task with m threads, gives replicas with m threads each. 3.1. Notation The notation for defining replication in LQN models is illustrated by the example in Fig. 2, with no activity detail or parallel execution. Beginning from Fig. 2(a) we wish to replicate task Client twice, and task Server three times, and have each Client make two requests to each Server. Each replica will run on its own processor. The notation for this is shown in Fig. 2(b), with three new elements: • a replication count K for each replicated task and processor, in angle brackets, as hKi, • a fanout count O for each arc, showing how many separate target tasks there are for each source replica, • a fanin count I for each arc, showing how many separate source tasks there are for each target replica.

3. Layered queueing with replicated servers Replication is used to add resources to a system, which are modelled by tasks and processors in the LQN. When an entity is replicated it is replaced by a replication group, a

Client [1]

Client [2]

Client

Client <2> (2), O=3, I=2

(1) Server [1]

Client

Server [3] Server <3>

Server

Server <3>

Client_1 [2]

Client_2 [2]

Client_1

Client_2 (2) (2)

Client_1

Client <2>

Server

(2)

513

(2) (2)

(2)

Server_1 [3]

Server_2 [3]

Server_3 [3]

Server_1

Server_2

Server_3

Server_1

Server_2

Server_3

Client_2

Fig. 2. Replication of a simple client–server model: (a) simple, (b) replicated and (c) expanded.

514

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

All of these have default values of unity and are not shown if they equal unity. The meaning of the replication notation is shown in the expanded model of Fig. 2(c), where each replica is shown separately, with a replica number appended to the name, as in Client_1. We see that each Client makes requests to three Server replicas, expressing the fanout of 3, and each Server receives requests from two Client replicas, expressing the fanin of 2. Because the number of processor replicas matches the number of task replicas, each replica task has its own processor. Note that for an original request from a task with KC replicas, to a task with KS replicas, total request arcs = KC · O = KS · I,

as well as O 6 KS. In our present work all requests between activities and entries of the same pair of original tasks must have the same fanout and fanin. A more elaborate case is illustrated in Fig. 3, with two groups of clients and two groups of servers. We replicate tA twice, tB twice, tC three times and tD twice, with each client distributing its requests across all replicas of each server. Fig. 3(c) shows how the replicas interact. The same architecture, with parallel execution within task tB, is shown in Fig. 4. In Fig. 4, requests from activity b3 go from one replica of tB to one replica of tD, since the fanout and fanin are both the default value of 1.

eB [4]

eA [2] tA

tB (2)

(4) (3) eC [3]

pA {inf}

eD [5]

pB {inf}

tC

tD

pC {inf}

pD {inf}

eA [2]

eB [4]

tA <2>

tB <2>

(2), O=3, I=2 (3), O=3, I=2 eC [3]

pA {inf}

tA_1

(2) (2)

(2)

(2)

tA_2

eD [5]

pB {inf}

tC <3>

tD <2>

pC {inf}

pD {inf}

eB_2 [4]

eB_1 [4]

eA_2 [2]

eA_1 [2]

pA_1 {inf}

(4)

(2)

(2)

(3) (3)

tB_1 (3)

eC_1 [3]

eC_2 [3]

eC_3 [3]

tC_1

tC_2

tC_3

pC_1 {inf}

pB_1 {inf}

(3) (3) (3) (4)

tB_2 (4) eD_1 [5]

eD_2 [5]

tD_1

tD_2

pD_1 {inf}

Fig. 3. Replication in a larger simple model: (a) simple, (b) replicated and (c) expanded.

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527 eA [2]

515

eB

tA <2>

tB <2> b1 [1] & b2 [1]

b3 [1] & b4 [1] (4)

(2), O=3, I=2 (3), O=3, I=2 eC [3]

pA <2>

eD [5]

pB <2>

tC <3>

tD <2>

pC <3>

pD <2>

eA_1 [2]

eA_2 [2]

tA_1

tA_2

tB_1

tB_2

(2)

b1_1 [1]

b1_2 [1]

(2)

(2)

(2)

(2)

eB_1

(2)

eB_2

& b2_1 [1]

(3) (3)

pA_1

& b3_1 [1]

b2_2 [1]

b3_2 [1]

&

&

b4_1 [1]

b4_2 [1] (3) (3) (3)

(3)

(4)

(4)

eC_1 [3]

eC_2 [3]

eC_3 [3]

tC_1

tC_2

tC_3

tD_1

tD_2

pC_1

pC_2

pC_3

pD_1

pD_2

pA_2

pB_1

eD_1 [5]

pB_2

eD_2 [5]

Fig. 4. Replication in a model with parallelism: (a) replicated and (b) expanded.

3.2. Semantics of replication A model with replicated servers implies an expanded model of the actual system, in which every replica is a separate model entity. To expand a model MR with replicated servers, into the corresponding expanded model ME, one does: • For each original replicated server task t in MR, with KS replicas, create KS replica tasks with the same entry and activity structure. The tasks, entries and activities are distinct (with separate names) but have the same parameters, and symmetric service requests, as follows. • For each original request arc from an entry/activity of MR to an entry e of MR, with fanout O, create O request arcs of the same type (and the same visit rate parameter)

from each corresponding entry/activity of ME. Each arc has as target one of the entries in ME that corresponds to entry e, as follows. • The target entries for the request arcs are chosen as follows: the O target tasks for the arcs from the first source replica are chosen in sequence from the KT target replicas; for the next source replica the sequence is continued, modulo KT. • For each replicated processor in MR with KP replicas, create the KP replica processors. For the replicated tasks allocated to this processor in MR, allocate the first replica task in ME to the first replica processor in ME, and then allocate each succeeding replica task to the next replica processor, modulo KP. Commonly there are equal numbers of replicas of tasks and processors, and each task is allocated to a separate processor.

516

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

This distributes the requests evenly across the target replicas. It follows that for any two request arcs between the entries of the same requester and target tasks in MR, the

desktop [0,1] Users {100} <100> (0,1), I=10

app [1]

3.3. Some examples

RegionalServer <10> (5), I=10

arcs created in ME from one task replica go to entries of the same target task in ME. Thus the interaction relationships are per task. When three or more replicated tasks interact, all with the same replication number and fanout/fanin values, then K subsystems are automatically created with the same interaction pattern within each one. The expressive power of this replication model is high. It supports any number of replicas of any task, with any layered interactions, subject only to the constraints expressed by KC · O = KS · I and O 6 KS. In our present work O and I are constrained to be integers; fractional fanout or fanin could however be interpreted as distributing a single request stream among a number of targets.

The expressive power of this replication model will be explored with a few examples, which will also explain the notation further.

(1), I=10

db1Op [3,0.5]

db2Op [1]

DB1

DB2

3.3.1. Enterprise systems Fig. 5 shows a financial application with a very large user population served by 10 regional application servers. The application integrates information from two different

Fig. 5. Example of a financial application.

desktop [1] Users {100} <100> (0.8), I=10 (0.2), I=10

Users {inf}

appQuery [1]

appUpdate [1]

RegionalServer <10> Users (7), I=10

(5), O=2, I=10 (1), O=2, I=10

(1), I=10 db2Op [1]

RegionalServer <10>

DB2

db1Query [3,0.5]

db1Update [3,0.5]

DB1 <2> (13,1), O=2 (3,1), O=2 diskOp [0.01,0.06]

DB2

DB1

db1Disk <4> Regional

Disk <2>

Primary

Fig. 6. Example of a financial application.

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

517

The reduced number of reads is represented by showing a visit rate of 5 to represent 10 Query interactions, sent to half of the replicated servers (this will give the correct average access rate at the server). The details of read and write locking are not represented here. Second, detail is added to the modelling of DB1, by describing the data storage subsystem explicitly with two disks for each node. A disk is described by a combination of a task (for the logical operations) and a processor (for the disk device). Since there are two replicas of DB1, there are four replicas of db1Disk. Similar detail is not added to DB2 in this case. We notice that the allocation of requests

databases, covering different aspects of the business. The users are initially modelled as belonging to 100 subregions, with a fanin from 10 subregions to one RegionalServer, and a further fanin from 10 RegionalServers to each database. The solution effort of this model will not be affected by changes in the number of RegionalServer replicas, say from 10 to 100, or in the number of users. Fig. 6 modifies this model in two ways. First, database DB1 is replicated for reliability, in such a way that each write interaction is sent by the RegionalServer to update both replicas, but each read interaction goes to only one.

Clients {16}

Clients {16}

O=2 net1 {inf}

net2 {inf}

net1 {inf}

net3 {inf}

net2 {inf} <2> O=2

Agent

ServerA1

ServerA2

ServerB1

ServerB2

Agent

Server <4> I=4

net4 {inf}

net4 {inf}

net5 {inf}

net5 {inf}

Fig. 7. LQN model for the H-ORB architecture (Petriu et al., 2000): (a) non-replicated and (b) replicated.

Controller

Radars

Two Radars

Controller Controller

Radars

Three Controllers

Three primarystandby replicas

OR OR user Interface

AND OR

UI

modify displa

display FlightPlan

modify FlightPlan

Radars

display RadarData

conflict Alert

process radar data SurveillanceProcessing radarProc

detect and resolve conflicts ConflictResolution

get Trajectory TrajectoryManagement

DisplayManagement consoleProc

Three load balancing replicas Two primarystandby replicas Synchronous request Asynchronous request

get modify Flight Flight Plan Plan FlightPlanManagement

read update Flight Flight Plan Plan FlightPlanDatabase centralProc

Fig. 8. Air traffic control system.

518

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

model in Fig. 1). Tasks that only make requests, which normally represent users and load sources, are placed in the top layer. Other tasks are ordered by the greatest request depth from the top to any of their entries. The lth layer submodel is created with two groups of queueing stations. There is a server station for each server task at depth l, and a source station for each client task which makes a request to any server task at depth l. A task appears as a server station in exactly one layer submodel, but it also appears as a source station in each submodel where a lower layer server appears as a server station. Each source station represents the requests coming from one client task by a number of identical customers in a routing chain (Lazowska et al., 1984). Their number is equal to the multiplicity (number of threads) of the client task, and their service time is the mean delay between the end of one request and making the next. For the example in Fig. 3(a) the top layer submodel has one chain for task tA and one for task tB, with the following parameters:

to replicas puts the requests from DB1_1 (the first replica) on the first two disk replicas, and from DB1_2 on the third and fourth, creating two distinct subsystems for the two database replicas. 3.3.2. Middleware-based replication Fig. 7(a) is a model of a CORBA middleware system from Petriu et al. (2000), without showing the details of entries. Clients access a set of Servers through the ORB, which forwards requests according to the service required, and the currently available Servers. The net tasks represent network delays in the model. The dashed arcs represent forwarding of requests, so that the eventual reply is sent to the originator, and the forwarder is unblocked. Fig. 7(b) shows the same system, but exploiting or assuming symmetry between the operations beginning with net2, and with net3. The representation is naturally more compact and solution effort will be reduced, and will be unaffected in going from 2 paths to, say, 1000 paths.

Chain 1 : N 1 ¼ I AC ¼ 2; V 1A ¼ 1; V 1C ¼ 2; V 1;pA ¼ 1

3.3.3. Distributed high-assurance system Fig. 8 shows a model analyzed in Das and Woodside (2004), describing an en-route air traffic control system with replication for availability and reliability. The analysis used the replication model and algorithms described in this paper, and added additional logic for reconfiguring the system based on the availability of replicas. The AND and OR notations in the figure refer to relationships used to compute availability and to reconfigure the system, and will not be explained here. The open-headed request arcs between entries denote asynchronous messages generated periodically from the radar and its front-end processing.

Chain 2 : N 2 ¼ I BC ¼ 2; V 2B ¼ 1; V 2C ¼ 3; V 2D ¼ 4; V 2;pB ¼ 1

where tasks tA, tB . . . are denoted by A, B. . ., processors are pA, pB and Nc Vcy

the number of customers in chain c the number of visits of chain c to station y in the queueing submodel

The service time for chain c at the source station is the sum of delays to the client task in the LQN, at servers (tasks or processors) which are represented in other submodels. If client task t gives rise to source station (and chain) c in layer l, the source station service time is Sc given by X X Sc ¼ Rce þ ðidle timeÞ ð1Þ

4. Layered solution strategy

l0 6¼l e2Eðl0 Þ

The LQNS tool constructs queueing submodels for clusters of servers at different layers and applies a fixed-point iteration to the submodels, to find a steady-state solution for delays and resource utilizations. There are several strategies for submodel construction, but we will consider the one illustrated in Fig. 9 (corresponding to the web server

where Rce EðlÞ

delay to task t at any entry e, per request to task m the set of LQN entries of tasks at depth l (and thus in layer l)

Webserver

Webserver

User

P1

Webserver

Submodel 1

AppServer

Disk

Disk

AppServer

P3

P2

Submodel 2 Database

FileServer

Database

Submodel 3 RemoteWS

P5

Submodel 4

P4

FileServer

P6

Submodel 5

Fig. 9. Submodels for the LQN model of Fig. 1.

RemoteWS

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

The treatment of processors as servers to the tasks allocated to them, and the ‘‘idle time’’ term (which accounts for delays when the task t is idle) are explained in Franks et al. (1999). Within each layer submodel, the solver applies standard Mean Value Analysis (MVA) approximations to solve the model, and special approximations to deal with special features, as described in the thesis (Franks et al., 1999). These features include non-exponential service with multiclass FIFO queues, servers with second phases and parallel branches within a service. The basic idea of MVA is to find the arrival-instant mean queue length and use it to find the residence time of an average customer. Then the delays along a customer path are used to determine the system delays, throughputs and utilizations. Using iterative MVA approximations such as Bard-Schweitzer and Linearizer (Chandy and Neuse, 1982), these results define a new mean queue length.

eA [2]

eB

tA

tB b1 [1] & b3 [1]

b2 [1] & b4 [1] (2)

(4) (3) eC [3]

pA

4.1. Solution with parallelism Parallelism within a task, as in Webserver in Fig. 1, modifies the structure and parameters of the layer submodels. In a layer submodel where the task appears as a client, forking of the flow creates threads and effectively adds customers to the queueing network and joining removes them, a behaviour which violates the conditions for product-form networks. LQNS uses the approach of Mak and Lundstrom (1990), to model the additional concurrency with additional customer chains. An additional thread class is defined for each parallel subpath in the activity network within the task (thus Webserver has three thread classes, a main class which includes activities init and fin, and two classes for the subpaths with activities static and dynamic, respectively). In the layer submodel each thread class becomes a separate source node and customer chain, with multiplicity equal to the task multiplicity. For example, in Fig. 10, the task ‘tB’ runs the activities ‘b2’ and ‘b3’ in parallel because of the fork. Three routing chains are created: one for the main thread of control consisting of activities ‘b1’ and ‘b4’, and one each for activities ‘b2’ and ‘b3’. Each of these routing chains has one customer since there is only one copy of task tB. The corresponding queueing network for this model is shown in Fig. 11. Routing chain 2 corresponds to activities ‘b1’ and ‘b4’, chain 3 to activity ‘b2’ and chain 4 to activity ‘b3’. The service time at the source station is again given by Eq. (1), but with delays and idle time calculated for each thread. The separate threads created by forks and joins are modelled as if they were independent, but actually they are synchronized at the fork and join points. This synchronization is accounted for approximately in the solver using the ‘‘overlap’’ adjustment of Mak and Lundstrom (1990) in calculating the contention at the servers. This adjustment to the method of surrogates (Jacobson and Lazowska, 1982; Lazowska et al., 1984), removes the delays calculated by

519

eD [5]

pB

tC

tD

pC

pD

Fig. 10. Threaded non-replicated tasks.

1

2

3

4

tB

tA

3

pA

tC

4

pB

tD

Fig. 11. Queueing network for the non-replicated model of Fig. 10.

MVA for contention between chains that do not interfere with each other. For example, in Figs. 10 and 11, chain 2 cannot interfere with either chain 3 or chain 4 because chain 2 is blocked waiting for the join while chains 3 and 4 run, and chains 3 and 4 are effectively suspended while chain 2 runs. The overlap adjustment significantly improves the accuracy of the surrogate delay approach (Mak and Lundstrom, 1990; Franks et al., 1998). 4.2. Solution with replication With replication (but no parallelism) the solution algorithm represents each set of replicated servers by just one server, and the LQNS replication algorithm described in Pan (1996) and Omari et al. (2005) solves for the contention delays and the service times of that replica. The algorithm constructs a reduced version of any layer submodel that contains servers that represent replicated tasks. It includes only one server for each group of replicas, and a

520

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

set of source stations and chains for the clients of each server. For each server, one chain is constructed for each group of replicated client tasks whose instances visit it, with a population defined by the number of potential customers to the server. This is equal to the product of the client task multiplicity and the fanin parameter to the server task (representing replication of the client tasks). Each chain visits just one source station and one server station, but a server may be traversed by many chains. The chains in the submodel for the top layer of example of Fig. 3(b) have the parameters: Chain 1 ðfor tA visiting tCÞ : N 01 ¼ I AC ¼ 2; V 01A ¼ 1; V 01C ¼ 2 Chain 2 ðfor tA visiting pAÞ : N 02 ¼ I ApA ¼ 1; V 02A ¼ 1; V 02pA ¼ 1 Chain 3 ðfor tB visiting tCÞ : N 03 ¼ I BC ¼ 2; V 03B ¼ 1; V 03C ¼ 3 Chain 4 ðfor tB visiting pBÞ : N 04 ¼ I BpB ¼ 1; V 04B ¼ 1; V 04pB ¼ 1

where: Sc Otm Ote Rte EðmÞ

service time of for chain c, calculated by Eq. (1) fanout of client task t to server m in the LQN model fanout of client task t to a particular entry e in the LQN model delay to task t at any entry e, per request to server m the set of all entries of task m

This equation makes use of the fact that Rte = 0 if task t does not make any requests to entry e. The first sum includes entries of the task m visited by the chain; the second sum includes terms for other entries in the same layer submodel and other layers. Since the chain does not visit them in the submodel, their delays are added to the source station service time.

Chain 5 ðfor tB visiting tDÞ : N 05 ¼ I BD ¼ 1; V 05B ¼ 1; V 05D ¼ 4

where N 0c is the number of customers in the chain, A, B . . . denote servers for the LQN tasks, pA and pB denote the processors for tA and tB, and as before V 0cy is the number of visits by a client of chain c to server y. Notice that the requests modelled by Chain 2 without replication are now modelled by Chain 3, Chain 4, and Chain 5, which have different numbers of customers. Splitting the customer chains according to the server they visit is necessary if different fanout values can be applied to different server tasks in the LQN, since there is one server station in the layer submodel for each server task in the LQN. 4.3. Source service time calculation with replication The source service time calculation of Eq. (1) is modified to add the delay due to visits to replicas which are not represented in the submodel. As an example, the modified source service time S 01A of Chain 1 is given as S 01A ¼ S 1A þ ðOAC  1ÞR1C þ OA;pA R2pA where S1A OAC R1C YAC WAC OA,pA R2pA

ð2Þ

source service time for chain 1, calculated according to Eq. (1) fanout from task A to task C delay to task A (which generates chain 1) at task C = YACWAC number of visits from task A to task C delay for one visit from task A at task C (service time plus queueing time) fanout from task A to processor pA delay to task A (which generates chain 2) at processor pA

In general, the modified source service time S 0c for a chain c generated by a client task t for its requests to server task or processor m becomes, in the model with replication: X X X S 0c ¼ S c þ ðOtm  1Þ Rte þ Ote Rte e2EðmÞ

m0 6¼m e2Eðm0 Þ

4.4. Solving models with replication and parallelism The LQNS replication algorithm described in Pan (1996) and Omari et al. (2005) assumes that the task does not have parallelism or heterogeneous threads. In this paper, the algorithm has been modified in order to remove this restriction. A source station and customer chain are constructed for each server and for each thread class that visits it, with the number of customers equal to the product of the client task’s multiplicity and the fanin parameter to the server task. Each chain visits just one source station and one server station, but a server may be traversed by many chains. Consider the model shown in Fig. 4(a). Its upper-most submodel is shown in Fig. 12(a), and the underlying queueing model for this submodel is shown in Fig. 12(b). The chains for the queueing model are constructed as follows: Chain 1 ðfor thread 1 of tAÞ : N 01 ¼ I tA;pA ¼ 1; V 01;tA ¼ 1; V 01;pA ¼ 1 Chain 2 ðfor thread 1 of tAÞ : N 02 ¼ I tA;tC ¼ 2; V 01;tA ¼ 1; V 01;tC ¼ 2 Chain 3 ðfor thread 1 of tBÞ : N 03 ¼ I tB;tC ¼ 2; V 03;tB ¼ 1; V 03;tC ¼ 3 Chain 4 ðfor thread 1 of tBÞ : N 04 ¼ I tB;pB ¼ 1; V 04;tB ¼ 1; V 04;pB ¼ 1 Chain 5 ðfor thread 2 of tBÞ : N 05 ¼ I tB;pB ¼ 1; V 05;tB ¼ 1; V 05;pB ¼ 1 Chain 6 ðfor thread 3 of tBÞ : N 06 ¼ I tB;pB ¼ 1; V 06;tB ¼ 1; V 06;pB ¼ 1 Chain 7 ðfor thread 3 of tBÞ : N 07 ¼ I tB;tD ¼ 1; V 07;tB ¼ 1;V 07;tD ¼ 4

where V 0c;y is the number of visits by a client of chain c to a server y and N 0c is the number of customers in the chain. Task tB has three threads which generate five chains. The first thread executes activities b1 and b4, and generates chains 3 and 4, the second thread executes activity b2 on processor pB using chain 5, and the third thread executes activity b3 on processor pB using chain 6 and calls task tD using chain 7. Servers tC, pB, and tD each have only one chain that visit the client tB. Server tD has no chain that visit task tA because task tA does no make any request to task tD. This means that a client task can have more than one chain corresponding to each of its threads, but a server can have one and only one chain for each thread that makes request or calls to it.

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

a

Similarly,

eA [2]

eB

tA <2>

1,2

S 03 ¼ S 3 þ ðOtB;tC  1ÞR3;tC þ OtB;pB R4;pB

tB <2>

&

1

2

5

b2 [1]

b3 [1]

6.7

& b4 [1]

3,4 (2), O=3, I=2

(4) (3), O=3, I=2

3

eD [5]

pB <2>

tC <3>

1

7

e2EðmÞ

eC [3]

pA <2>

4,5,6

2,3

4,5,6 1

tA

where:

2

3

4

Sc Otm

7

5

6

Rthr,e

7

tB

EðmÞ

4.6. Implementation 4

pA

56

tC

7

pB

tD

Fig. 12. The top-most submodel and queueing network for the replicated model of Fig. 4. Objects in (a) are annotated with the chains used in the queueing network in (b): (a) layer submodel and (b) queueing network model.

4.5. Source station service time calculation The source service time calculation of Eq. (1) is modified to consider behaviour of just one thread class per source station, and to add the delay due to visits to replicas which are not represented in the submodel. The delays that would be seen at these stations are added to the source station service time. The modified source station service time S 02 for chain 2 (which is generated by thread 1 of tA, visiting tC) is given as S 02 ¼ S 2 þ ðOtA;tC  1ÞR2;tC þ R1;pA

ð3Þ

where:

R2,tC Y2,tC W2,tC

source station service time for chain c fanout of client tasks t to server m in the LQN model total delay to activities in the thread which generates chain c, from requests to entry e, per request made to entries of task m the set of entries of task m

2 3

R1,pA

m0 6¼m e2Eðm0 Þ

tD <2>

b

S2

ð4Þ

In addition to the delay for the other replica tC servers, the modified service time of chain 3 of client tB includes delays to the thread in the LQN, from other service requests. In this case the requests are to the processor, and they are incurred by chain 4. In general, with parallelism the source station service times for a customer chain c generated by thread thr of task t making requests to server task m is calculated by X X X S 0c ¼ S c þ ðOtm  1Þ Rthr;e þ Ote Rthr;e ð5Þ

b1

3,4 [1]

1

521

source service time for chain 2, calculated by Eq. (1) delay imposed on thread 1 of tA by processor pA (service time plus queueing time) delay imposed on thread 1 of tA by task tC = Y2,tCW2,tC number of visits to task tC by chain 2 delay for a single visit from thread 1 of tA to task tC

The algorithm that solves replicated tasks with ForkJoin thread patterns, shown in Fig. 13, has been implemented in the Layered Queueing Network Solver (LQNS). After the model is loaded the solver checks if it has replication in the initialization. If it does, then it creates the chains for the replicated clients and servers from the servers’ point of view as explained in previous sections. It sets the number of customers of a chain equal to the fanin of that chain multiplied by a population of an instance of the replicated client. For each LQN submodel, the solver solves it by MVA. If the submodel has any of its clients or servers replicated then it modifies the clients’ service times as explained before. This is because a customer of a specific client’s thread might visit more than one chain, so the service time of a chain needs to be modified to account for the delay incurring when the customers visit all other chains visited by that thread. The solution of a specific submodel that has replication is iterative. The client service times are modified, and then the submodel is solved using MVA. Then the new service times are used to update the clients’ service times, and then the model is solved again. This process is repeated until the submodel solution converges. This is called ‘‘inner’’ fixed-point iteration using multivariate Newton– Raphson method. When the replicated submodel converges, the solver moves to the next submodel and uses the results of the current submodel as input to the next submodel if needed. This process of iterating between the submodels is called ‘‘outer’’ fixed-point iteration. This is repeated until the outer iteration converges or reaches an iteration limit.

522

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527 Client [0,100] Z=[0,5e+03] Client <1000000> (0,1), I=10000/K Index [3] IndexServer <100K> (2), O=50, I=100(2), O=10, I=100 Query [5]

Rank [4]

DocumentServer <50K>

Ranking <10K>

Fig. 14. LQN model for a typical search engine.

Table 1 Results of solving the LQN model of Fig. 14 K

Client response time (ms)

IndexServer utilization

Number of computational steps

1 100 500 1000 2000 2500 5000 10000

15,341,500 153,276 30,533.9 15,249.7 8216.24 7221.13 6062.56 5798.32

1.0 1.0 1.0 1.0 0.844 0.685 0.281 0.120

188 214 240 240 331 272 100 70

Fig. 13. Pseudo-code for ‘‘inner’’ iteration.

5. Results and analysis To demonstrate the replicated solver, several models are shown below. The first subsection demonstrates the scalability of the solution technique. The second subsection consists of several examples both in their replicated and expanded forms, and are used to demonstrate the accuracy of the solution. Finally, the technique is used on an industrial management information system example. 5.1. Scalability The replicated model in Fig. 14 is a hypothetical implementation of a typical search engine. It consists of one million customers requesting services from 100K index servers, which access, in turn, 50K and 10K Document and Ranking servers. The parameter K was varied to change the replication level of the components from 1 to 10,000. Table 1 shows the number of times the core one-step MVA calculation is executed to solve the various configurations and is indication of the complexity of the calculation. The fourth column shows that the number of steps is approximately 240 on average regardless of the scaling with the index server saturated. The algorithm is much more efficient when the model is not bottlenecked because the iterations are

sensitive to small changes in throughputs when the corresponding utilizations are high. The non-replicated model would not be solvable when K = 10,000 as there would be 1.6 million stations. It took less than 20 ms to solve the LQN for any value of K on Pentium 4 2.8 GHz machine running Windows XP using LQNS version 3.8. 5.2. Accuracy The example system shown in Fig. 4 is used to consider the accuracy of the approximations made in the replication calculation. Three different MVA queueing network solvers were used to solve the layer submodels for the replicated model: the Schweitzer approximation, Linearizer, and exact MVA. The fully expanded system was solved using simulation; results are shown with 95% confidence levels. Tables 2–4 list the response times, throughputs and task utilizations respectively and the relative error (approx  sim)/sim · 100% of these values when compared to the simulation runs. These results show that the Schweitzer approximation gives the closest results to the full system on average with errors of less than 3% in magnitude for all of the cases, and in some instances the results lie within the bounds of the confidence levels of the simulation. Linearizer and Exact MVA fare less well, though the errors from these algorithms are still quite acceptable at less than 5% in magnitude.

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

523

Table 2 Cycle time results for the replication example Task

Full model (simulation)

Replicated model Schweitzer

Linearizer % Error

tA tB

33.10 ± 0.40 73.19 ± 0.79

33.94 73.58

2.5 0.5

(Exact MVA) % Error

31.54 70.79

4.7 3.3

% Error 31.64 70.92

4.7 3.1

Table 3 Throughput results for the replication example Task

Full model (simulation)

Replicated model Schweitzer

Linearizer % Error

tA tB tC tD

0.0302 ± 0.0003 0.0137 ± 0.0001 0.2025 ± 0.0009 0.0546 ± 0.0014

0.0295 0.0136 0.2005 0.0544

2.3 0.7 1.0 0.4

Exact MVA % Error

0.0317 0.0141 0.2128 0.0565

5.0 2.9 5.1 3.5

% Error 0.0316 0.0141 0.2122 0.0564

4.6 2.9 4.8 3.3

Table 4 Task utilization results for the replication example Task

Full model (simulation)

Replicated model Schweitzer

Linearizer % Error

tC tD

0.6077 ± 0.0045 0.2733 ± 0.0081

0.6016 0.2718

1.0 0.5

The higher errors for the exact MVA and Linearizer may be explained by inspecting the MVA algorithm. Modifying the service time of the client delay server with the delay at the ‘missing’ stations is essentially estimating the residence times of the chains at the stations, R, in the MVA algorithm. However, in the case of exact MVA and Linearizer, the R values for different populations are required in the MVA iteration. For the exact MVA, the R values for the populations from 0 to N are required. For Linearizer, the R values for population N and N  1c are required. By modifying the service time of the client, an estimation of R for a population N is used, which is fixed throughout the iteration. That is, it is used even though for the exact MVA and R value for the range of populations from 0 to N is needed. The estimation for R is incorrect and therefore produces bigger errors in the exact MVA and Linearizer. The error for the Schweitzer approximation is low on average since, in this algorithm, only the delays for population N are needed. In this case, the estimated R is correct, or nearly so. The error in the results is due to the Schweitzer approximation itself. The Schweitzer approximation results for the full models are very close to their corresponding replicated model results also using Schweitzer. In addition, the error for the cycle time result is increased since the cycle time is obtained by multiplying the calculated delay at the client by the number of visits to a server.

Exact MVA % Error

0.6385 0.2826

5.1 3.4

% Error 0.6368 0.2820

4.8 3.2

In other words, the error from the replication algorithm appears in the delay result of the client (delay server) which is magnified in the client cycle time result by the number of visits. The Schweitzer algorithm is computationally more efficient than Linearizer or exact MVA algorithms. Therefore, it is recommended to use the Schweitzer algorithm in solving larger replicated models. Since the Schweitzer MVA approximation gives the best results on average, the space and time complexity relative to this algorithm is discussed. The space requirements for Schweitzer is proportional to the product of the number of chains, C, and the number of stations, N, i.e. O(CN). The time requirement per iteration of the algorithm is also proportional to this product. The replication algorithm reduces the number of chains and the number of stations needed, thereby reducing the space and time requirement for each PM  Schweitzer iteration by O m¼1 ðK m  1Þ , where Km is the number of replicas atserver m and M is the total number P of replicated task sets N ¼ M m¼1 K m . The replication iteration introduced for solving each submodel increases the time by an unknown factor. Finally, the time complexity for one iteration of the LQNS inter-layer submodel solution is  P 2  M O(LN2) or O L , where L is the number of laym¼1 K m ers. (This is derived from the time complexity of one iteration of Schweitzer which is (O(CN)).) Since the replication algorithm reduces the number of stations by representing a set

524

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

Table 5 Time complexity Algorithm

Operations

Schweitzer Linearizer Exact

1.08 · 104 1.46 · 106 1.21 · 1011

of replicas by one station, the time complexity of one LQNS iteration is reduced to (LN2) or by a factor of (N/M)2. Table 5 shows the difference in the complexity of the three MVA algorithms when solving this model. The column labelled ‘‘operations’’ in the table is the number of times the residence time is computed by the MVA solver. The results show that Schweitzer approximation is two orders of magnitude more efficient than Linearizer and seven orders of magnitude more efficient than exact MVA. The advantages of the replication approximation are obvious when comparing the calculation time, the ease of reading the results, and the simplification of the model between the full model and the replicated model. Further, changing the level of replication with the replicated model amounts a simple parameter change which can simplify parametric analysis. 5.3. Industrial management information system This example has two configurations shown in Figs. 15 and 16. It describes a large Management Information Sys-

tem, which access two backend databases (called RF and BC) through local servers (shown as the ‘‘LAN servers’’) which do routing and some processing. Some of the LANs are connected to the backbone network and others are connected to a wide area network (WAN). The configurations differ in that the second has ‘‘regional servers’’ (shown as ‘‘RS_DB’’) to off-load work from the one of the two databases. The model uses simplifying assumptions to obtain symmetry in the entities. In the model, the workstations are identical sources of workload, and each LAN has the same number of workstations. The number of LANs attached to the WAN equals that attached to the backbone. The two databases however are different. The model is shown in Fig. 15. The users and their workstations are the two sets of client tasks at the top, one set for the backbone and one for the WAN connection. The LAN servers are in the middle, they fork their requests to the database servers and WAN. The two database servers are modelled as server tasks in the bottom layer. The LANs are modelled as delay servers attached to the Client workstations which use them and the WAN and backbone are similar delay servers attached to the LAN servers. The database server subsystems have been greatly simplified, and the effects of the communications front end and the storage are incorporated into the parameters of the database server tasks. The LQN model in Fig. 15 was used to study the performance of the system without regional servers, under varying

WS_B [15]

WS_W [15]

WS_B <20>

WS_W <20>

(0.667), I=10

(2), I=10

(0.667), I=10 (0.333), I=10

(0.333), I=10

(2), I=10

LAN_B [0.00039]

LSB_L [0.03]

LSB_DB

LAN_B {inf} <2>

LS_W <2>

pre [0.001]

pre [0.001]

&

& rf_db [0.005]

bc_db [0.003]

rf_db [0.005]

bc_db [0.003] &

&

post [0.001]

post [0.001]

(1), I=2

(0.417), I=2 (0.417), I=2

(0.0833), I=2 (0.0833), I=2(1), I=2

LSW_L [0.03]

LSW_DB

LAN_W {inf} <2>

LS_B <2>

(0.0833), I=2 (0.0833), I=2

LAN_W [0.00039]

(1), I=2

(0.417), I=2 (0.417), I=2

(1), I=2

BC_DB_H BC_DB_L [0.16] [0.08]

Backbone [0.02]

RF_DB_H RF_DB_L [0.08] [0.04]

WAN [0.04]

BC_DB

Backbone {inf}

RF_DB

WAN {inf}

Fig. 15. LQN for the base case of the database system.

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527 WS_B [15]

WS_W [15]

WS_B <20>

WS_W <20>

(0.667), I=10 (2), I=10

(0.667), I=10

(0.333), I=10

LAN_B [0.00039]

LAN_W [0.00039]

LSW_L [0.03]

LSW_DB

LAN_W {inf} <2>

LS_B <2>

LS_W <2>

pre [0.001]

pre [0.001]

&

&

bc_db [0.003]

A: (0.0833), I=2

(0.333), I=10

(2), I=10

LSB_L [0.03]

LSB_DB

LAN_B {inf} <2>

bc_db [0.003]

rf_db [0.005]

B: (0.208), O=3, I=2

525

rf_db [0.005]

&

&

C: (0.417), I=2 post [0.001]

C A

A

(1), I=2

A

B

B

(1), I=2

A

post [0.001]

C

B B B

C

B B B

(1), I=2

C

(1), I=2

BC_DB_H BC_DB_L [0.16] [0.08]

Backbone [0.02]

RS_DB_H RS_DB_L [0.08] [0.04]

RF_DB_H RF_DB_L [0.08] [0.04]

WAN [0.04]

BC_DB

Backbone {inf}

RS_DB <3>

RF_DB

WAN {inf}

Fig. 16. LQN for the regional servers’ case.

configurations and parameters. The response time at the workstations was determined when the number of workstations in the system is increased, and the results are shown in Fig. 17, labeled as the ‘‘base case’’. There are 10 workstations attached to each LAN. Note that there are two LAN_W attached to 20 WS_W workstations, and there are two LAN_B attached to 20 WS_B workstations. The number of workstations is increased by attaching new LAN’s to the network. As expected, the response time increases with the number of workstations. The response time increases dramatically at more than 600 clients since

12

the RF database saturates between 500 and 600 clients. The RF database is the first component to saturate and is the bottleneck of the system. After saturation, the response time continues to increase linearly. The second case using regional servers is shown in Fig. 16. Three regional servers, RS_DB, were introduced into the design to off-load the work at the RF database, by handling transactions that are local to the region of the originating workstation. The regional servers are given the same parameters and entries as the RF database. The visit ratios to the regional servers and RF database are determined by the fraction of RF database requests that 8

Base Case Regional Servers

7 Response time at Clients

Reponse time of Client

10 8 6 4 2 0 100

6 5 4 3 2 1

200

300

400

500

600

700

Number of Workstations

Fig. 17. Performance of database system.

800

900

0

0

0.1

0.2

0.3

0.4

0.5

Fraction of requests to regional server

Fig. 18. Effect on performance of off-loading to regional servers.

526

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527

are routed to the regional servers. Fig. 18 shows the effect of off-loading to regional servers on the response time at clients. Fig. 17 (regional servers’ case) shows the response time seen at the client with 20% of the RF database traffic routed to the regional servers. Clearly, the response time at the client is improved and the RF database now saturates only between 700 and 800 clients. This example shows a typical use of the replicated solver in a planning context, and is based on a real industrial system. It also shows the use of replication level (of the LAN servers) as a parameter, rather than a structural change in the model. 6. Conclusions The approach to describing and solving models with parallel threads in replicated components and subsystems, described here, expands our capability to use analytic models for planning. It makes it easier to describe large parallel systems since each group of replicas is described only once. It is scalable in time and space, since the solution is insensitive to replication levels. Thus it is much faster than the solution of the full expanded model. This approach gives an advantage over previous replication-based solvers using state-based models (providing better scalability due to use of Mean Value Analysis), over replication analysis in queueing networks (because it handles extended queueing models with simultaneous resource possession) and over other work in replicated layered queues (in that it handles replicas with fanin). A subtle advantage of the present approach is, that it makes replication a parameter of the model, so it can be rapidly studied as a parameter change, rather than requiring re-structuring for each level of replication. The approximations necessary to use the present approach introduce some error. Using the Schweitzer MVA algorithm for the layer submodel solutions, errors under 3% were introduced. It is interesting that MVA algorithms which are more accurate for product-form queueing networks (Linearizer and exact MVA) gave larger errors. Acknowledgements The authors wish to acknowledge the value of conversations with Jerry Rolia during the development of these ideas, and the financial support of Bell Canada, CITO (the Centre for Innovation in Telecommunications) and NSERC (Natural Sciences and Research Council of Canada). References Barroso, L.A., Dean, J., Holzle, U., 2003. Web search for a planet: the Google cluster architecture. IEEE Micro 23 (2), 22–28. Capra, L., Dutheillet, C., Giuliana Franceschinis, J.M.I., 1999. Towards performance analysis with partially symmetrical SWN. In: Proceedings of the Seventh International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MAS-

COTS’99). IEEE Computer Society Press, College Park, MD, USA, pp. 148–155. Chandy, K.M., Neuse, D., 1982. Linearizer: a heuristic algorithm for queueing network models of computing systems. Commun. ACM 25 (2), 126–134. Das, O., Woodside, M., 2004. Dependability modeling of self-healing client–server applications. In: Lemos, R.D., Gacek, C., Romanovsky, A. (Eds.), Architecting Dependable Systems II, Lecture Notes in Computer Science, vol. 3069. Springer-Verlag, pp. 266–285. Franks, G., Majumdar, S., Neilson, J., Petriu, D., Rolia, J., Woodside, M., 1996. Performance analysis of distributed server systems. In: The Sixth International Conference on Software Quality (6ICSQ). American Society for Quality Control (ASQC), Ottawa, Ont., Canada, pp. 15–26. Franks, G., Woodside, M., 1998. Performance of multi-level client–server systems with parallel service operations. In: Proceedings of the First International Workshop on Software and Performance (WOSP ’98), ACM Sigmetrics, Association for Computing Machinery, Santa Fe, NM, pp. 120–130. Franks, R.G., 1999. Performance analysis of distributed server systems. Ph.D. thesis, Department of Systems and Computer Engineering, Carleton University, Ottawa, Ont., Canada, December. Jacobson, P.A., Lazowska, E.D., 1982. Analyzing queueing networks with simultaneous resource possession. Commun. ACM 25 (2), 142–151. Ka¨hkipuro, P., 2001. UML-Based performance modeling framework for component-based systems. In: Dumke, R., Rautenstrauch, C., Schmietendorf, A., Scholz, A. (Eds.), Performance Engineering: State of the Art and Current Trends, vol. 2047. Springer-Verlag, Berlin. Lamehamedi, H., Shentu, Z., Szymanski, B., Deelman, E., 2003. Simulation of dynamic data replication strategies in data grids. In: International Parallel and Distributed Processing Symposium (IPDPS’03). IEEE Computer Society Press, Nice, France. Lazowska, E.D., Zhorjan, J., Graham, S.G., Sevcik, K.C., 1984. Quantitative System Performance; Computer System Analysis Using Queueing Network Models. Prentice-Hall, Englewood Cliffs, NJ. Loukopoulos, T., Ahmad, I., Papadias, D., 2002. An overview of data replication on the internet. In: International Symposium on Parallel Architectures Algorithms and Networks (I-SPAN ’02). IEEE Computer Society Press, Makati City, Metro Manila, Philippines, pp. 31– 38. Mak, V.W., Lundstrom, S.F., 1990. Predicting performance of parallel computations. IEEE Trans. Parallel Distrib. Syst. 1 (3), 257–270. Menasce´, D.M., 2002. Two-level iterative queuing modeling of software contention. In: Proceedings of the Tenth IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2002), Fort Worth, TX. Menasce´, D.A., Almeida, V.A.F., Dowdy, L.W., 1994. Capacity Planning and Performance Modeling: From Mainframes to Client–Server Systems. Prentice-Hall, Englewood Cliffs, NJ. Omari, T., Franks, G., Woodside, M., Pan, A., 2005. Solving layered queueing networks of large client–server system with symmetric replication. In: Proceedings of the Fifth International Workshop on Software and Performance (WOSP ’05), Association for Computing Machinery, Palma de Mallorca, Spain, pp. 159–166. Pan, A.M., 1996. Solving stochastic rendezvous networks of large client– server systems with symmetric replication. Master’s thesis, Department of Systems and Computer Engineering, Carleton University, oCIEE96-06, September. Petriu, D., Amer, H., Majumdar, S., Abdull-Fatah, I. 2000. Using analytic models predicting middleware performance. In: Proceedings of the Second International Workshop on Software and Performance (WOSP 2000), ACM Sigmetrics, Association for Computing Machinery, Ottawa, Ont., Canada, pp. 189–194. Ramesh, S., Perros, H.G., 2000. A multilayer client–server queueing network model with synchronous and asynchronous messages. IEEE Trans. Softw. Eng. 26 (11), 1086–1100. Rolia, J.A., Sevcik, K.A., 1995. The method of layers. IEEE Trans. Softw. Eng. 21 (8), 689–700.

T. Omari et al. / The Journal of Systems and Software 80 (2007) 510–527 Sanders, W.H., Meyer, J.F., 1991. Reduced base model construction methods for stochastic activity networks. IEEE J. Select. Areas Commun. 9 (1), 25–36. Sheihk, F., Rolia, J., Garg, P., Frolund, A.S.S., 1997. Layered modelling of large scale distributed applications. In: Proceedings of the 1st World Congress on Systems Simulation. Quality of Service Modelling, Singapore, pp. 247–254. Smith, C.U., 1990. Performance engineering of software systemsThe SEI Series in Software Engineering. Addison Wesley. Spitznagel, B., Garlan, D., 1998. Architecture-based performance analysis. In: Deng, Y., Gerken, M. (Eds.), Proceedings of the 10th Conference on Software Engineering and Knowledge Engineering (SEKE’98). Knowledge Systems Institute, pp. 146–151. Woodside, C.M., 1983. Performance potential of communications interface processors. In: Proceedings of the Eighth Data Communications Symposium, Association for Computing Machinery. North Falmouth, MA, USA, pp. 245–253. Woodside, C.M., Neilson, J.E., Petriu, D.C., Majumdar, S., 1995. The stochastic rendezvous network model for performance of synchronous client–server-like distributed software. IEEE Trans. Comput. 44 (8), 20–34. Tariq Omari is a Ph.D. candidate in the Department of Systems and Computer Engineering at Carleton University, Canada. He received his M.S. in computer engineering from the University of Wisconsin-Milwaukee, USA in 2000. He worked as a software engineer for Catalyst International Inc., and Intel Corporation, USA. His research interests

527

include software performance engineering, mobile computing, computer networks, and distributed systems. Greg Franks is an Assistant Professor at the Department of Systems and Computer Engineering at Carleton University. His areas of interest are computer systems performance analysis, operating systems, and internet protocol routing. His principle area of research is analytic performance modelling where he has several years of experience both in an industrial and an academic setting. He received the Ph.D. degree from Carleton University, and has taught topics ranging from microprocessor interfacing to functional and logic programming languages. Murray Woodside does research in all aspects of performance and dependability of software. Much of this work is based on a special form of queueing analysis called layered queueing (also known as ‘‘active servers’’) which he has applied to distributed systems of many kinds. He received the Ph.D. degree in Control Engineering from Cambridge University, England and has taught and done research in stochastic control, optimization, queuing theory, performance modelling of communications and computer systems, and software performance. In the period 1995–1999 he was ViceChair and Chair of Sigmetrics, the ACM Special Interest Group on performance. He is an Associate Editor of Performance Evaluation. Amy Pan obtained an M.Eng. degree in Electrical Engineering at Carleton University, studying software modeling, and works in the telecommunications industry.