Available online at www.sciencedirect.com
Computer Communications 31 (2008) 119–128 www.elsevier.com/locate/comcom
A hybrid closed queuing network approach to model dataflow in networked distributed processors Vidhyacharan Bhaskar a
a,*
, Kondo Hloindo Adjallah
b
Department of Electronics and Communication Engineering, S.R.M. University, S.R.M. Nagar, Kattankulathur 603203, Kanchipuram District, Tamilnadu, India b Institute of Computer Science and Engineering of Troyes, Universite de Technologie de Troyes, 10000 Troyes, France Received 17 October 2006; received in revised form 12 October 2007; accepted 14 October 2007 Available online 22 October 2007
Abstract In this paper, a hybrid closed queuing network model has been proposed to model dataflow in networked distributed processing systems. Multi-threading is useful in reducing the latency by switching among a set of threads in order to improve the processor utilization. Two sets of processors, synchronization and execution processors exist. Synchronization processors handle load/store operations and execution processors handle arithmetic/logic and control operations. A closed queuing network model is suitable for large number of job arrivals. Both single server and multiple server models are discussed. The normalization constant is derived using a recursive algorithm for the given model. Performance measures such as average response times and average system throughput are derived and plotted against the total number of processors in the closed queuing network model. Other important performance measures like processor utilizations, average queue lengths, average waiting times and relative utilizations are also derived. 2007 Published by Elsevier B.V. Keywords: Synchronization and execution processors; Queue lengths; Response times; Utilizations and throughput
1. Introduction Multi-threaded processors use a fast context switch to bridge latencies caused by memory accesses or by synchronization operations. In processors which contain multiple functional units, multi-threading achieves higher instruction issue rates. Load/store, synchronization and execution operations of different threads of control are executed simultaneously by appropriate functional units. The search for an optimal solution in modeling multi-threaded architecture using a queuing network model is an approach which will be based on the evaluation of performance measures such as the average response time, the average throughput, the average waiting time, the average queue
*
Corresponding author. Tel.: +91 9884661184. E-mail address:
[email protected] (V. Bhaskar).
0140-3664/$ - see front matter 2007 Published by Elsevier B.V. doi:10.1016/j.comcom.2007.10.027
length, the optimum number of processors for good balance of utilizations, and the maximum queue capacity. For a large number of arrivals, the multi-threaded architecture can be modeled by a closed queuing network with single and multiple servers. In open queuing network models, a job is scheduled to the main-memory and is able to compete for active resources such as synchronization and execution processors immediately on its arrival [1]. In embedded micro-controllers for diagnostics, the number of main-memory partitions is limited, so the existence of an additional queue to hold diagnostic tasks is necessary. This queue is called a job-scheduler queue [1]. However, such a network is said to be multiple-resources holding. This is because a job cannot simultaneously hold mainmemory and an active device. Such a network cannot be solved by product-form methods [2]. The jobs entering the queuing model from outside have a specific arrival rate. If this external arrival rate is low, the
120
V. Bhaskar, K.H. Adjallah / Computer Communications 31 (2008) 119–128
probability that the job has to wait in the scheduler queue is also low. So, the open queuing network model is a good solution when the arrival rate is low. In other words, an open queuing network model is a light-load approximation. If the external arrival rate is high, the probability that there is at least one customer in the job-scheduler queue is very high. The departure of a job from the active-set of processors immediately triggers the scheduling of an already waiting job into main memory. Thus, a closed queuing network model becomes imperative. Input to the synchronization processor is in the form of threads and comprises a statistically determined sequence of RISC-style instructions [3]. Threads (sequence of tasks) are scheduled dynamically to be executed by the execution processors. The threads have a bounded execution time [3]. Our model also represents a distributed shared memory system (DSM) model in which all processors share a common address space [4]. So, the memory access time depends on the location of the accessed data. Multi-threading can also achieve higher instruction rates on processors which contain multiple functional units (e.g. super-scalars) or multiple-processing elements (e.g. chip multi-processors) [5]. To achieve higher performance, it is therefore necessary to optimize the number of synchronization and execution units to find an appropriate multithreaded model. In the context of multi-threading, a ‘‘thread’’ is defined to be a sequence of instructions. So, the ‘‘life-time of a thread’’ is defined to be the time interval between the arrival of a thread into the system and its departure out of the system. In [6], each thread is composed of conventional controlflow instructions. These instructions do not retain functional properties and need Write-after-Write (WAW) and Write-After-Read (WAR) dependencies [7]. In [8], a few limitations of the pure dataflow computational model are presented. They are: (1) too-fine grained (instruction level) multi-threading and (2) difficulty in exploiting the memory hierarchies and registers [8]. However, in the model developed in [9], the instructions within a thread retain functional properties of dataflow model and thus eliminates the need for complex hardware. Our work models the dataflow instructions appearing in [9]. In [10], a hybrid closed queuing network model is discussed for multi-threaded dataflow architecture, where the synchronization processors are modeled as a network of single servers, and execution processors are modeled as a network of multiple servers. Performance measures such as response times, queue lengths, and waiting times are discussed. In this paper, we propose a hybrid closed queuing network model to model dataflow in networked distributed processors, where the synchronization processors are modeled as a network of multiple servers, and execution processors are modeled as a network of single servers. The objective of this work is to present a hybrid queuing network to model dataflow in currently existing distributed
processors, compute performance measures like response times, queue lengths, waiting times, throughput and utilizations, and also present state diagrams and steady-state balance equations for (2, 2) queuing networks. Average response times give us an idea of the life-time of a thread, average queue lengths indicate the number of threads serviced, average waiting times indicate the number of threads waiting in a queue, throughput indicates the fraction of the threads which are successfully serviced, and the utilizations indicate the effectiveness of the distributed processors. Section 2 presents the block diagram of the hybrid closed queuing network model with both single and multiple servers, and describes in detail the calculation of normalization constant and utilizations. Section 3 discusses the performance measures such as queue lengths, response times, waiting times, throughput and utilizations related to the hybrid model. Section 4 discusses the simulation results of the queuing model. Finally, Section 5 presents the conclusions. 2. System model An example of a Dual CPU Core Chip microprocessor is shown in Fig. 1. The queuing network model for Fig. 1 models the CPU Core units as synchronization and execution processors, and the bus interfaces as arrival rates. Arrivals into the Dual CPU Core chip processor and departures out of the Dual Core chip processor occur through the Bus interface. A networked distributed processor consists of a decentralized system composed of a number of processors that can exchange their processing results with a unique goal. The distributed organization is enabled by a real-time communication hardware such as data bus, wireless or wired communication. Thus, a group of processors can exchange data and information in real-time with a unique computing objective, while being distributed in a local area. This group of processors is called ‘‘networked distributed processors’’. This architecture enables local execution as well as executions in collaboration. This collaboration could be intimate, i.e., the distributed processors may access one another’s memories and can execute at the same time. Dual CPU Core microprocessor CPU Core and L1 Caches
CPU Core and L1 Caches
..... .. ...
CPU Core and L1 Caches
Bus Interface and L2 Caches Arrivals/Departures
Fig. 1. Dual CPU Core chip microprocessor.
V. Bhaskar, K.H. Adjallah / Computer Communications 31 (2008) 119–128
p1′
μ1
λ′
SP1
λ0
μ2
p′2
λ′
...
...
... SPm New program path
μ 2′
p′n λn′
μm
μ1′
EP2
λ2′
SP2
λ
EP1
λ1′
121
μ n′
EPn
p0
Fig. 2. Hybrid closed queuing network model.
Distributed processing, because of the relatively large communication overheads, is best suited to exploit coarser grains of parallelism, such as the one which occurs among large blocks of instructions. Fig. 2 represents an approximation to an open queuing network model having both single and multiple servers under heavy-load conditions. This closed queuing network model is hybrid in the sense that it contains both single and multiple servers in its queuing network. More often, a large algorithm, which requires distributed processing, may contain a mix of different tasks with quite different processing requirements. One consequence is that homogeneous machines (i.e., machines with identical and symmetrically connected processors) can, at best, be designed so that their processors are compatible with the ‘‘average workload’’. Distributed processing system is not required to be homogeneous, and can process heterogeneous tasks. Bottleneck invariably develops in the processing of non-average workloads. Bottleneck can also occur in real-world scenarios, where a server which is modeled to represent a distributed processor, is unavailable for service. In summary, • bottlenecks may not occur when each distributed processor is required to process an average workload, and the whole system will be smooth enough to accept a large number of arrivals, and • bottleneck may also occur when the distributed processors are (i) required to process non-average workloads, and (ii) unavailable for service. Consider Fig. 2. The total number of active jobs, N, is called the degree of multi-programming [1]. Each job circulating in the closed network is said to be an active job and must be allocated a partition of main memory. The number of active jobs, k1, at server 1 among the SPs, is the sum of the number (either 0 or 1) of jobs (tasks) currently being served by server SP1 and the number of jobs, kSP waiting in the queue, QSP. The number of active jobs, ki, at node i (2 6 i 6 m) among the SPs, is the number of jobs currently being served (either 0 or 1) by servers SPi (2 6 i 6 m). The number of active jobs, k 0j , at node j among the EPs (denoted by EPj), is the number of jobs (tasks) at node j
(including any in service). For a closed queuing model with m servers for QSP and n servers, one for each QEPj (1 6 j 6 n), N is the total number of tasks waiting in queues QSP, QEPj (1 6 j 6 n), and the total number of tasks currently being served by servers SPi (1 6 i 6 m) and EPj (1 6 j 6 n), or N ¼ k 1 þ k 2 þ . . . þ k m þ k 01 þ k 02 þ . . . þ k 0n , where N is the total number of jobs in the system. The total number of tasks in the system is restricted to N due to the constraints of finite main memory [11]. Ready jobs will wait at their terminals until some job leaves from the active set, at which time one of the ready jobs enters the active set and is allowed to compete for the system resources [11]. Let the external arrival rate be k, the arrival rate to all SPs be k0, and the arrival rate to each EP be k0j (1 6 j 6 n). The total arrival rate to all SPs is equal to sum of the external arrival rate and the arrival rate to all EPs. Hence, n X k0j : ð1Þ k0 ¼ k þ k0 ¼ k þ j¼1
Let p0j ; 1 6 j 6 n, be the probability that a job will be served by the jth execution processor, EPj, 1 6 j 6 n. Under heavy-load conditions, as soon as a job leaves the network, another job enters the network. So, we can define p0 as the ‘‘probability of entering the new program path’’. Now, k0j ¼ k0 p0j
81 6 j 6 n:
Substituting (2) in (1) and rearranging, k Pn k0 ¼ 1 j¼1 p0j
ð2Þ
ð3Þ
with p0 þ
n X
p0j ¼ 1:
ð4Þ
j¼1
Substituting (4) in (3), we have k k0 ¼ : p0
ð5Þ
From (2) and (5), the arrival rate to EPj (1 6 j 6 n) is k0j ¼
k 0 p: p0 j
ð6Þ
122
V. Bhaskar, K.H. Adjallah / Computer Communications 31 (2008) 119–128
Let li be the service rate (rate at which tasks are being served) of server i among the SPs, and let l0j be the service rate of server j among the EPs. The utilization of each SPi (1 6 i 6 m) is
It is clear that only n of the above (n + 1) equations are independent. v0 can be chosen as any real value that will aid us in our computations. The usual choices for v0 are p1 , l1, 0 or 1. If we choose v0 ¼ p1 , then from (12), we have
k0 k ci ¼ ¼ ; li p0 li
p0 v01 ¼ 1 p0 p0 v02 ¼ 2 p0 .. .. . . 0 p v0n ¼ n : p0
ð7Þ
and the utilization of each EPj (1 6 j 6 n) is c0j ¼
k0j kp0j ¼ : l0j p0 l0j
ð8Þ
Consider the closed network of queues with single server shown in Fig. 2. The state of the network is given by an (m + n)-tuple vector, s ¼ ðk 1 ; k 2 ; . . . ; k m ; k 01 ; k 02 ; . . . ; k 0n Þ. Assuming that the service times of all servers are exponentially distributed, the stochastic process modeling the behavior of the network is a finite-state homogeneous continuous-time Markov chain (CTMC), which can be shown to be irreducible and recurrent non-null [1] (assuming that 0 < p0j 6 1 8j ¼ 1; 2; . . . ; n). The transient probability matrix, T, of the discrete-time Markov chain (DTMC) is given by 3 2 p0 p01 p02 . . . p0n 61 0 0 ... 0 7 7 6 7 6 7: 6 1 0 0 . . . 0 ð9Þ T¼6 7 .. 7 .. . . .. 6 .. 4 . . . 5 . . 1
0
...
0
0
P Notice that all rows sum to 1, i.e., p0 þ nj¼1 p0j ¼ 1. The DTMC is finite, and if we assume that 0 < p0j 6 1 "j = 1,2, . . . , n, then the DTMC can be shown to be irreducible and periodic. Then, the relative visit count vector v ¼ ðv0 ; v01 ; . . . ; v0n Þ can be obtained by solving the system of equations,
0
ð13Þ
Here, v0 is the relative throughput of node i (1 6 i 6 m). The relative throughput of node j (1 6 j 6 n) is v0j ¼
p0j : p0
ð14Þ
The relative utilization of node i (1 6 i 6 m) is qi ¼
v0 1 ¼ : li p0 li
ð15Þ
The relative utilization of node j (1 6 j 6 n) is q0j ¼
v0j p0j ¼ : 0 lj p0 l0j
ð16Þ
Substituting qi and q0j in the expression for steady-state probability [12] p k 1 ;k 2 ;. ..;k m ; k 01 ; k 02 ; ...; k 0n ¼
m n Y 0 1 Y qki i ðq0j Þkj CðN Þ i¼1 j¼1
!k0j k i Y m n p0j 1 Y 1 ¼ ; p0 l0j CðN Þ i¼1 p0 li j¼1
ð10Þ
ð17Þ
If we observe the system for a real-time interval of duration s, then vis can be interpreted to be the average number of visits to node i in that interval [1]. The term v0s represents the average number of visits to node i (1 6 i 6 m), and the term v0j s represents the average number of visits to node j (1 6 j 6 n). In this sense, v0 can be thought of as a relative visit count to node i in the SPs and v0j as a relative visit count to node j in the EPs. For the network of queues shown in Fig. 2, (10) becomes
where C(N) is the normalization constant chosen so that the sum of the steady-state probabilities is one. The expression for steady-state probability in (17) satisfies the steadystate balance equations derived from the state diagrams in Appendix A. The normalization constant can be expressed as [1]
½v0 v01 . . . v0n ¼ ½v0 v01 . . . v0n T:
where the state space I ¼Pfðk 1 ; k 2 ; P . . . ; k m ; k 01 ; k 02 ; . . . ; k 0n Þjk i m n 0 P 0 and k j P 0 8i; j and i¼1 k i þ j¼1 k 0j ¼ N g; and s ¼ 0 0 0 ðk 1 ; k 2 ; . . . ; k m ; k 1 ; k 2 ; . . . ; k n Þ is a particular state of the network. We now introduce a new vector of relative utilizations of devices i (1 6 i 6 m) in SPs and devices j (1 6 j 6 n) in EPs, qnew ¼ ½q1 q2 . . . qm q01 q02 . . . q0n . The recursive algorithm for computing the normalization constant is [1]
v ¼ vT:
ð11Þ
The system of linear equations represented by (11) is v0 ¼ v0 p0 þ v01 þ v02 þ . . . þ v0n v01 ¼ v0 p01 v02 ¼ v0 p02 .. .. . . v0n ¼ v0 p0n :
ð12Þ
CðN Þ ¼
m XY s2I
i¼1
qki i
n Y k0 ðq0j Þ j ;
ð18Þ
j¼1
C k ðlÞ ¼ qnewk C k ðl 1Þ þ C k1 ðlÞ;
ð19Þ
V. Bhaskar, K.H. Adjallah / Computer Communications 31 (2008) 119–128
where k = 1,2,. . .,m,m + 1, . . . ,m + n and l = 1,2,. . ., N. Also [1] C k ð0Þ ¼ 1
8k ¼ 1; 2; . . . ; m; m þ 1; . . . ; m þ n:
ð20Þ
We now write a convolution algorithm to represent the above procedure for computing the normalization constant.
The expected total number of jobs in the system is equal to the sum of the expected number of jobs in SPs and EPs. i.e., m X
L¼
The utilization of the ith device among the SPs when there are N jobs in the system is [1] CðN 1Þ 1 CðN 1Þ ¼ : ð21Þ U i ðN Þ ¼ qi CðN Þ p 0 li CðN Þ Similarly, the utilization of the jth device among the EPs when there are N jobs in the system is [1] ! p0j CðN 1Þ 0 0 CðN 1Þ U j ðN Þ ¼ qj ¼ : ð22Þ CðN Þ CðN Þ p0 l0j The ratio of the actual utilizations of SPs and EPs when Þ there are N jobs in the system is given by UU 0i ðN . From (21) j ðN Þ and (22), the ratio of actual utilizations is ! 0 lj U i ðN Þ qi 1 ¼ ¼ : ð23Þ U 0j ðN Þ q0j p0j li Eq. (23) explains the reason for calling qi and q0j ‘‘relative utilizations’’.
E½Li ðN Þ ¼
l¼1
qli
CðN lÞ : CðN Þ
ð24Þ
ð25Þ
Similarly, the average queue length at node j in the EPs when there are N jobs in the system is given by E½L0j ðN Þ ¼
N X l¼1
ðq0j Þ
l
CðN lÞ : CðN Þ
qli
ð27Þ
The time spent by a job (also called a request) in a queue and a server in the SPs is called the average response time in SPs. The average response time of a job in node i in SPs (1 6 i 6 m) is given by E½Li ðN Þ k0 X 1 N l CðN lÞ ¼ q k0 l¼1 i CðN Þ l N p0 X 1 CðN lÞ ¼ ; p CðN Þ k l¼1 0 li
RSPi ¼
ð28Þ
since k0 = k/p0 and qi ¼ p 1li . The average response time of a 0 task in node j in EPs (1 6 j 6 n) is given by REPj ¼
E½L0j ðN Þ k0j
N 1 X l CðN lÞ ðq0 Þ 0 k0 pj l¼1 j CðN Þ !l N p0j p X CðN lÞ ; ¼ 00 CðN Þ kpj l¼1 p0 l0j
¼
The average queue length at node i in the SPs when there are N jobs in the system is given by N X
j¼1
3.2. Response times
The probability that there are k or more jobs at node i is given by [1] P ðN i P kÞ ¼
E½L0j ðN Þ
n X N CðN lÞ X l CðN lÞ þ ðq0j Þ CðN Þ CðN Þ i¼1 l¼1 j¼1 l¼1 ( l m X N X 1 1 ¼ CðN lÞ CðN Þ i¼1 l¼1 p0 li 9 !l n X N = X p0j þ CðN lÞ : ; p0 l0j j¼1 l¼1
3.1. Queue lengths
CðN kÞ : CðN Þ
n X
i¼1
3. Performance measures
qki
E½Li ðN Þ þ
m X N X
¼
Convolution algorithm: [1] {initialize} C[0]: = 1; for l: = 1 to N do C[l]: = 0; for k: = 0 to (m + n 1) do for l: = 1 to N do C[l]: = C[l] + qnew(k)*C[l 1]; end end
123
ð26Þ
since k0j ¼ k0 p0j ; q0j ¼ p
p0j 0 0 lj
ð29Þ
and k0 = k/p0. The time spent by N
requests in all the queues and servers is called the total response time. The total response time of the system (SPs and EPs) is given by R¼
m X i¼1
RSPi þ
(
n X
REPj
j¼1
l m X N X 1 CðN lÞ p0 li CðN Þ i¼1 l¼1 9 !l 0 n X N X p 1 CðN lÞ= j : þ p0j p0 l0j CðN Þ ; j¼1 l¼1
p ¼ 0 k
ð30Þ
124
V. Bhaskar, K.H. Adjallah / Computer Communications 31 (2008) 119–128
3.3. Waiting times
3.5. System throughput
The total waiting time at all queues in the SPs is given by m X 1 E½W ¼ RSPi li i¼1 ! l m N X p0 X 1 CðN lÞ 1 ¼ ; ð31Þ CðN Þ li k l¼1 p0 li i¼1
The average throughput of the ith node in the SPs (1 6 i 6 m) is given by
where l1i is the service time of SPi (1 6 i 6 m). The total waiting time at all queues in the EPs is given by ! n X 1 0 E½W ¼ REPj 0 lj j¼1 0 1 !l 0 n N X X p p CðN lÞ 1 j @ 0 0 A; ð32Þ ¼ 0 0 CðN Þ lj kp p l j j 0 j¼1 l¼1 where l10 is the service time of EPj (1 6 j 6 n). The total j waiting time in all the queues increases with N. The total waiting time in the system (SPs and EPs), E[Ws], is given by ! l m N X p0 X 1 CðN lÞ 1 E½W s ¼ CðN Þ li k l¼1 p0 li i¼1 0 1 !l 0 n N X X p p CðN lÞ 1 j @ 0 0 A: þ 0 0 CðN Þ lj kp p l j j 0 j¼1 l¼1
ð33Þ
3.4. Utilizations The steady-state probability of having all jobs served by the EPs is given by pð0; 0; . . . ; 0; l01 ; l02 ; . . . ; l0n Þ ¼
n 0 1 Y ðq0j Þlj CðN Þ j¼1
n p0j 1 Y ¼ CðN Þ j¼1 p0 l0j
!l0j ;
ð34Þ
Pn where j¼1 l0j ¼ N . The steady-state probability of having at least one job served by the SPs is given by U 0 ¼ 1 pð0; . . . ; 0; l01 ; l02 ; . . . ; l0n Þ !l0j n p0j 1 Y ¼1 : CðN Þ j¼1 p0 l0j
E½T i ðN Þ ¼ li p0 U i ðN Þ ¼
CðN 1Þ : CðN Þ
ð36Þ
The average throughput of the jth node in the EPs (1 6 j 6 n) is given by E½T 0j ðN Þ ¼ l0j p0 U 0j ðN Þ ¼ p0j
CðN 1Þ : CðN Þ
ð37Þ
Now, the system throughput is provided only by the contribution of the SPs [1]. The system throughput is given by E½T ðN Þ ¼
m X
E½T i ðN Þ ¼
i¼1
m X CðN 1Þ : CðN Þ i¼1
ð38Þ
4. Simulation results A simulation is performed for the closed queuing network model with single and multiple servers. The number of synchronization processors (m) is chosen to be equal to 5, and the number of execution processors (n) is chosen to be equal to 6. The service rates, li and l0j , and the probability of getting serviced, p0j (1 6 j 6 n) at EPs are appropriately chosen such that there is a constant probability of entering the new program path (p0 = 0.4). The estimated number of jobs in SPs (L1), EPs (L2) and jobs in the whole system (L = L1 + L2), are shown in Table 1. It is found that the estimated total number of jobs in the system is close to N, where N is chosen as 10 in the simulation. The actual utilizations of the devices among the SPs (1 6 i 6 m), Ui(N), and the actual utilizations of the devices among the EPs (1 6 j 6 n), U 0j ðN Þ, are tabulated in Table 2. The normalization constant is computed from the convolution algorithm by pre-computing the utilizations at each of the nodes in SPs and EPs. The normalization constant is plotted against N, the number of jobs in the system in Fig. 3. The system throughput is obtained from (38) and is plotted in Fig. 4 against the total number of processors (m + n). The system throughput increases as the total number of Table 1 Total number of jobs in SPs, EPs, and system
ð35Þ
If U0 > 1 U0, or equivalently, U 0 > 12, we have more SP utilization. This indicates that the execution of the program is dominated by SPs. If U 0 < 12, we have more EP utilization. In this case, the execution of the program is dominated by the EPs. When U 0 ¼ 12, the program execution is said to be balanced.
(m, n)
L1
L2
L = L1 + L2
(1, 1) (1, 2) (2, 2) (3, 2) (3, 3) (3, 4) (4, 4) (5, 4) (5, 5) (5, 6)
6.606 7.8595 6.8062 7.0231 7.2867 7.40105 7.60932 7.86156 7.8789 7.8899
2.399 1.3639 1.2926 1.2648 0.83732 0.69072 0.68807 0.68634 0.65013 0.6277
9.006 9.2234 8.0987 8.2879 8.12404 8.09178 8.2974 8.5479 8.529 8.5177
V. Bhaskar, K.H. Adjallah / Computer Communications 31 (2008) 119–128
125
Table 2 Actual utilizations of SPs and EPs (m, n)
Ui(N), i = 1,2, . . . , m
U 0j ðN Þ; j ¼ 1; 2; . . . ; n
(1, 1) (1, 2) (2, 2) (3, 2) (3, 3) (3, 4) (4, 4) (5, 4) (5, 5) (5, 6)
(0.625) (0.68965) (0.51282, 0.2564) (0.43795, 0.21897, 0.14598) (0.46511, 0.23255, 0.15503) (0.4761, 0.23809, 0.1587) (0.4255, 0.2127, 0.1418, 0.1063) (0.3921, 0.19607, 0.1307, 0.098, 0.078) (0.39421, 0.1971, 0.1314, 0.09855, 0.07884) (0.3955, 0.1977, 0.1318, 0.0988, 0.0791)
(0.375) (0.20689, 0.10344) (0.15384, 0.07692) (0.1313, 0.06569) (0.04651, 0.0697, 0.031) (0.04761, 0.0238, 0.0317, 0.0238) (0.04255, 0.02127, 0.0283, 0.02127) (0.03942, 0.0196, 0.0261, 0.0196) (0.03942, 0.0197, 0.01314, 0.01971, 0.007884) (0.03955, 0.01977, 0.01318, 0.00988, 0.00791, 0.00659)
x 105
18 16
Hybrid model
14 No.of SPs = 5, No. of EPs = 6
C(N)
12 10 8 6 4 2 0 1
2
3
4
5
6
7
8
9
10
N
Fig. 3. Normalization constant of a closed queuing network versus the total number of jobs.
−3
Response time
10
x 10
9
p = 0.4
8
λ = 1000
0
7 6 5 4
2
4
6
8 10 12 14 Total no. of processors (m+n)
16
18
20
16
18
20
λ = 1000
0.8 Throughput
p0 = 0.4
0.6 0.4 0.2
2
4
6
8 10 12 14 Total no. of processors (m+n)
Fig. 4. Response time and throughput versus the total number of processors (SPs and EPs).
processors increases. The arrival rate (k = 1000) and p0(=0.4) are kept constant throughout the simulation.
The throughput curve shown in Fig. 4 is obtained by choosing a fixed service rate each on the SP side and the EP side. As the total number of servers (processors) increases from 1 to 9, the throughput increases gradually. The service rate of each server is useful enough to keep the throughput gradually increasing. There is a point, (m + n) P 10, when the service rate of each additional server (processor) is not really useful in improving the throughput beyond that obtained when (m + n) = 9. In other words, (m + n) P 10, becomes redundant to the system. This is because the number of tasks in the system is restricted to N, and does not increase beyond N (closed queuing network). In the numerical example considered for simulation in Fig. 4, the servicing ability of nine processors in the hybrid network is just enough to service N tasks with the highest throughput. The system is said to have attained the ‘‘load balancing’’ state with four servers from the SP side and five servers from the EP side giving the highest throughput. In Fig. 4, the response time is also plotted against the total number of processors in the system. As the total number of processors increases, the response time increases for a closed queuing network model. One can notice that the response time grows with the number of distributed processors in the system. As the total number of processors increases, the number of processors on which the tasks must be treated (serviced) also increases. Defining a ‘‘cycle’’ as the time duration during which a task enters the queuing network, gets serviced by each and every server in the network and exits the network, the response time of the tasks is expected to increase as the total number of processors increases. The management and operating mode of the tasks can be found in distributed systems where the distributed processors have to process the tasks one by one. A task can be serviced by only one processor at a given time. Also, a task must traverse through several processors inside the distributed system before it can exit the system. In such architectures, processors are embedded in distributed devices and serve as Watchdog Agents which perform signal processing tasks for the diagnostics of networked devices. An overview of the efforts undertaken for intelligent maintenance systems (IMS) towards developing Watchdog Agents for different applications is presented in [13]. In Fig. 4, the system throughput increases as the total number
126
V. Bhaskar, K.H. Adjallah / Computer Communications 31 (2008) 119–128
of processors (m + n), increases. Since the total number of tasks in the system is a constant for a closed queuing network, the number of successfully serviced tasks (throughput) reaches a maximum value of 0.8 when m + n P 9.
obtained by achieving a good balance of utilizations between the pipelines (SPs and EPs). Thus, we have found the optimum number of functional units in a multithreaded model to achieve higher instruction rates.
5. Conclusions
Appendix A. State diagrams and steady-state balance equations
In this paper, we have introduced a closed network of queues with multiple servers to model dataflow in a multi-processor system. The instruction streams are executed simultaneously (multi-threading) to minimize the loss of CPU cycles. The hybrid model presented in this paper describes an architecture of embedded and distributed systems for diagnosis and forecasting for which we studied some operational performances. A convolution algorithm is used to compute the normalization constant as a function of the degree of multi-programming (number of active jobs) in the queuing model. The system performance measures are derived knowing the normalization constant. The normalization constant is plotted against the degree of multi-programming. The number of jobs in SPs, EPs, and in the system are tabulated for different (m, n). The response times and throughput are found to increase with the number of processors. The system throughput approaches an optimum value when the number of (synchronization + execution) processors is greater than or equal to nine. The optimum value for throughput is
This appendix presents the state diagram and steadystate balance equations for (m, n) = (2, 2) network in the hybrid model shown in Fig. 2. Consider Fig. 5. Here, a CPU is modeled as a synchronization processor and an I/O unit is modeled as an execution processor. It is assumed that there are zero jobs in the CPUs and N jobs in the I/O units initially. The steady-state balance equation is obtained by equating the rates of flow into and out of state (0, 0, x, N x): pð0; 0; x; N xÞ½l01 þ l01 þ l02 þ l02 l p0 ¼ pð1; 0; x 1; N xÞ 0 1 1 0 p1 þ p2 l p0 þ pð0; 1; x 1; N xÞ 0 2 1 0 p1 þ p 2 l p0 þ pð0; 1; x; N x 1Þ 0 2 2 0 p1 þ p 2 l p0 þ pð1; 0; x; N x 1Þ 0 1 2 0 p1 þ p 2
Case (i): Let there be zero jobs in CPU and N jobs in I/O initially p0 + p1′ + p ′2 = 1
N-x, 0, x, 0
p2′ωμ 1
μ 2′ .. .
ω=
.. .
p2′ωμ 1
1 p1′ + p2′
μ 2′
1, 0, x, N-x-1 ′ p2ωμ 1 μ 2′
μ 2′ 0, N-x, x, 0 p2′ωμ 2
.. ..
μ 2′
μ 2′
μ1′
0, 1, x, N-x-1 p′2ωμ 2
0, 0, x, N-x
1,0,x-1,N-x
p2′ωμ 2
p′1ωμ 1
μ1′
μ1′ p1′ωμ1
.. ..
μ1′ x, 0, 0, N-x
p′1ωμ1
p1′ωμ 2
0, 1, x-1, N-x
μ1′ .. .
.. .
μ1′
p1′ωμ 2
p1′ωμ 2 0, x, 0, N-x
Fig. 5. State diagram for the m = 2, n = 2 network when there are zero jobs in the CPUs and N jobs in the I/O units initially.
V. Bhaskar, K.H. Adjallah / Computer Communications 31 (2008) 119–128
) pð0; 0; x; N xÞ 1 ¼ 2ðp01 þ p02 Þðl01 þ l02 Þ p01 ½l1 pð1; 0; x 1; N xÞ þ l2 pð0; 1; x 1; N xÞ þ p02 ½l1 pð1; 0; x; N x 1Þ þl2 pð0; 1; x; N x 1Þg:
127
Consider Fig. 7. Here, we have k1 jobs in CPU1, k2 jobs in CPU2, k 01 jobs in I/O1 and k 02 jobs in I/O2 initially. For a state ðk 1 ; k 2 ; k 01 ; k 02 Þ with k1 > 0, k2 > 0, k 01 > 0, k 0 2 > 0, the steady-state balance equation is obtained by equating the rates of flow into and out of state ðk 1 ; k 2 ; k 01 ; k 02 Þ: ð39Þ
Let p0 be the probability of entering the new program path, p01 and p02 be the probability that a job will get serviced by I/O1 and I/O2, respectively. Let l1 and l2 be the service rates of the CPUs, l01 and l02 be the service rates of the I/O units. Consider Fig. 6. Here, we have N jobs in the CPUs and zero jobs in the I/O units initially. The steady-state balance equation is obtained by equating the rates of flow into and out of state (x, N x, 0, 0): l1 p01 l1 p02 l2 p02 l2 p01 pðx; N x; 0; 0Þ 0 þ þ þ p1 þ p02 p01 þ p02 p01 þ p02 p01 þ p02 ¼ pðx 1; N x; 1; 0Þl01 þ pðx 1; N x; 0; 1Þl02 þ pðx; N x 1; 0; 1Þl02 þ pðx; N x 1; 1; 0Þl01 1 0 ) pðx; N x; 0; 0Þ ¼ l ½pðx 1; N x; 1; 0Þ l1 þ l 2 1 þ pðx; N x 1; 1; 0Þ þ l02 ½pðx 1; N x; 0; 1Þ þ pðx; N x 1; 0; 1Þg: ð40Þ
pðk 1 ; k 2 ; k 01 ; k 02 Þ
l1 p01 l p0 l p0 l p0 þ 1 2 þ 2 2 þ 2 1 þ l01 þ l01 þ l02 þ l02 þ p02 p01 þ p02 p01 þ p02 p01 þ p02
¼ pðk 1 1; k 2 ; k 01 þ 1; k 02 Þl01 þ pðk 1 1; k 2 ; k 01 ; k 02 þ 1Þl02 þ pðk 1 ; k 2 1; k 01 þ 1; k 02 Þl01 þ pðk 1 ; k 2 1; k 01 ; k 02 þ 1Þl02 l p0 l p0 þ pðk 1 þ 1; k 2 ; k 01 1; k 02 Þ 0 1 1 0 þ pðk 1 þ 1; k 2 ; k 01 ; k 02 1Þ 0 1 2 0 p1 þ p2 p1 þ p2
i.e., pðk 1 ; k 2 ; k 01 ; k 02 Þ ¼
0 1 l ½pðk 1 1; k 2 ; k 01 0 0 l1 þ l2 þ 2ðl1 þ l2 Þ 1 þ1; k 02 Þ þ pðk 1 ; k 2 1; k 01 þ 1; k 02 Þ þl02 ½pðk 1 1; k 2 ; k 01 ; k 02 þ 1Þ þpðk 1 ; k 2 1; k 01 ; k 02 þ 1Þ 1 l ½p0 pðk 1 þ 1; k 2 ; k 01 1; k 02 Þ þ 0 p1 þ p02 1 1 þp02 pðk 1 þ 1; k 2 ; k 01 ; k 02 1Þ 1 þ 0 l ½p0 pðk 1 ; k 2 þ 1; k 01 1; k 02 Þ p1 þ p02 2 1 ð41Þ þp02 pðk 1 ; k 2 þ 1; k 01 ; k 02 1Þ :
Case (ii): Let there be N jobs in CPU and zero jobs in I/O initially x, 0, N-x, 0 μ1′ p′ωμ 1
.. .
p0 + p1′ + p ′2 = 1 2
ω=
.. .
1 p1′ + p′2
p1′ωμ 2
μ1′
x, N-x-1, 1, 0 p1′ωμ 2
μ1′ p′2ωμ 2
x, 0, 0, N-x
... ...
μ 2′
p′2ωμ 2
p2′ωμ 2
x, N-x-1, 0, 1 μ 2′
p1′ωμ 1
x, N-x, 0, 0
μ 2′ p2′ωμ 1
p1′ωμ1
x-1, N-x, 1, 0 ′ μ 2′ μ1
μ1′
p1′ωμ1
... ...
0, N-x, x, 0
μ1′
x-1, N-x, 0, 1
μ 2′
p2′ωμ 1 .. .
p01
.. .
p2′ωμ 1
μ 2′
0, N-x, 0, x Fig. 6. State diagram for the m = 2, n = 2 network when there are N jobs in the CPUs and zero jobs in the I/O units initially.
128
V. Bhaskar, K.H. Adjallah / Computer Communications 31 (2008) 119–128
Fig. 7. State diagram for the m = 2, n = 2 network when there are k1 jobs in CPU1, k2 jobs in CPU2, k 01 jobs in I/O1 and k 02 jobs in I/O2 initially.
Eqs. (39)–(41) provide the steady-state balance equations for the (2, 2) hybrid queuing model. We represent the steady-state probability as 0 0 k1 k2 1 0 k1 0 k2 pðk 1 ; k 2 ; k 01 ; k 02 Þ ¼ CðN q q ðq Þ ðq Þ , where C(N) is 2 PÞ 1 2 1 0 0 chosen such that pðk ; k ; k ; k Þ ¼ 1. As in 0 0 1 2 1 2 k 1 þk 2 þk 1 þk 2 ¼N Section 2, the relative utilizations can be chosen such that qi ¼ p 1li 8i ¼ 1; 2 and q0j ¼ p 0
p0j 0 0 lj
8j ¼ 1; 2.
References [1] K. Trivedi, Probability & Statistics with Reliability, Queuing and Computer Science Applications, Prentice-Hall, New Jersey, 1982. [2] J. Jackson, Network of waiting lines, Oper. Res. 5 (1957). [3] B. Shankar, L. Rho, W. Bohm, W. Najjar, Control of parallelism in multithreaded code, in: Proc. of the Intl. Conference on Parallel Architectures and Compiler Techniques (PACT-95), June 1995. [4] W. Grunewald, T. Ungerer, A multithreaded processor design for Distributed Shared Memory (DSM) system, in: Proc. of the Intl. Conference on Advances in Parallel and Distributed Computing, 1997. [5] M. Lam, R. Wilson, Limits of control flow on parallelism, in: Proc. of the 19th Intl. Symposium on Computer Architecture (ISCA-19), May 1992, pp. 46–57.
[6] S. Sakai, Architectural and software mechanisms for optimizing parallel computations, in: Proc. of 1993 Intl. Conference on Supercomputing, July 1993. [7] K.M. Kavi, J. Arul, R. Giorgi, Execution and cache performance of the scheduled dataflow architecture, J. Universal Comp. Sci. (2000). [8] M. Takesue, A unified resource management and execution control mechanism for dataflow machines, in: Proc. 14th Intl. Symp. on Computer Architecture (ICSA-14), June 1987, pp. 90–97. [9] K.M. Kavi, R. Girogi, J. Arul, Scheduled dataflow: execution paradigm, architecture, and performance evaluation, IEEE Trans. Comp. 50 (8) (2001) 834–846. [10] V. Bhaskar, A hybrid closed queuing network model for multithreaded dataflow architecture, Elsevier J. Comp. Electr. Eng. 31 (8) (2005) 556–571. [11] L. Kleinrock, Queuing systems, Volume II: Computer Applications, John Wiley & Sons Inc., 1976. [12] W. Gordon, G. Newell, Closed queuing systems with exponential servers, Oper. Res. 15 (1947). [13] D. Djurdjanovic, J. Lee, J. Ni, Watchdog agent an infotronics based prognostics approach for product performance assessment and prediction, International Journal of Advanced Engineering Informatics: Special issue on Intelligent Maintenance systems, vol. 17, No. 3–4, pp. 109–125, 2003.