A Superscalar software architecture model for Multi-Core Processors (MCPs)

The Journal of Systems and Software 83 (2010) 1823–1837 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepa...

Download PDF

1MB Sizes 1 Downloads 73 Views

Report

PDF Reader
Full Text

The Journal of Systems and Software 83 (2010) 1823–1837

Contents lists available at ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

A Superscalar software architecture model for Multi-Core Processors (MCPs)夽 Gyu Sang Choi a,∗ , Chita R. Das b a Department of Information and Communication Engineering, Yeungnam University, Sojae Building #202-1, 214-1 Dae-dong, Gyeongsan-si, Gyeongsangbuk-do 712-749, Republic of Korea b Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802, United States

a r t i c l e

i n f o

Article history: Received 8 June 2009 Received in revised form 29 April 2010 Accepted 29 April 2010 Available online 15 June 2010 Keywords: Multi-Core SuperScalar Software architecture model Multi-thread

a b s t r a c t Design of high-performance servers has become a research thrust to meet the increasing demand of network-based applications. One approach to design such architectures is to exploit the enormous computing power of Multi-Core Processors (MCPs) that are envisioned to become the state-of-the-art in processor architecture. In this paper, we propose a new software architecture model, called SuperScalar, suitable for MCP machines. The proposed SuperScalar model consists of multiple pipelined thread pools, where each pipelined thread pool consists of multiple threads, and each thread takes a different role. The main advantages of the proposed model are global information sharing by the threads and minimal memory requirement due to fewer threads. We have conducted in-depth performance analyses of the proposed scheme along with three prior software architecture schemes (Multi-Process (MP), Multi-Thread (MT) and Event-Driven (ED)) via an analytical model. The performance results indicate that the proposed SuperScalar model shows the best performance across all system and workload parameters compared to the MP, MT and ED models. Although the MT model shows competitive performance with less number of processing cores and smaller data cache size, the advantage of the SuperScalar model becomes obvious as the number of processing cores increases. © 2010 Elsevier Inc. All rights reserved.

1. Introduction Improving the performance of server platforms has become a critical issue to cope with the increasing use of network-based services. The critical nature of many online transactions and distributed services mandates design of high-performance servers. Several software and hardware scale-up techniques have been proposed to enhance the performance of a server platform (Cardellini et al., 2002). Although these mechanisms can improve the performance of a server to varying degrees, any server design should exploit the novelty of the state-of-the-art architectural trends. The motivation of this paper relies on this context, and attempts to see how a server design can beneﬁt from the recent architectural innovations. Towards this goal, we focus on the design of servers using Multi-Core Processor (MCP) architectures. Recently, Quad-Core CPUs (Intel, 2009a; Wikipedia, 2009) was released by Intel and AMD and Octo-Core CPUs (Intel, 2009b) will be released soon to target the high-performance server market. The

夽 This research was supported by the Yeungnam University research grants in 2010. Moreover, this research was supported in part by NSF grants EIA-0202007, CCR-0208734, CCF-0429631 and CNS-0509251. ∗ Corresponding author. E-mail addresses: [email protected] (G.S. Choi), [email protected] (C.R. Das). 0164-1212/$ – see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2010.04.068

cell processor design, released by a consortium of IBM, Sony and Toshiba, incorporates eight Synergistic Processing Elements (SPEs) on a silicon (Pham et al., 2005). Thus, with the advent in deep submicron technology, MCP architectures have become a reality, and by the end of the decade, MCPs with billions of transistors are likely to dominate the high-performance computing landscape (Benini and Micheli, 2002; Edenfeld et al., 2004; Bell and Gray, 2002). With technology scaling down to 35 nm, it would be possible to fabricate MCPs with up to 32/64 processing cores (TILERA, 2009; Michael et al., 1997 September). Hence, we expect that future servers will be designed on MCP systems to provide high-performance. Implementation of a server on any hardware platform needs a software architecture, which can support the required functionalities. Thus, software architecture is an abstract concept to model a server architecture. We can implement any type of servers (i.e. File, Web and Database servers), using a software architecture scheme. Three software architecture models (Multi-Process (MP) (The Apache Software Foundation, 2003), Multi-Thread (MT) (The Apache Software Foundation, 2003) and Event-Driven (ED) (Pai et al., 1999)) have been proposed to implement speciﬁc server architecture on a single CPU machine. In light of the recent interest on multi-core architectures, it is important to design efﬁcient software architectures for MCP machines. Instead of designing a software scheme to implement any server architecture, most prior studies (Welsh et al., 2001; Choi

1824

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

et al., 2005; Ruan et al., 2005) have mainly proposed and conducted performance evaluation of Web server architectures on SMP/SoC machines. Welsh et al. (2001) proposed the new Web server architecture, called State Event-Driven Architecture (SEDA), combining the ED and MT models, and conducted performance comparison through real implementation on a small SMP machine. Choi et al. (2005) proposed a new Web server architecture, called PIPELINED, for SMP/SoC machines. They conducted performance comparison among several Web server architectures through simulation and showed that their proposed model can outperform other models across various system and workload parameters. Next, Ruan et al. (2005) evaluated the impact of Simultaneous Multithreading (SMT) on various Web servers with three versions of the Intel Xeon processor and showed that SMT has limitations to yield signiﬁcant performance improvement. Their evaluation was conducted on 2 CPUs and 4 SMTs. Moreover, Kelly et al. (2008) conducted performance evaluation of parallel servers using analytical models. However, this study did not evaluate performance among several software models. Thus, in this paper we propose a new software architecture for MCP machines and develop a simple queuing model to analyze the performance of the three prior and the proposed schemes. While our prior study (Choi et al., 2005) only focuses on a Web server, this study extends our prior study to a generic software architecture model, in order to evaluate performance and scalability of software architecture models in the MCP domain. The novelty of this study is that we propose the new software architecture model which can apply for any kind of servers, including Web server, database and ﬁles servers, while other studies (Choi et al., 2005; Welsh et al., 2001; Park et al., 2001) proposed customized server models on a speciﬁc server area (i.e. Web server, Database server, File server, etc). Thus, this paper mainly focuses on general software architecture models in the MCP domain, instead of Web server architecture models, and our study can easily apply for shared-memory multiprocessing. First, to understand the performance implications of current software architecture models, we analyze the memory usage of three prior software architectures models; MP (The Apache Software Foundation, 2003; Menasce, 2003), MT (The Apache Software Foundation, 2003; Menasce, 2003) and ED (Pai et al., 1999). This is done through measuring the memory requirements of an Apache Web server (Pai et al., 1999) for MP and MT models and a Flash Web server for an ED model on a Sun solaris machine. The data cache and memory overhead analyses of the three models in an MCP environment reveal that the Multi-Thread (MT) model is ideal in providing a large data cache per server to enhance the throughput. However, the memory consumption of this model can be signiﬁcant with a large number of threads. Thus, an MT model with relatively small number of threads can provide high throughput in MCP-based machines. Based on this rationale, we propose a new software architecture model, called SuperScalar, for MCP systems in this paper. The SuperScalar software architecture consists of a pipelined thread pool per processing core, where each pipelined thread pool can have several threads to support any speciﬁc server design. Compared to the prior MT model, each thread in the SuperScalar design maps to only one speciﬁed step/execution, while each thread in the MT model is responsible for the execution of all processing steps. The main advantage of the proposed model is that the threads can share the global information (i.e. data cache, etc.). Thus, like the MT model, it needs relatively small memory to maintain the global information. However, unlike the MT model, it can alleviate the memory overhead by limiting the total number of threads to M × K, where M is the number of processing cores and K is the number of processing steps.

To evaluate the performance (throughput) of the proposed SuperScalar scheme and prior three schemes, we have developed a simple closed queuing network model, which can provide accurate performance estimates. Using the analytical model as a design tool, we conduct several performance analyses by varying the critical system parameters such as the number of processing cores, disk speed and memory size with several synthetic workloads. The main conclusions of this paper are the following: ﬁrst, our proposed SuperScalar software architecture model shows the best performance across various environments and workloads compared to the MP, MT and ED models. The MP model exhibits the worst performance due to available smaller data cache size per process. Second, the MP and ED models suffer from decreasing data cache sizes with increasing number of processing cores in an MCP machine due to little sharing of global information. Third, the MT model can provide competitive throughput as the SuperScalar model with smaller system conﬁgurations. However, as the number of processing cores increases, the SuperScalar model becomes a clear winner. All these results indicate that the SuperScalar model is a viable candidate for deploying MCP-based server architectures. The rest of this paper is organized as follows: in Section 2, we provide a summary of all of the prior software architecture models. Section 3 analyzes the memory requirements of the prior software architecture models. The SuperScalar software architecture model is presented in Section 4. Section 5 narrates the queuing models of these software architectures, the simulator platform and validation of the queuing model. The performance results are analyzed in Section 6, followed by the concluding remarks in the last Section.

2. Software architecture models A typical server such as a Web server, a ﬁle server and a Database server consists of several processing steps. For example, an HTTP server consists of eight request processing steps. The ﬁrst step is the accept client connection, which accepts an incoming connection from a client based on the socket operations. Second, the read request operation reads and parses an HTTP request from the client’s connection. Third, the ﬁnd ﬁle operation checks whether the requested ﬁle exists in the ﬁle system, and the client has appropriate permissions. Fourth, the send response header step sends an HTTP response header to the client through a socket connection. Next, the check cache checks whether the requested data is in a memory cache or not. With a cache hit, the read ﬁle operation reads the requested data from the memory cache. Without a cache hit, the disk access operation reads the requested data from the ﬁle system. Then, the Web server reads the requested data from the memory cache by read ﬁle. Finally, the send data step transmits the requested content to the client. Especially, for larger ﬁles, the read ﬁle and send data steps are repeated until all of the requested contents are transmitted (Pai et al., 1999). In a Database server, while the ﬁrst and second steps are similar to the Web server, the third step parses and checks whether the requested query is valid or not. If the query is not valid, the Database sends an error message to the user. With a valid query, the fourth step checks whether the result of the query is in a data cache or not. With a cache hit, the Database server simply sends the response to the user. Without a cache hit, the ﬁfth step handles the query to generate a result. In this case, the Database server might need to access the disk system. The last step sends the response to the user. In a ﬁle server, all processing steps are quite similar to a Web server. In a typical server, the number of steps is 7 or 8, but it could have more steps such as in online transactions or complex applications. In addition, the number of cache accesses is usually one in a server, but it could have multiple cache accesses, depending the complexity of a server.

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

1825

Fig. 1. Software architecture models. (a) Multi-Process (MP) model, (b) Multi-Thread (MT) model, and (c) Event-Driven (ED) model.

To implement these servers, three software architectures have been proposed in the literature as shown in Fig. 1. The Multi-Process (MP) model, as shown in Fig. 1(a), has a process pool and each process is assigned to execute the basic K steps associated with servicing a request. Since it is the simplest approach to implement a server, Web, Database and ﬁle servers have been implemented using this model. Since multiple processes are employed, many requests can be served concurrently. Moreover, the MP model is very fault-tolerant because multiple processes are launched, even if some processes are unexpectedly terminated. However, the disadvantage of this model is the difﬁculty to share any global information (e.g.: shared cache information) among the processes. Each process has its own private address, while processes can share global information through shared memory. It means that an MPbased application needs more memory to maintain the same cache size per process compared to other models. Thus, the overall performance of this model is expected to be lower than that of other models (Markatos, 1996; Bestavros et al., 1995; Choi et al., 2005). The Multi-Thread (MT) model, on the other hand, consists of multiple kernel threads with a single shared address space. In Fig. 1(b), each thread takes care of a client’s request and performs the request processing steps independently. The advantage of this model is that the threads can share any global information. Especially, the data cache is shared among all threads. Thus, many high-performance applications (i.e. File, Web, Database servers) have also been implemented using this model. However, not all Operating Systems (OSes) support kernel threads, and sharing the data cache information among many threads may lead to high synchronization overhead. Moreover, memory consumption of the threads in the model may be the performance obstacle if thousands of threads are launched in a single system. The third software architecture is the Event-Driven (ED) model, which is shown in Fig. 1(c). There is a global event queue in the model, and an event dispatcher reads the request with the event in the queue. Then, the dispatcher sends the request to a corresponding step, depending on the event. Unlike the prior models, it uses non-blocking I/O operations to interleave the I/O time with CPU time, and thus, increases the CPU utilization. Moreover, the ED

model can avoid context-switching overhead, because there is only a single process. However, a global event queue will be performance bottleneck due to large queuing time. All these server models were originally proposed for single CPU systems and the scalability of these software architecture models needs to be examined for MCP systems. 3. Memory and cache hit analysis In this section, we analyze the memory requirements and cache hit ratios of the software architecture models in an MCP machine. We deﬁne three terms to conduct the memory analysis. First, we use system memory as the available main memory for a speciﬁc server, although normally system memory is referred as the main memory. For example, if the size of main memory is 1 GBytes and an OS uses 100 MBytes, then the system memory is 900 MBytes. Next, the available memory space for caching any contents is called the data cache mem. Finally, we refer to all additional memory spaces as memory overhead, which is equal to system memory – data cache mem. This includes the memory overhead for the application to maintain the data cache mem (e.g. name, size, path, etc) and processes or threads. Based on the memory analysis, we can calculate the expected cache hit ratio of an application, which in turn can be used for performance estimation. For the memory analysis, we choose a Web server, because it has been implemented by several software architecture models. We measured the memory usage using the system monitoring tool in a SunBlade2000 machine, which has a 900 MHz UltraSPARC III Cu processor (64-bit), 1 GBytes main memory and a 36 GBytes hard drive. 3.1. Memory usage in software architecture models First, an MP-based Web server in a single node has typically 16 or 32 processes, and each process should have its own data cache. Thus, the data cache mem per process reduces proportionally when we launch multiple processes. In addition, each server process incurs some overhead to maintain the data cache. To exam-

1826

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

Fig. 2. Memory usage of the Apache and Flash web servers. (a) 20 MBytes cache size in an MP-based Apache Web server, (b) 20 MBytes cache size in an MT-based Apache Web server, and (c) 100 MBytes cache size in an ED-based Flash Web server.

ine the scalability problem of the MP model, we measure the total memory usage of an Apache Web server by varying the number of servers (processes) in a single node, as we set data cache mem to 20 MBytes per process. Fig. 2(a) shows the memory usage as a function of the number of processes. As the number of processes increases, memory usage increases to about 24 MBytes per process, due to 20 MBytes data cache mem and other overheads (i.e. memory overhead). Thus, the total memory usage is around (memory overhead + data cache mem) × P, where P is the number of Web server processes. Unlike the MP model, since threads in an MT model can share the global cache information, the memory requirement should not change signiﬁcantly with an increase in the number of threads. To verify this, we increased the number of threads in an Apache Web server (The Apache Software Foundation, 2003) from 8 to 64, and measured the memory usage as shown in Fig. 2(b). We again set data cache mem to 20 MBytes per process. Each thread consumed about 80 KBytes in the MT model conﬁguration. In addition, each thread of Network File System (NFS) daemon in the Sun Solaris OS consumes around 48 KBytes and 1000 threads of the NFS server will need around 40 MBytes memory if they are launched at 16 processors (maximum 64 threads per processor) (S. Microsystem). Based on these observations, the problem with the MT model is the memory consumption of threads. This overhead can be a major bottleneck in an MCP node. For example, if these types of machines can have 16–64 processing cores, the total number of threads, T, can be in hundreds or thousands because the number of threads per CPU is 64 or 128.

Next, we run a Flash Web server (Pai et al., 1999) in the Sun Solaris machine to analyze the memory usage of an ED model. The memory overhead of a Flash Web server consists of several terms. First, the major component of the memory overhead is the space required for maintaining the information of cached ﬁles, and it is 850 Bytes per ﬁle. If we assume that the average Web ﬁle size is 15 KBytes (Pai et al., 1999) and the data cache mem is 100 MBytes, the maximum number of the cached web ﬁles is approximately 6800. Then, the memory overhead to maintain 100 MBytes of data cache mem is approximately 5 MBytes. Second, since the maximum number of path translation entries in a Flash Web server is 6000 and the maximum size of a path translation entry is about 1024 Bytes, the path translation consumes around 6 MBytes. Third, two helper processes in a Flash Web server, read and path translation helpers, consume additional 3 MBytes (Pai et al., 1999). Thus, besides the data cache mem, a Flash Web server needs additional 14 MBytes to maintain the 100 MBytes data cache mem in the main memory. While this memory overhead seems small for a single event-driven server process, it increases linearly with the number of servers. Fig. 2(c) shows the memory usage as a function of the number of event-driven Flash Web servers in a single node. We ﬁx the data cache mem size to 100 MBytes and increase the number of processes from 1 to 8 in a node. In Fig. 2(c), for a single Web server, the memory overhead is only around 5 MBytes for 100 MBytes of data cache mem. However, when the number of server processes is eight, the memory overhead becomes 120 MBytes, which is even larger than the data cache mem size.

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837 Table 1 memory overhead in software architecture models, where F is number of cached items, M is the number of processing cores, S is the additional memory overhead to maintain a single cache item, P is the number of processes per processing core, T is the number of threads per processing core, K is the number of steps, Y is the memory consumption per process, and X is the memory consumption per thread. Software architecture

memory overhead (Bytes)

MP model MT model ED model SuperScalar model

F F F F

×S×M×P+M×P×Y ×S+T ×M×X ×S×M+M×Y ×S+K ×M×X

3.2. Memory overhead in software architecture models Now, we estimate the memory overhead of the prior three models in an MCP system. Here, we assume that all applications have the same cache structure, even though they are based on different software architecture models. Assuming 16 or 32 server processes per processing core in an MP model, the total number of the processes, P, could be very high in an MCP system. In the MT-based model, there is only one process, no matter how many threads are running. In the ED model, one process usually runs in one node, and thus, we may launch one ED-based process per processing core. Based on this, we compute the memory overhead of each software architecture model in Table 1, when the system memory size is known. We assume that the size of an average cached item is A and the average size to maintain each cache element is S Bytes. Since the maximum number of cached ﬁles is F = data cache mem/A, the memory requirement for keeping the information of these entries is F × S Bytes. The total memory overhead of an MP-based application is (F × S × M × P + M × P × Y ) Bytes, where M is the number of processing cores, P is the number of processes per processing core, and a single process consumes around Y Bytes

1827

of memory. In an MT-based application, the total memory usage does not change signiﬁcantly. The memory overhead can be calculated as (F × S + T × M × X) Bytes, since we assume that a single thread consumes X Bytes and T is the number of threads per processing core. For an ED-based application, the memory overhead is (F × S × M + M × Y ) with an ED-based server process per processing core.

3.3. data cache mem in software architecture models Next, we calculate the available data cache mem of the three software architecture schemes, which is (system memory − memory overhead), since we assume that the main memory is used for data caching only. In this subsection, we examine the relationship among data cache mem size, number of processing cores and system memory. In this experiment, we set the average ﬁle size (i.e. A) to 10 KBytes, the average memory consumption to maintain each cache element (i.e. S) to 500 Bytes while it is 850 Bytes in (Choi et al., 2005), memory consumption per process (i.e. Y) to 1 MBytes and memory consumption per thread (i.e. X) to 80 KBytes. Fig. 3 shows the variation of available data cache mem as a function of number of processing cores and the system memory. Fig. 3(a) depicts that the data cache mem size in an MP model reduces drastically when the number of processing cores increases. In Fig. 3(b), the data cache mem for the MT-based server is slightly reduced because there is only one server process, and a thread can share cache information with other threads. However, as we will see later, the performance of the MT model might suffer from thread memory consumption as the number of threads increases. Fig. 3(c) also shows that the data cache mem of the ED-based model reduces when the number of processing cores increases. It is attributed to the non-sharing nature of the global information.

Fig. 3. data cache mem size in software architecture models. (a) Multi-Process model, (b) Multi-Thread model, and (c) Event-Driven model.

1828

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

3.4. Cache hit ratio in software architecture models Finally, we predict cache hit ratios of the software architecture models based on the previously calculated data cache mem. Since cache hit ratio signiﬁcantly affects the performance of a server, we can predict performance based on the calculated cache hit ratios. In this experiment, we assume that the data set size is 4 GBytes. To measure the cache hit ratio, we use the perfect cache model in (Carrera et al., 2002), which contains the highly accessed items up to the data cache mem size. Several studies (Breslau et al., 1999; Almeida et al., 1996; Fonseca et al., 2003) support that requests from a ﬁxed user community have a skewed distribution according to the Zipf’s law, which is given by C p(i) = ˛ , i

where C =

E

−1 i−˛

,

is dramatically reduced, the throughput would be low due to frequent disk accesses. When the number of processing cores is 16 with 1 GBytes system memory, the cache hit ratio is around 70%. In the MT model, the cache hit ratio (88.2% with 16 processing cores and 1 GBytes system memory) is much higher than that in the MP model, and it is slightly reduced as the number of processing cores increases. This is because the threads can share global information and there is only one process. The ED model shows that the cache hit ratio is worse compared to the MT model, as the number of processing cores increases. With 16 processing cores and 1 GBytes system memory, the cache hit ratio is around 85%. While the difference in cache hit ratio between the MT and ED models is only around 3%, the impact of this difference on performance is signiﬁcant because the disk access latency is 1000 or 10,000 times larger than the cache access latency. Thus, increasing the cache hit ratio is crucial for performance improvement.

i=1

where E is the total number of contents and ˛ (i.e. the skewness factor) is equal to 1. The larger ˛ (˛ > 1) means that the contents have more skewed popularity compared to a lower ˛. Thus, the temporal locality of the contents becomes high. When we assume that F is the number of the cached ﬁles, the cache hit ratio is simply calculated (Carrera et al., 2002) as Cache hit ratio =

F

p(i).

i=1

Fig. 4 shows cache hit ratios of the three software architecture models. The MP model shows the worst cache hit ratio, since the global information cannot be shared among the processes. As number of processing cores increases, the cache hit ratio is significantly reduced. Since the cache hit ratio of an MP-based server

4. A SuperScalar software architecture In this section, we propose a new software architecture, called SuperScalar, which takes advantage of the MT model, but mitigates the memory overhead by limiting the number of threads. Fig. 5(a) depicts the logical structure of the SuperScalar model. Unlike the MT model, here each server operation is mapped on to a thread. A thread in step 1 ﬁrst starts the operation, and then forwards it to a next thread (Step 2), and the pipelined operation continues until the last step. We refer to these threads as a pipelined thread pool. A SuperScalar model can have multiple pipelined thread pools. Due to its similarity to the SuperScalar model in computer architecture, we call this model a SuperScalar Software Architecture. The most important characteristic of this model is that there is only a single process, even though there are multiple pipelined

Fig. 4. Cache hit ratio in software architecture models. (a) Multi-Process model, (b) Multi-Thread model, and (c) Event-Driven model.

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

1829

Fig. 5. A SuperScalar software architecture model. (a) Architecture (b) data cache mem, and (c) cache hit ratio.

thread pools. A pipelined thread pool can be launched on each processing core in an MCP system and the threads can share the global information. Since the data cache mem size in a single process model (i.e. the MT model) is larger than that in other models in a multi-CPU environment, the proposed model needs relatively small memory to maintain the global information compared to the MP and ED models. The main reason why the proposed model is suitable for an MCP environment is because of the almost constant data cache mem size (Fig. 5(b)). The multiple processes in an MP or ED model should have their own private cache, and thus, the data cache mem size reduces as shown in Fig. 3(a) and (c). Moreover, the SuperScalar model can alleviate the memory overhead, compared to the MT model. The total number of threads in the proposed model is K × M, while it is T × M in the MT model, where M is the number of processing cores, T is the number of threads per processing core and K is the number of steps in an application. Thus, the threads in the MT model are more likely to compete with each other to access the shared information (i.e. data cache mem), since K × M is much smaller than T × M. The total number of threads in the SuperScalar model can be up to 10–20 times less compared to the MT model. Next, we calculate the memory overhead and data cache mem sizes of the SuperScalar software architecture. As shown in Table 1, the memory overhead of the SuperScalar model is equal to F × S + K × M × X Bytes. Like the previous experiment, we again use the perfect cache model (Carrera et al., 2002) and assume that the access pattern follows the Zipf distribution and the data set size is 4 GBytes. Fig. 5(b) shows the data cache mem size variation of this model when the number of processing cores and system memory are varied. The memory requirement of this model is similar to the MT model since these two models have only a single process. Based on these memory overhead and data cache mem sizes, we can predict the cache hit ratio of the SuperScalar software architecture model as shown in Fig. 5(c). With 16 processing cores and 1 GBytes system memory, the cache hit ratio is around 89.2%.

5. Performance modeling In this section, we propose a closed queuing network model for analyzing the four software architectures, and validate the model through extensive simulation. While throughput and average request service time are the two most critical parameters in designing a server, in this paper we primarily focus on throughput as the objective function. For quantifying the throughput, we model the system as a closed queuing network instead of an open queuing network so that we can control the number of users accessing the system.1 5.1. A queuing model The queuing model consists of three parts. First, it should capture the number of sequential steps k in a request. Second, if a model has cache access, then the cache contention overhead should be included in the service time computation. Third, the closed network should capture the cache and disk access behaviors in the overall system model. These three steps are described next. For modeling the service time of a request, we assume that a job consists of k independent and sequential steps and the average service time of each step is given by 1/. In Fig. 6(a), this job structure can be represented by the task precedence graph for a single server multiprogramming system (Chang and Wallace, 1992). This task precedence graph can be translated to a stochastic queuing model as shown in Fig. 6(b) (Chang and Wallace, 1992). In the queuing model, the probability that a job will leave the system is q (=1/k), and (1 − q) is the probability that the job joins the system through the feedback loop. The total service time is k/ since a job will pass through the system k times.

1 The real system should be a combination of open and closed queuing networks and we plan to investigate such a hybrid model in future.

1830

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

Fig. 6. Modeling of a task precedence graph. (a) A k-step task precedence graph and (b) the corresponding queuing model.

Nowadays, a server usually employs a data cache in main memory to boost its performance by reducing the number of disk accesses. If a cache miss occurs, the application reads the requested data from a disk. Since the disk access time is 1000 or 10,000 times larger than the cache access time, it is important to consider the impact of data cache on application performance. To model the data cache using a queuing model, we need to consider the serialization overhead to access the cache in a multiprocessor system. It means that a process or a thread on a certain processor can access the cache as a critical section, while the other competing threads/processes should wait for it. However, it is difﬁcult to capture such synchronization in a queuing model (Jain, 1991; Trivedi, 2002). Nelson (Nelson, 1990) used a synchronizing queue to add the delay time due to synchronization overhead of a parallel application in a multi-processor system. However, the limitation of this approach is that it can only represent CPU-intensive jobs, which encounter multiple barrier synchronizations (Chang and Wallace, 1992). It does not model the disk access behavior, which are mainly CPU-intensive without considering impact of I/O operations. Next, Chang and Wallace (Chang and Wallace, 1992) extended Nelson’s work and proposed a multiple feedback queuing model to analysis multi-programmed multi-processor systems. In this paper, we extend Nelson’s work (Nelson, 1990) and develop a queuing model for an MCP system with a data cache in a main memory. We calculate the synchronization overhead as follows. The probability for a job to access a critical section is 1/k since the application consists of k sequential steps, and we assume that there is only one access step to the critical section (such as the data cache). The probability that a job does not access the critical section is (k − 1)/k. Thus, the probability that i processes simultaneously access the critical section is i m−i mi × ((1/k) ) × ((k − 1/k) ), where m is the number of processing cores and i is the number of processes. The probability to have m i m−i a synchronization contention is m × ((1/k) ) × ((k − 1/k) ), i=2 i since at least two processes or threads should access the cache simultaneously. Therefore, the overall synchronization overhead is m i m−i m × ((1/k) ) × ((k − 1/k) )/. We add the average synchroi=2 i nization overhead (synchronization overhead/m) to 1/ to capture the average service time in an M/M/m queuing model, since m jobs could concurrently run on the m processing cores. Fig. 7 depicts the closed queuing model of a Multi-Core Processor with a disk system, where the m processing cores are represented as an M/M/m queue. Let us assume that there are N clients and the average think time is Z. In addition, we set the data cache miss probability per request to p. After a client sends a request to a server, the request arrives at the central queue and each step of the request can run on any processing core. Note that it is a reasonable assumption, since a thread can run on any available processor in a multi-processor system. A request goes back to the queue with a probability ((1 − q) × (1 − p)) and leaves the server with q = (1/k) probability, since we model k steps using a feedback loop. Once a client receives the response from the server, it sends the next request after an average think time Z. Since the disk access probability is p and there are k steps in a server, the disk access probability of each step is ((1 − q) × p).

Fig. 7. A queuing model for a multi-core processor with a disk system.

We use the Mean Value Analysis (MVA) algorithm (Jain, 1991; Trivedi, 2002) for solving the queuing model. The MVA algorithm, shown in Algorithm 1, calculates the average throughput for a given system conﬁguration. Now, we expand the generic queuing model for analyzing the four software architecture schemes. First, in the MP model, there is no need to model the cache access overhead since each process has its own cache. It means that the cache is not a critical section in the MP model. Thus, we can simply apply an M/M/m queuing model without the synchronization overhead for the MP model. Like the MP model, the ED model has no additional cache access overhead since each process has its own cache. Next, we expand the M/M/m queuing model for the MT and SuperScalar models. In the MT model, there is only one process, no matter how many threads are launched in MCP systems, and these threads can share the cache. It means that the cache should be considered as a critical section, because only one thread can access a cache at a certain time. This cache access overhead is the same to the synchronization overhead, explained before. In the SuperScalar model, we launch a pipelined pool of k threads per processing core, even though there is only one process. It means that (k × m) threads might run on an MCP machine. From these (k × m) threads, only m threads access the data cache since we assume that only one step out of k processing steps needs to access the cache. Thus, we need to consider the cache access overhead which is the same to that in the MT model. Algorithm 1 shows the MVA algorithm to solve these four software architecture models. Here, we set the number of device to 2, since we only model CPU and DISK. We calculate the cache access overhead using N, M, C and K values. For the MP and ED models, we do not include the cache access overhead in the service time S1 . However, for the MT and SuperScalar models, we add the synchronization overhead (i.e. cache access overhead() in Algorithm 1) to the CPU service time (S1 ). We set the average disk access time (i.e. D) to S2 . In addition, we calculate the cache hit ratio (i.e. calculated cache hit ratio() in Algorithm 1) based on system and workload parameters, which is described in Section 3. Since a disk access results after a cache miss, the disk visit count (V2 ) is equal to 1.0 − calculated cache hit ratio(), while the CPU visit count (V1 ) is the number of steps (i.e. K in Algorithm 1). The rest of the MVA algorithm calculates throughput of four software architecture models using the N, S, V and Z values. Algorithm 1. The Mean Value Analysis (MVA) algorithm for software architecture models.

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

1831

Table 3 System parameters for model validation. Average File Size

10 KBytes

˛ 1/ Number of clients Think time

1.0 100 ␮s 100 0.0

5.3. Model validation

5.2. Simulator We have developed a simulator platform, which can model the MP, MT, ED and SuperScalar software architecture models. The simulator receives the system conﬁguration parameters such as the number of processing cores and the system memory size as input. The client module generates a request using the exponential distribution of the average ﬁle size, and sends the request to a server. The main part of the simulator is the request processing module based on the software architecture schemes. Whenever a request arrives, the application module processes the basic operations. The simulator calculates the cache hit ratios of the schemes from system parameters. Then, the simulator decides whether the requested data is in the data cache or not. Using the ˛ value in the Zipf distribution, we models workloads with different localities. Finally, a client sends the next request after an average think time Z, when the previous request is completed. In our experiments, we set the Z to zero, because this is primarily done to stress the server. Since a disk latency is a critical factor in quantifying server performance, we modeled three hard drives, representing fast, medium and slow drives from the Western Digital Technologies (Western Digital Corporation, 2009). The hard drive speciﬁcations are depicted in Table 2. In addition, we extended our simulator to model Operating System (OS) scheduling and synchronization overheads since it can signiﬁcantly affect server performance, while we mainly focus on the architectural tradeoffs such as the number of processors and data cache size in the server. In our simulator, both OS scheduling overhead and synchronization overhead are 20 ␮s (Appleton, 1999). Throughput (number of completed requests per second) is the objective function analyzed in this paper.

Table 2 Performance parameters of three hard drives. Hard drives Model

Fast

Medium

Slow

Capacity (GBytes) Average latency (ms) Rotational speed (RPM) Data transfer rate (GBits/s)

1000 2.99 10,000 3

640 4.20 7200 3

500 5.35 5400 0.8

We have validated the queuing analytical model with synthetic workloads by comparing to the results of the simulator. Table 3 shows the system parameters used for the model validation. We ﬁxed the number of clients to 100 and varied the number of processing cores from 1 to 16. We set the average CPU server time to 100 ␮s, the think time to zero and the number of steps (K) to 5. We use a fast disk model, shown in Table 2. The disk access time for 10 KBytes is about 4.2 ms. We repeated two cache hit ratios results for 90% and 99%.2 We obtained the results of the simulator with 95% conﬁdence level, and the difference between max and min values is less than 3%. Thus, we do not draw the conﬁdence interval in our graphs. In this paper, we mainly focus on a simulation-based methodology since we plan to implement a ﬁle server and a Database server based on the proposed model in the future. Fig. 8 depicts the throughput results of the four software architecture models obtained from analysis. In Fig. 8(a) and (b), we set cache hit ratio to 90%, while the cache hit ratio is 99% in Fig. 8(c) and (d). The throughput results of the analytical model are almost exactly matched to those of the simulator when we varied the number of processing cores in the four software architecture schemes. This is because the disk system is a performance bottleneck due to low cache hit ratio. Next, we increase the cache hit ratio to 99%, which reveals the cache access overheads of the four models. In Fig. 8(c), the MP model shows linear throughput improvement up to 6 processing cores, before the disk system becomes the performance bottleneck. For the MT model, the simulator shows less throughput (within 6%) compared to the queuing model. This is because OS scheduling overhead and synchronization overhead can reduce the throughput of the simulator in the MT model. The interesting point is that unlike Fig. 8(a), the MT model shows about 30% less throughput compared to the MP model. This is because the cache access contention is high at 99% cache hit ratio due to very low cache miss in the MT model, while in the MP model there is no cache access contention because each process has its own cache. The reason why the MP model shows better performance compared to the MT model as the number of cores increase is that there is no contention to access a data cache in the MP model, but in the MT model threads should compete other threads to access a data cache. In Fig. 8(d), the throughput results from analysis and simulation are very close for the ED model. Like the MP model, the ED model also has no cache access contention. In the SuperScalar model, the throughput is almost 30% less than that of the ED model simulator to analytic model underestimates the throughput up to 13% compared to simulator. For the SuperScalar model, the simulator shows less throughput (within 6%) compared to the queuing model. This is because OS scheduling overhead and synchronization overhead can reduce the throughput of the simulator in the SuperScalar model. In this model validation, it is very hard to differentiate the performance difference among software architecture models since we

2 We conducted experiments with two more cache hit ratios; 95% and 98%. The results of the analytical model are very close to the simulator, and these results are not included due to the space limitation.

1832

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

Fig. 8. An analytical model validation of the four software architecture schemes where N = 100, ˛ = 1.0, K = 5 and A = 10 KBytes. (a) The MP and MT models with 90% cache hit ratio, (b) the ED and SuperScalar models with 90% cache hit ratio, (c) the MP and MT models with 99% cache hit ratio and (d) the ED and SuperScalar models with 99% cache hit ratio

used the same cache hit ratio across four models. However, in the next section we will clearly show the advantage of the SuperScalar model compared to the prior three models.

6. Performance analysis In this section, we present the performance comparison of the four software architecture models under various system parameters, using the analytical model. The simulator results in these cases also match well with the queuing model. However, the simulator results are not plotted for clarity. In all the following experiments, we set the data set size to 4 GBytes, the think time (Z) to zero and the memory overhead to maintain a single cached item to 500 Bytes. In addition, Table 4 shows the modeled performance parameters of OS. In our experiments, both OS scheduling overhead and synchronization overhead are 20 ␮s (Appleton, 1999), and we set the size of OS-level caching into 500 MBytes. OS will signiﬁcantly affect server performance since it usually caches frequently used data within OS memory. In our experiments, we assume that a server consists of multiple steps and each step has the same amount of work.

Table 4 Performance parameters of operating system. Scheduling overhead Synchronization overhead OS-level caching

20 ␮s 20 ␮s 500 MBytes

6.1. Impact of synthetic workloads First, we show the performance results with various synthetic workloads. We conducted three synthetic workloads which are explained in Section 5, by varying a shape parameter (i.e. ˛) in the Zipf distribution. We use 3 GBytes as the system memory and average cached ﬁle size (i.e. A) is 10 KBytes, while the number of clients is ﬁxed at 1000. Moreover, we assume that each request consists of 10 sequential steps and each step has the same amount of work. To better understand our results, Table 5 presents the cache locality for the given the data set sizes and ˛ values. We measured these localities of the three workloads using the method described in Section 3.

Table 5 Locality of the workloads with 4 GBytes data set size. Shape Parameter (˛)

data cache mem

Locality (%)

0.9

3.5G 3.0G 2.0G 1.0G

98.33 96.43 91.57 83.71

1.0

3.5G 3.0G 2.0G 1.0G

99.01 97.87 94.87 89.75

1.1

3.5G 3.0G 2.0G 1.0G

99.65 99.24 98.14 96.15

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

1833

Fig. 9. Performance comparison among the software architecture models by varying shape parameter (˛) in the Zipf distribution, where the number of clients = 1000 and system memory = 3 GBytes. (a) ˛ = 0.9, (b) ˛ = 1.0 and (c) ˛ = 1.1.

Fig. 9 shows that the proposed SuperScalar scheme outperformed all prior other models when we vary the number of processing cores in an MCP machine. The MP model shows the worst throughput across all synthetic workloads. Since the MP model requires more space for the memory overhead and this overhead increases as the number of the processing cores increases, it suffers from high cache miss ratio, and thus the disk becomes the performance bottleneck. In Fig. 9(a), we use the synthetic workload with a low locality (˛ = 0.9) in the Zipf distribution. As shown in Table 5, this synthetic workload with ˛ = 0.9 incurs relatively high cache misses and frequent disk accesses compared to other cases. In Fig. 9(a), the SuperScalar model shows up to 300% and 80% improvement in throughput compared to the MP and ED models, respectively. OS-level caching might give positive impact on performance of the MP and ED models, since OS might cache frequently accessed data across all processes. In contrast, the OS-level caching might give negative impact on performance of the MT and SuperScalar models, since the frequently accessed data are duplicated cached in OS-level and application caches. This is because there is only one process in the MT and SuperScalar models, while there exist multiple processes in the MP and ED models. The performance difference between the SuperScalar model and the MP and ED models increases with the number of processing cores. This is because the data cache mems of the MP and ED models are reduced greatly as the number of cores increases. The MT model shows comparable performance to the SuperScalar model, since it does not suffer from high disk accesses due to the larger data cache mem. Interestingly, the throughput of the SuperScalar model did not increase linearly beyond 3 processing cores in Fig. 9(a), because the disk became the performance bottleneck due to low locality of the workload. Unlike the SuperScalar model, the throughput of the MT model decreased beyond 3 processing cores. This is

because the memory consumption of threads increases, and thus, the available cache memory decreases as the number of threads increases. In addition, OS scheduling and synchronization overheads can affect negative impact on performance in the MT model, as the number of threads increases. The ED model incurs more memory overhead than the SuperScalar and the MT models because multiple processes in the ED model cannot share the global information, except 500 MBytes as OS-level cache. The ED model needs 4 times more memory overhead than the SuperScalar model for 16 processing cores. Thus, the available data cache mem in the ED models decreases as the number of processing cores increases. Fig. 9(b) shows the results of the synthetic workload with ˛ = 1.0 in the Zipf distribution. As shown in Table 5, it has medium locality compared to the other two synthetic workloads. The overall throughput in the all models increased for up to 80%, but still shows a similar trend compared to Fig. 9(a). The throughput of the SuperScalar model is levels up beyond 4 processing cores, while the throughput of the MT model reduced as the number of processing cores increased. In case of the ED model, the throughput decreased beyond 4 processing cores because the data cache mem reduces as the number of processing cores increases. The workload with ˛ = 1.1 in Fig. 9(c) has the highest locality, and thus, the disk is no more the performance bottleneck. With this workload, the SuperScalar model outperformed all other models noticeably. Unlike the results of other two workloads, the MT model saturated beyond 8 processing cores. This is because the memory, OS scheduling and synchronization overheads of the threads increases as the processing cores increase. The MT model in our experiment launches 128 threads per processing core, and thus, there are more than thousand threads in a 16 processing core MCP. In contrast, the number of threads in the SuperScalar model is around 10 times less than that in the MT model. Thus, the proposed model signiﬁcantly reduces the memory overhead. In addition, the

1834

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

Fig. 10. Performance comparison of software architectures by varying the system memory with N = 1000, ˛ = 1.0 and A = 10 KBytes. (a) 2.5 GBytes system memory and (b) 3.5 GBytes system memory.

MT model incurs more OS scheduling overhead and synchronization overhead due to 128 threads per processing core, compared to the SuperScalar model.

6.2. Impact of data cache size In this experiment, we examine the impact of data cache mem by varying the system memory, when the number of clients is ﬁxed at 1000. We set the shape parameter ˛ = 1.0 in the Zipf distribution for this experiment. Fig. 10(a) depicts the throughput results of the four software architectures for the 2.5 GBytes system memory. In this experiment, the all four architecture schemes incur large number of cache misses due to the smaller data cache mem size. Thus, the throughputs of all models are bounded by the disk speed. In other words, the throughput is not scalable with the number of processing cores. The proposed model shows the best throughput, while the performance of the MP model is the worst, as expected. The throughput of the ED model increases until the number of processing cores is 3, and decreases after that. The ED model shows competitive throughput result compared to the MT model, since OS uses 500 MBytes as OS-level cache within 2.5 GBytes system memory. More processing cores in the SuperScalar model do not reduce the data cache mem signiﬁcantly, and thus, it yields stable throughputs. The throughput result with 3 GBytes system memory is plotted in Fig. 9(b). The performance difference among the SuperScalar and MT models is more evident as the system memory increases. Since the cache miss rate is decreased, the throughputs of all servers improved drastically. However, the disk is still the bottleneck for all models as seen in Fig. 9(b). In Fig. 10(b), the throughput results with

3.5 GBytes system memory depict that all servers beneﬁt from a large cache. In the ED model, OS-level cache (i.e. 500 MBytes) shows little performance improvement with 3.5 GBytes system memory. The SuperScalar model shows up to 25% throughput improvement compared to the MT model. The results of this section indicate that the cache size has a great impact on the throughput for all software architectures.

6.3. Impact of average ﬁle size In this subsection, we explore the performance impact of average ﬁle size on the software architecture schemes. In this experiment, we varied the average ﬁle size (A) from 1 KBytes to 100 KBytes and set the system memory to 3.5 GBytes. In addition, the ˛ in the Zipf distribution is set to 1.0 and the number of clients is ﬁxed at 1000. Fig. 11 shows the throughput results for two ﬁle sizes (1 KBytes and 100 KBytes). The throughput results with 10 KBytes average ﬁle size is depicted in Fig. 10(b). The throughput with 100 KBytes average ﬁle size, shown in Fig. 11(b), is two times higher than with the smaller average ﬁle size in Fig. 11(a). This is because the memory overhead depends on the number of cached items, which means that if the average ﬁle size is small (i.e. 1 KBytes), a data cache can keep more cached ﬁles compared to larger average ﬁle size. Thus, it consumes more memory to maintain these cached ﬁles. Table 6 shows the cache hit ratio of the models by varying the average ﬁle size for a 16 processing core MCP. As shown in Table 6, with 1 KBytes average ﬁle size, the cache hit ratios of the models are signiﬁcantly less compared to 100 KBytes average ﬁle ﬁle. This explains the throughput difference in Fig. 11(a) and (b).

Fig. 11. Performance comparison by varying the average ﬁle size with N = 1000 and ˛ = 1.0. (a) 1 KBytes and (b) 100 KBytes

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837 Table 6 Locality of the workloads for 4 GBytes data set size and 3.5 GBytes system memory. Average File Size

A Model

Locality (%)

1 KBytes

MP MT ED SuperScalar MP MT ED SuperScalar

67.97 96.22 85.24 96.57 90.79 98.24 98.08 98.73

100 KBytes

1835

speed hard drive results in Fig. 12(b) show a similar tendency, even though it yields better performance than the slow hard drive model. In Fig. 10(b), the proposed model with a fast disk yields much better throughput, since the disk is not any more the performance bottleneck. The performance of the SuperScalar and MT models in Fig. 12(a) and (b) increases until the number of the processing cores is 4 with the slow disk, and 5 with the medium disk. After these points, the disk becomes the performance bottleneck in the these two models. Especially, the MT model saturated early because it has higher memory overhead to maintain more threads compared to the SuperScalar model.

6.4. Impact of disk speed 6.5. Other system conﬁgurations The performance of a software architecture is signiﬁcantly affected by the speed of a hard drive. In all prior experiments, we used the parameters of a high-performance hard drive, given in Table 2. In this subsection, we evaluate the performance impact with slow and medium hard drive models on a software architecture. We set ˛ = 1.0 and N = 1000. In addition, we ﬁx the system memory to 3.5 GBytes. We expect that it will clearly reveal the impact of a disk model on a software architecture performance. Fig. 12 shows the throughput results with the medium and slow hard drives. The performance results with a fast hard drive model are plotted in Fig. 10(b), which gives 2 times higher throughput than the slow hard drive model in Fig. 12(a). This is because the average disk access latency of a slow disk is two times more than that of a fast disk, while the disk access ratio to total number of completed requests is almost the same across all hard drive models. With the slow hard drive in Fig. 12(a), the throughput results of the ED and MP models deteriorate more severely. This is because by increasing the number of processing cores, the available data cache mem size reduces, and thus, it results in frequent disk accesses. The medium-

As outlined in Section 2, a server could have more than 10 process steps involved to handle a request in a data center. In addition, the server could incur multiple data cache accesses, depending on its complexity. In this subsection, we vary the number of steps and number of cache accesses to examine the impact of synchronization overhead and frequent cache accesses on performance. We set ˛ to 1.0 to model a workload, and the system memory size is ﬁxed to 3.5 GBytes. In Fig. 13(a), when the number of steps is 15 and the number of cache accesses is 2, the throughput difference between the SuperScalar and MT models increases, while other models suffer from high cache miss ratio. This is because multiple accesses to a data cache incur high synchronization overhead and more disk accesses compared to other experiments. Thus, the disk becomes the performance bottleneck in this experiment. Next, we conducted a experiment with different thread conﬁgurations. For this experiment, we added two additional thread models. The ﬁrst one is that we limit the number of threads as 10 threads per core in the MT model which is the number of threads

Fig. 12. Performance comparison by varying the disk speed with N = 1000, ˛ = 1.0 and A = 10 KBytes. (a) A slow disk and (b) a medium disk.

Fig. 13. Performance comparison by varying the number of steps and thread conﬁgurations with N = 1000, ˛ = 1.0 and A = 10 KBytes. (a) 15 steps with two critical sections and (b) different thread conﬁgurations

1836

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837

in the SuperScalar model, instead of 128 threads. The second one is that we modeled the Apache Web server 2.0 (The Apache Software Foundation, 2003) in our simulator. The Apache Web server 2.0 could consist of multiple processes and each process could have multiple threads. In Fig. 13(b), REDUCED MT represents the MT model with limited number of threads and APACHE represents the Apache Web server model. The REDUCED MT shows no performance degradation beyond 5 cores while the MT shows worse performance, as the number of cores increases. This is because the REDUCED MT has little memory overhead compared to the MT model due to the small number of threads. The APACHE models shows signiﬁcant performance degradation beyond 2 cores, since each process in the APACHE model cannot share its cache with other processes. In this experiment, the SuperScalar model outperformed all other models noticeably, even though we simulated two additional thread models. In addition, we also varied the number of clients to examine the scalability issue in terms of the number of concurrent connections. Since we did not notice throughput improvement beyond 1000 clients, we skip the results due to the space limitation.

7. Conclusions Design of high-performance servers is essential for providing adequate support to the increasing demand of various services. Although several prior studies (Welsh et al., 2001; Choi et al., 2005; Ruan et al., 2005) have focused on the design of Web server architecture for small-scale SMP machines, there is no software architecture to support any server design on multi-core machines, which are likely to become the de facto server architectures in near future. In this paper, we have proposed a software architecture, called SuperScalar, to implement (Web, ﬁle and Database servers) in MCP machines. The SuperScalar architecture consists of multiple pipelined thread pools, where each pipelined thread pool is composed of K sequential steps to implement a higher level server architecture. Although the overall idea of the proposed architecture is similar to the MT concept, it differs from the later in terms of number of threads and functionalities of different threads. The main advantage of the proposed model is that the threads can share the global information such as data cache structure. Thus, like the MT model, it needs relatively small memory to maintain the global information. For analyzing the performance of the prior and the proposed software architectures, we have developed a simple closed queuing network model that captures the effects of caching and disk access behavior of a software architecture. Performance analyses of the prior three software architecture models (MP, ED and MT) and the proposed SuperScalar model show that the proposed model can deliver the best performance across various system conﬁgurations and workloads. The MT model is a close competitor with the SuperScalar model for several system conﬁgurations, speciﬁcally when the number of processors is small. However, the proposed model outperformed the MT model as the number of processors increased. Furthermore, while the MT and ED models suffered due to low locality and inadequate system memory, the proposed model exhibited good throughput across all experimental conditions. The MP model, as expected, is the worst performer. Thus, the SuperScalar architecture is a viable candidate to implement any server in multi-core machines. We plan to implement a ﬁle server and a Database server based on the proposed model in the future. In addition, a hybrid model, designed as an open and closed queuing network, may be a better model for MCP-based software architectures.

References Almeida, V., Bestavros, A., Crovella, M., Oliveira, A., 1996. Characterizing reference locality in the WWW. In: In Proceeding of the Fourth International Conference on Parallel and Distributed Information System (PDIS), December, pp. 92–103. Appleton, R., 1999. Improving context switching performance for idle tasks in linux. In: In Proceedings of Computers and Their Applications (CATA-99), pp. 237–239. Bell, G., Gray, J., 2002. What’s next in high-performance computing? Communications of the ACM 45 (2), 91–95. Benini, L., Micheli, G.D., 2002. Networks on chips: a new SoC paradigm. IEEE Computer 35 (1), 70–78. Bestavros, A., Carter, R.L., Crovella, M.E., Cunha, C.R., Heddaya, A., Mirdad, S.A., 1995. Application-level document caching in the Internet. In: In Proceedings of the 2nd International Workshop on Services in Distributed and Networked Environments, p. 166. Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S., 2009. Web Caching and Zipflike Distributions: Evidence and Implications. Proceedings of IEEE INFOCOM’99, March 1, 126–134. Cardellini, V., Casalicchio, E., Colajanni, M., Yu, P.S., 2002. The state of the art in locally distributed web-server systems. ACM Computing Surveys (CSUR) 34 (2), 263–311. Carrera, E.V., Rao, S., Iftode, L., Bianchini, R., 2002. User-level communication in cluster-based servers. In: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture (HPCA’02), pp. 248– 259. Chang, C., Wallace, V.L., 1992. Multiple feedback queue as a model of general purpose multiprocessor systems. In: CSC’92: Proceedings of the 1992 ACM Annual Conference on Communications, pp. 493–500. Choi, G.S., Kim, J.-H., Ersoz, D., Das, C.R., 2005. A Multi-threaded PIPELINED Web server architecture for SMP/SoC machines. In: In the 14th International Conference on World Wide Web, pp. 730–739. Edenfeld, D., Kahng, A.B., Rodgers, M., Zorian, Y., 2004. 2003 technology roadmap for semiconductors. IEEE Computer 37 (1), 47–56. Fonseca, R., Almeida, V., Crovella, M., Abrahao, B., 2003. On the intrinsic locality properties of web reference streams. In: Proceedings of IEEE Infocom, March, pp. 448–458. Intel, 2009a. Intel Core2 Quad Processors, Available from http://www.intel.com/ products/processor/core2quad/index.html. Intel, 2009b. Intel Previews Intel Xeon Nehalem-EX Processor, Available from http://www.intel.com/pressroom/archive/release/20090526comp.html. Jain, R., 1991. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. John Wiley & Sons, Inc., New York. Kelly, T., Shen, K., Zhang, A., Steward, C., 2008. Operational analysis of parallel servers. In: IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems, September, pp. 1–10. Markatos, E.P., 1996. Main memory caching of web documents. In: Proceedings of the ﬁfth international World Wide Web conference on Computer networks and ISDN systems, pp. 893–905. Menasce, D.A., 2003. Web server software architectures. IEEE Internet Computing 7 (6), 78–81. Michael, W., Devabhaktuni, T., Srikrishna, Sarkar, V., Lee, W., Lee, V., Kim, J., Frank, M., Finch, P., Barua, R., Babb, J., Amarasinghe, S., Agarwal, A., 1997 September. Baring it all to software: raw machines. IEEE Computer, 86–93. S. Microsystems. NFS Server Performance and Tuning Guide for Sun Hardware, http://docs.sun.com/app/docs/doc/806-2195-10. Nelson, R., 1990. A performance evaluation of a general parallel processing model. In: SIGMETRICS’90: Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems, pp. 13–26. Pai, V., Druschel, P., Zwaenepoel, W., 1999. Flash: an efﬁcient and portable web server. In: Proceedings of the USENIX 99 Annual Technical Conference, June, pp. 199–212. Park, J.-H., Kanitkar, V., Delis, A., 2001. Logically clustered architectures for networked databases. Distribution Parallel Databases 10 (2), 161–198. Pham, D., Asano, S., Bolliger, M., Day, M., Hofstee, H., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., Yazawa, K., 2005. The design and implementation of a ﬁrst-generation cell processor. In: IEEE International Solid-State Circuits Conference (ISSCC), pp. 184–186. Ruan, Y., Pai, V.S., Nahum, E., Tracey, J.M., 2005. Evaluating the impact of simultaneous multithreading on network servers using real hardware. SIGMETRICS Performance Evaluation Reviews, 315–326. The Apache Software Foundation, 2003. The Apache HTTP Server Project., http://httpd.apache.org. TILERA, 2009. TILE64 Processor Family, Available from http://www.tilera. com/products/TILE64.php. Trivedi, K.S., 2002. Probability and statistics with reliability. In: Queuing and Computer Science Applications, 2nd ed. John Wiley & Sons, Inc. Welsh, M., Culler, D., Brewer, E., 2001. SEDA: an architecture for well-conditioned, scalable Internet services. In: In the Eighteenth Symposium on Operating Systems Princples (SOSP’01), October. Western Digital Corporation, 2009. Enterprise Hard Drives., http://www.westerndigital.com/en/products/. Wikipedia, 2009. Opteron, Available from: http://www.wikipedia.org/wiki/Opteron.

G.S. Choi, C.R. Das / The Journal of Systems and Software 83 (2010) 1823–1837 Gyu Sang Choi received the Ph.D. degree in computer science and engineering from Pennsylvania State University. He was a research staff member at the Samsung Advanced Institute of Technology (SAIT) in Samsung Electronics from 2006 to 2009. Since 2009, he has been with Yeungnam University, where he is currently an assistant professor. His research interests include embedded systems, storage systems, parallel and distributed computing, supercomputing, cluster-based Web servers, and data centers. He is now working on embedded systems and storage systems, while his prior research has been mainly focused on improving the performance of clusters. He is a member of the IEEE. Chita R. Das received the MSc degree in electrical engineering from the Regional Engineering College, Rourkela, India, in 1981 and the Ph.D. degree in computer sci-

1837

ence from the Center for Advanced Computer Studies, University of Louisiana at Lafayette, in 1986. Since 1986, he has been with the Pennsylvania State University, where he is currently a professor in the Department of Computer Science and Engineering. His main areas of interest are parallel and distributed computer architectures, cluster computing, mobile computing, Internet quality of service (QoS), multimedia systems, performance evaluation, and fault-tolerant computing. He has served on the editorial boards of the IEEE Transactions on Computers and IEEE Transactions on Parallel and Distributed Systems. He is a fellow of the IEEE and a member of the ACM.

A Superscalar software architecture model for Multi-Core Processors (MCPs)

A Superscalar software architecture model for Multi-Core Processors (MCPs)

Recommend Documents