Future Generation Computer Systems 28 (2012) 1110–1120
Contents lists available at SciVerse ScienceDirect
Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs
A weighted-fair-queuing (WFQ)-based dynamic request scheduling approach in a multi-core system Guohua You, Ying Zhao ∗ College of Information Science and Technology, Beijing University of Chemical Technology, 100029 Beijing, PR China
article
info
Article history: Received 18 November 2010 Received in revised form 15 April 2011 Accepted 13 July 2011 Available online 3 November 2011 Keywords: Dynamic requests Scheduling Web server Multi-core Threads Hard affinity
abstract A popular website is expected to simultaneously deal with a large number of dynamic requests in the reasonable mean response time. The performance of websites mainly depends on hardware performance and the processing strategy of dynamic requests. In order to improve the hardware performance, more and more web servers are adopting multi-core CPUs. Moreover, the scheduling algorithm of requests on the first-come–first-served (FCFS) basis is still utilized. Although FCFS is a reasonable and fair strategy for request sequences, it takes into account neither the distribution of the dynamic request service times nor the characteristics of multi-core CPUs. In the present paper, in order to solve the above-mentioned problems, a new dynamic request scheduling approach is proposed. The new scheduling approach, according to the distribution of the dynamic request service time, schedules the dynamic requests based on a weighted-fair-queuing (WFQ) system, and exploits the performance of multi-core CPUs by means of the hard affinity method in the O/S. Simulation experiments have been done to evaluate the new scheduling approach, and the results obtained prove that the new scheduling approach could eliminate the ping-pong effect and efficiently reduce the mean response time. © 2011 Elsevier B.V. All rights reserved.
1. Introduction With the huge development of the Internet industry, people are more like to rely on the web for their daily activities such as ecommerce, online banking, stock trading, reservation, and product merchandising. Consequently, popular web sites are expected to deal simultaneously with large numbers of requests without noticeable reduction of the response time performance. Moreover, dynamic and personalized content delivery has been increased sharply with the application of server-side scripting technologies. Web pages incorporating the latest customized information are generated dynamically, but they are not cacheable. So the generation of these dynamic web pages is heavy to load on the web server. Furthermore, with the progress of broadband communication technology, web servers tend to become performance bottlenecks. The performance of web servers mainly depends on the hardware and the scheduling strategy of requests. At the same time, with the development of multi-core technology, web servers have mostly adopted multi-core CPUs to improve the hardware performance in the past few years. A multicore system integrates two or more processing cores into one silicon chip [1–3]. In this type of design, every processing core has its own private L1 cache and shared L2 cache [4,5]. All the processing cores share the main memory and the system bandwidth. Fig. 1
∗
Corresponding author. E-mail address:
[email protected] (G. You).
0167-739X/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2011.07.006
shows the architecture in a multi-core system. When web servers adopt multi-core CPUs, there will be some new problems. There is usually one or more thread pool in the web server, in which threads are usually in the blocking state. When a request arrives at the thread pool, the web server fetches a blocked thread from the thread pool, and then assigns the request to the thread and finally executes the thread. The processing result is saved into the I/O buffer queue, and then sent to the network under scheduling by the I/O management module. This is the procedure of the web server, which is modeled in the light of queuing network theory [6]. The procedure is shown in Fig. 2. Obviously, the web server is a service application including multiple threads. To improve the service performance in multicore web servers, the scheduling strategy of multiple threads must take into account the characteristics of the multi-core CPUs. If there are multiple threads in the multi-core system, the O/S will usually assign these threads to different cores due to consideration of performance improvement and load balance [7]. But, in some cases, this would not result in good performance. Generally, when a thread is running, the O/S will transfer its data from main memory or the L2 cache to its private L1 cache. If both threads have shared data and the O/S assigns the threads to different cores, the O/S will continually transmit the shared data of the threads back and forth between the private L1 caches of the cores during the execution of the threads. This is the ping-pong effect, which will greatly degrade the performance of the multi-core system. Usually, there are a lot of dynamic requests that ask for the same dynamic page in a
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
Fig. 1. The architecture of multi-core CPUs.
multi-core web server. When the threads that deal with these dynamic requests are assigned to different cores, this easily gives rise to the ping-pong effect. Another factor that influences the performance of a web server is the scheduling strategy of requests. The requests of a web server usually are of two types. Static requests: these request a file (including media files) from the web server. Dynamic requests: these request some kind of processing from the server web. The processing is usually programmed in the server using a script language (JSP, ASP, etc.) and the result is usually a generated dynamic page [8]. Usually, the processing of the static requests is simple. The procedure has two steps: reading a file from a disk or cache and transferring it through the network interface. The disk and network resources are the main bottlenecks for this type of web object. The service time of the static request is proportional to the size of the file [9]. But the processing of the dynamic requests is complicated. Many dynamic requests include some personalized information (such as location and personal data), so the contents of the dynamic requests cannot be known in advance and must be retrieved from the web servers. They must be generated dynamically each time and cannot be fully cached [10]. In general, many dynamic requests are very simple, and they do not require intensive server resources, such as sum of bill items, but some dynamic requests are very complex, and they require intensive use of web server resources, such as the content of an e-commerce secure site which require Secure Socket Layer Protocol (SSL) processing with intensive CPU use. So the service time of the dynamic requests differs greatly, and usually is a heavy-tailed distribution [8]. In the present paper, we mainly discuss dynamic requests. Many request scheduling strategies have been proposed. Cherkasova [11] proposed using shortest-job first (SJF) scheduling for static requests. In 1998, the α scheduling strategy was developed at HP Labs [11]. Schroeder and Harchol-Balter [12] demonstrated an additional benefit using the shortest remaining processing time (SRPT) for static requests. Elnikety et al. [13] proposed preferential scheduling for dynamic requests in a
1111
transparent fashion. And they addressed the starvation question by using an aging mechanism to prevent starvation. Actually, most servers, e.g. Apache [14], employ the firstcome–first-served (FCFS) strategy. FCFS is fair and starvation free [15], but it is a traditional system-centric scheduling approach [16] and it cannot consider the characteristics of multi-core web servers and the distribution of the dynamic request service times. As a consequence, we propose a new dynamic request scheduling approach in a multi-core web server, which fully considers the distribution of the dynamic request service times and the characteristics of the multi-core web server, and improves the performance of the multi-core web server in an efficient manner. The remainder of the paper is organized as follows. Section 2 introduces the related work. The new dynamic request scheduling approach is described in Section 3. Section 4 introduces the simulation experiments of the new approach and presents an evaluation of the performance. And finally, we present our conclusions and future work in Section 5. 2. Related work 2.1. Request scheduling strategy In this section, some proposed request scheduling strategies are introduced. FCFS is a fair strategy, but SJF and SRPT have shorter average waiting time [17]. In addition, α scheduling and weighted fair queuing are also introduced. (1) First-come–first-served (FCFS): In the FCFS strategy, requests are handled in the sequence of their arrival time. FCFS is fair, but it takes a long time to process the large files that are newly arriving. As a result, the overall average waiting time increases. (2) Shortest-job first (SJF): In SJF, requests with small service time have precedence over requests with longer service time. In this way, the overall mean waiting time is reduced. However, SJF needs to know the service time of requests beforehand. Because requests with longer service time have lower priority, there will be starvation on long-term heavily loaded web servers, i.e., when there are many requests for small files in the web server. (3) Shortest remaining processing time (SRPT) [18]: In SRPT, the request with the least remaining processing time is scheduled and processed with precedence over requests with a longer processing time. Each request is divided into sub-requests, only the first of which is scheduled at the request arrival time. The next sub-request is qualified to be scheduled only if its previous request has been completed. This scheme approximates roundrobin scheduling. Like SJF, SRPT unfairly penalizes requests with longer processing time in order to give priority to requests with shorter remaining processing time. (4) α scheduling: α scheduling is a scheduling strategy that is adjustable between (fair and starvation-free) FCFS and
Fig. 2. Basic model of a web server [6].
1112
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
Fig. 3. A weight-fair queuing station [15].
SJF [11]. By means of a dynamic priority method, in which the priorities of requests increase while waiting, it evades starvation. Consequently, it is potentially superior to SJF or SRPT. An incoming request is assigned a priority which is based on the size of the requested file, and then inserted into a priority queue. The longer the time of the request waiting in the queue, the higher priority it has. For a request u, its priority P (u) is given as P (u) = Q +
α , S (u)
(1)
where S (u) is the service time of request u. The parameter Q is the value of a clock which starts at zero and increases by S (u) each time a request u has been handled. Obviously, using the above method, the priority P (u) of request u will always increase; however, overflow of the priorities can be eliminated by resetting the clock each time the queue becomes empty or by assigning the lowest priority to them when they have reached a given threshold. α scheduling is an attractive strategy, but it is difficult to determine an appropriate parameter. If α is too high, α scheduling is close to SJF, and may lead to starvation; if α is low, α scheduling is close to FCFS, and the overall mean waiting time is long. The value of α cannot be computed precisely, but must be intuitively estimated or approximately determined. (5) Weighted fair queuing (WFQ): In this approach, the incoming requests are classified by a classifier on the basis of different criteria: request object type, service time of request, URL of request, etc. (see Fig. 3). If the requests are classified based on the URL of the requests, the classifier will extract the URL from the requests and requests with the same URL will be assigned to same queue. So the appropriate queues are determined for the requests, and a single request queue becomes multiple request queues after classification. After classification, the request queues are continually handled by the web server with different priorities: each request queue i gets a part Gi of the web server processing capacity given by Gi =
wi , I wj
(2)
j=1
where I is the number of request queues and wi is the weight of request queue i. Khayari proposed class-based interleaving WFQ (CI-WFQ) based on WFQ [15]. Moreover, in this paper, WFQ is the basis of the new dynamic request scheduling approach in a multicore web server. 2.2. CPU affinity To overcome the ping-pong effect in a multi-core system, we introduce CPU affinity. CPU affinity is the capacity of binding a process or thread to a specific CPU core, which could provide very efficient use of the processor/core cache [19]. The policy that the threads or processes are assigned to a suitable core by the O/S is
called soft affinity. Furthermore, developers can also set the affinity of processes or threads in the O/S, which is called hard affinity. The emergence of multi-core processors poses new demands on schedulers [20]. By means of CPU affinity, some related works for improving the performance of applications in a multi-core system by scheduling threads or processes among different cores have been done. Chonka et al. [21] improved the security application efficiency by restricting the security application in one specific core in a multi-core system. Yang et al. [22] integrated monitoring oriented programming (MOP) with a multi-core system and achieved high performance by means of the O/S hard affinity method. Islam et al. [23] greatly improved the performance of a multi-classifier classification technique in a multi-core framework by running each classifier process in parallel within its dedicated core. Chonka et al. [24] proposed a new ubiquitous multicore framework and improved the security application efficiency greatly. Chonka et al. [25] applied the ubiquitous multi-core (UM) framework into a multimedia application to take advantage of multi-core systems to speed up computations and allow real-time multimedia applications. Feng et al. [26] applied the affinity mechanism to a scale-invariant feature transform (SIFT) and obtained 2–10% performance improvement compared to the default OS scheduling policy. Terboven et al. [27] combined thread affinity, processor binding, and explicit data migration and obtained a speedup of 25% for 64 threads on an SFE25K, a satisfying result for this code taking into account Amdahl’s law. But the above works did not refer to web server performance optimization in a multi-core system by means of affinity methods. In this paper, we propose a new dynamic request scheduling approach, which applies the WFQ strategy to the web server and exploits multi-core CPU performance based on the hard affinity method in the O/S. 3. Dynamic request scheduling approach 3.1. Description of the approach In a website, although there are a lot of dynamic requests, the types of dynamic request are limited. Many dynamic requests are different just because their parameters are different (for example, http://www.experimentexample.com/Web. aspx?name=tom&age=23 and http://www.experimentexample. com/Web.aspx?name=mike&age=18), but the dynamic page requested is the same one. We consider dynamic requests that request the same page as the same type, whose processing procedures are similar. Generally, incoming requests will be assigned to the threads in a thread pool, which handle these requests. When the same type of dynamic request is assigned to these threads, these threads would execute the same code because the same
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
1113
Fig. 4. Dynamic request scheduling approach.
types of dynamic request ask for the same page. Thus, these threads have shared data. Furthermore, according to the thread scheduling strategy adopted by the multi-core system, the O/S always tries to assign these threads to different processing cores due to load balance [7], so the shared data will be continually transferred back and forth between the L1 caches of the different processing cores, which is the ping-pong effect. However, one could obtain higher performance by assigning these threads with shared data to the same core in a multi-core system rather than trying to allocate them to different processing cores. To solve this problem, we can assign the threads with shared data to the same processing core by means of the hard affinity method in the O/S, which can improve the performance of a multicore system, avoid the ping-pong effect, and speed up the response of the multi-core web server. Because the types of dynamic request are limited in a website, the access frequency and the mean service time of each kind of dynamic request can be obtained from log files, and a lookup table can be created for calculating the weight of request queues. As is shown in Fig. 4, when they arrive at the web server from the TCP queue, the dynamic requests are classified by the classifier on the basis of their URLs and requests with the same URL will be assigned to the same request queue. Thus, multiple HTTP request queues are established. The weight of each request queue can be calculated on the basis of the access frequency and the mean service time of this kind of dynamic request, which can be obtained from the lookup table. And the weight of a request queue could be used to calculate the share of CPU capacity occupied by the dynamic request queue. In a multiple-thread system, CPU time is allocated to every thread equally during a CPU cycle, since threads are scheduled based on the round-robin strategy. Therefore, the number of threads represents the proportions of CPU capacity. The number of threads can be allocated to a request queue according to the weight of the request queues. In order to improve the performance and avoid the ping-pong effect, all threads that process the same request queue should be assigned to the same processing core. Moreover, for the sake of load balance between different cores, we calculate the thread allocation strategy by means of a genetic algorithm. After the thread allocation strategy is decided, the dynamic requests waiting in the request queues are assigned to these threads one by one. Then these threads that have received dynamic requests begin to execute. After these threads are completed, the results of execution generate new dynamic pages, which are then sent to the I/O buffer, and network scheduled by I/O management, and these are the responses.
3.2. Calculation of scheduling parameter 3.2.1. Weight of dynamic request queue The dynamic requests in the same request queue are of the same type. After obtaining the number of visits of each kind of requests at a specified time interval from log files, we can calculate the percentage Ci of the access times of the request queue i in total access times of all request queues. Likewise, we can calculate the average service time Ti of requests in request queue i. Thus, the weight Wi of request queue i can be calculated as Wi = Ci Ti ,
(3)
M
where i=1 Ci = 1 and M is the total number of the request queues. The access frequency and the mean service time are two main factors that influence the CPU load. Therefore, the weight of the request queues indicates the impact of the request queues on the CPU load. The greater the weight of the request queues, the larger impact they have on the CPU load. 3.2.2. Number of threads for the request queue According to the weight Wi of the request queue i, the number λi of threads used to handle the request queue i can be calculated based on the following formula:
λi =
Wi M
H,
(4)
Wj
j =1
where H is the total number of the threads in the thread pool, and M is the total number of the request queues. λi is the number of threads that are used to handle request queue i, and it is an integer through rounding. Sometimes, the value of λi is close to zero. Therefore, we set the minimum value of λi to one so that all kinds of request can be handled. The weights of request queues are decided by the access frequency and the mean service time. The request queue with the greater weight would lead to the heavier CPU load, so there would be more threads to serve it. 3.2.3. Load balance between cores In order to avoid the ping-pong effect between threads, the threads serving the same request queue should be assigned to the same processing core as a whole. Moreover, the numbers of threads that serve different request queues are quite different. After these threads are allocated to the processing cores as a whole, the numbers of threads on different cores differ greatly too. Thus, this will give rise to a new problem: load balance between cores. To maintain a load balance between cores and avoid the pingpong effect between threads, we must allocate the threads serving the same request queue to the same processing core and keep the
1114
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
number of threads on the different cores evenly distributed. We can solve the problem by means of a genetic algorithm. A genetic algorithm (GA) is a stochastic optimization method that works on the principle of evolution via natural selection [28]. If the CPU has N cores, and the number of request queues is M, we can define the chromosome of the GA as follows: R1
R2
· · · Rj
· · · RM
where Rj is an integer and 0 ≤ Rj ≤ N − 1. Rj is a gene in the chromosome, and it represents the serial number of the core to which the threads serving the request queue j are assigned. For example, if Rj = 1, then the threads that serve request queue j are assigned to core 1. So a chromosome stands for a thread assignment solution. For the purpose of brevity, we define the threads that serve the same request queue as a service thread group, denoted as STG. If an STG serves request queue j, then it is denoted STGj . And the number of threads in STGj is λj , which has been calculated according to (4). From the chromosome, we can acquire all the STGs that are allocated to core i. If we define the number of STGs on core i as Bi , then we can enumerate all the STGs on core i: STGA1 , STGA2 , . . . , STGAk , . . . , STGAB , 1 ≤ k ≤ Bi . STGAk is the seri vice thread group that serves request queue Ak , and the number of the threads in STGAk is λAk , which have been calculated based on (4). k is the serial number of STGAk in all STGs on core i. Ak is the serial number of STGAk in all STGs on all the cores. If the total number of all threads on core i is Xi , then Xi could be calculated by the following formula: Xi =
Bi
λAk .
(5)
k=1
So we can get the number of threads on every core: X1 , X2 , . . . , XN . We define D(X ) as the variance of X1 , X2 , . . . , XN . If D(X ) is large, it means that the numbers of threads on different cores differ greatly. So, to keep the load balance between cores, a lower value of D(X ) is favorable. Therefore, we define the fitness function in the GA as follows: f (e) =
1 D(X ) + 1
,
(6)
where D(X ) is the variance of X1 , X2 , . . . , XN and e is a chromosome. Because a lower value of D(X ) is helpful to keep the load balance between cores, sometimes D(X ) might be zero. So we use the reciprocal of D(X )+ 1 as the fitness function. As a result, a larger value of f (e) means better load balance between cores. The value of f (e) is greater than 0 and less than or equal to 1. According to Fig. 5, the GA has the following procedure. (1) Initial population: a population is a collection of chromosomes. The population size L is typically problem dependent and can be determined experimentally. The initial population is usually generated randomly. We can generate L chromosomes by means of assigning a random integer, which ranges from 0 to N − 1, to every gene in chromosome. This is the initial population. (2) Calculation of the fitness value: we can use the fitness function, given by (6), and the method above to calculate the fitness value of every chromosome in the population. (3) Selection: we select the fitter chromosomes by means of the roulette wheel method based on the fitness value of every chromosome. The greater the fitness value of a chromosome, the larger the probability to be chosen. We repeat the selection operation as many times as the number of chromosomes. (4) Crossover: we randomly choose a couple of chromosomes from the population. For the selected chromosomes, we decide whether to perform crossover or not based on crossover probability. If crossover is allowed, it will generate a new couple
Fig. 5. The calculation procedure of thread allocation solution based on the genetic algorithm.
of chromosomes by exchanging portions of the two old chromosomes. We repeat the crossover operation until L chromosomes are generated. (5) Mutation: we randomly choose a chromosome from the population. For a gene of the chromosome, we allow a random change with very small probability. If it happens, the gene will be assigned to random integer, which ranges from 0 to N − 1 and is different from the origin value of the gene. (6) Termination condition: in the GA, the generational processes (2)–(5) are repeated. At each iteration, the chromosome with the largest fitness value will be recorded. If the largest fitness value does not change over five iterations, the iterations should be ended. When the iterations come to an end, we take the recorded chromosome with the largest fitness value. From the chromosome, we can obtain the best thread assignment solution, which could maintain a load balance between cores. 4. Experiment and evaluation 4.1. Experiment setup 4.1.1. Dynamic request web server (DRWS) To validate the new dynamic request scheduling approach, we developed a new web server simulation program, called DRWS. DRWS is a single-process web server, which incorporates a thread pool and runs on the Windows system. For simplicity, it only handles ‘‘GET’’ dynamic requests. The size of the thread pool could be customized, and we set the default value to 200. Generally, the threads are blocked. We created ten DLL files to simulate the ten requested dynamic pages, which include different codes to accomplish different functions. When a request arrives, DRWS fetches a thread form the thread pool to handle the request. The thread loads and executes a specific DLL file on the basis of the URL and parameters of the request, which simulates the generation of a dynamic page. The execution time of the DLL files is different because the generation time of the dynamic pages is different. Also, there exists the possibility that a lot of requests arrive for the same DLL file; we can assign these requests on the same core by the hard affinity method to avoid ping-pong effect. In the Windows O/S, we can use the application programming interface (API) SetThreadAffinityMask to set the affinity of threads. The API has two parameters: hThread and dwThreadAffinityMask. The former is the handle to the thread whose affinity mask is to be set, while the later is the affinity mask for the thread. A thread affinity mask is a bit vector in which each bit represents a logical processor that a thread is allowed to run on. Moreover, DRWS classifies and schedules the incoming requests based on the WFQ strategy, and
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
1115
identified and loaded according to the URLs and parameters of the requests. When a request is assigned to a thread, the corresponding DLL file is loaded and executed so as to simulate the generation procedure of the dynamic page. The functions of these DLL files are distinct. Some connect with a database and retrieve data, while others just execute a simple calculation. So the execution times of these DLL files are different, accordingly. All the operations are the simulation of the generation procedure of dynamic web pages. Furthermore, the threads that execute the same dynamic web page have shared data. So we set shared data in these DLL files. The default value of shared data is 2 k.
Fig. 6. The control flow of DRWS.
tries to allocate the threads with shared data to the same core by means of affinity methods in the O/S. And Fig. 6 shows the control flow of DRWS. When a request arrives, it can be rejected directly or after a time-out if the request queues are full. Otherwise, it will be classified by URL, and enter the appropriate request queue. If there are no free threads that match the request’s type, it will wait in the queue. When free threads are available, the request will be assigned to the threads, and then the corresponding DLL file is loaded based on the URL and parameters of the request. After the execution of the DLL file, the result will be sent to client, and this is the response. 4.1.2. FCFS and SJF Tor compare with the dynamic request scheduling approach, we also developed the programs based on FCFS and SJF, respectively. The FCFS strategy is simple. When a request arrives, it will be queued if the request queue is not full. And then it will be sent to a thread pool in a sequence of the FCFS strategy. If there are free threads in the thread pool, it will be assigned to a thread. The thread will load the corresponding DLL files based on the URL and execute. The result will be sent to network, and this is a response. SJF is a little more complicated. When a request arrives, according to its service time, it will be inserted into the request queue in a sequence of the SJF strategy if the request queue is not full. The service time of the dynamic request is required to be known in advance. In practice, we can obtain the average service time of a kind of dynamic request from log files in an offline manner [29]. When the request is assigned to a thread, the corresponding DLL file is loaded and executed. After execution of the DLL file, the result is sent to network, and this is a response. 4.1.3. Dynamic web page simulation In order to simulate the generation of dynamic pages, we created ten DLL files instead of ten dynamic pages. The DLL files are
4.1.4. Sending requests module and arrival process So as to simulate the visit behavior of users to dynamic websites, we designed a sending requests module, which can automatically send requests to DRWS or programs with FCFS or SJF strategies in a specific arrival process. In [30,31], which study the impact of different scheduling strategies on the performance of a queuing station, it is assumed that the arrival process is a Poisson process. However, studies of world-wide traffic have shown that the arrival processes cannot be described using a Poisson process [32–34]. And in [8], it is shown that the number of request arrivals per second in a long period clearly reflects a heavytailed distribution. In our experiment, we adopted two arrival processes: Poisson distribution and heavy-tailed distribution. We compare the experimental results of the two kinds of arrival mode in Section 4.3.3. The default arrival process is a heavy-tailed distribution. The traffic that reflects a heavy-tailed distribution is generated by the on–off heavy-tailed model, which models traffic as a superposition of a large number of on–off sources with a Pareto distribution of on and/or off periods in our experiment. The Pareto distribution is a simple heavy-tailed distribution. If the traffic is generated by a Poisson process, the time between successive packets follows a negative exponential distribution. 4.2. Calculation of parameter 4.2.1. Access frequency and mean service time Actually, the access frequency of the dynamic web pages could be calculated by adding a counter in each dynamic request queue, which can accumulate the number of visits for the same dynamic web page. By doing so, we can obtain the access frequency for the same dynamic web page. In our experiment, for simplicity, we manually set the access frequency of dynamic pages on the basis of the access frequency distribution at the Beijing University of Chemical Technology (BUCT) web site and it is as close to reality as possible. Moreover, we also need to know the service time of each kind of dynamic request beforehand. Actually, we can acquire the average service time of each type of request from log files in an offline manner. For more details, see [29]. In our experiment, we set the execution time of ten DLL files on the basis of the average service time of ten kinds of dynamic request, which were obtained from the log files at the BUCT website. It can be shown in practice that the execution time of DLL files is indeed approximately the mean service time of the dynamic requests. The access frequency and the mean service time are shown in Table 1. According to Table 1, we can draw the service time distribution histogram (shown in Fig. 7). Furthermore, we drew the Poisson, Pareto, and negative exponential distribution fitted curves of the service time distribution histogram with Origin 8.0 (see Fig. 7). As mentioned above, the Pareto distribution is a heavy-tailed distribution. The fitted curve was drawn according to a Pareto distribution. We can see that the fitted curve of the Pareto distribution could fit the service time distribution perfectly, as its adjusted R-squared value is 0.99505.
1116
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
Table 1 Calculation of scheduling parameters. Request queue
Access frequency (Hz)
Percentage of access frequency (%)
Mean service time (ms)
Weight of request queue
Percentage of weight (%)
Number of threads serving each request queue
Serial number of core that threads are allocated to
1 2 3 4 5 6 7 8 9 10
232 465 904 484 224 174 186 46 628 446
6.12 12.25 23.86 12.78 5.91 4.59 4.91 1.21 16.58 11.77
12 45 25 23 56 239 16 87 32 47
0.73 5.51 5.97 2.94 3.31 10.98 0.79 1.06 5.31 5.53
1.74 13.09 14.16 6.98 7.86 26.06 1.87 2.51 12.59 13.14
3 26 28 14 16 52 4 5 25 26
2 2 1 1 2 0 2 1 3 3
Fig. 8. The distribution of the response times with different scheduling strategies.
Fig. 7. Service time distribution histogram.
If the departure process of the requests is a Poisson process, the service time of requests should follow a negative exponential distribution. So we also drew the fitted curve according to a negative exponential distribution, and its adjusted R-squared value is 0.98777. Likewise, the fitted curve based on a Poisson distribution has adjusted R-squared value 0.75627. From Fig. 7 and the adjusted R-squared value of the three fitted curves, it is found that the fitted curve of Pareto distribution is closest to the distribution of the service time of the requests. So the service time distribution of the requests clearly reflects a heavytailed distribution. And the service time of the dynamic requests is indeed in accordance with a heavy-tailed distribution as well [8]. 4.2.2. Weight After obtaining the access frequency and the mean service time, we can calculate the weight of each request queue through (3). The results are also given in Table 1. Then we also calculate the percentage of each weight, by which we can get the service thread number of each request queue. 4.2.3. Thread number of each request queue In the thread pool, the default number of threads is 200, and according to (4), we can divide the threads into several parts with different numbers, which provide service to the requests in the corresponding request queue. Under certain circumstances, the calculated number of threads may be close to zero. To ensure that this kind of request can be handled, we set the minimum number of corresponding threads to one. 4.2.4. Load balance between cores In order to avoid the ping-pong effect in a multi-core system, the same type of thread must be assigned to the same core as a whole. Furthermore, we must maintain a load balance between the
cores. So we used the GA to assign the threads, as mentioned in Section 3.2.3. The results are shown in Table 1. For example, we can see that the threads of the third, fourth, and eighth request queues are assigned to core 1. 4.3. Result evaluation 4.3.1. Response time distribution We measured the response time of each dynamic request under three kinds of scheduling strategy, and analyzed the response time distribution of dynamic requests. The results are shown in Fig. 8. From Fig. 8, the response time of most requests scheduled by SJF and DRWS is located in the area within first 1000 ms, which means that SJF and DRWS have the shorter mean response time. But the response time distribution of the requests using FCFS strategy is more even than those using SJF or DRWS, which is in agreement with the fairness of FCFS and means that FCFS has longer mean response time. In the first 500 ms of the service time, the number of requests of SJF is more than that of DRWS, which reveals the characteristic of SJF strategy that the requests with shorter service time have higher priority. But the number of requests of SJF over 3000 ms is more than that of DRWS or FCFS, which shows that there may exist starvation caused by SJF. Moreover, DRWS has more requests than FCFS and SJF in the short response time area (less than 1000 ms), and it has fewer requests than SJF in the long response time area (over 3000 ms), which proves that the proposed dynamic request scheduling approach has shorter mean response time and avoids starvation. 4.3.2. Changing the number of threads We measured and calculated the mean response time and the percentage of dropped requests when we changed the number of threads in the multi-core system. The results are shown in Figs. 9 and 11. From Fig. 9, we can see that the mean response time of the three scheduling strategies all decrease with the increase of number of
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
1117
Fig. 9. The mean response time as function of the number of threads for the three scheduling strategies.
Fig. 11. The percentage of dropped requests as function of the number of threads for the three scheduling strategies.
threads. Under the FCFS strategy, the requests with the shorter service time will have longer waiting time, which will lead to the longer mean response time of whole queue. Moreover, FCFS could not eliminate the ping-pong effect, so the mean response time of FCFS declines slowly with the increase of threads. Therefore, FCFS always has a longer mean response time than DRWS and SJF. The curves of SJF and DRWS are more complicated. From Fig. 10, SJF is a single shared queuing model in which the threads indistinguishably serve all the requests from the request queue, but DRWS is a multi-separated queuing model in which each type of
request is only handled by the corresponding threads. The singleshared queuing model has better performance than the multiseparated queuing model according to queuing theory [35]. So SJF should have the shorter mean response time than DRWS. However, considering the characteristics of SJF, starvation will occur when not enough threads are available or many requests with a short service time exist in SJF. So those requests with a long service time have to wait in the request queue. When the corresponding responses are not received for a long time, the clients will send these requests again in our experiment. This would lead to greater
(a) SJF.
(b) DRWS. Fig. 10. The schematic diagrams of SJF and DRWS.
1118
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
possibility of congestion and longer mean response time in SJF. Therefore, starvation in SJF could increase the mean response time. However, more threads could reduce the congestion and starvation of SJF [36]. At these two fewer thread points of 50 and 100 (see Fig. 9), the notable starvation caused by the deficient threads in SJF gives rise to the increase of the mean response time of SJF. Furthermore, the effect of the starvation in SJF weakens the fact that DRWS should have longer mean response time than SJF based on queuing theory. As a result, DRWS and SJF are close in the mean response time at the two points on the whole. Furthermore, because the starvation in SJF is lightened with the increase of threads, the curve of SJF is little sharper than that of DRWS from 50 to 100 threads. Now consider that the number of threads keeps increasing. The starvation in SJF is completely removed when enough threads are available (at 150 threads). After eliminating the effect of starvations, SJF should have shorter mean response time than DRWS on the basis of queuing theory, which explains the fact that DRWS has higher response time than SJF at 150 threads. At 200 threads, the threads of SJF and DRWS continue to increase. The processing ability of SJF improves too. However, because the ping-pong effect begins to appear with the increment of threads, the mean response time of SJF drops slowly. Meanwhile, each request queue of DRWS also makes further improvement. Moreover, DRWS eliminates the ping-pong effect. So the mean response time of DRWS declines more sharply. Therefore, the mean response time of DRWS is close to that of SJF again. Over 250 threads, with the increment of threads, the pingpong effect of SJF and FCFS becomes notable, which explains why the mean response time of SJF and FCFS declines more slowly. Nevertheless, because DRWS eliminates the ping-pong effect, the mean response time of DRWS declines more rapidly than that of SJF and FCFS. So DRWS has smaller response time for large number of threads. With the increment of threads, the congestion of the three scheduling strategies declines. The fluctuation of the mean response time for three strategies decreases too. Consequently, the errors of the mean response time of SJF, FCFS, and DRWS decrease (see Fig. 9). From Fig. 11, with the increase of number of threads, the percentage of dropped requests all decline quickly. When the number of threads is over 300, the percentage of dropped requests is close to zero. And the impact on the percentage of dropped requests of scheduling strategies vanishes little by little. The variation of the percentage of dropped requests of DRWS becomes close to that of SJF. 4.3.3. Different arrival process In this experiment, we adopted two kinds of arrival process: Poisson process and heavy-tailed distribution. We measured and calculated the mean response time of each request queue in DRWS, and show the results in Fig. 12. The on–off heavy-tailed model models traffic as a superposition of a large number of on–off sources, with heavy-tailed on and/or off periods [37]. And it has the characteristic of long-range dependence (LRD) and the feature of self-similarity. The degree of self-similarity can be described by the Hurst parameter H (0.5 < H < 1). The self-similarity becomes obvious and the number of bursts increases as the Hurst parameter increases [38]. Many studies have shown that self-similar network traffic would amplify the queuing delay [39–41]. So a larger Hurst parameter will enlarge the queue length and increase the mean response time. A Poisson process has the characteristic of short-range dependence (SRD). And traffic bursts of a Poisson process usually happen at a small time scale. However, traffic bursts of the on–off heavy-tailed model would happen at both small and large time
Fig. 12. Mean response time of each request queue using a Poisson process and heavy-tailed distribution, respectively.
scales. Consequently, the burst probability of the on–off heavytailed model is usually larger than that of the Poisson process when the self-similarity of traffic is obvious. According to queuing theory, more bursts will lead to longer queuing delay and longer mean response time. Furthermore, Gospodinova and Todorov [42] proved that, as the Hurst parameter increases, the queue length of a queuing system using the LRD input (SSM/M/1 queuing system) is much higher than for the classical M/M/1 model using SRD input. To be as close as possible to reality, we adopted different theoretical values of the Hurst parameter for different request queues by means of adjusting the shape parameters of the Pareto distribution in our experiment (see Table 2). Form Table 2, we can see that most request queues (request queues 1, 2, 3, 4, 6, 7, 8, 10) have higher Hurst parameters (H > 0.7). Therefore, for these request queues, the on–off heavy-tailed models have the longer mean response time, compared to that of Poisson process (see Fig. 12). But request queues 5 and 9 have smaller Hurst parameters (H < 0.7) than other request queues. When the Hurst parameter is small, the self-similarity is not obvious and the number of bursts is lower. So request queues 5 and 9 have fewer traffic bursts than the other request queues. The fewer bursts lead to the shorter mean response time of the request queues. Therefore, request queues 5 and 9 have shorter mean response time than the other request queues (see Fig. 12). Moreover, the network traffic at request queues 5 and 9, which are generated by Poisson process, has characteristics of SRD as well. Therefore, traffic bursts also exist in them. When the Poisson process has more bursts than the on–off heavy-tailed model for request queues 5 and 9, the former has the longer mean response time than the latter (see Fig. 12). 4.3.4. Changing the shared data In our experiment, the DLL files have shared data to simulate dynamic web pages with shared data. We changed the size of the shared data in the ten DLL files, and the mean response time for the three scheduling strategies was measured; it is shown in Fig. 13. From Fig. 13, we can see that the mean response time of FCFS and SJF increases as the size of the shared data increases. However, the mean response time of DRWS changes little. The reason is that we adopted the new dynamic request scheduling approach, which removes the ping-pong effect. Therefore, when the size of the shared data increases, the impact on the mean response time of DRWS is slight. 5. Conclusions and future work In this paper, we have studied several scheduling algorithms in a web server and the application of the hard affinity method
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
1119
Table 2 Hurst parameter for different request queues. Request queue Hurst parameter
1 0.71
2 0.73
3 0.85
4 0.89
Fig. 13. Mean response time for the three scheduling strategies when the size of shared data is changed.
in a multi-core system. In order to avoid the ping-pong effect and improve the performance of handling dynamic requests in a multi-core web server, we propose a new dynamic request scheduling approach, which applies the affinity method to the multi-core web server based on the WFQ request scheduling algorithm and tries to maintain a load balance between the cores in the multi-core system. We have described the principle of the new dynamic request scheduling approach and have given the calculation formulas for the scheduling parameters. Furthermore, we developed DRWS, a simulation program for a web server based on the new dynamic request scheduling approach, and performed simulation experiments using it. We analyzed the key indices of performance and compared them with those from SJF and FCFS strategies. The new dynamic request scheduling approach is close to SJF in the mean response time and the percentage of dropped requests, and it can avoid the ping-pong effect and starvation. As future work, we want to improve the performance of the approach and enable the scheduling parameters to be adjusted automatically. Moreover, we want to improve the experimental method to obtain more accurate results. Acknowledgment This paper has been partially supported by the National Grand Fundamental Research 973 Program of China (No. 2011CB706900). References [1] Multi-Core from Intel-Products and Platforms http://www.intel.com/multicore/products.htm. [2] AMD Multi-Core Products http://multicore.amd.com/en/Products. [3] P. Kongetira, K. Aingaran, K. Olukotun, Niagara: a 32-way multithreaded sparc processor, IEEE Micro 25 (2005) 21–29. [4] P.M. Gorder, Multicore processors for science and engineering, IEEE CS and the AIP9 (2007) 3–7. [5] J.M. Calandrino, J.H. Anderson, D.P. Baumberger, A hybrid real-time scheduling approach for large-scale multicore platforms, in: ECRTS’07: 19th Euromicro Conference on Real-Time Systems, in: IEEE, Pisa, Italy, 2007, pp. 247–258. [6] V. der Mei RD, R. Hariharan, P.K. Reeser, Web server performance modeling, Telecommunication Systems 16 (2001) 361–378. [7] S.B. Siddha, Multi-core and Linux kernel http://oss.intel.com/pdf/mclinux.pdf.
5 0.54
6 0.72
7 0.92
8 0.88
9 0.61
10 0.83
[8] E. Hernández-Orallo, J. Vila-Carbó, Web server performance analysis using histogram workload models, Computer Networks 53 (2009) 2727–2739. [9] Q. Zhang, A. Riska, W. Sun, E. Smirni, G. Ciardo, Workload-aware load balancing for clustered web servers, IEEE Transactions on Parallel and Distributed Systems 16 (2005) 219–233. [10] W. van der Weij, S. Bhulai, R. van der Mei, Dynamic thread assignment in web server performance optimization, Performance Evaluation 66 (2009) 301–310. [11] L. Cherkasova, Scheduling strategy to improve response time for web applications, in: Proceedings on High Performance Computing and Networking, in: LNCS, vol. 1401, Holland, Amsterdam, 1998, pp. 305–314. [12] B. Schroeder, M. Harachol-Balter, Web servers under overload: how scheduling can help, ACM Transactions on Internet Technology 6 (2006) 20–52. [13] S. Elnikety, E. Nahum, J. Tracey, W. Zwaenepoel, A method for transparent admission control and request scheduling in e-commerce web sites, in: Proceedings of the 13th International World Wide Web conference, ACM, New York, USA, 2004, pp. 276–286. [14] Apache. The Apache Software Foundation http://www.apache.org. [15] R. El Abdouni Khayari, Class-based weighted fair queueing: validation and comparison by trace-driven simulation, International Journal of Communication Systems 18 (2005) 975–994. [16] R.V. Bossche, K. Vanmechelen, J. Broeckhove, An evaluation of the benefits of fine-grained value-based scheduling on general purpose clusters, Future Generation Computer Systems 27 (2011) 1–9. [17] M. Harchol-Balter, B. Schroeder, N. Bansal, M. Agrawal, Size-based scheduling to improve web performance, ACM Transactions on Computer Systems 21 (2) (2003) 207–233. [18] M.E. Crovella, R. Frangioso, M. Harchol-Balter, Connection scheduling in web servers, in: USITS’99: Proceedings of the 2nd USENIX Symposium on Internet Technologies and Systems, in: USENIX Association Berkeley, vol. 2, Boulder, Colorado, USA, 1999, pp. 243–254. [19] R. Bolla, R. Bruschi, PC-based software routers: high performance and application service support, in: PRESTO’08: Workshop on Programmable Routers for Extensible Service of Tomorrow, in: ACM, Seattle, Washington, USA, 2008, pp. 27–32. [20] Z.C. Papazachos, H.D. Karatza, Gang scheduling in multi-core clusters, Future Generation Computer Systems 27 (2011) 1153–1165. [21] A. Chonka, W. Zhou, K. Knapp, Y. Xiang, Protecting information systems from DDoS attack using multi-core methodology, in: Proceedings of the IEEE 8th International Conference on Computer and Information Technology, IEEE, Sydney, Australia, 2008, pp. 270–275. [22] Y. Lu, J. Tang, J. Zhao, X. Li, A case study for monitoring-oriented programming in multi-core architecture, in: IWMSE 08: Proceedings of the 1st international workshop on Multicore software engineering, in: ACM, Leipzig, Germany, 2008, pp. 47–52. [23] R. Islam, W. Zhou, Y. Xiang, A.N. Mahmood, Spam filtering for network traffic security on a multi-core environment, Concurrency Computational Practice Exper 21 (2009) 1307–1320. [24] A. Chonka, S.K. Chong, W. Zhou, Y. Xiang, Multi-core defense system MSDS for protecting computer infrastructure against DDoS attacks, in: Proceedings of the 2008 Ninth International Conference on Parallel and Distributed Computing, IEEE, Dunedin, New Zealand, 2008, pp. 503–508. [25] A. Chonka, W. Zhou, L. Ngo, Y. Xiang, Ubiquitous multicore(UM) methodology for multimedia, International Journal of Multimedia and Ubiquitous Engineering 4 (2009) 145–156. [26] H. Feng, E. Li, Y. Chen, Y. Zhang, Parallelization and characterization of SIFT on multi-core systems, in: IISWC 2008, IEEE International Symposium on Workload Characterization, in: IEEE, Seattle, USA, 2008, pp.14–23. [27] C. Terboven, D. an Mey, D. Schmidl, H. Jin, T. Reichstein, Data and thread affinity in openMP programs, in: MAW’ 08: Proceeding of the 2008 workshop on Memory access on future processor, in: ACM, New York, USA, 2008, pp. 377–384. [28] L. Ai, M. Tang, C. Fidge, Partitioning composite web service for decentralized execution using a genetic algorithm, Future Generation Computer Systems 27 (2011) 157–172. [29] S. Sharifian, S.A. Motamedi, M.K. Akbari, A content-based load balancing algorithm with admission control for cluster webservers, Future Generation Computer Systems 24 (2008) 775–787. [30] S.C. Borst, O.J. Boxma, R. Nunez Queija, Heavy tails: the effect of the service discipline, Computer Performance Evaluation-Modelling Techniques and Tools 2324 (2002) 1–30. [31] S.C. Borst, O.J. Boxma, R. Nunez Queija, A.P. Zwart, The impact of the service discipline on delay asymptotics, Performance Evaluation 54 (2) (2003) 175–206. [32] M. Crovella, A. Bestavros, Self-similarity in world wide web traffic: evidence and possible causes, IEEE-ACM Transactions Networking 5 (6) (1997) 835–846. [33] R. El Abdouni Khayari, R. Sadre, B. Haverkort, The pseudo-self-similar traffic model: application and validation, Performance Evaluation 56 (1–4) (2004) 3–22.
1120
G. You, Y. Zhao / Future Generation Computer Systems 28 (2012) 1110–1120
[34] W. Willinger, M. Taqqu, A. Erramilli, A bibliographical guide to self-similar traffic and performance modeling for modern high-speed networks, Stochastic Networks: Theory and Applications (1996) 339–366. [35] M. Gui, Y. Jiang, Z. Zhang, Comparison of two queuing models with multiple servers, Computer Engineering and Application 44 (13) (2008) 44–46. [36] Z. Xie, X. Li, Study of network congestion rate based on queuing theory model, Computer Engineering and Design 28 (17) (2007) 4172–4174. [37] S. Sarvotham, R. Riedi, R. Baraniuk, Network and user driven alpha–beta on–off source model for network traffic, Computer Networks 48 (2005) 335–350. [38] W.E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the self-similar nature of ethernet traffic, ACM/SIGCOMM Computer Communications Review 23 (1993) 183–193. [39] A. Erramilli, O. Narayan, W. Willinger, Experimental queuing analysis with long-range dependent packet traffic, IEEE/ACM Transactions on Networking 4 (2) (1996) 209–223. [40] Z. Sahinoglu, S. Tekinay, On multimedia networks: self-similar traffic and network performance, IEEE Communications Magazine 37 (1) (1999) 48–52. [41] K. Park, G. Kim, M. Crovella, On the effect of traffic self-similarity on network performance, in: Proceedings of 1997 SPIE International Conference on Performance and Control of Network Systems, Bellingham, USA, pp. 296–310. [42] P. Gospodinova, G. Todorov, Comparative analysis of the self-similar queuing process and the classical queuing process in telecommunication networks, in: IBIC 2007: the 2nd International Business Informatics Challenge and Conference, Dublin, Ireland, 2007, pp. 296–302.
Guohua You is a Ph.D. candidate at Beijing University of Chemical Technology, P.R. China. He received his BS and MS degrees from Jilin University and Beijing University of Chemical Technology, P.R. China, in 2002 and 2009, respectively. His research interests include distributed/grid computing systems, multi-core web servers, and network security.
Ying Zhao is a professor at Beijing University of Chemical Technology, P.R. China. He received his BS from Tianjin University, P.R. China in 1987. He received his MS and Ph.D. degrees from Beijing University of Chemical Technology, P.R. China, in 1996 and 2004, respectively. His research interests include distributed/grid computing systems, network time protocols, and web services.