Scheduling overcommitted VM: Behavior monitoring and dynamic switching-frequency scaling

Scheduling overcommitted VM: Behavior monitoring and dynamic switching-frequency scaling

Future Generation Computer Systems 29 (2013) 341–351 Contents lists available at SciVerse ScienceDirect Future Generation Computer Systems journal h...

2MB Sizes 0 Downloads 58 Views

Future Generation Computer Systems 29 (2013) 341–351

Contents lists available at SciVerse ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Scheduling overcommitted VM: Behavior monitoring and dynamic switching-frequency scaling✩ Huacai Chen, Hai Jin ∗ , Kan Hu, Jian Huang Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

article

info

Article history: Received 1 November 2010 Received in revised form 12 May 2011 Accepted 5 August 2011 Available online 19 August 2011 Keywords: Virtualization Xen Credit scheduler Behavior monitor Dynamic switching-frequency scaling Variable time slice

abstract Virtualization enables multiple guest operating systems run on a single physical platform. These virtual machines (i.e., VM) may host any type of application, including concurrent HPC programs. Traditionally, VMM schedulers have focused on fairly sharing the processor resources among VMs, rarely consider VCPUs’ behaviors. However, this can result in poor application performance to overcommitted virtual machines if there are concurrent programs hosted in them. In this paper, we review the features of both Xen’s Credit and SEDF schedulers, and show how these schedulers may seriously impact the performance of the communication-intensive and I/O-intensive concurrent applications in overcommitted VMs. We discuss the origination of the problem theoretically, and confirm the derived conclusion on benchmarks. A novel approach is then proposed to improve the Credit scheduler, more adaptive for concurrent applications. Our solution includes two aspects: a periodical monitor analyzing the behaviors of each VCPU in a real-time manner, and the scheduler (extended from Credit scheduler) dynamically scaling the context switching-frequency by applying variable time slices to VCPUs according to their behaviors. The experimental results show that this extended Credit scheduler can significantly improve the performance of communication-intensive and I/O-intensive concurrent applications in overcommitted VMs, which is as good as the performance in undercommitted scenarios. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Modern computers are becoming more and more powerful with the emergence of multi-core architecture, creating a demand for server consolidation. Consequently, there has been a resurgence of interest in machine virtualization. A virtual machine monitor (VMM) enables multiple virtual machines to share a single physical machine safely and efficiently. Specially, it provides isolation between the virtual machines (VMs), and minimizes the damages from guest OS. Moreover, the VMM manages the access of VMs to hardware resources, optimizing resource usage and reducing overhead for cache coherence. Virtual machines may host any type of application, including concurrent HPC programs. Traditionally, VMM schedulers have focused on dynamically and fairly sharing the processor resources among virtual machines, rarely considering the difference of VCPUs’ behaviors. However,

✩ This is an extended version of an earlier paper ‘‘Dynamic SwitchingFrequency Scaling: Scheduling Overcommitted Domains in Xen VMM’’ appeared in Proceedings of the 39th International Conference on Parallel Processing (ICPP2010), San Diego, CA, USA. Sept. 13–16, 2010. ∗ Corresponding author. Tel.: +86 27 8754 3529; fax: +86 27 8755 7354. E-mail address: [email protected] (H. Jin).

0167-739X/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2011.08.006

this can result in poor application performance to overcommitted VMs (an overcommitted VM is a VM whose virtual CPUs are more than the physical CPUs it has pinned) if there are concurrent programs hosted, making virtualization less desirable for applications that require strong performance isolation. Xen [1] is a stand-alone VMM, or in other words, a Type I [2] VMM. Unlike hosted VMM (i.e., Type II VMM), Xen runs on the hardware directly rather than on a host OS. The default scheduler in the current version of Xen is the Credit scheduler, which uses a credit/debit system to fairly share processor resources among domains (domain is an alias of VM in Xen terminology). SEDF scheduler is an alternative scheduler of Xen, which allows I/O-intensive domains to achieve lower latency [3]. Despite the SEDF scheduler’s advantages for workloads mixing CPU- and I/O-intensive domains, the Credit scheduler is configured as the default one because it improves scheduling on multiprocessors and provides better QoS controls. A virtualized system is an overcommitted system in most cases since the total VCPUs are usually more than total PCPUs (i.e., physical CPU). But for a single domain, over-commitment is usually considered as a configuration problem. However, improving the performance of VM scheduling for overcommitted domains still makes sense for several reasons:

342

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351

1. Not all concurrent programs are available in source code. A precompiled program needs a predefined number of threads (or processes), so it may require more VCPUs than available PCPUs.1 2. Some concurrent programs need threads of a special number but the count of PCPUs does not match. e.g., consider a program need a square number of threads but the system has 8 PCPUs, compiling it in 4-threads will waste resource, and compiling it in 9-threads will meet over-commitment. 3. A domain may be undercommitted when it starts to run concurrent programs, but some of its PCPUs are later pinned by other domains exclusively since others need strong isolation. (Pinning mechanism provides stronger performance isolation, which will be seen in Section 3.) 4. Performance issues in overcommitted domains mainly caused by the fact that concurrent threads cannot run simultaneously. With current Xen architecture, scheduling decisions are made locally for each PCPU, and similar problem appears if there are processes hosted in different domains communicate with each other. We conduct experiments to show that in overcommitted domains, the Credit scheduler may seriously impair the performance of communication-intensive and I/O-intensive concurrent applications. By theoretically analysis, a conclusion is derived that the reason for the performance decreasing is too much ‘‘busy blocking’’ time caused by too few context switching for communication- and I/O-intensive applications in overcommitted domains. And thus, the fixed time slice (30 ms) scheduling used by Credit scheduler cannot prefer any types of applications in any cases. This conclusion is also confirmed on benchmarks. In this paper, we propose an extension to Credit scheduler targeted at improving the performance of communication-intensive and I/O-intensive applications in overcommitted domain. The extended scheduler profiles the behaviors of VCPUs using PMU (performance monitor unit) [4,5], scale the context switching-frequency dynamically in a selfadaptive way by selecting different time slice for VCPUs in different type. The experiment results show that the proposed scheduler can improve the performance of overcommitted communicationand I/O-intensive applications as good as those undercommitted ones, while not impairing the performance of CPU-intensive applications. The contributions of this paper are two folded. First, the issues induced by Xen’s Credit scheduler to overcommitted domains are discovered in our experiments, and the source of the problem is explored theoretically, motivating future innovation in this area. Second, a solution based on behavior monitoring and dynamic switching-frequency scaling is proposed. The proposed solution is implemented as an extension to Credit scheduler which combine the advantages of both Credit and SEDF scheduler. The rest of this paper is organized as follows: Section 2 briefly surveys the related works. In Section 3 we introduce our motivation, by describing the problem via experiments. In Section 4 we first analyze the problem in detail, followed by the design and implementation of behavior monitoring and dynamic switching-frequency scaling. After that, we evaluated the performance of our solution in Section 5. Finally, Section 6 draws a conclusion of this paper. 2. Related works VMM provides an abstraction layer between the VMs running their own software stacks and the actual hardware. Researchers

1 To simplify the problem, in this paper we assume the number of VCPUs of a domain is equal to the threads of concurrent programs hosted in the domain.

categorize VMM into two types [2]: a Type I VMM is a standalone VMM, which runs directly on hardware, and a Type II VMM needs to run on a host OS as its module. Type I VMMs, e.g., VMware ESX Server [6,7] and Xen [1], are more robust since it can prevent system from crashing caused by buggy device drivers in host OS. However, stand-alone VMMs cannot reuse the process scheduling of host OS for VM scheduling, either. BVT (i.e., borrowed virtual time), SEDF (i.e., simple earliest deadline first) and Credit scheduler [8] are major schedulers once used by Xen, while Credit scheduler is the default one in the current version. Credit scheduler is a fair-share scheduler, with the features of both WC (i.e., workconserving) and NWC (i.e., non work-conserving) mode, and global load balancing capability on multiprocessors [8]. The relationship between scheduling in VMM and I/O performance has been researched [3]. It is point out that Credit scheduler is not appropriate for bandwidth-intensive and latencysensitive applications. Several extensions to Xen’s Credit scheduler are proposed to improve the I/O performance, via adding a highest priority named BOOST, via sorting the runqueue based on their remaining credits, and via tickling the scheduler when events are sent. Communication-aware scheduling is proposed in [9], by implementing a preferential scheduling for the recipient and an anticipatory scheduling for the sender. Lock-aware scheduling [10] and task-aware scheduling [11] in VMM are both scheduling strategies to improve performance of I/O-intensive virtual machines, by using intrusive/non-intrusive methods to avoid lock-holder preemption, or by using partial boosting to promote the priority of I/O-intensive domains. In summary, these achievements to improve scheduling performance for I/O-intensive programs focus on generic cases, but without considering the overcommitted domains. To reduce ‘‘busy blocking’’ time, a possible way is make all VCPUs of a logical group (e.g., VCPUs for a concurrent program) to run simultaneously. Gang-scheduling [10,12] is a solution of this type. However, in the overcommitted case, gang-scheduling try its best but still far from the goal since available computing resources (i.e., PCPUs) are not enough. Another way is to detect ‘‘busy blocking’’ directly in VMM. This type of approach can be used for overcommitted domain [13], but is complicated to implement if without hardware support, especially for full virtualization which runs unmodified guest OSes. OProfile is a system-wide profiler for Linux systems [14]; it uses PMU of the CPU to enable profiling of a wide variety of events statistics. According to the profiling data methods can be used to optimize the systems or applications. Xenoprof [15,16] is a profiler for Xen derived from OProfile, it can profiles the Xen virtual machine monitor, multiple Linux guest operating systems, and applications running on them. While implementing dynamic switching-frequency scaling, we use techniques similar to Xenoprof for monitoring domains’ behaviors for the basis of selecting the length of time slice. 3. Motivation A VCPU cannot run on PCPUs except those it has pinned, so pinning reduces the ‘‘available’’ PCPUs for a certain domain and causes many cases of overcommitted domain. However, pinning is very useful since it provides stronger performance isolation. We have designed an experiment to evaluate it. In our experiment the physical platform has 8 CPU cores and 4 GB memory. One PV (i.e., para-virtualization) guest domain is configured with 4 VCPUs and 512 MB memory, while NPB (i.e., NAS Parallel Benchmark) [17] is used as the workload. The pinned case means 4 VCPUs of the guest domain is pinned to the PCPU0–PCPU3, and Domain-0 is pinned to the rest (i.e., PCPU4–PCPU7). Meanwhile, the non-pinned case means the default configuration without any mapping constraint. Fig. 1 shows

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351

343

Table 1 Switching-frequencies (S ) of NPB programs.

Fig. 1. Normalized standard deviation of NPB programs execution time.

the normalized standard deviation of execution time of each program of NPB.2 Domain-0 does some daily work in a fixed pattern during tests. The results are derived from 8 rounds of execution. A smaller standard deviation means the execution of a program is more stable, reflecting a stronger performance isolation (the performance of guest domain will not be violated by Domain-0). From this figure we can see that for most programs the standard deviation of pinned case is less than 80% of the non-pinned case, especially, for MG and SP, the relative values decrease down to 21% and 12% of non-pinned case. Another experiment is conducted to show the issue that the performance of communication- and I/O-intensive concurrent applications may decrease steeply when the domains become overcommitted. This time we configured four PV guest domains, each one has the same VCPU and memory configuration as that in previous experiments. We compare the performance (i.e., execution time) of NPB programs in this experiment in four cases3 : Credit scheduler with dispersive pinning (credit_d), Credit scheduler with concentrated pinning (credit_c ), SEDF scheduler with dispersive pinning (sedf _d), and SEDF scheduler with concentrated pinning (sedf _c ). The total amount of CPU resource for the 4 guest domains is the same (i.e., 4 PCPUs) for all cases, and the concentrated pinning is regarded as an extreme case of overcommitted domains. The two types of pinning policies are sketched out in Fig. 2: Dispersive pinning (undercommitted case) means the 4 VCPUs of each guest domain are pinned to 4 different PCPUs, e.g., with the processor id 0, 1, 2 and 3 respectively; concentrated pinning (overcommitted case) means the 4 VCPUs of a single domain are all pinned to one PCPU exclusively, e.g., the 4 VCPUs of the first guest domain are all pinned to PCPU0, and those of the second domain are pinned to PCPU1, etc. The VCPUs of Domain-0 are pinned to the rest PCPUs to avoid unexpected disturbance. The performance results are shown in Fig. 3. In this figure the execution time is the average value of those of the 4 concurrent domains. Several observations can be made from this figure: For CPUintensive applications, e.g., EP, the execution time in 4 cases are nearly the same, with Credit scheduler has a slight advantage to SEDF scheduler. For communication-intensive programs such as IS and CG, and I/O-intensive ones such as BT, the performance of Credit scheduler with concentrated pinning is seriously worse than the other 3 cases, with the execution time 3–45 times longer. FT is a mixed type program, the execution time of concentrated pinning is

2 NPROCS = 4 for each program, Class A is used here for FT because FT.B need more than 512 MB memory. 3 NPROCS = 4 for each program, Class A is used here for LU, FT, BT and SP because Class B either need more than 512 MB memory, or need too much time to complete when credit scheduler is used with concentrated pinning.

App.

credit_d

credit_c

sedf _d

sedf _c

EP.B IS.B CG.B LU.A MG.B FT.A BT.A SP.A

33.38 994.3 298.6 212.86 205.45 225.43 125.18 203.85

33.3 72.26 114.01 83.27 112.36 75.55 88.4 98.35

2012.72 2512.2 1307.53 1992.83 1976.01 1911.66 2008.18 1834.08

2017.57 2730.67 2158.98 2049.77 2098.73 1761.25 2085.32 2083.08

about 23% longer than the dispersive case, and Credit scheduler has a small advantage when comparing with SEDF. For all programs, credit_d is a little better than sedf _d and sedf _c, the reason is that SEDF scheduler has more frequent context switching than Credit scheduler (confirmed by Table 1 in the next section). Consequently, the communication-intensive and I/O-intensive concurrent applications have a poor performance in credit_c cases (i.e., overcommitted domain with Credit scheduler). We will discuss the source of the problem theoretically and address it in next section. 4. Behavior monitoring and dynamic switching-frequency scaling In this section, we will first analyze the reason for the performance difference introduced in the previous section in detail. After that, an extension to Credit scheduler with behavior monitoring and dynamic switching-frequency scaling will be discussed. 4.1. Analysis: the key factor is ineffective holding time As we know, for the communication-intensive concurrent programs, it is a common case that one thread blocks in a busy state waiting for data from other threads. For I/O-intensive programs, there is also a similar case that a thread blocks itself for data from devices, such as disk or network. In the Xen virtual machine environment, ‘‘busy blocking’’ makes a VCPU hold a PCPU but doing nothing meaningful (probably it is spinning on a lock). In other words, it is wasting time. Since gang-scheduling is not suitable for overcommitted domains and direct spinning detection is complicated to implement, we will consider a simple but effective method to reduce the ‘‘busy blocking’’ time. For the convenience of our theoretical analysis, we first define some variables as follows, some of which are illustrated in Fig. 4:

• TH : Average holding time, denoting the average interval length •







• •

that a VCPU holding a PCPU between two adjacent VCPU context switching (i.e., the length of time slice). TS : Average scheduling time, denoting the average time spent in picking up the next VCPU to run and performing corresponding context switching. TE : Average effective holding time, denoting the average time taken by a VCPU to do useful work between two context switching. TI : Average ineffective holding time, denoting the average time that a VCPU being in ‘‘busy blocking’’ state between two context switching. TO : Optimal execution time, denoting the execution time of a program in the optimal case. TS is considered equal to 0 in this case. TA : Actual execution time, denoting the execution time of a program in actual. S: Switching-frequency, denoting the number of context switching in one second. S = 1/TH .

344

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351

Fig. 2. The sketch map of dispersive and concentrated pinning.

Fig. 3. Performance of NPB programs with different schedulers and pinning policies.

makes the ineffective holding time in sedf _c and sedf _d decreased significantly. Meanwhile, the difference of switching-frequency also causes the tiny performance gap to CPU-intensive programs (i.e., EP.B) between two schedulers. The switching-frequencies of Credit scheduler in dispersive case are not very large, because 4 VCPUs of a domain are running concurrently and the ineffective holding time is essentially small. The accumulative time spent in scheduling will increase as TS increases, so it seems intuitively that a large value of switchingfrequency may bring large overhead. However, by tracing the scheduler we found that TS is about 2.0 µs for Credit scheduler and 1.5 µs for SEDF scheduler. This is far less than 357 µs, the TH value when S is 2800 times per second (this is the upper-bound value of S we have observed). Therefore, the scheduling overhead here is negligible.

The relation of TO and TA can be given by this equation:

4.2. Overview of dynamic switching-frequency scaling

TA = TO × (TH /TE ) = TO /(S × TE ).

(1)

It is clear that TH = TS + TE + TI , hence, TI = 1/S − TS − TE

TI ≥ 0.

(2)

The maximum value of TE depends on the behavior of the program, so TE remains a constant for a given program when TI > 0. Since TI decreases as S increases, TE will decreases as S increases after TI has already decreased to 0. Therefore,

 TA =

TO /(S × TE ) TO /(1 − TS × S )

TI > 0 TI = 0.

(3)

In this equation, TS is a constant for a specific scheduler, and TO is a constant for a specific program. Our goal is to reduce TA as much as possible, and the basic method is to reduce TI . From (2) we can see that increasing S is an alternative, however, still increasing S will make TA increase when TI arrives at 0, and the reason for this can be explained by (3). We analyzed the context switching of the third experiment in Section 3 with Xentrace [18], and the results listed in Table 1 confirm that the performance of communication-intensive and I/O-intensive applications depends highly on S. From Table 1 we can see that the switching-frequency of SEDF scheduler is far greater than that of Credit scheduler. This

The scheduler extension we have designed includes a monitor to analyze the VMs’ behaviors and an improved scheduling algorithm with switching-frequency scaling feature. Fig. 5 illustrates the framework of the extended scheduler. Behavior monitor is the basis and dynamic switching-frequency scaling is an upper layer functionality. The behavior monitor is invoked periodically for each domain. It traces the boost frequency (the times of a domain’s VCPUs become BOOST in one monitor interval) of each domain, as well as the bus transactions with help of PMU. The traced information can be used to distinguish CPU-intensive, communication-intensive and I/Ointensive domains. A longer time slice will be used by our scheduler if a domain is recognized as CPU-intensive; by contrast, if a domain is communication-intensive or I/O-intensive, shorter time slices are then applied. To facilitate further discussion, these two types of domain are called as lts-domain and sts-domain. Originally, Credit scheduler works in three steps: (1) Picking up the next VCPU in the runqueue. (2) Applying the fixed time slice to the next VCPU. (3) Context switching to the next VCPU. Our extension improve the second step as: if the next VCPU belongs to an lts-domain, a long time slice is applied; otherwise a

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351

345

Fig. 4. Definition of TH , TS , TE and TI . Table 2 Switching frequencies (S ) of NPB programs (tick = 1 for Credit scheduler). App.

credit_d

credit_c

sedf _d

sedf _c

EP.B IS.B CG.B LU.A MG.B FT.A BT.A SP.A

332.96 1044.97 512.35 472.98 456.5 663.88 399.17 449.48

332.88 720.23 654.3 438.97 542.92 730.07 465.96 532.66

2012.72 2512.2 1307.53 1992.83 1976.01 1911.66 2008.18 1834.08

2017.57 2730.67 2158.98 2049.77 2098.73 1761.25 2085.32 2083.08

Fig. 5. Framework of extended Credit scheduler.

short time slice is applied. According to our definition in Section 4, the length of time slice here is essentially equal to the holding time of the next VCPU, and a shorter time slice may result in more frequent context switching. In order to maintain the fairshare feature of Credit scheduler, the VCPU accounting policy is also modified to adapt our extension. The default fixed time slice of Credit scheduler is 30 ms (i.e., tick = 10, ‘‘tick’’ is a variable to denote the time unit in Credit scheduler), and this value is also used as the long time slice in our extended scheduler. However, there are two questions should be answered in our design: (1) How long the short time slice will be? (2) How to distinguish lts-domain and sts-domain? The questions are answered in the rest of this section. Our research and development are incremental. In the first version, we use a simple design that just use two kinds of time slice length (one long time slice and one short time slice). In Version 2, the switching-frequency scaling is smoother that short time slices are gradually changed from the maximum value (equal to long time slice in Version 1) to the minimum value. Since the behavior monitor reuses part of the infrastructure of Xenoprof and there are conflicts between them, another enhancement in the second version is making dynamic switching-frequency scaling reconcile with Xenoprof. 4.3. Version #1: initial design 4.3.1. Variable time slice We first answer Question 1 in the previous subsection. It is a simple way to use two different time slices for scaling the switching-frequency. The long one can be the same as the default fixed value (i.e., 30 ms) and the other should be shorter. In order to find an appropriate short time slice, we first conduct an experiment to show whether 3 ms (i.e., tick = 1) is short enough. The switching-frequencies (when tick = 1) are listed in Table 2, the data of SEDF scheduler is used for comparison once again. Compared with the case when tick = 10, the switchingfrequencies of NPB programs in the concentrated pinning case have increased up to 5–10 times, but still far less than the SEDF cases. Fig. 6 shows the performance of each program.

Fig. 6. Performance of NPB programs with different schedulers and pinning policies (tick = 1 for Credit scheduler).

This figure shows that the execution time of EP has increased slightly when compared with tick = 10, this is due to the redundant context switching. For other programs the performances are increased more or less, with the increasing ranges are: 479% for IS.B, 395% for CG.B, 144% for LU.A, 260% for MG.B, 8% for FT.A, 164% for BT.A and 243% for SP.A. However, for IS.B, CG.B, BT.A and SP.A, there is still a wide performance gap between Credit and SEDF schedulers. This implies that a time slice of 3 ms is not short enough for most programs. After tens of experiments and comparison, in this version of our extended scheduler 1 ms is selected as the short time slice, which is an appropriate value in most cases. In the implementation of variable time slice, a property of ‘‘DomainType’’ is added to the definition of csched_domain, the domain descriptor of Credit scheduler, to indicate whether a domain is an lts-domain or sts-domain. Besides, there are modifications in csched_schedule(), the core function of credit scheduler. This function return a structure that includes the time to decide when csched_schedule() will be invoked again (this is also the holding time of the picked VCPU). Obviously, 30 ms is returned for an lts-domain, and 1 ms is for a sts-one. To coordinate with variable time slice there should be also modifications for VCPU accounting. As we know, accurate VCPU accounting is the key factor of fair sharing. In the default Credit scheduler, the holding time of VCPUs is 30 ms at most of the time (boosting mechanism will shorten the holding time of the VCPU which scheduled out, but this is not a common case). So the

346

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351 Table 3 Switching-frequencies (S ) of NPB programs (with dynamic switching-frequency scaling extension for Credit scheduler).

Fig. 7. Long and short time slice in extended scheduler.

VCPU accounting action is completed in a periodical timer handler, csched_tick(), which subtracts 100 credits every 10 ms. However, when we use variable time slice, this accounting strategy is far from accurate. For example, in Fig. 7, when csched_tick() is called at T 2, the current VCPU may be scheduled in at one millisecond ago (at T 1), but its credits will be also subtracted 100 (it should only be subtracted 10 in actual). In our extended scheduler, a field named ‘‘TimeOfScheduledIn’’ is added to csched_vcpu, the VCPU descriptor of Credit scheduler. Besides, the accounting action is now not completed periodically, but at every moment of context switching in csched_schedule(): the credits of current VCPU (which will be scheduled out) will be subtracted the accurate amount (calculated by (4)), and ‘‘TimeOfScheduledIn’’ of the picked VCPU (the VCPU will be scheduled in) is updated to the current time. CreditsToSubtract = CreditsPerMsec

× (TimeOfNow − TimeOfScheduledIn).

(4)

In this equation, CreditsPerMsec is 10, corresponding to 100 credits for 10 ms in the default Credit scheduler. 4.3.2. Behavior monitor As mentioned in the previous subsection, all domains can be divided into two types and a flag is used to indicate the type. Every domain is initialized as an lts-one. Naturally, during a domain’s lifetime, a domain’s type can be changed from one to another, according to the feedback from the behavior monitor. Since the overhead may increase significantly if timer handlers are triggered too frequently, we configure the length of the interval to activate the monitor as 1 s. A VCPU in an I/O-intensive domain blocks more frequently for its massive I/O requests, and as noted, its priority will change into BOOST once it is woken up from blocking state. Therefore, the boost frequency (boost times during a monitor interval) can be used to distinguish an I/O-intensive domain from others. Most communications are completed by accessing memory or I/O ports, which are considered as ‘‘bus transactions’’. Therefore, our behavior monitor uses the PMU of X86 processors to trace the bus access events (e.g., BUS_TRAN_MEM, BUS_TRANS_IO, etc.) [4,5]. By this means, a domain can be considered as communicationintensive, if its bus accesses during an interval exceed a threshold. Bus accesses can be in a bursting manner, i.e., a communicationintensive domain does not access the data bus frequently in every monitoring interval. Consequently, we look backwards last N intervals, rather than the last one interval to analyze a domain’s behavior. A domain is recognized as communication-intensive, if the bus accesses of one or more intervals in the last N monitoring intervals exceed the threshold. In current implementation, N is configured equal to 10. Besides, similar approaches can also be used to recognize I/O-intensive domains. To implement the behavior monitor, two fields named ‘‘BoostFreqArray’’ and ‘‘BusAccFreqArray’’ is added to csched_domain. The size of two arrays is defined to N, which means that the boosting

App.

credit_d′

credit_c ′

credit_d′′

credit_c ′′

sedf _d

sedf _c

EP.B IS.B CG.B LU.A MG.B FT.A BT.A SP.A

33.28 1618.82 956.22 1056.05 1027.73 267.15 1007.75 938.26

33.3 1155.65 1052.12 1082.53 914.68 342 1069.01 1122.3

33.27 2499.35 1420.67 1971.92 2093.4 304.575 1968.07 1689.92

33.35 2825.45 2231.83 2088.01 2206.15 422.88 2117.05 2146.85

2012.72 2512.2 1307.53 1992.83 1976.01 1911.66 2008.18 1834.08

2017.57 2730.67 2158.98 2049.77 2098.73 1761.25 2085.32 2083.08

and bus access situations in the last N intervals are recorded. Another field of csched_domain, ‘‘CurrentIndex’’, indicates the current record in the two arrays. The PMU is initialized during the booting phase of the host machine: the event name is set to BUS_TRAN_ANY, while the counter value can be configured by user. After system initialization, the event counter is started automatically. Behavior monitor reuses part of the infrastructure of Xenoprof, including some hardware operation and NMI (i.e., Non-Maskable Interrupt) interrupt handling. When the NMI handler detects a PMU counter overflow, the current record in BusAccFreqArray of the current domain is incremented. Similarly, when a VCPU is boosted, the current record in BoostFreqArray of its hosted domain is incremented. The per-domain timer is also a field of csched_domain, and it is initialized and started during the domain creation process. The timer handler is the main body of behavior monitor, invoked every 1 s as mentioned before. The algorithm of the monitor is illustrated in Fig. 8. When the timer handler is invoked: It first check BoostFreqArray and BusAccFreqArray, if there is a record exceeds their thresholds, DomainType of this domain is set to a sts-domain; otherwise, if all record in arrays are below thresholds, DomainType is set to an ltsdomain. After that, it updates CurrentIndex and drops their oldest records. At last, it set the next time of activating the monitor (1 s later in our case). 4.4. Version #2: enhancements 4.4.1. Smoother scaling The first version of dynamic switch-frequency scaling can improve the performance of overcommitted communication- and I/O-intensive applications up to the same magnitude as those undercommitted ones (confirmed by Fig. 11 in Section 5), but take bus access frequency as an example (boost frequency is similar) we can see this version has problems: both values of BusAccFreqThresh and the short time slice are difficult to choose. The relationship between time slice length and bus access frequency is illustrated by the blue polyline in Fig. 9. We can see that time slice length can only jump between two values. So for a domain whose bus access frequency is around BusAccFreqThresh, there are three situations: be recognized as CPU-intensive always; be recognized as communication-intensive always; changed between CPU-intensive and communication-intensive frequently. The third situation is considered as normal since its switchingfrequency is medium and the performance will be stable during repetitions. However, the performance gap between the former two situations will be very large, and this thrashing is not what we want. In Version 1, the length of short time slice is an empirical value that derived from many times of tests. As a result, it is a compromising value and is not an optimal value for all cases. In fact, 1 ms is still too large for some benchmarks (e.g., IS and CG, confirmed by Fig. 11 and Table 3 in Section 5).

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351

347

Fig. 8. Algorithm of behavior monitor (Version 1).

frequency (denoted by BusAccFreq) is expressed by the violet curve in Fig. 9 and the following equation: SliceMax A/(BusAccFreq − B) SliceMin

 SliceLen =

BusAccFreq < L L ≤ BusAccFreq ≤ H BusAccFreq > H .

(5)

H and L in this equation are used to denote the two thresholds of bus access frequency mentioned above (i.e., BusAccFreqHighThresh and BusAccFreqLowThresh). From Fig. 9 we can see the violet curve is continuous at the two thresholds, thus: SliceMax = A/(L − B) = 30 SliceMin = A/(H − B) = 0.4. Fig. 9. Relationship between time slice length and bus access frequency.

To solve these problems we need smoother control with which the time slice length can change gradually from the maximum value to the minimum value. Four constants are defined as follows:

• SliceMax: The maximum length of time slice, equal to the long time slice in Version 1 (i.e., 30 ms).

• SliceMin: The minimum length of time slice. We choose 0.4 ms which corresponding to the case that switching-frequency is equal to 2500 Hz (2500 Hz is approximately the upper bound of switching frequency observed from Tables 1 and 2). • BusAccFreqLowThresh: The low threshold of bus access frequency, SliceMax is used for a domain if frequency is below this threshold. • BusAccFreqHighThresh: The high threshold of bus access frequency, SliceMin is used for a domain if frequency exceeds this threshold. In Version 2, when bus access frequency is between BusAccFreqLowThresh and BusAccFreqHighThresh, we use an inverse proportional function to calculate the time slice length. The relationship between time slice length (denoted by SliceLen) and bus access

(6)

The two constant A and B can be calculated with H and L: A = 30(H − L)/74 B = (75L − H )/74.

(7)

Since the role of short time slice in the first version is partly replaced by SliceMin and SliceMin is not an empirical value but a meaningful one, problems in Version 2 are how to choose BusAccFreqLowThresh and BusAccFreqHighThresh. Fortunately, choosing an extent restricted by two borders is looser and easier than choosing a certain value. In our design, BusAccFreqLowThresh is less than BusAccFreqThresh of Version 1 to avoid thrashing discussed before, and BusAccFreqHighTresh is large enough to insure smooth. In this version of dynamic switching-frequency scaling, the behavior monitor not just distinguishes whether a domain is ltsor sts-, but also calculate the domain’s SliceLenif it is a sts-domain. The new algorithm is illustrated by Fig. 10. For convenience, this algorithm is simplified that it just considers bus access frequency. Indeed, the logic of boost frequency operation is similar, and if SliceLen is calculated by both two frequencies (this means the domain is both communication-intensive and I/O-intensive) then the smaller value is used.

348

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351

Fig. 10. Algorithm of behavior monitor (Version 2).

When the behavior monitor is invoked: The current domain is initialized to lts- and the peak value of bus access frequency, PeakFreq, is initialized to zero. Then it checks BusAccFreqArray, if there is a record exceeds BusAccFreqLowThresh, the domain type is set to sts- and PeakFreq is updated. The followed update operation of BusAccFreqArray is the same as in Version 1. After that, SliceLen is calculated with Eq. (5). Finally, it set the next time of activating the monitor. 4.4.2. Reconciling with Xenoprof Xenoprof is based on PMU counters and NMI, so is our behavior monitor. Since behavior monitor uses parts of infrastructure of existing Xenoprof, in our earlier version they conflict with each other. For example, start Xenoprof will cause system crash while behavior monitor is enabled. In this subsection we make some efforts to allow them co-exist friendly. Briefly, the life cycle of a Xenoprof session has several steps: (1) Init: Get the number and features of PMU counters, then build some data structure used by user-space tools. (2) ReserveCounter: Allocate memory for PMU counters and fill the memory with meaningful data. (3) SetupEvents: Setup parameters (e.g., event type, overflow count, unit mask, etc.) for every counter. (4) EnableVIRQ : Enable the interrupt handler of NMI. (5) Start: Start the profiling procedure for every enabled counter. (6) Profile: This is the main procedure of profiling, accomplished in NMI handler. (7) Stop: Stop the profiling procedure for every enabled counter. (8) DisableVIRQ : Disable the interrupt handler of NMI. (9) ReleaseCounter: Free memory for PMU counters. (10) Shutdown: Destroy the data structure used by user-space tools.

On the other hand, the life cycle of our behavior monitor is very similar to a Xenoprof session. The only difference is behavior monitor always runs in kernel-space and has no interaction with user-level, so its life cycle has no Init and Shutdown steps. The conflicts between behavior monitor and Xenoprof caused by two reasons: (1) Xenoprof sessions cannot interleave but behavior monitor is essentially a long session that begins at system boot and ends at system shutdown. (2) Behavior monitor need a dedicated PMU counter so this counter cannot be used by Xenoprof any more. To resolve this problem, we modified the operation logic of several steps of Xenoprof session, and define a global flag, MonitorEnabled, to indicate a certain step is called by behavior monitor (when MonitorEnabled = false, false is the initial value when system boot) or by Xenoprof (when MonitorEnabled = true). Those modified steps are listed below:

• Init: Since behavior monitor uses one PMU counter exclusively • • • • •

(we reserve the last counter for behavior monitor), this step only return n − 1 if the real number of counters is n. ReserveCounter: Allocate memory as original if MonitorEnabled is false, otherwise do nothing (allocation only need to do once, and it has already been done by behavior monitor). SetupEvents: Setup parameters as original if MonitorEnabled is false, otherwise ignore the last counter. EnableVIRQ : Enable NMI handler as original if MonitorEnabled is false, otherwise do nothing (NMI has already been enabled by behavior monitor). Start: Start profiling as original if MonitorEnabled is false, otherwise ignore the last counter. Our behavior monitor will set MonitorEnabled to true after this step at system boot. Stop: Stop profiling as original if MonitorEnabled is false, otherwise ignore the last counter. The behavior monitor will set MonitorEnabled to false before this step at system shutdown.

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351

349

• DisableVIRQ : Disable NMI handler as original if MonitorEnabled is false, otherwise do nothing (behavior monitor still need NMI).

• ReleaseCounter: Free memory as original if MonitorEnabled is false, otherwise do nothing. With all above modifications in this version, behavior monitor and Xenoprof can reconcile with each other. The only shortcoming is Xenoprof will ‘‘lose’’ one usable counter, yet we think this is acceptable. 5. Performance evaluation A series of experiments in this section evaluate the performance, fair-share ability and overhead of the extended Credit scheduler (both Version 1 and Version 2). 5.1. Experimental setup The hardware platform of our testbed has a 2-way quad-core processor configuration, 8 cores in total. The processor type is Intel Core2 based Xeon E5310, 1.60 GHz, with 32 KB L1 I-Cache, 32 KB L1 D-Cache and 4 MB L2-Cache. The total amount of physical memory is 4 GB. A 160 GB SCSI hard disk and two Gigabit Ethernet card is configured. Our research work is based on Xen-3.4.3 and Linux-2.6.18.8xen. Both Domain-0 and unprivileged domains use Red Hat Enterprise Linux Server 5.1 (X86_64) as their operating systems. NPB-3.3 (with MPICH-1.2.7p1 together) [17,19,20] is used as the benchmark to obtain performance data. NPB is derived from CFD (i.e., computational fluid dynamics) codes. They were designed to compare the performance of parallel computers and are widely recognized as a standard indicator of computer performance. NPB consist of eight programs: EP (Embarrassingly Parallel), IS (Integer Sort), CG (Conjugate Gradient), LU (Lower Upper Triangular), MG (Multi-Grid), FT (Fast Fourier Transformation), BT (Block Tridiagonal) and SP (Scalar Penta-diagonal). Among them, EP is a completely CPU-intensive application; BT involves a number of I/O operations; IS, CG, LU, MG and SP (especially IS) need lots of interprocess communications; and FT is a mixed type program. For these programs, the scale (class) of the problems can be specified by user, i.e., S, W, A, B, C, D, E, from the smallest to the largest (some classes are not available for some programs). In Section 5.2, an experiment is conduct to trace the switchingfrequency and evaluate the performance of our extended Credit scheduler. For comparison, the results obtained by original Credit scheduler and by SEDF scheduler (Tables 2 and 3) are used as baseline data. Then in Section 5.3, another experiment is used to compare the fair-share and load balancing capability of original Credit, SEDF and our extended Credit scheduler. Finally, Section 5.4 evaluates the overhead of the scheduler extension. 5.2. Switching-frequency and benchmark execution time Similar to the previous experiments, four guest domains are configured in this section, with 4 VCPUs and 512 MB memory each. We analyze the performance (i.e., execution time) of NPB programs in six cases: both two versions of extended Credit scheduler with dispersive pinning (credit_d′ for Version 1 and credit_d′′ for Version 2), both two versions of extended Credit scheduler with concentrated pinning (credit_c ′ for Version 1 and credit_c ′′ for Version 2), SEDF scheduler with dispersive pinning (sedf _d) and SEDF scheduler with concentrated pinning (sedf _c ). Dispersive pinning here means the 4 VCPUs of each guest domain are pinned to 4 different PCPUs respectively, while concentrated pinning means the 4 VCPUs of the same domain are all pinned to one PCPU exclusively, as discussed before.

Fig. 11. Performance of NPB programs with different schedulers and pinning policies (with dynamic switching frequency scaling extension for Credit scheduler).

Table 3 shows the switching-frequencies of each program by our extended Credit scheduler, the data of SEDF scheduler is also listed here to ease the comparison. It can be observed that for the CPU-intensive programs (i.e., EP), the switching-frequency of our extended scheduler (both Version 1 and Version 2) is nearly the same as the original Credit scheduler, and only 1/10 of the case when tick = 1 (see Tables 1 and 2). This means the long time slice is always applied to EP (1 s/30 ms = 33.3). For communication- and I/O-intensive programs, shorter time slices are selected most of the time, which results in S around or greater than 1000 Hz in Version 1 and 2000 Hz in Version 2 (for Version 1, 1 s/1 ms = 1000, more than 1000 Hz implies there are preemptions occurred and TH is less than 1 ms; for Version 2, S will be 2500 Hz if SliceMin is used always, so S < 2500 Hz means multiple time slices are applied and S > 2500 Hz implies preemptions). Context switching of our extended Credit scheduler happens more frequently than that in the case when tick = 1, benefiting the performance of communication- and I/O-intensive programs. For mixed type program (i.e., FT), both long time slice and shorter time slices are used, as a result, the switchingfrequency is more than CPU-intensive programs and less than communication- and I/O-intensive programs. Fig. 11 shows the performance of NPB programs. Let us focus on the concentrated pinning case since this is the one we want to improve. With both version of our extension, the execution time of EP is almost the same as original Credit scheduler and less than SEDF, because there is no redundant context switching. For other programs, the execution times are improved to the same magnitude to SEDF in Version 1. Comparing with the original Credit scheduler, the performance improvements in Version 1 are significantly: 1775% for IS.B, 468% for CG.B, 127% for LU.A, 314% for MG.B, 9% for FT.A, 201% for BT.A, and 284% for SP.A, respectively. Version 2 of our extension behaves even better that all execution times are improved very close to SEDF cases and undercommitted cases, with the performance improvements are: 3469% for IS.B, 610% for CG.B, 132% for LU.A, 342% for MG.B, 8% for FT.A, 233% for BT.A, and 301% for SP.A, respectively. For all programs except LU, the new Credit scheduler behaves better than the case in which tick is set to 1. As discussed in Section 4.3, for IS, CG, BT and SP the ‘‘tick = 1’’ solution is not quite satisfying. However, our extension has addressed this problem by bringing additional performance increases: 224%, 14.7%, 13.6% and 12.0% for each in Version 1; 517%, 43.3%, 22.7% and 25.7% for each in Version 2. From this figure, we can also see that dynamic switching-frequency scaling rarely has performance penalties in Credit scheduler’s dispersive pinning cases. 5.3. Fair share and load balancing For analyzing the fair-share capability of our scheduler, consider a configuration with three domains, each one has an equal

350

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351

DomU–1

DomU–2

DomU–3

Fig. 12. CPU shares of three domains with different schedulers.

weight and one VCPU. If they are running on a 2 PCPU machine, it is expected that each of them can be allocated around 66.7% of the total CPU resources (200%/3 = 66.7%). In this subsection, we design the experiment to compare our extended Credit scheduler together with the original Credit scheduler and SEDF scheduler. Three guest domains are configured with one VCPU and 512 MB memory each. The platform is forced to boot with only two physical CPUs by adding ‘‘maxcpus = 2’’ in the Xen’s boot option. Seven NPB programs are used to evaluate CPU shares.4 The execution times of NPB programs within each domain are collected. It is obvious that the inverse of execution time of the program running in a domain is proportional to the CPU allocation of it. Fig. 12 shows the CPU shares of each domain by using different scheduler. Because of the load balancing capability, each domain gets approximately 33% CPU share with the original Credit scheduler (Fig. 12(A)). On the contrary, it is a common case for the SEDF scheduler to assign one VCPU to the first PCPU, and the remaining two VCPUs to the other one. Once assigned, the VCPU–PCPU mapping will not be changed except explicitly pinning operation. As a result, in Fig. 12(B) we can see that one domain get around 50% CPU share, while the other two get only 25% each. Meanwhile, our dynamic switching-frequency scaling extension does not break the load balancing capability, for which makes the three domains still share the CPU resource in a fair manner (Version 1 in Fig. 12(C) and Version 2 in Fig. 12(D)). 5.4. Overhead The extension of Credit scheduler (mostly the behavior monitor) in this paper brings some overhead unavoidably. Thus,

4 NPROCS = 1 for each program, IS is not used here because it need more than one processes; Class A is used for FT because FT.B need more than 512 MB memory.

Fig. 13. The relative execution time of NPB programs with extended Credit scheduler.

an experiment is conducted to evaluate it quantificationally in this subsection. We configure one PV guest domain with 4 VCPUs and 512 MB memory. To obtain a more stable result, the VCPUs of the PV domain are pinned to 4 cores, while those of Domain-0 are pinned to the others. We enable the behavior monitor and disable variable time slice, since variable time slice changes the scheduling policy and the results cannot reflect the overhead brought by behavior monitor. Fig. 13 shows the relative performance of NPB programs with our extended Credit scheduler (assume the execution time with default Credit scheduler is 100%, Credit’ and Credit’’ are respectively for Version 1 and Version 2 of the extended scheduler). We can see that the overhead brought by the behavior monitor is less than 2%. For FT.A in Version 2, whose overhead is the largest, the execution time is only 1.6% longer than the default scheduler; for other programs, the execution time with the two schedulers are very close (sometimes the extended scheduler behaves even better, e.g. EP.B and CG.B in Version 1, this is considered as random errors since the difference is too small). From Section 5 we know that the behavior monitor in the second version of extended

H. Chen et al. / Future Generation Computer Systems 29 (2013) 341–351

scheduler is more complex than in Version 1, so the overhead of Version 2 is larger for most programs. The low overhead of our solution enhances its availability. 6. Conclusion This paper has discovered the problem that the performance of communication- and I/O-intensive concurrent applications will decrease steeply in overcommitted domains with Xen’s Credit scheduler. To address this problem, we traced the context switching of both Credit and SEDF schedulers, and found that the key point to improve the performance is to reduce ineffective holding time. A solution, monitoring VCPU’s behaviors and using variable time slice to scale the switching-frequency dynamically, is proposed on the basis of this knowledge. The proposed solution is implemented as an extended version of Credit scheduler. By using our extend scheduler, the performance of communication-intensive and I/O-intensive concurrent applications in overcommitted domains can be improved close to the undercommitted case, meanwhile, the performance of CPU-intensive application remains the same as the original version of Credit scheduler. The fair-share feature of Credit scheduler is also maintained. Acknowledgment

351

[13] P.M. Wells, K. Chakraborty, G.S. Sohi, Hardware support for spin management in overcommitted virtual machines, in: Proc. of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT’06, Seattle, Washington, USA, September 16–20, 2006. [14] [Oprofile], A system-wide profiler for Linux systems. http://oprofile. sourceforge.net/. [15] A. Menon, J.R. Santos, Y. Turner, Diagnosing performance overheads in the Xen virtual machine environment, in: Proc. of 1st International Conference on Virtual Execution Environments, VEE’05, June 2005, pp. 13–23. [16] [Xenoprof], Hewlett-Packard Development Company. http://xenoprof. sourceforge.net/. [17] [NPB], NASA. NAS parallel benchmarks. http://www.nas.nasa.gov/Software/ NPB/. [18] D. Gupta, L. Cherkasova, R. Gardner, A. Vahdat, Enforcing performance isolation across virtual machines in Xen, in: Proc. of Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware, Middleware’06, November 2006, pp. 342–362. [19] H. Jin, M. Frumkin, J. Yan, The OpenMP implementation of NAS parallel benchmarks and its performance, NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, 1999. [20] [MPI], Argonne National Laboratory. http://www-unix.mcs.anl.gov/mpi/ mpich/.

Huacai Chen is a Ph.D. candidate in Huazhong University of Science and Technology (HUST), China. He is now a member of Service Computing Technology and System Lab, and Cluster and Grid Computing Lab, in School of Computer Science and Technology, HUST. His major research interests are in system virtualization, highperformance computing (HPC), and operating system.

The work is supported by National 973 Basic Research Program of China under grant No. 2007CB310900. References [1] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, A. Warfield, Xen and the art of virtualization, in: Proc. of 19th ACM Symposium on Operating Systems Principles, SOSP’03, October 2003, pp. 164–177. [2] T.K. Samuel, G.W. Dunlap, P.M. Chen, Operating system support for virtual machines, in: Proc. of USENIX Annual Technical Conference, USENIX’03, June 2003, pp. 71–84. [3] D. Ongaro, A.L. Cox, S. Rixner, Scheduling I/O in virtual machine monitors, in: Proc. of the 4th International Conference on Virtual Execution Environments, VEE’08, Seattle, WA, March, 2008. [4] Intel Corporation, Intel architecture developer’s manual volume 3, Appendix A (243192), 1999. [5] Intel Corporation, Intel architecture optimization reference manual (730795001), 1999. [6] G. Southern, Symmetric multiprocessing virtualization, Thesis for Degree of Master at George Mason University, 2008. [7] VMware Inc., VMware ESX server 3.5 datasheet. Available at: http://www. vmware.com/products/vi/esx/. [8] L. Cherkasova, D. Gupta, A. Vahdat, Comparison of the three CPU schedulers in Xen, ACM SIGMETRICS Performance Evaluation Review 35 (2) (2007) 42–51. [9] S. Govindan, A.R. Nath, A. Das, B. Urgaonkar, A. Sivasubramaniam, Xen and Co.: communication-aware CPU scheduling for consolidated Xen-based hosting platforms, in: Proc. of the 3rd International Conference on Virtual Execution Environments, VEE’07, June 2007, pp. 126–136. [10] V. Uhlig, J. LeVasseur, E. Skoglund, U. Dannowski, Towards scalable multiprocessor virtual machines, in: Proc. of the 3rd Virtual Machine Research & Technology Symposium, VM’04, San Jose, CA, May 2004. [11] H. Kim, H. Lim, J. Jeong, H. Jo, J. Lee, Task-aware virtual machine scheduling for I/O performance, in: Proc. of 2009 ACM SIGPLAN/ SIGOPS International Conference on Virtual Execution Environment, VEE’09, Washington, DC, USA, March 2009. [12] C. Weng, Z. Wang, Mi. Li, X. Lu, The hybrid scheduling framework for virtual machine systems, in: Proc. of 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environment, VEE’09, Washington, DC, USA, March 2009.

Hai Jin is a professor of Computer Science and Engineering at Huazhong University of Science and Technology (HUST) in China. He received his Ph.D. in computer engineering from HUST in 1994. He worked for University of Hong Kong (1998–2000) and participated in HKU Cluster project. He worked as a visiting scholar at University of Southern California (1999–2000). He is now the Dean of School of Computer Science and Technology at HUST and the chief scientist of both ChinaGrid and Virtualization Technology for Computing System (China 973 Project).

Kan Hu received a Ph.D. degree in computer software and theory from Huazhong University of Science and Technology (HUST), China, in 2007. He is now an associate professor in School of Computer Science and Technology at HUST. His research interests are in the areas of virtualization, database system, and high-performance computing (HPC).

Jian Huang received a Master degree from Huazhong University of Science and Technology (HUST), China, in 2009. He is now a graduate student in Computer Science and Engineering Department at the Ohio State University. He is now a member of Network-Based Computing Laboratory lead by Professor D.K. Panda. His primary research interests include high performance computing, distributed systems, networking and system virtualization.