Design and implementation of energy-aware application-specific CPU frequency governors for the heterogeneous distributed computing systems

Future Generation Computer Systems ( ) – Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevi...

Download PDF

2MB Sizes 0 Downloads 10 Views

Report

PDF Reader
Full Text

Future Generation Computer Systems (

)

–

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Design and implementation of energy-aware application-specific CPU frequency governors for the heterogeneous distributed computing systems✩ Michał P. Karpowicz ∗ , Piotr Arabas, Ewa Niewiadomska-Szynkiewicz NASK Research Institute, Wawozowa 18, 02-796 Warsaw, Poland Institute of Control and Computation Engineering, Nowowiejska 15/19, 00-665 Warsaw, Poland

highlights • • • •

Design of application-specific energy-aware CPU controller is presented. Application-specific CPU controllers may outperform standard Linux CPU governors. Benchmarking methodology is proposed to identify models of CPU workload dynamics. Server power consumption estimate based on MSR-based measurements is proposed.

article

info

Article history: Received 17 December 2015 Received in revised form 10 March 2016 Accepted 5 May 2016 Available online xxxx Keywords: Green computing DVFS Data centers Optimal control System identification Linux

abstract This paper deals with the design of application-specific energy-aware CPU frequency scaling mechanisms. The proposed customized CPU controllers may optimize performance of data centers in which diverse tasks are allocated to servers with different characteristics. First, it is demonstrated that server power usage can be accurately estimated based on the measurements of CPU power consumption read from the model specific registers (MSRs). Next, a benchmarking methodology derived from the RFC2544 specification is proposed that allows to identify models of CPU workload dynamics. Finally, it is demonstrated how the identified models can be applied in the design of customized energy-aware controllers that dynamically adjust CPU frequency to the application-specific workload patterns. According to the results of experimental studies the customized controllers may outperform standard general-purpose governors of the Linux kernel both in terms of reachable server performance and power saving capabilities. © 2016 Elsevier B.V. All rights reserved.

1. Introduction To meet the demand for data processing a variety of computing resources have been combined together in large-scale heterogeneous distributed systems. This trend in technology development gives rise to a broad spectrum of computer engineering challenges, involving security, scalability, mobility, fault tolerance, energyefficiency and performance. The design of efficient resource allocation and task scheduling algorithms has been at the cutting-edge of the related research efforts [1,2].

✩ This research was partially supported by the National Science Centre (NCN) under the Grant No. 2015/17/B/ST6/01885. ∗ Corresponding author at: NASK Research Institute, Wawozowa 18, 02-796 Warsaw, Poland. E-mail address: [email protected] (M.P. Karpowicz).

http://dx.doi.org/10.1016/j.future.2016.05.011 0167-739X/© 2016 Elsevier B.V. All rights reserved.

As the support of cloud services and HPC applications in data centers requires an ever-increasing amount of electrical power, energy-efficiency has been an important factor shaping the growth of new data processing technologies. More precisely, annual cost of energy consumed by data centers was estimated to $27 billions in 2011 (www.idc.com). According to recent estimates, the total electricity consumption accounted to ICT infrastructures increased from about 3.9% in 2007 to 4.6% in 2012. The demanded computing efficiency levels of 50 GFLOPS/W are required to be reached within the economically reasonable power consumption levels, currently ranging from 20 MW to 40 MW [3–5]. These limits are correlated with the costs of electricity and power provisioning. It is indeed commonly accepted that the problem of power consumption needs to be properly addressed, if the growth rate of ICT is to be sustained [6–9]. A straightforward way to reduce power consumption in data centers is to take advantage of the hardware-specific power saving

2

M.P. Karpowicz et al. / Future Generation Computer Systems (

capabilities and to adjust the operating states of the computing elements to their variable workload [10–12]. These are typically exposed to the operating systems through the ACPI compatible interfaces (www.acpi.info). Power consumption of the processor (CPU) can be controlled with dynamic voltage and frequency scaling (DVFS) mechanisms capable of switching the rate of instruction execution (ACPI P-states) and the power saving levels (ACPI C-states) during idle periods [13]. In the Linux operating system these mechanisms have been provided by two kernel modules (www.kernel.org). The CPU frequency scaling process is controlled by the governors of the cpufreq module [14,13, 15]. Behavior of the CPU during the idle periods is independently and simultaneously determined by the cpuidle module [16]. The modules also provide dedicated drivers, such as intel_pstate, designed for specific CPU architectures. Each governor of the cpufreq module implements a frequency switching policy. There are several standard build-in ACPI-based governors available in the recent versions of the Linux kernel. The governors named performance and powersave keep the CPU at the highest and the lowest processing frequency, respectively. The userspace governor permits user-space control of the CPU frequency. Finally, the default energy-aware governor, named ondemand, dynamically adjusts the frequency to the observed variations of the CPU workload. This paper deals with the problem of CPU frequency control policy design and implementation in the Linux-based servers supporting cloud services and HPC applications in heterogeneous computing systems. The context of this research, as well as the motivation for the proposed approach, is presented in Section 2. In order to illustrate the role that the CPU governors play in the resource allocation and job scheduling process, a general structure of data center control system is discussed. It is next proposed that efficiency of data processing could be improved if applicationspecific CPU controllers were submitted to the computing nodes along with the scheduled batch of jobs. This way both the energy consumption and the performance of the cluster nodes could be adjusted to the expected or forecasted workload patterns. The related problems of workload identification and controller design are discussed in the following sections. The problem of power consumption modeling is addressed in Section 3. Precisely, it is demonstrated how the power consumption profiles of different types of applications can be derived from the high-resolution power usage measurements read from the CPU model specific registers (MSRs). In Section 4 the problem of CPU workload dynamics identification is addressed. In particular, a dedicated benchmarking methodology is proposed together with the design of the required high-resolution probes. Finally, in Section 5 it is shown how the developed models of power consumption and processing dynamics may support the design of dedicated energy-aware controllers efficiently adjusting CPU frequency to the workload generated by a specified class of applications. Overview of results It is known that power consumption of a server depends on the workload generated by performed operations. Since these operations use computing resources required by the executed applications, it may be possible to design customized controllers that optimize energy-consumption and performance of a server under the application-specific workload profiles. This paper shows how the above concept can be realized and how the related engineering problems can be addressed in the environment of the Linux kernel. The addressed problems involve power consumption modeling, identification of CPU workload dynamics, controller design and implementation. The presented conclusions are supported by the results of experimental studies. Implementation details of the developed software are given as well.

)

–

First, it is demonstrated that in absence of external powermeters operating at high sampling-rate the total power usage of a server can be accurately estimated based on the high-resolution measurements read from the processor’s MSRs. Experimentally identified polynomial models of low order are presented that describe power consumption profiles of different types of applications. Second, in order to identify models of CPU workload dynamics, a general-purpose benchmarking methodology is proposed based on the RFC2544 specification. The proposed methodology identifies packet filtering dynamics of the libpcap module in the Linux kernel. It is experimentally verified that accurate models of data processing dynamics can be estimated this way based on the observations collected from the customized kernel probes. A conjecture is also made according to which appropriately designed benchmarking excitation signals, characterized by a specified spectral profile, could be used to identify application-specific models of data processing. Third, it is demonstrated how the identified models can be applied in the design of customized energy-aware controllers that dynamically adjust CPU frequency (ACPI P-state) to the application-specific workload patterns. For this goal, a stochastic control problem is formulated and solved numerically. The optimal solution to the problem is a control policy that minimizes the long-run average cost of data processing operations. The policy has the form of application-specific frequency switching table parameterized by the server power consumption profile and the workload model introduced in the control problem formulation. Implementation of the obtained switching table in the form of the cpufreq governor is next discussed. Performance of the governor was tested in a series of webserver benchmarks and compared to the performance of standard governors provided by the Linux kernel. The obtained results show that in comparison to the ondemand governor the designed controller allows to reduce power consumption of the server while improving its performance or keeping it at the similar level. In particular, it is demonstrated that the customized application-specific CPU controller, tailored to heterogeneous computing environments, may outperform the generalpurpose controllers, both in terms of performance and power usage efficiency. Related work There has been a large volume of research in the area of energyefficient control in data centers and networks. Several recent projects may be mentioned. The ECONET project (www.econetproject.eu), co-developed by the authors of this paper, introduced dynamic power and performance control technologies, based on standby and performance scaling capabilities, improving energyefficiency of wired network devices. Integration of the activities of major European players in networking focused on the design of energy-efficient, scalable and sustainable future networks was facilitated within the TREND project (www.fp7-trend.eu). The GAMES project (www.green-datacenters.eu) considered innovative methodologies, metrics, Open Source ICT services and tools for the active management of energy efficiency of IT Service Centres. The DEEP project (www.deep-project.eu), an innovative European response to the Exascale challenge, was focused on a novel supercomputing architecture with a matching software stack and a set of optimized grand-challenge simulation applications. The DEEP-ER project (www.deep-er.eu) extended the architecture of the DEEP project by a highly scalable I/O system. Furthermore, the project introduced new memory technology to provide increased performance and power efficiency. The CRESTA project (www.crestaproject.eu) explored how the Exascale challenge can be met by building and exploring appropriate system software for Exascale platforms. The Mont-Blanc project (www.montblanc-project.eu)

M.P. Karpowicz et al. / Future Generation Computer Systems (

)

–

3

was focused on the design of a next-generation HPC system, embedded technologies and development of Exascale applications. Finally, the problem of high-performance modeling and simulation for Big Data applications in heterogeneous distributed systems is addressed by the cHiPSet project (chipset-cost.eu). An extensive survey of control engineering solutions applied in software design domains, as well as an interesting taxonomy of research directions, can be found in [17]. Designs of power budgeting control systems for data centers are proposed in [18–20,11]. The proposed systems adjust power consumption of computing nodes in order to keep the power consumption of the entire cluster within the required bounds. Quality of services provided by the application layer is included in the design of power budgeting control systems in [21,22]. These systems coordinate distribution of power among virtual machines within given peak power capacity while tracking dynamic power availability and workload dynamics. To respect application performance boundaries, a combination of DVFS and CPU time allocation is used. The workload variations and power capacity changes are handled by feedback controllers. The problem of reducing the cooling energy consumption with DVFS and load balancing mechanisms is addressed in [23]. In [24] a power optimization strategy is proposed that jointly minimizes the total power consumption of servers and the data center network. Algorithms predicting the number of virtual machine requests arriving at cloud data centers, along with the amount of CPU and memory resources associated with each of these requests, are proposed in [25]. The problem of low server utilization in data-centers running I/O-intensive applications is addressed e.g. in [26]. To adjust CPU frequency a feedback controller is proposed that relies on energy-related system-wide measurements rather than only on CPU utilization levels. Experimental architectures of control systems for energy-aware networks are discussed in [27–31]. In [32] a design of performance optimizing controller is proposed for a single application server. The paper describes the use of MIMO techniques to track desired CPU and memory utilization while capturing the related interactions between CPU and memory. System identification approach to server workload modeling is presented in [33], together with the controller design. The decoding rate control based on DVFS, applied in multimediaplayback systems, is discussed in [34]. In [35] it is demonstrated how inherent complexities, such as multiple cores, hidden device states, and large dynamic power components, can result in high prediction errors of linear models. A DVFS-based technique that makes use of adaptive update intervals for optimal frequency and voltage scheduling is proposed in [36]. In this case the optimal control strategy is constructed in order to meet the workload processing deadlines given the workload arrival time. In [37] a technique is proposed to reduce memory bus and memory bank contention by DVFS-based control of thread execution on each core. A process identification technique applied for the purpose of CPU controller design is presented in [38]. Based on stochastic workload characterization, a feedback control design methodology is developed that leads to stochastic minimization of performance loss. The optimal design of the controller is formulated as a problem of stochastic minimization of runtime performance error for selected classes of applications. In [39] a supervised learning technique is used in DVFS to predict the performance state of the processor for each incoming task and to reduce the overhead of the state observation.

power consumption bounds. Therefore, the cluster control system is required to coordinate a large number of individual power and performance control loops operating at the level of cluster nodes. In order to exploit increasing capabilities of the underlying hardware, the interacting elements of the control structure should dynamically identify and properly adjust power states of the available resources, taking into account the provisioned power budget. The control mechanisms should respond to the observed workload and track the actions originating from the cluster resource and job management system. Fig. 1 presents an overview of a data center control system architecture; cf. [40–42]. The racks, supplied with electric power by power distribution units, (PDUs) are filled with blade servers. The racks are connected into a data center network with a hierarchy of switches (SW). The management (or upper control) layer is responsible for allocation of resources, job submission, adjustments of the interconnect settings, power budgeting and system monitoring. These tasks are executed by dedicated systems of resource allocation and job management (RJMS) and systemwide energy management (SEM). The lower (or direct) control layer is responsible for job execution and enforcement of resource usage constraints in computing nodes. The consumable resources allocated to jobs, exposed via customized APIs, include CPU cores, memory and sockets. The challenging problem of data center control is to maximize the utilization of computing resources subject to energy consumption and quality of service constraints by means of available hardware controllers. Various power budgeting mechanisms are used to keep the total power consumption of data center within the available power range. The resources that remain idle can be switched to a power saving (or sleeping) mode for a configurable period of time and restored to operating mode on demand. The resources performing operations can reduce their power consumption by dynamically adjusting their performance level; cf. [28,27]. In the Linux-based clusters the cpufreq governors (provided by the Linux kernel) are important and commonly applied mechanisms of the CPU energy-aware performance control. Currently available resource allocation and job scheduling systems provide functions that allow to set up the module for the purpose of job execution. The submitted configuration settings typically include the requested CPU frequency or the frequency scaling governor to be used for the submitted job. In the following sections it is argued that these mechanisms can be further developed to improve the energy-efficiency of computing nodes. Namely, along with the scheduled batch of jobs, the application-specific CPU controllers, or appropriately defined control policies, could be submitted to the computing nodes and servers. The controllers could be implemented in the form of loadable kernel modules and installed in the servers on demand. Another possible solution to be considered is to submit an application-specific control policy, i.e. a frequency switching table, directly to the appropriately designed configurable governor. This way computing performance could be tailored to the heterogeneous environment and the identified workload patterns of submitted jobs. The main scientific challenge of our approach is to accurately identify performance profiles and workload dynamics of applications, and to effectively exploit the obtained models in the controller design. In the following sections several solutions to these problems are presented and discussed.

2. Data center control: an overview

3. Power consumption model

Data centers respond to random streams of incoming requests by allocating computing resources and scheduling execution of jobs subject to negotiated quality of service constraints and

Hardware-specific power consumption adjustment mechanisms are typically exposed to the operating system through the ACPI (www.acpi.info) compatible interfaces. These mechanisms

4

M.P. Karpowicz et al. / Future Generation Computer Systems (

)

–

Fig. 1. An overview of cluster control system architecture.

Fig. 2. Relationship between CPU clock frequency and power consumed by a processor and the server.

are exploited by data-center control systems to keep the power usage within the desired power range. Power consumption of CPUs can be controlled with DVFS mechanisms through the adjustment of the processing ACPI P-states and the power saving ACPI C-states. Fig. 2 illustrates an example of server workload trace revealing the relationships between CPU frequency, CPU power consumption and the total server power consumption. The experiment shows that the contribution of CPU reaches approximately 40% of the total power usage, which confirms the known dominant role CPU plays in the server power consumption profile [5,43,44]. This observation also motivates efforts aimed at the design of energyaware CPU frequency controllers adapting processing rate to the observable short-term workload. Identification of server power consumption model, which is necessary for the purpose of energy-aware controller design, requires accurate and high resolution measurements. Unfortunately, these are often difficult to collect from the standard power metering devices [45,46]. An alternative solution to this problem is to read power usage measurements (e.g. RAPL energy counters) directly from the MSRs available in recent architectures of CPUs. This approach allows to measure CPU power usage with high sampling rate and accuracy. However, the main problem in this case

is that this way only a partial view of the server power consumption profile is available. The overall power consumption depends on the variable set of computing resources engaged in data processing, which in turn is correlated to the application profile of the server [47]. In order to determine the relationship between the power measurements collected from the MSRs and the measurements of the total power consumption of the server, various application scenarios need to be considered and analyzed. In the following subsections two such scenarios are presented, i.e., computing server carrying intensive arithmetic operations and video transcoding server scaling the rate of transmitted streams. The results of conducted experiments show that the server power consumption profile may be estimated with the MSR-based measurements by polynomial models of the following form: p(w) =

m 

αk w k ,

(1)

k=0

where w denotes the power read from MSRs, p(w) the total power consumption and αk , k = 0, . . . , m, model coefficients. The obtained results allow to include application-specific power consumption profiles in the CPU controller design proce-

M.P. Karpowicz et al. / Future Generation Computer Systems (

)

–

Table 1 Parameters of modified computing server.

Fig. 3. Testbed network.

5

stress benchmark:

Idle time (µs)

Workload (%)

96 000 48 000 24 000 12 000 6 000 3 000 2 000 1 200 700 160 10 0

3 4 10 19 30 50 60 70 80 89 95 100

dure and suggest that controllers optimized for specific usage scenarios may be capable of increasing the efficiency of power usage above the levels provided by the general-purpose controllers, such as default Linux ondemand governor, cf. [15]. Finally, information regarding the expected power consumption profile can be included in task scheduling mechanisms of cluster management systems. 3.1. Experimental setting: power consumption estimation An experimental testbed was built and set up to support the presented research. It was formed of six PC Linux-based servers (Fedora 18, Linux kernel 3.6), equipped with Intel i7 processor, 8 GB of RAM, and quad 1Gb/s Ethernet cards (Broadcom BCM5719) with topology depicted in Fig. 3. Server D0 played the role of the system under test, servers S1, S2, S3 and S4 were used as traffic generators and analyzers. In addition, a Windows-based server MS powered by Intel Xeon processor, with 8 GB of RAM and MySQL database was connected to power meter PM and used to record power samples from electricity meter connected directly to server D0. The presented setup allowed to perform multiple experiments for various scenarios concerning computation loads, transmissions of video streams and real-time packet filtering. The applied power meter (PM) allowed to collect measurements of total server power consumption during time intervals of approximately 8 s with resolution of 1 W. In addition, a modified version of the power_gov1 program was used to attain high rate of MSRs reading. The modifications covered sampling rate adjustment which was set to 10 ms. 3.2. Computing server model To identify the power consumption model suitable for this scenario a number of power measurements with varying computational workload had to be performed. A modified version of stress2 benchmark was used to generate the desired workload. The modification allowed to vary the intensity of operations, which was achieved by insertion of sleep() instruction into the inner benchmark loop. As presented in Table 1, suspending computations for a specified time interval provided an instrument to reduce the mean processor workload. The second parameter which was used to differentiate workload was the number of concurrently running threads ensuring utilization of all processor cores. To identify the effects of CPU frequency adjustment the default ondemand governor was activated. The sampling period of the governor was set to 100 ms. It must be noted that CPU power measurements were taken every 10 ms, whereas the total power consumption measurements were collected every

1 software.intel.com/en-us/articles/intel-power-governor. 2 people.seas.harvard.edu/˜apw/stress/.

Fig. 4. Comparison of relationship between total power consumed by the computer and power consumed by the processor. Table 2 Parameters of modified validation experiment.

stress benchmark:

Idle time (µs)

Workload (%)

32 000 10 000 4 500 1 200 90 5

8 22 42 71 90 96

8 s. Despite different time scales close correlation between total power consumption and the CPU power can be seen in Fig. 2. To identify more general relationship long term measurements were compared. Every point in Fig. 4 presents the results of one of a hundred runs of the stress benchmark. Each of them is an average of over seventy measurements for total power and 60,000 samples for processor power. It may be easily seen that the samples can be approximated by a linear function. Due to the vast amount of samples the parameters of model (1) may be easily tuned using least squares method, resulting in α0 = 27.33, α1 = 1.46 and αk = 0 for k = 2, . . . , m. To validate the model additional experiments were conducted for benchmark parameters given in Table 2. Separate set of N samples was collected to allow quality of fit assessment by means of the absolute value of the error e0 calculated according to the formula: e0 =

N 1 

N i=1

|p(wi ) − pi |,

(2)

where wi denotes ith sample of the processor power read from MSRs, pi the measured value of the total power consumption and p(wi ) the estimated total power consumption. Fig. 5 illustrates the results of the model verification with ±5% error margins indicated with lightly colored area. The model

6

M.P. Karpowicz et al. / Future Generation Computer Systems (

)

–

Table 3 Video transcoding server model coefficients. Range

α0

α1

α2

Full 1–4 streams

10.6 28.49

4.43 2.21

−0.06 0

Fig. 5. Model validation for computing server scenario. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 7. Model validation for video transcoding server. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 6. Relationship between total power consumed by the computer and power consumed by the processor for video transcoding server scenario.

line fits in this bounds with an absolute error (2) not greater than 3.1%. Such an accuracy seems to be sufficient for most control applications. Much higher error may be observed for only one sample, possibly associated with wake-up of one of system processes, e.g. file indexing using a hard disk intensively. 3.3. Video transcoding server model Transcoding of video data is a high-workload task. It is processor-intensive and requires the frames to be emitted according to rigorous time schedule to avoid quality deterioration. In the performed experiments the video flows were streamed and then transcoded with mencoder (www.mplayerhq.hu) program. To forward the traffic streams a netcat tunnel was used. Both mencoder and netcat were run in the userspace, bypassing the kernel level forwarding mechanism. This resulted in the reduced use of the Ethernet board offloading mechanisms and increased the CPU workload. The computing capacity of the target server allowed to transcode up to six streams only while forcing the network ports to work well below their nominal speed (bit rate of single stream was approximately 20 Mb/s). Fig. 6 presents the measured power profile. When confronted with the previously discussed measurements, see Fig. 4, a significant difference can be noticed. The data points visibly follow concave pattern, which suggests existence of a saturation effect for larger number of streams. The occasionally observed frame delay may indicate reaching the CPU processing limit. Due to application of the large receive buffers the playback was not impaired, however it was not possible to reliably transmit seven or more streams. Linear characteristic could be seen for workloads generated by up to four flows, which may also be correlated with the number of four processor cores. The saturation effect became visible for larger number of flows,

suggesting that the source of delays may be thread switching and concurrent memory access. Consequently, contrary to the computing server case, a polynomial model of higher order is required. For somewhat less demanding applications it may be valuable to restrict the number of streams up to four in the setting considered, which should justify a simpler linear model. Table 3 presents the identified coefficients for both cases. Fig. 7 shows a sufficiently good fit of a quadratic model. Both models can be easily identified based on relatively small amount of data. Furthermore, the models are also characterized by desirable disturbance and noise filtering properties. Validation of the estimated models was performed with streams of slightly (approx. 20%) lower rate, as illustrated in Fig. 7. It may be observed that both models fit well within 5% margin (light blue), provided that correct range is analyzed. 4. Data processing model Identification of server performance dynamics is the next essential step in the controller design procedure. Due to complexity of data processing operations, involving interactions of multiple computing elements in a stochastic environment of operating system, it is clearly a challenging engineering problem. Indeed, many attempts, e.g. [48–50,32], have been recently made to address the key questions that need to be resolved in this context, including implementation of system performance observers, design and implementation of informative experiments, model selection, estimation and validation. Many commonly available benchmarks have also been used, providing valuable information about various aspects of server and cluster performance [51,52]. However, for the specific purpose of high-resolution dynamic model estimation a much more flexible approach is required. Namely, it is necessary to generate variable system workload persistently exciting all the computing elements engaged and, therefore, revealing dynamics of data processing operations in the observed system outputs. Following the system identification methodology a design of RFC2544 compliant [53,54] experiment can be proposed in which the target server is forced to filter multiple streams of UDP packets incoming at bit rate characterized by a specified spectrum profile. Packet filtering is both CPU and memory intensive operation which

M.P. Karpowicz et al. / Future Generation Computer Systems (

involves numerous data structures and procedures of the operating system network stack. For the purpose of processing dynamics estimation we propose to extend the benchmarking methodology by introducing experiments with packet streams incoming at rate characterized by a desired and application-specific spectral profile. Consequently, the characteristics of frequency response of the target server may be observed and used for model identification. The proposed approach supports estimation of models that may be used to design CPU controllers optimizing performance of webservers, firewalls or intrusion detection systems. 4.1. Experimental setting: processing dynamics estimation The testbed presented in subsection Section 3.1 was configured for the purpose of CPU workload dynamics estimation. In a series of experiments the target server, D0, was set up to operate with a fixed CPU frequency, ranging from 1.6 to 3.40 GHz, while forwarding and processing streams of UDP packets generated by the directly interconnected nodes (S1 and S2) at rate w(t ) with amplitude Ak = A = 100r Mbps, r = 1, . . . , 10. Two full duplex flows were crossing the D0 node, each one using a different pair of network ports; cf. [53]. A specifically designed kernel probe was implemented to observe performance of the packet filtering libpcap module with the sampling rate of fs = 100 Hz. The following measurements were recorded:

• CPU workload expressed as a percentage of the current processing frequency,

• CPU workload expressed in MHz as a frequency of instruction execution,

• CPU power consumption (RAPL MSRs), • number of packets that passed the kernel packet filter (ps_recv statistic of libpcap), • number of packets that passed the kernel packet filter and were dropped due to exhausted buffer space (ps_drop statistic of libpcap).

)

–

7

Listing 1 Performance probe implementation overview /∗ . . . ∗/ int input_observer_run ( s t r u c t s k _ b u f f ∗ skb , s t r u c t g e n l _ i n f o ∗ i n f o ) { /∗ . . . ∗/ } int output_observer_run ( s t r u c t s k _ b u f f ∗ skb , s t r u c t g e n l _ i n f o ∗ i n f o ) { /∗ . . . ∗/ } /∗ . . . ∗/ s t a t i c enum h r t i m e r _ r e s t a r t sampling_function ( s t r u c t hrtimer ∗ unused ) { i f ( send_signal_to_app ( ) < 0) { / ∗ . . . ∗ / } get_cpu_load ( ) ; sampling_counter ++; i f ( sampling_counter >=MAX_SAMPLES ) { return HRTIMER_NORESTART ; } else { hrtimer_forward_now (& cpuload_probe , ktime ) ; return HRTIMER_RESTART ; } } /∗ . . . ∗/ i n t init_module ( void ) { /∗ . . . ∗/ ktime = ktime_set ( 0 , MS_TO_NS( sampling_rate ) ) ; h r t i m e r _ i n i t (& cpuload_probe , CLOCK_MONOTONIC , HRTIMER_MODE_REL ) ; cpuload_probe . f u n c t i o n = &sampling_function ; h r t i m e r _ s t a r t (& cpuload_probe , ktime , HRTIMER_MODE_REL ) ; /∗ . . . ∗/ } /∗ . . . ∗/ module_param ( sampling_rate , long , S_IRUSR ) ; MODULE_PARM_DESC( sampling_rate , " Sampling␣r a t e " ) ; module_param ( app_pid , long , S_IRUSR ) ; MODULE_PARM_DESC( app_pid , " A p p l i c a t i o n␣PID " ) ;

4.3. Design of excitation signals

Details regarding the probe implementation are discussed in the next subsection. It should be pointed out that the designed experiments allowed to observe data processing dynamics under a fixed CPU frequency. In the setting considered, under the applied sampling rate, the workload variations (or transient responses) caused by a sequence of CPU frequency switching operations were not observed.

To excite the packet filtering server in the desired frequency band a multi-sine signal can be used. There are many well known methods to construct such inputs, see e.g. [60,61]. In the course of conducted experiments the signals with Schroeder phases proved to be convenient for traffic generation purposes. More specifically, the following family of signals was used:

4.2. Processor workload probe

w(t ) =

Nf 

Ak sin(2π fk t + ϕk ),

k=1

Packet filtering operations can be observed with high sampling rate by a probe operating at the operating system kernel level. The probe can be implemented in many ways, see e.g. [55–59]. For the particular purpose of the CPU workload monitoring we based our probe design on the cpufreq kernel module [14]. The probe was designed to collect the measurements at the default sampling rate of fs = 100 Hz, which was equal to that of the cpufreq module. The measurements were started when the probe module was inserted into the kernel and were completed with its removal. The collected measurements were reported to system log file (/var/log/messages). Listing 1 gives an overview of the implementation. A dedicated netlink channel was also implemented between the libpcap-based process (tcpdump) and the probe in order to collect the measurements of packet capture performance. At each sampling instant the probe (Listing 1) was sending signal (send_signal_to_app()) to the observed tcpdump process to collect its performance statistics. The communication channel was established by operations input_observer_run() and output_observer_run(), which were also responsible for data processing and storage. Listing 2 presents the netlink socket configuration.

t = lTs , Ts = 1/fs , fk = kfs /Nt , l = 0, . . . , Nt − 1, Nf = 30, Nt = 1000, ϕk = −k(k − 1)π /Nf , k = 1, . . . , Nf ,

(3)

where Nf denotes the number of frequencies in multi-sine signal and Nt the number of samples. Based on the designed pattern, w(t ), traffic generation scenario was constructed. It is worth pointing out that the particular choice of frequency band may be bounded by the resolution of measurements and the capabilities of traffic generation software. An alternative way to generate network traffic would be to replay a properly filtered traffic trace from .pcap file, with the filter (bandpass system) designed to excite the system at frequencies of interest. 4.4. Model identification and validation Fig. 8 illustrates a typically observed outcome of experiment in which the server is forced to filter packets with CPU operating at a fixed frequency. The network packets can be seen to arrive at the (kernel) filter in bursts. The correlated CPU workload is characterized by lower variance, which can be explained by buffering and job scheduling operations performed by the kernel.

8

M.P. Karpowicz et al. / Future Generation Computer Systems (

)

–

Fig. 8. Example of server processing dynamics. Top: CPU workload (output, % of processing frequency). Bottom: libpcap filtering rate (input, pps).

Fig. 9. Comparison of CPU response dynamics under various CPU frequencies. Top: CPU workload (%), packet drop rate (pps), CPU power usage (W) (smoothed). Bottom: libpcap filtering rate (pps) (smoothed).

Comparison of response dynamics, packet drop rate and power consumption under various CPU frequencies is illustrated in Fig. 9. As expected, on average the CPU workload and the packet drop rate are decreasing with the CPU frequency, whereas the power consumption increases with the CPU frequency. Occasionally severe fluctuations can be observed both in the CPU workload and packet drop rate. This effect can be, at least partially, explained by unpredictable outcomes of kernel operations, such as context switching and memory access operations, effectively blocking (preempting) packet capture instructions. Furthermore, sudden filtering break-downs can be observed which can be correlated with the CPU saturation periods; cf. [62]. Fig. 10 presents autocorrelation function (ACF) and partial autocorrelation function (PACF) of the observed CPU workload, Yt , and its first differences, Yt − Yt −1 . These results suggest that the process dynamics may be described by a low order moving average models, possibly with noise integration included. The key observation that follows is that the effects of packet processing remain visible in CPU workload throughout the applied sampling period. Based on the collected results linear state–space and polynomial (input–output) models were estimated, each one separately

for a fixed CPU frequency, to describe the dynamics of CPU workload. Acceptable outcomes were obtained by models of low order with noise integration. An example of Box–Jenkins model obtained for a fixed frequency, u = 3.1 GHz, is presented below: y(t ) = B∗ (q−1 |u) C ∗ (q−1 |u) D∗ (q−1 |u) F ∗ (q−1 |u)

= = = =

B∗ (q−1 |u) F ∗ (q−1 |u)

w(t ) +

C ∗ (q−1 |u) D∗ (q−1 |u)(1 − q−1 )

4.755 · 10−6 q−2 + 4.677 · 10−6 q−3 , 1 − 0.7905q−1 + 0.07769q−2 , 1 − 0.2198q−1 + 0.01524q−2 , 1 − 0.4874q−1 ,

e(t ), (4)

with w denoting server input, e denoting white noise and q representing forward shift operator. Another example is an ARMAX model obtained for u = 2.6 GHz: A∗ (q−1 |u)y(t ) A∗ (q−1 |u) B∗ (q−1 |u) C ∗ (q−1 |u)

= = = =

B∗ (q−1 |u)w(t ) + C ∗ (q−1 |u)e(t ), 1 − 0.9971q−1 , 1.596−6 q−1 , 1 − 0.135q−1 .

(5)

Fig. 11 illustrates the results of a model validation experiment in which five-step ahead prediction, Yˆ (t |t − k), accurately reproduces

M.P. Karpowicz et al. / Future Generation Computer Systems (

)

–

9

Fig. 10. ACF and PACF of measured signals. Top: CPU workload. Bottom: libpcap filtering rate.

Fig. 11. CPU workload model (Box–Jenkins) validation.

CPU workload generated by packet filtering operations. The results of workload simulation, Yˆ (t |t −∞), show the input smoothing and noise integration effects of the model. Workload bursts, which are not related to packet filtering operations, cannot be simulated by the applied linear model. However, satisfactory prediction results are reachable. Fig. 12 confirms that satisfactory fit to estimation data can be obtained with the applied low order linear models. Normally distributed residuals are located inside an acceptable standard confidence interval region. For more results on server dynamics identification see e.g. [63,64,49,33]. 5. Controller design and implementation The results of experiments discussed in the preceding sections pave the way for the design of customized energy-aware controllers dynamically adjusting CPU frequency (ACPI P-state) to the workload generated by a specified class of applications, tailored to characteristics of servers in heterogeneous systems. The identified

Fig. 12. Approximately normal distribution of prediction residuals.

models of server dynamics support the design of system state observers and workload predictors based on which efficient feedback control can be designed. The identified server power consumption

10

M.P. Karpowicz et al. / Future Generation Computer Systems (

)

–

Listing 2 NETLINK socket configuration example enum { SIGNAL_ATTRB_UNSPEC , SIGNAL_ATTRB_MSG , __SIGNAL_ATTRB_MAX , }; #define SIGNAL_ATTRB_MAX ( __SIGNAL_ATTRB_MAX − 1) /∗ P r o t o c o l a t t r i b u t e v a l i d a t i o n p o l i c y ∗/ static struct nla_policy app_observer_genl_policy [ SIGNAL_ATTRB_MAX + 1] = { [ SIGNAL_ATTRB_MSG ] = { . type = NLA_U32 } , }; /∗ Generic n e t l i n k family s p e c i f i c a t i o n ∗/ #define VERSION_NR 1 s t a t i c struct genl_family app_observer_gnl_family = { . id = GENL_ID_GENERATE , . hdrsize = 0 , . name = "APP_OBSERVER" , . ve r s i o n = VERSION_NR , . maxattr = SIGNAL_ATTRB_MAX , }; enum { APP_OBSERVER_UNSPEC , INPUT_OBSERVER_RUN , OUTPUT_OBSERVER_RUN , __APP_OBSERVER_MAX , }; #define APP_OBSERVER_MAX ( __APP_OBSERVER_MAX − 1) int input_observer_run ( s t r u c t s k _ b u f f ∗ skb , s t r u c t g e n l _ i n f o ∗ i n f o ) ; /∗ Generic n e t l i n k operations ∗/ s t r u c t genl_ops input_observer_gnl_ops = { . cmd = INPUT_OBSERVER_RUN , . flags = 0 , . p o l i c y = app_observer_genl_policy , . doit = input_observer_run , . dumpit = NULL , }; int output_observer_run ( s t r u c t s k _ b u f f ∗ skb , s t r u c t g e n l _ i n f o ∗ i n f o ) ; /∗ Generic n e t l i n k operations ∗/ s t r u c t genl_ops output_observer_gnl_ops = { . cmd = OUTPUT_OBSERVER_RUN , . flags = 0 , . p o l i c y = app_observer_genl_policy , . doit = output_observer_run , . dumpit = NULL , }; /∗ . . . ∗/

Fig. 13. Stage cost model, gγ (x, u, w) ¯ with γ = 0.75.

follows: gγ (x, u, w) = γ gP (x, u, w) + (1 − γ )gQ (x, u, w),

0 ≤ γ ≤ 1, (6)

i.e. it was given by a weighted sum of two control objectives, namely, the application performance cost gQ (x, u, w) and the server power consumption cost gP (x, u, w). The power consumption cost was constructed from the MSR-based measurements, as presented in Section 3. The application performance cost was modeled by a barrier function of the CPU workload, parameterized by the CPU frequency. This particular model of performance cost was derived based on the observations of saturation effects, discussed in Section 3, and processing break-downs, discussed in Section 4. Illustration of the stage cost models obtained for fixed values of disturbance w and weight parameter γ is presented in Fig. 13. In order to deal with a collection of models representing the dynamics of data processing under various CPU frequencies (see Section 4), it was assumed that the CPU workload evolution is described by the following state–space model: x(t + 1) = [Φ (u(t ))x(t ) + Γ (u(t ))w(t )]X ,

where Φ (u(t )) and Γ (u(t )) (identified experimentally) depend on the CPU frequency u(t ) = µt (x(t )) applied by the controller in period t, and [·]X denotes projection on X. The following problem of stochastic optimal control was a springboard for our design of energy-aware CPU controllers: minimize

Jπ (x0 ) = lim E

 N −1 

N →∞

α t gγ (x(t ), µt (x(t )),

t =0

w(t )) : w(t ) ∼ P profiles, constructed based on high-frequency measurements, can be introduced into the control objective function as an adequately weighted power usage cost component. In order to take the advantage of the developed models many controller design methods can be applied, including pole-placement, optimal design and adaptive control methods. Detailed discussion of the control policy design methodologies and the related problems of digital controller implementation can be found e.g. in [65–67]. For an overview of computational methods applied below, based on dynamic programming algorithm, see e.g. [68].

5.1. Control policy design The design goal was to construct a control policy π = {µ0 , µ1 , . . .}, i.e. a sequence of CPU frequency switching rules µt : X → U to be applied in each control stage t = 0, 1, . . . , N − 1, that minimizes the expected total cost of data processing over an infinite control horizon (N → ∞). The stage cost was defined as

(7)

subject to

over

(8)

x(t + 1) = [Φ (u(t ))x(t ) + Γ (u(t ))w(t )]X , x(t ) ∈ X, u(t ) ∈ U, w(t ) ∈ W for t = 0, 1, . . . , N − 1, π ∈ {{µt }tN=−01 |µt : X → U},

where X = [0, 100] ⊂ R denotes CPU workload, U = {u0 , . . . , um } is the set of admissible CPU frequencies and W ⊂ R is the set of random application-specific disturbances characterized by probability distribution P. Due to complexity of the models involved it was difficult to derive a general closed-form solution to the addressed problem. Nonetheless, under additional but rather mild assumptions it was possible to show that the frequency switching rule of the optimal energy-aware control policy differs from the switching rule of the general-purpose ondemand governor provided by the cpufreq kernel module in the Linux system. This theoretical result, presented in [15], motivated the design of controllers based on numerically obtained solutions to (8). Given the discounted infinite horizon problem formulation, the policy iteration algorithm was used to construct a stationary policy, π = {µγ , µγ , . . .}, defined

M.P. Karpowicz et al. / Future Generation Computer Systems (

Fig. 14. CPU frequency control policy, µγ (x), γ = 0.55.

Fig. 15. Control loop implementation, Linux kernel ver. 3.6.

by the control rule µγ : X → U parameterized by the weight γ assigned to the power consumption cost. The control rule µγ defines a CPU frequency switching table that takes advantage of both the application workload characteristics and the related server power consumption profiles in order to optimize the long-run average cost of data processing operations. Fig. 14 shows an example of the optimal control policy µγ . It should be noticed that in general case the policy is not monotonic, which seems to result from the characteristics of stage cost model with multiple crossings of cost functions corresponding to subsequent CPU frequencies. 5.2. Experimental studies The energy-aware control policy µγ was implemented as the CPU frequency governor, named NASK, of the cpufreq module in the Linux kernel. For this purpose the control loop of the ondemand governor, illustrated in Fig. 15, was used and appropriately modified. The observer block was introduced to provide the controller with the system state estimation calculated based on the identified models of system dynamics. The controller block implementing the frequency switching table µγ was introduced in place of the default frequency switching rule of the ondemand governor. The implemented controller was tested in a series of experiments conducted in the previously described testbed setting, presented in Fig. 3. The designed experiments were focused on the webserver benchmarking. The nginx HTTP webserver (www.nginx.com), installed on the target server (D0) hosting several websites, was forced to respond to the streams of requests originating from the directly connected servers (S1 and S2). Standard HTTP server benchmarking tools, siege (www.joedog.org) and ab (www.httpd.apache.org), were configured to generate a time-series of requests. The following performance metrics were collected during the experiments:

• accepted connections per second (cnx/s), • handled requests per second (req/s), • request processing time (s).

)

–

11

In addition, power consumption measurements were taken, both from the CPU MSRs and the server power meter (PM). Performance of the designed controller was compared to that of the standard CPU frequency governors of the Linux kernel, namely, performance, powersave and ondemad. For this purpose, each of the webserver benchmarking experiments was carried out with the target server operating under the control of each of the compared CPU governors, respectively. Fig. 16 presents examples of the observed trajectories of the webserver performance metrics. As expected, the trajectories obtained by the governors performance and powersave line out an envelope, defined both in terms of energy consumption and processing performance, for the trajectories of the governors NASK and ondemad. The later two can be seen to dynamically adjust the CPU frequency to the observed workload, which allows to reduce the power consumption (bold lines denote MSR-based measurements) while keeping the performance metrics at the similar high level. It can also be noticed that the increased number of accepted connections results in the increased accumulated webserver response time. Fig. 17 summarizes the results of a large number of experiments. Each data point represents an average short-term service rate and the corresponding power consumption level reachable by the webserver while operating under a given CPU governor. Since high service rate and low power usage define preferred outcomes, the results in south–east region dominate those located in the north–west region of the presented bi-objective space. It can be noticed that the mean vectors, presented with standard deviation ellipses, form the set of Pareto-efficient outcomes. The numbers presented in brackets show an average loss (negative number) or improvement (positive number) in performance and power consumption, respectively, that the NASK governor achieved in comparison to a governor considered. According to the collected measurements, the designed governor consumed 94.6% more power than the powersave governor but simultaneously handled the incoming requests at the service rate improved by 28.9%. In comparison to the ondemand and the performance governors, the designed controller reduced power consumption of the server by 14.7% and 25.8%, respectively. At the same time the server was able to respond to the requests at the service rate reduced by 3% and 4.9%, respectively. Fig. 18 illustrates the results of experiments with large sizes of websites hosted by the server. Outcomes dominating those of the ondemand governor were reachable under high workloads, i.e. for the streams of requests inducing service rates above average. The light green area in north east region denotes outcomes dominated by those of the NASK governor. The designed controller was able to outperform the ondemad governor in terms of performance and power saving, improving both metrics by 2.6% and 2%, respectively. 6. Conclusions Prior research successfully applied control-theoretic approach to design efficient power regulation and application performance control structures for data centers. However, the obtained solutions were often limited by the lack of software probes and hardware sensors supporting high-frequency and fine-grained performance and power consumption measurements. This paper attempts to make new contributions to the field of energy-efficient computing by exploiting the possibilities provided by highresolution sensors of modern computing hardware and software in the design of optimal controllers. The measurements of power consumption and workload dynamics were collected with high sampling rate from the Linux kernel level in the course of specifically designed experiments. The collected data were next used to develop maximally informative

12

M.P. Karpowicz et al. / Future Generation Computer Systems (

)

–

Fig. 16. Webserver performance comparison.

Fig. 17. Comparison of performance in multi-objective space.

power consumption metrics and accurate dynamical processing models for the purpose of CPU controller design. Performance of the designed controller was compared to that of the standard ACPIbased CPU frequency governors of the Linux kernel. According to the experimental studies the customized controller may outperform the standard general-purpose governors in terms of service quality and power-saving capabilities. The presented approach seems to be advantageous and interesting for several reasons. As it was already pointed out, currently used cluster management systems provide highly scalable functions that allow to configure CPU governors for the purpose of job execution. We propose to extend the collection of locally available CPU governors, installed in each server, by the application-specific ones that can be activated on demand by the cluster management systems. The presented controller design approach may also be adjusted to take into account power consumption and performance impact of other elements of the server, e.g. I/O or memory. Highfrequency sampling of server operations is achieved with standard mechanisms and benchmarking features of the Linux kernel. The conducted experiments show that the required data can be collected at the sampling rate of the CPU governor almost without additional cost. These observations also give rise to interesting

Fig. 18. Comparison of average control performance. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

questions for future research, especially verifying applicability of the presented approach in actual datacenters and exploring feasibility of adaptive controllers. References [1] D.A. Reed, J. Dongarra, Exascale computing and big data, Commun. ACM 58 (2015) 56–68. [2] J. Shalf, S. Dosanjh, J. Morrison, Exascale computing technology challenges, in: High Performance Computing for Computational Science, VECPAR 2010, Springer, 2011, pp. 1–25. [3] J. Dongarra, et al., The international exascale software project roadmap, Int. J. High Perform. Comput. Appl. 25 (2011) 3–60. [4] B. Subramaniam, W. Saunders, T. Scogland, W.-c. Feng, Trends in energyefficient computing: A perspective from the green500, in: 2013 International Green Computing Conference, IGCC, IEEE, 2013, pp. 1–8. [5] B. Subramaniam, W.-c. Feng, Towards energy-proportional computing for enterprise-class server workloads, in: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ACM, 2013, pp. 15–26. [6] W. Van Heddeghem, S. Lambert, B. Lannoo, D. Colle, M. Pickavet, P. Demeester, Trends in worldwide ICT electricity consumption from 2007 to 2012, Comput. Commun. 50 (2014) 64–76. [7] J. Koomey, Growth in Data Center Electricity Use 2005–2010, Analytical Press, Oakland, CA, 2011.

M.P. Karpowicz et al. / Future Generation Computer Systems ( [8] A. Sikora, E. Niewiadomska-Szynkiewicz, A federated approach to parallel and distributed simulation of complex systems, Int. J. Appl. Math. Comput. Sci. 17 (2007) 99–106. [9] S.-Y. Jing, S. Ali, K. She, Y. Zhong, State-of-the-art research study for green cloud computing, J. Supercomput. 65 (2013) 445–468. [10] D. Hackenberg, T. Ilsche, J. Schuchart, R. Schone, W.E. Nagel, M. Simon, Y. Georgiou, HDEEM: high definition energy efficiency monitoring, in: Energy Efficient Supercomputing Workshop, IEEE, 2014, pp. 1–10. [11] X. Fan, W.-D. Weber, L.A. Barroso, Power provisioning for a warehouse-sized computer, in: ACM SIGARCH Computer Architecture News, Vol. 35, ACM, 2007, pp. 13–23. [12] L.A. Barroso, U. Hölzle, The case for energy-proportional computing, IEEE Comput. 40 (2007) 33–37. [13] J. Howard, S. Dighe, S.R. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries, et al., A 48-core IA-32 processor in 45 nm CMOS using on-die message-passing and DVFS for performance and power scaling, IEEE J. Solid-State Circuits 46 (2011) 173–183. [14] V. Pallipadi, A. Starikovskiy, The ondemand governor, in: Proceedings of the Linux Symposium, vol. 2, 2006, pp. 215–230. [15] M.P. Karpowicz, Energy-efficient CPU frequency control for the Linux system, Concurr. Comput.: Pract. Exper. 28 (2016) 420–437. http://dx.doi.org/10.1002/cpe.3476. [16] V. Pallipadi, S. Li, A. Belay, cpuidle: Do nothing, efficiently, in: Proceedings of the Linux Symposium, vol. 2, 2007, pp. 119–125. [17] T. Patikirikorala, A. Colman, J. Han, L. Wang, A systematic survey on the design of self-adaptive software systems using control engineering approaches, in: ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems, SEAMS, IEEE, 2012, pp. 33–42. [18] H. Lim, A. Kansal, J. Liu, Power budgeting for virtualized data centers, in: 2011 USENIX Annual Technical Conference, USENIX ATC’11, 2011, p. 59. [19] C. Lefurgy, X. Wang, M. Ware, Power capping: a prelude to power shifting, Cluster Comput. 11 (2008) 183–195. [20] J. Stoess, C. Lang, F. Bellosa, Energy management for hypervisor-based virtual machines, in: USENIX Annual Technical Conference, 2007, pp. 1–14. [21] X. Wang, Y. Wang, Coordinating power control and performance management for virtualized server clusters, IEEE Trans. Parallel Distrib. Syst. 22 (2011) 245–259. [22] Y. Wang, X. Wang, M. Chen, X. Zhu, Partic: Power-aware response time control for virtualized web servers, IEEE Trans. Parallel Distrib. Syst. 22 (2011) 323–336. [23] O. Sarood, P. Miller, E. Totoni, L.V. Kalé, Cool load balancing for high performance computing data centers, IEEE Trans. Comput. 61 (2012) 1752–1764. [24] K. Zheng, X. Wang, L. Li, X. Wang, Joint power optimization of data center network and servers with correlation analysis, in: Proceedings of IEEE INFOCOM, IEEE, 2014, pp. 2598–2606. [25] M. Dabbagh, B. Hamdaoui, M. Guizani, A. Rayes, Energy-efficient resource allocation and provisioning framework for cloud data centers, IEEE Trans. Netw. Serv. Manag. 12 (2015) 377–391. [26] I. Manousakis, M. Marazakis, A. Bilas, FDIO: A feedback driven controller for minimizing energy in I/O-intensive applications, in: Presented as Part of the 5th USENIX Workshop on Hot Topics in Storage and File Systems, USENIX, 2013. [27] M.P. Karpowicz, P. Arabas, E. Niewiadomska-Szynkiewicz, Energyaware multilevel control system for a network of linux software routers: Design and implementation, IEEE Syst. J. PP (99) (2015) 1–12. http://dx.doi.org/10.1109/JSYST.2015.2489244. [28] E. Niewiadomska-Szynkiewicz, A. Sikora, P. Arabas, M. Kamola, M. Mincer, J. Kołodziej, Dynamic power management in energy-aware computer networks and data intensive computing systems, Future Gener. Comput. Syst. 37 (2014) 284–296. [29] E. Niewiadomska-Szynkiewicz, A. Sikora, P. Arabas, J. Kołodziej, Control system for reducing energy consumption in backbone computer network, Concurr. Comput.: Pract. Exper. 25 (2013) 1738–1754. [30] M. Kamola, P. Arabas, Shortest path green routing and the importance of traffic matrix knowledge, in: 24th Tyrrhenian International Workshop on Digital Communications-Green ICT, IEEE, 2013, pp. 1–6. [31] P. Arabas, K. Malinowski, A. Sikora, On formulation of a network energy saving optimization problem, in: Fourth International Conference on Communications and Electronics, ICCE, 2012, pp. 227–232. http://dx.doi.org/10.1109/CCE.2012.6315903. [32] N. Gandhi, D. Tilbury, Y. Diao, J. Hellerstein, S. Parekh, MIMO control of an Apache web server: modeling and controller design, in: Proceedings of the American Control Conference, vol. 6, 2002, pp. 4922–4927. [33] P. Padala, K.-Y. Hou, K.G. Shin, X. Zhu, M. Uysal, Z. Wang, S. Singhal, A. Merchant, Automated control of multiple virtualized resources, in: Proceedings of the 4th ACM European Conference on Computer Systems, ACM, 2009, pp. 13–26. [34] Z. Lu, J. Hein, M. Humphrey, M. Stan, J. Lach, K. Skadron, Control-theoretic dynamic frequency and voltage scaling for multimedia workloads, in: Proceedings of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, ACM, 2002, pp. 156–163. [35] J.C. McCullough, Y. Agarwal, J. Chandrashekar, S. Kuppuswamy, A.C. Snoeren, R.K. Gupta, Evaluating the effectiveness of model-based power characterization, in: USENIX Annual Technical Conference, 2011.

)

–

13

[36] M.E. Salehi, M. Samadi, M. Najibi, A. Afzali-Kusha, M. Pedram, S.M. Fakhraie, Dynamic voltage and frequency scheduling for embedded processors considering power/performance tradeoffs, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19 (2011) 1931–1935. [37] M. Kondo, H. Sasaki, H. Nakamura, Improving fairness, throughput and energyefficiency on a chip multiprocessor through DVFS, ACM SIGARCH Comput. Archit. News 35 (2007) 31–38. [38] B. Wu, P. Li, Load-aware stochastic feedback control for DVFS with tight performance guarantee, in: 2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip, VLSI-SoC, 2012, pp. 231–236. [39] H. Jung, M. Pedram, Supervised learning based power management for multicore processors, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 29 (2010) 1395–1408. [40] V.K. Vavilapalli, A.C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al., Apache hadoop yarn: Yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, 2013, p. 5. [41] S. Jha, J. Qiu, A. Luckow, P. Mantha, G.C. Fox, A tale of two dataintensive paradigms: Applications, abstractions, and architectures, in: IEEE International Congress on Big Data, IEEE, 2014, pp. 645–652. [42] Y. Georgiou, T. Cadeau, D. Glesser, D. Auble, M. Jette, M. Hautreux, Energy accounting and control with slurm resource and job management system, in: Distributed Computing and Networking, Springer, 2014, pp. 96–118. [43] L.A. Barroso, J. Clidaras, U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, 2013. [44] M. Kambadur, M.A. Kim, An experimental survey of energy management across the stack, in: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, ACM, 2014, pp. 329–344. [45] J. Mair, D. Eyers, Z. Huang, H. Zhang, Myths in power estimation with Performance monitoring counters, Sustain. Comput.: Inform. Syst. 4 (2014) 83–93. [46] M.E.M. Diouri, M.F. Dolz, O. Glück, L. Lefèvre, P. Alonso, S. Catalán, R. Mayo, E.S. Quintana-Ortí, Assessing power monitoring approaches for energy and power analysis of computers, Sustain. Comput.: Inform. Syst. 4 (2014) 68–82. [47] D. Chisnall, There’s no such thing as a general-purpose processor, Queue 12 (2014) 20. [48] J. Rao, Y. Wei, J. Gong, C.-Z. Xu, QoS guarantees and service differentiation for dynamic cloud applications, IEEE Trans. Netw. Serv. Manag. 10 (2013) 43–55. [49] D. Kusic, J.O. Kephart, J.E. Hanson, N. Kandasamy, G. Jiang, Power and performance management of virtualized computing environments via lookahead control, Cluster Comput. 12 (2009) 1–15. [50] Y. Wang, X. Wang, M. Chen, X. Zhu, Power-efficient response time guarantees for virtualized enterprise servers, in: Real-Time Systems Symposium, IEEE, 2008, pp. 303–312. [51] D. Molka, D. Hackenberg, R. Schöne, T. Minartz, W.E. Nagel, Flexible workload generation for HPC cluster efficiency benchmarking, Comput. Sci.-Res. Dev. 27 (2012) 235–243. [52] R. Schöne, D. Hackenberg, D. Molka, Memory performance at reduced CPU clock speeds: an analysis of current x86 64 processors, in: Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, USENIX Association, 2012, pp. 9–14. [53] S. Bradner, J. McQuaid, RFC 2544: Benchmarking methodology for network interconnect devices, 1999. [54] R. Bolla, R. Bruschi, RFC 2544 performance evaluation and internal measurements for a Linux based open router, in: 2006 Workshop on High Performance Switching and Routing, IEEE, 2006, pp. 9–14. [55] M. Desnoyers, M.R. Dagenais, The LTTng tracer: A low impact performance and behavior monitor for GNU/Linux, in: OLS (Ottawa Linux Symposium), vol. 2006, Citeseer, 2006, pp. 209–224. [56] R. Matias, I. Beicker, B. Leitão, P.R. Maciel, Measuring software ageing effects through OS kernel instrumentation, in: IEEE Second International Workshop on Software Aging and Rejuvenation, IEEE, 2010, pp. 1–6. [57] B. Lee, S. Moon, Y. Lee, Application-specific packet capturing using kernel probes, in: IFIP/IEEE International Symposium on Integrated Network Management, IEEE, 2009, pp. 303–306. [58] M.-H. Wang, C.-M. Yu, C.-L. Lin, C.-C. Tseng, L.-H. Yen, KPAT: A kernel and protocol analysis tool for embedded networking devices, in: IEEE International Conference on Communications, IEEE, 2014, pp. 1160–1165. [59] A. Khoroshilov, V. Mutilin, E. Novikov, I. Zakharov, Modeling environment for static verification of Linux kernel modules, in: Perspectives of System Informatics, Springer, 2014, pp. 400–414. [60] R. Pintelon, J. Schoukens, System Identification: A Frequency Domain Approach, John Wiley & Sons, 2012. [61] T. Soderstrom, P. Stoica, System Identification, Prentice Hall International, UK, 1989. [62] M. Lassnig, T. Fahringer, V. Garonne, A. Molfetas, M. Branco, Identification, modelling and prediction of non-periodic bursts in workloads, in: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGRID’102010, IEEE Computer Society, Washington, DC, USA, 2010, pp. 485–494. http://dx.doi.org/10.1109/CCGRID.2010.118.

14

M.P. Karpowicz et al. / Future Generation Computer Systems (

[63] M.P. Karpowicz, P. Arabas, Preliminary results on the Linux libpcap model identification, in: 20th International Conference on Methods and Models in Automation and Robotics, MMAR, IEEE, 2015, pp. 1056–1061. URL: http://dx.doi.org/10.1109/MMAR.2015.7284025. [64] H. Li, Workload dynamics on clusters and grids, J. Supercomput. 47 (2009) 1–20. [65] K.J. Åström, B. Wittenmark, Adaptive Control, Dover Publications, Mineola, NY, 2013. [66] K.J. Åström, B. Wittenmark, Computer-Controlled Systems: Theory and Design, Dover Publications, Mineola, NY, 2011. [67] R. Istepanian, J.F. Whidborne, Digital Controller Implementation and Fragility: A Modern Perspective, Springer Science & Business Media, 2001. [68] D.P. Bertsekas, Dynamic Programming and Optimal Control, third ed., Athena Scientific, Belmont, MA, 2005.

Michał P. Karpowicz, received his Ph.D. in 2010. He is an assistant professor of computer science at the Warsaw University of Technology and NASK Research Institute, Poland. His research interests focus on stochastic control theory, control engineering, game theory and network optimization.

)

–

Piotr Arabas, received his Ph.D. in computer science from the Warsaw University of Technology, Poland, in 2004. Currently he is assistant professor at Institute of Control and Computation Engineering at the Warsaw University of Technology. Since 2002 with the NASK Research Institute. His research area focuses on modeling computer networks, predictive control and hierarchical systems.

Ewa Niewiadomska-Szynkiewicz, received D.Sc in 2005 and Ph.D. in 1995. She is a professor of control and computer engineering at the Warsaw University of Technology, head of the Complex Systems Group. She is also the Director for the NASK Research Institute. The author and co-author of three books and over 150 journal and conference papers. Her research interests focus on complex systems modeling, optimization and control, computer simulation, parallel computation, computer networks and ad-hoc networks. She was involved in a number of research projects including EU projects, coordinated the Groups activities, managed organization of a number of national-level and international conferences.

Design and implementation of energy-aware application-specific CPU frequency governors for the heterogeneous distributed computing systems

Design and implementation of energy-aware application-specific CPU frequency governors for the heterogeneous distributed computing systems

Recommend Documents