Performance Evaluation 68 (2011) 1232–1246
Contents lists available at SciVerse ScienceDirect
Performance Evaluation journal homepage: www.elsevier.com/locate/peva
Cooling-aware workload placement with performance constraints Andrea Sansottera ∗ , Paolo Cremonesi Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milan, Italy
article
info
Article history: Available online 3 August 2011 Keywords: Data center Cooling aware consolidation CFD
abstract Power optimization in data centers requires either to raise the temperature of the cold air supplied by the air conditioner or to reduce the power consumption of the servers by careful workload allocation. Both the approaches must satisfy a number of constraints, mainly temperature at the server intakes, which should not exceed a critical threshold, and capacity and response time requirements. To tackle these issues, we formulate an optimization problem in which the total data center power has to be minimized subject to the constraints imposed by performance requirements and thermal specifications of the servers. At the heart of the optimization problem is an analytical model which takes into account the complex relationship between the performance of servers, the allocation of workloads, the temperature of the air supplied by the conditioning unit and the heat distribution in the server room. For the easy evaluation of this relationship, we adopt a simplified yet accurate heat flow model, which we extensively validate using the data collected in several months of Computational Fluid Dynamics simulations. Extensive tests on 90 randomly generated scenarios suggest that the proposed coupled thermalperformance model can lead to a power saving of 21%. Finally, a case study is presented which is based on 1164 workload traces collected from the data center of a large telco operator. The cooling-aware workload placement suggests a saving of 8% with respect to a performance-only based strategy. © 2011 Elsevier B.V. All rights reserved.
1. Introduction In order to revert or slow down the growth in data center power consumption, efficiency must be sought at two strictly connected levels: reduction of server power consumption and increase in air conditioner efficiency. The first goal could be achieved with a careful placement of workloads that exploits the most ‘‘efficient’’ servers (e.g., in terms of watts/transaction). However, this strategy alone can lead to hot spots, i.e., portion of the data-center with a larger-than-average temperature. Hot-spots decrease the efficiency of the computer room air conditioner (CRAC) because the air temperature at the server inlets must be lowered in order to meet the thermal specifications of the manufacturer [1]. Hence, reduction in server and cooling power must be tackled as a single problem. Unfortunately, consolidation and air circulation models in data center are generally treated as two separate problems in the literature, because of the difficulties in writing an analytical relationship between performance and thermodynamic aspects. In this paper, we propose a mixed integer program that aims at minimizing the total power consumption of the data center, taking into account the requirements of both the servers and the cooling equipment. We rely on the linear heat flow model proposed in [2]. The general idea of this approach is to complete a small number of Computational Fluid Dynamic (CFD) simulations and compute a cross-interference matrix, which represents the amount of heat recirculating among servers. Using the cross-interference matrix, we can introduce the power consumption of the CRAC unit into the
∗
Corresponding author. E-mail addresses:
[email protected] (A. Sansottera),
[email protected] (P. Cremonesi).
0166-5316/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.peva.2011.07.018
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
1233
objective function and consider the inlet server temperature in the constraints. Moreover, we introduce in the optimization problem constraints based on capacity and response time requirements. A number of instances of these problems have been solved, using both heuristics and an exact solver. Despite the challenging nature of the problem, the results show 16%–20% power savings on a baseline workload allocation strategy. 1.1. Main contributions The main contributions of this paper can be summarized as follows. Validation of the linear heat flow model. We contribute an extensive validation of the linear heat flow model, numerically verifying the invariance of the cross-interference matrix to widely different power distributions. To perform the validation, a computational fluid dynamic (CFD) description of a computer room has been developed and more than 240 configurations have been simulated, accounting for almost 10 000 CPU/hours at the CILEA supercomputing center.1 Optimal cold air temperature. Given a data center layout, we derive a closed-form analytical equation between server utilizations and the optimal cold-air temperature, i.e., the highest temperature from the CRAC unit that avoids server intake air temperature to exceed manufacturer-imposed thresholds. Extension of previous works to include heterogeneous servers and response times. While previous works (e.g., [3,4]) have considered homogeneous computing resources and rather simple capacity constraints (e.g., thresholds on the utilization), we take into account different server characteristics, such as speedup factors and number of CPUs, and characterize the workloads by their service demands and arrival rates. Moreover, we consider constraints on the workload response times. 1.2. Structure of the paper The state of the art is summarized in Section 2. In Section 3, we describe the power model of a typical data center. The abstract heat flow model of the server room is described in Section 4. Section 5 describes the setup of our computation fluid dynamics simulations and the model validation. The mathematical program for optimal workload placement and different solution strategies are described and evaluated in Section 6. In Section 6.4 we compare the energy savings obtained with our model with respect to other approaches. A case study based on real workload measurements and server specifications is presented in Section 7. Conclusions are drawn in Section 8. 2. Related works Many have focused on server consolidation problems, i.e., the process of combining the workloads of several different servers on a set of target servers. All the works address consolidation as a mixed integer-linear-programming (ILP) bin packing problem. Rolia et al. [5] suggest an integer program for allocation problems in a data center. Linear and non-linear integer programming models for consolidation are presented in [6]. These models aim at minimizing the number of servers used while satisfying constraints on end-to-end response times. A similar minimization problem is formulated in [7], where response time constraints are dropped, but rules related to availability and compatibility requirements are introduced. The trend toward high power densities has led to a number of works discussing thermal models of data centers. The impact of asymmetry in layout and heat load is studied in [8], highlighting the energy costs that can be obtained through variable capacity cooling units. In order to avoid the need for expensive ambient sensors, machine learning techniques have been adopted in [9] to estimate server inlet temperatures from on-board sensor readings, using utilization metrics to mask out the thermal effects due to workload. An approach based on neural networks to infer thermal maps of data centers from workload distribution, cooling configuration and physical topology has been presented in [10]. CFD simulations in [11] have been used to study the impact of a CRAC failure, showing that in less than 80 s critical ambient temperatures are reached. The problem of thermal-aware workload placement has also been tackled. In [3], the authors consider compute-intensive batch jobs characterized by the number of processor cores needed. The power consumption of the servers is defined by a finite number of power states: in one state the server is powered off, while the other states are defined by the number of active CPUs. Another simplifying assumption of this design is that the data center is homogeneous and so the performance/power ratio of the servers is not considered. This papers first introduces the idea of the heat recirculation factor (HRF), which is defined as the fraction of heat recirculating from the outlet of one server to the inlet of the other servers. The underlying assumption is that the HRF is a property of the data center layout and is independent of load. One of the main contributions of the present paper is the validation of this assumption. The idea of characterizing heat recirculation for a given data center layout is further developed in [2]. The authors define a cross-interference matrix, representing, for each pair of servers, the fraction of heat generated by the first server which recirculates into the second one. Since this work is fundamental to our research, we present the main characteristics of the proposed heat flow model in Section 4. The cross-interference matrix is used in [12] to minimize the total power 1 http://www.cilea.it.
1234
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
consumption of a data center by deciding which servers should be turned on. The total amount of servers which need to operate is assumed to be known a priori and the performance modeling aspect of the problem is factored out of the optimization model. In [13], the cross-interference matrix is used to minimize the power consumption of a storage centric data center. Given a set of tasks, the problem tackled by the authors is the association of each task to a specific replica among many available replicas. A minimum cost flow heuristic is used to solve the optimization problem. The Highest Thermostat Setting (HTS) algorithm for spatial job scheduling in a High Performance Computing (HPC) data center is presented in [1]. The cross-interference matrix is employed to determine whether a given temperature of the cold air supplied by the CRAC is sufficient to ensure that the critical temperatures at the server inlets are not exceeded. This information is exploited by allocating the jobs in a cooling-aware manner. Cooling management is performed by the spatial job scheduler, taking into account the different operating modes supported by the CRAC unit and the time necessary to switch between modes. In [4], a genetic algorithm for workload placement aiming at minimizing heat recirculation, XInt, is presented. Like [3], this work makes the assumption that servers are homogeneous and thus server power consumption is not considered. Moreover, no performance constraints are taken into account. 3. Data center power model The power consumption of a data center can be divided into two main components: the power drawn by the servers PSERVERS , and the cooling equipment PCRAC . For simplicity we do not consider the power consumption of other devices, such as the power distribution units. Therefore we define the total data center power as follows: PDC = PSERVERS + PCRAC .
(1)
The ∑ power drawn by the servers can intuitively be seen as the sum of the individual power consumptions, i.e. PSERVERS = s∈S Ps , where S is the set of servers in the data center and Ps is the power consumption of server s. A number of works have dealt with the problem of estimating the power consumption of a computer system. System level power models for computer systems are either based on simulations [14–16] or on resource utilization, like Mantis [17]. Similarly to Mantis, we adopt a power model based on resource utilization and consider the most power hungry component of the computer system, the CPU: Ps = Pidle,s + Pbusy,s Us ,
(2)
where Us ∈ [0, 1] is the CPU utilization. Least squares regression can be applied to the data from standard power benchmarks (e.g., SPECpower2 ) to estimate the linear model parameters Pidle and Pbusy . The CRAC is a heat pump that moves heat from within the computer room to the outdoor. The efficiency of the CRAC is expressed by the Coefficient of Performance (COP), which is the ratio of the heat removed from the computer room to the supplied work. This coefficient depends on the temperature Tsup of the cold air supplied by the CRAC: the lower this temperature, the lower the COP. Therefore, the power drawn by the CRAC unit is PCRAC =
PSERVERS COP(Tsup )
.
(3)
We can now define the total power consumption of the data center as follows: PDC =
−
Pidle,s + Pbusy,s Us
s∈S
1+
1 COP(Tsup )
.
(4)
In the next section, we define an analytical relationship between the optimal temperature of the air supplied by the CRAC and the utilization of the servers. 4. The linear heat flow model The dominant form of heat transfer within a data center is the convective heat transfer, often referred to as convection. Convection is the transfer of heat from one place to another by the movement of air [18]. Convection can be ‘‘forced’’ (e.g., by means of server fans) or ‘‘natural’’, whenever air is heated and buoyancy forces alone are responsible for air motion. In natural convection, an increase in temperature produces a reduction in density, which causes air motion due to pressures and forces when portions of air of different densities are affected by gravity. Because of the effects of natural convection, heat transfer between servers and CRAC units might depend on the heat produced by the servers, which, in turn, depends on their power requirements. In this section we will show that the effects of natural convection in a typical data center layout may be neglected when modeling heat transfer between servers, thus allowing for a simple linear heat transfer model.
2 http://www.spec.org/power_ssj2008/.
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
1235
Fig. 1. This figure, obtained by a CFD simulation, shows how the air exiting a server recirculates into the inlets of other servers.
4.1. Data center layout The typical layout of a data center presents multiple rows of racks separated by either a hot or a cold aisle. In cold aisles, the air ventilated by the CRAC units exits the raised floor and is drawn by the servers, whose air intakes face the cold aisle. The hot air from the servers is ventilated in hot aisles. In order to draw as much hot air as possible, the intake of the CRAC units is also placed in hot aisles. Unfortunately, some of the air inevitably recirculates among servers, as shown in Fig. 1. The figure, obtained using a CFD simulation, shows the air flow exiting one of the servers. The layout comprises two rows of racks, separated by a hot isle, and two CRAC units. Only one of the CRAC units is active and draws a significant amount of air exiting from the server. However, some of the hot air exhaust recirculates to the intake of the other servers, hence impacting the inlet temperatures. The linear heat flow model, described in the following sub-section, enables the characterization of heat recirculation among servers for a given data center layout. 4.2. Model assumptions and definitions Given a set of server S, for each server i ∈ S, let Pi be the power consumption of the server, Tin,i the temperature of the air at the server intake and Tout,i the outlet temperature. Due to the law of energy conservation, assuming that all the power drawn by a server is dissipated as heat and that the thermal exchange between a server and the computer room happens mostly through convection, the following equation must hold Pi = φi ρ Cp (Tout,i − Tin,i ),
(5)
where ρ and Cp are constants representing the density and the specific heat of the air, while φi is the volumetric air flow rate. We indicate with aj,i the fraction of air which recirculates from the outlet of server j to the inlet of server i. It follows that only a fraction of the volumetric air flow rate at the inlet surface of the i server comes from the CRAC. The inlet temperature for the server can be computed as a weighted sum of the temperatures at the outlets of the servers and the temperature of the cold air Tsup :
Tin,i =
−
aj,i Tout,j +
1−
−
j∈S
aj , i
Tsup .
(6)
j∈S
Using (5) and (6), a simple linear relationship between the outlet temperatures of the servers is found
Tout,i =
− j∈S
aj,i Tout,j +
1−
−
aj,i
Tsup +
j∈S
1 Ki
Pi ,
(7)
where Ki = φi ρ Cp . Let A be the matrix whose elements are the aj,i coefficients. Moreover, let T⃗out be the vector of the temperatures at the outlet of the servers and K the diagonal matrix of the constants Ki . The linear heat flow model can be written as the following vector equation
⃗ · Tsup + P⃗ , KT⃗out = KA′ (T⃗out ) + K(I − A′ )1
(8)
1236
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
⃗ is the vector with all elements equal to 1. The matrix A is a characteristic of the data center layout and is called where 1 cross-interference matrix. 4.3. The cross-interference matrix In order to compute the cross-interference matrix for a data center containing a m servers, an equal number of profiling experiments or CFD simulations are needed. Let Tin be the m × m matrix whose column vectors are the inlet temperature vectors measured in the m different configurations. Analogously, let P be the matrix of the m power vectors and T∆ = K−1 P. From (8) it follows that Tin + T∆ = A′ (Tin + T∆ ) + (I − A′ )1 · Tsup + T∆ ,
(9)
where 1 is the m × m matrix with all elements equal to 1. Through simple algebraic operations the following equation is obtained
(Tin + T∆ − 1 · Tsup )′ A = (Tin − 1 · Tsup )′ .
(10)
Thus, A can be computed by matrix inversion and multiplication or, to achieve better numerical precision, each column of A can be computed solving a linear system. 4.4. The optimal cold air temperature For the correct operation of servers, the temperature of the air at the server intake must not exceed a critical threshold. Given this requirement and the cross-interference matrix representative of a given data center layout, an upper limit on the temperature of the air ventilated by the CRAC can be obtained. For convenience, we define the vector T⃗∆ = K−1 P⃗ and rewrite (8) as follows:
⃗ · Tsup + T⃗∆ . T⃗in + T⃗∆ = A′ (T⃗in + T⃗∆ ) + (I − A′ )1
(11)
Simplifying and grouping terms we obtain:
(I − A′ )T⃗in = A′ T⃗∆ + (I − A′ )1⃗ · Tsup .
(12)
Finally, assuming I − A′ is nonsingular, we can express T⃗in in terms of T⃗∆ and Tsup :
⃗ · Tsup . T⃗in = (I − A′ )−1 A′ T⃗∆ + 1
(13)
Let T⃗crit be the vector of the maximum temperature allowed at the intake of the servers, then, using (13), the following inequality must hold:
⃗ · Tsup ≤ T⃗crit − (I − A′ )−1 A′ T⃗∆ . 1
(14)
This is a set of linear inequalities in one variable that can be reduced to a single inequality: Tsup ≤ min{(T⃗crit − (I − A′ )−1 A′ T⃗∆ )s }, s∈S
(15)
where, for any vector v ⃗ , (⃗v )s denotes the s-th element of the vector. Since the efficiency of the CRAC unit decreases with the temperature of the air it supplies to the computer room, the optimal Tsup is the highest temperature satisfying (15): ∗ Tsup = min{(T⃗crit − (I − A′ )−1 A′ T⃗∆ )s }. s∈S
(16)
Remembering that T⃗∆ = K−1 P⃗ and using (2), we can express the optimal temperature of the air supplied by the CRAC as a ⃗ function of the server utilization vector U: ∗ Tsup = min{(T⃗crit − (I − A′ )−1 A′ K−1 (P⃗idle + P⃗busy U⃗ ))s }. s∈S
(17)
5. Model validation The fundamental assumption of the linear heat flow model is that the power consumption of the servers has little impact on the cross-interference matrix. In other words, the fraction air recirculating from one server to another depends mostly on the layout of the data center and not on other factors such as the power consumption of each server or the temperature of the cold air.
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
1237
Fig. 2. Data center layout and server numbering used in our simulations.
5.1. Data collection To check the validity of this hypothesis, we examined six rather different scenarios, summarized in Table 1. We considered a single data center module composed of two rows of racks, as shown in Fig. 2. A hot aisle stands between the two rows, with two CRAC units at the opposite ends of the aisle. One of the CRAC units is active, while the other is used as a backup and therefore powered off. In the other two aisles the cold air ventilated by the CRAC unit exits from the raised floor. The two rows are composed of five 40U racks. Each rack can host 4 high-density servers (each server is a 10U blade enclosure), for a total of 40 servers. For each test case, 40 simulations are needed to characterize the cross-interference matrix, for a total of 240 simulations across the six test cases. As in [2], during a simulation all but one server are absorbing a certain amount of power Pb , while the power consumption of the remaining server is Pb + P∆ . A significant contribution of our work is the model validation with large values of P∆ , both positive and negative. The CFD simulations were carried out using the FLUENT solver3 on the CILEA supercomputing cluster and required a total of almost 10 000 CPU/hour. 5.2. Cross-validation Due to the complexity of heat transfer phenomena, when the coefficients estimated in one test case are used to predict the temperatures at the inlet of the servers in another test case, a certain prediction error occurs. We performed K -fold cross validation [19] using the average of the cross-interference coefficients over five test cases to predict the temperatures in the remaining test case. The mean and standard deviation of the prediction errors for each test case are shown in Table 2. The accuracy is strikingly good, despite the vastly different power profiles used in the six scenarios. In fact, the average error ranges between 0.36 K and 1.48 K. An example of the results of the cross-validation is shown in Fig. 3. In Fig. 4 we provide a graphical representation of the cross-interference coefficients in the six test cases. The figures are almost identical, confirming that the impact of the power distribution on the cross-interference matrix is low. Thus, we may assume that heat transfer between servers is mainly due to forced convection. The main difference between the figures can be noted when comparing the first row with the last row. The lighter blobs in the figures in the top row are bigger than those in the bottom row, and the black blobs are much more pronounced in the top row than in the bottom row. This means that recirculation coefficients are slightly larger in the bottom configuration. This can be easily explained if we consider that the first row describes a not-too-hot data center, with total server power consumption equal to 20 KW, while the bottom row describes a quite hot data center, with total server power consumption equal to 140 KW. In the hot configuration, natural convection affects in a more significant way the recirculation coefficients. The server on the left row are numbered from 1 to 20, while the servers on the right row are numbered from 21 to 40. Numbers 1–4, 5–6, 7–12, and so on refer to servers in the same rack, from the bottom to the top. The racks closest to the active cooling unit are characterized by the smallest numbers. The highest heat recirculation is observed for the racks closest to the CRAC units (both the active and the inactive one). For these racks, observing the corresponding 4 × 4 blocks, the heat generated to the servers at the bottom recirculates to the servers above, as expected. Moreover, the figure confirms that servers at the ends of the aisle have the worst impact on heat recirculation. 6. Performance-constrained workload placement In this section we deal with the cooling aware data center consolidation problem, i.e., the off-line mapping between workloads and servers that minimize the total power requirement of the data center. For this reason in our analysis we 3 http://www.ansys.com/Products/Simulation+Technology/Fluid+Dynamics/ANSYS+FLUENT.
1238
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246 Table 1 The different test cases employed to validate the linear heat flow model. Test case
Pb (W)
Pb + P∆ (W)
1 2 3 4 5 6
500 500 2500 2500 3500 3500
2500 3500 500 3500 500 2500
Table 2 K -fold cross validation of the linear heat flow model: sample mean and standard deviation of the absolute prediction errors. Test case
Mean error (K)
Error standard deviation (K)
1 2 3 4 5 6
0.3604 0.3760 0.4642 0.5125 1.4174 1.4827
0.3059 0.3485 0.3360 0.3667 1.0032 1.0306
Fig. 3. An example of the K -fold cross validation. The temperatures obtained in the first simulation of test case 3 are represented by the blue circles, while the red crosses mark the temperatures predicted with the cross-interference matrix. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
do not consider the cost of migrating workloads. Future work will focus on on-line algorithms able to dynamically manage changing workload requirements. Once the power characteristics of the servers, the COP function of the CRAC unit and the cross-interference matrix for a given data center layout are known, this information can be exploited to reduce the power consumption of the data center through careful placement of the workloads. Given the set of workloads C and the set of servers S, we can formulate an optimization problem in which we aim at minimizing the total power consumption. A solution is represented by |C | × |S | binary variables xc ,s such that, for all c ∈ C and s ∈ S , xc ,s = 1 if and only if the workload c is allocated on the server s, and xc ,s = 0 otherwise. The solution is also subject to performance constraints, related to the resource utilization and response time requirements of the different workloads. 6.1. Performance modeling In order to characterize the utilization of the servers and the response times, we need a number of performance parameters. In this work we focus our attention on CPU time, but the model can be rather intuitively extended to include other types of resources. Hence, let dc ,s be the CPU demand of workload c on a single CPU core of server s. Let λc be the arrival rate of workload requests c. For each feasible solution of the optimization problem, the utilization of a server s can be computed as the sum of the utilization of all the workloads allocated on it dc ,s λc
∑ Us =
c ∈C :xc ,s =1
ns
,
where ns is the number of CPU cores available on server s.
(18)
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
1239
Fig. 4. Graphical representation of the recirculation coefficients. The block on the i-th row and the j-th column represents the heat recirculating from server i to server j. Lighter colors represent higher recirculation coefficients.
(max)
For each workload c, let Rc be the constraint on mean response time, usually defined in service level agreements, and s be the server on which the application is placed, i.e. xc ,s = 1. Assuming exponentially distributed inter-arrival and service times, we write these constraints using the approximation of the mean response time for multi-server queues proposed in [20] d c ,s
ns − 1 ns
+
dc ,s /ns
(1 − Us )
≤ R(cmax) .
(19)
As we will see later on, using this approximation instead of the exact formula [21] allows us to write this constraint in a linear form.
1240
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246 Table 3 Mixed Integer Non-Linear Program for performance-constrained and thermal-aware workload placement. I
minimize
II III
∑
IV V VI VII VIII IX
∑ s∈S
Ps
1+
j∈S xc ,j = 1 Ps = Pidle,s + Pbusy ,s U s
Tin,s =
∑
∑
j∈S aj,s i∈C xi,s di,s λi ns
Us = Tin,s ≤ Tcrit ,s Us ≤ 1 xc ,s
xc ,s
Tin,j +
Pj Kj
1 COP(Tsup )
+ 1−
∑
≤ Rc(max) − dc ,s nsn−s 1 (1 − Us ) ∈ {0, 1}
dc , s ns
j ∈S
aj,s Tsup
∀c ∈ C ∀s ∈ S ∀s ∈ S ∀s ∈ S ∀s ∈ S ∀s ∈ S ∀(c , s) ∈ C × S ∀(c , s) ∈ C × S
6.2. Mathematical programming formulation We formulate the Mixed Integer Non-Linear Program (MINLP) for performance-constrained and thermal-aware workload placement as shown in Table 3. The objective function I is the sum of the power consumption of the servers and of the CRAC unit. The equality constraint II ensures that every workload is allocated to one and only one server. In III, the server power consumption is expressed as a linear function of the utilization as discussed in Section 3. The linear heat flow model is used in IV to express the relationship between the inlet temperatures and the temperature of the cold air. The utilization of the servers is defined in V. The constraints VI ensure that the thermal specifications of the servers are met, while the constraints VII guarantee that the server capacity is not exceeded. The upper bounds on the mean response times are imposed in VIII. In IX, the xc ,s variables are defined as binary. All the constraints of the problem are linear, while the objective function is non-linear and non-convex. 6.3. Solution strategies We have tackled the mathematical program described in the previous sub-section using the modeling language AMPL,4 which allows for an easy and very readable definition of optimization problems, and COUENNE,5 an exact solver for nonconvex programs with integer variables, based on branching and bounding techniques [22]. Since solving the problem to optimality takes an exponential amount of time in the number of workloads, we set a limit of 300 s on the execution time of the solver. Two heuristics were also developed for the fast solution of the optimization problem. The first heuristic is a greedy procedure that tries to place applications on the server which guarantees the minimum increase in total power consumption. The allocation is not performed whenever it invalidates one of the performance-related constraints. In this case, a less powerefficient allocation is considered. This heuristic is called Greedy Least Power (GLP) and the pseudo-code is given in Algorithm 1. Since the GLP heuristic decides workload placement for one application at a time, it might yield solutions that are quite far from the global optimal. In order to achieve better solutions, we have designed a randomized multi-start variant which runs the GLP algorithm multiple times, with randomly shuffled lists of applications. The algorithm keeps track of the best solution achieved across the multiple executions and terminates when a given time limit is exceeded. We call this heuristic Randomized Multi-start Greedy Least Power (RMGLP). While the GLP algorithm run in tens of milliseconds, the randomized multi-start variant can take a large amount of time to explore the solution space. The time limit was set at 60 s, which is much smaller than the one set for the exact solver. The pseudo-code is given in Algorithm 2. 6.4. Results Nine different test cases were considered, varying the number of workloads and the utilization level of the data center as shown in Table 4. For each test case, 10 different instances of the optimization problem were generated and tested. In order to compare with the exact solver, which takes a long time to find feasible solutions in the largest instances, we considered only the servers at the bottom of the racks, i.e. 10 servers. In ordered to do so, we obtained a 10 × 10 cross-interference matrix with the same approach described in Section 4. The COP curve is the one estimated for the CRAC units at the HP Labs Utility Data Center [3]. The other parameters vary across different problem instances. In particular, the number of CPUs, the idle power and the busy power were generated according to uniform discrete and continuous distributions. Let Ud {h1 , . . . , hn } denote the uniform discrete distribution over the set of values h1 , . . . , hn and U(a, b) denote the uniform continuous distribution over [a, b]. The parameters of the distributions were chosen as follows: ns ∼ Ud {4, 8, 12}
∀s ∈ S
4 http://www.ampl.com/. 5 https://projects.coin-or.org/Couenne.
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
1241
Algorithm 1 Greedy Least Power algorithm us ← 0 ∀s ∈ S ps ← Pidle,s ∀s ∈ S xc ,s ← 0 ∀(c , s) ∈ C × S for i ∈ C do ∆s ← 0 ∀ s ∈ S for j = 1, . . . , |S | do p′h ← ph ∀h ∈ S p′j ← pj + (di,j λi /nj )Pbusy,j ∗ Tsup ← optimalCold(p⃗′ , T⃗crit ) ∑ ∗ ∆j ← ( h∈S p′h )(1 + 1/COP(Tsup )) end for ⃗) order ← increasingOrder (∆ j←1 feasible ← 0 while feasible = 0 and j ≤ |S | do s ← orderj feasible ← 1 u′s ← us + di,s λi /ns if u′s > 1 then feasible ← 0 end if for c ∈ C do rc ← dc ,s (ns − 1)/ns + (dc ,s /ns )/(1 − u′s ) (max)
then if rc ≥ Rc feasible ← 0 end if end for if feasible = 1 then us ← u′s ps ← Pidle,s + us Pbusy,s end if j←j+1 end while if feasible = 0 then return solution not found end if end for return xc ,s ∀(c , s) ∈ C × S Table 4 Mean percentage of power saved with respect to the baseline strategy. For each test case, 10 randomly generated problem instances were tested. Test case
Number of applications
U (total)
GLSP (%)
GLP (%)
RMGLSP (60 s) (%)
RMGLP (60 s) (%)
COUENNE (300 s) (%)
1 2 3 4 5 6 7 8 9
20 20 20 40 40 40 100 100 100
0.3 0.5 0.7 0.3 0.5 0.7 0.3 0.5 0.7
14.97 11.77 8.75 15.13 11.74 12.31 14.97 9.89 7.99
16.64 17.55 13.03 16.94 17.97 18.05 18.05 17.85 15.56
15.28 11.88 13.59 15.19 11.50 13.02 15.07 10.01 8.31
16.88 18.55 21.32 17.11 18.37 19.51 18.12 17.99 16.28
15.87 17.32 16.90 16.65 16.41 15.42 15.42 14.51 9.58
Pidle,s ∼ U(300, 600)W
∀s ∈ S ∀s ∈ S .
Pbusy,s ∼ U(500, 1000) · ns W
To generate the service demands, we considered a reference server s˜ with speedup factor fs˜ = 1 and one CPU core, i.e. ns˜ = 1. We generated the service demands of the workloads on this reference server according to the following distribution dc ,˜s ∼ U(0.05, 0.1) ∀c ∈ C .
1242
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
Algorithm 2 Randomized Multi-start Greedy Least Power algorithm start ← currentTime() xbest ← NULL Ptotal,best ← ∞ while currentTime() − start < timeLimit do C ′ ← randomShuffle(C ) x ← greedyLeastPow er (C ′ , S ) if x ̸= solution not found then Ptotal ← computePow erConsumption(x) if Ptotal < Ptotal,best then xbest ← x Ptotal,best ← Ptotal end if end if end while if xbest ̸= NULL then return x else return solution not found end if In order to appropriately scale the service demands on the different servers and obtain the service demands for any workload and server pair, we define the speedup factor fs of a server s with respect to the reference server s˜ as fs =
dc ,s dc ,˜s
ns ,
which we assume to be constant for all workloads c ∈ C . The speedup factors were generated according to a uniform distribution fs ∼ U(0.8, 1.6) · ns
∀s ∈ S .
The constraints on the mean response times were set to R(cmax) = 0.15 ∀c ∈ C . (i.e., 1.5 times the largest service demand on the reference server).∑ If U (total) represents the overall utilization of the data center, the total standard CPU capacity consumed is G(total) = U (total) s∈S fs . This capacity was partitioned across workloads according to randomly generated weights gc ∼ U(0, 1), setting the arrival rates as follows: gc ∑
λc =
gj
G(total)
j∈C
dc ,˜s
∀c ∈ C .
The results obtained by the different algorithms are summarized in Table 4. The percentages represent the average total power savings (cooling and server power) with respect to a baseline strategy, which greedily allocates applications to the least used server, thus achieving a certain degree of load balancing. To stress the importance of taking the cooling power under consideration, we also run a modified version of the GLP procedure that ignores the power consumption of the CRAC. We call this algorithm Greedy Least Server Power (GLSP) and the corresponding multi-start variant Randomized Multi-start Greedy Least Server Power (RMGLSP). Since GLSP, RMGLSP and the baseline strategy are not CRAC-aware, the linear heat flow model was applied a posteriori to obtain the optimal temperature of cold air and therefore the total power consumption. We observe that GSLP, which only consider the power drawn by the servers, can achieve savings in the range 8% . . . 15% with respect to the base strategy. Due to CRAC-awareness, the GLP introduces additional savings ranging between 1% and 8% and more importantly never achieves less than 13% savings with respect to the base strategy. The randomized multi-start variants further increase the power efficiency of the data center, especially when the number of applications is small but the overall utilization is high: in test case 3, the RMGLP improves of more than 8% on the GLP results, achieving a total of 21% power savings with respect to load-balancing workload placement. In only 60 s, the RMGLP always provide more than 16% savings when compared to the base strategy and additional savings of 2% to 9% over the non CRAC-aware RMGLSP. On the tested problem instances, COUENNE never fails to find feasible solutions, but is always outperformed by the RMGLP heuristic (despite running for a 5 times longer time) and in all test cases except the third one provides smaller power savings than even the GLP, which runs in few milliseconds. This happens because the time-limit forces COUENNE to find sub-optimal solutions far from the optimum. ∑ x
In Fig. 5 we represent the fraction of workloads allocated on each server s defined as c ∈|CC| c ,s . This value is averaged over all the 90 problem instances. The most densely populated server is represented in black, while a server on which
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
(a) RMGLSP (non CRAC-aware).
1243
(b) RMGLP (CRAC-aware).
Fig. 5. Fraction of workloads allocated to each server. Dark colors correspond to a high allocation density, light colors correspond to low allocation density.
no application was ever allocated is represented in white. The result for RMGLSP is represented on the left (a): since this algorithm is not CRAC-aware, allocation density is quite homogeneous over all the servers. The result for RMGLP (b), on the other hand, highlights that servers at the ends of the aisle have the worst impact on heat recirculation and it is advantageous to allocate applications in the central part of the aisle. This result is intuitive because cold air flows in the data centers from grids located in the central part of the cold aisles. 7. Case study In this section, we apply our methodology to real workloads and server specifications. The data center under consideration hosts 200 servers of size 2U. In order to use the same cross-interference matrix estimated in Section 5, the server are grouped in 40 blocks of 5 servers. Since the traces we have adopted contain, for each workload c ∈ C , only CPU utilization on its current server uc ,host(c ) (missing information about throughput and service demands) we have rewritten constraint VIII, which defines a limit on the response time, requiring that a new workload allocation does not increase the response time. In fact, using the approximation for multi-server queues with exponentially distributed service times and Poisson arrivals proposed in [20], the current response time of each workload is R(ccurrent) = dc ,host(c )
nhost(c )−1 nhost(c )
+
dc ,host(c ) nhost(c )(1−Uhost(c ) )
,
∑
where Us = i∈C :host(i)=s ui,host(i) . Dividing both sides of the equation by the service demand, we obtain the current normalized response time rc(current) =
(current)
Rc
dc ,host(c )
,
(20)
which is a known quantity, since it only depends on the number of CPU cores and the utilization of the current server. For (max) Rc = R(ccurrent) , we can write constraint VIII dividing both sides by dc ,host(c ) : x c ,s
d c ,s
1
dc ,host(c ) ns
≤ rc(current) −
d c ,s
ns − 1
dc ,host(c )
ns
(1 − Us ).
Using the speedup factors and taking into account the number of CPU cores we obtain x c ,s
fhost(c ) /nhost(c ) 1 fs /ns
fhost(c ) /nhost(c ) ns − 1 ≤ rc(current) − (1 − Us ). ns fs /ns ns
All the parameters which appear in this constraint are either known or can be obtained through benchmark results. 7.1. Server specifications We selected a few rack-mounted 2U servers that are representative of a typical up-to-date data center. The number of CPU cores and the critical inlet temperatures are obtained directly from the technical guides released by the manufacturers (critical temperatures are lowered by 5 K for robustness). The relative speedups and the power characteristics of the servers are based on the published results of the SPECpower benchmark. SPECpower collects throughput and power consumption of
1244
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
Table 5 Server specifications used in the case study. Power characteristics and speedup factors are based on published results of the SPECpower benchmark. Server
CPUs
Cores
Speedup
Pidle (W)
Pbusy (W)
Tcrit (K)
Sun Netra X4250 HP ProLiant DL385 G6 Dell PowerEdge R815 Dell PowerEdge R815
2 × Intel Xeon L5408 2 × AMD Opteron 2435 2 × AMD Opteron 6174 4 × AMD Opteron 6174
8 12 24 48
1.000 2.331 4.032 8.008
228.730 124.532 116.744 178.171
71.111 136.846 160.717 329.533
308.15 298.15 303.15 303.15
Fig. 6. Intensity and normalized response time of the 1164 workloads considered in the case study.
the system under test under a graduated server side Java workload. The target utilization of the system ranges from 0.1 to 1 with increments of 0.1. The power consumption at idle, with all the Java Virtual Machine instances active and configured to run the workload, is measured as well. The relative speedup fs of a server is calculated with respect to an arbitrarily chosen reference server s˜ as (ssj)
fs =
Xs
(ssj)
Xs˜
,
(21)
(ssj)
where Xi is the maximum throughput in the SPECpower benchmark. The power consumption coefficients Pidle and Pbusy are estimated through using least squares regression on the power consumption and utilization measurements of the SPECpower benchmark. The characteristics of the servers are shown in Table 5. 7.2. Workload data The CPU utilizations for 1164 workloads were measured in the data center of one of the largest mobile operators in the world. These values have been standardized on the basis of published SPECintRate2006 benchmark results, considering a Sun Netra X4250 as the reference server. In Fig. 6(a), the histogram of the standardized workload intensities is shown. As you can see, most workloads use very little CPU time, but some of them are particularly demanding, exceeding the capacity of the reference server. The ratio between the estimated response time and the service demand as defined in (20) is shown in Fig. 6(b). The vast majority of workloads have response times just a bit above their service demands and hence pose an interesting challenge for our workload placement scenario because of the new formulation of constraint VIII. 7.3. Results The results of the optimization problem with the GLSP (non CRAC-aware) and GLP (CRAC-aware) heuristics are shown in Table 6. Both the strategies are able to significantly reduce the power consumption (4% and 8%, respectively) with respect to an allocation policy which is not power-aware. However, the GLP allocation strategy, being CRAC-aware, is able to greatly reduce the CRAC consumption (−3700 W) with a slight increase in the server consumption (200 Watt) with respect to the GLSP approach. This happens because the CRAC-aware workload allocation finds a compromise between minimizing server consumption and CRAC consumption. 8. Conclusion In this paper, we have presented a model that, given a placement of workloads in the data center, yields the power consumption due to the servers and the cooling equipment. The model is employed within an optimization problem
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246
1245
Table 6 Results obtained by the Greedy Least Server Power and Greedy Least Power heuristics in the case study. Power savings are with respect to the base allocation strategy which is not poweraware (Greedy Load Balancing).
PSERVERS (W) PCRAC (W) PDC (W) Savings
GLSP (non CRAC-aware)
GLP (CRAC-aware)
55 946 22 554 78 501 3.77%
56 143 18 884 75 027 8.03%
which aims at minimizing power consumption while satisfying constraints on the response times imposed by service level agreements. A fast yet effective heuristic has been proposed for the solution of the optimization problem. Power savings in a number of different test cases range between 16% and 21% with respect to workload placement strategies based on performance requirements only. In the future, we plan to further validate the heat flow model, testing different layouts and increasing the number of servers used. Moreover, other constraints, such as the one arising from reliability or compatibility requirements, will be introduced in the mathematical program. References [1] A. Banerjee, T. Mukherjee, G. Varsamopoulos, S. Gupta, Cooling-aware and thermal-aware workload placement for green hpc data centers, in: Green Computing Conference, 2010 International, August 2010, pp. 245–256. [2] Q. Tang, T. Mukherjee, S. Gupta, P. Cayton, Sensor-based fast thermal evaluation model for energy efficient high-performance datacenters, in: Intelligent Sensing and Information Processing, 2006, ICISIP 2006, Fourth International Conference on, 2006, pp. 203–208. [3] J. Moore, J. Chase, P. Ranganathan, R. Sharma, Making scheduling ‘‘cool’’: temperature-aware workload placement in data centers, in: Proceedings of the annual conference on USENIX Annual Technical Conference, ATEC’05, USENIX Association, Berkeley, CA, USA, 2005. [4] Q. Tang, S. Gupta, G. Varsamopoulos, Thermal-aware task scheduling for data centers through minimizing heat recirculation, in: Cluster Computing, 2007 IEEE International Conference on, September 2007, pp. 129–138. [5] D. Gmach, J. Rolia, L. Cherkasova, A. Kemper, Resource pool management: reactive versus proactive or let’s be friends, Comput. Netw. 53 (2009) 2905–2922. [6] J. Anselmi, E. Amaldi, P. Cremonesi, Service consolidation with end-to-end response time constraints, in: Software Engineering and Advanced Applications, 2008, SEAA’08, 34th Euromicro Conference, September 2008, pp. 345–352. [7] K. Dhyani, S. Gualandi, P. Cremonesi, A constraint programming approach for the service consolidation problem, in: A. Lodi, M. Milano, P. Toth (Eds.), Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, in: Lecture Notes in Computer Science, vol. 6140, Springer, Berlin, Heidelberg, 2010, pp. 97–101. [8] C. Patel, R. Sharma, C. Bash, A. Beitelmal, Thermal considerations in cooling large scale high compute density data centers, in Thermal and Thermomechanical Phenomena in Electronic Systems, 2002, ITHERM 2002, The Eighth Intersociety Conference on, 2002, pp. 767–776. [9] J. Moore, J.S. Chase, P. Ranganathan, Consil: Lowcost thermal mapping of data centers, in: In Proceedings of the Workshop on Tackling Computer Systems Problems with Machine Learning Techniques, SysML, 2006. [10] J. Moore, J. Chase, P. Ranganathan, Weatherman: automated, online and predictive thermal mapping and management for data centers, in: Autonomic Computing, 2006, ICAC’06, IEEE International Conference on, June 2006, pp. 155–164. [11] A. Beitelmal, C. Patel, Thermo-fluids provisioning of a high performance high density data center, Distrib. Parallel Databases 21 (2007) 227–238. doi:10.1007/s10619-005-0413-0. [12] E. Pakbaznia, M. Pedram, Minimizing data center cooling and server power costs, in: Proceedings of the 14th ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED’09, ACM, New York, NY, USA, 2009, pp. 145–150. [13] B. Shi, A. Srivastava, Thermal and power-aware task scheduling for hadoop based storage centric datacenters, in: Green Computing Conference, 2010 International, August 2010, pp. 73–83. [14] N. Vijaykrishnan, M. Kandemir, M.J. Irwin, H.S. Kim, W. Ye, Energy-driven integrated hardware-software optimizations using simplepower, SIGARCH Comput. Archit. News 28 (2000) 95–106. [15] H. Shafi, P.J. Bohrer, J. Phelan, C.A. Rusu, J.L. Peterson, Design and validation of a performance and power simulator for powerpc systems, IBM J. Res. Dev. 47 (5–6) (2003) 641–651. [16] S. Gurumurthi, A. Sivasubramaniam, M.J. Irwin, N. Vijaykrishnan, M. Kandemir, T. Li, L.K. John, Using complete machine simulation for software power estimation: the softwatt approach, High-Performance Computer Architecture, International Symposium on, vol. 0, 2002, p. 0141. [17] D. Economou, S. Rivoire, C. Kozyrakis, Full-system power analysis and modeling for server environments, in: In Workshop on Modeling Benchmarking and Simulation, MOBS, 2006. [18] Y.A. Cengel, Heat Transfer: A Practical Approach, 2nd ed., McGraw-Hill Companies, 2002. [19] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the 14th international joint conference on Artificial intelligence—Volume 2, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1995, pp. 1137–1143 [20] D.A. Menasce, V. Almeida, Capacity Planning for Web Services: Metrics, Models, and Methods, 1st ed., Prentice Hall PTR, Upper Saddle River, NJ, USA, 2001. [21] L. Kleinrock, Queueing Systems, Volume I: Theory, Wiley Interscience, New York, 1975. [22] P. Belotti, J. Lee, L. Liberti, F. Margot, A. Wachter, Branching and bounds tightening techniques for non-convex minlp, Optim. Methods Softw. 24 (2009) 597–634.
Andrea Sansottera received an M.S. degree in Computer Engineering from Politecnico di Milano, in 2010. Currently, he is a Ph.D. student at Politecnico di Milano. His main research interests are performance modeling, virtualization technologies and server consolidation.
1246
A. Sansottera, P. Cremonesi / Performance Evaluation 68 (2011) 1232–1246 Paolo Cremonesi is an associate professor at Politecnico di Milano. He holds an M.Sc. in Aerospace Engineering (1991) and a Ph.D. in Computer Science (1996). Before joining Politecnico di Milano in 1993 he has worked with the Von Karman Institute for Fluid Dynamics in Belgium. From 2001 until 2006 he has been Editor of the Elsevier Journal of Systems Architecture. His recent work focuses on Green IT, Computer System Performance and Interactive TV. He is the author of more than 40 scientific publications concerning distributed systems, performance evaluation and capacity planning.