ISA Transactions 90 (2019) 202–212
Contents lists available at ScienceDirect
ISA Transactions journal homepage: www.elsevier.com/locate/isatrans
Practice article
Optimization based resource and cooling management for a high performance computing data center ∗
Qiu Fang a , , Qi Gong b , Jun Wang c , Yaonan Wang a a
National Engineering Laboratory for Robot Vision Perception and Control Technology, Department of Control Science and Engineering, Hunan University, Changsha, Hunan, 410082, PR China b Department of Applied Mathematics and Statistics, University of California, Santa Cruz, CA 95064, USA c Department of Control Science and Engineering, Tongji University, Shanghai, 201804, PR China
highlights • On the basis of scheduling and processing jobs efficiently, a control framework is proposed for HPC data centers to coordinate hysteresis resource provisioning, thermal-aware allocation and dynamic cooling management.
• An economic model predictive control based technique for the modeling and managing the thermal environment. • Comparison of proposed method is made with existing techniques. The proposed methodology achieves significant energy saving, and low performance loss in service quality.
article
info
Article history: Received 1 August 2018 Received in revised form 27 November 2018 Accepted 23 December 2018 Available online 4 January 2019 Keywords: Data center Thermal management Optimization Model predictive control Energy efficiency
a b s t r a c t This paper focuses on the problem of reducing energy consumption within high-performance computing data centers, especially for those with a large portion of ‘‘small size’’ jobs. Different from previous works, the efficiency of job scheduling and processing is made as the first priority. To reduce energy from servers while maintaining the processing efficiency of jobs, a new hysteresis computing resource-provisioning algorithm is proposed to adjust the total computing resource reactively. A dynamical thermal model is presented to reflect the relationship between the computational system and cooling system. The proposed model is used to formulate constrained optimal control problems to minimize the energy consumption of the cooling system. Then, a two-step solution is proposed. Firstly, a thermal-aware resource allocation optimizer is developed to decide where the resource should be increased or decreased. Secondly, an economic model predictive controller is designed to adjust the cooling temperature predictively along with the variation of the rack power. Performance of the proposed method is studied through simulations with real job trace. The results show that significant energy saving can be achieved with guaranteed service quality. © 2019 ISA. Published by Elsevier Ltd. All rights reserved.
1. Introduction The size and the number of high-performance-computing (HPC) data centers are increasing fast, so that the power need is growing intensively. A data center mainly consists of a computational system and a cooling system. The servers in the computational system are connected with the high-speed network, which can transfer data and process tasks together. The cooling system removes heat generated by the servers, then cools the server room. Typically, the power of the cooling system encompasses up to half of the total power of a data center. To improve energy efficiency, both the computational system and the cooling system should be considered. This paper focuses on the optimal control strategies for ∗ Corresponding author. E-mail address:
[email protected] (Q. Fang). https://doi.org/10.1016/j.isatra.2018.12.038 0019-0578/© 2019 ISA. Published by Elsevier Ltd. All rights reserved.
minimizing the power consumption of the computational system (servers) and cooling system (computer room air conditioners (CRAC)) simultaneously. The power efficiency of servers can be improved through power-aware capacity provisioning, such as P-state control [1] and dynamic voltage and frequency scaling [2,3]. These approaches adjust the power performance of CPU according to the variation of the workload. In standard air-cooled data centers, other major causes of energy inefficiency are heat recirculation and idle running servers [4]. Air-cooled data centers are usually designed by hot–cold aisles. The drawback of the hot–cold aisles design is that the hot air may recirculate to the cold aisle, and the heat recirculation decreases the cooling performance inevitably. Due to the complex nature of airflow inside the data center, some of the hot exhaust air from outlets of the server racks will recirculate into the inlet of other server racks. One of the reasons for the
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
203
Fig. 1. The difference between previous work and this paper.
extra energy cost of the cooling system is the mixture of the cooling air and the recirculated hot air near the inlets. Under such circumstance, to provide acceptable inlet temperatures for all servers, the supplied air temperature of the cooling system has to be decreased, which increases the cooling energy cost. Thus, reducing hot air recirculation can be studied to decrease the energy cost of data centers. The direct way to eliminate the mixture of hot and cold air is to improve the physical design of hot–cold aisles. One is to develop new cooling technologies, such as liquid cooling and so on [6]. The other one is the cold/hot aisle containment [7], it provides greater ability to control the supply air to match the server airflow. However, this kind of solutions have some disadvantages. One side of the container can be hot and uncomfortable for managers and engineers. Meanwhile, the contained environment may decrease the flexibility of maintenance and affect the overall fire protection and lighting design. Considering the difficulties and expenses, these solutions may be more suitable for designing new data centers. The physical improvement of the existing data centers may need a lot of investments. As a lot of exiting air-cooled data centers were built according to the design of hot-cold aisles, this paper considers the air-cooled data centers as the research object and aims to develop a control method which does not have to change the layout or design of the data center. This control solution may be a low-cost way to improve energy efficiency. Due to the heat recirculation and uneven cooling air supplement, different areas in the server room have different cooling efficiency. Hence, more jobs can be dispatched to the areas with relatively high cooling efficiency and save more energy [8,9]. Those approaches are grouped as thermal-aware job scheduling. Another effective way to save energy is to aggregate tasks to a portion of servers and shut down idle servers [10]. Based on the prediction of workloads, dynamic resource/server provisioning has been studied for Internet data centers [10–12]. In Refs. [1,13], idle servers were temporarily shut down to save energy. However, it takes time to recover servers from a low-power mode or restart servers, which may cause the data center to be overloaded, thus decreases the efficiency of job execution [14]. Furthermore, frequent on/off switching of servers may reduce the lifetime. Besides the computational part, setting the reference temperature of cooling systems as high as possible can improve the energy efficiency of the cooling system [15,16]. To investigate the benefit of turning down idle servers in HPC centers, Refs. [1,13] considered to shut down idle servers in static job assignment problems. The general control process of previous
work is shown in Fig. 1(a). Every time, a new coming job has to wait for solving optimization problems and turning on corresponding servers. Thus, it is worth considering to shut down the idle servers after solving the thermal-aware job scheduling problem, when the jobs are relatively tolerant to time delays [8]. Especially, when the thermal-aware job scheduling and server provisioning are combined in one optimization problem, the optimization problem is usually NP-hard [4]. Thus, heuristic algorithms that used to solve these unified optimization problems, such as genetic algorithm and particle swarm optimization, are time-consuming [8,17]. Assuming the time for solving the optimization problems can be reduced by advanced computing devices or algorithms in the future, switching on/off servers still need time. Hence, the optimization and server provisioning process is the main cause of the delay in job scheduling. If the runtime of a job is relatively long, it is more tolerant to delays, then the wait-time may be neglected. Since the runtime of ‘‘small-size’’ jobs is relatively short, the wait time could play a non-negligible role. For instance, in the statistic of LNLL-Thunder job trace [5] as shown in Fig. 2(a)–2(b), the biggest job size is 3860 cores and the longest job runtime is 50 h, but 76.7% of submitted jobs are with size below 100 cores and their runtime are less than 10 min. These small-size jobs have to be deal with ‘‘immediately’’ to keep good end-user experience. Hence, the server turn-off policy should be handled carefully within data centers have a large portion of ‘‘small-size’’ jobs Based on a dynamic thermal model, model predictive control (MPC) was proposed to reduce the energy consumption of both servers and cooling systems in Internet data centers (IDCs) by coordinating job scheduling, server provisioning, cooling management and other elements (electric grid) [18–21]. Among them, the economic MPC (EMPC) method [22,23] was introduced to realize the possible process performance improvement achieved by consistently dynamic operation [3,18,19] (i.e., not forcing the process to operate at a pre-specified steady-state). Such strategies can predictively evaluate the environment changes and provide optimal solutions. However, to the best of our knowledge, applications of similar methods in HPC data centers are restrained. The scheduling of parallel and distributed jobs is more complicated in HPC data centers. The size, arrival interval and execution time of jobs are subject to wide variation ranges. Those properties make the prediction of the workload challenging. Above all, there are three main aspects, i.e., server provisioning, job scheduling, and thermal dynamics, have been considered in the literature. Most of the previous control methods are based on
204
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
Fig. 2. Statistics of job size and job runtime from LNLL-Thunder job trace [5]; 106,107 jobs recorded during 5 months. Table 1 Comparison with related work. Research object IDC [10,11,14] IDC [9] IDC [16,18] IDC [19,20] HPC [13,14] HPC [2,24] HPC [1,4,8] HPC [3] This work (HPC)
Server Job scheduling provisioning
√
√ √ √ √ √
Thermal dynamics
Predictive control
√ √ √ √ √ √ √ √ √
Priority of jobs High
√ √ √
√ √
√ √ √ √
√ √
– – High Low – Low Low High
steady-state optimization. When server provisioning is considered, the priority of job processing should be taken into consideration. The comparisons of the related literature are summarized in Table 1. The research objects include IDC and HPC data center. IDCs usually deal with network-intensive workload, while HPC data center processes compute-intensive workload. According to the different types of workload, network, and servers, the design of job scheduling and server provisioning algorithms are different. Thus, the corresponding optimization problem should be different, too. However, common thoughts can be found between different solutions. As shown in Table 1, the related works have incorporated some of these aspects. Reference [3] coordinated all these three aspects with predictive control method, and the considered jobs are relatively tolerant of time delay. This paper aims at the type of high priority jobs in HPC data center, all three key aspects are addressed, and the thermal environment is managed through economic model predictive control. As a result, an optimal control framework is proposed for HPC data centers which especially have a large amount of ‘‘small-size’’ jobs. The control framework aims at achieving energy saving from the idle servers and cooling systems while maintaining the efficiency in job scheduling and processing; thus balancing the energy consumption, device lifetime and service quality. The control process is shown in Fig. 1(b). The proposed design can avoid the waiting process in most cases. Different from previous work, the proposed solution puts the priority of job scheduling and processing in the first. With randomly incoming jobs, a simple FCFS-backfilling policy is used to schedule and distribute jobs immediately. Then, a hysteresis resource-provisioning algorithm is proposed to adjust extra idle servers reactively with the variation of incoming jobs. By doing this, the processing efficiency of small jobs can be guaranteed. Considering the random characteristics of server provisioning, a thermal-aware optimizer is developed for managing server allocating and an economic MPC controller is designed for adjusting
cooling temperature dynamically. The thermal-aware job scheduling is integrated with server provisioning as a ‘‘feedback’’ loop. Simulations on real job traces show that the proposed method can significantly decrease the energy consumption of both the computational system and the cooling system while ensuring the efficiency of job processing. The remainder of the paper is organized as follows. Section 2 presents the structure of the control framework. Section 3 includes job scheduling and resource provisioning algorithms. Section 4 describes the thermal models and optimization problems. The optimal controllers are detailed in Section 5. The analysis of the simulation results are given in Section 6. Finally, Section 7 makes concluding remarks. 2. System overview The control framework illustrated in Fig. 3 consists of three main parts: Job Manager (JM), Resource-provisioning Analyzer (RPA), and Optimizer. JM schedules the priority incoming jobs from the task pool and distributes them to the available computing resource, i.e. active servers/cores. Based on real-time incoming jobs, the minimum resource requirements are passed to RPA by the JM. A hysteresis resource-provisioning algorithm is developed for the RPA and provides evaluated resource requirements. The evaluated resource requirement is calculated based on the minimum resource requirement and system resource utilization. The optimizer contains two parts, i.e. Resource Allocation Optimizer (RAO) and MPC controller. The thermal properties are considered in the optimizer. The RAO optimizes the allocation of active cores in each rack according to the evaluated resource requirements. The EMPC controller manages the cooling system dynamically. Detailed descriptions on these four parts are provided in Sections 3–4. The bottom of Fig. 3 presents the illustration of a typical aircooled data center. The components are arranged in the way of hot-aisle/cold-aisle configuration. In the present study, the computational system of data center consists of N = 28 server racks, and each rack installed with S = 36 servers. There are M = 4 CRAC units deliver cold air under the raised floor. The cold air enters the racks from their front sides, takes the heat generated by the servers away, and exits to the hot aisles. The CRAC absorbs hot air from the computer room. 3. Hysteresis resource provisioning algorithm This section introduces algorithms in JM and RPA. Together, they manage the dispatching of jobs and provide resource requirements.
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
205
Fig. 3. The structure of the control framework.
3.1. Preliminaries The HPC data centers usually deal with parallel and distributed jobs which have large variations in job size and execution time. For simplicity, a homogeneous hardware environment is assumed, i.e., all racks are identical and servers have the same power performance and computing capability. The submitted jobs usually included information, such as submitted time, requested time, and requested number of cores. For an incoming job, an appropriate number of cores will be arranged based on the job size. Which means different servers can process the job cooperatively at the time. A First-Come-First-Served (FCFS)-backfilling policy [25] is implied in the JM to arrange the priority of jobs. When the waiting queue is empty, the JM schedules jobs by First-Come-First-Served policy. In other cases, the JM moves forward small jobs which do no harm to any previous job. Assuming all cores, Ctot , are available, the JM arranges as many jobs as possible. As the server on/off strategy is considered, the arranged jobs will be dealing with immediately if there is enough idle cores. If idle cores are not enough, then some jobs need to wait for inactive cores get ready. 3.2. Hysteresis resource-provisioning algorithm with turn-off-delay To efficiently process incoming ‘‘small-size’’ jobs, a number of cores are reserved at idle state. Providing more idle cores proactively can improve service quality but at the expense of more energy consumption. To keep the service quality at an acceptable level and improve the energy efficiency of computational system, we propose to adjust idle cores dynamically. To avoid turning on/off server too frequently, which decreases the lifetime of electronic devices, a hysteresis resource-provisioning algorithm is proposed to keep the number of idle cores in a reasonable range. Because the delay of turning off cores is also beneficial for processing unexpected incoming jobs, the turn-off delay is taken into consideration. Let Cidle (t), Cbusy (t) and Ctot be the number of idle, busy and total cores at time t respectively. The number of active cores, Con (t), equals to Cbusy (t) plus Cidle (t). During the process of jobs, the Cbusy (t) and Cidle (t) fluctuate with the variation of total processing jobs.
Let λl and λu be two tuning parameters with 0 ≤ λl ≤ λu ≤ 1. λl Ctot and λu Ctot specify the lower and upper bounds of idle cores variation. This hysteresis resource-provisioning algorithm is summarized in Eq. (1), and it works in four situations:
• When the number of idle cores Cidle (t) exceeds the upper bound λu Ctot , the algorithm prefers to decease the idle cores to the middle number to reduce the energy consumption.
• When the number of idle cores Cidle (t) is in between these two boundaries, the algorithm will not change the number of idle cores. • When the idle cores Cidle (t) is less than the lower bound u )Ctot λl Ctot , i.e. Cidle (t) < λl Ctot and Cbusy (t) ≤ (2−λl −λ , the 2 algorithm increases the idle cores to the middle number of two boundaries. This action helps to provide enough idle cores for unpredictable jobs. • For the situation when the rest cores (that besides the busy cores) are not enough for setting idle core to the middle (2−λl −λu )Ctot number, i.e. Cidle (t) < λl Ctot and Cbusy (t) > , all 2 cores will keep active Con (t + 1) = Ctot .
⎧ Cidle (t), ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ˆCidle (t + 1) = (λl + λu )Ctot ⎪ , ⎪ ⎪ 2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
if λl Ctot ≤ Cidle (t) ≤ λu Ctot , or Cidle (t) < λl Ctot and (2 − λl − λu )Ctot , Cbusy (t) > 2 if Cidle (t) > λu Ctot or Cidle (t) < λl Ctot and (2 − λl − λu )Ctot . Cbusy (t) ≤ 2 (1)
Algorithm 1 illustrates a loop of the server provisioning algorithm. When the algorithm decides to decrease the number of idle cores, a turn-off delay is implemented. To be noticed, the RPA only provides the requirements of resources, i.e. the number of required idle cores in the next time slot. The resource management is the responsibility of the optimizer. In this algorithm, the delaying time td and Steps 3–13 can put off the turning-off of servers. Such design also provides more idle server temporally for handling random incoming jobs. The RPA
206
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
Algorithm 1 Hysteresis Resource-provisioning with turn-off-delay Require: The total number of installed cores, Ctot ; The number of idle cores, Cidle (t); The turning-off-schedule of idle cores, ToS(t − 1); The number of cores waiting for turning off, Cwait (t − 1); Ensure: The evaluated number of required idle cores, Cˆ idle (t + 1); 1: Obtain Cˆ idle (t + 1) with Eq. (1), Cidle (t) and Ctot ; 2: if Cˆ idle (t + 1) ̸ = Cidle (t) then 3: if Cˆ idle (t + 1) < Cidle (t)&Cidle (t) − Cˆ idle (t + 1) > Cwait (t − 1) then 4: Add Cidle (t) − Cˆ idle (t + 1) cores to be turned-off to the schedule, and set corresponding countdown time length as td ; 5: Update ToS(t) and Cwait (t); 6: Set Cˆ idle (t + 1) = Cidle (t); 7: else if Cˆ idle (t + 1) < Cidle (t)&Cidle (t) − Cˆ idle (t + 1) ≤ Cwait (t − 1) then 8: Cancel Cidle (t) − Cˆ idle (t + 1) cores from schedule ToS(t) according last-in-firstout policy; 9: Update ToS(t) and Cwait (t); 10: Set Cˆ idle (t + 1) = Cidle (t); 11: else if Cˆ idle (t + 1) > Cidle (t) then 12: Cancel all cores in the schedule, ToS(t) = Null, Cwait (t) = 0; (the case that idle cores have to be increased by the RAO) 13: end if 14: end if 15: if Any countdown counter of servers in ToS(t) is equal to 0 at time t then 16: Remove the cores to be turned off form ToS(t), Cwait (t) and Cidle (t); 17: Update ToS(t), Cwait (t) and Cidle (t) ; 18: Update Cˆ idle (t + 1) = Cidle (t);(the case that idle cores have to be decreased by the RAO) 19: end if 20: return Cˆ idle (t + 1);
provides evaluated required cores Cˆ on (t + 1) for the optimizer. In the simulation, the cycle time of JM and RPA are set as 1 s. 4. Thermal models and problem statement Collaborating with the JM and RPA, the optimizer manages the allocation of computing resources and cooling supplement. An abstract thermal model is presented first for temperature evaluation. 4.1. Control-oriented modeling 4.1.1. Thermal model From heat transfer perspective, racks and CRAC units can be regarded as the different type of thermal nodes. In the thermal network, there are N + M thermal nodes including N rack nodes and M CRAC nodes. The rack nodes consume power for processing jobs and generate heat, while the CRAC nodes consume power for removing heat. An outlet temperature and an inlet temperature is defined for each node. The outlet and inlet temperature of rack node i are denoted by Trout,i (t) and Trin,i (t), i ∈ {1, 2, . . . , N } respectively. The outlet and inlet temperature of CRAC node j are denoted by Tcout,j (t) and Tcin,j (t) similarly, j ∈ {1, 2, . . . , M }. In this study, the evolution of the outlet temperature of Rack i is approximated by a simple linear model [26], i.e. the following linear time-invariant function (2). Although this lacks the accumulation of thermal inventory by the physical devices, and greater rigor in the model would make the results more definitive, it has little impact on the overall application conclusions. The following Eqs. (3) and (4) are obtained in the same way. T˙rout,i (t) = −αr,i · Trout,i (t) + αr,i · Trin,i (t) + cr,i · Pr,i (t),
(2)
where αr,i is the reciprocal of the time constant of the temperature of Node i, Pr,i (t) is the power consumption of Rack i, and cr,i is the coefficient mapping the power to the variation of output temperature.
Similar to Eq. (2), the evolution of the outlet temperature of the CRAC units is approximated by [26,27] T˙cout,j (t) = −αc,j · Tcout,j (t) + αc,j · min{Tcin,j (t), Tref,j (t)},
(3)
where αc,j is the reciprocal of a time constant, Tref,j (t) is the reference temperature of CRAC j. The reference temperature Tref,j (t) is controllable. The min operator in Eq. (3) ensures that the outlet air temperatures of a CRAC unit not greater than the inlet of itself. An abstract airflow model from [28] is adopted for approximating the cross-interaction of airflow among thermal nodes. This model applies to the situation that the mixture of the air in the server room is stable. The inlet temperature of any thermal node is approximated by a linear combination of the outlet temperature of all thermal nodes: N +M
Tin,i (t) =
∑
φi,j Tout,j (t),
(4)
j=1
where φi,j is a weight parameters related to the connection between the outlet temperature of Node j and the inlet temperature of Node i. Tin = [Trin,1 , . . . , Trin,N , Tcin,1 , . . . , Tcin,M ]T , Tout = [Trout,1 , . . . , Trout,N , Tcout,1 . . . , Tcout,M ]T . Then, the Eq. (4) can be put in a compact form Tin = 8Tout ,
(5)
where 8 is a matrix of φi,j . The coefficient matrix 8 is a matrix with (M +N)×(M +N) unknown parameters. To calculate 8, at least M + N different pair of inlet temperature and outlet temperature are needed. Given N different sets of rack power plus M different sets of reference temperature, M + N different scenarios of computational fluid dynamic (CFD) simulations need to be conducted. , then calculated the cross-interference coefficients matrix with Eq. (5) and the corresponding inlet and outlet temperature distribution recorded from the CFD. 4.1.2. Power model The operation of servers, which installed CPU cores, are assumed to switch between two states: on (active) or off. An active server can be busy or idle. The power of an active server consists of two parts, one part is the power needed to power the server on, and the rest part increases with CPU utilization linearly. The utilization of idle servers are assumed to be zero. Servers in off state cost no power. The average power of active servers in Rack i is Pidle + Pco · Ui (t), where Pidle is the basis power that needed to power on, and Ui (t) is the average utilization of all active servers. Pco is a coefficient which representing the linear growth rate of the power with CPU utilization. The total power of Rack i Pr,i (t) is obtained by summing the power of active servers in it Pr,i (t) = (Pidle + Pco Ui (t)) · si (t),
(6)
The number of active servers in Rack i is denoted as si (t), where si (t) ∈ {1, 2, . . . , S }. Switching the operating state of servers need time and power. In the simulation processes, the transient time for switching on/off servers are assumed to be 2 min. The transient power for turning on a server is assumed to be the maximum power of an active servers. The power of the jth CRAC unit is as follows [26] Pc,j (t) = Qj
Tcin,j (t) − Tcout,j (t) COP(Tcout,j (t))
,
(7)
where Qj is a coefficient related to the density and the specific heat capacity and the flow rate of the air passing through the CRAC j.
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
1. Once a new resource requirement is provided to the optimizer, the RAO has to choose the allocation of idle cores to be turn on/off based on the thermal properties to decrease heat recirculation. 2. To deal with the variation of rack power, an EMPC controller is proposed to manage the cooling air temperature predictively.
The rate of heat removed by CRAC j is Qj · (Tcin,j (t) − Tcout,j (t)). The inlet temperature Tcin,j (t) is considered not smaller than the output temperature Tcout,j (t). The coefficient of performance (COP) of CRAC j can be modeled as a function of its outlet air temperature Tcout,j (t) [27] 2 COP(Tcout ) = 0.0068 · Tcout + 0.0008 · Tcout + 0.458.
(8)
4.1.3. State-space model The power consumed by the CRAC nodes and the rack nodes at time t can be grouped as Pc (t) and Pr (t) respectively. Similarity, define the number of active servers in racks as vector s(t), and the reference temperature of CRACs at time t as vector Tref (t). A continuous time state-space model can be derived from Eqs. (2)– (5) and be discretized as follows Tout (k + 1) = Ad Tout (k) + Bd u(k), Tin (k + 1) = Cd Tout (k + 1),
(9)
where Ad , Bd and Cd are matrices obtained through the discretization of continuous state-space model. The input vector u(k) includes the power of racks Pr (k), and the reference temperature of CRACs, Tref (k), i.e. u(k) = [PTr (k), TTref (k)]T . Since the evolution of thermal dynamics is slower than the computational dynamics, the length of the discretization time step, T , can be longer than that of JM and RPA, i.e., T ≫ 1(s). Considering the minimum time constant of the continuous state-space model is around 3 min, the sampling period used in discretizing the continuous dynamics model is set as T = 1 min. The discretized state-space model Eq. (9) can be used to predict the temperature variation based on different power of racks and reference temperature setting of CRACs. 4.2. Power optimization problem Different assignment of jobs and active servers can change the power distribution of racks, then impact the temperature distribution as shown in Eq. (2), (3), and (5), at last, affect the cooling cost according to Eq. (7). On one side, the outlet heat of a rack could be recirculated to the inlet of other racks, then impact the cooling efficiency. Different rack power distribution means different heat source distribution. The processing jobs and the number of the active servers are the main factors that affect the power of racks. On the other side, the set-points of CRACs impact the coefficient of performance, higher set-point leads to lower cooling power. To reduce the cooling power and improve the cooling efficiency, the power optimization problem is to find the optimal active server distribution, job assignment and set-points of CRACs to minimize the total power consumption of racks and CRACs. As the jobs are assigned to active servers in each rack and a small portion of idle servers is provided, the optimization of job assignment can be done by optimizing the distribution of active server. In the following optimization problem, only the active server distribution and the set-points of CRACs is considered to be optimized. The jobs are assigned according to the allocation of active servers. To be noticed, the timescale of different decision variables vary in different ranges. Due to the variation of incoming jobs, the adjustment of idle cores/servers is discrete and random. The time between twice idle server adjustment is random, it could be very short or very long. While the incoming jobs fluctuate very frequently. Hence, the power change of racks caused by job variation is more frequently than that of idle core adjustment. Considering the dynamic variation of rack power, the reference temperature of cold air from the CRACs should be managed in a precise way to prevent over heat or cooling waste. To solve the above optimal control problem, an optimizer with two steps is proposed:
207
5. Optimization strategies To setup the optimization problem, the predicted value of Tin (t) at the beginning of the hth interval is denoted as Tˆ in (h|k) = [Tˆ Trin (h|k), Tˆ Tcin (h|k)]T , based on information available up to the beginning of the kth interval. So does the vector Tˆ out (h|k). The expected value of active server distribution s(t), reference temperature Tref (t) and power distribution P(t) during the hth interval ˆ |k) = [Pˆ Tr (h|k), Pˆ Tc (h|k)]T are denoted with sˆ (h|k), Tˆ ref (h|k) and P(h respectively. Define
∆sˆ (h|k) = sˆ (h|k) − s(k|k), ∆Pˆ r (h|k) = Pˆ r (h|k) − Pr (k|k).
ˆ r (h|k) During the prediction horizon h ∈ {k, k + 1, . . . , k + Np − 1}, P and Tˆ ref (h|k) can be grouped as ˆ r (k|k), . . . , Pˆ r (k + Np − 1|k)], P = [P Tref = [Tˆ ref (k|k), . . . , Tˆ ref (k + Np − 1|k)].
5.1. Optimization of resource allocation Since the RPA provides the information that when and how many idle cores should be turned on/off randomly, the allocation of active servers have to be managed occasionally. The allocation of active servers affects the distribution of rack power and cooling efficiency. The cost function of the optimization problem is designed to minimize the power of both servers and CRACs. Assuming there is a resource adjustment request the beginning of the kth interval, the optimization problem to be solved is as follows k+Np −1
min
Tref ,P
∑ {
∥Pˆ r (h|k)∥1 + ∥Pˆ c (h|k)∥1
}
(10)
h=k
s.t. dynamics (2)–(8), (9) and Tref ≤ Tˆ ref (h|k) ≤ Tref ,
(11)
Tˆ rin (h + 1|k) ≤ Trin ,
(12)
1T ∆sˆ r (h|k) ≥ ∆Cidle /C0 ,
(13)
∆Pˆ r (h|k) = Pidle · ∆sˆ r (h|k) ˆ r (h|k) ≤ Pr , P ≤ ∆P
(14)
Tˆ out (k|k) = Tout (k),
(16)
r
(15)
where Pr (k|k) is the power of racks at the beginning of the kth interval, and (Tref , Tref ) are the lower and upper limits of the reference temperatures respectively. Eqs. (11) and (15) specify the constraints on the input variables. The main environment requirements are given in Constraint (12). According to [29], the inlet temperature of server racks should below Tri = 27 ◦ C. In Constraint (13), 1Cidle denotes the total number of idle cores need to be adjusted, and C0 is the number of cores per server. When 1Cidle is positive, Cidle /C0 stands for the minimum amount of server should be turned on. On the contrast, the absolute value of Cidle /C0 stands for the maximum amount of server could be turned off. Constraint (13) specify that the change of the total active servers is allowed to be greater than or equal to the minimum number of
208
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
Table 2 Part of parameters. Notations
Value
Notations
Value
N M S Ctot Pidle Pco 1/αr,i cr,i
28 4 36 4032 180 W 120 W 360 s 5.1108 × 10−6 ◦ C/Ws
1/αc,j Qj td Np Nc
180 s 4831.2 W/◦ C [0, ∞) 15 min 2 min see Fig. 4 [0, 1] [0, 1]
8 λl λh
Fig. 5. The variation of busy cores when all servers are active.
Fig. 6. The structure of the proposed EMPC controller.
Fig. 4. The coefficient 8.
server adjustment, i.e. provide more idle servers than request or turn off less servers. Constraint (14) specifies that the power change of racks are equal to be power change caused idle server adjustments. In Conˆ r (h|k) straint (15), Pr and Pr are the lower and upper bound of ∆P respectively. The vector Pr groups the power of the idle servers in each rack at the beginning of kth interval. The elements of vector Pr are non-positive. The vector Pr groups the maximum power of idle servers that can be added to each rack. For instance, if the number of inactive(off) servers in Rack 1 is x, the maximum power of idle servers that can be added is x · Pidle . The elements of Pr are nonnegative too. Solving this optimization problem provides the expected adjustment of idle (active) servers, ∆sˆ (k|k), and the set-points of reference temperature, Tˆ ref (k|k). Since the power of racks are also affected the variation of jobs. The obtained set-points Tˆ ref (k|k) are meaningless and will not taken as control input of CRACs. The prediction horizon Np of this optimization problem is set long enough for evaluating the impact on thermal environment. 5.2. EMPC for cooling management In the data center, the heat generation of racks is affected by the time-varying random workloads and the operating states of servers. The desired steady-state of the cooling system may change frequently to keep the safety of the thermal environment and the optimally of the operating cost. In an attempt to realize the possible process performance improvement achieved by a consistently dynamic operation, an EMPC controller is proposed to manage the cooling system in a receding horizon manner. The cost function (17) of the EMPC controller is a direct reflection of the power of the cooling system. The structure of the EMPC controller is illustrated in Fig. 6. The state-space model (9) is augmented [30] according to the length of predict and control horizon. The inputs of the EMPC controller includes the measured outlet temperature Tout (k) of the
thermal nodes (racks and CRACs) and the measured power Pr (k|k) of racks. The reference temperature of racks Tˆ ref (k|k) is the output of the EMPC controller. In the EMPC controller, the following optimization problem is solved step-by-step to adjust the reference temperature along with the variation of rack power. The cost function aims to minimize the power of the CRACs. At the beginning of the kth interval, the optimal control problem that defines MPC controller is: k+Np −1
min Tref
∑
∥Pˆ c (h|k)∥1
(17)
h=k
s.t. system dynamics (2)–(8) and (9) Tref ≤ Tˆ ref (h|k) ≤ Tref ,
(18)
Tˆ rin (h + 1|k) ≤ Trin ,
(19)
ˆ r (h|k) = Pr (k|k), P
(20)
Tˆ out (k|k) = Tout (k),
(21)
Constraints (18) and (19) are the same with the previous problem. Constraint (20) specifies the real-time power of racks as the input vector Pr (h|k). The output temperature is initialized by Eq. (21). The EMPC controller solves the optimization problem step by step to update the optimal reference temperature T˜ ref (k|k). The predict horizon of the EMPC controller is set to be 15 min, i.e., Np = 15 and the control horizon is set to be 2 min, i.e., Nc = 2. In this paper, the sequential quadratic programming (SQP) method is used to solve the optimization problems. The optimization problems are implemented by the function fmincon in Matlab Optimization Toolbox. All simulations were carried out in an IBM System X blade server with 24 GB RAM and two Intel Xeon X5650 CPUs with a frequency of 2.67 GHz, the optimization problems were solved within 1 s. 6. Simulations and discussions The simulations are conducted based on the data center shown in Fig. 3 with N = 28 number of racks of server and S = 36
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
209
Table 3 The comparison of energy usage (kWh), PUE, RPCs and WJs. Scenarios
λl
λu
td (min)
Rack energy
CRAC energy
Total energy
PUE
RPCs
WJs
TTSC [3] Baseline 1 Baseline 2 Baseline 3
– 0 – 0.20
– 0 – 0.40
– 0 – 20
9 060 8 920 16 543 13 036
2812 2906 8145 5889
11 872 11 826 24 688 18 925
1.31 1.33 1.49 1.45
2446 1559 0 68
1226 831 0 7
Case I Case II Case III Case IV
0.20 0.05 0.05 0
0.40 0.15 0.15 0
20 20 0 20
13 025 10 889 10 287 9 689
5218 3992 3768 3207
18 243 14 881 14 055 12 896
1.40 1.37 1.37 1.33
68 105 157 323
7 33 52 149
Fig. 7. The comparison of active cores of Baseline 1 and Case IV.
Fig. 8. The comparison of active cores of Case I, II and IV.
identical server blades. Each server has C0 = 4 cores. Hence, the data center has Ctot = 4032 cores in total. The cooling system consists of M = 4 CRAC units. The idle power of the servers in each rack is Pidle = 180 W, and the coefficient to CPU utilization is Pco = 120 W. The parameters related to the models and the proposed control method are given in Table 2. A 3-day job trace is obtained from LLNL-Thunder [5] and used to test the proposed method. Since the utilization rate of the original data is very high, the job trace is modified by decreasing a half of the job runtime. The length of the trace is 3 days. The total number of cores in the data center is 4032. There are 1226 jobs submitted in 3 days. The average size of jobs is almost 62, and the maximum size of submitted jobs is 2048. The average runtime of jobs in the modified LLNL-Thunder is nearly 0.67 h and the max runtime is about 12 h. When all servers are kept active, the cores needed by the jobs from the trace is shown in Fig. 5. 6.1. Different setting of RPA Using the modified LLNL-Thunder as the test trace, a series of simulations were conducted under seven different scenarios. These
Fig. 9. Comparison of active cores and busy cores of Case II.
Fig. 10. The variation of max Trin , mean Trin and mean Tref in Case II.
results are shown in Table 3. Case I∼IV stand for the proposed control method with different settings of λl , λu and td . Three baseline control strategies (Baseline 1∼3) are designed for comparison. In Baseline 1, there all idle servers are turned off and active servers are distributed through thermal-aware optimization. In Baseline 2, all servers are kept active to process job on its fastest speed. Different from Case I, Baseline 3 distributes active servers evenly to each rack. The (Two-time-scale Control) TTSC method developed by Reference [3] is implemented for comparison too. Note that in Reference [3], the dynamic voltage and frequency scaling (DVFS) technology is considered in the optimization problem to improve the efficiency servers. To compare the performance with respect to the proposed method, simply assume that the processing frequency of servers in TTSC is set as same as other cases. The TTSC is a typical control solution which fulfills the process indicated in Fig. 1(a). The TTSC method is similar to Baseline 1 that no extra idle server is provided. Distribution of each job is regarded as a single optimization problem, the servers are turned on/off along with the distribution of jobs. In TTSC, once the optimal job dispatching law and reference temperatures are obtained, the control law is tested in the same simulation environment as with other cases.
210
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
Fig. 11. The performance of Case II.
Table 3 provides the comparison of energy usage, server provisioning changes (RPCs) and waiting jobs (WJs). In Table 3 we also provide power usage effectiveness (PUE) as a measure of power efficiency. PUE is defined as the ratio of total amount of energy used by the data center facilities to the energy used by computational system, i.e., PUE = 1 + ∥Pc ∥1 /∥Pr ∥1 . The total energy usage of Baseline 1 is about 48% of the energy usage of Baseline 2 and is the least among all scenarios. There is no idle server in Baseline 1. Besides the TTSC, the provisioning changes (RPCs) and waiting jobs (WJs) of Baseline 1 are higher than other cases obviously, reach 1559 and 831 times respectively. Since no extra idle server is provided, the power consumption of TTSC is close to Baseline 1, but RPCs and WJs of TTSC are the most. Due to the nature design, each job has to wait for the optimization and the server provisioning process. In Baseline 1, even idle servers are turned off, there are some servers not fully utilized. Hence, some small jobs could be processed with no delay. The comparison of Baseline 1 and Case IV (Case II and III) indicates that the turning-off-delay, which provide extra cores for future unpredictable jobs, can decrease RPCs and WJs significantly. However, increasing turning-off-delay time leads to power consumption growth. Fig. 7 is an example which provides a comparison of the variation of active cores between Baseline 1 and Case IV. In Case IV, the server turn-off delay is 20 min, while Baseline 1 has no turn-off delay. The RPCs and WJs were decreased from 1559 and 831 to 323 and 149 respectively. In all cases, the proposed method produces lower RPCs and WJs than Baseline 1 and TTSC but consumes more power. The parameters λl and λu in the RPA (Resource Provisioning Analyzer) affects the number of idle servers spatially. λl decides the minimum number of idles servers and 1λ = λu − λl specifies the gap between the two boundaries in the hysteresis algorithm. The comparison of Baseline 1 and Case III shows that λl and λu can significantly affect the result of RPCs and WJs. Fig. 8 illustrates the variation of active cores in Case I, II, and IV. Higher λl and λu provide more active cores. Fig. 9 demonstrates the comparison of active cores and busy cores of Case II, the ‘‘active’’ line is smoother than the ‘‘busy’’ line as a result of the hysteresis provisioning algorithm.
6.2. Thermal and cooling management Through the simulations, the EMPC controller adjusts the reference temperature of CRACs dynamically while enforces the constraint on the inlet temperature of racks. Fig. 10 demonstrates the temperature variation of max Trin , mean Trin and mean Tref in Case II. Fig. 11 shows the performance of the EMPC controller in Case II. During the duration, the variation of active cores and busy cores are illustrated in Fig. 11(a). Along with the variation of the workload, the total computational power of racks is shown in Fig. 11(b). So does the optimized power of CRACs. As the EMPC controller uses the power of CRACs as the cost function directly, the reference temperature of 4 CRAC units is updated every single step. The optimized reference temperature of each CRAC is indicated in Fig. 11(c), and the outlet temperature of each CRAC is indicated in Fig. 11(d). Since the power distribution of racks are not even and each CARC unit have a different impact on the different area, the optimized reference temperatures are not the same. Fig. 12 illustrates the distribution of active servers and busy servers that sampled each hour through 3 days. In Fig. 13, the upper line of the different color bars demonstrate the distribution of active and busy servers in each rack at different time. The utilization rate is at low and medium level at time 36 h and 5 h respectively. In this kind of situation, racks allocated with more active servers are with high cooling efficiency, then saved more energy from CRACs. The utilization rate is high at time 53 h, almost all servers were turned on, server racks with low cooling efficiency were powered on to process tasks. The result show that the RAO allocate more active servers to high cooling efficiency areas. Comparing Case I with Baseline 3, the thermal-aware resource allocation optimizer saves about 11.4% energy from the cooling system. The PUE also decreases from 1.452 to 1.401. The PUE in Table 3 grows from Case I to Case IV. It is because the number of active servers grows from Case I to VI, which forces more servers to be distributed to low cooling efficiency area, thus increases the difficulty of cooling those servers. 7. Conclusions On the basis of scheduling and processing jobs efficiently, a new control framework is proposed to coordinate multiple energy
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
211
Fig. 12. The distribution of active servers and busy servers of 28 racks sampled each hour through 3 days.
References
Fig. 13. The distribution of active and busy servers in different rack (Case II, at time 53 h, 5 h, and 36 h).
saving techniques, such as thermal-aware server provisioning, job allocation, and control of the cooling system, at the data center level. A hysteresis resource provisioning algorithm is proposed to achieve energy saving from idle servers while keep the processing efficiency of jobs. The thermal-aware server allocating optimizer and the EMPC-based controller can reduce the power consumption of the cooling system while enforcing thermal constraints. The relationship between the energy saving and the performance loss is obtained by simulations under different scenarios. The simulation results confirm that the proposed control method can balance typically conflicting demands of energy efficiency, device lifetime, and service quality. In the future, an adaptive law is to be studied to adjust the parameters of hysteresis resource provisioning algorithm and save more energy. The proposed method may be extended to more complex environments, such as smart grid, cloud computing and so on.
Acknowledgments This work was supported in part by the National Natural Science Foundation of China under Grant No. 61733004, the Funding Projects of Chang-Zhu-Tan National Independent Innovation Zone under Grant No. 2017XK2102, and the Fundamental Research Funds for the Central Universities of China under Grant No. 531107051111.
[1] Al-Qawasmeh AM, Pasricha S, Maciejewski A, Siegel HJ, et al. Power and thermal-aware workload allocation in heterogeneous data centers. IEEE Trans Comput 2015;64(2):477–91. [2] Han D, Shu T. Thermal-aware energy-efficient task scheduling for DVFSenabled data centers. In: Proc. of the international conference on computing, networking and communications. IEEE; 2015, p. 536–40. [3] Fang Q, Wang J, Gong Q, Song M. Thermal-aware energy management of an hpc data center via two-time-scale control. IEEE Trans Ind Inf 2017;13(5):2260–9. [4] Mukherjee T, Banerjee A, Varsamopoulos G, Gupta SK, Rungta S. Spatiotemporal thermal-aware job scheduling to minimize energy consumption in virtualized heterogeneous data centers. Comput Netw 2009;53(17):2888– 904. [5] Parallel Workloads Archive, http://www.cs.huji.ac.il/labs/parallel/workload/ (2018). [6] Li L, Zheng W, Wang X, Wang X. Data center power minimization with placement optimization of liquid-cooled servers and free air cooling. Sustain Comput: Inform Syst 2016;11:3–15. [7] Niemann J, Brown K, Avelar V. Impact of hot and cold aisle containment on data center temperature and efficiency, Schneider Electric Data Center Science Center, White Paper, 135 (2011) 1–14. [8] Tang Q, Gupta SK, Varsamopoulos G. Energy-efficient thermal-aware task scheduling for homogeneous high-performance computing data centers: A cyber-physical approach. IEEE Trans Parallel Distrib Syst 2008;19(11):1458– 72. [9] Bai Y, Gu L, Qi X. Comparative study of energy performance between chip and inlet temperature-aware workload allocation in air-cooled data center. Energies 2018;11(669). [10] Chen G, He W, Liu J, Nath S, Rigas L, Xiao L, Zhao F. Energy-aware server provisioning and load dispatching for connection-intensive internet services. In: NSDI. Berkeley, CA, USA: USENIX Association; 2008, p. 337–50. [11] Yin X, Sinopoli B. Adaptive robust optimization for coordinated capacity and load control in data centers. In: Proc. of the 53rd conference on decision and control (CDC). IEEE; 2014, p. 5674–9. [12] Pakbaznia E, Ghasemazar M, Pedram M. Temperature-aware dynamic resource provisioning in a power-optimized datacenter. In: Proc. of the conference on design, automation and test in Europe. European Design and Automation Association; 2010, p. 124–9. [13] Pahlavan A, Momtazpour M, Goudarzi M. Power reduction in HPC data centers: a joint server placement and chassis consolidation approach. J Supercomput 2014;70(2):845–79. [14] Gandhi A, Chen Y, Gmach D, Arlitt M, Marwah M. Hybrid resource provisioning for minimizing data center SLA violations and power consumption. Sustain Comput: Inform Syst 2012;2(2):91–104. [15] Mukherjee T, Banerjee A, Varsamopoulos G, Gupta SK. Model-driven coordinated management of data centers. Comput Netw 2010;54(16):2869–86. [16] Song M, Chen K, Wang J. Numerical study on the optimized control of cracs in a data center based on a fast temperature-predicting model. J Energy Eng 2017;143(5). [17] Oxley M, Jonardi E, Pasricha S, Maciejewski A, Koenig G, Siegel HJ, et al. Thermal, power, and co-location aware resource allocation in heterogeneous high performance computing systems. In: Proc. of the international green computing conference. IEEE; 2014, p. 1–10.
212
Q. Fang, Q. Gong, J. Wang et al. / ISA Transactions 90 (2019) 202–212
[18] Parolini L, Sinopoli B, Krogh BH, Wang Z. A cyber–physical systems approach to data center modeling and control for energy efficiency. Proc IEEE 2012;100(1):254–68. [19] Fang Q, Wang J, Gong Q. Qos-driven power management of data centers via model predictive control. IEEE Trans Automat Sci Eng 2016;13(4):1557–66. [20] Yao J, Guan H, Luo J, Rao L, Liu X. Adaptive power management through thermal aware workload balancing in internet data centers. IEEE Trans Parallel Distrib Syst 2015;26(9):2400–9. [21] Cupelli LJ, Schutz T, Jahangiri P, Fuchs M, Monti A, Muller D. Data center control strategy for participation in demand response programs. IEEE Trans Ind Inf 2018;14(11):04017041. [22] Rawlings JB, Amrit R. Optimizing process economic performance using model predictive control. In: Nonlinear model predictive control. Springer; 2009, p. 119–38. [23] Ellis M, Durand H, Christofides PD. A tutorial review of economic model predictive control methods. J Process Control 2014;24(8):1156–78. [24] Oxley MA, Jonardi E, Pasricha S, Maciejewski AA, Siegel HJ, Burns PJ, Koenig GA. Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers. J Parallel Distrib Comput 2018;112:126–39.
[25] Feitelson DG, Mu’alem Weil A. Utilization and predictability in scheduling the IBM SP2 with backfilling. In: 12th intl. parallel processing symp. (IPPS). 1998, p. 542–6. [26] Parolini L, Sinopoli B, Krogh BH. Reducing data center energy consumption via coordinated cooling and load management. In: Proc. of the 2008 conference on power aware computing and systems, hotpower, Vol. 8, 2008, 14–14. [27] Moore JD, Chase JS, Ranganathan P, Sharma RK. Making scheduling ‘‘Cool": Temperature-aware workload placement in data centers.. In: Proc. of the USENIX annual technical conference, general track. 2005, p. 61–75. [28] Tang Q, Mukherjee T, Gupta SK, Cayton P. Sensor-based fast thermal evaluation model for energy efficient high-performance datacenters. In: Proc. of the 4th international conference on intelligent sensing and information processing. IEEE; 2006, p. 203–8. [29] ASHRAE, 2008 ASHRAE environmental guidelines for datacom equipment, expanding the recommended environmental envelope (2008). [30] Wang L. Model predictive control system design and implementation using R MATLAB⃝ . Springer Science & Business Media; 2009.