Sustainable Computing: Informatics and Systems 3 (2013) 132–147
Contents lists available at SciVerse ScienceDirect
Sustainable Computing: Informatics and Systems journal homepage: www.elsevier.com/locate/suscom
Using transient thermal models to predict cyberphysical phenomena in data centers夽 Georgios Varsamopoulos, Michael Jonas, Joshua Ferguson, Joydeep Banerjee, Sandeep K.S. Gupta ∗ , Impact Lab School of Computing, Informatics and Decision Systems Engineering, Arizona State University, Tempe, Arizona, United States
a r t i c l e
i n f o
Article history: Received 11 September 2012 Accepted 29 January 2013 Keywords: Data center modeling Cyber-physical systems Lightweight transient models
a b s t r a c t Designing and configuring the layout of data centers, as well as testing thermal-aware decision (e.g., scheduling) algorithms has been hindered by the use of CFD simulations, which are considerably slow and are not that flexible in integrating cyber behavior (e.g., workload scheduling). Fast thermal mapping techniques on the other hand may rely on wrong assumptions (e.g., steady state) and thusly produce incorrect conclusions or they can be unusable because of cyber-physical interactions that rely on transient phenomena (e.g., transient hot spots that cause throttling). In this paper, we propose to speed up the evaluation of designs and algorithms with the use of a heat transfer model that captures transient behavior. We demonstrate its physical relevance, provide a methodology in yielding its parameters from experiments, and show how it can be combined with heat generation and cooling models to create a complete minimal data center system model, which can be simulated in only a small fraction of a CFD simulation time. © 2013 Elsevier Inc. All rights reserved.
1. Introduction Energy-aware computing has grown dramatically in the recent years. From longer cell phone battery lifetimes to low-power screens and power-scaling CPU’s, the net effect has been an increase in the operating efficiency of computing devices that has led to greener devices. Due to their $7.4 billion in annual electricity use [1], data centers are an important topic in green computing research. In many data centers, the majority of their energy is consumed by support infrastructure rather than on the computing equipment itself. Researchers have developed numerous energyaware approaches to reduce these costs [2–12]. Many of these improvements rely on a thermal model to predict the temperature at places of interest throughout data center facilities. These temperature predictions can be used in making proper management decisions (e.g., which server to assign an incoming workload).
夽 Developed as a part of the NSF CRI project #0855277 “BlueTool” (http://impact.asu.edu/BlueTool/). Work in this paper was also funded in part by NSF grants #0834797 and #1218505. ∗ Corresponding author at: School of Computing, Informatics and Decision Systems Engineering, Arizona State University, Tempe, AZ 85287-8809, United States. Tel.: +1 480 965 3806. E-mail addresses:
[email protected] (G. Varsamopoulos),
[email protected] (M. Jonas),
[email protected] (J. Ferguson),
[email protected] (J. Banerjee),
[email protected] (S.K.S. Gupta). 1 http://impact.asu.edu/. 2210-5379/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.suscom.2013.01.008
However, the thermal models used in these approaches only predict steady state temperatures, ignoring critical temporal aspects of thermal behavior. As data center technology shifts to a higherdensity deployments, e.g., containerized and contained-aisle data centers, which typically exhibit lower air-to-equipment mass ratios compared to traditional big-room data centers, transient conditions tend to rise faster and be more extreme. The most prominent conditions that warrant the use of transient models are: • The cooling delay problem [13,14]: High density data centers can increase localized temperatures much faster than conventional data centers. If the cooling equipment (chiller) is slow at detecting and responding to the temperature rise, the servers may quickly reach high temperatures before the chiller catches up. • Oscillating behavior of cooling: Most conventional vaporcompression chillers have few discrete compression states (i.e., modes), e.g., off (no compression), low (one activated compressor), and high (two activated compressors or one stronger activated compressor). With such configuration the chiller can remove heat at almost fixed rates; thus, if the data center produces heat at some rate between two cooling rates, the chiller will have to oscillate between those two rates. Since oscillations cannot be captured by steady-state models, transient models are needed for such cases. • Cool-off and heat-up periods due to thermal capacitance of equipment: Most CFD simulations ignore the thermal capacitance of solid materials because it would slow down the simulation and
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
because it does not affect much the converging steady state. The characteristic of such simulations is that they suggest that the data center reaches the steady state within only a small fraction of time as it would in reality. This can be a problem when considering dynamic workload, as conventional CFD simulations will predict that, in times of high workload, the data center will cool down rather quickly, consequently predicting the equipment will spend less time heated (or overheated) than in reality. In the aforementioned examples, a data center equipment may spend considerably more time overheated than predicted which not only may have an effect on the long-term prediction of lifetime, but also on short-term performance degradation due to temperature-triggered down-throttling of CPUs. The latter condition constitutes a cyber-physical implication loop: increase in workload (i.e., computing behavior) causes increase in heat (i.e., physical behavior) which in turn causes performance degradation (i.e., computing behavior), something that conventional steadystate simulators cannot predict. On the other hand, transient CFD simulations take a considerable amount of time, typically greater than steady-state CFD simulations. Such long stretches of simulation time, which can take several hours or days, are not suitable for online model-based decision making (e.g., how can we place incoming workload so as to avoid down-throttling?). The research question that this paper tackles is: Is there a lightweight model that can describe transient thermal behavior of data centers and how can it be combined with the cooling of CRACs, the power consumption of servers, and further with the performance of servers, to make energy and performance predictions? 1.1. Overview of the paper and of contributions This paper proposes an analytical transient heat transfer model to be used instead of CFD simulations to speed up the evaluation and
133
decision making in initial designing or modifying the configuration of a data center, and also to speed up the testing of thermal-aware algorithms. The main idea is presented in Fig. 1. Designing and configuring a data center for operation requires the examination of several alternative scenarios. As CFD simulations take considerable amount of time, we propose to instead use a model (Section 3) that predicts three aspects of thermal behavior: division (spatial distribution): how the heat produced by each computing server is split and circulated into each other server or to each chiller; temporal distribution: how the heat portion of one server to another is distributed over time, which also captures the hysteresis: how long the heat takes to travel from one unit to another. Then, we apply the methodology in Section 4 to yield the contribution curves and weights, analytically fit those curves to a convex weighted sim of gamma distributions (Appendix B), and use the analytical formulation to feed an ODE solver. The benefits of using the proposed approach instead of CFD are twofold, in run times (Section 5.3) that take a small fraction of CFD run times, and in the ability to introduce logic and triggers that dynamically change the behavior of the data center system, something that is hard to implement in CFD. This methodology can be incorporated into data center simulators such as GDCSim [15] so that it can be used to test the effectiveness of several thermal-aware algorithms. The rest of the paper is organized as follows: Section 2 covers the preliminary material on the physical design of data centers on which data centers are based (Section 2.1), reviews the state of the art of thermal modeling (Section 2.2) and establishes a critical drawback of assuming steady state when using discontinuous cooling models (Section 2.3), demonstrates the heat capacitance problem (Section 2.4 using an experiment on BlueCenter, the data center of the BlueTool project (http://impact.asu.edu/BlueTool/), and demonstrates the benefits of using our proposed approach. Section 3 formally defines our transient heat circulation model (Section 3.1), which states that the heat observed at a moment
Fig. 1. (a) Instead of spending considerable time testing alternative scenarios for design or configuration decision making, (b) we propose to yield analytical approximations and use them instead to speed up the process (the width of a rectangle is indicative of the time its procedure takes). This simulation approach can be incorporated to DC simulators such as GDCSim.
134
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
at the air inlet of some equipment is the cumulative integral of past produced heat as it was spread over time (Eq. (5)). Then a symmetry property is presented which establishes that the heat contribution curve c(t) that describes how heat accumulates to a moment from all the past is a reflection of the heat distribution curve cˇ (t) that describe how a heat produced at a moment spreads over time toward a destination air inlet (Eq. (6)) (Section 3.2). Section 4 provides a methodology for calculating the parameters of our model for a given physical or modeled data center. The methodology involves inducing heat spikes at the air outlets and then observing the temperature effect at the air inlets of the servers. The methodology is applied to a CFD model of a hypothetical containerized model (Section 4.2). The results show that heat distribution cˇ (t) curves resemble exponential fading functions. This property is leveraged to analytically fit these curves to gamma functions (as later demonstrated in Appendix B). Section 5 presents a validation of the model through: (i) a comparison a CFD simulation on OpenFOAM [16] with a MATLAB implementation of the analytical model (Section 5.1), (ii) testing whether linearity of the model manifests in physical models (Section 5.2), including also a study on the accuracy of the model when fan speeds change (Section 5.2.1), and (iii) a small scalability test on the run times of the transient model impementation on MATLAB (Section 5.3). Section 6 discusses how to analytically combine the transient model with heat generation and cooling models. The combination along with the analytical fitting of the curves is used in Appendix C to create a linear system model of the data center that can be fed to an ODE solver. Section 7 provides some conclusions on the models and methods provided in this paper, with a discussion on future work. The appendix contains more detailed information on the experiments (Appendix A), curve fitting (Appendix B) and MATLAB ODE formulation (Appendix C).
Knowledge of the heat recirculation, CRAC efficiency and the magnitude of over-cooling can be used to save energy in a data center by predicting how much energy would be spent for upcoming scenarios. By placing load intelligently with respect to its thermal impacts, thermal aware scheduling can allow the CRAC to act with higher efficiency by increasing the required supply temperature. To the degree that the temperature of a data center can be accurately predicted for any load, the amount of over-cooling shrinks and efficiency is gained. This highlights the potential of having a better thermal model, a mapping of how load effects the temperature throughout the data center. As long as the inlet of all equipment is kept below its manufacturer specified red-line temperature and system performance is not overly affected, this efficiency gain causes no adverse effects on the performance, e.g., no SLA (Service-Level Agreement) violations in Internet applications [17]. Most models and simulators of data centers operate under the thermal isolation assumption, i.e., that the heat passing through the data center walls is negligible. 2.2. Existing thermal modeling techniques (steady-state) While the previous section motivates the need for an accurate thermal model. This section will briefly review existing methods used for thermal modeling, compare it against empirical behavior, and motivate the need for a transient model. Thermal modeling of data centers requires three components (see Fig. 2): 1 Heat generation: Heat generation describes how power consumed by a server is emitted as heat into the data center. A common and practical assumption is that mechanical (sonic) and electromagnetic energy produced is negligible compared to the heat produced, thus: Q˙ computing ≈ Pcomputing = P.
2. Preliminaries 2.1. Data center operation A typical data center is organized as follows: All servers are situated on a raised floor plenum. These servers are organized into rows that separate aisles. These aisles alternate between cold air intake aisles and hot air exhaust aisles. Cold air from the computer room air conditioner (CRAC) is supplied to the servers through perforated tiles in the raised floor. This cold air is then passed through the servers, where it absorbs heat and is ejected into the hot air aisle. Finally the hot air is returned to the AC through a ceiling intake forming the cycle. Cooling inefficiency in this system has a number of sources, but for this paper we focus on heat recirculation (i.e., the return of hot air to the air inlets of computing equipment) and the notion of hot spots, i.e., localized increased temperatures due to heat recirculation. CRACs are incapable of provisioning varying amounts of cool air to specific, dynamic locations in the data center. As a result, much of the equipment is cooled far below red line temperatures in order to ensure that all equipment are below this threshold. This is due to hot spots, or locations where inlet temperatures are dramatically higher than the rest. Hot spots force the CRAC to over-cool the rest of the data center while trying to bring the hot spots to normal temperature. This negatively impacts CRAC efficiency not only through higher utilization, but by reducing the CRAC’s performance due to a lower Coefficient of Performance (CoP), i.e., the ratio between heat removed over work (energy) expended to remove that heat.
In a steady state, the heat rate that is output at a server’s air outlet equals to the heat at the server’s air inlet plus the heat produced:
Q˙ outlet = Q˙ inlet + P, which can be translated into the temperature domain as: T− = T+ +
P cp f
(outlet temp = inlet temp + temp rise),
where cp is the specific heat of air and f is the air flow rate through the server. In CFD simulation software such as the commercially
Fig. 2. Thermal modeling of a data center requires three components: a heat generation model, which describes how power consumed is introduced as heat into the data center, a heat circulation (heat flow and mixing) model, which describes how heat is distributed among the data center equipment, and a cooling model, which describes how cooling is introduced into the data center.
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
⎡
T1
⎤
⎡
p1
⎤
⎢T ⎥ ⎢p ⎥ ⎢ 2⎥ ⎢ 2⎥ ⎢ ⎥ = Tout + D ⎢ ⎥ . ⎢ . ⎥ ⎢ . ⎥ ⎣ .. ⎦ ⎣ .. ⎦ Tn
(1)
pn
3 Cooling: Cooling models describe how heat is removed from the air. It is analogous to heat generation. In Fig. 3(a), the constant cooling model the outlet of the CRAC supplies cool air at a constant temperature, regardless of the input air temperature. The power removed by this model scales linearly with inlet temperature. Previous research routinely assumes a constant cooling model because given this model the steady state assumption is guaranteed to hold [18,20].
recorded pair average linear fit
64
Output temperature (Floor West 1)
available FloVENT or the open source OpenFOAM [16], heat generation is modeled as (forced) convection through the servers. 2 Heat circulation (heat flow and mixing): Heat circulation describes how heat is transferred around the room by the air. In CFD software, this is done by modeling the room as a 3D grid of Navier–Stokes equations that express physics laws such as conservation of energy, momentum, mass and pressure in a defined geometry. Although CFD simulations can predict the temperature at all points of the grid, it is of practical interest to predict the temperature at a few points only, specifically at the equipment’s air inlets and outlets. Fast approximate solvers such as used in [18,19,10,20] all rely on the following steady state assumption: for any constant input of heat the data center will reach a steady thermal state such that the temperature anywhere in the data center becomes invariant to time. These approaches frequently used FloVENT for model building but once a model is constructed they heuristically approximate temperatures. For example, the work in [10,20] constructs a heat recirculation matrix D that is used to predict the steady-state temperature at the air inlets of servers given the power consumption pi of each server, plus the supplied cool temperature Tout from the CRAC:
62
60
58
56
54
62
64
The constant cooling model behavior is only true for certain types of CRAC units. The most crucial difference with the research models is that the step-linear behavior does not allow for steady state solutions to occur for a majority of scheduling patterns. To understand this first consider that data centers are designed to be closed systems such that the only points of energy exchange in a system are known and considered. This is known as the thermal isolation assumption [21]. Fig. 3(b) shows the linear cooling model which is simply a function of the form y = mx + b. With m = 1 the power removed by the CRAC is constant. As m approaches 0 this model converges to the constant model. Consider how the total energy of a system evolves when the heat input to the system from chassis is near-constant and the heat
66
68
70
72
Input temperature (Ceiling West 2) Fig. 4. Temperature sensor measurements at the CRAC inlet and outlet of a real data center over 2 weeks time [3].
removed from the system follows the empirical step-linear cooling model in Fig. 4. From this perspective, in order for a data center to reach a steady thermal state the incoming energy must equal the outgoing energy. Here we show that this is not commonly the case. The incoming energy can be derived using the sum of the power of all heat sources
Pin =
n
pi .
i=0
Pin is assumed to be constant for a particular schedule. The outgoing energy can be inferred using the temperature drop across the CRAC so long as the airflow mass is known. Thus an equilibrium temperature exists: Pin = PAC = (TAC − TAC,out )cp fAC ⇒ TAC − TAC,out = TAC − (mTAC + b) =
2.3. Shortcomings of steady-state cooling assumptions
135
Pin ⇒ cp fAC
(Pin /cp fAC ) + b Pin ⇒ TAC = , 1−m cp fAC
(2)
where cp is the specific heat of air and fAC is the airflow rate of the AC. In the case for which m = 1, the data center can only reach equilibrium at exactly one value of Pin = − bcp fAC (from Eq. (2)). Since typically the slope of the step-linear cooling model is near 1 we can only conclude that in the majority of cases, no steady state is possible. Hence, predictions made by steady state thermal models may be false and, generally speaking, not conservative. This leads to possible critical situations where red-line temperatures are exceeded as temperature fluctuates. This is a problem for fast approximate solvers that do not support step-linear cooling models. This is the empirical behavior for many types of CRACs. It extracts a near constant amount of heat while the inlet temperature is within each temperature range.
Fig. 3. Various cooling models assumed in the literature. (a) the constant model assumes a constant temperature of supplied cool air, regardless of the input temperature; (b) the linear model assumes a linear reduction of temperature with a constant shift; (c) the step-linear model is a set of discontinuous linear functions that best reflects empirically observed behavior [3].
136
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
Fig. 5. A rendering of the CFD model of BlueCenter used for validations in this paper. The BlueCenter features a cold-hot-cold aisle layout, with isolated aisles communicating with doors. In the experiments, the doors were kept open to allow heat recirculation.
In summary, steady state solvers cannot provide accurate temperature estimates in situations where thermal steady states do not exist [22]. Where inaccuracies occur steady state solvers are not conservative in their estimates [23]. When heat spikes occur on a regular basis they can cause heat induced performance throttling, equipment shutdown, and equipment damage over time. These are all costly consequences that a transient thermal model can prevent. 2.4. Heat capacitance in real systems In typical CFD simulations of data centers, only the air (i.e., the fluid) is being simulated. Solid equipment only acts as thicknessfree walls, some of which may have some porosity (e.g., ventilation tiles), and they do not have any thermal exchange with the fluid. This saves a lot of simulation time and it can find good steady-state solutions. Most CFD simulation models of data centers are modeling the air cooling portion, i.e., they focus on the temperature of the air. An experiment was conducted in BlueCenter, a small data center at ASU, part of the BlueTool project (a CFD model of it is shown in Fig. 5). Initially 144 servers were turned on and the
Fig. 6. CRAC inlet temperature from experimental measurements at BlueCenter vs CFD simulation on OpenFOAM. The CFD simulations suggest that the room quickly converges to the equilibrium state which is not the case in reality.
temperature at the inlet of the CRAC was brought to 40 ◦ C. The CRAC was then turned on. The inlet and outlet temperatures of the CRAC were recorded for two hours using sensors attached at the airducts of the CRAC (section (a) in Fig. 6). At the end of two hours, 48 servers were turned off. The remaining 96 servers were allowed to continue running for two more hours after which 48 servers were turned off (section (b) in Fig. 6). The remaining 48 servers were again allowed to run for two more hours (section (c) in Fig. 6). There was no other variation in power output. The geometry of the BlueCenter was generated using the BlueSim CFD simulator and the experimental setup was simulated and the predicted CRAC inlet temperature was recorded. These predicted values were then compared with the experimental values and the results are shown in Fig. 6. Although the CFD simulation suggests that the data center quickly converges to the steadystate temperature (within a couple of minutes), the steady state is achieved in a couple of hours. This phenomenon is due to thermal capacitance existing in the materials, mainly solid objects. For a transient model to be accurate and usable, the thermal capacitance must be taken into account, otherwise the model may exhibit large prediction errors.
Fig. 7. Emulation of CFD output using a preliminary version of our transient model. CFD simulations need to be done separately for each power consumption level, and predict the converging state.
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
137
Fig. 8. Simulation of the same three-server data center with a 2-mode CRAC (1 kW low / 5 kW high). It is clear that transient simulation predicts that the data center can reach far wider range of temperatures than with a converging simulation.
2.5. Benefits of a transient cyberphysical simulation of data centers To demonstrate the benefits of having a light-weight transient model, we combined our proposed model with a server thermal model (i.e., heat generation) and a CRAC model (i.e., cooling) as in Appendix C, and we generated three figures (Figs. 7–9). The first figure (Fig. 7) depicts a 2500-second simulation of a hypothetical data center with three servers and one CRAC, using our transient model on MATLAB using ode45. The servers work at 66% (1000 W each) for 500 s, then they go to 33% (500 W each) for another 500 s, then they go to 100% (1500 W each) for 500 s, and finally run for another 1000 s at 1000 W each. There is strong recirculation (only about 25% of the heat produced reaches the CRAC before any other server). Air travel delays are in the range of 1–4 s from a server to another equipment (server or CRAC) and 5–8 s from the CRAC to
other equipment. This simulation scenario tries to emulate what a CFD simulator would produce: a converging steady state at each stage. The next figure (Fig. 8) shows that a transient model can produce results for non-converging (oscillating) behavior, in this example due to a two-mode CRAC, for the same workload schedule. The figure shows how the inlet temperatures variate when the CRAC thermostat is set to 28 ◦ C. The proposed model can produce these results in under one minute. The benefits of an analytical and computationally lightweight transient model can be shown in the last figure (Fig. 9). In this scenario, and for the same workload schedule as above, we enhanced the model with a throttling function that reduces the server power when its inlet temperature exceeds 30 ◦ C. The results show that there can be several time windows of throttling, which will cause performance degradation.
Fig. 9. Simulation of the same three-server data center with a 2-mode CRAC (1 kW low / 5 kW high) and with throttling emulation. The simulation predicts windows of throttling, which occur in periods of high workload, which can have significant effect on performance. A performance model integrated with such a transient simulator can also provide several insights into performance events that are unpredictable with conventional technologies.
138
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
3. A new transient model for heat circulation This section will provide both an intuitive and a formal definition of the transient thermal model motivated above. Intuitively, we propose that a useful estimate of the temperature at a point of interest can be provided by creating a weighted average of contributing temperatures of each source, where each contribution temperature is a weighted average of its own outlet temperatures over time. Essentially, heat added to a room differs with respect to time and location in the room, and, as the air mixes, the resulting temperature averages according to the relative air masses. If we monitor the temperature at every point where it is modified the temperature everywhere else in the room can only be an averaging of measured temperatures at various times past. Therefore this model requires the thermal isolation assumption. This is primarily to avoid the need for a bias factor so that we can constrain the model such that the sum of all weights is 1. The other key assumption of this model is that the airflows in the data center are controlled and dominated by the fans such that the portion of air arriving everywhere over time is invariant to temperature. This ignores changes in airflow due to pressure changes caused by temperature such as hotter air rising faster (a likely cause for slight inaccuracy in our model).
temperature T ij (t) factors into the actual temperature Tj+ (t) at the air inlet of server j. The wij can be organized into a matrix W:
⎡
w11
w21
···
⎢w ⎢ 12
wn1
w22
···
wn2
.. .
..
w2m
···
W =⎢ ⎢
⎣ ... w1m
⎤
⎥ ⎥ ⎥. .. ⎥ . ⎦
.
wnm
The core hypothesis of the model is that the temperature at the air inlet of a server j is the convex weighted sum of the temperature contributions of all servers (and of the CRAC) as they accumulate over time from the past: Tj+ (t) =
n i=1
n
wij T ij (t) =
n
wij
i=1
0
−∞
cij ()Ti− (t + )d,
(5)
where w = 1, ∀j, because of the thermal isolation assumpi=1 ij tion. Also, since a contribution temperature intuitively represents the temperature of air from a source weighted over some past time interval, it should never be hotter or colder than it has ever been: min(Ti− ) T ij max(Ti− ).
3.1. Model definition Our model, formally defined, uses the symbols of Table 1, where source temperatures are initialized as steady state temperatures. The heart of the model is a collection of n × n temporal contribution curves, cij (t), each one denoting how heat arrives to the air inlet of a sink (receiving) server j from the air outlet of a source server i. (For notational convenience, the CRAC also is included in the ij enumeration). A cij curve conforms to the following:
cij (t) : (−∞, 0] → [0, 1], with
0
−∞
cij (t)dt = 1.
(3)
A key concept of the model is the contributing temperature T ij (t) of a source server i to a sink server j, which is the temperature rise as affected by the history of the source server i’s outlet temperature Ti− (t) through the spreading of cij :
T ij (t) =
3.2. The symmetry property The proposed circulation model, as defined, suggests that between a contributing server and a receiving server, at any point in time t, there is heat from the contributing server that arrives from any point in the past (− ∞ , t − ) (see Fig. 10). The hysteresis parameter expresses the delay it takes for heat to start arriving. Since the above is true for any point in time, it is also true, in a converse manner, that heat Qi− (t) produced at
any moment t will be n divided according to some division factors uij ( j=1 ui,j = 1, ∀i) and will start arriving after time and will continue contributing some portion of itself until +∞, i.e., it will contribute in the future interval (t + , + ∞). The insight of the model is that air and heat that leaves a server outlet spreads itself out in arriving at a new location. Assume
0
−∞
cij ()Ti− (t + )d.
(4)
In addition to the cij curves, the model assumes n × n weighting factors wij , which define how each contributing
Table 1 Nomenclature. Symbol
Definition
n t Ti+ (t) Ti− (t) Ti∞ cij (t) cˇ ij (t) wij uij Q ij T ij fi pi (t) cp
Number of points of interest: chassis, CRAC inlets etc Time Entering (sink) temperatures of point i at time t Exiting (source) temperatures of point i at time t Starting steady state temperature of point i Temporal contribution of i’s temperature upon j Temporal distribution of i’s temperature upon j Heat weighting factor (at sink) Heat division factor (at source) Cumulative contributed heat from point i to j at time t Cumulative contributed temperature from i to j at time t Air flow rate through a point (server or CRAC) i Power at server i specific heat of air
Fig. 10. Explanation of the proposed transient model assertions: heat leaving a server i splits according to weights ui and reaches each server j with a temporal distribution (spreading) cˇ ij (t) which includes some hysteresis (ij ). For a receiving server j, incoming heat at time t is a weighted sum of integrals of accumulated heat from the past as dictated by the the heat contribution curves cˇ ij (t). cˇ ij (t) mirrors cij (t) (Eq. (6)).
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
139
that the heat distribution function is cˇ ij () (this function incorporates the hysteresis as well). Assuming that heat is distributed according to cˇ ij (), irrespective of when it is generated, then the contributed heat at time t at the receiving server is:
Q ij (t) =
t
−∞
cˇ ij (−)uij Q (t + )d.
Therefore, if we can easily calculate the function cij (t) as: cij (t) = cˇ ij (−t).
(6)
Also, we can establish a relationship between the heat division factors uij and the weighting factors wij as follows: Tj+ (t)
= cp fj Qj+ (t) = cp fj
n
= cp fj
uij
n uij cp fj
=
i=1
=
n i=1
cp fi uij
fj
fi
Q ij (t)
separate experiments, one for each server and CRAC, to obtain their respective curves (Fig. 11). The temperature curves Tj+ (t) that are observed at the air inlets are used to derive the cij (t) and wij as follows:
i=1
0
−∞
i=1
n
Qi− (t + )cij ()d
0
−∞
cp fi Qi− (t + )cij ()d
(7)
0
−∞
Fig. 11. The methodology of yielding the cˇ ij curves involves inducing a Dirac-like heat spike at the source server i, one source i at a time, and then observing the temperatures over time at the sink servers. The uij factors are computed by dividing the areas of each curve by the sum of all the areas at a sink.
Ti− (t + )cij ()d
fj ⇒ wij = uij . fi Eqs. (6) and (7) define a symmetry relationship between heat distribution as it is generated at the sources, and heat contribution as it arrives at the sinks. It should be noted that the matrix U = {uij } of the division factors is closely related to the heat recirculation matrix D (Eq. (1)). Matrix D describes, in a steady-state notion, how output heat from a server is split among all servers (and to the CRAC as well) and also does a conversion from power to temperature [20]. The next sections answer the fundamental questions of: (i) how to derive the cij curves and the weighting factors wij (Section 4), and (ii) how accurate and useful the model is (Section 5). 4. Model learning In this section, we present a methodology on how to derive the weights w and cij curves for a given data center. The methodology leverages the symmetry relationship to first derive the cˇ ij curves and uij factors from CFD simulations, and then compute the cˇ (t) and wij . 4.1. Methodology for yielding wij and cij A series of CFD simulations with high temporal resolution is required (in this case, we used the OpenFOAM CFD solver). Once the model is learned, the CFD solver would no longer be required. Furthermore, a CFD solver can be done away with entirely if the data can be obtained from sensors in the data center. To obtain the cˇ (t) curves we first generate n CFD simulations which start with a steady state of Tconst temperature and then having a single location at a time produce a considerable temperature spike. The simulation should include a parameter specifying that the outlet of all servers (other than the previously specified short period of extreme heat) should act as “heat sinks”, that is emit a constant temperature Ti− (t) = Tconst , regardless of their inlet temperatures Ti− (t). Heat sinking ensures that eventually the heat introduced by the temperature spike will dissipate and the system will revert to the pre-spike state. This procedure is performed in n
1 Remove the bias temperature of the pre-spike steady state for the curves. 2 (Calculate uij ) For each experiment i, divide the area of each temperature curve by the sum of the areas of the curves: uij =
∞ Tj+ ()d
n 0 ∞ j=1
0
Tj+ ()d
, for each experiment i.
3 Calculate wij using Eq. (7). 4 (Calculate cˇ ij (t)) For each experiment i, normalize each curve to a unit area: cˇ ij (t) =
Tj+ (t)
∞ 0
Tj+ ()d
.
5 Calculate cij (t) using Eq. (6). 4.2. Example CFD experiment on small containerized data center We provide another example application of the methodology on the CFD model of a smaller, containerized data center consisting of a single row with 4 racks. Each rack contains 6 chassis. A mesh model of this data center’s layout can be found in Appendix A. Fig. 12 shows a sample set of cij curves. Note that the arrival of heat is much more compressed in these curves, as opposed to (some of) those from the medium sized data center. This is likely due to the smaller size of the containerized data center compared to the medium sized data center. 5. Validation Validation contains the following aspects: • Cornerstone validity (Section 5.1). Can the model predict transient temperatures? A comparison with an equivalent CFD model is conducted to demonstrate the validity of the model. • Linearity (Section 5.2). The transient heat model, as defined exhibits linearity with respect to the sources and their contribution, i.e., if a source increases its output temperature, the sinks will get a corresponding linear increase (according to the weights wij ), and it will not affect the heat from other sources (there is no multiplication or cancellation or other type of non-linear effect between two heat sources). Does this linearity actually manifest in actual data center rooms? Also, how accurate is the model
140
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
+
Fig. 12. Two examples of server inlet (T (t)) traces, t-axis inverted to show as ci,j , showing the shape of the heat contribution curves.
when the fan speeds change? We conduct CFD simulations and see that linearity does occur. • Scalability (Section 5.3). How do the runtimes increase as the data center population (i.e., the number of servers) increases? The test shows a polynomial-like increase in runtime. 5.1. Comparison against CFD simulation To validate the physical relevance of the proposed model we prepared a CFD setup on OpenFOAM as in Fig. 13. The physical setup is a 10 m long rectangle of a 0.1 m by 0.1 m square vent with the 3 chassis and 1 CRAC inserted at distances apart from each other proportional to the initial delay indicated in Fig. 13. The results of the CFD simulator strongly agree with our model (see Fig. 14). The primary advantage of our simulator over the CFD simulator is that our results can be obtained within 1 s as opposed to the approximately 1 h solution time of the CFD simulation. This enables our model to be used by data center administrators provided that the parameters of our model are learned through the procedure described in Section 4. 5.2. Linearity of heat contributions The simulations used to derive the cˇ ij curves are performed one for each server (e.g., each simulation is that of a single server
emitting heat, and measuring the inlet temperatures of all servers to create the temperature trace cˇ ij ). It is not necessary to perform simulations for every combination of servers due to the linearity of heat arrival as a function of the number of servers contributing. Mathematically, if cˇ i&j,k (t) represents the heat arrival at server k’s inlet from servers i and j emitting simultaneously: cˇ i&j,k (t)∼ˇci,k (t) + cˇ j,k (t). Fig. 15 shows two examples of how the linear sum of two individually simulated cˇ i,j curves very closely approximates the values generated by both servers emitting simultaneously.
5.2.1. Effect of fan speed change on linearity The transient model’s depiction of heat arrival rate in a physical layout is necessarily dependent on the speed of air flowing through the servers. It is a well known fact that modern servers modulate fan speed according to utilization. It is prohibitively expensive to derive ci,j curves for every possible fan speed combination for each server. To overcome this challenge, we instead propose performing modifications to ci,j curves derived at a median fan speed. We hypothesize that ci,j curves can be approximated as the sum of a finite number of gamma distributions (for more detailed analysis of this topic see Appendix B), where each individual distribution represents a different path of airflow delivering heat from server i to server j. If fan speed is increased, these distributions will shift to an earlier point, thus delivering heat earlier. Fig. 16, in comparison to Fig. 15’s depiction of cˇ 1&2,3 , exemplifies the shifting of peaks towards an earlier arrival time. Fig. 17 gives a cumulative look at this change in arrival time.
5.3. Scalability testing
Fig. 13. Basic simulation setup for the discrete-time simulator in Section 5.1. Every airflow arrives at exactly one other place. Air passing through each node changes by the amount listed above it. Air leaving a node takes time to arrive at the next node.
The proposed transient model consists of n equations on the inlet temperatures, each of them being a sum of weighted integrals over n heat contribution curves. With some manipulation (see Appendix C), it is possible to bring the equations into a linear system x˙ = Ax + B. The size of the A matrix is O(n2 ), so we speculate that the runtimes will follow a quadratic growth over the population of the data center n. Fig. 18 shows how the runtimes scale for a simulated time of 50 s on a contemporary commodity computer
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
CFD simulation vs transient model. Server2. 10 temperature ( C)
5
o
o
temperature ( C)
CFD simulation vs transient model. Server1. 10
0 −5 transient model CFD simulation
−10 0
5 time (s)
10
o temperature ( C)
o
temperature ( C)
0 −5 transient model CFD simulation
5 time (s)
0 −5 transient model CFD simulation
5 time (s)
10
CFD simulation vs transient model. CRAC. 10
CFD simulation vs transient model. Server3. 10
−10 0
5
−10 0
5
141
5 0 −5 transient model CFD simulation
−10 0
10
5 time (s)
10
Fig. 14. Temperature over time results compared for each node between transient model and CFD calculations. CFD computation on OpenFOAM took about 1 h. Transient model computation on MATLAB ode45 took about 1 s.
c’ Examples from Server 17, 20, and 17&20 Simulations 0.014 c’
c’1,3(t)+c’2,3(t)
(t)
0.008 c’(t)
(t)
2,3
c’(t)
c’
17&20,6
c’1,3(t) c’
c’17,6(t)
0.01
c’1&2,3(t)
(t)
20,6
c’20,6(t)
0.012
0.03
(t)+c’
17,6
c’ Examples from Server 1, 2, and 1&2 Simulations 0.04
0.006
0.02
0.004
0.01 0.002
0
0
5
10 Time (Seconds)
15
20
0
0
5
10 Time (Seconds)
15
20
+
Fig. 15. Two examples of server inlet (T (t)) traces, cˇ i,j , showing the linear combination of individual curves.
consisting of a 2.3 GHz Intel Core i5-2410M system with 3MB L3 cache and 4GB RAM at 667 MHz.
6. Combining the thermal model with heat generation and cooling As discussed earlier in Section 2.2, the minimum requirements for a thermal simulation of a data center’s operation is to have a heat generation model (for the computing servers), a heat circulation and mixing model (for heat transfer among the equipment), and a cooling model that removes the heat (for the CRAC). This section discusses possible combinations of the proposed transient model with heat generation and cooling models.
6.1. Heat generation models For heat generation, it is easy to use any model that conforms to the generic form that defines a relationship Tj− = f(Tj+ ) between the air inlet and the air outlet of a server. For example, assuming that the air takes t time to pass through the server, and assuming
that it picks all of the heat produced in that interval, then the outlet temperature can be defined as:
Tj− (t) =
t
Tj+ (t − t) + K
P(t + )d
(8)
0
= Tj+ (t − t) + KPt, (if P(t) = P ∀t). The above relationship can be translated into a discrete-time system: Tj− (n) = Tj− (n − 1) + P(n). The above equations describe a system which passes all generated heat instantaneously into the air. Motivated by Section 2.4, we propose a system that exhibits lumped heat capacitance, based on Netwon’s law of object heating: T˙ server,j (t) = ˛j (Tserver,j (t) − Tj+ (t)) + T˙ j− (t) = ˇj (Tserver,j (t) − Tj+ (t)).
Pj (t) Cserver,j
,
(9)
(10)
that is, the rate of temperature change T˙ server,j (t) of the server j depends on the temperature difference (delta-t) between the server
142
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
and the air and on the heat produced by computation; and the outlet temperature depends on the heat exchanged from the server into the air. The parameters ˛j and ˇj are the thermodynamic constants that include convection coefficient, area of convection, heat capacitance of the server and the air, for ˛j and ˇj , respectively. For cooling, it is possible to use any model of cooling that defines a transfer function between Tin and Tout , including the step-linear model: Tout = f(Tin ).
c’ Examples from Server 5, 21, and 5&21 Simulations Fanspeed Doubled 0.035
c’1,3(t)+c’2,3(t) c’1&2,3(t) c’
0.03
(t)
1,3
c’2,3(t)
0.025
0.02
c’(t)
6.2. Cooling models 0.015
A challenging point for cooling models is the incorporation of a two-mode model. An example would be
0.01
Tn− (t) = 0.005
0
0
5
10
15
20
0
5
10
15
20
0
Time (Seconds) Fig. 16. An example depicting how cˇ 1&2,3 , when the fan speeds of the heat emitting servers are increased (in this case doubled), shows a shifting of peaks towards an earlier time. Compare this specifically with Fig. 14’s depiction of cˇ 1&2,3 at a standard fan speed.
Tn+ (t) − b1 , if Tn+ (t) < Tthres Tn+ (t) − b2 , if Tn+ (t) Tthres
(step − linear).
(11)
Since we want to insert a transient model, we propose the use the step linear model but in a form that captures the dynamics of the temperature change: T n− (t) = ˇj (Tn− (t) − Tj+ (t)) +
Pn (t) , Cair,AC
(12)
where Cair,AC is the heat capacitance of the air in the AC 6.3. Putting them all together
Cumulative Heat Arrival Comparison between Fanspeeds 1
In the first example below, we combine the transient heat model with Eqs. (8) and (11):
0.9 0.8
Example. As an example combination of the transient model with a constant heat generation model and a step-linear cooling model, the equations are:
cumulative c’(t)
0.7 0.6
Ti− (t) = Ti+ (t) + a, i = 1 . . . n − 1,
0.5
Tn− (t) =
0.4 0.3
Ti+ (t) =
0.2
cumulative c’ (t) 1&2,3
0.1
double fanspeed cumulative c’ (t) 1&2,3
0
5
10
15
20
0
5
10
15
20
0
Fig. 17. The cumulative sum of cˇ 1&2,3 at both standard and double fan speed. As expected, the double fan speed curve shows overall earlier heat arrival.
Transient model simulation runtimes over data center population 6
wij T ij (t), ∀i.
(step − linear)
(from Eq.(5))
The above example assumes instantaneous heat generation and cooling models, i.e., hot air that enters the CRAC instantly exits cooled at the outlet. To account for heat capacitance at the servers, we combine the transient heat model with Eqs. (10) and (9): Example. As an example combination of the transient model with a constant heat generation model and a step-linear cooling model, the equations are:
5
time (seconds)
n
,
Our convention is to use the index i = n to denote the CRAC, i.e., Tin (t) ≡ Tn+ (t) and Tout (t) ≡ Tn− (t).
Time (Seconds)
Ti+ (t) =
4
n
wij T ij (t), ∀i,
i=1
T˙ s,j (t) = ˛i (Ts,j (t) − Tj− (t)) +
3
(from Eq.(5)) Pj (t) Cj
, (j = 1, . . . , n − 1),
T˙ j− (t) = ˇj ((Tj+ (t) − Ts,j (t)), (j = 1, . . . , n − 1),
2 1 0
Tn+ (t) − b2 , if Tn+ (t) Tthres
i=1
standard fanspeed
0
Tn+ (t) − b1 , if Tn+ (t) < Tthres
(constant rate)
T˙ n− = K(Tj+ − Tn− ) + 5
10
15
20
25
30
35
40
45
Pn (t) , (j = 1, . . . , n). Cair,n (13)
50
Population (number of servers and CRACs)
We use the same convention on CRAC numbering as server n. Fig. 18. Runtimes of the transient model on MATLAB ode45 for a range of data center population for a simulated time of 50 s. Runs performed on a 2.3 GHz Intel Core i5-2410M system with 667 MHz RAM and 3MB L3 cache.
The above example assumes instantaneous cooling models, i.e., hot air that enters the CRAC instantly exits cooled at the outlet.
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
143
Integration of our transient heat circulation model with noninstantaneous CRAC cooling models is left as future work. 7. Conclusions In this paper we motivated the need for a light-weight transient heat circulation model to be used for quick evaluation of workload management decisions. The proposed model (Section 3.1) accurately characterizes thermal behavior as it varies over time. The addition of a time dimension allows data center operators to model the periodic temperature fluctuations implied by discontinuous, step-linear cooling models. While it maintains a good level of accuracy, its speed is rather attractive: coarse-level ODE comptuations take only a few seconds to complete, as opposed to several hours of CFD computations. The core hypothesis of the model is that heat, as it is produced at the computing equipment, is spread out as it follows the air flow in the data center, and it arrives with a certain distribution at the air inlets of either other computing equipment or of the CRAC. Preliminary simulations in Section 4 show that the shape of the distribution curves resembles a combination of inverse-Gaussian functions. The splitting of the produced heat among the equipment is captured by a matrix of division factors uij , which is closely related to the heat recirculation matrix [20]. Research on this model is congruent with the objectives of the BlueTool project (http://impact.asu.edu/BlueTool), whose purpose is to develop and provide evaluation and validation tools for green data center design. Operating fast yet accurate models to produce results is a cornerstone factor for the success of the project. Acknowledgements We acknowledge Rose Robin Gilbert, Georgios E. Fainekos and Sunit Verma for their valuable assistance with the material in this paper. Appendix A. Yielding the transient model parameters for a containerized data center As mentioned in Section 4, our CFD experiments are performed using OpenFOAM on a model as shown in Fig. A.19. In our experiments we modify two key files in the OpenFOAM framework: SetFieldsDict (located with the simulation results) and TransientSimpleFoam.C (our custom solver for tracking heat flow throughout the room). To derive cˇ i,j curves for an entire data center we perform one experiment for each server i during which we measure the inlet temperatures for each server j. During each experiment we have three main phases: 1. Steady flow state – Begin the simulation with initial values (temperature, flow directions, flow speeds). We begin the simulation with equal temperatures throughout the room (291 K) and with servers and CRAC not emitting or removing any heat. This phase is set to run for 20 s so that flow within the room reaches a somewhat steady state. For this phase, and all successive stages (with one exception) the exiting temperature of each server is overridden within the solver file to emit air at 291 K. 2. Heat emission –For a single second (simulation time) we emit heat from the server i specified at start time. To do this, we must remove the override term within the solver for server i to allow higher heat emission. This necessitates a recompilation of the solver before this phase begins. While the compilation
Fig. A.19. Mesh model of a hypothetical containerized data center.
is happening, SetFieldsDict is modified to include the heat addition to server i. 3. Continued measurement – To begin this phase, we must return to masking all outlet temperatures to be 291 K. Once again, this requires modification to the solver and recompilation. SetFieldsDict is returned to its original state of no heat addition. Once the compilation is complete the simulation is run until the room temperature at all locations is a steady 291 K, denoting that the remnants of heat from the heat emission phase are through traversing the room and removed by the outlet heat mask. The exact length of time the simulation takes to reach this state is dependent on the layout of the data center, and as such a single experiment can be used to find a rough estimate of the time it takes. The small size of the containerized data center means that a runtime of 55 s ensures that all heat has been removed from the room. After each simulation, the temperature trace files for each server are traversed and collected into a single matrix where columns are time steps and rows are servers. Once cˇ i,j curves are created using the temperature trace matrix, Section 4.1 explains how to create ci,j curves. We counted the number of peaks in the 625 curves produced and we found a distribution of peaks as showin in Table A.2. We use this result to justify our decision to use three gamma distributions to approximate those curves. Appendix B. Fitting of the cˇij (t) temporal distribution cuves We use an analytical representation of the temporal distribution curves to formulate a linear system x˙ = Ax + B and have ODE solvers compute a solution. We selected to use a convex weighted Table A.2 Number of peaks in a curve. Peaks in the curve:
1
2
3
4
Curves with that many peaks: Cumulative percentile:
518 83%
73 95%
8 96%
28 100%
144
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
sum of three component gamma distributions (with high shape parameter) for each cˇ ij (t) curve, for the following reasons: • The curves seem to have an exponential tail (i.e., exponential fade out). The candidate distributions considered were Inverse Gaussian (IG(, )) and high-shape Gamma(, ). Gamma distribution was chosen over IG for ease of analytical integration and differentiation. • Many curves exhibit multiple peaks, therefore we chose to represent them as weighted sums of Gamma distribution components. • We chose the shape parameter for the gamma to be 8. Many of the curves sharp peaks, and the choice of = 8 to be the lowest shape value that was good enough for keeping the fitting error low. • The number of peaks vary from curve to curve. Using a variable number of component gamma distributions makes it hard to analytically yield and then program the ODE formulation of the linear system. Fixing that number makes formulation easier. In the small containerized simulation, most curves have up to three peaks, so for this example, we chose to fix the number to three. The chosen analytical fit is given by the following formula of three weighted gamma distributions:
c(t) =
v1 1 (t) + v2 2 (t) + v3 3 (t)
= v1 (t − 1 , 8, a1 ) + v2 (t − 2 , 8, a2 ) + v3 (t − 3 , 8, a3 ) = v1
a81 (t − 1 )7 e−a1 (t−1 )
+v3
7! a83 (t
− 3
+ v2
)7 e−a3 (t−3 ) 7!
a82 (t − 2 )7 e−a2 (t−2 ) 7!
, (v1 + v2 + v3 = 1). (B.1)
In this formulation, there are two errors to be accounted for: • Approximation error to the input curve, i.e., c(t) − (v1 1 (t) + v2 2 (t) + v3 3 (t)) = approx . • Convexity error of the weights, i.e., 1 − (v1 + v2 + v3 ) = convex .
Fitting was done using MATLAB’s curvefit toolbox. We let the builtin solvers take care the approximation error, and use an iterative approach to minimize the convexity error (Fig. B.20).
Fig. B.20. Example Gamma fittings of the temporal distribution curves with respect to Eq. (B.1).
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
Appendix C. Formulation of a data center model as a linear system The heat flow in data center is modeled according to three equations described below Tj+ (t) =
n
wi,j
i=1
0
−∞
Ti− (t + )cij ()d,
T˙ s,j (t) = ˛i (Ts,j (t) − Tj− (t)) +
Pj (t) Cj
, (j = 1, . . . , n − 1),
T˙ j− (t) = ˇj ((Tj+ (t) − Ts,j (t)), (j = 1, . . . , n − 1), − − = K(Tj+ − TCRAC )+ T˙ CRAC
T ij (t) =
T˙ ij (t) =
(C.3)
(C.5)
where Ti− (t + ) is the exiting source temperature at time t + , and function cij () is summation of three gamma distributions having variable scale (a) and fixed shape (n) parameter which can be expressed in the form as in Eq. (B.1). By using b8(k)ij = v(k)ij a8(k)ij /7!, for k = 1, 2, 3, we have: cij () =
b81ij ( − 1ij )7 e−(−1ij )a1ij + b82ij ( − 2ij )7 e−(−2ij )a2ij +b83ij ( − 3ij )7 e−(−3ij )a3ij .
T ij (t) = −
(
b81ij ( + 1ij )7 e(+1ij )a1ij
+
b82ij ( + 2ij )7 e(+2ij )a2ij
+
b83ij ( + 3ij )7 e(+3ij )a3ij )Ti− (t − )d.
−∞
b6kij 5kij ekij akij Ti− (t) − akij zk2ij (t) − 5bkij zk3ij (t),
z˙ k3ij (t) =
b5kij 4kij ekij akij Ti− (t) − akij zk3ij (t) − 4bkij zk4ij (t),
z˙ k4ij (t) =
b4kij 3kij ekij akij Ti− (t) − akij zk4ij (t) − 3bkij zk5ij (t),
z˙ k5ij (t) =
b3kij 2kij ekij akij Ti− (t) − akij zk5ij (t) − 2bkij zk6ij (t),
z˙ k6ij (t) =
b2kij kij ekij akij Ti− (t) − akij zk6ij (t) − bkij zk7ij (t),
T˙ S,j (t) = ˛j (TS,j (t) − Tj− (t)) +
Pj (t) Cj
, (j = 1, . . . , n − 1) (Eq. (C.2))
T˙ j− (t) = ˇj ((Tj+ (t) − TS,j (t)), (j = 1, . . . , n − 1)(from Eq. (C.1)),
(C.7)
(C.8)
0
−∞
(b8kij ( + kij )7 e(+kij )akij )Ti− (t − )d.
x˙ kij (t) = −b8kij 7kij ekij akij Ti− (t) − akij xkij (t) + 7bkij zk1ij (t), 8−(l+1) kij akij
b8−l kij kij
e
Ti−
−akij zklij (t) − (8 − l + 1)bkij zk(l+1)ij (t), z˙ k7ij (t) = akij ekij akij Ti− − bkij zk7ij (t).
(C.14)
(C.16)
(C.9)
Reduction of Eq. (C.9) using repetitive integration by parts and introduction of new variables end up with seven dependent functional variable represented as zkl(ij) (t)(l = 1, 2, . . ., 7). The set of ordinary differential equations obtained can be represented in terms of three equation for different values of k and n.
z˙ klij (t) =
z˙ k2ij (t) =
0
T ij (t) = x1ij () + x2ij () + x3ij (),
xkij (t) = −
b7kij 6kij ekij akij Ti− (t) − akij zk1ij (t) − 6bkij zk2ij (t),
(C.15)
We rewrite Eq. (C.7) by splitting it into three terms and assign a separate functional variable for each term in the form:
where
z˙ k1ij (t) =
(C.6)
Putting Eq. (C.6) into Eq. (C.5) with ≤ 0 we can express Eq. (C.5) in the form:
(C.13)
−b83ij 73ij e3ij a3ij Ti− (t) − a3ij x3ij (t) + 7b3ij z31ij (t),
0
−∞
−b81ij 71ij e1ij a1ij Ti− (t) − a1ij x1ij (t) + 7b1ij z11ij (t) −b82ij 72ij e2ij a2ij Ti− (t) − a2ij x2ij (t) + 7b2ij z21ij (t)
(C.2)
PCRAC (t) , (j = 1, . . . , n) (from Eq. (6.3)). Cair (C.4)
cij ()Ti− (t + )d,
represents the same as Eq. (C.11) but for l = 7. So time differentiated form of Eqs. (C.8), (C.11) and (C.12) from 22 sets of dependent ordinary differential equations. In a nut shell the set of ordinary differential equations to be solved in order to find the output temperature of the server are:
(C.1)
The cumulative contributed temperature from server i to server j at time t can be expressed in the form:
145
(C.10)
(C.11)
(C.12)
Eq. (C.10) represents each term for the Eq. (C.8), Eq. (C.11) represents zklij (t)’s differential equations for l = 1, 3, . . ., 6 and Eq. (C.12)
+ − T˙ n− (t) = K(TCRAC − TCRAC (t)) +
PCRAC (t) , (j = n). Cair
(C.17)
For the set of Eq. (C.14) the value of k is 1, 2 and 3 and the above equations are solved using variable step Runge Kutta Method incorporated in MATLAB function ode45. For framing the differential equation in MATLAB a matrix A is defined with coefficients corresponding to the parameters of differential equations framed from (C.13) through (6.3) along with matrix B consisting of power values: X˙ = A × X + B
(C.18)
The matrix A is a square matrix of dimension (2n − 1 +3n2 + 21n2 × 2n − 1 +3n2 + 21n2 ) with first n − 1 rows as differential of server temperature of each server (Ts(1−(n−1)) ), − ), next next n − 1 rows as the outlet temperature of each server(T1−n 2 row for the CRAC outlet temperature next 3n as differential of the x variables (xk,1,1 –xk,n,n ) for k = 1, 2, 3 and last 21n2 as differential of the z variables (zk,l,1,1 –zk,l,n,n ) for k = 1, 2, 3 and l = 1, 2, . . ., 7. The columns represents the same as rows except for they are undifferentiated variables for each of the parameters mentioned above.
146
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
The X and B vectors can be represented as:
⎡
Ts1
⎤
⎢. ⎥ ⎢. ⎥ ⎢. ⎥ ⎢ ⎥ ⎢ Ts(n−1) ⎥ ⎢ ⎥ ⎢ − ⎥ ⎢ T1 ⎥ ⎢ ⎥ ⎢. ⎥ ⎢. ⎥ ⎢. ⎥ ⎢ − ⎥ ⎢ Tn−1 ⎥ ⎢ ⎥ ⎢ T− ⎥ ⎢ CRAC ⎥ ⎢ ⎥ ⎢ ⎥ X = ⎢ x1,1,1) ⎥ , ⎢ ⎥ ⎢ x2,1,1) ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ x3,1,1 ⎥ ⎢ ⎥ ⎢ ⎥ .. ⎥ ⎢ ⎢ x1,1,2 . ⎥ ⎢ ⎥ ⎢x ⎥ ⎢ 3,n,n ⎥ ⎢ ⎥ ⎢ z1,1,1,1) ⎥ ⎢ ⎥ ⎢ ⎥ ⎢. ⎥ ⎣ .. ⎦
(C.19)
z3,7,n,n and
⎡
P1 (t)/C1
⎤
⎢ ⎥ ⎢ P2 (t)/C2 ⎥ ⎢ ⎥ ⎢. ⎥ ⎢. ⎥ ⎢. ⎥ ⎢ ⎥ ⎢ Pn−1 (t)/Cn−1 ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢. ⎥ B = ⎢. ⎥. ⎢. ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ PCRAC /Cair ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎣. ⎦
(C.20)
0 References [1] U.S. Environment Protection Agency, Report to congress on server and data center energy efficiency (2007). http://www.energystar.gov/ia/partners/ prod development/downloads/EPA Datacenter Report Congress Final1.pdf [2] Z. Abbasi, G. Varsamopoulos, S.K.S. Gupta, Thermal aware server provisioning and workload distribution for internet data centers, in: ACM International Symposium on High Performance Distributed Computing (HPDC10), 2010, pp. 130–141, http://dx.doi.org/10.1145/1851476.1851493. [3] A. Banerjee, T. Mukherjee, G. Varsamopoulos, S.K.S. Gupta, Cooling-aware and thermal-aware workload placement for green HPC data centers, in: International Conference on Green Computing Conference (IGCC2010), 2010, pp. 245–256, http://dx.doi.org/10.1109/GREENCOMP.2010.5598306. [4] T.D. Boucher, D.M. Auslander, C.E. Bash, C.C. Federspiel, C.D. Patel, Viability of dynamic cooling control in a data center environment, in: (IEEE) Inter Society Conference on Thermal Phenomena, 2004, pp. 593–600. [5] J. Donald, M. Martonosi, Techniques for multicore thermal management: Classification and new exploration, SIGARCH Computer Architecture News 34 (2) (2006) 78–88, http://dx.doi.org/10.1145/1150019.1136493. [6] M. Fontecchio, Companies reuse data center waste heat to improve energy efficiency (2008, May). http://searchdatacenter.techtarget.com/news/article/ 0,289142,sid80 gci1314324,00.html [7] T. Heath, B. Diniz, E.V. Carrera, W.M. Jr., R. Bianchini, in: In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP), Chicago, IL, USA, 2005. http://doi.acm.org/10.1145/1065944.1065969
[8] T.D. Boucher, D.M. Auslander, C.E. Bash, C.C. Federspiel, C.D. Patel, Viability of dynamic cooling control in a data center environment, Journal of Electronic Packaging 128 (2) (2006) 137–144, http://dx.doi.org/10.1115/1.2165214 http://link.aip.org/link/?JEP/128/137/1 [9] Q. Tang, S.K.S. Gupta, D. Stanzione, P. Cayton, Thermal-aware task scheduling to minimize energy USAge for blade servers., in: 2nd IEEE Int’l Dependable, Autonomic, and Secure Computing (DASC’06), 2006, pp. 195–202. [10] Q. Tang, S.K.S. Gupta, G. Varsamopoulos, Thermal-aware task scheduling for data centers through minimizing heat recirculation, in: IEEE Cluster, 2007, pp. 129–138. [11] T. Mukherjee, A. Banerjee, G. Varsamopoulos, S.K.S. Gupta, Model-driven coordinated management of data centers. (Elsevier) Computer Networks, 54(16), in: Special Issue on Managing Emerging Computing Environments, 2010, pp. 2869–2886, http://dx.doi.org/10.1016/j.comnet.2010.08.011. [12] G. Varsamopoulos, A. Banerjee, S.K.S. Gupta, Energy efficiency of thermalaware job scheduling algorithms under various cooling models., in: International Conference on Contemporary Computing IC3 , Noida, India, pp. 568–580, 2009, http://dx.doi.org/10.1007/978-3-642-03547-0 54 http://impact.asu.edu/thesis/Varsamopoulos2009-cooling-models.pdf [13] S. Gupta, G. Varsamopoulos, A. Haywood, P. Phelan, T. Mukherjee, Bluetool: using a computing systems research infrastructure tool to design and test green and sustainable data centers., in: I. Ahmad, S. Ranka (Eds.), Handbook of EnergyAware and Green Computing, 1st ed., Chapman & Hall/CRC, 2012. [14] E.K. Lee, I. Kulkarni, D. Pompili, M. Parashar, Proactive thermal management in green datacenters, The Journal of Supercomputing 60 (2) (2012) 165–195, http://dx.doi.org/10.1007/s11227-010-0453-8. [15] S.K.S. Gupta, R.R. Gilbert, A. Banerjee, Z. Abbasi, T. Mukherjee, G. Varsamopoulos, GDCSim – an integrated tool chain for analyzing green data center physical design and resource management techniques, in: Proceedings of International Green Computing Conference(IGCC11), IEEE, Orlando, FL, 2011. [16] Z.T. Hrvoje Jasak, Aleksandar Jemcov, OpenFOAM: A C++library for complex physics simulations, in: International Workshop on Couplted Methods in Numerical Dynamics IUC, 2007, pp. 47–66. http://powerlab.fsb.hr/ped/kturbo/ openfoam/papers/CMND2007.pdf [17] T. Mukherjee, A. Banerjee, G. Varsamopoulos, S.K.S. Gupta, S. Rungta, Spatiotemporal thermal-aware job scheduling to minimize energy consumption in virtualized heterogeneous data centers, Computer Networks 53 (17) (2009) 2888–2904. [18] J. Moore, J. Chase, P. Ranganathan, Weatherman: automated, online and predictive thermal mapping and management for data centers, in: Proceedings of the 2006 IEEE International Conference on Autonomic Computing, ICACXXXX’06, IEEE Computer Society, Washington, DC, USA, 2006, pp. 155–164, http://dx.doi.org/10.1109/ICAC.2006.1662394. [19] J. Moore, J. Chase, P. Ranganathan, Making scheduling “cool”: temperatureaware workload placement in data centers., in: Proceedings of the annual conference on USENIX Annual Technical Conference, ATECXXXX’05, USENIX Association, Berkeley, CA, USA, 2005, pp. 5–5. http://dl.acm.org/ citation.cfm?id=1247360.1247365 [20] Q. Tang, S.K.S. Gupta, G. Varsamopoulos, Energy-efficient thermal-aware task scheduling for homogeneous high-performance computing data centers: a cyber-physical approach, IEEE Transactions on Parallel and Distributed Systems, Special Issue on Power-Aware Parallel and Distributed Systems 19 (11) (2008) 1458–1472, http://dx.doi.org/10.1109/TPDS.2008.111. [21] J. Moore, R. Sharma, R. Shih, J. Chase, C. Patel, P. Ranganathan,;1; Going beyond CPUs: the potential of temperature-aware data center architectures, in: First Workshop on Temperature-Aware Computer Systems, 2004. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.87.6147 [22] A. Kubilay, S. Zimmermann, I. Zinovik, B. Michel, D. Poulikakos, Compact thermal model for the transient temperature prediction of a water-cooled microchip module in low carbon emission computing, Numerical Heat Transfer A (59) (2011) 815–835, http://dx.doi.org/10.1080/10407782.2011.578014. [23] L. Marshall, http://www.coolsimsoftware.com/wwwroot/LinkClick.aspx? fileticket=ap5wEZJbecQ%3D&tabid=189, 2010.
Georgios Varsamopoulos is Research Assistant Professor in the School of Computing, Informatics and Decision Systems Engineering, at Arizona State University, Tempe, AZ, USA. He received his B.E. in Computer Science and Engineering from University of Patras, Greece, his M.S. in Computer Science from Colorado State University, Ft Collins, Colorado, USA, and his Ph.D. in Computer Science from Arizona State University. His research interests include modeling of cyber-physical systems, performance and operation optimization, energy-aware and contextaware computing, wireless and mobile communications and security. He is with the editorial board of Elsevier’s Simulation and Modeling Practice (SIMPAT) and he has served as a co-chair and reviewer for numerous conferences and reviewer for several journals, including IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions in Mobile Computing and Elsevier Computer Networks. His research work has been funded by the National Science Foundation, the U.S. Department of Transporation, Science Foundation Arizona (SFAz), Intel Corporation and Raytheon Corporation. He is co-recipient of best poster award. He is faculty member of the Impact Lab (http://impact.asu.edu/).
G. Varsamopoulos et al. / Sustainable Computing: Informatics and Systems 3 (2013) 132–147
Michael Jonas is a Ph.D. student in the School of Computing, Informatics, and Decision Systems Engineering at Arizona State University. He works at Microsoft Corporation doing data analytics at Windows Telemetry. His research interests include artificial intelligence, cognitive architectures, knowledge representation and modeling of complex systems, and planning.
Joshua Ferguson is currently an M.S. student in the Computing, Informatics, and Decision Systems at Arizona State University. His research interests include data center management, high-performance computing, and the network management of each.
Joydeep Banerjee is currently a Ph.D. student in the School of Computing, Informatics and Decision Systems Engineering at the Arizona State University. His research interests include thermal management of data-centers,
147
green computing, sustainable computing and modeling of energy storage devices in data centers. Joydeep Banerjee received B.E degree in Electronics and Telecommunication Engineering from Jadavpur University, India.
Sandeep K.S. Gupta is the Chair of Computer Engineering Graduate Program and a Professor in the School of Computing, Informatics, and Decision Systems Engineering (SCIDSE), Arizona State University, Tempe, USA. He received the B.Tech degree in Computer Science and Engineering (CSE) from Institute of Technology, Banaras Hindu University, Varanasi, M.Tech. degree in CSE from Indian Institute of Technology, Kanpur, and M.S. and Ph.D. degree in Computer and Information Science from Ohio State University, Columbus, OH. His current research is focused on cyber-physical systems with emphasis on green computing, pervasive healthcare, and criticality-aware systems. Gupta’s research awards include a best 2009 SCIDSE senior researcher and a best paper award. His research has been supported by Science Foundation of Arizona, National Science Foundation, National Institutes of Health, Intel Corp., Raytheon Missile Systems, and Northrop Grumman Corp. He has served or currently serving on several editorial boards including IEEE Transactions on Parallel and Distributed Systems, Springer Wireless Networks, Elsevier Sustainable Computing, and IEEE Communication Letters. Gupta has served on several program committees, including Percom, Wireless Health, BSN, and ICDCS, chair/co-chaired several workshops and conferences, including Greencom and BodyNets, and coedited several special issues for various journals and magazines, including IEEE Transactions on Computers (SI on Data Management and Mobile Computing), IEEE Pervasive Computing (SI on Pervasive Computing), and IEEE Proceedings (SI on cyber-physical systems). Gupta is a senior member of IEEE and heads the Impact Lab (http://impact.asu.edu) at ASU.