Proceedings of the 20th World Congress The International Federation of Congress Automatic Control Proceedings of the 20th World The International Federation of Automatic Control Toulouse, France, July 9-14, 2017 The International Federation of Automatic Control Proceedings of the 20th World Congress Toulouse, France, July 9-14, 2017 Available online at www.sciencedirect.com Toulouse, France,Federation July 9-14, 2017 The International of Automatic Control Toulouse, France, July 9-14, 2017
ScienceDirect
IFAC PapersOnLine 50-1 Management (2017) 6709–6716 Multiscale Thermal of Computing Multiscale Thermal Management of Computing Multiscale Thermal Management of Computing Systems The MULTITHERMAN approach Multiscale Thermal Management of approach Computing Systems The MULTITHERMAN Systems - The MULTITHERMAN approach Systems - The∗,∗∗ MULTITHERMAN approach∗ ∗ Andrea Bartolini ∗,∗∗ Christian Conficoni ∗ Roberto Diversi ∗
Andrea Bartolini ∗,∗∗ Conficoni ∗∗∗,∗∗ Roberto Diversi ∗∗ ∗,∗∗ Christian Andrea Bartolini Christian Conficoni Roberto Diversi Andrea Tilli ∗∗ Luca Benini Andrea Luca Benini ∗∗,∗∗ ∗,∗∗ Tilli ∗ ∗,∗∗ ∗ ∗,∗∗ Andrea Bartolini Christian Conficoni Andrea Tilli Luca Benini Roberto Diversi ∗ ∗ ∗,∗∗ ∗ Andrea Tilli Luca Benini Electronic and Information Engineering, University of Bologna, ∗ Department of Electrical,
of Electrical, Electronic and Information Engineering, University of Bologna, ∗ ∗ Department Bologna, Italy, (e-mail: Department of Electrical, Electronic and Information Engineering, University of Bologna, Bologna, Italy, (e-mail: ∗ {a.bartolini,christian.conficoni3,andrea.tilli,roberto.diversi,luca.benini}@unibo.it) Bologna, Italy, (e-mail: Department of Electrical, Electronic and Information Engineering, University of Bologna, ∗∗ {a.bartolini,christian.conficoni3,andrea.tilli,roberto.diversi,luca.benini}@unibo.it) Systems Laboratory, Dept. of Information technology and Electrical Engineering, {a.bartolini,christian.conficoni3,andrea.tilli,roberto.diversi,luca.benini}@unibo.it) Bologna, Italy, (e-mail: ∗∗ Integrated Integrated Systems Laboratory, Dept. of Information technology and Electrical Engineering, ∗∗ ∗∗ {a.bartolini,christian.conficoni3,andrea.tilli,roberto.diversi,luca.benini}@unibo.it) Zurich, Switzerland, {barandre,lbenini}@iis.ee.eth.ch) IntegratedETH, Systems Laboratory, Dept.(e-mail: of Information technology and Electrical Engineering, ETH, Zurich, Switzerland, (e-mail: {barandre,lbenini}@iis.ee.eth.ch) ∗∗ Zurich, Switzerland, {barandre,lbenini}@iis.ee.eth.ch) IntegratedETH, Systems Laboratory, Dept.(e-mail: of Information technology and Electrical Engineering, ETH, Zurich, Switzerland, (e-mail: {barandre,lbenini}@iis.ee.eth.ch)
Abstract: Abstract: This This work work presents presents the the research research findings findings of of the the Multithermal Multithermal ERC-Advanced ERC-Advanced project, project, in in terms terms Abstract: Thisthermal work presents the research findings of the Multithermal ERC-Advanced project, in terms of multi-scale control of complex large scale computing platforms such as High Performance of multi-scale thermal control of complex large scale computing platforms such as High Performance Abstract: This work presents the findings of the Multithermal ERC-Advanced project, in terms of multi-scale thermal control of research complex large scale computing platforms such as High Performance Computing Systems, and datacenters. In this the and opportunities concerning Computing Systems, and datacenters. In large this respect, respect, the challenges challenges andsuch opportunities concerning of multi-scale thermal control of complex scale computing platforms as High Performance Computing Systems, and datacenters. In this respect, the challenges and opportunities concerning thermal control of computing systems are discussed, along with the proposed innovative solutions. thermal control of computing systems In arethis discussed, along with the proposed innovativeconcerning solutions. Computing Systems, and problem datacenters. respect, the challenges and opportunities thermal control of computing systems are discussed, along with theand proposed innovative solutions. Control and management is divided at different hierarchical, scale, levels (compute node Control and management problem is divided at different hierarchical, and scale, levels (compute node thermal control of computing systems are discussed, along with theat proposed innovative solutions. Control and management problem is divided atand different hierarchical, and scale, levels (compute node and system). Then, given the control knobs sensors available each level, advanced control and system). Then, given the control knobs and sensors available at each level, advanced control Control and management is divided atand different hierarchical, andeach scale, levels (compute node and system). Then, givenproblem the control knobs sensors available at level, advanced control system identification techniques are proposed, to achieve a holistic thermal management of and system). system identification are proposed, to achieve a holistic management of Then, platforms, given techniques the with control knobs and sensors available at each thermal level, advanced control and system identification techniques are proposed, to stability achieve a holistic thermal management of complex computing benefits for thermal guarantees and energy consumption complex computing platforms, with benefits for thermal stability guarantees and energy consumption and system identification techniques are importance proposed, to stability achieve a holistic thermal management of complex computing platforms, with benefits for thermal guarantees andgeneration energy consumption optimization. Such features are of utmost for advancements in next large scale optimization. Such features are with of utmost importance for stability advancements in next large scale complex computing platforms, benefits for thermal guarantees andgeneration energy consumption optimization. Such features are of utmost importance for advancements in next generation large scale computing solutions. computing solutions. optimization. Such features are of utmost importance for advancements in next generation large scale computing solutions. © 2017, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. computing solutions. Keywords: Keywords: Decentralized Decentralized control, control, System System identification identification and and adaptive adaptive control control of of distributed distributed parameters parameters Keywords: Decentralized control, System identification and adaptive control of distributed parameters systems, Static optimization problems. systems, Static optimization problems. Keywords: Decentralized control, System identification and adaptive control of distributed parameters systems, Static optimization problems. systems, Static optimization problems. thermore, 1. 1. INTRODUCTION INTRODUCTION AND AND RELATED RELATED WORKS WORKS thermore, the the heat heat transferred transferred to to the the fluid fluid needs needs to to be be removed removed thermore, the heat transferred to theand fluid needs to be removed 1. INTRODUCTION AND RELATED WORKS from the room by means of chillers air conditioners, which from the room by means of chillers and air conditioners, which The by on 1. dictated INTRODUCTION ANDlaw RELATED WORKSof the heat transferred to the fluid needs to free be removed from the room by means of chillers and air conditioners, which cause an additional energy cost. When possible, cooling The pace pace dictated by the the Moore’s Moore’s law on the the shrinking shrinking of trantran- thermore, an additional energy cost. When possible, free cooling The pace dictated by the Moore’s law on theoretical the shrinking of tran- cause sistors size has led to the increase of the maximum from to the room bycost means of chillers and air conditioners, which cause anease additional energy cost. exchanging When possible, freefrom cooling tries this by directly the heat the sistors size has led to the increase of the theoretical maximum this costenergy by directly exchanging the heat the The pace by law onprogressive the shrinking of tran- tries sistors sizedictated has led tothe theMoore’s increase ofthe the theoretical maximum frequency of the devices, as well as integration causeto anease additional cost.However, When possible, freefrom cooling tries toto ease this cost by directly exchanging the heat from the room the external ambient. this is feasible only frequency of the devices, as well as the progressive integration room toease the this external ambient. However, thisthe is heat feasible only sistors sizeofhas tonumber the as increase the theoretical maximum frequency theled devices, well asofthe progressive integration of an ever growing of cores and functional units in tries to cost by directly exchanging from the room to the external ambient. However, this isthan feasible only when the environmental temperature is smaller the room of an ever growing number of cores and functional units in the environmental temperature is smaller than the room frequency ofgrowing the devices, as well asetthe progressive integration of an everdie number of cores and functional units in when the same area (Esmaeilzadeh al., 2011). With the end room to the external ambient. However, this is feasible only when the environmental temperature is smaller than the room air or temperature et thean same die area (Esmaeilzadeh et al., 2011). With units the end or liquid liquid temperature (Zhang (Zhang et al., al.,is2014). 2014). everdie growing number of cores and functional in air the same area (Esmaeilzadeh etsize al., 2011). With the with end of Dennard’s scaling, the transistor reduction comes when the environmental temperature smaller than the room air or liquid temperature (Zhang etsimple al., 2014). Choosing the right design is not a engineering of Dennard’s scaling, the transistor size reduction comes with the right design is not a simple engineering problem, problem, theDennard’s same dieofscaling, area (Esmaeilzadeh etand al.,thermal 2011). design With the end Choosing of the transistor size reduction comes with an increase the power density power air workload or liquid (Zhang al., 2014). Choosing thetemperature rightload design is notprocessor aetsimple engineering problem, as and of each and core dyan increase of the power density and thermal design power workload and load of each processor and core changes changes dyof Dennard’s scaling, thehas transistor size reduction comes with as an increase of theeffect power density and thermal design power (TDP). This side become visible in the latest years Choosing the right design is not a simple engineering problem, as workload and on load of each processor and core changes dynamically based users requests and application characteris(TDP). This side effect has become visible in the latest years namically based on users requests and application characterisan increase of the power density and thermal design power (TDP). This side effect has become visible in the latest years of processors road-maps and it has been known with several as workload and on load of each processor andmethods, core changes dynamically based users requests and application characteristics and computing phases. Standard design based of processors road-maps and it has been known with several tics and computing phases. Standard design methods, based on on (TDP). This side effect has visible in thewith latest years of processors road-maps andbecome it has been known several names: dark silicon (Esmaeilzadeh et al., 2011), power and namically based on users requests and application characteristics and computing phases. Standard design methods, based on static worst case design, have become highly inefficient in terms names: dark silicon (Esmaeilzadeh et al.,known 2011),with power and worst case design, have become highly inefficient in terms of processors road-maps and The it has been several names: dark silicon (Esmaeilzadeh et dark al., 2011), power and static thermal wall (Borkar, 1999). term silicon underlines ticsfinal and computing phases. on static worst case design, haveStandard becomedesign highly methods, inefficientbased in terms of cost and energy-efficiency. thermal wall (Borkar, 1999). The term dark silicon underlines final costcase and design, energy-efficiency. names: dark silicon etits al., 2011), and of thermal wallto(Borkar, 1999). The term dark silicon power underlines the fact that operate(Esmaeilzadeh the processor at maximum frequency, static worst havemanagement become highly inefficient intries terms of final cost and and energy-efficiency. Dynamic power thermal (DPM/DTM) the fact that to operate the processor at its maximum frequency, Dynamic power thermal management (DPM/DTM) tries to to thermal wallto(Borkar, 1999). The term dark siliconsolution, underlines the factmaintaining that operate the processor at and its maximum frequency, while the same package cooling as of final this cost and and energy-efficiency. Dynamic power and thermal management (DPM/DTM) triesopto handle issue dynamically with several works showing while maintaining the same package and cooling solution, as this issue dynamically with several works showing opthe fact that to operate processor atlonger its maximum frequency, while maintaining the the same package and cooling solution, as handle used in previous generations, it is no possible to power Dynamic power and thermal management triesopto handle this issue dynamically withtechniques several(DPM/DTM) works showing portunities in coupling low power such as Dynamic used in previous generations, it is no longer possible to power in coupling low power techniques suchshowing as Dynamic while maintaining the same package and cooling solution, as portunities used insame previous generations, it is no longer possible todissipapower at the time all the logic without causing thermal handle this issue dynamically with several works opportunities in coupling low power techniques such as Dynamic Voltage and Frequency Scaling (DVFS) and power gating with at the same time all the logic without causing thermal dissipaVoltage andinFrequency Scaling (DVFS) and power gating with used insame previous it is2011). no longer possible todissipapower at theissues time generations, all the logic without causing thermal tion (Esmaeilzadeh et Indeed, increasing the coupling low power techniques suchetas Dynamic Voltage andthe Frequency Scaling (DVFS) and power gating with control of on-chip thermal evolution (Kong al., 2012). tion issues (Esmaeilzadeh et al., al., 2011). Indeed, increasing the portunities of the on-chip thermal evolution (Kong et al., 2012). at theissues same time all theatlogic without causing thermal dissipation (Esmaeilzadeh etsilicon al., 2011). Indeed, increasing the control operating temperature the die can cause timing errors Voltage of and Frequency Scalingexpose (DVFS) and poweretgating with control the on-chip thermal evolution (Kong al.,of2012). Modern multicore processors the management these operating temperature at the silicon die can cause timing errors multicore processors expose the management of2012). these tionwell issues (Esmaeilzadeh al., 2011). Indeed, the Modern operating temperature atdevice theetsilicon die can causeincreasing timing errors as as decrease the reliability, life-time, and energycontroltoof thesoftware on-chip thermal evolution (Kong et al., Modern multicore processors expose thethe management of these knobs the stack. However, optimal control of as well as decrease the device reliability, life-time, and energytomulticore the software stack. However, optimal control of operating the silicon die can cause timing errors knobs as well as temperature decrease theatdevice reliability, life-time, and energyefficiency due to leakage power increase (Su et al., 2003). Modern processors expose solution thethe management of these them at full scale requires a holistic and is challengefficiency due to leakage power increase (Su et al., 2003). knobs to the software stack. However, the optimal control of them at full scale requires a holistic solution and is challengas well as decrease the device reliability, life-time, and energyefficiency due toand leakage power increase In datacenters supercomputers, hundreds, (Su et al.,thousands 2003). of knobs thescale software stack. However, thebyoptimal of them attofull requires acharacterized holistic solution and is control challenging. Indeed, processors are multiple thermal In datacenters and supercomputers, hundreds, thousands of Indeed, processors areacharacterized by multiple thermal efficiency due toand leakage power increase (Su et al.,thousands 2003). are In datacenters supercomputers, hundreds, of ing. computing nodes, composed by several processors/sockets, them at full scale requires holistic solution and is challenging. Indeed, processors are characterized by multiple thermal constants associated to the different building materials rangcomputing nodes, composed by severalhundreds, processors/sockets, are associated to are the characterized different building materials rangIn datacenters andcomposed supercomputers, thousands of constants computing nodes, by several processors/sockets, are integrated into racks in the same room. Thus, the power dissiing. from Indeed, processors by the multiple constants associated to the different building materials ranging ms to seconds and minutes, while room integrated into racks in the same room. Thus, the power dissiing from ms to seconds and minutes, while the room thermal thermal computing nodes, composed by several processors/sockets, are integrated into racks inup. theAt same room. Thus, thegenerated power dissipation problem scales this scale, the heat by constants associated to the ranging from ms to seconds anddifferent minutes,building while thematerials room thermal evolution and users applications/requests changes in hours and pation problem scales up. At this scale, the heat generated by and to users applications/requests changes in hours and integrated into racks thefinal same room. Thus, thegenerated power of dissipation problem scalesinthe up. At this scale,and theperformance heat by evolution each processor limits density the ing from ms seconds and minutes,with while theincrease room thermal evolution and users applications/requests changes in hours and days (Hui et al., 2004). In addition, the of the each processor limits the final density and performance of the (Hui et al., 2004). In addition, with the increase of the pationprocessor problemand scales up.tofinal At this scale,and the heat generated by days each limits the density performance of the entire system needs be efficiently removed. To achieve evolution and users applications/requests changes in and hours days (Hui et al., 2004). Inon addition, with the increase of and the number of cores integrated the same die, thermal power entire system and needs to be efficiently removed. To achieve number of cores integrated on the same die, thermal and power eachtarget processor limits theto final density and performance of the entire system and needs be efficiently removed. the performance, active cooling conveys forced air or To achieve days (Hui et al., 2004). Inon addition, with the increase of the number of cores integrated the same die, thermal and power heterogeneity have become visible in the same die, suggesting the target performance, active cooling conveys forced air or have becomeon visible in the same die, suggesting entire system and needs toofbeeach efficiently removed. ToRotating achieve the target performance, liquid flow on computing device. cooling conveys forced air or heterogeneity number of cores integrated the same die, thermal and power heterogeneity have become visible in the same die, suggesting that fine grain control of each core operating point liquid flowperformance, on the the surface surfaceactive of each computing device. Rotating that fine grain control of each core operating point and thread thread the target active cooling conveys forced air or liquid flow onasthe surface of each computing device. Rotating fans as well pumps are then needed to move the fluid into heterogeneity have become visible in the same die, suggesting that fine grain control of each core operating point and thread potential benefit in terms of performance and fans asflow well as pumps are of then needed to move the fluid into allocation have have potential benefit in terms of performance and liquidas surface each computing device. Rotating fans wellonasthe pumps areadditional then needed to move the fluid into allocation each components causing power consumption. Furthat fine grain control of each core operating point and thread allocation have potential benefit in terms of performance and energy-efficiency (Beneventi et al., 2016; James and Martonosi, each components causing additional power consumption. Furenergy-efficiency (Beneventi et al., 2016; James and Martonosi, fans as well as pumps are then needed to move the fluid into each components causing additional power consumption. Fur⋆ This work was supported by the EU FP7 ERC Advance Project MULTIallocation have potential benefit in terms performance energy-efficiency (Beneventi et al., 2016; James and Martonosi, 2006). However, this is only the bottom of the problem and ⋆ However, this is onlyetthe of theand problem and each causing additional power consumption. Fur- 2006). Thiscomponents work was supported by the EU FP7 ERC Advance Project MULTI⋆ energy-efficiency al., bottom 2016; James Martonosi, 2006). However, (Beneventi this is only the bottom of the problem and THERMAN 291125). by the EU FP7 ERC Advance Project MULTIThis work(GA was n.supported THERMAN (GA n. 291125). ⋆ 2006). However, this is only the bottom of the problem and THERMAN 291125). by the EU FP7 ERC Advance Project MULTIThis work(GA was n.supported
THERMAN (GA n. 291125). Copyright © 2017 IFAC 6913 Copyright © 2017, 2017 IFAC 6913Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © IFAC (International Federation of Automatic Control) 6913 Copyright © 2017 IFAC Peer review under responsibility of International Federation of Automatic Control. Copyright © 2017 IFAC 6913 10.1016/j.ifacol.2017.08.1168
Proceedings of the 20th IFAC World Congress 6710 Andrea Bartolini et al. / IFAC PapersOnLine 50-1 (2017) 6709–6716 Toulouse, France, July 9-14, 2017
needs to be coupled with the cooling control of the computing room to optimize the entire cooling cost. The MULTIscale THERmal MANagement of Computing Systems (MULTITHERMAN) ERC-Advanced project (MultiTherMan, 2012) tackles this issue proposing an holistic design of the thermal and power control policies at the different scales, with dedicated focus on the scalability of the proposed solution. This paper collects, and gives a holistic coherent view of the key results in terms of thermal control of computing systems at the device level and at large-scale level, and their interactions. The paper is organized as follows. Section 2 describes the available control knobs and sensors. Section 3 elaborates on the control problem and solution for the compute node, while Section 4 presents the cooling management approach for the overall computing system. 2. SYSTEM DESCRIPTION In this Section, we describe the main control knobs available in modern computing systems, both at the component level and at the system level. At component level, modern processors built-in sensors and actuators are available for implementing the feedback control policies targeting power and temperature. At the basis, each core features thermal sensors and architectural performance counters which are accessible to the different layers of the software stack and allow to monitor each core temperature, activity, and workload properties. In addition, total power consumption of each socket can be also read by means of dedicated performance counters. With the same interface, current systems allows to throttle the frequency of each core and/or to switch them off independently (Hackenberg et al., 2015; Bortolotti et al., 2016). These can be controlled by the operating systems by means of power governors which control them with a configurable policies. The HW itself use these information internally to create a first layer of protection mechanisms, as HW power capping, which bypasses O.S. decisions when critical temperatures are reached. Dynamic power and thermal management solutions use these sensors and actuators to control the temperature of each computing element, and keep it below the critical threshold. At system level, hundreds/thousands of nodes, each of which composed by several processors, are integrated in racks and datacenters rooms. At this scale, thermal condition at chassislevel or machine-level is handled by a global controller which modulates the cooling effort to prevent over-temperatures or excessive local throttling of nodes. To be effective, this requires two crucial features. On one hand, many measurement have to be collected in a single end point; these are the internal telemetry of each processing element as well as the ambient temperature, the system usage, the job statistics, and the system power demand. On the other hand, the controller has to command each cooling component, including air cooling blowers, air temperature, liquid temperature, flow and air conditioning activations. In the next sections, we will further discuss these two scenarios describing the implementation of optimal thermal control policy at component level (Section 3) and at system level (Section 4). We will use the Intel Single Chip Cloud computer (Howard et al., 2010) as a reference device for many-core computing components and the Galileo supercomputer (Cineca, 2015) as a reference large scale high performance computing installations.
3. CHIP LEVEL CONTROL In this Section, we focus on thermal management of multi or many-core processors. As discussed in the Introduction, integration of many cores on a single silicon die has led to relevant thermal issues. High variability of computing tasks assigned to the different cores asks for a dynamic run-time thermal management, to achieve an optimal exploitation of the overall chip; that means orchestrating dynamic limitations on powers/frequencies of each core, exploiting per-core DVFS infrastructure. In addition, such dynamic approaches are attractive since they allow, when suitably arranged, to take advantage of thermal capacitance of silicon and metal (or, possibly, other “heat buffers”) to temporarily and safely overload the cores (this is often referred as turbo or sprinting). In the Multitherman project, to tackle the chip-level thermal management, we have selected Model Predictive Control (MPC) approach, since it is inherently targeted to optimal constrained control. The basic idea is to cast the temperature limits as constraints on the state values of the controlled system, while the core powers/frequencies are included in a suitable cost function, in order to keep them as close as possible to the requested ones, without violating the above-mentioned thermal constraints. The requested per-core powers/frequencies are assumed to be given from an high-level supervisory unit, managing and distributing the computing tasks on the chip-cores. An important issue in developing MPC solutions is the resulting complexity. Chips are fast moving toward many-cores architectures, then the way the proposed MPC solution scales up w.r.t. the number of cores is crucial. In order to effectively tackle such issue, we focused on developing a core-centric distributed MPC-based approach. Each core has its own local model predictive thermal manager, and the interaction with the rest of the chip will be considered by probing the neighbors’ temperatures and using them in the local model. Another fundamental requirement, to effectively develop the above-mentioned distributed-MPC-based thermal management, is the availability of dependable and compact (i.e. low complexity) models representing the system thermal behavior. In order to comply with the proposed distributed architecture, also modelling and identification eed to be oriented toward deriving a set of local interacting models. In the following, the modelling and identification techniques adopted to obtain reliable, low-complexity, core-centric, interacting models are described. Subsequently, the key aspects of the proposed distributed MPC-based thermal management will be addressed. 3.1 System Identification The solution adopted for the identification of the chip thermal model will be described by considering the Intel’s Single-chipCloud-Computer (SCC) as a case study. SCC is a 48-core experimental processor created by Intel Labs as a platform for many-core software research. It integrates hardware monitors and thermal sensors to track the chip workload phases and thermal behavior. The built-in thermal sensors output is affected by significant noise. Therefore, SCC is a challenging testbench for thermal model learning strategies. Moreover, its high number of cores sets tight requirements on the complexity and scalability of the proposed algorithm. For these reasons, the SCC features are representative of a wide class of upcoming multi-core systems on-chip from the thermal management viewpoint.
6914
Proceedings of the 20th IFAC World Congress Andrea Bartolini et al. / IFAC PapersOnLine 50-1 (2017) 6709–6716 Toulouse, France, July 9-14, 2017
Tile19
Tile20
Tile13
Tile14
Tile7
Tile8
CoreODD CoreEVEN
Tile2
Tile21 Tile22
Tile9 Tile3
Center MISO �le
Tile23
Tile24
Tile17
Tile18
Tile10 Tile11
Tile12
Tile4
Tile5
Tile6
Tile 15 neighbours Tile 16 neighbours
Fig. 1. SCC floorplan. SCC architecture and experimental framework The SCC processor has 24 dual-core tiles arranged in a 6x4 mesh. Each core is a P54C core. Each tile integrates two thermal sensors based on a couple of ring oscillators, one positioned in proximity of the router and the other positioned close to the top core L1 cache. These thermal sensors are originally uncalibrated. We used the calibration procedure presented in (Bartolini et al., 2011) to obtain a meaningful reading for each sensor. Calibrated thermal sensors outputs show the presence of significant white noise (Bartolini et al., 2011). Each P54C core has two performance counters that can be programmed to track various architectural events (such as number of instructions or cache misses) at periodic intervals. Performance counters can be accessed from the specific core they are located at by reading the dedicated registers. The Board Memory Controller (BMC) includes a power sensor capable of measuring the full SCC chip power consumption and an ambient temperature sensor. A per-core power estimation for SCC can be obtained by using a power model. This has been obtained by correlating the fullchip power measurement with each core activity and operating point (i.e. frequency) measured through the HW performance counters, as presented in (Sadri et al., 2011). Fig. 1 shows the SCC tile topology and some other details exploited later on (the focus will be on tiles 15 and 16). Given these HW features, a set of traces suitable for the thermal model identification can be generated. We designed a set of scripts that use POSIX signals to start and stop synchronously a given workload/power virus on the different SCC cores while at the same time collecting the HW monitors outputs (i.e. performance counters and thermal sensors). In this framework, we can apply a given Pseudo-Random Binary Sequence (PRBS) stress workload to the SCC cores. The performance counters outputs are then transformed through the power model in a set of per-tile power traces and used as input vector for the model identification problem. The output vector is instead composed by the calibrated thermal sensor output for each tile 1 . The temperature readings are obtained by means of noisy sensors, then they will be affected by typical measurement noise (additive with respect to the true value). In contrast, the power estimations are derived by noiseless readings, therefore, the resulting power estimations will be affected by systematic and random errors w.r.t. the actual one, but they will not suffer the effects of typical sensor noise. In the specific case of SCC, tile is assumed as basic computing unit. Therefore, the power model is applied in each core of a tile and the results are summed up to derive the estimation of the overall tile power. Distributed thermal modelling According to the purpose of defining a scalable and distributed model, we propose to represent the SCC device as a set of interacting, local, compact models; one for each computing unit (a tile in this case, a core in general). The purpose of each local 1 We consider only the thermal sensor positioned close to the router since it is more central within the tile area.
6711
model is to reproduce (i.e. to predict) the evolution of the local temperature, namely the output, at a given sampling time by exploiting previous samples of output and inputs. According to standard thermodynamics, it is intuitive to assume that the local unit temperature is influenced by the local dissipated power and the temperatures of other units in the neighborhood of the considered one: this leads to a Multi-Input Single-Output (MISO) model. The use of neighbors’ temperatures as inputs of a specific unit model gives the interaction among the local models, which leads to capture locally the dynamics of the whole chip. In this case study, we define as neighborhood of a given unit the set of units sharing an edge with the considered one in a layout similar to the one reported in Fig 1. Starting from these considerations we first tried to represent each tile by means of a MISO ARX model whose inputs are the dissipated power related to the tile activity P (t) and the temperatures Tn1 (t), Tn2 (t), . . . Tnq (t) probed on the tiles belonging to the tile neighborhood whereas the output is the measured tile temperature T (t) (Diversi et al., 2013a). This model is described by the following equation (1)
A(z −1 ) T (t) = B(z −1 ) u(t) + w(t)
where u(t) = [ P (t) Tn1 (t) . . . Tnq (t) ]T is the input vector, T (t) is the output, w(t) is a white process. A(z −1 ) and Bi (z −1 ) are polynomials in the backward shift operator z −1 (i.e z −1 x(t) = x(t − 1)) defined as A(z −1 ) = 1 + a1 z −1 + · · · + an z −n
B(z −1 ) = B1 (z −1 ) B2 (z −1 ) · · · Bq+1 (z −1 ) Bi (z −1 ) = bi1 z −1 + · · · + bin z −n ,
(2)
i = 1, . . . , q + 1.
where n is the model order and, according to previous assumptions, the number of neighbor temperatures q ranges from 2 to 4. It is worth to note that this model can be easily identified by means of the least squares (LS) method. Experiments were performed according to the SCC testing framework previously described to collect input-output data sequences and identify the thermal models of tiles 15 and 16 by using LS. The results of the identification procedure led to critical issues, as explained below (Diversi et al., 2013a, 2014). – First, an ARX model of order n = 2 has been identified for each considered tile. The choice n = 2 is in-line with the silicon and copper materials composing the processor die and heat-spreader, as described in (Beneventi et al., 2014).In this case, the estimated coefficients of the polynomial A(z −1 ) show a relevant negative pole. This is in contrast with the dynamic of thermal systems where only real positive poles can be present. Moreover, whiteness tests performed on the residuals of the LS estimates are far from being passed. – The assumption n = 2 has then been relaxed and the least squares approach has been applied augmenting the model order up to a level satisfying the whiteness test on the residuals. This led to the choice of the order n = 10. In this case, the estimated coefficients of A(z −1 ) lead to the presence of both negative and complex poles. Again, this does not comply with the physics of thermal systems. On the basis of these results, it is possible to conclude that ARX models are not suitable to represent the thermal dynamics of a tile (core). As previously said, the temperature readings are affected by measurement noise so that it is reasonable to consider also the presence of an additive white noise corrupting both the tile temperature T (t) and the neighbors’ temperatures Tn1 (t), Tn2 (t), . . . Tnq (t). This leads to the MISO ARX model with noisy input and output described by the following relations
6915
Proceedings of the 20th IFAC World Congress 6712 Andrea Bartolini et al. / IFAC PapersOnLine 50-1 (2017) 6709–6716 Toulouse, France, July 9-14, 2017
A(z −1 ) T¯ (t) = B(z −1 ) u ¯(t) + w(t) u(t) = u ¯(t) + vu (t) T (t) = T¯(t) + vy (t),
(3) (4) (5)
where u¯(t) = [ P (t) T¯n1 (t) . . . T¯nq (t) ]T , u(t) = [ P (t) Tn1 (t) T 0 vu1 (t) · · · vur (t) and A(z −1 ) and Bi (z −1 ) are the polynomials (2). In this model T¯ (t) and T¯n1 (t) . . . T¯nq (t) denotes the actual tile and neighbors’ temperatures whose available readings T (t) and Tn1 (t) . . . Tnq (t) are affected by the additive white noises vy (t) and vu1 (t), · · · , vur (t). This model allows thus taking into account the presence of both a process noise (driven by the white process w(t)) and additive measurement noises on the inputs and output signals. It is worth to note that the dissipated tile power P (t) is still considered as noise-free. As previously mentioned, it is an estimation of the actual one and is affected by systematic errors. These errors are effectively included just in the process noise. Since both inputs and output are corrupted by noise, this model belongs to the family of the so-called errors-in-variables models (Diversi et al., 2014). The following assumptions will be considered. – The measurement noises vu1 (t), . . . , vur (t) and the driving process noise w(t) are mutually uncorrelated and uncorrelated with the noise–free input u ¯(t). – The sensors adopted to measure all of the temperatures in the chip are based on the same technology and embedded in the same framework, then they are assumed to show noises with similar statistical properties (e.g. the same variances). It is worth to note that the tests are carried out under constant ambient temperature (Tamb ) conditions so that all the temperatures considered in the MISO ARX models are actually temperature gap w.r.t. the ambient. Therefore, the identified models will be effective in predicting the difference between the tile temperatures and the ambient one, under constant (or slowlyvarying) ambient temperature scenario (the most common one). Anyway, if fast variations (w.r.t. the chip thermal dynamics) in Tamb can be experienced, the presented identification procedure can be adopted as well, by adding Tamb in the input set of the MISO I/O-noisy ARX models and using the absolute temperatures instead of the temperature gaps.
. . . Tnq (t) ]T , vu (t) =
Thermal model identification Because of the presence of the inputs and output additive noises vy (t) and vu1 (t), . . . , vur (t), model (3)-(5) cannot be identified by using least squares since the obtained estimate is biased. To solve the identification problem, two different approaches have been exploited: the bias compensated least squares (BCLS) (Diversi et al., 2013a, 2014) and the dynamic Frisch scheme (Diversi et al., 2013b). – The rationale of BCLS methods consists in estimating the measurement noise variance (which is the same for all temperature readings, as previously assumed) and compensate its effect in the LS estimate to determine an asymptotically unbiased parameter estimation. In particular, the procedure proposed in (Diversi et al., 2014) is an iterative algorithm where, at each step, an estimate of the additive noise variance is used to improve the estimate of the model parameters and vice versa. The estimation of the noise variance is obtained by exploiting the statistical properties of the equation error of the MISO noisy ARX model. The obtained identification algorithm can be easily converted to a recursive version that allows to track slow thermal model changes while the processor is working. – The basic idea of the dynamic Frisch scheme is to search for the solution of the identification problem within a locus of
solutions which are compatible with the covariance matrix of the noisy data. To select a single solution among all compatible ones a suitable criterion is required. The selection criterion exploited in (Diversi et al., 2013b) is based on a set of both low-order and high-order Yule-Walker equations. This method is more demanding from a computational viewpoint but allows to cope also with temperature readings affected by noise with different variances. Two important issues in any identification procedure are the model validation and the model order estimation. When using the LS method these problems can be solved by performing a whiteness test on the residuals generated by the identified model used as a predictor of the output. In contrast, in the framework of the adopted MISO noisy ARX models this property does no longer hold. Then, the validation and order estimation steps are carried out evaluating the whiteness of the innovation generated by a Kalman optimal predictor based on a state space representation of the noisy ARX model (3)-(5) (Diversi et al., 2014). This innovation is given by the difference between the measured tile temperature T (t) and the optimal prediction of the actual tile temperature Tˆ¯ (t|t − 1) computed by the Kalman predictor. It is worth to stress that this procedure has also been exploited to select the sampling time. Several experiments were performed according to the SCC testing framework previously described to collect input-output data sequences and identify the MISO noisy ARX thermal models of tiles 15 and 16. The model order n and the sampling time Ts selections have been performed by means of the procedure described before. The selected best solutions are n = 2 and Ts = 50 ms. In this case the identified poles are real and positive so that the obtained models complies with the physics of thermal systems. As an example, the following poles have been identified from a sequence of 6000 samples with n = 2 and Ts = 50 ms: p1,15 = 0.9043, p2,15 = 0.0998 for tile 15 and p1,16 = 0.9366, p2,16 = 0.0723 for tile 16. All models have been successfully validated by the whiteness tests performed on the Kalman predictor innovation sequences. The above mentioned identification procedure has also been applied to the identification of thermal models of supercomputing nodes in a production environment scenario affected by quantization noise on the temperature measurements as well as operating in free-cooling, with variable ambient temperature (Diversi et al., 2016) . Tests have been performed on a node of the CINECA Galileo Tier-1 supercomputer system described in Subsection 4.1. The first obtained results are good so that the method is promising also for this kind of high-performance computing systems. 3.2 Model Predictive Distributed Control Here the distributed per-core control unit is presented, based on the modeling approach illustrated in the previous paragraphs. As mentioned in Section 2, the thermal control knob for the cores is the frequency level which, differently from the power consumption, has a nonlinear relation with the core die temperature. However, such nonlinearity can be “incapsulated” in a static map between the core power consumption and its frequency level, depending also on other known/measurable parameters, such as the supply voltage level, and the CPI (clocks per instruction) of the workload (see (Bartolini et al., 2013) for details). Thus, from now on, we assume frequency level to power consumption conversion and its inverse can be always performed via proper nonlinear relations. This way, linear dynamic models such as (1) can be adopted to design linear
6916
Proceedings of the 20th IFAC World Congress Toulouse, France, July 9-14, 2017 Andrea Bartolini et al. / IFAC PapersOnLine 50-1 (2017) 6709–6716
�PT,i (t + j|t) − PC,i (t + j|t)�2Q
min
j=0
(7)
subject to Ti (t + j + 1|t) ≤ Tmax ∀ j = 0, . . . , N
where temperature Ti (t+j+1|t) are predicted exploiting model (6), according to information at time t, in a classic receding horizon fashion, while � · �Q denotes the induced norm by a symmetric positive definite weighting matrix Q, and N the time horizon length, which has been set to 1. Optimization problem (7) is convex and quadratic, then it can be efficiently solved online via iterative numerical methods, thanks to the low number of variables given by distributed approach. Alternatively, explicit feedback form via off-line computation can be obtained (see (Bartolini et al., 2013) for description of both implementations). The overall control scheme is reported in Fig. 2 for the sake of clarity. It is further to remark that, rigorously speaking, feasibility of problem (7) is not enough to guarantee thermal bounds are met in the real system. The motivation boils down to the fact that an approximated discretized model has been used for the control strategy. In-depth analysis and possible solutions to this issue are presented in (Tilli et al., 2012). Another significant application of the proposed approach, as outlined at the beginning of this Section, is the possibility to take advantage of the cores (and dissipation system) thermal capacitances, to boost computational resource usage well 2
Clearly thermal coupling among cores increases as the time discretization window is enlarged.
T C C
C
T
Fig. 2. Chip level distributed control unit block scheme. (a)
(6)
For the sake of simplicity, a deterministic system is considered (neglecting process and I/O noise), where x is the state, Tamb is the ambient temperature, Tneighs,i are the temperatures of the neighbor cores and PC,i is the core power consumption. In order to guarantee only the adjacent neighbors are significant, the sampling time 2 has been set to 50ms, according to the analysis in Subsection 3.1. Typically, only one temperature per core, related to its silicon parts, is available from measurement, therefore the state xi (t) is not fully known and needs to be estimated. To this aim, a classic Luenberger observer has been designed, and to directly use the measured temperature for output injection, model (6) has been profitably put in a basis so that the first state component corresponds to the core temperature (i.e. C = [1 0]). With this results at hand, the goal is to design a controller which: at each sampling time receives a target frequency/power fT,i /PT,i (provided by some energy mapping unit, see (Bartolini et al., 2013) for details), its own temperature plus the neighbors and ambient temperatures, and the observer reconstruction of the not measurable component. Based on these information, the output power/frequency has to be selected as close as possible w.r.t. the requested value, keeping Ti (t) under a safe limit value Tmax ∀ t. As mentioned, MPC-based approach is particularly suited for such task, in this framework, the aforementioned problem can be mathematically formulated as follows N−1 PC,i
C
T
Delta Kelvin degree respect TMAX
xi (t + 1) = Axi (t) + B PC,i (t) Tamb (t) Tneighs,i(t) . Ti (t) = Cxi (t)
T
1
Maximum Temperature Overshoot
0,8
(b) 40
Percentage of �me - thermal bound viola�on
30
% of �me
MPC strategies (providing power as output, to be converted into frequency), which can be run more lightly and efficiently w.r.t. nonlinear versions. Thus, a second order discrete time state space LTI model can be considered, stemming from a statespace realization of identified models obtained (via off-line/online calibration routines) applying the algorithms described in Subsection 3.1, namely
6713
0,6
20
0,4
10
0,2
0
0
Fluidanimate Facesim Dedup Bodytrack Raytracing Centralized MPC with nonlinear f/P
Distributed MPC Solu�on
Fluidanimate Facesim Dedup Bodytrack Raytracing
Centralized MPC with 1 dynamic
Centralized MPC with linear f/P
Distributed PID*
Fig. 3. Performance of the proposed chip-level controller compared against other solutions.
above the steady-state TDP, for short time periods. As shown in (Tilli et al., 2015), dynamic models and predictive control can be again leveraged, in a hierarchical structure, with a distributed inner MPC layer as (7) and a centralized outer layer setting cores power references speculatively close to target values higher then the TDP, if energy can be stored in the thermal capacitances, while cutting such reference values if no or little energy can be absorbed. Moreover, such approach has been extended to cope with hard real-time applications request. Namely, ensuring the capability to boost computing performance (thus power consumption) at any activation of critical periodic tasks, with hard deadlines. 3.3 Experimental Results The modeling and distributed MPC solutions, presented in the previous paragraphs, have been validated in a quite comprehensive set of scenarios in (Bartolini et al., 2013) . For brevity, a summarizing result, comparing it with other methods, under different benchmark applications, is reported in Fig. 3. Data have been obtained emulating a 4 cores chip via an accurate finite element-based model. The performance metrics selected for comparative analysis are: the maximum temperature overshoot w.r.t. the constraint, and the time percentage of temperature bound violation. Comparison with standard distributed PID, and centralized MPC has been performed. For the latter, two additional variations, with linear approximation of the frequency to power function and with first order dynamics for cores thermal behavior have been considered. This is to highlight the importance of an accurate modeling, in fact, it is clear by Fig. 3, how missing the system (algebraic) nonlinearity or the right order for cores dynamics gives worse performance even w.r.t. simple PID control. Compared with the “right” centralized MPC solution, the distributed version achieves very close performance, assessing optimality is not compromised w.r.t. a full knowledge centralized optimization. The benefits of distributed local per-core controllers are in complexity reduction, as remarked in Fig. 4, where the computational burden of all the control solution ingredients is reported for both centralized and distributed versions (implicit and explicit MPC have been also considered for the sake of completeness). Obviously, efficient and lightweight control algorithms are critical, as they need to be run on the same hardware used for computation. Furthermore, natural parallelism is achieved via the distributed approach (each core runs its own controller), which is not a trivial property to obtain for centralized optimization algorithms.
6917
Proceedings of the 20th IFAC World Congress 6714 Andrea Bartolini et al. / IFAC PapersOnLine 50-1 (2017) 6709–6716 Toulouse, France, July 9-14, 2017
a) 4 8 16 48
Room
Centralized MPC Complexity b) Distributed MPC Complexity (single core)
MPC_explicit MPC_implicit 81 7,70 6561 9,00 OUT 24,20 OUT 85,50 # regions time (us)
f2P
0,061
�me (us)
Observer
0,743
�me (us)
MPC (Impl)
4,690
�me (us)
MPC (Expl)
2
# regions
1,188
�me (us)
P2f
CRAC Three way valve
Chiller
Outlet
Inlet Pump
Fig. 4. Computational burden results for distributed and centralized MPC.
Rack Lines
Value obtained with C-code implementation on a 2.4 GHz dual-core.
RDHX
4. SYSTEM LEVEL CONTROL
Hot Aisle
Here the focus is shifted to the full scale system level, looking at the overall computing platform thermal and energetic behavior. The purpose is to define suitable management/control policies for the external cooling devices (air conditioners, chillers, pumps and so forth), which, at this level, represent the control knobs, ensuring the entire computing platforms temperature constraints are met. At the same time, energy efficiency is pursued, that is, the cooling elements are operated so that their power consumption is minimized. This task is performed assuming standard node/chip level controller are implemented, and thermal capping has to be ensured at this level. If advanced chip-level controllers are available as introduced in Section 3, then a profitable integration can be applied, as it will be detailed in the last paragraph of this Section. In the considered context, two crucial steps have been performed: first, a realistic, numerically tractable thermal model of the entire system has been derived, then, an optimization problem devoted to provide the best cooling devices operating point, in terms of energy efficiency, has been formulated, based on the aforementioned model and the system thermal constraints. This procedure is detailed in the remainder of this Section, taking Galileo, a real Tier-1 hybrid-cooled (i.e. both liquid and air are used) High Performance Computing (HPC) system (Cineca, 2015), as case study. 4.1 System Architecture and Modeling Approach The architecture of the considered HPC, which is rather typical for this kind of systems, is shown in Fig. 5. Racks hosting the computing nodes are deployed in two rows, in a standard coldaisle hot-aisle arrangement. Cold air is delivered by Computer Room Air Conditioners (CRACs) in a sub-floor plenum, it raises through perforated tiles, then it flows in the machine, with the help of racks internal fans, and the hot air returns to CRACs. Two out of ten CRACs have free cooling capability 3 . As regards the liquid part, the so-called Rear Door Heat Exchangers (RDHXs) (Grimshaw et al., 2011) mounted on each rack help cooling the air heated by the computing devices before it flows back to the CRACs. Liquid (water) entering the RDHXs is provided by a chiller, with adjustable and free cooling capabilities. A variable speed pump pushes water in the pipes, while a threeway valve is exploited to recirculate part of the warm return water, mixing it to the chiller outlet flow, before re-entering the RDHXs. To fully exploit such technology, advanced control of cooling hardware is needed, moving from simple fixed working point strategies to optimal management, adapting to different environmental and workload conditions. To this aim, a crucial step is to derive a model characterizing the relations between the HPC thermal dynamics and the cooling hardware in a comprehensive fashion. For model to be used in optimization and 3
That is the CRAC refrigerating cycle is turned-off while the blowers pushes external air at ambient temperature into the subfloor.
Rack Cold Aisle
Cage
Fig. 5. Galileo HPC architecture and cooling system. Liquid Circuit
3 way valve
Chiller
TWout RCh
qw
Twouth
Tamb
PCRACj
T0Ch
CCh
TWin
Water Pump
Air Circuit
x qCRACj
CRDout
y
RHPC-R
TROOF
RHPC-A
RRDHX
TROOM
PCRACj
TSF CSF
TAin TROOM
qC PHPC
TAout
TARD
PFC
CRDin
CRoom
HPC Racks
PCRACj CRACs
RDHXs
CHPC
CRAC Blowers
PFC
TAMB
Fig. 6. Galileo’s equivalent thermal circuit exploited for modeling. control, complexity should be curtailed. Bearing in mind these considerations, a lumped parameter approach, based on physics first-principles has been proposed (Conficoni et al., 2016), to derive an analytical, numerically tractable description of the complete cooling system, capturing its influence on the HPC thermal status. The main advantages of the approach are: avoidance of complex and poorly-scalable Computational Fluid Dynamics (CFD) tools, characterization of the main thermal dynamics, with explicit accounting for cooling devices and free-cooling effects, scalability and suitability for optimization/control. The key idea is to describe system macro-components as heat or cooling power sources, and their thermal interaction is characterized by means of heat exchangers. This concept is underscored in the functional scheme of Fig. 6. For the given HPC, five heat exchange points, highlighted by thermal capacitor symbols in (6), are considered, taking their output temperatures as state variables, while energy absorbed/rejected by them is assumed to depend on the inlet and outlet temperatures average. Uniform distribution is assumed for the racks temperatures, thus just one state variable is used to characterize the computing nodes thermal behavior 4 . Similarly, RDHXs are described in an aggregate fashion. Therefore, under few others technical assumptions (see (Conficoni et al., 2016),(Breen et al., 2010) for details), the overall system thermal dynamics reads as (8). The first four equations describe the air circuit, with temperatures, thermal capacitance and resistance variables defined according to Fig.6, while PHPC is the thermal power produced by the computational workload. PCRAC denotes the cooling load of 4
In modern HPCs the internal rack fans control is usually capable of guaranteeing uniform temperature distribution among nodes, despite heterogenous workload.
6918
Proceedings of the 20th IFAC World Congress Toulouse, France, July 9-14, 2017 Andrea Bartolini et al. / IFAC PapersOnLine 50-1 (2017) 6709–6716
T˙HPC = T˙ARD =
T˙SF =
1 CHPC
�
1 CROOM
PHPC − RHPC-R (THPC − TROOM ) − RHPC-A −1
�
−1
qR cvA ρA (TAout − TAin ) − RRDHX −1
1 CSF
�
�
TAout + TAin
THPC −
2
TAout + TARD 2
��
TWin + TWout −
2
�
, T˙Aout =
1 CRDin
�
−1
RHPC-A
�
THPC −
6715
TAout + TAin 2
�
+ RHPC-R (THPC − TROOM ) − qCRACFC ρA cvA TROOF −1
− qR cvA ρA (TAout − TAin )
�
�
(8)
cvA ρA qCRAC TROOF − PCRAC + qCRACFC cvA ρA Tamb −qCRACtot cvA ρA TSF � �� � PFC
T˙Wout = T˙Woutch =
1
�
CRDout 1 CCh
−1
RRDHX
�
�
TAout + TARD 2
−
TWin + TWout 2
qW cvW ρW (TWout − TWin ) − RCH
−1
�
�
− qW cvW ρW (TWout (t) − TWin )
TWout + TWoutch 2
− T0Ch
��
the CRACs on “refrigerating mode”, while PF C is the thermal power removed by the CRACs in free cooling mode, while qCRAC , qCRACFC are the corresponding overall air flow rates of the (refrigerating and free-cooling) CRACs blowers. These are the air part control inputs, which can be modulated via suitable duty cycles βj (for refrigerating mode), γj (for free cooling mode) in each CRACs (see (Conficoni et al., 2016)). qR is the flow rate imposed by the racks fans, which is usually not controllable at this scale level. ρA , ρW are air and water densities, while cvW , cvA are the specific heats. The last two equations model the liquid circuit for which the control inputs are the flow rate qW , imposed via the pump, and T0Ch , a temperature representing the behavior of chiller evaporator side, which is assumed adjustable. Also the three way valve position α is a control knob, affecting the water inlet temperature according to the flow balance equation TW in = αTW outch + (1 − α)TW out , α ∈ [0, 1]. Similar algebraic relations hold for ceiling TROOF and the outlet reardoor TRD values, and subfloor air temperatures TSF with inlet air temperature TAin , dictated by the mismatch between qR and the overall CRACs flow rate qCRACtot (see (Conficoni et al., 2016) for details). In Galileo, a cage preventing hot and cold air flows recirculation is mounted (see Fig. 5), then equality qR = qCRACtot is enforced. Parameters in (8) can be derived via physics principles, or gray-box identification techniques. It is worth remarking that the proposed modeling approach can be easily adapted to different computing platforms, with little effort in rearranging the modeling equations (see (Conficoni et al., 2015) for application to a direct liquid cooled HPC). 4.2 Optimal Cooling Policy Design Based on (8), an optimal management strategy, setting the cooling control knobs to optimize efficiency, keeping the HPC system in a proper thermal status, can be posed. First, cost and constraints functions need to be defined. As mentioned, the function to be minimized is the overall cooling system power consumption. For what regards the pump and CRACs blowers, a quadratic function of the flow rate can be assumed (Ma et al., 2015). CRACs and chiller consumption generally requires sophisticated models depending on several internal variables, which are not usually available at the considered scale level. For this reason, the Carnot efficiency of refrigerating cycle has been adopted to approximate the Coefficient of Performance 0Ch 0CRAC (COP), namely: COPCRAC = TambT−T , COPch = TambT−T . 0CRAC 0Ch Therefore, the following cost function is introduced J=
NC �
COP−1 CRAC βi PCRACim +
i=1
+
NFC � j=1
NC �
2 kBLi βi qCRACim +
i=1
2 2 kBLj γj qCRACjm + COP−1 Ch PW + kP qW .
�
(9)
where PW = qw cvW ρW (TW out − TW in ) is the thermal power removed by the liquid circuit, NC , NFC are the number of refrigerating and free-cooling operating CRACs, respectively, PCRACim , qCRACim the CRAC maximum cooling load and blower flow rate while kBLi , kBLj and kP are components specific coefficients. As far as constraints are concerned, the following temperatures, flow rates, and cooling capacity limits must be met THPC < THPCmax , TROOF ≤ TROOFmax TWin > TWinmin , TW out < TWoutmax qW ≤ qWmax , qR = qCRACtot , α ∈ [0, 1] βi , γj ∈ [0, 1]∀i = 1, . . . , NC , ∀j = 1, . . . , NFC , βi γj = 0
(10)
where THPCmax , stems from the HPC system computing devices limits; TWinmin , the minimum RDHXs water inlet temperature, is set to prevent condensation; TROOFmax and TWoutmax , are the maximum air and water temperatures that can be handled by the CRACs and the chilling system, respectively. All that being given, and assuming steady-state conditions of system (8) under constant workloads PHPC , the cooling management strategy is cast into the following problem min
α,qW ,βi ,γj ,βj ,T0ch
J as in (9), subject to : (11)
temp. and ctrl. knobs constr. (10), steady-state model constr. (8).
Optimal values, and corresponding steady-state temperatures obtained by solving the problem above 5 can be exploited by the respective cooling components local controllers as suitable feed-forward actions and set-points. The assumption of constant workload is motivated by the long jobs typically executed on HPC nodes (Bartolini et al., 2014), which last considerably longer w.r.t. the system thermal dynamics time constants. It is worth noting that the approach explicitly accounts for exogenous inputs Tamb and PHPC , related to external conditions and the HPC load, keeping thermal state within the prescribed set. This way, energy-aware decisions about all the cooling devices operation can be made, in face of “externally defined” working conditions. 4.3 Results The performance of the proposed algorithm have been validated via several numerical simulations in (Conficoni et al., 2016). Here we summarize the results, comparing our method with a standard strategy (currently adopted in Galileo) regulating room chiller outlet temperatures and the pump flow rate to fixed values. Results are shown in Fig. 7. These have been obtained for the machine parameters (see Tab. I in (Conficoni et al., 2016)), considering average hourly ambient temperatures of 5 The problem is non convex, thus finding global solutions is not trivial. A sequential quadratic programming method, with feasible initialization points has been adopted for obtaining satisfying results in short time.
6919
300
Galileo Low (25% TDP) Med (50% TDP) High (80% TDP)
200 100 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
400 300
Galileo Low (25% TDP) Med (50% TDP) High (80% TDP)
200 100 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
(c) Cooling Energy Savings [%]
400
(b) Annual Cool. En. [MWh]
(a) Annual Cool. En. [MWh]
Proceedings of the 20th IFAC World Congress 6716 Andrea Bartolini et al. / IFAC PapersOnLine 50-1 (2017) 6709–6716 Toulouse, France, July 9-14, 2017
50 40 30 20 10 0
High WL
Med WL
Low WL Galileo WL
Fig. 7. Annual cooling energy consumption under different workload profiles and cooling strategies. (a) Results with proposed optimal strategy. (b) results for the Galileo’s native strategy. (c) Cooling energy savings achieved by the proposed optimal management.
Galileo location (Bologna, Italy), and different workload levels (expressed as fraction of the overall TDP), including 2015 real workloads. Considerable cooling energy reduction is achieved, for all workload scenarios, by using the proposed method (see Fig. 7 (c)). In particular, most of the efficiency gain occur in spring/summer months, when the high ambient temperature makes cooling more critical, and the advantages of a synergistic strategy, optimizing both liquid and air parts, and exploiting free-cooling capability is more significant. 4.4 Integration and Opportunities The presented cooling management strategy guarantees temperature constraints are always satisfied no matter what kind of controllers are implemented at the lower hierarchical levels. Indeed, chip-level control methods elaborated in Section 3 already ensure thermal stability. Therefore, if they are implemented at low hierarchical level, profitable integration with the cooling management can be performed as follows: information about the DVFS intervention on nodes’ cores can be used at the cooling control level for a twofold purpose. On the one hand, the model parameters can be adapted for better fitting with the physical plant. On the other hand, cooling effort can be increased to boost nodes’ performance and avoid frequency downscaling (i.e. letting cores to operate at the target frequency values). Clearly, in the latter scenario, the price to pay will be more cooling consumption. Therefore, a proper trade-off must be found. Again, this decision could be cast as an optimization problem, formulated according to reasonable business and price models for the centers hosting the computing platforms. In a similar way, more speculative cooling policies could be adopted (i.e. making THPCmax a constraint for the average of the nodes die temperatures rather then the maximum value) to save energy, knowing that lower level regulators ensure a safe thermal capping. REFERENCES Bartolini, A., Borghesi, A., Bridi, T., Lombardi, M., and Milano, M. (2014). Proactive workload dispatching on the eurora supercomputer. In B. O’Sullivan (ed.), Principles and practice of constrained programming. Springer. Bartolini, A., Cacciari, M., Tilli, A., and Benini, L. (2013). Thermal and energy management of high-performance multicores: Distributed and self-calibrating modelpredictive controller. IEEE Trans. Parallel Distrib. Syst, 24(1), 170–183. Bartolini, A., Sadri, M., Beneventi, F., Cacciari, M., Tilli, A., and Benini, L. (2011). A system level approach to multi-core thermal sensors calibration. Proc. of PATMOS, 22–31,Madrid, Spain,. Beneventi, F., Bartolini, A., Cavazzoni, C., and Benini, L. (2016). Cooling-aware nodelevel task allocation for next-generation green hpc systems. High Performance Computing & Simulation (HPCS), 2016 International Conference on, 690–696. Beneventi, F., Bartolini, A., Tilli, A., and Benini, L. (2014). An effective graybox identification procedure for multicore thermal modelling. IEEE Trans. on Computers, 63, 1097–1110. Borkar, S. (1999). Design challenges of technology scaling. IEEE Micro, 19(4), 23–29. Bortolotti, D., Tinti, S., Alto, P., and Bartolini, A. (2016). User-space apis for dynamic power management in many-core armv8 computing nodes. High Performance Computing & Simulation (HPCS), 2016 International Conference on, 675–681. Breen, T.J., Walsh, E.J., and Punch, J. (2010). From chip to cooling tower data center modeling: Part i influence of server inlet temperature and temperature rise across cabinet. Proc. of ITHERM, 1–10.
Cineca (2015). The italian tier-1 cluster for industrial and public research. http://www.hpc.cineca.it/hardware/galileo. Conficoni, C., Bartolini, A., Tilli, A., Benini, L., and Tecchiolli, G. (2015). Energy-aware cooling for hot-water cooled supercomputers. Proc. of IEEE DATE, 1353–1358. Conficoni, C., Bartolini, A., Tilli, A., Cavazzoni, C., and Benini, L. (2016). Integrated energy-aware management of supercomputer hybrid cooling systems. IEEE Tran. Ind. Inf, 12(4), 1299–1311. Diversi, R., Bartolini, A., Beneventi, F., and Benini, L. (2016). Thermal model identification of supercomputing nodes in production environment. Proc. of 42nd Conf. of the IEEE Industrial Electronic Society (IECON 2016), 4838–4844, Florence,Italy. Diversi, R., Bartolini, A., Tilli, A., Beneventi, F., and Benini, L. (2013a). Scc thermal model identification via advanced bias-compensated least-squares. Proc. of DATE 2013 (Design, Automation & Test in Europe), 230–235, Grenoble,France. Diversi, R., Bartolini, A., Tilli, A., and Benini, L. (2013b). Identification of many-core systems-on-chip with input and output noises. Proc. of 52nd IEEE Conference on Decision and Control (CDC 2013), 6481–6488, Florence,Italy. Diversi, R., Tilli, A., Bartolini, A., Beneventi, F., and Benini, L. (2014). Bias-compensated least squares identification of distributed thermal models for many-core systems-onchip. IEEE Trans. Circuits Syst. I, Reg. Papers, 61(9), 2663–2676. Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., and Burger, D. (2011). Dark silicon and the end of multicore scaling. IEEE International Symposium on Computer Architecture (ISCA),, 39(3), 365–376. Grimshaw, J., McSweeney, M., Novotny, S., and Gagnon, M. (2011). Data center rack level cooling utilizing water-cooled, passive rear door heat exchangers (rdhx) as a cost eff ective alternative to crah air cooling. Coolcentric Technical Report, 1–10. Hackenberg, D., Schone, R., Ilsche, T., Molka, D., Schuchart, J., and Geyer, R. (2015). An energy efficiency feature survey of the intel haswell processor. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 896–904. Howard, J., Dighe, S., and et al. (2010). A 48-core ia-32 message-passing processor withdvfs in 45nm cmos. Proc. of IEEE International Solid-StateCircuits Conference, San Francisco, CA,. Hui, L., Groep, D., and Wolters, L. (2004). Workload characteristics of a multi-cluster supercomputer. Workshop on Job Scheduling Strategies for Parallel Processing, Springer Berlin Heidelberg. James, D. and Martonosi, M. (2006). Techniques for multicore thermal management: Classification and new exploration. ACM SIGARCH Computer Architecture News, 34(2), 295–298. Kong, J., Joonho, K., Chung, W., and Skadron, K. (2012). Recent thermal management techniques for microprocessors. ACM Computing Surveys, 44(3). Ma, Y., Matusko, J., and Borrelli, F. (2015). Stochastic model predictive control for buildinghvac systems: Complexity and conservatism. IEEE Trans. Control Syst. Technol., 1–16. MultiTherMan (2012). Multiscale thermal management of computing systems. https://erc.europa.eu/multiscale-thermal-management-computing-systems. Sadri, M., Bartolini, A., and Benini, L. (2011). Single-chip cloud computer thermal model. Proc. of THERMINIC, 22–31,Paris, France. Su, H., Liu, F., Devgan, A., Acar, E., and Nassif, S. (2003). Full chip leakage-estimation considering power supply and temperature variations. Proc. of International Symposium on Low Power Electronics and Design., 73–83, Seul. Tilli, A., Bartolini, A., Cacciari, M., and Benini, L. (2015). Guaranteed computational resprinting via model-predictive control. ACM Trans. Embedded Computing Systems, 14(3), Article 48. Tilli, A., Garone, E., Cacciari, M., and Bartolin, A. (2012). Thermal models characterization for reliable temperature capping and performance optimization in multiprocessor systems on chip. IEEE Proc. of American Control Conference, 4721–4726, Montreal, Canada. Zhang, H., Shao, S., Xu, H., Zou, H., and Tian, C. (2014). Free cooling of data centers: A review. Renewable and Sustainable Energy Reviews, 35, 171–182.
6920