On-line Optimization of Power Efficiency in 3D Multicore Processors

On-line Optimization of Power Efficiency in 3D Multicore Processors

14th 30 IFAC Workshop DiscreteCoast, EventItaly Systems May - June 1, 2018.on Sorrento May 30 - June 1, 2018.on Sorrento 14th IFAC Workshop DiscreteCo...

591KB Sizes 0 Downloads 81 Views

14th 30 IFAC Workshop DiscreteCoast, EventItaly Systems May - June 1, 2018.on Sorrento May 30 - June 1, 2018.on Sorrento 14th IFAC Workshop DiscreteCoast, EventItaly Systems May 30 - June 1, 2018.on Sorrento Coast, Italy 14th IFAC Workshop Discrete Event Systems May 30 - June 1, 2018. Sorrento Coast, Italy Available online at www.sciencedirect.com May 30 - June 1, 2018. Sorrento Coast, Italy

ScienceDirect IFAC PapersOnLine 51-7 (2018) 127–132 On-line Optimization of Power On-line Optimization of Power On-line Optimization of Power On-line Optimization of Power On-line Optimization of Power Efficiency in 3D Multicore Efficiency in 3D Multicore Efficiency in 3D Multicore Efficiency in Efficiency in 3D 3D Multicore Multicore Processors Processors Processors Processors Processors ∗ X. Chen, H. Xiao, X. Chen, H. Xiao, Y. Y. Wardi, Wardi, S. S. Yalamanchili Yalamanchili ∗

X. Chen, H. Xiao, Y. Wardi, S. Yalamanchili ∗∗ X. Chen, H. Xiao, Y. Wardi, S. Yalamanchili ∗ X. Chen, H.and Xiao, Y. Wardi, S. Yalamanchili School of Electrical Computer Engineering, Georgia Institute of School and Engineering, Georgia School of of Electrical Electrical and Computer Computer Engineering, Georgia Institute Institute of of Technology, Atlanta, GA 30332. School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332. Technology, Atlanta, GA 30332.Georgia Institute of School ofEmail: Electrical and Computer Engineering, [email protected], [email protected], Technology, Atlanta, [email protected], GA 30332. Email: [email protected], Email: [email protected], [email protected], Technology, Atlanta, GA 30332. [email protected], [email protected]. Email: [email protected], [email protected], [email protected], [email protected]. [email protected], [email protected]. Email: [email protected], [email protected], [email protected], [email protected]. [email protected], [email protected]. Abstract: This paper considers the problem of maximizing power efficiency in multi-core Abstract: This paper considers the of maximizing power efficiency in Abstract:byThis papervoltage considers the problem problem of maximizing power efficiency in multi-core multi-core processors dynamic frequency scaling. We formulate the problem in an on-line setting Abstract:byThis papervoltage considers the problem of maximizing power efficiency in multi-core processors dynamic frequency scaling. We formulate the problem in on-line setting processors byThis dynamic voltage frequency scaling. Wemaximizing formulate the problem in aan ancycle-level, on-line setting Abstract: paper considers the problem of power efficiency in multi-core and apply to it a gradient-based stochastic approximation algorithm. We use fullprocessors by it dynamic voltage frequency scaling. We formulate the problem in aancycle-level, on-line setting and apply to a gradient-based stochastic approximation algorithm. We use fulland apply to it a gradient-based stochastic approximation algorithm. We use aancycle-level, fullprocessors by dynamic voltage frequency scaling. We formulate the problem in on-line setting system simulation platform capable of realistic modeling of state-of-the-art microarchitectures and apply to it a gradient-based stochastic approximation algorithm. We use a cycle-level, fullsystem simulation platform capable of realistic modeling of state-of-the-art microarchitectures system simulation platform capable of realistic modeling ofalgorithm. state-of-the-art microarchitectures and apply to it a gradient-based stochastic approximation We use a cycle-level, fullexecuting industry-standard benchmarks (Splash 2). The optimization algorithm computes the system simulation platform capable of realistic modeling of state-of-the-art microarchitectures executing industry-standard benchmarks (Splash 2). The optimization algorithm computes the executing industry-standard benchmarks (Splash 2). The optimization algorithm computes the system simulation platform capable of realistic modeling of state-of-the-art microarchitectures gradients by the Infinitesimal Perturbation Analysis (IPA) sample-based sensitivity analysis executing by industry-standard benchmarks (Splash 2). The optimization algorithm computes the gradients the Infinitesimal Perturbation Analysis (IPA) sample-based sensitivity analysis gradients by the Infinitesimal Perturbation Analysis (IPA) sample-based sensitivity analysis executing industry-standard benchmarks (Splash 2). The optimization algorithm computes technique. Despite the fact that the system’s characteristics vary widely with time and hence the gradients by the Infinitesimal Perturbation Analysis (IPA) sample-based sensitivity analysis technique. Despite the fact that the system’s characteristics vary widely with time and hence the technique. Despite theasfact that the system’sexperiments characteristics vary with time and hence the gradients by the Infinitesimal Perturbation Analysis (IPA) sample-based sensitivity analysis optimum varies well, the simulation exhibit aawidely tracking of the optimal values. technique.point Despite theasfact that the system’sexperiments characteristics vary widely with time and hence the optimum point varies well, the simulation exhibit tracking of the values. optimum point varies asfact well, the simulation experiments exhibit awidely tracking of time the optimal optimal values. technique. Despite the that the system’s characteristics vary with and hence the To our knowledge this is the first application of IPA to co-optimizing throughput performance optimum point varies asiswell, the simulation experiments exhibit a tracking of the optimal values. To our knowledge this the first application of IPA to co-optimizing throughput performance To our knowledge thisasisprocessors. the the firstsimulation application of IPA to co-optimizing throughput performance optimum point varies well, experiments exhibit a tracking of the optimal values. and power in computer To our knowledge this isprocessors. the first application of IPA to co-optimizing throughput performance and power in and power in computer computer To our knowledge this isprocessors. the first application of IPA to co-optimizing throughput performance and power in (International computer processors. © 2018, IFAC Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. and power in computer processors. 1. INTRODUCTION community -- software-based and hardware-based tech1. INTRODUCTION community software-based and hardware-based tech1. INTRODUCTION community - former software-based and scheduling hardware-based techniques. The are typically algorithms 1. INTRODUCTION community - former software-based and scheduling hardware-based techniques. The are typically algorithms niques. The former are typically scheduling algorithms 1. INTRODUCTION community -redistribute software-based and scheduling hardware-based techthat seek to generated by threads (proComputing is at an inflection point where the latency niques. The former are heat typically algorithms that seek to redistribute heat generated by threads (proComputing is at an inflection point where the latency that seek to redistribute heat generated by threads (proComputing is atofancommunication inflection point where the latency niques. The former in are typically scheduling algorithms grams) or processes order to avoid exceeding peak temand energy cost is dominating that of that seek to redistribute heat generated by threads (programs) or processes in order to avoid exceeding peak temComputing cost is atofancommunication inflection point where the latency and is that grams) or on processes in (see order togenerated avoid exceeding peak temand energy energy cost ofancommunication is dominating dominating that of of that seek to redistribute heat by threads (properatures any core Lin et al. (2016) and references Computing is at inflection point where the latency computation. Consequently we are seeing the re-emergence grams) or on processes in (see order toet avoid exceeding peak temany core Lin al. (2016) and references and energy cost of communication is dominating that of peratures computation. Consequently we are seeing the re-emergence peratures on any core (see Lin et al. (2016) and peak references computation. Consequently weparadigm are seeing the re-emergence grams) or processes in order to avoid exceeding temtherein). For example, swapping threads executing on hot and energy cost of communication is dominating that of of the near-data computing where processing peraturesFor on any core (see Lin et threads al. (2016) and references therein). example, swapping executing on computation. Consequently weparadigm are seeingwhere the re-emergence of the near-data computing processing therein). For example, swapping threads executing on hot hot of tightly the near-data computing paradigm where processing peratures on any core (see Lin et al. (2016) and references cores with threads executing on cooler cores. The intuition computation. Consequently we are seeing the re-emergence is integrated with memory to reduce the energy therein). For example, swapping threads executing on hot cores with threads executing on cooler cores. The intuition of tightly the near-data computing paradigm where the processing is integrated with memory to reduce energy cores with threads executing on cooler cores. The intuition is tightly integrated with memory to reduce the energy therein). For example, swapping threads executing on hot is that these approaches will create uniform thermal fields of the near-data computing paradigm where processing and latency cost of data between cores with threads executing on cooler cores. thermal The intuition these approaches will create uniform fields is tightly integrated withmovement memory to reducethe thememory energy is and latency cost of data movement between the memory is that thatwith these approaches will on create uniform thermal fields latency cost of trend data movement between the memory cores threads executing cooler cores. The intuition thereby maximize the number of operations that can be is tightly integrated with memory to reduce the energy and the CPU. This has been accelerated due to the is that these approaches will create uniform thermal fields thereby maximize the number of that be latency cost of trend data movement between the memory and the CPU. This has been accelerated due to maximize the package number of operations operations that can can be the CPU. This trend has of been accelerated due to the the thereby is that these approaches will create uniform thermal fields completed for a given and thermal capacity. The and latency cost of data movement between the memory exponentially increasing sizes modern data sets and thereby maximize the package number and of operations that canThe be for a given thermal capacity. and the CPU. increasing This trendsizes has of been accelerated dueand to the completed exponentially modern data sets completed for a given package and thermal capacity. The exponentially increasing sizes of modern data sets and the thereby maximize the number of operations that can be approaches is that they i) are limited and the CPU. This trendsizes has of been accelerated due to the limitation need for commensurate increases in computing capacity completed of forthese a given package and thermal capacity. The limitation of these approaches is that they i) are limited exponentially increasing modern data sets and need for commensurate increases in computing capacity limitation of these approaches is that they i) are limited need for commensurate increases in computing capacity completed for a given package and thermal capacity. The in the scope of feasible improvements in throughput that exponentially increasing sizes of modern data sets and the and efficiency. One particular technology limitation of of these approaches is that in they i) are limited feasible improvements throughput that needpower for commensurate increases inattractive computing capacity in and efficiency. One particular attractive technology in the the scope scope feasible improvements in throughput that and power power efficiency. One particularintechniques attractive technology limitation of of these approaches is that they i) of are limited they achieve, ii) are dependent on diversity workload need for is commensurate increases computing capacity solution comprised of packaging where dyin thecan scope of feasible improvements in throughput that they can achieve, ii) are dependent on diversity of workload and power efficiency. One particular techniques attractive technology solution is comprised of packaging where dycan achieve, ii) are dependent on diversity of workload solution is efficiency. comprised of packaging techniques where dy- they in the scope of feasible improvements in throughput that behaviors in applications, and iii) do not improve power and power One particular attractive technology namic random access memory (DRAM) die are stacked they can achieve, ii) are dependent on diversity of workload in applications, and iii) do not improve power solutionrandom is comprised of packaging techniques where dy- behaviors namic access memory (DRAM) die are stacked behaviors in applications, and iii) do not improve power namic random access memory (DRAM) die are stacked they can achieve, ii) are dependent on diversity of workload efficiency, e.g., ops/joule. Hence the performance improvesolution is comprised of packaging techniques where dyon top of each other, interconnected by through silicon behaviors e.g., in applications, and iii) do not improve power efficiency, ops/joule. Hence the performance improvenamic random access memory (DRAM) die are stacked on of other, interconnected by through silicon efficiency, e.g., ops/joule. Hence thedo performance improveon top top of each each other, interconnected by die through silicon behaviors inthese applications, and iii) not improve power ment using techniques is limited. namic random access memory (DRAM) are stacked vias (TSVs) and integrated with aa multicore processor die. efficiency, e.g., ops/joule. Hence the performance improvement using these techniques is limited. on top of each other, interconnected by through silicon vias (TSVs) and integrated with multicore processor die. ment usinge.g., these techniques is limited. viastop (TSVs) and other, integrated with a multicore processor die. efficiency, ops/joule. Hence the performance improveon of each interconnected by through silicon While such compact, three-dimensional solutions provide usingschemes these techniques is limited.designs where cores vias (TSVs) and integrated with a multicore processor die. ment While such three-dimensional solutions provide Hardware rely on multicore While such compact, compact, three-dimensional solutions provide Hardware rely on where usingschemes these techniques is limited.designs vias (TSVs) and integrated with areduction multicore processor die. ment up to three orders of magnitude in energy cost Hardware schemes rely on multicore multicore designs wherein where cores cores While such compact, three-dimensional solutions provide up to three orders of magnitude reduction in energy cost are organized into distinct voltage islands the Hardware schemes rely on multicore designs wherein where cores up data to three orders ofpower magnitude reduction in energy cost are organized into distinct voltage islands the While such compact, three-dimensional solutions provide of movement, densities are increased which are organized into rely distinct voltage islands wherein the up data to three orders ofpower magnitude reduction in energywhich cost Hardware of movement, densities are increased schemes on multicore designs where cores voltage and frequency of a voltage island can be set to one are organized into distinct voltage islands wherein the of data movement, power densities areresulting increased which voltage and frequency of a voltage island can be set to one up to three orders of magnitude reduction in energy cost can lead to unacceptable temperatures in device and frequency of aEach voltage island can be set to corone of data movement, powertemperatures densities areresulting increased which voltage can lead to unacceptable in device are organized into distinct voltage islands wherein the of several discrete values. setting or power state voltage and frequency of aEach voltage island can be set to corone candata lead movement, to unacceptable temperatures resulting inbenefits device of several discrete values. setting or power state of power densities are increased which degradation and ultimately failure. To reap the of several discrete values. Each setting orcan power state corcan lead to unacceptable temperatures resulting inbenefits device responds degradation and ultimately failure. To reap the voltage and frequency of a voltage island be set to one to a specific throughput capability (determined of several discrete values. Each setting or power state cordegradation and ultimately failure. Toresulting reap the responds to aa specific throughput capability (determined can lead to unacceptable temperatures inbenefits device of 3D DRAM with integrated compute, the challenge is responds to specific throughput capability degradation and ultimately failure. To reap the benefits of 3D with integrated compute, the challenge is of several discrete values. Each setting or power state corthe and peak power dissipation (determined responds to a specific throughput capability of control 3D DRAM DRAM with integrated compute, the challenge is by by the frequency) frequency) and peak power dissipation (determined degradation and ultimately failure. To reap the benefits to power dissipation with minimal compromises in frequency) and peak power dissipation (determined of control 3D DRAM with integrated compute, the challenge in is by to power dissipation with minimal compromises responds to a specific throughput capability the voltage and frequency). The ability to adjust power frequency) and peak power dissipation (determined to control power dissipation with minimal compromises in the and frequency). The ability to power of 3D DRAM with integrated compute, the challengedeis by throughput performance, i.e., improve power efficiency by the isvoltage voltage and frequency). The ability to adjust adjust power to control power dissipation with minimal compromises in throughput performance, i.e., improve power efficiency defrequency) and peak power dissipation (determined states referred to as Dynamic Voltage Frequency Scaling by the isvoltage and frequency). The ability to adjustScaling power throughput performance, i.e.,with improve power efficiency destates referred to as Dynamic Voltage Frequency to control power dissipation minimal compromises in fined as ops/joule or equivalently throughput/watt. Maxstates isvoltage referred to frequency). as fine Dynamic Voltage Frequency Scaling throughput performance, i.e., improve power efficiency de- by fined as ops/joule or equivalently throughput/watt. Maxthe and The ability to adjust power (DVFS) and provides grain control of the relationship states is referred to as fine Dynamic Voltage Frequency Scaling fined as ops/joule or equivalently throughput/watt. Max(DVFS) and provides grain control of the relationship throughput performance, i.e., improve power efficiency deimizing power efficiency will maximize the throughput (DVFS) and provides fine grain control of the for relationship fined as ops/joule or equivalently throughput/watt. Max- states imizing power efficiency will maximize the throughput is referred to as Dynamic Voltage Frequency Scaling between power dissipation and throughput, example (DVFS) and provides fine grain control of the for relationship imizing power efficiency will maximize the3D throughput power dissipation and throughput, example fined as ops/joule or equivalently throughput/watt. Max- between performance achievable in thermally limited multicore between power dissipation and throughput, example imizing power efficiency will maximize the3D throughput performance achievable in thermally limited multicore (DVFS) and provides fine grain control of the for relationship see Wu et al. (2005). Power states of multicore processors between power dissipation and throughput, for example performance achievable in will thermally limited 3D multicore see Wu et al. (2005). Power states of multicore processors imizing power efficiency maximize the throughput processors. In this paper we address this challenge and see Wu et al. (2005). Power states of multicore processors performanceInachievable in thermally limited 3D multicore processors. this paper we address this challenge and between power dissipation and throughput, for example aa al. wide range of power efficiencies and therefore see Wu et (2005). Power states of multicore processors processors. Inachievable this paper we address this challenge and provide provide wide range of power efficiencies and therefore performance in thermally limited 3D multicore report on the design of a DVFS controller to optimize provide a wide range of power efficiencies and therefore processors. In this paper we address this challenge and report on the design of a DVFS controller to optimize see Wu et al. (2005). Power states of multicore processors have received considerable attention for the purposes of 1 provide a wideconsiderable range of power efficiencies and therefore report efficiency. on the design of awe DVFS controller to optimize attention for the purposes of processors. In this paper address this challenge and have power 1 have received received considerable attention for to the purposes of report efficiency. on the design of a DVFS controller to optimize power provide a wide range of power efficiencies and therefore regulation and optimization of power and a lesser extent 1 have received considerable attention for the purposes of power efficiency. regulation and optimization of power and to a lesser extent report on the design of a DVFS controller to optimize 1 regulation and optimization of power and to a lesser extent power are efficiency. have received considerable attention for the purposes of in meeting temperature constraints. There two broad classes of techniques for power man1 regulation and optimization of power and to a lesser extent in meeting temperature constraints. There are two broad classes of techniques for power manpower efficiency. in meetingand temperature constraints. There arethat two have broadevolved classes in of techniques for architecture power man- regulation optimization of power and to a lesser extent agement the computer in meeting temperature There arethat two have broadevolved classes in of techniques for architecture power man- Control agement the schemes based onconstraints. DVFS are typically designed for agement the computer computer Control schemes based DVFS designed for meeting temperature There arethat two have broadevolved classes in of techniques for architecture power man- in 1 Control schemes based on onconstraints. DVFS are are typically typically designed for agement that have evolved the computer architecture The term ‘DVFS’ refers to ainwidely used control concept in responding to changing application demands (Das et al. 1 Control schemes based on DVFS are typically designed for The term ‘DVFS’ refers to ainwidely used control concept in responding to changing application demands (Das et al. agement that have evolved the computer architecture 1 The term ‘DVFS’ refers to a widely used control concept in responding to changing application demands (Das et al. computer architectures, and will be explained in the sequel. Control schemes based on DVFS are typically designed for 1 computer architectures, and will bewidely explained in control the sequel. The term ‘DVFS’ refers to a used concept in responding to changing application demands (Das et al. 1 computer architectures, and will bewidely explained in control the sequel. The term ‘DVFS’ refers to a used concept in responding to changing application demands (Das et al. computer architectures, and will be explained in the sequel. ∗ ∗ ∗ ∗ ∗

computer architectures, Copyright © 2018 IFAC and will be explained in the sequel. 127 Copyright © 2018 IFAC 127 Copyright 2018 IFAC 127 Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © 2018, IFAC (International Federation of Automatic Control) Copyright ©under 2018 responsibility IFAC 127Control. Peer review of International Federation of Automatic Copyright © 2018 IFAC 127 10.1016/j.ifacol.2018.06.290

IFAC WODES 2018 128 30 - June 1, 2018. Sorrento Coast, Italy May

X. Chen et al. / IFAC PapersOnLine 51-7 (2018) 127–132

(2006); Bogdan et al. (2012); Jung and Pedram (2010)). For instance, one can scale down the clock rate and voltage of a core whenever it runs a memory-intensive part of a thread (program) with little performance consequence but significantly lower power consumption (Lim et al. (2006); Dhiman and Rosing (2007)). Conversely, one would ramp up the clock rate and necessary voltage during compute-intensive parts of a program in order to increase the throughput -these are conflicting requirements, since higher clock rate typically results in higher instructionthroughput but also in higher power and temperature. A recent survey of DVFS techniques can be found in Hanumaiah and Vrudhula (2014). The design of controllers in a DVFS environment is challenging for the following two reasons. First, there is a lack of tractable (in a real-time sense) system models, where by ‘system’ we mean the relationship between the control variable (clock rate) and the controlled performance metrics (power, temperature, throughput). Second, the system’s response-characterization varies widely, in unpredictable ways, with application workloads. Consequently, early approaches to control design were ad hoc and based on deriving models from measurements of experiments on benchmark application programs. Formal control techniques started emerging about a decade ago (Garg et al. (2010); Lefurgy et al. (2008); Ogras et al. (2009)), and continued to evolve; examples include proportional control (Krishnaswamy et al. (2015)), PID control (Deval et al. (2015); Wu et al. (2005)), optimal control (Bogdan and Xue (2015)), model predictive control (Bogdan et al. (2012); Bartolini et al. (2013) and references therein), and a technique based on integral control with adjustable gain (Almoosa et al. (2012); Chen et al. (2016)). Regarding the problem of maximizing power/energy efficiency, various DVFS-based control techniques have been devised including task migration (Hanumaiah and Vrudhula (2014)), power clamping (Rountree et al. (2012)), voltage regulation (Sinkar et al. (2012)), proportional control (Krishnaswamy et al. (2015)), and optimal control and MPC (Bogdan et al. (2012); Bogdan and Xue (2015)). We mention that alternative techniques for software applications, based on princiles other than DVFS (which is hardware-based), include virtual machines and machine learning; see, e.g., Lin et al. (2016); Verma and Sharma (2017), respectively, and references therein. The approach pursued in this paper is different from the aforementioned techniques in that it is based on a stochastic approximation algorithm of the Robbins-Monro type (Kushner and Clark (1978)) in conjunction with Infinitesimal Perturbation Analysis (IPA) (Ho and Cao (1991); Cassandras and Lafortune (2008)). Both stochastic approximation and IPA have low computing costs and can be amenable to implementation in real-time environments. Now IPA often gives statistically-biased estimation of performance gradients, and surrogate techniques designed to give unbiased estimates often require more computing times (see Ho and Cao (1991); Cassandras and Lafortune (2008) and references therein). However, performance of gradient-descent algorithms including stochastic approximation techniques can be quite robust to errors in gradient-computations . Leveraging this robustness we use a simple (hence fast) form of IPA and various modeling 128

approximations designed for an optimization algorithm with as-simple-as-possible computations perhaps at the expense of precision. The approach we use is application agnostic in the sense that the underlying model does not depend on the application program executed by the processor or core. Furthermore, it is sufficiently fast so as to be implementable according to the real-time constraints imposed by the setting of power-efficiency optimization. We mention that a similar approach has been used in earlier works by us (Chen et al. (2016); Wardi et al. (2016)), but there the objective was regulation of power or throughput to a given setpoint reference, whereas in this paper the application setting is in optimization-based control. Consequently the respective algorithms are different: Refs. Chen et al. (2016); Wardi et al. (2016) describe a control algorithm slated to find the zeros of an algebraic function while this paper concerns a hill-climbing optimization algorithm. The novelty of this paper is twofold. (i). It proposes an optimization-based control technique for maximizing power efficiency in multicore processors – a problem of current interest in microarchitectures. Furthermore, the algorithm is the first concerning the use of IPA or (to our knowledge) stochastic approximation for that problem, and it appears to be simpler to implement than extant DVFS techniques. (ii). It provides a proof-of-concept in a new strand of IPA research, which concerns the development of gradient estimators designed for simplicity at the possible expense of precision, and their use in optimization. The rest of the paper is organized as follows. Section 2 defines the power-efficiency optimization problem and describes the system model, while Section 3 presents simulation experiments and analyzes their results. 2. PROBLEM DEFINITION, ALGORITHM, AND SYSTEM MODELING Power efficiency generally is defined as the ratio of a throughput-related performance and the power required to achieve that performance. In this paper we consider both metrics as functions of the clock rate (frequency). In the framework of real-time optimization the period of time it takes to execute a program is divided into contiguous intervals called control cycles and denoted by Cn , n = 1, 2, . . .. During each control cycle the control variable, namely the clock frequency is held to a constant value which is determined at the start of the cycle. Based on that value of the control, the power efficiency and related quantities are computed during the cycle, and the results are used to change the control variable at the end of the cycle. To make matters concrete, we denote the clock frequency throughout Cn by φn , the instruction throughput measured during Cn by Θn , and the average power spent during Cn by Pn . We view Θn and Pn as functions of the applied clock rate φn and hence also use the notation Θn (φn ) and Pn (φn ), respectively. We define the power efficiency, denoted by Ln := Ln (φn ), by Θn (φn )3 ; (1) Ln (φn ) = Pn (φn )

IFAC WODES 2018 May 30 - June 1, 2018. Sorrento Coast, Italy X. Chen et al. / IFAC PapersOnLine 51-7 (2018) 127–132

see Srinivasan et al. (2002) for a justification. In the sequel we will omit the functional dependence on φn when no confusion arises, hence write Eq. (1) as Ln = Θ3n /Pn . The objective of the algorithm defined by Eq. (2), below, is to maximize the power efficiency. However, the power required for program execution by a core can vary widely during the program’s execution, for reasons that will be made clear in the sequel. Putting it in the parlance of control theory, the system representing the relationship between the applied clock rate and the power is highly time varying. Therefore the functional dependence of the power efficiency during cycle Cn , Ln (φ), depends on n. Consequently we cannot pose the optimization problem as maximizing Ln (φ), but rather as tracking the maximum of Ln (φ). To this end we apply a stochastic approximation algorithm that changes φn at the end of Cn . Of course no exact maximization is possible, and all we do is apply a single step of the algorithm at the end of Cn . The algorithm is of the Robbins-Monro type (Kushner and Clark (1978)) and has the following form: Given a monotone-decreasing sequence {αn }∞ n=1 such that ∞ 

n=1

αn = ∞

and

∞ 

n=1

αn2 < ∞,

φn+1 is computed from φn by the formula ∂Ln φn+1 = φn + αn , ∂φn

(2)

and its derivative has the following form by Eq. (1): 1  ∂Θn ∂Pn  ∂Ln . (3) = 2 3Pn Θ2n − Θ3n ∂φn Pn ∂φn ∂φn The quantities Θn and Pn , throughput and average power during Cn , can be measured in real time, but the es∂Pn n timation of the derivative terms ∂Θ ∂φn and ∂φn can be problematic. Therefore we will discuss approximations of these terms that are computable in real time by simple procedures. As stated in the Introduction, these approximations are defined not for precision but for simplicity of computations, and hence may incur considerable errors. However, due to the convergence-robustness of gradientdescent algorithms with respect to errors in computing the gradients, the algorithm will be shown to converge on example-problems presented in Section 3. Frequency-to-power model. The model described in the forthcoming paragraphs can be found in standard textbooks on computer architectures such as Hennessy and Patterson (2012). Power consumed by computer cores during program execution is composed of static power and dynamic power. The static power, also called leakage power, depends on the supply voltage and is exponentially related to the temperature. Notably, static power is consumed even in the absence of circuit activity. In contrast, the dynamic power is due to the switching activity of the transistors, and it depends on the supply voltage and clock rate (frequency) as well as on a measure of the switching activity determined by the program (thread) executed by the core, referred to as the activity factor. Thus, denoting by Ps and Pd the static power and dynamic power consumed by a core, we have that P = Ps + Pd , where P denotes the total power. 129

129

By Eq. (3), we are interested in the derivative term ∂P ∂φ , namely the derivative of power with respect to frequency. As we shall see in the next paragraph, the dynamic power has a closed-form functional relationship to the frequency, and its derivative is easily computable. On the other hand, the frequency-to-static power relation is much more complicated and it is impractical to compute its derivative analytically. What we do is approximate the derivative ∂Pd power ∂P ∂φ by the derivative of the dynamic power, ∂φ in Eq. (3), namely ∂Pd ∂P ∼ . (4) ∂φ ∂φ This approximation is justified by noting that in many current applications the static power comprises about 30% of the total power, and it will be argued that this is sufficient for the algorithm to perform well on the specific simulation programs described in Section 3. An expanded argument is brought forth in Chen et al. (2016). The relationship between clock rate and dynamic power is well established, and the following formulation thereof has been made in Hennessy and Patterson (2012). Consider a processor (or core) driven by a supply voltage V and operating at clock-frequency φ. The dynamic power consumption is a function of voltage, frequency and workload, and it has the following form (Hennessy and Patterson (2012)), Pd (φ, V, t) = α(t)CV 2 φ. (5) Here α(t) is the workload activity factor representing the switching activity of the processor’s logic gates, and C is the total processor’s capacitance. The voltage-frequency relation is affine and has the form V = mφ + V0 (6) for given offset V0 and slope m, respectively, and m can be obtained experimentally or directly from the chip manufacturer. α(t) varies with time significantly and in unpredictable manner, and cannot be measured in real time. Therefore we denote its dependence on time explicitly in Eq. (5), while acknowledging only implicitly that V and φ depend on time as well. By using Eq. (6) in Eq. (5) and taking derivatives we obtain, after some algebra, that 1 ∂Pd 2m  = Pd + ; (7) ∂φ φ V see Almoosa et al. (2012). Observe that the RHS of Eq. (7) does not depend explicitly on α(t), whereas φ is the control variable, m can be obtained from the manufacturer, and V can be measured or computed via Eq. (6). As for Pd , we approximate it by the total power P which can be readily measured. Frequency-to-Throughput Model. Most modern computer architectures are based on an Out-of-Order (OOO) architecture whereby a core executes the instructions of a program, or a thread as soon as all of the required resources become available and not necessarily according to the program order. A detailed presentation of OOO architectures can be found in Hennessy and Patterson (2012). The system described in this paper is based on a queueing model that is intractable by analysis, and

X. Chen et al. / IFAC PapersOnLine 51-7 (2018) 127–132

The optimization problem of maximizing the power efficiency (as defined by (1)) is solved by the stochastic approximation algorithm defined by Eq. (2), with the derivan tive ∂L ∂φn computed according to (3) and the material in Section 2. The experimental framework is based on a cyclelevel, full-system simulation platform, Manifold (Wang et al. (2014)). The application and operating system binaries drive cycle-level models of cores, coherent caches, onchip networks, and the DRAM system. The architectural setting consists of a 16-core X86 computer processor based on OOO state-of-the-art technology, which is simulated by Manifold at the level of the physical processes. While the cores are interconnected and thermally coupled, they can be controlled separately by their respective clock rates. In each experiment all of the 16 cores run the same program, and the results are shown from one randomly-chosen core. We test the optimization algorithm on several programs from the industry-standard Splash-2 suite of benchmarks (Woo et al. (2005)). In each experiment the frequencies are constrained to the range [0.5Ghz, 5.0GHz]. Some of the programs are heavily computational while others are memory intensive, and this yields noticeably different results as will be explained in the sequel. Based on our previous experience with stochastic approximation, we chose the step sizes αn in Eq. (2) to be αn = 0.8/n0.6 . Every control cycle lasts 0.1 ms, and the algorithm takes one step at the end of each control cycle. The first Splash-2 program, Barnes, is computationally intensive. The run time of this program, with the clock rates determined by the algorithm, is about 350 ms, hence containing 3,500 control cycles. The algorithm yielded an average of 0.64078 million instructions per control cycle. We set the initial frequency variable to φ1 = 1.0 GHz. The resulting sequence {Ln } of the power efficiencies is shown in Figure 1, and we notice that it converges quickly to a band around 5e28 . The only apparent large deviation from that value is at about 310 ms, where the power efficiency dips drastically. A possible reason for that will be discussed in the sequel. In any event, it is noted that the algorithm recovers from that glitch quickly and returns the power efficiency to its former range around 5e28 . We also computed the average power-efficiency per iteration, and found it to be 4.9475e28 IP S 3 /W (the acronym IPS stands for Instructions per Second). A second experiment with the initial frequency variable set to φ1 = 4.2 GHz yielded convergence to a comparable limiting range, with the average power-efficiency per iteration of 4.8426e28 , but its graph is not shown due to space limitations of the paper. The granularity of the graph in Figure 1 is insufficient to capture the transient behavior due to the initial condition φ1 = 1GHz, and similarly for the second experiment with φ1 = 4.2GHz. Therefore we depict, in Figure 2, the first 12 ms of the graphs for both experiments. In both cases 130

×1028

5

Power Efficiency (IPS3/W)

3. SIMULATION EXAMPLES

6

4 3 2 1 0

0

50

100

150

200

250

300

350

Time (ms)

Fig. 1. Barnes, power efficiency vs. time, experiment 1: φ1 = 1GHz Power Efficiency (IPS3/W)

n hence we use IPA for the derivative term ∂L ∂φn . A detailed description of this model and the IPA algorithm can be found in Chen et al. (2016).

10

×1028

8 6 4 2 0

0

2

4

6

8

10

12

Time (ms)

Fig. 2. Barnes, experiments 1 and 2: Zoom on first 12 ms 5

Frequency (GHz)

IFAC WODES 2018 130 30 - June 1, 2018. Sorrento Coast, Italy May

×109

4 3 2 1 0

0

50

100

150

200

250

300

350

Time (ms)

Fig. 3. Barnes, frequency vs. time, experiment 1 convergence is noted in under 0.5 ms, or 5 iterations of the algorithm. Figure 3 depicts the graph of the computed frequencysequence, {φn } for the first experiment only; results for the second experiment are similar. The graph indicates a rapid convergence to a band around its average, computed at 3.048 GHz. The graph of the power efficiency in Figure 1 exhibits sporadic jumps during the course of the program. These are due mainly to wide variations in the system’s response to the applied frequency, and in particular, to the program’s activity factor α(t) (see (5)). Such variations can trigger actions by the operating system such as redirection of instructions between cores in order to balance throughput or power among the cores. These actions are not reflected in the models described in Section 2, but are captured by the Manifold simulation. They may result in wide fluctuations in the system’s output, such as the large dip in the graph in Figure 1 at about 310 ms. We compared the average power efficiency obtained from Figure 1 (49.475e27 ) to the average power efficiency obtained from three constant frequencies: 0.5GHz, 5.0GHz, and 3.048GHz. The first two values are the extreme values of the feasible frequency range for the optimization problem, and the third value is the average frequency computed from the run of the algorithm (Figure 3). The graphs of the resulting power efficiencies are shown in Figure 4. The lowest graph corresponds to φ = 0.5GHz, and its average value is 1.4465e27 . The middle curve corresponds to φ = 4.2GHz, and its average value is 8.1942e27 . The third, highest graph corresponds to φ = 3.048GHz, and

IFAC WODES 2018 May 30 - June 1, 2018. Sorrento Coast, Italy X. Chen et al. / IFAC PapersOnLine 51-7 (2018) 127–132

28

3.5

x 10

5

Frequency (Hz)

Power Efficiency (IPS3/W)

3 2.5 2 1.5 1

131

×109

4 3 2 1

0.5

0

0 0

50

100

150

200

250

300

350

0

50

100

Fig. 4. Barnes: system response to constant frequencies

3

x 10

250

300

350

×1028

2.5

Power Efficiency (IPS3/W)

6

Power Efficiency (IPS3/W)

200

Fig. 7. Ocean-nc: Frequency vs. time, experiment 1

28

7

150

Time (ms)

Time (ms)

5 4 3 2

2 1.5 1 0.5

1

0

0 0

50

100

150

200

250

300

350

Fig. 5. Ocean-nc, power efficiency vs. time, experiment 1: φ1 = 1GHz

Power Efficiency (IPS3/W)

10

×10

28

8

6

4

2

0

0

2

4

6

8

10

0

50

100

150

200

250

300

350

Time (ms)

Time (ms)

12

Time (ms)

Fig. 6. Ocean-nc, experiments 1 and 2: Zoom on first 12 ms its average value is 32.96e27 . It is not surprising that the last value is larger than the other two since it corresponds to the average frequency computed by the algorithm. However, the average power efficiency computed from the algorithm’s run (Ln , Figure 1) is even higher, at 49.475e27 , thereby providing an improvement over the best result obtained from a constant frequency. The second Splash-2 program we tested, Ocean-nc, is memory intensive with over 30% of instructions concerning memory access, from level-1 cache to DRAM. The run time of Ocean-nc is about 500 ms, but we show in the following graphs only the first 350 ms. The algorithm yielded an average of 0.639 million instructions per control cycle. In the first experiment we set the frequency variable to φ1 = 1.0 GHz. The resulting sequence {Ln } are shown in Figure 5, and its average values is 54.067e27 . A second experiment, with φ1 = 4.2 GHz, yielded similar convergence with the average of 53.775e27 . Subgraphs over the first 12 ms, capturing the transient behaviors of both runs of the algorithm, are shown in Figure 6. In both cases convergence is noted in under 1.0 ms, or 10 iterations of the algorithm. Figure 7 depicts the graph of the computed frequency-sequence, {φn } for the first experiment only, and the obtained average frequency is 3.766 GHz. Comparing these results to those obtained for Barnes, the following differences stand out: Figure 5 exhibits much more drastic periodic spikes downward in the power effi131

Fig. 8. Ocean-nc: system response to constant frequencies ciency for Ocean-nc than Figure 1 shows for Barnes. Correspondingly, Figure 7 exhibits much wider fluctuations in frequency than Figure 3. The reason is this: Extensive memory-search instructions significantly pull down the throughput and hence the power efficiency, which explains the downwards spikes in Figure 5. The algorithm attempts to stabilize the power efficiency while the program activity factor α(t) (in Eq. (5)) is subjected to wide variations much wider than for Barnes. Moreover, the step of the algorithm defined by Eq. (2) is computed at the end of n the control cycle Cn , while the derivative term ∂L ∂φn it uses (Eq. (3)), is based upon data gathered throughout Cn . This delay, coupled with the wide instability of the activity factor, acts to destabilize the system and explains the wide fluctuations in Figure 7 and Figure 5. Nevertheless the algorithm seems (Figure 5) to recover quickly from each one of these downward spikes. We compare the performance of the algorithm to the response of the system to three constant frequencies, namely 0.5GHz, 5.0GHz, and the reported average obtained by the algorithm in the first experiment, 3.766GHz. The results are shown in Figure 8. The lower, middle, and upper graphs correspond to the respective frequencies of 0.5GHz, 3.766GHz, and 5.0GHz; the corresponding averages of (Ln ) are 3.8907e27 , 8.2598e27 , and 17.45e27 . Unlike the case of Barnes, the highest response is not for the average frequency obtained by the algorithm but rather for the frequency upper bound of 5.0Ghz. This is partly explained by the large variability of the frequency schedule computed by the algorithm as depicted in Figure 7, which does not necessarily imply that the application of 3.766Ghz outperform that of 5.0Ghz. In any event, the average power efficiency computed by the algorithm from the graph in Figure 5, 54.067e27 IP S 3 /W , is over three times the best result from the three constant-frequency experiments. REFERENCES Almoosa, N., Song, W., Wardi, Y., and Yalamanchili, S. (2012). A power capping controller for multicore processors. In American Control Conference (ACC), Montreal, Canada, June 27-29, 4709–4714.

IFAC WODES 2018 132 30 - June 1, 2018. Sorrento Coast, Italy May

X. Chen et al. / IFAC PapersOnLine 51-7 (2018) 127–132

Bartolini, A., Cacciari, M., Tilli, A., and Benini, L. (2013). Thermal and energy management of highperformance multicores: Distributed and self-calibrating model-predictive controller. IEEE Transactions on Parallel and Distributed Systems, 24(1), 170–183. Bogdan, P., Marculescu, R., Jain, S., and Gavila, R. (2012). An optimal control approach to power management for multi-voltage and frequency islands multiprocessor platforms under highly variable workloads. In Sixth IEEE/ACM Intl. Symp. Networks on Chip (NoCS), Lyngby, Denmark, May 9 - 11, 35–42. Bogdan, P. and Xue, Y. (2015). Mathematical models and control algorithms for dynamic optimization of multicore platforms: A complex dynamics approach. In IEEE/ACM Intl. Conf. Computer-Aided Design, Austin, TX, November 2- 6, 170–175. Cassandras, C.G. and Lafortune, S. (2008). Introduction to Discrete Event Systems, 2nd Edition. Springer. Chen, X., Wardi, Y., and Yalamanchili, S. (2016). Ipa in the loop: Control design for throughput regulation in computer processors. In 13th Intl. Workshop on Discrete Event Systems (WODES), Xi’an, China, May 30 - June 1, also in Arxiv, Ref. arXiv:1604.02727 [math.OC]. Das, S., Roberts, D., Lee, S., Pant, S., Blaauw, D., Austin, T., Flautner, K., and Mudge, T. (2006). A self-tuning dvs processor using delay-error detection and correction. IEEE Journal on Solid-State Circuits, 41, 792 – 804. Deval, A., Ananthakrishnan, A., and Forbell, C. (2015). Power management on 14 nm intel core- m processor. In IEEE Symp. Low-Power and High-Speed Chips (COOL CHIPS XVIII), Yokohama, Japan, April 13-15. Dhiman, G. and Rosing, T. (2007). Dynamic voltage frequency scaling for multi-tasking systems using online learning. In ACM/IEEE Intl. Symp. Low Power Electronics and Design (ISLPED), Portland, Oregon, Aug 27 - 29, 207–212. Garg, S., Marculescu, D., and Marculescu, R. (2010). Custom feedback control: enabling truly scalable on-chip power management for mpsocs. In ACM/IEEE Intl. Symp. Low-Power Electronics and Design (ISLPED), Austin, TX, August 18- 20, 425–430. Hanumaiah, V. and Vrudhula, S. (2014). Energy-efficient operation of multicore processors by dvfs, task migration, and active cooling. IEEE Transactions on Computers, 63, 349–360. Hennessy, J. and Patterson, D. (2012). Computer Architecture: A Quantitative Approach. Morgan Kaufmann. Ho, Y.C. and Cao, X.R. (1991). Perturbation Analysis of Discrete Event Dynamic Systems. Kluwer Academic Pub. Jung, H. and Pedram, M. (2010). Supervised learning based power management for multicore processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 29, 1395–1408. Krishnaswamy, V., Brooks, J., Konstadinidis, G., McAllister, C., Pham, H., Turullols, S., Shin, J., YangGong, Y., and Zhang, H. (2015). 4.3 fine-grained adaptive power management of the sparc m7 processor. In IEEE SolidState Circuits Conference-(ISSCC), San Francisco, CA, February 22 -26, 2015, 1–3. Kushner, H.J. and Clark, D.S. (1978). Stochastic Approximation for Constrained and Unconstrained Systems. Springer-Verlag, Berlin. 132

Lefurgy, C., Wang, X., and Ware, M. (2008). Power capping: a prelude to power shifting. Cluster Computing, 11, 183–195. Lim, M., Freeh, V., and Lowenthal, D. (2006). Adaptive, transparent frequency and voltage scaling of communication phases in mpi programs. In ACM/IEEE Conf. High Performance Computing, Networking, Storage, and Analysis (Super Computing), Tampa, FL, November 11 - 17, 1–14. Lin, X., Xue, Y., Bogdan, P., Wang, Y., Garg, S., and Pedram, M. (2016). Power-aware virtual machine mapping in the data-center-on-a-chip paradigm. In 34th IEEE Intl. Conf. Computer Design (ICCD), Phoenix, AZ, October 3-5, 241 – 248. Ogras, U., Marculescu, R., Marculescu, D., and Jung, E. (2009). Design and management of voltage-frequency island partitioned networks-on-chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 17, 330–341. Rountree, B., Ahn, D.H., Supinski, B.R.D., Lowenthal, D.K., and Schulz, M. (2012). Beyond dvfs: A first look at performance under a hardware-enforced power bound. In IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), Shanghai, China, May 21 - 25. Sinkar, A., Wang, H., and Kim, N. (2012). Workloadaware voltage regulator optimization for power efficient multi-core processors. In Design, Automation and Test in Europe Conference and Exhibition, March 12–16, Dresden, Germany, 1134 – 1137. Srinivasan, V., Brooks, D., Gschwind, M., Bose, P., Zyuban, V., Strenski, P., and Emma, P. (2002). Optimizing pipelines for power and performance. In IEEE/ACM International Symposium on Microarchitecture (MICRO), Istanbul, Turkey, November, 333 – 344. Verma, N. and Sharma, A. (2017). Workload prediction model based on supervised learning for energy efficiency in cloud. In IEEE 2nd Intl. Conf. Communication Systems, Computing and IT Applications (CSCITA 2017), Mumbai, India, April 7-8. Wang, J., Beu, J., Behda, R., Conte, T., Dong, Z., Kersey, C., Rasquinha, M., Riley, G., Song, W., Xiao, H., Xu, P., and Yalamanchili, S. (2014). Manifold: A parallel simulation framework for multicore systems. In 19th IEEE International Symposium on Performance Evaluation of Systems and Software (ISPASS), Monterey, CA, May 23-25. Wardi, Y., Seatzu, C., Chen, X., and Yalamanchili, S. (2016). Performance regulation of event-driven dynamical systems using infinitesimal perturbation analysis. Nonlinear Analysis: Hybrid Systems, 22, 116–136. Woo, S., Oharat, M., Torriet, E., Singhi, J., and Guptat, A. (2005). The splash-2 programs: Characterization and methodological considerations. In 22nd Annual International symposium on Computer architectures, Santa Margherita Ligure, Italy, June 22-24, 24–36. Wu, Q., Juang, P., Martonosi, M., and Clark, D. (2005). Voltage and frequency control with adaptive reaction time in multiple-clock-domain processors. In 11th IEEE Intl. Symp. High-Performance Computer Architecture, San Francisco, CA, February 12-16, 178–189.