Deep reinforcement learning: Algorithm, applications, and ultra-low-power implementation

Deep reinforcement learning: Algorithm, applications, and ultra-low-power implementation

Accepted Manuscript Deep reinforcement learning: Algorithm, applications, and ultra-low-power implementation Hongjia Li, Ruizhe Cai, Ning Liu, Xue Lin...

4MB Sizes 0 Downloads 62 Views

Accepted Manuscript Deep reinforcement learning: Algorithm, applications, and ultra-low-power implementation Hongjia Li, Ruizhe Cai, Ning Liu, Xue Lin, Yanzhi Wang

PII: DOI: Reference:

S1878-7789(17)30085-6 https://doi.org/10.1016/j.nancom.2018.02.003 NANCOM 204

To appear in:

Nano Communication Networks

Received date : 23 July 2017 Accepted date : 11 February 2018 Please cite this article as: H. Li, R. Cai, N. Liu, X. Lin, Y. Wang, Deep reinforcement learning: Algorithm, applications, and ultra-low-power implementation, Nano Communication Networks (2018), https://doi.org/10.1016/j.nancom.2018.02.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Deep Reinforcement Learning: Algorithm, Applications, and Ultra-Low-Power Implementation Hongjia Lia,∗, Ruizhe Caia,∗, Ning Liua , Xue Linb , Yanzhi Wanga a Syracuse

University, Syracuse University, Boston

b Northeastern

Abstract In order to overcome the limitation of traditional reinforcement learning techniques on the restricted dimensionality of state and action spaces, the recent breakthroughs of deep reinforcement learning (DRL) in Alpha Go and playing Atari set a good example in handling large state and action spaces of complicated control problems. The DRL technique is comprised of an offline deep neural network (DNN) construction phase and an online deep Q-learning phase. In the offline phase, DNNs are utilized to derive the correlation between each state-action pair of the system and its value function. In the online phase, a deep Q-learning technique is adopted based on the offline-trained DNN to derive the optimal action and meanwhile update the value estimates and the DNN. This paper is the first to provide a comprehensive study of applications of the DRL framework on cloud computing and residential smart grid systems along with efficient hardware implementations. Based on the introduction of the general DRL framework, we develop two applications, one for the cloud computing resource allocation problem and one for the residential smart grid user-end task scheduling problem. The former could achieve up to 54.1% energy saving compared with baselines through automatically and dynamically distributing resources to servers. The latter achieves up to 22.77% total energy ∗ Hongjia

Li and Ruizhe Cai contributed equally to this work. Email addresses: [email protected] (Hongjia Li), [email protected] (Ruizhe Cai), [email protected] (Ning Liu), [email protected] (Xue Lin), [email protected] (Yanzhi Wang)

Preprint submitted to Elsevier

July 17, 2017

cost reduction compared with the baseline algorithm. The DRL framework is mainly utilized for the complicated control problems and requires light-weight and low-power implementations in edge and portable systems. In order to achieve this goal, we develop the ultra-low-power implementation of the DRL framework using the stochastic computing technique, which has the potential of significantly enhancing the computation speed and reducing hardware footprint and therefore the power/energy consumption. The overall implementation is based on the effective stochastic computing-based implementations of approximate parallel counter-based inner product blocks and tanh activation functions. The stochastic computing-based implementation achieves only 57941.61 µm2 area and 6.30 mW power with 412.47 ns delay. Keywords: Deep reinforcement learning, algorithm, stochastic computing, ultra-low-power.

1. Introduction Reinforcement learning [1], as an area of machine learning, has been applied to solve problems in many disciplines, such as control theory, information theory, operations research, economics, etc. Different from supervised learning, the agent (i.e., learner) in reinforcement learning learns the policy for decision making through interactions with the environment. During trade-offs between explorations and exploitations, the agent aims to maximize the cumulative reward by taking the proper action at each time step according to the current state of the environment. Due to its generality nature, reinforcement learning could address problems, where exact models are unknown or exact methods are infeasible. In order to make a decision on which action to take, the computational complexity of reinforcement learning algorithms is in general Q(|A| + M ) for each time step, where |A| is the total number of actions the agent can choose from and M is the number of the most recently experienced state-action pairs kept in memory. Further more, the convergence time of reinforcement learning

2

algorithms is proportional to O(|A| · |S|), where |S| is the total number of states. Therefore, for some real world problems with high dimensions in action and state spaces, the traditional reinforcement learning becomes less effective or even infeasible. To overcome the above shortcoming of traditional reinforcement learning techniques, the recent breakthroughs of deep reinforcement learning (DRL) in Alpha Go and playing Atari set a good example in handling large state and action spaces of complicated problems [2][3]. The DRL technique is comprised of an offline deep neural network (DNN) construction phase and an online deep Qlearning phase. In the offline phase, convolutional neural network as an example of DNNs is used to derive the correlation between each state-action pair (s, a) of the system under control and its value function Q(s, a), which represents the expected cumulative (with discount) reward when the system starts at state s and follows action a and also certain policy thereafter. In the online phase, a deep Q-learning technique is adopted based on the offline-trained DNN to derive the optimal action a at the current state s and at the same time update the values of Q(s, a)’s. In this paper, we first introduce the general DRL framework, which can be widely utilized in many applications with different optimization objectives, such as resource allocation, residential smart grid, embedded system power management, and autonomous control. As discussed before, it is especially suitable for complicated system control problems with large state and action spaces. Next we present two applications of the DRL framework, one for the cloud computing resource allocation problem and one for the residential smart grid user-end task scheduling problem. The cloud computing resource allocation problem automatically and dynamically distributes resources (virtual machines or tasks) to servers by establishing efficient strategy. Through extensive experimental simulations using Google cluster traces [4], the DRL framework for cloud computing resource allocation achieves up to 54.1% energy saving compared with the baseline approach. The residential smart grid task scheduling problem determines the task scheduling and resource allocation with the goal of simultaneously max3

imizing the utilization of photovoltaic (PV) power generation and minimizing user’s electricity cost. The proposed DRL framework for residential smart grid sets a reference for industrial and commercial sectors as well. Through extensive experimental simulations with realistic task modelings, the DRL framework for residential smart grid task scheduling achieves up to 22.77% total energy cost reduction compared with the baseline algorithm. The DRL framework is mainly utilized for the complicated control problems and require light-weight and low-power implementation in edge and portable systems with a tight power/energy budget. In order to achieve this goal, we investigate the ultra-low-power implementation of the DRL framework using the stochastic computing technique. In stochastic computing, the data is represented as a probabilistic number represented by the number of ones in a bitstream. Hence, the multiply and add operations, the most common operations in DNN algorithms/models, can be implemented by single gates like AND/XOR gates and multiplexers. As a result, it has the potential of significantly enhancing the computation speed and reducing hardware footprint and therefore the power/energy consumption. We develop the ultra-low-power implementation of the DRL framework for residential smart grid task scheduling. The stochastic computing-based ultra-low-power implementation consumes only 57941.61 µm2 area and 6.30 mW power with 412.47 ns delay. The rest of this paper is organized as follows. Section 2 discusses related works on the emerging deep reinforcement learning technique. Section 3 introduces a general DRL framework. Section 4 employs the DRL framework for two applications i.e., cloud computing resource allocation and residential smart grid task scheduling. Section 5 presents the proposed ultra-low-power hardware implementation of the DRL framework using stochastic computing technique, followed by comprehensive experimental simulations and results. Finally, Section 6 concludes the paper.

4

2. Related works In 2013, reference [2] presented the pioneering work of the first deep reinforcement learning model, which successfully learns control policies directly from high-dimensional sensory inputs. The model is a trained deep convolutional neural network and applied on seven Atari 2600 games from the Arcade Learning Environment. It outperforms all previous approaches on six of the games and surpasses a human expert on three of them. In 2015, reference [5] develops the first artificial agent which is capable of learning to excel at a diverse array of challenging tasks. This agent uses end-to-end reinforcement learning to learn policies directly from high-dimensional sensory inputs and actions. Meanwhile, reference [6] presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Reference [7] proposes a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. Reference [8] presents the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation. A specific adaptation to the DQN algorithm is proposed and resulted as a much better performance. Reference [3] proposed a new approach to computer Go, in which ’value networks’ is used to evaluate board position and ’policy networks’ is used to select moves. A novel combination of supervised learning from human experts games and reinforcement learning from games of a self-play is adopted for training these deep neural networks. As for low-footprint hardware implementations, reference [9] presents the first comprehensive design and optimization framework of SC-based DCNNs (SC-DCNNs), using a bottom-up approach.

3. Deep Reinforcement Learning (DRL) Framework In this section, we present a general DRL framework, which can be utilized to solve complicated problems with large state and action spaces. The DRL

5

technique consists of two phases: an offline deep neural network (DNN) construction phase and an online deep Q-learning phase [2][5]. In the offline phase, a DNN is adopted to derive the correlation between each state-action pair (s, a) of the system under control and its value function Q(s, a). Q(s, a) represents the expected cumulative (with discount) reward when the system starts at state s and follows action a and certain policy thereafter. Q(s, a) for a discrete-time system is given as:

∞ hX i Q(s, a) = E γ k r(k) s0 , a0

(1)

k=0

where r(t) is the reward rate and γ is the discount rate in a discrete-time system. The Q(s, a) for a continuous-time system is given as: hZ ∞ i Q(s, a) = E e−β(t−t0 ) r(t)dt s0 , a0

(2)

t0

where β is the discount rate in a continuous-time system.

In order to construct an enough accurate DNN, the offline phase needs to accumulate sufficient samples of Q(s, a) value estimates and corresponding (s, a). It can be a model-based procedure [10] or obtained from actual measurement data [3]. This procedure includes simulating the control process, and obtaining the state transition profile and Q(s, a) value estimates, using an arbitrary but gradually refined policy. The state transition profile is stored in an experience memory D with capacity ND . The use of experience memory can smooth out learning and avoid oscillations or divergence in the parameters [2]. Based on the stored state transition profile and Q(s, a) value estimates, the DNN is constructed with weight set θ trained using standard training algorithms [11]. The key steps in the offline phase are summarized in line 1-4 in Algorithm 1. For the online phase, we adopt the deep Q-learning technique based on the offline-trained DNN to select actions and update Q-value estimates. To be more specific, at each decision epoch tk of an execution sequence, suppose the system under control is in the state sk . The DRL agent performs inference using the DNN to obtain the Q(sk , a) value estimate for each state-action pair (sk , a). Then according to the ε-greedy policy, the action with the maximum 6

Algorithm 1: The General DRL Framework Offline DNN construction: 1

Simulate the control process using an arbitrary but gradually refined policy for enough long time;

2

Obtain the state transition profile and Q(s, a) value estimates during the process simulation;

3

Store the state transition profile and Q(s, a) value estimates in experience memory D with capacity ND ;

4

Train a DNN with features (s, a) and outcomes Q(s, a); Online deep Q-learning:

5 6

foreach execution sequence do foreach decision epoch tk do With probability 1 - ε select the action ak = argmaxa Q(sk , a),

7

otherwise randomly select an action; 8

Execute the chosen action in the control system;

9

Observe reward rk (sk , ak ) during time period [tk , tk+1 ) and the new state sk+1 at the next decision epoch;

10

Store transition set (sk , ak , rk , sk+1 ) in D;

11

Update Q(sk , ak ) based on rk and maxa0 Q(sk+1 , a ) based on

0

ˆ to Q-learning updating rule. One could use a duplicate DNN Q achieve this goal; end 12

Update DNN weight set θ based on updated Q-value estimates, in a mini-batch manner; end

7

Q(sk , a) value estimate is selected with probability 1 - ε and a random action is selected with probability ε. After choosing an action, which is denoted by ak , before the next decision epoch tk+1 , the observed total reward rk (sk , ak ) during [tk , tk+1 ) leads to Q-value updates. The reference work [6] proposed to ˆ for Q-value estimate updating, in order to mitigate utilize a duplicate DNN Q the potential oscillation of the inference results of the DNN. At the end of the execution sequence, the DNN is updated by the DRL agent using the lately observed Q-value estimates in a mini-batch manner, and will be employed in the next execution sequence. The algorithm of the general DRL framework is shown in Algorithm 1. It can be observed from above procedure that the DRL framework is highly scalable for large state space, which is distinctive from traditional reinforcement learning techniques. On the other hand, the DRL framework requires a relatively low-dimensional action space due to the fact that at each decision epoch the DRL agent needs to enumerate all possible actions under current state and perform inference using DNN to derive the optimal Q(s, a) value estimate, which implies that the action space in the general DRL framework needs to be reduced. 4. Representative Applications of Deep Reinforcement Learning As deep reinforcement learning can be utilized to solve complicated control problems with a large state space, we present two representative and important applications of the DRL framework, one for the cloud computing resource allocation problem and one for the residential smart grid user-end task scheduling problem. 4.1. DRL Framework for Cloud Computing Resource Allocation 4.1.1. System Model Cloud computing has enabled convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released 8

Figure 1: An illustration of the cloud resource allocation system.

with minimal management efforts or service provider interactions [12]. Resource allocation has played one important role in cloud computing, which requires automatic control method to provide resource sharing in an on-demand manner. A complete resource allocation problem in the cloud computing systems exhibits high dimensions in state and action spaces, which limits the applicability of traditional automatic decision-making approaches, such as the reinforcement learning (RL)-based approaches. On the other hand, recent breakthroughs of DRL technique show promising results in complicated control problems with high-dimensional state space, and can potentially resolve the limitation of traditional automatic decision-making approaches in resource allocation applications. In the cloud computing resource allocation problem, a server cluster is considered, which consists of M physical servers that can provide P types of resources. A job scheduler assigns each incoming job to one desired server. Figure 1 provides an illustration of the overview of the cloud resource allocation environment. In the servers, a first-come-first-served manner is deployed to process assigned jobs. A job will wait in the queue until sufficient resource is released in the server. We define the latency of a job as the actual duration from its arrival time to its complete time. Figure 2 shows a scenario for executing jobs on a server. In this scenario, we assume that jobs 1, 2, 3, 4 arrive at time instances t1 , t2 , t3 and t5 , respectively. However, job 3 waits until job 1 is completed to

9

Figure 2: An example of job execution on a server.

start execution due to insufficient resource in the server. The latency of job 3 is t6 − t3 , which is longer than the job 3 duration. Similarly, job 4 has a latency of t7 − t5 . Hence, it is crucial for the job scheduler to avoid overloading servers to minimize the job latency. In order to save energy, a server has two working modes: active and sleep. Ton is the time needed by a server to transit from sleep mode to active mode. Tof f is the time needed by a server to transit from active mode to sleep mode when no job is pending or running. All the mode transitions are considered as uninterruptible. We assume the power consumption of a server in the sleep mode is zero. Based on an empirical non-linear model in [13], the power consumption of a server in active mode is a function of CPU utilization as follows: P (ut ) = P (0%) + (P (100%) − P (0%))(2ut − u1.4 t )

(3)

where ut denotes the CPU utilization of the server at time t. Please note that the power consumption of a server in active mode with 0% CPU utilization i.e., P (0%) is non-zero. We should incorporate the power consumption of a server during mode transitions, which is P (0%). 4.1.2. DRL Formulation We apply the DRL framework to solve the cloud computing resource allocation problem. In order to significantly reduce the action space, we adopt a continuous-time and event-driven decision making mechanism [14] in which each decision epoch coincides with the arrival time of a new job. In this way the ac10

tion at each decision epoch is simply the target server for executing the job and it reduces the overhead of periodic value updates in discrete-time reinforcement learning systems. State Space: We define at job j’s arrival time tj the whole system state t

(denoted by stj ) as the union of the server cluster state (denoted by scj ) and the job j’s state (denoted by sj ) i.e., stj = stcj ∪ sj .

(4)

All the M servers can be equally split into I groups, G1 , G2 , ..., GI . The state t

of servers in group Gi at time tj is gi j . We define the utilization requirement of resource type p by job j by ujp . We define the utilization level of resource type t

j p in server m at time tj as ump . The execution duration of job j is denoted by

dj . The whole system state stj can be represented as: t

t

stj = [stcj , sj ] = [g1j , ..., gIj , sj ] t

t

t

(5)

j j = [u11j , ..., u1|P | , ..., u|M ||P | , uj1 , ..., uj|P | , dj ]

Action Space: We define the index of server for job assignment as the action. The action space for a cluster consisting of M servers is therefore defined by: A = {a|a ∈ {1, 2, ..., |M |}}

(6)

Reward: The profit of the server cluster equals to the total revenue of processing all the incoming jobs minus the total energy cost and the reliability penalty. Including waiting time in the queue and processing time in the server, increased job latency leads to a decrease of the income earned by processing the job. We adopted hot spot avoidance which can effectively improve resource shortage problems and strengthen data center reliability. As for reliability, anti co-location is introduced to ensure aprtial distances between VMs and the use of disjoint routing paths in the data center. The instantaneous reward function is defined as r(t) = −ω1 · T otal P ower(t) − ω2 · N umber V M s(t) − ω3 · Reli Obj(t) 11

(7)

where ω1 , ω2 , and ω3 are the weights of the three terms; T otal P ower(t) denotes the total power consumption of the cluster representing the energy cost component; N umber V M s(t) denotes the number of pending jobs in the system representing the negative of the total revenue component; and Reli Obj(t) represents reliability penalty component. The total revenue of processing jobs is inversely correlated with VM (job) latencies. According to Little’s Theorem [15], the average number of jobs pending in the system is proportional to the average VM (job) latency. Action Selection and Q-value Update: The definition of value function Q(s, a) is continuous-time based and given by Eqn. (2). At each k-th decision epoch, which coincides with the arrival time of job j, i.e., tj , the DRL agent selects the action ak using the ε-greedy policy. At the next (k + 1)-st decision epoch triggered by the arrival of job j + 1, the Q-value update (from the k-th estimate to the (k + 1)-st estimate) is given by:  1 − e−βτk r(sk , ak )+ Q(k+1) (sk , ak ) ← Q(k) (sk , ak ) + α · β  −βτk (k) 0 (k) max e Q (s , a ) − Q (s , a ) k+1 k k 0

(8)

a

(k)

where Q

(sk , ak ) is the Q-value estimate at decision epoch k, r(sk , ak ) denotes

the reward function, τk denotes the sojourn time that system remains in state sk before a transition occurs, α denotes the learning rate, and β denotes the discount rate. Offline DNN Construction and Online Deep Q-learning: The DRL framework of cloud computing resource allocation comprises an offline DNN construction and an online deep Q-learning. The offline phase derives the Q-value estimate for each state-action pair (stj , a). A straightforward approach using a conventional feedforward neural network only works well with relatively small state and action spaces [2]. An alternative approach training multiple neural networks, each network estimating Q-values for a subset of actions (assigning to server group Gi for example), will significantly increase the training time by up to a factor of I. To address these issues, we harness the power of representation learning and 12

Figure 3: Deep neural network for Q-value estimates with autoencoder and weight share (K = 2).

weight sharing for DNN construction, with basic procedure illustrated in Figure 3. Specifically, we first employ an autoencoder to extract a lower-dimensional t

high-level representation of server group state gi j for each possible i, denoted t

by gi j . Next, for estimating the Q-value of the action of allocating a job to t

t

t

servers in Gi , denoted by aij , the neural network Sub-Qi takes gi j , sj , all gi0 j t

t

t

(i0 6= i), and aij as input features. The dimension difference between gi j and gi0 j

(i0 6= i) reflects the relative importance of the targeting server group compared with other groups and results in reduction in the state space. In addition, we introduce weight sharing among all I autoencoders, as well as all Sub-Qi ’s to reduce the total number of parameters and the training time. For the online phase, at the beginning of each decision epoch tj , the Q(stj , a) value estimates are derived for each state-action pair (stj , a) by inference based on the offline trained DNN. An action is then selected for the current state stj using the -greedy policy. At the next decision epoch, Q-value estimates are updated using Eqn. (8). After the execution of a whole control procedure, the DNN is updated in a mini-batch manner with the newly observed Q-value estimates.

13

4.1.3. Experimental Results of Cloud Computing Resource Allocations In the simulation setup, we assume a homogeneous server cluster without loss of generality. The idle power consumption is P (0%) = 87W, and the peak power consumption is P (100%) = 145W [13]. We set the server power mode transition times Ton = 30s and Tof f = 30s. We set the number of servers in each cluster as M = 20, 30, and 40 in different experimental scenarios. We use real data center workload traces extracted from Google cluster-usage traces over month-long period in May 2011 [4]. In order to simulate the workload on a 20-40 machines cluster, the traces are split into 200 segments with around 100,000 jobs in each segment, corresponding to the one-week workload for the M -machine cluster. Tensorflow 0.10 [16] is adopted for DNN construction. The round-robin job (VM) allocation policy is adopted as baseline approach. ×10 2

Power Usage (W·h)

5

×10 14

M = 20 M = 30 M = 40

Accumulated Job Latency (ms)

6

Baseline-20 Baseline-30 Baseline-40

4 3 2 1 0

0

20000

40000

60000

Number of Jobs

80000

1. 0

M = 20 M = 30 M = 40

Baseline-20 Baseline-30 Baseline-40

0. 8 0. 6 0. 4 0. 2 0. 0

(a) Power usage

0

20000

40000

60000

Number of Jobs

80000

(b) Average job latency

Figure 4: Performance of the server cluster with M = 20, 30, and 40.

Based on the Google cluster traces, we simulate five different one-week job traces into the proposed online deep Q-learning framework and compare the average results against the baseline. Figure 4 shows the results under M = 20, 30 and 40. It can be observed that compared with baseline, the proposed DRLbased framework on average can achieve 20.3%, 47.4% and 54.1% of power consumption saving while the accumulated latency only increases by 9.5%, 16.1% and 18.7%. The accumulated latencies for cluster with different servers are the same because the workload traces we tested do not overload the 20-server cluster.

14

×10 2

Power Usage (W·h)

4

w3 = 0 w3 = 1000 w3 = 4000

Accumulated Job Latency (ms)

5

Baseline

3 2 1 0

0

20000

40000

60000

Number of Jobs

80000

1. 2

×10 14

1. 0

w3 = 0 w3 = 1000 w3 = 4000

Baseline

0. 8 0. 6 0. 4 0. 2 0. 0

(a) Power usage

0

20000

40000

60000

Number of Jobs

80000

(b) Average job latency

Figure 5: Performance of the server cluster with w3 = 0, 1000, and 4000.

This result shows that the proposed DRL-based framework can provide policies that can achieve significant less power consumption without losing too much of latency. Figure 5 presents the effect of weight ω3 , the weight of resilient metrics. The proposed framework effectively generates policies to decrease accumulated latency when the weight increases because of the more evenly jobs distributing. All tested cases can achieve at least 47.8% power consumption saving with only a slight increase in job latency. These results prove that weights of the reward function can take a effective control of the trade-off between power, latency, and resiliency. 4.2. Residential Smart Grid Task Scheduling 4.2.1. System Model Dynamic pricing, which sets time-variant electricity price [17], can effectively incentivize electrical energy users to adjust their energy usage to reduce the energy cost. Dynamic pricing is often coordinated with the Smart Grid, a plugand-play integration of intelligent interconnected networks of distributed energy systems with pervasive command-and-control functions. The implementation of a dynamic pricing scheme can be achieved based on energy management systems with advanced metering infrastructure [18][19], which provides two-way communication between the electrical energy users and the grid [20]. Additionally, renewables are established around the world as mainstream sources of 15

Figure 6: Scenarios for executing a task i.

energy nowadays [21]. Integrating a renewable power source, photovoltaic (PV), into both residential and power grid levels of the smart grid can increase the efficiency of the current power grid and reduce environmental impact. In this work, we reduce users’ electricity cost by applying the deep reinforcement learning framework for the user-end task scheduling in the Smart Grid equipped with distributed PV power generation devices under dynamic pricing. We employ a slotted time model i.e., the task scheduling frame (one day) is divided into T = 24 time slots each with a duration of one hour. We assume for an electrical energy user there are a set of tasks that should be completed in the task scheduling frame, i.e., one day. The tasks are indexed with the integer set {1, 2, ..., N }, where N is the total number of tasks in a daily schedule. Each task i ∈ {1, 2, ..., N } has a duration Di and a desired operating window [Ei , Li ], where Ei denotes the earliest start time and Li denotes the latest complete time. The tasks are non-interruptible, i.e., tasks need to be operated in continuous time slots. If the actual start time of task i is θi , the actual complete time is θi + Di − 1. We also need to guarantee that each task is completed in the task scheduling frame. Therefore, we have 1 ≤ θi ≤ T − Di + 1. The power consumption of task i is denoted by pi . An inconvenience price Ii is determined by the user to represent the penalty when scheduling task i outside its desired operating window. Three scenarios of executing a task are shown in Figure 6. We assume that the residential user is equipped with a distributed PV system. The power generation of the PV system in time slot t is denoted by Ppv (t).

16

There is no energy storage system with the residential user, and therefore any unused energy from PV will be wasted. In addition, the PV power generation causes zero cost for the residential user. Therefore it is desirable to maximize the PV power utilization through task scheduling. We denote the load power demand from task i in time slot t as Pload,i (t), which equals either pi if task i is scheduled in time slot t or 0 otherwise. The total load power consumpPN tion in time slot t is Pload (t) = i=1 Pload,i (t). The power provided from the

grid in time slot t is denoted as Pgrid (t), which depends on Ppv (t) and Pload (t) according to the following:   0, when Ppv (t) ≥ Pload (t) Pgrid (t) =  Pload (t) − Ppv (t), otherwise

(9)

We consider a dynamic price model C(t, Pgrid (t)) consisting of a time-of-use (TOU) price component and a power consumption price component.

Figure 7: The DRL framework for the smart grid application.

4.2.2. Problem Formulation Based on the models described above, the statement of residential smart grid user-end task scheduling problem is given as follows: Given the specifications of each task i (1 ≤ i ≤ N ) including Ei , Li , Di , pi , and Ii ; the PV power generation profile Ppv (t) as well as the dynamic electricity price C(t, Pgrid (t)) for (1 ≤ t ≤ T ). Find the start time slot θi for each task i.

17

Minimize

T otal Cost =

T  X t=1

C(t, Pgrid (t)) · Pgrid (t) +

N X

Ii (t)

i=1

where the inconvenience cost for task i at time slot t is given as:   0, when Ei ≤ t ≤ Li Ii (t) =  Ii , otherwise



(10)

(11)

Subject to:

1 ≤ θi ≤ T − Di + 1, ∀i ∈ {1, 2, ..., N }

(12)

4.2.3. DRL Framework In this section, we apply the DRL framework to generate an optimal task schedule for the electricity user with the objective of minimizing the user’s electricity cost given by Eqn. (10). State Space: In order to reduce the time complexity, we propose a new definition of the state as the set of total load power consumption of each time slot t, i.e., s = [Pload (1), Pload (2), ..., Pload (T )]

(13)

Action Space: An action is defined by: a = [θ1 , θ2 , ..., θN ]

(14)

where 1 ≤ θi ≤ T − Di + 1 and 1 ≤ i ≤ N . Reward: In order for the cumulative discounted reward to coincide with the total energy cost, we define the reward as the negative of the cost increase caused by scheduling the start time of task i as θi i.e., r = −T otal Cost Increasement(i, θi )

(15)

Offline DNN Construction and Online Deep Q-learning: We simulate the control process using generated task sets and following a preliminary control policy. The state transition profile and Q(s, a) value estimates are obtained through the simulation and used as the training data for offline DNN 18

(a) N = 100

(b) N = 300

(c) N = 500 Figure 8:

Power usage/distribution of the residential smart grid user over time using

DRL/baseline scheduling.

construction. We construct a three-layer artificial neural network with 26 hidden neurons, which is trained using the previously obtained training data. In the online phase, for each decision epoch k, according to the current system state sk , the action resulting in the maximum Q(sk , a) estimate is selected using the -greedy policy. And Q(sk , a) estimates are obtained by performing inference on the offline-trained neural network. Based on the selected actions and observed rewards, Q-value estimates are updated before the next decision epoch. At the end of one execution sequence, the neural network is updated for use in the next execution sequence. The whole procedure is illustrated in Figure 7.

19

Table 1: Comparison on the Total Energy Cost of the Residential User using the Proposed DRL-based Framework and Baseline Algorithm.

Task

Total cost by DRL

Total cost by baseline

Number

($)

($)

100

7.33

9.50

22.77%

300

6.92

7.91

12.54%

500

6.72

7.67

12.45%

Decrease

4.2.4. Experimental Results of Task Scheduling for the Residential Smart Grid User The PV power generation profiles are provide by [22], which are measured at Duffield, VA, in 2007. We adopt an approach using the negotiation-based task scheduling algorithm [23] as our baseline system. We compare the total electric cost for the residential smart grid user computed by Eqn. (10) using the DRL framework and the baseline algorithm on the following test cases: 100, 300 and 500 tasks for scheduling. Figure 8 shows the comparison on power distribution of the residential smart grid user by using the DRL framework and the baseline algorithm. According to the results, the DRL framework can schedule tasks to maximize the coverage of the PV power and avoid the peak of TOU price in a more effective manner compared to the baseline method. The results on the total electric cost are concluded in Table 1, and it can be observed that the DRL framework can achieve 22.77%, 12.54% and 12.45% total energy cost reductions when the number of tasks are 100, 300, and 500, respectively.

5. Ultra-Low-Power Hardware Implementation of the Deep Reinforcement Learning Framework 5.1. Background on Stochastic Computing Stochastic Computing (SC) utilizes bit-streams for representing probabilistic numbers. Each probabilistic number is represented by the proportion of ones

20

in a binary bit-stream. For example of a 10-bit-stream 1101001011 containing six ones, it represents the probabilistic number P (X = 1) = 6/10 = 0.6. The unipolar encoding format allows us to represent a number x in the range of [0, 1] with a probabilistic number according to x = P (X = 1). There is also the bipolar encoding format, which can represent a number x in the range of [−1, 1] with a probabilistic number according to x = 2P (X = 1) − 1. Therefore, for the bipolar scheme, the 10-bit-stream 1101001011 (with P (X = 1) = 0.6) represents the number 0.2. For a number beyond [0, 1] in the unipolar encoding and [−1, 1] in the bipolar encoding, a pre-scaling operation [24] should be used. Compared with conventional implementation of arithmetic operations in CMOS circuits, stochastic computing provides a number of advantages including its low-cost hardware implementations of arithmetic operations using standard logic elements [25]. In addition, it can achieve a high degree of fault tolerance and provide the capability to trade off computation time and accuracy without hardware changes. Allowing very high clock rates, the SC paradigm significantly simplifies the hardware implementation.

Figure 9: Stochastic Multiplication. (a) Unipolar multiplication and (b) bipolar multiplication.

Multiplication: Figure 9 shows the two implementations of the two stochastic pulse stream coding formats, respectively. The multiplication of two numbers can be implemented using one AND gate in unipolar encoding format, i.e., y = a × b = P (A = 1) · P (B = 1) = P (Y = 1), while the multiplication can be implemented using one XNOR gate in bipolar encoding format, i.e., y = a×b = (2P (A = 1)−1)×(2P (B = 1)−1) = 2P (A = 1)P (B = 1)+2P (A = 0)P (B = 0) − 1 = 2P (A = 1) P (B = 1) − 1 = 2P (Y = 1) − 1. Addition: For the addition of two numbers, it can be implemented using 21

Figure 10: Stochastic Addition. (a) OR gate, (b) MUX and (c) APC.

either one OR gate or a multiplexer. However, the OR gate-based addition may induce some inaccuracy since a 1OR1 operation generates a single 1 as the output. The multiplexer is widely used in both unipolar and bipolar encoding. The select signal of the multiplexer is set to select each input with equal probability and the corresponding output is actually half of the summation. The Approximate Parallel Counter (APC) is another choice for the conversion from bit-streams to binary numbers. APC counts the number of ones in a bit-stream by summation of input bits and outputs an approximately equivalent binary number. An APC has two basic components, an Approximate Unit (AU) which is a layer of AND gates and OR gates for accumulating approximation, and an adder tree consisting of adders to convert the number of ones in the bitstream to a binary number. The APC achieves the highest accuracy among all stochastic adders, especially for a large number of input bit-streams, at the cost of higher hardware footprint and power/energy consumptions. Different SC implementations of the addition are shown in Figure 10. Hyperbolic Tangent (tanh): The Hyperbolic Tangent (tanh) function

Figure 11: Stochastic hyperbolic tangent function implementation.

22

Figure 12: The overall hardware structure of the DRL framework.

for usage as the activation function in neural networks is highly suitable for SCbased implementation, given that it can be easily implemented with a K-state finite-state-machine (FSM) in the SC domain [26]. The FSM-based implementation can significant reduce hardware cost compared to its implementation in conventional computing domain, showed in Figure 11. In addition, in general no inaccuracy will be induced if replacing ReLU or sigmoid function by tanh function. It will output a zero if the current state is on the left half of the states, and a one under the other states. In this configuration, the approximate transfer function is: k Stanh(K, x) ∼ = tanh( x) 2

(16)

where Stanh stands for stochastic tanh function. 5.2. SC-Based Hardware Implementation of the DRL Framework Figure 12 shows an overview of the hardware implementation of the DRL framework. The DRL network is implemented using APC-based inner product blocks for matrix-vector multiplication calculation and Btanh blocks for tanh activation function implementation. The input layer is consisted of 30 XNOR gates for processing the inputs and weight, 30 26-input APCs and Btanh as the activation function. The hidden layer mainly includes a 30-input APC. Converters between stochastic and binary numbers are employed when processing

23

Figure 13: The proposed 30-input Approximate Parallel Counter (APC). Table 2: Inaccuracies of the APC-Based Inner Product Blocks

Bit Stream Length

APC

256

512

1024

26-input

2.56%

2.12%

1.71%

30-input

2.34%

2.03%

1.56%

26-input improved

0.63%

0.61%

0.57%

30-input improved

0.61%

0.58%

0.55%

the inputs and generating the outputs. Action selection logic is implemented through the global controller. Effective pipelining technique is applied for enhancing the overall throughput. APC-Based Inner Product Block: The network consists of one 26neuron input layer and one 30-neuron hidden layer. Based on this, both 26input and 30-input APCs are implemented for the inner product blocks. The structure of an improved 30-input APC is shown in Figure 15. Its inputs are the products values of inputs and its corresponding weights of the inner product block, except for the last pair (the base). The function of APC is to convert the

24

number of ones in a column of all input bit-streams to a binary number. For instance, as shown in Figure 15, since the input size is 30, the output should be a 5-bit binary number between 0 to 32. In order to further reduce hardware cost, [27] proposed an APC constructed by an adder tree using inverse mirror full adders. Inverse mirror full adders are smaller and more responsive adders that output inverse logic of true summation and carry-out bits. The internal results in the even layers correspond to the number of ones in the primary input, while the internal results in the odd layers represent the number of zeros. Based on this APC structure, we modified the last one pair of inputs with a half adder as shown in Figure 15. Based on our experimental results showed in Table 2, compared with the original approximate PC, our modified approximate PC achieves a smaller accuracy degradation. Less than 0.7% inaccuracy will be introduced compared to the conventional accumulative parallel counter while achieving more than 40% reduction of gate count [27]. Therefore, the APC based inner product block implementation has significant advantage in terms of power, energy and hardware consumption. Btanh Block: We apply Btanh block for the tanh activation function used in the DRL framework. Btanh is designed to coordinate with the APC-based adder to perform a scaled hyperbolic tangent function. It uses a saturated up/down counter to convert the binary format outputs from APC to a stochastic bit-stream. Details about the implementation can be found in [25]. Based on the blocks mentioned above, we implement the proposed DRL network. Table 3 shows the result of our proposed DRL hardware implementation based on SC. The bit stream length ranges from 256 to 1024. Table 4 presents the hardware implementation of the fixed network using conventional binary computing with the bit size ranging from 8 bits to 32 bits. It can be observed that the SC-based implementation can achieve a much smaller power and area cost compared with the binary-based hardware implementations.

25

Table 3: Performance of Stochastic Computing-based Hardware Implementation of the DRL Framework

Bit Stream

Performance

Length

Delay(ns)

Power(mW )

Area(µm2 )

256

412.47

6.30

57941.61

512

824.63

6.30

57990.82

1024

1648.95

10.76

58089.24

Table 4: Performance of Binary-based Hardware Implementation of the DRL Framework

Bit Size

Performance Delay(ns)

Power(mW )

Area(µm2 )

8

7.60

63.31

1056958.13

16

10.53

217.79

1080106.41

32

14.76

880.25

3450187.80

6. Conclusion In this paper, we propose a comprehensive DRL framework with ultra-low power hardware implementation using stochastic computing. First, a general DRL framework is presented, which can be utilized in solving problems with large state and action space. In this framework, we develop two applications, one is for cloud computing resource allocation, which could achieve up to 54.1% energy saving compared with baselines and another one is for residential smart grid task scheduling, which could achieve up to 22.77% total energy cost reduction compared with the baseline. Since the widely utilization of DRL in complicated control problems and portable systems, we implement the ultralow-power hardware of the DRL framework using stochastic computing which can achieve a significant speed improvement and hardware footprint reduction. Effective stochastic computing-based implementations of approximate parallel counter-based inner product blocks and tanh activation functions are designed and adopted in the overall implementation. Experimental results demonstrate 26

that our stochastic computing-based implementation achieves low energy consumption and hardware area consumption.

References [1] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, Vol. 1, MIT press Cambridge, 1998. [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing atari with deep reinforcement learning, arXiv preprint arXiv:1312.5602. [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of go with deep neural networks and tree search, Nature 529 (7587) (2016) 484–489. [4] C.

Reiss,

usage

J.

traces:

Wilkes, format

J.

L. +

Hellerstein, schema,

Google

[Online].

clusterAvailable:

http://code.google.com/p/googleclusterdata/wiki/TraceVersion2

(Nov.

2011). [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533. [6] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint arXiv:1509.02971. [7] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International Conference on Machine Learning, 2016.

27

[8] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning., in: AAAI, 2016, pp. 2094–2100. [9] A. Ren, J. Li, Z. Li, C. Ding, X. Qian, Q. Qiu, B. Yuan, Y. Wang, Scdcnn: highly-scalable deep convolutional neural network using stochastic computing, arXiv preprint arXiv:1611.05939. [10] J. Rao, X. Bu, C.-Z. Xu, L. Wang, G. Yin, Vconf: a reinforcement learning approach to virtual machines auto-configuration, in: Proceedings of the 6th international conference on Autonomic computing, ACM, 2009, pp. 137–146. [11] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980. [12] P. Mell, T. Grance, The nist definition of cloud computing. [13] X. Fan, W.-D. Weber, L. A. Barroso, Power provisioning for a warehousesized computer, in: ACM SIGARCH Computer Architecture News, Vol. 35, ACM, 2007, pp. 13–23. [14] S. J. Duff, O. Bradtke Michael, Reinforcement learning methods for continuous-time markov decision problems, Adv Neural Inf Process Syst 7 (1995) 393. [15] D. Gross, J. F. Shortle, J. M. Thompson, C. M. Harris, Fundamentals of Queueing Theory, 4th Edition, Wiley-Interscience, New York, NY, USA, 2008. [16] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorflow: Large-scale machine learning on heterogeneous distributed systems, arXiv preprint arXiv:1603.04467. [17] Dynamic

pricing,

Comverge[Online].

Available:

http://www.comverge.com/Comverge/media/pdf/Whitepaper/ComvergeDynamic-Pricing-White-Paper.pdf. 28

[18] J. S. Vardakas, N. Zorba, C. V. Verikoukis, Scheduling policies for twostate smart-home appliances in dynamic electricity pricing environments, Energy 69 (2014) 455–469. [19] J. S. Vardakas, N. Zorba, C. V. Verikoukis, A survey on demand response programs in smart grids: pricing methods and optimization algorithms, IEEE Communications Surveys & Tutorials 17 (1) (2015) 152–178. [20] H. Farhangi, The path of the smart grid, IEEE power and energy magazine 8 (1) (2010) 18–28. [21] Renewables 2016 global status report,

REN21[Online]. Available:

http://www.ren21.net/wp-content/uploads/2016/10/REN21-GSR2016FullReport-en-11.pdf. [22] Baltimore

gas

and

electric

company,

[Online].

Available:

https://supplier.bge.com/electric/load/profiles.asp. [23] J. Li, Y. Wang, T. Cui, S. Nazarian, M. Pedram, Negotiation-based task scheduling to minimize user??? s electricity bills under dynamic energy prices, in: Green Communications (OnlineGreencomm), 2014 IEEE Online Conference on, IEEE, 2014, pp. 1–6. [24] B. Yuan, C. Zhang, Z. Wang, Design space exploration for hardwareefficient stochastic computing: A case study on discrete cosine transformation, in: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, IEEE, 2016, pp. 6555–6559. [25] B. R. Gaines, Stochastic computing, in: Proceedings of the April 18-20, 1967, spring joint computer conference, ACM, 1967, pp. 149–156. [26] B. D. Brown, H. C. Card, Stochastic neural computation. i. computational elements, IEEE Transactions on computers 50 (9) (2001) 891–905. [27] K. Kim, J. Lee, K. Choi, Approximate de-randomizer for stochastic circuits, in: SoC Design Conference (ISOCC), 2015 International, IEEE, 2015, pp. 123–124. 29

Figure 1: Hongjia Li

Hongjia Li reiceived her B.S. degree in Automation from Jilin University, Changchun, Jilin, China, in 2013 and the M.S. degree in Electrical Engineering from Syracuse University, Syracuse, New York, USA in 2015. During 2011, she studied at Zhejiang University, Hangzhou, Zhejiang, China as an exchange student for half year. From 2015, she worked at Integrated Medical Devices, Inc., Liverpool, New York, USA, as an associate hardware engineer. Currently, she is a Ph.D. student in Computer Engineering in Syracuse University, Syracuse, New York, USA. His research interest include deep reinforcement learning, low power design and computer-aided design.

Figure 2: Ruizhe Cai

1

Ruizhe Cai received the B.S. degree in Integrated Circuit Design and Integrated System from Dalian University of Technology, Dalian, Liaoning, China, in 2014, and the M.S. degree in Computer Engineering from Syracuse University, Syracuse, NY, USA, in 2016. He was a research student in Communication and Computer Engineering at Tokyo Institute of Technology, Tokyo, Japan between 2013 to 2014. He is currently a Ph.D. student in Computer Engineering in Syracuse University, Syracuse, NY, USA. His research interests include neuromorphic computing, deep neural network acceleration, and low power design.

Figure 3: Ning Liu

Ning Liu received the MS degree in electronic engineering from Syracuse University, Syracuse, USA, in 2015. He is currently a Phd student under the supervision of Prof.Wang Yanzhi in Syracuse university. His current research interests include cloud computing, neurotrophic computing, deep learning, etc. Dr. Xue Lin is an assistant professor in the department of electrical and computer engineering at Northeastern University starting from 2017. She received her PhD degree in electrical engineering from University of Southern California in 2016. Her research interests include machine learning for cyber physical systems, cloud computing, and emerging device-based system design. She received the best paper award from International Symposium on VLSI Designs 2014, the top paper award from IEEE Cloud Computing Conference 2014, and the popular paper in IEEE Transactions on CAD. She is a Member of IEEE and an 2

Figure 4: Xue Lin

active reviewer for many conferences and journals in the area.

Figure 5: Yanzhi Wang

Yanzhi Wang is currently an assistant professor at Syracuse University, starting from August 2015. He received B.S. degree from Tsinghua University in 2009 and Ph.D. degree from University of Southern California in 2014, under supervision of Prof. Massoud Pedram. His research interests include neuromorphic computing, energy-efficient deep learning systems, deep reinforcement learning, embedded systems and wearable devices, etc. He has received best paper awards from International Symposium on Low Power Electronics Design 2014, International Symposium on VLSI Designs 2014, top paper award from IEEE Cloud

3

Computing Conference 2014, and best paper award and best student presentation award from ICASSP 2017. He has two popular papers in IEEE Trans. on CAD. He has received multiple best paper nominations from ACM Great Lakes Symposium on VLSI, IEEE Trans. on CAD, and Asia and South Pacific Design Automation Conference., and International Symposium on Low Power Electronics Design.

4