An apprenticeship learning scheme based on expert demonstrations for cross-layer routing design in cognitive radio networks

An apprenticeship learning scheme based on expert demonstrations for cross-layer routing design in cognitive radio networks

Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230 Contents lists available at ScienceDirect International Journal of Electronics and Communications...

1MB Sizes 0 Downloads 14 Views

Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

Contents lists available at ScienceDirect

International Journal of Electronics and Communications (AEÜ) journal homepage: www.elsevier.com/locate/aeue

Regular paper

An apprenticeship learning scheme based on expert demonstrations for cross-layer routing design in cognitive radio networks Yihang Du ⇑, Lei Xue, Ying Xu, Zunyang Liu Electronic Countermeasure Institute, National University of Defense Technology, Anhui 230037, China

a r t i c l e

i n f o

Article history: Received 26 February 2019 Accepted 28 May 2019

Keywords: Cognitive radio Cross-layer routing design Apprenticeship learning Reinforcement responsibility rating

a b s t r a c t In cognitive radio, Reinforcement Learning (RL) has been widely applied to the construction of cognition engine. However, two crucial challenges remain to be resolved: First, it takes long time to interact with the environment before reaching intelligent decision. Second, agents improve their performance through trial and error, but some applications in Cognitive Radio Networks (CRN) cannot afford the extensive latency and energy expenditure. An apprenticeship learning scheme based on expert demonstrations is adopted for solving above problems. Firstly, the reinforcement responsibility rating is introduced to enhance the efficiency of power allocation by allowing multi-level transition for the transmit power. In order to avoid the failure of expert node identification due to SU’s remote location, an adaptive radius Bregman Ball model is presented. Furthermore, the Multi-Teacher Deep Q-learning from Demonstrations (MT-DQfD) is proposed to accelerate the learning procedure by sharing demonstrations derived from multiple expert nodes. Our experiments illustrate that the proposed cross-layer routing protocol reduces the training period while improves transmission quality compared to the traditional algorithms. Moreover, the newly-joined nodes can achieve better performance than the experts. Ó 2019 Elsevier GmbH. All rights reserved.

1. Introduction Cognitive Radio (CR) aims to settle the problem of spectrum scarcity by discovering and opportunistically utilizing spectrum holes through time, frequency and spatial multiplexing [1]. While many achievements have been obtained in CR technical fields such as spectrum sensing, channel allocation and spectrum handoff [2–4], the opportunistic routing has not received considerable attention by the research community for its complexity and uncertainty. It is of significant importance to take routing into consideration in CRN since signal quality declines with power attenuation in the direct transmission. Moreover, the opportunistic routing can bridge the gulf between the source and destination Secondary User (SU) if there is no common frequency band for them to transmit data flows [5]. The layered architecture is applied in traditional wireless communication and performs well with respect to portability and low design complexity [6]. However, it is not suitable for multi-hop CRN since the location independence in a multi-hop path leads to the difference of frequency band at each intermediate node. Furthermore, a strict layered design is incapable of dealing with the ⇑ Corresponding author at: Electronic Countermeasure Institute, National University of Defense Technology, Anhui 230037, China (Y. Du). E-mail address: [email protected] (Y. Du). https://doi.org/10.1016/j.aeue.2019.05.041 1434-8411/Ó 2019 Elsevier GmbH. All rights reserved.

time-variant characteristics of CRN environments. Accordingly, cross-layer routing protocol is proposed to achieve better overall performance, which comprehensively exploits the routing parameters in the lower three layers of Open System Interconnection (OSI) [7]. Traditional protocols attempted to solve the cross-layer routing problem with explicit and rigid programming. The author in [8] proposed a cross-layer routing protocol employed two models, which focused on cluster formation and the calculation of energy expenditure, respectively. In [9], the author presented a Markov random field based framework and optimized the SUs’ route selection, channel allocation and medium access through Gibbs sampling. Nevertheless, fixed routing strategies require priori information such as spectrum statistics and network topology, which becomes impractical due to the dynamic and unpredictable nature of the CRN. In order to apperceive the environment and realize real cognition, a CR should be capable of learning and reasoning [10]. Machine learning, especially reinforcement learning, is appropriate for joint routing design in CRN. The author in [11] proposed an integrated channel selection and cluster-based routing scheme. It clustered SUs in the network and then established a multi-path route using Q-routing. In [12], a cross-layer routing protocol based on Prioritized Memories Deep Q-Network (PM-DQN) was presented to address the joint routing and resource allocation problem

222

Y. Du et al. / Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

in the large-scale networks. These works attempted to achieve cross-layer design through single-agent learning. However, multi-agent schemes are more suitable for CRN, which arouses interest of researchers. The author in [13] proposed an online multi-agent learning based solution to joint routing and channel assignment problem. SUs searched for routes without any global network information while controlled the interference power to Primary User (PU). In [14], the routing problem was modeled as a stochastic game. Then, the stochastic game was solved through a non-cooperative multi-agent learning method in which each SU speculated other nodes’ strategies via local information. Although multi-agent learning strategy is not affected by network scale and can inherently enhance the learning efficiency, it needs to explore each state-action pair blindly in the beginning, which still takes considerable time before converging to an optimal. In this case, the apprenticeship learning is adopted for speeding up the adaption process. Unlike general learning scheme, apprenticeship learning passes over the initial stage and allows newly-jointed SUs to learn strategies from expert nodes. In [15], Bregman divergence was used to determine the similarity between SUs and identify expert nodes for the apprentice if their similarity index was smaller than a pre-set threshold. However, a newlyjointed SU may fail to find a suitable expert if it is relatively isolated in geographical when fixed threshold is adopted. An Inverse Reinforcement Learning (IRL) based apprenticeship learning was used for spectrum decision in [16] by considering various wireless network parameters. But IRL incurs significant computational expense [17], and moreover, it can only obtain the capability close to the experts instead of exceeding their performance. Summarizing, an effective apprenticeship learning framework needs to rationally resolve two crucial issues: expert node identification and apprentice node training [18]. The expert node identification asks how to choose the expert node with mature strategies and similar transmission tasks. The apprentice node training investigates how to combine its local exploration and expert’s strategies effectively to achieve even better performance than the experts. In this paper, the cross-layer routing problem in multi-hop CRN is studied for transmission delay reduction and energy efficiency improvement. An apprenticeship learning scheme based on Deep Q-learning from Demonstrations (DQfD) algorithm is adopted to accelerate learning procedure while attaining transmission policies superior to the expert’s. Our contributions are summarized as follows: (i) The responsibility rating is introduced to adjust the transmit power and compress the huge action space. However, the transmit power has to hop step by step when using the responsibility rating, which leads to low convergence rate. Reinforcement responsibility rating is proposed to allocate the transmit power more efficiently. It allows SU’s power to transfer multiple levels every timeslot until the power configuration reaches a proper range, which accelerates the learning efficiency and achieves better performance. (ii) In order to avoid the failure of expert node identification due to SU’s remote location, the adaptive radius Bregman Ball model is introduced to dynamically tune the Bregman Ball radius in accordance with the packet relay ratio of the SU’s closest candidate expert. It provides larger radius for isolated SUs to improve the probability of discovering expert nodes. (iii) To the best of our knowledge, this is the first work that adopts DQfD to solve the cross-layer routing problem in multi-hop CRN. To further enhance the performance, MTDQfD is proposed, which exploits the demonstration data derived from multiple expert nodes. Simulation results illus-

trate that the proposed scheme outperforms traditional algorithms in terms of transmission delay, power efficiency and routing robustness. The rest of this paper is organized as follows: System model is described in Section 2. Section 3 introduces reinforcement responsibility rating and models the cross-layer design problem as a Markov Decision Process (MDP). Section 4 presents the adaptive radius Bregman Ball model to identify the expert node. The MT-DQfD based cross-layer routing protocol is proposed in Section 5. Section 6 shows the simulation results and Section 6 concludes the paper. 2. System model A multi-hop CRN consisting of M PUs and N SUs is taken into consideration as shown in Fig. 1. The frequency bands are divided into different portions and each sub-band is assigned to a specific PU in terms of fixed spectrum allocation regulation. SUs have no licensed channels and opportunistically access the spectrum bands in the idle time of PUs. The lower three layer of OSI is also demonstrated in Fig. 1, which comprising the physical (PHY), medium access control (MAC) and network (NET) layers. In the CRN scenarios, each SU node selects its next hop (NET layer) in SUs within their transmission range, and two SUs can transmit\receive data successfully only when they have common available channels (MAC layer). Moreover, every SU has to control the transmit power (PHY layer) reasonably because it is closely related to the transmission and interference range. Consequently, the joint routing and resource allocation in multi-hop CRN is a complicated and interdependent problem intertwined with cross-layer design. The available channel set of SUs consists of the Service Channel (SCH) and the Control Channel (CCH). SCH of SU nj is used for data transmission and can be denoted as C i ¼ fc1 ; c2 ; . . . ; cm g. CCH is reserved for exchanging control signaling between SUs. Furthermore, the PU’s arrival model is considered as an ON\OFF process [19]. The Probability Density Function (PDF) of the OFF periods is shown as following:

 f ðsÞ ¼

keks 0

sP0 s<0

ð1Þ

where k is the PU departure rate. Hence, the probability that SU collides with PU is calculated as

Fig. 1. Cognitive radio networking scenarios and the layered architecture.

223

Y. Du et al. / Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

Pcollision ¼ 1  Pðt P sÞ R1 ¼ 1  s f ðtÞdt

ð2Þ

¼ 1  eks We attain the average departure rate kðl; rÞ with a mean l and deviation rfrom the definition in [20]. It is assumed that the spectrum sensing has been done using one of the different spectrum detection techniques available [21,22], and all of the SU nodes have the local channel availability. Generally, the spectrum statistic parameterized in [23] remains almost invariable.

8  t1 < wt1 þ 1; if dt1 > K i t i i wi ¼  t1 : t1 0; if di 6 Ki

ð7Þ

That is, wti will increase step by step if the transmission time is longer than the historical expectations for consecutive times; Otherwise, it will return to 0. Similarly, the decrement counter fti is defined as:

3. Problem formulation for cross-layer routing

8  t1 < ft1 þ 1; if dt1 6 K i i i ¼  t1 : t1 0; if di > Ki

3.1. Reinforcement responsibility rating

In addition, if ut1 ¼ maxfui g and di i

fti

t1

t1

To enhance energy efficiency and control interference to PU, transmit power allocation is taken into consideration. However, the action space will be particularly large if we treat route selection, channel access and power control as actions in the crosslayer routing design. As we know, Q-learning framework follows the rule below to update its Q-value:

  DQ tþ1 ðs; aÞ / a r t þ b max Q t ðs0 ; bÞ  Q t ðs; aÞ

ð3Þ

b2A

where a 2 ½0; 1Þ is the learning rate, r t denotes the instantaneous reward, b is the discount factor, and A represents the agent’s action space. Since the maximum calculation needs to be executed at each step, the computational complexity will become high as the size of A is huge. Consequently, the reinforcement responsibility rating is proposed in this section to compress the action space. SUni ’s single-hop transmission delay is given by:

di ¼ Spacket =B log2 1 þ

g ijc pi

ð8Þ  t1

> Ki

, then uti ¼ ut1 ; i

 t1

¼ 0 and di 6 Ki , then uti ¼ 0. It can be seen from the defif ut1 i inition that the more times di exceeds the expected latency threshold successively, the larger the increment of ui will be; and vice versa. It allows SU’s transmit power to transfer multiple levels every timeslot until the power configuration reaches a proper range, which accelerates the power arrangement procedure. Each ui matches with one of the power level pi ðpmin 6 pi 6 pmax Þ, and the matching relation is given by

  pi uti ¼



1

 ut p þ i p jUi j min jUi j max

uti

ð9Þ



where jUi j denotes the size of uti . Consequently, the reinforcement responsibility rating indicates how much power the SU should consume to be responsible for making a reasonable tradeoff between transmission latency and power efficiency. Furthermore, it reduces the size of action space and achieves lower calculation load.

!

ð4Þ

n þ /PU ijc

where Spacket is the size of packet, B is the bandwidth of SCH, g ijc denotes the channel gain between the SU ni and nj , /PU ijc represents the PU-to-SU interference at the receiver of SU ni , and n is the AWGN power. When the transmission time of a SU exceeds historical expectations in the current stage, the SU should enhance the transmit power in its next transmission to reduce the single-hop latency. By contrast, if the latency of the current data transmission is sufficiently short, the SU inclines to cut down on its power in the next timeslot for energy conservation. Based on this principle, our previous work introduced the concept of responsibility rating [24]. However, the responsibility rating allows transmit power to hop only one level each timeslot. It reduces the efficiency of power adaptation, which extends the convergence time. Thus we present a multi-level transfer mechanism called reinforcement responsibility rating. The reinforcement responsibility rating of SU ni at time step t is defined as:

8  t1 < ut1 þ wt ; if dt1 > K i i i i t ui ¼  t1 : t1 ui  fti ; if dt1 6 K i i

ð5Þ

where the reinforcement responsibility rating ui is a nonnegative t1

integer corresponding to one of the transmit power levels; di represents SU ni ’s single-hop transmission delay at timeslot t  1; and t

Ki is SU ni ’s average transmission latency which can be attained via historical information:   t  t1 1 t  t1 Ki ¼ Ki þ ð6Þ di  Ki t The increment counter wti is given by:

3.2. Problem formulation In this part, the cross-layer routing problem is modeled as a standard Markov Decision Problem (MDP). The MDP can be described by a tuple M ¼ hSi ; Ai ; T i ; Ri iNi¼1 , where Si is the state space     of SU ni , Ai represents SU ni ’s action space, T i si ; ai ; s0i ¼ P s0i jsi ; ai denotes the transition function, and Ri ðsi ; ai Þ specifies the reward received by SU ni at si 2 Si when performing action ai 2 Ai . The state, action and reward function are defined as follows in detail. (1) State of node: The state of SU ni at time step t is defined as

sti ¼

n

eti ; uti ; f ti

o

ð10Þ t

where uti is the reinforcement responsibility rating of SU ni , f i is its operating frequency at timeslot t, and eti 2 f0; 1g is a signal reception indicator which indicates whether the Signal-to-Interference plus Noise Ratio (SINR) vi of SU ni is above or below the threshold vth :



eti ¼ where

1; if

vi P vth

ð11Þ

0; otherwise



vi ¼ g ijc pui = n þ /PU ijc , pui is the transmitting power of SU ni ,

g ijc denotes the channel gain between node ni and nj , /PU ijc represents the PU-to-SU interference at ni , and n is the AWGN power. From the definition in (5), we can see that uti depends on the t

, f i relies on the previous channel selection, previous value ut1 i and eti is only with regard to the current action selection. Consequently, the joint design problem can be modeled as a Markov process. In addition, a learning round comes to an end as ei ¼ 0, i.e.,

sti ¼ 0; uti is the terminal state in the Markov chain.

224

Y. Du et al. / Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

(2) Agent’s action: The agent ni ’s action at time slot t is described as

n o ati ¼ nj ; ci ; pui

ð12Þ

where nj 2 N i is SU ni ’s selection of next relay, and Ni denotes the neighboring nodes set of SU ni . ci 2 C i is SU ni ’s available channel, and C i is the operating channel set of SU ni . pui represents ni ’s transmit power corresponding to its current responsibility rating uti . Since pui depends entirely on uti , the size of action space transforms from jN i j  jC i j  jPi j into jN i j  jC i j  1, while the size of state space rises from 2  jC i j to 2  jUi j  jC i j. In other words, responsibility rating compresses huge action space and achieves a reasonable tradeoff between state and action space in cross-layer routing. (3) Reward function: Our objective is to minimize the transmission latency and energy expenditure in the cross-layer routing design. The instantaneous reward function is given by

r ti

¼ - log2 ða  di þ b  qi Þ

ð13Þ

where a and b are coefficients that adjust the tradeoff between transmission delay and energy expenditure.di represents the transmission latency of SU ni , and qi is the Power Consumption Ratio (PCR) defined as follows:

qi ¼ pui =B log2 1 þ

g ijc pui

!

ð14Þ

n þ /PU ijc

Fig. 2. Bregman Ball model based expert node identification.

Here DisðX t ; dk Þ is the Symmetric Bregman Divergence (SBD), which is defined as:

 DisðX t ; dk Þ ¼ d/ ðX t ; dk Þ þ d/ ðdk ; X t Þ =2

ð16Þ

where d/ ðp; qÞ is known as Bregman divergence, which measures the manifold distance between two data points. The calculation of d/ ðp; qÞ is related to a strictly convex function /ð Þ:

d/ ðp; qÞ ¼ /ðpÞ  /ðqÞ  hp  q; r/ðqÞi

ð17Þ

The first step of apprenticeship learning is that every newlyjoined SU node appropriately selects experts with mature experience and similar transmission conditions. We assume that the learning procedure is already stable and expert nodes have achieved maximum cumulative rewards. Multiple network parameters are considered to evaluate the similarity between apprentice and expert nodes as shown in Table 1. Each parameter is a pre-defined value to describe the properties for SUs in the transmission task. All of the parameters constitute a high-dimensional signal ðx1 ; . . . ; x6 Þ. Having these highdimensional signal points, the similarity of SUs can be estimated using similarity measurement techniques such as Manifold learning (as shown in Fig. 2). Manifold learning adopts the Bregman Ball model to compare complex objects [25]. We then define the concept of Bregman Ball, which has a center dk and radius Rk . Any data point X t is highly similar to the center dk if it lies inside the ball, that is:

where h; i is the inner product operation, and r/ðqÞ denotes the gradient vector of / estimated at q. The calculation of d/ ðq; pÞ is the similar to d/ ðp; qÞ. The reason for defining the Bregman Ball via SBD is that the Symmetry and Triangle Inequality are not satisfied for original Bregman divergence [16]. If the newly-joined SU is isolated in geographical, few nodes have high similarity with it since the location differences will lead to differences in spectrum statistics. Even in this case, learning from expert nodes has a slight advantage over self-learning because similarity still exists in some aspects such as SINR threshold and delay tolerance. Thus the similarity restriction has to be relaxed for the isolated nodes. If the radius Rk is a fixed threshold distance as in [16] and [25], some isolated apprentice nodes may fail to find suitable experts due to strict similarity constraint. Therefore, the adaptive radius Bregman Ball model is presented to dynamically tune the radius. It provides larger Rk for isolated SUs to improve the probability of discovering expert nodes. It is obvious that a newly-joined SU node is isolated if its closest candidate expert locates in a remote area. Furthermore, if the candidate expert node is geographically remote, few nodes select it as the next relay in the multi-hop network, which results in few packets relayed by this node. The Packet Relay Ratio (PRR) of SU ni describes the proportion of packets relayed by SU ni in the whole packets derived from the source nodes, which is given by:

Bðdk ; Rk Þ ¼ fX t 2 X : DisðX t ; dk Þ 6 Rk g

xi ¼ Npacketi =Npacketsource

The PCR is the energy expenditure as attaining unit throughput, which is presented to evaluate power efficiency. The parameters adopted in (14) are same as (4), which will not be detailed here. 4. Adaptive radius Bregman Ball model

ð15Þ

Table 1 Parameters for expert node identification. Attribute

Symbol

Parameters

Channel Statistics

x1 x2 x3

Bandwidth Channel Gain PU-to-SU Interference

x4 x5 x6

SINR Threshold Delay Tolerance Location Information

SU Characteristic

ð18Þ

where N packeti is the number of packets relayed by SU ni , and Npacketsource is the total number of packets derived from the original nodes. Therefore, the PRR of the intermediate SU can serve as the indicator of SU’s isolation extent. We assume that every apprentice node has a same reference radius r k , and the radius of Bregman Ball is calculated as follows:

Rk ¼ r k  ð1=xi Þ

ð19Þ

where xi is the PRR of the apprentice SU’s closest candidate expert node ni . Thus, the radius of isolated apprentice nodes is larger since their closest candidate experts have lower PRR, and vice versa.

Y. Du et al. / Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

We now describe the process of expert node identification using adaptive radius Bregman Ball model: Firstly, when the new SU joins the network it will broadcast EXPERT SEARCH information through CCH. The candidate expert nodes within its transmission range generate replay messages and send them back to the apprentice node. The replay message contains all of the network parameters shown in Table 1. Then, using the replay messages, the newlyjoined SU seeks the closest candidate node in geographical and acquires its PRR xi . Finally, the newly-joined SU calculates the Bregman Ball radius Rk using Eq. (19), and selects the nodes inside the Bregman Ball as the experts. 5. MT-DQfD based cross-layer routing protocol To speed up the convergence in cross-layer routing, an apprenticeship learning scheme based on DQfD is considered. DQfD leverages demonstration data to massively accelerate learning [26]. The SUs learn from the expert node through three steps: demonstration data collection, pre-training and interaction with environment. In the demonstrations collection phrase, the format of expert’s transition is set as e ¼ ðs; a ; r ; s0 Þ and the expert SU collects high-quality transitions whose instantaneous reward exceeds a certain threshold. Then the expert node transmits these demonstrations to the newly-joined SU via CCH and the data is stored in its demonstration buffer. Since both of the transition size and the demonstration buffer size are limited, the communication overhead during the demonstration data transmission is in an acceptable range. In the pre-training phrase, the agent samples demonstration data to initially pre-train its neural network. Through pre-training the SU’s strategy is biased to the expert’s with local information only, which saves time and energy considerably. Once the pre-training is finished, the agent starts interacting with the real environment based on its learned policy. In this stage, the self-generated data is mainly used to update the network parameters, supplemented by the demonstration data for supervision and fine tuning. In natural DQfD algorithm, the agent learns from single expert’s demonstrations. In order to avoid the biased knowledge of a particular teacher, the Multi-Teacher Deep Q-learning from Demonstrations (MT-DQfD) is proposed. Learning from multiple teachers effectively integrates the advantages of different experts. For example, a SU may find a teacher with similar channel statistics to learn how to select spectrum bands. It may also find another teacher with similar SU characteristic to learn how to select the QoSoriented routing [27]. In MT-DQfD, the demonstration buffer Ddemo consists of the demonstration data from different expert nodes, and the proportion of every teacher’s data is based on their normalized similarity indices to the apprentice node. Specifically, the implementation of MT-DQfD is demonstrated in detail as follows. 5.1. Demonstration data collection The apprentice node SU a identifies its expert nodes using adaptive radius Bregman Ball model. The normalized similarity index of expert node SU en is derived from SBD:

DisðSU en ; SU a Þ simðSU a ; SU en Þ ¼ PN   i¼1 Dis SU ei ; SU a

ð20Þ

  where Dis SU ei ; SU a represents the SBD between apprentice node SU a and expert node SU ei , and N is the number of SU a ’s expert nodes. Then the apprentice node collects demonstration data from each expert through CCH, and the number of demonstrations from expert         SU en is calculated as simðSU a ; SU en Þ  Ddemo , where Ddemo  denotes the size of demonstration buffer.

225

5.2. Pre-training After demonstration data collection, the apprentice node starts pre-training. During the pre-training phase, the agent samples mini-batches from the demonstration buffer Ddemo and adjusts weights of neural network by adopting three losses: the Qlearning loss, a supervised loss, and an L2 regularization loss [26]. The Q-learning loss LQL ðQ Þ ensures that the update follows Bellman equation [28], thus the network can be used as a starting point of TD learning. Since the number of demonstration data is limited and it cannot cover all of the actions in each state. Some state-action pairs may never be taken and their Q-function will update to unrealistic values. So the supervised loss is introduced to make the untaken state-action functions converge to reasonable values. It is given as follows:

J E ðQ Þ ¼ max ½Q ðs; aÞ þ mðs; aE ; aÞ  Q ðs; aE Þ a2A

ð21Þ

where aE is the action taken in state s in expert demonstration, and mðs; aE ; aÞ is a margin function defined as:



mðs; aE ; aÞ ¼

0;

a ¼ aE

C;

else

ð22Þ

where C represents a positive value which is significantly smaller than the maximum Q-value. From Eq. (21) we can see that if Q ðs; aE Þ is the largest value among Q-value set fQ ðs; aÞga2A , the supervised loss JE ðQ Þ is 0; otherwise, JE ðQ Þ is equal to a positive value ½Q ðs; aÞ  Q ðs; aE Þ þ C. This forces the network update towards the expert’s Q-value distribution which reaches the peak at the action aE . Furthermore, the L2 regularization loss LL2 ðQ Þ is added to prevent over-fitting on the relatively small demonstration dataset. Then the overall loss used for updating the network is given by:

LðQ Þ ¼ LQL ðQ Þ þ c1 LE ðQ Þ þ c2 LL2 ðQ Þ

ð23Þ

where c1 and c2 are parameters that adjust the weighting between the losses. 5.3. Interaction with environment Once the pre-training is finished, the agent has learned a relatively effective policy. It will continue improving its strategies through the interaction with real environment in the next phase. The agent selects the actionat with probability [29]:

eQ ðst ;at Þ=g Q ðst ;bÞ=g b2A e

pðst ; at Þ ¼ P

ð24Þ

where g is a positive parameter called the temperature. High temperature causes the action probabilities to be all nearly equal, and low temperature causes huge difference in selection probabilities for actions. After obtaining the transition, the agent stores its self-generated data in the replay buffer Dself . The old data will be over-written when the buffer is full. The demonstration data from multiple experts is stored in the independent buffer Ddemo , which retains static. In this phase, both of the self-generated data and demonstration data are sampled when executing experience replay. Only the Q-learning loss is adopted for the self-generated data; while for demonstration data, both of the Q-learning loss and the supervised loss are applied. In natural DQfD, the demonstration data is sampled with a constant proportion in self-learning phrase. However, despite the similarity between the newly-joined SU and its expert nodes, the SU might experience new radio context later on so that some differences still exist. This may make more ineffective actions be selected by the SU in the later stage. To avoid these problems,

226

Y. Du et al. / Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

the impact of expert demonstrations should be weakened gradually, which is similar to the teacher-student model in human world [25]. Therefore, a dynamic demonstration ratio method is introduced in MT-DQfD: we set the initial and end value of demonstration proportion as a and b, respectively. In self-learning phrase, the demonstration proportion is reduced by a certain value 1 after each learning episode until it reaches b. The update of demonstration ratio t after each episode is given by:





b ;t1 6 b

t  1 ; else

ð25Þ

Thus a newly-joined SU can learn from the expert demonstrations at first, and gradually learn the policy on its own. The details of MT-DQfD based cross-layer protocol are obtained as described in Algorithm 1. Algorithm 1. (MT-DQfD based Joint Routing and Resource Allocation Scheme).

6.1. Simulation setup The performance of our proposed scheme is evaluated using an event-driven simulator coded in Python 3.5. The network model and learning framework are built based on the Python packages Networkx and Numpy, respectively. The results of our proposed MT-DQfD based cross-layer protocol is compared with (i) Deep Q-Network (DQN) [30] based scheme; (ii) Natural DQfD [31] based scheme; and (iii) Policy of the expert with best performance. In the simulation progress, we consider a square network of 1000  1000 m2. 20 expert SU nodes and 5 newly-joined SU nodes are uniformly and randomly distributed in the area. Each node is equipped with omni-directional antenna. The network topology is shown in Fig. 3, where circular nodes in red and triangular nodes in pink represent the expert SUs and the newly-joined SUs, respectively. In addition, the number of PU channels is 10 and the available transmit power of each SU node consists of 50 levels for the refinement of power allocation: f100; 110; 120; . . . ; 600 mW g. Other system parameters are given in Table 2.

1: Inputs: Initialize replay memory Dself ¼ /, replay period M, demonstration ratio g ¼ a and number of pre-training steps j. 3: Initialize Q-function with random weights h and target Q-function with weights h ¼ h. 4: Identify expert nodes using the adaptive radius Bregman Ball model. 5: Calculate the number of each expert’s demonstrations via the normalized similarity index simðSU a ; SU en Þ and initialize

2:

the demonstration buffer Ddemo . 6: Pre-training: 7: For steps t 2 f1; 2; :::; jg Do 8: Sample a mini-batch of n transitions from Ddemo . 9: Calculate loss LðQ Þ using target network. 10: Execute gradient descent to update h. 11: End For 12: Self-learning: 13: For episode 2 f1; 2; :::; Eg Do 14: Initialize state s1 . 15: For steps t 2 f1; 2; :::g Do 16: Choose action at based on policy p2Q h . 17: Execute action at and observe reward rt , state stþ1 . 18: 19:

Store transition ðst ; at ; r t ; stþ1 Þ in Dself . Sample a mini-batch of n transitions from

Ddemo [ Dself with a fraction g of the samples from Ddemo . 20: Calculate loss LðQ Þ using target network. 21: Execute gradient descent to update h. ^ ¼ Q. 22: Every M steps reset Q 23: 24: 25: 26: 27: 28:

End For If t  1 > b Do t t1 Else t b End For

6. Simulation experiment and result analysis The goal of MT-DQfD based scheme is to reduce the transmission latency while guaranteeing the energy efficiency. In this section, we present the simulation results to validate the apprenticeship learning based cross-layer routing protocol.

Fig. 3. Network topology consisting of 20 expert SUs and 5 newly-joined SUs.

Table 2 Simulation parameters. Parameters

Values

Link Gain Available Spectrum Bandwidth, B AWGN power, n

g ¼ eGðr=r 0 Þm ; for r > r 0 [32] 56 MHz - 65 MHz B  ½1; 2 MHz

PU-to-SU interference, /PU ijc Packet Size, Spacket Mean of PU Departure Rate,l Deviation of PU Departure Rate,r SINR threshold, vth Demonstration ratio, t Initial value of demo ratio,a End value of demo ratio, b Maximal transmission delay,Dth Discount factor, b Learning rate, a Time slot size, t Temperature, g Weighting of the supervised loss,c1 Weighting of L2 regularization loss,c2 Pre-training step, Replay period, M

j

    Replay buffer size, Dself 

    Demonstration buffer size, Ddemo 

107 mW h 7 /PU ; 106 mW ijc  10 10 KB 0:1 0:05 vth  ð - 100  - 80 dBm 0:1 0:8 0:05 Dth  ð40; 80 ms 0:9 0:01 10 ms 0:005 1 105 800 300 2000 2000

Y. Du et al. / Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

6.2. Simulation results Fig. 4(a) and (b) present the expected reward changing with iteration index when the agent adopts responsibility rating and reinforcement responsibility rating, respectively. In Fig. 4(a), the reward of all algorithms rises as the iteration increases during the start-up stage. As expected, when converged, the performance of proposed MT-DQfD outperforms all other schemes. The reward of DQfD is slightly lower than MT-DQfD, which is very close to the expert’s. DQN achieves the lowest expected reward. In addition, MT-DQfD converges faster than all other schemes, followed by DQfD. While DQN based scheme has the lowest convergence speed. This is because pre-training with expert demonstrations speeds up the learning process of two DQfD based schemes. Besides, learning from multiple teachers avoids the biased knowledge of a particular teacher, which achieves better performance. In

227

Fig. 4(b), it can be clearly observed that the expected reward of all algorithms has been raised compared to Fig. 4(a). The reason is that the reinforcement responsibility rating allows the transmit power to hop multiple levels each timeslot, instead of one level with original responsibility rating. It reduces the state transition times before reaching the terminal state, which makes the average transmission latency and PCR per episode decrease simultaneously and thus achieves higher expected reward. In Figs. 5 and 6, we investigate the expected transmission latency and power consumption ratio changing with the iteration index as the responsibility rating and reinforcement responsibility rating are applied, respectively. Fig. 5(a) illustrates that with increasing number of iterations, the expected single-hop latency suffers a gradual decline for all algorithms. MT-DQfD based scheme converges faster than other algorithms and achieves the shortest transmission latency. The single-hop delay of the expert and DQfD

Fig. 4. Expected reward vs. iteration index.

Fig. 5. Expected one-hop delay vs. iteration index.

228

Y. Du et al. / Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

Fig. 6. Expected power consumption ratio vs. iteration index.

based scheme is almost the same in the end, which is slightly higher than MT-DQfD. DQN has the slowest convergence speed and the longest transmission delay after it is converged. It verifies the advantage of multi-teacher apprenticeship learning on singlehop delay. As shown in Fig. 5(b), the transmission delay of MTDQfD, DQfD, expert and DQN is 38%, 35%, 32%, 19% lower at 800 iterations, compared to Fig. 5(a). This is because the application of reinforcement responsibility rating provides better demonstration data for the apprentice node, which trains the neural network more effectively during pre-learning phase. For the contributions of reinforcement responsibility rating in both pre-training and self-learning stage, the transmission latency of DQfD based schemes decreases more apparently compared to DQN. As shown in Fig. 6, the expected PCR as a function of iteration index has a similar trend to Fig. 5, which will not be detailed here. The comparison of the system performance when using different expert identification methods is illustrated in this section. Fig. 7(a) depicts the end-to-end delay as a function of the number

of routes when the fixed radius Bregman Ball model is applied. It can be seen that with increasing number of routes, the end-toend delay declines for all algorithms. MT-DQfD achieves the lowest end-to-end delay, followed by DQfD, and DQN produces the longest total latency. This is due to the fact that the decrease of single node’s transmission delay makes contribution to the reduction of end-to-end delay along the route. Fig. 7(b) shows the end-to-end delay changing with the number of routes when the adaptive radius Bregman Ball is adopted, which has a similar trend as Fig. 7(a). However, the total transmission delay of MT-DQfD and DQfD is 14% and 9% lower at 4000 routes compared to Fig. 7(a), and the end-to-end delay of DQN is almost the same as that in Fig. 7(a). The reason for this is that in DQfD based schemes, isolated SUs are capable of finding expert nodes when the adaptive radius Bregman Ball model is applied. The number of nodes adopting apprenticeship learning increases so that the system performance is improved. In the DQN based scheme, all of the agents adopt self-learning and its performance is irrelevant to the expert

Fig. 7. End-to-end delay vs. The number of routes.

Y. Du et al. / Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

229

Fig. 8. Packet loss ratio vs. The number of routes.

identification method. In addition, the adaptive radius Bregman Ball model can help newly joined SUs find more expert nodes, which further exerts the advantages of multi-teacher apprenticeship learning scheme. Thus the total delay reduction of MT-DQfD is superior to the natural DQfD. In Fig. 8(a) and (b), we further investigate the Packet Loss Ratio (PLR) for three kinds of algorithms when using fixed and adaptive radius Bregman Ball model to identify the expert nodes, respectively. In both experimental scenarios, MT-DQfD produces the lowest PLR, followed by the DQfD based scheme. The PLR of DQN is apparently higher than that of two apprenticeship learning schemes. Comparing Fig. 8(a) and (b), we find that the PLR of DQN remains almost the same, whereas two DQfD based schemes achieve lower PLR when the adaptive radius Bregman Ball model is adopted. In addition, the PLR of MT-DQfD decreases more significantly than that of DQfD. The reason for this finding is that the end-to-end delay reduction of MT-DQfD is more obvious than that of DQfD when the adaptive radius Bregman Ball is applied, and the new expert identification method has no impact on the total transmission latency of DQN. Figs. 9 and 10 demonstrate the effect of PU arrival probability on expected single-hop delay and PCR. As shown in Fig. 9, the

Fig. 10. Expected power consumption ratio vs. Probability of PU arrival.

transmission delay increases with the probability of PU arrival for all three algorithms. The reason for this is that frequent arrival of PU causes more interruption of packets transmission, which results in longer latency. In addition, MT-DQfD achieves the shortest one-hop latency, followed by DQfD. The transmission delay of DQN is much longer than two DQfD based schemes. This illustrates that the apprenticeship learning which shares experience of experts produces better performance than self-learning scheme. Furthermore, multi-teacher based apprenticeship learning can avoid the biased knowledge of a particular teacher, which achieves better results. The effect of PU arrival probability on PCR shown in Fig. 10 has a similar trend to Fig. 9, which will not be detailed here. 7. Conclusions

Fig. 9. Expected one-hop delay vs. Probability of PU arrival.

In this paper, we proposed an apprenticeship learning based cross-layer routing scheme called MT-DQfD. Unlike general RL scheme, MT-DQfD allows the newly-joined SU to learn from multiple expert nodes and shortens the interaction time with environment. Simulation results illustrate that MT-DQfD achieves

230

Y. Du et al. / Int. J. Electron. Commun. (AEÜ) 107 (2019) 221–230

lower transmission latency, energy consumption and packet loss ratio than the DQN and natural DQfD based schemes. Furthermore, the agent can learn a policy better than the experts’ through MTDQfD. In this paper, we applied demonstration data to accelerate the learning process, in which reward design was still required. However, it may be difficult to manually specify a reward function for the complex of desiderata tradeoff. Generative Adversarial Imitation Learning (GAIL) combines reinforcement learning with Generative Adversarial Networks (GAN). It is not necessary for GAIL to design the reward function. Moreover, unlike IRL which requires reinforcement learning in an inner loop, GAIL has much lower computation complexity. Our future work aims at adopting adversarial imitation learning based scheme to accelerate the learning procedure in CRN. Declaration of Competing Interest We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. Acknowledgements This work is supported by National University of Defense Technology Scientific Research Project (No. JS17-03-38). Appendix A. Supplementary material Supplementary data to this article can be found online at https://doi.org/10.1016/j.aeue.2019.05.041. References [1] Mitola J, Maguire GQ. Cognitive radio: Making software radios more personal. IEEE Pers Commun 1999;6(4):13–8. [2] Mengbo Z, Lunwen W, Yanqing F. Distributed cooperative spectrum sensing based on reinforcement learning in cognitive radio networks. AEU-Int J Electron Commun 2018;94:359–66. [3] Darsena D, Gelli G, Verde F. An opportunistic spectrum access scheme for multicarrier cognitive sensor networks. IEEE Sens J 2017;17(8):2596–606. [4] Bayrakdar ME, Calhan A. Improving spectrum handoff utilization for prioritized cognitive radio users by exploiting channel bonding with starvation mitigation. AEU-Int J Electron Commun 2017;71:181–91. [5] Pourpeighambar B, Dehghan M, Sabaei M. Multi-agent learning based routing for delay minimization in cognitive radio networks. J Network Comput Appl 2017;84:82–92. [6] Zuo J, Dong C, Ng SX, Yang LL, Hanzo L. Cross-layer aided energy-efficient routing design for ad hoc networks. IEEE Commun Surv Tutorials 2015;17 (3):1214–38. [7] Awang A, Husain K, Kamel N, Aïssa S. Routing in vehicular ad-hoc networks: a survey on single- and cross-layer design techniques, and perspectives. IEEE Access 2017;5(99):9497–517. [8] Singh R, Verma AK. Energy efficient cross layer based adaptive threshold routing protocol for WSN. AEU-Int J Electron Commun 2017;72:166–73.

[9] Anifantis E, Karyotis V, Papavassiliou S. A spatio-stochastic framework for cross-layer design in cognitive radio networks. IEEE Trans Parallel Distrib Syst 2014;25(11):2762–71. [10] Amini RM, Dziong Z. An economic framework for routing and channel allocation in cognitive wireless mesh networks. IEEE Trans Netw Serv Manage 2014;11(2):188–203. [11] Saleem Y, Yau KLA, Mohamad H, Ramli N, Rehmani MH. Joint channel selection and cluster-based routing scheme based on reinforcement learning for cognitive radio networks. In: International Conference on Computer, Communications, and Control Technology. IEEE; 2015. p. 21–25. [12] Du Y, Zhang F, Xue L. A kind of joint routing and resource allocation scheme based on prioritized memories deep Q-network for cognitive radio ad hoc networks. Sensors 2018;18(7). [13] Pourpeighambar B, Dehghan M, Sabaei M. Joint routing and channel assignment using online learning in cognitive radio networks. Wireless Networks 2018;2:1–15. [14] Pourpeighambar B, Dehghan M, Sabaei M. Non-cooperative reinforcement learning based routing in cognitive radio networks. Comput Commun 2017;106:11–23. [15] Koushik AM, Hu F, Kumar S. Intelligent spectrum management based on transfer actor-critic learning for rateless transmissions in cognitive radio networks. IEEE Trans Mob Comput 2018;PP(99). 1–1. [16] Wu Y, Hu F, Kumar S, Matyjas JD, Sun Q, Zhu Y. Apprenticeship learning based spectrum decision in multi-channel wireless mesh networks with multi-beam antennas. IEEE Trans Mob Comput 2017;PP(99):314–25. [17] Ho J, Ermon S. Generative adversarial imitation learning; 2016. [18] Zhao Q, Grace D. Agent transfer learning for cognitive resource management on multi-hop backhaul networks. Future Network & Mobile Summit. IEEE; 2013. [19] Singh K, Moh S. An energy-efficient and robust multipath routing protocol for cognitive radio ad hoc networks. Sensors 2017;17(9):2027. [20] Box GEP, Muller ME. A note on the generation of random normal deviates. Ann Math Statist 1958;29(2):610–1. [21] DasMahapatra S, Sharan SN. A general framework for multiuser de-centralized cooperative spectrum sensing game. AEU-Int J Electron Commun 2018;92:74–81. [22] Hu H, Zhang H, Yu H. Delay QoS guaranteed cooperative spectrum sensing in cognitive radio networks. AEU-Int J Electron Commun 2013;67(9):804–7. [23] Wellens M, Riihijarvi J, Mahonen P. Evaluation of adaptive MAC-layer sensing in realistic spectrum occupancy scenarios. In: IEEE Symposium on New Frontiers in Dynamic Spectrum. IEEE; 2010. [24] Du Y, Chen C, Ma P, Xue L. A cross-layer routing protocol based on quasicooperative multi-agent learning for multi-hop cognitive radio networks. Sensors 2019;19(1). [25] Hu MKAF, Kumar S. Intelligent spectrum management based on transfer actorcritic learning for rateless transmissions in cognitive radio networks. IEEE Trans Mob Comput 2017. 1–1. [26] Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, et al. Deep qlearning from demonstrations; 2017. [27] Wu Y, Hu F, Zhu Y, Kumar S. Multi-teacher knowledge transfer for optimal crn spectrum handoff control with hybrid priority queueing. IEEE Trans Veh Technol 2016. 1–1. [28] Shirazi AG, Amindavar H. Channel assignment for cellular radio using extended dynamic programming. AEU-Int J Electron Commun 2005;59 (7):401–9. [29] Du Z, Wang W, Yan Z, Dong W, Wang W. Variable admittance control based on fuzzy reinforcement learning for minimally invasive surgery manipulator. Sensors 2017;17(4):844. [30] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing atari with deep reinforcement learning. Comput Sci 2013. [31] Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, et al. Learning from demonstrations for real world reinforcement learning; 2017. [32] Chen X, Zhao Z, Zhang H. Stochastic power adaptation with multiagent reinforcement learning for cognitive wireless mesh networks. IEEE Trans Mob Comput 2012;12(11).