Infrastructure based spectrum sensing scheme in VANET using reinforcement learning

Infrastructure based spectrum sensing scheme in VANET using reinforcement learning

Vehicular Communications 18 (2019) 100161 Contents lists available at ScienceDirect Vehicular Communications www.elsevier.com/locate/vehcom Infrast...

3MB Sizes 0 Downloads 16 Views

Vehicular Communications 18 (2019) 100161

Contents lists available at ScienceDirect

Vehicular Communications www.elsevier.com/locate/vehcom

Infrastructure based spectrum sensing scheme in VANET using reinforcement learning Christopher Chembe a , Douglas Kunda a , Ismail Ahmedy b , Rafidah Md Noor b,c,∗ , Aznul Qalid Md Sabri d , Md Asri Ngadi e a

School of Science, Engineering and Technology, Mulungushi University, Box 80415, Kabwe, Zambia Computer System and Technology, Faculty of Computer Science and IT, University of Malaya, 50603 Kuala Lumpur, Malaysia c Centre of Mobile Cloud Computing Research, University of Malaya, Kuala Lumpur, Malaysia d Department of Artificial Intelligent, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia e Department of Computer Science, Faculty of Computing, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia b

a r t i c l e

i n f o

Article history: Received 27 September 2017 Received in revised form 22 March 2019 Accepted 15 May 2019 Available online 21 May 2019 Keywords: Reinforcement learning Spectrum sensing Vehicle-to-infrastructure Cognitive radio Adaptive sensing

a b s t r a c t Spectrum sensing is one of the fundamental functionality performed by a cognitive radio to identify vacant radio spectrum for dynamic spectrum access (DSA). However, there are many challenges still existing before the benefits of DSA can be realized. The challenges include multipath fading, shadowing and hidden primary user (PU) problem. The challenges are more severe in vehicular communication due to unique characteristics such as dynamic topology caused by vehicle mobility. Furthermore, spectrum sensing is dependent on the activities of the PU traffic pattern which are not known in advance. In a typical cognitive radio network, the PU plays a passive role. Therefore, a sensing technique should account for traffic pattern of the PU autonomously. However, most of the proposed spectrum sensing schemes in vehicular communication assumes a static ON/OFF PU model which does not realistically model the PU traffic pattern. In this paper, we propose reinforcement learning (RL) to model the traffic pattern of the PU and use the model to predict channels likely to be free in future. The RL is implemented on road side unit (RSU) which send predicted vacant PU channels to vehicles on the road. Before the channels can be used, vehicles perform spectrum sensing. To account for multipath fading and shadowing, adaptive spectrum sensing is proposed. The results from spectrum sensing, sensing time and PU channel capacity are calculated into a scalar value and used as reward for RL at RSU. The RSU continuously update the reward for channels of interest using sensing history from passing vehicles as reward. Compared to history based schemes from literature, the RL technique proposed in this paper performs better. © 2019 Elsevier Inc. All rights reserved.

1. Introduction Recent years have seen growth in the number of technologies aimed at improving human lives in all sectors, transportation included. Technologies that improve communication efficiency in transportation are grouped under the umbrella of Intelligent Transportation System (ITS). ITS will play a major role in automobile industry by providing means of vehicular communications. ITS applications are envisioned to reduce congestion on the road, increase safety to drivers and provide comfort infotainment to passengers. The types of information to be shared can include road

*

Corresponding author at: Computer System and Technology, Faculty of Computer Science and IT, University of Malaya, 50603 Kuala Lumpur, Malaysia. E-mail addresses: [email protected] (C. Chembe), [email protected] (D. Kunda), [email protected] (I. Ahmedy), fi[email protected] (R. Md Noor), [email protected] (A.Q. Md Sabri), [email protected] (M.A. Ngadi). https://doi.org/10.1016/j.vehcom.2019.100161 2214-2096/© 2019 Elsevier Inc. All rights reserved.

traffic conditions such as accidents, traffic jams, road works, etc, and other service messages (e.g. sharing information about free parking lots) [1]. This kind of information is shared through vehicle ad hoc network (VANET) comprising of vehicle to vehicle (V2V) in ad hoc manner and vehicle to infrastructure (V2I) communication [2]. VANET is a network that relies on the wireless electromagnetic radio frequency to facilitate V2V and V2I communications. The first standard designed to support short to medium range vehicular communications services is called dedicated short range communications (DSRC) [3]. The DSRC have been allocated 75 MHz of spectrum band at 5.9 GHz to be used for ITS communications. In the United States of America, the Federal Communications Commission (FCC) allocated the 75 MHz in the 5.850-5.925 GHz band covering 7 channels with 10 MHz each and 5 MHz reserved as a guard band [4]. Six of the 7 channels are reserved for service messages while one channel used as control channel for broadcasting safety related messages.

2

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

During peak hours and accident incidences, the density of vehicles on the road increases resulting in traffic jams. In those circumstances, the DSRC channels are not enough to serve all vehicles [5]. Many solutions have been proposed in literature to overcome the communication setbacks [6]. Nevertheless, the proposed solutions rely on the 7 DSRC channels at 5.9 GHz band. Due to current the fixed radio spectrum allocation, the DSRC channels cannot be extended beyond 5.9 GHz bands. However, the fixed allocation of radio spectrum has created what is known as artificial scarcity of radio frequency spectrum. Spectrum occupancy measurements and campaigns have shown that some radio frequency bands are heavily used while other frequencies are underutilized [7]. Therefore, some mechanisms are needed to optimize the underutilized frequency bands provided no interference is caused to the primary owners of the license. Hence, in 2002 the FCC proposed a mechanism to optimize usage of underutilized radio spectrum through dynamic spectrum access (DSA) [8]. The concept of DSA is to allow unlicensed users access channels that are allocated to licensed users for exclusive use, provided no harmful interference is caused to the primary owners. In DSA, the incumbent licensed users are called primary users (PU) while unlicensed users are called secondary users (SU). The enabling technology for DSA is a cognitive radio (CR). A CR is an intelligent software defined radio (SDR) that reconfigures its transceiver parameters based on the current radio environment [8]. The CR is introduced to identify the spectrum holes from licensed frequency bands and is responsible for monitoring and tracking changes in the radio environment [8]. This task is performed through spectrum sensing. Spectrum sensing is one of the major tasks performed by the CR. Accurate spectrum sensing results enable maximum protection of licensed users from interference in communication channels while optimizing the usage of the radio spectrum. However, there are still challenges associated with spectrum sensing in vehicular communication environment before CR is realized for spectrum management in VANET. One of the major challenges in spectrum sensing for VANET environment is the mobility of vehicles [9]. A vehicle needs to identify spectrum holes within a short period of time before moving to some other areas where spectrum opportunities might not be available. On the other hand, mobility can be leveraged by vehicles in obtaining spectrum opportunities at future time and location if the speed and velocity of the vehicle is known [10]. Other challenges include multipath fading and shadowing of PU signal due to obstacles such as tall building [11]. In addition, Doppler Effect causes radio signal fading which result in poor signal-to-noise ratio (SNR) between sensing vehicle (SU) and the PU source. Different solutions have been proposed to solve some of the mentioned problems [12]. Nevertheless, many of the proposed techniques have not considered the effect of PU activity duty cycles [13]. The performance of spectrum sensing scheme is dependent on the PU activity model assumed. Therefore, PU activities must be well understood to maximize accuracy of sensing results. There are many PU activity models proposed in literature [14,15]. In VANET environment, the common PU activity model assumed by many spectrum sensing schemes is a fixed ON/OFF activity model. However, this model has shown to be unrealistic in a practical environment, because the operation of the primary system is perceived to be random [16]. In addition, the fixed ON/OFF model does not perform well in the mobile PU environment [15]. Therefore, in this paper we propose reinforcement learning (RL) to model the activity of PU traffic pattern based on rewards. The learned traffic pattern is used to predict PU channels likely to be free in future which are sent to vehicles during congestion. The RL is implemented on the road side unit (RSU). The RSU updates the PU channels based on the rewards obtained from vehicles. The reward is a scalar value obtained from history of vehicles’ sens-

ing results, sensing time and PU channel capacity. The sensing results are obtained from vehicles using adaptive sensing. In this paper, adaptive sensing implemented by vehicles is introduced to improve the performance of energy detector in low SNR (fading environment) or in noise uncertainties. Therefore, adaptive sensing is based on energy detector and one order cyclostationary feature detector. Two thresholds are used to set a trade-off when to exploit the two detectors by the vehicle when sensing. The feature detector is used to identify the PU signal in fading environment where SNR is low and energy detector performs poorly. Using RL and adaptive sensing, the proposed approach has shown improvement in performance of spectrum detection for vehicular communication compared to other schemes in literature based on history of sensing results. This is demonstrated in simulation results presented in this paper. The simulation scenario considered in this paper is V2I communication. The rationale for only considering V2I communication is that RSU will keep the database of PU channels and history of rewards for those channels. Since the RSU is stationary, its appropriate to coordinate the spectrum management opposed to distributed environment in V2V. In addition, vehicles move from one place to another, hence the PU might be different hence the PU activity pattern for sensing channels might differ. Therefore, other PU activity models can be used in V2V. The rest of the paper is organized as follows: in Section 2, a detailed review of literature on spectrum sensing in VANET is presented with focus on cooperative spectrum sensing decision. In Section 3 a brief background to reinforcement learning is presented. Section 4 details the modeling of the sensing scheme, adaptive sensing and RL at RSU is presented in this section. Section 5 presents simulation work and comparison to other schemes in literature. Finally, Section 6 concludes the paper. 2. Review of related work The three common conventional spectrum sensing techniques employed in literature are energy detector, cyclostationary feature detector and matched filter detector [13]. These detection techniques are used to detect spectrum opportunities by individual vehicles. Any vehicle on the road will sense the licensed channels of interest and decide on the PU’s occupancy without involving other vehicles. However, spectrum sensing performed by an individual vehicle is susceptible to various challenges such as multipath fading, shadowing, noise uncertainty and hidden PU problem. Furthermore, PU activities are not known to the vehicles in advance and the PU system is not involved in the sensing process. Thus, when sensing is performed by the individual vehicle, the sensing performance is not optimal; which can result in interference to the PU system [17]. In addition, it is difficult to accurately model the PU activity pattern in a non-cooperating sensing environment. To overcome some of the challenges encountered in conventional spectrum sensing methods, cooperative decision is proposed in literature. Vehicles on the road can use spatial and temporal diversity to cooperate in deciding the PU occupancy. Cooperative decision in VANET can be divided into two categories; centralized and distributed. Centralized cooperative decision involves vehicles sending local sensing results to a central node which is used as fusion centre (FC). The central node can be RSU or cluster head in cluster based vehicular communication. The FC is responsible for deciding on the PU occupancy state. The FC decides on the global sensing result from participating vehicles using the hard fusion combining rule or soft combining rule. In hard fusion rule, each vehicle senses the desired licensed channel and only sends the binary digit to the FC [18]. The binary digit can either be 1 for H 1 representing presence of PU signal or 0 for H 0 representing absent of the PU signal. Once the individual results are collected, the FC will make the de-

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

cision using AND rule, OR rule, K-out-of-M rule or Majority rule [19–22]. Cooperative decision based on hard fusion rule is simple to implement because the SUs only send a bit (0 or 1) which is easy to manipulate. Nevertheless, hard fusion rule performs poorly in low vehicle density especially if the K-out-of-M rule is used. In addition, the fusion rules cannot detect malicious data sent from participating vehicles. For example the FC cannot distinguish between a bit sent from a legitimate SU to one from a malicious SU. To mitigate some of the deficiencies noticed in the hard fusion rule, cooperative decision based on the soft combining approach demands that the measurement of the channel samples obtained at local SU are sent to FC [23]. The FC thereafter combines the collected samples to derive a global PU measurement which is compared to some predefined threshold. At FC, the commonly used soft combining methods include weighted linear, equal gain combining and optimal combining [20,24,25]. Compared to hard fusion approach, soft combining performs better in cooperative decision. However, soft combining is also susceptible to malicious attack. Malicious SUs can send distorted PU measurements which can affect the global result. In addition, there is need for extra bandwidth to send the weight of local results to FC by cooperating SUs. Similar to the hard combining approach, the performance of soft combining degrades in low vehicle density [21]. Other centralized cooperative decision approaches not based on hard fusion or soft combining include renewal process method [26], general likelihood ratio test (GLRT) using covariance matrix [27], and coordinated spectrum sensing [28]. The burden of spectrum decision in the renewal process is on the RSU [26]. It monitors the PU ON/OFF periods and selects which channels to sense. Other vehicles within the coverage area of RSU are assigned channels to sense and report back the individual results to RSU. Thereafter, the global result is sent back to participating vehicles. The GLRT approach use test statistics obtained from blind Eigen values of the received covariance matrix of SUs to detect the PU signal. In the coordinated approach, the goal is to mitigate the disadvantage of conventional techniques by choosing a coordinator which can either be RSU or vehicle. Generally, cooperative decision in centralized approach is subject to synchronization problem which is overlooked by sensing methods presented above [17]. Opposed to conventional individual sensing, cooperative decision requires that sensed results are sent to FC for global judgment. Thus, each SU will incur some delay when reporting the local sensing result to the RSU. The cumulative delay from cooperating vehicles affects the sensing performance. Another delay is incurred when the RSU sends back the global result to cooperating vehicles. Due to speed of vehicles, the global result from the RSU might be determined to be useless if the vehicle moves in another area not covered by the RSU. This can result in missing many spectrum opportunities. One approach to mitigate problems associated with synchronization is to use asynchronous method proposed by [29]. Another challenge associated with centralized cooperative decision is deciding on the optimal number of participating vehicles. The performance of centralized cooperative decision in low vehicle density is poor [21]. Therefore, there is need to establish the trade-off between setting optimal number of vehicles to participate in cooperative decision and sensing performance. In distributed cooperative decision, no infrastructure support is provided to act as FC. Hence, vehicles cooperate among themselves in an ad hoc manner to reach an agreement on the sensing results. Different distributed cooperative sensing decision algorithms have be proposed in literature which largely fall in the following categories; consensus based [30,31], belief propagation [32] and weighted [29,33]. In consensus based approach each individual SU determine the local PU signal status using any conventional technique available. Once the local measurement is obtained, the vehicle exchanges this local result with its immediate neighbours,

3

the neighbour compares the result with its own and forwards the update to other neighbours. The process continues until a consensus is reached. Implementation of belief propagation approach is similar to consensus based. The minor difference is that in belief propagation sensing periods are divided into time slots. In each time slot the belief messages are sent to neighbours to generate new beliefs. The process continues iteratively within a time slot until a final result is computed by individual SU with the help of belief messages from other vehicles. In weighted distributed cooperative decision each SU assigns a weight to the local result. Then the weights are shared among vehicles to determine the global PU occupancy state. The global decision is made based on the weighted majority correlation decision. Synchronization time among sensing vehicles is a challenge even in distributed cooperative decision. Another challenge overlooked in distributed cooperative decision is bandwidth demand to complete the iterative communication among vehicles to reach the global result. This is due to the fact that vehicles have to communicate with its neighbours using the available common channel. In VANET environment the use of control channel of the DSRC bands is proposed. However, this is counter intuitive because the concept of DSA is to free DSRC channels during congestion, therefore using control channel of DSRC band can led to further congestion. Furthermore, distributed cooperative decision performs poor under low vehicle density. Making the algorithms based on these approaches unbefitting in sparse environments like suburban areas with low vehicle density. In addition, the distributed cooperative decision approach is susceptible to malicious attacks that can introduce false data to be shared by other vehicles [30]. The false data lures vehicles in reaching the wrong global PU occupancy state to the benefit of the malicious source. In recent years machine learning (ML) in cooperative decision has been exploited in cognitive radio networks not based on VANET [34]. Relevant to this paper, there have been some researches that exploit reinforcement learning (RL) in cognitive radio networks. For example, RL is used to optimize the sensing policy for energy efficient cognitive radio network in [35] and [36]. In another paper [37], RL is used to address the cooperation overhead problem such as sensing delay for reporting local decision. Nevertheless, RL algorithms have not been fully exploited in VANET environment. In this work, RL is used to aid the RSU learn the behaviour of the PU activity pattern in order to predict free licensed channels in future that can be used by vehicles during congestion. The RL model continuously optimizes the output generated based on the reward. When the PU is transmitting on the channel continuously, that channel is likely to get low reward as punishment. On the other hand, when the PU is idle most of the time, that PU channel is likely to get a high reward hence recommended for DSA. Therefore, RL can be used to continuously update and optimize the PU activity model that is important in spectrum sensing. The updated learnt PU model is used to predict channels likely to be free in future based on current and previous rewards. 3. Reinforcement learning Reinforcement learning is a branch of ML that maps situations to actions in an environment in order to maximize some numerical reward signal [38]. It is defined by characterizing a learning problem opposed to characterizing learning methods, as is the case with other ML techniques such as supervised learning. Therefore, RL is suited for learning complex problems in an environment without a prior knowledge of the dynamics of the environment. This is true especially in VANET were the PU systems are not directly involved in the sensing process. The VANET network (i.e. vehicles and RSU) needs to sense and figure out the characteristics of the PU systems as well as the PU transmission patterns.

4

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

of sensing and transmission time. Therefore, a reward at t + 1 is given by:

R t +1 = I A {C A k }( w b C B w + w t T ot )

Fig. 1. Reinforcement learning model.

Two prominent entities of RL are agent and the environment [39]. The agent is an entity that learns and models the dynamics of the environment to achieve the target goal through desired actions. In this work the agent is the RSU while the environment is the VANET which is divided into road segments. The goal is for RSU to find PU channels with good bandwidth throughput and during long OFF periods (more spectrum opportunities) for vehicular communication during congestion. Therefore, the main function of the RSU is to define the sensing policy by guiding vehicles to sense particular channels. The problem of PU activity modeling is formulated as a finitehorizon Markov decision process (MDP) [39]. MDP is Markov reward process with decision to be made at each time instance. The MDP is considered because the RSU (agent) interacts with the VANET environment and makes decisions by selecting the channels to send to vehicles in time step. The MDP is represented by set of states S representing the model (observations) of the environment, a reward function R in a given state, the set of actions A that an agent can take, a state transition probability ρ and a discount factor  ∈ [0, 1]. If we let the episode during which the congestion is considered be T E , then the RSU will interact with the environment in time steps t = 0, 1, 2, . . . , T E . At each step t the RSU (agent) executes an action A t by sending sensing information to vehicles in the VANET environment. Then at step t + 1 the vehicle (SU) sends back the observation O t +1 and reward R t +1 ∈ R. What happens next is determined by the state S t +1 ∈ S. A detailed pictorial representation of RL in VANET environment is given in Fig. 1. The high level description of Fig. 1 is like this. The RSU continuously estimates the vehicle density on the road segment. When the vehicle density is beyond some threshold dth , the DSA mode is initialized. The RSU will send channels to passing vehicles that wish to transmit its data on licensed channels, which is termed as action A t . Then the vehicle will perform sensing based on adaptive sensing. Depending on the result of the sensing, the results are sent to RSU with new state S t +1 and reward R t +1 of sensing and using that channel if idle. Thereafter, the RSU will update the channel status with new state and reward. The RSU will chose channels with high reward to send to vehicles. The process continues until the congestion is cleared. Without loss of generality, the state S t represents the PU channel status. The cumulative reward R is weighted value calculated from three parameters; channel availability status C A k , channel bandwidth C B w and opportunistic time T ot . Opportunistic time in this case denote the combination

(1)

Where w b and w t represent the weighted contribution for C B w and T ot respectively. The weights make sure channels with high OFF periods and high bandwidth get high reward. I A {C A k } is the indicator function for channel availability with C A k ∈ {0, 1}. Zero represents no channel for DSA denoted by H 1 and one represents free channel for vehicle communication denoting H 0 (further explained in next section). The reward function R for a given state S and action A is R as = E [ R t +1 | S t = s, A t = a] where E is expectation function [39]. The value of R after t + 1 time–steps is given by  t R for  ∈ [0, 1]. Values of the discount factor  close to zero are concerned about the immediate rewards while values of  close to one are concerned about rewards which are far-sighted. Since RL is based on the rewards that are obtained from spectrum sensing results from vehicles on the roads, its important to explain the spectrum sensing employed in this paper. Therefore, we explain in details the spectrum sensing using energy detector and one order cyclostationary feature detector with emphasis on adaptive sensing. 4. System model In this paper, we consider a VANET environment based on infrastructure support where the RSU perform two main roles. The first role is estimating the density of vehicles on the road and second is predicting PU channels likely to be free based on RL (see Fig. 1). Vehicles on the road are responsible for sensing the spectrum holes in licensed channels and transmitting on those channels if they are determined to be free. Thus, a vehicle is assumed to be equipped with two antennas, one to communicate on DSRC channels and one on vacant PU channels. In addition, segmentation of the road is assumed with each segment covered by the RSU. It should also be noted that additional spectrum is necessitated during congestion. During congestion, vehicles move at very low speed ranging from 0-10km/h in high dense traffic jams and up to 40km/h in medium dense traffic on the road [40]. Therefore, the DSA mode is only activated during traffic congestion on the road. VANET communication is composed of frame by frame structure with each frame denoted by T consisting of sensing time T s and transmission time T − T s . In addition, vehicles need to continuously sense the licensed bands at the start of each frame to monitor the PU activities in order to avoid interference. This is necessary because vehicles in the VANET environment are unaware of the activities of the PU in advance. The flow chart representing the proposed workflow is shown in Fig. 2. The flow chart above demonstrates when to activate the DSA mode to provide extra channels to vehicles on the road. The motivating factor is reducing sensing time while maximizing sensing performance which is achieved through predictive algorithm at RSU using RL. The RSU is also responsible for determining the threshold for activating the DSA mode based on vehicle density. There are many techniques proposed for estimating vehicle density on the roads including inductive loop, surveillance cameras, pressure pads, infra-red counters, using wireless vehicle sensors and many more [41]. In this work, we adapted the beacon message approach which exploits VANET communication [42]. The RSU is responsible for estimating the aggregated vehicle density in each road segment it covers. Whenever a vehicle enters a road segment covered by the RSU, it sends a request to RSU for some services. The RSU then use these beacons from vehicles to estimate the density for a particular time. If the estimated number of vehicles is more than some threshold dth the RSU will initiate DSA mode as

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

5

The non-central chi-square has additional parameter 2γ which is representative of the transmitted SNR

γ given as γ =

σs2 where σn2

σs2

is the power variance of the PU transmitted signal. Based on CLT the energy test statistic can be approximated to Gaussian distribution with large number of sample size (i.e. M ≥ 250) [18] written as:



Te ∼

Fig. 2. Flowchart of the sensing architecture. (For interpretation of the colours in the figure(s), the reader is referred to the web version of this article.)

shown in Fig. 2. In the following subsection we present detailed modeling of sensing at vehicle level and PU activity modeling later. 4.1. Adaptive sensing at individual vehicle level Adaptive spectrum sensing is proposed to increase the robustness of energy detector in low SNR and under noise uncertainty. It gets advantages from simplicity of energy detector and robustness of feature detector. Since adaptive sensing is based on energy detector and feature detector, it is in line to discuss these two in details. 4.1.1. Energy detection In this paper, we assume the PU activities follow ON/OFF model. The ON period represents active PU signal while OFF represents an idle state. If we let y be the signal sampled at the receiving SU, then xth(x = 1, 2, 3, . . . , M ) sample of the binary hypothesis can be given as:



y (x) =

n(x)

for H 0

hs(x) + n(x)

for H 1

(2)

where y (x) is the signal envelop received by the sensing vehicle. The SNR signal originating from the PU transmitter is given by s(x) while h represent the amplitude signal gain between the PU and the sensor. The PU signal is distorted by Additive White Gaussian Noise (AWGN) given by n(x) which follow a Gaussian distribution with mean zero and variance σn2 (i .e ., n(x) ∼ η(0, σn2 )) [18]. The presence of the PU signal is denoted by H 1 while absence of the PU signal is represented by H 0 in which case only noise is detected. Given M sensing samples of the PU signal, the energy test statistic ( T e ) is represented by:

Te =

M 

| y (x)|2

(3)

( M σn2 , 2M σn4 ) ( M (σn2 + σs2 ), 2M (σn2 + σs2 )2 )



Te ∼

2 X 2M

  λ − M (σn2 + σs2 ) −√ 2M (σn2 + σs2 )   λ − M σn2 =Q √ 2M (σn2 )

P d, E D = Q

(6)

P f ,E D

(7)

In Eq. (6) and (7), the component Q (. . .) represents a general∞ ized Marcum Q-function given as Q (a) = √1 a exp ( −2t ), λ de2π

notes the decision threshold. Note, P f , E D in Eq. (7) is dependent only on noise variance σn2 and not the power variance σs2 of transmitted PU signal. However, P d, E D is dependent on both the noise variance σn2 and the power variance σs2 transmitted by the PU signal. This signifies that the detection performance P d, E D of spectrum sensing is subject to the transmission environment which is affected by the multipath fading. Some works have proposed to model fading behaviour in VANET using Nakagami model [44] which is a generalized fading model. Nakagami fading considers multipath scattering of signals with comparatively large time delay spreads with a group of signals that differ in reflection. Thus, it can be reduced to either Rayleigh or Rician model depending on the m-parameter used [45]. Nevertheless, the challenge is to approximate the m-parameter [45] to suit the sensing environment and the PU under consideration. Therefore under multipath fading and shadowing, the Rayleigh fading channel in urban environment is considered in this paper. Thus, Eq. (6) for P d, E D over fading channels can be averaged over Rayleigh PDF ( f ray (x) = x

σ2

2

exp (− 2xσ 2 , x ≥ 0) to give [46]:

P¯ d, E D =

∞ Q (. . .) f ray (γ )dγ 0



for H 0

2 X 2M (2γ ) for H 1

(4)

2 Where M is the received signal samples, the terms X 2M and γ ) correspond to central and non-central chi-square distributions respectively, each with degree of freedom given by 2M. 2 X 2M (2

(5)

for H 1

The performance of the energy detector is measured by probability of detection ( P d ) which defines the probability of detecting the presence of the PU signal by the sensing technique given some threshold λ. Alternatively, P d can be defined as the probability of the PU transmitting on the licensed channel being true and the detector detecting the PU signal. Therefore, we compare the energy test statistic (T e ) which is obtained from Eq. (5) and some threshold λ to determine the presence of the PU signal. Using Eq. (5) and following the performance evaluation of energy detector (ED), the probability of detection P d and probability of false alarm P f can be written as:

i =0

The energy test statistic T e has a distribution that can be summarized using central limit theorem (CLT) [43] given by:

for H 0

= exp − ⎛ ⎜ ∗ ⎝e

λ

M −2 ( 

2σn2

− 2λ 2σn +γ¯

−e

i =0



λ 2σn2

λ 2σn2

)i

i! M −2  i =0



 +

2σn2 + γ¯

 M −1

γ¯ λγ¯

2σn2 (2σn2 +γ¯ )

i!

i ⎞ ⎟ ⎠

(8)

In Eq. (8) γ¯ is the average SNR of scattered signals due to multipath fading while λ is the detection threshold. In the conventional

6

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

energy detector the value of λ is predetermined to achieve the designed constant false alarm probability rate (CFAR). Therefore, the threshold λ can be determined using noise power variance σn2 using Eq. (7) as:

√ λ = σn2 ( Q −1 ( P f , E D ) 2M + M )

(9)

The fixed threshold λ is compared to energy statistic T e obtained from sensing to determine the occupancy state of the PU signal. Respectively, the performance of the energy detector is measured by P f , E D = P ( T e < λ| H 0 ) and P d, E D = P ( T e > λ| H 1 ). In addition, the performance of energy detector on sensing time for average achievable transmission throughput of cognitive network is measured as follows:



R=

 +

T − Ts T T − Ts T



.(1 − P f ). P ( H 0 ).C B 0 

(10)

.(1 − P d ). P ( H 1 ).C B 1

Where T is the frame length consisting of sensing period T s and transmission time T − T s , C B 0 and C B 1 denote the channel capacity (bandwidth) of cognitive network when operating without PUs and in presence of PUs respectively [47]. Nevertheless, energy detector fails to capture the aggregation of noise in communication systems that emanate from random power sources such as thermal noise in antennas [43]. In addition, the distance between the PU and the SU can affect the signal. These issues (including multipath fading) can lead to noise uncertainty in the received signal luring the SU to make uninformed judgment, thus causing interference to PU transmission system or missing the spectrum opportunities. The problem of noise uncertainty and low SNR in spectrum sensing results is addressed by using adaptive sensing. 4.1.2. One order cyclostationary feature detection Cyclostationary feature detection technique is one of the robust sensing techniques reported in literature [48]. The focus of feature detector is to use the autocorrelation function to determine the desired features of the PU signal in frequency domain. Feature detection technique based on frequency domain is sometimes called two order autocorrelation cyclostationary feature detection [49]. To reduce implementation complexity, OOC feature detection technique is used [49,50]. OOC exploits the mean periodicity characteristics of the PU signal in the time domain opposed to frequency domain, thus improving spectrum sensing. The OOC can be modelled as follows; when the PU signal s x (t ) is transmitted through the AWGN channel nx (t ) the combined signal received is given by y x (t ) = nx (t ) + s x (t ). The mean function of y x (t ) at SU can be written as:

M F y (t ) T = E [ y x (t )] = sx (t )

(11)

where M F y (t ) T is the mean function and E [. . .] is the expectation operator. Note that nx (t ) is not considered in the mean function because noise is considered to be random and has no periodicity. If the PU signal s x (t ) is periodic with periodic function T 0 = f1 s then it follows that the signal received at SU y x (t ) is periodicity. Therefore, based on Eq. (11) it could be observed that the mean function M F y (t ) T is also time varying with a periodic function of time given by period T 0 . Given the period T 0 , it is possible to extract the periodicity using synchronized averaging. Thus, y x (t ) can be sampled periodically given M F y (t ) = M F y (t + kT 0 ) for k = 0, 1, 2, 3, . . ., such characteristic is what is referred to as one order cyclostationary. For the two hypotheses H 0 and H 1 the

Fig. 3. Flowchart of adaptive spectrum sensing.

cumulative density function (CDF) of the envelope M F y (t ) T over AWGN for defined threshold λ is given by:



P f ,O O C

λ2 = exp − 2 2δ A 

P d, O O C = Q In which





λ , δ δA

for H 0

(12)

 for H 1

(13)

γ is the instantaneous SNR, the component Q (, ) rep2

resents a generalized Marcum Q-function while δ 2A = (2Mδ +1) with M as the number of samples. 4.1.3. Adaptive sensing The adaptive sensing approach is composed of two stages. In the first stage, the energy detector is used to detect the PU signal. To determine the PU occupancy status, the energy statistic T e is compared to the threshold given by Eq. (9). To account for noise uncertainty and low SNR, a new threshold derived from Eq. (6) is introduced to achieve a desired constant probability of detection (CPD). Thus, the two thresholds are labeled λ1 (former λ from Eq. (9)) and λ2 (derived from Eq. (6)). If the energy statistic T e < λ1 the final decision on the occupancy state of the PU signal will be H 0 . Alternatively, if the energy statistic T e ≥ λ2 the final decision on the occupancy state of the PU signal will be H 1 . However, if the energy statistic falls within λ1 < T e < λ2 , no decision is made. Instead, stage two sensing is performed using OOC. OOC detects the PU signal in time domain opposed to frequency domain. Mean function is used to determine the time peaks in the PU signal which are later compared to some threshold λ P U based on the channel and frequency of interest. Fig. 3 depicts the flowchart of the adaptive sensing. ED is energy detector, OOC is one order cyclostationary detector, C 1,...,k represents channels of interest with center frequency ( f 1,...,k ). The OOC have three options as noted from Fig. 3. In the first option, if the peaks are determined to be those of the PU signal, the final decision will be H 1 . For the second option, if the

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

7

peaks are determined to be noise or no periodicity is observed, the final decision will be H 0 and the SU can transmit on that channel. Finally, if the peaks are determined to be those of other SU, the scheme will wait some random time and attempt to sense again. This is reinforced to increase channel reuse. Nevertheless, the SU will check if the sensing time is less than the total sensing time allowed. If the sensing time is more than the total sensing time allowed, the scheme assumes H 1 and can check other channels that might be free for DSA. However, if the sensing time is less than the total sensing time allowed, the SU can check the same channel. The flowchart of Fig. 3 is represented in an algorithm given below: Fig. 4. Two state Markov chain transition of the PU transmitter.

Fig. 5. PU transition state within frame window T.

β

α and ρ represented by ρ O N = α +β O F F = α +β . To comprehend this further, consider the Fig. 5. In Fig. 5, the frame is taken to be T in which a vehicle can sense and transmit on the PU channel, there are two instances in which the PU transmission is not static (i.e. T 2 and T 4 ). For example, the PU is considered to be ON for the intervals T 1 , part of T 2 and part of T 4 . The total time for PU ON therefore is T 1 + T 2− O N + T 4− O N . On the other hand, the PU is considered to be OFF for the intervals T 3 , part of T 2 and T 4 giving total of T 2− O F F + T 3 + T 4− O F F . Note in Fig. 5 the transition state intervals (from Fig. 4 are demonstrated by part of T 2 for PU ON→OFF and T 4 for PU OFF→ON. The probability of the PU remaining in the stable ON or OFF periods for M sensing samples is exponentially disM tributed [10] given by exp (− M α ) and exp (− β ) respectively. Thus, the probability of the PU being ON for sensing window is give by:

Step 1 sets the predefined threshold for both ED and OOC using respective equations. The other steps follow the flowchart shown in Fig. 3. After step 6 or 14, the SU can transmit on the licensed channel C k because it is determined to be free given that the decision is H 0 . However, after step 8, 16 or 21 the SU is not permitted to transmit on the sensed channel because the decision is H 1 . When the final decision is H 1 the case of steps 8, 16 and 21, the SU can start the sensing process on the next licensed channel. 4.2. Primary user activity modeling Spectrum sensing performance is highly dependent on the activity pattern (duty cycles) of the PU transmitters. This is because the duty cycles of any communication systems are perceived to be random and not statically predetermined. However, many literature in VANET consider only static PU activities or no PU model at all [13]. To model the PU activity pattern the transition time between the ON and OFF states and vice versa must be taken into account. For instance, consider the figure below. The activities of the PU duty cycles can be modelled as Markov chain with two states corresponding to ON and OFF. The Markov chain models the state of a system with a random variable that changes through time. Let the ON period be represented by α and the OFF be represented by β . From Fig. 4 the ON states can be taken as the time instance when the PU is ON and when the PU is transitioning from OFF to ON state. Similarly, the OFF states can be taken as time instance when the PU is OFF and when the PU is transitioning from ON to OFF. Therefore, the probability of steady ON and OFF states can be

P O N = p O N .exp (−

M

α

(14)

)

Similarly the probability of the PU being OFF for the sensing window is given by:

P O F F = p O F F .exp (−

M

β

(15)

)

To account for transition states switching from ON to OFF and vice versa, a probability of switch state is formulated. Remember from Fig. 5 that a PU can switch from ON to OFF and vice versa within a sensing cycle. Therefore, the probability of transition ( P T ) is given by:

P T = 1 − P O N − P O F F = 1 − (P O N + P O F F )

=1−[

α α+β

.exp (−

M

α

)+

β M .exp (− )] α+β β

(16)

From Eq. (16) a plot of P T can be obtained using different sensing samples and various values of α and β . Fig. 6 shows the reliance of PU state transition for different M sensing samples given divergent PU activity patterns. The streams in the figure correspond to the values of α and β . The values used for the streams are as follows; Stream 1 (α = 40, β = 100), Stream 2 (α = 60, β = 150), Stream 3 (α = 100, β = 200), Stream 4 (α = 200, β = 10), Stream 5 (α = 120, β = 100) and Stream 6 (α = 300, β = 100). Streams in this case demonstrate random PU activity pattern. Therefore, the values are picked arbitrarily (i.e. assuming the PU activity pattern can take any of the streams given).

8

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

changing state from ON to OFF and vice versa (see Fig. 5). The state probability matrix P which define transition probabilities from all states S to all possible next states S is given by:



ρO N→O N ρO N→O F F P= ρO F F →O N ρO F F →O F F

 (17)

where each row in the matrix adds up to one. The value function v π (s) for the MDP is the expected return following the policy π for state S given by

v π (s) = E π [ G t | S t = s]

(18)

Where G t is the return function for total discounted rewards ∞ from time-step t given by G t = d=0  d R t +d+1 [39]. Similarly, the state-action function qπ (s, a) is expected return from state S following the policy π and taking action a. Thus, the action function is given by Fig. 6. P T for different sensing samples, M.

Fig. 6 demonstrates that the PU transition state increases with increase in the number of sensing samples. For instance, the chance of the PU transition states for 250 sensing samples is above 60% for all streams. This is because increase in number of sensing samples increases the sensing time thereby increasing the chances for the PU to switch states. On the other hand the values of P T decreases with increased variation in the PU activity pattern. With increased duty cycles (i.e. increased ON or OFF periods), the chances of the PU changing state becomes minimal. For example, the values of P T for Stream 6 are relatively low compared to other streams. This is demonstrated by long PU activity patterns (α = 300, β = 100) compared to say Stream 1 with short activity pattern (α = 40, β = 100). The implication of Fig. 6 is that PU activity patterns can influence the sensing results. P T is directly proportional to sample size and inversely proportional to PU activity pattern. Consequently, short sensing intervals are desired to account for transitional states. Furthermore, few samples should be collected when spectrum sensing is conducted to accommodate dynamic transition state. Short sensing intervals are important especially in VANET because of mobility of vehicles. Therefore, the trade-off is between performing quick sensing and increasing accuracy of sensing results. To account for transition state and sensing interval, RL is used in this work to predict licensed channels of the PU systems which are likely to be free at a particular time. Simply put, the RSU should predict channels of the PU systems that exhibit long transitional state with long OFF periods. These channels are then communicated to vehicles upon entry on the congested road segment for sensing. The RSU is also responsible for updating the learning PU activity model by using results obtained from vehicles after sensing. The history of sensing results plays an important role similar to cooperative spectrum sensing decision. In the following section, we present PU activity model based on RL. 4.2.1. Modeling the PU state activity pattern with MDP In this work, the Bellman equations [38] are used to model the MDP for RL problem while the temporal difference (TD) learning with lambda (TD(λ)) is used to solve the problem formulation. For a given policy π , which is a distribution of action for some state is given by π (a|s) = P [ A t = a| S t = s]. π (a|s) denotes the probability for action A t being selected in state S t using the policy π . The probability of the state S transition to another state S is given by ρss = P [ S t +1 = S | S t = s]. The transit in VANET refers to PU

q π (s, a) = E π [ G t | S t = s, A t = a]

(19)

Furthermore, value and action functions (Eq. (19) and Eq. (20)) can be decomposed into immediate reward and discounted value of successor states for the return function G t as v π (s) = E π [ R t +1 +  v π ( S t +1 )| S t = s] and qπ (s, a) = E π [ R t +1 +  qπ ( S t +1 , At +1 )| S t = s, A t = a]. The Bellman expectation equations for the two functions is given by

v π (s) =



π (a|s)qπ (s, a) and

a∈ A

qπ (s, a) = R as + 



ρss v π ( S )

(20)

(21)

s ∈ S

The value function Eq. (21) has action-state function, therefore, qπ (s, a) in Eq. (21) can be replaced by Eq. (22) to give

v π (s) =



π (a|s)( R as + 

a∈ A



ρss v π ( S ))

(22)

s ∈ S

To find the best possible performance for MDP, the optimal function is used. For value function, the optimal value that maximize v π (s) over all policies is given by

v ∗ (s) = maxπ v π (s)

(23)

TD (λ) is used to find the optimal value function. The update equations used by TD (λ) to estimate the value function are given below:

δt = R t +1 +  V ( S t +1 ) − V ( S t )

(24)

E t ( S ) =  λ E t ( S ) + 1( S t = s )

(25)

V ( S t +1 ) ← V ( S t ) + θt δt +1 E t +1 ( S )

(26)

Where E t ( S ) is the eligibility trace, δt is the TD-error that defines the difference in values of the state that correspond to successive time steps. θt is the step size sequence. The algorithm for finding the optimal v ∗ (s) using TD (λ) is given below:

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

9

Table 1 Simulation parameters for SUMO and Ns-3. Parameter SUMO

Value

Parameter Ns-3

Value

Map dimension Road length Average segment length RSU placement Max. number of vehicles Vehicle length Vehicle speed Vehicle acceleration Vehicle deceleration Sigma (driver imperfection)

13 km X 15 km 12 km 1km 500 m (segment center) 500 5 m, 10 m 0-30 m/s 2.5 4.5 0.5

Simulation time MAC/PHY DSRC channel bandwidth DSRC channel frequency WAVE TX power TV modulation TV transmission power TV channel bandwidth TV channel frequencies M (Sensing Samples)

400 seconds WAVE/IEEE802.11p 10 MHz 5.9 GHz 13dBm 8-VSB 22.2 dBm/Hz 6 MHz 500-524 MHz >250

5. Simulation and results Simulations were used to verify the sensing scheme proposed in this paper. Implementating vehicular communication in real world testbeds involve high costs and not scalable in most circumstances as noted by [51] and [52]. Therefore, simulations are used in the absence of real testbed for VANETs [53,54]. Thus, Network Simulator 3(Ns-3) was used in this work while the mobility patterns of vehicles were generated using SUMO (Simulation of Urban MObility). Advantage of Ns-3 is that it decouples network functionalities at different levels. Thus, a single node can be installed with one or more physical device communicating over different medium (wireless or wired). For communication over DSRC channels, we used the WAVE module in Ns-3 that define IEEE802.11p standard for PHY and MAC as well as IEEE1609.X protocols stack. To realize cognitive capabilities, we utilized the Spectrum module defined in NS3 [55]. The SpectrumAnalyzer device modeled at the physical layer is used to measure and record the power spectral density (PSD) (i.e. energy statistic T e ) of the PU signal. Additional modification was done to account for OOC for adaptive sensing. In addition, the Spectrum module defines TV transmitter model which is used as a PU signal generator in our work. The TV transmitter model provides the PSD that is transmitted on the SpectrumChannel and detected by the SpectrumAnalyzer device. The TV transmitter operation is considered to be random ON/OFF. The rest of the parameters used in the simulation are given in Table 1. 5.1. Adaptive spectrum sensing Adaptive spectrum sensing is presented in Section 3 based on considering two detection thresholds of the energy detector. One threshold is set to achieve the desired CFAPR and the other threshold is set to achieve desired constant CPD. The lower threshold for target CFAPR (i.e. 0.1) is obtained using Equation (7) and the upper threshold for target CPD (i.e. 0.9) is obtained from Equation (6). If the energy test statistic obtained from spectrum sensing is less than the lower threshold, the energy detector decides absent of the

Fig. 7. Complementary ROC curve for adaptive sensing vs energy detector in Rayleigh fading and non-fading environment.

PU signal. On the other hand if the energy test statistic is above or equal to the upper threshold, the energy detector decides presence of the PU signal. However, if the test statistic falls between the lower and upper thresholds, the energy detector does not decide anything instead further sensing is performed using feature detection in time domain. The performance of adaptive sensing is compared to the energy detector for fading and non-fading environment. The Fig. 7 presents the comparison between adaptive sensing and energy detector in Rayleigh fading and non-fading AGWN. The performance of adaptive sensing in non-fading AGWN is similar to energy detector as shown from Fig. 7. Probability of false alarm of 0.2 is required to reach the probability of detection of about 0.9 for both energy detector and adaptive sensing in non-fading AGWN. This is achieved because adaptive sensing is mainly based on energy detector in high SNR to reduce sensing time. However, in fading environment with low SNR, the performance of energy detector degrades as shown in Fig. 7. Nevertheless, the adaptive sensing maintains a better probability of detection compared to energy detector. This is because in fading environment with low SNR, adaptive sensing uses feature detector (OOC). OOC extras features by sampling the PU signal in time domain as discussed in Section 3. Thus, the adaptive sensing scheme could maintain the probability of detection of 0.8 (i.e. 80%) compared to energy detector with 0.7 (i.e. 70%) based on probability of false alarm of 0.2. Nevertheless, the performance is still not optimal because the feature detector requires long sensing period with increased sensing samples. In Fig. 8, the probability of missed de-

10

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

Fig. 10. The effect of speed of vehicles on probability of detection.

Fig. 8. Complementary ROC for probabilities of missed detection and false alarm for adaptive sensing in comparison vs. energy detector.

detection time in Rayleigh fading environment for adaptive sensing is higher than energy detector. For example, to reach a 0.9 probability of detection requires on average 11ms for adaptive sensing compared to 2ms for energy detector. As stated before, in fading environment with low SNR, the adaptive sensing uses OOC which performs sensing in time domain to detect the PU signal. OOC like any feature detector require long sensing time to extract the features of the signal before deciding on the PU occupancy state. Hence, adaptive sensing takes more than twice the time to reach the probability of detection of as low as 0.5 as observed from Fig. 9. This is due to combining time for energy detector in first stage and OOC in the second stage of adaptive sensing. On the other hand, the energy detector maintains the same mean time for both fading and non-fading environments because it only measures the power or SNR in the channel being considered and compares it to the threshold. No extra time is needed for energy detector to check the features of the signal being sensed as opposed to feature detector. 5.2. Analysis of speed of vehicle on spectrum sensing

Fig. 9. Comparative mean detection time for adaptive sensing and energy detector in fading and non-fading environment.

tection of adaptive sensing is compared with energy detector for fading and non-fading environment. Just as observed in Fig. 7, the adaptive sensing performance is better than the energy detector in fading environment but same in non-fading AGWN environment. Fig. 8 demonstrates that the probability of missed detection for both adaptive sensing and energy detector are similar in non-fading AGWN environment. For instance, both approaches require P f ≈ 0.08 for P md = 0.1 or 10−1 . However, in fading environment the performance of adaptive sensing is slightly better than energy detector. Considering the probability of missed detection of 0.1 or 10−1 requires the probability of false alarm of about 0.2 for adaptive sensing compared to energy detector which requires about 0.6. Higher probability of false alarm limit radio spectrum reuse while high probability of missed detection causes interference to PU systems. The Fig. 9 demonstrates the mean sensing time overhead of proposed adaptive sensing in comparison to energy detector in fading and non-fading environments. Fig. 9 shows the mean detection time against the probability of detecting the PU channel for adaptive sensing and energy detector. For non-fading AGWN, the adaptive sensing has the same mean time as energy detector in most cases. However, the mean

The speed of a vehicle plays an important role in spectrum sensing in VANET environment. The following figure shows the performance of adaptive sensing compared to energy detector for various vehicle speeds. With increase in speed of the vehicle, the detection performance decrease as observed in Fig. 10. This is mostly because the vehicle is likely to miss free channels with increased speed in the road segment of interest. When the vehicle is moving at high speed, there is high possibility of moving from the area with free licensed channels to areas without free channels. The effect of speed is more severe in fading environment compared to non-fading environments as observed in Fig. 10. For instance, the probability of detection of about 0.85 is achieved with a vehicle moving at 30m/s for both adaptive sensing and energy detector in non-fading AGWN. However, in fading environment the performance of adaptive sensing is lower than energy detector with increased speed. In fading environment, adaptive sensing relies on OOC which require more sensing time as observed in Fig. 9. With increased sensing time, the vehicle is likely to move into other areas where the characteristics of the sensed channels might be different. Thus, adaptive sensing is optimal for vehicles moving at low speed especially in congested areas. Although vehicle speed affects the detection performance and increase missing spectrum opportunities, cooperative decision approach mitigates the effects. Cooperating vehicles use spatial and

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

11

channel availability denoted by either 0 or 1 depending on the result of the sensing and channel bandwidth. The weight contribution w b and w t from Equation (1) are 0.4 and 0.6 respectively. 5.4. Performance comparison of RL based sensing to other schemes

Fig. 11. Cumulative rewards for PU channels over the episodes.

diversity gain to improve sensing decision. Furthermore, vehicles move at relatively slow speed in congested areas varying from 0-10km/h (≈2.8m/s) in highly dense traffic jams to 40km/h (≈11m/s) in medium traffic [40]. Thus, adaptive sensing can be used for both fading and non-fading environment in congested areas. Based on Fig. 10, the probability of detection for vehicles moving at 2.8m/s and 11m/s is 0.98 and 0.88 respectively, which is acceptable. Nevertheless, the performance of adaptive sensing can be enhanced by considering the PU activities. Vehicles on the roads can concentrate on sensing channels that are likely to be free based on the activities of the PU system. This is achieved through reinforcement learning at RSU. The evaluation of adaptive sensing and PU activity influence on spectrum sensing is presented in the next section. 5.3. Evaluation of spectrum sensing with reinforcement learning The previous section demonstrates that the adaptive sensing performs better than energy detector in fading environment as observed from Fig. 8 and 9. However, the higher performance is at the expense of increased sensing time. Specifically Fig. 9 shows the sensing time verses probability of detection in which adaptive sensing has higher mean detection time in fading environment. In addition, adaptive sensing performs poor when the vehicle is moving at high speed in fading environment as observed in Fig. 10. Therefore, reinforcement learning (RL) is used at RSU to increase the performance of adaptive sensing. RL is used at the RSU to model the activities of the PU traffic pattern using sensing results sent from vehicles on the road. In particular, when the vehicle senses and identifies the free PU channel it will transmit data on it. After transmitting data, the vehicle will send a scalar value as reward to RSU. The Fig. 11 demonstrates the cumulative rewards for different PU channels for different episodes. The episode denotes the time between start and end of congestion on the road. Fig. 11 shows the channel availability based on the cumulative reward. The channels with high bandwidth and availability time show higher cumulative reward. For instance, Channel 1 and 3 have low cumulative reward because they are used by the PU most of the times. Alternatively, Channel 2 and 4 have higher cumulative reward with time because they are not constantly being used by the PU. The implication of high reward is that the RSU will recommend these channels to vehicles to be used during congestion. The reward is calculated based on Equation (1) in which the parameters used include sensing time plus transmission time if any,

In this section, the performance of RL based sensing is compared to other sensing schemes proposed in literature. In particular, infrastructure based sensing schemes using history of sensing data to improve sensing results are considered. First scheme is proposed by Huang et al. (2016) in [56]. They present a spectrum sensing scheme that use historical spectrum data mining to predict spectrum availability on the next road segment. The scheme is based on Bayesian model and hidden Markov model (HMM) to exploit spatial and temporal correlation. The algorithm for the proposed scheme is presented in Table 1 and 2 of the paper while implementation procedure is given in Table 3 [56]. Another history based scheme is proposed by Abbassi et al. (2015) in [19]. In the proposed scheme, the vehicles on the road utilize history of spatial-temporal diversity and frequency pattern controlled by the RSU through a database. The RSU coordinates which PU channels are sent to passing vehicles based on the time of the day. In addition, the RSU decide on the PU occupancy state from historical data using hard fusion (majority rule). The paper provides abstract of the flowchart for the model in Figures 5, 6 and 7 [19]. Another scheme based on infrastructure support that use hard fusion rule at RSU to cooperatively decide on spectrum sensing results is proposed by Duan et al. (2013) in [21]. The RSU uses the OR, AND or majority rule to determine the final global sensing results [21]. The above schemes are considered because they use history of sensing results to improve accuracy of spectrum sensing results on which this work is based with exception of [21]. The proposed cooperative sensing by [21] is based on hard fusion without considering history of sensing results. [21] and [19] consider TV bands as PU channels which are also considered in this work. Nevertheless, the approach in this work utilizes RL to learn the behaviour of the PU channels which is different from the historical based scheme from literature. Unlike other schemes RL is used to learn and model PU traffic patterns by assigning high reward to channels with long OFF periods and high bandwidth. The hard fusion rule approach is used in comparison because it is one of the most used cooperative decision scheme and it is simple to implement. 5.4.1. Detection performance under ROC curve The performance metrics used to compare RL based and other schemes include probabilities of detection, false alarm and missed detection based on ROC curves. Other metrics include mean detection time and sensing overhead against throughput. For RL, the value for step size (θ ) is taken to be 0.3, the discounted factor ( ) is taken to be 0.3 and finally the return value (λ) is 0.3. The Fig. 12 shows the detection performance for complementary ROC curve of probabilities of detection and false alarm. Fig. 12 presents simulation results of proposed scheme in comparison to other schemes in fading environment. The proposed scheme is represented by Proposed-Adaptive-RL. In addition, result from adaptive sensing without reinforcement learning is also included. As observed from Fig. 12, energy detection technique performances last in fading environment followed by hard fusion rule [21]. Adaptive sensing performs better than energy detector and hard fusion rule. However, its performance is below that of schemes using history to enhance sensing results. Schemes using history account for the PU traffic pattern which is sent to vehicles on the road before sensing is performed. Regardless, the proposed scheme using RL performs better than other history based. [56] approach predicts spectrum availability on next road segment and not the current segment using HMM. The drawback with this

12

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

Fig. 12. Complementary ROC curve of proposed scheme compared to other sensing approach in fading environment. Fig. 14. Comparison of mean detection time versus probability of detection of various sensing schemes.

Fig. 13. Performance of proposed RL based scheme compared to other approaches on the effect of speed on probability of detection.

approach is that by the time the vehicle goes in the next road segment, the PU might be active in the road segment. On the other hand, [19] rely on hard fusion to decide on the channels to send to vehicles using history data stored in the database. This adds more sensing time overhead by deciding on which channels to send to vehicles before sensing is performed. The RSU in our proposed approach use cumulative reward to assign channels of PU with higher reward in the current road segment to vehicles on the road. From Fig. 12 the proposed adaptive sensing with RL achieves a probability of detection of 0.9 for probability of false alarm of 0.1. Compared to [56] and [19] which require probability of false alarm of 0.2 and about 0.3 respectively to reach the same probability of detection of 0.9. 5.4.2. Detection performance based on speed of the vehicle The speed of vehicles has impact on probability of detection as noted in Fig. 10. Fig. 13 shows the performance of RL based in terms of probability of detection and speed of sensing vehicle in comparison with other schemes in fading environment.

The performance of adaptive sensing without RL is poor in fading environment because more sensing samples are needed for accurate detection for OOC detector. Thus, based on Fig. 13, adaptive sensing without RL performance is below other schemes. On the other hand, adaptive sensing with RL has the best performance compared to other schemes under consideration. This is because with RL, channels to sense are sent to the vehicles opposed to adaptive sensing which has to identify channels on it on. Therefore, few sensing samples are required to confirm the presence or absence of the PU signal when using RL. In addition, the channels sent to vehicles are those with high rewards and likely to be free at that particular time and space. Similarly, [56] approach has good performance because of channel prediction based on HMM. However, [56] consider channels in the next road segment with no preference given to channels which is not the case for RL in which preference is given to channels with high OFF times. The performance of hard fusion rule proposed by [21] degrades with increase in speed of vehicles. With increased speed few samples are collected from passing vehicles to be used for fusion at RSU. Hard fusion relies on large number of vehicles in order to provide accurate sensing results. However, in sparse environment there are few vehicles, hence, vehicles move at high speed. 5.4.3. Comparison of detection time Spectrum sensing detection time is another important metric in cognitive radio networks. The Fig. 14 illustrates the mean detection time for probability of detection in fading environment. Compared to other schemes, conventional energy detector has the least mean detection time shown in Fig. 14. The average detection time for energy detector is 2.11 ms. The second best scheme in terms of average detection time is the proposed adaptive sensing with RL of about 3.18 ms followed by [56] of about 3.25 ms. Overall, [19] has long average detection time of 7.48 ms followed by proposed adaptive sensing without RL of about 6.4 ms. [19] and [21] require more sensing time because they get spectrum results first from cooperating vehicles before global decision on the PU occupancy state is made. In addition, [19] use history of sensing results before sending to cooperating vehicles for a global decision to be made using majority rule thereby increasing sensing time. The main contributing factor to long mean detection time for adaptive sensing without RL is the fact that in fading environment more samples are needed to accurately quantify the PU signal. In

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

13

and 8. With RL, vehicles only sense channels sent by RSU which are likely to be free thereby reducing on sensing time as shown in Fig. 14. In addition, the proposed adaptive sensing with RL performs better than state-of-art schemes in literature as noted from Fig. 12-14. Therefore, the proposed scheme can be used by vehicles on the road to identify spectrum opportunities when there is congestion in the DSRC channels. Nevertheless, more work is needed to extend this work to cover distributed vehicular communication (i.e. V2V). In addition, future work will involve looking at spectrum sharing and mobility which is part of cognitive radio network as well as studying other fading models such as Rician model and determine its effect on spectrum sensing. It would also be interesting to test the proposed approach in real testbed to determine the performance in real world. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgement Fig. 15. Achievable average throughput of cognitive VANET network in fading environment.

addition PU signal characteristics are not known in advance hence more time to get the features of the signal using OOC. Adaptive sensing with RL on the other hand sense channels sent by the RSU with known PU signal characteristics hence few sensing samples are needed and no extra time is required to search for unknown channels. 5.4.4. Sensing time on channel throughput A network throughput is another performance metric that is important to spectrum sensing. The channel throughput of the secondary network is measured using transmission time on the PU network using parameters given in Equation (10). The frame size T = 100ms is used. In Fig. 15 the average throughput of adaptive sensing with RL is compared to adaptive sensing without RL in the fading environment. Spectrum sensing time has an impact on channel capacity throughput of the secondary network. Long sensing periods provide accurate sensing result. However, this is at the cost of throughput as more time is spent on sensing less time is left for transmission on the identified channel for a given frame (frame accounts for sensing and transmission time). In fading environment, adaptive sensing without RL requires more sensing samples to measure the features of the PU signal hence low throughput as observed in Fig. 15. With adaptive sensing using RL, the RSU sends channels with PU characteristics to sense as a result few sensing samples are required to achieve desired probability of detection hence more time is left for transmission. 6. Conclusion and future works In this paper we have shown that reinforcement learning can be used to learn and model the traffic pattern of the PU channel. Particularly, RL was used to learn the behaviour of the traffic pattern of the PU channels. Channels with long OFF periods get high rewards while channels with short OFF periods get low rewards as shown in Fig. 11. The learnt PU activity pattern is used to predict PU channels likely to be free in future and sent to vehicles. Thus, RL enhance the adaptive sensing which performs poorly in high speed and incurs long sensing periods as demonstrated in Fig. 10 and 9 respectively. Adaptive sensing proposed in this paper performs better than conventional energy detector in fading environment as shown from simulation results presented in Fig. 7

This research is supported by the Faculty of Computer Science and Information Technology, University of Malaya under a grant project GPF009D-2018. References [1] M. Chaqfeh, A. Lakas, I. Jawhar, A survey on data dissemination in vehicular ad hoc networks, Veh. Commun. 1 (4) (2014) 214–225. [2] R.F. Atallah, M.J. Khabbaz, C.M. Assi, Vehicular networking: a survey on spectrum access technologies and persisting challenges, Veh. Commun. 2 (3) (2015) 125–149. [3] S. Zeadally, R. Hunt, Y.-S. Chen, A. Irwin, A. Hassan, Vehicular ad hoc networks (vanets): status, results, and challenges, Telecommun. Syst. 50 (4) (2012) 217–241. [4] J.B. Kenney, Dedicated short-range communications (dsrc) standards in the united states, Proc. IEEE 99 (7) (2011) 1162–1182. [5] W. Liang, Z. Li, H. Zhang, S. Wang, R. Bie, Vehicular ad hoc networks: architectures, research issues, methodologies, challenges, and trends, Int. J. Distrib. Sens. Netw. 2015 (2015) 17. [6] Y.P. Fallah, C. Huang, R. Sengupta, H. Krishnan, Congestion control based on channel occupancy in vehicular broadcast networks, in: Vehicular Technology Conference Fall (VTC 2010-Fall), 2010 IEEE 72nd, IEEE, 2010, pp. 1–5. [7] D. Das, S. Das, A survey on spectrum occupancy measurement for cognitive radio, Wirel. Pers. Commun. 85 (4) (2015) 2581–2598. [8] S. Haykin, Cognitive radio: brain-empowered wireless communications, IEEE J. Sel. Areas Commun. 23 (2) (2005) 201–220. [9] D.B. Rawat, T. Amin, M. Song, The impact of secondary user mobility and primary user activity on spectrum sensing in cognitive vehicular networks, in: Computer Communications Workshops (INFOCOM WKSHPS), 2015 IEEE Conference on, IEEE, 2015, pp. 588–593. [10] K.D. Singh, P. Rawat, J.-M. Bonnin, Cognitive radio for vehicular ad hoc networks (cr-vanets): approaches and challenges, EURASIP J. Wirel. Commun. Netw. 2014 (1) (2014) 1–22. [11] I.F. Akyildiz, W.-Y. Lee, M.C. Vuran, S. Mohanty, Next generation/dynamic spectrum access/cognitive radio wireless networks: a survey, Comput. Netw. 50 (13) (2006) 2127–2159. [12] E. Axell, G. Leus, E.G. Larsson, H.V. Poor, Spectrum sensing for cognitive radio: state-of-the-art and recent advances, IEEE Signal Process. Mag. 29 (3) (2012) 101–116. [13] C. Chembe, R.M. Noor, I. Ahmedy, M. Oche, D. Kunda, C.H. Liu, Spectrum sensing in cognitive vehicular network: state-of-art, challenges and open issues, Comput. Commun. 97 (2017) 15–30. [14] Y. Chen, H. Oh, A survey of measurement-based spectrum occupancy modeling for cognitive radios, IEEE Commun. Surv. Tutor. 18 (1) (2016) 848–859, https:// doi.org/10.1109/COMST.2014.2364316. [15] Y. Saleem, M.H. Rehmani, Primary radio user activity models for cognitive radio networks: a survey, J. Netw. Comput. Appl. 43 (2014) 1–16. [16] V. Kumar, A. Sharma, S. Debnath, R. Gangopadhyay, Impact of primary user duty cycle in generalized fading channels on spectrum sensing in cognitive radio, Proc. Comput. Sci. 46 (2015) 1196–1202. [17] T. Yucek, H. Arslan, A survey of spectrum sensing algorithms for cognitive radio applications, IEEE Commun. Surv. Tutor. 11 (1) (2009) 116–130.

14

C. Chembe et al. / Vehicular Communications 18 (2019) 100161

[18] X. Qian, L. Hao, On the performance of spectrum sensing in cognitive vehicular networks, in: Personal, Indoor, and Mobile Radio Communications (PIMRC), 2015 IEEE 26th Annual International Symposium on, IEEE, 2015, pp. 1002–1006. [19] S.H. Abbassi, I.M. Qureshi, H. Abbasi, B.R. Alyaie, History-based spectrum sensing in cr-vanets, EURASIP J. Wirel. Commun. Netw. 2015 (1) (2015) 1–12. [20] D. Borota, G. Ivkovic, R. Vuyyuru, O. Altintas, I. Seskar, P. Spasojevic, On the delay to reliably detect channel availability in cooperative vehicular environments, in: Vehicular Technology Conference (VTC Spring), 2011 IEEE 73rd, IEEE, 2011, pp. 1–5. [21] J.-Q. Duan, S. Li, G. Ning, Compressive spectrum sensing in centralized vehicular cognitive radio networks, Int. J. Future Gener. Commun. Netw. 6 (3) (2013) 1–12. [22] X. Xu, J. Bao, Y. Luo, H. Wang, Cooperative wideband spectrum detection based on maximum likelihood ratio for cr enhanced vanet, J. Commun. 8 (12) (2013) 814–821. [23] X. Qian, L. Hao, Spectrum sensing with energy detection in cognitive vehicular ad hoc networks, in: Wireless Vehicular Communications (WiVeC), 2014 IEEE 6th International Symposium on, IEEE, 2014, pp. 1–5. [24] S.A. Alvi, M.S. Younis, M. Imran, et al., A weighted linear combining scheme for cooperative spectrum sensing, Proc. Comput. Sci. 32 (2014) 149–157. [25] K. Baraka, L. Safatly, H. Artail, A. Ghandour, A. El-Hajj, An infrastructure-aided cooperative spectrum sensing scheme for vehicular ad hoc networks, Ad Hoc Netw. 25 (2015) 197–212. [26] A. Paul, A. Daniel, A. Ahmad, S. Rho, Cooperative cognitive intelligence for internet of vehicles, IEEE Syst. J. [27] I. Souid, H.B. Chikha, R. Attia, Blind spectrum sensing in cognitive vehicular ad hoc networks over nakagami-m fading channels, in: Electrical Sciences and Technologies in Maghreb (CISTEM), 2014 International Conference on, IEEE, 2014, pp. 1–5. [28] X.Y. Wang, P.-H. Ho, A novel sensing coordination framework for cr-vanets, IEEE Trans. Veh. Technol. 59 (4) (2010) 1936–1948. [29] Y. Liu, S. Xie, R. Yu, Y. Zhang, X. Zhang, C. Yuen, Exploiting temporal and spatial diversities for spectrum sensing and access in cognitive vehicular networks, Wirel. Commun. Mob. Comput. 15 (17) (2015) 2079–2094. [30] Z. Wei, F.R. Yu, A. Boukerche, Cooperative spectrum sensing with trust assistance for cognitive radio vehicular ad hoc networks, in: Proceedings of the 5th ACM Symposium on Development and Analysis of Intelligent Vehicular Networks and Applications, ACM, 2015, pp. 27–33. [31] A. Raza, S.S. Ahmed, W. Ejaz, H.S. Kim, Cooperative spectrum sensing among mobile nodes in cognitive radio distributed network, in: Frontiers of Information Technology (FIT), 2012 10th International Conference on, IEEE, 2012, pp. 18–23. [32] H. Li, D.K. Irick, Collaborative spectrum sensing in cognitive radio vehicular ad hoc networks: belief propagation on highway, in: 2010 IEEE 71st Vehicular Technology Conference, 2010, pp. 1–5. [33] M. Di Felice, K.R. Chowdhury, L. Bononi, Cooperative spectrum management in cognitive vehicular ad hoc networks, in: Vehicular Networking Conference (VNC), 2011 IEEE, IEEE, 2011, pp. 47–54. [34] M. Bkassiny, Y. Li, S.K. Jayaweera, A survey on machine-learning techniques in cognitive radios, IEEE Commun. Surv. Tutor. 15 (3) (2013) 1136–1159. [35] J. Oksanen, J. Lundén, V. Koivunen, Reinforcement learning based sensing policy optimization for energy efficient cognitive radio networks, Neurocomputing 80 (2012) 102–110. [36] I. Mustapha, B.M. Ali, A. Sali, M.F.A. Rasid, H. Mohamad, An energy efficient reinforcement learning based cooperative channel sensing for cognitive radio sensor networks, Pervasive Mob. Comput. 35 (2017) 165–184.

[37] B.F. Lo, I.F. Akyildiz, Reinforcement learning-based cooperative sensing in cognitive radio ad hoc networks, in: Personal Indoor and Mobile Radio Communications (PIMRC), 2010 IEEE 21st International Symposium on, IEEE, 2010, pp. 2244–2249. [38] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, vol. 1, MIT Press, Cambridge, 1998. [39] C. Szepesvári, Algorithms for reinforcement learning, Synth. Lect. Artif. Intell. Mach. Learn. 4 (1) (2010) 1–103. [40] V. Tyagi, S. Kalyanaraman, R. Krishnapuram, Vehicular traffic density state estimation based on cumulative road acoustics, IEEE Trans. Intell. Transp. Syst. 13 (3) (2012) 1156–1166. [41] T. Darwish, K.A. Bakar, Traffic density estimation in vehicular ad hoc networks: a review, Ad Hoc Netw. 24 (2015) 337–351. [42] J.A. Sanguesa, M. Fogue, P. Garrido, F.J. Martinez, J.-C. Cano, C.T. Calafate, P. Manzoni, An infrastructureless approach to estimate vehicular density in urban environments, Sensors 13 (2) (2013) 2399–2418. [43] D.M. Martínez, Á.G. Andrade, Adaptive energy detector for spectrum sensing in cognitive radio networks, Comput. Electr. Eng. 52 (2016) 226–239. [44] R. He, A.F. Molisch, F. Tufvesson, Z. Zhong, B. Ai, T. Zhang, Vehicle-to-vehicle propagation models with large vehicle obstructions, IEEE Trans. Intell. Transp. Syst. 15 (5) (2014) 2237–2248. [45] T. Islam, Y. Hu, E. Onur, B. Boltjes, J. de Jongh, Realistic simulation of ieee 802.11 p channel in mobile vehicle to vehicle communication, in: Microwave Techniques (COMITE), 2013l Conference on, IEEE, 2013, pp. 156–161. [46] F.F. Digham, M.-S. Alouini, M.K. Simon, On the energy detection of unknown signals over fading channels, IEEE Trans. Commun. 55 (1) (2007) 21–24. [47] S. Stotas, A. Nallanathan, Overcoming the sensing-throughput tradeoff in cognitive radio networks, in: Communications (ICC), 2010 IEEE International Conference on, IEEE, 2010, pp. 1–5. [48] D. Bhargavi, C.R. Murthy, Performance comparison of energy, matched-filter and cyclostationarity-based spectrum sensing, in: Signal Processing Advances in Wireless Communications (SPAWC), 2010 IEEE Eleventh International Workshop on, IEEE, 2010, pp. 1–5. [49] Y. Liu, Z. Zhong, G. Wang, D. Hu, Cyclostationary detection based spectrum sensing for cognitive radio networks, J. Commun. 10 (1) (2015) 74–79. [50] W. Yue, B. Zheng, Spectrum sensing algorithms for primary detection based on reliability in cognitive radio systems, Comput. Electr. Eng. 36 (3) (2010) 469–479. [51] S.A.B. Mussa, M. Manaf, K.Z. Ghafoor, Z. Doukha, Simulation tools for vehicular ad hoc networks: a comparison study and future perspectives, in: Wireless Networks and Mobile Communications (WINCOM), 2015 International Conference on, IEEE, 2015, pp. 1–8. [52] R. Stanica, E. Chaput, A.-L. Beylot, Simulation of vehicular ad-hoc networks: challenges, review of tools and recommendations, Comput. Netw. 55 (14) (2011) 3179–3188. [53] S. Al-Sultan, M.M. Al-Doori, A.H. Al-Bayatti, H. Zedan, A comprehensive survey on vehicular ad hoc network, J. Netw. Comput. Appl. 37 (2014) 380–392. [54] C. Sommer, J. Härri, F. Hrizi, B. Schünemann, F. Dressler, Simulation tools and techniques for vehicular communications and applications, in: Vehicular ad hoc Networks, Springer, 2015, pp. 365–392. [55] N. Baldo, M. Miozzo, Spectrum-aware channel and phy layer modeling for ns3, in: Proceedings of the Fourth International ICST Conference on Performance Evaluation Methodologies and Tools, in: ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2009, p. 2. [56] X.-L. Huang, J. Wu, W. Li, Z. Zhang, F. Zhu, M. Wu, Historical spectrum sensing data mining for cognitive radio enabled vehicular ad-hoc networks, IEEE Trans. Dependable Secure Comput. 13 (1) (2016) 59–70.