Computers and Electronics in Agriculture 166 (2019) 104985
Contents lists available at ScienceDirect
Computers and Electronics in Agriculture journal homepage: www.elsevier.com/locate/compag
Double-DQN based path smoothing and tracking control method for robotic vehicle navigation
T
Wenyu Zhanga, Jingyao Gaib, Zhigang Zhanga, , Lie Tangb, , Qingxi Liaoc, Youchun Dingc ⁎
⁎
a
Key Laboratory of Key Technology on Agricultural Machine and Equipment, Ministry of Education, South China Agricultural University, Guangzhou 510642, China Department of Agricultural and Biosystems Engineering, Iowa State University, Ames, IA 50011-3270, USA c College of Engineering, Huazhong Agricultural University, Wuhan, Hubei 430070, China b
ARTICLE INFO
ABSTRACT
Keywords: Navigation control Deep reinforcement learning Double Deep Q-Network Path tracking Robotic vehicle
Automatically tracking paths with large curvatures smoothly and tightly is an acknowledged challenge in robotic vehicle navigation. This is particularly true for some navigation applications in which pre-defined paths such as crop rows in agricultural applications must be followed accurately. This article presents a novel path smoothing and tracking control method based on Double Deep Q-Network (Double DQN) for an automated robotic vehicle. This Double-DQN-based controller utilized a deep reinforcement learning method that scored all potential actions at the current state according to their performance index and selected the best performer as the output of the network. In this study, a path-tracking algorithm was self-developed with a deep Q-network trained by driving a rover in a simulated virtual environment. The algorithm was tested in both simulation and on a grass field to follow paths with multiple sharp turns. The performance was compared with that of the Pure-Pursuit Control (PPC) algorithm. The results showed that the Double DQN-based control dramatically reduced the settling time and the overshoot at the corner at higher forward speeds at a minor expense of slightly increased rise time and steady state error.
1. Introduction Over the last two decades, autonomous vehicle navigation has become an important technology in many applications such as industrial, military and modern agriculture, particularly under the context of the increasing demand of high operational efficiency and equipment reliability with lower labor cost and lower human health risk. A precise path-tracking algorithm is critical in applications where mobile equipment has limited working space or needs to accurately follow predefined paths, such as crop scouting and phenotyping in agricultural where crop rows must be accurately followed. RTK-GPS was well adopted as a positioning sensor that is capable of precisely measuring the vehicle position in outdoor navigation. Various path-tracking algorithms have been developed for different vehicle models using different principles such as PID control, Fuzzy logic models, Model Predictive Control, Pure Pursuit Control and so on. PID control is a widely used control algorithm, mainly due to its simplicity, robustness, and popularity (Normey-Rico et al., 2001). Luo et al. (2009) developed a path-tracking navigation system for tractor navigation based on a PID controller and achieved an average offset error of less than 3 cm on straight trajectories at a travel speeds of
⁎
0.8 m/s. Dong et al. (2011) designed a row tracking system with a twolayer PID control structure for a robotic white asparagus harvester, and an average offset error of 0.5 cm was achieved when tracking straight paths at a travel speed of 0.16 m/s. Fuzzy models are able to approximate non-linear models by a set of If-Then rules without fully characterizing the non-linear systems (Precup and Hellendoorn, 2011). Fuzzy control based on fuzzy models was applied to the navigation of many agricultural robots because the agricultural environment has variable field conditions that make the system hard to model accurately (Cho and Lee, 2000). For example, Hagras et al. (1999) developed a fuzzy controller trained by a genetic algorithm (GA) for an autonomous field vehicle. The vehicle was able to follow irregular crop rows with an ultrasonic sensor within a tolerance of 5 cm (2″). Model Predictive Control (MPC) refers to a series of algorithms that use an explicit process to compute a sequence of manipulated adjustments in order to optimize the future behavior of a system. MPCs were able to handle constraints, process nonlinearity, uncertainty or timedelay (Ding et al., 2018). A path-tracking control method that combines a Kalman filter and a Nonlinear Model Predictive Control (NMPC) method was designed for a tractor, and was able to achieve a lateral
Corresponding authors. E-mail address:
[email protected] (L. Tang).
https://doi.org/10.1016/j.compag.2019.104985 Received 6 February 2019; Received in revised form 14 August 2019; Accepted 1 September 2019 Available online 12 September 2019 0168-1699/ © 2019 Elsevier B.V. All rights reserved.
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al.
error of 10 cm at a velocity of 12 km/h (Backman et al., 2012). Pure-Pursuit is a tracking method that calculates the path curvature when moving a vehicle from its current position to a goal point on the path at certain distance ahead based on the vehicle kinematic model (Coulter, 1992). It can effectively control the vehicle tracking and was widely used in automatic navigation. Pure-Pursuit based algorithms such as Clothoid Fitting (CF) pursuit (Ollero and Heredia, 1995) were developed to further improve the performance (Samuel et al., 2016). In many applications, the curved paths were simplified into polygonal lines defined by a series of way points (Xiang and Noguchi, 2014). Traditional methods have shown limitations in following paths with sharp turns. A pure-pursuit path-tracking controller was developed for an autonomous hydraulic powered articulated-steer vehicle. It was able to track paths without sharp turns within an error of 6–19 cm, but for a 90-degree turn, the path-following error was almost 80 cm (Rains et al., 2014), which was significantly higher than the error over the straight paths. A navigation control system utilizing a laser range finder was developed for an autonomous tractor equipped with a trailer. The control system could navigate the tractor-trailer with the RMS values of 0.275 m (SD: 0.009 m), 0.373 m (SD: 0.030 m) and 0.518 m (SD: 0.022 m) for wide curve, tight curve and U-turn test paths respectively (Samuel et al., 2016). A series of algorithms smooth the curved paths before tracking the paths in order to boost the tracking performance. For path smoothing, single clamped cubic B-spline curves were found to be more suited for planning curved paths (Elbanhawi et al., 2015). Circle-segments, generalized elementary paths, and bi-elementary paths were considered as design elements for path smoothing between two straight paths (Plessen et al., 2017). The above method adopts a smooth path planning and tracking approach when solving the problem of straight-line connection. In general, it is more difficult to design a controller that can track both straight lines and sharp curves precisely with minimal overshoot. However, tracking straight lines and sharp curves are not equally important in many applications. For instance, in some agricultural in-field navigation applications, the vehicles are following predefined straight paths such as crop rows, in which optimizing the raising time and overshot for tracking linear paths is more important than high accuracy tracking circular paths in non-planting area (Moustris and Tzafestas, 2011; Xiang and Noguchi, 2014). A controller capable of decisionmaking based on the performance of tracking linear paths and switching tracking lines is a reasonable solution. The controller does not minimize tracking error at every point on the path, but focus on the performance in the transition of path segments when tracking polygonal paths. In this study, a Deep Q network (DQN) based controller was designed for accurate linear path tracking and smooth and fast transition when tracking polygonal lines. This method optimizes the tracking performance based on the control criteria (using a reward function) learned in a simulated environment. Because the method does not rely on complex mathematics deduction, the method is easily expandable to incorporate parameters such as driving velocity and steering rate constraints to better approximate real-world situations.
computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states (Watkins and Dayan, 1992). 2.1. Deep Q network (DQN) The DQN is a technique that combines reinforcement learning (Qlearning, notably) and deep neural networks. A DQN which learned to play Atari 2600 games for 38 days was able to match human performance, and showed a strong decision-making ability (Mnih et al., 2015). The network acts as an approximation function to calculate the action values (or Q values), as a cumulative future reward, of a series of pre-defined actions given a state variable s. The approximation function is denoted as Q(s, ·; w), where w was the collection of weights of the network. When using as a controller, the action with the maximum Q value (or target value) at the current state s was performed to the system as the control signal. Two important techniques in Deep Q-learning are the use of a target network and the use of experience replay to improve the training speed and results (Mnih et al., 2015). The target Q-network is a separate network to generate Q values during training. Instead of being updated at every training step, the target Q-network is updated at a fixed period. The target value function for DQN is then
YtDQN
Rt + 1 + max Q (St + 1, a; wt ) a
(1)
In the equation, YtDQN gives the target value of the DQN at time t. The reward Rt+1 is drawn from a fixed reward distribution at time t + 1. γ is the discount factor determining the agent’s horizon. St+1 is the state at time t + 1 and a is the taking action. And wt− are the weight parameters of the target Q-network at time t. The purpose of the target Q-network is to reduce the correlation between the target value and the action value (Q value). Updating the network in real-time when calculating the Q value, and then returning to update the network will lead to a serious over-fitting phenomenon, which will affect the network training. About the experience replay, the data of training DQN is randomly sampled from an experience memory bank (Lin, 1992). The bank stores the result of the running system. This approach could reduce the correlation of data samples because the experience data has a strong continuity that can affect the training of DQN. Both the target network and the experience replay improve the performance of the algorithm (Mnih et al., 2015). 2.2. Double DQN Because the max operator in DQN selects the max value to evaluate an action produced by the same Q value net, it is possible to get overoptimistically estimated values. To prevent this, we need to decouple the selection from the evaluation, which is the idea behind Double DQN (Hasselt, 2010; Hasselt et al., 2015). There are two value functions (as Q networks) learned by alternately updating one of the two functions, resulting to two set of network weights: w and w´. One value function with weight w is used to determine the action to take, based on a greedy policy. And another function with weight w’ is used to evaluate the Q value. The Double DQN target value is expressed as
2. Theoretical background In this section, the theoretical background of the DQN paradigm is described, particularly the double-DQN. It is important to understand the functions of the different components of DQN including the formulation of the problem, the design of the optimization policy, the rewarding mechanism, and the deep learning network. The deep network, which replaces the state memory matrix of the traditional Qlearning, is used to record the Q-value of all possible action at each state. Q-learning is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited
YtDoubleQ
Rt + 1 + Q (St + 1, arg max Q (St + 1, a ; wt ); wt ) a
(2)
where YtDoubleQ gives the target value of the Double DQN at time t. Q (s, a; w) gives the Q value of the action a determined by the DQN with weight w′, given the state s.
2
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al.
Fig. 1. The experimental rover and its components: (a) Husky base robot, (b) test rover, (c) RTK-GPS radio, (d) battery, (e) IMU sensor, (f) Jetson TX1 module computer, (g) RTK-GPS module, (h) structure of navigation system.
3. Material and methods
3.2. Kinematics model of the rover
3.1. Hardware and software setup
We limited the travel speed of the rover to less than 3 mph (4.8 km/ h) to maintain the maneuverability for tracking paths accurately. Given this low travel speed, our control system design was only based on the kinematic model of the rover. The model of the rover, with respect to vehicle’s 2D coordinate system, describes the 6 DOF (degree-offreedom) motion of the rover including movements in the x, y and z directions as well as pitch, roll, and yaw rotations (Fig. 2). The differential equation of a coordinate point x and y, the heading angle θ and the velocity v of the rover are defined below:
For testing our developed algorithm, we built a test platform (Fig. 1b) by using a four-wheeled rover (Husky, Clearpath Robotics Inc., ON, Canada) (Fig. 1a). The rover uses a differential skid steering mechanism. Its main parameters are shown in Table 1. The rover was controlled by a Jetson TX1 computer. An RTK-GPS system of the Swift Navigation company was fixed on top of the rover to acquire centimeter-level accuracy of the positioning data. The accuracy of the GPS system is within ± 20 mm. An inertial measurement unit (MicroStrain 3DM-GX5-15, Lord Sensing, VT, USA) was set up for obtaining the posture of the rover (Fig. 1c–g). The system scheme is presented in Fig. 1h. The software for the navigation application was built by using QT 5.0 (The QT Company, CA, USA). It was installed on a Jetson TX1 module running an Ubuntu 16.04 LTS operation system. The software acquired the GPS and IMU data by RS232 serial communication. The GPS updating rate was 10 Hz while the sampling rate of the IMU was at 50 Hz. This proposed algorithm was implemented to output the desired curvature as the control signals to the rover (Husky) based on the sensor signal input. The controller was running at an update rate of 30 Hz. Table 1 Main parameters of the Husky. Parameters
Value
External dimensions(mm) Weight(kg) Max payload(kg) Max speed(m/s) User power(V)
990 × 670 × 390 50 75 1.0 5,12,24
Fig. 2. Rover state in the local coordinate system (x, y, z), where x′, y′ and z′ are the translational speeds of the rover along x, y, z directions, respectively; and αroll, βpitch, γyaw are the rotational rates around x, y, z axes, respectively. 3
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al.
v, ed and eθ
Net
Action a
Q vlaue Max
ed
eθ v
Input
Deep Network
Output
Fig. 3. Overview of the network structure and the formation of its input and output layers.
x y =
vcos + ax t +
roll Hsin
+
pitch Hcos
vsin + a y t
roll Hcos
+
pitch Hsin
0.3 0 0 0 0 0 0.3 0 0 0 R = 0 0 0.2 0 0 0 0 0 0.2 0 0 0 0 0 0.2
yaw
vx vy
ax ay
(3)
3.3. Double-DQN model development
where ax is the acceleration of the rover along the x-direction in vehicle’s 2D coordinate system, and ay is the acceleration on y-direction. αroll is the roll angle of the rover based on the horizontal plane, βpitch is the pitch angle of the rover, andγyaw is the yaw angle of the rover. The values of these variables were captured by the IMU sensor. H is the distance from GPS antenna to the ground. A Kalman filter was designed based on the kinematic model to fuse the GPS positioning data (WGS84 data transformed to the NED coordinate) and the IMU sensor data (αroll′, βpitch′, γyaw′ and ax, ay, az) for localization. Some Kalman filter design methods can be found (Khot et al., 2008, 2006; Welch and Bishop, 2004). The states estimation is represented by the following system function: k k 1 k 1 xkal = Akal xkal + Bkal ukal
(4)
xkal =
(5)
x y
ukal = v
Bkal
t cos t sin = 0 0 0
roll
ax ay
0 0 t 0 0
pitch
roll
0.5t 2 sin 0.5t 2 cos 0 0 0
0.5t 2 cos 0.5t 2 sin 0 0 0
(1) Structure of Double-DQN network We used an architecture in which there is a separate output unit for each action, i.e., the steering curvature, and the robot state representation is the input to the neural network. The main advantage of this type of architecture is the ability to compute Q-values for all actions in a given state S with only a single forward pass through the network (Mnih et al., 2015). The proposed network architecture for training and controlling has seven layers, including an input layer, five fully connected hidden layers, and an output layer (Fig. 4). The input layer has three neurons, and the output layer has 800 neurons corresponding to 800 pre-defined actions. Each hidden fully connected layer has 128 neurons. In this study, the network structure was determined by balancing the model’s generalization ability and the computational cost. The calculation speed of a network with simple structure is fast, but the network has poor generalization ability. And a complex network has strong generalization ability but requires longer training time. In the tests to determine the model structure, the upper limit and the rising speed of the reward score during network training were used as the evaluation criteria. The average time for a single forward pass through the selected network on a NVIDIA Pascal™ GPU (Graphics Processing Unit) was 12 ms.
(6)
pitch
thant sin thant cos 0 t 0
thant cos thant sin 0 0 t (7)
Table 2 List of hyperparameters and their values.
where xkal, ukal the state variables and the control variables. Akal is an identity matrix (5 × 5). The process noise covariance Q and the measurement error covariance R of Kalman filter are given below based on testing:
0.002 0 0 0 0 0 0.002 0 0 0 Q= 0 0 0.002 0 0 0 0 0 0.05 0 0 0 0 0 0.05
(9)
(8) 4
Hyperparameter
Value
Minibatch size Replay memory Dr size Target network update frequency D Discount factor γ Learning rate Initial exploration ε Final exploration
16 20,000 40,000 0.9 0.0001 1 0.1
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al.
delay, with higher vehicle speed, the states change faster, and the system tends to become unstable. The lateral position deviation ed and angle deviation eθ to the linear path were used to quantify the rover’s path tracking error. When the two errors became zero, the rover is completely on the path and thus reaches a steady state. The path-following errors were given by:
ed e =
Al y + Bl x + Cl Al 2 + Bl 2 l
(10)
where Al, Bl, and Cl are paraments of the tracking linear path equation (Aly + Blx + Cl = 0). θl is the heading angular of the path direction in the 2D coordinate. For tracking control and training the network, errors of the pathfollowing were nonlinearly normalized before input to the network. The equations of the input mapping and normalization are given below:
Fig. 4. Schematic illustration of the Double-DQN network. The input to the neural network consists of a state S produced by simulation rover system, followed by five fully connected hidden layers with an output for each action. Each hidden layer is followed by a nonlinearity rectifier - max(0,x).
Sd S = Sv
(2) Input and output of the network
0.5 3
ed Drag
0.5 3
e
(v
Vrag
Errors of the path-following and the velocity of the rover were fed into the Double-DQN. The network structure and constitution of the input and output layer is shown in Fig. 3. The input signal consisted of the vehicle speed and the path-tracking errors. The speed plays an important role in navigation control. Because the system has response
rag
0.5Vrag )
(11)
where Sd, Sθ and Sv are the input state variables to the network. Drag, θrag and Vrag are the normalization range for ed, eθ and v, respectively. The use of the cube root for Sd, Sθ amplifies the small navigational errors when close to the steady state before feeding them into the network. The purpose of the nonlinear normalization is to avoid
Fig. 5. (a) The control function map of the PPC method; (b–d) the control function map of the Double-DQN method at travel speed of 0.5 m/s, 1.0 m/s, and 1.5 m/s. 5
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al.
converging. Gradient vanishing will occur if the input variables are always small. Thus, a nonlinear normalization is needed to improve the distribution of input signals and speed up training. To solve control problems, we utilized the optimal Q-value (action value) of each action. The Q-values of a series of given actions (desired curvatures in this study) are the output of the network, given a state S as the network input. Each Q-value represents the expected future tracking performance of the corresponding steering action. Therefore, the desired curvature with the maximum Q-value was selected. The mapping of the 800 output neurons to the real curvature aout of steering map is given below:
aout = Arag
(n max
nrag )
3
nrag
(12)
where Arag is the anti-normalization range, nrag is the half number of the total output neurons, and nmax is the number of the neuron with the maximum action value. The mapping of the output was designed to amplify the subtle details around the zero point, because the distribution of the curvature is concentrated near the steady state. The neighboring output vectors interfere with each other when network weights are updated and the network training need to normalize the action value.
Fig. 6. The experimental of robotic vehicle navigation.
gradient vanishing and speed up training. That was because the goal of path tracking is to minimize the tracking error, then the distributions of the error input state variables are concentrated to 0 if the system is
3.4. Simulation rover system architecture In this study, all training date came from a simulation rover system
Fig. 7. Experimental results of the path tracking performance when the speed of the rover was set at 1.0 (m/s): (a) Path tracking performed by the Double-DQN method; (b) path tracking performed by PPC method; (c) lateral deviation error ed by the Double-DQN method; (d) lateral deviation error ed by PPC method. 6
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al.
Table 3 Comparison of the results of PPC and Double-DQN methods in a simulation experiment. Method
d (m)
PPC
v = 0.5 (m/s)
2 4 6 8 2 4 6 8
Double-DQN
v = 1.0 (m/s)
σ (%)
tr (s)
ts (s)
σ (%)
tr (s)
ts (s)
σ (%)
tr (s)
ts (s)
3.5 3.5 3.6 3.8 0 0 0 0
16.2 20.0 23.3 26.6 10.1 14.8 18.6 21.2
24.5 27.9 31.3 35.1 11.2 16.7 21 23.9
3.3 3.5 4.6 5.8 0 0 0 0
8.3 10.1 11.4 12.9 5.9 8.0 10.0 11.6
12 13.8 16.2 18.5 7.3 10.0 12.1 14.4
4.0 6.7 9.3 10.0 0 0 0 0
5.3 6.2 7.3 8.5 4.1 5.1 6.3 7.9
8.4 10.0 11.6 13.0 4.5 5.9 7.2 10.5
built on the kinematic model. It was also used to calculate the reward of each action given the states S from the simulation system. The state space kinematic model, which is a set of recursive equations (Eq. (13)) with steering rate constraints (Eq. (14)) are listed below:
Xk + 1 Yk + 1
=
k+1
ak + 1 F (ak ) =
Xk + Vk T cos( k ) Yk + Vk T sin( k ) k + Vk ak F (a k ) a k , |a k ak + Tsign (a k
improve training efficiency by disordering the inputting data batches (Mnih et al., 2015). Algorithm 1. For training the Double-DQN to control robotic. Initialize Simulation memory Ds with capacity Ns Initialize replay memory Dr with capacity Nr Initialize the action-value function Q with random weight w by Xavier Initialization Initialize the target action-value Q´ with random weight w´ by Xavier Initialization For episode = 1, M do For j = 1, J do Initialize a state S0 and a curvature a0 For k = 0, K do With probability ε, select a random target curvature action ak´ Otherwise select ak´=argmaxaQ(Sk, a; w´) Execute curvature action ak’ and set state Sk in Simulation rover system and observe rewards rk+1 and state Sk+1 Store transition (Sk, ak´, rk+1, Sk+1) in Ds End For Randomly sort all data of Ds and store to Dr Every C-step: For k = 0, K do With probability ε select a random target curvature action ak´ Otherwise select ak´ from Equation (10) Execute curvature action ak’ and set state Sk in rover simulation system and observe rewards rk+1 and state Sk+1 Store transition (Sk, ak´, rk+1, Sk+1) in Ds End For Random sorting all data of Ds and store it to Dr End For For t = 1, N do Sample transitions (St, at´, rt+1, St+1) form Dr
(13)
ak | < T ak ), |a k
ak|
T
(14)
where the subscript k and k + 1 are for the present step and preceding step, respectively. The position of the rover is defined in easting (X) and northing (Y) coordinate system, and Vk, θk are the speed (m/s) and heading (rad) of the rover in the coordinate, respectively. T is the sampling interval (s), and ak is the curvature (m−1) in simulation system. ak′ is the curvature of the target for ak. The rate of the curvature of the rover has constraints. Therefore, the rate of ak is limited, as shown in Eq. (14), and ξ is the maximum rate of ak, which is 0.22 (m−1∙s−1) to maintain steady turning based on field testing. 3.5. The training algorithm for Double-DQN The training algorithm of Double-DQN is shown in Algorithm 1. Firstly, a state S0 of start is initiated to run the simulation system to collect the data (Sk, ak´, rk+1, Sk+1) and store to the simulation memory Ds at each step. The training phase employs the Pure Pursuit Control (PPC) method (Eq. (15)) to speed up training. The PPC method, which is built based on the vehicle kinematics, is widely used in the field of automatic navigation control. The curvature is calculated using equations listed below: 2(ed cos e + Lh2
a=
ed2 sin e )
Lh2 2sign (ed ) cos e Lh
rt + 1 if episode terminates at t + 1 Set yt = rt + 1 + Q (St + 1, arg max Q (St + 1, a; w ); w ) otherwise a
Perform a gradient descent step on loss function: (yt - Q(St, a; w))2 with respect to the network parameters w End For Every D step set w´=w End For
The Xavier initialization is to initialize the weights w of the network from a Gaussian distribution (Glorot and Bengio, n.d.). After initialization, the variance of the input data is the same as the variance of the output data to avoid setting all output values to 0. The C-step was defined as that at every number of C-1 times of training epochs, a predesigned training sample with an expected reward factor was inserted and used for training the network. This injection of training samples generated by prior knowledge accelerated the training process and improved overall learning efficiency of the network. When training the network, the state S, action a and reward r for each step of training are collected. The convergence is driven by evaluating the error. To do this, a relaxation factor γ per time-step is defined (γ is set as 0.90), as well as the reward r which is the sum of time bonus point and quadratic term of error e. It’s given as below:
, ed < Lh , ed
v = 1.5 (m/s)
Lh
(15)
where Lh is the look-ahead distance. The longer look-ahead distances lead to more gradual convergence to the path and with less oscillation. The response of the pure pursuit tracker looks similar to the step response of a second order dynamic system, and the value Lh acts as a damping factor (Coulter, 1992). If ed is longer than Lh, Lh2 ed2 will have no real solution, thus we divided the calculation equation into two parts. The PPC generates effective empirical data for the simulation system and thus is added to speed up the network convergence. Secondly, the data of Ds is randomly sorted for the experience replay. Thirdly, we use the data of replay memory Dr to train the deep network by the Double-DQN update policy. Because the weights of the network updates by the gradient back propagation, if the input data is in order, it will lead to over-estimation. The experience replay can
r=
7
Rwin Rtime
|ed | < Ed, |e | < E ed2 T
e 2 T |ed|
Ed, |e |
E
(16)
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al.
where Rwin is a reward of the successfully achieved steady state of a linear tracking, and then the episode terminates (Rwin is set as 0.002). Rtime is the time bonus point for each step (Rtime is set as 0.002). Ed (m) and Eθ (rad) are the threshold (Ed is set as 0.15 and Eθ is set as 3 degrees) to determine whether the steady state has been reached. If the error e is less than the threshold, the steady state shall be reached. The network needs a large amount of data to get trained. To do that, each time the simulated line tracking process was run till the rover reaches the steady state and the process was stored and then used for the experience replay. The initial state S0 and curvature a0 were randomly established in the set range, the range of ed, eθ, v and a were [−8, 8], (−π, π], [0.4, 1.6] and [−0.698, 0.698], respectively. The behavior policy during training was ε-greedy with ε annealed linearly from 0.4 to 0.1 and finally fixed at 0.1. Therefore, the convergence speed of the network was slower in the early stage of training, wherefore we added a set of simulation data of a Pure Pursuit Control model into the experience replay in every C step of the simulation loop, and the prior experience helped the network to gain the basic weights. The other hyperparameters for training the network are listed in Table 2.
Table 4 Tracking performance comparison between Double-DQN and PPC methods at 0.5 m/s travel speed. Method
PPC
Average Double-DQN
Average
Turning
B C D E F G H B C D E F G H
v = 0.50(m/s) σ%
tr (s)
ts (s)
|es|(m)
0 0 5.9 21.1 0 14.4 8.2 7.1 0 0 0 0 0 0 0 0
6.7 6.7 6.3 7.1 7.1 6.2 6.5 6.7 8.2 8.1 7.6 7.2 7.9 7.7 7.8 7.9
7.6 7.7 11.5 14.4 8.1 14.1 12.8 10.9 9.9 9.6 10.3 8.8 9.3 9.3 9.9 9.6
0.028 0.018 0.015 0.010 0.026 0.016 0.020 0.019 0.035 0.013 0.014 0.018 0.027 0.019 0.011 0.020
3.6. Simulator research PPC method was used for performance comparison with the DoubleDQN method in the simulation system prior to the field test. To visualize the difference between PPC and Double-DQN methods, the state S was used as input to generate the output curvature a and thus their control function maps were produced (Fig. 5). It can be observed that the Double-DQN control functions are highly nonlinear and discontinuous whereas the PPC control function appears to be much gradual and smooth. The maps of Double-DQN control functions at different speeds v display a unique canyon-shaped structure with higher ground representing more aggressive steering action and the lower ground/valleys for gentle steering action (Fig. 5b–d). The Double-DQN responses at different travel speeds also shows the adaptability of the controller, as it can be observed that higher travel speed leads to more eroded (wider) valleys. This makes sense as when speed is higher, the turning curvature needs to be gentler to reduce overshoot and oscillations.
Table 5 Tracking performance comparison between Double-DQN and PPC methods at 0.75 m/s travel speed. Method
PPC
Average Double-DQN
Average
Turning
B C D E F G H B C D E F G H
v = 0.75(m/s) σ%
tr (s)
ts (s)
|es|(m)
13.5 6.6 10.7 8.3 14.2 18.4 10.4 11.7 0 0 0 0 0 0 0 0
4.1 4.6 4.3 4.2 4.1 4.8 4.5 4.4 5.2 5.3 5.3 4.8 5.1 5.2 5.1 5.1
9.6 10.8 9.7 8.2 10.4 12.1 10.0 10.1 8.7 6.6 8.5 5.4 6.1 6.9 7.9 7.2
0.027 0.011 0.017 0.011 0.032 0.027 0.015 0.020 0.023 0.015 0.013 0.020 0.031 0.043 0.013 0.023
3.7. Experiment area, path selection, and data capture The experiment was carried out on a grassed field on Iowa State University campus (Fig. 6). During the data collection, the rover was driven by the navigation control application, which tracked the preset paths and stopped the rover at the end of the paths. The rover position for the robot was recorded by the application at a sampling rate of 30 Hz. The test path consisted of 9-vertices (A ~ I) and 8-sides, forming seven sharp turning corners (Fig. 7a). The rover was navigated to track the path by using Double-DQN and PPC based control methods at different travel speeds; and like what we did in the simulation, their performance indicators including overshoot σ%, rise time tr, settling time ts and Absolute-Mean-Steady-State-Error (AMSSE) |es| were collected. The AMSSE |es| is also known as the Absolute-Mean lateral deviation error of linear tracking.
Table 6 Tracking performance comparison between Double-DQN and PPC methods at 1.0 m/s travel speed. Method
PPC
Average Double-DQN
Average
Turning
B C D E F G H B C D E F G H
v = 1.0(m/s) σ%
tr (s)
ts (s)
|es|(m)
36.7 29.2 32.7 27.7 31.4 32.8 29.2 31.4 0 0 9.8 25.5 0 13.1 10.4 8.4
3.8 4.0 3.9 4.1 3.9 4.0 4.2 4.0 4.2 4.4 3.7 4.2 4.3 3.7 4.0 4.1
12.7 11.4 11.4 11.7 11.4 13.3 12.8 12.1 5.3 5.1 11.3 10.9 5.0 10.5 9.4 8.2
0.0257 0.020 0.011 0.092 0.025 0.074 0.013 0.037 0.029 0.024 0.051 0.177 0.048 0.109 0.032 0.067
4. Results and discussion An experiment to compare the performance of PPC and Double-DQN methods was designed in the simulation. In the experiment setup, four initial values (2 m, 4 m, 6 m, and 8 m) of the lateral deviation error d were selected. Meanwhile, the initial heading error θ was set at 0 (rad) and three travel speeds of 0.5 m/s, 1.0 m/s and 1.5 m/s were tested. In total there were 4 × 3 groups of tests in the simulation experiment. Performance indicators including the overshoot σ% of lateral deviation error ed, rise time tr and settling time ts were collected and analyzed (Table 3). The experiment result showed that the overshoot of lateral 8
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al.
deviation error ed response using Double-DQN method was less than 3%, except that when the speed v was 1.5 m/s and the initial lateral deviation error d was 4 m. Rising time and settling time of the DoubleDQN responses were smaller than that produced by the PPC method.
Through observing the result of the simulation experiments, it can be confirmed that Double-DQN method can effectively suppress the overshoot, as well as accelerate the convergence of the error and improve the response quality.
Fig. 8. Comparison of variability between Double-DQN and PPC in the experimental at three different travel speeds.
9
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al.
Fig. 8. (continued)
4.1. Experimental results
the average AMSSE by 0.001 m, 0.003 m and 0.03 m, respectively. This research demonstrated that the Double-DQN control method was able to effectively improve the performance of the navigation controller according to some key performance indicators such as overshoot and settling time. However, like any other method, there are some constraints of the method and rooms for improvements. It was noticed that the quality of the simulation model would have direct influence on the quality of the trained network. Also, to make the network more resilient to the delays and noise, injecting these types of disturbances into the training process could be beneficial. In addition, we will further investigate the ways to speed up the convergence during the training process.
As an illustration, plots showing the navigation performance of running the rover at 1.0 m/s is presented in Fig. 7. It can be observed that the overshoot σ% and settling time ts when using Double-DQN method are clearly less than that from using PPC method (Lh was set at 1.6 m). Three groups of comparison experiments were used to analyze the tracking performance between Double-DQN and PPC methods at different speeds. The speed was set as 0.5 m/s, 0.75 m/s, and 1.0 m/s, and the paths are the same as shown in Fig. 7(a). In addition to capturing the overshoot σ%, rising time tr, settling time ts and (AMSSE) |es| for each turn at corners from B to H (Fig. 7a) was also calculated (Tables 4–6 and Fig. 8). These results have shown that the Double-DQN method could substantially reduce the overshoot and settling time when tracking curved paths at a minor cost of slightly increased rising time. We can see there is an exception at turn-E, which is caused by a larger initial heading error θ (i.e., sharper turning angle required at corner E), and inadequate travel distance to allow the convergence before hitting the next turning corner.
Acknowledgements This work was financially supported by the National Key Research and Development Program of China (project number: 2017YFD0700400, 2017YFD0700405, 2017YFD0700404). It was also partly supported by National Science Foundation Major Research Instrumention program, USA (Award ID 1625364) and funds from the Plant Sciences Institute at Iowa State University, USA.
5. Conclusions
References
In the paper, we developed a navigation controller based on the Double-DQN method for a small autonomous ground vehicle (a rover) to follow predefined trajectories. Our target application is where there is limited travel space such as agricultural crop scouting and phenotyping, thus minimal overshooting and settling time is always desired. The Double-DQN method, which is an optimized decision-making method based on the state space, was first used to trained a deep network to find the best action for each state in a customized simulation environment. Then, the network was implemented with the rover to perform comparative studies against the widely used PPC method by tracking a path consisting of multiple right-angle turning corners and straight lines. In the simulation test, when compared with the PPC method at travel speed of 0.5 m/s, 1.0 m/s, and 1.5 m/s, the Double-DQN method was able to reduce the overshoots by 3.6%, 4.3% and 5.8%, decrease both the rising time by 5.4 s, 1.8 s, 0.9 s and the settling time by 11.5 s, 4.18 s, and 3.3 s, respectively. In the open field experimental on a grassland, when compared with the PPC method at travel speed of 0.5 m/s, 0.75 m/s and 1.0 m/s, the Double-DQN method reduced the average overshoot by 7.1%, 11.7% and 23.0%, the average settling time by 1.3 s, 2.9 s and 3.9 s, at a minor cost of increasing the average rising time by 1.6 s, 0.7 s and 0.1 s and
Backman, J., Oksanen, T., Visala, A., 2012. Navigation system for agricultural machines: Nonlinear Model Predictive path tracking. Comput. Electron. Agric. 82, 32–43. https://doi.org/10.1016/j.compag.2011.12.009. Cho, S.I., Lee, J.H., 2000. Autonomous speedsprayer using differential global positioning system, genetic algorithm and fuzzy control. J. Agric. Eng. Res. 76 (2), 111–119. https://doi.org/10.1006/JAER.1999.0503. Coulter, R.C., 1992. Implementation of the pure pursuit path tracking algorithm. Carnegie-Mellon UNIV Pittsburgh PA Robotics INST. Ding, Y., Wang, L., Li, Y., Li, D., 2018. Model predictive control and its application in agriculture: a review. Comput. Electron. Agric. https://doi.org/10.1016/j.compag. 2018.06.004. Dong, F., Heinemann, W., Kasper, R., 2011. Development of a row guidance system for an autonomous robot for white asparagus harvesting. Comput. Electron. Agric. 79 (2), 216–225. https://doi.org/10.1016/j.compag.2011.10.002. Elbanhawi, M., Simic, M., Jazar, R.N., 2015. Continuous path smoothing for car-like robots using B-spline curves. J. Intelligent Robotic Syst. 80 (S1), 23–56. https://doi. org/10.1007/s10846-014-0172-0. Glorot, X., Bengio, Y., n.d. Understanding the difficulty of training deep feedforward neural networks. Retrieved from http://www.iro.umontreal. Hagras, H., Callaghan, V., Colley, M., Carr-West, M., 1999. A fuzzy-genetic based embedded-agent approach to learning and\ncontrol in agricultural autonomous vehicles. In: Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C), 2, 1–6. https://doi.org/10.1109/ROBOT.1999. 772444. Hasselt, H.V., 2010. Double Q-Learning. Advances in Neural Information Processing Systems, 2613–2621. Retrieved from http://papers.nips.cc/paper/3964-double-qlearning.pdf.
10
Computers and Electronics in Agriculture 166 (2019) 104985
W. Zhang, et al. Hasselt, H.V, Guez, A., Silver, D., 2015. Deep Reinforcement Learning with Double Qlearning. Retrieved from http://arxiv.org/abs/1509.06461. Khot, L.R., Tang, L., Steward, B.L., Han, S., 2008. Sensor fusion for improving the estimation of roll and pitch for an agricultural sprayer. Biosyst. Eng. 101 (1), 13–20. https://doi.org/10.1016/j.biosystemseng.2008.05.015. Khot, L., Tang, L., Blackmore, S., Nørremark, M., 2006. Navigational context recognition for an autonomous robot in a simulated tree plantation. Transactions of the ASABE. Retrieved from https://lib.dr.iastate.edu/abe_eng_pubs/337. Lin, L.-J., 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8 (3–4), 293–321. https://doi.org/10.1007/BF00992699. Luo, X., Zhang, Z., Zhao, Z., Chen, B., Hu, L., Wu, X., 2009. Design of DGPS navigation control system for Dongfanghong X-804 tractor. Nongye Gongcheng Xuebao/Trans. Chinese Soc. Agric. Eng. 25 (11). https://doi.org/10.3969/j.issn.1002-6819.2009.11. 025. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Hassabis, D., 2015. Human-level control through deep reinforcement learning. Nature 518 (7540), 529–533. https://doi.org/10.1038/nature14236. Moustris, G.P., Tzafestas, S.G., 2011. Switching fuzzy tracking control for mobile robots under curvature constraints. Control Eng. Practice 19 (1), 45–53. https://doi.org/10. 1016/J.CONENGPRAC.2010.08.008. Normey-Rico, J.E., Alcalá, I., Gómez-Ortega, J., Camacho, E.F., 2001. Mobile robot path tracking using a robust PID controller. Control Eng. Practice 9 (11), 1209–1214. https://doi.org/10.1016/S0967-0661(01)00066-1.
Ollero, A., Heredia, G., 1995. Stability analysis of mobile robot path tracking. In: Proceedings 1995 IEEE/RSJ International Conference on Intelligent Robots and Systems. Human Robot Interaction and Cooperative Robots, vol. 3 IEEE Comput. Soc. Press, pp. 461–466. https://doi.org/10.1109/IROS.1995.525925. Plessen, M.G., Lima, P.F., Martensson, J., Bemporad, A., Wahlberg, B., 2017. Trajectory Planning Under Vehicle Dimension Constraints Using Sequential Linear Programming. Retrieved from http://arxiv.org/abs/1704.06325. Precup, R.E., Hellendoorn, H., 2011. A survey on industrial applications of fuzzy control. Comput. Indust. 62 (3), 213–226. https://doi.org/10.1016/j.compind.2010.10.001. Rains, G.C., Faircloth, A.G., Thai, C., Raper, R.L., 2014. Evaluation of a simple pure pursuit path-following algorithm for an autonomous, articulated-steer vehicle. Appl. Eng. Agric. 30 (3), 367–374. https://doi.org/10.13031/aea.30.10347. Samuel, M., Hussein, M., Binti, M., 2016. A review of some pure-pursuit based path tracking techniques for control of autonomous vehicle. Int. J. Comput. Appl. 135 (1), 35–38. https://doi.org/10.5120/ijca2016908314. Welch, G., Bishop, G., 2004. An Introduction to the Kalman Filter, UNC-Chapel Hill, pp. 1–16. Retrieved from http://www.cs.unc.edu/~gb. Watkins, C.J.C.H., Dayan, P., 1992. Technical Note: Q-Learning. Mach. Learn. 8 (3–4), 279–292. https://doi.org/10.1007/BF00992698. https://link.springer.com/article/ 10.1023/A:1022676722315. Xiang, Y., Noguchi, N., 2014. Development and evaluation of a general-purpose electric off-road robot based on agricultural navigation. Int. J. Agric. Biol. Eng. 7 (5), 14–21. https://doi.org/10.25165/IJABE.V7I5.1328.
11