Aerospace Science and Technology 94 (2019) 105403
Contents lists available at ScienceDirect
Aerospace Science and Technology www.elsevier.com/locate/aescte
Novel docking controller for autonomous aerial refueling with probe direct control and learning-based preview method Yiheng Liu a,b,c , Honglun Wang a,c,∗ , Jiaxuan Fan a,c a b c
School of Automation Science and Electrical Engineering, Beihang University, 100191, Beijing, China Shenyuan Honors College of Beihang University, 100191, Beijing, China The Science and Technology on Aircraft Control Laboratory, Beihang University, 100191, Beijing, China
a r t i c l e
i n f o
Article history: Received 29 April 2019 Received in revised form 6 August 2019 Accepted 15 September 2019 Available online 19 September 2019 Keywords: Unmanned aerial vehicle (UAV) Autonomous aerial refueling (AAR) Probe control Deep learning Reinforcement learning Preview control
a b s t r a c t Autonomous aerial refueling (AAR) has always been a hot research area due to its significant application and complicated control problem. In order to improve the docking precision of AAR, a novel docking controller with probe direct control and learning-based preview method is proposed. Firstly, the controlled object is transformed from receiver barycenter to probe tactfully. Then, a suitable probe direct controller designed via the combination of reference-observer-based tracking control method and the high order sliding mode control method is proposed for the probe direct control. Furthermore, a learning-based preview method is introduced to solve the tracking lag problem. The prediction of drogue motion is considered in the reference signal. Then, a novel learning algorithm, named deep learning and reinforcement learning (DLRL), which combines deep learning (DL) and reinforcement learning (RL) spatially rather than structurally like deep reinforcement learning (DRL) is proposed to generate the preview time adaptively. And a novel preview index is proposed to adapt for it. Through the combination of probe direct controller and learning-based preview method, the proposed docking controller could improve the tracking precision largely. Effectiveness of the proposed method is demonstrated by the simulations. © 2019 Elsevier Masson SAS. All rights reserved.
1. Introduction Unmanned aerial vehicles (UAVs) are increasing rapidly in popularity for their use in the military field and civil applications [1–4]. Autonomous aerial refueling (AAR) [5,6] as a technology which can satisfy the long-endurance and long-region requirements of UAV has attracted a lot of researchers’ interests. In general, there are two ways to refuel [5,6]: the flying boom method and the probe-and-drogue method. In this paper, we focus on the probe-and-drogue method (PDR), as shown in Fig. 1. The receiver docking control is a key problem for AAR docking process. The main task for the receiver docking controller [10] is to control the probe to track drogue precisely. There is much research discussing the docking controller [7–16] for the receiver in the AAR docking process. For instance, an optimal Nonzero Set Point with Control Rate Weighting (NZSP-CRW) control structure proposed in [15,16] could estimate reference states of the receiver and control the receiver to track and dock
*
Corresponding author at: School of Automation Science and Electrical Engineering, Beihang University, 100191, Beijing, China. E-mail address:
[email protected] (H. Wang). https://doi.org/10.1016/j.ast.2019.105403 1270-9638/© 2019 Elsevier Masson SAS. All rights reserved.
with a stationary drogue. In [9], the NZSP-CRW control structure is modified and its drawback which couldn’t track the moving drogue is removed. Hence, the reference-observer-based tracking controller (ROTC) is proposed. ROTC does not require a drogue model or presumed knowledge of the drogue position. In [7,10, 12], the receiver nonlinear 6 DOF rigid model is transformed to affine nonlinear form to facilitate the nonlinear controller design, then, nonlinear control methods such as active disturbance rejection control (ADRC) [18] and sliding mode control (SMC) [17] are applied in receiver docking control. However, there are some problems existing in present works, which could be summarized as: 1) Controlled object. The probe locates on the receiver nose and there is some distance between the probe and barycenter. The task of the AAR docking process is to control the probe to insert the drogue. But the probe is controlled indirectly by controlling the barycenter position in most researches, which is not precise because the effect of attitude variation on the probe is ignored [8]. 2) Reference signal. The light drogue affected by multiple perturbations moves rapidly [35]. But the future information of the drogue motion is not considered in the reference signal in most researches, which may cause tracking lag problem because the sys-
2
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
Fig. 1. Configuration of the hose-drogue aerial refueling system.
tem dynamic response of receiver is much slower than drogue motion [12]. For the first problem, to our best knowledge, relatively little specialized research [7,8] about the probe direct control method has been conducted on. In [8], the probe is controlled directly using a very simple linear controller and the disturbance rejection ability is limited. In [7], the probe is controlled via translational motion and rotational motion. However, barycenter variation in the process of attitude variation is omitted, which will decrease the tracking precision. For the second problem, preview control [12, 21,22] is a solution. For AAR, preview control uses the prediction of drogue motion as the reference signal, which can compensate the tracking lag partly. The preview time which determines the prediction length of drogue motion is a significant parameter of preview control. In [12], a fuzzy logic controller is used to determine the preview time. Nevertheless, the fuzzy logic controller mainly depends on human intuition, which means the logic may not be optimal. With the development of machine learning (ML) technology, many tough problems have been solved. Reinforcement learning (RL) [23–32] and deep learning (DL) [33–40] are two key methods in the field of ML. RL is very suitable for the preview time selection problem mentioned above because RL has the ability to take an optimal control policy for an agent by utilizing the response from the unknow system [27]. With the combination of DL and RL, deep reinforcement learning (DRL) [25,29] is designed to get better performance than RL. But DRL is hard to train due to its relatively complicated DL model. According to the above analyses, we propose a novel docking controller based on probe direct control and learning-based preview method. Firstly, the controlled object is transformed from barycenter to probe tactfully and a suitable control scheme is proposed for probe direct control. ROTC [9] is designed as an outer loop (probe loop and attitude loop) controller because probe dynamics could not be transformed into affine nonlinear form. Then, in order to improve the disturbance rejection ability, high order sliding mode (HOSM) [17] is designed as an inner loop (angular rate loop) controller and ground speed controller. Secondly, the learning-based preview method is introduced to solve the tracking lag problem. The prediction of drogue motion is considered in the reference signal. Then, a novel learning algorithm which contains DL and RL is proposed to select the preview time. The proposed learning algorithm combines DL and RL spatially rather than structurally like DRL, named deep learning and reinforcement learning (DLRL). And a novel preview index is proposed to adapt for DLRL. Finally, through the combination of probe direct controller and learning-based preview method, our proposed docking controller could control the probe to track the drogue precisely and quickly.
The main contributions of this paper can be summarized as follows: 1) The controlled object is transformed from the receiver barycenter to the probe tactfully. And a suitable control scheme which is designed via the combination of ROTC and HOSM is proposed for the probe direct control. 2) A learning-based preview method is introduced to solve the tracking lag problem. The prediction of drogue motion is considered in the reference signal. And a novel learning algorithm named DLRL is proposed to get an optimal preview time selection policy. 3) Based on the proposed probe direct controller and learningbased preview method, a novel docking controller with high tracking precision, good disturbance rejection ability, and simple control structure is proposed for AAR docking process. The paper is organized as follows. The problem formulation including the 6 DOF rigid models of the receiver and probe dynamics are presented in Sec. 2. In Sec. 3, the detailed design process of the novel docking controller for AAR is illustrated. Simulations, comparisons, and analyses are shown in Sec. 4. The paper ends up with concluding remarks in Sec. 5. 2. Receiver modeling with probe dynamics The 6 DOF rigid model for receiver including wind effect and probe dynamics is described by the following translational and rotational dynamics equations [20]:
⎧ x˙ b = V cos β cos α cos θ cos ψ + sin β(sin φ sin θ cos ψ ⎪ ⎪ ⎪ ⎪ ⎪ − cos φ sin ψ) + cos β sin α (sin φ sin ψ ⎪ ⎪ ⎪ ⎪ ⎪ + cos φ sin θ cos ψ) ⎪ ⎪ ⎪ ⎨ y˙ = V cos β cos α cos θ sin ψ + sin β(sin φ sin θ sin ψ b ⎪ + cos φ cos ψ) + cos β sin α (cos φ sin θ sin ψ ⎪ ⎪ ⎪ ⎪ ⎪ − sin φ cos ψ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ z˙ b = V [− cos β cos α sin θ + sin β sin φ cos θ ⎩ + cos β sin α cos φ cos θ]
(1)
⎧ x˙ p = x˙ b + cos θ cos ψ(qzbp − r y bp ) ⎪ ⎪ ⎪ ⎪ ⎪ + (sin θ cos ψ sin φ − sin ψ cos φ)(rxbp − pzbp ) ⎪ ⎪ ⎪ ⎪ ⎪ + (sin θ cos ψ cos φ + sin ψ sin φ)( pybp − qxbp ) ⎪ ⎪ ⎪ ⎨ y˙ = y˙ + cos θ sin ψ(qz − r y ) p b bp bp ⎪ + (sin θ sin ψ sin φ + cos ψ cos φ)(rxbp − pzbp ) ⎪ ⎪ ⎪ ⎪ ⎪ + (sin θ sin ψ cos φ − cos ψ sin φ)( pybp − qxbp ) ⎪ ⎪ ⎪ ⎪ ⎪ ˙ z = z˙ b − sin θ(qzbp − r y bp ) + cos θ sin φ p ⎪ ⎪ ⎩ (rxbp − pzbp ) + cos θ cos φ( pybp − qxbp )
(2)
⎧ V˙ k = g (sin β sin φ cos θ + cos β sin α cos φ cos θ ⎪ ⎪ ⎪ ⎪ ⎪ − cos α cos β sin θ) + m1 (− D + T cos α cos β) ⎪ ⎪ ⎪ ⎪ ⎨ β˙ = p sin α − r cos α + g (cos α sin β sin θ + cos β sin φ cos θ k V (3) 1 ⎪ sin α sin β cos φ cos θ) + mV (− S − T cos α sin β) − ⎪ ⎪ ⎪ ⎪ ⎪ α˙ k = q − r sin α tan β − p cos α tan β + g (sec β sin α sin θ ⎪ V ⎪ ⎪ ⎩ β + cos α sec β cos φ cos θ) + sec (− L − T sin α ) mV ⎧ ˙ ⎪ ⎨ φ = p + q sin φ tan θ + r cos φ tan θ θ˙ = q cos φ − r sin φ ⎪ ⎩ ˙ ψ = (q sin φ + r cos φ) sec θ
(4)
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
3
where C M denotes the pitch moment; ωd denotes the tanker vortex angle; γ denotes the flight path angle. It is obvious that ωd and γ have similar effect on the equilibrium values. Under the existence of tanker vortex, the receiver is trimmed as in a climbing state [8]. Then, the linear model of the probe loop and attitude loop could be obtained.
x˙ r = Ar xr + Br ur
yr = Cr xr
(8)
where yr = [ y p z p ] denotes the output vector; xr = [ β α φ θ ψ y p z p ] denotes the state vector and ur = [ p q r ] denotes the control vector. The angular rate loop (5) is transformed into affine nonlinear form, which is written as Fig. 2. Effect of tanker vortex for the receiver.
⎧ 1 ⎪ 2 ⎪ I y I z − I 2z − I xz rq ⎪ p˙ = 2 ⎪ ⎪ I x I z − I xz ⎪ ⎪ ⎪ ⎪ + ( I x I xz − I y I xz − I z I xz ) pq + I z L + I xz N ⎪ ⎪ ⎪ ⎨ 1 ( I z − I x ) pr − I xz p 2 + I xz r 2 + M q˙ = Iy ⎪ ⎪ ⎪ ⎪ 2 ⎪ 1 ⎪ 2 ⎪ r˙ = I x − I x I y + I xz pq ⎪ 2 ⎪ ⎪ I I − I x z xz ⎪ ⎪ ⎩ − ( I x I xz − I y I xz − I z I xz )rq + I xz L + I x N ⎧ ⎪ ⎪ V = Vk − uw ⎪ ⎨ vw β = βk − V ⎪ ⎪ ww ⎪ ⎩ α = αk −
x˙ ω = fω + Bω uact
where xω = [ p q r ] denotes the angular rates; uact = [ δa δe δr ] denotes the actuators of the receiver; fω , Bω denote
(5)
the lumped disturbances and control input matrix of angular rate loop. Note that V k and x p are controlled by the throttle. Thus, a separate HOSM controller is used to control V k . The ground speed loop is transformed into affine nonlinear form, which is written as
V˙ k = f V k + B V k δ T
(10)
where f V k , B V k denote the lumped disturbances and control input matrix of ground speed loop.
(6)
V
where V k is ground speed; βk , αk are flow angles (angle of attack and sideslip angle) without wind disturbances; p , q, r are angular rates in the body frame; φ, θ, ψ are Euler attitude angles; xb , y b , zb are barycenter position in the inertial frame; x p , y p , z p are probe position in the inertial frame; xbp , y bp , zbp are probe position relative to the barycenter in the body frame; u w , v w , w w are wind speed transformed to the body frame; V , β, α are airspeed and flow angles considering wind disturbances; L , D , S are lift, drag, and lateral force; L, M, N are moments along the axis of the body frame; T is thrust. The wind disturbances are transformed from the inertial frame to the body frame of the receiver. Then, the wind disturbances effect is added into the flow angles and the airspeed as shown in (6). From (2), we can see that the probe variation consists of two parts which are the translational motion of the barycenter and rotational motion of the attitude. And the probe dynamics could not be transformed to affine nonlinear form due to the existence of angular rates, thus, the nonlinear control methods are difficult to be applied [10], then, we linearize the receiver model with probe dynamics from probe loop (2) to attitude loop (3)–(4) to satisfy ROTC design. Before the linearization, the receiver model needs to be trimmed. Considering that the mean value of tanker vortex is nonzero and the trim values which are used to obtain the stability derivatives during linearization will be changed, the trim values are corrected under tanker vortex by using (7). The effect of tanker vortex for the receiver is shown in Fig. 2.
⎧ T cos α = D + mg sin(ωd + γ ) ⎪ ⎪ ⎪ ⎨ L = mg cos(ωd + γ ) − T sin α ⎪ CM = 0 ⎪ ⎪ ⎩ ωd = WV d
(9)
(7)
3. Novel docking controller For the general trajectory tracking control problem of aircraft, the controlled object is the barycenter. But for AAR docking, the target is to control probe to insert the drogue. The probe locates on the receiver nose and has some distance from the receiver barycenter. Thus, it is affected by rotational motion and translational motion of the receiver at the same time [7]. But the general barycenter control only considers the translational motion. Furthermore, the system dynamic response of the receiver is much slower than drogue [12], which causes tracking lag problem if the future information of drogue motion is not considered in the reference signal. Aiming at these problems, we propose a novel docking controller based on probe direct control and learning-based preview method. As shown in Fig. 3, by transforming the controlled object from barycenter to probe, the translational motion and the rotational motion of the receiver could be combined when tracking the drogue. To compensate for the tracking lag, the learning-based preview method is proposed to add the prediction of drogue motion in the reference signal. And the novel DLRL based preview time controller is also proposed to generate the preview time adaptively. The detailed design process of the proposed controller will be illustrated in this section. 3.1. Probe direct controller Considering the actual fight condition in AAR docking process, the following assumptions are given firstly Assumption 1. All the state variables of the receiver (xr , xω and V k ) can be obtained through direct or indirect measurement means. Assumption 2. All the “lumped disturbances” of the receiver (fω and f V k ) are bounded. They are differentiable and their differentials (f˙ω and ˙f V k ) are bounded.
4
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
Fig. 3. Schematic diagram of the receiver motion in tracking drogue.
Fig. 4. Structure of the proposed probe direct controller.
As shown in Fig. 4, the proposed probe direct controller consists of two parts which are ROTC and HOSM. ROTC is designed as the outer loop (probe loop and attitude loop) controller which uses the angular rates as the control inputs and uses the probe position as the outputs. Then the probe is treated as the controlled object directly. HOSM is designed as the inner loop (angular rate loop) controller and ground speed loop controller to reject the main disturbances. The proposed control structure is simpler because it only has three control loops which are the outer loop, inner loop and ground speed loop. 3.1.1. Reference-observer-based tracking controller Reference-observer-based tracking controller (ROTC) [9] contains two parts which are reference observer and LQR controller. The reference observer (ROB) could estimate the reference control inputs and reference state vector according to reference output vector. The purpose of ROB is to generate the reference states that the receiver should follow so that it can track the reference trajectory. The LQR controller is designed as a state feedback PI controller which is linked behind ROB. LQR controller aims to control the receiver to track the states estimated by ROB. And the control inputs estimated by ROB are designed as feedforward control. The dynamics of ROB is defined as
˙ˆ = X r
x˙ˆ r ˙ˆ r u
=
Ar Br 0 0
xˆ r
ˆr u
+
L1 L2
∗
(yr − Cr xˆ r )
(11)
where 0 represents null matrices of appropriate dimensions; xˆ r deˆ r denotes the notes the reference state vector estimated by ROB; u reference control inputs estimated by ROB; y∗r denotes the reference trajectory and L =
L1 L2
denotes the design parameter of ROB
and is calculated using LQR method. Then, the error is defined as
ec = xˆ r − xr ,
ea = x∗r − xˆ r
er = x∗r − xr ,
ˆr uc = ur − u
(12)
e˙ c = Ar ec − Br uc + L1 eo where ec denotes tracking error of estimated state vector; eo = y∗r − Cr xˆ r denotes the estimation error of reference output vector; ea denotes the estimation error of reference state vector; er denotes the tracking error of reference state vector; uc denotes the control inputs of LQR controller; ur denotes ROTC control inputs consisting of LQR control inputs uc and feedforward control inˆ r. puts u In order to improve the disturbance rejection ability of the outer loop, the integral action is introduced. A new state vector is defined
z˙ = yˆ r − yr = Cr (ˆxr − xr ) = Cr ec
(13)
Then, the control law of the LQR controller is
uc = KI
KP
z ec
(14)
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
where K = [KI KP ] denotes the design parameter of the LQR controller and is calculated using the LQR method. Finally, the total control law of ROTC consists of the control law of the LQR controller uc and feedforward control inputs uˆ r . ROTC could generate reference angular rates according to reference probe position. The reference angular rates generated by ROTC consider the rotational motion and translational motion of the receiver synthetically because the controlled object is set to probe. 3.1.2. High order sliding mode controller In fact, the design of the docking controller can be completed using the only ROTC, but the linear control method is insufficient in the aspect of disturbance rejection. HOSM controller has good anti-disturbance ability and tracking performance [17]. Thus, the HOSM controller is designed in angular loop and ground speed loop to reject the main disturbances of the receiver. Because the wind disturbances and the modeling error mainly exist in these two loops. Firstly, we design tracking errors
eω = ur − xω ,
e V k = V k∗ − V k
(15)
where eω denotes tracking error of angular rates and e V k denotes tracking error of ground speed. For the angular rate loop, based on the HOSM method and super-twisting algorithm, the control law is
−1
˙ r + β sig(s)0.5 + uact = Bω (−fω + u
α sign(s)dt + BTr ec ) (16)
where the sliding manifold vector is designed as s = eω ; fω = ˙ r = u˙ r + ε 1 is the first fω + ε 0 is the estimation of fω by HOSMO; u ˙ r ; The symbol ε dederivative estimation of the command signal u notes estimation error; α and β are the control parameters to be designed. The lumped disturbance fω is taken as an extended state, then we can obtain the augmented angular rate loop dynamic
x˙ ω = fω + Bω uact
(17)
f˙ω = hω
x˙ ω =fω + Bω uact + 1.5C0.5 sig(xω −xω )0.5 ω
˙
fω = 1.1Cω sign(xω −xω )
(18)
where sig(xω −xω )λ = [| p −p |λ sign( p −p ) |q −q|λ sign(q −q) |r − r |λ sign(r −r )] T ; Cω = diag([ Cω1 Cω2 Cω3 ]) is the observer design parameter. For the ground speed loop, the control law is −1
δT = B V k (−f V k + H V k e V k
+ V˙ ∗ ) k
(19)
where H V k is the positive constant which decides the convergence
rate; δ T is the throttle opening; V˙ k∗ = V˙ k∗ + ε2 is the first derivative estimation of the command signal V˙ k∗ and f V k = f V k + ε3 is the estimation of f V k by HOSMO. The lumped disturbance f V k is taken as an extended state, then we can obtain the augmented ground speed loop dynamic
V˙ k = f V k + B V k δ T
˙f V = h V k k
⎧ ⎨ V˙ k = f V + B V δT + 1.5C 0.5 sig( V k − V k )0.5 k k Vk ⎩ ˙f = 1.1C sign( V − V ) Vk Vk k k
(21)
λ
where sig( V k − V k )λ = V k − V k sign( V k − V k ) and C V k is the observer design parameter. Theorem 3. Under the Assumptions 1 and 2, by using the observer (11), (18), (21) and controller (14), (16), (19), the asymptotic convergence of the control output to the command y∗r , V k∗ and asymptotic stability of the closed-loop system are guaranteed if the observer and controller parameters are selected properly. 3.1.3. Stability proof Before the Theorem 3 proof, note that the estimations error of ROB eo will converge to zero and are bounded during the convergence procedure according to [9]. And the estimations of the first
˙ r ) of the command signals ( V˙ k∗ , u˙ r ) are derivative signals (V˙ k∗ , u obtained from tracking differentiator (TD) [18]. The estimation errors (ε1 , ε2 ) will converge to zero uniformly and are bounded [18]. In addition, the estimation errors (ε0 , ε3 ) of HOSMO will converge to zero after a finite time transient process and ε0 , ε3 are bound throughout the convergence procedure, which is demonstrated in [17]. Firstly, the following Lyapunov function is considered for ground speed loop: V0 = 0.5e 2V k
(22)
with derivate respect to time
V˙ 0 = e V k e˙ V k
= e V k V˙ k∗ − f V k − B V k δT ˙ ∗ 1 = e V k V˙ k∗ − f V k − B V k B − Vk − f Vk + H Vk eVk + V k = −e V k H V k e V k + e V k (ε3 − ε2 ) 0 2
Then, the HOSMO is established for the augmented angular rate loop dynamic
5
(20)
where h V k is the derivative of the lumped disturbances f V k . Then, the HOSMO is established for the augmented ground speed loop dynamic
≤ − H V k e V k + e V k 0 = −e V k H V k e V k − 0
(23)
where 0 is the residual estimation error of HOSMO and TD, and it is bound. Thus, V˙ 0 is negative if H V k is selected properly. Therefore, e V k will converge to zero asymptotically. Then, for the outer loop, a Lyapunov function is designed as
V1 = 0.5eTr er
(24)
with derivate respect to time
V˙ 1 = eTr e˙ r
= eTr (˙ec + e˙ a ) = eTr (Ar ec − Br uc + L1 eo + e˙ a )
Cr ec dt T = er Ar ec − Br KI KP + L1 eo + e˙ a ec
= −eTr (Br KP − Ar )er + Br Cr KI
er dt
− Br Cr KI
ea dt − (Br KP − Ar )ea + L1 eo + e˙ a
r
≤ −eTr er Br KP − Ar + Br Cr KI t − eTr r = −eT Br KP − Ar + Br Cr KI t er − r r
(25)
6
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
where r is the residual estimation error of ROB and it is bounded according to [9]. Thus, V˙ 1 is negative if K is selected properly. Therefore, er will converge to zero asymptotically. For the angular rate loop, a Lyapunov function is designed as
V2 = 0.5eTr er + 0.5eTω eω
(26)
with derivate respect to time
V˙ 2 = eTr e˙ r + eTω e˙ ω
= eTr (Ar ec − Br uc + L1 eo + e˙ a ) + eTω (u˙ r − fω − Bω uact ) Cr ec dt + L1 eo + e˙ a + eTr Br eω = eTr Ar ec − Br KI KP ec T − 1 + eω u˙ r − fω − Bω Bω −fω + u˙ r + β sig(s)0.5 + α sign(s)dt + BTr er = −eTr (Br KP − Ar )er + Br Cr KI er dt − Br Cr KI ea dt − (Br KP − Ar )ea + L1 eo + e˙ a r
+ eTω −β sig(eω )0.5 − α sign(eω )dt ) + eTω (ε 0 − ε 1 ) ≤ −eTr Br KP − Ar + Br Cr KI t er − r − eω βeω 0.5 + α dt − 1
1
(27)
where 1 is the residual estimation error of HOSMO and TD, and it is bound. Thus, V˙ 2 is negative if α , β is selected properly. Therefore, eω will converge to zero asymptotically. For the whole control system, a Lyapunov function is designed as
V3 = 0.5e V k e V k + 0.5eTr e˙ r + 0.5eTω eω
(28)
with derivate respect to time
V˙ 3 = e V k e˙ V k + eTr e˙ r + eTω e˙ ω
≤ −e V k H V k e V k − 0 − eTr Br KP − Ar + Br Cr KIt er − r − eω βeω 0.5 + α dt − 1
(29)
Thus, V˙ 3 will be negative if controller design parameters H V k , K, α and β are selected properly. Therefore, the error e V k , er and eω will asymptotically converge to zero. Namely, V k will converge to V k∗ asymptotically; xω will converge to ur asymptotically and xr will converge to x∗r asymptotically. 3.2. Learning-based preview method Preview control has attracted much attention for its various applications, such as autonomous vehicle guidance [21], flight control [12], manufacturing control [22] and so on. For the drogue tracking problem in AAR, as an extra degree of freedom (DOF) available in controller design, preview control can obtain better performance [12]. If the current drogue position y∗ (t ) is set as the reference signal, then
Fig. 5. Schematic diagram of tracking status.
y(t ) + ek = y∗ (t − λd ) et = ek + y∗ (t ) − y∗ (t − λd ) = ek + ed
(30)
where y(t ) denotes current probe position; ek denotes control error; ed denotes tracking lag error; et denotes total tracking error which contains ek and ed ; y∗ (t − λd ) denotes drogue position λd before. λd denotes tracking lag time. Namely, λd is the receiver response time which contains sensor delay, program run time and receiver action time. Thus, the tracking lag problem occurs. In order to compensate ed , the preview reference signal is set to
yp (t ) + eb = y∗ (t + λ p )
(31)
where yp (t ) denotes preview reference signal and y∗ (t + λ p ) denotes drogue position λ p after. eb denotes the drogue motion prediction error. λ p is the preview time. If the preview method is applied, the reference signal is transformed from y∗ (t ) to yp (t ), then
y(t ) = yp (t − λd ) − ek = y∗ (t + λ p − λd ) − ek − eb
(32)
Thus, the tracking delay time λd will be compensated by preview time λ p if λ p is set properly. λd is time-varying because the receiver response time is different at a different state. So, the preview time λ p is a significant adaptive parameter in the preview method. In this section, a learning-based preview time controller which combines DL and RL spatially is proposed to get an optimal preview time selection policy. 3.2.1. Q-learning based reinforcement learning The optimal preview time selection policy is hard to obtain because the whole system which consists of hose-drogue assembly model, receiver nonlinear 6 DOF rigid model, probe direct controller, wind perturbations model and so on is too complicated. The RL, which studied from machine learning and computational intelligence, has the ability to take an optimal control policy for an agent by using the response from the unknown system [27]. There are many RL algorithms such as Q-learning (QL) [28], deep Q-learning (DQN) [29], actor-critic (AC) [30], policy gradient (PG) [31] and adaptive dynamic programming (ADP) [32]. As a traditional algorithm, QL is easy to train due to discretized state and discretized action. Thus, a QL based preview time selection method is proposed. Aiming at the characteristics of AAR docking process, a novel index is proposed to reflect the tracking status. As shown in Fig. 5, in the left half-plane of the imaginary line, yt lags yd . In the right half-plane of the imaginary line, yt leads yd . The degree of lag or lead is related to the gradient k of yd and the tracking error e = yt − yd . We define a new preview index e p .
e p = esgn(k)
(33)
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
Table 1 Algorithm 1: Training algorithm for QL 1 2 3 4 5 6 7 8 9 10 11 12
Initialize Q table Q(s, a) arbitrarily for episode = 1, . . . , N Initialize an episode for t = 1, . . . , T Observe current state st Select an action a R t according to Q table Qt −1 (st , :) (ε -Greedy Policy) Calculate the previous reward rt −1 (e st , e st −1 ) according to st and st −1 Calculate the target Q-value according to (36) Updating the Q table according to (37) Perform a R t in the environment end for end for
where e p > 0 means yt leads yd ; e p < 0 means yt lag behind yd ; e p represents the degree of lag or lead. As an index of QL, e p needs to be discretized as e s . Neither lag nor lead is expected, which means e s = 0 is the best situation. Thus, the reward signal is designed as
rt −1 (e st , e st −1 ) =
|e st −1 | − |e st | e st −1 = 0 ∨ e st = 0 1
e s t −1 = 0 ∧ e s t = 0
(34)
where e st −1 denotes previous e s ; e st denotes current e s ; s denotes the state of QL. e s is one of the factors of s. The other factor of s is the current preview time λ p . The action selection policy is:
a Rt
= A arg max Qt −1 (st , :)
(35)
where a R t denotes current action; A denotes action set; Qt −1 (st , :) denotes previous Q table in st row. arg max [Qt −1 (st , :)] denotes the function that could return the subscript of the maximum value of Qt −1 (st , :). The training rule for QL is shown in:
Qt∗ (st −1 , at −1 ) = rt −1 (st , st −1 ) + γ max Qt −1 (st , :)
(36)
Qt (st −1 , at −1 ) = Qt −1 (st −1 , at −1 )
+ η Qt∗ (st −1 , at −1 ) − Qt −1 (st −1 , at −1 )
where Qt∗ denotes the target Q-value.
(37)
γ denotes the decay parameter and η denotes the learning rate. The detailed training algorithm for QL is shown in Table 1.
7
3.2.2. LSTM based deep learning QL has a simple structure, which means it has relatively low precision. Thus, the selected action may not be optimal when the Q-values of all the actions corresponding to the current state Qt −1 (st , :) are close. In this situation, the greedy policy is better. Namely, the preview time that can make the next preview index e st +1 reach the optimal value is the best choice. Thus, an exact prediction of preview index eˆ st +1 according to the selected action at and the current preview index e st is needed. Deep learning has the good perceptual ability. Long short-term memory (LSTM) has been widely used in time series prediction problems [36–39] and has achieved great success. Thus, a deep LSTM based neural network is proposed to predict the discretized preview index e st +1 . As shown in Fig. 6, the proposed LSTM neural network has many layers which are the input layer, several LSTM layers, fully connected (FC) layer, Softmax layer, and output layer. LSTM layer consists of many LSTM cells. LSTM cell has three gates which are input gate, output gate and forget gate. The input gate controls the flow of input activations into the memory. The output gate controls the output flow of cell activations into the rest of the network. The forget gate scales the cell state flow, therefore adaptively forgetting the cell’s memory. Long-term dependence could be solved by using the LSTM structure. The detailed calculation process of the LSTM block could be written as
⎧ ⎪ ⎨ f t = σ (W f [ht −1 , lt ] + b f ) it = σ (Wi [ht −1 , lt ] + bi ) ⎪ ⎩ ot = σ (Wo [ht −1 , lt ] + bo ) ⎧ m ⎪ ⎨ ˜ t = tanh Wm [ht −1 , lt ] + bm ˜t mt = f t mt −1 + it m ⎪ ⎩ ht = ot tanh(mt )
(38)
(39)
where f t denotes forget gate; i t denotes input gate; ot denotes output gate; W terms denotes weight matrix; b terms denotes bias ˜ denotes updated cell state; vector; σ denotes sigmoid function; m m denotes cell state; h denotes output vector. The proposed LSTM network computes the mapping from the input sequence N1 = [ a D t e pt τt λ pt ] to the output N7 = eˆ st +1 . τt is the curvature of drogue; λ pt is the current preview time. The calculation process of the forward propagation could be written as
N j +1 = LSTM(N j )
Fig. 6. Structure of the proposed LSTM neural network.
(1 ≤ j ≤ N)
(40)
8
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
Fig. 7. Structure of the preview time controller.
where LSTM denotes LSTM layer which could be described by (38)–(39); N j denotes j-th layer and N denotes the number of LSTM layers. There are several LSTM layers in the proposed network. Deep LSTM can make better use of parameters by distributing them over the space through multiple layers [36].
NN+2 = WN+2 NN+1 + bN+2
(41)
where WN+2 and bN+2 denote weight matrix and bias vector of the fully connected layer. This layer is used to synthesize the output of LSTM layer.
en j F s (n j ) = 5
k =1
enk
NN+3 = F s (n1 )
···
F s (n5 )
(43)
where N N+4 denotes the output layer and eˆ st +1 denotes the predicted preview index. In the training process of the proposed network, dropout layers are added between LSTM layers to avoid the overfitting problem. The loss function in TMN training process consists of regression error and L2 weight regularization which is utilized to prevent the overfitting problem.
L=−
K
y ∗i ln yˆ i
Algorithm 2: Learning-based preview time controller algorithm 1 2 3 4 5 6 7 8 9 10 11
Initialize LSTM model and Q table for t = 1, . . ., T Observe current state st if (max( Q (st , :)) − min( Q (st , :))) > ξ Select an action a R t according to (35) else Predict next state st +1 with different a D t using LSTM model Select an action a D t according to (44) end if Perform a R t /a D t in the environment end for
(42)
where F s denotes the Softmax function and NN+3 denotes the Softmax layer. This layer is used to generate the probability of every category according to NN+2 .
N N+4 = arg max(NN+3 ) = eˆ st +1
Table 2
(44)
could be obtained from the proposed LSTM network. Then, the expected action a D t is obtained by selecting the action corresponding to the minimum eˆ st +1 , as shown in (45). net ( A ) denotes all eˆ st +1 forecasted by the proposed network and corresponding to the action in A.
a D t = A arg min net ( A )
(45)
3.2.3. Preview time controller According to the above analyses, RL considers global optimality but the precision is relatively low. DL is greedy but has high prediction precision. Therefore, RL and DL are combined spatially to overcome their disadvantages, and a novel preview time controller is proposed. The proposed method is named deep learning and reinforcement learning (DLRL). The detailed process is shown in Table 2. ξ denotes the threshold value of the action selection policy.
i =1
where L denotes the loss function; y ∗i denotes the indicator that the inputs belong to i-th class; yˆ i denotes the output for i-th class and is the value from the Softmax function; Then, trainable parameters are updated according to the gradient of L. Note that the training set and the test set are obtained from the simulations. The parameters perturbations and the wind perturbations are all considered in the simulation model to reflect the actual situation more realistic. Based on these data, the trained network will be more robust for the actual situation. Remark 1. We use the classification model to predict the discretized preview index eˆ st +1 but do not use the regression model to predict the exact preview index eˆ pt +1 , because the preview system is too active to predict eˆ pt +1 precisely and a proper discretization of eˆ st +1 will improve the prediction precision. The prediction results of preview index could be obtained using (40)–(43). By trying all feasible actions, every corresponding eˆ st +1
Remark 2. The proposed DLRL method is easier to train compared to DRL and has good performance. Compared to the deep neural network used by DRL, Q table which is the RL model of DLRL is easy to train and converge because it has a simple structure. The deep neural network (LSTM) is used in DL of DLRL and is trained by supervised learning, which means the DL has a capable network and is also easy to train. The structure of the preview time controller is shown in Fig. 7. The decision maker selects an appropriate action to determine the preview time λ p according to DL and RL. λ p is calculated in (46). λm denotes the tolerable max value of λ p . λm is determined by the drogue prediction model. Within λm , the drogue prediction precision can be guaranteed. A deep learning model is used to predict the drogue motion, which is our previous work [35] and is not the key point in this paper. D p denotes the predicted drogue position at λ p . Probe direct controller uses D p as the desired signal. Then, the probe will be controlled to track D p by the probe direct
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
9
Table 3 Simulation parameters of the proposed docking controller. Initial flight conditions ROTC
Altitude: 7010 m; Ground speed: 200 m/s; Tracking error requirement: ed =
Qr1 = diag 20 20 20 20 20 20 20 Qr1 0 10 0 Qr2 = diag 10 10 10 , Qr = , Rr = Q 0 10 r2 1 0 , Rc = diag 100 100 100 Qc1 = 0.1 0 1 Qc1 0 Qc2 = 0.1 diag 1 1 1 1 1 50 100 , Qc = 0
HOSM
α = 1.1 × 0.7 × diag 1 1 1 , β = 1.5 × 0.7 H V k = 1.2, Cω = 0.2 diag 1
1
0 .5
1 , C V k = 0.05
e 2y + e 2z ≤ 0.3 m
Qc2
× diag 1 1 1
−1 , η = 0.01, γ = 0.9, eg = 0.9
Reinforcement learning
A= 1 0
Deep learning
Max Iterations: 3000, Learning rate: 0.001, LSTM nodes: 20, Optimizer: Adam, N = 3
Decision maker
ξ = 1 λm = 5
controller and the tracking lag will be compensated. Note that the sampling step size of preview time controller is 1 s and the probe direct controller is 0.02 s. The differences between D p and drogue position are divided into the 50 sampling steps of the probe direct controller.
λ p0 = 0 t = 0 ⎧ ⎪ ⎨ λ pt −1 + at 0 < (λ pt −1 + at ) < λm (λ pt −1 + at ) ≤ 0 λ pt = 0 ⎪ ⎩ λm (λ pt −1 + at ) ≥ λm
t>0
(46)
4. Simulations and comparisons To verify the validity of the proposed docking controller for AAR, abundant simulations are conducted on the hardware platform with 32G RAM, Intel Core I7 6850 K, 3.60 GHz, and the software platform with MATLAB 2018B. The equivalent model and parameters of the receiver are given in [19] and [20], respectively. The proposed docking controller is compared with the conventional ROTC, probe direct controller (Section 3.1) and ADRC [12]. Then, the proposed deep learning and reinforcement learning (DLRL) based preview time controller is compared with DL, RL, DRL and the fuzzy logic controller (FL) [12]. All simulation parameters of the proposed docking controller are shown in Table 3. Remark 3. The drogue motion is separated into two axes: y axis and z axis in tanker frame. In these two axes, the receiver has different tracking status, because y axis is related to the rolling movement and z axis is related to the pitching movement of the receiver. Different from [12], we apply different preview time in two axes. 4.1. Simulation conditions The following simulations are conducted in the existence of tanker vortex and light turbulence, as shown in Fig. 8, where W t denotes turbulence and W denotes the total wind perturbations. The simulation environments run 80 s to obtain the following results. where Qr , Rr denote the LQR design parameters of reference observer; Qc , Rc denote the LQR design parameters of LQR controller; α , β denote the design parameter of angular loop controller; H V k denotes the design parameter of ground speed loop; Cω , C V k denote the observer design parameter of HOSMO; A denotes action set of RL; η denotes learning rate of RL; γ denotes decay parameter of RL; eg denotes the greed parameter of RL; ξ denotes the threshold of Q value; λm denotes the tolerable max value of λ p .
Fig. 8. Wind perturbations.
4.2. Reinforcement learning and deep learning For reinforcement learning, we have trained it for 50 episodes by putting it into the environment with well-designed probe direct controller and 6 DOF nonlinear receiver model. Then, the welltrained Q-table will be obtained, as shown in Fig. 9. s1 denotes discretized preview index e s which has five values: -2 denotes -big; -1 denotes -small; 0 denotes zero; 1 denotes small; 2 denotes big. These values denote the degree of lead or lag. s2 denotes preview time λ p . The state of Q table consists of s1 and s2 . a is the action of Q table. For deep learning, the performance of the DL model mainly relates to the number of LSTM layers and the number of the LSTM nodes in every LSTM layer. Therefore, the influence of the mentioned parameters on performance is analyzed to select a suitable set of parameters. As shown in Fig. 10, different number of LSTM layers and nodes are paired for the performance test. For real-time considerations, the number of LSTM layers and nodes are limited to 4 and 30, respectively. Finally, according to the test results, the number of LSTM layers and nodes are set to 3 and 20, respectively. Over-fitting occurs when the number of layers and nodes increase. Under-fitting occurs when the number of layers and nodes decrease. Note that the train set and test set are both obtained from the simulation environment which contains the hose-and-drogue model, receiver model, and probe direct controller. After enough training iterations, the proposed LSTM network has a good performance on the test set, as shown in Fig. 11. The prediction accuracy
10
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
Fig. 9. Q table of two axes.
Fig. 10. Comparison of DL models with different parameters.
Fig. 11. Performance of DL on a part of the test set. (For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.)
of y axis and z axis are 97.50% and 95.36%, respectively, which means the exact eˆ st +1 corresponding to every feasible action can be obtained. By using the proposed learning algorithm: deep learning and reinforcement learning (DLRL), the variation of e s and λ p generated by the preview time controller are shown in Fig. 12. The red imaginary lines are discretization standard of e p . The unit of λ p is 0.1 s, namely 1 denotes 0.1 s preview time. The variation of λ p relates to e p closely. In y axis, e p varies in a large scope. Thus,
λ p of y axis changes substantially to suppress e p . In z axis, e p varies in a small scope. Thus, λ p of z axis changes slightly to suppress e p . 4.3. Docking controller Note that the following simulations are performed in the presence of aerodynamic parameter perturbations to reflect the modeling error influence. The aerodynamic parameter perturbation is
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
11
Fig. 12. Variation of preview index and preview time.
Fig. 13. Tracking performance comparison of different docking controllers.
Fig. 14. Distance error comparison of different docking controllers.
generated randomly and the amplitude is between ±20%. Under the same aerodynamic parameter perturbation conditions, the following comparison results are obtained. The traditional ROTC [9], ADRC [12], proposed probe direct controller, and proposed probe direct controller with preview method are used to compare their tracking performance, as shown in Fig. 13. Compared to the traditional ROTC and ADRC, the proposed probe direct controller which transforms the controlled object from barycenter to probe promotes the tracking performance markedly.
In Fig. 14, the improvement of the learning-based preview
method could be seen clearly. ed =
e 2y + e 2z denotes the distance
from the probe to drogue t and the IAE index denotes the integral of ed , which is I A E = 0 ed dt. The detailed information is shown in Table 4. Probe direct controller reduces both max error and max IAE of ROTC obviously. And the probe direct controller with preview method reduces both max error and max IAE of probe direct controller partly. In conclusion, the proposed novel docking controller has better performance.
12
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
Table 4 Distance error comparison of different docking controller.
Table 5 Distance error comparison of different preview controller.
Controller
ROTC
ADRC
Probe direct controller
Probe+Preview
Method
FL
DL
RL
DRL
DLRL
Max error Max IAE
0.8499 m 813.6 m
0.3864 m 580.1 m
0.2648 m 393.0 m
0.1993 m 230.9 m
Max error Max IAE
0.2532 m 289.8 m
0.2102 m 250.7 m
0.1991 m 282.2 m
0.2025 m 246.5 m
0.1993 m 230.9 m
Fig. 15. Distance error comparison of different preview time controller.
Fig. 16. The variation of receiver states.
Fig. 15 compares the distance error of different preview time controller. FL denotes fuzzy logic controller [12]. DL denotes deep learning (LSTM) controller. RL denotes reinforcement learning (QL) controller. DRL denotes deep reinforcement learning (DQN) controller. DLRL denotes the proposed deep learning and reinforce-
ment learning controller. In order to guarantee fairness and show the real performance of different preview time controller, these comparisons are conducted with the same probe direct controller and same training iterations. The detailed information is shown in Table 5. DLRL has the minimum value in both indexes, which
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
13
Fig. 17. Variation of receiver control inputs.
the combination of ROTC and HOSM is proposed for probe direct control. The proposed probe direct control method improves the docking precision largely and has good anti-disturbance ability. Secondly, the learning-based preview method is proposed to compensate for the tracking lag. The novel learning algorithm combines DL and RL spatially rather than structurally like DRL, named DLRL. Then, a DLRL based preview time controller is proposed to get the optimal preview time selection policy. The learning-based preview method further improves the tracking precision based on the probe direct control method. Finally, with the combination of above, the novel docking controller is proposed. The proposed docking controller has higher tracking precision, good disturbance rejection ability and simple control structure, which is demonstrated by the extensive simulation results. Declaration of competing interest There is no conflict of interest. Fig. 18. Real-time performance of the proposed algorithm.
Acknowledgements means our proposed DLRL has the best performance in these five preview time controllers. The detailed variation of receiver states and control inputs are shown in Fig. 16 and Fig. 17. It is obvious that the variation range of receiver attitudes is suppressed by transforming the controlled object from barycenter to probe. Because the attitudes variation has been considered in probe dynamics. In Fig. 17, sometimes probe direct controller with preview method will generate a big value comparing to the probe direct controller. Because in these moments, the preview time changes, then the controller tracks a different reference signal which needs big control inputs. The real-time performance of the proposed algorithm is shown in Fig. 18. In one simulation, the running time of each sampling step is shorter than the sampling step size, which means that the proposed algorithm satisfies the real-time requirements. 5. Conclusion In this paper, a novel docking controller for AAR with probe direct control and learning-based preview method is proposed to improve the tracking precision of the docking process. Firstly, the controlled object is transformed from barycenter to probe of the receiver. And a suitable control scheme which is designed via
This research has been funded in part by the National Natural Science Foundation of China under Grant 61673042, 61175084 and the Graduate Innovation Practice Foundation of Beihang University under Grant YCSJ-01-201914. References [1] N. Zhang, W. Gai, M. Zhong, et al., A fast finite-time convergent guidance law with nonlinear disturbance observer for unmanned aerial vehicles collision avoidance, Aerosp. Sci. Technol. 86 (2019) 204–214. [2] D. Lee, S. Kim, J. Suk, Formation flight of unmanned aerial vehicles using track guidance, Aerosp. Sci. Technol. 76 (2018) 412–420. [3] J. Li, C. Gao, C. Li, et al., A survey on moving mass control technology, Aerosp. Sci. Technol. 82–83 (2018) 594–606. [4] J. Li, C. Gao, W. Jing, et al., Nonlinear vibration analysis of a novel moving mass flight vehicle, Nonlinear Dyn. 90 (1) (2017) 733–748. [5] P. Thomas, U. Bhandari, S. Bullock, et al., Advances in air to air refueling, Prog. Aerosp. Sci. 71 (2014) 14–35. [6] J. Nalepka, J. Hinchman, Automated aerial refueling: extending the effectiveness of UAVs, in: AIAA Modeling and Simulation Technologies Conference and Exhibit, 2005, pp. 6005–6012. [7] Z. Su, H. Wang, et al., Probe motion compound control for autonomous aerial refueling docking, Aerosp. Sci. Technol. 72 (2018) 1–13. [8] S. Kriel, J. Engelbrecht, T. Jones, Receptacle normal position control for automated aerial refueling, Aerosp. Sci. Technol. 29 (1) (2013) 296–304.
14
Y. Liu et al. / Aerospace Science and Technology 94 (2019) 105403
[9] M. Tandale, R. Bowers, J. Valasek, Trajectory tracking controller for vision-based probe and drogue autonomous aerial refueling, J. Guid. Control Dyn. 29 (4) (2006) 846–857. [10] Z. Su, H. Wang, N. Li, et al., Exact docking flight controller for autonomous aerial refueling with back-stepping based high order sliding mode, Mech. Syst. Signal Process. 101 (2018) 338–360. [11] J. Wang, V. Patel, C. Cao, et al., Novel L1 adaptive control method for aerial refueling with guaranteed transient performance, J. Guid. Control Dyn. 31 (1) (2008) 182–193. [12] Z. Su, H. Wang, P. Yao, et al., Back-stepping based anti-disturbance flight controller with preview method for autonomous aerial refueling, Aerosp. Sci. Technol. 61 (2017) 95–108. [13] A. Dogan, S. Sato, W. Blake, Flight control and simulation for aerial refueling, in: AIAA Guidance, Navigation, and Control Conference and Exhibit, 2005, pp. 6264–6278. [14] A. Dogan, S. Venkataramanan, Nonlinear control for reconfiguration of unmanned-aerial-vehicle formation, J. Guid. Control Dyn. 28 (4) (2005) 667–678. [15] J. Valasek, K. Gunnam, J. Kimmett, et al., Vision-based sensor and navigation system for autonomous air refueling, J. Guid. Control Dyn. 28 (5) (2005) 979–989. [16] J. Kimmett, J. Valasek, J.L. Junkins, Autonomous aerial refueling utilizing a vision based navigation system, in: Proceedings of the 2002 AIAA GNC Conference, 2002, paper 5569. [17] A. Levant, Sliding order and sliding accuracy in sliding mode control, Int. J. Control 58 (6) (1993) 1247–1263. [18] Han J. From, PID to active disturbance rejection control, IEEE Trans. Ind. Electron. 56 (3) (2009) 900–906. [19] A. Barfield, J. Hinchman, An equivalent model for UAV automated aerial refueling research, in: AIAA Modeling and Simulation Technologies Conference and Exhibit, 2005, pp. 6006–6012. [20] E. Kim, Control and Simulation of Relative Motion for Aerial Refueling in Racetrack Maneuver, The University of Texas at Arlington, 2007. [21] C. Göhrle, A. Schindler, A. Wagner, O. Sawodny, Road profile estimation and preview control for low-bandwidth active suspension systems, IEEE/ASME Trans. Mechatron. 20 (5) (2015) 2299–2310. [22] A. Ozdemir, P. Seiler, G. Balas, Design tradeoffs of wind turbine preview control, IEEE Trans. Control Syst. Technol. 21 (4) (2013) 1143–1154. [23] J. Carapuço, R. Neves, N. Horta, Reinforcement learning applied to Forex trading, Appl. Soft Comput. 73 (2018) 783–794. [24] F. Fathinezhad, V. Derhami, M. Rezaeian, Supervised fuzzy reinforcement learning for robot navigation, Appl. Soft Comput. 40 (2016) 33–41.
[25] Y. Wang, S. Geng, H. Gao, A proactive decision support method based on deep reinforcement learning and state partition, Knowl.-Based Syst. 143 (2018) 248–258. [26] D. Muse, C. Weber, S. Wermter, Robot docking based on omnidirectional vision and reinforcement learning, Knowl.-Based Syst. 19 (2006) 324–332. [27] A. Khater, A. El-Nagar, M. El-Bardini, et al., Adaptive T–S fuzzy controller using reinforcement learning based on Lyapunov stability, J. Franklin Inst. 355 (14) (2018) 6390–6415. [28] C. Watkins, P. Dayan, Q-learning, Mach. Learn. 8 (3–4) (1992) 279–292. [29] V. Mnih, K. Kavukcuoglu, D. Silver, et al., Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533. [30] V. Konda, J. Tsitsiklis, Actor-Critic Algorithms, Advances in Neural Information Processing Systems, 2000. [31] R. Sutton, D. McAllester, S. Singh, et al., Policy gradient methods for reinforcement learning with function approximation, in: Advances in Neural Information Processing Systems, 2000, pp. 1057–1063. [32] Q. Wei, D. Liu, H. Lin, Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems, IEEE Trans. Cybern. 46 (3) (2016) 840–853. [33] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436. [34] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Netw. 61 (2015) 85–117. [35] Y. Liu, H. Wang, Z. Su, et al., Deep learning based trajectory optimization for UAV aerial refueling docking under bow wave, Aerosp. Sci. Technol. 80 (2018) 392–402. [36] H. Sak, A. Senior, F. Beaufays, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, in: Fifteenth Annual Conference of the International Speech Communication Association, 2014. [37] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780. [38] H. Palangi, L. Deng, Y. Shen, et al., Deep sentence embedding using long shortterm memory networks: analysis and application to information retrieval, IEEE/ACM Trans. Audio Speech Lang. Process. 24 (4) (2016) 694–707. [39] X. Ma, Z. Tao, Y. Wang, et al., Long short-term memory neural network for traffic speed prediction using remote microwave sensor data, Transp. Res., Part C, Emerg. Technol. 54 (2015) 187–197. [40] A.S. Qureshi, A. Khan, A. Zameer, et al., Wind power prediction using deep neural network based meta regression and transfer learning, Appl. Soft Comput. 58 (2017) 742–755.