Available online at www.sciencedirect.com
ScienceDirect ScienceDirect
Procedia Computer Science 00 (2018) 000–000 Procedia Computer Science 00 (2018) 000–000
Available online at www.sciencedirect.com
ScienceDirect
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
Procedia Computer Science 140 (2018) 376–382
Complex Adaptive Systems Conference with Theme: Cyber Physical Systems and Deep Learning, CAS 2018, Complex Adaptive Systems Conference– with Theme: 2018, CyberChicago, Physical Illinois, SystemsUSA and Deep Learning, CAS 2018, 5 November 7 November 5 November – 7 November 2018, Chicago, Illinois, USA
Learning to Operate an Excavator via Policy Optimization Learning to Operate an Excavator via Policy Optimization Benjamin J. Hodela Benjamin J. Hodela
*Caterpillar, Inc. 501 SW Jefferson Ave, Peoria, IL, USA *Caterpillar, Inc. 501 SW Jefferson Ave, Peoria, IL, USA
Abstract Abstract This paper provides a case study for optimizing a deep neural network policy to control an excavator and perform bucket-leveling. The policy mimics human behavior, which traditional controlnetwork algorithms find because of the earthmoving This paper provides a case study for optimizing a deep neural policy to difficult control an excavator andunstructured perform bucket-leveling. environment and excavator system dynamics. The approach this paper find reliesdifficult on integrating simulator, Dynasty, The policy mimics human behavior, which traditional controlinalgorithms becauseaofproprietary the unstructured earthmoving with the OpenAI framework. exposingThe the simulation engine in a manner compatible with OpenAI Gym, we benchmarked environment andGym excavator systemBy dynamics. approach in this paper relies on integrating a proprietary simulator, Dynasty, several algorithms against excavatorengine bucket-leveling control problem. The paper Gym, provides for the with thereinforcement OpenAI Gym learning framework. By exposing theansimulation in a manner compatible with OpenAI we results benchmarked experiment and discusses techniques to effectively policiesbucket-leveling that converge on smooth machine operation. several reinforcement learning algorithms against anfind excavator control problem. The paper provides results for the experiment and discusses techniques to effectively find policies that converge on smooth machine operation. © 2018 The Authors. Published by Elsevier B.V. © 2018 The Authors. by Elsevier B.V. This is an open accessPublished article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the Complex Adaptive Systems Conference with Theme: Engineering Cyber This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the Complex Adaptive Systems Conference with Theme: Engineering Cyber Physical Systems. Selection and peer-review under responsibility of the Complex Adaptive Systems Conference with Theme: Engineering Cyber Physical Systems. Physical Systems. Keywords: construction; operator; autonomous; reinforcement learning Keywords: construction; operator; autonomous; reinforcement learning
1. Introduction 1. Introduction Product development of large, heavy earthmoving machines benefits from running virtual simulations of new Product development of large, heavy earthmoving machines benefits fromthe running virtual and simulations designs before any prototypes or production machines are built. By evaluating performance durabilityofofnew the designs before any powertrain, prototypes or production machinesand are other built. systems By evaluating and durability of the engine, structures, hydraulic implements using the onlyperformance a digital model, designs can be engine, structures, hydraulicparts implements and other systems using a digital model, can be iterated many timespowertrain, before any physical are constructed or assembled. It is only critical, therefore, thatdesigns the simulation iterated many physical parts are constructed It isdoes critical, the simulation is operated in atimes way before that is any representative of human operators.orAassembled. program that this therefore, is called anthat operator model. is operated a way that is representative of human operators. A program that does thisisisdifficult called an operator model. Creatingin operator models that mimic human behavior to perform implement control using conventional Creating Traditional operator models that mimic control is difficult using conventional techniques. development of human operatorbehavior models to hasperform focusedimplement on rule-based logic or the use of proportionaltechniques. Traditional of operator models has focused onrule-based rule-basedmodels, logic orhowever, the use of integral-derivative (PID)development controllers applied to a state machine. Simple areproportionaloften brittle integral-derivative (PID) controllers to a state machine. Simple rule-based models, however,ofare brittle and become unsatisfactory when the applied design parameters of the machine or the boundary conditions theoften simulation and become unsatisfactory the manual design parameters the similar machinesensitivity or the boundary conditions of theparameters. simulation are changed. PID controllerswhen require tuning and of suffer to the initial simulation are changed. controllers require manual are tuning suffer similar sensitivity to the initial simulation parameters. In both cases,PID robust human-like trajectories hardand to achieve. In both cases, robust human-like trajectories are hard to achieve. 1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) 1877-0509 © 2018 Thearticle Authors. Published by Elsevier B.V. Selection under responsibility of the Complex Systems Conference with Theme: Engineering Cyber Physical Systems. This is an and openpeer-review access article under the CC BY-NC-ND licenseAdaptive (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the Complex Adaptive Systems Conference with Theme: Engineering Cyber Physical Systems.
1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the Complex Adaptive Systems Conference with Theme: Engineering Cyber Physical Systems. 10.1016/j.procs.2018.10.301
Benjamin J. Hodel / Procedia Computer Science 140 (2018) 376–382 Benjamin Hodel / Procedia Computer Science 00 (2018) 000–000
377
There are advanced control methods such as iterative learning control (ILC) [1], model predictive control (MPC) [2], model reference adaptive control (MRAC) [3] and fuzzy control [4] which are better suited for complicated behavior by imposing a more complex control structure on the flow of data. However, these advanced control methods require careful design and control theory expertise to tune them correctly. In some cases, they require complete knowledge of the system dynamics or they do not guarantee transient learning behavior [5]. Besides model complexities, some simulation scenarios are virtually impossible to control due to the chaotic system dynamics. Realistic loading use cases involve high loads, hydraulic deadband and stall pressures, kinematic friction, and variable material densities. For this reason, leading researchers are now looking at reinforcement learning approaches to model operator behavior [6]. In this article, we present a new approach for creating operator models which applies reinforcement learning algorithms to an excavator simulation to iteratively learn a bucket-leveling control policy. We compare four reinforcement learning algorithms – random search, hill-climbing method, cross entropy method (CEM), and trust region policy optimization (TRPO) – and show that the latter performs better than the others in terms of accuracy and guaranteed convergence. 2. OpenAI Environments In 2016, OpenAI released OpenAI Gym as a toolkit for reinforcement learning research [7]. This toolkit contains a set of simulation environments that work with a common interface, allowing practitioners to try different algorithms and benchmark the results. As a result, the OpenAI Gym interface is now held as a standard for benchmarking and testing the effectiveness of new algorithms. This impact can be seen in the wider community with many user scripts that rely on OpenAI Gym as well as the use of OpenAI as a benchmark when publishing improvements on policy gradient methods [8] or new approaches to reinforcement learning, such as evolution strategies [9]. The usefulness of Gym has led other simulators to include support for it including physics engines such as MuJoCo, Bullet, and the Robot Operating System (ROS) / Gazebo simulator [10]. Our machine simulation models, however, were not in a format easy to move to these supported simulators. As a result, we chose to bring the OpenAI algorithms to our existing simulation environment. We have a long history using Dynasty, a proprietary multi-body physics simulation engine. Dynasty is designed for the scenarios of our products and we have a vast array of models already built for it. Therefore, we chose to create an interface using its existing co-simulation extension [11] that would make Dynasty conform to the OpenAI application programming interface. By extending our simulation engine to support the required OpenAI Gym environment methods, we get the immediate value of state-of-the-art algorithms and can easily adapt it to any new ones that are published. Dynasty provides more than solid contact kinematics. It is a full system simulation software tool that is widely used by our company to predict the transient and frequency domain behavior of vehicles and vehicle systems. Dynasty integrates many types of systems, including hydraulics, engines, drivelines, rigid and flexible linkages, electronic controls, and cooling systems. Dynasty contains over 500 predefined components (e.g. torque converters, springs, spool valves, etc.) which can be graphically connected to build a mathematical model of the physical system. The model may consist of a sub-system or an entire vehicle or engine system. 3. Excavator Bucket Leveling Experiment Many operator modeling problems exist, for example bucket positioning control (excavators, wheel loaders), steering control (articulated trucks, motor graders), and blade control (track-type tractors, motor graders). Any algorithm which provides a better method for creating a robust operator model for one application will benefit all other machine simulations that we currently perform. We selected excavator bucket-leveling as a representative task to study since it is both straightforward and yet challenging to model.
378
Benjamin Hodel / Procedia Computer Science 00 (2018) 000–000 Benjamin J. Hodel / Procedia Computer Science 140 (2018) 376–382
Fig. 1. Hydraulic excavator performing a bucket-leveling task
3.1. Experimental Setup Leveling, also called bucket-dragging, is a difficult task which may look simple, but requires full-skill in a human operator to perform (see Fig. 1). The goal is to move the excavator’s bucket from an extended position (away from the machine) to a retracted position (close to the machine) while maintaining the bucket face in a flat, horizontal position along the ground using a sweeping motion. The vertical position of the bucket tooth should stay flush with the ground, and the bucket pitch angle should also remain zero relative to the ground. Inadvertent movements which raise the bucket will cause it to lose contact with the ground, whereas movements which lower it will gouge the ground. Errors in both directions will result in a poorly-leveled ground surface. To do this accurately, without noticeable deviations, an operator must vary three degrees-of freedom simultaneously. Leveling is challenging for conventional controls to solve, especially if hydraulic deadband, pump delays, joint friction, and ground engagement are added to the simulation. To run reinforcement learning on the Dynasty simulation engine we needed to provide the simulation model, a reward function, a policy architecture, and a learning algorithm. With Dynasty integrated with OpenAI Gym, we supplied the boom, stick, and bucket linkages. For simplicity, we chose a model in which the lever commands of the boom, stick, and bucket were directly proportional to the speed of their respective cylinders. The Dynasty environment requires boom, stick, and bucket lever commands as the input actions (ranging from -100 to 100), and outputs observations in the form of absolute bucket-tooth position (in the X, Y, and Z directions) and bucket pitch angle relative to the ground. The reward function was hand-coded to motivate proper leveling technique. A neural network with three hidden layers was chosen as the policy architecture, with on the order of 10 nodes per layer. The inputs were standardized, and the actions were passed through a hyperbolic tangent function so they could be multiplied by 100 to produce the correct range. We ran the following algorithms and compared results: random search; hill-climbing method; cross-entropy method (CEM), a gradient-free method [12]; and trust region policy optimization (TRPO), a gradient-based policy iteration method which uses a constraint on the KL divergence to control learning [13]. The random search method simply picks new policy weights, and keeps them only when the cumulative reward is better than the last policy. Hillclimbing is similar, but it perturbs the previous policy weights (by a noise scaling factor, e.g. 0.1), and keeps them only when the cumulative reward is better than the last policy. 3.2. Experimental Results The results of the experimental runs are shown in Fig. 2 and Fig. 3. We found that the TRPO algorithm outperformed the other methods, as it has a nearly monotonic learning curve, which shows the benefit of being able to follow the policy gradient. CEM also did quite well, but saturated at about the 60th iteration since that is where we caused the extra covariance to taper off. The difference between TRPO and CEM are more pronounced than the plot indicates when you inspect sample episodes. The fact that TRPO continues to increase the total reward to above -1000
Benjamin J. Hodel / Procedia Computer Science 140 (2018) 376–382 Benjamin Hodel / Procedia Computer Science 00 (2018) 000–000
379
Fig. 2a. Learning curve without jerk penalty or noise filter
Fig. 2b. TRPO rollout without jerk penalty or noise filter. Fig. 2: Learning curve and sample TRPO rollouts (after 200 policy iterations) without jerk penalty or noise filter.
makes it much more preferred than the other methods. The random and hill-climbing methods did not do as well, as they either have no real mechanism to improve the policy efficiently (random) or get easily stuck in local optimums (hill-climbing). All the methods showed problems in the action space, however. They favor policies that result in rapid movement of the lever commands, sometimes oscillating between full-stick negative and full-stick positive (see Fig. 2b). This is common in control problems, and is known as “bang-bang” control. These bang-bang actions are likely optimal according to the learning algorithm, but smooth commands are desired for machine operation. Smooth controls are more human-like and will result in less severity on the machine hydraulics and structures. Therefore, to avoid the oscillation problem the reward function was modified to penalize this behavior, a solution known as “jerk minimization” [14]. Similar to the way humans minimize jerk (the third derivative of position) when moving limbs,
380
Benjamin J. Hodel / Procedia Computer Science 140 (2018) 376–382 Benjamin Hodel / Procedia Computer Science 00 (2018) 000–000
Fig. 3a. Learning curve with jerk penalty and noise filter.
Fig. 3b. TRPO rollout with jerk penalty and noise filter. Fig. 3: Learning curve and sample TRPO rollout (after 200 policy iterations) with jerk penalty and noise filter.
we encoded this jerk minimization by applying a small penalty for bucket jerk (both radial translation and rotation). However, with jerk penalized, the TRPO method ceased to function. It relies on adding Gaussian noise to the simulation actions to explore the action space. To make it possible for TRPO to work with jerk penalty, we had to pass the noise through a low-pass digital filter before adding it to the commands. We chose to use an eight-pole infinite impulse response (IIR) filter designed to cut off frequencies above 2 Hz. This removed the high frequency actions that would incur high jerk value while still allowing sufficient policy exploration. The jerk-penalized results are shown in Fig. 3a and 3b. In the plots with filtered actions, the jerk movements are minimized and the sampled lever commands are much smoother. The true policy actions will have even less ripple during deployment since they will not include any additive noise at all.
Benjamin J. Hodel / Procedia Computer Science 140 (2018) 376–382 Benjamin Hodel / Procedia Computer Science 00 (2018) 000–000
381
Regarding construction of a proper reward function for bucket-leveling, initially we gave positive reward proportional to being within an acceptable band of the target tooth vertical position, velocity, and bucket angle. However, this scheme developed certain pathological trajectories, so we changed to a negative reward (or penalty) approach whereby we assigned negative values proportional to the radial and angular deviations relative to the target trajectory. This worked much better. Even so, balancing the different reward components was only achieved through trial and error by applying different weights. Selecting the right policy architecture was also crucial. Initially, the network had only one layer with no bias terms and no activation function. This resulted in unsatisfactory behavior where the excavator stick slowed to a complete stop in the middle of the travel (since all inputs became zero). To resolve this problem, we increased the number of layers to three, added bias terms, and placed sigmoid activation functions after the first two hidden layers. With that, the randomly-initialized weights made it possible for travel to continue. We ran tests with one, two, and three hidden layers, but did not observe much sensitivity to the network depth. 4. Conclusions We introduced the concept of using reinforcement learning to find a policy which can perform excavator bucketleveling with human-like realism in a virtual simulation. The approach here relied on adapting Dynasty, a proprietary simulation engine, to the OpenAI Gym environment interface to take advantage of the wealth of algorithms that have been developed for it. We have shown that reinforcement learning can learn an acceptable excavator-leveling neural network policy. We found that TRPO had excellent results and outperformed the other methods we tested. A key finding was that optimal policies often incur bang-bang control behavior, which is undesirable for machine design and performance reasons. We showed that this behavior can be avoided if a jerk penalty is added to the reward function and if the exploration noise is low-pass filtered. More work is required to show that reinforcement learning can extend to all the needs of operator modeling, but the fact that the TRPO algorithm has a nearly monotonic improvement curve is encouraging and gives the promise that tractable policies may be found for difficult simulation scenarios yet to be encountered. We believe that this preliminary work shows the rich potential of reinforcement learning for operator modeling, and may have on-machine applications beyond simulation involving operator-assist features and machinelevel autonomy. Acknowledgements This work was conceived and supported by Rajiv Shah, Caterpillar (now with DataRobot), and Matthew Lossmann, Caterpillar. References [1] D. A. Bristow, M. Tharayil, and A. G. Alleyne. A survey of iterative learning control. IEEE Control Systems, 26(3):96–114, 2006. [2] C. E. Garcia, D. M. Prett, and M. Morari. Model predictive control: Theory and practice—A survey. Automatica, 25(3):335–348, 1989. [3] K. B. Pathak and D. M. Adhyaru. Survey of model reference adaptive control. In Engineering (NUiCONE), 2012 Nirma University International Conference on, pages 1–6. IEEE, 2012. [4] C.-C. Lee. Fuzzy logic in control systems: fuzzy logic controller. I. IEEE Transactions on systems, man, and cybernetics, 20(2):404–418, 1990. [5] S. Ding, J. Szalko, and M. Lossmann. Survey of advanced control techniques (Caterpillar internal white paper). 2016. [6] S. Dadhich, U. Bodin, F. Sandin, and U. Andersson. Machine learning approach to automatic bucket loading. In Control and Automation (MED), 2016 24th Mediterranean Conference on, pages 1260–1265. IEEE, 2016. [7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym. CoRR, abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.01540. [8] S. Gu, T. P. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-Prop: Sample-efficient policy gradient with an off-policy critic. CoRR, abs/1611.02247, 2016. URL http://arxiv.org/abs/1611.02247. [9] T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. CoRR, abs/1703.03864, 2017. URL http://arxiv.org/abs/1703.03864. [10] I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero. Extending the OpenAI Gym for robotics: a toolkit for reinforcement learning using ROS and Gazebo. arXiv preprint arXiv:1608.05742, 2016.
382
Benjamin J. Hodel / Procedia Computer Science 140 (2018) 376–382 Benjamin Hodel / Procedia Computer Science 00 (2018) 000–000
[11] S. Saikia, S. Ozili, K. C. Paranjothi, and T. Rajendran. Controller model integration for virtual product development. Technical report, SAE Technical Paper, 2013. [12] I. Szita and A. Lörincz. Learning tetris using the noisy cross-entropy method. Learning, 18(12), 2006. [13] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015. [14] C. Elkan. Reinforcement learning with a bilinear Q function. In S. Sanner and M. Hutter, editors, Recent Advances in Reinforcement Learning: 9th European Workshop, EWRL 2011, Athens, Greece, September 9-11, 2011, Revised and Selected Papers, volume 7188, pages 78–88. Springer, 2012.