Paper:

# Learning Quadcopter Maneuvers with Concurrent Methods of Policy Optimization

## Pei-Hua Huang and Osamu Hasegawa

Tokyo Institute of Technology

J3-13, 4259 Nagatsuta, Midori-ku, Yokohama 226-8502, Japan

This study presents an aerial robotic application of deep reinforcement learning that imparts an asynchronous learning framework and trust region policy optimization to a simulated quad-rotor helicopter (quadcopter) environment. In particular, we optimized a control policy asynchronously through interaction with concurrent instances of the environment. The control system was benchmarked and extended with examples to tackle continuous state-action tasks for the quadcoptor: hovering control and balancing an inverted pole. Performing these maneuvers required continuous actions for sensitive control of small acceleration changes of the quadcoptor, thereby maximizing the scalar reward of the defined tasks. The simulation results demonstrated an enhancement of the learning speed and reliability for the tasks.

- [1] P. C. Salmon and P. L. Meissner, “Mobile Bot Swarms: They’re closer than you might think!,” IEEE Consumer Electronics Magazine, Vol.4, No.1, pp. 58-65, 2015.
- [2] S. Bouabdallah,” Design and control of quadrotors with application to autonomous flying,” Ph.D. thesis, Ecole Polytechnique Federale de Lausanne, 2007.
- [3] G. M. Hoffmann, H. Huang, S. L. Waslander, and C. J. Tomlin, “Precision flight control for a multi-vehicle quadrotor helicopter testbed,” Control engineering practice, Vol.19, No.9, pp. 1023-1036, 2011.
- [4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
- [5] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang, “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning,” Advances in Neural Information Processing Systems, Vol.27, pp. 3338-3346, 2014.
- [6] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
- [7] H. Kimura, T. Yamashita, and S. Kobayashi, “Reinforcement learning of walking behavior for a four-legged robot,” Proc. of the 40th IEEE Conf. on Decision and Control (Cat. No.01CH37228), Vol.1, pp. 411-416, 2001.
- [8] T. Degris, P. M. Pilarski, and R. S. Sutton, “Model-free reinforcement learning with continuous action in practice,” Proc. of the 2012 American Control Conf. (ACC), pp. 2177-2182, 2012.
- [9] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trust Region Policy Optimization,” Proc. of the the 32nd Int. Conf. on Machine Learning, pp. 1889-1897, 2015.
- [10] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” arXiv preprint arXiv:1602.01783, 2016.
- [11] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
- [12] S. Kakade, “A natural policy gradient,” Advances in neural information processing systems, Vol.2, pp. 1531-1538, 2002.
- [13] J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural networks, Vol.21, No.4, pp. 682-697, 2008.
- [14] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” Advances in Neural Information Processing Systems, Vol.24, pp. 693-701, 2011.
- [15] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized stochastic gradient descent,” Advances in neural information processing systems, Vol.23, pp. 2595-2603, 2010.
- [16] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, et al., “Massively parallel methods for deep reinforcement learning,” arXiv preprint arXiv:1507.04296, 2015.
- [17] G. Heigold, E. McDermott, V. Vanhoucke, A. Senior, and M. Bacchiani, “Asynchronous stochastic optimization for sequence training of deep neural networks,” Proc. of the 2014 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5587-5591, 2014.
- [18] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. J. Smola, “On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants,” Advances in Neural Information Processing Systems, Vol.28, pp. 2647-2655, 2015.
- [19] Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng, “On optimization methods for deep learning,” Proc. of the 8th Int. Conf. on Machine Learning, pp. 265-272, 2011.
- [20] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” Proc. of the 2012 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 5026. 5033. IEEE, 2012.
- [21] E. Rohmer, S. P. Singh, and M. Freese, “V-REP: A versatile and scalable robot simulation framework,” Proc. of the 2013 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 1321-1326, 2013.
- [22] T. Erez, Y. Tassa, and E. Todorov, “Simulation tools for modelbased robotics: Comparison of bullet, havok, mujoco, ode and physx,” Proc. of the 2015 IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 4397-4404, 2015.
- [23] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” arXiv preprint arXiv:1606.01540, 2016.
- [24] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. on Systems, Man, and Cybernetics, Vol.5, pp. 834-846, 1983.
- [25] H. Pei-Hua and H. Osamu, “Associative-memory-recall-based control system for learning hovering manoeuvres,” 2015 Int. Joint Conf. on Neural Networks (IJCNN), pp. 1-8, 2015.
- [26] G. Klein and D. Murray, “Parallel tracking and mapping on a camera phone,” Proc. of the 8th IEEE Int. Symp. on Mixed and Augmented Reality, pp. 83-86, 2009.
- [27] B. Bethke, M. Valenti, and J. How, “Cooperative vision based estimation and tracking using multiple UAVs,” Advances in Cooperative Control and Optimization, pp. 179-189, Springer, 2007.
- [28] H. J. Kim, M. I. Jordan, S. Sastry, and A. Y. Ng, “Autonomous Helicopter Flight via Reinforcement Learning,” Advances in Neural Information Processing Systems, Vol.16, pp. 799-806, MIT Press, 2004.
- [29] J. A. Bagnell and J. G. Schneider, “Autonomous helicopter control using reinforcement learning policy search methods,” Proc. of the 2001 IEEE Int. Conf. on Robotics and Automation (Cat. No.01CH37164), Vol.2, pp. 1615-1620, 2001.
- [30] M. Hehn and R. D’Andrea, “A flying inverted pendulum,” Proc. of 2011 IEEE Int. Conf. on Robotics and Automation, pp. 763-770, 2011.
- [31] R. Figueroa, A. Faust, P. Cruz, L. Tapia, and R. Fierro, “Reinforcement learning for balancing a flying inverted pendulum,” Proc. of the 11th World Congress on Intelligent Control and Automation, pp. 1787-1793, 2014.
- [32] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al., “Large scale distributed deep networks,” Proc. of the 25th Int. Conf. on Neural Information Processing Systems, pp. 1223-1231, 2012.
- [33] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J. of Machine Learning Research, Vol.12 (Jul), pp. 2121-2159, 2011.
- [34] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, Vol.4, No.2, 2012.
- [35] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [36] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection,” arXiv preprint arXiv:1603.02199, 2016.
- [37] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, Vol.529, No.7587, pp. 484-489, 2016.
- [38] P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aerobatics through apprenticeship learning,” The Int. J. of Robotics Research, Vol.29, pp. 1608-1639, 2010.