Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller
Yutaka Nakamura, Takeshi Mori, Yoichi Tokita,
Tomohiro Shibata, and Shin Ishii
Theoretical Life Science Lab., Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
Referring to the mechanism of animals’ rhythmic movements, motor control schemes using a central pattern generator (CPG) controller have been studied. We previously proposed reinforcement learning (RL) called the CPG-actor-critic model, as an autonomous learning framework for a CPG controller. Here, we propose an off-policy natural policy gradient RL algorithm for the CPG-actor-critic model, to solve the “exploration-exploitation” problem by meta-controlling “behavior policy.” We apply this RL algorithm to an automatic control problem using a biped robot simulator. Computer simulation demonstrated that the CPG controller enables the biped robot to walk stably and efficiently based on our new algorithm.
-  S. Grillner, P. Wallen, L. Brodin, and A. Lansner, “Neuronal network generating locomotor behavior in lamprey: circuitry, transmitters, membrane properties and simulations,” Annual Review of Neuroscience, 14, pp. 169-199, 1991.
-  Y. Fukuoka, H. Kimura, and A. H. Cohen, “Adaptive dynamic walking of a quadruped robot on irregular terrain based on biological concepts,” International Journal of Robotics Research, 22, 3-4, pp. 187-202, 2003.
-  G. Taga, Y. Yamaguchi, and H. Shimizu, “Self-organized control of bipedal locomotion by neural oscillators in unpredictable environment,” Biological Cybernetics, 65, pp. 147-159, 1991.
-  Y. Nakamura, T. Mori, and S. Ishii, “International conference on parallel problem solving from nature (PPSN VIII),” pp. 972-981, 2004.
-  S. Kakade, “A natural policy gradient,” In Advances in Neural Information Processing Systems, 14, pp. 1531-1538, 2001.
-  J. Peters, S. Vijayakumar, and S. Schaal, “Reinforcement learning for humanoid robotics,” Third IEEE International Conference on Humanoid Robotics 2003, Germany, 2003.
-  S. B. Thrun, “The role of exploration in learning control with neural networks,” Handbook of intelligent control: neural, fuzzy and adaptive approaches (Eds. by D. A. White, and D. A. Sofge), Florence, Kentucky, Van Nostrand Reinhold, 1992.
-  S. Ishii, W. Yoshida, and J. Yoshimoto, “Control of exploitation-exploration meta-parameter in reinforcement learning,” Neural Networks, 15, 4, pp. 665-687, 2002.
-  H. Kimura, and S. Kobayashi, “An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function,” 15th International Conference on Machine Learning, pp. 278-286, 1998.
-  E. Uchibe, and K. Doya, “Competitive-cooperative-concurrent reinforcement learning with importance sampling,” Proceedings of international conference on simulation of adaptive behavior: from animals and animats, pp. 287-296, 2004.
-  C. R. Shelton, “Policy improvement for pomdps using normalized importance sampling,” Proceedings of the seventeenth international conference on uncertainty in artificial intelligence, pp. 496-503, 2001.
-  D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal-difference learning with function approximation,” Proceedings of the 18th international conference on machine learning, pp. 417-424, 2001.
-  R. S. Sutton, and A. G. Barto, “Reinforcement Learning: An Introduction,” MIT Press, 1998.
-  M. Sato, and S. Ishii, “Reinforcement learning based on on-line em algorithm,” Advances in Neural Information Processing Systems, 11, pp. 1052-1058, 1999.
-  M. Sato, Y. Nakamura, and S. Ishii, “Reinforcement learning for biped locomotion,” International Conference on Artificial Neural Networks (ICANN 2002), pp. 777-782, 2002.
-  V. R. Konda, and J. N. Tsitsiklis, “Actor-critic algorithms,” SIAM Journal on Control and Optimization, 42, 4, pp. 1143-1146, 2003.
-  R. S. Sutton, D. McAllester, S. Singh, and Y. Manour, “Policy gradient method for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, Vol.12, pp. 1057-1063, 2000.
-  S. J. Bradtke, and A. G. Barto, “Linear least-squares algorithms for temporal difference learning,” Machine Learning, 22, pp. 33-57, 1996.
-  J. Yoshimoto, S. Ishii, and M. Sato, “System identification based on on-line variational bayes method and its application to reinforcement learning,” in artificial neural networks and neural information processing (ICANN/ICONIP 2003), LCN2714, Springer-Verlag, pp. 123-131, 2003.
-  D. J. C. MacKay, “Information Theory, Inference, and Learning Algorithms,” Cambridge University Press, 2002.
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.
Copyright© 2005 by Fuji Technology Press Ltd. and Japan Society of Mechanical Engineers. All right reserved.