Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller

Yutaka Nakamura; Takeshi Mori; Yoichi Tokita; Tomohiro Shibata; Shin Ishii

doi:10.20965/jrm.2005.p0636

single-rb.php

« previous

JRM Vol.17 No.6 pp. 636-644

doi: 10.20965/jrm.2005.p0636

(2005)

Paper:

Views over last 60 days: 544

Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller

Yutaka Nakamura, Takeshi Mori, Yoichi Tokita,
Tomohiro Shibata, and Shin Ishii

Theoretical Life Science Lab., Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan

Received:

February 10, 2005

Accepted:

June 21, 2005

Published:

December 20, 2005

Keywords:

reinforcement learning, off-policy learning, biped walking, central pattern generator (CPG)

Abstract

Referring to the mechanism of animals’ rhythmic movements, motor control schemes using a central pattern generator (CPG) controller have been studied. We previously proposed reinforcement learning (RL) called the CPG-actor-critic model, as an autonomous learning framework for a CPG controller. Here, we propose an off-policy natural policy gradient RL algorithm for the CPG-actor-critic model, to solve the “exploration-exploitation” problem by meta-controlling “behavior policy.” We apply this RL algorithm to an automatic control problem using a biped robot simulator. Computer simulation demonstrated that the CPG controller enables the biped robot to walk stably and efficiently based on our new algorithm.

Cite this article as:

Y. Nakamura, T. Mori, Y. Tokita, T. Shibata, and S. Ishii, “Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller,” J. Robot. Mechatron., Vol.17 No.6, pp. 636-644, 2005.

Data files:

References

[1] S. Grillner, P. Wallen, L. Brodin, and A. Lansner, “Neuronal network generating locomotor behavior in lamprey: circuitry, transmitters, membrane properties and simulations,” Annual Review of Neuroscience, 14, pp. 169-199, 1991.
[2] Y. Fukuoka, H. Kimura, and A. H. Cohen, “Adaptive dynamic walking of a quadruped robot on irregular terrain based on biological concepts,” International Journal of Robotics Research, 22, 3-4, pp. 187-202, 2003.
[3] G. Taga, Y. Yamaguchi, and H. Shimizu, “Self-organized control of bipedal locomotion by neural oscillators in unpredictable environment,” Biological Cybernetics, 65, pp. 147-159, 1991.
[4] Y. Nakamura, T. Mori, and S. Ishii, “International conference on parallel problem solving from nature (PPSN VIII),” pp. 972-981, 2004.
[5] S. Kakade, “A natural policy gradient,” In Advances in Neural Information Processing Systems, 14, pp. 1531-1538, 2001.
[6] J. Peters, S. Vijayakumar, and S. Schaal, “Reinforcement learning for humanoid robotics,” Third IEEE International Conference on Humanoid Robotics 2003, Germany, 2003.
[7] S. B. Thrun, “The role of exploration in learning control with neural networks,” Handbook of intelligent control: neural, fuzzy and adaptive approaches (Eds. by D. A. White, and D. A. Sofge), Florence, Kentucky, Van Nostrand Reinhold, 1992.
[8] S. Ishii, W. Yoshida, and J. Yoshimoto, “Control of exploitation-exploration meta-parameter in reinforcement learning,” Neural Networks, 15, 4, pp. 665-687, 2002.
[9] H. Kimura, and S. Kobayashi, “An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function,” 15th International Conference on Machine Learning, pp. 278-286, 1998.
[10] E. Uchibe, and K. Doya, “Competitive-cooperative-concurrent reinforcement learning with importance sampling,” Proceedings of international conference on simulation of adaptive behavior: from animals and animats, pp. 287-296, 2004.
[11] C. R. Shelton, “Policy improvement for pomdps using normalized importance sampling,” Proceedings of the seventeenth international conference on uncertainty in artificial intelligence, pp. 496-503, 2001.
[12] D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal-difference learning with function approximation,” Proceedings of the 18th international conference on machine learning, pp. 417-424, 2001.
[13] R. S. Sutton, and A. G. Barto, “Reinforcement Learning: An Introduction,” MIT Press, 1998.
[14] M. Sato, and S. Ishii, “Reinforcement learning based on on-line em algorithm,” Advances in Neural Information Processing Systems, 11, pp. 1052-1058, 1999.
[15] M. Sato, Y. Nakamura, and S. Ishii, “Reinforcement learning for biped locomotion,” International Conference on Artificial Neural Networks (ICANN 2002), pp. 777-782, 2002.
[16] V. R. Konda, and J. N. Tsitsiklis, “Actor-critic algorithms,” SIAM Journal on Control and Optimization, 42, 4, pp. 1143-1146, 2003.
[17] R. S. Sutton, D. McAllester, S. Singh, and Y. Manour, “Policy gradient method for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, Vol.12, pp. 1057-1063, 2000.
[18] S. J. Bradtke, and A. G. Barto, “Linear least-squares algorithms for temporal difference learning,” Machine Learning, 22, pp. 33-57, 1996.
[19] J. Yoshimoto, S. Ishii, and M. Sato, “System identification based on on-line variational bayes method and its application to reinforcement learning,” in artificial neural networks and neural information processing (ICANN/ICONIP 2003), LCN2714, Springer-Verlag, pp. 123-131, 2003.
[20] D. J. C. MacKay, “Information Theory, Inference, and Learning Algorithms,” Cambridge University Press, 2002.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] S. Grillner, P. Wallen, L. Brodin, and A. Lansner, “Neuronal network generating locomotor behavior in lamprey: circuitry, transmitters, membrane properties and simulations,” Annual Review of Neuroscience, 14, pp. 169-199, 1991.

[2] [2] Y. Fukuoka, H. Kimura, and A. H. Cohen, “Adaptive dynamic walking of a quadruped robot on irregular terrain based on biological concepts,” International Journal of Robotics Research, 22, 3-4, pp. 187-202, 2003.

[3] [3] G. Taga, Y. Yamaguchi, and H. Shimizu, “Self-organized control of bipedal locomotion by neural oscillators in unpredictable environment,” Biological Cybernetics, 65, pp. 147-159, 1991.

[4] [4] Y. Nakamura, T. Mori, and S. Ishii, “International conference on parallel problem solving from nature (PPSN VIII),” pp. 972-981, 2004.

[5] [5] S. Kakade, “A natural policy gradient,” In Advances in Neural Information Processing Systems, 14, pp. 1531-1538, 2001.

[6] [6] J. Peters, S. Vijayakumar, and S. Schaal, “Reinforcement learning for humanoid robotics,” Third IEEE International Conference on Humanoid Robotics 2003, Germany, 2003.

[7] [7] S. B. Thrun, “The role of exploration in learning control with neural networks,” Handbook of intelligent control: neural, fuzzy and adaptive approaches (Eds. by D. A. White, and D. A. Sofge), Florence, Kentucky, Van Nostrand Reinhold, 1992.

[8] [8] S. Ishii, W. Yoshida, and J. Yoshimoto, “Control of exploitation-exploration meta-parameter in reinforcement learning,” Neural Networks, 15, 4, pp. 665-687, 2002.

[9] [9] H. Kimura, and S. Kobayashi, “An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function,” 15th International Conference on Machine Learning, pp. 278-286, 1998.

[10] [10] E. Uchibe, and K. Doya, “Competitive-cooperative-concurrent reinforcement learning with importance sampling,” Proceedings of international conference on simulation of adaptive behavior: from animals and animats, pp. 287-296, 2004.

[11] [11] C. R. Shelton, “Policy improvement for pomdps using normalized importance sampling,” Proceedings of the seventeenth international conference on uncertainty in artificial intelligence, pp. 496-503, 2001.

[12] [12] D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal-difference learning with function approximation,” Proceedings of the 18th international conference on machine learning, pp. 417-424, 2001.

[13] [13] R. S. Sutton, and A. G. Barto, “Reinforcement Learning: An Introduction,” MIT Press, 1998.

[14] [14] M. Sato, and S. Ishii, “Reinforcement learning based on on-line em algorithm,” Advances in Neural Information Processing Systems, 11, pp. 1052-1058, 1999.

[15] [15] M. Sato, Y. Nakamura, and S. Ishii, “Reinforcement learning for biped locomotion,” International Conference on Artificial Neural Networks (ICANN 2002), pp. 777-782, 2002.

[16] [16] V. R. Konda, and J. N. Tsitsiklis, “Actor-critic algorithms,” SIAM Journal on Control and Optimization, 42, 4, pp. 1143-1146, 2003.

[17] [17] R. S. Sutton, D. McAllester, S. Singh, and Y. Manour, “Policy gradient method for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, Vol.12, pp. 1057-1063, 2000.

[18] [18] S. J. Bradtke, and A. G. Barto, “Linear least-squares algorithms for temporal difference learning,” Machine Learning, 22, pp. 33-57, 1996.

[19] [19] J. Yoshimoto, S. Ishii, and M. Sato, “System identification based on on-line variational bayes method and its application to reinforcement learning,” in artificial neural networks and neural information processing (ICANN/ICONIP 2003), LCN2714, Springer-Verlag, pp. 123-131, 2003.

[20] [20] D. J. C. MacKay, “Information Theory, Inference, and Learning Algorithms,” Cambridge University Press, 2002.

Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller

Yutaka Nakamura, Takeshi Mori, Yoichi Tokita, Tomohiro Shibata, and Shin Ishii

Yutaka Nakamura, Takeshi Mori, Yoichi Tokita,
Tomohiro Shibata, and Shin Ishii