JRM Vol.17 No.6 pp. 636-644
doi: 10.20965/jrm.2005.p0636


Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller

Yutaka Nakamura, Takeshi Mori, Yoichi Tokita,
Tomohiro Shibata, and Shin Ishii

Theoretical Life Science Lab., Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan

February 10, 2005
June 21, 2005
December 20, 2005
reinforcement learning, off-policy learning, biped walking, central pattern generator (CPG)

Referring to the mechanism of animals’ rhythmic movements, motor control schemes using a central pattern generator (CPG) controller have been studied. We previously proposed reinforcement learning (RL) called the CPG-actor-critic model, as an autonomous learning framework for a CPG controller. Here, we propose an off-policy natural policy gradient RL algorithm for the CPG-actor-critic model, to solve the “exploration-exploitation” problem by meta-controlling “behavior policy.” We apply this RL algorithm to an automatic control problem using a biped robot simulator. Computer simulation demonstrated that the CPG controller enables the biped robot to walk stably and efficiently based on our new algorithm.

Cite this article as:
Yutaka Nakamura, Takeshi Mori, Yoichi Tokita,
Tomohiro Shibata, and Shin Ishii, “Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller,” J. Robot. Mechatron., Vol.17, No.6, pp. 636-644, 2005.
Data files:
  1. [1] S. Grillner, P. Wallen, L. Brodin, and A. Lansner, “Neuronal network generating locomotor behavior in lamprey: circuitry, transmitters, membrane properties and simulations,” Annual Review of Neuroscience, 14, pp. 169-199, 1991.
  2. [2] Y. Fukuoka, H. Kimura, and A. H. Cohen, “Adaptive dynamic walking of a quadruped robot on irregular terrain based on biological concepts,” International Journal of Robotics Research, 22, 3-4, pp. 187-202, 2003.
  3. [3] G. Taga, Y. Yamaguchi, and H. Shimizu, “Self-organized control of bipedal locomotion by neural oscillators in unpredictable environment,” Biological Cybernetics, 65, pp. 147-159, 1991.
  4. [4] Y. Nakamura, T. Mori, and S. Ishii, “International conference on parallel problem solving from nature (PPSN VIII),” pp. 972-981, 2004.
  5. [5] S. Kakade, “A natural policy gradient,” In Advances in Neural Information Processing Systems, 14, pp. 1531-1538, 2001.
  6. [6] J. Peters, S. Vijayakumar, and S. Schaal, “Reinforcement learning for humanoid robotics,” Third IEEE International Conference on Humanoid Robotics 2003, Germany, 2003.
  7. [7] S. B. Thrun, “The role of exploration in learning control with neural networks,” Handbook of intelligent control: neural, fuzzy and adaptive approaches (Eds. by D. A. White, and D. A. Sofge), Florence, Kentucky, Van Nostrand Reinhold, 1992.
  8. [8] S. Ishii, W. Yoshida, and J. Yoshimoto, “Control of exploitation-exploration meta-parameter in reinforcement learning,” Neural Networks, 15, 4, pp. 665-687, 2002.
  9. [9] H. Kimura, and S. Kobayashi, “An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function,” 15th International Conference on Machine Learning, pp. 278-286, 1998.
  10. [10] E. Uchibe, and K. Doya, “Competitive-cooperative-concurrent reinforcement learning with importance sampling,” Proceedings of international conference on simulation of adaptive behavior: from animals and animats, pp. 287-296, 2004.
  11. [11] C. R. Shelton, “Policy improvement for pomdps using normalized importance sampling,” Proceedings of the seventeenth international conference on uncertainty in artificial intelligence, pp. 496-503, 2001.
  12. [12] D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal-difference learning with function approximation,” Proceedings of the 18th international conference on machine learning, pp. 417-424, 2001.
  13. [13] R. S. Sutton, and A. G. Barto, “Reinforcement Learning: An Introduction,” MIT Press, 1998.
  14. [14] M. Sato, and S. Ishii, “Reinforcement learning based on on-line em algorithm,” Advances in Neural Information Processing Systems, 11, pp. 1052-1058, 1999.
  15. [15] M. Sato, Y. Nakamura, and S. Ishii, “Reinforcement learning for biped locomotion,” International Conference on Artificial Neural Networks (ICANN 2002), pp. 777-782, 2002.
  16. [16] V. R. Konda, and J. N. Tsitsiklis, “Actor-critic algorithms,” SIAM Journal on Control and Optimization, 42, 4, pp. 1143-1146, 2003.
  17. [17] R. S. Sutton, D. McAllester, S. Singh, and Y. Manour, “Policy gradient method for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, Vol.12, pp. 1057-1063, 2000.
  18. [18] S. J. Bradtke, and A. G. Barto, “Linear least-squares algorithms for temporal difference learning,” Machine Learning, 22, pp. 33-57, 1996.
  19. [19] J. Yoshimoto, S. Ishii, and M. Sato, “System identification based on on-line variational bayes method and its application to reinforcement learning,” in artificial neural networks and neural information processing (ICANN/ICONIP 2003), LCN2714, Springer-Verlag, pp. 123-131, 2003.
  20. [20] D. J. C. MacKay, “Information Theory, Inference, and Learning Algorithms,” Cambridge University Press, 2002.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Mar. 05, 2021