Proposal of the Continuous-Valued Penalty Avoiding Rational Policy Making Algorithm

Kazuteru Miyazaki

doi:10.20965/jaciii.2012.p0183

single-jc.php

JACIII Vol.16 No.2 pp. 183-190

doi: 10.20965/jaciii.2012.p0183

(2012)

Paper:

Views over last 60 days: 640

Proposal of the Continuous-Valued Penalty Avoiding Rational Policy Making Algorithm

Kazuteru Miyazaki

Research Department, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira, Tokyo 187-8587, Japan

Received:

September 4, 2011

Accepted:

December 20, 2011

Published:

March 20, 2012

Keywords:

reinforcement learning, profit sharing, PARP, Exploitation-oriented Learning (XoL)

Abstract

Applying reinforcement learning to actual problems, sometimes requires the treatment of continuousvalued input and output. We previously proposed a process called Exploitation-oriented Learning (XoL) to strongly enhance successful experience and thereby reduce the number of trial-and-error searches. A method based on Penalty-Avoiding Rational Policymaking (PARP) is proposed as a XoL method corresponding to continuous-valued input, but types of action treating continuous-valued output are not executed. We study the treatment of continuous-valued output suitable for a XoL method in which the environment includes both a reward and a penalty. We extend PARP in continuous-valued input to continuousvalued output. We apply our proposal to the pole-cart balancing problem and the biped LEGO robot, and confirm its effectiveness.

Cite this article as:

K. Miyazaki, “Proposal of the Continuous-Valued Penalty Avoiding Rational Policy Making Algorithm,” J. Adv. Comput. Intell. Intell. Inform., Vol.16 No.2, pp. 183-190, 2012.

Data files:

References

[1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.
[2] V. Heidrich-Meisner and C. Igel, “Evolution Strategies for Direct Policy Search,” Parallel Problem Solving from Nature (PPSN X), 5199 of LNCS, pp. 428-437, Springer-Verlag, 2008.
[3] K. Ikeda, “Exemplar-Based Direct Policy Search with Evolutionary Optimization,” Proc. of 2005 Congress on Evolutionary Computation CEC2005, pp. 2357-2364, 2005.
[4] T. Matsui, T. Goto, and K. Izumi, “Acquiring a Government Bond Trading Strategy Using Reinforcement Learning,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 691-696, 2009.
[5] K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.
[6] A.Miyamae, J. Sakuma, I. Ono, and S. Kobayashi, “Instance-based Policy Learning by Real-coded Genetic Algorithms and Its Application to Control of Nonholonomic Systems,” J. of The Japanese Society for Artificial Intelligence, Vol.24, No.1, pp. 104-115, 2009 (in Japanese).
[7] J. Randlov and P. Alstrom, “Learning to Drive a Bicycle Using Reinforcement Learning and Shaping,” Proc. of the 15th Int. Conf. on Machine Learning, pp. 463-471, 1998.
[8] P. Stone, R. S. Sutton, and G. Kuhlamann, “Reinforcement Learning toward RoboCup Soccer Keepaway,” Adaptive Behavior, Vol.13, No.3, pp. 165-188, 2005.
[9] J. Yoshimoto, M. Nishimura, Y. Tokita, and S. Ishii, “Acrobot control by learning the switching of multiple controllers,” J. of Artificial Life and Robotics, Vol.9, No.2, pp. 67-71, 2005.
[10] T. Watanabe, K. Miyazaki, and H. Kobayashi, “A New Improved Penalty Avoiding Rational Policy Making Algorithm for Keepaway with Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 675-682, 2009.
[11] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.
[12] K. Miyazaki and S. Kobayashi, “Exploitation-oriented Learning PS-r#,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 624-630, 2009.
[13] A. Uemura, A. Ueno, and S. Tatsumi, “A Profit Sharing Method for Forgetting Past Experiences Effectively,” Trans. of the Japanese Society for Artificial Intelligence, Vol.21, No.1, pp. 81-93, 2006 (in Japanese).
[14] S. Kato and H. Matuo, “Theory of Profit Sharing in Dynamic Environment,” IEICE Trans. D, Vol.84, No.7, pp. 1067-1075, 2001 (in Japanese).
[15] K. Miyazaki, M. Yamamura, and S. Kobayashi, “A Theory of Profit Sharing in Reinforcement Learning,” J. of The Japanese Society for Artificial Intelligence, Vol.9, No.4, pp. 580-587, 1994 (in Japanese).
[16] D. Tamashima, S. Koatsu, T. Okamoto, and H. Hirata, “Profit Sharing Using a Dynamic Reinforcement Function Considering Expectation Value of Reinforcement,” IEEJ Trans. Electronics, Information and Systems. Vol.129, No.C(7), pp. 1339-1347, 2009 (in Japanese).
[17] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” Proc. of the 5th Int. Conf. on Intelligent Autonomous System, pp. 250-257, 1998.
[18] H. Kimura, “Reinforcement Learning in Multi-Dimensional State-Action Space Using Random Rectangular Coarse Coding and Gibbs Sampling,” Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 88-95, 2007.
[19] J. C. Santamaria, R. S. Sutton, and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces,” Adaptive Behavior, Vol.6, No.2, pp. 163-218, 1998.
[20] T. Tateyama, S. Kawata, and Y. Shimomura, “A Reinforcement Learning Algorithm for Continuous State Spaces using Multiple Fuzzy-ART Networks,” Proc. of SICE-ICCAS 2006, pp. 2445-2450, 2006.
[21] K.Miyazaki and S. Kobayashi, “A Reinforcement Learning System for Penalty Avoiding in Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.11, No.6, pp. 668-676, 2007.
[22] D. Benedettelli, “Creating Cool MINDSTORMS NXT Robots,” Apress, 2008.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.

[2] [2] V. Heidrich-Meisner and C. Igel, “Evolution Strategies for Direct Policy Search,” Parallel Problem Solving from Nature (PPSN X), 5199 of LNCS, pp. 428-437, Springer-Verlag, 2008.

[3] [3] K. Ikeda, “Exemplar-Based Direct Policy Search with Evolutionary Optimization,” Proc. of 2005 Congress on Evolutionary Computation CEC2005, pp. 2357-2364, 2005.

[4] [4] T. Matsui, T. Goto, and K. Izumi, “Acquiring a Government Bond Trading Strategy Using Reinforcement Learning,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 691-696, 2009.

[5] [5] K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.

[6] [6] A.Miyamae, J. Sakuma, I. Ono, and S. Kobayashi, “Instance-based Policy Learning by Real-coded Genetic Algorithms and Its Application to Control of Nonholonomic Systems,” J. of The Japanese Society for Artificial Intelligence, Vol.24, No.1, pp. 104-115, 2009 (in Japanese).

[7] [7] J. Randlov and P. Alstrom, “Learning to Drive a Bicycle Using Reinforcement Learning and Shaping,” Proc. of the 15th Int. Conf. on Machine Learning, pp. 463-471, 1998.

[8] [8] P. Stone, R. S. Sutton, and G. Kuhlamann, “Reinforcement Learning toward RoboCup Soccer Keepaway,” Adaptive Behavior, Vol.13, No.3, pp. 165-188, 2005.

[9] [9] J. Yoshimoto, M. Nishimura, Y. Tokita, and S. Ishii, “Acrobot control by learning the switching of multiple controllers,” J. of Artificial Life and Robotics, Vol.9, No.2, pp. 67-71, 2005.

[10] [10] T. Watanabe, K. Miyazaki, and H. Kobayashi, “A New Improved Penalty Avoiding Rational Policy Making Algorithm for Keepaway with Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 675-682, 2009.

[11] [11] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.

[12] [12] K. Miyazaki and S. Kobayashi, “Exploitation-oriented Learning PS-r#,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 624-630, 2009.

[13] [13] A. Uemura, A. Ueno, and S. Tatsumi, “A Profit Sharing Method for Forgetting Past Experiences Effectively,” Trans. of the Japanese Society for Artificial Intelligence, Vol.21, No.1, pp. 81-93, 2006 (in Japanese).

[14] [14] S. Kato and H. Matuo, “Theory of Profit Sharing in Dynamic Environment,” IEICE Trans. D, Vol.84, No.7, pp. 1067-1075, 2001 (in Japanese).

[15] [15] K. Miyazaki, M. Yamamura, and S. Kobayashi, “A Theory of Profit Sharing in Reinforcement Learning,” J. of The Japanese Society for Artificial Intelligence, Vol.9, No.4, pp. 580-587, 1994 (in Japanese).

[16] [16] D. Tamashima, S. Koatsu, T. Okamoto, and H. Hirata, “Profit Sharing Using a Dynamic Reinforcement Function Considering Expectation Value of Reinforcement,” IEEJ Trans. Electronics, Information and Systems. Vol.129, No.C(7), pp. 1339-1347, 2009 (in Japanese).

[17] [17] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” Proc. of the 5th Int. Conf. on Intelligent Autonomous System, pp. 250-257, 1998.

[18] [18] H. Kimura, “Reinforcement Learning in Multi-Dimensional State-Action Space Using Random Rectangular Coarse Coding and Gibbs Sampling,” Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 88-95, 2007.

[19] [19] J. C. Santamaria, R. S. Sutton, and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces,” Adaptive Behavior, Vol.6, No.2, pp. 163-218, 1998.

[20] [20] T. Tateyama, S. Kawata, and Y. Shimomura, “A Reinforcement Learning Algorithm for Continuous State Spaces using Multiple Fuzzy-ART Networks,” Proc. of SICE-ICCAS 2006, pp. 2445-2450, 2006.

[21] [21] K.Miyazaki and S. Kobayashi, “A Reinforcement Learning System for Penalty Avoiding in Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.11, No.6, pp. 668-676, 2007.

[22] [22] D. Benedettelli, “Creating Cool MINDSTORMS NXT Robots,” Apress, 2008.