single-jc.php

JACIII Vol.16 No.2 pp. 183-190
doi: 10.20965/jaciii.2012.p0183
(2012)

Paper:

Proposal of the Continuous-Valued Penalty Avoiding Rational Policy Making Algorithm

Kazuteru Miyazaki

Research Department, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira, Tokyo 187-8587, Japan

Received:
September 4, 2011
Accepted:
December 20, 2011
Published:
March 20, 2012
Keywords:
reinforcement learning, profit sharing, PARP, Exploitation-oriented Learning (XoL)
Abstract
Applying reinforcement learning to actual problems, sometimes requires the treatment of continuousvalued input and output. We previously proposed a process called Exploitation-oriented Learning (XoL) to strongly enhance successful experience and thereby reduce the number of trial-and-error searches. A method based on Penalty-Avoiding Rational Policymaking (PARP) is proposed as a XoL method corresponding to continuous-valued input, but types of action treating continuous-valued output are not executed. We study the treatment of continuous-valued output suitable for a XoL method in which the environment includes both a reward and a penalty. We extend PARP in continuous-valued input to continuousvalued output. We apply our proposal to the pole-cart balancing problem and the biped LEGO robot, and confirm its effectiveness.
Cite this article as:
K. Miyazaki, “Proposal of the Continuous-Valued Penalty Avoiding Rational Policy Making Algorithm,” J. Adv. Comput. Intell. Intell. Inform., Vol.16 No.2, pp. 183-190, 2012.
Data files:
References
  1. [1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.
  2. [2] V. Heidrich-Meisner and C. Igel, “Evolution Strategies for Direct Policy Search,” Parallel Problem Solving from Nature (PPSN X), 5199 of LNCS, pp. 428-437, Springer-Verlag, 2008.
  3. [3] K. Ikeda, “Exemplar-Based Direct Policy Search with Evolutionary Optimization,” Proc. of 2005 Congress on Evolutionary Computation CEC2005, pp. 2357-2364, 2005.
  4. [4] T. Matsui, T. Goto, and K. Izumi, “Acquiring a Government Bond Trading Strategy Using Reinforcement Learning,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 691-696, 2009.
  5. [5] K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.
  6. [6] A.Miyamae, J. Sakuma, I. Ono, and S. Kobayashi, “Instance-based Policy Learning by Real-coded Genetic Algorithms and Its Application to Control of Nonholonomic Systems,” J. of The Japanese Society for Artificial Intelligence, Vol.24, No.1, pp. 104-115, 2009 (in Japanese).
  7. [7] J. Randlov and P. Alstrom, “Learning to Drive a Bicycle Using Reinforcement Learning and Shaping,” Proc. of the 15th Int. Conf. on Machine Learning, pp. 463-471, 1998.
  8. [8] P. Stone, R. S. Sutton, and G. Kuhlamann, “Reinforcement Learning toward RoboCup Soccer Keepaway,” Adaptive Behavior, Vol.13, No.3, pp. 165-188, 2005.
  9. [9] J. Yoshimoto, M. Nishimura, Y. Tokita, and S. Ishii, “Acrobot control by learning the switching of multiple controllers,” J. of Artificial Life and Robotics, Vol.9, No.2, pp. 67-71, 2005.
  10. [10] T. Watanabe, K. Miyazaki, and H. Kobayashi, “A New Improved Penalty Avoiding Rational Policy Making Algorithm for Keepaway with Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 675-682, 2009.
  11. [11] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.
  12. [12] K. Miyazaki and S. Kobayashi, “Exploitation-oriented Learning PS-r#,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 624-630, 2009.
  13. [13] A. Uemura, A. Ueno, and S. Tatsumi, “A Profit Sharing Method for Forgetting Past Experiences Effectively,” Trans. of the Japanese Society for Artificial Intelligence, Vol.21, No.1, pp. 81-93, 2006 (in Japanese).
  14. [14] S. Kato and H. Matuo, “Theory of Profit Sharing in Dynamic Environment,” IEICE Trans. D, Vol.84, No.7, pp. 1067-1075, 2001 (in Japanese).
  15. [15] K. Miyazaki, M. Yamamura, and S. Kobayashi, “A Theory of Profit Sharing in Reinforcement Learning,” J. of The Japanese Society for Artificial Intelligence, Vol.9, No.4, pp. 580-587, 1994 (in Japanese).
  16. [16] D. Tamashima, S. Koatsu, T. Okamoto, and H. Hirata, “Profit Sharing Using a Dynamic Reinforcement Function Considering Expectation Value of Reinforcement,” IEEJ Trans. Electronics, Information and Systems. Vol.129, No.C(7), pp. 1339-1347, 2009 (in Japanese).
  17. [17] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” Proc. of the 5th Int. Conf. on Intelligent Autonomous System, pp. 250-257, 1998.
  18. [18] H. Kimura, “Reinforcement Learning in Multi-Dimensional State-Action Space Using Random Rectangular Coarse Coding and Gibbs Sampling,” Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 88-95, 2007.
  19. [19] J. C. Santamaria, R. S. Sutton, and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces,” Adaptive Behavior, Vol.6, No.2, pp. 163-218, 1998.
  20. [20] T. Tateyama, S. Kawata, and Y. Shimomura, “A Reinforcement Learning Algorithm for Continuous State Spaces using Multiple Fuzzy-ART Networks,” Proc. of SICE-ICCAS 2006, pp. 2445-2450, 2006.
  21. [21] K.Miyazaki and S. Kobayashi, “A Reinforcement Learning System for Penalty Avoiding in Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.11, No.6, pp. 668-676, 2007.
  22. [22] D. Benedettelli, “Creating Cool MINDSTORMS NXT Robots,” Apress, 2008.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Apr. 19, 2024