Proposal of the Continuous-Valued Penalty Avoiding Rational Policy Making Algorithm
Research Department, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira, Tokyo 187-8587, Japan
Applying reinforcement learning to actual problems, sometimes requires the treatment of continuousvalued input and output. We previously proposed a process called Exploitation-oriented Learning (XoL) to strongly enhance successful experience and thereby reduce the number of trial-and-error searches. A method based on Penalty-Avoiding Rational Policymaking (PARP) is proposed as a XoL method corresponding to continuous-valued input, but types of action treating continuous-valued output are not executed. We study the treatment of continuous-valued output suitable for a XoL method in which the environment includes both a reward and a penalty. We extend PARP in continuous-valued input to continuousvalued output. We apply our proposal to the pole-cart balancing problem and the biped LEGO robot, and confirm its effectiveness.
-  R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.
-  V. Heidrich-Meisner and C. Igel, “Evolution Strategies for Direct Policy Search,” Parallel Problem Solving from Nature (PPSN X), 5199 of LNCS, pp. 428-437, Springer-Verlag, 2008.
-  K. Ikeda, “Exemplar-Based Direct Policy Search with Evolutionary Optimization,” Proc. of 2005 Congress on Evolutionary Computation CEC2005, pp. 2357-2364, 2005.
-  T. Matsui, T. Goto, and K. Izumi, “Acquiring a Government Bond Trading Strategy Using Reinforcement Learning,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 691-696, 2009.
-  K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.
-  A.Miyamae, J. Sakuma, I. Ono, and S. Kobayashi, “Instance-based Policy Learning by Real-coded Genetic Algorithms and Its Application to Control of Nonholonomic Systems,” J. of The Japanese Society for Artificial Intelligence, Vol.24, No.1, pp. 104-115, 2009 (in Japanese).
-  J. Randlov and P. Alstrom, “Learning to Drive a Bicycle Using Reinforcement Learning and Shaping,” Proc. of the 15th Int. Conf. on Machine Learning, pp. 463-471, 1998.
-  P. Stone, R. S. Sutton, and G. Kuhlamann, “Reinforcement Learning toward RoboCup Soccer Keepaway,” Adaptive Behavior, Vol.13, No.3, pp. 165-188, 2005.
-  J. Yoshimoto, M. Nishimura, Y. Tokita, and S. Ishii, “Acrobot control by learning the switching of multiple controllers,” J. of Artificial Life and Robotics, Vol.9, No.2, pp. 67-71, 2005.
-  T. Watanabe, K. Miyazaki, and H. Kobayashi, “A New Improved Penalty Avoiding Rational Policy Making Algorithm for Keepaway with Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 675-682, 2009.
-  K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.
-  K. Miyazaki and S. Kobayashi, “Exploitation-oriented Learning PS-r#,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 624-630, 2009.
-  A. Uemura, A. Ueno, and S. Tatsumi, “A Profit Sharing Method for Forgetting Past Experiences Effectively,” Trans. of the Japanese Society for Artificial Intelligence, Vol.21, No.1, pp. 81-93, 2006 (in Japanese).
-  S. Kato and H. Matuo, “Theory of Profit Sharing in Dynamic Environment,” IEICE Trans. D, Vol.84, No.7, pp. 1067-1075, 2001 (in Japanese).
-  K. Miyazaki, M. Yamamura, and S. Kobayashi, “A Theory of Profit Sharing in Reinforcement Learning,” J. of The Japanese Society for Artificial Intelligence, Vol.9, No.4, pp. 580-587, 1994 (in Japanese).
-  D. Tamashima, S. Koatsu, T. Okamoto, and H. Hirata, “Profit Sharing Using a Dynamic Reinforcement Function Considering Expectation Value of Reinforcement,” IEEJ Trans. Electronics, Information and Systems. Vol.129, No.C(7), pp. 1339-1347, 2009 (in Japanese).
-  K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” Proc. of the 5th Int. Conf. on Intelligent Autonomous System, pp. 250-257, 1998.
-  H. Kimura, “Reinforcement Learning in Multi-Dimensional State-Action Space Using Random Rectangular Coarse Coding and Gibbs Sampling,” Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 88-95, 2007.
-  J. C. Santamaria, R. S. Sutton, and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces,” Adaptive Behavior, Vol.6, No.2, pp. 163-218, 1998.
-  T. Tateyama, S. Kawata, and Y. Shimomura, “A Reinforcement Learning Algorithm for Continuous State Spaces using Multiple Fuzzy-ART Networks,” Proc. of SICE-ICCAS 2006, pp. 2445-2450, 2006.
-  K.Miyazaki and S. Kobayashi, “A Reinforcement Learning System for Penalty Avoiding in Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.11, No.6, pp. 668-676, 2007.
-  D. Benedettelli, “Creating Cool MINDSTORMS NXT Robots,” Apress, 2008.