Exploitation-Oriented Learning PS-r<sup>#</sup>

Kazuteru Miyazaki; Shigenobu Kobayashi

doi:10.20965/jaciii.2009.p0624

single-jc.php

« previous

JACIII Vol.13 No.6 pp. 624-630

(2009)

doi: 10.20965/jaciii.2009.p0624

Paper:

Views over last 60 days: 826

Exploitation-Oriented Learning PS-r^#

Kazuteru Miyazaki^* and Shigenobu Kobayashi^**

^*Department of Assessment and Research for degree Awarding, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira, Tokyo 187-8587, Japan

^**Graduate School of Interdisciplinary Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8502, Japan

Received:

April 15, 2009

Accepted:

June 28, 2009

Published:

November 20, 2009

Keywords:

reinforcement learning, profit sharing, PS-r^*, partially observed Markov decision process, Exploitation-oriented Learning XoL

Abstract

Exploitation-oriented learning (XoL) is a novel approach to goal-directed learning from interaction. Reinforcement learning is much more focused on learning and ensures optimality in Markov decision process (MDP) environments, XoL involves learning a rational policy that obtains rewards continuously and very quickly. PS-r^*, a form of XoL, involves learning a useful rational policy not inferior to the random walk in the partially observed Markov decision process (POMDP) where reward types number one. PS-r^*, however, requires O(MN²) memory where N is the number of sensory input types and M is an action. We propose PS-r^# for learning a useful rational policy in the POMDP using O(MN) memory. PS-r^# effectiveness is confirmed in numerical examples.

Cite this article as:

K. Miyazaki and S. Kobayashi, “Exploitation-Oriented Learning PS-r^#,” J. Adv. Comput. Intell. Intell. Inform., Vol.13 No.6, pp. 624-630, 2009.

Data files:

References

[1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.
[2] K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.
[3] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.
[4] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” Proc. of the 17th Int. Conf. on Machine Learning, pp. 663-670, 2000.
[5] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” Proc. of the 22nd Int. Conf. on Machine Learning, pp. 1-8, 2005.
[6] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” Proc. of the 3rd Int. Conf. on Fuzzy Logic, Neural Nets and Soft Computing, pp. 285-288, 1994.
[7] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” Proc. of the 5th Int. Conf. on Intelligent Autonomous System, pp. 250-257, 1998.
[8] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r^* and its Evaluation,” J. of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 286-296, 2003 (in Japanese).
[9] L. Chrisman, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proc. of the 10th National Conf. on Artificial Intelligence, pp. 183-188, 1992.
[10] R. A. McCallum, “Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State,” Proc. of the 12th Int. Conf. on Machine Learning, pp. 387-395, 1995.
[11] C. Boutilier and D. Poole, “Computing Optimal Policies for Partially Observable Decision Processes using Compact Representations,” Proc. of the 13th National Conf. on Artificial Intelligence, pp. 1168-1175, 1996.
[12] S. P. Singh, T. Jaakkola, and M. I. Jordan, “Learning Without State-Estimation in Partially Observable Markovian Decision Processes,” Proc. of the 11th Int. Conf. on Machine Learning, pp. 284-292, 1994.
[13] R. J. Williams, “Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning,” Machine Learning, Vol.8, pp. 229-256, 1992.
[14] T. Jaakkola, S. P. Singh, and M. I. Jordan, “Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems,” Advances in Neural Information Processing Systems, Vol.7, pp. 345-352, 1994.
[15] H. Kimura, M. Yamamura, and S. Kobayashi, “Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward,” Proc. of the 12th Int. Conf. on Machine Learning, pp. 295-303, 1995.
[16] H. Kimura, K. Miyazaki, and S. Kobayashi, “Reinforcement Learning in POMDPs with Function Approximation,” Proc. of the 14th Int. Conf. on Machine Learning, pp. 152-160, 1997.
[17] L. Baird and D. Poole, “Gradient Descent for General Reinforcement Learning,” Advances in Neural Information Processing System, Vol.11, pp. 968-974, 1999.
[18] V. R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” Advances in Neural Information Processing Systems, Vol.12, pp. 1008-1014, 2000.
[19] R. S. Sutton, D. McAllester, S. P. Singh, and Y. Mansour, “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” Advances in Neural Information Processing Systems, Vol.12, pp. 1057-1063, 2000.
[20] D. Aberdeen and J. Baxter, “Scalable Internal-State Policy-Gradient Methods for POMDPs,” Proc. of the 19th Int. Conf. on Machine Learning, pp. 3-10, 2002.
[21] T. J. Perkins, “Reinforcement Learning for POMDPs based on Action Values and Stochastic Optimization,” Proc. of the 18th National Conf. on Artificial Intelligence, pp. 199-204, 2002.
[22] K. Miyazaki and S. Kobayashi, “A Reinforcement Learning System for Penalty Avoiding in Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.11, No.6, pp. 668-676, 2007.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[B1] [1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.

[B2] [2] K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.

[B3] [3] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.

[B4] [4] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” Proc. of the 17th Int. Conf. on Machine Learning, pp. 663-670, 2000.

[B5] [5] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” Proc. of the 22nd Int. Conf. on Machine Learning, pp. 1-8, 2005.

[B6] [6] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” Proc. of the 3rd Int. Conf. on Fuzzy Logic, Neural Nets and Soft Computing, pp. 285-288, 1994.

[B7] [7] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” Proc. of the 5th Int. Conf. on Intelligent Autonomous System, pp. 250-257, 1998.

[B8] [8] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r^* and its Evaluation,” J. of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 286-296, 2003 (in Japanese).

[B9] [9] L. Chrisman, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proc. of the 10th National Conf. on Artificial Intelligence, pp. 183-188, 1992.

[B10] [10] R. A. McCallum, “Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State,” Proc. of the 12th Int. Conf. on Machine Learning, pp. 387-395, 1995.

[B11] [11] C. Boutilier and D. Poole, “Computing Optimal Policies for Partially Observable Decision Processes using Compact Representations,” Proc. of the 13th National Conf. on Artificial Intelligence, pp. 1168-1175, 1996.

[B12] [12] S. P. Singh, T. Jaakkola, and M. I. Jordan, “Learning Without State-Estimation in Partially Observable Markovian Decision Processes,” Proc. of the 11th Int. Conf. on Machine Learning, pp. 284-292, 1994.

[B13] [13] R. J. Williams, “Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning,” Machine Learning, Vol.8, pp. 229-256, 1992.

[B14] [14] T. Jaakkola, S. P. Singh, and M. I. Jordan, “Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems,” Advances in Neural Information Processing Systems, Vol.7, pp. 345-352, 1994.

[B15] [15] H. Kimura, M. Yamamura, and S. Kobayashi, “Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward,” Proc. of the 12th Int. Conf. on Machine Learning, pp. 295-303, 1995.

[B16] [16] H. Kimura, K. Miyazaki, and S. Kobayashi, “Reinforcement Learning in POMDPs with Function Approximation,” Proc. of the 14th Int. Conf. on Machine Learning, pp. 152-160, 1997.

[B17] [17] L. Baird and D. Poole, “Gradient Descent for General Reinforcement Learning,” Advances in Neural Information Processing System, Vol.11, pp. 968-974, 1999.

[B18] [18] V. R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” Advances in Neural Information Processing Systems, Vol.12, pp. 1008-1014, 2000.

[B19] [19] R. S. Sutton, D. McAllester, S. P. Singh, and Y. Mansour, “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” Advances in Neural Information Processing Systems, Vol.12, pp. 1057-1063, 2000.

[B20] [20] D. Aberdeen and J. Baxter, “Scalable Internal-State Policy-Gradient Methods for POMDPs,” Proc. of the 19th Int. Conf. on Machine Learning, pp. 3-10, 2002.

[B21] [21] T. J. Perkins, “Reinforcement Learning for POMDPs based on Action Values and Stochastic Optimization,” Proc. of the 18th National Conf. on Artificial Intelligence, pp. 199-204, 2002.

[B22] [22] K. Miyazaki and S. Kobayashi, “A Reinforcement Learning System for Penalty Avoiding in Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.11, No.6, pp. 668-676, 2007.

Exploitation-Oriented Learning PS-r#

Kazuteru Miyazaki* and Shigenobu Kobayashi**

Exploitation-Oriented Learning PS-r^#

Kazuteru Miyazaki^* and Shigenobu Kobayashi^**