single-jc.php

JACIII Vol.13 No.6 pp. 624-630
doi: 10.20965/jaciii.2009.p0624
(2009)

Paper:

Exploitation-Oriented Learning PS-r#

Kazuteru Miyazaki* and Shigenobu Kobayashi**

*Department of Assessment and Research for degree Awarding, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira, Tokyo 187-8587, Japan

**Graduate School of Interdisciplinary Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8502, Japan

Received:
April 15, 2009
Accepted:
June 28, 2009
Published:
November 20, 2009
Keywords:
reinforcement learning, profit sharing, PS-r*, partially observed Markov decision process, Exploitation-oriented Learning XoL
Abstract

Exploitation-oriented learning (XoL) is a novel approach to goal-directed learning from interaction. Reinforcement learning is much more focused on learning and ensures optimality in Markov decision process (MDP) environments, XoL involves learning a rational policy that obtains rewards continuously and very quickly. PS-r*, a form of XoL, involves learning a useful rational policy not inferior to the random walk in the partially observed Markov decision process (POMDP) where reward types number one. PS-r*, however, requires O(MN2) memory where N is the number of sensory input types and M is an action. We propose PS-r# for learning a useful rational policy in the POMDP using O(MN) memory. PS-r# effectiveness is confirmed in numerical examples.

Cite this article as:
K. Miyazaki and S. Kobayashi, “Exploitation-Oriented Learning PS-r#,” J. Adv. Comput. Intell. Intell. Inform., Vol.13, No.6, pp. 624-630, 2009.
Data files:
References
  1. [1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.
  2. [2] K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.
  3. [3] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.
  4. [4] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” Proc. of the 17th Int. Conf. on Machine Learning, pp. 663-670, 2000.
  5. [5] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” Proc. of the 22nd Int. Conf. on Machine Learning, pp. 1-8, 2005.
  6. [6] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” Proc. of the 3rd Int. Conf. on Fuzzy Logic, Neural Nets and Soft Computing, pp. 285-288, 1994.
  7. [7] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” Proc. of the 5th Int. Conf. on Intelligent Autonomous System, pp. 250-257, 1998.
  8. [8] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r* and its Evaluation,” J. of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 286-296, 2003 (in Japanese).
  9. [9] L. Chrisman, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proc. of the 10th National Conf. on Artificial Intelligence, pp. 183-188, 1992.
  10. [10] R. A. McCallum, “Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State,” Proc. of the 12th Int. Conf. on Machine Learning, pp. 387-395, 1995.
  11. [11] C. Boutilier and D. Poole, “Computing Optimal Policies for Partially Observable Decision Processes using Compact Representations,” Proc. of the 13th National Conf. on Artificial Intelligence, pp. 1168-1175, 1996.
  12. [12] S. P. Singh, T. Jaakkola, and M. I. Jordan, “Learning Without State-Estimation in Partially Observable Markovian Decision Processes,” Proc. of the 11th Int. Conf. on Machine Learning, pp. 284-292, 1994.
  13. [13] R. J. Williams, “Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning,” Machine Learning, Vol.8, pp. 229-256, 1992.
  14. [14] T. Jaakkola, S. P. Singh, and M. I. Jordan, “Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems,” Advances in Neural Information Processing Systems, Vol.7, pp. 345-352, 1994.
  15. [15] H. Kimura, M. Yamamura, and S. Kobayashi, “Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward,” Proc. of the 12th Int. Conf. on Machine Learning, pp. 295-303, 1995.
  16. [16] H. Kimura, K. Miyazaki, and S. Kobayashi, “Reinforcement Learning in POMDPs with Function Approximation,” Proc. of the 14th Int. Conf. on Machine Learning, pp. 152-160, 1997.
  17. [17] L. Baird and D. Poole, “Gradient Descent for General Reinforcement Learning,” Advances in Neural Information Processing System, Vol.11, pp. 968-974, 1999.
  18. [18] V. R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” Advances in Neural Information Processing Systems, Vol.12, pp. 1008-1014, 2000.
  19. [19] R. S. Sutton, D. McAllester, S. P. Singh, and Y. Mansour, “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” Advances in Neural Information Processing Systems, Vol.12, pp. 1057-1063, 2000.
  20. [20] D. Aberdeen and J. Baxter, “Scalable Internal-State Policy-Gradient Methods for POMDPs,” Proc. of the 19th Int. Conf. on Machine Learning, pp. 3-10, 2002.
  21. [21] T. J. Perkins, “Reinforcement Learning for POMDPs based on Action Values and Stochastic Optimization,” Proc. of the 18th National Conf. on Artificial Intelligence, pp. 199-204, 2002.
  22. [22] K. Miyazaki and S. Kobayashi, “A Reinforcement Learning System for Penalty Avoiding in Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.11, No.6, pp. 668-676, 2007.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, IE9,10,11, Opera.

Last updated on Oct. 18, 2019