Reinforcement Learning for Penalty Avoidance in Continuous State Spaces

Kazuteru Miyazaki; Shigenobu Kobayashi

doi:10.20965/jaciii.2007.p0668

single-jc.php

« previous

JACIII Vol.11 No.6 pp. 668-676

(2007)

doi: 10.20965/jaciii.2007.p0668

Paper:

Views over last 60 days: 692

Reinforcement Learning for Penalty Avoidance in Continuous State Spaces

Kazuteru Miyazaki^* and Shigenobu Kobayashi^**

^*Department of Assessment and Research for Degree Awarding, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira-city, Tokyo 187-8587, Japan

^**Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8502, Japan

Received:

January 15, 2007

Accepted:

March 19, 2007

Published:

July 20, 2007

Keywords:

reinforcement learning, Profit Sharing, continuous state spaces

Abstract

Reinforcement learning involves learning to adapt to environments through the presentation of rewards – special input – serving as clues. To obtain quick rational policies, profit sharing (PS) [6], rational policy making algorithm (RPM) [7], penalty avoiding rational policy making algorithm (PARP) [8], and PS-r* [9] are used. They are called PS-based methods. When applying reinforcement learning to actual problems, treatment of continuous-valued input is sometimes required. A method [10] based on RPM is proposed as a PS-based method corresponding to the continuous-valued input, but only rewards exist and penalties cannot be suitably handled. We studied the treatment of continuous-valued input suitable for a PS-based method in which the environment includes both rewards and penalties. Specifically, we propose having PARP correspond to continuous-valued input while simultaneously targeting the attainment of rewards and avoiding penalties. We applied our proposal to the pole-cart balancing problem and confirmed its validity.

Cite this article as:

K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoidance in Continuous State Spaces,” J. Adv. Comput. Intell. Intell. Inform., Vol.11 No.6, pp. 668-676, 2007.

Data files:

References

[1] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” Proceedings of the 21st International Conference on Machine Learning, pp. 1-8, 2005.
[2] L. Chrisman, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proceedings of the 10th National Conference on Artificial Intelligence, pp. 183-188, 1992.
[3] H. Kimura and S. Kobayashi, “An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value function,” Proceedings of the 15th International Conference on Machine Learning, pp. 278-286, 1998.
[4] H. Kimura, “Reinforcement Learning in multi-dimensional stateaction space using random tiling and Gibbs sampling,” Transactionof the Society of Instrument and Control Engineers, Vol.42, No.12, 2006 (in Japanese).
[5] H. Kita, I. Ono, and S. Kobayashi, “Theoretical Analysis of the Unimodal Normal Distribution Crossover for Real-coded Genetic Algorithm,” Proceedings of 1998 IEEE Int. Conf. on Evolutionary Computation, pp. 529-534, 1998.
[6] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, Iizuka, Japan, pp. 285-288, 1994.
[7] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” International Conference on Intelligent Autonomous System (IAS) 5, pp. 250-257, 1998.
[8] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” 2000 IEEE International Conference on Systems, Man, and Cybernetics, pp. 206-211, 2000.
[9] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r^* and its Evaluation,” Journal of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 286-296, 2003 (in Japanese).
[10] K. Miyazaki and S. Kobayashi, “Reinforcement Learning Systems based on Profit Sharing in Robotics,” Proceedings of the 36th International Symposium on Robotics, 2005.
[11] A. Y. Ng, D. Harada, and S. J. Russell, “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,” Proceedings of the 17th International Conference on Machine Learning, pp. 278-287, 1999.
[12] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” Proceedings of the 17th International Conference on Machine Learning, pp. 663-670, 2000.
[13] I. Ono and S. Kobayashi, “A Real-coded Genetic Algorithm for Function Optimization Using Unimodal Normal Distribution Crossover,” Proceedings of the 7th International Conference on Genetic Algorithms, pp. 246-253, 1997.
[14] J. C. Santamaria, R. S. Sutton, and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces,” Adaptive Behavior, Vol.6, No.2, pp. 163-218, 1998.
[15] R. S. Sutton, “Learning to Predict by the Methods of Temporal Differences,” Machine Learning, 3, pp. 9-44, 1988.
[16] R. S. Sutton and A. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, The MIT Press, 1998.
[17] T. Tateyama, S. Kawata, and Y. Shimomura, “A Reinforcement Learning Algorithm for Continuous State Spaces using Multiple Fuzzy-ART Networks,” Proceedings of SICE-ICCAS 2006, 2006.
[18] C. J. H. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, 8, pp. 55-68, 1992.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” Proceedings of the 21st International Conference on Machine Learning, pp. 1-8, 2005.

[2] [2] L. Chrisman, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proceedings of the 10th National Conference on Artificial Intelligence, pp. 183-188, 1992.

[3] [3] H. Kimura and S. Kobayashi, “An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value function,” Proceedings of the 15th International Conference on Machine Learning, pp. 278-286, 1998.

[4] [4] H. Kimura, “Reinforcement Learning in multi-dimensional stateaction space using random tiling and Gibbs sampling,” Transactionof the Society of Instrument and Control Engineers, Vol.42, No.12, 2006 (in Japanese).

[5] [5] H. Kita, I. Ono, and S. Kobayashi, “Theoretical Analysis of the Unimodal Normal Distribution Crossover for Real-coded Genetic Algorithm,” Proceedings of 1998 IEEE Int. Conf. on Evolutionary Computation, pp. 529-534, 1998.

[6] [6] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, Iizuka, Japan, pp. 285-288, 1994.

[7] [7] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” International Conference on Intelligent Autonomous System (IAS) 5, pp. 250-257, 1998.

[8] [8] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” 2000 IEEE International Conference on Systems, Man, and Cybernetics, pp. 206-211, 2000.

[9] [9] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r^* and its Evaluation,” Journal of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 286-296, 2003 (in Japanese).

[10] [10] K. Miyazaki and S. Kobayashi, “Reinforcement Learning Systems based on Profit Sharing in Robotics,” Proceedings of the 36th International Symposium on Robotics, 2005.

[11] [11] A. Y. Ng, D. Harada, and S. J. Russell, “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,” Proceedings of the 17th International Conference on Machine Learning, pp. 278-287, 1999.

[12] [12] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” Proceedings of the 17th International Conference on Machine Learning, pp. 663-670, 2000.

[13] [13] I. Ono and S. Kobayashi, “A Real-coded Genetic Algorithm for Function Optimization Using Unimodal Normal Distribution Crossover,” Proceedings of the 7th International Conference on Genetic Algorithms, pp. 246-253, 1997.

[14] [14] J. C. Santamaria, R. S. Sutton, and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces,” Adaptive Behavior, Vol.6, No.2, pp. 163-218, 1998.

[15] [15] R. S. Sutton, “Learning to Predict by the Methods of Temporal Differences,” Machine Learning, 3, pp. 9-44, 1988.

[16] [16] R. S. Sutton and A. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, The MIT Press, 1998.

[17] [17] T. Tateyama, S. Kawata, and Y. Shimomura, “A Reinforcement Learning Algorithm for Continuous State Spaces using Multiple Fuzzy-ART Networks,” Proceedings of SICE-ICCAS 2006, 2006.

[18] [18] C. J. H. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, 8, pp. 55-68, 1992.

Reinforcement Learning for Penalty Avoidance in Continuous State Spaces

Kazuteru Miyazaki* and Shigenobu Kobayashi**

Kazuteru Miyazaki^* and Shigenobu Kobayashi^**