Paper:

# Reinforcement Learning for Penalty Avoidance in Continuous State Spaces

## Kazuteru Miyazaki^{*} and Shigenobu Kobayashi^{**}

^{*}Department of Assessment and Research for Degree Awarding, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira-city, Tokyo 187-8587, Japan

^{**}Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8502, Japan

Reinforcement learning involves learning to adapt to environments through the presentation of rewards – special input – serving as clues. To obtain quick rational policies, profit sharing (PS) [6], rational policy making algorithm (RPM) [7], penalty avoiding rational policy making algorithm (PARP) [8], and PS-r* [9] are used. They are called PS-based methods. When applying reinforcement learning to actual problems, treatment of continuous-valued input is sometimes required. A method [10] based on RPM is proposed as a PS-based method corresponding to the continuous-valued input, but only rewards exist and penalties cannot be suitably handled. We studied the treatment of continuous-valued input suitable for a PS-based method in which the environment includes both rewards and penalties. Specifically, we propose having PARP correspond to continuous-valued input while simultaneously targeting the attainment of rewards and avoiding penalties. We applied our proposal to the pole-cart balancing problem and confirmed its validity.

*J. Adv. Comput. Intell. Intell. Inform.*, Vol.11, No.6, pp. 668-676, 2007.

- [1] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” Proceedings of the 21st International Conference on Machine Learning, pp. 1-8, 2005.
- [2] L. Chrisman, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proceedings of the 10th National Conference on Artificial Intelligence, pp. 183-188, 1992.
- [3] H. Kimura and S. Kobayashi, “An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value function,” Proceedings of the 15th International Conference on Machine Learning, pp. 278-286, 1998.
- [4] H. Kimura, “Reinforcement Learning in multi-dimensional stateaction space using random tiling and Gibbs sampling,” Transactionof the Society of Instrument and Control Engineers, Vol.42, No.12, 2006 (in Japanese).
- [5] H. Kita, I. Ono, and S. Kobayashi, “Theoretical Analysis of the Unimodal Normal Distribution Crossover for Real-coded Genetic Algorithm,” Proceedings of 1998 IEEE Int. Conf. on Evolutionary Computation, pp. 529-534, 1998.
- [6] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, Iizuka, Japan, pp. 285-288, 1994.
- [7] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” International Conference on Intelligent Autonomous System (IAS) 5, pp. 250-257, 1998.
- [8] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” 2000 IEEE International Conference on Systems, Man, and Cybernetics, pp. 206-211, 2000.
- [9] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r
^{*}and its Evaluation,” Journal of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 286-296, 2003 (in Japanese). - [10] K. Miyazaki and S. Kobayashi, “Reinforcement Learning Systems based on Profit Sharing in Robotics,” Proceedings of the 36th International Symposium on Robotics, 2005.
- [11] A. Y. Ng, D. Harada, and S. J. Russell, “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,” Proceedings of the 17th International Conference on Machine Learning, pp. 278-287, 1999.
- [12] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” Proceedings of the 17th International Conference on Machine Learning, pp. 663-670, 2000.
- [13] I. Ono and S. Kobayashi, “A Real-coded Genetic Algorithm for Function Optimization Using Unimodal Normal Distribution Crossover,” Proceedings of the 7th International Conference on Genetic Algorithms, pp. 246-253, 1997.
- [14] J. C. Santamaria, R. S. Sutton, and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces,” Adaptive Behavior, Vol.6, No.2, pp. 163-218, 1998.
- [15] R. S. Sutton, “Learning to Predict by the Methods of Temporal Differences,” Machine Learning, 3, pp. 9-44, 1988.
- [16] R. S. Sutton and A. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, The MIT Press, 1998.
- [17] T. Tateyama, S. Kawata, and Y. Shimomura, “A Reinforcement Learning Algorithm for Continuous State Spaces using Multiple Fuzzy-ART Networks,” Proceedings of SICE-ICCAS 2006, 2006.
- [18] C. J. H. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, 8, pp. 55-68, 1992.