Reinforcement Learning for Penalty Avoidance in Continuous State Spaces
Kazuteru Miyazaki* and Shigenobu Kobayashi**
*Department of Assessment and Research for Degree Awarding, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira-city, Tokyo 187-8587, Japan
**Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8502, Japan
Reinforcement learning involves learning to adapt to environments through the presentation of rewards – special input – serving as clues. To obtain quick rational policies, profit sharing (PS) , rational policy making algorithm (RPM) , penalty avoiding rational policy making algorithm (PARP) , and PS-r*  are used. They are called PS-based methods. When applying reinforcement learning to actual problems, treatment of continuous-valued input is sometimes required. A method  based on RPM is proposed as a PS-based method corresponding to the continuous-valued input, but only rewards exist and penalties cannot be suitably handled. We studied the treatment of continuous-valued input suitable for a PS-based method in which the environment includes both rewards and penalties. Specifically, we propose having PARP correspond to continuous-valued input while simultaneously targeting the attainment of rewards and avoiding penalties. We applied our proposal to the pole-cart balancing problem and confirmed its validity.
-  P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” Proceedings of the 21st International Conference on Machine Learning, pp. 1-8, 2005.
-  L. Chrisman, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proceedings of the 10th National Conference on Artificial Intelligence, pp. 183-188, 1992.
-  H. Kimura and S. Kobayashi, “An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value function,” Proceedings of the 15th International Conference on Machine Learning, pp. 278-286, 1998.
-  H. Kimura, “Reinforcement Learning in multi-dimensional stateaction space using random tiling and Gibbs sampling,” Transactionof the Society of Instrument and Control Engineers, Vol.42, No.12, 2006 (in Japanese).
-  H. Kita, I. Ono, and S. Kobayashi, “Theoretical Analysis of the Unimodal Normal Distribution Crossover for Real-coded Genetic Algorithm,” Proceedings of 1998 IEEE Int. Conf. on Evolutionary Computation, pp. 529-534, 1998.
-  K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, Iizuka, Japan, pp. 285-288, 1994.
-  K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” International Conference on Intelligent Autonomous System (IAS) 5, pp. 250-257, 1998.
-  K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” 2000 IEEE International Conference on Systems, Man, and Cybernetics, pp. 206-211, 2000.
-  K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r* and its Evaluation,” Journal of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 286-296, 2003 (in Japanese).
-  K. Miyazaki and S. Kobayashi, “Reinforcement Learning Systems based on Profit Sharing in Robotics,” Proceedings of the 36th International Symposium on Robotics, 2005.
-  A. Y. Ng, D. Harada, and S. J. Russell, “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,” Proceedings of the 17th International Conference on Machine Learning, pp. 278-287, 1999.
-  A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” Proceedings of the 17th International Conference on Machine Learning, pp. 663-670, 2000.
-  I. Ono and S. Kobayashi, “A Real-coded Genetic Algorithm for Function Optimization Using Unimodal Normal Distribution Crossover,” Proceedings of the 7th International Conference on Genetic Algorithms, pp. 246-253, 1997.
-  J. C. Santamaria, R. S. Sutton, and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces,” Adaptive Behavior, Vol.6, No.2, pp. 163-218, 1998.
-  R. S. Sutton, “Learning to Predict by the Methods of Temporal Differences,” Machine Learning, 3, pp. 9-44, 1988.
-  R. S. Sutton and A. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, The MIT Press, 1998.
-  T. Tateyama, S. Kawata, and Y. Shimomura, “A Reinforcement Learning Algorithm for Continuous State Spaces using Multiple Fuzzy-ART Networks,” Proceedings of SICE-ICCAS 2006, 2006.
-  C. J. H. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, 8, pp. 55-68, 1992.