JACIII Vol.11 No.6 pp. 668-676
doi: 10.20965/jaciii.2007.p0668


Reinforcement Learning for Penalty Avoidance in Continuous State Spaces

Kazuteru Miyazaki* and Shigenobu Kobayashi**

*Department of Assessment and Research for Degree Awarding, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira-city, Tokyo 187-8587, Japan

**Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8502, Japan

January 15, 2007
March 19, 2007
July 20, 2007
reinforcement learning, Profit Sharing, continuous state spaces
Reinforcement learning involves learning to adapt to environments through the presentation of rewards – special input – serving as clues. To obtain quick rational policies, profit sharing (PS) [6], rational policy making algorithm (RPM) [7], penalty avoiding rational policy making algorithm (PARP) [8], and PS-r* [9] are used. They are called PS-based methods. When applying reinforcement learning to actual problems, treatment of continuous-valued input is sometimes required. A method [10] based on RPM is proposed as a PS-based method corresponding to the continuous-valued input, but only rewards exist and penalties cannot be suitably handled. We studied the treatment of continuous-valued input suitable for a PS-based method in which the environment includes both rewards and penalties. Specifically, we propose having PARP correspond to continuous-valued input while simultaneously targeting the attainment of rewards and avoiding penalties. We applied our proposal to the pole-cart balancing problem and confirmed its validity.
Cite this article as:
K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoidance in Continuous State Spaces,” J. Adv. Comput. Intell. Intell. Inform., Vol.11 No.6, pp. 668-676, 2007.
Data files:
  1. [1] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” Proceedings of the 21st International Conference on Machine Learning, pp. 1-8, 2005.
  2. [2] L. Chrisman, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proceedings of the 10th National Conference on Artificial Intelligence, pp. 183-188, 1992.
  3. [3] H. Kimura and S. Kobayashi, “An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value function,” Proceedings of the 15th International Conference on Machine Learning, pp. 278-286, 1998.
  4. [4] H. Kimura, “Reinforcement Learning in multi-dimensional stateaction space using random tiling and Gibbs sampling,” Transactionof the Society of Instrument and Control Engineers, Vol.42, No.12, 2006 (in Japanese).
  5. [5] H. Kita, I. Ono, and S. Kobayashi, “Theoretical Analysis of the Unimodal Normal Distribution Crossover for Real-coded Genetic Algorithm,” Proceedings of 1998 IEEE Int. Conf. on Evolutionary Computation, pp. 529-534, 1998.
  6. [6] K. Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, Iizuka, Japan, pp. 285-288, 1994.
  7. [7] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” International Conference on Intelligent Autonomous System (IAS) 5, pp. 250-257, 1998.
  8. [8] K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” 2000 IEEE International Conference on Systems, Man, and Cybernetics, pp. 206-211, 2000.
  9. [9] K. Miyazaki and S. Kobayashi, “An Extension of Profit Sharing to Partially Observable Markov Decision Processes: Proposition of PS-r* and its Evaluation,” Journal of the Japanese Society for Artificial Intelligence, Vol.18, No.5, pp. 286-296, 2003 (in Japanese).
  10. [10] K. Miyazaki and S. Kobayashi, “Reinforcement Learning Systems based on Profit Sharing in Robotics,” Proceedings of the 36th International Symposium on Robotics, 2005.
  11. [11] A. Y. Ng, D. Harada, and S. J. Russell, “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,” Proceedings of the 17th International Conference on Machine Learning, pp. 278-287, 1999.
  12. [12] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” Proceedings of the 17th International Conference on Machine Learning, pp. 663-670, 2000.
  13. [13] I. Ono and S. Kobayashi, “A Real-coded Genetic Algorithm for Function Optimization Using Unimodal Normal Distribution Crossover,” Proceedings of the 7th International Conference on Genetic Algorithms, pp. 246-253, 1997.
  14. [14] J. C. Santamaria, R. S. Sutton, and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces,” Adaptive Behavior, Vol.6, No.2, pp. 163-218, 1998.
  15. [15] R. S. Sutton, “Learning to Predict by the Methods of Temporal Differences,” Machine Learning, 3, pp. 9-44, 1988.
  16. [16] R. S. Sutton and A. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, The MIT Press, 1998.
  17. [17] T. Tateyama, S. Kawata, and Y. Shimomura, “A Reinforcement Learning Algorithm for Continuous State Spaces using Multiple Fuzzy-ART Networks,” Proceedings of SICE-ICCAS 2006, 2006.
  18. [18] C. J. H. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, 8, pp. 55-68, 1992.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Jul. 23, 2024