Introduction of Fixed Mode States into Online Reinforcement Learning with Penalties and Rewards and its Application to Biped Robot Waist Trajectory Generation
Seiya Kuroda*, Kazuteru Miyazaki**, and Hiroaki Kobayashi***
*Panasonic Factory Solutions Co., Ltd., 1375 Kamisukiawara, Showa-cho, Nakakoma-gun, Yamanashi 409-3895, Japan
**Research Department, National Institution for Academic Degrees and University Evaluation, 1-29-1 Gakuennishimachi, Kodaira, Tokyo 187-8587, Japan
***Department of Mechanical Engineering Informatics, Meiji University, 1-1-1 Higashimita Tama-ku, Kawasaki, Kanagawa 214-8571, Japan
During a long-term reinforcement learning task, the efficiency of learning is heavily degraded because the probabilistic actions of an agent often cause the task to fail, which makes it difficult to reach the goal and receive a reward. To address this problem, a fixed mode state is proposed in this paper. If the agent acquires an adequate reward, a normal state is switched to a fixed mode state. In this mode, the agent selects an action using a greedy strategy, i.e., it selects the highest weight action deterministically. First, this paper combines Online Profit Sharing reinforcement learning with the Penalty Avoiding Rational Policy Making algorithm, then introduces fixed mode states in it. The target task is then formulated, i.e., learning the modified waist trajectory of dynamically stable walking task based on the static stable walking of a biped robot. Finally, we present our simulation results and discuss the effectiveness of the proposed method.
-  R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998.
-  V. Heidrich-Meisner and C. Igel, “Evolution Strategies for Direct Policy Search,” Parallel Problem Solving from Nature (PPSN X), 5199 of LNCS, pp. 428-437, Springer-Verlag, 2008.
-  K. Ikeda, “Exemplar-Based Direct Policy Search with Evolutionary Optimization,” Proc. of 2005 Congress on Evolutionary Computation CEC2005, pp. 2357-2364, 2005.
-  T. Matsui, T. Goto, and K. Izumi, “Acquiring a Government Bond Trading Strategy Using Reinforcement Learning,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 691-696, 2009.
-  K. Merrick and M. L. Maher, “Motivated Reinforcement Learning for Adaptive Characters in Open-Ended Simulation Games,” Proc. of the Int. Conf. on Advanced in Computer Entertainment Technology, pp. 127-134, 2007.
-  A.Miyamae, J. Sakuma, I. Ono, and S. Kobayashi, “Instance-based Policy Learning by Real-coded Genetic Algorithms and Its Application to Control of Nonholonomic Systems,” J. of The Japanese Society for Artificial Intelligence, Vol.24, No.1, pp. 104-115, 2009 (in Japanese).
-  J. Randløv and P. Alstrøm, “Learning to Drive a Bicycle Using Reinforcement Learning and Shaping,” Proc. of the 15th Int. Conf. on Machine Learning, pp. 463-471, 1998.
-  P. Stone, R. S. Sutton, and G. Kuhlamann, “Reinforcement Learning toward RoboCup Soccer Keepaway,” Adaptive Behavior, Vol.13, No.3, pp. 165-188, 2005.
-  T. Watanabe, K. Miyazaki, and H. Kobayashi, “A New Improved Penalty Avoiding Rational Policy Making Algorithm for Keepaway with Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 675-682, 2009.
-  J. Yoshimoto, M. Nishimura, Y. Tokita, and S. Ishii, “Acrobot control by learning the switching of multiple controllers,” J. of Artificial Life and Robotics, Vol.9, No.2, pp. 67-71, 2005.
-  R. S. Sutton, D. McAllester, S. P. Singh, and Y. Mansour, “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” Advances in Neural Information Processing Systems, Vol.12, pp. 1057-1063, 2000.
-  D. E. Goldberg, “Genetic Algorithms in Search, Optimization, and Machine Learning,” Addison-Wesley Professional, 1989.
-  K. Miyazaki and S. Kobayashi, “Exploitation-Oriented Learning PS-r#,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp. 624-630, 2009.
-  K.Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Profit Sharing in Reinforcement Learning,” Proc. of the 3rd Int. Conf. on Fuzzy Logic, Neural Nets and Soft Computing, pp. 285-288, 1994.
-  K. Miyazaki and S. Kobayashi, “Reinforcement Learning for Penalty Avoiding Policy Making,” Proc. of the 2000 IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 206-211, 2000.
-  C. J. H. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, Vol.8, pp. 55-68, 1992.
-  K. Hasemi and H. Suyari, “A Proposal of algorithm that reduces computational complexity for Online Profit Sharing,” Report of the Institute of Electronics, Information and Communication Engineers. Vol.NC-105, No.657, pp. 103-108, 2006 (in Japanese).
-  K.Miyazaki and S. Kobayashi, “A Reinforcement Learning System for Penalty Avoiding in Continuous State Spaces,” J. of Advanced Computational Intelligence and Intelligent Informatics, Vol.11, No.6, pp. 668-676, 2007.
-  L. Christan, “Reinforcement Learning with perceptual aliasing: The Perceptual Distinctions Approach,” Proc. of the 10th National Conf. on Artificial Intelligence, pp. 183-188, 1992.
-  H. Kimura and S. Kobayashi, “An analysis of actor/critic algorithm using eligibility traces: reinforcement learning with imperfect value function,” Proc. of the 15th Int. Conf. on Machine Learning, pp. 278-286, 1998.