JACIII Vol.21 No.5 pp. 939-947
doi: 10.20965/jaciii.2017.p0939


Adaptation Method of the Exploration Ratio Based on the Orientation of Equilibrium in Multi-Agent Reinforcement Learning Under Non-Stationary Environments

Takuya Okano* and Itsuki Noda**

*Fujitsu Limited
4-1-1 Kamikodanaka, Nakahara-ku, Kawasaki-shi, Kanagawa 211-8588, Japan

**National Institute of Advanced Industrial Science and Technology (AIST)
1-1-1 Umezono, Tsukuba, Ibaraki 305-8560, Japan

March 19, 2017
July 21, 2017
September 20, 2017
reinforcement learning, exploration ratio, multi-agent learning

In this paper, we propose a method to adapt the exploration ratio in multi-agent reinforcement learning. The adaptation of exploration ratio is important in multi-agent learning, as this is one of key parameters that affect the learning performance. In our observation, the adaptation method can adjust the exploration ratio suitably (but not optimally) according to the characteristics of environments. We investigated the evolutionarily adaptation of the exploration ratio in multi-agent learning. We conducted several experiments to adapt the exploration ratio in a simple evolutionary way, namely, mimicking advantageous exploration ratio (MAER), and confirmed that MAER always acquires relatively lower exploration ratio than the optimal value for the change ratio of the environments. In this paper, we propose a second evolutionary adaptation method, namely, win or update exploration ratio (WoUE). The results of the experiments showed that WoUE can acquire a more suitable exploration ratio than MAER, and the obtained ratio was near-optimal.

  1. [1] M. Tokic and G. Palm, “Adaptive Exploration Using Stochastic Neurons,” Artificial Neural Networks and Machine Learning ICANN, pp. 42-49, 2012.
  2. [2] M. Tokic, F. Schwenker, and G. Palm, “Meta-Learning of Exploration and Exploitation Parameters with Replacing Eligibility Traces,” 2nd IAPR Int. Workshop, pp. 13-14, 2013.
  3. [3] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine Learning, pp. 235-256, 2002.
  4. [4] R. S. Suttonn and A. G. Barto, “Reinforcement Learning An Introduction,” MIT Press, 1998.
  5. [5] I.Noda, “Discusssion on Limitation of Simultaneous Multiagent Learning in Unstationary Environment,” JAWS, 2013.
  6. [6] T. Okano, and I. Noda, “Updating Method of Exploration Rate and It Deviated From Optimal Rate in Multi-agent Learnging,” DOCMAS, 2015.
  7. [7] M. Bowling and M. Veloso, “Multiagent learning using a variable learning rate,” Vol.136, No.2, pp. 215 E50, Artificial Intelligence, 2002.
  8. [8] I. Noda, “Possibility of Evolutionary Methods for Optimization of Exploration Ratio,” JSAI, 2015.
  9. [9] T. Okano and I. Noda, “Investigation of Evolutinarily Adaptation of Exploration Rate in Multi-agent Reinforcement Learning and extended,” Information Processing Society of Japan, 2016.
  10. [10] S. Devlin et al., “Potential-based difference rewards for multiagent reinforcement learning,” Autonomous Agents and MultiAgent Systems, 2014.
  11. [11] C. J. C. H. Watkins and P. Dayan, “Technical Notes Q-learning,” Machine learning, Vol.8, No.3, pp. 279-292, 1991.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, IE9,10,11, Opera.

Last updated on Oct. 20, 2017