Paper:

# Exploitation-Oriented Learning with Deep Learning – Introducing Profit Sharing to a Deep Q-Network –

## Kazuteru Miyazaki

National Institution for Academic Degrees and Quality Enhancement of Higher Education

1-29-1 Gakuennishimachi, Kodaira, Tokyo 185-8587, Japan

Currently, deep learning is attracting significant interest. Combining deep Q-networks (DQNs) and Q-learning has produced excellent results for several Atari 2600 games. In this paper, we propose an exploitation-oriented learning (XoL) method that incorporates deep learning to reduce the number of trial-and-error searches. We focus on a profit sharing (PS) method that is an XoL method, and combine it with a DQN to propose a DQNwithPS method. This method is compared with a DQN in Atari 2600 games. We demonstrate that the proposed DQNwithPS method can learn stably with fewer trial-and-error searches than required by only a DQN.

*J. Adv. Comput. Intell. Intell. Inform.*, Vol.21, No.5, pp. 849-855, 2017.

- [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” NIPS Deep Learning Workshop 2013, 2013.
- [2] C. J. H. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, Vol.8, pp. 55-68, 1992.
- [3] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” J. of Artificial Intelligence Research, Vol.47, pp. 253-279, 2013.
- [4] K. Miyazaki, M. Yamamura, and H. Kobayashi, “A Theory of Profit Sharing in Reinforcement Learning,” Trans. of the Japanese Society for Artificial Intelligence, Vol.9, No.4, pp. 580-587, 1994 (in Japanese).
- [5] K.Miyazaki, M. Yamamura, and S. Kobayashi, “On the Rationality of Prot Sharing in Reinforcement Learning,” Proc of the 3rd Int. Conf. on Fuzzy Logic, Neural Nets and Soft Computing, pp. 285-288, 1994.
- [6] K. Miyazaki and S. Kobayashi, “Exploitation-Oriented Learning PS-r
^{#},” J. Adv. Comput. Intell. Intell. Inform., Vol.13, No.6, pp. 624-630, 2009. - [7] K. Miyazaki and S. Kobayashi, “Learning Deterministic Policies in Partially Observable Markov Decision Processes,” Proc. of the 5th Int. Conf. on Intelligent Autonomous System, pp. 250-257, 1998.
- [8] L. Chrisman, “Reinforcement learning with perceptual aliasing: The perceptual distinctions approach,” Proc. of the 10th National Conf. on Artificial Intelligence, pp. 183-188, 1992.
- [9] M. T. Spaan, “Partially Observable Markov Decision Processes,” Reinforcement Learning, Chapter 12, pp. 387-414, Springer-Verlag Berlin Heidelberg, 2012.
- [10] K. Miyazaki, M. Yamamura, and S. Kobayashi, “k-Certainty Exploration Method : An Action Selector to identify the environment in reinforcement learning,” Artificial Intelligence, Vol.91, No.1, pp. 155-171, 1997.
- [11] K. Miyazaki and S. Kobayashi, “Rationality of Reward Sharing in Multi-agent Reinforcement Learning,” New Generation Computing, Vol.19, No.2, pp. 157-172, 2001.
- [12] M. Hausknecht and P. Stone, “Deep Recurrent Q-learning for Partially Observable MDPs,” arXiv:1507, 2015.
- [13] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D. Maria, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, and D. Silver, “Massively Parallel Methods for Deep Reinforcement Learning,” ICML Deep Learning Workshop, 2015.
- [14] I. Osband, C. Blundell, A. Pritzel, and B. V. Roy, “Deep Exploration via Bootstrapped DQN,” arXiv:1602, 2016.
- [15] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous Deep Q-learning with Model-based Acceleration,” arXiv:1603, 2016.
- [16] V. Francois-Lavet, R. Fonteneau, and D. Emst, “How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies,” NIPS 2015 Deep Reinforcement Learning Workshop, 2015.
- [17] K. Miyazaki, H. Muraoka, and H. Kobayashi, “Proposal of a Propagation Algorithm of the Expected Failure Probability and the Effectiveness on Multi-agent Environments,” SICE Annual Conf. 2013, pp. 1067-1072, 2013.
- [18] P. Stone, R. S. Sutton, and G. Kuhlamann, “Reinforcement Learning toward RoboCup Soccer Keepaway,” Adaptive Behavior, Vol.13, No.3, pp. 165-188, 2005.
- [19] T. Watanabe, K. Miyazaki, and H. Kobayashi, “A New Improved Penalty Avoiding Rational PolicyMaking Algorithm for Keepaway with Continuous State Spaces,” J. Adv. Comput. Intell. Intell. Inform., Vol.13, No.6, pp. 675-682, 2009.
- [20] S. Kuroda, K. Miyazaki, and H. Kobayashi, “Introduction of Fixed Mode States into Online Reinforcement Learning with Penalties and Rewards and its Application to Biped Robot Waist Trajectory Generation,” J. Adv. Comput. Intell. Intell. Inform., Vol.16, No.6, pp. 758-768, 2013.
- [21] K. Miyazaki and M. Ida, “Proposal and Evaluation of the Active Course Classification Support System with Exploitation-oriented Learning,” Lecture Notes in Computer Science, Vol.7188, pp. 333-344, 2012.
- [22] K. Miyazaki and J. Takeno, “The Necessity of a Secondary System in Machine Consciousness,” Procedia Computer Science, Vol.41, pp. 15-22, 2014.