TD Learning with Neural Networks
Information Science, Osaka-Kyoiku University, 4-698-1 Asahiga-Oka, Kashiwara-shi, Osaka 582-8582, Japan
Temporal difference (TD) learning (TD learning), proposed by Sutton in the late 1980s, is very interesting prediction using obtained predictions for future prediction. Applying this learning to neural networks helps improve prediction performance using neural networks, after certain problems are solved. Major problems are as follows: 1) Prediction Pt at time t is assumed to be scalar in Sutton’s original paper, raising the problem of “what is the rule for updating weight vector of the neural network if the neural network has multiple outputs?” 2) How do we derive individual components of gradient vector ∇wPt for weight vector w? This paper proposes how to handle these problems when TD learning is used in a neural network, focusing on the TD(0) algorithm, often used in TD learning. It proposes the rule for updating the neural network weight vector for a two-out neural network under problem 1) above, and explains the rule’s validity. It then proposes computing every components of ∇wPt.