I am trying to set-up a Deep Deterministic Policy Gradient algorithm (Reinforcement Learning) with Flux. Let me give a context-free explanation of what I am trying to achieve.
Say I have two networks A and C. Network A takes as input an arbitrary vector that we call
s and returns a vector
C takes as input the concatenation
vcat(s,a) as input and output a predicted value of taking action
s is observed.
I want to train A so that it learns to output actions that will be have a good value predicted by C. Assume that C is properly trained already. What I did is :
opt = ADAM() A = Chain(Dense(5,3,relu), Dense(3,2,relu)) # size of a is (2,1) C = Chain(Dense(7,4,relu), Dense(4,1)) #7 inputs is 5 + 2 (s + a) ... #assume C is trained here loss(s, A, C) = -mean(C(vcat(s,A(s))) function onetrainingiteration() data = rand(5,8000) #generate 8000 random s vectors xs = Flux.params(A, C) gs = Tracker.gradient(() -> loss (data, A, C), xs) Tracker.update!(opt, Flux.params(A), gs) end
Notice that I compute the gradient using both A and C’s parameters but I only give A’s parameters to
update!(). The idea is that to compute the gradient, Tracker needs the parameters of both networks but I do not want the update to mess with C, otherwise it would bias C to output a very large estimation.
Obviously if I come here is because A does not seem to learn at all. Anyone would do this differently ? Am I right to give C’s parameters to
For those a bit familiar with RL, A is an actor network and C is a critic. By criticizing A, C is supposed to help it improve its decision making (its policy).