I am trying to set-up a Deep Deterministic Policy Gradient algorithm (Reinforcement Learning) with Flux. Let me give a context-free explanation of what I am trying to achieve.
Say I have two networks A and C. Network A takes as input an arbitrary vector that we call s
and returns a vector a
. Network C
takes as input the concatenation vcat(s,a)
as input and output a predicted value of taking action a
when s
is observed.
I want to train A so that it learns to output actions that will be have a good value predicted by C. Assume that C is properly trained already. What I did is :
opt = ADAM()
A = Chain(Dense(5,3,relu), Dense(3,2,relu)) # size of a is (2,1)
C = Chain(Dense(7,4,relu), Dense(4,1)) #7 inputs is 5 + 2 (s + a)
... #assume C is trained here
loss(s, A, C) = -mean(C(vcat(s,A(s)))
function onetrainingiteration()
data = rand(5,8000) #generate 8000 random s vectors
xs = Flux.params(A, C)
gs = Tracker.gradient(() -> loss (data, A, C), xs)
Tracker.update!(opt, Flux.params(A), gs)
end
Notice that I compute the gradient using both A and C’s parameters but I only give A’s parameters to update!()
. The idea is that to compute the gradient, Tracker needs the parameters of both networks but I do not want the update to mess with C, otherwise it would bias C to output a very large estimation.
Obviously if I come here is because A does not seem to learn at all. Anyone would do this differently ? Am I right to give C’s parameters to gradient()
?
For those a bit familiar with RL, A is an actor network and C is a critic. By criticizing A, C is supposed to help it improve its decision making (its policy).