I am trying to set-up a Deep Deterministic Policy Gradient algorithm (Reinforcement Learning) with Flux. Let me give a context-free explanation of what I am trying to achieve.

Say I have two networks A and C. Network A takes as input an arbitrary vector that we call `s`

and returns a vector `a`

. Network `C`

takes as input the concatenation `vcat(s,a)`

as input and output a predicted value of taking action `a`

when `s`

is observed.

I want to train A so that it learns to output actions that will be have a good value predicted by C. Assume that C is properly trained already. What I did is :

```
opt = ADAM()
A = Chain(Dense(5,3,relu), Dense(3,2,relu)) # size of a is (2,1)
C = Chain(Dense(7,4,relu), Dense(4,1)) #7 inputs is 5 + 2 (s + a)
... #assume C is trained here
loss(s, A, C) = -mean(C(vcat(s,A(s)))
function onetrainingiteration()
data = rand(5,8000) #generate 8000 random s vectors
xs = Flux.params(A, C)
gs = Tracker.gradient(() -> loss (data, A, C), xs)
Tracker.update!(opt, Flux.params(A), gs)
end
```

Notice that I compute the gradient using both A and C’s parameters but I only give A’s parameters to `update!()`

. The idea is that to compute the gradient, Tracker needs the parameters of both networks but I do not want the update to mess with C, otherwise it would bias C to output a very large estimation.

Obviously if I come here is because A does not seem to learn at all. Anyone would do this differently ? Am I right to give C’s parameters to `gradient()`

?

For those a bit familiar with RL, A is an actor network and C is a critic. By criticizing A, C is supposed to help it improve its decision making (its policy).