# Training a network to optimize output value of another network

I am trying to set-up a Deep Deterministic Policy Gradient algorithm (Reinforcement Learning) with Flux. Let me give a context-free explanation of what I am trying to achieve.
Say I have two networks A and C. Network A takes as input an arbitrary vector that we call `s` and returns a vector `a`. Network `C` takes as input the concatenation `vcat(s,a)` as input and output a predicted value of taking action `a` when `s` is observed.

I want to train A so that it learns to output actions that will be have a good value predicted by C. Assume that C is properly trained already. What I did is :

``````opt = ADAM()
A = Chain(Dense(5,3,relu), Dense(3,2,relu)) # size of a is (2,1)
C = Chain(Dense(7,4,relu), Dense(4,1)) #7 inputs is 5 + 2 (s + a)
... #assume C is trained here

loss(s, A, C) = -mean(C(vcat(s,A(s)))

function onetrainingiteration()
data = rand(5,8000) #generate 8000 random s vectors
xs = Flux.params(A, C)
gs = Tracker.gradient(() -> loss (data, A, C), xs)
Tracker.update!(opt, Flux.params(A), gs)
end
``````

Notice that I compute the gradient using both A and C’s parameters but I only give A’s parameters to `update!()`. The idea is that to compute the gradient, Tracker needs the parameters of both networks but I do not want the update to mess with C, otherwise it would bias C to output a very large estimation.

Obviously if I come here is because A does not seem to learn at all. Anyone would do this differently ? Am I right to give C’s parameters to `gradient()`?

For those a bit familiar with RL, A is an actor network and C is a critic. By criticizing A, C is supposed to help it improve its decision making (its policy).

1 Like

Just checking, did you intend to include that negative sign in your loss? Should C be outputting a lesser or greater value as A improves its output mapping?

C outputs a value desired to be high, not a cost. So yes C should be outputting a greater value as A improves.

I guess you understand that but I’ll precise for those who might not: we want the output of C to be maximized, so -C to be minimized. In standard supervised learning `loss()` is an error that we want to minimize.

Ok, good to know, I just wanted to make sure this was the case.

Next question: What do the loss values look like across, say, 10 iterations? Do they oscillate around, or do they not budge at all?

However I am still questioning whether `xs = Flux.params(A, C)` is useful, I tried with only A, it appears to work as well so it might only be a waste of computation to get C’s gradients.