Training a network to optimize output value of another network

I am trying to set-up a Deep Deterministic Policy Gradient algorithm (Reinforcement Learning) with Flux. Let me give a context-free explanation of what I am trying to achieve.
Say I have two networks A and C. Network A takes as input an arbitrary vector that we call s and returns a vector a. Network C takes as input the concatenation vcat(s,a) as input and output a predicted value of taking action a when s is observed.

I want to train A so that it learns to output actions that will be have a good value predicted by C. Assume that C is properly trained already. What I did is :

opt = ADAM()
A = Chain(Dense(5,3,relu), Dense(3,2,relu)) # size of a is (2,1)
C = Chain(Dense(7,4,relu), Dense(4,1)) #7 inputs is 5 + 2 (s + a)
... #assume C is trained here

loss(s, A, C) = -mean(C(vcat(s,A(s)))  

function onetrainingiteration()
    data = rand(5,8000) #generate 8000 random s vectors
    xs = Flux.params(A, C)
    gs = Tracker.gradient(() -> loss (data, A, C), xs)
    Tracker.update!(opt, Flux.params(A), gs)

Notice that I compute the gradient using both A and C’s parameters but I only give A’s parameters to update!(). The idea is that to compute the gradient, Tracker needs the parameters of both networks but I do not want the update to mess with C, otherwise it would bias C to output a very large estimation.

Obviously if I come here is because A does not seem to learn at all. Anyone would do this differently ? Am I right to give C’s parameters to gradient()?

For those a bit familiar with RL, A is an actor network and C is a critic. By criticizing A, C is supposed to help it improve its decision making (its policy).

1 Like

Just checking, did you intend to include that negative sign in your loss? Should C be outputting a lesser or greater value as A improves its output mapping?

C outputs a value desired to be high, not a cost. So yes C should be outputting a greater value as A improves.

I guess you understand that but I’ll precise for those who might not: we want the output of C to be maximized, so -C to be minimized. In standard supervised learning loss() is an error that we want to minimize.

Ok, good to know, I just wanted to make sure this was the case.

Next question: What do the loss values look like across, say, 10 iterations? Do they oscillate around, or do they not budge at all?

Follow-up: Are your gradients non-zero?

1 Like

Well ! To answer your question I had to make a few changes to my code. What I was doing before was training C at the same time as A, as it is proposed in the DDPG paper. So to get a plot that actually reflects the evolution of the loss of A without being influenced by the changes to C, I trained C for 1000 iterations then A for 100 iterations and it actually improved quite well:

But better, I have a simulation that I use to test the network, with a random actor it returns a value of -3600. After 100 iterations it got down to -600, where the optimum is known to be -363.

So this seems to be a good start really !

However I am still questioning whether xs = Flux.params(A, C) is useful, I tried with only A, it appears to work as well so it might only be a waste of computation to get C’s gradients.

1 Like