Network Not Updating, Flux Julia

Hello people. I am trying to implement an actor critic network using Flux. However for some reason my network is not being updated. Here is a sample from the code.

state_dim1 = 14
output_dim = 1
function actor_model(state_dim)
    return Chain(
            Dense(state_dim, 100),
            Dense(100, 200),
            Dense(200, 150),
            # Dense(150, 150),
            Dense(150,1,tanh))
end

struct Join{T, F}
    combine::F
    paths::T
end

# allow Join(op, m1, m2, ...) as a constructor
Join(combine, paths...) = Join(combine, paths)
Flux.@functor Join
(m::Join)(xs::Tuple) = m.combine(map((f, x) -> f(x), m.paths, xs)...)
(m::Join)(xs...) = m(xs)

function critic_model(state_dim,output_dim)
    return Chain(
        Join(vcat,
            Chain(Dense(state_dim => 100, σ), Dense(100 => 64)), # branch 1
            Dense(output_dim => 64, tanh)                        # branch 2        
            ),
            Dense(128,84,relu),Dense(84,10,relu),Dense(10,1,relu)
       )
end
critic = gpu(critic_model(state_dim1,output_dim))
target_critic = gpu(critic_model(state_dim1,output_dim))

actor = gpu(actor_model(state_dim1))

target_actor = gpu(actor_model(state_dim1))



println(Flux.params(actor)[1,1][8:16],":Initial params actor")
# println(Flux.params(critic)[1,1][1:8],":Initial params critic")

s1 = [52.256588,3.001099,47.256588,1.0010991,52.815,2.0,52.565,2.0,52.315,2.0,52.065,2.0,51.815,2.0] |>gpu
a1 = [-0.43706045] |>gpu
r1 = [-5.0] |>gpu
s2 = [52.222652,3.0011091,47.222652,1.0011091,52.815,2.0,52.565,2.0,52.315,2.0,52.065,2.0,51.815,2.0] |>gpu

a2          = target_actor(s2)
next_val    = target_critic((s2,a2))
y_expected  = r1 .+ 0.2.*next_val 

#### critic update
critic_loss(x,y) = Flux.mse(x, y)
prms_critic      = Flux.params(critic)
opt              = Flux.Adam()
data_critic      = [(s1,a1,y_expected)] |> gpu
Flux.train!((x,y,z) -> critic_loss(critic((x,y)),z), prms_critic, data_critic, opt)


#### actor update
actor_loss(x,y)  = -1*sum(critic((x,y)))
opt              = Adam(0.5)
prms_actor       = Flux.params(actor)
data_actor       = [(s1,s1)] |> gpu
Flux.train!((x,y) -> actor_loss(x,actor(y)), prms_actor, data_actor, opt)

println(Flux.params(actor)[1,1][8:16])
# println(Flux.params(critic)[1,1][1:8])

Julia Version 1.6.7
I am not able to figure out the error in this case. Any help is appreciated. Thanks

2 Likes

Try skipping the last relu as it will make all gradients zero if output from the last layer is negative.

I see. However that did not solve the problem. The params of the actor network do not change as well even though there is no relu activation in it.

Creating multiple dense layers without activation is not going to give your function any more flexibility since it then is just an affine transform, but it will add a bunch of extra parameters which slows down learning in most cases. I would either have the single layer with Dense(100, 1, tanh) or put some activations in the intermediate layers.

Removing the relu from the critic seems reasonable, otherwise you might randomly get zero gradients depending on the initialization and data. If you make larger updates with more data it could maybe be more okay to keep it since then it is more plausible that some of the data will still generate a positive output and thus lead to some gradient.

It also seems the actor has a similar problem, when I ran it and checked the value in s1 it was -1, indicating it is really negative before the tanh and thus will have a very small gradient, and checking the gradient of the action w.r.t. the parameters then really seem to be 0 or very close at least. Testing the gradients with some random number gives non-zero values.

julia> actor(s1)
1-element Vector{Float64}:
 -1.0

julia> gs = Flux.gradient(() -> sum(actor(s1)), Flux.params(actor))
Grads(...)

julia> gs.grads
IdDict{Any, Any} with 9 entries:
  Float32[0.0]                      => Float32[0.0]
  Float32[-0.0531993 -0.0839223 … … => Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0,…
  Float32[-0.0853877 0.130838 … 0.… => Float32[0.0 -0.0 … -0.0 0.0]
  Float32[-0.0815599 -0.130409 … -… => Float32[0.0 0.0 … -0.0 0.0; 0.0 0.0 … -0.0 0.0; … ; 0.0 0.0 … -0.…
  Float32[0.0633241 0.111354 … -0.… => Float32[-0.0 0.0 … -0.0 -0.0; -0.0 0.0 … -0.0 -0.0; … ; -0.0 0.0 …
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0,…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0,…
  :(Main.s1)                        => [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…

julia> gs = Flux.gradient(() -> sum(actor(randn(size(s1)))), Flux.params(actor))
Grads(...)

julia> gs.grads
IdDict{Any, Any} with 8 entries:
  Float32[0.0]                      => Float32[0.688863]
  Float32[-0.0531993 -0.0839223 … … => Float32[-0.0834978 0.0162966 … -0.100557 0.141133; -0.0920515 0.0…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.010946, 0.105696, -0.0541519, -0.0183322, 0.000768424, …
  Float32[-0.0853877 0.130838 … 0.… => Float32[-0.0757515 0.195297 … -0.215118 -0.0997762]
  Float32[-0.0815599 -0.130409 … -… => Float32[0.00461565 -0.00612626 … -0.00670385 -0.00224172; 0.04456…
  Float32[0.0633241 0.111354 … -0.… => Float32[-0.0199989 -0.0170654 … -0.0421886 -0.0202436; 0.0306439 …
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[-0.0588204, 0.0901291, -0.0234954, 0.0110632, -0.0666493,…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[-0.0973696, -0.107344, 0.0529834, 0.0645531, -0.00943013,…

Recreating the actor I got different results (since the random initialization was different)

ulia> actor = gpu(actor_model(state_dim1))
Chain(
  Dense(14 => 100),                     # 1_500 parameters
  Dense(100 => 200),                    # 20_200 parameters
  Dense(200 => 150),                    # 30_150 parameters
  Dense(150 => 1, tanh),                # 151 parameters
)                   # Total: 8 arrays, 52_001 parameters, 203.629 KiB.

julia> actor(s1)
1-element Vector{Float64}:
 0.999993465657055

julia> gs = Flux.gradient(() -> sum(actor(randn(size(s1)))), Flux.params(actor))
Grads(...)

julia> gs.grads
IdDict{Any, Any} with 8 entries:
  Float32[0.182963 0.0816542 … 0.1… => Float32[-0.0372346 -0.0957257 … 0.0396487 0.0540225]
  Float32[-0.0363748 -0.103407 … -… => Float32[0.0134421 0.0336876 … 0.0912891 -0.0231217; 0.0151213 0.0…
  Float32[-0.0789767 0.225775 … -0… => Float32[0.0548423 0.0282836 … 0.049636 -0.108771; 0.0129213 0.006…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[-0.0767889, -0.0180921, -0.122244, 0.00583031, -0.0116575…
  Float32[0.0]                      => Float32[0.999765]
  Float32[0.0659612 0.0139627 … -0… => Float32[-0.152804 0.0411203 … -0.109576 0.0850265; -0.0681948 0.0…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.18292, 0.081635, -0.0876073, 0.00315569, 0.192231, 0.16…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0947302, 0.106564, -0.0905477, 0.100901, -0.0278349, -0…

so now it seems like I do get non-zero gradients through the actor. This also then worked in the full update together with the critic to make updates to the agent network.

2 Likes

Thanks for the detailed explanation. I did try out your suggestions like adding relu to the layers. However what I am noticing is having no tanh in the final layer does update the network in my case.

What I unable to understand is the code snippet in your answer. If the actor’s network design you are taking remains the same why is it able to update in your case. Did you change anything else?

Also one more thing i notice is that upon running update function more than once the param values seem to remain the same. I guess it might be possible that the network does not change much after the first train!.

Yes, that is what I mentioned that the gradient vanishes thanks to the tanh, so removing it could be one option if it is not needed for your specific problem.

Because of the random initialization of the network, I got gradients in one case, but not in the other case. This is an effect of combining tanh with a single datapoint, so if that single datapoint happen to generate a large (negative or positive) value through the net, tanh will generate 0 as gradient.

I think you pretty much have a implementation that should work, just that the single datapoint causes problems in random cases for you, so if you train on batches of data instead it should hopefully do better.

1 Like

Got it… Thanks for the help.