Network Not Updating, Flux Julia

Nishant_Mohanty · September 16, 2022, 6:14am

Hello people. I am trying to implement an actor critic network using Flux. However for some reason my network is not being updated. Here is a sample from the code.

state_dim1 = 14
output_dim = 1
function actor_model(state_dim)
    return Chain(
            Dense(state_dim, 100),
            Dense(100, 200),
            Dense(200, 150),
            # Dense(150, 150),
            Dense(150,1,tanh))
end

struct Join{T, F}
    combine::F
    paths::T
end

# allow Join(op, m1, m2, ...) as a constructor
Join(combine, paths...) = Join(combine, paths)
Flux.@functor Join
(m::Join)(xs::Tuple) = m.combine(map((f, x) -> f(x), m.paths, xs)...)
(m::Join)(xs...) = m(xs)

function critic_model(state_dim,output_dim)
    return Chain(
        Join(vcat,
            Chain(Dense(state_dim => 100, σ), Dense(100 => 64)), # branch 1
            Dense(output_dim => 64, tanh)                        # branch 2        
            ),
            Dense(128,84,relu),Dense(84,10,relu),Dense(10,1,relu)
       )
end
critic = gpu(critic_model(state_dim1,output_dim))
target_critic = gpu(critic_model(state_dim1,output_dim))

actor = gpu(actor_model(state_dim1))

target_actor = gpu(actor_model(state_dim1))



println(Flux.params(actor)[1,1][8:16],":Initial params actor")
# println(Flux.params(critic)[1,1][1:8],":Initial params critic")

s1 = [52.256588,3.001099,47.256588,1.0010991,52.815,2.0,52.565,2.0,52.315,2.0,52.065,2.0,51.815,2.0] |>gpu
a1 = [-0.43706045] |>gpu
r1 = [-5.0] |>gpu
s2 = [52.222652,3.0011091,47.222652,1.0011091,52.815,2.0,52.565,2.0,52.315,2.0,52.065,2.0,51.815,2.0] |>gpu

a2          = target_actor(s2)
next_val    = target_critic((s2,a2))
y_expected  = r1 .+ 0.2.*next_val 

#### critic update
critic_loss(x,y) = Flux.mse(x, y)
prms_critic      = Flux.params(critic)
opt              = Flux.Adam()
data_critic      = [(s1,a1,y_expected)] |> gpu
Flux.train!((x,y,z) -> critic_loss(critic((x,y)),z), prms_critic, data_critic, opt)


#### actor update
actor_loss(x,y)  = -1*sum(critic((x,y)))
opt              = Adam(0.5)
prms_actor       = Flux.params(actor)
data_actor       = [(s1,s1)] |> gpu
Flux.train!((x,y) -> actor_loss(x,actor(y)), prms_actor, data_actor, opt)

println(Flux.params(actor)[1,1][8:16])
# println(Flux.params(critic)[1,1][1:8])

Julia Version 1.6.7
I am not able to figure out the error in this case. Any help is appreciated. Thanks

DrChainsaw · September 16, 2022, 6:23am

Try skipping the last relu as it will make all gradients zero if output from the last layer is negative.

Nishant_Mohanty · September 16, 2022, 6:26am

I see. However that did not solve the problem. The params of the actor network do not change as well even though there is no relu activation in it.

albheim · September 16, 2022, 9:43am

Nishant_Mohanty:

function actor_model(state_dim)
    return Chain(
            Dense(state_dim, 100),
            Dense(100, 200),
            Dense(200, 150),
            # Dense(150, 150),
            Dense(150,1,tanh))
end

Creating multiple dense layers without activation is not going to give your function any more flexibility since it then is just an affine transform, but it will add a bunch of extra parameters which slows down learning in most cases. I would either have the single layer with Dense(100, 1, tanh) or put some activations in the intermediate layers.

Removing the relu from the critic seems reasonable, otherwise you might randomly get zero gradients depending on the initialization and data. If you make larger updates with more data it could maybe be more okay to keep it since then it is more plausible that some of the data will still generate a positive output and thus lead to some gradient.

It also seems the actor has a similar problem, when I ran it and checked the value in s1 it was -1, indicating it is really negative before the tanh and thus will have a very small gradient, and checking the gradient of the action w.r.t. the parameters then really seem to be 0 or very close at least. Testing the gradients with some random number gives non-zero values.

julia> actor(s1)
1-element Vector{Float64}:
 -1.0

julia> gs = Flux.gradient(() -> sum(actor(s1)), Flux.params(actor))
Grads(...)

julia> gs.grads
IdDict{Any, Any} with 9 entries:
  Float32[0.0]                      => Float32[0.0]
  Float32[-0.0531993 -0.0839223 … … => Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0,…
  Float32[-0.0853877 0.130838 … 0.… => Float32[0.0 -0.0 … -0.0 0.0]
  Float32[-0.0815599 -0.130409 … -… => Float32[0.0 0.0 … -0.0 0.0; 0.0 0.0 … -0.0 0.0; … ; 0.0 0.0 … -0.…
  Float32[0.0633241 0.111354 … -0.… => Float32[-0.0 0.0 … -0.0 -0.0; -0.0 0.0 … -0.0 -0.0; … ; -0.0 0.0 …
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0,…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0,…
  :(Main.s1)                        => [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…

julia> gs = Flux.gradient(() -> sum(actor(randn(size(s1)))), Flux.params(actor))
Grads(...)

julia> gs.grads
IdDict{Any, Any} with 8 entries:
  Float32[0.0]                      => Float32[0.688863]
  Float32[-0.0531993 -0.0839223 … … => Float32[-0.0834978 0.0162966 … -0.100557 0.141133; -0.0920515 0.0…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.010946, 0.105696, -0.0541519, -0.0183322, 0.000768424, …
  Float32[-0.0853877 0.130838 … 0.… => Float32[-0.0757515 0.195297 … -0.215118 -0.0997762]
  Float32[-0.0815599 -0.130409 … -… => Float32[0.00461565 -0.00612626 … -0.00670385 -0.00224172; 0.04456…
  Float32[0.0633241 0.111354 … -0.… => Float32[-0.0199989 -0.0170654 … -0.0421886 -0.0202436; 0.0306439 …
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[-0.0588204, 0.0901291, -0.0234954, 0.0110632, -0.0666493,…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[-0.0973696, -0.107344, 0.0529834, 0.0645531, -0.00943013,…

Recreating the actor I got different results (since the random initialization was different)

ulia> actor = gpu(actor_model(state_dim1))
Chain(
  Dense(14 => 100),                     # 1_500 parameters
  Dense(100 => 200),                    # 20_200 parameters
  Dense(200 => 150),                    # 30_150 parameters
  Dense(150 => 1, tanh),                # 151 parameters
)                   # Total: 8 arrays, 52_001 parameters, 203.629 KiB.

julia> actor(s1)
1-element Vector{Float64}:
 0.999993465657055

julia> gs = Flux.gradient(() -> sum(actor(randn(size(s1)))), Flux.params(actor))
Grads(...)

julia> gs.grads
IdDict{Any, Any} with 8 entries:
  Float32[0.182963 0.0816542 … 0.1… => Float32[-0.0372346 -0.0957257 … 0.0396487 0.0540225]
  Float32[-0.0363748 -0.103407 … -… => Float32[0.0134421 0.0336876 … 0.0912891 -0.0231217; 0.0151213 0.0…
  Float32[-0.0789767 0.225775 … -0… => Float32[0.0548423 0.0282836 … 0.049636 -0.108771; 0.0129213 0.006…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[-0.0767889, -0.0180921, -0.122244, 0.00583031, -0.0116575…
  Float32[0.0]                      => Float32[0.999765]
  Float32[0.0659612 0.0139627 … -0… => Float32[-0.152804 0.0411203 … -0.109576 0.0850265; -0.0681948 0.0…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.18292, 0.081635, -0.0876073, 0.00315569, 0.192231, 0.16…
  Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0947302, 0.106564, -0.0905477, 0.100901, -0.0278349, -0…

so now it seems like I do get non-zero gradients through the actor. This also then worked in the full update together with the critic to make updates to the agent network.

Nishant_Mohanty · September 16, 2022, 2:23pm

Thanks for the detailed explanation. I did try out your suggestions like adding relu to the layers. However what I am noticing is having no tanh in the final layer does update the network in my case.

What I unable to understand is the code snippet in your answer. If the actor’s network design you are taking remains the same why is it able to update in your case. Did you change anything else?

Also one more thing i notice is that upon running update function more than once the param values seem to remain the same. I guess it might be possible that the network does not change much after the first train!.

albheim · September 16, 2022, 3:00pm

Yes, that is what I mentioned that the gradient vanishes thanks to the tanh, so removing it could be one option if it is not needed for your specific problem.

Because of the random initialization of the network, I got gradients in one case, but not in the other case. This is an effect of combining tanh with a single datapoint, so if that single datapoint happen to generate a large (negative or positive) value through the net, tanh will generate 0 as gradient.

I think you pretty much have a implementation that should work, just that the single datapoint causes problems in random cases for you, so if you train on batches of data instead it should hopefully do better.

Nishant_Mohanty · September 19, 2022, 7:19pm

Got it… Thanks for the help.

Topic		Replies	Views
Flux Custom Loss Function Not Working Properly Machine Learning flux , zygote	20	2243	April 2, 2021
Params not getting updated during training New to Julia flux	25	1734	October 11, 2020
Flux.jl changes in api General Usage	2	209	March 17, 2023
The same network performs differently in Flux.jl and tensorflow Machine Learning performance	13	3065	December 18, 2019
Unexpected behaviour with Flux Machine Learning flux	0	220	July 12, 2023

Network Not Updating, Flux Julia

Related topics