Creating multiple dense layers without activation is not going to give your function any more flexibility since it then is just an affine transform, but it will add a bunch of extra parameters which slows down learning in most cases. I would either have the single layer with Dense(100, 1, tanh) or put some activations in the intermediate layers.
Removing the relu from the critic seems reasonable, otherwise you might randomly get zero gradients depending on the initialization and data. If you make larger updates with more data it could maybe be more okay to keep it since then it is more plausible that some of the data will still generate a positive output and thus lead to some gradient.
It also seems the actor has a similar problem, when I ran it and checked the value in s1 it was -1, indicating it is really negative before the tanh and thus will have a very small gradient, and checking the gradient of the action w.r.t. the parameters then really seem to be 0 or very close at least. Testing the gradients with some random number gives non-zero values.
julia> actor(s1)
1-element Vector{Float64}:
-1.0
julia> gs = Flux.gradient(() -> sum(actor(s1)), Flux.params(actor))
Grads(...)
julia> gs.grads
IdDict{Any, Any} with 9 entries:
Float32[0.0] => Float32[0.0]
Float32[-0.0531993 -0.0839223 … … => Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0…
Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0,…
Float32[-0.0853877 0.130838 … 0.… => Float32[0.0 -0.0 … -0.0 0.0]
Float32[-0.0815599 -0.130409 … -… => Float32[0.0 0.0 … -0.0 0.0; 0.0 0.0 … -0.0 0.0; … ; 0.0 0.0 … -0.…
Float32[0.0633241 0.111354 … -0.… => Float32[-0.0 0.0 … -0.0 -0.0; -0.0 0.0 … -0.0 -0.0; … ; -0.0 0.0 …
Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0,…
Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0,…
:(Main.s1) => [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
julia> gs = Flux.gradient(() -> sum(actor(randn(size(s1)))), Flux.params(actor))
Grads(...)
julia> gs.grads
IdDict{Any, Any} with 8 entries:
Float32[0.0] => Float32[0.688863]
Float32[-0.0531993 -0.0839223 … … => Float32[-0.0834978 0.0162966 … -0.100557 0.141133; -0.0920515 0.0…
Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.010946, 0.105696, -0.0541519, -0.0183322, 0.000768424, …
Float32[-0.0853877 0.130838 … 0.… => Float32[-0.0757515 0.195297 … -0.215118 -0.0997762]
Float32[-0.0815599 -0.130409 … -… => Float32[0.00461565 -0.00612626 … -0.00670385 -0.00224172; 0.04456…
Float32[0.0633241 0.111354 … -0.… => Float32[-0.0199989 -0.0170654 … -0.0421886 -0.0202436; 0.0306439 …
Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[-0.0588204, 0.0901291, -0.0234954, 0.0110632, -0.0666493,…
Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[-0.0973696, -0.107344, 0.0529834, 0.0645531, -0.00943013,…
Recreating the actor I got different results (since the random initialization was different)
ulia> actor = gpu(actor_model(state_dim1))
Chain(
Dense(14 => 100), # 1_500 parameters
Dense(100 => 200), # 20_200 parameters
Dense(200 => 150), # 30_150 parameters
Dense(150 => 1, tanh), # 151 parameters
) # Total: 8 arrays, 52_001 parameters, 203.629 KiB.
julia> actor(s1)
1-element Vector{Float64}:
0.999993465657055
julia> gs = Flux.gradient(() -> sum(actor(randn(size(s1)))), Flux.params(actor))
Grads(...)
julia> gs.grads
IdDict{Any, Any} with 8 entries:
Float32[0.182963 0.0816542 … 0.1… => Float32[-0.0372346 -0.0957257 … 0.0396487 0.0540225]
Float32[-0.0363748 -0.103407 … -… => Float32[0.0134421 0.0336876 … 0.0912891 -0.0231217; 0.0151213 0.0…
Float32[-0.0789767 0.225775 … -0… => Float32[0.0548423 0.0282836 … 0.049636 -0.108771; 0.0129213 0.006…
Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[-0.0767889, -0.0180921, -0.122244, 0.00583031, -0.0116575…
Float32[0.0] => Float32[0.999765]
Float32[0.0659612 0.0139627 … -0… => Float32[-0.152804 0.0411203 … -0.109576 0.0850265; -0.0681948 0.0…
Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.18292, 0.081635, -0.0876073, 0.00315569, 0.192231, 0.16…
Float32[0.0, 0.0, 0.0, 0.0, 0.0,… => Float32[0.0947302, 0.106564, -0.0905477, 0.100901, -0.0278349, -0…
so now it seems like I do get non-zero gradients through the actor. This also then worked in the full update together with the critic to make updates to the agent network.