Creating multiple dense layers without activation is not going to give your function any more flexibility since it then is just an affine transform, but it will add a bunch of extra parameters which slows down learning in most cases. I would either have the single layer with Dense(100, 1, tanh) or put some activations in the intermediate layers.
Removing the relu from the critic seems reasonable, otherwise you might randomly get zero gradients depending on the initialization and data. If you make larger updates with more data it could maybe be more okay to keep it since then it is more plausible that some of the data will still generate a positive output and thus lead to some gradient.
It also seems the actor has a similar problem, when I ran it and checked the value in s1 it was -1, indicating it is really negative before the tanh and thus will have a very small gradient, and checking the gradient of the action w.r.t. the parameters then really seem to be 0 or very close at least. Testing the gradients with some random number gives non-zero values.
Thanks for the detailed explanation. I did try out your suggestions like adding relu to the layers. However what I am noticing is having no tanh in the final layer does update the network in my case.
What I unable to understand is the code snippet in your answer. If the actor’s network design you are taking remains the same why is it able to update in your case. Did you change anything else?
Also one more thing i notice is that upon running update function more than once the param values seem to remain the same. I guess it might be possible that the network does not change much after the first train!.
Yes, that is what I mentioned that the gradient vanishes thanks to the tanh, so removing it could be one option if it is not needed for your specific problem.
Because of the random initialization of the network, I got gradients in one case, but not in the other case. This is an effect of combining tanh with a single datapoint, so if that single datapoint happen to generate a large (negative or positive) value through the net, tanh will generate 0 as gradient.
I think you pretty much have a implementation that should work, just that the single datapoint causes problems in random cases for you, so if you train on batches of data instead it should hopefully do better.