Some even simpler examples might be:
julia> f(x) = inv(inv(x));
julia> g(x) = cbrt(x)^3;
julia> all(f(x)==x for x in -10:10)
true
julia> all(g(x)≈x for x in -10:10)
true
julia> gradient(f, 0)
(NaN,)
julia> gradient(g, 0)
(NaN,)
If the chain of functions (or rather, the chain of their derivatives) contains singularities, at intermediate steps, then the final gradient will tend to be NaN. Even if it’s obvious to a human that things ought to cancel.
The individual components all seem correct here. I think these examples could be fixed by returning incorrect gradients near singularities, e.g. replacing the gradient at x==0
with one slightly off of it, like gradient(cbrt, 0 + eps())
. But this may have horrible consequences elsewhere, I’m not sure.