Autodiff-ing a function defined by the result of Roots.find_zero fails

Hi all!

I am trying to ForwardDiff a function that is defined by solving a non-linear equation that depends on parameters. The code below reproduces the error. It first redefines exp by solving log(x) - a = 0

using Roots, ForwardDiff

f( x, a ) = log( x ) - a
∂f_x( x, a ) = ForwardDiff.derivative( t -> f( t, a ), x )

mexp( a ) = find_zero(
    (
        x -> f( x, a ),
        x -> ∂f_x( x, a )
    ),
    1, Roots.Newton()
)

# Test
mexp( log( π ) ) ≈ π # true

# Differentiate
dmexp(a) = ForwardDiff.derivative( mexp, a )
dmexp( 1 ) # error!

Full stacktrace:

ERROR: MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Float64, 1})
Closest candidates are:
  (::Type{T})(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
  (::Type{T})(::T) where T<:Number at boot.jl:772
  (::Type{T})(::AbstractChar) where T<:Union{AbstractChar, Number} at char.jl:50
  ...
Stacktrace:
  [1] convert(#unused#::Type{Float64}, x::ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Float64, 1})
    @ Base ./number.jl:7
  [2] init_state(M::Roots.Newton, F::Roots.Callable_Function{Val{2}, Val{true}, Tuple{var"#11#13"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}, var"#12#14"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}}, Nothing}, x::Int64)
    @ Roots ~/.julia/packages/Roots/LYDH4/src/Derivative/newton.jl:63
  [3] init(𝑭𝑿::ZeroProblem{Tuple{var"#11#13"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}, var"#12#14"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}}, Int64}, M::Roots.Newton, p′::Nothing; p::Nothing, verbose::Bool, tracks::Roots.NullTracks, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Roots ~/.julia/packages/Roots/LYDH4/src/find_zero.jl:283
  [4] solve(𝑭𝑿::ZeroProblem{Tuple{var"#11#13"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}, var"#12#14"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}}, Int64}, M::Roots.Newton, p::Nothing; verbose::Bool, kwargs::Base.Pairs{Symbol, Union{Nothing, Roots.NullTracks}, Tuple{Symbol, Symbol}, NamedTuple{(:p, :tracks), Tuple{Nothing, Roots.NullTracks}}})
    @ Roots ~/.julia/packages/Roots/LYDH4/src/find_zero.jl:469
  [5] find_zero(f::Tuple{var"#11#13"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}, var"#12#14"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}}, x0::Int64, M::Roots.Newton; p::Nothing, verbose::Bool, tracks::Roots.NullTracks, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Roots ~/.julia/packages/Roots/LYDH4/src/find_zero.jl:215
  [6] find_zero(f::Tuple{var"#11#13"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}, var"#12#14"{ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1}}}, x0::Int64, M::Roots.Newton)
    @ Roots ~/.julia/packages/Roots/LYDH4/src/find_zero.jl:206
  [7] mexp(a::ForwardDiff.Dual{ForwardDiff.Tag{typeof(mexp), Int64}, Int64, 1})
    @ Main ~/Dropbox/Research/dynamic_games/hvb/tests.jl:21
  [8] derivative
    @ ~/.julia/packages/ForwardDiff/pDtsf/src/derivative.jl:14 [inlined]
  [9] dmexp(a::Int64)
    @ Main ~/Dropbox/Research/dynamic_games/hvb/tests.jl:31
 [10] top-level scope
    @ ~/Dropbox/Research/dynamic_games/hvb/tests.jl:32

Output of versioninfo:

Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 10 on 20 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 10

Roots v2.0.4 and ForwardDiff v0.10.32.

Many thanks for your help! :pray:

You really don’t want to do this, even if it works — you should implement the chain rule yourself using the implicit-function theorem, or use a package like ImplicitDifferentiation.jl. See also sections 3 and 5 of these notes. (Unless the Roots.jl package does this for you, which it does not currently: Roots.jl#325.)

The problem is that applying AD naively to a nonlinear solver will waste a lot of effort trying to propagate the derivatives through all of the intermediate steps, in order to exactly differentiate the error in the solution (from the finite number of steps). If you assume that the solution is converged to high precision, you can do much better by directly applying the chain rule to the equation being solved.

5 Likes

I recently had a similar problem - trying to ForwardDiff through a root finding step within a function that has a ton of other steps inside it, and ForwardDiff did fail (see here: confused by this root finding example · Issue #595 · JuliaDiff/ForwardDiff.jl · GitHub). Note that I was using Bisection instead of the default Order0 method, and instead of erroring out here, like in the above example, Bisection just silently fails and reports incorrect derivatives.

I am aware of the ImplicitDifferential package but as far as I can tell it only works with reverse mode approaches, which were not working in my setup for other reasons. More generally, I do know about the broader idea of computing derivatives myself using the chain rule + implicit differentiation, but given that the root finding step was but one of many steps in a much longer function that I needed a gradient of, I ended up giving up and just using finite differences, which I know is an even worse sin :wink:

For what its worth - the fact that ForwardDiff silently fails here, instead of explicitly failing, was pretty frustrating, and was something that took me forever to figure out, since I had the (apparently) mistaken impression that if ForwardDiff gives you a derivative, its giving you the right one.

Instead of making this response purely a rant, I will try to turn it into a productive question. Suppose a user like me is hoping to use ForwardDiff derivatives. Is there any (approximately) canned approach to doing what ImplicitDifferentiation is doing, but stay within the ForwardDiff universe? If I have a function f(x,theta) with many intermediate steps, in which step n of N is to perform a root finding operation, must I break f into an f1 which does the first n-1 steps, an f2 which does the root finding step, and an f3 which does the remaining steps? Should I have ForwardDiff create gradients for f1 and f3, manually work out the implicit gradient in f2`, and then manually combine those gradients to get at my complete gradient?

Until we get to the issue @stevengj opened, you can make this work in the short term by using one(a) as the initial value instead of 1 in your mexp function.

That Bisection error is odd. It arises from the number of steps taken by that algorithm and some interaction with ForwrdDiff. I’m unclear where that bug lurks. Anyways, until a better solution is implemented, avoiding Bisection is suggested (use the default A42 for bracketing).

Interesting. Can you explain why?

Your comment generalizes, and I have no idea why. Can you shed some light on this?

f( x, a ) = log( x ) - a
∂f_x( x, a ) = ForwardDiff.derivative( t -> f( t, a ), x )

mexp( a, x0 ) = find_zero(
    (
        x -> f( x, a ),
        x -> ∂f_x( x, a )
    ),
    convert( eltype( a ), x0 ), # New
Roots.Newton()
)

# Test
mexp( log( π ), 1.0 ) ≈ π

# Differentiate
dmexp( a, x0 ) = ForwardDiff.derivative( a -> mexp( a, x0 ), a )
dmexp( 1.0, 1.0 )
dmexp( log( 17.0 ), 1.0 )

In light of @stevengj’s response, I will probably just implement these derivatives myself. But I’m curious as to why this version works.

As an algorithm, Bisection is not differentiable because there is no continuous connection from an input to an output to collect the duals on. That is not the case for Falsi or Newton type methods. But of course, you don’t need to actually differentiate the algorithm so it’s a moot point.

2 Likes

I’m not sure I understand your point here. Sure, you don’t have to differentiate the algorithm (which afaiu is what ForwardDiff would be doing in this example), but perhaps you can. [The ability to do so comes up as an example of the power of autodiff here (also available as notes somewhere)]. If this feature is available, and because programmer time is a very scarce resource, we may want to exploit it if the result is sufficiently efficient for one’s purpose.

No you cannot. There is no arithmetic connection between the input and the answer when you use bisection. Think about it. What is the derivative of the root with respect to the brackets? Zero because as long as the root is in there you get the same solution. So now run the algorithm. Is it in the first half or the second? Midpoint is calculated using the average of the top and bottom, so duals of the brackets are transferred and scaled to the midpoint. Keep doing it. Therefore the duals are the derivatives of the brackets. Therefore when it tells you that they are zero it is correct: it correctly computed that the solution is constant with respect to the brackets.

It requires a manual fix using the implicit differentiation rule to handle, and a definition of what are parameters to f, which is precisely why the SciML tools have such explicit parameterization (see NonlinearSolve.jl)

I’m not using bisection. When you said

I understood you were referring to Newton-like methods. In that case you can differentiate the algorithm. As you say, you don’t have to, but it does save programmer time if one can just ForwardDiff (or an alternative autodiff) a function as the one defined above. So the point is not moot.

I understood your comment as “there are other (perhaps theoretically sounder) tools for the job, so there’s no need to talk about this particular tool”. I think this particular tool is still useful. Apologies if I misunderstood you.

For some reason I thought you were calling Bisection.

Yes, Newton-like methods are autodiff-able. Just use NonlinearSolve.jl whose implementation is differentiable, and it has overloads (and it outperforms Roots.jl, https://twitter.com/ChrisRackauckas/status/1544743542094020615)

1 Like

To close out this conversation from the Roots end, the newest version, closing the issue opened based on this thread, provides a different solution:

With ForwardDiff it is important – due to how types propagate – to start with a type matching the parameter:

using Roots, ForwardDiff, Test

f( x, a ) = log( x ) - a
fₓ( x, a ) = ForwardDiff.derivative( t -> f( t, a ), x )

mexp(a) = find_zero((f, fₓ), 1, Roots.Newton(), a)  # x0 = 1 -- not one(a)!
@test mexp(log(π)) ≈ π
@test_throws MethodError ForwardDiff.derivative(mexp, 1)

mexp_(a) = find_zero((f, fₓ), one(a), Roots.Newton(), a)
@test ForwardDiff.derivative(mexp_, 1) ≈ exp(1)

# using the closure style you started with
mexp1(a) = find_zero((x -> f(x,a), x->fₓ(x,a)), one(a), Roots.Newton())
@test ForwardDiff.derivative(mexp1, 1) ≈ exp(1)

The adjoint method, mentioned by @stevengj, can avoid the extra effort of pushing the AD machinery through the algorithm:

a = log(π)
xᵅ = find_zero((f,fₓ), 1, Roots.Newton(), a)
fx = ForwardDiff.derivative(x -> f(x, a), xᵅ)
fp = ForwardDiff.derivative(a -> f(xᵅ, a), a) # gradient if `a` is a vector
@test - fp / fx ≈ exp(a)

This led to a new frule and rrule in Roots so that AD programs should be able to differentiate directly (with the parameter passed in to find_zero, not within a closure):

using Zygote # uses rrule from Roots
@test first(Zygote.gradient(mexp, 1)) ≈ exp(1)
1 Like

This is great. Others have provided very useful input, but since this addresses the exact original question, I feel like this should be accepted as a solution.

Many thanks for your work on this!