Compile FastDifferentiation derivatives one time only

Hi @brianguenter, I’m currently using FastDifferentiation.jl, and I must say, the package is fantastic—it’s incredibly fast. Thank you very much for sharing this!

I have a question regarding the differentiation processes, such as derivative and jacobian . I believe these processes are computationally expensive, so I think it would be better if they could be handled at compile-time. However, I’m unsure how to achieve this. Initially, I tried using Val(func) and evaluating the differentiation within a generated function, but this approach resulted in a world age issue. Am I heading in the wrong direction?

Indeed the symbolic differentiation and compilation phase is costly, so you only want to run it once. This is made possible with FastDifferentiation.make_function. An alternative is to call FastDifferentiation.jl from DifferentiationInterface.jl and use the preparation mechanism, which creates the function for you.

1 Like

Thank you for creating a new topic.

I understand that ideally, the preparation step can be avoided, but it is cumbersome in certain cases. For example, in my field, there are many functions having AD that are called repeatedly. In such situations, preparation step can be tedious because it’s required for every instance of AD. When using ForwardDiff.jl with SArray, no such function calls are necessary; differentiation can be evaluated immediately. This significantly simplifies the AD process for users.

Generated functions most often are the wrong direction.

1 Like

FastDifferentiation.make_function relies on RuntimeGeneratedFunction.jl, and RuntimeGeneratedFunction instances evade the world age issue because the methods inline an internal @generated function drawing from an method body cache, so in a roundabout way it was the right direction.

2 Likes

Can you give a minimal example of what you’re trying to achieve? Perhaps FastDifferentiation.jl is not the right tool, and ForwardDiff.jl might indeed be more suitable? The question of whether to use them directly or through DifferentiationInterface.jl is pretty much orthogonal.

It is difficult to give a minimum example, but the situation is like this

function func1(x::SVector)
    f = a_function_I_want_to_apply_AD_to             # this can be a anonymous function
    prep = prepare_gradient(f, AutoForwardDiff(), x) # how can I evaluate this only once effectively?
    val, grad = value_and_gradient(f, prep, AutoForwardDiff(), x)
    # futher computations
end

functoin func2(x::SVector, y::SVector)
    # this also uses AD
end

# there are still many other functions like these...

These functions (func1, func2…) are called many times in the simulation, e.g., in calculations for the basis function and tangent modulus. So their performance significantly impacts simulation time.

You would just have to run preparation for each of the functions you differentiate outside of func123 itself but apart from that I don’t really see an issue? Although I’m not sure how well the function generated by FastDifferentiation.jl’s make_function works with static arrays

If I understand you correctly this is possible. There is an unexported function make_Expr which will generate an Expr representing your function.

If you call this function inside a macro it will do all the preprocessing, derivative calculation, etc., at macro evaluation time.

Something like this I think will do what you want:

module Experiments
using FastDifferentiation


macro make_my_function()
    @variables x y

    f = [cos(x), sin(y), x^3 * y^4]
    jac = jacobian(f, [x, y])

    return FastDifferentiation.make_Expr(jac, [x, y], false, false)
end

function runtime()
    return @make_my_function
end
export runtime

end # module Experiments

I think someone has asked me about this before so maybe I should export this function and document it.

FastDifferentiation works well with static arrays. In fact that is generally the way to get the best performance, so long as your arrays are relatively small.

Hmm… So, would you suggest preparing ad_prep for every single func123 and passing it as a function argument? If so, I couldn’t accept that approach. I could only accept it for non-bitstypes like Array , but not for bitstypes such as SArray . I want a more easy-to-use framework for the latter case. Please imagine func123 functions distributed across various locations, sometimes deeply nested within the project or inside very small functions. In these cases, we would constantly need to either create a mutable struct to store ad_prep or set up ad_prep at a shallow point in the project that doesn’t impact performance and keep it until func123 eventually uses it. I think this structure is inefficient, as the influence of func123 becomes too large. Instead, if AD is handled solely within func123 , it would be as straightforward as simply writing an equation.

This could be a hint for the framework I want, thank you.

Do you think, for example, that it is possible to simply write gradient(f, x) to obtain the gradient for f(x) while allowing the make_function process to be completed at compile-time?

I do not understand exactly what you are asking. Could you expand a little and maybe give an example?

In Tensorial.jl, we can use AD, for example, as follows:

julia> x = rand(Mat{3,3});

julia> gradient(norm, x) ≈ x / norm(x)
true

In this process, I just need to pass norm and x for AD. I want to achieve the same framework using FastDifferentiation.jl. To do this, the make_function (and differentiation) processes need to be done in gradient function. But, since these functions are expensive and size of x is known at compile-time, I want to them to be done at compile-time if possible.

I think the macro method I described earlier will do what you want.

Yes, mostly. But I think this does not work for anonymous functions and functions defined in Main (maybe) because of world age problem?

In a macro, we cannot determine the size of x, so the approach does not work.

Thank you, everyone. It seems that the function I want might be impossible to achieve. I’m still working through it, and I’ll report back here if I make any progress.

I have never used AD myself but I think the structure of this problem is just that you have a couple of function you want to differentiate with respect to some vector of known size. This can be optimized by doing some prepare_gradient which precomputes/builds/returns some structure that helps the AD. Of course you want to do this only once per function (and possible size of input vector if your functions can accept vectors of different length).

So all you need is a preprocessing step that does this prepare_gradient and bundles up the result with the function so you can pass that object around as a single entity. I propose something like:

struct FunctionWithPreppedGradient{F,PG}
    func::F
    prepped_gradient::PG
    function FunctionWithPreppedGradient(f, x)
        prepped_gradient = prepare_gradient(f, AutoForwardDiff(), x)
        return new{typeof(f),typeof(prepped_gradient)}(f, prepped_gradient)
    end
end

# make callable -> call = normal function value
(f::FunctionWithPreppedGradient)(args...; kwargs...) = f.func(args...; kwargs...)

# and make gradient efficient
function DifferentiationInterface.value_and_gradient(fpg::FunctionWithPreppedGradient, backend, x)
    value_and_gradient(fpg.func, fpg.prepped_gradient, backend, x)
end

So when the function, that you need the gradients of, comes in, then you just wrap it up into this struct and everything down the line should be efficient :slight_smile: Maybe you need to add additional functionality, maybe my syntax is wrong. Think of this as the sketch of a solution and the final thing - as I said: I never used AD myself yet.

@abraemer, thank you for you kind suggestion. However, as long as the preparation step is required each time, my issue remains unresolved.

I’ll provide an example to illustrate. The first version below uses DifferentiationInterface.jl with StaticArrays.jl:

using StaticArrays
using LinearAlgebra
using DifferentiationInterface
using ForwardDiff # for AutoForwardDiff()

function main()
    t = 0.0
    T = 10.0
    dt = 1.0
    vals = map(i -> rand(SVector{3}), 1:100_000)
    while t < T
        calculate!(vals)
        t += dt
    end
end

function calculate!(vals)
    for i in eachindex(vals)
        vals[i] = func1(vals[i])
    end
end

function func1(x::SVector)
    f = x -> 2x ⋅ x
    prep = prepare_gradient(f, AutoForwardDiff(), x)
    y, dydx = value_and_gradient(f, prep, AutoForwardDiff(), x)
    func2(y, dydx)
end

function func2(y::Real, dydx::SVector)
    prep = prepare_hessian(norm, AutoForwardDiff(), dydx)
    H = hessian(norm, prep, AutoForwardDiff(), dydx)
    func3(2*y, H * dydx)
end

function func3(z::Real, v::SVector)
    f = x -> z * x
    prep = prepare_jacobian(f, AutoForwardDiff(), v)
    J = jacobian(f, prep, AutoForwardDiff(), v)
    z .+ J * v
end

@time main()
  2.589184 seconds (20.00 M allocations: 3.370 GiB, 3.47% gc time)

This code is slow as expected because preparation is performed each time. To prepare only once, the code can be modified as follows:

using StaticArrays
using LinearAlgebra
using DifferentiationInterface
using ForwardDiff

function main()
    t = 0.0
    T = 10.0
    dt = 1.0
    vals = map(i -> rand(SVector{3}), 1:100_000)
    prep1 = prepare_gradient(x -> 2x ⋅ x, AutoForwardDiff(), vals[1])
    prep2 = prepare_hessian(norm, AutoForwardDiff(), vals[1])
    prep3 = prepare_jacobian(x -> z * x, AutoForwardDiff(), vals[1])
    while t < T
        calculate!(vals, prep1, prep2, prep3)
        t += dt
    end
end

function calculate!(vals, prep1, prep2, prep3)
    for i in eachindex(vals)
        vals[i] = func1(vals[i], prep1, prep2, prep3)
    end
end

function func1(x::SVector, prep1, prep2, prep3)
    f = x -> 2x ⋅ x
    y, dydx = value_and_gradient(f, prep1, AutoForwardDiff(), x)
    func2(y, dydx, prep2, prep3)
end

function func2(y::Real, dydx::SVector, prep2, prep3)
    H = hessian(norm, prep2, AutoForwardDiff(), dydx)
    func3(2*y, H * dydx, prep3)
end

function func3(z::Real, v::SVector, prep3)
    f = x -> z * x
    J = jacobian(f, prep3, AutoForwardDiff(), v)
    z .+ J * v
end

@time main()
  0.086852 seconds (22 allocations: 2.292 MiB)

This approach is much faster but requires separate preparation steps for each function, which can be cumbersome. With Tensorial.jl, however, the code can be simplified as follows:

using Tensorial

function main()
    t = 0.0
    T = 10.0
    dt = 1.0
    vals = map(i -> rand(Vec{3}), 1:100_000)
    while t < T
        calculate!(vals)
        t += dt
    end
end

function calculate!(vals)
    for i in eachindex(vals)
        vals[i] = func1(vals[i])
    end
end

function func1(x::Vec)
    f = x -> 2x ⋅ x
    dydx, y = gradient(f, x, :all)
    func2(y, dydx)
end

function func2(y::Real, dydx::Vec)
    H = hessian(norm, dydx)
    func3(2*y, H * dydx)
end

function func3(z::Real, v::Vec)
    f = x -> z * x
    J = gradient(f, v) # this becomes Jacobin automatically in Tensorial.jl if the output of function is a vector
    z .+ J * v
end

@time main()
  0.079240 seconds (3 allocations: 2.289 MiB)

The execution time is comparable to the second example, but the code is significantly simpler. This is possible because when using ForwardDiff.jl the preparation object can be bitstype, so almost no cost for preparation actually.

1 Like

This is type piracy, I would strongly discourage it. You can define your own value_and_gradient function instead of overloading that from DI.