Debugging type-instability allocations with ForwardDiff

gdalle · April 28, 2024, 4:55pm

I’m really struggling with my ForwardDiff.jl bindings for DifferentiationInterface.jl.
The goal is to achieve an allocation-free Jacobian-vector product (aka pushforward) for functions of the form f!(y, x). I have succeeded for pushforward! but not for value_and_pushforward!.
To reproduce my issue:

create a new environment containing Chairmarks.jl and JET.jl
clone or dev DifferentiationInterface.jl into it from the main branch, be sure to use the subdirectory https://github.com/gdalle/DifferentiationInterface.jl/tree/main/DifferentiationInterface
run the following snippet in VSCode (or remove @profview_allocs to run it elsewhere) and observe the results

Profiling code

using DifferentiationInterface
using Chairmarks, JET

f!(y, x) = copyto!(y, x)

function profile_pushforward(n)
    b = AutoForwardDiff()
    x, dx, y, dy = rand(n), rand(n), zeros(n), zeros(n)
    extras = prepare_pushforward(f!, y, b, x, dx)

    @test_opt pushforward!(f!, y, dy, b, x, dx, extras)

    @profview_allocs for _ in 1:100000
        pushforward!(f!, y, dy, b, x, dx, extras)
    end

    @be pushforward!(f!, y, dy, b, x, dx, extras)
end

function profile_value_and_pushforward(n)
    b = AutoForwardDiff()
    x, dx, y, dy = rand(n), rand(n), zeros(n), zeros(n)
    extras = prepare_pushforward(f!, y, b, x, dx)

    @test_opt value_and_pushforward!(f!, y, dy, b, x, dx, extras)

    @profview_allocs for _ in 1:100000
        value_and_pushforward!(f!, y, dy, b, x, dx, extras)
    end

    @be value_and_pushforward!(f!, y, dy, b, x, dx, extras)
end

profile_pushforward(10)
profile_pushforward(100)
profile_value_and_pushforward(10)
profile_value_and_pushforward(100)

Results

julia> profile_pushforward(10)
Benchmark: 5129 samples with 755 evaluations
min    22.925 ns
median 23.571 ns
mean   23.569 ns
max    65.959 ns

julia> profile_pushforward(100)
Benchmark: 3086 samples with 405 evaluations
min    69.160 ns
median 70.953 ns
mean   74.641 ns
max    291.783 ns

julia> profile_value_and_pushforward(10)
Benchmark: 2756 samples with 149 evaluations
min    192.268 ns (1 allocs: 32 bytes)
median 203.742 ns (1 allocs: 32 bytes)
mean   218.388 ns (1 allocs: 32 bytes)
max    642.477 ns (1 allocs: 32 bytes)

julia> profile_value_and_pushforward(100)
Benchmark: 2769 samples with 95 evaluations
min    296.947 ns (1 allocs: 32 bytes)
median 314.632 ns (1 allocs: 32 bytes)
mean   355.077 ns (1 allocs: 32 bytes, 0.04% gc time)
max    120.325 μs (1 allocs: 32 bytes, 99.15% gc time)

There is one allocation of 32 bytes in value_and_pushforward! regardless of the size of the input, so I’m thinking type instability. Unfortunately, JET.@test_opt says it’s all fine, so I’m rather confused.
What’s worse, the allocation disappears if you comment either of the following lines: https://github.com/gdalle/DifferentiationInterface.jl/blob/cc738421b6a8ccd3f0b2a807159ccc78f0c1c151/DifferentiationInterface/ext/DifferentiationInterfaceForwardDiffExt/twoarg.jl#L56-L57. So the type instability cannot be specific to either line, which seems weird to me.

My best guesses:

Maybe related to no specialization on types, in this case to the tag type T
Something something compiler inlining?

Ping @hill in case you have a clue

KnutAM · April 28, 2024, 5:29pm

Changing DifferentiationInterface.jl/DifferentiationInterface/ext/DifferentiationInterfaceForwardDiffExt/twoarg.jl at cc738421b6a8ccd3f0b2a807159ccc78f0c1c151 · gdalle/DifferentiationInterface.jl · GitHub to

    f!::F, y, dy, ::AutoForwardDiff, x, dx, extras::ForwardDiffTwoArgPushforwardExtras{T}
) where {T, F}

to make it specialize on f! solves it for me (I first changed to ForwardDiff#master due to a similar issue fixed in Support specializing on functions by j-fu · Pull Request #615 · JuliaDiff/ForwardDiff.jl · GitHub), but haven’t checked if that is required. Probably the same trick is required for all functions in the extension?
(Found using ProfileView.jl + Cthulhu.jl with descend_clicked() )

gdalle · April 28, 2024, 5:50pm

Good to know that Cthulhu can catch the things JET misses!

Thanks for the suggestion, but I’m hesitant as to whether this is a good idea. The “no-specialization on functions” heuristic is in place for a reason, and forcing specialization on every single function we could possibly differentiate sounds expensive?

EDIT: ForwardDiff itself does force specialization since the PR mentioned above, e.g. in gradient, so maybe it’s not a bad idea, see for instance.

More generally, I’m curious as to why commenting out an unrelated line (where f! does not even play a role) makes the problem disappear. Is it because the function crosses a simplicity threshold where the compiler is able to optimize away the inefficiency?

ChrisRackauckas · April 28, 2024, 6:49pm

The shuttling around f is the cost you pay and that’s relatively small here.

gdalle · April 28, 2024, 9:47pm

Just to be clear, you mean that I should indeed add f::F everywhere, even if it increases the compilation burden?

Any idea why JET didn’t catch this type-instability but Cthulhu did? Cthulhu is much harder to automate, e.g. in tests

ChrisRackauckas · April 28, 2024, 10:14pm

Depends on the application. For most applications with AD, probably yes, since the core cost is not the compilation cost of the setup code but rather the compilation cost due to the alternative f formulation that is generated (i.e. AD always has to do some amount of specialization on f anyways, so it’s only a matter of whether you’re also re-specializing the outside code as well).

Certain tricks can be done with FunctionWrappers.jl and FunctionWrappersWrappers.jl to decrease this cost but that’s probably not necessary here.

Elrod · April 29, 2024, 5:11am

I think that if ForwardDiff.jl doesn’t want to specialize on f, it should use FunctionWrappers.jl.

It is already strict about input->output.

FunctionWrappers will avoid specialization with much lower overhead than dynamic dispatch.

If going that far, ForwardDiff should also avoid the dynamic dispatch on chunk size.

gdalle · April 29, 2024, 5:34am

From what I understand, ForwardDiff.jl recently decided that it does want to force specialization on f?
https://github.com/JuliaDiff/ForwardDiff.jl/pull/615

It seems rather unavoidable if you want a variable chunk size that is also a type parameter for tuples of partials? What would you do differently there?

Elrod · April 29, 2024, 5:48am

I think that is reasonable, but…
…is ForwardDiff 0.11 ever actually happening?

There is a finite set of options.
Max default chunk size is 12.
So a (not necessarily good) solution would be

struct Chunk{N} end
function gradient!(f, y, x, chunk_size::Int)
  Base.Cartesian.@nif 12 n->(n == chunk_size) n->gradient!(f, y, x, Chunk{n}())
end
function gradient!(f, y, x, ::Chunk{chunk_size}) where {chunk_size}
    # impl
end

This is not good because it is expensive to compile all 12 specializations.
I would suggest instead only supporting powers of 2, and perhaps multiples of 4, so that the options are instead
1, 2, 4, 8, and 12.
Perhaps we can drop that further.
This requires supporting chunk-size-padding.
It is very likely that a chunk size of 8 is faster than 7, and depending on your hardware, it might be faster than 5 and 6 as well. So the cost of padding out the chunk size is low.

I would also suggest again merging the SIMD.jl PR.

julia> using ForwardDiff, StaticArrays, BenchmarkTools

julia> x = @SVector rand(7);

julia> d(x, ::Val{N}) where {N} = ForwardDiff.Dual(x, ntuple(_->randn(), Val(N))...)
d (generic function with 1 method)

julia> for n = 5:8
           dx = d.(x, Val(n))
           print("n = $n: ")
           @btime sum(log, $dx)
       end
n = 5:   44.304 ns (0 allocations: 0 bytes)
n = 6:   48.354 ns (0 allocations: 0 bytes)
n = 7:   52.333 ns (0 allocations: 0 bytes)
n = 8:   49.872 ns (0 allocations: 0 bytes)

Rounding chunk sizes up doesn’t have much negative cost compared to a dynamic dispatch.
(We don’t have the dynamic dispatch here, because of using StaticArays.)

gdalle · April 29, 2024, 5:53am

I mean, if it never happens, then all of the stuff that gets merged onto master is pointless. I don’t know what the breaking change from 0.10 to 0.11 would be, but if it is so scary, maybe this is an opportunity to release 1.0 so that bug fixes can at least be backported in the future.

Elrod · April 29, 2024, 5:55am

Note that 0.11 was more than a year ago! Version 0.11 by mcabbott · Pull Request #613 · JuliaDiff/ForwardDiff.jl · GitHub

Anyway, if you want to get something into ForwardDIff, you should ignore master and merge it into the release-0.10 branch instead.

gdalle · April 29, 2024, 6:01am

I don’t know anything about ForwardDiff maintenance, but it would be sad to consider that one of the most widely used tools in the ecosystem will never publish a new release for fear of being breaking. That’s what semver and CI are for? Unless I’m missing something @mcabbott

mcabbott · April 29, 2024, 10:58am

No, why? You should merge into master and backport, like other recent changes. Not ideal but let’s not make it worse.

It’s stuck essentially on people wondering what else should be in the same breaking release, and whether it should be 1.0. And everyone being busy. The hisory is that 481 took 2 years to get in, and similar indecision about 0.11 pushed us to calling this a bugfix… which resulted in enough complaining that it was pulled.

Useful steps would be someone reviewing 695 carefully, e.g. does this create any other surprises? Also 572. Also 570 as above.

They don’t remove the need for 290 registered packages to update compat. That’s a lot of labour.

Topic		Replies	Views
ForwardDiff Allocations Performance memory-allocation , forwarddiff	3	123	June 19, 2025
Jacobian of StaticArray function becoming type-unstable Performance	3	678	December 28, 2018
Getting ForwardDiff jacobian! to execute with zero allocations Performance performance , memory-allocation , forwarddiff	9	973	April 26, 2024
ForwardDiff and GradientConfig memory usage Performance	10	1122	April 5, 2018
Large memory allocation in ForwardDiff Optimization (Mathematical)	14	1730	July 7, 2020

Debugging type-instability allocations with ForwardDiff

Related topics