Debugging type-instability allocations with ForwardDiff

I’m really struggling with my ForwardDiff.jl bindings for DifferentiationInterface.jl.
The goal is to achieve an allocation-free Jacobian-vector product (aka pushforward) for functions of the form f!(y, x). I have succeeded for pushforward! but not for value_and_pushforward!.
To reproduce my issue:

Profiling code
using DifferentiationInterface
using Chairmarks, JET

f!(y, x) = copyto!(y, x)

function profile_pushforward(n)
    b = AutoForwardDiff()
    x, dx, y, dy = rand(n), rand(n), zeros(n), zeros(n)
    extras = prepare_pushforward(f!, y, b, x, dx)

    @test_opt pushforward!(f!, y, dy, b, x, dx, extras)

    @profview_allocs for _ in 1:100000
        pushforward!(f!, y, dy, b, x, dx, extras)

    @be pushforward!(f!, y, dy, b, x, dx, extras)

function profile_value_and_pushforward(n)
    b = AutoForwardDiff()
    x, dx, y, dy = rand(n), rand(n), zeros(n), zeros(n)
    extras = prepare_pushforward(f!, y, b, x, dx)

    @test_opt value_and_pushforward!(f!, y, dy, b, x, dx, extras)

    @profview_allocs for _ in 1:100000
        value_and_pushforward!(f!, y, dy, b, x, dx, extras)

    @be value_and_pushforward!(f!, y, dy, b, x, dx, extras)

julia> profile_pushforward(10)
Benchmark: 5129 samples with 755 evaluations
min    22.925 ns
median 23.571 ns
mean   23.569 ns
max    65.959 ns

julia> profile_pushforward(100)
Benchmark: 3086 samples with 405 evaluations
min    69.160 ns
median 70.953 ns
mean   74.641 ns
max    291.783 ns

julia> profile_value_and_pushforward(10)
Benchmark: 2756 samples with 149 evaluations
min    192.268 ns (1 allocs: 32 bytes)
median 203.742 ns (1 allocs: 32 bytes)
mean   218.388 ns (1 allocs: 32 bytes)
max    642.477 ns (1 allocs: 32 bytes)

julia> profile_value_and_pushforward(100)
Benchmark: 2769 samples with 95 evaluations
min    296.947 ns (1 allocs: 32 bytes)
median 314.632 ns (1 allocs: 32 bytes)
mean   355.077 ns (1 allocs: 32 bytes, 0.04% gc time)
max    120.325 μs (1 allocs: 32 bytes, 99.15% gc time)

There is one allocation of 32 bytes in value_and_pushforward! regardless of the size of the input, so I’m thinking type instability. Unfortunately, JET.@test_opt says it’s all fine, so I’m rather confused.
What’s worse, the allocation disappears if you comment either of the following lines: So the type instability cannot be specific to either line, which seems weird to me.

My best guesses:

Ping @hill in case you have a clue

1 Like

Changing DifferentiationInterface.jl/DifferentiationInterface/ext/DifferentiationInterfaceForwardDiffExt/twoarg.jl at cc738421b6a8ccd3f0b2a807159ccc78f0c1c151 · gdalle/DifferentiationInterface.jl · GitHub to

    f!::F, y, dy, ::AutoForwardDiff, x, dx, extras::ForwardDiffTwoArgPushforwardExtras{T}
) where {T, F}

to make it specialize on f! solves it for me (I first changed to ForwardDiff#master due to a similar issue fixed in Support specializing on functions by j-fu · Pull Request #615 · JuliaDiff/ForwardDiff.jl · GitHub), but haven’t checked if that is required. Probably the same trick is required for all functions in the extension?
(Found using ProfileView.jl + Cthulhu.jl with descend_clicked() )


Good to know that Cthulhu can catch the things JET misses!

Thanks for the suggestion, but I’m hesitant as to whether this is a good idea. The “no-specialization on functions” heuristic is in place for a reason, and forcing specialization on every single function we could possibly differentiate sounds expensive?

EDIT: ForwardDiff itself does force specialization since the PR mentioned above, e.g. in gradient, so maybe it’s not a bad idea, see for instance.

More generally, I’m curious as to why commenting out an unrelated line (where f! does not even play a role) makes the problem disappear. Is it because the function crosses a simplicity threshold where the compiler is able to optimize away the inefficiency?

The shuttling around f is the cost you pay and that’s relatively small here.

Just to be clear, you mean that I should indeed add f::F everywhere, even if it increases the compilation burden?

Any idea why JET didn’t catch this type-instability but Cthulhu did? Cthulhu is much harder to automate, e.g. in tests

Depends on the application. For most applications with AD, probably yes, since the core cost is not the compilation cost of the setup code but rather the compilation cost due to the alternative f formulation that is generated (i.e. AD always has to do some amount of specialization on f anyways, so it’s only a matter of whether you’re also re-specializing the outside code as well).

Certain tricks can be done with FunctionWrappers.jl and FunctionWrappersWrappers.jl to decrease this cost but that’s probably not necessary here.

1 Like

I think that if ForwardDiff.jl doesn’t want to specialize on f, it should use FunctionWrappers.jl.

It is already strict about input->output.

FunctionWrappers will avoid specialization with much lower overhead than dynamic dispatch.

If going that far, ForwardDiff should also avoid the dynamic dispatch on chunk size.

From what I understand, ForwardDiff.jl recently decided that it does want to force specialization on f?

It seems rather unavoidable if you want a variable chunk size that is also a type parameter for tuples of partials? What would you do differently there?

I think that is reasonable, but…
…is ForwardDiff 0.11 ever actually happening?

There is a finite set of options.
Max default chunk size is 12.
So a (not necessarily good) solution would be

struct Chunk{N} end
function gradient!(f, y, x, chunk_size::Int)
  Base.Cartesian.@nif 12 n->(n == chunk_size) n->gradient!(f, y, x, Chunk{n}())
function gradient!(f, y, x, ::Chunk{chunk_size}) where {chunk_size}
    # impl

This is not good because it is expensive to compile all 12 specializations.
I would suggest instead only supporting powers of 2, and perhaps multiples of 4, so that the options are instead
1, 2, 4, 8, and 12.
Perhaps we can drop that further.
This requires supporting chunk-size-padding.
It is very likely that a chunk size of 8 is faster than 7, and depending on your hardware, it might be faster than 5 and 6 as well. So the cost of padding out the chunk size is low.

I would also suggest again merging the SIMD.jl PR.

julia> using ForwardDiff, StaticArrays, BenchmarkTools

julia> x = @SVector rand(7);

julia> d(x, ::Val{N}) where {N} = ForwardDiff.Dual(x, ntuple(_->randn(), Val(N))...)
d (generic function with 1 method)

julia> for n = 5:8
           dx = d.(x, Val(n))
           print("n = $n: ")
           @btime sum(log, $dx)
n = 5:   44.304 ns (0 allocations: 0 bytes)
n = 6:   48.354 ns (0 allocations: 0 bytes)
n = 7:   52.333 ns (0 allocations: 0 bytes)
n = 8:   49.872 ns (0 allocations: 0 bytes)

Rounding chunk sizes up doesn’t have much negative cost compared to a dynamic dispatch.
(We don’t have the dynamic dispatch here, because of using StaticArays.)

I mean, if it never happens, then all of the stuff that gets merged onto master is pointless. I don’t know what the breaking change from 0.10 to 0.11 would be, but if it is so scary, maybe this is an opportunity to release 1.0 so that bug fixes can at least be backported in the future.

Note that 0.11 was more than a year ago! Version 0.11 by mcabbott · Pull Request #613 · JuliaDiff/ForwardDiff.jl · GitHub

Anyway, if you want to get something into ForwardDIff, you should ignore master and merge it into the release-0.10 branch instead.

I don’t know anything about ForwardDiff maintenance, but it would be sad to consider that one of the most widely used tools in the ecosystem will never publish a new release for fear of being breaking. That’s what semver and CI are for? Unless I’m missing something @mcabbott

No, why? You should merge into master and backport, like other recent changes. Not ideal but let’s not make it worse.

It’s stuck essentially on people wondering what else should be in the same breaking release, and whether it should be 1.0. And everyone being busy. The hisory is that 481 took 2 years to get in, and similar indecision about 0.11 pushed us to calling this a bugfix… which resulted in enough complaining that it was pulled.

Useful steps would be someone reviewing 695 carefully, e.g. does this create any other surprises? Also 572. Also 570 as above.

They don’t remove the need for 290 registered packages to update compat. That’s a lot of labour.