What are `autodiff_deferred` and `autodiff_thunk` for in Enzyme?

I think I understand the basic autodiff function (see this post) but it has three variants:

Is there someone who can explain to me:

  • why “deferring” computations is useful for GPU or higher-order?
  • in which cases I should use the “thunk” version?
  • whether there are performance differences?
2 Likes

The autodiff_defered variants are for higher order or GPU use cases. They defer compilation until we actually call things, this allows us to detect them when doing compilation for the GPU or when seeing a call to within a compilation request for autodiff.

Long-term I think autodiff_deferred should just be the normal interface.

The thunk variants don’t immediately call the augmented primal or reverse function, but instead return callable objects (thunks) that you can use to perform the computation.

2 Likes

Okay thanks! Could you help me understand how to make it work with a vector output? The example in the docs is a scalar output, and I don’t understand how I’m supposed to pass the cotangent:

julia> using Enzyme

julia> g(x) = abs2.(x)
g (generic function with 1 method)

julia> x = [2.0, 3.0];

julia> dx = zero(x);

julia> dy = [5.0, 7.0];

julia> forw, rev = autodiff_thunk(ReverseSplitWithPrimal, Const{typeof(g)}, Duplicated, Duplicated{typeof(x)})
(Enzyme.Compiler.AugmentedForwardThunk{Ptr{Nothing}, Const{typeof(g)}, Duplicated{Vector{Float64}}, Tuple{Duplicated{Vector{Float64}}}, Val{1}, Val{true}(), @NamedTuple{1, 2, 3, 4, 5::Bool, 6::Bool, 7::Core.LLVMPtr{Float64, 0}, 8::Core.LLVMPtr{Float64, 0}, 9::Core.LLVMPtr{Float64, 0}, 10::Core.LLVMPtr{Float64, 0}, 11::Core.LLVMPtr{Float64, 0}, 12::Core.LLVMPtr{Float64, 0}, 13::Core.LLVMPtr{Float64, 0}, 14::Core.LLVMPtr{Float64, 0}, 15::Core.LLVMPtr{Float64, 0}}}(Ptr{Nothing} @0x00007871c8880450), Enzyme.Compiler.AdjointThunk{Ptr{Nothing}, Const{typeof(g)}, Duplicated{Vector{Float64}}, Tuple{Duplicated{Vector{Float64}}}, Val{1}, @NamedTuple{1, 2, 3, 4, 5::Bool, 6::Bool, 7::Core.LLVMPtr{Float64, 0}, 8::Core.LLVMPtr{Float64, 0}, 9::Core.LLVMPtr{Float64, 0}, 10::Core.LLVMPtr{Float64, 0}, 11::Core.LLVMPtr{Float64, 0}, 12::Core.LLVMPtr{Float64, 0}, 13::Core.LLVMPtr{Float64, 0}, 14::Core.LLVMPtr{Float64, 0}, 15::Core.LLVMPtr{Float64, 0}}}(Ptr{Nothing} @0x00007871c8880a40))

julia> tape, y, shadow_y = forw(Const(g), Duplicated(x, dx))
(var"1" = @NamedTuple{1, 2, 3, 4, 5::Bool, 6::Bool, 7::Core.LLVMPtr{Float64, 0}, 8::Core.LLVMPtr{Float64, 0}, 9::Core.LLVMPtr{Float64, 0}, 10::Core.LLVMPtr{Float64, 0}, 11::Core.LLVMPtr{Float64, 0}, 12::Core.LLVMPtr{Float64, 0}, 13::Core.LLVMPtr{Float64, 0}, 14::Core.LLVMPtr{Float64, 0}, 15::Core.LLVMPtr{Float64, 0}}(([0.0, 0.0], [4.0, 9.0], nothing, nothing, false, false, Core.LLVMPtr{Float64, 0}(0x000078715e006610), Core.LLVMPtr{Float64, 0}(0x7ffffffffffffffe), Core.LLVMPtr{Float64, 0}(0x000078715e0065e0), Core.LLVMPtr{Float64, 0}(0x000078715e181730), Core.LLVMPtr{Float64, 0}(0x000078725cb10311), Core.LLVMPtr{Float64, 0}(0x000078724ca32f60), Core.LLVMPtr{Float64, 0}(0x000078725ca6bee0), Core.LLVMPtr{Float64, 0}(0x000078724ca32dc0), Core.LLVMPtr{Float64, 0}(0x0000000009900c00))), var"2" = [4.0, 9.0], var"3" = [0.0, 0.0])

julia> y
2-element Vector{Float64}:
 4.0
 9.0

julia> rev(Const(g), Duplicated(x, dx), Duplicated(y, dy), tape)
ERROR: AssertionError: length(argtypes) + needs_tape == length(argexprs)
Stacktrace:
     ⋮ internal @ Enzyme.Compiler, GPUCompiler, Core, Unknown
 [6] (::Enzyme.Compiler.AdjointThunk{Ptr{…}, Const{…}, Duplicated{…}, Tuple{…}, Val{…}, @NamedTuple{…}})(::Const{typeof(g)}, ::Duplicated{Vector{…}}, ::Vararg{Any})
   @ Enzyme.Compiler ~/.julia/packages/Enzyme/l4FS0/src/compiler.jl:5004
Use `err` to retrieve the full stack trace.
Some type information was truncated. Use `show(err)` to see complete types.

You should be passing the same args to the primal as the shadow, plus the shadow return if it is active, plus the tape.

In other words you should not pass y and dy.

As for Julia 1.11 it is not presently supported in Enzyme

1 Like

As for your example set shadow_y to your desired cotangent

I don’t understand, where do I plug in dy in order to compute the VJP \partial g(x)^\top (\mathrm{d}y)?

shadow_y is an output, not an input, right? I never plug it back in

If it is needed to compute the reverse pass it will be captured (by reference) in the tape

Okay so where do I plug it in? Can you show me the code on this simple example? I’m really struggling to guess here ^^

I got it, I need to do

shadow_y .= dy

But what happens if shadow_y cannot be mutated in place?

So in reverse mode Reverse Mode should only use Duplicated for mutable memory. For immutable data you should use Active for the return (and then you pass that into the reverse pass function).

What about immutable data that has mutable fields, like Tuple{Vector,Vector}?

You can do shadow_xy0] .= dy and shadow_y[1] .= dy on the outside just the original vector case. This would still use Duplicated (sorry I should have been more specific to say that if derivative data is immutable on the top level register you should use active. In this case the inner differentiable data is in a mutable construct so you should still use Duplicated (in this case a vector, even if behind a tuple)).

1 Like

I’m tempted to use the autodiff_thunk option everywhere and put the part

forw, rev = autodiff_thunk(ReverseSplitWithPrimal, Const{typeof(g)}, Duplicated, Duplicated{typeof(x)})

in a preparation step, so that it only runs once when I want to compute several gradients in a row.
Is that reasonable? Does autodiff call autodiff_thunk like that under the hood? Or is there a conceptual difference?

No they are not equivalent, at least from a performance standpoint. The reason is that if you know you’re computing the forward and reverse pass together in one go, Enzyme can do a lot more performance optimizations.

Within autodiff there is a compilation cache so asking for the same function and activities multiple times won’t incur additional cost

We could also make a thunk for combined mode (currently thunk only supports split mode), if desired.

Okay, I thought the thunk was specific to split mode, it is helpful to know that it could exist in combined mode too.

Does the output of autodiff_thunk (before even running the forward pass) save some time if we reuse it for different inputs, or do you think it would be negligible?
Because for different inputs we cannot reuse the forward pass of course.

Hey @wsmoses sorry for pinging again, I was just curious about the purpose of the thunk mechanism.
One obvious application would be computing a Jacobian, where you only do one forward sweep and then as many pullbacks as there are output dimensions. But if I call the thunk-ed pullback more than once, will the answer still be correct? Or would one reverse sweep somehow alter the tape, so that the pullback closure is no longer working? I’m asking specifically because that’s the situation in Tapir.jl, where each reverse sweep must be directly preceded by a forward sweep

See also Split reverse mode for Tapir · Issue #115 · withbayes/Tapir.jl · GitHub

The thunks are stateless and can be called as many times as you like.

However, the extra “tape” (really value cache) for the reverse pass may be different, depending on the function (e.g. if a shadow pointer is captured/overwritten/etc).

If certain properties hold (or you swap out all uses of the capured shadow with your new shadow) you can do that.

Of course use enzyme’s batchduplicated/etc to perform as many reverse passes in one reverse sweep as you want (thunk or no thunk).

1 Like