What are `autodiff_deferred` and `autodiff_thunk` for in Enzyme?

gdalle · March 25, 2024, 7:26am

I think I understand the basic autodiff function (see this post) but it has three variants:

autodiff_deferred
autodiff_thunk (which seems broken on 1.11)
autodiff_deferred_thunk

Is there someone who can explain to me:

why “deferring” computations is useful for GPU or higher-order?
in which cases I should use the “thunk” version?
whether there are performance differences?

vchuravy · March 25, 2024, 12:16pm

The autodiff_defered variants are for higher order or GPU use cases. They defer compilation until we actually call things, this allows us to detect them when doing compilation for the GPU or when seeing a call to within a compilation request for autodiff.

Long-term I think autodiff_deferred should just be the normal interface.

The thunk variants don’t immediately call the augmented primal or reverse function, but instead return callable objects (thunks) that you can use to perform the computation.

gdalle · March 25, 2024, 12:57pm

Okay thanks! Could you help me understand how to make it work with a vector output? The example in the docs is a scalar output, and I don’t understand how I’m supposed to pass the cotangent:

julia> using Enzyme

julia> g(x) = abs2.(x)
g (generic function with 1 method)

julia> x = [2.0, 3.0];

julia> dx = zero(x);

julia> dy = [5.0, 7.0];

julia> forw, rev = autodiff_thunk(ReverseSplitWithPrimal, Const{typeof(g)}, Duplicated, Duplicated{typeof(x)})
(Enzyme.Compiler.AugmentedForwardThunk{Ptr{Nothing}, Const{typeof(g)}, Duplicated{Vector{Float64}}, Tuple{Duplicated{Vector{Float64}}}, Val{1}, Val{true}(), @NamedTuple{1, 2, 3, 4, 5::Bool, 6::Bool, 7::Core.LLVMPtr{Float64, 0}, 8::Core.LLVMPtr{Float64, 0}, 9::Core.LLVMPtr{Float64, 0}, 10::Core.LLVMPtr{Float64, 0}, 11::Core.LLVMPtr{Float64, 0}, 12::Core.LLVMPtr{Float64, 0}, 13::Core.LLVMPtr{Float64, 0}, 14::Core.LLVMPtr{Float64, 0}, 15::Core.LLVMPtr{Float64, 0}}}(Ptr{Nothing} @0x00007871c8880450), Enzyme.Compiler.AdjointThunk{Ptr{Nothing}, Const{typeof(g)}, Duplicated{Vector{Float64}}, Tuple{Duplicated{Vector{Float64}}}, Val{1}, @NamedTuple{1, 2, 3, 4, 5::Bool, 6::Bool, 7::Core.LLVMPtr{Float64, 0}, 8::Core.LLVMPtr{Float64, 0}, 9::Core.LLVMPtr{Float64, 0}, 10::Core.LLVMPtr{Float64, 0}, 11::Core.LLVMPtr{Float64, 0}, 12::Core.LLVMPtr{Float64, 0}, 13::Core.LLVMPtr{Float64, 0}, 14::Core.LLVMPtr{Float64, 0}, 15::Core.LLVMPtr{Float64, 0}}}(Ptr{Nothing} @0x00007871c8880a40))

julia> tape, y, shadow_y = forw(Const(g), Duplicated(x, dx))
(var"1" = @NamedTuple{1, 2, 3, 4, 5::Bool, 6::Bool, 7::Core.LLVMPtr{Float64, 0}, 8::Core.LLVMPtr{Float64, 0}, 9::Core.LLVMPtr{Float64, 0}, 10::Core.LLVMPtr{Float64, 0}, 11::Core.LLVMPtr{Float64, 0}, 12::Core.LLVMPtr{Float64, 0}, 13::Core.LLVMPtr{Float64, 0}, 14::Core.LLVMPtr{Float64, 0}, 15::Core.LLVMPtr{Float64, 0}}(([0.0, 0.0], [4.0, 9.0], nothing, nothing, false, false, Core.LLVMPtr{Float64, 0}(0x000078715e006610), Core.LLVMPtr{Float64, 0}(0x7ffffffffffffffe), Core.LLVMPtr{Float64, 0}(0x000078715e0065e0), Core.LLVMPtr{Float64, 0}(0x000078715e181730), Core.LLVMPtr{Float64, 0}(0x000078725cb10311), Core.LLVMPtr{Float64, 0}(0x000078724ca32f60), Core.LLVMPtr{Float64, 0}(0x000078725ca6bee0), Core.LLVMPtr{Float64, 0}(0x000078724ca32dc0), Core.LLVMPtr{Float64, 0}(0x0000000009900c00))), var"2" = [4.0, 9.0], var"3" = [0.0, 0.0])

julia> y
2-element Vector{Float64}:
 4.0
 9.0

julia> rev(Const(g), Duplicated(x, dx), Duplicated(y, dy), tape)
ERROR: AssertionError: length(argtypes) + needs_tape == length(argexprs)
Stacktrace:
     ⋮ internal @ Enzyme.Compiler, GPUCompiler, Core, Unknown
 [6] (::Enzyme.Compiler.AdjointThunk{Ptr{…}, Const{…}, Duplicated{…}, Tuple{…}, Val{…}, @NamedTuple{…}})(::Const{typeof(g)}, ::Duplicated{Vector{…}}, ::Vararg{Any})
   @ Enzyme.Compiler ~/.julia/packages/Enzyme/l4FS0/src/compiler.jl:5004
Use `err` to retrieve the full stack trace.
Some type information was truncated. Use `show(err)` to see complete types.

wsmoses · March 25, 2024, 1:20pm

You should be passing the same args to the primal as the shadow, plus the shadow return if it is active, plus the tape.

In other words you should not pass y and dy.

As for Julia 1.11 it is not presently supported in Enzyme

wsmoses · March 25, 2024, 1:20pm

As for your example set shadow_y to your desired cotangent

gdalle · March 25, 2024, 2:44pm

I don’t understand, where do I plug in dy in order to compute the VJP \partial g(x)^\top (\mathrm{d}y)?

gdalle · March 25, 2024, 2:51pm

shadow_y is an output, not an input, right? I never plug it back in

wsmoses · March 25, 2024, 3:37pm

If it is needed to compute the reverse pass it will be captured (by reference) in the tape

gdalle · March 25, 2024, 3:51pm

Okay so where do I plug it in? Can you show me the code on this simple example? I’m really struggling to guess here ^^

gdalle · March 25, 2024, 4:49pm

I got it, I need to do

shadow_y .= dy

gdalle · March 25, 2024, 4:57pm

But what happens if shadow_y cannot be mutated in place?

wsmoses · March 25, 2024, 5:51pm

So in reverse mode Reverse Mode should only use Duplicated for mutable memory. For immutable data you should use Active for the return (and then you pass that into the reverse pass function).

gdalle · March 25, 2024, 6:49pm

What about immutable data that has mutable fields, like Tuple{Vector,Vector}?

wsmoses · March 25, 2024, 6:56pm

You can do shadow_xy0] .= dy and shadow_y[1] .= dy on the outside just the original vector case. This would still use Duplicated (sorry I should have been more specific to say that if derivative data is immutable on the top level register you should use active. In this case the inner differentiable data is in a mutable construct so you should still use Duplicated (in this case a vector, even if behind a tuple)).

gdalle · March 26, 2024, 1:40pm

I’m tempted to use the autodiff_thunk option everywhere and put the part

forw, rev = autodiff_thunk(ReverseSplitWithPrimal, Const{typeof(g)}, Duplicated, Duplicated{typeof(x)})

in a preparation step, so that it only runs once when I want to compute several gradients in a row.
Is that reasonable? Does autodiff call autodiff_thunk like that under the hood? Or is there a conceptual difference?

wsmoses · March 26, 2024, 4:05pm

No they are not equivalent, at least from a performance standpoint. The reason is that if you know you’re computing the forward and reverse pass together in one go, Enzyme can do a lot more performance optimizations.

Within autodiff there is a compilation cache so asking for the same function and activities multiple times won’t incur additional cost

wsmoses · March 26, 2024, 4:06pm

We could also make a thunk for combined mode (currently thunk only supports split mode), if desired.

gdalle · March 26, 2024, 4:47pm

Okay, I thought the thunk was specific to split mode, it is helpful to know that it could exist in combined mode too.

Does the output of autodiff_thunk (before even running the forward pass) save some time if we reuse it for different inputs, or do you think it would be negligible?
Because for different inputs we cannot reuse the forward pass of course.

gdalle · April 5, 2024, 6:04pm

Hey @wsmoses sorry for pinging again, I was just curious about the purpose of the thunk mechanism.
One obvious application would be computing a Jacobian, where you only do one forward sweep and then as many pullbacks as there are output dimensions. But if I call the thunk-ed pullback more than once, will the answer still be correct? Or would one reverse sweep somehow alter the tape, so that the pullback closure is no longer working? I’m asking specifically because that’s the situation in Tapir.jl, where each reverse sweep must be directly preceded by a forward sweep

See also Split reverse mode for Tapir · Issue #115 · withbayes/Tapir.jl · GitHub

wsmoses · April 5, 2024, 7:02pm

The thunks are stateless and can be called as many times as you like.

However, the extra “tape” (really value cache) for the reverse pass may be different, depending on the function (e.g. if a shadow pointer is captured/overwritten/etc).

If certain properties hold (or you swap out all uses of the capured shadow with your new shadow) you can do that.

Of course use enzyme’s batchduplicated/etc to perform as many reverse passes in one reverse sweep as you want (thunk or no thunk).

Topic		Replies	Views
Autodifferentiation with FFT, with Enzyme? Numerics question , fftw , autodiff , enzyme	11	362	June 14, 2025
Enzyme with Const() on a vector throws an error Optimization (Mathematical) autodiff , enzyme	20	389	April 10, 2025
Enzyme autodiff: Why am I getting allocations? Machine Learning enzyme	20	865	January 29, 2024
How to differentiate using Enzyme.jl? General Usage autodiff , enzyme	7	458	November 18, 2024
Enzyme strange output for hessian General Usage autodiff , enzyme	0	43	December 16, 2024

What are `autodiff_deferred` and `autodiff_thunk` for in Enzyme?

Related topics