AD pipeline and Hessian–vector products

RS-Coop · June 27, 2024, 10:51pm

To be clear, this isn’t a question about the state of AD in Julia. That has been covered quite well in a variety of sources. Instead I have the following question: what is the state of the overall AD pipeline in Julia?

As I see it, the AD frameworks are only a portion of this. There are a bunch of different packages to tackle different approaches to AD or to perform AD in a variety of specific scenarios.

One level below this seems to be ChainRules, which defines a common set of simple base level derivatives. Do all Julia AD packages rely on ChainRules? If not, would that be the ideal scenario, i.e. one core set of rules?

Above the AD frameworks there seem to be a growing number of interface packages. AbstractDifferentiation.jl and DifferentiationInterface.jl are the two main ones I am aware of. It seems that the latter has (or will) replace the former, but that is a bit unclear. These seem to use ADTypes.jl as a way to integrate with a desired backend AD framework.

Related to this are SparseDiffTools.jl and AutoDiffOperators.jl, which are higher level tools/operators for working with AD.

To turn this into a question, is the following the correct AD pipeline?
ChainRules → AD Framework (e.g. Enzyme) → ADTypes → DifferentiationInterface → High-level tools (e.g. AutoDiffOperators)

My interest here comes from the optimization space, where some of my research requires fast matrix-free Hessian-vector product operators. Given a twice differentiable function f:\mathbb{R}^n\to\mathbb{R}, I would like to construct an operator H_x:\mathbb{R}^n\to\mathbb{R}^n which maps v\to\nabla^2f(x)v. Beyond performance for a single operation, this would ideally be efficient for multiple v and updating x.

I have implmented such an operator using a variety of AD backends, but I certainly don’t think they are the most performant. Further, this seems like something that should be implemented well in a single location so as to be the most reusable across the Julia ecosystem. Does this exist, and if not, where should it exist?

RS-Coop · June 27, 2024, 10:52pm

I am not sure what the protocol is on calling people out, but @gdalle I thought you might have a bit to say on this topic, especially sorrounding the higher level interfaces and operators.

stevengj · June 27, 2024, 11:37pm

Nested AD (higher derivatives) like this can be a tricky subject, especially for large n where efficiency is important, but I think at this point there may be good solutions for the problem you are interested in.

If I understand your notation correctly, what you want is the following Hessian–vector product, which is equivalent to a directional derivative of \nabla f, and can be efficiently implemented by forward-over-reverse. i.e.

\underbrace{\frac{\partial^2 f}{\partial x^2}}_\mathrm{Hessian} v= \left. \frac{d}{d\alpha} \left( \left. \nabla f \right|_{x+\alpha v} \right) \right|_{\alpha = 0}

where (for large n) the \nabla f is implemented with reverse-mode (e.g. via Enzyme.jl or Zygote.jl) and the scalar derivative d/d\alpha is implemented with forward mode (e.g. ForwardDiff.jl). In this way, you avoid ever explicitly computing the Hessian matrix, and the computational cost should be proportional to the cost of computing f(x) only once.

See e.g. the discussion and example here of a closely related problem: Nested AD with Lux etc - #12 by stevengj

For example, this works fine for me:

using LinearAlgebra
f(x) = norm(x)^3 * x[end]  + x[1]^2 # example ℝⁿ→ℝ function

import Zygote, ForwardDiff

# compute (∂²f/∂x²)v at x
function Hₓ(x, v)
    ∇f(y) = Zygote.gradient(f, y)[1]
    return ForwardDiff.derivative(α -> ∇f(x + α*v), 0)
end

I couldn’t get the analogous thing to work with Enzyme, but maybe @wsmoses knows the trick.

Yeah, it seems like DifferentiationInterface.jl should include some kind of Hessian–vector product interface. It needs to be in a higher-level package like that one because it may involve combining multiple AD packages as I did above, and the implementation can be rather non-obvious (especially because not all AD combinations support nesting).

gdalle · June 28, 2024, 6:12am

Thank you for your question, I’ll do my best to answer point by point!
Note that @hill and I are also giving a talk at JuliaCon where we hope to cover some of this stuff.

No they don’t:

ForwardDiff.jl has its own set of Dual number overloads
Enzyme.jl has its own set of rules
Tapir.jl has its own set of rules

There are also bridges between the various rule sets.

The main limitation with ChainRules.jl at the moment is the lack of support for mutation. If you want to fix that, the project is looking for a new maintainer! And by the way, thank you @oxinabox for all you’ve done, you’ve been an inspiration to me and many others

Disclaimer: I’m the lead developer of DifferentiationInterface.jl, shortened as DI from now on.
My hope is that DI should become the new standard, at least for single-argument functions x -> y that eat and spit arrays or numbers. Within these limits, DI is more efficient, better tested and has broader backend coverage than AbstractDifferentiation.jl.
It is still unclear what should happen with multiple-argument functions (xa, xb, xc) -> y. The original plan we hashed out with @mohamed82008 was for AbstractDifferentiation.jl to take over this aspect as a wrapper around DI, for the few backends that support it. However, several people have been asking for multi-arg functionality in DI itself, especially since the SciML ecosystem has started to use my package. Since I have a bit more dev time available than the team around AbstractDifferentiation.jl, perhaps we could re-discuss that arrangement?

DI uses ADTypes.jl, which has become the standard in the SciML and Turing ecosystems. AbstractDifferentiation.jl still uses its own structs because it predates ADTypes.jl, but that’s rather easy to fix.

AutoDiffOperators.jl is a way to turn functional syntax into matrix multiplication syntax. It is still based on AbstractDifferentiation.jl, but its maintainer wants to switch to DI in the near future.

SparseDiffTools.jl automates the computation of sparse Jacobians and Hessians, but only for a few specific backends (mainly ForwardDiff.jl and FiniteDiff.jl). Our main project with @hill these past few months has been to design a better, generic sparse AD pipeline to replace it, and we’re nearly there:

SparseConnectivityTracer.jl (Adrian’s baby) is a very lightweight sparsity pattern detector, much faster than Symbolics.jl in our benchmarks on the OptimizationProblems.jl suite.
SparseMatrixColorings.jl takes care of matrix coloring and decompression, which are needed to reduce the number of matrix-vector products in sparse AD. We have preliminary benchmarks against the C++ reference in the field, called ColPack (still WIP with @amontoison because the Julia interface ColPack.jl is a bit buggy). Those comparisons suggest that we’re nearly always faster than ColPack.jl, although we don’t offer quite as many options as the underlying C++ library.
DI includes a backend-agnostic mechanism for computing sparse Jacobians and Hessians, given a sparsity pattern and a coloring. I released the last necessary feature yesterday morning and never tried to profile the code, so there are still large margins of improvements.

Despite the fact that all of this was developed by only 2 people in the last 6 months, the first benchmarks on the famous Brusselator are quite promising. We get 100x speedup on sparsity detection and 10x speedup on coloring compared to the time-honored SparseDiffTools.jl pipeline. That explains why the JuliaSmoothOptimizers organization now uses SparseConnectivityTracer.jl instead of Symbolics.jl.
The only disappointing aspect is a 1.5x slowdown on the Jacobian evaluation itself, which is the crucial part. But let me stress this again: the whole code is backend-agnostic and was literally released this week, so give me a few more days

Yes, but I would include bigger stuff like Optimization.jl and DifferentialEquations.jl with the high-level tools. Indeed, @ChrisRackauckas and @Vaibhavdixit02 have been very supportive of DI from day 1, and they’re planning to include it in much of the SciML ecosystem within the coming months. If anyone wants to hack on this at JuliaCon, hit me up!

Of course, another important high-level tool is deep learning packages. Using DI in Lux.jl and especially Flux.jl is a bit more tricky because they need to differentiate wrt arbitrary structs (and have varying notions of what a “gradient” in this case means, e.g. whether scalar fields should be included or not). Turing.jl is another worthy challenge, for which I still need to fix a few performance issues with DI’s handling of Enzyme.jl.

My friend, you’re in luck! One of the perks of DI is the ability to combine different backends to perform second-order AD. As you’ve pointed out, such functionality has to live outside of the individual AD packages, which is why I implemented it there.

Two different backends can be put together inside a SecondOrder struct, which you can then feed to the hvp operator. This does exactly what you want, and it adapts to the combination of backends you give it (although the best choice remains forward-over-reverse, as @stevengj pointed out).
Two caveats:

Not every pair of backends will work, and I would advise sticking to well-known combinations like ForwardDiff.jl over Zygote.jl. However I managed to get second-order Enzyme.jl working as well, using deferred autodiff.
Second-order AD will be less efficient than first-order AD. One of my main pain points is that closures make it harder to “prepare” (tape, cache, etc.) the inner differentiation part. See this issue if you have any ideas to unblock me.

Do you mean several v's at once, or in a sequence during an iterative linear solve? If you have several v's at once, there’s a threshold from which it might be worth constructing the whole Hessian matrix, especially if it’s sparse?

Sorry for the wall of text, and don’t hesitate to reach out via DM or on GitHub if you need help using DI. It’s all still rather experimental, so I’m not claiming performance will beat the existing code everywhere, especially because most existing code is customized for a specific AD backend. However, I do believe that:

If performance is not up to your standards, most of the time it can be solved with a better implementation. I’m not an expert in every single backend, so open issues or PRs and help me out! In my view, for single-argument functions, the current design of DI does not limit performance in any meaningful way.
In some cases, a little performance drop is acceptable to reduce the huge amount of duplication and maintenance troubles. At the moment, tons of packages have extensions handling Zygote.jl, ForwardDiff.jl, Enzyme.jl and friends, which contain essentially the same code. My goal with DI was to write such code only once, test and optimize the hell out of it, then let everyone use it. Time will tell if it worked!

RS-Coop · June 29, 2024, 3:12pm

Thanks @stevengj for the response. I maybe should have been a bit more clear in that I have implemented a variety of Hvp operators for Newton-type optimization algorithms in SFN.jl. My ForwardDiff/Zygote implementation follows a similar approach to the one you provided.

The problem is, as you mentioned, the implementation can be non-obvious and varies considerably based on the desired AD backends. Even for the operators I have that do work, it is unclear if they are optimized properly. Some, like the Enzyme based operator work most of the time, but fail under some strange circumstances.

All that is to say I could greatly benefit from a centralized and optimized set of more complex AD tools like this, and I think there are others who would feel the same.

RS-Coop · June 29, 2024, 3:41pm

Thanks for the great and thorough response @gdalle!

A quick follow up regarding ChainRules. Is the lack of mutation support in ChainRules what is preventing widerspread adoption, or is it likely that AD backends would continue to implement their own rules? It seems that a centralized set of rules would be very beneficial to future AD development.

When I say multiple v's in discussing Hessian-vector product operators, several at once, i.e. a Hessian-matrix product, and multiple in sequence are important. Ideally, in my application, the number of Hvp operations for a fixed x would be much less than the dimension of n, where the Hessian is n\times n. We are also targetting very large n, so a matrix-free operator is desirable. Sparsity is another matter.

The other aspect is how can we update the operator “in-place” so to speak. I.e. given a new x, how much of the operator can we reuse.

A non-optimal Hvp operator, say from a few extra allocations when updating x, may not be the primary performance bottleneck, but similar to your last point, it would be great to have a centralized code that can be heavily tested and optimized.

Thanks again!

wsmoses · June 29, 2024, 8:30pm

I’m going to chime in and offer a different perspective on this. Enzyme lead, so obvious biases and will use Enzyme for some examples since I have more experience with that.

I think the overall workflow is similar to what you describe, with some key differences

User Code/Applications → Derivative Rules → AD Framework → AD-based algorithms/operators

First we have user code/applications, which is the code you want to to apply some derivative-based algorithm to. Perhaps a neural network model, or a scientific code. Depending on the restrictions of the downstream tools your code will need to have certain properties. For example, some code isn’t legal to write in Julia due to the Julia language / compiler (e.g. unsafe GC, python syntax, etc). More commonly, you’ll see restrictions from the other part of the workflow. If Zygote is your AD framework you can’t use mutation. Similarly if ChainRules is a rule provider, you cannot change the values of any arrays between the forward pass and the pullback – even if the tool supports mutation [more on this later]! And similarly, perhaps your AD-based algorithm assumes all values are positive or real, so your code needs to have this property as well.
Derivative rules. The most widely known set in Julia is ChainRules. It’s a great project by some great project, but it has its limitations. Many AD tools, like Enzyme, Zygote, etc have their own rules system as well which better adapts to the needs of the AD tool. This allows one to define more efficient operations and guarantee that the rules conform with the language semantics. Many tools (including Enzyme) support the ability to import rule definitions from ChainRules, instead of using it internal definitions, or its EnzymeRules interface (see Enzyme.jl/src/Enzyme.jl at a21e60ddbdab383f7caba147514afc95a8bdb150 · EnzymeAD/Enzyme.jl · GitHub).

So why don’t we just use ChainRules for everything?

Part of it is mutation like you say, but it is somewhat deeper. Enzyme also supports the ability to avoid unnecessary differentiation, if a value is not requested by the user, or otherwise can be proven to be zero. By default many chain rules will densely compute all values (making code take significantly longer).

Moreover, use of ChainRules within Enzyme (or another mutation-aware AD tool) may often result in incorrect code, even if the function the rule is defined on is mutation free.

For example consider the following function and custom rule (I haven’t run so consider it pseudo code with syntax).

using ChainRules

sumsq(x) = sum(x .* x)

function ChainRules.rrule(::typeof(sumsq), x)
    Y = sumsq(x)
    function sumsq_pullback(Ȳ)
        return ChainRules.NoTangent(), 2 .* x .* Ȳ
    end
    return Y, sumsq_pullback
end

_, pb = ChainRules.rrule(sumsq, [2.0, 3.0])
@show pb(1.0)
(ChainRulesCore.NoTangent(), [4.0, 6.0])

We compute sum of square, so naturally the derivative is 2 * x * dout, which is what our pullback returns. What happens, however, if we mutate x before running the pullback? For example.

function sumsqzero(x)
	res = sumsq(x)
	x .= 0
	return res
end

If we were to use ChainRules in an reverse-mode AD tool (doesn’t matter which), if it doesn’t crash you’ll get something similar to the followinf:

function sumsqzero(x, dout)
	res, sumsq_pb = ChainRules.rrule(x, [2.0, 3.0])
	x .= 0
	grad_set_x_zero() # defined by AD tool
	return sumsq_pb(dout)
end

The value of x from our original program has been overwritten, we’ll be multiplying by zero instead. Since a lot of ChainRules based AD’s assuming nothing can mutate (including outside of the Chain rule itself), this isn’t a problem. But it does mean that most of the chain rules out there are actually wrong. A fix for this case, would be defining a rule like the following.


function ChainRules.rrule(::typeof(sumsq), x)
    Y = sumsq(x)
    copy_x = copy(x)
    function sumsq_pullback(Ȳ)
        return ChainRules.NoTangent(), 2 .* copy_x .* Ȳ
    end
    return Y, sumsq_pullback
end

Now even if x is overwritten by subsequent code, we’ll get the correct result!

Of course if a lot of a rules defined by the system are wrong/would cause bugs for a tool because they have a different set of assumptions, it doesn’t make sense to import wrong things – especially if it also lacks other things like activity(hence Enzyme preferring its own internal rules).

Of course we want to avoid this gotcha to users, so if you read the Enzyme docs for import_rrule you’ll see more information that lets us avoid this in some cases to make auto importing easy.

    import_rrule(::fn, tys...)

Automatically import a ChainRules.rrule as a custom reverse mode EnzymeRule. When called in batch mode, this
will end up calling the primal multiple times which results in slower code. This macro assumes that the underlying
function to be imported is read-only, and returns a Duplicated or Const object. This macro also assumes that the
inputs permit a .+= operation and that the output has a valid Enzyme.make_zero function defined. It also assumes
that overwritten(x) accurately describes if there is any non-preserved data from forward to reverse, not just
the outermost data structure being overwritten as provided by the specification.

Finally, this macro falls back to almost always caching all of the inputs, even if it may not be needed for the
derivative computation.

As a result, this auto importer is also likely to be slower than writing your own rule, and may also be slower
than not having a rule at all.

Use with caution.

Enzyme.@import_rrule(typeof(Base.sort), Any);

AD Frameworks (like Zygote, Enzyme, ReverseDiff, ForwardDiff, etc). Like you say these have been touched on elsewhere so I won’t discuss much more. However, as a slightly biased update we just finished adding significant improvements to handle most (though not all) type unstable code, blas including some cublas (but not lapack), nicer error messages, and more. Still a lot to go but if you look at my github history of multiple commits a day, including weekends, you’ll see that we’re slow but steadily getting things easier to use and faster to run.

These tend to provide low level utilities (for example Enzyme.autodiff which handles arbitrary argument count, differentiating with respect to some but not other variables, etc), as well as high level utilities (like Enzyme.gradient and Enzyme.jacobian. or Zygote.gradient / ForwardDiff.gradient).

I’m also going to lump in the interface packages, like AbstractDifferentiation.jl and DifferentiationInterface.jl in here. The AD packages all have slightly different conventions for how to call their high and low level API’s so these packages intend to make it easy to try out one tool versus the other.

However, I will caution that this is not free. For example, every year there are numerous papers that mention trying to improve “abstraction without regret” (Google Scholar , including some of my papers say for automatically improving tensor arithmetic (https://arxiv.org/pdf/1802.04730), for example ).

Fundamentally, any interface (be it for differentiation or something else) will need to decide what type of API to provide to users. A higher level API can make it easier to use (but may miss optimizations and thus result in slower code), or provide access to lower level utilities which give the users both the power and the burden of more effective usage. This disconnect is actually why most AD tools provide both (e.g. pushforward/pullback and gradient for ForwardDiff/ReverseDiff-like tools and autodiff / gradient / jacobian for Enzyme-like ones).

The interface packages have a definite purpose in exploration of AD as well as making it easier to use, but I will offer a word of caution before blindly applying them. As an example, whereas in the past AbstractDifferentiation.jl was quite successful in exploring what AD tools one can use – it wasn’t adopted in some large applications due to issues of its Tuple handling. I can’t find the link to it offhand, but I remember a discussion where a big application wanted to switch to AbstractDifferentiation.jl, but when compared with using Zyogte.jl directly their tuple handling caused significant slowdowns making machine learning or whatever the domain was unviable. DifferentiationInterface is a great package which is trying to lift some of the AbstractDifferentiation.jl limitations, but it similarly comes with its own set of limitations. For example, it (presently) assumes that inputs are either scalars/arrays (making things difficult for direct use in ML models), and does not have support for either functions with multiple arguments, or enabling/disabling differentiating with respect to some variables. All of these are serious (but potentially resolvable), limitations. For example, some code which could be efficiently differentiated with a tool directly, might be significantly slower when using DI, or possibly not be possible to differentiate at all! Same with AbstractDifferentiation.jl. It’s a bit out of scope here, but Switch to DifferentiationInterface by gdalle · Pull Request #29 · tpapp/LogDensityProblemsAD.jl · GitHub describes some issues with the creation of closures (which may be created by some of the interface packages).

To be clear both Guillaume and Adrian on the DI side are working dilligently to identify which of these limitations can be lifted, but I think its critical that we don’t put the cart before the horse here avoid a situation like what happened with Diffractor in which it was expected to solve everybody’s problems (and then couldn’t fulfill these goals). In that regard, one should definitely follow both DifferentiationInterface.jl and AbstractDifferentiation.jl closely, but ultimately select the tool that matches their use case best (which can be changed as things develop). For example, core sciml libraries like SciMLSensitivity and Optimization.jl as well as ML libs are explicitly not using DI.jl yet because it doesn’t have support for multi-arguments / support for structured data, and doesn’t make sense to risk a lack of support for user code / drop performance as a result. Other users like ContinuousNormalizingFlows.jl just take a gradient of a single-argument function which takes an array, which matches the DI abstraction perfectly!

The same can be said of Enzyme.jl as well. It supports a specific set of code whose features have been growing over time. Any time it comes up I try to caution both what is expected to be supported, and similarly as it has grown more feature support, it has been growing in adoption to places that make sense (for example, with recent BLAS and other support we’ve now been added as a AD-backend within ML libraries Flux/Lux). At it’s start it only had support for GPUCompiler-compatible code, that looks like a kernel and definitely nothing with a GC-able object, type unstable, or more. It’s grown signficiantly over time, adding support for parallelism, forward mode, batching, GC, type instabilities, BLAS, and most recently cuBLAS.

Derivative-based algorithms/optimizations. This is your standard backprop or bayesian inference, etc. Plenty has been said about this elsewhere so I’ll leave it for now, but these similarly have their own set of limitations and expectations (often along the lines that it is legal to rerun a code multiple times – perhaps preventing someone from using a function which reads in from a file or increments a counter).

As a side note @stevengj and @RS-Coop getting back to your original question of an HVP, since this seems sufficiently useful for folks, I went ahead and just tagged a release of Enzyme with a helper hvp function (as well as in place hvp! and even hvp_and_gradient!)

using Enzyme
using LinearAlgebra

f(x) = norm(x)^3 * x[end]  + x[1]^2 # example ℝⁿ→ℝ function

x = collect(Float64, 1:5)
v = collect(Float64, 6:10)
@show Enzyme.hvp(f, x, v)
# 5-element Vector{Float64}:
#  1164.881764812144
#  1749.5486430921133
#  2346.2155213720825
#  2942.882399652052
#  6431.86668789933

using Zygote, ForwardDiff
function Hₓ(x, v)
    ∇f(y) = Zygote.gradient(f, y)[1]
    return ForwardDiff.derivative(α -> ∇f(x + α*v), 0)
end

@show Hₓ(x, v)
# 5-element Vector{Float64}:
#  1164.881764812144
#  1749.5486430921133
#  2346.2155213720825
#  2942.8823996520514
#  6431.86668789933

If you look at the definition it’s pretty simple and similar to @stevengj’s one from above (with some minor changes to try getting better perf). If you want it to have a version whic supports more arguments, or has a constant input, you’d just pass those to the corresponding autodiff’s.

@inline function gradient_deferred!(::ReverseMode, dx::X, f::F, x::X) where {X<:Array, F}
    make_zero!(dx)
    autodiff_deferred(Reverse, f, Active, Duplicated(x, dx))
    dx
end

@inline function hvp!(res::X, f::F, x::X, v::X) where {F, X}
    grad = make_zero(x)
    Enzyme.autodiff(Forward, gradient_deferred!, Const(Reverse), DuplicatedNoNeed(grad, res), Const(f), Duplicated(x, v))
    return nothing
end

@inline function hvp(f::F, x::X, v::X) where {F, X}
    res = make_zero(x)
    hvp!(res, f, x, v)
    return res
end

wsmoses · June 29, 2024, 8:45pm

And yeah if anything doesn’t work please open an issue and we’ll try to get it fixed quickly!

gdalle · June 30, 2024, 3:20pm

Of course it would be great to have a unified rule definition system. But even just designing a unified AD calling system is extremely complex, and rule definition is much harder. So I fear that’s something we may have to live with.

The DI version of hvp comes with a prepare_hvp function, where the outer differentiation step is taped/cached/whatever. That, combined with the in-place hvp! version, allows some savings in memory and runtime. But it is not yet perfect, and while I have some improvements in store, I still don’t have a good solution for preparation of the inner differentiation step.

Regarding non-array inputs, it’s not so much about support as it is about testing. At the moment, I feel confident guaranteeing array support because the DI test suite includes every combination of numbers, vectors and matrices as function inputs and outputs. We also test on other types like static and GPU arrays. As soon as we move to general structs, which are indeed useful for deep learning, the question becomes: can we even promise anything?
There are subtle differences between AD systems in the way nested structs are handled (when they are handled at all), and I trust the infinite imagination of Julia programmers to come up with edge cases that will break even the smartest of interfaces. Thus I don’t feel confident promising that “arbitrary structs can be differentiated”. A more realistic promise would be “arbitrary structs are allowed in principle for this specific list of backends, and the test suite includes a few tricky edge cases, as well as whatever Lux.jl / Flux.jl need to work well”. I think that’s something we could strive for.

While that is very true for Enzyme and more advanced AD systems, I would like to temper it a little bit for the old-school systems. Packages like ForwardDiff.jl, ReverseDiff.jl, FiniteDiff.jl and friends only support one array input anyway. And for these, DI’s generic preparation mechanism allows nearly optimal performance if the preparation result is reused. If you find performance isn’t up to par in such simple cases, it’s an implementation flaw, not a design flaw.

Awesome, thanks! I’ll make sure the DI bindings rely on that.

I agree, and as a result, what @hill and I are trying to do with DI is underpromise and (hopefully) overdeliver. If you use DifferentiationInterface.gradient on an input that is not an array, with some backends it ~~will~~ might already work ~~(Zygote and Enzyme for example)~~. Why am I still saying that the interface is limited to arrays? Because I haven’t figured out how to reliably test generic structs yet. Once I have a good test suite for this part, you can trust me to announce it.

Thank you for recognizing that. I think it’s normal and probably healthy for a very ambitious project like DI to start simple and then grow in scope. A few months ago, we didn’t support second order operators. A few weeks ago, our sparse AD utilities were still limited. Give me a few more weeks or months, and multi-argument or activity specification are far from infeasible. The main limitation is not lack of will, it’s human hours devoted to this project, and I plan to devote a lot more of them still.

wsmoses · June 30, 2024, 5:04pm

I don’t think that’s true.

For example, the following code fails with DI but could be differentiated with Enzyme directly.

julia> using Enzyme, DifferentiationInterface

julia> function f(x)
               sum(x[1]) * sum(x[2])
       end
f (generic function with 1 method)

julia> x = ([2.0, 3.0], [5.0, 7.0])
([2.0, 3.0], [5.0, 7.0])

julia> DifferentiationInterface.value_and_gradient(f, AutoEnzyme(Forward),      x)

       #
ERROR: MethodError: no method matching BatchDuplicated(::Tuple{Vector{Float64}, Vector{Float64}}, ::Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}})

Closest candidates are:
  BatchDuplicated(::T1, ::Tuple{Vararg{T1, N}}) where {T1, N}
   @ EnzymeCore ~/.julia/packages/EnzymeCore/a2poZ/src/EnzymeCore.jl:106
  BatchDuplicated(::T1, ::Tuple{Vararg{T1, N}}, ::Bool) where {T1, N}
   @ EnzymeCore ~/.julia/packages/EnzymeCore/a2poZ/src/EnzymeCore.jl:106

Stacktrace:
 [1] (::Enzyme.var"#97#98"{Tuple{Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}}, typeof(f), Tuple{Vector{Float64}, Vector{Float64}}})(i::Int64)
   @ Enzyme ~/.julia/packages/Enzyme/qd8AI/src/Enzyme.jl:1104
 [2] ntuple
   @ ./ntuple.jl:19 [inlined]
 [3] #gradient#96
   @ ~/.julia/packages/Enzyme/qd8AI/src/Enzyme.jl:1103 [inlined]
 [4] gradient
   @ ~/.julia/packages/Enzyme/qd8AI/src/Enzyme.jl:1099 [inlined]
 [5] gradient(f::Function, backend::AutoEnzyme{ForwardMode{FFIABI}}, x::Tuple{Vector{Float64}, Vector{Float64}}, extras::DifferentiationInterfaceEnzymeExt.EnzymeForwardGradientExtras{2, Tuple{Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}}})
   @ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/ifUK5/ext/DifferentiationInterfaceEnzymeExt/forward_onearg.jl:73
 [6] value_and_gradient(f::typeof(f), backend::AutoEnzyme{ForwardMode{FFIABI}}, x::Tuple{Vector{Float64}, Vector{Float64}}, extras::DifferentiationInterfaceEnzymeExt.EnzymeForwardGradientExtras{2, Tuple{Tuple{Tuple{…}, Tuple{…}}}})
   @ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/ifUK5/ext/DifferentiationInterfaceEnzymeExt/forward_onearg.jl:80
 [7] value_and_gradient(f::typeof(f), backend::AutoEnzyme{ForwardMode{FFIABI}}, x::Tuple{Vector{Float64}, Vector{Float64}})
   @ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/ifUK5/src/first_order/gradient.jl:68
 [8] top-level scope
   @ REPL[17]:1
Some type information was truncated. Use `show(err)` to see complete types.

julia> dx = (
               [
                       Enzyme.autodiff(Forward, f, Duplicated(x, ([1.0, 0.0], [0.0, 0.0])))[1],
                       Enzyme.autodiff(Forward, f, Duplicated(x, ([0.0, 1.0], [0.0, 0.0])))[1]
               ],
               [
                       Enzyme.autodiff(Forward, f, Duplicated(x, ([0.0, 0.0], [1.0, 0.0])))[1],
                       Enzyme.autodiff(Forward, f, Duplicated(x, ([0.0, 0.0], [0.0, 1.0])))[1]
               ]
               )
([12.0, 12.0], [5.0, 5.0])

julia> @show dx
dx = ([12.0, 12.0], [5.0, 5.0])
([12.0, 12.0], [5.0, 5.0])

Presumably the reason this fails is that you’re just directly calling syntactic-sugar Enzyme.gradient(Forward, …) wrapper we provide, which per its docstring does not support this:

    Enzyme.gradient(::ForwardMode, f, x; shadow=onehot(x))

Compute the gradient of an array-input function `f` using forward mode.

Of course, as shown above Enzyme can, in fact, differentiate this function when calling autodiff directly, but DI is not passing in the right setup to do so.

RS-Coop · June 30, 2024, 8:36pm

Regarding ChainRules, and rules in general, I think it is easy to ignore all the complexities of implementation. All you need to know is how to compute the derivative, right? @wsmoses and @gdalle, you both make good points regarding why it isn’t as simple as that. It does seem unfortunate to think of many a developer rewritting essentially the same rule. Maybe I am exaggerating the amount of effort there. Clearly work has been put into sharing rules and it sounds not so simple.

Is ComponentArrays.jl not the solution to the non-standard (e.g. array) input for AD systems when using something like Lux.jl for deep learning? Perhaps it would be nice for DI to support all possible inputs someone could cook up, but that seems like a lot to ask when you have an ideal solution (imo) with ComponentArrays.jl. Working with a flat vector also seems to be the way DL in Julia is going with explicitly parameterized models, and from my research perspective this is much appreciated.

Great, thanks for this! I will try to see if this gives some insight on how to fix my Enzyme Hvp operator. I had previously run into some issues with repeated products (i.e. sequential v). I will be sure to open an isue if I uncover something deeper. It does seem simple in hindsight, but I think it can be difficult to put this together on ones own, so it is really nice to have it exported by the AD software.

Thanks @wsmoses for contributing your thoughts, and thanks to both you and @gdalle for all the hard dev work!

ChrisRackauckas · June 30, 2024, 8:47pm

ComponentArrays is just an array with indexing on top. That’s kind of the problem. It works really well as the abstraction if the core underlying data is (or should be) a Vector{T}. However, multiple arguments is generally a Tuple, Tuple{T1,T2,...,TN}. Crucially, tuples can be different types. So say for example you want f(u,p,t) where p is some user structure and u is a Vector{Float64}, you cannot generically always put that into a ComponentArray.

It does work well for many cases though where the object really should be a vector. For example, Newton optimization methods do linear algebra on that vector, so clumping the vector of neural network parameters together in order to do a quasi-Newton scheme can improve performance of those parts.

gdalle · July 1, 2024, 7:24am

That is a good find, and once we move to general structs I might fix it by calling autodiff directly. However, it is not a surprise if forward mode backends don’t work for a gradient on general structs, because you need to iterate over all “input directions” and that is hard to generalize (as shown by the convoluted definition of dx).
DI also fails with reverse mode Enzyme here, but for a different reason: because it expects eltype(x) to be a numeric type, which is not satisfied for a tuple of arrays. I guess we’ll need some kind of recursive eltype in this situation in order to build the one given to the pullback.

Topic		Replies	Views
Nested AD with Lux etc Machine Learning ad	26	1277	May 1, 2024
[blog post] Implement your own AD with Julia in ONE day Community blog-post	33	4240	November 3, 2018
State of automatic differentiation in Julia Machine Learning	57	21852	September 8, 2021
ChainRules: Replacing DiffRules in the Julia AD world Package Announcements package	43	4279	March 12, 2019
What lessons could Julia's autodiff ecosystem learn from Stan's TinyGrad? Machine Learning	41	3846	September 13, 2023

AD pipeline and Hessian–vector products

Related topics