State of automatic differentiation in Julia

I don’t think that FD should be confused with AD.

5 Likes

It’s the same algorithm, you just store the perturbations in a different dimension.

4 Likes

This surprised me greatly. Firstly, I always heard they were fundamentally different; and secondly, I thought finite differences were very vulnerable to to noise and sensitive to stepsize.

Yes, I am aware of the theoretical connection, but for people new to the concept approaching it like this may be confusing, especially as FD requires close attention to a host of numerical issues that do not afflict AD.

3 Likes

They aren’t necessarily. Forward mode and reverse mode are fundamentally different: they have different computational complexities, different actions, etc. Forward mode computes Jacobian-vector products, while reverse mode computes vector-transpose-Jacobian products (Jv vs v’J). One computes columns of Jacobians, while the other computes rows.

But forward mode and finite difference? Both compute Jacobian-vector products ((f(x + epsilon*v) - f(x)) / epsilon), both are computing columns of Jacobians, both even have the same computational complexity for any of its cases. They are fundamentally the same algorithm. What’s different is, when x is a real number, autodiff stores its internal values as an N+1 dimensional number, where N are the N ongoing perturbations. Finite difference can only do one perturbation at a time, and stores that perturbation as a small piece at the end of the original number (hence having less digits and the numerical accuracy issues). So forward-mode AD with chunk size of 1 and finite difference should usually have around the same computational cost, and then with higher chunk sizes it can have a constant factor reduction due to calculating the primal less, but it’s really not huge. Even with higher chunk sizes, we’ve seen FiniteDiff.jl is usually <2x from ForwardDiff.jl (because the primal is usually less complex than the derivative, so you’re tagging on a bunch of extra calculations anyways). The main difference is really just accuracy because you’re not mixing the perturbation dimension with the primal dimension.

I always point to https://mitmath.github.io/18337/lecture9/autodiff_dimensions and hope it’s helpful. So yes, finite differencing done correctly is essentially the same algorithm but with a higher error floor.

12 Likes

3 posts were split to a new topic: Inverse functions using TaylorSeries.jl

Another practical difference in FD vs Forward mode is that FD can treat a function completely as a black-box requiring only function evaluation, whereas Forward AD will need to decompose the function into primitives with known derivative rules. So with FD you might even differentiate through some exotic web API, or let’s say some Fortran solvers you’re calling etc, making it very robust from a compatibility perspective.

7 Likes

I’m interested in knowing how trustworthy the results from Zygote/Flux and Yota currently are.

Zygote seems to already be used in Flux, but in its docs there is “At least, that’s the idea. We’re still in beta so expect some adventures”. I can’t really use it if there’s a chance that some bug is going to make my results completely off.

Yota doesn’t explicitly state anything similar, but I can’t really tell what sort of position it’s in.

@dfdx @MikeInnes

1 Like

if you depend it on rocket launching, don’t, cuz numerical stuff always go wrong and it’s impossible to guarantee otherwise.

other than that, it’s pretty “accurate” but of course edge cases exist, but then again, don’t expect it to accurately handling everything in the mathematically possible universe.

4 Likes

Jerry summarized it pretty well. Currently Yota passes all its 140 tests (+ a bunch of tests in Lilith.jl which further challenges Yota’s autodiff), but the universe of possible code paths is many orders larger. There’s no bug-free software, only undertested. The question is what level of trust is acceptable for your task. With Yota/Lilith, I haven’t encountered numerical errors in a while (there was an issue in 3rd party library, but it was caught in tests and didn’t get to master), however in the world of living software with changing versions of the language, libraries, variety of platforms and use cases reliability can only be checked by practical use.

5 Likes

As a rule, if you’re at all concerned about the correctness of the gradients you’re getting from an AD tool, you should test it using finite differencing eg. with FiniteDifferences.jl

14 Likes

I think the real question here is one of soundness, not completeness: I understand that no AD system will be able to give me gradients on completely arbitrary code, but giving me a gradient that is incorrect instead of an error is a big no-no!

Finite differences is always a great way to double check gradients, though ideally the onus of testing should lie on the libraries themselves, not users.

1 Like

There’s a lot of safety in the current implementations. ForwardDiff’s tag implementation is especially good at blocking perturbation confusion. The only thing that I know of that’s a little wonky is Zygote’s nothing handling. It’s not incorrect in any known way other than turning absence of gradient definitions into zero gradients, which sometimes is weird and should error IMO

7 Likes

And usually authors of these libraries do extensive testing of all new gradients (excluding maybe trivial functions from textbooks like sin() or well-known gradients like matrix-matrix multiplication). However, there are always corner cases where it’s not so easy to figure out the right gradient or even behavior.

For example, consider the following function:

loss(x) = sum(x) / length(x)

There are 2 paths connecting x and loss - via sum() and length(). sum() is no problem - it has a well known derivative, but length() is not so unambiguous. Usually, we work in spaces with fixed number of dimensions, e.g. R^n, so length(x) == n is constant, and so the derivative should be zero (or not propagated? that’s another non-trivial question). On the other hand, Julia is not pure math, and in practice there might be a use case in which we must calculate length() derivative as well.

I haven’t seen such use cases, so in Yota I used the first approach - stop propagation through length. If, however, someone encounters such a scenario, they won’t get an exception. They won’t even get zero derivative, since loss still depends on x via sum(). The result will be just wrong. But, honestly, I don’t know how to prevent it.

Other tricky use case:

  • derivatives of iterate() and Base.indexed_iterate() w.r.t. to iterator state
  • getindex() and view() w.r.t. indices
  • convert()
  • gradients w.r.t. global variables, etc.

There are also mutable state, control flow, exception handling, tasks, multithreading and many other things that can, in some scenario, mess up the result in a way that you don’t even notice it. So the only way to improve robustness of an AD system is to put it in use as much as possible, but look at the results with just a little skepticism.

7 Likes

I don’t think so. Finite differencing acts on the output, Auto Diff acts on the code.

e. g. Any bozo, including me, can write an finite difference approximation to a derivative. It takes an expert to write a forward mode AD code.

You may be surprised to hear this, but this is the case for most users — most of us prefer correct results.

All of the Julia AD packages mentioned above are free software, so this may not be a good way to phrase this. If you believe that you care about correctness more than other users, you should consider contributing to tests and/or reviewing code.

Generally, it is unclear what you are expecting from this discussion. Like all software, Julia’s AD libraries are not guaranteed to be bug-free, despite careful implementation and testing. Major errors are rare and are usually fixed quickly, but can nevertheless happen.

10 Likes

Just to clarify my previous reply, your expectations are correct, and in, say, 99.5% of cases AD systems work reliably. But the remaining 0.5% constitute a long-long tail of rare use cases which are impossible to cover without actual user base. So please, keep in mind 0.5% probability of a mistake, but be strict with tools you use and report any errors you encounter - they are super important for the healthy development of any library!

10 Likes

As far as I’ve heard, it is possible to write ‘provably correct software’ with formal methods. I don’t know much about it though, I think only a small number of languages can be used for this.

2 Likes

It’s provably correct in terms of types, but not in terms of values. That’s a form of correctness which IMO doesn’t prove very much, so I think the wording chosen in that discipline is completely overblown.

8 Likes

I don’t think that is true at all. Formal proof systems sometimes use math formalized via homotopy type theory (Homotopy type theory - Wikipedia) instead of set theory, but they aren’t just proving things about types, or that objects will have a certain type at a certain place in the program. In this context, formal methods are about proving that a program fulfills a specification (Formal specification - Wikipedia), and that’s not just on the level of types.

Maybe you are thinking of statically typed languages, where the compiler checks the types are right at compile time?

2 Likes