State of automatic differentiation in Julia

dfdx · August 4, 2020, 9:04am

Just to clarify my previous reply, your expectations are correct, and in, say, 99.5% of cases AD systems work reliably. But the remaining 0.5% constitute a long-long tail of rare use cases which are impossible to cover without actual user base. So please, keep in mind 0.5% probability of a mistake, but be strict with tools you use and report any errors you encounter - they are super important for the healthy development of any library!

DNF · August 4, 2020, 9:49am

As far as I’ve heard, it is possible to write ‘provably correct software’ with formal methods. I don’t know much about it though, I think only a small number of languages can be used for this.

ChrisRackauckas · August 4, 2020, 10:58am

It’s provably correct in terms of types, but not in terms of values. That’s a form of correctness which IMO doesn’t prove very much, so I think the wording chosen in that discipline is completely overblown.

ericphanson · August 4, 2020, 12:30pm

I don’t think that is true at all. Formal proof systems sometimes use math formalized via homotopy type theory (Homotopy type theory - Wikipedia) instead of set theory, but they aren’t just proving things about types, or that objects will have a certain type at a certain place in the program. In this context, formal methods are about proving that a program fulfills a specification (Formal specification - Wikipedia), and that’s not just on the level of types.

Maybe you are thinking of statically typed languages, where the compiler checks the types are right at compile time?

ChrisRackauckas · August 4, 2020, 12:39pm

Yes, I’m talking about the Haskell since of provability. Like the kind discussed in testing - Can Haskell functions be proved/model-checked/verified with correctness properties? - Stack Overflow . It’s nice stuff and finds a lot, but it definitely doesn’t prove that your program has no numerical bugs, which is something I’ve found important to mention because I’ve seen the marketing cause people to think that proofs passing means there cannot be bug in the software.

Samuel_Ainsworth · August 4, 2020, 6:51pm

Indeed, there are formally verified automatic differentiation systems out there! Shows | Microsoft Docs.

As @ChrisRackauckas mentioned, this does not guarantee numerical stability, but it does give you a way to verify that any rewrite you do for numerical stability still calculates the correct mathematical quantity. And there also exist tools for automatic numerical stability rewrites, eg. http://herbie.uwplse.org/.

oxinabox · August 4, 2020, 10:37pm

Lets be clear, noone involved in automatic differentiation is disagreeing with you there.
That would be a major bug.
but bugs happen some times.
If someone tells their software has no bugs then either:

The code is trivial
They are lying, or at least massively over-confident
They are working on formal methods, and in many cases also points 1 and/or 2 applies*

_{*(though that formally verified AD system posted above looks pretty cool, here is the GitHub repo. Really cool stuff, docs do highlight the limitations, which are not critical, though eigen is a big dependency not to have proven. Sadly seems abandoned).}

The closest thing you will get normally get on real software to a promise of no bugs is a support contract with indemnification.
I belive you can probably get one of those for Flux/Zygote from Julia Computing, though I am not sure how readily they support things outside the language itself, but its a talk one should have with them if interested.
In general, having made multiple support requests for the most gnarly of bugs through Julia Computing’s enterprise support, I can say that i am pretty impressed with it.
Would recommend, for circumstances where that seriously matters.

Finite differences is always a great way to double check gradients, though ideally the onus of testing should lie on the libraries themselves, not users.

Like @dfdx says, to the best of my knowlesge all the libraries themselves do extensively test with finite differencing.
Zygore, Nabla, and ChainRules certainly do.

In ChainRules these are some of the ones we test most extenstively.
I think something like 10% of the tests in ChainRules are for various overloads of *.
Only finite differencing can be trusted, when testing based on well know textbook rules, it is too easy to make a mistake with a missed - , or a transpose instead of an adjoint etc, and to then make it twice when you write your tests.
and it’s really bad when you mess up something fundermental.

Please don’t though. we don’t need any more of them.
But seriously, if you look at the code of FiniteDiff.jl or FiniteDifferences.jl you will see it is a bit more complex than you expect.
They were written by experts, and are probably 2-3 orders of magnitude more accurate than what a naive implementation would get (on certain problems.)

Also Forward Mode AD is probably easier to write than you think, but again please don’t we already have 3 decent efforts. Though it is good to do for education purposes, makes a nice blog post.

dfdx · August 4, 2020, 11:00pm

Formal systems are great to reason about, but often they are quite impractical in use. Consider following example: we are building an AD system using operator overloading (just for simplicity). We wrap all inputs in, say, TrackedArray type, and overload a number of allowed operations so that they not only compute result, but also keep track of computational graph:

A = TrackedArray(rand(3, 4))
B = TrackedArray(rand(4, 5))
C = A * B    # TrackedArray

Everything works fine while we only work with tracked arrays. But then we encounter a non-tracked object not present in the computational graph. It may be just a constant like π, or global variable, which the user of our system decided to depend on, or result of a call to a 3rd party function. It may also be a call to some function of AbstractArray which we didn’t overload, and so our tracking failed.

To make our AD system provably correct we have a choice:

Prove that all the cases above are properly covered.
Forbid these use cases / throw an error if they are encountered.

Given that Julia is a dynamic language, I think option (1) is just not possible. Forbidding these uses cases may work, but it will leave you with very limited set of available features (in both - your own code and code of libraries you’d like to use).

Note that there are practical systems which are somewhere in between formal systems and complete programming language. Prominent example is PyTorch’s AD: in its current state PyTorch and dependent libraries provide wide range of primitive functions covering almost everything you may want to use without leaving the “tracked universe”. However, the price is high - PyTorch authors had to redefine thousands of functions already implemented in NumPy / SciPy & friends.

oxinabox · August 4, 2020, 11:19pm

I don’t see the real similarity between PyTorch’s operator overloading AD, and formal methods.
Can you explain more?

(I am not really familiar with formal methods so probably missed something)

dfdx · August 4, 2020, 11:41pm

In Julia, you can write foo(A, B) and it will work even if foo is not overloaded for tracked A and B. So Julia array types make an open ecosystem that everybody can extend, and extensions can be mixed in a variety of ways.
In PyTorch, you usually can’t do it. PyTorch tensors are different from TensorFlow’s tensors, from NumPy arrays, from Python lists, etc. If you run:

import numpy as np
import torch

t = torch.FloatTensor([1, 2, 3])
np.sum(t)

you will get an error because you mixed up 2 closed systems. But on the other hand, this error will force you to rewrite this expressions as:

t.sum()

which creates the correct computational graph.

This doesn’t make PyTorch a formal system, but rather a closed system, with smaller chance to make an error, but also with many restrictions.

ChrisRackauckas · August 5, 2020, 9:52am

Herbie is a really cool project: I want to integrate that with ModelingToolkit somehow.

Azamat · August 7, 2020, 10:39am

A software can potentially have unknown major bugs. In case of Flux and Zygote, though, what is troubling to me is that there are known major bugs (i.e. by definition above, bugs that silently produce incorrect gradients) in fundamental functionality like backprop, RNNs and broadcasting (e.g. Flux.jl#1209, Zygote.jl#563) that remain unaddressed for months and do not seem to be prioritized.

darsnack · August 7, 2020, 5:08pm

I share your frustrations. I can speak for what the ML community effort is planning on doing to address this.

(in progress) automated benchmarking of standard models/datasets to catch performance regressions; this process will also help collect an expose fundamental issues differentiating common models.
(planning stage) establish a triage similar to what is done for Julia itself. The idea here would be to track longstanding issues across the Flux ecosystem and discuss them live on call. Issues will be assigned to people or some comment will be made about why it’s not prioritized.

We still need to discuss/coordinate (2) with the core maintainers, but our view is that there is just too large a backlog of issues on Flux/Zygote and not enough man-hours available. So we will try and help. I’ll post back once we have the GH infrastructure up and running, then people can file the issues they want triaged.

Syam · October 14, 2020, 3:03pm

If it hasn’t already been mentioned, you can use complex step to verify AD in your application. See here: Complex Step Differentiation » Cleve’s Corner: Cleve Moler on Mathematics and Computing - MATLAB & Simulink

Complex step is usually costlier than AD, so we just use it for verifying AD of the specific function of interest in our applications (computational fluid dynamics). Then use AD thereafter.

jling · October 14, 2020, 6:28pm

it is my understanding that for example, ChainRulesTestUtils.jl tests use this / and or other similar techniques to verify the correctness of rules.

tim.holy · October 14, 2020, 7:15pm

Isn’t that just a less-convenient version of dual numbers? Which is what ForwardDiff already uses? Though I guess perhaps the point is that complex math libraries are widespread and already well-tested?

ChrisRackauckas · October 14, 2020, 7:27pm

It’s a less convenient version of dual numbers which only works if the function is analytic. It’s also just as fast if you make it SIMD the same way dual numbers does.

oxinabox · October 14, 2020, 9:45pm

That’s my understanding.
Less convenient.
Does weird things if nested, or if there are actual complex numbers involved.

ChainRules tests against FiniteDifferences.jl.
Which is high accuracy finite differencing using magic I don’t get but that people worked out.
Which will run on absolutely any function without even having the possibility of hitting va different code path.

It’s generally accurate to 1e-14 or so.
Though the latest release seems to have a regression on accuracy we are currently chasing down.

I have often thought about testing ChainRules against ForwardDiff, but keep deciding finite differencing is better since it is the same code path as the primal.
And because if we ever start using ChainRules under the hood for ForwardDiff, we would get circular.

andrewning · October 15, 2020, 1:54pm

Complex step usually isn’t more convenient than AD, but it can be a useful way to verify derivatives in some fields (especially if the derivatives you want to verify were found using AD). Syam mentioned CFD which is one example where complex step is widely used for that purpose. Finite differencing is a good option but requires more function calls to improve accuracy. That’s less desirable for scenarios where a single function call takes many minutes, and the result may still not be quite as accurate as a complex step. Often the check is against a few directional derivatives, in random directions, because the input size is large so checking the full gradient with complex step or finite differencing is too expensive.

You actually can use complex step with complex functions (multi complex step) and nested functions, it just requires more effort. Typically you need to overload some complex functions with analytic counterparts (e.g., abs) and/or ones that have numerical issues at small step sizes (e.g., some complex trig functions depending on the language).

Tamas_Papp · October 15, 2020, 2:51pm

But why would you invest that effort in Julia, where solutions like ForwardDiff.Dual seamlessly integrate into the language?

Complex step differentiation can be a useful hack in some languages, and it’s a neat trick to know about in general, but particularly in Julia it has little practical value.

Topic		Replies	Views
State of AD in 2024 General Usage machine-learning , autodiff	4	2169	April 6, 2024
Comparison of Julia autodiff packages General Usage autodiff	1	187	September 27, 2024
Automatic differentiation - Julia implementation advantages Machine Learning	8	4083	June 7, 2019
AD pipeline and Hessian–vector products Optimization (Mathematical) autodiff	12	709	July 1, 2024
[ANN] DifferentiationInterface - gradients for everyone Package Announcements zygote , forwarddiff , ad , autodiff , enzyme	5	1365	October 8, 2024

State of automatic differentiation in Julia

Related topics