Automatic Differentiation

Hey,

I am still having a bit of a hard time figuring out what flux.jl actually does. One important aspect seems to be making gradients available automatically via tracking. How does that differ from the approach in ForwardDiff.jl, i.e. why is it not possible/feasible/sensible to get the gradient of my loss with respect to all parameters using ForwardDiff?

The current tracking implementation seems to be very similar to what Tensorflow + eager execution does, right?

Thanks,

Kevin

Flux is reverse, while ForwardDiff is forward-mode AD (as its name implies). See eg wikipedia about the difference.

Reverse mode is ideal for \mathbb{R}^n\to\mathbb{R} functions, while forward mode is the opposite. That said, in practice, for a small n ForwardDiff can be fine.

2 Likes

Thanks for the quick answer. Reverse > Forward due to speed considerations, I guess? But why not mix both - there are still some cases where flux cannot provide a gradient but ForwardDiff can - couldnā€™t ForwardDiff be uses as a default fallback when the tracker encounters a ā€˜dead endā€™?

All AD libraries have limitations (but these are best reported as an issue with a minimal example so that they can be fixed). That said, I find ForwardDiff to be the most robust in practice, so it is indeed useful as a fallback.

Also, as you said, mixed mode AD can be useful, especially if the problem has a structure that can exploit this.

I would recommend that you consult a good source on AD for a deeper understanding. I find

@book{griewank2008evaluating,
  title={Evaluating derivatives: principles and techniques of algorithmic differentiation},
  author={Griewank, Andreas and Walther, Andrea},
  volume={105},
  year={2008},
  publisher={Siam}
}

especially nice.

5 Likes

thanks again!

I believe both Reversediff.jl and Flux uses mixed mode AD.
https://github.com/FluxML/Flux.jl/blob/39dcfd3933fff5cc498e0330de0064f521aad9a7/src/tracker/lib/array.jl

Itā€™s partly explained in this paper if I remember correctly [1810.08297] Dynamic Automatic Differentiation of GPU Broadcast Kernels

1 Like

I saw this today - A Benchmark of Selected Algorithmic Differentiation Tools on Some Problems in Computer Vision and Machine Learning.

The code is given in - ADBench - AutoDiff Benchmarks.

What do you think?

The abstract describes the approach as

a skilled programmer devotingroughly a week to each tool produced the timings we present.

I would have thought that a skilled programmer would notice that a library named ForwardDiff lacking reverse AD may not be an accident, and proceed to try at least

if nothing else.

I have failed to find the actual Julia code, but I wonder if at least the low-hanging optimizations were applied.

2 Likes

I was surprised to see that MATLABā€™s libraries were faster.
Julia should beat MATLAB even without those ā€œLow Hanging Fruitsā€.

It is really strangeā€¦

I am not sure ā€” note that if you donā€™t specify the config, the ForwardDiff methods themselves are type unstable as the chunks are calculated dynamically.

EDIT the repo is somewhat disorganized, but the Julia code appears to be here. It seems it does not to precalculate chunk size or configuration, and also has various readily apparent type stability problems.

2 Likes

Code appears to be here: ADBench/usr/awf/Julia at master Ā· microsoft/ADBench Ā· GitHub
There should be plenty of optimizations left to be done here: Juliaā€™s forward diff is considerably slower than C++'s finite difference. Juliaā€™s ReverseDiff has a number of problems, so I would imagine they tried it, it didnā€™t work and they didnā€™t pursue it further (which is fair game). There are references to ReverseDiffSource in the code, though. In any case, writing efficient Julia code is not completely trivial, and using Juliaā€™s autodiff tools efficiently is even harder (at the moment at least), so we canā€™t really blame the author there.

Since C++ was included, I think it is reasonable to give the same amount of effort to Julia as it would take to write it in C++. And that would take a lot, so that should cover at least following the advice in the performance tips. IMO that would be the minimum to make this comparison interesting ā€” nothing more than some applications of @code_warntype and benchmarking.

1 Like