Automatic Differentiation

kkmann · November 13, 2018, 2:00pm

Hey,

I am still having a bit of a hard time figuring out what flux.jl actually does. One important aspect seems to be making gradients available automatically via tracking. How does that differ from the approach in ForwardDiff.jl, i.e. why is it not possible/feasible/sensible to get the gradient of my loss with respect to all parameters using ForwardDiff?

The current tracking implementation seems to be very similar to what Tensorflow + eager execution does, right?

Thanks,

Kevin

Tamas_Papp · November 13, 2018, 2:07pm

Flux is reverse, while ForwardDiff is forward-mode AD (as its name implies). See eg wikipedia about the difference.

Reverse mode is ideal for \mathbb{R}^n\to\mathbb{R} functions, while forward mode is the opposite. That said, in practice, for a small n ForwardDiff can be fine.

kkmann · November 13, 2018, 2:18pm

Thanks for the quick answer. Reverse > Forward due to speed considerations, I guess? But why not mix both - there are still some cases where flux cannot provide a gradient but ForwardDiff can - couldn’t ForwardDiff be uses as a default fallback when the tracker encounters a ‘dead end’?

Tamas_Papp · November 13, 2018, 2:49pm

All AD libraries have limitations (but these are best reported as an issue with a minimal example so that they can be fixed). That said, I find ForwardDiff to be the most robust in practice, so it is indeed useful as a fallback.

Also, as you said, mixed mode AD can be useful, especially if the problem has a structure that can exploit this.

I would recommend that you consult a good source on AD for a deeper understanding. I find

@book{griewank2008evaluating,
  title={Evaluating derivatives: principles and techniques of algorithmic differentiation},
  author={Griewank, Andreas and Walther, Andrea},
  volume={105},
  year={2008},
  publisher={Siam}
}

especially nice.

kkmann · November 13, 2018, 3:18pm

thanks again!

baggepinnen · November 13, 2018, 3:27pm

I believe both Reversediff.jl and Flux uses mixed mode AD.
https://github.com/FluxML/Flux.jl/blob/39dcfd3933fff5cc498e0330de0064f521aad9a7/src/tracker/lib/array.jl

It’s partly explained in this paper if I remember correctly [1810.08297] Dynamic Automatic Differentiation of GPU Broadcast Kernels

RoyiAvital · February 11, 2019, 1:26pm

I saw this today - A Benchmark of Selected Algorithmic Differentiation Tools on Some Problems in Computer Vision and Machine Learning.

The code is given in - ADBench - AutoDiff Benchmarks.

What do you think?

Tamas_Papp · February 11, 2019, 1:43pm

The abstract describes the approach as

a skilled programmer devotingroughly a week to each tool produced the timings we present.

I would have thought that a skilled programmer would notice that a library named ForwardDiff lacking reverse AD may not be an accident, and proceed to try at least

if nothing else.

I have failed to find the actual Julia code, but I wonder if at least the low-hanging optimizations were applied.

RoyiAvital · February 11, 2019, 1:53pm

I was surprised to see that MATLAB’s libraries were faster.
Julia should beat MATLAB even without those “Low Hanging Fruits”.

It is really strange…

Tamas_Papp · February 11, 2019, 2:00pm

I am not sure — note that if you don’t specify the config, the ForwardDiff methods themselves are type unstable as the chunks are calculated dynamically.

EDIT the repo is somewhat disorganized, but the Julia code appears to be here. It seems it does not to precalculate chunk size or configuration, and also has various readily apparent type stability problems.

antoine-levitt · February 11, 2019, 2:12pm

Code appears to be here: ADBench/usr/awf/Julia at master · microsoft/ADBench · GitHub
There should be plenty of optimizations left to be done here: Julia’s forward diff is considerably slower than C++'s finite difference. Julia’s ReverseDiff has a number of problems, so I would imagine they tried it, it didn’t work and they didn’t pursue it further (which is fair game). There are references to ReverseDiffSource in the code, though. In any case, writing efficient Julia code is not completely trivial, and using Julia’s autodiff tools efficiently is even harder (at the moment at least), so we can’t really blame the author there.

Tamas_Papp · February 11, 2019, 2:19pm

Since C++ was included, I think it is reasonable to give the same amount of effort to Julia as it would take to write it in C++. And that would take a lot, so that should cover at least following the advice in the performance tips. IMO that would be the minimum to make this comparison interesting — nothing more than some applications of @code_warntype and benchmarking.

Topic		Replies	Views
Mixed-mode automatic differentiation using ForwardDiff and ReverseDiff General Usage forwarddiff , reversediff , autodiff	9	2742	February 1, 2022
ReverseDiff.jl Community package , announcement	9	1699	December 13, 2017
How can I write a neural network using ForwardDiff.jl package? General Usage	7	661	February 2, 2021
Automatic Differentiation Slow (Slower than Finite-Differences) Optimization (Mathematical)	18	3465	May 31, 2018
Taking gradients in Julia General Usage question , zygote , forwarddiff , reversediff	7	2129	September 28, 2021

Automatic Differentiation

Related topics