Automatic Differentiation


#1

Hey,

I am still having a bit of a hard time figuring out what flux.jl actually does. One important aspect seems to be making gradients available automatically via tracking. How does that differ from the approach in ForwardDiff.jl, i.e. why is it not possible/feasible/sensible to get the gradient of my loss with respect to all parameters using ForwardDiff?

The current tracking implementation seems to be very similar to what Tensorflow + eager execution does, right?

Thanks,

Kevin


#2

Flux is reverse, while ForwardDiff is forward-mode AD (as its name implies). See eg wikipedia about the difference.

Reverse mode is ideal for \mathbb{R}^n\to\mathbb{R} functions, while forward mode is the opposite. That said, in practice, for a small n ForwardDiff can be fine.


#3

Thanks for the quick answer. Reverse > Forward due to speed considerations, I guess? But why not mix both - there are still some cases where flux cannot provide a gradient but ForwardDiff can - couldn’t ForwardDiff be uses as a default fallback when the tracker encounters a ‘dead end’?


#4

All AD libraries have limitations (but these are best reported as an issue with a minimal example so that they can be fixed). That said, I find ForwardDiff to be the most robust in practice, so it is indeed useful as a fallback.

Also, as you said, mixed mode AD can be useful, especially if the problem has a structure that can exploit this.

I would recommend that you consult a good source on AD for a deeper understanding. I find

@book{griewank2008evaluating,
  title={Evaluating derivatives: principles and techniques of algorithmic differentiation},
  author={Griewank, Andreas and Walther, Andrea},
  volume={105},
  year={2008},
  publisher={Siam}
}

especially nice.


#5

thanks again!


#6

I believe both Reversediff.jl and Flux uses mixed mode AD.

It’s partly explained in this paper if I remember correctly https://arxiv.org/abs/1810.08297