ReverseDiff.GradientTapes for functions of the form x -> f(x, y) when gradient is to be calculated wrt x only?


I typically differentiate log-likelihood functions of the forms x -> f(x, data) wrt x, where x is a multi-dimensional input, data is some fixed dataframe, f is a scalar-valued function. To do so, I typically use either ForwardDiff or ReverseDiff. I was recently looking back at the ReverseDiff documentation and there was a recommendation to use a ReverseDiff.GradientTape to prerecord f. I was wondering if there is way to pre-complie “tapes” for functions of the form x -> f(x, data). The relevant links which made me ask this are

While searching for a solution, I have also read about talks of Capstan.jl but I am unable to appreciate how this new package (and, there seems to be a lot of excitement about this!) will improve the existing implementations of ForwardDiff.jl and ReverseDiff.jl.

I found tapes to be very fragile in practice (a lot of seemingly innocuous Julia code has branches, which will break things).

I have created a simple interface package

which allows you to define a \mathbb{R}^n \to \mathbb{R} callable (so put the data in eg a struct), then AD it via either ForwardDiff, ReverseDiff, Flux, or Zygote (experimental).

The docs has a worked example.

Thanks, this seems very helpful and I will check it out.

Yeah, my experience with ReverseDiff has been the same. I have found ForwardDiff to be the most robust. Has this been anyone else’s experience?

My go to AD code for say a function from \mathbb{R}^2 to \mathbb{R} looks something like

function calculate_gradient(x, data, cfg)
 ForwardDiff.gradient!(Array{Float64}(undef, 1, 2), x -> f(x, data), beta, cfg, Val{false}())


const cfg = ForwardDiff.GradientConfig(x -> f(x, y), beta, ForwardDiff.Chunk{2}())

and I have found this to be the most robust in my experiments, even though the documentation suggests that one should use ReverseDiff for problems where f:\mathbb{R}^n \rightarrow \mathbb{R}, where n > 1 (but also mentions that ForwardDiff can be faster for low dimensional inputs).