AD in a SGD loop: one call to 'gradient' or many?

jaynick · January 26, 2020, 1:45am

A general question about how to think about automatic differentiation.

I might have thought the gradient is a function, that takes in a location and returns the up-hill vector at that location.

If an AD returned such a function, then one would obtain the function outside of the gradient descent loop, i.e. (pseudocode)

loss(x) = ...
dloss(x) = gradient(loss,x)    # dloss() is constructed only once, outside the loop
x = randinit()
for iter = 1:n
    x = x - 0.01 * dloss(x)

However, if AD libaries only return a vector, not a function, then it is necessary to call the AD inside the iteration.

Does anyone know which it is?

Honestly, I am finding flux and knet docs hard to understand in this aspect! It seems that the docs have either examples that are too high-level (using Adam and a SGD wrapper), too simple (not involving GD), or those that assume one knows how AD works internally. (I hope to, but not yet

baggepinnen · January 26, 2020, 9:42am

You can choose whatever you want. Either of

gradfun(x) = DiffLibrary.gradient(f,x)
for ...
    g = gradfun(x)
end

for ...
    g = DiffLibrary.gradient(f,x)
end

would work. You always have to call the AD in the loop, there’s no way around that. The AD is what calculates the gradient after all. Depending on the AD library, more or less of the gradient computations will be handled and optimized by the compiler. If you use a tape-based reverse ad a lot of things happens at runtime. If you use ForwardDiff or Zygote, a lot of things are optimized by the compiler. You still have to call the gradient function though.

jaynick · January 26, 2020, 8:37pm

Thank you.

From your answer I guess that constructing the gradfun() fucntion ie. this step

gradfun(x) = DiffLibrary.gradient(f,x)

is not so expensive compared to evaluating it, so there is no motivation to put it outside the loop?

baggepinnen · January 26, 2020, 10:07pm

No, nothing is done at compile time in one of the cases that cannot be done in the other. They should be exactly equivalent in performance

lineycroc · January 27, 2020, 2:37pm

Not too familiar with the Julia implementations, but I found this document which seems to be a decent explanation of AD without being too detail-oriented too quickly if you’re interested (I particularly found the comparisons to numeric differentiation and symbolic differentiation helpful): http://jmlr.org/papers/volume18/17-468/17-468.pdf.

Feel free to correct me, but my quick read of it is that:

Automatic differentiation is numerically evaluating the derivative at a given point. It’s similar to symbolic differentiation in that it uses the chain rule, and may even be the same in some cases, but if there’s a branch in your code, AD would only take the relevant path for your input. You could write a function (outside a loop) that called AD, but every time you call that function in the loop, you are still doing the full evaluation of the derivative at the new point.
If you want to create the gradient function outside the loop, I think you want symbolic differentiation, which would return a function that computes the derivative at any input point (where hopefully computing the function is much cheaper than finding it). The article (and wikipedia and a bunch of other sites) have a lot to say about how symbolic differentiation can be bad for complicated computer programs though.

jaynick · January 27, 2020, 6:09pm

Thank you, helpful comments.

Especially, I now understand that the gradient “function” cannot be constructed outside the loop, if it contains branching.

Topic		Replies	Views
Nested and different AD methods altogether: How to add AD calculations inside my loss function when using neural differential equations? Machine Learning sciml , ad , neural-network , differentialequation	9	979	September 28, 2024
Best way to prevent AD to differentiate through useless or zero gradient function Machine Learning question , differentiation , autodiff	2	172	March 13, 2025
DifferentialEquations - Derivatives in ODE function/ nesting AD Machine Learning differentiation	6	4023	November 17, 2020
Automatic Differentiation Machine Learning	11	3289	February 11, 2019
Do any of the auto differentiation packages support specifying known derivatives by hand? General Usage	5	435	September 23, 2018

AD in a SGD loop: one call to 'gradient' or many?

Related topics