AD in a SGD loop: one call to 'gradient' or many?

A general question about how to think about automatic differentiation.

I might have thought the gradient is a function, that takes in a location and returns the up-hill vector at that location.

If an AD returned such a function, then one would obtain the function outside of the gradient descent loop, i.e. (pseudocode)

loss(x) = ...
dloss(x) = gradient(loss,x)    # dloss() is constructed only once, outside the loop
x = randinit()
for iter = 1:n
    x = x - 0.01 * dloss(x)

However, if AD libaries only return a vector, not a function, then it is necessary to call the AD inside the iteration.

Does anyone know which it is?

Honestly, I am finding flux and knet docs hard to understand in this aspect! It seems that the docs have either examples that are too high-level (using Adam and a SGD wrapper), too simple (not involving GD), or those that assume one knows how AD works internally. (I hope to, but not yet :slight_smile:

You can choose whatever you want. Either of

gradfun(x) = DiffLibrary.gradient(f,x)
for ...
    g = gradfun(x)
for ...
    g = DiffLibrary.gradient(f,x)

would work. You always have to call the AD in the loop, there’s no way around that. The AD is what calculates the gradient after all. Depending on the AD library, more or less of the gradient computations will be handled and optimized by the compiler. If you use a tape-based reverse ad a lot of things happens at runtime. If you use ForwardDiff or Zygote, a lot of things are optimized by the compiler. You still have to call the gradient function though.


Thank you.

From your answer I guess that constructing the gradfun() fucntion ie. this step

gradfun(x) = DiffLibrary.gradient(f,x)

is not so expensive compared to evaluating it, so there is no motivation to put it outside the loop?

No, nothing is done at compile time in one of the cases that cannot be done in the other. They should be exactly equivalent in performance

Not too familiar with the Julia implementations, but I found this document which seems to be a decent explanation of AD without being too detail-oriented too quickly if you’re interested (I particularly found the comparisons to numeric differentiation and symbolic differentiation helpful):

Feel free to correct me, but my quick read of it is that:

  • Automatic differentiation is numerically evaluating the derivative at a given point. It’s similar to symbolic differentiation in that it uses the chain rule, and may even be the same in some cases, but if there’s a branch in your code, AD would only take the relevant path for your input. You could write a function (outside a loop) that called AD, but every time you call that function in the loop, you are still doing the full evaluation of the derivative at the new point.
  • If you want to create the gradient function outside the loop, I think you want symbolic differentiation, which would return a function that computes the derivative at any input point (where hopefully computing the function is much cheaper than finding it). The article (and wikipedia and a bunch of other sites) have a lot to say about how symbolic differentiation can be bad for complicated computer programs though.

Thank you, helpful comments.

Especially, I now understand that the gradient “function” cannot be constructed outside the loop, if it contains branching.