A general question about how to think about automatic differentiation.
I might have thought the gradient is a function, that takes in a location and returns the up-hill vector at that location.
If an AD returned such a function, then one would obtain the function outside of the gradient descent loop, i.e. (pseudocode)
loss(x) = ... dloss(x) = gradient(loss,x) # dloss() is constructed only once, outside the loop x = randinit() for iter = 1:n x = x - 0.01 * dloss(x)
However, if AD libaries only return a vector, not a function, then it is necessary to call the AD inside the iteration.
Does anyone know which it is?
Honestly, I am finding flux and knet docs hard to understand in this aspect! It seems that the docs have either examples that are too high-level (using Adam and a SGD wrapper), too simple (not involving GD), or those that assume one knows how AD works internally. (I hope to, but not yet