Evaluation time of generic gradient vs gradient at a particular input using Flux.jl

I am beginner in ML coming from the scientific computing field and getting started with Flux.jl for a simple deep learning problem (not very deep actually) with an MLP Neural Network (NN). Like the usual training approach with mini-batch stochastic gradient method, I need to evaluate the gradient of the the NN with respect to parameters at several inputs in each epoch. I have tried both constructing a generic gradient function and also a gradient evaluation at every specific input using the gradient function from Flux/Zygote. My question is I don’t see any advantage in evaluation time with a generic gradient function when evaluated at a new input values. Is there a more faster way to do this, where the gradient information can be reused as the architecture of the NN is fixed? Following are the related code snippets with benchmarks:

using Flux, BenchmarkTools

model = Chain(Dense(1,5,relu), Dense(5,5,relu), Dense(5,1,identity))
eval_model(x) = model([x])
par = Flux.params(model)

x_test = 0.5;
gr_generic(x) = gradient(() -> eval_model(x), par)
gr_specific = gradient(() -> eval_model(x_test), par)

# After two @btime runs to take away the overheads in all the below cases 
julia> @btime gr_specific = gradient(() -> eval_model(x_test), par)
  33.552 ÎĽs (311 allocations: 20.70 KiB)

julia> @btime gr_generic(x_test)
  33.836 ÎĽs (313 allocations: 21.03 KiB)

# Also evaluating the generic one at a new value to see if it can make any advantage, 
x_test_new = 0.3

julia> @btime gr_generic(x_test_new)
  33.797 ÎĽs (313 allocations: 21.03 KiB)

# compared to a new gr_specific evaluation below,
@btime gr_specific_new = gradient(() -> eval_model(x_test_new), par)
  33.742 ÎĽs (311 allocations: 20.70 KiB)

We can see that both the above are nearly the same (actually generic one is very slightly slower, with few extra allocations), so having a generic gradient doesn’t bring any advantage here when evaluated at a new input value. May I know which one of the above is recommended? And if there is a way to reuse the gradient information so that evaluations can be faster than would be really great!

Thanks a lot!

Can you clarify what “generic” and “specific” mean here? To my eyes, both versions are doing the exact same thing. That is, there is no “reuse of gradient information” at all.

1 Like

Thanks a lot for your reply! I think I realised in the last few days that both “generic” and “specific” do the same time. But I am still clueless on if it is possible to reuse any of the gradient information, when evaluating the gradient at many inputs? Kind of the functionality remembers the graphs of autodiff, making the evaluation faster than the first run, rather than constructing it every single time? Or does it happen already internally in Zygote? I am unsure of this because it takes the same run time even after the first run.

Zygote is a source-to-source AD, so instead of getting back an explicit graph like you would in TensorFlow, it compiles a function that returns the gradients. After the first compilation, subsequent calls should be able to use the cached version of this function (and the generated functions it calls). You can see this effect by @timeing the first call to gradient or using @benchmark (which reports max times) instead of just @btime.

1 Like

Thanks a lot for solving my misconception. Good to know that Julia caches the gradient function. I’ ll try out @time and @benchmark as you suggested. I ll read more on what Zygote is exactly doing :slight_smile: