When I calculate the output of a function and a directional derivative using forward-mode autodiff (ForwardDiff, TaylorDiff etc), is the calculation of the derivatives carried out in parallel?
By parallel I mean: assuming both my function and its derivatives are gpu-friendly (i.e. a Neural Network), should calculating its derivatives along with the result make overall evaluation slower? Or could I hope for “free” derivatives given enough memory?
Just to clarify, I think there are three ways to interpret your question:
If the function runs on the GPU (parallelizing over input), does the derivative run on the GPU too?
Assuming the derivative runs on the GPU (parallelizing over input), how much will one derivative slow down the primal program?
Assuming the derivative runs on the GPU (parallelizing over input), how many derivatives are computed simultaneously (parallelizing over directions/tangents)?
As far as ForwardDiff.jl is concerned:
Depends on the operator. If I remember correctly, ForwardDiff.derivative will run fine on GPU arrays, but ForwardDiff.gradient will fail due to scalar indexing.
In forward mode, autodiff theory tells us that evaluation shouldn’t be slowed down too much when derivatives are propagated alongside the primals… but that’s not true in practice. For instance, if the primal function takes an optimized code path for Matrix{Float64} (like a BLAS call), the derivative requires working with Matrix{Dual{Float64}}, which is much slower because it is pure Julia code.
This is determined by the so-called chunk size, which determines how many partials are stored in the derivative tuple.