Vectorized gradient using ForwardDiff

Hi and greetings :slight_smile:

I am trying to compute derivatives (later on up to the 3rd or 4th degree) of vectorized functions using ForwardDiff. The functions are generally in the following form:

using BenchmarkTools
using StaticArrays
using ForwardDiff

# broadcasting inside function
vars = [rand(1000) for _ in 1:2]
func(vector_of_vars::Vector{<:Vector{<:Number}}) = 5.0 .* vector_of_vars[1] .* vector_of_vars[1] .* vector_of_vars[1] .* vector_of_vars[2]

@btime func(vars) # -> 657.443 ns (1 allocation: 7.94 KiB)

Some research yielded the following approach: (Vectorization of ForwardDiff.gradient function - #4 by rdeits)

# vectorize outside the function
vars_rows = [rand(2) for _ in 1:1000]
func_rows(vars_rows::Vector{<:Number}) = 5 * vars_rows[1] * vars_rows[1] * vars_rows[1] * vars_rows[2] 
@btime func_rows.(vars_rows) # -> 1.158 μs (4 allocations: 8.00 KiB)

grad_func_rows = x -> ForwardDiff.gradient(func_rows, x)
@btime grad_func_rows.(vars_rows) # 229.625 μs (4500 allocations: 297.19 KiB)

Seeing as there is this enourmous performance decrease, I tried to see if I can copy and reimplement some simplified basis of ForwardDiff with some help from various sources including the paper from ForwardDiff and Youtube yielded the following:

struct Dual{T, N} <: Number

import Base: convert, promote_rule, show, *
Base.convert(::Type{<:Dual}, val::Real) = Dual(val, zero(val))
Base.promote_rule(::Type{<:Dual}, ::Type{<:Number}) = Dual, x::Dual) = print(io, x.value, " + ", x.partials, "𝜖")
*(x::Dual, y::Dual) = Dual(x.value .* y.value, (x.value .* y.partials .+ x.partials .* y.value))

vars_dual = [[Dual(rand(), @SVector[1.0, 0.0]) for _ in 1:1000], [Dual(rand(), @SVector[0.0, 1.0]) for _ in 1:1000]]
@btime func(vars_dual) # -> 1.613 μs (2 allocations: 23.48 KiB)

The manual approach uses the vectorization inside the function and (I guess) has better access to SIMD and other optimizations.
On top of the decrease seen here, in my real code, the function evaluations are quite nested and go many levels deep, which favors the vectorization inside rather than outside even more (I would think).

So my questions is, can I somehow make ForwardDiff work with the vector evaluations. Maybe by manually creating a vectors of Duals using the ForwardDiff infrastructure? Do you have any other insights I might have missed?

Thank you very much in advance :slight_smile:

Can you elaborate a bit on what you’re asking? It isn’t clear to me. Your second code chunk that defines vars_rows and func_rows doesn’t run because vars_rows is a Vector{Vector{Float64}}, not a Vector{<:Number} (more on that later), so I’m confused how you can even report timings for that. I would guess you had some other method defined for func_rows in your REPL and you’re not benchmarking what you think you’re benchmarking.

A couple generic notes from looking at your code, though:

  • You should interpolate globals when you use @btime with the syntax @btime func($vars).
  • Your type annotations as written are likely not achieving anything helpful because Vector{<:Vector{<:Number}} covers a lot of very generic and inadvisable composite types, and so I would guess the only time this annotation would actually change anything is by throwing an error if you passed in something that wasn’t <:Vector on the inner container, like a Vector{NTuple{N,Float64}} or something where the function would make perfect sense to evaluate, and so it will just sort of reduce composability without helping performance. In general, for a function like this I think it would be best to just not use any type annotations. If you want them for correctness and for forcing specialization (which I’m not even sure happens, the compiler is complicated and I’m not an expert), you could use my_function(x::Vector{Vector{T}}) where{T}. But again, that won’t help performance I don’t think and will just reduce composability.
  • Writing your own Dual number methods where you don’t have to is pretty inadvisable, and if all you’re trying to accomplish is the output of your first func then you definitely don’t have to.
  • I think your mental model of “vectorization” is a bit different from what people on this forum might mean. I would guess you’re coming from R or something, where it means the same thing as “broadcasting” in the Julia community. And a Vector{T} data structure means a chunk of contiguous memory representing several objects of type T, not some structure that necessarily signals any kind of broadcasting.

ForwardDiff certainly passes around vectors of Dual types, so I think the answer to your literal question is that you shouldn’t have to think about these details and that the performance issues are coming from somewhere else. What would probably be more productive is to think about things like type stability. Loops in Julia are fast, and so maybe you should consider writing a version of your function that’s just a bunch of loops and then use @code_warntype or something so make sure there are no instabilities, and then use that to obtain a benchmarking standard.

1 Like

Thank you for the reply :slight_smile:

My core problem is that I have equations from R^n → R^1 that I want do evaluate for many points at once. In the example in the above func I have two input variables and one output, which is evaluated for 1000 rows at the time. In my original code, the equations are represented as deeply nested binary trees. (I am writing/adapting a Symbolic Regression library for thermodynamic property modelling).

In the second code snippet, the function (and the gradient) is evaluated for the two input variables for each row individually. But this seems way more inefficient and will become more so in my original code, when the binary tree has to be traversed for each for the rows of data.

As for the ambigous types, you are quite right. At first, I had Float64 rather than Number, but ForwardDiff requires a more broad definition so that the duals can be evaluated.
Regarding the mistaken type in the second snippet: I call the function with " . ", making the individual inputs ::Vector{<:Number}.

In short, my “desired” behavior would be:

vars = [rand(1000) for _ in 1:2]
func(vector_of_vars::Vector{<:Vector{<:Number}}) = 5.0 .* vector_of_vars[1] .* vector_of_vars[1] .* vector_of_vars[1] .* vector_of_vars[2]

grad_func = x -> ForwardDiff.gradient(func, x, "some config to expect vectors")
vector_of_gradients = grad_func(vars) # outputs the gradients wrt the the two input variables at each of the 1000 rows

I also had some approach (that I forgot to save weeks ago) where the vectors were accpted in a different form using jacobian, but were differentiated wrt each of the rows giving me (for this case) a 1000x1000 matrix (in contrast to the 1000x2 I need) with diagonal entries, while the remainder was 0.0. Maybe this could also be a lead to make this work, but I can quite wrap my head around formulating the problem in this way. And I think there should be way to “just specify” 1000 rows, differentiate wrt the two input vectors.

Thanks for all the other pointers. I will have a closer look. I learned Julia during Advent of Code last December, and you are right, have been programming in Python before that.


Ah, I see, thanks for the clarification. And welcome to the community! Please excuse my error about the second code snippet. I didn’t actually paste that line into my REPL and missed that you were broadcasting.

I think the simplest answer to your question is just to slightly reshape the way you’re using the data. Here is what I mean:

using ForwardDiff, StaticArrays, BenchmarkTools

func(x) = 5.0 .* x[1] .* x[1] .* x[1] .* x[2] # your version
func_rn_to_r(x) = 5*x[2]*x[1]^3 # a version that takes a Vector{T} and returns a T.

# Since the new version of the function is R^n -> R, you can make a new function that 
# returns the gradient:
func_grad(x) = ForwardDiff.gradient(func_rn_to_r, x)

# Now if you reshape your points, you can just broadcast that func_grad directly.
# Note that I'm using StaticArrays here for some boost in speed and reduced allocations.
vars = [rand(1000) for _ in 1:2]
vars_reshape = [@SVector [vars[1][j], vars[2][j]] for j in 1:1000]

println("Original timing:")
@btime func($vars) # 751 ns, 1 alloc (7.94 KiB)

println("Reshaped timing:")
@btime func_rn_to_r.($vars_reshape) # 429.9 ns, 1 alloc (7.94 KiB)

println("ForwardDiff gradient:")
@btime func_grad.($vars_reshape) # 827 ns, 1 alloc (15.75 KiB)

I know this is just an MWE and so maybe the reshaping isn’t so simple in your real case. But given the discussion, I think the best general advice here would be to instead adjust the way you’re handling the data and use your AD programs in the simple way like this snippet does.

Does this get to your question, or am I still off?

1 Like

Thank you for showing that the problem lies in the typestabilty. This proves now conclusively that I need to take a couple of steps back and try and to rethink my equation structures.

I had many different iterations of the equation structures and finally decided to take the a less typestable one, since I create thousands of equations, used to evaluate them once, and then mutate them. The typestabler equation structures were a lot faster for evaluation, but very slow and cumbersome to mutate and create. But now I started constant fitting, for which I require a lot more evaluation, which may tip the scales in favor of the other structures. And on top, the performance of the derivatives speaks for itself. And seeing as I will have to go as deep as the 4th or 5th degree, the change of structure is definitely worth while.

Thank you very much again and see you around :slight_smile:

Good stuff! If you try higher-order AD and your compile times get rough, you might consider looking at TaylorSeries.jl, which does “Taylor-mode” AD and is designed for higher-order derivatives. You might also consider Symbolics.jl if your functions are simple enough, which could literally compute the symbolic derivatives for you and then compile them into functions that, at least in my experience, are as good as hand-written ones.

Good luck!