Also note that my package DifferentiationInterface.jl is meant to make your life easier in such situations, by hiding away the details of preallocation and performance optimization. Take a look at the tutorial if you’re interested.
First tests with foo!() instead of foo() fail. Resulting gradient is a zero matrix.
Should ForwardDiff.jacobian() be replaced by ForwardDiff.jacobian!() as well?
Can you provide an MWE?
Here is an attempt
function foo!(y,x)
y = sum(x)*ones(length(x))
return y
end
x1 = ones(10); y1 = zeros(length(x1))
y1 = foo!(y1,x1)
ForwardDiff.jacobian(foo!, y1, x1)
yields as output
10×10 Matrix{Float64}:
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The issue here is that you’re not modifying the argument y
passed to the function in-place. You’re creating a new object (which happens to have the same name) which you fill with the desired output. What you need to do is mutate the very y
that the function receives.
A simple fix is the following:
function foo!(y,x)
# notice the dot to denote elementwise assignment
y .= sum(x)*ones(length(x))
return y
end
but a more clever one would be
function foo!(y, x)
y .= sum(x)
end
You may want to take a look at the docs page on performance tips to improve such in-place functions and maybe avoid allocations altogether
Thx again.
Now obtaining correct Jacobian with in-place functions.
Improvements in terms of memory allocations remain only marginal.
Version 1 - Using Jacobian (no pound)
function foo!(y,x)
# need to mutate the very y that the function receives
# notice the dot to denote elementwise assignment
y .= sum(x)*ones(length(x))
return y
end
x1 = ones(10); y1 = zeros(length(x1))
y1 = foo!(y1,x1)
cfg = ForwardDiff.JacobianConfig(foo!,y1,x1)
@btime ForwardDiff.jacobian(foo!, y1, x1, cfg)
resulting in
528.361 ns (4 allocations: 2.88 KiB)
Version 2 - Using Jacobian! (with pound)
x1 = ones(10); y1 = zeros(length(x1))
y1 = foo!(y1,x1)
cfg = ForwardDiff.JacobianConfig(foo!,y1,x1)
@btime ForwardDiff.jacobian!(DiffResults.JacobianResult(y1,x1),foo!,y1,x1,cfg)
resulting in
600.000 ns (6 allocations: 3.05 KiB)
More soon.
The call to ones
allocates a new array. It is unnecessary here, as broadcasting the assignment takes care of writing into all of y
anyways:
function baz!(y, x)
y .= sum(x)
end
# Better to interpolate global variables when benchmarking to avoid spurious allocations
julia> @btime foo!($y1,$x1);
143.241 ns (2 allocations: 288 bytes)
julia> @btime baz!($y1,$x1);
16.323 ns (0 allocations: 0 bytes)
Thx!
That drastically reduces the number of allocations (hurray!).
The CPU time, however, remains roughly equal to earlier more naive implementations (see Post # 18 above).
function foo!(y,x)
# need to mutate the very y that the function receives
# notice the dot to denote elementwise assignment
y .= sum(x)
return y
end
x1 = ones(10); y1 = zeros(length(x1))
y1 = foo!(y1,x1)
cfg = ForwardDiff.JacobianConfig(foo!,y1,x1)
@btime ForwardDiff.jacobian(foo!, $y1, $x1, cfg)
432.161 ns (2 allocations: 1.75 KiB)
x1 = ones(100); y1 = zeros(length(x1))
y1 = foo!(y1,x1)
cfg = ForwardDiff.JacobianConfig(foo!,y1,x1)
@btime ForwardDiff.jacobian(foo!, $y1, $x1, cfg)
24.792 μs (5 allocations: 80.30 KiB)
x1 = ones(1000); y1 = zeros(length(x1))
y1 = foo!(y1,x1)
cfg = ForwardDiff.JacobianConfig(foo!,y1,x1)
@btime ForwardDiff.jacobian(foo!, $y1, $x1, cfg)
2.425 ms (3 allocations: 7.63 MiB)
Try it again while interpolating (putting a $
on) the cfg
object too, since it is also a global variable
Hardly changes
287.826 ns (1 allocation: 896 bytes) # problem size 10
24.666 μs (4 allocations: 79.08 KiB) # problem size 100
2.399 ms (2 allocations: 7.63 MiB) # problem size 1000
For optimal performance, you also want to pre-allocate the output matrix:
using ForwardDiff, BenchmarkTools
foo!(y, x) = y .= sum(x)
function benchmark_trivial_jacobian(n)
x = ones(n)
y = similar(x)
J = similar(y, length(y), length(x))
cfg = ForwardDiff.JacobianConfig(foo!, y, x)
@btime ForwardDiff.jacobian(foo!, $y, $x, $cfg)
@btime ForwardDiff.jacobian!($J, foo!, $y, $x, $cfg)
return nothing
end
But it doesn’t change much for larger problems, except for memory use:
julia> benchmark_trivial_jacobian(10)
370.061 ns (1 allocation: 896 bytes)
304.813 ns (0 allocations: 0 bytes)
julia> benchmark_trivial_jacobian(100)
27.706 μs (19 allocations: 89.31 KiB)
26.636 μs (17 allocations: 11.14 KiB)
julia> benchmark_trivial_jacobian(1000)
3.515 ms (169 allocations: 7.73 MiB)
3.423 ms (167 allocations: 104.89 KiB)
Thanks. I much appreciate all your input.
I would like to better understand why the version with all the bells and whistles (buffer on input, preallocation of output) is not faster than the more naive implementation.
Stated differently, I wonder why the reduction of number of allocations does not results in a reduction of CPU time.
Is this due to the Jacobian being dense? Is looking into a MWE with a sparse Jacobian worthwhile here?
At some point you’re limited by
- the time it takes your function to run
- the overhead caused by autodiff
If you know for a fact that your Jacobian is sparse, then yes you can accelerate it significantly by making use of that knowledge. See the DifferentiationInterface sparse tutorial for an example.
It may also be worth pondering whether you need a full Jacobian at all. For instance, if you compute J
only to take its products with some vector, there are faster ways to achieve this
Thanks again.
The Jacobian in the application I target is dense. The application requires to solve a linear system with the Jacobian.
Does using ForwardDiff on a function that calls hcubature with buffer as argument cause particular challenges?
Without buffer as argument things works fine.
With buffer I get a type conversion error.
You’re gonna need PreallocationTools.jl to automatically generate buffers that have the right type
Aha! Will check it out. Thx again.