Allocations using dualcache

I have a rather complicated function that I’ve written in a non-allocating manner, and am now trying to use ForwardDiff.jacobian! to generate a non-allocating jacobian. The function has many substeps, so use of a pre-allocated cache is required; and I’ve used the dualcache function from PreallocationTools.jl to create the required caches. While this works (in that it produces the correct answer when tested against a hand-written allocating jacobian routine), it still creates a large number of allocations. I’ve already seen and read quite a few discussions on this topic. Some of the posted solutions have helped, but allocations still remain.

The problem seems to be that while no allocations are being created when I call get_tmp(cache, vec) with vec a float vector, allocations are created when get_tmp is called on a dual vector. Here’s a somewhat minimalist example of my basic code structure:

using ForwardDiff, PreallocationTools
using LinearAlgebra

struct Solver
    A::Matrix{Float64}
    cache::PreallocationTools.DiffCache{Vector{Float64}, Vector{Float64}}
end

const ChunkSize = 12;

function test(testfunc!, N)
    print("\nN = ", N, "\n")

    # ChunkSize = ForwardDiff.pickchunksize(N)
    A = rand(N, N);
    B = rand(N);
    solver = Solver(A, dualcache(B, ChunkSize));

    C = similar(B);
    J = similar(A);
    func!(y, x) = testfunc!(y, x, solver);

    print("...Testing function\n")
    @time func!(C, B);

    cfg = ForwardDiff.JacobianConfig(func!, C, B, ForwardDiff.Chunk{ChunkSize}());
    print("...Testing jacobian\n")
    @time ForwardDiff.jacobian!(J, func!, C, B, cfg);
end

function example1!(y::Vector{<:Real}, x::Vector{<:Real}, solver::Solver)
    mul!(y, solver.A, x)
    return y
end
function example2!(y::Vector{<:Real}, x::Vector{<:Real}, solver::Solver)
    tmp = get_tmp(solver.cache, x)
    mul!(tmp, solver.A, x)
    @. y = tmp*tmp
    return 0
end
function example3!(y::Vector{<:Real}, x::Vector{<:Real}, solver::Solver)
    a = @allocated tmp = get_tmp(solver.cache, x); println(a)
    mul!(tmp, solver.A, x)
    @. y = tmp*tmp
    return 0
end

function testmany(funcs)
    for func in funcs
        print("\nTesting ", typeof(func), "\n")
        test(func, 100)
        test(func, 200)
    end
end

testmany([example1!, example2!, example3!])
testmany([example1!, example2!, example3!])

example1! is trivial, but runs correctly, in that neither the function nor the jacobian allocate:

Testing typeof(example1!)

N = 100
...Testing function
  0.000005 seconds
...Testing jacobian
  0.000211 seconds

N = 200
...Testing function
  0.000009 seconds
...Testing jacobian
  0.001240 seconds

example2! is closer to my case, and we now see allocations appear in the jacobian only:

Testing typeof(example2!)

N = 100
...Testing function
  0.000004 seconds
...Testing jacobian
  0.000221 seconds (9 allocations: 92.812 KiB)

N = 200
...Testing function
  0.000010 seconds
...Testing jacobian
  0.001268 seconds (34 allocations: 346.109 KiB)

The allocations clearly scale with the problem size. example3! just throws a block around the get_tmp function to track allocations:

Testing typeof(example3!)

N = 100
...Testing function
  0.000005 seconds
...Testing jacobian
10560
10560
10560
10560
10560
10560
10560
10560
10560
  0.000985 seconds (82 allocations: 95.219 KiB)

N = 200
...Testing function
  0.000008 seconds
...Testing jacobian
20848
20848
20848
20848
20848
20848
20848
20848
20848
20848
20848
20848
20848
20848
20848
20848
20848
  0.003006 seconds (172 allocations: 350.688 KiB)

And we see that the call to get_tmp is producing allocations, but only when called on a dual vector. Just to double check what’s going on in the simplest possible setting, I also tried:

let
A = zeros(10);
DA = zeros(ForwardDiff.Dual{ForwardDiff.Tag{nothing, Float64}, Float64, ChunkSize}, 10);
cache = dualcache(A, ChunkSize);
a = @allocated get_tmp(cache, A);  println(a)
a = @allocated get_tmp(cache, A);  println(a)
a = @allocated get_tmp(cache, DA); println(a)
a = @allocated get_tmp(cache, DA); println(a)
end

which returns:

0
0
1168
1168

Perhaps how I’m constructing the dualcache is wrong?

As a sidenote, if I uncomment the line:

# ChunkSize = ForwardDiff.pickchunksize(N)

from the definition of the function test, then allocations appear even for example1!, though they don’t scale with the problem size:

Testing typeof(example1!)

N = 100
...Testing function
  0.000006 seconds
...Testing jacobian
  0.000175 seconds (2 allocations: 1.250 KiB)

N = 200
...Testing function
  0.000009 seconds
...Testing jacobian
  0.001134 seconds (2 allocations: 1.250 KiB)

@code_warntype test(example1!, 100) now returns a type instability, while the original definition didn’t.

I’m running Julia v1.7.3 on a Mac, with ForwardDiff v0.10.30

Thanks!

Tracing a little further back in the source, it appears that some combination of the view / restructure in:

function get_tmp(dc::DiffCache, u::AbstractArray{T}) where {T <: ForwardDiff.Dual}
    nelem = div(sizeof(T), sizeof(eltype(dc.dual_du))) * length(dc.du)
    if nelem > length(dc.dual_du)
        enlargedualcache!(dc, nelem)
    end
    ArrayInterfaceCore.restructure(dc.du, reinterpret(T, view(dc.dual_du, 1:nelem)))
end

(lines 52-58 of PreallocationTools.jl/PreallocationTools.jl at master · SciML/PreallocationTools.jl · GitHub) is the culprit. Indeed, just defining my_get_tmp as below:

function my_get_tmp(dc::PreallocationTools.DiffCache, u::Vector{T}) where {T <: ForwardDiff.Dual}
    return reshape(reinterpret(T, dc.dual_du), size(dc.du))
end
function my_get_tmp(dc::PreallocationTools.DiffCache, u::Vector{T}) where {T <: Float64}
    return dc.du
end

and replacing get_tmp yields jacobians with no allocations. Of course, this is a less flexible solution.