Not using PreallocationTools.jl correctly

I have run into the same problem that others have run into before me. Unfortunately, I can’t solve the problem despite the fact that the question has been asked in (slightly) different guises (e.g. link1, link2).

The problem: I have a function that for efficiency pre-allocates a buffer. The function works as it should. However, I cannot auto-differentiate it easily. I discovered PreallocationTools.jl, but evidently I am not using it correctly: when I attempt to use it, too many memory allocations are made.

Side note: I plan to use ForwarfDiff.jl for the autodifferentiation, though reverse differentiation would probably make more sense for my case.

I have created a MWE example below that includes two functions:

  • function setup_loglikelihood sets up and returns a function that evaluates a mock log-likelihood. This mock log-likelihood function would suffice if we are only interested in evaluating the log-likelihood, but not auto-differentiating it.
  • function setup_loglikelihood_tool sets up and returns a function that evaluates a mock log-likelihood. It attempts to use PreallocationTools.jl, but something seems to be terribly wrong given its poor performance.
using LinearAlgebra, ForwardDiff, PreallocationTools

# This function does not make any considerations for auto-differentiation
function setup_loglikelihood(X, Y)

    # X are the Q × N inputs
    Q, N = size(X)

    # Y are D × N targets/outputs 
    D = size(Y, 1)
    
    # pre-allocate array for predictions we get from W*X
    # where W are the parameters of the model of size D×Q      
    pred_storage = zeros(D, N)
    
    function loglikelihood(param)

        W = reshape(param, D, Q)

        mul!(pred_storage, W, X)
        
        -sum(abs2.(Y - pred_storage))

    end

    return loglikelihood

end


# This function uses a cache created with PreallocationTools.jl, but does something wrong
function setup_loglikelihood_tool(X, Y)

    # X are the Q × N inputs
    Q, N = size(X)

    # Y are D × N targets/outputs 
    D = size(Y, 1)
    
    function loglikelihood!(pred_storage, param)

        pred_storage = get_tmp(pred_storage, param)

        W = reshape(param, D, Q)

        mul!(pred_storage, W, X)
        
        -sum(abs2.(Y - pred_storage))

    end

    return loglikelihood!

end

Using the code above, I try the following:

# create mock data and parameter
X = randn(4,10) # mock inputs
Y = randn(5, 10) # mock outputs
W = randn(4,5) # mock parameters

# get log-likelihood function that does not account for autodifferentiation
logl = setup_loglikelihood(X, Y)
logl(vec(W)) # evaluates with no problems
ForwardDiff.gradient(logl, vec(W)) # this throws error as expected

# get log-likelihood function that attempts to account for autodifferentiation
logl! = setup_loglikelihood_tool(X, Y)
C = DiffCache(similar(Y)) # create cache
logl!(C, vec(W)) # returns same result as logl(vec(W)) above

# takes **very** long and produces **far too** many memory allocations
@time ForwardDiff.gradient(w -> logl!(C, w), vec(W))

Evidently, I am not using PreallocationTools.jl correctly. What am I doing wrong?

What happens when you run @time a second time? You may just be seeing the effects of just-in-time compilation (and global variables)? In general, it is better to benchmark with dedicated tools like BenchmarkTools.jl or Chairmarks.jl

If I run @time again, I get the same number of allocations. Also: @time ForwardDiff.gradient(w -> logl!(C, w), vec(W)) shows that it takes 0.56 to run! I am aware that Benchmarks.jl is the proper of way of timing things, but here the number of allocations points to a grave mistake in the example.

This isn’t an answer to the question you directly posed, but I think a better solution to your underlying problem is to use Enzyme.jl.

Here’s how I’d do this with Enzyme:

using LinearAlgebra, Enzyme

function setup_loglikelihood(X, Y)
    Q, N = size(X)
    D = size(Y, 1)
    pred_storage = zeros(D, N)
    
    function loglikelihood(param)
        W = reshape(param, D, Q)
        mul!(pred_storage, W, X)
        -sum(abs2, (Y - pred_storage))
    end
    return loglikelihood
end

and then

# create mock data and parameter
X = randn(4,10) # mock inputs
Y = randn(5, 10) # mock outputs
W = randn(4,5) # mock parameters
logl = setup_loglikelihood(X, Y)

# Create a 'shadow' of W. After autodiff, this will have the property that
# dW[i] == ∂logl(W)[i]/∂W[i]
dW = make_zero(W)

# Create a 'shadow' of logl. This is necessary because logl is a closure that
# stores data which is mutated during differentiation.
dlogl = make_zero(logl)

# Run the autodiff
Enzyme.autodiff(Reverse, Duplicated(logl, dlogl), Duplicated(W, dW))

Looking at the gradient accumulated into dW we see

julia> dW
4×5 Matrix{Float64}:
 -20.4806    -1.63579    10.6586    6.69816   1.45061
 -11.3634    16.5473     19.6695   -2.8299   -9.35841
  26.4333    -0.970109  -19.8093    5.12396  23.1772
  -4.74061  -14.9736     -8.79992  13.625     9.38448

This seems to agree with what I find using finite differences.

1 Like

You are right, this is not the answer, but this is useful to me. Thank you.

1 Like

I just tried out the following which indicates a type-instability which I can’t really interpret properly, but is probably the culprit:

@code_warntype ForwardDiff.gradient(w -> logl!(C, w), W) # takes very long and produces far too many memory allocations

MethodInstance for ForwardDiff.gradient(::var"#118#119", ::Matrix{Float64})
  from gradient(f::F, x::AbstractArray) where F @ ForwardDiff ~/.julia/packages/ForwardDiff/UBbGT/src/gradient.jl:16
Static Parameters
  F = var"#118#119"
Arguments
  #self#::Core.Const(ForwardDiff.gradient)
  f::Core.Const(var"#118#119"())
  x::Matrix{Float64}
Body::Any
1 ─ %1 = ForwardDiff.GradientConfig(f, x)::ForwardDiff.GradientConfig{ForwardDiff.Tag{var"#118#119", Float64}, Float64, _A, Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#118#119", Float64}, Float64, _A1}, 2}} where {_A, _A1}
│   %2 = (#self#)(f, x, %1)::Any
└──      return %2

All I undestand is that there is a type instability, but I don’t understand why. Neither do I know how to fix this.

Here’s the performance I see with Enzyme.jl:

julia> let A = 4, B = 10, C = 5
       # create mock data and parameter
       X = randn(A,B) # mock inputs
       Y = randn(C, B) # mock outputs
       W = randn(A,C) # mock parameters
       logl = setup_loglikelihood(X, Y)

       println("Primal benchmark: ")
       @btime $logl($W)

       println("Gradient benchmark: ")
       @btime Enzyme.autodiff(Reverse, Duplicated(logl, dlogl), Duplicated(W, dW)) setup=begin
           logl=$logl
           dlogl = make_zero(logl)
           W = $W
           dW = make_zero(W)
       end
       end
Primal benchmark: 
  133.595 ns (5 allocations: 1.02 KiB)
Gradient benchmark: 
  826.548 ns (9 allocations: 1.98 KiB)

which seems to more or less make sense given that this is a gradient compared to a single primal evaluation.

I somehow managed to solve this, though I don’t understand exactly why this solution works. Basically, it has to do with the chunking in ForwardDiff.jl.

(This reply, How to make this function compatible with ForwardDiff? - #9 by ChrisRackauckas pointed me to the solution).

Staying with the code I posted above, the following does the trick:

aux(x) = logl!(C,x)
cfg1 = ForwardDiff.GradientConfig(aux, W, ForwardDiff.Chunk{1}())
# The call below is now fast and allocates far less
ForwardDiff.gradient(aux, W, cfg1) 

However: even if the above works, I am not sure why it does. Hence, to anyone that has the same problem and is reading this, I am not sure whether this is a good solution. Perhaps more knowledgeable people can offer a comment.

Also: letting ForwardDiff.jl decide on its own about the chunking, works even better:

cfg1 = ForwardDiff.GradientConfig(aux, W, ForwardDiff.Chunk{1}());
@btime ForwardDiff.gradient(aux, W, cfg1)
  10.771 μs (161 allocations: 39.89 KiB)

# VS

cfg = ForwardDiff.GradientConfig(aux, W);
@btime ForwardDiff.gradient(aux, W, cfg)
  2.956 μs (21 allocations: 18.17 KiB)

Obviously, it doesn’t beat the Enzyme solution above.

How much of a difference it makes depends a lot on the problem size too. For the small size you’re showing here, the difference is a modest ~2x-3x:

function bench(a=4, b=10, c=5)
    # create mock data and parameter
    X = randn(a,b)
    Y = randn(c, b)
    W = randn(a,c) 
    logl = setup_loglikelihood(X, Y)

    println("Enzyme gradient: ")

    dW_enzyme = @btime let dW = make_zero(W)
        Enzyme.autodiff(Reverse, Duplicated(logl, dlogl), Duplicated(W, dW))
        dW
    end setup = begin
        logl=$logl
        dlogl = make_zero(logl)
        W = $W
    end

    logl! = setup_loglikelihood_tool(X, Y)
    C = DiffCache(similar(Y)) # create cache
    aux(x) = logl!(C,x)
    cfg = ForwardDiff.GradientConfig(aux, W);
    dW_fwd = @btime ForwardDiff.gradient($aux, $W, $cfg)
    
    nothing
end
julia> bench(4, 10, 5)
Enzyme gradient: 
  697.062 ns (8 allocations: 1.30 KiB)
  1.609 μs (13 allocations: 9.20 KiB)

but if we scale it up, it can balloon to an over 1000x performance difference

julia> bench(50, 100, 60)
Enzyme gradient: 
  74.259 μs (11 allocations: 117.52 KiB)
  84.935 ms (1502 allocations: 148.84 MiB)

(note the microseconds vs milliseconds), and this deficit will continue to get worse as the problem grows. Presumably this is mostly due to the algorithmic advantage of reverse mode versus forward mode.

2 Likes

The Enzyme solution is great (and most likely I will consider it in the future), but in my current state of affairs, I would still like to know how to properly use Preallocationtools.jl with ForwardDiff.jl for a function that makes use of a pre-allocated array. My tentative solution seems to work, but there is still the type instability I mention above.