Large memory allocation in ForwardDiff

shiroghost · July 5, 2020, 11:49pm

Hi all,

I am a bit new to Julia, so I am sure that I should have a type stability problem somewhere, but I cannot find the mistake…

I am trying to determine a simple hessian of a user supplied function. The user gives a function lmfcn in the example. Then via a couple of closures, it is transformed in another function whose Hessian I am interested in:

using ForwardDiff

function do_hess(cs::Function, x::Vector{Float64}, y::Vector{Float64})

    n = length(x)
    m = length(y)

    xav = zeros(Float64, n+m)
    for i in 1:n
        xav[i] = x[i]
    end
    for i in n+1:n+m
        xav[i] = y[i-n]
    end

    @time hess = ForwardDiff.hessian(x -> cs(x[1:n], x[n+1:end]), xav)
    @time hess = ForwardDiff.hessian(x -> cs(x[1:n], x[n+1:end]), xav)
    @time hess = ForwardDiff.hessian(x -> cs(x[1:n], x[n+1:end]), xav)
    @time hess = ForwardDiff.hessian(x -> cs(x[1:n], x[n+1:end]), xav)
    return hess
end

function defs(f::Function, x, w)
    function chisq(p, d)
	tot = (d[1] - f(x[1], p))^2*w[1]
	for i in 2:length(d)
	    tot = tot + (d[i] - f(x[i], p))^2*w[i]
	end
	return tot
    end
    return chisq
end


x = rand(10)
w = rand(10).^2
lmfcn(x, p) = p[1] .+ p[2].*x
chisq = defs(lmfcn, x, w)

do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])

I know there are better ways of doing this, (this is just a minimal working example). I just want to understand why this example allocates so much memory, and why 2secs are needed to determine the Hessian of a relatively simple function…

 julia14 mem.jl
  2.770448 seconds (3.57 M allocations: 161.756 MiB, 0.82% gc time)
  2.269592 seconds (2.54 M allocations: 108.139 MiB, 0.67% gc time)
  2.284117 seconds (2.54 M allocations: 108.139 MiB, 0.55% gc time)
  2.287905 seconds (2.54 M allocations: 108.331 MiB, 0.88% gc time)

Many thanks!

longemen3000 · July 6, 2020, 5:05am

i tried some things, the most gains were obtained using views in the fx definition. preallocating hess helped too. here is the modified code:

using ForwardDiff

function do_hess(cs::Function, x::Vector{Float64}, y::Vector{Float64})

    n = length(x)
    m = length(y)
    hess=zeros(n+m,n+m) #trying preallocated result
    xav = zeros(Float64, n+m)
    for i in 1:n
        xav[i] = x[i]
    end
    for i in n+1:n+m
        xav[i] = y[i-n]
    end
    #trying views,this helped a lot!
    function fx(x0)
        x1 = @view x0[1:n]
        x2 = @view x0[n+1:end]
        return cs(x1,x2)
    end
    @time hess = ForwardDiff.hessian!(hess,fx, xav)
    @time hess = ForwardDiff.hessian!(hess,fx, xav)
    @time hess = ForwardDiff.hessian!(hess,fx, xav)
    @time hess = ForwardDiff.hessian!(hess,fx, xav)
    return hess
end


#this code seems fine
function defs(f::Function, x, w)
    function chisq(p, d)
	tot = (d[1] - f(x[1], p))^2*w[1]
	for i in 2:length(d)
	    tot = tot + (d[i] - f(x[i], p))^2*w[i]
	end
	return tot
    end
    return chisq
end

#adding const, didnt help a lot
const x = rand(10)
const w = rand(10).^2
lmfcn(x, p) = p[1] .+ p[2].*x
chisq = defs(lmfcn, x, w)

do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])

julia> do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])
  1.885206 seconds (3.14 M allocations: 138.086 MiB, 
0.95% gc time)
  0.000066 seconds (9 allocations: 34.703 KiB)
  0.000113 seconds (9 allocations: 34.703 KiB)       
  0.000050 seconds (9 allocations: 34.703 KiB)       
12×12 Array{Float64,2}:
  7.9521      4.82835     …  -0.192059  -1.66344
  4.82835     3.8532         -0.166405  -0.0145785   
 -0.97447    -0.47211         0.0        0.0
 -1.91046    -1.59864         0.0        0.0
 -0.0554433  -0.0458173       0.0        0.0
 -1.56975    -1.23606     …   0.0        0.0
 -0.0202298  -0.00383929      0.0        0.0
 -0.0698391  -0.0106075       0.0        0.0
 -1.23267    -1.14003         0.0        0.0
 -0.263742   -0.140256        0.0        0.0
 -0.192059   -0.166405    …   0.192059   0.0
 -1.66344    -0.0145785       0.0        1.66344

first time takes a lot for compilation, the subsequent times are fast.

shiroghost · July 6, 2020, 9:14am

Many thanks!

Unfortunately I still do not understand the issue…

Your modification runs faster, but the large allocation of the first time time does not seem to be for compilation. If I change the closure by adding at the end of the code:

println("Change closure")
x = rand(10)
w = rand(10).^2
lmfcn(x, p) = p[1] .+ p[2].*x
chisq = defs(lmfcn, x, w, 2, 10)

do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])

one again finds many allocations a a very slow running time the first time:

  2.809938 seconds (4.11 M allocations: 188.278 MiB, 1.06% gc time)
  0.000029 seconds (9 allocations: 34.703 KiB)
  0.000033 seconds (9 allocations: 34.703 KiB)
  0.000015 seconds (9 allocations: 34.703 KiB)
Change closure
  1.652488 seconds (688.18 k allocations: 27.695 MiB)
  0.000019 seconds (9 allocations: 34.703 KiB)
  0.000013 seconds (9 allocations: 34.703 KiB)
  0.000014 seconds (9 allocations: 34.703 KiB)

I would expect the first call large time/allocation to not be relevant if I change the function.

Still, it is very difficult to understand what is wrong with my original code. Ok, the views, save 2 allocations per function call, and preallocating hess saves another. But in order to explain 2.54M allocations, ForwardDiff needs to call my function 1M times, which for evaluating the Hessian of a simple function of 12 variables is clearly absurd.

In summary, I still think that something fishy is going on here when using closures to pass data to the function that is being differentiated…

Do you (anybody) understand this?

Bernard_GODARD · July 6, 2020, 12:47pm

In the first call to time you are benchmarking the julia compilation of the function for your argument types.

Also use benchrmak tools @btime instead of @time for benchmarking.

Also the hessian pre-allocation is not used in your code since you are re-binding the hessian to a different matrix not modifying the array content.

longemen3000 · July 6, 2020, 8:14pm

From my (not perfect) understanding of ForwardDiff, the derivative is compiled one time per function signature, so when you pass a closure, ForwardDiff must compile each time, as two closures are considered different functions, and there isn’t a way for ForwardDiff to know if two closures are the same.
A good thing is that the compile time is about the same if you are going to increase the input length.

longemen3000 · July 7, 2020, 5:16am

ok i saw something that really helped: at the start:

function do_hess(cs::Function, x::Vector{Float64}, y::Vector{Float64})
...

replace with:

function do_hess(cs::T, x::Vector{Float64}, y::Vector{Float64}) where T
...

this parametric dispatch triggers the ability to make specialized version of do_hess that is compiled for only chisq, so if you execute do_hess multiple times, only the first one will pay the compilation time

s-broda · July 7, 2020, 8:54am

Note also that ForwardDiff seems to do some recompilation when you call a function with the same signature with an array of different length:

julia> using ForwardDiff

julia> @time ForwardDiff.gradient(sum, [1., 2., 3., 4.]);
  0.106838 seconds (301.52 k allocations: 15.820 MiB)

julia> @time ForwardDiff.gradient(sum, [1., 2., 3., 4.]);
  0.000014 seconds (4 allocations: 608 bytes)

julia> @time ForwardDiff.gradient(sum, [1., 2., 3., 4., 5.]);
  0.119015 seconds (308.93 k allocations: 16.164 MiB)

julia> @time ForwardDiff.gradient(sum, [1., 2., 3., 4., 5.]);
  0.000020 seconds (4 allocations: 816 bytes)

Related: Another post on package compilation time - #15 by s-broda

kristoffer.carlsson · July 7, 2020, 9:46am

Yes, the “partials” are stored in a Tuple which has different lengths depend on the input size. Unless you manually specify the “chunk size”.

shiroghost · July 7, 2020, 12:02pm

Ok, now a summary of what I have learned.

it seems that ForwardDiff is too smart. If you change the closure significantly, it does a lot of work that is able to speed up a lot many computations of the same/similar function. But, if your goal is to compute hessians of functions, defined via closures that change significantly from call to call the overhead is too much.

In the end, I have ended up with a mixed solution:

1.- I use ForwardDiff to define the derivatives/second derivatives of the elementary functions. (Note that these are compiled once, and ForwardDiff will be very efficient doing these computations).

2.- Then I define my own Poor’s man Automatic differentiation. The main differences with ForwardDiff, is that it basically only works for Float64 types, and that I define custom operations between dual numbers and Float64 numbers without promoting the Float64 numbers to dual numbers. For example:

Base.:+(h1::hyperd, h2::hyperd) = hyperd(h1.v+h2.v, h1.d1+h2.d1, h1.d2+h2.d2, h1.dd+h2.dd)
Base.:+(h1::hyperd, h2::Number) = hyperd(h1.v+h2, h1.d1, h1.d2, h1.dd)
Base.:+(h1::Number, h2::hyperd) = hyperd(h1+h2.v, h2.d1, h2.d2, h2.dd)

In my very poor understanding of all this, promoting to Dual numbers is at the core of this large allocation/first execution time.

This is part of a module that I am writing about something completely unrelated to AD, but the undocumented relevant code is here:

My poor’s man implementation of hessian:

My hyperd type is similar to ForwardDiff but with a fixed chunk size of 1. This avoids allocations at the cost of more evaluations of the function:

mutable struct hyperd
    v::Float64
    d1::Float64
    d2::Float64
    dd::Float64
end

Now some results, for the basic code:

using ForwardDiff
import ADerrors
function do_hess(cs::Function, x::Vector{Float64}, y::Vector{Float64})

    n = length(x)
    m = length(y)

    xav = zeros(Float64, n+m)
    for i in 1:n
        xav[i] = x[i]
    end
    for i in n+1:n+m
        xav[i] = y[i-n]
    end

    #trying views,this helped a lot!
    function fx(x0)
        x1 = @view x0[1:n]
        x2 = @view x0[n+1:end]
        return cs(x1,x2)
    end
    
    hess = zeros(n+m, n+m)
    print("ForwardDiff timming: ")
    @time ForwardDiff.hessian!(hess, fx, xav)
    print("ForwardDiff timming: ")
    @time ForwardDiff.hessian!(hess, fx, xav)
    println("Custom hessian")
    print("hyperd_hessian timming: ")
    @time ADerrors.hyperd_hessian!(hess, fx, xav)
    print("hyperd_hessian timming: ")
    @time ADerrors.hyperd_hessian!(hess, fx, xav)
    println("Exit")
    
    return hess
end

function defs(f, x, w, n, m)
    chisq(p, d) = sum((d .- f(x, p)).^2 .*w)
    return chisq
end


x = rand(10)
w = rand(10).^2
@. lmfcn(x, p) = p[1] + p[2]*x
function lm2(x, p)
    res = Vector{eltype(p)}(undef, length(x))
    for i in eachindex(res)
        res[i] = p[1] + p[2]*x[i] + 1.0/x[i]
    end
    return res
end

chisq = defs(lm2, x, w, 2, 10)
do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])

println("Change closure")
x = rand(10)
w = rand(10).^2
chisq = defs(lmfcn, x, w, 2, 10)

do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])

println("Change closure")
x = rand(10)
w = rand(10).^2
@. lmfcn(x, p) = p[1]/x + p[2]*x^2
chisq = defs(lmfcn, x, w, 2, 10)

do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])

RESULTS:

ForwardDiff timming:   5.681562 seconds (5.28 M allocations: 235.013 MiB, 2.66% gc time)
ForwardDiff timming:   0.000039 seconds (11 allocations: 61.453 KiB)
Custom hessian
hyperd_hessian timming:   0.000080 seconds (4.15 k allocations: 211.578 KiB)
hyperd_hessian timming:   0.000076 seconds (4.15 k allocations: 211.578 KiB)
Exit
Change closure
ForwardDiff timming:   4.629693 seconds (4.51 M allocations: 187.885 MiB, 0.94% gc time)
ForwardDiff timming:   0.000031 seconds (11 allocations: 61.453 KiB)
Custom hessian
hyperd_hessian timming:   0.162594 seconds (111.11 k allocations: 5.882 MiB, 13.89% gc time)
hyperd_hessian timming:   0.000361 seconds (7.27 k allocations: 306.641 KiB)
Exit
Change closure
ForwardDiff timming:   1.905636 seconds (1.76 M allocations: 72.483 MiB, 6.92% gc time)
ForwardDiff timming:   0.000031 seconds (11 allocations: 61.453 KiB)
Custom hessian
hyperd_hessian timming:   0.036707 seconds (57.31 k allocations: 2.639 MiB)
hyperd_hessian timming:   0.000437 seconds (9.37 k allocations: 366.359 KiB)

So yes, my implementation takes about 2-10 times longer than ForwardDiff once it has compiled the basic functions. This seems to be true no matter if you change the closure/function. The only important thing for my implementation to perform is that the basic operations (sin, cos, product of hyperd, etc…) are precompiled.

On the other hand, once ForwardDiff is tuned, it is much faster than my implementation. But the overhead of this tunning is about 1M hessian evaluations!

Final Question

To the experts/developers of ForwardDiff. Would not be more efficient to hard-code the operations between dual numbers and normal numbers, without promotion? (… notice that maybe I am not understanding ForwardDiff code very well…).

One last thing: Many thanks for the help!

kristoffer.carlsson · July 7, 2020, 12:34pm

Are your closures really arbitrary new functions or is it just parameters you are changing in them? Where do you get 1 million “arbitrary” functions from?

I don’t think that is true.

shiroghost · July 7, 2020, 1:03pm

They are typically new functions. The 1M comes assuming that with precompilation the hyperd approach would perform, since it only requires the precompilation of a handful of functions that are type stable. 4-5 sec to compute one hessian of a function of 10 variables is a big overhead, and it seems that it cannot be solved with precompilation

But I perfectly understand that my goals are very concrete.

Well, I am sure that you are right about this. I though that each element of the arrays x, w that enter in the closure would need to be promoted to some Dual representation and that this could, in part, be responsible of the large allocation/times.

Many thanks.

kristoffer.carlsson · July 7, 2020, 1:07pm

So where do they come from? To get a million of them you must be generating them with metaprogramming or evolutionary algorithms etc?

shiroghost · July 7, 2020, 1:52pm

I never use 1M functions. I meant that the overhead of computing the first Hessian with ForwardDiff is approximately like the computation of 1M hessians (after the proper workspaces/whatever is properly done).

My use case is just the opposite: I only need to do this a handful of times (2-10). It just seems unreasonable to me to wait 1 minute and that the computer time is dominated by the determination of hessians of relatively simple functions.

Note that the hyperd approach scales terribly with the number of variables. This is not really something that can compete with ForwardDiff, and it is easy to see if you just use 1000 variables.

I see this hyperd thing as a a small hack for a very particular use case…

kristoffer.carlsson · July 7, 2020, 2:35pm

Oh, I see. Makes sense now. Did you try tell ForwardDiff to use a chunk size of 1?

shiroghost · July 7, 2020, 3:06pm

Fantastic! This seems to do the job. I assume that all the huge allocation/run time was in fact due to the euristics in the determination of the chunk size (in fact the documentation says that this euristic is not type stable, but I never though that this could be the cause).

Now setting the chunk size to any smallish number (1, 2, 4, 8), produces consistent and very small run times. In fact for my cases it really does not matter the number a lot. Basically is always better than the hyperd hack.

Many thanks, Now I have to git rm some files…

Topic		Replies	Views
ForwardDiff.jl \ Optim.jl and recompilation for any vector size Optimization (Mathematical)	4	556	January 10, 2021
Fast f(x) ∂f∂x and ∂2f∂x2 Performance autodiff	5	577	July 30, 2021
ForwardDiff.hessian! with StaticArrays, unexpected allocations and performance Performance staticarrays , autodiff	4	195	May 13, 2024
Large amount of memory allocation when using autodiff and IPNewton Performance	1	114	January 30, 2024
Nesting ForwardDiff inside ReverseDiff? General Usage question , differentiation	5	1275	January 4, 2018

Large memory allocation in ForwardDiff

Related topics