Enzyme autodiff: Why am I getting allocations?

I’m using Enzyme to perform automatic differentiation.

I thought that as the primal code is type stable and non allocating, the enzyme autodiff function would also be. However for me the minimized example below gives the output:

Autodiff time
  0.000005 seconds (4 allocations: 192 bytes)

It seems quite fragile. Moving things around, adding or removing @inline or @noinline sometimes changes the number of allocations, but in the full example of my code (not just this minimal example), I struggle to get rid of all of them.

using StaticArrays, Profile, PProf
using Enzyme: Reverse, autodiff, Active, Const, Duplicated

function mwe()
    function func_to_be_differentiated(mat_vec::Vector, vec::Vector)
        L = length(vec)
        for i = 1:L
            mat = mat_vec[i] 
            a = vec[i]
            new_column = SVector(a, a, a)
            for j = 1:3
                mat[j, 1] = new_column[j]
            end
            mat[1, 1] = vec[i]
        end
        nothing
    end
    
    vec = zeros(3)
    dvec = copy(vec)
    matvec = [zeros(3, 1) for i in eachindex(vec)]
    dmatvec = deepcopy(matvec)

    function f(_matvec, vec)
        func_to_be_differentiated(_matvec, vec)
        _matvec[3][1, 1]
    end

    f(matvec, vec)

    args = (Duplicated(matvec, dmatvec), Duplicated(vec, dvec))
    println("Autodiff time")
    @time autodiff(Reverse, f, args...)
    # @profview test_autodiff(f, args; N=1000000)
end

mwe()

I’ve been looking at vscode @profview, which attributes the allocations sometimes to getindex or setindex calls, sometimes to the first line of the loop for i = 1:L, and I’ve been looking at pprof of a Profile.Allocs.@profile sample_rate=1 profile, which I include but I don’t get whats going on.

I would upload the profile, but it is not a supported file type… alloc-profile.pb.gz

1 Like

@wsmoses

Can you share the full code (I don’t see the test autodiff defn, but I’m on mobile rn).

Also passing the duplicated in rather than constructing it at the call site may be it, but would need full code for the deeper look

Really sorry - I improved the MWE after writing my original comment but forgot to paste the final version! Its updated now.

Can reverse mode ever be fully non-allocating? Even in Enzyme?

I don’t see any fundamental reason it cant be. Especially for simple functions.

Consider
y = x₁² + x₂²

Which is performed my mutating an input vector x by squaring each element and then reducing it with sum.

e.g.

function test_vector()
    function f(x::Vector)
        for i in eachindex(x)
            x[i] = x[i]^2
        end
        sum(x)
    end
    x = [1.0, 2.0]
    dx = zero(x)
    @time autodiff(Reverse, f, Duplicated(x, dx))
    @show dx
end
test_vector()

gives

 0.000001 seconds
dx = [2.0, 4.0]

i.e. no allocations during autdiff as dx is already assigned

1 Like

Does the @btime macro from the BenchmarkTools package also report allocations? Especially when interpolating arguments via $. I think it shouldn’t make a difference here, because it’s already all in functions, but…

Yeah I’ve been careful to make sure everything is inside a function so that “global shenanigans” doesn’t cause spurious allocations.

The problem could possibly be related to using a vector of matrices. Vector{Matrix{T}}. I have not seen any examples of this being done with Enzyme in documentation.

Could this cause something like what I’m see?

A possible fix for me right now that could work is to refactor to an Array{T, 3}, rather than a vec of matrices as all the matrices are the same size anyway. This would be a bit more involved in my full code (Its structs of matrices rather than matrices) compared to the example, but fundamentally possible

Can you live with things being fast, but allocating? If it is fast enough and if yes, I would move on with life.

My understanding is that reverse mode autodiff always needs to store a tape, whose length increases with the depth of the function call (like the number of layers in a neural net). But maybe I’m wrong

Yes

That is not right. Enzyme uses a source to source reverse mode. The reverse pass is just a new code with different inputs. That code building allocates but since it is independent of the input values it allocates and compiles once like any other Julia code. When you call it you give not only the forward vectors but also the pullback to write into. With that, indeed it can and often is non allocating

2 Likes

Reverse-mode AD (including Enzyme) indeed can be non-allocating.

Like Chris said, Enzyme does not store an instruction tape. It instead generates a new function which computes the derivative. Consequently instructions aren’t pushed/poped/etc.

However, non-allocating code can in some case require allocations in the derivative.

For example, consider the following:

function foo(x)
    res = 0.
    for i in 1:10
       res += x[i] * x[i]
       x[i] = 0
    end
    return
end

The corresponding gradient here is to perform dx += 2 * original value of x. However the original value of x was overwritten by 0. Thus you need to cache x (or do some other fancy optimizations around). Incidentally this is why Enzyme has a ton of really interesting/novel cache/aliasing optimizations. I’d recommend reading https://c.wsmoses.com/papers/EnzymeGPU.pdf and https://dl.acm.org/doi/pdf/10.1145/3572848.3577475 for examples.

In any case, @Larbino1 I haven’t had a chance to look more deeply at your problem, but since I see a vector of vector, I have a hypothesis.

What happens if you turn mat_vec into an sarray? My hypothesis here is incidentally one of aliasing. Specifically Julia stores vector’s as pointers to pointers, which means alias analysis cannot easily tell that stores/loads to vectors don’t overwrite memory from other locations. In this sense you’re reading mat_vec[i], which we also need to read in the corresponding for loop in the reverse pass (though now for shadow(mat_vec[i]) However, my hypothesis is the info from julia is insufficient for LLVM alias analysis to prove that the store to mat[1,1] cannot overwrite mat_vec[i], thus forcing one to cache each mat_vec[i] load, lest get the wrong answer (see above). Alternatively, if you you add some extra alias info like the julia const construct, that may help here.

3 Likes

I will add a comment here however Enzyme’s forward mode, which does not need to cache variables, should not add extra allocations. The previous cache comment only applies to reverse mode (or technically Enzyme’s forward split mode, but we’ve yet to expose that to the Julia API yet).

Also you can use Enzyme.API.printall!(true) to see the LLVM input code Enzyme sees as well as the derivative code generated. Simiarly Enzyme.API.printperf!(true) will print warnings of where there is a situation where it may need to cache (as it couldn’t prove or otherwise optimize it such that it could avoid it [or that it thought it was a worthwhile compute vs memory tradeoff]), as well as what it actually ended up caching.

edit:

I just confirmed that making matvec an MArray resolves the issue. In this case it’s a single pointer so aliasing works easier (and also possibly getting unrolled/etc).

using Enzyme: Reverse, autodiff, Active, Const, Duplicated

function mwe()
    function func_to_be_differentiated(mat_vec, vec::Vector)
        L = length(vec)
        for i = 1:L
            mat = mat_vec[i] 
            a = vec[i]
            new_column = SVector(a, a, a)
            for j = 1:3
                mat[j, 1] = new_column[j]
            end
            mat[1, 1] = vec[i]
        end
        nothing
    end
    
    vec = zeros(3)
    dvec = copy(vec)
    matvec = MVector{3}([zeros(3, 1) for i in eachindex(vec)])
    dmatvec = deepcopy(matvec)

    function f(_matvec, vec)
        func_to_be_differentiated(_matvec, vec)
        _matvec[3][1, 1]
    end

    f(matvec, vec)

    args = (Duplicated(matvec, dmatvec), Duplicated(vec, dvec))
    println("Autodiff time")
    @time autodiff(Reverse, f, args...)
    # @profview test_autodiff(f, args; N=1000000)
end
mwe()
1 Like

Thanks @ChrisRackauckas and @wsmoses for clarifying! I guess my question was rather “can we always ensure that derivatives are non-allocating in Enzyme”, to which the answer seems to be “no, even when we provide duplicated inputs, but Enzyme does try its very best”

I disagree with that. You can ensure your code doesn’t require any allocations as follows:

  • it has no allocations in the primal program
  • all inputs are recomputable (aka not overwritten by a potentially aliasing store)
5 Likes

I’m sorry I don’t fully understand what you mean by this? If an input is mutated and I provide a Duplicated storage to Enzyme, in which situations will it remain allocation free?

If there are no instructions which may write to the memory of that input.

For example

F(out, in)
Out[1] = in[1] × in [2]

If out and in are marked no alias (aka the output array can be proven not to be the same as the input array), this will not allocate. In c/c++ I would mark this with restrict in the arguments. Julia has some partial ways of marking noalias like Core.Experimental.Const iirc.

The reason aliasing must be proven here is because we need to access in[1] to compute the derivative. If it’s overwritten we will be conservative and cache it. If out is possibly the same array as in it may be overwritten and we will be conservatively correct.

Technically my claim above is an over approximation as well, as Enztme would only cache variables needed to compute the derivative. For example if this is x1 + x2, we would not need either x1 or x2 to compute the derivative. For a more thorough explanation I’d recommended reading the Enzyme papers.

3 Likes

I’ll check the papers out, thank you for the summary!

Thank you so much for all of your replies. Your insight is incredibly helpful, the both explanation of when caching is necessary, and your pointers to other sources/debug printing options to help me improve my understanding!

I am not familiar with ‘alias info like the julia const contruct’. Is the the Core.Experimental.Const that you refer to later? A quick google yields nothing. Can explain what this is or point me towards somewhere I can learn about it?