Non-escaping array buffer re-use and the sufficiently smart compiler

This comes up pretty often in deep learning land because in-place operations do not generally please the AD framework. A quick demo:

julia> function test(n)
           a = ones(Int64, n, n)
           a = a .+ 2
           b = sum(a, dims=2)
           b = b .* 3
           b = b .+ 0
           return b
       end
test (generic function with 1 method)

julia> function test2(n)
           a = ones(Int64, n, n)
           a .+= 2
           b = sum(a, dims=2)
           b .*= 3
           b .+= 0
           return b
       end
test2 (generic function with 1 method)

julia> @btime test(1000);
  1.946 ms (11 allocations: 15.28 MiB)

julia> @btime test2(1000);
  1.578 ms (7 allocations: 7.64 MiB)

Looking at the output of @code_typed (and briefly at @code_llvm), I see 4 and 1 calls to jl_alloc_array_2d for test and test2 respectively. Are there any circumstances where the compiler could identify that the allocations behind a and earlier incarnations of b do not escape the function and thus can be re-used? Am I missing any semantic issues that might arise from this type of optimization?

4 Likes

For some reason julia is especially bad about doing this optimization on it’s native arrays, but there are other options. For instance, https://github.com/chriselrod/StrideArrays.jl was designed with this in mind. Unfortunately, it doesn’t manage to improve on your example, but it is capable of eliding a lot of array operations from the heap and doing them on the stack: https://chriselrod.github.io/StrideArrays.jl/dev/stack_allocation/

Maybe @Elrod has some thoughts on how to make your example work

using StrideArrays, LoopVectorization
function test2(n)
           a = ones(Int64, n, n)
           a .+= 2
           b = sum(a, dims=2)
           b .*= 3
           b .+= 0
           return b
       end
function test3(::Val{N}) where {N}
    _a = StrideArray{Int64}(undef, (StaticInt(N),StaticInt(N)))
    b = StrideArrays.preserve_buffer(_a)
    a = PtrArray(_a)
    GC.@preserve begin b
        @avx a .= 1
        @avx a .+= 2
        b = Matrix{Int}(undef, size(a,1), 1)
        @avx for j in axes(a,2)
            bi = 0
            for i in axes(a,1)
                bi += a[i,j]
            end
            b[i] = bi
        end
    end # GC.@preserve
    @avx b .*= 3
    @avx b .+= 0    
    return b
end

I get

julia> @btime test2(100)'
  10.006 ÎĽs (7 allocations: 79.16 KiB)
1Ă—100 adjoint(::Matrix{Int64}) with eltype Int64:
 900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  …  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900

julia> @btime test3(Val(100))'
  3.413 ÎĽs (1 allocation: 896 bytes)
1Ă—100 adjoint(::Matrix{Int64}) with eltype Int64:
 900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  …  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900  900

About 3x faster, and a tiny fraction of the memory use.

This will only work up to a certain size. You probably can’t stack allocate a 1000x1000 array in Julia – the stack isn’t big enough, so you’ll probably get a crash.
It also has to be statically sized.
In Fortran, it’s easy to set a bigger stack size and to stack allocate dynamically sized arrays. Hopefully we’ll get there sooner rather than later.

I was a bit heavy handed with test3. Something wasn’t working right somewhere, so I’ll have to take more of a look at it later.
StrideArrays is far from mature – I haven’t used it much myself yet – so it needs a lot of ironing out.
If you pin down any problematic functions that don’t work (and which aren’t solved with @gc_preserve, please file an issue.

sum(a,dims=2) apparently isn’t type stable when used with StrideArrays, for example…

EDIT:
There aren’t any semantic problems. It’s just that (currently – will probably be fixed eventually) allocating base arrays is opaque to the compiler. mutable structs aren’t, so you can just create a mutable struct holding an NTuple for memory, and the compiler can optimize it, like in the above example with just 1 allocation.
But it’s fairly easy to mess up at the moment.

1 Like

@Elrod interesting developments in this area! I think this has the potential to unlock a rather dramatic performance increase in a lot of code bases, without having to go the route to preallocate everything at the start if an expensive computation.

Regarding the macro gc_preserve, I find that I often encounter code that allocates a lot of small arrays in a loop in order to construct one larger array. Nothing but the large array escapes the loop. Is this situation applicable to StrideArrays? Assuming static arrays have already been ruled out due to the need for mutation.

1 Like

I agree. It’s something I’m excited for.

Note that there isn’t really anything special about StrideArrays – StaticArrays.MArrays work similarly.
I just added PtrArray and gc_preserve to make it easier to get this to work (but the macro needs a lot of work, in particular, to have it apply to a block instead of just a function call).

I defined a handful of functions to do this, so that this should work with some code written naively. It should’ve worked in a test3 example that looks a lot more similar to test2, but there are a lot of rough edges.
And it only takes 1 to problem to make the array allocate, so that everything else was for naught.

You’re welcome to try it and file issue/prs if/when something isn’t working as intended or allocates needlessly

I am not sure if I understand what’s going on here, both b and a somehow come from the same _a?

My use case would be something like

function outer_fun()
    amat = StrideArray{T}(undef, (StaticInt(48), StaticInt(48)))
    bmat = StrideArray{T}(undef, (StaticInt(48),))
    expensive_fun_mutating_amat_bmat(amat, bmat, ...)
    return amat \ bmat
end

Do I need to use any of the @gc_preserve, GC.@preserve, preserve_buffer on the arrays or the function call?
My naive try with just sending the StrideArrays into the function gave me an

ERROR: "Memory access for SubArray{Float64, 1, SMatrix{6, 7, Float64, 42}, Tuple{UnitRange{Int64}, Int64}, true} not implemented yet."

so I’m either trying this too early, or I have missed something :grinning_face_with_smiling_eyes:

1 Like

b is the memory of _a, while a is an array backed by that memory.

julia> @macroexpand @gc_preserve foo(A, B)
quote
    var"#20###A#258" = A
    var"#21###buffer#259" = StrideArrays.preserve_buffer(var"#20###A#258")
    var"#22###A#260" = B
    var"#23###buffer#261" = StrideArrays.preserve_buffer(var"#22###A#260")
    $(Expr(:gc_preserve, :(foo(StrideArrays.maybe_ptr_array(var"#20###A#258"), StrideArrays.maybe_ptr_array(var"#22###A#260"))), Symbol("#21###buffer#259"), Symbol("#23###buffer#261")))
end

I was just manually doing what @gc_preserve will do once I update it for working on blocks instead of calls (aside: Expr(:gc_preserve, ...) is from GC.@preserve).
The reason for separating the memory is that GC.@preserve will generally allocate the entire object if the memory to be protected is held by a struct. Preserving only the memory will preserve said memory without forcing it to be heap allocated.

Ah, that’s probably LoopVectorization complaining about SMatrix{6, 7, Float64, 42}.
I haven’t implemented support for SMatrix yet, because I haven’t found out a way to actually make them fast. The two best possibilities so far are:

  1. convert to MArray, then convert back. In my tests, even thought the MArrays weren’t heap allocated, it’d often actually copy them onto a different part of the stack anyway for no reason.
  2. add dummy wrapper for them that supports the StridedPointer interface, and cross fingers that code gen on memory loads is okay. It wasn’t in my tests.

EDIT: Actually, I’m tired. StaticArrays should work with LoopVectorization in that they should fail check_args so that it uses a fallback array. This must have come from StrideArrays itself trying to get stridedpointers without checking first.

2 Likes

Great discussion, thanks all. StridedArrays in particular is very cool.

I was unfortunately a bit imprecise when describing the problem statement, so here’s what I was trying to get at: given code that does a bunch of out of place operations on heap-allocated (or in accelerator memory) arrays, what additional information is required to detect and re-use these already allocated buffers? Transforming intermediate allocations to allocas is a nice to have but probably not realistic given the array sizes in question (can you even do this on a GPU?). I would assume shape, array type, layout and type promotion metadata needs to be propogated through every operation? This in addition to what was mentioned above about the compiler having more insight into array allocations in the first place.

If this sounds suspiciously like the linear algebra optimizer that has come up in a couple of ML calls, that’s because it more or less is. However, I purposefully chose to focus on the lowest hanging fruit. No BLAS, input size + type + layout == output size + type + layout, no type promotion and only element-wise broadcasting operations that the compiler already knows about (if the appearance of simdloop.jl is any indication).

This has been top of mind recently as we move Flux’s optimizers from in-place to out of place operations. It would be great to not have to pull in the entire XLA compiler just to optimize a bunch of element-wise ops!

3 Likes

This is a totally undocumented implementation detail AFAICT, but I realized that you can pass a stack size to the secrete second argument to Task. So I’ve been wondering if we can do some evil things with

t0 = current_task()
t1 = Task(BIGGER_STACK_SIZE) do
    DO_SOME_COMPUTATION()
    yieldto(t0)
end
yieldto(t1)

It’s kinda like a segmented stack, but manual.

4 Likes

Hm, yeah it seems that what you’re asking for is actually rather tricky. The problem is that you don’t want Zygote to observe any in-place code, so whatever code transformation happens, needs to happen after Zygote does it’s transformations.

This is complicated because not only does it limit the number of tools that could be used to tackle this, but it’s also made harder by the fact that Zygote generates a lot of code, so the code that needs to be a analyzed for buffer re-use and such is more complicated, so complicated that Julia’s optimizer quickly hits its limits and bails out.

As long as we’re living in a pre-Diffractor.jl world, this seems like a problem that’s much better approached from the LLVM side than the Julia side, so I’d be tempted to say XLA really shouldn’t be dismissed here.

1 Like

Definitely, I singled out Zygote because you can’t easily convert code from out-of-place → in-place when working with it, but this applies equally to any old Julia function.

I have a few technical and philosophical objections to XLA itself:

  1. Building the XLA compiler is a daunting task on the best of days because Tensorflow is a big pile of spaghetti. Certain non-core components are available as JLLs, but getting the beast to cooperate with CUDA.jl and more would be no small feat.
  2. HLO is great iff. you like programming while wearing a straitjacket. It balks on moderately complex control flow.
  3. XLA favours more of a “batch compilation” model which seems anathema to Julia’s philosophy of interactivity/Just Doing Things Fast Instead of Staging Them for Offload.

Granted LLVM IR isn’t much better here, which is why you see me prattle on about MLIR so much :wink: . In particular, arrays as values and shape propogation feel like a natural fit for a language like Julia.

1 Like

By “easy to set a bigger stack size” I meant ulimit -s unlimited or some number, but this doesn’t seem to work in Julia.

This code snippet didn’t work for me (wanted to try if using test3(Val(1000)) as DO_SOME_COMPUTATION() wouldn’t crash) in that it hangs. I’ll keep this in mind though, and look into it later.
Octavian was actually getting stack overflow errors on 32-bit Julia, so it had to stop allocating temporaries on the stack in 32-bit builds and is now instead using a malloced pointer as a pretend-stack (which relies on there not being task-migration, otherwise using threadid() to pick a block won’t work). It’s probably better to keep doing this, because it’d like to run code on the main task, but that’d be worth considering for longer running computations that want a bigger stack.

1 Like