Memory caching for reducing allocations

I am experimenting with the idea of automatically reusing memory from previous allocations. For some workloads like parameter search on fixed data I know that allocations will have the same size between iterations. Preallocating often requires modifying functions deep into call stack which is quite inconvenient.

My current attempt is to have a mem_cache object with a dictionary of previously finalized arrays (or rather their underlying Memory objects). However, this requires overriding Array (or Memory) constructor which I believe leads to invalidations and is considered bad practice.

Is there a better way to achieve this? And is this approach viable overall as a targeted optimization at call site? Thanks!

using BenchmarkTools
using Random, Statistics

const mem_cache = (; lock=Threads.SpinLock(), dict=Dict{Int, Vector{Memory{Float64}}}())
const cache_on = Base.ScopedValues.ScopedValue(false) # toggle array caching

# reuse cached memoery if available, otherwise use internal `jl_alloc_array_1d` to allocate
# register finalizer that stores memory in cache
function cached_vector_allocator(nels)
    arr = lock(mem_cache.lock) do
        cache_vec = get(mem_cache.dict, nels, nothing)
        if !isnothing(cache_vec) && !isempty(cache_vec)
            Base.wrap(Array, pop!(cache_vec), (nels, ))
        else
            @ccall jl_alloc_array_1d(Vector{Float64}::Any, nels::Csize_t) :: Vector{Float64}
        end
    end
    finalizer(arr) do x
        lock(mem_cache.lock) do
            cache_vec = get!(() -> Vector{Memory{Float64}}(), mem_cache.dict, nels)
            push!(cache_vec, x.ref.mem)
        end
    end
    return arr
end

# use cached vectors for large arrays
@inline function Array{Float64, 1}(::UndefInitializer, m::Int64)
    if cache_on[] && (m >= 100_000)
        cached_vector_allocator(m)
    else
        @ccall jl_alloc_array_1d(Vector{Float64}::Any, m::Csize_t) :: Vector{Float64}
    end
 end

and benchmarking

function test_seq(arr; n_iter=100)
    for _ in 1:n_iter
        median(arr)
    end
end

function test_par(arr; n_iter=100)
    Threads.@threads for _ in 1:n_iter
        median(arr)
    end
end

Random.seed!(42)
arr = randn(5_000_000)


# Sequential benchmark
@btime test_seq($arr) # 5.907 s (900 allocations: 3.79 GiB)

GC.gc()
@btime Base.ScopedValues.with(cache_on => true) do
    empty!(mem_cache.dict)
    test_seq($arr)
end # 4.010 s (940 allocations: 642.37 MiB)


# Parallel benchmark
GC.gc()
@btime test_par(arr) # 2.250 s (922 allocations: 3.79 GiB)

GC.gc()
@btime Base.ScopedValues.with(cache_on => true) do
    empty!(mem_cache.dict)
    test_par($arr)
end # 1.198 s (955 allocations: 527.93 MiB)

Isn’t GitHub - MasonProtter/Bumper.jl: Bring Your Own Stack useful here?

Bumper.jl is useful but requires changes to your code (like placing @noescape’s throughout). In my example median(arr) makes a copy - and this is hidden in implementation which is a problem for Bumper.jl’s approach.

1 Like

So something like CUDA’s memory pool?

Memory management · CUDA.jl

Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1 | NVIDIA Technical Blog

To be accurate, you finalize garbage or actively freed objects; a cache has live references that prevent both. To cache Julia-level garbage, you’d need to change the internal memory management in C and rebuild Julia. However, Arrays or other containers can be garbage while their associated Memory is not, so you could indeed change the constructors for all the core containers to check a Memory cache instead of allocating every time. However, you can’t control how third-party developers allocate the public Memory in their packages, so getting down to internals has its benefits.

You’re modifying Julia’s core implementation, so this would be a fork, not a typical package we just import. I think it’s hypothetically possible for a heavily pirating and invalidating package to interactively change base Julia, but even if that doesn’t break anything, we’d have to recompile a lot. At this point, making in-place functions would be less work, but there are definitely scenarios where preallocation isn’t feasible, and if we can afford to use extra memory for caching then we’ll take the performance boost (why CUDA does it).

1 Like

You could implement a manual memory arena which you request objects from, though this won’t be “hooked into” by your dependencies. There is currently no way to replace the julia GC wholesale for e.g. one function call (and all allocations happening within).

In theory you might be able to cook something up with GPUCompiler.jl (check out AllocCheck.jl for some inspiration) to modify the generated LLVM-IR, which would allow you to replace the internal calls to the GC with calls to your custom arena. Maybe a future capability for Compiler plugins (see Pull requests · JuliaLang/julia · GitHub for previous attempts at adding this) would make this easier, but for now, there is only the hard way.

2 Likes

Thank you, I will have a look in GPUCompiler.jl based approaches.

I was wondering about the feasibility of “scoped type piracy” where the methods are redefined within particular scope / context. So in my example I would only override Array{Float64, 1}(::UndefInitializer, m::Int64) within my function call. That reminds me of Cassette.jl (there was an example changing sin to cos within a function call) and also about this talk https://www.youtube.com/watch?v=3fmwk_Wo788 (“methodtable overlays”, starting from ~18:00). But if I am not mistaken Cassette is more or less incompatible with recent julia while compiler plugins are not part of julia yet.

Another hack I was thinking of is to remember the method table before doing the piracy and then restore it back (possibly be removing entries with newer world age?). This, however, I do not know if is even possible in principle.

AllocArrays.jl can alleviate this restriction in some cases; similar on an AllocArray uses the bump allocator, so you don’t have to modify library code, only to your top-level types/code. It is a much simpler solution in some ways than tools like GPUCompiler (it is just Bumper + dispatch), but more limited.

1 Like