Seamless integration of Bumper.jl

Hi All,

I have a rather naive question. I have a function computing heuristic value of a state in planning using graph neural network. This function takes state, converts it to a graph uses network and return a single scalar value. So it seems like an ideal candidate for Bumper.jl.

So ideally, I would like to do something like @no_escape(my code goes here) and magically all allocation would happen through the bumper.

Is it possible? Is there a plan for this? I can see how this can be useful for example for training neural networks.

Probably difficult, because the packages you use need to have explicit Bumper support, which most packages currently don’t have.

You can use AllocArrays.jl.

2 Likes

That would make sense. I have tried to naively use AllocArrays, but no luck

julia> @benchmark map(states) do s
       with_allocator(buffer) do
               r = model(pddle(s))
               AllocArrays.reset!(buffer)
               r
           end
           end
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  190.468 ms … 201.215 ms  β”Š GC (min … max): 13.92% … 17.44%
 Time  (median):     193.559 ms               β”Š GC (median):    14.00%
 Time  (mean Β± Οƒ):   194.199 ms Β±   3.948 ms  β”Š GC (mean Β± Οƒ):  14.67% Β±  1.60%

  β–ˆ   β–ˆ    β–ˆ               β–ˆβ–ˆ                                 β–ˆ
  β–ˆβ–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  190 ms           Histogram: frequency by time          201 ms <

 Memory estimate: 985.04 MiB, allocs estimate: 20556.

julia> @benchmark map(model ∘ pddle, states)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  182.934 ms … 192.560 ms  β”Š GC (min … max): 10.02% … 14.59%
 Time  (median):     188.578 ms               β”Š GC (median):    11.65%
 Time  (mean Β± Οƒ):   188.544 ms Β±   3.385 ms  β”Š GC (mean Β± Οƒ):  11.97% Β±  1.61%

  β–ˆ                           β–ˆ β–ˆ          β–ˆ          β–ˆ       β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–ˆβ–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–ˆ ▁
  183 ms           Histogram: frequency by time          193 ms <

 Memory estimate: 985.04 MiB, allocs estimate: 20137.

I think we should probably ask @ericphanson for more help. But in general you are still at the mercy of Bumper.jl limitations. For a full on control we should wait until mmtk (custom GC heuristics)is integrated as a gc backend then we can see how developers are able to use it.

It’s hard to tell, are you actually using an AllocArray there?

How do I test this?
It seems to me that I do not use it.

Well it’s hard to say without your actual code. The way AllocArrays work is that you create an AllocArray yourself, pass it into your code, and then whenever the code calls similar, that call will use the bump allocator to produce a new AllocArray with bump-allocated memory. For example,

using Flux, AllocArrays
model = Chain(Dense(1 => 23, tanh), Dense(23 => 1; bias=true), only)
data = [[x] for x in -2:0.001f0:2]
alloc_data = AllocArray.(data)
function run_model(b, model, data)
    with_allocator(b) do
        result = sum(model, data)
        reset!(b)
        return result
    end
end

b = BumperAllocator(2^25) # 32 MiB
run_model(b, model, data)

The idea here is we can skip the intermediate allocations which may be hidden in library code, as long as that library code uses similar.

See here for a more complicated example, using this for low-allocation inference of a Flux model.

1 Like

Now I see, thanks a lot, this makes perfect sense.

That might be a bit complicated to change though.

would it just be this?

edit: though make sure r is a scalar here, not a 1-element array or such! You can use a CheckedAllocArray for a slow version that verifies all memory accesses are safe (useful for testing).

1 Like

Nice, this is neat.

1 Like

BTW if you get it working, I’d be interested to see the before-and-after benchmarking :slight_smile: . When I tried it on that Flux model I saw 100x less allocation, but similar runtime; I’m curious how it fares in other settings.

sorry for the slight offtopic question,
is Bumber.jl still developed? it seems that
the last activity is 7 months ago

it seems pretty featureful at this point and none of the open issues look like bad bugs, so my assumption is that it is reasonably mature and code changes would be motivated by compat updates, new ideas, or bugfixes. Therefore 7 months of inactivity doesn’t seem like a bad sign. But @Mason could probably say more.

2 Likes