Avoiding Allocations with User provided Closures

I’m writing a query API that allows a user to pass in a closure to be run on every entry satisfying the query, however this usually results in undesirable boxing. For example, here’s the kind of API I want to offer, with the query replaced by a simple foreach loop.

function foreach(f, x)
    for i = 1:length(x)
        f(x[i])
    end
end

function bench()
    x = rand(1000)
    count = 0
    @time foreach(x) do val
        count += 1
    end
    count
end

x = rand(1000)
bench(x)
#   0.000005 seconds (489 allocations: 7.641 KiB)

Now I know why this is allocating, because Ints aren’t mutable and the compiler boxes count on each iteration so that it can be incremented in the closure, but this is sort of a nonintuitive bug for my users. I want to provide this kind of API rather than some pre canned count, etc. function because this is flexible and the compiler should be able to optimise for it. How can I provide an API like this that doesn’t suck performance wise?

Is this performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub? Then, maybe educate your users until the bug is fixed? See FastClosures.jl

Yeah, you may just need to help your users out a little bit. A good rule of thumb, which might not be hard for you to convey to your users, is that anything the closure modifies should be wrapped in a Ref:

julia> function foreach(f, x)
           for i = 1:length(x)
               f(x[i])
           end
       end
foreach (generic function with 1 method)

julia> function bench(x)
           count = Ref(0)
           foreach(x) do val
               count[] += 1
           end
           count[]
       end
bench (generic function with 2 methods)

julia> @btime bench($x)
  253.535 ns (0 allocations: 0 bytes)
1000

(note that in this case, the compiler is even able to skip actually allocating the Ref, which is pretty cool).

If you want to get really fancy, you can actually check fieldtypes(typeof(f)) and look for Core.Box in order to print a helpful warning to your users. This might even be a legitimate use case for an @generated function, e.g.:

julia> box_warn(x) = nothing
box_warn (generic function with 3 methods)

julia> @generated function box_warn(::T) where {T <: Function}
         if Core.Box in fieldtypes(T)
           :(@warn("Oh no! A Core.Box!"))
         end
       end
box_warn (generic function with 3 methods)

julia> function bench_bad(x)
           count = 0
           foreach(x) do val
               count += 1
           end
           count
       end
bench_bad (generic function with 1 method)

julia> function bench_good(x)
           count = Ref(0)
           foreach(x) do val
               count[] += 1
           end
           count[]
       end
bench_good (generic function with 1 method)

julia> bench_bad(x)
┌ Warning: Oh no! A Core.Box!
└ @ Main REPL[49]:3
1000

julia> bench_good(x)
1000

But I’m not sure this level of complexity is actually a good idea.

Edit: you also need to use box_warn in foreach, e.g.

function foreach(f, x)
  box_warn(f)
  ...
end

(Thanks @Mason)

6 Likes

@rdeits Did you mean to include a redefinition of foreach in your code that uses box_warn?

If user functions need some state (like count), I believe it’s much better to design API based on foldl (or reduce) rather than foreach. The example in the OP would look like

count = foldl(x; init=0) do count, val
    count += 1
end

Slightly less trivial example:

(count, sum) = foldl(x; init=(0, false)) do (count, sum), val
    count += 1
    sum += val
    (count, sum)
end

Making this kind of operation composable would lead you to transducers. FYI, here is my implementation: https://github.com/tkf/Transducers.jl

8 Likes

Fixed, thanks!

Thanks all!

So to summarise, capturing variables in closures is a known performance issue. The direct solution is to wrap captured mutable variables in a Ref, which the compiler can optimise out the allocation of.

The indirect (and perhaps more Julian?) solution would be to separate out this particular usage of the query API as a more “functional” reduce/foldl, which I think makes sense in the example I gave, because logically the user is computing a reduction.

Users would still have the option to pass in a plain closure, but they’d have to be careful about capturing mutable variables.

3 Likes

Bumping this back alive for a moment.

I’m also trying to avoid allocations when updating a local variable from within a closure. However, whilst the Ref trick works in a minimum working example, in my real code the compiler isn’t clever enough to avoid allocating the Ref.

Are there any obvious conditions under which the compiler can’t work out not to allocate the Ref on the stack?

(I’m avoiding posting the code since it’s a fair chunk.)

Feel free to post whatever code might be relevant and debug info you’ve been able to produce into a [details] block. For example:

Details hidden. Click to expand.

Put some code here.

This is made with:

[details="Details hidden. Click to expand."]
Put some `code` here.
[/details]

Alright - well I have a MWE anyway.

As you can see, each call to search() is allocating. Moreover, if I allocate the found reference globally first and pass this into the function during benchmarking, the whole benchmark registers 0 allocations.

Why is the allocation not being elided in this case?

using StaticArrays
using BenchmarkTools

const Position = SVector{2, Int8}

function search(gamestate, position)
    found = Ref{Bool}(false)
    iterator(gamestate) do piece
        found[] = found[] || (piece == position)
    end
    return found[]
end

function iterator(f, gamestate)
    for piece in gamestate
        f(piece)
    end
    return nothing
end

gamestate = [Position(rand(Int8, 2)...) for _ in 1:200]
needle = Position(1, 2)

b = @benchmark search($gamestate, $needle)
display(b)

Benchmarking shows that one pesky allocation, although --track-allocations shows a lot of 0s.

BenchmarkTools.Trial: 10000 samples with 918 evaluations.
 Range (min … max):  116.427 ns … 929.549 ns  ┊ GC (min … max): 0.00% … 87.12%
 Time  (median):     121.033 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   121.472 ns ±  12.790 ns  ┊ GC (mean ± σ):  0.18% ±  1.49%

            ▁▃▄▃▄▆▇█▇▇█▅▅▅▅▃▃▁                                   
  ▂▂▂▂▂▃▃▄▅▇███████████████████▇▅▆▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▄
  116 ns           Histogram: frequency by time          131 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.
2 Likes

If you do @inline function iterator(...), the allocation also disappears. However, it becomes slower (using Julia 1.8.4). Not sure why.

using StaticArrays
using BenchmarkTools

const Position = SVector{2, Int8}

function search(gamestate, position)
    found = Ref{Bool}(false)
    iterator(gamestate) do piece
        found[] = found[] || (piece == position)
    end
    return found[]
end

functor(found, position) =
  let f = found, p = position
    function(piece)
      f[] |= (piece == p)
      nothing
    end
  end

function search2(gamestate, position)
    found = Ref{Bool}(false)
    iterator(functor(found, position), gamestate)
    return found[]
end

function search3(gamestate, position)
    found = Ref{Bool}(false)
    @inline iterator(functor(found, position), gamestate)
    return found[]
end

function iterator(f, gamestate)
    for piece in gamestate
        f(piece)
    end
    return nothing
end

search3 is faster and doesn’t allocate. The only difference between search2 and search3 is the inlining hint.

2 Likes

Thanks for the replies here.

In my case, both for the MWE and my actual code, adding @inline at either the call site or on the function definition has elided the allocation.