I’m writing a query API that allows a user to pass in a closure to be run on every entry satisfying the query, however this usually results in undesirable boxing. For example, here’s the kind of API I want to offer, with the query replaced by a simple foreach loop.
function foreach(f, x)
for i = 1:length(x)
f(x[i])
end
end
function bench()
x = rand(1000)
count = 0
@time foreach(x) do val
count += 1
end
count
end
x = rand(1000)
bench(x)
# 0.000005 seconds (489 allocations: 7.641 KiB)
Now I know why this is allocating, because Ints aren’t mutable and the compiler boxes count on each iteration so that it can be incremented in the closure, but this is sort of a nonintuitive bug for my users. I want to provide this kind of API rather than some pre canned count, etc. function because this is flexible and the compiler should be able to optimise for it. How can I provide an API like this that doesn’t suck performance wise?
Yeah, you may just need to help your users out a little bit. A good rule of thumb, which might not be hard for you to convey to your users, is that anything the closure modifies should be wrapped in a Ref:
julia> function foreach(f, x)
for i = 1:length(x)
f(x[i])
end
end
foreach (generic function with 1 method)
julia> function bench(x)
count = Ref(0)
foreach(x) do val
count[] += 1
end
count[]
end
bench (generic function with 2 methods)
julia> @btime bench($x)
253.535 ns (0 allocations: 0 bytes)
1000
(note that in this case, the compiler is even able to skip actually allocating the Ref, which is pretty cool).
If you want to get really fancy, you can actually check fieldtypes(typeof(f)) and look for Core.Box in order to print a helpful warning to your users. This might even be a legitimate use case for an @generated function, e.g.:
julia> box_warn(x) = nothing
box_warn (generic function with 3 methods)
julia> @generated function box_warn(::T) where {T <: Function}
if Core.Box in fieldtypes(T)
:(@warn("Oh no! A Core.Box!"))
end
end
box_warn (generic function with 3 methods)
julia> function bench_bad(x)
count = 0
foreach(x) do val
count += 1
end
count
end
bench_bad (generic function with 1 method)
julia> function bench_good(x)
count = Ref(0)
foreach(x) do val
count[] += 1
end
count[]
end
bench_good (generic function with 1 method)
julia> bench_bad(x)
┌ Warning: Oh no! A Core.Box!
└ @ Main REPL[49]:3
1000
julia> bench_good(x)
1000
But I’m not sure this level of complexity is actually a good idea.
Edit: you also need to use box_warn in foreach, e.g.
If user functions need some state (like count), I believe it’s much better to design API based on foldl (or reduce) rather than foreach. The example in the OP would look like
count = foldl(x; init=0) do count, val
count += 1
end
Slightly less trivial example:
(count, sum) = foldl(x; init=(0, false)) do (count, sum), val
count += 1
sum += val
(count, sum)
end
So to summarise, capturing variables in closures is a known performance issue. The direct solution is to wrap captured mutable variables in a Ref, which the compiler can optimise out the allocation of.
The indirect (and perhaps more Julian?) solution would be to separate out this particular usage of the query API as a more “functional” reduce/foldl, which I think makes sense in the example I gave, because logically the user is computing a reduction.
Users would still have the option to pass in a plain closure, but they’d have to be careful about capturing mutable variables.
I’m also trying to avoid allocations when updating a local variable from within a closure. However, whilst the Ref trick works in a minimum working example, in my real code the compiler isn’t clever enough to avoid allocating the Ref.
Are there any obvious conditions under which the compiler can’t work out not to allocate the Ref on the stack?
(I’m avoiding posting the code since it’s a fair chunk.)
As you can see, each call to search() is allocating. Moreover, if I allocate the found reference globally first and pass this into the function during benchmarking, the whole benchmark registers 0 allocations.
Why is the allocation not being elided in this case?
using StaticArrays
using BenchmarkTools
const Position = SVector{2, Int8}
function search(gamestate, position)
found = Ref{Bool}(false)
iterator(gamestate) do piece
found[] = found[] || (piece == position)
end
return found[]
end
function iterator(f, gamestate)
for piece in gamestate
f(piece)
end
return nothing
end
gamestate = [Position(rand(Int8, 2)...) for _ in 1:200]
needle = Position(1, 2)
b = @benchmark search($gamestate, $needle)
display(b)
Benchmarking shows that one pesky allocation, although --track-allocations shows a lot of 0s.
BenchmarkTools.Trial: 10000 samples with 918 evaluations.
Range (min … max): 116.427 ns … 929.549 ns ┊ GC (min … max): 0.00% … 87.12%
Time (median): 121.033 ns ┊ GC (median): 0.00%
Time (mean ± σ): 121.472 ns ± 12.790 ns ┊ GC (mean ± σ): 0.18% ± 1.49%
▁▃▄▃▄▆▇█▇▇█▅▅▅▅▃▃▁
▂▂▂▂▂▃▃▄▅▇███████████████████▇▅▆▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▄
116 ns Histogram: frequency by time 131 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
using StaticArrays
using BenchmarkTools
const Position = SVector{2, Int8}
function search(gamestate, position)
found = Ref{Bool}(false)
iterator(gamestate) do piece
found[] = found[] || (piece == position)
end
return found[]
end
functor(found, position) =
let f = found, p = position
function(piece)
f[] |= (piece == p)
nothing
end
end
function search2(gamestate, position)
found = Ref{Bool}(false)
iterator(functor(found, position), gamestate)
return found[]
end
function search3(gamestate, position)
found = Ref{Bool}(false)
@inline iterator(functor(found, position), gamestate)
return found[]
end
function iterator(f, gamestate)
for piece in gamestate
f(piece)
end
return nothing
end
search3 is faster and doesn’t allocate. The only difference between search2 and search3 is the inlining hint.