It is known that push! is not very fast. There are various ways of avoiding push!, all of which make code uglier. Unfortunately, I get a 30% slowdown in my real code when using push! (graph search; the push! is into long-lived temporaries, and hence almost never reallocs).
Since I did some microbenchmarks on this, I thought to post them here, and see what you think, especially whether the benchmarks are misleading and whether something can be done about this.
My testcase will be to filter a Vector; either into a newly allocated output, or into a pre-allocated output vector.
So, let’s set up some data:
using StaticArrays
using BenchmarkTools
srand(0)
V1=rand(Float64, 20_000);
V4=rand(SVector{4,Float64}, 20_000);
V8=rand(SVector{8,Float64}, 20_000);
@inline filterfun1(x) = (@fastmath x[1]<0.7 )
struct filtparams{N}
v::SVector{N,Float64}
r::Float64
end
@inline function (fp::filtparams{N})(x::SVector{N,Float64}) where {N}
s = 0.0
@inbounds @fastmath @simd for i in 1:N
s += (x[i] - fp.v[i]) *(x[i] - fp.v[i])
end
return @fastmath sqrt(s) < fp.r
end
filterfun4 = filtparams{4}(rand(SVector{4,Float64}), 0.95)
filterfun8 = filtparams{8}(rand(SVector{8,Float64}), 1.15);
I will test the built in filter, as well as the following variants:
function fastfilt(pred, V, out=Vector{eltype(V)}())
resize!(out, length(V))
outlen = 0
pos = 0
@inbounds while pos < length(V)
pos += 1
if pred(V[pos])
outlen += 1
out[outlen] = V[pos]
end
end
resize!(out, outlen)
out
end
function fastfilt_branchfree(pred, V, out=Vector{eltype(V)}())
resize!(out, length(V))
outlen = 1
pos = 0
@inbounds while pos < length(V)-1
pos += 1
out[outlen] = V[pos]
outlen += ifelse(pred(V[pos]), 1, 0)
end
outlen -= 1
if pred(V[end])
outlen += 1
out[outlen] = V[end]
end
resize!(out, outlen)
out
end
function pushfilt(pred, V, out=Vector{eltype(V)}())
resize!(out, 0)
@inbounds for v in V
pred(v) && push!(out, v)
end
out
end
and test in the following way:
function runtests(pred, V)
println("copy $(length(V)) elements / $(sizeof(V)) bytes")
@btime copy($V)
out = similar(V)
ell = length(filter(pred, V))
println("filter $(ell) elements")
@btime filter($pred,$V)
println("pushfilter")
@btime pushfilt($pred,$V)
@btime pushfilt($pred,$V, $out)
println("fastfilter")
@btime fastfilt($pred,$V)
@btime fastfilt($pred,$V, $out)
println("fastfilter_branchfree")
@btime fastfilt_branchfree($pred,$V)
@btime fastfilt_branchfree($pred,$V, $out)
println("\n")
nothing
end
Now, let’s run some tests.
For simply filtering floats with basically free predicate, we get:
runtests(filterfun1, V1)
copy 20000 elements / 160000 bytes
15.445 μs (2 allocations: 156.33 KiB)
filter 14016 elements
202.288 μs (14 allocations: 256.70 KiB)
pushfilter
189.827 μs (14 allocations: 256.70 KiB)
171.366 μs (0 allocations: 0 bytes)
fastfilter
93.226 μs (2 allocations: 156.39 KiB)
90.069 μs (0 allocations: 0 bytes)
fastfilter_branchfree
27.224 μs (2 allocations: 156.39 KiB)
26.546 μs (0 allocations: 0 bytes)
For the somewhat larger struct, we get:
runtests(filterfun1, V4)
runtests(filterfun4, V4)
copy 20000 elements / 640000 bytes
73.337 μs (2 allocations: 625.08 KiB)
filter 14022 elements
304.179 μs (14 allocations: 1.00 MiB)
pushfilter
277.417 μs (14 allocations: 1.00 MiB)
207.107 μs (0 allocations: 0 bytes)
fastfilter
117.116 μs (2 allocations: 625.14 KiB)
111.964 μs (0 allocations: 0 bytes)
fastfilter_branchfree
74.003 μs (2 allocations: 625.14 KiB)
67.018 μs (0 allocations: 0 bytes)
copy 20000 elements / 640000 bytes
72.549 μs (2 allocations: 625.08 KiB)
filter 15868 elements
400.494 μs (14 allocations: 1.00 MiB)
pushfilter
385.974 μs (14 allocations: 1.00 MiB)
323.148 μs (0 allocations: 0 bytes)
fastfilter
196.761 μs (2 allocations: 625.14 KiB)
188.847 μs (0 allocations: 0 bytes)
fastfilter_branchfree
134.430 μs (2 allocations: 625.14 KiB)
143.983 μs (0 allocations: 0 bytes)
and for the even larger one:
runtests(filterfun1, V8)
runtests(filterfun8, V8)
copy 20000 elements / 1280000 bytes
145.466 μs (2 allocations: 1.22 MiB)
filter 14029 elements
500.910 μs (14 allocations: 2.00 MiB)
pushfilter
451.093 μs (14 allocations: 2.00 MiB)
256.953 μs (0 allocations: 0 bytes)
fastfilter
164.108 μs (2 allocations: 1.22 MiB)
149.048 μs (0 allocations: 0 bytes)
fastfilter_branchfree
143.316 μs (2 allocations: 1.22 MiB)
117.722 μs (0 allocations: 0 bytes)
copy 20000 elements / 1280000 bytes
145.203 μs (2 allocations: 1.22 MiB)
filter 16425 elements
783.321 μs (15 allocations: 3.00 MiB)
pushfilter
733.565 μs (15 allocations: 3.00 MiB)
447.273 μs (0 allocations: 0 bytes)
fastfilter
299.561 μs (2 allocations: 1.22 MiB)
281.358 μs (0 allocations: 0 bytes)
fastfilter_branchfree
245.805 μs (2 allocations: 1.22 MiB)
234.719 μs (0 allocations: 0 bytes)
What do we take from this? To me, the most disturbing part is the variant where we filter into a pre-allocated vector of sufficient size; there, it is the ccall that is killing us. I am quite surprised that the ccall overhead is not going through the roof for larger structs, in a context where we profit quite a lot from our valuable registers.
Also, avoiding push appears to be worth it for inner loops, even if you need to implement your own exponential resizing scheme. Branch-free versions are quite situational, so I’m not sure whether to advertise them. If the vector-resizing operations became builtin, then one could maybe avoid a lot of the overhead (call into C-runtime only if an actual memory resize is necessary). If vector could stop storing the useless size(V,1), using length(V) only, then one could get rid of even more of the overhead. Not sure whether the latter is possible without problems at other places.
PS. I tested against copy, not copy!, which is why there is any chance of getting faster than it. Also, on julia 0.6.2.