Speed of push! and filter

performance

#1

It is known that push! is not very fast. There are various ways of avoiding push!, all of which make code uglier. Unfortunately, I get a 30% slowdown in my real code when using push! (graph search; the push! is into long-lived temporaries, and hence almost never reallocs).

Since I did some microbenchmarks on this, I thought to post them here, and see what you think, especially whether the benchmarks are misleading and whether something can be done about this.

My testcase will be to filter a Vector; either into a newly allocated output, or into a pre-allocated output vector.

So, let’s set up some data:

using StaticArrays
using BenchmarkTools
srand(0)
V1=rand(Float64, 20_000);
V4=rand(SVector{4,Float64}, 20_000);
V8=rand(SVector{8,Float64}, 20_000);

@inline filterfun1(x) = (@fastmath x[1]<0.7 )


struct filtparams{N}
    v::SVector{N,Float64}
    r::Float64
end

@inline function (fp::filtparams{N})(x::SVector{N,Float64}) where {N}
    s = 0.0
    @inbounds @fastmath @simd for i in 1:N
        s += (x[i] - fp.v[i]) *(x[i] - fp.v[i])
    end
    return @fastmath sqrt(s) < fp.r
end


filterfun4 = filtparams{4}(rand(SVector{4,Float64}), 0.95)
filterfun8 = filtparams{8}(rand(SVector{8,Float64}), 1.15);

I will test the built in filter, as well as the following variants:

function fastfilt(pred, V, out=Vector{eltype(V)}())
    resize!(out, length(V))
    outlen = 0
    pos = 0
    @inbounds while pos < length(V)
        pos += 1
        if pred(V[pos])
            outlen += 1
            out[outlen] = V[pos]
        end
    end
    resize!(out, outlen)
    out
end

function fastfilt_branchfree(pred, V, out=Vector{eltype(V)}())
    resize!(out, length(V))
    outlen = 1
    pos = 0
    @inbounds while pos < length(V)-1
        pos += 1
        out[outlen] = V[pos]
        outlen += ifelse(pred(V[pos]), 1, 0)
    end
    outlen -= 1
    if pred(V[end])
        outlen += 1
        out[outlen] = V[end]
    end
    resize!(out, outlen)
    out
end

function pushfilt(pred, V, out=Vector{eltype(V)}())   
    resize!(out, 0)
    @inbounds for v in V
        pred(v) && push!(out, v)
    end
    out
end

and test in the following way:

function runtests(pred, V)
    println("copy $(length(V)) elements / $(sizeof(V)) bytes")
    @btime copy($V)
    out = similar(V)
    ell = length(filter(pred, V))
    println("filter $(ell) elements")   
    @btime filter($pred,$V)
    
    println("pushfilter")
    @btime pushfilt($pred,$V)
    @btime pushfilt($pred,$V, $out)


    println("fastfilter")
    @btime fastfilt($pred,$V)
    @btime fastfilt($pred,$V, $out)
    
    println("fastfilter_branchfree")
    @btime fastfilt_branchfree($pred,$V)
    @btime fastfilt_branchfree($pred,$V, $out)    
    
    println("\n")
    nothing
    
end

Now, let’s run some tests.

For simply filtering floats with basically free predicate, we get:

runtests(filterfun1, V1)

copy 20000 elements / 160000 bytes
  15.445 μs (2 allocations: 156.33 KiB)
filter 14016 elements
  202.288 μs (14 allocations: 256.70 KiB)
pushfilter
  189.827 μs (14 allocations: 256.70 KiB)
  171.366 μs (0 allocations: 0 bytes)
fastfilter
  93.226 μs (2 allocations: 156.39 KiB)
  90.069 μs (0 allocations: 0 bytes)
fastfilter_branchfree
  27.224 μs (2 allocations: 156.39 KiB)
  26.546 μs (0 allocations: 0 bytes)

For the somewhat larger struct, we get:

runtests(filterfun1, V4)
runtests(filterfun4, V4)

copy 20000 elements / 640000 bytes
  73.337 μs (2 allocations: 625.08 KiB)
filter 14022 elements
  304.179 μs (14 allocations: 1.00 MiB)
pushfilter
  277.417 μs (14 allocations: 1.00 MiB)
  207.107 μs (0 allocations: 0 bytes)
fastfilter
  117.116 μs (2 allocations: 625.14 KiB)
  111.964 μs (0 allocations: 0 bytes)
fastfilter_branchfree
  74.003 μs (2 allocations: 625.14 KiB)
  67.018 μs (0 allocations: 0 bytes)


copy 20000 elements / 640000 bytes
  72.549 μs (2 allocations: 625.08 KiB)
filter 15868 elements
  400.494 μs (14 allocations: 1.00 MiB)
pushfilter
  385.974 μs (14 allocations: 1.00 MiB)
  323.148 μs (0 allocations: 0 bytes)
fastfilter
  196.761 μs (2 allocations: 625.14 KiB)
  188.847 μs (0 allocations: 0 bytes)
fastfilter_branchfree
  134.430 μs (2 allocations: 625.14 KiB)
  143.983 μs (0 allocations: 0 bytes)

and for the even larger one:

runtests(filterfun1, V8)
runtests(filterfun8, V8)

copy 20000 elements / 1280000 bytes
  145.466 μs (2 allocations: 1.22 MiB)
filter 14029 elements
  500.910 μs (14 allocations: 2.00 MiB)
pushfilter
  451.093 μs (14 allocations: 2.00 MiB)
  256.953 μs (0 allocations: 0 bytes)
fastfilter
  164.108 μs (2 allocations: 1.22 MiB)
  149.048 μs (0 allocations: 0 bytes)
fastfilter_branchfree
  143.316 μs (2 allocations: 1.22 MiB)
  117.722 μs (0 allocations: 0 bytes)


copy 20000 elements / 1280000 bytes
  145.203 μs (2 allocations: 1.22 MiB)
filter 16425 elements
  783.321 μs (15 allocations: 3.00 MiB)
pushfilter
  733.565 μs (15 allocations: 3.00 MiB)
  447.273 μs (0 allocations: 0 bytes)
fastfilter
  299.561 μs (2 allocations: 1.22 MiB)
  281.358 μs (0 allocations: 0 bytes)
fastfilter_branchfree
  245.805 μs (2 allocations: 1.22 MiB)
  234.719 μs (0 allocations: 0 bytes)

What do we take from this? To me, the most disturbing part is the variant where we filter into a pre-allocated vector of sufficient size; there, it is the ccall that is killing us. I am quite surprised that the ccall overhead is not going through the roof for larger structs, in a context where we profit quite a lot from our valuable registers.

Also, avoiding push appears to be worth it for inner loops, even if you need to implement your own exponential resizing scheme. Branch-free versions are quite situational, so I’m not sure whether to advertise them. If the vector-resizing operations became builtin, then one could maybe avoid a lot of the overhead (call into C-runtime only if an actual memory resize is necessary). If vector could stop storing the useless size(V,1), using length(V) only, then one could get rid of even more of the overhead. Not sure whether the latter is possible without problems at other places.

PS. I tested against copy, not copy!, which is why there is any chance of getting faster than it. Also, on julia 0.6.2.