Interesting micro-benchmark

The data set i’m processing is organized in chunks, and each chunk is variable length, and i have to filter over all the chunks and create a set of data for several categories. i noticed that my program was starting to take a LONG time. i was appending results into an array and was of course suspicious that was the problem(hint: it was).

I decided to write a quick microbenchmark and I thought the results would be of interest to my fellow Julia users. The obvious question is: should you

  1. append (no ! lol)
  2. push!
  3. pre-allocate an array i know is big enough and then just return a count of the number of items put in that array

quick summary of results:
for 1,000,000 elements total (order of results is append, push!, allocated)

41.963682 seconds (1.23 M allocations: 37.291 GiB, 25.72% gc time)
Any[-0.334331, 0.943863, 0.111534, -2.54138, 0.572677]
0.095106 seconds (1.02 M allocations: 25.614 MiB, 39.89% gc time)
Any[-0.334331, 0.943863, 0.111534, -2.54138, 0.572677]
0.020536 seconds (24.78 k allocations: 9.025 MiB)
[-0.334331, 0.943863, 0.111534, -2.54138, 0.572677]

now the results for 100,000 elements total:

0.575038 seconds (292.42 k allocations: 394.206 MiB, 22.88% gc time)
Any[-1.40151, 1.02218, -0.947789, 1.10541, 0.432075]
0.030933 seconds (124.86 k allocations: 4.881 MiB, 7.59% gc time)
Any[-1.40151, 1.02218, -0.947789, 1.10541, 0.432075]
0.021361 seconds (24.78 k allocations: 2.158 MiB)
[-1.40151, 1.02218, -0.947789, 1.10541, 0.432075]

Look at that append time and how much it drops ! Very interesting. I’m assuming that’s a classic case of “not everything fits into cache anymore”. The other noticeable result is the (kind-of) low penalty for using push!. yeah, i know it’s no really low, after all it’s a factor of 4 in the 10^6 case, but since the total time is 0.1s , who cares ?

Needless to say i changed my code over to push! and it runs much faster now :grin:

edit: quick note about pyplot. do NOT invoke repeatedly invoke scatter multiple times on small chunks of x,y data. save it all up and plot it all at once. that was the other thing that was killing the run time of my code.

The benchmark code (for 10^6 elements):

function test1(y)
    x = []
    for i=1:10000
        # same as x= vcat(x, randn(100))
        x = [x ; y[(i-1)*100+1:i*100]]
    end
    x
end
function test2(y)
    x = []
    for i=1:1000000
        push!(x, y[i])
    end
    x
end
function test3(y)
    x = Array{Float64,1}(undef, 1000000)
    for i=1:1000000
        x[i] = y[i]
    end
    x
end

y = randn(1000000)
x =@time(test1(y))
println(x[1:5])
x =@time(test2(y))
println(x[1:5])
x =@time(test3(y))
println(x[1:5])

You’ll see better results if you use a typed Float64[] empty array instead of [] (which can hold anything).

Also you’re not seeing caching behavior in the appending case — it’s simply O(n^2) behavior. Each time you append, Julia needs to construct a brand new, yet bigger array and copy all the previous elements into it. That’s why a factor of 10 increase in size led to a factor of ~100 increase in runtime.

1 Like

doh! i should have realized that.

yes- that’s a definite improvement. for the low N case it actually made push! faster than pre-allocation.

Thanks for the advice.

Try sizehint before push!

function test4(y)
    x = eltype(y)[]
    sizehint!(x,1000000)
    for i=1:1000000
        push!(x, y[i])
    end
    x
end

y = randn(1000000)
x = @btime(test1(y)) # 20.245 s (1039981 allocations: 37.28 GiB)
x = @btime(test2(y)) # 26.125 ms (1000020 allocations: 24.26 MiB)
x = @btime(test3(y)) # 1.284 ms (2 allocations: 7.63 MiB)
x = @btime(test4(y)) # 5.721 ms (2 allocations: 7.63 MiB)
# Edit: below is same timings on julia v1.1
@btime(test2(y)); # 6.759 ms (20 allocations: 9.00 MiB)
@btime(test3(y)); # 1.240 ms (2 allocations: 7.63 MiB)
@btime(test4(y)); # 5.791 ms (2 allocations: 7.63 MiB)
2 Likes

i didn’t know about sizehint!, thanks.

interesting that your run shows a much larger disparity between case 2 and 3.
on my system there is almost no difference between the 2,3 and 4.

edit: test run from a cygwin shell comand line

I’m running Julia v1.2, not sure if that affects some of the timings. In case it does, I hope it’s for the better! :smiley:

Edit: I updated my post above with timings for julia v1.1, Unfortunately, timings got worse… :frowning:

Edit 2: Actually, it was Juno that caused the second case to be much slower… Re running the benchmark in the terminal yields similar results between julia v1.2 and julia v1.1, but juno is consistently slower.

Edit 3: Juno is consistent with the terimnal if I interpolate y into the benchmark, my bad. Code with interpolation below for reference

@btime(test2($y)); # 6.759 ms (20 allocations: 9.00 MiB)
@btime(test3($y)); # 1.240 ms (2 allocations: 7.63 MiB)
@btime(test4($y)); # 5.791 ms (2 allocations: 7.63 MiB)

Why would append allocate every time while push grows the array in a more clever way?

append! is totally fine. x = [x; elem] is just bad (has quadratic runtime).

The actual order of perf issues are (1) FFI overhead for push!, (2) kernel dealing with pagefaults, (3) growth strategy (sizehint!). Reuse is the most important thing.

julia> function pushcopy(y)
       x = eltype(y)[]
       for e in y
       push!(x, e)
       end
       x
       end
pushcopy (generic function with 1 method)

julia> function pushcopy2(y)
              x = eltype(y)[]
              sizehint!(x, length(y))
              for e in y
              push!(x, e)
              end
              x
       end
julia> using BenchmarkTools
julia> y=rand(10^7);
julia> run(`cat /sys/kernel/mm/transparent_hugepage/enabled`);
always [madvise] never
julia> @btime copy($y); @btime pushcopy2($y); @btime pushcopy($y);
  55.340 ms (2 allocations: 76.29 MiB)
  106.840 ms (2 allocations: 76.29 MiB)
  181.064 ms (24 allocations: 129.00 MiB)
#shell> sudo echo "always" > /sys/kernel/mm/transparent_hugepage/enabled
julia> run(`cat /sys/kernel/mm/transparent_hugepage/enabled`);
[always] madvise never
julia> @btime copy($y); @btime pushcopy2($y); @btime pushcopy($y);
  19.822 ms (2 allocations: 76.29 MiB)
  81.166 ms (2 allocations: 76.29 MiB)
  114.986 ms (24 allocations: 129.00 MiB)

julia> z=copy(y);
julia> @btime z.=y;
  8.241 ms (2 allocations: 32 bytes)

This is the reference speed on my machine: I have 10 GB/s memcopy, and everything that is slower needs explanation.