Interesting micro-benchmark

purplishrock · May 23, 2019, 5:36pm

The data set i’m processing is organized in chunks, and each chunk is variable length, and i have to filter over all the chunks and create a set of data for several categories. i noticed that my program was starting to take a LONG time. i was appending results into an array and was of course suspicious that was the problem(hint: it was).

I decided to write a quick microbenchmark and I thought the results would be of interest to my fellow Julia users. The obvious question is: should you

append (no ! lol)
push!
pre-allocate an array i know is big enough and then just return a count of the number of items put in that array

quick summary of results:
for 1,000,000 elements total (order of results is append, push!, allocated)

41.963682 seconds (1.23 M allocations: 37.291 GiB, 25.72% gc time)
Any[-0.334331, 0.943863, 0.111534, -2.54138, 0.572677]
0.095106 seconds (1.02 M allocations: 25.614 MiB, 39.89% gc time)
Any[-0.334331, 0.943863, 0.111534, -2.54138, 0.572677]
0.020536 seconds (24.78 k allocations: 9.025 MiB)
[-0.334331, 0.943863, 0.111534, -2.54138, 0.572677]

now the results for 100,000 elements total:

0.575038 seconds (292.42 k allocations: 394.206 MiB, 22.88% gc time)
Any[-1.40151, 1.02218, -0.947789, 1.10541, 0.432075]
0.030933 seconds (124.86 k allocations: 4.881 MiB, 7.59% gc time)
Any[-1.40151, 1.02218, -0.947789, 1.10541, 0.432075]
0.021361 seconds (24.78 k allocations: 2.158 MiB)
[-1.40151, 1.02218, -0.947789, 1.10541, 0.432075]

Look at that append time and how much it drops ! Very interesting. I’m assuming that’s a classic case of “not everything fits into cache anymore”. The other noticeable result is the (kind-of) low penalty for using push!. yeah, i know it’s no really low, after all it’s a factor of 4 in the 10^6 case, but since the total time is 0.1s , who cares ?

Needless to say i changed my code over to push! and it runs much faster now

edit: quick note about pyplot. do NOT invoke repeatedly invoke scatter multiple times on small chunks of x,y data. save it all up and plot it all at once. that was the other thing that was killing the run time of my code.

The benchmark code (for 10^6 elements):

function test1(y)
    x = []
    for i=1:10000
        # same as x= vcat(x, randn(100))
        x = [x ; y[(i-1)*100+1:i*100]]
    end
    x
end
function test2(y)
    x = []
    for i=1:1000000
        push!(x, y[i])
    end
    x
end
function test3(y)
    x = Array{Float64,1}(undef, 1000000)
    for i=1:1000000
        x[i] = y[i]
    end
    x
end

y = randn(1000000)
x =@time(test1(y))
println(x[1:5])
x =@time(test2(y))
println(x[1:5])
x =@time(test3(y))
println(x[1:5])

mbauman · May 23, 2019, 5:43pm

You’ll see better results if you use a typed Float64[] empty array instead of [] (which can hold anything).

Also you’re not seeing caching behavior in the appending case — it’s simply O(n^2) behavior. Each time you append, Julia needs to construct a brand new, yet bigger array and copy all the previous elements into it. That’s why a factor of 10 increase in size led to a factor of ~100 increase in runtime.

purplishrock · May 23, 2019, 5:48pm

doh! i should have realized that.

yes- that’s a definite improvement. for the low N case it actually made push! faster than pre-allocation.

Thanks for the advice.

baggepinnen · May 23, 2019, 5:48pm

Try sizehint before push!

baggepinnen · May 23, 2019, 6:07pm

function test4(y)
    x = eltype(y)[]
    sizehint!(x,1000000)
    for i=1:1000000
        push!(x, y[i])
    end
    x
end

y = randn(1000000)
x = @btime(test1(y)) # 20.245 s (1039981 allocations: 37.28 GiB)
x = @btime(test2(y)) # 26.125 ms (1000020 allocations: 24.26 MiB)
x = @btime(test3(y)) # 1.284 ms (2 allocations: 7.63 MiB)
x = @btime(test4(y)) # 5.721 ms (2 allocations: 7.63 MiB)
# Edit: below is same timings on julia v1.1
@btime(test2(y)); # 6.759 ms (20 allocations: 9.00 MiB)
@btime(test3(y)); # 1.240 ms (2 allocations: 7.63 MiB)
@btime(test4(y)); # 5.791 ms (2 allocations: 7.63 MiB)

purplishrock · May 23, 2019, 6:15pm

i didn’t know about sizehint!, thanks.

interesting that your run shows a much larger disparity between case 2 and 3.
on my system there is almost no difference between the 2,3 and 4.

edit: test run from a cygwin shell comand line

baggepinnen · May 23, 2019, 6:15pm

I’m running Julia v1.2, not sure if that affects some of the timings. In case it does, I hope it’s for the better!

Edit: I updated my post above with timings for julia v1.1, Unfortunately, timings got worse…

Edit 2: Actually, it was Juno that caused the second case to be much slower… Re running the benchmark in the terminal yields similar results between julia v1.2 and julia v1.1, but juno is consistently slower.

Edit 3: Juno is consistent with the terimnal if I interpolate y into the benchmark, my bad. Code with interpolation below for reference

@btime(test2($y)); # 6.759 ms (20 allocations: 9.00 MiB)
@btime(test3($y)); # 1.240 ms (2 allocations: 7.63 MiB)
@btime(test4($y)); # 5.791 ms (2 allocations: 7.63 MiB)

cortner · May 24, 2019, 9:03am

Why would append allocate every time while push grows the array in a more clever way?

foobar_lv2 · May 24, 2019, 10:52am

append! is totally fine. x = [x; elem] is just bad (has quadratic runtime).

The actual order of perf issues are (1) FFI overhead for push!, (2) kernel dealing with pagefaults, (3) growth strategy (sizehint!). Reuse is the most important thing.

julia> function pushcopy(y)
       x = eltype(y)[]
       for e in y
       push!(x, e)
       end
       x
       end
pushcopy (generic function with 1 method)

julia> function pushcopy2(y)
              x = eltype(y)[]
              sizehint!(x, length(y))
              for e in y
              push!(x, e)
              end
              x
       end
julia> using BenchmarkTools
julia> y=rand(10^7);
julia> run(`cat /sys/kernel/mm/transparent_hugepage/enabled`);
always [madvise] never
julia> @btime copy($y); @btime pushcopy2($y); @btime pushcopy($y);
  55.340 ms (2 allocations: 76.29 MiB)
  106.840 ms (2 allocations: 76.29 MiB)
  181.064 ms (24 allocations: 129.00 MiB)
#shell> sudo echo "always" > /sys/kernel/mm/transparent_hugepage/enabled
julia> run(`cat /sys/kernel/mm/transparent_hugepage/enabled`);
[always] madvise never
julia> @btime copy($y); @btime pushcopy2($y); @btime pushcopy($y);
  19.822 ms (2 allocations: 76.29 MiB)
  81.166 ms (2 allocations: 76.29 MiB)
  114.986 ms (24 allocations: 129.00 MiB)

julia> z=copy(y);
julia> @btime z.=y;
  8.241 ms (2 allocations: 32 bytes)

This is the reference speed on my machine: I have 10 GB/s memcopy, and everything that is slower needs explanation.

Topic		Replies	Views
How to benchmark append!? General Usage	13	965	November 17, 2021
Benchmarking questions General Usage benchmark	3	318	September 18, 2023
pre-allocating array is slow General Usage	8	345	May 23, 2023
Confusing benchmark time results and memory allocation depending on number of calls for function with zip Performance question	12	1117	March 12, 2019
Benchmarking the cost of allocation General Usage question	7	555	October 29, 2019

Interesting micro-benchmark

Related topics